20 September 2024

What are Character Sets and Collations in MySQL and MariaDB?

Why Switching from utf8 to utf8mb4 in MySQL and MariaDB is crucial to support the full Unicode set, including emojis and special characters.

When working with MySQL or derivatives like Percona Server or MariaDB, you will often come across the concepts of character set e collation, which are essential to properly manage the saving and manipulation of textual data within databases. However, for many developers who are just starting out with these DBMS, these concepts can be complex or unclear.

In this article, we will explore in detail what are character set and collation in MySQL and MariaDB, why they are important and how they affect data storage and management. We will cover the main characters such as UTF8, UTF8MB3, UTF8MB4, talk about the importance of collation like utf8mb4_general_ci, utf8mb4_unicode_ci, and utf8mb4_unicode_520_ci and we'll understand how these settings can impact query speed.

What is a Character Set?

Un character set (character set) is a set of symbols and their binary representation. Every relational database like MySQL or MariaDB uses the character set to manage how characters are encoded and saved in table fields.

Character Set Examples

There are several character set used in databases, some of the more common include:

  • Latin1: a single-byte character set representing the ISO-8859-1 encoding (common in Western European languages).
  • utf8: A character set that encodes data using UTF-8 encoding. Each character can take up between 1 and 3 bytes. However, in MySQL, the name “utf8” is a bit misleading because it only represents characters up to 3 bytes (more on that later).
  • utf8mb4: A variant of UTF-8 that fully supports all Unicode characters, including emoji and symbols that require up to 4 bytes.

UTF8 vs UTF8MB4: What's the Difference?

One of the most important points to understand is the difference between utf8 e utf8mb4 in MySQL and MariaDB.

  • utf8: it's a character set which supports UTF-8 characters, but only up to 3 bytes per character. This means that it can only represent a subset of Unicode characters (approximately 1.112.064 characters in total), but does not support characters such as many emoji and some Asian symbols that require 4 bytes.
  • utf8mb4: is the complete implementation of UTF-8 encoding in MySQL and MariaDB. utf8mb4 supports all Unicode characters, including those that require 4 bytes. This is the character set which you should use if your database needs to correctly handle emojis or other characters that require more than 3 bytes.

Practical example:

If you try to save an emoji (e.g. 😊) in a column that uses the character set utf8, you will get an error or the data will be truncated, since that character requires 4 bytes, while utf8 only supports up to 3 bytes. Using utf8mb4, instead, the emoji will be saved correctly.

Using UTF8MB3

Sometimes you can see the term utf8mb3, which is an alternative name for the character set utf8 in MySQL. This name was introduced to make it clearer that utf8 in MySQL only supports characters up to 3 bytes, in contrast to utf8mb4, which supports the entire Unicode character set, including 4-byte characters, like emoji or some more complex Asian characters. So, basically, utf8mb3 e utf8 they are equivalent, but the use of utf8mb3 This is to highlight the inherent limitation of MySQL in supporting only a subset of Unicode characters under the old naming convention utf8.

In recent years, the technological landscape is increasingly moving towards full support for Unicode characters, including 4-byte characters. For this reason, the world is moving towards universal adoption of utf8mb4, both for reasons of compatibility with the new standards and to ensure more complete character management.

The “gear shift” towards utf8mb4

In some configurations, especially in the newer versions of MariaDB, it is possible to observe a “gear shift” in the management of character set. Traditionally, utf8 (o utf8mb3) was considered sufficient for most applications that did not require handling complex characters. However, with the increased need to handle multilingual content, emojis, and other special characters, the character set utf8mb4 has begun to take hold as the new norm.

An example of this change can be observed in the default behavior of databases. While in the past the character set utf8 was widely used, many of the default configurations new versions of MySQL and MariaDB are moving to utf8mb4 as the default option to ensure broader and more modern font support.

In some recent releases, it may happen that, without explicit configuration, a database that historically used utf8 to store strings, you can implicitly pass to utf8mb4. This can lead to unexpected changes in data management, such as increased column storage size. VARCHAR o TEXT, and potentially performance impacts regarding indexing and comparison operations on complex characters.

Implications of MySQL and MariaDB Configuration

To manage this step correctly, It is essential to carefully check and configure your database settings, both at the server level and for a single table or column. In MySQL and MariaDB, many of the settings regarding the character set and collation can be defined in the main configuration files, such as my.cnf in MySQL or server.cnf in MariaDB.

What is a Collation?

An collation is a set of rules that determine how to compare and sort characters in a database. Each character set has one or more collation associates, which specify how characters are compared for operations such as ORDER BY, GROUP BY or to perform equality comparisons.

Character-Set-Collation-MySQL-and-MariaDB

Main Collations in MySQL

Le collation have names that follow a specific convention. For example, utf8mb4_general_ci It is divided into three parts:

  • utf8mb4: indicates the character set to which it belongs collation.
  • general: indicates the type of comparison rules.
  • ci: stands for case insensitive, Namely the collation is not case sensitive.

Here are some of the main ones collation used in MySQL and MariaDB:

  • utf8mb4_general_ci: This is one of the collation default for utf8mb4 and is not case sensitive (case insensitive). It uses simplified, general collation rules, which makes it particularly speed-efficient for operations such as sorting and string comparison. However, because of its simplified nature, it is less rigorous and precise in handling some linguistic complexities than the Unicode standard. For applications where speed is critical and linguistic precision is not a concern, it is often the preferred choice.
  • utf8mb4_unicode_ci: This collation strictly follows the standard Unicode rules for character comparison. It is more accurate than utf8mb4_general_ci when working with different languages, accents, complex symbols, and special characters. However, its accuracy comes at the cost of performance: it can be slightly slower in queries, especially on large datasets, due to the more detailed collation rules. It is recommended for applications that require high linguistic accuracy.
  • utf8mb4_unicode_520_ci: This is an updated variant of utf8mb4_unicode_ci which implements the rules of the Unicode 5.2 standard. In addition to retaining the features of the previous version, it supports new characters and symbols introduced with this version of the Unicode protocol, making it a suitable choice for handling recent or special characters. Again, accuracy comes at the cost of a possible slowdown in queries compared to collation less precise.

Differences between Collations

utf8mb4_general_ci vs utf8mb4_unicode_ci

utf8mb4_general_ci is faster because it applies simpler collations, especially for European languages. However, it does not handle all linguistic complexities well. For example, it does not correctly distinguish some character variations in non-European languages, such as ligatures or certain accents in Asian languages.

On the other hand, utf8mb4_unicode_ci strictly follows Unicode rules, correctly handling special characters, accents and symbols, making it best suited to situations where linguistic precision is essential.

Impact on performance

The use of a collation can have a significant impact on query performance. Collation more complex, such as utf8mb4_unicode_ci o utf8mb4_unicode_520_ci, may take longer to perform comparisons and sorts, as they must follow more detailed rules.

For example, if you have a table with millions of rows and you are performing a ORDER BY on a column with the collation utf8mb4_unicode_ci, may take longer than a table that uses utf8mb4_general_ci. This is due to the fact that the collation Unicode must correctly handle complex characters, accents, and other special symbols, while utf8mb4_general_ci apply simpler comparison rules.

The graph shows a performance comparison between different collation in MySQL 5.7, measured in throughput (tps) compared to the number of threads used (4, 24, 64, 128). The collation compared are:

  • utf8mb4_general_ci (default) (in blue)
  • utf8mb4_bin (in red)
  • utf8mb4_unicode_ci (in yellow)
  • utf8mb4_unicode_520_ci (in green)

Remarks:

  1. utf8mb4_bin (red) has the highest throughput with all thread amounts, showing the best performance.
  2. utf8mb4_general_ci (blue), the collation By default, it is the second fastest, with performances that remain constant and very close to those of utf8mb4_bin with 128 threads.
  3. utf8mb4_unicode_ci (yellow) has lower performance than utf8mb4_bin e utf8mb4_general_ci, with visibly lower throughput especially starting from 24 threads.
  4. utf8mb4_unicode_520_ci (green) is the collation with the worst performance, especially as the number of threads increases, confirming a noticeable drop in throughput.

If you use a collation like utf8mb4_unicode_ci o utf8mb4_unicode_520_ci, there will be a significant performance impact, especially in high thread count situations, compared to using collation more light as utf8mb4_general_ci o utf8mb4_bin.

Practical use cases

If you are developing an application that needs to support Western European languages ​​and you are not too concerned about the accuracy of the collation rules for other languages, utf8mb4_general_ci might be a reasonable choice. If, however, your database needs to support multiple languages ​​and you need to be sure that character comparisons are done according to standard Unicode rules, then utf8mb4_unicode_ci o utf8mb4_unicode_520_ci they are better choices.

Choosing the Right Character Set and Collation

The choice of character set and collation It highly depends on your application requirements and the type of data you plan to handle in the database.

When to use UTF8MB4

In general, if you are working on a new project, you should use utf8mb4 as default font. Even if you don't plan on handling emoji or 4-byte Unicode symbols right now, using utf8mb4 gives you the flexibility to handle any Unicode character in the future. There are no significant disadvantages to using utf8mb4 compared to utf8, except for a slight increase in storage space for characters that require more bytes.

Practical implementation example:

CREATE DATABASE testdb
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;


In this example, you are creating a database called testdb character set utf8mb4 , collation utf8mb4_unicode_ci. This configuration ensures that the database supports all Unicode characters, including emoji, and that it follows standard Unicode rules for comparing and sorting characters.

Collation and performances

As we have already mentioned, the use of a collation more complex can impact performance. Therefore, if you are developing an application where query speed is critical and you are not too concerned about linguistic accuracy, you may want to choose a collation simpler as utf8mb4_general_ci.

On the other hand, if your application has to handle multiple languages ​​and requires strict linguistic precision, you should opt for a collation more complex as utf8mb4_unicode_ci.

Impact of Collations on Indexes and Searches

Another area where the collation can affect is the creation of indexes. When you create an index on a column that uses a collation, the rules of the collation determine how the index is sorted. This can impact the performance of database searches as we can see in the example below taken from Percona's blog where he talks about collation performance.

For example, an index created on a column with utf8mb4_general_ci could be more efficient than an index on a column with utf8mb4_unicode_ci, since the rules of comparison of the collation In general they are simpler.

CREATE INDEX idx_name ON users (name COLLATE utf8mb4_general_ci);

In this example, the index on the column name use the collation utf8mb4_general_ci, which may offer better search performance than an index that uses utf8mb4_unicode_ci.

Conclusions

I character set and collation are crucial components for properly managing text data in MySQL and MariaDB. Choose the character set correct (preferably utf8mb4 for new projects) and the collation adequate can have a significant impact on the database's ability to handle complex characters, such as emojis, and on how operations such as sorting and comparing data are performed.

In summary, here are six practical tips for managing your business better. character set e collation in MySQL and MariaDB:

  1. Usa utf8mb4 to support all Unicode characters: It is the best choice to ensure compatibility with complex characters, emojis and 4-byte symbols, making your database ready to handle modern and multilingual content.
  2. If you care about query speed and don't need precise Unicode rules, choose utf8mb4_general_ci: This collation It offers better performance in terms of speed, with simpler collation rules, and is suitable for contexts where linguistic accuracy is not critical.
  3. If precision in collation rules is important, use utf8mb4_unicode_ci o utf8mb4_unicode_520_ci: These collation They are ideal for multilingual applications that require accurate, Unicode-compliant comparisons. utf8mb4_unicode_520_ci It also provides support for the newer characters introduced with Unicode 5.2.
  4. Consider storage space and indexes when using utf8mb4: Since it takes up more bytes than utf8, you may need to consider index limits and increased column sizes. Incorrect configurations may cause errors or increase resource usage.
  5. Make sure to align the settings of character set e collation between server, database, tables and clients: Differences in configurations between these levels can cause encoding problems and corrupted data. Please set the configuration file correctly (my.cnf o server.cnf) to ensure consistency.
  6. Update existing applications if they are still based on utf8 (utf8mb3): If your application is built on a character set utf8 (Aka utf8mb3), carefully consider the migration to utf8mb4, especially if you plan to handle complex data, emojis, or multilingual symbols in the future.

Being aware of the implications of these choices will help you optimize your text data management and ensure that your application functions correctly and efficiently.

If your database or WordPress installation is unable to save special characters, please contact us for a consultation and to resolve the issue.

Do you have doubts? Don't know where to start? Contact us!

We have all the answers to your questions to help you make the right choice.

Chat with us

Chat directly with our presales support.

0256569681

Contact us by phone during office hours 9:30 - 19:30

Contact us online

Open a request directly in the contact area.

INFORMATION

Managed Server Srl is a leading Italian player in providing advanced GNU/Linux system solutions oriented towards high performance. With a low-cost and predictable subscription model, we ensure that our customers have access to advanced technologies in hosting, dedicated servers and cloud services. In addition to this, we offer systems consultancy on Linux systems and specialized maintenance in DBMS, IT Security, Cloud and much more. We stand out for our expertise in hosting leading Open Source CMS such as WordPress, WooCommerce, Drupal, Prestashop, Joomla, OpenCart and Magento, supported by a high-level support and consultancy service suitable for Public Administration, SMEs and any size.

Red Hat, Inc. owns the rights to Red Hat®, RHEL®, RedHat Linux®, and CentOS®; AlmaLinux™ is a trademark of AlmaLinux OS Foundation; Rocky Linux® is a registered trademark of the Rocky Linux Foundation; SUSE® is a registered trademark of SUSE LLC; Canonical Ltd. owns the rights to Ubuntu®; Software in the Public Interest, Inc. holds the rights to Debian®; Linus Torvalds holds the rights to Linux®; FreeBSD® is a registered trademark of The FreeBSD Foundation; NetBSD® is a registered trademark of The NetBSD Foundation; OpenBSD® is a registered trademark of Theo de Raadt. Oracle Corporation owns the rights to Oracle®, MySQL®, and MyRocks®; Percona® is a registered trademark of Percona LLC; MariaDB® is a registered trademark of MariaDB Corporation Ab; REDIS® is a registered trademark of Redis Labs Ltd. F5 Networks, Inc. owns the rights to NGINX® and NGINX Plus®; Varnish® is a registered trademark of Varnish Software AB. Adobe Inc. holds the rights to Magento®; PrestaShop® is a registered trademark of PrestaShop SA; OpenCart® is a registered trademark of OpenCart Limited. Automattic Inc. owns the rights to WordPress®, WooCommerce®, and JetPack®; Open Source Matters, Inc. owns the rights to Joomla®; Dries Buytaert holds the rights to Drupal®. Amazon Web Services, Inc. holds the rights to AWS®; Google LLC holds the rights to Google Cloud™ and Chrome™; Microsoft Corporation holds the rights to Microsoft®, Azure®, and Internet Explorer®; Mozilla Foundation owns the rights to Firefox®. Apache® is a registered trademark of The Apache Software Foundation; PHP® is a registered trademark of the PHP Group. CloudFlare® is a registered trademark of Cloudflare, Inc.; NETSCOUT® is a registered trademark of NETSCOUT Systems Inc.; ElasticSearch®, LogStash®, and Kibana® are registered trademarks of Elastic NV Hetzner Online GmbH owns the rights to Hetzner®; OVHcloud is a registered trademark of OVH Groupe SAS; cPanel®, LLC owns the rights to cPanel®; Plesk® is a registered trademark of Plesk International GmbH; Facebook, Inc. owns the rights to Facebook®. This site is not affiliated, sponsored or otherwise associated with any of the entities mentioned above and does not represent any of these entities in any way. All rights to the brands and product names mentioned are the property of their respective copyright holders. Any other trademarks mentioned belong to their registrants. MANAGED SERVER® is a trademark registered at European level by MANAGED SERVER SRL, Via Enzo Ferrari, 9, 62012 Civitanova Marche (MC), Italy.

Back to top