Table of contents of the article:
When working with MySQL or derivatives like Percona Server or MariaDB, you will often come across the concepts of character set e collation, which are essential to properly manage the saving and manipulation of textual data within databases. However, for many developers who are just starting out with these DBMS, these concepts can be complex or unclear.
In this article, we will explore in detail what are character set and collation in MySQL and MariaDB, why they are important and how they affect data storage and management. We will cover the main characters such as UTF8, UTF8MB3, UTF8MB4, talk about the importance of collation like utf8mb4_general_ci
, utf8mb4_unicode_ci
, and utf8mb4_unicode_520_ci
and we'll understand how these settings can impact query speed.
What is a Character Set?
Un character set (character set) is a set of symbols and their binary representation. Every relational database like MySQL or MariaDB uses the character set to manage how characters are encoded and saved in table fields.
Character Set Examples
There are several character set used in databases, some of the more common include:
- Latin1: a single-byte character set representing the ISO-8859-1 encoding (common in Western European languages).
- utf8: A character set that encodes data using UTF-8 encoding. Each character can take up between 1 and 3 bytes. However, in MySQL, the name “utf8” is a bit misleading because it only represents characters up to 3 bytes (more on that later).
- utf8mb4: A variant of UTF-8 that fully supports all Unicode characters, including emoji and symbols that require up to 4 bytes.
UTF8 vs UTF8MB4: What's the Difference?
One of the most important points to understand is the difference between utf8
e utf8mb4
in MySQL and MariaDB.
- utf8: it's a character set which supports UTF-8 characters, but only up to 3 bytes per character. This means that it can only represent a subset of Unicode characters (approximately 1.112.064 characters in total), but does not support characters such as many emoji and some Asian symbols that require 4 bytes.
- utf8mb4: is the complete implementation of UTF-8 encoding in MySQL and MariaDB.
utf8mb4
supports all Unicode characters, including those that require 4 bytes. This is the character set which you should use if your database needs to correctly handle emojis or other characters that require more than 3 bytes.
Practical example:
If you try to save an emoji (e.g. 😊) in a column that uses the character set utf8
, you will get an error or the data will be truncated, since that character requires 4 bytes, while utf8
only supports up to 3 bytes. Using utf8mb4
, instead, the emoji will be saved correctly.
Using UTF8MB3
Sometimes you can see the term utf8mb3, which is an alternative name for the character set utf8
in MySQL. This name was introduced to make it clearer that utf8
in MySQL only supports characters up to 3 bytes, in contrast to utf8mb4
, which supports the entire Unicode character set, including 4-byte characters, like emoji or some more complex Asian characters. So, basically, utf8mb3
e utf8
they are equivalent, but the use of utf8mb3
This is to highlight the inherent limitation of MySQL in supporting only a subset of Unicode characters under the old naming convention utf8
.
In recent years, the technological landscape is increasingly moving towards full support for Unicode characters, including 4-byte characters. For this reason, the world is moving towards universal adoption of utf8mb4
, both for reasons of compatibility with the new standards and to ensure more complete character management.
The “gear shift” towards utf8mb4
In some configurations, especially in the newer versions of MariaDB, it is possible to observe a “gear shift” in the management of character set. Traditionally, utf8
(o utf8mb3
) was considered sufficient for most applications that did not require handling complex characters. However, with the increased need to handle multilingual content, emojis, and other special characters, the character set utf8mb4
has begun to take hold as the new norm.
An example of this change can be observed in the default behavior of databases. While in the past the character set utf8
was widely used, many of the default configurations new versions of MySQL and MariaDB are moving to utf8mb4
as the default option to ensure broader and more modern font support.
In some recent releases, it may happen that, without explicit configuration, a database that historically used utf8
to store strings, you can implicitly pass to utf8mb4
. This can lead to unexpected changes in data management, such as increased column storage size. VARCHAR
o TEXT
, and potentially performance impacts regarding indexing and comparison operations on complex characters.
Implications of MySQL and MariaDB Configuration
To manage this step correctly, It is essential to carefully check and configure your database settings, both at the server level and for a single table or column. In MySQL and MariaDB, many of the settings regarding the character set and collation can be defined in the main configuration files, such as my.cnf
in MySQL or server.cnf
in MariaDB.
What is a Collation?
An collation is a set of rules that determine how to compare and sort characters in a database. Each character set has one or more collation associates, which specify how characters are compared for operations such as ORDER BY
, GROUP BY
or to perform equality comparisons.
Main Collations in MySQL
Le collation have names that follow a specific convention. For example, utf8mb4_general_ci
It is divided into three parts:
- utf8mb4: indicates the character set to which it belongs collation.
- general: indicates the type of comparison rules.
- ci: stands for case insensitive, Namely the collation is not case sensitive.
Here are some of the main ones collation used in MySQL and MariaDB:
- utf8mb4_general_ci: This is one of the collation default for
utf8mb4
and is not case sensitive (case insensitive). It uses simplified, general collation rules, which makes it particularly speed-efficient for operations such as sorting and string comparison. However, because of its simplified nature, it is less rigorous and precise in handling some linguistic complexities than the Unicode standard. For applications where speed is critical and linguistic precision is not a concern, it is often the preferred choice. - utf8mb4_unicode_ci: This collation strictly follows the standard Unicode rules for character comparison. It is more accurate than
utf8mb4_general_ci
when working with different languages, accents, complex symbols, and special characters. However, its accuracy comes at the cost of performance: it can be slightly slower in queries, especially on large datasets, due to the more detailed collation rules. It is recommended for applications that require high linguistic accuracy. - utf8mb4_unicode_520_ci: This is an updated variant of
utf8mb4_unicode_ci
which implements the rules of the Unicode 5.2 standard. In addition to retaining the features of the previous version, it supports new characters and symbols introduced with this version of the Unicode protocol, making it a suitable choice for handling recent or special characters. Again, accuracy comes at the cost of a possible slowdown in queries compared to collation less precise.
Differences between Collations
utf8mb4_general_ci vs utf8mb4_unicode_ci
utf8mb4_general_ci
is faster because it applies simpler collations, especially for European languages. However, it does not handle all linguistic complexities well. For example, it does not correctly distinguish some character variations in non-European languages, such as ligatures or certain accents in Asian languages.
On the other hand, utf8mb4_unicode_ci
strictly follows Unicode rules, correctly handling special characters, accents and symbols, making it best suited to situations where linguistic precision is essential.
Impact on performance
The use of a collation can have a significant impact on query performance. Collation more complex, such as utf8mb4_unicode_ci
o utf8mb4_unicode_520_ci
, may take longer to perform comparisons and sorts, as they must follow more detailed rules.
For example, if you have a table with millions of rows and you are performing a ORDER BY
on a column with the collation utf8mb4_unicode_ci
, may take longer than a table that uses utf8mb4_general_ci
. This is due to the fact that the collation Unicode must correctly handle complex characters, accents, and other special symbols, while utf8mb4_general_ci
apply simpler comparison rules.
The graph shows a performance comparison between different collation in MySQL 5.7, measured in throughput (tps) compared to the number of threads used (4, 24, 64, 128). The collation compared are:
- utf8mb4_general_ci (default) (in blue)
- utf8mb4_bin (in red)
- utf8mb4_unicode_ci (in yellow)
- utf8mb4_unicode_520_ci (in green)
Remarks:
- utf8mb4_bin (red) has the highest throughput with all thread amounts, showing the best performance.
- utf8mb4_general_ci (blue), the collation By default, it is the second fastest, with performances that remain constant and very close to those of
utf8mb4_bin
with 128 threads. - utf8mb4_unicode_ci (yellow) has lower performance than
utf8mb4_bin
eutf8mb4_general_ci
, with visibly lower throughput especially starting from 24 threads. - utf8mb4_unicode_520_ci (green) is the collation with the worst performance, especially as the number of threads increases, confirming a noticeable drop in throughput.
If you use a collation like utf8mb4_unicode_ci
o utf8mb4_unicode_520_ci
, there will be a significant performance impact, especially in high thread count situations, compared to using collation more light as utf8mb4_general_ci
o utf8mb4_bin
.
Practical use cases
If you are developing an application that needs to support Western European languages and you are not too concerned about the accuracy of the collation rules for other languages, utf8mb4_general_ci
might be a reasonable choice. If, however, your database needs to support multiple languages and you need to be sure that character comparisons are done according to standard Unicode rules, then utf8mb4_unicode_ci
o utf8mb4_unicode_520_ci
they are better choices.
Choosing the Right Character Set and Collation
The choice of character set and collation It highly depends on your application requirements and the type of data you plan to handle in the database.
When to use UTF8MB4
In general, if you are working on a new project, you should use utf8mb4
as default font. Even if you don't plan on handling emoji or 4-byte Unicode symbols right now, using utf8mb4
gives you the flexibility to handle any Unicode character in the future. There are no significant disadvantages to using utf8mb4
compared to utf8
, except for a slight increase in storage space for characters that require more bytes.
Practical implementation example:
CREATE DATABASE testdb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
In this example, you are creating a database called testdb
character set utf8mb4
, collation utf8mb4_unicode_ci
. This configuration ensures that the database supports all Unicode characters, including emoji, and that it follows standard Unicode rules for comparing and sorting characters.
Collation and performances
As we have already mentioned, the use of a collation more complex can impact performance. Therefore, if you are developing an application where query speed is critical and you are not too concerned about linguistic accuracy, you may want to choose a collation simpler as utf8mb4_general_ci
.
On the other hand, if your application has to handle multiple languages and requires strict linguistic precision, you should opt for a collation more complex as utf8mb4_unicode_ci
.
Impact of Collations on Indexes and Searches
Another area where the collation can affect is the creation of indexes. When you create an index on a column that uses a collation, the rules of the collation determine how the index is sorted. This can impact the performance of database searches as we can see in the example below taken from Percona's blog where he talks about collation performance.
For example, an index created on a column with utf8mb4_general_ci
could be more efficient than an index on a column with utf8mb4_unicode_ci
, since the rules of comparison of the collation In general they are simpler.
CREATE INDEX idx_name ON users (name COLLATE utf8mb4_general_ci);
In this example, the index on the column name
use the collation utf8mb4_general_ci
, which may offer better search performance than an index that uses utf8mb4_unicode_ci
.
Conclusions
I character set and collation are crucial components for properly managing text data in MySQL and MariaDB. Choose the character set correct (preferably utf8mb4
for new projects) and the collation adequate can have a significant impact on the database's ability to handle complex characters, such as emojis, and on how operations such as sorting and comparing data are performed.
Being aware of the implications of these choices will help you optimize your text data management and ensure that your application functions correctly and efficiently.
If your database or WordPress installation is unable to save special characters, please contact us for a consultation and to resolve the issue.