Menu:
charset-unicode-conversion:: Converting Between 3-Byte and 4-Byte Unicode Character Sets
The Unicode Standard includes characters from the Basic Multilingual Plane (BMP) and supplementary characters that lie outside the BMP. This section describes support for Unicode in MySQL. For information about the Unicode Standard itself, visit the Unicode Consortium website (http://www.unicode.org/).
BMP characters have these characteristics:
Their code point values are between 0 and 65535 (or 'U+0000' and 'U+FFFF').
They can be encoded in a variable-length encoding using 8, 16, or 24 bits (1 to 3 bytes).
They can be encoded in a fixed-length encoding using 16 bits (2 bytes).
They are sufficient for almost all characters in major languages.
Supplementary characters lie outside the BMP:
Their code point values are between 'U+10000' and 'U+10FFFF').
Unicode support for supplementary characters requires character sets that have a range outside BMP characters and therefore take more space than BMP characters (up to 4 bytes per character).
The UTF-8 (Unicode Transformation Format with 8-bit units) method for encoding Unicode data is implemented according to RFC 3629, which describes encoding sequences that take from one to four bytes. The idea of UTF-8 is that various Unicode characters are encoded using byte sequences of different lengths:
Basic Latin letters, digits, and punctuation signs use one byte.
Most European and Middle East script letters fit into a 2-byte sequence: extended Latin letters (with tilde, macron, acute, grave and other accents), Cyrillic, Greek, Armenian, Hebrew, Arabic, Syriac, and others.
Korean, Chinese, and Japanese ideographs use 3-byte or 4-byte sequences.
MySQL supports these Unicode character sets:
'utf8mb4': A UTF-8 encoding of the Unicode character set using one to four bytes per character.
'utf8mb3': A UTF-8 encoding of the Unicode character set using one to three bytes per character.
'utf8': An alias for 'utf8mb3'.
'ucs2': The UCS-2 encoding of the Unicode character set using two bytes per character.
'utf16': The UTF-16 encoding for the Unicode character set using two or four bytes per character. Like 'ucs2' but with an extension for supplementary characters.
'utf16le': The UTF-16LE encoding for the Unicode character set. Like 'utf16' but little-endian rather than big-endian.
'utf32': The UTF-32 encoding for the Unicode character set using four bytes per character.
*note charset-unicode-charset-characteristics::, summarizes the general characteristics of Unicode character sets supported by MySQL.
Unicode Character Set General Characteristics
Character Supported Characters Required Storage Per Set Character
'utf8mb3', 'utf8'
BMP only 1, 2, or 3 bytes
'ucs2'
BMP only 2 bytes
'utf8mb4'
BMP and 1, 2, 3, or 4 bytes supplementary
'utf16'
BMP and 2 or 4 bytes supplementary
'utf16le'
BMP and 2 or 4 bytes supplementary
'utf32'
BMP and 4 bytes supplementary
Characters outside the BMP compare as REPLACEMENT CHARACTER and convert to ''?'' when converted to a Unicode character set that supports only BMP characters ('utf8mb3' or 'ucs2').
If you use character sets that support supplementary characters and thus are 'wider' than the BMP-only 'utf8mb3' and 'ucs2' character sets, there are potential incompatibility issues for your applications; see *note charset-unicode-conversion::. That section also describes how to convert tables from the (3-byte) 'utf8mb3' to the (4-byte) 'utf8mb4', and what constraints may apply in doing so.
A similar set of collations is available for most Unicode character sets. For example, each has a Danish collation, the names of which are 'utf8mb4_danish_ci', 'utf8mb3_danish_ci', 'utf8_danish_ci', 'ucs2_danish_ci', 'utf16_danish_ci', and 'utf32_danish_ci'. The exception is 'utf16le', which has only two collations. For information about Unicode collations and their differentiating properties, including collation properties for supplementary characters, see *note charset-unicode-sets::.
The MySQL implementation of UCS-2, UTF-16, and UTF-32 stores characters in big-endian byte order and does not use a byte order mark (BOM) at the beginning of values. Other database systems might use little-endian byte order or a BOM. In such cases, conversion of values must be performed when transferring data between those systems and MySQL. The implementation of UTF-16LE is little-endian.
MySQL uses no BOM for UTF-8 values.
Client applications that communicate with the server using Unicode should set the client character set accordingly (for example, by issuing a 'SET NAMES 'utf8mb4'' statement). Some character sets cannot be used as the client character set. Attempting to use them with note 'SET NAMES': set-names. or note 'SET CHARACTER SET': set-character-set. produces an error. See *note charset-connection-impermissible-client-charset::.
The following sections provide additional detail on the Unicode character sets in MySQL.
File: manual.info.tmp, Node: charset-unicode-utf8mb4, Next: charset-unicode-utf8mb3, Prev: charset-unicode, Up: charset-unicode
The 'utf8mb4' character set has these characteristics:
Supports BMP and supplementary characters.
Requires a maximum of four bytes per multibyte character.
'utf8mb4' contrasts with the 'utf8mb3' character set, which supports only BMP characters and uses a maximum of three bytes per character:
For a BMP character, 'utf8mb4' and 'utf8mb3' have identical storage characteristics: same code values, same encoding, same length.
For a supplementary character, 'utf8mb4' requires four bytes to store it, whereas 'utf8mb3' cannot store the character at all. When converting 'utf8mb3' columns to 'utf8mb4', you need not worry about converting supplementary characters because there are none.
'utf8mb4' is a superset of 'utf8mb3', so for an operation such as the following concatenation, the result has character set 'utf8mb4' and the collation of 'utf8mb4_col':
SELECT CONCAT(utf8mb3_col, utf8mb4_col);
Similarly, the following comparison in the 'WHERE' clause works according to the collation of 'utf8mb4_col':
SELECT * FROM utf8mb3_tbl, utf8mb4_tbl
WHERE utf8mb3_tbl.utf8mb3_col = utf8mb4_tbl.utf8mb4_col;
For information about data type storage as it relates to multibyte character sets, see *note data-types-storage-reqs-strings::.
File: manual.info.tmp, Node: charset-unicode-utf8mb3, Next: charset-unicode-utf8, Prev: charset-unicode-utf8mb4, Up: charset-unicode
The 'utf8mb3' character set has these characteristics:
Supports BMP characters only (no support for supplementary characters)
Requires a maximum of three bytes per multibyte character.
Applications that use UTF-8 data but require supplementary character support should use 'utf8mb4' rather than 'utf8mb3' (see *note charset-unicode-utf8mb4::).
Exactly the same set of characters is available in 'utf8mb3' and 'ucs2'. That is, they have the same repertoire.
'utf8' is an alias for 'utf8mb3'; the character limit is implicit, rather than explicit in the name.
'utf8mb3' can be used in 'CHARACTER SET' clauses, and 'utf8mb3_COLLATION_SUBSTRING' in 'COLLATE' clauses, where COLLATION_SUBSTRING is 'bin', 'czech_ci', 'danish_ci', 'esperanto_ci', 'estonian_ci', and so forth. For example:
CREATE TABLE t (s1 CHAR(1)) CHARACTER SET utf8mb3;
SELECT * FROM t WHERE s1 COLLATE utf8mb3_general_ci = 'x';
DECLARE x VARCHAR(5) CHARACTER SET utf8mb3 COLLATE utf8mb3_danish_ci;
SELECT CAST('a' AS CHAR CHARACTER SET utf8) COLLATE utf8_czech_ci;
MySQL immediately converts instances of 'utf8mb3' in statements to 'utf8', so in statements such as 'SHOW CREATE TABLE' or 'SELECT CHARACTER_SET_NAME FROM INFORMATION_SCHEMA.COLUMNS' or 'SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS', users see the name 'utf8' or 'utf8_COLLATION_SUBSTRING'.
'utf8mb3' is also valid in contexts other than 'CHARACTER SET' clauses. For example:
mysqld --character-set-server=utf8mb3
SET NAMES 'utf8mb3'; /* and other SET statements that have similar effect */
SELECT _utf8mb3 'a';
For information about data type storage as it relates to multibyte character sets, see *note data-types-storage-reqs-strings::.
File: manual.info.tmp, Node: charset-unicode-utf8, Next: charset-unicode-ucs2, Prev: charset-unicode-utf8mb3, Up: charset-unicode
'utf8' is an alias for the 'utf8mb3' character set. For more information, see *note charset-unicode-utf8mb3::.
File: manual.info.tmp, Node: charset-unicode-ucs2, Next: charset-unicode-utf16, Prev: charset-unicode-utf8, Up: charset-unicode
In UCS-2, every character is represented by a 2-byte Unicode code with the most significant byte first. For example: 'LATIN CAPITAL LETTER A' has the code '0x0041' and it is stored as a 2-byte sequence: '0x00 0x41'. 'CYRILLIC SMALL LETTER YERU' (Unicode '0x044B') is stored as a 2-byte sequence: '0x04 0x4B'. For Unicode characters and their codes, please refer to the Unicode Consortium website (http://www.unicode.org/).
The 'ucs2' character set has these characteristics:
Supports BMP characters only (no support for supplementary characters)
Uses a fixed-length 16-bit encoding and requires two bytes per character.
File: manual.info.tmp, Node: charset-unicode-utf16, Next: charset-unicode-utf16le, Prev: charset-unicode-ucs2, Up: charset-unicode
The 'utf16' character set is the 'ucs2' character set with an extension that enables encoding of supplementary characters:
For a BMP character, 'utf16' and 'ucs2' have identical storage characteristics: same code values, same encoding, same length.
For a supplementary character, 'utf16' has a special sequence for representing the character using 32 bits. This is called the 'surrogate' mechanism: For a number greater than '0xffff', take 10 bits and add them to '0xd800' and put them in the first 16-bit word, take 10 more bits and add them to '0xdc00' and put them in the next 16-bit word. Consequently, all supplementary characters require 32 bits, where the first 16 bits are a number between '0xd800' and '0xdbff', and the last 16 bits are a number between '0xdc00' and '0xdfff'. Examples are in Section 15.5 Surrogates Area (http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf) of the Unicode 4.0 document.
Because 'utf16' supports surrogates and 'ucs2' does not, there is a validity check that applies only in 'utf16': You cannot insert a top surrogate without a bottom surrogate, or vice versa. For example:
INSERT INTO t (ucs2_column) VALUES (0xd800); /* legal */
INSERT INTO t (utf16_column)VALUES (0xd800); /* illegal */
There is no validity check for characters that are technically valid but are not true Unicode (that is, characters that Unicode considers to be 'unassigned code points' or 'private use' characters or even 'illegals' like '0xffff'). For example, since 'U+F8FF' is the Apple Logo, this is legal:
INSERT INTO t (utf16_column)VALUES (0xf8ff); /* legal */
Such characters cannot be expected to mean the same thing to everyone.
Because MySQL must allow for the worst case (that one character requires four bytes) the maximum length of a 'utf16' column or index is only half of the maximum length for a 'ucs2' column or index. For example, the maximum length of a 'MEMORY' table index key is 3072 bytes, so these statements create tables with the longest permitted indexes for 'ucs2' and 'utf16' columns:
CREATE TABLE tf (s1 VARCHAR(1536) CHARACTER SET ucs2) ENGINE=MEMORY;
CREATE INDEX i ON tf (s1);
CREATE TABLE tg (s1 VARCHAR(768) CHARACTER SET utf16) ENGINE=MEMORY;
CREATE INDEX i ON tg (s1);
File: manual.info.tmp, Node: charset-unicode-utf16le, Next: charset-unicode-utf32, Prev: charset-unicode-utf16, Up: charset-unicode
This is the same as 'utf16' but is little-endian rather than big-endian.
File: manual.info.tmp, Node: charset-unicode-utf32, Next: charset-unicode-conversion, Prev: charset-unicode-utf16le, Up: charset-unicode
The 'utf32' character set is fixed length (like 'ucs2' and unlike 'utf16'). 'utf32' uses 32 bits for every character, unlike 'ucs2' (which uses 16 bits for every character), and unlike 'utf16' (which uses 16 bits for some characters and 32 bits for others).
'utf32' takes twice as much space as 'ucs2' and more space than 'utf16', but 'utf32' has the same advantage as 'ucs2' that it is predictable for storage: The required number of bytes for 'utf32' equals the number of characters times 4. Also, unlike 'utf16', there are no tricks for encoding in 'utf32', so the stored value equals the code value.
To demonstrate how the latter advantage is useful, here is an example that shows how to determine a 'utf8mb4' value given the 'utf32' code value:
/* Assume code value = 100cc LINEAR B WHEELED CHARIOT */
CREATE TABLE tmp (utf32_col CHAR(1) CHARACTER SET utf32,
utf8mb4_col CHAR(1) CHARACTER SET utf8mb4);
INSERT INTO tmp VALUES (0x000100cc,NULL);
UPDATE tmp SET utf8mb4_col = utf32_col;
SELECT HEX(utf32_col),HEX(utf8mb4_col) FROM tmp;
MySQL is very forgiving about additions of unassigned Unicode characters or private-use-area characters. There is in fact only one validity check for 'utf32': No code value may be greater than '0x10ffff'. For example, this is illegal:
INSERT INTO t (utf32_column) VALUES (0x110000); /* illegal */
File: manual.info.tmp, Node: charset-unicode-conversion, Prev: charset-unicode-utf32, Up: charset-unicode
This section describes issues that you may face when converting character data between the 'utf8mb3' and 'utf8mb4' character sets.
Note:
This discussion focuses primarily on converting between 'utf8mb3' and 'utf8mb4', but similar principles apply to converting between the 'ucs2' character set and character sets such as 'utf16' or 'utf32'.
The 'utf8mb3' and 'utf8mb4' character sets differ as follows:
'utf8mb3' supports only characters in the Basic Multilingual Plane (BMP). 'utf8mb4' additionally supports supplementary characters that lie outside the BMP.
'utf8mb3' uses a maximum of three bytes per character. 'utf8mb4' uses a maximum of four bytes per character.
Note:
This discussion refers to the 'utf8mb3' and 'utf8mb4' character set names to be explicit about referring to 3-byte and 4-byte UTF-8 character set data. The exception is that in table definitions, 'utf8' is used because MySQL converts instances of 'utf8mb3' specified in such definitions to 'utf8', which is an alias for 'utf8mb3'.
One advantage of converting from 'utf8mb3' to 'utf8mb4' is that this enables applications to use supplementary characters. One tradeoff is that this may increase data storage space requirements.
In terms of table content, conversion from 'utf8mb3' to 'utf8mb4' presents no problems:
For a BMP character, 'utf8mb4' and 'utf8mb3' have identical storage characteristics: same code values, same encoding, same length.
For a supplementary character, 'utf8mb4' requires four bytes to store it, whereas 'utf8mb3' cannot store the character at all. When converting 'utf8mb3' columns to 'utf8mb4', you need not worry about converting supplementary characters because there are none.
In terms of table structure, these are the primary potential incompatibilities:
For the variable-length character data types (note 'VARCHAR': char. and the note 'TEXT': blob. types), the maximum permitted length in characters is less for 'utf8mb4' columns than for 'utf8mb3' columns.
For all character data types (note 'CHAR': char, note 'VARCHAR': char, and the *note 'TEXT': blob. types), the maximum number of characters that can be indexed is less for 'utf8mb4' columns than for 'utf8mb3' columns.
Consequently, to convert tables from 'utf8mb3' to 'utf8mb4', it may be necessary to change some column or index definitions.
Tables can be converted from 'utf8mb3' to 'utf8mb4' by using *note 'ALTER TABLE': alter-table. Suppose that a table has this definition:
CREATE TABLE t1 (
col1 CHAR(10) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
col2 CHAR(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL
) CHARACTER SET utf8;
The following statement converts 't1' to use 'utf8mb4':
ALTER TABLE t1
DEFAULT CHARACTER SET utf8mb4,
MODIFY col1 CHAR(10)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
MODIFY col2 CHAR(10)
CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL;
The catch when converting from 'utf8mb3' to 'utf8mb4' is that the maximum length of a column or index key is unchanged in terms of bytes. Therefore, it is smaller in terms of characters because the maximum length of a character is four bytes instead of three. For the note 'CHAR': char, note 'VARCHAR': char, and *note 'TEXT': blob. data types, watch for these issues when converting your MySQL tables:
Check all definitions of 'utf8mb3' columns and make sure they do not exceed the maximum length for the storage engine.
Check all indexes on 'utf8mb3' columns and make sure they do not exceed the maximum length for the storage engine. Sometimes the maximum can change due to storage engine enhancements.
If the preceding conditions apply, you must either reduce the defined length of columns or indexes, or continue to use 'utf8mb3' rather than 'utf8mb4'.
Here are some examples where structural changes may be needed:
A note 'TINYTEXT': blob. column can hold up to 255 bytes, so it can hold up to 85 3-byte or 63 4-byte characters. Suppose that you have a note 'TINYTEXT': blob. column that uses 'utf8mb3' but must be able to contain more than 63 characters. You cannot convert it to 'utf8mb4' unless you also change the data type to a longer type such as *note 'TEXT': blob.
Similarly, a very long note 'VARCHAR': char. column may need to be changed to one of the longer note 'TEXT': blob. types if you want to convert it from 'utf8mb3' to 'utf8mb4'.
'InnoDB' has a maximum index length of 767 bytes for tables that use 'COMPACT' or 'REDUNDANT' row format, so for 'utf8mb3' or 'utf8mb4' columns, you can index a maximum of 255 or 191 characters, respectively. If you currently have 'utf8mb3' columns with indexes longer than 191 characters, you must index a smaller number of characters.
In an 'InnoDB' table that uses 'COMPACT' or 'REDUNDANT' row format, these column and index definitions are legal:
col1 VARCHAR(500) CHARACTER SET utf8, INDEX (col1(255))
To use 'utf8mb4' instead, the index must be smaller:
col1 VARCHAR(500) CHARACTER SET utf8mb4, INDEX (col1(191))
Note:
For 'InnoDB' tables that use 'COMPRESSED' or 'DYNAMIC' row format, you can enable the 'innodb_large_prefix' option to permit index key prefixes longer than 767 bytes (up to 3072 bytes). Creating such tables also requires the option values 'innodb_file_format=barracuda' and 'innodb_file_per_table=true'.) In this case, enabling the 'innodb_large_prefix' option enables you to index a maximum of 1024 or 768 characters for 'utf8mb3' or 'utf8mb4' columns, respectively. For related information, see *note innodb-limits::.
The preceding types of changes are most likely to be required only if you have very long columns or indexes. Otherwise, you should be able to convert your tables from 'utf8mb3' to 'utf8mb4' without problems, using *note 'ALTER TABLE': alter-table. as described previously.
The following items summarize other potential incompatibilities:
'SET NAMES 'utf8mb4'' causes use of the 4-byte character set for connection character sets. As long as no 4-byte characters are sent from the server, there should be no problems. Otherwise, applications that expect to receive a maximum of three bytes per character may have problems. Conversely, applications that expect to send 4-byte characters must ensure that the server understands them.
For replication, if character sets that support supplementary characters are to be used on the source, all replicas must understand them as well.
Also, keep in mind the general principle that if a table has different definitions on the source and replica, this can lead to unexpected results. For example, the differences in maximum index key length make it risky to use 'utf8mb3' on the source and 'utf8mb4' on the replica.
If you have converted to 'utf8mb4', 'utf16', 'utf16le', or 'utf32', and then decide to convert back to 'utf8mb3' or 'ucs2' (for example, to downgrade to an older version of MySQL), these considerations apply:
'utf8mb3' and 'ucs2' data should present no problems.
The server must be recent enough to recognize definitions referring to the character set from which you are converting.
For object definitions that refer to the 'utf8mb4' character set, you can dump them with *note 'mysqldump': mysqldump. prior to downgrading, edit the dump file to change instances of 'utf8mb4' to 'utf8', and reload the file in the older server, as long as there are no 4-byte characters in the data. The older server sees 'utf8' in the dump file object definitions and creates new objects that use the (3-byte) 'utf8' character set.
File: manual.info.tmp, Node: charset-charsets, Next: charset-restrictions, Prev: charset-unicode, Up: charset