10.9 Unicode Support

The Unicode Standard includes characters from the Basic Multilingual Plane (BMP) and supplementary characters that lie outside the BMP. This section describes support for Unicode in MySQL. For information about the Unicode Standard itself, visit the Unicode Consortium website (http://www.unicode.org/).

BMP characters have these characteristics:

Supplementary characters lie outside the BMP:

The UTF-8 (Unicode Transformation Format with 8-bit units) method for encoding Unicode data is implemented according to RFC 3629, which describes encoding sequences that take from one to four bytes. The idea of UTF-8 is that various Unicode characters are encoded using byte sequences of different lengths:

MySQL supports these Unicode character sets:

*note charset-unicode-charset-characteristics::, summarizes the general characteristics of Unicode character sets supported by MySQL.

Unicode Character Set General Characteristics

Character Supported Characters Required Storage Per Set Character

'utf8mb3', 'utf8'

BMP only 1, 2, or 3 bytes

'ucs2'

BMP only 2 bytes

'utf8mb4'

BMP and 1, 2, 3, or 4 bytes supplementary

'utf16'

BMP and 2 or 4 bytes supplementary

'utf16le'

BMP and 2 or 4 bytes supplementary

'utf32'

BMP and 4 bytes supplementary

Characters outside the BMP compare as REPLACEMENT CHARACTER and convert to ''?'' when converted to a Unicode character set that supports only BMP characters ('utf8mb3' or 'ucs2').

If you use character sets that support supplementary characters and thus are 'wider' than the BMP-only 'utf8mb3' and 'ucs2' character sets, there are potential incompatibility issues for your applications; see *note charset-unicode-conversion::. That section also describes how to convert tables from the (3-byte) 'utf8mb3' to the (4-byte) 'utf8mb4', and what constraints may apply in doing so.

A similar set of collations is available for most Unicode character sets. For example, each has a Danish collation, the names of which are 'utf8mb4_danish_ci', 'utf8mb3_danish_ci', 'utf8_danish_ci', 'ucs2_danish_ci', 'utf16_danish_ci', and 'utf32_danish_ci'. The exception is 'utf16le', which has only two collations. For information about Unicode collations and their differentiating properties, including collation properties for supplementary characters, see *note charset-unicode-sets::.

The MySQL implementation of UCS-2, UTF-16, and UTF-32 stores characters in big-endian byte order and does not use a byte order mark (BOM) at the beginning of values. Other database systems might use little-endian byte order or a BOM. In such cases, conversion of values must be performed when transferring data between those systems and MySQL. The implementation of UTF-16LE is little-endian.

MySQL uses no BOM for UTF-8 values.

Client applications that communicate with the server using Unicode should set the client character set accordingly (for example, by issuing a 'SET NAMES 'utf8mb4'' statement). Some character sets cannot be used as the client character set. Attempting to use them with note 'SET NAMES': set-names. or note 'SET CHARACTER SET': set-character-set. produces an error. See *note charset-connection-impermissible-client-charset::.

The following sections provide additional detail on the Unicode character sets in MySQL.

 File: manual.info.tmp, Node: charset-unicode-utf8mb4, Next: charset-unicode-utf8mb3, Prev: charset-unicode, Up: charset-unicode

10.9.1 The utf8mb4 Character Set (4-Byte UTF-8 Unicode Encoding)

The 'utf8mb4' character set has these characteristics:

'utf8mb4' contrasts with the 'utf8mb3' character set, which supports only BMP characters and uses a maximum of three bytes per character:

'utf8mb4' is a superset of 'utf8mb3', so for an operation such as the following concatenation, the result has character set 'utf8mb4' and the collation of 'utf8mb4_col':

 SELECT CONCAT(utf8mb3_col, utf8mb4_col);

Similarly, the following comparison in the 'WHERE' clause works according to the collation of 'utf8mb4_col':

 SELECT * FROM utf8mb3_tbl, utf8mb4_tbl
 WHERE utf8mb3_tbl.utf8mb3_col = utf8mb4_tbl.utf8mb4_col;

For information about data type storage as it relates to multibyte character sets, see *note data-types-storage-reqs-strings::.

 File: manual.info.tmp, Node: charset-unicode-utf8mb3, Next: charset-unicode-utf8, Prev: charset-unicode-utf8mb4, Up: charset-unicode

10.9.2 The utf8mb3 Character Set (3-Byte UTF-8 Unicode Encoding)

The 'utf8mb3' character set has these characteristics:

Applications that use UTF-8 data but require supplementary character support should use 'utf8mb4' rather than 'utf8mb3' (see *note charset-unicode-utf8mb4::).

Exactly the same set of characters is available in 'utf8mb3' and 'ucs2'. That is, they have the same repertoire.

'utf8' is an alias for 'utf8mb3'; the character limit is implicit, rather than explicit in the name.

'utf8mb3' can be used in 'CHARACTER SET' clauses, and 'utf8mb3_COLLATION_SUBSTRING' in 'COLLATE' clauses, where COLLATION_SUBSTRING is 'bin', 'czech_ci', 'danish_ci', 'esperanto_ci', 'estonian_ci', and so forth. For example:

 CREATE TABLE t (s1 CHAR(1)) CHARACTER SET utf8mb3;
 SELECT * FROM t WHERE s1 COLLATE utf8mb3_general_ci = 'x';
 DECLARE x VARCHAR(5) CHARACTER SET utf8mb3 COLLATE utf8mb3_danish_ci;
 SELECT CAST('a' AS CHAR CHARACTER SET utf8) COLLATE utf8_czech_ci;

MySQL immediately converts instances of 'utf8mb3' in statements to 'utf8', so in statements such as 'SHOW CREATE TABLE' or 'SELECT CHARACTER_SET_NAME FROM INFORMATION_SCHEMA.COLUMNS' or 'SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS', users see the name 'utf8' or 'utf8_COLLATION_SUBSTRING'.

'utf8mb3' is also valid in contexts other than 'CHARACTER SET' clauses. For example:

 mysqld --character-set-server=utf8mb3

 SET NAMES 'utf8mb3'; /* and other SET statements that have similar effect */
 SELECT _utf8mb3 'a';

For information about data type storage as it relates to multibyte character sets, see *note data-types-storage-reqs-strings::.

 File: manual.info.tmp, Node: charset-unicode-utf8, Next: charset-unicode-ucs2, Prev: charset-unicode-utf8mb3, Up: charset-unicode

10.9.3 The utf8 Character Set (Alias for utf8mb3)

'utf8' is an alias for the 'utf8mb3' character set. For more information, see *note charset-unicode-utf8mb3::.

 File: manual.info.tmp, Node: charset-unicode-ucs2, Next: charset-unicode-utf16, Prev: charset-unicode-utf8, Up: charset-unicode

10.9.4 The ucs2 Character Set (UCS-2 Unicode Encoding)

In UCS-2, every character is represented by a 2-byte Unicode code with the most significant byte first. For example: 'LATIN CAPITAL LETTER A' has the code '0x0041' and it is stored as a 2-byte sequence: '0x00 0x41'. 'CYRILLIC SMALL LETTER YERU' (Unicode '0x044B') is stored as a 2-byte sequence: '0x04 0x4B'. For Unicode characters and their codes, please refer to the Unicode Consortium website (http://www.unicode.org/).

The 'ucs2' character set has these characteristics:

 File: manual.info.tmp, Node: charset-unicode-utf16, Next: charset-unicode-utf16le, Prev: charset-unicode-ucs2, Up: charset-unicode

10.9.5 The utf16 Character Set (UTF-16 Unicode Encoding)

The 'utf16' character set is the 'ucs2' character set with an extension that enables encoding of supplementary characters:

Because 'utf16' supports surrogates and 'ucs2' does not, there is a validity check that applies only in 'utf16': You cannot insert a top surrogate without a bottom surrogate, or vice versa. For example:

 INSERT INTO t (ucs2_column) VALUES (0xd800); /* legal */
 INSERT INTO t (utf16_column)VALUES (0xd800); /* illegal */

There is no validity check for characters that are technically valid but are not true Unicode (that is, characters that Unicode considers to be 'unassigned code points' or 'private use' characters or even 'illegals' like '0xffff'). For example, since 'U+F8FF' is the Apple Logo, this is legal:

 INSERT INTO t (utf16_column)VALUES (0xf8ff); /* legal */

Such characters cannot be expected to mean the same thing to everyone.

Because MySQL must allow for the worst case (that one character requires four bytes) the maximum length of a 'utf16' column or index is only half of the maximum length for a 'ucs2' column or index. For example, the maximum length of a 'MEMORY' table index key is 3072 bytes, so these statements create tables with the longest permitted indexes for 'ucs2' and 'utf16' columns:

 CREATE TABLE tf (s1 VARCHAR(1536) CHARACTER SET ucs2) ENGINE=MEMORY;
 CREATE INDEX i ON tf (s1);
 CREATE TABLE tg (s1 VARCHAR(768) CHARACTER SET utf16) ENGINE=MEMORY;
 CREATE INDEX i ON tg (s1);

 File: manual.info.tmp, Node: charset-unicode-utf16le, Next: charset-unicode-utf32, Prev: charset-unicode-utf16, Up: charset-unicode

10.9.6 The utf16le Character Set (UTF-16LE Unicode Encoding)

This is the same as 'utf16' but is little-endian rather than big-endian.

 File: manual.info.tmp, Node: charset-unicode-utf32, Next: charset-unicode-conversion, Prev: charset-unicode-utf16le, Up: charset-unicode

10.9.7 The utf32 Character Set (UTF-32 Unicode Encoding)

The 'utf32' character set is fixed length (like 'ucs2' and unlike 'utf16'). 'utf32' uses 32 bits for every character, unlike 'ucs2' (which uses 16 bits for every character), and unlike 'utf16' (which uses 16 bits for some characters and 32 bits for others).

'utf32' takes twice as much space as 'ucs2' and more space than 'utf16', but 'utf32' has the same advantage as 'ucs2' that it is predictable for storage: The required number of bytes for 'utf32' equals the number of characters times 4. Also, unlike 'utf16', there are no tricks for encoding in 'utf32', so the stored value equals the code value.

To demonstrate how the latter advantage is useful, here is an example that shows how to determine a 'utf8mb4' value given the 'utf32' code value:

 /* Assume code value = 100cc LINEAR B WHEELED CHARIOT */
 CREATE TABLE tmp (utf32_col CHAR(1) CHARACTER SET utf32,
                   utf8mb4_col CHAR(1) CHARACTER SET utf8mb4);
 INSERT INTO tmp VALUES (0x000100cc,NULL);
 UPDATE tmp SET utf8mb4_col = utf32_col;
 SELECT HEX(utf32_col),HEX(utf8mb4_col) FROM tmp;

MySQL is very forgiving about additions of unassigned Unicode characters or private-use-area characters. There is in fact only one validity check for 'utf32': No code value may be greater than '0x10ffff'. For example, this is illegal:

 INSERT INTO t (utf32_column) VALUES (0x110000); /* illegal */

 File: manual.info.tmp, Node: charset-unicode-conversion, Prev: charset-unicode-utf32, Up: charset-unicode

10.9.8 Converting Between 3-Byte and 4-Byte Unicode Character Sets

This section describes issues that you may face when converting character data between the 'utf8mb3' and 'utf8mb4' character sets.

Note:

This discussion focuses primarily on converting between 'utf8mb3' and 'utf8mb4', but similar principles apply to converting between the 'ucs2' character set and character sets such as 'utf16' or 'utf32'.

The 'utf8mb3' and 'utf8mb4' character sets differ as follows:

Note:

This discussion refers to the 'utf8mb3' and 'utf8mb4' character set names to be explicit about referring to 3-byte and 4-byte UTF-8 character set data. The exception is that in table definitions, 'utf8' is used because MySQL converts instances of 'utf8mb3' specified in such definitions to 'utf8', which is an alias for 'utf8mb3'.

One advantage of converting from 'utf8mb3' to 'utf8mb4' is that this enables applications to use supplementary characters. One tradeoff is that this may increase data storage space requirements.

In terms of table content, conversion from 'utf8mb3' to 'utf8mb4' presents no problems:

In terms of table structure, these are the primary potential incompatibilities:

Consequently, to convert tables from 'utf8mb3' to 'utf8mb4', it may be necessary to change some column or index definitions.

Tables can be converted from 'utf8mb3' to 'utf8mb4' by using *note 'ALTER TABLE': alter-table. Suppose that a table has this definition:

 CREATE TABLE t1 (
   col1 CHAR(10) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
   col2 CHAR(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL
 ) CHARACTER SET utf8;

The following statement converts 't1' to use 'utf8mb4':

 ALTER TABLE t1
   DEFAULT CHARACTER SET utf8mb4,
   MODIFY col1 CHAR(10)
     CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
   MODIFY col2 CHAR(10)
     CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL;

The catch when converting from 'utf8mb3' to 'utf8mb4' is that the maximum length of a column or index key is unchanged in terms of bytes. Therefore, it is smaller in terms of characters because the maximum length of a character is four bytes instead of three. For the note 'CHAR': char, note 'VARCHAR': char, and *note 'TEXT': blob. data types, watch for these issues when converting your MySQL tables:

If the preceding conditions apply, you must either reduce the defined length of columns or indexes, or continue to use 'utf8mb3' rather than 'utf8mb4'.

Here are some examples where structural changes may be needed:

The preceding types of changes are most likely to be required only if you have very long columns or indexes. Otherwise, you should be able to convert your tables from 'utf8mb3' to 'utf8mb4' without problems, using *note 'ALTER TABLE': alter-table. as described previously.

The following items summarize other potential incompatibilities:

If you have converted to 'utf8mb4', 'utf16', 'utf16le', or 'utf32', and then decide to convert back to 'utf8mb3' or 'ucs2' (for example, to downgrade to an older version of MySQL), these considerations apply:

 File: manual.info.tmp, Node: charset-charsets, Next: charset-restrictions, Prev: charset-unicode, Up: charset