10.10 Supported Character Sets and Collations

This section indicates which character sets MySQL supports. There is one subsection for each group of related character sets. For each character set, the permissible collations are listed.

To list the available character sets and their default collations, use the note 'SHOW CHARACTER SET': show-character-set. statement or query the 'INFORMATION_SCHEMA' note 'CHARACTER_SETS': information-schema-character-sets-table. table. For example:

 mysql> SHOW CHARACTER SET;
 +----------+---------------------------------+---------------------+--------+
 | Charset  | Description                     | Default collation   | Maxlen |
 +----------+---------------------------------+---------------------+--------+
 | big5     | Big5 Traditional Chinese        | big5_chinese_ci     |      2 |
 | dec8     | DEC West European               | dec8_swedish_ci     |      1 |
 | cp850    | DOS West European               | cp850_general_ci    |      1 |
 | hp8      | HP West European                | hp8_english_ci      |      1 |
 | koi8r    | KOI8-R Relcom Russian           | koi8r_general_ci    |      1 |
 | latin1   | cp1252 West European            | latin1_swedish_ci   |      1 |
 | latin2   | ISO 8859-2 Central European     | latin2_general_ci   |      1 |
 | swe7     | 7bit Swedish                    | swe7_swedish_ci     |      1 |
 | ascii    | US ASCII                        | ascii_general_ci    |      1 |
 | ujis     | EUC-JP Japanese                 | ujis_japanese_ci    |      3 |
 | sjis     | Shift-JIS Japanese              | sjis_japanese_ci    |      2 |
 | hebrew   | ISO 8859-8 Hebrew               | hebrew_general_ci   |      1 |
 | tis620   | TIS620 Thai                     | tis620_thai_ci      |      1 |
 | euckr    | EUC-KR Korean                   | euckr_korean_ci     |      2 |
 | koi8u    | KOI8-U Ukrainian                | koi8u_general_ci    |      1 |
 | gb2312   | GB2312 Simplified Chinese       | gb2312_chinese_ci   |      2 |
 | greek    | ISO 8859-7 Greek                | greek_general_ci    |      1 |
 | cp1250   | Windows Central European        | cp1250_general_ci   |      1 |
 | gbk      | GBK Simplified Chinese          | gbk_chinese_ci      |      2 |
 | latin5   | ISO 8859-9 Turkish              | latin5_turkish_ci   |      1 |
 | armscii8 | ARMSCII-8 Armenian              | armscii8_general_ci |      1 |
 | utf8     | UTF-8 Unicode                   | utf8_general_ci     |      3 |
 | ucs2     | UCS-2 Unicode                   | ucs2_general_ci     |      2 |
 | cp866    | DOS Russian                     | cp866_general_ci    |      1 |
 | keybcs2  | DOS Kamenicky Czech-Slovak      | keybcs2_general_ci  |      1 |
 | macce    | Mac Central European            | macce_general_ci    |      1 |
 | macroman | Mac West European               | macroman_general_ci |      1 |
 | cp852    | DOS Central European            | cp852_general_ci    |      1 |
 | latin7   | ISO 8859-13 Baltic              | latin7_general_ci   |      1 |
 | utf8mb4  | UTF-8 Unicode                   | utf8mb4_general_ci  |      4 |
 | cp1251   | Windows Cyrillic                | cp1251_general_ci   |      1 |
 | utf16    | UTF-16 Unicode                  | utf16_general_ci    |      4 |
 | utf16le  | UTF-16LE Unicode                | utf16le_general_ci  |      4 |
 | cp1256   | Windows Arabic                  | cp1256_general_ci   |      1 |
 | cp1257   | Windows Baltic                  | cp1257_general_ci   |      1 |
 | utf32    | UTF-32 Unicode                  | utf32_general_ci    |      4 |
 | binary   | Binary pseudo charset           | binary              |      1 |
 | geostd8  | GEOSTD8 Georgian                | geostd8_general_ci  |      1 |
 | cp932    | SJIS for Windows Japanese       | cp932_japanese_ci   |      2 |
 | eucjpms  | UJIS for Windows Japanese       | eucjpms_japanese_ci |      3 |
 | gb18030  | China National Standard GB18030 | gb18030_chinese_ci  |      4 |
 +----------+---------------------------------+---------------------+--------+

In cases where a character set has multiple collations, it might not be clear which collation is most suitable for a given application. To avoid choosing the wrong collation, it can be helpful to perform some comparisons with representative data values to make sure that a given collation sorts values the way you expect.

 File: manual.info.tmp, Node: charset-unicode-sets, Next: charset-we-sets, Prev: charset-charsets, Up: charset-charsets

10.10.1 Unicode Character Sets

This section describes the collations available for Unicode character sets and their differentiating properties. For general information about Unicode, see *note charset-unicode::.

MySQL supports multiple Unicode character sets:

'utf8mb4', 'utf16', 'utf16le', and 'utf32' support Basic Multilingual Plane (BMP) characters and supplementary characters that lie outside the BMP. 'utf8' and 'ucs2' support only BMP characters.

Most Unicode character sets have a general collation (indicated by '_general' in the name or by the absence of a language specifier), a binary collation (indicated by '_bin' in the name), and several language-specific collations (indicated by language specifiers). For example, for 'utf8mb4', 'utf8mb4_general_ci' and 'utf8mb4_bin' are its general and binary collations, and 'utf8mb4_danish_ci' is one of its language-specific collations.

Collation support for 'utf16le' is limited. The only collations available are 'utf16le_general_ci' and 'utf16le_bin'. These are similar to 'utf16_general_ci' and 'utf16_bin'.

Unicode Collation Algorithm (UCA) Versions

MySQL implements the 'XXX_unicode_ci' collations according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys: http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt. The 'XXX_unicode_ci' collations have only partial support for the Unicode Collation Algorithm. Some characters are not supported, and combining marks are not fully supported. This affects primarily Vietnamese, Yoruba, and some smaller languages such as Navajo. A combined character is considered different from the same character written with a single unicode character in string comparisons, and the two characters are considered to have a different length (for example, as returned by the 'CHAR_LENGTH()' function or in result set metadata).

Unicode collations based on UCA versions higher than 4.0.0 include the version in the collation name. Thus, 'utf8_unicode_520_ci' is based on UCA 5.2.0 weight keys (http://www.unicode.org/Public/UCA/5.2.0/allkeys.txt).

The 'LOWER()' and 'UPPER()' functions perform case folding according to the collation of their argument. A character that has uppercase and lowercase versions only in a Unicode version higher than 4.0.0 is converted by these functions only if the argument collation uses a high enough UCA version.

Language-Specific Collations

MySQL implements language-specific Unicode collations if the ordering based only on the Unicode Collation Algorithm (UCA) does not work well for a language. Language-specific collations are UCA-based, with additional language tailoring rules. Examples of such rules appear later in this section. For questions about particular language orderings, http://unicode.org provides Common Locale Data Repository (CLDR) collation charts at http://www.unicode.org/cldr/charts/30/collation/index.html.

A language name shown in the following table indicates a language-specific collation. Unicode character sets may include collations for one or more of these languages.

Unicode Collation Language Specifiers

Language Language Specifier

Classical Latin 'roman'

Croatian 'croatian'

Czech 'czech'

Danish 'danish'

Esperanto 'esperanto'

Estonian 'estonian'

German phone book order 'german2'

Hungarian 'hungarian'

Icelandic 'icelandic'

Latvian 'latvian'

Lithuanian 'lithuanian'

Persian 'persian'

Polish 'polish'

Romanian 'romanian'

Sinhala 'sinhala'

Slovak 'slovak'

Slovenian 'slovenian'

Modern Spanish 'spanish'

Traditional Spanish 'spanish2'

Swedish 'swedish'

Turkish 'turkish'

Vietnamese 'vietnamese'

Croatian collations are tailored for these Croatian letters: 'Č', 'Ć', 'Dž', 'Đ', 'Lj', 'Nj', 'Š', 'Ž'.

Danish collations may also be used for Norwegian.

For Classical Latin collations, 'I' and 'J' compare as equal, and 'U' and 'V' compare as equal.

Spanish collations are available for modern and traditional Spanish. For both, 'ñ' (n-tilde) is a separate letter between 'n' and 'o'. In addition, for traditional Spanish, 'ch' is a separate letter between 'c' and 'd', and 'll' is a separate letter between 'l' and 'm'.

Traditional Spanish collations may also be used for Asturian and Galician.

Swedish collations include Swedish rules. For example, in Swedish, the following relationship holds, which is not something expected by a German or French speaker:

 U" = Y < O"

*_general_ci Versus _unicode_ci Collations*

For any Unicode character set, operations performed using the 'XXX_general_ci' collation are faster than those for the 'XXX_unicode_ci' collation. For example, comparisons for the 'utf8_general_ci' collation are faster, but slightly less correct, than comparisons for 'utf8_unicode_ci'. The reason is that 'utf8_unicode_ci' supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, 'ss' is equal to 'ss' in German and some other languages. 'utf8_unicode_ci' also supports contractions and ignorable characters. 'utf8_general_ci' is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

To further illustrate, the following equalities hold in both 'utf8_general_ci' and 'utf8_unicode_ci' (for the effect of this in comparisons or searches, see *note charset-collation-effect::):

 A" = A
 O" = O
 U" = U

A difference between the collations is that this is true for 'utf8_general_ci':

 ss = s

Whereas this is true for 'utf8_unicode_ci', which supports the German DIN-1 ordering (also known as dictionary order):

 ss = ss

MySQL implements 'utf8' language-specific collations if the ordering with 'utf8_unicode_ci' does not work well for a language. For example, 'utf8_unicode_ci' works fine for German dictionary order and French, so there is no need to create special 'utf8' collations.

'utf8_general_ci' also is satisfactory for both German and French, except that 'ss' is equal to 's', and not to 'ss'. If this is acceptable for your application, you should use 'utf8_general_ci' because it is faster. If this is not acceptable (for example, if you require German dictionary order), use 'utf8_unicode_ci' because it is more accurate.

If you require German DIN-2 (phone book) ordering, use the 'utf8_german2_ci' collation, which compares the following sets of characters equal:

 A" = Æ = AE
 O" = Œ = OE
 U" = UE
 ss = ss

'utf8_german2_ci' is similar to 'latin1_german2_ci', but the latter does not compare 'Æ' equal to 'AE' or 'Œ' equal to 'OE'. There is no 'utf8_german_ci' corresponding to 'latin1_german_ci' for German dictionary order because 'utf8_general_ci' suffices.

Character Collating Weights

A character's collating weight is determined as follows:

Collating weights can be displayed using the 'WEIGHT_STRING()' function. (See *note string-functions::.) If a collation uses a weight lookup table, but a character is not in the table (for example, because it is a 'new' character), collating weight determination becomes more complex:

There is a difference between 'ordering by the character's code value' and 'ordering by the character's binary representation,' a difference that appears only with 'utf16_bin', because of surrogates.

Suppose that 'utf16_bin' (the binary collation for 'utf16') was a binary comparison 'byte by byte' rather than 'character by character.' If that were so, the order of characters in 'utf16_bin' would differ from the order in 'utf8_bin'. For example, the following chart shows two rare characters. The first character is in the range 'E000'-'FFFF', so it is greater than a surrogate but less than a supplementary. The second character is a supplementary.

 Code point  Character                    utf8         utf16
 ----------  ---------                    ----         -----
 0FF9D       HALFWIDTH KATAKANA LETTER N  EF BE 9D     FF 9D
 10384       UGARITIC LETTER DELTA        F0 90 8E 84  D8 00 DF 84

The two characters in the chart are in order by code point value because '0xff9d' < '0x10384'. And they are in order by 'utf8' value because '0xef' < '0xf0'. But they are not in order by 'utf16' value, if we use byte-by-byte comparison, because '0xff' > '0xd8'.

So MySQL's 'utf16_bin' collation is not 'byte by byte.' It is 'by code point.' When MySQL sees a supplementary-character encoding in 'utf16', it converts to the character's code-point value, and then compares. Therefore, 'utf8_bin' and 'utf16_bin' are the same ordering. This is consistent with the SQL:2008 standard requirement for a UCS_BASIC collation: 'UCS_BASIC is a collation in which the ordering is determined entirely by the Unicode scalar values of the characters in the strings being sorted. It is applicable to the UCS character repertoire. Since every character repertoire is a subset of the UCS repertoire, the UCS_BASIC collation is potentially applicable to every character set. NOTE 11: The Unicode scalar value of a character is its code point treated as an unsigned integer.'

If the character set is 'ucs2', comparison is byte-by-byte, but 'ucs2' strings should not contain surrogates, anyway.

Miscellaneous Information

The 'XXX_general_mysql500_ci' collations preserve the pre-5.1.24 ordering of the original 'XXX_general_ci' collations and permit upgrades for tables created before MySQL 5.1.24 (Bug #27877).

 File: manual.info.tmp, Node: charset-we-sets, Next: charset-ce-sets, Prev: charset-unicode-sets, Up: charset-charsets

10.10.2 West European Character Sets

Western European character sets cover most West European languages, such as French, Spanish, Catalan, Basque, Portuguese, Italian, Albanian, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English.

 File: manual.info.tmp, Node: charset-ce-sets, Next: charset-se-me-sets, Prev: charset-we-sets, Up: charset-charsets

10.10.3 Central European Character Sets

MySQL provides some support for character sets used in the Czech Republic, Slovakia, Hungary, Romania, Slovenia, Croatia, Poland, and Serbia (Latin).

 File: manual.info.tmp, Node: charset-se-me-sets, Next: charset-baltic-sets, Prev: charset-ce-sets, Up: charset-charsets

10.10.4 South European and Middle East Character Sets

South European and Middle Eastern character sets supported by MySQL include Armenian, Arabic, Georgian, Greek, Hebrew, and Turkish.

 File: manual.info.tmp, Node: charset-baltic-sets, Next: charset-cyrillic-sets, Prev: charset-se-me-sets, Up: charset-charsets

10.10.5 Baltic Character Sets

The Baltic character sets cover Estonian, Latvian, and Lithuanian languages.

 File: manual.info.tmp, Node: charset-cyrillic-sets, Next: charset-asian-sets, Prev: charset-baltic-sets, Up: charset-charsets

10.10.6 Cyrillic Character Sets

The Cyrillic character sets and collations are for use with Belarusian, Bulgarian, Russian, Ukrainian, and Serbian (Cyrillic) languages.

 File: manual.info.tmp, Node: charset-asian-sets, Next: charset-binary-set, Prev: charset-cyrillic-sets, Up: charset-charsets

10.10.7 Asian Character Sets

The Asian character sets that we support include Chinese, Japanese, Korean, and Thai. These can be complicated. For example, the Chinese sets must allow for thousands of different characters. See note charset-cp932::, for additional information about the 'cp932' and 'sjis' character sets. See note charset-gb18030::, for additional information about character set support for the Chinese National Standard GB 18030.

For answers to some common questions and problems relating support for Asian character sets in MySQL, see *note faqs-cjk::.

The 'big5_chinese_ci' collation sorts on number of strokes.

 File: manual.info.tmp, Node: charset-cp932, Next: charset-gb18030, Prev: charset-asian-sets, Up: charset-asian-sets

10.10.7.1 The cp932 Character Set .................................

Why is 'cp932' needed?

In MySQL, the 'sjis' character set corresponds to the 'Shift_JIS' character set defined by IANA, which supports JIS X0201 and JIS X0208 characters. (See http://www.iana.org/assignments/character-sets.)

However, the meaning of 'SHIFT JIS' as a descriptive term has become very vague and it often includes the extensions to 'Shift_JIS' that are defined by various vendors.

For example, 'SHIFT JIS' used in Japanese Windows environments is a Microsoft extension of 'Shift_JIS' and its exact name is 'Microsoft Windows Codepage : 932' or 'cp932'. In addition to the characters supported by 'Shift_JIS', 'cp932' supports extension characters such as NEC special characters, NEC selected--IBM extended characters, and IBM selected characters.

Many Japanese users have experienced problems using these extension characters. These problems stem from the following factors:

The MySQL 'cp932' character set is designed to solve these problems.

Because MySQL supports character set conversion, it is important to separate IANA 'Shift_JIS' and 'cp932' into two different character sets because they provide different conversion rules.

How does 'cp932' differ from 'sjis'?

The 'cp932' character set differs from 'sjis' in the following ways:

For some characters, conversion to and from 'ucs2' is different for 'sjis' and 'cp932'. The following tables illustrate these differences.

Conversion to 'ucs2':

'sjis'/'cp932' Value 'sjis' -> 'ucs2' 'cp932' -> 'ucs2' Conversion Conversion

5C

005C 005C

7E

007E 007E

815C

2015 2015

815F

005C FF3C

8160

301C FF5E

8161

2016 2225

817C

2212 FF0D

8191

00A2 FFE0

8192

00A3 FFE1

81CA

00AC FFE2

Conversion from 'ucs2':

'ucs2' value 'ucs2' -> 'sjis' 'ucs2' -> 'cp932' Conversion Conversion

005C

815F 5C

007E

7E 7E

00A2

8191 3F

00A3

8192 3F

00AC

81CA 3F

2015

815C 815C

2016

8161 3F

2212

817C 3F

2225

3F 8161

301C

8160 3F

FF0D

3F 817C

FF3C

3F 815F

FF5E

3F 8160

FFE0

3F 8191

FFE1

3F 8192

FFE2

3F 81CA

Users of any Japanese character sets should be aware that using '--character-set-client-handshake' (or '--skip-character-set-client-handshake') has an important effect. See *note server-options::.

 File: manual.info.tmp, Node: charset-gb18030, Prev: charset-cp932, Up: charset-asian-sets

10.10.7.2 The gb18030 Character Set ...................................

In MySQL, the 'gb18030' character set corresponds to the 'Chinese National Standard GB 18030-2005: Information technology --- Chinese coded character set', which is the official character set of the People's Republic of China (PRC).

Characteristics of the MySQL gb18030 Character Set

Supported Collations

 File: manual.info.tmp, Node: charset-binary-set, Prev: charset-asian-sets, Up: charset-charsets

10.10.8 The Binary Character Set

The 'binary' character set is the character set for binary strings, which are sequences of bytes. The 'binary' character set has one collation, also named 'binary'. Comparison and sorting are based on numeric byte values, rather than on numeric character code values (which for multibyte characters differ from numeric byte values). For information about the differences between the 'binary' collation of the 'binary' character set and the '_bin' collations of nonbinary character sets, see *note charset-binary-collations::.

For the 'binary' character set, the concepts of lettercase and accent equivalence do not apply:

To perform lettercase conversion of a binary string, first convert it to a nonbinary string using a character set appropriate for the data stored in the string:

 mysql> SET @str = BINARY 'New York';
 mysql> SELECT LOWER(@str), LOWER(CONVERT(@str USING utf8mb4));
 +-------------+------------------------------------+
 | LOWER(@str) | LOWER(CONVERT(@str USING utf8mb4)) |
 +-------------+------------------------------------+
 | New York    | new york                           |
 +-------------+------------------------------------+

To convert a string expression to a binary string, these constructs are equivalent:

 BINARY EXPR
 CAST(EXPR AS BINARY)
 CONVERT(EXPR USING BINARY)

If a value is a character string literal, the '_binary' introducer may be used to designate it as a binary string. For example:

 _binary 'a'

The '_binary' introducer is permitted for hexadecimal literals and bit-value literals as well, but unnecessary; such literals are binary strings by default.

For more information about introducers, see *note charset-introducer::.

Note:

Within the note 'mysql': mysql. client, binary strings display using hexadecimal notation, depending on the value of the '--binary-as-hex'. For more information about that option, see note mysql::.

 File: manual.info.tmp, Node: charset-restrictions, Next: error-message-language, Prev: charset-charsets, Up: charset