Menu:
charset-binary-set:: The Binary Character Set
This section indicates which character sets MySQL supports. There is one subsection for each group of related character sets. For each character set, the permissible collations are listed.
To list the available character sets and their default collations, use the note 'SHOW CHARACTER SET': show-character-set. statement or query the 'INFORMATION_SCHEMA' note 'CHARACTER_SETS': information-schema-character-sets-table. table. For example:
mysql> SHOW CHARACTER SET;
+----------+---------------------------------+---------------------+--------+
| Charset | Description | Default collation | Maxlen |
+----------+---------------------------------+---------------------+--------+
| big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 |
| dec8 | DEC West European | dec8_swedish_ci | 1 |
| cp850 | DOS West European | cp850_general_ci | 1 |
| hp8 | HP West European | hp8_english_ci | 1 |
| koi8r | KOI8-R Relcom Russian | koi8r_general_ci | 1 |
| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |
| swe7 | 7bit Swedish | swe7_swedish_ci | 1 |
| ascii | US ASCII | ascii_general_ci | 1 |
| ujis | EUC-JP Japanese | ujis_japanese_ci | 3 |
| sjis | Shift-JIS Japanese | sjis_japanese_ci | 2 |
| hebrew | ISO 8859-8 Hebrew | hebrew_general_ci | 1 |
| tis620 | TIS620 Thai | tis620_thai_ci | 1 |
| euckr | EUC-KR Korean | euckr_korean_ci | 2 |
| koi8u | KOI8-U Ukrainian | koi8u_general_ci | 1 |
| gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci | 2 |
| greek | ISO 8859-7 Greek | greek_general_ci | 1 |
| cp1250 | Windows Central European | cp1250_general_ci | 1 |
| gbk | GBK Simplified Chinese | gbk_chinese_ci | 2 |
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 |
| armscii8 | ARMSCII-8 Armenian | armscii8_general_ci | 1 |
| utf8 | UTF-8 Unicode | utf8_general_ci | 3 |
| ucs2 | UCS-2 Unicode | ucs2_general_ci | 2 |
| cp866 | DOS Russian | cp866_general_ci | 1 |
| keybcs2 | DOS Kamenicky Czech-Slovak | keybcs2_general_ci | 1 |
| macce | Mac Central European | macce_general_ci | 1 |
| macroman | Mac West European | macroman_general_ci | 1 |
| cp852 | DOS Central European | cp852_general_ci | 1 |
| latin7 | ISO 8859-13 Baltic | latin7_general_ci | 1 |
| utf8mb4 | UTF-8 Unicode | utf8mb4_general_ci | 4 |
| cp1251 | Windows Cyrillic | cp1251_general_ci | 1 |
| utf16 | UTF-16 Unicode | utf16_general_ci | 4 |
| utf16le | UTF-16LE Unicode | utf16le_general_ci | 4 |
| cp1256 | Windows Arabic | cp1256_general_ci | 1 |
| cp1257 | Windows Baltic | cp1257_general_ci | 1 |
| utf32 | UTF-32 Unicode | utf32_general_ci | 4 |
| binary | Binary pseudo charset | binary | 1 |
| geostd8 | GEOSTD8 Georgian | geostd8_general_ci | 1 |
| cp932 | SJIS for Windows Japanese | cp932_japanese_ci | 2 |
| eucjpms | UJIS for Windows Japanese | eucjpms_japanese_ci | 3 |
| gb18030 | China National Standard GB18030 | gb18030_chinese_ci | 4 |
+----------+---------------------------------+---------------------+--------+
In cases where a character set has multiple collations, it might not be clear which collation is most suitable for a given application. To avoid choosing the wrong collation, it can be helpful to perform some comparisons with representative data values to make sure that a given collation sorts values the way you expect.
File: manual.info.tmp, Node: charset-unicode-sets, Next: charset-we-sets, Prev: charset-charsets, Up: charset-charsets
This section describes the collations available for Unicode character sets and their differentiating properties. For general information about Unicode, see *note charset-unicode::.
MySQL supports multiple Unicode character sets:
'utf8mb4': A UTF-8 encoding of the Unicode character set using one to four bytes per character.
'utf8mb3': A UTF-8 encoding of the Unicode character set using one to three bytes per character.
'utf8': An alias for 'utf8mb3'.
'ucs2': The UCS-2 encoding of the Unicode character set using two bytes per character.
'utf16': The UTF-16 encoding for the Unicode character set using two or four bytes per character. Like 'ucs2' but with an extension for supplementary characters.
'utf16le': The UTF-16LE encoding for the Unicode character set. Like 'utf16' but little-endian rather than big-endian.
'utf32': The UTF-32 encoding for the Unicode character set using four bytes per character.
'utf8mb4', 'utf16', 'utf16le', and 'utf32' support Basic Multilingual Plane (BMP) characters and supplementary characters that lie outside the BMP. 'utf8' and 'ucs2' support only BMP characters.
Most Unicode character sets have a general collation (indicated by '_general' in the name or by the absence of a language specifier), a binary collation (indicated by '_bin' in the name), and several language-specific collations (indicated by language specifiers). For example, for 'utf8mb4', 'utf8mb4_general_ci' and 'utf8mb4_bin' are its general and binary collations, and 'utf8mb4_danish_ci' is one of its language-specific collations.
Collation support for 'utf16le' is limited. The only collations available are 'utf16le_general_ci' and 'utf16le_bin'. These are similar to 'utf16_general_ci' and 'utf16_bin'.
*note charset-unicode-sets-uca::
*note charset-unicode-sets-language-specific-collations::
*note charset-unicode-sets-general-versus-unicode::
*note charset-unicode-sets-collating-weights::
*note charset-unicode-sets-miscellaneous::
Unicode Collation Algorithm (UCA) Versions
MySQL implements the 'XXX_unicode_ci' collations according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys: http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt. The 'XXX_unicode_ci' collations have only partial support for the Unicode Collation Algorithm. Some characters are not supported, and combining marks are not fully supported. This affects primarily Vietnamese, Yoruba, and some smaller languages such as Navajo. A combined character is considered different from the same character written with a single unicode character in string comparisons, and the two characters are considered to have a different length (for example, as returned by the 'CHAR_LENGTH()' function or in result set metadata).
Unicode collations based on UCA versions higher than 4.0.0 include the version in the collation name. Thus, 'utf8_unicode_520_ci' is based on UCA 5.2.0 weight keys (http://www.unicode.org/Public/UCA/5.2.0/allkeys.txt).
The 'LOWER()' and 'UPPER()' functions perform case folding according to the collation of their argument. A character that has uppercase and lowercase versions only in a Unicode version higher than 4.0.0 is converted by these functions only if the argument collation uses a high enough UCA version.
Language-Specific Collations
MySQL implements language-specific Unicode collations if the ordering based only on the Unicode Collation Algorithm (UCA) does not work well for a language. Language-specific collations are UCA-based, with additional language tailoring rules. Examples of such rules appear later in this section. For questions about particular language orderings, http://unicode.org provides Common Locale Data Repository (CLDR) collation charts at http://www.unicode.org/cldr/charts/30/collation/index.html.
A language name shown in the following table indicates a language-specific collation. Unicode character sets may include collations for one or more of these languages.
Unicode Collation Language Specifiers
Language Language Specifier
Classical Latin 'roman'
Croatian 'croatian'
Czech 'czech'
Danish 'danish'
Esperanto 'esperanto'
Estonian 'estonian'
German phone book order 'german2'
Hungarian 'hungarian'
Icelandic 'icelandic'
Latvian 'latvian'
Lithuanian 'lithuanian'
Persian 'persian'
Polish 'polish'
Romanian 'romanian'
Sinhala 'sinhala'
Slovak 'slovak'
Slovenian 'slovenian'
Modern Spanish 'spanish'
Traditional Spanish 'spanish2'
Swedish 'swedish'
Turkish 'turkish'
Vietnamese 'vietnamese'
Croatian collations are tailored for these Croatian letters: 'Č', 'Ć', 'Dž', 'Đ', 'Lj', 'Nj', 'Š', 'Ž'.
Danish collations may also be used for Norwegian.
For Classical Latin collations, 'I' and 'J' compare as equal, and 'U' and 'V' compare as equal.
Spanish collations are available for modern and traditional Spanish. For both, 'ñ' (n-tilde) is a separate letter between 'n' and 'o'. In addition, for traditional Spanish, 'ch' is a separate letter between 'c' and 'd', and 'll' is a separate letter between 'l' and 'm'.
Traditional Spanish collations may also be used for Asturian and Galician.
Swedish collations include Swedish rules. For example, in Swedish, the following relationship holds, which is not something expected by a German or French speaker:
U" = Y < O"
*_general_ci Versus _unicode_ci Collations*
For any Unicode character set, operations performed using the 'XXX_general_ci' collation are faster than those for the 'XXX_unicode_ci' collation. For example, comparisons for the 'utf8_general_ci' collation are faster, but slightly less correct, than comparisons for 'utf8_unicode_ci'. The reason is that 'utf8_unicode_ci' supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, 'ss' is equal to 'ss' in German and some other languages. 'utf8_unicode_ci' also supports contractions and ignorable characters. 'utf8_general_ci' is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.
To further illustrate, the following equalities hold in both 'utf8_general_ci' and 'utf8_unicode_ci' (for the effect of this in comparisons or searches, see *note charset-collation-effect::):
A" = A
O" = O
U" = U
A difference between the collations is that this is true for 'utf8_general_ci':
ss = s
Whereas this is true for 'utf8_unicode_ci', which supports the German DIN-1 ordering (also known as dictionary order):
ss = ss
MySQL implements 'utf8' language-specific collations if the ordering with 'utf8_unicode_ci' does not work well for a language. For example, 'utf8_unicode_ci' works fine for German dictionary order and French, so there is no need to create special 'utf8' collations.
'utf8_general_ci' also is satisfactory for both German and French, except that 'ss' is equal to 's', and not to 'ss'. If this is acceptable for your application, you should use 'utf8_general_ci' because it is faster. If this is not acceptable (for example, if you require German dictionary order), use 'utf8_unicode_ci' because it is more accurate.
If you require German DIN-2 (phone book) ordering, use the 'utf8_german2_ci' collation, which compares the following sets of characters equal:
A" = Æ = AE
O" = Œ = OE
U" = UE
ss = ss
'utf8_german2_ci' is similar to 'latin1_german2_ci', but the latter does not compare 'Æ' equal to 'AE' or 'Œ' equal to 'OE'. There is no 'utf8_german_ci' corresponding to 'latin1_german_ci' for German dictionary order because 'utf8_general_ci' suffices.
Character Collating Weights
A character's collating weight is determined as follows:
For all Unicode collations except the '_bin' (binary) collations, MySQL performs a table lookup to find a character's collating weight.
For '_bin' collations, the weight is based on the code point, possibly with leading zero bytes added.
Collating weights can be displayed using the 'WEIGHT_STRING()' function. (See *note string-functions::.) If a collation uses a weight lookup table, but a character is not in the table (for example, because it is a 'new' character), collating weight determination becomes more complex:
For BMP characters in general collations ('XXX_general_ci'), the weight is the code point.
For BMP characters in UCA collations (for example, 'XXX_unicode_ci' and language-specific collations), the following algorithm applies:
if (code >= 0x3400 && code <= 0x4DB5)
base= 0xFB80; /* CJK Ideograph Extension */
else if (code >= 0x4E00 && code <= 0x9FA5)
base= 0xFB40; /* CJK Ideograph */
else
base= 0xFBC0; /* All other characters */
aaaa= base + (code >> 15);
bbbb= (code & 0x7FFF) | 0x8000;
The result is a sequence of two collating elements, 'aaaa' followed by 'bbbb'. For example:
mysql> SELECT HEX(WEIGHT_STRING(_ucs2 0x04CF COLLATE ucs2_unicode_ci));
+----------------------------------------------------------+
| HEX(WEIGHT_STRING(_ucs2 0x04CF COLLATE ucs2_unicode_ci)) |
+----------------------------------------------------------+
| FBC084CF |
+----------------------------------------------------------+
Thus, 'U+04cf CYRILLIC SMALL LETTER PALOCHKA' is, with all UCA 4.0.0 collations, greater than 'U+04c0 CYRILLIC LETTER PALOCHKA'. With UCA 5.2.0 collations, all palochkas sort together.
For supplementary characters in general collations, the weight is the weight for '0xfffd REPLACEMENT CHARACTER'. For supplementary characters in UCA 4.0.0 collations, their collating weight is '0xfffd'. That is, to MySQL, all supplementary characters are equal to each other, and greater than almost all BMP characters.
An example with Deseret characters and 'COUNT(DISTINCT)':
CREATE TABLE t (s1 VARCHAR(5) CHARACTER SET utf32 COLLATE utf32_unicode_ci);
INSERT INTO t VALUES (0xfffd); /* REPLACEMENT CHARACTER */
INSERT INTO t VALUES (0x010412); /* DESERET CAPITAL LETTER BEE */
INSERT INTO t VALUES (0x010413); /* DESERET CAPITAL LETTER TEE */
SELECT COUNT(DISTINCT s1) FROM t;
The result is 2 because in the MySQL 'XXX_unicode_ci' collations, the replacement character has a weight of '0x0dc6', whereas Deseret Bee and Deseret Tee both have a weight of '0xfffd'. (Were the 'utf32_general_ci' collation used instead, the result is 1 because all three characters have a weight of '0xfffd' in that collation.)
An example with cuneiform characters and 'WEIGHT_STRING()':
/*
The four characters in the INSERT string are
00000041 # LATIN CAPITAL LETTER A
0001218F # CUNEIFORM SIGN KAB
000121A7 # CUNEIFORM SIGN KISH
00000042 # LATIN CAPITAL LETTER B
*/
CREATE TABLE t (s1 CHAR(4) CHARACTER SET utf32 COLLATE utf32_unicode_ci);
INSERT INTO t VALUES (0x000000410001218f000121a700000042);
SELECT HEX(WEIGHT_STRING(s1)) FROM t;
The result is:
0E33 FFFD FFFD 0E4A
'0E33' and '0E4A' are primary weights as in UCA 4.0.0 (ftp://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt). 'FFFD' is the weight for KAB and also for KISH.
The rule that all supplementary characters are equal to each other is nonoptimal but is not expected to cause trouble. These characters are very rare, so it is very rare that a multi-character string consists entirely of supplementary characters. In Japan, since the supplementary characters are obscure Kanji ideographs, the typical user does not care what order they are in, anyway. If you really want rows sorted by the MySQL rule and secondarily by code point value, it is easy:
ORDER BY s1 COLLATE utf32_unicode_ci, s1 COLLATE utf32_bin
For supplementary characters based on UCA versions higher than 4.0.0 (for example, 'XXX_unicode_520_ci'), supplementary characters do not necessarily all have the same collating weight. Some have explicit weights from the UCA 'allkeys.txt' file. Others have weights calculated from this algorithm:
aaaa= base + (code >> 15);
bbbb= (code & 0x7FFF) | 0x8000;
There is a difference between 'ordering by the character's code value' and 'ordering by the character's binary representation,' a difference that appears only with 'utf16_bin', because of surrogates.
Suppose that 'utf16_bin' (the binary collation for 'utf16') was a binary comparison 'byte by byte' rather than 'character by character.' If that were so, the order of characters in 'utf16_bin' would differ from the order in 'utf8_bin'. For example, the following chart shows two rare characters. The first character is in the range 'E000'-'FFFF', so it is greater than a surrogate but less than a supplementary. The second character is a supplementary.
Code point Character utf8 utf16
---------- --------- ---- -----
0FF9D HALFWIDTH KATAKANA LETTER N EF BE 9D FF 9D
10384 UGARITIC LETTER DELTA F0 90 8E 84 D8 00 DF 84
The two characters in the chart are in order by code point value because '0xff9d' < '0x10384'. And they are in order by 'utf8' value because '0xef' < '0xf0'. But they are not in order by 'utf16' value, if we use byte-by-byte comparison, because '0xff' > '0xd8'.
So MySQL's 'utf16_bin' collation is not 'byte by byte.' It is 'by code point.' When MySQL sees a supplementary-character encoding in 'utf16', it converts to the character's code-point value, and then compares. Therefore, 'utf8_bin' and 'utf16_bin' are the same ordering. This is consistent with the SQL:2008 standard requirement for a UCS_BASIC collation: 'UCS_BASIC is a collation in which the ordering is determined entirely by the Unicode scalar values of the characters in the strings being sorted. It is applicable to the UCS character repertoire. Since every character repertoire is a subset of the UCS repertoire, the UCS_BASIC collation is potentially applicable to every character set. NOTE 11: The Unicode scalar value of a character is its code point treated as an unsigned integer.'
If the character set is 'ucs2', comparison is byte-by-byte, but 'ucs2' strings should not contain surrogates, anyway.
Miscellaneous Information
The 'XXX_general_mysql500_ci' collations preserve the pre-5.1.24 ordering of the original 'XXX_general_ci' collations and permit upgrades for tables created before MySQL 5.1.24 (Bug #27877).
File: manual.info.tmp, Node: charset-we-sets, Next: charset-ce-sets, Prev: charset-unicode-sets, Up: charset-charsets
Western European character sets cover most West European languages, such as French, Spanish, Catalan, Basque, Portuguese, Italian, Albanian, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English.
'ascii' (US ASCII) collations:
* 'ascii_bin'
* 'ascii_general_ci' (default)
'cp850' (DOS West European) collations:
* 'cp850_bin'
* 'cp850_general_ci' (default)
'dec8' (DEC Western European) collations:
* 'dec8_bin'
* 'dec8_swedish_ci' (default)
'hp8' (HP Western European) collations:
* 'hp8_bin'
* 'hp8_english_ci' (default)
'latin1' (cp1252 West European) collations:
* 'latin1_bin'
* 'latin1_danish_ci'
* 'latin1_general_ci'
* 'latin1_general_cs'
* 'latin1_german1_ci'
* 'latin1_german2_ci'
* 'latin1_spanish_ci'
* 'latin1_swedish_ci' (default)
'latin1' is the default character set. MySQL's 'latin1' is the same as the Windows 'cp1252' character set. This means it is the same as the official 'ISO 8859-1' or IANA (Internet Assigned Numbers Authority) 'latin1', except that IANA 'latin1' treats the code points between '0x80' and '0x9f' as 'undefined,' whereas 'cp1252', and therefore MySQL's 'latin1', assign characters for those positions. For example, '0x80' is the Euro sign. For the 'undefined' entries in 'cp1252', MySQL translates '0x81' to Unicode '0x0081', '0x8d' to '0x008d', '0x8f' to '0x008f', '0x90' to '0x0090', and '0x9d' to '0x009d'.
The 'latin1_swedish_ci' collation is the default that probably is used by the majority of MySQL customers. Although it is frequently said that it is based on the Swedish/Finnish collation rules, there are Swedes and Finns who disagree with this statement.
The 'latin1_german1_ci' and 'latin1_german2_ci' collations are based on the DIN-1 and DIN-2 standards, where DIN stands for Deutsches Institut fu"r Normung (the German equivalent of ANSI). DIN-1 is called the 'dictionary collation' and DIN-2 is called the 'phone book collation.' For an example of the effect this has in comparisons or when doing searches, see *note charset-collation-effect::.
* 'latin1_german1_ci' (dictionary) rules:
A" = A
O" = O
U" = U
ss = s
* 'latin1_german2_ci' (phone-book) rules:
A" = AE
O" = OE
U" = UE
ss = ss
In the 'latin1_spanish_ci' collation, 'ñ' (n-tilde) is a separate letter between 'n' and 'o'.
'macroman' (Mac West European) collations:
* 'macroman_bin'
* 'macroman_general_ci' (default)
'swe7' (7bit Swedish) collations:
* 'swe7_bin'
* 'swe7_swedish_ci' (default)
File: manual.info.tmp, Node: charset-ce-sets, Next: charset-se-me-sets, Prev: charset-we-sets, Up: charset-charsets
MySQL provides some support for character sets used in the Czech Republic, Slovakia, Hungary, Romania, Slovenia, Croatia, Poland, and Serbia (Latin).
'cp1250' (Windows Central European) collations:
* 'cp1250_bin'
* 'cp1250_croatian_ci'
* 'cp1250_czech_cs'
* 'cp1250_general_ci' (default)
* 'cp1250_polish_ci'
'cp852' (DOS Central European) collations:
* 'cp852_bin'
* 'cp852_general_ci' (default)
'keybcs2' (DOS Kamenicky Czech-Slovak) collations:
* 'keybcs2_bin'
* 'keybcs2_general_ci' (default)
'latin2' (ISO 8859-2 Central European) collations:
* 'latin2_bin'
* 'latin2_croatian_ci'
* 'latin2_czech_cs'
* 'latin2_general_ci' (default)
* 'latin2_hungarian_ci'
'macce' (Mac Central European) collations:
* 'macce_bin'
* 'macce_general_ci' (default)
File: manual.info.tmp, Node: charset-se-me-sets, Next: charset-baltic-sets, Prev: charset-ce-sets, Up: charset-charsets
South European and Middle Eastern character sets supported by MySQL include Armenian, Arabic, Georgian, Greek, Hebrew, and Turkish.
'armscii8' (ARMSCII-8 Armenian) collations:
* 'armscii8_bin'
* 'armscii8_general_ci' (default)
'cp1256' (Windows Arabic) collations:
* 'cp1256_bin'
* 'cp1256_general_ci' (default)
'geostd8' (GEOSTD8 Georgian) collations:
* 'geostd8_bin'
* 'geostd8_general_ci' (default)
'greek' (ISO 8859-7 Greek) collations:
* 'greek_bin'
* 'greek_general_ci' (default)
'hebrew' (ISO 8859-8 Hebrew) collations:
* 'hebrew_bin'
* 'hebrew_general_ci' (default)
'latin5' (ISO 8859-9 Turkish) collations:
* 'latin5_bin'
* 'latin5_turkish_ci' (default)
File: manual.info.tmp, Node: charset-baltic-sets, Next: charset-cyrillic-sets, Prev: charset-se-me-sets, Up: charset-charsets
The Baltic character sets cover Estonian, Latvian, and Lithuanian languages.
'cp1257' (Windows Baltic) collations:
* 'cp1257_bin'
* 'cp1257_general_ci' (default)
* 'cp1257_lithuanian_ci'
'latin7' (ISO 8859-13 Baltic) collations:
* 'latin7_bin'
* 'latin7_estonian_cs'
* 'latin7_general_ci' (default)
* 'latin7_general_cs'
File: manual.info.tmp, Node: charset-cyrillic-sets, Next: charset-asian-sets, Prev: charset-baltic-sets, Up: charset-charsets
The Cyrillic character sets and collations are for use with Belarusian, Bulgarian, Russian, Ukrainian, and Serbian (Cyrillic) languages.
'cp1251' (Windows Cyrillic) collations:
* 'cp1251_bin'
* 'cp1251_bulgarian_ci'
* 'cp1251_general_ci' (default)
* 'cp1251_general_cs'
* 'cp1251_ukrainian_ci'
'cp866' (DOS Russian) collations:
* 'cp866_bin'
* 'cp866_general_ci' (default)
'koi8r' (KOI8-R Relcom Russian) collations:
* 'koi8r_bin'
* 'koi8r_general_ci' (default)
'koi8u' (KOI8-U Ukrainian) collations:
* 'koi8u_bin'
* 'koi8u_general_ci' (default)
File: manual.info.tmp, Node: charset-asian-sets, Next: charset-binary-set, Prev: charset-cyrillic-sets, Up: charset-charsets
Menu:
charset-gb18030:: The gb18030 Character Set
The Asian character sets that we support include Chinese, Japanese, Korean, and Thai. These can be complicated. For example, the Chinese sets must allow for thousands of different characters. See note charset-cp932::, for additional information about the 'cp932' and 'sjis' character sets. See note charset-gb18030::, for additional information about character set support for the Chinese National Standard GB 18030.
For answers to some common questions and problems relating support for Asian character sets in MySQL, see *note faqs-cjk::.
'big5' (Big5 Traditional Chinese) collations:
* 'big5_bin'
* 'big5_chinese_ci' (default)
*note 'cp932': charset-cp932. (SJIS for Windows Japanese) collations:
* 'cp932_bin'
* 'cp932_japanese_ci' (default)
'eucjpms' (UJIS for Windows Japanese) collations:
* 'eucjpms_bin'
* 'eucjpms_japanese_ci' (default)
'euckr' (EUC-KR Korean) collations:
* 'euckr_bin'
* 'euckr_korean_ci' (default)
'gb2312' (GB2312 Simplified Chinese) collations:
* 'gb2312_bin'
* 'gb2312_chinese_ci' (default)
'gbk' (GBK Simplified Chinese) collations:
* 'gbk_bin'
* 'gbk_chinese_ci' (default)
*note 'gb18030': charset-gb18030. (China National Standard GB18030) collations:
* 'gb18030_bin'
* 'gb18030_chinese_ci' (default)
* 'gb18030_unicode_520_ci'
'sjis' (Shift-JIS Japanese) collations:
* 'sjis_bin'
* 'sjis_japanese_ci' (default)
'tis620' (TIS620 Thai) collations:
* 'tis620_bin'
* 'tis620_thai_ci' (default)
'ujis' (EUC-JP Japanese) collations:
* 'ujis_bin'
* 'ujis_japanese_ci' (default)
The 'big5_chinese_ci' collation sorts on number of strokes.
File: manual.info.tmp, Node: charset-cp932, Next: charset-gb18030, Prev: charset-asian-sets, Up: charset-asian-sets
10.10.7.1 The cp932 Character Set .................................
Why is 'cp932' needed?
In MySQL, the 'sjis' character set corresponds to the 'Shift_JIS' character set defined by IANA, which supports JIS X0201 and JIS X0208 characters. (See http://www.iana.org/assignments/character-sets.)
However, the meaning of 'SHIFT JIS' as a descriptive term has become very vague and it often includes the extensions to 'Shift_JIS' that are defined by various vendors.
For example, 'SHIFT JIS' used in Japanese Windows environments is a Microsoft extension of 'Shift_JIS' and its exact name is 'Microsoft Windows Codepage : 932' or 'cp932'. In addition to the characters supported by 'Shift_JIS', 'cp932' supports extension characters such as NEC special characters, NEC selected--IBM extended characters, and IBM selected characters.
Many Japanese users have experienced problems using these extension characters. These problems stem from the following factors:
MySQL automatically converts character sets.
Character sets are converted using Unicode ('ucs2').
The 'sjis' character set does not support the conversion of these extension characters.
There are several conversion rules from so-called 'SHIFT JIS' to Unicode, and some characters are converted to Unicode differently depending on the conversion rule. MySQL supports only one of these rules (described later).
The MySQL 'cp932' character set is designed to solve these problems.
Because MySQL supports character set conversion, it is important to separate IANA 'Shift_JIS' and 'cp932' into two different character sets because they provide different conversion rules.
How does 'cp932' differ from 'sjis'?
The 'cp932' character set differs from 'sjis' in the following ways:
'cp932' supports NEC special characters, NEC selected--IBM extended characters, and IBM selected characters.
Some 'cp932' characters have two different code points, both of which convert to the same Unicode code point. When converting from Unicode back to 'cp932', one of the code points must be selected. For this 'round trip conversion,' the rule recommended by Microsoft is used. (See http://support.microsoft.com/kb/170559/EN-US/.)
The conversion rule works like this:
* If the character is in both JIS X 0208 and NEC special
characters, use the code point of JIS X 0208.
* If the character is in both NEC special characters and IBM
selected characters, use the code point of NEC special
characters.
* If the character is in both IBM selected characters and NEC
selected--IBM extended characters, use the code point of IBM
extended characters.
The table shown at https://msdn.microsoft.com/en-us/goglobal/cc305152.aspx provides information about the Unicode values of 'cp932' characters. For 'cp932' table entries with characters under which a four-digit number appears, the number represents the corresponding Unicode ('ucs2') encoding. For table entries with an underlined two-digit value appears, there is a range of 'cp932' character values that begin with those two digits. Clicking such a table entry takes you to a page that displays the Unicode value for each of the 'cp932' characters that begin with those digits.
The following links are of special interest. They correspond to the encodings for the following sets of characters:
* NEC special characters (lead byte '0x87'):
<https://msdn.microsoft.com/en-us/goglobal/gg674964>
* NEC selected--IBM extended characters (lead byte '0xED' and
'0xEE'):
<https://msdn.microsoft.com/en-us/goglobal/gg671837>
<https://msdn.microsoft.com/en-us/goglobal/gg671838>
* IBM selected characters (lead byte '0xFA', '0xFB', '0xFC'):
<https://msdn.microsoft.com/en-us/goglobal/gg671839>
<https://msdn.microsoft.com/en-us/goglobal/gg671840>
<https://msdn.microsoft.com/en-us/goglobal/gg671841>
'cp932' supports conversion of user-defined characters in combination with 'eucjpms', and solves the problems with 'sjis'/'ujis' conversion. For details, please refer to http://www.sljfaq.org/afaq/encodings.html.
For some characters, conversion to and from 'ucs2' is different for 'sjis' and 'cp932'. The following tables illustrate these differences.
Conversion to 'ucs2':
'sjis'/'cp932' Value 'sjis' -> 'ucs2' 'cp932' -> 'ucs2' Conversion Conversion
5C
005C 005C
7E
007E 007E
815C
2015 2015
815F
005C FF3C
8160
301C FF5E
8161
2016 2225
817C
2212 FF0D
8191
00A2 FFE0
8192
00A3 FFE1
81CA
00AC FFE2
Conversion from 'ucs2':
'ucs2' value 'ucs2' -> 'sjis' 'ucs2' -> 'cp932' Conversion Conversion
005C
815F 5C
007E
7E 7E
00A2
8191 3F
00A3
8192 3F
00AC
81CA 3F
2015
815C 815C
2016
8161 3F
2212
817C 3F
2225
3F 8161
301C
8160 3F
FF0D
3F 817C
FF3C
3F 815F
FF5E
3F 8160
FFE0
3F 8191
FFE1
3F 8192
FFE2
3F 81CA
Users of any Japanese character sets should be aware that using '--character-set-client-handshake' (or '--skip-character-set-client-handshake') has an important effect. See *note server-options::.
File: manual.info.tmp, Node: charset-gb18030, Prev: charset-cp932, Up: charset-asian-sets
10.10.7.2 The gb18030 Character Set ...................................
In MySQL, the 'gb18030' character set corresponds to the 'Chinese National Standard GB 18030-2005: Information technology --- Chinese coded character set', which is the official character set of the People's Republic of China (PRC).
Characteristics of the MySQL gb18030 Character Set
Supports all code points defined by the GB 18030-2005 standard. Unassigned code points in the ranges (GB+8431A439, GB+90308130) and (GB+E3329A36, GB+EF39EF39) are treated as ''?'' (0x3F). Conversion of unassigned code points return ''?''.
Supports UPPER and LOWER conversion for all GB18030 code points. Case folding defined by Unicode is also supported (based on 'CaseFolding-6.3.0.txt').
Supports Conversion of data to and from other character sets.
Supports SQL statements such as *note 'SET NAMES': set-names.
Supports comparison between 'gb18030' strings, and between 'gb18030' strings and strings of other character sets. There is a conversion if strings have different character sets. Comparisons that include or ignore trailing spaces are also supported.
The private use area (U+E000, U+F8FF) in Unicode is mapped to 'gb18030'.
There is no mapping between (U+D800, U+DFFF) and GB18030. Attempted conversion of code points in this range returns ''?''.
If an incoming sequence is illegal, an error or warning is returned. If an illegal sequence is used in 'CONVERT()', an error is returned. Otherwise, a warning is returned.
For consistency with 'utf8' and 'utf8mb4', UPPER is not supported for ligatures.
Searches for ligatures also match uppercase ligatures when using the 'gb18030_unicode_520_ci' collation.
If a character has more than one uppercase character, the chosen uppercase character is the one whose lowercase is the character itself.
The minimum multibyte length is 1 and the maximum is 4. The character set determines the length of a sequence using the first 1 or 2 bytes.
Supported Collations
'gb18030_bin': A binary collation.
'gb18030_chinese_ci': The default collation, which supports Pinyin. Sorting of non-Chinese characters is based on the order of the original sort key. The original sort key is 'GB(UPPER(ch))' if 'UPPER(ch)' exists. Otherwise, the original sort key is 'GB(ch)'. Chinese characters are sorted according to the Pinyin collation defined in the Unicode Common Locale Data Repository (CLDR 24). Non-Chinese characters are sorted before Chinese characters with the exception of 'GB+FE39FE39', which is the code point maximum.
'gb18030_unicode_520_ci': A Unicode collation. Use this collation if you need to ensure that ligatures are sorted correctly.
File: manual.info.tmp, Node: charset-binary-set, Prev: charset-asian-sets, Up: charset-charsets
The 'binary' character set is the character set for binary strings, which are sequences of bytes. The 'binary' character set has one collation, also named 'binary'. Comparison and sorting are based on numeric byte values, rather than on numeric character code values (which for multibyte characters differ from numeric byte values). For information about the differences between the 'binary' collation of the 'binary' character set and the '_bin' collations of nonbinary character sets, see *note charset-binary-collations::.
For the 'binary' character set, the concepts of lettercase and accent equivalence do not apply:
For single-byte characters stored as binary strings, character and byte boundaries are the same, so lettercase and accent differences are significant in comparisons. That is, the 'binary' collation is case-sensitive and accent-sensitive.
mysql> SET NAMES 'binary';
mysql> SELECT CHARSET('abc'), COLLATION('abc');
+----------------+------------------+
| CHARSET('abc') | COLLATION('abc') |
+----------------+------------------+
| binary | binary |
+----------------+------------------+
mysql> SELECT 'abc' = 'ABC', 'a' = 'a"';
+---------------+------------+
| 'abc' = 'ABC' | 'a' = 'a"' |
+---------------+------------+
| 0 | 0 |
+---------------+------------+
For multibyte characters stored as binary strings, character and byte boundaries differ. Character boundaries are lost, so comparisons that depend on them are not meaningful.
To perform lettercase conversion of a binary string, first convert it to a nonbinary string using a character set appropriate for the data stored in the string:
mysql> SET @str = BINARY 'New York';
mysql> SELECT LOWER(@str), LOWER(CONVERT(@str USING utf8mb4));
+-------------+------------------------------------+
| LOWER(@str) | LOWER(CONVERT(@str USING utf8mb4)) |
+-------------+------------------------------------+
| New York | new york |
+-------------+------------------------------------+
To convert a string expression to a binary string, these constructs are equivalent:
BINARY EXPR
CAST(EXPR AS BINARY)
CONVERT(EXPR USING BINARY)
If a value is a character string literal, the '_binary' introducer may be used to designate it as a binary string. For example:
_binary 'a'
The '_binary' introducer is permitted for hexadecimal literals and bit-value literals as well, but unnecessary; such literals are binary strings by default.
For more information about introducers, see *note charset-introducer::.
Note:
Within the note 'mysql': mysql. client, binary strings display using hexadecimal notation, depending on the value of the '--binary-as-hex'. For more information about that option, see note mysql::.
File: manual.info.tmp, Node: charset-restrictions, Next: error-message-language, Prev: charset-charsets, Up: charset