A.11 MySQL 5.7 FAQ: MySQL Chinese, Japanese, and Korean Character Sets

This set of Frequently Asked Questions derives from the experience of MySQL's Support and Development groups in handling many inquiries about CJK (Chinese-Japanese-Korean) issues.

What CJK character sets are available in MySQL?

The list of CJK character sets may vary depending on your MySQL version. For example, the 'gb18030' character set is not supported prior to MySQL 5.7.4. However, since the name of the applicable language appears in the 'DESCRIPTION' column for every entry in the *note 'INFORMATION_SCHEMA.CHARACTER_SETS': information-schema-character-sets-table. table, you can obtain a current list of all the non-Unicode CJK character sets using this query:

 mysql> SELECT CHARACTER_SET_NAME, DESCRIPTION
        FROM INFORMATION_SCHEMA.CHARACTER_SETS
        WHERE DESCRIPTION LIKE '%Chin%'
        OR DESCRIPTION LIKE '%Japanese%'
        OR DESCRIPTION LIKE '%Korean%'
        ORDER BY CHARACTER_SET_NAME;
 +--------------------+---------------------------------+
 | CHARACTER_SET_NAME | DESCRIPTION                     |
 +--------------------+---------------------------------+
 | big5               | Big5 Traditional Chinese        |
 | cp932              | SJIS for Windows Japanese       |
 | eucjpms            | UJIS for Windows Japanese       |
 | euckr              | EUC-KR Korean                   |
 | gb18030            | China National Standard GB18030 |
 | gb2312             | GB2312 Simplified Chinese       |
 | gbk                | GBK Simplified Chinese          |
 | sjis               | Shift-JIS Japanese              |
 | ujis               | EUC-JP Japanese                 |
 +--------------------+---------------------------------+

(For more information, see *note information-schema-character-sets-table::.)

MySQL supports three variants of the GB (Guojia Biaozhun, or National Standard, or Simplified Chinese) character sets which are official in the People's Republic of China: 'gb2312', 'gbk', and (as of MySQL 5.7.4) 'gb18030'.

Sometimes people try to insert 'gbk' characters into 'gb2312', and it works most of the time because 'gbk' is a superset of 'gb2312'. But eventually they try to insert a rarer Chinese character and it does not work. (For an example, see Bug #16072).

Here, we try to clarify exactly what characters are legitimate in 'gb2312' or 'gbk', with reference to the official documents. Please check these references before reporting 'gb2312' or 'gbk' bugs:

The MySQL 'gbk' character set is in reality 'Microsoft code page 936'. This differs from the official 'gbk' for characters 'A1A4' (middle dot), 'A1AA' (em dash), 'A6E0-A6F5', and 'A8BB-A8C0'.
For a listing of 'gbk'/Unicode mappings, see http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT.

It is also possible to store CJK characters in Unicode character sets, although the available collations may not sort characters quite as you expect:

The 'utf8' and 'ucs2' character sets support the characters from Unicode Basic Multilingual Plane (BMP). These characters have code point values between 'U+0000' and 'U+FFFF'.
The 'utf8mb4', 'utf16', 'utf16le', and 'utf32' character sets support BMP characters, as well as supplementary characters that lie outside the BMP. Supplementary characters have code point values between 'U+10000' and 'U+10FFFF'.

The collation used for a Unicode character set determines the ability to sort (that is, distinguish) characters in the set:

Collations based on Unicode Collation Algorithm (UCA) 4.0.0 distinguish only BMP characters.
Collations based on UCA 5.2.0 or 9.0.0 distinguish BMP and supplementary characters.
Non-UCA collations may not distinguish all Unicode characters. For example, the 'utf8mb4' default collation is 'utf8mb4_general_ci', which distinguishes only BMP characters.

Moreover, distinguishing characters is not the same as ordering them per the conventions of a given CJK language. Currently, MySQL has only one CJK-specific UCA collation, 'gb18030_unicode_520_ci' (which requires use of the non-Unicode 'gb18030' character set).

For information about Unicode collations and their differentiating properties, including collation properties for supplementary characters, see *note charset-unicode-sets::.

I have inserted CJK characters into my table. Why does *note 'SELECT': select. display them as '?' characters?

This problem is usually due to a setting in MySQL that does not match the settings for the application program or the operating system. Here are some common steps for correcting these types of issues:

Be certain of what MySQL version you are using.

Use the statement 'SELECT VERSION();' to determine this.
Make sure that the database is actually using the desired character set.

People often think that the client character set is always the same as either the server character set or the character set used for display purposes. However, both of these are false assumptions. You can make sure by checking the result of 'SHOW CREATE TABLE TABLENAME' or, better yet, by using this statement:
```
  SELECT character_set_name, collation_name
      FROM information_schema.columns
      WHERE table_schema = your_database_name
          AND table_name = your_table_name
          AND column_name = your_column_name;
```
Determine the hexadecimal value of the character or characters that are not being displayed correctly.

You can obtain this information for a column COLUMN_NAME in the table TABLE_NAME using the following query:
```
  SELECT HEX(COLUMN_NAME)
  FROM TABLE_NAME;
```
'3F' is the encoding for the '?' character; this means that '?' is the character actually stored in the column. This most often happens because of a problem converting a particular character from your client character set to the target character set.
_Make sure that a round trip is possible. When you select LITERAL (or INTRODUCER HEXADECIMAL-VALUE), do you obtain LITERAL as a result?

For example, the Japanese Katakana character Pe ('ペ'') exists in all CJK character sets, and has the code point value (hexadecimal coding) '0x30da'. To test a round trip for this character, use this query:
```
  SELECT 'ペ' AS `ペ`;         /* or SELECT _ucs2 0x30da; */
```
If the result is not also 'ペ', the round trip failed.

For bug reports regarding such failures, we might ask you to follow up with 'SELECT HEX('ペ');'. Then we can determine whether the client encoding is correct.
Make sure that the problem is not with the browser or other application, rather than with MySQL.

Use the note 'mysql': mysql. client program to accomplish this task. If note 'mysql': mysql. displays characters correctly but your application does not, your problem is probably due to system settings.

To determine your settings, use the *note 'SHOW VARIABLES': show-variables. statement, whose output should resemble what is shown here:
```
  mysql> SHOW VARIABLES LIKE 'char%';
  +--------------------------+----------------------------------------+
  | Variable_name            | Value                                  |
  +--------------------------+----------------------------------------+
  | character_set_client     | utf8                                   |
  | character_set_connection | utf8                                   |
  | character_set_database   | latin1                                 |
  | character_set_filesystem | binary                                 |
  | character_set_results    | utf8                                   |
  | character_set_server     | latin1                                 |
  | character_set_system     | utf8                                   |
  | character_sets_dir       | /usr/local/mysql/share/mysql/charsets/ |
  +--------------------------+----------------------------------------+
```
These are typical character-set settings for an international-oriented client (notice the use of 'utf8' Unicode) connected to a server in the West ('latin1' is a West Europe character set).

Although Unicode (usually the 'utf8' variant on Unix, and the 'ucs2' variant on Windows) is preferable to Latin, it is often not what your operating system utilities support best. Many Windows users find that a Microsoft character set, such as 'cp932' for Japanese Windows, is suitable.

If you cannot control the server settings, and you have no idea what setting your underlying computer uses, try changing to a common character set for the country that you're in ('euckr' = Korea; 'gb18030', 'gb2312' or 'gbk' = People's Republic of China; 'big5' = Taiwan; 'sjis', 'ujis', 'cp932', or 'eucjpms' = Japan; 'ucs2' or 'utf8' = anywhere). Usually it is necessary to change only the client and connection and results settings. The *note 'SET NAMES': set-names. statement changes all three at once. For example:
```
  SET NAMES 'big5';
```
Once the setting is correct, you can make it permanent by editing 'my.cnf' or 'my.ini'. For example you might add lines looking like these:
```
  [mysqld]
  character-set-server=big5
  [client]
  default-character-set=big5
```
It is also possible that there are issues with the API configuration setting being used in your application; see 'Why does my GUI front end or browser not display CJK characters correctly...?' for more information.

What problems should I be aware of when working with the Big5 Chinese character set?

MySQL supports the Big5 character set which is common in Hong Kong and Taiwan (Republic of China). The MySQL 'big5' character set is in reality Microsoft code page 950, which is very similar to the original 'big5' character set.

A feature request for adding 'HKSCS' extensions has been filed. People who need this extension may find the suggested patch for Bug #13577 to be of interest.

Why do Japanese character set conversions fail?

MySQL supports the 'sjis', 'ujis', 'cp932', and 'eucjpms' character sets, as well as Unicode. A common need is to convert between character sets. For example, there might be a Unix server (typically with 'sjis' or 'ujis') and a Windows client (typically with 'cp932').

In the following conversion table, the 'ucs2' column represents the source, and the 'sjis', 'cp932', 'ujis', and 'eucjpms' columns represent the destinations; that is, the last 4 columns provide the hexadecimal result when we use 'CONVERT(ucs2)' or we assign a 'ucs2' column containing the value to an 'sjis', 'cp932', 'ujis', or 'eucjpms' column.

Character Name ucs2 sjis cp932 ujis eucjpms

BROKEN BAR

00A6 3F 3F 8FA2C3 3F

FULLWIDTH BROKEN BAR

FFE4 3F FA55 3F 8FA2

YEN SIGN

00A5 3F 3F 20 3F

FULLWIDTH YEN SIGN

FFE5 818F 818F A1EF 3F

TILDE

007E 7E 7E 7E 7E

OVERLINE

203E 3F 3F 20 3F

HORIZONTAL BAR

2015 815C 815C A1BD A1BD

EM DASH

2014 3F 3F 3F 3F

REVERSE SOLIDUS

005C 815F 5C 5C 5C

FULLWIDTH REVERSE SOLIDUS

FF3C 3F 815F 3F A1C0

WAVE DASH

301C 8160 3F A1C1 3F

FULLWIDTH TILDE

FF5E 3F 8160 3F A1C1

DOUBLE VERTICAL LINE

2016 8161 3F A1C2 3F

PARALLEL TO

2225 3F 8161 3F A1C2

MINUS SIGN

2212 817C 3F A1DD 3F

FULLWIDTH HYPHEN-MINUS

FF0D 3F 817C 3F A1DD

CENT SIGN

00A2 8191 3F A1F1 3F

FULLWIDTH CENT SIGN

FFE0 3F 8191 3F A1F1

POUND SIGN

00A3 8192 3F A1F2 3F

FULLWIDTH POUND SIGN

FFE1 3F 8192 3F A1F2

NOT SIGN

00AC 81CA 3F A2CC 3F

FULLWIDTH NOT SIGN

FFE2 3F 81CA 3F A2CC

Now consider the following portion of the table.

                                     ucs2        sjis        cp932

NOT SIGN

00AC 81CA 3F

FULLWIDTH NOT SIGN

FFE2 3F 81CA

This means that MySQL converts the 'NOT SIGN' (Unicode 'U+00AC') to 'sjis' code point '0x81CA' and to 'cp932' code point '3F'. ('3F' is the question mark ('?'. This is what is always used when the conversion cannot be performed.)

What should I do if I want to convert SJIS '81CA' to 'cp932'?

Our answer is: '?'. There are disadvantages to this, and many people would prefer a 'loose' conversion, so that '81CA (NOT SIGN)' in 'sjis' becomes '81CA (FULLWIDTH NOT SIGN)' in 'cp932'.

How does MySQL represent the Yen ('¥') sign?

A problem arises because some versions of Japanese character sets (both 'sjis' and 'euc') treat '5C' as a reverse solidus ('', also known as a backslash), whereas others treat it as a yen sign ('¥').

MySQL follows only one version of the JIS (Japanese Industrial Standards) standard description. In MySQL, '5C' is always the reverse solidus ('').

Of what issues should I be aware when working with Korean character sets in MySQL?

In theory, while there have been several versions of the 'euckr' (Extended Unix Code Korea) character set, only one problem has been noted. We use the 'ASCII' variant of EUC-KR, in which the code point '0x5c' is REVERSE SOLIDUS, that is '', instead of the 'KS-Roman' variant of EUC-KR, in which the code point '0x5c' is 'WON SIGN' ('₩'). This means that you cannot convert Unicode 'U+20A9' to 'euckr':

 mysql> SELECT
            CONVERT('₩' USING euckr) AS euckr,
            HEX(CONVERT('₩' USING euckr)) AS hexeuckr;
 +-------+----------+
 | euckr | hexeuckr |
 +-------+----------+
 | ?     | 3F       |
 +-------+----------+

Why do I get 'Incorrect string value' error messages?

To see the problem, create a table with one Unicode ('ucs2') column and one Chinese ('gb2312') column.

 mysql> CREATE TABLE ch
        (ucs2 CHAR(3) CHARACTER SET ucs2,
        gb2312 CHAR(3) CHARACTER SET gb2312);

In nonstrict SQL mode, try to place the rare character '汌' in both columns.

 mysql> SET sql_mode = '';
 mysql> INSERT INTO ch VALUES ('A汌B','A汌B');
 Query OK, 1 row affected, 1 warning (0.00 sec)

The *note 'INSERT': insert. produces a warning. Use the following statement to see what it is:

 mysql> SHOW WARNINGS\G
 *************************** 1. row ***************************
   Level: Warning
    Code: 1366
 Message: Incorrect string value: '\xE6\xB1\x8CB' for column 'gb2312' at row 1

So it is a warning about the 'gb2312' column only.

 mysql> SELECT ucs2,HEX(ucs2),gb2312,HEX(gb2312) FROM ch;
 +-------+--------------+--------+-------------+
 | ucs2  | HEX(ucs2)    | gb2312 | HEX(gb2312) |
 +-------+--------------+--------+-------------+
 | A汌B | 00416C4C0042 | A?B    | 413F42      |
 +-------+--------------+--------+-------------+

Several things need explanation here:

The '汌' character is not in the 'gb2312' character set, as described earlier.
If you are using an old version of MySQL, you may see a different message.
A warning occurs rather than an error because MySQL is not set to use strict SQL mode. In nonstrict mode, MySQL tries to do what it can, to get the best fit, rather than give up. With strict SQL mode, the 'Incorrect string value' message occurs as an error rather than a warning, and the *note 'INSERT': insert. fails.

Why does my GUI front end or browser display CJK characters incorrectly in my application using Access, PHP, or another API?

Obtain a direct connection to the server using the note 'mysql': mysql. client, and try the same query there. If note 'mysql': mysql. responds correctly, the trouble may be that your application interface requires initialization. Use note 'mysql': mysql. to tell you what character set or sets it uses with the statement 'SHOW VARIABLES LIKE 'char%';'. If you are using Access, you are most likely connecting with Connector/ODBC. In this case, you should check Configuring Connector/ODBC (https://dev.mysql.com/doc/connector-odbc/en/connector-odbc-configuration.html). If, for example, you use 'big5', you would enter 'SET NAMES 'big5''. (In this case, no ';' character is required.) If you are using ASP, you might need to add note 'SET NAMES': set-names. in the code. Here is an example that has worked in the past:

 <%
 Session.CodePage=0
 Dim strConnection
 Dim Conn
 strConnection="driver={MySQL ODBC 3.51 Driver};server=SERVER;uid=USERNAME;" \
                & "pwd=PASSWORD;database=DATABASE;stmt=SET NAMES 'big5';"
 Set Conn = Server.CreateObject("ADODB.Connection")
 Conn.Open strConnection
 %>

In much the same way, if you are using any character set other than 'latin1' with Connector/NET, you must specify the character set in the connection string. See Connector/NET Connections (https://dev.mysql.com/doc/connector-net/en/connector-net-connections.html), for more information.

If you are using PHP, try this:

 <?php
   $link = new mysqli($host, $usr, $pwd, $db);

   if( mysqli_connect_errno() )
   {
     printf("Connect failed: %s\n", mysqli_connect_error());
     exit();
   }

   $link->query("SET NAMES 'utf8'");
 ?>

In this case, we used *note 'SET NAMES': set-names. to change 'character_set_client', 'character_set_connection', and 'character_set_results'.

Another issue often encountered in PHP applications has to do with assumptions made by the browser. Sometimes adding or changing a '' tag suffices to correct the problem: for example, to insure that the user agent interprets page content as 'UTF-8', include '' in the '

' section of the HTML page.

If you are using Connector/J, see Using Character Sets and Unicode (https://dev.mysql.com/doc/connector-j/en/connector-j-reference-charsets.html).

I've upgraded to MySQL 5.7. How can I revert to behavior like that in MySQL 4.0 with regard to character sets?

In MySQL Version 4.0, there was a single 'global' character set for both server and client, and the decision as to which character to use was made by the server administrator. This changed starting with MySQL Version 4.1. What happens now is a 'handshake', as described in *note charset-connection:::

When a client connects, it sends to the server the name of the character set that it wants to use. The server uses the name to set the 'character_set_client', 'character_set_results', and 'character_set_connection' system variables. In effect, the server performs a *note 'SET NAMES': set-names. operation using the character set name.

The effect of this is that you cannot control the client character set by starting note 'mysqld': mysqld. with '--character-set-server=utf8'. However, some Asian customers prefer the MySQL 4.0 behavior. To make it possible to retain this behavior, we added a note 'mysqld': mysqld. switch, '--character-set-client-handshake', which can be turned off with '--skip-character-set-client-handshake'. If you start *note 'mysqld': mysqld. with '--skip-character-set-client-handshake', then, when a client connects, it sends to the server the name of the character set that it wants to use. However, the server ignores this request from the client.

By way of example, suppose that your favorite server character set is 'latin1'. Suppose further that the client uses 'utf8' because this is what the client's operating system supports. Start the server with 'latin1' as its default character set:

 mysqld --character-set-server=latin1

And then start the client with the default character set 'utf8':

 mysql --default-character-set=utf8

The resulting settings can be seen by viewing the output of *note 'SHOW VARIABLES': show-variables.:

 mysql> SHOW VARIABLES LIKE 'char%';
 +--------------------------+----------------------------------------+
 | Variable_name            | Value                                  |
 +--------------------------+----------------------------------------+
 | character_set_client     | utf8                                   |
 | character_set_connection | utf8                                   |
 | character_set_database   | latin1                                 |
 | character_set_filesystem | binary                                 |
 | character_set_results    | utf8                                   |
 | character_set_server     | latin1                                 |
 | character_set_system     | utf8                                   |
 | character_sets_dir       | /usr/local/mysql/share/mysql/charsets/ |
 +--------------------------+----------------------------------------+

Now stop the client, and stop the server using *note 'mysqladmin': mysqladmin. Then start the server again, but this time tell it to skip the handshake like so:

 mysqld --character-set-server=utf8 --skip-character-set-client-handshake

Start the client with 'utf8' once again as the default character set, then display the resulting settings:

 mysql> SHOW VARIABLES LIKE 'char%';
 +--------------------------+----------------------------------------+
 | Variable_name            | Value                                  |
 +--------------------------+----------------------------------------+
 | character_set_client     | latin1                                 |
 | character_set_connection | latin1                                 |
 | character_set_database   | latin1                                 |
 | character_set_filesystem | binary                                 |
 | character_set_results    | latin1                                 |
 | character_set_server     | latin1                                 |
 | character_set_system     | utf8                                   |
 | character_sets_dir       | /usr/local/mysql/share/mysql/charsets/ |
 +--------------------------+----------------------------------------+

As you can see by comparing the differing results from *note 'SHOW VARIABLES': show-variables, the server ignores the client's initial settings if the '--skip-character-set-client-handshake' option is used.

Why do some 'LIKE' and 'FULLTEXT' searches with CJK characters fail?

For 'LIKE' searches, there is a very simple problem with binary string column types such as note 'BINARY': binary-varbinary. and note 'BLOB': blob.: we must know where characters end. With multibyte character sets, different characters might have different octet lengths. For example, in 'utf8', 'A' requires one byte but 'ペ' requires three bytes, as shown here:

 +-------------------------+---------------------------+
 | OCTET_LENGTH(_utf8 'A') | OCTET_LENGTH(_utf8 'ペ') |
 +-------------------------+---------------------------+
 |                       1 |                         3 |
 +-------------------------+---------------------------+

If we do not know where the first character in a string ends, we do not know where the second character begins, in which case even very simple searches such as 'LIKE '_A%'' fail. The solution is to use a nonbinary string column type defined to have the proper CJK character set. For example: 'mycol TEXT CHARACTER SET sjis'. Alternatively, convert to a CJK character set before comparing.

This is one reason why MySQL cannot permit encodings of nonexistent characters. If it is not strict about rejecting bad input, it has no way of knowing where characters end.

For 'FULLTEXT' searches, we must know where words begin and end. With Western languages, this is rarely a problem because most (if not all) of these use an easy-to-identify word boundary: the space character. However, this is not usually the case with Asian writing. We could use arbitrary halfway measures, like assuming that all Han characters represent words, or (for Japanese) depending on changes from Katakana to Hiragana due to grammatical endings. However, the only sure solution requires a comprehensive word list, which means that we would have to include a dictionary in the server for each Asian language supported. This is simply not feasible.

How do I know whether character X is available in all character sets?

The majority of simplified Chinese and basic nonhalfwidth Japanese Kana characters appear in all CJK character sets. The following stored procedure accepts a 'UCS-2' Unicode character, converts it to other character sets, and displays the results in hexadecimal.

 DELIMITER //

 CREATE PROCEDURE p_convert(ucs2_char CHAR(1) CHARACTER SET ucs2)
 BEGIN

 CREATE TABLE tj
              (ucs2 CHAR(1) character set ucs2,
               utf8 CHAR(1) character set utf8,
               big5 CHAR(1) character set big5,
               cp932 CHAR(1) character set cp932,
               eucjpms CHAR(1) character set eucjpms,
               euckr CHAR(1) character set euckr,
               gb2312 CHAR(1) character set gb2312,
               gbk CHAR(1) character set gbk,
               sjis CHAR(1) character set sjis,
               ujis CHAR(1) character set ujis);

 INSERT INTO tj (ucs2) VALUES (ucs2_char);

 UPDATE tj SET utf8=ucs2,
               big5=ucs2,
               cp932=ucs2,
               eucjpms=ucs2,
               euckr=ucs2,
               gb2312=ucs2,
               gbk=ucs2,
               sjis=ucs2,
               ujis=ucs2;

 /* If there are conversion problems, UPDATE produces warnings. */

 SELECT hex(ucs2) AS ucs2,
        hex(utf8) AS utf8,
        hex(big5) AS big5,
        hex(cp932) AS cp932,
        hex(eucjpms) AS eucjpms,
        hex(euckr) AS euckr,
        hex(gb2312) AS gb2312,
        hex(gbk) AS gbk,
        hex(sjis) AS sjis,
        hex(ujis) AS ujis
 FROM tj;

 DROP TABLE tj;

 END//

 DELIMITER ;

The input can be any single 'ucs2' character, or it can be the code value (hexadecimal representation) of that character. For example, from Unicode's list of 'ucs2' encodings and names (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt), we know that the Katakana character Pe appears in all CJK character sets, and that its code value is 'X'30DA''. If we use this value as the argument to 'p_convert()', the result is as shown here:

 mysql> CALL p_convert(X'30DA');
 +------+--------+------+-------+---------+-------+--------+------+------+------+
 | ucs2 | utf8   | big5 | cp932 | eucjpms | euckr | gb2312 | gbk  | sjis | ujis |
 +------+--------+------+-------+---------+-------+--------+------+------+------+
 | 30DA | E3839A | C772 | 8379  | A5DA    | ABDA  | A5DA   | A5DA | 8379 | A5DA |
 +------+--------+------+-------+---------+-------+--------+------+------+------+

Since none of the column values is '3F' (that is, the question mark character, '?'), we know that every conversion worked.

Why do CJK strings sort incorrectly in Unicode? (I)

Note:

The CJK sorting problems described here can occur for MySQL versions prior to MySQL 8.0. As of MySQL 8.0, they can be solved by using the 'utf8mb4' character set and the 'utf8mb4_ja_0900_as_cs' collation.

Sometimes people observe that the result of a 'utf8_unicode_ci' or 'ucs2_unicode_ci' search, or of an 'ORDER BY' sort is not what they think a native would expect. Although we never rule out the possibility that there is a bug, we have found in the past that many people do not correctly read the standard table of weights for the Unicode Collation Algorithm. MySQL uses the tables found under http://www.unicode.org/Public/UCA/:

UCA 4.0.0 table: http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt

This includes 'XXX_unicode_ci' collations with no version number in the collation name.
UCA 5.2.0 table: http://www.unicode.org/Public/UCA/5.2.0/allkeys.txt

This includes collations with '520' in the collation name.
UCA 9.0.0 table: http://www.unicode.org/Public/UCA/9.0.0/allkeys.txt

This includes collations with '0900' in the collation name.

To handle newer UCA versions, we create new collations. We are very wary about changing ordering of existing collations because that affects indexes, which can bring about situations such as that reported in Bug #16526, illustrated as follows:

 mysql> CREATE TABLE tj (s1 CHAR(1) CHARACTER SET utf8 COLLATE utf8_unicode_ci);
 Query OK, 0 rows affected (0.05 sec)

 mysql> INSERT INTO tj VALUES ('が'),('か');
 Query OK, 2 rows affected (0.00 sec)
 Records: 2  Duplicates: 0  Warnings: 0

 mysql> SELECT * FROM tj WHERE s1 = 'か';
 +------+
 | s1   |
 +------+
 | が  |
 | か  |
 +------+

The character in the first result row is not the one that we searched for. Why did MySQL retrieve it? First we look for the Unicode code point value, which is possible by reading the hexadecimal number for the 'ucs2' version of the characters:

 mysql> SELECT s1, HEX(CONVERT(s1 USING ucs2)) FROM tj;
 +------+-----------------------------+
 | s1   | HEX(CONVERT(s1 USING ucs2)) |
 +------+-----------------------------+
 | が  | 304C                        |
 | か  | 304B                        |
 +------+-----------------------------+

Now we search for '304B' and '304C' in the '4.0.0 allkeys' table, and find these lines:

 304B  ; [.1E57.0020.000E.304B] # HIRAGANA LETTER KA
 304C  ; [.1E57.0020.000E.304B][.0000.0140.0002.3099] # HIRAGANA LETTER GA; QQCM

The official Unicode names (following the '#' mark) tell us the Japanese syllabary (Hiragana), the informal classification (letter, digit, or punctuation mark), and the Western identifier ('KA' or 'GA', which happen to be voiced and unvoiced components of the same letter pair). More importantly, the primary weight (the first hexadecimal number inside the square brackets) is '1E57' on both lines. For comparisons in both searching and sorting, MySQL pays attention to the primary weight only, ignoring all the other numbers. This means that we are sorting 'が' and 'か' correctly according to the Unicode specification. If we wanted to distinguish them, we'd have to use a non-UCA (Unicode Collation Algorithm) collation ('utf8_bin' or 'utf8_general_ci'), or to compare the 'HEX()' values, or use 'ORDER BY CONVERT(s1 USING sjis)'. Being correct 'according to Unicode' is not enough, of course: the person who submitted the bug was equally correct. To solve this, we need another collation for Japanese according to the JIS X 4061 standard, in which voiced/unvoiced letter pairs like 'KA'/'GA' are distinguishable for ordering purposes.

Why do CJK strings sort incorrectly in Unicode? (II)

Note:

If you are using Unicode ('ucs2' or 'utf8'), and you know what the Unicode sort order is (see *note faqs-cjk::), but MySQL still seems to sort your table incorrectly, first verify the character set in the table definition:

 mysql> SHOW CREATE TABLE t\G
 ******************** 1. row ******************
 Table: t
 Create Table: CREATE TABLE `t` (
 `s1` char(1) CHARACTER SET ucs2 DEFAULT NULL
 ) ENGINE=MyISAM DEFAULT CHARSET=latin1

Since the character set for the column 's1' appears to be correct ('ucs2'), check what information the *note 'INFORMATION_SCHEMA.COLUMNS': information-schema-columns-table. table can provide about this column:

 mysql> SELECT COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME
        FROM INFORMATION_SCHEMA.COLUMNS
        WHERE COLUMN_NAME = 's1'
        AND TABLE_NAME = 't';
 +-------------+--------------------+-----------------+
 | COLUMN_NAME | CHARACTER_SET_NAME | COLLATION_NAME  |
 +-------------+--------------------+-----------------+
 | s1          | ucs2               | ucs2_general_ci |
 +-------------+--------------------+-----------------+

(See *note information-schema-columns-table::, for more information.)

You can see that the collation is 'ucs2_general_ci' instead of 'ucs2_unicode_ci'. The reason why this is so can be found using *note 'SHOW CHARACTER SET': show-character-set, as shown here:

 mysql> SHOW CHARSET LIKE 'ucs2%';
 +---------+---------------+-------------------+--------+
 | Charset | Description   | Default collation | Maxlen |
 +---------+---------------+-------------------+--------+
 | ucs2    | UCS-2 Unicode | ucs2_general_ci   |      2 |
 +---------+---------------+-------------------+--------+

For 'ucs2' and 'utf8', the default collation is 'general'. To specify a Unicode UCA collation, use 'COLLATE ucs2_unicode_ci', as shown in the preceding item.

Why are my supplementary characters rejected by MySQL?

Supplementary characters lie outside the Unicode Basic Multilingual Plane / Plane 0. BMP characters have code point values between 'U+0000' and 'U+FFFF'. Supplementary characters have code point values between 'U+10000' and 'U+10FFFF'.

To store supplementary characters, you must use a character set that permits them:

The 'utf8' and 'ucs2' character sets support BMP characters only.

The 'utf8' character set permits only 'UTF-8' characters that take up to three bytes. This has led to reports such as that found in Bug #12600, which we rejected as 'not a bug'. With 'utf8', MySQL must truncate an input string when it encounters bytes that it does no understand. Otherwise, it is unknown how long the bad multibyte character is.

One possible workaround is to use 'ucs2' instead of 'utf8', in which case the 'bad' characters are changed to question marks. However, no truncation takes place. You can also change the data type to note 'BLOB': blob. or note 'BINARY': binary-varbinary, which perform no validity checking.
The 'utf8mb4', 'utf16', 'utf16le', and 'utf32' character sets support BMP characters, as well as supplementary characters outside the BMP.

Should 'CJK' be 'CJKV'?

No. The term 'CJKV' (Chinese Japanese Korean Vietnamese) refers to Vietnamese character sets which contain Han (originally Chinese) characters. MySQL supports the modern Vietnamese script with Western characters, but does not support the old Vietnamese script using Han characters.

As of MySQL 5.6, there are Vietnamese collations for Unicode character sets, as described in *note charset-unicode-sets::.

Does MySQL permit CJK characters to be used in database and table names?

Yes.

Where can I find translations of the MySQL Manual into Chinese, Japanese, and Korean?

The Japanese translation of the MySQL 5.6 manual can be downloaded from https://dev.mysql.com/doc/.

Where can I get help with CJK and related issues in MySQL?

The following resources are available:

A listing of MySQL user groups can be found at https://wikis.oracle.com/display/mysql/List+of+MySQL+User+Groups.
View feature requests relating to character set issues at http://tinyurl.com/y6xcuf.
Visit the MySQL Collation, Unicode Forum. http://forums.mysql.com/ also provides foreign-language forums.

File: manual.info.tmp, Node: faqs-connectors-apis, Next: faqs-c-api, Prev: faqs-cjk, Up: faqs