10.2 Character Sets and Collations in MySQL

MySQL Server supports multiple character sets. To display the available character sets, use the 'INFORMATION_SCHEMA' note 'CHARACTER_SETS': information-schema-character-sets-table. table or the note 'SHOW CHARACTER SET': show-character-set. statement. A partial listing follows. For more complete information, see *note charset-charsets::.

 mysql> SHOW CHARACTER SET;
 +----------+---------------------------------+---------------------+--------+
 | Charset  | Description                     | Default collation   | Maxlen |
 +----------+---------------------------------+---------------------+--------+
 | big5     | Big5 Traditional Chinese        | big5_chinese_ci     |      2 |
 ...
 | latin1   | cp1252 West European            | latin1_swedish_ci   |      1 |
 | latin2   | ISO 8859-2 Central European     | latin2_general_ci   |      1 |
 ...
 | utf8     | UTF-8 Unicode                   | utf8_general_ci     |      3 |
 | ucs2     | UCS-2 Unicode                   | ucs2_general_ci     |      2 |
 ...
 | utf8mb4  | UTF-8 Unicode                   | utf8mb4_general_ci  |      4 |
 ...
 | binary   | Binary pseudo charset           | binary              |      1 |
 ...

By default, the *note 'SHOW CHARACTER SET': show-character-set. statement displays all available character sets. It takes an optional 'LIKE' or 'WHERE' clause that indicates which character set names to match. For example:

 mysql> SHOW CHARACTER SET LIKE 'latin%';
 +---------+-----------------------------+-------------------+--------+
 | Charset | Description                 | Default collation | Maxlen |
 +---------+-----------------------------+-------------------+--------+
 | latin1  | cp1252 West European        | latin1_swedish_ci |      1 |
 | latin2  | ISO 8859-2 Central European | latin2_general_ci |      1 |
 | latin5  | ISO 8859-9 Turkish          | latin5_turkish_ci |      1 |
 | latin7  | ISO 8859-13 Baltic          | latin7_general_ci |      1 |
 +---------+-----------------------------+-------------------+--------+

A given character set always has at least one collation, and most character sets have several. To list the display collations for a character set, use the 'INFORMATION_SCHEMA' note 'COLLATIONS': information-schema-collations-table. table or the note 'SHOW COLLATION': show-collation. statement.

By default, the *note 'SHOW COLLATION': show-collation. statement displays all available collations. It takes an optional 'LIKE' or 'WHERE' clause that indicates which collation names to display. For example, to see the collations for the default character set, 'latin1' (cp1252 West European), use this statement:

 mysql> SHOW COLLATION WHERE Charset = 'latin1';
 +-------------------+---------+----+---------+----------+---------+
 | Collation         | Charset | Id | Default | Compiled | Sortlen |
 +-------------------+---------+----+---------+----------+---------+
 | latin1_german1_ci | latin1  |  5 |         | Yes      |       1 |
 | latin1_swedish_ci | latin1  |  8 | Yes     | Yes      |       1 |
 | latin1_danish_ci  | latin1  | 15 |         | Yes      |       1 |
 | latin1_german2_ci | latin1  | 31 |         | Yes      |       2 |
 | latin1_bin        | latin1  | 47 |         | Yes      |       1 |
 | latin1_general_ci | latin1  | 48 |         | Yes      |       1 |
 | latin1_general_cs | latin1  | 49 |         | Yes      |       1 |
 | latin1_spanish_ci | latin1  | 94 |         | Yes      |       1 |
 +-------------------+---------+----+---------+----------+---------+

The 'latin1' collations have the following meanings.

Collation Meaning

'latin1_bin' Binary according to 'latin1' encoding

'latin1_danish_ci' Danish/Norwegian

'latin1_general_ci' Multilingual (Western European)

'latin1_general_cs' Multilingual (ISO Western European), case-sensitive

'latin1_german1_ci' German DIN-1 (dictionary order)

'latin1_german2_ci' German DIN-2 (phone book order)

'latin1_spanish_ci' Modern Spanish

'latin1_swedish_ci' Swedish/Finnish

Collations have these general characteristics:

When a character set has multiple collations, it might not be clear which collation is most suitable for a given application. To avoid choosing an inappropriate collation, perform some comparisons with representative data values to make sure that a given collation sorts values the way you expect.

 File: manual.info.tmp, Node: charset-repertoire, Next: charset-metadata, Prev: charset-mysql, Up: charset-mysql

10.2.1 Character Set Repertoire

The repertoire of a character set is the collection of characters in the set.

String expressions have a repertoire attribute, which can have two values:

The 'ASCII' range is a subset of 'UNICODE' range, so a string with 'ASCII' repertoire can be converted safely without loss of information to the character set of any string with 'UNICODE' repertoire. It can also be converted safely to any character set that is a superset of the 'ascii' character set. (All MySQL character sets are supersets of 'ascii' with the exception of 'swe7', which reuses some punctuation characters for Swedish accented characters.)

The use of repertoire enables character set conversion in expressions for many cases where MySQL would otherwise return an 'illegal mix of collations' error when the rules for collation coercibility are insufficient to resolve ambiguities. (For information about coercibility, see *note charset-collation-coercibility::.)

The following discussion provides examples of expressions and their repertoires, and describes how the use of repertoire changes string expression evaluation:

 File: manual.info.tmp, Node: charset-metadata, Prev: charset-repertoire, Up: charset-mysql

10.2.2 UTF-8 for Metadata

Metadata is 'the data about the data.' Anything that describes the database--as opposed to being the contents of the database--is metadata. Thus column names, database names, user names, version names, and most of the string results from *note 'SHOW': show. are metadata. This is also true of the contents of tables in 'INFORMATION_SCHEMA' because those tables by definition contain information about database objects.

Representation of metadata must satisfy these requirements:

To satisfy both requirements, MySQL stores metadata in a Unicode character set, namely UTF-8. This does not cause any disruption if you never use accented or non-Latin characters. But if you do, you should be aware that metadata is in UTF-8.

The metadata requirements mean that the return values of the 'USER()', 'CURRENT_USER()', 'SESSION_USER()', 'SYSTEM_USER()', 'DATABASE()', and 'VERSION()' functions have the UTF-8 character set by default.

The server sets the 'character_set_system' system variable to the name of the metadata character set:

 mysql> SHOW VARIABLES LIKE 'character_set_system';
 +----------------------+-------+
 | Variable_name        | Value |
 +----------------------+-------+
 | character_set_system | utf8  |
 +----------------------+-------+

Storage of metadata using Unicode does not mean that the server returns headers of columns and the results of note 'DESCRIBE': describe. functions in the 'character_set_system' character set by default. When you use 'SELECT column1 FROM t', the name 'column1' itself is returned from the server to the client in the character set determined by the value of the 'character_set_results' system variable, which has a default value of 'utf8'. If you want the server to pass metadata results back in a different character set, use the note 'SET NAMES': set-names. statement to force the server to perform character set conversion. note 'SET NAMES': set-names. sets the 'character_set_results' and other related system variables. (See note charset-connection::.) Alternatively, a client program can perform the conversion after receiving the result from the server. It is more efficient for the client to perform the conversion, but this option is not always available for all clients.

If 'character_set_results' is set to 'NULL', no conversion is performed and the server returns metadata using its original character set (the set indicated by 'character_set_system').

Error messages returned from the server to the client are converted to the client character set automatically, as with metadata.

If you are using (for example) the 'USER()' function for comparison or assignment within a single statement, don't worry. MySQL performs some automatic conversion for you.

 SELECT * FROM t1 WHERE USER() = latin1_column;

This works because the contents of 'latin1_column' are automatically converted to UTF-8 before the comparison.

 INSERT INTO t1 (latin1_column) SELECT USER();

This works because the contents of 'USER()' are automatically converted to 'latin1' before the assignment.

Although automatic conversion is not in the SQL standard, the standard does say that every character set is (in terms of supported characters) a 'subset' of Unicode. Because it is a well-known principle that 'what applies to a superset can apply to a subset,' we believe that a collation for Unicode can apply for comparisons with non-Unicode strings. For more information about coercion of strings, see *note charset-collation-coercibility::.

 File: manual.info.tmp, Node: charset-syntax, Next: charset-connection, Prev: charset-mysql, Up: charset