10.2 Character Sets and Collations in MySQL

Menu:
charset-repertoire:: Character Set Repertoire
charset-metadata:: UTF-8 for Metadata

MySQL Server supports multiple character sets. To display the available character sets, use the 'INFORMATION_SCHEMA' note 'CHARACTER_SETS': information-schema-character-sets-table. table or the note 'SHOW CHARACTER SET': show-character-set. statement. A partial listing follows. For more complete information, see *note charset-charsets::.

 mysql> SHOW CHARACTER SET;
 +----------+---------------------------------+---------------------+--------+
 | Charset  | Description                     | Default collation   | Maxlen |
 +----------+---------------------------------+---------------------+--------+
 | big5     | Big5 Traditional Chinese        | big5_chinese_ci     |      2 |
 ...
 | latin1   | cp1252 West European            | latin1_swedish_ci   |      1 |
 | latin2   | ISO 8859-2 Central European     | latin2_general_ci   |      1 |
 ...
 | utf8     | UTF-8 Unicode                   | utf8_general_ci     |      3 |
 | ucs2     | UCS-2 Unicode                   | ucs2_general_ci     |      2 |
 ...
 | utf8mb4  | UTF-8 Unicode                   | utf8mb4_general_ci  |      4 |
 ...
 | binary   | Binary pseudo charset           | binary              |      1 |
 ...

By default, the *note 'SHOW CHARACTER SET': show-character-set. statement displays all available character sets. It takes an optional 'LIKE' or 'WHERE' clause that indicates which character set names to match. For example:

 mysql> SHOW CHARACTER SET LIKE 'latin%';
 +---------+-----------------------------+-------------------+--------+
 | Charset | Description                 | Default collation | Maxlen |
 +---------+-----------------------------+-------------------+--------+
 | latin1  | cp1252 West European        | latin1_swedish_ci |      1 |
 | latin2  | ISO 8859-2 Central European | latin2_general_ci |      1 |
 | latin5  | ISO 8859-9 Turkish          | latin5_turkish_ci |      1 |
 | latin7  | ISO 8859-13 Baltic          | latin7_general_ci |      1 |
 +---------+-----------------------------+-------------------+--------+

A given character set always has at least one collation, and most character sets have several. To list the display collations for a character set, use the 'INFORMATION_SCHEMA' note 'COLLATIONS': information-schema-collations-table. table or the note 'SHOW COLLATION': show-collation. statement.

By default, the *note 'SHOW COLLATION': show-collation. statement displays all available collations. It takes an optional 'LIKE' or 'WHERE' clause that indicates which collation names to display. For example, to see the collations for the default character set, 'latin1' (cp1252 West European), use this statement:

 mysql> SHOW COLLATION WHERE Charset = 'latin1';
 +-------------------+---------+----+---------+----------+---------+
 | Collation         | Charset | Id | Default | Compiled | Sortlen |
 +-------------------+---------+----+---------+----------+---------+
 | latin1_german1_ci | latin1  |  5 |         | Yes      |       1 |
 | latin1_swedish_ci | latin1  |  8 | Yes     | Yes      |       1 |
 | latin1_danish_ci  | latin1  | 15 |         | Yes      |       1 |
 | latin1_german2_ci | latin1  | 31 |         | Yes      |       2 |
 | latin1_bin        | latin1  | 47 |         | Yes      |       1 |
 | latin1_general_ci | latin1  | 48 |         | Yes      |       1 |
 | latin1_general_cs | latin1  | 49 |         | Yes      |       1 |
 | latin1_spanish_ci | latin1  | 94 |         | Yes      |       1 |
 +-------------------+---------+----+---------+----------+---------+

The 'latin1' collations have the following meanings.

Collation Meaning

'latin1_bin' Binary according to 'latin1' encoding

'latin1_danish_ci' Danish/Norwegian

'latin1_general_ci' Multilingual (Western European)

'latin1_general_cs' Multilingual (ISO Western European), case-sensitive

'latin1_german1_ci' German DIN-1 (dictionary order)

'latin1_german2_ci' German DIN-2 (phone book order)

'latin1_spanish_ci' Modern Spanish

'latin1_swedish_ci' Swedish/Finnish

Collations have these general characteristics:

Two different character sets cannot have the same collation.
Each character set has a default collation. For example, the default collations for 'latin1' and 'utf8' are 'latin1_swedish_ci' and 'utf8_general_ci', respectively. The 'INFORMATION_SCHEMA' note 'CHARACTER_SETS': information-schema-character-sets-table. table and the note 'SHOW CHARACTER SET': show-character-set. statement indicate the default collation for each character set. The 'INFORMATION_SCHEMA' note 'COLLATIONS': information-schema-collations-table. table and the note 'SHOW COLLATION': show-collation. statement have a column that indicates for each collation whether it is the default for its character set ('Yes' if so, empty if not).
Collation names start with the name of the character set with which they are associated, generally followed by one or more suffixes indicating other collation characteristics. For additional information about naming conventions, see *note charset-collation-names::.

When a character set has multiple collations, it might not be clear which collation is most suitable for a given application. To avoid choosing an inappropriate collation, perform some comparisons with representative data values to make sure that a given collation sorts values the way you expect.

File: manual.info.tmp, Node: charset-repertoire, Next: charset-metadata, Prev: charset-mysql, Up: charset-mysql

10.2.1 Character Set Repertoire

The repertoire of a character set is the collection of characters in the set.

String expressions have a repertoire attribute, which can have two values:

'ASCII': The expression can contain only ASCII characters; that is, characters in the Unicode range 'U+0000' to 'U+007F'.
'UNICODE': The expression can contain characters in the Unicode range 'U+0000' to 'U+10FFFF'. This includes characters in the Basic Multilingual Plane (BMP) range ('U+0000' to 'U+FFFF') and supplementary characters outside the BMP range ('U+10000' to 'U+10FFFF').

The 'ASCII' range is a subset of 'UNICODE' range, so a string with 'ASCII' repertoire can be converted safely without loss of information to the character set of any string with 'UNICODE' repertoire. It can also be converted safely to any character set that is a superset of the 'ascii' character set. (All MySQL character sets are supersets of 'ascii' with the exception of 'swe7', which reuses some punctuation characters for Swedish accented characters.)

The use of repertoire enables character set conversion in expressions for many cases where MySQL would otherwise return an 'illegal mix of collations' error when the rules for collation coercibility are insufficient to resolve ambiguities. (For information about coercibility, see *note charset-collation-coercibility::.)

The following discussion provides examples of expressions and their repertoires, and describes how the use of repertoire changes string expression evaluation:

The repertoire for a string constant depends on string content and may differ from the repertoire of the string character set. Consider these statements:
```
  SET NAMES utf8; SELECT 'abc';
  SELECT _utf8'def';
  SELECT N'MySQL';
```
Although the character set is 'utf8' in each of the preceding cases, the strings do not actually contain any characters outside the ASCII range, so their repertoire is 'ASCII' rather than 'UNICODE'.

A column having the 'ascii' character set has 'ASCII' repertoire because of its character set. In the following table, 'c1' has 'ASCII' repertoire:

  CREATE TABLE t1 (c1 CHAR(1) CHARACTER SET ascii);

The following example illustrates how repertoire enables a result to be determined in a case where an error occurs without repertoire:

  CREATE TABLE t1 (
    c1 CHAR(1) CHARACTER SET latin1,
    c2 CHAR(1) CHARACTER SET ascii
  );
  INSERT INTO t1 VALUES ('a','b');
  SELECT CONCAT(c1,c2) FROM t1;

Without repertoire, this error occurs:

  ERROR 1267 (HY000): Illegal mix of collations (latin1_swedish_ci,IMPLICIT)
  and (ascii_general_ci,IMPLICIT) for operation 'concat'

Using repertoire, subset to superset ('ascii' to 'latin1') conversion can occur and a result is returned:

  +---------------+
  | CONCAT(c1,c2) |
  +---------------+
  | ab            |
  +---------------+

Functions with one string argument inherit the repertoire of their argument. The result of 'UPPER(_utf8'abc')' has 'ASCII' repertoire because its argument has 'ASCII' repertoire. (Despite the '_utf8' introducer, the string ''abc'' contains no characters outside the ASCII range.)

For functions that return a string but do not have string arguments and use 'character_set_connection' as the result character set, the result repertoire is 'ASCII' if 'character_set_connection' is 'ascii', and 'UNICODE' otherwise:

  FORMAT(NUMERIC_COLUMN, 4);

Use of repertoire changes how MySQL evaluates the following example:

  SET NAMES ascii;
  CREATE TABLE t1 (a INT, b VARCHAR(10) CHARACTER SET latin1);
  INSERT INTO t1 VALUES (1,'b');
  SELECT CONCAT(FORMAT(a, 4), b) FROM t1;

Without repertoire, this error occurs:

  ERROR 1267 (HY000): Illegal mix of collations (ascii_general_ci,COERCIBLE)
  and (latin1_swedish_ci,IMPLICIT) for operation 'concat'

With repertoire, a result is returned:

  +-------------------------+
  | CONCAT(FORMAT(a, 4), b) |
  +-------------------------+
  | 1.0000b                 |
  +-------------------------+

Functions with two or more string arguments use the 'widest' argument repertoire for the result repertoire, where 'UNICODE' is wider than 'ASCII'. Consider the following 'CONCAT()' calls:
```
  CONCAT(_ucs2 X'0041', _ucs2 X'0042')
  CONCAT(_ucs2 X'0041', _ucs2 X'00C2')
```
For the first call, the repertoire is 'ASCII' because both arguments are within the ASCII range. For the second call, the repertoire is 'UNICODE' because the second argument is outside the ASCII range.
The repertoire for function return values is determined based on the repertoire of only those arguments that affect the result's character set and collation.
```
  IF(column1 < column2, 'smaller', 'greater')
```
The result repertoire is 'ASCII' because the two string arguments (the second argument and the third argument) both have 'ASCII' repertoire. The first argument does not matter for the result repertoire, even if the expression uses string values.

File: manual.info.tmp, Node: charset-metadata, Prev: charset-repertoire, Up: charset-mysql

10.2.2 UTF-8 for Metadata

Metadata is 'the data about the data.' Anything that describes the database--as opposed to being the contents of the database--is metadata. Thus column names, database names, user names, version names, and most of the string results from *note 'SHOW': show. are metadata. This is also true of the contents of tables in 'INFORMATION_SCHEMA' because those tables by definition contain information about database objects.

Representation of metadata must satisfy these requirements:

All metadata must be in the same character set. Otherwise, neither the note 'SHOW': show. statements nor note 'SELECT': select. statements for tables in 'INFORMATION_SCHEMA' would work properly because different rows in the same column of the results of these operations would be in different character sets.
Metadata must include all characters in all languages. Otherwise, users would not be able to name columns and tables using their own languages.

To satisfy both requirements, MySQL stores metadata in a Unicode character set, namely UTF-8. This does not cause any disruption if you never use accented or non-Latin characters. But if you do, you should be aware that metadata is in UTF-8.

The metadata requirements mean that the return values of the 'USER()', 'CURRENT_USER()', 'SESSION_USER()', 'SYSTEM_USER()', 'DATABASE()', and 'VERSION()' functions have the UTF-8 character set by default.

The server sets the 'character_set_system' system variable to the name of the metadata character set:

 mysql> SHOW VARIABLES LIKE 'character_set_system';
 +----------------------+-------+
 | Variable_name        | Value |
 +----------------------+-------+
 | character_set_system | utf8  |
 +----------------------+-------+

Storage of metadata using Unicode does not mean that the server returns headers of columns and the results of note 'DESCRIBE': describe. functions in the 'character_set_system' character set by default. When you use 'SELECT column1 FROM t', the name 'column1' itself is returned from the server to the client in the character set determined by the value of the 'character_set_results' system variable, which has a default value of 'utf8'. If you want the server to pass metadata results back in a different character set, use the note 'SET NAMES': set-names. statement to force the server to perform character set conversion. note 'SET NAMES': set-names. sets the 'character_set_results' and other related system variables. (See note charset-connection::.) Alternatively, a client program can perform the conversion after receiving the result from the server. It is more efficient for the client to perform the conversion, but this option is not always available for all clients.

If 'character_set_results' is set to 'NULL', no conversion is performed and the server returns metadata using its original character set (the set indicated by 'character_set_system').

Error messages returned from the server to the client are converted to the client character set automatically, as with metadata.

If you are using (for example) the 'USER()' function for comparison or assignment within a single statement, don't worry. MySQL performs some automatic conversion for you.

 SELECT * FROM t1 WHERE USER() = latin1_column;

This works because the contents of 'latin1_column' are automatically converted to UTF-8 before the comparison.

 INSERT INTO t1 (latin1_column) SELECT USER();

This works because the contents of 'USER()' are automatically converted to 'latin1' before the assignment.

Although automatic conversion is not in the SQL standard, the standard does say that every character set is (in terms of supported characters) a 'subset' of Unicode. Because it is a well-known principle that 'what applies to a superset can apply to a subset,' we believe that a collation for Unicode can apply for comparisons with non-Unicode strings. For more information about coercion of strings, see *note charset-collation-coercibility::.

File: manual.info.tmp, Node: charset-syntax, Next: charset-connection, Prev: charset-mysql, Up: charset