10.3 Specifying Character Sets and Collations

There are default settings for character sets and collations at four levels: server, database, table, and column. The description in the following sections may appear complex, but it has been found in practice that multiple-level defaulting leads to natural and obvious results.

'CHARACTER SET' is used in clauses that specify a character set. 'CHARSET' can be used as a synonym for 'CHARACTER SET'.

Character set issues affect not only data storage, but also communication between client programs and the MySQL server. If you want the client program to communicate with the server using a character set different from the default, you'll need to indicate which one. For example, to use the 'utf8' Unicode character set, issue this statement after connecting to the server:

 SET NAMES 'utf8';

For more information about character set-related issues in client/server communication, see *note charset-connection::.

 File: manual.info.tmp, Node: charset-collation-names, Next: charset-server, Prev: charset-syntax, Up: charset-syntax

10.3.1 Collation Naming Conventions

MySQL collation names follow these conventions:

 File: manual.info.tmp, Node: charset-server, Next: charset-database, Prev: charset-collation-names, Up: charset-syntax

10.3.2 Server Character Set and Collation

MySQL Server has a server character set and a server collation. By default, these are 'latin1' and 'latin1_swedish_ci', but they can be set explicitly at server startup on the command line or in an option file and changed at runtime.

Initially, the server character set and collation depend on the options that you use when you start *note 'mysqld': mysqld. You can use '--character-set-server' for the character set. Along with it, you can add '--collation-server' for the collation. If you don't specify a character set, that is the same as saying '--character-set-server=latin1'. If you specify only a character set (for example, 'latin1') but not a collation, that is the same as saying '--character-set-server=latin1' '--collation-server=latin1_swedish_ci' because 'latin1_swedish_ci' is the default collation for 'latin1'. Therefore, the following three commands all have the same effect:

 mysqld
 mysqld --character-set-server=latin1
 mysqld --character-set-server=latin1 \
   --collation-server=latin1_swedish_ci

One way to change the settings is by recompiling. To change the default server character set and collation when building from sources, use the 'DEFAULT_CHARSET' and 'DEFAULT_COLLATION' options for 'CMake'. For example:

 cmake . -DDEFAULT_CHARSET=latin1

Or:

 cmake . -DDEFAULT_CHARSET=latin1 \
   -DDEFAULT_COLLATION=latin1_german1_ci

Both *note 'mysqld': mysqld. and 'CMake' verify that the character set/collation combination is valid. If not, each program displays an error message and terminates.

The server character set and collation are used as default values if the database character set and collation are not specified in *note 'CREATE DATABASE': create-database. statements. They have no other purpose.

The current server character set and collation can be determined from the values of the 'character_set_server' and 'collation_server' system variables. These variables can be changed at runtime.

 File: manual.info.tmp, Node: charset-database, Next: charset-table, Prev: charset-server, Up: charset-syntax

10.3.3 Database Character Set and Collation

Every database has a database character set and a database collation. The note 'CREATE DATABASE': create-database. and note 'ALTER DATABASE': alter-database. statements have optional clauses for specifying the database character set and collation:

 CREATE DATABASE DB_NAME
     [[DEFAULT] CHARACTER SET CHARSET_NAME]
     [[DEFAULT] COLLATE COLLATION_NAME]

 ALTER DATABASE DB_NAME
     [[DEFAULT] CHARACTER SET CHARSET_NAME]
     [[DEFAULT] COLLATE COLLATION_NAME]

The keyword 'SCHEMA' can be used instead of 'DATABASE'.

All database options are stored in a text file named 'db.opt' that can be found in the database directory.

The 'CHARACTER SET' and 'COLLATE' clauses make it possible to create databases with different character sets and collations on the same MySQL server.

Example:

 CREATE DATABASE DB_NAME CHARACTER SET latin1 COLLATE latin1_swedish_ci;

MySQL chooses the database character set and database collation in the following manner:

The character set and collation for the default database can be determined from the values of the 'character_set_database' and 'collation_database' system variables. The server sets these variables whenever the default database changes. If there is no default database, the variables have the same value as the corresponding server-level system variables, 'character_set_server' and 'collation_server'.

To see the default character set and collation for a given database, use these statements:

 USE DB_NAME;
 SELECT @@character_set_database, @@collation_database;

Alternatively, to display the values without changing the default database:

 SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME
 FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = 'DB_NAME';

The database character set and collation affect these aspects of server operation:

 File: manual.info.tmp, Node: charset-table, Next: charset-column, Prev: charset-database, Up: charset-syntax

10.3.4 Table Character Set and Collation

Every table has a table character set and a table collation. The note 'CREATE TABLE': create-table. and note 'ALTER TABLE': alter-table. statements have optional clauses for specifying the table character set and collation:

 CREATE TABLE TBL_NAME (COLUMN_LIST)
     [[DEFAULT] CHARACTER SET CHARSET_NAME]
     [COLLATE COLLATION_NAME]]

 ALTER TABLE TBL_NAME
     [[DEFAULT] CHARACTER SET CHARSET_NAME]
     [COLLATE COLLATION_NAME]

Example:

 CREATE TABLE t1 ( ... )
 CHARACTER SET latin1 COLLATE latin1_danish_ci;

MySQL chooses the table character set and collation in the following manner:

The table character set and collation are used as default values for column definitions if the column character set and collation are not specified in individual column definitions. The table character set and collation are MySQL extensions; there are no such things in standard SQL.

 File: manual.info.tmp, Node: charset-column, Next: charset-literal, Prev: charset-table, Up: charset-syntax

10.3.5 Column Character Set and Collation

Every 'character' column (that is, a column of type note 'CHAR': char, note 'VARCHAR': char, a note 'TEXT': blob. type, or any synonym) has a column character set and a column collation. Column definition syntax for note 'CREATE TABLE': create-table. and *note 'ALTER TABLE': alter-table. has optional clauses for specifying the column character set and collation:

 COL_NAME {CHAR | VARCHAR | TEXT} (COL_LENGTH)
     [CHARACTER SET CHARSET_NAME]
     [COLLATE COLLATION_NAME]

These clauses can also be used for note 'ENUM': enum. and note 'SET': set. columns:

 COL_NAME {ENUM | SET} (VAL_LIST)
     [CHARACTER SET CHARSET_NAME]
     [COLLATE COLLATION_NAME]

Examples:

 CREATE TABLE t1
 (
     col1 VARCHAR(5)
       CHARACTER SET latin1
       COLLATE latin1_german1_ci
 );

 ALTER TABLE t1 MODIFY
     col1 VARCHAR(5)
       CHARACTER SET latin1
       COLLATE latin1_swedish_ci;

MySQL chooses the column character set and collation in the following manner:

The 'CHARACTER SET' and 'COLLATE' clauses are standard SQL.

If you use *note 'ALTER TABLE': alter-table. to convert a column from one character set to another, MySQL attempts to map the data values, but if the character sets are incompatible, there may be data loss.

 File: manual.info.tmp, Node: charset-literal, Next: charset-national, Prev: charset-column, Up: charset-syntax

10.3.6 Character String Literal Character Set and Collation

Every character string literal has a character set and a collation.

For the simple statement 'SELECT 'STRING'', the string has the connection default character set and collation defined by the 'character_set_connection' and 'collation_connection' system variables.

A character string literal may have an optional character set introducer and 'COLLATE' clause, to designate it as a string that uses a particular character set and collation:

 [_CHARSET_NAME]'STRING' [COLLATE COLLATION_NAME]

The '_CHARSET_NAME' expression is formally called an introducer. It tells the parser, 'the string that follows uses character set CHARSET_NAME.' An introducer does not change the string to the introducer character set like 'CONVERT()' would do. It does not change the string value, although padding may occur. The introducer is just a signal. See *note charset-introducer::.

Examples:

 SELECT 'abc';
 SELECT _latin1'abc';
 SELECT _binary'abc';
 SELECT _utf8'abc' COLLATE utf8_danish_ci;

Character set introducers and the 'COLLATE' clause are implemented according to standard SQL specifications.

MySQL determines the character set and collation of a character string literal in the following manner:

Examples:

An introducer indicates the character set for the following string, but does not change how the parser performs escape processing within the string. Escapes are always interpreted by the parser according to the character set given by 'character_set_connection'.

The following examples show that escape processing occurs using 'character_set_connection' even in the presence of an introducer. The examples use note 'SET NAMES': set-names. (which changes 'character_set_connection', as discussed in note charset-connection::), and display the resulting strings using the 'HEX()' function so that the exact string contents can be seen.

Example 1:

 mysql> SET NAMES latin1;
 mysql> SELECT HEX('a`\n'), HEX(_sjis'a`\n');
 +------------+-----------------+
 | HEX('a`\n')  | HEX(_sjis'a`\n')  |
 +------------+-----------------+
 | E00A       | E00A            |
 +------------+-----------------+

Here, 'a`' (hexadecimal value 'E0') is followed by '', the escape sequence for newline. The escape sequence is interpreted using the 'character_set_connection' value of 'latin1' to produce a literal newline (hexadecimal value '0A'). This happens even for the second string. That is, the '_sjis' introducer does not affect the parser's escape processing.

Example 2:

 mysql> SET NAMES sjis;
 mysql> SELECT HEX('a`\n'), HEX(_latin1'a`\n');
 +------------+-------------------+
 | HEX('a`\n')  | HEX(_latin1'a`\n')  |
 +------------+-------------------+
 | E05C6E     | E05C6E            |
 +------------+-------------------+

Here, 'character_set_connection' is 'sjis', a character set in which the sequence of 'a`' followed by '' (hexadecimal values '05' and '5C') is a valid multibyte character. Hence, the first two bytes of the string are interpreted as a single 'sjis' character, and the '' is not interpreted as an escape character. The following 'n' (hexadecimal value '6E') is not interpreted as part of an escape sequence. This is true even for the second string; the '_latin1' introducer does not affect escape processing.

 File: manual.info.tmp, Node: charset-national, Next: charset-introducer, Prev: charset-literal, Up: charset-syntax

10.3.7 The National Character Set

Standard SQL defines note 'NCHAR': char. or note 'NATIONAL CHAR': char. as a way to indicate that a *note 'CHAR': char. column should use some predefined character set. MySQL uses 'utf8' as this predefined character set. For example, these data type declarations are equivalent:

 CHAR(10) CHARACTER SET utf8
 NATIONAL CHARACTER(10)
 NCHAR(10)

As are these:

 VARCHAR(10) CHARACTER SET utf8
 NATIONAL VARCHAR(10)
 NVARCHAR(10)
 NCHAR VARCHAR(10)
 NATIONAL CHARACTER VARYING(10)
 NATIONAL CHAR VARYING(10)

You can use 'N'LITERAL'' (or 'n'LITERAL'') to create a string in the national character set. These statements are equivalent:

 SELECT N'some text';
 SELECT n'some text';
 SELECT _utf8'some text';

 File: manual.info.tmp, Node: charset-introducer, Next: charset-examples, Prev: charset-national, Up: charset-syntax

10.3.8 Character Set Introducers

A character string literal, hexadecimal literal, or bit-value literal may have an optional character set introducer and 'COLLATE' clause, to designate it as a string that uses a particular character set and collation:

 [_CHARSET_NAME] LITERAL [COLLATE COLLATION_NAME]

The '_CHARSET_NAME' expression is formally called an introducer. It tells the parser, 'the string that follows uses character set CHARSET_NAME.' An introducer does not change the string to the introducer character set like 'CONVERT()' would do. It does not change the string value, although padding may occur. The introducer is just a signal.

For character string literals, space between the introducer and the string is permitted but optional.

For character set literals, an introducer indicates the character set for the following string, but does not change how the parser performs escape processing within the string. Escapes are always interpreted by the parser according to the character set given by 'character_set_connection'. For additional discussion and examples, see *note charset-literal::.

Examples:

 SELECT 'abc';
 SELECT _latin1'abc';
 SELECT _binary'abc';
 SELECT _utf8'abc' COLLATE utf8_danish_ci;

 SELECT _latin1 X'4D7953514C';
 SELECT _utf8 0x4D7953514C COLLATE utf8_danish_ci;

 SELECT _latin1 b'1000001';
 SELECT _utf8 0b1000001 COLLATE utf8_danish_ci;

Character set introducers and the 'COLLATE' clause are implemented according to standard SQL specifications.

Character string literals can be designated as binary strings by using the '_binary' introducer. Hexadecimal literals and bit-value literals are binary strings by default, so '_binary' is permitted, but unnecessary.

MySQL determines the character set and collation of a character string literal, hexadecimal literal, or bit-value literal in the following manner:

Examples:

 File: manual.info.tmp, Node: charset-examples, Next: charset-compatibility, Prev: charset-introducer, Up: charset-syntax

10.3.9 Examples of Character Set and Collation Assignment

The following examples show how MySQL determines default character set and collation values.

Example 1: Table and Column Definition

 CREATE TABLE t1
 (
     c1 CHAR(10) CHARACTER SET latin1 COLLATE latin1_german1_ci
 ) DEFAULT CHARACTER SET latin2 COLLATE latin2_bin;

Here we have a column with a 'latin1' character set and a 'latin1_german1_ci' collation. The definition is explicit, so that is straightforward. Notice that there is no problem with storing a 'latin1' column in a 'latin2' table.

Example 2: Table and Column Definition

 CREATE TABLE t1
 (
     c1 CHAR(10) CHARACTER SET latin1
 ) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;

This time we have a column with a 'latin1' character set and a default collation. Although it might seem natural, the default collation is not taken from the table level. Instead, because the default collation for 'latin1' is always 'latin1_swedish_ci', column 'c1' has a collation of 'latin1_swedish_ci' (not 'latin1_danish_ci').

Example 3: Table and Column Definition

 CREATE TABLE t1
 (
     c1 CHAR(10)
 ) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;

We have a column with a default character set and a default collation. In this circumstance, MySQL checks the table level to determine the column character set and collation. Consequently, the character set for column 'c1' is 'latin1' and its collation is 'latin1_danish_ci'.

Example 4: Database, Table, and Column Definition

 CREATE DATABASE d1
     DEFAULT CHARACTER SET latin2 COLLATE latin2_czech_cs;
 USE d1;
 CREATE TABLE t1
 (
     c1 CHAR(10)
 );

We create a column without specifying its character set and collation. We're also not specifying a character set and a collation at the table level. In this circumstance, MySQL checks the database level to determine the table settings, which thereafter become the column settings.) Consequently, the character set for column 'c1' is 'latin2' and its collation is 'latin2_czech_cs'.

 File: manual.info.tmp, Node: charset-compatibility, Prev: charset-examples, Up: charset-syntax

10.3.10 Compatibility with Other DBMSs

For MaxDB compatibility these two statements are the same:

 CREATE TABLE t1 (f1 CHAR(N) UNICODE);
 CREATE TABLE t1 (f1 CHAR(N) CHARACTER SET ucs2);

 File: manual.info.tmp, Node: charset-connection, Next: charset-applications, Prev: charset-syntax, Up: charset