10.14 Adding a Collation to a Character Set

A collation is a set of rules that defines how to compare and sort character strings. Each collation in MySQL belongs to a single character set. Every character set has at least one collation, and most have two or more collations.

A collation orders characters based on weights. Each character in a character set maps to a weight. Characters with equal weights compare as equal, and characters with unequal weights compare according to the relative magnitude of their weights.

The 'WEIGHT_STRING()' function can be used to see the weights for the characters in a string. The value that it returns to indicate weights is a binary string, so it is convenient to use 'HEX(WEIGHT_STRING(STR))' to display the weights in printable form. The following example shows that weights do not differ for lettercase for the letters in ''AaBb'' if it is a nonbinary case-insensitive string, but do differ if it is a binary string:

 mysql> SELECT HEX(WEIGHT_STRING('AaBb' COLLATE latin1_swedish_ci));
 +------------------------------------------------------+
 | HEX(WEIGHT_STRING('AaBb' COLLATE latin1_swedish_ci)) |
 +------------------------------------------------------+
 | 41414242                                             |
 +------------------------------------------------------+
 mysql> SELECT HEX(WEIGHT_STRING(BINARY 'AaBb'));
 +-----------------------------------+
 | HEX(WEIGHT_STRING(BINARY 'AaBb')) |
 +-----------------------------------+
 | 41614262                          |
 +-----------------------------------+

MySQL supports several collation implementations, as discussed in *note charset-collation-implementations::. Some of these can be added to MySQL without recompiling:

The following sections describe how to add user-defined collations of the first two types to existing character sets. All existing character sets already have a binary collation, so there is no need here to describe how to add one.

Summary of the procedure for adding a new user-defined collation:

  1. Choose a collation ID.

  2. Add configuration information that names the collation and describes the character-ordering rules.

  3. Restart the server.

  4. Verify that the server recognizes the collation.

The instructions here cover only user-defined collations that can be added without recompiling MySQL. To add a collation that does require recompiling (as implemented by means of functions in a C source file), use the instructions in *note adding-character-set::. However, instead of adding all the information required for a complete character set, just modify the appropriate files for an existing character set. That is, based on what is already present for the character set's current collations, add data structures, functions, and configuration information for the new collation.

Note:

If you modify an existing user-defined collation, that may affect the ordering of rows for indexes on columns that use the collation. In this case, rebuild any such indexes to avoid problems such as incorrect query results. See *note rebuilding-tables::.

Additional Resources

 File: manual.info.tmp, Node: charset-collation-implementations, Next: adding-collation-choosing-id, Prev: adding-collation, Up: adding-collation

10.14.1 Collation Implementation Types

MySQL implements several types of collations:

Simple collations for 8-bit character sets

This kind of collation is implemented using an array of 256 weights that defines a one-to-one mapping from character codes to weights. 'latin1_swedish_ci' is an example. It is a case-insensitive collation, so the uppercase and lowercase versions of a character have the same weights and they compare as equal.

 mysql> SET NAMES 'latin1' COLLATE 'latin1_swedish_ci';
 Query OK, 0 rows affected (0.01 sec)

 mysql> SELECT HEX(WEIGHT_STRING('a')), HEX(WEIGHT_STRING('A'));
 +-------------------------+-------------------------+
 | HEX(WEIGHT_STRING('a')) | HEX(WEIGHT_STRING('A')) |
 +-------------------------+-------------------------+
 | 41                      | 41                      |
 +-------------------------+-------------------------+
 1 row in set (0.01 sec)

 mysql> SELECT 'a' = 'A';
 +-----------+
 | 'a' = 'A' |
 +-----------+
 |         1 |
 +-----------+
 1 row in set (0.12 sec)

For implementation instructions, see *note adding-collation-simple-8bit::.

Complex collations for 8-bit character sets

This kind of collation is implemented using functions in a C source file that define how to order characters, as described in *note adding-character-set::.

Collations for non-Unicode multibyte character sets

For this type of collation, 8-bit (single-byte) and multibyte characters are handled differently. For 8-bit characters, character codes map to weights in case-insensitive fashion. (For example, the single-byte characters ''a'' and ''A'' both have a weight of '0x41'.) For multibyte characters, there are two types of relationship between character codes and weights:

For implementation instructions, see *note adding-character-set::.

Collations for Unicode multibyte character sets

Some of these collations are based on the Unicode Collation Algorithm (UCA), others are not.

Non-UCA collations have a one-to-one mapping from character code to weight. In MySQL, such collations are case-insensitive and accent-insensitive. 'utf8_general_ci' is an example: ''a'', ''A'', ''A`'', and ''a''' each have different character codes but all have a weight of '0x0041' and compare as equal.

 mysql> SET NAMES 'utf8' COLLATE 'utf8_general_ci';
 Query OK, 0 rows affected (0.00 sec)

 mysql> CREATE TABLE t1
        (c1 CHAR(1) CHARACTER SET UTF8 COLLATE utf8_general_ci);
 Query OK, 0 rows affected (0.01 sec)

 mysql> INSERT INTO t1 VALUES ('a'),('A'),('A`'),('a'');
 Query OK, 4 rows affected (0.00 sec)
 Records: 4  Duplicates: 0  Warnings: 0

 mysql> SELECT c1, HEX(c1), HEX(WEIGHT_STRING(c1)) FROM t1;
 +------+---------+------------------------+
 | c1   | HEX(c1) | HEX(WEIGHT_STRING(c1)) |
 +------+---------+------------------------+
 | a    | 61      | 0041                   |
 | A    | 41      | 0041                   |
 | A`    | C380    | 0041                   |
 | a'    | C3A1    | 0041                   |
 +------+---------+------------------------+
 4 rows in set (0.00 sec)

UCA-based collations in MySQL have these properties:

A many-characters-to-many-weights mapping is also possible (this is contraction with expansion), but is not supported by MySQL.

For implementation instructions, for a non-UCA collation, see note adding-character-set::. For a UCA collation, see note adding-collation-unicode-uca::.

Miscellaneous collations

There are also a few collations that do not fall into any of the previous categories.

 File: manual.info.tmp, Node: adding-collation-choosing-id, Next: adding-collation-simple-8bit, Prev: charset-collation-implementations, Up: adding-collation

10.14.2 Choosing a Collation ID

Each collation must have a unique ID. To add a collation, you must choose an ID value that is not currently used. MySQL supports two-byte collation IDs. The range of IDs from 1024 to 2047 is reserved for user-defined collations.

The collation ID that you choose appears in these contexts:

To determine the largest currently used ID, issue the following statement:

 mysql> SELECT MAX(ID) FROM INFORMATION_SCHEMA.COLLATIONS;
 +---------+
 | MAX(ID) |
 +---------+
 |     247 |
 +---------+

To display a list of all currently used IDs, issue this statement:

 mysql> SELECT ID FROM INFORMATION_SCHEMA.COLLATIONS ORDER BY ID;
 +-----+
 | ID  |
 +-----+
 |   1 |
 |   2 |
 | ... |
 |  52 |
 |  53 |
 |  57 |
 |  58 |
 | ... |
 |  98 |
 |  99 |
 | 128 |
 | 129 |
 | ... |
 | 247 |
 +-----+

Warning:

Before upgrading, you should save the configuration files that you change. If you upgrade in place, the process replaces the modified files.

 File: manual.info.tmp, Node: adding-collation-simple-8bit, Next: adding-collation-unicode-uca, Prev: adding-collation-choosing-id, Up: adding-collation

10.14.3 Adding a Simple Collation to an 8-Bit Character Set

This section describes how to add a simple collation for an 8-bit character set by writing the '' elements associated with a '' character set description in the MySQL 'Index.xml' file. The procedure described here does not require recompiling MySQL. The example adds a collation named 'latin1_test_ci' to the 'latin1' character set.

  1. Choose a collation ID, as shown in *note adding-collation-choosing-id::. The following steps use an ID of
  2. Modify the 'Index.xml' and 'latin1.xml' configuration files. These files are located in the directory named by the 'character_sets_dir' system variable. You can check the variable value as follows, although the path name might be different on your system:

      mysql> SHOW VARIABLES LIKE 'character_sets_dir';
      +--------------------+-----------------------------------------+
      | Variable_name      | Value                                   |
      +--------------------+-----------------------------------------+
      | character_sets_dir | /user/local/mysql/share/mysql/charsets/ |
      +--------------------+-----------------------------------------+
  3. Choose a name for the collation and list it in the 'Index.xml' file. Find the '' element for the character set to which the collation is being added, and add a '' element that indicates the collation name and ID, to associate the name with the ID. For example:

      <charset name="latin1">
        ...
        <collation name="latin1_test_ci" id="1024"/>
        ...
      </charset>
  4. In the 'latin1.xml' configuration file, add a '' element that names the collation and that contains a '' element that defines a character code-to-weight mapping table for character codes 0 to 255. Each value within the '' element must be a number in hexadecimal format.

      <collation name="latin1_test_ci">
      <map>
       00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
       10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
       20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
       30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
       40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
       50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F
       60 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
       50 51 52 53 54 55 56 57 58 59 5A 7B 7C 7D 7E 7F
       80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F
       90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F
       A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF
       B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF
       41 41 41 41 5B 5D 5B 43 45 45 45 45 49 49 49 49
       44 4E 4F 4F 4F 4F 5C D7 5C 55 55 55 59 59 DE DF
       41 41 41 41 5B 5D 5B 43 45 45 45 45 49 49 49 49
       44 4E 4F 4F 4F 4F 5C F7 5C 55 55 55 59 59 DE FF
      </map>
      </collation>
  5. Restart the server and use this statement to verify that the collation is present:

      mysql> SHOW COLLATION WHERE Collation = 'latin1_test_ci';
      +----------------+---------+------+---------+----------+---------+
      | Collation      | Charset | Id   | Default | Compiled | Sortlen |
      +----------------+---------+------+---------+----------+---------+
      | latin1_test_ci | latin1  | 1024 |         |          |       1 |
      +----------------+---------+------+---------+----------+---------+

 File: manual.info.tmp, Node: adding-collation-unicode-uca, Prev: adding-collation-simple-8bit, Up: adding-collation

10.14.4 Adding a UCA Collation to a Unicode Character Set

This section describes how to add a UCA collation for a Unicode character set by writing the '' element within a '' character set description in the MySQL 'Index.xml' file. The procedure described here does not require recompiling MySQL. It uses a subset of the Locale Data Markup Language (LDML) specification, which is available at http://www.unicode.org/reports/tr35/. With this method, you need not define the entire collation. Instead, you begin with an existing 'base' collation and describe the new collation in terms of how it differs from the base collation. The following table lists the base collations of the Unicode character sets for which UCA collations can be defined. It is not possible to create user-defined UCA collations for 'utf16le'; there is no 'utf16le_unicode_ci' collation that would serve as the basis for such collations.

MySQL Character Sets Available for User-Defined UCA Collations

Character Set Base Collation

'utf8' 'utf8_unicode_ci'

'ucs2' 'ucs2_unicode_ci'

'utf16' 'utf16_unicode_ci'

'utf32' 'utf32_unicode_ci'

The following sections show how to add a collation that is defined using LDML syntax, and provide a summary of LDML rules supported in MySQL.

 File: manual.info.tmp, Node: ldml-collation-example, Next: ldml-rules, Prev: adding-collation-unicode-uca, Up: adding-collation-unicode-uca

10.14.4.1 Defining a UCA Collation Using LDML Syntax ....................................................

To add a UCA collation for a Unicode character set without recompiling MySQL, use the following procedure. If you are unfamiliar with the LDML rules used to describe the collation's sort characteristics, see *note ldml-rules::.

The example adds a collation named 'utf8_phone_ci' to the 'utf8' character set. The collation is designed for a scenario involving a Web application for which users post their names and phone numbers. Phone numbers can be given in very different formats:

 +7-12345-67
 +7-12-345-67
 +7 12 345 67
 +7 (12) 345 67
 +71234567

The problem raised by dealing with these kinds of values is that the varying permissible formats make searching for a specific phone number very difficult. The solution is to define a new collation that reorders punctuation characters, making them ignorable.

  1. Choose a collation ID, as shown in *note adding-collation-choosing-id::. The following steps use an ID of
  2. To modify the 'Index.xml' configuration file. This file is located in the directory named by the 'character_sets_dir' system variable. You can check the variable value as follows, although the path name might be different on your system:

      mysql> SHOW VARIABLES LIKE 'character_sets_dir';
      +--------------------+-----------------------------------------+
      | Variable_name      | Value                                   |
      +--------------------+-----------------------------------------+
      | character_sets_dir | /user/local/mysql/share/mysql/charsets/ |
      +--------------------+-----------------------------------------+
  3. Choose a name for the collation and list it in the 'Index.xml' file. In addition, you'll need to provide the collation ordering rules. Find the '' element for the character set to which the collation is being added, and add a '' element that indicates the collation name and ID, to associate the name with the ID. Within the '' element, provide a '' element containing the ordering rules:

      <charset name="utf8">
        ...
        <collation name="utf8_phone_ci" id="1029">
          <rules>
            <reset>\u0000</reset>
            <i>\u0020</i> <!-- space -->
            <i>\u0028</i> <!-- left parenthesis -->
            <i>\u0029</i> <!-- right parenthesis -->
            <i>\u002B</i> <!-- plus -->
            <i>\u002D</i> <!-- hyphen -->
          </rules>
        </collation>
        ...
      </charset>
  4. If you want a similar collation for other Unicode character sets, add other '' elements. For example, to define 'ucs2_phone_ci', add a '' element to the '' element. Remember that each collation must have its own unique ID.

  5. Restart the server and use this statement to verify that the collation is present:

      mysql> SHOW COLLATION WHERE Collation = 'utf8_phone_ci';
      +---------------+---------+------+---------+----------+---------+
      | Collation     | Charset | Id   | Default | Compiled | Sortlen |
      +---------------+---------+------+---------+----------+---------+
      | utf8_phone_ci | utf8    | 1029 |         |          |       8 |
      +---------------+---------+------+---------+----------+---------+

Now test the collation to make sure that it has the desired properties.

Create a table containing some sample phone numbers using the new collation:

 mysql> CREATE TABLE phonebook (
          name VARCHAR(64),
          phone VARCHAR(64) CHARACTER SET utf8 COLLATE utf8_phone_ci
        );
 Query OK, 0 rows affected (0.09 sec)

 mysql> INSERT INTO phonebook VALUES ('Svoj','+7 912 800 80 02');
 Query OK, 1 row affected (0.00 sec)

 mysql> INSERT INTO phonebook VALUES ('Hf','+7 (912) 800 80 04');
 Query OK, 1 row affected (0.00 sec)

 mysql> INSERT INTO phonebook VALUES ('Bar','+7-912-800-80-01');
 Query OK, 1 row affected (0.00 sec)

 mysql> INSERT INTO phonebook VALUES ('Ramil','(7912) 800 80 03');
 Query OK, 1 row affected (0.00 sec)

 mysql> INSERT INTO phonebook VALUES ('Sanja','+380 (912) 8008005');
 Query OK, 1 row affected (0.00 sec)

Run some queries to see whether the ignored punctuation characters are in fact ignored for comparison and sorting:

 mysql> SELECT * FROM phonebook ORDER BY phone;
 +-------+--------------------+
 | name  | phone              |
 +-------+--------------------+
 | Sanja | +380 (912) 8008005 |
 | Bar   | +7-912-800-80-01   |
 | Svoj  | +7 912 800 80 02   |
 | Ramil | (7912) 800 80 03   |
 | Hf    | +7 (912) 800 80 04 |
 +-------+--------------------+
 5 rows in set (0.00 sec)

 mysql> SELECT * FROM phonebook WHERE phone='+7(912)800-80-01';
 +------+------------------+
 | name | phone            |
 +------+------------------+
 | Bar  | +7-912-800-80-01 |
 +------+------------------+
 1 row in set (0.00 sec)

 mysql> SELECT * FROM phonebook WHERE phone='79128008001';
 +------+------------------+
 | name | phone            |
 +------+------------------+
 | Bar  | +7-912-800-80-01 |
 +------+------------------+
 1 row in set (0.00 sec)

 mysql> SELECT * FROM phonebook WHERE phone='7 9 1 2 8 0 0 8 0 0 1';
 +------+------------------+
 | name | phone            |
 +------+------------------+
 | Bar  | +7-912-800-80-01 |
 +------+------------------+
 1 row in set (0.00 sec)

 File: manual.info.tmp, Node: ldml-rules, Next: collation-diagnostics, Prev: ldml-collation-example, Up: adding-collation-unicode-uca

10.14.4.2 LDML Syntax Supported in MySQL ........................................

This section describes the LDML syntax that MySQL recognizes. This is a subset of the syntax described in the LDML specification available at http://www.unicode.org/reports/tr35/, which should be consulted for further information. MySQL recognizes a large enough subset of the syntax that, in many cases, it is possible to download a collation definition from the Unicode Common Locale Data Repository and paste the relevant part (that is, the part between the '' and '' tags) into the MySQL 'Index.xml' file. The rules described here are all supported except that character sorting occurs only at the primary level. Rules that specify differences at secondary or higher sort levels are recognized (and thus can be included in collation definitions) but are treated as equality at the primary level.

The MySQL server generates diagnostics when it finds problems while parsing the 'Index.xml' file. See *note collation-diagnostics::.

Character Representation

Characters named in LDML rules can be written literally or in '' format, where NNNN is the hexadecimal Unicode code point value. For example, 'A' and 'a'' can be written literally or as '041' and '0E1'. Within hexadecimal values, the digits 'A' through 'F' are not case-sensitive; '0E1' and '0e1' are equivalent. For UCA 4.0.0 collations, hexadecimal notation can be used only for characters in the Basic Multilingual Plane, not for characters outside the BMP range of '0000' to 'FFFF'. For UCA 5.2.0 collations, hexadecimal notation can be used for any character.

The 'Index.xml' file itself should be written using UTF-8 encoding.

Syntax Rules

LDML has reset rules and shift rules to specify character ordering. Orderings are given as a set of rules that begin with a reset rule that establishes an anchor point, followed by shift rules that indicate how characters sort relative to the anchor point.

MySQL-Specific LDML Extensions

An extension to LDML rules permits the '' element to include an optional 'version' attribute in '' tags to indicate the UCA version on which the collation is based. If the 'version' attribute is omitted, its default value is '4.0.0'. For example, this specification indicates a collation that is based on UCA 5.2.0:

 <collation id="NNN" name="utf8_XXX_ci" version="5.2.0">
 ...
 </collation>

 File: manual.info.tmp, Node: collation-diagnostics, Prev: ldml-rules, Up: adding-collation-unicode-uca

10.14.4.3 Diagnostics During Index.xml Parsing ..............................................

The MySQL server generates diagnostics when it finds problems while parsing the 'Index.xml' file:

  • Unknown tags are written to the error log. For example, the following message results if a collation definition contains a '' tag:

      [Warning] Buffered warning: Unknown LDML tag:
      'charsets/charset/collation/rules/aaa'
  • If collation initialization is not possible, the server reports an 'Unknown collation' error, and also generates warnings explaining the problems, such as in the previous example. In other cases, when a collation description is generally correct but contains some unknown tags, the collation is initialized and is available for use. The unknown parts are ignored, but a warning is generated in the error log.

  • Problems with collations generate warnings that clients can display with *note 'SHOW WARNINGS': show-warnings. Suppose that a reset rule contains an expansion longer than the maximum supported length of 6 characters:

      <reset>abcdefghi</reset>
      <i>x</i>

    An attempt to use the collation produces warnings:

      mysql> SELECT _utf8'test' COLLATE utf8_test_ci;
      ERROR 1273 (HY000): Unknown collation: 'utf8_test_ci'
      mysql> SHOW WARNINGS;
      +---------+------+----------------------------------------+
      | Level   | Code | Message                                |
      +---------+------+----------------------------------------+
      | Error   | 1273 | Unknown collation: 'utf8_test_ci'      |
      | Warning | 1273 | Expansion is too long at 'abcdefghi=x' |
      +---------+------+----------------------------------------+

 File: manual.info.tmp, Node: charset-configuration, Next: locale-support, Prev: adding-collation, Up: charset