10.14 Adding a Collation to a Character Set

Menu:
charset-collation-implementations:: Collation Implementation Types
adding-collation-choosing-id:: Choosing a Collation ID
adding-collation-simple-8bit:: Adding a Simple Collation to an 8-Bit Character Set
adding-collation-unicode-uca:: Adding a UCA Collation to a Unicode Character Set

A collation is a set of rules that defines how to compare and sort character strings. Each collation in MySQL belongs to a single character set. Every character set has at least one collation, and most have two or more collations.

A collation orders characters based on weights. Each character in a character set maps to a weight. Characters with equal weights compare as equal, and characters with unequal weights compare according to the relative magnitude of their weights.

The 'WEIGHT_STRING()' function can be used to see the weights for the characters in a string. The value that it returns to indicate weights is a binary string, so it is convenient to use 'HEX(WEIGHT_STRING(STR))' to display the weights in printable form. The following example shows that weights do not differ for lettercase for the letters in ''AaBb'' if it is a nonbinary case-insensitive string, but do differ if it is a binary string:

 mysql> SELECT HEX(WEIGHT_STRING('AaBb' COLLATE latin1_swedish_ci));
 +------------------------------------------------------+
 | HEX(WEIGHT_STRING('AaBb' COLLATE latin1_swedish_ci)) |
 +------------------------------------------------------+
 | 41414242                                             |
 +------------------------------------------------------+
 mysql> SELECT HEX(WEIGHT_STRING(BINARY 'AaBb'));
 +-----------------------------------+
 | HEX(WEIGHT_STRING(BINARY 'AaBb')) |
 +-----------------------------------+
 | 41614262                          |
 +-----------------------------------+

MySQL supports several collation implementations, as discussed in *note charset-collation-implementations::. Some of these can be added to MySQL without recompiling:

Simple collations for 8-bit character sets.
UCA-based collations for Unicode character sets.
Binary ('XXX_bin') collations.

The following sections describe how to add user-defined collations of the first two types to existing character sets. All existing character sets already have a binary collation, so there is no need here to describe how to add one.

Summary of the procedure for adding a new user-defined collation:

Choose a collation ID.
Add configuration information that names the collation and describes the character-ordering rules.
Restart the server.
Verify that the server recognizes the collation.

The instructions here cover only user-defined collations that can be added without recompiling MySQL. To add a collation that does require recompiling (as implemented by means of functions in a C source file), use the instructions in *note adding-character-set::. However, instead of adding all the information required for a complete character set, just modify the appropriate files for an existing character set. That is, based on what is already present for the character set's current collations, add data structures, functions, and configuration information for the new collation.

Note:

If you modify an existing user-defined collation, that may affect the ordering of rows for indexes on columns that use the collation. In this case, rebuild any such indexes to avoid problems such as incorrect query results. See *note rebuilding-tables::.

Additional Resources

Example showing how to add a collation for full-text searches: *note full-text-adding-collation::
The Unicode Collation Algorithm (UCA) specification: http://www.unicode.org/reports/tr10/
The Locale Data Markup Language (LDML) specification: http://www.unicode.org/reports/tr35/

File: manual.info.tmp, Node: charset-collation-implementations, Next: adding-collation-choosing-id, Prev: adding-collation, Up: adding-collation

10.14.1 Collation Implementation Types

MySQL implements several types of collations:

Simple collations for 8-bit character sets

This kind of collation is implemented using an array of 256 weights that defines a one-to-one mapping from character codes to weights. 'latin1_swedish_ci' is an example. It is a case-insensitive collation, so the uppercase and lowercase versions of a character have the same weights and they compare as equal.

 mysql> SET NAMES 'latin1' COLLATE 'latin1_swedish_ci';
 Query OK, 0 rows affected (0.01 sec)

 mysql> SELECT HEX(WEIGHT_STRING('a')), HEX(WEIGHT_STRING('A'));
 +-------------------------+-------------------------+
 | HEX(WEIGHT_STRING('a')) | HEX(WEIGHT_STRING('A')) |
 +-------------------------+-------------------------+
 | 41                      | 41                      |
 +-------------------------+-------------------------+
 1 row in set (0.01 sec)

 mysql> SELECT 'a' = 'A';
 +-----------+
 | 'a' = 'A' |
 +-----------+
 |         1 |
 +-----------+
 1 row in set (0.12 sec)

For implementation instructions, see *note adding-collation-simple-8bit::.

Complex collations for 8-bit character sets

This kind of collation is implemented using functions in a C source file that define how to order characters, as described in *note adding-character-set::.

Collations for non-Unicode multibyte character sets

For this type of collation, 8-bit (single-byte) and multibyte characters are handled differently. For 8-bit characters, character codes map to weights in case-insensitive fashion. (For example, the single-byte characters ''a'' and ''A'' both have a weight of '0x41'.) For multibyte characters, there are two types of relationship between character codes and weights:

Weights equal character codes. 'sjis_japanese_ci' is an example of this kind of collation. The multibyte character ''ぢ'' has a character code of '0x82C0', and the weight is also '0x82C0'.

  mysql> CREATE TABLE t1
         (c1 VARCHAR(2) CHARACTER SET sjis COLLATE sjis_japanese_ci);
  Query OK, 0 rows affected (0.01 sec)

  mysql> INSERT INTO t1 VALUES ('a'),('A'),(0x82C0);
  Query OK, 3 rows affected (0.00 sec)
  Records: 3  Duplicates: 0  Warnings: 0

  mysql> SELECT c1, HEX(c1), HEX(WEIGHT_STRING(c1)) FROM t1;
  +------+---------+------------------------+
  | c1   | HEX(c1) | HEX(WEIGHT_STRING(c1)) |
  +------+---------+------------------------+
  | a    | 61      | 41                     |
  | A    | 41      | 41                     |
  | ぢ    | 82C0    | 82C0                   |
  +------+---------+------------------------+
  3 rows in set (0.00 sec)

Character codes map one-to-one to weights, but a code is not necessarily equal to the weight. 'gbk_chinese_ci' is an example of this kind of collation. The multibyte character ''膰'' has a character code of '0x81B0' but a weight of '0xC286'.

  mysql> CREATE TABLE t1
         (c1 VARCHAR(2) CHARACTER SET gbk COLLATE gbk_chinese_ci);
  Query OK, 0 rows affected (0.33 sec)

  mysql> INSERT INTO t1 VALUES ('a'),('A'),(0x81B0);
  Query OK, 3 rows affected (0.00 sec)
  Records: 3  Duplicates: 0  Warnings: 0

  mysql> SELECT c1, HEX(c1), HEX(WEIGHT_STRING(c1)) FROM t1;
  +------+---------+------------------------+
  | c1   | HEX(c1) | HEX(WEIGHT_STRING(c1)) |
  +------+---------+------------------------+
  | a    | 61      | 41                     |
  | A    | 41      | 41                     |
  | 膰    | 81B0    | C286                   |
  +------+---------+------------------------+
  3 rows in set (0.00 sec)

For implementation instructions, see *note adding-character-set::.

Collations for Unicode multibyte character sets

Some of these collations are based on the Unicode Collation Algorithm (UCA), others are not.

Non-UCA collations have a one-to-one mapping from character code to weight. In MySQL, such collations are case-insensitive and accent-insensitive. 'utf8_general_ci' is an example: ''a'', ''A'', ''A`'', and ''a''' each have different character codes but all have a weight of '0x0041' and compare as equal.

 mysql> SET NAMES 'utf8' COLLATE 'utf8_general_ci';
 Query OK, 0 rows affected (0.00 sec)

 mysql> CREATE TABLE t1
        (c1 CHAR(1) CHARACTER SET UTF8 COLLATE utf8_general_ci);
 Query OK, 0 rows affected (0.01 sec)

 mysql> INSERT INTO t1 VALUES ('a'),('A'),('A`'),('a'');
 Query OK, 4 rows affected (0.00 sec)
 Records: 4  Duplicates: 0  Warnings: 0

 mysql> SELECT c1, HEX(c1), HEX(WEIGHT_STRING(c1)) FROM t1;
 +------+---------+------------------------+
 | c1   | HEX(c1) | HEX(WEIGHT_STRING(c1)) |
 +------+---------+------------------------+
 | a    | 61      | 0041                   |
 | A    | 41      | 0041                   |
 | A`    | C380    | 0041                   |
 | a'    | C3A1    | 0041                   |
 +------+---------+------------------------+
 4 rows in set (0.00 sec)

UCA-based collations in MySQL have these properties:

If a character has weights, each weight uses 2 bytes (16 bits).
A character may have zero weights (or an empty weight). In this case, the character is ignorable. Example: "U+0000 NULL" does not have a weight and is ignorable.

A character may have one weight. Example: ''a'' has a weight of '0x0E33'.

  mysql> SET NAMES 'utf8' COLLATE 'utf8_unicode_ci';
  Query OK, 0 rows affected (0.05 sec)

  mysql> SELECT HEX('a'), HEX(WEIGHT_STRING('a'));
  +----------+-------------------------+
  | HEX('a') | HEX(WEIGHT_STRING('a')) |
  +----------+-------------------------+
  | 61       | 0E33                    |
  +----------+-------------------------+
  1 row in set (0.02 sec)

A character may have many weights. This is an expansion. Example: The German letter ''ss'' (SZ ligature, or SHARP S) has a weight of '0x0FEA0FEA'.

  mysql> SET NAMES 'utf8' COLLATE 'utf8_unicode_ci';
  Query OK, 0 rows affected (0.11 sec)

  mysql> SELECT HEX('ss'), HEX(WEIGHT_STRING('ss'));
  +-----------+--------------------------+
  | HEX('ss')  | HEX(WEIGHT_STRING('ss'))  |
  +-----------+--------------------------+
  | C39F      | 0FEA0FEA                 |
  +-----------+--------------------------+
  1 row in set (0.00 sec)

Many characters may have one weight. This is a contraction. Example: ''ch'' is a single letter in Czech and has a weight of '0x0EE2'.

  mysql> SET NAMES 'utf8' COLLATE 'utf8_czech_ci';
  Query OK, 0 rows affected (0.09 sec)

  mysql> SELECT HEX('ch'), HEX(WEIGHT_STRING('ch'));
  +-----------+--------------------------+
  | HEX('ch') | HEX(WEIGHT_STRING('ch')) |
  +-----------+--------------------------+
  | 6368      | 0EE2                     |
  +-----------+--------------------------+
  1 row in set (0.00 sec)

A many-characters-to-many-weights mapping is also possible (this is contraction with expansion), but is not supported by MySQL.

For implementation instructions, for a non-UCA collation, see note adding-character-set::. For a UCA collation, see note adding-collation-unicode-uca::.

Miscellaneous collations

There are also a few collations that do not fall into any of the previous categories.

File: manual.info.tmp, Node: adding-collation-choosing-id, Next: adding-collation-simple-8bit, Prev: charset-collation-implementations, Up: adding-collation

10.14.2 Choosing a Collation ID

Each collation must have a unique ID. To add a collation, you must choose an ID value that is not currently used. MySQL supports two-byte collation IDs. The range of IDs from 1024 to 2047 is reserved for user-defined collations.

The collation ID that you choose appears in these contexts:

The 'ID' column of the Information Schema *note 'COLLATIONS': information-schema-collations-table. table.
The 'Id' column of *note 'SHOW COLLATION': show-collation. output.
The 'charsetnr' member of the 'MYSQL_FIELD' C API data structure.
The 'number' member of the 'MY_CHARSET_INFO' data structure returned by the 'mysql_get_character_set_info()' (https://dev.mysql.com/doc/c-api/5.7/en/mysql-get-character-set-info.html) C API function.

To determine the largest currently used ID, issue the following statement:

 mysql> SELECT MAX(ID) FROM INFORMATION_SCHEMA.COLLATIONS;
 +---------+
 | MAX(ID) |
 +---------+
 |     247 |
 +---------+

To display a list of all currently used IDs, issue this statement:

 mysql> SELECT ID FROM INFORMATION_SCHEMA.COLLATIONS ORDER BY ID;
 +-----+
 | ID  |
 +-----+
 |   1 |
 |   2 |
 | ... |
 |  52 |
 |  53 |
 |  57 |
 |  58 |
 | ... |
 |  98 |
 |  99 |
 | 128 |
 | 129 |
 | ... |
 | 247 |
 +-----+

Warning:

Before upgrading, you should save the configuration files that you change. If you upgrade in place, the process replaces the modified files.

File: manual.info.tmp, Node: adding-collation-simple-8bit, Next: adding-collation-unicode-uca, Prev: adding-collation-choosing-id, Up: adding-collation

10.14.3 Adding a Simple Collation to an 8-Bit Character Set

This section describes how to add a simple collation for an 8-bit character set by writing the '' elements associated with a '' character set description in the MySQL 'Index.xml' file. The procedure described here does not require recompiling MySQL. The example adds a collation named 'latin1_test_ci' to the 'latin1' character set.

Choose a collation ID, as shown in *note adding-collation-choosing-id::. The following steps use an ID of

Modify the 'Index.xml' and 'latin1.xml' configuration files. These files are located in the directory named by the 'character_sets_dir' system variable. You can check the variable value as follows, although the path name might be different on your system:

  mysql> SHOW VARIABLES LIKE 'character_sets_dir';
  +--------------------+-----------------------------------------+
  | Variable_name      | Value                                   |
  +--------------------+-----------------------------------------+
  | character_sets_dir | /user/local/mysql/share/mysql/charsets/ |
  +--------------------+-----------------------------------------+

Choose a name for the collation and list it in the 'Index.xml' file. Find the '' element for the character set to which the collation is being added, and add a '' element that indicates the collation name and ID, to associate the name with the ID. For example:
```
 <charset name="latin1">
 ...
 <collation name="latin1_test_ci" id="1024"/>
 ...
 </charset>
```

In the 'latin1.xml' configuration file, add a '' element that names the collation and that contains a '' element that defines a character code-to-weight mapping table for character codes 0 to 255. Each value within the '' element must be a number in hexadecimal format.

  <collation name="latin1_test_ci">
  <map>
   00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
   10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
   20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
   30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
   40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
   50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F
   60 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
   50 51 52 53 54 55 56 57 58 59 5A 7B 7C 7D 7E 7F
   80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F
   90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F
   A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF
   B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF
   41 41 41 41 5B 5D 5B 43 45 45 45 45 49 49 49 49
   44 4E 4F 4F 4F 4F 5C D7 5C 55 55 55 59 59 DE DF
   41 41 41 41 5B 5D 5B 43 45 45 45 45 49 49 49 49
   44 4E 4F 4F 4F 4F 5C F7 5C 55 55 55 59 59 DE FF
  </map>
  </collation>

Restart the server and use this statement to verify that the collation is present:

  mysql> SHOW COLLATION WHERE Collation = 'latin1_test_ci';
  +----------------+---------+------+---------+----------+---------+
  | Collation      | Charset | Id   | Default | Compiled | Sortlen |
  +----------------+---------+------+---------+----------+---------+
  | latin1_test_ci | latin1  | 1024 |         |          |       1 |
  +----------------+---------+------+---------+----------+---------+

File: manual.info.tmp, Node: adding-collation-unicode-uca, Prev: adding-collation-simple-8bit, Up: adding-collation

10.14.4 Adding a UCA Collation to a Unicode Character Set

Menu:
ldml-collation-example:: Defining a UCA Collation Using LDML Syntax
ldml-rules:: LDML Syntax Supported in MySQL
collation-diagnostics:: Diagnostics During Index.xml Parsing

This section describes how to add a UCA collation for a Unicode character set by writing the '' element within a '' character set description in the MySQL 'Index.xml' file. The procedure described here does not require recompiling MySQL. It uses a subset of the Locale Data Markup Language (LDML) specification, which is available at http://www.unicode.org/reports/tr35/. With this method, you need not define the entire collation. Instead, you begin with an existing 'base' collation and describe the new collation in terms of how it differs from the base collation. The following table lists the base collations of the Unicode character sets for which UCA collations can be defined. It is not possible to create user-defined UCA collations for 'utf16le'; there is no 'utf16le_unicode_ci' collation that would serve as the basis for such collations.

MySQL Character Sets Available for User-Defined UCA Collations

Character Set Base Collation

'utf8' 'utf8_unicode_ci'

'ucs2' 'ucs2_unicode_ci'

'utf16' 'utf16_unicode_ci'

'utf32' 'utf32_unicode_ci'

The following sections show how to add a collation that is defined using LDML syntax, and provide a summary of LDML rules supported in MySQL.

File: manual.info.tmp, Node: ldml-collation-example, Next: ldml-rules, Prev: adding-collation-unicode-uca, Up: adding-collation-unicode-uca

10.14.4.1 Defining a UCA Collation Using LDML Syntax ....................................................

To add a UCA collation for a Unicode character set without recompiling MySQL, use the following procedure. If you are unfamiliar with the LDML rules used to describe the collation's sort characteristics, see *note ldml-rules::.

The example adds a collation named 'utf8_phone_ci' to the 'utf8' character set. The collation is designed for a scenario involving a Web application for which users post their names and phone numbers. Phone numbers can be given in very different formats:

 +7-12345-67
 +7-12-345-67
 +7 12 345 67
 +7 (12) 345 67
 +71234567

The problem raised by dealing with these kinds of values is that the varying permissible formats make searching for a specific phone number very difficult. The solution is to define a new collation that reorders punctuation characters, making them ignorable.

Choose a collation ID, as shown in *note adding-collation-choosing-id::. The following steps use an ID of

To modify the 'Index.xml' configuration file. This file is located in the directory named by the 'character_sets_dir' system variable. You can check the variable value as follows, although the path name might be different on your system:

  mysql> SHOW VARIABLES LIKE 'character_sets_dir';
  +--------------------+-----------------------------------------+
  | Variable_name      | Value                                   |
  +--------------------+-----------------------------------------+
  | character_sets_dir | /user/local/mysql/share/mysql/charsets/ |
  +--------------------+-----------------------------------------+

Choose a name for the collation and list it in the 'Index.xml' file. In addition, you'll need to provide the collation ordering rules. Find the '' element for the character set to which the collation is being added, and add a '' element that indicates the collation name and ID, to associate the name with the ID. Within the '' element, provide a '' element containing the ordering rules:
```
 <charset name="utf8">
 ...
 <collation name="utf8_phone_ci" id="1029">
 <rules>
 <reset>\u0000</reset>
 \u0020 
 \u0028 
 \u0029 
 \u002B 
 \u002D 
 </rules>
 </collation>
 ...
 </charset>
```
If you want a similar collation for other Unicode character sets, add other '' elements. For example, to define 'ucs2_phone_ci', add a '' element to the '' element. Remember that each collation must have its own unique ID.

Restart the server and use this statement to verify that the collation is present:

  mysql> SHOW COLLATION WHERE Collation = 'utf8_phone_ci';
  +---------------+---------+------+---------+----------+---------+
  | Collation     | Charset | Id   | Default | Compiled | Sortlen |
  +---------------+---------+------+---------+----------+---------+
  | utf8_phone_ci | utf8    | 1029 |         |          |       8 |
  +---------------+---------+------+---------+----------+---------+

Now test the collation to make sure that it has the desired properties.

Create a table containing some sample phone numbers using the new collation:

 mysql> CREATE TABLE phonebook (
          name VARCHAR(64),
          phone VARCHAR(64) CHARACTER SET utf8 COLLATE utf8_phone_ci
        );
 Query OK, 0 rows affected (0.09 sec)

 mysql> INSERT INTO phonebook VALUES ('Svoj','+7 912 800 80 02');
 Query OK, 1 row affected (0.00 sec)

 mysql> INSERT INTO phonebook VALUES ('Hf','+7 (912) 800 80 04');
 Query OK, 1 row affected (0.00 sec)

 mysql> INSERT INTO phonebook VALUES ('Bar','+7-912-800-80-01');
 Query OK, 1 row affected (0.00 sec)

 mysql> INSERT INTO phonebook VALUES ('Ramil','(7912) 800 80 03');
 Query OK, 1 row affected (0.00 sec)

 mysql> INSERT INTO phonebook VALUES ('Sanja','+380 (912) 8008005');
 Query OK, 1 row affected (0.00 sec)

Run some queries to see whether the ignored punctuation characters are in fact ignored for comparison and sorting:

 mysql> SELECT * FROM phonebook ORDER BY phone;
 +-------+--------------------+
 | name  | phone              |
 +-------+--------------------+
 | Sanja | +380 (912) 8008005 |
 | Bar   | +7-912-800-80-01   |
 | Svoj  | +7 912 800 80 02   |
 | Ramil | (7912) 800 80 03   |
 | Hf    | +7 (912) 800 80 04 |
 +-------+--------------------+
 5 rows in set (0.00 sec)

 mysql> SELECT * FROM phonebook WHERE phone='+7(912)800-80-01';
 +------+------------------+
 | name | phone            |
 +------+------------------+
 | Bar  | +7-912-800-80-01 |
 +------+------------------+
 1 row in set (0.00 sec)

 mysql> SELECT * FROM phonebook WHERE phone='79128008001';
 +------+------------------+
 | name | phone            |
 +------+------------------+
 | Bar  | +7-912-800-80-01 |
 +------+------------------+
 1 row in set (0.00 sec)

 mysql> SELECT * FROM phonebook WHERE phone='7 9 1 2 8 0 0 8 0 0 1';
 +------+------------------+
 | name | phone            |
 +------+------------------+
 | Bar  | +7-912-800-80-01 |
 +------+------------------+
 1 row in set (0.00 sec)

File: manual.info.tmp, Node: ldml-rules, Next: collation-diagnostics, Prev: ldml-collation-example, Up: adding-collation-unicode-uca

10.14.4.2 LDML Syntax Supported in MySQL ........................................

This section describes the LDML syntax that MySQL recognizes. This is a subset of the syntax described in the LDML specification available at http://www.unicode.org/reports/tr35/, which should be consulted for further information. MySQL recognizes a large enough subset of the syntax that, in many cases, it is possible to download a collation definition from the Unicode Common Locale Data Repository and paste the relevant part (that is, the part between the '' and '' tags) into the MySQL 'Index.xml' file. The rules described here are all supported except that character sorting occurs only at the primary level. Rules that specify differences at secondary or higher sort levels are recognized (and thus can be included in collation definitions) but are treated as equality at the primary level.

The MySQL server generates diagnostics when it finds problems while parsing the 'Index.xml' file. See *note collation-diagnostics::.

Character Representation

Characters named in LDML rules can be written literally or in '' format, where NNNN is the hexadecimal Unicode code point value. For example, 'A' and 'a'' can be written literally or as '041' and '0E1'. Within hexadecimal values, the digits 'A' through 'F' are not case-sensitive; '0E1' and '0e1' are equivalent. For UCA 4.0.0 collations, hexadecimal notation can be used only for characters in the Basic Multilingual Plane, not for characters outside the BMP range of '0000' to 'FFFF'. For UCA 5.2.0 collations, hexadecimal notation can be used for any character.

The 'Index.xml' file itself should be written using UTF-8 encoding.

Syntax Rules

LDML has reset rules and shift rules to specify character ordering. Orderings are given as a set of rules that begin with a reset rule that establishes an anchor point, followed by shift rules that indicate how characters sort relative to the anchor point.

A '' rule does not specify any ordering in and of itself. Instead, it 'resets' the ordering for subsequent shift rules to cause them to be taken in relation to a given character. Either of the following rules resets subsequent shift rules to be taken in relation to the letter ''A'':
```
 <reset>A</reset>

 <reset>\u0041</reset>
```

The '

', '~~', and '' shift rules define primary, secondary, and tertiary differences of a character from another character:~~

* Use primary differences to distinguish separate letters. * Use secondary differences to distinguish accent variations. * Use tertiary differences to distinguish lettercase variations.

Either of these rules specifies a primary shift rule for the ''G'' character:

G \u0047

The '' shift rule indicates that one character sorts identically to another. The following rules cause ''b'' to sort the same as ''a'':

<reset>a</reset> b

Abbreviated shift syntax specifies multiple shift rules using a single pair of tags. The following table shows the correspondence between abbreviated syntax rules and the equivalent nonabbreviated rules.

Abbreviated Shift Syntax

Abbreviated Syntax Nonabbreviated Syntax
'xyz' '
x

y

z

'

'xyz' 'xyz'

'xyz' 'xyz'

'xyz' 'xyz'

An expansion is a reset rule that establishes an anchor point for a multiple-character sequence. MySQL supports expansions 2 to 6 characters long. The following rules put ''z'' greater at the primary level than the sequence of three characters ''abc'':

<reset>abc</reset> z

A contraction is a shift rule that sorts a multiple-character sequence. MySQL supports contractions 2 to 6 characters long. The following rules put the sequence of three characters ''xyz'' greater at the primary level than ''a'':

<reset>a</reset> xyz

Long expansions and long contractions can be used together. These rules put the sequence of three characters ''xyz'' greater at the primary level than the sequence of three characters ''abc'':

<reset>abc</reset> xyz

Normal expansion syntax uses '' plus '' elements to specify an expansion. The following rules put the character ''k'' greater at the secondary level than the sequence ''ch''. That is, ''k'' behaves as if it expands to a character after ''c'' followed by ''h'':

<reset>c</reset> <x><s>k</s><extend>h</extend></x>

This syntax permits long sequences. These rules sort the sequence ''ccs'' greater at the tertiary level than the sequence ''cscs'':

<reset>cs</reset> <x><t>ccs</t><extend>cs</extend></x>

The LDML specification describes normal expansion syntax as 'tricky.' See that specification for details.

Previous context syntax uses '' plus '' elements to specify that the context before a character affects how it sorts. The following rules put ''-'' greater at the secondary level than ''a'', but only when ''-'' occurs after ''b'':

<reset>a</reset> <x><context>b</context><s>-</s></x>

Previous context syntax can include the '' element. These rules put ''def'' greater at the primary level than ''aghi'', but only when ''def'' comes after ''abc'':

<reset>a</reset> <x><context>abc</context>def<extend>ghi</extend></x>

Reset rules permit a 'before' attribute. Normally, shift rules after a reset rule indicate characters that sort after the reset character. Shift rules after a reset rule that has the 'before' attribute indicate characters that sort before the reset character. The following rules put the character ''b'' immediately before ''a'' at the primary level:

<reset before="primary">a</reset> b

Permissible 'before' attribute values specify the sort level by name or the equivalent numeric value:

<reset before="primary"> <reset before="1"> <reset before="secondary"> <reset before="2"> <reset before="tertiary"> <reset before="3">

A reset rule can name a logical reset position rather than a literal character:

<first_tertiary_ignorable/> <last_tertiary_ignorable/> <first_secondary_ignorable/> <last_secondary_ignorable/> <first_primary_ignorable/> <last_primary_ignorable/> <first_variable/> <last_variable/> <first_non_ignorable/> <last_non_ignorable/> <first_trailing/> <last_trailing/>

These rules put ''z'' greater at the primary level than nonignorable characters that have a Default Unicode Collation Element Table (DUCET) entry and that are not CJK:

<reset><last_non_ignorable/></reset> z

Logical positions have the code points shown in the following table.

Logical Reset Position Code Points

Logical Position Unicode 4.0.0 Code Unicode 5.2.0 Code Point Point

''

U+02D0 U+02D0

''

U+A48C U+1342E

''

U+0332 U+0332

''

U+20EA U+101FD

''

U+0000 U+0000

''

U+FE73 U+FE73

''

U+0000 U+0000

''

U+FE73 U+FE73

''

U+0000 U+0000

''

U+0000 U+0000

''

U+0009 U+0009

''

U+2183 U+1D371

The '' element permits a 'shift-after-method' attribute that affects character weight calculation for shift rules. The attribute has these permitted values:

* 'simple': Calculate character weights as for reset rules that do not have a 'before' attribute. This is the default if the attribute is not given. * 'expand': Use expansions for shifts after reset rules.

Suppose that ''0'' and ''1'' have weights of '0E29' and '0E2A' and we want to put all basic Latin letters between ''0'' and ''1'':

<reset>0</reset> <pc>abcdefghijklmnopqrstuvwxyz</pc>

For simple shift mode, weights are calculated as follows:

'a' has weight 0E29+1 'b' has weight 0E29+2 'c' has weight 0E29+3 ...

However, there are not enough vacant positions to put 26 characters between ''0'' and ''1''. The result is that digits and letters are intermixed.

To solve this, use 'shift-after-method="expand"'. Then weights are calculated like this:

'a' has weight [0E29][233D+1] 'b' has weight [0E29][233D+2] 'c' has weight [0E29][233D+3] ...

'233D' is the UCA 4.0.0 weight for character '0xA48C', which is the last nonignorable character (a sort of the greatest character in the collation, excluding CJK). UCA 5.2.0 is similar but uses '3ACA', for character '0x1342E'.

MySQL-Specific LDML Extensions

An extension to LDML rules permits the '' element to include an optional 'version' attribute in '' tags to indicate the UCA version on which the collation is based. If the 'version' attribute is omitted, its default value is '4.0.0'. For example, this specification indicates a collation that is based on UCA 5.2.0:

<collation id="NNN" name="utf8_XXX_ci" version="5.2.0"> ... </collation>

File: manual.info.tmp, Node: collation-diagnostics, Prev: ldml-rules, Up: adding-collation-unicode-uca

10.14.4.3 Diagnostics During Index.xml Parsing ..............................................

The MySQL server generates diagnostics when it finds problems while parsing the 'Index.xml' file:

Unknown tags are written to the error log. For example, the following message results if a collation definition contains a '' tag:

[Warning] Buffered warning: Unknown LDML tag: 'charsets/charset/collation/rules/aaa'

If collation initialization is not possible, the server reports an 'Unknown collation' error, and also generates warnings explaining the problems, such as in the previous example. In other cases, when a collation description is generally correct but contains some unknown tags, the collation is initialized and is available for use. The unknown parts are ignored, but a warning is generated in the error log.

Problems with collations generate warnings that clients can display with *note 'SHOW WARNINGS': show-warnings. Suppose that a reset rule contains an expansion longer than the maximum supported length of 6 characters:

<reset>abcdefghi</reset> x

An attempt to use the collation produces warnings:

mysql> SELECT _utf8'test' COLLATE utf8_test_ci; ERROR 1273 (HY000): Unknown collation: 'utf8_test_ci' mysql> SHOW WARNINGS; +---------+------+----------------------------------------+ | Level | Code | Message | +---------+------+----------------------------------------+ | Error | 1273 | Unknown collation: 'utf8_test_ci' | | Warning | 1273 | Expansion is too long at 'abcdefghi=x' | +---------+------+----------------------------------------+

File: manual.info.tmp, Node: charset-configuration, Next: locale-support, Prev: adding-collation, Up: charset