Menu:
adding-collation-unicode-uca:: Adding a UCA Collation to a Unicode Character Set
A collation is a set of rules that defines how to compare and sort character strings. Each collation in MySQL belongs to a single character set. Every character set has at least one collation, and most have two or more collations.
A collation orders characters based on weights. Each character in a character set maps to a weight. Characters with equal weights compare as equal, and characters with unequal weights compare according to the relative magnitude of their weights.
The 'WEIGHT_STRING()' function can be used to see the weights for the characters in a string. The value that it returns to indicate weights is a binary string, so it is convenient to use 'HEX(WEIGHT_STRING(STR))' to display the weights in printable form. The following example shows that weights do not differ for lettercase for the letters in ''AaBb'' if it is a nonbinary case-insensitive string, but do differ if it is a binary string:
mysql> SELECT HEX(WEIGHT_STRING('AaBb' COLLATE latin1_swedish_ci));
+------------------------------------------------------+
| HEX(WEIGHT_STRING('AaBb' COLLATE latin1_swedish_ci)) |
+------------------------------------------------------+
| 41414242 |
+------------------------------------------------------+
mysql> SELECT HEX(WEIGHT_STRING(BINARY 'AaBb'));
+-----------------------------------+
| HEX(WEIGHT_STRING(BINARY 'AaBb')) |
+-----------------------------------+
| 41614262 |
+-----------------------------------+
MySQL supports several collation implementations, as discussed in *note charset-collation-implementations::. Some of these can be added to MySQL without recompiling:
Simple collations for 8-bit character sets.
UCA-based collations for Unicode character sets.
Binary ('XXX_bin') collations.
The following sections describe how to add user-defined collations of the first two types to existing character sets. All existing character sets already have a binary collation, so there is no need here to describe how to add one.
Summary of the procedure for adding a new user-defined collation:
Choose a collation ID.
Add configuration information that names the collation and describes the character-ordering rules.
Restart the server.
Verify that the server recognizes the collation.
The instructions here cover only user-defined collations that can be added without recompiling MySQL. To add a collation that does require recompiling (as implemented by means of functions in a C source file), use the instructions in *note adding-character-set::. However, instead of adding all the information required for a complete character set, just modify the appropriate files for an existing character set. That is, based on what is already present for the character set's current collations, add data structures, functions, and configuration information for the new collation.
Note:
If you modify an existing user-defined collation, that may affect the ordering of rows for indexes on columns that use the collation. In this case, rebuild any such indexes to avoid problems such as incorrect query results. See *note rebuilding-tables::.
Additional Resources
Example showing how to add a collation for full-text searches: *note full-text-adding-collation::
The Unicode Collation Algorithm (UCA) specification: http://www.unicode.org/reports/tr10/
The Locale Data Markup Language (LDML) specification: http://www.unicode.org/reports/tr35/
File: manual.info.tmp, Node: charset-collation-implementations, Next: adding-collation-choosing-id, Prev: adding-collation, Up: adding-collation
MySQL implements several types of collations:
Simple collations for 8-bit character sets
This kind of collation is implemented using an array of 256 weights that defines a one-to-one mapping from character codes to weights. 'latin1_swedish_ci' is an example. It is a case-insensitive collation, so the uppercase and lowercase versions of a character have the same weights and they compare as equal.
mysql> SET NAMES 'latin1' COLLATE 'latin1_swedish_ci';
Query OK, 0 rows affected (0.01 sec)
mysql> SELECT HEX(WEIGHT_STRING('a')), HEX(WEIGHT_STRING('A'));
+-------------------------+-------------------------+
| HEX(WEIGHT_STRING('a')) | HEX(WEIGHT_STRING('A')) |
+-------------------------+-------------------------+
| 41 | 41 |
+-------------------------+-------------------------+
1 row in set (0.01 sec)
mysql> SELECT 'a' = 'A';
+-----------+
| 'a' = 'A' |
+-----------+
| 1 |
+-----------+
1 row in set (0.12 sec)
For implementation instructions, see *note adding-collation-simple-8bit::.
Complex collations for 8-bit character sets
This kind of collation is implemented using functions in a C source file that define how to order characters, as described in *note adding-character-set::.
Collations for non-Unicode multibyte character sets
For this type of collation, 8-bit (single-byte) and multibyte characters are handled differently. For 8-bit characters, character codes map to weights in case-insensitive fashion. (For example, the single-byte characters ''a'' and ''A'' both have a weight of '0x41'.) For multibyte characters, there are two types of relationship between character codes and weights:
Weights equal character codes. 'sjis_japanese_ci' is an example of this kind of collation. The multibyte character ''ぢ'' has a character code of '0x82C0', and the weight is also '0x82C0'.
mysql> CREATE TABLE t1
(c1 VARCHAR(2) CHARACTER SET sjis COLLATE sjis_japanese_ci);
Query OK, 0 rows affected (0.01 sec)
mysql> INSERT INTO t1 VALUES ('a'),('A'),(0x82C0);
Query OK, 3 rows affected (0.00 sec)
Records: 3 Duplicates: 0 Warnings: 0
mysql> SELECT c1, HEX(c1), HEX(WEIGHT_STRING(c1)) FROM t1;
+------+---------+------------------------+
| c1 | HEX(c1) | HEX(WEIGHT_STRING(c1)) |
+------+---------+------------------------+
| a | 61 | 41 |
| A | 41 | 41 |
| ぢ | 82C0 | 82C0 |
+------+---------+------------------------+
3 rows in set (0.00 sec)
Character codes map one-to-one to weights, but a code is not necessarily equal to the weight. 'gbk_chinese_ci' is an example of this kind of collation. The multibyte character ''膰'' has a character code of '0x81B0' but a weight of '0xC286'.
mysql> CREATE TABLE t1
(c1 VARCHAR(2) CHARACTER SET gbk COLLATE gbk_chinese_ci);
Query OK, 0 rows affected (0.33 sec)
mysql> INSERT INTO t1 VALUES ('a'),('A'),(0x81B0);
Query OK, 3 rows affected (0.00 sec)
Records: 3 Duplicates: 0 Warnings: 0
mysql> SELECT c1, HEX(c1), HEX(WEIGHT_STRING(c1)) FROM t1;
+------+---------+------------------------+
| c1 | HEX(c1) | HEX(WEIGHT_STRING(c1)) |
+------+---------+------------------------+
| a | 61 | 41 |
| A | 41 | 41 |
| 膰 | 81B0 | C286 |
+------+---------+------------------------+
3 rows in set (0.00 sec)
For implementation instructions, see *note adding-character-set::.
Collations for Unicode multibyte character sets
Some of these collations are based on the Unicode Collation Algorithm (UCA), others are not.
Non-UCA collations have a one-to-one mapping from character code to weight. In MySQL, such collations are case-insensitive and accent-insensitive. 'utf8_general_ci' is an example: ''a'', ''A'', ''A`'', and ''a''' each have different character codes but all have a weight of '0x0041' and compare as equal.
mysql> SET NAMES 'utf8' COLLATE 'utf8_general_ci';
Query OK, 0 rows affected (0.00 sec)
mysql> CREATE TABLE t1
(c1 CHAR(1) CHARACTER SET UTF8 COLLATE utf8_general_ci);
Query OK, 0 rows affected (0.01 sec)
mysql> INSERT INTO t1 VALUES ('a'),('A'),('A`'),('a'');
Query OK, 4 rows affected (0.00 sec)
Records: 4 Duplicates: 0 Warnings: 0
mysql> SELECT c1, HEX(c1), HEX(WEIGHT_STRING(c1)) FROM t1;
+------+---------+------------------------+
| c1 | HEX(c1) | HEX(WEIGHT_STRING(c1)) |
+------+---------+------------------------+
| a | 61 | 0041 |
| A | 41 | 0041 |
| A` | C380 | 0041 |
| a' | C3A1 | 0041 |
+------+---------+------------------------+
4 rows in set (0.00 sec)
UCA-based collations in MySQL have these properties:
If a character has weights, each weight uses 2 bytes (16 bits).
A character may have zero weights (or an empty weight). In this case, the character is ignorable. Example: "U+0000 NULL" does not have a weight and is ignorable.
A character may have one weight. Example: ''a'' has a weight of '0x0E33'.
mysql> SET NAMES 'utf8' COLLATE 'utf8_unicode_ci';
Query OK, 0 rows affected (0.05 sec)
mysql> SELECT HEX('a'), HEX(WEIGHT_STRING('a'));
+----------+-------------------------+
| HEX('a') | HEX(WEIGHT_STRING('a')) |
+----------+-------------------------+
| 61 | 0E33 |
+----------+-------------------------+
1 row in set (0.02 sec)
A character may have many weights. This is an expansion. Example: The German letter ''ss'' (SZ ligature, or SHARP S) has a weight of '0x0FEA0FEA'.
mysql> SET NAMES 'utf8' COLLATE 'utf8_unicode_ci';
Query OK, 0 rows affected (0.11 sec)
mysql> SELECT HEX('ss'), HEX(WEIGHT_STRING('ss'));
+-----------+--------------------------+
| HEX('ss') | HEX(WEIGHT_STRING('ss')) |
+-----------+--------------------------+
| C39F | 0FEA0FEA |
+-----------+--------------------------+
1 row in set (0.00 sec)
Many characters may have one weight. This is a contraction. Example: ''ch'' is a single letter in Czech and has a weight of '0x0EE2'.
mysql> SET NAMES 'utf8' COLLATE 'utf8_czech_ci';
Query OK, 0 rows affected (0.09 sec)
mysql> SELECT HEX('ch'), HEX(WEIGHT_STRING('ch'));
+-----------+--------------------------+
| HEX('ch') | HEX(WEIGHT_STRING('ch')) |
+-----------+--------------------------+
| 6368 | 0EE2 |
+-----------+--------------------------+
1 row in set (0.00 sec)
A many-characters-to-many-weights mapping is also possible (this is contraction with expansion), but is not supported by MySQL.
For implementation instructions, for a non-UCA collation, see note adding-character-set::. For a UCA collation, see note adding-collation-unicode-uca::.
Miscellaneous collations
There are also a few collations that do not fall into any of the previous categories.
File: manual.info.tmp, Node: adding-collation-choosing-id, Next: adding-collation-simple-8bit, Prev: charset-collation-implementations, Up: adding-collation
Each collation must have a unique ID. To add a collation, you must choose an ID value that is not currently used. MySQL supports two-byte collation IDs. The range of IDs from 1024 to 2047 is reserved for user-defined collations.
The collation ID that you choose appears in these contexts:
The 'ID' column of the Information Schema *note 'COLLATIONS': information-schema-collations-table. table.
The 'Id' column of *note 'SHOW COLLATION': show-collation. output.
The 'charsetnr' member of the 'MYSQL_FIELD' C API data structure.
The 'number' member of the 'MY_CHARSET_INFO' data structure returned by the 'mysql_get_character_set_info()' (https://dev.mysql.com/doc/c-api/5.7/en/mysql-get-character-set-info.html) C API function.
To determine the largest currently used ID, issue the following statement:
mysql> SELECT MAX(ID) FROM INFORMATION_SCHEMA.COLLATIONS;
+---------+
| MAX(ID) |
+---------+
| 247 |
+---------+
To display a list of all currently used IDs, issue this statement:
mysql> SELECT ID FROM INFORMATION_SCHEMA.COLLATIONS ORDER BY ID;
+-----+
| ID |
+-----+
| 1 |
| 2 |
| ... |
| 52 |
| 53 |
| 57 |
| 58 |
| ... |
| 98 |
| 99 |
| 128 |
| 129 |
| ... |
| 247 |
+-----+
Warning:
Before upgrading, you should save the configuration files that you change. If you upgrade in place, the process replaces the modified files.
File: manual.info.tmp, Node: adding-collation-simple-8bit, Next: adding-collation-unicode-uca, Prev: adding-collation-choosing-id, Up: adding-collation
This section describes how to add a simple collation for an 8-bit character set by writing the '
Modify the 'Index.xml' and 'latin1.xml' configuration files. These files are located in the directory named by the 'character_sets_dir' system variable. You can check the variable value as follows, although the path name might be different on your system:
mysql> SHOW VARIABLES LIKE 'character_sets_dir';
+--------------------+-----------------------------------------+
| Variable_name | Value |
+--------------------+-----------------------------------------+
| character_sets_dir | /user/local/mysql/share/mysql/charsets/ |
+--------------------+-----------------------------------------+
Choose a name for the collation and list it in the 'Index.xml' file. Find the '
<charset name="latin1">
...
<collation name="latin1_test_ci" id="1024"/>
...
</charset>
In the 'latin1.xml' configuration file, add a '
<collation name="latin1_test_ci">
<map>
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F
60 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
50 51 52 53 54 55 56 57 58 59 5A 7B 7C 7D 7E 7F
80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F
90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F
A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF
B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF
41 41 41 41 5B 5D 5B 43 45 45 45 45 49 49 49 49
44 4E 4F 4F 4F 4F 5C D7 5C 55 55 55 59 59 DE DF
41 41 41 41 5B 5D 5B 43 45 45 45 45 49 49 49 49
44 4E 4F 4F 4F 4F 5C F7 5C 55 55 55 59 59 DE FF
</map>
</collation>
Restart the server and use this statement to verify that the collation is present:
mysql> SHOW COLLATION WHERE Collation = 'latin1_test_ci';
+----------------+---------+------+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+----------------+---------+------+---------+----------+---------+
| latin1_test_ci | latin1 | 1024 | | | 1 |
+----------------+---------+------+---------+----------+---------+
File: manual.info.tmp, Node: adding-collation-unicode-uca, Prev: adding-collation-simple-8bit, Up: adding-collation
Menu:
collation-diagnostics:: Diagnostics During Index.xml Parsing
This section describes how to add a UCA collation for a Unicode character set by writing the '
MySQL Character Sets Available for User-Defined UCA Collations
Character Set Base Collation
'utf8' 'utf8_unicode_ci'
'ucs2' 'ucs2_unicode_ci'
'utf16' 'utf16_unicode_ci'
'utf32' 'utf32_unicode_ci'
The following sections show how to add a collation that is defined using LDML syntax, and provide a summary of LDML rules supported in MySQL.
File: manual.info.tmp, Node: ldml-collation-example, Next: ldml-rules, Prev: adding-collation-unicode-uca, Up: adding-collation-unicode-uca
10.14.4.1 Defining a UCA Collation Using LDML Syntax ....................................................
To add a UCA collation for a Unicode character set without recompiling MySQL, use the following procedure. If you are unfamiliar with the LDML rules used to describe the collation's sort characteristics, see *note ldml-rules::.
The example adds a collation named 'utf8_phone_ci' to the 'utf8' character set. The collation is designed for a scenario involving a Web application for which users post their names and phone numbers. Phone numbers can be given in very different formats:
+7-12345-67
+7-12-345-67
+7 12 345 67
+7 (12) 345 67
+71234567
The problem raised by dealing with these kinds of values is that the varying permissible formats make searching for a specific phone number very difficult. The solution is to define a new collation that reorders punctuation characters, making them ignorable.
To modify the 'Index.xml' configuration file. This file is located in the directory named by the 'character_sets_dir' system variable. You can check the variable value as follows, although the path name might be different on your system:
mysql> SHOW VARIABLES LIKE 'character_sets_dir';
+--------------------+-----------------------------------------+
| Variable_name | Value |
+--------------------+-----------------------------------------+
| character_sets_dir | /user/local/mysql/share/mysql/charsets/ |
+--------------------+-----------------------------------------+
Choose a name for the collation and list it in the 'Index.xml' file. In addition, you'll need to provide the collation ordering rules. Find the '
<charset name="utf8">
...
<collation name="utf8_phone_ci" id="1029">
<rules>
<reset>\u0000</reset>
<i>\u0020</i> <!-- space -->
<i>\u0028</i> <!-- left parenthesis -->
<i>\u0029</i> <!-- right parenthesis -->
<i>\u002B</i> <!-- plus -->
<i>\u002D</i> <!-- hyphen -->
</rules>
</collation>
...
</charset>
If you want a similar collation for other Unicode character sets, add other '
Restart the server and use this statement to verify that the collation is present:
mysql> SHOW COLLATION WHERE Collation = 'utf8_phone_ci';
+---------------+---------+------+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+---------------+---------+------+---------+----------+---------+
| utf8_phone_ci | utf8 | 1029 | | | 8 |
+---------------+---------+------+---------+----------+---------+
Now test the collation to make sure that it has the desired properties.
Create a table containing some sample phone numbers using the new collation:
mysql> CREATE TABLE phonebook (
name VARCHAR(64),
phone VARCHAR(64) CHARACTER SET utf8 COLLATE utf8_phone_ci
);
Query OK, 0 rows affected (0.09 sec)
mysql> INSERT INTO phonebook VALUES ('Svoj','+7 912 800 80 02');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO phonebook VALUES ('Hf','+7 (912) 800 80 04');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO phonebook VALUES ('Bar','+7-912-800-80-01');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO phonebook VALUES ('Ramil','(7912) 800 80 03');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO phonebook VALUES ('Sanja','+380 (912) 8008005');
Query OK, 1 row affected (0.00 sec)
Run some queries to see whether the ignored punctuation characters are in fact ignored for comparison and sorting:
mysql> SELECT * FROM phonebook ORDER BY phone;
+-------+--------------------+
| name | phone |
+-------+--------------------+
| Sanja | +380 (912) 8008005 |
| Bar | +7-912-800-80-01 |
| Svoj | +7 912 800 80 02 |
| Ramil | (7912) 800 80 03 |
| Hf | +7 (912) 800 80 04 |
+-------+--------------------+
5 rows in set (0.00 sec)
mysql> SELECT * FROM phonebook WHERE phone='+7(912)800-80-01';
+------+------------------+
| name | phone |
+------+------------------+
| Bar | +7-912-800-80-01 |
+------+------------------+
1 row in set (0.00 sec)
mysql> SELECT * FROM phonebook WHERE phone='79128008001';
+------+------------------+
| name | phone |
+------+------------------+
| Bar | +7-912-800-80-01 |
+------+------------------+
1 row in set (0.00 sec)
mysql> SELECT * FROM phonebook WHERE phone='7 9 1 2 8 0 0 8 0 0 1';
+------+------------------+
| name | phone |
+------+------------------+
| Bar | +7-912-800-80-01 |
+------+------------------+
1 row in set (0.00 sec)
File: manual.info.tmp, Node: ldml-rules, Next: collation-diagnostics, Prev: ldml-collation-example, Up: adding-collation-unicode-uca
10.14.4.2 LDML Syntax Supported in MySQL ........................................
This section describes the LDML syntax that MySQL recognizes. This is a subset of the syntax described in the LDML specification available at http://www.unicode.org/reports/tr35/, which should be consulted for further information. MySQL recognizes a large enough subset of the syntax that, in many cases, it is possible to download a collation definition from the Unicode Common Locale Data Repository and paste the relevant part (that is, the part between the '
The MySQL server generates diagnostics when it finds problems while parsing the 'Index.xml' file. See *note collation-diagnostics::.
Character Representation
Characters named in LDML rules can be written literally or in '' format, where NNNN is the hexadecimal Unicode code point value. For example, 'A' and 'a'' can be written literally or as '041' and '0E1'. Within hexadecimal values, the digits 'A' through 'F' are not case-sensitive; '0E1' and '0e1' are equivalent. For UCA 4.0.0 collations, hexadecimal notation can be used only for characters in the Basic Multilingual Plane, not for characters outside the BMP range of '0000' to 'FFFF'. For UCA 5.2.0 collations, hexadecimal notation can be used for any character.
The 'Index.xml' file itself should be written using UTF-8 encoding.
Syntax Rules
LDML has reset rules and shift rules to specify character ordering. Orderings are given as a set of rules that begin with a reset rule that establishes an anchor point, followed by shift rules that indicate how characters sort relative to the anchor point.
A '
<reset>A</reset>
<reset>\u0041</reset>
', '', and '
* Use primary differences to distinguish separate letters.
* Use secondary differences to distinguish accent variations.
* Use tertiary differences to distinguish lettercase variations.
Either of these rules specifies a primary shift rule for the ''G'' character:
<p>G</p>
<p>\u0047</p>
The '' shift rule indicates that one character sorts identically to another. The following rules cause ''b'' to sort the same as ''a'':
<reset>a</reset>
<i>b</i>
Abbreviated shift syntax specifies multiple shift rules using a single pair of tags. The following table shows the correspondence between abbreviated syntax rules and the equivalent nonabbreviated rules.
Abbreviated Shift Syntax
Abbreviated Syntax Nonabbreviated Syntax
'x
y
z
'
'xyz'
'
'
An expansion is a reset rule that establishes an anchor point for a multiple-character sequence. MySQL supports expansions 2 to 6 characters long. The following rules put ''z'' greater at the primary level than the sequence of three characters ''abc'':
<reset>abc</reset>
<p>z</p>
A contraction is a shift rule that sorts a multiple-character sequence. MySQL supports contractions 2 to 6 characters long. The following rules put the sequence of three characters ''xyz'' greater at the primary level than ''a'':
<reset>a</reset>
<p>xyz</p>
Long expansions and long contractions can be used together. These rules put the sequence of three characters ''xyz'' greater at the primary level than the sequence of three characters ''abc'':
<reset>abc</reset>
<p>xyz</p>
Normal expansion syntax uses '
<reset>c</reset>
<x><s>k</s><extend>h</extend></x>
This syntax permits long sequences. These rules sort the sequence ''ccs'' greater at the tertiary level than the sequence ''cscs'':
<reset>cs</reset>
<x><t>ccs</t><extend>cs</extend></x>
The LDML specification describes normal expansion syntax as 'tricky.' See that specification for details.
Previous context syntax uses '
<reset>a</reset>
<x><context>b</context><s>-</s></x>
Previous context syntax can include the '
<reset>a</reset>
<x><context>abc</context><p>def</p><extend>ghi</extend></x>
Reset rules permit a 'before' attribute. Normally, shift rules after a reset rule indicate characters that sort after the reset character. Shift rules after a reset rule that has the 'before' attribute indicate characters that sort before the reset character. The following rules put the character ''b'' immediately before ''a'' at the primary level:
<reset before="primary">a</reset>
<p>b</p>
Permissible 'before' attribute values specify the sort level by name or the equivalent numeric value:
<reset before="primary">
<reset before="1">
<reset before="secondary">
<reset before="2">
<reset before="tertiary">
<reset before="3">
A reset rule can name a logical reset position rather than a literal character:
<first_tertiary_ignorable/>
<last_tertiary_ignorable/>
<first_secondary_ignorable/>
<last_secondary_ignorable/>
<first_primary_ignorable/>
<last_primary_ignorable/>
<first_variable/>
<last_variable/>
<first_non_ignorable/>
<last_non_ignorable/>
<first_trailing/>
<last_trailing/>
These rules put ''z'' greater at the primary level than nonignorable characters that have a Default Unicode Collation Element Table (DUCET) entry and that are not CJK:
<reset><last_non_ignorable/></reset>
<p>z</p>
Logical positions have the code points shown in the following table.
Logical Reset Position Code Points
Logical Position Unicode 4.0.0 Code Unicode 5.2.0 Code Point Point
'
U+02D0 U+02D0
'
U+A48C U+1342E
'
U+0332 U+0332
'
U+20EA U+101FD
'
U+0000 U+0000
'
U+FE73 U+FE73
'
U+0000 U+0000
'
U+FE73 U+FE73
'
U+0000 U+0000
'
U+0000 U+0000
'
U+0009 U+0009
'
U+2183 U+1D371
The '
* 'simple': Calculate character weights as for reset rules that
do not have a 'before' attribute. This is the default if the
attribute is not given.
* 'expand': Use expansions for shifts after reset rules.
Suppose that ''0'' and ''1'' have weights of '0E29' and '0E2A' and we want to put all basic Latin letters between ''0'' and ''1'':
<reset>0</reset>
<pc>abcdefghijklmnopqrstuvwxyz</pc>
For simple shift mode, weights are calculated as follows:
'a' has weight 0E29+1
'b' has weight 0E29+2
'c' has weight 0E29+3
...
However, there are not enough vacant positions to put 26 characters between ''0'' and ''1''. The result is that digits and letters are intermixed.
To solve this, use 'shift-after-method="expand"'. Then weights are calculated like this:
'a' has weight [0E29][233D+1]
'b' has weight [0E29][233D+2]
'c' has weight [0E29][233D+3]
...
'233D' is the UCA 4.0.0 weight for character '0xA48C', which is the last nonignorable character (a sort of the greatest character in the collation, excluding CJK). UCA 5.2.0 is similar but uses '3ACA', for character '0x1342E'.
MySQL-Specific LDML Extensions
An extension to LDML rules permits the '
<collation id="NNN" name="utf8_XXX_ci" version="5.2.0">
...
</collation>
File: manual.info.tmp, Node: collation-diagnostics, Prev: ldml-rules, Up: adding-collation-unicode-uca
10.14.4.3 Diagnostics During Index.xml Parsing ..............................................
The MySQL server generates diagnostics when it finds problems while parsing the 'Index.xml' file:
Unknown tags are written to the error log. For example, the following message results if a collation definition contains a '
[Warning] Buffered warning: Unknown LDML tag:
'charsets/charset/collation/rules/aaa'
If collation initialization is not possible, the server reports an 'Unknown collation' error, and also generates warnings explaining the problems, such as in the previous example. In other cases, when a collation description is generally correct but contains some unknown tags, the collation is initialized and is available for use. The unknown parts are ignored, but a warning is generated in the error log.
Problems with collations generate warnings that clients can display with *note 'SHOW WARNINGS': show-warnings. Suppose that a reset rule contains an expansion longer than the maximum supported length of 6 characters:
<reset>abcdefghi</reset>
<i>x</i>
An attempt to use the collation produces warnings:
mysql> SELECT _utf8'test' COLLATE utf8_test_ci;
ERROR 1273 (HY000): Unknown collation: 'utf8_test_ci'
mysql> SHOW WARNINGS;
+---------+------+----------------------------------------+
| Level | Code | Message |
+---------+------+----------------------------------------+
| Error | 1273 | Unknown collation: 'utf8_test_ci' |
| Warning | 1273 | Expansion is too long at 'abcdefghi=x' |
+---------+------+----------------------------------------+
File: manual.info.tmp, Node: charset-configuration, Next: locale-support, Prev: adding-collation, Up: charset