Menu:
multibyte-characters:: Multi-Byte Character Support for Complex Character Sets
This section discusses the procedure for adding a character set to MySQL. The proper procedure depends on whether the character set is simple or complex:
If the character set does not need special string collating routines for sorting and does not need multibyte character support, it is simple.
If the character set needs either of those features, it is complex.
For example, 'greek' and 'swe7' are simple character sets, whereas 'big5' and 'czech' are complex character sets.
To use the following instructions, you must have a MySQL source distribution. In the instructions, MYSET represents the name of the character set that you want to add.
Add a '
<charset name="latin1">
<family>Western</family>
<description>cp1252 West European</description>
...
<collation name="latin1_swedish_ci" id="8" order="Finnish, Swedish">
<flag>primary</flag>
<flag>compiled</flag>
</collation>
<collation name="latin1_danish_ci" id="15" order="Danish"/>
...
<collation name="latin1_bin" id="47" order="Binary">
<flag>binary</flag>
<flag>compiled</flag>
</collation>
...
</charset>
The '
You must assign a unique ID number to each collation. The range of IDs from 1024 to 2047 is reserved for user-defined collations. To find the maximum of the currently used collation IDs, use this query:
SELECT MAX(ID) FROM INFORMATION_SCHEMA.COLLATIONS;
This step depends on whether you are adding a simple or complex character set. A simple character set requires only a configuration file, whereas a complex character set requires C source file that defines collation functions, multibyte functions, or both.
For a simple character set, create a configuration file, 'MYSET.xml', that describes the character set properties. Create this file in the 'sql/share/charsets' directory. You can use a copy of 'latin1.xml' as the basis for this file. The syntax for the file is very simple:
* Comments are written as ordinary XML comments ('<!-- TEXT
-->').
* Words within '<map>' array elements are separated by arbitrary
amounts of whitespace.
* Each word within '<map>' array elements must be a number in
hexadecimal format.
* The '<map>' array element for the '<ctype>' element has 257
words. The other '<map>' array elements after that have 256
words. See *note character-arrays::.
* For each collation listed in the '<charset>' element for the
character set in 'Index.xml', 'MYSET.xml' must contain a
'<collation>' element that defines the character ordering.
For a complex character set, create a C source file that describes the character set properties and defines the support routines necessary to properly perform operations on the character set:
* Create the file 'ctype-MYSET.c' in the 'strings' directory.
Look at one of the existing 'ctype-*.c' files (such as
'ctype-big5.c') to see what needs to be defined. The arrays
in your file must have names like 'ctype_MYSET',
'to_lower_MYSET', and so on. These correspond to the arrays
for a simple character set. See *note character-arrays::.
* For each '<collation>' element listed in the '<charset>'
element for the character set in 'Index.xml', the
'ctype-MYSET.c' file must provide an implementation of the
collation.
* If the character set requires string collating functions, see
*note string-collating::.
* If the character set requires multibyte character support, see
*note multibyte-characters::.
Modify the configuration information. Use the existing configuration information as a guide to adding information for MYSYS. The example here assumes that the character set has default and binary collations, but more lines are needed if MYSET has additional collations.
Edit 'mysys/charset-def.c', and 'register' the collations for the new character set.
Add these lines to the 'declaration' section:
#ifdef HAVE_CHARSET_MYSET
extern CHARSET_INFO my_charset_MYSET_general_ci;
extern CHARSET_INFO my_charset_MYSET_bin;
#endif
Add these lines to the 'registration' section:
#ifdef HAVE_CHARSET_MYSET
add_compiled_collation(&my_charset_MYSET_general_ci);
add_compiled_collation(&my_charset_MYSET_bin);
#endif
If the character set uses 'ctype-MYSET.c', edit 'strings/CMakeLists.txt' and add 'ctype-MYSET.c' to the definition of the 'STRINGS_SOURCES' variable.
Edit 'cmake/character_sets.cmake':
1. Add MYSET to the value of with 'CHARSETS_AVAILABLE' in
alphabetic order.
2. Add MYSET to the value of 'CHARSETS_COMPLEX' in
alphabetic order. This is needed even for simple
character sets, or 'CMake' does not recognize
'-DDEFAULT_CHARSET=MYSET'.
Reconfigure, recompile, and test.
File: manual.info.tmp, Node: character-arrays, Next: string-collating, Prev: adding-character-set, Up: adding-character-set
Each simple character set has a configuration file located in the 'sql/share/charsets' directory. For a character set named MYSYS, the file is named 'MYSET.xml'. It uses '
'
'
'
'
For a complex character set as implemented in a 'ctype-MYSET.c' file in the 'strings' directory, there are corresponding arrays: 'ctype_MYSET[]', 'to_lower_MYSET[]', and so forth. Not every complex character set has all of the arrays. See also the existing 'ctype-*.c' files for examples. See the 'CHARSET_INFO.txt' file in the 'strings' directory for additional information.
Most of the arrays are indexed by character value and have 256 elements. The '
'
#define _MY_U 01 /* Upper case */
#define _MY_L 02 /* Lower case */
#define _MY_NMR 04 /* Numeral (digit) */
#define _MY_SPC 010 /* Spacing character */
#define _MY_PNT 020 /* Punctuation */
#define _MY_CTR 040 /* Control character */
#define _MY_B 0100 /* Blank */
#define _MY_X 0200 /* heXadecimal digit */
The '
ctype['A'+1] = _MY_U | _MY_X = 01 | 0200 = 0201
The bitmask values in 'm_ctype.h' are octal values, but the elements of the '
The '
lower['A'] should contain 'a'
upper['a'] should contain 'A'
Each '
File: manual.info.tmp, Node: string-collating, Next: multibyte-characters, Prev: character-arrays, Up: adding-character-set
For a simple character set named MYSET, sorting rules are specified in the 'MYSET.xml' configuration file using '
The existing character sets provide the best documentation and examples to show how these functions are implemented. Look at the 'ctype-*.c' files in the 'strings' directory, such as the files for the 'big5', 'czech', 'gbk', 'sjis', and 'tis160' character sets. Take a look at the 'MY_COLLATION_HANDLER' structures to see how they are used. See also the 'CHARSET_INFO.txt' file in the 'strings' directory for additional information.
File: manual.info.tmp, Node: multibyte-characters, Prev: string-collating, Up: adding-character-set
If you want to add support for a new character set named MYSET that includes multibyte characters, you must use multibyte character functions in the 'ctype-MYSET.c' source file in the 'strings' directory.
The existing character sets provide the best documentation and examples to show how these functions are implemented. Look at the 'ctype-*.c' files in the 'strings' directory, such as the files for the 'euc_kr', 'gb2312', 'gbk', 'sjis', and 'ujis' character sets. Take a look at the 'MY_CHARSET_HANDLER' structures to see how they are used. See also the 'CHARSET_INFO.txt' file in the 'strings' directory for additional information.
File: manual.info.tmp, Node: adding-collation, Next: charset-configuration, Prev: adding-character-set, Up: charset