10.13 Adding a Character Set

This section discusses the procedure for adding a character set to MySQL. The proper procedure depends on whether the character set is simple or complex:

For example, 'greek' and 'swe7' are simple character sets, whereas 'big5' and 'czech' are complex character sets.

To use the following instructions, you must have a MySQL source distribution. In the instructions, MYSET represents the name of the character set that you want to add.

  1. Add a '' element for MYSET to the 'sql/share/charsets/Index.xml' file. Use the existing contents in the file as a guide to adding new contents. A partial listing for the 'latin1' '' element follows:

      <charset name="latin1">
        <family>Western</family>
        <description>cp1252 West European</description>
        ...
        <collation name="latin1_swedish_ci" id="8" order="Finnish, Swedish">
          <flag>primary</flag>
          <flag>compiled</flag>
        </collation>
        <collation name="latin1_danish_ci" id="15" order="Danish"/>
        ...
        <collation name="latin1_bin" id="47" order="Binary">
          <flag>binary</flag>
          <flag>compiled</flag>
        </collation>
        ...
      </charset>

    The '' element must list all the collations for the character set. These must include at least a binary collation and a default (primary) collation. The default collation is often named using a suffix of 'general_ci' (general, case-insensitive). It is possible for the binary collation to be the default collation, but usually they are different. The default collation should have a 'primary' flag. The binary collation should have a 'binary' flag.

    You must assign a unique ID number to each collation. The range of IDs from 1024 to 2047 is reserved for user-defined collations. To find the maximum of the currently used collation IDs, use this query:

      SELECT MAX(ID) FROM INFORMATION_SCHEMA.COLLATIONS;
  2. This step depends on whether you are adding a simple or complex character set. A simple character set requires only a configuration file, whereas a complex character set requires C source file that defines collation functions, multibyte functions, or both.

    For a simple character set, create a configuration file, 'MYSET.xml', that describes the character set properties. Create this file in the 'sql/share/charsets' directory. You can use a copy of 'latin1.xml' as the basis for this file. The syntax for the file is very simple:

    * Comments are written as ordinary XML comments ('<!-- TEXT
      -->').
    
    * Words within '<map>' array elements are separated by arbitrary
      amounts of whitespace.
    
    * Each word within '<map>' array elements must be a number in
      hexadecimal format.
    
    * The '<map>' array element for the '<ctype>' element has 257
      words.  The other '<map>' array elements after that have 256
      words.  See *note character-arrays::.
    
    * For each collation listed in the '<charset>' element for the
      character set in 'Index.xml', 'MYSET.xml' must contain a
      '<collation>' element that defines the character ordering.

    For a complex character set, create a C source file that describes the character set properties and defines the support routines necessary to properly perform operations on the character set:

    * Create the file 'ctype-MYSET.c' in the 'strings' directory.
      Look at one of the existing 'ctype-*.c' files (such as
      'ctype-big5.c') to see what needs to be defined.  The arrays
      in your file must have names like 'ctype_MYSET',
      'to_lower_MYSET', and so on.  These correspond to the arrays
      for a simple character set.  See *note character-arrays::.
    
    * For each '<collation>' element listed in the '<charset>'
      element for the character set in 'Index.xml', the
      'ctype-MYSET.c' file must provide an implementation of the
      collation.
    
    * If the character set requires string collating functions, see
      *note string-collating::.
    
    * If the character set requires multibyte character support, see
      *note multibyte-characters::.
  3. Modify the configuration information. Use the existing configuration information as a guide to adding information for MYSYS. The example here assumes that the character set has default and binary collations, but more lines are needed if MYSET has additional collations.

    1. Edit 'mysys/charset-def.c', and 'register' the collations for the new character set.

      Add these lines to the 'declaration' section:

         #ifdef HAVE_CHARSET_MYSET
         extern CHARSET_INFO my_charset_MYSET_general_ci;
         extern CHARSET_INFO my_charset_MYSET_bin;
         #endif

      Add these lines to the 'registration' section:

         #ifdef HAVE_CHARSET_MYSET
           add_compiled_collation(&my_charset_MYSET_general_ci);
           add_compiled_collation(&my_charset_MYSET_bin);
         #endif
    2. If the character set uses 'ctype-MYSET.c', edit 'strings/CMakeLists.txt' and add 'ctype-MYSET.c' to the definition of the 'STRINGS_SOURCES' variable.

    3. Edit 'cmake/character_sets.cmake':

      1. Add MYSET to the value of with 'CHARSETS_AVAILABLE' in
         alphabetic order.
      
      2. Add MYSET to the value of 'CHARSETS_COMPLEX' in
         alphabetic order.  This is needed even for simple
         character sets, or 'CMake' does not recognize
         '-DDEFAULT_CHARSET=MYSET'.
  4. Reconfigure, recompile, and test.

 File: manual.info.tmp, Node: character-arrays, Next: string-collating, Prev: adding-character-set, Up: adding-character-set

10.13.1 Character Definition Arrays

Each simple character set has a configuration file located in the 'sql/share/charsets' directory. For a character set named MYSYS, the file is named 'MYSET.xml'. It uses '' array elements to list character set properties. '' elements appear within these elements:

For a complex character set as implemented in a 'ctype-MYSET.c' file in the 'strings' directory, there are corresponding arrays: 'ctype_MYSET[]', 'to_lower_MYSET[]', and so forth. Not every complex character set has all of the arrays. See also the existing 'ctype-*.c' files for examples. See the 'CHARSET_INFO.txt' file in the 'strings' directory for additional information.

Most of the arrays are indexed by character value and have 256 elements. The '' array is indexed by character value + 1 and has 257 elements. This is a legacy convention for handling 'EOF'.

'' array elements are bit values. Each element describes the attributes of a single character in the character set. Each attribute is associated with a bitmask, as defined in 'include/m_ctype.h':

 #define _MY_U   01      /* Upper case */
 #define _MY_L   02      /* Lower case */
 #define _MY_NMR 04      /* Numeral (digit) */
 #define _MY_SPC 010     /* Spacing character */
 #define _MY_PNT 020     /* Punctuation */
 #define _MY_CTR 040     /* Control character */
 #define _MY_B   0100    /* Blank */
 #define _MY_X   0200    /* heXadecimal digit */

The '' value for a given character should be the union of the applicable bitmask values that describe the character. For example, ''A'' is an uppercase character ('_MY_U') as well as a hexadecimal digit ('_MY_X'), so its 'ctype' value should be defined like this:

 ctype['A'+1] = _MY_U | _MY_X = 01 | 0200 = 0201

The bitmask values in 'm_ctype.h' are octal values, but the elements of the '' array in 'MYSET.xml' should be written as hexadecimal values.

The '' and '' arrays hold the lowercase and uppercase characters corresponding to each member of the character set. For example:

 lower['A'] should contain 'a'
 upper['a'] should contain 'A'

Each '' array indicates how characters should be ordered for comparison and sorting purposes. MySQL sorts characters based on the values of this information. In some cases, this is the same as the '' array, which means that sorting is case-insensitive. For more complicated sorting rules (for complex character sets), see the discussion of string collating in *note string-collating::.

 File: manual.info.tmp, Node: string-collating, Next: multibyte-characters, Prev: character-arrays, Up: adding-character-set

10.13.2 String Collating Support for Complex Character Sets

For a simple character set named MYSET, sorting rules are specified in the 'MYSET.xml' configuration file using '' array elements within '' elements. If the sorting rules for your language are too complex to be handled with simple arrays, you must define string collating functions in the 'ctype-MYSET.c' source file in the 'strings' directory.

The existing character sets provide the best documentation and examples to show how these functions are implemented. Look at the 'ctype-*.c' files in the 'strings' directory, such as the files for the 'big5', 'czech', 'gbk', 'sjis', and 'tis160' character sets. Take a look at the 'MY_COLLATION_HANDLER' structures to see how they are used. See also the 'CHARSET_INFO.txt' file in the 'strings' directory for additional information.

 File: manual.info.tmp, Node: multibyte-characters, Prev: string-collating, Up: adding-character-set

10.13.3 Multi-Byte Character Support for Complex Character Sets

If you want to add support for a new character set named MYSET that includes multibyte characters, you must use multibyte character functions in the 'ctype-MYSET.c' source file in the 'strings' directory.

The existing character sets provide the best documentation and examples to show how these functions are implemented. Look at the 'ctype-*.c' files in the 'strings' directory, such as the files for the 'euc_kr', 'gb2312', 'gbk', 'sjis', and 'ujis' character sets. Take a look at the 'MY_CHARSET_HANDLER' structures to see how they are used. See also the 'CHARSET_INFO.txt' file in the 'strings' directory for additional information.

 File: manual.info.tmp, Node: adding-collation, Next: charset-configuration, Prev: adding-character-set, Up: charset