12.9 Full-Text Search Functions

'MATCH (COL1,COL2,...) AGAINST (EXPR [SEARCH_MODIFIER])'

 SEARCH_MODIFIER:
   {
        IN NATURAL LANGUAGE MODE
      | IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION
      | IN BOOLEAN MODE
      | WITH QUERY EXPANSION
   }

MySQL has support for full-text indexing and searching:

Full-text searching is performed using 'MATCH() AGAINST()' syntax. 'MATCH()' takes a comma-separated list that names the columns to be searched. 'AGAINST' takes a string to search for, and an optional modifier that indicates what type of search to perform. The search string must be a string value that is constant during query evaluation. This rules out, for example, a table column because that can differ for each row.

There are three types of full-text searches:

For information about 'FULLTEXT' query performance, see *note column-indexes::.

For more information about 'InnoDB' 'FULLTEXT' indexes, see *note innodb-fulltext-index::.

Constraints on full-text searching are listed in *note fulltext-restrictions::.

The note 'myisam_ftdump': myisam-ftdump. utility dumps the contents of a 'MyISAM' full-text index. This may be helpful for debugging full-text queries. See note myisam-ftdump::.

 File: manual.info.tmp, Node: fulltext-natural-language, Next: fulltext-boolean, Prev: fulltext-search, Up: fulltext-search

12.9.1 Natural Language Full-Text Searches

By default or with the 'IN NATURAL LANGUAGE MODE' modifier, the 'MATCH()' function performs a natural language search for a string against a text collection. A collection is a set of one or more columns included in a 'FULLTEXT' index. The search string is given as the argument to 'AGAINST()'. For each row in the table, 'MATCH()' returns a relevance value; that is, a similarity measure between the search string and the text in that row in the columns named in the 'MATCH()' list.

 mysql> CREATE TABLE articles (
     ->   id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
     ->   title VARCHAR(200),
     ->   body TEXT,
     ->   FULLTEXT (title,body)
     -> ) ENGINE=InnoDB;
 Query OK, 0 rows affected (0.08 sec)

 mysql> INSERT INTO articles (title,body) VALUES
     ->   ('MySQL Tutorial','DBMS stands for DataBase ...'),
     ->   ('How To Use MySQL Well','After you went through a ...'),
     ->   ('Optimizing MySQL','In this tutorial, we show ...'),
     ->   ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
     ->   ('MySQL vs. YourSQL','In the following database comparison ...'),
     ->   ('MySQL Security','When configured properly, MySQL ...');
 Query OK, 6 rows affected (0.01 sec)
 Records: 6  Duplicates: 0  Warnings: 0

 mysql> SELECT * FROM articles
     -> WHERE MATCH (title,body)
     -> AGAINST ('database' IN NATURAL LANGUAGE MODE);
 +----+-------------------+------------------------------------------+
 | id | title             | body                                     |
 +----+-------------------+------------------------------------------+
 |  1 | MySQL Tutorial    | DBMS stands for DataBase ...             |
 |  5 | MySQL vs. YourSQL | In the following database comparison ... |
 +----+-------------------+------------------------------------------+
 2 rows in set (0.00 sec)

By default, the search is performed in case-insensitive fashion. To perform a case-sensitive full-text search, use a binary collation for the indexed columns. For example, a column that uses the 'latin1' character set of can be assigned a collation of 'latin1_bin' to make it case-sensitive for full-text searches.

When 'MATCH()' is used in a 'WHERE' clause, as in the example shown earlier, the rows returned are automatically sorted with the highest relevance first as long as the following conditions are met:

Given the conditions just listed, it is usually less effort to specify using 'ORDER BY' an explicit sort order when one is necessary or desired.

Relevance values are nonnegative floating-point numbers. Zero relevance means no similarity. Relevance is computed based on the number of words in the row (document), the number of unique words in the row, the total number of words in the collection, and the number of rows that contain a particular word.

Note:

The term 'document' may be used interchangeably with the term 'row', and both terms refer to the indexed part of the row. The term 'collection' refers to the indexed columns and encompasses all rows.

To simply count matches, you could use a query like this:

 mysql> SELECT COUNT(*) FROM articles
     -> WHERE MATCH (title,body)
     -> AGAINST ('database' IN NATURAL LANGUAGE MODE);
 +----------+
 | COUNT(*) |
 +----------+
 |        2 |
 +----------+
 1 row in set (0.00 sec)

You might find it quicker to rewrite the query as follows:

 mysql> SELECT
     -> COUNT(IF(MATCH (title,body) AGAINST ('database' IN NATURAL LANGUAGE MODE), 1, NULL))
     -> AS count
     -> FROM articles;
 +-------+
 | count |
 +-------+
 |     2 |
 +-------+
 1 row in set (0.03 sec)

The first query does some extra work (sorting the results by relevance) but also can use an index lookup based on the 'WHERE' clause. The index lookup might make the first query faster if the search matches few rows. The second query performs a full table scan, which might be faster than the index lookup if the search term was present in most rows.

For natural-language full-text searches, the columns named in the 'MATCH()' function must be the same columns included in some 'FULLTEXT' index in your table. For the preceding query, the columns named in the 'MATCH()' function ('title' and 'body') are the same as those named in the definition of the 'article' table's 'FULLTEXT' index. To search the 'title' or 'body' separately, you would create separate 'FULLTEXT' indexes for each column.

You can also perform a boolean search or a search with query expansion. These search types are described in note fulltext-boolean::, and note fulltext-query-expansion::.

A full-text search that uses an index can name columns only from a single table in the 'MATCH()' clause because an index cannot span multiple tables. For 'MyISAM' tables, a boolean search can be done in the absence of an index (albeit more slowly), in which case it is possible to name columns from multiple tables.

The preceding example is a basic illustration that shows how to use the 'MATCH()' function where rows are returned in order of decreasing relevance. The next example shows how to retrieve the relevance values explicitly. Returned rows are not ordered because the *note 'SELECT': select. statement includes neither 'WHERE' nor 'ORDER BY' clauses:

 mysql> SELECT id, MATCH (title,body)
     -> AGAINST ('Tutorial' IN NATURAL LANGUAGE MODE) AS score
     -> FROM articles;
 +----+---------------------+
 | id | score               |
 +----+---------------------+
 |  1 | 0.22764469683170319 |
 |  2 |                   0 |
 |  3 | 0.22764469683170319 |
 |  4 |                   0 |
 |  5 |                   0 |
 |  6 |                   0 |
 +----+---------------------+
 6 rows in set (0.00 sec)

The following example is more complex. The query returns the relevance values and it also sorts the rows in order of decreasing relevance. To achieve this result, specify 'MATCH()' twice: once in the *note 'SELECT': select. list and once in the 'WHERE' clause. This causes no additional overhead, because the MySQL optimizer notices that the two 'MATCH()' calls are identical and invokes the full-text search code only once.

 mysql> SELECT id, body, MATCH (title,body)
     ->   AGAINST ('Security implications of running MySQL as root'
     ->   IN NATURAL LANGUAGE MODE) AS score
     -> FROM articles
     ->   WHERE MATCH (title,body)
     ->   AGAINST('Security implications of running MySQL as root'
     ->   IN NATURAL LANGUAGE MODE);
 +----+-------------------------------------+-----------------+
 | id | body                                | score           |
 +----+-------------------------------------+-----------------+
 |  4 | 1. Never run mysqld as root. 2. ... | 1.5219271183014 |
 |  6 | When configured properly, MySQL ... | 1.3114095926285 |
 +----+-------------------------------------+-----------------+
 2 rows in set (0.00 sec)

A phrase that is enclosed within double quote ('"') characters matches only rows that contain the phrase literally, as it was typed. The full-text engine splits the phrase into words and performs a search in the 'FULLTEXT' index for the words. Nonword characters need not be matched exactly: Phrase searching requires only that matches contain exactly the same words as the phrase and in the same order. For example, '"test phrase"' matches '"test, phrase"'. If the phrase contains no words that are in the index, the result is empty. For example, if all words are either stopwords or shorter than the minimum length of indexed words, the result is empty.

The MySQL 'FULLTEXT' implementation regards any sequence of true word characters (letters, digits, and underscores) as a word. That sequence may also contain apostrophes ('''), but not more than one in a row. This means that 'aaa'bbb' is regarded as one word, but 'aaa''bbb' is regarded as two words. Apostrophes at the beginning or the end of a word are stripped by the 'FULLTEXT' parser; ''aaa'bbb'' would be parsed as 'aaa'bbb'.

The built-in 'FULLTEXT' parser determines where words start and end by looking for certain delimiter characters; for example, ' ' (space), ',' (comma), and '.' (period). If words are not separated by delimiters (as in, for example, Chinese), the built-in 'FULLTEXT' parser cannot determine where a word begins or ends. To be able to add words or other indexed terms in such languages to a 'FULLTEXT' index that uses the built-in 'FULLTEXT' parser, you must preprocess them so that they are separated by some arbitrary delimiter. Alternatively, you can create 'FULLTEXT' indexes using the ngram parser plugin (for Chinese, Japanese, or Korean) or the MeCab parser plugin (for Japanese).

It is possible to write a plugin that replaces the built-in full-text parser. For details, see The MySQL Plugin API (https://dev.mysql.com/doc/extending-mysql/5.7/en/plugin-api.html). For example parser plugin source code, see the 'plugin/fulltext' directory of a MySQL source distribution.

Some words are ignored in full-text searches:

See note fulltext-stopwords:: to view default stopword lists and how to change them. The default minimum word length can be changed as described in note fulltext-fine-tuning::.

Every correct word in the collection and in the query is weighted according to its significance in the collection or query. Thus, a word that is present in many documents has a lower weight, because it has lower semantic value in this particular collection. Conversely, if the word is rare, it receives a higher weight. The weights of the words are combined to compute the relevance of the row. This technique works best with large collections.

MyISAM Limitation:

For very small tables, word distribution does not adequately reflect their semantic value, and this model may sometimes produce bizarre results for search indexes on 'MyISAM' tables. For example, although the word 'MySQL' is present in every row of the 'articles' table shown earlier, a search for the word in a 'MyISAM' search index produces no results:

 mysql> SELECT * FROM articles
     -> WHERE MATCH (title,body)
     -> AGAINST ('MySQL' IN NATURAL LANGUAGE MODE);
 Empty set (0.00 sec)

The search result is empty because the word 'MySQL' is present in at least 50% of the rows, and so is effectively treated as a stopword. This filtering technique is more suitable for large data sets, where you might not want the result set to return every second row from a 1GB table, than for small data sets where it might cause poor results for popular terms.

The 50% threshold can surprise you when you first try full-text searching to see how it works, and makes 'InnoDB' tables more suited to experimentation with full-text searches. If you create a 'MyISAM' table and insert only one or two rows of text into it, every word in the text occurs in at least 50% of the rows. As a result, no search returns any results until the table contains more rows. Users who need to bypass the 50% limitation can build search indexes on 'InnoDB' tables, or use the boolean search mode explained in *note fulltext-boolean::.

 File: manual.info.tmp, Node: fulltext-boolean, Next: fulltext-query-expansion, Prev: fulltext-natural-language, Up: fulltext-search

12.9.2 Boolean Full-Text Searches

MySQL can perform boolean full-text searches using the 'IN BOOLEAN MODE' modifier. With this modifier, certain characters have special meaning at the beginning or end of words in the search string. In the following query, the '+' and '-' operators indicate that a word must be present or absent, respectively, for a match to occur. Thus, the query retrieves all the rows that contain the word 'MySQL' but that do not contain the word 'YourSQL':

 mysql> SELECT * FROM articles WHERE MATCH (title,body)
     -> AGAINST ('+MySQL -YourSQL' IN BOOLEAN MODE);
 +----+-----------------------+-------------------------------------+
 | id | title                 | body                                |
 +----+-----------------------+-------------------------------------+
 |  1 | MySQL Tutorial        | DBMS stands for DataBase ...        |
 |  2 | How To Use MySQL Well | After you went through a ...        |
 |  3 | Optimizing MySQL      | In this tutorial, we show ...       |
 |  4 | 1001 MySQL Tricks     | 1. Never run mysqld as root. 2. ... |
 |  6 | MySQL Security        | When configured properly, MySQL ... |
 +----+-----------------------+-------------------------------------+

Note:

In implementing this feature, MySQL uses what is sometimes referred to as implied Boolean logic, in which

Boolean full-text searches have these characteristics:

The boolean full-text search capability supports the following operators:

The following examples demonstrate some search strings that use boolean full-text operators:

Relevancy Rankings for InnoDB Boolean Mode Search

note 'InnoDB': innodb-storage-engine. full-text search is modeled on the Sphinx (http://sphinxsearch.com/) full-text search engine, and the algorithms used are based on BM25 (http://en.wikipedia.org/wiki/Okapi_BM25) and TF-IDF (http://en.wikipedia.org/wiki/TF-IDF) ranking algorithms. For these reasons, relevancy rankings for 'InnoDB' boolean full-text search may differ from note 'MyISAM': myisam-storage-engine. relevancy rankings.

'InnoDB' uses a variation of the 'term frequency-inverse document frequency' ('TF-IDF') weighting system to rank a document's relevance for a given full-text search query. The 'TF-IDF' weighting is based on how frequently a word appears in a document, offset by how frequently the word appears in all documents in the collection. In other words, the more frequently a word appears in a document, and the less frequently the word appears in the document collection, the higher the document is ranked.

How Relevancy Ranking is Calculated

The term frequency ('TF') value is the number of times that a word appears in a document. The inverse document frequency ('IDF') value of a word is calculated using the following formula, where 'total_records' is the number of records in the collection, and 'matching_records' is the number of records that the search term appears in.

 ${IDF} = log10( ${total_records} / ${matching_records} )

When a document contains a word multiple times, the IDF value is multiplied by the TF value:

 ${TF} * ${IDF}

Using the 'TF' and 'IDF' values, the relevancy ranking for a document is calculated using this formula:

 ${rank} = ${TF} * ${IDF} * ${IDF}

The formula is demonstrated in the following examples.

Relevancy Ranking for a Single Word Search

This example demonstrates the relevancy ranking calculation for a single-word search.

 mysql> CREATE TABLE articles (
     ->   id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
     ->   title VARCHAR(200),
     ->   body TEXT,
     ->   FULLTEXT (title,body)
     ->)  ENGINE=InnoDB;
 Query OK, 0 rows affected (1.04 sec)

 mysql> INSERT INTO articles (title,body) VALUES
     ->   ('MySQL Tutorial','This database tutorial ...'),
     ->   ("How To Use MySQL",'After you went through a ...'),
     ->   ('Optimizing Your Database','In this database tutorial ...'),
     ->   ('MySQL vs. YourSQL','When comparing databases ...'),
     ->   ('MySQL Security','When configured properly, MySQL ...'),
     ->   ('Database, Database, Database','database database database'),
     ->   ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
     ->   ('MySQL Full-Text Indexes', 'MySQL fulltext indexes use a ..');
 Query OK, 8 rows affected (0.06 sec)
 Records: 8  Duplicates: 0  Warnings: 0

 mysql> SELECT id, title, body,
     ->   MATCH (title,body) AGAINST ('database' IN BOOLEAN MODE) AS score
     ->   FROM articles ORDER BY score DESC;
 +----+------------------------------+-------------------------------------+---------------------+
 | id | title                        | body                                | score               |
 +----+------------------------------+-------------------------------------+---------------------+
 |  6 | Database, Database, Database | database database database          |  1.0886961221694946 |
 |  3 | Optimizing Your Database     | In this database tutorial ...       | 0.36289870738983154 |
 |  1 | MySQL Tutorial               | This database tutorial ...          | 0.18144935369491577 |
 |  2 | How To Use MySQL             | After you went through a ...        |                   0 |
 |  4 | MySQL vs. YourSQL            | When comparing databases ...        |                   0 |
 |  5 | MySQL Security               | When configured properly, MySQL ... |                   0 |
 |  7 | 1001 MySQL Tricks            | 1. Never run mysqld as root. 2. ... |                   0 |
 |  8 | MySQL Full-Text Indexes      | MySQL fulltext indexes use a ..     |                   0 |
 +----+------------------------------+-------------------------------------+---------------------+
 8 rows in set (0.00 sec)

There are 8 records in total, with 3 that match the 'database' search term. The first record ('id 6') contains the search term 6 times and has a relevancy ranking of '1.0886961221694946'. This ranking value is calculated using a 'TF' value of 6 (the 'database' search term appears 6 times in record 'id 6') and an 'IDF' value of 0.42596873216370745, which is calculated as follows (where 8 is the total number of records and 3 is the number of records that the search term appears in):

 ${IDF} = LOG10( 8 / 3 ) = 0.42596873216370745

The 'TF' and 'IDF' values are then entered into the ranking formula:

 ${rank} = ${TF} * ${IDF} * ${IDF}

Performing the calculation in the MySQL command-line client returns a ranking value of 1.088696164686938.

 mysql> SELECT 6*LOG10(8/3)*LOG10(8/3);
 +-------------------------+
 | 6*LOG10(8/3)*LOG10(8/3) |
 +-------------------------+
 |       1.088696164686938 |
 +-------------------------+
 1 row in set (0.00 sec)

Note:

You may notice a slight difference in the ranking values returned by the 'SELECT ... MATCH ... AGAINST' statement and the MySQL command-line client ('1.0886961221694946' versus '1.088696164686938'). The difference is due to how the casts between integers and floats/doubles are performed internally by 'InnoDB' (along with related precision and rounding decisions), and how they are performed elsewhere, such as in the MySQL command-line client or other types of calculators.

Relevancy Ranking for a Multiple Word Search

This example demonstrates the relevancy ranking calculation for a multiple-word full-text search based on the 'articles' table and data used in the previous example.

If you search on more than one word, the relevancy ranking value is a sum of the relevancy ranking value for each word, as shown in this formula:

 ${rank} = ${TF} * ${IDF} * ${IDF} + ${TF} * ${IDF} * ${IDF}

Performing a search on two terms ('mysql tutorial') returns the following results:

 mysql> SELECT id, title, body, MATCH (title,body)
     ->   AGAINST ('mysql tutorial' IN BOOLEAN MODE) AS score
     ->   FROM articles ORDER BY score DESC;
 +----+------------------------------+-------------------------------------+----------------------+
 | id | title                        | body                                | score                |
 +----+------------------------------+-------------------------------------+----------------------+
 |  1 | MySQL Tutorial               | This database tutorial ...          |   0.7405621409416199 |
 |  3 | Optimizing Your Database     | In this database tutorial ...       |   0.3624762296676636 |
 |  5 | MySQL Security               | When configured properly, MySQL ... | 0.031219376251101494 |
 |  8 | MySQL Full-Text Indexes      | MySQL fulltext indexes use a ..     | 0.031219376251101494 |
 |  2 | How To Use MySQL             | After you went through a ...        | 0.015609688125550747 |
 |  4 | MySQL vs. YourSQL            | When comparing databases ...        | 0.015609688125550747 |
 |  7 | 1001 MySQL Tricks            | 1. Never run mysqld as root. 2. ... | 0.015609688125550747 |
 |  6 | Database, Database, Database | database database database          |                    0 |
 +----+------------------------------+-------------------------------------+----------------------+
 8 rows in set (0.00 sec)

In the first record ('id 8'), 'mysql' appears once and 'tutorial' appears twice. There are six matching records for 'mysql' and two matching records for 'tutorial'. The MySQL command-line client returns the expected ranking value when inserting these values into the ranking formula for a multiple word search:

 mysql> SELECT (1*log10(8/6)*log10(8/6)) + (2*log10(8/2)*log10(8/2));
 +-------------------------------------------------------+
 | (1*log10(8/6)*log10(8/6)) + (2*log10(8/2)*log10(8/2)) |
 +-------------------------------------------------------+
 |                                    0.7405621541938003 |
 +-------------------------------------------------------+
 1 row in set (0.00 sec)

Note:

The slight difference in the ranking values returned by the 'SELECT ... MATCH ... AGAINST' statement and the MySQL command-line client is explained in the preceding example.

 File: manual.info.tmp, Node: fulltext-query-expansion, Next: fulltext-stopwords, Prev: fulltext-boolean, Up: fulltext-search

12.9.3 Full-Text Searches with Query Expansion

Full-text search supports query expansion (and in particular, its variant 'blind query expansion'). This is generally useful when a search phrase is too short, which often means that the user is relying on implied knowledge that the full-text search engine lacks. For example, a user searching for 'database' may really mean that 'MySQL', 'Oracle', 'DB2', and 'RDBMS' all are phrases that should match 'databases' and should be returned, too. This is implied knowledge.

Blind query expansion (also known as automatic relevance feedback) is enabled by adding 'WITH QUERY EXPANSION' or 'IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION' following the search phrase. It works by performing the search twice, where the search phrase for the second search is the original search phrase concatenated with the few most highly relevant documents from the first search. Thus, if one of these documents contains the word 'databases' and the word 'MySQL', the second search finds the documents that contain the word 'MySQL' even if they do not contain the word 'database'. The following example shows this difference:

 mysql> SELECT * FROM articles
     WHERE MATCH (title,body)
     AGAINST ('database' IN NATURAL LANGUAGE MODE);
 +----+-------------------+------------------------------------------+
 | id | title             | body                                     |
 +----+-------------------+------------------------------------------+
 |  1 | MySQL Tutorial    | DBMS stands for DataBase ...             |
 |  5 | MySQL vs. YourSQL | In the following database comparison ... |
 +----+-------------------+------------------------------------------+
 2 rows in set (0.00 sec)

 mysql> SELECT * FROM articles
     WHERE MATCH (title,body)
     AGAINST ('database' WITH QUERY EXPANSION);
 +----+-----------------------+------------------------------------------+
 | id | title                 | body                                     |
 +----+-----------------------+------------------------------------------+
 |  5 | MySQL vs. YourSQL     | In the following database comparison ... |
 |  1 | MySQL Tutorial        | DBMS stands for DataBase ...             |
 |  3 | Optimizing MySQL      | In this tutorial we show ...             |
 |  6 | MySQL Security        | When configured properly, MySQL ...      |
 |  2 | How To Use MySQL Well | After you went through a ...             |
 |  4 | 1001 MySQL Tricks     | 1. Never run mysqld as root. 2. ...      |
 +----+-----------------------+------------------------------------------+
 6 rows in set (0.00 sec)

Another example could be searching for books by Georges Simenon about Maigret, when a user is not sure how to spell 'Maigret'. A search for 'Megre and the reluctant witnesses' finds only 'Maigret and the Reluctant Witnesses' without query expansion. A search with query expansion finds all books with the word 'Maigret' on the second pass.

Note:

Because blind query expansion tends to increase noise significantly by returning nonrelevant documents, use it only when a search phrase is short.

 File: manual.info.tmp, Node: fulltext-stopwords, Next: fulltext-restrictions, Prev: fulltext-query-expansion, Up: fulltext-search

12.9.4 Full-Text Stopwords

The stopword list is loaded and searched for full-text queries using the server character set and collation (the values of the 'character_set_server' and 'collation_server' system variables). False hits or misses might occur for stopword lookups if the stopword file or columns used for full-text indexing or searches have a character set or collation different from 'character_set_server' or 'collation_server'.

Case sensitivity of stopword lookups depends on the server collation. For example, lookups are case-insensitive if the collation is 'latin1_swedish_ci', whereas lookups are case-sensitive if the collation is 'latin1_general_cs' or 'latin1_bin'.

Stopwords for InnoDB Search Indexes

'InnoDB' has a relatively short list of default stopwords, because documents from technical, literary, and other sources often use short words as keywords or in significant phrases. For example, you might search for 'to be or not to be' and expect to get a sensible result, rather than having all those words ignored.

To see the default 'InnoDB' stopword list, query the Information Schema *note 'INNODB_FT_DEFAULT_STOPWORD': information-schema-innodb-ft-default-stopword-table. table.

 mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD;
 +-------+
 | value |
 +-------+
 | a     |
 | about |
 | an    |
 | are   |
 | as    |
 | at    |
 | be    |
 | by    |
 | com   |
 | de    |
 | en    |
 | for   |
 | from  |
 | how   |
 | i     |
 | in    |
 | is    |
 | it    |
 | la    |
 | of    |
 | on    |
 | or    |
 | that  |
 | the   |
 | this  |
 | to    |
 | was   |
 | what  |
 | when  |
 | where |
 | who   |
 | will  |
 | with  |
 | und   |
 | the   |
 | www   |
 +-------+
 36 rows in set (0.00 sec)

To define your own stopword list for all 'InnoDB' tables, define a table with the same structure as the note 'INNODB_FT_DEFAULT_STOPWORD': information-schema-innodb-ft-default-stopword-table. table, populate it with stopwords, and set the value of the 'innodb_ft_server_stopword_table' option to a value in the form 'DB_NAME/TABLE_NAME' before creating the full-text index. The stopword table must have a single note 'VARCHAR': char. column named 'value'. The following example demonstrates creating and configuring a new global stopword table for 'InnoDB'.

 -- Create a new stopword table

 mysql> CREATE TABLE my_stopwords(value VARCHAR(30)) ENGINE = INNODB;
 Query OK, 0 rows affected (0.01 sec)

 -- Insert stopwords (for simplicity, a single stopword is used in this example)

 mysql> INSERT INTO my_stopwords(value) VALUES ('Ishmael');
 Query OK, 1 row affected (0.00 sec)

 -- Create the table

 mysql> CREATE TABLE opening_lines (
 id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
 opening_line TEXT(500),
 author VARCHAR(200),
 title VARCHAR(200)
 ) ENGINE=InnoDB;
 Query OK, 0 rows affected (0.01 sec)

 -- Insert data into the table

 mysql> INSERT INTO opening_lines(opening_line,author,title) VALUES
 ('Call me Ishmael.','Herman Melville','Moby-Dick'),
 ('A screaming comes across the sky.','Thomas Pynchon','Gravity\'s Rainbow'),
 ('I am an invisible man.','Ralph Ellison','Invisible Man'),
 ('Where now? Who now? When now?','Samuel Beckett','The Unnamable'),
 ('It was love at first sight.','Joseph Heller','Catch-22'),
 ('All this happened, more or less.','Kurt Vonnegut','Slaughterhouse-Five'),
 ('Mrs. Dalloway said she would buy the flowers herself.','Virginia Woolf','Mrs. Dalloway'),
 ('It was a pleasure to burn.','Ray Bradbury','Fahrenheit 451');
 Query OK, 8 rows affected (0.00 sec)
 Records: 8  Duplicates: 0  Warnings: 0

 -- Set the innodb_ft_server_stopword_table option to the new stopword table

 mysql> SET GLOBAL innodb_ft_server_stopword_table = 'test/my_stopwords';
 Query OK, 0 rows affected (0.00 sec)

 -- Create the full-text index (which rebuilds the table if no FTS_DOC_ID column is defined)

 mysql> CREATE FULLTEXT INDEX idx ON opening_lines(opening_line);
 Query OK, 0 rows affected, 1 warning (1.17 sec)
 Records: 0  Duplicates: 0  Warnings: 1

Verify that the specified stopword ('Ishmael') does not appear by querying the Information Schema *note 'INNODB_FT_INDEX_TABLE': information-schema-innodb-ft-index-table-table. table.

Note:

By default, words less than 3 characters in length or greater than 84 characters in length do not appear in an 'InnoDB' full-text search index. Maximum and minimum word length values are configurable using the 'innodb_ft_max_token_size' and 'innodb_ft_min_token_size' variables. This default behavior does not apply to the ngram parser plugin. ngram token size is defined by the 'ngram_token_size' option.

 mysql> SET GLOBAL innodb_ft_aux_table='test/opening_lines';
 Query OK, 0 rows affected (0.00 sec)

 mysql> SELECT word FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_TABLE LIMIT 15;
 +-----------+
 | word      |
 +-----------+
 | across    |
 | all       |
 | burn      |
 | buy       |
 | call      |
 | comes     |
 | dalloway  |
 | first     |
 | flowers   |
 | happened  |
 | herself   |
 | invisible |
 | less      |
 | love      |
 | man       |
 +-----------+
 15 rows in set (0.00 sec)

To create stopword lists on a table-by-table basis, create other stopword tables and use the 'innodb_ft_user_stopword_table' option to specify the stopword table that you want to use before you create the full-text index.

Stopwords for MyISAM Search Indexes

The stopword file is loaded and searched using 'latin1' if 'character_set_server' is 'ucs2', 'utf16', 'utf16le', or 'utf32'.

To override the default stopword list for MyISAM tables, set the 'ft_stopword_file' system variable. (See *note server-system-variables::.) The variable value should be the path name of the file containing the stopword list, or the empty string to disable stopword filtering. The server looks for the file in the data directory unless an absolute path name is given to specify a different directory. After changing the value of this variable or the contents of the stopword file, restart the server and rebuild your 'FULLTEXT' indexes.

The stopword list is free-form, separating stopwords with any nonalphanumeric character such as newline, space, or comma. Exceptions are the underscore character ('_') and a single apostrophe (''') which are treated as part of a word. The character set of the stopword list is the server's default character set; see *note charset-server::.

The following list shows the default stopwords for 'MyISAM' search indexes. In a MySQL source distribution, you can find this list in the 'storage/myisam/ft_static.c' file.

 a's           able          about         above         according
 accordingly   across        actually      after         afterwards
 again         against       ain't         all           allow
 allows        almost        alone         along         already
 also          although      always        am            among
 amongst       an            and           another       any
 anybody       anyhow        anyone        anything      anyway
 anyways       anywhere      apart         appear        appreciate
 appropriate   are           aren't        around        as
 aside         ask           asking        associated    at
 available     away          awfully       be            became
 because       become        becomes       becoming      been
 before        beforehand    behind        being         believe
 below         beside        besides       best          better
 between       beyond        both          brief         but
 by            c'mon         c's           came          can
 can't         cannot        cant          cause         causes
 certain       certainly     changes       clearly       co
 com           come          comes         concerning    consequently
 consider      considering   contain       containing    contains
 corresponding could         couldn't      course        currently
 definitely    described     despite       did           didn't
 different     do            does          doesn't       doing
 don't         done          down          downwards     during
 each          edu           eg            eight         either
 else          elsewhere     enough        entirely      especially
 et            etc           even          ever          every
 everybody     everyone      everything    everywhere    ex
 exactly       example       except        far           few
 fifth         first         five          followed      following
 follows       for           former        formerly      forth
 four          from          further       furthermore   get
 gets          getting       given         gives         go
 goes          going         gone          got           gotten
 greetings     had           hadn't        happens       hardly
 has           hasn't        have          haven't       having
 he            he's          hello         help          hence
 her           here          here's        hereafter     hereby
 herein        hereupon      hers          herself       hi
 him           himself       his           hither        hopefully
 how           howbeit       however       i'd           i'll
 i'm           i've          ie            if            ignored
 immediate     in            inasmuch      inc           indeed
 indicate      indicated     indicates     inner         insofar
 instead       into          inward        is            isn't
 it            it'd          it'll         it's          its
 itself        just          keep          keeps         kept
 know          known         knows         last          lately
 later         latter        latterly      least         less
 lest          let           let's         like          liked
 likely        little        look          looking       looks
 ltd           mainly        many          may           maybe
 me            mean          meanwhile     merely        might
 more          moreover      most          mostly        much
 must          my            myself        name          namely
 nd            near          nearly        necessary     need
 needs         neither       never         nevertheless  new
 next          nine          no            nobody        non
 none          noone         nor           normally      not
 nothing       novel         now           nowhere       obviously
 of            off           often         oh            ok
 okay          old           on            once          one
 ones          only          onto          or            other
 others        otherwise     ought         our           ours
 ourselves     out           outside       over          overall
 own           particular    particularly  per           perhaps
 placed        please        plus          possible      presumably
 probably      provides      que           quite         qv
 rather        rd            re            really        reasonably
 regarding     regardless    regards       relatively    respectively
 right         said          same          saw           say
 saying        says          second        secondly      see
 seeing        seem          seemed        seeming       seems
 seen          self          selves        sensible      sent
 serious       seriously     seven         several       shall
 she           should        shouldn't     since         six
 so            some          somebody      somehow       someone
 something     sometime      sometimes     somewhat      somewhere
 soon          sorry         specified     specify       specifying
 still         sub           such          sup           sure
 t's           take          taken         tell          tends
 th            than          thank         thanks        thanx
 that          that's        thats         the           their
 theirs        them          themselves    then          thence
 there         there's       thereafter    thereby       therefore
 therein       theres        thereupon     these         they
 they'd        they'll       they're       they've       think
 third         this          thorough      thoroughly    those
 though        three         through       throughout    thru
 thus          to            together      too           took
 toward        towards       tried         tries         truly
 try           trying        twice         two           un
 under         unfortunately unless        unlikely      until
 unto          up            upon          us            use
 used          useful        uses          using         usually
 value         various       very          via           viz
 vs            want          wants         was           wasn't
 way           we            we'd          we'll         we're
 we've         welcome       well          went          were
 weren't       what          what's        whatever      when
 whence        whenever      where         where's       whereafter
 whereas       whereby       wherein       whereupon     wherever
 whether       which         while         whither       who
 who's         whoever       whole         whom          whose
 why           will          willing       wish          with
 within        without       won't         wonder        would
 wouldn't      yes           yet           you           you'd
 you'll        you're        you've        your          yours
 yourself      yourselves    zero

 File: manual.info.tmp, Node: fulltext-restrictions, Next: fulltext-fine-tuning, Prev: fulltext-stopwords, Up: fulltext-search

12.9.5 Full-Text Restrictions

 File: manual.info.tmp, Node: fulltext-fine-tuning, Next: full-text-adding-collation, Prev: fulltext-restrictions, Up: fulltext-search

12.9.7 Adding a User-Defined Collation for Full-Text Indexing

This section describes how to add a user-defined collation for full-text searches using the built-in full-text parser. The sample collation is like 'latin1_swedish_ci' but treats the ''-'' character as a letter rather than as a punctuation character so that it can be indexed as a word character. General information about adding collations is given in *note adding-collation::; it is assumed that you have read it and are familiar with the files involved.

To add a collation for full-text indexing, use the following procedure. The instructions here add a collation for a simple character set, which as discussed in *note adding-collation::, can be created using a configuration file that describes the character set properties. For a complex character set such as Unicode, create collations using C source files that describe the character set properties.

  1. Add a collation to the 'Index.xml' file. The permitted range of IDs for user-defined collations is given in *note adding-collation-choosing-id::. The ID must be unused, so choose a value different from 1025 if that ID is already taken on your system.

      <charset name="latin1">
      ...
      <collation name="latin1_fulltext_ci" id="1025"/>
      </charset>
  2. Declare the sort order for the collation in the 'latin1.xml' file. In this case, the order can be copied from 'latin1_swedish_ci':

      <collation name="latin1_fulltext_ci">
      <map>
      00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
      10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
      20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
      30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
      40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
      50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F
      60 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
      50 51 52 53 54 55 56 57 58 59 5A 7B 7C 7D 7E 7F
      80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F
      90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F
      A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF
      B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF
      41 41 41 41 5C 5B 5C 43 45 45 45 45 49 49 49 49
      44 4E 4F 4F 4F 4F 5D D7 D8 55 55 55 59 59 DE DF
      41 41 41 41 5C 5B 5C 43 45 45 45 45 49 49 49 49
      44 4E 4F 4F 4F 4F 5D F7 D8 55 55 55 59 59 DE FF
      </map>
      </collation>
  3. Modify the 'ctype' array in 'latin1.xml'. Change the value corresponding to 0x2D (which is the code for the ''-'' character) from 10 (punctuation) to 01 (uppercase letter). In the following array, this is the element in the fourth row down, third value from the end.

      <ctype>
      <map>
      00
      20 20 20 20 20 20 20 20 20 28 28 28 28 28 20 20
      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
      48 10 10 10 10 10 10 10 10 10 10 10 10 01 10 10
      84 84 84 84 84 84 84 84 84 84 10 10 10 10 10 10
      10 81 81 81 81 81 81 01 01 01 01 01 01 01 01 01
      01 01 01 01 01 01 01 01 01 01 01 10 10 10 10 10
      10 82 82 82 82 82 82 02 02 02 02 02 02 02 02 02
      02 02 02 02 02 02 02 02 02 02 02 10 10 10 10 20
      10 00 10 02 10 10 10 10 10 10 01 10 01 00 01 00
      00 10 10 10 10 10 10 10 10 10 02 10 02 00 02 01
      48 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
      10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
      01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
      01 01 01 01 01 01 01 10 01 01 01 01 01 01 01 02
      02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
      02 02 02 02 02 02 02 10 02 02 02 02 02 02 02 02
      </map>
      </ctype>
  4. Restart the server.

  5. To employ the new collation, include it in the definition of columns that are to use it:

      mysql> DROP TABLE IF EXISTS t1;
      Query OK, 0 rows affected (0.13 sec)
    
      mysql> CREATE TABLE t1 (
          a TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci,
          FULLTEXT INDEX(a)
          ) ENGINE=InnoDB;
      Query OK, 0 rows affected (0.47 sec)
  6. Test the collation to verify that hyphen is considered as a word character:

      mysql> INSERT INTO t1 VALUEs ('----'),('....'),('abcd');
      Query OK, 3 rows affected (0.22 sec)
      Records: 3  Duplicates: 0  Warnings: 0
    
      mysql> SELECT * FROM t1 WHERE MATCH a AGAINST ('----' IN BOOLEAN MODE);
      +------+
      | a    |
      +------+
      | ---- |
      +------+
      1 row in set (0.00 sec)

 File: manual.info.tmp, Node: fulltext-search-ngram, Next: fulltext-search-mecab, Prev: full-text-adding-collation, Up: fulltext-search

12.9.8 ngram Full-Text Parser

The built-in MySQL full-text parser uses the white space between words as a delimiter to determine where words begin and end, which is a limitation when working with ideographic languages that do not use word delimiters. To address this limitation, MySQL provides an ngram full-text parser that supports Chinese, Japanese, and Korean (CJK). The ngram full-text parser is supported for use with note 'InnoDB': innodb-storage-engine. and note 'MyISAM': myisam-storage-engine.

Note:

MySQL also provides a MeCab full-text parser plugin for Japanese, which tokenizes documents into meaningful words. For more information, see *note fulltext-search-mecab::.

An ngram is a contiguous sequence of N characters from a given sequence of text. The ngram parser tokenizes a sequence of text into a contiguous sequence of N characters. For example, you can tokenize 'abcd' for different values of N using the ngram full-text parser.

 n=1: 'a', 'b', 'c', 'd'
 n=2: 'ab', 'bc', 'cd'
 n=3: 'abc', 'bcd'
 n=4: 'abcd'

The ngram full-text parser is a built-in server plugin. As with other built-in server plugins, it is automatically loaded when the server is started.

The full-text search syntax described in *note fulltext-search:: applies to the ngram parser plugin. Differences in parsing behavior are described in this section. Full-text-related configuration options, except for minimum and maximum word length options ('innodb_ft_min_token_size', 'innodb_ft_max_token_size', 'ft_min_word_len', 'ft_max_word_len') are also applicable.

Configuring ngram Token Size

The ngram parser has a default ngram token size of 2 (bigram). For example, with a token size of 2, the ngram parser parses the string 'abc def' into four tokens: 'ab', 'bc', 'de' and 'ef'.

ngram token size is configurable using the 'ngram_token_size' configuration option, which has a minimum value of 1 and maximum value of 10.

Typically, 'ngram_token_size' is set to the size of the largest token that you want to search for. If you only intend to search for single characters, set 'ngram_token_size' to 1. A smaller token size produces a smaller full-text search index, and faster searches. If you need to search for words comprised of more than one character, set 'ngram_token_size' accordingly. For example, 'Happy Birthday' is '生日快乐' in simplified Chinese, where '生日' is 'birthday', and '快乐' translates as 'happy'. To search on two-character words such as these, set 'ngram_token_size' to a value of 2 or higher.

As a read-only variable, 'ngram_token_size' may only be set as part of a startup string or in a configuration file:

Note:

The following minimum and maximum word length configuration options are ignored for 'FULLTEXT' indexes that use the ngram parser: 'innodb_ft_min_token_size', 'innodb_ft_max_token_size', 'ft_min_word_len', and 'ft_max_word_len'.

Creating a FULLTEXT Index that Uses the ngram Parser

To create a 'FULLTEXT' index that uses the ngram parser, specify 'WITH PARSER ngram' with note 'CREATE TABLE': create-table, note 'ALTER TABLE': alter-table, or *note 'CREATE INDEX': create-index.

The following example demonstrates creating a table with an 'ngram' 'FULLTEXT' index, inserting sample data (Simplified Chinese text), and viewing tokenized data in the Information Schema *note 'INNODB_FT_INDEX_CACHE': information-schema-innodb-ft-index-cache-table. table.

 mysql> USE test;

 mysql> CREATE TABLE articles (
       id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
       title VARCHAR(200),
       body TEXT,
       FULLTEXT (title,body) WITH PARSER ngram
     ) ENGINE=InnoDB CHARACTER SET utf8mb4;

 mysql> SET NAMES utf8mb4;

 INSERT INTO articles (title,body) VALUES
     ('数据库管理','在本教程中我将向你展示如何管理数据库'),
     ('数据库应用开发','学习开发数据库应用程序');

 mysql> SET GLOBAL innodb_ft_aux_table="test/articles";

 mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE ORDER BY doc_id, position;

To add a 'FULLTEXT' index to an existing table, you can use note 'ALTER TABLE': alter-table. or note 'CREATE INDEX': create-index. For example:

 CREATE TABLE articles (
       id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
       title VARCHAR(200),
       body TEXT
      ) ENGINE=InnoDB CHARACTER SET utf8;

 ALTER TABLE articles ADD FULLTEXT INDEX ft_index (title,body) WITH PARSER ngram;

 # Or:

 CREATE FULLTEXT INDEX ft_index ON articles (title,body) WITH PARSER ngram;

ngram Parser Space Handling

The ngram parser eliminates spaces when parsing. For example:

ngram Parser Stopword Handling

The built-in MySQL full-text parser compares words to entries in the stopword list. If a word is equal to an entry in the stopword list, the word is excluded from the index. For the ngram parser, stopword handling is performed differently. Instead of excluding tokens that are equal to entries in the stopword list, the ngram parser excludes tokens that contain stopwords. For example, assuming 'ngram_token_size=2', a document that contains 'a,b' is parsed to 'a,' and ',b'. If a comma (',') is defined as a stopword, both 'a,' and ',b' are excluded from the index because they contain a comma.

By default, the ngram parser uses the default stopword list, which contains a list of English stopwords. For a stopword list applicable to Chinese, Japanese, or Korean, you must create your own. For information about creating a stopword list, see *note fulltext-stopwords::.

Stopwords greater in length than 'ngram_token_size' are ignored.

ngram Parser Term Search

For natural language mode search, the search term is converted to a union of ngram terms. For example, the string 'abc' (assuming 'ngram_token_size=2') is converted to 'ab bc'. Given two documents, one containing 'ab' and the other containing 'abc', the search term 'ab bc' matches both documents.

For boolean mode search, the search term is converted to an ngram phrase search. For example, the string 'abc' (assuming 'ngram_token_size=2') is converted to ''ab bc''. Given two documents, one containing 'ab' and the other containing 'abc', the search phrase ''ab bc'' only matches the document containing 'abc'.

ngram Parser Wildcard Search

Because an ngram 'FULLTEXT' index contains only ngrams, and does not contain information about the beginning of terms, wildcard searches may return unexpected results. The following behaviors apply to wildcard searches using ngram 'FULLTEXT' search indexes:

ngram Parser Phrase Search

Phrase searches are converted to ngram phrase searches. For example, The search phrase 'abc' is converted to 'ab bc', which returns documents containing 'abc' and 'ab bc'.

The search phrase 'abc def' is converted to 'ab bc de ef', which returns documents containing 'abc def' and 'ab bc de ef'. A document that contains 'abcdef' is not returned.

 File: manual.info.tmp, Node: fulltext-search-mecab, Prev: fulltext-search-ngram, Up: fulltext-search

12.9.9 MeCab Full-Text Parser Plugin

The built-in MySQL full-text parser uses the white space between words as a delimiter to determine where words begin and end, which is a limitation when working with ideographic languages that do not use word delimiters. To address this limitation for Japanese, MySQL provides a MeCab full-text parser plugin. The MeCab full-text parser plugin is supported for use with note 'InnoDB': innodb-storage-engine. and note 'MyISAM': myisam-storage-engine.

Note:

MySQL also provides an ngram full-text parser plugin that supports Japanese. For more information, see *note fulltext-search-ngram::.

The MeCab full-text parser plugin is a full-text parser plugin for Japanese that tokenizes a sequence of text into meaningful words. For example, MeCab tokenizes 'データベース管理' ('Database Management') into 'データベース' ('Database') and '管理' ('Management'). By comparison, the *note ngram: fulltext-search-ngram. full-text parser tokenizes text into a contiguous sequence of N characters, where N represents a number between 1 and 10.

In addition to tokenizing text into meaningful words, MeCab indexes are typically smaller than ngram indexes, and MeCab full-text searches are generally faster. One drawback is that it may take longer for the MeCab full-text parser to tokenize documents, compared to the ngram full-text parser.

The full-text search syntax described in *note fulltext-search:: applies to the MeCab parser plugin. Differences in parsing behavior are described in this section. Full-text related configuration options are also applicable.

For additional information about the MeCab parser, refer to the MeCab: Yet Another Part-of-Speech and Morphological Analyzer (http://taku910.github.io/mecab/) project on Github.

Installing the MeCab Parser Plugin

The MeCab parser plugin requires 'mecab' and 'mecab-ipadic'.

On supported Fedora, Debian and Ubuntu platforms (except Ubuntu 12.04 where the system 'mecab' version is too old), MySQL dynamically links to the system 'mecab' installation if it is installed to the default location. On other supported Unix-like platforms, 'libmecab.so' is statically linked in 'libpluginmecab.so', which is located in the MySQL plugin directory. 'mecab-ipadic' is included in MySQL binaries and is located in 'MYSQL_HOME'.

You can install 'mecab' and 'mecab-ipadic' using a native package management utility (on Fedora, Debian, and Ubuntu), or you can build 'mecab' and 'mecab-ipadic' from source. For information about installing 'mecab' and 'mecab-ipadic' using a native package management utility, see Installing MeCab From a Binary Distribution (Optional). If you want to build 'mecab' and 'mecab-ipadic' from source, see Building MeCab From Source (Optional).

On Windows, 'libmecab.dll' is found in the MySQL 'bin' directory. 'mecab-ipadic' is located in 'MYSQL_HOME/lib/mecab'.

To install and configure the MeCab parser plugin, perform the following steps:

  1. In the MySQL configuration file, set the 'mecab_rc_file' configuration option to the location of the 'mecabrc' configuration file, which is the configuration file for MeCab. If you are using the MeCab package distributed with MySQL, the 'mecabrc' file is located in 'MYSQL_HOME/lib/mecab/etc/'.

      [mysqld]
      loose-mecab-rc-file=MYSQL_HOME/lib/mecab/etc/mecabrc

    The 'loose' prefix is an *note option modifier: option-modifiers. The 'mecab_rc_file' option is not recognized by MySQL until the MeCaB parser plugin is installed but it must be set before attempting to install the MeCaB parser plugin. The 'loose' prefix allows you restart MySQL without encountering an error due to an unrecognized variable.

    If you use your own MeCab installation, or build MeCab from source, the location of the 'mecabrc' configuration file may differ.

    For information about the MySQL configuration file and its location, see *note option-files::.

  2. Also in the MySQL configuration file, set the minimum token size to 1 or 2, which are the values recommended for use with the MeCab parser. For 'InnoDB' tables, minimum token size is defined by the 'innodb_ft_min_token_size' configuration option, which has a default value of 3. For 'MyISAM' tables, minimum token size is defined by 'ft_min_word_len', which has a default value of 4.

      [mysqld]
      innodb_ft_min_token_size=1
  3. Modify the 'mecabrc' configuration file to specify the dictionary you want to use. The 'mecab-ipadic' package distributed with MySQL binaries includes three dictionaries ('ipadic_euc-jp', 'ipadic_sjis', and 'ipadic_utf-8'). The 'mecabrc' configuration file packaged with MySQL contains and entry similar to the following:

      dicdir =  /path/to/mysql/lib/mecab/lib/mecab/dic/ipadic_euc-jp

    To use the 'ipadic_utf-8' dictionary, for example, modify the entry as follows:

      dicdir=MYSQL_HOME/lib/mecab/dic/ipadic_utf-8

    If you are using your own MeCab installation or have built MeCab from source, the default 'dicdir' entry in the 'mecabrc' file differs, as do the dictionaries and their location.

    Note:

    After the MeCab parser plugin is installed, you can use the 'mecab_charset' status variable to view the character set used with MeCab. The three MeCab dictionaries provided with the MySQL binary support the following character sets.

    * The 'ipadic_euc-jp' dictionary supports the 'ujis' and
      'eucjpms' character sets.
    
    * The 'ipadic_sjis' dictionary supports the 'sjis' and 'cp932'
      character sets.
    
    * The 'ipadic_utf-8' dictionary supports the 'utf8' and
      'utf8mb4' character sets.

    'mecab_charset' only reports the first supported character set. For example, the 'ipadic_utf-8' dictionary supports both 'utf8' and 'utf8mb4'. 'mecab_charset' always reports 'utf8' when this dictionary is in use.

  4. Restart MySQL.

  5. Install the MeCab parser plugin:

    The MeCab parser plugin is installed using note 'INSTALL PLUGIN': install-plugin. The plugin name is 'mecab', and the shared library name is 'libpluginmecab.so'. For additional information about installing plugins, see note plugin-loading::.

      INSTALL PLUGIN mecab SONAME 'libpluginmecab.so';

    Once installed, the MeCab parser plugin loads at every normal MySQL restart.

  6. Verify that the MeCab parser plugin is loaded using the *note 'SHOW PLUGINS': show-plugins. statement.

      mysql> SHOW PLUGINS;

    A 'mecab' plugin should appear in the list of plugins.

Creating a FULLTEXT Index that uses the MeCab Parser

To create a 'FULLTEXT' index that uses the mecab parser, specify 'WITH PARSER ngram' with note 'CREATE TABLE': create-table, note 'ALTER TABLE': alter-table, or *note 'CREATE INDEX': create-index.

This example demonstrates creating a table with a 'mecab' 'FULLTEXT' index, inserting sample data, and viewing tokenized data in the Information Schema *note 'INNODB_FT_INDEX_CACHE': information-schema-innodb-ft-index-cache-table. table:

 mysql> USE test;

 mysql> CREATE TABLE articles (
       id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
       title VARCHAR(200),
       body TEXT,
       FULLTEXT (title,body) WITH PARSER mecab
     ) ENGINE=InnoDB CHARACTER SET utf8;

 mysql> SET NAMES utf8;

 mysql> INSERT INTO articles (title,body) VALUES
     ('データベース管理','このチュートリアルでは、私はどのようにデータベースを管理する方法を紹介します'),
     ('データベースアプリケーション開発','データベースアプリケーションを開発することを学ぶ');

 mysql> SET GLOBAL innodb_ft_aux_table="test/articles";

 mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE ORDER BY doc_id, position;

To add a 'FULLTEXT' index to an existing table, you can use note 'ALTER TABLE': alter-table. or note 'CREATE INDEX': create-index. For example:

 CREATE TABLE articles (
       id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
       title VARCHAR(200),
       body TEXT
      ) ENGINE=InnoDB CHARACTER SET utf8;

 ALTER TABLE articles ADD FULLTEXT INDEX ft_index (title,body) WITH PARSER mecab;

 # Or:

 CREATE FULLTEXT INDEX ft_index ON articles (title,body) WITH PARSER mecab;

MeCab Parser Space Handling

The MeCab parser uses spaces as separators in query strings. For example, the MeCab parser tokenizes データベース管理 as データベース and 管理.

MeCab Parser Stopword Handling

By default, the MeCab parser uses the default stopword list, which contains a short list of English stopwords. For a stopword list applicable to Japanese, you must create your own. For information about creating stopword lists, see *note fulltext-stopwords::.

MeCab Parser Term Search

For natural language mode search, the search term is converted to a union of tokens. For example, データベース管理 is converted to データベース 管理.

 SELECT COUNT(*) FROM articles WHERE MATCH(title,body) AGAINST('データベース管理' IN NATURAL LANGUAGE MODE);

For boolean mode search, the search term is converted to a search phrase. For example, データベース管理 is converted to データベース 管理.

 SELECT COUNT(*) FROM articles WHERE MATCH(title,body) AGAINST('データベース管理' IN BOOLEAN MODE);

MeCab Parser Wildcard Search

Wildcard search terms are not tokenized. A search on データベース管理* is performed on the prefix, データベース管理.

 SELECT COUNT(*) FROM articles WHERE MATCH(title,body) AGAINST('データベース*' IN BOOLEAN MODE);

MeCab Parser Phrase Search

Phrases are tokenized. For example, データベース管理 is tokenized as データベース 管理.

 SELECT COUNT(*) FROM articles WHERE MATCH(title,body) AGAINST('"データベース管理"' IN BOOLEAN MODE);

Installing MeCab From a Binary Distribution (Optional)

This section describes how to install 'mecab' and 'mecab-ipadic' from a binary distribution using a native package management utility. For example, on Fedora, you can use Yum to perform the installation:

 yum mecab-devel

On Debian or Ubuntu, you can perform an APT installation:

 apt-get install mecab
 apt-get install mecab-ipadic

Installing MeCab From Source (Optional)

If you want to build 'mecab' and 'mecab-ipadic' from source, basic installation steps are provided below. For additional information, refer to the MeCab documentation.

  1. Download the tar.gz packages for 'mecab' and 'mecab-ipadic' from http://taku910.github.io/mecab/#download. As of February, 2016, the latest available packages are 'mecab-0.996.tar.gz' and 'mecab-ipadic-2.7.0-20070801.tar.gz'.

  2. Install 'mecab':

      tar zxfv mecab-0.996.tar
      cd mecab-0.996
      ./configure
      make
      make check
      su
      make install
  3. Install 'mecab-ipadic':

      tar zxfv mecab-ipadic-2.7.0-20070801.tar
      cd mecab-ipadic-2.7.0-20070801
      ./configure
      make
      su
      make install
  4. Compile MySQL using the 'WITH_MECAB' CMake option. Set the 'WITH_MECAB' option to 'system' if you have installed 'mecab' and 'mecab-ipadic' to the default location.

      -DWITH_MECAB=system

    If you defined a custom installation directory, set 'WITH_MECAB' to the custom directory. For example:

      -DWITH_MECAB=/path/to/mecab

 File: manual.info.tmp, Node: cast-functions, Next: xml-functions, Prev: fulltext-search, Up: functions