MySQL ngram Full-Text Parser

MySQL ngram Full-Text Parser

 

MySQL ngram Full-Text Parser



Summary: in this tutorial, you will learn how to use MySQL ngram full-text parser to support full-text searches for ideographic languages such as Chinese, Japanese, and Korean.

Introduction to MySQL ngram full-text parser

The built-in MySQL full-text parser determines the beginning and end of words using white space. When it comes to ideographic languages such as Chinese, Japanese, and Korean, the full-text parser has a limitation that these ideographic languages do not use word delimiters.

To address this issue, MySQL provided the ngram full-text parser. Since version 5.7.6, MySQL included ngram full-text parser as a built-in server plugin, meaning that MySQL loads this plugin automatically when the MySQL database server starts. MySQL supports ngram full-text parser for both InnoDB and MyISAM storage engines.

By definition, an ngram is a contiguous sequence of a number of characters from a sequence of text. The main function of ngram full-text parser is tokenizing a sequence of text into a contiguous sequence of n characters.

The following illustrates how the ngram full-text parser tokenizes a sequence of text for different value of n:

n = 1: 'm','y','s','q','l' n = 2: 'my', 'ys', 'sq','ql' n = 3: 'mys', 'ysq', 'sql' n = 4: 'mysq', 'ysql' n = 5: 'mysql'

Creating FULLTEXT indexes with ngram parser

To create a FULLTEXT an index that uses ngram full-text parser, you add the WITH PARSER ngram in the CREATE TABLEALTER TABLE, or CREATE INDEX statement.

Consider the following example.

First, create new posts table and adds the title and body columns to the FULLTEXT index that uses ngram full-text parser.

DROP TABLE IF EXISTS posts; CREATE TABLE posts ( id INT PRIMARY KEY AUTO_INCREMENT, title VARCHAR(255), body TEXT, FULLTEXT ( title , body ) WITH PARSER NGRAM ) ENGINE=INNODB CHARACTER SET UTF8MB4;

Second, use the SET NAMES the statement sets the character set to utf8mb4.

SET NAMES utf8mb4;

Third, insert a new row into the posts table:

INSERT INTO posts(title,body) VALUES('MySQL全文搜索','MySQL提供了具有许多好的功能的内置全文搜索'), ('MySQL教程','学习MySQL快速,简单和有趣');

Fourth, to see how the ngram tokenizes the text, you use the following statement:

SET GLOBAL innodb_ft_aux_table="test/posts"; SELECT * FROM information_schema.innodb_ft_index_cache ORDER BY doc_id , position;

This query is useful for troubleshooting purposes. For example, if a word does not include in the search results, then the word may be not indexed because it is a stopword or it could be another reason.

Setting ngram token size

As you can see in the previous example, the token size (n) in the ngram by default is 2. To change the token size, you use the ngram_token_size configuration option, which has a value between 1 and 10.

Note that a smaller token size makes a smaller full-text search index and allows you to search faster.

Because ngram_token_size is a read-only variable, therefore you only can set its value using two options:

First, in the start-up string:

mysqld --ngram_token_size=1

Second, in the configuration file:

[mysqld] ngram_token_size=1

ngram parser phrase search

MySQL converts a phrase search into ngram phrase searches. For example, "abc" is converted into "ab bc", which returns documents that contain "ab bc" and "abc".

The following example shows you to search for the phrase 搜索 in the posts table:

SELECT id, title, body FROM posts WHERE MATCH (title , body) AGAINST ('搜索' );

Processing search result with ngram

Natural language mode

In NATURAL LANGUAGE MODE searches, the search term is converted to a union of ngram values. Suppose the token size is 2 or bigram, the search term mysql is converted to my ys sq and ql.

SELECT * FROM posts WHERE MATCH (title , body) AGAINST ('简单和有趣' IN natural language MODE);

Boolean mode

In BOOLEAN MODE searches, the search term is converted to an ngram phrase search. For example:

SELECT * FROM posts WHERE MATCH (title , body) AGAINST ('简单和有趣' IN BOOLEAN MODE);


ngram wildcard search

The ngram FULLTEXT the index contains only ngrams, therefore it does not know the beginning of terms. When you perform wildcard searches, it may return an unexpected result.

The following rules are applied to wildcard search using ngram FULLTEXT search indexes:

If the prefix term in the wildcard is shorter than ngram token size, the query returns all documents that contain ngram tokens starting with the prefix term. For example:

SELECT id, title, body FROM posts WHERE MATCH (title , body) AGAINST ('my*' );

In case the prefix term in the wildcard is longer than ngram token size, MySQL will convert the prefix term into ngram phrases and ignore the wildcard operator. See the following example:

SELECT id, title, body FROM posts WHERE MATCH (title , body) AGAINST ('mysqld*' );

In this example, the term “mysqld" is converted into ngram phrases: "my" "ys" "sq" "ql" "ld". Therefore all documents that contain one of these phrases are returned.

Handling stopwords

The ngram parser excludes tokens that contain the stopword in the stopword list. For example, suppose the ngram_token_size is 2 and the document contains "abc". The ngram parser will tokenize the document to "ab" and "bc".  If "b" is a stopword, ngram will exclude both "ab" , and "bc" because they contain "b".

Note that you must define your own stopword list if the language is other than English. In addition, the stopwords with lengths that are greater than ngram_token_size are ignored.

In this tutorial, you have learned how to use MySQL ngram full-text parser to handle full-text searches for ideographic languages.

Reactions

Post a Comment

0 Comments

close