Pinyin Analysis for Elasticsearch
This Pinyin Analysis plugin is used to do conversion between Chinese characters and Pinyin, integrates NLP tools (https://github.com/NLPchina/nlp-lang).
--------------------------------------------------
| Pinyin Analysis Plugin | Elasticsearch |
--------------------------------------------------
| master | 7.x -> master |
--------------------------------------------------
| 6.x | 6.x |
--------------------------------------------------
| 5.x | 5.x |
--------------------------------------------------
| 1.8.1 | 2.4.1 |
--------------------------------------------------
| 1.7.5 | 2.3.5 |
--------------------------------------------------
| 1.6.1 | 2.2.1 |
--------------------------------------------------
| 1.5.0 | 2.1.0 |
--------------------------------------------------
| 1.4.0 | 2.0.x |
--------------------------------------------------
| 1.3.0 | 1.6.x |
--------------------------------------------------
| 1.2.2 | 1.0.x |
--------------------------------------------------
The plugin includes analyzer: pinyin , tokenizer: pinyin and token-filter: pinyin.
** Optional Parameters **
keep_first_letterwhen this option enabled, eg:刘德华>ldh, default: truekeep_separate_first_letterwhen this option enabled, will keep first letters separately, eg:刘德华>l,d,h, default: false, NOTE: query result maybe too fuzziness due to term too frequencylimit_first_letter_lengthset max length of the first_letter result, default: 16keep_full_pinyinwhen this option enabled, eg:刘德华> [liu,de,hua], default: truekeep_joined_full_pinyinwhen this option enabled, eg:刘德华> [liudehua], default: falsekeep_none_chinesekeep non chinese letter or number in result, default: truekeep_none_chinese_togetherkeep non chinese letter together, default: true, eg:DJ音乐家->DJ,yin,yue,jia, when set tofalse, eg:DJ音乐家->D,J,yin,yue,jia, NOTE:keep_none_chineseshould be enabled firstkeep_none_chinese_in_first_letterkeep non Chinese letters in first letter, eg:刘德华AT2016->ldhat2016, default: truekeep_none_chinese_in_joined_full_pinyinkeep non Chinese letters in joined full pinyin, eg:刘德华2016->liudehua2016, default: falsenone_chinese_pinyin_tokenizebreak non chinese letters into separate pinyin term if they are pinyin, default: true, eg:liudehuaalibaba13zhuanghan->liu,de,hua,a,li,ba,ba,13,zhuang,han, NOTE:keep_none_chineseandkeep_none_chinese_togethershould be enabled firstkeep_originalwhen this option enabled, will keep original input as well, default: falselowercaselowercase non Chinese letters, default: truetrim_whitespacedefault: trueremove_duplicated_termwhen this option enabled, duplicated term will be removed to save index, eg:de的>de, default: false, NOTE: position related query maybe influencedignore_pinyin_offsetafter 6.0, offset is strictly constrained, overlapped tokens are not allowed, with this parameter, overlapped token will allowed by ignore offset, please note, all position related query or highlight will become incorrect, you should use multi fields and specify different settings for different query purpose. if you need offset, please set it to false. default: true.
1.Create a index with custom pinyin analyzer
PUT /medcl/
{
"settings" : {
"analysis" : {
"analyzer" : {
"pinyin_analyzer" : {
"tokenizer" : "my_pinyin"
}
},
"tokenizer" : {
"my_pinyin" : {
"type" : "pinyin",
"keep_separate_first_letter" : false,
"keep_full_pinyin" : true,
"keep_original" : true,
"limit_first_letter_length" : 16,
"lowercase" : true,
"remove_duplicated_term" : true
}
}
}
}
}
2.Test Analyzer, analyzing a chinese name, such as 刘德华
GET /medcl/_analyze
{
"text": ["刘德华"],
"analyzer": "pinyin_analyzer"
}
{
"tokens" : [
{
"token" : "liu",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "de",
"start_offset" : 1,
"end_offset" : 2,
"type" : "word",
"position" : 1
},
{
"token" : "hua",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 2
},
{
"token" : "刘德华",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 3
},
{
"token" : "ldh",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 4
}
]
}
3.Create mapping
POST /medcl/_mapping
{
"properties": {
"name": {
"type": "keyword",
"fields": {
"pinyin": {
"type": "text",
"store": false,
"term_vector": "with_offsets",
"analyzer": "pinyin_analyzer",
"boost": 10
}
}
}
}
}
4.Indexing
POST /medcl/_create/andy
{"name":"刘德华"}
5.Let's search
curl http://localhost:9200/medcl/_search?q=name:%E5%88%98%E5%BE%B7%E5%8D%8E curl http://localhost:9200/medcl/_search?q=name.pinyin:%e5%88%98%e5%be%b7 curl http://localhost:9200/medcl/_search?q=name.pinyin:liu curl http://localhost:9200/medcl/_search?q=name.pinyin:ldh curl http://localhost:9200/medcl/_search?q=name.pinyin:de+hua
6.Using Pinyin-TokenFilter
PUT /medcl1/
{
"settings" : {
"analysis" : {
"analyzer" : {
"user_name_analyzer" : {
"tokenizer" : "whitespace",
"filter" : "pinyin_first_letter_and_full_pinyin_filter"
}
},
"filter" : {
"pinyin_first_letter_and_full_pinyin_filter" : {
"type" : "pinyin",
"keep_first_letter" : true,
"keep_full_pinyin" : false,
"keep_none_chinese" : true,
"keep_original" : false,
"limit_first_letter_length" : 16,
"lowercase" : true,
"trim_whitespace" : true,
"keep_none_chinese_in_first_letter" : true
}
}
}
}
}
Token Test:刘德华 张学友 郭富城 黎明 四大天王
GET /medcl1/_analyze
{
"text": ["刘德华 张学友 郭富城 黎明 四大天王"],
"analyzer": "user_name_analyzer"
}
{
"tokens" : [
{
"token" : "ldh",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "zxy",
"start_offset" : 4,
"end_offset" : 7,
"type" : "word",
"position" : 1
},
{
"token" : "gfc",
"start_offset" : 8,
"end_offset" : 11,
"type" : "word",
"position" : 2
},
{
"token" : "lm",
"start_offset" : 12,
"end_offset" : 14,
"type" : "word",
"position" : 3
},
{
"token" : "sdtw",
"start_offset" : 15,
"end_offset" : 19,
"type" : "word",
"position" : 4
}
]
}
7.Used in phrase query
- option 1
PUT /medcl2/
{
"settings" : {
"analysis" : {
"analyzer" : {
"pinyin_analyzer" : {
"tokenizer" : "my_pinyin"
}
},
"tokenizer" : {
"my_pinyin" : {
"type" : "pinyin",
"keep_first_letter":false,
"keep_separate_first_letter" : false,
"keep_full_pinyin" : true,
"keep_original" : false,
"limit_first_letter_length" : 16,
"lowercase" : true
}
}
}
}
}
GET /medcl2/_search
{
"query": {"match_phrase": {
"name.pinyin": "刘德华"
}}
}
- option 2
PUT /medcl3/
{
"settings" : {
"analysis" : {
"analyzer" : {
"pinyin_analyzer" : {
"tokenizer" : "my_pinyin"
}
},
"tokenizer" : {
"my_pinyin" : {
"type" : "pinyin",
"keep_first_letter":true,
"keep_separate_first_letter" : true,
"keep_full_pinyin" : true,
"keep_original" : false,
"limit_first_letter_length" : 16,
"lowercase" : true
}
}
}
}
}
POST /medcl3/_mapping
{
"properties": {
"name": {
"type": "keyword",
"fields": {
"pinyin": {
"type": "text",
"store": false,
"term_vector": "with_offsets",
"analyzer": "pinyin_analyzer",
"boost": 10
}
}
}
}
}
GET /medcl3/_analyze
{
"text": ["刘德华"],
"analyzer": "pinyin_analyzer"
}
POST /medcl3/_create/andy
{"name":"刘德华"}
GET /medcl3/_search
{
"query": {"match_phrase": {
"name.pinyin": "刘德h"
}}
}
GET /medcl3/_search
{
"query": {"match_phrase": {
"name.pinyin": "刘dh"
}}
}
GET /medcl3/_search
{
"query": {"match_phrase": {
"name.pinyin": "liudh"
}}
}
GET /medcl3/_search
{
"query": {"match_phrase": {
"name.pinyin": "liudeh"
}}
}
GET /medcl3/_search
{
"query": {"match_phrase": {
"name.pinyin": "liude华"
}}
}
8.That's all, have fun.