Elasticsearch下安装IK中文分词器
- 2019 年 10 月 6 日
- 笔记
环境:elasticsearch版本是5.5.2,其所在目录为/usr/local/elasticsearch-5.5.2
- 下载
解压到 /usr/local/elasticsearch-5.5.2/plugins/ , 目录结构如下
├── plugins │ └── elasticsearch-analysis-ik │ ├── commons-codec-1.9.jar │ ├── commons-logging-1.2.jar │ ├── config │ │ ├── extra_main.dic │ │ ├── extra_single_word.dic │ │ ├── extra_single_word_full.dic │ │ ├── extra_single_word_low_freq.dic │ │ ├── extra_stopword.dic │ │ ├── IKAnalyzer.cfg.xml │ │ ├── main.dic │ │ ├── preposition.dic │ │ ├── quantifier.dic │ │ ├── stopword.dic │ │ ├── suffix.dic │ │ └── surname.dic │ ├── elasticsearch-analysis-ik-5.5.2.jar │ ├── httpclient-4.5.2.jar │ ├── httpcore-4.4.4.jar │ └── plugin-descriptor.properties
- 重启 elasticsearch
- 测试
分别用下面两种方式检查一下分词效果
ik_max_word分词法
GET _analyze { "analyzer":"ik_max_word", "text":"中华人民共和国国歌" }
结果
{ "tokens": [ { "token": "中华人民共和国", "start_offset": 0, "end_offset": 7, "type": "CN_WORD", "position": 0 }, { "token": "中华人民", "start_offset": 0, "end_offset": 4, "type": "CN_WORD", "position": 1 }, { "token": "中华", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 2 }, { "token": "华人", "start_offset": 1, "end_offset": 3, "type": "CN_WORD", "position": 3 }, { "token": "人民共和国", "start_offset": 2, "end_offset": 7, "type": "CN_WORD", "position": 4 }, { "token": "人民", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 5 }, { "token": "共和国", "start_offset": 4, "end_offset": 7, "type": "CN_WORD", "position": 6 }, { "token": "共和", "start_offset": 4, "end_offset": 6, "type": "CN_WORD", "position": 7 }, { "token": "国", "start_offset": 6, "end_offset": 7, "type": "CN_CHAR", "position": 8 }, { "token": "国歌", "start_offset": 7, "end_offset": 9, "type": "CN_WORD", "position": 9 } ] }
智能分词法
GET _analyze { "analyzer":"ik_smart", "text":"中华人民共和国国歌" }
结果
{ "tokens": [ { "token": "中华人民共和国", "start_offset": 0, "end_offset": 7, "type": "CN_WORD", "position": 0 }, { "token": "国歌", "start_offset": 7, "end_offset": 9, "type": "CN_WORD", "position": 1 } ] }
修改 Mapping中text类型的字段定义
... "title": { "type": "text", "analyzer": "ik_max_word", "search_analyzer": "ik_max_word", "include_in_all": "true" }, ...
已有大数据需要重建索引
参考 https://github.com/medcl/elasticsearch-analysis-ik