elasticsearch之集成中文分詞器
- 2022 年 1 月 11 日
- 筆記
- Elastic search
IK是基於字典的一款輕量級的中文分詞工具包,可以通過elasticsearch的插件機制集成;
一、集成步驟
1.在elasticsearch的安裝目錄下的plugin下新建ik目錄;
2.在github下載對應版本的ik插件;
//github.com/medcl/elasticsearch-analysis-ik/releases/tag/v6.8.12
3.解壓插件文件,並重啟elasticsearch,可以看到如下已經載入了ik插件;
[2022-01-11T15:22:54,341][INFO ][o.e.p.PluginsService ] [4EvvJl1] loaded plugin [analysis-ik]
二、體驗IK的分析器
IK提供了ik_smart和ik_max_word兩個分析器;
ik_max_word分析器會最大程度的對文本進行分詞,分詞的粒度還是比較細緻的;
POST _analyze
{
"analyzer": "ik_max_word",
"text":"這次出差我們住的是閆團如家快捷酒店"
}
{
"tokens" : [
{
"token" : "這次",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "出差",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "我們",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "住",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_CHAR",
"position" : 3
},
{
"token" : "的",
"start_offset" : 7,
"end_offset" : 8,
"type" : "CN_CHAR",
"position" : 4
},
{
"token" : "是",
"start_offset" : 8,
"end_offset" : 9,
"type" : "CN_CHAR",
"position" : 5
},
{
"token" : "閆",
"start_offset" : 9,
"end_offset" : 10,
"type" : "CN_CHAR",
"position" : 6
},
{
"token" : "團",
"start_offset" : 10,
"end_offset" : 11,
"type" : "CN_CHAR",
"position" : 7
},
{
"token" : "如家",
"start_offset" : 11,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 8
},
{
"token" : "快捷酒店",
"start_offset" : 13,
"end_offset" : 17,
"type" : "CN_WORD",
"position" : 9
}
]
}
ik_smart相對來說粒度會比較粗;
POST _analyze
{
"analyzer": "ik_smart",
"text":"這次出差我們住的是閆團如家快捷酒店"
}
{
"tokens" : [
{
"token" : "這次",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "出差",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "我們",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "住",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_CHAR",
"position" : 3
},
{
"token" : "的",
"start_offset" : 7,
"end_offset" : 8,
"type" : "CN_CHAR",
"position" : 4
},
{
"token" : "是",
"start_offset" : 8,
"end_offset" : 9,
"type" : "CN_CHAR",
"position" : 5
},
{
"token" : "閆",
"start_offset" : 9,
"end_offset" : 10,
"type" : "CN_CHAR",
"position" : 6
},
{
"token" : "團",
"start_offset" : 10,
"end_offset" : 11,
"type" : "CN_CHAR",
"position" : 7
},
{
"token" : "如家",
"start_offset" : 11,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 8
},
{
"token" : "快捷酒店",
"start_offset" : 13,
"end_offset" : 17,
"type" : "CN_WORD",
"position" : 9
}
]
}
三、擴展ik字典
由於 閆團 是一個比較小的地方,ik的字典中並不包含導致分成兩個單個的字元;我們可以將它添加到ik的字典中;
在ik的安裝目錄下config中新增my.dic文件,並將 閆團 放到文件中;完成之後修改IKAnalyzer.cfg.xml文件,添加新增的字典文件;
<properties>
<comment>IK Analyzer 擴展配置</comment>
<!--用戶可以在這裡配置自己的擴展字典 -->
<entry key="ext_dict">my.dic</entry>
<!--用戶可以在這裡配置自己的擴展停止詞字典-->
<entry key="ext_stopwords"></entry>
<!--用戶可以在這裡配置遠程擴展字典 -->
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!--用戶可以在這裡配置遠程擴展停止詞字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
重啟elasticsearch並重新執行查看已經將地名作為一個分詞了;
POST _analyze
{
"analyzer": "ik_smart",
"text":"這次出差我們住的是閆團如家快捷酒店"
}
{
"tokens" : [
{
"token" : "這次",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "出差",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "我們",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "住",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_CHAR",
"position" : 3
},
{
"token" : "的",
"start_offset" : 7,
"end_offset" : 8,
"type" : "CN_CHAR",
"position" : 4
},
{
"token" : "是",
"start_offset" : 8,
"end_offset" : 9,
"type" : "CN_CHAR",
"position" : 5
},
{
"token" : "閆團",
"start_offset" : 9,
"end_offset" : 11,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "如家",
"start_offset" : 11,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "快捷酒店",
"start_offset" : 13,
"end_offset" : 17,
"type" : "CN_WORD",
"position" : 8
}
]
}
四、體驗HanLP分析器及自定義字典
HanLP是由一系列模型與演算法組成的Java工具包,它從中文分詞開始,覆蓋詞性標註、命名實體識別、句法分析、文本分類等常用的NLP任務,提供了豐富的API,被廣泛用於Lucene、Solr和ES等搜索平台。就分詞演算法來說,它支援最短路分詞、N-最短路分詞和CRF分詞等分詞演算法。
從以下地址下載hanLP插件包
//github.com/KennFalcon/elasticsearch-analysis-hanlp/releases/download/v7.9.2/elasticsearch-analysis-hanlp-7.9.2.zip
安裝hanLP插件包
bin\elasticsearch-plugin install file:///c:/elasticsearch-analysis-hanlp-7.9.2.zip
-> Installing file:///c:/elasticsearch-analysis-hanlp-7.9.2.zip
-> Downloading file:///c:/elasticsearch-analysis-hanlp-7.9.2.zip
[=================================================] 100%??
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: plugin requires additional permissions @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
* java.io.FilePermission plugins/analysis-hanlp/data/-#plus read,write,delete
* java.io.FilePermission plugins/analysis-hanlp/hanlp.cache#plus read,write,delete
* java.lang.RuntimePermission getClassLoader
* java.lang.RuntimePermission setContextClassLoader
* java.net.SocketPermission * connect,resolve
* java.util.PropertyPermission * read,write
See //docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html
for descriptions of what these permissions allow and the associated risks.
Continue with installation? [y/N]y
-> Installed analysis-hanlp
使用hanlp_standard分析器對文本進行分析
POST _analyze
{
"analyzer": "hanlp_standard",
"text":"這次出差我們住的是閆團如家快捷酒店"
}
{
"tokens" : [
{
"token" : "這次",
"start_offset" : 0,
"end_offset" : 2,
"type" : "r",
"position" : 0
},
{
"token" : "出差",
"start_offset" : 2,
"end_offset" : 4,
"type" : "vi",
"position" : 1
},
{
"token" : "我們",
"start_offset" : 4,
"end_offset" : 6,
"type" : "rr",
"position" : 2
},
{
"token" : "住",
"start_offset" : 6,
"end_offset" : 7,
"type" : "vi",
"position" : 3
},
{
"token" : "的",
"start_offset" : 7,
"end_offset" : 8,
"type" : "ude1",
"position" : 4
},
{
"token" : "是",
"start_offset" : 8,
"end_offset" : 9,
"type" : "vshi",
"position" : 5
},
{
"token" : "閆團",
"start_offset" : 9,
"end_offset" : 11,
"type" : "nr",
"position" : 6
},
{
"token" : "如家",
"start_offset" : 11,
"end_offset" : 13,
"type" : "r",
"position" : 7
},
{
"token" : "快捷酒店",
"start_offset" : 13,
"end_offset" : 17,
"type" : "ntch",
"position" : 8
}
]
}
我們可以看到hanLP自動將 閆團 分成一個詞了;
執行如下測試,可以看到hanLP沒有將 小地方作為一個分詞;
POST _analyze
{
"analyzer": "hanlp_standard",
"text":"閆團是一個小地方"
}
{
"tokens" : [
{
"token" : "閆團",
"start_offset" : 0,
"end_offset" : 2,
"type" : "nr",
"position" : 0
},
{
"token" : "是",
"start_offset" : 2,
"end_offset" : 3,
"type" : "vshi",
"position" : 1
},
{
"token" : "一個",
"start_offset" : 3,
"end_offset" : 5,
"type" : "mq",
"position" : 2
},
{
"token" : "小",
"start_offset" : 5,
"end_offset" : 6,
"type" : "a",
"position" : 3
},
{
"token" : "地方",
"start_offset" : 6,
"end_offset" : 8,
"type" : "n",
"position" : 4
}
]
}
為了自定義分詞,我們在${ES_HOME}/plugins/analysis-hanlp/data/dictionary/custom下新建my.dic,並添加 小地方;
然後從插件安裝包拷貝hanlp.properties文件放到如下位置${ES_HOME}/config/analysis-hanlp/hanlp.properties,並修改CustomDictionaryPath;
CustomDictionaryPath=data/dictionary/custom/CustomDictionary.txt; ModernChineseSupplementaryWord.txt; ChinesePlaceName.txt ns; PersonalName.txt; OrganizationName.txt; ShanghaiPlaceName.txt ns;data/dictionary/person/nrf.txt nrf;data/dictionary/custom/my.dic;
從起elasticsearch並執行測試
POST _analyze
{
"analyzer": "hanlp",
"text":"閆團是一個小地方"
}
{
"tokens" : [
{
"token" : "閆團",
"start_offset" : 0,
"end_offset" : 2,
"type" : "nr",
"position" : 0
},
{
"token" : "是",
"start_offset" : 2,
"end_offset" : 3,
"type" : "vshi",
"position" : 1
},
{
"token" : "一個",
"start_offset" : 3,
"end_offset" : 5,
"type" : "mq",
"position" : 2
},
{
"token" : "小地方",
"start_offset" : 5,
"end_offset" : 8,
"type" : "n",
"position" : 3
}
]
}