elasticsearch之集成中文分詞器

IK是基於字典的一款輕量級的中文分詞工具包,可以通過elasticsearch的插件機制集成;
一、集成步驟

1.在elasticsearch的安裝目錄下的plugin下新建ik目錄;

2.在github下載對應版本的ik插件;

//github.com/medcl/elasticsearch-analysis-ik/releases/tag/v6.8.12

3.解壓插件文件,並重啟elasticsearch,可以看到如下已經載入了ik插件;

[2022-01-11T15:22:54,341][INFO ][o.e.p.PluginsService     ] [4EvvJl1] loaded plugin [analysis-ik]

二、體驗IK的分析器

IK提供了ik_smart和ik_max_word兩個分析器;

ik_max_word分析器會最大程度的對文本進行分詞,分詞的粒度還是比較細緻的;

POST _analyze
{
  "analyzer": "ik_max_word",
  "text":"這次出差我們住的是閆團如家快捷酒店"
}


{
  "tokens" : [
    {
      "token" : "這次",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "出差",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "我們",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "住",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "的",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "是",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_CHAR",
      "position" : 5
    },
    {
      "token" : "閆",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "CN_CHAR",
      "position" : 6
    },
    {
      "token" : "團",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "CN_CHAR",
      "position" : 7
    },
    {
      "token" : "如家",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "快捷酒店",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 9
    }
  ]
}



ik_smart相對來說粒度會比較粗;

POST _analyze
{
  "analyzer": "ik_smart",
  "text":"這次出差我們住的是閆團如家快捷酒店"
}

{
  "tokens" : [
    {
      "token" : "這次",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "出差",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "我們",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "住",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "的",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "是",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_CHAR",
      "position" : 5
    },
    {
      "token" : "閆",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "CN_CHAR",
      "position" : 6
    },
    {
      "token" : "團",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "CN_CHAR",
      "position" : 7
    },
    {
      "token" : "如家",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "快捷酒店",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 9
    }
  ]
}

三、擴展ik字典

由於 閆團 是一個比較小的地方,ik的字典中並不包含導致分成兩個單個的字元;我們可以將它添加到ik的字典中;

在ik的安裝目錄下config中新增my.dic文件,並將 閆團 放到文件中;完成之後修改IKAnalyzer.cfg.xml文件,添加新增的字典文件;

<properties>
	<comment>IK Analyzer 擴展配置</comment>
	<!--用戶可以在這裡配置自己的擴展字典 -->
	<entry key="ext_dict">my.dic</entry>
	 <!--用戶可以在這裡配置自己的擴展停止詞字典-->
	<entry key="ext_stopwords"></entry>
	<!--用戶可以在這裡配置遠程擴展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用戶可以在這裡配置遠程擴展停止詞字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

重啟elasticsearch並重新執行查看已經將地名作為一個分詞了;

POST _analyze
{
  "analyzer": "ik_smart",
  "text":"這次出差我們住的是閆團如家快捷酒店"
}

{
  "tokens" : [
    {
      "token" : "這次",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "出差",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "我們",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "住",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "的",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "是",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_CHAR",
      "position" : 5
    },
    {
      "token" : "閆團",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "如家",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "快捷酒店",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 8
    }
  ]
}

四、體驗HanLP分析器及自定義字典

HanLP是由一系列模型與演算法組成的Java工具包,它從中文分詞開始,覆蓋詞性標註、命名實體識別、句法分析、文本分類等常用的NLP任務,提供了豐富的API,被廣泛用於Lucene、Solr和ES等搜索平台。就分詞演算法來說,它支援最短路分詞、N-最短路分詞和CRF分詞等分詞演算法。

從以下地址下載hanLP插件包

//github.com/KennFalcon/elasticsearch-analysis-hanlp/releases/download/v7.9.2/elasticsearch-analysis-hanlp-7.9.2.zip

安裝hanLP插件包

bin\elasticsearch-plugin install file:///c:/elasticsearch-analysis-hanlp-7.9.2.zip
-> Installing file:///c:/elasticsearch-analysis-hanlp-7.9.2.zip
-> Downloading file:///c:/elasticsearch-analysis-hanlp-7.9.2.zip
[=================================================] 100%??
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@     WARNING: plugin requires additional permissions     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
* java.io.FilePermission plugins/analysis-hanlp/data/-#plus read,write,delete
* java.io.FilePermission plugins/analysis-hanlp/hanlp.cache#plus read,write,delete
* java.lang.RuntimePermission getClassLoader
* java.lang.RuntimePermission setContextClassLoader
* java.net.SocketPermission * connect,resolve
* java.util.PropertyPermission * read,write
See //docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html
for descriptions of what these permissions allow and the associated risks.

Continue with installation? [y/N]y
-> Installed analysis-hanlp

使用hanlp_standard分析器對文本進行分析

POST _analyze
{
  "analyzer": "hanlp_standard",
  "text":"這次出差我們住的是閆團如家快捷酒店"
}

{
  "tokens" : [
    {
      "token" : "這次",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "r",
      "position" : 0
    },
    {
      "token" : "出差",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "vi",
      "position" : 1
    },
    {
      "token" : "我們",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "rr",
      "position" : 2
    },
    {
      "token" : "住",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "vi",
      "position" : 3
    },
    {
      "token" : "的",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "ude1",
      "position" : 4
    },
    {
      "token" : "是",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "vshi",
      "position" : 5
    },
    {
      "token" : "閆團",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "nr",
      "position" : 6
    },
    {
      "token" : "如家",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "r",
      "position" : 7
    },
    {
      "token" : "快捷酒店",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "ntch",
      "position" : 8
    }
  ]
}

我們可以看到hanLP自動將 閆團 分成一個詞了;

執行如下測試,可以看到hanLP沒有將 小地方作為一個分詞;

POST _analyze
{
  "analyzer": "hanlp_standard",
  "text":"閆團是一個小地方"
}

{
  "tokens" : [
    {
      "token" : "閆團",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "nr",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "vshi",
      "position" : 1
    },
    {
      "token" : "一個",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "mq",
      "position" : 2
    },
    {
      "token" : "小",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "a",
      "position" : 3
    },
    {
      "token" : "地方",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "n",
      "position" : 4
    }
  ]
}

為了自定義分詞,我們在${ES_HOME}/plugins/analysis-hanlp/data/dictionary/custom下新建my.dic,並添加 小地方;

然後從插件安裝包拷貝hanlp.properties文件放到如下位置${ES_HOME}/config/analysis-hanlp/hanlp.properties,並修改CustomDictionaryPath;

CustomDictionaryPath=data/dictionary/custom/CustomDictionary.txt; ModernChineseSupplementaryWord.txt; ChinesePlaceName.txt ns; PersonalName.txt; OrganizationName.txt; ShanghaiPlaceName.txt ns;data/dictionary/person/nrf.txt nrf;data/dictionary/custom/my.dic;

從起elasticsearch並執行測試

POST _analyze
{
  "analyzer": "hanlp",
  "text":"閆團是一個小地方"
}

{
  "tokens" : [
    {
      "token" : "閆團",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "nr",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "vshi",
      "position" : 1
    },
    {
      "token" : "一個",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "mq",
      "position" : 2
    },
    {
      "token" : "小地方",
      "start_offset" : 5,
      "end_offset" : 8,
      "type" : "n",
      "position" : 3
    }
  ]
}