ElasticSearch(7.2.2)-分詞器的介紹和使⽤
- 2019 年 11 月 4 日
- 筆記
簡介:分詞器是什麼,內置的分詞器有哪些
什麼是分詞器
- 將⽤戶輸⼊的⼀段⽂本,按照⼀定邏輯,分析成多個詞語的⼀種⼯具
- example: The best 3-points shooter is Curry!
常用的內置分詞器
- standard analyzer
- simple analyzer
- whitespace analyzer
- stop analyzer
- language analyzer
- pattern analyzer
standard analyzer
- 標準分析器是默認分詞器,如果未指定,則使⽤該分詞器。
- POST localhost:9200/_analyze
{ "analyzer": "standard", "text": "The best 3-points shooter is Curry!" }
simple analyzer
- simple 分析器當它遇到只要不是字⺟的字元,就將⽂本解析成term,⽽且所有的term都是⼩寫的。
- POST localhost:9200/_analyze
{ "analyzer": "simple", "text": "The best 3-points shooter is Curry!" }
whitespace analyzer
- whitespace 分析器,當它遇到空⽩字元時,就將⽂本解析成terms
- POST localhost:9200/_analyze
{ "analyzer": "whitespace", "text": "The best 3-points shooter is Curry!" }
stop analyzer
- stop 分析器 和 simple 分析器很像,唯⼀不同的是,stop 分析器增加了對刪除停⽌詞的⽀持,默認使⽤了english停⽌詞
- stop words 預定義的停⽌詞列表,⽐如 (the,a,an,this,of,at)等等
- POST localhost:9200/_analyze
{ "analyzer": "whitespace", "text": "The best 3-points shooter is Curry!" }
language analyzer
- (特定的語⾔的分詞器,⽐如說,English[英語分詞器]),內置語⾔:arabic, armenian,basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish,french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian,lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish,swedish, turkish, thai
- POST localhost:9200/_analyze
{ "analyzer": "english", "text": "The best 3-points shooter is Curry!" }
pattern analyzer
- ⽤正則表達式來將⽂本分割成terms,默認的正則表達式是W+(⾮單詞字元)
- POST localhost:9200/_analyze
{ "analyzer": "pattern", "text": "The best 3-points shooter is Curry!" }
選擇分詞器
- PUT localhost:9200/my_index
{ "settings": { "analysis": { "analyzer": { "my_analyzer": { "type": "whitespace" } } } }, "mappings": { "properties": { "name": { "type": "text" }, "team_name": { "type": "text" }, "position": { "type": "text" }, "play_year": { "type": "long" }, "jerse_no": { "type": "keyword" }, "title": { "type": "text", "analyzer": "my_analyzer" } } } }
- PUT localhost:9200/my_index/_doc/1
{ "name": "庫⾥", "team_name": "勇⼠", "position": "控球後衛", "play_year": 10, "jerse_no": "30", "title": "The best 3-points shooter is Curry!" }
- POST localhost:9200/my_index/_search
{ "query": { "match": { "title": "Curry!" } } }