ElasticSearch(7.2.2)-分詞器的介紹和使⽤

  • 2019 年 11 月 4 日
  • 筆記

簡介:分詞器是什麼,內置的分詞器有哪些

什麼是分詞器
  • 將⽤戶輸⼊的⼀段⽂本,按照⼀定邏輯,分析成多個詞語的⼀種⼯具
  • example: The best 3-points shooter is Curry!
常用的內置分詞器
  • standard analyzer
  • simple analyzer
  • whitespace analyzer
  • stop analyzer
  • language analyzer
  • pattern analyzer
standard analyzer
  • 標準分析器是默認分詞器,如果未指定,則使⽤該分詞器。
  • POST localhost:9200/_analyze
{  	 "analyzer": "standard",  	 "text": "The best 3-points shooter is Curry!"  }
simple analyzer
  • simple 分析器當它遇到只要不是字⺟的字元,就將⽂本解析成term,⽽且所有的term都是⼩寫的。
  • POST localhost:9200/_analyze
{  	 "analyzer": "simple",  	 "text": "The best 3-points shooter is Curry!"  }
whitespace analyzer
  • whitespace 分析器,當它遇到空⽩字元時,就將⽂本解析成terms
  • POST localhost:9200/_analyze
{  	 "analyzer": "whitespace",  	 "text": "The best 3-points shooter is Curry!"  }
stop analyzer
  • stop 分析器 和 simple 分析器很像,唯⼀不同的是,stop 分析器增加了對刪除停⽌詞的⽀持,默認使⽤了english停⽌詞
  • stop words 預定義的停⽌詞列表,⽐如 (the,a,an,this,of,at)等等
  • POST localhost:9200/_analyze
{  	 "analyzer": "whitespace",  	 "text": "The best 3-points shooter is Curry!"  }
language analyzer
  • (特定的語⾔的分詞器,⽐如說,English[英語分詞器]),內置語⾔:arabic, armenian,basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish,french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian,lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish,swedish, turkish, thai
  • POST localhost:9200/_analyze
{  	 "analyzer": "english",  	 "text": "The best 3-points shooter is Curry!"  }
pattern analyzer
  • ⽤正則表達式來將⽂本分割成terms,默認的正則表達式是W+(⾮單詞字元)
  • POST localhost:9200/_analyze
{  	 "analyzer": "pattern",  	 "text": "The best 3-points shooter is Curry!"  }
選擇分詞器
  • PUT localhost:9200/my_index
{  	"settings": {  		"analysis": {  			"analyzer": {  				"my_analyzer": {  					"type": "whitespace"  				}  			}  		}  	},  	"mappings": {  		"properties": {  			"name": {  				"type": "text"  			},  			"team_name": {  				"type": "text"  			},  			"position": {  				"type": "text"  			},  			"play_year": {  				"type": "long"  			},  			"jerse_no": {  				"type": "keyword"  			},  			"title": {  				"type": "text",  				"analyzer": "my_analyzer"  			}  		}  	}  }
  • PUT localhost:9200/my_index/_doc/1
{  	 "name": "庫⾥",  	 "team_name": "勇⼠",  	 "position": "控球後衛",  	 "play_year": 10,  	 "jerse_no": "30",  	 "title": "The best 3-points shooter is Curry!"   }
  • POST localhost:9200/my_index/_search
{  	"query": {  		"match": {  			"title": "Curry!"  		}  	}  }