數組如何在ElasticSearch中索引
- 2020 年 9 月 19 日
- 筆記
- elasticsearch, 標籤
一、簡介
在ElasticSearch里沒有專門的數組類型,任何一個欄位都可以有零個和多個值。當欄位值的個數大於1時,欄位類型就變成了數組。
下面以影片數據為例,介紹ElasticSearch如何索引數組數據,以及如何檢索數組中的欄位值。
測試影片數據格式如下:
{
"media_id": 88992211,
"tags": ["電影","科技","恐怖","電競"]
}
media_id代表影片id,tags是影片的標籤,有多個值。業務上需要按影片標籤檢索標籤下所有的影片。同一個影片有多個標籤。
演示使用的ElasticSearch集群的版本是7.6.2。
二、測試演示
2.1 創建索引
PUT test_arrays
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"properties": {
"media_id": {
"type": "long"
},
"tags": {
"type": "text"
}
}
}
}
2.2 向test_arrays索引里寫入測試數據
POST test_arrays/_doc
{
"media_id": 887722,
"tags": [
"電影",
"科技",
"恐怖",
"電競"
]
}
2.3 查看test_arrays內部如何索引tags欄位
{
"tokens" : [
{
"token" : "電",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "影",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "科",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 102
},
{
"token" : "技",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 103
},
{
"token" : "恐",
"start_offset" : 6,
"end_offset" : 7,
"type" : "<IDEOGRAPHIC>",
"position" : 204
},
{
"token" : "怖",
"start_offset" : 7,
"end_offset" : 8,
"type" : "<IDEOGRAPHIC>",
"position" : 205
},
{
"token" : "電",
"start_offset" : 9,
"end_offset" : 10,
"type" : "<IDEOGRAPHIC>",
"position" : 306
},
{
"token" : "競",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<IDEOGRAPHIC>",
"position" : 307
}
]
}
從響應結果可以看到,tags數組中的每個值被分詞成多個token。
2.4 檢索tags數組中的值
POST test_arrays/_search
{
"query": {
"match": {
"tags": "電影"
}
}
}
響應結果:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.68324494,
"hits" : [
{
"_index" : "test_arrays",
"_type" : "_doc",
"_id" : "MyhnpXQBGXOapfjvSpOW",
"_score" : 0.68324494,
"_source" : {
"media_id" : 887722,
"tags" : [
"電影",
"科技",
"恐怖",
"電競"
]
}
}
]
}
}
模糊檢索:
POST test_arrays/_search
{
"query": {
"match": {
"tags": "影"
}
}
}
響應結果
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "test_arrays",
"_type" : "_doc",
"_id" : "MyhnpXQBGXOapfjvSpOW",
"_score" : 0.2876821,
"_source" : {
"media_id" : 887722,
"tags" : [
"電影",
"科技",
"恐怖",
"電競"
]
}
}
]
}
}
影片數據業務上需要通過標籤精確匹配,查詢標籤下的所有影片。實現這種效果,需要把tags欄位類型修改為keyword。test_arrays索引的mappings設置如下:
PUT test_arrays
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"properties": {
"media_id": {
"type": "long"
},
"tags": {
"type": "keyword"
}
}
}
}
此時tags欄位數組中每一個值對應一個token,可以實現按標籤精準查詢標籤下影片的效果。
{
"tokens" : [
{
"token" : "電影",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "科技",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 1
},
{
"token" : "恐怖",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "電競",
"start_offset" : 9,
"end_offset" : 11,
"type" : "word",
"position" : 3
}
]
}
實際業務場景中,影片標籤的數據可能不是按數組存儲的,全部標籤存儲在一個字元串中,標籤之間用逗號分隔。
{
"media_id": 88992211,
"tags": "電影,科技,恐怖,電競"
}
上面的標籤存儲格式,通過調整索引欄位的類型,同樣可以實現精準檢索單個標籤下影片的效果。test_arrays索引的配置如下:
PUT test_arrays
{
"settings": {
"number_of_shards": 1,
"analysis" : {
"analyzer" : {
"comma_analyzer": {
"tokenizer": "comma_tokenizer"
}
},
"tokenizer" : {
"comma_tokenizer": {
"type": "simple_pattern_split",
"pattern": ","
}
}
}
},
"mappings": {
"properties": {
"media_id": {
"type": "long"
},
"tags": {
"search_analyzer" : "simple",
"analyzer" : "comma_analyzer",
"type" : "text"
}
}
}
}
寫入一條測試數據到test_arrays索引
POST test_arrays/_doc
{
"media_id": 887722,
"tags": "電影,科技,恐怖,電競"
}
tags欄位的索引結構如下,同樣實現了一個標籤對應一個token。
{
"tokens" : [
{
"token" : "電影",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "科技",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 1
},
{
"token" : "恐怖",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "電競",
"start_offset" : 9,
"end_offset" : 11,
"type" : "word",
"position" : 3
}
]
}
通過標籤精準匹配查詢。
請求參數
POST test_arrays/_search
{
"query": {
"match": {
"tags": "電影"
}
}
}
響應結果
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "test_arrays",
"_type" : "_doc",
"_id" : "3i2ipXQBGXOapfjv3THH",
"_score" : 0.2876821,
"_source" : {
"media_id" : 887722,
"tags" : "電影,科技,恐怖,電競"
}
}
]
}
}
三、總結
ElasticSearch採用的一種數據類型同時支援單值和多值的設計理念,即簡化了數據類型的總量,同時也降低索引配置的複雜度,是一種非常優秀的設計。
同時標籤數據的組織方式支援數組和分隔符分隔兩種形式,體現了ElasticSearch功能的靈活性。