Elasticsearch7.6學習筆記1 Getting start with Elasticsearch
- 2020 年 4 月 10 日
- 筆記
Elasticsearch7.6學習筆記1 Getting start with Elasticsearch
前言
權威指南中文只有2.x, 但現在es已經到7.6. 就安裝最新的來學下.
安裝
這裡是學習安裝, 生產安裝是另一套邏輯.
win
es下載地址:
https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.0-windows-x86_64.zip
kibana下載地址:
https://artifacts.elastic.co/downloads/kibana/kibana-7.6.0-windows-x86_64.zip
官方目前最新是7.6.0, 但下載速度慘不忍睹. 使用迅雷下載速度可以到xM.
binelasticsearch.bat binkibana.bat
雙擊bat啟動.
docker安裝
對於測試學習,直接使用官方提供的docker鏡像更快更方便。
安裝方法見: https://www.cnblogs.com/woshimrf/p/docker-es7.html
以下內容來自:
https://www.elastic.co/guide/en/elasticsearch/reference/7.6/getting-started.html
Index some documents 索引一些文檔
本次測試直接使用kibana, 當然也可以通過curl或者postman訪問localhost:9200.
訪問localhost:5601, 然後點擊Dev Tools.
新建一個客戶索引(index)
PUT /{index-name}/_doc/{id}
PUT /customer/_doc/1 { "name": "John Doe" }
put
是http method, 如果es中不存在索引(index) customer
, 則創建一個, 並插入一個數據, id
為,
name=John`.
如果存在則更新. 注意, 更新是覆蓋更新, 即body json是什麼, 最終結果就是什麼.
返回如下:
{ "_index" : "customer", "_type" : "_doc", "_id" : "1", "_version" : 7, "result" : "updated", "_shards" : { "total" : 2, "successful" : 2, "failed" : 0 }, "_seq_no" : 6, "_primary_term" : 1 }
_index
是索引名稱_type
唯一為_doc
_id
是文檔(document)的主鍵, 也就是一條記錄的pk_version
是該_id
的更新次數, 我這裡已經更新了7次_shards
表示分片的結果. 我們這裡一共部署了兩個節點, 都寫入成功了.
在kibana上設置-index manangement里可以查看index的狀態. 比如我們這條記錄有主副兩個分片.
保存記錄成功後可以立馬讀取出來:
GET /customer/_doc/1
返回
{ "_index" : "customer", "_type" : "_doc", "_id" : "1", "_version" : 15, "_seq_no" : 14, "_primary_term" : 1, "found" : true, "_source" : { "name" : "John Doe" } }
_source
就是我們記錄的內容
批量插入
當有多條數據需要插入的時候, 我們可以批量插入. 下載準備好的文檔, 然後通過http請求導入es.
創建一個索引bank: 由於shards(分片)和replicas(副本)創建後就不能修改了,所以要先創建的時候配置shards. 這裡配置了3個shards和2個replicas.
PUT /bank { "settings": { "index": { "number_of_shards": "3", "number_of_replicas": "2" } } }
文檔地址: https://gitee.com/mirrors/elasticsearch/raw/master/docs/src/test/resources/accounts.json
下載下來之後, curl命令或者postman 發送文件請求過去
curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_bulk?pretty&refresh" --data-binary "@accounts.json" curl "localhost:9200/_cat/indices?v"
每條記錄格式如下:
{ "_index": "bank", "_type": "_doc", "_id": "1", "_version": 1, "_score": 0, "_source": { "account_number": 1, "balance": 39225, "firstname": "Amber", "lastname": "Duke", "age": 32, "gender": "M", "address": "880 Holmes Lane", "employer": "Pyrami", "email": "[email protected]", "city": "Brogan", "state": "IL" } }
在kibana monitor中選擇self monitor. 然後再indices中找到索引bank。可以看到我們導入的數據分布情況。
可以看到, 有3個shards分在不同的node上, 並且都有2個replicas.
開始查詢
批量插入了一些數據後, 我們就可以開始學習查詢了. 上文知道, 數據是銀行職員表, 我們查詢所有用戶,並根據帳號排序.
類似 sql
select * from bank order by account_number asc limit 3
Query DSL
GET /bank/_search { "query": { "match_all": {} }, "sort": [ { "account_number": "asc" } ], "size": 3, "from": 2 }
_search
表示查詢query
是查詢條件, 這裡是所有size
表示每次查詢的條數, 分頁的條數. 如果不傳, 默認是10條. 在返回結果的hits
中顯示.from
表示從第幾個開始
返回:
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1000, "relation" : "eq" }, "max_score" : null, "hits" : [ { "_index" : "bank", "_type" : "_doc", "_id" : "2", "_score" : null, "_source" : { "account_number" : 2, "balance" : 28838, "firstname" : "Roberta", "lastname" : "Bender", "age" : 22, "gender" : "F", "address" : "560 Kingsway Place", "employer" : "Chillium", "email" : "[email protected]", "city" : "Bennett", "state" : "LA" }, "sort" : [ 2 ] }, { "_index" : "bank", "_type" : "_doc", "_id" : "3", "_score" : null, "_source" : { "account_number" : 3, "balance" : 44947, "firstname" : "Levine", "lastname" : "Burks", "age" : 26, "gender" : "F", "address" : "328 Wilson Avenue", "employer" : "Amtap", "email" : "[email protected]", "city" : "Cochranville", "state" : "HI" }, "sort" : [ 3 ] }, { "_index" : "bank", "_type" : "_doc", "_id" : "4", "_score" : null, "_source" : { "account_number" : 4, "balance" : 27658, "firstname" : "Rodriquez", "lastname" : "Flores", "age" : 31, "gender" : "F", "address" : "986 Wyckoff Avenue", "employer" : "Tourmania", "email" : "[email protected]", "city" : "Eastvale", "state" : "HI" }, "sort" : [ 4 ] } ] } }
返回結果提供了如下資訊
took
es查詢時間, 單位是毫秒(milliseconds)timed_out
search是否超時了_shards
我們搜索了多少shards
, 成功了多少, 失敗了多少, 跳過了多少. 關於shard, 簡單理解為數據分片, 即一個index里的數據分成了幾片,可以理解為按id進行分表。max_score
最相關的記錄(document)的分數
接下來可可以嘗試帶條件的查詢。
分詞查詢
查詢address中帶mill
和lane
的地址。
GET /bank/_search { "query": { "match": { "address": "mill lane" } }, "size": 2 }
返回
{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 19, "relation" : "eq" }, "max_score" : 9.507477, "hits" : [ { "_index" : "bank", "_type" : "_doc", "_id" : "136", "_score" : 9.507477, "_source" : { "account_number" : 136, "balance" : 45801, "firstname" : "Winnie", "lastname" : "Holland", "age" : 38, "gender" : "M", "address" : "198 Mill Lane", "employer" : "Neteria", "email" : "[email protected]", "city" : "Urie", "state" : "IL" } }, { "_index" : "bank", "_type" : "_doc", "_id" : "970", "_score" : 5.4032025, "_source" : { "account_number" : 970, "balance" : 19648, "firstname" : "Forbes", "lastname" : "Wallace", "age" : 28, "gender" : "M", "address" : "990 Mill Road", "employer" : "Pheast", "email" : "[email protected]", "city" : "Lopezo", "state" : "AK" } } ] } }
- 我設置了返回2個,但實際上命中的有19個
完全匹配查詢
GET /bank/_search { "query": { "match_phrase": { "address": "mill lane" } } }
這時候查的完全符合的就一個了
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 9.507477, "hits" : [ { "_index" : "bank", "_type" : "_doc", "_id" : "136", "_score" : 9.507477, "_source" : { "account_number" : 136, "balance" : 45801, "firstname" : "Winnie", "lastname" : "Holland", "age" : 38, "gender" : "M", "address" : "198 Mill Lane", "employer" : "Neteria", "email" : "[email protected]", "city" : "Urie", "state" : "IL" } } ] } }
多條件查詢
實際查詢中通常是多個條件一起查詢的
GET /bank/_search { "query": { "bool": { "must": [ { "match": { "age": "40" } } ], "must_not": [ { "match": { "state": "ID" } } ] } } }
bool
用來合併多個查詢條件must
,should
,must_not
是boolean查詢的子語句,must
,should
決定相關性的score,結果默認按照score排序must not
是作為一個filter,影響查詢的結果,但不影響score,只是從結果中過濾。
還可以顯式地指定任意過濾器,以包括或排除基於結構化數據的文檔。
比如,查詢balance在20000和30000之間的。
GET /bank/_search { "query": { "bool": { "must": { "match_all": {} }, "filter": { "range": { "balance": { "gte": 20000, "lte": 30000 } } } } } }
聚合運算group by
按照省份統計人數
按sql的寫法可能是
select state AS group_by_state, count(*) from tbl_bank limit 3;
對應es的請求是
GET /bank/_search { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state.keyword", "size": 3 } } } }
size=0
是限制返回內容, 因為es會返回查詢的記錄, 我們只想要聚合值aggs
是聚合的語法詞group_by_state
是一個聚合結果, 名稱自定義terms
查詢的欄位精確匹配, 這裡是需要分組的欄位state.keyword
state是text
類型, 字元類型需要統計和分組的,類型必須是keywordsize=3
限制group by返回的數量,這裡是top3, 默認top10, 系統最大10000,可以通過修改search.max_buckets
實現, 注意多個shards會產生精度問題, 後面再深入學習
返回值:
{ "took" : 5, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1000, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "group_by_state" : { "doc_count_error_upper_bound" : 26, "sum_other_doc_count" : 928, "buckets" : [ { "key" : "MD", "doc_count" : 28 }, { "key" : "ID", "doc_count" : 23 }, { "key" : "TX", "doc_count" : 21 } ] } } }
hits
命中查詢條件的記錄,因為設置了size=0, 返回[]
.total
是本次查詢命中了1000條記錄aggregations
是聚合指標結果group_by_state
是我們查詢中命名的變數名doc_count_error_upper_bound
沒有在這次聚合中返回、但是可能存在的潛在聚合結果.鍵名有「上界」的意思,也就是表示在預估的最壞情況下沒有被算進最終結果的值,當然doc_count_error_upper_bound的值越大,最終數據不準確的可能性越大,能確定的是,它的值為 0 表示數據完全正確,但是它不為 0,不代表這次聚合的數據是錯誤的.sum_other_doc_count
聚合中沒有統計到的文檔數
值得注意的是, top3是否是準確的呢. 我們看到doc_count_error_upper_bound
是有錯誤數量的, 即統計結果很可能不準確, 並且得到的top3分別是28,23,21. 我們再來添加另個查詢參數來比較結果:
GET /bank/_search { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state.keyword", "size": 3, "shard_size": 60 } } } } ----------------------------------------- "aggregations" : { "group_by_state" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 915, "buckets" : [ { "key" : "TX", "doc_count" : 30 }, { "key" : "MD", "doc_count" : 28 }, { "key" : "ID", "doc_count" : 27 } ] } }
shard_size
表示每個分片計算的數量. 因為agg聚合運算是每個分片計算出一個結果,然後最後聚合計算最終結果. 數據在分片分布不均衡, 每個分片的topN並不是一樣的, 就有可能最終聚合結果少算了一部分. 從而導致doc_count_error_upper_bound
不為0. es默認shard_size
的值是size*1.5+10
, size=3對應就是14.5, 驗證shar_size=14.5時返回值確實和不傳一樣. 而設置為60時, error終於為0了, 即, 可以保證這個3個絕對是最多的top3. 也就是說, 聚合運算要設置shard_size儘可能大, 比如size的20倍.
按省份統計人數並計算平均薪酬
我們想要查看每個省的平均薪酬, sql可能是
select state, avg(balance) AS average_balance, count(*) AS group_by_state from tbl_bank group by state limit 3
在es可以這樣查詢:
GET /bank/_search { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state.keyword", "size": 3, "shard_size": 60 }, "aggs": { "average_balance": { "avg": { "field": "balance" } }, "sum_balance": { "sum": { "field": "balance" } } } } } }
- 第二個
aggs
是計算每個state的聚合指標 average_balance
自定義的變數名稱, 值為相同state的balanceavg
運算sum_balance
自定義的變數名稱, 值為相同state的balancesum
運算
結果如下:
{ "took" : 12, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1000, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "group_by_state" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 915, "buckets" : [ { "key" : "TX", "doc_count" : 30, "sum_balance" : { "value" : 782199.0 }, "average_balance" : { "value" : 26073.3 } }, { "key" : "MD", "doc_count" : 28, "sum_balance" : { "value" : 732523.0 }, "average_balance" : { "value" : 26161.535714285714 } }, { "key" : "ID", "doc_count" : 27, "sum_balance" : { "value" : 657957.0 }, "average_balance" : { "value" : 24368.777777777777 } } ] } } }
按省份統計人數並按照平均薪酬排序
agg terms默認排序是count降序, 如果我們想用其他方式, sql可能是這樣:
select state, avg(balance) AS average_balance, count(*) AS group_by_state from tbl_bank group by state order by average_balance limit 3
對應es可以這樣查詢:
GET /bank/_search { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state.keyword", "order": { "average_balance": "desc" }, "size": 3 }, "aggs": { "average_balance": { "avg": { "field": "balance" } } } } } }
返回結果的top3就不是之前的啦:
"aggregations" : { "group_by_state" : { "doc_count_error_upper_bound" : -1, "sum_other_doc_count" : 983, "buckets" : [ { "key" : "DE", "doc_count" : 2, "average_balance" : { "value" : 39040.5 } }, { "key" : "RI", "doc_count" : 5, "average_balance" : { "value" : 36035.4 } }, { "key" : "NE", "doc_count" : 10, "average_balance" : { "value" : 35648.8 } } ] } }
參考
- 中文社區:https://elasticsearch.cn/
- es官方文檔: https://www.elastic.co/guide/en/elasticsearch/reference/current/documents-indices.html
- es官方文檔: https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index.html
- terms 聚合計算不準確: https://www.dongwm.com/post/elasticsearch-terms-agg-is-not-accurate/