Elasticsearch7.6学习笔记1 Getting start with Elasticsearch

2020 年 4 月 10 日
筆記

Elasticsearch7.6学习笔记1 Getting start with Elasticsearch

前言

权威指南中文只有2.x, 但现在es已经到7.6. 就安装最新的来学下.

安装

这里是学习安装, 生产安装是另一套逻辑.

win

es下载地址:

https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.0-windows-x86_64.zip

kibana下载地址:

https://artifacts.elastic.co/downloads/kibana/kibana-7.6.0-windows-x86_64.zip

官方目前最新是7.6.0, 但下载速度惨不忍睹. 使用迅雷下载速度可以到xM.

binelasticsearch.bat  binkibana.bat

双击bat启动.

docker安装

对于测试学习，直接使用官方提供的docker镜像更快更方便。

安装方法见： https://www.cnblogs.com/woshimrf/p/docker-es7.html

以下内容来自:

https://www.elastic.co/guide/en/elasticsearch/reference/7.6/getting-started.html

Index some documents 索引一些文档

本次测试直接使用kibana, 当然也可以通过curl或者postman访问localhost:9200.

访问localhost:5601, 然后点击Dev Tools.

新建一个客户索引(index)

PUT /{index-name}/_doc/{id}

PUT /customer/_doc/1  {    "name": "John Doe"  }

put 是http method, 如果es中不存在索引(index) customer, 则创建一个, 并插入一个数据, id为, name=John`.
如果存在则更新. 注意, 更新是覆盖更新, 即body json是什么, 最终结果就是什么.

返回如下:

{    "_index" : "customer",    "_type" : "_doc",    "_id" : "1",    "_version" : 7,    "result" : "updated",    "_shards" : {      "total" : 2,      "successful" : 2,      "failed" : 0    },    "_seq_no" : 6,    "_primary_term" : 1  }

_index 是索引名称
_type 唯一为_doc
_id 是文档(document)的主键, 也就是一条记录的pk
_version 是该_id的更新次数, 我这里已经更新了7次
_shards 表示分片的结果. 我们这里一共部署了两个节点, 都写入成功了.

在kibana上设置-index manangement里可以查看index的状态. 比如我们这条记录有主副两个分片.

保存记录成功后可以立马读取出来:

GET /customer/_doc/1

{    "_index" : "customer",    "_type" : "_doc",    "_id" : "1",    "_version" : 15,    "_seq_no" : 14,    "_primary_term" : 1,    "found" : true,    "_source" : {      "name" : "John Doe"    }  }

_source 就是我们记录的内容

批量插入

当有多条数据需要插入的时候, 我们可以批量插入. 下载准备好的文档, 然后通过http请求导入es.

创建一个索引bank: 由于shards(分片)和replicas(副本)创建后就不能修改了，所以要先创建的时候配置shards. 这里配置了3个shards和2个replicas.

PUT /bank  {    "settings": {      "index": {        "number_of_shards": "3",        "number_of_replicas": "2"      }    }  }

文档地址: https://gitee.com/mirrors/elasticsearch/raw/master/docs/src/test/resources/accounts.json

下载下来之后, curl命令或者postman 发送文件请求过去

curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_bulk?pretty&refresh" --data-binary "@accounts.json"  curl "localhost:9200/_cat/indices?v"

每条记录格式如下:

{    "_index": "bank",    "_type": "_doc",    "_id": "1",    "_version": 1,    "_score": 0,    "_source": {      "account_number": 1,      "balance": 39225,      "firstname": "Amber",      "lastname": "Duke",      "age": 32,      "gender": "M",      "address": "880 Holmes Lane",      "employer": "Pyrami",      "email": "[email protected]",      "city": "Brogan",      "state": "IL"    }  }

在kibana monitor中选择self monitor. 然后再indices中找到索引bank。可以看到我们导入的数据分布情况。

可以看到, 有3个shards分在不同的node上, 并且都有2个replicas.

开始查询

批量插入了一些数据后, 我们就可以开始学习查询了. 上文知道, 数据是银行职员表, 我们查询所有用户,并根据账号排序.

类似 sql

select * from bank order by  account_number asc limit 3

Query DSL

  GET /bank/_search  {    "query": { "match_all": {} },    "sort": [      { "account_number": "asc" }    ],    "size": 3,    "from": 2  }

_search 表示查询
query 是查询条件, 这里是所有
size 表示每次查询的条数, 分页的条数. 如果不传, 默认是10条. 在返回结果的hits中显示.
from表示从第几个开始

  {    "took" : 1,    "timed_out" : false,    "_shards" : {      "total" : 1,      "successful" : 1,      "skipped" : 0,      "failed" : 0    },    "hits" : {      "total" : {        "value" : 1000,        "relation" : "eq"      },      "max_score" : null,      "hits" : [        {          "_index" : "bank",          "_type" : "_doc",          "_id" : "2",          "_score" : null,          "_source" : {            "account_number" : 2,            "balance" : 28838,            "firstname" : "Roberta",            "lastname" : "Bender",            "age" : 22,            "gender" : "F",            "address" : "560 Kingsway Place",            "employer" : "Chillium",            "email" : "[email protected]",            "city" : "Bennett",            "state" : "LA"          },          "sort" : [            2          ]        },        {          "_index" : "bank",          "_type" : "_doc",          "_id" : "3",          "_score" : null,          "_source" : {            "account_number" : 3,            "balance" : 44947,            "firstname" : "Levine",            "lastname" : "Burks",            "age" : 26,            "gender" : "F",            "address" : "328 Wilson Avenue",            "employer" : "Amtap",            "email" : "[email protected]",            "city" : "Cochranville",            "state" : "HI"          },          "sort" : [            3          ]        },        {          "_index" : "bank",          "_type" : "_doc",          "_id" : "4",          "_score" : null,          "_source" : {            "account_number" : 4,            "balance" : 27658,            "firstname" : "Rodriquez",            "lastname" : "Flores",            "age" : 31,            "gender" : "F",            "address" : "986 Wyckoff Avenue",            "employer" : "Tourmania",            "email" : "[email protected]",            "city" : "Eastvale",            "state" : "HI"          },          "sort" : [            4          ]        }      ]    }  }

返回结果提供了如下信息

took es查询时间, 单位是毫秒(milliseconds)
timed_out search是否超时了
_shards 我们搜索了多少shards, 成功了多少, 失败了多少, 跳过了多少. 关于shard, 简单理解为数据分片, 即一个index里的数据分成了几片，可以理解为按id进行分表。
max_score 最相关的记录(document)的分数

接下来可可以尝试带条件的查询。

分词查询

查询address中带mill和lane的地址。

GET /bank/_search  {    "query": { "match": { "address": "mill lane" } },    "size": 2  }

{    "took" : 3,    "timed_out" : false,    "_shards" : {      "total" : 1,      "successful" : 1,      "skipped" : 0,      "failed" : 0    },    "hits" : {      "total" : {        "value" : 19,        "relation" : "eq"      },      "max_score" : 9.507477,      "hits" : [        {          "_index" : "bank",          "_type" : "_doc",          "_id" : "136",          "_score" : 9.507477,          "_source" : {            "account_number" : 136,            "balance" : 45801,            "firstname" : "Winnie",            "lastname" : "Holland",            "age" : 38,            "gender" : "M",            "address" : "198 Mill Lane",            "employer" : "Neteria",            "email" : "[email protected]",            "city" : "Urie",            "state" : "IL"          }        },        {          "_index" : "bank",          "_type" : "_doc",          "_id" : "970",          "_score" : 5.4032025,          "_source" : {            "account_number" : 970,            "balance" : 19648,            "firstname" : "Forbes",            "lastname" : "Wallace",            "age" : 28,            "gender" : "M",            "address" : "990 Mill Road",            "employer" : "Pheast",            "email" : "[email protected]",            "city" : "Lopezo",            "state" : "AK"          }        }      ]    }  }

我设置了返回2个，但实际上命中的有19个

完全匹配查询

GET /bank/_search  {    "query": { "match_phrase": { "address": "mill lane" } }  }

这时候查的完全符合的就一个了

{    "took" : 1,    "timed_out" : false,    "_shards" : {      "total" : 1,      "successful" : 1,      "skipped" : 0,      "failed" : 0    },    "hits" : {      "total" : {        "value" : 1,        "relation" : "eq"      },      "max_score" : 9.507477,      "hits" : [        {          "_index" : "bank",          "_type" : "_doc",          "_id" : "136",          "_score" : 9.507477,          "_source" : {            "account_number" : 136,            "balance" : 45801,            "firstname" : "Winnie",            "lastname" : "Holland",            "age" : 38,            "gender" : "M",            "address" : "198 Mill Lane",            "employer" : "Neteria",            "email" : "[email protected]",            "city" : "Urie",            "state" : "IL"          }        }      ]    }  }

多条件查询

实际查询中通常是多个条件一起查询的

GET /bank/_search  {    "query": {      "bool": {        "must": [          { "match": { "age": "40" } }        ],        "must_not": [          { "match": { "state": "ID" } }        ]      }    }  }

bool用来合并多个查询条件
must, should, must_not是boolean查询的子语句， must, should决定相关性的score，结果默认按照score排序
must not是作为一个filter，影响查询的结果，但不影响score，只是从结果中过滤。

还可以显式地指定任意过滤器，以包括或排除基于结构化数据的文档。

比如，查询balance在20000和30000之间的。

GET /bank/_search  {    "query": {      "bool": {        "must": { "match_all": {} },        "filter": {          "range": {            "balance": {              "gte": 20000,              "lte": 30000            }          }        }      }    }  }

聚合运算group by

按照省份统计人数

按sql的写法可能是

select state AS group_by_state, count(*) from tbl_bank limit 3;

对应es的请求是

  GET /bank/_search  {    "size": 0,    "aggs": {      "group_by_state": {        "terms": {          "field": "state.keyword",          "size": 3        }      }    }  }

size=0是限制返回内容，因为es会返回查询的记录，我们只想要聚合值
aggs是聚合的语法词
group_by_state 是一个聚合结果，名称自定义
terms 查询的字段精确匹配, 这里是需要分组的字段
state.keyword state是text类型, 字符类型需要统计和分组的，类型必须是keyword
size=3 限制group by返回的数量，这里是top3, 默认top10, 系统最大10000，可以通过修改search.max_buckets实现，注意多个shards会产生精度问题，后面再深入学习

返回值：

{    "took" : 5,    "timed_out" : false,    "_shards" : {      "total" : 3,      "successful" : 3,      "skipped" : 0,      "failed" : 0    },    "hits" : {      "total" : {        "value" : 1000,        "relation" : "eq"      },      "max_score" : null,      "hits" : [ ]    },    "aggregations" : {      "group_by_state" : {        "doc_count_error_upper_bound" : 26,        "sum_other_doc_count" : 928,        "buckets" : [          {            "key" : "MD",            "doc_count" : 28          },          {            "key" : "ID",            "doc_count" : 23          },          {            "key" : "TX",            "doc_count" : 21          }        ]      }    }  }

hits命中查询条件的记录，因为设置了size=0，返回[]. total是本次查询命中了1000条记录
aggregations 是聚合指标结果
group_by_state 是我们查询中命名的变量名
doc_count_error_upper_bound 没有在这次聚合中返回、但是可能存在的潜在聚合结果.键名有「上界」的意思，也就是表示在预估的最坏情况下沒有被算进最终结果的值，当然doc_count_error_upper_bound的值越大，最终数据不准确的可能性越大，能确定的是，它的值为 0 表示数据完全正确，但是它不为 0，不代表这次聚合的数据是错误的.
sum_other_doc_count 聚合中没有统计到的文档数

值得注意的是, top3是否是准确的呢. 我们看到doc_count_error_upper_bound是有错误数量的, 即统计结果很可能不准确, 并且得到的top3分别是28,23,21. 我们再来添加另个查询参数来比较结果:

GET /bank/_search  {    "size": 0,    "aggs": {      "group_by_state": {        "terms": {          "field": "state.keyword",          "size": 3,          "shard_size":  60        }      }    }  }  -----------------------------------------    "aggregations" : {      "group_by_state" : {        "doc_count_error_upper_bound" : 0,        "sum_other_doc_count" : 915,        "buckets" : [          {            "key" : "TX",            "doc_count" : 30          },          {            "key" : "MD",            "doc_count" : 28          },          {            "key" : "ID",            "doc_count" : 27          }        ]      }    }

shard_size 表示每个分片计算的数量. 因为agg聚合运算是每个分片计算出一个结果,然后最后聚合计算最终结果. 数据在分片分布不均衡, 每个分片的topN并不是一样的, 就有可能最终聚合结果少算了一部分. 从而导致doc_count_error_upper_bound不为0. es默认shard_size的值是size*1.5+10, size=3对应就是14.5, 验证shar_size=14.5时返回值确实和不传一样. 而设置为60时, error终于为0了, 即, 可以保证这个3个绝对是最多的top3. 也就是说, 聚合运算要设置shard_size尽可能大, 比如size的20倍.

按省份统计人数并计算平均薪酬

我们想要查看每个省的平均薪酬, sql可能是

select    state, avg(balance) AS average_balance, count(*) AS group_by_state  from tbl_bank  group by state  limit 3

在es可以这样查询:

GET /bank/_search  {    "size": 0,    "aggs": {      "group_by_state": {        "terms": {          "field": "state.keyword",          "size": 3,          "shard_size":  60        },        "aggs": {          "average_balance": {            "avg": {              "field": "balance"            }          },          "sum_balance": {            "sum": {              "field": "balance"            }          }        }      }    }  }

第二个aggs是计算每个state的聚合指标
average_balance 自定义的变量名称, 值为相同state的balance avg运算
sum_balance 自定义的变量名称, 值为相同state的balancesum运算

结果如下:

{    "took" : 12,    "timed_out" : false,    "_shards" : {      "total" : 3,      "successful" : 3,      "skipped" : 0,      "failed" : 0    },    "hits" : {      "total" : {        "value" : 1000,        "relation" : "eq"      },      "max_score" : null,      "hits" : [ ]    },    "aggregations" : {      "group_by_state" : {        "doc_count_error_upper_bound" : 0,        "sum_other_doc_count" : 915,        "buckets" : [          {            "key" : "TX",            "doc_count" : 30,            "sum_balance" : {              "value" : 782199.0            },            "average_balance" : {              "value" : 26073.3            }          },          {            "key" : "MD",            "doc_count" : 28,            "sum_balance" : {              "value" : 732523.0            },            "average_balance" : {              "value" : 26161.535714285714            }          },          {            "key" : "ID",            "doc_count" : 27,            "sum_balance" : {              "value" : 657957.0            },            "average_balance" : {              "value" : 24368.777777777777            }          }        ]      }    }  }

按省份统计人数并按照平均薪酬排序

agg terms默认排序是count降序, 如果我们想用其他方式, sql可能是这样:

select    state, avg(balance) AS average_balance, count(*) AS group_by_state  from tbl_bank  group by state  order by average_balance  limit 3

对应es可以这样查询:

GET /bank/_search  {    "size": 0,    "aggs": {      "group_by_state": {        "terms": {          "field": "state.keyword",          "order": {            "average_balance": "desc"          },          "size": 3        },        "aggs": {          "average_balance": {            "avg": {              "field": "balance"            }          }        }      }    }  }

返回结果的top3就不是之前的啦:

  "aggregations" : {      "group_by_state" : {        "doc_count_error_upper_bound" : -1,        "sum_other_doc_count" : 983,        "buckets" : [          {            "key" : "DE",            "doc_count" : 2,            "average_balance" : {              "value" : 39040.5            }          },          {            "key" : "RI",            "doc_count" : 5,            "average_balance" : {              "value" : 36035.4            }          },          {            "key" : "NE",            "doc_count" : 10,            "average_balance" : {              "value" : 35648.8            }          }        ]      }    }

参考

中文社区:https://elasticsearch.cn/
es官方文档: https://www.elastic.co/guide/en/elasticsearch/reference/current/documents-indices.html
es官方文档: https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index.html
terms 聚合计算不准确: https://www.dongwm.com/post/elasticsearch-terms-agg-is-not-accurate/