Elasticsearch聚合分析

聚合分析简介

聚合分析是数据库中重要的功能特性,完成对一个查询的数据集中数据的聚合计算,如:找出某字段(或计算表达式的结果)的最大值、最小值,计算和、平均值等。ES作为搜索引擎兼数据库,同样提供了强大的聚合分析能力。

对一个数据集求最大、最小、和、平均值等指标的聚合,在ES中称为指标聚合 metric

而关系型数据库中除了有聚合函数外,还可以对查询出的数据进行分组group by,再在组上进行指标聚合。在 ES 中group by 称为分桶,桶聚合 bucketing

ES中还提供了矩阵聚合(matrix)、管道聚合(pipleline),但还在完善中。

ES聚合分析查询的写法

在查询请求体中以aggregations节点按如下语法定义聚合分析(aggregations 也可简写为 aggs):

1
2
3
4
5
6
7
8
9
10
"aggregations" : {
"<aggregation_name>" : { <!--聚合的名字 -->
"<aggregation_type>" : { <!--聚合的类型 -->
<aggregation_body> <!--聚合体:对哪些字段进行聚合 -->
}
[,"meta" : { [<meta_data_body>] } ]? <!--元 -->
[,"aggregations" : { [<sub_aggregation>]+ } ]? <!--在聚合里面在定义子聚合 -->
}
[,"<aggregation_name_2>" : { ... } ]*<!--聚合的名字 -->
}

聚合计算的值可以取字段的值,也可是脚本计算的结果。

指标聚合

max min sum avg

最大值:

1
2
3
4
5
6
7
8
9
10
11
POST /bank/_search?
{
"size": 0,
"aggs": {
"masssbalance": {
"max": {
"field": "balance"
}
}
}
}

年龄为24岁的客户中的余额最大值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
POST /bank/_search?
{
"size": 20,
"query": {
"match": {
"age": 24
}
},
"sort": [
{
"balance": {
"order": "desc"
}
}
],
"aggs": {
"max_balance": {
"max": {
"field": "balance"
}
}
}
}

值来源于脚本,查询所有客户的平均年龄是多少,并对平均年龄加10

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
POST /bank/_search?size=0
{
"aggs": {
"avg_age": {
"avg": {
"script": {
"source": "doc.age.value"
}
}
},
"avg_age10": {
"avg": {
"script": {
"source": "doc.age.value + 10"
}
}
}
}
}

指定field,在脚本中用_value 取字段的值

1
2
3
4
5
6
7
8
9
10
11
12
13
POST /bank/_search?size=0
{
"aggs": {
"sum_balance": {
"sum": {
"field": "balance",
"script": {
"source": "_value * 1.03"
}
}
}
}
}

为没有值字段指定值。如未指定,缺失该字段值的文档将被忽略。

1
2
3
4
5
6
7
8
9
10
11
POST /bank/_search?size=0
{
"aggs": {
"avg_age": {
"avg": {
"field": "age",
"missing": 18
}
}
}
}

文档计数 count

统计银行索引bank下年龄为24的文档数量

1
2
3
4
5
6
7
8
POST /bank/accounts/_count
{
"query": {
"match": {
"age" : 24
}
}
}

value_count 统计某字段有值的文档数

1
2
3
4
5
6
7
8
9
10
POST /bank/_search?size=1
{
"aggs": {
"age_count": {
"value_count": {
"field": "age"
}
}
}
}

cardinality 值去重计数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
POST /bank/_search?size=1
{
"aggs": {
"age_count": {
"cardinality": {
"field": "age"
}
},
"state_count": {
"cardinality": {
"field": "state.keyword"
}
}
}
}

说明:state的使用它的keyword版

stats 统计 count max min avg sum 5个值

1
2
3
4
5
6
7
8
9
10
POST /bank/_search?size=0
{
"aggs": {
"age_stats": {
"stats": {
"field": "age"
}
}
}
}

Extended stats

高级统计,比stats多4个统计结果: 平方和、方差、标准差、平均值加/减两个标准差的区间

1
2
3
4
5
6
7
8
9
10
POST /bank/_search?size=0
{
"aggs": {
"age_stats": {
"extended_stats": {
"field": "age"
}
}
}
}

Percentiles 占比百分位对应的值统计

对指定字段(脚本)的值按从小到大累计每个值对应的文档数的占比(占所有命中文档数的百分比),返回指定占比比例对应的值。默认返回[ 1, 5, 25, 50, 75, 95, 99 ]分位上的值。如下中间的结果,可以理解为:占比为50%的文档的age值 <= 31,或反过来:age<=31的文档数占总命中文档数的50%

1
2
3
4
5
6
7
8
9
10
POST /bank/_search?size=0
{
"aggs": {
"age_percents": {
"percentiles": {
"field": "age"
}
}
}
}

结果说明:占比为50%的文档的age值 <= 31,或反过来:age<=31的文档数占总命中文档数的50%

指定分位值

1
2
3
4
5
6
7
8
9
10
11
POST /bank/_search?size=0
{
"aggs": {
"age_percents": {
"percentiles": {
"field": "age",
"percents" : [95, 99, 99.9]
}
}
}
}

Percentiles rank 统计值小于等于指定值的文档占比

统计年龄小于25和30的文档的占比,和第7项相反

1
2
3
4
5
6
7
8
9
10
11
12
13
14
POST /bank/_search?size=0
{
"aggs": {
"gge_perc_rank": {
"percentile_ranks": {
"field": "age",
"values": [
25,
30
]
}
}
}
}

结果说明:年龄小于25的文档占比为26.1%,年龄小于30的文档占比为49.2%,

Geo Bounds aggregation 求文档集中的地理位置坐标点的范围

参考官网链接:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-geobounds-aggregation.html

Geo Centroid aggregation 求地理位置中心点坐标值

参考官网链接:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-geocentroid-aggregation.html

桶聚合

Terms Aggregation 根据字段值项分组聚合

1
2
3
4
5
6
7
8
9
10
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age"
}
}
}
}

结果说明:

1
2
"doc_count_error_upper_bound": 0:文档计数的最大偏差值
"sum_other_doc_count": 463:未返回的其他项的文档数

默认情况下返回按文档计数从高到低的前10个分组。

年龄为31的文档有61个,年龄为39的文档有60个

** size 指定返回多少个分组 **

指定返回20个分组

1
2
3
4
5
6
7
8
9
10
11
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"size": 20
}
}
}
}

每个分组上显示偏差值

1
2
3
4
5
6
7
8
9
10
11
12
13
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"size": 5,
"shard_size": 20,
"show_term_doc_count_error": true
}
}
}
}

** shard_size ** 指定每个分片上返回多少个分组

shard_size 的默认值为:索引只有一个分片:= size,多分片:= size * 1.5 + 10

1
2
3
4
5
6
7
8
9
10
11
12
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"size": 5,
"shard_size": 20
}
}
}
}

** order ** 指定分组的排序

根据文档计数排序

1
2
3
4
5
6
7
8
9
10
11
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"order" : { "_count" : "asc" }
}
}
}
}

根据分组值排序

1
2
3
4
5
6
7
8
9
10
11
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"order" : { "_key" : "asc" }
}
}
}
}

取分组指标值排序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"order": {
"max_balance": "asc"
}
},
"aggs": {
"max_balance": {
"max": {
"field": "balance"
}
},
"min_balance": {
"min": {
"field": "balance"
}
}
}
}
}
}

筛选分组-正则表达式匹配值

1
2
3
4
5
6
7
8
9
10
11
12
GET /_search
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tags",
"include" : ".*sport.*",
"exclude" : "water_.*"
}
}
}
}

筛选分组-指定值列表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
GET /_search
{
"aggs" : {
"JapaneseCars" : {
"terms" : {
"field" : "make",
"include" : ["mazda", "honda"]
}
},
"ActiveCarManufacturers" : {
"terms" : {
"field" : "make",
"exclude" : ["rover", "jensen"]
}
}
}
}

根据脚本计算值分组

1
2
3
4
5
6
7
8
9
10
11
12
13
GET /_search
{
"aggs" : {
"genres" : {
"terms" : {
"script" : {
"source": "doc['genre'].value",
"lang": "painless"
}
}
}
}
}

缺失值处理

1
2
3
4
5
6
7
8
9
10
11
GET /_search
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tags",
"missing": "N/A"
}
}
}
}

filter Aggregation 对满足过滤查询的文档进行聚合计算

在查询命中的文档中选取符合过滤条件的文档进行聚合,先过滤再聚合

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"filter": {"match":{"gender":"F"}},
"aggs": {
"avg_age": {
"avg": {
"field": "age"
}
}
}
}
}
}

Filters Aggregation 多个过滤组聚合计算

准备数据:

1
2
3
4
5
6
7
PUT /logs/_doc/_bulk?refresh
{"index":{"_id":1}}
{"body":"warning: page could not be rendered"}
{"index":{"_id":2}}
{"body":"authentication error"}
{"index":{"_id":3}}
{"body":"warning: connection timed out"}

获取组合过滤后聚合的结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
GET logs/_search
{
"size": 0,
"aggs": {
"messages": {
"filters": {
"filters": {
"errors": {
"match": {
"body": "error"
}
},
"warnings": {
"match": {
"body": "warning"
}
}
}
}
}
}
}

为其他值组指定key

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
GET logs/_search
{
"size": 0,
"aggs": {
"messages": {
"filters": {
"other_bucket_key": "other_messages",
"filters": {
"errors": {
"match": {
"body": "error"
}
},
"warnings": {
"match": {
"body": "warning"
}
}
}
}
}
}
}

Range Aggregation 范围分组聚合

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
POST /bank/_search?size=0
{
"aggs": {
"age_range": {
"range": {
"field": "age",
"ranges": [
{
"to": 25
},
{
"from": 25,
"to": 35
},
{
"from": 35
}
]
},
"aggs": {
"bmax": {
"max": {
"field": "balance"
}
}
}
}
}
}

为组指定key

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
POST /bank/_search?size=0
{
"aggs": {
"age_range": {
"range": {
"field": "age",
"keyed": true,
"ranges": [
{
"to": 25,
"key": "Ld"
},
{
"from": 25,
"to": 35,
"key": "Md"
},
{
"from": 35,
"key": "Od"
}
]
}
}
}
}

Date Range Aggregation 时间范围分组聚合

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
POST /bank/_search?size=0
{
"aggs": {
"range": {
"date_range": {
"field": "date",
"format": "MM-yyy",
"ranges": [
{
"to": "now-10M/M"
},
{
"from": "now-10M/M"
}
]
}
}
}
}

Date Histogram Aggregation 时间直方图(柱状)聚合

就是按天、月、年等进行聚合统计。可按 year (1y), quarter (1q), month (1M), week (1w), day (1d), hour (1h), minute (1m), second (1s) 间隔聚合或指定的时间间隔聚合。

1
2
3
4
5
6
7
8
9
10
11
POST /bank/_search?size=0
{
"aggs": {
"sales_over_time": {
"date_histogram": {
"field": "date",
"interval": "month"
}
}
}
}

Missing Aggregation 缺失值的桶聚合

1
2
3
4
5
6
7
8
POST /bank/_search?size=0
{
"aggs" : {
"account_without_a_age" : {
"missing" : { "field" : "age" }
}
}
}

Geo Distance Aggregation 地理距离分区聚合

参考官网链接:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-geodistance-aggregation.html