让我们尝试使用Python和Elasticsearch进行搜索和聚合操作

2 年 ago

韵, 科

3 minutes

索引

1. 首先
2. 准备工作
3. 搜索
4. 统计
5. 收拾
6. 最后

首先

这篇文章是2020年RevComm Advent Calendar第13天的文章。第12天是tomohiro86的《尝试使用Apollo Client进行WebSocket通信》。

RevComm公司正在开发语音分析AI电话MiiTel。
MiiTel可以提供全文会话转录，并且可以通过转录的文本进行单词搜索。

本文介绍了使用Python（elasticsearch_dsl）和Elasticsearch（AWS）进行搜索和聚合的方法。

2. 准备 –

创建 IAM 用户

请从”添加用户”按钮开始创建具有AWS Elasticsearch服务访问权限的用户。请记下访问密钥ID、秘密访问密钥和ARN。

创建一个 Elasticsearch 域

准备数据

我們整理了日本歷代票房收入前十名的排行榜。
這是截至2020年11月29日的數據。（噬血狂襲到底能有多高的成績呢？）
來源：http://www.kogyotsushin.com/archives/alltime/

[
    {
        "title": "千と千尋の神隠し",
        "company": "東宝",
        "type": "animation",
        "year": 2011,
        "rank": 1,
        "income": 308.0
    },
    {
        "title": "劇場版「鬼滅の刃」無限列車編",
        "company": "東宝",
        "type": "animation",
        "year": 2020,
        "rank": 2,
        "income": 275.1
    },
    {
        "title": "タイタニック",
        "company": "FOX",
        "type": "live_action",
        "year": 1997,
        "rank": 3,
        "income": 262.0
    },
    {
        "title": "アナと雪の女王",
        "company": "ディズニー",
        "type": "animation",
        "year": 2014,
        "rank": 4,
        "income": 255.0
    },
    {
        "title": "君の名は。",
        "company": "東宝",
        "type": "animation",
        "year": 2016,
        "rank": 5,
        "income": 250.3
    },
    {
        "title": "ハリー・ポッターと賢者の石",
        "company": "ワーナー",
        "type": "live_action",
        "year": 2001,
        "rank": 6,
        "income": 203.0
    },
    {
        "title": "ハウルの動く城",
        "company": "東宝",
        "type": "animation",
        "year": 2004,
        "rank": 7,
        "income": 196.0
    },
    {
        "title": "もののけ姫",
        "company": "東宝",
        "type": "animation",
        "year": 1997,
        "rank": 8,
        "income": 193.0
    },
    {
        "title": "踊る大捜査線 THE MOVIE2 レインボーブリッジを封鎖せよ！",
        "company": "東宝",
        "type": "live_action",
        "year": 2003,
        "rank": 9,
        "income": 173.5
    },
    {
        "title": "ハリー・ポッターと秘密の部屋",
        "company": "ワーナー",
        "type": "live_action",
        "year": 2002,
        "rank": 10,
        "income": 173.0
    }
]

安装图书馆

pip install elasticsearch elasticsearch_dsl requests_aws4auth

数据增加

让我们使用Python脚本将数据添加到准备好的Elasticsearch中。

import json
from elasticsearch import Elasticsearch, RequestsHttpConnection, helpers
from requests_aws4auth import AWS4Auth

HOST = 'search-test-foobar.ap-northeast-1.es.amazonaws.com'
awsauth = AWS4Auth(
    'アクセスキー ID',
    'シークレットアクセスキー',
    'ap-northeast-1',
    'es'
)

es = Elasticsearch(
    hosts=[{'host': HOST, 'port': 443}],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection
)

def generate():

    with open('movies.json', 'r') as f:
        movies = json.load(f)

    for movie in movies:
        yield {
            "_op_movie_type": "create",
            "_index": "movies",
            "_source": movie
        }

helpers.bulk(es, generate())
print(es.count(index="movies")["count"])
# 結果
# 10　もし10じゃなかったら少し時間を置いて試してみてください。

请确保将text数据的fielddata设置为true。
如果不这样做，则无法将text数据用于聚合。

from pprint import pprint

MAPPING = {
    "properties": {
        "company": {"fields": {"keyword": {"ignore_above": 256, "type": "keyword"}}, "type": "text", "fielddata": True},
        "title": {"fields": {"keyword": {"ignore_above": 256, "type": "keyword"}}, "type": "text", "fielddata": True},
        "type": {"fields": {"keyword": {"ignore_above": 256, "type": "keyword"}}, "type": "text", "fielddata": True},
    }
}

es.indices.put_mapping(index="movies", body=MAPPING)
pprint(es.indices.get_mapping()["movies"])
# 結果
# {'mappings': {'properties': {'company': {'fielddata': True,
#                                         'fields': {'keyword': {'ignore_above': 256,
#                                                                'type': 'keyword'}},
#                                         'type': 'text'},
#                             'income': {'type': 'float'},
#                             'rank': {'type': 'long'},
#                             'title': {'fielddata': True,
#                                       'fields': {'keyword': {'ignore_above': 256,
#                                                              'type': 'keyword'}},
#                                       'type': 'text'},
#                             'type': {'fielddata': True,
#                                      'fields': {'keyword': {'ignore_above': 256,
#                                                             'type': 'keyword'}},
#                                      'type': 'text'},
#                             'year': {'type': 'long'}}}}

准备工作已经完成。

3. 查询

那么我们来搜索一下。
Elasticsearch的文档
elasticsearch_dsl的文档

from elasticsearch_dsl import Search

s = Search(using=es).sort('rank').query("match", type="animation").query("match", company="東宝")
for hit in s:
    print(hit.title)

# 結果
# 千と千尋の神隠し
# 劇場版「鬼滅の刃」無限列車編
# 君の名は。
# ハウルの動く城
# もののけ姫

- s = Search(using=es).sort('rank').query("match", type="animation").query("match", company="東宝")
+ s = Search(using=es).filter("range", year={"lte": 2000})

# 結果
# タイタニック
# もののけ姫

from elasticsearch_dsl import Search, Q

q = Q("multi_match", query='ハリー アナ', fields=['title'])

s = Search(using=es).sort('rank').query(q)
for hit in s:
    print(hit.title)

# 結果
# アナと雪の女王
# ハリー・ポッターと賢者の石
# ハリー・ポッターと秘密の部屋

- q = Q("multi_match", query='ハリー アナ', fields=['title'])
+ q = Q('bool', must=[Q("match", company="東宝"), ~Q("match", title="の")])

# 結果
# 踊る大捜査線 THE MOVIE2 レインボーブリッジを封鎖せよ！

4. 汇总

让我们接下来进行统计。
Elasticsearch的文档
elasticsearch_dsl的文档

首先, 我们试试这些条款。

from pprint import pprint
from elasticsearch_dsl import Search, A

s = Search(using=es)

a = A('terms',  field='company')
s.aggs.bucket('results', a)
resp = s.execute()
pprint(resp.aggregations._d_['results']['buckets'])

# 結果
# [{'doc_count': 6, 'key': '宝'},
#  {'doc_count': 6, 'key': '東'},
#  {'doc_count': 2, 'key': 'ワーナー'},
#  {'doc_count': 1, 'key': 'fox'},
#  {'doc_count': 1, 'key': 'ディズニー'}]

我了解到東寶有6部作品，華納、福斯和迪士尼各有1部作品！

- a = A('terms',  field='company')
+ a = A('terms',  field='type')

# 結果
# [{'doc_count': 6, 'key': 'animation'},
#  {'doc_count': 4, 'key': 'live_action'}]

我知道了有6部动画作品和4部真人实拍作品！

- a = A('terms',  field='type')
+ a = A('histogram', field='year', interval='5', offset='1901')

# 結果
# [{'doc_count': 2, 'key': 1996.0},
#  {'doc_count': 4, 'key': 2001.0},
#  {'doc_count': 0, 'key': 2006.0},
#  {'doc_count': 2, 'key': 2011.0},
#  {'doc_count': 2, 'key': 2016.0}]

我们得出了一个结论，即从2001年到2005年这段时间内有最多的作品，总共有4个！

请查看文档，因为有许多其他的统计方法，一定要尝试一下。

5. 整理房间

最后记得删除最近创建的IAM用户和Elasticsearch。

删除IAM用户

点击删除按钮之前，请勾选要删除的用户。请访问以下链接：https://console.aws.amazon.com/iam/home?#/users

删除Elasticsearch。

点击目标域名，选择操作>点击删除域名

6. 最后

我之前从未使用过ElasticSearch，但在我们公司，只要表明希望，就可以提供一个能够尝试的环境。我很高兴能有机会挑战自己。明年我希望能继续不断挑战并不断成长！

明天是tatakahashi35的“在RevComm背后运作的算法[二分搜索篇]”。请期待！