使用Python操作MongoDB～聚合操作篇～

3 年 ago

科, 雅

3 minutes

只需要一种选择即可:文章的范围

この記事ではPythonでmongodbに接続してから、aggregate（SQLで言うところの集計関数）の使い方について記載します。
mongodbの起動やpymongoのインストール方法については以下の記事をご覧いただければ幸いです。
https://qiita.com/bc_yuuuuuki/items/2b92598434f6cc320112

准备数据準備資料將使用以下文章：[Python] Qiita文章資訊存入mongoDB的Qiita文章資訊。

汇总的用法 de对于熟悉SQL的人来说，使用MongoDB的聚合（aggregate）功能可能不太容易理解。下表展示了SQL和MongoDB聚合的对比。

SQLaggregateWHERE$matchGROUP BY$groupHAVING$matchSELECT$projectORDER BY$sortLIMIT$limitSUM()$sumCOUNT()$sum

MongoDB操作类我正在使用pymongo创建一个使用各种mongoDB操作的类。

from pymongo import MongoClient

class MongoSample(object):

    def __init__(self, dbName, collectionName):
        self.client = MongoClient()
        self.db = self.client[dbName] #DB名を設定
        self.collection = self.db.get_collection(collectionName)

    def aggregate(self, filter, **keyword):
        return self.collection.aggregate(filter, keyword)

我只是在为调用aggregate的函数做准备。

通过MongoDB获取数据
首先，我们需要代码。

from mongo_sample import MongoSample
import pprint
# arg1:DB Name
# arg2:Collection Name
mongo = MongoSample("db", "qiita")

# 最大値
pipeline = [
    {"$group":{ "_id":"title","page_max_view":{"$max":"$page_views_count"}}}
]
results = mongo.aggregate(pipeline)
print("------------------------最大値-----------------------------")
pprint.pprint(list(results))

# 最小値
pipeline = [
    {"$group":{ "_id":"title","page_min_view":{"$min":"$page_views_count"}}}
]
results = mongo.aggregate(pipeline)
print("------------------------最小値-----------------------------")
pprint.pprint(list(results))

# 平均値
pipeline = [
    {"$group":{ "_id":"average","page_average_view":{"$avg":"$page_views_count"}}}
]

# 合計
pipeline = [
    {"$group":{"_id":"page_total_count","total":{"$sum":"$page_views_count"}}}
]
results = mongo.aggregate(pipeline)
print("------------------------平均値-----------------------------")
pprint.pprint(list(results))

# tag毎の出現回数カウント
pipeline = [
    { "$unwind": "$tag_list"}, 
    { "$group": { "_id": "$tag_list", "count": { "$sum":1}}},
    { "$sort": {"count": -1, "_id":1}}
]

results = mongo.aggregate(pipeline)
print("------------------------集計値-----------------------------")
pprint.pprint(list(results))

我所做的没有什么了不起的事情。正在获取最大值、最小值、平均值，以及按标签计数的数据。

需要安装pprint。

pip install pprint

それぞれ、mongoDBの操作方法と見比べていきます。

最大值/最小值/平均值/总和
首先，需要了解MongoDB的命令。
以最大值为例。如果将max改为min、avg或sum，则可以得到最小值/平均值/最大值。

db.qiita.aggregate([{$group:{_id:"page_max_views",total:{$max:"$page_views_count"}}}])

pipeline = [
    {"$group":{ "_id":"title","page_max_view":{"$max":"$page_views_count"}}}
]

执行结果

[{'_id': 'title', 'page_max_view': 2461}]

用这种方法，将”_id”固定为”title”，然后获取所有记录中的最大值。

但是，我想知道哪篇文章被阅读得最多，所以希望显示文章的标题。

MongoDB命令

> db.qiita.aggregate([{$project:{title:1,page_views_count:1}},{$group:{_id:"$title", total:{$max:"$page_views_count"}}},{$sort:{total:-1}}])
{ "_id" : "Pythonでmongodbを操作する～その２：find編～", "total" : 2461 }
{ "_id" : "Pythonでmongodbを操作する～その３：update編～", "total" : 1137 }
{ "_id" : "Pythonでmongodbを操作する～その４：insert編～", "total" : 1102 }
{ "_id" : "pymongoを使った様々な検索条件（AND／OR／部分一致／範囲検索）", "total" : 1019 }
（略）

使用这个命令，您可以查看文章的标题和页面浏览次数。
毫无疑问，由于按文章名称分组，所以这种统计没有太多意义。
如果不需要进行分组，最好的方法是在find中进行排序并设置限制。

让我们获取每个标签的最大值。

> db.qiita.aggregate([{$group:{_id:"$tag1", total:{$max:"$page_views_count"}}},{$sort:{total:-1}}])
{ "_id" : "Python", "total" : 2461 }
{ "_id" : "Vagrant", "total" : 946 }
{ "_id" : "Java", "total" : 617 }
{ "_id" : "Hyperledger", "total" : 598 }
{ "_id" : "solidity", "total" : 363 }
{ "_id" : "Ethereum", "total" : 347 }
{ "_id" : "ブロックチェーン", "total" : 232 }
{ "_id" : "Blockchain", "total" : 201 }
{ "_id" : "coverage", "total" : 199 }

好的。取得得还不错。

暂且，我试着改变一下Python的代码。

# 最大値
pipeline = [
    {"$group":{ "_id":"$tag1","page_max_view":{"$max":"$page_views_count"}}}
]
results = mongo.aggregate(pipeline)
print("------------------------最大値-----------------------------")
pprint.pprint(list(results))

根据标签进行统计。
我想对每个标签写了多少篇文章进行统计。
统计将使用标签列表作为项目，在这个数据中如下所示。

> db.qiita.find({},{_id:0,tag_list:1})
{ "tag_list" : [ "Python", "MongoDB", "Python3", "pymongo" ] }
{ "tag_list" : [ "Python", "Python3" ] }
{ "tag_list" : [ "Python", "Python3", "Blockchain", "ブロックチェーン", "Hyperledger-Iroha" ] }
{ "tag_list" : [ "Blockchain", "ブロックチェーン", "Hyperledger-Iroha" ] }
{ "tag_list" : [ "Blockchain", "Ethereum", "Hyperledger", "Hyperledger-sawtooth" ] }
{ "tag_list" : [ "ブロックチェーン", "Hyperledger", "Hyperledger-sawtooth" ] }
{ "tag_list" : [ "Java", "ブロックチェーン", "Hyperledger", "Hyperledger-Iroha" ] }
{ "tag_list" : [ "ブロックチェーン", "Hyperledger", "Hyperledger-Iroha" ] }
{ "tag_list" : [ "Java", "Ethereum", "ブロックチェーン", "Hyperledger", "Hyperledger-Iroha" ] }
{ "tag_list" : [ "Java", "ブロックチェーン", "Hyperledger", "Hyperledger-Iroha" ] }
{ "tag_list" : [ "Hyperledger", "Hyperledger-Iroha", "Hyperledger-burrow", "Hyperledger-sawtooth", "Hyperledger-besu" ] }
{ "tag_list" : [ "Vagrant", "VirtualBox", "Hyper-V" ] }
{ "tag_list" : [ "Java", "Ethereum", "solidity", "ブロックチェーン", "web3j" ] }
{ "tag_list" : [ "Java", "Ethereum", "ブロックチェーン", "web3j" ] }
{ "tag_list" : [ "Java", "Ethereum", "ブロックチェーン", "web3j" ] }
{ "tag_list" : [ "Java", "Ethereum", "ブロックチェーン", "web3j" ] }
{ "tag_list" : [ "Java", "Ethereum", "solidity", "ブロックチェーン", "web3j" ] }
{ "tag_list" : [ "Java", "Ethereum", "ブロックチェーン", "web3j" ] }
{ "tag_list" : [ "Java", "Ethereum", "ブロックチェーン", "web3j" ] }
{ "tag_list" : [ "Ethereum", "ブロックチェーン" ] }

在SQL中对以这种格式存储的数据进行聚合确实相当麻烦吧。。

通过使用 MongoDB 中的 unwind 功能，可以将列表形式的数据拆分并进行聚合。

> db.qiita.aggregate( { $project:{tag_list:1}}, { $unwind: "$tag_list"}, { $group: { _id: "$tag_list", count: { $sum:1}}},{ $sort: {"count": -1, "_id":1}} )
{ "_id" : "ブロックチェーン", "count" : 16 }
{ "_id" : "Ethereum", "count" : 11 }
{ "_id" : "Java", "count" : 10 }
{ "_id" : "Python", "count" : 9 }
{ "_id" : "Python3", "count" : 9 }
{ "_id" : "Hyperledger", "count" : 7 }
{ "_id" : "Hyperledger-Iroha", "count" : 7 }
{ "_id" : "MongoDB", "count" : 7 }
{ "_id" : "web3j", "count" : 7 }
{ "_id" : "solidity", "count" : 4 }
{ "_id" : "Blockchain", "count" : 3 }
{ "_id" : "Hyperledger-sawtooth", "count" : 3 }
{ "_id" : "Hyper-V", "count" : 1 }
{ "_id" : "Hyperledger-besu", "count" : 1 }
{ "_id" : "Hyperledger-burrow", "count" : 1 }
{ "_id" : "Vagrant", "count" : 1 }
{ "_id" : "VirtualBox", "count" : 1 }
{ "_id" : "coverage", "count" : 1 }
{ "_id" : "pymongo", "count" : 1 }
{ "_id" : "truffle", "count" : 1 }

这个Python代码没有加入”{ “$project”: {“tag_list”: 1}}”。
无论有没有这个代码，结果都没有改变。
对于这个project的使用方法我不太清楚。

印象熟悉SQL后，可能会遇到一些难以理解的部分，但通过使用unwind等功能可以实现灵活的聚合查询。