当出现警报时，通过 Prometheus/Alertmanager 的 webhook_configs 功能创建 JIRA 工单

2 年 ago

新, 韵

3 minutes

首先

如果警報响起，我希望能自动触发一些操作。

Auto Healのための何かスクリプトなど叩いたり

所以，我尝试使用alertmanager的webhook_configs将”我想在JIRA中使用警报来进行工单管理”。

alertmanager的webhook_configs是什么？

在configuration的webhook_config上有官方说明，它是用于将警报信息发送到任意的webhook连接。实际上是webhook_configs，但在官方说明中没有s。

当作为警报信息发送的示例中，JSON将以alertmanager所指定的连接点通过webhook_config进行POST操作。

{
  "receiver": "webhook-trouble-handler",
  "status": "firing",
  "alerts": [
    {
      "status": "firing",
      "labels": {
        "alertname": "sample_error",
        "category": "pushgateway",
        "channel": "sample",
        "environment": "hoge-env",
        "exported_instance": "TEST_INSTANCE",
        "exported_job": "sample_exporter",
        "instance": "localhost:9091",
        "job": "pushgateway",
        "severity": "critical"
      },
      "annotations": {
        "resolved_text": "SAMPLE is OK.",
        "summary": "SAMPLE is NG."
      },
      "startsAt": "2017-12-10T23:20:08.822+09:00",
      "endsAt": "0001-01-01T00:00:00Z",
      "generatorURL": "http://localhost/prometheus/graph?XXXXXXXXXXXXXXXX"
    }
  ],
  "groupLabels": {
    "alertname": "sample_error",
    "channel": "sample",
    "instance": "localhost:9091",
    "job": "pushgateway"
  },
  "commonLabels": {
    "alertname": "sample_error",
    "category": "pushgateway",
    "channel": "sample",
    "environment": "hoge-env",
    "exported_instance": "TEST_INSTANCE",
    "exported_job": "sample_exporter",
    "instance": "localhost:9091",
    "job": "pushgateway",
    "severity": "critical"
  },
  "commonAnnotations": {
    "resolved_text": "SAMPLE is OK.",
    "summary": "SAMPLE is NG."
  },
  "externalURL": "/alertmanager",
  "version": "3",
  "groupKey": "000000000000000000"
}

要做的事情 zuò de

请在alertmanager的配置文件中添加webhook_configs设置，并准备该连接目标（此次实现使用python编写的webhook_reciever.py）。

需要。

1. Alertmanager的配置

定义webhook-trouble-handler接收器。

continue: trueをつけてroutesを定義

ここではJIRAを作る以外にもslackへ通知したいため、別のrecieverにもroutes分岐させる

webhook_reciever.pyへの接続先としてwebhook_configsを定義

...
  routes:
  - match:
      channel: sample
    routes:
    - match:
        severity: critical
      receiver: webhook-trouble-handler
      repeat_interval: 1680h
      continue: true
...
- name: 'webhook-trouble-handler'
  webhook_configs:
    - url: 'http://localhost:9083'
      send_resolved: true
...

2. 实现webhook_reciever.py

其功能包括：
– 接收来自Alertmanager的POST请求，并将其作为JSON格式的警报信息进行处理。
– 根据警报信息创建JIRA工单。
– 监听端口9083的简易Web服务器。

import json
import logging
from http.server import BaseHTTPRequestHandler
from http.server import HTTPServer
from jira import JIRA # install this package by pip in advance

logging.basicConfig(level=logging.DEBUG, format="%(asctime)-15s %(message)s")
requests_log = logging.getLogger("requests.packages.urllib3")
requests_log.setLevel(logging.DEBUG)
requests_log.propagate = True

class TroubleHandler(BaseHTTPRequestHandler):
    def do_POST(self):
        self.send_response(200)
        self.end_headers()
        data = json.loads(self.rfile.read(int(self.headers['Content-Length'])))

        alert_data = self.build_alert_data(data)
        logging.info("recieved data:%s" % alert_data)
        self.create_alert_jira_issue(alert_data, data["status"])

    def build_alert_data(self, data):
        # customize by your metrics
        alert_data = {
            "status": data["status"],
            "alertname": data["alerts"][0]["labels"]["alertname"],
            "starts_at": data["alerts"][0]["startsAt"],
            "summary": data["alerts"][0]["annotations"]["summary"],
            "group_key": data["groupKey"]
        }
        return alert_data

    def create_alert_jira_issue(self, alert_data, alert_status):
        j = JiraPoster()
        j.create_alert_jira(alert_data)

class JiraPoster():
    def __init__(self):
        # fill your JIRA info
        server = "https://***********.atlassian.net/"
        basic_auth = ('*************', '***********')
        self.jira = JIRA(server=server, basic_auth=basic_auth)

    def create_alert_jira(self, data):
        # customize as you want
        issue_dict = {
            'project': {"key": "TEST"},
            'summary': "[ALERT] %s" % data["summary"],
            'description': "h4.alertname\n%s\nh4.starts at\n%s\nh4.summary\n%s\nh4.group key\n%s" % (data["alertname"], data["starts_at"], data["summary"], data["group_key"]),
            'issuetype': {'name': 'Task'},
        }
        if data["status"] == "firing":
                self.jira.create_issue(fields=issue_dict)

if __name__ == "__main__":
    httpd = HTTPServer(('', 9083), TroubleHandler)
    httpd.serve_forever()

如果追求实用性，经常会选择如下等选项，不再赘述解释。

group keyでresolved時にはその旨をコメント追記やチケットステータス変更
host nameみたいな概念が監視対象にあればそれをJQLで検索して既存チケットないか?探してあれば新チケット作らずコメント追記のみしてチケット節約
kinabaや管理APIなどあればそこからアラート時のlogや状態をとってきて追記

确认动作

成功了！!)

结束

prometheusのアラートの「その後」をイベント駆動でカスタムできてよい

気になる

alertmanagerとwebhook内、どちらで処理をdispatchすると運用上きれいか
もしcustom exporterで自サービスAPI見てて、webhookでも追加情報をそこから取得したいとなると監視観点で自サービス叩くAPIが複数箇所あってきもい

今後

今回のトラブルチケット管理的な話だとslack reactionを使って誰がハンドルしてるか?などpagerdutyっぽいのの実装も気軽そうなのでfuture work

以上所述。