直接支持Prometheus指标的应用程序

3 年 ago

宇, 华

6 minutes

首先

Prometheus是一种”拉取式”监测系统，它通过HTTP访问监测目标从Prometheus服务器收集指标。
因此，为了使用Prometheus进行监测，被监测方需要公开一个可以获取Prometheus格式指标的HTTP端点（装备）。
许多软件已经准备了外部程序，称为Exporter，用于输出Prometheus格式的指标。
Exporter列表
此外，Kubernetes、Etcd和SkyDNS等应用程序直接支持Prometheus格式的指标输出。
支持Prometheus格式指标的软件列表

对于无法通过这些方法涵盖的指标，可以考虑以下选择：
1. 自己制作Exporter。
2. 应用程序本身直接支持Prometheus的指标输出。

如果您无法直接修改（或者不想修改）要监视的应用程序，那么您需要创建一个Exporter。Exporter可以轻松创建，因为它不依赖于Prometheus，但同时也有一个限制，即它只能收集来自外部可获取的指标。另一方面，如果直接进行支持（当然前提是您可以直接修改应用程序），您可以收集应用程序内部更详细的指标。在这种情况下，我们假设您希望收集应用程序特定的指标，并尝试验证在应用程序本身中直接支持Prometheus指标输出的方法。

Prometheus的客户端库

无论是与Exporter /直接支持相关与否，在输出Prometheus格式的度量时，使用Prometheus的客户端库是最方便的。这个客户端库实现了Prometheus的度量类型，因此使用它可以轻松地创建Prometheus格式的度量。

我认为不同的语言可能会有不同的命名，但用于度量收集的库基本上都有以下的结构。

Collector …各メトリクスを取得するインタフェース2

Counter …Counterタイプのメトリクスを収集するためのCollector実装

Gauge …Gaugeタイプのメトリクスを収集するためのCollector実装

Histogram …Histogramタイプのメトリクスを収集するためのCollector実装

Summary …Summaryタイプのメトリクスを収集するためのCollector実装

CollectorRegistry …Collectorをとりまとめるクラス。スクレイプされる度に登録されたCollectorのコールバック(e.g., collect)を呼び出してメトリクスを収集・出力する

通过准备一个collector来收集各项指标，并将其注册到CollectorRegistry中。当执行抓取时，注册在CollectorRegistry中的collector的回调函数将被调用以收集和输出指标。

这就是基本的流程。

Go语言的实现示例

这次，我们将尝试使用Go的客户端库。
我们将以一个简单的Web服务器为目标，试图支持Prometheus格式的指标输出。

package main

import (
    "fmt"
    "net/http"
    "log"
    "math/rand"
    "time"
)

func main() {
    helloHandler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        rand.Seed(time.Now().UnixNano())
        switch rand.Intn(4) {
        case 0:
            log.Println("Hello!", )
            fmt.Fprint(w, "Hello!", )
        case 1:
            log.Println("Hi!", )
            fmt.Fprint(w, "Hi!")
        case 2:
            log.Println("Hey!", )
            fmt.Fprint(w, "Hey!")
        case 3:
            log.Println("Error!", )
            fmt.Fprint(w, "Error!!")
        }
    })
    http.Handle("/", helloHandler)
    log.Fatal(http.ListenAndServe(":8080", nil))
}

源代码（Gist）

每次访问时，这个 Web 服务器只会随机返回三种不同的问候和错误信息。为了方便后续确认返回了什么，还会执行以下类似的日志输出。

% go run httpserver.go
2017/12/07 17:29:23 Hello!
2017/12/07 17:29:31 Hey!
2017/12/07 17:29:33 Hi!
2017/12/07 17:29:34 Error!
2017/12/07 17:29:36 Hi!
2017/12/07 17:29:37 Hello!
2017/12/07 17:29:39 Hi!

第一步：计算错误次数。

由于失败次数是一个重要的指标，首先我们将尝试用计数器类型收集出现”Error!”的次数。

参考： https://prometheus.io/docs/practices/instrumentation/#failures

请参考上述链接了解有关失败的信息。

请按照以下方式添加代码。

% git diff 14eb176 daa222e
diff --git a/httpserver.go b/httpserver.go
index a4b223e..fea3c6e 100644
--- a/httpserver.go
+++ b/httpserver.go
@@ -6,8 +6,22 @@ import (
        "math/rand"
        "net/http"
        "time"
+
+       "github.com/prometheus/client_golang/prometheus"
+       "github.com/prometheus/client_golang/prometheus/promhttp"
 )

+var (
+       errorCount = prometheus.NewCounter(prometheus.CounterOpts{
+               Name: "greeting_error_count_total",
+               Help: "Counter of HTTP requests resulting in an error.",
+       })
+)
+
+func init() {
+       prometheus.MustRegister(errorCount)
+}
+
 func main() {
        helloHandler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
                rand.Seed(time.Now().UnixNano())
@@ -23,9 +37,11 @@ func main() {
                        fmt.Fprint(w, "Hey!")
                case 3:
                        log.Println("Error!")
+                       errorCount.Inc()
                        fmt.Fprint(w, "Error!!")
                }
        })
        http.Handle("/", helloHandler)
+       http.Handle("/metrics", promhttp.Handler())
        log.Fatal(http.ListenAndServe(":8080", nil))
 }

源代码（Gist）

只需注册一个名为errorCount的Collector，并在错误的时候调用errorCount.Inc()函数。
最后我们在路径/metrics上公开了指标。
当你启动后，在http://localhost:8080/metrics上访问，你将得到以下以Prometheus格式呈现的指标。

# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
...（省略）
# HELP http_error_count_total Counter of HTTP requests resulting in an error.
# TYPE http_error_count_total counter
greeting_error_count_total 1

在Go客户端中，默认情况下会输出与Go相关的指标（如垃圾回收时间和正在运行的Goroutine数量等）。
通过最后一行可以确认我们设定的错误次数也被正确输出了。

各项指标都可以设置标签。
例如，对于SkyDNS，我们为错误添加了系统(system)和原因(cause)这两个标签。
请参见：https://github.com/skynetservices/skydns/blob/master/metrics/metrics.go#L82

第二步：计算日志数量。

下一步，我们将尝试使用Counter类型来收集回应问候的次数。
这次我们将通过设置标签，使得可以按照不同的模式进行分组。

请参考：https://prometheus.io/docs/practices/instrumentation/#logging

% git diff daa222e 51d2d01
diff --git a/httpserver.go b/httpserver.go
index fea3c6e..74a90fb 100644
--- a/httpserver.go
+++ b/httpserver.go
@@ -16,10 +16,15 @@ var (
                Name: "greeting_error_count_total",
                Help: "Counter of HTTP requests resulting in an error.",
        })
+       successCount = prometheus.NewCounterVec(prometheus.CounterOpts{
+               Name: "greeting_success_count_total",
+               Help: "Counter of HTTP requests resulting in a success.",
+       }, []string{"type"})
 )

 func init() {
        prometheus.MustRegister(errorCount)
+       prometheus.MustRegister(successCount)
 }

 func main() {
@@ -28,12 +33,15 @@ func main() {
                switch rand.Intn(4) {
                case 0:
                        log.Println("Hello!")
+                       successCount.WithLabelValues("hello").Inc()
                        fmt.Fprint(w, "Hello!")
                case 1:
                        log.Println("Hi!")
+                       successCount.WithLabelValues("hi").Inc()
                        fmt.Fprint(w, "Hi!")
                case 2:
                        log.Println("Hey!")
+                       successCount.WithLabelValues("hey").Inc()
                        fmt.Fprint(w, "Hey!")
                case 3:
                        log.Println("Error!")

代码片段（Gist）

准备了一个名为type的标签，名为successCount的收集器，指定标签时进行递增。
如下所示，可以获得带有标签的指标。

# HELP greeting_success_count_total Counter of HTTP requests resulting in an error.
# TYPE greeting_success_count_total counter
greeting_success_count_total{type="hello"} 3
greeting_success_count_total{type="hey"} 1
greeting_success_count_total{type="hi"} 1

步骤3：收集与HTTP相关的指标

最后我们将介绍使用Go的promhttp包进行度量收集。
在Go的promhttp包中提供了一个名为InstrumentHandlerX的函数。

请参阅以下链接以获取参考：https://godoc.org/github.com/prometheus/client_golang/prometheus/promhttp

当将Collector和net/http的handler传递给此函数时，它会将handler包装成一个可以收集指标的便捷功能，并返回给您。您可以轻松地收集与HTTP相关的指标，如请求数、请求时间和响应大小等。

以下是新增代码的方法。

% git diff 51d2d01 c4756da
diff --git a/httpserver.go b/httpserver.go
index 74a90fb..5d46163 100644
--- a/httpserver.go
+++ b/httpserver.go
@@ -20,11 +20,28 @@ var (
                Name: "greeting_success_count_total",
                Help: "Counter of HTTP requests resulting in a success.",
        }, []string{"type"})
+       requestCount = prometheus.NewCounterVec(prometheus.CounterOpts{
+               Name: "http_request_count_total",
+               Help: "Counter of HTTP requests made.",
+       }, []string{"code", "method"})
+       requestDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{
+               Name:    "http_request_duration_seconds",
+               Help:    "A histogram of latencies for requests.",
+               Buckets: append([]float64{0.000001, 0.001, 0.003}, prometheus.DefBuckets...),
+       }, []string{"code", "method"})
+       responseSize = prometheus.NewHistogramVec(prometheus.HistogramOpts{
+               Name:    "http_response_size_bytes",
+               Help:    "A histogram of response sizes for requests.",
+               Buckets: []float64{0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20},
+       }, []string{"code", "method"})
 )

 func init() {
        prometheus.MustRegister(errorCount)
        prometheus.MustRegister(successCount)
+       prometheus.MustRegister(requestCount)
+       prometheus.MustRegister(requestDuration)
+       prometheus.MustRegister(responseSize)
 }

 func main() {
@@ -49,7 +66,13 @@ func main() {
                        fmt.Fprint(w, "Error!!")
                }
        })
-       http.Handle("/", helloHandler)
+       // Instrument helloHandler
+       wrappedHelloHandler := promhttp.InstrumentHandlerCounter(requestCount,
+               promhttp.InstrumentHandlerDuration(requestDuration,
+                       promhttp.InstrumentHandlerResponseSize(responseSize, helloHandler),
+               ),
+       )
+       http.Handle("/", wrappedHelloHandler)
        http.Handle("/metrics", promhttp.Handler())
        log.Fatal(http.ListenAndServe(":8080", nil))
 }

GitHub 上的源代码(Gist)

我们准备了一个用于收集请求数量、请求时间和响应大小的Collector。最后，在InstrumentX中简单包装处理程序，它会自动收集以下相应的指标。

# HELP http_request_count_total Counter of HTTP requests made.
# TYPE http_request_count_total counter
http_request_count_total{code="200",method="get"} 15
# HELP http_request_duration_seconds A histogram of latencies for requests.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{code="200",method="get",le="1e-06"} 0
http_request_duration_seconds_bucket{code="200",method="get",le="0.001"} 15
http_request_duration_seconds_bucket{code="200",method="get",le="0.003"} 15
...(省略)
http_request_duration_seconds_bucket{code="200",method="get",le="5"} 15
http_request_duration_seconds_bucket{code="200",method="get",le="10"} 15
http_request_duration_seconds_bucket{code="200",method="get",le="+Inf"} 15
http_request_duration_seconds_sum{code="200",method="get"} 0.0008033
http_request_duration_seconds_count{code="200",method="get"} 15
# HELP http_response_size_bytes A histogram of response sizes for requests.
# TYPE http_response_size_bytes histogram
http_response_size_bytes_bucket{code="200",method="get",le="0"} 0
http_response_size_bytes_bucket{code="200",method="get",le="2"} 0
http_response_size_bytes_bucket{code="200",method="get",le="4"} 7
http_response_size_bytes_bucket{code="200",method="get",le="6"} 11
http_response_size_bytes_bucket{code="200",method="get",le="8"} 15
...(省略)
http_response_size_bytes_bucket{code="200",method="get",le="20"} 15
http_response_size_bytes_bucket{code="200",method="get",le="+Inf"} 15
http_response_size_bytes_sum{code="200",method="get"} 76
http_response_size_bytes_count{code="200",method="get"} 15

最后

为了收集Prometheus特定的应用程序自定义度量，我们这次验证了直接支持Prometheus度量输出的方法。
通过使用Prometheus客户端库，可以轻松支持未公开的系统特定度量输出。
我认为，虽然只有在修改应用程序本身的情况下才能采用，但从白盒监控的角度来看，这是非常有效的。
在以Kubernetes和Prometheus为基础的原生云应用程序中，直接支持Prometheus格式的度量输出也是可行的。

请参照以下内容，只需要一个选项。

Etcd: Collector生成/登録部分、メトリクス収集部分

SkyDNS: Collector生成/登録部分、メトリクス収集部分、テスト

Kubernetes: Collector生成/登録部分

Exporter(official)の実装例 ※この記事の対象外ですが、Exporterを自作する時にとても参考になりました

Consul Exporter
Memcached Exporter
MySQL Exporter
Node Collector
HA Proxy Exporter

Instrumenting in preparation for a HTTP endpoint and getting data from Prometheus for monitoring purposes are often referred to as “instrument” and “scrape” respectively. It is common to implement a Collector on one’s own. Upon examining several implementations, it seems that many cases involve creating a Collector that retrieves the metrics of the target system in bulk using an Exporter, while also commonly utilizing pre-prepared Collector types directly supported.