普罗米修斯通过监控Kubernetes集群

2 年 ago

清, 宇

2 minutes

我想使用Prometheus来监视Kubernetes的集群！

第一次投稿。

在谈到Kubernetes集群监控时，有一些选择，比如Prometheus、EFK、SaaS里的Datadog等，但是本次我们将使用Prometheus进行监控。

搭建用于kubernetes的prometheus环境。

从该链接中，我们将搭建一个Prometheus环境。虽然最后更新是在9月份，而且Deployment也处于beta版，但这次我们将无视这些并使用它。

执行Quickstart中的命令将自动在名为monitoring的命名空间中构建prometheus/grafana/node-exporter等。

kubectl apply \
  --filename https://raw.githubusercontent.com/giantswarm/kubernetes-prometheus/master/manifests-all.yaml

构建完成后，可以在“monitoring”命名空间中看到各种Pod已经启动。

# kubectl get pods --namespace=monitoring
NAME                                  READY     STATUS    RESTARTS   AGE
alertmanager-56f6fdd9f6-z4vl8         1/1       Running   0          2h
grafana-core-867b94888d-td7b4         1/1       Running   0          5h
kube-state-metrics-694fdcf55f-797th   1/1       Running   0          5h
kube-state-metrics-694fdcf55f-tsvh5   1/1       Running   0          5h
node-directory-size-metrics-8rjvx     2/2       Running   0          5h
node-directory-size-metrics-z86cs     2/2       Running   0          5h
prometheus-core-5cf65c7b68-2dg5r      1/1       Running   0          2h
prometheus-node-exporter-8dccv        1/1       Running   0          5h
prometheus-node-exporter-rmlwk        1/1       Running   0          5h

它可以启动alertmanager和kube-state-metrics等组件。

因为NodePort也会被构建，所以让我们从URL上进行确认吧。

# kubectl get svc --namespace=monitoring
NAME                       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
alertmanager               NodePort    100.64.36.59     <none>        9093:32651/TCP   5h
grafana                    NodePort    100.70.155.49    <none>        3000:31676/TCP   5h
kube-state-metrics         ClusterIP   100.66.157.125   <none>        8080/TCP         5h
prometheus                 NodePort    100.71.101.61    <none>        9090:32296/TCP   5h
prometheus-node-exporter   ClusterIP   None             <none>        9100/TCP         5h

Prometheus和Grafana都已经启动了。

但是，当检查kubernetes-pod-resources时，会显示为N/A，无法正确确认。
这是因为在prometheus的设置中没有获取cAdvisor。

普罗米修斯的配置更改

修改prometheus的ConfigMap，获取cAdvisor。

kubectl edit configmap prometheus-core --namespace=monitoring

在scrape_configs下添加job_name。
（官方的复制粘贴：https://github.com/prometheus/prometheus/blob/master/documentation/examples/prometheus-kubernetes.yml）

- job_name: 'kubernetes-cadvisor'

  # Default to scraping over https. If required, just disable this or change to
  # `http`.
  scheme: https

  # This TLS & bearer token file config is used to connect to the actual scrape
  # endpoints for cluster components. This is separate to discovery auth
  # configuration because discovery & scraping are two separate concerns in
  # Prometheus. The discovery auth config is automatic if Prometheus runs inside
  # the cluster. Otherwise, more config options have to be provided within the
  # <kubernetes_sd_config>.
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

  kubernetes_sd_configs:
  - role: node

  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - target_label: __address__
    replacement: kubernetes.default.svc:443
  - source_labels: [__meta_kubernetes_node_name]
    regex: (.+)
    target_label: __metrics_path__
    replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

在将设置应用后，为了反映configMap的更改，将删除Pod一次。

kubectl delete pods prometheus-core-XXX --namespace=monitoring

如果重新启动后显示为“Running”，那就没问题了。
如果启动失败，可能是yaml文件的编辑错误或其他原因（我自己做的）。

再次启动后，打开 Grafana 可以查看整个节点的 CPU、内存、Pod 资源等信息。

我們可以使用圖形化的方式監視Node的狀態了！

总结一下

通过使用Alertmanager，可以根据节点的状态向Slack发送警报。

如果使用Datadog，只需启动dd-agent的守护进程集即可获取指标并设置警报，因此，如果规模不大或者有足够的资金，我觉得用Datadog就可以了。

这是Datadog的Kubernetes集群监控界面。非常直观易懂。