在EKS上通过Prometheus/Grafana进行机器学习的部署

3 年 ago

韵, 科

5 minutes

首先

在这个系列中，我希望通过使用亚马逊 EKS 进行机器学习。

系列目錄

EKS中的机器学习＃1准备篇
EKS中的机器学习＃2创建集群篇
EKS中的机器学习＃3创建托管工作节点篇
EKS中的机器学习＃4创建GPU托管工作节点篇
EKS中的机器学习＃5设置集群自动扩展器篇
EKS中的机器学习＃6设置HPA篇
EKS中的机器学习＃7设置EFS篇
EKS中的机器学习＃8使用Argo CD构建CD环境篇
EKS中的机器学习＃9引入SageMaker Operater篇
EKS中的机器学习＃10引入Container Insights篇
EKS中的机器学习＃11引入Prometheus/Grafana（本文）篇

这篇文章的目的是什么？

我們上次引入了Container Insights，但這次我們想試試引入Prometheus/Grafana。
在正式環境中，我認為很少會同時引入兩者，所以這只是在驗證中試著同時引入兩者而已。

参考文件。

我参考了这个链接：https://eksworkshop.com/intermediate/240_monitoring/
如果链接失效，你可以搜索“eksworkshop prometheus”，应该能找到相关信息。

安装 Helm

根据参考文档，看起来是使用helm进行安装，因此需要安装helm。参考文档链接：https://eksworkshop.com/beginner/060_helm/helm_intro/install/.

curl -sSL https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash
helm version --short
helm repo add stable https://kubernetes-charts.storage.googleapis.com/
helm search repo stable
helm repo update

以下是 Prometheus 的安装。

创建命名空间

kind: Namespace
apiVersion: v1
metadata:
  name: prometheus

k apply -f namespace.yml 
namespace/prometheus created

使用helm命令安装Prometheus。

helm install prometheus stable/prometheus \
    --namespace prometheus \
    --set alertmanager.persistentVolume.storageClass="gp2" \
    --set server.persistentVolume.storageClass="gp2"

(output)
NAME: prometheus
LAST DEPLOYED: Sat Mar  7 14:49:44 2020
NAMESPACE: prometheus
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
The Prometheus server can be accessed via port 80 on the following DNS name from within your cluster:
prometheus-server.prometheus.svc.cluster.local


Get the Prometheus server URL by running these commands in the same shell:
  export POD_NAME=$(kubectl get pods --namespace prometheus -l "app=prometheus,component=server" -o jsonpath="{.items[0].metadata.name}")
  kubectl --namespace prometheus port-forward $POD_NAME 9090


The Prometheus alertmanager can be accessed via port 80 on the following DNS name from within your cluster:
prometheus-alertmanager.prometheus.svc.cluster.local


Get the Alertmanager URL by running these commands in the same shell:
  export POD_NAME=$(kubectl get pods --namespace prometheus -l "app=prometheus,component=alertmanager" -o jsonpath="{.items[0].metadata.name}")
  kubectl --namespace prometheus port-forward $POD_NAME 9093
#################################################################################
######   WARNING: Pod Security Policy has been moved to a global property.  #####
######            use .Values.podSecurityPolicy.enabled with pod-based      #####
######            annotations                                               #####
######            (e.g. .Values.nodeExporter.podSecurityPolicy.annotations) #####
#################################################################################


The Prometheus PushGateway can be accessed via port 9091 on the following DNS name from within your cluster:
prometheus-pushgateway.prometheus.svc.cluster.local


Get the PushGateway URL by running these commands in the same shell:
  export POD_NAME=$(kubectl get pods --namespace prometheus -l "app=prometheus,component=pushgateway" -o jsonpath="{.items[0].metadata.name}")
  kubectl --namespace prometheus port-forward $POD_NAME 9091

For more information on running Prometheus, visit:
https://prometheus.io/

确认

kubectl get all -n prometheus
NAME                                                 READY   STATUS    RESTARTS   AGE
pod/prometheus-alertmanager-86bfcc75db-bbmvl         2/2     Running   0          107s
pod/prometheus-kube-state-metrics-5ccb885bdc-gzcqn   1/1     Running   0          107s
pod/prometheus-node-exporter-dwv7c                   1/1     Running   0          107s
pod/prometheus-node-exporter-q9w4m                   1/1     Running   0          107s
pod/prometheus-pushgateway-7867ddb5cf-vjq5x          1/1     Running   0          107s
pod/prometheus-server-68677bcbd9-tn7nq               2/2     Running   0          107s

NAME                                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/prometheus-alertmanager         ClusterIP   10.100.46.107    <none>        80/TCP     107s
service/prometheus-kube-state-metrics   ClusterIP   10.100.193.251   <none>        8080/TCP   107s
service/prometheus-node-exporter        ClusterIP   None             <none>        9100/TCP   107s
service/prometheus-pushgateway          ClusterIP   10.100.111.192   <none>        9091/TCP   107s
service/prometheus-server               ClusterIP   10.100.61.62     <none>        80/TCP     107s

NAME                                      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/prometheus-node-exporter   2         2         2       2            2           <none>          107s

NAME                                            READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/prometheus-alertmanager         1/1     1            1           107s
deployment.apps/prometheus-kube-state-metrics   1/1     1            1           107s
deployment.apps/prometheus-pushgateway          1/1     1            1           107s
deployment.apps/prometheus-server               1/1     1            1           107s

NAME                                                       DESIRED   CURRENT   READY   AGE
replicaset.apps/prometheus-alertmanager-86bfcc75db         1         1         1       107s
replicaset.apps/prometheus-kube-state-metrics-5ccb885bdc   1         1         1       107s
replicaset.apps/prometheus-pushgateway-7867ddb5cf          1         1         1       107s
replicaset.apps/prometheus-server-68677bcbd9               1         1         1       107s

可以确定对象已经正确创建

我試著看看界面。


kubectl port-forward -n prometheus deploy/prometheus-server 8080:9090

由于考虑到Cloud9，根据参考文档，在预览中显示(/targets)，可以确认它正确显示出来。

安装Grafana。

创建grarana命名空间

kind: Namespace
apiVersion: v1
metadata:
  name: grafana

k apply -f namespace.yml 
namespace/grafana created

使用helm命令进行安装。

由于将 adminPassword 设置为 EKS!sAWSome 会导致任何人都具有相同的密码，因此请将其更改为您自己选择的密码。

helm install grafana stable/grafana \
    --namespace grafana \
    --set persistence.storageClassName="gp2" \
    --set adminPassword='xxxxxx' \
    --set datasources."datasources\.yaml".apiVersion=1 \
    --set datasources."datasources\.yaml".datasources[0].name=Prometheus \
    --set datasources."datasources\.yaml".datasources[0].type=prometheus \
    --set datasources."datasources\.yaml".datasources[0].url=http://prometheus-server.prometheus.svc.cluster.local \
    --set datasources."datasources\.yaml".datasources[0].access=proxy \
    --set datasources."datasources\.yaml".datasources[0].isDefault=true \
    --set service.type=LoadBalancer

NAME: grafana
LAST DEPLOYED: Sat Mar  7 15:03:25 2020
NAMESPACE: grafana
STATUS: deployed
REVISION: 1
NOTES:
1. Get your 'admin' user password by running:

   kubectl get secret --namespace grafana grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo

2. The Grafana server can be accessed via port 80 on the following DNS name from within your cluster:

   grafana.grafana.svc.cluster.local

   Get the Grafana URL to visit by running these commands in the same shell:
NOTE: It may take a few minutes for the LoadBalancer IP to be available.
        You can watch the status of by running 'kubectl get svc --namespace grafana -w grafana'
     export SERVICE_IP=$(kubectl get svc --namespace grafana grafana -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
     http://$SERVICE_IP:80

3. Login with the password from step 1 and the username: admin
#################################################################################
######   WARNING: Persistence is disabled!!! You will lose your data when   #####
######            the Grafana pod is terminated.                            #####
#################################################################################

确认。

kubectl get all -n grafana
NAME                           READY   STATUS    RESTARTS   AGE
pod/grafana-794598bb56-mslrg   1/1     Running   0          48s

NAME              TYPE           CLUSTER-IP      EXTERNAL-IP                                                              PORT(S)        AGE
service/grafana   LoadBalancer   10.100.79.227   xxxxxxxxxxxxx-365458450.us-west-2.elb.amazonaws.com   80:31243/TCP   48s

NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/grafana   1/1     1            1           48s

NAME                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/grafana-794598bb56   1         1         1       48s

访问用户界面

请确认ELB。

k get svc -n grafana
NAME      TYPE           CLUSTER-IP      EXTERNAL-IP                                                              PORT(S)        AGE
grafana   LoadBalancer   10.100.79.227   xxxxxx-365458450.us-west-2.elb.amazonaws.com   80:31243/TCP   16m

可以通过浏览器访问在此处显示的ELB的DNS名称。
用户为admin，并且密码是您设置的。
你可以通过这种方式确认访问是否成功。

创建仪表盘模板

参考此链接可以轻松完成。
由于是相同的图像，省略了屏幕截图。

总结

我們按照EKSWorkshop的步驟，在EKS環境中導入了Prometheus/Grafana。
由於這只是一個檢驗環境，如果Pod重新創建，數據將會遺失。
我們計劃近期考慮到運營生產的配置。