在EKS上通过Prometheus/Grafana进行机器学习的部署
首先
在这个系列中,我希望通过使用亚马逊 EKS 进行机器学习。
系列目錄
EKS中的机器学习#1准备篇
EKS中的机器学习#2创建集群篇
EKS中的机器学习#3创建托管工作节点篇
EKS中的机器学习#4创建GPU托管工作节点篇
EKS中的机器学习#5设置集群自动扩展器篇
EKS中的机器学习#6设置HPA篇
EKS中的机器学习#7设置EFS篇
EKS中的机器学习#8使用Argo CD构建CD环境篇
EKS中的机器学习#9引入SageMaker Operater篇
EKS中的机器学习#10引入Container Insights篇
EKS中的机器学习#11引入Prometheus/Grafana(本文)篇
这篇文章的目的是什么?
我們上次引入了Container Insights,但這次我們想試試引入Prometheus/Grafana。
在正式環境中,我認為很少會同時引入兩者,所以這只是在驗證中試著同時引入兩者而已。
参考文件。
我参考了这个链接:https://eksworkshop.com/intermediate/240_monitoring/
如果链接失效,你可以搜索“eksworkshop prometheus”,应该能找到相关信息。
安装 Helm
根据参考文档,看起来是使用helm进行安装,因此需要安装helm。参考文档链接:https://eksworkshop.com/beginner/060_helm/helm_intro/install/.
curl -sSL https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash
helm version --short
helm repo add stable https://kubernetes-charts.storage.googleapis.com/
helm search repo stable
helm repo update
以下是 Prometheus 的安装。
创建命名空间
kind: Namespace
apiVersion: v1
metadata:
name: prometheus
k apply -f namespace.yml
namespace/prometheus created
使用helm命令安装Prometheus。
helm install prometheus stable/prometheus \
--namespace prometheus \
--set alertmanager.persistentVolume.storageClass="gp2" \
--set server.persistentVolume.storageClass="gp2"
(output)
NAME: prometheus
LAST DEPLOYED: Sat Mar 7 14:49:44 2020
NAMESPACE: prometheus
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
The Prometheus server can be accessed via port 80 on the following DNS name from within your cluster:
prometheus-server.prometheus.svc.cluster.local
Get the Prometheus server URL by running these commands in the same shell:
export POD_NAME=$(kubectl get pods --namespace prometheus -l "app=prometheus,component=server" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace prometheus port-forward $POD_NAME 9090
The Prometheus alertmanager can be accessed via port 80 on the following DNS name from within your cluster:
prometheus-alertmanager.prometheus.svc.cluster.local
Get the Alertmanager URL by running these commands in the same shell:
export POD_NAME=$(kubectl get pods --namespace prometheus -l "app=prometheus,component=alertmanager" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace prometheus port-forward $POD_NAME 9093
#################################################################################
###### WARNING: Pod Security Policy has been moved to a global property. #####
###### use .Values.podSecurityPolicy.enabled with pod-based #####
###### annotations #####
###### (e.g. .Values.nodeExporter.podSecurityPolicy.annotations) #####
#################################################################################
The Prometheus PushGateway can be accessed via port 9091 on the following DNS name from within your cluster:
prometheus-pushgateway.prometheus.svc.cluster.local
Get the PushGateway URL by running these commands in the same shell:
export POD_NAME=$(kubectl get pods --namespace prometheus -l "app=prometheus,component=pushgateway" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace prometheus port-forward $POD_NAME 9091
For more information on running Prometheus, visit:
https://prometheus.io/
确认
kubectl get all -n prometheus
NAME READY STATUS RESTARTS AGE
pod/prometheus-alertmanager-86bfcc75db-bbmvl 2/2 Running 0 107s
pod/prometheus-kube-state-metrics-5ccb885bdc-gzcqn 1/1 Running 0 107s
pod/prometheus-node-exporter-dwv7c 1/1 Running 0 107s
pod/prometheus-node-exporter-q9w4m 1/1 Running 0 107s
pod/prometheus-pushgateway-7867ddb5cf-vjq5x 1/1 Running 0 107s
pod/prometheus-server-68677bcbd9-tn7nq 2/2 Running 0 107s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/prometheus-alertmanager ClusterIP 10.100.46.107 <none> 80/TCP 107s
service/prometheus-kube-state-metrics ClusterIP 10.100.193.251 <none> 8080/TCP 107s
service/prometheus-node-exporter ClusterIP None <none> 9100/TCP 107s
service/prometheus-pushgateway ClusterIP 10.100.111.192 <none> 9091/TCP 107s
service/prometheus-server ClusterIP 10.100.61.62 <none> 80/TCP 107s
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/prometheus-node-exporter 2 2 2 2 2 <none> 107s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/prometheus-alertmanager 1/1 1 1 107s
deployment.apps/prometheus-kube-state-metrics 1/1 1 1 107s
deployment.apps/prometheus-pushgateway 1/1 1 1 107s
deployment.apps/prometheus-server 1/1 1 1 107s
NAME DESIRED CURRENT READY AGE
replicaset.apps/prometheus-alertmanager-86bfcc75db 1 1 1 107s
replicaset.apps/prometheus-kube-state-metrics-5ccb885bdc 1 1 1 107s
replicaset.apps/prometheus-pushgateway-7867ddb5cf 1 1 1 107s
replicaset.apps/prometheus-server-68677bcbd9 1 1 1 107s
可以确定对象已经正确创建
我試著看看界面。
kubectl port-forward -n prometheus deploy/prometheus-server 8080:9090
由于考虑到Cloud9,根据参考文档,在预览中显示(/targets),可以确认它正确显示出来。

安装Grafana。
创建grarana命名空间
kind: Namespace
apiVersion: v1
metadata:
name: grafana
k apply -f namespace.yml
namespace/grafana created
使用helm命令进行安装。
由于将 adminPassword 设置为 EKS!sAWSome 会导致任何人都具有相同的密码,因此请将其更改为您自己选择的密码。
helm install grafana stable/grafana \
--namespace grafana \
--set persistence.storageClassName="gp2" \
--set adminPassword='xxxxxx' \
--set datasources."datasources\.yaml".apiVersion=1 \
--set datasources."datasources\.yaml".datasources[0].name=Prometheus \
--set datasources."datasources\.yaml".datasources[0].type=prometheus \
--set datasources."datasources\.yaml".datasources[0].url=http://prometheus-server.prometheus.svc.cluster.local \
--set datasources."datasources\.yaml".datasources[0].access=proxy \
--set datasources."datasources\.yaml".datasources[0].isDefault=true \
--set service.type=LoadBalancer
NAME: grafana
LAST DEPLOYED: Sat Mar 7 15:03:25 2020
NAMESPACE: grafana
STATUS: deployed
REVISION: 1
NOTES:
1. Get your 'admin' user password by running:
kubectl get secret --namespace grafana grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
2. The Grafana server can be accessed via port 80 on the following DNS name from within your cluster:
grafana.grafana.svc.cluster.local
Get the Grafana URL to visit by running these commands in the same shell:
NOTE: It may take a few minutes for the LoadBalancer IP to be available.
You can watch the status of by running 'kubectl get svc --namespace grafana -w grafana'
export SERVICE_IP=$(kubectl get svc --namespace grafana grafana -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
http://$SERVICE_IP:80
3. Login with the password from step 1 and the username: admin
#################################################################################
###### WARNING: Persistence is disabled!!! You will lose your data when #####
###### the Grafana pod is terminated. #####
#################################################################################
确认。
kubectl get all -n grafana
NAME READY STATUS RESTARTS AGE
pod/grafana-794598bb56-mslrg 1/1 Running 0 48s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/grafana LoadBalancer 10.100.79.227 xxxxxxxxxxxxx-365458450.us-west-2.elb.amazonaws.com 80:31243/TCP 48s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/grafana 1/1 1 1 48s
NAME DESIRED CURRENT READY AGE
replicaset.apps/grafana-794598bb56 1 1 1 48s
访问用户界面
请确认ELB。
k get svc -n grafana
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
grafana LoadBalancer 10.100.79.227 xxxxxx-365458450.us-west-2.elb.amazonaws.com 80:31243/TCP 16m
可以通过浏览器访问在此处显示的ELB的DNS名称。
用户为admin,并且密码是您设置的。
你可以通过这种方式确认访问是否成功。

创建仪表盘模板
参考此链接可以轻松完成。
由于是相同的图像,省略了屏幕截图。
总结
我們按照EKSWorkshop的步驟,在EKS環境中導入了Prometheus/Grafana。
由於這只是一個檢驗環境,如果Pod重新創建,數據將會遺失。
我們計劃近期考慮到運營生產的配置。