使用Stackdriver+Gitlab-CI+Ansible搭建自动恢复机制
首先
在Stackdriver中监控GCE上的进程,当进程崩溃时向Slack发送警报。
不一样啦!警报发出来不是目的,而是尽快恢复服务才是目的!
无论如何,一旦Alert触发,我们需要做的只是通过systemctl status XXXXX来确认状态,然后使用sudo systemctl start XXXXX启动。
因此,我打算通过将Stackdriver + Gitlab-CI + Ansible进行整合,不仅仅发送警报,还要建立一个机制来恢复服务。
有一篇关于团队成员使用Slack机器人来创建良好的自动恢复机制的文章,所以如果有人想用Slack机器人做这件事的话,可以参考以下链接,你会感到满意的。
利用Stackdriver和slackbot实现服务的自动恢复
https://qiita.com/andromeda/items/fcb2ea02e9bb32e329e4
所有成員的組成

填補
后述也提到了,通过代理进行Stackdriver Webhook的原因是因为无法固定SourceIP,因此无法将其添加到GitLab的防火墙规则中。
为了确保一定的安全性,我们选择了带有基本身份验证的方式。
前提 tí) – pre-requisite
-
- gitlab自体はすでにインストール(稼働)済みとする
- GCPプロジェクトが利用可能な状態である
在中国的母语中,只需给出一个选项:搭建Gitlab-runner+Ansible服务器。
根据需要创建GCE实例。
此次使用CentOS7镜像。
安装Ansible
随意安装
sudo yum install ansible
Ansible的配置文件
非常简单,只是这样构成
$ tree
.
├── inventories
│ └── hosts
├── roles
│ └── nginx
│ └── tasks
│ └── main.yml
└── site_test.yml
[nginx]
XXX.XXX.XXX.XXX
- name: start nginx
systemd:
name: nginx
state: started
- name: start nginx
hosts: nginx
become: yes
roles:
- nginx
.gilab-ci.yml的配置
创建一个名为ansible的用户,并使用ansible用户执行。
此外,需要在托管Nginx服务器的GCP项目中注册ansible用户的公钥。
stages:
- start-service
start-service:
stage: start-service
tags:
- autohealing
script:
- sudo ansible-playbook -i ./ansible/inventories/hosts -u ansible --private-key=/home/ansible/.ssh/id_rsa ./ansible/site_test.yml
安装GitLab Runner
参考公式文档进行安装。
$ sudo curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.rpm.sh | sudo bash
$ sudo yum install gitlab-runner
# 起動確認
$ systemctl status gitlab-runner
● gitlab-runner.service - GitLab Runner
Loaded: loaded (/etc/systemd/system/gitlab-runner.service; enabled; vendor preset: disabled)
Active: active (running) since Sat 2018-10-20 01:04:37 JST; 11h ago
Main PID: 25635 (gitlab-runner)
CGroup: /system.slice/gitlab-runner.service
└─25635 /usr/lib/gitlab-runner/gitlab-runner run --working-directory /home/gitlab-runner --config /etc/gitlab-runner/config.toml --service gitla...
给予gitlab-runner sudo权限。
$ sudo visudo
# 以下を末尾に追加
gitlab-runner ALL=(ALL) NOPASSWD: ALL
注册GitLab Runner
按照公式文档的要求,以对话的形式进行以下注册。
$ sudo gitlab-runner register
Running in system-mode.
Please enter the gitlab-ci coordinator URL (e.g. https://gitlab.com/):
https://example.gitlab.com/
Please enter the gitlab-ci token for this runner:
XXXXXXXXXXXXXXX
Please enter the gitlab-ci description for this runner:
[ansible-server] : autohealing
Please enter the gitlab-ci tags for this runner (comma separated):
autohealing
Registering runner... succeeded runner=yJC3cma3
Please enter the executor: docker+machine, docker-ssh+machine, kubernetes, docker, virtualbox, shell, ssh, docker-ssh, parallels:
shell
Runner registered successfully. Feel free to start it, but if it's running already the config should be automatically reloaded!

创建.netrc
为了避免在进行git pull等操作时被要求提供账户信息而感到麻烦,建议事先创建一个.netrc文件。
$ su - gitlab-runner
$ vi .netrc
machine gitlab.com
login ci-user(適当)
password ci-password(適当)
设置管道触发器
在GitLab上设置触发器并获取用于webhook的令牌。

创建GitLab代理服务器
Stackdriver Alert的Webhook不是固定的(即使向Google支持询问,答案也是如此)。
如果在Gitlab上有基于IP的防火墙,则无法通过Webhook进行通知。
因此,我们需要准备一个作为代理的Nginx服务器,并将IP固定化。
然后,通过在Nginx上使用基本身份验证来确保一定程度的安全性。
因为有人提出了能固定IP的问题,所以如果您一直关注这里,就会明白当这个gitlab代理不再需要时。请务必给个星星!
安装Nginx
$ sudo yum install -y nginx
设置基本认证
$ sudo yum install -y httpd-tools
$ htpasswd -c /etc/nginx/.htpasswd なんか適当にID
New password: なんか適当にPW
Re-type new password: なんか適当にPW
$ vi /etc/nginx/nginx.conf
auth_basic "Restricted";
auth_basic_user_file /etc/nginx/.htpasswd;
location / {
proxy_pass https://gitlabのアドレス/;
}
建立被监视的服务器。
我們將建立一個要被監視的伺服器。如果這個Nginx掛掉了,Stackdriver會檢測到並通知GitLab,然後觸發GitLab Runner。
安装Nginx
随意在GCE上安装Nginx。
$ sudo install -y nginx
$ systemctl status nginx
● nginx.service - The nginx HTTP and reverse proxy server
Loaded: loaded (/usr/lib/systemd/system/nginx.service; disabled; vendor preset: disabled)
Active: active (running) since Sat 2018-10-20 04:29:30 UTC; 42min ago
Process: 1726 ExecStart=/usr/sbin/nginx (code=exited, status=0/SUCCESS)
Process: 1723 ExecStartPre=/usr/sbin/nginx -t (code=exited, status=0/SUCCESS)
Process: 1722 ExecStartPre=/usr/bin/rm -f /run/nginx.pid (code=exited, status=0/SUCCESS)
Main PID: 1728 (nginx)
CGroup: /system.slice/nginx.service
├─1728 nginx: master process /usr/sbin/nginx
└─1729 nginx: worker process
安装 Stackdriver Agent
需要安装StackdriverMonitoringAgent才能获取进程的度量指标。
$ curl -sSO https://dl.google.com/cloudagents/install-monitoring-agent.sh
$ sudo bash install-monitoring-agent.sh
# 起動確認
$ systemctl status stackdriver-agent
● stackdriver-agent.service - LSB: start and stop Stackdriver Agent
Loaded: loaded (/etc/rc.d/init.d/stackdriver-agent; bad; vendor preset: disabled)
Active: active (running) since Sat 2018-10-20 11:27:42 UTC; 37s ago
Docs: man:systemd-sysv-generator(8)
Process: 9464 ExecStart=/etc/rc.d/init.d/stackdriver-agent start (code=exited, status=0/SUCCESS)
Main PID: 9495 (stackdriver-col)
CGroup: /system.slice/stackdriver-agent.service
└─9495 /opt/stackdriver/collectd/sbin/stackdriver-collectd -C /etc/stackdriver/collectd.conf -P /var/run/stackdriver-agent.pid
Stackdriver的警报设置
设置 webhook

警报设置

通知设置

现在行动起来
关闭Nginx
$ sudo systemctl stop nginx
$ systemctl status nginx
● nginx.service - The nginx HTTP and reverse proxy server
Loaded: loaded (/usr/lib/systemd/system/nginx.service; disabled; vendor preset: disabled)
Active: inactive (dead)
Oct 25 05:05:39 web-test systemd[1]: Started The nginx HTTP and reverse proxy server.
Oct 25 05:36:07 web-test systemd[1]: Stopping The nginx HTTP and reverse proxy server...
Oct 25 05:36:07 web-test systemd[1]: Stopped The nginx HTTP and reverse proxy server.
Oct 25 05:44:40 web-test systemd[1]: Starting The nginx HTTP and reverse proxy server...
Oct 25 05:44:40 web-test nginx[13630]: nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
Oct 25 05:44:40 web-test nginx[13630]: nginx: configuration file /etc/nginx/nginx.conf test is successful
Oct 25 05:44:40 web-test systemd[1]: Failed to read PID from file /run/nginx.pid: Invalid argument
Oct 25 05:44:40 web-test systemd[1]: Started The nginx HTTP and reverse proxy server.
Oct 26 08:33:01 web-test systemd[1]: Stopping The nginx HTTP and reverse proxy server...
Oct 26 08:33:01 web-test systemd[1]: Stopped The nginx HTTP and reverse proxy server.

随后……管道开始运作。

Running with gitlab-runner 11.3.1 (0aa5179e)
on autohealing 2cd3b42d
Using Shell executor...
Running on ansible-server...
Fetching changes...
HEAD is now at 324102c 改装を変更
Checking out 324102c4 as master...
Skipping Git submodules setup
$ sudo ansible-playbook -i ./ansible/inventories/hosts -u ansible --private-key=/home/ansible/.ssh/id_rsa ./ansible/site_test.yml
PLAY [start nginx] *************************************************************
TASK [Gathering Facts] *********************************************************
ok: [XXX.XXX.XXX.XXX]
TASK [nginx : start nginx] *****************************************************
changed: [XXX.XXX.XXX.XXX]
PLAY RECAP *********************************************************************
XXX.XXX.XXX.XXX : ok=2 changed=1 unreachable=0 failed=0
Job succeeded
查看Nginx的状态……它已经复活了!
$ systemctl status nginx
● nginx.service - The nginx HTTP and reverse proxy server
Loaded: loaded (/usr/lib/systemd/system/nginx.service; disabled; vendor preset: disabled)
Active: active (running) since Fri 2018-10-26 08:41:40 UTC; 1min 16s ago
Process: 15845 ExecStart=/usr/sbin/nginx (code=exited, status=0/SUCCESS)
Process: 15842 ExecStartPre=/usr/sbin/nginx -t (code=exited, status=0/SUCCESS)
Process: 15841 ExecStartPre=/usr/bin/rm -f /run/nginx.pid (code=exited, status=0/SUCCESS)
Main PID: 15847 (nginx)
CGroup: /system.slice/nginx.service
├─15847 nginx: master process /usr/sbin/nginx
└─15848 nginx: worker process
Oct 26 08:41:39 web-test systemd[1]: Starting The nginx HTTP and reverse proxy server...
Oct 26 08:41:40 web-test nginx[15842]: nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
Oct 26 08:41:40 web-test nginx[15842]: nginx: configuration file /etc/nginx/nginx.conf test is successful
Oct 26 08:41:40 web-test systemd[1]: Failed to read PID from file /run/nginx.pid: Invalid argument
Oct 26 08:41:40 web-test systemd[1]: Started The nginx HTTP and reverse proxy server.

将来想做的事情
我希望能够在发送GitLab通知时,动态地修改Ansible playbook,以添加变量供使用。