使用Stackdriver+Gitlab-CI+Ansible搭建自动恢复机制

首先

在Stackdriver中监控GCE上的进程,当进程崩溃时向Slack发送警报。

不一样啦!警报发出来不是目的,而是尽快恢复服务才是目的!

无论如何,一旦Alert触发,我们需要做的只是通过systemctl status XXXXX来确认状态,然后使用sudo systemctl start XXXXX启动。

因此,我打算通过将Stackdriver + Gitlab-CI + Ansible进行整合,不仅仅发送警报,还要建立一个机制来恢复服务。

有一篇关于团队成员使用Slack机器人来创建良好的自动恢复机制的文章,所以如果有人想用Slack机器人做这件事的话,可以参考以下链接,你会感到满意的。
利用Stackdriver和slackbot实现服务的自动恢复
https://qiita.com/andromeda/items/fcb2ea02e9bb32e329e4

所有成員的組成

autohealing (2).png

填補

后述也提到了,通过代理进行Stackdriver Webhook的原因是因为无法固定SourceIP,因此无法将其添加到GitLab的防火墙规则中。
为了确保一定的安全性,我们选择了带有基本身份验证的方式。

前提 tí) – pre-requisite

    • gitlab自体はすでにインストール(稼働)済みとする

 

    GCPプロジェクトが利用可能な状態である

在中国的母语中,只需给出一个选项:搭建Gitlab-runner+Ansible服务器。

根据需要创建GCE实例。
此次使用CentOS7镜像。

安装Ansible

随意安装

sudo yum install ansible

Ansible的配置文件

非常简单,只是这样构成

$ tree
.
├── inventories
│   └── hosts
├── roles
│   └── nginx
│       └── tasks
│           └── main.yml
└── site_test.yml

[nginx]
XXX.XXX.XXX.XXX
- name: start nginx
  systemd:
    name: nginx
    state: started
- name: start nginx
  hosts: nginx
  become: yes
  roles: 
    - nginx

.gilab-ci.yml的配置

创建一个名为ansible的用户,并使用ansible用户执行。
此外,需要在托管Nginx服务器的GCP项目中注册ansible用户的公钥。

stages:
  - start-service

start-service:
  stage: start-service
  tags:
    - autohealing
  script:
    - sudo ansible-playbook -i ./ansible/inventories/hosts -u ansible --private-key=/home/ansible/.ssh/id_rsa ./ansible/site_test.yml

安装GitLab Runner

参考公式文档进行安装。

$ sudo curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.rpm.sh | sudo bash

$ sudo yum install gitlab-runner

# 起動確認
$ systemctl status gitlab-runner
● gitlab-runner.service - GitLab Runner
   Loaded: loaded (/etc/systemd/system/gitlab-runner.service; enabled; vendor preset: disabled)
   Active: active (running) since Sat 2018-10-20 01:04:37 JST; 11h ago
 Main PID: 25635 (gitlab-runner)
   CGroup: /system.slice/gitlab-runner.service
           └─25635 /usr/lib/gitlab-runner/gitlab-runner run --working-directory /home/gitlab-runner --config /etc/gitlab-runner/config.toml --service gitla...

给予gitlab-runner sudo权限。

$ sudo visudo
# 以下を末尾に追加
gitlab-runner ALL=(ALL) NOPASSWD: ALL

注册GitLab Runner

按照公式文档的要求,以对话的形式进行以下注册。

$ sudo gitlab-runner register
Running in system-mode.                            

Please enter the gitlab-ci coordinator URL (e.g. https://gitlab.com/):
https://example.gitlab.com/
Please enter the gitlab-ci token for this runner:
XXXXXXXXXXXXXXX
Please enter the gitlab-ci description for this runner:
[ansible-server] : autohealing
Please enter the gitlab-ci tags for this runner (comma separated):
autohealing
Registering runner... succeeded                     runner=yJC3cma3
Please enter the executor: docker+machine, docker-ssh+machine, kubernetes, docker, virtualbox, shell, ssh, docker-ssh, parallels:
shell
Runner registered successfully. Feel free to start it, but if it's running already the config should be automatically reloaded! 
CI : CD Settings · CI : CD · blcloud-infra : blcloud-autohealing · GitLab 2018-10-20 14-13-28.png

创建.netrc

为了避免在进行git pull等操作时被要求提供账户信息而感到麻烦,建议事先创建一个.netrc文件。

$ su - gitlab-runner
$ vi .netrc
machine gitlab.com
login ci-user(適当)
password ci-password(適当)

设置管道触发器

在GitLab上设置触发器并获取用于webhook的令牌。

スクリーンショット 2018-10-20 22.28.13.png

创建GitLab代理服务器

Stackdriver Alert的Webhook不是固定的(即使向Google支持询问,答案也是如此)。
如果在Gitlab上有基于IP的防火墙,则无法通过Webhook进行通知。
因此,我们需要准备一个作为代理的Nginx服务器,并将IP固定化。
然后,通过在Nginx上使用基本身份验证来确保一定程度的安全性。

因为有人提出了能固定IP的问题,所以如果您一直关注这里,就会明白当这个gitlab代理不再需要时。请务必给个星星!

安装Nginx

$ sudo yum install -y nginx

设置基本认证

$ sudo yum install -y httpd-tools
$ htpasswd -c /etc/nginx/.htpasswd なんか適当にID
New password: なんか適当にPW
Re-type new password: なんか適当にPW

$ vi /etc/nginx/nginx.conf

       auth_basic "Restricted";
       auth_basic_user_file /etc/nginx/.htpasswd;

        location / {
            proxy_pass https://gitlabのアドレス/;
        }

建立被监视的服务器。

我們將建立一個要被監視的伺服器。如果這個Nginx掛掉了,Stackdriver會檢測到並通知GitLab,然後觸發GitLab Runner。

安装Nginx

随意在GCE上安装Nginx。

$ sudo install -y nginx
$ systemctl status nginx
● nginx.service - The nginx HTTP and reverse proxy server
   Loaded: loaded (/usr/lib/systemd/system/nginx.service; disabled; vendor preset: disabled)
   Active: active (running) since Sat 2018-10-20 04:29:30 UTC; 42min ago
  Process: 1726 ExecStart=/usr/sbin/nginx (code=exited, status=0/SUCCESS)
  Process: 1723 ExecStartPre=/usr/sbin/nginx -t (code=exited, status=0/SUCCESS)
  Process: 1722 ExecStartPre=/usr/bin/rm -f /run/nginx.pid (code=exited, status=0/SUCCESS)
 Main PID: 1728 (nginx)
   CGroup: /system.slice/nginx.service
           ├─1728 nginx: master process /usr/sbin/nginx
           └─1729 nginx: worker process

安装 Stackdriver Agent

需要安装StackdriverMonitoringAgent才能获取进程的度量指标。

$ curl -sSO https://dl.google.com/cloudagents/install-monitoring-agent.sh
$ sudo bash install-monitoring-agent.sh

# 起動確認
$ systemctl status stackdriver-agent
● stackdriver-agent.service - LSB: start and stop Stackdriver Agent
   Loaded: loaded (/etc/rc.d/init.d/stackdriver-agent; bad; vendor preset: disabled)
   Active: active (running) since Sat 2018-10-20 11:27:42 UTC; 37s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 9464 ExecStart=/etc/rc.d/init.d/stackdriver-agent start (code=exited, status=0/SUCCESS)
 Main PID: 9495 (stackdriver-col)
   CGroup: /system.slice/stackdriver-agent.service
           └─9495 /opt/stackdriver/collectd/sbin/stackdriver-collectd -C /etc/stackdriver/collectd.conf -P /var/run/stackdriver-agent.pid

Stackdriver的警报设置

设置 webhook

スクリーンショット 2018-10-25 9.46.41.png

警报设置

スクリーンショット 2018-10-25 9.43.51.png

通知设置

スクリーンショット 2018-10-25 9.53.01.png

现在行动起来

关闭Nginx

$ sudo systemctl stop nginx
$ systemctl status nginx
● nginx.service - The nginx HTTP and reverse proxy server
   Loaded: loaded (/usr/lib/systemd/system/nginx.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

Oct 25 05:05:39 web-test systemd[1]: Started The nginx HTTP and reverse proxy server.
Oct 25 05:36:07 web-test systemd[1]: Stopping The nginx HTTP and reverse proxy server...
Oct 25 05:36:07 web-test systemd[1]: Stopped The nginx HTTP and reverse proxy server.
Oct 25 05:44:40 web-test systemd[1]: Starting The nginx HTTP and reverse proxy server...
Oct 25 05:44:40 web-test nginx[13630]: nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
Oct 25 05:44:40 web-test nginx[13630]: nginx: configuration file /etc/nginx/nginx.conf test is successful
Oct 25 05:44:40 web-test systemd[1]: Failed to read PID from file /run/nginx.pid: Invalid argument
Oct 25 05:44:40 web-test systemd[1]: Started The nginx HTTP and reverse proxy server.
Oct 26 08:33:01 web-test systemd[1]: Stopping The nginx HTTP and reverse proxy server...
Oct 26 08:33:01 web-test systemd[1]: Stopped The nginx HTTP and reverse proxy server.
スクリーンショット 2018-10-26 17.41.43.png

随后……管道开始运作。

Pipelines · blcloud-infra : blcloud-autohealing · GitLab 2018-10-25 13-24-57.png
Running with gitlab-runner 11.3.1 (0aa5179e)
  on autohealing 2cd3b42d
Using Shell executor...
Running on ansible-server...
Fetching changes...
HEAD is now at 324102c 改装を変更
Checking out 324102c4 as master...
Skipping Git submodules setup
$ sudo ansible-playbook -i ./ansible/inventories/hosts -u ansible --private-key=/home/ansible/.ssh/id_rsa ./ansible/site_test.yml

PLAY [start nginx] *************************************************************

TASK [Gathering Facts] *********************************************************
ok: [XXX.XXX.XXX.XXX]

TASK [nginx : start nginx] *****************************************************
changed: [XXX.XXX.XXX.XXX]

PLAY RECAP *********************************************************************
XXX.XXX.XXX.XXX             : ok=2    changed=1    unreachable=0    failed=0   

Job succeeded

查看Nginx的状态……它已经复活了!

$ systemctl status nginx
● nginx.service - The nginx HTTP and reverse proxy server
   Loaded: loaded (/usr/lib/systemd/system/nginx.service; disabled; vendor preset: disabled)
   Active: active (running) since Fri 2018-10-26 08:41:40 UTC; 1min 16s ago
  Process: 15845 ExecStart=/usr/sbin/nginx (code=exited, status=0/SUCCESS)
  Process: 15842 ExecStartPre=/usr/sbin/nginx -t (code=exited, status=0/SUCCESS)
  Process: 15841 ExecStartPre=/usr/bin/rm -f /run/nginx.pid (code=exited, status=0/SUCCESS)
 Main PID: 15847 (nginx)
   CGroup: /system.slice/nginx.service
           ├─15847 nginx: master process /usr/sbin/nginx
           └─15848 nginx: worker process

Oct 26 08:41:39 web-test systemd[1]: Starting The nginx HTTP and reverse proxy server...
Oct 26 08:41:40 web-test nginx[15842]: nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
Oct 26 08:41:40 web-test nginx[15842]: nginx: configuration file /etc/nginx/nginx.conf test is successful
Oct 26 08:41:40 web-test systemd[1]: Failed to read PID from file /run/nginx.pid: Invalid argument
Oct 26 08:41:40 web-test systemd[1]: Started The nginx HTTP and reverse proxy server.
スクリーンショット 2018-10-26 18.08.07.png

将来想做的事情

我希望能够在发送GitLab通知时,动态地修改Ansible playbook,以添加变量供使用。

广告
将在 10 秒后关闭
bannerAds