关于从 ngx_mruby 启动守护程序时的陷阱

2 年 ago

新, 韵

4 minutes

我原本认为这是一个关于第七天的帖子，但是第二十二天是空的，所以我搬了过来。@udzura

当使用ngx_mruby对命令进行启动并启动守护进程时，可以异步进行，这是非常方便的。然而，如果使用Kernel#system方法进行启动，可能会遇到一些陷阱，因此我将介绍一下这个问题，并介绍一个名为mruby-clean-spwan的mgem来解决这个问题。

尝试使用Nginx创建守护进程。 (I will try to create a daemon using Nginx.)

这是一个相当专业的用途，首先，我们将使用以下的Ruby脚本来创建一个守护进程。

#!/usr/bin/ruby
Process.daemon

require 'logger'
fd = Logger.new("/tmp/test.log.#{$$}")

loop do
  fd.info "Test!! #{Time.now}"
  sleep 1
end

ngx_mruby 的配置可以按照以下方式进行。

server {
  listen      80 default_server;
  server_name *.example.com;
  location / {
    mruby_content_handler /var/lib/mruby/hook.rb;
  }
}

Fox Script 是这样的。

unless system "/bin/sh -c /usr/local/bin/daemon.rb"
  Nginx.rputs '{"status": "FAILURE"}'
else
  Nginx.rputs '{"status": "OK"}'
end

在这个状态下尝试使用curl确实可以看到某种守护程序正在运行。

ubuntu@ubuntu-zesty:~$ curl -s localhost | jq .
{
  "status": "OK"
}
ubuntu@ubuntu-zesty:~$ ps auxf
...
nginx     15206  0.0  0.1  37176  1296 ?        Ss   09:55   0:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf
nginx     15207  0.0  0.3  37600  4028 ?        S    09:55   0:00  \_ nginx: worker process
nginx     15218  0.0  0.7  40548  7272 ?        Sl   09:55   0:00 /usr/bin/ruby /usr/local/bin/daemon.rb

看起来很不错，但是…

这个守护进程直接继承了 Nginx 的文件描述符集。 fd0 ~ fd2 当然会重新打开，因为它们是守护进程的，但其他的就是直接使用 Nginx 的原样…。

ubuntu@ubuntu-zesty:~$ sudo ls -l /proc/15218/fd
total 0
lrwx------ 1 nginx nginx 64 Dec 22 09:58 0 -> /dev/null
lrwx------ 1 nginx nginx 64 Dec 22 09:58 1 -> /dev/null
l-wx------ 1 nginx nginx 64 Dec 22 09:58 10 -> pipe:[48162]
l-wx------ 1 nginx nginx 64 Dec 22 09:58 11 -> /tmp/test.log.15218
lrwx------ 1 nginx nginx 64 Dec 22 09:58 2 -> /dev/null
lrwx------ 1 nginx nginx 64 Dec 22 09:58 3 -> socket:[48154]
lr-x------ 1 nginx nginx 64 Dec 22 09:58 4 -> pipe:[48161]
l-wx------ 1 nginx nginx 64 Dec 22 09:58 5 -> pipe:[48161]
lrwx------ 1 nginx nginx 64 Dec 22 09:58 6 -> socket:[46974]
lr-x------ 1 nginx nginx 64 Dec 22 09:58 7 -> pipe:[48162]
lrwx------ 1 nginx nginx 64 Dec 22 09:58 8 -> anon_inode:[eventpoll]
lrwx------ 1 nginx nginx 64 Dec 22 09:58 9 -> anon_inode:[eventfd]

mruby的Kernel#system直接使用了system(3)函数，但是它并没有很好地处理一些后续操作。如果原始进程的文件描述符被继承，子进程的处理可能会存在风险。

提供 close-on-exec 功能的干净生成。

所以，我们尝试使用clean_spawn进行重写。

unless clean_spawn "/bin/sh", "-c", "/usr/local/bin/daemon.rb"
  Nginx.rputs '{"status": "FAILURE"}'
else
  Nginx.rputs '{"status": "OK"}'
end

从这个钩子起动的守护进程可以看出fd的状态已经发生了变化。

nginx     15271  0.0  0.7  40504  7428 ?        Sl   10:02   0:00 /usr/bin/ruby /usr/local/bin/daemon.rb
ubuntu@ubuntu-zesty:~$ sudo ls -l /proc/15271/fd
total 0
lrwx------ 1 nginx nginx 64 Dec 22 10:02 0 -> /dev/null
lrwx------ 1 nginx nginx 64 Dec 22 10:02 1 -> /dev/null
lrwx------ 1 nginx nginx 64 Dec 22 10:02 2 -> /dev/null
lr-x------ 1 nginx nginx 64 Dec 22 10:02 3 -> pipe:[48379]
l-wx------ 1 nginx nginx 64 Dec 22 10:02 4 -> pipe:[48379]
lr-x------ 1 nginx nginx 64 Dec 22 10:02 5 -> pipe:[48380]
l-wx------ 1 nginx nginx 64 Dec 22 10:02 6 -> pipe:[48380]
l-wx------ 1 nginx nginx 64 Dec 22 10:02 7 -> /tmp/test.log.15271

这些管道3到6是由CRuby创建的，7是在脚本内部打开的作为日志文件的。这意味着不再有来自Nginx的文件描述符。

在clean_spawn函数内部，更具体地说，

execをする

我会仔细地做这件事。

更进一步的陷阱 de

不过，在这里还有一个陷阱，就是在这种情况下重新启动Nginx，这个守护进程就会完全消失…。

ubuntu@ubuntu-zesty:~$ sudo systemctl restart nginx                                                                                               
ubuntu@ubuntu-zesty:~$ ps auxf | grep ruby
ubuntu   15298  0.0  0.1  12960  1028 pts/0    S+   10:06   0:00              \_ grep --color=auto ruby

谁在进行杀死操作呢？通过使用gdb可以追踪，所以我将进行附加。

nginx     15360  0.0  0.7  40580  7592 ?        Sl   10:07   0:00 /usr/bin/ruby /usr/local/bin/daemon.rb                                           
ubuntu@ubuntu-zesty:~$ sudo gdb -p 15360                                 
GNU gdb (Ubuntu 7.12.50.20170314-0ubuntu1.1) 7.12.50.20170314-git        
Copyright (C) 2017 Free Software Foundation, Inc.                        
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>                                                                     
This is free software: you are free to change and redistribute it.       
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"                                                                        
and "show warranty" for details.    
This GDB was configured as "x86_64-linux-gnu".                           
Type "show configuration" for configuration details.                     
For bug reporting instructions, please see:                              
<http://www.gnu.org/software/gdb/bugs/>.                                 
Find the GDB manual and other documentation resources online at:         
<http://www.gnu.org/software/gdb/documentation/>.                        
For help, type "help".              
Type "apropos word" to search for commands related to "word".            
Attaching to process 15360          
[New LWP 15362]                     
[Thread debugging using libthread_db enabled]                            
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".                                                                        
pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225                                          
225     ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: No such file or directory.                                                    
(gdb) c                             
Continuing.

在这种状态下，如果从另一个终端运行【systemctl restart nginx】就会… 我发现这是由SIGTERM信号，PID为1即来自systemd发送的。

Thread 2 "ruby-timer-thr" received signal SIGTERM, Terminated.           
[Switching to Thread 0x7f7fef5d7700 (LWP 15362)]                         
0x00007f7feec6cd8d in poll () at ../sysdeps/unix/syscall-template.S:84   
84      ../sysdeps/unix/syscall-template.S: No such file or directory.   
(gdb) ptype $_siginfo               
type = struct {
    int si_signo;
    int si_errno;
    int si_code;
    union {
        int _pad __attribute__ ((vector_size(28)));
        struct {...} _kill;
        struct {...} _timer;
        struct {...} _rt;
        struct {...} _sigchld;
        struct {...} _sigfault;
        struct {...} _sigpoll;
    } _sifields;
}
(gdb) p $_siginfo._sifields._sigchld.si_pid
$1 = 1 <- !!!

这是什么意思呢？实际上，这是由于守护进程的cgroup与原始的nginx服务的cgroup是相同的。

nginx     15703  0.0  0.1  37176  1296 ?        Ss   10:12   0:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf
nginx     15705  0.0  0.3  37600  3972 ?        S    10:12   0:00  \_ nginx: worker process
nginx     15718  0.0  0.7  40488  7340 ?        Sl   10:14   0:00 /usr/bin/ruby /usr/local/bin/daemon.rb
ubuntu@ubuntu-zesty:~$ sudo grep systemd /proc/15703/cgroup 
1:name=systemd:/system.slice/nginx.service
ubuntu@ubuntu-zesty:~$ sudo grep systemd /proc/15718/cgroup 
1:name=systemd:/system.slice/nginx.service

systemd在重启时引用了这个systemd子系统。对于不是直接由Unit文件管理的进程，systemd会向所属组中的进程发送SIGTERM -> SIGKILL的信号顺序。请参考第29张。

修改cgroup的systemd组。

所以，clean_spawn有一个选项，用于移动守护进程的cgroup。

CleanSpawn.cgroup_root_path = "/sys/fs/cgroup/systemd"
unless clean_spawn "/bin/sh", "-c", "/usr/local/bin/daemon.rb"
  Nginx.rputs '{"status": "FAILURE"}'
else
  Nginx.rputs '{"status": "OK"}'
end

目前，如果Nginx没有以root身份启动，则此选项无法正常工作。

如果将其设置为这样，重新启动Nginx时就不会影响守护进程。真是太好了…

ubuntu@ubuntu-zesty:~$ sudo systemctl restart nginx          
ubuntu@ubuntu-zesty:~$ ps auxf | grep ruby
ubuntu   15820  0.0  0.0  12960   936 pts/0    S+   10:18   0:00  |           \_ grep --color=auto ruby
root     15789  0.0  0.7  40500  7356 ?        Sl   10:17   0:00 /usr/bin/ruby /usr/local/bin/daemon.rb

我介绍了用于ngx_mruby的专门处理的mgem。

我想做的事情雖然有點狂熱，但通過這個過程我獲得了Linux/UNIX相當重要的知識，所以我寫了下來。就是今天這樣子…。