Process_exporter使用效果不佳

首先

大家好,你们是如何进行进程监控的呢?在 Prometheus 中,有一个名为 process_exporter 的导出器,我将验证结果记在备忘录里。

如果您正在實際運用此出口商並且能夠有效地使用,請一定告訴我們!!!

验证

検証环境使用vagrant在本地搭建,虽然有点旧,但使用了Ubuntu 14.04.6 LTS。

引入

我参考了以下的文章并努力去做。
https://qiita.com/nekoneck/items/a9deab623da277afc4be

以下是 Github 的存储库:
https://github.com/ncabatoff/process-exporter/blob/master/README.md

这次我们不是用Docker,而是通过二进制文件进行安装。

首先,您需要手动下载并解压缩。

$ curl -LO https://github.com/ncabatoff/process-exporter/releases/download/v0.2.11/process-exporter_0.2.11_linux_amd64.tar.gz 
$ tar xvf process-exporter_0.2.11_linux_amd64.tar.gz

这样做的话,以下文件将会被解压缩。

$ ls
LICENSE  README.md  process-exporter  process-exporter_0.2.11_linux_amd64.tar.gz

启动

我们将把process-exporter作为可执行文件。实际上,我们将使用标志来执行它,让我们看看有什么选项可用。

$ ./process-exporter -h
Usage of ./process-exporter:
  -children
        if a proc is tracked, track with it any children that aren't part of their own group (default true)
  -config.path string
        path to YAML config file
  -man
        print manual
  -namemapping string
        comma-seperated list, alternating process name and capturing regex to apply to cmdline
  -once-to-stdout
        Don't bind, instead just print the metrics once to stdout and exit
  -procfs string
        path to read proc data from (default "/proc")
  -procnames string
        comma-seperated list of process names to monitor
  -recheck
        recheck process names on each scrape
  -threads
        report on per-threadname metrics as well
  -web.listen-address string
        Address on which to expose metrics and web interface. (default ":9256")
  -web.telemetry-path string
        Path under which to expose metrics. (default "/metrics")

基本上可以选择读取外部配置文件或者使用-procnames参数来执行。

这个设定我觉得有些难以理解,这就是我认为这次有些微妙的原因。

首先,让我们来看一下这次要针对的进程。我们将以PostgreSQL的进程为对象来进行观察。

$ ps -ef |grep -v grep | grep postgres
postgres 21127     1  0 Apr15 ?        00:00:00 /usr/lib/postgresql/9.3/bin/postgres -D /var/lib/postgresql/9.3/main -c config_file=/etc/postgresql/9.3/main/postgresql.conf
postgres 21129 21127  0 Apr15 ?        00:00:00 postgres: checkpointer process
postgres 21130 21127  0 Apr15 ?        00:00:00 postgres: writer process
postgres 21131 21127  0 Apr15 ?        00:00:00 postgres: wal writer process
postgres 21132 21127  0 Apr15 ?        00:00:00 postgres: autovacuum launcher process
postgres 21133 21127  0 Apr15 ?        00:00:01 postgres: stats collector process

我已经配置了config文件。根据github的描述,comm似乎是从/proc/{pid}/stat的第二列提取的。

  # comm is the second field of /proc/<pid>/stat minus parens.
  # It is the base executable name, truncated at 15 chars.  
  # It cannot be modified by the program, unlike exe.
  - comm:
    - bash
$ cat config.yaml
process_names:
  - comm:
    - postgresql

那么,就让我们开始运行吧。

$ ./process-exporter -config.path config.yaml
2019/04/16 05:02:48 Reading metrics from /proc based on "config.yaml"

我们试试打开终端。

$ curl http://192.168.33.31:9256/metrics |grep postgresql
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  7215  100  7215    0     0  1081k      0 --:--:-- --:--:-- --:--:-- 1174k

嗯,还没捕捉到。那么,让我们尝试稍微改变一下配置文件吧。

$ cat config.yaml
process_names:
  - comm:
    - postgres

我会尝试请求该终端点。

$ curl http://192.168.33.31:9256/metrics |grep postgres
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0namedprocess_namegroup_cpu_system_seconds_total{groupname="postgres"} 0
namedprocess_namegroup_cpu_system_seconds_total{groupname="postgres: autovacuum launcher process   "} 0
namedprocess_namegroup_cpu_system_seconds_total{groupname="postgres: checkpointer process   "} 0
namedprocess_namegroup_cpu_system_seconds_total{groupname="postgres: stats collector process   "} 0
namedprocess_namegroup_cpu_system_seconds_total{groupname="postgres: wal writer process   "} 0
namedprocess_namegroup_cpu_system_seconds_total{groupname="postgres: writer process   "} 0
namedprocess_namegroup_cpu_user_seconds_total{groupname="postgres"} 0
namedprocess_namegroup_cpu_user_seconds_total{groupname="postgres: autovacuum launcher process   "} 0
namedprocess_namegroup_cpu_user_seconds_total{groupname="postgres: checkpointer process   "} 0
namedprocess_namegroup_cpu_user_seconds_total{groupname="postgres: stats collector process   "} 0
namedprocess_namegroup_cpu_user_seconds_total{groupname="postgres: wal writer process   "} 0
namedprocess_namegroup_cpu_user_seconds_total{groupname="postgres: writer process   "} 0
namedprocess_namegroup_major_page_faults_total{groupname="postgres"} 0
namedprocess_namegroup_major_page_faults_total{groupname="postgres: autovacuum launcher process   "} 0
namedprocess_namegroup_major_page_faults_total{groupname="postgres: checkpointer process   "} 0
namedprocess_namegroup_major_page_faults_total{groupname="postgres: stats collector process   "} 0
namedprocess_namegroup_major_page_faults_total{groupname="postgres: wal writer process   "} 0
namedprocess_namegroup_major_page_faults_total{groupname="postgres: writer process   "} 0
namedprocess_namegroup_memory_bytes{groupname="postgres",memtype="resident"} 1.2640256e+07
namedprocess_namegroup_memory_bytes{groupname="postgres",memtype="virtual"} 2.52489728e+08
namedprocess_namegroup_memory_bytes{groupname="postgres: autovacuum launcher process   ",memtype="resident"} 2.822144e+06
namedprocess_namegroup_memory_bytes{groupname="postgres: autovacuum launcher process   ",memtype="virtual"} 2.5337856e+08
namedprocess_namegroup_memory_bytes{groupname="postgres: checkpointer process   ",memtype="resident"} 3.244032e+06
namedprocess_namegroup_memory_bytes{groupname="postgres: checkpointer process   ",memtype="virtual"} 2.52628992e+08
namedprocess_namegroup_memory_bytes{groupname="postgres: stats collector process   ",memtype="resident"} 1.794048e+06
namedprocess_namegroup_memory_bytes{groupname="postgres: stats collector process   ",memtype="virtual"} 1.04841216e+08
namedprocess_namegroup_memory_bytes{groupname="postgres: wal writer process   ",memtype="resident"} 1.695744e+06
namedprocess_namegroup_memory_bytes{groupname="postgres: wal writer process   ",memtype="virtual"} 2.52489728e+08
namedprocess_namegroup_memory_bytes{groupname="postgres: writer process   ",memtype="resident"} 2.506752e+06
namedprocess_namegroup_memory_bytes{groupname="postgres: writer process   ",memtype="virtual"} 2.52489728e+08
namedprocess_namegroup_minor_page_faults_total{groupname="postgres"} 0
namedprocess_namegroup_minor_page_faults_total{groupname="postgres: autovacuum launcher process   "} 0
namedprocess_namegroup_minor_page_faults_total{groupname="postgres: checkpointer process   "} 0
namedprocess_namegroup_minor_page_faults_total{groupname="postgres: stats collector process   "} 0
namedprocess_namegroup_minor_page_faults_total{groupname="postgres: wal writer process   "} 0
namedprocess_namegroup_minor_page_faults_total{groupname="postgres: writer process   "} 0
namedprocess_namegroup_num_procs{groupname="postgres"} 1
namedprocess_namegroup_num_procs{groupname="postgres: autovacuum launcher process   "} 1
namedprocess_namegroup_num_procs{groupname="postgres: checkpointer process   "} 1
namedprocess_namegroup_num_procs{groupname="postgres: stats collector process   "} 1
namedprocess_namegroup_num_procs{groupname="postgres: wal writer process   "} 1
namedprocess_namegroup_num_procs{groupname="postgres: writer process   "} 1
namedprocess_namegroup_num_threads{groupname="postgres"} 1
namedprocess_namegroup_num_threads{groupname="postgres: autovacuum launcher process   "} 1
namedprocess_namegroup_num_threads{groupname="postgres: checkpointer process   "} 1
namedprocess_namegroup_num_threads{groupname="postgres: stats collector process   "} 1
namedprocess_namegroup_num_threads{groupname="postgres: wal writer process   "} 1
namedprocess_namegroup_num_threads{groupname="postgres: writer process   "} 1
namedprocess_namegroup_oldest_start_time_seconds{groupname="postgres"} 1.55536655e+09
namedprocess_namegroup_oldest_start_time_seconds{groupname="postgres: autovacuum launcher process   "} 1.55536655e+09
namedprocess_namegroup_oldest_start_time_seconds{groupname="postgres: checkpointer process   "} 1.55536655e+09
namedprocess_namegroup_oldest_start_time_seconds{groupname="postgres: stats collector process   "} 1.55536655e+09
namedprocess_namegroup_oldest_start_time_seconds{groupname="postgres: wal writer process   "} 1.55536655e+09
namedprocess_namegroup_oldest_start_time_seconds{groupname="postgres: writer process   "} 1.55536655e+09
namedprocess_namegroup_open_filedesc{groupname="postgres"} 0
namedprocess_namegroup_open_filedesc{groupname="postgres: autovacuum launcher process   "} 0
namedprocess_namegroup_open_filedesc{groupname="postgres: checkpointer process   "} 0
namedprocess_namegroup_open_filedesc{groupname="postgres: stats collector process   "} 0
namedprocess_namegroup_open_filedesc{groupname="postgres: wal writer process   "} 0
namedprocess_namegroup_open_filedesc{groupname="postgres: writer process   "} 0
namedprocess_namegroup_read_bytes_total{groupname="postgres"} 0
namedprocess_namegroup_read_bytes_total{groupname="postgres: autovacuum launcher process   "} 0
namedprocess_namegroup_read_bytes_total{groupname="postgres: checkpointer process   "} 0
namedprocess_namegroup_read_bytes_total{groupname="postgres: stats collector process   "} 0
namedprocess_namegroup_read_bytes_total{groupname="postgres: wal writer process   "} 0
namedprocess_namegroup_read_bytes_total{groupname="postgres: writer process   "} 0
namedprocess_namegroup_states{groupname="postgres",state="Other"} 0
namedprocess_namegroup_states{groupname="postgres",state="Running"} 0
namedprocess_namegroup_states{groupname="postgres",state="Sleeping"} 1
namedprocess_namegroup_states{groupname="postgres",state="Waiting"} 0
namedprocess_namegroup_states{groupname="postgres",state="Zombie"} 0
namedprocess_namegroup_states{groupname="postgres: autovacuum launcher process   ",state="Other"} 0
namedprocess_namegroup_states{groupname="postgres: autovacuum launcher process   ",state="Running"} 0
namedprocess_namegroup_states{groupname="postgres: autovacuum launcher process   ",state="Sleeping"} 1
namedprocess_namegroup_states{groupname="postgres: autovacuum launcher process   ",state="Waiting"} 0
namedprocess_namegroup_states{groupname="postgres: autovacuum launcher process   ",state="Zombie"} 0
namedprocess_namegroup_states{groupname="postgres: checkpointer process   ",state="Other"} 0
namedprocess_namegroup_states{groupname="postgres: checkpointer process   ",state="Running"} 0
namedprocess_namegroup_states{groupname="postgres: checkpointer process   ",state="Sleeping"} 1
namedprocess_namegroup_states{groupname="postgres: checkpointer process   ",state="Waiting"} 0
namedprocess_namegroup_states{groupname="postgres: checkpointer process   ",state="Zombie"} 0
namedprocess_namegroup_states{groupname="postgres: stats collector process   ",state="Other"} 0
namedprocess_namegroup_states{groupname="postgres: stats collector process   ",state="Running"} 0
namedprocess_namegroup_states{groupname="postgres: stats collector process   ",state="Sleeping"} 1
namedprocess_namegroup_states{groupname="postgres: stats collector process   ",state="Waiting"} 0
namedprocess_namegroup_states{groupname="postgres: stats collector process   ",state="Zombie"} 0
namedprocess_namegroup_states{groupname="postgres: wal writer process   ",state="Other"} 0
namedprocess_namegroup_states{groupname="postgres: wal writer process   ",state="Running"} 0
namedprocess_namegroup_states{groupname="postgres: wal writer process   ",state="Sleeping"} 1
namedprocess_namegroup_states{groupname="postgres: wal writer process   ",state="Waiting"} 0
namedprocess_namegroup_states{groupname="postgres: wal writer process   ",state="Zombie"} 0
namedprocess_namegroup_states{groupname="postgres: writer process   ",state="Other"} 0
namedprocess_namegroup_states{groupname="postgres: writer process   ",state="Running"} 0
namedprocess_namegroup_states{groupname="postgres: writer process   ",state="Sleeping"} 1
namedprocess_namegroup_states{groupname="postgres: writer process   ",state="Waiting"} 0
namedprocess_namegroup_states{groupname="postgres: writer process   ",state="Zombie"} 0
namedprocess_namegroup_worst_fd_ratio{groupname="postgres"} 0
100 18856  100 18856    0     0  1769k      0 --:--:-- --:--:-- --:--:-- 1841k
namedprocess_namegroup_worst_fd_ratio{groupname="postgres: autovacuum launcher process   "} 0
namedprocess_namegroup_worst_fd_ratio{groupname="postgres: checkpointer process   "} 0
namedprocess_namegroup_worst_fd_ratio{groupname="postgres: stats collector process   "} 0
namedprocess_namegroup_worst_fd_ratio{groupname="postgres: wal writer process   "} 0
namedprocess_namegroup_worst_fd_ratio{groupname="postgres: writer process   "} 0
namedprocess_namegroup_write_bytes_total{groupname="postgres"} 0
namedprocess_namegroup_write_bytes_total{groupname="postgres: autovacuum launcher process   "} 0
namedprocess_namegroup_write_bytes_total{groupname="postgres: checkpointer process   "} 0
namedprocess_namegroup_write_bytes_total{groupname="postgres: stats collector process   "} 0
namedprocess_namegroup_write_bytes_total{groupname="postgres: wal writer process   "} 0
namedprocess_namegroup_write_bytes_total{groupname="postgres: writer process   "} 0

哦!好像找到了点什么。看起来,postgresql的参数似乎是通过groupname参数进行分组的。
这样一来,嗯,看起来可用。

接下来,让我们尝试使用”-procnames”选项启动。

$ ./process-exporter -procnames postgres
2019/04/16 05:15:46 Reading metrics from /proc for procnames: [postgres]

我将尝试访问终端点。

$ curl http://192.168.33.31:9256/metrics |grep postgres
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0namedprocess_namegroup_cpu_system_seconds_total{groupname="postgres"} 0.010000000000000009
namedprocess_namegroup_cpu_user_seconds_total{groupname="postgres"} 0
namedprocess_namegroup_major_page_faults_total{groupname="postgres"} 0
namedprocess_namegroup_memory_bytes{groupname="postgres",memtype="resident"} 2.4702976e+07
namedprocess_namegroup_memory_bytes{groupname="postgres",memtype="virtual"} 1.368317952e+09
namedprocess_namegroup_minor_page_faults_total{groupname="postgres"} 61
namedprocess_namegroup_num_procs{groupname="postgres"} 6
namedprocess_namegroup_num_threads{groupname="postgres"} 6
namedprocess_namegroup_oldest_start_time_seconds{groupname="postgres"} 1.55536655e+09
namedprocess_namegroup_open_filedesc{groupname="postgres"} 0
namedprocess_namegroup_read_bytes_total{groupname="postgres"} 0
namedprocess_namegroup_states{groupname="postgres",state="Other"} 0
namedprocess_namegroup_states{groupname="postgres",state="Running"} 0
100 10261  100 10261    0     0  1126k      0 --:--:-- --:--:-- --:--:-- 1252k
namedprocess_namegroup_states{groupname="postgres",state="Sleeping"} 6
namedprocess_namegroup_states{groupname="postgres",state="Waiting"} 0
namedprocess_namegroup_states{groupname="postgres",state="Zombie"} 0
namedprocess_namegroup_worst_fd_ratio{groupname="postgres"} 0
namedprocess_namegroup_write_bytes_total{groupname="postgres"} 0

哦,输出发生了变化。似乎是将先前输出的东西进行了汇总并展示出来。
如果想要改变groupname的显示名称,可以按照以下方式重新编写配置。

$ cat config.yaml
process_names:
  - name: aaaaaa
    comm:
      - 'postgres'

再次,尝试调用终端节点。

$ curl http://192.168.33.31:9256/metrics |grep namedprocess_namegroup_num_procs
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP namedprocess_namegroup_num_procs number of processes in this group
# TYPE namedprocess_namegroup_num_procs gauge
namedprocess_namegroup_num_procs{groupname="aaaaaa"} 6
100 10009  100 10009    0     0  1094k      0 --:--:-- --:--:-- --:--:-- 1221k

现在开始。
我们尝试使用名为exe的选择器。顺便一提,在Github的说明中,写着如下内容。

# exe is argv[0]. If no slashes, only basename of argv[0] need match.
  # If exe contains slashes, argv[0] must match exactly.
  - exe: 
    - postgres
    - /usr/local/bin/prometheus

因此,我尝试设置了以下的配置。这个想法是获取程序的名称。

process_names:
  - name: postgres master process
    exe:
      - '/usr/lib/postgresql/9.3/bin/postgres'

尝试调用终结点。嗯?成功获取到了6个。也许子进程也被计算进去了。这样的话,就不太清楚与comm的区别了。。

$ curl http://192.168.33.31:9256/metrics |grep namedprocess_namegroup_num_procs
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP namedprocess_namegroup_num_procs number of processes in this group
# TYPE namedprocess_namegroup_num_procs gauge
namedprocess_namegroup_num_procs{groupname="postgres process"} 6
100 10180  100 10180    0     0  1145k      0 --:--:-- --:--:-- --:--:-- 1242k

最后,关于cmdline。这是一个可以使用正则表达式的东西。

  # cmdline is a list of regexps applied to argv.
  # Each must match, and any captures are added to the .Matches map.
  - name: "{{.ExeFull}}:{{.Matches.Cfgfile}}"
    exe: 
    - /usr/local/bin/process-exporter
    cmdline: 
    - -config.path\s+(?P<Cfgfile>\S+)

我这样进行了设置。

process_names:
 # - name: postgres process
 -  exe:
      - '/usr/lib/postgresql/9.3/bin/postgres'
    cmdline:
    - .*postgresql.conf

我会尝试访问终端节点。根据下面的内容可以看出,如果没有指定name,默认情况下会使用Comm的值作为名称。

$ curl http://192.168.33.31:9256/metrics |grep namedprocess_namegroup_num_procs
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP namedprocess_namegroup_num_procs number of processes in this group
# TYPE namedprocess_namegroup_num_procs gauge
namedprocess_namegroup_num_procs{groupname="postgres"} 6
100 10239  100 10239    0     0  1225k      0 --:--:-- --:--:-- --:--:-- 1249k

嗯,没有变化呢。

结论

我认为可以作为可用指标的有namedprocess_namegroup_num_procs和namedprocess_namegroup_states,但这两个都只会显示分组计数。

如果想要检测到进程的全部死亡等情况,这个方法可以使用吗?例如,如果想要监视特定的Java进程,可能需要将它们全部归为Java分组,可能会比较困难。

如果使用方法出错或有任何指正,请告诉我!

追加于2019年5月16日

发布这篇文章后,我稍微再研究了一下,发现了不同的用法,因此进行追加。

只需要一个选项:

使用comm和cmdline的组合技术在配置文件中,只要具有不同的进程ID,就可以将其识别为不同的进程。

我将创建一个用于监听端口的Java应用程序,并启动它。

$ ps -ef |grep java
vagrant   4136  4060  0 05:14 pts/1    00:00:00 java -jar Sample1.jar 8080
vagrant   4148  4060  1 05:14 pts/1    00:00:00 java -jar Sample2.jar 8081
vagrant   4161  4060  0 05:14 pts/1    00:00:00 grep --color=auto java

以下是config文件。

$ cat config.yaml
process_names:
  - name: test1
    comm:
    - java
    cmdline:
    - Sample1.jar
  - name: test2
    comm:
    - java
    cmdline:
    - Sample2.jar

每个都做出解释。
使用name来指定groupname。在本例中,为test1,test2。
使用comm来指定/proc/{pid}/state的第二列中的字符串。
仅有这些的话会将所有包含java的字符串都匹配上,因此使用cmdline。使用这个参数可以指定要匹配的字符串。也可以使用正则表达式。我认为在这里应该指定一个能够唯一标识进程的字符串。(例如jar文件名等)

我们试试打个终端点。

$ curl http://192.168.33.31:9256/metrics | grep namedprocess_namegroup_num_procs
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP namedprocess_namegroup_num_procs number of processes in this group
# TYPE namedprocess_namegroup_num_procs gauge
namedprocess_namegroup_num_procs{groupname="test1"} 1
namedprocess_namegroup_num_procs{groupname="test2"} 1
100 11500  100 11500    0     0  1169k      0 --:--:-- --:--:-- --:--:-- 1247k

我已经成功地将Java进程分隔并获取。

※注意事项
类似于PostgreSQL这样的父进程派生子进程的情况下,可以使用这种方法在子进程中进行判断,但对于父进程则无效(子进程数量会一并计算在内)。

bannerAds