Process_exporter使用效果不佳
首先
大家好,你们是如何进行进程监控的呢?在 Prometheus 中,有一个名为 process_exporter 的导出器,我将验证结果记在备忘录里。
如果您正在實際運用此出口商並且能夠有效地使用,請一定告訴我們!!!
验证
検証环境使用vagrant在本地搭建,虽然有点旧,但使用了Ubuntu 14.04.6 LTS。
引入
我参考了以下的文章并努力去做。
https://qiita.com/nekoneck/items/a9deab623da277afc4be
以下是 Github 的存储库:
https://github.com/ncabatoff/process-exporter/blob/master/README.md
这次我们不是用Docker,而是通过二进制文件进行安装。
首先,您需要手动下载并解压缩。
$ curl -LO https://github.com/ncabatoff/process-exporter/releases/download/v0.2.11/process-exporter_0.2.11_linux_amd64.tar.gz
$ tar xvf process-exporter_0.2.11_linux_amd64.tar.gz
这样做的话,以下文件将会被解压缩。
$ ls
LICENSE README.md process-exporter process-exporter_0.2.11_linux_amd64.tar.gz
启动
我们将把process-exporter作为可执行文件。实际上,我们将使用标志来执行它,让我们看看有什么选项可用。
$ ./process-exporter -h
Usage of ./process-exporter:
-children
if a proc is tracked, track with it any children that aren't part of their own group (default true)
-config.path string
path to YAML config file
-man
print manual
-namemapping string
comma-seperated list, alternating process name and capturing regex to apply to cmdline
-once-to-stdout
Don't bind, instead just print the metrics once to stdout and exit
-procfs string
path to read proc data from (default "/proc")
-procnames string
comma-seperated list of process names to monitor
-recheck
recheck process names on each scrape
-threads
report on per-threadname metrics as well
-web.listen-address string
Address on which to expose metrics and web interface. (default ":9256")
-web.telemetry-path string
Path under which to expose metrics. (default "/metrics")
基本上可以选择读取外部配置文件或者使用-procnames参数来执行。
这个设定我觉得有些难以理解,这就是我认为这次有些微妙的原因。
首先,让我们来看一下这次要针对的进程。我们将以PostgreSQL的进程为对象来进行观察。
$ ps -ef |grep -v grep | grep postgres
postgres 21127 1 0 Apr15 ? 00:00:00 /usr/lib/postgresql/9.3/bin/postgres -D /var/lib/postgresql/9.3/main -c config_file=/etc/postgresql/9.3/main/postgresql.conf
postgres 21129 21127 0 Apr15 ? 00:00:00 postgres: checkpointer process
postgres 21130 21127 0 Apr15 ? 00:00:00 postgres: writer process
postgres 21131 21127 0 Apr15 ? 00:00:00 postgres: wal writer process
postgres 21132 21127 0 Apr15 ? 00:00:00 postgres: autovacuum launcher process
postgres 21133 21127 0 Apr15 ? 00:00:01 postgres: stats collector process
我已经配置了config文件。根据github的描述,comm似乎是从/proc/{pid}/stat的第二列提取的。
# comm is the second field of /proc/<pid>/stat minus parens.
# It is the base executable name, truncated at 15 chars.
# It cannot be modified by the program, unlike exe.
- comm:
- bash
$ cat config.yaml
process_names:
- comm:
- postgresql
那么,就让我们开始运行吧。
$ ./process-exporter -config.path config.yaml
2019/04/16 05:02:48 Reading metrics from /proc based on "config.yaml"
我们试试打开终端。
$ curl http://192.168.33.31:9256/metrics |grep postgresql
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 7215 100 7215 0 0 1081k 0 --:--:-- --:--:-- --:--:-- 1174k
嗯,还没捕捉到。那么,让我们尝试稍微改变一下配置文件吧。
$ cat config.yaml
process_names:
- comm:
- postgres
我会尝试请求该终端点。
$ curl http://192.168.33.31:9256/metrics |grep postgres
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0namedprocess_namegroup_cpu_system_seconds_total{groupname="postgres"} 0
namedprocess_namegroup_cpu_system_seconds_total{groupname="postgres: autovacuum launcher process "} 0
namedprocess_namegroup_cpu_system_seconds_total{groupname="postgres: checkpointer process "} 0
namedprocess_namegroup_cpu_system_seconds_total{groupname="postgres: stats collector process "} 0
namedprocess_namegroup_cpu_system_seconds_total{groupname="postgres: wal writer process "} 0
namedprocess_namegroup_cpu_system_seconds_total{groupname="postgres: writer process "} 0
namedprocess_namegroup_cpu_user_seconds_total{groupname="postgres"} 0
namedprocess_namegroup_cpu_user_seconds_total{groupname="postgres: autovacuum launcher process "} 0
namedprocess_namegroup_cpu_user_seconds_total{groupname="postgres: checkpointer process "} 0
namedprocess_namegroup_cpu_user_seconds_total{groupname="postgres: stats collector process "} 0
namedprocess_namegroup_cpu_user_seconds_total{groupname="postgres: wal writer process "} 0
namedprocess_namegroup_cpu_user_seconds_total{groupname="postgres: writer process "} 0
namedprocess_namegroup_major_page_faults_total{groupname="postgres"} 0
namedprocess_namegroup_major_page_faults_total{groupname="postgres: autovacuum launcher process "} 0
namedprocess_namegroup_major_page_faults_total{groupname="postgres: checkpointer process "} 0
namedprocess_namegroup_major_page_faults_total{groupname="postgres: stats collector process "} 0
namedprocess_namegroup_major_page_faults_total{groupname="postgres: wal writer process "} 0
namedprocess_namegroup_major_page_faults_total{groupname="postgres: writer process "} 0
namedprocess_namegroup_memory_bytes{groupname="postgres",memtype="resident"} 1.2640256e+07
namedprocess_namegroup_memory_bytes{groupname="postgres",memtype="virtual"} 2.52489728e+08
namedprocess_namegroup_memory_bytes{groupname="postgres: autovacuum launcher process ",memtype="resident"} 2.822144e+06
namedprocess_namegroup_memory_bytes{groupname="postgres: autovacuum launcher process ",memtype="virtual"} 2.5337856e+08
namedprocess_namegroup_memory_bytes{groupname="postgres: checkpointer process ",memtype="resident"} 3.244032e+06
namedprocess_namegroup_memory_bytes{groupname="postgres: checkpointer process ",memtype="virtual"} 2.52628992e+08
namedprocess_namegroup_memory_bytes{groupname="postgres: stats collector process ",memtype="resident"} 1.794048e+06
namedprocess_namegroup_memory_bytes{groupname="postgres: stats collector process ",memtype="virtual"} 1.04841216e+08
namedprocess_namegroup_memory_bytes{groupname="postgres: wal writer process ",memtype="resident"} 1.695744e+06
namedprocess_namegroup_memory_bytes{groupname="postgres: wal writer process ",memtype="virtual"} 2.52489728e+08
namedprocess_namegroup_memory_bytes{groupname="postgres: writer process ",memtype="resident"} 2.506752e+06
namedprocess_namegroup_memory_bytes{groupname="postgres: writer process ",memtype="virtual"} 2.52489728e+08
namedprocess_namegroup_minor_page_faults_total{groupname="postgres"} 0
namedprocess_namegroup_minor_page_faults_total{groupname="postgres: autovacuum launcher process "} 0
namedprocess_namegroup_minor_page_faults_total{groupname="postgres: checkpointer process "} 0
namedprocess_namegroup_minor_page_faults_total{groupname="postgres: stats collector process "} 0
namedprocess_namegroup_minor_page_faults_total{groupname="postgres: wal writer process "} 0
namedprocess_namegroup_minor_page_faults_total{groupname="postgres: writer process "} 0
namedprocess_namegroup_num_procs{groupname="postgres"} 1
namedprocess_namegroup_num_procs{groupname="postgres: autovacuum launcher process "} 1
namedprocess_namegroup_num_procs{groupname="postgres: checkpointer process "} 1
namedprocess_namegroup_num_procs{groupname="postgres: stats collector process "} 1
namedprocess_namegroup_num_procs{groupname="postgres: wal writer process "} 1
namedprocess_namegroup_num_procs{groupname="postgres: writer process "} 1
namedprocess_namegroup_num_threads{groupname="postgres"} 1
namedprocess_namegroup_num_threads{groupname="postgres: autovacuum launcher process "} 1
namedprocess_namegroup_num_threads{groupname="postgres: checkpointer process "} 1
namedprocess_namegroup_num_threads{groupname="postgres: stats collector process "} 1
namedprocess_namegroup_num_threads{groupname="postgres: wal writer process "} 1
namedprocess_namegroup_num_threads{groupname="postgres: writer process "} 1
namedprocess_namegroup_oldest_start_time_seconds{groupname="postgres"} 1.55536655e+09
namedprocess_namegroup_oldest_start_time_seconds{groupname="postgres: autovacuum launcher process "} 1.55536655e+09
namedprocess_namegroup_oldest_start_time_seconds{groupname="postgres: checkpointer process "} 1.55536655e+09
namedprocess_namegroup_oldest_start_time_seconds{groupname="postgres: stats collector process "} 1.55536655e+09
namedprocess_namegroup_oldest_start_time_seconds{groupname="postgres: wal writer process "} 1.55536655e+09
namedprocess_namegroup_oldest_start_time_seconds{groupname="postgres: writer process "} 1.55536655e+09
namedprocess_namegroup_open_filedesc{groupname="postgres"} 0
namedprocess_namegroup_open_filedesc{groupname="postgres: autovacuum launcher process "} 0
namedprocess_namegroup_open_filedesc{groupname="postgres: checkpointer process "} 0
namedprocess_namegroup_open_filedesc{groupname="postgres: stats collector process "} 0
namedprocess_namegroup_open_filedesc{groupname="postgres: wal writer process "} 0
namedprocess_namegroup_open_filedesc{groupname="postgres: writer process "} 0
namedprocess_namegroup_read_bytes_total{groupname="postgres"} 0
namedprocess_namegroup_read_bytes_total{groupname="postgres: autovacuum launcher process "} 0
namedprocess_namegroup_read_bytes_total{groupname="postgres: checkpointer process "} 0
namedprocess_namegroup_read_bytes_total{groupname="postgres: stats collector process "} 0
namedprocess_namegroup_read_bytes_total{groupname="postgres: wal writer process "} 0
namedprocess_namegroup_read_bytes_total{groupname="postgres: writer process "} 0
namedprocess_namegroup_states{groupname="postgres",state="Other"} 0
namedprocess_namegroup_states{groupname="postgres",state="Running"} 0
namedprocess_namegroup_states{groupname="postgres",state="Sleeping"} 1
namedprocess_namegroup_states{groupname="postgres",state="Waiting"} 0
namedprocess_namegroup_states{groupname="postgres",state="Zombie"} 0
namedprocess_namegroup_states{groupname="postgres: autovacuum launcher process ",state="Other"} 0
namedprocess_namegroup_states{groupname="postgres: autovacuum launcher process ",state="Running"} 0
namedprocess_namegroup_states{groupname="postgres: autovacuum launcher process ",state="Sleeping"} 1
namedprocess_namegroup_states{groupname="postgres: autovacuum launcher process ",state="Waiting"} 0
namedprocess_namegroup_states{groupname="postgres: autovacuum launcher process ",state="Zombie"} 0
namedprocess_namegroup_states{groupname="postgres: checkpointer process ",state="Other"} 0
namedprocess_namegroup_states{groupname="postgres: checkpointer process ",state="Running"} 0
namedprocess_namegroup_states{groupname="postgres: checkpointer process ",state="Sleeping"} 1
namedprocess_namegroup_states{groupname="postgres: checkpointer process ",state="Waiting"} 0
namedprocess_namegroup_states{groupname="postgres: checkpointer process ",state="Zombie"} 0
namedprocess_namegroup_states{groupname="postgres: stats collector process ",state="Other"} 0
namedprocess_namegroup_states{groupname="postgres: stats collector process ",state="Running"} 0
namedprocess_namegroup_states{groupname="postgres: stats collector process ",state="Sleeping"} 1
namedprocess_namegroup_states{groupname="postgres: stats collector process ",state="Waiting"} 0
namedprocess_namegroup_states{groupname="postgres: stats collector process ",state="Zombie"} 0
namedprocess_namegroup_states{groupname="postgres: wal writer process ",state="Other"} 0
namedprocess_namegroup_states{groupname="postgres: wal writer process ",state="Running"} 0
namedprocess_namegroup_states{groupname="postgres: wal writer process ",state="Sleeping"} 1
namedprocess_namegroup_states{groupname="postgres: wal writer process ",state="Waiting"} 0
namedprocess_namegroup_states{groupname="postgres: wal writer process ",state="Zombie"} 0
namedprocess_namegroup_states{groupname="postgres: writer process ",state="Other"} 0
namedprocess_namegroup_states{groupname="postgres: writer process ",state="Running"} 0
namedprocess_namegroup_states{groupname="postgres: writer process ",state="Sleeping"} 1
namedprocess_namegroup_states{groupname="postgres: writer process ",state="Waiting"} 0
namedprocess_namegroup_states{groupname="postgres: writer process ",state="Zombie"} 0
namedprocess_namegroup_worst_fd_ratio{groupname="postgres"} 0
100 18856 100 18856 0 0 1769k 0 --:--:-- --:--:-- --:--:-- 1841k
namedprocess_namegroup_worst_fd_ratio{groupname="postgres: autovacuum launcher process "} 0
namedprocess_namegroup_worst_fd_ratio{groupname="postgres: checkpointer process "} 0
namedprocess_namegroup_worst_fd_ratio{groupname="postgres: stats collector process "} 0
namedprocess_namegroup_worst_fd_ratio{groupname="postgres: wal writer process "} 0
namedprocess_namegroup_worst_fd_ratio{groupname="postgres: writer process "} 0
namedprocess_namegroup_write_bytes_total{groupname="postgres"} 0
namedprocess_namegroup_write_bytes_total{groupname="postgres: autovacuum launcher process "} 0
namedprocess_namegroup_write_bytes_total{groupname="postgres: checkpointer process "} 0
namedprocess_namegroup_write_bytes_total{groupname="postgres: stats collector process "} 0
namedprocess_namegroup_write_bytes_total{groupname="postgres: wal writer process "} 0
namedprocess_namegroup_write_bytes_total{groupname="postgres: writer process "} 0
哦!好像找到了点什么。看起来,postgresql的参数似乎是通过groupname参数进行分组的。
这样一来,嗯,看起来可用。
接下来,让我们尝试使用”-procnames”选项启动。
$ ./process-exporter -procnames postgres
2019/04/16 05:15:46 Reading metrics from /proc for procnames: [postgres]
我将尝试访问终端点。
$ curl http://192.168.33.31:9256/metrics |grep postgres
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0namedprocess_namegroup_cpu_system_seconds_total{groupname="postgres"} 0.010000000000000009
namedprocess_namegroup_cpu_user_seconds_total{groupname="postgres"} 0
namedprocess_namegroup_major_page_faults_total{groupname="postgres"} 0
namedprocess_namegroup_memory_bytes{groupname="postgres",memtype="resident"} 2.4702976e+07
namedprocess_namegroup_memory_bytes{groupname="postgres",memtype="virtual"} 1.368317952e+09
namedprocess_namegroup_minor_page_faults_total{groupname="postgres"} 61
namedprocess_namegroup_num_procs{groupname="postgres"} 6
namedprocess_namegroup_num_threads{groupname="postgres"} 6
namedprocess_namegroup_oldest_start_time_seconds{groupname="postgres"} 1.55536655e+09
namedprocess_namegroup_open_filedesc{groupname="postgres"} 0
namedprocess_namegroup_read_bytes_total{groupname="postgres"} 0
namedprocess_namegroup_states{groupname="postgres",state="Other"} 0
namedprocess_namegroup_states{groupname="postgres",state="Running"} 0
100 10261 100 10261 0 0 1126k 0 --:--:-- --:--:-- --:--:-- 1252k
namedprocess_namegroup_states{groupname="postgres",state="Sleeping"} 6
namedprocess_namegroup_states{groupname="postgres",state="Waiting"} 0
namedprocess_namegroup_states{groupname="postgres",state="Zombie"} 0
namedprocess_namegroup_worst_fd_ratio{groupname="postgres"} 0
namedprocess_namegroup_write_bytes_total{groupname="postgres"} 0
哦,输出发生了变化。似乎是将先前输出的东西进行了汇总并展示出来。
如果想要改变groupname的显示名称,可以按照以下方式重新编写配置。
$ cat config.yaml
process_names:
- name: aaaaaa
comm:
- 'postgres'
再次,尝试调用终端节点。
$ curl http://192.168.33.31:9256/metrics |grep namedprocess_namegroup_num_procs
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0# HELP namedprocess_namegroup_num_procs number of processes in this group
# TYPE namedprocess_namegroup_num_procs gauge
namedprocess_namegroup_num_procs{groupname="aaaaaa"} 6
100 10009 100 10009 0 0 1094k 0 --:--:-- --:--:-- --:--:-- 1221k
现在开始。
我们尝试使用名为exe的选择器。顺便一提,在Github的说明中,写着如下内容。
# exe is argv[0]. If no slashes, only basename of argv[0] need match.
# If exe contains slashes, argv[0] must match exactly.
- exe:
- postgres
- /usr/local/bin/prometheus
因此,我尝试设置了以下的配置。这个想法是获取程序的名称。
process_names:
- name: postgres master process
exe:
- '/usr/lib/postgresql/9.3/bin/postgres'
尝试调用终结点。嗯?成功获取到了6个。也许子进程也被计算进去了。这样的话,就不太清楚与comm的区别了。。
$ curl http://192.168.33.31:9256/metrics |grep namedprocess_namegroup_num_procs
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0# HELP namedprocess_namegroup_num_procs number of processes in this group
# TYPE namedprocess_namegroup_num_procs gauge
namedprocess_namegroup_num_procs{groupname="postgres process"} 6
100 10180 100 10180 0 0 1145k 0 --:--:-- --:--:-- --:--:-- 1242k
最后,关于cmdline。这是一个可以使用正则表达式的东西。
# cmdline is a list of regexps applied to argv.
# Each must match, and any captures are added to the .Matches map.
- name: "{{.ExeFull}}:{{.Matches.Cfgfile}}"
exe:
- /usr/local/bin/process-exporter
cmdline:
- -config.path\s+(?P<Cfgfile>\S+)
我这样进行了设置。
process_names:
# - name: postgres process
- exe:
- '/usr/lib/postgresql/9.3/bin/postgres'
cmdline:
- .*postgresql.conf
我会尝试访问终端节点。根据下面的内容可以看出,如果没有指定name,默认情况下会使用Comm的值作为名称。
$ curl http://192.168.33.31:9256/metrics |grep namedprocess_namegroup_num_procs
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0# HELP namedprocess_namegroup_num_procs number of processes in this group
# TYPE namedprocess_namegroup_num_procs gauge
namedprocess_namegroup_num_procs{groupname="postgres"} 6
100 10239 100 10239 0 0 1225k 0 --:--:-- --:--:-- --:--:-- 1249k
嗯,没有变化呢。
结论
我认为可以作为可用指标的有namedprocess_namegroup_num_procs和namedprocess_namegroup_states,但这两个都只会显示分组计数。
如果想要检测到进程的全部死亡等情况,这个方法可以使用吗?例如,如果想要监视特定的Java进程,可能需要将它们全部归为Java分组,可能会比较困难。
如果使用方法出错或有任何指正,请告诉我!
追加于2019年5月16日
发布这篇文章后,我稍微再研究了一下,发现了不同的用法,因此进行追加。
只需要一个选项:
使用comm和cmdline的组合技术在配置文件中,只要具有不同的进程ID,就可以将其识别为不同的进程。
我将创建一个用于监听端口的Java应用程序,并启动它。
$ ps -ef |grep java
vagrant 4136 4060 0 05:14 pts/1 00:00:00 java -jar Sample1.jar 8080
vagrant 4148 4060 1 05:14 pts/1 00:00:00 java -jar Sample2.jar 8081
vagrant 4161 4060 0 05:14 pts/1 00:00:00 grep --color=auto java
以下是config文件。
$ cat config.yaml
process_names:
- name: test1
comm:
- java
cmdline:
- Sample1.jar
- name: test2
comm:
- java
cmdline:
- Sample2.jar
每个都做出解释。
使用name来指定groupname。在本例中,为test1,test2。
使用comm来指定/proc/{pid}/state的第二列中的字符串。
仅有这些的话会将所有包含java的字符串都匹配上,因此使用cmdline。使用这个参数可以指定要匹配的字符串。也可以使用正则表达式。我认为在这里应该指定一个能够唯一标识进程的字符串。(例如jar文件名等)
我们试试打个终端点。
$ curl http://192.168.33.31:9256/metrics | grep namedprocess_namegroup_num_procs
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0# HELP namedprocess_namegroup_num_procs number of processes in this group
# TYPE namedprocess_namegroup_num_procs gauge
namedprocess_namegroup_num_procs{groupname="test1"} 1
namedprocess_namegroup_num_procs{groupname="test2"} 1
100 11500 100 11500 0 0 1169k 0 --:--:-- --:--:-- --:--:-- 1247k
我已经成功地将Java进程分隔并获取。
※注意事项
类似于PostgreSQL这样的父进程派生子进程的情况下,可以使用这种方法在子进程中进行判断,但对于父进程则无效(子进程数量会一并计算在内)。