## 配置钉钉机器人 1. 打开钉钉的智能群助手,点击添加机器人 ![image-20210210143408015](../../../images/image-20210210143408015.png) 2. 选择自定义机器人 ![image-20210210143511253](../../../images/image-20210210143511253.png) ![image-20210210144458936](../../../images/image-20210210143832145.png) 3. 复制webhook地址后点击保存 ![image-20210210143924405](../../../images/image-20210210143924405.png) ![image-20210210144032019](../../../images/image-20210210144032019.png) ![image-20210210144051553](../../../images/image-20210210144051553.png) ## 安装钉钉服务 ### 二进制安装 1. 部署前大家可以先前往github发行版地址看一下最新的部署包:https://github.com/timonwong/prometheus-webhook-dingtalk/releases 2. 截至目前最新版本为`1.4.0`,以后若有更新,大家根据版本修改下方的脚本即可 3. 登录Linux服务器(以Centos7.x为例),下载部署包,由于是github,网络会有些慢,大家若等不及可以开发机下载,然后再传至服务器也可。下载包为:`prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz` ~~~shell [root@JD ~]# cd /usr/local/src/ [root@JD src]# wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v1.4.0/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz ~~~ ![image-20210210000348510](../../../images/image-20210210000348510.png) 4. 部署包下载完毕,开始安装 ~~~shell [root@JD src]# tar xf prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz -C /data [root@JD src]# mv /data/prometheus-webhook-dingtalk-1.4.0.linux-amd64 /data/webhook_dingtalk ~~~ ~~~shell [root@JD src]# cd /data/webhook_dingtalk [root@JD webhook_dingtalk]# ls config.example.yml contrib LICENSE prometheus-webhook-dingtalk ~~~ 5. 编写配置文件,将上述获取的钉钉webhook地址填写到如下文件 ~~~shell [root@JD webhook_dingtalk]# vim dingtalk.yml ~~~ ~~~shell timeout: 5s targets: webhook_robot: # 钉钉机器人创建后的webhook地址 url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx webhook_mention_all: # 钉钉机器人创建后的webhook地址 url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx # 提醒全员 mention: all: true ~~~ 5. 进行系统service编写 * 创建`webhook_dingtalk`配置文件 ~~~shell [root@JD alertmanager]# cd /usr/lib/systemd/system [root@JD system]# vim webhook_dingtalk.service ~~~ * webhook_dingtalk.service 文件填入如下内容后保存`:wq` ~~~shell [Unit] Description=https://prometheus.io [Service] Restart=on-failure ExecStart=/data/webhook_dingtalk/prometheus-webhook-dingtalk --config.file=/data/webhook_dingtalk/dingtalk.yml --web.listen-address=:8060 [Install] WantedBy=multi-user.target ~~~ * 查看配置文件 ~~~shell [root@JD system]# cat webhook_dingtalk.service [Unit] Description=https://prometheus.io [Service] Restart=on-failure ExecStart=/data/webhook_dingtalk/prometheus-webhook-dingtalk --config.file=/data/webhook_dingtalk/dingtalk.yml --web.listen-address=:8060 [Install] WantedBy=multi-user.target ~~~ * 刷新服务配置并启动服务 ~~~shell [root@JD system]# systemctl daemon-reload [root@JD system]# systemctl start webhook_dingtalk.service ~~~ * 查看服务运行状态 ~~~shell [root@JD system]# systemctl status webhook_dingtalk.service ~~~ ![image-20210210152145947](../../../images/image-20210210152145947.png) * 设置开机自启动 ~~~shell [root@JD system]# systemctl enable webhook_dingtalk.service Created symlink from /etc/systemd/system/multi-user.target.wants/webhook_dingtalk.service to /usr/lib/systemd/system/webhook_dingtalk.service. ~~~ ### Docker安装 1. 在Docker部署之前,首先要确保拥有Docker环境,具体安装可以参考文档`6.2.3.3章节` 2. 拉取prometheus-webhook最新镜像 ~~~shell docker pull timonwong/prometheus-webhook-dingtalk ~~~ ![image-20210210150155872](../../../images/image-20210210150155872.png) 3. 启动docker容器并挂载配置文件,配置文件上一小节二进制部署正好有,可以直接使用 注意:⚠️ 若二进制章节部署后,需要执行`systemctl stop webhook_dingtalk.service`关闭服务,否则会造成端口冲突,或者docker的端口映射改为`-p 8160:8060`也可。另外下面命令的`xxxxx`需要填写为自己申请钉钉机器人的`access_token` ~~~shell [root@JD data]# docker run --name webhook-dingtalk -d -p 8060:8060 timonwong/prometheus-webhook-dingtalk --ding.profile="webhook_robot=https://oapi.dingtalk.com/robot/send?access_token=xxxxx" ~~~ ![image-20210210155242690](../../../images/image-20210210155242690.png) 4. 查看日志,若自定义api地址生效则说明启动成功 ![image-20210210155444498](../../../images/image-20210210155444498.png) 5. 我们记下 `urls=http://localhost:8060/dingtalk/webhook_robot/send` 这一段值,接下来的配置会用上 ## 配置Alertmanager 1. 打开 `/data/alertmanager/alertmanager.yaml`,修改为如下内容 ~~~yaml global: # 在没有报警的情况下声明为已解决的时间 resolve_timeout: 5m route: # 接收到告警后到自定义分组 group_by: ["alertname"] # 分组创建后初始化等待时长 group_wait: 10s # 告警信息发送之前的等待时长 group_interval: 30s # 重复报警的间隔时长 repeat_interval: 5m # 默认消息接收 receiver: "dingtalk" receivers: # 钉钉 - name: 'dingtalk' webhook_configs: # prometheus-webhook-dingtalk服务的地址 - url: http://1xx.xx.xx.7:8060/dingtalk/webhook_robot/send send_resolved: true ~~~ 2. 在prometheus文件夹根目录增加`alert_rules.yaml`配置文件,内容如下 ~~~yaml groups: - name: alert_rules rules: - alert: CpuUsageAlertWarning expr: sum(avg(irate(node_cpu_seconds_total{mode!='idle'}[5m])) without (cpu)) by (instance) > 0.60 for: 2m labels: level: warning annotations: summary: "Instance {{ $labels.instance }} CPU usage high" description: "{{ $labels.instance }} CPU usage above 60% (current value: {{ $value }})" - alert: CpuUsageAlertSerious #expr: sum(avg(irate(node_cpu_seconds_total{mode!='idle'}[5m])) without (cpu)) by (instance) > 0.85 expr: (100 - (avg by (instance) (irate(node_cpu_seconds_total{job=~".*",mode="idle"}[5m])) * 100)) > 85 for: 3m labels: level: serious annotations: summary: "Instance {{ $labels.instance }} CPU usage high" description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})" - alert: MemUsageAlertWarning expr: avg by(instance) ((1 - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes) / node_memory_MemTotal_bytes) * 100) > 70 for: 2m labels: level: warning annotations: summary: "Instance {{ $labels.instance }} MEM usage high" description: "{{$labels.instance}}: MEM usage is above 70% (current value is: {{ $value }})" - alert: MemUsageAlertSerious expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)/node_memory_MemTotal_bytes > 0.90 for: 3m labels: level: serious annotations: summary: "Instance {{ $labels.instance }} MEM usage high" description: "{{ $labels.instance }} MEM usage above 90% (current value: {{ $value }})" - alert: DiskUsageAlertWarning expr: (1 - node_filesystem_free_bytes{fstype!="rootfs",mountpoint!="",mountpoint!~"/(run|var|sys|dev).*"} / node_filesystem_size_bytes) * 100 > 80 for: 2m labels: level: warning annotations: summary: "Instance {{ $labels.instance }} Disk usage high" description: "{{$labels.instance}}: Disk usage is above 80% (current value is: {{ $value }})" - alert: DiskUsageAlertSerious expr: (1 - node_filesystem_free_bytes{fstype!="rootfs",mountpoint!="",mountpoint!~"/(run|var|sys|dev).*"} / node_filesystem_size_bytes) * 100 > 90 for: 3m labels: level: serious annotations: summary: "Instance {{ $labels.instance }} Disk usage high" description: "{{$labels.instance}}: Disk usage is above 90% (current value is: {{ $value }})" - alert: NodeFileDescriptorUsage expr: avg by (instance) (node_filefd_allocated{} / node_filefd_maximum{}) * 100 > 60 for: 2m labels: level: warning annotations: summary: "Instance {{ $labels.instance }} File Descriptor usage high" description: "{{$labels.instance}}: File Descriptor usage is above 60% (current value is: {{ $value }})" - alert: NodeLoad15 expr: avg by (instance) (node_load15{}) > 80 for: 2m labels: level: warning annotations: summary: "Instance {{ $labels.instance }} Load15 usage high" description: "{{$labels.instance}}: Load15 is above 80 (current value is: {{ $value }})" - alert: NodeAgentStatus expr: avg by (instance) (up{}) == 0 for: 2m labels: level: warning annotations: summary: "{{$labels.instance}}: has been down" description: "{{$labels.instance}}: Node_Exporter Agent is down (current value is: {{ $value }})" - alert: NodeProcsBlocked expr: avg by (instance) (node_procs_blocked{}) > 10 for: 2m labels: level: warning annotations: summary: "Instance {{ $labels.instance }} Process Blocked usage high" description: "{{$labels.instance}}: Node Blocked Procs detected! above 10 (current value is: {{ $value }})" - alert: NetworkTransmitRate #expr: avg by (instance) (floor(irate(node_network_transmit_bytes_total{device="ens192"}[2m]) / 1024 / 1024)) > 50 expr: avg by (instance) (floor(irate(node_network_transmit_bytes_total{}[2m]) / 1024 / 1024 * 8 )) > 40 for: 1m labels: level: warning annotations: summary: "Instance {{ $labels.instance }} Network Transmit Rate usage high" description: "{{$labels.instance}}: Node Transmit Rate (Upload) is above 40Mbps/s (current value is: {{ $value }}Mbps/s)" - alert: NetworkReceiveRate #expr: avg by (instance) (floor(irate(node_network_receive_bytes_total{device="ens192"}[2m]) / 1024 / 1024)) > 50 expr: avg by (instance) (floor(irate(node_network_receive_bytes_total{}[2m]) / 1024 / 1024 * 8 )) > 40 for: 1m labels: level: warning annotations: summary: "Instance {{ $labels.instance }} Network Receive Rate usage high" description: "{{$labels.instance}}: Node Receive Rate (Download) is above 40Mbps/s (current value is: {{ $value }}Mbps/s)" - alert: DiskReadRate expr: avg by (instance) (floor(irate(node_disk_read_bytes_total{}[2m]) / 1024 )) > 200 for: 2m labels: level: warning annotations: summary: "Instance {{ $labels.instance }} Disk Read Rate usage high" description: "{{$labels.instance}}: Node Disk Read Rate is above 200KB/s (current value is: {{ $value }}KB/s)" - alert: DiskWriteRate expr: avg by (instance) (floor(irate(node_disk_written_bytes_total{}[2m]) / 1024 / 1024 )) > 20 for: 2m labels: level: warning annotations: summary: "Instance {{ $labels.instance }} Disk Write Rate usage high" description: "{{$labels.instance}}: Node Disk Write Rate is above 20MB/s (current value is: {{ $value }}MB/s)" ~~~ 3. 修改`prometheys.yaml`,最上方三个节点改为如下配置 **注意⚠️:若prometheus为docker部署的服务,则需要关闭后重新启动,同时使用 -v 挂载目录才会读取到rules文件** ~~~yaml global: scrape_interval: 15s evaluation_interval: 15s alerting: alertmanagers: - static_configs: # alertmanager服务地址 - targets: ['11x.xx.x.7:9093'] rule_files: - "alert_rules.yml" ~~~ 4. 执行`curl -XPOST localhost:9090/-/reload`刷新prometheus配置 5. 执行`systemctl restart alertmanger.service`或`docker restart alertmanager`刷新alertmanger服务 ## 验证配置 1. 打开prometheus服务,可以看到alerts栏出现了很多规则 ![image-20210209145250954](../../../images/image-20210209144437502.png) 2. 此时我们手动关闭一个节点 ~~~shell [root@JD ~]# docker stop mysqld-exporter ~~~ 3. 刷新prometheus,可以看到有一个节点颜色改变,进入了pending状态 ![image-20210210152723906](../../../images/image-20210210152723906.png) 4. 稍等片刻,alertmanager.yaml 配置为等待5m,颜色变为红色,进入了firing状态 ![image-20210210152829210](../../../images/image-20210210152829210.png) 5. 查看alertmanager服务,也出现了相关告警节点 ![image-20210210152851241](../../../images/image-20210210152851241.png) 6. 此时如果配置无误,会收到钉钉机器人的一条信息 ![image-20210210152921360](../../../images/image-20210210152921360.png) 7. 这时我们重启mysqld-exporter服务 ~~~shell [root@JD ~]# docker start mysqld-exporter ~~~ 8. 过了配置的等待时长,若服务没有在期间断开,钉钉机器人会发送一条恢复状态的信息 ![image-20210210153105732](../../../images/image-20210210153105732.png) 9. 手机端效果如下 ![image-20210210155305550](../../../images/image-20210210155305550.png) ## 后记 * 钉钉告警配置相对复杂,需要单独启动一个服务并配置,配置较多容易出错,稍有门槛 * 钉钉告警仅为参考,企业微信告警才是我们推荐的方案,下一节我们来看一下企业微信的具体配置