常用组件告警规则
所属分类 prometheus
浏览量 265
仅供参考
可学习 promQL 的各种写法
指标名称 指标备注 阈值 指标表达式
Host high CPU load CPU使用率(%) 80%
(sum by (instance) (avg by (mode, instance) (rate(node_cpu_seconds_total{mode!="idle"}[2m]))) ) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
Host CPU high iowait IOWait占比(%) 10%
(avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
Host out of memory 内存使用率(%) 90%
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
Host OOM kill detected OOM kill 0
(increase(node_vmstat_oom_kill[1m])) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
Host swap is filling up Swap使用率(%) 80%
((1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
Host out of disk space 硬盘使用率(%) 90%
((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
Host unusual disk read rate 硬盘读取速度(MB/s) 50MB/s
(sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
Host unusual disk write rate 硬盘写入速度(MB/s) 50MB/s
(sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
Host unusual disk read latency 硬盘读取延迟(ms) 100ms
(rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) and rate(node_disk_reads_completed_total[1m]) > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
Host unusual disk write latency 硬盘写入延迟(ms) 100ms
(sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
Host out of inodes Inodes使用率(%) 90%
(node_filesystem_files_free{fstype!="msdosfs"} / node_filesystem_files{fstype!="msdosfs"} * 100 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
Host unusual disk IO 硬盘IO增长率(%)
50% (rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) and rate(node_disk_writes_completed_total[1m]) > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
Host filesystem device error 文件系统设备报错 0 node_filesystem_device_error
Host unusual network throughput in 网络下载速度(MB/s) 100MB/s
(sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
Host unusual network throughput out 网络上传速度(MB/s) 100MB/s
(sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
Host conntrack limit 链接状态跟踪表使用率(%) 80%
(node_nf_conntrack_entries / node_nf_conntrack_entries_limit) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
Host Network Receive Errors 网络下载丢包率(%) 1%
(rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m])) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
Host Network Transmit Errors 网络上传丢包率(%) 1%
(rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m])) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
Nginx high HTTP 4xx error rate 客户端报错率(%) 5%
sum(rate(nginx_server_requests{code="4xx"}[1m])) / sum(rate(nginx_server_requests{code="total"}[1m])) * 100
Nginx high HTTP 5xx error rate 服务器报错率(%) 5%
sum(rate(nginx_server_requests{code="5xx"}[1m])) / sum(rate(nginx_server_requests{code="total"}[1m])) * 100
cpu alert CPU使用率(%) 50% max(system_cpu_usage) * 100
load alert 系统负载率(%) 15% max(system_load_average_1m)
memory alert JVM内存使用率(%) 60%
sum(jvm_memory_used_bytes{area="heap"})/sum(jvm_memory_max_bytes{area="heap"}) * 100
threads alert JVM线程数 500 max(jvm_threads_daemon_threads)
gc alert GC增长次数 1 max(rate(jvm_gc_pause_seconds_count{action="end of major GC"}[5m])) * 300
notify task alert 通知任务数 10 sum(nacos_monitor{name='notifyTask'})
rt alert 请求响应时间(ms) 5000ms
sum(rate(http_server_requests_seconds_sum[1m]))/sum(rate(http_server_requests_seconds_count[1m])) * 1000
long polling alert 长连接数 5000 max(nacos_monitor{name='longPolling'})
config unhealth exception alert 检查异常增长次数 1
sum(rate(nacos_exception_total{name='unhealth'}[1m])) * 60
db exception alert 数据库异常增长次数 1
sum(rate(nacos_exception_total{name='db'}[1m])) * 60
failed push alert 推送失败数 1
sum(nacos_monitor{name='failedPush'})
illegalArgument exception alert 请求参数异常增长次数 1
sum(rate(nacos_exception_total{name='illegalArgument'}[1m])) * 60
naming disk exception alert 硬盘异常增长次数 1
sum(rate(nacos_exception_total{name='disk'}[1m])) * 60
config notify exception alert 通知异常增长次数 1
sum(rate(nacos_exception_total{name='configNotify'}[1m])) * 60
naming leader send beat failed exception alert 心跳异常增长次数 1
sum(rate(nacos_exception_total{name='leaderSendBeatFailed'}[1m])) * 60
nacos exception alert 内部异常增长次数 1 sum(rate(nacos_exception_total{name='nacos'}[1m])) * 60
Redis down 服务停止 0 redis_up
Redis disconnected slaves 失去连接从节点数 0
count without (instance, job) (redis_connected_slaves) - sum without (instance, job) (redis_connected_slaves) - 1
Redis replication broken 从节点服务停止 0 delta(redis_connected_slaves[1m])
Redis cluster flapping 主从节点切换 1 changes(redis_connected_slaves[1m])
Redis not enough connections 连接数 5 redis_connected_clients
Redis rejected connections 拒绝连接 0 increase(redis_rejected_connections_total[1m])
Zookeeper Down 服务停止 0 zk_up
Kafka brokers Broker节点数 3 kafka_brokers
RabbitMQ down 服务停止 0 rabbitmq_up
RabbitMQ cluster down 集群节点数 3 sum(rabbitmq_running)
RabbitMQ out of memory 内存使用率(%) 90% rabbitmq_node_mem_used / rabbitmq_node_mem_limit * 100
RabbitMQ too many unack messages 未确认消息数 1000 sum(rabbitmq_queue_messages_unacknowledged) by (queue)
RabbitMQ no queue consumer 消费者数 1 rabbitmq_queue_consumers
RabbitMQ too many connections 连接数 1000 rabbitmq_connections
RabbitMQ cluster partition 网络分区 0 rabbitmq_partitions
Elasticsearch Cluster Yellow 集群状态黄色 0 elasticsearch_cluster_health_status{color="yellow"}
Elasticsearch Cluster Red 集群状态红色 0 elasticsearch_cluster_health_status{color="red"}
Elasticsearch Heap Usage Too High JVM内存使用率(%) 80%
(elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100
Elasticsearch disk out of space 硬盘使用率(%) 90%
elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100
Elasticsearch Healthy Nodes 节点数 3 elasticsearch_cluster_health_number_of_nodes
Elasticsearch Healthy Data Nodes 数据节点数 3 elasticsearch_cluster_health_number_of_data_nodes
Elasticsearch unassigned shards 未分配切片 0 elasticsearch_cluster_health_unassigned_shards
Doris fe query err 查询报错次数 0 doris_fe_query_err
Doris fe edit log clean failed 清理历史元数据日志失败次数 0 doris_fe_edit_log_clean {type='failed'}
Doris fe image clean failed 清理历史元数据镜像文件失败次数 0 doris_fe_image_clean{type='failed'}
Flink job failed check points 失败的检查点数 0 flink_jobmanager_job_numberOfFailedCheckpoints
Flink job full restarts Full Restart次数 0 flink_jobmanager_job_fullRestarts
上一篇
下一篇
Prometheus sum 和 sum_over_time
PromQL内置函数
kafka 与 rabbitMQ 比较
flink-CDC-3.0 mysql to doris 数据同步任务 经常报错 stream load error: [LABEL_ALREADY_EXISTS]
promQL ON 使用
杭州西山游步道爬山路线汇总