flink prometheus 监控指标
所属分类 flink
浏览量 329
https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/ops/metrics/
increase(flink_jobmanager_Status_JVM_CPU_Time[5m]) >0
高可用,可查出来多个结果
运行的job数
flink_jobmanager_numRunningJobs and on(job) (increase(flink_jobmanager_job_runningTime[5m]) >0)
或者
count(increase(flink_jobmanager_job_runningTime[5m]) >0)
TaskManager 数量
flink_jobmanager_numRegisteredTaskManagers and on(job) (increase(flink_jobmanager_job_runningTime[5m]) >0)
可用的 taskSlots
flink_jobmanager_taskSlotsAvailable and on(job) (increase(flink_jobmanager_job_runningTime[5m]) >0)
taskSlots 总数
flink_jobmanager_taskSlotsTotal and on(job) (increase(flink_jobmanager_job_runningTime[5m]) >0)
jobmanager 加载类数
flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded and on(job) (increase(flink_jobmanager_job_runningTime[5m]) >0)
jobmanager 卸载类数 ,fullGC 时 ,类会被卸载回收
flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded and on(job) (increase(flink_jobmanager_job_runningTime[5m]) >0)
taskmanager fullGC 也不会卸载回收 ,metaspace 一直涨 ,可能存在内存泄露 ?
jobmanager 堆内存大小
flink_jobmanager_Status_JVM_Memory_Heap_Used and on(job) (increase(flink_jobmanager_job_runningTime[5m]) >0)
flink_jobmanager_Status_JVM_Memory_Metaspace_Used and on(job) (increase(flink_jobmanager_job_runningTime[5m]) >0)
检查点失败数量
flink_jobmanager_job_numberOfFailedCheckpoints and on(job,job_id) (increase(flink_jobmanager_job_runningTime[5m]) >0)
重启次数
flink_jobmanager_job_numRestarts and on(job,job_id) (increase(flink_jobmanager_job_runningTime[5m]) >0)
fullRestarts Attention: deprecated, use numRestarts.
numRestarts The total number of restarts since this job was submitted, including full restarts and fine-grained restarts.
变更记录数
sum (flink_taskmanager_job_task_numRecordsIn ) by (job_id,job_name)
and on(job_id) (increase(flink_jobmanager_job_runningTime[5m]) >0)
每分钟变更记录数
60 * sum(rate(flink_taskmanager_job_task_numRecordsIn[5m]))
taskmanager Metaspace 大小
flink_taskmanager_Status_JVM_Memory_Metaspace_Used and on(tm_id) (increase(flink_taskmanager_Status_JVM_CPU_Time[5m]) >0 )
(flink_taskmanager_Status_JVM_Memory_Metaspace_Used / (1024 * 1024) ) and on(tm_id) (increase(flink_taskmanager_Status_JVM_CPU_Time[5m]) >0 )
flink 反压监控
(flink_taskmanager_job_task_isBackPressured and on(job_id) (increase(flink_jobmanager_job_runningTime[5m]) >0)) >0
出现反压,一般是写入doris 慢了 超时了
org.apache.doris.flink.sink.batch.DorisBatchStreamLoad [] - stream load error with 10.116.55.178:8040, to retry, cause by
java.net.SocketException: Connection timed out (Read failed)
flink_taskmanager_Status_JVM_Memory_Heap_Used{tm_id="10_116_55_38:41657_8ce5a1"} / (1024*1024*1024)
flink_taskmanager_Status_JVM_Memory_Heap_Max{tm_id="10_116_55_38:41657_8ce5a1"} / (1024*1024*1024)
TM老年代标记-清除回收器运行次数
flink_taskmanager_Status_JVM_GarbageCollector_PS_MarkSweep_Count{tm_id="10_116_55_38:41657_8ce5a1"}
flink_taskmanager_Status_JVM_GarbageCollector_PS_MarkSweep_Time{tm_id="10_116_55_38:41657_8ce5a1"}
TM 年轻代并行回收器运行次数
flink_taskmanager_Status_JVM_GarbageCollector_PS_Scavenge_Count{tm_id="10_116_55_38:41657_8ce5a1"}
flink_taskmanager_Status_JVM_GarbageCollector_PS_Scavenge_Time{tm_id="10_116_55_38:41657_8ce5a1"}
flink_jobmanager_Status_JVM_GarbageCollector_PS_MarkSweep_Count{ host="10_116_55_36", job="flinkeae2fd6d7a41eff4f057f5876275b058"}
flink_jobmanager_Status_JVM_GarbageCollector_PS_Scavenge_Count{host="10_116_55_36", job="flinkeae2fd6d7a41eff4f057f5876275b058"}
节点内存
node_memory_MemTotal_bytes
node_memory_MemAvailable_bytes
node_memory_MemFree_bytes
node_memory_Buffers_bytes
node_memory_Cached_bytes
(node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes) / node_memory_MemAvailable_bytes {app="node-exporter", instance=~"192.168.1.10:9103|192.168.1.11:9103"}
上一篇
下一篇
SpringCloud与SpringCloud Alibaba的区别
prometheus http_sd_config 基于http服务发现
简单家常菜
JMS 和 AMQP
rabbitmq 管理页面
RabbitMQ 消息发送和消费过程