首页  

flink prometheus 监控指标     所属分类 flink 浏览量 307
https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/ops/metrics/


increase(flink_jobmanager_Status_JVM_CPU_Time[5m]) >0
高可用,可查出来多个结果

运行的job数
flink_jobmanager_numRunningJobs and on(job) (increase(flink_jobmanager_job_runningTime[5m]) >0)
或者
count(increase(flink_jobmanager_job_runningTime[5m]) >0)

TaskManager 数量
flink_jobmanager_numRegisteredTaskManagers and  on(job) (increase(flink_jobmanager_job_runningTime[5m]) >0)

可用的 taskSlots
flink_jobmanager_taskSlotsAvailable and  on(job) (increase(flink_jobmanager_job_runningTime[5m]) >0)


taskSlots 总数 
flink_jobmanager_taskSlotsTotal  and  on(job) (increase(flink_jobmanager_job_runningTime[5m]) >0)


jobmanager 加载类数 
flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded  and  on(job) (increase(flink_jobmanager_job_runningTime[5m]) >0)

jobmanager 卸载类数  ,fullGC 时 ,类会被卸载回收

flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded  and  on(job) (increase(flink_jobmanager_job_runningTime[5m]) >0)

taskmanager  fullGC 也不会卸载回收 ,metaspace 一直涨 ,可能存在内存泄露 ?




jobmanager 堆内存大小
flink_jobmanager_Status_JVM_Memory_Heap_Used  and  on(job) (increase(flink_jobmanager_job_runningTime[5m]) >0)

flink_jobmanager_Status_JVM_Memory_Metaspace_Used  and  on(job) (increase(flink_jobmanager_job_runningTime[5m]) >0)


检查点失败数量
flink_jobmanager_job_numberOfFailedCheckpoints and on(job,job_id)  (increase(flink_jobmanager_job_runningTime[5m]) >0)


重启次数
flink_jobmanager_job_numRestarts  and on(job,job_id)  (increase(flink_jobmanager_job_runningTime[5m]) >0)

fullRestarts	Attention: deprecated, use numRestarts.	
numRestarts	The total number of restarts since this job was submitted, including full restarts and fine-grained restarts.


变更记录数 
sum (flink_taskmanager_job_task_numRecordsIn )  by (job_id,job_name)    
and on(job_id)  (increase(flink_jobmanager_job_runningTime[5m]) >0) 


每分钟变更记录数
60 * sum(rate(flink_taskmanager_job_task_numRecordsIn[5m]))


taskmanager Metaspace 大小 
flink_taskmanager_Status_JVM_Memory_Metaspace_Used and on(tm_id)  (increase(flink_taskmanager_Status_JVM_CPU_Time[5m]) >0 )


(flink_taskmanager_Status_JVM_Memory_Metaspace_Used / (1024 * 1024) ) and on(tm_id)  (increase(flink_taskmanager_Status_JVM_CPU_Time[5m]) >0 )




flink 反压监控
(flink_taskmanager_job_task_isBackPressured  and on(job_id)  (increase(flink_jobmanager_job_runningTime[5m]) >0)) >0
出现反压,一般是写入doris 慢了  超时了 
org.apache.doris.flink.sink.batch.DorisBatchStreamLoad       [] - stream load error with 10.116.55.178:8040, to retry, cause by
java.net.SocketException: Connection timed out (Read failed)




flink_taskmanager_Status_JVM_Memory_Heap_Used{tm_id="10_116_55_38:41657_8ce5a1"} / (1024*1024*1024)

flink_taskmanager_Status_JVM_Memory_Heap_Max{tm_id="10_116_55_38:41657_8ce5a1"} / (1024*1024*1024)

TM老年代标记-清除回收器运行次数
flink_taskmanager_Status_JVM_GarbageCollector_PS_MarkSweep_Count{tm_id="10_116_55_38:41657_8ce5a1"}

flink_taskmanager_Status_JVM_GarbageCollector_PS_MarkSweep_Time{tm_id="10_116_55_38:41657_8ce5a1"}

TM 年轻代并行回收器运行次数
flink_taskmanager_Status_JVM_GarbageCollector_PS_Scavenge_Count{tm_id="10_116_55_38:41657_8ce5a1"}
flink_taskmanager_Status_JVM_GarbageCollector_PS_Scavenge_Time{tm_id="10_116_55_38:41657_8ce5a1"}




flink_jobmanager_Status_JVM_GarbageCollector_PS_MarkSweep_Count{ host="10_116_55_36", job="flinkeae2fd6d7a41eff4f057f5876275b058"}


flink_jobmanager_Status_JVM_GarbageCollector_PS_Scavenge_Count{host="10_116_55_36", job="flinkeae2fd6d7a41eff4f057f5876275b058"}




节点内存 node_memory_MemTotal_bytes node_memory_MemAvailable_bytes node_memory_MemFree_bytes node_memory_Buffers_bytes node_memory_Cached_bytes (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes) / node_memory_MemAvailable_bytes {app="node-exporter", instance=~"192.168.1.10:9103|192.168.1.11:9103"}

上一篇     下一篇
SpringCloud与SpringCloud Alibaba的区别

prometheus http_sd_config 基于http服务发现

简单家常菜

JMS 和 AMQP

rabbitmq 管理页面

RabbitMQ 消息发送和消费过程