文章详情|Grafana 告警设置

Grafana 告警设置 所属分类 grafana 浏览量 595
https://grafana.com/products/cloud/alerting/

https://grafana.com/docs/grafana/latest/alerting/

How Grafana Alerting works

Labels match alert instances to notification policies and silences and can be used to group your alerts by severity.
Notification policy is the set of rules for where, when, and how the alerts get routed. 
Notification policies have a tree structure, where each policy can also match specific alert labels.

Contact points define how your contacts are notified when an alert fires. 



选择要监控的指标 
CPU利用率、内存使用率、网络流量等

设置告警条件
这些条件将决定何时触发告警通知 ，可以设置阈值、持续时间以及其他相关参数

选择告警通知方式
邮件、短信、webhook等

保存并启用告警规则



选择要监控的指标时，应该根据实际情况和需求进行有针对性的选择。不要一味地监控过多的指标，以免信息过载。
设置告警条件时，考虑指标的波动性和峰值情况。太过敏感的条件可能会导致频繁触发警报通知，而太不敏感可能会导致错过重要的警报信号。
选择告警通知方式时，应根据紧急程度和重要性进行权衡。有些情况下，即时的短信通知可能更为合适，而其他情况下，电子邮件通知可能更为便捷。





evaluate every  1m  for  5m 

evaluate every 
how often the alert will be evaluated to see if fired 
evaluate for 
Once a condition is breached, the alert goes into the Pending state. 
If the condition remains breached for the duration specified, the alert transitions to the Firing state, else it reverts back to the Normal state.


每分钟计算一次是否超过阀值，如果超过阀值的时间持续5分钟，就触发告警通知，如果没有的话，从ok 状态转为 pedding状态


Evaluate every  明检测频率 ，必须是10s的倍数
For: 报警触发前，条件为真需要持续的时间




告警状态变化 
Normal 》Padding 》Firing

Create a Grafana managed alerting rule
https://grafana.com/docs/grafana/v8.5/alerting/unified-alerting/alerting-rules/create-grafana-managed-rule/
https://www.bookstack.cn/read/Grafana-8.5-en/36e63104e080a0c7.md

Annotations and labels for alerting rules
https://grafana.com/docs/grafana/v8.5/alerting/unified-alerting/alerting-rules/alert-annotation-label/
https://www.bookstack.cn/read/Grafana-8.5-en/53b1337424ed86a6.md

Manage alerting rules
https://grafana.com/docs/grafana/v8.5/alerting/unified-alerting/alerting-rules/rule-list/
https://www.bookstack.cn/read/Grafana-8.5-en/dea7378cb0a95ded.md





新建告警规则 
new alert rule


Rule name
flink数据同步任务数

folder=Alert  , group=flink_group 
count(rate(flink_jobmanager_job_runningTime[5m]) >0)

classic condition 
last() of A is BELOW 6 


details for alert

message  (这个是定制内容 ，给webhook 接口传递额外参数用的)
{"title":"flink 任务 数量 小于 6","serviceName":"Flink","alertLevel":"normal"}

Custom Labels
type=flink 



http://localhost:9000/alerting/list
过滤条件
state   firing normal pedding 
rule-type  alert recording 

展示形式 view as
list 
grouped   （ folder group 分组） 
state  根据 状态 分组展示 




Grafana在没有数据的情况下也会触发告警
T-1日 15点0分 到 15点10分有数据 ， 15点15分后没数据 ， 图表有显示 nodata 
T日 早上 11点 20分 收到告警邮件 

告警规则配置页面

No Data & Error Handling 
if no data or all values are null  
set state to    

无数据选项 说明
No Data 设置报警规则状态为NoData,这会触发通知
Alerting 设置报警规则状态为Alerting
Keep Last State 保持当前报警规则的状态
Ok 设置报警规则状态为OK



Execution errors or timeouts

错误或超时选项 说明
Alerting 设置报警规则状态为Alerting
Keep Last State 保持当前报警规则的状态
若数据不稳定，建议设置为Keep Last State


Evaluate every  明检测频率 ，必须是10s的倍数
For: 报警触发前，条件为真需要持续的时间







参考资料

Grafana告警体系配置
https://blog.csdn.net/qq_38571773/article/details/128735955
skywalking PromQL 服务 grafana 整合图表配置

flinkcdc3.0 checkpoint 和 restart 策略配置及测试

flink job 快照机制恢复机制 checkpoint 和 savepoint

PromQL 基础

杭州登山路线2024

zookeeper Monitor prometheus + grafana