flinkcdc3.0 checkpoint 和 restart 策略 配置及测试
所属分类 flink
浏览量 385
flink1.18.1
flink-cdc3.0
flink-conf.yaml 配置
state.backend.type: filesystem
execution.checkpointing.interval: 3min
state.checkpoints.dir: file:///Users/dugang/work/test/flink_state/checkpoints
state.savepoints.dir: file:///Users/dugang/work/test/flink_state/savepoints
state.backend.incremental: false
execution.checkpointing.min-pause: 1000
execution.checkpointing.timeout: 60s
execution.checkpointing.max-concurrent-checkpoints: 500
execution.checkpointing.tolerable-failed-checkpoints: 10
# web控制台 取消任务时 保留 checkpoint
execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION
restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 100
restart-strategy.fixed-delay.delay: 60s
checkpoint 保存目录
2253a8207f30ab3ca7b4eb967900427c
chk-12
_metadata
一个job一个目录 ,jobid
chk-12 (第12次)
flink web 控制台
Overview
查看 source sink 变更记录数和字节数
Status
Bytes Received
Records Received
Bytes Sent
Records Sent
Parallelism
Start Time
Exceptions 查看错误信息
Checkpoints 查看 Checkpoint 信息
Overview
History
Summary
Configuration
Checkpointing Mode Exactly Once
Checkpoint Storage FileSystemCheckpointStorage
State Backend HashMapStateBackend
Interval 3m 0s
Configuration
重启策略
Restart with fixed delay (60000 ms). #100 restart attempts.
TimeLine
测试场景
把 doris 的表 重命名 ,flinkcdc同步时 会报错,执行重启策略 ,
表重命名成原来的表 ,恢复正常,后续同步ok
alter table t1 rename t1_001
alter table t1_001 rename t1
关键日志信息
Caused by: org.apache.doris.flink.exception.DorisBatchLoadException: stream load error: [ANALYSIS_ERROR]TStatus: errCode = 7, detailMessage = unknown table, tableName=t1, see more in null
2024-03-27 17:41:02,960 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - 2 tasks will be restarted to recover the failed task 5abf87f4cc2fd607ed9659cb1647b0be_d40592faea9b13cc59503ebfb2b12986_0_1.
2024-03-27 17:41:02,961 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job mysql-doris-001 (2253a8207f30ab3ca7b4eb967900427c) switched from state RUNNING to RESTARTING.
2024-03-27 17:41:02,962 WARN org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to trigger or complete checkpoint 3 for job 2253a8207f30ab3ca7b4eb967900427c. (0 consecutive failed attempts so far)
org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint Coordinator is suspending.
问题
命令行提交任务时 如何指定 checkpoint ?
先保存job状态 savepoint
./flink savepoint 93d57e55922989282f13fbf1804f4052 /Users/dugang/work/flinksavepoint
flinksavepoint
savepoint-93d57e-096b303dd391
_metadata
flink-conf.yaml 配置
execution.savepoint.path: /Users/dugang/work/flinksavepoint/savepoint-93d57e-096b303dd391
目前只有这种方式生效
flink-cdc.sh里设置 jvm参数 (无效)
-Dexecution.savepoint.path=/Users/dugang/work/flinksavepoint/savepoint-93d57e-096b303dd391
-Dexecution.savepoint.path=file:///Users/dugang/work/flinksavepoint/savepoint-93d57e-096b303dd391
2024-03-28 06:39:42,227 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Starting job 2f03d8b451bd6c0a7a4db748923a321d from savepoint /Users/dugang/work/flinksavepoint/savepoint-93d57e-096b303dd391 ()
flinkcdc3.0 重启启动任务 ,如何从指定savepoint恢复?
https://developer.aliyun.com/ask/608958
https://developer.aliyun.com/ask/602807
上一篇
下一篇
杭州西湖三十景
GraphQL 基础
skywalking PromQL 服务 grafana 整合 图表配置
flink job 快照机制 恢复机制 checkpoint 和 savepoint
Grafana 告警设置
PromQL 基础