首页  

flinkcdc3.0 checkpoint 和 restart 策略 配置及测试     所属分类 flink 浏览量 385
flink1.18.1 
flink-cdc3.0 

flink-conf.yaml 配置 

state.backend.type: filesystem
execution.checkpointing.interval: 3min
state.checkpoints.dir: file:///Users/dugang/work/test/flink_state/checkpoints
state.savepoints.dir: file:///Users/dugang/work/test/flink_state/savepoints
state.backend.incremental: false
execution.checkpointing.min-pause: 1000
execution.checkpointing.timeout: 60s
execution.checkpointing.max-concurrent-checkpoints: 500
execution.checkpointing.tolerable-failed-checkpoints: 10
# web控制台 取消任务时 保留 checkpoint
execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION



restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 100
restart-strategy.fixed-delay.delay: 60s



checkpoint 保存目录 2253a8207f30ab3ca7b4eb967900427c chk-12 _metadata 一个job一个目录 ,jobid chk-12 (第12次)
flink web 控制台 Overview 查看 source sink 变更记录数和字节数 Status Bytes Received Records Received Bytes Sent Records Sent Parallelism Start Time Exceptions 查看错误信息 Checkpoints 查看 Checkpoint 信息 Overview History Summary Configuration Checkpointing Mode Exactly Once Checkpoint Storage FileSystemCheckpointStorage State Backend HashMapStateBackend Interval 3m 0s Configuration 重启策略 Restart with fixed delay (60000 ms). #100 restart attempts. TimeLine
测试场景 把 doris 的表 重命名 ,flinkcdc同步时 会报错,执行重启策略 , 表重命名成原来的表 ,恢复正常,后续同步ok alter table t1 rename t1_001 alter table t1_001 rename t1
关键日志信息 Caused by: org.apache.doris.flink.exception.DorisBatchLoadException: stream load error: [ANALYSIS_ERROR]TStatus: errCode = 7, detailMessage = unknown table, tableName=t1, see more in null 2024-03-27 17:41:02,960 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - 2 tasks will be restarted to recover the failed task 5abf87f4cc2fd607ed9659cb1647b0be_d40592faea9b13cc59503ebfb2b12986_0_1. 2024-03-27 17:41:02,961 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job mysql-doris-001 (2253a8207f30ab3ca7b4eb967900427c) switched from state RUNNING to RESTARTING. 2024-03-27 17:41:02,962 WARN org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to trigger or complete checkpoint 3 for job 2253a8207f30ab3ca7b4eb967900427c. (0 consecutive failed attempts so far) org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint Coordinator is suspending.
问题 命令行提交任务时 如何指定 checkpoint ? 先保存job状态 savepoint ./flink savepoint 93d57e55922989282f13fbf1804f4052 /Users/dugang/work/flinksavepoint flinksavepoint savepoint-93d57e-096b303dd391 _metadata flink-conf.yaml 配置 execution.savepoint.path: /Users/dugang/work/flinksavepoint/savepoint-93d57e-096b303dd391 目前只有这种方式生效 flink-cdc.sh里设置 jvm参数 (无效) -Dexecution.savepoint.path=/Users/dugang/work/flinksavepoint/savepoint-93d57e-096b303dd391 -Dexecution.savepoint.path=file:///Users/dugang/work/flinksavepoint/savepoint-93d57e-096b303dd391 2024-03-28 06:39:42,227 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Starting job 2f03d8b451bd6c0a7a4db748923a321d from savepoint /Users/dugang/work/flinksavepoint/savepoint-93d57e-096b303dd391 ()
flinkcdc3.0 重启启动任务 ,如何从指定savepoint恢复? https://developer.aliyun.com/ask/608958 https://developer.aliyun.com/ask/602807

上一篇     下一篇
杭州西湖三十景

GraphQL 基础

skywalking PromQL 服务 grafana 整合 图表配置

flink job 快照机制 恢复机制 checkpoint 和 savepoint

Grafana 告警设置

PromQL 基础