文章详情|flinkcdc3.0 checkpoint 和 restart 策略配置及测试

flinkcdc3.0 checkpoint 和 restart 策略配置及测试 所属分类 flink 浏览量 1177
flink1.18.1 
flink-cdc3.0 

flink-conf.yaml 配置 

state.backend.type: filesystem
execution.checkpointing.interval: 3min
state.checkpoints.dir: file:///Users/dugang/work/test/flink_state/checkpoints
state.savepoints.dir: file:///Users/dugang/work/test/flink_state/savepoints
state.backend.incremental: false
execution.checkpointing.min-pause: 1000
execution.checkpointing.timeout: 60s
execution.checkpointing.max-concurrent-checkpoints: 500
execution.checkpointing.tolerable-failed-checkpoints: 10
# web控制台 取消任务时 保留 checkpoint
execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION



restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 100
restart-strategy.fixed-delay.delay: 60s




checkpoint 保存目录
2253a8207f30ab3ca7b4eb967900427c
    chk-12
       _metadata

一个job一个目录 ，jobid 
chk-12 (第12次)


flink web 控制台

Overview    
查看 source sink  变更记录数和字节数
	
Status
Bytes Received
Records Received
Bytes Sent
Records Sent
Parallelism
Start Time 

Exceptions  查看错误信息
Checkpoints   查看 Checkpoint 信息 
  Overview 
  History 
  Summary 
  Configuration
    Checkpointing Mode	Exactly Once
    Checkpoint Storage	FileSystemCheckpointStorage
    State Backend	HashMapStateBackend
    Interval	3m 0s

Configuration 
重启策略
Restart with fixed delay (60000 ms). #100 restart attempts.

TimeLine



测试场景 
把 doris  的表 重命名 ，flinkcdc同步时 会报错，执行重启策略 ， 
表重命名成原来的表 ，恢复正常，后续同步ok

alter table t1 rename t1_001 
alter table t1_001 rename t1



关键日志信息

Caused by: org.apache.doris.flink.exception.DorisBatchLoadException: stream load error: [ANALYSIS_ERROR]TStatus: errCode = 7, detailMessage = unknown table, tableName=t1, see more in null

2024-03-27 17:41:02,960 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - 2 tasks will be restarted to recover the failed task 5abf87f4cc2fd607ed9659cb1647b0be_d40592faea9b13cc59503ebfb2b12986_0_1.
2024-03-27 17:41:02,961 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job mysql-doris-001 (2253a8207f30ab3ca7b4eb967900427c) switched from state RUNNING to RESTARTING.
2024-03-27 17:41:02,962 WARN  org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to trigger or complete checkpoint 3 for job 2253a8207f30ab3ca7b4eb967900427c. (0 consecutive failed attempts so far)
org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint Coordinator is suspending.



问题
命令行提交任务时  如何指定 checkpoint ?

先保存job状态 savepoint

./flink savepoint 93d57e55922989282f13fbf1804f4052  /Users/dugang/work/flinksavepoint 
flinksavepoint
    savepoint-93d57e-096b303dd391
        _metadata


flink-conf.yaml  配置 
execution.savepoint.path: /Users/dugang/work/flinksavepoint/savepoint-93d57e-096b303dd391

目前只有这种方式生效 


flink-cdc.sh里设置 jvm参数 （无效）
-Dexecution.savepoint.path=/Users/dugang/work/flinksavepoint/savepoint-93d57e-096b303dd391
-Dexecution.savepoint.path=file:///Users/dugang/work/flinksavepoint/savepoint-93d57e-096b303dd391


2024-03-28 06:39:42,227 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Starting job 2f03d8b451bd6c0a7a4db748923a321d from savepoint /Users/dugang/work/flinksavepoint/savepoint-93d57e-096b303dd391 ()





flinkcdc3.0 重启启动任务 ，如何从指定savepoint恢复？
https://developer.aliyun.com/ask/608958
https://developer.aliyun.com/ask/602807
杭州西湖三十景

GraphQL 基础

skywalking PromQL 服务 grafana 整合图表配置

flink job 快照机制恢复机制 checkpoint 和 savepoint

Grafana 告警设置

PromQL 基础