Temporal architecture deep dive
Sharding and Routing
External Workflow and Activity Implementations
In practical systems, you don't want to call tasks directly,
because there can be issues with flow control, availability, or slowness.
So using queues to dispatch tasks is a very common technique.
流控 可用性 慢速度
Every practical workflow engine uses queues to dispatch tasks to workers
(working processes that host those tasks).
need an external timer service or timer queue that durably stores and dispatches these timers.
start a workflow
create state in the database
create a timer
pull tasks from the task queue for the WorkflowDefinition to pick up
when it gives a list of commands we need to update and create tasks and update the state
Workflow as unit of scalability
Every workflow should be limited in size, but we can infinitely scale out the number of workflows
use partitions within the database,
over-allocate the number of partitions,
allocate those partitions to specific physical hosts, and move them around if necessary.
hash workflows to a specific shard id and use consistent hashing to allocate a shard to a specific host.
Membership and Routing
You don't want to have a fat clientside library that understands the topology of your cluster,
so you will need to have a frontend which will know the membership of the cluster and route requests accordingly.
move the queue into its a separate component with its own persistence.
Local Transfer Queues
every shard which stores workflow state also stores a queue
Every time we make an update to a shard we can also make an update to the queue because it lives in the same partition.
start a workflow
create a state for that workflow
create workflow tasks for the worker to pick up
add the task to the local queue of that shard
This will be committed to the database atomically
a thread pulls from that queue and transfers that message to the queuing subsystem.
the ability to list workflows
Going to all 10,000 shards and asking for this information, even with indexing, would be impractical.
The way to solve this is to have a separate indexing component.
the History component is responsible for state transitions of individual workflows
we have Transfer Queues to be able to transactionally create tasks
a Matching component responsible for delivering tasks for queuing and matching poll for requests coming from external workers
a Front-end component because we need routing
an ElasticSearch component for indexing
a Worker component for background jobs.
Workflows and Activities to be implemented by application developers using one of the Temporal SDKs.
Shards, history, and matching will be redistributed automatically.
Clustered databases like Cassandra can sustain node failures.
Elasticsearch is fault tolerant.
Frontends are stateless so you can add and remove them anytime
If you want to provide very high availability, we have multi-cluster deployment with asynchronous replication.
高可用 多集群 异步复制
prometheus irate 函数说明
temporal cassandra 阶梯压测记录
sh -s 用法