Scaling and Metrics
All metrics emitted by the server are listed in Temporal's source
At a high level, you will want to track these 3 categories of metrics:
For each request made by the service handler we emit service_requests, service_errors, and service_latency metrics with type, operation, and namespace tags.
This gives you basic visibility into service usage and allows you to look at request rates across services, namespaces and even operations.
The Server emits persistence_requests, persistence_errors and persistence_latency metrics for each persistence operation.
These metrics include the operation tag such that you can get the request rates, error rates or latencies per operation.
These are super useful in identifying issues caused by the database.
Workflow Execution stats
The Server also emits counters for when Workflow Executions are complete.
These are useful in getting overall stats about Workflow Execution completions.
Use workflow_success, workflow_failed, workflow_timeout, workflow_terminate and workflow_cancel counters for each type of Workflow Execution completion.
These include the namespace tag.
Checklist for Scaling Temporal
some common bottlenecks
The vast majority of the time the database will be the bottleneck.
We highly recommend setting alerts on schedule_to_start_latency to look out for this.
Also check if your database connection is getting saturated.
The next layer will be scaling the 4 internal services of Temporal
(Frontend, Matching, History, and Worker). Monitor each accordingly.
The Frontend service is more CPU bound, whereas the History and Matching services require more memory.
If you need more instances of each service, spin them up separately with different command line arguments.
You can learn more cross referencing our Helm chart with our Server Configuration reference.
See the Server Limits section below for other limits you will want to keep in mind when doing system design, including event history length.
Temporal Server architecture
使用arthas 观察 temporal worker grpc 方法调用
temporal local activity vs activity