temporal 监控指标
所属分类 temporal
浏览量 704
https://docs.temporal.io/docs/server/production-deployment/
Scaling and Metrics
temporal_activity_schedule_to_start_latency histogram
temporal_workflow_task_schedule_to_start_latency histogram
All metrics emitted by the server are listed in Temporal's source
https://github.com/temporalio/temporal/blob/master/common/metrics/defs.go
At a high level, you will want to track these 3 categories of metrics:
Service metrics
For each request made by the service handler we emit service_requests, service_errors, and service_latency metrics with type, operation, and namespace tags.
This gives you basic visibility into service usage and allows you to look at request rates across services, namespaces and even operations.
Persistence metrics
The Server emits persistence_requests, persistence_errors and persistence_latency metrics for each persistence operation.
These metrics include the operation tag such that you can get the request rates, error rates or latencies per operation.
These are super useful in identifying issues caused by the database.
Workflow Execution stats
The Server also emits counters for when Workflow Executions are complete.
These are useful in getting overall stats about Workflow Execution completions.
Use workflow_success, workflow_failed, workflow_timeout, workflow_terminate and workflow_cancel counters for each type of Workflow Execution completion.
These include the namespace tag.
Checklist for Scaling Temporal
some common bottlenecks
Database
The vast majority of the time the database will be the bottleneck.
We highly recommend setting alerts on schedule_to_start_latency to look out for this.
Also check if your database connection is getting saturated.
Internal services
The next layer will be scaling the 4 internal services of Temporal
(Frontend, Matching, History, and Worker). Monitor each accordingly.
The Frontend service is more CPU bound, whereas the History and Matching services require more memory.
If you need more instances of each service, spin them up separately with different command line arguments.
You can learn more cross referencing our Helm chart with our Server Configuration reference.
See the Server Limits section below for other limits you will want to keep in mind when doing system design, including event history length.
https://docs.temporal.io/docs/operation/how-to-tune-workers/
上一篇
下一篇
grpc jar版本不匹配问题处理实例
temporal学习笔记
Temporal Server architecture
使用arthas 观察 temporal worker grpc 方法调用
temporal local activity vs activity
temporal 一些关键概念