文章详情|temporal 监控指标

temporal 监控指标 所属分类 temporal 浏览量 931

https://docs.temporal.io/docs/server/production-deployment/
Scaling and Metrics

temporal_activity_schedule_to_start_latency histogram
temporal_workflow_task_schedule_to_start_latency histogram

All metrics emitted by the server are listed in Temporal's source
https://github.com/temporalio/temporal/blob/master/common/metrics/defs.go

At a high level, you will want to track these 3 categories of metrics:

Service metrics
For each request made by the service handler we emit service_requests, service_errors, and service_latency metrics with type, operation, and namespace tags.
This gives you basic visibility into service usage and allows you to look at request rates across services, namespaces and even operations.

Persistence metrics
The Server emits persistence_requests, persistence_errors and persistence_latency metrics for each persistence operation.
These metrics include the operation tag such that you can get the request rates, error rates or latencies per operation.
These are super useful in identifying issues caused by the database.

Workflow Execution stats
The Server also emits counters for when Workflow Executions are complete.
These are useful in getting overall stats about Workflow Execution completions.
Use workflow_success, workflow_failed, workflow_timeout, workflow_terminate and workflow_cancel counters for each type of Workflow Execution completion.
These include the namespace tag.

Checklist for Scaling Temporal

some common bottlenecks

Database
The vast majority of the time the database will be the bottleneck.
We highly recommend setting alerts on schedule_to_start_latency to look out for this.
Also check if your database connection is getting saturated.

Internal services
The next layer will be scaling the 4 internal services of Temporal
(Frontend, Matching, History, and Worker). Monitor each accordingly.
The Frontend service is more CPU bound, whereas the History and Matching services require more memory.
If you need more instances of each service, spin them up separately with different command line arguments.
You can learn more cross referencing our Helm chart with our Server Configuration reference.

See the Server Limits section below for other limits you will want to keep in mind when doing system design, including event history length.

https://docs.temporal.io/docs/operation/how-to-tune-workers/

grpc jar版本不匹配问题处理实例

temporal学习笔记

Temporal Server architecture

使用arthas 观察 temporal worker grpc 方法调用

temporal local activity vs activity

temporal 一些关键概念