Envoy Metrics#
Envoy Metrics Overview#
One of the main goals of Envoy is to make the network easy to understand. Envoy generates a large amount of statistical information depending on how it is configured. In general, statistics (metrics) fall into three categories:
Downstream: Downstream metrics are related to incoming connections/requests. They are generated by
listener,HTTP connection manager (HCM),TCP proxy filterand so on.Upstream: Upstream metrics are related to outgoing connections/requests. They are generated by
connection pool,router filter,tcp proxy filter, and so on.Server:
Servermetrics information describes the operation of the Envoy server instance. Statistics such as server uptime or amount of memory allocated.
In the simplest scenario, a single Envoy Proxy typically involves Downstream and Upstream statistics. These two metrics reflect the operation of the Network Node from which they are taken. Statistics from the entire grid provide very detailed summary information about the health of each Network Node and the network as a whole.Envoy’s documentation has some brief descriptions of these metrics.
Tag#
Envoy’s metrics also have two subconcepts that are supported for use in metrics: tags / dimensions. The tags pair here is equal to the label of the Prometheus metric, in the sense that it can be interpreted as: categorical dimensions.
Envoy’s metrics are identified by canonical strings. The dynamic parts (substrings) of these strings are extracted as tags. This can be done by specifying tag extraction rules (Tag Specifier configuration.) to customize tags.
As an example:
### 1. The original Envoy metrics ###
$ kubectl exec fortio-server -c istio-proxy -- curl 'localhost:15000/stats'
### Returns:
cluster.outbound|8080||fortio-server-l2.mark.svc.cluster.local.external.upstream_rq_2xx: 300
# where:
# - The `outbound|8080||fortio-server-l2.mark.svc.cluster.local` part is the name of the upstream cluster. It can be extracted as a tag.
# - The `2xx` part is the HTTP Status Code category. This can be extracted as a tag. The configuration of this extraction rule is described below.
### 2. Metrics for Prometheus ###
$ kubectl exec fortio-server -c istio-proxy -- curl 'localhost:15000/stats?format=prometheus' | grep 'outbound|8080||fortio-server-l2' | grep ' external.upstream_rq'
# Returns:
envoy_cluster_external_upstream_rq{response_code_class="2xx",cluster_name="outbound|8080||fortio-server-l2.mark.svc.cluster.local" } 300
Metrics data types#
Envoy emits three types of values as statistics:
Counters: unsigned integers that only increase, not decrease. For example, Total Requests.
Gauges(Gauges): unsigned integers that increase and decrease. For example, currently active requests.
Histograms: Unsigned integers that are part of a stream of metrics, which are then aggregated by the collector to eventually produce a summarized percentile (i.e., the usual P99/P50/Pxx). For example,
Upstreamresponse time.
In Envoy’s internal implementation, Counters and Gauges are batched and refreshed periodically to improve performance. histograms are written on receipt.
Metrics Interpretation#
Metrics can be categorized by where they are produced:
cluster manager : L3/L4/L7 level metrics for
upstream.http connection manager(HCM) : L7 level metrics for
upstream&downstream.listeners: Layer L3/L4 metrics for
downstream.server (global)
watch dog
Below I’ve selected only some of the key performance metrics to briefly explain.
cluster manager#
Envoy documentation:cluster manager stats
The documentation above already goes into a bit more detail. I’ll just add some aspects to focus on when performance tuning. So, what metrics to focus on in general?
Let’s analyze it from the famous Utilization Saturation and Errors (USE) methodology.
Utilization:
upstream_cx_total(Counter): counter of connectionsupstream_rq_active
Saturation:
upstream_rq_time(Histogram): response latencyupstream_cx_connect_ms(Histogram)upstream_cx_rx_bytes_bufferedupstream_cx_tx_bytes_bufferedupstream_rq_pending_total(Counter)upstream_rq_pending_active(Gauge)circuit_breakers.*cx_opencircuit_breakers.*cx_pool_opencircuit_breakers.*rq_pending_opencircuit_breakers.*rq_opencircuit_breakers.*rq_retry_open
Error:
upstream_cx_connect_fail(Counter): Number of connection failures.upstream_cx_connect_timeout(Counter): number of connection timeoutsupstream_cx_overflow(Counter): total number of cluster connection breaker overflowsupstream_cx_pool_overflowupstream_cx_destroy_local_with_active_rqupstream_cx_destroy_remote_with_active_rqupstream_rq_timeoutupstream_rq_retryupstream_rq_rx_resetupstream_rq_tx_resetupstream_rq_pending_overflow(Counter) : Total number of requests that overflowed the connection pool or requests (mainly for HTTP/2 and higher) that melted and failed
Other:
upstream_rq_total(Counter) : TPS (throughput)upstream_cx_destroy_local(Counter): Count of connections actively disconnected by Envoyupstream_cx_destroy_remote(Counter): count of Envoy passive disconnectsupstream_cx_length_ms(Histogram)
http connection manager(HCM)#
Envoy docs:http connection manager(HCM) stats
This can be thought of as an L7 layer metrics for downstream & some upstream.
Utilization:
downstream_cx_totaldownstream_cx_activedownstream_cx_http1_activedownstream_rq_totaldownstream_rq_http1_totaldownstream_rq_active
Saturation:
downstream_cx_rx_bytes_buffereddownstream_cx_tx_bytes_buffereddownstream_flow_control_paused_reading_totaldownstream_flow_control_resumed_reading_total
Error:
downstream_cx_destroy_local_active_rqdownstream_cx_destroy_remote_active_rqdownstream_rq_rx_resetdownstream_rq_tx_resetdownstream_rq_too_largedownstream_rq_max_duration_reacheddownstream_rq_timeoutdownstream_rq_overload_closers_too_large
Others:
downstream_cx_destroy_remotedownstream_cx_destroy_localdownstream_cx_length_ms
listeners#
It can be assumed that this is an metrics of the L3/L4 level of the downstream.
Utilization:
downstream_cx_totaldownstream_cx_active
Saturation:
downstream_pre_cx_active
Error:
downstream_cx_transport_socket_connect_timeoutdownstream_cx_overflowno_filter_chain_matchdownstream_listener_filter_errorno_certificate
Others:
downstream_cx_length_ms
server#
Envoy basic info metrics
Utilization:
concurrency
Error:
days_until_first_cert_expiring
watch dog#
The Envoy also includes a configurable watchdog system that adds statistics and optionally terminates the server when the Envoy is not responding. The system has two separate watchdog configurations, one for the main thread and one for the worker threads; as different threads have different workloads. These statistics help to understand at a high level whether the Envoy’s event loop is not responding because it is doing too much work, blocking, or not being scheduled by the operating system.
Saturation.
watchdog_mega_miss(Counter): number of mega misseswatchdog_miss(Counter): number of misses
If you are interested in the watchdog mechanism, see:
https://github.com/envoyproxy/envoy/issues/11391 https://github.com/envoyproxy/envoy/issues/11388
Event loop#
Envoy documentation: Event loop
The Envoy architecture is designed to optimize scalability and resource utilization by running the event loop on a small number of threads. The "main" thread is responsible for control plane processing, and each "worker" thread shares a portion of the data plane tasks. Envoy exposes two statistics to monitor the performance of all these threaded event loops.
Time taken to run a round of the loop: each iteration of the event loop executes a number of tasks. The number of tasks varies with the load. However, if one or more threads have unusually long-tailed loop execution elapsed times, there may be performance issues. For example, the responsibility may be unevenly distributed between worker threads, or there may be long blocking operations in the plugin that impede task progress.
Polling Latency: In each iteration of the event loop, the event scheduler polls for I/O events and “wakes up” threads when some I/O event is ready or a timeout occurs, whichever occurs first. In the case of a timeout, we can measure the difference between the expected wakeup time after polling and the actual wakeup time; this difference is called polling delay. It is normal to see some small polling delay, usually equal to the kernel scheduler’s time slice or quantum – depending on which kernel is running Envoy. – depending on the operating system running Envoy – but if this number is significantly higher than its normally observed baseline, it indicates that the kernel scheduler may be experiencing delays.
This can be done by setting [enable_dispatcher_stats](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/bootstrap/v3/bootstrap.proto#envoy-v3-api -field-config-bootstrap-v3-bootstrap-enable-dispatcher-stats) to true to enable these statistics.
The event scheduler for the
mainthread has a statistics tree rooted atserver.dispatcher.. Eachworkerthread has a statistics tree rooted at `server.dispatcher.The event scheduler for each
workerthread has a statistics tree rooted atlistener_manager.worker_<id>.dispatcher..
Each tree has the following statistics:
Name |
Type |
Description |
|---|---|---|
loop_duration_us |
Histogram |
event loop duration in microseconds |
poll_delay_us |
Histogram |
Polling delay in microseconds |
Note that this does not include any auxiliary (non-main and worker) threads.
Hint
Watch Dog and Event loop are both tools for solving and monitoring event processing delays and timings, and there are a lot of details and stories here, even down to the Linux Kernel. hopefully there will be time later in the book to learn and analyze these interesting details with you.
Configuration#
Hint
If you read the introduction to this book What this book is not, it says it’s not a “user’s manual”, so why is it talking about configuration? Well, all I can say is that it’s better to start with understanding how to use it, and then learn how to implement it, than to come straight to the source code.
This section is referenced in: Envoy Documentation
config.bootstrap.v3.Bootstrap#
Envoy docs:config.bootstrap.v3.Bootstrap proto
{
"node": {...},
"static_resources": {...},
"dynamic_resources": {...},
"cluster_manager": {...},
"stats_sinks": [],
"stats_config": {...},
"stats_flush_interval": {...},
"stats_flush_on_admin": ...,
...
}
Hint
What is stats sink? This book does not explain it.Istio does not customize the configuration by default. The following is only part of the configuration of concern.
stats_config ([config.metrics.v3.StatsConfig](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto#envoy-v3-api-msg -config-metrics-v3-statsconfig)) Configuration for internal processing of statistics.
stats_flush_interval (Duration) Interval at which to flush the
stats sink. For performance reasons, Envoy does not flush the counter in real time, only the counter and gauge are flushed periodically. If not specified, the default value is 5000 milliseconds. Only one ofstats_flush_intervalorstats_flush_on_admincan be set. Duration must be at least 1 millisecond and at most 5 minutes.stats_flush_on_admin (bool) Flush statistics to
sinkonly when queried on theadmin interface. If set, no refresh timer is created. Only one ofstats_flush_on_adminorstats_flush_intervalcan be set.
config.metrics.v3.StatsConfig#
Envoy docs:config-metrics-v3-statsconfig
{
"stats_tags": [],
"use_all_default_tags": {...},
"stats_matcher": {...},
"histogram_bucket_settings": []
}
stats_tags - dimension extraction rules (corresponds to Prometheus label extraction) (Multiple [config.metrics.v3.TagSpecifier](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto#envoy-v3- api-msg-config-metrics-v3-tagspecifier) ) Each
metrics name stringis processed independently by these tag rules. When a tag matches, the first capture group is not immediately removed from the name, so the later [TagSpecifiers](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto# envoy-v3-api-msg-config-metrics-v3-tagspecifier) can also match the same section repeatedly. After all tag matches have been completed, the matching portion of themetrics name stringis then clipped and used as the metric name for thestats sink, e.g., the metric name for Prometheus.use_all_default_tags (BoolValue) Use all the default tags regular expressions specified in the Envoy. These can be used in conjunction with the custom tags specified in stats_tags. They will be processed before the custom tags.Istio defaults to false.
stats_matcher ([config.metrics.v3.StatsMatcher](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto#envoy-v3-api- msg-config-metrics-v3-statsmatcher)) Specifies which metrics the Envoy will output. Supports
include/excluderule specification. If not provided, all metrics will be output. Blocking statistics for certain sets of metrics can improve Envoy performance a bit.
config.metrics.v3.StatsMatcher#
Envoy docs:config-metrics-v3-statsmatcher
Configuration for disabling/enabling the calculation and output of statistical metrics.
{
"reject_all": ...,
"exclusion_list": {...},
"inclusion_list": {...}
}
reject_all (bool) If
reject_allis true, disable all statistics. Ifreject_allis false, all statistics are enabled.exclusion_list ([type.matcher.v3.ListStringMatcher](https://www.envoyproxy.io/docs/envoy/latest/api-v3/type/matcher/v3/string.proto#envoy-v3-api- msg-type-matcher-v3-liststringmatcher)) exclusion list
inclusion_list ([type.matcher.v3.ListStringMatcher](https://www.envoyproxy.io/docs/envoy/latest/api-v3/type/matcher/v3/string.proto#envoy-v3-api- msg-type-matcher-v3-liststringmatcher)) inclusion list
Note
This section references: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/observability/statistics
In the next section, an example of how Istio can be used with the configuration above will be shown.