Envoy Metrics#
Envoy Metrics Overview#
One of the main goals of Envoy is to make the network easy to understand. Envoy generates a large amount of statistical information depending on how it is configured. In general, statistics (metrics) fall into three categories:
Downstream: Downstream metrics are related to incoming connections/requests. They are generated by
listener
,HTTP connection manager (HCM)
,TCP proxy filter
and so on.Upstream: Upstream metrics are related to outgoing connections/requests. They are generated by
connection pool
,router filter
,tcp proxy filter
, and so on.Server:
Server
metrics information describes the operation of the Envoy server instance. Statistics such as server uptime or amount of memory allocated.
In the simplest scenario, a single Envoy Proxy typically involves Downstream
and Upstream
statistics. These two metrics reflect the operation of the Network Node
from which they are taken. Statistics from the entire grid provide very detailed summary information about the health of each Network Node
and the network as a whole.Envoy’s documentation has some brief descriptions of these metrics.
Tag#
Envoy’s metrics also have two subconcepts that are supported for use in metrics: tags
/ dimensions
. The tags
pair here is equal to the label of the Prometheus metric, in the sense that it can be interpreted as: categorical dimensions.
Envoy’s metrics
are identified by canonical strings. The dynamic parts (substrings) of these strings are extracted as tags
. This can be done by specifying tag extraction rules (Tag Specifier configuration.) to customize tags.
As an example:
### 1. The original Envoy metrics ###
$ kubectl exec fortio-server -c istio-proxy -- curl 'localhost:15000/stats'
### Returns:
cluster.outbound|8080||fortio-server-l2.mark.svc.cluster.local.external.upstream_rq_2xx: 300
# where:
# - The `outbound|8080||fortio-server-l2.mark.svc.cluster.local` part is the name of the upstream cluster. It can be extracted as a tag.
# - The `2xx` part is the HTTP Status Code category. This can be extracted as a tag. The configuration of this extraction rule is described below.
### 2. Metrics for Prometheus ###
$ kubectl exec fortio-server -c istio-proxy -- curl 'localhost:15000/stats?format=prometheus' | grep 'outbound|8080||fortio-server-l2' | grep ' external.upstream_rq'
# Returns:
envoy_cluster_external_upstream_rq{response_code_class="2xx",cluster_name="outbound|8080||fortio-server-l2.mark.svc.cluster.local" } 300
Metrics data types#
Envoy emits three types of values as statistics:
Counters: unsigned integers that only increase, not decrease. For example, Total Requests.
Gauges(Gauges): unsigned integers that increase and decrease. For example, currently active requests.
Histograms: Unsigned integers that are part of a stream of metrics, which are then aggregated by the collector to eventually produce a summarized percentile (i.e., the usual P99/P50/Pxx). For example,
Upstream
response time.
In Envoy’s internal implementation, Counters and Gauges are batched and refreshed periodically to improve performance. histograms are written on receipt.
Metrics Interpretation#
Metrics can be categorized by where they are produced:
cluster manager : L3/L4/L7 level metrics for
upstream
.http connection manager(HCM) : L7 level metrics for
upstream
&downstream
.listeners: Layer L3/L4 metrics for
downstream
.server (global)
watch dog
Below I’ve selected only some of the key performance metrics to briefly explain.
cluster manager#
Envoy documentation:cluster manager stats
The documentation above already goes into a bit more detail. I’ll just add some aspects to focus on when performance tuning. So, what metrics to focus on in general?
Let’s analyze it from the famous Utilization Saturation and Errors (USE) methodology.
Utilization:
upstream_cx_total
(Counter): counter of connectionsupstream_rq_active
Saturation:
upstream_rq_time
(Histogram): response latencyupstream_cx_connect_ms
(Histogram)upstream_cx_rx_bytes_buffered
upstream_cx_tx_bytes_buffered
upstream_rq_pending_total
(Counter)upstream_rq_pending_active
(Gauge)circuit_breakers.*cx_open
circuit_breakers.*cx_pool_open
circuit_breakers.*rq_pending_open
circuit_breakers.*rq_open
circuit_breakers.*rq_retry_open
Error:
upstream_cx_connect_fail
(Counter): Number of connection failures.upstream_cx_connect_timeout
(Counter): number of connection timeoutsupstream_cx_overflow
(Counter): total number of cluster connection breaker overflowsupstream_cx_pool_overflow
upstream_cx_destroy_local_with_active_rq
upstream_cx_destroy_remote_with_active_rq
upstream_rq_timeout
upstream_rq_retry
upstream_rq_rx_reset
upstream_rq_tx_reset
upstream_rq_pending_overflow
(Counter) : Total number of requests that overflowed the connection pool or requests (mainly for HTTP/2 and higher) that melted and failed
Other:
upstream_rq_total
(Counter) : TPS (throughput)upstream_cx_destroy_local
(Counter): Count of connections actively disconnected by Envoyupstream_cx_destroy_remote
(Counter): count of Envoy passive disconnectsupstream_cx_length_ms
(Histogram)
http connection manager(HCM)#
Envoy docs:http connection manager(HCM) stats
This can be thought of as an L7 layer metrics for downstream
& some upstream
.
Utilization:
downstream_cx_total
downstream_cx_active
downstream_cx_http1_active
downstream_rq_total
downstream_rq_http1_total
downstream_rq_active
Saturation:
downstream_cx_rx_bytes_buffered
downstream_cx_tx_bytes_buffered
downstream_flow_control_paused_reading_total
downstream_flow_control_resumed_reading_total
Error:
downstream_cx_destroy_local_active_rq
downstream_cx_destroy_remote_active_rq
downstream_rq_rx_reset
downstream_rq_tx_reset
downstream_rq_too_large
downstream_rq_max_duration_reached
downstream_rq_timeout
downstream_rq_overload_close
rs_too_large
Others:
downstream_cx_destroy_remote
downstream_cx_destroy_local
downstream_cx_length_ms
listeners#
It can be assumed that this is an metrics of the L3/L4 level of the downstream.
Utilization:
downstream_cx_total
downstream_cx_active
Saturation:
downstream_pre_cx_active
Error:
downstream_cx_transport_socket_connect_timeout
downstream_cx_overflow
no_filter_chain_match
downstream_listener_filter_error
no_certificate
Others:
downstream_cx_length_ms
server#
Envoy basic info metrics
Utilization:
concurrency
Error:
days_until_first_cert_expiring
watch dog#
The Envoy also includes a configurable watchdog system that adds statistics and optionally terminates the server when the Envoy is not responding. The system has two separate watchdog configurations, one for the main thread and one for the worker threads; as different threads have different workloads. These statistics help to understand at a high level whether the Envoy’s event loop is not responding because it is doing too much work, blocking, or not being scheduled by the operating system.
Saturation.
watchdog_mega_miss
(Counter): number of mega misseswatchdog_miss
(Counter): number of misses
If you are interested in the watchdog mechanism, see:
https://github.com/envoyproxy/envoy/issues/11391 https://github.com/envoyproxy/envoy/issues/11388
Event loop#
Envoy documentation: Event loop
The Envoy architecture is designed to optimize scalability and resource utilization by running the event loop on a small number of threads. The "main"
thread is responsible for control plane processing, and each "worker"
thread shares a portion of the data plane tasks. Envoy exposes two statistics to monitor the performance of all these threaded event loops.
Time taken to run a round of the loop: each iteration of the event loop executes a number of tasks. The number of tasks varies with the load. However, if one or more threads have unusually long-tailed loop execution elapsed times, there may be performance issues. For example, the responsibility may be unevenly distributed between worker threads, or there may be long blocking operations in the plugin that impede task progress.
Polling Latency: In each iteration of the event loop, the event scheduler polls for I/O events and “wakes up” threads when some I/O event is ready
or a timeout
occurs, whichever occurs first. In the case of a timeout
, we can measure the difference between the expected wakeup time after polling and the actual wakeup time; this difference is called polling delay
. It is normal to see some small polling delay
, usually equal to the kernel scheduler’s time slice
or quantum
– depending on which kernel is running Envoy. – depending on the operating system running Envoy – but if this number is significantly higher than its normally observed baseline, it indicates that the kernel scheduler may be experiencing delays.
This can be done by setting [enable_dispatcher_stats](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/bootstrap/v3/bootstrap.proto#envoy-v3-api -field-config-bootstrap-v3-bootstrap-enable-dispatcher-stats) to true
to enable these statistics.
The event scheduler for the
main
thread has a statistics tree rooted atserver.dispatcher.
. Eachworker
thread has a statistics tree rooted at `server.dispatcher.The event scheduler for each
worker
thread has a statistics tree rooted atlistener_manager.worker_<id>.dispatcher.
.
Each tree has the following statistics:
Name |
Type |
Description |
---|---|---|
loop_duration_us |
Histogram |
event loop duration in microseconds |
poll_delay_us |
Histogram |
Polling delay in microseconds |
Note that this does not include any auxiliary (non-main and worker) threads.
Hint
Watch Dog and Event loop are both tools for solving and monitoring event processing delays and timings, and there are a lot of details and stories here, even down to the Linux Kernel. hopefully there will be time later in the book to learn and analyze these interesting details with you.
Configuration#
Hint
If you read the introduction to this book What this book is not, it says it’s not a “user’s manual”, so why is it talking about configuration? Well, all I can say is that it’s better to start with understanding how to use it, and then learn how to implement it, than to come straight to the source code.
This section is referenced in: Envoy Documentation
config.bootstrap.v3.Bootstrap#
Envoy docs:config.bootstrap.v3.Bootstrap proto
{
"node": {...},
"static_resources": {...},
"dynamic_resources": {...},
"cluster_manager": {...},
"stats_sinks": [],
"stats_config": {...},
"stats_flush_interval": {...},
"stats_flush_on_admin": ...,
...
}
Hint
What is stats sink
? This book does not explain it.Istio does not customize the configuration by default. The following is only part of the configuration of concern.
stats_config ([config.metrics.v3.StatsConfig](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto#envoy-v3-api-msg -config-metrics-v3-statsconfig)) Configuration for internal processing of statistics.
stats_flush_interval (Duration) Interval at which to flush the
stats sink
. For performance reasons, Envoy does not flush the counter in real time, only the counter and gauge are flushed periodically. If not specified, the default value is 5000 milliseconds. Only one ofstats_flush_interval
orstats_flush_on_admin
can be set. Duration must be at least 1 millisecond and at most 5 minutes.stats_flush_on_admin (bool) Flush statistics to
sink
only when queried on theadmin interface
. If set, no refresh timer is created. Only one ofstats_flush_on_admin
orstats_flush_interval
can be set.
config.metrics.v3.StatsConfig#
Envoy docs:config-metrics-v3-statsconfig
{
"stats_tags": [],
"use_all_default_tags": {...},
"stats_matcher": {...},
"histogram_bucket_settings": []
}
stats_tags - dimension extraction rules (corresponds to Prometheus label extraction) (Multiple [config.metrics.v3.TagSpecifier](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto#envoy-v3- api-msg-config-metrics-v3-tagspecifier) ) Each
metrics name string
is processed independently by these tag rules. When a tag matches, the first capture group is not immediately removed from the name, so the later [TagSpecifiers](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto# envoy-v3-api-msg-config-metrics-v3-tagspecifier) can also match the same section repeatedly. After all tag matches have been completed, the matching portion of themetrics name string
is then clipped and used as the metric name for thestats sink
, e.g., the metric name for Prometheus.use_all_default_tags (BoolValue) Use all the default tags regular expressions specified in the Envoy. These can be used in conjunction with the custom tags specified in stats_tags. They will be processed before the custom tags.Istio defaults to false.
stats_matcher ([config.metrics.v3.StatsMatcher](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto#envoy-v3-api- msg-config-metrics-v3-statsmatcher)) Specifies which metrics the Envoy will output. Supports
include
/exclude
rule specification. If not provided, all metrics will be output. Blocking statistics for certain sets of metrics can improve Envoy performance a bit.
config.metrics.v3.StatsMatcher#
Envoy docs:config-metrics-v3-statsmatcher
Configuration for disabling/enabling the calculation and output of statistical indicators.
{
"reject_all": ...,
"exclusion_list": {...},
"inclusion_list": {...}
}
reject_all (bool) If
reject_all
is true, disable all statistics. Ifreject_all
is false, all statistics are enabled.exclusion_list ([type.matcher.v3.ListStringMatcher](https://www.envoyproxy.io/docs/envoy/latest/api-v3/type/matcher/v3/string.proto#envoy-v3-api- msg-type-matcher-v3-liststringmatcher)) exclusion list
inclusion_list ([type.matcher.v3.ListStringMatcher](https://www.envoyproxy.io/docs/envoy/latest/api-v3/type/matcher/v3/string.proto#envoy-v3-api- msg-type-matcher-v3-liststringmatcher)) inclusion list
Note
This section references: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/observability/statistics
In the next section, an example of how Istio can be used with the configuration above will be shown.