Envoy Metrics#

Envoy Metrics Overview#

One of the main goals of Envoy is to make the network easy to understand. Envoy generates a large amount of statistical information depending on how it is configured. In general, statistics (metrics) fall into three categories:

Downstream: Downstream metrics are related to incoming connections/requests. They are generated by listener, HTTP connection manager (HCM), TCP proxy filter and so on.
Upstream: Upstream metrics are related to outgoing connections/requests. They are generated by connection pool, router filter, tcp proxy filter, and so on.
Server: Server metrics information describes the operation of the Envoy server instance. Statistics such as server uptime or amount of memory allocated.

In the simplest scenario, a single Envoy Proxy typically involves Downstream and Upstream statistics. These two metrics reflect the operation of the Network Node from which they are taken. Statistics from the entire grid provide very detailed summary information about the health of each Network Node and the network as a whole.Envoy’s documentation has some brief descriptions of these metrics.

Tag#

Envoy’s metrics also have two subconcepts that are supported for use in metrics: tags / dimensions. The tags pair here is equal to the label of the Prometheus metric, in the sense that it can be interpreted as: categorical dimensions.

Envoy’s metrics are identified by canonical strings. The dynamic parts (substrings) of these strings are extracted as tags. This can be done by specifying tag extraction rules (Tag Specifier configuration.) to customize tags.

As an example:

### 1. The original Envoy metrics ###

$ kubectl exec fortio-server -c istio-proxy -- curl 'localhost:15000/stats'

### Returns:
cluster.outbound|8080||fortio-server-l2.mark.svc.cluster.local.external.upstream_rq_2xx: 300

# where:
# - The `outbound|8080||fortio-server-l2.mark.svc.cluster.local` part is the name of the upstream cluster. It can be extracted as a tag.
# - The `2xx` part is the HTTP Status Code category. This can be extracted as a tag. The configuration of this extraction rule is described below.

### 2. Metrics for Prometheus ###
$ kubectl exec fortio-server -c istio-proxy -- curl 'localhost:15000/stats?format=prometheus' | grep 'outbound|8080||fortio-server-l2' | grep ' external.upstream_rq'

# Returns:
envoy_cluster_external_upstream_rq{response_code_class="2xx",cluster_name="outbound|8080||fortio-server-l2.mark.svc.cluster.local" } 300

Metrics data types#

Envoy emits three types of values as statistics:

Counters: unsigned integers that only increase, not decrease. For example, Total Requests.
Gauges(Gauges): unsigned integers that increase and decrease. For example, currently active requests.
Histograms: Unsigned integers that are part of a stream of metrics, which are then aggregated by the collector to eventually produce a summarized percentile (i.e., the usual P99/P50/Pxx). For example, Upstream response time.

In Envoy’s internal implementation, Counters and Gauges are batched and refreshed periodically to improve performance. histograms are written on receipt.

Metrics Interpretation#

Metrics can be categorized by where they are produced:

cluster manager : L3/L4/L7 level metrics for upstream.
http connection manager(HCM) : L7 level metrics for upstream & downstream.
listeners: Layer L3/L4 metrics for downstream.
server (global)
watch dog

Below I’ve selected only some of the key performance metrics to briefly explain.

cluster manager#

Envoy documentation:cluster manager stats

The documentation above already goes into a bit more detail. I’ll just add some aspects to focus on when performance tuning. So, what metrics to focus on in general?

Let’s analyze it from the famous Utilization Saturation and Errors (USE) methodology.

Utilization:

upstream_cx_total (Counter): counter of connections
upstream_rq_active

Saturation:

upstream_rq_time (Histogram): response latency
upstream_cx_connect_ms (Histogram)
upstream_cx_rx_bytes_buffered
upstream_cx_tx_bytes_buffered
upstream_rq_pending_total (Counter)
upstream_rq_pending_active (Gauge)
circuit_breakers.*cx_open
circuit_breakers.*cx_pool_open
circuit_breakers.*rq_pending_open
circuit_breakers.*rq_open
circuit_breakers.*rq_retry_open

Error:

upstream_cx_connect_fail (Counter): Number of connection failures.
upstream_cx_connect_timeout (Counter): number of connection timeouts
upstream_cx_overflow (Counter): total number of cluster connection breaker overflows
upstream_cx_pool_overflow
upstream_cx_destroy_local_with_active_rq
upstream_cx_destroy_remote_with_active_rq
upstream_rq_timeout
upstream_rq_retry
upstream_rq_rx_reset
upstream_rq_tx_reset
upstream_rq_pending_overflow (Counter) : Total number of requests that overflowed the connection pool or requests (mainly for HTTP/2 and higher) that melted and failed

Other:

upstream_rq_total (Counter) : TPS (throughput)
upstream_cx_destroy_local (Counter): Count of connections actively disconnected by Envoy
upstream_cx_destroy_remote (Counter): count of Envoy passive disconnects
upstream_cx_length_ms (Histogram)

http connection manager(HCM)#

Envoy docs:http connection manager(HCM) stats

This can be thought of as an L7 layer metrics for downstream & some upstream.

Utilization:

downstream_cx_total
downstream_cx_active
downstream_cx_http1_active
downstream_rq_total
downstream_rq_http1_total
downstream_rq_active

Saturation:

downstream_cx_rx_bytes_buffered
downstream_cx_tx_bytes_buffered
downstream_flow_control_paused_reading_total
downstream_flow_control_resumed_reading_total

Error:

downstream_cx_destroy_local_active_rq
downstream_cx_destroy_remote_active_rq
downstream_rq_rx_reset
downstream_rq_tx_reset
downstream_rq_too_large
downstream_rq_max_duration_reached
downstream_rq_timeout
downstream_rq_overload_close
rs_too_large

Others：

downstream_cx_destroy_remote
downstream_cx_destroy_local
downstream_cx_length_ms

listeners#

Envoy docs:listener stats

It can be assumed that this is an metrics of the L3/L4 level of the downstream.

Utilization:

downstream_cx_total
downstream_cx_active

Saturation:

downstream_pre_cx_active

Error:

downstream_cx_transport_socket_connect_timeout
downstream_cx_overflow
no_filter_chain_match
downstream_listener_filter_error
no_certificate

Others:

downstream_cx_length_ms

server#

Envoy basic info metrics

Envoy docs:server stats

Utilization:

concurrency

Error:

days_until_first_cert_expiring

watch dog#

Envoy docs: Watchdog

The Envoy also includes a configurable watchdog system that adds statistics and optionally terminates the server when the Envoy is not responding. The system has two separate watchdog configurations, one for the main thread and one for the worker threads; as different threads have different workloads. These statistics help to understand at a high level whether the Envoy’s event loop is not responding because it is doing too much work, blocking, or not being scheduled by the operating system.

Saturation.

watchdog_mega_miss(Counter): number of mega misses
watchdog_miss(Counter): number of misses

If you are interested in the watchdog mechanism, see:

https://github.com/envoyproxy/envoy/issues/11391 https://github.com/envoyproxy/envoy/issues/11388

Event loop#

Envoy documentation: Event loop

The Envoy architecture is designed to optimize scalability and resource utilization by running the event loop on a small number of threads. The "main" thread is responsible for control plane processing, and each "worker" thread shares a portion of the data plane tasks. Envoy exposes two statistics to monitor the performance of all these threaded event loops.

Time taken to run a round of the loop: each iteration of the event loop executes a number of tasks. The number of tasks varies with the load. However, if one or more threads have unusually long-tailed loop execution elapsed times, there may be performance issues. For example, the responsibility may be unevenly distributed between worker threads, or there may be long blocking operations in the plugin that impede task progress.

Polling Latency: In each iteration of the event loop, the event scheduler polls for I/O events and “wakes up” threads when some I/O event is ready or a timeout occurs, whichever occurs first. In the case of a timeout, we can measure the difference between the expected wakeup time after polling and the actual wakeup time; this difference is called polling delay. It is normal to see some small polling delay, usually equal to the kernel scheduler’s time slice or quantum – depending on which kernel is running Envoy. – depending on the operating system running Envoy – but if this number is significantly higher than its normally observed baseline, it indicates that the kernel scheduler may be experiencing delays.

This can be done by setting [enable_dispatcher_stats](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/bootstrap/v3/bootstrap.proto#envoy-v3-api -field-config-bootstrap-v3-bootstrap-enable-dispatcher-stats) to true to enable these statistics.

The event scheduler for the main thread has a statistics tree rooted at server.dispatcher.. Each worker thread has a statistics tree rooted at `server.dispatcher.
The event scheduler for each worker thread has a statistics tree rooted at listener_manager.worker_<id>.dispatcher..

Each tree has the following statistics:

Name	Type	Description
loop_duration_us	Histogram	event loop duration in microseconds
poll_delay_us	Histogram	Polling delay in microseconds

Note that this does not include any auxiliary (non-main and worker) threads.

Hint

Watch Dog and Event loop are both tools for solving and monitoring event processing delays and timings, and there are a lot of details and stories here, even down to the Linux Kernel. hopefully there will be time later in the book to learn and analyze these interesting details with you.

Configuration#

Hint

If you read the introduction to this book What this book is not, it says it’s not a “user’s manual”, so why is it talking about configuration? Well, all I can say is that it’s better to start with understanding how to use it, and then learn how to implement it, than to come straight to the source code.

This section is referenced in: Envoy Documentation

config.bootstrap.v3.Bootstrap#

Envoy docs:config.bootstrap.v3.Bootstrap proto

{
  "node": {...},
  "static_resources": {...},
  "dynamic_resources": {...},
  "cluster_manager": {...},
  "stats_sinks": [],
  "stats_config": {...},
  "stats_flush_interval": {...},
  "stats_flush_on_admin": ...,
...
}

Hint

What is stats sink? This book does not explain it.Istio does not customize the configuration by default. The following is only part of the configuration of concern.

stats_config ([config.metrics.v3.StatsConfig](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto#envoy-v3-api-msg -config-metrics-v3-statsconfig)) Configuration for internal processing of statistics.
stats_flush_interval (Duration) Interval at which to flush the stats sink. For performance reasons, Envoy does not flush the counter in real time, only the counter and gauge are flushed periodically. If not specified, the default value is 5000 milliseconds. Only one of stats_flush_interval or stats_flush_on_admin can be set. Duration must be at least 1 millisecond and at most 5 minutes.
stats_flush_on_admin (bool) Flush statistics to sink only when queried on the admin interface. If set, no refresh timer is created. Only one of stats_flush_on_admin or stats_flush_interval can be set.

config.metrics.v3.StatsConfig#

Envoy docs:config-metrics-v3-statsconfig

{
  "stats_tags": [],
  "use_all_default_tags": {...},
  "stats_matcher": {...},
  "histogram_bucket_settings": []
}

stats_tags - dimension extraction rules (corresponds to Prometheus label extraction) (Multiple [config.metrics.v3.TagSpecifier](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto#envoy-v3- api-msg-config-metrics-v3-tagspecifier) ) Each metrics name string is processed independently by these tag rules. When a tag matches, the first capture group is not immediately removed from the name, so the later [TagSpecifiers](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto# envoy-v3-api-msg-config-metrics-v3-tagspecifier) can also match the same section repeatedly. After all tag matches have been completed, the matching portion of the metrics name string is then clipped and used as the metric name for the stats sink, e.g., the metric name for Prometheus.
use_all_default_tags (BoolValue) Use all the default tags regular expressions specified in the Envoy. These can be used in conjunction with the custom tags specified in stats_tags. They will be processed before the custom tags.Istio defaults to false.
stats_matcher ([config.metrics.v3.StatsMatcher](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto#envoy-v3-api- msg-config-metrics-v3-statsmatcher)) Specifies which metrics the Envoy will output. Supports include/exclude rule specification. If not provided, all metrics will be output. Blocking statistics for certain sets of metrics can improve Envoy performance a bit.

config.metrics.v3.StatsMatcher#

Envoy docs:config-metrics-v3-statsmatcher

Configuration for disabling/enabling the calculation and output of statistical metrics.

{
  "reject_all": ...,
  "exclusion_list": {...},
  "inclusion_list": {...}
}

reject_all (bool) If reject_all is true, disable all statistics. If reject_all is false, all statistics are enabled.
exclusion_list ([type.matcher.v3.ListStringMatcher](https://www.envoyproxy.io/docs/envoy/latest/api-v3/type/matcher/v3/string.proto#envoy-v3-api- msg-type-matcher-v3-liststringmatcher)) exclusion list
inclusion_list ([type.matcher.v3.ListStringMatcher](https://www.envoyproxy.io/docs/envoy/latest/api-v3/type/matcher/v3/string.proto#envoy-v3-api- msg-type-matcher-v3-liststringmatcher)) inclusion list

Note

This section references: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/observability/statistics

In the next section, an example of how Istio can be used with the configuration above will be shown.

Envoy Metrics

Contents

Envoy Metrics#

Envoy Metrics Overview#

Tag#

Metrics data types#

Metrics Interpretation#

cluster manager#

http connection manager(HCM)#

listeners#

server#

watch dog#

Event loop#

Configuration#

config.bootstrap.v3.Bootstrap#

config.metrics.v3.StatsConfig#

config.metrics.v3.StatsMatcher#