Envoy request and response scheduling#
🎤 Before get start. I’d like to talk about some of the story reasons for writing this section. Why should I study Envoy’s request and response scheduling?
It started with a customer request to do some research on fast recovery from Istio worker node failures. To do this, I went through a lot of Istio/Envoy documentation, blogs, and a lot of info:
Health Detection
Circuit breaking
Envoy’s mysterious and inextricably linked timeout configurations
Request Retry
TCP keepalive
,TCP_USER_TIMEOUT
configuration
At the end, I had to write a post to sort out the information: A First Look at Rapid Recovery from Istio Worker Node Failures. But while the information was sorted out, the underlying principles were not. So I decided to dig into Envoy’s documentation. Yes, Envoy’s documentation is actually quite detailed. However:
The information is scattered in a web page, can not be organized in a chronological and flow method, constituting an organic whole.
It’s impossible to rationally weigh these parameters without understanding the overall collaboration and just looking at them one by one.
Metrics and Metrics, Metrics and setting parameters, complex relationship
Above relationships can be linked through the request and response scheduling process.
For the above reasons. I deduce the following flow from the documents, setting parameters and metrics. Note: not verified in the code for the time being, please refer to it with caution.
Request and Response Scheduling#
Essentially, Envoy is an proxy. When talking about proxies, the first thought should be software/hardware that has the following processes:
receive a
Request
from thedownstream
do some logic, modify the
Request
if necessary, and determine theupstream
destinationforward the (modified)
Request
toupstream
if the protocol is a
Request
&Response
style protocol (e.g. HTTP)the proxy usually receives a
Response
from theupstream
. 2.does some logic and modifies the
Response
if necessaryforward the
Response
to thedownstream
. Indeed, this is the outline of how Envoy proxies the HTTP protocol. But there are many more features that Envoy has to implement:
efficient
downstream
/upstream
transfer ➡️ requiresconnection multiplexing
andconnection pooling
.flexible configuration of forwarding target service policies ➡️ requires
Router
configuration policies and implementation logicresilient micro-services
load balancing
peak shaving and troughing of unexpected traffic ➡️ request queuing: pending request
cope with abnormal upstream, Circuit breaking, protect service from avalanche ➡️ various timeout configurations, health checking, Outlier detection, Circuit breaking
elastic retry ➡️ retry
observability ➡️ ubiquitous performance indicators
dynamic programming configuration interface ➡️ xDS: EDS/LDS/…
To implement these features, the request and response process must not simple.
Hint
The reader may wonder if the title of this section is “Request and Response Scheduling”? Does the Envoy need to be scheduled like a Linux Kernel scheduling thread to process the request? Yes, you’ve hit the nail on the head.
Envoy applies the event-driven
design pattern. An event-driven
program, compared to a non-event-driven
program, has fewer threads and more flexible control over what tasks to do when, i.e. more flexible scheduling logic. And even better: since there is not much data shared between threads, the data concurrency control of threads is at the same time greatly simplified.
In this section, the event types at least includes:
External network readable, writable, connection closure events
Various types of timers
Retry timings
Various timeout configuration timings
Because of the pattern of using an unlimited number of requests assigned to a limited number of threads, and the fact that requests may need to be retried, the threads must have a series of logic to order
what requests should be processed first. What requests should immediately return a failure due to timeout
or resource usage over the configured limit
.
As is customary in this book, the diagram is shown first. Later, a step-by-step expansion and explanation of this diagram.
Hint
Interactive book:
It is recommended to open it with Draw.io. The diagrams contain a large number of links to the documentation descriptions of each component, configuration item, and indicator.
Dual monitors, one for the diagrams and one for the text, is the recommended way of reading for this book. If you’re reading it on your phone, well, ignore me 🤦
Envoy request scheduling flow#
Let’s start with the request component flow part, the flowchart can be reasoned from the relevant documentation as (not fully verified, partial reasoning exists):
Request and Response Scheduling Timeline#
As mentioned at the beginning of this section, the immediate reason for writing this section was: the need to do some research on fast recovery from Istio worker node failures. The premise of fast recovery
is:
A fast response to a request that has been sent to or bound to a
failed upstream host
fails.Use
Outlier detection / health checker
to identify thefailed upstream host
and move it out of the load balanced list. All of the problems depend on one question: how do you define and detect when anupstream host
fails?Network partition or peer crashes or overloaded
In most cases, distributed systems can only detect such problems through timeouts. So, to quickly discover a
failed upstream host
orfailed request
, you need to configure the timeout appropriately.
peer responding with a Layer 7 failure (e.g., HTTP 500), or a Layer 3 failure (e.g., TCP REST/No router to destination/ICMP error).
These are failures that can be quickly detected
For cases where network partitions or peers are crashed or overloaded
timeout based discovery is required, Envoy provides a rich set of timeout configurations. It’s so rich that sometimes it’s hard to know which one is the right one to use. It is easy to miss configuring, e.g configuring some values that are logically long or short and contradict the implementation design. So, I tried to rationalize the request and response scheduling timeline
, and then look at the related timeout configurations associated with which point of this timeline, then the whole logic is clear. The configuration is also easier to rationalize.
The following diagram shows the request and response timeline, and the associated timeout configurations with the resulting metrics, and how they are related.
Briefly explain the timeline:
if downstream reuses a previous connection, step 2 & 3 can be skipped.
downstream initiates a new connection (TCP handshake)
TLS handshake
the Envoy receives the downstream request header & body
the Envoy executes the Router rules to determine the upstream cluster for the next hop.
the Envoy executes the Load Balancing algorithm to determine the upstream host of the next upstream cluster.
If the Envoy already has a free connection to the upstream host, skip 8 & 9.
Envoy initiates a new connection to the upstream host (TCP handshake).
the Envoy initiates a TLS handshake with the upstream host
The Envoy forwards the request header & body to the upstream host.
The Envoy receives the response header & body from the upstream host.
upstream host starts idle connection
Envoy forwards response header & body to downstream host.
downstream host connection starts idle
Accordingly, the timeout configurations are labeled in relation to the timeline steps, in the following order from the start of the timeline
max_connection_duration
transport_socket_connect_timeout
metrics
listener.downstream_cx_transport_socket_connect_timeout
request_headers_timeout
Envoy’s route.timeout is Istio’s Istio request timeout(outbound) Note that this timeout value takes into account the total time of the actual retry while the request is being processed.
indicator
cluster.upstream_rq_timeout
indicator
vhost.vcluster.upstream_rq_timeout
max_connection_duration
connection_timeout
upstream_cx_connect_timeout
metrics
transport_socket_connect_timeout
httpprotocoloptions.idle_timeout
Summary#
If you want Envoy to perform as expected under stressful and abnormal conditions, you need to configure Envoy in a way that makes sense for your specific application and scenario. The prerequisite for configuring this set of parameters is to have an insight into the processing flow and logic. I’ve gone through the request and response scheduling
and request and response scheduling timeline
above. I hope this helps in understanding these aspects.
It’s not just Envoy, it’s all the middleware that does proxying, and probably the most core stuff is in this piece. So, don’t expect to get all the knowledge at once. Here, also just want to let the reader in these processes, there is a clue, and then through the clues to learn, so as not to lose their way.