App outbound connecting timed out because App selected a ephemeral port which collisions with the existing socket on 15001(outbound) listener#

Sidecar intercept and TCP proxying all outbound TCP connection by default: (app --[conntrack DNAT]--> sidecar) -----> upstream-tcp-service

But in some scenarios, App just get a connect timed out error when connecting to the sidecar 15001(outbound) listener.

Scenarios:

  1. When sidecar has a half-open connection to App. e.g:

    $ ss
    tcp FIN-WAIT-2 0 0 127.0.0.1:15001  172.29.73.7:44410(POD_IP:ephemeral_port)
    

    This can happen, eg: TCP Proxy half-closed connection leak for 1 hour in some scenarios #43297

    There is no track entry in conntrack table because nf_conntrack_tcp_timeout_close_wait time out and expired.

  2. App invoke syscall connect(sockfd, peer_addr) , kernel allocation a ephemeral port(44410 in this case) , bind the new socket to that ephemeral port and sent SYN packet to peer.

  3. SYN packet reach conntrack and it create a track entry in conntrack table:

$ conntrack -L
tcp  6 108 SYN_SENT src=172.29.73.7 dst=172.21.206.198 sport=44410 dport=7777 src=127.0.0.1 dst=172.29.73.7 sport=15001 dport=44410
  1. SYN packet DNAT to 127.0.0.1:15001

  2. SYN packet reach the already existing FIN-WAIT-2 127.0.0.1:15001  172.29.73.7:44410 socket, then sidecar reply a TCP Challenge ACK (TCP seq-no is from the old FIN-WAIT-2) packet to App

  3. App reply the TCP Challenge ACK with a RST(TCP seq-no is from the TCP Challenge ACK)

  4. Conntrack get the RST packet and check it. In some kernel version, conntrack just invalid the RST packet because the seq-no is out of the track entry in conntrack table which created in step 3.

  5. App will retransmit SYN but all without an expected SYN/ACK reply. Connect timed out will happen on App user space.

Different kernel version may have different packet validate rule in step 7:

RST packet mark as invalid:
  SUSE Linux Enterprise Server 15 SP4:
    5.14.21-150400.24.21-default

# cat /proc/sys/net/netfilter/nf_conntrack_tcp_ignore_invalid_rst
0

####################
    
RST packet passed check and NATed:
  Ubuntu 20.04.2:
    5.4.0-137-generic
    
# cat /proc/sys/net/netfilter/nf_conntrack_tcp_ignore_invalid_rst
cat: /proc/sys/net/netfilter/nf_conntrack_tcp_ignore_invalid_rst: No such file or directory    

It seems related to kernel patch: Add tcp_ignore_invalid_rst sysctl to allow to disable out of segment RSTs which merge to kernel after kernel v5.14

Good news is that, someone will fix the problem at kernel v6.2-rc7: netfilter: conntrack: handle tcp challenge acks during connection reuse:

When a connection is re-used, following can happen:
[ connection starts to close, fin sent in either direction ]
 > syn   # initator quickly reuses connection
 < ack   # peer sends a challenge ack
 > rst   # rst, sequence number == ack_seq of previous challenge ack
 > syn   # this syn is expected to pass

Problem is that the rst will fail window validation, so it gets
tagged as invalid.

If ruleset drops such packets, we get repeated syn-retransmits until
initator gives up or peer starts responding with syn/ack.

But in some scenarios and kernel version, it will be an issue anyway.

Github issue: App outbound connecting timed out because sidecar select ephemeral port collisions with socket on 15001(outbound) listener #43301

Base knowledge#

TCP#

TCP Challenge ACK#

Conntrack table#

  • list conntrack table command and output explain:

    conntrack -L 
    
    tcp  6 54(entry expire seconds) CLOSE_WAIT(entry state) 
    (local endpoint):src=172.29.73.7 dst=172.21.206.198 sport=44410 dport=7777 
    (remote endpoint)src=127.0.0.1 dst=172.29.73.7 sport=15001 dport=44410 [ASSURED] mark=0 use=1
    

Conntrack by default will not DNAT any invalid(Not in tcp connection window) tcp RST packet.

Environment#

$ ./istioctl version
client version: 1.16.2
control plane version: 1.16.2
data plane version: 1.16.2 (4 proxies)

$ kubectl version --short
Flag --short has been deprecated, and will be removed in the future. The --short output will become the default.
Client Version: v1.25.3
Kustomize Version: v4.5.7
Server Version: v1.25.6

CNI: Calico

Testing POD#

IP: 172.29.73.7
POD Name: fortio-server-l2-0
pod running at worker node: sle

Worker Node:
Kernel: 5.14.21-150400.24.21-default
Linux distros: SUSE Linux Enterprise Server 15 SP4

TCP service pod#

IP: 172.21.206.198
POD Name: netshoot-0
# listen on port 7777

kubectl -it exec netshoot-0 -- nc -l -k 7777

enable conntrack’s invalid packet log on Testing POD#

enable nf_log_all_netns on worker node:

sle:/proc/sys # sudo su
sle:/proc/sys # echo 1 > /proc/sys/net/netfilter/nf_log_all_netns

add iptables rule on container:

fortio-server-l2-0:/ $

iptables -t mangle -N ISTIO_INBOUND_LOG
iptables -t mangle -A ISTIO_INBOUND_LOG -m conntrack --ctstate INVALID -j LOG --log-level debug --log-prefix 'ISTIO_INBOUND_INVALID: '
iptables -t mangle -A PREROUTING -p tcp -j ISTIO_INBOUND_LOG 

iptables -t mangle -N ISTIO_OUTBOUND_LOG
iptables -t mangle -A ISTIO_OUTBOUND_LOG -m conntrack --ctstate INVALID -j LOG --log-level debug --log-prefix 'ISTIO_OUTBOUND_INVALID: '
iptables -t mangle -A OUTPUT -p tcp -j ISTIO_OUTBOUND_LOG 

New connection timeout#

App build connection on the same ephemeral port#

  • Make a connection by specified ephemeral port. e.g:

    nc -p 44410 172.21.206.198 7777
    
  • List sockets with timer(state/retry timeout) info. e.g:

    ss -naoep
    
    tcp   SYN-SENT    0  1  172.29.73.7:44410    172.21.206.198:7777         users:(("nc",pid=144426,fd=3)) timer:(on,1.684ms(timeout),2(retry counter)) ino:2024378629 sk:aa4e2 <->
    

Case 1: connect timed out by collision ephemeral port#

用 Draw.io 打开

fortio-server-l2-0:/ # nc -p 44410 172.21.206.198 7777 #pid=144426

fortio-server-l2-0:/ # ss -naoep | egrep '44410|7777'

tcp   SYN-SENT    0  1  172.29.73.7:44410    172.21.206.198:7777         users:(("nc",pid=144426,fd=3)) timer:(on,1.684ms,2) ino:2024378629 sk:aa4e2 <->
tcp   FIN-WAIT-2  0  0     127.0.0.1:15001  172.29.73.7:44410        users:(("envoy",pid=51435,fd=121)) ino:2023763691 sk:aa125 <--

#tcp   CLOSE-WAIT  0  0  172.29.73.7:38076    172.21.206.198:7777         users:(("envoy",pid=51435,fd=135)) ino:2023763692 sk:aa123 -->
#tcp   CLOSE-WAIT  0  0  172.29.73.7:38078    172.21.206.198:7777         users:(("nc",pid=88235,fd=3)) ino:2024198076 sk:aa3d7 -->

fortio-server-l2-0:/ # conntrack -L 2>&1 | egrep "192.168.7.|10.96.94.44|172.21.206.198"

tcp  6 108 SYN_SENT src=172.29.73.7 dst=172.21.206.198 sport=44410 dport=7777 src=127.0.0.1 dst=172.29.73.7 sport=15001 dport=44410 mark=0 use=1


fortio-server-l2-0:/ # nstat -zas | egrep -i 'retr|cha'

TcpRetransSegs        165296        0.0
TcpExtTCPLostRetransmit    842      0.0
TcpExtTCPChallengeACK      140      0.0
TcpExtTCPSYNChallenge      140      0.0
TcpExtTCPSynRetrans        295      0.0


sle:/ $ dmesg --follow-new -T -l debug

[Sat Feb  4 16:46:40 2023] ISTIO_OUTBOUND_INVALID: IN= OUT=eth0 SRC=172.29.73.7 DST=172.21.206.198 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=44410 DPT=7777 WINDOW=0 RES=0x00 RST URGP=0 
[Sat Feb  4 16:46:41 2023] ISTIO_OUTBOUND_INVALID: IN= OUT=eth0 SRC=172.29.73.7 DST=172.21.206.198 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=44410 DPT=7777 WINDOW=0 RES=0x00 RST URGP=0 
[Sat Feb  4 16:46:43 2023] ISTIO_OUTBOUND_INVALID: IN= OUT=eth0 SRC=172.29.73.7 DST=172.21.206.198 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=44410 DPT=7777 WINDOW=0 RES=0x00 RST URGP=0 
[Sat Feb  4 16:46:47 2023] ISTIO_OUTBOUND_INVALID: IN= OUT=eth0 SRC=172.29.73.7 DST=172.21.206.198 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=44410 DPT=7777 WINDOW=0 RES=0x00 RST URGP=0 
[Sat Feb  4 16:46:55 2023] ISTIO_OUTBOUND_INVALID: IN= OUT=eth0 SRC=172.29.73.7 DST=172.21.206.198 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=44410 DPT=7777 WINDOW=0 RES=0x00 RST URGP=0 
[Sat Feb  4 16:47:11 2023] ISTIO_OUTBOUND_INVALID: IN= OUT=eth0 SRC=172.29.73.7 DST=172.21.206.198 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=44410 DPT=7777 WINDOW=0 RES=0x00 RST URGP=0 
[Sat Feb  4 16:47:43 2023] ISTIO_OUTBOUND_INVALID: IN= OUT=eth0 SRC=172.29.73.7 DST=172.21.206.198 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=44410 DPT=7777 WINDOW=0 RES=0x00 RST URGP=0 




fortio-server-l2-0:/ # sudo tcpdump -i lo -n -v "port 44410"
tcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
16:47:16.343525 IP (tos 0x0, ttl 64, id 20177, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.73.7.44410 > 127.0.0.1.15001: Flags [S], cksum 0x43fd (incorrect -> 0x3d28), seq 66001914, win 62720, options [mss 8960,sackOK,TS val 1250779694 ecr 0,nop,wscale 11], length 0

# TcpExtTCPChallengeACK: (from sidecar)
16:47:16.343547 IP (tos 0x0, ttl 64, id 40356, offset 0, flags [DF], proto TCP (6), length 52)
    172.21.206.198.7777 > 172.29.73.7.44410: Flags [.], cksum 0x7c2c (incorrect -> 0x0bf4), ack 1382665639, win 32, options [nop,nop,TS val 2150755275 ecr 1250324922], length 0

16:47:17.356027 IP (tos 0x0, ttl 64, id 20178, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.73.7.44410 > 127.0.0.1.15001: Flags [S], cksum 0x43fd (incorrect -> 0x3934), seq 66001914, win 62720, options [mss 8960,sackOK,TS val 1250780706 ecr 0,nop,wscale 11], length 0
16:47:17.356051 IP (tos 0x0, ttl 64, id 40357, offset 0, flags [DF], proto TCP (6), length 52)
    172.21.206.198.7777 > 172.29.73.7.44410: Flags [.], cksum 0x7c2c (incorrect -> 0x0800), ack 1, win 32, options [nop,nop,TS val 2150756287 ecr 1250324922], length 0
16:47:19.372023 IP (tos 0x0, ttl 64, id 20179, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.73.7.44410 > 127.0.0.1.15001: Flags [S], cksum 0x43fd (incorrect -> 0x3154), seq 66001914, win 62720, options [mss 8960,sackOK,TS val 1250782722 ecr 0,nop,wscale 11], length 0
16:47:19.372063 IP (tos 0x0, ttl 64, id 40358, offset 0, flags [DF], proto TCP (6), length 52)
    172.21.206.198.7777 > 172.29.73.7.44410: Flags [.], cksum 0x7c2c (incorrect -> 0x0020), ack 1, win 32, options [nop,nop,TS val 2150758303 ecr 1250324922], length 0
16:47:23.440028 IP (tos 0x0, ttl 64, id 20180, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.73.7.44410 > 127.0.0.1.15001: Flags [S], cksum 0x43fd (incorrect -> 0x2170), seq 66001914, win 62720, options [mss 8960,sackOK,TS val 1250786790 ecr 0,nop,wscale 11], length 0
16:47:23.440068 IP (tos 0x0, ttl 64, id 40359, offset 0, flags [DF], proto TCP (6), length 52)
    172.21.206.198.7777 > 172.29.73.7.44410: Flags [.], cksum 0x7c2c (incorrect -> 0xf03a), ack 1, win 32, options [nop,nop,TS val 2150762372 ecr 1250324922], length 0
16:47:31.628034 IP (tos 0x0, ttl 64, id 20181, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.73.7.44410 > 127.0.0.1.15001: Flags [S], cksum 0x43fd (incorrect -> 0x0174), seq 66001914, win 62720, options [mss 8960,sackOK,TS val 1250794978 ecr 0,nop,wscale 11], length 0
16:47:31.628066 IP (tos 0x0, ttl 64, id 40360, offset 0, flags [DF], proto TCP (6), length 52)
    172.21.206.198.7777 > 172.29.73.7.44410: Flags [.], cksum 0x7c2c (incorrect -> 0xd03e), ack 1, win 32, options [nop,nop,TS val 2150770560 ecr 1250324922], length 0
16:47:47.756038 IP (tos 0x0, ttl 64, id 20182, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.73.7.44410 > 127.0.0.1.15001: Flags [S], cksum 0x43fd (incorrect -> 0xc273), seq 66001914, win 62720, options [mss 8960,sackOK,TS val 1250811106 ecr 0,nop,wscale 11], length 0
16:47:47.756096 IP (tos 0x0, ttl 64, id 40361, offset 0, flags [DF], proto TCP (6), length 52)
    172.21.206.198.7777 > 172.29.73.7.44410: Flags [.], cksum 0x7c2c (incorrect -> 0x913e), ack 1, win 32, options [nop,nop,TS val 2150786688 ecr 1250324922], length 0
16:48:20.016036 IP (tos 0x0, ttl 64, id 20183, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.73.7.44410 > 127.0.0.1.15001: Flags [S], cksum 0x43fd (incorrect -> 0x446f), seq 66001914, win 62720, options [mss 8960,sackOK,TS val 1250843366 ecr 0,nop,wscale 11], length 0
16:48:20.016066 IP (tos 0x0, ttl 64, id 40362, offset 0, flags [DF], proto TCP (6), length 52)
    172.21.206.198.7777 > 172.29.73.7.44410: Flags [.], cksum 0x7c2c (incorrect -> 0x133a), ack 1, win 32, options [nop,nop,TS val 2150818948 ecr 1250324922], length 0



fortio-server-l2-0:/ # sudo tcpdump -i eth0 -n -v "port 44410"
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
# RST from 172.29.73.7.44410 (Conntrack Invalid)
16:47:16.343564 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    172.29.73.7.44410 > 172.21.206.198.7777: Flags [R], cksum 0x47ee (correct), seq 1382665639, win 0, length 0
16:47:17.356068 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    172.29.73.7.44410 > 172.21.206.198.7777: Flags [R], cksum 0x47ee (correct), seq 1382665639, win 0, length 0
16:47:19.372085 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    172.29.73.7.44410 > 172.21.206.198.7777: Flags [R], cksum 0x47ee (correct), seq 1382665639, win 0, length 0
16:47:23.440094 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    172.29.73.7.44410 > 172.21.206.198.7777: Flags [R], cksum 0x47ee (correct), seq 1382665639, win 0, length 0
16:47:31.628082 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    172.29.73.7.44410 > 172.21.206.198.7777: Flags [R], cksum 0x47ee (correct), seq 1382665639, win 0, length 0
16:47:47.756113 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    172.29.73.7.44410 > 172.21.206.198.7777: Flags [R], cksum 0x47ee (correct), seq 1382665639, win 0, length 0
16:48:20.016082 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    172.29.73.7.44410 > 172.21.206.198.7777: Flags [R], cksum 0x47ee (correct), seq 1382665639, win 0, length 0

Case 2: connect by collision ephemeral port but seq-no happens to be in the TCP window of the old connection#

When connect by collision ephemeral port but seq-no happens to be in the TCP window of the old connection(FIN-WAIT-2 state). Connect successed.

fortio-server-l2-0:/home/eccd # sudo tcpdump -i lo -n -v "port 44410"
tcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
# SYN retrans
16:23:32.466286 IP (tos 0x0, ttl 64, id 6422, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.73.7.44410 > 127.0.0.1.15001: Flags [S], cksum 0x43fd (incorrect -> 0xef14), seq 3587723838, win 62720, options [mss 8960,sackOK,TS val 1249355817 ecr 0,nop,wscale 11], length 0
# TcpExtTCPChallengeACK: (from sidecar)
16:23:32.466309 IP (tos 0x0, ttl 64, id 19983, offset 0, flags [DF], proto TCP (6), length 52)
    172.21.206.198.7777 > 172.29.73.7.44410: Flags [.], cksum 0x7c2c (incorrect -> 0x176f), ack 2082340714, win 32, options [nop,nop,TS val 2149331398 ecr 1248029014], length 0
# RST from 172.29.73.7.44410 (Conntrack Valid)
16:23:32.466320 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    172.29.73.7.44410 > 127.0.0.1.15001: Flags [R], cksum 0x0876 (correct), seq 2082340714, win 0, length 0
# SYN retrans
16:23:33.484037 IP (tos 0x0, ttl 64, id 6423, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.73.7.44410 > 127.0.0.1.15001: Flags [S], cksum 0x43fd (incorrect -> 0xeb1b), seq 3587723838, win 62720, options [mss 8960,sackOK,TS val 1249356834 ecr 0,nop,wscale 11], length 0
# syn/ack from   172.21.206.198.7777(sidecar)
16:23:33.484060 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    172.21.206.198.7777 > 172.29.73.7.44410: Flags [S.], cksum 0x7c34 (incorrect -> 0xa6d3), seq 3731074132, ack 3587723839, win 65483, options [mss 65495,sackOK,TS val 2149332416 ecr 1249356834,nop,wscale 11], length 0
# ack from 172.29.73.7.44410    
16:23:33.484072 IP (tos 0x0, ttl 64, id 6424, offset 0, flags [DF], proto TCP (6), length 52)
    172.29.73.7.44410 > 127.0.0.1.15001: Flags [.], cksum 0x43f5 (incorrect -> 0xeb72), ack 3731074133, win 31, options [nop,nop,TS val 1249356835 ecr 2149332416], length 0
# fin from 172.21.206.198.7777(sidecar, because upstream pod(172.21.206.198) not listen on 7777)
16:23:33.484498 IP (tos 0x0, ttl 64, id 49188, offset 0, flags [DF], proto TCP (6), length 52)
    172.21.206.198.7777 > 172.29.73.7.44410: Flags [F.], cksum 0x7c2c (incorrect -> 0xcf71), seq 1, ack 1, win 32, options [nop,nop,TS val 2149332416 ecr 1249356835], length 0
16:23:33.488011 IP (tos 0x0, ttl 64, id 6425, offset 0, flags [DF], proto TCP (6), length 52)
    172.29.73.7.44410 > 127.0.0.1.15001: Flags [.], cksum 0x43f5 (incorrect -> 0xeb6e), ack 2, win 31, options [nop,nop,TS val 1249356838 ecr 2149332416], length 0

Skills#

ssh $WORKER_NODE_OF_POD

export POD="fortio-server-0"
ENVOY_PIDS=$(pgrep envoy)
while IFS= read -r ENVOY_PID; do
    HN=$(sudo nsenter -u -t $ENVOY_PID hostname)
    if [[ "$HN" = "$POD" ]]; then # space between = is important
        sudo nsenter -u -t $ENVOY_PID hostname
        export POD_PID=$ENVOY_PID
    fi
done <<< "$ENVOY_PIDS"
echo $POD_PID
export PID=$POD_PID

sudo nsenter -t $PID -n -u
ssh $WORKER_NODE_OF_POD

export POD="fortio-server-l2-0"
ENVOY_PIDS=$(pgrep envoy)
while IFS= read -r ENVOY_PID; do
    HN=$(sudo nsenter -u -t $ENVOY_PID hostname)
    if [[ "$HN" = "$POD" ]]; then # space between = is important
        sudo nsenter -u -t $ENVOY_PID hostname
        export POD_PID=$ENVOY_PID
    fi
done <<< "$ENVOY_PIDS"
echo $POD_PID
export PID=$POD_PID

sudo nsenter -t $PID -n -u