RHEL7.6: NFSv3 client hangs after 5 minute idle timer drops the TCP connection and a subsequent TCP 3-way handhake fails due to duplicate SYN or unexpected RST from the NFS client

Solution Verified - Updated

Environment

  • NFS client

    • Red Hat Enterprise Linux 7.6
    • seen on kernel-3.10.0-957.el7 and earlier than kernel-3.10.0-957.5.1.el7
  • NFS server

    • seen with Solaris 10 NFS server
    • seen with Windows Server 2008
  • NFSv3

Issue

  • On an NFSv3 client, a NFS share that was previously active goes idle for 5 minutes and the NFS client initiates a close of the TCP connection. After some time, a process accesses the NFS share and the NFS client attempts a 3-way TCP handshake to the NFS server. During the 3-way handshake, the NFS client issues either a duplicate SYN packet for the first part of the 3-way handshake, or issues an RST to the response to the second part (SYN,ACK from NFS server). As a result, the NFS client is unable to complete the TCP 3-way handshake, and the NFS share hangs.

Resolution

Red Hat Enterprise Linux 7
  • Follow the same resolution in <https://access.redhat.com/solutions/3742871>.
  • The issue was tracked with bugzilla 1661304: Bug 1661304 - RHEL7.6: NFSv3 client reconnect may produce double SYN packets and/or RST packets during 3-way TCP handshake after 10 min idle disconnnect / reconnect sequence (RHEL 7 7.7.0). As of Fri, December 21 2018, the status of 1661304 is CLOSED DUPLICATE of 1657399.

Workarounds

  1. Downgrade to a kernel prior to kernel-3.10.0-957.el7
  2. Use some approach to ensure the NFS (and lockd) TCP connections do not go idle and drop (Note that the idle timer will drop a connection after 5 minutes). One possibility is to use a variation of a script to monitor for hung NFS mount points as described in <https://access.redhat.com/solutions/97873>.
  3. Use NFSv4.1 or NFSv4.0 (not fully reliable). NOTE: This issue can still occur on NFSv4.0 or NFSv4.1, but it is believed to be harder to reproduce due to the fact that NFS4.x periodically issues either NFS4 RENEW or SEQUENCE operations to keep the TCP connection active. However, on NFS 4.x any NFS operation completing with an error code (i.e. NFS4ERR_BAD_STATEID) will cause the TCP connection to drop so this issue can occur.

Diagnostic Steps

  • Take a tcpdump at the time of the issue

Analysis of tcpdump: duplicate SYN from NFS client

  • On an NFSv3 client, a NFS share that was previously active goes idle for 5 minutes and the TCP connection drops
  127 2018-12-18 00:30:04.939674 10.0.0.1 → 10.0.0.2 NFS 182 V3 GETATTR Reply (Call In 126)  Directory mode: 0775 uid: 102 gid: 102
  128 2018-12-18 00:30:04.939923 10.0.0.2 → 10.0.0.1 NFS 194 V3 ACCESS Call, FH: 0x1b2ea40f, [Check: RD LU MD XT DL]
  129 2018-12-18 00:30:04.944543 10.0.0.1 → 10.0.0.2 NFS 190 V3 ACCESS Reply (Call In 128), [Allowed: RD LU MD XT DL]
  130 2018-12-18 00:30:04.944616 10.0.0.2 → 10.0.0.1 NFS 190 V3 GETATTR Call, FH: 0x25780a51
  131 2018-12-18 00:30:04.951819 10.0.0.1 → 10.0.0.2 NFS 182 V3 GETATTR Reply (Call In 130)  Directory mode: 0755 uid: 103 gid: 102
  132 2018-12-18 00:30:04.953812 10.0.0.2 → 10.0.0.1 NFS 190 V3 GETATTR Call, FH: 0x1b2ea40f
  133 2018-12-18 00:30:04.957942 10.0.0.1 → 10.0.0.2 NFS 182 V3 GETATTR Reply (Call In 132)  Directory mode: 0775 uid: 102 gid: 102
  134 2018-12-18 00:30:04.997732 10.0.0.2 → 10.0.0.1 TCP 66 965 → 2049 [ACK] Seq=1299167297 Ack=1431197553 Win=49536 Len=0 TSval=129660299 TSecr=1950438082
  135 2018-12-18 00:31:04.962742 10.0.0.2 → 10.0.0.1 TCP 66 [TCP Keep-Alive] 965 → 2049 [ACK] Seq=1299167296 Ack=1431197553 Win=49536 Len=0 TSval=129720264 TSecr=1950438082
  136 2018-12-18 00:31:04.962942 10.0.0.1 → 10.0.0.2 TCP 66 [TCP Keep-Alive ACK] 2049 → 965 [ACK] Seq=1431197553 Ack=1299167297 Win=1049792 Len=0 TSval=1950444083 TSecr=129660299
  137 2018-12-18 00:32:05.178749 10.0.0.2 → 10.0.0.1 TCP 66 [TCP Keep-Alive] 965 → 2049 [ACK] Seq=1299167296 Ack=1431197553 Win=49536 Len=0 TSval=129780480 TSecr=1950444083
  138 2018-12-18 00:32:05.179535 10.0.0.1 → 10.0.0.2 TCP 66 [TCP Keep-Alive ACK] 2049 → 965 [ACK] Seq=1431197553 Ack=1299167297 Win=1049792 Len=0 TSval=1950450104 TSecr=129660299
  139 2018-12-18 00:33:05.338758 10.0.0.2 → 10.0.0.1 TCP 66 [TCP Keep-Alive] 965 → 2049 [ACK] Seq=1299167296 Ack=1431197553 Win=49536 Len=0 TSval=129840640 TSecr=1950450104
  140 2018-12-18 00:33:05.339061 10.0.0.1 → 10.0.0.2 TCP 66 [TCP Keep-Alive ACK] 2049 → 965 [ACK] Seq=1431197553 Ack=1299167297 Win=1049792 Len=0 TSval=1950456120 TSecr=129660299
  141 2018-12-18 00:34:05.498740 10.0.0.2 → 10.0.0.1 TCP 66 [TCP Keep-Alive] 965 → 2049 [ACK] Seq=1299167296 Ack=1431197553 Win=49536 Len=0 TSval=129900800 TSecr=1950456120
  142 2018-12-18 00:34:05.498968 10.0.0.1 → 10.0.0.2 TCP 66 [TCP Keep-Alive ACK] 2049 → 965 [ACK] Seq=1431197553 Ack=1299167297 Win=1049792 Len=0 TSval=1950462135 TSecr=129660299
  143 2018-12-18 00:35:05.658753 10.0.0.2 → 10.0.0.1 TCP 66 [TCP Keep-Alive] 965 → 2049 [ACK] Seq=1299167296 Ack=1431197553 Win=49536 Len=0 TSval=129960960 TSecr=1950462135
  144 2018-12-18 00:35:05.658778 10.0.0.2 → 10.0.0.1 TCP 66 965 → 2049 [FIN, ACK] Seq=1299167297 Ack=1431197553 Win=49536 Len=0 TSval=129960960 TSecr=1950462135
  145 2018-12-18 00:35:05.659079 10.0.0.1 → 10.0.0.2 TCP 66 [TCP Dup ACK 133#1] 2049 → 965 [ACK] Seq=1431197553 Ack=1299167297 Win=1049792 Len=0 TSval=1950468151 TSecr=129660299
  146 2018-12-18 00:35:05.659102 10.0.0.1 → 10.0.0.2 TCP 66 2049 → 965 [ACK] Seq=1431197553 Ack=1299167298 Win=1049792 Len=0 TSval=1950468151 TSecr=129960960
  147 2018-12-18 00:35:05.659362 10.0.0.1 → 10.0.0.2 TCP 66 2049 → 965 [FIN, ACK] Seq=1431197553 Ack=1299167298 Win=1049792 Len=0 TSval=1950468151 TSecr=129960960
  148 2018-12-18 00:35:05.659381 10.0.0.2 → 10.0.0.1 TCP 66 965 → 2049 [ACK] Seq=1299167298 Ack=1431197554 Win=49536 Len=0 TSval=129960960 TSecr=1950468151
  • After some time, some process on the NFS client becomes active and accesses the share, so a TCP reconnect is started. However, the NFS client sends a duplicate SYN packet with multiple Seq numbers. This seems to confuse the NFS server and the client and server cannot agree on the 3-way handshake to complete the TCP connection setup. As a result the NFS share is hung.
  149 2018-12-18 00:45:01.281567 10.0.0.2 → 10.0.0.1 TCP 74 [TCP Port numbers reused] 965 → 2049 [SYN] Seq=2466437126 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=130556582 TSecr=0 WS=128
  150 2018-12-18 00:45:01.286665 10.0.0.2 → 10.0.0.1 TCP 74 [TCP Port numbers reused] 965 → 2049 [SYN] Seq=2466516951 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=130556587 TSecr=0 WS=128
  151 2018-12-18 00:45:01.287275 10.0.0.1 → 10.0.0.2 TCP 78 2049 → 965 [SYN, ACK] Seq=1703943566 Ack=2466437127 Win=32806 Len=0 TSval=1950527710 TSecr=130556582 MSS=1460 WS=32 SACK_PERM=1
  152 2018-12-18 00:45:01.291245 10.0.0.1 → 10.0.0.2 TCP 66 [TCP Dup ACK 151#1] 2049 → 965 [ACK] Seq=1703943567 Ack=2466437127 Win=1049792 Len=0 TSval=1950527710 TSecr=130556582
  153 2018-12-18 00:45:01.291272 10.0.0.2 → 10.0.0.1 TCP 54 965 → 2049 [RST] Seq=2466437127 Win=0 Len=0
  154 2018-12-18 00:45:02.286741 10.0.0.2 → 10.0.0.1 TCP 74 [TCP Retransmission] 965 → 2049 [SYN] Seq=2466516951 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=130557588 TSecr=0 WS=128
  155 2018-12-18 00:45:02.287102 10.0.0.1 → 10.0.0.2 TCP 78 [TCP Previous segment not captured] [TCP Port numbers reused] 2049 → 965 [SYN, ACK] Seq=1704526816 Ack=2466516952 Win=32806 Len=0 TSval=1950527810 TSecr=130557588 MSS=1460 WS=32 SACK_PERM=1
  156 2018-12-18 00:45:02.703059 10.0.0.1 → 10.0.0.2 TCP 78 [TCP Retransmission] [TCP Port numbers reused] 2049 → 965 [SYN, ACK] Seq=1704526816 Ack=2466516952 Win=32806 Len=0 TSval=1950527852 TSecr=130557588 MSS=1460 WS=32 SACK_PERM=1
  157 2018-12-18 00:45:03.523085 10.0.0.1 → 10.0.0.2 TCP 78 [TCP Retransmission] [TCP Port numbers reused] 2049 → 965 [SYN, ACK] Seq=1704526816 Ack=2466516952 Win=32806 Len=0 TSval=1950527934 TSecr=130557588 MSS=1460 WS=32 SACK_PERM=1
  158 2018-12-18 00:45:04.290737 10.0.0.2 → 10.0.0.1 TCP 74 [TCP Retransmission] 965 → 2049 [SYN] Seq=2466516951 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=130559592 TSecr=0 WS=128
  159 2018-12-18 00:45:04.290913 10.0.0.1 → 10.0.0.2 TCP 66 [TCP Dup ACK 155#1] 2049 → 965 [ACK] Seq=1704526817 Ack=2466516952 Win=1049792 Len=0 TSval=1950528010 TSecr=130557588
  160 2018-12-18 00:45:05.143198 10.0.0.1 → 10.0.0.2 TCP 78 [TCP Retransmission] [TCP Port numbers reused] 2049 → 965 [SYN, ACK] Seq=1704526816 Ack=2466516952 Win=32806 Len=0 TSval=1950528096 TSecr=130557588 MSS=1460 WS=32 SACK_PERM=1
  161 2018-12-18 00:45:08.298777 10.0.0.2 → 10.0.0.1 TCP 74 [TCP Retransmission] 965 → 2049 [SYN] Seq=2466516951 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=130563600 TSecr=0 WS=128
  162 2018-12-18 00:45:08.299263 10.0.0.1 → 10.0.0.2 TCP 66 [TCP Dup ACK 155#2] 2049 → 965 [ACK] Seq=1704526817 Ack=2466516952 Win=1049792 Len=0 TSval=1950528411 TSecr=130557588
  163 2018-12-18 00:45:08.373383 10.0.0.1 → 10.0.0.2 TCP 78 [TCP Retransmission] [TCP Port numbers reused] 2049 → 965 [SYN, ACK] Seq=1704526816 Ack=2466516952 Win=32806 Len=0 TSval=1950528419 TSecr=130557588 MSS=1460 WS=32 SACK_PERM=1
  164 2018-12-18 00:45:14.825157 10.0.0.1 → 10.0.0.2 TCP 78 [TCP Retransmission] [TCP Port numbers reused] 2049 → 965 [SYN, ACK] Seq=1704526816 Ack=2466516952 Win=32806 Len=0 TSval=1950529064 TSecr=130557588 MSS=1460 WS=32 SACK_PERM=1
  165 2018-12-18 00:45:16.314750 10.0.0.2 → 10.0.0.1 TCP 74 [TCP Retransmission] 965 → 2049 [SYN] Seq=2466516951 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=130571616 TSecr=0 WS=128
  166 2018-12-18 00:45:16.315012 10.0.0.1 → 10.0.0.2 TCP 66 [TCP Dup ACK 155#3] 2049 → 965 [ACK] Seq=1704526817 Ack=2466516952 Win=1049792 Len=0 TSval=1950529213 TSecr=130557588
  167 2018-12-18 00:45:27.704614 10.0.0.1 → 10.0.0.2 TCP 78 [TCP Retransmission] [TCP Port numbers reused] 2049 → 965 [SYN, ACK] Seq=1704526816 Ack=2466516952 Win=32806 Len=0 TSval=1950530352 TSecr=130557588 MSS=1460 WS=32 SACK_PERM=1
  168 2018-12-18 00:45:32.346762 10.0.0.2 → 10.0.0.1 TCP 74 [TCP Retransmission] 965 → 2049 [SYN] Seq=2466516951 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=130587648 TSecr=0 WS=128
  169 2018-12-18 00:45:32.347218 10.0.0.1 → 10.0.0.2 TCP 66 [TCP Dup ACK 155#4] 2049 → 965 [ACK] Seq=1704526817 Ack=2466516952 Win=1049792 Len=0 TSval=1950530816 TSecr=130557588
  170 2018-12-18 00:45:53.446268 10.0.0.1 → 10.0.0.2 TCP 78 [TCP Retransmission] [TCP Port numbers reused] 2049 → 965 [SYN, ACK] Seq=1704526816 Ack=2466516952 Win=32806 Len=0 TSval=1950532926 TSecr=130557588 MSS=1460 WS=32 SACK_PERM=1
  171 2018-12-18 00:46:04.410744 10.0.0.2 → 10.0.0.1 TCP 74 [TCP Retransmission] 965 → 2049 [SYN] Seq=2466516951 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=130619712 TSecr=0 WS=128
  172 2018-12-18 00:46:04.410959 10.0.0.1 → 10.0.0.2 TCP 66 [TCP Dup ACK 155#5] 2049 → 965 [ACK] Seq=1704526817 Ack=2466516952 Win=1049792 Len=0 TSval=1950534022 TSecr=130557588
  173 2018-12-18 00:46:44.919491 10.0.0.1 → 10.0.0.2 TCP 78 [TCP Retransmission] [TCP Port numbers reused] 2049 → 965 [SYN, ACK] Seq=1704526816 Ack=2466516952 Win=32806 Len=0 TSval=1950538073 TSecr=130557588 MSS=1460 WS=32 SACK_PERM=1
  174 2018-12-18 00:47:08.602830 10.0.0.2 → 10.0.0.1 TCP 74 [TCP Port numbers reused] 965 → 2049 [SYN] Seq=160864716 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=130683904 TSecr=0 WS=128
  175 2018-12-18 00:47:08.603115 10.0.0.1 → 10.0.0.2 TCP 66 2049 → 965 [ACK] Seq=1704526817 Ack=2466516952 Win=32806 Len=0 TSval=1950540441 TSecr=130557588
  176 2018-12-18 00:47:08.603157 10.0.0.2 → 10.0.0.1 TCP 54 965 → 2049 [RST] Seq=2466516952 Win=0 Len=0
  177 2018-12-18 00:47:09.604738 10.0.0.2 → 10.0.0.1 TCP 74 [TCP Retransmission] 965 → 2049 [SYN] Seq=160864716 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=130684906 TSecr=0 WS=128
  178 2018-12-18 00:47:09.604948 10.0.0.1 → 10.0.0.2 TCP 78 [TCP Previous segment not captured] [TCP Port numbers reused] 2049 → 965 [SYN, ACK] Seq=1744055023 Ack=160864717 Win=32806 Len=0 TSval=1950540541 TSecr=130684906 MSS=1460 WS=32 SACK_PERM=1
  179 2018-12-18 00:47:10.741140 10.0.0.1 → 10.0.0.2 TCP 78 [TCP Retransmission] [TCP Port numbers reused] 2049 → 965 [SYN, ACK] Seq=1744055023 Ack=160864717 Win=32806 Len=0 TSval=1950540655 TSecr=130684906 MSS=1460 WS=32 SACK_PERM=1
  180 2018-12-18 00:47:11.610762 10.0.0.2 → 10.0.0.1 TCP 74 [TCP Retransmission] 965 → 2049 [SYN] Seq=160864716 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=130686912 TSecr=0 WS=128
  181 2018-12-18 00:47:11.610994 10.0.0.1 → 10.0.0.2 TCP 66 [TCP Dup ACK 178#1] 2049 → 965 [ACK] Seq=1744055024 Ack=160864717 Win=1049792 Len=0 TSval=1950540741 TSecr=130684906
  182 2018-12-18 00:47:13.011265 10.0.0.1 → 10.0.0.2 TCP 78 [TCP Retransmission] [TCP Port numbers reused] 2049 → 965 [SYN, ACK] Seq=1744055023 Ack=160864717 Win=32806 Len=0 TSval=1950540882 TSecr=130684906 MSS=1460 WS=32 SACK_PERM=1
  183 2018-12-18 00:47:15.618744 10.0.0.2 → 10.0.0.1 TCP 74 [TCP Retransmission] 965 → 2049 [SYN] Seq=160864716 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=130690920 TSecr=0 WS=128
  184 2018-12-18 00:47:15.618957 10.0.0.1 → 10.0.0.2 TCP 66 [TCP Dup ACK 178#2] 2049 → 965 [ACK] Seq=1744055024 Ack=160864717 Win=1049792 Len=0 TSval=1950541142 TSecr=130684906
  185 2018-12-18 00:47:17.531577 10.0.0.1 → 10.0.0.2 TCP 78 [TCP Retransmission] [TCP Port numbers reused] 2049 → 965 [SYN, ACK] Seq=1744055023 Ack=160864717 Win=32806 Len=0 TSval=1950541334 TSecr=130684906 MSS=1460 WS=32 SACK_PERM=1
  186 2018-12-18 00:47:23.642740 10.0.0.2 → 10.0.0.1 TCP 74 [TCP Retransmission] 965 → 2049 [SYN] Seq=160864716 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=130698944 TSecr=0 WS=128
  187 2018-12-18 00:47:23.643067 10.0.0.1 → 10.0.0.2 TCP 66 [TCP Dup ACK 178#3] 2049 → 965 [ACK] Seq=1744055024 Ack=160864717 Win=1049792 Len=0 TSval=1950541945 TSecr=130684906
  188 2018-12-18 00:47:26.553920 10.0.0.1 → 10.0.0.2 TCP 78 [TCP Retransmission] [TCP Port numbers reused] 2049 → 965 [SYN, ACK] Seq=1744055023 Ack=160864717 Win=32806 Len=0 TSval=1950542236 TSecr=130684906 MSS=1460 WS=32 SACK_PERM=1
  189 2018-12-18 00:47:39.674750 10.0.0.2 → 10.0.0.1 TCP 74 [TCP Retransmission] 965 → 2049 [SYN] Seq=160864716 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=130714976 TSecr=0 WS=128
  190 2018-12-18 00:47:39.675022 10.0.0.1 → 10.0.0.2 TCP 66 [TCP Dup ACK 178#4] 2049 → 965 [ACK] Seq=17440550
Components
Category
Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.