The NFS connectivity between a RHEL 6 NFS client and a RHEL 6 NFS server is not stable
Environment
- Red Hat Enterprise Linux 6.3
Issue
-
On a two node production RHEL 6.3 Oracle RAC system, which has two NFS shares mounted from another RHEL 6.3 system, Everything works fine for a few days. Suddenly one of the nodes will stop being able to access the NFS mounts while the other is working fine. The system can reach the NFS server, but cannot reach it through nfs. This results in the database hanging, since these mounts are used by oracle rac.
-
RHEL NFS client syslog contains:
Aug 19 10:50:17 hostname1 kernel: nfs: server NFS-SERVER-HOSTNAME not responding, still trying
Resolution
The customer changed the portchannel configuration on the switch and the problem went away and the client regained access immediately:
port-channel load-balance src-dst-ip
(http://www.cisco.com/en/US/docs/switches/lan/catalyst4500/12.2/20ewa/command/reference/int_sess.html#wp1977735)
Diagnostic Steps
- the NFS traffic seems to work fine until this point:
[...]
182 2012-08-20 04:58:44.362593 x.x.x.x 944 x.x.x.a 2049 NFS 338 V4 Call OPEN DH:0xb2e7ab25/.bash_history
183 2012-08-20 04:58:44.362754 x.x.x.a 2049 x.x.x.x 944 NFS 494 V4 Reply (Call In 182) OPEN StateID:0x6dee
184 2012-08-20 04:58:44.362832 x.x.x.x 944 x.x.x.a 2049 NFS 246 V4 Call OPEN_CONFIRM
185 2012-08-20 04:58:44.362949 x.x.x.a 2049 x.x.x.x 944 NFS 138 V4 Reply (Call In 184) OPEN_CONFIRM
186 2012-08-20 04:58:44.363007 x.x.x.x 944 x.x.x.a 2049 NFS 262 V4 Call CLOSE StateID:0x6843
187 2012-08-20 04:58:44.363179 x.x.x.a 2049 x.x.x.x 944 NFS 202 V4 Reply (Call In 186) CLOSE
188 2012-08-20 04:58:44.363576 x.x.x.x 944 x.x.x.a 2049 NFS 258 V4 Call LOOKUP DH:0xb2e7ab25/.inputrc
189 2012-08-20 04:58:44.363711 x.x.x.a 2049 x.x.x.x 944 NFS 122 V4 Reply (Call In 188) LOOKUP Status: NFS4ERR_NOENT
190 2012-08-20 04:58:44.403594 x.x.x.x 944 x.x.x.a 2049 TCP 66 944 > 2049 [ACK] Seq=8829 Ack=8449 Win=41728 Len=0 TSval=10411784 TSecr=350966614
- At this point we see a lot of external traffic:
- LACP starts
191 2012-08-20 04:58:46.262326 aa:bb:cc:dd:ee:ff aa:bb:cc:dd:ee:aa LACP 124 Link Aggregation Control Protocol
- STP starts as well:
192 2012-08-20 04:58:46.324501 aa:bb:cc:dd:ee:bb aa:bb:cc:dd:ee:cc STP 60 Conf. Root = 32768/321/aa:bb:cc:dd:ee:dd Cost = 0 Port = 0x82b8
195 2012-08-20 04:58:46.516037 x.x.x.x 944 x.x.x.a 2049 TCP 66 944 > 2049 [ACK] Seq=9097 Ack=8513 Win=41728 Len=0 TSval=10413896 TSecr=350968767
196 2012-08-20 04:58:48.331189 aa:bb:cc:dd:ee:bb aa:bb:cc:dd:ee:cc STP 60 Conf. Root = 32768/321/aa:bb:cc:dd:ee:dd Cost = 0 Port = 0x82b8
197 2012-08-20 04:58:49.177091 aa:bb:cc:dd:ee:ff aa:bb:cc:dd:ee:ff LOOP 60 Reply
198 2012-08-20 04:58:49.662349 aa:bb:cc:dd:ee:ee aa:bb:cc:dd:ee:ee LOOP 60 Reply
199 2012-08-20 04:58:50.337944 aa:bb:cc:dd:ee:bb aa:bb:cc:dd:ee:cc STP 60 Conf. Root = 32768/321/aa:bb:cc:dd:ee:dd Cost = 0 Port = 0x82b8
200 2012-08-20 04:58:52.352655 aa:bb:cc:dd:ee:bb aa:bb:cc:dd:ee:cc STP 60 Conf. Root = 32768/321/aa:bb:cc:dd:ee:dd Cost = 0 Port = 0x82b8
201 2012-08-20 04:58:54.368129 aa:bb:cc:dd:ee:bb aa:bb:cc:dd:ee:cc STP 60 Conf. Root = 32768/321/aa:bb:cc:dd:ee:dd Cost = 0 Port = 0x82b8
202 2012-08-20 04:58:56.369075 aa:bb:cc:dd:ee:bb aa:bb:cc:dd:ee:cc STP 60 Conf. Root = 32768/321/aa:bb:cc:dd:ee:dd Cost = 0 Port = 0x82b8
- In the meantime the NFS seems to work:
193 2012-08-20 04:58:46.515824 x.x.x.x 944 x.x.x.a 2049 NFS 334 V4 Call OPEN DH:0xb2e7ab25/.bash_logout
194 2012-08-20 04:58:46.516015 x.x.x.a 2049 x.x.x.x 944 NFS 130 V4 Reply (Call In 193) OPEN Status: NFS4ERR_NOENT
- Then we have more STP and LACP packets:
204 2012-08-20 04:58:58.373267 aa:bb:cc:dd:ee:bb aa:bb:cc:dd:ee:cc STP 60 Conf. Root = 32768/321/aa:bb:cc:dd:ee:dd Cost = 0 Port = 0x82b8
205 2012-08-20 04:58:59.179768 aa:bb:cc:dd:ee:ff aa:bb:cc:dd:ee:ff LOOP 60 Reply
207 2012-08-20 04:58:59.717152 aa:bb:cc:dd:ee:ee aa:bb:cc:dd:ee:ee LOOP 60 Reply
208 2012-08-20 04:59:00.377180 aa:bb:cc:dd:ee:bb aa:bb:cc:dd:ee:cc STP 60 Conf. Root = 32768/321/aa:bb:cc:dd:ee:dd Cost = 0 Port = 0x82b8
210 2012-08-20 04:59:02.296559 aa:bb:cc:dd:ee:ee aa:bb:cc:dd:ee:aa LACP 124 Link Aggregation Control Protocol
211 2012-08-20 04:59:02.392044 aa:bb:cc:dd:ee:bb aa:bb:cc:dd:ee:cc STP 60 Conf. Root = 32768/321/aa:bb:cc:dd:ee:dd Cost = 0 Port = 0x82b8
212 2012-08-20 04:59:04.400747 aa:bb:cc:dd:ee:bb aa:bb:cc:dd:ee:cc STP 60 Conf. Root = 32768/321/aa:bb:cc:dd:ee:dd Cost = 0 Port = 0x82b8
214 2012-08-20 04:59:06.403697 aa:bb:cc:dd:ee:bb aa:bb:cc:dd:ee:cc STP 60 Conf. Root = 32768/321/aa:bb:cc:dd:ee:dd Cost = 0 Port = 0x82b8
215 2012-08-20 04:59:08.407718 aa:bb:cc:dd:ee:bb aa:bb:cc:dd:ee:cc STP 60 Conf. Root = 32768/321/aa:bb:cc:dd:ee:dd Cost = 0 Port = 0x82b8
216 2012-08-20 04:59:09.187310 aa:bb:cc:dd:ee:ff aa:bb:cc:dd:ee:ff LOOP 60 Reply
217 2012-08-20 04:59:09.720363 aa:bb:cc:dd:ee:ee aa:bb:cc:dd:ee:ee LOOP 60 Reply
- ARP messages and MDNS (multicast DNS):
218 2012-08-20 04:59:10.295514 aa:bb:cc:dd:aa:ff aa:bb:cc:dd:bb:ff ARP 42 Who has x.x.x.a? Tell x.x.x.x
219 2012-08-20 04:59:10.295600 aa:bb:cc:dd:bb:ff aa:bb:cc:dd:aa:ff ARP 60 x.x.x.a is at aa:bb:cc:dd:bb:ff
223 2012-08-20 04:59:13.243786 x.x.x.c 5353 x.x.x.d 5353 MDNS 334 Standard query SRV CLSFRAME_257_1345085410 c:ac3233dc1c6dcf6dff1e2cb3c69177e4._CLSFRAME._tcp.local, "QM" question SRV CLSFRAME_861_1345426544 c:7ae0c86711e96f27bfbd094484af6bba._CLSFRAME._tcp.local, "QM" question SRV CLSFRAME_890_1345424499 c:7ae0c86711e96f27bfbd094484af6bba._CLSFRAME._tcp.local, "QM" question SRV CLSFRAME_303_1345407455 c:ac3233dc1c6dcf6dff1e2cb3c69177e4._CLSFRAME._tcp.local, "QM" question
224 2012-08-20 04:59:13.295482 x.x.x.x 739 x.x.x.a 2049 TCP 74 739 > 2049 [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=10440676 TSecr=0 WS=128
225 2012-08-20 04:59:13.301902 x.x.x.e 5353 x.x.x.d 5353 MDNS 167 Standard query response SRV, cache flush 0 0 8888 rict-racprod02
226 2012-08-20 04:59:13.325565 x.x.x.f 5353 x.x.x.d 5353 MDNS 166 Standard query response SRV, cache flush 0 0 8888 rict-racdts02
227 2012-08-20 04:59:13.349199 x.x.x.x 5353 x.x.x.d 5353 MDNS 167 Standard query response SRV, cache flush 0 0 8888 hostname2
228 2012-08-20 04:59:13.354341 x.x.x.g 5353 x.x.x.d 5353 MDNS 166 Standard query response SRV, cache flush 0 0 8888 rict-racdts01
- Followed by other STP packets:
229 2012-08-20 04:59:14.454172 aa:bb:cc:dd:ee:bb aa:bb:cc:dd:ee:cc STP 60 Conf. Root = 32768/321/aa:bb:cc:dd:ee:dd Cost = 0 Port = 0x82b8
230 2012-08-20 04:59:16.457159 aa:bb:cc:dd:ee:bb aa:bb:cc:dd:ee:cc STP 60 Conf. Root = 32768/321/aa:bb:cc:dd:ee:dd Cost = 0 Port = 0x82b8
231 2012-08-20 04:59:18.461111 aa:bb:cc:dd:ee:bb aa:bb:cc:dd:ee:cc STP 60 Conf. Root = 32768/321/aa:bb:cc:dd:ee:dd Cost = 0 Port = 0x82b8
232 2012-08-20 04:59:19.188249 aa:bb:cc:dd:ee:ff aa:bb:cc:dd:ee:ff LOOP 60 Reply
233 2012-08-20 04:59:19.723014 aa:bb:cc:dd:ee:ee aa:bb:cc:dd:ee:ee LOOP 60 Reply
234 2012-08-20 04:59:20.465058 aa:bb:cc:dd:ee:bb aa:bb:cc:dd:ee:cc STP 60 Conf. Root = 32768/321/aa:bb:cc:dd:ee:dd Cost = 0 Port = 0x82b8
- From here onwards we don't see any more NFS traffic. We see that the client is sending some SYN packets but never get a reply:
203 2012-08-20 04:58:58.295531 x.x.x.x 739 x.x.x.a 2049 TCP 74 739 > 2049 [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=10425676 TSecr=0 WS=128
206 2012-08-20 04:58:59.295511 x.x.x.x 739 x.x.x.a 2049 TCP 74 739 > 2049 [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=10426676 TSecr=0 WS=128
209 2012-08-20 04:59:01.295515 x.x.x.x 739 x.x.x.a 2049 TCP 74 739 > 2049 [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=10428676 TSecr=0 WS=128
213 2012-08-20 04:59:05.295520 x.x.x.x 739 x.x.x.a 2049 TCP 74 739 > 2049 [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=10432676 TSecr=0 WS=128
224 2012-08-20 04:59:13.295482 x.x.x.x 739 x.x.x.a 2049 TCP 74 739 > 2049 [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=10440676 TSecr=0 WS=128
On the server side (Filtering by src x.x.x.x):
- the NFS traffic we saw on the client as well:
941323 2012-08-20 04:58:44.998692 x.x.x.x 944 x.x.x.a 2049 NFS 262 V4 Call CLOSE StateID:0x6843
941327 2012-08-20 04:58:44.998764 x.x.x.a 2049 x.x.x.x 944 NFS 202 V4 Reply (Call In 941323) CLOSE
941342 2012-08-20 04:58:44.999262 x.x.x.x 944 x.x.x.a 2049 NFS 258 V4 Call LOOKUP DH:0xb2e7ab25/.inputrc
941345 2012-08-20 04:58:44.999339 x.x.x.a 2049 x.x.x.x 944 NFS 122 V4 Reply (Call In 941342) LOOKUP Status: NFS4ERR_NOENT
942189 2012-08-20 04:58:45.039305 x.x.x.x 944 x.x.x.a 2049 TCP 66 944 > 2049 [ACK] Seq=8829 Ack=8449 Win=41728 Len=0 TSval=10411784 TSecr=350966614
985310 2012-08-20 04:58:47.152411 x.x.x.x 944 x.x.x.a 2049 NFS 334 V4 Call OPEN DH:0xb2e7ab25/.bash_logout
985313 2012-08-20 04:58:47.152479 x.x.x.a 2049 x.x.x.x 944 NFS 130 V4 Reply (Call In 985310) OPEN Status: NFS4ERR_NOENT
985315 2012-08-20 04:58:47.152609 x.x.x.x 944 x.x.x.a 2049 TCP 66 944 > 2049 [ACK] Seq=9097 Ack=8513 Win=41728 Len=0 TSval=10413896 TSecr=350968767
- Then suddenly we don't have any more traffic coming from the NFS client.
The first packets are after the LACP/STP and MDNS traffic:
1557068 2012-08-20 04:59:13.996805 x.x.x.x 5353 x.x.x.d 5353 MDNS 167 Standard query response SRV, cache flush 0 0 8888 hostname2
1809628 2012-08-20 04:59:22.472867 x.x.x.x 5353 x.x.x.d 5353 MDNS 231 Standard query response SRV, cache flush 0 0 22 hostname2.local AAAA, cache flush aa:bb:cc:dd:cc:ff A, cache flush x.x.x.x SRV, cache flush 0 0 9 hostname2.local
- A possibly unrelated symptom was noticed on the NFS server. There were dropped packets on one interface of the bond:
$ ifconfig
bond0 Link encap:Ethernet HWaddr REDACTED
inet addr:REDACTED Bcast:REDACTED Mask:REDACTED
inet6 addr: REDACTED Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:889501484 errors:0 dropped:757 overruns:0 frame:0
TX packets:628371746 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:653783075814 (608.8 GiB) TX bytes:148817107063 (138.5 GiB)
$ grep rx_fw_discards sos_commands/networking/ethtool_-S_eth1
rx_fw_discards: 757
- Logs from the NFS client:
Aug 20 00:08:01 hostname2 kernel: nfs: server NFS-SERVER-HOSTNAME not responding, still trying
Aug 20 01:33:01 hostname2 kernel: nfs: server NFS-SERVER-HOSTNAME not responding, still trying
Aug 20 02:58:01 hostname2 kernel: nfs: server NFS-SERVER-HOSTNAME not responding, still trying
Aug 20 04:23:01 hostname2 kernel: nfs: server NFS-SERVER-HOSTNAME not responding, still trying
Aug 20 05:48:01 hostname2 kernel: nfs: server NFS-SERVER-HOSTNAME not responding, still trying
Aug 20 05:55:36 hostname2 kernel: ip_tables: (C) 2000-2006 Netfilter Core Team
Aug 20 06:08:19 hostname2 avahi-daemon[4001]: Registering new address record for x.x.x.h on bond0.IPv4.
Aug 20 06:08:19 hostname2 avahi-daemon[4001]: Withdrawing address record for x.x.x.h on bond0.
Aug 20 06:08:19 hostname2 avahi-daemon[4001]: Registering new address record for x.x.x.h on bond0.IPv4.
Aug 20 06:08:19 hostname2 avahi-daemon[4001]: Withdrawing address record for x.x.x.h on bond0.
Aug 20 06:08:19 hostname2 avahi-daemon[4001]: Registering new address record for x.x.x.h on bond0.IPv4.
Aug 20 06:08:19 hostname2 avahi-daemon[4001]: Registering new address record for x.x.x.i on bond0.IPv4.
Aug 20 06:08:19 hostname2 avahi-daemon[4001]: Withdrawing address record for x.x.x.i on bond0.
Aug 20 06:08:19 hostname2 avahi-daemon[4001]: Registering new address record for x.x.x.i on bond0.IPv4.
Aug 20 06:08:19 hostname2 avahi-daemon[4001]: Withdrawing address record for x.x.x.i on bond0.
Aug 20 06:08:19 hostname2 avahi-daemon[4001]: Registering new address record for x.x.x.i on bond0.IPv4.
Aug 20 06:08:21 hostname2 ntpd[4356]: Listening on interface #19 bond0:2, x.x.x.h#123 Enabled
Aug 20 06:08:21 hostname2 ntpd[4356]: Listening on interface #20 bond0:5, x.x.x.i#123 Enabled
Aug 20 06:26:22 hostname2 avahi-daemon[4001]: Withdrawing address record for x.x.x.h on bond0.
Aug 20 06:26:23 hostname2 avahi-daemon[4001]: Withdrawing address record for x.x.x.i on bond0.
Aug 20 06:26:25 hostname2 ntpd[4356]: Deleting interface #19 bond0:2, x.x.x.h#123, interface stats: received=0, sent=0, dropped=0, active_time=1084 secs
Aug 20 06:26:25 hostname2 ntpd[4356]: Deleting interface #20 bond0:5, x.x.x.i#123, interface stats: received=0, sent=0, dropped=0, active_time=1084 secs
Aug 20 07:13:01 hostname2 kernel: nfs: server NFS-SERVER-HOSTNAME not responding, still trying
The client is periodically losing contact with the NFS server.
From the tcpdump, the nfs client should use bond1:
bond1 Link encap:Ethernet HWaddr aa:bb:cc:dd:aa:ff
inet addr:x.x.x.x Bcast:x.x.x.j Mask:255.255.255.0
$ grep bond1 etc/sysconfig/network-scripts/ifcfg-*
etc/sysconfig/network-scripts/ifcfg-bond1:DEVICE=bond1
etc/sysconfig/network-scripts/ifcfg-bond1:NAME="System bond1"
etc/sysconfig/network-scripts/ifcfg-eth4:MASTER=bond1
etc/sysconfig/network-scripts/ifcfg-eth5:MASTER=bond1
No discards, errors on these interfaces:
$ egrep '(disc|err)' sos_commands/networking/ethtool_-S_eth{4,5}
sos_commands/networking/ethtool_-S_eth4: rx_error_bytes: 0
sos_commands/networking/ethtool_-S_eth4: tx_error_bytes: 0
sos_commands/networking/ethtool_-S_eth4: tx_mac_errors: 0
sos_commands/networking/ethtool_-S_eth4: tx_carrier_errors: 0
sos_commands/networking/ethtool_-S_eth4: rx_crc_errors: 0
sos_commands/networking/ethtool_-S_eth4: rx_align_errors: 0
sos_commands/networking/ethtool_-S_eth4: tx_deferred: 0
sos_commands/networking/ethtool_-S_eth4: rx_ftq_discards: 0
sos_commands/networking/ethtool_-S_eth4: rx_discards: 0
sos_commands/networking/ethtool_-S_eth4: rx_fw_discards: 0
sos_commands/networking/ethtool_-S_eth5: rx_error_bytes: 0
sos_commands/networking/ethtool_-S_eth5: tx_error_bytes: 0
sos_commands/networking/ethtool_-S_eth5: tx_mac_errors: 0
sos_commands/networking/ethtool_-S_eth5: tx_carrier_errors: 0
sos_commands/networking/ethtool_-S_eth5: rx_crc_errors: 0
sos_commands/networking/ethtool_-S_eth5: rx_align_errors: 0
sos_commands/networking/ethtool_-S_eth5: tx_deferred: 0
sos_commands/networking/ethtool_-S_eth5: rx_ftq_discards: 0
sos_commands/networking/ethtool_-S_eth5: rx_discards: 0
sos_commands/networking/ethtool_-S_eth5: rx_fw_discards: 0
Going back to the client tcpdump, this is about the lacp protocol:
$ tshark -t ad -n -r tcpdump.pcap slow
3 2012-08-20 04:58:17.055148 aa:bb:cc:dd:ee:ff aa:bb:cc:dd:ee:aa LACP 124 Link Aggregation Control ProtocolVersion 1. Actor Port = 1053 Partner Port = 2
16 2012-08-20 04:58:32.790975 aa:bb:cc:dd:ee:ee aa:bb:cc:dd:ee:aa LACP 124 Link Aggregation Control ProtocolVersion 1. Actor Port = 796 Partner Port = 1
191 2012-08-20 04:58:46.262326 aa:bb:cc:dd:ee:ff aa:bb:cc:dd:ee:aa LACP 124 Link Aggregation Control ProtocolVersion 1. Actor Port = 1053 Partner Port = 2
210 2012-08-20 04:59:02.296559 aa:bb:cc:dd:ee:ee aa:bb:cc:dd:ee:aa LACP 124 Link Aggregation Control ProtocolVersion 1. Actor Port = 796 Partner Port = 1
221 2012-08-20 04:59:11.948032 aa:bb:cc:dd:ee:ff aa:bb:cc:dd:ee:aa LACP 124 Link Aggregation Control ProtocolVersion 1. Actor Port = 1053 Partner Port = 2
As far as I understand lacp, this should be normal (partner and actor ports are the remote and local ports).
We see two messages because probably we have two NICs participating in the lacp group.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.