RHEL6: NFS mounts become unresponsive when NFS server experiences repeated hardware.py python backtrace / abrt

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 6 (NFS Server)
  • rhn-client-tools
    • between rhn-client-tools-1.0.0.1-4.el6 and rhn-client-tools-1.0.0.1-16.el6
  • sos (sosreport)
    • between sos-2.2-29.el6 and before sos-2.2-47.el6
    • sos-2.2-29.el6 includes fix for 'Bug 730641 - sosreport does not collect /proc/net details', and introduces the regression
    • sos-2.2-47.el6 fixes the regression in 'Bug 913201 - sosreport can potentially cause an RPC failure when copying RPC channel files'

Issue

  • RPC-data reaches NFSD but it seems to stall while doing svc_authenticate
  • NFS server TCP ACKs NFS client requests, then silently drops them, never responding at the NFS layer.
  • NFS mounts on NFS clients periodically hang with visible in /var/log/messages:
kernel: nfs: server hostname not responding, still trying
Oct 20 08:25:18 hostname abrt: detected unhandled Python exception in '/usr/share/rhn/up2date_client/hardware.py'
Oct 20 08:25:18 hostname abrtd: New client connected
Oct 20 08:25:18 hostname abrtd: Directory 'pyhook-2013-10-20-08:25:18-10226' creation detected
Oct 20 08:25:18 hostname abrt-server[10333]: Saved Python crash dump of pid 10226 to /var/spool/abrt/pyhook-2013-10-20-08:25:18-10226
Oct 20 08:25:25 hostname abrtd: Sending an email...
Oct 20 08:25:26 hostname abrtd: Email was sent to: root@localhost
Oct 20 08:25:26 hostname abrtd: Duplicate: UUID
Oct 20 08:25:26 hostname abrtd: DUP_OF_DIR: /var/spool/abrt/pyhook-2013-10-20-07:28:52-15636
Oct 20 08:25:26 hostname abrtd: Corrupted or bad directory '/var/spool/abrt/pyhook-2013-10-20-08:22:37-1222', deleting

Resolution

  1. The kernel NFS server bug is tracked by private Red Hat Bug 1055757 - "RHEL6 nfs server may suffer DoS as a result of repeated python backtrace / abrt running, which runs 'sosreport --only=networking'". This bug has not been reported outside an environment where the other two bugs were present and has been closed as WONT_FIX.
  2. The sosreport bug is covered by the resolution section in https://access.redhat.com/site/solutions/312913
  3. The rhn-client-tools bug is covered by the resolution in https://access.redhat.com/site/solutions/506353

Root Cause

  • A combination of the following three bugs explains the permanent NFS server DoS state. Fixing any one of the following should resolve any permanent NFS server unresponsiveness due to this issue.

  • Due to bug #1 (the kernel's NFS server heuristic) and bug #2 (sosreport reading 'channel' files), the kernel temporarily thinks a userspace process exists to service the RPC cache, and forwards a request down, only to timeout. After a number of tries, it gives up and will recover. However, due to bug #3 (repeatedly calling sosreport), sosreport is called very frequently, and the temporary condition of the NFS server being fooled into thinking a userspace daemon exists to service the RPC caches becomes permanent.

  1. NFS server kernel heuristic involving RPC caches and detection of a userspace process. The NFS server contains a heuristic where it tries to determine whether a userspace daemon exists to service RPC caches. It determines this based on the timestamp of when the last 'open' was done on the channel file, among other factors. However, it's possible that no userspace daemon exists, yet some other process running as root reads the 'channel' file and can fool the kernel into thinking a userspace daemon exists to service rpc caches. Assuming the 'channel' file is only opened / read one time, after a few tries, the kernel will eventually recover, realize no userspace process exists, and be able to respond to RPC requests normally. During this window of time though, it drops RPC requests due the RPC authentication failures.

  2. sosreport bug when run with --only=networking on an NFS server where it inadvertently reads the RPC channel files. This is covered by https://access.redhat.com/site/solutions/312913. The bug is the below line where it copies all files in /proc/net. The reason this is problematic is because there are special files in the directory which are used for userspace <--> kernel communications to update / validate various sunrpc caches, such as /proc/net/rpc/auth.unix.gid/channel When such special file gets read a message from the kernel could get read, which was intended for another process (such as rpc.mountd)

 grep -n "/proc/net" /usr/lib/python2.6/site-packages/sos/plugins/networking.py
66:        self.addCopySpec("/proc/net/")

3. rhn-client-tools bug involving repeated /usr/share/rhn/up2date_client/hardware.py backtraces. This is covered by https://access.redhat.com/site/solutions/506353 and causes sosreport to repeatedly get called. Normally sosreport would only be called very infrequently (usually only once).

Diagnostic Steps

  • NOTE: You can use the attached 3 scripts to gather tcpdump, rpcdebug, and ftrace data to verify this problem. The main script is nfsd_tcpdump_rpcdebug_ftrace.sh, which runs the other two scripts, ftrace-nfsd-dispatch.sh and save-ftrace-buffer.sh.

Diagnosis using tcpdump / packet capture

  • Objective: Using tcpdump, look for the NFS server to give TCP-level ACKs of NFS Call packets, but never any NFS Reply packets. It looks as though the NFS server is silently dropping the requests.

  • On the NFS server, gather a tcpdump at the time of the problem, leading up to and covering the "not responding" message. You should cover at least a couple minutes leading up to the message.

  • Process the tcpdump with wireshark, and use a filter to show traffic only on the NFS port (2049). If TCP is being used for NFS communication, then use 'tcp.port == 2049'. NOTE: If there are multiple NFS client connections, you may need additional filters to filter out one specific NFS client connection, such as an ip.addr filter or an additional tcp.port filter.

  • Also, you can process the tcpdump and look for RPC Calls without Responses. If this problem is occurring, you should see the number of RPC Calls much higher than the RPC Replys.

  • The following are two examples, one with 'good' / normal output, and one with 'bad' / abnormal output, which shows the problem.

  • GOOD: Example tcpdump / tshark output which shows normal NFS Reply packets.

$ tshark -n -rad -r ./tcpdump.pcap -R 'tcp.port == 2049 && tcp.port == 979'
  1   0.000000  192.168.1.5 -> 192.168.1.10  NFS 218 V4 Call ACCESS FH:0x62d40c52, [Check: RD LU MD XT DL]
  2   0.004029  192.168.1.10 -> 192.168.1.5  NFS 298 V4 Reply (Call In 1) ACCESS, [Allowed: RD LU MD XT DL]
  3   0.004393  192.168.1.5 -> 192.168.1.10  TCP 66 979 > 2049 [ACK] Seq=153 Ack=233 Win=20904 Len=0 TSval=1257229991 TSecr=1835188253
  4   0.004408  192.168.1.5 -> 192.168.1.10  NFS 310 V4 Call OPEN DH:0x62d40c52/file1-user33
  5   0.010571  192.168.1.10 -> 192.168.1.5  NFS 462 V4 Reply (Call In 4) OPEN StateID:0xbe51
  6   0.010899  192.168.1.5 -> 192.168.1.10  NFS 226 V4 Call OPEN_CONFIRM
  7   0.015733  192.168.1.10 -> 192.168.1.5  NFS 138 V4 Reply (Call In 6) OPEN_CONFIRM
  8   0.016096  192.168.1.5 -> 192.168.1.10  NFS 262 V4 Call SETATTR FH:0x45c725c7
  9   0.020481  192.168.1.10 -> 192.168.1.5  NFS 310 V4 Reply (Call In 8) SETATTR
 10   0.020828  192.168.1.5 -> 192.168.1.10  NFS 242 V4 Call CLOSE StateID:0xbbfc
 11   0.025019  192.168.1.10 -> 192.168.1.5  NFS 202 V4 Reply (Call In 10) CLOSE
 12   0.064657  192.168.1.5 -> 192.168.1.10  TCP 66 979 > 2049 [ACK] Seq=929 Ack=1081 Win=24120 Len=0 TSval=1257230052 TSecr=1835188274
  • BAD: Example tcpdump/tshark output that shows TCP-level ACKs of NFS Call packets (packets 16, and 18 show increasing "Ack=" numbers), but the NFS server never sends NFS Reply packets. Eventually (packet #19) the NFS client sends a FIN, and reconnects the TCP connection (packet #22), and re-sends the requests, only to see the same behaviour from the NFS server.
$ tshark -n -tad -r /tmp/tcpdump.pcap -R 'tcp.port == 2049 && ip.addr == 192.168.1.2'
 15  60.002113  192.168.1.2 -> 192.168.1.6  NFS 218 V4 Call ACCESS FH:0x62d40c52, [Check: RD LU MD XT DL]
 16  60.002265  192.168.1.6 -> 192.168.1.2  TCP 66 2049 > 979 [ACK] Seq=1 Ack=153 Win=15616 Len=0 TSval=1836365178 TSecr=1258406916
 17  83.094143  192.168.1.2 -> 192.168.1.6  NFS 190 V4 Call RENEW CID: 0xbce9
 18  83.094175  192.168.1.6 -> 192.168.1.2  TCP 66 2049 > 979 [ACK] Seq=1 Ack=277 Win=15616 Len=0 TSval=1836388270 TSecr=1258430008
 19 143.094087  192.168.1.2 -> 192.168.1.6  TCP 66 979 > 2049 [FIN, ACK] Seq=277 Ack=1 Win=14480 Len=0 TSval=1258490008 TSecr=1836388270
 20 143.095592  192.168.1.6 -> 192.168.1.2  TCP 66 2049 > 979 [FIN, ACK] Seq=1 Ack=278 Win=15616 Len=0 TSval=1836448272 TSecr=1258490008
 21 143.095961  192.168.1.2 -> 192.168.1.6  TCP 66 979 > 2049 [ACK] Seq=278 Ack=2 Win=14480 Len=0 TSval=1258490010 TSecr=1836448272
 22 143.095989  192.168.1.2 -> 192.168.1.6  TCP 74 [TCP Port numbers reused] 979 > 2049 [SYN] Seq=0 Win=14480 Len=0 MSS=1460 SACK_PERM=1 TSval=1258490010 TSecr=1836448272 WS=1
 23 143.096033  192.168.1.6 -> 192.168.1.2  TCP 74 2049 > 979 [SYN, ACK] Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 SACK_PERM=1 TSval=1836448272 TSecr=1258490010 WS=128
 24 143.096195  192.168.1.2 -> 192.168.1.6  TCP 66 979 > 2049 [ACK] Seq=1 Ack=1 Win=14480 Len=0 TSval=1258490010 TSecr=1836448272
 25 143.096576  192.168.1.2 -> 192.168.1.6  NFS 190 V4 Call RENEW CID: 0xbce9
 26 143.096727  192.168.1.6 -> 192.168.1.2  TCP 66 2049 > 979 [ACK] Seq=1 Ack=125 Win=14592 Len=0 TSval=1836448273 TSecr=1258490010
 27 143.096743  192.168.1.2 -> 192.168.1.6  NFS 218 V4 Call ACCESS FH:0x62d40c52, [Check: RD LU MD XT DL]
 28 143.096890  192.168.1.6 -> 192.168.1.2  TCP 66 2049 > 979 [ACK] Seq=1 Ack=277 Win=15616 Len=0 TSval=1836448273 TSecr=1258490010
 41 263.097177  192.168.1.2 -> 192.168.1.6  TCP 66 979 > 2049 [FIN, ACK] Seq=277 Ack=1 Win=14480 Len=0 TSval=1258610011 TSecr=1836448273

Diagnosis using rpcdebug

  • Objective: Look for evidence in rpcdebug output that RPC requests are being dropped.
  • Enable rpcdebug on the nfs server:
echo 32767 > /proc/sys/sunrpc/nfsd_debug 
echo 32767 > /proc/sys/sunrpc/rpc_debug 
  • The following are two examples, one with 'good' / normal output, and one with 'bad' / abnormal output, which shows the problem.

  • GOOD: Example of RPC debug which shows a request processed normally

Feb  6 09:38:47 rhel6-test kernel: svc: svc_authenticate (1)
Feb  6 09:38:47 rhel6-test kernel: RPC:       Want update, refage=1800, age=936   <--------- cache update needed
Feb  6 09:38:47 rhel6-test kernel: RPC:       Want update, refage=120, age=120   <--------- cache update needed
Feb  6 09:38:47 rhel6-test kernel: svc: calling dispatcher  <------------------ no delay here, request not dropped
Feb  6 09:38:47 rhel6-test kernel: nfsd_dispatch: vers 4 proc 1  <------------ nfsd_dispatch called to service request at NFS server layer
  • BAD: Example RPC debug output which shows requests are being dropped after a 5 second delay
Jan 20 20:40:49 rhel6-test kernel: svc: svc_authenticate (1)
Jan 20 20:40:49 rhel6-test kernel: RPC:       Want update, refage=120, age=105   <--------- cache update needed
Jan 20 20:40:54 rhel6-test kernel: svc: svc_process dropit   <-------------- 5 seconds later
Jan 20 20:40:54 rhel6-test kernel: svc: xprt ffff8800047a9000 dropped request

Diagnosis using ftrace

  • Objective: Look in the ftrace data, tracing svc_process_common, for the following trace on a single nfsd thread
    • Call tree: svcauth_unix_set_client -> cache_check -> cache_wait_req [context switch]
    • [context switch after 5 seconds] del_timer_sync -> svc_defer
  • Ftrace of svc_process_common indicates nfsd-2013 thread goes calls cache_check, waits inside cache_wait_req, and times out after 5 seconds, then calling svc_defer. Note that unix_gid_find is not shown in the ftrace output (perhaps due to it being a static/inlined function), even though it must be the case that it is called.
------------------------------------------
  2)   nfsd-2018    =>   nfsd-2013
 ------------------------------------------

  2)               |  svc_process_common() {
  2)   0.245 us    |    svc_tcp_prep_reply_hdr();
  2)               |    svc_authenticate() {
  2)               |      vprintk() {
...
  2) + 20.317 us   |      }
  2)   0.202 us    |      _spin_lock();
  2)               |      svcauth_unix_accept() {
  2)               |        groups_alloc() {
  2)               |          __kmalloc() {
  2)   0.207 us    |            _cond_resched();
  2)   0.734 us    |          }
  2)   1.169 us    |        }
  2)   1.757 us    |      }
  2) + 23.209 us   |    }
  2)               |    svc_set_client() {
  2)               |      svcauth_unix_set_client() {
  2)   0.203 us    |        _spin_lock();
  2)               |        cache_check() {  <----- 1st cache_check is for the ip cache, i.e. cache_check(&ip_map_cache, ...
  2)               |          getboottime() {
  2)   0.221 us    |            set_normalized_timespec();
  2)   0.652 us    |          }
  2)   0.198 us    |          get_seconds();
  2)   1.503 us    |        }
  2)   0.201 us    |        _spin_lock();
  2)   0.222 us    |        cache_put();
  2)               |        unix_gid_lookup() {  <------ Actual call tree: unix_gid_find -> unix_gid_lookup
  2)               |          sunrpc_cache_lookup() {
...
  2)   2.959 us    |          }
  2)   3.494 us    |        }
  2)               |        cache_check() {  <---- Actual call tree: unix_gid_find -> cache_check, i.e. cache_check(&unix_gid_cache, ....
  2)               |          getboottime() {
  2)   0.201 us    |            set_normalized_timespec();
  2)   0.677 us    |          }
  2)   0.199 us    |          get_seconds();
  2)               |          vprintk() {  <------------- the 'vprintks' are rpcdebug statements
...
  2) + 27.389 us   |          }
  2)               |          cache_wait_req() {   <------- NOTE: Actual call tree: cache_check -> cache_defer_req -> cache_wait_req
  2)   0.201 us    |            __init_waitqueue_head();
  2)               |            setup_deferral() {
  2)   0.207 us    |              _spin_lock();
  2)   0.637 us    |            }
  2)               |            wait_for_completion_interruptible_timeout() { <-- wait 'req->thread_wait' secs (svc_recv sets '5*HZ')
  2)               |              wait_for_common() {
  2)   0.202 us    |                _cond_resched();
  2)   0.204 us    |                _spin_lock_irq();
  2)               |                schedule_timeout() {
  2)   0.216 us    |                  init_timer_key();
  2)               |                  lock_timer_base() {
  2)   0.205 us    |                    _spin_lock_irqsave();
  2)   0.704 us    |                  }
  2)   0.205 us    |                  idle_cpu();
  2)   0.219 us    |                  internal_add_timer();
  2)   0.214 us    |                  _spin_unlock_irqrestore();
  2)               |                  schedule() {
....
 ------------------------------------------
  2)   nfsd-2018    =>   nfsd-2013
 ------------------------------------------

  2)   0.485 us    |                    finish_task_switch();
  2) ! 5002642 us |                  }  <----------- again timeout, no response from userspace ?
  2)               |                  del_timer_sync() {
  2)               |                    try_to_del_timer_sync() {
  2)               |                      lock_timer_base() {
  2)   0.372 us    |                        _spin_lock_irqsave();
  2)   0.767 us    |                      }
  2)   0.326 us    |                      _spin_unlock_irqrestore();
  2)   1.901 us    |                    }
  2)   2.522 us    |                  }
  2) ! 5002648 us |                }
  2)   0.321 us    |                _spin_lock_irq();
  2) ! 5002650 us |              }
  2) ! 5002651 us |            }
  2)   0.202 us    |            _spin_lock();
  2) ! 5002653 us |          }
  2)   0.363 us    |          svc_defer();  <-------- see cache_defer_req() { ... dreq = req->defer(req); ... }
  2)               |          setup_deferral() {
  2)   0.217 us    |            _spin_lock();
  2)   0.732 us    |          }
  2) ! 5002685 us |        }
  2) ! 5002692 us |      }
  2) ! 5002692 us |    }
  2)               |    svc_authorise() {
  2)               |      svcauth_unix_release() {
  2)   0.209 us    |        auth_domain_put();
  2)               |        groups_free() {
  2)               |          kfree() {
  2)   0.198 us    |            __phys_addr();
  2)   0.608 us    |          }
  2)   1.152 us    |        }
  2)   2.206 us    |      }
  2)   0.207 us    |      module_put();
  2)   3.043 us    |    }
  2)               |    vprintk() {
...
  2) + 37.674 us   |    }
  2) ! 5002758 us |  }

Reproducer

On the NFS server (rhel6):

# while true; do python /usr/share/rhn/up2date_client/hardware.py; sleep 6; done

On the NFS client (rhel6 or rhel5):

# for i in $(seq 1 10000); do dd if=/dev/zero of=/mnt/file.$i bs=512 count=3; done

After few writes, the client hang, dd is in D state, and the "not responding" message is seen in the log.

Without the python command all 10K files were written without issue.
Apparently stopping the python script solves the problem because the nfs client starts to write again.
As soon as I started the python command again (and therefore abrtd intervene) the client script hangs after a while.

Simplified reproducer

On the NFS server:

# while true; do echo `date` reading /proc/net/rpc/auth.unix.gid/channel; cat /proc/net/rpc/auth.unix.gid/channel>/dev/null; sleep 3; done

On the NFS client, attempt to mount an export from the NFS server, or do some NFS operation. It should hang.

Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.