RHEL 7 system experiences connectivity issues while under load.

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 7
  • Possibly trigger by heavy network or memory load
  • Possibly more likely to occur in environments with large MTU (Jumbo frames)
  • Seen on an NFS client with kernel 3.10.0-957.5.1.el7 or NFS server with 3.10.0-862.14.4.el7, but applications other than NFS affected as well
  • ixgbe and mlx5 based NICs (possibly others)

Issue

  • A RHEL 7 host may lose some network connectivity, possibly for minutes at a time.
  • May affect NFS and trigger nfs: server [...] not responding, still trying log messages on an NFS client machine.
  • Connectivity with some subset of remote hosts may continue to function as expected while this is occurring.
  • No obvious OS network error counter related to the issue.

Resolution

  • Increase the sysctl vm.min_free_kbytes to something like ten times its default value: How to tune vm.min_free_kbytes

  • Newer kernels include an SNMP counter TcpExtPFMemallocDrop which is incremented when this condition is met. This counter is available in all RHEL 8 kernels and in RHEL 7.7 (kernel-3.10.0-1062.el7 and above). Please see the Diagnostic Steps section of this article below for how to use it.

Root Cause

  • Packets may be silently lost on receive in the sk_filter_trim_cap() function if it returns -ENOMEM:

     70 /**
     71  *      sk_filter_trim_cap - run a packet through a socket filter
     72  *      @sk: sock associated with &sk_buff
     73  *      @skb: buffer to filter
     74  *      @cap: limit on how short the eBPF program may trim the packet
     75  *
     76  * Run the filter code and then cut skb->data to correct size returned by
     77  * sk_run_filter. If pkt_len is 0 we toss packet. If skb->len is smaller
     78  * than pkt_len we keep whole skb->data. This is the socket level
     79  * wrapper to sk_run_filter. It returns 0 if the packet should
     80  * be accepted or -EPERM if the packet should be tossed.
     81  *
     82  */
     83 int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
     84 {
     85         int err;
     86         struct sk_filter *filter;
     87 
    >88         /*
    >89          * If the skb was allocated from pfmemalloc reserves, only
    >90          * allow SOCK_MEMALLOC sockets to use it as this socket is
    >91          * helping free memory
    >92          */
    >93         if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
    >94                 return -ENOMEM;
     95 
     96         err = security_sock_rcv_skb(sk, skb);
     97         if (err)
     98                 return err;
     99 
    100         rcu_read_lock();
    101         filter = rcu_dereference(sk->sk_filter);
    102         if (filter) {
    103                 unsigned int pkt_len = SK_RUN_FILTER(filter, skb);
    104 
    105                 err = pkt_len ? pskb_trim(skb, max(cap, pkt_len)) : -EPERM;
    106         }
    107         rcu_read_unlock();
    108 
    109         return err;
    110 }
    111 EXPORT_SYMBOL(sk_filter_trim_cap);
    
  • Increasing the sysctl vm.min_free_kbytes avoids the condition.

  • The issue has been reported on systems using ixgbe based interfaces and mlx5 based interfaces.

  • Newer kernels add an SNMP counter in sk_filter_trim_cap() so the condition can be more easily recognized: Content from git.kernel.org is not included.net: add LINUX_MIB_PFMEMALLOCDROP counter

Diagnostic Steps

  • Check the kernel version in use:

    $ uname -r
    
  • For RHEL 8 and RHEL 7.7+ (kernel-3.10.0-1062.el7 and above) the nstat command can be used to check the TcpExtPFMemallocDrop counter:

    $ nstat -rsz | grep TcpExtPFMemallocDrop
    TcpExtPFMemallocDrop            0                  0.0
    
  • For older RHEL 7 kernels nothing is logged and no counter is incremented if the function sk_filter_trim_cap() returns -ENOMEM. In this case, the return value of the function can be probed with tools such as perf or SystemTap. An example probe using perf which will watch for the condition for 10 seconds:

        # perf probe -a 'sk_filter_trim_cap%return return=$retval:s32'
        # perf record -e probe:sk_filter_trim_cap -agR --filter 'return < 0' sleep 10
    
      // or for kernels < 3.10.0-1062.el7 :
    
        # perf record -e probe:sk_filter_trim_cap__return -agR --filter 'return < 0' sleep 10
        
        # perf report:
             Samples: 692  of event 'probe:sk_filter_trim_cap', Event count (approx.): 692
               Children      Self  Trace output
             +  100.00%   100.00%  (ffffffff8a0551c0 <- ffffffff8a0a735c) return=-12
    
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.