NFS packet trace analysis tips and tricks

Updated

This document assumes you have gathered a packet trace which includes NFS traffic and would like to analyze the problem. If you have not gathered a packet trace please refer to the following article: How to capture network packets with tcpdump?

Both NFSv3 and NFSv4

Timestamps in packet traces and matching other event timestamps

  • When using wireshark / tshark, and attempting to match date/times of a packet trace with an event in a log (for example /var/log/messages), prepend tshark / wireshark / capinfos command with an appropriate 'TZ' value. Often you can obtain the TZ value from a sosreport etc/sysconfig/clock. You should also always validate any other log timestamps (for example, /var/log/messages, strace output, etc) with the tcpdump. Use the capinfos command to look at the timeframe of the tcpdump, and compare that with the other log. For example
$ cat sosreport-xyz/etc/sysconfig/clock
ZONE="Europe/Helsinki"
$ TZ="Europe/Helsinki" capinfos tracefile.pcap | grep time
Start time:          Wed Nov 28 11:00:08 2012
End time:            Wed Nov 28 11:00:45 2012
$ head -1 strace.out 
30819 11:00:43.032273 execve("/bin/ls", ["ls"], ...
$ tail -1 strace.out
30819 11:00:43.038457 exit_group(2)     = ?

The above shows that the strace.out file was gathered at the same time as the tracefile.pcap file.

Matching errors in strace and network captures

Now we look for any specific errors in the strace matching up with any errors in the tcpdump.
NOTE: This is just a simple example to show one interesting error in an strace matching with one
interesting packet in the tcpdump - more complex analysis may be needed.

$ grep ESTALE strace.out
30819 11:00:43.036550 open(".", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ESTALE (Stale NFS file handle) <0.001038>
$ TZ="Europe/Helsinki" tshark -tad -r tracefile.pcap -R "(nfs.procedure_v4) && (nfs.status != 0)"
246 2012-11-28 11:00:25.504376 10.224.180.88 -> 10.224.193.150 NFS V4 COMP Reply (Call In 245) LOOKUP(2)

Finding a filehandle for a specific filename

Only few procedures in NFS use actual filenames. Most procedures use filehandles. A filehandle is obtained with an OPEN or CREATE. Often it is useful to know which filename is associated with which filehandle.

It is possible to pick the filehandle from the RPC Reply that is related to an OPEN or CREATE. For this, it is needed to understand how to match a Reply that contains the filehandle, with the Call that contains the filename.

  1. find the CREATE (or OPEN) that contains the filename of interest

    $ tshark -r tracefile.pcap.gz \
      -Y 'nfs.name == "README" && nfs.procedure_v3 == CREATE'
    1108   8.614943 10.99.141.81 764 10.99.141.78 2049 NFS 230 V3 CREATE Call, DH: 0xcaeae0e0/README Mode: UNCHECKED
    
  2. from frame 1108 we can pick the RPC transaction ID (rpc.xid)

    $ tshark -r tracefile.pcap.gz -T fields -e rpc.xid 'frame.number == 1108'
    0xdea41c77
    
  3. the rpc.xid should be unique for each Call/Reply combination, so it is possible to get the Reply by filtering on the rpc.xid:

    $ tshark -r tracefile.pcap.gz 'rpc.xid == 0xdea41c77'
    1108   8.614943 10.99.141.81 764 10.99.141.78 2049 NFS 230 V3 CREATE Call, DH: 0xcaeae0e0/README Mode: UNCHECKED
    1109   8.647137 10.99.141.78 2049 10.99.141.81 764 NFS 338 V3 CREATE Reply (Call In 1108)
    
  4. as mentioned earlier, the Reply contains the filehandle:

    $ tshark -r fast.pcap.gz -T fields -e nfs.fh.hash frame.number == 1109
    0x95312866
    

Fortunately Wireshark offers some advanced options which make things easier. Browsing the options is easiest in the Wireshark GUI:

  • Edit
  • Preferences
  • Protocols
  • NFS

Wireshark: Edit -> Preferences -> Protocols -> NFS

These options are placed in ~/.wireshark/preferences, and are also applied when running tshark.

Enabling some of these options cause Wireshark to add additional attributes (like nfs.name) to packets that do not originally contain them (like a Reply on a CREATE). These additional attributes can be used in searching for packets too!

Miscellaneous commands

  • Print any NFS operations completing with error status
$ tshark -tad -r tracefile.pcap -R "(nfs.status != 0)"
  • NFS traffic showing status values as well as file handles and xids:
$ tshark -ntad -z proto,colinfo,nfs.status,nfs.status -z proto,colinfo,nfs.fh.hash,nfs.fh.hash -z proto,colinfo,rpc.xid,rpc.xid -R 'nfs' -r tracefile.pcap > /tmp/tracefile-nfs-status-fh-xid.txt
  • Print specific frame/packets in verbose mode (example prints frames 245 - 246)
$ tshark -V -tad -r tracefile.pcap -R 'frame.number >= 245 && frame.number <= 246'
  • Summarize RPC traffic statistics for specific file / handle, 0x12345678
$ tshark -R 'nfs.fh.hash==0x12345678' -r tracefile.pcap
  • Show RPC call retransmissions per file handle
$ tshark -tad -r tracefile.pcap -R rpc.call.dup -T fields -e nfs.fh.hash | sort | uniq -c | sort -nr 
  • Summarize all traffic, but this may be a good starting point if you've filtered properly during capture:
$ tshark -r tracefile.pcap
  • Summarize TCP conversations:
$ tshark -n -tad -r tracefile.pcap -q -z conv,tcp
  • All RPC traffic:
$ tshark -R rpc -r tracefile.pcap
  • Throughput:
$ tshark -q -z io,stat,10 -r tracefile.pcap
  • List the tshark/wireshark names and values for NFS related fields. NOTE: See the f17 (or above) man page for the description of the "-G" ('glossary') option to tshark.
$ tshark -G values | egrep '(nfs|rpc)' | less
$ tshark -G fields | egrep '(nfs|rpc)' | less

NFSv3 only

$ tshark -q -z rpc,rtt,100003,3 -r tracefile.pcap
  • Print any NFS operations completing with error status
$ tshark -tad -r tracefile.pcap -R "(nfs.procedure_v3) && (nfs.status != 0)"
  • Search for NFSv3 FSSTAT (operation 18) RPC operations which take longer than 120 seconds (the default hung_task timeout). Print out the xid and the RPC service time (in seconds). NOTE This requires a later version of wireshark (ran on wireshark-1.10.0-2.fc19).
$ tshark -2 -ntad  -T fields -e rpc.xid -e rpc.time -R 'rpc && nfs.procedure_v3 == 18' -r tcpdump.pcap | awk '{ if ($2 > 120) { print $0 } }'
0x7c3bb117      240.001021000
0x5d3bb117      360.002734000
0xe240b117      146.460670000
0xe640b117      120.019259000
0xc042b117      121.036544000
0xc042b117      121.036562000
0x6b60b117      180.001421000
0x6f60b117      180.001870000
  • Now show the packets associated with xid 0x5d3bb117 above, which took 360 seconds to complete:
$ tshark -2 -ntad -R 'rpc && rpc.xid == 0x5d3bb117' -r tcpdump.pcap
  1 2013-10-22 13:32:25.406285 10.1.1.1 -> 10.1.1.2 NFS 186 V3 FSSTAT Call (Reply In 4), FH: 0x3c3c7ee5
  2 2013-10-22 13:33:25.407628 10.1.1.1 -> 10.1.1.2 NFS 186 [RPC retransmission of #1]V3 FSSTAT Call (Reply In 4), FH: 0x3c3c7ee5
  3 2013-10-22 13:38:25.408456 10.1.1.1 -> 10.1.1.2 NFS 186 [RPC retransmission of #1]V3 FSSTAT Call (Reply In 4), FH: 0x3c3c7ee5
  4 2013-10-22 13:38:25.409019 10.1.1.2 -> 10.1.1.1 NFS 154 V3 FSSTAT Reply (Call In 1) 
  • Given a tcpdump capturing a MOUNT sequence, and given a FH 0xdb71c322 which is a root FH, show the NFS share associated with the FH:
$ tshark -V -R 'mount && rpc.msgtyp == 1' -r server-mount.pcap | grep -A 10 "Mount Service" | grep -A 3 0xdb71c322
        [hash (CRC-32): 0xdb71c322]
        [Name: 10.10.18.8:/export/foo/bar]
        [Full Name: 10.10.18.8:/export/foo/bar]
        decode type as: unknown

NFSv4 only

  • Print any COMPOUND operations which return NFS4ERR_* status (not NFS4_OK)
$ tshark -tad -r tracefile.pcap -R "(nfs.procedure_v4) && (nfs.status != 0)"
  • If any errors are seen, use editcap and tshark again to narrow the focus. For example:
$ tshark -tad -r tracefile.pcap -R "(nfs.procedure_v4) && (nfs.status != 0)"
 95 2012-02-23 16:10:45.112680   10.41.2.53 -> 10.37.99.32  NFS V4 COMP Reply (Call In 94) PUTFH OPEN(13)
100 2012-02-23 16:10:45.114325   10.41.2.53 -> 10.37.99.32  NFS V4 COMP Reply (Call In 99) PUTFH LOOKUP(2)
$ editcap -r tracefile.pcap network_errors.cap 94-95
Add_Selected: 94-95
Inclusive ... 94, 95
$ tshark -V -tad -r tracefile.pcap | egrep '(Filename|Status)'
                Filename: test.txt
    Status: NFS4ERR_ACCES (13)
            Status: NFS4_OK (0)
            Status: NFS4ERR_ACCES (13)
$ tshark -V -tad -r tracefile.pcap | egrep '(Filename|Status|Opcode)'
        Opcode: PUTFH (22)
        Opcode: OPEN (18)
                Filename: test.txt
        Opcode: GETFH (10)
        Opcode: GETATTR (9)
    Status: NFS4ERR_ACCES (13)
        Opcode: PUTFH (22)
            Status: NFS4_OK (0)
        Opcode: OPEN (18)
            Status: NFS4ERR_ACCES (13)
  • Print any NFS CLOSE operations completing with error status.
$ tshark -tad -r tracefile.pcap -R "(nfs.procedure_v4) && (nfs.status != 0) && (nfs.reply.operation == 4)"
  • Dump raw packet 12345 to ASCII text file
$ tshark -x -r tracefile.pcap -R "frame.number == 12345" > 12345-raw.txt
  • Summarize RPC traffic statistics in a nice concise format, which may be useful for performance cases.
$ tshark -q -z rpc,srt,100003,4 -r tcpdump.pcap
=======================================================
NFS Version 4 SRT Statistics:
Filter: 
Procedure        Calls    Min SRT    Max SRT    Avg SRT
COMPOUND          9296   0.000001   0.170105   0.004120
=======================================================
  • 4 msec average seems ok, use rpc.time to look closer at the bad rpc times
$ TZ=$timezone tshark -tad -Tfields -e frame.number -e frame.time -e rpc.time -r tcpdump.pcap -R "(nfs.procedure_v4) && (nfs.status == 0) && (rpc.time>.1)"
19329	Mar 21, 2014 20:26:01.052579280	0.169526720
19401	Mar 21, 2014 20:26:01.053166480	0.170105440
37609	Mar 21, 2014 20:26:02.029518840	0.107553841
$ TZ=$timezone tshark -tad -r tcpdump.pcap -R 'nfs.opcode == CLOSE || nfs.opcode == DELEGPURGE || nfs.opcode == DELEGRETURN || nfs.opcode == LOCK || nfs.opcode == LOCKU || nfs.opcode == OPEN || nfs.opcode == OPEN_CONFIRM || nfs.opcode == OPEN_DOWNGRADE || nfs.opcode == READ || nfs.opcode == RENEW || nfs.opcode == SETATTR || nfs.opcode == WRITE'
Category
Article Type