High I/O wait time

Solution Unverified - Updated

Environment

  • Red Hat Enterprise Linux

Issue

  • Experiencing the poor system I/O performance. It takes a lot of time to copy a 4GB file on local storage.

Resolution

Check with the vendor (looks to be VMWare vendor as both disks are virtual VMWare disk)

Root Cause

  • Slightly high value for await in iostat command could be seen.
Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00    11.83  0.02  0.28     0.48    96.90   325.63     0.01   37.17   1.62   0.05
dm-3              0.00     0.00  0.01 102.18     0.08   817.46     8.00     3.07   29.99   0.01   0.08
dm-7              0.00     0.00  0.02  3.30     0.15    26.42     8.01     0.18   55.61   0.43   0.14
dm-9              0.00     0.00  0.01  0.55     0.32     4.40     8.35     0.01   18.99   0.66   0.04
dm-10             0.00     0.00  0.03  0.43     1.38     3.46    10.41     0.05  104.03   2.29   0.11
dm-14             0.00     0.00  0.03  0.10     0.19     0.38     4.48     0.01   45.16   0.28   0.00
dm-15             0.00     0.00  0.02 11.22     0.36    89.77     8.02     0.65   57.71   0.04   0.04
dm-16             0.00     0.00  0.00  0.49     0.00     3.91     8.00     0.08  167.35   0.03   0.00
dm-17             0.00     0.00  0.00  0.22     0.06     1.77     8.25     0.04  174.25   0.18   0.00
dm-18             0.00     0.00  0.00  0.04     0.00     0.36     8.00     0.00   98.72   0.10   0.00
dm-19             0.00     0.00  0.00  0.05     0.00     0.39     8.00     0.00   80.95   0.27   0.00
dm-20             0.00     0.00  0.00  0.09     0.00     0.71     8.00     0.01  167.22   0.04   0.00

"avgqu-sz" parameter is within normal range which indicates the number of IOs.

  • The await time is measured on a per io basis from the front of the io scheduler until io done time. It covers the time that is taken through the scheduler, driver, controller, transport (for example fibre san), and storage needed to complete each io.

  • Await is the average time, in milliseconds, for I/O requests completed by storage and includes the time spent by the requests in the scheduler queue and time spent by storage servicing them.

There are two queues within this code path: one in the scheduler and one somewhere out on the hardware side.

  • The avgqu-sz is the average number of io contained within both the io scheduler queue and storage lun queue.

  • If the reported avgqu-sz in the sample is much less than the allowed lun queue_depth (and the sample time is small so averaging isn't hiding a much larger peak queue size), then there is little time spent within the scheduler queue. The scheduler will continue passing io to the driver while the number of io still outstanding to the driver (io currently being worked on by storage) remains under the lun's queue_depth limit. In such cases, the await time is dominated by storage servicing time alone. Under these circumstances(low avgqu-sz), high await time is due to storage side issues.

So with available information, this looks to be a hardware side issue. Please check with the vendor (looks to be VMWare vendor as both disks are virtual VMWare disks.)

Diagnostic Steps

Run the following commands in different consoles of your server when you notice issue on the server.

# top -n 5 -b > /tmp/top.out
# vmstat 1 50 > /tmp/vm.out
# iostat -x 2 10 > /tmp/io.out
# sar 1 50 > /tmp/sar.out

Then run following command..

# tar -cvzf /tmp/perf.tar.gz /tmp/*.out

and attach us the file /tmp/perf.tar.gz.

For running sar and iostat commands kindly install the systsat package.

# yum install sysstat
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.