Disk performance issues can be hard
to track down but can also cause a wide variety of issues. The disk
performance counter available in Windows are numerous, and being able to
select the right counters for a given situation is a great
troubleshooting skill. Here, we'll review two basic scenarios -
measuring overall disk performance and determining if the disks are a
bottleneck.
Measuring Disk Performance
When it comes to disk performance, there are two important
considerations: IOPS and byte throughput. IOPS is the raw number of
disk operations that are performed per second. Byte throughput is the
effective bandwidth the disk is achieving, usually expressed in MB/s.
These numbers are closely related - a disk with more IOPS can provide
better throughput.
These can be measured in perfmon with the following counters:
- Disk Transfers/sec
- Total number of IOPS. This should be about equal to Disk Reads/sec + Disk Writes/sec
- Disk Reads/sec
- Disk read operations per second (IOPS which are read operations)
- Disk Writes/sec
- Disk write operations per second (IOPS which are write operations)
- Disk Bytes/sec
- Total disk throughput per second. This should be about equal to Disk Read Bytes/sec + Disk Write Bytes/sec
- Disk Read Bytes/sec
- Disk read throughput per second
- Disk Write Bytes/sec
- Disk write throughput per second
These performance counters are available in both the LogicalDisk and
PhysicalDisk categories. In a standard setup, with a 1:1 disk-partition
mapping, these would provide the same results. However, if you have a
more advanced setup with storage pools, spanned disks, or multiple
partitions on a single disk, you would need to choose the correct
category for the part of the stack you are measuring.
Here are the results on a test VM. In this test, diskspd was used to
simulate an average mixed read/write workload. The results show the
following:
- 3,610 IOPS
- 2,872 read IOPS
- 737 write IOPS
- 17.1 MB/s total throughput
- 11.2 MB/s read throughput
- 5.9 MB/s write throughput
In this case, we're seeing a decent number of IOPS with fairly low
throughput. The expected results vary greatly depending on the
underlying storage and the type of workload that is running. In any
case, you can use these counters to get an idea of how a disk is
performing during real world usage.
Disk Bottlenecks
Determining if storage is a performance bottleneck relies on a
different set of counters than the above. Instead of looking at IOPS
and throughput, latency and queue lengths needs to be checked. Latency
is the amount of time it takes to get a piece of requested data back
from the disk and is measured in milliseconds (ms). Queue length refers
to the number of outstanding IO requests that are in the queue to be
sent to the disk. This is measured as an absolute number of requests.
The specific perfmon counters are:
- Avg. Disk sec/Transfer
- The average number of seconds it takes to get a response from the disk. This is the total latency.
- Avg. Disk sec/Read
- The average number of seconds it takes to get a response from the disk for read operations. This is read latency.
- Avg. Disk sec/Write
- The average number of seconds it takes to get a response from the disk for read operations. This is write latency.
- Current Disk Queue Length
- The current number of IO requests in the queue waiting to be sent to the storage system.
- Avg. Disk Read Queue Length
- The average number of read IO requests in the queue waiting to be
sent to the storage system. The average is taken over the perfmon
sample interval (default of 1 second)
- Avg. Disk Write Queue Length
- The average number of read IO requests in the queue waiting to be
sent to the storage system. The average is taken over the perfmon
sample interval (default of 1 second)
Here are the results on a test VM. In this test, diskspd was used to
simulate an IO-intensive read/write workload. Here is what the test
shows:
- Total disk latency: 42 ms (0.042 seconds is equal to 42 milliseconds)
- Read latency: 5 ms
- Write latency: 80 ms
- Total disk queue: 48
- Read queue: 2.7
- Write queue: 45
These results show that the disk is clearly a bottleneck and
underperforming for the workload. Both the write latency and write
queue are very high. If this were a real environment, we would be
digging deeper into the storage to see where the issue is. It could be
that there's a problem on the storage side (like a bad drive or a
misconfiguration), or that the storage is simply too slow to handle the
workload.
Generally speaking, the performance tests can be interpreted with the following:
- Disk latency should be below 15 ms. Disk latency above 25 ms can
cause noticeable performance issues. Latency above 50 ms is indicative
of extremely underperforming storage.
- Disk queues should be no greater twice than the number of physical
disks serving the drive. For example, if the underlying storage is a 6
disk RAID 5 array, the total disk queue should be 12 or less. For
storage that isn't mapped directly to an array (such as in a private
cloud or in Azure), queues should be below 10 or so. Queue length isn't
directly indicative of performance issues but can help lead to that
conclusion.
These are general rules and may not apply in every scenario.
However, if you see the counters exceeding the thresholds above, it
warrants a deeper investigation.
General Troubleshooting Process
If a disk performance issue is suspected to be causing a larger
problem, we generally start off by running the second set of counters
above. This will determine if the storage is actually a bottleneck, or
if the problem is being caused by something else. If the counters
indicate that the disk is underperforming, we would then run the first
set of counters to see how many IOPS and how much throughput we are
getting. From there, we would determine if the storage is under-spec'ed
or if there is a problem on the storage side. In an on-premise
environment, that would be done by working with the storage team. In
Azure, we would review the disk configuration to see if we're getting
the advertised performance.