Lies, Damned Lies, and I/O Statistics
Given a description of gas dynamics and the atmosphere, you would be hard to pressed forecast tornadoes. The term emergence denotes the phenomena of surprising behaviors arising in complex systems. Modern storage systems are complex, and full of emergent behavior that make forecasting application I/O performance fiendishly difficult. In collaboration with Kyle Hailey, Adam Leventhal, and others I've learned some rules of thumb for how to make accurate I/O performance forecasts. I've also stepped on every rake imaginable along the way. For those of you who also may wish to forecast the storage performance an application will receive, this post summarizes some lessons learned. When I began evaluating storage performance, I had a naive idea that the process would be like this:
I quickly discovered that the actual process was much more like this:
In going through this process, I've compiled a bestiary of performance anomalies you may expect to encounter if you interact with a variety of storage platforms, along with root causes for those anomalies and some explanatory notes. Following that are some approaches for designing I/O simulations or tests that avoid them.
Bestiary of I/O Performance Anomalies
Anomaly Name Characteristics Notes Caching Impossibly good performance:
- Higher throughput than the connection to storage could provide
- Latencies which imply faster than light travel over cables to the storage
Often the operating system and the underlying storage array will have large memory caches. Drivers will also tend to cache small amounts of data. This mostly occurs during read tests, but depending on the application write semantics can also occur during write tests. Shared drives Inconsistent performance It is common in storage systems to allocate LUNs or file systems from storage pools that are composed of large numbers of physical drives shared with other LUNs. Shared connection to storage Inconsistent performance, especially for:
- High throughput tests
- NAS storage with a 1 GB Ethernet connection
For storage tests being done within a VM, other VMs on the same physical server can contend with your tests for access to the storage. I/O request consolidation Somewhat paradoxically both higher latency and higher throughput than expected. Particularly common for small sequential non-O_[D]SYNC writes Various I/O layers can group together multiple I/Os issued by your application before issuing them to the storage or a lower layer. I/O request fragmentation Higher latency and lower throughput than expected, particularly for large I/Os or NFS based NAS storage Large application I/O requests can be broken down into multiple, smaller I/Os that are issued serially by intervening layers. Read ahead
- Improbably good sequential read performance
- Unexpectedly poor random I/O performance
- Performance that changes dramatically midway through a test
Many layers may decide to Read Ahead - that is to optimistically fetch data adjacent to the requested data in case it is needed. If you have a sequential read workload, read ahead will substantially improve performance. If you have a random read workload, read ahead ensures the storage subsystem components will be doing a lot of unnecessary work that may degrade performance. Finally, some systems will try to discern the random or sequential nature of your workload and dynamically enable / disable read ahead. This can lead to inconsistent behavior, for example a sequential read test may start slowly and then speed up once read ahead kicks in. Tiered storage migration Unexpectedly bad performance, especially during initial tests on high powered SANs such as EMC VMAX Some storage systems cleverly use a mix of very high performance flash drives, fast hard disks, and slower large capacity hard disks. These systems dynamically move data among these tiers depending on their access patterns. Often data newly created for a test will be initially located on the slow high capacity disks - I have seen 8 kB random read latencies averaging around 20 ms, with spikes to around 100 ms, for initial tests on these kinds of 'high performance' 'Enterprise' storage systems. First write penalty Unexpectedly bad write performance, especially if it happens early in testing and is not reproducible Many storage systems, volume managers, and some file systems will use some form of thin provisioning. In these systems when an initial write happens into some region, additional overhead is required, such as adjusting some meta-data and formatting the region. Subsequent writes to the same region will be faster. For example, a thin provisioned VMDK on VMware must be zeroed on first write - so a 1 kB application write can trigger a write of an entire VMFS block of 1 megabyte or more. Elided reads Unexpectedly good read performance on raw devices or regions that have not been written Some file systems and storage systems know whether a region has been written to. Attempts to read from uninitialized regions can result in an immediate software provided response of: "Here you go, all zeros!" - without actually engaging the disk hardware at all. Both VMFS and ZFS will do this, depending on configuration. Compressed I/O Unexpected, or even impossibly good write or read performance Some file systems will compress data. If your I/O test is writing out a pattern that compresses well (such as all 0s or all 1s), the amount of I/O submitted to and read from the physical storage will be a tiny fraction of your test's intended I/O workload. Storage Maintenance Unexpectedly poor performance Often when I speak to a storage administrator after getting unacceptable performance results, I learn there was some kind of maintenance happening at the time, such as migration of data to another storage pool, rebuilding of RAID configurations, etc.
Avoiding Anomalies While Testing
Here is a summary of how to avoid these anomalies, with details below:
- Use a real workload if possible
- When simulating, be sure to simulate the actual application workload
- Evaluate latencies using histograms, not averages
- Verify your performance tests give reproducible results
- Run test at the same time as the production application will run, and for sufficiently long durations
- Ensure the test data is similar to what the application will use and produce
Use a real workload if possible. Unfortunately, often this isn't be possible. For example, you probably won't be able to determine the exact workload of the month end close for the new ERP system while that system is being architected, which is when you'll need to design and select the storage. When you must simulate, be sure to simulate what the application actually does in terms of I/O. This means understanding the read and write mix, I/O sizes and rates, as well as the semantics of the I/O that is issued. For example: are writes O_SYNC or O_DSYNC, is Direct I/O used? fio is an amazing tool for performing I/O simulation and tests, it can reproduce most application workloads, has good platform support, and an active development and user community. When measuring I/O performance, be sure to use histograms to evaluate latencies rather that looking just at averages. Histograms show the existence of anomalies, and can clarify the presence of caches as well as the actual I/O performance that the disks are delivering. See, for example, these two images from an actual test on customer system:
First a sequential read I/O test was run, followed by a random read I/O test. If we looked only at averages, we would have seen a sequential read latency of around 4 ms, quite good. Looking at the histogram distribution however, it is clear we are getting a mix of 10-20 ms disk latencies, and 0.5-2 ms latencies, presumably from cache. In the subsequent random I/O test we see the improbably good performance with an average of 1 ms and I/Os ranging as low as 100 microseconds. Clearly our working set has been mostly cached here - we can see the few actual random read disk accesses that are occurring by looking at the small bar in the 10-20 ms range. Without histograms it would be easy to mistakenly conclude that the storage was screaming fast and we would not see latencies over 5 ms. For this reason, at Delphix, we use the 95th percentile latency as the guideline for how storage is responding to our tests. Again, fio is an excellent tool for I/O testing that reports latency histogram information. Run tests multiple times to verify reproducibility. If your second and third test runs show different results than the first, none are good basis for making a forecast of eventual application performance. If performance is increasing for later tests, most likely your data is becoming cached. If performance moves up and down, most likely you are on shared infrastructure. Since shared infrastructure is common, run at least one test at the same time as when the key workload will run. On shared infrastructure it is important to test at the time when the actual application performance will matter. For example, test during the peak load times of your application, not overnight or on a weekend. As an aside, I am occasionally misinformed by customers that the infrastructure is not shared, only to learn later that it is. For read tests, ensure your test data size is comparable with the size of the eventual application data - or at least much larger than any intervening caches. For example, if you are developing a 1 TB OLTP system try to test over 1 TB of data files. Typical SANs have order 10 GB of cache shared among multiple users. Many operating systems (notably Linux and Solaris) will tend to use all available system RAM as a read cache. This suggests 100 GB would be the absolute minimum test data size that wouldn't see substantial caching. Run tests for long enough so that ramp up or dynamic workload detection changes don't contribute substantially to your result. In practice, I find an adequate duration by running the test workloads over and over while progressively doubling the duration until I get two runs whose performance is within a close margin of one another. Initialize your test files with data similar to what your application will use. This avoids first write penalties in the test, and ensures your tests are consistent with application performance when the storage uses compression.
Architects must often forecast the I/O performance an application will receive from existing storage. Making an accurate forecast is surprisingly tricky. If you are trying to use a test workload to evaluate storage, there is a risk that the test will trigger some anomaly that makes the results invalid. If you are lucky, these invalid results will be clearly impossible and lead you to do more testing, if you are unlucky they will appear reasonable problems will arise during the production roll out of the application. An awareness of common root causes of I/O performance anomalies, and some rules of thumb for avoiding them while testing, can improve the accuracy of a performance forecast and reduce risks to an application roll out.