Delphix on Flash
We recently published a white paper with Pure Storage. The experiment, results and conclusions have already been published in various forums: here, here and here.: here, here and here. A quick review of this work: I was able to run an Online Transaction Processing (OLTP) workload on 26 Virtual Databases (VDBs), while achieving 910,000 Transactions Per Minute (TPM) in aggregate.
The databases were provisioned from a Delphix Engine running with an all-flash array from Pure Storage. To achieve comparable performance on a setup with traditional rotating media based storage would cost 10x more. The big factor is the number of IO Operations Per Second(IOPS) required to sustain the high TPM each VDB was running at. In my experiment, each VDB was running at ~40K TPM, requiring close to 2.5K IOPS.
A disk-based system was bottlenecked on the IOPS, producing only around 35K TPM. The rotating media was benchmarked by Chas Dye from Pure Storage. The results are clear, but here I would like to talk about the journey. How did I get to 26 VDBs? why did I stop there? And how using VDBs enabled me to finish the study in a very short time. I picked the Sales Order Entry (SOE) benchmark from swingbench for this study.
This workload mimics the popular TPC-C benchmark from the Transaction Processing Council. While it is a relevant benchmark for our customers, it also stresses the I/O subsystem behind the database -- the key component in this study. I had 2 ESX hosts at my disposal for this work. I created a VDB, loaded 1TB of SOE schema onto it, and took a snapshot of the VDB.
A snapshot is simply a crash consistent copy of the database. Delphix allows you to take snapshots without any storage overhead in a matter of seconds. I provisioned the remaining VDBs off of this snapshot. This process propagates the optimizations and settings to the rest of the VDBs. This is a typical use case for many of our customers. A single VDB is provisioned from production DB and it is prepared for the non-production cycle using a variety of operations like transformations, masking, error injection, obfuscation, etc.
The rest of the downstream non-prod environments, like Test, Dev, Stage and Load-Test work off of VDBs provisioned from the first VDB. You will see shortly how this model made my task a whole lot easier. I started with a single VDB and with a little bit of tuning, my initial setup produced 150K TPM. The top wait events were CPU and read I/O from Delphix. Keeping the benchmark CPU bound allows me to scale the performance by adding more VDBs.
The goal is to add more VDBs and generate more parallel load on them, while maintaining the same wait events. When I added the second VDB, I noticed the benchmark score did not scale as expected. After digging around a bit, I narrowed the problem down to the benchmark client. Swingbench uses JDBC connections for its users to drive load onto the database. I saw JDBC connection timeouts even when the backend I/O coming from Delphix did not change.
On further investigation, I realized the problem -- the VDBs were running out of Oracle processes. This OTN article pointed me to the fix. As soon as I bumped up oracle processes on the VDBs to 250, I was back on track, saw much better scaling beyond 2 VDBs. Adding more VDBs to the train... up to 8.
The next issue I saw was also a problem on the benchmark client side. I was still getting connection drops from the client, though the responses from the backend remained good. Top wait events did not change, the system was still bound by CPU and IO responses from backend. My colleague helped me realize that once again the issue was with the load generator for the benchmark.
The Oracle JDBC driver relies on /dev/random to produce random numbers to login a user. Problem is, /dev/random blocks the requester when the entropy drains on load. This causes the JDBC driver to timeout its connections. I followed the simple solution that was suggested in this OTN forum thread about JDBC connection resets. I pointed /dev/random to /dev/urandom, which reduces the quality of output when the entropy drains instead of blocking the requester. This relieved the bottleneck and I could scale beyond 8 VDBs.
Adding more VDBs to the train...climbed to 10. Scaling problems popped up once more. I was powering the VDBs on 3 VMs, all running along with the Delphix engine on the same ESX host. Even though CPU utilization on the hosts only showed 70%, I saw that the VMs themselves were starved for CPU; %RDY was in the high 80s when I looked at 'esxtop'. Thanks to the easy-to-follow guidelines in my colleague Matt's blog about esxtop, I was able to identify this pathology. I moved the VMs onto a second ESX host and made sure the CPU and memory were reserved for the VMs.
Once I relieved this bottleneck, I started seeing much better scaling, even beyond 10 VDBs Adding more VDBs to the train... reached 15. The next bottleneck while increasing the number of VDBs was CPU on the host. I could not scale beyond 15 VDBs because the VMs were out of CPU horsepower. The OLTP workload I ran on the databases is fairly CPU intensive, especially when the backend I/O from Delphix is responding with low latency. In order to continue adding more VDBs, I drafted another ESX host in my lab and kept going.
A few days before Oracle Open World I was finally able to get 26 VDBs running.
One of the goals for this experiment was to use as little caching as possible. Even on the Delphix Engine I was using only 16GB of cache, less than the minimum we recommend for our customers.
The idea was to push I/O onto the backend, to demonstrate that Delphix can sustain high IOPS despite sharing blocks across all the VDBs. As I added more VDBs, I kept reducing the SGA to make space on the ESX hosts, while maintaining TPM comfortably higher compared to the score from rotating media. By the time I had 26 VDBs running, I had to drop the SGA for each of the VDBs down to 4GB.
Combined with the memory on the Delphix system, I was only using 1/5th of the RAM used on the setup for rotating media. The reduced SGA led each VDB to run only at ~39.5K TPM. I would like to now highlight two aspects of this journey that I felt were lost in all the press and the releases we did on this work.
Delphix+Pure Storage I/O Scaling
We were able to scale up to 26 VDBs, but why not any more?
Even though we achieved the hoped-for result once I had all 26 VDBs running, I was not yet satisfied. I saw that the backend I/O latency had not gone down enough. There was no change in the wait events on the VDBs. Average latency of reads as seen by the database was still around 3-5ms. I saw the potential to add more VDBs and push the score even higher. I had the I/O capacity from the Pure Storage back-end at my disposal, as well as from the Delphix Engine to continue adding more databases, but I was blocked by CPU on the ESX hosts.
While this ended my experiment, this is where we need to be if we want to scale this higher without exploding the cost. Most performance critical installations will have a bottleneck, ideally we want it to be a relatively cheap resource that can be scaled as desired. Today, CPU is an affordable commodity to scale, and the same level of performance gets cheaper every year.
Using Delphix database virtualization, I could easily provision more VDBs and continue to scale this benchmark much higher, but I had to stop at 26 VDBs given that I had run out of all the available ESX hosts in my lab.
Throughout the process of scaling to more VDBs, the experiment parameters required iterated adjustments.
At each step I had to tweak settings on the VDBs, check the result and readjust them before being able to proceed. I managed to run through numerous optimizations on many databases only because I was working with VDBs. Imagine instead the agonizing eternity of having to tweak an SGA parameter on 26+ databases running load simultaneously. You would have to run a workload for each 1TB, manually change and bounce all the databases after each trial, and rejig the parameter until you reach the desired result.
Working with VDBs, all of them sourced from a single VDB, made this entire operation relatively painless and much less time consuming. I changed the SGA on the source VDB, took a snapshot of it, then refreshed all the remaining VDBs to that snapshot. Being able to operate the snapshot and refresh process from our UI spared me from having to manually bounce the VDBs and repeat the changes 26 times for a single run-through. Such efficiency allowed me to wrap up this whole experiment in less than two weeks, just in time for Open World.
This process demonstrated to me first-hand the power of agility we provide to our customers.
Thanks to Chas Dye and the rest of the Pure Storage team for providing the storage and benchmarking on the rotating media and Adam/Rick Caccia for seeding the idea and the partnership.