LSO And LRO Hazardous to Inter-VM Communication

This is a tale of how a set of environmental conditions and networking protocol interactions conspired in the perfect way to cause an interesting disk I/O performance problem for VMware VMs.

This is a tale of how a set of environmental conditions and networking protocol interactions conspired in the perfect way to cause an interesting disk I/O performance problem for VMware VMs.  The setting is an IT infrastructure composed almost entirely of VMware ESX servers hosting VMs for users' testing needs.  The problem as originally vaguely reported by users was that their VMs were "running slowly", which was quickly narrowed down to disk I/O operations within VMs taking upwards of seconds to complete.

Although stated as a disk I/O performance problem, the root cause turned out to be a networking problem.  This is not entirely surprising if you look at the architecture of the system involved.  Each test VM's disk image is itself stored on a VM (named "dcenter") which is exporting these images via NFS to ESX hosts.  All VMs in this infrastructure are running DelphixOS, which is based on Illumos.  The picture looks something like this:

As you can see in this picture, each ZFS filesystem I/O operation originating from within a VM is mapped to VFS filesystem I/O within ESX on an NFS-mounted filesystem which is served by another VM (dcenter).  The dcenter VM itself is running on one of the ESX hosts, so the VM originating the I/O may or may not be running on the same host.  We quickly noticed, however, that this problem was unique to VMs running on the same ESX host as the dcenter VM.  VMs running on other ESX hosts were not affected.

We were able to quickly determine that NFS reads over the ESX vmdk mount were taking a very long time by running some tests with dd on the host itself.  Read throughput using dd was on the order of 1 to 5MB/s.  Given that we didn't have very much observability into what was going on within ESX, our next step was to measure the latency of NFS operations at various layers using DTrace on dcenter (the NFS server).  Measured at the NFS layer, operations were completing relatively quickly.  In other words, once a requests was received by the NFS layer from RPC, the I/O operations through ZFS were relatively fast.  At the RPC layer, however, we observed that NFS operations were taking hundreds of milliseconds, with spikes in the order of seconds, which was inline with the I/O response times reported via iostat from within the VMs.  This implied that requests or replies were being enqueued within RPC.

RPC in DelphixOS has two queueing mechanisms.  One is a service queue onto which all requests associated with a given service (e.g NFS) are placed while waiting for an available service thread.  Each service has a thread pool, so as long as there is a non-busy service thread in the pool, requests should not stay in this queue for very long (essentially as long as it takes for RPC to signal a thread to look for new requests in the queue, or to signal the service to create a new thread to do the same).  Under ideal circumstances, this should take microseconds.  Using DTrace, we could see that requests were spending milliseconds in the service queue, but nothing to account for the hundreds of milliseconds of lost time at the RPC layer. The other queueing mechanism in RPC is a flow control mechanism that is enabled when TCP itself is flow controlled on the reply side.  TCP has a transmit buffer of a specific size (it's really a transmit queue with a size limit on the aggregate size of all data in the queue).  TCP transmits segments from this queue, but if the rate at which the transmitter (RPC in this case) exceeds the rate at which TCP can drain data out of this queue, then this queue fills up, at which point TCP enables flow control on the transmitter.  This causes RPC to enqueue replies in its output queue, and enqueue requests in its input queue.  Again using DTrace, I measured that NFS operations were spending almost all of their time in one or both of these queues waiting for TCP to clear the flow control condition.

The focus then shifted to TCP.  Why was TCP enabling flow control for such lengthy periods of time?  Stated differently, why was TCP so slow at draining segments from its transmit queue?  The rate of TCP transmission is a function of many things including the available network throughput, the round trip times to the peer, the reliability of the network (factors taken into account by the TCP congestion control algorithm), the TCP receive window of the peer, and related, the rate at which the peer application is able to read from the TCP receive buffer.  As a starting point, I wanted to see if we were limited by the congestion window.  I wrote a DTrace script that, using the tcp provider, traced each segment sent and received.  The script displays times, segment number information (including ACK sequence numbers) and the size of the congestion window.  The result is something like this:

        5579198      -> snd 23360 bytes (seq = 2357437641 cwnd = 23360)
        5689117      <- ACK 2357461001
        5689153      -> snd 24820 bytes (seq = 2357461001 cwnd = 24820)
        5799152      <- ACK 2357485821
        5799183      -> snd 24820 bytes (seq = 2357485821 cwnd = 24820)
        5799934      <- ACK 2357510641
        5799952      -> snd 24820 bytes (seq = 2357510641 cwnd = 24820)
        5800061      <- ACK 2357535461

The times are in microseconds, relative to the time the script was started.  We could immediately identify some interesting facts based on just a few segments traced.  The TCP mss was 1460, and TCP is sending segments much larger than that.  This is because LSO is being used.  Another fact that stands out is that the congestion window is very small, and we're sending an entire congestion window in one (large) segment.  The result of this is that we must wait for acknowledgement of a segment from the peer before continuing to transmit.  This shouldn't be an issue since the peer is very close by (it's the ESX host that our VM is running on!), and our data should be acknowledged extremely fast (right?).

The next fact comes into play.  We can see that it takes the peer roughly 100ms to acknowledge the data that we send it, and it acknowledges _all_ of the data at once.  The reason that it acknowledges all of the data is because the network driver for the virtual NIC used by ESX uses LRO (the inverse of LSO), which means that the TCP implementation of the peer will receive multiple TCP segments in one large chunk.  The ESX network virtualization layer is smart enough to know that the destination of the large segment we transmit is attached to the same virtual switch and has LRO enabled, and so it doesn't bother to do TCP segmentation at all.  This seems like a valuable optimization to save CPU cycles doing needless TCP segmentation.  The result is that the same large chunk we transmit using LSO is received as-is by the peer using LRO.

So far so good, except that waiting 100ms to send an acknowledgement is seriously hampering the throughput of this connection.  This looks like a TCP delayed-ACK scheme gone awry.  As stated in RFC 1122, "A TCP SHOULD implement a delayed ACK, but an ACK should not be excessively delayed; in particular, the delay MUST be less than 0.5 seconds, and in a stream of full-sized segments there SHOULD be an ACK for at least every second segment."  In the case of LRO, the peer is receiving multiple segments simultaneously (more than two segments), and it should therefore not delay the ACK.  On the surface, this seems like an ESX TCP LRO bug.

That said, we would not have noticed this if the congestion window were not so small.  The small congestion window needed to be root-caused.  Continuing to examine the output of the TCP trace data produced by the DTrace script, we see an additional pattern.  Occasionally, we see:

        2143945      -> snd 21900 bytes (seq = 2050686021 cwnd = 21900)
        2245254      <- ACK (cwnd = 21900)
        2245268      -> snd 21900 bytes (seq = 2050707921 cwnd = 21900)
        2345425      <- ACK (cwnd = 21900)
        2345433      -> snd 21900 bytes (seq = 2050729821 cwnd = 21900)
        2817517      -> snd 1460 bytes (seq = 2050729821 cwnd = 1460)
        2923963      <- ACK (cwnd = 1460)

We see here that we had a retransmission timeout event.  We can verify with mdb that the retransmission timeout for this connection was roughly 500ms, and that's the amount of time that expired before we timed out and retransmitted.  Notice that the retransmission timeout event lowered the congestion window to 1mss.  This is the expected behavior for the TCP Tahoe and Reno congestion control algorithms when slow-start is triggered.  From that point on, even with LSO enabled, we send only a few kilobytes of data at a time before waiting for an acknowledgement.  As per the congestion control algorithm, we increase the congestion window 1mss at a time until we get another drop event and retransmission timeout, after which we start the cycle over again.  It turns out in this case that we never increase the congestion window to more than 30KB or so before a segment is dropped and we timeout.

As a result of this combination (LSO + LRO + delayed-acks + packet loss), we can only transmit a few kilobytes of data per 100ms, giving us an effective throughput of < 1MB/s.

To test this theory, I disabled LSO.  On an NFS server, this is as simple as:

        # ipadm set-prop -p _lso_outbound=0 ip
        # svcadm restart nfs/server

After this change, we could immediately see an increase in throughput (in the Gb/s range).  We could still observe frequent retransmissions due to packet drops, but the performance impact of these drops was drastically reduced since the drops were only for individual segments, allowing us to do fast-retransmits (we receive acknowledgements for subsequent segments that were not dropped, and can automatically infer a drop rather than waiting for a 500ms timeout).

The only mystery remaining was why we saw frequent packet drops for communication between an ESX host and one of its VMs.  The host in question was over-provisioned, cpu-wise.  One theory was that this was contributing to packet drops.  In the end, offloading VMs to other hosts eliminated the packet drop issue, lending some credence to this theory (although we never did have a smoking gun for this aspect of the problem).

As a result of this experience, we've learned a few things:

  1. There is a delayed-ack bug in the version of ESX we were running

  2. The combination of LSO + LRO for VM-to-VM communication can be disastrous if the host is over-provisioned and drops packets.  While the problem I was dealing with was in communication between a VM and its host, the same problem can occur for VM-to-VM communication.  Even without (1) in the picture (100ms delays between every TCP transmit), the frequent 500ms retransmission timeouts and slow-start events will kill connection throughput on their own.

1 isn't a hard problem to solve (file a bug with VMware).  2 is a more tricky issue because LSO and LRO are ubiquitous these days, and the general thinking is that they should only be disabled to workaround bugs.  Some operating systems do not provide a supported means of disabling the features for this reason.  In this case, LSO and LRO can be problematic due to environmental conditions, and not necessarily due to a bug.  There is no obvious great solution to this problem, but perhaps one is to design the network virtualization layer to always perform segmentation in software for VM to VM communication.