It’s not everyday you see command aliasing causing native memory exhaustion.
Sep 26, 2019
At Delphix, some of the most exhilarating yet demanding (and almost painstakingly tough) experiences as an engineer have come from solving customer support cases. Throughout these challenges, I’ve learned that one of the most important yet ill-defined skills to master is debugging. Each of these incidents have given me an opportunity for growth and reinforced the need for teamwork, systematic thinking, and an openness to learning new things in brand-new ways.
Recently, a support case was brought to our attention when a Delphix engine crashed. Tony Slack, one of our support engineers, collected and analyzed information about the system and found that the Java process running Delphix’s management stack had run out of native memory. When this happens, the stack automatically takes a core dump and shuts down the application, knowing that we can’t make any more progress. My job was to figure out why we ran out of memory in the first place.
I had plenty to look at as we collect quite a bit of useful debugging information on customers’ systems. I used a number of Java debugging tools to figure out what was going on at the exact time of the crash. Using
jstack the first thing I noticed was that there were about 10 times more Java threads than we expected to be running on a standard Delphix system. Since each thread reserves some space for itself, that explained why it ran out of memory.
This led me to my next big question: where did all these threads come from and what do they want?
From the aggregated jstack output, I was able to identify three stacks that were each repeated nearly 900 times, which gave me another clue as to which threads were out of control.
The first stack was executing a KeepAlive routine for a SSH connection.
The second was trying to read from a socket (hooked up via a SSH connection).
The third was executing a subroutine of LogSync, a feature we use to keep a virtualized copy of the customer’s Oracle database up-to-date with the source.
After each sync, the LogSync process cleans up the files on the external host we are syncing the database from, but this was where the threads were hanging, in a function called removeFile, deleting a file on the remote machine by executing a command over SSH. This detail shed a light on what these threads were trying to do: open up a SSH connection, run a deletion command over that connection, and wait for the output from that command. The response was never coming because the deletions weren’t finishing.
My next strategy was to learn more about the SSH connections themselves to see if there was a common theme. I dug deeper on the information we gathered on the customer’s systems — the output of the
pfiles command which records all the files that a particular process has opened. Because sockets are just another kind of file, I looked for all the files of type
IFSOCK that were outgoing connections to a host. After aggregating by IP address, I saw hundreds of port 22 (i.e., SSH) connections going to the same IP. I was able to narrow the scope of the bug to a single machine because the customer had a number of remote hosts connected to their Delphix engine.
At this point, I decided to look for an insider perspective on this part of the product. Mike Stewart, an engineer on our Oracle team, came to the rescue and talked through the high-level implementation of the LogSync feature. He agreed the issue might have to do with the SSH connection doing the log removal, and he posited the removal command itself -
rm -f was hanging. With this theory, support worked with the customer to examine the host. There were a large number of these files left around, confirming that the deletion was not successful. However, there was nothing unexpected about the permissions of the files, which would have been the most obvious cause for the failed delete. We attempted a manual deletion and saw that rm asked for confirmation: are you sure you want to delete this?
This gave Mike an idea. What if the SSH connection that executed the file deletion didn’t know how to handle the rm confirmation message? He tried out this theory on a test VM and found the SSH thread did indeed hang when returned a prompt. But there was just one problem - that’s what the
rm -f (the command LogSync used) was supposed to take care of.
From rm’s man page:
-f ignore nonexistent files and arguments, never prompt
This seemed like a dead-end, unless there was some other way that a prompt was being produced disregarding the -f. After sharing this with a wider group, another engineer suggested the customer might have aliased (reassigned) the rm command to
rm -i, a configuration that might have been put in place to guard against accidental deletions.
-i prompt before every removal
Again, we worked with the customer to inspect their system and using the
alias command found that our theory was right! This meant that everytime SnapSync tried to forcibly delete a file, it actually ran
rm -i -f and got a prompt anyway. Interestingly,
rm -i -f does different things depending on which OS you’re using. On Linux,
rm -f -i gives a prompt but
rm -i -f does not. On Solaris (the OS of the customer’s host), however,
-i always negates
The short term solution for this customer was to remove the alias on their system. Support cleaned up the excess files and the SSH commands were able to start completing.
However, this bug also highlighted a number of areas for improvement in our product.
The way the SSH command execution hung on the command prompt instead of progressing or giving up.
The way connections were able to build up until running the process out of memory, rather than timing out or being cleaned up in another way.
At a high level, this issue highlighted an area where our software was susceptible to seemingly unrelated user system configuration, and breaking this dependency will make it more robust. Though we’d prefer customers to have a flawless product experience, escalations always teach us a lot. We discovered an unexpected failure condition and the design problems that led to it, and we look forward to taking these lessons forward to create a better product for everyone.
Though the solution to this specific escalation wasn’t especially complex, it was well hidden. The trail started with OS memory management and I followed clues that led me through Java process management, our Oracle SnapSync functionality, SSH connections, and finally to simple shell scripting. The journey drove me to learn new tools, immerse myself in unfamiliar areas of the product, and leverage collaboration between engineers all across the organization. Most of all, it reminded me that I enjoy debugging problems like this because it pushes me out of my comfort zone. Debugging always shows me how much I’ve grown and areas that I can work on next. Plus, there’s just something so satisfying about finally tracking down the answer - it’s not everyday you see command aliasing causing native memory exhaustion.