First customer support escalation
A lot gets written about software engineering that goes well. Less is written about the inevitable bugs that develop as software becomes increasingly complex. Recently, I had an opportunity to help diagnose and solve one of those bugs in some of our software that was deployed in the field. The underlying cause of the bug itself was very interesting, and I had a lot of fun troubleshooting it and developing a solution.
Delphix is shipped as a virtual machine that is deployed at customer sites. The Delphix Engine, our name for this virtual machine, connects to customer machines, which we refer to as environments, and collects information about which databases are running on those systems. We support a variety of databases, including Oracle, MSSQL, Postgres, and MySQL.
In addition to these, we support many different configurations of environments in which the databases are running, such as Linux, Windows, AIX, and Solaris. Supporting many environments introduces a lot of complexity, and that complexity ended up leading to this bug. I received the bug as an escalation from our support staff. Escalations are typically filed against engineering when our support staff is not able to solve them.
The customer was an international telecom provider whose Oracle database was running on a Solaris SPARC machine. SPARC is a RISC based processor architecture that was developed by Sun Microsystems. During the 90s and early 2000s Sun sold SPARC systems running a version of Solaris designed for SPARC.
When a Delphix Engine connects to a target environment, it pushes over several pre-build binaries that are used for operations. These are run on the customer environment to discover and interact with their databases. One of these binaries, rsync, was crashing when our engine tried to execute it. The reason it was crashing was that it was linked against a version of libc which was newer than on the customer machine.
Our customer support had already diagnosed this and run nm libc.so.6 on the customer machine to dump the object symbols for libc so I knew which version of libc the binary needed to be linked against in order to run. It turns out that the customer was using an older version of Solaris 10 than what we had originally tested which is why there was a discrepancy in the versions of libc present.
There were a couple seemingly easy potential solutions that were not available to me. I couldn't build the binary and statically link it (as opposed to the current dynamic linking on the binary that we were shipping) because libc has kernel dependencies on the functions included in it.
Upgrading the customer system was out of the question since it was their production server. We didn't have a machine available in house on which I could build the binary since all our Solaris 10 systems had a more recent version of libc. Compiling the binary on the customer machine wasn't an option since the customer wouldn't want a full development toolchain being installed onto their production instance.
After quickly exhausting the most obvious options I started to research other ways in which I could get the appropriate version of the binary. One of the reasons this was a difficult problem is that getting a machine with the right version of Solaris is quite hard. Solaris SPARC doesn't have good virtual machine emulation since it isn't x86, and it was popular before virtual machines became widely used.
The installation media is also next to impossible to find. When I initially received the escalation we filed a support request with Oracle to get the installation media for an older version of Solaris 10. However, we didn't want to depend on Oracle for this and even when we did get it then I would need to head down to our server closet at our headquarters and manually install the OS off disk onto physical SPARC machines that we have in house for testing.
Another options I thought of was using the sites that some universities run that provide already built packages of SPARC binaries for various versions of Solaris. However, the one I found was linked against an external library not on the customer system, and there were concerns about putting random binaries from the internet onto customer production instances. Another option that I eliminated was building the binary on a machine with a newer version of libc but telling the linker to link against an old version of libc.
The most helpful article on the internet about this didn't work out for rsync. After exhausting all these options I decided to attempt a more complicated solution. Solaris has support for running containers, a concept popularized by Docker. In Solaris they are referred to as zones, and you can run a different version of Solaris than your base operating system in something called a branded zone. My plan was to use a Solaris 9 zone running on our Solaris 10 machine to build the binary, and since libc has backward compatibility the old version of libc on Solaris 9 would be supported by the customers Solaris 10 installation.
Fortunately for me we have a number of ex-Sun employees at the company, and I was able to jump on our engineering slack channel with my plan. After some initial confusion about why I wanted to setup a branded zone I was able to get some good information about how to setup a ZFS filesystem to back my zone. Then I used the Oracle docs to run through installing the zone.
The Solaris 9 installation media for the branded zone was on a rather obscure and underused Oracle site, but it worked. After churning through setup I was able to login to my Solaris 9 zone. I tried cding around and found that the shell I had didn't have tab completion. Man this was old. I immediately encountered a few problems.
The network stack didn't seem to want to come up, so I couldn't copy files into the zone using scp. I was able to find a cool perl script that someone had hacked together that uses the zone login prompt to pipe the file into the zone filesystem. With that I copied over the source tarball for rsync. Then I found that I had no GNU build toolchain. I had expected this since it was a Sun operating system. More worrisome was that the zone installation media I had downloaded from Oracle didn't even include the Sun C compiler.
Without a compiler I had no way to bootstrap a working version of gcc and from there bootstrap the rest of my toolchain. After about an hour spent searching on the internet I eventually found an abandoned ftp site with various packages for for SPARC Solaris machines. After some intelligent guesses on the file naming schemes I was able to find a copy of gcc that installed and worked in my zone.
From there I bootstrapped make and the rest of my toolchain. Then I edited the rsync build options to only link against libraries on the customer system, built rsync, and when I ran ldd -v rsync I saw an old version of libc that was present on the customer system. That was a pretty awesome feeling. Then I spent a while figuring out how to copy the binary of the zone. Turns out you can just mount the zone file system externally, so I never actually needed the perl script.
Sometimes it pays to read the documentation. I assembled a hotfix package with the rsync binary, and a day later I heard that it had been installed and worked at the customer site. It was a great feeling to squash a complicated bug via a rather interesting solution with the payoff of a happy customer.