Debugging the debugger

Recently I was looking into a boot hang issue that we had been seeing on some of the VMs in our test environment.

The Bug

Recently I was looking into a boot hang issue that we had been seeing on some of the VMs in our test environment. The first thing I usually do when a system is hung is issue an NMI to get into kmdb. Once in the debugger I usually examine the stack traces of all the threads running on the system using following command:

[0]> ::walk thread | ::findstack -v
stack pointer for thread fffffffffbc2fac0: fffffffffbc72210
[ fffffffffbc72210 _resume_from_idle+0xf1() ]

Each time I tried to do this on one of the hung VMs my kmdb session would hang after printing out a single stack frame and I was forced to power cycle the VM. I asked a few colleagues if they had been having trouble using kmdb and it turned out that I was not alone. A few people had theorized that the problem was somehow related to using gcc to build the kernel but we didn’t have any concrete data to support this.

Analysis

At this point it seemed worthwhile to take a diversion from the boot hang issue to try and first figure out why kmdb was hanging. At first I was a bit unsure of how to even attack the problem since the debugger itself was hung but for these types of problems the fact that the OS was running as a VM is very beneficial. The first thing I did was login to the ESX server which was hosting the VM to check esxtop.

My VM was at the top of the list and was consuming a ton of CPU resources. The logical conclusion here is that it was stuck in some sort of tight loop, but where ?

VMware has a facility named VProbes which provides functionality similar to dtrace and can be used to inspect guest VM behavior. This seemed like the perfect tool to debug this issue and after skimming through the VProbes manual I wrote my first script to gather guest stack traces as fast as possible (10 times a second):

(defstring stack)

(vprobe VMM10Hz
        (gueststack stack)
        (printf "%sn" stack))

Using mdb I converted the stack traces of virtual addresses into stack traces with symbolic names. One stack trace that was particularly interesting was the following:

kdi_traperr14:
kmdbmod`kmt_reader+0x2c:
kmdbmod`kmt_rw+0xef:
kmdbmod`kmt_read+0x61:
kmdbmod`mdb_tgt_vread+0x1f:
kmdbmod`mdb_amd64_kvm_stack_iter+0xbc:
kmdbmod`kmt_stack_common+0xa5:
kmdbmod`kmt_stackv+0x2a:
kmdbmod`dcmd_invoke+0x7e:
kmdbmod`mdb_call_idcmd+0x112:
kmdbmod`mdb_call+0x3f9:
kmdbmod`yyparse+0xdb4:

The kdi_trapper14 function is the interrupt handler for page faults. After some more skimming of the VProbes manual I found out there is a probe that fires for every page fault. I wrote a new script to log a stack trace on each page fault which revealed the following stack:

kmdbmod`kmt_bcopy+0x5a:         movl   (%rdi),%eax
kmdbmod`kmt_reader+0x2c:        movq   %rbx,%rax
kmdbmod`kmt_rw+0xef:            cmpq   %rbx,%rax
kmdbmod`kmt_read+0x61:          movq   -0x30(%rbp),%rbx
kmdbmod`mdb_tgt_vread+0x1f:     addq   $0x20,%rsp
kmdbmod`mdb_amd64_kvm_stack_iter+0xbc:          cmpq   %r14,%rax
kmdbmod`kmt_stack_common+0xa5:  xorl   %eax,%eax
kmdbmod`kmt_stackv+0x2a:        addq   $0x20,%rsp
kmdbmod`dcmd_invoke+0x7e:       cmpl   $0x2,%eax
kmdbmod`mdb_call_idcmd+0x112:   cmpl   $0x4,%eax
kmdbmod`mdb_call+0x3f9:         call   +0x932   <kmdbmod`mdb_intr_disable>
kmdbmod`yyparse+0xdb4:          testl  %eax,%eax

I tried using dynamic VProbes to debug the problem further but apparently they aren’t supported under ESX. Now that I knew where the general problem was I was able to do some iterative printf debugging to locate the root cause of the problem.

In this case we are triggering a page fault while performing a bcopy. The way that kmdb handles page faults is by using setjmp()/longjmp(). If we hit a page fault during a read we will longjmp() back to the setjmp() location to handle the error. Here is the relevant c code from kmt_rw():

size_t chunksz;                                                
        size_t n, ndone;                                                        
        jmp_buf *oldpcb = NULL;                                                 
        jmp_buf pcb;                                                            
        ssize_t res;                                                            
                                                                                
        kmdb_prom_check_interrupt();                                            
                                                                                
        if (nbytes == 0)                                                        
                return (0);                                                     
                                                                                
        /*                                                                      
         * Try to process the entire buffer, as requested.  If we catch a fault,
         * try smaller chunks.  This allows us to handle regions that cross     
         * mapping boundaries.                                                  
         */                                                                     
        chunksz = nbytes;                                                       
        ndone = 0;                                                              
        if (setjmp(pcb) != 0) {                                                 
                if (chunksz == 1) {                                             
                        /* We failed with the smallest chunk - give up */       
                        kmdb_dpi_restore_fault_hdlr(oldpcb);                    
                        return (ndone > 0 ? ndone : -1); /* errno set for us */ 
                } else if (chunksz > 4)                                         
                        chunksz = 4;                                            
                else                                                            
                        chunksz = 1;                                            
        }               

After a few iterations, we should return from this function with an error, but what I discovered through instrumentation was that the value of chunksz was never changing. Why? Obviously, if chunksz isn’t changing and was initially greater than 1, we will never error out of kmt_rw().

Again from the above c code we can see that chunksz (a local variable) is set to the function’s nbytes parameter. If we disassemble the kmt_rw code we can see how the chunksz local variable is modified:

kmt_rw+0x77:                    call   +0x30d44                      
kmt_rw+0x7c:                    testl  %eax,%eax                  (A)
kmt_rw+0x7e:                    movq   -0x98(%rbp),%r12           (B)
kmt_rw+0x85:                    je     +0x1c                    
kmt_rw+0x87:                    decq   %r12
kmt_rw+0x8a:                    je     +0x90                   
kmt_rw+0x90:                    cmpq   $0x5,-0x98(%rbp)                      
kmt_rw+0x98:                    sbbq   %r12,%r12
kmt_rw+0x9b:                    andq   $0xfffffffffffffffd,%r12              
kmt_rw+0x9f:                    addq   $0x4,%r12
kmt_rw+0xa3:                    leaq   -0x90(%rbp),%rdi                      
kmt_rw+0xaa:                    xorl   %r13d,%r13d
kmt_rw+0xad:                    call   -0xc192      
kmt_rw+0xb2:                    movq   %rax,%r14
kmt_rw+0xb5:                    jmp    +0xf                     
kmt_rw+0xb7:                    nop    
kmt_rw+0xb8:                    addq   %rax,-0xa8(%rbp)                      
kmt_rw+0xbf:                    addq   %rax,-0xa0(%rbp)                      
kmt_rw+0xc6:                    cmpq   %r12,-0x98(%rbp)                      
kmt_rw+0xcd:                    movq   -0x98(%rbp),%rbx                      
kmt_rw+0xd4:                    movq   -0xa8(%rbp),%rdx                      
kmt_rw+0xdb:                    movq   -0xa0(%rbp),%rdi                      
kmt_rw+0xe2:                    cmovq.a %r12,%rbx                            
kmt_rw+0xe6:                    movq   %rbx,%rsi
kmt_rw+0xe9:                    call   *-0xb0(%rbp)               (C)

(A) - This is where we will branch on the page fault.
(B) - This is where chunksz is set to the nbytes parameter value.
(C) - The page fault occurs in this call.

So from this code we can see that chunksz is a register and is always loaded from the stack. Nowhere between (A) and (C) is the value of -0x98(%rbp) modified. This explains why the value is never changing when we hit a page fault.

Looking at a version of the kmbdmod that was compiled with the Sun compiler shows that it does not have this problem:

kmt_rw+0x7f:                    call   +0x38d04 
kmt_rw+0x84:                    testl  %eax,%eax
kmt_rw+0x86:                    je     +0x58    
kmt_rw+0x88:                    movq   -0x90(%rbp),%r8           (A)
kmt_rw+0x8f:                    cmpq   $0x1,%r8
kmt_rw+0x93:                    je     +0x22    
kmt_rw+0x95:                    movq   $0x4,%r9
kmt_rw+0x9c:                    cmpq   %r8,%r9
kmt_rw+0x9f:                    jb     +0xd     
kmt_rw+0xa1:                    movq   $0x1,-0x90(%rbp)          (B)
kmt_rw+0xac:                    jmp    +0x32    
kmt_rw+0xae:                    movq   %r9,-0x90(%rbp)           (C)
kmt_rw+0xb5:                    jmp    +0x29    
kmt_rw+0xb7:                    movq   -0x98(%rbp),%rdi

(A) - Load into chunksz.
(B) - Store of chunksz into the stack.
(C) - Store of chunksz into the stack.

So it is true that this problem started showing up when we switched to gcc but according to the following excerpt from the setjmp() man page, this behavior is not a bug in gcc.”

All accessible objects have values as of the time longjmp() was called, except that the values of objects of automatic storage duration which are local to the function containing the invocation of the corresponding setjmp() which do not have volatile-qualified type and which are changed between the setjmp() invocation and longjmp() call are indeterminate.

Solution

The solution is straight-forward given the setjmp() documentation.  In the kmdb code we are modifying a non-volatile local variable (chunksz) between the setjmp() and longjmp() call so according to the setjmp() documentation its value will be indeterminate. I changed the code to make chunksz volatile, rebuilt the kmdb module and disassembled the code.

kmt_rw+0x7a:                    call   +0x30d61 
kmt_rw+0x7f:                    testl  %eax,%eax
kmt_rw+0x81:                    je     +0x33    
kmt_rw+0x83:                    movq   -0x58(%rbp),%rax
kmt_rw+0x87:                    cmpq   $0x1,%rax
kmt_rw+0x8b:                    je     +0xaf    
kmt_rw+0x91:                    cmpq   $0x5,%rax
kmt_rw+0x95:                    leaq   -0x58(%rbp),%rsi
kmt_rw+0x99:                    movq   $0xfffffffffbd78b12,%rdi
kmt_rw+0xa0:                    sbbq   %rax,%rax
kmt_rw+0xa3:                    andq   $0xfffffffffffffffd,%rax
kmt_rw+0xa7:                    addq   $0x4,%rax
kmt_rw+0xab:                    movq   %rax,-0x58(%rbp)          (A)

(A) - Store of chunksz.
 

Much better! chunksz is now being stored in the stack, and when I try it out, kmdb no longer hangs while examining the thread stacks.

This bug is a good example of some of the subtle risks that are involved when changing compilers. In this case gcc exposed a latent bug in the kmdb code that had undefined behavior which happened to work correctly when compiled with the Sun compiler.

Now back to debugging the boot hang…