Running the poormans sampler detailed in my previous post. I found that I got a heap of samples looking like this:
Thread 1 (Thread 0xb6f436f0 (LWP 1657)):
#0 0xb7828430 in __kernel_vsyscall ()
#1 0xb7370f93 in read () from /lib/tls/i686/cmov/libc.so.6
#2 0xb731aedb in _IO_file_underflow () from /lib/tls/i686/cmov/libc.so.6
#3 0xb731dcc8 in __underflow () from /lib/tls/i686/cmov/libc.so.6
#4 0xb731a888 in ?? () from /lib/tls/i686/cmov/libc.so.6
#5 0xb731c7b8 in _IO_sgetn () from /lib/tls/i686/cmov/libc.so.6
#6 0xb73103be in fread () from /lib/tls/i686/cmov/libc.so.6
#7 0x0811028f in main (argc=2, argv=0xbffccf54) at src/server.cpp:252
That is I was waiting for a read to complete. Sometimes for 5-10 seconds, sometimes for a few minutes.
What was this troublesome read? An unexpected socket call? Nope - something I would never have expected. Here's the "offending" code.
FILE * f = fopen("/dev/random", "r"); unsigned seed; fread( &seed, sizeof(seed), 1, f); fclose(f); srandom(seed);This code seeds the random number generators using some values pulled from the random device /dev/random.
Reading from /dev/random blocks until there is enough entropy in its internal entropy pool to complete the read. On an isolated VM there's not much system noise generating entropy and so the pool was emptying quickly.
Turns out theres a non-blocking random device that uses feedback of hashed values to prevent blocking when the entropy pool is low, so switching this to using "/dev/urandom" and all was OK.
Now maybe I could have found this by doing some traditional debugging, (maybe not as that may have generated enough noise to stop /dev/random blocking!) But using dumps from gdb the problem was found in a few minutes of time.