Thursday, November 5, 2009

Debugging web apps with strace

Debugging web apps with strace

Want to be an advanced debugger? My #1 Superman debugging tool is Linux’s strace. If you have ever run into problems where a user complains that the site is slow, and you can’t figure out why, you may want to give strace a try.


strace is a system call tracer, i.e. a debugging tool which prints out a trace of all the system calls made by a another process/program.

In other words, strace tells you what a program is doing, at the C function call level. This is great for finding the problems where a page just "hangs" for no apparent reason. Let’s walk through what it takes to set up strace on Apache in a LAMP environment, with some real world examples that I’ve run into.

First, you’ll need to install strace, if it isn’t already installed. My favorite method is just yum install strace, but if you want to, you can download and compile it yourself.

Next, you will need a place where you can test the slow page. For the rest of this article, we will assume you have a development environment that is all to your own, where you can start/stop Apache at will, and no one else will be using it. Note: If a separate development environment isn’t available, I suggest running another Apache on a different port, say 81 instead of 80. This way you can still work on the production site without affecting end users.

Environment set up? Good. Let’s get down to debugging.

1. Start Apache in "Debug Mode" with the -X option. This has Apache start one process, instead of a bunch of children, and then all the requests will go through one process.

# httpd -X

2. In another terminal window, find the process id for the listening Apache that you just started.

# ps auxw | grep httpd should do the trick.

3. Once you have the process id, attach strace with the -p option:

# strace -p $processidofapacheprocess

4. Go to your browser and go to the url that is hanging. While it is running, watch the output from strace in your terminal window. You’ll see a ton of system calls stream by, but the important thing to look for is when it stops. What is it doing?

I’ve used this approach to find several "Superman" level problems (problems that other people spent at least a day trying to figure out what was going on — sometimes weeks). Here are some examples.

1. Sendmail hanging via PHP - The reported problem was that certain pages were slow (30-300 seconds). Load on the machines seemed fine, but certain requests were painfully slow. strace revealed that the PHP script was waiting for sendmail to come back with a response. Upon looking further, sendmail was doing a reverse dns lookup that was timing out, which resulted in a 30+ second delay. Problem resolved by reconfiguring sendmail.

2. PHP pages slow on an NFS server - The reported problem was a development environment with pages that were slow to load. strace revealed that the pages were hanging at a flock call to a directory that was mounted via NFS. Here’s the actual output from strace:

…pages of output snipped…
fcntl(24, F_SETFL, O_RDWR) = 0
sendto(24, "incr toys:stats:request_with_ses"…, 40, MSG_DONTWAIT, NULL, 0) = 40
poll([{fd=24, events=POLLIN|POLLERR|POLLHUP, revents=POLLIN}], 1, 500) = 1
recvfrom(24, "76\r\n", 8192, MSG_DONTWAIT, NULL, NULL) = 4
open("/home/phpsessions/sess_079113645a3da0fe50f68e4ce6ed58d2″, O_RDWR|O_CREAT, 0600) = 25
flock(25, LOCK_EX

So we can see here that the file /home/phpsessions/sess_079113645a3da0fe50f68e4ce6ed58d2 has been opened, and the flock call is hanging. Turns out NFS doesn’t deal well with flock. When we saw this, there was a big smack on the forehead. Why on earth were the sessions being stored via NFS anyway? Especially for a development server, where only one box needed to store it. To solve the problem, we changed the session.save_path in the php configuration file to a directory that was not on NFS.

3. Memcached hanging - Again, certain requests were hanging, causing pages to be slow to load. Again, strace to the rescue! Turns out PHP was hanging when talking to memcached. Once this was determined, we also ran strace on memcached, and found a bug with the particular memcached client we were using via PHP. We upgraded the memcached client to the latest version, and the problem was solved.

In all of the above cases, the problem could have been found through other means, but strace made it a much easier and faster to figure out where the slowdowns were.

There are other helpful uses of strace. In addition to finding hanging web pages, I’ve also used strace to find why/where Apache was segfaulting. Just run strace and look to see what the last thing it did. It should give you an indication of why the script stopped when it did.

Also, I’ve used Apache as a troubleshooting tool to find out where most of the time is being spent by analyzing the entire request.

Good luck in your adventures with strace, it’s been a big help for me. Feel free to leave a comment with your findings.

No comments: