Tracking down rogue processes

By samnewman on July 12, 2011

I’m in the process of migrating the many sites I manage from Slicehost over to EC2 (which is where this blog is currently running). I hit a snag in the last day or two – my Montastic alerts told me that the sites I had already migrated were not responding. I tried – and failed – to SSH into the box. The CloudWatch graphs for the instance showed a 100% CPU use, explaining SSH being unresponsive. The problem is that I couldn’t tell what was causing the problem. My only option was to restart the instance, which at least brought it back to life.

What I needed was something that would tell me what was causing the problem. After reaching out to The Hive Mind, Cosmin pointed me in the direction of some awk and ps foo. This little script gets a process listing, and writes out all those rows where the CPU is above 20%, prepended with the current timestamp:

[plain light=”true”]
ps aux | gawk ‘{ if ( $3 > 20 ) { print strftime("%Y-%m-%d %H:%M:%S")" "$0 } }
[/plain]

My box rarely goes about 5% CPU use, and I was worried about the CPU ramping up so quickly that I didn’t get a sample, so this threshold seemed sensible. The magic is the if ( $3 > 20) – this only emits the line if the third column of input from ps aux (which is the CPU) goes above 20.

I put the one-liner in a script, then stuck the following entry into cron to ensure that every minute, the script gets run. If everything is ok, no output. Otherwise, I’ll get the full process listing. This wouldn’t top the box getting wedged again, but would at least tell me what caused it.

[plain light=”true”]
* * * * * root /home/ubuntu/tools/cpu_log >> /var/log/cpu_log
[/plain]

Lo and behold, several hours later and the box got wedged once again. After a restart, the cpu_log showed this:

[plain light=”true” wraplines=”false”]
2011-07-11 17:55:42 postfix 6398 29.6 0.3 39428 2184 ? S 17:55 0:01 pickup -l -t fifo -u -c
2011-07-11 17:55:42 postfix 6398 29.6 0.3 39428 2180 ? S 17:55 0:01 pickup -l -t fifo -u -c
2011-07-11 17:55:42 postfix 6398 29.6 0.2 39428 1556 ? S 17:55 0:01 pickup -l -t fifo -u -c
2011-07-11 17:55:42 postfix 6398 24.6 0.2 39428 1368 ? S 17:55 0:01 pickup -l -t fifo -u -c
2011-07-11 18:16:43 root 6440 50.0 0.0 30860 344 ? R 18:16 0:01 pickup -l -t fifo -u -c
[/plain]

Matching what the CloudWatch graphs showed me, the CPU ramped up quote quickly, before I loose all output (the 4th column here is CPU). But this time, we have a culprit – Postfix’s pickup process. I had configured Postfix just a day or two back, so clearly something was amiss. Nonetheless, I can at least now disable Postfix to spend some time diagnosing the problem.

Limiting CPU

Something else that was turned up in my cries for help was cpulimit. This utility would allow me to cap how much CPU a given process used. If and when I re-enable postfix, I’ll almost certainly use this to avoid future outages while I iron out any kinks.

2 Responses to “Tracking down rogue processes”

Apache and the case of the missing memory | Magpiebrain July 13, 2011

[…] seems I spoke too soon. Just one day after thinking I had tracked down the source of the trouble, and yesterday evening […]

Reply
Jake July 18, 2012

It seems like the script is missing something. It gives me end of line errors. Is it missing a tick mark perhaps? Pretty interesting script. I am having the same problem with WordPress site on EC2 that eating up memory and I am trying to track down the culprit script. Thanks for the posts on your experience!

Reply

magpiebrain

Sam Newman's site, a Consultant at ThoughtWorks