magpiebrain

Sam Newman's site, a Consultant at ThoughtWorks

It seems I spoke too soon. Just one day after thinking I had tracked down the source of the trouble, and yesterday evening brought another outage. The graph in CloudWatch was all too familiar, showing the huge uptick in CPU use. The box was again unresponsive and had to be restarted. Checking cpu_log for a likely culprit, the entries looked odd:

[plain light=”true”]
2011-07-13 00:12:22 www-data 26096 21.4 0.9 160732 5972 ? D Jul12 5:30 /usr/sbin/apache2 -k start
2011-07-13 00:12:25 www-data 26096 21.4 0.9 160736 6040 ? R Jul12 5:30 /usr/sbin/apache2 -k start
2011-07-13 00:12:22 www-data 26096 21.3 0.9 160732 5972 ? D Jul12 5:30 /usr/sbin/apache2 -k start
2011-07-13 00:12:22 root 26179 24.0 0.0 4220 584 ? S 00:12 0:00 /bin/sh /home/ubuntu/tools/cpu_log
[/plain]

No entries from Postfix – good – but now other processes are having trouble. This was starting to point away from one rogue process gobbling CPU, to high CPU use being a symptom of something else. What can cause very high CPU use? Among other things, swapping memory. A process chewing up all available memory could easily cause these kinds of symptoms. A quick scan through syslog showed me something I should have spotted earlier. If it wasn’t the smoking gun, then at least something pretty close:

[plain light=”true”]
Jul 13 00:11:28 domU-12-31-39-01-F0-E5 kernel: [38837.985499] apache2 invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
[/plain]

This doesn’t mean that Apache is to blame, just that it was a process which oom-killer tried to take out in order to free up memory. And just prior to the outage itself in the apache logs:

[plain light=”true”]
[Tue Jul 12 23:48:24 2011] [error] (12)Cannot allocate memory: fork: Unable to fork new process
[/plain]

At this point my mind was already turning to the fact that I hadn’t done *any* tuning of Apache processes or PHP. After googling around for a bit, a few things looked wrong in my config. Here was the untuned default that Ubuntu gave me:

[plain light=”true”]
<IfModule mpm_prefork_module>
StartServers 5
MinSpareServers 5
MaxSpareServers 10
MaxClients 150
MaxRequestsPerChild 0
</IfModule>
[/plain]

In general MaxClients refers to the maximum number of simultaneous requests that will be served. On prefork Apache, like mine, MaxClients also refers to the max number of child processes that get spawned. A simple ps showed that even after a restart, each apache process was consuming up to 35MB of memory. The host in question has 1GB in RAM – it was clear that even with nothing else running on the box, with that sort of memory footprint I would exhaust memory way before the MaxClients threshold was reached. Even more worrying, the MaxRequestsPerChild was set to zero, meaning that the child processes would never be restarted. If a memory leak was occurring inside the child Apache process, it could carry on eating memory until the box comes crashing to it’s knees. After some quick maths I decided to reduce my MaxClients down to a more manageable 25, but also set MaxRequestsPerChild to 1000. My hope is that this may buy me some more time to try and track down where the memory use is occurring.

This has spurned me on to finally invest some time looking at nginx. This weekend may see me putting in nginx side by side with a view to moving away from Apache – from all reports this may allow me to run my sites with a much lower footprint. But if the last couple of days has taught me anything, it’s that I should be so sure to rush to the conclusion that I’ve finally tracked this problem down. If I’m still having trouble at the weekend, I may well just clone the box and try and reproduce the problem with some performance tests.

2 Responses to “Apache and the case of the missing memory”

  1. Sam

    Thanks for this!
    I have been seeing a similar issue with one of my Apache servers and your experience has helped me.
    Keep up the good work 🙂

    Reply
  2. Ain

    Thanks for the post, did this conf change finally get you on a safe side or did you find something else that you had to do about resolving the problem?

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Basic HTML is allowed. Your email address will not be published.

Subscribe to this comment feed via RSS

%d bloggers like this: