magpiebrain

Sam Newman's site, a Consultant at ThoughtWorks

On my current client project, in terms of managing configuration of the various environments, I have separated things into two problem spaces – provisioning hosts, and configuring hosts. Part of the reason for this separation is that although targeting AWS, we do need to allow us to support alternative services in the future, but I also consider the type of tasks to be rather different and to require different types of tools.

For provisioning hosts I am using the Python AWS API Boto. For configuring the hosts once provisioned, I am using Puppet. I remain unconvinced as to the relative merits of PuppetMaster or Chef Server (see my previous post on the subject) and so have decided to stick with using PuppetSolo so I can manage versioning how I would like. This leaves me with a challenge – how do I apply the puppet configuration for the hosts once provisioned with Boto? I also wanted to provide a relatively uniform command-line interface to the development team for other tasks like running builds etc. Some people use cron-based polling for this, but I wanted a more direct form of control. I also wanted to avoid the need to run any additional infrastructure, so mcollective was never something I was particularly interested in.

After a brief review of my “Things I should look at later” list it looked like time to give Fabric a play.

Fabric is a Python-based tool/library which excels at creating command-line tools for machine management. It’s bread and butter is script-based automation of machines via SSH – many people in fact use hand-rolled scripts on top of Fabric as an alternative to systems like Chef and Puppet. The documentation is very good, and I can heartily recommend the Fabric tutorial.

The workflow I wanted was simple. I wanted to be able to checkout a specific version of code locally, run one command to bring up a host and also apply a given configuration set. My potentially naive solution to this problem is to simply tar up my puppet scripts, upload them, and then run puppet. Here is the basic script:

[python]
@task
def provision-box():
public_dns = provision_using_boto()

local("tar cfz /tmp/end-bundle.tgz path/to/puppet_scripts/*")
with settings(host_string=public_dns, user="ec2-user", key_filename="path/to/private_key.pem"):
run("sudo yum install -y puppet")
put("/tmp/end-bundle.tgz", ".")
run("tar xf end-bundle.tgz && sudo puppet –modulepath=/home/ec2-user/path/to/puppet_scripts/modules path/to/puppet_scripts/manifests/myscript.pp")
[/python]

The provision_using_boto() command is an exercise left to the reader, but the documentation should point you in the right direction. If you stuck the above command in your fabfile.py, all you need to do is run fab provision-box to do the work. The first yum install command is there to handle bootstraping of puppet (as it is not on the AMIs we are using) – this will be a noop if the target host already has it installed.

This example is much more simplified than the actual scripts as we have also implemented some logic to re-use ec2 instances to save time & money, and also a simplistic role system to manage different classes of machines. I may write up those ideas in a future post.

Advertisements
1 Comment

I’ll be speaking on the cloud track at JAX London 2011. The talk “Private Cloud, A Convenient Fiction” attempts to puncture some of the FUD on the subject. The nuances between various hosting solutions are many and varied, and don’t suit being put into neat boxes like ‘public’ and ‘private’. When I talk to clients about what is right to them, the types of things we discuss place different providers on a number of axis, which I hope to get across in this talk.

As always, you can track the conference (or note that you plan to attend) at Simon & Nat’s startup Lanyrd, which is several shades of awesome.

Leave a comment

It seems I spoke too soon. Just one day after thinking I had tracked down the source of the trouble, and yesterday evening brought another outage. The graph in CloudWatch was all too familiar, showing the huge uptick in CPU use. The box was again unresponsive and had to be restarted. Checking cpu_log for a likely culprit, the entries looked odd:

[plain light=”true”]
2011-07-13 00:12:22 www-data 26096 21.4 0.9 160732 5972 ? D Jul12 5:30 /usr/sbin/apache2 -k start
2011-07-13 00:12:25 www-data 26096 21.4 0.9 160736 6040 ? R Jul12 5:30 /usr/sbin/apache2 -k start
2011-07-13 00:12:22 www-data 26096 21.3 0.9 160732 5972 ? D Jul12 5:30 /usr/sbin/apache2 -k start
2011-07-13 00:12:22 root 26179 24.0 0.0 4220 584 ? S 00:12 0:00 /bin/sh /home/ubuntu/tools/cpu_log
[/plain]

No entries from Postfix – good – but now other processes are having trouble. This was starting to point away from one rogue process gobbling CPU, to high CPU use being a symptom of something else. What can cause very high CPU use? Among other things, swapping memory. A process chewing up all available memory could easily cause these kinds of symptoms. A quick scan through syslog showed me something I should have spotted earlier. If it wasn’t the smoking gun, then at least something pretty close:

[plain light=”true”]
Jul 13 00:11:28 domU-12-31-39-01-F0-E5 kernel: [38837.985499] apache2 invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
[/plain]

This doesn’t mean that Apache is to blame, just that it was a process which oom-killer tried to take out in order to free up memory. And just prior to the outage itself in the apache logs:

[plain light=”true”]
[Tue Jul 12 23:48:24 2011] [error] (12)Cannot allocate memory: fork: Unable to fork new process
[/plain]

At this point my mind was already turning to the fact that I hadn’t done *any* tuning of Apache processes or PHP. After googling around for a bit, a few things looked wrong in my config. Here was the untuned default that Ubuntu gave me:

[plain light=”true”]
<IfModule mpm_prefork_module>
StartServers 5
MinSpareServers 5
MaxSpareServers 10
MaxClients 150
MaxRequestsPerChild 0
</IfModule>
[/plain]

In general MaxClients refers to the maximum number of simultaneous requests that will be served. On prefork Apache, like mine, MaxClients also refers to the max number of child processes that get spawned. A simple ps showed that even after a restart, each apache process was consuming up to 35MB of memory. The host in question has 1GB in RAM – it was clear that even with nothing else running on the box, with that sort of memory footprint I would exhaust memory way before the MaxClients threshold was reached. Even more worrying, the MaxRequestsPerChild was set to zero, meaning that the child processes would never be restarted. If a memory leak was occurring inside the child Apache process, it could carry on eating memory until the box comes crashing to it’s knees. After some quick maths I decided to reduce my MaxClients down to a more manageable 25, but also set MaxRequestsPerChild to 1000. My hope is that this may buy me some more time to try and track down where the memory use is occurring.

This has spurned me on to finally invest some time looking at nginx. This weekend may see me putting in nginx side by side with a view to moving away from Apache – from all reports this may allow me to run my sites with a much lower footprint. But if the last couple of days has taught me anything, it’s that I should be so sure to rush to the conclusion that I’ve finally tracked this problem down. If I’m still having trouble at the weekend, I may well just clone the box and try and reproduce the problem with some performance tests.

2 Comments

I’m in the process of migrating the many sites I manage from Slicehost over to EC2 (which is where this blog is currently running). I hit a snag in the last day or two – my Montastic alerts told me that the sites I had already migrated were not responding. I tried – and failed – to SSH into the box. The CloudWatch graphs for the instance showed a 100% CPU use, explaining SSH being unresponsive. The problem is that I couldn’t tell what was causing the problem. My only option was to restart the instance, which at least brought it back to life.

What I needed was something that would tell me what was causing the problem. After reaching out to The Hive Mind, Cosmin pointed me in the direction of some awk and ps foo. This little script gets a process listing, and writes out all those rows where the CPU is above 20%, prepended with the current timestamp:

[plain light=”true”]
ps aux | gawk ‘{ if ( $3 > 20 ) { print strftime("%Y-%m-%d %H:%M:%S")" "$0 } }
[/plain]

My box rarely goes about 5% CPU use, and I was worried about the CPU ramping up so quickly that I didn’t get a sample, so this threshold seemed sensible. The magic is the if ( $3 > 20) – this only emits the line if the third column of input from ps aux (which is the CPU) goes above 20.

I put the one-liner in a script, then stuck the following entry into cron to ensure that every minute, the script gets run. If everything is ok, no output. Otherwise, I’ll get the full process listing. This wouldn’t top the box getting wedged again, but would at least tell me what caused it.

[plain light=”true”]
* * * * * root /home/ubuntu/tools/cpu_log >> /var/log/cpu_log
[/plain]

Lo and behold, several hours later and the box got wedged once again. After a restart, the cpu_log showed this:

[plain light=”true” wraplines=”false”]
2011-07-11 17:55:42 postfix 6398 29.6 0.3 39428 2184 ? S 17:55 0:01 pickup -l -t fifo -u -c
2011-07-11 17:55:42 postfix 6398 29.6 0.3 39428 2180 ? S 17:55 0:01 pickup -l -t fifo -u -c
2011-07-11 17:55:42 postfix 6398 29.6 0.2 39428 1556 ? S 17:55 0:01 pickup -l -t fifo -u -c
2011-07-11 17:55:42 postfix 6398 24.6 0.2 39428 1368 ? S 17:55 0:01 pickup -l -t fifo -u -c
2011-07-11 18:16:43 root 6440 50.0 0.0 30860 344 ? R 18:16 0:01 pickup -l -t fifo -u -c
[/plain]

Matching what the CloudWatch graphs showed me, the CPU ramped up quote quickly, before I loose all output (the 4th column here is CPU). But this time, we have a culprit – Postfix’s pickup process. I had configured Postfix just a day or two back, so clearly something was amiss. Nonetheless, I can at least now disable Postfix to spend some time diagnosing the problem.

Limiting CPU

Something else that was turned up in my cries for help was cpulimit. This utility would allow me to cap how much CPU a given process used. If and when I re-enable postfix, I’ll almost certainly use this to avoid future outages while I iron out any kinks.

2 Comments