magpiebrain

Sam Newman's site, a Consultant at ThoughtWorks

A week from today, Jez Humble and I will be joint presenting a 1 day Continuous Delivery tutorial at one of my favourite conferences, Goto: Aarhus. I’m also fortunate enough to have been asked to present my Designing For Rapid Release talk on Mike Nyygard’s track on Tuesday.

Goto: Aarhus is an excellent conference, and tickets may still be available if you can make it. If you do, make sure you use the code newm1000 so you can save DKK 1.000 (approx £115 / €135 / $180). I hope to see you there!

Leave a comment

For over a year now I have been running day-long training sessions on AWS with ThoughtWorks. I have helped rollout this training globally, giving sessions in both Australia and the USA. The course covers the main building blocks of the AWS offering, including:

  • EC2
  • EBS
  • S3

A mix of theory and hands-on sections, it is ideal for anyone interested in getting started with AWS. Attendees need only a basic knowledge of a command-line, although any *NIX experience is a bonus.

I have presented this course multiple times for public and private classes. We are currently offering this course in several locations – if you are interested in this course for a private occasion or conference then please contact me.

This talk focuses on the drivers behind, and the main advantages of Continuous Delivery. Focused at a higher level than some of my other talks, it should appeal to both a technical and business audience.

You can view a video of the presentation I gave at one of the ThoughtWorks QTB events in 2011.

I was out shooting today, putting the x-pro through its paces. No real purpose – meandering around the Portobello Road, shooting stall keepers and tourists alike. I turn for home, walking alongside the westway, when a cyclist zips past. I see him, heading towards me, bring the camera up, track him and *click*. Enough time for one shot, and it’s perfect. He is captured on the top right, pin sharp, looking straight into the lens. The background is beautifully blurred. Perfect shot.

I walk on, and get a tap on my shoulder. I turn around to see the cyclist.

“Can you delete that, please?”.

And I do.

Walking back, I think idly “I could probably recover that…”. Legally, I’ve done nothing wrong. And it is a great shot – one of the best I think I’ve taken. But I sigh, and know I won’t. At least this is one shot I can’t claim to have lost due to the Fuji x-pro’s AF.

Leave a comment

I’ll be running my new talk “Designing For Rapid Release” at a couple of conferences in the first half of this year. First up is the delightfully named Crash & Burn in Stockholm, on the 2nd of March. Then later in May I’ll be at Poznan in Poland for GeeCon 2012.

This talk focuses on the kinds of constraints we should consider when evolving their architecture of our systems in order to enable rapid, frequent release. So much of the conversation about Continuous Delivery focuses on the design of build pipelines, or the nuts and bolts of CI and infrastructure automation. But often the biggest constraint in being able to incrementally roll out new features are the problems in the design of the system itself. I’ll be pulling together a series of patterns that will help you identify what to look for in your own systems when moving towards Continuous Delivery.

Leave a comment

On my current client project, in terms of managing configuration of the various environments, I have separated things into two problem spaces – provisioning hosts, and configuring hosts. Part of the reason for this separation is that although targeting AWS, we do need to allow us to support alternative services in the future, but I also consider the type of tasks to be rather different and to require different types of tools.

For provisioning hosts I am using the Python AWS API Boto. For configuring the hosts once provisioned, I am using Puppet. I remain unconvinced as to the relative merits of PuppetMaster or Chef Server (see my previous post on the subject) and so have decided to stick with using PuppetSolo so I can manage versioning how I would like. This leaves me with a challenge – how do I apply the puppet configuration for the hosts once provisioned with Boto? I also wanted to provide a relatively uniform command-line interface to the development team for other tasks like running builds etc. Some people use cron-based polling for this, but I wanted a more direct form of control. I also wanted to avoid the need to run any additional infrastructure, so mcollective was never something I was particularly interested in.

After a brief review of my “Things I should look at later” list it looked like time to give Fabric a play.

Fabric is a Python-based tool/library which excels at creating command-line tools for machine management. It’s bread and butter is script-based automation of machines via SSH – many people in fact use hand-rolled scripts on top of Fabric as an alternative to systems like Chef and Puppet. The documentation is very good, and I can heartily recommend the Fabric tutorial.

The workflow I wanted was simple. I wanted to be able to checkout a specific version of code locally, run one command to bring up a host and also apply a given configuration set. My potentially naive solution to this problem is to simply tar up my puppet scripts, upload them, and then run puppet. Here is the basic script:

[python]
@task
def provision-box():
public_dns = provision_using_boto()

local("tar cfz /tmp/end-bundle.tgz path/to/puppet_scripts/*")
with settings(host_string=public_dns, user="ec2-user", key_filename="path/to/private_key.pem"):
run("sudo yum install -y puppet")
put("/tmp/end-bundle.tgz", ".")
run("tar xf end-bundle.tgz && sudo puppet –modulepath=/home/ec2-user/path/to/puppet_scripts/modules path/to/puppet_scripts/manifests/myscript.pp")
[/python]

The provision_using_boto() command is an exercise left to the reader, but the documentation should point you in the right direction. If you stuck the above command in your fabfile.py, all you need to do is run fab provision-box to do the work. The first yum install command is there to handle bootstraping of puppet (as it is not on the AMIs we are using) – this will be a noop if the target host already has it installed.

This example is much more simplified than the actual scripts as we have also implemented some logic to re-use ec2 instances to save time & money, and also a simplistic role system to manage different classes of machines. I may write up those ideas in a future post.

1 Comment

I’ll be speaking on the cloud track at JAX London 2011. The talk “Private Cloud, A Convenient Fiction” attempts to puncture some of the FUD on the subject. The nuances between various hosting solutions are many and varied, and don’t suit being put into neat boxes like ‘public’ and ‘private’. When I talk to clients about what is right to them, the types of things we discuss place different providers on a number of axis, which I hope to get across in this talk.

As always, you can track the conference (or note that you plan to attend) at Simon & Nat’s startup Lanyrd, which is several shades of awesome.

Leave a comment

It seems I spoke too soon. Just one day after thinking I had tracked down the source of the trouble, and yesterday evening brought another outage. The graph in CloudWatch was all too familiar, showing the huge uptick in CPU use. The box was again unresponsive and had to be restarted. Checking cpu_log for a likely culprit, the entries looked odd:

[plain light=”true”]
2011-07-13 00:12:22 www-data 26096 21.4 0.9 160732 5972 ? D Jul12 5:30 /usr/sbin/apache2 -k start
2011-07-13 00:12:25 www-data 26096 21.4 0.9 160736 6040 ? R Jul12 5:30 /usr/sbin/apache2 -k start
2011-07-13 00:12:22 www-data 26096 21.3 0.9 160732 5972 ? D Jul12 5:30 /usr/sbin/apache2 -k start
2011-07-13 00:12:22 root 26179 24.0 0.0 4220 584 ? S 00:12 0:00 /bin/sh /home/ubuntu/tools/cpu_log
[/plain]

No entries from Postfix – good – but now other processes are having trouble. This was starting to point away from one rogue process gobbling CPU, to high CPU use being a symptom of something else. What can cause very high CPU use? Among other things, swapping memory. A process chewing up all available memory could easily cause these kinds of symptoms. A quick scan through syslog showed me something I should have spotted earlier. If it wasn’t the smoking gun, then at least something pretty close:

[plain light=”true”]
Jul 13 00:11:28 domU-12-31-39-01-F0-E5 kernel: [38837.985499] apache2 invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
[/plain]

This doesn’t mean that Apache is to blame, just that it was a process which oom-killer tried to take out in order to free up memory. And just prior to the outage itself in the apache logs:

[plain light=”true”]
[Tue Jul 12 23:48:24 2011] [error] (12)Cannot allocate memory: fork: Unable to fork new process
[/plain]

At this point my mind was already turning to the fact that I hadn’t done *any* tuning of Apache processes or PHP. After googling around for a bit, a few things looked wrong in my config. Here was the untuned default that Ubuntu gave me:

[plain light=”true”]
<IfModule mpm_prefork_module>
StartServers 5
MinSpareServers 5
MaxSpareServers 10
MaxClients 150
MaxRequestsPerChild 0
</IfModule>
[/plain]

In general MaxClients refers to the maximum number of simultaneous requests that will be served. On prefork Apache, like mine, MaxClients also refers to the max number of child processes that get spawned. A simple ps showed that even after a restart, each apache process was consuming up to 35MB of memory. The host in question has 1GB in RAM – it was clear that even with nothing else running on the box, with that sort of memory footprint I would exhaust memory way before the MaxClients threshold was reached. Even more worrying, the MaxRequestsPerChild was set to zero, meaning that the child processes would never be restarted. If a memory leak was occurring inside the child Apache process, it could carry on eating memory until the box comes crashing to it’s knees. After some quick maths I decided to reduce my MaxClients down to a more manageable 25, but also set MaxRequestsPerChild to 1000. My hope is that this may buy me some more time to try and track down where the memory use is occurring.

This has spurned me on to finally invest some time looking at nginx. This weekend may see me putting in nginx side by side with a view to moving away from Apache – from all reports this may allow me to run my sites with a much lower footprint. But if the last couple of days has taught me anything, it’s that I should be so sure to rush to the conclusion that I’ve finally tracked this problem down. If I’m still having trouble at the weekend, I may well just clone the box and try and reproduce the problem with some performance tests.

2 Comments