magpiebrain

Sam Newman's site, a Consultant at ThoughtWorks

Posts from the ‘AWS’ category

For over a year now I have been running day-long training sessions on AWS with ThoughtWorks. I have helped rollout this training globally, giving sessions in both Australia and the USA. The course covers the main building blocks of the AWS offering, including:

  • EC2
  • EBS
  • S3

A mix of theory and hands-on sections, it is ideal for anyone interested in getting started with AWS. Attendees need only a basic knowledge of a command-line, although any *NIX experience is a bonus.

I have presented this course multiple times for public and private classes. We are currently offering this course in several locations – if you are interested in this course for a private occasion or conference then please contact me.

On my current client project, in terms of managing configuration of the various environments, I have separated things into two problem spaces – provisioning hosts, and configuring hosts. Part of the reason for this separation is that although targeting AWS, we do need to allow us to support alternative services in the future, but I also consider the type of tasks to be rather different and to require different types of tools.

For provisioning hosts I am using the Python AWS API Boto. For configuring the hosts once provisioned, I am using Puppet. I remain unconvinced as to the relative merits of PuppetMaster or Chef Server (see my previous post on the subject) and so have decided to stick with using PuppetSolo so I can manage versioning how I would like. This leaves me with a challenge – how do I apply the puppet configuration for the hosts once provisioned with Boto? I also wanted to provide a relatively uniform command-line interface to the development team for other tasks like running builds etc. Some people use cron-based polling for this, but I wanted a more direct form of control. I also wanted to avoid the need to run any additional infrastructure, so mcollective was never something I was particularly interested in.

After a brief review of my “Things I should look at later” list it looked like time to give Fabric a play.

Fabric is a Python-based tool/library which excels at creating command-line tools for machine management. It’s bread and butter is script-based automation of machines via SSH – many people in fact use hand-rolled scripts on top of Fabric as an alternative to systems like Chef and Puppet. The documentation is very good, and I can heartily recommend the Fabric tutorial.

The workflow I wanted was simple. I wanted to be able to checkout a specific version of code locally, run one command to bring up a host and also apply a given configuration set. My potentially naive solution to this problem is to simply tar up my puppet scripts, upload them, and then run puppet. Here is the basic script:

[python]
@task
def provision-box():
public_dns = provision_using_boto()

local("tar cfz /tmp/end-bundle.tgz path/to/puppet_scripts/*")
with settings(host_string=public_dns, user="ec2-user", key_filename="path/to/private_key.pem"):
run("sudo yum install -y puppet")
put("/tmp/end-bundle.tgz", ".")
run("tar xf end-bundle.tgz && sudo puppet –modulepath=/home/ec2-user/path/to/puppet_scripts/modules path/to/puppet_scripts/manifests/myscript.pp")
[/python]

The provision_using_boto() command is an exercise left to the reader, but the documentation should point you in the right direction. If you stuck the above command in your fabfile.py, all you need to do is run fab provision-box to do the work. The first yum install command is there to handle bootstraping of puppet (as it is not on the AMIs we are using) – this will be a noop if the target host already has it installed.

This example is much more simplified than the actual scripts as we have also implemented some logic to re-use ec2 instances to save time & money, and also a simplistic role system to manage different classes of machines. I may write up those ideas in a future post.

1 Comment

I’m in the process of migrating the many sites I manage from Slicehost over to EC2 (which is where this blog is currently running). I hit a snag in the last day or two – my Montastic alerts told me that the sites I had already migrated were not responding. I tried – and failed – to SSH into the box. The CloudWatch graphs for the instance showed a 100% CPU use, explaining SSH being unresponsive. The problem is that I couldn’t tell what was causing the problem. My only option was to restart the instance, which at least brought it back to life.

What I needed was something that would tell me what was causing the problem. After reaching out to The Hive Mind, Cosmin pointed me in the direction of some awk and ps foo. This little script gets a process listing, and writes out all those rows where the CPU is above 20%, prepended with the current timestamp:

[plain light=”true”]
ps aux | gawk ‘{ if ( $3 > 20 ) { print strftime("%Y-%m-%d %H:%M:%S")" "$0 } }
[/plain]

My box rarely goes about 5% CPU use, and I was worried about the CPU ramping up so quickly that I didn’t get a sample, so this threshold seemed sensible. The magic is the if ( $3 > 20) – this only emits the line if the third column of input from ps aux (which is the CPU) goes above 20.

I put the one-liner in a script, then stuck the following entry into cron to ensure that every minute, the script gets run. If everything is ok, no output. Otherwise, I’ll get the full process listing. This wouldn’t top the box getting wedged again, but would at least tell me what caused it.

[plain light=”true”]
* * * * * root /home/ubuntu/tools/cpu_log >> /var/log/cpu_log
[/plain]

Lo and behold, several hours later and the box got wedged once again. After a restart, the cpu_log showed this:

[plain light=”true” wraplines=”false”]
2011-07-11 17:55:42 postfix 6398 29.6 0.3 39428 2184 ? S 17:55 0:01 pickup -l -t fifo -u -c
2011-07-11 17:55:42 postfix 6398 29.6 0.3 39428 2180 ? S 17:55 0:01 pickup -l -t fifo -u -c
2011-07-11 17:55:42 postfix 6398 29.6 0.2 39428 1556 ? S 17:55 0:01 pickup -l -t fifo -u -c
2011-07-11 17:55:42 postfix 6398 24.6 0.2 39428 1368 ? S 17:55 0:01 pickup -l -t fifo -u -c
2011-07-11 18:16:43 root 6440 50.0 0.0 30860 344 ? R 18:16 0:01 pickup -l -t fifo -u -c
[/plain]

Matching what the CloudWatch graphs showed me, the CPU ramped up quote quickly, before I loose all output (the 4th column here is CPU). But this time, we have a culprit – Postfix’s pickup process. I had configured Postfix just a day or two back, so clearly something was amiss. Nonetheless, I can at least now disable Postfix to spend some time diagnosing the problem.

Limiting CPU

Something else that was turned up in my cries for help was cpulimit. This utility would allow me to cap how much CPU a given process used. If and when I re-enable postfix, I’ll almost certainly use this to avoid future outages while I iron out any kinks.

2 Comments

As part of the day job at ThoughtWorks, I’m involved with training courses we run in partnership with Amazon, giving people an overview of the AWS service offerings. One of the things we cover that can often cause some confusion is the different types of images – specifically the difference between S3 and EBS-backed images. I tend to give specific advice over what type of instance I prefer, but before then some detail is in order.

S3-hosted Images

Instances launched from S3-hosted images use the local disk for the root partition. Once spun up, if you shut down machine down the state of the operating system is lost. The only way to persist data is to store it off-instance – for example on an attached EBS volume, on S3 or elsewhere. Aside from the lack of persistence of the state of the OS, these images are limited to being 10GB in size – not a problem for most Linux distros I know of, but a non-starter for some Microsoft images.

There is also a time delaying spinning these images up, due to the time taken to copy the Image from S3 to the physical machine – the whole image has to be downloaded before the instance can start the boot up process. In practice, popular images seem to be pretty well cached and so that the delay may not be an issue for you. As with everything in AWS, the barrier to entry is so low it is worth testing this yourself to understand if this is going to be a deal breaker.

EBS-backed Images

An EBS-backed instance is an EC2 instance which uses such a device as its root partition, rather than local disk. EBS-backed images can boot in seconds (as opposed to minutes for less-well cached S3 images) as they just need to stream enough of the image to start the boot process. They also can be up to 1TB in size – in other words the max size of any EBS volume. EBS volumes also have different performance characteristics than local drives, which may need to be taken into account depending on your particular use.

The key benefit is that EBS-backed instances can be ‘suspended’ – by stopping an EBS-backed instance, you can resume it later, with all EBS-backed state (e.g. the OS state) being maintained. In this way it is more akin to the slices at slicehost, or a real, physical machine.

There is one out and out downside when compared to S3-hosted images – your charges will be higher. You will be charged for IO, which could mount up depending on your particular usage patterns. You’re also charged for the allocated size of the EBS volume too.

So which is best?

Aside from the increased cost, and the differing benefits of a block-storage rather than local disk in terms of performance characteristics, there is one key downside to an EBS. It encourages you to think of EC2 instances as persistent, stable entities which are around forever. In general, I believe you’ll gain far more from embracing the ‘everything fails, deal with it’ ethos – that is, have your machines configure themselves afresh atop vanilla gold S3-backed AMIs on boot (e.g. using EC2 instance data to download & install your software). This means you cannot fall back into the old thought patterns of constantly patching, tweaking an OS, to the point where attempting to recreate it would be akin to an act of sysadmin archeology.

There will remain scenarios where EBS-backed instances make sense (and those of you using Windows Server 2008 have no choice), but I always recommend that people moving to AWS limit their use and instead adopt their practices to use the S3 – thereby also embracing more of the promise of the AWS stack as a whole.