Sam Newman's site, a Consultant at ThoughtWorks

Posts tagged ‘devops’

On my current client project, in terms of managing configuration of the various environments, I have separated things into two problem spaces – provisioning hosts, and configuring hosts. Part of the reason for this separation is that although targeting AWS, we do need to allow us to support alternative services in the future, but I also consider the type of tasks to be rather different and to require different types of tools.

For provisioning hosts I am using the Python AWS API Boto. For configuring the hosts once provisioned, I am using Puppet. I remain unconvinced as to the relative merits of PuppetMaster or Chef Server (see my previous post on the subject) and so have decided to stick with using PuppetSolo so I can manage versioning how I would like. This leaves me with a challenge – how do I apply the puppet configuration for the hosts once provisioned with Boto? I also wanted to provide a relatively uniform command-line interface to the development team for other tasks like running builds etc. Some people use cron-based polling for this, but I wanted a more direct form of control. I also wanted to avoid the need to run any additional infrastructure, so mcollective was never something I was particularly interested in.

After a brief review of my “Things I should look at later” list it looked like time to give Fabric a play.

Fabric is a Python-based tool/library which excels at creating command-line tools for machine management. It’s bread and butter is script-based automation of machines via SSH – many people in fact use hand-rolled scripts on top of Fabric as an alternative to systems like Chef and Puppet. The documentation is very good, and I can heartily recommend the Fabric tutorial.

The workflow I wanted was simple. I wanted to be able to checkout a specific version of code locally, run one command to bring up a host and also apply a given configuration set. My potentially naive solution to this problem is to simply tar up my puppet scripts, upload them, and then run puppet. Here is the basic script:

def provision-box():
public_dns = provision_using_boto()

local("tar cfz /tmp/end-bundle.tgz path/to/puppet_scripts/*")
with settings(host_string=public_dns, user="ec2-user", key_filename="path/to/private_key.pem"):
run("sudo yum install -y puppet")
put("/tmp/end-bundle.tgz", ".")
run("tar xf end-bundle.tgz && sudo puppet –modulepath=/home/ec2-user/path/to/puppet_scripts/modules path/to/puppet_scripts/manifests/myscript.pp")

The provision_using_boto() command is an exercise left to the reader, but the documentation should point you in the right direction. If you stuck the above command in your, all you need to do is run fab provision-box to do the work. The first yum install command is there to handle bootstraping of puppet (as it is not on the AMIs we are using) – this will be a noop if the target host already has it installed.

This example is much more simplified than the actual scripts as we have also implemented some logic to re-use ec2 instances to save time & money, and also a simplistic role system to manage different classes of machines. I may write up those ideas in a future post.

1 Comment

From’ve been playing around with both Chef and Vagrant recently, getting my head around the current state of the art regarding configuration management. A rather good demo of Chef at the recent DevOpsDays Hamburg by John Willis pushed me towards Chef over Puppet, but I’m still way to early in my experimentation to know if that is the right choice.

I may speak more later about my experiences with Vagrant, but this post is primarily concerning Chef, and specifically thoughts regarding repeatability.


Most of us I hope, check our code in. Some of us even have continuous integration, and perhaps even a fully fledged deployment pipeline which creates packages representing our code which have been validated to be production ready. By checking in our code, we hope to bring about a situation whereby we can recreate a build of our software at a previous point in time.

Typically however, deploying these systems requires a bit more than simply running apt-get install or something similar. Machines need to be provisioned and dependencies configured, and this is where Chef and Puppet come in. Both systems allow you to write code that specifies the state you expect your nodes to be in to allow your systems to work. To my mind, it is important therefore that the version of the configuration code needs to be in sync with your application version. Otherwise, when you deploy your software, you may find that the systems are not configured how you would like.

Isn’t It All About Checking In?

So, if we rely on checking our application code in to be able to reproduce a build, why not check our configuration code into the same place? On the face of it, this makes sense. The challenge here – at least as I understand the capabilities of Chef, is that much of the power of Chef comes from using Chef Server, which doesn’t play nicely with this model.

Chef Server is a server which tells nodes what they are expected to be. It is the system that gathers information about your configured systems allowing discovering via mechanisms like Knife, and also how you push configuration out to multiple machines. Whilst Chef Server itself is backed by version control, there doesn’t seem to be an obvious way for an application node to say “I need version 123 of the Web Server recipe”. That means, that if I want to bring up an old version of a Web node, it could reach out and end up getting a much newer version of a recipe, thereby not correctly recreating the previous state.

Now, using Chef Solo, I could check out my code and system configuration together as a piece, then load that on to the nodes I want, but I loose a lot by not being able to do discovery using Knife and similar tools, and I loose the tracking etc.

Perhaps there is another way…

Chef does have a concept of environments. With an environment, you are able to specify that a node associated with a specific environment should use a specific version of a recipe, for example:

name "dev"
description "The development environment"
cookbook_versions  "couchdb" => "11.0.0"
attributes "apache2" => { "listen_ports" => [ "80", "443" ] }

The problem here is that I think the concept of being able to access versions of my cookbooks is completely orthogonal to environments. Let’s remember the key goal – I want to be able to reproduce a running system based on a specific version of code, and identify the right version of the configuration (recipes) to apply for that version of the code. Am I missing something?

In a previous post, I showed how we could use Clojure and specifically Incanter to process access logs to graph hits on our site. Now, we’re going to adapt our solution to allow us to to show the number of unique users over time.

We’re going to change the previous solution to pull out the core dataset representing the raw data we’re interested in from the access log – records-from-access-log remains unchanged from before:

(defn access-log-to-dataset
(col-names (to-dataset (records-from-access-log filename)) ["Date" "User"]))

The raw dataset retrieved from this call looks like this:

Date User
11/Aug/2010:00:00:30 +0100 Bob
11/Aug/2010:00:00:31 +0100 Frank
11/Aug/2010:00:00:34 +0100 Frank

Now, we need to work out the number of unique users in a given time period. Like before, we’re going to use $rollup to group multiple records by minute, but we need to work out how to summarise the user column. To do this, we create a custom summarise function which calculates the number of unique users:

(defn num-unique-items
  (count (set seq)))

Then use that to modify the raw dataset and graph the resulting dataset:

(defn access-log-to-unique-user-dataset
    ($rollup num-unique-items "User" "Date" 
      (col-names (conj-cols ($map #(round-ms-down-to-nearest-min (as-millis %)) "Date" access-log-dataset) ($ "User" access-log-dataset)) ["Date" "Unique Users"])))

(defn concurrent-users-graph
  (time-series-plot :Date :User
                             :x-label "Date"
                             :y-label "User"
                             :title "Users Per Min"
                             :data (access-log-to-unique-user-dataset dataset)))

(def access-log-dataset
  (access-log-to-dataset "/path/to/access.log"))

(save (concurrent-users-graph access-log-dataset) "unique-users.png")

You can see the full source code listing here.

Continuing in a re-occurring series of posts showing my limited understanding of Clojure, today we’re using Clojure for log processing. This example is culled from some work I’m doing right now in the day job – we needed to extract usage information to better understand how the system is performing.

The Problem

We have an Apache-style access log showing hits our site. We want to process this information to extract information like peak hits per minute, and perhaps eventually more detailed information like the nature of the request, response time etc.

The log looks like this: - username 05/Aug/2010:17:27:24 +0100 "GET /some/url HTTP/1.1" 200 24 "http://some.refering.domain/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv: Gecko/2009040821 Firefox/3.0.9 (.NET CLR 3.5.30729)"

Extracting The Information

We want to use Incanter to help us process the data & graph it. Incanter likes its data as a sequence of sequences – so that’s what we’ll create.

First up – processing a single line. I TDD’d this solution, but have excluded the tests from the source listing for brevity.

user=> (use 'clojure.contrib.str-utils)
user=> (use '[ :only (read-lines)])

user=> (defn extract-records-from-line
  (let [[_ ip username date] (re-find #"^(d{1,3}.d{1,3}.d{1,3}.d{1,3}) - (w+) (.+? .+?) " line-from-access-log)]
    [date username]))

user=> (defn as-dataseries
  (map extract-records-from-line access-log-lines))

user=> (defn records-from-access-log
  (as-dataseries (read-lines filename)))

A few things to note. extract-records-from-line is matching more than strictly needed – I just wanted to indicate the use of destructing for matching parts of the log line. I’m pulling in the username & date – the username is not strictly needed for what follows. Secondly, note the use of read-lines from – rather than slurp, read-lines is lazy. We’ll have to process the whole file at some point, but it’s a good idea to look to use lazy functions where possible.

At this point, running records-from-access-log gives us our sequence of sequences – next up, pulling it into Incanter.

Getting The Data Into Incanter

We can check that our code is working properly by firing up Incanter. Given a sample log: - fred 05/Aug/2010:17:27:24 +0100 "GET /some/url HTTP/1.1" 200 24 "http://some.refering.domain/" "SomeUserAgent" - norman 05/Aug/2010:17:27:24 +0100 "GET /some/url HTTP/1.1" 200 24 "http://some.refering.domain/" "SomeUserAgent" - bob 05/Aug/2010:17:28:24 +0100 "GET /some/url HTTP/1.1" 200 24 "http://some.refering.domain/" "SomeUserAgent" - clare 05/Aug/2010:17:29:24 +0100 "GET /some/url HTTP/1.1" 200 24 "http://some.refering.domain/" "SomeUserAgent"

Let’s create a dataset from it, and view the resulting records:

user=> (use 'incanter.core)
user=> (def access-log-to-dataset 
(to-dataset (records-from-access-log "/path/to/example-access.log")))
user=> (view access-log-dataset)

The result of the view command:

Unfortunately, no column names – but that is easy to fix using col-names:

user=> (def access-log-dataset 
(col-names (to-dataset (records-from-access-log "/path/to/example-access.log")) ["Date" "User"]))
user=> (view access-log-dataset)

At this point you can see that it would be easy for us to pull in the URL, response code or other data rather than the username from the log – all we’d need to do is change extract-records-from-line and update the column names.

Graphing The Data

To graph the data, we need to get Incanter to register the date column as what it is – time. Currently it is in string format, so we need to fix that. Culling the basics from Jake McCray’s post, here is what I ended up with (note use of Joda-Time for date handling – you could use the standard SimpleDateFormat if you preferred):

user=> (import 'org.joda.time.format.DateTimeFormat)

user=> (defn as-millis
  (.getMillis (.parseDateTime (DateTimeFormat/forPattern "dd/MMM/yyyy:HH:mm:ss Z") date-as-str)))

user=> (defn access-log-to-dataset
  (let [unmod-data (col-names (to-dataset (records-from-access-log filename)) ["Date" "User"])]
    (col-names (conj-cols ($map as-millis "Date" unmod-dataset) ($ "User" unmod-dataset)) ["Date Time In Ms" "User"])))

While the date parsing should be pretty straightforward to understand, there are a few interesting things going on with the Incanter code that we should dwell on briefly.

The $ function extracts a named column, whereas the $map function runs another function over the named column from the dataset, returning the modified column (pretty familiar if you’ve used map). conj-cols then takes these resulting sequences to create our final dataset.

We’re not quite done yet though. We have our time-series records – representing one hit on our webserver – but don’t actually have values to graph. We also need to work out how we group hits to the nearest minute. What we’re going to do is replace our as-millis function to one that rounds to the nearest minute. Then, we’re going to use Incater to group those rows together – summing the hits it finds per minute. But before that, we need to tell Incanter that each row represents a hit, by adding a ‘Hits’ column. We’re also going to ditch the user column, as it isn’t going to help us here:

user=> (defn access-log-to-dataset
  (let [unmod-dataset (col-names (to-dataset (records-from-access-log filename)) ["Date" "User"])]
    (col-names (conj-cols ($map as-millis "Date" unmod-dataset) (repeat 1)) ["Date" "Hits"])))

Next, we need to create a new function to round our date & time to the nearest minute.

Update: The earlier version of this post screwed up, and the presented round-ms-down-to-nearest-min actually rounded to the nearest second. This is a corrected version:

(defn round-ms-down-to-nearest-min
  (* 60000 (quot millis 60000)))

If you wanted hits per second, here is the function:

(defn round-ms-down-to-nearest-sec
  (* 1000 (quot millis 1000)))

And one more tweak to access-log-to-dataset to use the new function:

(defn access-log-to-dataset
  (let [unmod-dataset (col-names (to-dataset (records-from-access-log filename)) ["Date" "User"])]
    (col-names (conj-cols ($map #(round-ms-down-to-nearest-min (as-millis %)) "Date" unmod-dataset) (repeat 1)) ["Date" "Hits"])))

Finally, we need to roll our data up, summing the hits per minute – all this done using $rollup:

(defn access-log-to-dataset
  (let [unmod-dataset (col-names (to-dataset (records-from-access-log filename)) ["Date" "User"])]
    ($rollup :sum "Hits" "Date" 
      (col-names (conj-cols ($map #(round-ms-down-to-nearest-min (as-millis %)) "Date" unmod-dataset) (repeat 1)) ["Date" "Hits"]))))

$rollup applies a summary function to a given column (in our case “Hits”), using another function to determine the parameters for that function (“Date” in our case). :sum here is a built-in Incanter function, but we could provide our own.

And the resulting dataset:

Now we have our dataset, let’s graph it:

user=> (defn hit-graph
  (time-series-plot :Date :Hits
                             :x-label "Date"
                             :y-label "Hits"
                             :title "Hits"
                             :data dataset))

user=> (view (hit-graph (access-log-to-dataset "/path/to/example-access.log")))

This is deeply unexciting – what about if we try a bigger dataset? Then we get things like this:


You can grab the final code here.

Incanter is much more than simply a way of graphing data. This (somewhat) brief example shows you how to get log data into an Incanter-frendly format – what you want to do with it then is up to you. I may well explore other aspects of Incanter in further posts.


I’ve been invited to speak on colleague Chris Read’s track at QCon London this March. The track itself is chock full of a number of experienced proffesionals (including two ex-colleagues) so I fully intend to raise my game accordingly. We’re lucky enough to have Michael T. Nygard speaking too, author of perhaps the best book written for software developers in years in the form of Release It!

The track – “Dev and Ops – a Single Team” – attempts to address many of the issues software professionals have in getting their software live. It will cover many aspects, both on the hardcore technical and on the softer people side. Hopefully it will provide lots of useful information you can take back to your own teams.

My talk – From Dev To Production– will be giving an overview of build pipelines, and how they can be used to get the whole delivery team focused on the end objective – shipping quality software as quickly as possible. It draws on some of my recent writing on build patterns, and a wealth of knowledge built up inside ThoughtWorks over the past few years.

My experience of QCon SF last year was excellent – I can thoroughly recommend it to any IT professional involved in shipping software. If you haven’t got your ticket already, go get them now before the prices go up!

Pipeline from Flickr user Stuck in Customs One of the problems quickly encountered when any new team adopts a Continuous Build is that builds become slow. Enforcing a Build Time Limit can help, but ultimately if all of your Continuous Build runs as one big monolithic block, there are limits to what you can do to decrease build times.

One of the main issues is that you don’t get fast feedback to tell you when there is an issue – by breaking up a monolithic build you can gain fast feedback without reducing code coverage, and often without any complex setup.

In a Chained Continuous Build, multiple build stages are chained together in a flow. The goal is for the earlier stages to provide the fastest feedback possible, so that build breakages can be detected early. For example, a simple flow might first compile the software and run the unit tests, with the next stage running the slower end-to-end tests.

With the chain, a downstream stage only runs if the previous stage passed – so in the above example, the end-to-end stage only runs if the build-and-test stage passes.

Handling Breaks

As with a Continuous Build you need to have a clear escalation process by which the whole team understands what to do in case of a break. Most teams I have worked with tend to stick to the rule of downing tools to fix the build if any part of the Continuous Build is red – which is strongly recommended. It is important that if you do decide to split your continuous build into a chain that you don’t let the team ignore builds that happen further along the chain.

Build Artifacts Once vs Rebuild

It is strongly suggested that you build the required artifacts once, and pass them along the chain. Rebuilding artifacts takes time – and the whole point of a chained build is to improve feedback. Additionally getting into the habit of building an artifact once, and once only, will help when you start considering creating a proper Build Pipeline (see below).

And Build Pipelines

Note that a chained build is not necessarily the same thing as a Build Pipeline. A Chained Continuous Build simply represents one or more Continuous Builds in series, whereas a Build Pipeline fully models all the stages a software artifact moves from development right through to production. One or more Chained Continuous Builds may form part of a Build Pipeline, and a simplistic Build Pipeline might not represent anything other than Chained Continuous Builds, but Build Pipelines will often incorporate activities more varied than compilation or test running.

Fast Feedback vs Fast Total Build Time

One thing to note is that by breaking a big build up into smaller sections to improve fast feedback, counterintuitively you may well end up increasing overall build time. The time to build and pass artifacts from one stage to another adds time, as does dispatching calls to build processes further down the chain. This balance has to be considered – consider being conservative in the splits you make, and always keep an eye on the total duration of your build chain.

Tool Support

Tooling can be complex. Simple straight-line chains can be relatively easily build using most continuous build systems. For example a common approach is to have one build check in some artifact which is the trigger point for another Continuos Build to run. Such approaches have the downside that the chain isn’t explicitly modelled, and reporting of the current state of the chain ends up having to be jury rigged, typically through custom dashboards. More complex still is dealing with branching chains.

Continuous Build systems have got more mature of late, with many of them supporting simple Chained Continuous Builds out of the box. TeamCity, Hudson and Cruise and others all have some form of (varying) support. Cruise probably has the best support for running stages in parallel (caveat: Cruise is written by ThoughtWorks, the company I currently work for), and has some of the better support for visualising the chains, but given the way all of these tools are moving expect support in this area to get much better over time.

Recently, both Paul Julius and Chris Read pointed out that I was perhaps the first person to document the concept of build pipelines, at least in terms of how it relates to continuous integration and the like. As it turns out, the original posts on the subject are from further back than I remember:

I plan to pull together my previous posts on the subject and update them a little, but in the meantime thought I’d give a bit of background as to where much of this came from.

A Harsh Introduction

My first exposure to continuous integration was by being dropped in at the deep end during my first ThoughtWorks project. The project in question was for an electronic point of sale system, and at its peak had over 50 developers in three countries working on the project. During this time I started reading up on the topic, specifically Martin Fowler & Matt Foemmel’s paper on the topic (Martin has since created an updated version).

Much of the experiences at this first, large project were dominated by long, slow build times, caused in part by an inability to separate out activities being performed by individual teams. A full discussion as to things we learnt from that project can certainly wait for another time, but I came out of that experience liking the concept of Continuous Integration, but feeling incredibly constrained by the actual implementation.

Monkeying Around

Subsequently, I worked as what we used to call a ‘Build Monkey’ at a London-based ISP. My role (which we now tend to call an Environment Specialist) was typically to identify the causes of build failure, keep the build running smoothly, as well as manage deployments to a number of different environments. Throughout this time, discussions around the theory behind managing Continuous builds for larger software teams was continuing – primarily with colleagues like Julian Simpson, Jack Bolles & others.

The challenge we seemed to face, time and again, was how you balance the various activities associated with getting software from developers machines into production, all whilst providing the fastest feedback possible.

Typically, we came at the problem from two different directions – in the first instance from the point of view of how to hammer our tools into supporting the kind of processes involved, but the more important angle was understanding what the pipeline – from developer workstation to production – actually was. This thinking can now be best thought of in terms of Continuous Deployment – although that topic is far more nuanced that the often simplistic thinking regarding systems where 50 deployments a day is possible, or even desirable.

The Present Day

Since I wrote my original articles, many other people have done work in this area, to the extent that tools like ThoughtWork’s own Cruise builds support for build pipelining & visualisation directly into the tool.

Update 1: Corrected spelling of Paul’s surname – sorry Paul!