magpiebrain

Sam Newman's site, a Consultant at ThoughtWorks

Posts from the ‘clojure’ category

Some yak shaving while playing around with Riemann resulted in me creating my first leiningen plugin, lein-gentags. It uses etags based on instructions from Nurullah Akkaya’s original blogpost – perfect for improving navigation of Clojure code in Emacs. Feedback appreciated!

Leave a comment

In a previous post, I showed how we could use Clojure and specifically Incanter to process access logs to graph hits on our site. Now, we’re going to adapt our solution to allow us to to show the number of unique users over time.

We’re going to change the previous solution to pull out the core dataset representing the raw data we’re interested in from the access log – records-from-access-log remains unchanged from before:

[clojure]
(defn access-log-to-dataset
[filename]
(col-names (to-dataset (records-from-access-log filename)) ["Date" "User"]))
[/clojure]

The raw dataset retrieved from this call looks like this:

Date User
11/Aug/2010:00:00:30 +0100 Bob
11/Aug/2010:00:00:31 +0100 Frank
11/Aug/2010:00:00:34 +0100 Frank

Now, we need to work out the number of unique users in a given time period. Like before, we’re going to use $rollup to group multiple records by minute, but we need to work out how to summarise the user column. To do this, we create a custom summarise function which calculates the number of unique users:

(defn num-unique-items
  [seq]
  (count (set seq)))

Then use that to modify the raw dataset and graph the resulting dataset:

(defn access-log-to-unique-user-dataset
  [access-log-dataset]
    ($rollup num-unique-items "User" "Date" 
      (col-names (conj-cols ($map #(round-ms-down-to-nearest-min (as-millis %)) "Date" access-log-dataset) ($ "User" access-log-dataset)) ["Date" "Unique Users"])))

(defn concurrent-users-graph
  [dataset]
  (time-series-plot :Date :User
                             :x-label "Date"
                             :y-label "User"
                             :title "Users Per Min"
                             :data (access-log-to-unique-user-dataset dataset)))


(def access-log-dataset
  (access-log-to-dataset "/path/to/access.log"))

(save (concurrent-users-graph access-log-dataset) "unique-users.png")

You can see the full source code listing here.

Continuing in a re-occurring series of posts showing my limited understanding of Clojure, today we’re using Clojure for log processing. This example is culled from some work I’m doing right now in the day job – we needed to extract usage information to better understand how the system is performing.

The Problem

We have an Apache-style access log showing hits our site. We want to process this information to extract information like peak hits per minute, and perhaps eventually more detailed information like the nature of the request, response time etc.

The log looks like this:

43.124.137.100 - username 05/Aug/2010:17:27:24 +0100 "GET /some/url HTTP/1.1" 200 24 "http://some.refering.domain/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.9) Gecko/2009040821 Firefox/3.0.9 (.NET CLR 3.5.30729)"

Extracting The Information

We want to use Incanter to help us process the data & graph it. Incanter likes its data as a sequence of sequences – so that’s what we’ll create.

First up – processing a single line. I TDD’d this solution, but have excluded the tests from the source listing for brevity.

user=> (use 'clojure.contrib.str-utils)
nil
user=> (use '[clojure.contrib.duck-streams :only (read-lines)])
nil

user=> (defn extract-records-from-line
  [line-from-access-log]
  (let [[_ ip username date] (re-find #"^(d{1,3}.d{1,3}.d{1,3}.d{1,3}) - (w+) (.+? .+?) " line-from-access-log)]
    [date username]))
#'user/extract-records-from-line

user=> (defn as-dataseries
  [access-log-lines]
  (map extract-records-from-line access-log-lines))
#'user/as-dataseries

user=> (defn records-from-access-log
  [filename]
  (as-dataseries (read-lines filename)))
#'user/records-from-access-log

A few things to note. extract-records-from-line is matching more than strictly needed – I just wanted to indicate the use of destructing for matching parts of the log line. I’m pulling in the username & date – the username is not strictly needed for what follows. Secondly, note the use of read-lines from clojure.contrib.duck-streams – rather than slurp, read-lines is lazy. We’ll have to process the whole file at some point, but it’s a good idea to look to use lazy functions where possible.

At this point, running records-from-access-log gives us our sequence of sequences – next up, pulling it into Incanter.

Getting The Data Into Incanter

We can check that our code is working properly by firing up Incanter. Given a sample log:

56.24.137.230 - fred 05/Aug/2010:17:27:24 +0100 "GET /some/url HTTP/1.1" 200 24 "http://some.refering.domain/" "SomeUserAgent"
12.14.137.140 - norman 05/Aug/2010:17:27:24 +0100 "GET /some/url HTTP/1.1" 200 24 "http://some.refering.domain/" "SomeUserAgent"
42.1.137.110 - bob 05/Aug/2010:17:28:24 +0100 "GET /some/url HTTP/1.1" 200 24 "http://some.refering.domain/" "SomeUserAgent"
143.124.1.50 - clare 05/Aug/2010:17:29:24 +0100 "GET /some/url HTTP/1.1" 200 24 "http://some.refering.domain/" "SomeUserAgent"

Let’s create a dataset from it, and view the resulting records:

user=> (use 'incanter.core)
nil
user=> (def access-log-to-dataset 
(to-dataset (records-from-access-log "/path/to/example-access.log")))
#'user/access-log-dataset
user=> (view access-log-dataset)

The result of the view command:

Unfortunately, no column names – but that is easy to fix using col-names:

user=> (def access-log-dataset 
(col-names (to-dataset (records-from-access-log "/path/to/example-access.log")) ["Date" "User"]))
#'user/access-log-dataset
user=> (view access-log-dataset)

At this point you can see that it would be easy for us to pull in the URL, response code or other data rather than the username from the log – all we’d need to do is change extract-records-from-line and update the column names.

Graphing The Data

To graph the data, we need to get Incanter to register the date column as what it is – time. Currently it is in string format, so we need to fix that. Culling the basics from Jake McCray’s post, here is what I ended up with (note use of Joda-Time for date handling – you could use the standard SimpleDateFormat if you preferred):

user=> (import 'org.joda.time.format.DateTimeFormat)
nil

user=> (defn as-millis
  [date-as-str]
  (.getMillis (.parseDateTime (DateTimeFormat/forPattern "dd/MMM/yyyy:HH:mm:ss Z") date-as-str)))
#'user/as-millis

user=> (defn access-log-to-dataset
  [filename]
  (let [unmod-data (col-names (to-dataset (records-from-access-log filename)) ["Date" "User"])]
    (col-names (conj-cols ($map as-millis "Date" unmod-dataset) ($ "User" unmod-dataset)) ["Date Time In Ms" "User"])))
#'user/access-log-to-dataset

While the date parsing should be pretty straightforward to understand, there are a few interesting things going on with the Incanter code that we should dwell on briefly.

The $ function extracts a named column, whereas the $map function runs another function over the named column from the dataset, returning the modified column (pretty familiar if you’ve used map). conj-cols then takes these resulting sequences to create our final dataset.

We’re not quite done yet though. We have our time-series records – representing one hit on our webserver – but don’t actually have values to graph. We also need to work out how we group hits to the nearest minute. What we’re going to do is replace our as-millis function to one that rounds to the nearest minute. Then, we’re going to use Incater to group those rows together – summing the hits it finds per minute. But before that, we need to tell Incanter that each row represents a hit, by adding a ‘Hits’ column. We’re also going to ditch the user column, as it isn’t going to help us here:

user=> (defn access-log-to-dataset
  [filename]
  (let [unmod-dataset (col-names (to-dataset (records-from-access-log filename)) ["Date" "User"])]
    (col-names (conj-cols ($map as-millis "Date" unmod-dataset) (repeat 1)) ["Date" "Hits"])))
#'user/access-log-to-dataset

Next, we need to create a new function to round our date & time to the nearest minute.

Update: The earlier version of this post screwed up, and the presented round-ms-down-to-nearest-min actually rounded to the nearest second. This is a corrected version:

(defn round-ms-down-to-nearest-min
  [millis]
  (* 60000 (quot millis 60000)))

If you wanted hits per second, here is the function:

(defn round-ms-down-to-nearest-sec
  [millis]
  (* 1000 (quot millis 1000)))

And one more tweak to access-log-to-dataset to use the new function:

(defn access-log-to-dataset
  [filename]
  (let [unmod-dataset (col-names (to-dataset (records-from-access-log filename)) ["Date" "User"])]
    (col-names (conj-cols ($map #(round-ms-down-to-nearest-min (as-millis %)) "Date" unmod-dataset) (repeat 1)) ["Date" "Hits"])))

Finally, we need to roll our data up, summing the hits per minute – all this done using $rollup:

(defn access-log-to-dataset
  [filename]
  (let [unmod-dataset (col-names (to-dataset (records-from-access-log filename)) ["Date" "User"])]
    ($rollup :sum "Hits" "Date" 
      (col-names (conj-cols ($map #(round-ms-down-to-nearest-min (as-millis %)) "Date" unmod-dataset) (repeat 1)) ["Date" "Hits"]))))

$rollup applies a summary function to a given column (in our case “Hits”), using another function to determine the parameters for that function (“Date” in our case). :sum here is a built-in Incanter function, but we could provide our own.

And the resulting dataset:

Now we have our dataset, let’s graph it:

user=> (defn hit-graph
  [dataset]
  (time-series-plot :Date :Hits
                             :x-label "Date"
                             :y-label "Hits"
                             :title "Hits"
                             :data dataset))

user=> (view (hit-graph (access-log-to-dataset "/path/to/example-access.log")))

This is deeply unexciting – what about if we try a bigger dataset? Then we get things like this:

Conclusion

You can grab the final code here.

Incanter is much more than simply a way of graphing data. This (somewhat) brief example shows you how to get log data into an Incanter-frendly format – what you want to do with it then is up to you. I may well explore other aspects of Incanter in further posts.

3 Comments

In the spirit of making my mistakes in public – something which I have a long history of on this blog – I thought I’d post up a solution I came up with for a relatively simple problem. I’m not unhappy with the solution – it works – but I can’t help thinking I’m missing something and there is a more elegant solution out there.

The Problem

For our input we have a set of time series records. Each record contains a list of name value pairs. For each record, the keys are not fixed – they may vary. Let’s imagine that we’re recording viewing figures for the major UK soap operas – some soaps are on every day, some are only on a few days a week.

In Clojure, we’re got our data in the following form:

(("Monday" {:eastenders 6.5, :thearchers 2.3, :corrinationstreet 5.6})
 ("Tuesday" {:eastenders 6.8, :thearchers 1.4})
 ...)

We want to convert this into a single table of data, with the keys from the source data representing the columns, with each row representing a different timestamp, so that we can visualise the data with something like gnuplot, Incanter or just plain old excel. So we want to get to something like this (yes, I know The Archers is on at the weekend too, but this is just an example):

Day Eastenders The Archers Coronation Street
Monday 6.5 2.3 5.6
Tuesday 6.8 1.4
Wednesday   2.3 7
Thursday 6.7 2.8
Friday 9.8 2.1 7

The challenge here (such as it is) is that we want a sparse table, and that our code cannot know beforehand the total universe of soap names (what if a new soap launched?).

The Header Row

The first part of this problem as I saw it was to determine which soaps our records represented to create a header row. The solution I came up with was to stick the keys for all records into a set:

user=> (def soaps 
'(("Monday" {"eastenders" 6.5, "thearchers" 2.3, "corrinationstreet" 5.6})
("Tuesday" {"eastenders" 6.8, "thearchers" 1.4})))

#'user/records
user=> (defn as-columns [records]
(apply (partial conj #{}) (mapcat keys (map second records)))))
#'user/columns

user=> (as-columns soaps)
#{"corrinationstreet" "eastenders" "thearchers"}

Assuming we want a CSV file to store our content, a function to create the header row becomes:

user=> (str "Date," (apply str (interpose "," (as-columns soaps))))
"Date,corrinationstreet,eastenders,thearchers"

The Data

Now we have a list of all possible keys (in this example, the names of the soaps), we can use this to extract data from the records for each row. Getting data from a map is straightforward – even handling the not-there case is simple enough:

user=> (def some-map 
{"eastenders" 6.5, "thearchers" 2.3, "corrinationstreet" 5.6})
#'user/some-map

user=> (get some-map "eastenders" "-")
6.5

user=> (get some-map "hollyoaks" "-")
-

Given a list of known columns, we can use a list comprehension to extract the data in a consistent order:

user=> (for [col #{"eastenders" "thearchers" "corrinationstreet"}] 
(get {"eastenders" 6.8, "thearchers" 1.4} col "-"))
(["-"] [6.8] [1.4])

Pulling It All Together

Taking those various strands, we end up with the following solution:

(defn as-columns 
  [records]
  (apply (partial conj #{}) (mapcat keys (map second records))))

(defn header-row 
  [records]
  (str "Date," (apply str (interpose "," (as-columns records)))))

(defn values-for-record
  [columns values]
  (for [col columns] (get values col "-")))

(defn as-row
  [record columns]
  (let [day (first record)
        values (second record)]
   (str day ","
    (apply str (interpose "," (values-for-record columns values))))))

(defn as-data
  [records]
  (apply str 
    (interpose "n" 
      (for [record records] 
          (as-row record (as-columns records))))))

(defn as-table
  [records]
    (apply str (header-row records) "n" (as-data records)))

Things I like with the solution: it works.

Things I don’t like with the solution:

  • Duplicated call to as-columns, in both as-data and header-row
  • Still don’t think I’ve got the indentation right
  • Still worried I’m creating functions which are too large, or that are not readable
  • This solution was arrived at by hacking on the code in a REPL – not TDD. My Clojure skills are still lacking, so I have to embark on the occasional hack-a-thon to learn some things – this being one such exercise. I plan to re-implement this using TDD with what I’ve learnt to see what I end up with. It will be interesting to see if TDDing this allays my fears about function size.
  • Lots of (apply str (interpose… duplication going around – I should factor that out
  • Not sure if the list comprehension here is needed – have I missed something obvious?

Updated to reflect some feedback and one example of using commons-exec as an alternative to the plain old Runtime.exec

Second Update to reflect use of shell-out – thanks Scott!

Basic

Making use of clojure.contrib.duck-streams:

(ns utils
 (:use clojure.contrib.duck-streams))

(defn execute [command]
  (let [process (.exec (Runtime/getRuntime) command)]
    (if (= 0 (.waitFor  process))
        (read-lines (.getInputStream process))
        (read-lines (.getErrorStream process)))))

...
user=> (execute "ls")
("MyProject.iml" "lib" "out" "src" "test")

It could be improved obviously – for example catching some of the potential IOExceptions that can result to rethrow additional information, such as the command being executed, or the ability to take a seq of program arguments.

Error & Argument Handling

This version adds some basic (and ugly) exception handling, and also handles spacing out arguments passed in (so passing "ls" "-la" gets processed into "ls -la"):

(defn execute
  "Executes a command-line program, returning stdout if a zero return code, else the
  error out. Takes a list of strings which represent the command & arguments"
  [& args]
  (try
    (let [process (.exec (Runtime/getRuntime) (reduce str (interleave args (iterate str " "))))]
      (if (= 0 (.waitFor  process))
          (read-lines (.getInputStream process))
          (read-lines (.getErrorStream process))))
    (catch IOException ioe
      (throw (new RuntimeException (str "Cannot run" args) ioe)))))

Using commons-exec

I had some problems with hanging processes, so knocked up a version using Apache’s commons-exec. This version has the added advantage of killing long-running processes, and I folded in Steve’s suggestion for a better way of splicing in the spaces in the command line args (see his comment). commons-exec is part of the special sauce inside Ant, so is a rock solid way of launching command-line processes (well, as rock solid as Java gets).

The use of the ByteArrayOutputStream is probably inefficient, and again, decent error handling is left as an exercise to the reader.

(defn alternative-execute
  "Executes a command-line program, returning stdout if a zero return code, else the
  error out. Takes a list of strings which represent the command & arguments"
  [& args]
  (let [output-stream (new ByteArrayOutputStream)
        error-stream (new ByteArrayOutputStream)
        stream-handler (new PumpStreamHandler output-stream error-stream)
        executor (doto
                  (new DefaultExecutor)
                  (.setExitValue 0)
                  (.setStreamHandler stream-handler)
                  (.setWatchdog (new ExecuteWatchdog 20000)))]
     (if (= 0 (.execute executor (CommandLine/parse (apply str (interpose " " args)))))
       (.toString output-stream)
       (.toString error-stream))))

Using clojure.contrib.shell-out

Many thanks to Scott for this. clojure.contrib supplies the very neat shell-out:

user=> (use 'clojure.contrib.shell-out)
nil
user=> (sh "ls" "-la")

I haven’t probed further to see if this deals with my hanging process problem, but it certainly doesn’t seem to have any support for killing timeout processes. If you’re worried about runaway tasks, the commons-exec version above might be the right choice for you.

I’ve been playing around with partially applied functions in Clojure, and have hit an interesting snag when dealing with Java interop. First, lets examine what partial does in Clojure, by cribbing off an example from Stuart Halloway’s Programming Clojure:

user=> (defn add [one two] (+ one two))
#'user/add
user=> (add 1 2)
3
user=> (def increment-by-two (partial add 2))
#'user/increment-by-two
user=> (increment-by-two 5)
7

What partial is doing is partially applying the function – in our case we have applied one of the two arguments our add implementation requires, and got back another function we can call later to pass in the second argument. This example is obviously rather trivial, but partially applied functions can be very handy in a number of situations.

Anyway, this wasn’t supposed to be a discussion of partial in general, but one problem I’ve hit with it when trying to partially apply a call to a Java static method. So, let’s implement our trivial add method in plain old Java:

public class Functions {
    public static int add(int first, int second) {
        return first + second;
    }
}

Then try using partial as before:

user=> (import com.xxx.yyy.Functions)
com.xxx.yyy.Functions
user=> (Functions/add 1 2)
3
user=> (def increment-by-two (partial Functions/add 1))
java.lang.Exception: Unable to find static field: add in class com.xxx.yyy.Functions (NO_SOURCE_FILE:3)
user=> 

So it seems like the partial call can’t handle static calls in this situation. But what if I wrap the call in another function?

user=> (defn java-add [arg1 arg2] (Functions/add arg1 arg2))
#'user/java-add
user=> (def increment-by-two (partial java-add 2))
#'user/increment-by-two
user=> (increment-by-two 10)
12

Which works. There is probably a reason why, but I can’t quite work it out right now.

Posted in the “I hope no-one else has to go through this” category in the hope that Google surfaces this for some other poor soul.

Picture a rather trivial split function:

(defn split [str delimiter]
  ((seq (.split str delimiter))))

Which helpfully spits out:

java.lang.ClassCastException: clojure.lang.ArraySeq cannot be cast to clojure.lang.IFn

The issue here is the additional set of parentheses – a hang-over from a previous edit. Removing this fixed the trouble. These parentheses were causing Clojure to expect a function call…

I’ve recently been working on a Clojure application that I hope to open source soon. It’s been my first experience of using Clojure, and is almost certainly one of the most thought provking things I’ve done in a long while. One of the things that is still causing me issues is how to go about TDDing Clojure applications – or rather functional programs in general.

My natural inclination – for many reasons – is to use TDD as my process of choice for developing my code. Beyond its use as a design tool, it’s having a saftey net to catch me if I screw something up. It allows me to be a little more brave, and drastically reduces the cycle between changing some code and being happy that it works. I’m used to that saftey net – I feel lost without it.

Stuart Halloway said during his Clojure talk at Qcon SF that despite being a TDD fan he finds it hard to TDD in a new language, and I get exactly what he means. A big part of it is that you’re getting to grips with the idioms, capabilities, libraries and tools associated with your new language – and a lack of this knowledge is going to impact on your ability to write good tests, let alone worry about implementing them.

Typically, when learning a new language I try and write a small application that has a real world need. BigVisibleWall was my attempt to learn Scala – but it had a real goal. With BigVisibleWall, as with my current Clojure project, I started by implementing the system by just writing the production code. I’m pushing the limits of my knowledge constantly, attempting to understand the size and shape of the solution space that I find myself in with this new tool. Once I got BigVisibleWall working with a small set of features, I broke it down and rewrote it TDD style – at that point, I had enough Scala (and I mean *just* enough) to be able to do this without it feeling like I was wading through treacle.

I consciously decided to follow the same pattern with my Clojure project. Code the main logic, get it running, then break it down and rewrite it piece by piece using TDD. But then I hit a problem – Scala and Java are similar enough languages that my programming style didn’t have to change much from one to the other. Therefore the way I structured the code and thought about TDD didn’t have to shift much. In both cases I was driving the design of an Object Oriented system. With Clojure though it wasn’t just the language which was different, it was so many of the underlying concepts were different. Put simply, I really don’t know where to begin.

My first instinct is to start decomposing functions, passing in stubs to the functions under test. But this just feels like I’m trying to shoehorn IOC-type patterns into a functional program. But what am I left with – testing large combinations of functions? That feels wrong too.

So what about you lot out there in blogland? Any other OO types trying to make the switch and encountering the same issues? Or any FP practitioners for whom TDD is second nature? Or does TDD just not fit with FP after all?

I’ve been working on a couple of spare time projects, both of which I hope to release more formally in the next few weeks. One of them involves development of a simple web application for deployment on Google App Engine. As part of the development, I had to modify an existing open source Clojure API – my changes are now available for all.

appengine-clj was written by ThoughtWorks colleague John Hume. It provides some Clojure-esque wrappers over Google App Engine’s user authentication and low level datastore API. John outlines his use of the library in a highly useful post on using Compojure & Clojure on the App Engine – it was this post which helped immensely in getting started myself.

There were a couple of minor issues with the latest version of John’s API which stopped me from being able to use it for my latest project – so I created a fork to make the changes I wanted. First, a general issue. I love projects which make checkout and build easy and bullet-proof. For me, that means check in the build tool & all dependencies. I know this is a contentious point – I may well write a post on it later. The other issue is that since the 1.2 version of the SDK some of the APIs have changed a little, so I updated the datastore testing macro accordingly.

My Clojure skills are highly limited, and the modest modifications are probably botched, but nonetheless it seems to work. My fork can be found on GitHub.

So all the cool Clojure kids keep wanting me to use Emacs. The problem is that I haven’t used Emacs for the last 10 years – since, in fact, I had to support a C application on about 7 different flavours of UNIX. As you can imagine, I’ve since expunged many of those past memories.

My IDE of choice – ever since I joined ThoughtWorks – has been IntelliJ. Yes, I had to spend my time in the wilderness with Eclipse, long enough that I feel well placed to compare the two and consider IntelliJ superior for the languages I use often. La Clojure now seems to play nicely with IntelliJ’s Community Edition, so I’m giving that a try.

Ultimately, I’m learning a new language, one which often requires my brain to work in a quite different fashion than it is used to. As such, I’m trying to limit the number of new things I have to deal with. If, however, I’m missing out on something by not using Emacs, I may be persuaded to give it a go. So can anyone out there tell me what I’m missing?