Sam Newman's site, a Consultant at ThoughtWorks

In the spirit of making my mistakes in public – something which I have a long history of on this blog – I thought I’d post up a solution I came up with for a relatively simple problem. I’m not unhappy with the solution – it works – but I can’t help thinking I’m missing something and there is a more elegant solution out there.

The Problem

For our input we have a set of time series records. Each record contains a list of name value pairs. For each record, the keys are not fixed – they may vary. Let’s imagine that we’re recording viewing figures for the major UK soap operas – some soaps are on every day, some are only on a few days a week.

In Clojure, we’re got our data in the following form:

(("Monday" {:eastenders 6.5, :thearchers 2.3, :corrinationstreet 5.6})
 ("Tuesday" {:eastenders 6.8, :thearchers 1.4})

We want to convert this into a single table of data, with the keys from the source data representing the columns, with each row representing a different timestamp, so that we can visualise the data with something like gnuplot, Incanter or just plain old excel. So we want to get to something like this (yes, I know The Archers is on at the weekend too, but this is just an example):

Day Eastenders The Archers Coronation Street
Monday 6.5 2.3 5.6
Tuesday 6.8 1.4
Wednesday   2.3 7
Thursday 6.7 2.8
Friday 9.8 2.1 7

The challenge here (such as it is) is that we want a sparse table, and that our code cannot know beforehand the total universe of soap names (what if a new soap launched?).

The Header Row

The first part of this problem as I saw it was to determine which soaps our records represented to create a header row. The solution I came up with was to stick the keys for all records into a set:

user=> (def soaps 
'(("Monday" {"eastenders" 6.5, "thearchers" 2.3, "corrinationstreet" 5.6})
("Tuesday" {"eastenders" 6.8, "thearchers" 1.4})))

user=> (defn as-columns [records]
(apply (partial conj #{}) (mapcat keys (map second records)))))

user=> (as-columns soaps)
#{"corrinationstreet" "eastenders" "thearchers"}

Assuming we want a CSV file to store our content, a function to create the header row becomes:

user=> (str "Date," (apply str (interpose "," (as-columns soaps))))

The Data

Now we have a list of all possible keys (in this example, the names of the soaps), we can use this to extract data from the records for each row. Getting data from a map is straightforward – even handling the not-there case is simple enough:

user=> (def some-map 
{"eastenders" 6.5, "thearchers" 2.3, "corrinationstreet" 5.6})

user=> (get some-map "eastenders" "-")

user=> (get some-map "hollyoaks" "-")

Given a list of known columns, we can use a list comprehension to extract the data in a consistent order:

user=> (for [col #{"eastenders" "thearchers" "corrinationstreet"}] 
(get {"eastenders" 6.8, "thearchers" 1.4} col "-"))
(["-"] [6.8] [1.4])

Pulling It All Together

Taking those various strands, we end up with the following solution:

(defn as-columns 
  (apply (partial conj #{}) (mapcat keys (map second records))))

(defn header-row 
  (str "Date," (apply str (interpose "," (as-columns records)))))

(defn values-for-record
  [columns values]
  (for [col columns] (get values col "-")))

(defn as-row
  [record columns]
  (let [day (first record)
        values (second record)]
   (str day ","
    (apply str (interpose "," (values-for-record columns values))))))

(defn as-data
  (apply str 
    (interpose "n" 
      (for [record records] 
          (as-row record (as-columns records))))))

(defn as-table
    (apply str (header-row records) "n" (as-data records)))

Things I like with the solution: it works.

Things I don’t like with the solution:

  • Duplicated call to as-columns, in both as-data and header-row
  • Still don’t think I’ve got the indentation right
  • Still worried I’m creating functions which are too large, or that are not readable
  • This solution was arrived at by hacking on the code in a REPL – not TDD. My Clojure skills are still lacking, so I have to embark on the occasional hack-a-thon to learn some things – this being one such exercise. I plan to re-implement this using TDD with what I’ve learnt to see what I end up with. It will be interesting to see if TDDing this allays my fears about function size.
  • Lots of (apply str (interpose… duplication going around – I should factor that out
  • Not sure if the list comprehension here is needed – have I missed something obvious?

3 Responses to “Creating Sparse Tabular Data With Clojure”

  1. Justin Kramer

    Some tips:
    * Check out clojure.string/join (1.2) or clojure.contrib.string/join (1.1)
    * A handy idiom: (map {:a 1 :b 2} [:a :b]) => (1 2)
    * Use destructuring to grab the first and second elements from a sequence
    Here’s a quick rewrite:

    (use '[clojure.string :only [join]])
    (defn to-csv [daily-stats]
    (let [cols (distinct (mapcat (comp keys second) daily-stats))]
    (join "," (cons "Day" cols)) "n"
    (join "n"
    (for [[day stats] daily-stats]
    (join "," (cons day (map #(stats % -) cols))))))))

  2. Sam Newman

    Spotted join about an hour after I wrote that – it’ll help a lot. And that map idiom is exactly what I was looking for. Many thanks Justin!

    I still think I’ll do a rewrite from scratch, but I now have a couple of new tools in my armory.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Basic HTML is allowed. Your email address will not be published.

Subscribe to this comment feed via RSS

%d bloggers like this: