Creating Sparse Tabular Data With Clojure

In the spirit of making my mistakes in public – something which I have a long history of on this blog – I thought I’d post up a solution I came up with for a relatively simple problem. I’m not unhappy with the solution – it works – but I can’t help thinking I’m missing something and there is a more elegant solution out there.

The Problem

For our input we have a set of time series records. Each record contains a list of name value pairs. For each record, the keys are not fixed – they may vary. Let’s imagine that we’re recording viewing figures for the major UK soap operas – some soaps are on every day, some are only on a few days a week.

In Clojure, we’re got our data in the following form:

(("Monday" {:eastenders 6.5, :thearchers 2.3, :corrinationstreet 5.6})
 ("Tuesday" {:eastenders 6.8, :thearchers 1.4})
 ...)

We want to convert this into a single table of data, with the keys from the source data representing the columns, with each row representing a different timestamp, so that we can visualise the data with something like gnuplot, Incanter or just plain old excel. So we want to get to something like this (yes, I know The Archers is on at the weekend too, but this is just an example):

Day	Eastenders	The Archers	Coronation Street
Monday	6.5	2.3	5.6
Tuesday	6.8	1.4
Wednesday		2.3	7
Thursday	6.7	2.8
Friday	9.8	2.1	7

The challenge here (such as it is) is that we want a sparse table, and that our code cannot know beforehand the total universe of soap names (what if a new soap launched?).

The Header Row

The first part of this problem as I saw it was to determine which soaps our records represented to create a header row. The solution I came up with was to stick the keys for all records into a set:

user=> (def soaps 
'(("Monday" {"eastenders" 6.5, "thearchers" 2.3, "corrinationstreet" 5.6})
("Tuesday" {"eastenders" 6.8, "thearchers" 1.4})))

#'user/records
user=> (defn as-columns [records]
(apply (partial conj #{}) (mapcat keys (map second records)))))
#'user/columns

user=> (as-columns soaps)
#{"corrinationstreet" "eastenders" "thearchers"}

Assuming we want a CSV file to store our content, a function to create the header row becomes:

user=> (str "Date," (apply str (interpose "," (as-columns soaps))))
"Date,corrinationstreet,eastenders,thearchers"

The Data

Now we have a list of all possible keys (in this example, the names of the soaps), we can use this to extract data from the records for each row. Getting data from a map is straightforward – even handling the not-there case is simple enough:

user=> (def some-map 
{"eastenders" 6.5, "thearchers" 2.3, "corrinationstreet" 5.6})
#'user/some-map

user=> (get some-map "eastenders" "-")
6.5

user=> (get some-map "hollyoaks" "-")
-

Given a list of known columns, we can use a list comprehension to extract the data in a consistent order:

user=> (for [col #{"eastenders" "thearchers" "corrinationstreet"}] 
(get {"eastenders" 6.8, "thearchers" 1.4} col "-"))
(["-"] [6.8] [1.4])

Pulling It All Together

Taking those various strands, we end up with the following solution:

(defn as-columns 
  [records]
  (apply (partial conj #{}) (mapcat keys (map second records))))

(defn header-row 
  [records]
  (str "Date," (apply str (interpose "," (as-columns records)))))

(defn values-for-record
  [columns values]
  (for [col columns] (get values col "-")))

(defn as-row
  [record columns]
  (let [day (first record)
        values (second record)]
   (str day ","
    (apply str (interpose "," (values-for-record columns values))))))

(defn as-data
  [records]
  (apply str 
    (interpose "n" 
      (for [record records] 
          (as-row record (as-columns records))))))

(defn as-table
  [records]
    (apply str (header-row records) "n" (as-data records)))

Things I like with the solution: it works.

Things I don’t like with the solution:

Duplicated call to as-columns, in both as-data and header-row
Still don’t think I’ve got the indentation right
Still worried I’m creating functions which are too large, or that are not readable
This solution was arrived at by hacking on the code in a REPL – not TDD. My Clojure skills are still lacking, so I have to embark on the occasional hack-a-thon to learn some things – this being one such exercise. I plan to re-implement this using TDD with what I’ve learnt to see what I end up with. It will be interesting to see if TDDing this allays my fears about function size.
Lots of (apply str (interpose… duplication going around – I should factor that out
Not sure if the list comprehension here is needed – have I missed something obvious?

August 10, 2010

3 Responses to “Creating Sparse Tabular Data With Clojure”

Justin Kramer August 11, 2010

Some tips:
* Check out clojure.string/join (1.2) or clojure.contrib.string/join (1.1)
* A handy idiom: (map {:a 1 :b 2} [:a :b]) => (1 2)
* Use destructuring to grab the first and second elements from a sequence
Here’s a quick rewrite:
(use '[clojure.string :only [join]]) (defn to-csv [daily-stats] (let [cols (distinct (mapcat (comp keys second) daily-stats))] (str (join "," (cons "Day" cols)) "n" (join "n" (for [[day stats] daily-stats] (join "," (cons day (map #(stats % -) cols))))))))

Sam Newman August 11, 2010

Spotted join about an hour after I wrote that – it’ll help a lot. And that map idiom is exactly what I was looking for. Many thanks Justin!

I still think I’ll do a rewrite from scratch, but I now have a couple of new tools in my armory.

Graphing Hits Per Second Using Clojure & Incanter – Magpiebrain August 15, 2010

[…] in a re-occurring series of posts showing my limited understanding of Clojure, today we’re using Clojure for log processing. This example is culled from some work […]

magpiebrain

Sam Newman's site, a Consultant at ThoughtWorks

The Problem

The Header Row

The Data

Pulling It All Together

3 Responses to “Creating Sparse Tabular Data With Clojure”

Leave a comment Cancel reply

The Problem

The Header Row

The Data

Pulling It All Together

Share this:

Related

3 Responses to “Creating Sparse Tabular Data With Clojure”

Leave a comment Cancel reply