In the spirit of making my mistakes in public – something which I have a long history of on this blog – I thought I’d post up a solution I came up with for a relatively simple problem. I’m not unhappy with the solution – it works – but I can’t help thinking I’m missing something and there is a more elegant solution out there.
The Problem
For our input we have a set of time series records. Each record contains a list of name value pairs. For each record, the keys are not fixed – they may vary. Let’s imagine that we’re recording viewing figures for the major UK soap operas – some soaps are on every day, some are only on a few days a week.
In Clojure, we’re got our data in the following form:
(("Monday" {:eastenders 6.5, :thearchers 2.3, :corrinationstreet 5.6}) ("Tuesday" {:eastenders 6.8, :thearchers 1.4}) ...)
We want to convert this into a single table of data, with the keys from the source data representing the columns, with each row representing a different timestamp, so that we can visualise the data with something like gnuplot
, Incanter
or just plain old excel. So we want to get to something like this (yes, I know The Archers is on at the weekend too, but this is just an example):
Day | Eastenders | The Archers | Coronation Street |
---|---|---|---|
Monday | 6.5 | 2.3 | 5.6 |
Tuesday | 6.8 | 1.4 | |
Wednesday | 2.3 | 7 | |
Thursday | 6.7 | 2.8 | |
Friday | 9.8 | 2.1 | 7 |
The challenge here (such as it is) is that we want a sparse table, and that our code cannot know beforehand the total universe of soap names (what if a new soap launched?).
The Header Row
The first part of this problem as I saw it was to determine which soaps our records represented to create a header row. The solution I came up with was to stick the keys for all records into a set:
user=> (def soaps '(("Monday" {"eastenders" 6.5, "thearchers" 2.3, "corrinationstreet" 5.6}) ("Tuesday" {"eastenders" 6.8, "thearchers" 1.4}))) #'user/records user=> (defn as-columns [records] (apply (partial conj #{}) (mapcat keys (map second records))))) #'user/columns user=> (as-columns soaps) #{"corrinationstreet" "eastenders" "thearchers"}
Assuming we want a CSV file to store our content, a function to create the header row becomes:
user=> (str "Date," (apply str (interpose "," (as-columns soaps)))) "Date,corrinationstreet,eastenders,thearchers"
The Data
Now we have a list of all possible keys (in this example, the names of the soaps), we can use this to extract data from the records for each row. Getting data from a map is straightforward – even handling the not-there case is simple enough:
user=> (def some-map {"eastenders" 6.5, "thearchers" 2.3, "corrinationstreet" 5.6}) #'user/some-map user=> (get some-map "eastenders" "-") 6.5 user=> (get some-map "hollyoaks" "-") -
Given a list of known columns, we can use a list comprehension to extract the data in a consistent order:
user=> (for [col #{"eastenders" "thearchers" "corrinationstreet"}] (get {"eastenders" 6.8, "thearchers" 1.4} col "-")) (["-"] [6.8] [1.4])
Pulling It All Together
Taking those various strands, we end up with the following solution:
(defn as-columns [records] (apply (partial conj #{}) (mapcat keys (map second records)))) (defn header-row [records] (str "Date," (apply str (interpose "," (as-columns records))))) (defn values-for-record [columns values] (for [col columns] (get values col "-"))) (defn as-row [record columns] (let [day (first record) values (second record)] (str day "," (apply str (interpose "," (values-for-record columns values)))))) (defn as-data [records] (apply str (interpose "n" (for [record records] (as-row record (as-columns records)))))) (defn as-table [records] (apply str (header-row records) "n" (as-data records)))
Things I like with the solution: it works.
Things I don’t like with the solution:
- Duplicated call to
as-columns
, in bothas-data
andheader-row
- Still don’t think I’ve got the indentation right
- Still worried I’m creating functions which are too large, or that are not readable
- This solution was arrived at by hacking on the code in a REPL – not TDD. My Clojure skills are still lacking, so I have to embark on the occasional hack-a-thon to learn some things – this being one such exercise. I plan to re-implement this using TDD with what I’ve learnt to see what I end up with. It will be interesting to see if TDDing this allays my fears about function size.
- Lots of (apply str (interpose… duplication going around – I should factor that out
- Not sure if the list comprehension here is needed – have I missed something obvious?
3 Responses to “Creating Sparse Tabular Data With Clojure”
Some tips:
* Check out clojure.string/join (1.2) or clojure.contrib.string/join (1.1)
* A handy idiom: (map {:a 1 :b 2} [:a :b]) => (1 2)
* Use destructuring to grab the first and second elements from a sequence
Here’s a quick rewrite:
(use '[clojure.string :only [join]])
(defn to-csv [daily-stats]
(let [cols (distinct (mapcat (comp keys second) daily-stats))]
(str
(join "," (cons "Day" cols)) "n"
(join "n"
(for [[day stats] daily-stats]
(join "," (cons day (map #(stats % -) cols))))))))
Spotted join about an hour after I wrote that – it’ll help a lot. And that map idiom is exactly what I was looking for. Many thanks Justin!
I still think I’ll do a rewrite from scratch, but I now have a couple of new tools in my armory.
[…] in a re-occurring series of posts showing my limited understanding of Clojure, today we’re using Clojure for log processing. This example is culled from some work […]