In the spirit of making my mistakes in public – something which I have a long history of on this blog – I thought I’d post up a solution I came up with for a relatively simple problem. I’m not unhappy with the solution – it works – but I can’t help thinking I’m missing something and there is a more elegant solution out there.
The Problem
For our input we have a set of time series records. Each record contains a list of name value pairs. For each record, the keys are not fixed – they may vary. Let’s imagine that we’re recording viewing figures for the major UK soap operas – some soaps are on every day, some are only on a few days a week.
In Clojure, we’re got our data in the following form:
(("Monday" {:eastenders 6.5, :thearchers 2.3, :corrinationstreet 5.6})
("Tuesday" {:eastenders 6.8, :thearchers 1.4})
...)
We want to convert this into a single table of data, with the keys from the source data representing the columns, with each row representing a different timestamp, so that we can visualise the data with something like gnuplot, Incanter or just plain old excel. So we want to get to something like this (yes, I know The Archers is on at the weekend too, but this is just an example):
| Day | Eastenders | The Archers | Coronation Street |
|---|---|---|---|
| Monday | 6.5 | 2.3 | 5.6 |
| Tuesday | 6.8 | 1.4 | |
| Wednesday | 2.3 | 7 | |
| Thursday | 6.7 | 2.8 | |
| Friday | 9.8 | 2.1 | 7 |
The challenge here (such as it is) is that we want a sparse table, and that our code cannot know beforehand the total universe of soap names (what if a new soap launched?).
The Header Row
The first part of this problem as I saw it was to determine which soaps our records represented to create a header row. The solution I came up with was to stick the keys for all records into a set:
user=> (def soaps
'(("Monday" {"eastenders" 6.5, "thearchers" 2.3, "corrinationstreet" 5.6})
("Tuesday" {"eastenders" 6.8, "thearchers" 1.4})))
#'user/records
user=> (defn as-columns [records]
(apply (partial conj #{}) (mapcat keys (map second records)))))
#'user/columns
user=> (as-columns soaps)
#{"corrinationstreet" "eastenders" "thearchers"}
Assuming we want a CSV file to store our content, a function to create the header row becomes:
user=> (str "Date," (apply str (interpose "," (as-columns soaps)))) "Date,corrinationstreet,eastenders,thearchers"
The Data
Now we have a list of all possible keys (in this example, the names of the soaps), we can use this to extract data from the records for each row. Getting data from a map is straightforward – even handling the not-there case is simple enough:
user=> (def some-map
{"eastenders" 6.5, "thearchers" 2.3, "corrinationstreet" 5.6})
#'user/some-map
user=> (get some-map "eastenders" "-")
6.5
user=> (get some-map "hollyoaks" "-")
-
Given a list of known columns, we can use a list comprehension to extract the data in a consistent order:
user=> (for [col #{"eastenders" "thearchers" "corrinationstreet"}]
(get {"eastenders" 6.8, "thearchers" 1.4} col "-"))
(["-"] [6.8] [1.4])
Pulling It All Together
Taking those various strands, we end up with the following solution:
(defn as-columns
[records]
(apply (partial conj #{}) (mapcat keys (map second records))))
(defn header-row
[records]
(str "Date," (apply str (interpose "," (as-columns records)))))
(defn values-for-record
[columns values]
(for [col columns] (get values col "-")))
(defn as-row
[record columns]
(let [day (first record)
values (second record)]
(str day ","
(apply str (interpose "," (values-for-record columns values))))))
(defn as-data
[records]
(apply str
(interpose "n"
(for [record records]
(as-row record (as-columns records))))))
(defn as-table
[records]
(apply str (header-row records) "n" (as-data records)))
Things I like with the solution: it works.
Things I don’t like with the solution:
- Duplicated call to
as-columns, in bothas-dataandheader-row - Still don’t think I’ve got the indentation right
- Still worried I’m creating functions which are too large, or that are not readable
- This solution was arrived at by hacking on the code in a REPL – not TDD. My Clojure skills are still lacking, so I have to embark on the occasional hack-a-thon to learn some things – this being one such exercise. I plan to re-implement this using TDD with what I’ve learnt to see what I end up with. It will be interesting to see if TDDing this allays my fears about function size.
- Lots of (apply str (interpose… duplication going around – I should factor that out
- Not sure if the list comprehension here is needed – have I missed something obvious?
3 Responses to “Creating Sparse Tabular Data With Clojure”
Some tips:
* Check out clojure.string/join (1.2) or clojure.contrib.string/join (1.1)
* A handy idiom: (map {:a 1 :b 2} [:a :b]) => (1 2)
* Use destructuring to grab the first and second elements from a sequence
Here’s a quick rewrite:
(use '[clojure.string :only [join]])
(defn to-csv [daily-stats]
(let [cols (distinct (mapcat (comp keys second) daily-stats))]
(str
(join "," (cons "Day" cols)) "n"
(join "n"
(for [[day stats] daily-stats]
(join "," (cons day (map #(stats % -) cols))))))))
Spotted join about an hour after I wrote that – it’ll help a lot. And that map idiom is exactly what I was looking for. Many thanks Justin!
I still think I’ll do a rewrite from scratch, but I now have a couple of new tools in my armory.
[…] in a re-occurring series of posts showing my limited understanding of Clojure, today we’re using Clojure for log processing. This example is culled from some work […]