Following up my “previous post(A brief history of ed, sed and Regular Expressions)”:http://www.magpiebrain.com/archives/000199.html on @sed@, I thought I’d post a little tutorial on using @sed@ and @grep@ for the purposes of cleaning up logfiles. To follow the examples you’ll need @sed@ or @grep@ for your platform.
You have a log file. A _big_ log file. You need some specific information to chase up a problem. We are assuming the log file is a standard log4j file, looking something like this:
2004-03-12 12:33:29,077 INFO [com.company.ExampleClass] Did something 2004-03-12 12:33:29,077 WARN [com.company.ExampleClass] Something might be wrong 2004-03-12 12:33:29,077 INFO [com.company.AnotherClass] Did something else
Filtering with grep
Lets say you only want to display log messages from a single class, @com.company.ExampleClass@. Just run the following command:
$ grep "[com.company.ExampleClass]" logfile
To return all lines in @logfile@ which contain the text “[com.company.ExampleClass]”. If you wanted to show everything _except_ @com.company.ExampleClass@, do the following:
$ grep -v "[com.company.ExampleClass]" logfile
So far we have been matching exact text. Imagine if we wanted to show all log messages from the @com.company@ package:
$ grep "[com.company.*]" logfile
Now we could chain a few of these together. Imagine that we want to display all messages from the package @com.company@, except those produced by @com.company.ExampleClass@:
$ grep "[com.company.*] logfile | grep -v "[com.company.ExampleClass]"
A very brief sed primer
We are going to be using @sed@’s global search and replace. These commands come in the form:
A regular expression can be a normal piece of text (like ‘bob’) or can contain special characters. The ‘,’ character matches any single character, so the expression @.bob@ would match @bbob@, @abob@, @vbob@ etc. The ‘*’ character states that the preceding term can appear multiple times, so the epxression @.*bob@ would match @abob@, @aabob@, @aaabob@…
If you wanted to match the ‘.’ or ‘*’ characters yourself, then you need to escape them using a ”. To match the text @I said hello. Then he said *@, you would write the expression @I said hello. Then he said *@.
There are lots of things you can do with regular expressions, but that will do for now.
Pruning with sed
@grep@ lets you easily filter whole lines out, but what if the individual lines themselves are too verbose? Here, @sed@ comes to the rescue. For a start, lets assume we no longer need to know the class the message came from. We can remove this from the lines using the search and replace command :
$ sed 's/[.*]//g' logfile.
Here, we our search term @[.*]@ starts by matching the ‘[‘ character, but have to escape it as @sed@ uses it for other things (its a special meta-character like ‘.’ or ==’*’== ). Next, we match any character using the ‘.’ and ==’*’== meta-characters. Finally, we look for a closing ‘]’. This has the effect of matching all the class identifiers. Next, we define the replacement, only we want to remove the matched text so we include nothing.
We could expand this command to remove all the timing information and the logging level:
$ sed 's/.*[.*]//g' logfile.
Here we have added an extra @.*@ term to our expression. We are now matching any text before square brackets, and the square brackets itself.
Pulling it all together
Imagine we want to show just the messages from all classes in @com.company@ except those in @com.company.ExampleCompany@, and we want to remove the timing information and the source class name. We just simply use grep and sed:
$ grep "[com.company.*] logfile | grep -v "[com.company.ExampleClass]" | sed 's/.*[.*]//g'
Next time I’ll look at some more complex example to show you how to create CSV files from your logs.