Following up my “previous post(A brief history of ed, sed and Regular Expressions)”:http://www.magpiebrain.com/archives/000199.html on @sed@, I thought I’d post a little tutorial on using @sed@ and @grep@ for the purposes of cleaning up logfiles. To follow the examples you’ll need @sed@ or @grep@ for your platform.
The situation
You have a log file. A _big_ log file. You need some specific information to chase up a problem. We are assuming the log file is a standard log4j file, looking something like this:
2004-03-12 12:33:29,077 INFO [com.company.ExampleClass] Did something 2004-03-12 12:33:29,077 WARN [com.company.ExampleClass] Something might be wrong 2004-03-12 12:33:29,077 INFO [com.company.AnotherClass] Did something else
Filtering with grep
Lets say you only want to display log messages from a single class, @com.company.ExampleClass@. Just run the following command:
$ grep "[com.company.ExampleClass]" logfile
To return all lines in @logfile@ which contain the text “[com.company.ExampleClass]”. If you wanted to show everything _except_ @com.company.ExampleClass@, do the following:
$ grep -v "[com.company.ExampleClass]" logfile
So far we have been matching exact text. Imagine if we wanted to show all log messages from the @com.company@ package:
$ grep "[com.company.*]" logfile
Now we could chain a few of these together. Imagine that we want to display all messages from the package @com.company@, except those produced by @com.company.ExampleClass@:
$ grep "[com.company.*] logfile | grep -v "[com.company.ExampleClass]"
A very brief sed primer
We are going to be using @sed@’s global search and replace. These commands come in the form:
s/Regular Expression/Replacement/g
A regular expression can be a normal piece of text (like ‘bob’) or can contain special characters. The ‘,’ character matches any single character, so the expression @.bob@ would match @bbob@, @abob@, @vbob@ etc. The ‘*’ character states that the preceding term can appear multiple times, so the epxression @.*bob@ would match @abob@, @aabob@, @aaabob@…
If you wanted to match the ‘.’ or ‘*’ characters yourself, then you need to escape them using a ”. To match the text @I said hello. Then he said *@, you would write the expression @I said hello. Then he said *@.
There are lots of things you can do with regular expressions, but that will do for now.
Pruning with sed
@grep@ lets you easily filter whole lines out, but what if the individual lines themselves are too verbose? Here, @sed@ comes to the rescue. For a start, lets assume we no longer need to know the class the message came from. We can remove this from the lines using the search and replace command :
$ sed 's/[.*]//g' logfile.
Here, we our search term @[.*]@ starts by matching the ‘[‘ character, but have to escape it as @sed@ uses it for other things (its a special meta-character like ‘.’ or ==’*’== ). Next, we match any character using the ‘.’ and ==’*’== meta-characters. Finally, we look for a closing ‘]’. This has the effect of matching all the class identifiers. Next, we define the replacement, only we want to remove the matched text so we include nothing.
We could expand this command to remove all the timing information and the logging level:
$ sed 's/.*[.*]//g' logfile.
Here we have added an extra @.*@ term to our expression. We are now matching any text before square brackets, and the square brackets itself.
Pulling it all together
Imagine we want to show just the messages from all classes in @com.company@ except those in @com.company.ExampleCompany@, and we want to remove the timing information and the source class name. We just simply use grep and sed:
$ grep "[com.company.*] logfile | grep -v "[com.company.ExampleClass]" | sed 's/.*[.*]//g'
Next time I’ll look at some more complex example to show you how to create CSV files from your logs.
4 Responses to “Pruning logfiles with Sed and grep”
Just want to say thanks for this post and the previous 2. They have insipred me to try using sed to modify the qif file I download from my bank to add categories before I import into quicken.
No trouble at all Mike! I’m going to post at least one more on sed, so I hope its of interest…
Hello there. I found your Sed primer while searching google. I have a very large log file that I am attempting to break down in to smaller, individual files. It is an IRC log file for a channel from mid-2000 to mid-2001. I would like to break it down in to individual daily log files. Any recommendation on how to do this? Manually cut’n’paste would take aeons. I have a rudamentary understanding of Sed since I’m teaching myself Unix recently. Any pointers would be great!
Each day of the large log file starts with the a string like so: “Session Start: Tue Mar 21 03:54:00 2000”. It is timestamped and in MIRC format. Thank you very much in advance!
As always with these things, I suggest you start small and build up your solution. First off, construct a regexp capable of matching the individual years – matching the string @2000@ or @2001@ would be enough. Next, write a regexp to match a month – so @mar@, @jun@ etc and run this on the individual year files. So at this stage you sould have files like mar-2000, apr-2000 etc. Finally, run a regexp to match the specific date to give you 21-mar-2000, 22-mar-2000 etc. Then you can start using some sed search/replace to extract the header lines.