Tuesday, September 17, 2013

Raspberry Pi Journal #23


Sorting Log  Frequency List


A lot of times, in analyzing log files, you want to see how often something happens. Maybe you want to know how often a particular user logon in a day, or how many times a web page is accessed? If you want to do a daily report, you want to automate the process. Well, there is an easy way to do it using Perl. But since I'm still learning the shell, I want to know if there's a way to do it using shell, not necessarily the most efficient, but as long as it's relatively quick, then it should be alright.

The first thing you want is a way to get only the relevant piece of data. I want to do this in two steps:
1. Identify the wanted line
2. put the identifying unique word in a line in a file.

Let's say I want to parse /var/log/messages.1 for various processes. I would do this:

egrep -o 'raspberrypi [[:alnum:]]+' messages.1 | egrep -o [[:alnum:]]+$ | sort > messlist.txt

replacing "raspberrypi" with your unique SSID_hostname. The "-o" option says that only the matching text, instead of the whole line will be piped out. I do it twice, first with the tell-tale SSID token, then only the process name. I also sort the file and write it out to messlist.txt

We now have the frequency list. How will we get unique names of the list? Simplicity itself. I simply sort it out again, with the "-u" option.

sort -u messlist.txt > messuniq.txt

That's all there is to it!

Now that we have two files, one containing the unique id, and the other contains all the instances of the ids, we can now count them no problem.

for NAM in `cat messuniq.txt`;
do
  echo $NAM `egrep $NAM messlist.txt | wc -l` ;
done

If you are familiar with computer programming at all, you will probably recognize that this algorithm is not at all efficient. What it basically does is for each entry in messuniq.txt, scan all entries in messlist.txt, outputting only those that matches, and count each occurrence. It's pretty easy to do. It's easy-as-a-bubble-sort, and just as inefficient.

But it works, and works well. So, there you go. If you want to do frequency counting, no need to bring out some large, bloated scripting engine. Just do it via shell!

By the way, I was confused by the instruction in using "for" loops. I keep using the square brackets, and keep getting invalid token error. It was only after I step back, and consider all possibilities that I guessed that the square brackets are there only to indicate optional entries. As such, I eliminate all the square brackets, and it works!

Here is the script in its entirety:

#!/bin/bash

egrep -o 'raspberrypi [[:alnum:]]+' messages.1 | egrep -o [[:alnum:]]+$ | sort > messlist.txt
sort -u messlist.txt > messuniq.txt

for NAM in `cat messuniq.txt`;
do
  echo $NAM `egrep $NAM messlist.txt | wc -l` ;
done



And here is the output:

kernel 6379
motion 1545
mtp 34
rsyslogd 19
shutdown 14
wpa 366

No comments:

Post a Comment