Monday, March 22, 2010

Text Processing and Data Extraction using sed and awk

Some weeks ago, my friend contacted me for a help. He needs to extract some data from a text file. I am familiar with this stuff, since in my old job I do this quite often. Like usual, the data was just a colon (:) separated file. He needs to extract all fields that starts or contains a particular text. It can appear in different column of the row.

So, I played around using awk, and got something like the following:
awk -F\: '/CSP/ { for(i=1;i<=NF;i++) { if($i ~ "CSP") print $i } }' raw_text_file.txt


Since not long before he asked, I had played around much with regex grouping in python. I just needed to translate the my python regex knowledge in sed. Not much work, of course. So, I came up with the following method, using sed:
sed -r 's/.*(CSP,[0-9]+[:;]).*/\1/g' raw_text_file.txt


Notes: "CSP" is the text that needed to be extracted from the text file, along with some other number(s) after that.

Lots of stuff can be done by regex.
Thanks to Dr.Bunker who taught me so much about regex.

1 comment:

  1. waduh...keren bener

    banyak bener istilah yg baru neh..cool....

    ReplyDelete