Tuesday, February 24, 2009

Using sort and uniq to find duplicate lines in a file

I love my Mac, mostly because I can run all those great Unix utilities that make tasks easy. Today I want to point out a pair of tools that come in handy often. They are sort and uniq.

sort can be used to sort the lines in a given file. A simple use would be

echo -e "c\na\nb" | sort

which produces:

a
b
c

Sweet! Now lets look at the sort program:

echo -e "a\na\nb\nc\nb" | uniq

gives:

a
b
c
b

Removing the duplicate a, but not the b! The tool only removes any consecutive duplicate lines. So what if you want to remove all duplicates? Easy:

echo -e "a\na\nb\nc\nb" | sort | uniq

gives:

a
b
c

Taa Dah! Thats easy!

Now this duo can be used in very many useful ways. Just the other day I needed to find two XML elements that had the same value. I used 'sed' to pull out all the values in the given element, sort to put these values in lexographically sorted order, and uniq to tell me the duplicates found:

sed -n 's/.*<tag>\(.*\)<\/tag>/\1/p' | sort | uniq -d

For the 'sed' part you could put any tag name in there you need to find duplicate value for. should be the name of the file you are using. Notice that each XML element was on it's own line, so this wouldn't work for documents that are not formatted nicely (for that see xmllint :).

No comments: