Suppose you have a text file and you need to remove all of its duplicate lines.
TL;DR
To remove the duplicate lines preserving their order in the file use:
awk '!visited[$0]++' your_file > deduplicated_file
How it works
The script keeps an associative array with indices equal to the unique lines of the file and values equal to their occurrences. For each line of the file, if the line occurrences are zero then it increases them by one and prints the line, otherwise it just increases the occurrences without printing the line.
I was not familiar with awk
and I wanted to understand how is this accomplished with such a short script (awk
ward). I did my research and here is what is going on:
- the awk “script”
!visited[$0]++
is executed for each line of the input file visited[]
is a variable of type associative array (a.k.a. Map). We don’t have to initialize it,awk
will do this for us the first time we access it.- the
$0
variable holds the contents of the line currently being processed visited[$0]
accesses the value stored in the map with key equal to$0
(the line being processed), a.k.a. the occurrences (which we set below)- the
!
negates the occurrences value:- In awk, any nonzero numeric value or any nonempty string value is true
- By default, variables are initialized to the empty string, which is zero if converted to a number
- That being said:
- if
visited[$0]
returns a number greater than zero, this negation is resolved tofalse
. - if
visited[$0]
returns a number equal to zero or an empty string, this negation is resolved totrue
.
- if
- the
++
operation increases the variable’s value (visited[$0]
) by one.- If the value is empty,
awk
converts it to0
(number) automatically and then it gets increased. - Note: the operation is executed after we access the variable’s value.
- If the value is empty,
Summing up, the whole expression evaluates to:
- true if the occurrences are zero/empty string
- false if the occurrences are greater than zero
awk
statements consist of a pattern-expression and an associated action.
<pattern/expression> { <action> }
If the pattern succeeds then the associated action is being executed. If we don’t provide an action, awk
by default print
s the input.
An omitted action is equivalent to
{ print $0 }
Our script consists of one awk
statement with an expression, omitting the action.
So this:
awk '!visited[$0]++' your_file > deduplicated_file
is equivalent to this:
awk '!visited[$0]++ { print $0 }' your_file > deduplicated_file
For every line of the file, if the expression succeeds the line is printed to the output. Otherwise, the action is not executed, nothing is printed.
Why not use the uniq
command?
The uniq
commands removes only the adjacent duplicate lines. Demonstration:
$ cat test.txt
A
A
A
B
B
B
A
A
C
C
C
B
B
A
$ uniq < test.txt
A
B
A
C
B
A
Other approaches
Using the sort
command
We can also use the following sort
command to remove the duplicate lines but the line order is not preserved.
sort -u your_file > sorted_deduplicated_file
Using cat
, sort
and cut
The previous approach would produce a de-duplicated file whose lines would be sorted based on the contents. Piping a bunch of commands we can overcome this issue:
cat -n your_file | sort -uk2 | sort -nk1 | cut -f2-
How it works
Suppose we have the following file:
abc
ghi
abc
def
xyz
def
ghi
klm
cat -n test.txt
prepends the order number in each line.
1 abc
2 ghi
3 abc
4 def
5 xyz
6 def
7 ghi
8 klm
sort -uk2
sorts the lines based on the second column (k2
option) and keeps only the first occurrence of the lines with the same second column value (u
option)
1 abc
4 def
2 ghi
8 klm
5 xyz
sort -nk1
sorts the lines based on their first column (k1
option) treating the column as a number (-n
option)
1 abc
2 ghi
4 def
5 xyz
8 klm
Finally, cut -f2-
prints each line starting from the second column until its end (-f2-
option: note the -
suffix which instructs to include the rest of the line)
abc
ghi
def
xyz
klm
References
- The GNU Awk User’s Guide
- Arrays in Awk
- Awk - Truth values
- Awk expressions
- How can I delete duplicate lines in a file in Unix?
- Remove duplicate lines without sorting [duplicate]
- How does awk ‘!a[$0]++’ work?
That’s all. Cat photo.
Thanks for visiting!