Introduction
The awk
, nawk
and gawk
(GNU Awk) utilities are very easy to use text file parsers.
They handle efficiently text files of data delimited by a character.
With an easy syntax to understand, operations for filtering rows, filtering columns, enriching content, converting formats, computing aggregates (averages, sums for example), etc. become child’s play with these utilities.
awk
, gawk
and nawk
differ only in a few very advanced features not discussed in this paper.
awk
is the utility to use without hesitation to parse very efficiently in seconds complex log files for example.
This paper is an article to get started with the nawk
utility through examples, and contrary to common beliefs it is also
available on Windows platforms :
- UnxUtils pour Windows (gawk) : http://sourceforge.net/projects/unxutils/
- Cygwin (gawk) : http://www.cygwin.com/
- MingW - Minimalist Gui for Windows (awk) : http://www.mingw.org/
A little bit of history, awk
was created in the 1970s for Unix operating systems. awk
is an acronym with the names of the authors : Aho, Weinberger et Kernighan.
Getting started with awk
The tutorial csv file (Hello World)
This tutorial uses the file file.txt
below. Columns are separated by tabs :
file.txt
Name Gender Age
---------------------------------------
CAMILLE M 7
CHLOE F 12
CLARA F 11
CLEMENT M 7
EMMA F 6
THEO M 8
Retrieving columns with awk
To extract data from the file, for example the first 2 columns :
%> nawk '{ print $1, $2 }' file.txt
Name Gender --------------------------------------- CAMILLE M CHLOE F CLARA F CLEMENT M EMMA F THEO M
Notice the structure of an awk
program between quotes and braces.
$1
matches the first column,$2
the second,$3
the third…$0
matches the entire row.
In the output format, tabs are replaced with a space which is the output separator by default.
An important point to consider with awk
: its particular behavior with spaces and tabs. By default, contiguous spaces and tabs are treated as a single separator. This is the only exception.
Applying filters with regular expressions with awk
Previously, columns were filtered, but awk
is also mainly used to filter rows using regular expression syntaxes.
To find the lines containing CAMILLE :
%> nawk '/CAMILLE/ { print $1, $3, $2 }' file.txt
CAMILLE 7 M
The order of the columns has been changed for the example.
Another more complex filter, to find the rows starting with C and containing the letter A or the letter O :
%> nawk '/^C.*[AO]/ { print $1, $3, $2 }' file.txt
CAMILLE 7 M CHLOE 12 F CLARA 11 F
For more information about regular expressions : Regular-Expressions.info
awk
is also very useful and powerful to apply filters on paragraphs. To retrieve the rows from CL to E :
%> nawk '/^CL/,/^E/ { print $0 }' file.txt
CLARA F 11 CLEMENT M 7 EMMA F 6
Internal variables with awk
awk
provides useful variables that can be used, displayed, computed or assigned.
NR
: number of records read (row number).NF
: number of fields (number of columns).
%> nawk '{ print NR, NF, $0 }' file.txt
1 3 Nom Gender Age 2 1 --------------------------------------- 3 3 CAMILLE M 7 4 3 CHLOE F 12 5 3 CLARA F 11 6 3 CLEMENT M 7 7 3 EMMA F 6 8 3 THEO M 8
FS
: field separator (by default : space/tab).OFS
: output field separator (by default : space).
%> nawk '/CAMILLE/ { OFS="," ; print $2,$1 }' file.txt
M,CAMILLE
Notice the character ";" to separate the instructions in the same line and how a value is assigned to a variable (OFS=","
).
The variable ENVIRON
is an array storing the user’s environment variables. Below, the user’s EDITOR
variable is displayed with awk
:
%> nawk '/EMMA/ { OFS="," ; print $2,$1, ENVIRON["EDITOR"]}' file.txt
F,EMMA,vi
Notice the way to address the content of an array : array["tag"]
Scripts
awk
has been used previously in command line mode. When the awk
program becomes complex, it can be stored in a file prog.awk
prog.awk
/^CL/,/^E/ {
print NR, $0
}
then interpreted using the option -f
%> nawk -f prog.awk file.txt
5 CLARA F 11 6 CLEMENT M 7 7 EMMA F 6
Pre and Post operations
awk
allows pre-processing (BEGIN
) and post-processing (END
) sections when parsing a file.
The structure of an awk
script is :
BEGIN {
action
}
/filter/,/filter/ { action }
{ action }
END {
action
}
BEGIN
and END
blocks are not mandatory. There can be a BEGIN
block without an END
block, an END
block without a BEGIN
block, or neither of these 2 blocks.
Much more complex scripts can then be written.
For example, to extract 2 columns by replacing tabs with ";
" and then display the number of rows :
prog.awk
BEGIN {
FS=" "
OFS=";"
}
{
print $1, $3
}
END {
printf "\nThe file has %d lines\n", NR
}
%> nawk -f prog.awk file.txt
Nom;Age ---------------------------------------; CAMILLE;7 CHLOE;12 CLARA;11 CLEMENT;7 EMMA;6 THEO;8 The file has 8 lines
Functions
Built-in functions
The awk
parser offers many built-in functions for processing data.
See the awk
utility manuals for the complete list of internal functions, here is a very partial list :
toupper, tolower
To convert text to upper or lower case using the functions toupper
and tolower
:
%> nawk '/THEO/ { print $1, tolower($1) }' file.txt
THEO theo
int
To convert a value to an integer using the function int
:
%> nawk '/CHLOE/ { print $3, int($3/5}' file.txt
12 2
printf
The function printf
with awk
works like the function printf
in the C language to format the output :
%> nawk 'NR > 2 { printf "%10s %02d %-10s\n", $1,$3, $1}' file.txt
CAMILLE 07 CAMILLE CHLOE 12 CHLOE CLARA 11 CLARA CLEMENT 07 CLEMENT EMMA 06 EMMA THEO 08 THEO
length
To display the size of a character string using the function length
:
%> nawk '/CLEM/ { print $1, length($1) }' file.txt
CLEMENT 7
match
The function match
returns the position of a character string following the criteria of a regular expression :
%> nawk 'NR >2 { print $1, match($1,"A")}' file.txt
CAMILLE 2 CHLOE 0 CLARA 3 CLEMENT 0 EMMA 4 THEO 0
gsub
To replace strings using the function gsub
:
%> nawk 'NR >2 { gsub("A","_",$1) ; print $1 }' file.txt
C_MILLE CHLOE CL_R_ CLEMENT EMM_ THEO
substr
To extract characters from a string using the function substr
:
%> nawk '{ print $1, substr($1,2,3) }' file.txt
Nom om --------------------------------------- --- CAMILLE AMI CHLOE HLO CLARA LAR CLEMENT LEM EMMA MMA THEO HEO
User defined functions
The ability to create user functions is one of the most important features of the awk
utility.
Functions are defined with the keyword function
.
prog.awk
function gentag(nom,age) {
tmp=tolower(substr(nom,1,3))
return tmp "_" age
}
BEGIN {
FS=" "
OFS=";"
}
{
print $1, $3, gentag($1,$3)
}
END {
print NR , "lines"
}
%> nawk -f prog.awk file.txt
Nom;Age;nom_Age ---------------------------------------;;---_ CAMILLE;7;cam_7 CHLOE;12;chl_12 CLARA;11;cla_11 CLEMENT;7;cle_7 EMMA;6;emm_6 THEO;8;the_8 8;lines
Programming
The awk
parser provides all programming structures : conditions, loops, iterations.
Conditions
Are the children in primary or middle school with if() {} else {}
?
prog.awk
BEGIN {
OFS=","
}
NR <=2 { next }
{
if ( $3 < 11 ) {
ecole="primary school"
} else {
ecole="middle school"
}
print $1, ecole
}
%> nawk -f prog.awk file.txt
CAMILLE,primary school CHLOE,middle school CLARA,middle school CLEMENT,primary school EMMA,primary school THEO,primary school
Notice the way the header is discarded : NR <=2 { next }
Loops
Replace the child's age by a number of dots with while() {}
.
prog.awk
NR <=2 { next }
{
min=1
printf "%-10s", $1
while ( min <= $3 ) {
printf "."
min++
}
printf "\n"
}
%> nawk -f prog.awk file.txt
CAMILLE ....... CHLOE ............ CLARA ........... CLEMENT ....... EMMA ...... THEO ........
Iterations
Replace the child's age by a number of dots with for (i= ; i< ; i++ ) { }
.
prog.awk
NR <=2 { next }
{
printf "%-10s", $1
for ( min=1 ; min <= $3; min++ ) {
printf "."
}
printf "\n"
}
%> nawk -f prog.awk file.txt
CAMILLE ....... CHLOE ............ CLARA ........... CLEMENT ....... EMMA ...... THEO ........
Arrays
To end this brief presentation : arrays with awk
, useful when computing aggregates.
The structure of an array with awk
is very basic :
tab[indice] = value
Compute the average age of children by gender :
prog.awk
{
if ( NR <= 2 ) { next } # skip first 2 lines
tab_age[$2]+=$3
tab_cpt[$2]++
}
END {
for ( gender in tab_age ) {
print gender, " : ", "Average :", int(tab_age[gender]/tab_cpt[gender]), "years", "nb :", tab_cpt[gender]
}
}
%> nawk -f prog.awk file.txt
F : Average : 9 years nb : 3 M : Average : 7 years nb : 3
Notice how the 2 tables are filled and processed at the end of the program.