Introduction
The awk, nawk and gawk (GNU Awk) utilities are very easy to use text file parsers.
They handle efficiently text files of data delimited by a character.
With an easy syntax to understand, operations for filtering rows, filtering columns, enriching content, converting formats, computing aggregates (averages, sums for example), etc. become child’s play with these utilities.
awk, gawk and nawk differ only in a few very advanced features not discussed in this paper.
awk is the utility to use without hesitation to parse very efficiently in seconds complex log files for example.
This paper is an article to get started with the nawk utility through examples, and contrary to common beliefs it is also
available on Windows platforms :
- UnxUtils pour Windows (gawk) : http://sourceforge.net/projects/unxutils/
- Cygwin (gawk) : http://www.cygwin.com/
- MingW - Minimalist Gui for Windows (awk) : http://www.mingw.org/
A little bit of history, awk was created in the 1970s for Unix operating systems. awk is an acronym with the names of the authors : Aho, Weinberger et Kernighan.
Getting started with awk
The tutorial csv file (Hello World)
This tutorial uses the file file.txt below. Columns are separated by tabs :
file.txt
Name Gender Age
---------------------------------------
CAMILLE M 7
CHLOE F 12
CLARA F 11
CLEMENT M 7
EMMA F 6
THEO M 8
Retrieving columns with awk
To extract data from the file, for example the first 2 columns :
%> nawk '{ print $1, $2 }' file.txtName Gender --------------------------------------- CAMILLE M CHLOE F CLARA F CLEMENT M EMMA F THEO M
Notice the structure of an awk program between quotes and braces.
$1matches the first column,$2the second,$3the third…$0matches the entire row.
In the output format, tabs are replaced with a space which is the output separator by default.
An important point to consider with awk : its particular behavior with spaces and tabs. By default, contiguous spaces and tabs are treated as a single separator. This is the only exception.
Applying filters with regular expressions with awk
Previously, columns were filtered, but awk is also mainly used to filter rows using regular expression syntaxes.
To find the lines containing CAMILLE :
%> nawk '/CAMILLE/ { print $1, $3, $2 }' file.txtCAMILLE 7 M
The order of the columns has been changed for the example.
Another more complex filter, to find the rows starting with C and containing the letter A or the letter O :
%> nawk '/^C.*[AO]/ { print $1, $3, $2 }' file.txtCAMILLE 7 M CHLOE 12 F CLARA 11 F
For more information about regular expressions : Regular-Expressions.info
awk is also very useful and powerful to apply filters on paragraphs. To retrieve the rows from CL to E :
%> nawk '/^CL/,/^E/ { print $0 }' file.txtCLARA F 11 CLEMENT M 7 EMMA F 6
Internal variables with awk
awk provides useful variables that can be used, displayed, computed or assigned.
NR: number of records read (row number).NF: number of fields (number of columns).
%> nawk '{ print NR, NF, $0 }' file.txt1 3 Nom Gender Age 2 1 --------------------------------------- 3 3 CAMILLE M 7 4 3 CHLOE F 12 5 3 CLARA F 11 6 3 CLEMENT M 7 7 3 EMMA F 6 8 3 THEO M 8
FS: field separator (by default : space/tab).OFS: output field separator (by default : space).
%> nawk '/CAMILLE/ { OFS="," ; print $2,$1 }' file.txtM,CAMILLE
Notice the character ";" to separate the instructions in the same line and how a value is assigned to a variable (OFS=",").
The variable ENVIRON is an array storing the user’s environment variables. Below, the user’s EDITOR variable is displayed with awk :
%> nawk '/EMMA/ { OFS="," ; print $2,$1, ENVIRON["EDITOR"]}' file.txtF,EMMA,vi
Notice the way to address the content of an array : array["tag"]
Scripts
awk has been used previously in command line mode. When the awk program becomes complex, it can be stored in a file prog.awk
prog.awk
/^CL/,/^E/ {
print NR, $0
}
then interpreted using the option -f
%> nawk -f prog.awk file.txt5 CLARA F 11 6 CLEMENT M 7 7 EMMA F 6
Pre and Post operations
awk allows pre-processing (BEGIN) and post-processing (END) sections when parsing a file.
The structure of an awk script is :
BEGIN {
action
}
/filter/,/filter/ { action }
{ action }
END {
action
}
BEGIN and END blocks are not mandatory. There can be a BEGIN block without an END block, an END block without a BEGIN block, or neither of these 2 blocks.
Much more complex scripts can then be written.
For example, to extract 2 columns by replacing tabs with ";" and then display the number of rows :
prog.awk
BEGIN {
FS=" "
OFS=";"
}
{
print $1, $3
}
END {
printf "\nThe file has %d lines\n", NR
}
%> nawk -f prog.awk file.txtNom;Age ---------------------------------------; CAMILLE;7 CHLOE;12 CLARA;11 CLEMENT;7 EMMA;6 THEO;8 The file has 8 lines
Functions
Built-in functions
The awk parser offers many built-in functions for processing data.
See the awk utility manuals for the complete list of internal functions, here is a very partial list :
toupper, tolower
To convert text to upper or lower case using the functions toupper and tolower :
%> nawk '/THEO/ { print $1, tolower($1) }' file.txtTHEO theo
int
To convert a value to an integer using the function int :
%> nawk '/CHLOE/ { print $3, int($3/5}' file.txt12 2
printf
The function printf with awk works like the function printf in the C language to format the output :
%> nawk 'NR > 2 { printf "%10s %02d %-10s\n", $1,$3, $1}' file.txtCAMILLE 07 CAMILLE CHLOE 12 CHLOE CLARA 11 CLARA CLEMENT 07 CLEMENT EMMA 06 EMMA THEO 08 THEO
length
To display the size of a character string using the function length :
%> nawk '/CLEM/ { print $1, length($1) }' file.txtCLEMENT 7
match
The function match returns the position of a character string following the criteria of a regular expression :
%> nawk 'NR >2 { print $1, match($1,"A")}' file.txtCAMILLE 2 CHLOE 0 CLARA 3 CLEMENT 0 EMMA 4 THEO 0
gsub
To replace strings using the function gsub :
%> nawk 'NR >2 { gsub("A","_",$1) ; print $1 }' file.txtC_MILLE CHLOE CL_R_ CLEMENT EMM_ THEO
substr
To extract characters from a string using the function substr :
%> nawk '{ print $1, substr($1,2,3) }' file.txtNom om --------------------------------------- --- CAMILLE AMI CHLOE HLO CLARA LAR CLEMENT LEM EMMA MMA THEO HEO
User defined functions
The ability to create user functions is one of the most important features of the awk utility.
Functions are defined with the keyword function.
prog.awk
function gentag(nom,age) {
tmp=tolower(substr(nom,1,3))
return tmp "_" age
}
BEGIN {
FS=" "
OFS=";"
}
{
print $1, $3, gentag($1,$3)
}
END {
print NR , "lines"
}
%> nawk -f prog.awk file.txtNom;Age;nom_Age ---------------------------------------;;---_ CAMILLE;7;cam_7 CHLOE;12;chl_12 CLARA;11;cla_11 CLEMENT;7;cle_7 EMMA;6;emm_6 THEO;8;the_8 8;lines
Programming
The awk parser provides all programming structures : conditions, loops, iterations.
Conditions
Are the children in primary or middle school with if() {} else {} ?
prog.awk
BEGIN {
OFS=","
}
NR <=2 { next }
{
if ( $3 < 11 ) {
ecole="primary school"
} else {
ecole="middle school"
}
print $1, ecole
}
%> nawk -f prog.awk file.txtCAMILLE,primary school CHLOE,middle school CLARA,middle school CLEMENT,primary school EMMA,primary school THEO,primary school
Notice the way the header is discarded : NR <=2 { next }
Loops
Replace the child's age by a number of dots with while() {}.
prog.awk
NR <=2 { next }
{
min=1
printf "%-10s", $1
while ( min <= $3 ) {
printf "."
min++
}
printf "\n"
}
%> nawk -f prog.awk file.txtCAMILLE ....... CHLOE ............ CLARA ........... CLEMENT ....... EMMA ...... THEO ........
Iterations
Replace the child's age by a number of dots with for (i= ; i< ; i++ ) { }.
prog.awk
NR <=2 { next }
{
printf "%-10s", $1
for ( min=1 ; min <= $3; min++ ) {
printf "."
}
printf "\n"
}
%> nawk -f prog.awk file.txtCAMILLE ....... CHLOE ............ CLARA ........... CLEMENT ....... EMMA ...... THEO ........
Arrays
To end this brief presentation : arrays with awk, useful when computing aggregates.
The structure of an array with awk is very basic :
tab[indice] = value
Compute the average age of children by gender :
prog.awk
{
if ( NR <= 2 ) { next } # skip first 2 lines
tab_age[$2]+=$3
tab_cpt[$2]++
}
END {
for ( gender in tab_age ) {
print gender, " : ", "Average :", int(tab_age[gender]/tab_cpt[gender]), "years", "nb :", tab_cpt[gender]
}
}
%> nawk -f prog.awk file.txtF : Average : 9 years nb : 3 M : Average : 7 years nb : 3
Notice how the 2 tables are filled and processed at the end of the program.