awk, nawk and gawk utilities

Introduction

The awk, nawk and gawk (GNU Awk) utilities are very easy to use text file parsers. They handle efficiently text files of data delimited by a character. With an easy syntax to understand, operations for filtering rows, filtering columns, enriching content, converting formats, computing aggregates (averages, sums for example), etc. become child’s play with these utilities. awk, gawk and nawk differ only in a few very advanced features not discussed in this paper.

awk is the utility to use without hesitation to parse very efficiently in seconds complex log files for example.

This paper is an article to get started with the nawk utility through examples, and contrary to common beliefs it is also available on Windows platforms :

UnxUtils pour Windows (gawk) : http://sourceforge.net/projects/unxutils/
Cygwin (gawk) : http://www.cygwin.com/
MingW - Minimalist Gui for Windows (awk) : http://www.mingw.org/

A little bit of history, awk was created in the 1970s for Unix operating systems. awk is an acronym with the names of the authors : Aho, Weinberger et Kernighan.

Getting started with awk

The tutorial csv file (Hello World)

This tutorial uses the file file.txt below. Columns are separated by tabs :

file.txt

Name            Gender           Age
---------------------------------------
CAMILLE         M               7
CHLOE           F               12
CLARA           F               11
CLEMENT         M               7
EMMA            F               6
THEO            M               8

Retrieving columns with awk

To extract data from the file, for example the first 2 columns :

%> nawk '{ print $1, $2 }' file.txt
          
Name Gender
--------------------------------------- 
CAMILLE M
CHLOE F
CLARA F
CLEMENT M
EMMA F
THEO M

Notice the structure of an awk program between quotes and braces.

$1 matches the first column, $2 the second, $3 the third…
$0 matches the entire row.

In the output format, tabs are replaced with a space which is the output separator by default.

An important point to consider with awk : its particular behavior with spaces and tabs. By default, contiguous spaces and tabs are treated as a single separator. This is the only exception.

Applying filters with regular expressions with awk

Previously, columns were filtered, but awk is also mainly used to filter rows using regular expression syntaxes.

To find the lines containing CAMILLE :

%> nawk '/CAMILLE/ { print $1, $3, $2 }' file.txt

CAMILLE 7 M

The order of the columns has been changed for the example.

Another more complex filter, to find the rows starting with C and containing the letter A or the letter O :

%> nawk '/^C.*[AO]/ { print $1, $3, $2 }' file.txt
CAMILLE 7 M
CHLOE 12 F
CLARA 11 F

For more information about regular expressions : Regular-Expressions.info

awk is also very useful and powerful to apply filters on paragraphs. To retrieve the rows from CL to E :

%> nawk '/^CL/,/^E/ { print $0 }' file.txt
          
CLARA F 11
CLEMENT M 7
EMMA F 6

Internal variables with awk

awk provides useful variables that can be used, displayed, computed or assigned.

NR : number of records read (row number).
NF : number of fields (number of columns).

%> nawk '{ print NR, NF, $0 }' file.txt

1 3 Nom              Gender            Age
2 1 ---------------------------------------
3 3 CAMILLE             M               7
4 3 CHLOE               F               12
5 3 CLARA               F               11
6 3 CLEMENT             M               7
7 3 EMMA                F               6
8 3 THEO                M               8

FS : field separator (by default : space/tab).
OFS : output field separator (by default : space).

%> nawk '/CAMILLE/ { OFS="," ; print $2,$1 }' file.txt

M,CAMILLE

Notice the character ";" to separate the instructions in the same line and how a value is assigned to a variable (OFS=",").

The variable ENVIRON is an array storing the user’s environment variables. Below, the user’s EDITOR variable is displayed with awk :

%> nawk '/EMMA/ { OFS="," ; print $2,$1, ENVIRON["EDITOR"]}' file.txt

F,EMMA,vi

Notice the way to address the content of an array : array["tag"]

Scripts

awk has been used previously in command line mode. When the awk program becomes complex, it can be stored in a file prog.awk

prog.awk

/^CL/,/^E/ { 
     print NR, $0 
}

then interpreted using the option -f

%> nawk -f prog.awk file.txt
          
5 CLARA F 11
6 CLEMENT M 7
7 EMMA F 6

Pre and Post operations

awk allows pre-processing (BEGIN) and post-processing (END) sections when parsing a file. The structure of an awk script is :

BEGIN {
        action
}

/filter/,/filter/ { action }

{ action }

END {
        action
}

BEGIN and END blocks are not mandatory. There can be a BEGIN block without an END block, an END block without a BEGIN block, or neither of these 2 blocks.

Much more complex scripts can then be written. For example, to extract 2 columns by replacing tabs with ";" and then display the number of rows :

prog.awk

BEGIN { 
        FS=" "
        OFS=";"
}
{ 
        print $1, $3 
}
END { 
        printf "\nThe file has %d lines\n", NR 
}

%> nawk -f prog.awk file.txt
          
Nom;Age
---------------------------------------;
CAMILLE;7
CHLOE;12
CLARA;11
CLEMENT;7
EMMA;6
THEO;8

The file has 8 lines

Functions

Built-in functions

The awk parser offers many built-in functions for processing data. See the awk utility manuals for the complete list of internal functions, here is a very partial list :

toupper, tolower

To convert text to upper or lower case using the functions toupper and tolower :

%> nawk '/THEO/ { print $1, tolower($1) }' file.txt
          
THEO theo

int

To convert a value to an integer using the function int :

%> nawk '/CHLOE/ { print $3, int($3/5}' file.txt
          
12 2

printf

The function printf with awk works like the function printf in the C language to format the output :

%> nawk 'NR > 2 { printf "%10s %02d %-10s\n", $1,$3, $1}' file.txt

   CAMILLE 07 CAMILLE   
     CHLOE 12 CHLOE     
     CLARA 11 CLARA     
   CLEMENT 07 CLEMENT   
      EMMA 06 EMMA      
      THEO 08 THEO

length

To display the size of a character string using the function length :

%> nawk '/CLEM/ { print $1, length($1) }' file.txt

CLEMENT 7

match

The function match returns the position of a character string following the criteria of a regular expression :

%> nawk 'NR >2 { print $1, match($1,"A")}' file.txt
          
CAMILLE 2
CHLOE 0
CLARA 3
CLEMENT 0
EMMA 4
THEO 0

gsub

To replace strings using the function gsub :

%> nawk 'NR >2 { gsub("A","_",$1) ; print $1 }' file.txt
          
C_MILLE
CHLOE
CL_R_
CLEMENT
EMM_
THEO

substr

To extract characters from a string using the function substr :

%> nawk '{ print $1, substr($1,2,3) }' file.txtNom om
--------------------------------------- ---
CAMILLE AMI
CHLOE HLO
CLARA LAR
CLEMENT LEM
EMMA MMA
THEO HEO

User defined functions

The ability to create user functions is one of the most important features of the awk utility. Functions are defined with the keyword function.

prog.awk

function gentag(nom,age) {
        tmp=tolower(substr(nom,1,3))
        return tmp "_" age
}

BEGIN { 
        FS=" "
        OFS=";"
}

{ 
        print $1, $3, gentag($1,$3)
}

END { 
print NR , "lines"
}

%> nawk -f prog.awk file.txt
          
Nom;Age;nom_Age
---------------------------------------;;---_
CAMILLE;7;cam_7
CHLOE;12;chl_12
CLARA;11;cla_11
CLEMENT;7;cle_7
EMMA;6;emm_6
THEO;8;the_8
8;lines

Programming

The awk parser provides all programming structures : conditions, loops, iterations.

Conditions

Are the children in primary or middle school with if() {} else {} ?

prog.awk

BEGIN {
        OFS=","
}
NR <=2 { next }
{
        if ( $3 < 11 ) {
                ecole="primary school"
        } else {
                ecole="middle school"
        }

        print $1, ecole
}

%> nawk -f prog.awk file.txt
          
CAMILLE,primary school
CHLOE,middle school
CLARA,middle school
CLEMENT,primary school
EMMA,primary school
THEO,primary school

Notice the way the header is discarded : NR <=2 { next }

Loops

Replace the child's age by a number of dots with while() {}.

prog.awk

NR <=2 { next }
{
        min=1
        printf "%-10s", $1
        while  ( min <= $3 ) {
                printf "."
                min++
        }
        printf "\n"
}

 %> nawk -f prog.awk file.txt

CAMILLE   .......
CHLOE     ............
CLARA     ...........
CLEMENT   .......
EMMA      ......
THEO      ........

Iterations

Replace the child's age by a number of dots with for (i= ; i< ; i++ ) { }.

prog.awk

NR <=2 { next }
{
        printf "%-10s", $1
        for ( min=1 ; min <= $3; min++ ) {
                printf "."
        }
        printf "\n"
}

%> nawk -f prog.awk file.txt
          
CAMILLE   .......
CHLOE     ............
CLARA     ...........
CLEMENT   .......
EMMA      ......
THEO      ........

Arrays

To end this brief presentation : arrays with awk, useful when computing aggregates. The structure of an array with awk is very basic :

tab[indice] = value

Compute the average age of children by gender :

prog.awk

{ 
        if ( NR <= 2 ) { next } # skip first 2 lines
        tab_age[$2]+=$3
        tab_cpt[$2]++
}
END { 
        for ( gender in tab_age ) { 
        print gender, " : ", "Average :", int(tab_age[gender]/tab_cpt[gender]), "years", "nb :", tab_cpt[gender] 
        }
}

%> nawk -f prog.awk file.txt
          
F : Average : 9 years nb : 3
M : Average : 7 years nb : 3

Notice how the 2 tables are filled and processed at the end of the program.

awk, nawk and gawk utilities - Tutorial