Home » Linux Basics » 10 - Regular Expressions, sed, and awk
10

AWK (Aho, Weinberger and Kernigan)

Using the AWK utility

The awk utility is similar to sed but more useful while dealing with tabular data. Files that contain data aligned in fixed-width columns are said to contain tabular data. awk commands may be either executed directly from the command line or awk may be used to execute scripts. awk is an excellent filter and may be used as a text report writing facility. Many Linux commands and utilities generate rows and columns of information. The awk utility may be used to process such data and format reports in a single command. The awk utility also allows the user to perform string and arithmetic manipulation. Associative arrays may be used in an awk program.

An awk command is pattern based and executed for each line of input. The following is its general structure:

    <pattern> { <action> }

The pattern is a condition based on which the action is performed. The input stream is searched for the pattern line by line. A block of code in curly braces after the word BEGIN, if present, is executed before the first line is processed and a block of code within curly braces after the word END is processed after the last line. This awk program uses the 'print' command to simply output each line of input as is; however, it prints 'Start of File' before the first line and 'End of File' after the last line:

    BEGIN   { print "Start of File" }
               { print }
END     { print "End of File" }

Before we move to a practical exercise, let us examine basic awk syntax. There are two main differences between AWK programs and shell scripts. AWK processes backslash escape special characters within double quotes; the shell does not. However, the shell evaluates and substitutes variables within strings ("$MyVar" would be substituted in a shell script); awk does not perform variable substitution within strings.

The most important elements of awk commands and programs are of the form $<column_number>. These represent the column or field number in the input stream (of tabular format). Backslash escapes are quite similar to those used by the shell - \t stands for tab, \d for date and so on. The man pages have an extensive list of awk's backslash escape special characters.

The awk utility's power and ease of use is demonstrated best by previewing its practical use. We will write an awk script that creates a report of a user's current processes (by PID), individual CPU usage, and corresponding commands. We will also calculate the total CPU usage for the user.

First, let us save a snapshot of the output of a ps command that outputs the information we want to see into a file:

    [ LinuxUser ] ~$ ps ux > processInfo

The following are the contents of the processInfo file:

    USER       PID %CPU %MEM   VSZ  RSS   TTY   STAT  START   TIME COMMAND
LinuxUser  409  0.5  4.8 31096 18820   ?     S    23:35   0:03 /usr/lib/mozilla/
LinuxUser  412  0.0  4.8 31096 18820   ?     S    23:36   0:00 /usr/lib/mozilla/
LinuxUser  413  0.0  4.8 31096 18820   ?     S    23:36   0:00 /usr/lib/mozilla/
LinuxUser  415  0.0  4.8 31096 18820   ?     S    23:52   0:00 /usr/lib/mozilla/
LinuxUser  425  0.2  0.4 31096 18820  pts/0  S    23:52   0:00 bash
LinuxUser  426  0.0  2.0 19408 9404   pts/0  S    23:53   0:00 more
LinuxUser  432  0.0  4.8 31096 18820   ?     S    23:56   0:00 /usr/lib/mozilla/
LinuxUser  436  0.2  0.3  2108 1160   pts/1  S    23:57   0:00 bash
LinuxUser  437  0.4  1.4  9520 6484   pts/1  S    23:58   0:00 emacs
LinuxUser  455  0.2  0.3  2112 1164   pts/2  S    00:05   0:00 bash
LinuxUser  459  0.0  0.4  3560 1546   pts/2  R    00:06   0:00 ps ux 

The columns of interest to us are columns 2 (PID), 3 (%CPU), and 11(command). These would be referenced in an awk program using $2, $3, and $11. Additionally, we may extract $1 to print a heading with the user's name. Following is a script that performs the actions we need. Copy and paste it onto a new file in your system and save it under the name "processReport.awk".

    #!/bin/awk -f
BEGIN {
        totalCPU = 0;
}
 
{
if (NR == 1) {
       
        header=$2,"\t",$3,"\t",$11;
}
else
{
        # Code executes for all lines but header:
        if (NR == 2) {
               print "Processes for User: ", $1;
               print header;
        }
        CPU = $3;
        print  $2,"\t",CPU,"\t",$11;
        totalCPU += CPU;
}
}
END {
        print "Total CPU:", totalCPU,"%";
        }

To run the file, first add execute permissions by issuing 'ps ugo+x processReport.awk'

The following command should be run to process the file. The output is also shown:

    [ LinuxUser ] ~$ processReport.awk < processInfo
Processes for User: LinuxUser
PID %CPU  Command
409  0.5  /usr/lib/mozilla/
412  0.0  /usr/lib/mozilla/
413  0.0  /usr/lib/mozilla/
415  0.0  /usr/lib/mozilla/
425  0.2  bash
426  0.0  more
432  0.0  /usr/lib/mozilla/
436  0.2  bash
437  0.4  emacs
455  0.2  bash
459  0.0  ps ux 
Total CPU: 1.5 %

The initial line is a directive that the script should be executed through the awk program. Note that the -f option is specified. #!/bin/awk -f

The BEGIN section of the above code simply initializes the totalCPU variable to zero. This is used to hold the calculate-as-we-go total CPU.

    BEGIN {
     totalCPU = 0;
}

The main section contains two if loops. The initial if loop simply uses the first line of the input file that contains the ps command header to create a header with the labels corresponding to just the three columns of interest to us. This section is executed only if the condition 'NR == 1' is true (just for the first line processed - NR refers to input file line number).

    {
if (NR == 1) {
    
    header=$2,"\t",$3,"\t",$11;
}

The else loop is executed for all values of the line number (NR) other than one. The inner if loop is executed for line number two. It prints the username of the user and the header from step 5. We have to do this here because the name of the user only becomes available to awk at the second line of the file. Then, we print out the PID, %CPU and command using $2, $3 (set to a variable named CPU), and $11. Finally, the totalCPU variable is incremented with the value of CPU for the current line.

    else
{
     # Code executes for all lines but header:
     if (NR == 2) {
            print "Processes for User: ", $1;
            print header;
     }
     CPU = $3;
     print  $2,"\t",CPU,"\t",$11;
     totalCPU += CPU;
}
}

The END section simply prints out the total CPU with a suitable label.

    END {
     print "Total CPU:", totalCPU,"%";
     }

Although the above exercise used a tab-space delimited file, we can easily change the delimiter by using the -F option while running awk at the command line (-F; for semicolon delimited files, -F: for colon delimited files like etc/passwd etc), or by using the internal input field separator variable 'FS' as follows. This code tells the script to use comma (,) as the field separator:

    BEGIN {
        FS=",";
}

The sed and awk tools are very powerful but also useless if they perform the wrong modifications. They should be used with care. It is best to test the regular expressions extensively on a small representative sample before using it on a large file.