AWK (Aho, Weinberger and Kernigan)
Using the AWK utility
The awk utility is similar to sed but more useful while dealing with tabular data. Files that contain data aligned in fixed-width columns are said to contain tabular data. awk commands may be either executed directly from the command line or awk may be used to execute scripts. awk is an excellent filter and may be used as a text report writing facility. Many Linux commands and utilities generate rows and columns of information. The awk utility may be used to process such data and format reports in a single command. The awk utility also allows the user to perform string and arithmetic manipulation. Associative arrays may be used in an awk program.
An awk command is pattern based and executed for each line of input. The following is its general structure:
<pattern> { <action> }
The pattern is a condition based on which the action is performed. The input stream is searched for the pattern line by line. A block of code in curly braces after the word BEGIN, if present, is executed before the first line is processed and a block of code within curly braces after the word END is processed after the last line. This awk program uses the 'print' command to simply output each line of input as is; however, it prints 'Start of File' before the first line and 'End of File' after the last line:
BEGIN { print "Start of File" }
{ print }
END { print "End of File" }
Before we move to a practical exercise, let us examine basic awk syntax. There are two main differences between AWK programs and shell scripts. AWK processes backslash escape special characters within double quotes; the shell does not. However, the shell evaluates and substitutes variables within strings ("$MyVar" would be substituted in a shell script); awk does not perform variable substitution within strings.
The most important elements of awk commands and programs are of the form $<column_number>. These represent the column or field number in the input stream (of tabular format). Backslash escapes are quite similar to those used by the shell - \t stands for tab, \d for date and so on. The man pages have an extensive list of awk's backslash escape special characters.
The awk utility's power and ease of use is demonstrated best by previewing its practical use. We will write an awk script that creates a report of a user's current processes (by PID), individual CPU usage, and corresponding commands. We will also calculate the total CPU usage for the user.
First, let us save a snapshot of the output of a ps command that outputs the information we want to see into a file:
[ LinuxUser ] ~$ ps ux > processInfo
The following are the contents of the processInfo file:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND LinuxUser 409 0.5 4.8 31096 18820 ? S 23:35 0:03 /usr/lib/mozilla/ LinuxUser 412 0.0 4.8 31096 18820 ? S 23:36 0:00 /usr/lib/mozilla/ LinuxUser 413 0.0 4.8 31096 18820 ? S 23:36 0:00 /usr/lib/mozilla/ LinuxUser 415 0.0 4.8 31096 18820 ? S 23:52 0:00 /usr/lib/mozilla/ LinuxUser 425 0.2 0.4 31096 18820 pts/0 S 23:52 0:00 bash LinuxUser 426 0.0 2.0 19408 9404 pts/0 S 23:53 0:00 more LinuxUser 432 0.0 4.8 31096 18820 ? S 23:56 0:00 /usr/lib/mozilla/ LinuxUser 436 0.2 0.3 2108 1160 pts/1 S 23:57 0:00 bash LinuxUser 437 0.4 1.4 9520 6484 pts/1 S 23:58 0:00 emacs LinuxUser 455 0.2 0.3 2112 1164 pts/2 S 00:05 0:00 bash LinuxUser 459 0.0 0.4 3560 1546 pts/2 R 00:06 0:00 ps ux
The columns of interest to us are columns 2 (PID), 3 (%CPU), and 11(command). These would be referenced in an awk program using $2, $3, and $11. Additionally, we may extract $1 to print a heading with the user's name. Following is a script that performs the actions we need. Copy and paste it onto a new file in your system and save it under the name "processReport.awk".
#!/bin/awk -f
BEGIN {
totalCPU = 0;
}
{
if (NR == 1) {
header=$2,"\t",$3,"\t",$11;
}
else
{
# Code executes for all lines but header:
if (NR == 2) {
print "Processes for User: ", $1;
print header;
}
CPU = $3;
print $2,"\t",CPU,"\t",$11;
totalCPU += CPU;
}
}
END {
print "Total CPU:", totalCPU,"%";
}
To run the file, first add execute permissions by issuing 'ps ugo+x processReport.awk'
The following command should be run to process the file. The output is also shown:
[ LinuxUser ] ~$ processReport.awk < processInfo
Processes for User: LinuxUser
PID %CPU Command
409 0.5 /usr/lib/mozilla/
412 0.0 /usr/lib/mozilla/
413 0.0 /usr/lib/mozilla/
415 0.0 /usr/lib/mozilla/
425 0.2 bash
426 0.0 more
432 0.0 /usr/lib/mozilla/
436 0.2 bash
437 0.4 emacs
455 0.2 bash
459 0.0 ps ux
Total CPU: 1.5 %
The initial line is a directive that the script should be executed through the awk program. Note that the -f option is specified. #!/bin/awk -f
The BEGIN section of the above code simply initializes the totalCPU variable to zero. This is used to hold the calculate-as-we-go total CPU.
BEGIN {
totalCPU = 0;
}
The main section contains two if loops. The initial if loop simply uses the first line of the input file that contains the ps command header to create a header with the labels corresponding to just the three columns of interest to us. This section is executed only if the condition 'NR == 1' is true (just for the first line processed - NR refers to input file line number).
{
if (NR == 1) {
header=$2,"\t",$3,"\t",$11;
}
The else loop is executed for all values of the line number (NR) other than one. The inner if loop is executed for line number two. It prints the username of the user and the header from step 5. We have to do this here because the name of the user only becomes available to awk at the second line of the file. Then, we print out the PID, %CPU and command using $2, $3 (set to a variable named CPU), and $11. Finally, the totalCPU variable is incremented with the value of CPU for the current line.
else
{
# Code executes for all lines but header:
if (NR == 2) {
print "Processes for User: ", $1;
print header;
}
CPU = $3;
print $2,"\t",CPU,"\t",$11;
totalCPU += CPU;
}
}
The END section simply prints out the total CPU with a suitable label.
END {
print "Total CPU:", totalCPU,"%";
}
Although the above exercise used a tab-space delimited file, we can easily change the delimiter by using the -F option while running awk at the command line (-F; for semicolon delimited files, -F: for colon delimited files like etc/passwd etc), or by using the internal input field separator variable 'FS' as follows. This code tells the script to use comma (,) as the field separator:
BEGIN {
FS=",";
}
The sed and awk tools are very powerful but also useless if they perform the wrong modifications. They should be used with care. It is best to test the regular expressions extensively on a small representative sample before using it on a large file.