I don’t write large fancy programs in Awk (I’m not Brian Kernighan.) But I do use it a ton for simple things in shell scripts. It can do serious and fancy things and if your project calls for that, lucky you. I don’t have many notes for Awk because what I normally use it for is pretty simple and for everything else I have a copy of the ORA Sed & Awk book.

I did really like these notes, however.

I also find I understand the interesting capabilities of a thing like Awk by looking at small examples of interesting and useful tricks. Here’s another such resource.

Useful Built-In Variables

  • FS = Field Separator

  • OFS = Output Field Separator

  • RS = Record Separator

  • ORS = Output Record Separator

  • NF = Number of Fields

  • NR = Number of Record - This can effectively be used as a counter of what line number you’re on.

  • FNR = File’s Number of Records - Line number reset on each file.

  • FILENAME = Current filename being processed

  • FIELDWIDTHS = When set with a whitespace separated list of values, reads fields from those positions ignoring FS. Useful for fixed column inputs.

  • IGNORECASE = Non zero treats upper and lower alike.

  • OFMT = Output format for numbers - default "%.6g".

Note that these variables may not need $ to be resolved. In fact, something like $NF where there are 3 fields would be the same as $3. The following prints the line number with the number of fields (good for checking the integrity of a data file).

awk '{print NR,NF}'
awk '{if (NF != 25) print NR,NF}' # Check for exactly 25 fields.

Patterns

This demonstrates how to use "patterns. It will take some output that has the form of "ID_PARAM=value" and only for the parameters of interest, save the value. At the end it will compute what is needed.

mplayer -identify -vo null -ao null -frames 0 mysteryvid.mp4 \
| awk 'BEGIN{FS="="} \
       /ID_VIDEO_FPS/{rate=$2} /ID_LENGTH/{time=$2} \
       END{print rate*time}'

For complete information, see man awk and search for (/) "^ *Patterns".

Sum

Adding a list of numbers is quite handy. This format is good for other things like averages etc.

awk '{X+=$1}END{print X}'

Percentages

A slightly tricky problem involves sending a stream of numbers and returning what percent that number was of the entire sum of all numbers sent. This will necessarily take a full pass before answers can be computed.

Here L is an array of all lines which is built while adding the total sum, S, of each line. After all the input is in, the END clause runs with a for loop iterating over the array.

$ seq 5 10 50 | awk '{L[NR]=$1;S=S+$1}END{for(i in L)print L[i],(L[i]/S)}'
5 0.04
15 0.12
25 0.2
35 0.28
45 0.36

For billions of lines, this could be a problem. In that case you might want to run two complete passes.

This will make the percentages cumulative.

awk '{L[NR]=$1;S=S+$1}END{for(i in L){T+=L[i]/S;print L[i],T}}'

I used this for pie chart making.

Pi

Speaking of pi, here’s how you can get it in awk.

$ awk '{print atan2(0,-1)}' <(echo)
3.14159

Or just use this.

3.141592653589793238462643383279502884197169399375105820974

Or this many radians per circle.

6.283185307179586476925286766559005768394338798750211641949

Pie Charts

Mean and Standard Deviation

Mean.

awk '{X+=$1}END{print X/NR}'

This may be correct for standard deviation but should be checked.

awk '{X+=$1;Y+=$1^2}END{print sqrt(Y/NR-(X/NR)^2)}'

Removing Duplicate Words

If you have a line that contains a bunch of words and you want to remove any duplicate mentions of them, this does the trick.

$ cat duptest
nothing duplicated here
another another
stupidly duplicated ok unique stupidly duplicated fine

$ awk '{ while(++i<=NF) printf (!a[$i]++) ? $i FS : ""; i=split("",a); print "" }' ./duptest
nothing duplicated here
another
stupidly duplicated ok unique fine

Note that the split is really just a way to clear the array (see delete command which may also be a way to do this). It also resets i as a bonus. A bit of a dirty trick, but that’s how awk pros roll.

A Column Rearranging Factory

I had a situation where I had hundreds of files from a messy data source that needed to be homogenized. The files had many different categories and in each category, there was a correct form. That form might have fields "A B C D E F". Some of the other files in that category would have "A B D E F" or "A B E D F". A mapping just had to be made, but once it was made, the following awk snippet worked to rearrange everything automatically.

So for the first example, "A B D E F" to "A B C D E F", the fixing rule would be like this where X is the missing field to be inserted (blank of course).

AWKARGS='$1 $2 X $3 $4 $5'

This will make the old fifth column (i.e. F) the new sixth and inserting a new empty field between two and three (i.e. C). By defining a whole list of those strings that I could use, I could then send this string to the following code.

AWKFORM=$(echo ${AWKARGS} | sed -e 's/[^,][^,]*/%s/g' -e 's/,/$/g' -e 's/^/"/' -e 's/$/\\n"/')
awk 'BEGIN{FS="$"}{printf('${AWKFORM}','${AWKARGS}')}' the_file

Yes, these awful files were separated with dollar signs. If the format is already correct, you can just set AWKARGS to $0.

Counting Occurrence Of Character Within Line

If you have a bar separated value file and you need to know how many fields there are, this will count the bars. It basically substitutes out everything that’s not the character of interest (the bar) and prints the remaining length.

awk '{gsub("[^|]",""); print length();}' file.bsv

Splitting A Big File Into Many Small Ones

The example here is an SDF (structure definition format) molecule file. Each file can contain many molecules separated by the hideous separator string of four dollar signs on a line by itself. Here awk pulls apart such a file and writes it out to many sequentially named files.

:-> [crow][/tmp/sdftest]$ ls -l
total 44
-rw-rw-r-- 1 xed xed 41529 Dec 18 15:06 small.sdf
:-> [crow][/tmp/sdftest]$ wc -l small.sdf
948 small.sdf
:-> [crow][/tmp/sdftest]$ awk 'BEGIN{RS="[$]{4}\n"}{F++; print $0 "$$$$" > "partial-" sprintf("%04d",F) ".sdf" }' small.sdf
:-> [crow][/tmp/sdftest]$ ls -l
total 112
-rw-rw-r-- 1 xed xed  4895 Dec 18 15:49 partial-0001.sdf
-rw-rw-r-- 1 xed xed  4295 Dec 18 15:49 partial-0002.sdf
-rw-rw-r-- 1 xed xed  4847 Dec 18 15:49 partial-0003.sdf
-rw-rw-r-- 1 xed xed  4251 Dec 18 15:49 partial-0004.sdf
-rw-rw-r-- 1 xed xed  3971 Dec 18 15:49 partial-0005.sdf
-rw-rw-r-- 1 xed xed  4527 Dec 18 15:49 partial-0006.sdf
-rw-rw-r-- 1 xed xed  5211 Dec 18 15:49 partial-0007.sdf
-rw-rw-r-- 1 xed xed  4343 Dec 18 15:49 partial-0008.sdf
-rw-rw-r-- 1 xed xed  5189 Dec 18 15:49 partial-0009.sdf
-rw-rw-r-- 1 xed xed 41529 Dec 18 15:06 small.sdf
:-> [crow][/tmp/sdftest]$ cat partial-000* | wc -l
948

Note that ORS could be used.

awk 'BEGIN{RS="[$]{4}\n";ORS="$$$$\n"}{F++; print > "partial-" sprintf("%04d",F) ".sdf" }' small.sdf

But in this case the awkward regular expression element, $, makes it not worth it.

It’s A Real Programming Language

To make awk programs use something like this.

$ cat awktest
#!/bin/awk -f
BEGIN { print "Runs this block one time." }
{ print "Runs this once for every line." }

$ seq 3 | ./awktest
Runs this block one time.
Runs this once for every line.
Runs this once for every line.
Runs this once for every line.

I found that the -f was required. Without it, I got

awk: ^ syntax error

Also, only one type of block is needed but the braces are required at a minimum.

This is just an example I wrote as an exercise for performance testing.

#!/usr/bin/awk
# Chris X Edwards - 2015-05-04
# Merges files filled with sorted numeric entries, one number per
# line, into a sorted single stream. Files must each contain at least
# one number. Cf. `sort -n <fileone> <filetwo>`.
# Usage:
#     awk -f ./merge <fileone> <filetwo>
# fileone contains: 1, 3, 4, 80, 95
# filetwo contains: 2, 5, 5, 10
# output: 1,2,3,4,5,5,10,80,95

{
    getline vA <ARGV[1]
    getline vB <ARGV[2]
    while (1) {
        if (vA > vB){
            print vB
            if (! getline vB <ARGV[2]) {
                vB="x"
                break
            }
        }
        else {
            print vA
            if (! getline vA <ARGV[1]) {
                break
            }
        }
    }
    if (vB == "x"){
        print vA
        while (getline vA <ARGV[1]) {
            print vA
            }
        exit
    }
    else {
        print vB
        while (getline vB <ARGV[2]) {
            print vB
            }
        exit
        }
}