I don’t write large fancy programs in Awk (I’m not Brian Kernighan.) But I do use it a ton for simple things in shell scripts. It can do serious and fancy things and if your project calls for that, lucky you. I don’t have many notes for Awk because what I normally use it for is pretty simple and for everything else I have a copy of the ORA Sed & Awk book.

I did really like these notes, however.

I also find I understand the interesting capabilities of a thing like Awk by looking at small examples of interesting and useful tricks. Here’s another such resource.

Useful Built-In Variables

  • FS = Field Separator

  • OFS = Output Field Separator

  • RS = Record Separator

  • ORS = Output Record Separator

  • NF = Number of Fields

  • NR = Number of Record - This can effectively be used as a counter of what line number you’re on.

  • FNR = File’s Number of Records - Line number reset on each file.

  • FILENAME = Current filename being processed

  • FIELDWIDTHS = When set with a whitespace separated list of values, reads fields from those positions ignoring FS. Useful for fixed column inputs.

  • IGNORECASE = Non zero treats upper and lower alike.

  • OFMT = Output format for numbers - default "%.6g".

Note that these variables may not need $ to be resolved. In fact, something like $NF where there are 3 fields would be the same as $3. The following prints the line number with the number of fields (good for checking the integrity of a data file).

awk '{print NR,NF}'
awk '{if (NF != 25) print NR,NF}' # Check for exactly 25 fields.

Patterns

This demonstrates how to use "patterns. It will take some output that has the form of "ID_PARAM=value" and only for the parameters of interest, save the value. At the end it will compute what is needed.

mplayer -identify -vo null -ao null -frames 0 mysteryvid.mp4 \
| awk 'BEGIN{FS="="} \
       /ID_VIDEO_FPS/{rate=$2} /ID_LENGTH/{time=$2} \
       END{print rate*time}'

For complete information, see man awk and search for (/) "^ *Patterns".

Line Lengths And Character Counts

Need to know how long a line is? This is often useful for looking for missing records or absurdly huge records. Awk makes this pretty easy.

$ echo "abcdefghijklmnopqrstuvwxyz" | awk '{print length}'
26
$ echo "abcdefghijklmnopqrstuvwxyz" | awk '{print length,$0}'
26 abcdefghijklmnopqrstuvwxyz

What about a count of the occurrences of a specific character?

$ cal | awk '{print gsub("[01]","&"),$0}'
2       June 2017
0 Su Mo Tu We Th Fr Sa
1              1  2  3
2  4  5  6  7  8  9 10
8 11 12 13 14 15 16 17
4 18 19 20 21 22 23 24
1 25 26 27 28 29 30
0

It seems Awk has some cheap functions and some expensive ones. If you’re in a big hurry to count a lot of lines, note the following technique.

$ time gzip -cd d17q1.txt.gz | awk '{print gsub("|",x),$0}' > /dev/null
user    1m18.498s

$ time gzip -cd d17q1.txt.gz | awk '{print gsub("|",x)}' > /dev/null
user    1m17.642s

$ time gzip -cd d17q1.txt.gz | tr -cd '|\n' | awk '{print length}' > /dev/null
user    0m11.199s

I’m surprised by the large discrepancy but I think that dragging out any kind of regular expression handling is going to be way more expensive than dumb counting. It’s worth noting that Ruben’s custom C program did this in 0m12.175s which shows the power of Unix parallelization.

Paragraph Grep

Apparently some variants of grep can operate based on paragraphs (blocks of text separated by blank lines) with a -p option. This seems very handy. So much so that my boss (and this guy and no doubt many others) wrote custom software for this purpose. Here the GNU grep maintainers scoff at the idea of a -p option because… Again, Awk to the rescue. This trick comes in handy for Windows style configuration files which are becoming more common in the Unix world.

$ awk '/xed/' RS= 'ORS=\n\n' /etc/samba/smb.conf
[xedhome]
    comment = Home directory of Chris
    read only = no
    valid users = xed
    path = /home/xed

That was shockingly easy, wasn’t it? I will note that my boss' custom C program was about 50% faster on a big data set.

Sum

Adding a list of numbers is quite handy. This format is good for other things like averages etc.

awk '{X+=$1}END{print X}'

Percentages

A slightly tricky problem involves sending a stream of numbers and returning what percent that number was of the entire sum of all numbers sent. This will necessarily take a full pass before answers can be computed.

Here L is an array of all lines which is built while adding the total sum, S, of each line. After all the input is in, the END clause runs with a for loop iterating over the array.

$ seq 5 10 50 | awk '{L[NR]=$1;S=S+$1}END{for(i in L)print L[i],(L[i]/S)}'
5 0.04
15 0.12
25 0.2
35 0.28
45 0.36

For billions of lines, this could be a problem. In that case you might want to run two complete passes.

This will make the percentages cumulative.

awk '{L[NR]=$1;S=S+$1}END{for(i in L){T+=L[i]/S;print L[i],T}}'

I used this for pie chart making.

Pi

Speaking of pi, here’s how you can get it in awk.

$ awk '{print atan2(0,-1)}' <(echo)
3.14159

Or just use this.

3.141592653589793238462643383279502884197169399375105820974

Or this many radians per circle.

6.283185307179586476925286766559005768394338798750211641949

Pie Charts

Mean and Standard Deviation

Mean.

awk '{X+=$1}END{print X/NR}'

This seems to be correct for the population standard deviation.

awk '{X+=$1;Y+=$1^2}END{print sqrt(Y/NR-(X/NR)^2)}'

Here’s both.

awk '{X+=$1;Y+=$1^2}END{print X/NR, sqrt(Y/NR-(X/NR)^2)}'

Removing Duplicate Words

If you have a line that contains a bunch of words and you want to remove any duplicate mentions of them, this does the trick.

$ cat duptest
nothing duplicated here
another another
stupidly duplicated ok unique stupidly duplicated fine

$ awk '{ while(++i<=NF) printf (!a[$i]++) ? $i FS : ""; i=split("",a); print "" }' ./duptest
nothing duplicated here
another
stupidly duplicated ok unique fine

Note that the split is really just a way to clear the array (see delete command which may also be a way to do this). It also resets i as a bonus. A bit of a dirty trick, but that’s how awk pros roll.

A Column Rearranging Factory

I had a situation where I had hundreds of files from a messy data source that needed to be homogenized. The files had many different categories and in each category, there was a correct form. That form might have fields "A B C D E F". Some of the other files in that category would have "A B D E F" or "A B E D F". A mapping just had to be made, but once it was made, the following awk snippet worked to rearrange everything automatically.

So for the first example, "A B D E F" to "A B C D E F", the fixing rule would be like this where X is the missing field to be inserted (blank of course).

AWKARGS='$1 $2 X $3 $4 $5'

This will make the old fifth column (i.e. F) the new sixth and inserting a new empty field between two and three (i.e. C). By defining a whole list of those strings that I could use, I could then send this string to the following code.

AWKFORM=$(echo ${AWKARGS} | sed -e 's/[^,][^,]*/%s/g' -e 's/,/$/g' -e 's/^/"/' -e 's/$/\\n"/')
awk 'BEGIN{FS="$"}{printf('${AWKFORM}','${AWKARGS}')}' the_file

Yes, these awful files were separated with dollar signs. If the format is already correct, you can just set AWKARGS to $0.

Counting Occurrence Of Character Within Line

If you have a bar separated value file and you need to know how many fields there are, this will count the bars. It basically substitutes out everything that’s not the character of interest (the bar) and prints the remaining length.

awk '{gsub("[^|]",""); print length();}' file.bsv

Splitting A Big File Into Many Small Ones

The example here is an SDF (structure definition format) molecule file. Each file can contain many molecules separated by the hideous separator string of four dollar signs on a line by itself. Here awk pulls apart such a file and writes it out to many sequentially named files.

:-> [crow][/tmp/sdftest]$ ls -l
total 44
-rw-rw-r-- 1 xed xed 41529 Dec 18 15:06 small.sdf
:-> [crow][/tmp/sdftest]$ wc -l small.sdf
948 small.sdf
:-> [crow][/tmp/sdftest]$ awk 'BEGIN{RS="[$]{4}\n"}{F++; print $0 "$$$$" > "partial-" sprintf("%04d",F) ".sdf" }' small.sdf
:-> [crow][/tmp/sdftest]$ ls -l
total 112
-rw-rw-r-- 1 xed xed  4895 Dec 18 15:49 partial-0001.sdf
-rw-rw-r-- 1 xed xed  4295 Dec 18 15:49 partial-0002.sdf
-rw-rw-r-- 1 xed xed  4847 Dec 18 15:49 partial-0003.sdf
-rw-rw-r-- 1 xed xed  4251 Dec 18 15:49 partial-0004.sdf
-rw-rw-r-- 1 xed xed  3971 Dec 18 15:49 partial-0005.sdf
-rw-rw-r-- 1 xed xed  4527 Dec 18 15:49 partial-0006.sdf
-rw-rw-r-- 1 xed xed  5211 Dec 18 15:49 partial-0007.sdf
-rw-rw-r-- 1 xed xed  4343 Dec 18 15:49 partial-0008.sdf
-rw-rw-r-- 1 xed xed  5189 Dec 18 15:49 partial-0009.sdf
-rw-rw-r-- 1 xed xed 41529 Dec 18 15:06 small.sdf
:-> [crow][/tmp/sdftest]$ cat partial-000* | wc -l
948

Note that ORS could be used.

awk 'BEGIN{RS="[$]{4}\n";ORS="$$$$\n"}{F++; print > "partial-" sprintf("%04d",F) ".sdf" }' small.sdf

But in this case the awkward regular expression element, $, makes it not worth it.

It’s A Real Programming Language

To make awk programs use something like this.

$ cat awktest
#!/bin/awk -f
BEGIN { print "Runs this block one time." }
{ print "Runs this once for every line." }

$ seq 3 | ./awktest
Runs this block one time.
Runs this once for every line.
Runs this once for every line.
Runs this once for every line.

I found that the -f was required. Without it, I got

awk: ^ syntax error

Also, only one type of block is needed but the braces are required at a minimum.

This is just an example I wrote as an exercise for performance testing.

#!/usr/bin/awk
# Chris X Edwards - 2015-05-04
# Merges files filled with sorted numeric entries, one number per
# line, into a sorted single stream. Files must each contain at least
# one number. Cf. `sort -n <fileone> <filetwo>`.
# Usage:
#     awk -f ./merge <fileone> <filetwo>
# fileone contains: 1, 3, 4, 80, 95
# filetwo contains: 2, 5, 5, 10
# output: 1,2,3,4,5,5,10,80,95

{
    getline vA <ARGV[1]
    getline vB <ARGV[2]
    while (1) {
        if (vA > vB){
            print vB
            if (! getline vB <ARGV[2]) {
                vB="x"
                break
            }
        }
        else {
            print vA
            if (! getline vA <ARGV[1]) {
                break
            }
        }
    }
    if (vB == "x"){
        print vA
        while (getline vA <ARGV[1]) {
            print vA
            }
        exit
    }
    else {
        print vB
        while (getline vB <ARGV[2]) {
            print vB
            }
        exit
        }
}