Awk

What is Awk? You could do worse than listening to Brian Kernighan (the "K" in Awk) offer his own excellent explanation.

I don’t write large fancy programs in Awk (I’m not Brian Kernighan.) But I do use it a ton for simple things in shell scripts. It can do serious and fancy things and if your project calls for that, lucky you. I don’t have many notes for Awk because what I normally use it for is pretty simple and for everything else I have a copy of the ORA Sed & Awk book.

I did really like these notes, however.

I also find I understand the interesting capabilities of a thing like Awk by looking at small examples of interesting and useful tricks. Here’s another such resource.

Useful Built-In Variables

FS = Field Separator (Can use -F* from command line.)
OFS = Output Field Separator
RS = Record Separator
ORS = Output Record Separator
NF = Number of Fields
NR = Number of Record - This can effectively be used as a counter of what line number you’re on.
FNR = File’s Number of Records - Line number reset on each file.
FILENAME = Current filename being processed
FIELDWIDTHS = When set with a whitespace separated list of values, reads fields from those positions ignoring FS. Useful for fixed column inputs.
IGNORECASE = Non zero treats upper and lower alike.
OFMT = Output format for numbers - default "%.6g".

Note that these variables may not need $ to be resolved. In fact, something like $NF where there are 3 fields would be the same as $3. The following prints the line number with the number of fields (good for checking the integrity of a data file).

awk '{print NR,NF}'
awk '{if (NF != 25) print NR,NF}' # Check for exactly 25 fields.

Patterns

This demonstrates how to use "patterns. It will take some output that has the form of "ID_PARAM=value" and only for the parameters of interest, save the value. At the end it will compute what is needed.

mplayer -identify -vo null -ao null -frames 0 mysteryvid.mp4 \
| awk 'BEGIN{FS="="} \
       /ID_VIDEO_FPS/{rate=$2} /ID_LENGTH/{time=$2} \
       END{print rate*time}'

For complete information, see man awk and search for (/) "^ *Patterns".

Line Lengths And Character Counts

Need to know how long a line is? This is often useful for looking for missing records or absurdly huge records. Awk makes this pretty easy.

$ echo "abcdefghijklmnopqrstuvwxyz" | awk '{print length}'
26
$ echo "abcdefghijklmnopqrstuvwxyz" | awk '{print length,$0}'
26 abcdefghijklmnopqrstuvwxyz

What about a count of the occurrences of a specific character?

$ cal | awk '{print gsub("[01]","&"),$0}'
2       June 2017
0 Su Mo Tu We Th Fr Sa
1              1  2  3
2  4  5  6  7  8  9 10
8 11 12 13 14 15 16 17
4 18 19 20 21 22 23 24
1 25 26 27 28 29 30
0

It seems Awk has some cheap functions and some expensive ones. If you’re in a big hurry to count a lot of lines, note the following technique.

$ time gzip -cd d17q1.txt.gz | awk '{print gsub("|",x),$0}' > /dev/null
user    1m18.498s

$ time gzip -cd d17q1.txt.gz | awk '{print gsub("|",x)}' > /dev/null
user    1m17.642s

$ time gzip -cd d17q1.txt.gz | tr -cd '|\n' | awk '{print length}' > /dev/null
user    0m11.199s

I’m surprised by the large discrepancy but I think that dragging out any kind of regular expression handling is going to be way more expensive than dumb counting. It’s worth noting that Ruben’s custom C program did this in 0m12.175s which shows the power of Unix parallelization.

Here’s another way.

awk '{print split($0,X,"$")}' demo17q2.txt | sort -n | uniq -c

This shows how to get a list of all the possibilities of dollar sign quantities. If you have a (bad) set of data field-separated by dollar signs and you want to make sure that all lines have the same number of fields, this should output only one line which will also show the total line count.

Paragraph Grep

Apparently some variants of grep can operate based on paragraphs (blocks of text separated by blank lines) with a -p option. This seems very handy. So much so that my boss (and this guy and no doubt many others) wrote custom software for this purpose. Here the GNU grep maintainers scoff at the idea of a -p option because… Again, Awk to the rescue. This trick comes in handy for Windows style configuration files which are becoming more common in the Unix world.

$ awk '/xed/' RS= 'ORS=\n\n' /etc/samba/smb.conf
[xedhome]
    comment = Home directory of Chris
    read only = no
    valid users = xed
    path = /home/xed

That was shockingly easy, wasn’t it? I will note that my boss' custom C program was about 50% faster on a big data set.

Sum

Adding a list of numbers is quite handy. This format is good for other things like averages etc.

awk '{X+=$1}END{print X}'

Percentages

A slightly tricky problem involves sending a stream of numbers and returning what percent that number was of the entire sum of all numbers sent. This will necessarily take a full pass before answers can be computed.

Here L is an array of all lines which is built while adding the total sum, S, of each line. After all the input is in, the END clause runs with a for loop iterating over the array.

$ seq 5 10 50 | awk '{L[NR]=$1;S=S+$1}END{for(i in L)print L[i],(L[i]/S)}'
5 0.04
15 0.12
25 0.2
35 0.28
45 0.36

For billions of lines, this could be a problem. In that case you might want to run two complete passes.

This will make the percentages cumulative.

awk '{L[NR]=$1;S=S+$1}END{for(i in L){T+=L[i]/S;print L[i],T}}'

I used this for pie chart making.

Shell Math

In ancient times old Bourne shell sh had pretty much no math abilities at all. These days, Bash has just enough to taunt you. But I’ve set up fancy things where I need a simple math problem done and Bash seems incapable. Things like this will fail.

ADJUSTMENT=3.3
actuate $(( ${VALUE} + ${ADJUSTMENT} ))

How then can you do floating point math? Most classical sources will steer you to bc but this is problematic if, like most systems in modern times, bc is not present. Sure you can download it, but a way that prevents any problems with your end user/final installation not being prepared is to use awk.

function shmath {
    EXP=$1
    awk "END{print ${EXP}}" /dev/null
}

Then you can do this.

actuate $(shmath "${VALUE} + ${ADJUSTMENT}")

Awk saves the day!

Pi

Speaking of pi, here’s how you can get it in awk.

$ awk '{print atan2(0,-1)}' <(echo)
3.14159

Or just use this.

3.141592653589793238462643383279502884197169399375105820974

Or this many radians per circle.

6.283185307179586476925286766559005768394338798750211641949

Pie Charts

Here’s an example showing the proportions of vowels in the awk man page.

man awk | sed 's/\(.\)/\1 /g' | tr 'A-Z ' 'a-z\n' | grep [aeiou] \
  | sort | uniq -c | ./pie > awkman.svg

For the short Awk pie program, see my complete blog post about using Awk to make pie charts. Yes! It works!

ASCII Bar Charts

What if you have some simple data in the unix world, perhaps at the end of a pipe line, and you want to get a rough idea of how that looks visually? One of the most universally available and direct methods is using awk to print some line whose length is proportional to some input value contained in the input lines. Here is an example showing a project where I converted a pretty complex Python program from object oriented style to something that could be rewritten easily in C (or Awk!). I suspected that the length was getting shorter and wanted to see that.

$ ls -lrt code-version*.py | awk '{for(c=0;c<($5/200);c++)printf "_";printf "\n"}'
___________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________
_____________________________________________________________________________________________________________________________
_________________________________________________________________________________________________________________________
____________________________________________________________________________________________________________
____________________________________________________________________________________________________________
____________________________________________________________________________________________________________
__________________________________________________________________________________________________________
__________________________________________________________________________________________________________
__________________________________________________________________________________________________________
__________________________________________________________________________________________________________
_________________________________________________________________________________________________________
_______________________________________________________________________________________________________
__________________________________________________________________________________________________________
_________________________________________________________________________________________________________________
__________________________________________________________________________________________________________
_______________________________________________________________________________________________________________
__________________________________________________________________________________________________________________
__________________________________________________________________________________________________________________
______________________________________________________________________________________________________
_________________________________________________________________________________________________________
_____________________________________________________________________________________________________
_________________________________________________________________________________________________________
_________________________________________________________________________________
______________________________________________________________________________
___________________________________________________________________________
___________________________________________________________________________
___________________________________________________________________________
_________________________________________________________________________
____________________________________________________________
__________________________________________________________
_________________________________________________________
__________________________________________________________
____________________________________________________________________
_______________________________________________________________________
___________________________________________________________________________
_____________________________________________________________________
________________________________________________________
_____________________________________________________

Mean and Standard Deviation

Mean.

awk '{X+=$1}END{print X/NR}'

This seems to be correct for the population standard deviation.

awk '{X+=$1;Y+=$1^2}END{print sqrt(Y/NR-(X/NR)^2)}'

Here’s both.

awk '{X+=$1;Y+=$1^2}END{print X/NR, sqrt(Y/NR-(X/NR)^2)}'

Absolute Value

It’s a bit strange that awk doesn’t have an abs function but since it has that sqrt function and performance to burn, it’s not a problem.

awk '{print sqrt($1*$1)}'

Seems a bit hackish, but works fine. This even converts -0 to 0 for whatever that’s worth.

Removing Duplicate Words

If you have a line that contains a bunch of words and you want to remove any duplicate mentions of them, this does the trick.

$ cat duptest
nothing duplicated here
another another
stupidly duplicated ok unique stupidly duplicated fine

$ awk '{ while(++i<=NF) printf (!a[$i]++) ? $i FS : ""; i=split("",a); print "" }' ./duptest
nothing duplicated here
another
stupidly duplicated ok unique fine

Note that the split is really just a way to clear the array (see delete command which may also be a way to do this). It also resets i as a bonus. A bit of a dirty trick, but that’s how awk pros roll.

Make Specific Column Unique

The uniq command likes to work on entire lines. If you just need to know all the different values of the third field, Awk is a little sharper than sort.

awk '!X[$3]++' logfile

The details are explained here.

A Column Rearranging Factory

I had a situation where I had hundreds of files from a messy data source that needed to be homogenized. The files had many different categories and in each category, there was a correct form. That form might have fields "A B C D E F". Some of the other files in that category would have "A B D E F" or "A B E D F". A mapping just had to be made, but once it was made, the following awk snippet worked to rearrange everything automatically.

So for the first example, "A B D E F" to "A B C D E F", the fixing rule would be like this where X is the missing field to be inserted (blank of course).

AWKARGS='$1 $2 X $3 $4 $5'

This will make the old fifth column (i.e. F) the new sixth and inserting a new empty field between two and three (i.e. C). By defining a whole list of those strings that I could use, I could then send this string to the following code.

AWKFORM=$(echo ${AWKARGS} | sed -e 's/[^,][^,]*/%s/g' -e 's/,/$/g' -e 's/^/"/' -e 's/$/\\n"/')
awk 'BEGIN{FS="$"}{printf('${AWKFORM}','${AWKARGS}')}' the_file

Yes, these awful files were separated with dollar signs. If the format is already correct, you can just set AWKARGS to $0.

Counting Occurrence Of Character Within Line

If you have a bar separated value file and you need to know how many fields there are, this will count the bars. It basically substitutes out everything that’s not the character of interest (the bar) and prints the remaining length.

awk '{gsub("[^|]",""); print length();}' file.bsv

Splitting A Big File Into Many Small Ones

For simple jobs where you want to split up a big file and all the lines are similar and it doesn’t matter which files they end up in, use the unix split command. But if there is some multi-line content that you need to avoid splitting across files, Awk can solve the problem!

The example here is an SDF (structure definition format) molecule file. Each file can contain many molecules separated by the hideous separator string of four dollar signs on a line by itself. Here awk pulls apart such a file and writes it out to many sequentially named files.

:-> [crow][/tmp/sdftest]$ ls -l
total 44
-rw-rw-r-- 1 xed xed 41529 Dec 18 15:06 small.sdf
:-> [crow][/tmp/sdftest]$ wc -l small.sdf
948 small.sdf
:-> [crow][/tmp/sdftest]$ awk 'BEGIN{RS="[$]{4}\n"}{F++; print $0 "$$$$" > "partial-" sprintf("%04d",F) ".sdf" }' small.sdf
:-> [crow][/tmp/sdftest]$ ls -l
total 112
-rw-rw-r-- 1 xed xed  4895 Dec 18 15:49 partial-0001.sdf
-rw-rw-r-- 1 xed xed  4295 Dec 18 15:49 partial-0002.sdf
-rw-rw-r-- 1 xed xed  4847 Dec 18 15:49 partial-0003.sdf
-rw-rw-r-- 1 xed xed  4251 Dec 18 15:49 partial-0004.sdf
-rw-rw-r-- 1 xed xed  3971 Dec 18 15:49 partial-0005.sdf
-rw-rw-r-- 1 xed xed  4527 Dec 18 15:49 partial-0006.sdf
-rw-rw-r-- 1 xed xed  5211 Dec 18 15:49 partial-0007.sdf
-rw-rw-r-- 1 xed xed  4343 Dec 18 15:49 partial-0008.sdf
-rw-rw-r-- 1 xed xed  5189 Dec 18 15:49 partial-0009.sdf
-rw-rw-r-- 1 xed xed 41529 Dec 18 15:06 small.sdf
:-> [crow][/tmp/sdftest]$ cat partial-000* | wc -l
948

Note that ORS could be used.

awk 'BEGIN{RS="[$]{4}\n";ORS="$$$$\n"}{F++; print > "partial-" sprintf("%04d",F) ".sdf" }' small.sdf

But in this case the awkward regular expression element, $, makes it not worth it.

Slow Reader Using System

Awk has an interesting system call which can run shell commands. This can get into all kinds of mischief. One thing I did with it was to slow down some captured input to mimic the flow of it’s original source.

awk '{system("sleep .025");print $0}' capturefile | my_device_handler

Analyze Storage Volume Usage And Speeds

This script is a monitoring tool to continuously track and report the disk space usage and write speed of mounted devices, providing insights into disk activity and predicting when the disk might become full.

awk -v delta=$delta  \
'BEGIN { printf "%-10s %-10s %10s %10s %15s\n","device","mounted on","free (gb)","mb/sec","hours to fill" } \
NR==FNR {t[$7]=$1; \
u[$7]=$4} NR>FNR {r=1e-3*($4-u[$7])/($1-t[$7]); \
if(r>0.001) printf "%-10s %-10s %10.3f %10.3f %10.2f\n",$2,$7,1e-6*$5,r=1e-3*($4-u[$7])/($1-t[$7]),1e-3*$5/r/3600.0; \
fflush(stdout)}' \
 <(df | fgrep /dev/sd | prefix `date +"%s"`) \
 <(while true; do sleep $delta; df | fgrep /dev/sd | prefix `date +"%s"`; done)

It’s A Real Programming Language

To make awk programs use something like this.

$ cat awktest
#!/bin/awk -f
BEGIN { print "Runs this block one time." }
{ print "Runs this once for every line." }

$ seq 3 | ./awktest
Runs this block one time.
Runs this once for every line.
Runs this once for every line.
Runs this once for every line.

I found that the -f was required. Without it, I got

awk: ^ syntax error

Also, only one type of block is needed but the braces are required at a minimum.

This is just an example I wrote as an exercise for performance testing.

#!/usr/bin/awk
# Chris X Edwards - 2015-05-04
# Merges files filled with sorted numeric entries, one number per
# line, into a sorted single stream. Files must each contain at least
# one number. Cf. `sort -n <fileone> <filetwo>`.
# Usage:
#     awk -f ./merge <fileone> <filetwo>
# fileone contains: 1, 3, 4, 80, 95
# filetwo contains: 2, 5, 5, 10
# output: 1,2,3,4,5,5,10,80,95

{
    getline vA <ARGV[1]
    getline vB <ARGV[2]
    while (1) {
        if (vA > vB){
            print vB
            if (! getline vB <ARGV[2]) {
                vB="x"
                break
            }
        }
        else {
            print vA
            if (! getline vA <ARGV[1]) {
                break
            }
        }
    }
    if (vB == "x"){
        print vA
        while (getline vA <ARGV[1]) {
            print vA
            }
        exit
    }
    else {
        print vB
        while (getline vB <ARGV[2]) {
            print vB
            }
        exit
        }
}