What is Awk? You could do worse than listening to Brian Kernighan (the "K" in Awk) offer his own excellent explanation.
I don’t write large fancy programs in Awk (I’m not Brian Kernighan.) But I do use it a ton for simple things in shell scripts. It can do serious and fancy things and if your project calls for that, lucky you. I don’t have many notes for Awk because what I normally use it for is pretty simple and for everything else I have a copy of the ORA Sed & Awk book.
I did really like these notes, however.
I also find I understand the interesting capabilities of a thing like Awk by looking at small examples of interesting and useful tricks. Here’s another such resource.
Useful Built-In Variables
-
FS
= Field Separator (Can use-F*
from command line.) -
OFS
= Output Field Separator -
RS
= Record Separator -
ORS
= Output Record Separator -
NF
= Number of Fields -
NR
= Number of Record - This can effectively be used as a counter of what line number you’re on. -
FNR
= File’s Number of Records - Line number reset on each file. -
FILENAME
= Current filename being processed -
FIELDWIDTHS
= When set with a whitespace separated list of values, reads fields from those positions ignoring FS. Useful for fixed column inputs. -
IGNORECASE
= Non zero treats upper and lower alike. -
OFMT
= Output format for numbers - default "%.6g".
Note that these variables may not need $
to be resolved. In fact,
something like $NF
where there are 3 fields would be the same as $3
.
The following prints the line number with the number of fields (good for
checking the integrity of a data file).
awk '{print NR,NF}'
awk '{if (NF != 25) print NR,NF}' # Check for exactly 25 fields.
Patterns
This demonstrates how to use "patterns. It will take some output that has the form of "ID_PARAM=value" and only for the parameters of interest, save the value. At the end it will compute what is needed.
mplayer -identify -vo null -ao null -frames 0 mysteryvid.mp4 \
| awk 'BEGIN{FS="="} \
/ID_VIDEO_FPS/{rate=$2} /ID_LENGTH/{time=$2} \
END{print rate*time}'
For complete information, see man awk
and search for (/) "^ *Patterns".
Line Lengths And Character Counts
Need to know how long a line is? This is often useful for looking for missing records or absurdly huge records. Awk makes this pretty easy.
$ echo "abcdefghijklmnopqrstuvwxyz" | awk '{print length}'
26
$ echo "abcdefghijklmnopqrstuvwxyz" | awk '{print length,$0}'
26 abcdefghijklmnopqrstuvwxyz
What about a count of the occurrences of a specific character?
$ cal | awk '{print gsub("[01]","&"),$0}'
2 June 2017
0 Su Mo Tu We Th Fr Sa
1 1 2 3
2 4 5 6 7 8 9 10
8 11 12 13 14 15 16 17
4 18 19 20 21 22 23 24
1 25 26 27 28 29 30
0
It seems Awk has some cheap functions and some expensive ones. If you’re in a big hurry to count a lot of lines, note the following technique.
$ time gzip -cd d17q1.txt.gz | awk '{print gsub("|",x),$0}' > /dev/null
user 1m18.498s
$ time gzip -cd d17q1.txt.gz | awk '{print gsub("|",x)}' > /dev/null
user 1m17.642s
$ time gzip -cd d17q1.txt.gz | tr -cd '|\n' | awk '{print length}' > /dev/null
user 0m11.199s
I’m surprised by the large discrepancy but I think that dragging out any kind of regular expression handling is going to be way more expensive than dumb counting. It’s worth noting that Ruben’s custom C program did this in 0m12.175s which shows the power of Unix parallelization.
Here’s another way.
awk '{print split($0,X,"$")}' demo17q2.txt | sort -n | uniq -c
This shows how to get a list of all the possibilities of dollar sign quantities. If you have a (bad) set of data field-separated by dollar signs and you want to make sure that all lines have the same number of fields, this should output only one line which will also show the total line count.
Paragraph Grep
Apparently some variants of grep
can operate based on paragraphs
(blocks of text separated by blank lines) with a -p
option. This
seems very handy. So much so that my boss (and
this guy and no doubt many
others) wrote custom software for this purpose. Here the GNU
grep
maintainers
scoff
at the idea of a -p
option because… Again, Awk to the rescue.
This trick comes in handy for Windows style configuration files which
are becoming more common in the Unix world.
$ awk '/xed/' RS= 'ORS=\n\n' /etc/samba/smb.conf
[xedhome]
comment = Home directory of Chris
read only = no
valid users = xed
path = /home/xed
That was shockingly easy, wasn’t it? I will note that my boss' custom C program was about 50% faster on a big data set.
Sum
Adding a list of numbers is quite handy. This format is good for other things like averages etc.
awk '{X+=$1}END{print X}'
Percentages
A slightly tricky problem involves sending a stream of numbers and returning what percent that number was of the entire sum of all numbers sent. This will necessarily take a full pass before answers can be computed.
Here L is an array of all lines which is built while adding the total sum, S, of each line. After all the input is in, the END clause runs with a for loop iterating over the array.
$ seq 5 10 50 | awk '{L[NR]=$1;S=S+$1}END{for(i in L)print L[i],(L[i]/S)}'
5 0.04
15 0.12
25 0.2
35 0.28
45 0.36
For billions of lines, this could be a problem. In that case you might want to run two complete passes.
This will make the percentages cumulative.
awk '{L[NR]=$1;S=S+$1}END{for(i in L){T+=L[i]/S;print L[i],T}}'
I used this for pie chart making.
Shell Math
In ancient times old Bourne shell sh
had pretty much no math
abilities at all. These days, Bash has just enough to taunt you. But
I’ve set up fancy things where I need a simple math problem done and
Bash seems incapable. Things like this will fail.
ADJUSTMENT=3.3
actuate $(( ${VALUE} + ${ADJUSTMENT} ))
How then can you do floating point math? Most classical sources will
steer you to bc
but this is problematic if, like most systems in
modern times, bc
is not present. Sure you can download it, but a way
that prevents any problems with your end user/final installation not
being prepared is to use awk.
function shmath {
EXP=$1
awk "END{print ${EXP}}" /dev/null
}
Then you can do this.
actuate $(shmath "${VALUE} + ${ADJUSTMENT}")
Awk saves the day!
Pi
Speaking of pi, here’s how you can get it in awk.
$ awk '{print atan2(0,-1)}' <(echo)
3.14159
Or just use this.
3.141592653589793238462643383279502884197169399375105820974
Or this many radians per circle.
6.283185307179586476925286766559005768394338798750211641949
Pie Charts
Here’s an example showing the proportions of vowels in the awk man page.
man awk | sed 's/\(.\)/\1 /g' | tr 'A-Z ' 'a-z\n' | grep [aeiou] \
| sort | uniq -c | ./pie > awkman.svg
For the short Awk pie
program, see
my complete blog post about using
Awk to make pie charts. Yes! It works!
ASCII Bar Charts
What if you have some simple data in the unix world, perhaps at the end of a pipe line, and you want to get a rough idea of how that looks visually? One of the most universally available and direct methods is using awk to print some line whose length is proportional to some input value contained in the input lines. Here is an example showing a project where I converted a pretty complex Python program from object oriented style to something that could be rewritten easily in C (or Awk!). I suspected that the length was getting shorter and wanted to see that.
$ ls -lrt code-version*.py | awk '{for(c=0;c<($5/200);c++)printf "_";printf "\n"}'
___________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________
_____________________________________________________________________________________________________________________________
_________________________________________________________________________________________________________________________
____________________________________________________________________________________________________________
____________________________________________________________________________________________________________
____________________________________________________________________________________________________________
__________________________________________________________________________________________________________
__________________________________________________________________________________________________________
__________________________________________________________________________________________________________
__________________________________________________________________________________________________________
_________________________________________________________________________________________________________
_______________________________________________________________________________________________________
__________________________________________________________________________________________________________
_________________________________________________________________________________________________________________
__________________________________________________________________________________________________________
_______________________________________________________________________________________________________________
__________________________________________________________________________________________________________________
__________________________________________________________________________________________________________________
______________________________________________________________________________________________________
_________________________________________________________________________________________________________
_____________________________________________________________________________________________________
_________________________________________________________________________________________________________
_________________________________________________________________________________
______________________________________________________________________________
___________________________________________________________________________
___________________________________________________________________________
___________________________________________________________________________
_________________________________________________________________________
____________________________________________________________
__________________________________________________________
_________________________________________________________
__________________________________________________________
____________________________________________________________________
_______________________________________________________________________
___________________________________________________________________________
_____________________________________________________________________
________________________________________________________
_____________________________________________________
Mean and Standard Deviation
Mean.
awk '{X+=$1}END{print X/NR}'
This seems to be correct for the population standard deviation.
awk '{X+=$1;Y+=$1^2}END{print sqrt(Y/NR-(X/NR)^2)}'
Here’s both.
awk '{X+=$1;Y+=$1^2}END{print X/NR, sqrt(Y/NR-(X/NR)^2)}'
Absolute Value
It’s a bit strange that awk doesn’t have an abs function but since it
has that sqrt
function and performance to burn, it’s not a problem.
awk '{print sqrt($1*$1)}'
Seems a bit hackish, but works fine. This even converts -0 to 0 for whatever that’s worth.
Removing Duplicate Words
If you have a line that contains a bunch of words and you want to remove any duplicate mentions of them, this does the trick.
$ cat duptest
nothing duplicated here
another another
stupidly duplicated ok unique stupidly duplicated fine
$ awk '{ while(++i<=NF) printf (!a[$i]++) ? $i FS : ""; i=split("",a); print "" }' ./duptest
nothing duplicated here
another
stupidly duplicated ok unique fine
Note that the split
is really just a way to clear the array (see
delete
command which may also be a way to do this). It also resets
i
as a bonus. A bit of a dirty trick, but that’s how awk pros roll.
Make Specific Column Unique
The uniq
command likes to work on entire lines. If you just need to
know all the different values of the third field, Awk is a little
sharper than sort
.
awk '!X[$3]++' logfile
The details are explained here.
A Column Rearranging Factory
I had a situation where I had hundreds of files from a messy data source that needed to be homogenized. The files had many different categories and in each category, there was a correct form. That form might have fields "A B C D E F". Some of the other files in that category would have "A B D E F" or "A B E D F". A mapping just had to be made, but once it was made, the following awk snippet worked to rearrange everything automatically.
So for the first example, "A B D E F" to "A B C D E F", the fixing rule would be like this where X is the missing field to be inserted (blank of course).
AWKARGS='$1 $2 X $3 $4 $5'
This will make the old fifth column (i.e. F) the new sixth and inserting a new empty field between two and three (i.e. C). By defining a whole list of those strings that I could use, I could then send this string to the following code.
AWKFORM=$(echo ${AWKARGS} | sed -e 's/[^,][^,]*/%s/g' -e 's/,/$/g' -e 's/^/"/' -e 's/$/\\n"/')
awk 'BEGIN{FS="$"}{printf('${AWKFORM}','${AWKARGS}')}' the_file
Yes, these awful files were separated with dollar signs. If the format
is already correct, you can just set AWKARGS
to $0
.
Counting Occurrence Of Character Within Line
If you have a bar separated value file and you need to know how many fields there are, this will count the bars. It basically substitutes out everything that’s not the character of interest (the bar) and prints the remaining length.
awk '{gsub("[^|]",""); print length();}' file.bsv
Splitting A Big File Into Many Small Ones
For simple jobs where you want to split up a big file and all the
lines are similar and it doesn’t matter which files they end up in,
use the unix split
command. But if there is some multi-line content
that you need to avoid splitting across files, Awk can solve the
problem!
The example here is an SDF (structure definition format) molecule file. Each file can contain many molecules separated by the hideous separator string of four dollar signs on a line by itself. Here awk pulls apart such a file and writes it out to many sequentially named files.
:-> [crow][/tmp/sdftest]$ ls -l total 44 -rw-rw-r-- 1 xed xed 41529 Dec 18 15:06 small.sdf :-> [crow][/tmp/sdftest]$ wc -l small.sdf 948 small.sdf :-> [crow][/tmp/sdftest]$ awk 'BEGIN{RS="[$]{4}\n"}{F++; print $0 "$$$$" > "partial-" sprintf("%04d",F) ".sdf" }' small.sdf :-> [crow][/tmp/sdftest]$ ls -l total 112 -rw-rw-r-- 1 xed xed 4895 Dec 18 15:49 partial-0001.sdf -rw-rw-r-- 1 xed xed 4295 Dec 18 15:49 partial-0002.sdf -rw-rw-r-- 1 xed xed 4847 Dec 18 15:49 partial-0003.sdf -rw-rw-r-- 1 xed xed 4251 Dec 18 15:49 partial-0004.sdf -rw-rw-r-- 1 xed xed 3971 Dec 18 15:49 partial-0005.sdf -rw-rw-r-- 1 xed xed 4527 Dec 18 15:49 partial-0006.sdf -rw-rw-r-- 1 xed xed 5211 Dec 18 15:49 partial-0007.sdf -rw-rw-r-- 1 xed xed 4343 Dec 18 15:49 partial-0008.sdf -rw-rw-r-- 1 xed xed 5189 Dec 18 15:49 partial-0009.sdf -rw-rw-r-- 1 xed xed 41529 Dec 18 15:06 small.sdf :-> [crow][/tmp/sdftest]$ cat partial-000* | wc -l 948
Note that ORS
could be used.
awk 'BEGIN{RS="[$]{4}\n";ORS="$$$$\n"}{F++; print > "partial-" sprintf("%04d",F) ".sdf" }' small.sdf
But in this case the awkward regular expression element, $
, makes it
not worth it.
Slow Reader Using System
Awk has an interesting system call which can run shell commands. This can get into all kinds of mischief. One thing I did with it was to slow down some captured input to mimic the flow of it’s original source.
awk '{system("sleep .025");print $0}' capturefile | my_device_handler
Analyze Storage Volume Usage And Speeds
This script is a monitoring tool to continuously track and report the disk space usage and write speed of mounted devices, providing insights into disk activity and predicting when the disk might become full.
awk -v delta=$delta \ 'BEGIN { printf "%-10s %-10s %10s %10s %15s\n","device","mounted on","free (gb)","mb/sec","hours to fill" } \ NR==FNR {t[$7]=$1; \ u[$7]=$4} NR>FNR {r=1e-3*($4-u[$7])/($1-t[$7]); \ if(r>0.001) printf "%-10s %-10s %10.3f %10.3f %10.2f\n",$2,$7,1e-6*$5,r=1e-3*($4-u[$7])/($1-t[$7]),1e-3*$5/r/3600.0; \ fflush(stdout)}' \ <(df | fgrep /dev/sd | prefix `date +"%s"`) \ <(while true; do sleep $delta; df | fgrep /dev/sd | prefix `date +"%s"`; done)
It’s A Real Programming Language
To make awk programs use something like this.
$ cat awktest #!/bin/awk -f BEGIN { print "Runs this block one time." } { print "Runs this once for every line." } $ seq 3 | ./awktest Runs this block one time. Runs this once for every line. Runs this once for every line. Runs this once for every line.
I found that the -f
was required. Without it, I got
awk: ^ syntax error
Also, only one type of block is needed but the braces are required at a minimum.
This is just an example I wrote as an exercise for performance testing.
#!/usr/bin/awk # Chris X Edwards - 2015-05-04 # Merges files filled with sorted numeric entries, one number per # line, into a sorted single stream. Files must each contain at least # one number. Cf. `sort -n <fileone> <filetwo>`. # Usage: # awk -f ./merge <fileone> <filetwo> # fileone contains: 1, 3, 4, 80, 95 # filetwo contains: 2, 5, 5, 10 # output: 1,2,3,4,5,5,10,80,95 { getline vA <ARGV[1] getline vB <ARGV[2] while (1) { if (vA > vB){ print vB if (! getline vB <ARGV[2]) { vB="x" break } } else { print vA if (! getline vA <ARGV[1]) { break } } } if (vB == "x"){ print vA while (getline vA <ARGV[1]) { print vA } exit } else { print vB while (getline vB <ARGV[2]) { print vB } exit } }