sed - Stream Editor Notes

The Unix sed command is an awesome piece of work. It can pretty much single handedly replace many other very standard Unix commands if you are really good at it.

A very, very good list of these tricks can be found here and here. The text version is nice too.

For example, the Unix nl (number line) command can be done with sed. This prints line numbers and then the lines:

sed = filename

The Unix head command can be done nicely with sed:

sed 10q filename

Multiple Sed Operations

If you need to use more than one sed command at a time you can use this format. This skips the first line (perhaps a table heading row) and replaces all commas with pipes.

sed -e1d -e's/,/|/g'

GNU sed may let you get away with just separating the elements with ; but obviously you need to protect it from shell interpretation.

One I use a lot is to combine some processing with the equivalent of a Unix head. Here I just want to have the date printed so everything from the first quote to the end is ignored as is everything after the first line with the -eq, i.e. "execute quit".

sed -e 's:".*$::' -eq logfile

Show Lines N Lines Past Search Term

A common situation is to have some messy status output where there are sections and you want a piece from each of these sections. For example, pactl produces a big jumbled mess of what Pulse Audio knows. But what if you want just the "Name" lines in each "Source" section. You can look for Source and since the Name is the line after the next line, you can do this.

pactl list | sed -n '/^Source/{n;n;p;}'

This also shows another way to put multiple commands together. Specifically, it’s skipping the found line (n) then the next one (n) and finally printing the one that is wanted.

Specific Range Of Line Numbers

Often I need every other line or every third or every 14th, etc. The basic format for this uses ~N where N is the number to be skipping.

sed -n 0~5p

Maybe you have a huge file full of line delimited stuff and you want to look at some small section of the data.

sed -n '1000,$p;2000q' data

Here’s a way to wrap it up nicely with proper shell quoting so that a script or something can dole out sensible chunks. This specifically prints the 1000 lines (L) after the millionth (S). Change just those variables as needed.

$ S=1000000; L=1000; E=$(($L+$S))
$ time sed -n "${S},\$p;${E}q" data > /dev/null
real    0m0.059s

Note that this blazes through 1 million records quite quickly. If you’re seeing much slower performance, it probably isn’t sed’s contribution. Also note that you can simply specify a range with a start and an end, but for large files where you want a slice near the top, you do not want sed continuing to check past the last possible line you care about.

$ S=1000; L=1000; E=$(($L+$S))
$ time sed -n "${S},\$p;${E}q" data > /dev/null
real    0m0.016s
$ time sed -n "${S},${E}p" data > /dev/null
real    0m0.054s

Here you can see the difference in speed when looking at 1k-2k in a 1M record file.

This is equivalent to piping the results of head to tail which has similar performance but requires two processes. I don’t think there is a clearly preferable strategy so use whatever you like best.

Inserting Stuff Into Templates

Often I like to make a template file that has changeable content. The template mostly stays the same, but the content is new every time. I do this with my web pages which have a constant header and footer but changing content. Also I use it for quickly dumping geometry into an SVG file with the correct XML wrapping.

Here’s how it is done. In this example, I’m making a template file called T that just contains 5 numbers. If I want the number 3 to be replaced with some custom content (here I use "Three") this will do it.

seq 5 > T ; cat <(sed '/3/,$d' T) <(echo Three) <(sed '0,/3/d' T)

Or more generally…

cat <(sed '/CONTENT_GOES_HERE/,$d' template_file) \
    <(make_content_program) \
    <(sed '0,/CONTENT_GOES_HERE/d' template_file)

The line in template_file containing CONTENT_GOES_HERE will be replaced with whatever make_content_program produces when run.

Break Up Files Respecting Content

Breaking up files into parts can be done with the Unix split command but what if you can’t break the file arbitrarily. Here I wrote out a section of a file starting after a known break point and until the start of the next break point, then I cut off that last line (the next start point) with the second sed command.

sed -n /RECORD 100/,/RECORD 201/p records.xml | sed -n $\!p > records100-200.xml

Escaping What XML Is Sensitive To

This can be put in a Bash script just like this because the new lines are ok as whitespace here.

sed 's/\&/\&amp;/g;
s/"/\&quot;/g;
s/</\&lt;/g;
s/>/\&gt;/g;
s/\x27/\&apos;/g'

An Example of Sed

Here is a quick script I wrote to create a concordance from the entirety of Wikipedia.

#/bin/bash
#usage: $0 enwiki-latest-pages-articles.xml.bz2 concordance
bzip2 -cd $1
| head -n 2000000 \
| sed -e '/^ *</d'                      \
      -e '/^ *|/d'                      \
      -e 's/[| ][a-z][a-z]* \?=//g'     \
      -e 's@https\?://[^ ][^ ]*[ ,]@@g' \
      -e 's@[{()}]@ @g'                 \
      -e 's@&quot;@@g'                  \
      -e 's@&[gl]t;@@g'                 \
      -e "s@[/_–']@ @g"                 \
      -e '/^=/d'                        \
      -e 's@[^a-zA-Z ]@@g'              \
| tr ' [:upper:]' '\n[:lower:]'         \
| grep .                                \
| sort                                  \
| uniq -c                               \
| sort -n                               \
| tee $2