Basic Usage

wget is a web client but unlike a normal browser, it is designed to simply get the served remote files.

Normally wget is simply used like this:

$ wget http://www.example.com/path/xed_example.html

After running that command you will have a copy of xed_example.html in your local current directory.

The most common option I use is -O (capital oh, not zero) which changes the name of the saved file:

$ wget -O xed.htm http://www.example.com/path/xed_example.html

This gets xed_example.html from the web site specified and saves that with a new name, xed.htm.

Using wget With Proxies

These days I like to configure machines to not live on the real internet and get any external stuff through a well managed proxy. wget should have no problem with this. Here’s an example.

wget -e use_proxy=yes -e http_proxy=avproxy.ucsd.edu:3128 http://example.edu/foo-1.0.tgz

Using wget With Other Commands

The following is useful for checking on a web site perhaps from a cron job.

$ echo -n "xed.ch "; if wget -qO /dev/null xed.ch; then echo "is still working."; else echo "seems down"; fi
xed.ch is still working.

Note that you are downloading the entire page, but immediately throwing it away to /dev/null.

Difficult Web Sites

Some web sites are more complicated to use than others. Here is a script to get several files from a site that requires a password:

Automatic Getter With Authentication
#!/bin/bash
WWW=http://www.example.com/coolstuff/
function get {
    wget --user=xed --password=examplepw $WWW/$1
}
get coolfile1.jpg
get coolfile2.cc

Things can get worse. Perhaps the web site doesn’t use normal HTTP Basic Authentication. Commonly some interactive Javascript activated form must be filled out, perhaps with a captcha. Of course you can’t easily automate this by definition. But if you have the credentials, you can do the log in manually and then do the downloading automatically. The key is in what is called a session cookie. When you enter your user name and password on a web site with complex interactive functionality, the remote site usually sends you a session cookie to let it know that you have already been granted permission to use the site. It’s like going to a concert and having to show your driving license and then they give you a wrist band saying you are allowed to buy alcohol. It’s possible (if you’re clever) to take that wrist band off and give it to someone else. The same idea is true for session cookies. Here’s how it works in action:

$ wget -O CrazyKinase.zip --no-cookies \
--header='Cookie:PHPSESSID=6d8cf0002600360034d350a57a3485c3' \
'http://www.examplechem.net/download/download.php?file=186'

In this example the --no-cookies option is used since we’re going to manually manage what the server knows about cookies. This is done with the --header option. This highly versatile option can do all kinds of fun things, but here we are using it to pretend that this cookie has already been set as shown. The cookie’s name is PHPSESSID and it’s value is a big hex number, 6d8c...85c3.

Discovering The Session ID

How do you know what your session ID cookie is? You need to use a browser that can show you. I find that Google Chrome works fine for this but most browsers have a similar way. In Chrome go to Wrench-→Tools-→Developer Tools (or Shift-Ctrl-I). Click "Resources" on the top bar of the Developer Tools and then expand the entry for "Cookies". Find your web page that you’re targeting and click on that. You should see a table of "Name" and "Value" pairs. Sometimes you need trial and error to figure out which ones are really needed and sometimes it is obvious. Plug one of these key value pairs into the wget command shown above and give it a try. Note that you can have multiple --header= options so you can report multiple cookies to the server.

Alternatives To wget

Still not satisfied with the great power of wget? There are other ways to do what it does.

curl

curl is a program that pretty much does exactly what wget does. I like wget better, but curl seems to be installed natively on Macs. Also installed natively on Macs is the man page for curl so if you don’t know how to use it, type man curl.

Maybe you just realized that you don’t really want coolstuff.html because it is just a messy HTML file. Maybe you wanted just the human text in it. In this case you can use elinks like this:

$ elinks -dump xed.ch/coolstuff.html > cool.html

The old fashioned browser lynx can also do this:

$ lynx -dump xed.ch

It is also what primitive internet tribes used in the prehistoric days before wget:

$ lynx -source -dump xed.ch

Python and urllib2

You want even more control? Here’s how to pull the contents of a URL into your Python program.

import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
print html

This example is from the urllib2 documentation.