wget Notes

Basic Usage

wget is a web client but unlike a normal browser, it is designed to simply get the served remote files.

Normally wget is simply used like this:

$ wget http://www.example.com/path/xed_example.html

After running that command you will have a copy of xed_example.html in your local current directory.

The most common option I use is -O (capital oh, not zero) which changes the name of the saved file:

$ wget -O xed.htm http://www.example.com/path/xed_example.html

This gets xed_example.html from the web site specified and saves that with a new name, xed.htm.

Fancy Usage

Geolocation

Put this function in your .bashrc and you can use it to roughly find the real world geographic location of a host name or IP address.

function geo() { wget -qO- ipinfo.io/$(if tr -d [0-9.]<<<$1|grep .>/dev/null;then dig +short $1;else echo $1; fi);}

Using wget With Proxies

These days I like to configure machines to not live on the real internet and get any external stuff through a well managed proxy. wget should have no problem with this. Here’s an example.

wget -e use_proxy=yes -e http_proxy=avproxy.ucsd.edu:3128 http://example.edu/foo-1.0.tgz

Or something like this. Note the httpS.

URL=https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
wget -e use_proxy=yes -e https_proxy=proxy.ucsd.edu:3128 ${URL}

Using wget With Other Commands

The following is useful for checking on a web site perhaps from a cron job.

$ echo -n "xed.ch "; if wget -qO /dev/null xed.ch; then echo "is still working."; else echo "seems down"; fi
xed.ch is still working.

Note that you are downloading the entire page, but immediately throwing it away to /dev/null.

User Agent Annoyances

Sometimes a web site creator doesn’t realize how the internet actually works and tries to thwart requests from wget. This is trivial to get around by specifying your own synthesized "User Agent" header.

UA="User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
wget -qO- --header=${UA} ${THE_TARGET_URL}

You can try the user agent in the example, but sometimes you want to blend in with the sheep. To find the User Agent header your favorite normal-person browser uses, check out my nc notes which has a clever trick for finding that out.

Official specifications on how exactly user agent headers should be composed can be found here and here.

Sometimes you need the user agent and the cookies - see below.

Difficult Web Sites

Some web sites are more complicated to use than others. Here is a script to get several files from a site that requires a password:

Automatic Getter With Authentication

#!/bin/bash
WWW=http://www.example.com/coolstuff/
function get {
    wget --user=xed --password=examplepw $WWW/$1
}
get coolfile1.jpg
get coolfile2.cc

Things can get worse. Perhaps the web site doesn’t use normal HTTP Basic Authentication. Commonly some interactive Javascript activated form must be filled out, perhaps with a captcha. Of course you can’t easily automate this by definition. But if you have the credentials, you can do the log in manually and then do the downloading automatically. The key is in what is called a session cookie. When you enter your user name and password on a web site with complex interactive functionality, the remote site usually sends you a session cookie to let it know that you have already been granted permission to use the site. It’s like going to a concert and having to show your driving license and then they give you a wrist band saying you are allowed to buy alcohol. It’s possible (if you’re clever) to take that wrist band off and give it to someone else. The same idea is true for session cookies. Here’s how it works in action:

$ wget -O CrazyKinase.zip --no-cookies \
--header='Cookie:PHPSESSID=6d8cf0002600360034d350a57a3485c3' \
'http://www.examplechem.net/download/download.php?file=186'

In this example the --no-cookies option is used since we’re going to manually manage what the server knows about cookies. This is done with the --header option. This highly versatile option can do all kinds of fun things, but here we are using it to pretend that this cookie has already been set as shown. The cookie’s name is PHPSESSID and it’s value is a big hex number, 6d8c...85c3.

Discovering The Session ID

How do you know what your session ID cookie is? You need to use a browser that can show you. I find that Google Chrome works fine for this but most browsers have a similar way. In Chrome go to Wrench-→Tools-→Developer Tools (or Shift-Ctrl-I). Click "Resources" on the top bar of the Developer Tools and then expand the entry for "Cookies". Find your web page that you’re targeting and click on that. You should see a table of "Name" and "Value" pairs. Sometimes you need trial and error to figure out which ones are really needed and sometimes it is obvious. Plug one of these key value pairs into the wget command shown above and give it a try. Note that you can have multiple --header= options so you can report multiple cookies to the server.

Firefox really hides cookies in an absurd way. In Firefox, go to about:preferences#privacy and click "remove individual cookies". Even if you don’t want to remove any, this will show you the cookies.

Solving Serious Cookie Problems

Sometimes some asshat has really made a mess of distributing their 800GB files and really wants to insist that you use a graphical browser. If you can’t easily get a simple session cookie working with the technique described above, here is the next level of escalation.

This does require a browser plugin. My advice is to start Firefox like this.

firefox --ProfileManager --no-remote

This will open the profile manager letting you create a new "profile". This will keep the obnoxious site’s cookies completely isolated. It also allows you to install a browser plugin just for this profile so it won’t infect the rest of your life. Create the new profile and start a fresh browser. You can find the "Export Cookies" extension here.

https://addons.mozilla.org/en-US/firefox/addon/export-cookies/

(Deprecated, see below.) After installing this, it’s pretty unclear what to do. First, obviously, you need to go to the target website and log in and establish the session cookies. Then you need to hit "Alt-T" to pull down the (possibly hidden if you’re intelligent) "Tools" menu. There you can find "Export Cookies". This brings up a system file browser to specify the file to save. I put this example in /tmp/cookies.txt.

If that one can’t be installed because of version issues, this one might be better anyway. Simply export from the menu on its add-on icon.

https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/

Great, now you have the messy cookies in a neat file that wget is particularly fond of for some reason. Just for reference in case you need to ever compose such a file manually, here’s what they look like.

.services.addons.mozilla.org    TRUE    /       FALSE   0               __utmc  145953462
.services.addons.mozilla.org    TRUE    /       FALSE   1513404819      __utmz  145953462.<blah,blah>one)
.services.addons.mozilla.org    TRUE    /       FALSE   1497638619      __utmb  145953462.<blah,blah>6819
.addons.mozilla.org             TRUE    /       FALSE   1560709003      __utma  164683759.9<blah,blah>2.1
.addons.mozilla.org             TRUE    /       FALSE   1513405003      __utmz  164683759.1<blah,blah>one)
.addons.mozilla.org             TRUE    /       FALSE   1497638803      __utmb  164683759.5<blah,blah>65612

These are some of the cookies set by the site that provided the extension. I shortened the last field which appears to be the cookie’s content. Those seem to be spaces, not tabs. The number might be a timestamp, hard to say. Cookies that look like __utm? seem to be something to do with Google Analytics. The nice thing about this approach is you wrap up all the cookies that could possibly be useful and not many more.

Since you’re as likely as not to be downloading a lot of files, I’ll show the technique with a quick script I wrote.

BASE_URL='http://www.dishwasher.com/downloads/fancy.cgi/'
function dl {
   wget --load-cookies=/tmp/cookies.txt --no-check-certificate ${BASE_URL}$1
}
dl "DISHWASHER-fancy_huge_file-2014.tar.gz"
dl "DISHWASHER-fancy_huge_file-2015.tar.gz"
dl "DISHWASHER-fancy_huge_file-2016.tar.gz"
dl "DISHWASHER-fancy_huge_file-2017.tar.gz"

That’s it. Unleash that in a screen session to have some remote machine obtain the files that need to be there.

Alternatives To wget

Still not satisfied with the great power of wget? There are other ways to do what it does.

curl

curl is a program that pretty much does exactly what wget does. I like wget better, but curl seems to be installed natively on Macs. Also installed natively on Macs is the man page for curl so if you don’t know how to use it, type man curl.

lynx/elinks

Maybe you just realized that you don’t really want coolstuff.html because it is just a messy HTML file. Maybe you wanted just the human text in it. In this case you can use elinks like this:

$ elinks -dump xed.ch/coolstuff.html > cool.html

The old fashioned browser lynx can also do this:

$ lynx -dump xed.ch

It is also what primitive internet tribes used in the prehistoric days before wget:

$ lynx -source -dump xed.ch

Photon

Good god, look at this monster.

https://github.com/s0md3v/Photon

This thing crawls in a way that should scare you if you run a web site.

Python and urllib2

You want even more control? Here’s how to pull the contents of a URL into your Python program.

import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
print html

This example is from the urllib2 documentation.