Basic Usage
wget
is a web client but unlike a normal browser, it is designed to
simply get the served remote files.
Normally wget
is simply used like this:
$ wget http://www.example.com/path/xed_example.html
After running that command you will have a copy of xed_example.html
in your local current directory.
The most common option I use is -O
(capital oh, not zero) which
changes the name of the saved file:
$ wget -O xed.htm http://www.example.com/path/xed_example.html
This gets xed_example.html
from the web site specified and saves that
with a new name, xed.htm
.
Fancy Usage
Geolocation
Put this function in your .bashrc and you can use it to roughly find the real world geographic location of a host name or IP address.
function geo() { wget -qO- ipinfo.io/$(if tr -d [0-9.]<<<$1|grep .>/dev/null;then dig +short $1;else echo $1; fi);}
Using wget With Proxies
These days I like to configure machines to not live on the real internet and get any external stuff through a well managed proxy. wget should have no problem with this. Here’s an example.
wget -e use_proxy=yes -e http_proxy=avproxy.ucsd.edu:3128 http://example.edu/foo-1.0.tgz
Or something like this. Note the httpS.
URL=https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
wget -e use_proxy=yes -e https_proxy=proxy.ucsd.edu:3128 ${URL}
Using wget With Other Commands
The following is useful for checking on a web site perhaps from a cron job.
$ echo -n "xed.ch "; if wget -qO /dev/null xed.ch; then echo "is still working."; else echo "seems down"; fi
xed.ch is still working.
Note that you are downloading the entire page, but immediately throwing it away to /dev/null.
User Agent Annoyances
Sometimes a web site creator doesn’t realize how the internet actually works and tries to thwart requests from wget. This is trivial to get around by specifying your own synthesized "User Agent" header.
UA="User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
wget -qO- --header=${UA} ${THE_TARGET_URL}
You can try the user agent in the example, but sometimes you want to blend in with the sheep. To find the User Agent header your favorite normal-person browser uses, check out my nc notes which has a clever trick for finding that out.
Official specifications on how exactly user agent headers should be composed can be found here and here.
Sometimes you need the user agent and the cookies - see below.
Difficult Web Sites
Some web sites are more complicated to use than others. Here is a script to get several files from a site that requires a password:
#!/bin/bash
WWW=http://www.example.com/coolstuff/
function get {
wget --user=xed --password=examplepw $WWW/$1
}
get coolfile1.jpg
get coolfile2.cc
Things can get worse. Perhaps the web site doesn’t use normal HTTP Basic Authentication. Commonly some interactive Javascript activated form must be filled out, perhaps with a captcha. Of course you can’t easily automate this by definition. But if you have the credentials, you can do the log in manually and then do the downloading automatically. The key is in what is called a session cookie. When you enter your user name and password on a web site with complex interactive functionality, the remote site usually sends you a session cookie to let it know that you have already been granted permission to use the site. It’s like going to a concert and having to show your driving license and then they give you a wrist band saying you are allowed to buy alcohol. It’s possible (if you’re clever) to take that wrist band off and give it to someone else. The same idea is true for session cookies. Here’s how it works in action:
$ wget -O CrazyKinase.zip --no-cookies \
--header='Cookie:PHPSESSID=6d8cf0002600360034d350a57a3485c3' \
'http://www.examplechem.net/download/download.php?file=186'
In this example the --no-cookies
option is used since we’re going to
manually manage what the server knows about cookies. This is done with
the --header
option. This highly versatile option can do all kinds
of fun things, but here we are using it to pretend that this cookie
has already been set as shown. The cookie’s name is PHPSESSID
and
it’s value is a big hex number, 6d8c...85c3
.
Discovering The Session ID
How do you know what your session ID cookie is? You need to use a
browser that can show you. I find that Google Chrome works fine for
this but most browsers have a similar way. In Chrome go to
Wrench-→Tools-→Developer Tools (or Shift-Ctrl-I). Click "Resources"
on the top bar of the Developer Tools and then expand the entry for
"Cookies". Find your web page that you’re targeting and click on that.
You should see a table of "Name" and "Value" pairs. Sometimes you need
trial and error to figure out which ones are really needed and
sometimes it is obvious. Plug one of these key value pairs into the
wget
command shown above and give it a try. Note that you can have
multiple --header=
options so you can report multiple cookies to the
server.
Firefox really hides cookies in an absurd way. In Firefox, go to
about:preferences#privacy
and click "remove individual cookies".
Even if you don’t want to remove any, this will show you the cookies.
Solving Serious Cookie Problems
Sometimes some asshat has really made a mess of distributing their 800GB files and really wants to insist that you use a graphical browser. If you can’t easily get a simple session cookie working with the technique described above, here is the next level of escalation.
This does require a browser plugin. My advice is to start Firefox like this.
firefox --ProfileManager --no-remote
This will open the profile manager letting you create a new "profile". This will keep the obnoxious site’s cookies completely isolated. It also allows you to install a browser plugin just for this profile so it won’t infect the rest of your life. Create the new profile and start a fresh browser. You can find the "Export Cookies" extension here.
(Deprecated, see below.) After installing this, it’s pretty unclear
what to do. First, obviously, you need to go to the target website and
log in and establish the session cookies. Then you need to hit "Alt-T"
to pull down the (possibly hidden if you’re intelligent) "Tools" menu.
There you can find "Export Cookies". This brings up a system file
browser to specify the file to save. I put this example in
/tmp/cookies.txt
.
If that one can’t be installed because of version issues, this one might be better anyway. Simply export from the menu on its add-on icon.
Great, now you have the messy cookies in a neat file that wget
is
particularly fond of for some reason. Just for reference in case you
need to ever compose such a file manually, here’s what they look like.
.services.addons.mozilla.org TRUE / FALSE 0 __utmc 145953462
.services.addons.mozilla.org TRUE / FALSE 1513404819 __utmz 145953462.<blah,blah>one)
.services.addons.mozilla.org TRUE / FALSE 1497638619 __utmb 145953462.<blah,blah>6819
.addons.mozilla.org TRUE / FALSE 1560709003 __utma 164683759.9<blah,blah>2.1
.addons.mozilla.org TRUE / FALSE 1513405003 __utmz 164683759.1<blah,blah>one)
.addons.mozilla.org TRUE / FALSE 1497638803 __utmb 164683759.5<blah,blah>65612
These are some of the cookies set by the site that provided the
extension. I shortened the last field which appears to be the cookie’s
content. Those seem to be spaces, not tabs. The number might be a
timestamp, hard to say. Cookies that look like __utm?
seem to be
something to do with Google Analytics. The nice thing about this
approach is you wrap up all the cookies that could possibly be useful
and not many more.
Since you’re as likely as not to be downloading a lot of files, I’ll show the technique with a quick script I wrote.
BASE_URL='http://www.dishwasher.com/downloads/fancy.cgi/' function dl { wget --load-cookies=/tmp/cookies.txt --no-check-certificate ${BASE_URL}$1 } dl "DISHWASHER-fancy_huge_file-2014.tar.gz" dl "DISHWASHER-fancy_huge_file-2015.tar.gz" dl "DISHWASHER-fancy_huge_file-2016.tar.gz" dl "DISHWASHER-fancy_huge_file-2017.tar.gz"
That’s it. Unleash that in a screen session to have some remote machine obtain the files that need to be there.
Alternatives To wget
Still not satisfied with the great power of wget
? There are other
ways to do what it does.
curl
curl
is a program that pretty much does exactly what wget
does. I
like wget
better, but curl
seems to be installed natively on Macs.
Also installed natively on Macs is the man page for curl
so if you
don’t know how to use it, type man curl
.
lynx/elinks
Maybe you just realized that you don’t really want coolstuff.html
because it is just a messy HTML file. Maybe you wanted just the human
text in it. In this case you can use elinks
like this:
$ elinks -dump xed.ch/coolstuff.html > cool.html
The old fashioned browser lynx
can also do this:
$ lynx -dump xed.ch
It is also what primitive internet tribes used in the prehistoric days
before wget
:
$ lynx -source -dump xed.ch
Photon
Good god, look at this monster.
This thing crawls in a way that should scare you if you run a web site.
Python and urllib2
You want even more control? Here’s how to pull the contents of a URL into your Python program.
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
print html
This example is from the urllib2 documentation.