:date: 2016-11-29 18:31 :tags:
In the world of normal people, when a collection of files needs to be downloaded from the internet as a single entity the odds are good that these files will be packaged into a zip file. This is a shame.
The reason is that it lets people think that this is a good way to
bundle files and it is not. It may be an adequate way and it may work
fine. Linux and Mac people certainly have an unzip
command at their
disposal but the problem is more subtle than a simple ability to
unpack it. The reason I know that zip
is a bad idea is because I know
of a better way. Of course I'm talking about the Unix way.
The problem of zip
is that it packages up files and it compresses
them (let's ignore bizarre options like -n
that may be able to
suppress this default behavior). Why this is bad can be seen with two
examples.
Imagine I had a collection of mp3s that I wanted to make available for
download as a set. For these files zip
will pointlessly "waste its
time trying to compress them", to quote from the zip
man page. The
problem is that mp3 files are already compressed so that compressing
them again when I package them makes little to no sense.
Here's another example. Let's say I had an enormous file containing an
ASCII text SQL dump. I want to compress this because that will be very
effective but there is no need for any kind of archive container. Yet
if I zip
the file, I will have to work with it as if it were a
collection of one. Why should there be any accounting about possible
other files when I know I just want to compress a single file? Have
you ever purchased a single bag of snack food at a store and the
cashier rings you up and asks if you'd like a bag? If anything like
that ever happens to you, I hope you think the same thought I always
do, "No thanks, it's already got a bag." How many freaking bags do you
need?
The Unix way tries to break down the fundamental operations into
separate steps so that they can both be used if needed and not if not.
Not only that, but this architecture allows one to change the parts if
they are not suitable. The normal Unix way to bundle a collection of
files into a single file is tar
(my notes). Note that tar
by default does not
compress things. It can and often does, but the important
functionality is aggregating files into a single manageable file. If
you don't like tar
for some reason, there are other options. An even
more ancient archiving system is cpio
(Mac people also have it ready
to go by default). This may seem like pointless ancient history, but
the Linux distributions (e.g. Red Hat) still seem to think that cpio
archives have advantages for making
initial ram disks for OS booting.
To really drive home the point that there really is more than one way
to skin this cat, Unix has yet another very common archiving tool
called ar
(Macs also have this by default, see man ar
). This is
mostly used to package dynamic library object files, but nothing
prevents you from rounding up your mp3s into an ar
archive if you
want. If your files are all the same exact size, which isn't
especially uncommon with many data sources, you can use a simple Unix
cat
to pack them and use the Unix split
command to unpack.
The huge point here, is that the Unix way conceptually separates the
archiving from the compression (even if they occur simultaneously).
Unix has even more ways to compress things. The classic way is gzip
(ready to go on Macs). This is not merely the GNU implementation of
Zip, it is gzip
, quite a different beast. First it does no multi-file
aggregation. That is, quite properly, outside of its scope. With
gzip
you can specify exactly how you want the files compressed
(--fast
which is -1
or maybe --best
which is -9
, default is
-6
). But it's a lot easier to figure out how to get what you want
since there's no archiving cruft to figure out.
But that's the tip of the iceberg for compression. The real power
comes from being able to pick and choose which compression program you
want to use. There is an ancient compression system called, plainly,
compress
. I don't have it on my Debian Linux system by default, but
Macs do seem to naturally have it. The reason that it's not so common
today is that gzip
can uncompress files reduced with compress
but
today there are just more effective compression algorithms. Chiefly is
bzip2
(Mac, installed). This works very similarly to gzip
with
extra aggressive compression. The Linux kernel
maintainers seem to prefer using the compression program xz
(Mac
ready). Another one is 7z
. This one seems to appeal to Windows
people. I find it slightly annoying because it can, like
zip
, archive files (but badly - doesn't preserve file order), but
Linux and Mac today have perfectly good 7z
utilities by default and
it has better compression than zip
.
Not content with archiving and compressing zip
also can encrypt the
contents. Hopefully by now you're understanding the problem. It's
better to apply a separate utility that specializes in encryption
rather than take what zip
threw together as an afterthought. Options
for encryption include ccrypt, pgp/gpg, and mcrypt. My latest favorite
way uses the universally available OpenSSL suite. All of these are
documented in my crypto notes.
I'm writing this carefully because I often have a difficult time
properly convincing people that the Unix way is correct. To further
make the point, let me share an illustrative example of someone using
the Unix way, but doing it wrong. Using zip
would have been even
more wrong, but take a look at this archive.
$ tar -tvzf chembl.tar.gz
0 2016-11-14 06:23 chembl/
9044737177 2016-11-14 04:09 chembl/chembl.sql
1001 2016-11-14 06:23 chembl/INSTALL
It's ok. It's a normal gzip
compressed tar
archive. It's ok but it
could be better. The problem is that there are two files, one tiny and
one huge. Both are compressed with the archive. If I want to read the
tiny INSTALL
file, I will have to unpack the entire archive
including the 9GB file.
The correct way would have been to have it set up like this.
$ tar -tvf chembl.tar
0 2016-11-14 06:23 chembl/
2030884202 2016-11-14 04:09 chembl/chembl.sql.gz
1001 2016-11-14 06:23 chembl/INSTALL
Packaged like this, the tiny file doesn't even get compressed. Why
should it be? Now when I extract the tar
archive I'll wind up with a
2GB file and the INSTALL
file which I can then read. The files
extracted from the archive will be about the same size as the archive.
This kind of fine control can be very helpful when you're pushing the
limits of your hardware or piping large file trees between various
processes and tunnelling them to various hosts. Obviously if you're
creating an archive for linear storage as on a back up tape, tar is
ideal; its name comes from "tape archive".
I also feel that zip
has aesthetic flaws. Sure, normal people with
their little normal files just open a magical file manager window of
the zip
archive and they muddle through simple things fine. On such
non-explicit systems I sometimes get confused about whether it really
is unpacked or it just could be unpacked or what the heck is really
going on. In fact even in Unix command line mode this is annoying. You
can see what a zip
archive contains with unzip -l
and the special
zipinfo
command but I don't think that zip
proper, the main
command, actually has a flag to just dump the archive's contents (it's
not -l
which is inconsistently --to-crlf
).
Obviously you need to know how to handle zip
files. They aren't
going away. Indeed, those master turd polishers known as Java
programmers have something called jar
(Java archive) as their main
software distribution format and it is actually exactly a zip
file
with some proscribed contents. But please, if you're going to package
and/or compress and/or encrypt some files, please consider doing it
properly.
UPDATE 2022-04-22 You know what's worse than a zip file? A zip file made on a Mac. The reason is that Macs like to litter your file system with all kinds of useless garbage. To show exactly what the problem is I will show exactly how to fix it. If you receive a zip file from a Mac person, try making a copy of it (just in case) and running the following commands (where ZF is the name of the zip file).
zip $ZF -d "__MACOSX/*" || zip $ZF -d "*/.DS_Store"
That should remove a lot of Mac file system trash right out of the zip
archive. Unfortunately, it doesn't always get it all. To get it all
you have to hunt it down (unzip -l $ZF
) or unpack it and delete the
results. This seemed more effective in my tests.
mkdir junk ; cd junk ; unzip $ZF
rm -rv __MACOSX ; find . -iname "*DS_Store*" -exec rm -v '{}' \
Obviously when deleting things, test it before you go all in.