Why Your Zip Files Irk Me

:date: 2016-11-29 18:31 :tags:

In the world of normal people, when a collection of files needs to be downloaded from the internet as a single entity the odds are good that these files will be packaged into a zip file. This is a shame.

The reason is that it lets people think that this is a good way to bundle files and it is not. It may be an adequate way and it may work fine. Linux and Mac people certainly have an unzip command at their disposal but the problem is more subtle than a simple ability to unpack it. The reason I know that zip is a bad idea is because I know of a better way. Of course I'm talking about the Unix way.

The problem of zip is that it packages up files and it compresses them (let's ignore bizarre options like -n that may be able to suppress this default behavior). Why this is bad can be seen with two examples.

Imagine I had a collection of mp3s that I wanted to make available for download as a set. For these files zip will pointlessly "waste its time trying to compress them", to quote from the zip man page. The problem is that mp3 files are already compressed so that compressing them again when I package them makes little to no sense.

Here's another example. Let's say I had an enormous file containing an ASCII text SQL dump. I want to compress this because that will be very effective but there is no need for any kind of archive container. Yet if I zip the file, I will have to work with it as if it were a collection of one. Why should there be any accounting about possible other files when I know I just want to compress a single file? Have you ever purchased a single bag of snack food at a store and the cashier rings you up and asks if you'd like a bag? If anything like that ever happens to you, I hope you think the same thought I always do, "No thanks, it's already got a bag." How many freaking bags do you need?

The Unix way tries to break down the fundamental operations into separate steps so that they can both be used if needed and not if not. Not only that, but this architecture allows one to change the parts if they are not suitable. The normal Unix way to bundle a collection of files into a single file is tar (my notes). Note that tar by default does not compress things. It can and often does, but the important functionality is aggregating files into a single manageable file. If you don't like tar for some reason, there are other options. An even more ancient archiving system is cpio (Mac people also have it ready to go by default). This may seem like pointless ancient history, but the Linux distributions (e.g. Red Hat) still seem to think that cpio archives have advantages for making initial ram disks for OS booting. To really drive home the point that there really is more than one way to skin this cat, Unix has yet another very common archiving tool called ar (Macs also have this by default, see man ar). This is mostly used to package dynamic library object files, but nothing prevents you from rounding up your mp3s into an ar archive if you want. If your files are all the same exact size, which isn't especially uncommon with many data sources, you can use a simple Unix cat to pack them and use the Unix split command to unpack.

The huge point here, is that the Unix way conceptually separates the archiving from the compression (even if they occur simultaneously). Unix has even more ways to compress things. The classic way is gzip (ready to go on Macs). This is not merely the GNU implementation of Zip, it is gzip, quite a different beast. First it does no multi-file aggregation. That is, quite properly, outside of its scope. With gzip you can specify exactly how you want the files compressed (--fast which is -1 or maybe --best which is -9, default is -6). But it's a lot easier to figure out how to get what you want since there's no archiving cruft to figure out.

But that's the tip of the iceberg for compression. The real power comes from being able to pick and choose which compression program you want to use. There is an ancient compression system called, plainly, compress. I don't have it on my Debian Linux system by default, but Macs do seem to naturally have it. The reason that it's not so common today is that gzip can uncompress files reduced with compress but today there are just more effective compression algorithms. Chiefly is bzip2 (Mac, installed). This works very similarly to gzip with extra aggressive compression. The Linux kernel maintainers seem to prefer using the compression program xz (Mac ready). Another one is 7z. This one seems to appeal to Windows people. I find it slightly annoying because it can, like zip, archive files (but badly - doesn't preserve file order), but Linux and Mac today have perfectly good 7z utilities by default and it has better compression than zip.

Not content with archiving and compressing zip also can encrypt the contents. Hopefully by now you're understanding the problem. It's better to apply a separate utility that specializes in encryption rather than take what zip threw together as an afterthought. Options for encryption include ccrypt, pgp/gpg, and mcrypt. My latest favorite way uses the universally available OpenSSL suite. All of these are documented in my crypto notes.

I'm writing this carefully because I often have a difficult time properly convincing people that the Unix way is correct. To further make the point, let me share an illustrative example of someone using the Unix way, but doing it wrong. Using zip would have been even more wrong, but take a look at this archive.

$ tar -tvzf chembl.tar.gz
0 2016-11-14 06:23 chembl/
9044737177 2016-11-14 04:09 chembl/chembl.sql
1001 2016-11-14 06:23 chembl/INSTALL

It's ok. It's a normal gzip compressed tar archive. It's ok but it could be better. The problem is that there are two files, one tiny and one huge. Both are compressed with the archive. If I want to read the tiny INSTALL file, I will have to unpack the entire archive including the 9GB file.

The correct way would have been to have it set up like this.

$ tar -tvf chembl.tar
0 2016-11-14 06:23 chembl/
2030884202 2016-11-14 04:09 chembl/chembl.sql.gz
1001 2016-11-14 06:23 chembl/INSTALL

Packaged like this, the tiny file doesn't even get compressed. Why should it be? Now when I extract the tar archive I'll wind up with a 2GB file and the INSTALL file which I can then read. The files extracted from the archive will be about the same size as the archive. This kind of fine control can be very helpful when you're pushing the limits of your hardware or piping large file trees between various processes and tunnelling them to various hosts. Obviously if you're creating an archive for linear storage as on a back up tape, tar is ideal; its name comes from "tape archive".

I also feel that zip has aesthetic flaws. Sure, normal people with their little normal files just open a magical file manager window of the zip archive and they muddle through simple things fine. On such non-explicit systems I sometimes get confused about whether it really is unpacked or it just could be unpacked or what the heck is really going on. In fact even in Unix command line mode this is annoying. You can see what a zip archive contains with unzip -l and the special zipinfo command but I don't think that zip proper, the main command, actually has a flag to just dump the archive's contents (it's not -l which is inconsistently --to-crlf).

Obviously you need to know how to handle zip files. They aren't going away. Indeed, those master turd polishers known as Java programmers have something called jar (Java archive) as their main software distribution format and it is actually exactly a zip file with some proscribed contents. But please, if you're going to package and/or compress and/or encrypt some files, please consider doing it properly.

UPDATE 2022-04-22 You know what's worse than a zip file? A zip file made on a Mac. The reason is that Macs like to litter your file system with all kinds of useless garbage. To show exactly what the problem is I will show exactly how to fix it. If you receive a zip file from a Mac person, try making a copy of it (just in case) and running the following commands (where ZF is the name of the zip file).

zip $ZF -d "__MACOSX/*" || zip $ZF -d "*/.DS_Store" 

That should remove a lot of Mac file system trash right out of the zip archive. Unfortunately, it doesn't always get it all. To get it all you have to hunt it down (unzip -l $ZF) or unpack it and delete the results. This seemed more effective in my tests.

mkdir junk ; cd junk ; unzip $ZF
rm -rv __MACOSX ; find . -iname "*DS_Store*" -exec rm -v '{}' \

Obviously when deleting things, test it before you go all in.