Sort compressed tar archives to make them smaller… 20% smaller!

How would you like for your file archives to be 20% smaller, with just the tools every Linux distribution already provides and a little ingenuity?  Read on and see how I did it!

There was a folder “NESRen” that I wanted to pack up for archival, and I knew it contained many files that share the same data, minus a few changes here and there.  When packing them up and compressing them into a tarball archive, I knew that I will achieve better compression if these largely similar files are put side-by-side in the archive so that the repeated blocks of data “compress themselves away” and take up almost no space.  Unfortunately, the GNU “tar” command in nearly every Linux distribution packs up files and folders in the order that the underlying filesystem chooses, which is almost always unordered and not optimal for compression.

How do we make tar put files in an order which will compress better?  The answer is to use the tar -T option, which lets you feed tar a list of files to pack up.  The list is processed line-by-line, and each file is packed up in the order provided.  You can, for example, create a list of files with the “find” command, then hand-edit that list to be optimal, and pass the list to tar (you must use the –no-recursion option when creating the archive from this list since the “find” makes a recursive list already):

find folder_of_files/ > list.txt
vi list.txt
tar -c --no-recursion -T list.txt | xz > archive.tar.xz

In my case, however, the folder structure and naming conventions allowed for creative use of the “sort” command to arrange the files. Since there is one folder “NESRen” followed by a series of categorizations, followed by the file names themselves (i.e. “NESRen/World/Pinball (JU).nes”) I can do something like this to make all files with the same name sort beside each other, regardless of the name of the category directory (as “sort” with no options would do):

find NESRen | sort -t / --key=3 | \
  tar -cv -T - --no-recursion | xz -e > NESRen.tar.xz

The “-t /” tells sort to use a slash as a field delimiter, and –key=3 tells it to sort by the third field (NESRen is field 1, the folder is 2, the file is 3).  What kind of difference did that make for the size of my .tar.xz archive?  Take a look (-nosort archive created with “tar -c NESRen | xz -e > NESren-nosort.tar.xz”):

Size of each file in bytes:

212958664 NESRen-nosort.tar.xz
170021312 NESRen.tar.xz

Size of the original folder and each file in megabytes:

705M    NESRen
204M    NESRen-nosort.tar.xz
163M    NESRen.tar.xz

By sorting the files, I saw a 20.1% drop in archive size using the exact same compression method, with a total compression ratio of 23.1% versus the unsorted 28.9%.  That’s a huge difference!  If this were 70.5GB instead of 705MB and the data exhibited identical performance, the final archive would be 4.1GB smaller–nearly the entire capacity of a single-layer DVD-R in space savings, just by sorting the file names before compression.

Applying a similar sort-then-compress process to the packing of the “ext” version of the Tritech Service System, a 700KB reduction in the total size of the archive containing “ext” was seen.  Of course, this doesn’t help as much because the archive itself was already 32.7MB in size (700KB is only a 2.1% reduction) but it still means shorter load and boot times due to less overall data to handle.

Next time you’re packing a lot of stuff up, see if you can use these tricks to improve your compression ratio.

One thought on “Sort compressed tar archives to make them smaller… 20% smaller!

Leave a Reply

Your email address will not be published. Required fields are marked *