Wednesday, October 1 2008, 08:00
The PAQ compression algorithm
We all know about the most popular file compression formats available, from the rather outdated and outperformed ZIP, to the quite popular 7-Zip, with the RAR format in between. In the last few years, the 7-Zip format gained a lot in popularity and is regarded by many to be the compression format offering the best ratio, but is it really the truth? If I were to tell you that there exists a format able to compress the 438 MB of Office 2007 Setup files down to 7MB archive, you may not believe me or worry for my mental stability, but after having tested it, I can confirm it is the truth and that there is no magic here, but just a very advanced algorithm: PAQ.
Formats derived or inspired from Zip are all more or less based on traditional compression algorithms (like the various Lempel-Ziv flavors) along with some others combined to give an efficient compression ratio, but PAQ is an advanced and highly-intelligent algorithms (like artificial neural-networks) and doing much more (or more thoroughly) than their aforementioned counterparts to achieve a much higher compression rate. For example, PAQ is able to use specialized algorithms depending on the source file format being compressed and this for each file. As such, it will compress pictures and executables files differently in order to take the best approach to compress them and get the most optimized output in terms of size.
Unfortunately, as I was writing in the introduction, there are no miracles and to get 438 MB down to 7MB, some downsides are obviously going to exist and there is a price to pay, or more precisely a virtue to have: patience, and a lot of it. One of the key particularities of PAQ is that it gets its awesome compression ratio at the expense of speed and memory usage. It isn’t really that surprising, considering how excellent compression ratios are generally obtained at the expense of something else: for example, JPEG and MP3, which both are great performers in pictures and audio compression respectively, use lossy algorithms to obtain that result, which means a (generally acceptable) loss of data and thus of quality.
To give another example, I compressed an MS Access database file (MDB) weighting 151 MB using PAQ (more exactly the paq8o variant) at the maximum level (8) which gave me a 3.89 MB file as the output. While it seems less impressive than the Office 2007 example at the first glance, we have to consider that the same file compressed using 7-Zip with the “Ultra” compression settings resulted in a 14.4 MB file: almost 4 times the size the PAQ-packed version. If we keep in mind that 7-Zip is generally showcased as the most efficient compression format in general computing, it certainly isn’t that bad. However, compressing this file using PAQ took not less than 8 hours on a Core 2 Duo T7700 CPU running at 2.40 GHz (not exactly a slow or prehistoric machine, but running on a 32-bits Windows XP though) and required roughly 2GB of RAM… I prefer not to imagine the time it would have taken on a 3 year-old computer. For reference, compressing the same file in 7-Zip using the “Ultra” compression preset took roughly 2 minutes and 10 seconds. The question then is: is it worth to wait approximately 220 times longer to compress a file to get a 4 times better compression ratio? The problem here is that we deal nowadays with data size much larger than 150 MB, which is actually a minimum in many IT activities, and having to wait 8 hours is unpractical in many uses and the only times it can be actually be realistically used is as a nightly process.
Usually, when using more traditional compression tools, I don’t mind that much about compression time and prefer to get the smallest file possible, especially since it doesn’t take noticeably longer to unpack an highly compressed RAR or 7-Zip than one compressed using medium settings, but unfortunately PAQ doesn’t work the same way and even unpacking the file is going to take a long time. Unpacking the database took more than 2 hours, which is way too much for the user on the receiving end, who may not have a high-end system. I initially intended to use this format to help me in sending data files faster to the UK-based headquarters of the company I work for, as I often find myself needing to transfer several gigabytes to the servers every month but the compression time required is so high that even sending files totally uncompressed would be way faster. The only case where it could be useful is over very slow-link and/or where very powerful machines are on both the sending and receiving ends or for long-term archiving purposes that are not going to be needed frequently.
Command-line versions implementing the original PAQ algorithms are available among others in the PeaZip compression suite, and implementations of PAQ up to its seventh version are available in the KGB Archiver whose newest version is still in Beta after one year. KGB Archiver is written in the .Net Framework but luckily seems to use native code DLLs modules for the compression code. It also supports SFX (self-extracting) archives which is probably a must for this still very rare format (whose speeds shortcomings probably prevent to grow more popular). Unfortunately, the GUI still has to be polished a lot, especially the Shell Extension as the opportunity to compress files or folders isn’t proposed for all the files but only for those on the desktop. The file manager, while decently working, still is lagging behind compared to those of the more commons archivers, but we have to keep in mind that it is still a beta version. It should be noted however that the PAQ7 variant used by KGB Archiver set to its maximum settings compressed the 151 MB file in 1 hour and 32 minutes, which is much better than the 8 hours it took using the paq8o command-line utility. However, the resulting file was relatively larger with a size of 5.20 MB, once again confirming that the more time is given to PAQ, the smaller the file is going to be. However, uncompressing the file still took an unacceptable time of roughly one hour, which makes KGB a slightly more usable PAQ implementation but far from solving the main problem of this format.
I believe the PAQ algorithms probably represent the future of general-purposes compression as the the other formats are probably near their limits, compression-ratio wise. Adding newer algorithms to and/or improving older algorithms like deflate and the other LZ-based algorithms isn’t going to improve the compression rate several times fold like other approaches can. The biggest problem of this format is that the time required for both compression and decompression is way too discouraging to get new users, which in turn do not motivate a lot of developers to start writing software supporting this format or include support for it in the already existing ones (WinRAR, 7-Zip, etc). If the only downside was compression time, it could have been more acceptable by many, but the decompression time is almost as discouraging as the compression one and is even more shocking for end-users as the algorithms used in older formats allow rather fast decompression, even when using the maximum settings. As sacrificing time for compression ratio is one of the core properties of the PAQ algorithms, the only hope there is for this format is for Moore’s law to compensate PAQ’s slowness by providing faster machines and processors. While it is true that we constantly got much higher computational power in PCs as the time passed, it takes some time for the majority of the population to upgrade to newer machines and even if CPU’s get fast enough to decompress PAQ-based archives as quickly as we do it with ZIP-based ones nowadays, a lot of users would be left behind due to their older machines’ specs and a compression-format only gets the popularity it deserves once everyone is able use it decently.
People interested in the algorithms used by PAQ and wanting to learn more about them may want to learn the very thorough Wikipedia page about it.