zlib

Technical Details


Maximum Expansion Factor

zlib's compression method, an LZ77 variant called deflation, emits compressed data as a sequence of blocks. Various block types are allowed, one of which is stored blocks; these are simply composed of the raw input data plus a few header bytes. In the worst possible case, where all other block types would expand the data, deflation falls back to stored blocks. Thus the only expansion is an overhead of 5 bytes per 32 KB block (.02%), plus a one-time overhead of six bytes for the entire stream. In the absolute worst case of a single-byte input stream, the overhead therefore amounts to 1100% (11 bytes of overhead, 1 byte of actual data). For larger stream sizes, the overhead approaches the limiting value of .02%.


Maximum Compression Factor

Empirically, the deflate method is capable of compression factors exceeding 1000:1. (The test case was 50MB file filled with zeros; it compressed to roughly 49 KB.) Mark loves to calculate stuff like this and reports that the theoretical limit for the zlib format (as opposed to its implementation in the currently available sources) is 1032:1. To quote him,

The limit comes from the fact that one length/distance pair can represent at most 258 output bytes. A length requires at least one bit and a distance requires at least one bit, so two bits in can give 258 bytes out, or eight bits in give 1032 bytes out. A dynamic block has no length restriction, so you could get arbitrarily close to the limit of 1032:1.

He goes on to note that the current implementation limits its dynamic blocks to about 8 KB (corresponding to 8MB of input data); together with a few bits of overhead, this implies an actual compression limit of about 1030.3:1. Not only that, but the compressed data stream is itself likely to be rather compressible (in this special case only), so running it through deflate again should produce further gains.

By way of comparison, note that a version of run-length encoding optimized for this sort of unusual data file -- that is, by using 32-bit integers for the lengths rather than the more usual 8-bit bytes or 16-bit words -- could encode the test file in five bytes. That would be a compression factor of 10,000,000:1 (or 10.000.000:1 for you Europeans, or 107:1 for all of those engineers and scientists whose browsers support superscripts).

Finally, please note that this level of compression is extremely rare and only occurs with really trivial files (e.g., a megabyte of zeros). More typical zlib compression ratios are on the order of 2:1 to 5:1.


Memory Footprint

zlib's memory footprint can also be specified fairly precisely. It is larger for compression than for decompression, and the exact requirements depend on how the library was compiled.

The memory requirements for compression depend on two parameters, windowBits and memLevel:

deflate memory usage (bytes) = (1 << (windowBits+2)) + (1 << (memLevel+9))

For the default values of 15 and 8, respectively, this is 256 KB. Both windowBits and memLevel can be set to lower values at compile time via the MAX_WBITS and MAX_MEM_LEVEL macros, but only at a cost in compression efficiency.

The memory requirements for decompression depend only on windowBits, but this is, in a sense, a harsher limitation: whereas data streams compressed with a smaller window will merely be a bit larger than they would have otherwise, a reduced window size for decompression means that streams compressed with larger windows cannot be decompressed at all. Having said that:

inflate memory usage (bytes) = (1 << windowBits) + inflate_huft usage

Empirically, no more than 1041 inflate_huft structures are allocated: 166 for distance codes and 875 for literal/length codes. Each inflate_huft is the size of two pointers (i.e., four, eight or sixteen bytes, depending on whether pointers are 16-bit, 32-bit or 64-bit). Typically, therefore, inflate() requires no more than 42 KB of storage on a 32-bit machine--this includes the 32768-byte sliding window, 84 bytes of initialization overhead, 1292 bytes of inflate_blocks overhead, 8328 bytes of inflate_huft structures, and 148 bytes of memory-allocation overhead (37 allocation calls, assuming 4 bytes' overhead per call). 64 KB is almost certainly a hard upper limit.


Adler-32 versus CRC-32

This section uses superscripts, which are not supported by some older browsers.

Both Adler-32 and CRC-32 (cyclic redundancy check) are 32-bit checks. But while the CRC can take on any 32-bit value (232 possibilities), Adler-32 is limited to 655212 possibilities. So the probability of a false positive on random errors for CRC-32 is 2.3283 x 10-10, whereas it is very slightly higher for Adler-32 at 2.3294 x 10-10.

The above assumes that all the values are accessible given the amount of data. That is true after only four bytes for the CRC-32, but Adler-32 requires, on the average, about 0.5 KB of data to get rolling--or 1 KB if it's ASCII data (text). So if the Adler-32 is used on significantly less than about a kilobyte, it will be noticeably weaker than a CRC-32 on the same small block.

A properly constructed CRC-n has the nice property that less than n bits in error is always detectable. This is not always true for Adler-32 -- Adler-32 can detect all one- or two-byte errors but can miss some three-byte errors. However, Adler-32 has been constructed to minimize the ways to make small changes in the data that result in the same check value, through the use of sums significantly larger than the bytes and by using a prime (65521) for the modulus. It is in this area that some analysis is deserved, but it has not yet been done.

This last potential weakness is not a major concern in the application of Adler-32 to zlib (or any other history-based compressor), since if there is an error at some point in a stream, it will be massively propagated after that. It would be of concern in an application with transmission or storage that has a borderline signal-to-noise ratio, for which small numbers of random errors are expected. For that sort of application one would certainly want to use a CRC or, better yet, Reed-Solomon error-correction coding. But even in this case, if the data being transmitted or stored uses some sort of history-dependent compression (as in zlib) and was compressible to begin with, then an Adler-32 used after decompression would be adequate since the decompressor would significantly amplify any small errors in the compressed stream. (For incompressible data, most modern compressors operate in a pass-through mode, so the original comment about using a CRC or ECC holds.)

The main reason for Adler-32 is, of course, speed in software implementations. The authors wanted a check on zlib's decompression, but not a significant speed penalty just for the check. So Mark came up with the Adler-32 as a faster but still effective alternative to the CRC-32.


Click here for an informal explanation of the deflate algorithm.
Click here to return to the zlib Home Page.


Last modified 20 January 1998 by newt@pobox.com , you betcha.

Copyright © 1996-1998 Greg Roelofs and Mark Adler.