[Decoded] Data Compression Part 3: Reducing Redundancy

By Kshitij Sobti | Updated on 11-Sep-2009

11-Sep-2009

Part 1: Information | Part 2: Data Explosion | Part 3: Reducing Redundancy

Reducing Redundancy

Many a new computer user might have wondered, “What if I keep compressing the compressed file?”. Zip the rar which has been cab’d and 7zed? Wont you just get a simple clean 1 byte file you could store in that Floppy (You young readers might not know, but way back, almost 10 yeas ago, many people relied on small magnetic disks which could store 1.44MB)

As explained before, this isn’t possible, since the aim of a compression algorithm is to juice out all redundancy. And the thing about redundancy is, once it’s gone, it’s gone.

Taking English text for example, the letter ‘e’ is the most common of all letters, occurring on an average, with a probability ~ 12.7% with the second ‘t’ lagging far behind at ~ 9.1%.

However the thing is, how would you store an ‘e’ in less space that it currently takes? All a single character requires for storage is 1byte! This means that to save space everywhere an ‘e’ is used, we will have to store it in less than 1 byte!

In our normal encoding schemes, we represent each character as a sequence of bits, with each character having a unique sequence. Say ‘A’ is ‘01000001’, ‘B’ is ‘01000010’ and so on, we can get a code for each character. However, if every letter takes one byte, we will never be able to compress anything, since everything, no matter what the probability, will take the same amount of space.

In order to get some compression, we need to use a way of generating a code for each character such that the more common letters have a shorter code, and longer ones have a longer code.

Our common ‘e’, for example could get a nice, single bit ‘0’, and ‘t’ could get a ‘1’.
Even with this simple change, we can analyze the impact on the compression:

Both ‘e’ and ‘t’ are now using a one bit code instead of an 8 bit code. So the compression for each presence of ‘e’ and ‘t’ is 1/8 (12.5%).
This replacement will occur with a 12.7% 9.1% frequency, i.e. 21.8% of all characters are either a ‘e’ or a ‘t’.
Therefore 21.8% of the text will be compressed to 1/8th it’s size.
The final compression will be (1/8)*21.8 1*78.2% = 81%.

The file will be compressed to 81% of its original size! Pretty decent for just two replacements.

Of course it is impossible to compress text this way. For one, how would you discern the ‘0’ and ‘1’ of ‘e’ and ‘t’ from the 0s and 1s which code the rest of the text? One way is to create a unique code which ‘pads’ each character, something like a space to differentiate between each word. This will mean there will be some restrictions in how each character is coded to.

Let us look at such an example:

In this example we use a HIGH bit (‘1’) to separate each code. This will mean that we CANNOT use a ‘1’ as part of our character codes, as it will be mistaken for a character separator.

The code for the first most frequent characters in the English language:

Letter	Compressed Code	Ext. ASCII Code
e	1	01100101
t	10	01110100
a	100	01100001
o	1000	01101111
i	10000	01101001

As you can see, we now have a code which is variable in length, unique and identifiable, with each code padded with ‘1’.

Let us take the code for ‘eat’ :

`Compressed code:`	`110010`	`->` `6 bytes`
`Original code:`	`11001010110000101110100`	`-> 24 bytes`

Using this code has essentially resulted in a compression which reduces the size to a quarter of the original size.

The decoder here, will read in ‘110010’ will decode it as follows:

Input first character: '1'
- Sequence: '1'
Input second character: '1'
- Encountered beginning of new character, decode previous sequence '1' as 'e'
- Sequence: '1'
Input third character: '0'
- Sequence: '10'
Input fourth character: '0'
- Sequence: '100'
Input fifth character: '1'
- Encountered beginning of new character, decode previous sequence '100' as 'a'
- Sequence: '1'
Input sixth character: '0'
- Sequence: '10'
End of input encountered, decode previous sequence '10' as 't'
Output: 'eat'

It is clear though that as we go further, the size of the characters code sequence will increase, till we reach ‘h’ with a priority of ~ 6.1% and a code of ‘10000000’, which is 8 bits. After this point, the code sequence will become larger than the original 8 bits, and will cause an expansion in size rather than a compression. So:

'e' with 12.702% frequency, will be compressed to 1/8 its size. 't' with 09.056% frequency, will be compressed to 1/4 its size. 'a' with 08.167% frequency, will be compressed to 3/8 its size. 'o' with 07.507% frequency, will be compressed to 1/2 its size. 'i' with 06.966% frequency, will be compressed to 5/8 its size. 'n' with 06.749% frequency, will be compressed to 3/4 its size. 's' with 06.327% frequency, will be compressed to 7/8 its size.

After that there will only be an expansion. A little calculation will show you that these characters combined will be compressed to around 44.6% of their original size. This will be offset by the expansion encountered in the other characters.

This way is quite limiting an inefficient, and will yield very little compression in most real world cases. Real compression mechanisms use more complicated methods of creating a code for each character. Such as:

Letter	Compressed Code
e	0
t	10
a	110
o	111

As you can see here, each sequence is unique, and so is the prefix of each code. The word ‘eat’ here will be encoded as ‘011010’, again a 6bit code, with a 1/4th compression.

This will be decoded in the following manner:

Input first character: '0'
- Sequence: '0'
  - Matches character 'e'
Input second character: '1'
Sequence: '1'
No match
Input third character: '1'
- Sequence: '11'
  - No match
Input fourth character: '0'
- Sequence: '110'
  - Matched character 'a'
Input fifth character: '1'
- Sequence: '1'
  - No match
Input sixth character: '0'
- Sequence: '10'
  - Matched character 't'
Output: 'eat'

If you compare the representation of ‘eat’ in the ASCII and the compressed code, you will notice that the encoded result ‘011010’ has lesser sequences of repeated digits, than ‘11001010110000101110100’. The redundancy has been lost, and it is difficult to get a further increase in space savings by running the same compression scheme again. With a longer sequence you will notice that in the compressed file, the probability of occurrence of characters will be almost evenly distributed amongst all characters.

All this is great, however, these character frequencies / probabilities only apply on an average. An average text document might see decent compression with such a code, however ‘Gadsby‘ wont get no love from such a compressor!

Compressors can overcome this by generating it’s own dictionary of frequencies, and using those to assign codes. Even multi character sequences which are common can be assigned a much shorter code if they occur frequently enough.

Each compression mechanism has its own unique way of finding patterns, and assigning codes. A compressor which ‘understands’ the data will function better than one which doesn’t and will be able to compress better, since it will know where to look for the patterns. As such you are bound to get better compression in an image file with the lossless PNG, than with a standard ZIP file. No matter how sophisticated the compressor, and what kind of data it deals with, it is most likely that it operates on these simple principles.