[Decoded] Data Compression Part 1: Information

By Kshitij Sobti | Updated on 11-Sep-2009

11-Sep-2009

Part 1: Information | Part 2: Data Explosion | Part 3: Reducing Redundancy

Information

How do we define information? How do we quantify the amount of information that any piece of data contains? In computers, we measure information in bits and bytes (or more accurately now Megabytes and gigabytes).

While this is surely a good way to ascertain the amount of space the data will take to store in a binary format, it lacks to quantify the real worth of the data. Your 100-page project document will probably take less space than a 2-minute commercial. This does nothing to validate the importance of the data.

When we deal with data outside of computers, we tend to look at it in a different way. We might consider it worthy to remember each and every detail of our project and might fail miserably, but we will be able to easily remember the commercial, no matter how little it means to you! While this might seem like an unimportant factor while considering data compression, it becomes quite important while dealing with data such as audio, video, and images.

Like all things, information can be relative, information for me, may be nothing for you. Taking a common example, news of your failing in Mathematics may be no news for you, but it might shock you that you failed in Geography. Which has more information here?

Redundancy

Consider a situation where you get a call from your employee every day that they will be late for work. The call gives you no new information; you already expect to get a call every day around the same time with the same message. However if on a particular day, you get a call saying that they will be on time, the information will suddenly have more meaning. You can see then that in case there is no NEW information, in case there is redundancy, there is a better and more efficient way exchange the same data.

Instead of calling every day with the same recycled excuses, you could construct a “dictionary” of excuses, and just cite one in a conversation. A simple call, “Excuse 42” would be enough, heck, even saying “Excuse” is redundant.

Data compression mechanisms tend to do the same thing, create a table of commonly occurring patterns, and replace the occurrences with something that takes less space. In essence, frequently occurring patterns are replaced with smaller codes while the infrequently occurring patterns might even take up more space than they initially did! However if the algorithm is properly designed, in the end, there is a net positive gain.

Packing Bytes

Try this: create a text file with a single sentence, and keep pasting it until the file is a few MB. This will take time! You will have to paste it thousands maybe hundreds of thousands of times! A better approach is to exponentially increase the amount of text you paste. Paste the sentence a couple of times, select all, and paste a couple of times again. Keep increasing the copy time and soon you will have a reasonably large file¹. If you know a programming language you could perhaps write a script. When you compress this file using any program, you will see a drastic decrease in file size, because we have a small pattern repeating continuously. In fact, no matter how large the input file, the final filesize will be in the order of a few hundred bytes!

The aim of compression algorithms is to pack in more information in the same amount of space, to reduce the redundancy of the data. Compressed files have very low redundancy, and it is because of this reason that compressing an already compressed file will not give much benefit. You will not even get any benefit from compressing a “ZIP” file as a “RAR” even though the uncompressed original file might have been better compressed with RAR. Similarly, an uncompressed BMP image or WAV audio file might get compressed quite a bit, while a JPG image or MP3 file will barely benefit from further compression.

Every compression algorithm uses some mechanism to detect patterns, and assign them codes based on how frequently they occur. It is in this selection of patterns and the choice of their codes that makes each compression algorithm unique.

Part 1: Information | Part 2: Data Explosion | Part 3: Reducing Redundancy

1. Another nice way is to run the procedure above with one character, any character, and then use the search and replace tool to replace with a sentence or paragraph of text.