Martin S. Stoller

Transcript

CRUNCHY CRUNCH!
Martin (馬田) S. Stoller
20160903
The Art of Data Compression

What we will learn today!
In this presentation we have a look at:
A quick look at how images are stored;
Run Length Encoding;
Dictionary Based Encoding;
Extra Bonus: SciFi Fractal Compression!

Why do we bother to compress?
First off, we never seem to have enough space on our disk drives / cell phones / USB
Flash Drives, etc. So if we can compress data, we can fit more of it on our storage devices.
Secondly, compressed data travels faster over distance - without compression, the
internet would be even slower, and services such as YouTube or NetFlix would not exist.

PICTURES!
Where compression really shines!

A quick note on how Images are stored...

Most images (and videos, too) are saved as a long list of numbers that represent the
colour value of each dot, also called a “pixel”.

A B&W photo is easy to encode. For each pixel we just need to save if it is BLACK or WHITE.

Next slide shows an example using a kitten.

If we blow up a small section of this image, we can see the dots that make it up a bit
clearer.
For ease of reading, we’ll use a B and W to represent Black and White pixels, but more
traditionally we’d use zeros and ones.

The text to the right is a simplified version of the first part of the image below
in B and W code.
Each B represents a BLACK dot, each W a WHITE dot, in the image above.

For colour images, we often use numbers to represent how much Red, Green and Blue each
dot has -> called the RGB format.

For instance a Red only dot would be Red 255, Green 0, Blue 0. A Yellow dot would
be Red 255, Green 255 and Blue 0. And Pink = R255, G192, B203.
Note: As you might suspect, colour image compression is a bit complex, so we’ll skip
that in this presentation.
R104G50B66,R30G102B0,R0G0B255,R0G255B0,R63G12B42,R0G0B0, R10G15B0, R0G210B5,
R12G0B123, etc etc etc

Data Compression
The underlying principle...

Basic Compression using Pattern Matching
Basic data compression processes work by identifying patterns within a set of data.
The compression engine remembersWhere and how often each pattern is found in the data.
Each pattern is saved into a new “compressed” data set only once, thus saving space,
with the added metadata on all the locations it originally was found in the source data.
This process is called “encoding”.

Encoding Run Length Compression
A very basic form of pattern matching based compression is the legendary “Run Length”
encoding.
This is a very early compression method, that was used as early as 1967 with success
to reduce the size of television signals! And in the 70’s it was used to compress
computer images into manageable sizes.
The idea is to find a “run” of the same values, and save only one copy of that value,
recording how many times it repeated - we’ll look now at how that works in practice!

Demonstration

Remember our kitten from before? Below is the first line of data representing the very
first line of pixels (dots) that make up the image.
(We’ll be showing the compressed run length encoded data here in a bit…)

As you can count, the image starts with 9 black pixels, represented by nine letter “B”
in a row, also known as a “run” of B’s.
(We’ll be showing the compressed run length encoded data here in a bit… Patience… almost
there!)

In our compression file we now can write that we have counted nine (9) letter B.

Next, we count that we have thirty (30) times a white pixel, so in our compressed file
we record 30W.

Finally, the rest of the row of pixels is made of 53 black pixels, so we note that down
in our compressed file, too.

Somehow we need to know that we reached the end of the first line. For that we’ll use
the special code double-zero (00), which is also a value carefully chosen because that
will never appear in our image (trust us, we checked).

Now we do the same encoding for each row of pixels - here we have the runs for the
second row.

And for the third row, and we’re sure you got it now.

Decoding a Run Length encoded file is now very simple - and amazingly fast, too!
We simply have to read the instructions listed in the compressed file, and draw as many
dots of the colour as each instruction block tells us to.
So if the instruction is “3B”, we draw three black dots. Simple!

DE - coding Run Length Compression

Each compressed line gives the instruction on how to put the image back together again...

For the first line of the image, we need nine (9) pixels worth of B, or black.

Then 30 pixels of W, or white...

Finally, to finish up the first line of the image, we need 53 pixels of B (yep, that’s
the code we use for black!)

And we know we are done with the very first line of the image because we have the special
code we use, double-zero (00).

All the following lines are de-coded in exactly the same manner, to make up the completed
kitten image.

And that, folks, is RUN LENGTH compression in a nutshell!

Encoding Dictionary Based Compression
As the name implies, in this method we create a “dictionary” of smaller pieces of data
within the whole data set.
For example:
We take the sentence “The brown fox jumps over fences. How many fences you ask? The fox
did not count the fences. But the fences were brown.” and pull out each word once,
creating a dictionary of words. Then, we start the actual encoding...

Demonstration
The brown fox jumps over fences. How many fences you ask? The fox did not count the
fences. But the fences were brown.

Decoding Dictionary Based Compression
To decompress we just need to create a new data set with all the patterns added back
in as many times - and in the right locations - as needed, to end up with the
original data.
Again, this is almost as simple as the Run Length method!
As we probably already know by now, this process is called “decoding”.

Some Notes...
Pattern based compression works best on larger data sets.
The more patterns that can be identified, the more that the data can be compressed.
Careful though, already compressed data may not be able to be further compressed - in
fact, some compressed files GET LARGER if you compress them twice!

Loss-Less VS Loss-Y compression
Compressing data with a “lossless” method will not destroy any of it. The first two
compression methods we looked at are lossless. GIF and BMP are also lossless image
compression methods.
Some compression methods on the other hand are considered “lossy” and do destroy some
of the data. An example is the JPEG image format. It can give incredibly small file sizes,
but does that at the cost of degrading the image quality. (This is also the cause of
JPEG “artifacts”!)

SCIENCE FICTION?!?!
The mystery of
FRACTAL COMPRESSION!

Fractal Compression of the FUTURE!
The theory is that Fractal Compression could one day give amazingly small files sizes -
think a ratio of 1% - or less!
Fractal Compression already exists since the 80’s, and is a lossy method - data is
destroyed to reach such small data sizes.
(And yes, fractal compression is also a pattern matching method…)

So why aren’t we using Fractal Compression today???
Short answer: our computers are still too slow to make this practical, as it involves
millions of complex calculations ∓ comparisons. Fractal compression is based on the
mathematical fractal chaos theory - that any smaller part of a whole is just a smaller
or rotated or otherwise transformed version of the whole.

So how does it work?
There are a few methods being researched.
In general, we again try and find patterns, and turn them into mathematical equations.
An equation uses up hardly any space compared to an image.

So how does it work?
Another method is to take a small piece of an image, and try and find where that piece
is duplicated more or less else where in the image, either larger, smaller, rotated,
inverted or reversed. I can tell you that takes some serious computing power!

Conclusion:
There are a lot of different compression methods out there - we only touched on two
very basic versions.

There are special compression methods for sound, video, spreadsheets, vector graphics,
statistical data, and so on. Some are relative simple, many are very complex. All are
amazing feats of ingenuity!

Hopefully this simple introduction has started you on a journey of discovery, to see
what else is out there!

Stay curious!

THANKS!
You all are just out of this world!
Take care, stay safe!

Credits
Special thanks to all the people who made and released this awesome slide template
resource for free!
Check them out if you need your own slide templates!

Presentation template by SlidesCarnival
Photographs by Unsplash

Cheers and thanks again for watching! - Martin S. Stoller.