CRUNCHY CRUNCH! Martin (馬田) S. Stoller 20160903 The Art of Data Compression What we will learn today! In this presentation we have a look at: A quick look at how images are stored; Run Length Encoding; Dictionary Based Encoding; Extra Bonus: SciFi Fractal Compression! Why do we bother to compress? First off, we never seem to have enough space on our disk drives / cell phones / USB Flash Drives, etc. So if we can compress data, we can fit more of it on our storage devices. Secondly, compressed data travels faster over distance - without compression, the internet would be even slower, and services such as YouTube or NetFlix would not exist. PICTURES! Where compression really shines! A quick note on how Images are stored... Most images (and videos, too) are saved as a long list of numbers that represent the colour value of each dot, also called a “pixel”. A B&W photo is easy to encode. For each pixel we just need to save if it is BLACK or WHITE. Next slide shows an example using a kitten. If we blow up a small section of this image, we can see the dots that make it up a bit clearer. For ease of reading, we’ll use a B and W to represent Black and White pixels, but more traditionally we’d use zeros and ones. The text to the right is a simplified version of the first part of the image below in B and W code. Each B represents a BLACK dot, each W a WHITE dot, in the image above. For colour images, we often use numbers to represent how much Red, Green and Blue each dot has -> called the RGB format. For instance a Red only dot would be Red 255, Green 0, Blue 0. A Yellow dot would be Red 255, Green 255 and Blue 0. And Pink = R255, G192, B203. Note: As you might suspect, colour image compression is a bit complex, so we’ll skip that in this presentation. R104G50B66,R30G102B0,R0G0B255,R0G255B0,R63G12B42,R0G0B0, R10G15B0, R0G210B5, R12G0B123, etc etc etc Data Compression The underlying principle... Basic Compression using Pattern Matching Basic data compression processes work by identifying patterns within a set of data. The compression engine remembersWhere and how often each pattern is found in the data. Each pattern is saved into a new “compressed” data set only once, thus saving space, with the added metadata on all the locations it originally was found in the source data. This process is called “encoding”. Encoding Run Length Compression A very basic form of pattern matching based compression is the legendary “Run Length” encoding. This is a very early compression method, that was used as early as 1967 with success to reduce the size of television signals! And in the 70’s it was used to compress computer images into manageable sizes. The idea is to find a “run” of the same values, and save only one copy of that value, recording how many times it repeated - we’ll look now at how that works in practice! Demonstration Remember our kitten from before? Below is the first line of data representing the very first line of pixels (dots) that make up the image. (We’ll be showing the compressed run length encoded data here in a bit…) As you can count, the image starts with 9 black pixels, represented by nine letter “B” in a row, also known as a “run” of B’s. (We’ll be showing the compressed run length encoded data here in a bit… Patience… almost there!) In our compression file we now can write that we have counted nine (9) letter B. Next, we count that we have thirty (30) times a white pixel, so in our compressed file we record 30W. Finally, the rest of the row of pixels is made of 53 black pixels, so we note that down in our compressed file, too. Somehow we need to know that we reached the end of the first line. For that we’ll use the special code double-zero (00), which is also a value carefully chosen because that will never appear in our image (trust us, we checked). Now we do the same encoding for each row of pixels - here we have the runs for the second row. And for the third row, and we’re sure you got it now. Decoding a Run Length encoded file is now very simple - and amazingly fast, too! We simply have to read the instructions listed in the compressed file, and draw as many dots of the colour as each instruction block tells us to. So if the instruction is “3B”, we draw three black dots. Simple! DE - coding Run Length Compression Each compressed line gives the instruction on how to put the image back together again... For the first line of the image, we need nine (9) pixels worth of B, or black. Then 30 pixels of W, or white... Finally, to finish up the first line of the image, we need 53 pixels of B (yep, that’s the code we use for black!) And we know we are done with the very first line of the image because we have the special code we use, double-zero (00). All the following lines are de-coded in exactly the same manner, to make up the completed kitten image. And that, folks, is RUN LENGTH compression in a nutshell! Encoding Dictionary Based Compression As the name implies, in this method we create a “dictionary” of smaller pieces of data within the whole data set. For example: We take the sentence “The brown fox jumps over fences. How many fences you ask? The fox did not count the fences. But the fences were brown.” and pull out each word once, creating a dictionary of words. Then, we start the actual encoding... Demonstration The brown fox jumps over fences. How many fences you ask? The fox did not count the fences. But the fences were brown. Decoding Dictionary Based Compression To decompress we just need to create a new data set with all the patterns added back in as many times - and in the right locations - as needed, to end up with the original data. Again, this is almost as simple as the Run Length method! As we probably already know by now, this process is called “decoding”. Some Notes... Pattern based compression works best on larger data sets. The more patterns that can be identified, the more that the data can be compressed. Careful though, already compressed data may not be able to be further compressed - in fact, some compressed files GET LARGER if you compress them twice! Loss-Less VS Loss-Y compression Compressing data with a “lossless” method will not destroy any of it. The first two compression methods we looked at are lossless. GIF and BMP are also lossless image compression methods. Some compression methods on the other hand are considered “lossy” and do destroy some of the data. An example is the JPEG image format. It can give incredibly small file sizes, but does that at the cost of degrading the image quality. (This is also the cause of JPEG “artifacts”!) SCIENCE FICTION?!?! The mystery of FRACTAL COMPRESSION! Fractal Compression of the FUTURE! The theory is that Fractal Compression could one day give amazingly small files sizes - think a ratio of 1% - or less! Fractal Compression already exists since the 80’s, and is a lossy method - data is destroyed to reach such small data sizes. (And yes, fractal compression is also a pattern matching method…) So why aren’t we using Fractal Compression today??? Short answer: our computers are still too slow to make this practical, as it involves millions of complex calculations ∓ comparisons. Fractal compression is based on the mathematical fractal chaos theory - that any smaller part of a whole is just a smaller or rotated or otherwise transformed version of the whole. So how does it work? There are a few methods being researched. In general, we again try and find patterns, and turn them into mathematical equations. An equation uses up hardly any space compared to an image. So how does it work? Another method is to take a small piece of an image, and try and find where that piece is duplicated more or less else where in the image, either larger, smaller, rotated, inverted or reversed. I can tell you that takes some serious computing power! Conclusion: There are a lot of different compression methods out there - we only touched on two very basic versions. There are special compression methods for sound, video, spreadsheets, vector graphics, statistical data, and so on. Some are relative simple, many are very complex. All are amazing feats of ingenuity! Hopefully this simple introduction has started you on a journey of discovery, to see what else is out there! Stay curious! THANKS! You all are just out of this world! Take care, stay safe! Credits Special thanks to all the people who made and released this awesome slide template resource for free! Check them out if you need your own slide templates! Presentation template by SlidesCarnival Photographs by Unsplash Cheers and thanks again for watching! - Martin S. Stoller.