A little while ago I mentioned that one of my goals was to implement the LZ77 compression algorithm. Well my implementation in Python has arrived! You can find it here on my Github.

I will take this opportunity to discuss a little bit about how LZ77 works, my implementation, and improvements that could be made.

How LZ77 works

Before reading on, you should watch this great Computerphile video which provides a great basis for understanding LZ77. For a long time it was my primary source on the methodology.

LZ77 is a lossless compression algorithm. Lossless means that when you compress something and then decompress it, you get back exactly what you put in. Not all compression algorithms work this way; JPEG, for example, compresses and image into what looks like the same image but actually has different information in it. This type of compression, the opposite of lossless is called lossy.

LZ77 can work on any input, but for the time being let’s just consider a string - a series of characters such as “abcdefg”. Initially, each letter is represented only by itself, so as more letters are added, the representation will increase at the same rate. However, by noting that in large bodies of text, such as books or newspaper articles, words and phrases are often repeated, it is possible to reduce the amount of information which is required to represent the string.

For each letter, we calculate it’s offset and length compared to all the letters which came before it.

  • offset - the number of letters you have to count back to find the letter
  • length the largest number of letters in a row which you can find an occurence of

Encoding Example

In this way, we can encode a string of letters into this represenation. Here are a number of simple examples using the Python code listed above:

>>> from compress import compress
>>> compress("a")
# [(offest, length, letter)]
[(0, 1, 'a')]
>>> compress("aba")
[(0, 1, 'a'), (0, 1, 'b'), (2, 1, 'a')]
>>> compress("word word")
[(0, 1, 'w'), (0, 1, 'o'), (0, 1, 'r'), (0, 1, 'd'), (0, 1, ' '), (5, 4, 'w')]
>>> compress("abababab")
[(0, 1, 'a'), (0, 1, 'b'), (2, 6, 'a')]

In the first two examples, it looks like we are creating more trouble for ourselves by adding this information! The output is much longer than the input!

But the next example makes it clear that we can save a lot of space when we see combinations of letters we have seen before! The first “word” has to be spelled out, but the second occurence is stored in just one value, which says “go back five letters and copy the next four”.

The final exmaple is even more intriguing. As before, the first occurences of letters are spelled out, then it says “go back two letters and copy the next six. This process confused me when I first learned about it. It clearly is saving space, but how does it work? To do that, we need to think about how to decode the LZ77 representation.

Decoding Example

When decoding, we start with an empty string of letters, call this our output. We also have a cursor, which starts at the start of the output and moves to the end after each (offset, length, letter) set processed. We then take the representation and interpret it in this way:

  • If the offest is 0, copy the letter provided to the output, keep cursor in front of the new letter and subtract 1 from the length
  • If the offset is greater than zero, move the cursor back that many letters
  • For every number up to the length, copy the letter in front of the cursor to the end of the output, and move cursor to the next letter

Let’s work through the final example given above:

[(0, 1, 'a'), (0, 1, 'b'), (2, 6, 'a')]

When we start, there is nothing in the output:

Cursor
Output  

The first offset is 0, so we copy the letter provided - “a” - to the end of the output and keep the cursor in front of the new letter:

Cursor  
Output   a

As our length was 1 but we have copied a letter, our length is now 0 so we can move on to the next representation.

This one, (0, 1, 'b') proceeds in much the same way, so that we are left with this output:

Cursor    
Output a b  

Now we get to the interesting representation: (2, 6, 'a'). First, we move the cursor back two places:

Cursor    
Output   a b

We copy the letter in front of the cursor to the end:

Cursor      
Output   a b a

And move the cursor to the next letter:

Cursor      
Output   a b a

The length of 6 is now reduced to 5 and we repeat the process. Copy the letter in front of the cursor to the end:

Cursor        
Output   a b a b

And move the cursor to the next letter:

Cursor        
Output   a b a b

Our length is now reduced to 4. So we repeat the process 4 more times:

Cursor          
Output   a b a b a
Cursor            
Output   a b a b a b
Cursor              
Output   a b a b a b a
Cursor                
Output   a b a b a b a b

And finally we would move our cursor to the end if there were any more representations to be processed:

Cursor                
Output a b a b a b a b  

And our output is now the same as what we originally put in: “abababab”.

Moving to Bytes

Now that we have our representation, encoding and decoding sorted out, we get to the nuts and bolts of the topic which is storing this data in fewer bytes than the original string.

For this purpose, we pack the length and offset values into as few bytes as possible. This might involve splitting a byte down the middle, for example using 5 bits of the byte for the offset value and the other 3 bits for the length. When the offset is bigger than 0, we don’t actually need to store the letter as it will not be used. This helps us further reduce the amount of data we are trying to store.

Python Implementation

In my Python implementation, I have tried to separate out my concerns of encoding and decoding, and transitioning to and from bytes. I also limited myself, for now at least, to only being concerned with ASCII strings.

So I created the following functions:

  • a function to encode an input string into a list of LZ77 representations
  • a function to turn a list of LZ77 representations into a series of bytes
  • a function to turn a series of bytes into a list of valid LZ77 representations
  • a function to turn a list of LZ77 representations back into a string of letters

I also added functions to compress and decompress files which I will get to later.

Examples

Here are some examples which should be possible to work back from by hand, using 11 bits for offset and 5 bits for length:

>>> from compress import compress, to_bytes
>>> from decompress import decompress, from_bytes
>>> x1 = to_bytes(compress("aaaaaa"))
>>> x1
bytearray(b'\x00\x06a')
>>> decompress(from_bytes(x1))
'aaaaaa'
>>> x2 = to_bytes(compress("ababab"))
>>> x2
bytearray(b'\x00\x01a\x00\x01b\x00D')
>>> decompress(from_bytes(x2))
'ababab'

Testing

This is a reasonably complex thing to try and do, and when just looking at the encoded representation, or a series of bytes, it can be quite difficult to know whether the code is working properly or not.

Therefore, I added a number of tests using pytest. This was a good start and was useful for checking some of the details, but the point of the LZ77 algorithm is that it can be used for any combination of letters. It is unlikely that I could write out enough tests myself to be confident that any combination of letters. For this, I turned to the hypothesis package which allows property-based testing. I would highly recommend reading more about this topic as it seems to be really useful for helping find strange edge cases you might not have considered! Once again, here is a Computerphile video which discusses this topic further.

Results

So, does it actually work!?

I generated 500 paragraphs of nonsense from Lorem Ipsum, which amounted to 150 paragraphs, 13216 words, 89318 bytes and put it in lorem_ipsum.txt.

It was compressed into lorem_compressed.txt - you might find it hard to read this though - and then decompressed into lorem_decompressed.txt.

According to my file system, the original and decompressed both came in at 89,919 bytes.

The compressed file: 36,130 bytes!!!

So yes, it does actually work, giving in this case a compressed file which is 40% the size of the original.

For comparison, zipping the original file resulted in only 25,900 bytes so there is a bit further to go!

Improvements

The number of bits being used for length and offset are variable at the moment, but should be recorded in the byte array which is outputted. These numbers should then be used by the decompression function to interpret the rest of the data. Looking beyond what I want to do, these values could be varied during compression to provide the best possible set of values for the input data.

I would like to make the implementation more generic, so that it could compress any array of bytes. It would be interesting to see if the compressed data could then be compressed further by going through another round of compression! In principle it seems that this should be possible, until you get to some uncompressible data which is not made smaller by the compression.

Finally, I would like to implement this in some other langauges to see how the implementation differs and what is easier/harder. It would also be interesting to see the performance differences in terms of time taken to process a given input.

Summary

LZ77 is cool and underpins a number of other popular compression tools! Learning how it works and can actually be implemented was a really interesting exercise!

I hope I was able to give a reasonable explanation of how LZ77 works and that you might take some inspiration from my code to try it, or something else, yourself! Comments on the code very welcome via Github!

Thanks for reading!