Lecture 11

Text compression

What is compression?

When is compression used?

Basic idea

Lossy vs Lossless

Lossy compression

Lossless compression

Simple Ideas and Intuition

GTCACCCCCCCCCGTCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCTGG
= GTC1AC9GTC41TGG

Ratio: (118+332)/(588)(11*8 + 3*32) / (58*8) Around 40% of the original!

Run-Length-Encoding (RLE)

Smarter DNA coding

Variable length codes

Prefix Codes

Codex Graph
Alt text Alt text

Encoding Tree Optimisation

Alt text

Huffman Coding

i=1n2li1    and    i=1nwili    is minimal\sum _{i=1} ^{n} 2^{-li} \le 1 \; \; \text{and} \; \; \sum _{i=1} ^{n} w_{i} \cdot l_{i} \; \; \text{is minimal}

Algorithm Huffman(X): Input: string X of length n Output:optimal encoding tree for X Compute frequency f(c) of each character c of X PQ <- new empty Priority Queue for each character c in alphabet of X do T <- single node binary tree storing c PQ.insert(f(c), T) while PQ.size() > 1 do (f1, T1) = PQ.removeMin() (f2, T2) = PQ.removeMin() T <- a new binary tree T with left subtree T1 and right subtree T2 PQ.insert(f1+ f2, T) (f, T) = PQ.removeMin() return T

Alt text

Analysis of Huffman’s Algorithm

Assuming that

Real Huffman coding

In practice, we don’t actually care about the codewords in the first instance – we actually want to compute the set of codeword lengths (how many bits to assign to each codeword)

Huffman's example

Alt text

Assign codewords by the branch direction of the tree: “If we go left, assign 0, else assign 1” But we can do better (where better does not mean a better code, but more beautiful, elegant, and less overhead when transmitting the codebook)

Alt text

Canonical codes

Canonical codes - tree shape

Because it means we can minimize the amount of information we provide to the decoder. Suppose we pass the symbols to the decoder in lexicographical order. If we sort the codewords first by their length and then lexicographically, all we need to provide the decoder is the list of codeword lengths!

Sort them within their bit length buckets

Alt text

How to transmit the codebook?

Decoder

Compression and coding

  1. Building a probability model over the input
  2. Applying that probability model to the data

Lempel-Ziv Schemes

Statistical Methods

Static Models

Dynamic Models

Adaptive Models

Lempel-Ziv Compression

Basic idea: Use a dictionary to determine if you have seen something before. If so, don’t output the “thing”, just output its index in the dictionary!

LZ77 example

LZ77 decoding example

Alt text

What about Huffman?

LZ* challenges

Variable Byte

130 would usually be: 00000000000000000000000010000010 But now we can represent it as: 10000001 00000010 The leftmost 1 says “you need to read another byte” (continuation bit)