Huffman Encoding Pr

19
1 Huffman Encoding Βαγγέλης Δούρος EY0619

description

 

Transcript of Huffman Encoding Pr

Page 1: Huffman Encoding Pr

1

Huffman Encoding

Βαγγέλης ΔούροςEY0619

Page 2: Huffman Encoding Pr

2

Text Compression

On a computer: changing the representation of a file so that it takes less space to store or/and less time to transmit.– original file can be reconstructed exactly from the

compressed representation.different than data compression in general– text compression has to be lossless.– compare with sound and images: small changes

and noise is tolerated.

Page 3: Huffman Encoding Pr

3

First Approach

Let the word ABRACADABRAWhat is the most economical way to write this string in a binary representation?Generally speaking, if a text consists of N different characters, we need bits to represent each one using a fixed-length encoding. Thus, it would require 3 bits for each of 5 different letters, or 33 bits for 11 letters.Can we do it better?

log N⎡ ⎤⎢ ⎥

Page 4: Huffman Encoding Pr

4

Yes!!!!

We can do better, provided:– Some characters are more frequent than others.– Characters may be different bit lengths, so that for

example, in the English alphabet letter a may use only one or two bits, while letter y may use several.

– We have a unique way of decoding the bit stream.

Page 5: Huffman Encoding Pr

5

Using Variable-Length Encoding (1)

Magic word: ABRACADABRALET A = 0

B = 100C = 1010D = 1011R = 11

Thus, ABRACADABRA = 01001101010010110100110

So 11 letters demand 23 bits < 33 bits, an improvement of about 30%.

Page 6: Huffman Encoding Pr

6

Using Variable-Length Encoding (2)

However, there is a serious danger: How to ensure unique reconstruction?Let A 01 and B 0101How to decode 010101?AB? BA?AAA?No problem……if we use prefix codes: no codeword is a prefix of another codeword.

Page 7: Huffman Encoding Pr

7

Prefix Codes (1)

Any prefix code can be represented by a fullbinary tree.Each leaf stores a symbol.Each node has two children – left branch means 0, right means 1.codeword = path from the root to the leaf interpreting suitably the left and right branches.

Page 8: Huffman Encoding Pr

8

Prefix Codes (2)

ABRACADABRA

A = 0B = 100C = 1010D = 1011R = 11 Decoding is unique and simple!Read the bit stream from left to right and starting from the root, whenever a leaf is reached, write down its symbol and return to the root.

Page 9: Huffman Encoding Pr

9

Prefix Codes (3)

Let the frequency of the i-th symbol , the number of bits required for the i-th

symbol(=the depth of this symbol in tree), How do we find the optimal coding tree, which minimizes the cost of tree ?– Frequent characters should have short

codewords– Rare characters should have long codewords

1

n

i ii

C f d=

= ∑

if

id

1 i n≤ ≤

Page 10: Huffman Encoding Pr

10

Huffman’s Idea

From the previous definition of the cost of tree, it is clear that the two symbols with the smallest frequencies must be at the bottom of the optimal tree, as children of the lowest internal node, isn’t it?This is a good sign that we have to use a bottom-up manner to build the optimal code!Huffman’s idea is based on a greedy approach, using the previous notices.Repeat until all nodes merged into one tree:

– Remove two nodes with the lowest frequencies.– Create a new internal node, with the two just-removed nodes as

children (either node can be either child) and the sum of their frequencies as the new frequency.

Page 11: Huffman Encoding Pr

11

Constructing a Huffman Code (1)

Assume that frequencies of symbols are:– A: 40 B: 20 C: 10 D: 10 R: 20

Smallest numbers are 10 and 10 (C and D), so connect them

Page 12: Huffman Encoding Pr

12

Constructing a Huffman Code (2)

C and D have already been used, and the new node above them (call it C+D) has value 20The smallest values are B, C+D, and R, all of which have value 20

– Connect any two of theseIt is clear that the algorithm does not construct a unique tree, but even if we have chosen the other possible connection, the code would be optimal too!

Page 13: Huffman Encoding Pr

13

Constructing a Huffman Code (3)

The smallest value is R, while A and B+C+D have value 40.Connect R to either of the others.

Page 14: Huffman Encoding Pr

14

Constructing a Huffman Code(4)

Connect the final two nodes, adding 0 and 1 to each left and right branch respectively.

Page 15: Huffman Encoding Pr

15

Algorithm

X is the set of symbols, whose frequencies are known in advance

Q is a min-priority queue,implemented as binary-heap

-1

Page 16: Huffman Encoding Pr

16

What about Complexity?

Thus, the algorithm needs Ο(nlogn)

needs O(logn)needs O(logn)

needs O(logn)

needs O(nlogn)

Thus, the loop needs O(nlogn)-1

Page 17: Huffman Encoding Pr

17

Algorithm’s Correctness

It is proven that the greedy algorithm HUFFMAN is correct, as the problem of determining an optimal prefix code exhibits the greedy-choice and optimal-substructure properties. Greedy Choice :Let C an alphabet in which each character c Є C has frequency f[c]. Let x and y two characters in C having the lowest frequencies. Then there exists an optimal prefix code for C in which the codewords for x and y have the same length and differ only in the last bit.Optimal Substructure :Let C a given alphabet with frequency f[c] defined for each character c Є C . Let x and y, two characters in C with minimum frequency. Let C’ ,the alphabet C with characters x,yremoved and (new) character z added, so that C’ = C – {x,y} U {z}; define f for C’ as for C, except that f[z] = f[x] + f[y]. Let T’ ,any tree representing an optimal prefix code for the alphabet C’. Then the tree T, obtained from T’ by replacing the leaf node for z with an internal node having x and y as children, represents an optimal prefix code for the alphabet C.

Page 18: Huffman Encoding Pr

18

Last Remarks

• "Huffman Codes" are widely used applications that involve the compression and transmission of digital data, such as: fax machines, modems, computer networks.

• Huffman encoding is practical if:– The encoded string is large relative to the code table

(because you have to include the code table in the entire message, if it is not widely spread).

– We agree on the code table in advance• For example, it’s easy to find a table of letter frequencies for

English (or any other alphabet-based language)

Page 19: Huffman Encoding Pr

19

Ευχαριστώ!