Greedy Algorithms Amihood Amir Bar-Ilan University.

58
Greedy Algorithms Amihood Amir Bar-Ilan University

Transcript of Greedy Algorithms Amihood Amir Bar-Ilan University.

Greedy Algorithms

Amihood Amir

Bar-Ilan University

Idea

Simplest type of strategy:

1. Take a step that makes the problem smaller. 2. iterate.

Difficulty: Prove that this leads to an optimal solution.

This is not always the case!

Example: Centerstring Problem

Input: k strings s1,…,sk of length ℓ over alphabet Σ, distance d.

Find: string s* such that max(Ham(s*,si)), i=1,…k is ≤ d.

3

Our Problem:

4

k stringslength ℓ

Maximum distance is smallest

Example:

s1: 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0

s2: 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0

s3: 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0

--------------------------------------------------

s*: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

The Hamming distance of the consensus from any string: 4

5

Suggestion: greedy strategy column majority?

0 1 1 1 0 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 1 1 0 1 0 1 00 1 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 0 0 1 0 1 0 1 0 1 00 1 1 0 0 0 1 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0 1 1 0

---------------------------------------0 1 1 1 0 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 1 0 1 0

Problem: Works if we want to minimize averageNot if we want to minimize maximum!

6

Why?

1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

-----------------------------------------

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (majority)

Hamming distance from last string: 16

7

But:

1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

-----------------------------------------

1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0

Hamming distance from any string: 8 8

Example (that works) –Huffman code

Computer Data Encoding: How do we represent data in binary?

Historical Solution:Fixed length codes.Encode every symbol by a unique

binary string of a fixed length. Examples: ASCII (7 bit code), EBCDIC (8 bit code), …

American Standard Code for Information Interchange

ASCII Example:

AABCAA

A A B C A A1000001 1000001 1000010 1000011 1000001 1000001

Total space usage in bits:

Assume an ℓ bit fixed length code.

For a file of n characters

Need nℓ bits.

Variable Length codes

Idea: In order to save space, use less bits for frequent characters and more bits for rare characters.

Example: suppose alphabet of 3 symbols: { A, B, C }. suppose in file: 1,000,000 characters. Need 2 bits for a fixed length code for a total of 2,000,000 bits.

Variable Length codes - example

ABC

999,000500500

Suppose the frequency distribution of the characters is:

ABC

01011

Note that the code of A is of length 1, and the codes for B and C are of length 2

Encode:

Fixed code: 1,000,000 x 2 = 2,000,000

Varable code: 999,000 x 1 + 500 x 2 500 x 2 1,001,000

Total space usage in bits:

A savings of almost 50%

How do we decode?

In the fixed length, we know where every character starts, since they all have the same number of bits.

Example: A = 00 B = 01 C = 10

000000010110101001100100001010

A A A B B C C C B C B A A C C

How do we decode?

In the variable length code, we use an idea called Prefix code, where no code is a prefix of another.

Example: A = 0 B = 10 C = 11

None of the above codes is a prefix of another.

How do we decode?

Example: A = 0 B = 10 C = 11

So, for the string: A A A B B C C C B C B A A C C the encoding:

0 0 01010111111101110 0 01111

Prefix Code

Example: A = 0 B = 10 C = 11

Decode the string

0 0 01010111111101110 0 01111

A A A B B C C C B C B A A C C

Desiderata:

Construct a variable length code for a given file with the following properties:

1. Prefix code.2. Using shortest possible codes.3. Efficient.4. As close to entropy as possible.

Idea

Consider a binary tree, with: 0 meaning a left turn 1 meaning a right turn.

0

0

0

1

1

1

A

B

C D

Idea

Consider the paths from the root to each of the leaves A, B, C, D:

A : 0 B : 10 C : 110 D : 111

0

0

0

1

1

1

A

B

C D

Observe:

1. This is a prefix code, since each of the leaves has a path ending in it, without continuation.

2. If the tree is full then we are not “wasting” bits.

3. If we make sure that the more frequent symbols are closer to the root then they will have a smaller code.

0

0

0

1

1

1

A

B

C D

Greedy Algorithm:

1. Consider all pairs: <frequency, symbol>.

2. Choose the two lowest frequencies, and make them brothers, with the root having the combined frequency.

3. Iterate.

Greedy Algorithm Example:

Alphabet: A, B, C, D, E, F

Frequency table:

ABCDEF

102030405060

Total File Length: 210

Algorithm Run:

A 10 B 20 C 30 D 40 E 50 F 60

Algorithm Run:

A 10 B 20

C 30 D 40 E 50 F 60X 30

Algorithm Run:

A 10 B 20

C 30

D 40 E 50 F 60

X 30

Y 60

Algorithm Run:

A 10 B 20

C 30

D 40 E 50 F 60

X 30

Y 60

Algorithm Run:

A 10 B 20

C 30

F 60

X 30

Y 60

D 40 E 50

Z 90

Algorithm Run:

A 10 B 20

C 30

F 60

X 30

Y 60

D 40 E 50

Z 90

Algorithm Run:

A 10 B 20

C 30

F 60

X 30

Y 60 D 40 E 50

Z 90W 120

Algorithm Run:

A 10 B 20

C 30

F 60

X 30

Y 60D 40 E 50

Z 90 W 120

Algorithm Run:

A 10 B 20

C 30

F 60

X 30

Y 60D 40 E 50

Z 90 W 120

V 2100

0 0

0

0

1

1

11

1

The Huffman encoding:

A 10 B 20

C 30

F 60

X 30

Y 60D 40 E 50

Z 90 W 120

V 2100

0 0

0

0

1

1

11

1

A: 1000B: 1001C: 101D: 00E: 01F: 11

File Size: 10x4 + 20x4 + 30x3 + 40x2 + 50x2 + 60x2 = 40 + 80 + 90 + 80 + 100 + 120 = 510 bits

Note the savings:

The Huffman code: Required 510 bits for the file.

Fixed length code:Need 3 bits for 6 characters.File has 210 characters.

Total: 630 bits for the file.

Note also:

For uniform character distribution: The Huffman encoding will be equal to

the fixed length encoding.

Why?

Assignment.

Formally, the algorithm:

Initialize trees of a single node each.

Keep the roots of all subtrees in a priority queue.

Iterate until only one tree left: Merge the two smallest frequency

subtrees into a single subtree with two children, and insert into priority queue.

Algorithm time:

Each priority queue operation (e.g. heap):

O(log n)

In each iteration: one less subtree.

Initially: n subtrees.

Total: O(n log n) time.

Algorithm correctness:Need to prove two things for greedy

algorithms:

Greedy Choice Property:The choice of local optimum is indeed

part of a global optimum.

Optimal Substructure Property:When we recurse on the remaining and

combine it with the local optimum of the greedy choice, we get a global optimum.

Centerstring Agorithm correctness:

Greedy Choice Property:The choice of majority at a column turns

out not be necessarily a global optimum.

Optimal Substructure Property:A global optimum means that the

overall max distance including the first greedy choice is smallest.

Example:

0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

-----------------------------------------

1

For the optimum the second index needs to be 0, but if we ignore the first index, a global optimum may be

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

42

Huffman Algorithm correctness:

Need to prove two things:

Greedy Choice Property: There exists a minimum cost prefix

tree where the two smallest frequency characters are indeed siblings with the longest path from root.

This means that the greedy choice does not hurt finding the optimum.

Algorithm correctness:

Optimal Substructure Property: An optimal solution to the problem once

we choose the two least frequent elements and combine them to produce a smaller problem, is indeed a solution to the problem when the two elements are added.

Algorithm correctness:There exists a minimum cost tree where

the minimum frequency elements are longest path siblings:

Assume that is not the situation.Then there are two elements in the

longest path.

Say a,b are the elements with smallest frequency and x,y the elements in the longest path.

Algorithm correctness:

x y

a

dy

da

We knowabout depthandfrequency: da ≤ dy

fa ≤ fy

CT

Algorithm correctness:

x y

a

dy

da

We also knowabout code tree CT: ∑fσdσ σ

is smallestpossible.

CT

Now exchange a and y.

Algorithm correctness:

x a

y

dy

da

CT’

(da ≤ dy, fa ≤ fy

Thereforefada ≥fyda andfydy ≥fady )

Cost(CT) = ∑fσdσ

= σ

∑fσdσ+fada+fydy≥σ≠a,y

∑fσdσ+fyda+fady=σ≠a,y cost(CT’)

Algorithm correctness:

x a

b

dx

db

CT

Now do the same thing for b and x

Algorithm correctness:

b a

x

dx

db

CT”

And get an optimal code tree where a and b are sibling with the longest paths

Algorithm correctness:

Optimal substructure property:Let a,b be the symbols with the smallest frequency.Let x be a new symbol whose frequency isfx =fa +fb. Delete characters a and b, and find the optimal code tree CT for the reduced alphabet.

Then CT’ = CT U {a,b} is an optimal tree for the original alphabet.

Algorithm correctness:CT

x

a b

CT’

x

fx = fa + fb

Algorithm correctness:

cost(CT’)=∑fσd’σ = ∑fσd’σ + fad’a + fbd’b= σ σ≠a,b

∑fσd’σ + fa(dx+1) + fb (dx+1) =σ≠a,b

∑fσd’σ+(fa + fb)(dx+1)=σ≠a,b

∑fσdσ+fx(dx+1)+fx = cost(CT) + fxσ≠a,b

Algorithm correctness:CT

x

a b

CT’

x

fx = fa + fb

cost(CT)+fx = cost(CT’)

Algorithm correctness:

Assume CT’ is not optimal.

By the previous lemma there is a tree CT”that is optimal, and where a and b are siblings. So

cost(CT”) < cost(CT’)

Algorithm correctness:CT’’’

x

a b

CT”

x

fx = fa + fb

By a similar argument:cost(CT’’’)+fx = cost(CT”)

Consider

Algorithm correctness:

We get:

cost(CT’’’) = cost(CT”) – fx < cost(CT’) – fx = cost(CT)

and this contradicts the minimality of cost(CT).

Algorithm correctness:

Entropy:

We leave for a compression course.