Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING...

102
LOSSLESS SOURCE CODING ALGORITHMS Frans M.J. Willems INTRODUCTION HUFFMAN-TUNSTALL Binary IID Sources Huffman Code Tunstall Code ENUMERATIVE CODING Lexicographical Ordering FV: Pascal-Δ Method VF: Petry Code ARITHMETIC CODING Intervals Universal Coding, Individual Redundancy CONTEXT-TREE WEIGHTING IID, unknown θ Tree Sources Context Trees Coding Prbs., Redundancy REPETITION TIMES LZ77 Repetition Times, Kac Repetition-Time Algorithm Achieving Entropy CONCLUSION Lossless Source Coding Algorithms Frans M.J. Willems Department of Electrical Engineering Eindhoven University of Technology ISIT 2013, Istanbul, Turkey

Transcript of Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING...

Page 1: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Lossless Source Coding Algorithms

Frans M.J. Willems

Department of Electrical EngineeringEindhoven University of Technology

ISIT 2013, Istanbul, Turkey

Page 2: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Outline

1 INTRODUCTION

2 HUFFMAN and TUNSTALLBinary IID SourcesHuffman CodeTunstall Code

3 ENUMERATIVE CODINGLexicographical OrderingFV: Pascal-∆ MethodVF: Petry Code

4 ARITHMETIC CODINGIntervalsUniversal Coding, Individual Redundancy

5 CONTEXT-TREE WEIGHTINGIID, unknown θBinary Tree-SourcesContext TreesCoding Probabilities

6 REPETITION TIMESLZ77Repetition Times, KacRepetition-Time AlgorithmAchieving Entropy

7 CONCLUSION

Page 3: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Choosing a Topic

POSSIBLE TOPICS:

Multi-user Information Theory (with Edward van der Meulen (KUL),Andries Hekstra)

Lossless Source Coding (with Tjalling Tjalkens, Yuri Shtarkov (IPPI),Paul Volf)

Watermarking, Embedding, and Semantic Coding (with Martin vanDijk, Ton Kalker (Philips Research))

Biometrics (with Tanya Ignatenko)

LOSSLESS SOURCE CODING ALGORITHMS

WHY?

Not many sessions at ISIT 2012! Is lossless source coding DEAD?

Lossless Source Coding is about UNDERSTANDING data. UniversalLossless Source Coding is focussing on FINDING STRUCTURE indata. MDL principle [Rissanen].

ALGORITHMS are fun (Piet Schalkwijk).

Page 4: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Choosing a Topic

POSSIBLE TOPICS:

Multi-user Information Theory (with Edward van der Meulen (KUL),Andries Hekstra)

Lossless Source Coding (with Tjalling Tjalkens, Yuri Shtarkov (IPPI),Paul Volf)

Watermarking, Embedding, and Semantic Coding (with Martin vanDijk, Ton Kalker (Philips Research))

Biometrics (with Tanya Ignatenko)

LOSSLESS SOURCE CODING ALGORITHMS

WHY?

Not many sessions at ISIT 2012! Is lossless source coding DEAD?

Lossless Source Coding is about UNDERSTANDING data. UniversalLossless Source Coding is focussing on FINDING STRUCTURE indata. MDL principle [Rissanen].

ALGORITHMS are fun (Piet Schalkwijk).

Page 5: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Choosing a Topic

POSSIBLE TOPICS:

Multi-user Information Theory (with Edward van der Meulen (KUL),Andries Hekstra)

Lossless Source Coding (with Tjalling Tjalkens, Yuri Shtarkov (IPPI),Paul Volf)

Watermarking, Embedding, and Semantic Coding (with Martin vanDijk, Ton Kalker (Philips Research))

Biometrics (with Tanya Ignatenko)

LOSSLESS SOURCE CODING ALGORITHMS

WHY?

Not many sessions at ISIT 2012! Is lossless source coding DEAD?

Lossless Source Coding is about UNDERSTANDING data. UniversalLossless Source Coding is focussing on FINDING STRUCTURE indata. MDL principle [Rissanen].

ALGORITHMS are fun (Piet Schalkwijk).

Page 6: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Lecture Structure

TUTORIAL, binary case, my favorite algorithms, ...

REMARKS, open problems, ...

Page 7: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Outline

1 INTRODUCTION

2 HUFFMAN and TUNSTALLBinary IID SourcesHuffman CodeTunstall Code

3 ENUMERATIVE CODINGLexicographical OrderingFV: Pascal-∆ MethodVF: Petry Code

4 ARITHMETIC CODINGIntervalsUniversal Coding, Individual Redundancy

5 CONTEXT-TREE WEIGHTINGIID, unknown θBinary Tree-SourcesContext TreesCoding Probabilities

6 REPETITION TIMESLZ77Repetition Times, KacRepetition-Time AlgorithmAchieving Entropy

7 CONCLUSION

Page 8: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Binary Sources, Sequences, IID

Binary Sourcex1x2 · · · xN

The binary source produces a sequence xN1 = x1x2 · · · xN with components

∈ {0, 1} with probability P(xN1 ).

Definition (Binary IID Source)

For an independent identically distributed (i.i.d.) source with parameterθ, for 0 ≤ θ ≤ 1,

P(xN1 ) =N∏

n=1

P(xn),

whereP(1) = θ, and P(0) = 1− θ.

A sequence xN1 containing N − w zeros and w ones has probability

P(xN1 ) = (1− θ)N−wθw .

Entropy IID Source

The ENTROPY of this source is h(θ)∆= (1− θ) log2

11−θ + θ log2

(bits).

Page 9: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Binary Sources, Sequences, IID

Binary Sourcex1x2 · · · xN

The binary source produces a sequence xN1 = x1x2 · · · xN with components

∈ {0, 1} with probability P(xN1 ).

Definition (Binary IID Source)

For an independent identically distributed (i.i.d.) source with parameterθ, for 0 ≤ θ ≤ 1,

P(xN1 ) =N∏

n=1

P(xn),

whereP(1) = θ, and P(0) = 1− θ.

A sequence xN1 containing N − w zeros and w ones has probability

P(xN1 ) = (1− θ)N−wθw .

Entropy IID Source

The ENTROPY of this source is h(θ)∆= (1− θ) log2

11−θ + θ log2

(bits).

Page 10: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Fixed-to-Variable (FV) Length Codes

IDEA:

Give more probable sequences shorter codewords than less probablesequences.

Definition (FV-Length Code)

A FV-length code assigns to source sequence xN1 a binary codeword c(xN1 )

of length L(xN1 ). The rate of a FV code is

R∆=

E [L(XN1 )]

N(code-symbols/source-symbol).

GOAL:

We would like to find decodable FV-length codes that MINIMIZE this rate.

Page 11: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Prefix Codes

Definition (Prefix code)

In a prefix code no codeword is the prefix of any other codeword.

We focus on prefix codes. Codewords in a prefix code can be regarded asleaves in a rooted tree. Prefix codes lead to instantaneous decodability.

Example

xN1 c(xN1 ) L(xN1 )00 0 101 10 210 110 311 111 3

∅1

0

11

10

111

110

Page 12: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Prefix-Codes (cont.)

Theorem (Kraft, 1949)

(a) The lengths of the codewords in a prefix code satisfy Kraft’s inequality∑xN1 ∈X

N

2−L(xN1 ) ≤ 1.

(b) For codeword lengths satisfying Kraft’s inequality there exists a prefixcode with these lengths.

This leads to:

Theorem (Fano, 1961)

(a) Any prefix code satisfies

E [L(XN1 )] ≥ H(XN

1 ) = Nh(θ),

or equivalently R ≥ h(θ). The minimum is achieved if and only ifL(xN1 ) = − log2(P(xN1 )) (ideal codeword length) for all xN1 ∈ XN with

nonzero P(xN1 ).(b) There exist prefix codes with

E [L(XN1 )] < H(XN

1 ) + 1 = Nh(θ) + 1,

or equivalently R < h(θ) + 1/N.

Page 13: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Huffman’s Code

Definition (Optimal FV-length Code)

A code that minimizes the expected codeword-length E [L(XN1 )] (and the

rate R) is called optimal.

Theorem (Huffman, 1952)

The Huffman construction leads to an optimal FV-length code.CONSTRUCTION:

Consider the set of probabilities {P(xN1 ), xN1 ∈ XN}.Replace two smallest probabilities by a probability which is their sum.Label the branches from these two smallest probabilities to their sumwith code-symbols “0” and “1”.

Continue like this until only one probability (equal to 1) is left.

Obviously Huffman’s code results in E [L(XN1 )] < H(XN

1 ) + 1 = Nh(θ) + 1and therefore R < h(θ) + 1/N.

Page 14: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Huffman’s Code

Definition (Optimal FV-length Code)

A code that minimizes the expected codeword-length E [L(XN1 )] (and the

rate R) is called optimal.

Theorem (Huffman, 1952)

The Huffman construction leads to an optimal FV-length code.CONSTRUCTION:

Consider the set of probabilities {P(xN1 ), xN1 ∈ XN}.Replace two smallest probabilities by a probability which is their sum.Label the branches from these two smallest probabilities to their sumwith code-symbols “0” and “1”.

Continue like this until only one probability (equal to 1) is left.

Obviously Huffman’s code results in E [L(XN1 )] < H(XN

1 ) + 1 = Nh(θ) + 1and therefore R < h(θ) + 1/N.

Page 15: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Huffman’s Construction

Example

Let N = 3 and θ = 0.3, then h(0.3) = 0.881.

.343

.147

.147

.063

.147

.063

.063

.027

.0901

0

.126

1

0

.2161

0

.2941

0 .3630

1

.6371

0

1.001

0

Now E [L(XN1 )] = 4(.027 + .063 + .063 + .063) + 3(.147 + .147)+

2(.147 + .343) = 2.726. Therefore R = 2.726/3 = 0.909.

Page 16: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Huffman’s Construction

Example

Let N = 3 and θ = 0.3, then h(0.3) = 0.881.

.343

.147

.147

.063

.147

.063

.063

.027

.0901

0

.126

1

0

.2161

0

.2941

0 .3630

1

.6371

0

1.001

0

Now E [L(XN1 )] = 4(.027 + .063 + .063 + .063) + 3(.147 + .147)+

2(.147 + .343) = 2.726. Therefore R = 2.726/3 = 0.909.

Page 17: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Huffman’s Construction

Example

Let N = 3 and θ = 0.3, then h(0.3) = 0.881.

.343

.147

.147

.063

.147

.063

.063

.027

.0901

0

.126

1

0

.2161

0

.2941

0 .3630

1

.6371

0

1.001

0

Now E [L(XN1 )] = 4(.027 + .063 + .063 + .063) + 3(.147 + .147)+

2(.147 + .343) = 2.726. Therefore R = 2.726/3 = 0.909.

Page 18: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Huffman’s Construction

Example

Let N = 3 and θ = 0.3, then h(0.3) = 0.881.

.343

.147

.147

.063

.147

.063

.063

.027

.0901

0

.126

1

0

.2161

0

.2941

0 .3630

1

.6371

0

1.001

0

Now E [L(XN1 )] = 4(.027 + .063 + .063 + .063) + 3(.147 + .147)+

2(.147 + .343) = 2.726. Therefore R = 2.726/3 = 0.909.

Page 19: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Huffman’s Construction

Example

Let N = 3 and θ = 0.3, then h(0.3) = 0.881.

.343

.147

.147

.063

.147

.063

.063

.027

.0901

0

.126

1

0

.2161

0

.2941

0 .3630

1

.6371

0

1.001

0

Now E [L(XN1 )] = 4(.027 + .063 + .063 + .063) + 3(.147 + .147)+

2(.147 + .343) = 2.726. Therefore R = 2.726/3 = 0.909.

Page 20: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Huffman’s Construction

Example

Let N = 3 and θ = 0.3, then h(0.3) = 0.881.

.343

.147

.147

.063

.147

.063

.063

.027

.0901

0

.126

1

0

.2161

0

.2941

0 .3630

1

.6371

0

1.001

0

Now E [L(XN1 )] = 4(.027 + .063 + .063 + .063) + 3(.147 + .147)+

2(.147 + .343) = 2.726. Therefore R = 2.726/3 = 0.909.

Page 21: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Huffman’s Construction

Example

Let N = 3 and θ = 0.3, then h(0.3) = 0.881.

.343

.147

.147

.063

.147

.063

.063

.027

.0901

0

.126

1

0

.2161

0

.2941

0 .3630

1

.6371

0

1.001

0

Now E [L(XN1 )] = 4(.027 + .063 + .063 + .063) + 3(.147 + .147)+

2(.147 + .343) = 2.726. Therefore R = 2.726/3 = 0.909.

Page 22: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Huffman’s Construction

Example

Let N = 3 and θ = 0.3, then h(0.3) = 0.881.

.343

.147

.147

.063

.147

.063

.063

.027

.0901

0

.126

1

0

.2161

0

.2941

0 .3630

1

.6371

0

1.001

0

Now E [L(XN1 )] = 4(.027 + .063 + .063 + .063) + 3(.147 + .147)+

2(.147 + .343) = 2.726. Therefore R = 2.726/3 = 0.909.

Page 23: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Huffman’s Construction

Example

Let N = 3 and θ = 0.3, then h(0.3) = 0.881.

.343

.147

.147

.063

.147

.063

.063

.027

.0901

0

.126

1

0

.2161

0

.2941

0 .3630

1

.6371

0

1.001

0

Now E [L(XN1 )] = 4(.027 + .063 + .063 + .063) + 3(.147 + .147)+

2(.147 + .343) = 2.726. Therefore R = 2.726/3 = 0.909.

Page 24: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Remarks: Huffman Code

Note that R ↓ h(θ) when N →∞.

Always E [L(XN1 )] ≥ 1. For θ ≈ 0 a Huffman code has expected

codeword length E [L(XN1 )] ≈ 1 and rate R ≈ 1/N.

Better bounds exist for Huffman codes than E [L(XN1 )] < H(XN

1 ) + 1.E.g. Gallager [1978] showed that

E [L(XN1 )]− H(XN

1 ) ≤ maxxN1

P(xN1 ) + 0.086.

Adaptive Huffman Codes (Gallager [1978]).

Page 25: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Variable-to-Fixed (VF) Length Codes

IDEA:

Parse the source output into variable-length segments of roughly the sameprobability. Code all these segments with codewords of fixed length.

Definition (VF-Length Code)

A VF-length code is defined by a set of variable-length source segments.Each segment x∗ in the set gets a unique binary codeword c(x∗) of lengthL. The length of a segment x∗ is denoted as N(x∗). The rate of aVF-code is

R∆=

L

E [N(X∗)](code-symbols/source symbol).

GOAL:

We would like to find parsable VF-length codes that MINIMIZE this rate.

Page 26: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Proper-and-Complete Segment Sets

Definition (Proper-and-Complete Segment Sets)

A set of source segments is proper-and-complete if each semi-infinitesource sequence has a unique prefix in this segment set.

We focus on proper-and-complete segments sets. Segments in aproper-and-complete set can be regarded as leaves in a rooted tree. Suchsets guarantee instantaneous parsability.

Example

x∗ N(x∗) c(x∗)1 1 1101 2 10001 3 01000 3 00

∅0

1

00

01

000

001

Page 27: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Proper-and-Complete Segment Sets: Leaf-Node Lemma

Assume that the source is IID with parameter θ. Consider a set of segmentsand all their prefixes. Depict them in a tree. The segments are leaves, theprefixes nodes. Note that all the nodes and leaves have a probability. E.g.P(10) = θ(1− θ). Let F (·) be a function on nodes, leaves.

F (∅)

F (0)

F (1)

0

1

F (10)

F (11)

0

1

F (100)

F (101)

0

1

Lemma (Massey, 1983)∑l∈leaves

P(l)[F (l)− F (∅)] =∑

n∈nodes

P(n)∑

s∈sons of n

P(s)

P(n)[F (s)− F (n)].

Let F (x∗) = # of edges from x∗ to root, then

E [N(X∗)] =∑

x∗∈nodes

P(x∗).

Let F (x∗) = − log2 P(x∗), then

H(X∗) = E [N(X∗)]h(θ).

Page 28: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Proper-and-Complete Segment Sets: Leaf-Node Lemma

Assume that the source is IID with parameter θ. Consider a set of segmentsand all their prefixes. Depict them in a tree. The segments are leaves, theprefixes nodes. Note that all the nodes and leaves have a probability. E.g.P(10) = θ(1− θ). Let F (·) be a function on nodes, leaves.

F (∅)

F (0)

F (1)

0

1

F (10)

F (11)

0

1

F (100)

F (101)

0

1

Lemma (Massey, 1983)∑l∈leaves

P(l)[F (l)− F (∅)] =∑

n∈nodes

P(n)∑

s∈sons of n

P(s)

P(n)[F (s)− F (n)].

Let F (x∗) = # of edges from x∗ to root, then

E [N(X∗)] =∑

x∗∈nodes

P(x∗).

Let F (x∗) = − log2 P(x∗), then

H(X∗) = E [N(X∗)]h(θ).

Page 29: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Proper-and-Complete Segment Sets: Result

Theorem

For any proper-and-complete segment set with no more than 2L segments

L ≥ H(X∗) = E [N(X∗)]h(θ),

or R = L/E [N(X∗)] ≥ h(θ).

More precisely, since

R =L

E [N(X∗)]=

L

H(X∗)h(θ),

we should make H(X∗) as close as possible to L, hence all segmentsshould have roughly the same probability.

Page 30: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Tunstall’s Code

Consider 0 < θ ≤ 1/2.

Definition (Optimal VF-length Code)

A code that maximizes the expected segment-length E [N(X∗)] is calledoptimal. Such a code minimizes the rate R.

Theorem (Tunstall, 1967)

The Tunstall construction leads to an optimal code.CONSTRUCTION:

Start with the empty segment ∅ which has unit probability.

As long as the number of segments is smaller than 2L replace asegment s with largest probability P(s) by two segments s0 and s1.The probabilities of the new segments (leaves) areP(s0) = P(s)(1− θ) and P(s1) = P(s)θ.

The Tunstall construction results in H(X∗) ≥ L− log2(1/θ) and thereforeR ≤ L

L+log2(θ)h(θ) (Jelinek and Schneider [1972]).

Page 31: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Tunstall’s Code

Consider 0 < θ ≤ 1/2.

Definition (Optimal VF-length Code)

A code that maximizes the expected segment-length E [N(X∗)] is calledoptimal. Such a code minimizes the rate R.

Theorem (Tunstall, 1967)

The Tunstall construction leads to an optimal code.CONSTRUCTION:

Start with the empty segment ∅ which has unit probability.

As long as the number of segments is smaller than 2L replace asegment s with largest probability P(s) by two segments s0 and s1.The probabilities of the new segments (leaves) areP(s0) = P(s)(1− θ) and P(s1) = P(s)θ.

The Tunstall construction results in H(X∗) ≥ L− log2(1/θ) and thereforeR ≤ L

L+log2(θ)h(θ) (Jelinek and Schneider [1972]).

Page 32: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Tunstall’s Construction

Example

Let L = 3 and θ = 0.3. Again h(0.3) = 0.881.

1.00

.700

.300

0

1

.490

.210

0

1

.343

.147

0

1

.240

.103

0

1

.210

.090

0

1

.168

.072

0

1

.147

.063

0

1

Now E [N(X∗)] = 1.0 + .7 + .3 + .49 + .21 + .343 + .240 = 3.283 andtherefore R = 3/3.283 = 0.914.

Page 33: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Tunstall’s Construction

Example

Let L = 3 and θ = 0.3. Again h(0.3) = 0.881.

1.00

.700

.300

0

1

.490

.210

0

1

.343

.147

0

1

.240

.103

0

1

.210

.090

0

1

.168

.072

0

1

.147

.063

0

1

Now E [N(X∗)] = 1.0 + .7 + .3 + .49 + .21 + .343 + .240 = 3.283 andtherefore R = 3/3.283 = 0.914.

Page 34: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Tunstall’s Construction

Example

Let L = 3 and θ = 0.3. Again h(0.3) = 0.881.

1.00

.700

.300

0

1

.490

.210

0

1

.343

.147

0

1

.240

.103

0

1

.210

.090

0

1

.168

.072

0

1

.147

.063

0

1

Now E [N(X∗)] = 1.0 + .7 + .3 + .49 + .21 + .343 + .240 = 3.283 andtherefore R = 3/3.283 = 0.914.

Page 35: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Tunstall’s Construction

Example

Let L = 3 and θ = 0.3. Again h(0.3) = 0.881.

1.00

.700

.300

0

1

.490

.210

0

1

.343

.147

0

1

.240

.103

0

1

.210

.090

0

1

.168

.072

0

1

.147

.063

0

1

Now E [N(X∗)] = 1.0 + .7 + .3 + .49 + .21 + .343 + .240 = 3.283 andtherefore R = 3/3.283 = 0.914.

Page 36: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Tunstall’s Construction

Example

Let L = 3 and θ = 0.3. Again h(0.3) = 0.881.

1.00

.700

.300

0

1

.490

.210

0

1

.343

.147

0

1

.240

.103

0

1

.210

.090

0

1

.168

.072

0

1

.147

.063

0

1

Now E [N(X∗)] = 1.0 + .7 + .3 + .49 + .21 + .343 + .240 = 3.283 andtherefore R = 3/3.283 = 0.914.

Page 37: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Tunstall’s Construction

Example

Let L = 3 and θ = 0.3. Again h(0.3) = 0.881.

1.00

.700

.300

0

1

.490

.210

0

1

.343

.147

0

1

.240

.103

0

1

.210

.090

0

1

.168

.072

0

1

.147

.063

0

1

Now E [N(X∗)] = 1.0 + .7 + .3 + .49 + .21 + .343 + .240 = 3.283 andtherefore R = 3/3.283 = 0.914.

Page 38: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Tunstall’s Construction

Example

Let L = 3 and θ = 0.3. Again h(0.3) = 0.881.

1.00

.700

.300

0

1

.490

.210

0

1

.343

.147

0

1

.240

.103

0

1

.210

.090

0

1

.168

.072

0

1

.147

.063

0

1

Now E [N(X∗)] = 1.0 + .7 + .3 + .49 + .21 + .343 + .240 = 3.283 andtherefore R = 3/3.283 = 0.914.

Page 39: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Tunstall’s Construction

Example

Let L = 3 and θ = 0.3. Again h(0.3) = 0.881.

1.00

.700

.300

0

1

.490

.210

0

1

.343

.147

0

1

.240

.103

0

1

.210

.090

0

1

.168

.072

0

1

.147

.063

0

1

Now E [N(X∗)] = 1.0 + .7 + .3 + .49 + .21 + .343 + .240 = 3.283 andtherefore R = 3/3.283 = 0.914.

Page 40: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Tunstall’s Construction

Example

Let L = 3 and θ = 0.3. Again h(0.3) = 0.881.

1.00

.700

.300

0

1

.490

.210

0

1

.343

.147

0

1

.240

.103

0

1

.210

.090

0

1

.168

.072

0

1

.147

.063

0

1

Now E [N(X∗)] = 1.0 + .7 + .3 + .49 + .21 + .343 + .240 = 3.283 andtherefore R = 3/3.283 = 0.914.

Page 41: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Remarks: Tunstall Code

Note that R ↓ h(θ) when L→∞.

For θ ≈ 0 a Tunstall code has expected segment lengthE [N(X∗)] ≈ 2L − 1 and rate R ≈ L/(2L − 1). Better than Huffmanfor L = N.

In each step in the Tunstall procedure, a leaf with the largestprobability is changed into a node. This leads to:

The largest increase in expected segment length (Massey LN-lemma),and P(n) ≥ P(l) for all nodes n and leaves l .Therefore for any two leafs l and l′ we can say that

P(l) ≥ θP(n) ≥ P(l′).

So leaves cannot differ too much in probability. This fact is used to lowerbound H(X∗) ≥ L − log2(1/θ).

Optimal VF-length codes can also be found by fixing a number γ anddefining a node to be internal if its probability is ≥ γ (Khodak, 1969).The size of the segment set is not completely controllable now.

Run-length Codes (Golomb [1966]).

Page 42: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Outline

1 INTRODUCTION

2 HUFFMAN and TUNSTALLBinary IID SourcesHuffman CodeTunstall Code

3 ENUMERATIVE CODINGLexicographical OrderingFV: Pascal-∆ MethodVF: Petry Code

4 ARITHMETIC CODINGIntervalsUniversal Coding, Individual Redundancy

5 CONTEXT-TREE WEIGHTINGIID, unknown θBinary Tree-SourcesContext TreesCoding Probabilities

6 REPETITION TIMESLZ77Repetition Times, KacRepetition-Time AlgorithmAchieving Entropy

7 CONCLUSION

Page 43: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Lexicographical Ordering

IDEA:

Sequences having the same weight (and probability) only need to beINDEXED. The binary representation of the index can be taken ascodeword.

Definition (Lexicographical Ordering, Index)

In a lexicographical ordering (0 < 1) we say that xN1 < yN1 if xn < yn for

the smallest index n such that xn 6= yn.Consider a subset S of the set {0, 1}N . Let iS(xN1 ) be the lexicographical

index of xN1 ∈ S, i.e., the number of sequences yN1 < xN1 for yN

1 ∈ S.

Example

Let N = 5 and S = {xN1 : w(xN1 ) = 2} where w(xN1 ) is the weight of xN1 .

Then |S| =(5

2

)= 10 and:

iS(11000) = 9 iS(01100) = 4

iS(10100) = 8 iS(01010) = 3

iS(10010) = 7 iS(01001) = 2

iS(10001) = 6 iS(00110) = 1

iS(01100) = 5 iS(00011) = 0

Page 44: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Sequential Enumeration

Theorem (Cover, 1973)

From the sequence xN1 ∈ S we can compute index

iS(xN1 ) =∑

n=1,N:xn=1

#S(x1, x2, · · · , xn−1, 0),

where #S(x1, x2, · · · , xk ) denotes the number of sequences in Shaving prefix x1, · · · , xk .

Moreover from the index iS(xN1 ) the sequence xN1 can be computed ifnumbers #S(x1, x2, · · · , xn−1, 0) for n = 1,N are available.

The index of a sequence can be represented by a codeword of fixed lengthdlog2 |S|e.

Example

Index iS(10100) = #S(0) + #S(100) =(4

2

)+(2

1

)= 6 + 2 = 8 hence, since

|S| = 10 the corresponding codeword is 1000.

Page 45: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

FV: Pascal-Triangle Method

IDEA:

Index sequences of fixed weight. Later use a Huffman code (or afixed-length code) to describe the weights.

Example (Lynch (1966), Davisson (1966), Schalkwijk (1972))

Let N = 5 and S = {xN1 :∑

xn = 2}. Then |S| =(5

2

)= 10.

10

4

6

1

3

3

1

2

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

0

0

0

0

0

Index from Sequence:

i(10100) = 6 + 2 = 8.

Sequence from Index:Index i = 8, nowa) 8 ≥ 6 hence x1 = 1,b) i < 6 + 3 hence x2 = 0,c) i ≥ 6 + 2 hence x3 = 1,d) x4 = x5 = 0.

Pascal Triangle.

Page 46: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

FV: Pascal-Triangle Method (cont.)

First note that

H(XN1 ) = H(XN

1 ,w(XN1 )) = H(W ) + H(XN

1 |W ).

If we use enumerative coding for XN1 given weight w , since all sequences

with a fixed weight have equal probability

E [L(XN1 |W )] =

∑w=0,1,N

P(w) log2d(Nw

)e

<∑

w=0,1,N

P(w) log2

(Nw

)+ 1 = H(XN

1 |W ) + 1.

If W is encoded using a Huffman code we obtain

E [L(XN1 )] = E [L(W )] + E [L(XN

1 |W )]

≤ H(W ) + 1 + H(XN1 |W ) + 1

= H(XN1 ) + 2.

Worse than Huffman, but no big code-table needed however.

Page 47: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Remarks: FV Pascal-Triangle Method

Enumeration for sequences generated by Markov Sources (Cover[1973]).

Universal approach:

Davisson [1966]

If W is encoded with a fixed-length codeword of dlog2(N + 1)e bits, thenentropy is achieved for every θ for N →∞.

Lexicographical ordering also possible for variable-length sourcesegments.

Page 48: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

VF: Petry Code

IDEA:

Modify the Tunstall segment sets such that the segments can be indexed.

Again let 0 < θ ≤ 1/2. It can be shown that a proper-and-completesegment set is a Tunstall set (maximal E [N(X∗)] given the number ofsegments) if and only if for all nodes n and all leaves l

P(n) ≥ P(l).

Consequence

If the segments x∗ in a proper-and-complete segment set satisfy

P(x∗−1) > γ ≥ P(x∗),

where x∗−1 is x∗ without the last symbol, this segment set is a Tunstallset. Constant γ determines the size of the set.

SinceP(x∗) = (1− θ)n0(x∗)θn1(x∗),

where n0(x∗) is the number or zeros in x∗ and n1(x∗) the number of onesin x∗, etc., this is equivalent to

An0(x∗−1) + Bn1(x∗−1) < C ≤ An0(x∗) + Bn1(x∗)

for A = − logb(1− θ), B = − logb θ, C = − logb γ, and some log-base b.

Page 49: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

VF: Petry Code (cont.)

Note that log-base b has to satisfy

1 = (1− θ) + θ = b−A + b−B .

For special values of θ, A and B are integers. E.g. for θ = (1− θ)2 weobtain that A = 1 and B = 2 for b = (1 +

√5)/2. Now C can also

assumed to be integer. The corresponding codes are called Petry codes.

Definition (Petry (Schalkwijk), 1982)

Fix integers A and B. The segments x∗ in a proper-and-complete Petrysegment set satisfy

An0(x∗−1) + Bn1(x∗−1) < C ≤ An0(x∗) + Bn1(x∗).

Integer C can be chosen to control the size of the set.

Linear Array

Petry codes can be implemented using a linear array.

Page 50: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

VF: Petry Code (cont.)

Example

Consider A = 1, B = 2. Costs, step-sizes.For given C , let S(C) denote the resulting segment set and σ(C) itscardinality. Let S(−1) = S(0) = ∅, then S(1) = {0, 1}, S(2) = {00, 01, 1},etc. Moreover now σ(−1) = σ(0) = 1, σ(1) = 2 and σ(2) = 3, etc. It iseasy to see that

σ(C) = σ(C − 1) + σ(C − 2),

and therefore σ(3) = 5, σ(4) = 8, σ(5) = 13, σ(6) = 21, σ(7) = 34, andσ(8) = 55.Now take C = 8. Note that 010010 ∈ S(8).We can now determine the index i(010010) using Cover’s formula:

i(010010) = #(00) + #(01000) = σ(6) + σ(2) = 21 + 3 = 24.

55 34 21 13 8 5 3 2 1 10

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

Page 51: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

VF: Petry Code (cont.)

Theorem (Tjalkens & W. (1987))

A Petry code with parameters A < B, and C is a Tunstall code forparameter q where q = b−B when b is the solution of b−A + b−B = 1.For arbitrary θ the rate

log2 σ(C)

E [N(X∗)]≤

C + (B − 1)

C(h(θ) + d(θ||q)).

Example

In the table q for several values of A and B:

A B 2 3 4 51 0.382 0.318 0.276 0.2452 0.4302 0.382 0.3463 0.450 0.412

Page 52: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Remarks: VF Petry Code

Note that log2 σ(C)/E [N(X∗)] ↓ h(θ) + d(θ|q) when C →∞, hencea Petry code achieves entropy for θ = q.

Tjalkens and W. investigated VF-length Petry codes for Markovsources, again with a linear array for each state.

VF-length universal enumerative solutions exist (Lawrence [1977],Tjalkens and W. [1992]).

The numbers in the linear array show exponential behaviour. Also anarray b2−i/Mef for i = 1,M can be used, through which we makesteps (Tjalkens [PhD, 1987]). This reduces the storage complexityand is similar to Rissanen [1976] multiplication-avoiding arithmeticcoding (Generalized Kraft inequality).

Page 53: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Outline

1 INTRODUCTION

2 HUFFMAN and TUNSTALLBinary IID SourcesHuffman CodeTunstall Code

3 ENUMERATIVE CODINGLexicographical OrderingFV: Pascal-∆ MethodVF: Petry Code

4 ARITHMETIC CODINGIntervalsUniversal Coding, Individual Redundancy

5 CONTEXT-TREE WEIGHTINGIID, unknown θBinary Tree-SourcesContext TreesCoding Probabilities

6 REPETITION TIMESLZ77Repetition Times, KacRepetition-Time AlgorithmAchieving Entropy

7 CONCLUSION

Page 54: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Idea Elias

Elias:

If source sequences are ORDERED LEXICOGRAPHICALLY thencodewords can be COMPUTED SEQUENTIALLY from the sourcesequence using conditional PROBABILITIES of next symbol given theprevious ones, and vice versa.

Page 55: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Source Intervals

Definition

Order the source sequences xN1 ∈ {0, 1}N lexicographically according to0 < 1.Now, to each source sequence xN1 ∈ {0, 1}N there corresponds asource-interval

I (xN1 ) = [Q(xN1 ),Q(xN1 ) + P(xN1 ))

withQ(xN1 ) =

∑x̃N1 <xN1

P(x̃N).

By construction the source intervals are all disjoint. Their union is [0, 1).

Example

Consider an I.I.D. source with θ = 0.2 and N = 2.

xN1 P(xN1 ) Q(xN1 ) I (xN1 )00 0.64 0 [0 , 0.64)01 0.16 0.64 [0.64 , 0.8)10 0.16 0.8 [0.8 , 0.96)11 0.04 0.96 [0.96 , 1)

Page 56: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Code Intervals

A codeword c with length L can be regarded as a binary fraction .c.If we concatenate this codeword with others the corresponding fractioncan increase, but no more than 2−L.

Definition

To a codeword c(xN1 ) with length L(xN1 ) there corresponds a code interval

J(xN1 ) = [.c(xN1 ), .c(xN1 ) + 2−L(xN1 )).

Note that J(xN1 ) ⊂ [0, 1).

Page 57: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Arithmetic coding: Encoding and Decoding

Procedure

ENCODING: Choose c such that the code interval ⊆ source interval,i.e.

[.c, .c + 2−L) ⊆ [Q(xN1 ),Q(xN1 ) + P(xN1 )).

DECODING: Is possible since there is only one source interval thatcontains the code interval.

Theorem

For sequence xN1 with source-interval I (xN1 ) = [Q(xN1 ),Q(xN1 ) + P(xN1 ))

take c(xN1 ) as the codeword with

L(xN1 )∆= dlog2

1

P(xN1 )e+ 1

.c(xN1 )∆= dQ(xN1 ) · 2L(xN1 )e · 2−L(xN1 ).

ThenJ(c(xN1 )) ⊆ I (xN1 ).

and

L(xN1 ) < log2

1

P(xN1 )+ 2,

i.e. less than two bits above the ideal codeword length.

Page 58: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Example

I.I.D. source with θ = 0.2 and N = 2.

Source Intervals

11

10

01

00

0.00

0.64

0.80

0.961.00 111110

1101

1011

00

0

1/4

11/163/4

13/16

7/8

31/3263/64

Code Intervals

Source-intervals are disjoint ⇒ code-intervals are disjoint ⇒ prefixcondition holds.

Page 59: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Example

I.I.D. source with θ = 0.2 and N = 2.

Source Intervals

11

10

01

00

0.00

0.64

0.80

0.961.00 111110

1101

1011

00

0

1/4

11/163/4

13/16

7/8

31/3263/64

Code Intervals

Source-intervals are disjoint ⇒ code-intervals are disjoint ⇒ prefixcondition holds.

Page 60: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Arithmetic Coding: Sequential Computation (Elias)

Example (Connection to Cover’s formula)

Let L = 3 and θ = 0.2.

0

1

00

01

10

11

000

001

010

011

100

101

110

111

Q(101) = P(0) + P(100) = 0.8 + 0.2 · 0.8 · 0.8 = 0.928.

P(101) = P(1)P(0)P(1) = 0.2 · 0.8 · 0.2 = 0.032.

Page 61: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Arithmetic Coding: Sequential Computation (Elias)

In general

Q(xN1 ) =∑

n=1,N:xn=1

P(x1, x2, · · · , xn−1, 0),

P(xN1 ) =N∏

n=1

P(xn|x1, x2, · · · , xn−1).

Sequential Computation

If we have access to P(x1, x2, · · · , xn, 0) and P(x1, x2, · · · , xn, 1) afterhaving processed P(x1, x2, · · · , xn) for n = 1, 2, · · · ,N we can computeI (xN1 ) sequentially.

Page 62: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Universal Coding

Coding Probabilities

If the actual probabilities P(xN1 ) are not known arithmetic coding is still

possible if instead of P(xN1 ) we use coding probabilities Pc (xN1 ) satisfying

Pc (xN1 ) > 0 for all xN1 , and∑xN1

Pc (xN1 ) = 1.

Then

L(xN1 ) < log2

1

Pc (xN1 )+ 2.

Encoder DecoderxN1 c(xN1 ) xN1

Pc (x1 · · · xn−1, 0),Pc (x1 · · · xn−1, 1)

for n = 1,N

Pc (x1 · · · xn−1, 0),Pc (x1 · · · xn−1, 1)

for n = 1,N

PROBLEM: How do we choose the coding probabilities Pc (xN1 )?

Page 63: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Individual Redundancy

Definition

The individual redundancy ρ(xN1 ) of a sequence xN1 is defined as

ρ(xN1 ) = L(xN1 )− log2

1

P(xN1 ),

i.e. codeword-length minus ideal codeword-length.

Bound Individual Redundancy

Arithmetic coding based on coding probabilities {Pc (xN1 ), xN1 ∈ {0, 1}N}yields

ρ(xN1 ) < log2

1

Pc (xN1 )+ 2− log2

1

P(xN1 )= log2

P(xN1 )

Pc (xN1 )+ 2.

We say that the CODING redundancy < 2 bits.The coding probabilities should be as large as possible (as close as possibleto the actual probabilities). Next focus on remaining part of the individual

redundancy log2P(xN1 )

Pc (xN1 ).

Page 64: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Remarks: Arithmetic Coding

Shannon [1948] already described relation between codewords andintervals, ordered probabilities however. Called Shannon-Fano code.

Shannon-Fano-Elias, arbitrary ordering, but not sequential.

Finite precision issues arithmetic coding solved by Pasco [1976] andRissanen [1976].

Page 65: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Outline

1 INTRODUCTION

2 HUFFMAN and TUNSTALLBinary IID SourcesHuffman CodeTunstall Code

3 ENUMERATIVE CODINGLexicographical OrderingFV: Pascal-∆ MethodVF: Petry Code

4 ARITHMETIC CODINGIntervalsUniversal Coding, Individual Redundancy

5 CONTEXT-TREE WEIGHTINGIID, unknown θBinary Tree-SourcesContext TreesCoding Probabilities

6 REPETITION TIMESLZ77Repetition Times, KacRepetition-Time AlgorithmAchieving Entropy

7 CONCLUSION

Page 66: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

CTW: Universal Codes

IDEA:

Find good coding probabilities for sources with UNKNOWNPARAMETERS and STRUCTURE. Use WEIGHTING!

Page 67: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Coding for a Binary IID Source, Unknown θ

Definition (Krichevsky-Trofimov estimator (1981))

A good coding probability Pc (xN1 ) for a sequence xN1 that contains a zeroesand b = N − a ones is

Pe(a, b) =

∫ 1

θ=0

1

π√

(1− θ)θ· (1− θ)aθbdθ.

(Dirichlet-(1/2, 1/2) prior, “weighting”).

Theorem

Upperbound on the PARAMETER redundancy

log2

P(xN1 )

Pc (xN1 )= log2

θa(1− θ)b

Pe(a, b)≤

1

2log2(a + b) + 1 =

1

2log2(N) + 1.

for all θ and xN1 with a zeros and b ones.

Probability of a sequence with a zeroes and b ones followed by a zero

Pe(a + 1, b) =a + 1/2

a + b + 1· Pe(a, b),

hence SEQUENTIAL COMPUTATION is possible!

Page 68: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Individual Redundancy Binary IID source

The total individual redundancy

ρ(xN1 ) < log2

θa(1− θ)b

Pe(a, b)+ 2 ≤

(1

2log2(N) + 1

)+ 2.

for all θ and xN1 with a zeroes and b ones.

Shtarkov [1988]: 12

log2 N behaviour is asymptotically optimal forindividual redundancy for N →∞ (NML-estimator)!

Rissanen [1984]: Also for expected redundancy 12

log2 N behaviour isasymptotically optimal.

Page 69: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

CTW: Binary Tree-Sources

Definition

0

1

00

10

(tree-) model M = {00, 10, 1}

· · · xn−2 xn−1 xn

θ1 = 0.1

θ10 = 0.3

θ00 = 0.5

parameters

P(Xn = 1| · · · ,Xn−1 = 1) = 0.1

P(Xn = 1| · · · ,Xn−2 = 1,Xn−1 = 0) = 0.3

P(Xn = 1| · · · ,Xn−2 = 0,Xn−1 = 0) = 0.5

Page 70: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

CTW: Binary Tree-Sources

Definition

0

1

00

10

(tree-) model M = {00, 10, 1}

· · · xn−2 xn−1 xn

θ1 = 0.1

θ10 = 0.3

θ00 = 0.5

parameters

P(Xn = 1| · · · ,Xn−1 = 1) = 0.1

P(Xn = 1| · · · ,Xn−2 = 1,Xn−1 = 0) = 0.3

P(Xn = 1| · · · ,Xn−2 = 0,Xn−1 = 0) = 0.5

Page 71: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

CTW: Binary Tree-Sources

Definition

0

1

00

10

(tree-) model M = {00, 10, 1}

· · · xn−2 xn−1 xn

θ1 = 0.1

θ10 = 0.3

θ00 = 0.5

parameters

P(Xn = 1| · · · ,Xn−1 = 1) = 0.1

P(Xn = 1| · · · ,Xn−2 = 1,Xn−1 = 0) = 0.3

P(Xn = 1| · · · ,Xn−2 = 0,Xn−1 = 0) = 0.5

Page 72: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

CTW: Problem, Concepts

PROBLEM:What are good coding probabilities for sequences xN1 produced by atree-source with

an unknown tree-model,

and unknown parameters?

CONCEPTS:

CONTEXT TREE (Rissanen [1983])

WEIGHTING: If P1(x) or P2(x) are two alternative codingprobabilities for sequence x , then the weighted probability

Pw (x)∆=

P1(x) + P2(x)

2≥

1

2max(P1(x),P2(x)),

thus we loose at most a factor of 2, which is one bit in redundancy.

Page 73: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

CTW: Problem, Concepts

PROBLEM:What are good coding probabilities for sequences xN1 produced by atree-source with

an unknown tree-model,

and unknown parameters?

CONCEPTS:

CONTEXT TREE (Rissanen [1983])

WEIGHTING: If P1(x) or P2(x) are two alternative codingprobabilities for sequence x , then the weighted probability

Pw (x)∆=

P1(x) + P2(x)

2≥

1

2max(P1(x),P2(x)),

thus we loose at most a factor of 2, which is one bit in redundancy.

Page 74: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Context Trees

Definition (Context Tree)

0

1

00

10

01

11

000

100

010

110

001

101

011

111

Node s contains the sequence of source symbols that have occurredfollowing context s. Depth is D.

Page 75: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Context-tree splits up sequences in subsequences

Example

1 2 3 4 5 6 7 n→

0 1 1 0 1 0 0 · · ·

xN1

010

past

1234567

1257

346

2

157

36

4

-

2

17

5

3

6

4

-

0

0

0

0

0

0

0

1

1

1

1

1

1

1

Page 76: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Coding Probabilities: Leaves of the Context-Tree

0

1

00

10

01

11

000

100

010

110

001

101

011

111

tree-model M = {00, 10, 1}

CTW: Leaves

The subsequence corresponding to a leaf s of the context tree is IID. Agood coding probability for this subsequence is therefore

Pw (s)∆= Pe(as , bs),

where as and bs are the number of zeroes and ones of this subsequence.

Page 77: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Coding Probabilities: Internal nodes of the Context-Tree

0

1

00

10

01

11

000

100

010

110

001

101

011

111

tree-model M = {00, 10, 1}

The subsequence corresponding to a node s of the context tree is

IID if the node s is not an internal node of the actual tree-model,

a combination of the subsequences that correspond to nodes 0s and1s, if s is an internal node of the actual tree-model.

Page 78: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Weighting

CTW: Internal Nodes

Weighting the coding probabilities corresponding to both alternatives yieldsthe coding probability

Pw (s)∆=

Pe(as , bs) + Pw (0s) · Pw (1s)

2

for the subsequence that corresponds to node s.Recursively we find in the root ∅ of the context-tree the coding probabilityPw (∅) for the entire source sequence xN1 .

IMPORTANT:

Coding probability Pw (λ) can be computed sequentially.

Page 79: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Redundancy (tree-model M = {00, 10, 1})

Actual probability:

P(xN1 ) = (1− θ00)a00θb0000 (1− θ10)a10θb10

10 (1− θ1)a1θb11 .

Lower bound coding probability:

Pw (∅) ≥1

2Pw (0) · Pw (1)

≥1

2

1

2Pw (00) · Pw (10) ·

1

2Pe(a1, b1)

≥1

2

1

2

1

2Pe(a00, b00) ·

1

2Pe(a10, b10) ·

1

2Pe(a1, b1).

Parameter redundancy bounds for the subsequences in the leaves oftree-model M = {00, 10, 1}:

log2

(1− θ00)a00θb0000

Pe(a00, b00)≤

1

2log2(a00 + b00) + 1,

log2

(1− θ10)a10θb1010

Pe(a10, b10)≤

1

2log2(a10 + b10) + 1,

log2

(1− θ1)a1θb11

Pe(a1, b1)≤

1

2log2(a1 + b1) + 1.

Page 80: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Redundancy (General)

ρ(xN1 ) < log2

P(xN1 )

Pw (∅)+ 2

≤ log2 32 +1

2log2(a00 + b00) + 1 +

1

2log2(a10 + b10) + 1

+1

2log2(a1 + b1) + 1 + 2

≤ 5 +

(3

2log2

N

3+ 3

)+ 2,

for all xN1 , and all θ00, θ10, and θ1.

Theorem (W., Shtarkov, and Tjalkens (1995))

In general for a tree source with |M| leaves (parameters):

ρ(xN1 ) < (2|M| − 1) +

(|M|

2log2

N

|M|+ |M|

)+ 2 bits.

(model, parameter, and coding redundancies)

Page 81: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Simulation: Model plus Parameter Redundancies

Redundancies for the CTW method, but also for methods focussing onM = ∅, M = {0, 1}, actual model M = {00, 10, 1}, M = {0, 01, 11} andM = {00, 10, 01, 11}. The CTW method improves over the best model!

0 50 100 150 200 250 300 350 400 450 5000

5

10

15

20

25

30

35

40redundancies(n)

Page 82: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Remarks: Context-Tree Weighting

CTW implements a “weighting” (Bayes mixture) over all tree-modelswith depth not exceeding D, i.e.

Pw (λ) =∑M≤D

P(M)Pe(xN1 |M),

with Pe(xN1 |M) =∏

s∈M Pe(as , bs) and P(M) = 2−(2|M|−1).

There is one tree-model of depth 0 (i.e., the IID model). If there are#d models of depth not exceeding d then #d+1 = #2

d + 1. Therefore#1 = 2, #2 = 5, #3 = 26, #4 = 677, #5 = 458330,#6 = 210066388901, #7 = 4.4128 · 1022, #8 = 1.9473 · 1045, etc.

Straightforward analysis. No model-estimation that only givesasymptotic results as in e.g. Rissanen [1983, 1986], Weinberger,Rissanen, and Feder [1995]).

Number of computations needed to process the source sequence xN1 islinear in N. Same holds for the storage complexity.

Page 83: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Remarks: Context-Tree Weighting (cont.)

Optimal parameter redundancy behavior in Rissanen [1984] sense (i.e.,12

log2 N bits/parameter).

A modified version achieves entropy not only for tree sources but forall stationary ergodic sources.

More general context-algorithms (splitting rules) were proposed. Thecontext of xn need not be xn−d , xn−d+1, · · · , xn−1.

A two-pass version (context-tree maximizing) exists that finds thebest model (MDL) matching to the source sequence. Now

Pm(s)∆=

max[Pe(as , bs),Pm(0s) · Pm(1s)]

2.

If a (minimal) tree source with model generates the sequence xN1 themaximizing method produces a model estimate which is correct withprobability one as N →∞.

Page 84: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Outline

1 INTRODUCTION

2 HUFFMAN and TUNSTALLBinary IID SourcesHuffman CodeTunstall Code

3 ENUMERATIVE CODINGLexicographical OrderingFV: Pascal-∆ MethodVF: Petry Code

4 ARITHMETIC CODINGIntervalsUniversal Coding, Individual Redundancy

5 CONTEXT-TREE WEIGHTINGIID, unknown θBinary Tree-SourcesContext TreesCoding Probabilities

6 REPETITION TIMESLZ77Repetition Times, KacRepetition-Time AlgorithmAchieving Entropy

7 CONCLUSION

Page 85: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Lempel-Ziv 1977 Compression

IDEA:

Let the data speak for itself.

LZ77 Compression is achieved by replacing repeated segments in the datawith pointers and lengths. To avoid deadlock an uncoded symbol is addedto each pointer and length.

Example (LZ77)

(0,-,a)

b r a c a d a b r a

arbadacarba

a b r a c a d a b r a

arbadacarba

a b r a c a d a b r a

arbadacarba

a b r a c a d a b r a

outputlook-ahead buffersearch buffer

(7,4, )

(2,1,d)

(3,1,c)

(0,-,r)

(0,-,b)

a

QUESTION:

Why does this method work? Note that the statistics of the data areunknown!

Page 86: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Lempel-Ziv 1977 Compression

IDEA:

Let the data speak for itself.

LZ77 Compression is achieved by replacing repeated segments in the datawith pointers and lengths. To avoid deadlock an uncoded symbol is addedto each pointer and length.

Example (LZ77)

(0,-,a)

b r a c a d a b r a

arbadacarba

a b r a c a d a b r a

arbadacarba

a b r a c a d a b r a

arbadacarba

a b r a c a d a b r a

outputlook-ahead buffersearch buffer

(7,4, )

(2,1,d)

(3,1,c)

(0,-,r)

(0,-,b)

a

QUESTION:

Why does this method work? Note that the statistics of the data areunknown!

Page 87: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Repetition Times

Consider the discrete stationary and ergodic process

· · · ,X−3,X−2,X−1,X0,X1,X2, · · · .

Suppose that X1 = x for symbol-value x ∈ X with Pr{X1 = x} > 0. Wesay that the repetition time of the x that occurred at time t = 1 is m ifX1−m = x and Xt 6= x for t = 2−m, · · · , 0.

m = 4

= x6= x6= x6= x= x

X2X1X0X−1X−2X−3

?

Definition (Average Repetition Time)

Let Qx (m) be the conditional probability that the repetition time of the xoccurring at t = 1 is m, hence

Qx (m)∆= Pr{X1−m = x ,X2−m 6= x , · · · ,X0 6= x |X1 = x}.

The average repetition time for symbol-value x with Pr{X1 = x} > 0 isnow defined as

T (x)∆=

∑m=1,2,···

mQx (m).

Page 88: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Repetition Times

Consider the discrete stationary and ergodic process

· · · ,X−3,X−2,X−1,X0,X1,X2, · · · .

Suppose that X1 = x for symbol-value x ∈ X with Pr{X1 = x} > 0. Wesay that the repetition time of the x that occurred at time t = 1 is m ifX1−m = x and Xt 6= x for t = 2−m, · · · , 0.

m = 4

= x6= x6= x6= x= x

X2X1X0X−1X−2X−3

?

Definition (Average Repetition Time)

Let Qx (m) be the conditional probability that the repetition time of the xoccurring at t = 1 is m, hence

Qx (m)∆= Pr{X1−m = x ,X2−m 6= x , · · · ,X0 6= x |X1 = x}.

The average repetition time for symbol-value x with Pr{X1 = x} > 0 isnow defined as

T (x)∆=

∑m=1,2,···

mQx (m).

Page 89: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Kac’s Result

Example

Consider an IID (binary) process and assume that Pr{X1 = 1} = θ > 0.Then

Q1(m) = θ(1− θ)m−1 and

T (1) =∑

m=1,2,···mθ(1− θ)m−1 =

1

θ.

Theorem (Kac, 1947)

For stationary and ergodic processes, for any x with Pr{X1 = x} > 0,

T (x) =1

Pr{X1 = x}.

Note that Kac’s result holds also for “sliding” N-blocks, hence

T ((x1, x2, · · · , xN)) =1

Pr{(X1,X2, · · · ,XN) = (x1, x2, · · · , xN),

if Pr{(X1,X2, · · · ,XN) = (x1, x2, · · · , xN)} > 0. Now the repetition time isequal to m when m is the smallest positive integer such that(x1−m, x2−m, · · · , xN−m) = (x1, x2, · · · , xN).

Page 90: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Kac’s Result

Example

Consider an IID (binary) process and assume that Pr{X1 = 1} = θ > 0.Then

Q1(m) = θ(1− θ)m−1 and

T (1) =∑

m=1,2,···mθ(1− θ)m−1 =

1

θ.

Theorem (Kac, 1947)

For stationary and ergodic processes, for any x with Pr{X1 = x} > 0,

T (x) =1

Pr{X1 = x}.

Note that Kac’s result holds also for “sliding” N-blocks, hence

T ((x1, x2, · · · , xN)) =1

Pr{(X1,X2, · · · ,XN) = (x1, x2, · · · , xN),

if Pr{(X1,X2, · · · ,XN) = (x1, x2, · · · , xN)} > 0. Now the repetition time isequal to m when m is the smallest positive integer such that(x1−m, x2−m, · · · , xN−m) = (x1, x2, · · · , xN).

Page 91: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Kac’s Result

Example

Consider an IID (binary) process and assume that Pr{X1 = 1} = θ > 0.Then

Q1(m) = θ(1− θ)m−1 and

T (1) =∑

m=1,2,···mθ(1− θ)m−1 =

1

θ.

Theorem (Kac, 1947)

For stationary and ergodic processes, for any x with Pr{X1 = x} > 0,

T (x) =1

Pr{X1 = x}.

Note that Kac’s result holds also for “sliding” N-blocks, hence

T ((x1, x2, · · · , xN)) =1

Pr{(X1,X2, · · · ,XN) = (x1, x2, · · · , xN),

if Pr{(X1,X2, · · · ,XN) = (x1, x2, · · · , xN)} > 0. Now the repetition time isequal to m when m is the smallest positive integer such that(x1−m, x2−m, · · · , xN−m) = (x1, x2, · · · , xN).

Page 92: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Universal Source Coding Based on Repetition Times

Suppose that our source is binary i.e. Xt ∈ {0, 1} for all integer t.

x3

?m = 4

‖‖‖

x−7 x−6 x−5 x−4 x−3 x−2 x−1 x0 x1 x2

x1 x2

x3

A. The encoder wants to convey a source block xN1∆= (x1, x2, · · · , xN) to

the decoder. Both encoder and decoder have access to buffers containingall previous source symbols · · · , x−2, x−1, x0.

B. Using these previous source symbols the encoder can determine therepetition time m of xN1 . It is the smallest integer m that satisfies

xN−m1−m = xN1 ,

where xN−m1−m

∆= (x1−m, x2−m, · · · , xN−m).

Page 93: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Universal Source Coding Based on Repetition Times (cont.)

C. Repetition time m is now encoded and sent to the decoder. The codefor m consists of a preamble p(m) and an index i(m) and has length l(m).

Example

Code table for the waiting time m for N = 3:

m p(m) i(m) l(m)1 00 - 2+0=22 01 0 2+1=33 01 1 2+1=34 10 00 2+2=45 10 01 2+2=46 10 10 2+2=47 10 11 2+2=4≥ 8 11 copy of x1x2x3 2+3=5

In general there are N + 1 groups. There are index groups with 1, 2, up to2N−1 elements, hence the index lengths are 0, 1, up to N − 1. The lastgroup is the “copy”-group. A “copy” has length N. We use a preamblep(m) of dlog2(N + 1)e bits to specify one of these N + 1 alternatives.

Page 94: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Universal Source Coding Based on Repetition Times (cont.)

For arbitrary N we get for the code-block length l(m)

l(m) =

{dlog2(N + 1)e+ blog2 mc if m < 2N ,dlog2(N + 1)e+ N if m ≥ 2N .

This results in the upper bound

l(m) ≤ dlog2(N + 1)e+ log2 m.

D. After decoding m the decoder can reconstruct xN1 using the previous

source symbols in the buffer. With this block xN1 both the encoder anddecoder can update their buffers.

E. Then the next block

x2NN+1

∆= xN+1, xN+2, · · · , x2N

is processed in a similar way, etc.

Note:

Buffers need only contain the previous 2N − 1 source symbols!

Page 95: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Universal Source Coding Based on Repetition Times (cont.)

For arbitrary N we get for the code-block length l(m)

l(m) =

{dlog2(N + 1)e+ blog2 mc if m < 2N ,dlog2(N + 1)e+ N if m ≥ 2N .

This results in the upper bound

l(m) ≤ dlog2(N + 1)e+ log2 m.

D. After decoding m the decoder can reconstruct xN1 using the previous

source symbols in the buffer. With this block xN1 both the encoder anddecoder can update their buffers.

E. Then the next block

x2NN+1

∆= xN+1, xN+2, · · · , x2N

is processed in a similar way, etc.

Note:

Buffers need only contain the previous 2N − 1 source symbols!

Page 96: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Analysis of the Repetition-Time Algorithm

Assume that a certain xN1 occurred as first block. What is then the average

codeword length L(xN1 ) for xN1 ?

L(xN1 ) =∑

m=1,2,···QxN1

(m)l(m)

(a)

≤∑

m=1,2,···QxN1

(m)dlog2(N + 1)e+∑

m=1,2,···QxN1

(m) log2 m

(b)

≤ dlog2(N + 1)e+ log2

∑m=1,2,···

mQxN1(m)

(c)= dlog2(N + 1)e+ log2

1

Pr{XN1 = xN1 }

.

Here (a) follows from the upper bound for l(m), (b) from Jensen’sinequality. Furthermore (c) follows from Kac’s theorem.Ideal codeword length plus dlog2(N + 1)e.

Page 97: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Analysis of the Repetition-Time Algorithm (cont.)

The probability that xN1 occurs as block is Pr{XN1 = xN1 }. For the average

codeword length L(XN1 ) we therefore get

L(XN1 ) =

∑xN1

Pr{XN1 = xN1 }L(xN1 )

≤∑xN1

Pr{XN1 = xN1 }

(dlog2(N + 1)e+ log2

1

Pr{XN1 = xN1 }

)

= dlog2(N + 1)e+ H(XN1 ).

For the rate RN we now obtain

RN =L(XN

1 )

N≤

H(XN1 )

N+dlog2(N + 1)e

N.

Theorem (W., 1986, 1989)

The repetition-time algorithm achieves entropy since

limN→∞

RN = limN→∞

(H(XN

1 )

N+dlog2(N + 1)e

N

)= H∞(X ).

Page 98: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Analysis of the Repetition-Time Algorithm (cont.)

The probability that xN1 occurs as block is Pr{XN1 = xN1 }. For the average

codeword length L(XN1 ) we therefore get

L(XN1 ) =

∑xN1

Pr{XN1 = xN1 }L(xN1 )

≤∑xN1

Pr{XN1 = xN1 }

(dlog2(N + 1)e+ log2

1

Pr{XN1 = xN1 }

)

= dlog2(N + 1)e+ H(XN1 ).

For the rate RN we now obtain

RN =L(XN

1 )

N≤

H(XN1 )

N+dlog2(N + 1)e

N.

Theorem (W., 1986, 1989)

The repetition-time algorithm achieves entropy since

limN→∞

RN = limN→∞

(H(XN

1 )

N+dlog2(N + 1)e

N

)= H∞(X ).

Page 99: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Remarks: Repetition-Time Algorithm

Universal algorithm.

Assume that · · · ,X−1,X0,X1,X2, · · · is stationary and ergodic withentropy H∞(X ). Let the random variable M be the repetition time ofthe source block XN

1 .

Theorem (Wyner & Ziv, 1989)

Fix an ε > 0. Then

limN→∞

Pr{M ≥ 2N(H∞(X )+ε)

}= 0.

This results implies that the buffer can be much smaller than 2N − 1 if theentropy is known to be smaller than 1.This result was crucial in proving that the LZ77 algorithm achieves entropy(Wyner & Ziv [1994]).

Elias [1987], interval and recency-rank coding methods (symbols).

Hershkovitz and Ziv [1998] studied conditional repetition times.

When better than CTW?CTW incremental redundancy for an N-block is ≈ NK/(2B ln(2)) bitsfor K parameters. This redundancy is larger than log2(N + 1) for(K/2N) > 2 ln(N + 1)/N. For N = 24 we get K/2N > 0.2682.

Page 100: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Outline

1 INTRODUCTION

2 HUFFMAN and TUNSTALLBinary IID SourcesHuffman CodeTunstall Code

3 ENUMERATIVE CODINGLexicographical OrderingFV: Pascal-∆ MethodVF: Petry Code

4 ARITHMETIC CODINGIntervalsUniversal Coding, Individual Redundancy

5 CONTEXT-TREE WEIGHTINGIID, unknown θBinary Tree-SourcesContext TreesCoding Probabilities

6 REPETITION TIMESLZ77Repetition Times, KacRepetition-Time AlgorithmAchieving Entropy

7 CONCLUSION

Page 101: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

CONCLUSION

Recent developments:

DUDE (Weismann, Ordentlich, Serroussi, Verdu, Weinberger [2005]),resulted in study of bi-directional contexts and splitting rules(Ordentlich, Weinberger, and Weissman [2005]).

Directed Mutual Information (Marko [1973], Massey [1990]):

I (XN1 → Y N

1 ) =∑

H(Yn|Y n−11 )− H(Yn|Y n−1

1 ,X n−11 ,Xn) is a

generalisation of Granger causality [1969]. CTW-methods were usedto estimate these quantities (Liao, Permuter, Kim, Zhao, Kim, andWeissman [2012]).

Questions:

LZ learns from seeing once. CTW is optimal for tree sources but seemsto take more time. What are the algorithms between CTW and LZ?

Suppose that the data have left-right symmetry henceP(a, b) = P(b, a), P(a, b, c) = P(c, b, a), P(a, b, c, d) = P(d , c, b, a),etc. This reduces the number of parameters. Algorithm? Relevant forimage-compression.

CTW can handle side-information by considering it as context (e.g.Cai, Kulkarni and Verdu [2005]). But what if the side-information isnot-properly aligned? Relevant for reference-based genomecompression (Chern et al. [2012]).

Page 102: Lossless Source Coding Algorithms · Biometrics (with Tanya Ignatenko) LOSSLESS SOURCE CODING ALGORITHMS WHY? Not many sessions at ISIT 2012! Is lossless source coding DEAD? Lossless

LOSSLESS SOURCECODING ALGORITHMS

Frans M.J. Willems

INTRODUCTION

HUFFMAN-TUNSTALL

Binary IID Sources

Huffman Code

Tunstall Code

ENUMERATIVE CODING

Lexicographical Ordering

FV: Pascal-∆ Method

VF: Petry Code

ARITHMETIC CODING

Intervals

Universal Coding,Individual Redundancy

CONTEXT-TREEWEIGHTING

IID, unknown θ

Tree Sources

Context Trees

Coding Prbs., Redundancy

REPETITION TIMES

LZ77

Repetition Times, Kac

Repetition-Time Algorithm

Achieving Entropy

CONCLUSION

Source Coding is FUN!

(Ulm, 1997)