Plan of talk

71
Plan of talk • Generative vs. non-generative modeling • Boosting • Weak-rule engineering: Alternating decision trees • Boosting and over-fitting • Applications • Boosting, repeated games, and online learning • Boosting and information geometry. 1

Transcript of Plan of talk

Page 1: Plan of talk

Plan of talk

• Generative vs. non-generative modeling • Boosting • Weak-rule engineering: Alternating decision trees • Boosting and over-fitting • Applications • Boosting, repeated games, and online learning • Boosting and information geometry.

!1

Page 2: Plan of talk

Toy Example

• Computer receives telephone call • Measures Pitch of voice • Decides gender of caller

Human Voice

Male

Female

!2

Page 3: Plan of talk

Generative modeling

Voice Pitch

Prob

abili

ty

mean1

var1

mean2

var2

Page 4: Plan of talk

Discriminative approach

Voice Pitch

No.

of m

ista

kes

[Vapnik 85]

Page 5: Plan of talk

Ill-behaved data

Voice Pitch

Prob

abili

ty

mean1 mean2

No.

of m

ista

kes

Page 6: Plan of talk

Traditional Statistics vs. Machine Learning

DataEstimated world state

Predictions ActionsStatistics

Decision Theory

Machine Learning

Page 7: Plan of talk

Model Generative Discriminative

Goal Probability estimates

Classification rule

Performance measure

Likelihood Misclassification rate

Mismatch problems

Outliers Misclassifications

Comparison of methodologies

!7

Page 8: Plan of talk

A weighted training set

Feature vectors

Binary labels {-1,+1}

Positive weights

x1,y1,w1( ), x2 ,y2 ,w2( ),…, xm ,ym ,wm( )

!8

Page 9: Plan of talk

A weak learner

The weak requirement:

yi ˆ y iwii=1

m∑wii=1

m∑> γ > 0

A weak rule

h

h

Weak Learner

Weighted training set

x1,y1,w1( ), x2 ,y2 ,w2( ),…, xm ,ym ,wm( )

instances

x1,x2 ,…,xm

predictions

ˆ y 1, ˆ y 2 ,…, ˆ y m; ˆ y i ∈ {0,1}

Page 10: Plan of talk

x1,y1,w1T−1( ), x2,y2,w2

T−1( ),…, xn ,yn,wnT−1( )

hT

The boosting process

weak learner

x1,y1,w11( ), x2,y2,w2

1( ),…, xn ,yn,wn1( )

weak learner

x1,y1,1( ), x2,y2,1( ),…, xn ,yn,1( )

h1

x1,y1,w12( ), x2,y2,w2

2( ),…, xn ,yn,wn2( )

h3

h2

...

Page 11: Plan of talk

α t = 12 ln wi

t

i:ht xi( )=1,yi =1∑ wit

i:ht xi( )=1,yi =−1∑⎛ ⎝ ⎜ ⎞

⎠ ⎟

AdaboostFreund, Schapire 1997

Page 12: Plan of talk

Main property of Adaboost

If advantages of weak rules over random guessing are: γ1,γ2,..,γT then training error of final rule is at most

ˆ ε fT( ) ≤ exp − γ t2

t=1

T

∑⎛

⎝ ⎜

⎠ ⎟

Page 13: Plan of talk

Boosting block diagram

Weak Learner

Booster

Weak rule

Example weights

Strong Learner Accurate Rule

Page 14: Plan of talk

Margins viewPrediction =

+

- + +++

+

+

-

--

--

--

w

Correct

Mistak

es

Project

Margin

Cumulative # examples

Mistakes Correct

Margin =

y(w i x)w p × x q

Page 15: Plan of talk

SVM vs Adaboost

!15

Margin =

y(w i x)w p × x q

Norms Algorithm

SVM: p = q = 12

Quadratic optimization

Adaboost: p = 1, q = ∞ Coordinate-wise descent

Page 16: Plan of talk

Minimizing error using loss functions on margin

Loss

Correct

MarginMistakes

Brownboost

Logitboost (logistic regression)

Adaboost =

0-1 loss

Hinge-Loss

Page 17: Plan of talk

What is a typical weak learner?

• Define a large set of features • Decision stumps:

• Speciaists (particularly for multi-class):

h(x) =+1−1

if xi ≥θotherwise

⎧⎨⎪

⎩⎪

h(x) =class j0

if xi ≥θ (xi <θ )otherwise

⎧⎨⎪

⎩⎪

(x1, x2,…, xn )

Page 18: Plan of talk

Alternating Trees

Joint work with Llew Mason

Page 19: Plan of talk

Decision Trees

X>3

Y>5-1

+1-1

no

yes

yesno

X

Y

3

5

+1

-1

-1

Page 20: Plan of talk

-0.2

Decision tree as a sum

X

Y-0.2

Y>5

+0.2-0.3

yesno

X>3

-0.1

no

yes

+0.1+0.1-0.1

+0.2

-0.3

+1

-1

-1sign

Page 21: Plan of talk

An alternating decision tree

X

Y

+0.1-0.1

+0.2

-0.3

sign

-0.2

Y>5

+0.2-0.3yesno

X>3

-0.1

no

yes

+0.1

+0.7

Y<1

0.0

no

yes

+0.7

+1

-1

-1

+1

Page 22: Plan of talk

Example: Medical Diagnostics

•Cleve dataset from UC Irvine database.

•Heart disease diagnostics (+1=healthy,-1=sick)

•13 features from tests (real valued and discrete).

•303 instances.

Page 23: Plan of talk

Adtree for Cleveland heart-disease diagnostics problem

Page 24: Plan of talk

Cross-validated accuracyLearning algorithm

Number of splits

Average test error

Test error variance

ADtree 6 17.0% 0.6%

C5.0 27 27.2% 0.5%

C5.0 + boosting 446 20.2% 0.5%

Boost Stumps 16 16.5% 0.8%

Page 25: Plan of talk

Boosting and over-fitting

Page 26: Plan of talk

Curious phenomenonBoosting decision trees

Using <10,000 training examples we fit >2,000,000 parameters

Page 27: Plan of talk

Explanation using margins

Margin

0-1 loss

Page 28: Plan of talk

Explanation using margins

Margin

0-1 loss

No examples with small margins!!

Page 29: Plan of talk

Experimental Evidence

Page 30: Plan of talk

TheoremFor any convex combination and any threshold

Probability of mistakeFraction of training example with small margin

Size of training sample

VC dimension of weak rules

No dependence on number of weak rules that are combined!!!

Schapire, Freund, Bartlett & Lee Annals of stat. 98

Page 31: Plan of talk

Suggested optimization problem

Margin

Page 32: Plan of talk

Idea of Proof

Page 33: Plan of talk

Applications

Page 34: Plan of talk

Application: Detecting FacesApplication: Detecting FacesApplication: Detecting FacesApplication: Detecting FacesApplication: Detecting Faces[Viola & Jones]

• problem: find faces in photograph or movie

• weak classifiers: detect light/dark rectangles in image

• many clever tricks to make extremely fast and accurate

Page 35: Plan of talk

Viola and Jones (~1996)

Page 36: Plan of talk

Fundamental PerspectivesFundamental PerspectivesFundamental PerspectivesFundamental PerspectivesFundamental Perspectives

• game theory

• loss minimization

• an information-geometric view

Page 37: Plan of talk

Just a GameJust a GameJust a GameJust a GameJust a Game[with Freund]

• can view boosting as a game, a formal interaction betweenbooster and weak learner

• on each round t:• booster chooses distribution Dt

• weak learner responds with weak classifier ht

• game theory: studies interactions between all sorts of“players”

Page 38: Plan of talk

GamesGamesGamesGamesGames

• game defined by matrix M:

Rock Paper ScissorsRock 1/2 1 0Paper 0 1/2 1

Scissors 1 0 1/2

• row player (“Mindy”) chooses row i

• column player (“Max”) chooses column j (simultaneously)

• Mindy’s goal: minimize her loss M(i , j)

• assume (wlog) all entries in [0, 1]

Page 39: Plan of talk

Randomized PlayRandomized PlayRandomized PlayRandomized PlayRandomized Play

• usually allow randomized play:• Mindy chooses distribution P over rows• Max chooses distribution Q over columns

(simultaneously)

• Mindy’s (expected) loss

=∑

i ,j

P(i)M(i , j)Q(j)

= P⊤MQ ≡ M(P,Q)

• i , j = “pure” strategies

• P, Q = “mixed” strategies

• m = # rows of M

• also write M(i ,Q) and M(P, j) when one side plays pure andother plays mixed

Page 40: Plan of talk

Sequential PlaySequential PlaySequential PlaySequential PlaySequential Play

• say Mindy plays before Max

• if Mindy chooses P then Max will pick Q to maximizeM(P,Q) ⇒ loss will be

L(P) ≡ maxQ

M(P,Q)

• so Mindy should pick P to minimize L(P)⇒ loss will be

minP

L(P) = minP

maxQ

M(P,Q)

• similarly, if Max plays first, loss will be

maxQ

minP

M(P,Q)

Page 41: Plan of talk

Minmax TheoremMinmax TheoremMinmax TheoremMinmax TheoremMinmax Theorem

• playing second (with knowledge of other player’s move)cannot be worse than playing first, so:

minP

maxQ

M(P,Q)︸ ︷︷ ︸

Mindy plays first

≥ maxQ

minP

M(P,Q)︸ ︷︷ ︸

Mindy plays second

• von Neumann’s minmax theorem:

minP

maxQ

M(P,Q) = maxQ

minP

M(P,Q)

• in words: no advantage to playing second

Page 42: Plan of talk

Optimal PlayOptimal PlayOptimal PlayOptimal PlayOptimal Play

• minmax theorem:

minP

maxQ

M(P,Q) = maxQ

minP

M(P,Q) = value v of game

• optimal strategies:• P∗ = arg minP maxQ M(P,Q) = minmax strategy• Q∗ = arg maxQ minP M(P,Q) = maxmin strategy

• in words:• Mindy’s minmax strategy P∗ guarantees loss ≤ v

(regardless of Max’s play)• optimal because Max has maxmin strategy Q∗ that can

force loss ≥ v (regardless of Mindy’s play)

• e.g.: in RPS, P∗ = Q∗ = uniform

• solving game = finding minmax/maxmin strategies

Page 43: Plan of talk

Weaknesses of Classical TheoryWeaknesses of Classical TheoryWeaknesses of Classical TheoryWeaknesses of Classical TheoryWeaknesses of Classical Theory

• seems to fully answer how to play games — just computeminmax strategy (e.g., using linear programming)

• weaknesses:

• game M may be unknown• game M may be extremely large• opponent may not be fully adversarial

• may be possible to do better than value v• e.g.:

Lisa (thinks):Poor predictable Bart, always takes Rock.

Bart (thinks):Good old Rock, nothing beats that.

Page 44: Plan of talk

Repeated PlayRepeated PlayRepeated PlayRepeated PlayRepeated Play

• if only playing once, hopeless to overcome ignorance of gameM or opponent

• but if game played repeatedly, may be possible to learn toplay well

• goal: play (almost) as well as if knew game and howopponent would play ahead of time

Page 45: Plan of talk

Repeated Play (cont.)Repeated Play (cont.)Repeated Play (cont.)Repeated Play (cont.)Repeated Play (cont.)

• M unknown

• for t = 1, . . . ,T :• Mindy chooses Pt

• Max chooses Qt (possibly depending on Pt)• Mindy’s loss = M(Pt ,Qt)• Mindy observes loss M(i ,Qt) of each pure strategy i

• want:

1

T

T∑

t=1

M(Pt ,Qt)

︸ ︷︷ ︸

actual average loss

≤ minP

1

T

T∑

t=1

M(P,Qt)

︸ ︷︷ ︸

best loss (in hindsight)

+ [“small amount”]

Page 46: Plan of talk

Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)[with Freund]

• choose η > 0

• initialize: P1 = uniform

• on round t:

Pt+1(i) =Pt(i) exp (−η M(i ,Qt))

normalization

• idea: decrease weight of strategies suffering the most loss

• directly generalizes [Littlestone & Warmuth]

• other algorithms:• [Hannan’57]• [Blackwell’56]• [Foster & Vohra]• [Fudenberg & Levine]...

Page 47: Plan of talk

AnalysisAnalysisAnalysisAnalysisAnalysis

• Theorem: can choose η so that, for any game M with m

rows, and any opponent,

1

T

T∑

t=1

M(Pt ,Qt)

︸ ︷︷ ︸

actual average loss

≤ minP

1

T

T∑

t=1

M(P,Qt)

︸ ︷︷ ︸

best average loss (≤ v)

+ ∆T

where ∆T = O

(√

lnm

T

)

→ 0

• regret ∆T is:• logarithmic in # rows m• independent of # columns

• therefore, can use when working with very large games

Page 48: Plan of talk

Solving a GameSolving a GameSolving a GameSolving a GameSolving a Game[with Freund]

• suppose game M played repeatedly

• Mindy plays using MW• on round t, Max chooses best response:

Qt = arg maxQ

M(Pt ,Q)

• let

P =1

T

T∑

t=1

Pt , Q =1

T

T∑

t=1

Qt

• can prove that P and Q are ∆T -approximate minmax andmaxmin strategies:

maxQ

M(P,Q) ≤ v + ∆T

andminP

M(P,Q) ≥ v − ∆T

Page 49: Plan of talk

Boosting as a GameBoosting as a GameBoosting as a GameBoosting as a GameBoosting as a Game

• Mindy (row player) ↔ booster

• Max (column player) ↔ weak learner

• matrix M:

• row ↔ training example• column ↔ weak classifier• M(i , j) =

{

1 if j-th weak classifier correct on i -th training example0 else

• encodes which weak classifiers correct on which examples• huge # of columns — one for every possible weak

classifier

Page 50: Plan of talk

h h h

x h h h

x h h h

x h h h

x h h h

h1(x1)=y1 means 1 is correct, 0 if incorrect

Page 51: Plan of talk

Hedge(⌘)

Outline

The Minimax theorem

Learning games

Page 52: Plan of talk

Hedge(⌘)

The Minimax theorem

Matrix Games

1 0 1 0 1-1 0 0 1 11 0 -1 1 0

I A game between the column player and the row player.I The chosen entry defines the loss of column player = gain

of row player.I If choices made serially, second player to choose has an

advantage.

Page 53: Plan of talk

Hedge(⌘)

The Minimax theorem

Mixed strategies

q1 q2 q3 q4 q5p1 1 0 1 0 1p2 -1 0 0 1 1p3 1 0 -1 1 0

I pure strategies: each player chooses a single action.I mixed strategies: each player chooses a distribution over

actions.I Expeted gain/loss: ~pM~qT

Page 54: Plan of talk

Hedge(⌘)

The Minimax theorem

The Minimax theoremJohn Von-Neumann, 1928

max~p

min~q

~pM~qT = min~q

max~p

~pM~qT

I Unlike pure strategies, the order of choice of mixedstrategies does not matter.

I Optimal mixed strateigies: the strategies that achieve theminimax.

I Value of the game: the value of the minimax.I Finding the minimax strategies when the matrix is known =

Linear Programming.

Page 55: Plan of talk

Hedge(⌘)

Learning games

A matrix corresponding to online learning.

t = 1 2 3 4 ...expert 1 1 0 1 0 ...expert 2 -1 0 0 1 ...expert 3 1 0 -1 1 ...

I The columns are revealed one at a time. strategies doesnot matter.

I Using Hedge or NormalHedge the row player chooses amixed strategy over the rows that is almost as good as thebest single row in hind-sight.

I The best single row in hind-site is at least as good as anymixed strategy in hind-sight.

Page 56: Plan of talk

Hedge(⌘)

Learning games

A matrix corresponding to online learning.

t = 1 2 3 4 ...expert 1 1 0 1 0 ...expert 2 -1 0 0 1 ...expert 3 1 0 -1 1 ...

I If the adversary playes optimally, then the row distributioncoverges to a minimax optimal mixed strategy.

I But adversary might not play optimally - minimizing regretis a stronger criterion than converging to minimax optimalmixed strategy.

Page 57: Plan of talk

Hedge(⌘)

Learning games

A matrix corresponding to boostingex. 1 ex. 2 ex. 3 ex. 4 ...

base rule 1 1 0 1 0 ...base rule 2 0 0 0 1 ...base rule 3 1 0 0 1 ...

I 0 mistake, 1 correct.I A weak learning algorithm: can find a base rule whose

weighted error is smaller than 1/2� � or any distributionover the examples.

I There is a distribution over the base rules such that for anyexample the expected error is smaller 1/2� �.

I Implies that the majority vote wrt this distribution over baserules is correct on all examples.

I Moreover - the weight of the majority is at least 1/2 + �,the minority is at most 1/2� �.

Page 58: Plan of talk

Boosting and the Minmax TheoremBoosting and the Minmax TheoremBoosting and the Minmax TheoremBoosting and the Minmax TheoremBoosting and the Minmax Theorem

• γ-weak learning assumption:• for every distribution on examples• can find weak classifier with weighted error ≤ 1

2 − γ

• equivalent to:

(value of game M) ≥ 12 + γ

• by minmax theorem, implies that:• ∃ some weighted majority classifier that correctly

classifies all training examples with margin ≥ 2γ• further, weights are given by maxmin strategy of game M

Page 59: Plan of talk

Idea for BoostingIdea for BoostingIdea for BoostingIdea for BoostingIdea for Boosting

• maxmin strategy of M has perfect (training) accuracy andlarge margins

• find approximately using earlier algorithm for solving a game• i.e., apply MW to M

• yields (variant of) AdaBoost

Page 60: Plan of talk

AdaBoost and Game TheoryAdaBoost and Game TheoryAdaBoost and Game TheoryAdaBoost and Game TheoryAdaBoost and Game Theory

• summarizing:

• weak learning assumption implies maxmin strategy for M

defines large-margin classifier• AdaBoost finds maxmin strategy by applying general

algorithm for solving games through repeated play

• consequences:

• weights on weak classifiers converge to(approximately) maxmin strategy for game M

• (average) of distributions Dt converges to(approximately) minmax strategy

• margins and edges connected via minmax theorem• explains why AdaBoost maximizes margins

• different instantiation of game-playing algorithm gives onlinelearning algorithms (such as weighted majority algorithm)

Page 61: Plan of talk

Fundamental PerspectivesFundamental PerspectivesFundamental PerspectivesFundamental PerspectivesFundamental Perspectives

• game theory

• loss minimization

• an information-geometric view

Page 62: Plan of talk

A Dual Information-Geometric PerspectiveA Dual Information-Geometric PerspectiveA Dual Information-Geometric PerspectiveA Dual Information-Geometric PerspectiveA Dual Information-Geometric Perspective

• loss minimization focuses on function computed by AdaBoost(i.e., weights on weak classifiers)

• dual view: instead focus on distributions Dt

(i.e., weights on examples)

• dual perspective combines geometry and information theory

• exposes underlying mathematical structure

• basis for proving convergence

Page 63: Plan of talk

An Iterative-Projection AlgorithmAn Iterative-Projection AlgorithmAn Iterative-Projection AlgorithmAn Iterative-Projection AlgorithmAn Iterative-Projection Algorithm

• say want to find point closest to x0 in setP = { intersection of N hyperplanes }

• algorithm: [Bregman; Censor & Zenios]

• start at x0

• repeat: pick a hyperplane and project onto it

0xP

• if P = ∅, under general conditions, will converge correctly

Page 64: Plan of talk

AdaBoost is an Iterative-Projection AlgorithmAdaBoost is an Iterative-Projection AlgorithmAdaBoost is an Iterative-Projection AlgorithmAdaBoost is an Iterative-Projection AlgorithmAdaBoost is an Iterative-Projection Algorithm[Kivinen & Warmuth]

• points = distributions Dt over training examples

• distance = relative entropy:

RE (P ∥ Q) =∑

i

P(i) ln

(P(i)

Q(i)

)

• reference point x0 = uniform distribution

• hyperplanes defined by all possible weak classifiers gj :

i

D(i)yigj (xi ) = 0 ⇔ Pri∼D

[gj(xi ) = yi ] = 12

• intuition: looking for “hardest” distribution

Page 65: Plan of talk

AdaBoost as Iterative Projection (cont.)AdaBoost as Iterative Projection (cont.)AdaBoost as Iterative Projection (cont.)AdaBoost as Iterative Projection (cont.)AdaBoost as Iterative Projection (cont.)

• algorithm:• start at D1 = uniform• for t = 1, 2, . . .:

• pick hyperplane/weak classifier ht ↔ gj

• Dt+1 = (entropy) projection of Dt onto hyperplane= arg min

D:P

i D(i)yigj (xi )=0RE (D ∥ Dt)

• claim: equivalent to AdaBoost

• further: choosing ht with minimum error ≡ choosing farthesthyperplane

Page 66: Plan of talk

Boosting as Maximum EntropyBoosting as Maximum EntropyBoosting as Maximum EntropyBoosting as Maximum EntropyBoosting as Maximum Entropy

• corresponding optimization problem:

minD∈P

RE (D ∥ uniform) ↔ maxD∈P

entropy(D)

• where

P = feasible set

=

{

D :∑

i

D(i)yigj(xi ) = 0 ∀j

}

• P = ∅ ⇔ weak learning assumption does not hold• in this case, Dt → (unique) solution

• if weak learning assumption does hold then• P = ∅• Dt can never converge• dynamics are fascinating but unclear in this case

Page 67: Plan of talk

Reformulating AdaBoost as Iterative ProjectionReformulating AdaBoost as Iterative ProjectionReformulating AdaBoost as Iterative ProjectionReformulating AdaBoost as Iterative ProjectionReformulating AdaBoost as Iterative Projection

• points = nonnegative vectors dt

• distance = unnormalized relative entropy:

RE (p ∥ q) =∑

i

[

p(i) ln

(p(i)

q(i)

)

+ q(i) − p(i)

]

• reference point x0 = 1 (all 1’s vector)

• hyperplanes defined by weak classifiers gj :

i

d(i)yigj (xi ) = 0

• resulting iterative-projection algorithm is again equivalent toAdaBoost

Page 68: Plan of talk

Reformulated Optimization ProblemReformulated Optimization ProblemReformulated Optimization ProblemReformulated Optimization ProblemReformulated Optimization Problem

• optimization problem:

mind∈P

RE (d ∥ 1)

• where

P =

{

d :∑

i

d(i)yigj(xi ) = 0 ∀j

}

• note: feasible set P never empty (since 0 ∈ P)

Page 69: Plan of talk

Exponential Loss as Entropy OptimizationExponential Loss as Entropy OptimizationExponential Loss as Entropy OptimizationExponential Loss as Entropy OptimizationExponential Loss as Entropy Optimization

• all vectors dt created by AdaBoost have form:

d(i) = exp

⎝−yi

j

λjgj(xi )

• let Q = { all vectors d of this form }

• can rewrite exponential loss:

infλ

i

exp

⎝−yi

j

λjgj(xi )

⎠ = infd∈Q

i

d(i)

= mind∈Q

i

d(i)

= mind∈Q

RE (0 ∥ d)

• Q = closure of Q

Page 70: Plan of talk

ConclusionsConclusionsConclusionsConclusionsConclusions

• from different perspectives, AdaBoost can be interpreted as:• a method for boosting the accuracy of a weak learner• a procedure for maximizing margins• an algorithm for playing repeated games• a numerical method for minimizing exponential loss• an iterative-projection algorithm based on an

information-theoretic geometry

• none is entirely satisfactory by itself, but each useful in itsown way

• taken together, create rich theoretical understanding• connect boosting to other learning problems and

techniques• provide foundation for versatile set of methods with

many extensions, variations and applications

Page 71: Plan of talk

ReferencesReferencesReferencesReferencesReferences

Coming soon:

• Robert E. Schapire and Yoav Freund.Boosting: Foundations and Algorithms.MIT Press, 2012.

Survey articles:

• Ron Meir and Gunnar Ratsch.An Introduction to Boosting and Leveraging.In Advanced Lectures on Machine Learning (LNAI2600), 2003.http://www.boosting.org/papers/MeiRae03.pdf

• Robert E. Schapire.The boosting approach to machine learning: An overview.In MSRI Workshop on Nonlinear Estimation and Classification, 2002.http://www.cs.princeton.edu/∼schapire/boost.html