Plan of talk

Post on 24-May-2022

2 views 0 download

Transcript of Plan of talk

Plan of talk

• Generative vs. non-generative modeling • Boosting • Weak-rule engineering: Alternating decision trees • Boosting and over-fitting • Applications • Boosting, repeated games, and online learning • Boosting and information geometry.

!1

Toy Example

• Computer receives telephone call • Measures Pitch of voice • Decides gender of caller

Human Voice

Male

Female

!2

Generative modeling

Voice Pitch

Prob

abili

ty

mean1

var1

mean2

var2

Discriminative approach

Voice Pitch

No.

of m

ista

kes

[Vapnik 85]

Ill-behaved data

Voice Pitch

Prob

abili

ty

mean1 mean2

No.

of m

ista

kes

Traditional Statistics vs. Machine Learning

DataEstimated world state

Predictions ActionsStatistics

Decision Theory

Machine Learning

Model Generative Discriminative

Goal Probability estimates

Classification rule

Performance measure

Likelihood Misclassification rate

Mismatch problems

Outliers Misclassifications

Comparison of methodologies

!7

A weighted training set

Feature vectors

Binary labels {-1,+1}

Positive weights

x1,y1,w1( ), x2 ,y2 ,w2( ),…, xm ,ym ,wm( )

!8

A weak learner

The weak requirement:

yi ˆ y iwii=1

m∑wii=1

m∑> γ > 0

A weak rule

h

h

Weak Learner

Weighted training set

x1,y1,w1( ), x2 ,y2 ,w2( ),…, xm ,ym ,wm( )

instances

x1,x2 ,…,xm

predictions

ˆ y 1, ˆ y 2 ,…, ˆ y m; ˆ y i ∈ {0,1}

x1,y1,w1T−1( ), x2,y2,w2

T−1( ),…, xn ,yn,wnT−1( )

hT

The boosting process

weak learner

x1,y1,w11( ), x2,y2,w2

1( ),…, xn ,yn,wn1( )

weak learner

x1,y1,1( ), x2,y2,1( ),…, xn ,yn,1( )

h1

x1,y1,w12( ), x2,y2,w2

2( ),…, xn ,yn,wn2( )

h3

h2

...

α t = 12 ln wi

t

i:ht xi( )=1,yi =1∑ wit

i:ht xi( )=1,yi =−1∑⎛ ⎝ ⎜ ⎞

⎠ ⎟

AdaboostFreund, Schapire 1997

Main property of Adaboost

If advantages of weak rules over random guessing are: γ1,γ2,..,γT then training error of final rule is at most

ˆ ε fT( ) ≤ exp − γ t2

t=1

T

∑⎛

⎝ ⎜

⎠ ⎟

Boosting block diagram

Weak Learner

Booster

Weak rule

Example weights

Strong Learner Accurate Rule

Margins viewPrediction =

+

- + +++

+

+

-

--

--

--

w

Correct

Mistak

es

Project

Margin

Cumulative # examples

Mistakes Correct

Margin =

y(w i x)w p × x q

SVM vs Adaboost

!15

Margin =

y(w i x)w p × x q

Norms Algorithm

SVM: p = q = 12

Quadratic optimization

Adaboost: p = 1, q = ∞ Coordinate-wise descent

Minimizing error using loss functions on margin

Loss

Correct

MarginMistakes

Brownboost

Logitboost (logistic regression)

Adaboost =

0-1 loss

Hinge-Loss

What is a typical weak learner?

• Define a large set of features • Decision stumps:

• Speciaists (particularly for multi-class):

h(x) =+1−1

if xi ≥θotherwise

⎧⎨⎪

⎩⎪

h(x) =class j0

if xi ≥θ (xi <θ )otherwise

⎧⎨⎪

⎩⎪

(x1, x2,…, xn )

Alternating Trees

Joint work with Llew Mason

Decision Trees

X>3

Y>5-1

+1-1

no

yes

yesno

X

Y

3

5

+1

-1

-1

-0.2

Decision tree as a sum

X

Y-0.2

Y>5

+0.2-0.3

yesno

X>3

-0.1

no

yes

+0.1+0.1-0.1

+0.2

-0.3

+1

-1

-1sign

An alternating decision tree

X

Y

+0.1-0.1

+0.2

-0.3

sign

-0.2

Y>5

+0.2-0.3yesno

X>3

-0.1

no

yes

+0.1

+0.7

Y<1

0.0

no

yes

+0.7

+1

-1

-1

+1

Example: Medical Diagnostics

•Cleve dataset from UC Irvine database.

•Heart disease diagnostics (+1=healthy,-1=sick)

•13 features from tests (real valued and discrete).

•303 instances.

Adtree for Cleveland heart-disease diagnostics problem

Cross-validated accuracyLearning algorithm

Number of splits

Average test error

Test error variance

ADtree 6 17.0% 0.6%

C5.0 27 27.2% 0.5%

C5.0 + boosting 446 20.2% 0.5%

Boost Stumps 16 16.5% 0.8%

Boosting and over-fitting

Curious phenomenonBoosting decision trees

Using <10,000 training examples we fit >2,000,000 parameters

Explanation using margins

Margin

0-1 loss

Explanation using margins

Margin

0-1 loss

No examples with small margins!!

Experimental Evidence

TheoremFor any convex combination and any threshold

Probability of mistakeFraction of training example with small margin

Size of training sample

VC dimension of weak rules

No dependence on number of weak rules that are combined!!!

Schapire, Freund, Bartlett & Lee Annals of stat. 98

Suggested optimization problem

Margin

Idea of Proof

Applications

Application: Detecting FacesApplication: Detecting FacesApplication: Detecting FacesApplication: Detecting FacesApplication: Detecting Faces[Viola & Jones]

• problem: find faces in photograph or movie

• weak classifiers: detect light/dark rectangles in image

• many clever tricks to make extremely fast and accurate

Viola and Jones (~1996)

Fundamental PerspectivesFundamental PerspectivesFundamental PerspectivesFundamental PerspectivesFundamental Perspectives

• game theory

• loss minimization

• an information-geometric view

Just a GameJust a GameJust a GameJust a GameJust a Game[with Freund]

• can view boosting as a game, a formal interaction betweenbooster and weak learner

• on each round t:• booster chooses distribution Dt

• weak learner responds with weak classifier ht

• game theory: studies interactions between all sorts of“players”

GamesGamesGamesGamesGames

• game defined by matrix M:

Rock Paper ScissorsRock 1/2 1 0Paper 0 1/2 1

Scissors 1 0 1/2

• row player (“Mindy”) chooses row i

• column player (“Max”) chooses column j (simultaneously)

• Mindy’s goal: minimize her loss M(i , j)

• assume (wlog) all entries in [0, 1]

Randomized PlayRandomized PlayRandomized PlayRandomized PlayRandomized Play

• usually allow randomized play:• Mindy chooses distribution P over rows• Max chooses distribution Q over columns

(simultaneously)

• Mindy’s (expected) loss

=∑

i ,j

P(i)M(i , j)Q(j)

= P⊤MQ ≡ M(P,Q)

• i , j = “pure” strategies

• P, Q = “mixed” strategies

• m = # rows of M

• also write M(i ,Q) and M(P, j) when one side plays pure andother plays mixed

Sequential PlaySequential PlaySequential PlaySequential PlaySequential Play

• say Mindy plays before Max

• if Mindy chooses P then Max will pick Q to maximizeM(P,Q) ⇒ loss will be

L(P) ≡ maxQ

M(P,Q)

• so Mindy should pick P to minimize L(P)⇒ loss will be

minP

L(P) = minP

maxQ

M(P,Q)

• similarly, if Max plays first, loss will be

maxQ

minP

M(P,Q)

Minmax TheoremMinmax TheoremMinmax TheoremMinmax TheoremMinmax Theorem

• playing second (with knowledge of other player’s move)cannot be worse than playing first, so:

minP

maxQ

M(P,Q)︸ ︷︷ ︸

Mindy plays first

≥ maxQ

minP

M(P,Q)︸ ︷︷ ︸

Mindy plays second

• von Neumann’s minmax theorem:

minP

maxQ

M(P,Q) = maxQ

minP

M(P,Q)

• in words: no advantage to playing second

Optimal PlayOptimal PlayOptimal PlayOptimal PlayOptimal Play

• minmax theorem:

minP

maxQ

M(P,Q) = maxQ

minP

M(P,Q) = value v of game

• optimal strategies:• P∗ = arg minP maxQ M(P,Q) = minmax strategy• Q∗ = arg maxQ minP M(P,Q) = maxmin strategy

• in words:• Mindy’s minmax strategy P∗ guarantees loss ≤ v

(regardless of Max’s play)• optimal because Max has maxmin strategy Q∗ that can

force loss ≥ v (regardless of Mindy’s play)

• e.g.: in RPS, P∗ = Q∗ = uniform

• solving game = finding minmax/maxmin strategies

Weaknesses of Classical TheoryWeaknesses of Classical TheoryWeaknesses of Classical TheoryWeaknesses of Classical TheoryWeaknesses of Classical Theory

• seems to fully answer how to play games — just computeminmax strategy (e.g., using linear programming)

• weaknesses:

• game M may be unknown• game M may be extremely large• opponent may not be fully adversarial

• may be possible to do better than value v• e.g.:

Lisa (thinks):Poor predictable Bart, always takes Rock.

Bart (thinks):Good old Rock, nothing beats that.

Repeated PlayRepeated PlayRepeated PlayRepeated PlayRepeated Play

• if only playing once, hopeless to overcome ignorance of gameM or opponent

• but if game played repeatedly, may be possible to learn toplay well

• goal: play (almost) as well as if knew game and howopponent would play ahead of time

Repeated Play (cont.)Repeated Play (cont.)Repeated Play (cont.)Repeated Play (cont.)Repeated Play (cont.)

• M unknown

• for t = 1, . . . ,T :• Mindy chooses Pt

• Max chooses Qt (possibly depending on Pt)• Mindy’s loss = M(Pt ,Qt)• Mindy observes loss M(i ,Qt) of each pure strategy i

• want:

1

T

T∑

t=1

M(Pt ,Qt)

︸ ︷︷ ︸

actual average loss

≤ minP

1

T

T∑

t=1

M(P,Qt)

︸ ︷︷ ︸

best loss (in hindsight)

+ [“small amount”]

Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)[with Freund]

• choose η > 0

• initialize: P1 = uniform

• on round t:

Pt+1(i) =Pt(i) exp (−η M(i ,Qt))

normalization

• idea: decrease weight of strategies suffering the most loss

• directly generalizes [Littlestone & Warmuth]

• other algorithms:• [Hannan’57]• [Blackwell’56]• [Foster & Vohra]• [Fudenberg & Levine]...

AnalysisAnalysisAnalysisAnalysisAnalysis

• Theorem: can choose η so that, for any game M with m

rows, and any opponent,

1

T

T∑

t=1

M(Pt ,Qt)

︸ ︷︷ ︸

actual average loss

≤ minP

1

T

T∑

t=1

M(P,Qt)

︸ ︷︷ ︸

best average loss (≤ v)

+ ∆T

where ∆T = O

(√

lnm

T

)

→ 0

• regret ∆T is:• logarithmic in # rows m• independent of # columns

• therefore, can use when working with very large games

Solving a GameSolving a GameSolving a GameSolving a GameSolving a Game[with Freund]

• suppose game M played repeatedly

• Mindy plays using MW• on round t, Max chooses best response:

Qt = arg maxQ

M(Pt ,Q)

• let

P =1

T

T∑

t=1

Pt , Q =1

T

T∑

t=1

Qt

• can prove that P and Q are ∆T -approximate minmax andmaxmin strategies:

maxQ

M(P,Q) ≤ v + ∆T

andminP

M(P,Q) ≥ v − ∆T

Boosting as a GameBoosting as a GameBoosting as a GameBoosting as a GameBoosting as a Game

• Mindy (row player) ↔ booster

• Max (column player) ↔ weak learner

• matrix M:

• row ↔ training example• column ↔ weak classifier• M(i , j) =

{

1 if j-th weak classifier correct on i -th training example0 else

• encodes which weak classifiers correct on which examples• huge # of columns — one for every possible weak

classifier

h h h

x h h h

x h h h

x h h h

x h h h

h1(x1)=y1 means 1 is correct, 0 if incorrect

Hedge(⌘)

Outline

The Minimax theorem

Learning games

Hedge(⌘)

The Minimax theorem

Matrix Games

1 0 1 0 1-1 0 0 1 11 0 -1 1 0

I A game between the column player and the row player.I The chosen entry defines the loss of column player = gain

of row player.I If choices made serially, second player to choose has an

advantage.

Hedge(⌘)

The Minimax theorem

Mixed strategies

q1 q2 q3 q4 q5p1 1 0 1 0 1p2 -1 0 0 1 1p3 1 0 -1 1 0

I pure strategies: each player chooses a single action.I mixed strategies: each player chooses a distribution over

actions.I Expeted gain/loss: ~pM~qT

Hedge(⌘)

The Minimax theorem

The Minimax theoremJohn Von-Neumann, 1928

max~p

min~q

~pM~qT = min~q

max~p

~pM~qT

I Unlike pure strategies, the order of choice of mixedstrategies does not matter.

I Optimal mixed strateigies: the strategies that achieve theminimax.

I Value of the game: the value of the minimax.I Finding the minimax strategies when the matrix is known =

Linear Programming.

Hedge(⌘)

Learning games

A matrix corresponding to online learning.

t = 1 2 3 4 ...expert 1 1 0 1 0 ...expert 2 -1 0 0 1 ...expert 3 1 0 -1 1 ...

I The columns are revealed one at a time. strategies doesnot matter.

I Using Hedge or NormalHedge the row player chooses amixed strategy over the rows that is almost as good as thebest single row in hind-sight.

I The best single row in hind-site is at least as good as anymixed strategy in hind-sight.

Hedge(⌘)

Learning games

A matrix corresponding to online learning.

t = 1 2 3 4 ...expert 1 1 0 1 0 ...expert 2 -1 0 0 1 ...expert 3 1 0 -1 1 ...

I If the adversary playes optimally, then the row distributioncoverges to a minimax optimal mixed strategy.

I But adversary might not play optimally - minimizing regretis a stronger criterion than converging to minimax optimalmixed strategy.

Hedge(⌘)

Learning games

A matrix corresponding to boostingex. 1 ex. 2 ex. 3 ex. 4 ...

base rule 1 1 0 1 0 ...base rule 2 0 0 0 1 ...base rule 3 1 0 0 1 ...

I 0 mistake, 1 correct.I A weak learning algorithm: can find a base rule whose

weighted error is smaller than 1/2� � or any distributionover the examples.

I There is a distribution over the base rules such that for anyexample the expected error is smaller 1/2� �.

I Implies that the majority vote wrt this distribution over baserules is correct on all examples.

I Moreover - the weight of the majority is at least 1/2 + �,the minority is at most 1/2� �.

Boosting and the Minmax TheoremBoosting and the Minmax TheoremBoosting and the Minmax TheoremBoosting and the Minmax TheoremBoosting and the Minmax Theorem

• γ-weak learning assumption:• for every distribution on examples• can find weak classifier with weighted error ≤ 1

2 − γ

• equivalent to:

(value of game M) ≥ 12 + γ

• by minmax theorem, implies that:• ∃ some weighted majority classifier that correctly

classifies all training examples with margin ≥ 2γ• further, weights are given by maxmin strategy of game M

Idea for BoostingIdea for BoostingIdea for BoostingIdea for BoostingIdea for Boosting

• maxmin strategy of M has perfect (training) accuracy andlarge margins

• find approximately using earlier algorithm for solving a game• i.e., apply MW to M

• yields (variant of) AdaBoost

AdaBoost and Game TheoryAdaBoost and Game TheoryAdaBoost and Game TheoryAdaBoost and Game TheoryAdaBoost and Game Theory

• summarizing:

• weak learning assumption implies maxmin strategy for M

defines large-margin classifier• AdaBoost finds maxmin strategy by applying general

algorithm for solving games through repeated play

• consequences:

• weights on weak classifiers converge to(approximately) maxmin strategy for game M

• (average) of distributions Dt converges to(approximately) minmax strategy

• margins and edges connected via minmax theorem• explains why AdaBoost maximizes margins

• different instantiation of game-playing algorithm gives onlinelearning algorithms (such as weighted majority algorithm)

Fundamental PerspectivesFundamental PerspectivesFundamental PerspectivesFundamental PerspectivesFundamental Perspectives

• game theory

• loss minimization

• an information-geometric view

A Dual Information-Geometric PerspectiveA Dual Information-Geometric PerspectiveA Dual Information-Geometric PerspectiveA Dual Information-Geometric PerspectiveA Dual Information-Geometric Perspective

• loss minimization focuses on function computed by AdaBoost(i.e., weights on weak classifiers)

• dual view: instead focus on distributions Dt

(i.e., weights on examples)

• dual perspective combines geometry and information theory

• exposes underlying mathematical structure

• basis for proving convergence

An Iterative-Projection AlgorithmAn Iterative-Projection AlgorithmAn Iterative-Projection AlgorithmAn Iterative-Projection AlgorithmAn Iterative-Projection Algorithm

• say want to find point closest to x0 in setP = { intersection of N hyperplanes }

• algorithm: [Bregman; Censor & Zenios]

• start at x0

• repeat: pick a hyperplane and project onto it

0xP

• if P = ∅, under general conditions, will converge correctly

AdaBoost is an Iterative-Projection AlgorithmAdaBoost is an Iterative-Projection AlgorithmAdaBoost is an Iterative-Projection AlgorithmAdaBoost is an Iterative-Projection AlgorithmAdaBoost is an Iterative-Projection Algorithm[Kivinen & Warmuth]

• points = distributions Dt over training examples

• distance = relative entropy:

RE (P ∥ Q) =∑

i

P(i) ln

(P(i)

Q(i)

)

• reference point x0 = uniform distribution

• hyperplanes defined by all possible weak classifiers gj :

i

D(i)yigj (xi ) = 0 ⇔ Pri∼D

[gj(xi ) = yi ] = 12

• intuition: looking for “hardest” distribution

AdaBoost as Iterative Projection (cont.)AdaBoost as Iterative Projection (cont.)AdaBoost as Iterative Projection (cont.)AdaBoost as Iterative Projection (cont.)AdaBoost as Iterative Projection (cont.)

• algorithm:• start at D1 = uniform• for t = 1, 2, . . .:

• pick hyperplane/weak classifier ht ↔ gj

• Dt+1 = (entropy) projection of Dt onto hyperplane= arg min

D:P

i D(i)yigj (xi )=0RE (D ∥ Dt)

• claim: equivalent to AdaBoost

• further: choosing ht with minimum error ≡ choosing farthesthyperplane

Boosting as Maximum EntropyBoosting as Maximum EntropyBoosting as Maximum EntropyBoosting as Maximum EntropyBoosting as Maximum Entropy

• corresponding optimization problem:

minD∈P

RE (D ∥ uniform) ↔ maxD∈P

entropy(D)

• where

P = feasible set

=

{

D :∑

i

D(i)yigj(xi ) = 0 ∀j

}

• P = ∅ ⇔ weak learning assumption does not hold• in this case, Dt → (unique) solution

• if weak learning assumption does hold then• P = ∅• Dt can never converge• dynamics are fascinating but unclear in this case

Reformulating AdaBoost as Iterative ProjectionReformulating AdaBoost as Iterative ProjectionReformulating AdaBoost as Iterative ProjectionReformulating AdaBoost as Iterative ProjectionReformulating AdaBoost as Iterative Projection

• points = nonnegative vectors dt

• distance = unnormalized relative entropy:

RE (p ∥ q) =∑

i

[

p(i) ln

(p(i)

q(i)

)

+ q(i) − p(i)

]

• reference point x0 = 1 (all 1’s vector)

• hyperplanes defined by weak classifiers gj :

i

d(i)yigj (xi ) = 0

• resulting iterative-projection algorithm is again equivalent toAdaBoost

Reformulated Optimization ProblemReformulated Optimization ProblemReformulated Optimization ProblemReformulated Optimization ProblemReformulated Optimization Problem

• optimization problem:

mind∈P

RE (d ∥ 1)

• where

P =

{

d :∑

i

d(i)yigj(xi ) = 0 ∀j

}

• note: feasible set P never empty (since 0 ∈ P)

Exponential Loss as Entropy OptimizationExponential Loss as Entropy OptimizationExponential Loss as Entropy OptimizationExponential Loss as Entropy OptimizationExponential Loss as Entropy Optimization

• all vectors dt created by AdaBoost have form:

d(i) = exp

⎝−yi

j

λjgj(xi )

• let Q = { all vectors d of this form }

• can rewrite exponential loss:

infλ

i

exp

⎝−yi

j

λjgj(xi )

⎠ = infd∈Q

i

d(i)

= mind∈Q

i

d(i)

= mind∈Q

RE (0 ∥ d)

• Q = closure of Q

ConclusionsConclusionsConclusionsConclusionsConclusions

• from different perspectives, AdaBoost can be interpreted as:• a method for boosting the accuracy of a weak learner• a procedure for maximizing margins• an algorithm for playing repeated games• a numerical method for minimizing exponential loss• an iterative-projection algorithm based on an

information-theoretic geometry

• none is entirely satisfactory by itself, but each useful in itsown way

• taken together, create rich theoretical understanding• connect boosting to other learning problems and

techniques• provide foundation for versatile set of methods with

many extensions, variations and applications

ReferencesReferencesReferencesReferencesReferences

Coming soon:

• Robert E. Schapire and Yoav Freund.Boosting: Foundations and Algorithms.MIT Press, 2012.

Survey articles:

• Ron Meir and Gunnar Ratsch.An Introduction to Boosting and Leveraging.In Advanced Lectures on Machine Learning (LNAI2600), 2003.http://www.boosting.org/papers/MeiRae03.pdf

• Robert E. Schapire.The boosting approach to machine learning: An overview.In MSRI Workshop on Nonlinear Estimation and Classification, 2002.http://www.cs.princeton.edu/∼schapire/boost.html