Plan of talk
Transcript of Plan of talk
Plan of talk
• Generative vs. non-generative modeling • Boosting • Weak-rule engineering: Alternating decision trees • Boosting and over-fitting • Applications • Boosting, repeated games, and online learning • Boosting and information geometry.
!1
Toy Example
• Computer receives telephone call • Measures Pitch of voice • Decides gender of caller
Human Voice
Male
Female
!2
Generative modeling
Voice Pitch
Prob
abili
ty
mean1
var1
mean2
var2
Discriminative approach
Voice Pitch
No.
of m
ista
kes
[Vapnik 85]
Ill-behaved data
Voice Pitch
Prob
abili
ty
mean1 mean2
No.
of m
ista
kes
Traditional Statistics vs. Machine Learning
DataEstimated world state
Predictions ActionsStatistics
Decision Theory
Machine Learning
Model Generative Discriminative
Goal Probability estimates
Classification rule
Performance measure
Likelihood Misclassification rate
Mismatch problems
Outliers Misclassifications
Comparison of methodologies
!7
A weighted training set
Feature vectors
Binary labels {-1,+1}
Positive weights
�
x1,y1,w1( ), x2 ,y2 ,w2( ),…, xm ,ym ,wm( )
!8
A weak learner
The weak requirement:
�
yi ˆ y iwii=1
m∑wii=1
m∑> γ > 0
A weak rule
h
h
Weak Learner
Weighted training set
�
x1,y1,w1( ), x2 ,y2 ,w2( ),…, xm ,ym ,wm( )
instances
�
x1,x2 ,…,xm
predictions
�
ˆ y 1, ˆ y 2 ,…, ˆ y m; ˆ y i ∈ {0,1}
�
x1,y1,w1T−1( ), x2,y2,w2
T−1( ),…, xn ,yn,wnT−1( )
�
hT
The boosting process
weak learner
�
x1,y1,w11( ), x2,y2,w2
1( ),…, xn ,yn,wn1( )
weak learner
�
x1,y1,1( ), x2,y2,1( ),…, xn ,yn,1( )
�
h1
�
x1,y1,w12( ), x2,y2,w2
2( ),…, xn ,yn,wn2( )
�
h3
�
h2
...
�
α t = 12 ln wi
t
i:ht xi( )=1,yi =1∑ wit
i:ht xi( )=1,yi =−1∑⎛ ⎝ ⎜ ⎞
⎠ ⎟
AdaboostFreund, Schapire 1997
Main property of Adaboost
If advantages of weak rules over random guessing are: γ1,γ2,..,γT then training error of final rule is at most
�
ˆ ε fT( ) ≤ exp − γ t2
t=1
T
∑⎛
⎝ ⎜
⎞
⎠ ⎟
Boosting block diagram
Weak Learner
Booster
Weak rule
Example weights
Strong Learner Accurate Rule
Margins viewPrediction =
+
- + +++
+
+
-
--
--
--
w
Correct
Mistak
es
Project
Margin
Cumulative # examples
Mistakes Correct
Margin =
y(w i x)w p × x q
SVM vs Adaboost
!15
Margin =
y(w i x)w p × x q
Norms Algorithm
SVM: p = q = 12
Quadratic optimization
Adaboost: p = 1, q = ∞ Coordinate-wise descent
Minimizing error using loss functions on margin
Loss
Correct
MarginMistakes
Brownboost
Logitboost (logistic regression)
Adaboost =
0-1 loss
Hinge-Loss
What is a typical weak learner?
• Define a large set of features • Decision stumps:
• Speciaists (particularly for multi-class):
h(x) =+1−1
if xi ≥θotherwise
⎧⎨⎪
⎩⎪
h(x) =class j0
if xi ≥θ (xi <θ )otherwise
⎧⎨⎪
⎩⎪
(x1, x2,…, xn )
Alternating Trees
Joint work with Llew Mason
Decision Trees
X>3
Y>5-1
+1-1
no
yes
yesno
X
Y
3
5
+1
-1
-1
-0.2
Decision tree as a sum
X
Y-0.2
Y>5
+0.2-0.3
yesno
X>3
-0.1
no
yes
+0.1+0.1-0.1
+0.2
-0.3
+1
-1
-1sign
An alternating decision tree
X
Y
+0.1-0.1
+0.2
-0.3
sign
-0.2
Y>5
+0.2-0.3yesno
X>3
-0.1
no
yes
+0.1
+0.7
Y<1
0.0
no
yes
+0.7
+1
-1
-1
+1
Example: Medical Diagnostics
•Cleve dataset from UC Irvine database.
•Heart disease diagnostics (+1=healthy,-1=sick)
•13 features from tests (real valued and discrete).
•303 instances.
Adtree for Cleveland heart-disease diagnostics problem
Cross-validated accuracyLearning algorithm
Number of splits
Average test error
Test error variance
ADtree 6 17.0% 0.6%
C5.0 27 27.2% 0.5%
C5.0 + boosting 446 20.2% 0.5%
Boost Stumps 16 16.5% 0.8%
Boosting and over-fitting
Curious phenomenonBoosting decision trees
Using <10,000 training examples we fit >2,000,000 parameters
Explanation using margins
Margin
0-1 loss
Explanation using margins
Margin
0-1 loss
No examples with small margins!!
Experimental Evidence
TheoremFor any convex combination and any threshold
Probability of mistakeFraction of training example with small margin
Size of training sample
VC dimension of weak rules
No dependence on number of weak rules that are combined!!!
Schapire, Freund, Bartlett & Lee Annals of stat. 98
Suggested optimization problem
Margin
Idea of Proof
Applications
Application: Detecting FacesApplication: Detecting FacesApplication: Detecting FacesApplication: Detecting FacesApplication: Detecting Faces[Viola & Jones]
• problem: find faces in photograph or movie
• weak classifiers: detect light/dark rectangles in image
• many clever tricks to make extremely fast and accurate
Viola and Jones (~1996)
Fundamental PerspectivesFundamental PerspectivesFundamental PerspectivesFundamental PerspectivesFundamental Perspectives
• game theory
• loss minimization
• an information-geometric view
Just a GameJust a GameJust a GameJust a GameJust a Game[with Freund]
• can view boosting as a game, a formal interaction betweenbooster and weak learner
• on each round t:• booster chooses distribution Dt
• weak learner responds with weak classifier ht
• game theory: studies interactions between all sorts of“players”
GamesGamesGamesGamesGames
• game defined by matrix M:
Rock Paper ScissorsRock 1/2 1 0Paper 0 1/2 1
Scissors 1 0 1/2
• row player (“Mindy”) chooses row i
• column player (“Max”) chooses column j (simultaneously)
• Mindy’s goal: minimize her loss M(i , j)
• assume (wlog) all entries in [0, 1]
Randomized PlayRandomized PlayRandomized PlayRandomized PlayRandomized Play
• usually allow randomized play:• Mindy chooses distribution P over rows• Max chooses distribution Q over columns
(simultaneously)
• Mindy’s (expected) loss
=∑
i ,j
P(i)M(i , j)Q(j)
= P⊤MQ ≡ M(P,Q)
• i , j = “pure” strategies
• P, Q = “mixed” strategies
• m = # rows of M
• also write M(i ,Q) and M(P, j) when one side plays pure andother plays mixed
Sequential PlaySequential PlaySequential PlaySequential PlaySequential Play
• say Mindy plays before Max
• if Mindy chooses P then Max will pick Q to maximizeM(P,Q) ⇒ loss will be
L(P) ≡ maxQ
M(P,Q)
• so Mindy should pick P to minimize L(P)⇒ loss will be
minP
L(P) = minP
maxQ
M(P,Q)
• similarly, if Max plays first, loss will be
maxQ
minP
M(P,Q)
Minmax TheoremMinmax TheoremMinmax TheoremMinmax TheoremMinmax Theorem
• playing second (with knowledge of other player’s move)cannot be worse than playing first, so:
minP
maxQ
M(P,Q)︸ ︷︷ ︸
Mindy plays first
≥ maxQ
minP
M(P,Q)︸ ︷︷ ︸
Mindy plays second
• von Neumann’s minmax theorem:
minP
maxQ
M(P,Q) = maxQ
minP
M(P,Q)
• in words: no advantage to playing second
Optimal PlayOptimal PlayOptimal PlayOptimal PlayOptimal Play
• minmax theorem:
minP
maxQ
M(P,Q) = maxQ
minP
M(P,Q) = value v of game
• optimal strategies:• P∗ = arg minP maxQ M(P,Q) = minmax strategy• Q∗ = arg maxQ minP M(P,Q) = maxmin strategy
• in words:• Mindy’s minmax strategy P∗ guarantees loss ≤ v
(regardless of Max’s play)• optimal because Max has maxmin strategy Q∗ that can
force loss ≥ v (regardless of Mindy’s play)
• e.g.: in RPS, P∗ = Q∗ = uniform
• solving game = finding minmax/maxmin strategies
Weaknesses of Classical TheoryWeaknesses of Classical TheoryWeaknesses of Classical TheoryWeaknesses of Classical TheoryWeaknesses of Classical Theory
• seems to fully answer how to play games — just computeminmax strategy (e.g., using linear programming)
• weaknesses:
• game M may be unknown• game M may be extremely large• opponent may not be fully adversarial
• may be possible to do better than value v• e.g.:
Lisa (thinks):Poor predictable Bart, always takes Rock.
Bart (thinks):Good old Rock, nothing beats that.
Repeated PlayRepeated PlayRepeated PlayRepeated PlayRepeated Play
• if only playing once, hopeless to overcome ignorance of gameM or opponent
• but if game played repeatedly, may be possible to learn toplay well
• goal: play (almost) as well as if knew game and howopponent would play ahead of time
Repeated Play (cont.)Repeated Play (cont.)Repeated Play (cont.)Repeated Play (cont.)Repeated Play (cont.)
• M unknown
• for t = 1, . . . ,T :• Mindy chooses Pt
• Max chooses Qt (possibly depending on Pt)• Mindy’s loss = M(Pt ,Qt)• Mindy observes loss M(i ,Qt) of each pure strategy i
• want:
1
T
T∑
t=1
M(Pt ,Qt)
︸ ︷︷ ︸
actual average loss
≤ minP
1
T
T∑
t=1
M(P,Qt)
︸ ︷︷ ︸
best loss (in hindsight)
+ [“small amount”]
Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)[with Freund]
• choose η > 0
• initialize: P1 = uniform
• on round t:
Pt+1(i) =Pt(i) exp (−η M(i ,Qt))
normalization
• idea: decrease weight of strategies suffering the most loss
• directly generalizes [Littlestone & Warmuth]
• other algorithms:• [Hannan’57]• [Blackwell’56]• [Foster & Vohra]• [Fudenberg & Levine]...
AnalysisAnalysisAnalysisAnalysisAnalysis
• Theorem: can choose η so that, for any game M with m
rows, and any opponent,
1
T
T∑
t=1
M(Pt ,Qt)
︸ ︷︷ ︸
actual average loss
≤ minP
1
T
T∑
t=1
M(P,Qt)
︸ ︷︷ ︸
best average loss (≤ v)
+ ∆T
where ∆T = O
(√
lnm
T
)
→ 0
• regret ∆T is:• logarithmic in # rows m• independent of # columns
• therefore, can use when working with very large games
Solving a GameSolving a GameSolving a GameSolving a GameSolving a Game[with Freund]
• suppose game M played repeatedly
• Mindy plays using MW• on round t, Max chooses best response:
Qt = arg maxQ
M(Pt ,Q)
• let
P =1
T
T∑
t=1
Pt , Q =1
T
T∑
t=1
Qt
• can prove that P and Q are ∆T -approximate minmax andmaxmin strategies:
maxQ
M(P,Q) ≤ v + ∆T
andminP
M(P,Q) ≥ v − ∆T
Boosting as a GameBoosting as a GameBoosting as a GameBoosting as a GameBoosting as a Game
• Mindy (row player) ↔ booster
• Max (column player) ↔ weak learner
• matrix M:
• row ↔ training example• column ↔ weak classifier• M(i , j) =
{
1 if j-th weak classifier correct on i -th training example0 else
• encodes which weak classifiers correct on which examples• huge # of columns — one for every possible weak
classifier
h h h
x h h h
x h h h
x h h h
x h h h
h1(x1)=y1 means 1 is correct, 0 if incorrect
Hedge(⌘)
Outline
The Minimax theorem
Learning games
Hedge(⌘)
The Minimax theorem
Matrix Games
1 0 1 0 1-1 0 0 1 11 0 -1 1 0
I A game between the column player and the row player.I The chosen entry defines the loss of column player = gain
of row player.I If choices made serially, second player to choose has an
advantage.
Hedge(⌘)
The Minimax theorem
Mixed strategies
q1 q2 q3 q4 q5p1 1 0 1 0 1p2 -1 0 0 1 1p3 1 0 -1 1 0
I pure strategies: each player chooses a single action.I mixed strategies: each player chooses a distribution over
actions.I Expeted gain/loss: ~pM~qT
Hedge(⌘)
The Minimax theorem
The Minimax theoremJohn Von-Neumann, 1928
max~p
min~q
~pM~qT = min~q
max~p
~pM~qT
I Unlike pure strategies, the order of choice of mixedstrategies does not matter.
I Optimal mixed strateigies: the strategies that achieve theminimax.
I Value of the game: the value of the minimax.I Finding the minimax strategies when the matrix is known =
Linear Programming.
Hedge(⌘)
Learning games
A matrix corresponding to online learning.
t = 1 2 3 4 ...expert 1 1 0 1 0 ...expert 2 -1 0 0 1 ...expert 3 1 0 -1 1 ...
I The columns are revealed one at a time. strategies doesnot matter.
I Using Hedge or NormalHedge the row player chooses amixed strategy over the rows that is almost as good as thebest single row in hind-sight.
I The best single row in hind-site is at least as good as anymixed strategy in hind-sight.
Hedge(⌘)
Learning games
A matrix corresponding to online learning.
t = 1 2 3 4 ...expert 1 1 0 1 0 ...expert 2 -1 0 0 1 ...expert 3 1 0 -1 1 ...
I If the adversary playes optimally, then the row distributioncoverges to a minimax optimal mixed strategy.
I But adversary might not play optimally - minimizing regretis a stronger criterion than converging to minimax optimalmixed strategy.
Hedge(⌘)
Learning games
A matrix corresponding to boostingex. 1 ex. 2 ex. 3 ex. 4 ...
base rule 1 1 0 1 0 ...base rule 2 0 0 0 1 ...base rule 3 1 0 0 1 ...
I 0 mistake, 1 correct.I A weak learning algorithm: can find a base rule whose
weighted error is smaller than 1/2� � or any distributionover the examples.
I There is a distribution over the base rules such that for anyexample the expected error is smaller 1/2� �.
I Implies that the majority vote wrt this distribution over baserules is correct on all examples.
I Moreover - the weight of the majority is at least 1/2 + �,the minority is at most 1/2� �.
Boosting and the Minmax TheoremBoosting and the Minmax TheoremBoosting and the Minmax TheoremBoosting and the Minmax TheoremBoosting and the Minmax Theorem
• γ-weak learning assumption:• for every distribution on examples• can find weak classifier with weighted error ≤ 1
2 − γ
• equivalent to:
(value of game M) ≥ 12 + γ
• by minmax theorem, implies that:• ∃ some weighted majority classifier that correctly
classifies all training examples with margin ≥ 2γ• further, weights are given by maxmin strategy of game M
Idea for BoostingIdea for BoostingIdea for BoostingIdea for BoostingIdea for Boosting
• maxmin strategy of M has perfect (training) accuracy andlarge margins
• find approximately using earlier algorithm for solving a game• i.e., apply MW to M
• yields (variant of) AdaBoost
AdaBoost and Game TheoryAdaBoost and Game TheoryAdaBoost and Game TheoryAdaBoost and Game TheoryAdaBoost and Game Theory
• summarizing:
• weak learning assumption implies maxmin strategy for M
defines large-margin classifier• AdaBoost finds maxmin strategy by applying general
algorithm for solving games through repeated play
• consequences:
• weights on weak classifiers converge to(approximately) maxmin strategy for game M
• (average) of distributions Dt converges to(approximately) minmax strategy
• margins and edges connected via minmax theorem• explains why AdaBoost maximizes margins
• different instantiation of game-playing algorithm gives onlinelearning algorithms (such as weighted majority algorithm)
Fundamental PerspectivesFundamental PerspectivesFundamental PerspectivesFundamental PerspectivesFundamental Perspectives
• game theory
• loss minimization
• an information-geometric view
A Dual Information-Geometric PerspectiveA Dual Information-Geometric PerspectiveA Dual Information-Geometric PerspectiveA Dual Information-Geometric PerspectiveA Dual Information-Geometric Perspective
• loss minimization focuses on function computed by AdaBoost(i.e., weights on weak classifiers)
• dual view: instead focus on distributions Dt
(i.e., weights on examples)
• dual perspective combines geometry and information theory
• exposes underlying mathematical structure
• basis for proving convergence
An Iterative-Projection AlgorithmAn Iterative-Projection AlgorithmAn Iterative-Projection AlgorithmAn Iterative-Projection AlgorithmAn Iterative-Projection Algorithm
• say want to find point closest to x0 in setP = { intersection of N hyperplanes }
• algorithm: [Bregman; Censor & Zenios]
• start at x0
• repeat: pick a hyperplane and project onto it
0xP
• if P = ∅, under general conditions, will converge correctly
AdaBoost is an Iterative-Projection AlgorithmAdaBoost is an Iterative-Projection AlgorithmAdaBoost is an Iterative-Projection AlgorithmAdaBoost is an Iterative-Projection AlgorithmAdaBoost is an Iterative-Projection Algorithm[Kivinen & Warmuth]
• points = distributions Dt over training examples
• distance = relative entropy:
RE (P ∥ Q) =∑
i
P(i) ln
(P(i)
Q(i)
)
• reference point x0 = uniform distribution
• hyperplanes defined by all possible weak classifiers gj :
∑
i
D(i)yigj (xi ) = 0 ⇔ Pri∼D
[gj(xi ) = yi ] = 12
• intuition: looking for “hardest” distribution
AdaBoost as Iterative Projection (cont.)AdaBoost as Iterative Projection (cont.)AdaBoost as Iterative Projection (cont.)AdaBoost as Iterative Projection (cont.)AdaBoost as Iterative Projection (cont.)
• algorithm:• start at D1 = uniform• for t = 1, 2, . . .:
• pick hyperplane/weak classifier ht ↔ gj
• Dt+1 = (entropy) projection of Dt onto hyperplane= arg min
D:P
i D(i)yigj (xi )=0RE (D ∥ Dt)
• claim: equivalent to AdaBoost
• further: choosing ht with minimum error ≡ choosing farthesthyperplane
Boosting as Maximum EntropyBoosting as Maximum EntropyBoosting as Maximum EntropyBoosting as Maximum EntropyBoosting as Maximum Entropy
• corresponding optimization problem:
minD∈P
RE (D ∥ uniform) ↔ maxD∈P
entropy(D)
• where
P = feasible set
=
{
D :∑
i
D(i)yigj(xi ) = 0 ∀j
}
• P = ∅ ⇔ weak learning assumption does not hold• in this case, Dt → (unique) solution
• if weak learning assumption does hold then• P = ∅• Dt can never converge• dynamics are fascinating but unclear in this case
Reformulating AdaBoost as Iterative ProjectionReformulating AdaBoost as Iterative ProjectionReformulating AdaBoost as Iterative ProjectionReformulating AdaBoost as Iterative ProjectionReformulating AdaBoost as Iterative Projection
• points = nonnegative vectors dt
• distance = unnormalized relative entropy:
RE (p ∥ q) =∑
i
[
p(i) ln
(p(i)
q(i)
)
+ q(i) − p(i)
]
• reference point x0 = 1 (all 1’s vector)
• hyperplanes defined by weak classifiers gj :
∑
i
d(i)yigj (xi ) = 0
• resulting iterative-projection algorithm is again equivalent toAdaBoost
Reformulated Optimization ProblemReformulated Optimization ProblemReformulated Optimization ProblemReformulated Optimization ProblemReformulated Optimization Problem
• optimization problem:
mind∈P
RE (d ∥ 1)
• where
P =
{
d :∑
i
d(i)yigj(xi ) = 0 ∀j
}
• note: feasible set P never empty (since 0 ∈ P)
Exponential Loss as Entropy OptimizationExponential Loss as Entropy OptimizationExponential Loss as Entropy OptimizationExponential Loss as Entropy OptimizationExponential Loss as Entropy Optimization
• all vectors dt created by AdaBoost have form:
d(i) = exp
⎛
⎝−yi
∑
j
λjgj(xi )
⎞
⎠
• let Q = { all vectors d of this form }
• can rewrite exponential loss:
infλ
∑
i
exp
⎛
⎝−yi
∑
j
λjgj(xi )
⎞
⎠ = infd∈Q
∑
i
d(i)
= mind∈Q
∑
i
d(i)
= mind∈Q
RE (0 ∥ d)
• Q = closure of Q
ConclusionsConclusionsConclusionsConclusionsConclusions
• from different perspectives, AdaBoost can be interpreted as:• a method for boosting the accuracy of a weak learner• a procedure for maximizing margins• an algorithm for playing repeated games• a numerical method for minimizing exponential loss• an iterative-projection algorithm based on an
information-theoretic geometry
• none is entirely satisfactory by itself, but each useful in itsown way
• taken together, create rich theoretical understanding• connect boosting to other learning problems and
techniques• provide foundation for versatile set of methods with
many extensions, variations and applications
ReferencesReferencesReferencesReferencesReferences
Coming soon:
• Robert E. Schapire and Yoav Freund.Boosting: Foundations and Algorithms.MIT Press, 2012.
Survey articles:
• Ron Meir and Gunnar Ratsch.An Introduction to Boosting and Leveraging.In Advanced Lectures on Machine Learning (LNAI2600), 2003.http://www.boosting.org/papers/MeiRae03.pdf
• Robert E. Schapire.The boosting approach to machine learning: An overview.In MSRI Workshop on Nonlinear Estimation and Classification, 2002.http://www.cs.princeton.edu/∼schapire/boost.html