AdaBoost Fits an Additive Model - Carnegie Mellon School...

10
AdaBoost Fits an Additive Model Leon Gu CSD, CMU

Transcript of AdaBoost Fits an Additive Model - Carnegie Mellon School...

AdaBoost Fits an Additive Model

Leon Gu

CSD, CMU

Generalized Additive Model

f(x) =M∑

m=1

βmbm(x; γm)

Many classification and regression models can be written as a linearcombination of some simpler models (as above), where

I x is input data;

I {βm, γm} are model parameters;

I bm(x; γm) are any arbitrary functions of x .

Typically, {βm, γm} are estimated by minimizing some loss function L,which measures the prediction errors over training data {xn, yn}.

〈β∗m, γ∗m〉M1 = arg min

{βm,γm}M1

N∑i=1

L

(yi,

M∑m=1

βmbm(x; γm)

)

Forward Stagewise Optimization

Directly optimizing such loss function is often difficult. However, ifoptimizing over one single base function

minβ,γ

N∑i=1

L (yi, βbm(xi; γ))

can be done efficiently, a simple greedy search method can be used. Thebasic idea is, sequentially adding new base functions to the expansionfunction f(x) without changing the parameters that have been added.

For example, in the i-th stage, a new function bi(x; γi) is added to the expansion fi−1(x); the new coefficientsβi, γi are fitted by minimizing the error between βibi(x; γi) and the residue y − fi−1(x); then f(x) isupdated by fi(x) = fi−1(x) + β∗i bi(x; γ∗i ).

Such strategy (looking for the global optimum by solving a sequence of

subproblems in a greedy manner) is commonly used in practice.

AdaBoost

AdaBoost fits an additive model by forward stagewise approach, where

I the base function bm is a binary classifier Gm(x) : RK → {−1, 1};I the objective function is the exponential loss.

L(y, f(x)) = exp(−yf(x))

Many boosting variants are developed based on this observation.

I change base function?

I change loss function?

First we write down the exponential loss:

〈βm, Gm〉 = arg minβ,G

N∑i=1

exp [−yi (fm−1(xi) + βG(xi))]

Then define weights w(m)i = exp (−yifm−1(xi)), and divide the data into

two subsets: {yi = G(xi)} and {yi 6= G(xi)},∑Ni=1 exp [−yi[fm−1(xi) + βG(xi)]]

=∑N

i=1 w(m)i exp (−yiβG(xi))

= e−β∑

yi=G(xi)w

(m)i + eβ

∑yi 6=G(xi)

w(m)i

= e−β∑N

i=1 w(m)i +

(eβ − e−β

)∑Ni=1 w

(m)i I(yi 6= G(xi))

Then we optimize L over β and G iteratively.

When β is fixed, the optimal Gm(x) is given by

Gm = arg minG

N∑i=1

w(m)i I(yi 6= G(xi))

in other words, the optimal Gm(x) is the classifier which minimizes the

weighted prediction error, where w(m)i is viewed as a weight assigned to

the i-th training data.

Given a fixed Gm, we substitute it into the loss function, take thederivative w.r.t. βm and set it to zero,

d

e−β∑

yi=Gm(xi)

w(m)i + eβ

∑yi 6=Gm(xi)

w(m)i

/dβm = 0

that is

βm =12

log1− errm

errm; errm =

∑Ni=1 w

(m)i I(yi 6= Gm(xi))∑N

i=1 w(m)i

Check the updating rule of weights w(m)i ,

w(m+1)i = exp (−yifm(xi))

= exp−yi(fm−1(xi) + βmGm(xi))= w

(m)i exp(−βmyiGm(xi))

= w(m)i exp(2βmI(yi 6= Gm(xi))) exp(−βm)

Now we summarize the algorithm

1. initialize weights wi = 1/N, i = 1, . . . , N .

2. for m = 1 to M

2.1 fit a classifier Gm to training data with weights wi.2.2 update weights by wi ← wi exp(αmI(yi 6= Gm(xi))), where

αm = log 1−errm

errmand errm =

PNi=1 w

(m)i I(yi 6=Gm(xi))PN

i=1 w(m)i

3. output the final classifier f(x) = sign(∑M

m=1 αmGm(x))

That is exactly AdaBoost.

Other Loss Functions

The choice of loss function is directly related to computation complexityand robustness of the algorithm.

I |y − f(x)| (called “residual error”) is used to represent thegoodness of regression.

I yf(x) ( called “margin” ) is used to represent the goodness ofclassification. The loss criterion should penalize large negativemargin and encourage positive margin.

Apparently square error loss (y − f(x))2 is not suitable for classification,because it penalizes correctly classified data as heavily as misclassifiedones.

Typical Loss Functions for Classification

I 0/1 loss: I(sign(f) 6= y)

I exponential loss: exp(−yf(x)) (adaboost)

I binomial deviance: log(1 + exp(−2yf))

I soft-margin loss: (1− yf)I(yf > 1) (SVM)

Typical Loss Functions for Regression

I squared-error loss: (y − f(x))2

I absolute loss: |y − f(x)|I Huber loss:

L(y, f) ={

(y − f(x))2 for |y − f(x)| < δδ(|y − f(x)| − δ/2) otherwise

1. quadratically increasing with the residual error while|y − f(x)| < δ;

2. linear increasing while |y − f(x)| < δ;3. differentiable, that is, the tangent of the quadratic loss part at

δ equals to the linear loss.