AdaBoost Fits an Additive Model - Carnegie Mellon School...
-
Upload
phungduong -
Category
Documents
-
view
215 -
download
2
Transcript of AdaBoost Fits an Additive Model - Carnegie Mellon School...
Generalized Additive Model
f(x) =M∑
m=1
βmbm(x; γm)
Many classification and regression models can be written as a linearcombination of some simpler models (as above), where
I x is input data;
I {βm, γm} are model parameters;
I bm(x; γm) are any arbitrary functions of x .
Typically, {βm, γm} are estimated by minimizing some loss function L,which measures the prediction errors over training data {xn, yn}.
〈β∗m, γ∗m〉M1 = arg min
{βm,γm}M1
N∑i=1
L
(yi,
M∑m=1
βmbm(x; γm)
)
Forward Stagewise Optimization
Directly optimizing such loss function is often difficult. However, ifoptimizing over one single base function
minβ,γ
N∑i=1
L (yi, βbm(xi; γ))
can be done efficiently, a simple greedy search method can be used. Thebasic idea is, sequentially adding new base functions to the expansionfunction f(x) without changing the parameters that have been added.
For example, in the i-th stage, a new function bi(x; γi) is added to the expansion fi−1(x); the new coefficientsβi, γi are fitted by minimizing the error between βibi(x; γi) and the residue y − fi−1(x); then f(x) isupdated by fi(x) = fi−1(x) + β∗i bi(x; γ∗i ).
Such strategy (looking for the global optimum by solving a sequence of
subproblems in a greedy manner) is commonly used in practice.
AdaBoost
AdaBoost fits an additive model by forward stagewise approach, where
I the base function bm is a binary classifier Gm(x) : RK → {−1, 1};I the objective function is the exponential loss.
L(y, f(x)) = exp(−yf(x))
Many boosting variants are developed based on this observation.
I change base function?
I change loss function?
First we write down the exponential loss:
〈βm, Gm〉 = arg minβ,G
N∑i=1
exp [−yi (fm−1(xi) + βG(xi))]
Then define weights w(m)i = exp (−yifm−1(xi)), and divide the data into
two subsets: {yi = G(xi)} and {yi 6= G(xi)},∑Ni=1 exp [−yi[fm−1(xi) + βG(xi)]]
=∑N
i=1 w(m)i exp (−yiβG(xi))
= e−β∑
yi=G(xi)w
(m)i + eβ
∑yi 6=G(xi)
w(m)i
= e−β∑N
i=1 w(m)i +
(eβ − e−β
)∑Ni=1 w
(m)i I(yi 6= G(xi))
Then we optimize L over β and G iteratively.
When β is fixed, the optimal Gm(x) is given by
Gm = arg minG
N∑i=1
w(m)i I(yi 6= G(xi))
in other words, the optimal Gm(x) is the classifier which minimizes the
weighted prediction error, where w(m)i is viewed as a weight assigned to
the i-th training data.
Given a fixed Gm, we substitute it into the loss function, take thederivative w.r.t. βm and set it to zero,
d
e−β∑
yi=Gm(xi)
w(m)i + eβ
∑yi 6=Gm(xi)
w(m)i
/dβm = 0
that is
βm =12
log1− errm
errm; errm =
∑Ni=1 w
(m)i I(yi 6= Gm(xi))∑N
i=1 w(m)i
Check the updating rule of weights w(m)i ,
w(m+1)i = exp (−yifm(xi))
= exp−yi(fm−1(xi) + βmGm(xi))= w
(m)i exp(−βmyiGm(xi))
= w(m)i exp(2βmI(yi 6= Gm(xi))) exp(−βm)
Now we summarize the algorithm
1. initialize weights wi = 1/N, i = 1, . . . , N .
2. for m = 1 to M
2.1 fit a classifier Gm to training data with weights wi.2.2 update weights by wi ← wi exp(αmI(yi 6= Gm(xi))), where
αm = log 1−errm
errmand errm =
PNi=1 w
(m)i I(yi 6=Gm(xi))PN
i=1 w(m)i
3. output the final classifier f(x) = sign(∑M
m=1 αmGm(x))
That is exactly AdaBoost.
Other Loss Functions
The choice of loss function is directly related to computation complexityand robustness of the algorithm.
I |y − f(x)| (called “residual error”) is used to represent thegoodness of regression.
I yf(x) ( called “margin” ) is used to represent the goodness ofclassification. The loss criterion should penalize large negativemargin and encourage positive margin.
Apparently square error loss (y − f(x))2 is not suitable for classification,because it penalizes correctly classified data as heavily as misclassifiedones.
Typical Loss Functions for Classification
I 0/1 loss: I(sign(f) 6= y)
I exponential loss: exp(−yf(x)) (adaboost)
I binomial deviance: log(1 + exp(−2yf))
I soft-margin loss: (1− yf)I(yf > 1) (SVM)
Typical Loss Functions for Regression
I squared-error loss: (y − f(x))2
I absolute loss: |y − f(x)|I Huber loss:
L(y, f) ={
(y − f(x))2 for |y − f(x)| < δδ(|y − f(x)| − δ/2) otherwise
1. quadratically increasing with the residual error while|y − f(x)| < δ;
2. linear increasing while |y − f(x)| < δ;3. differentiable, that is, the tangent of the quadratic loss part at
δ equals to the linear loss.