Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers...

39
Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Transcript of Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers...

Page 1: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Decision Tree and Boosting

Tong Zhang

Rutgers University

T. Zhang (Rutgers) Boosting 1 / 29

Page 2: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Learning Algorithm

Learning algorithm Agiven training data Sn = {(Xi ,Yi)}i=1,...,n.learn prediction rule f = A(Sn) from Sn

goal: given unseen X , can apply prediction f (x) to estimate Y .

Linear methods:ridge regression, linear SVM etc:linear prediction: learn a function f (x) = βT x from training data.nonlinearity achieved via nonlinear features (e.g. kernel methods)

Nonlinear methods:decision tree, boosted decision trees, neural networks etclearning nonlinear prediction directly from data

T. Zhang (Rutgers) Boosting 2 / 29

Page 3: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Learning Algorithm

Learning algorithm Agiven training data Sn = {(Xi ,Yi)}i=1,...,n.learn prediction rule f = A(Sn) from Sn

goal: given unseen X , can apply prediction f (x) to estimate Y .Linear methods:

ridge regression, linear SVM etc:linear prediction: learn a function f (x) = βT x from training data.nonlinearity achieved via nonlinear features (e.g. kernel methods)

Nonlinear methods:decision tree, boosted decision trees, neural networks etclearning nonlinear prediction directly from data

T. Zhang (Rutgers) Boosting 2 / 29

Page 4: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Learning Algorithm

Learning algorithm Agiven training data Sn = {(Xi ,Yi)}i=1,...,n.learn prediction rule f = A(Sn) from Sn

goal: given unseen X , can apply prediction f (x) to estimate Y .Linear methods:

ridge regression, linear SVM etc:linear prediction: learn a function f (x) = βT x from training data.nonlinearity achieved via nonlinear features (e.g. kernel methods)

Nonlinear methods:decision tree, boosted decision trees, neural networks etclearning nonlinear prediction directly from data

T. Zhang (Rutgers) Boosting 2 / 29

Page 5: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Decision Tree

X [1] < 2

X [2] < 3

Y = 1 Y = 0

true false

X [3] < 1

Y = 1 X [1] < 1

Y = 1 Y = 0

true false

true false

true false

X = [X [1],X [2], . . . ,X [d ]]: feature vector with attributes X [j]Y : predicted class labelsEach node: compare a feature value to a threshold

T. Zhang (Rutgers) Boosting 3 / 29

Page 6: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Decision Tree

Partition the data into segments along paths to leaf-nodesFollow branch at each node through test:

is an attribute value < a threshold?

Constant prediction at each node: for classification

probability score =number of in-class documents reaching the node

number of documents reaching the node

Decision tree learning: two-stage processTree growing: recursively search (attribute,threshold) pair to reduceerrorTree pruning: remove deep tree nodes to avoid overfitting

T. Zhang (Rutgers) Boosting 4 / 29

Page 7: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Tree Growing Illustration

Y = 0

X [1] < 2

Y = 0 Y = 1

true false

X [1] < 2

Y = 0 X [3] < 1

Y = 1 Y = 0

true false

true false

X [1] < 2

X [2] < 3

Y = 1 Y = 0

true false

X [3] < 1

Y = 1 Y = 0

true false

true false

X [1] < 2

X [2] < 3

Y = 1 Y = 0

true false

X [3] < 1

Y = 1 X [1] < 1

Y = 1 Y = 0

true false

true false

true false

T. Zhang (Rutgers) Boosting 5 / 29

Page 8: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Tree Growing Illustration

Y = 0

X [1] < 2

Y = 0 Y = 1

true false

X [1] < 2

Y = 0 X [3] < 1

Y = 1 Y = 0

true false

true false

X [1] < 2

X [2] < 3

Y = 1 Y = 0

true false

X [3] < 1

Y = 1 Y = 0

true false

true false

X [1] < 2

X [2] < 3

Y = 1 Y = 0

true false

X [3] < 1

Y = 1 X [1] < 1

Y = 1 Y = 0

true false

true false

true false

T. Zhang (Rutgers) Boosting 5 / 29

Page 9: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Tree Growing Illustration

Y = 0X [1] < 2

Y = 0 Y = 1

true false

X [1] < 2

Y = 0 X [3] < 1

Y = 1 Y = 0

true false

true false

X [1] < 2

X [2] < 3

Y = 1 Y = 0

true false

X [3] < 1

Y = 1 Y = 0

true false

true false

X [1] < 2

X [2] < 3

Y = 1 Y = 0

true false

X [3] < 1

Y = 1 X [1] < 1

Y = 1 Y = 0

true false

true false

true false

T. Zhang (Rutgers) Boosting 5 / 29

Page 10: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Tree Growing Illustration

Y = 0X [1] < 2

Y = 0 Y = 1

true false

X [1] < 2

Y = 0 X [3] < 1

Y = 1 Y = 0

true false

true false

X [1] < 2

X [2] < 3

Y = 1 Y = 0

true false

X [3] < 1

Y = 1 Y = 0

true false

true false

X [1] < 2

X [2] < 3

Y = 1 Y = 0

true false

X [3] < 1

Y = 1 X [1] < 1

Y = 1 Y = 0

true false

true false

true false

T. Zhang (Rutgers) Boosting 5 / 29

Page 11: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Tree Growing Illustration

Y = 0X [1] < 2

Y = 0 Y = 1

true false

X [1] < 2

Y = 0 X [3] < 1

Y = 1 Y = 0

true false

true false

X [1] < 2

X [2] < 3

Y = 1 Y = 0

true false

X [3] < 1

Y = 1 Y = 0

true false

true false

X [1] < 2

X [2] < 3

Y = 1 Y = 0

true false

X [3] < 1

Y = 1 X [1] < 1

Y = 1 Y = 0

true false

true false

true false

T. Zhang (Rutgers) Boosting 5 / 29

Page 12: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Tree Growing

Training data (Xi ,Yi) (i = 1, . . . ,n)Given loss function L(f , y) (e.g. (f − y)2)Recursively do the following:

at each leaf-node, let S be the training data reaching iteach node has constant prediction, with optimal loss:

minf

∑i∈S

L(f ,Yi).

for each potential (attribute,threshold) split (j , θ):partition S into S1(j , θ) and S2(j , θ)the optimal loss with this partition

minf1,f2

∑i∈S1

L(f1,Yi) +∑i∈S2

L(f2,Yi)

for each leaf node: grow the tree using (j , θ) that reduces the loss most

T. Zhang (Rutgers) Boosting 6 / 29

Page 13: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Example Loss Criteria

Least squares loss: L(f ,Y ) = (f − Y )2

with partition S, optimal prediction is

f =∑i∈S

Yi/|S|.

split into S1 and S2:

f1 =∑i∈S1

Yi/|S1|, f2 =∑i∈S2

Yi/|S2|.

Objective function to minimize is:∑i∈S1

(Yi − f1)2 +∑i∈S2

(Yi − f2)2.

T. Zhang (Rutgers) Boosting 7 / 29

Page 14: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Some Issues in Tree Growing

Test all features and thresholdsthresholds aligned with observed values in training datafor efficiency, should sort the feature valuesmay not need to test all values

Stopping criteria:Depth-first: A certain depth is reached.Best-first: each time split the node with the best loss reduction, until afixed number of nodes is reached.

Numerical versus categorical attributes:numerical: orderedcategorical: unordered — each split can partition into arbitrary subsets

Missing data:put as an extra valueuse zero valueimputation assuming missing at random.

T. Zhang (Rutgers) Boosting 8 / 29

Page 15: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Large Scale Implementation

Parallelization: general each processor either handles a subset offeatures (attributes) or a subset of training dataIt is relatively easier to efficiently parallelize tree growing overpartitions of attributesIt is more complex to efficiently parallelize tree growing overpartitions of data

T. Zhang (Rutgers) Boosting 9 / 29

Page 16: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Remarks on Decision Tree

Advantages:interpretablehandle non-homogeneous features easilyfinds non-linear interactions

Disadvantage:usually not the most accurate classifier by itself

T. Zhang (Rutgers) Boosting 10 / 29

Page 17: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Improving Decision Tree Performance

Improve accuracy through tree ensemble (forest) learning:bagging

generate bootstrap samples.train one tree per bootstrap sample.take equally weighted average of the trees.

random forestbagging with additional randomization.

boostingdirect forest learning

T. Zhang (Rutgers) Boosting 11 / 29

Page 18: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Forest: Ensemble of Trees

root T1

T2T2

T3

T4

Forest: multiple decision trees T1, . . . ,TK with common rootOutput is the (weighted) sum of the tree outputs.Main question: how to generate ensemble (multiple trees)?

T. Zhang (Rutgers) Boosting 12 / 29

Page 19: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Bagging and Random Forest

Baggingbuild each tree gi from a bootstrapped training samples (sample thetraining data with replacement)Form voted classifiers: f = 1

m

∑mi=1 gi .

Random forestgenerate bootstrap samples.build one tree per bootstrap sampleincrease diversity via additional randomization: randomly pick a subsetof features to split at each nodetake equally weighted average of the trees.

T. Zhang (Rutgers) Boosting 13 / 29

Page 20: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Bagging and Random Forest

Baggingbuild each tree gi from a bootstrapped training samples (sample thetraining data with replacement)Form voted classifiers: f = 1

m

∑mi=1 gi .

Random forestgenerate bootstrap samples.build one tree per bootstrap sampleincrease diversity via additional randomization: randomly pick a subsetof features to split at each nodetake equally weighted average of the trees.

T. Zhang (Rutgers) Boosting 13 / 29

Page 21: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Why Bagging and RF work

Tree unstable, thus different random trees are very differentMay generate deep trees, overfitting the data with each treeEqually weighted averaging reduces the variance.Related to Bayesian model averaging over all possible trees.

T. Zhang (Rutgers) Boosting 14 / 29

Page 22: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Boosting

Ensemble Learning algorithm.Given a learning algorithm A:

how to generate the ensemble candidates?how to combine the generated ensemble candidates?

Invoke A with multiple samples (similar to Bagging).goal: to find optimal ensemble by minimizing a loss functionlearning method:

greedy, stage-wise optimizationinvoking a base-learner (weak learner) A.adaptive resampling

Bias reduction:less stable but more expressive.better than any single classifier.

T. Zhang (Rutgers) Boosting 15 / 29

Page 23: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Boosting

Ensemble Learning algorithm.Given a learning algorithm A:

how to generate the ensemble candidates?how to combine the generated ensemble candidates?

Invoke A with multiple samples (similar to Bagging).goal: to find optimal ensemble by minimizing a loss functionlearning method:

greedy, stage-wise optimizationinvoking a base-learner (weak learner) A.adaptive resampling

Bias reduction:less stable but more expressive.better than any single classifier.

T. Zhang (Rutgers) Boosting 15 / 29

Page 24: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Why Boosted Trees

May build shallow treescombine shallow trees (weak learner) to get strong learner.

Linear model of high order featuresautomatically find high order interactive features gj(·)

f (x) =∑

j

wj gj(x)︸ ︷︷ ︸nonlinear in x

automatically handle heterogeneous featureshigh order features are indicator functions.

Alternatives:discretize each feature into (possibly overlapping) bucketsdirect construction of feature combination.nonlinear functions like kernels or neural networks.direct greedy learning.

T. Zhang (Rutgers) Boosting 16 / 29

Page 25: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Weak Learning and Adaptive Resampling

A: a weak learner (e.g. shallow tree)better than chance (0.5 error) on any (reweighted) training data.

Question: can we combine weak learners to obtain a stronglearner?Answer: yes, through adaptive resampling (boosting).

idea: overweighting difficult examples that are hard to classify.

Compare with bagging: sampling without overweighting errors.Compare with outlier removal: underweighting errors.

reduce variance (but may increase bias)

T. Zhang (Rutgers) Boosting 17 / 29

Page 26: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

The Idea of Adaptive Resampling

Reweight the training data to overweight difficult examples.Using weak learner A to obtain classifiers gj on reweightedsamples.Adding the new classifier into ensemble, and choose weight wj .Iterate.Final classifier is

∑j wjgj .

T. Zhang (Rutgers) Boosting 18 / 29

Page 27: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

AdaBoost (adaptive boosting)

How to reweight, and how to compute w .Assume binary classification y ∈ {±1}, and f ∈ {±1}.

initialize sample weights {di} = {1/n} for {(Xi ,Yi)}for j = 1, · · · , J

call Weak Learner to obtain gj using sample weighted by {di}let rj =

∑i digj(Xi)Yi

let wj = 0.5 ln((1 + rj)/(1− rj))

update di : di ∝ die−wj gj (Xi )Yi .let fJ(x) =

∑Jj=1 wjgj(x)

AdaBoost

T. Zhang (Rutgers) Boosting 19 / 29

Page 28: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Some Theoretical Results about AdaBoost

Convergence: reduces margin errorf correctly classifies Xi with margin γ if f (Xi)Yi > γ > 0.If each weak learner fj does better than 0.5− δj (δj > 0) on reweightedsamples with respect to classification error I(f (Xi)Yi ≤ 0), then

1n

n∑i=1

I(fJ(Xi)Yi ≤ γ)︸ ︷︷ ︸margin error

≤ exp(γ − 2J∑

j=1

δ2j ).

Generalization:smaller margin error implies good generalization performance

For linear separable problems, Adaboost does not usuallymaximize margin: different from SVM

T. Zhang (Rutgers) Boosting 20 / 29

Page 29: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

From Adaptive Resampling to Greedy Boosting

Weak learner: picks gj from a hypothesis space Hj to minimizecertain error criterion.Goal: find wj ≥ 0 and gj ∈ Hj to minimize loss

[{wj , gj}] = arg min{wj≥0,gj∈Hj}

n∑i=1

φ

∑j

wjgj(Xi),Yi

. (∗)

Idea: greedy optimization.at stage j : fix (wk ,gk ) (k < j), find (wj ,gj) to minimize the loss (∗).

T. Zhang (Rutgers) Boosting 21 / 29

Page 30: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

AdaBoost as Greedy Boosting

Loss φ(f , y) = exp(−fy).Goal: using greedy boosting to minimize

[{wj , gj}] = arg min{wj≥0,gj∈Hj}

n∑i=1

e−∑

j wj gj (Xi )Yi .

Greedy optimization: at stage j , let di ∝ e−∑j−1

k=1 wk gk (Xi )Yi , and solve

[wj , gj ] = arg minwj≥0,gj∈Hj

n∑i=1

die−wj gj (Xi )Yi .

It can be shown solution is exactly the Adaboost update.

T. Zhang (Rutgers) Boosting 22 / 29

Page 31: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

General Loss Function

Learn prediction function f (x).By solving learning formulation

f = arg minf∈HL(h)

L(f ): complex loss function of the form

L(f ) = 1n

n∑i=1

φi(f (xi,1), · · · , f (xi,mi ), yi)

Greedy algorithm: generalization of Adaboost(sk ,gk ) = arg ming∈C,s∈R L(fk + sg)fk+1 ← fk + sk gk (sk may not equal sk )

However, this greedy weak learner is specialized and hard toimplement; can we simplify?

T. Zhang (Rutgers) Boosting 23 / 29

Page 32: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

General Loss Function

Learn prediction function f (x).By solving learning formulation

f = arg minf∈HL(h)

L(f ): complex loss function of the form

L(f ) = 1n

n∑i=1

φi(f (xi,1), · · · , f (xi,mi ), yi)

Greedy algorithm: generalization of Adaboost(sk ,gk ) = arg ming∈C,s∈R L(fk + sg)fk+1 ← fk + sk gk (sk may not equal sk )

However, this greedy weak learner is specialized and hard toimplement; can we simplify?

T. Zhang (Rutgers) Boosting 23 / 29

Page 33: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Boosting with Regression base Learner

Simplified weak learner: nonlinear regression base leaner A.input: X = [x1, . . . , xk ], residues R = [r1, . . . , rk ]output: a nonlinear function g = A(X ,R) ∈ C (e.g. decision tree)

k∑j=1

(g(xj)− rj)2 ≈ min

g∈C

k∑j=1

(g(xj)− rj)2.

Question: can we use A to optimize complex loss functions L(·)?Answer: yes:

functional gradient boosting (Friedman 01)based on a functional generalization of gradient descenta generalization of Adaboost

T. Zhang (Rutgers) Boosting 24 / 29

Page 34: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Boosting with Regression base Learner

Simplified weak learner: nonlinear regression base leaner A.input: X = [x1, . . . , xk ], residues R = [r1, . . . , rk ]output: a nonlinear function g = A(X ,R) ∈ C (e.g. decision tree)

k∑j=1

(g(xj)− rj)2 ≈ min

g∈C

k∑j=1

(g(xj)− rj)2.

Question: can we use A to optimize complex loss functions L(·)?Answer: yes:

functional gradient boosting (Friedman 01)based on a functional generalization of gradient descenta generalization of Adaboost

T. Zhang (Rutgers) Boosting 24 / 29

Page 35: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Gradient Boosting Algorithm

1: f0(x) = 02: for t = 1 to T do3: rt = ∂L(f ,Y )/∂f |f=ft−1(X)

4: gt = A(X , rt)// (i.e. call base learner) gt ≈ arg ming∈C ‖g(X )− rt‖22

5: βt = arg minβ L(ft−1(X ) + β · gt(X ),Y )6: ft(x) = ft−1(x) + st · βtgt(x)7: end for8: Return fT (x)

st = s: shrinkage parameter — convergence requires s ≈ 0functional generalization of gradient descent ft ← ft−1− st∂L(ft)/∂ft

T. Zhang (Rutgers) Boosting 25 / 29

Page 36: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

A Simple Implementation for Least SquaresRegression

1: f0(x) = 02: for t = 1 to T do3: rt = Y − ft−1(X )4: gt = A(X , rt)5: ft(x) = ft−1(x) + st · gt(x)6: end for7: Return fT (x)

Parameters: number of trees; shrinkage parameter; tree size.

T. Zhang (Rutgers) Boosting 26 / 29

Page 37: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Other Approach to Build Forest

root T1

T2T2

T3

T4

Boosting is analogous to depth-first search.One can employ best-first search with regularization to achievesomewhat better results (regularized greedy forest or RGF).

T. Zhang (Rutgers) Boosting 27 / 29

Page 38: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Software Resources

gbm: by Greg Ridgewaypopular implementation of gradient boosting in RpGBRT: by Kilian Weinberger(http://machinelearning.wustl.edu/pmwiki.php/Main/Pgbrt)RGF: by Rie Johnson(http://riejohnson.com/rgf download.html)

T. Zhang (Rutgers) Boosting 28 / 29

Page 39: Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

References

Classification and regression trees, L Breiman, J Friedman, CJStone, RA Olshen, 1984.C4.5: programs for machine learning, JR Quinlan, 1993.Bagging predictors, L Breiman, Machine learning, 1996.A desicion-theoretic generalization of on-line learning and anapplication to boosting, Y Freund, R Schapire, J. Comput. Syst.Sci., 1997.Additive logistic regression: a statistical view of boosting, JFriedman, T Hastie, R Tibshirani, Ann. Statist, 2000.Greedy function approximation: a gradient boosting machine, JFriedman, Ann. Statist, 2001.Random forests, L Breiman, Machine learning, 2001.Learning Nonlinear Functions Using Regularized Greedy Forest, R.Johnson, T. Zhang, 2011.

T. Zhang (Rutgers) Boosting 29 / 29