Decision Tree and Boosting - CILVR at NYU .Decision Tree and Boosting Tong Zhang Rutgers University

download Decision Tree and Boosting - CILVR at NYU .Decision Tree and Boosting Tong Zhang Rutgers University

of 39

  • date post

    02-Mar-2019
  • Category

    Documents

  • view

    214
  • download

    0

Embed Size (px)

Transcript of Decision Tree and Boosting - CILVR at NYU .Decision Tree and Boosting Tong Zhang Rutgers University

Decision Tree and Boosting

Tong Zhang

Rutgers University

T. Zhang (Rutgers) Boosting 1 / 29

Learning Algorithm

Learning algorithm Agiven training data Sn = {(Xi ,Yi)}i=1,...,n.learn prediction rule f = A(Sn) from Sngoal: given unseen X , can apply prediction f (x) to estimate Y .

Linear methods:ridge regression, linear SVM etc:linear prediction: learn a function f (x) = T x from training data.nonlinearity achieved via nonlinear features (e.g. kernel methods)

Nonlinear methods:decision tree, boosted decision trees, neural networks etclearning nonlinear prediction directly from data

T. Zhang (Rutgers) Boosting 2 / 29

Learning Algorithm

Learning algorithm Agiven training data Sn = {(Xi ,Yi)}i=1,...,n.learn prediction rule f = A(Sn) from Sngoal: given unseen X , can apply prediction f (x) to estimate Y .

Linear methods:ridge regression, linear SVM etc:linear prediction: learn a function f (x) = T x from training data.nonlinearity achieved via nonlinear features (e.g. kernel methods)

Nonlinear methods:decision tree, boosted decision trees, neural networks etclearning nonlinear prediction directly from data

T. Zhang (Rutgers) Boosting 2 / 29

Learning Algorithm

Learning algorithm Agiven training data Sn = {(Xi ,Yi)}i=1,...,n.learn prediction rule f = A(Sn) from Sngoal: given unseen X , can apply prediction f (x) to estimate Y .

Linear methods:ridge regression, linear SVM etc:linear prediction: learn a function f (x) = T x from training data.nonlinearity achieved via nonlinear features (e.g. kernel methods)

Nonlinear methods:decision tree, boosted decision trees, neural networks etclearning nonlinear prediction directly from data

T. Zhang (Rutgers) Boosting 2 / 29

Decision Tree

X [1] < 2

X [2] < 3

Y = 1 Y = 0

true false

X [3] < 1

Y = 1 X [1] < 1

Y = 1 Y = 0

true false

true false

true false

X = [X [1],X [2], . . . ,X [d ]]: feature vector with attributes X [j]Y : predicted class labelsEach node: compare a feature value to a threshold

T. Zhang (Rutgers) Boosting 3 / 29

Decision Tree

Partition the data into segments along paths to leaf-nodesFollow branch at each node through test:

is an attribute value < a threshold?

Constant prediction at each node: for classification

probability score =number of in-class documents reaching the node

number of documents reaching the node

Decision tree learning: two-stage processTree growing: recursively search (attribute,threshold) pair to reduceerrorTree pruning: remove deep tree nodes to avoid overfitting

T. Zhang (Rutgers) Boosting 4 / 29

Tree Growing Illustration

Y = 0

X [1] < 2

Y = 0 Y = 1

true false

X [1] < 2

Y = 0 X [3] < 1

Y = 1 Y = 0

true false

true false

X [1] < 2

X [2] < 3

Y = 1 Y = 0

true false

X [3] < 1

Y = 1 Y = 0

true false

true false

X [1] < 2

X [2] < 3

Y = 1 Y = 0

true false

X [3] < 1

Y = 1 X [1] < 1

Y = 1 Y = 0

true false

true false

true false

T. Zhang (Rutgers) Boosting 5 / 29

Tree Growing Illustration

Y = 0

X [1] < 2

Y = 0 Y = 1

true false

X [1] < 2

Y = 0 X [3] < 1

Y = 1 Y = 0

true false

true false

X [1] < 2

X [2] < 3

Y = 1 Y = 0

true false

X [3] < 1

Y = 1 Y = 0

true false

true false

X [1] < 2

X [2] < 3

Y = 1 Y = 0

true false

X [3] < 1

Y = 1 X [1] < 1

Y = 1 Y = 0

true false

true false

true false

T. Zhang (Rutgers) Boosting 5 / 29

Tree Growing Illustration

Y = 0X [1] < 2

Y = 0 Y = 1

true false

X [1] < 2

Y = 0 X [3] < 1

Y = 1 Y = 0

true false

true false

X [1] < 2

X [2] < 3

Y = 1 Y = 0

true false

X [3] < 1

Y = 1 Y = 0

true false

true false

X [1] < 2

X [2] < 3

Y = 1 Y = 0

true false

X [3] < 1

Y = 1 X [1] < 1

Y = 1 Y = 0

true false

true false

true false

T. Zhang (Rutgers) Boosting 5 / 29

Tree Growing Illustration

Y = 0X [1] < 2

Y = 0 Y = 1

true false

X [1] < 2

Y = 0 X [3] < 1

Y = 1 Y = 0

true false

true false

X [1] < 2

X [2] < 3

Y = 1 Y = 0

true false

X [3] < 1

Y = 1 Y = 0

true false

true false

X [1] < 2

X [2] < 3

Y = 1 Y = 0

true false

X [3] < 1

Y = 1 X [1] < 1

Y = 1 Y = 0

true false

true false

true false

T. Zhang (Rutgers) Boosting 5 / 29

Tree Growing Illustration

Y = 0X [1] < 2

Y = 0 Y = 1

true false

X [1] < 2

Y = 0 X [3] < 1

Y = 1 Y = 0

true false

true false

X [1] < 2

X [2] < 3

Y = 1 Y = 0

true false

X [3] < 1

Y = 1 Y = 0

true false

true false

X [1] < 2

X [2] < 3

Y = 1 Y = 0

true false

X [3] < 1

Y = 1 X [1] < 1

Y = 1 Y = 0

true false

true false

true false

T. Zhang (Rutgers) Boosting 5 / 29

Tree Growing

Training data (Xi ,Yi) (i = 1, . . . ,n)Given loss function L(f , y) (e.g. (f y)2)Recursively do the following:

at each leaf-node, let S be the training data reaching iteach node has constant prediction, with optimal loss:

minf

iS

L(f ,Yi).

for each potential (attribute,threshold) split (j , ):partition S into S1(j , ) and S2(j , )the optimal loss with this partition

minf1,f2

iS1

L(f1,Yi) +iS2

L(f2,Yi)

for each leaf node: grow the tree using (j , ) that reduces the loss most

T. Zhang (Rutgers) Boosting 6 / 29

Example Loss Criteria

Least squares loss: L(f ,Y ) = (f Y )2

with partition S, optimal prediction is

f =iS

Yi/|S|.

split into S1 and S2:

f1 =iS1

Yi/|S1|, f2 =iS2

Yi/|S2|.

Objective function to minimize is:iS1

(Yi f1)2 +iS2

(Yi f2)2.

T. Zhang (Rutgers) Boosting 7 / 29

Some Issues in Tree Growing

Test all features and thresholdsthresholds aligned with observed values in training datafor efficiency, should sort the feature valuesmay not need to test all values

Stopping criteria:Depth-first: A certain depth is reached.Best-first: each time split the node with the best loss reduction, until afixed number of nodes is reached.

Numerical versus categorical attributes:numerical: orderedcategorical: unordered each split can partition into arbitrary subsets

Missing data:put as an extra valueuse zero valueimputation assuming missing at random.

T. Zhang (Rutgers) Boosting 8 / 29

Large Scale Implementation

Parallelization: general each processor either handles a subset offeatures (attributes) or a subset of training dataIt is relatively easier to efficiently parallelize tree growing overpartitions of attributesIt is more complex to efficiently parallelize tree growing overpartitions of data

T. Zhang (Rutgers) Boosting 9 / 29

Remarks on Decision Tree

Advantages:interpretablehandle non-homogeneous features easilyfinds non-linear interactions

Disadvantage:usually not the most accurate classifier by itself

T. Zhang (Rutgers) Boosting 10 / 29

Improving Decision Tree Performance

Improve accuracy through tree ensemble (forest) learning:bagging

generate bootstrap samples.train one tree per bootstrap sample.take equally weighted average of the trees.

random forestbagging with additional randomization.

boostingdirect forest learning

T. Zhang (Rutgers) Boosting 11 / 29

Forest: Ensemble of Trees

root T1

T2T2

T3

T4

Forest: multiple decision trees T1, . . . ,TK with common rootOutput is the (weighted) sum of the tree outputs.Main question: how to generate ensemble (multiple trees)?

T. Zhang (Rutgers) Boosting 12 / 29

Bagging and Random Forest

Baggingbuild each tree gi from a bootstrapped training samples (sample thetraining data with replacement)Form voted classifiers: f = 1m

mi=1 gi .

Random forestgenerate bootstrap samples.build one tree per bootstrap sampleincrease diversity via additional randomization: randomly pick a subsetof features to split at each nodetake equally weighted average of the trees.

T. Zhang (Rutgers) Boosting 13 / 29

Bagging and Random Forest

Baggingbuild each tree gi from a bootstrapped training samples (sample thetraining data with replacement)Form voted classifiers: f = 1m

mi=1 gi .

Random forestgenerate bootstrap samples.build one tree per bootstrap sampleincrease diversity via additional randomization: randomly pick a subsetof features to split at each nodetake equally weighted average of the trees.

T. Zhang (Rutgers) Boosting 13 / 29

Why Bagging and RF work

Tree unstable, thus different random trees are very differentMay generate deep trees, overfitting the data with each