# Decision Trees Applied Machine Learning Applied Machine Learning Decision Trees S ia m a k R a v a n

date post

22-May-2020Category

## Documents

view

1download

0

Embed Size (px)

### Transcript of Decision Trees Applied Machine Learning Applied Machine Learning Decision Trees S ia m a k R a v a n

Applied Machine LearningApplied Machine Learning Decision Trees

Siamak RavanbakhshSiamak Ravanbakhsh

COMP 551COMP 551 (winter 2020)(winter 2020)

1

decision trees:

model cost function how it is optimized

how to grow a tree and why you should prune it!

Learning objectivesLearning objectives

2

Adaptive basesAdaptive bases

so far we assume a fixed set of bases in f(x) = w ϕ (x)∑d d d

several methods can be classified as learning these bases adaptively

decision trees generalized additive models boosting neural networks

f(x) = w ϕ (x; v )∑d d d d each basis has its own parameters

3

image credit:https://mymodernmet.com/the-30second-rule-a-decision/

Decision trees: Decision trees: motivationmotivation

4

pros. decision trees are interpretable! they are not very sensitive to outliers do not need data normalization

image credit:https://mymodernmet.com/the-30second-rule-a-decision/

Decision trees: Decision trees: motivationmotivation

4

pros. decision trees are interpretable! they are not very sensitive to outliers do not need data normalization

image credit:https://mymodernmet.com/the-30second-rule-a-decision/

cons. they could easily overfit and they are unstable

pruning random forests

Decision trees: Decision trees: motivationmotivation

4

Decision trees: Decision trees: ideaidea

divide the input space into regions and learn one function per region

f(x) = w I(x ∈∑k k R )k the regions are learned adaptively more sophisticated prediction per region is also possible (e.g., one linear model per region)

5 . 1

split regions successively based on the value of a single variable called test

Decision trees: Decision trees: ideaidea

divide the input space into regions and learn one function per region

f(x) = w I(x ∈∑k k R )k the regions are learned adaptively more sophisticated prediction per region is also possible (e.g., one linear model per region)

5 . 1

split regions successively based on the value of a single variable called test

Decision trees: Decision trees: ideaidea

divide the input space into regions and learn one function per region

f(x) = w I(x ∈∑k k R )k the regions are learned adaptively more sophisticated prediction per region is also possible (e.g., one linear model per region)

x 1 x 2

w 1

w 3

w 5

each region is a set of conditions R =2 {x ≤1 t ,x ≤1 2 t }4

5 . 1

what constant to use for prediction in each region?

Prediction per regionPrediction per region

w k

suppose we have identified the regions R k

5 . 2

what constant to use for prediction in each region?

Prediction per regionPrediction per region

w k

suppose we have identified the regions R k

fore regression

use the mean value of training data-points in that region

w =k mean(y ∣x ∈(n) (n) R )k

5 . 2

what constant to use for prediction in each region?

Prediction per regionPrediction per region

w k

suppose we have identified the regions R k

fore regression

use the mean value of training data-points in that region

w =k mean(y ∣x ∈(n) (n) R )k

for classification count the frequency of classes per region predict the most frequent label or return probability

5 . 2

w =k mode(y ∣x ∈(n) (n) R )k

Winter 2020 | Applied Machine Learning (COMP551)

what constant to use for prediction in each region?

Prediction per regionPrediction per region

w k

suppose we have identified the regions R k

fore regression

use the mean value of training data-points in that region

w =k mean(y ∣x ∈(n) (n) R )k

for classification count the frequency of classes per region predict the most frequent label or return probability

example: predicting survival in titanic

most frequent label

frequency of survival

percentage of training data in this region

5 . 2

w =k mode(y ∣x ∈(n) (n) R )k

Feature typesFeature types

given a feature what are the possible tests

6 . 1

Feature typesFeature types

continuous features - e.g., age, height, GDP all the values that appear in the dataset can be used to split S =d {s =d,n x }d

(n)

one set of possible splits for each feature d

x >d s ?d,neach split is asking

given a feature what are the possible tests

6 . 1

Feature typesFeature types

continuous features - e.g., age, height, GDP all the values that appear in the dataset can be used to split S =d {s =d,n x }d

(n)

one set of possible splits for each feature d

x >d s ?d,neach split is asking

given a feature what are the possible tests

ordinal features - e.g., grade, rating x ∈d {1, … ,C} we can split any any value so S =d {s =d,1 1, … , s =d,C C}

x >d s ?d,ceach split is asking

6 . 1

Feature typesFeature types

continuous features - e.g., age, height, GDP all the values that appear in the dataset can be used to split S =d {s =d,n x }d

(n)

one set of possible splits for each feature d

x >d s ?d,neach split is asking

given a feature what are the possible tests

ordinal features - e.g., grade, rating x ∈d {1, … ,C} we can split any any value so S =d {s =d,1 1, … , s =d,C C}

x >d s ?d,ceach split is asking

categorical features - - types, classes and categories

6 . 1

Feature typesFeature types

(n)

one set of possible splits for each feature d

x >d s ?d,neach split is asking

given a feature what are the possible tests

ordinal features - e.g., grade, rating x ∈d {1, … ,C} we can split any any value so S =d {s =d,1 1, … , s =d,C C}

x >d s ?d,ceach split is asking

categorical features - - types, classes and categories

6 . 1

multi-way split problem: it could lead to sparse subsets data fragmentation: some splits may have few/no datapoints

{x =d ????

Feature typesFeature types

(n)

one set of possible splits for each feature d

x >d s ?d,neach split is asking

given a feature what are the possible tests

x >d s ?d,ceach split is asking

categorical features - - types, classes and categories

6 . 1

multi-way split problem: it could lead to sparse subsets data fragmentation: some splits may have few/no datapoints

{x =d ???? instead of we have { binary split assume C binary features (one-hot coding)

x ∈d {1, … ,C} x ∈d,1 {0, 1}

x ∈d,2 {0, 1}

⋮ x ∈d,C {0, 1}

x d,2 = ?

0

x d,2 = ?

1

Feature typesFeature types

(n)

one set of possible splits for each feature d

x >d s ?d,neach split is asking

given a feature what are the possible tests

x >d s ?d,ceach split is asking

categorical features - - types, classes and categories

6 . 1

multi-way split problem: it could lead to sparse subsets data fragmentation: some splits may have few/no datapoints

{x =d ???? instead of we have { binary split assume C binary features (one-hot coding)

x ∈d {1, … ,C} x ∈d,1 {0, 1}

x ∈d,2 {0, 1}

⋮ x ∈d,C {0, 1}

x d,2 = ?

0

x d,2 = ?

1

alternative: binary splits that produce balanced subsets

Cost functionCost function objective: find a decision tree minimizing the cost function

6 . 2

Cost functionCost function objective: find a decision tree minimizing the cost function

for predicting constant

cost(R ,D) =k (y −N k 1 ∑x ∈R (n) k

(n) w )k 2

w ∈k R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

6 . 2

Cost functionCost function objective: find a decision tree minimizing the cost function

for predicting constant

cost(R ,D) =k (y −N k 1 ∑x