Decision Trees Applied Machine Learning Applied Machine Learning Decision Trees S ia m a k R a v a n
date post
22-May-2020Category
Documents
view
1download
0
Embed Size (px)
Transcript of Decision Trees Applied Machine Learning Applied Machine Learning Decision Trees S ia m a k R a v a n
Applied Machine LearningApplied Machine Learning Decision Trees
Siamak RavanbakhshSiamak Ravanbakhsh
COMP 551COMP 551 (winter 2020)(winter 2020)
1
decision trees:
model cost function how it is optimized
how to grow a tree and why you should prune it!
Learning objectivesLearning objectives
2
Adaptive basesAdaptive bases
so far we assume a fixed set of bases in f(x) = w ϕ (x)∑d d d
several methods can be classified as learning these bases adaptively
decision trees generalized additive models boosting neural networks
f(x) = w ϕ (x; v )∑d d d d each basis has its own parameters
3
image credit:https://mymodernmet.com/the-30second-rule-a-decision/
Decision trees: Decision trees: motivationmotivation
4
pros. decision trees are interpretable! they are not very sensitive to outliers do not need data normalization
image credit:https://mymodernmet.com/the-30second-rule-a-decision/
Decision trees: Decision trees: motivationmotivation
4
pros. decision trees are interpretable! they are not very sensitive to outliers do not need data normalization
image credit:https://mymodernmet.com/the-30second-rule-a-decision/
cons. they could easily overfit and they are unstable
pruning random forests
Decision trees: Decision trees: motivationmotivation
4
Decision trees: Decision trees: ideaidea
divide the input space into regions and learn one function per region
f(x) = w I(x ∈∑k k R )k the regions are learned adaptively more sophisticated prediction per region is also possible (e.g., one linear model per region)
5 . 1
split regions successively based on the value of a single variable called test
Decision trees: Decision trees: ideaidea
divide the input space into regions and learn one function per region
f(x) = w I(x ∈∑k k R )k the regions are learned adaptively more sophisticated prediction per region is also possible (e.g., one linear model per region)
5 . 1
split regions successively based on the value of a single variable called test
Decision trees: Decision trees: ideaidea
divide the input space into regions and learn one function per region
f(x) = w I(x ∈∑k k R )k the regions are learned adaptively more sophisticated prediction per region is also possible (e.g., one linear model per region)
x 1 x 2
w 1
w 3
w 5
each region is a set of conditions R =2 {x ≤1 t ,x ≤1 2 t }4
5 . 1
what constant to use for prediction in each region?
Prediction per regionPrediction per region
w k
suppose we have identified the regions R k
5 . 2
what constant to use for prediction in each region?
Prediction per regionPrediction per region
w k
suppose we have identified the regions R k
fore regression
use the mean value of training data-points in that region
w =k mean(y ∣x ∈(n) (n) R )k
5 . 2
what constant to use for prediction in each region?
Prediction per regionPrediction per region
w k
suppose we have identified the regions R k
fore regression
use the mean value of training data-points in that region
w =k mean(y ∣x ∈(n) (n) R )k
for classification count the frequency of classes per region predict the most frequent label or return probability
5 . 2
w =k mode(y ∣x ∈(n) (n) R )k
Winter 2020 | Applied Machine Learning (COMP551)
what constant to use for prediction in each region?
Prediction per regionPrediction per region
w k
suppose we have identified the regions R k
fore regression
use the mean value of training data-points in that region
w =k mean(y ∣x ∈(n) (n) R )k
for classification count the frequency of classes per region predict the most frequent label or return probability
example: predicting survival in titanic
most frequent label
frequency of survival
percentage of training data in this region
5 . 2
w =k mode(y ∣x ∈(n) (n) R )k
Feature typesFeature types
given a feature what are the possible tests
6 . 1
Feature typesFeature types
continuous features - e.g., age, height, GDP all the values that appear in the dataset can be used to split S =d {s =d,n x }d
(n)
one set of possible splits for each feature d
x >d s ?d,neach split is asking
given a feature what are the possible tests
6 . 1
Feature typesFeature types
continuous features - e.g., age, height, GDP all the values that appear in the dataset can be used to split S =d {s =d,n x }d
(n)
one set of possible splits for each feature d
x >d s ?d,neach split is asking
given a feature what are the possible tests
ordinal features - e.g., grade, rating x ∈d {1, … ,C} we can split any any value so S =d {s =d,1 1, … , s =d,C C}
x >d s ?d,ceach split is asking
6 . 1
Feature typesFeature types
continuous features - e.g., age, height, GDP all the values that appear in the dataset can be used to split S =d {s =d,n x }d
(n)
one set of possible splits for each feature d
x >d s ?d,neach split is asking
given a feature what are the possible tests
ordinal features - e.g., grade, rating x ∈d {1, … ,C} we can split any any value so S =d {s =d,1 1, … , s =d,C C}
x >d s ?d,ceach split is asking
categorical features - - types, classes and categories
6 . 1
Feature typesFeature types
continuous features - e.g., age, height, GDP all the values that appear in the dataset can be used to split S =d {s =d,n x }d
(n)
one set of possible splits for each feature d
x >d s ?d,neach split is asking
given a feature what are the possible tests
ordinal features - e.g., grade, rating x ∈d {1, … ,C} we can split any any value so S =d {s =d,1 1, … , s =d,C C}
x >d s ?d,ceach split is asking
categorical features - - types, classes and categories
6 . 1
multi-way split problem: it could lead to sparse subsets data fragmentation: some splits may have few/no datapoints
{x =d ????
Feature typesFeature types
continuous features - e.g., age, height, GDP all the values that appear in the dataset can be used to split S =d {s =d,n x }d
(n)
one set of possible splits for each feature d
x >d s ?d,neach split is asking
given a feature what are the possible tests
ordinal features - e.g., grade, rating x ∈d {1, … ,C} we can split any any value so S =d {s =d,1 1, … , s =d,C C}
x >d s ?d,ceach split is asking
categorical features - - types, classes and categories
6 . 1
multi-way split problem: it could lead to sparse subsets data fragmentation: some splits may have few/no datapoints
{x =d ???? instead of we have { binary split assume C binary features (one-hot coding)
x ∈d {1, … ,C} x ∈d,1 {0, 1}
x ∈d,2 {0, 1}
⋮ x ∈d,C {0, 1}
x d,2 = ?
0
x d,2 = ?
1
Feature typesFeature types
continuous features - e.g., age, height, GDP all the values that appear in the dataset can be used to split S =d {s =d,n x }d
(n)
one set of possible splits for each feature d
x >d s ?d,neach split is asking
given a feature what are the possible tests
ordinal features - e.g., grade, rating x ∈d {1, … ,C} we can split any any value so S =d {s =d,1 1, … , s =d,C C}
x >d s ?d,ceach split is asking
categorical features - - types, classes and categories
6 . 1
multi-way split problem: it could lead to sparse subsets data fragmentation: some splits may have few/no datapoints
{x =d ???? instead of we have { binary split assume C binary features (one-hot coding)
x ∈d {1, … ,C} x ∈d,1 {0, 1}
x ∈d,2 {0, 1}
⋮ x ∈d,C {0, 1}
x d,2 = ?
0
x d,2 = ?
1
alternative: binary splits that produce balanced subsets
Cost functionCost function objective: find a decision tree minimizing the cost function
6 . 2
Cost functionCost function objective: find a decision tree minimizing the cost function
for predicting constant
cost(R ,D) =k (y −N k 1 ∑x ∈R (n) k
(n) w )k 2
w ∈k R
regression cost
number of instances in region k
cost per region (mean squared error - MSE)
6 . 2
Cost functionCost function objective: find a decision tree minimizing the cost function
for predicting constant
cost(R ,D) =k (y −N k 1 ∑x