Download - Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Transcript
Page 1: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Applied Machine LearningApplied Machine LearningDecision Trees

Siamak RavanbakhshSiamak Ravanbakhsh

COMP 551COMP 551 (winter 2020)(winter 2020)

1

Page 2: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

decision trees:

modelcost functionhow it is optimized

how to grow a tree and why you should prune it!

Learning objectivesLearning objectives

2

Page 3: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Adaptive basesAdaptive bases   

so far we assume a fixed set of bases in f(x) = w ϕ (x)∑d d d

several methods can be classified as learning these bases adaptively

decision treesgeneralized additive modelsboostingneural networks

f(x) = w ϕ (x; v )∑d d d d

each basis has its own parameters

3

Page 4: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

image credit:https://mymodernmet.com/the-30second-rule-a-decision/

  Decision trees: Decision trees: motivationmotivation

4

Page 5: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

pros.decision trees are interpretable!they are not very sensitive to outliersdo not need data normalization

image credit:https://mymodernmet.com/the-30second-rule-a-decision/

  Decision trees: Decision trees: motivationmotivation

4

Page 6: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

pros.decision trees are interpretable!they are not very sensitive to outliersdo not need data normalization

image credit:https://mymodernmet.com/the-30second-rule-a-decision/

cons.they could easily overfit and they are unstable

pruningrandom forests

  Decision trees: Decision trees: motivationmotivation

4

Page 7: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

  Decision trees: Decision trees: ideaidea

divide the input space into regions and learn one function per region

f(x) = w I(x ∈∑k k R )k

the regions are learned adaptivelymore sophisticated prediction per region is also possible (e.g., one linear model per region)

5 . 1

Page 8: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

split regions successively based on the value of a single variable called test

  Decision trees: Decision trees: ideaidea

divide the input space into regions and learn one function per region

f(x) = w I(x ∈∑k k R )k

the regions are learned adaptivelymore sophisticated prediction per region is also possible (e.g., one linear model per region)

5 . 1

Page 9: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

split regions successively based on the value of a single variable called test

  Decision trees: Decision trees: ideaidea

divide the input space into regions and learn one function per region

f(x) = w I(x ∈∑k k R )k

the regions are learned adaptivelymore sophisticated prediction per region is also possible (e.g., one linear model per region)

x 1x 2

w 1

w 3

w 5

each region is a set of conditions R =2 {x ≤1 t ,x ≤1 2 t }4

5 . 1

Page 10: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

what constant           to use for prediction in each region?

  Prediction per regionPrediction per region

w k

suppose we have identified the regions R k

5 . 2

Page 11: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

what constant           to use for prediction in each region?

  Prediction per regionPrediction per region

w k

suppose we have identified the regions R k

fore regression

use the mean value of training data-points in that region

w =k mean(y ∣x ∈(n) (n) R )k

5 . 2

Page 12: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

what constant           to use for prediction in each region?

  Prediction per regionPrediction per region

w k

suppose we have identified the regions R k

fore regression

use the mean value of training data-points in that region

w =k mean(y ∣x ∈(n) (n) R )k

for classificationcount the frequency of classes per regionpredict the most frequent labelor return probability

5 . 2

w =k mode(y ∣x ∈(n) (n) R )k

Page 13: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Winter 2020 | Applied Machine Learning (COMP551)

what constant           to use for prediction in each region?

  Prediction per regionPrediction per region

w k

suppose we have identified the regions R k

fore regression

use the mean value of training data-points in that region

w =k mean(y ∣x ∈(n) (n) R )k

for classificationcount the frequency of classes per regionpredict the most frequent labelor return probability

example: predicting survival in titanic

most frequent label

frequency of survival

percentage of training data in this region

5 . 2

w =k mode(y ∣x ∈(n) (n) R )k

Page 14: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Feature typesFeature types  

given a feature what are the possible tests

6 . 1

Page 15: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Feature typesFeature types  

continuous features - e.g., age, height, GDP

all the values that appear in the dataset can be used to split S =d {s =d,n x }d

(n)

one set of possible splits for each feature d

x >d s ?d,neach split is asking

given a feature what are the possible tests

6 . 1

Page 16: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Feature typesFeature types  

continuous features - e.g., age, height, GDP

all the values that appear in the dataset can be used to split S =d {s =d,n x }d

(n)

one set of possible splits for each feature d

x >d s ?d,neach split is asking

given a feature what are the possible tests

ordinal features - e.g., grade, rating x ∈d {1, … ,C}

we can split any any value so S =d {s =d,1 1, … , s =d,C C}

x >d s ?d,ceach split is asking

6 . 1

Page 17: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Feature typesFeature types  

continuous features - e.g., age, height, GDP

all the values that appear in the dataset can be used to split S =d {s =d,n x }d

(n)

one set of possible splits for each feature d

x >d s ?d,neach split is asking

given a feature what are the possible tests

ordinal features - e.g., grade, rating x ∈d {1, … ,C}

we can split any any value so S =d {s =d,1 1, … , s =d,C C}

x >d s ?d,ceach split is asking

categorical features -

- types, classes and categories

6 . 1

Page 18: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Feature typesFeature types  

continuous features - e.g., age, height, GDP

all the values that appear in the dataset can be used to split S =d {s =d,n x }d

(n)

one set of possible splits for each feature d

x >d s ?d,neach split is asking

given a feature what are the possible tests

ordinal features - e.g., grade, rating x ∈d {1, … ,C}

we can split any any value so S =d {s =d,1 1, … , s =d,C C}

x >d s ?d,ceach split is asking

categorical features -

- types, classes and categories

6 . 1

multi-way splitproblem:it could lead to sparse subsetsdata fragmentation: some splits may have few/no datapoints

{x =d

????

Page 19: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Feature typesFeature types  

continuous features - e.g., age, height, GDP

all the values that appear in the dataset can be used to split S =d {s =d,n x }d

(n)

one set of possible splits for each feature d

x >d s ?d,neach split is asking

given a feature what are the possible tests

ordinal features - e.g., grade, rating x ∈d {1, … ,C}

we can split any any value so S =d {s =d,1 1, … , s =d,C C}

x >d s ?d,ceach split is asking

categorical features -

- types, classes and categories

6 . 1

multi-way splitproblem:it could lead to sparse subsetsdata fragmentation: some splits may have few/no datapoints

{x =d

????

instead of                            we have {binary splitassume C binary features (one-hot coding)

x ∈d {1, … ,C} x ∈d,1 {0, 1}

x ∈d,2 {0, 1}

⋮x ∈d,C {0, 1}

x d,2 =?

0

x d,2 =?

1

Page 20: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Feature typesFeature types  

continuous features - e.g., age, height, GDP

all the values that appear in the dataset can be used to split S =d {s =d,n x }d

(n)

one set of possible splits for each feature d

x >d s ?d,neach split is asking

given a feature what are the possible tests

ordinal features - e.g., grade, rating x ∈d {1, … ,C}

we can split any any value so S =d {s =d,1 1, … , s =d,C C}

x >d s ?d,ceach split is asking

categorical features -

- types, classes and categories

6 . 1

multi-way splitproblem:it could lead to sparse subsetsdata fragmentation: some splits may have few/no datapoints

{x =d

????

instead of                            we have {binary splitassume C binary features (one-hot coding)

x ∈d {1, … ,C} x ∈d,1 {0, 1}

x ∈d,2 {0, 1}

⋮x ∈d,C {0, 1}

x d,2 =?

0

x d,2 =?

1

alternative: binary splits that produce balanced subsets

Page 21: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Cost functionCost function   

objective: find a decision tree minimizing the cost function

6 . 2

Page 22: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Cost functionCost function   

objective: find a decision tree minimizing the cost function

for predicting constant

cost(R ,D) =k (y −N k

1 ∑x ∈R

(n)k

(n) w )k2

w ∈k R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

6 . 2

Page 23: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Cost functionCost function   

objective: find a decision tree minimizing the cost function

for predicting constant

cost(R ,D) =k (y −N k

1 ∑x ∈R

(n)k

(n) w )k2

w ∈k R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

mean(y ∣x ∈(n) (n) R )k

6 . 2

Page 24: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Cost functionCost function   

objective: find a decision tree minimizing the cost function

for predicting constant

cost(R ,D) =k (y −N k

1 ∑x ∈R

(n)k

(n) w )k2

w ∈k R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

classification cost

for predicting constant class

cost(R ,D) =k I(y =N k

1 ∑x ∈D

(n) R k

(n) w )k

w ∈k {1, … ,C}

cost per region (misclassification rate)

mean(y ∣x ∈(n) (n) R )k

6 . 2

Page 25: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Cost functionCost function   

objective: find a decision tree minimizing the cost function

for predicting constant

cost(R ,D) =k (y −N k

1 ∑x ∈R

(n)k

(n) w )k2

w ∈k R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

classification cost

for predicting constant class

cost(R ,D) =k I(y =N k

1 ∑x ∈D

(n) R k

(n) w )k

w ∈k {1, … ,C}

cost per region (misclassification rate)

mean(y ∣x ∈(n) (n) R )k

6 . 2

mode(y ∣x ∈(n) (n) R )k

Page 26: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Cost functionCost function   

objective: find a decision tree minimizing the cost function

for predicting constant

cost(R ,D) =k (y −N k

1 ∑x ∈R

(n)k

(n) w )k2

w ∈k R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

classification cost

for predicting constant class

cost(R ,D) =k I(y =N k

1 ∑x ∈D

(n) R k

(n) w )k

w ∈k {1, … ,C}

cost per region (misclassification rate)

total cost in both cases is the normalized sum cost(R ,D)∑k NN k

k

mean(y ∣x ∈(n) (n) R )k

6 . 2

mode(y ∣x ∈(n) (n) R )k

Page 27: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Cost functionCost function   

it is sometimes possible to build a tree with zero cost:build a large tree with each instance having its own region (overfitting!)

objective: find a decision tree minimizing the cost function

for predicting constant

cost(R ,D) =k (y −N k

1 ∑x ∈R

(n)k

(n) w )k2

w ∈k R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

classification cost

for predicting constant class

cost(R ,D) =k I(y =N k

1 ∑x ∈D

(n) R k

(n) w )k

w ∈k {1, … ,C}

cost per region (misclassification rate)

total cost in both cases is the normalized sum cost(R ,D)∑k NN k

k

mean(y ∣x ∈(n) (n) R )k

6 . 2

mode(y ∣x ∈(n) (n) R )k

Page 28: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Cost functionCost function   

it is sometimes possible to build a tree with zero cost:build a large tree with each instance having its own region (overfitting!)

objective: find a decision tree minimizing the cost function

new objective: find a decision tree with K tests minimizing the cost function

for predicting constant

cost(R ,D) =k (y −N k

1 ∑x ∈R

(n)k

(n) w )k2

w ∈k R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

classification cost

for predicting constant class

cost(R ,D) =k I(y =N k

1 ∑x ∈D

(n) R k

(n) w )k

w ∈k {1, … ,C}

cost per region (misclassification rate)

total cost in both cases is the normalized sum cost(R ,D)∑k NN k

k

mean(y ∣x ∈(n) (n) R )k

6 . 2

mode(y ∣x ∈(n) (n) R )k

Page 29: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Search spaceSearch space   

objective: find a decision tree with K tests minimizing the cost functionK+1 regions

6 . 3

alternatively, find the smallest tree (K) that classifies all examples correctly

Page 30: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Search spaceSearch space   

not produced by a decision tree

objective: find a decision tree with K tests minimizing the cost functionK+1 regions

6 . 3

alternatively, find the smallest tree (K) that classifies all examples correctly

Page 31: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Search spaceSearch space   

not produced by a decision tree

assuming D features how many different partitions of size K+1?

objective: find a decision tree with K tests minimizing the cost functionK+1 regions

6 . 3

alternatively, find the smallest tree (K) that classifies all examples correctly

Page 32: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Search spaceSearch space   

not produced by a decision tree

assuming D features how many different partitions of size K+1?

objective: find a decision tree with K tests minimizing the cost functionK+1 regions

6 . 3

the number of full binary trees with K+1 leaves (regions      ) is the Catalan numberR k K+1

1 ( K2K)

exponential in K

alternatively, find the smallest tree (K) that classifies all examples correctly

Page 33: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Search spaceSearch space   

not produced by a decision tree

assuming D features how many different partitions of size K+1?

1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190,6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452

objective: find a decision tree with K tests minimizing the cost functionK+1 regions

6 . 3

the number of full binary trees with K+1 leaves (regions      ) is the Catalan numberR k K+1

1 ( K2K)

exponential in K

alternatively, find the smallest tree (K) that classifies all examples correctly

Page 34: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Search spaceSearch space   

not produced by a decision tree

assuming D features how many different partitions of size K+1?

1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190,6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452

we also have a choice of feature        for each of K internal node DKx d

objective: find a decision tree with K tests minimizing the cost functionK+1 regions

6 . 3

the number of full binary trees with K+1 leaves (regions      ) is the Catalan numberR k K+1

1 ( K2K)

exponential in K

alternatively, find the smallest tree (K) that classifies all examples correctly

Page 35: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Search spaceSearch space   

not produced by a decision tree

assuming D features how many different partitions of size K+1?

1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190,6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452

we also have a choice of feature        for each of K internal node DKx d

objective: find a decision tree with K tests minimizing the cost functionK+1 regions

moreover, for each feature different choices of splitting s ∈d,n S d

6 . 3

the number of full binary trees with K+1 leaves (regions      ) is the Catalan numberR k K+1

1 ( K2K)

exponential in K

alternatively, find the smallest tree (K) that classifies all examples correctly

Page 36: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Winter 2020 | Applied Machine Learning (COMP551)

Search spaceSearch space   

bottom line: finding optimal decision tree is an NP-hard combinatorial optimization problem

not produced by a decision tree

assuming D features how many different partitions of size K+1?

1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190,6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452

we also have a choice of feature        for each of K internal node DKx d

objective: find a decision tree with K tests minimizing the cost functionK+1 regions

moreover, for each feature different choices of splitting s ∈d,n S d

6 . 3

the number of full binary trees with K+1 leaves (regions      ) is the Catalan numberR k K+1

1 ( K2K)

exponential in K

alternatively, find the smallest tree (K) that classifies all examples correctly

Page 37: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

  Greedy heuristicGreedy heuristic

end the recursion if not worth-splitting

recursively split the regions based on a greedy choice of the next test

7 . 1

Page 38: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

  Greedy heuristicGreedy heuristic

end the recursion if not worth-splitting

recursively split the regions based on a greedy choice of the next test

function fit­tree(     ,   ,depth)R node

if not worth­splitting(depth,          ) 

     return  

else 

    left­set = fit­tree(    ,  , depth+1)

    right­set = fit­tree(     ,  , depth+1) 

    return {left­set, right­set}

       = greedy­test (     ,   )

D

R node DR ,R left right

R ,R left right

R node

R left DR right D

7 . 1

Page 39: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

  Greedy heuristicGreedy heuristic

end the recursion if not worth-splitting

recursively split the regions based on a greedy choice of the next test

function fit­tree(     ,   ,depth)R node

if not worth­splitting(depth,          ) 

     return  

else 

    left­set = fit­tree(    ,  , depth+1)

    right­set = fit­tree(     ,  , depth+1) 

    return {left­set, right­set}

       = greedy­test (     ,   )

D

R node DR ,R left right

R ,R left right

R node

R left DR right D

final decision tree in the form of nested list of regions

{{R ,R }, {R , {R ,R }}1 2 3 4 5

7 . 1

Page 40: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Choosing testsChoosing tests   

the split is greedy because it looks one step aheadthis may not lead to the the lowest overall cost

7 . 2

Page 41: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Choosing testsChoosing tests   

the split is greedy because it looks one step aheadthis may not lead to the the lowest overall cost

= cost(R ,D) +N node

N leftleft cost(R ,D)N node

N rightright

function greedy­test (     ,   )R node D

R =left R ∪node {x <d s }d,n

for d ∈ {1, … ,D}, s ∈d,n S d

R =right R ∪node {x ≥d s }d,n

split­cost

best­cost = ­inf

 

if split­cost < best­cost: 

    best­cost = split­cost

R =left∗ R left

R =right∗ R right

return R ,R left∗

right∗

7 . 2

Page 42: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Choosing testsChoosing tests   

the split is greedy because it looks one step aheadthis may not lead to the the lowest overall cost

= cost(R ,D) +N node

N leftleft cost(R ,D)N node

N rightright

function greedy­test (     ,   )R node D

R =left R ∪node {x <d s }d,n

for d ∈ {1, … ,D}, s ∈d,n S d

R =right R ∪node {x ≥d s }d,n

split­cost

best­cost = ­inf

 

if split­cost < best­cost: 

    best­cost = split­cost

R =left∗ R left

R =right∗ R right

return R ,R left∗

right∗

7 . 2

creating new regions

Page 43: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Choosing testsChoosing tests   

the split is greedy because it looks one step aheadthis may not lead to the the lowest overall cost

= cost(R ,D) +N node

N leftleft cost(R ,D)N node

N rightright

function greedy­test (     ,   )R node D

R =left R ∪node {x <d s }d,n

for d ∈ {1, … ,D}, s ∈d,n S d

R =right R ∪node {x ≥d s }d,n

split­cost

best­cost = ­inf

 

if split­cost < best­cost: 

    best­cost = split­cost

R =left∗ R left

R =right∗ R right

return R ,R left∗

right∗

evaluate their cost

7 . 2

creating new regions

Page 44: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Choosing testsChoosing tests   

the split is greedy because it looks one step aheadthis may not lead to the the lowest overall cost

= cost(R ,D) +N node

N leftleft cost(R ,D)N node

N rightright

function greedy­test (     ,   )R node D

R =left R ∪node {x <d s }d,n

for d ∈ {1, … ,D}, s ∈d,n S d

R =right R ∪node {x ≥d s }d,n

split­cost

best­cost = ­inf

 

if split­cost < best­cost: 

    best­cost = split­cost

R =left∗ R left

R =right∗ R right

return R ,R left∗

right∗

evaluate their cost

7 . 2

creating new regions

return the split with the lowest greedy cost

Page 45: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Stopping the recursionStopping the recursion   worth­splitting subroutine

if we stop when             has zero cost, we may overfitR node

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

7 . 3

Page 46: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Stopping the recursionStopping the recursion   worth­splitting subroutine

if we stop when             has zero cost, we may overfitR node

heuristics for stopping the splitting:

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

7 . 3

Page 47: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Stopping the recursionStopping the recursion   worth­splitting subroutine

if we stop when             has zero cost, we may overfitR node

heuristics for stopping the splitting:

reached a desired depth

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

7 . 3

Page 48: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Stopping the recursionStopping the recursion   worth­splitting subroutine

if we stop when             has zero cost, we may overfitR node

heuristics for stopping the splitting:

reached a desired depthnumber of examples in           or           is too smallR left R right

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

7 . 3

Page 49: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Stopping the recursionStopping the recursion   worth­splitting subroutine

if we stop when             has zero cost, we may overfitR node

heuristics for stopping the splitting:

reached a desired depthnumber of examples in           or           is too small      is a good approximation, the cost is small enough

R left R right

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

7 . 3

Page 50: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Stopping the recursionStopping the recursion   worth­splitting subroutine

if we stop when             has zero cost, we may overfitR node

heuristics for stopping the splitting:

reached a desired depthnumber of examples in           or           is too small      is a good approximation, the cost is small enough

R left R right

w k

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

7 . 3

Page 51: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Winter 2020 | Applied Machine Learning (COMP551)

Stopping the recursionStopping the recursion   worth­splitting subroutine

if we stop when             has zero cost, we may overfitR node

heuristics for stopping the splitting:

reached a desired depthnumber of examples in           or           is too small      is a good approximation, the cost is small enoughreduction in cost by splitting is small

R left R right

w k

cost(R ,D) −node ( cost(R ,D) +N node

N leftleft cost(R ,D))N node

N rightright

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

7 . 3

Page 52: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

revisiting therevisiting the classification cost classification cost  

ideally we want to optimize the 0-1 loss (misclassification rate)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w )k

this may not be the optimal cost for each step of greedy heuristic

8 . 1

Page 53: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

revisiting therevisiting the classification cost classification cost  

ideally we want to optimize the 0-1 loss (misclassification rate)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w )k

this may not be the optimal cost for each step of greedy heuristic

example

8 . 1

(.5, 100%)

(.25, 50%) (.75, 50%)

R node

R left R right

Page 54: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

revisiting therevisiting the classification cost classification cost  

ideally we want to optimize the 0-1 loss (misclassification rate)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w )k

this may not be the optimal cost for each step of greedy heuristic

example

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 1

(.5, 100%)

(.25, 50%) (.75, 50%)

R node

R left R right

Page 55: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

revisiting therevisiting the classification cost classification cost  

ideally we want to optimize the 0-1 loss (misclassification rate)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w )k

this may not be the optimal cost for each step of greedy heuristic

example

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 1

both splits have the same misclassification rate (2/8)

(.5, 100%)

(.25, 50%) (.75, 50%)

R node

R left R right

Page 56: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

revisiting therevisiting the classification cost classification cost  

ideally we want to optimize the 0-1 loss (misclassification rate)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w )k

this may not be the optimal cost for each step of greedy heuristic

example

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 1

both splits have the same misclassification rate (2/8)

however the second split may be preferable because one region does not need further splitting

(.5, 100%)

(.25, 50%) (.75, 50%)

R node

R left R right

Page 57: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

revisiting therevisiting the classification cost classification cost  

ideally we want to optimize the 0-1 loss (misclassification rate)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w )k

this may not be the optimal cost for each step of greedy heuristic

example

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 1

both splits have the same misclassification rate (2/8)

however the second split may be preferable because one region does not need further splitting

(.5, 100%)

(.25, 50%) (.75, 50%)

R node

R left R right

use a measure for homogeneity of labels in regions

Page 58: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

EntropyEntropy   

entropy is the expected amount of information in observing a random variable

H(y) = − p(y =∑c=1C

c) log p(y = c)

ynote that it is common to use capital letters forrandom variables (here for consistency we use lower-case)

8 . 2

Page 59: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

EntropyEntropy   

entropy is the expected amount of information in observing a random variable

H(y) = − p(y =∑c=1C

c) log p(y = c)

is the amount of information in observing c− log p(y = c)

ynote that it is common to use capital letters forrandom variables (here for consistency we use lower-case)

8 . 2

Page 60: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

EntropyEntropy   

entropy is the expected amount of information in observing a random variable

H(y) = − p(y =∑c=1C

c) log p(y = c)

is the amount of information in observing c− log p(y = c)

zero information of p(c)=1less probable events are more informativeinformation from two independent events is additive − log(p(c)q(d)) = − log p(c) − log q(d)

p(c) < p(c ) ⇒′ − log p(c) > − log p(c )′

ynote that it is common to use capital letters forrandom variables (here for consistency we use lower-case)

8 . 2

Page 61: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

EntropyEntropy   

entropy is the expected amount of information in observing a random variable

H(y) = − p(y =∑c=1C

c) log p(y = c)

a uniform distribution has the highest entropy H(y) = − log =∑c=1C

C1

C1 log C

is the amount of information in observing c− log p(y = c)

zero information of p(c)=1less probable events are more informativeinformation from two independent events is additive − log(p(c)q(d)) = − log p(c) − log q(d)

p(c) < p(c ) ⇒′ − log p(c) > − log p(c )′

ynote that it is common to use capital letters forrandom variables (here for consistency we use lower-case)

8 . 2

Page 62: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

EntropyEntropy   

entropy is the expected amount of information in observing a random variable

H(y) = − p(y =∑c=1C

c) log p(y = c)

a uniform distribution has the highest entropy H(y) = − log =∑c=1C

C1

C1 log C

a deterministic random variable has the lowest entropy H(y) = −1 log(1) = 0

is the amount of information in observing c− log p(y = c)

zero information of p(c)=1less probable events are more informativeinformation from two independent events is additive − log(p(c)q(d)) = − log p(c) − log q(d)

p(c) < p(c ) ⇒′ − log p(c) > − log p(c )′

ynote that it is common to use capital letters forrandom variables (here for consistency we use lower-case)

8 . 2

Page 63: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Mutual informationMutual information   

for two random variables t, y

8 . 3

Page 64: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Mutual informationMutual information   

for two random variables t, y

the amount of information t conveys about ychange in the entropy of y after observing the value of t

mutual information is

8 . 3

Page 65: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Mutual informationMutual information   

I(t, y) = H(y) − H(y∣t)

for two random variables t, y

the amount of information t conveys about ychange in the entropy of y after observing the value of t

mutual information is

8 . 3

Page 66: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Mutual informationMutual information   

I(t, y) = H(y) − H(y∣t)conditional entropy p(t =∑l=1

Ll)H(x∣t = l)

for two random variables t, y

the amount of information t conveys about ychange in the entropy of y after observing the value of t

mutual information is

8 . 3

Page 67: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Mutual informationMutual information   

I(t, y) = H(y) − H(y∣t)conditional entropy p(t =∑l=1

Ll)H(x∣t = l)

for two random variables t, y

= p(y =∑l∑c c, t = l) log

p(y=c)p(t=l)p(y=c,t=l)

this is symmetric wrt y and t

the amount of information t conveys about ychange in the entropy of y after observing the value of t

mutual information is

8 . 3

Page 68: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Mutual informationMutual information   

I(t, y) = H(y) − H(y∣t)conditional entropy p(t =∑l=1

Ll)H(x∣t = l)

for two random variables t, y

= H(t) − H(t∣y) = I(y, t)

= p(y =∑l∑c c, t = l) log

p(y=c)p(t=l)p(y=c,t=l)

this is symmetric wrt y and t

the amount of information t conveys about ychange in the entropy of y after observing the value of t

mutual information is

8 . 3

Page 69: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Mutual informationMutual information   

I(t, y) = H(y) − H(y∣t)conditional entropy p(t =∑l=1

Ll)H(x∣t = l)

it is always positive and zero only if y and t are independent

for two random variables t, y

= H(t) − H(t∣y) = I(y, t)

try to prove these properties

= p(y =∑l∑c c, t = l) log

p(y=c)p(t=l)p(y=c,t=l)

this is symmetric wrt y and t

the amount of information t conveys about ychange in the entropy of y after observing the value of t

mutual information is

8 . 3

Page 70: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Entropy for classification costEntropy for classification cost  

we care about the distribution of labels p (y =k c) = N k

I(y =c)∑x ∈R

(n)k

(n)

8 . 4

Page 71: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Entropy for classification costEntropy for classification cost  

we care about the distribution of labels p (y =k c) = N k

I(y =c)∑x ∈R

(n)k

(n)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p (w )k kmisclassification cost

8 . 4

Page 72: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Entropy for classification costEntropy for classification cost  

we care about the distribution of labels p (y =k c) = N k

I(y =c)∑x ∈R

(n)k

(n)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k

8 . 4

Page 73: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Entropy for classification costEntropy for classification cost  

we care about the distribution of labels p (y =k c) = N k

I(y =c)∑x ∈R

(n)k

(n)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k

entropy cost cost(R ,D) =k H(y) choose the split with the lowest entropy

8 . 4

Page 74: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Entropy for classification costEntropy for classification cost  

we care about the distribution of labels p (y =k c) = N k

I(y =c)∑x ∈R

(n)k

(n)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k

entropy cost cost(R ,D) =k H(y) choose the split with the lowest entropy

change in the cost becomes the mutual information between the test and labels

8 . 4

Page 75: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Entropy for classification costEntropy for classification cost  

we care about the distribution of labels p (y =k c) = N k

I(y =c)∑x ∈R

(n)k

(n)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k

entropy cost cost(R ,D) =k H(y) choose the split with the lowest entropy

cost(R ,D) −node ( cost(R ,D) +N node

N leftleft cost(R ,D))

N node

N leftright

change in the cost becomes the mutual information between the test and labels

8 . 4

Page 76: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Entropy for classification costEntropy for classification cost  

we care about the distribution of labels p (y =k c) = N k

I(y =c)∑x ∈R

(n)k

(n)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k

entropy cost cost(R ,D) =k H(y) choose the split with the lowest entropy

cost(R ,D) −node ( cost(R ,D) +N node

N leftleft cost(R ,D))

N node

N leftright

change in the cost becomes the mutual information between the test and labels

= H(y) − (p(x ≥d s )H(p(y∣x ≥d,n d s )) +d,n p(x <d s )H(p(y∣x <d,n d s )))d,n

8 . 4

Page 77: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Entropy for classification costEntropy for classification cost  

we care about the distribution of labels p (y =k c) = N k

I(y =c)∑x ∈R

(n)k

(n)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k

entropy cost cost(R ,D) =k H(y) choose the split with the lowest entropy

cost(R ,D) −node ( cost(R ,D) +N node

N leftleft cost(R ,D))

N node

N leftright

change in the cost becomes the mutual information between the test and labels

= H(y) − (p(x ≥d s )H(p(y∣x ≥d,n d s )) +d,n p(x <d s )H(p(y∣x <d,n d s )))d,n = I(y,x > s )d,n

8 . 4

Page 78: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Entropy for classification costEntropy for classification cost  

we care about the distribution of labels p (y =k c) = N k

I(y =c)∑x ∈R

(n)k

(n)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k

entropy cost cost(R ,D) =k H(y) choose the split with the lowest entropy

cost(R ,D) −node ( cost(R ,D) +N node

N leftleft cost(R ,D))

N node

N leftright

change in the cost becomes the mutual information between the test and labels

= H(y) − (p(x ≥d s )H(p(y∣x ≥d,n d s )) +d,n p(x <d s )H(p(y∣x <d,n d s )))d,n = I(y,x > s )d,n

choosing the test which is maximally informative about labels

8 . 4

Page 79: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Entropy for classification costEntropy for classification cost  

example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 5

R node

R left R right

Page 80: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Entropy for classification costEntropy for classification cost  

example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 5

R node

R left R right

misclassification cost

⋅84

+41

⋅84

=41

41

Page 81: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Entropy for classification costEntropy for classification cost  

example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 5

R node

R left R right

misclassification cost

⋅84

+41

⋅84

=41

41

⋅86

+31

⋅82

=20

41

Page 82: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Entropy for classification costEntropy for classification cost  

example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 5

R node

R left R right

misclassification cost

⋅84

+41

⋅84

=41

41

⋅86

+31

⋅82

=20

41

the same costs

Page 83: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Entropy for classification costEntropy for classification cost  

example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 5

R node

R left R right

misclassification cost

⋅84

+41

⋅84

=41

41

⋅86

+31

⋅82

=20

41

entropy cost (using base 2 logarithm)

the same costs

Page 84: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Entropy for classification costEntropy for classification cost  

example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 5

R node

R left R right

misclassification cost

⋅84

+41

⋅84

=41

41

⋅86

+31

⋅82

=20

41

entropy cost (using base 2 logarithm)

( −84

log( ) −41

41

log( )) +43

43

( −84

log( ) −41

41

log( )) ≈43

43 .81

the same costs

Page 85: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Entropy for classification costEntropy for classification cost  

example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 5

R node

R left R right

misclassification cost

⋅84

+41

⋅84

=41

41

⋅86

+31

⋅82

=20

41

entropy cost (using base 2 logarithm)

( −84

log( ) −41

41

log( )) +43

43

( −84

log( ) −41

41

log( )) ≈43

43 .81

( −86

log( ) −31

31

log( )) +32

32

⋅82 0 ≈ .68

lower cost split

the same costs

Page 86: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Gini indexGini index   

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p(w )kmisclassification (error) rate

entropy cost(R ,D) =k H(y)

another cost for selecting the test in classification

8 . 6

Page 87: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Gini indexGini index   

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p(w )kmisclassification (error) rate

entropy cost(R ,D) =k H(y)

Gini index it is the expected error rate

another cost for selecting the test in classification

8 . 6

Page 88: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Gini indexGini index   

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p(w )kmisclassification (error) rate

entropy cost(R ,D) =k H(y)

Gini index it is the expected error rate

cost(R ,D) =k p(c)(1 −∑c=1C

p(c))probability of class c probability of error

another cost for selecting the test in classification

8 . 6

Page 89: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Gini indexGini index   

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p(w )kmisclassification (error) rate

entropy cost(R ,D) =k H(y)

= p(c) −∑c=1C

p(c) =∑c=1C 2 1 − p(c)∑c=1

C 2

Gini index it is the expected error rate

cost(R ,D) =k p(c)(1 −∑c=1C

p(c))probability of class c probability of error

another cost for selecting the test in classification

8 . 6

Page 90: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Winter 2020 | Applied Machine Learning (COMP551)

Gini indexGini index   

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p(w )kmisclassification (error) rate

entropy cost(R ,D) =k H(y)

= p(c) −∑c=1C

p(c) =∑c=1C 2 1 − p(c)∑c=1

C 2

Gini index it is the expected error rate

cost(R ,D) =k p(c)(1 −∑c=1C

p(c))probability of class c probability of error

another cost for selecting the test in classification

comparison of costs of a node when we have 2 classes

p(y = 1)

cost

8 . 6

Page 91: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

ExampleExample  

decision tree for Iris dataset

decision boundariesdataset (D=2) decision tree

9 . 1

Page 92: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

ExampleExample  

decision tree for Iris dataset

decision boundariesdataset (D=2)

1

decision tree

9 . 1

Page 93: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

ExampleExample  

decision tree for Iris dataset

decision boundariesdataset (D=2)

1

2

decision tree

9 . 1

Page 94: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

ExampleExample  

decision tree for Iris dataset

decision boundariesdataset (D=2)

1

2

3

decision tree

9 . 1

Page 95: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

ExampleExample  

decision tree for Iris dataset

decision boundariesdataset (D=2)

decision boundaries suggest overfitting

confirmed using a validation set

training accuracy                   ~ 85%(Cross) validation accuracy   ~ 70% 1

2

3

decision tree

9 . 1

Page 96: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

OverfittingOverfitting   

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

9 . 2

Page 97: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

OverfittingOverfitting   

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

example: of decision tree representation of a boolean function (D=3)

9 . 2

Page 98: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

OverfittingOverfitting   

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

example: of decision tree representation of a boolean function (D=3)

there are         such functions, why?22D

9 . 2

Page 99: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

OverfittingOverfitting   

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

example: of decision tree representation of a boolean function (D=3)

there are         such functions, why?22D

large decision trees have a high variance - low bias (low training error, high test error)

9 . 2

Page 100: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

OverfittingOverfitting   

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

example: of decision tree representation of a boolean function (D=3)

there are         such functions, why?22D

large decision trees have a high variance - low bias (low training error, high test error)

idea 1. grow a small tree

9 . 2

Page 101: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

OverfittingOverfitting   

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

example: of decision tree representation of a boolean function (D=3)

there are         such functions, why?22D

large decision trees have a high variance - low bias (low training error, high test error)

idea 1. grow a small tree

9 . 2

 substantial reduction in cost may happen after a few steps

by stopping early we cannot know this

Page 102: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

OverfittingOverfitting   

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

example: of decision tree representation of a boolean function (D=3)

there are         such functions, why?22D

large decision trees have a high variance - low bias (low training error, high test error)

idea 1. grow a small tree

example cost drops after the second node

9 . 2

 substantial reduction in cost may happen after a few steps

by stopping early we cannot know this

Page 103: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

PruningPruning   idea 2. grow a large tree and then prune it

9 . 3

Page 104: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

PruningPruning   idea 2. grow a large tree and then prune it

greedily turn an internal node into a leaf nodechoice is based on the lowest increase in the costrepeat this until left with the root node

9 . 3

Page 105: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

PruningPruning   idea 2. grow a large tree and then prune it

greedily turn an internal node into a leaf nodechoice is based on the lowest increase in the costrepeat this until left with the root node

pick the best among the above models using using a validation set

9 . 3

Page 106: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

PruningPruning   idea 2. grow a large tree and then prune it

greedily turn an internal node into a leaf nodechoice is based on the lowest increase in the costrepeat this until left with the root node

pick the best among the above models using using a validation set

example

before pruning

9 . 3

Page 107: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

PruningPruning   idea 2. grow a large tree and then prune it

greedily turn an internal node into a leaf nodechoice is based on the lowest increase in the costrepeat this until left with the root node

pick the best among the above models using using a validation set

after pruning

example

before pruning

9 . 3

Page 108: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

PruningPruning   idea 2. grow a large tree and then prune it

greedily turn an internal node into a leaf nodechoice is based on the lowest increase in the costrepeat this until left with the root node

pick the best among the above models using using a validation set

after pruning cross-validation is used to pickthe best size

example

before pruning

9 . 3

Page 109: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Winter 2020 | Applied Machine Learning (COMP551)

PruningPruning   idea 2. grow a large tree and then prune it

greedily turn an internal node into a leaf nodechoice is based on the lowest increase in the costrepeat this until left with the root node

pick the best among the above models using using a validation set

after pruning cross-validation is used to pickthe best size

example

before pruning

idea 3. random forests (later!)

9 . 3

Page 110: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

SummarySummarymodel: divide the input into axis-aligned regionscost: for regression and classification

10

Page 111: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

SummarySummarymodel: divide the input into axis-aligned regionscost: for regression and classificationoptimization:

NP-harduse greedy heuristic

10

Page 112: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

SummarySummarymodel: divide the input into axis-aligned regionscost: for regression and classificationoptimization:

NP-harduse greedy heuristic

adjust the cost for the heuristicusing entropy (relation to mutual information maximization)using Gini index

10

Page 113: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

SummarySummarymodel: divide the input into axis-aligned regionscost: for regression and classificationoptimization:

NP-harduse greedy heuristic

adjust the cost for the heuristicusing entropy (relation to mutual information maximization)using Gini index

decision trees are unstable (have high variance)use pruning to avoid overfitting

10

Page 114: Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

SummarySummarymodel: divide the input into axis-aligned regionscost: for regression and classificationoptimization:

NP-harduse greedy heuristic

adjust the cost for the heuristicusing entropy (relation to mutual information maximization)using Gini index

decision trees are unstable (have high variance)use pruning to avoid overfitting

there are variations on decision tree heuristicswhat we discussed in called Classification and Regression Trees (CART)

10