Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v...

Post on 22-May-2020

6 views 0 download

Transcript of Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v...

Applied Machine LearningApplied Machine LearningDecision Trees

Siamak RavanbakhshSiamak Ravanbakhsh

COMP 551COMP 551 (winter 2020)(winter 2020)

1

decision trees:

modelcost functionhow it is optimized

how to grow a tree and why you should prune it!

Learning objectivesLearning objectives

2

Adaptive basesAdaptive bases   

so far we assume a fixed set of bases in f(x) = w ϕ (x)∑d d d

several methods can be classified as learning these bases adaptively

decision treesgeneralized additive modelsboostingneural networks

f(x) = w ϕ (x; v )∑d d d d

each basis has its own parameters

3

image credit:https://mymodernmet.com/the-30second-rule-a-decision/

  Decision trees: Decision trees: motivationmotivation

4

pros.decision trees are interpretable!they are not very sensitive to outliersdo not need data normalization

image credit:https://mymodernmet.com/the-30second-rule-a-decision/

  Decision trees: Decision trees: motivationmotivation

4

pros.decision trees are interpretable!they are not very sensitive to outliersdo not need data normalization

image credit:https://mymodernmet.com/the-30second-rule-a-decision/

cons.they could easily overfit and they are unstable

pruningrandom forests

  Decision trees: Decision trees: motivationmotivation

4

  Decision trees: Decision trees: ideaidea

divide the input space into regions and learn one function per region

f(x) = w I(x ∈∑k k R )k

the regions are learned adaptivelymore sophisticated prediction per region is also possible (e.g., one linear model per region)

5 . 1

split regions successively based on the value of a single variable called test

  Decision trees: Decision trees: ideaidea

divide the input space into regions and learn one function per region

f(x) = w I(x ∈∑k k R )k

the regions are learned adaptivelymore sophisticated prediction per region is also possible (e.g., one linear model per region)

5 . 1

split regions successively based on the value of a single variable called test

  Decision trees: Decision trees: ideaidea

divide the input space into regions and learn one function per region

f(x) = w I(x ∈∑k k R )k

the regions are learned adaptivelymore sophisticated prediction per region is also possible (e.g., one linear model per region)

x 1x 2

w 1

w 3

w 5

each region is a set of conditions R =2 {x ≤1 t ,x ≤1 2 t }4

5 . 1

what constant           to use for prediction in each region?

  Prediction per regionPrediction per region

w k

suppose we have identified the regions R k

5 . 2

what constant           to use for prediction in each region?

  Prediction per regionPrediction per region

w k

suppose we have identified the regions R k

fore regression

use the mean value of training data-points in that region

w =k mean(y ∣x ∈(n) (n) R )k

5 . 2

what constant           to use for prediction in each region?

  Prediction per regionPrediction per region

w k

suppose we have identified the regions R k

fore regression

use the mean value of training data-points in that region

w =k mean(y ∣x ∈(n) (n) R )k

for classificationcount the frequency of classes per regionpredict the most frequent labelor return probability

5 . 2

w =k mode(y ∣x ∈(n) (n) R )k

Winter 2020 | Applied Machine Learning (COMP551)

what constant           to use for prediction in each region?

  Prediction per regionPrediction per region

w k

suppose we have identified the regions R k

fore regression

use the mean value of training data-points in that region

w =k mean(y ∣x ∈(n) (n) R )k

for classificationcount the frequency of classes per regionpredict the most frequent labelor return probability

example: predicting survival in titanic

most frequent label

frequency of survival

percentage of training data in this region

5 . 2

w =k mode(y ∣x ∈(n) (n) R )k

Feature typesFeature types  

given a feature what are the possible tests

6 . 1

Feature typesFeature types  

continuous features - e.g., age, height, GDP

all the values that appear in the dataset can be used to split S =d {s =d,n x }d

(n)

one set of possible splits for each feature d

x >d s ?d,neach split is asking

given a feature what are the possible tests

6 . 1

Feature typesFeature types  

continuous features - e.g., age, height, GDP

all the values that appear in the dataset can be used to split S =d {s =d,n x }d

(n)

one set of possible splits for each feature d

x >d s ?d,neach split is asking

given a feature what are the possible tests

ordinal features - e.g., grade, rating x ∈d {1, … ,C}

we can split any any value so S =d {s =d,1 1, … , s =d,C C}

x >d s ?d,ceach split is asking

6 . 1

Feature typesFeature types  

continuous features - e.g., age, height, GDP

all the values that appear in the dataset can be used to split S =d {s =d,n x }d

(n)

one set of possible splits for each feature d

x >d s ?d,neach split is asking

given a feature what are the possible tests

ordinal features - e.g., grade, rating x ∈d {1, … ,C}

we can split any any value so S =d {s =d,1 1, … , s =d,C C}

x >d s ?d,ceach split is asking

categorical features -

- types, classes and categories

6 . 1

Feature typesFeature types  

continuous features - e.g., age, height, GDP

all the values that appear in the dataset can be used to split S =d {s =d,n x }d

(n)

one set of possible splits for each feature d

x >d s ?d,neach split is asking

given a feature what are the possible tests

ordinal features - e.g., grade, rating x ∈d {1, … ,C}

we can split any any value so S =d {s =d,1 1, … , s =d,C C}

x >d s ?d,ceach split is asking

categorical features -

- types, classes and categories

6 . 1

multi-way splitproblem:it could lead to sparse subsetsdata fragmentation: some splits may have few/no datapoints

{x =d

????

Feature typesFeature types  

continuous features - e.g., age, height, GDP

all the values that appear in the dataset can be used to split S =d {s =d,n x }d

(n)

one set of possible splits for each feature d

x >d s ?d,neach split is asking

given a feature what are the possible tests

ordinal features - e.g., grade, rating x ∈d {1, … ,C}

we can split any any value so S =d {s =d,1 1, … , s =d,C C}

x >d s ?d,ceach split is asking

categorical features -

- types, classes and categories

6 . 1

multi-way splitproblem:it could lead to sparse subsetsdata fragmentation: some splits may have few/no datapoints

{x =d

????

instead of                            we have {binary splitassume C binary features (one-hot coding)

x ∈d {1, … ,C} x ∈d,1 {0, 1}

x ∈d,2 {0, 1}

⋮x ∈d,C {0, 1}

x d,2 =?

0

x d,2 =?

1

Feature typesFeature types  

continuous features - e.g., age, height, GDP

all the values that appear in the dataset can be used to split S =d {s =d,n x }d

(n)

one set of possible splits for each feature d

x >d s ?d,neach split is asking

given a feature what are the possible tests

ordinal features - e.g., grade, rating x ∈d {1, … ,C}

we can split any any value so S =d {s =d,1 1, … , s =d,C C}

x >d s ?d,ceach split is asking

categorical features -

- types, classes and categories

6 . 1

multi-way splitproblem:it could lead to sparse subsetsdata fragmentation: some splits may have few/no datapoints

{x =d

????

instead of                            we have {binary splitassume C binary features (one-hot coding)

x ∈d {1, … ,C} x ∈d,1 {0, 1}

x ∈d,2 {0, 1}

⋮x ∈d,C {0, 1}

x d,2 =?

0

x d,2 =?

1

alternative: binary splits that produce balanced subsets

Cost functionCost function   

objective: find a decision tree minimizing the cost function

6 . 2

Cost functionCost function   

objective: find a decision tree minimizing the cost function

for predicting constant

cost(R ,D) =k (y −N k

1 ∑x ∈R

(n)k

(n) w )k2

w ∈k R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

6 . 2

Cost functionCost function   

objective: find a decision tree minimizing the cost function

for predicting constant

cost(R ,D) =k (y −N k

1 ∑x ∈R

(n)k

(n) w )k2

w ∈k R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

mean(y ∣x ∈(n) (n) R )k

6 . 2

Cost functionCost function   

objective: find a decision tree minimizing the cost function

for predicting constant

cost(R ,D) =k (y −N k

1 ∑x ∈R

(n)k

(n) w )k2

w ∈k R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

classification cost

for predicting constant class

cost(R ,D) =k I(y =N k

1 ∑x ∈D

(n) R k

(n) w )k

w ∈k {1, … ,C}

cost per region (misclassification rate)

mean(y ∣x ∈(n) (n) R )k

6 . 2

Cost functionCost function   

objective: find a decision tree minimizing the cost function

for predicting constant

cost(R ,D) =k (y −N k

1 ∑x ∈R

(n)k

(n) w )k2

w ∈k R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

classification cost

for predicting constant class

cost(R ,D) =k I(y =N k

1 ∑x ∈D

(n) R k

(n) w )k

w ∈k {1, … ,C}

cost per region (misclassification rate)

mean(y ∣x ∈(n) (n) R )k

6 . 2

mode(y ∣x ∈(n) (n) R )k

Cost functionCost function   

objective: find a decision tree minimizing the cost function

for predicting constant

cost(R ,D) =k (y −N k

1 ∑x ∈R

(n)k

(n) w )k2

w ∈k R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

classification cost

for predicting constant class

cost(R ,D) =k I(y =N k

1 ∑x ∈D

(n) R k

(n) w )k

w ∈k {1, … ,C}

cost per region (misclassification rate)

total cost in both cases is the normalized sum cost(R ,D)∑k NN k

k

mean(y ∣x ∈(n) (n) R )k

6 . 2

mode(y ∣x ∈(n) (n) R )k

Cost functionCost function   

it is sometimes possible to build a tree with zero cost:build a large tree with each instance having its own region (overfitting!)

objective: find a decision tree minimizing the cost function

for predicting constant

cost(R ,D) =k (y −N k

1 ∑x ∈R

(n)k

(n) w )k2

w ∈k R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

classification cost

for predicting constant class

cost(R ,D) =k I(y =N k

1 ∑x ∈D

(n) R k

(n) w )k

w ∈k {1, … ,C}

cost per region (misclassification rate)

total cost in both cases is the normalized sum cost(R ,D)∑k NN k

k

mean(y ∣x ∈(n) (n) R )k

6 . 2

mode(y ∣x ∈(n) (n) R )k

Cost functionCost function   

it is sometimes possible to build a tree with zero cost:build a large tree with each instance having its own region (overfitting!)

objective: find a decision tree minimizing the cost function

new objective: find a decision tree with K tests minimizing the cost function

for predicting constant

cost(R ,D) =k (y −N k

1 ∑x ∈R

(n)k

(n) w )k2

w ∈k R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

classification cost

for predicting constant class

cost(R ,D) =k I(y =N k

1 ∑x ∈D

(n) R k

(n) w )k

w ∈k {1, … ,C}

cost per region (misclassification rate)

total cost in both cases is the normalized sum cost(R ,D)∑k NN k

k

mean(y ∣x ∈(n) (n) R )k

6 . 2

mode(y ∣x ∈(n) (n) R )k

Search spaceSearch space   

objective: find a decision tree with K tests minimizing the cost functionK+1 regions

6 . 3

alternatively, find the smallest tree (K) that classifies all examples correctly

Search spaceSearch space   

not produced by a decision tree

objective: find a decision tree with K tests minimizing the cost functionK+1 regions

6 . 3

alternatively, find the smallest tree (K) that classifies all examples correctly

Search spaceSearch space   

not produced by a decision tree

assuming D features how many different partitions of size K+1?

objective: find a decision tree with K tests minimizing the cost functionK+1 regions

6 . 3

alternatively, find the smallest tree (K) that classifies all examples correctly

Search spaceSearch space   

not produced by a decision tree

assuming D features how many different partitions of size K+1?

objective: find a decision tree with K tests minimizing the cost functionK+1 regions

6 . 3

the number of full binary trees with K+1 leaves (regions      ) is the Catalan numberR k K+1

1 ( K2K)

exponential in K

alternatively, find the smallest tree (K) that classifies all examples correctly

Search spaceSearch space   

not produced by a decision tree

assuming D features how many different partitions of size K+1?

1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190,6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452

objective: find a decision tree with K tests minimizing the cost functionK+1 regions

6 . 3

the number of full binary trees with K+1 leaves (regions      ) is the Catalan numberR k K+1

1 ( K2K)

exponential in K

alternatively, find the smallest tree (K) that classifies all examples correctly

Search spaceSearch space   

not produced by a decision tree

assuming D features how many different partitions of size K+1?

1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190,6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452

we also have a choice of feature        for each of K internal node DKx d

objective: find a decision tree with K tests minimizing the cost functionK+1 regions

6 . 3

the number of full binary trees with K+1 leaves (regions      ) is the Catalan numberR k K+1

1 ( K2K)

exponential in K

alternatively, find the smallest tree (K) that classifies all examples correctly

Search spaceSearch space   

not produced by a decision tree

assuming D features how many different partitions of size K+1?

1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190,6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452

we also have a choice of feature        for each of K internal node DKx d

objective: find a decision tree with K tests minimizing the cost functionK+1 regions

moreover, for each feature different choices of splitting s ∈d,n S d

6 . 3

the number of full binary trees with K+1 leaves (regions      ) is the Catalan numberR k K+1

1 ( K2K)

exponential in K

alternatively, find the smallest tree (K) that classifies all examples correctly

Winter 2020 | Applied Machine Learning (COMP551)

Search spaceSearch space   

bottom line: finding optimal decision tree is an NP-hard combinatorial optimization problem

not produced by a decision tree

assuming D features how many different partitions of size K+1?

1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190,6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452

we also have a choice of feature        for each of K internal node DKx d

objective: find a decision tree with K tests minimizing the cost functionK+1 regions

moreover, for each feature different choices of splitting s ∈d,n S d

6 . 3

the number of full binary trees with K+1 leaves (regions      ) is the Catalan numberR k K+1

1 ( K2K)

exponential in K

alternatively, find the smallest tree (K) that classifies all examples correctly

  Greedy heuristicGreedy heuristic

end the recursion if not worth-splitting

recursively split the regions based on a greedy choice of the next test

7 . 1

  Greedy heuristicGreedy heuristic

end the recursion if not worth-splitting

recursively split the regions based on a greedy choice of the next test

function fit­tree(     ,   ,depth)R node

if not worth­splitting(depth,          ) 

     return  

else 

    left­set = fit­tree(    ,  , depth+1)

    right­set = fit­tree(     ,  , depth+1) 

    return {left­set, right­set}

       = greedy­test (     ,   )

D

R node DR ,R left right

R ,R left right

R node

R left DR right D

7 . 1

  Greedy heuristicGreedy heuristic

end the recursion if not worth-splitting

recursively split the regions based on a greedy choice of the next test

function fit­tree(     ,   ,depth)R node

if not worth­splitting(depth,          ) 

     return  

else 

    left­set = fit­tree(    ,  , depth+1)

    right­set = fit­tree(     ,  , depth+1) 

    return {left­set, right­set}

       = greedy­test (     ,   )

D

R node DR ,R left right

R ,R left right

R node

R left DR right D

final decision tree in the form of nested list of regions

{{R ,R }, {R , {R ,R }}1 2 3 4 5

7 . 1

Choosing testsChoosing tests   

the split is greedy because it looks one step aheadthis may not lead to the the lowest overall cost

7 . 2

Choosing testsChoosing tests   

the split is greedy because it looks one step aheadthis may not lead to the the lowest overall cost

= cost(R ,D) +N node

N leftleft cost(R ,D)N node

N rightright

function greedy­test (     ,   )R node D

R =left R ∪node {x <d s }d,n

for d ∈ {1, … ,D}, s ∈d,n S d

R =right R ∪node {x ≥d s }d,n

split­cost

best­cost = ­inf

 

if split­cost < best­cost: 

    best­cost = split­cost

R =left∗ R left

R =right∗ R right

return R ,R left∗

right∗

7 . 2

Choosing testsChoosing tests   

the split is greedy because it looks one step aheadthis may not lead to the the lowest overall cost

= cost(R ,D) +N node

N leftleft cost(R ,D)N node

N rightright

function greedy­test (     ,   )R node D

R =left R ∪node {x <d s }d,n

for d ∈ {1, … ,D}, s ∈d,n S d

R =right R ∪node {x ≥d s }d,n

split­cost

best­cost = ­inf

 

if split­cost < best­cost: 

    best­cost = split­cost

R =left∗ R left

R =right∗ R right

return R ,R left∗

right∗

7 . 2

creating new regions

Choosing testsChoosing tests   

the split is greedy because it looks one step aheadthis may not lead to the the lowest overall cost

= cost(R ,D) +N node

N leftleft cost(R ,D)N node

N rightright

function greedy­test (     ,   )R node D

R =left R ∪node {x <d s }d,n

for d ∈ {1, … ,D}, s ∈d,n S d

R =right R ∪node {x ≥d s }d,n

split­cost

best­cost = ­inf

 

if split­cost < best­cost: 

    best­cost = split­cost

R =left∗ R left

R =right∗ R right

return R ,R left∗

right∗

evaluate their cost

7 . 2

creating new regions

Choosing testsChoosing tests   

the split is greedy because it looks one step aheadthis may not lead to the the lowest overall cost

= cost(R ,D) +N node

N leftleft cost(R ,D)N node

N rightright

function greedy­test (     ,   )R node D

R =left R ∪node {x <d s }d,n

for d ∈ {1, … ,D}, s ∈d,n S d

R =right R ∪node {x ≥d s }d,n

split­cost

best­cost = ­inf

 

if split­cost < best­cost: 

    best­cost = split­cost

R =left∗ R left

R =right∗ R right

return R ,R left∗

right∗

evaluate their cost

7 . 2

creating new regions

return the split with the lowest greedy cost

Stopping the recursionStopping the recursion   worth­splitting subroutine

if we stop when             has zero cost, we may overfitR node

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

7 . 3

Stopping the recursionStopping the recursion   worth­splitting subroutine

if we stop when             has zero cost, we may overfitR node

heuristics for stopping the splitting:

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

7 . 3

Stopping the recursionStopping the recursion   worth­splitting subroutine

if we stop when             has zero cost, we may overfitR node

heuristics for stopping the splitting:

reached a desired depth

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

7 . 3

Stopping the recursionStopping the recursion   worth­splitting subroutine

if we stop when             has zero cost, we may overfitR node

heuristics for stopping the splitting:

reached a desired depthnumber of examples in           or           is too smallR left R right

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

7 . 3

Stopping the recursionStopping the recursion   worth­splitting subroutine

if we stop when             has zero cost, we may overfitR node

heuristics for stopping the splitting:

reached a desired depthnumber of examples in           or           is too small      is a good approximation, the cost is small enough

R left R right

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

7 . 3

Stopping the recursionStopping the recursion   worth­splitting subroutine

if we stop when             has zero cost, we may overfitR node

heuristics for stopping the splitting:

reached a desired depthnumber of examples in           or           is too small      is a good approximation, the cost is small enough

R left R right

w k

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

7 . 3

Winter 2020 | Applied Machine Learning (COMP551)

Stopping the recursionStopping the recursion   worth­splitting subroutine

if we stop when             has zero cost, we may overfitR node

heuristics for stopping the splitting:

reached a desired depthnumber of examples in           or           is too small      is a good approximation, the cost is small enoughreduction in cost by splitting is small

R left R right

w k

cost(R ,D) −node ( cost(R ,D) +N node

N leftleft cost(R ,D))N node

N rightright

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

7 . 3

revisiting therevisiting the classification cost classification cost  

ideally we want to optimize the 0-1 loss (misclassification rate)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w )k

this may not be the optimal cost for each step of greedy heuristic

8 . 1

revisiting therevisiting the classification cost classification cost  

ideally we want to optimize the 0-1 loss (misclassification rate)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w )k

this may not be the optimal cost for each step of greedy heuristic

example

8 . 1

(.5, 100%)

(.25, 50%) (.75, 50%)

R node

R left R right

revisiting therevisiting the classification cost classification cost  

ideally we want to optimize the 0-1 loss (misclassification rate)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w )k

this may not be the optimal cost for each step of greedy heuristic

example

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 1

(.5, 100%)

(.25, 50%) (.75, 50%)

R node

R left R right

revisiting therevisiting the classification cost classification cost  

ideally we want to optimize the 0-1 loss (misclassification rate)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w )k

this may not be the optimal cost for each step of greedy heuristic

example

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 1

both splits have the same misclassification rate (2/8)

(.5, 100%)

(.25, 50%) (.75, 50%)

R node

R left R right

revisiting therevisiting the classification cost classification cost  

ideally we want to optimize the 0-1 loss (misclassification rate)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w )k

this may not be the optimal cost for each step of greedy heuristic

example

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 1

both splits have the same misclassification rate (2/8)

however the second split may be preferable because one region does not need further splitting

(.5, 100%)

(.25, 50%) (.75, 50%)

R node

R left R right

revisiting therevisiting the classification cost classification cost  

ideally we want to optimize the 0-1 loss (misclassification rate)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w )k

this may not be the optimal cost for each step of greedy heuristic

example

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 1

both splits have the same misclassification rate (2/8)

however the second split may be preferable because one region does not need further splitting

(.5, 100%)

(.25, 50%) (.75, 50%)

R node

R left R right

use a measure for homogeneity of labels in regions

EntropyEntropy   

entropy is the expected amount of information in observing a random variable

H(y) = − p(y =∑c=1C

c) log p(y = c)

ynote that it is common to use capital letters forrandom variables (here for consistency we use lower-case)

8 . 2

EntropyEntropy   

entropy is the expected amount of information in observing a random variable

H(y) = − p(y =∑c=1C

c) log p(y = c)

is the amount of information in observing c− log p(y = c)

ynote that it is common to use capital letters forrandom variables (here for consistency we use lower-case)

8 . 2

EntropyEntropy   

entropy is the expected amount of information in observing a random variable

H(y) = − p(y =∑c=1C

c) log p(y = c)

is the amount of information in observing c− log p(y = c)

zero information of p(c)=1less probable events are more informativeinformation from two independent events is additive − log(p(c)q(d)) = − log p(c) − log q(d)

p(c) < p(c ) ⇒′ − log p(c) > − log p(c )′

ynote that it is common to use capital letters forrandom variables (here for consistency we use lower-case)

8 . 2

EntropyEntropy   

entropy is the expected amount of information in observing a random variable

H(y) = − p(y =∑c=1C

c) log p(y = c)

a uniform distribution has the highest entropy H(y) = − log =∑c=1C

C1

C1 log C

is the amount of information in observing c− log p(y = c)

zero information of p(c)=1less probable events are more informativeinformation from two independent events is additive − log(p(c)q(d)) = − log p(c) − log q(d)

p(c) < p(c ) ⇒′ − log p(c) > − log p(c )′

ynote that it is common to use capital letters forrandom variables (here for consistency we use lower-case)

8 . 2

EntropyEntropy   

entropy is the expected amount of information in observing a random variable

H(y) = − p(y =∑c=1C

c) log p(y = c)

a uniform distribution has the highest entropy H(y) = − log =∑c=1C

C1

C1 log C

a deterministic random variable has the lowest entropy H(y) = −1 log(1) = 0

is the amount of information in observing c− log p(y = c)

zero information of p(c)=1less probable events are more informativeinformation from two independent events is additive − log(p(c)q(d)) = − log p(c) − log q(d)

p(c) < p(c ) ⇒′ − log p(c) > − log p(c )′

ynote that it is common to use capital letters forrandom variables (here for consistency we use lower-case)

8 . 2

Mutual informationMutual information   

for two random variables t, y

8 . 3

Mutual informationMutual information   

for two random variables t, y

the amount of information t conveys about ychange in the entropy of y after observing the value of t

mutual information is

8 . 3

Mutual informationMutual information   

I(t, y) = H(y) − H(y∣t)

for two random variables t, y

the amount of information t conveys about ychange in the entropy of y after observing the value of t

mutual information is

8 . 3

Mutual informationMutual information   

I(t, y) = H(y) − H(y∣t)conditional entropy p(t =∑l=1

Ll)H(x∣t = l)

for two random variables t, y

the amount of information t conveys about ychange in the entropy of y after observing the value of t

mutual information is

8 . 3

Mutual informationMutual information   

I(t, y) = H(y) − H(y∣t)conditional entropy p(t =∑l=1

Ll)H(x∣t = l)

for two random variables t, y

= p(y =∑l∑c c, t = l) log

p(y=c)p(t=l)p(y=c,t=l)

this is symmetric wrt y and t

the amount of information t conveys about ychange in the entropy of y after observing the value of t

mutual information is

8 . 3

Mutual informationMutual information   

I(t, y) = H(y) − H(y∣t)conditional entropy p(t =∑l=1

Ll)H(x∣t = l)

for two random variables t, y

= H(t) − H(t∣y) = I(y, t)

= p(y =∑l∑c c, t = l) log

p(y=c)p(t=l)p(y=c,t=l)

this is symmetric wrt y and t

the amount of information t conveys about ychange in the entropy of y after observing the value of t

mutual information is

8 . 3

Mutual informationMutual information   

I(t, y) = H(y) − H(y∣t)conditional entropy p(t =∑l=1

Ll)H(x∣t = l)

it is always positive and zero only if y and t are independent

for two random variables t, y

= H(t) − H(t∣y) = I(y, t)

try to prove these properties

= p(y =∑l∑c c, t = l) log

p(y=c)p(t=l)p(y=c,t=l)

this is symmetric wrt y and t

the amount of information t conveys about ychange in the entropy of y after observing the value of t

mutual information is

8 . 3

Entropy for classification costEntropy for classification cost  

we care about the distribution of labels p (y =k c) = N k

I(y =c)∑x ∈R

(n)k

(n)

8 . 4

Entropy for classification costEntropy for classification cost  

we care about the distribution of labels p (y =k c) = N k

I(y =c)∑x ∈R

(n)k

(n)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p (w )k kmisclassification cost

8 . 4

Entropy for classification costEntropy for classification cost  

we care about the distribution of labels p (y =k c) = N k

I(y =c)∑x ∈R

(n)k

(n)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k

8 . 4

Entropy for classification costEntropy for classification cost  

we care about the distribution of labels p (y =k c) = N k

I(y =c)∑x ∈R

(n)k

(n)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k

entropy cost cost(R ,D) =k H(y) choose the split with the lowest entropy

8 . 4

Entropy for classification costEntropy for classification cost  

we care about the distribution of labels p (y =k c) = N k

I(y =c)∑x ∈R

(n)k

(n)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k

entropy cost cost(R ,D) =k H(y) choose the split with the lowest entropy

change in the cost becomes the mutual information between the test and labels

8 . 4

Entropy for classification costEntropy for classification cost  

we care about the distribution of labels p (y =k c) = N k

I(y =c)∑x ∈R

(n)k

(n)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k

entropy cost cost(R ,D) =k H(y) choose the split with the lowest entropy

cost(R ,D) −node ( cost(R ,D) +N node

N leftleft cost(R ,D))

N node

N leftright

change in the cost becomes the mutual information between the test and labels

8 . 4

Entropy for classification costEntropy for classification cost  

we care about the distribution of labels p (y =k c) = N k

I(y =c)∑x ∈R

(n)k

(n)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k

entropy cost cost(R ,D) =k H(y) choose the split with the lowest entropy

cost(R ,D) −node ( cost(R ,D) +N node

N leftleft cost(R ,D))

N node

N leftright

change in the cost becomes the mutual information between the test and labels

= H(y) − (p(x ≥d s )H(p(y∣x ≥d,n d s )) +d,n p(x <d s )H(p(y∣x <d,n d s )))d,n

8 . 4

Entropy for classification costEntropy for classification cost  

we care about the distribution of labels p (y =k c) = N k

I(y =c)∑x ∈R

(n)k

(n)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k

entropy cost cost(R ,D) =k H(y) choose the split with the lowest entropy

cost(R ,D) −node ( cost(R ,D) +N node

N leftleft cost(R ,D))

N node

N leftright

change in the cost becomes the mutual information between the test and labels

= H(y) − (p(x ≥d s )H(p(y∣x ≥d,n d s )) +d,n p(x <d s )H(p(y∣x <d,n d s )))d,n = I(y,x > s )d,n

8 . 4

Entropy for classification costEntropy for classification cost  

we care about the distribution of labels p (y =k c) = N k

I(y =c)∑x ∈R

(n)k

(n)

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k

entropy cost cost(R ,D) =k H(y) choose the split with the lowest entropy

cost(R ,D) −node ( cost(R ,D) +N node

N leftleft cost(R ,D))

N node

N leftright

change in the cost becomes the mutual information between the test and labels

= H(y) − (p(x ≥d s )H(p(y∣x ≥d,n d s )) +d,n p(x <d s )H(p(y∣x <d,n d s )))d,n = I(y,x > s )d,n

choosing the test which is maximally informative about labels

8 . 4

Entropy for classification costEntropy for classification cost  

example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 5

R node

R left R right

Entropy for classification costEntropy for classification cost  

example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 5

R node

R left R right

misclassification cost

⋅84

+41

⋅84

=41

41

Entropy for classification costEntropy for classification cost  

example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 5

R node

R left R right

misclassification cost

⋅84

+41

⋅84

=41

41

⋅86

+31

⋅82

=20

41

Entropy for classification costEntropy for classification cost  

example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 5

R node

R left R right

misclassification cost

⋅84

+41

⋅84

=41

41

⋅86

+31

⋅82

=20

41

the same costs

Entropy for classification costEntropy for classification cost  

example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 5

R node

R left R right

misclassification cost

⋅84

+41

⋅84

=41

41

⋅86

+31

⋅82

=20

41

entropy cost (using base 2 logarithm)

the same costs

Entropy for classification costEntropy for classification cost  

example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 5

R node

R left R right

misclassification cost

⋅84

+41

⋅84

=41

41

⋅86

+31

⋅82

=20

41

entropy cost (using base 2 logarithm)

( −84

log( ) −41

41

log( )) +43

43

( −84

log( ) −41

41

log( )) ≈43

43 .81

the same costs

Entropy for classification costEntropy for classification cost  

example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 5

R node

R left R right

misclassification cost

⋅84

+41

⋅84

=41

41

⋅86

+31

⋅82

=20

41

entropy cost (using base 2 logarithm)

( −84

log( ) −41

41

log( )) +43

43

( −84

log( ) −41

41

log( )) ≈43

43 .81

( −86

log( ) −31

31

log( )) +32

32

⋅82 0 ≈ .68

lower cost split

the same costs

Gini indexGini index   

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p(w )kmisclassification (error) rate

entropy cost(R ,D) =k H(y)

another cost for selecting the test in classification

8 . 6

Gini indexGini index   

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p(w )kmisclassification (error) rate

entropy cost(R ,D) =k H(y)

Gini index it is the expected error rate

another cost for selecting the test in classification

8 . 6

Gini indexGini index   

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p(w )kmisclassification (error) rate

entropy cost(R ,D) =k H(y)

Gini index it is the expected error rate

cost(R ,D) =k p(c)(1 −∑c=1C

p(c))probability of class c probability of error

another cost for selecting the test in classification

8 . 6

Gini indexGini index   

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p(w )kmisclassification (error) rate

entropy cost(R ,D) =k H(y)

= p(c) −∑c=1C

p(c) =∑c=1C 2 1 − p(c)∑c=1

C 2

Gini index it is the expected error rate

cost(R ,D) =k p(c)(1 −∑c=1C

p(c))probability of class c probability of error

another cost for selecting the test in classification

8 . 6

Winter 2020 | Applied Machine Learning (COMP551)

Gini indexGini index   

cost(R ,D) =k I(y =N k

1 ∑x ∈R

(n)k

(n) w ) =k 1 − p(w )kmisclassification (error) rate

entropy cost(R ,D) =k H(y)

= p(c) −∑c=1C

p(c) =∑c=1C 2 1 − p(c)∑c=1

C 2

Gini index it is the expected error rate

cost(R ,D) =k p(c)(1 −∑c=1C

p(c))probability of class c probability of error

another cost for selecting the test in classification

comparison of costs of a node when we have 2 classes

p(y = 1)

cost

8 . 6

ExampleExample  

decision tree for Iris dataset

decision boundariesdataset (D=2) decision tree

9 . 1

ExampleExample  

decision tree for Iris dataset

decision boundariesdataset (D=2)

1

decision tree

9 . 1

ExampleExample  

decision tree for Iris dataset

decision boundariesdataset (D=2)

1

2

decision tree

9 . 1

ExampleExample  

decision tree for Iris dataset

decision boundariesdataset (D=2)

1

2

3

decision tree

9 . 1

ExampleExample  

decision tree for Iris dataset

decision boundariesdataset (D=2)

decision boundaries suggest overfitting

confirmed using a validation set

training accuracy                   ~ 85%(Cross) validation accuracy   ~ 70% 1

2

3

decision tree

9 . 1

OverfittingOverfitting   

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

9 . 2

OverfittingOverfitting   

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

example: of decision tree representation of a boolean function (D=3)

9 . 2

OverfittingOverfitting   

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

example: of decision tree representation of a boolean function (D=3)

there are         such functions, why?22D

9 . 2

OverfittingOverfitting   

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

example: of decision tree representation of a boolean function (D=3)

there are         such functions, why?22D

large decision trees have a high variance - low bias (low training error, high test error)

9 . 2

OverfittingOverfitting   

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

example: of decision tree representation of a boolean function (D=3)

there are         such functions, why?22D

large decision trees have a high variance - low bias (low training error, high test error)

idea 1. grow a small tree

9 . 2

OverfittingOverfitting   

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

example: of decision tree representation of a boolean function (D=3)

there are         such functions, why?22D

large decision trees have a high variance - low bias (low training error, high test error)

idea 1. grow a small tree

9 . 2

 substantial reduction in cost may happen after a few steps

by stopping early we cannot know this

OverfittingOverfitting   

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

example: of decision tree representation of a boolean function (D=3)

there are         such functions, why?22D

large decision trees have a high variance - low bias (low training error, high test error)

idea 1. grow a small tree

example cost drops after the second node

9 . 2

 substantial reduction in cost may happen after a few steps

by stopping early we cannot know this

PruningPruning   idea 2. grow a large tree and then prune it

9 . 3

PruningPruning   idea 2. grow a large tree and then prune it

greedily turn an internal node into a leaf nodechoice is based on the lowest increase in the costrepeat this until left with the root node

9 . 3

PruningPruning   idea 2. grow a large tree and then prune it

greedily turn an internal node into a leaf nodechoice is based on the lowest increase in the costrepeat this until left with the root node

pick the best among the above models using using a validation set

9 . 3

PruningPruning   idea 2. grow a large tree and then prune it

greedily turn an internal node into a leaf nodechoice is based on the lowest increase in the costrepeat this until left with the root node

pick the best among the above models using using a validation set

example

before pruning

9 . 3

PruningPruning   idea 2. grow a large tree and then prune it

greedily turn an internal node into a leaf nodechoice is based on the lowest increase in the costrepeat this until left with the root node

pick the best among the above models using using a validation set

after pruning

example

before pruning

9 . 3

PruningPruning   idea 2. grow a large tree and then prune it

greedily turn an internal node into a leaf nodechoice is based on the lowest increase in the costrepeat this until left with the root node

pick the best among the above models using using a validation set

after pruning cross-validation is used to pickthe best size

example

before pruning

9 . 3

Winter 2020 | Applied Machine Learning (COMP551)

PruningPruning   idea 2. grow a large tree and then prune it

greedily turn an internal node into a leaf nodechoice is based on the lowest increase in the costrepeat this until left with the root node

pick the best among the above models using using a validation set

after pruning cross-validation is used to pickthe best size

example

before pruning

idea 3. random forests (later!)

9 . 3

SummarySummarymodel: divide the input into axis-aligned regionscost: for regression and classification

10

SummarySummarymodel: divide the input into axis-aligned regionscost: for regression and classificationoptimization:

NP-harduse greedy heuristic

10

SummarySummarymodel: divide the input into axis-aligned regionscost: for regression and classificationoptimization:

NP-harduse greedy heuristic

adjust the cost for the heuristicusing entropy (relation to mutual information maximization)using Gini index

10

SummarySummarymodel: divide the input into axis-aligned regionscost: for regression and classificationoptimization:

NP-harduse greedy heuristic

adjust the cost for the heuristicusing entropy (relation to mutual information maximization)using Gini index

decision trees are unstable (have high variance)use pruning to avoid overfitting

10

SummarySummarymodel: divide the input into axis-aligned regionscost: for regression and classificationoptimization:

NP-harduse greedy heuristic

adjust the cost for the heuristicusing entropy (relation to mutual information maximization)using Gini index

decision trees are unstable (have high variance)use pruning to avoid overfitting

there are variations on decision tree heuristicswhat we discussed in called Classification and Regression Trees (CART)

10