Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v...

Applied Machine LearningApplied Machine LearningDecision Trees

Siamak RavanbakhshSiamak Ravanbakhsh

COMP 551COMP 551 (winter 2020)(winter 2020)

decision trees:

modelcost functionhow it is optimized

how to grow a tree and why you should prune it!

Learning objectivesLearning objectives

Adaptive basesAdaptive bases

so far we assume a fixed set of bases in f(x) = w ϕ (x)∑d d d

several methods can be classified as learning these bases adaptively

decision treesgeneralized additive modelsboostingneural networks

f(x) = w ϕ (x; v )∑d d d d

each basis has its own parameters

image credit:https://mymodernmet.com/the-30second-rule-a-decision/

Decision trees: Decision trees: motivationmotivation

pros.decision trees are interpretable!they are not very sensitive to outliersdo not need data normalization

cons.they could easily overfit and they are unstable

pruningrandom forests

Decision trees: Decision trees: ideaidea

divide the input space into regions and learn one function per region

f(x) = w I(x ∈∑k k R )k

the regions are learned adaptivelymore sophisticated prediction per region is also possible (e.g., one linear model per region)

split regions successively based on the value of a single variable called test

f(x) = w I(x ∈∑k k R )k

split regions successively based on the value of a single variable called test

f(x) = w I(x ∈∑k k R )k

x 1x 2

each region is a set of conditions R =2 {x ≤1 t ,x ≤1 2 t }4

what constant to use for prediction in each region?

Prediction per regionPrediction per region

suppose we have identified the regions R k

fore regression

use the mean value of training data-points in that region

w =k mean(y ∣x ∈(n) (n) R )k

fore regression

for classificationcount the frequency of classes per regionpredict the most frequent labelor return probability

w =k mode(y ∣x ∈(n) (n) R )k

Winter 2020 | Applied Machine Learning (COMP551)

fore regression

for classificationcount the frequency of classes per regionpredict the most frequent labelor return probability

example: predicting survival in titanic

most frequent label

frequency of survival

percentage of training data in this region

w =k mode(y ∣x ∈(n) (n) R )k

Feature typesFeature types

given a feature what are the possible tests

continuous features - e.g., age, height, GDP

all the values that appear in the dataset can be used to split S =d {s =d,n x }d

one set of possible splits for each feature d

x >d s ?d,neach split is asking

ordinal features - e.g., grade, rating x ∈d {1, … ,C}

we can split any any value so S =d {s =d,1 1, … , s =d,C C}

x >d s ?d,ceach split is asking

categorical features -

- types, classes and categories

multi-way splitproblem:it could lead to sparse subsetsdata fragmentation: some splits may have few/no datapoints

instead of we have {binary splitassume C binary features (one-hot coding)

x ∈d {1, … ,C} x ∈d,1 {0, 1}

x ∈d,2 {0, 1}

⋮x ∈d,C {0, 1}

x d,2 =?

instead of we have {binary splitassume C binary features (one-hot coding)

x ∈d {1, … ,C} x ∈d,1 {0, 1}

x ∈d,2 {0, 1}

⋮x ∈d,C {0, 1}

x d,2 =?

alternative: binary splits that produce balanced subsets

Cost functionCost function

objective: find a decision tree minimizing the cost function

for predicting constant

cost(R ,D) =k (y −N k

1 ∑x ∈R

(n) w )k2

w ∈k R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

1 ∑x ∈R

(n) w )k2

w ∈k R

regression cost

mean(y ∣x ∈(n) (n) R )k

1 ∑x ∈R

(n) w )k2

w ∈k R

regression cost

classification cost

for predicting constant class

cost(R ,D) =k I(y =N k

1 ∑x ∈D

(n) R k

(n) w )k

w ∈k {1, … ,C}

cost per region (misclassification rate)

1 ∑x ∈R

(n) w )k2

w ∈k R

regression cost

classification cost

1 ∑x ∈D

(n) R k

(n) w )k

w ∈k {1, … ,C}

mode(y ∣x ∈(n) (n) R )k

1 ∑x ∈R

(n) w )k2

w ∈k R

regression cost

classification cost

1 ∑x ∈D

(n) R k

(n) w )k

w ∈k {1, … ,C}

total cost in both cases is the normalized sum cost(R ,D)∑k NN k

it is sometimes possible to build a tree with zero cost:build a large tree with each instance having its own region (overfitting!)

1 ∑x ∈R

(n) w )k2

w ∈k R

regression cost

classification cost

1 ∑x ∈D

(n) R k

(n) w )k

w ∈k {1, … ,C}

it is sometimes possible to build a tree with zero cost:build a large tree with each instance having its own region (overfitting!)

new objective: find a decision tree with K tests minimizing the cost function

1 ∑x ∈R

(n) w )k2

w ∈k R

regression cost

classification cost

1 ∑x ∈D

(n) R k

(n) w )k

w ∈k {1, … ,C}

Search spaceSearch space

objective: find a decision tree with K tests minimizing the cost functionK+1 regions

alternatively, find the smallest tree (K) that classifies all examples correctly

not produced by a decision tree

assuming D features how many different partitions of size K+1?

the number of full binary trees with K+1 leaves (regions ) is the Catalan numberR k K+1

1 ( K2K)

exponential in K

1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190,6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452

1 ( K2K)

exponential in K

1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190,6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452

we also have a choice of feature for each of K internal node DKx d

1 ( K2K)

exponential in K

1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190,6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452

moreover, for each feature different choices of splitting s ∈d,n S d

1 ( K2K)

exponential in K

bottom line: finding optimal decision tree is an NP-hard combinatorial optimization problem

1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190,6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452

moreover, for each feature different choices of splitting s ∈d,n S d

1 ( K2K)

exponential in K

Greedy heuristicGreedy heuristic

end the recursion if not worth-splitting

recursively split the regions based on a greedy choice of the next test

function fittree( , ,depth)R node

if not worthsplitting(depth, )

return

leftset = fittree( , , depth+1)

rightset = fittree( , , depth+1)

return {leftset, rightset}

= greedytest ( , )

R node DR ,R left right

R ,R left right

R node

R left DR right D

function fittree( , ,depth)R node

if not worthsplitting(depth, )

return

leftset = fittree( , , depth+1)

rightset = fittree( , , depth+1)

return {leftset, rightset}

= greedytest ( , )

R node DR ,R left right

R ,R left right

R node

R left DR right D

final decision tree in the form of nested list of regions

{{R ,R }, {R , {R ,R }}1 2 3 4 5

Choosing testsChoosing tests

the split is greedy because it looks one step aheadthis may not lead to the the lowest overall cost

= cost(R ,D) +N node

N leftleft cost(R ,D)N node

N rightright

function greedytest ( , )R node D

R =left R ∪node {x <d s }d,n

for d ∈ {1, … ,D}, s ∈d,n S d

R =right R ∪node {x ≥d s }d,n

splitcost

bestcost = inf

if splitcost < bestcost:

bestcost = splitcost

R =left∗ R left

R =right∗ R right

return R ,R left∗

right∗

N rightright

for d ∈ {1, … ,D}, s ∈d,n S d

splitcost

bestcost = inf

R =left∗ R left

R =right∗ R right

return R ,R left∗

right∗

creating new regions

N rightright

for d ∈ {1, … ,D}, s ∈d,n S d

splitcost

bestcost = inf

R =left∗ R left

R =right∗ R right

return R ,R left∗

right∗

evaluate their cost

N rightright

for d ∈ {1, … ,D}, s ∈d,n S d

splitcost

bestcost = inf

R =left∗ R left

R =right∗ R right

return R ,R left∗

right∗

evaluate their cost

return the split with the lowest greedy cost

Stopping the recursionStopping the recursion worthsplitting subroutine

if we stop when has zero cost, we may overfitR node

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

heuristics for stopping the splitting:

reached a desired depth

reached a desired depthnumber of examples in or is too smallR left R right

reached a desired depthnumber of examples in or is too small is a good approximation, the cost is small enough

R left R right

reached a desired depthnumber of examples in or is too small is a good approximation, the cost is small enough

R left R right

reached a desired depthnumber of examples in or is too small is a good approximation, the cost is small enoughreduction in cost by splitting is small

R left R right

cost(R ,D) −node ( cost(R ,D) +N node

N leftleft cost(R ,D))N node

N rightright

revisiting therevisiting the classification cost classification cost

ideally we want to optimize the 0-1 loss (misclassification rate)

1 ∑x ∈R

(n) w )k

this may not be the optimal cost for each step of greedy heuristic

1 ∑x ∈R

(n) w )k

example

(.5, 100%)

(.25, 50%) (.75, 50%)

R node

R left R right

1 ∑x ∈R

(n) w )k

example

(.5, 100%)

(.33, 75%) (1, 25%)

(.5, 100%)

(.25, 50%) (.75, 50%)

R node

R left R right

1 ∑x ∈R

(n) w )k

example

(.5, 100%)

(.33, 75%) (1, 25%)

both splits have the same misclassification rate (2/8)

(.5, 100%)

(.25, 50%) (.75, 50%)

R node

R left R right

1 ∑x ∈R

(n) w )k

example

(.5, 100%)

(.33, 75%) (1, 25%)

however the second split may be preferable because one region does not need further splitting

(.5, 100%)

(.25, 50%) (.75, 50%)

R node

R left R right

1 ∑x ∈R

(n) w )k

example

(.5, 100%)

(.33, 75%) (1, 25%)

however the second split may be preferable because one region does not need further splitting

(.5, 100%)

(.25, 50%) (.75, 50%)

R node

R left R right

use a measure for homogeneity of labels in regions

EntropyEntropy

entropy is the expected amount of information in observing a random variable

H(y) = − p(y =∑c=1C

c) log p(y = c)

ynote that it is common to use capital letters forrandom variables (here for consistency we use lower-case)

EntropyEntropy

H(y) = − p(y =∑c=1C

c) log p(y = c)

is the amount of information in observing c− log p(y = c)

EntropyEntropy

H(y) = − p(y =∑c=1C

c) log p(y = c)

zero information of p(c)=1less probable events are more informativeinformation from two independent events is additive − log(p(c)q(d)) = − log p(c) − log q(d)

p(c) < p(c ) ⇒′ − log p(c) > − log p(c )′

EntropyEntropy

H(y) = − p(y =∑c=1C

c) log p(y = c)

a uniform distribution has the highest entropy H(y) = − log =∑c=1C

C1 log C

p(c) < p(c ) ⇒′ − log p(c) > − log p(c )′

EntropyEntropy

H(y) = − p(y =∑c=1C

c) log p(y = c)

a uniform distribution has the highest entropy H(y) = − log =∑c=1C

C1 log C

a deterministic random variable has the lowest entropy H(y) = −1 log(1) = 0

p(c) < p(c ) ⇒′ − log p(c) > − log p(c )′

Mutual informationMutual information

for two random variables t, y

the amount of information t conveys about ychange in the entropy of y after observing the value of t

mutual information is

I(t, y) = H(y) − H(y∣t)

I(t, y) = H(y) − H(y∣t)conditional entropy p(t =∑l=1

Ll)H(x∣t = l)

= p(y =∑l∑c c, t = l) log

p(y=c)p(t=l)p(y=c,t=l)

this is symmetric wrt y and t

Ll)H(x∣t = l)

= H(t) − H(t∣y) = I(y, t)

Ll)H(x∣t = l)

it is always positive and zero only if y and t are independent

= H(t) − H(t∣y) = I(y, t)

try to prove these properties

Entropy for classification costEntropy for classification cost

we care about the distribution of labels p (y =k c) = N k

I(y =c)∑x ∈R

1 ∑x ∈R

(n) w ) =k 1 − p (w )k kmisclassification cost

I(y =c)∑x ∈R

1 ∑x ∈R

(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k

I(y =c)∑x ∈R

1 ∑x ∈R

entropy cost cost(R ,D) =k H(y) choose the split with the lowest entropy

I(y =c)∑x ∈R

1 ∑x ∈R

change in the cost becomes the mutual information between the test and labels

I(y =c)∑x ∈R

1 ∑x ∈R

N leftleft cost(R ,D))

N node

N leftright

I(y =c)∑x ∈R

1 ∑x ∈R

N node

N leftright

= H(y) − (p(x ≥d s )H(p(y∣x ≥d,n d s )) +d,n p(x <d s )H(p(y∣x <d,n d s )))d,n

I(y =c)∑x ∈R

1 ∑x ∈R

N node

N leftright

= H(y) − (p(x ≥d s )H(p(y∣x ≥d,n d s )) +d,n p(x <d s )H(p(y∣x <d,n d s )))d,n = I(y,x > s )d,n

I(y =c)∑x ∈R

1 ∑x ∈R

N node

N leftright

= H(y) − (p(x ≥d s )H(p(y∣x ≥d,n d s )) +d,n p(x <d s )H(p(y∣x <d,n d s )))d,n = I(y,x > s )d,n

choosing the test which is maximally informative about labels

example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

R node

R left R right

example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

R node

R left R right

misclassification cost

example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

R node

R left R right

example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

R node

R left R right

the same costs

example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

R node

R left R right

entropy cost (using base 2 logarithm)

the same costs

example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

R node

R left R right

( −84

log( ) −41

log( )) +43

( −84

log( ) −41

log( )) ≈43

43 .81

the same costs

example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

R node

R left R right

( −84

log( ) −41

log( )) +43

( −84

log( ) −41

log( )) ≈43

43 .81

( −86

log( ) −31

log( )) +32

⋅82 0 ≈ .68

lower cost split

the same costs

Gini indexGini index

1 ∑x ∈R

(n) w ) =k 1 − p(w )kmisclassification (error) rate

entropy cost(R ,D) =k H(y)

another cost for selecting the test in classification

1 ∑x ∈R

Gini index it is the expected error rate

1 ∑x ∈R

cost(R ,D) =k p(c)(1 −∑c=1C

p(c))probability of class c probability of error

1 ∑x ∈R

= p(c) −∑c=1C

p(c) =∑c=1C 2 1 − p(c)∑c=1

1 ∑x ∈R

= p(c) −∑c=1C

p(c) =∑c=1C 2 1 − p(c)∑c=1

comparison of costs of a node when we have 2 classes

p(y = 1)

ExampleExample

decision tree for Iris dataset

decision boundariesdataset (D=2) decision tree

ExampleExample

decision boundariesdataset (D=2)

decision tree

ExampleExample

decision tree

ExampleExample

decision tree

ExampleExample

decision boundaries suggest overfitting

confirmed using a validation set

training accuracy ~ 85%(Cross) validation accuracy ~ 70% 1

decision tree

OverfittingOverfitting

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

example: of decision tree representation of a boolean function (D=3)

there are such functions, why?22D

large decision trees have a high variance - low bias (low training error, high test error)

idea 1. grow a small tree

substantial reduction in cost may happen after a few steps

by stopping early we cannot know this

example cost drops after the second node

substantial reduction in cost may happen after a few steps

by stopping early we cannot know this

PruningPruning idea 2. grow a large tree and then prune it

greedily turn an internal node into a leaf nodechoice is based on the lowest increase in the costrepeat this until left with the root node

pick the best among the above models using using a validation set

example

before pruning

after pruning

example

before pruning

after pruning cross-validation is used to pickthe best size

example

before pruning

after pruning cross-validation is used to pickthe best size

example

before pruning

idea 3. random forests (later!)

SummarySummarymodel: divide the input into axis-aligned regionscost: for regression and classification

SummarySummarymodel: divide the input into axis-aligned regionscost: for regression and classificationoptimization:

NP-harduse greedy heuristic

adjust the cost for the heuristicusing entropy (relation to mutual information maximization)using Gini index

decision trees are unstable (have high variance)use pruning to avoid overfitting

there are variations on decision tree heuristicswhat we discussed in called Classification and Regression Trees (CART)

Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v...

Documents

Transcript of Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v...

Bayesian Decision and Bayesian Learning

Decision Analysis

decision making: between rationality and reality - Interdisciplinary

DECISION MAKING: BETWEEN RATIONALITY AND REALITY - indecs

Applied statistics lecture_4

Plant trees in kerkini lake

Trees (Set Theory)

DECISION SUPPORT TOOLS FOR ALTERNATIVE FUELS INSTALLATIONS

Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Graph Edge Coloring: Tashkinov Trees and Goldberg’s … · Graph Edge Coloring: Tashkinov Trees and Goldberg’s Conjecture ... [13, 14] a simple but very ... tional edge coloring

Glimcher Decision Making

Ληψη Αποφασης (ΛΗΑΠ) Military Decision Making Process (MDMP)

Description Logics Deduction in Propositional LogicPropositional Decision Procedures Truth tables provide a sound and complete decision procedure for testing satisability , validity,

Thesis Defense: Adaptive Binary Search Trees./jonderry/slides.pdf · Thesis Defense: Adaptive Binary Search Trees Jonathan C. Derryberry Department of Computer Science Carnegie Mellon

The Calculus of Computation: Decision Procedures with by

COURSE OUTLINE - Alexander Technological Educational ... · Decision trees, ID3, C4.5, entropy, information gain, pruning, advantages and disadvantages of ... Lecture notes in e-class

Lecture 6 - B+Trees - University of Cyprus

Trees & plants

Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

KUREPA TREES AND SPECTRA OF -SENTENCESsinapova/Sinapova... · 2018-08-13 · Kurepa trees, In nitary Logic, Abstract Elementary Classes, Spectra, Amalgamation, Maximal Models. Sinapova