Decision Trees Applied Machine Learning Applied Machine Learning Decision Trees S ia m a k R a v a n

download Decision Trees Applied Machine Learning Applied Machine Learning Decision Trees S ia m a k R a v a n

If you can't read please download the document

  • date post

    22-May-2020
  • Category

    Documents

  • view

    1
  • download

    0

Embed Size (px)

Transcript of Decision Trees Applied Machine Learning Applied Machine Learning Decision Trees S ia m a k R a v a n

  • Applied Machine LearningApplied Machine Learning Decision Trees

    Siamak RavanbakhshSiamak Ravanbakhsh

    COMP 551COMP 551 (winter 2020)(winter 2020)

    1

  • decision trees:

    model cost function how it is optimized

    how to grow a tree and why you should prune it!

    Learning objectivesLearning objectives

    2

  • Adaptive basesAdaptive bases   

    so far we assume a fixed set of bases in f(x) = w ϕ (x)∑d d d

    several methods can be classified as learning these bases adaptively

    decision trees generalized additive models boosting neural networks

    f(x) = w ϕ (x; v )∑d d d d each basis has its own parameters

    3

  • image credit:https://mymodernmet.com/the-30second-rule-a-decision/

       Decision trees: Decision trees: motivationmotivation

    4

  • pros. decision trees are interpretable! they are not very sensitive to outliers do not need data normalization

    image credit:https://mymodernmet.com/the-30second-rule-a-decision/

       Decision trees: Decision trees: motivationmotivation

    4

  • pros. decision trees are interpretable! they are not very sensitive to outliers do not need data normalization

    image credit:https://mymodernmet.com/the-30second-rule-a-decision/

    cons. they could easily overfit and they are unstable

    pruning random forests

       Decision trees: Decision trees: motivationmotivation

    4

  •    Decision trees: Decision trees: ideaidea

    divide the input space into regions and learn one function per region

    f(x) = w I(x ∈∑k k R )k the regions are learned adaptively more sophisticated prediction per region is also possible (e.g., one linear model per region)

    5 . 1

  • split regions successively based on the value of a single variable called test

       Decision trees: Decision trees: ideaidea

    divide the input space into regions and learn one function per region

    f(x) = w I(x ∈∑k k R )k the regions are learned adaptively more sophisticated prediction per region is also possible (e.g., one linear model per region)

    5 . 1

  • split regions successively based on the value of a single variable called test

       Decision trees: Decision trees: ideaidea

    divide the input space into regions and learn one function per region

    f(x) = w I(x ∈∑k k R )k the regions are learned adaptively more sophisticated prediction per region is also possible (e.g., one linear model per region)

    x 1 x 2

    w 1

    w 3

    w 5

    each region is a set of conditions R =2 {x ≤1 t ,x ≤1 2 t }4

    5 . 1

  • what constant           to use for prediction in each region?

       Prediction per regionPrediction per region

    w k

    suppose we have identified the regions R k

    5 . 2

  • what constant           to use for prediction in each region?

       Prediction per regionPrediction per region

    w k

    suppose we have identified the regions R k

    fore regression

    use the mean value of training data-points in that region

    w =k mean(y ∣x ∈(n) (n) R )k

    5 . 2

  • what constant           to use for prediction in each region?

       Prediction per regionPrediction per region

    w k

    suppose we have identified the regions R k

    fore regression

    use the mean value of training data-points in that region

    w =k mean(y ∣x ∈(n) (n) R )k

    for classification count the frequency of classes per region predict the most frequent label or return probability

    5 . 2

    w =k mode(y ∣x ∈(n) (n) R )k

  • Winter 2020 | Applied Machine Learning (COMP551)

    what constant           to use for prediction in each region?

       Prediction per regionPrediction per region

    w k

    suppose we have identified the regions R k

    fore regression

    use the mean value of training data-points in that region

    w =k mean(y ∣x ∈(n) (n) R )k

    for classification count the frequency of classes per region predict the most frequent label or return probability

    example: predicting survival in titanic

    most frequent label

    frequency of survival

    percentage of training data in this region

    5 . 2

    w =k mode(y ∣x ∈(n) (n) R )k

  • Feature typesFeature types   

    given a feature what are the possible tests

    6 . 1

  • Feature typesFeature types   

    continuous features - e.g., age, height, GDP all the values that appear in the dataset can be used to split S =d {s =d,n x }d

    (n)

    one set of possible splits for each feature d

    x >d s ?d,neach split is asking

    given a feature what are the possible tests

    6 . 1

  • Feature typesFeature types   

    continuous features - e.g., age, height, GDP all the values that appear in the dataset can be used to split S =d {s =d,n x }d

    (n)

    one set of possible splits for each feature d

    x >d s ?d,neach split is asking

    given a feature what are the possible tests

    ordinal features - e.g., grade, rating x ∈d {1, … ,C} we can split any any value so S =d {s =d,1 1, … , s =d,C C}

    x >d s ?d,ceach split is asking

    6 . 1

  • Feature typesFeature types   

    continuous features - e.g., age, height, GDP all the values that appear in the dataset can be used to split S =d {s =d,n x }d

    (n)

    one set of possible splits for each feature d

    x >d s ?d,neach split is asking

    given a feature what are the possible tests

    ordinal features - e.g., grade, rating x ∈d {1, … ,C} we can split any any value so S =d {s =d,1 1, … , s =d,C C}

    x >d s ?d,ceach split is asking

    categorical features - - types, classes and categories

    6 . 1

  • Feature typesFeature types   

    continuous features - e.g., age, height, GDP all the values that appear in the dataset can be used to split S =d {s =d,n x }d

    (n)

    one set of possible splits for each feature d

    x >d s ?d,neach split is asking

    given a feature what are the possible tests

    ordinal features - e.g., grade, rating x ∈d {1, … ,C} we can split any any value so S =d {s =d,1 1, … , s =d,C C}

    x >d s ?d,ceach split is asking

    categorical features - - types, classes and categories

    6 . 1

    multi-way split problem: it could lead to sparse subsets data fragmentation: some splits may have few/no datapoints

    {x =d ????

  • Feature typesFeature types   

    continuous features - e.g., age, height, GDP all the values that appear in the dataset can be used to split S =d {s =d,n x }d

    (n)

    one set of possible splits for each feature d

    x >d s ?d,neach split is asking

    given a feature what are the possible tests

    ordinal features - e.g., grade, rating x ∈d {1, … ,C} we can split any any value so S =d {s =d,1 1, … , s =d,C C}

    x >d s ?d,ceach split is asking

    categorical features - - types, classes and categories

    6 . 1

    multi-way split problem: it could lead to sparse subsets data fragmentation: some splits may have few/no datapoints

    {x =d ???? instead of                            we have { binary split assume C binary features (one-hot coding)

    x ∈d {1, … ,C} x ∈d,1 {0, 1}

    x ∈d,2 {0, 1}

    ⋮ x ∈d,C {0, 1}

    x d,2 = ?

    0

    x d,2 = ?

    1

  • Feature typesFeature types   

    continuous features - e.g., age, height, GDP all the values that appear in the dataset can be used to split S =d {s =d,n x }d

    (n)

    one set of possible splits for each feature d

    x >d s ?d,neach split is asking

    given a feature what are the possible tests

    ordinal features - e.g., grade, rating x ∈d {1, … ,C} we can split any any value so S =d {s =d,1 1, … , s =d,C C}

    x >d s ?d,ceach split is asking

    categorical features - - types, classes and categories

    6 . 1

    multi-way split problem: it could lead to sparse subsets data fragmentation: some splits may have few/no datapoints

    {x =d ???? instead of                            we have { binary split assume C binary features (one-hot coding)

    x ∈d {1, … ,C} x ∈d,1 {0, 1}

    x ∈d,2 {0, 1}

    ⋮ x ∈d,C {0, 1}

    x d,2 = ?

    0

    x d,2 = ?

    1

    alternative: binary splits that produce balanced subsets

  • Cost functionCost function    objective: find a decision tree minimizing the cost function

    6 . 2

  • Cost functionCost function    objective: find a decision tree minimizing the cost function

    for predicting constant

    cost(R ,D) =k (y −N k 1 ∑x ∈R (n) k

    (n) w )k 2

    w ∈k R

    regression cost

    number of instances in region k

    cost per region (mean squared error - MSE)

    6 . 2

  • Cost functionCost function    objective: find a decision tree minimizing the cost function

    for predicting constant

    cost(R ,D) =k (y −N k 1 ∑x