Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

30
Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    2

Transcript of Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

Page 1: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

Statistical Learning:Bayesian and ML

COMP155

Sections 20.1-20.2May 2, 2007

Page 2: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

Definitions• a posteriori: derived from observed facts 

• a priori: based on hypothesis or theory rather than experiment

Page 3: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

Bayesian Learning• Make predictions using all hypotheses,

weighted by their probabilities• Bayes’ rule: P(a | b) = α P(b | a) P(a)

• For each hypothesis hi, observed data d:

• P(hi | d) = α P(d | hi) P(hi)

• P(d | hi) is the likelihood of d under hypothesis hi

• P(hi) is the hypothesis prior

α is a normalization constant = 1 / ∑i P(d | hi) P(hi)

Page 4: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

Bayesian Learning• We want to predict some quantity X:

P(X | d) = ∑i P(X | d, hi) P(hi | d) = ∑i P(X | hi) P(hi | d)

• The predictions are weighted averages over the predictions of the individual hypotheses

Page 5: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

Example• Suppose we know that there are 5 kinds of bags

of candy:

cherry lime % of all bags

Type 1 100% 10%

Type 2 75% 25% 20%

Type 3 50% 50% 40%

Type 4 25% 75% 20%

Type 5 100% 10%

Page 6: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

Example: priors• Given a new bag of candy,

predict the type of the bag:

• Five hypotheses:• h1: bag is type 1, P(h1) = .1

• h2: bag is type 2, P(h2) = .2

• h3: bag is type 3, P(h3) = .4

• h4: bag is type 4, P(h4) = .2

• h5: bag is type 5, P(h5) = .1

With no evidence, we use the hypothesis priors

Page 7: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

Example: one lime candy• Suppose we unwrap one candy and

determine that it is lime.• P(h1 | onelime) = α P(onlime | h1)P(h1)

= 0.5*(0 * 0.1) = 0

• P(h2 | onelime) = α P(onlime | h2)P(h2) = 0.5*(0.25 * 0.2) = 0.1

• P(h3 | onelime) = α P(onlime | h3)P(h3) = 0.5*(0.5 * 0.4) = 0.4

• P(h4 | onelime) = α P(onlime | h4)P(h4) = 0.5*(0.75 * 0.2) = 0.3

• P(h5 | onelime) = α P(onlime | h5)P(h5) = 0.5*(1.0 * 0.1) = 0.2

Page 8: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

Example: two lime candies• Suppose we unwrap another candy and it

is also lime.• P(h1 | twolime) = α P(twolime | h1)P(h1)

= 0.33*(0 * 0.1) = 0

• P(h2 | twolime) = α P(twolime | h2)P(h2) = 0.33*(0.0625 * 0.2) = 0.05

• P(h3 | twolime) = α P(twolime | h3)P(h3) = 0.33*(0.25 * 0.4) = 0.4

• P(h4 | twolime) = α P(twolime | h4)P(h4) = 0.33*(0.5625 * 0.2) = 0.45

• P(h5 | twolime) = α P(twolime | h5)P(h5) = 0.33*(1.0 * 0.1) = 0.4

Page 9: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

Example: n lime candies• Suppose we unwrap n candies and they

are all lime.• P(h1 | nlime) = αn (0n * 0.1)

• P(h2 | nlime) = αn (0.25n * 0.2)

• P(h3 | nlime) = αn (0.5n * 0.4)

• P(h4 | nlime) = αn (0.75n * 0.2)

• P(h5 | nlime) = αn (1n * 0.1)

Page 10: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.
Page 11: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

Prediction: what candy is next?• P(nextlime | nlime) =

∑i P(nextlime | hi) P(hi | nlime)P(nextlime | h1) P(h1 | nlime) + P(nextlime | h2) P(h2 | nlime) + P(nextlime | h3) P(h3 | nlime) + P(nextlime | h4) P(h4 | nlime) + P(nextlime | h5) P(h5 | nlime) =

0 * αn (0n * 0.1) + 0.25 * αn (0.25n * 0.2) + 0.5 * αn (0.5n * 0.4) + 0.75 * αn (0.75n * 0.2) + 1 * αn (1n * 0.1)

Page 12: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

0.97

Page 13: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

Analysis: Bayesian Prediction• The true hypothesis eventually dominates

• The posterior probability of any false hypothesis will eventually dominate

• Probability of uncharacteristic data will become vanishingly small

• Bayesian prediction is optimal

• Bayesian prediction is expensive• Hypothesis space may be very large (or

infinite)

Page 14: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

MAP Approximation• To avoid expense of Bayesian learning,

one approach is to simply chose the most probable hypothesis and assume it is correct• MAP = maximum a posteriori

• hmap = hi with highest value for P(hi | d)

• In candy example, after 3 limes have been selected a MAP learner will always predict next candy is lime with 100% probability• Less accurate, but much cheaper

Page 15: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.
Page 16: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

Avoiding Complexity• As we’ve seen earlier, allowing overly

complex hypotheses can lead to overfitting

• Bayesian and MAP learning use the hypothesis prior to penalize complex hypotheses• Complex hypotheses typically have lower

priors – since there are typically more complex hypotheses

• We get the simplest hypothesis consistent with the data (as per Ockham’s razor)

Page 17: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

ML Approximation• For large data sets, the priors become

irrelevant, in this case we may use maximum likelihood (ML) learning• Choose hml that maximizes P(d | hi)

• Choose the hypothesis that has the highest probability of causing the observed data

• identical to MAP for uniform priors

• ML is the standard (non-Bayesian) statistical learning method

Page 18: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.
Page 19: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.
Page 20: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

Exercise• Suppose we were pulling candy from a

50/50 bag (type 3) or a 25/75 bag (type 4)

• With full Bayesian learning, what would the posterior probability and prediction plots look like after 100 candies?

• What would prediction plots look like for MAP and ML learning after 1000 candies?

Page 21: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

Bayesian 50/50 bag

Page 22: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

Bayesian 50/50 bag

Page 23: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

Bayesian 25/75 bag

Page 24: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

Bayesian 25/75 bag

Page 25: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

MAP 50/50 bag

Page 26: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

ML 50/50 bag

Page 27: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

MAP 25/75 bag

Page 28: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

ML 25/75 bag

Page 29: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

Exercise

Page 30: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

Answer