ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010...

27
ETHEM ALPAYDIN © The MIT Press, 2010 [email protected] http://www.cmpe.boun.edu.tr/~ethem/i2ml2e Lecture Slides for

Transcript of ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010...

Page 1: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

ETHEM ALPAYDIN© The MIT Press, 2010

[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml2e

Lecture Slides for

Page 2: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for
Page 3: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

Parametric Estimation X = { xt }t where xt ~ p (x)

Parametric estimation:

Assume a form for p (x |q ) and estimate q , its sufficient statistics, using X

e.g., N ( μ, σ2) where q = { μ, σ2}

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3

Page 4: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

Maximum Likelihood Estimation Likelihood of q given the sample X

l (θ|X) = p (X |θ) = ∏t

p (xt|θ)

Log likelihood

L(θ|X) = log l (θ|X) = ∑tlog p (xt|θ)

Maximum likelihood estimator (MLE)

θ* = argmaxθ L(θ|X)

4Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 5: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

Examples: Bernoulli/Multinomial Bernoulli: Two states, failure/success, x in {0,1}

P (x) = pox (1 – po )

(1 – x)

L (po|X) = log ∏t

poxt (1 – po )

(1 – xt)

MLE: po = ∑t

xt / N

Multinomial: K>2 states, xi in {0,1}

P (x1,x2,...,xK) = ∏ipi

xi

L(p1,p2,...,pK|X) = log ∏t ∏

ipi

xit

MLE: pi = ∑t

xit / N

5Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 6: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

Gaussian (Normal) Distribution

2

2

2exp

2

1 x-xp

N

mx

s

N

x

m

t

t

t

t

2

2

p(x) = N ( μ, σ2)

MLE for μ and σ2:

6

μ σ

2

2

22

1

xxp exp

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 7: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

Bias and Variance

7

Unknown parameter qEstimator di = d (Xi) on sample Xi

Bias: bq(d) = E [d] – qVariance: E [(d–E [d])2]

Mean square error: r (d,q) = E [(d–q)2]

= (E [d] – q)2 + E [(d–E [d])2]= Bias2 + Variance

q

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 8: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

Bayes’ Estimator Treat θ as a random var with prior p (θ)

Bayes’ rule: p (θ|X) = p(X|θ) p(θ) / p(X)

Full: p(x|X) = ∫ p(x|θ) p(θ|X) dθ

Maximum a Posteriori (MAP): θMAP = argmaxθ p(θ|X)

Maximum Likelihood (ML): θML = argmaxθ p(X|θ)

Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ

8Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 9: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

Bayes’ Estimator: Example xt ~ N (θ, σo

2) and θ ~ N ( μ, σ2)

θML = m

θMAP = θBayes’ =

q

22

0

2

22

0

2

0

1

1

1 //

/

//

/|

Nm

N

NE X

9Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 10: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

Parametric Classification

iii

iii

CPCxpxg

CPCxpxg

log| log

or

|

ii

iii

i

i

i

i

CPx

xg

xCxp

log log log

exp|

2

2

2

2

22

2

1

22

1

10Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 11: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

11

Given the sample

ML estimates are

Discriminant becomes

Nt

tt ,rx 1}{ X

x

, i f

i f

ijx

xr

jt

it

ti C

C

0

1

t

ti

t

tii

t

i

t

ti

t

ti

t

it

ti

ir

rmx

sr

rx

mN

r

CP

2

2 ˆ

ii

iii CP

s

mxsxg ˆ log log log

2

2

22

2

1

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 12: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

12

Equal variances

Single boundary athalfway between means

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 13: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

13

Variances are different

Two boundaries

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 14: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

Regression

2

20

q

q

,N

,N

|~|

~

|:estimator

xgxrp

xg

xfr

N

t

tN

t

tt

N

t

tt

xpxrp

rxp

11

1

log| log

, log|XL q

14Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 15: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

Regression: From LogL to Error

2

1

2

12

2

2

1

2

1

2

12

22

1

N

t

tt

N

t

tt

ttN

t

xgrE

xgrN

xgr

qq

q

q

q

||

|log

|exp log|

X

XL

15Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 16: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

Linear Regression

0101 wxwwwxg tt ,|

t

t

t

tt

t

t

t

t

t

t

xwxwxr

xwNwr

2

10

10

t

t

tt

t

t

t

t

t

t

t

xr

r

w

w

xx

xN

yw 1

0

2A

yw1A

16Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 17: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

Polynomial Regression

01

2

2012 wxwxwxwwwwwxg ttktkk

t ,,,,|

NNNN

k

k

r

r

r

xxx

xxx

xxx

2

1

22

2222

1211

1

1

1

r D

rwTT DDD

1

17Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 18: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

Other Error Measures

2

12

1

N

t

tt xgrE qq ||X

18

Square Error:

Relative Square Error:

Absolute Error: E (θ |X) = ∑t|rt – g(xt| θ)|

ε-sensitive Error:

E (θ |X) = ∑ t1(|rt – g(xt| θ)|>ε) (|rt – g(xt|θ)| – ε)

2

1

2

1

N

t

t

N

t

tt

rr

xgr

E

q

q

|

|X

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 19: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

Bias and Variance

19

222||| xgExgExgExrExxgxrEE XXXX

bias variance

222 xgxrExxrErExxgrE ||||

noise squared error

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 20: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

Estimating Bias and Variance

ti

t i

tti

t

tt

xgM

xg

xgxgNM

g

xfxgN

g

1

1

1

2

22

Variance

Bias

20

M samples Xi={xti , r

ti}, i=1,...,M

are used to fit gi (x), i =1,...,M

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 21: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

Bias/Variance Dilemma Example: gi(x)=2 has no variance and high bias

gi(x)= ∑trt

i/N has lower bias with variance

As we increase complexity,

bias decreases (a better fit to data) and

variance increases (fit varies more with data)

Bias/Variance dilemma: (Geman et al., 1992)

21Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 22: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

22

bias

variance

f

gi g

f

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 23: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

Polynomial Regression

23

Best fit “min error”

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 24: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

24

Best fit, “elbow”

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 25: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

Model Selection Cross-validation: Measure generalization accuracy by

testing on data unused during training

Regularization: Penalize complex models

E’=error on data + λ model complexity

Akaike’s information criterion (AIC), Bayesian information criterion (BIC)

Minimum description length (MDL): Kolmogorov complexity, shortest description of data

Structural risk minimization (SRM)

25Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 26: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

Bayesian Model Selection

data

model model|datadata|model

p

ppp

26

Prior on models, p(model)

Regularization, when prior favors simpler models

Bayes, MAP of the posterior, p(model|data)

Average over a number of models with high posterior (voting, ensembles: Chapter 17)

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 27: ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for

Regression example

27Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Coefficients increase in magnitude as order increases:1: [-0.0769, 0.0016]2: [0.1682, -0.6657, 0.0080]3: [0.4238, -2.5778, 3.4675, -0.00024: [-0.1093, 1.4356, -5.5007, 6.0454, -0.0019]

i i

N

t

tt wxgrEtionregulariza 2

2

12

1ww ||: X