ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010...

ETHEM ALPAYDIN© The MIT Press, 2010

[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml2e

Lecture Slides for

Parametric Estimation X = { xt }t where xt ~ p (x)

Parametric estimation:

Assume a form for p (x |q ) and estimate q , its sufficient statistics, using X

e.g., N ( μ, σ2) where q = { μ, σ2}

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3

Examples: Bernoulli/Multinomial Bernoulli: Two states, failure/success, x in {0,1}

P (x) = pox (1 – po )

(1 – x)

L (po|X) = log ∏t

poxt (1 – po )

(1 – xt)

MLE: po = ∑t

xt / N

Multinomial: K>2 states, xi in {0,1}

P (x1,x2,...,xK) = ∏ipi

xi

L(p1,p2,...,pK|X) = log ∏t ∏

ipi

xit

MLE: pi = ∑t

xit / N


Gaussian (Normal) Distribution

2

2

2exp

2

1 x-xp

N

mx

s

N

x

m

t

t

t

t

2

2

p(x) = N ( μ, σ2)

MLE for μ and σ2:

6

μ σ

2

2

22

1

xxp exp

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Bias and Variance

7

Unknown parameter qEstimator di = d (Xi) on sample Xi

Bias: bq(d) = E [d] – qVariance: E [(d–E [d])2]

Mean square error: r (d,q) = E [(d–q)2]

= (E [d] – q)2 + E [(d–E [d])2]= Bias2 + Variance

q


Bayes’ Estimator: Example xt ~ N (θ, σo

2) and θ ~ N ( μ, σ2)

θML = m

θMAP = θBayes’ =

q

22

0

2

22

0

2

0

1

1

1 //

/

//

/|

Nm

N

NE X


Parametric Classification

iii

iii

CPCxpxg

CPCxpxg

log| log

or

|

ii

iii

i

i

i

i

CPx

xg

xCxp

log log log

exp|

2

2

2

2

22

2

1

22

1


11

Given the sample

ML estimates are

Discriminant becomes

Nt

tt ,rx 1}{ X

x

, i f

i f

ijx

xr

jt

it

ti C

C

0

1

t

ti

t

tii

t

i

t

ti

t

ti

t

it

ti

ir

rmx

sr

rx

mN

r

CP

2

2 ˆ

ii

iii CP

s

mxsxg ˆ log log log

2

2

22

2

1


12

Equal variances

Single boundary athalfway between means


13

Variances are different

Two boundaries


Regression

2

20

q

q

,N

,N

|~|

~

|:estimator

xgxrp

xg

xfr

N

t

tN

t

tt

N

t

tt

xpxrp

rxp

11

1

log| log

, log|XL q


Regression: From LogL to Error

2

1

2

12

2

2

1

2

1

2

12

22

1

N

t

tt

N

t

tt

ttN

t

xgrE

xgrN

xgr

qq

q

q

q

||

|log

|exp log|

X

XL


Linear Regression

0101 wxwwwxg tt ,|

t

t

t

tt

t

t

t

t

t

t

xwxwxr

xwNwr

2

10

10

t

t

tt

t

t

t

t

t

t

t

xr

r

w

w

xx

xN

yw 1

0

2A

yw1A


Polynomial Regression

01

2

2012 wxwxwxwwwwwxg ttktkk

t ,,,,|

NNNN

k

k

r

r

r

xxx

xxx

xxx

2

1

22

2222

1211

1

1

1

r D

rwTT DDD

1


Other Error Measures

2

12

1

N

t

tt xgrE qq ||X

18

Square Error:

Relative Square Error:

Absolute Error: E (θ |X) = ∑t|rt – g(xt| θ)|

ε-sensitive Error:

E (θ |X) = ∑ t1(|rt – g(xt| θ)|>ε) (|rt – g(xt|θ)| – ε)

2

1

2

1

N

t

t

N

t

tt

rr

xgr

E

q

q

|

|X


Bias and Variance

19

222||| xgExgExgExrExxgxrEE XXXX

bias variance

222 xgxrExxrErExxgrE ||||

noise squared error


Estimating Bias and Variance

ti

t i

tti

t

tt

xgM

xg

xgxgNM

g

xfxgN

g

1

1

1

2

22

Variance

Bias

20

M samples Xi={xti , r

ti}, i=1,...,M

are used to fit gi (x), i =1,...,M


Bias/Variance Dilemma Example: gi(x)=2 has no variance and high bias

gi(x)= ∑trt

i/N has lower bias with variance

As we increase complexity,

bias decreases (a better fit to data) and

variance increases (fit varies more with data)

Bias/Variance dilemma: (Geman et al., 1992)


22

bias

variance

f

gi g

f


Polynomial Regression

23

Best fit “min error”


24

Best fit, “elbow”


Model Selection Cross-validation: Measure generalization accuracy by

testing on data unused during training

Regularization: Penalize complex models

E’=error on data + λ model complexity

Akaike’s information criterion (AIC), Bayesian information criterion (BIC)

Minimum description length (MDL): Kolmogorov complexity, shortest description of data

Structural risk minimization (SRM)


Bayesian Model Selection

data

model model|datadata|model

p

ppp

26

Prior on models, p(model)

Regularization, when prior favors simpler models

Bayes, MAP of the posterior, p(model|data)

Average over a number of models with high posterior (voting, ensembles: Chapter 17)


Regression example


Coefficients increase in magnitude as order increases:1: [-0.0769, 0.0016]2: [0.1682, -0.6657, 0.0080]3: [0.4238, -2.5778, 3.4675, -0.00024: [-0.1093, 1.4356, -5.5007, 6.0454, -0.0019]

i i

N

t

tt wxgrEtionregulariza 2

2

12

1ww ||: X

ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010...

Documents

Transcript of ETHEM ALPAYDIN © The MIT Press, 2010 · 2016. 8. 8. · ETHEM ALPAYDIN © The MIT Press, 2010...