• date post

27-Feb-2021
• Category

## Documents

• view

0

0

Embed Size (px)

### Transcript of Introduction to Machine Learning - Kangwon leeck/AI2/04-1.pdf · PDF file...

• Parametric Estimation  X = { xt }t where xt ~ p (x)

 Parametric estimation:

Assume a form for p (x |q ) and estimate q , its sufficient statistics, using X

e.g., N ( μ, σ2) where q = { μ, σ2}

 Sufficient statistics (Wikipedia):

a statistic which has the property of sufficiency with respect to a statistical model and its associated unknown parameter

e.g., Bernoulli dist.: X1,...,Xn  T(x) = X1+...+Xn is a sufficient statistic for p: px(1-p)1-x.

2

• Maximum Likelihood Estimation  Likelihood of q given the sample X l (θ|X) = p (X |θ) = ∏

t p (xt|θ)

 Log likelihood

L(θ|X) = log l (θ|X) = ∑ t log p (xt|θ)

 Maximum likelihood estimator (MLE)

θ* = argmaxθ L(θ|X)

3

• Examples: Bernoulli  Bernoulli: Two states, failure/success, x in {0,1}

P (x) = po x (1 – po )

(1 – x)

L (po|X) = log ∏t po xt (1 – po )

(1 – xt)

MLE: po = ∑t x t / N

4

MLE: dL/dp=0

• Examples: Multinomial  Multinomial: K>2 states, xi in {0,1}

P (x1,x2,...,xK) = ∏i pi xi

L(p1,p2,...,pK|X) = log ∏t ∏i pi xi

t

MLE: pi = ∑t xi t / N

 Finding the local maxima and minima of a function subject to constraints (조건부 최적화 문제):

 ∑ t pi = 1

 Lagrange multiplier, Karush–Kuhn–Tucker (KKT) conditions

5

• Lagrange multiplier (Wikipedia)

Lagrange multiplier

Solve:

• Karush-Kuhn-Tucker condition (Wikipedia)

KKT multiplier:

Lp = f (x)+ migi(x)+ l jhj (x) j=1

l

å i=1

m

å

• Gaussian (Normal) Distribution

   

 

  



 

2

2

2 exp

2

1 x -xp

 

N

mx

s

N

x

m

t

t

t

t

2

2

 p(x) = N ( μ, σ2)

 MLE for μ and σ2:

8

μ σ

   

 

  

  

2

2

22

1



x xp exp

• Bias and Variance

9

Unknown parameter q Estimator di = d (Xi) on sample Xi Bias: bq(d) = E [d] – q Variance: E [(d–E [d])2] Mean square error: r (d,q) = E [(d–q)2] = (E [d] – q)2 + E [(d–E [d])2] = Bias2 + Variance

q

• Bayes’ Estimator  Treat θ as a random var with prior p (θ)

 Bayes’ rule: p (θ|X) = p(X|θ) p(θ) / p(X)

 Full: p(x|X) = ∫ p(x|θ) p(θ|X) dθ  For easier calculation,

 Reduce it to a single point

 Maximum a Posteriori (MAP):  θMAP = argmaxθ p(θ|X)

 Maximum Likelihood (ML): prior density is flat.  θML = argmaxθ p(X|θ)

 Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ  The best estimate of a random var. is its mean!

10

• Bayes’ Estimator: Example  xt ~ N (θ, σ2) and θ ~ N ( μ0, σo2)  θML = m

 θMAP = θBayes’ =

  02 0

2

2 0

2 0

2

2

/1/

/1

/1/

/ | 





 q

 

 

N m

N

N E X

11

N

x

m t

t 

     

 

  

2

2

2 exp

2

1 |

q

 q

t t xxp

• SVM vs. MED • Support Vector Machine (SVM)

• Maximum Entropy Discrimination (MED)

 Max-margin 학습과 Bayesian-style 학습을 결합

 min𝑝(𝐰),ξ𝐾𝐿 𝑝 𝐰 ||𝑝0 𝐰 + 𝐶 ξ𝑖 𝑁 𝑖=1

 𝑠. 𝑡. 𝑦𝑖 𝑝 𝐰 𝐰 𝑇𝑓(𝐱𝑖) 𝑑𝐰 ≥ 1 − ξ𝑖 , ξ∀𝑖

 SVM is a special case of MED: (p0(w) ~ N(0,I))

+1

-1 p(w)

0 , 1)(, ..

2

1 min

2

,



 

iiii

n

i iw

byits

C





xw

w

• Margin-based Learning 파라다임 (ICML)

• Parametric Classification

     

     iii

iii

CPCxpxg

CPCxpxg

log| log

or

|



   

   

 i i

i ii

i

i

i

i

CP x

xg

x Cxp

log log log

exp|

 



 

  

  

2

2

2

2

2 2

2

1

22

1

 



14

• 15

 Given the sample

 ML estimates are

 Discriminant becomes

N t

tt ,rx 1}{ X

x   



 

, if

if

ijx

x r

j t

i t

t i C

C

0

1

   

  

t

t i

t

t ii

t

i

t

t i

t

t i

t

i t

t i

i r

rmx

s r

rx

m N

r

CP

2

2 ˆ

   

 i i

i ii CP

s

mx sxg ˆ log log log 

 

2

2

2 2

2

1 

• 16

Equal variances

Single boundary at halfway between means

• 17

Variances are different

Two boundaries