EE269 Signal Processing for Machine Learning - Lecture...

33
EE269 Signal Processing for Machine Learning Lecture 8 Instructor : Mert Pilanci Stanford University February 4, 2019

Transcript of EE269 Signal Processing for Machine Learning - Lecture...

Page 1: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

EE269Signal Processing for Machine Learning

Lecture 8

Instructor : Mert Pilanci

Stanford University

February 4, 2019

Page 2: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Recap: Linear and Quadratic Discriminant Analysis

I Suppose x[n] = [x1, ...xN ] ∼ N(µk,Σ) when y = k

gk(x) = Px|y=k = 1

(2π)N2 |Σ|

12e−

12

(x−µk)T Σ−1(x−µk)

I K classes

f(x) = arg maxk=1,...,K

πkgk(x)

Page 3: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Scaled identity covariances Σk = σ2I

I Suppose x[n] = [x1, ...xN ] ∼ N(µk,Σk) when y = k

gk(x) = Px|y=k = 1

(2π)N2 |Σk|

12e−

12

(x−µk)T Σ−1k (x−µk)

I K classes

I Decision boundary: hyperplane

wT (x− x0) = 0

w = µi − µj

x0 =1

2(µi + µj)−

σ2

||µi − µj ||2log

πiπj

(µi − µj)

I Hyperplane passes through the point x0 and is orthogonal tow

Page 4: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Scaled identity covariances Σk = σ2I

I Suppose x[n] = [x1, ...xN ] ∼ N(µk,Σk) when y = k

gk(x) = Px|y=k = 1

(2π)N2 |Σk|

12e−

12

(x−µk)T Σ−1k (x−µk)

I K classes

I Decision boundary: hyperplane

wT (x− x0) = 0

w = µi − µj

x0 =1

2(µi + µj)−

σ2

||µi − µj ||2log

πiπj

(µi − µj)

I Hyperplane passes through the point x0 and is orthogonal tow

Page 5: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Identical covariances Σk = Σ

I Suppose x[n] = [x1, ...xN ] ∼ N(µk,Σ) when y = k

gk(x) = Px|y=k = 1

(2π)N2 |Σ|

12e−

12

(x−µk)T Σ−1(x−µk)

I Decision boundary: hyperplane

wT (x− x0) = 0

w = Σ−1(µi − µj)

x0 =1

2(µi + µj)−

log πiπj

(µi − µj)TΣ−1(µi − µj)(µi − µj)

I Hyperplane passes through x0 but not necessarily orthogonalto the lines between the means

Page 6: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Identical covariances Σk = Σ

I Suppose x[n] = [x1, ...xN ] ∼ N(µk,Σ) when y = k

gk(x) = Px|y=k = 1

(2π)N2 |Σ|

12e−

12

(x−µk)T Σ−1(x−µk)

I Decision boundary: hyperplane

wT (x− x0) = 0

w = Σ−1(µi − µj)

x0 =1

2(µi + µj)−

log πiπj

(µi − µj)TΣ−1(µi − µj)(µi − µj)

I Hyperplane passes through x0 but not necessarily orthogonalto the lines between the means

Page 7: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Quadratic Discriminant Analysis: Σk arbitrary

I Suppose x[n] = [x1, ...xN ] ∼ N(µk,Σk) when y = k

gk(x) = Px|y=k = 1

(2π)N2 |Σk|

12e−

12

(x−µk)T Σ−1k (x−µk)

I hk(x) = xTWkx+ wTk x+ wk0

I Classify as class k if hk(x) > hk′(x) ∀k′ 6= k

Wk = −12Σ−1

k

wk = Σ−1k µk

wk0 = −12µ

Tk Σ−1

k µk − 12 log |Σk|+ log πk

Page 8: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Quadratic decision regions: hyperquadrics

Page 9: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Estimating parameters: univariate Gaussian

I Suppose x1, x2, ...xn i.i.d. ∼ N(µ, σ2)

I Estimating means

µML = 1n

∑ni=1 xn

I Estimating variances

σ2ML = 1

n

∑ni=1(xn − µML)2

Page 10: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Estimating parameters: multivariate Gaussian

I Suppose x1, x2, ...xn i.i.d. ∼ N(µ,Σ)

I Estimating means

µML = 1n

∑ni=1 xn

I Estimating covariances

ΣML = 1n

∑ni=1(xn − µML)(xn − µML)T

Page 11: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Estimating parameters: multivariate Gaussian

I Suppose x1, x2, ...xn i.i.d. ∼ N(µ,Σ)

I Estimating means

µML = 1n

∑ni=1 xn

I Estimating covariances

ΣML = 1n

∑ni=1(xn − µML)(xn − µML)T

Page 12: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Linear vs Quadratic Discriminant Analysis

I LDA

Estimate µk, for k = 1...,K and Σ

Kn+(n2

)+ n parameters

I QDA

Estimate µk, Σk for k = 1...,K

Kn+K((n2

)+ n

)parameters

Page 13: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Linear vs Quadratic Discriminant Analysis

I LDA

Estimate µk, for k = 1...,K and Σ

Kn+(n2

)+ n parameters

I QDA

Estimate µk, Σk for k = 1...,K

Kn+K((n2

)+ n

)parameters

Page 14: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Regularized Linear Discriminant Analysis

I Maximum Likelihood Covariance estimate

ΣML = 1n

∑ni=1(xn − µML)(xn − µML)T

I Regularized estimate

Σ̂ = (1− α) diag(ΣML) + αΣML

I Diagonal Linear Discriminant Analysis (α = 0)

Σ̂ = diag(ΣML)

Page 15: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

(slide credit: T. Hastie et al.)

Page 16: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Optimal basis change and dimension reduction

I Decision boundary wT (x− x0) = 0

e.g., in LDA with equal covariances, w = Σ−1(µi − µj)I Classifies based on wTx ∈ R

Page 17: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Mean of the projected data

I y = aTx

µ1 = 1N1

∑i∈ class 1 xi

µ2 = 1N2

∑i∈ class 2 xi

Page 18: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Fisher’s LDA

I µk = E[x | x comes from class k]

I Σk = E(x− µk)(x− µk)T | x comes from class k]

I classify using a scalar feature y = aTx

βk = E[y | x comes from class k]

σ2k = E[(y − βk)2 | x comes from class k]

maxa

(β1 − β2)2

σ21 + σ2

2

Page 19: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Fisher’s LDA

I µk = E[x | x comes from class k]

I Σk = E(x− µk)(x− µk)T | x comes from class k]

I classify using a scalar feature y = aTx

βk = E[y | x comes from class k]

σ2k = E[(y − βk)2 | x comes from class k]

maxa

(β1 − β2)2

σ21 + σ2

2

Page 20: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Fisher’s LDA

I µk = E[x | x comes from class k]

I Σk = E(x− µk)(x− µk)T | x comes from class k]

I classify using a scalar feature y = aTx

βk = E[y | x comes from class k]

σ2k = E[(y − βk)2 | x comes from class k]

maxa

(β1 − β2)2

σ21 + σ2

2

Page 21: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Fisher’s LDA

βk = E[y | x comes from class k] = aTµk

σ2k = E[(y − βk)2 | x comes from class k] =

E[(aT (x− µk))2] = E[(aT (x− µk)(x− µk)Ta] = aTΣka

maxa

(β1 − β2)2

σ21 + σ2

2

= maxa

(aT (µ1 − µ2))2

aT (Σ1 + Σ2)a

= maxa

aTQa

aTPa

where Q = (µ1 − µ2)(µ1 − µ2)T and P = Σ1 + Σ2.

Page 22: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Fisher’s LDA

maxa

aTQa

aTPa

where Q = (µ1 − µ2)(µ1 − µ2)T and P = Σ1 + Σ2.

Page 23: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Maximizing quadratic forms

maxa

aTQa

aTa

I Eigenvalue Decomposition Q = UΛUT

I Change of basis b = UTa, i.e., Ub = a

maxa

aTUΛUTa

aTa= max

b

bTΛb

bTUTUb

= maxb

bTΛb

bT b

I Optimum is given by b = δ[n− k∗] where

k∗ = arg maxk

Λkk = 1

Solution: a = u1 maximal eigenvector, i.e., Qu1 = λ1u1

Optimal value : λ1

Page 24: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Maximizing quadratic forms

maxa

aTQa

aTa

I Eigenvalue Decomposition Q = UΛUT

I Change of basis b = UTa, i.e., Ub = a

maxa

aTUΛUTa

aTa= max

b

bTΛb

bTUTUb

= maxb

bTΛb

bT b

I Optimum is given by b = δ[n− k∗] where

k∗ = arg maxk

Λkk = 1

Solution: a = u1 maximal eigenvector, i.e., Qu1 = λ1u1

Optimal value : λ1

Page 25: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Maximizing quadratic forms

maxa

aTQa

aTa

I Eigenvalue Decomposition Q = UΛUT

I Change of basis b = UTa, i.e., Ub = a

maxa

aTUΛUTa

aTa= max

b

bTΛb

bTUTUb

= maxb

bTΛb

bT b

I Optimum is given by b = δ[n− k∗] where

k∗ = arg maxk

Λkk = 1

Solution: a = u1 maximal eigenvector, i.e., Qu1 = λ1u1

Optimal value : λ1

Page 26: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Maximizing quadratic forms: two quadratics

maxa

aTQa

aTPa

I Theorem (Simultaneous Diagonalization)

Let P,Q ∈ Rn×n real symmetric matrices, and P is positivedefinite, then there exists a matrix V such that

V TPV = I

V TQV = Λ = diag(λ1, ..., λn)

where V,Λ satisfies the generalized eigenvalue equation:

Qvi = λiPvi

Page 27: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Maximizing quadratic forms: two quadratics

maxa

aTQa

aTPa

I Theorem (Simultaneous Diagonalization)

Let P,Q ∈ Rn×n real symmetric matrices, and P is positivedefinite, then there exists a matrix V such that

V TPV = I

V TQV = Λ = diag(λ1, ..., λn)

where V,Λ satisfies the generalized eigenvalue equation:

Qvi = λiPvi

Page 28: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Maximizing quadratic forms: two quadratics

I Theorem (Simultaneous Diagonalization)

Let P,Q ∈ Rn×n real symmetric matrices, and P is positivedefinite, then there exists a matrix V such that

V TPV = I

V TQV = Λ = diag(λ1, ..., λn)

where V,Λ satisfies thegeneralized eigenvalue equation:

Qvi = λiPvi

Proof: Let P = UPΛPUTP be its Eigenvalue Decomposition

V ′ = UPΛ− 1

2P will only diagonalize P

Let V ′TQV ′ = U ′Λ′U ′T be its EVD

Set V = V ′U ′

Page 29: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Maximizing quadratic forms: two quadratics

maxa

aTQa

aTPa

I Let V and Λ satisfy the generalized eigenvalue equation

Qvi = λiPvi

Basis change a = V b, i.e., b = V Ta

maxb

bTV TQV b

bTV TPV b= max

b

bTΛb

bT b

I Solution: a = v1, maximal generalized eigenvector

Optimal value: λ1 maximum generalized eigenvalue

Page 30: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Maximizing quadratic forms: two quadratics

maxa

aTQa

aTPa

I Let V and Λ satisfy the generalized eigenvalue equation

Qvi = λiPvi

Basis change a = V b, i.e., b = V Ta

maxb

bTV TQV b

bTV TPV b= max

b

bTΛb

bT b

I Solution: a = v1, maximal generalized eigenvector

Optimal value: λ1 maximum generalized eigenvalue

Page 31: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Fisher’s LDA

maxa

aTQa

aTPa

where Q = (µ1 − µ2)(µ1 − µ2)T and P = Σ1 + Σ2.

I Solution: Qa = λPa, therefore P−1Qa = λa

P−1(µ1 − µ2)(µ1 − µ2)Ta = λa

a = constant× P−1(µ1 − µ2)

can be normalized as a := P−1(µ1−µ2)||P−1(µ1−µ2)||2

Page 32: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Fisher’s LDA

maxa

aTQa

aTPa

where Q = (µ1 − µ2)(µ1 − µ2)T and P = Σ1 + Σ2.

I Solution: Qa = λPa, therefore P−1Qa = λa

P−1(µ1 − µ2)(µ1 − µ2)Ta = λa

a = constant× P−1(µ1 − µ2)

can be normalized as a := P−1(µ1−µ2)||P−1(µ1−µ2)||2

Page 33: EE269 Signal Processing for Machine Learning - Lecture 8web.stanford.edu/class/ee269/Lecture8.pdf · 2019. 3. 15. · Maximizing quadratic forms: two quadratics I Theorem (Simultaneous

Fisher’s LDA

maxa

aTQa

aTPa

where Q = (µ1 − µ2)(µ1 − µ2)T and P = Σ1 + Σ2.

I Solution: Qa = λPa, therefore P−1Qa = λa

P−1(µ1 − µ2)(µ1 − µ2)Ta = λa

a = constant× P−1(µ1 − µ2)

can be normalized as a := P−1(µ1−µ2)||P−1(µ1−µ2)||2