Multivariate Gaussian Distributionepxing/Class/10701-08s/... · I Mahalanobis distance: 42 =...
Transcript of Multivariate Gaussian Distributionepxing/Class/10701-08s/... · I Mahalanobis distance: 42 =...
Multivariate Gaussian Distribution
Leon Gu
CSD, CMU
Multivariate Gaussian
p(x|µ,Σ) =1
(2π)n/2|Σ|1/2exp {−1
2(x− µ)T Σ−1(x− µ)}
I Moment Parameterization: µ = E(X),Σ = Cov(X) = E[(X − µ)(X − µ)T ] (symmetric, positivesemi-definite matrix).
I Mahalanobis distance: 42 = (x− µ)T Σ−1(x− µ).
I Canonical Parameterization:
p(x|η, Λ) = exp {a + ηT x− 12xT Λx}
where Λ = Σ−1, η = Σ−1µ, a = − 12
(n log 2π − log |Λ|+ ηT Λη
).
I Tons of applications (MoG, FA, PPCA, Kalman Filter, ...)
Multivariate Gaussian P (X1, X2)
P (X1, X2) (Joint Gaussian)
µ =[
µ1
µ2
], Σ =
[Σ11 Σ12
Σ21 Σ22
]P (X2) (Marginal Gaussian)
µm2 = µ2, Σm
2 = Σ2
P (X1|X2 = x2) (Conditional Gaussian)
µ1|2 = µ1 + Σ12Σ−122 (x2 − µ2)
Σ1|2 = Σ11 − Σ12Σ−122 Σ21
Operations on Gaussian R.V.
The linear transform of a gaussian r.v. is a guassian. Remember that nomatter how x is distributed,
E(AX + b) = AE(X) + b
Cov(AX + b) = ACov(X)AT
this means that for gaussian distributed quantities:
X ∼ N (µ,Σ) ⇒ AX + b ∼ N (Aµ + b, AΣAT ).
The sum of two independent gaussian r.v. is a gaussian.
Y = X1 + X2, X1 ⊥ X2 ⇒ µY = µ1 + µ2, ΣY = Σ1 + Σ2
The multiplication of two gaussian functions is another gaussian function(although no longer normalized).
N (a,A)N (b, B) ∝ N (c, C),
where C = (A−1 + B−1)−1, c = CA−1a + CB−1b
Maximum Likelihood Estimate of µ and ΣGiven a set of i.i.d. data X = {x1, . . . , xN} drawn from N (x;µ,Σ), wewant to estimate (µ,Σ) by MLE. The log-likelihood function is
ln p(X|µ, Σ) = −N
2ln |Σ| −
1
2
NXn=1
(xn − µ)T
Σ−1
(xn − µ) + const
Taking its derivative w.r.t. µ and setting it to zero we have
µ̂ =1
N
NXn=1
xn
Rewrite the log-likelihood using “trace trick”,
ln p(X|µ, Σ) = −N2 ln |Σ| − 1
2
NPn=1
(xn − µ)T Σ−1(xn − µ) + const
∝ −N2 ln |Σ| − 1
2
NPn=1
Trace“Σ−1(xn − µ)(xn − µ)T
”= −N
2 ln |Σ| − 12 Trace
Σ−1
NPn=1
[(xn − µ)(xn − µ)T ]
!
Taking the derivative w.r.t. Σ−1, and using 1) ∂∂A log |A| = A−T ; 2)
∂∂ATr[AB] = ∂
∂ATr[BA] = BT , we obtain
Σ̂ =1
N
NXn=1
(xn − µ̂) (xn − µ̂)T
.