CS839:ProbabilisticGraphicalModels
Lecture7:LearningFullyObservedBNs
TheoRekatsinas
1
Exponentialfamily:abasicbuildingblock
2
• ForanumericrandomvariableX
isanexponentialfamilydistributionwithnatural(canonical)parameterη
• FunctionT(x)isasufficientstatistic.• FunctionA(η)=logZ(η)isthelognormalizer• Examples:Bernoulli,multinomial,Gaussian,Poisson,Gamma,Categorical
p(x|⌘) = h(x) exp
�⌘
TT (x)�A(⌘)
�=
1
Z(⌘)
h(x) exp(⌘
TT (x))
Whyexponentialfamily?
3
• Momentgeneratingproperty
WecaneasilycomputemomentsofanyexponentialfamilydistributionbytakingthederivativesofthelognormalizerA(η)
MLEforExponentialFamily
4
• Foriid datathelog-likelihoodis
• Wetakethederivativesandsetthemtozero
• Weperformmomentmatching
• Wecaninferthecanonicalparametersusing
GeneralizedLinearModels
5
• Thegraphicalmodel:• Linearregression• Discriminativelinearclassification
• GeneralizedLinearModel• Theobservedinputxisassumedtoenterintothemodelviaalinearcombinationofitselements.
• Theconditionalmeanμisrepresentedasafunctionf(ξ)ofξ,wherefisknownastheresponsefunction.
• Theobservedoutputyisassumedtobecharacterizedbyanexponentialfamilydistributionwithconditionalmeanμ.
Ep(T ) = µ = f(✓TX)
⇠ = ✓Tx
LearninginGraphicalModels
6
• Goal:Givenasetofindependentsamples(assignments torandomvariables),findthebestBayesianNetwork(bothDAGandCPDs)
(B,E,A,C,R)=(T,F,F,T,F)(B,E,A,C,R)=(T,F,T,T,F)…(B,E,A,C,R)=(F,T,T,T,F)
LearninginGraphicalModels
7
• Goal:Givenasetofindependentsamples(assignments torandomvariables),findthebestBayesianNetwork(bothDAGandCPDs)
(B,E,A,C,R)=(T,F,F,T,F)(B,E,A,C,R)=(T,F,T,T,F)…(B,E,A,C,R)=(F,T,T,T,F)
Structurelearning
LearninginGraphicalModels
8
• Goal:Givenasetofindependentsamples(assignments torandomvariables),findthebestBayesianNetwork(bothDAGandCPDs)
(B,E,A,C,R)=(T,F,F,T,F)(B,E,A,C,R)=(T,F,T,T,F)…(B,E,A,C,R)=(F,T,T,T,F)
Structurelearning
Parameterlearning
LearninginGraphicalModels
9
• Goal:Givenasetofindependentsamples(assignments torandomvariables),findthebestBayesianNetwork(bothDAGandCPDs)
(B,E,A,C,R)=(T,F,F,T,F)(B,E,A,C,R)=(T,F,T,T,F)…(B,E,A,C,R)=(F,T,T,T,F)
Structurelearning
Parameterlearning
Laterintheclass
ParameterEstimationforFullyObservedGMs
10
• Thedata:D=(x1,x2,x3,… ,xN)• AssumethegraphGisknownandfixed• Expertdesignorstructurelearning
• Goal:estimatefromadatasetofNindependent,identicallydistributed(iid)trainingexamplesD• EachtrainingexamplecorrespondstoavectorofMvaluesonepernoderandomvariable• Modelshouldbecompletelyobservable:nomissingvalues,nohiddenvariables
Densityestimation
11
• Aconstructionofan estimate,basedonobserveddata,ofanunobservableunderlyingprobabilitydensityfunction
• Canbeviewedassingle-nodegraphicalmodels
• Instancesofexponentialfamilydistribution
• BuildingblocksofgeneralGM
• MLEandBayesianestimate
DiscreteDistributions
12
• Bernoullidistribution:
• Multinomialdistribution:Mult(1,θ)
P (x) = p
x(1� p)1�x
X = [X1, X2, X3, X4, X5, X6] Xj = [0, 1],X
j2[1,··· ,6]
Xj = 1
Xj = 1 with probability ✓j ,X
j2[1,··· ,6]
thetaj = 1
P (Xj = 1) = ✓j
DiscreteDistributions
13
• Multinomialdistribution:Mult(n,θ)
n = [n1, n2, . . . , nk] whereX
j
nj = N
p(n) =N !
n1!n2! · · ·nk!✓n11 ✓n2
2 · · · ✓nKK
Example:multinomialmodel
14
• Data:WeobservedNiid dierolls(K-sided):D={5,1,K,…3}}
• Model:
• Likelihoodofanobservation:
• LikelihoodofD:
xn = [xn,1, xn,2, · · · , xn,K ] where xn,k = 0, 1KX
k=1
xn,k = 1
Xn,k = 1 with probability ✓k and
X
k2{1,··· ,K}
✓k = 1
P (x
i
) = P ({xn,k
= 1, where k is the index of the n-th roll})
= ✓
k
= ✓
xn,1
1 ✓
xn,2
2 · · · ✓xn,k
k
=
KY
k=1
✓
xn,k
k
P (x1, x2, . . . , xN |✓) =NY
n=1
P (xn|✓) =Y
k
✓
nkk
MLE:constrainedoptimization
15
• Objectivefunction:
• Weneedtomaximizethissubjecttotheconstraint:
• Lagrangemultipliers:
• Derivatives:
• Sufficientstatistics?
l(✓;D) = logP (D|✓) = log
Y
k
✓nkk =
X
k
nk log ✓k
¯l(✓;D) =
X
k
nk log ✓k + �(1�X
k
✓k)
@ l̄
@✓k=
nk
✓k� � = 0
nk = �✓k )X
k
nk = �X
k
✓k ) N = �
✓̂k,MLE =1
N
X
n
xn,k
Bayesianestimation
16
• Ineedaprioroverparametersθ
• Dirichlet distribution
• Posteriorofθ
• Isomorphismoftheposteriorwiththeprior(conjugateprior)
• Posteriormeanestimation
P (✓) =�(
Pk ↵k)Q
k �(↵k)
Y
k
✓↵k�1k = C(↵)
Y
k
✓ak�1
P (✓|x1, . . . , xN ) =p(x1, . . . , xN |✓)p(✓)
p(x1, . . . , xN )/
Y
k
✓
nkk
Y
k
✓
↵k�1k =
Y
k
✓
↵k+nk�1k
✓k =
Z✓kp(✓|D)d✓ = C
Z✓k
Y
k
✓↵k+nk�1k d✓ =
nk + ↵k
N + |↵|
ContinuousDistributions
17
• Uniform
• Gaussian
• MultivariateGaussian
MLEforamultivariateGaussian
18
• YoucanshowthattheMLEforμandΣ is
• Whatarethesufficientstatistics?
MLEforamultivariateGaussian
19
• YoucanshowthattheMLEforμandΣ is
• Whatarethesufficientstatistics?• Rewrite
• Sufficientstatisticsare:
• SimilarforBayesianestimation:Normalprior
MLEforgeneralBNs
20
• IfweassumetheparametersforeachCPDaregloballyindependent,andallnodesarefullyobserved,thenthelog-likelihoodfunctiondecomposesintoasumoflocalterms,onepernode
• MLE-basedparameterestimationofGMreducestolocalest.ofeachGLIM.
DecomposablelikelihoodofaBN
21
• ConsidertheGM:
• ThisisthesameaslearningfourseparatesmallerBNseachofwhichconsistsofanodeanitsparents.
MLEforBNswithtabularCPDs
22
• EachCPDisrepresentedasatable(multinomial)with
• IncaseofmultipleparentstheCPDisahigh-dimensionaltable• Thesufficientstatisticsarecountsofvariableconfigurations
• Thelog-likelihoodis
• AndusingaLagrangemultipliertoenforcethatconditionalssumupto1wehave:
Whataboutparameterpriors?
23
• InaBNwehaveacollectionoflocaldistributions• HowcanwedefinepriorsoverthewholeBN?• WecouldwriteP(x1,x2,…xN;G,θ)P(θ|α)
• Symbolicallythesameasbeforebutθ isdefinedoveravectorofrandomvariablesthatfollowdifferentdistributions.
• Weneedθ todecomposetouselocalrules.Otherwisewecannotdecomposethelikelihoodanymore.
• Weneedcertainrulesonθ• CompleteModelEquivalence• GlobalParameterIndependence• LocalParameterIndependence• LikelihoodandPriorModularity
GlobalandLocalParameterIndependence
24
• GlobalParameterIndependence• ForeveryDAGmodel
• LocalParameterIndependence• Foreverynode
GlobalandLocalParameterIndependence
25
WhichPDFssatisfytheseassumptions?
26
• DiscreteDAGModels
• GaussianDAGModels
Parametersharing
27
• TransitionprobabilitiesP(Xt|Xt-1)canbedifferent:Whatistheparameterizationcost?
Parametersharing
28
• TransitionprobabilitiesP(Xt|Xt-1)canbedifferent:Whatistheparameterizationcost?• Ineedadifferentlocalconditionaldistribution.• HowcanIlearnthetransitionprobabilities?Ineediid data.Somytrainingexamplesshouldcorrespondmultiplerollsofthedice.Ineedtodoalignmentetc.
• Whatdowedo?
Parametersharing
29
• Wewillmaketheassumptionthatforeverytransitionwefollowthesameconditional• Consideratime-invariant(stationary)1st-orderMarkovmodel
• Initialstateprobabilityvector• Statetransitionprobabilitymatrix
• Now:optimizeseparately• π (multinomial)• WhataboutA?
LearningaMarkovchaintransitionmatrix
30
• Aisastochasticmatrixwith
• EachrowofAisamultinomialdistribution• MLEofAij isthefractionoftransitionsfromi toj
• Sparsedataproblem:• Ifi tojdidnotoccurinthedatathenAij is0.Anyfuturesequencewithpairi tojwillhavezeroprobability
• Backoff smoothing
X
j
Aij = 1
Example:HMMsupervisedMLestimation
31
• Givenx=x1,x2,…,xN forwhichthetruestatepathy1,y2,…,yN isknown:• Define
• Themaximumlikelihoodparametersθ are:
• IfxiscontinuouswecanapplylearningrulesforaGaussian
SupervisedMLestimation
32
• Intuition: whenweknowtheunderlyingstates,thebestestimateofθ istheaveragefrequencyoftransitionsandemissionsthatoccurinthetrainingdata
• Drawback:Givenlittledata,wemayoverfit (rememberzeroprobabilities)
• Example:• Given10rollswehave
• Then
Pseudocounts
33
• Solutionforsmalltrainingsets:• Addpseudocounts
• Thepseudocounts representourpriorbelief
• Largetotalpseudocounts =>strongpriorbelief• Smalltotalpseudocounts =>smoothingtoavoid0probabilities• EquivalenttoBayesianestimationunderauniformpriorwithparameterstrengthequaltothepseudocounts.
Summary
34
• ForfullyobservedBN,thelog-likelihoodfunctiondecomposesintoasumoflocalterms,onepernode;thuslearningisalsofactored
• Learningsingle-nodeGM– densityestimation:exp.Family• Typicaldiscretedistribution• Typicalcontinuousdistribution• Conjugatepriors
• LearningBNwithmorenodes:• Localoperations
Top Related