ABC and empirical likelihood

of 147/147
Approximate Bayesian Computation (ABC) and empirical likelihood Christian P. Robert Structure and uncertainty, Bristol, Sept. 25, 2012 Universit´ e Paris-Dauphine, IuF, & CREST Joint work with Kerrie L. Mengersen and P. Pudlo
  • date post

    10-May-2015
  • Category

    Education

  • view

    1.880
  • download

    5

Embed Size (px)

description

talk at SuSTain Structure and uncertainty modelling, inference and computation in complex stochastic systems, Bristol, 24-27 Sept., 2012

Transcript of ABC and empirical likelihood

  • 1.Approximate Bayesian Computation (ABC) and empirical likelihoodChristian P. Robert Structure and uncertainty, Bristol, Sept. 25, 2012Universit Paris-Dauphine, IuF, & CREST eJoint work with Kerrie L. Mengersen and P. Pudlo

2. OutlineIntroductionABCABC as an inference machineABCel 3. Intractable likelihood Case of a well-dened statistical model where the likelihood function (|y) = f (y1 , . . . , yn |) is (really!) not available in closed form can (easily!) be neither completed nor demarginalised cannot be estimated by an unbiased estimatorc Prohibits direct implementation of a generic MCMC algorithm like MetropolisHastings 4. Intractable likelihood Case of a well-dened statistical model where the likelihood function (|y) = f (y1 , . . . , yn |) is (really!) not available in closed form can (easily!) be neither completed nor demarginalised cannot be estimated by an unbiased estimatorc Prohibits direct implementation of a generic MCMC algorithm like MetropolisHastings 5. Dierent perspectives on abc What is the (most) fundamental issue? a mere computational issue (that will eventually end up being solved by more powerful computers, &tc, even if too costly in the short term) an inferential issue (opening opportunities for new inference machine, with dierent legitimity than classical B approach) a Bayesian conundrum (while inferencial methods available, how closely related to the B approach?) 6. Dierent perspectives on abc What is the (most) fundamental issue? a mere computational issue (that will eventually end up being solved by more powerful computers, &tc, even if too costly in the short term) an inferential issue (opening opportunities for new inference machine, with dierent legitimity than classical B approach) a Bayesian conundrum (while inferencial methods available, how closely related to the B approach?) 7. Dierent perspectives on abc What is the (most) fundamental issue? a mere computational issue (that will eventually end up being solved by more powerful computers, &tc, even if too costly in the short term) an inferential issue (opening opportunities for new inference machine, with dierent legitimity than classical B approach) a Bayesian conundrum (while inferencial methods available, how closely related to the B approach?) 8. Economections Similar exploration of simulation-based and approximation techniques in Econometrics Simulated method of moments Method of simulated moments Simulated pseudo-maximum-likelihood Indirect inference [Gouriroux & Monfort, 1996] e even though motivation is partly-dened models rather than complex likelihoods 9. Economections Similar exploration of simulation-based and approximation techniques in Econometrics Simulated method of moments Method of simulated moments Simulated pseudo-maximum-likelihood Indirect inference [Gouriroux & Monfort, 1996] e even though motivation is partly-dened models rather than complex likelihoods 10. Indirect inference ^ Minimise [in ] a distance between estimators based on a pseudo-model for genuine observations and for observations simulated under the true model and the parameter . [Gouriroux, Monfort, & Renault, 1993; e Smith, 1993; Gallant & Tauchen, 1996] 11. Indirect inference (PML vs. PSE) Example of the pseudo-maximum-likelihood (PML)^(y) = arg maxlog f (yt |, y1:(t1) ) t leading toarg min ||(yo ) (y1 (), . . . , yS ())||2^^ whenys () f (y|)s = 1, . . . , S 12. Indirect inference (PML vs. PSE) Example of the pseudo-score-estimator (PSE)2 log f^(y) = arg min (yt |, y1:(t1) ) t leading to arg min ||(yo ) (y1 (), . . . , yS ())||2 ^^ when ys () f (y|)s = 1, . . . , S 13. Consistent indirect inference ...in order to get a unique solution the dimension of the auxiliary parameter must be larger than or equal to the dimension of the initial parameter . If the problem is just identied the dierent methods become easier... Consistency depending on the criterion and on the asymptotic identiability of [Gouriroux & Monfort, 1996, p. 66] e 14. Consistent indirect inference ...in order to get a unique solution the dimension of the auxiliary parameter must be larger than or equal to the dimension of the initial parameter . If the problem is just identied the dierent methods become easier... Consistency depending on the criterion and on the asymptotic identiability of [Gouriroux & Monfort, 1996, p. 66] e 15. Choice of pseudo-model Arbitrariness of pseudo-model Pick model such that^1. () not at (i.e. sensitive to changes in )^2. () not dispersed (i.e. robust agains changes in ys ())[Frigessi & Heggland, 2004] 16. Approximate Bayesian computation Introduction ABC Genesis of ABC ABC basics Advances and interpretations ABC as knn ABC as an inference machine ABCel 17. Genetic background of ABCskip genetics ABC is a recent computational technique that only requires being able to sample from the likelihood f (|) This technique stemmed from population genetics models, about 15 years ago, and population geneticists still contribute signicantly to methodological developments of ABC. [Grith & al., 1997; Tavar & al., 1999]e 18. Demo-genetic inference Each model is characterized by a set of parameters that cover historical (time divergence, admixture time ...), demographics (population sizes, admixture rates, migration rates, ...) and genetic (mutation rate, ...) factors The goal is to estimate these parameters from a dataset of polymorphism (DNA sample) y observed at the present time Problem: most of the time, we cannot calculate the likelihood of the polymorphism data f (y|)... 19. Demo-genetic inference Each model is characterized by a set of parameters that cover historical (time divergence, admixture time ...), demographics (population sizes, admixture rates, migration rates, ...) and genetic (mutation rate, ...) factors The goal is to estimate these parameters from a dataset of polymorphism (DNA sample) y observed at the present time Problem: most of the time, we cannot calculate the likelihood of the polymorphism data f (y|)... 20. Neutral model at a given microsatellite locus, in a closedpanmictic population at equilibrium Mutations according to the Simple stepwise Mutation Model (SMM) date of the mutations Poisson process with intensity /2 over the branches MRCA = 100 independent mutations: 1 with pr. 1/2 Sample of 8 genes 21. Neutral model at a given microsatellite locus, in a closedpanmictic population at equilibrium Kingmans genealogy When time axis is normalized, T (k) Exp(k(k 1)/2) Mutations according to the Simple stepwise Mutation Model (SMM) date of the mutations Poisson process with intensity /2 over the branches MRCA = 100 independent mutations: 1 with pr. 1/2 22. Neutral model at a given microsatellite locus, in a closedpanmictic population at equilibrium Kingmans genealogy When time axis is normalized, T (k) Exp(k(k 1)/2) Mutations according to the Simple stepwise Mutation Model (SMM) date of the mutations Poisson process with intensity /2 over the branches MRCA = 100 independent mutations: 1 with pr. 1/2 23. Neutral model at a given microsatellite locus, in a closedpanmictic population at equilibrium Kingmans genealogy When time axis is normalized, T (k) Exp(k(k 1)/2) Mutations according to the Simple stepwise Mutation Model (SMM) date of the mutations Poisson process with intensity /2 over the branches Observations: leafs of the tree MRCA = 100^=? independent mutations: 1 with pr. 1/2 24. Much more interesting models. . . several independent locus Independent gene genealogies and mutations dierent populations linked by an evolutionary scenario made of divergences, admixtures, migrations between populations, etc. larger sample size usually between 50 and 100 genes MRCA 2 1 A typical evolutionary scenario: POP 0POP 1 POP 2 25. Intractable likelihood Missing (too missing!) data structure: f (y|) = f (y|G , )f (G |)dG G cannot be computed in a manageable way... The genealogies are considered as nuisance parameters This modelling clearly diers from the phylogenetic perspective where the tree is the parameter of interest. 26. Intractable likelihood Missing (too missing!) data structure: f (y|) = f (y|G , )f (G |)dG G cannot be computed in a manageable way... The genealogies are considered as nuisance parameters This modelling clearly diers from the phylogenetic perspective where the tree is the parameter of interest. 27. not-so-obvious ancestry... You went to school to learn, girl (. . . ) Why 2 plus 2 makes four Now, now, now, Im gonna teach you (. . . ) All you gotta do is repeat after me! A, B, C! Its easy as 1, 2, 3! Or simple as Do, Re, Mi! (. . . ) 28. A?B?C?A stands for approximate[wrong likelihood /picture]B stands for BayesianC stands for computation[producing a parametersample] 29. A?B?C?A stands for approximate[wrong likelihood /picture]B stands for BayesianC stands for computation[producing a parametersample] 30. A?B?C?ESS=108.9 ESS=81.48ESS=105.23.0 2.0 2.0 DensityDensity Density1.5 1.0 1.0 0.00.0 0.0A stands for approximate 0.40.00.2ESS=133.30.4 0.6 0.8 0.60.4ESS=87.75 0.20.0 0.20.0 0.2 ESS=72.89 0.40.6 0.8 3.02.0 4[wrong likelihood / 2.0 DensityDensity Density1.0 1.0 2 0.00.0 0picture] 0.8 0.4ESS=116.50.0 0.2 0.4 0.2 0.00.2ESS=103.9 0.4 0.6 0.80.2 0.0 ESS=126.9 0.2 0.4 3.0 2.03.0 2.0 DensityDensity Density 1.01.5 1.0B stands for Bayesian 0.00.0 0.0 0.40.00.2 0.4 0.6 0.4 0.20.0 0.2 0.4 0.8 0.4 0.00.4ESS=113.3 ESS=92.99ESS=121.4 3.0 2.0C stands for computation 2.0 DensityDensity Density1.5 1.0 1.0 0.00.0 0.0[producing a parameter 0.6 0.2ESS=133.60.2 0.6 0.20.00.2ESS=116.4 0.4 0.60.5 ESS=131.6 0.0 0.50.0 1.0 2.0 3.0 2.0sample] DensityDensity Density 1.0 1.0 0.0 0.0 0.60.20.2 0.4 0.60.4 0.2 0.00.2 0.40.5 0.0 0.5 31. How Bayesian is aBc?Could we turn the resolution into a Bayesian answer?ideally so (not meaningfull: requires -ly powerful computerasymptotically so (when sample size goes to : meaningfull?)approximation error unknown (w/o costly simulation)true Bayes for wrong model (formal and articial)true Bayes for estimated likelihood (back to econometrics?) 32. Untractable likelihoodBack to stage zero: what can we dowhen a likelihood function f (y|) iswell-dened but impossible / toocostly to compute...?MCMC cannot be implemented!shall we give up Bayesianinference altogether?!or settle for an almost Bayesianinference/picture...? 33. Untractable likelihoodBack to stage zero: what can we dowhen a likelihood function f (y|) iswell-dened but impossible / toocostly to compute...?MCMC cannot be implemented!shall we give up Bayesianinference altogether?!or settle for an almost Bayesianinference/picture...? 34. ABC methodologyBayesian setting: target is ()f (x|)When likelihood f (x|) not in closed form, likelihood-free rejectiontechnique:FoundationFor an observation y f (y|), under the prior (), if one keepsjointly simulating () , z f (z| ) ,until the auxiliary variable z is equal to the observed value, z = y,then the selected (|y)[Rubin, 1984; Diggle & Gratton, 1984; Tavar et al., 1997] e 35. ABC methodologyBayesian setting: target is ()f (x|)When likelihood f (x|) not in closed form, likelihood-free rejectiontechnique:FoundationFor an observation y f (y|), under the prior (), if one keepsjointly simulating () , z f (z| ) ,until the auxiliary variable z is equal to the observed value, z = y,then the selected (|y)[Rubin, 1984; Diggle & Gratton, 1984; Tavar et al., 1997] e 36. ABC methodologyBayesian setting: target is ()f (x|)When likelihood f (x|) not in closed form, likelihood-free rejectiontechnique:FoundationFor an observation y f (y|), under the prior (), if one keepsjointly simulating () , z f (z| ) ,until the auxiliary variable z is equal to the observed value, z = y,then the selected (|y)[Rubin, 1984; Diggle & Gratton, 1984; Tavar et al., 1997] e 37. A as A...pproximative When y is a continuous random variable, strict equality z = y is replaced with a tolerance zone(y, z) where is a distance Output distributed from def () P {(y, z) < } (|(y, z) < )[Pritchard et al., 1999] 38. A as A...pproximative When y is a continuous random variable, strict equality z = y is replaced with a tolerance zone(y, z) where is a distance Output distributed from def () P {(y, z) < } (|(y, z) < )[Pritchard et al., 1999] 39. ABC algorithmIn most implementations, further degree of A...pproximation:Algorithm 1 Likelihood-free rejection samplerfor i = 1 to N dorepeat generate from the prior distribution () generate z from the likelihood f (| )until {(z), (y)}set i = end forwhere (y) denes a (not necessarily sucient) statistic 40. OutputThe likelihood-free algorithm samples from the marginal in z of: ()f (z|)IA ,y (z) (, z|y) = ,A ,y ()f (z|)dzdwhere A ,y = {z D|((z), (y)) < }.The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of theposterior distribution: (|y) = (, z|y)dz (|y) ....does it?! 41. OutputThe likelihood-free algorithm samples from the marginal in z of: ()f (z|)IA ,y (z) (, z|y) = ,A ,y ()f (z|)dzdwhere A ,y = {z D|((z), (y)) < }.The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of theposterior distribution: (|y) = (, z|y)dz (|y) ....does it?! 42. OutputThe likelihood-free algorithm samples from the marginal in z of: ()f (z|)IA ,y (z) (, z|y) = ,A ,y ()f (z|)dzdwhere A ,y = {z D|((z), (y)) < }.The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of theposterior distribution: (|y) = (, z|y)dz (|y) ....does it?! 43. OutputThe likelihood-free algorithm samples from the marginal in z of:()f (z|)IA ,y (z) (, z|y) = , A ,y ()f (z|)dzdwhere A ,y = {z D|((z), (y)) < }.The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of therestricted posterior distribution: (|y) = (, z|y)dz (|(y)) . Not so good..!skip convergence details! 44. Convergence of ABCWhat happens when 0?For B , we have A f (z|)dz f (z|)()d,yB ()d =dz B A ,y ()f (z|)dzdA ,y A ,y ()f (z|)dzd B f (z|)()dm(z) = dz A ,ym(z) A ,y ()f (z|)dzd m(z) =(B|z)dz A ,y A ,y ()f (z|)dzdwhich indicates convergence for a continuous (B|z). 45. Convergence of ABCWhat happens when 0?For B , we have A f (z|)dz f (z|)()d,yB ()d =dz B A ,y ()f (z|)dzdA ,y A ,y ()f (z|)dzd B f (z|)()dm(z) = dz A ,ym(z) A ,y ()f (z|)dzd m(z) =(B|z)dz A ,y A ,y ()f (z|)dzdwhich indicates convergence for a continuous (B|z). 46. Convergence (do not attempt!) ...and the above does not apply to insucient statistics: If (y) is not a sucient statistics, the best one can hope for is(|(y)) , not (|y) If (y) is an ancillary statistic, the whole information contained in y is lost!, the best one can hope for is(|(y)) = () Bummer!!! 47. Convergence (do not attempt!) ...and the above does not apply to insucient statistics: If (y) is not a sucient statistics, the best one can hope for is(|(y)) , not (|y) If (y) is an ancillary statistic, the whole information contained in y is lost!, the best one can hope for is(|(y)) = () Bummer!!! 48. Convergence (do not attempt!) ...and the above does not apply to insucient statistics: If (y) is not a sucient statistics, the best one can hope for is(|(y)) , not (|y) If (y) is an ancillary statistic, the whole information contained in y is lost!, the best one can hope for is(|(y)) = () Bummer!!! 49. Convergence (do not attempt!) ...and the above does not apply to insucient statistics: If (y) is not a sucient statistics, the best one can hope for is(|(y)) , not (|y) If (y) is an ancillary statistic, the whole information contained in y is lost!, the best one can hope for is(|(y)) = () Bummer!!! 50. MA exampleInference on the parameters of a MA(q) model qxt = t + i titi i.i.d.w.n. i=1bypass MA illustrationSimple prior: uniform over the inverse [real and complex] roots inq Q(u) = 1 i u ii=1under the identiability conditions 51. MA exampleInference on the parameters of a MA(q) model qxt = t + i ti ti i.i.d.w.n. i=1bypass MA illustrationSimple prior: uniform prior over the identiability zone in theparameter space, i.e. triangle for MA(2) 52. MA example (2)ABC algorithm thus made of1. picking a new value (1 , 2 ) in the triangle2. generating an iid sequence ( t )q , replace s with locally regressed transforms = {(z) (y)}T ^[Csillry et al., TEE, 2010]e ^ where is obtained by [NP] weighted least square regression on ((z) (y)) with weights K {((z), (y))}[Beaumont et al., 2002, Genetics] 63. ABC-NP (regression) Also found in the subsequent literature, e.g. inFearnhead-Prangle (2012) : weight directly simulation by K {((z()), (y))} or S 1 K {((zs ()), (y))} S s=1 [consistent estimate of f (|)] Curse of dimensionality: poor estimate when d = dim() is large... 64. ABC-NP (regression) Also found in the subsequent literature, e.g. inFearnhead-Prangle (2012) : weight directly simulation by K {((z()), (y))} or S 1 K {((zs ()), (y))} S s=1 [consistent estimate of f (|)] Curse of dimensionality: poor estimate when d = dim() is large... 65. ABC-NP (density estimation) Use of the kernel weights K {((z()), (y))} leads to the NP estimate of the posterior expectation i i K {((z(i )), (y))} i K {((z(i )), (y))}[Blum, JASA, 2010] 66. ABC-NP (density estimation) Use of the kernel weightsK {((z()), (y))} leads to the NP estimate of the posterior conditional densityiKb (i )K {((z(i )), (y))}i K {((z(i )), (y))} [Blum, JASA, 2010] 67. ABC-NP (density estimations) Other versions incorporating regression adjustments i Kb ( )K {((z(i )), (y))} ii K {((z(i )), (y))} In all cases, errorE[^ (|y)] g (|y) = cb 2 + c2 + OP (b 2 + 2 ) + OP (1/nd )g cvar(^ (|y)) =g(1 + oP (1)) nbd 68. ABC-NP (density estimations) Other versions incorporating regression adjustments i Kb ( )K {((z(i )), (y))} ii K {((z(i )), (y))} In all cases, errorE[^ (|y)] g (|y) = cb 2 + c2 + OP (b 2 + 2 ) + OP (1/nd )g cvar(^ (|y)) =g(1 + oP (1)) nbd [Blum, JASA, 2010] 69. ABC-NP (density estimations) Other versions incorporating regression adjustments i Kb ( )K {((z(i )), (y))} ii K {((z(i )), (y))} In all cases, errorE[^ (|y)] g (|y) = cb 2 + c2 + OP (b 2 + 2 ) + OP (1/nd )g cvar(^ (|y)) =g(1 + oP (1)) nbd[standard NP calculations] 70. ABC-NCHIncorporating non-linearities and heterocedasticities: ((y)) ^ = m((y)) + [ m((z))] ^^ ((z)) ^wherem() estimated by non-linear regression (e.g., neural network)^() estimated by non-linear regression on residuals^ log{i m(i )}2 = log 2 (i ) + i^[Blum & Franois, 2009]c 71. ABC-NCHIncorporating non-linearities and heterocedasticities: ((y)) ^ = m((y)) + [ m((z))] ^^ ((z)) ^wherem() estimated by non-linear regression (e.g., neural network)^() estimated by non-linear regression on residuals^ log{i m(i )}2 = log 2 (i ) + i^[Blum & Franois, 2009]c 72. ABC as knnPractice of ABC: determine tolerance as a quantile on observeddistances, say 10% or 1% quantile,= N = q (d1 , . . . , dN )Interpretation of as non-parametric bandwidth only approximation of the actual practice [Blum & Franois, 2010]cABC is a k-nearest neighbour (knn) method with kN = N N[Loftsgaarden & Quesenberry, 1965][Biau et al., 2012, arxiv:1207.6461] 73. ABC as knnPractice of ABC: determine tolerance as a quantile on observeddistances, say 10% or 1% quantile,= N = q (d1 , . . . , dN )Interpretation of as non-parametric bandwidth only approximation of the actual practice [Blum & Franois, 2010]cABC is a k-nearest neighbour (knn) method with kN = N N[Loftsgaarden & Quesenberry, 1965][Biau et al., 2012, arxiv:1207.6461] 74. ABC as knnPractice of ABC: determine tolerance as a quantile on observeddistances, say 10% or 1% quantile,= N = q (d1 , . . . , dN )Interpretation of as non-parametric bandwidth only approximation of the actual practice [Blum & Franois, 2010]cABC is a k-nearest neighbour (knn) method with kN = N N[Loftsgaarden & Quesenberry, 1965][Biau et al., 2012, arxiv:1207.6461] 75. ABC consistency ProvidedkN / log log N and kN /N 0 as N , for almost all s0 (with respect to the distribution of S), with probability 1,kN1(j ) E[(j )|S = s0 ] kNj=1[Devroye, 1982] Biau et al. (2012) also recall pointwise and integrated mean square error consistency results on the corresponding kernel estimate of the conditional posterior distribution, under constraintsp kN , kN /N 0, hN 0 and hN kN , 76. ABC consistency ProvidedkN / log log N and kN /N 0 as N , for almost all s0 (with respect to the distribution of S), with probability 1,kN1(j ) E[(j )|S = s0 ] kNj=1[Devroye, 1982] Biau et al. (2012) also recall pointwise and integrated mean square error consistency results on the corresponding kernel estimate of the conditional posterior distribution, under constraintsp kN , kN /N 0, hN 0 and hN kN , 77. Rates of convergence Further assumptions (on target and kernel) allow for precise (integrated mean square) convergence rates (as a power of the sample size N), derived from classical k-nearest neighbour regression, like 4 when m = 1, 2, 3, kN N (p+4)/(p+8) and rate N p+84 when m = 4, kN N (p+4)/(p+8) and rate N p+8 log N4 when m > 4, kN N (p+4)/(m+p+4) and rate N m+p+4[Biau et al., 2012, arxiv:1207.6461] Only applies to sucient summary statistics 78. Rates of convergence Further assumptions (on target and kernel) allow for precise (integrated mean square) convergence rates (as a power of the sample size N), derived from classical k-nearest neighbour regression, like 4 when m = 1, 2, 3, kN N (p+4)/(p+8) and rate N p+84 when m = 4, kN N (p+4)/(p+8) and rate N p+8 log N4 when m > 4, kN N (p+4)/(m+p+4) and rate N m+p+4[Biau et al., 2012, arxiv:1207.6461] Only applies to sucient summary statistics 79. ABC inference machine Introduction ABC ABC as an inference machine Error inc. Exact BC and approximate targets summary statistic ABCel 80. How much Bayesian aBc is..?maybe a convergent method of inference (meaningful?sucient? foreign?)approximation error unknown (w/o simulation)pragmatic Bayes (there is no other solution!)many calibration issues (tolerance, distance, statistics) 81. How much Bayesian aBc is..?maybe a convergent method of inference (meaningful?sucient? foreign?)approximation error unknown (w/o simulation)pragmatic Bayes (there is no other solution!)many calibration issues (tolerance, distance, statistics)...should Bayesians care?! 82. How much Bayesian aBc is..?maybe a convergent method of inference (meaningful?sucient? foreign?)approximation error unknown (w/o simulation)pragmatic Bayes (there is no other solution!)many calibration issues (tolerance, distance, statistics) yes they should!!! 83. How much Bayesian aBc is..?maybe a convergent method of inference (meaningful?sucient? foreign?)approximation error unknown (w/o simulation)pragmatic Bayes (there is no other solution!)many calibration issues (tolerance, distance, statistics)to ABCel 84. ABCIdea Infer about the error as well as about the parameter:Use of a joint densityf (, |y) ( |y, ) () ( )where y is the data, and ( |y, ) is the prior predictive density of((z), (y)) given and y when z f (z|)Warning! Replacement of ( |y, ) with a non-parametric kernelapproximation. [Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS] 85. ABCIdea Infer about the error as well as about the parameter:Use of a joint densityf (, |y) ( |y, ) () ( )where y is the data, and ( |y, ) is the prior predictive density of((z), (y)) given and y when z f (z|)Warning! Replacement of ( |y, ) with a non-parametric kernelapproximation. [Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS] 86. ABCIdea Infer about the error as well as about the parameter:Use of a joint densityf (, |y) ( |y, ) () ( )where y is the data, and ( |y, ) is the prior predictive density of((z), (y)) given and y when z f (z|)Warning! Replacement of ( |y, ) with a non-parametric kernelapproximation. [Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS] 87. ABC details Multidimensional distances k (k = 1, . . . , K ) and errorsk = k (k (z), k (y)), with 1k k ( |y, ) k ( |y, ) =^ K [{ k k (k (zb ), k (y))}/hk ]Bhkb then used in replacing ( |y, ) with mink k ( |y, )^ ABC involves acceptance probability ( , ) q( , )q( , ) mink k ( |y, ) ^(, ) q(, )q( , ) mink k ( |y, )^ 88. ABC details Multidimensional distances k (k = 1, . . . , K ) and errorsk = k (k (z), k (y)), with 1k k ( |y, ) k ( |y, ) =^ K [{ k k (k (zb ), k (y))}/hk ]Bhkb then used in replacing ( |y, ) with mink k ( |y, )^ ABC involves acceptance probability ( , ) q( , )q( , ) mink k ( |y, ) ^(, ) q(, )q( , ) mink k ( |y, )^ 89. ABC multiple errors [ c Ratmann et al., PNAS, 2009] 90. ABC for model choice[ c Ratmann et al., PNAS, 2009] 91. Wilkinsons exact BC (not exactly!) ABC approximation error (i.e. non-zero tolerance) replaced with exact simulation from a controlled approximation to the target, convolution of true posterior with kernel function ()f (z|)K (y z) (, z|y) =,()f (z|)K (y z)dzd with K kernel parameterised by bandwidth . [Wilkinson, 2008] Theorem The ABC algorithm based on the assumption of a randomised observation y = y + , K , and an acceptance probability of K (y z)/M gives draws from the posterior distribution (|y). 92. Wilkinsons exact BC (not exactly!) ABC approximation error (i.e. non-zero tolerance) replaced with exact simulation from a controlled approximation to the target, convolution of true posterior with kernel function ()f (z|)K (y z) (, z|y) =,()f (z|)K (y z)dzd with K kernel parameterised by bandwidth . [Wilkinson, 2008] Theorem The ABC algorithm based on the assumption of a randomised observation y = y + , K , and an acceptance probability of K (y z)/M gives draws from the posterior distribution (|y). 93. How exact a BC?Using to represent measurement error isstraightforward, whereas using to model the modeldiscrepancy is harder to conceptualize and not ascommonly used [Richard Wilkinson, 2008] 94. How exact a BC?Pros Pseudo-data from true model and observed data from noisy model Interesting perspective in that outcome is completely controlled Link with ABC and assuming y is observed with a measurement error with density K Relates to the theory of model approximation [Kennedy & OHagan, 2001]Cons Requires K to be bounded by M True approximation error never assessed Requires a modication of the standard ABC algorithm 95. ABC for HMMsSpecic case of a hidden Markov model Xt+1 Q (Xt , ) Yt+1 g (|xt )where only y0 is observed.1:n [Dean, Singh, Jasra, & Peters, 2011]Use of specic constraints, adapted to the Markov structure:y1 B(y1 , ) yn B(yn , )0 0 96. ABC for HMMsSpecic case of a hidden Markov model Xt+1 Q (Xt , ) Yt+1 g (|xt )where only y0 is observed.1:n [Dean, Singh, Jasra, & Peters, 2011]Use of specic constraints, adapted to the Markov structure:y1 B(y1 , ) yn B(yn , )0 0 97. ABC-MLE for HMMsABC-MLE dened by n = arg max P Y1 B(y1 , ), . . . , Yn B(yn , ) ^ 00Exact MLE for the likelihood same basis as Wilkinson!0 p (y1 , . . . , yn )corresponding to the perturbed process (xt , yt + zt )1 t n zt U(B(0, 1)[Dean, Singh, Jasra, & Peters, 2011] 98. ABC-MLE for HMMsABC-MLE dened by n = arg max P Y1 B(y1 , ), . . . , Yn B(yn , ) ^ 00Exact MLE for the likelihood same basis as Wilkinson!0 p (y1 , . . . , yn )corresponding to the perturbed process (xt , yt + zt )1 t n zt U(B(0, 1)[Dean, Singh, Jasra, & Peters, 2011] 99. ABC-MLE is biasedABC-MLE is asymptotically (in n) biased with targetl () = E [log p (Y1 |Y:0 )]but ABC-MLE converges to true value in the sense l n (n ) l ()for all sequences (n ) converging to and n 100. ABC-MLE is biasedABC-MLE is asymptotically (in n) biased with targetl () = E [log p (Y1 |Y:0 )]but ABC-MLE converges to true value in the sense l n (n ) l ()for all sequences (n ) converging to and n 101. Noisy ABC-MLEIdea: Modify instead the data from the start0(y1 + 1 , . . . , yn + n ) [ see Fearnhead-Prangle ]noisy ABC-MLE estimate arg max P Y1 B(y1 + 1 , ), . . . , Yn B(yn + n , )0 0[Dean, Singh, Jasra, & Peters, 2011] 102. Consistent noisy ABC-MLEDegrading the data improves the estimation performances:Noisy ABC-MLE is asymptotically (in n) consistentunder further assumptions, the noisy ABC-MLE isasymptotically normalincrease in variance of order 2likely degradation in precision or computing time due to thelack of summary statistic [curse of dimensionality] 103. SMC for ABC likelihood Algorithm 2 SMC ABC for HMMs Given for k = 1, . . . , n do 1 1NN generate proposals (xk , yk ), . . . , (xk , yk ) from the model weigh each proposal with kl =I l B(yk + k , ) (yk ) 0 l renormalise the weights and sample the xk s accordingly end for approximate the likelihood bynN lk N k=1 l=1[Jasra, Singh, Martin, & McCoy, 2010] 104. Which summary?Fundamental diculty of the choice of the summary statistic whenthere is no non-trivial sucient statisticsStarting from a large collection of summary statistics is available,Joyce and Marjoram (2008) consider the sequential inclusion intothe ABC target, with a stopping rule based on a likelihood ratiotestNot taking into account the sequential nature of the testsDepends on parameterisationOrder of inclusion matterslikelihood ratio test?! 105. Which summary?Fundamental diculty of the choice of the summary statistic whenthere is no non-trivial sucient statisticsStarting from a large collection of summary statistics is available,Joyce and Marjoram (2008) consider the sequential inclusion intothe ABC target, with a stopping rule based on a likelihood ratiotestNot taking into account the sequential nature of the testsDepends on parameterisationOrder of inclusion matterslikelihood ratio test?! 106. Which summary?Fundamental diculty of the choice of the summary statistic whenthere is no non-trivial sucient statisticsStarting from a large collection of summary statistics is available,Joyce and Marjoram (2008) consider the sequential inclusion intothe ABC target, with a stopping rule based on a likelihood ratiotestNot taking into account the sequential nature of the testsDepends on parameterisationOrder of inclusion matterslikelihood ratio test?! 107. Which summary for model choice? Depending on the choice of (), the Bayes factor based on this insucient statistic,1 (1 )f1 ((y)|1 ) d1 B12 (y) =, 2 (2 )f2 ((y)|2 ) d2 is consistent or not.[X, Cornuet, Marin, & Pillai, 2012] Consistency only depends on the range of Ei [(y)] under both models.[Marin, Pillai, X, & Rousseau, 2012] 108. Which summary for model choice? Depending on the choice of (), the Bayes factor based on this insucient statistic,1 (1 )f1 ((y)|1 ) d1 B12 (y) =, 2 (2 )f2 ((y)|2 ) d2 is consistent or not.[X, Cornuet, Marin, & Pillai, 2012] Consistency only depends on the range of Ei [(y)] under both models.[Marin, Pillai, X, & Rousseau, 2012] 109. Semi-automatic ABCFearnhead and Prangle (2010) study ABC and the selection of thesummary statistic in close proximity to Wilkinsons proposalABC considered as inferential method and calibrated as suchrandomised (or noisy) version of the summary statistics(y) = (y) + derivation of a well-calibrated version of ABC, i.e. analgorithm that gives proper predictions for the distributionassociated with this randomised summary statistic 110. Summary [of F&P/statistics)optimality of the posterior expectationE[|y]of the parameter of interest as summary statistics (y)!use of the standard quadratic loss function ( 0 )T A( 0 ) .recent extension to model choice, optimality of Bayes factorB12 (y) [F&P, ISBA 2012 talk] 111. Summary [of F&P/statistics)optimality of the posterior expectationE[|y]of the parameter of interest as summary statistics (y)!use of the standard quadratic loss function ( 0 )T A( 0 ) .recent extension to model choice, optimality of Bayes factorB12 (y) [F&P, ISBA 2012 talk] 112. ConclusionChoice of summary statistics is paramount for ABCvalidation/performanceAt best, ABC approximates (. | (y))Model selection feasible with ABC [with caution!]For estimation, consistency if {; () = 0 } = 0For testing consistency if{1 (1 ), 1 1 } {2 (2 ), 2 2 } = [Marin et al., 2011] 113. ConclusionChoice of summary statistics is paramount for ABCvalidation/performanceAt best, ABC approximates (. | (y))Model selection feasible with ABC [with caution!]For estimation, consistency if {; () = 0 } = 0For testing consistency if{1 (1 ), 1 1 } {2 (2 ), 2 2 } = [Marin et al., 2011] 114. ConclusionChoice of summary statistics is paramount for ABCvalidation/performanceAt best, ABC approximates (. | (y))Model selection feasible with ABC [with caution!]For estimation, consistency if {; () = 0 } = 0For testing consistency if{1 (1 ), 1 1 } {2 (2 ), 2 2 } = [Marin et al., 2011] 115. ConclusionChoice of summary statistics is paramount for ABCvalidation/performanceAt best, ABC approximates (. | (y))Model selection feasible with ABC [with caution!]For estimation, consistency if {; () = 0 } = 0For testing consistency if{1 (1 ), 1 1 } {2 (2 ), 2 2 } = [Marin et al., 2011] 116. ConclusionChoice of summary statistics is paramount for ABCvalidation/performanceAt best, ABC approximates (. | (y))Model selection feasible with ABC [with caution!]For estimation, consistency if {; () = 0 } = 0For testing consistency if{1 (1 ), 1 1 } {2 (2 ), 2 2 } = [Marin et al., 2011] 117. Empirical likelihood (EL) Introduction ABC ABC as an inference machine ABCel ABC and EL Composite likelihood Illustrations 118. Empirical likelihood (EL) Dataset x made of n independent replicates x = (x1 , . . . , xn ) of some X F Generalized moment condition modelEF h(X , ) = 0, where h is a known function, and an unknown parameter Corresponding empirical likelihoodn Lel (|x) = max pip i=1 for all p such that 0 pi1,pi = 1,i pi h(xi , ) = 0.[Owen, 1988, Bioka; Owen, 2001] 119. Empirical likelihood (EL) Dataset x made of n independent replicates x = (x1 , . . . , xn ) of some X F Generalized moment condition modelEF h(X , ) = 0, where h is a known function, and an unknown parameter Corresponding empirical likelihoodn Lel (|x) = max pip i=1 for all p such that 0 pi1,pi = 1,i pi h(xi , ) = 0.[Owen, 1988, Bioka; Owen, 2001] 120. Convergence of EL [3.4] Theorem 3.4 Let X , Y1 , . . . , Yn be independent rvs with common distribution F0 . For , and the function h(X , ) Rs , let 0 be such thatVar(h(Yi , 0 )) is nite and has rank q > 0. If 0 satisesE(h(X , 0 )) = 0, then Lel (0 |Y1 , . . . , Yn )2 log 2 (q)nn in distribution when n . [Owen, 2001] 121. Convergence of EL [3.4] ...The interesting thing about Theorem 3.4 is what is not there. It ^ includes no conditions to make a good estimate of 0 , nor even conditions to ensure a unique value for 0 , nor even that any solution 0 exists. Theorem 3.4 applies in the just determined, over-determined, and under-determined cases. When we can prove that our estimating ^ equations uniquely dene 0 , and provide a consistent estimator of it, then condence regions and tests follow almost automatically through Theorem 3.4..[Owen, 2001] 122. Raw ABCel sampler We act as if EL was an exact likelihood for i = 1 N do generate i from the prior distribution () set the weight i = Lel (i |xobs ) end for return (i , i ), i = 1, . . . , N The output is sample of parameters of size N with associated weights[Cornuet et al., 2012] 123. Raw ABCel sampler We act as if EL was an exact likelihood for i = 1 N do generate i from the prior distribution () set the weight i = Lel (i |xobs ) end for return (i , i ), i = 1, . . . , N Performance of the output evaluated through eective sample size 2N N ESS = 1 i j i=1 j=1 [Cornuet et al., 2012] 124. Raw ABCel sampler We act as if EL was an exact likelihood for i = 1 N do generate i from the prior distribution () set the weight i = Lel (i |xobs ) end for return (i , i ), i = 1, . . . , N Other classical sampling algorithms might be adapted to use EL. We resorted to the adaptive multiple importance sampling (AMIS) of Cornuet to speed up computations[Cornuet et al., 2012] 125. Moment condition in population genetics? EL does not require a fully dened and often complex (hence debatable) parametric model Main diculty Derive a constraintEF h(X , ) = 0, on the parameters of interest when X is made of the genotypes of the sample of individuals at a given locus E.g., in phylogeography, is composed of dates of divergence between populations, ratio of population sizes, mutation rates, etc. None of them are moments of the distribution of the allelic states of the sample 126. Moment condition in population genetics? EL does not require a fully dened and often complex (hence debatable) parametric model Main diculty Derive a constraintEF h(X , ) = 0, on the parameters of interest when X is made of the genotypes of the sample of individuals at a given locus c h pairwise composite scores whose zero is the pairwise maximum likelihood estimator 127. Pairwise composite likelihood The intra-locus pairwise likelihood j2 (xk |) 2 (xk , xk |)i=i z1 }, . . . , ending with zn = min{ij ; ij > zn1 } .[Cox & Kartsonaki, Bka, 2012] 136. Example: Superposition of gamma processes (ABC) Interesting testing ground for ABCel since data (zt ) neither iid nor Markov Recovery of an iid structure by 1. simulating a pseudo-dataset, Comparison of ABC and(z , . . . , z ), as in regularABCel posteriors1nABC,0.081.41.51.20.061.01.00.8DensityDensityDensity2. deriving sequence of0.040.60.50.40.020.2 indicators (1 , . . . , n ), as0.000.00.00 1 2 3 4 0 1 2 3 40 5 10 15 20 N z1 = 1 1 , z2 = 2 j2 , . . .1.50.061.00.050.81.00.040.6DensityDensityDensity0.030.40.50.023. exploiting that those0.20.010.000.00.0 indicators are distributed 0 1 23 4 0 1 23 40 5 10 N15 20 from the prior distributionTop: ABCel on the t s leading to an iid Bottom: regular ABC sample of G(, ) variables 137. Example: Superposition of gamma processes (ABC) Interesting testing ground for ABCel since data (zt ) neither iid nor Markov Recovery of an iid structure by 1. simulating a pseudo-dataset, Comparison of ABC and(z , . . . , z ), as in regularABCel posteriors1nABC,0.081.41.51.20.061.01.00.8DensityDensityDensity2. deriving sequence of0.040.60.50.40.020.2 indicators (1 , . . . , n ), as0.000.00.00 1 2 3 4 0 1 2 3 40 5 10 15 20 N z1 = 1 1 , z2 = 2 j2 , . . .1.50.061.00.050.81.00.040.6DensityDensityDensity0.030.40.50.023. exploiting that those0.20.010.000.00.0 indicators are distributed 0 1 23 4 0 1 23 40 5 10 N15 20 from the prior distributionTop: ABCel on the t s leading to an iid Bottom: regular ABC sample of G(, ) variables 138. Popgen: A rst experiment Evolutionary scenario:Comparison of the originalMRCA ABC with ABCelESS=7034 15 10 Density 5 POP 0POP 1 00.000.050.10 0.15 0.200.25 Dataset: log(theta) 50 genes per populations, 7 6 100 microsat. loci 5 Density 4 3 Assumptions: 2 1 0 Ne identical over all0.3 0.20.10.00.1 0.2 0.3 log(tau1) populations = (log10 , log10 )histogram = ABCel curve = original ABC uniform prior over vertical line = true (1., 1.5) (1., 1.) parameter 139. Popgen: A rst experiment Evolutionary scenario:Comparison of the originalMRCA ABC with ABCelESS=7034 15 10 Density 5 POP 0POP 1 00.000.050.10 0.15 0.200.25 Dataset: log(theta) 50 genes per populations, 7 6 100 microsat. loci 5 Density 4 3 Assumptions: 2 1 0 Ne identical over all0.3 0.20.10.00.1 0.2 0.3 log(tau1) populations = (log10 , log10 )histogram = ABCel curve = original ABC uniform prior over vertical line = true (1., 1.5) (1., 1.) parameter 140. ABC vs. ABCel on 100 replicates of the 1st experiment Accuracy:log10 log10 ABCABCelABCABCel(1)0.0970.0940.3150.117(2)0.0710.0590.2720.077(3)0.680.81 1.00.80 (1) Root Mean Square Error of the posterior mean (2) Median Absolute Deviation of the posterior median (3) Coverage of the credibility interval of probability 0.8 Computation time: on a recent 6-core computer (C++/OpenMP)ABC 4 hoursABCel 2 minutes 141. Popgen: Second experiment Evolutionary scenario:Comparison of the original ABCMRCA with ABCel2 histogram = ABCel curve = original ABC1 vertical line = true parameterPOP 0 POP 1POP 2 Dataset: 50 genes per populations, 100 microsat. loci Assumptions: Ne identical over all populations = (log10 , log10 1 , log10 2 ) non-informative uniform 142. Popgen: Second experiment Evolutionary scenario:Comparison of the original ABCMRCA with ABCel21POP 0 POP 1POP 2 Dataset: 50 genes per populations, 100 microsat. loci Assumptions: Ne identical over all histogram = ABCel populations curve = original ABC =vertical line = true parameter (log10 , log10 1 , log10 2 ) non-informative uniform 143. Popgen: Second experiment Evolutionary scenario:Comparison of the original ABCMRCA with ABCel21POP 0 POP 1POP 2 Dataset: 50 genes per populations, 100 microsat. loci Assumptions: Ne identical over all populations =histogram = ABCel (log10 , log10 1 , log10 2 ) curve = original ABC non-informative uniform vertical line = true parameter 144. Popgen: Second experiment Evolutionary scenario:Comparison of the original ABCMRCA with ABCel21POP 0 POP 1POP 2 Dataset: 50 genes per populations, 100 microsat. loci Assumptions: Ne identical over all populations =histogram = ABCel (log10 , log10 1 , log10 2 ) curve = original ABC non-informative uniform vertical line = true parameter 145. ABC vs. ABCel on 100 replicates of the 2nd experiment Accuracy: log10 log10 1 log10 2ABCABCelABC ABCelABC ABCel(1)0.00590.0794 0.472 0.48329.3 4.76(2)0.0480.053 0.32 0.284.13 3.36(3) 0.790.760.88 0.760.89 0.79 (1) Root Mean Square Error of the posterior mean (2) Median Absolute Deviation of the posterior median (3) Coverage of the credibility interval of probability 0.8 Computation time: on a recent 6-core computer (C++/OpenMP)ABC 6 hoursABCel 8 minutes 146. Why?On large datasets, ABCel gives more accurate results than ABCABC simplies the dataset through summary statisticsDue to the large dimension of x, the original ABC algorithmestimates (xobs ) ,where (xobs ) is some (non-linear) projection of the observeddataset on a space with smaller dimension Some information is lostABCel simplies the model through a generalized momentcondition model. Here, the moment condition model is based on pairwisecomposition likelihood 147. To be continued...