Bayesian large-scale multiple regression with summary ...xiangzhu/JSM_20160731.pdf · Bayesian...

Introduction Methods Real Data Future Work References

Bayesian large-scale multiple regression withsummary statistics from genome-wideassociation studies

Xiang ZhuUniversity of Chicago

JSM 2016 July 31

Xiang Zhu RSS JSM 2016 July 31 1 15


Statistical ModelsMultiple linear regression

M1 y = Xβ+ ε

Simple linear regression

M2 y = Xjαj + δj

R correlation between Xj

Genetic Datay phenotype (eg height)

X (centred) genotype

Xj genotype of SNP j

R linkage disequilibrium (LD)

M1 multiple-SNP model

M2 single-SNP model

Research Question

Statisticsestimated αj + estimated R

inference of β

Geneticssingle-SNP summary statistics + LD

multiple-SNP analyses




M1 y = Xβ+ ε


M2 y = Xjαj + δj







M2 single-SNP model

Research Question


inference of β





Why do we consider multiple-SNP model

Single-SNP analyses are routine in GWAS

Benefits of multiple-SNP analysesAllow for multiple causal variants in LD

Increase the power to detect associations

Improve the estimation of heritability

Recent surveysChapter 9 (Sabatti 2013)Chapter 11 (Guan and Wang 2013)

Few GWAS are analyzed with multiple-SNP model

Computationally challenging for large datasets

Require individual-level data that can be hard to obtain



Why do we consider single-SNP summary dataSingle-SNP GWAS summary statistics βj σ2

j are widely available

βj = (Xᵀ

j Xj)minus1Xᵀ

j y

σ2j

= (nXᵀ

j Xj)minus1(yminus Xjβj)ᵀ(yminus Xjβj)

Survey of GWAS summary statisticsPage 4-12 of Alkes Pricersquos slides [link] at ASHG 2015



How do we perform multiple-SNP analysesusing single-SNP summary data

Bayesian inference for the multiple regression coefficients β

p(β|Individual Data) prop p(Individual Data|β) middot p(β)

p(β|Summary Data)︸︷︷︸

Posterior

prop p(Summary Data|β)︸︷︷︸

Likelihood

middotp(β)︸︷︷︸

Prior

Our proposed solution

1 Develop a new likelihood of β based on summary data

2 Borrow an old prior of β from previous work

3 Combine the new likelihood with the old prior via Bayesrsquo Law



Individual-level data Xy generated as follows

xi (ith row of X) iidsim x E(x) = 0 Var(x) = diag(σx) middotR middot diag(σx)

yi = xiβ+ εi εiiidsim ε E(ε) = 0 Var(ε) = τminus1 Var(yi) = σ2

y

Regression with Summary Statistics (RSS) Likelihood

Lrss(β bβ bS bR) = N(bβ bSbRbSminus1β bSbRbS)

multiple-SNP parameter β = (β1 βp)ᵀ

single-SNP summary data bβ = (β1 βp)ᵀ

bS = diag(bs) bs = (s1 sp)ᵀ sj =Ccedil

(nXᵀ

j Xj)minus1yᵀy =r

σ2j + nminus1β2

j

bR the shrinkage estimate of LD matrix (Wen and Stephens 2010)



Obtain the asymptotic distribution of bβ

Let F = σydiagminus1(σx) and Σ = F(R+ ∆(c))F

pn(bβ minus FRFminus1β)

drarr N(0Σ)

where c = Rminus1Fβ ∆(c) is continuous and ∆(c) = O(maxj c2j )

Ignore the complicated term ∆(c)

Let S = nminus12F For each β isin Rp

logN(bβSRSminus1βSRS)minus logN(bβFRFminus1βnminus1Σ) = Op(maxjc2j

)

Plug in the estimates for SR




RSS performs comparably to methods thatrequire individual-level data

(a) Estimating heritability (b) Detecting association

Simulation real genotypes of 13K SNPs from two control groups inWellcome Trust Case Control Consortium (2007)

Methods BVSR (Guan and Stephens 2011) BSLMM (Zhou et al 2013)



We use RSS to estimate SNP heritability of adultheight (Wood et al 2014)

RSS on the summary data ( of SNPs 11M sample size 253K)521 [503 539]

LMM on the full data ( of SNPs 11M sample size 6K)498 [412 584]



We use RSS to detect multiple-SNP associations

We assess the genetic associations at the level of region (locus)

ENS(region) =sum

jisinregionPr(βj 6= 0| bβ bS bR)

Replicate previous GWAS hitsEstimate ENS of the region of plusmn40-kb around each SNP

Total hits 531697 ENS ge 1

Analyzed hits 379384 ENS ge 1Identify putatively novel loci

Estimate ENS for plusmn40-kb windows across the whole genome

5194 regions with ENS ge 1

2138 of them are at least 1 Mb away from any of 697 hits

Examples genes WWOX and ALX1



RSS opens the door to various applications

RSS Likelihood + OldNew Prior New Inference

Example 1 (Old Prior) gene set enrichment analysis

Let aj = 1SNP j is in the gene set

βj sim (1minus πj)δ0 + πjN(0 σ2) logit(πj) = θ0 + θaj

This extends Carbonetto and Stephens (2013) to analyze GWAS summary data Moredetails will be presented at ASHG 2016 (Abstract 1601200613)

Example 2 (New Prior) partition heritability by annotations

Let fjg be the annotation of SNP j wrt the category g

βj sim N(0 σ2j

) log(σ2j

) = w0 +sumG

g=1wgfjg

This is an ongoing project in collaboration with David Golan at Jonathan Pritchard Lab



RSS can be misspecified when summary dataare generated from different individualsβlowastj = (sum

iisinIjx2ij )minus1(sum

iisinIjxijyi) slowastj = (|Ij| middotsum


iisinIjy2i ) Ij sube [n]

bR should be adjusted by the sample overlap

Let H = (Hij) Hij = (|Ii| middot |Ij|)minus1(n middot |Ii cap Ij|) If H does not depend on n

pn(bβlowast minus FRFminus1β)

drarr N(0H Σ)

where H Σ is the Hadamard product of H and Σ (H Σ)ij = HijΣij

bS should be adjusted by the difference in sample sizes

If nminus1|Ij| does not depend on n for all j isin [p]

pn(sj minusq

nminus1|Ij| middot slowastj )prarr 0



ReferencesP Carbonetto and M Stephens Integrated enrichment analysis of variants and pathways in

genome-wide association studies indicates central role for IL-2 signaling genes in type 1diabetes and cytokine signaling genes in Crohnrsquos disease PLoS Genetics 9(10)e10037702013

Y Guan and M Stephens Bayesian variable selection regression for genome-wide associationstudies and other large-scale problems The Annals of Applied Statistics 5(3)1780ndash18152011

Y Guan and K Wang Whole-genome multi-SNP-phenotype association analysis In K-A Do Z SQin and M Vannucci editors Advances in Statistical Bioinformatics pages 224ndash243Cambridge University Press 2013 ISBN 9781139226448

C Sabatti Multivariate linear models for GWAS In K-A Do Z S Qin and M Vannucci editorsAdvances in Statistical Bioinformatics pages 188ndash207 Cambridge University Press 2013 ISBN9781139226448

Wellcome Trust Case Control Consortium Genome-wide association study of 14000 cases ofseven common diseases and 3000 shared controls Nature 447661ndash678 2007

X Wen and M Stephens Using linear predictors to impute allele frequencies from summary orpooled genotype data The Annals of Applied Statistics 4(3)1158ndash1182 2010

A R Wood T Esko J Yang S Vedantam T H Pers S Gustafsson A Y Chu K Estrada J LuanZ Kutalik et al Defining the role of common variation in the genomic and biologicalarchitecture of adult human height Nature Genetics 46(11)1173ndash1186 2014

X Zhou P Carbonetto and M Stephens Polygenic modeling with Bayesian sparse linear mixedmodels PLoS Genetics 9(2)e1003264 2013



Acknowledgements

Joint work with Matthew Stephens

Wellcome Trust Case Control Consortium

Genetic Investigation of AnthropometricTraits (GIANT) Consortium



Thank youPreprint httpsdxdoiorg101101042457

Software httpsgithubcomstephenslabrss

Contact xiangzhu[at]uchicago[dot]edu


Introduction

Methods

Real Data

Future Work



M1 y = Xβ+ ε


M2 y = Xjαj + δj







M2 single-SNP model

Research Question


inference of β






M1 y = Xβ+ ε


M2 y = Xjαj + δj







M2 single-SNP model

Research Question


inference of β


















βj = (Xᵀ

j Xj)minus1Xᵀ

j y

σ2j

= (nXᵀ









Posterior


Likelihood


Prior










y






(nXᵀ

j Xj)minus1yᵀy =r

σ2j + nminus1β2

j







drarr N(0Σ)





)


















ENS(region) =sum



















βj sim N(0 σ2j

) log(σ2j

) = w0 +sumG

g=1wgfjg












drarr N(0H Σ)




pn(sj minusq















Acknowledgements










Introduction

Methods

Real Data

Future Work



M1 y = Xβ+ ε


M2 y = Xjαj + δj







M2 single-SNP model

Research Question


inference of β


















βj = (Xᵀ

j Xj)minus1Xᵀ

j y

σ2j

= (nXᵀ









Posterior


Likelihood


Prior










y






(nXᵀ

j Xj)minus1yᵀy =r

σ2j + nminus1β2

j







drarr N(0Σ)





)


















ENS(region) =sum



















βj sim N(0 σ2j

) log(σ2j

) = w0 +sumG

g=1wgfjg












drarr N(0H Σ)




pn(sj minusq















Acknowledgements










Introduction

Methods

Real Data

Future Work















βj = (Xᵀ

j Xj)minus1Xᵀ

j y

σ2j

= (nXᵀ


Survey of GWAS summary statistics-12 of Alkes Pricersquos slides [link] at ASHG 2015







Posterior


Likelihood


Prior










y






(nXᵀ

j Xj)minus1yᵀy =r

σ2j + nminus1β2

j







drarr N(0Σ)





)


















ENS(region) =sum



















βj sim N(0 σ2j

) log(σ2j

) = w0 +sumG

g=1wgfjg












drarr N(0H Σ)




pn(sj minusq















Acknowledgements










Introduction

Methods

Real Data

Future Work




βj = (Xᵀ

j Xj)minus1Xᵀ

j y

σ2j

= (nXᵀ









Posterior


Likelihood


Prior










y






(nXᵀ

j Xj)minus1yᵀy =r

σ2j + nminus1β2

j







drarr N(0Σ)





)


















ENS(region) =sum



















βj sim N(0 σ2j

) log(σ2j

) = w0 +sumG

g=1wgfjg












drarr N(0H Σ)




pn(sj minusq















Acknowledgements










Introduction

Methods

Real Data

Future Work






Posterior


Likelihood


Prior










y






(nXᵀ

j Xj)minus1yᵀy =r

σ2j + nminus1β2

j







drarr N(0Σ)





)


















ENS(region) =sum



















βj sim N(0 σ2j

) log(σ2j

) = w0 +sumG

g=1wgfjg












drarr N(0H Σ)




pn(sj minusq















Acknowledgements










Introduction

Methods

Real Data

Future Work





y






(nXᵀ

j Xj)minus1yᵀy =r

σ2j + nminus1β2

j







drarr N(0Σ)





)


















ENS(region) =sum



















βj sim N(0 σ2j

) log(σ2j

) = w0 +sumG

g=1wgfjg












drarr N(0H Σ)




pn(sj minusq















Acknowledgements










Introduction

Methods

Real Data

Future Work





drarr N(0Σ)





)


















ENS(region) =sum



















βj sim N(0 σ2j

) log(σ2j

) = w0 +sumG

g=1wgfjg












drarr N(0H Σ)




pn(sj minusq















Acknowledgements










Introduction

Methods

Real Data

Future Work















ENS(region) =sum



















βj sim N(0 σ2j

) log(σ2j

) = w0 +sumG

g=1wgfjg












drarr N(0H Σ)




pn(sj minusq















Acknowledgements










Introduction

Methods

Real Data

Future Work










βj sim N(0 σ2j

) log(σ2j

) = w0 +sumG

g=1wgfjg












drarr N(0H Σ)




pn(sj minusq















Acknowledgements










Introduction

Methods

Real Data

Future Work










drarr N(0H Σ)




pn(sj minusq















Acknowledgements










Introduction

Methods

Real Data

Future Work













Acknowledgements










Introduction

Methods

Real Data

Future Work






Introduction

Methods

Real Data

Future Work

Bayesian large-scale multiple regression with summary ...xiangzhu/JSM_20160731.pdf · Bayesian...

Documents

Transcript of Bayesian large-scale multiple regression with summary ...xiangzhu/JSM_20160731.pdf · Bayesian...