Modelling Genetic Variation in the Brain - Warwick · 2011. 12. 2. · • Scan summaries –...

of 57 /57
Modelling Genetic Variation in the Brain Thomas Nichols, PhD Department of Statistics, Warwick Manufacturing Group University of Warwick joint with Becky Inkster Institute of Psychiatry King’s College London (GSK3β & WNT pathway VBM) Maria Vounou, Giovanni Montana Statistics Section, Dept. of Mathematics Imperial College (Sparse Reduced Rank Regression)

Embed Size (px)

Transcript of Modelling Genetic Variation in the Brain - Warwick · 2011. 12. 2. · • Scan summaries –...

  • Modelling Genetic Variation in the Brain

    Thomas Nichols, PhD Department of Statistics,

    Warwick Manufacturing Group University of Warwick

    joint with

    Becky Inkster Institute of Psychiatry King’s College London (GSK3β & WNT pathway VBM)

    Maria Vounou, Giovanni Montana Statistics Section, Dept. of Mathematics Imperial College (Sparse Reduced Rank Regression)

  • Outline

    •  Background – Structural brain imaging & VBM – Genetics –  “Imaging Genetics”

    •  Candidate SNP VBM •  Multivariate SNP analyses

  • Neuroimaging Background: Structural Brain Image Data

    •  Morphometry – Quantification of shape/volume of

    brain structures •  Traditional Morphometric Analysis

    – Laborious hand-tracing of structures – Accurate, but imperfect inter-rater

    reliability •  Voxel Based Morphometry

    – Automated morphometry method

  • Voxel Based Morphometry

    T1-weighted MRI Gray Matter

    Subject Space

    Original MRI → Segment → Warp to Atlas Space → Modulate → Smooth

    Subject Space

  • Voxel Based Morphometry

    T1-weighted MRI

    Atlas Space

    Gray Matter

    Atlas Space

    Modulated Gray Matter

    Atlas Space

    Modulation

    Gives units of subject GM volume in atlas space

    Allows analysis in common space while retaining individual differences

    Original MRI → Segment → Warp to Atlas Space → Modulate → Smooth

  • Smoothing •  Accounts for imperfect

    registration of individuals to atlas –  Even identical twins have

    different cortical foldings –  Exact match impossible

    •  Discards fine spatial details in exchange for reduced noise –  Generally searching for

    moderate scale differences Done! •  3D image is n=1

    –  A single (100,000-dimensional) phenotypic measurement on 1 individual

    Atlas Space

    Smoothed, Modulated GM

    Voxel Based Morphometry Original MRI → Segment → Warp to Atlas Space → Modulate → Smooth

    Modulated Gray Matter

    Atlas Space

  • Genetics Background

    •  Genotype – The genetic constitution of an organism or cell – 46 chromosomes in humans – 23 pairs of homologous chromosomes

    •  One each from each parent •  Gene

    – A series of basepairs (DNA bits) which code for a trait

    – Four different possible basepairs, the nucleotides

    •  Adenine, Thymine, Cytosine, & Guanine

  • AATGTGATAGCTT

    AATGTGACAGCTT

    Genetics Background •  Single Nucleotide Polymorphisms (SNP)

    –  Locations where single base-pair differences bases have been found in the population

    •  SNP Example –  If some of the population has sequence…

    –  And if remaining has… –  We have found a SNP!

    •  SNP data –  Homologous chromosomes –  For each SNP, for each individual: 0, 1 or 2 count

    AATGTGATAGCTT

    AATGTGACAGCTT

    SNP

  • 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y0

    0.5

    1

    1.5

    2

    2.5x 108 Number of basepairs per Chromosome

    3,079,843,747 Base Pairs † Genetics Background

    •  Millions of SNPs

    •  Thanks to correlation (linkage disequilibrium), only need ≈500k to “tag” all variation

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    4500Number of Genes per Chromosome

    32,185 Genes †

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y0

    2

    4

    6

    8

    10

    12

    14

    16

    18x 105 Number of SNPs per Chromosome

    20,296,765 SNPs *

    * From Entrez SNP database † From Wikipedia

  • Genetics Background •  SNPs vs. genes

    – Each gene often has several variants – 1 or more (but not many) SNPs typically

    needed to identify a gene – SNPs may not lie directly on coding portion of

    gene •  Due to linkage disequilibrium (correlation), close is

    good enough •  Non-coding, regulatory region may be causal

    Exon

    SNPs

    Exon Exon Exon Exon Exon

    Location on chromosome

  • Imaging Genetics

    •  Motivation –  Brain structure heritable –  Objective, reproducible phenotype

    •  Important in psychiatry –  Current best measures are coarse,

    with weak reproducibility »  e.g. HAM-D (depression), MMSE

    (cognition, AD) –  Sensitive

    •  Brain anatomy/function closer to disease process than other measures

    –  Use to collaborate other findings •  E.g. Large WGA finds modest

    significance Use brain imaging to build confidence in finding

    Brain Phenotype h2 Whole brain volume 0.78 Total gray matter volume 0.88 Total white matter volume 0.85 Glahn, Thompson, Blangero. Hum Brain Mapp 28:488-501, 2007

    Thickness of Cortical GM (r2)

    Heritability of GM Thickness (h2 & corrected P-value)

    Thompson et al, Nature Neuro, 4(12):1253-1258,. 2001

    Thompson & Toga, Annals of Medicine 34(7-8):523-36, 2002

  • Candidate ROI Many ROI Voxelwise

    Candidate SNP

    Candidate Gene

    Genome-wide SNP

    Genome-wide Gene

    Imaging Genetics Menu

    Genetics

    Imaging

    (Jason Stein/Andy Saykin/Bertrand Thirion)

    [Joyner et al. 2009] 4 ROIs, 11 SNPs

    [Potkin et al. 2009] 1 BOLD ROI

    317, 503 SNPs

    [Filippini et al. 2009] 29,812 voxels

    1 SNP

    [Stein et al. 2010] 31,622 voxels 448,293 SNPs

    [Hibar et al. 2011] 31,622 voxels 18,044 SNPs

  • Outline

    •  Background – Structural brain imaging & VBM – Genetics –  “Imaging Genetics”

    •  Candidate SNP VBM •  Multivariate SNP analyses

  • GSK3β Background •  High heritability of depression (Kendler et al.

    2006; Sullivan et al., 2000). •  Meta-analytical evidence from MRI studies for a

    role of hippocampal integrity in depression (Campbell et al., 2004).

    •  There is strong genetic regulation of neurodevelopment (reviewed by Wilson and Rubenstein, 2000; O’Leary et al., 2002).

    •  The Wnt signaling pathway is one network of proteins that play a role in embryogenesis

    •  GSK3β plays a key role in Wnt pathway

  • Wnt Signaling Pathways

    regulates the development of the hippocampus

  • GSK HiTDIP Study •  Major Depressive Disorder (MDD) Association Study

    –  “High Throughput Human Disease Specific Targets” –  7,000 SNPs covering 2,000 genes with tractable targets –  1000 cases, 1000 controls

    •  Imaging Subset –  200 cases, 200 controls (of 1000 & 1000) scanned with

    anatomical MRI protocol –  ‘Optimized VBM’ with SPM5’s segmentation tool –  324 images passed QC

    •  366 subjects’ data delivered •  42 subjects set aside

    (clinical exclusion, pathologies or failed segmentation)

    •  Glycogen synthase kinase 3β (GSK3β) –  Plays key role in WNT pathway, influential in development

  • Modelling Candidate SNPs

    •  Mass Univariate Modelling – Fit same univariate linear model at each voxel

    •  Quantitative Trait Multiple Regression – Linear model fit at each voxel

    •  Regressors – Genetic – Group (Case/Control) – Demographic / nuisance variables

  • SNP Models for Gray Matter Data

    •  Recessive

    •  Dominant

    •  Additive

    •  Genotypic

    Gra

    y M

    atte

    r Vol

    ume

    SNP Count 0 1 2

    Y

    Xj

    Gra

    y M

    atte

    r Vol

    ume

    SNP Count 0 1 2

    Y

    Xj G

    ray

    Mat

    ter V

    olum

    e

    SNP Count 0 1 2

    Y

    Xj

    Gra

    y M

    atte

    r Vol

    ume

    SNP Count 0 1 2

    Y

    Xj

  • Mass Univariate Modelling Genetic Effects

    •  Concerns about leverage/influence – 100’s not 1000’s of subjects

    •  100 subjects + 10% MAF → 1 subject with rare genotype expected!

    – Rare SNP can make a few subjects very influential

    •  An ever-greater problem as sample size shrinks

    Gra

    y M

    atte

    r Vol

    ume

    Allele Count

    0 1 2

    Y

    Xj

  • Mass Univariate Modelling Genetic Effects

    •  Ad hoc solution –  If expected rare genotype frequency 0.31 (=√0.1)

    – 2DF Genotypic model •  Additive + Nonadditive Parameterization

    –  Additive [ -1 0 +1 ] tested – Nonadditive [ -1/2 +1 -1/2 ] not tested

    (orthogonalize w.r.t. additive regressor * )

    •  If MAF < 0.31 – Use dominant/recessive model

  • Mass Univariate Modelling Nuisance Effects

    •  Age & Gender –  Substantial normal variation in GM w/ Age

    •  Total Gray matter –  Accounts for differences in head size –  Discounts global changes to find localized changes

    •  Scanner (Pre/Post Upgrade) –  Upgrade 2/3-through study altered image contrast

    •  Medication (Yes/No, for cases only) –  Neurotrophic effects reported for some Rx

  • Model Diagnosis for Imaging •  Why bother?

    –  Largish n, continuous data, Central Limit Theorem should carry us

    –  Type I Error generally OK due to robustness of t-test/ANOVA-like models

    •  Sensitivity! –  Decreased sensitivity due to inflated error

    variance σ –  Suboptimal sensitivity due to non-normality

    •  How!? –  100,000 voxels, 400 subjects –  100,000 QQ plots to look at all 40 million

    data points?

    Failed GM segmentation due to data formatting

    error

    Warping artefacts seen in modulated

    GM

  • Model Diagnosis for Imaging •  Model summaries

    –  Images of diagnostic stats •  Scan summaries

    –  Vectors of ad hoc measures

    •  Dynamic graphical tool –  Explore many summaries

    simultaneously –  Easily jump from

    summary image to plots, from plots to residual images

    •  End Result –  Swiftly localize and

    understand problems

    Statistic Assesses Null Distn

    Cook-Weisberg Var(εi) = σ2 Chi-Squared

    Shapiro-Wilk ε ~ Normal (tabulated)

    Outlier Count Artifacts Binomial

    Std. Deviation Artifacts

    Summary Interpretation

    Global intensity Whole-brain signals or artifacts

    Outlier Count Artifacts

    Any preprocessing parameters e.g. head size

    Suggests cause of artifacts

    Experimental predictors

    For investigating mismodelled signal in residuals

    Luo & Nichols NeuroImage 19:1014–1032, 2003 http://go.warwick.ac.uk/tenichols/software

    Model Summaries

    Scan Summaries

  • Model Diagnosis w/ SPMd Scan Summaries Model Summaries

    Model Detail Scan Detail

  • Outline / Motivation •  Data

    –  Intro to Voxel Based Morphometry data •  Model

    –  Quantitative trait regression w/ Mass Univariate Model •  Diagnosis

    –  100,000 Q-Q plots anyone? •  Inference

    –  Cluster size under nonstationarity –  Candidate screening procedure

    •  Results –  GSK3β in MDD

    •  Future Directions

  • Inference On Images: Voxel-wise vs. Cluster-wise

    •  Voxel-wise –  Reject Ho, point-by-point, by statistic magnitude

    •  Cluster-wise –  Define contiguous blobs with arbitrary threshold uclus –  Reject Ho for each cluster larger than kα

    Cluster not significant

    uclus

    space

    Cluster significant kα kα

    statistic image

  • Cluster Inference & Stationarity •  Cluster-wise preferred over voxel-wise

    –  Generally more sensitive Friston et al, NeuroImage 4:223-235, 1996

    –  Spatially-extended signals typical •  Problem w/ VBM

    –  Standard cluster methods assume stationarity, constant smoothness

    –  Assuming stationarity, false positive clusters will be found in extra-smooth regions

    –  VBM noise very non-stationary •  Nonstationary cluster inference

    –  Must un-warp nonstationarity –  Reported but not implemented

    •  Hayasaka et al, NeuroImage 22:676– 687, 2004 –  Now available as SPM toolbox

    •  http://fmri.wfubmc.edu/cms/software#NS

    VBM: Image of FWHM Noise

    Smoothness

    Nonstationary noise…

    …warped to stationarity

  • Inference in Imaging Genetics: Creeping Multiple Testing Problem

    •  Even just with candidate analyses, Can end up searching over… –  Genes –  SNPs within a gene –  Space (voxels or clusters) –  Different contrasts on GLM

    •  Main effect? By clinical subgroup? Interactions?

    •  Can quickly lose confidence in results –  E.g. 0.005 FWE-corrected is great… …Unless it’s the 25th statistic image you’ve seen

  • Inference in Imaging Genetics Multiple Testing Strategy

    •  Define strict primary outcome –  For given gene, use single SNP

    •  Best (large) association study significance, otw •  Best nonsynonymous exonic available, otw •  Best 5’ intronic available

    –  For each SNP, only consider main effect of gene •  If fitting gene x group interaction, test for average effect

    –  Any association is more likely than a disease-specific association –  Even if disease-specification association, opposing sign of effect unlikely w/ VBM

    –  1-number summary per gene •  Minimum nonstationary cluster FWE-corrected P-value for association (1 DF

    F-stat) –  Bonferroni correction for number of genes

    •  Primary outcomes then have strong FWE control –  Over brain, over genes –  (1-α)100% confidence of no false positives anywhere

    •  Secondary outcomes –  Interactions, sub-group results –  Use same FWE-inferences, but mark as post-hoc

    Need Becky to check correctness/terminology here!

  • Results: Model Diagnosis Outlier Detection with Shapiro -Wilk

    R

    Two outliers

    -log10 P Shapiro-Wilk Mean Smoothed Mod. GM

  • Results: Model Diagnosis Characterising Outliers with Standardized Residual Images

    Subject 193 Subject 194 Outlier

    Subject 195

    R

    Note: Compare standardized residuals to +/-6.128 (Bonferroni for 324 images, each with 173,823 voxels, at each a 2-sided test)

    R

    Subject 194 raw T1

    Severe enlargement of inferior horn of lateral ventricle

  • Results: Outlier

    Exploration

    Subject 194 Outlier

    Randomly Selected Control

    Inferior Horn of Lateral Ventricle In most of us, this is a pencil-lead-thick fluid-filled space

    In this subject it was a pencil-thick

    Clinical collaborator verified it as abnormal & subject was removed

  • GSK3β and Structural Differences 2 SNPs in strong linkage disequilibrium showed significant associations with GM differences in MDD patients: rs6438552 rs12630592 Brain regions where SNP clusters show co-localization.

    GSK3β-Gray Matter association in bilateral superior temporal gyrus (STG) and right hippocampus

    R L

    Inkster, B., Nichols, T. E., Saemann, P. G., Auer, D. P., Holsboer, F., Muglia, P., & Matthews, P. M. (2009). Association of GSK3 Polymorphisms With Brain Structural Changes in Major Depressive Disorder. Archives of General Psychiatry, 66(7), 721-728.

  • ‘AA genotype group’ associated with decreased GM concentration in right STG

    rs6438552 is a putative functional SNP

    i.e. it regulates the selection of splice acceptor sites in vitro.

    P = 0.0004 (corrected for whole brain search and multiple SNP testing)

    Inkster, B., Nichols, T. E., Saemann, P. G., Auer, D. P., Holsboer, F., Muglia, P., & Matthews, P. M. (2009). Association of GSK3 Polymorphisms With Brain Structural Changes in Major Depressive Disorder. Archives of General Psychiatry, 66(7), 721-728.

  • Wnt Signaling Pathways

    regulates the development of the hippocampus

    WNT3A

    FZD3

    KRM1

    DVL2

    CTNNB1

    AXIN2

    TCF4

    LEF1

    SMAD1

    PPARgC1a

    EMX2

    ZEB2

  • WNT pathway genes

    R

    R

    R

    ZEB2 FZD3 DVL2 AXIN2 GSK3β SMAD1 PPARGCA1 EMX2

    Inkster, B., Nichols, T. E., Saemann, P. G., Auer, D. P., Holsboer, F., Muglia, P., & Matthews, P. M. (2010). Pathway-based approaches to imaging genetics association studies: WNT signaling, GSK3beta substrates and major depression. NeuroImage, 53(3), 908-917.

  • Outline

    •  Background – Structural brain imaging & VBM – Genetics –  “Imaging Genetics”

    •  Candidate SNP VBM •  Multivariate SNP analyses

  • Possible Mass-Univariate Analyses •  Full cross analysis

    – Massive multiple testing problem!

    •  Candidate SNP – Full image result – Must have right SNP

    •  Voxel/Region QTL

    – Whole genome association

    – Must have right ROI

    500,000 SNPs

    100,

    000

    voxe

    ls

    ≈ 1010 tests!

    500,000 SNPs

    100,

    000

    voxe

    ls

    ≈ 106 tests

    500,000 SNPs 10

    0,00

    0 vo

    xels

    ≈ 105 tests

  • Multivariate Regression

    •  Silly… –  If N > NG, fit equivalent

    to NV univariate models fit independently

    – Much redundancy in C

    •  rank{C} ≤ min(NV, NG)

    ≪ NV ∙ NG

    = Y X

    C

    E + N × NV

    Images

    N × NG

    Genotypes

    NG × NV

    Regression Coefficients Error

    N × NV

    N # subjects NV # voxels/ROIs NG # genes/SNPs

  • Reduced Rank Regression

    = Y X

    B

    E + A N × NV N × NG

    N × r

    N × NV r × NV

    Images Genotypes Image Coefficients Error

    Genotype Coefficients

    •  Fix rank r

    •  Approximate

    C ≈ B A B & A each rank r

    N # subjects NV # voxels/ROIs NG # genes/SNPs

  • Sparse Reduced Rank Regression

    = Y X

    B

    E + A N × NV N × NG

    NG × r

    N × NV r × NV

    Images Genotypes Sparse Image Coefficients Error

    Sparse Genotype

    Coefficients

    •  Fix rank r

    •  Approximate

    C ≈ B A B & A each rank r

    •  Enforce sparsity N # subjects NV # voxels/ROIs NG # genes/SNPs

    Vounou, M., Nichols, T. E., & Montana, G. (2010). Discovering genetic associations with high-dimensional neuroimaging phenotypes: A sparse reduced-rank regression approach. NeuroImage, 53(3), 1147-59.

  • Sparse Reduced Rank Regression - Estimation

    •  RRR –  Y = X A B + E

    – For fixed rank r, find A & B that minimize

    M = tr { (Y−XBA) Γ (Y−XBA)’ } for some NV × NV matrix Γ, e.g. Γ = I

    •  SRRR – For rank 1, find a & b that minimize

    M = tr { (Y−Xba’) Γ (Y−Xba’)’ } + λa||a||1 + λb||b||1

    – Then subtract Xba’ from the data, and repeat – Need to specify final rank r, λa & λb

    •  Can set λa & λbin terms of #|a|>0 & #|b|>0

  • Simulation: Phenotype & SNPs •  Simulated MRI data

    –  ADNI T1 images through SPM5 VBM pipeline –  NV = 111 ROIs, placed on VBM data from 189 MCI

    ADNI subjects •  GSK CIC Atlas, based on Harvard-Oxford atlas

    –  Estimate covariance Σ after adjusting for age & gender

    –  Simulate ROI data (for arbitrary N) with covariance Σ •  Evaluate with realistic genetic population w/

    FREGENE –  Simulates sequence-level data in large population –  Provides 10K individuals, 20Mb chromosome (~180K

    SNPs) •  Chadeau-Hyam, et al. BMC Bioinformatics, 9:364, 2009

  • Simulation: Phenotype & SNPs •  FREGENE SNP simulation

    –  Population of 10,000 evolved over 200,000 generations –  20Mb simulated –  37,748 SNPs with MAF>0.05 –  Select k=10 causative SNPs

    •  From all possible having MAF=0.2 •  Used to induce phenotypic effect

    –  But then dropped from consideration •  Represents realistic setting, where causative SNP is not seen, but effect

    captured through local LD –  From population of 10,000, repeatedly sample cohorts of size N

    •  Simulated association in MRI data –  Add genetic effect to Frontal and Temporal ROIs with causative

    SNPs •  γ = 0.06, 0.08, or 0.1 reduction in mean GM in affected ROI •  Calibrated to Filipini et al. (2009)

    –  10% reduction in GM ApoE ε4/ε4 subjects relative to subjects with no ε4 alleles

  • Out

    of A

    frica

    (OoA

    ) sp

    lit &

    bot

    tlene

    ck

    Expansion Founding population in Africa

    Expansion

    Asi

    an &

    E

    urop

    ean

    split

    Expansion

    Cha

    deau

    -Hya

    m, e

    t al.

    BM

    C B

    ioin

    form

    atic

    s, 9

    :364

    , 200

    9

    FREGENE: Evolutionary model of world population

  • Why try so hard? Why not rand{0,1,2}500,000 ?

    •  Linkage disequilibrium (LD) –  SNPs not independent –  Highly structured,

    heterogeneous dependence

    •  Population sub-structure –  Ethnic differences &

    migration patterns induce systematic variation

    •  Multivariate analysis –  Want realistic multivariate

    structure in our simulations

    The

    Wel

    lcom

    e Tr

    ust C

    ase

    Con

    trol C

    onso

    rtium

    , Nat

    ure

    447,

    661

    -678

    , 200

    7.

  • Realistic Phenotype

    •  All pairwise GM correlations among NV = 111 ROIs

  • Realistic Genotypes

    •  Correlation of first 1000 simulated SNPs

  • •  “True positive” with missing causative SNP – Declare true positive

    if LD coefficient close enough

    •  LD-linked SNPs – Of 1990 SNPs –  51 linked (r>0.8) to

    one or more the 10 causative SNPs

    Simulation Setting: Horse shoes & Imaging Genetics

  • SRRR Simulation Results •  Power to detect 1 or more SNPs (NG=1990)

    •  For ranks r = 1,2,3 dominates Mass Uni. – Better for higher r

  • SRRR Simulation Results •  Power to detect 1 or more SNPs (NG=1990)

    •  For ranks r = 1,2,3 dominates Mass Uni. – Better for higher r; here r = 3, high eff. size.

  • SRRR Simulation Results

    •  Power to detect 1 or more ROIs •  Less difference

    – Power can be manipulated by varying λ by rank

  • SRRR: Multivariate vs. Mass-Univariate

    •  Does this NG=1990 result generalize?

    •  For up to 40k SNPs –  r = 3, med. effect

    size, N=1000 – Power 2-5 greater – Absolute power still

    tiny

  • SRRR Simulation Results •  Power to detect 1 or more SNPs (NG=1990)

    •  For ranks r = 1,2,3 dominates Mass Uni. – Better for higher r; here r = 3

  • Sparse Reduced Rank Regression for SNP – MRI Association

    •  Detailed simulation of imaging & genetic correlations structure – Suggests multivariate approach will out-

    perform mass-univariate – Power tiny, in any event

    •  Much work to do – Haven’t addressed how to optimize phenotype – Haven’t tried to estimate penalty parameters λa, λb or r

    •  Currently investigating stability selection – See #316 Le Floch et al

  • Conclusions •  VBM

    –  Powerful, “automated” anatomical analysis –  Need careful raw data, preprocessing & model QC

    •  Imaging Genetics –  Mash-up of two large data, massive multiple testing

    problems •  Candidate SNP VBM

    –  Given a SNP, just like a traditional imaging analysis –  Multiple SNPs possible too, but need combining

    methods •  Multivariate Sparse Reduced Rank Regression

    –  Promising, but little power unless have 1,000’s of subjects