Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of...

33
Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo

Transcript of Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of...

Page 1: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Association analysis

Genetics for Computer Scientists15.3.-19.3.2004

Biomedicum & Department of Computer Science, Helsinki

Päivi Onkamo

Page 2: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Lecture outline• Genetic association analysis• Allelic association • χ2 –test• Linkage disequilibrium (LD) process• Formulation of the computational problem for LD

mapping• Limitations of the LD mapping• Approaches. For example: HPM

Page 3: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Genetic association analysis

• Search for significant correlations between gene variants and phenotype

• For example:

Locus A for

SLE: 100

cases and 100

controls

genotyped

Affected Unaffected

Allele 1 79 46

Allele 2 21 54

Page 4: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Allelic association = An allele is associated to a trait

•Allele 1 seems to be associated, based on sheer numbers, but how sure can one be about it?

Page 5: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

  Affected Healthy

Allele 1 79 46 125

Allele 2 21 54 75

100 100 200

Page 6: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

• The idea is to compare the observed frequencies to frequencies expected under hypothesis of no association between alleles and the occurrence of the disease (independency between variables)

• Test statistic

Where• oi is the observed class frequency for class i, ei

expected (under H0 of no association)• k is the number of classes in the table• Degrees of freedom for the test: df=(r-1)(s-1)

k

i

ii

i eeo

1

)( 22

Page 7: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

23.235.37

)5.3754(5.37

)5.3721(

5.62)5.6246(

5.62)5.6279()(

22

22

,

22

ji ij

ijij

e

eo

df=1 p<<0,001

  Affected Healthy Allele 1 62.5 (79) 62.5 (46) 125

Allele 2 37.5 (21) 37.5 (54) 75

100 100 200

Expected

Page 8: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

df 0,995 0,9500 0,100 0,050 0,025 0,010 0,005

1 0,000 0,004 2,706 3,842 5,024 6,635 7,879

2 0,010 0,103 4,605 5,992 7,378 9,210 10,597

3 0,072 0,352 6,251 7,815 9,348 11,345 12,838

4 0,207 0,711 7,779 9,488 11,143 13,277 14,860

5 0,412 1,146 9,236 11,071 12,833 15,086 16,750

6 0,676 1,635 10,645 12,592 14,449 16,812 18,548

7 0,989 2,167 12,017 14,067 16,013 18,475 20,278

8 1,344 2,733 13,362 15,507 17,535 20,090 21,955

9 1,735 3,325 14,684 16,919 19,023 21,666 23,589

10 2,156 3,940 15,987 18,307 20,483 23,209 25,188

11 2,603 4,575 17,275 19,675 21,920 24,725 26,757

12 3,074 5,226 18,549 21,026 23,337 26,217 28,300

13 3,565 5,892 19,812 22,362 24,736 27,688 29,819

14 4,075 6,571 21,064 23,685 26,119 29,141 31,319

15 4,601 7,261 22,307 24,996 27,488 30,578 32,801

16 5,142 7,962 23,542 26,296 28,845 32,000 34,267

17 5,697 8,672 24,769 27,587 30,191 33,409 35,718

18 6,265 9,390 25,989 28,869 31,526 34,805 37,156

19 6,844 10,117 27,204 30,144 32,852 36,191 38,582

20 7,434 10,851 28,412 31,410 34,170 37,566 39,997

21 8,034 11,591 29,615 32,671 35,479 38,932 41,401

22 8,643 12,338 30,813 33,924 36,781 40,289 42,796

23 9,260 13,091 32,007 35,172 38,076 41,638 44,181

24 9,886 13,848 33,196 36,415 39,364 42,980 45,558

25 10,520 14,611 34,382 37,652 40,646 44,314 46,928

26 11,160 15,379 35,563 38,885 41,923 45,642 48,290

27 11,808 16,151 36,741 40,113 43,195 46,963 49,645

28 12,461 16,928 37,916 41,337 44,461 48,278 50,994

29 13,121 17,708 39,087 42,557 45,722 49,588 52,335

30 13,787 18,493 40,256 43,773 46,979 50,892 53,672

40 20,707 26,509 51,805 55,758 59,342 63,691 66,766

50 27,991 34,764 63,167 67,505 71,420 76,154 79,490

60 35,534 43,188 74,397 79,082 83,298 88,379 91,952

70 43,28 51,74 85,53 90,53 95,02 100,43 104,21

80 51,17 60,39 96,58 101,88 106,63 112,33 116,32

90 59,20 69,13 107,57 113,15 118,14 124,12 128,30

100 67,33 77,93 118,50 124,34 129,56 135,81 140,17

Page 9: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Interpretation of the test results

• The p-value is low enough that H0 can be rejected = the probability that the observed frequencies would differ this much (or even more) from expected by just coincidence < 0.001

• χ2 –tables (Appendix), internet resources, etc.

Page 10: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

• Genetic association is population level correlation with some known genetic variant and a trait: an allele is over-represented in affected individuals →

• From a genetic point of view, an association does not imply causal relationship

• Often, a gene is not a direct cause for the disease, but is in LD with a causative gene

Page 11: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Linkage disequilibrium (LD)• Closely located genes often express linkage

disequilibrium to each other: Locus 1 with alleles A and a, and locus 2 with alleles B and b, at a distance of a few centiMorgans from each other

• At equilibrium, the frequency of the AB haplotype should equal to the product of the allele frequencies of A and B, AB = AB. If this holds, then Ab = A b, aB = aB and ab = ab , as well. Any deviation from these values implies LD.

Page 12: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Linkage disequilibrium (LD)

• LD follows from the fact that closely located genes are transmitted as a ”block” which only rarely breaks up in meioses

• An example:– Locus 1 – marker gene – Locus 2 – disease locus, with allele b as

dominant susceptibility allele with 100% penetrance

Page 13: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

An example

Page 14: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

• Association evaluated → Locus 1 also seems associated, even though it has nothing to do with the disease – association observed just due to LD

LD mapping – utilizing founder effect • A new disease mutation born n generations ago in

a relatively small, isolated population• The original ancestral haplotype slowly decays as

a function of generations• In the last generation, only small stretches of

founder haplotype can be observed in the disease-associated chromosomes

Page 15: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

LD mapping: Utilizing founder effect

Page 16: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Data: Searching for a needle in a haystackDisease gene

a ? 2 1 1a ? 1 2 1

1 2 2 1 1 2 1 2 1 2 1 1 2 2

1 2 2 1 2 1 1 2

c 2 1 ? ?c 1 1 ? ?

1 2 2 1 1 2 1 1 2 2 2 1 1 1

1 1 2 1 1 2 2 2 2 2 1 1 2 1

2 1 1 1 1 1 1 1

2 2 ? 1 1 1 ? 1

a 1 1 2 1a 1 1 1 2

Diseasestatus S2 ...SNP1 ...

… … … …

Page 17: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

• Task is to find either an allele or an allele string (haplotype) which is overrepresented in disease-associated chromosomes– markers may vary: SNPs, microsatellites– populations vary: the strength of marker-to-

marker LD

• Many approaches:– ”old-fashioned” allele association with some

simple test (problem: multiple testing)– TDT; modelling of LD process: Bayesian, EM

algorithm, integrated linkage & LD

Page 18: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Limitations of the LD mapping

• The relationship between the distance of the markers vs. the strength of LD: theoretical curve

Page 19: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Linkage disequilibrium (D’) for the African American (red) and European (blue) populations binned in 5 kb classes after removing all SNPs with minor allele frequencies less than 20%. 3429 SNPs were included (Source http://www.fhcrc.org/labs/kruglyak/PGA/pga.html)

Page 20: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Limitations: LD is random process

• LD is a continuous process, which is created and decreased by several factors:– genetic drift

– population structure

– natural selection

– new mutations

– founder effect

→ limits the accuracy of association mapping

Page 21: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Research challenges …• Haplotyping methods needed as

prerequisite for association/LD methods• …or, searching association directly from

genotype data (without the haplotyping stage)

• Better methods for measurement of the association (and/or the effects of the genes)

• Taking disease models into consideration

Page 22: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

A methodological project:Haplotype Pattern Mining (HPM)

AJHG 67:133-145, 2000

• Search the haplotype data for recurrent patterns with no pre-specified sequence

• Patterns may contain gaps, taking into consideration missing and erroneous data

• The patterns are evaluated for their strength of association

• Markerwise ‘score’ of association is calculated

Page 23: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Algorithm1. Find a set of associated haplotype

patterns– number of gaps allowed (2)– maximum gap length (1 marker)– maximum pattern length (7 markers)– association threshold (2 = 9)

2. Score loci based on the patterns Evaluate significance by permutation

tests Extendable to quantitative traits Extendable to multiple genes

Page 24: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Example: a set of associated patterns

Marker 01 02 03 04 05 06 07 08 2

P1 2 1 2 2 2 * * * 9.6P2 2 1 2 2 2 1 * * 9.2P3 2 1 2 2 * 1 1 * 8.9P4 2 1 * 2 1 * * * 8.1P5 1 * 1 2 2 * * * 7.4P6 * * 1 2 2 1 2 * 7.1P7 * 2 1 2 * * * * 7.1P8 2 1 1 2 * * * * 6.9P9 2 1 1 * * * * * 6.8

Score 5 6 7 7 6 3 2 0

Page 25: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Pattern selection

• The set of potential patterns is large.

• Depth-first search for all potential patterns

• Search parameters limit search space:– number of gaps– maximum gap length– maximum pattern length– association threshold

Page 26: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Score and localization: an example

Page 27: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Permutation tests• random permutation of the status fields of

the chromosomes

• 10,000 permutations

• HPM and marker scores recalculated for each permuted data set

• proportion of permuted data sets in which score > true score empirical p-value.

Page 28: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Permutation surface (A=7.5 %). The solid line is the observed frequency.

Page 29: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Localization power with simulated SNP data (density 3 SNPs per 1 cM). Isolated population with a 500-year

history was simulated. Disease model was monogenic with disease allele frequency varying from 2.5-10 % in the affecteds. 12.5 % of data was missing. Sample size 100

cases and 100 controls.

Page 30: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Benefits & drawbacks

• Non-parametric, yet efficient approach; no disease model specification is needed +

• Powerful even with weak genetic effects and small data sets +

• Robust to genotyping errors, mutations, missing data +

• Allows for gaps in haplotypes +

Page 31: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

• Flexible: easily extended to different types of markers, environmental covariates, and quantitative measurements +

• optimal pattern search parameters may need to be specified case-wise -

• no rigid statistical theory background -

• requires dense enough map to find the area where DS gene is in LD with nearby markers.

Page 32: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

• Search of the susceptibility gene:1.With good luck - and information from gene

banks, pick up the correct candidate gene

2.Genetic region with positive linkage signal is saturated with markers, and this data is now searched for a secondary correlation – correlation of marker allele(s) with the actual disease mutation (LD)

Page 33: Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Improved statistical methods to detect LD– Terwilliger (1995)– Devlin, Risch, Roeder (1996)– McPeek and Strahs (1999)– Service, Lang et al. (1999)

Statistical power of association test statistics– Long, Langley (1999).

Review on statistical approaches to gene mapping– Ott, Hoh (2000)