Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each...

57
Demographic events

Transcript of Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each...

Page 1: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Demographic events

Page 2: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Measuring Genetic Diversity

• Theta = θ = 4Nµ = 4Nm = 4N(µ+m)• For haploid markers θ = 2Nµ = 2Nm = 2N(µ+m)• The all important population genetic parameter.• It is based on the number of alleles or the number of

different nucleotides in a given sample.• It quantifies genetic diversity of a given population.

Page 3: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Theta (θ) Hom

• The expected homozygosity (Zouros, 1979; Chakraborty and Weiss (1991) in a population at equilibrium between drift and mutation.

• Sensitive to small sample and allele sizes

• For microsat data

Page 4: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Theta (θ) S

• Estimated from the infinite-site equilibrium relationship (Watterson, 1975) between the number of segregating sites (S), the sample size (n) and θ for a sample of non-recombining DNA.

Page 5: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Theta (θ) k

• Estimated from the infinite-allele equilibrium relationship (Ewens, 1972) between the expected number of alleles (k), the sample size (n) and θ.

• 95% confidence limits are calculated as

Sterling number (expansion factor of a factorialFalling factorial

Page 6: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Theta (θ) πˆ

• Estimated from the infinite-site equilibrium (Tajima, 1983) relationship between the mean number of pair-wise differences (πˆ) and theta (θ ).

Page 7: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Why so many θ measures

• Not all methods are suitable for all types of data.• Ultimately all methods should result in the same

estimates of theta.• Differences in estimates can be interpreted as

violations of assumptions, and each method is sensitive to different assumptions.

Page 8: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Tajima’s D

• Tajima’s (1989) D test quantifies the discordance between the estimate of theta from number of segregating sites and from average pair-wise sequence divergence.

• Negative values interpreted as a signal of purifying selection or alternately as demographic expansion.

Page 9: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Fu’s Fs

• Fu’s (1997) Fs measures the probability of observing a certain number of haplotypes given particular value of θ – the test looks at discordance in values of θderived from number of haplotypes and average pair-wise sequence divergence.

• Negative values interpreted as a signal of purifying selection or alternately as demographic expansion.

Page 10: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Stobek’s S• Strobek’s (1987) S measures the level of discordance

in values of θ derived from number of haplotypes and average pair-wise sequence divergence.

• The expected number of alleles in a sample is an increasing function of the migration rates, whereas the expected average number of nucleotide differences is shown to be independent of the migration rates and equal to 4Nµ.

• Negative values interpreted as a signal population structure.

Page 11: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Differences in θ measures

• Have selective interpretations.• Have demographic interpretations.

Page 12: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Inferring demographic change

• Demographic changes are changes in effective population size over time.

• Differences in θ summary statistics based different population genetic measures will detect demographic changes.

• Distribution of allelic frequencies are the signatures of demographic changes.

• Coalescent analysis of data will recover signals of demographic changes.

Page 13: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Mismatch distribution

• A frequency graph of pair-wise differences between alleles.

• It is usually multimodal in samples drawn from populations at demographic equilibrium (it reflects the highly stochastic shape of gene trees)

• It is usually unimodal in populations having passed through a recent demographic expansion (Rogers and Harpending, 1992; Hudson and Slatkin, 1991) or though a range expansion with high levels of migration between neighboring demes (Ray et al. 2003, Excoffier 2004).

Page 14: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Mismatch distribution

• Multimodal

Page 15: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Mismatch distribution

• Unimodal

Page 16: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Mismatch distribution

• The mismatch distribution is a graphic way of visualizing the signature of an expansion.

• If there is population expansion, then theoretically we can calculate the population size before expansion, the population size after expansion, and the time that the expansion happened.

Page 17: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Pure demographic expansion

• We assumes that a population at equilibrium has suddenly passed τ generations ago from a population size of N0 to N1

• The probability of observing S segregating sites between two randomly chosen non-recombining haplotypes is given by

Page 18: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Pure demographic expansion• In a simplified analysis θ1 is assumed to be ∞ - i.e. no

coalescent event since expansion.• In this case:

• Where m and v are the mean and variance of the mismatch distribution.

• This simplifying assumption tends to underestimate the time to expansion, but is a fast solution.

• This model is implemented in DnaSP.

Page 19: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Pure demographic expansion

• Alternately all three variables θ1, θ0 and τ can be solved for simultaneously using generalized non-linear least-squares approach.

• The objective is to simultaneously change all three variables such that Fs is maximized.

• This approach is more exact, but it is computationally intensive.

• This model is implemented in Arlequin.

Page 20: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Pure demographic expansion

• Using the model implemented in Arlequin, we can also estimate the confidence intervals around all three variables.

• This is done by a parametric bootstrap.

Page 21: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Parametric bootstrap• Assume some model plus some set of parameters,

and simulate a new dataset.• Simulate N datasets.• For each new simulated dataset, calculate a new set

of values.• Plot a frequency distribution of newly calculated

values, and see where actual values or parameters are placed relative to values of parameters derived from simulated data.

• In this case we know the values of three variables θ1, θ0 and τ and using these we generate new datasets under the assumption of population expansion.

Page 22: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Test of demographic expansion

• Using the parametric bootstrap, we can place confidence intervals on the three variables θ1, θ0 and τ.

• The model used in the parametric bootstrap assumes population expansion, and we have not yet tested if the data have a signature of population expansion.

• We use the parametric bootstrap.• Statistic 1 - the sum of square deviations (SSD)

between the observed and the expected mismatch.• Statistic 2 - the raggedness index of the observed

distribution.

Page 23: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

SSD statistic

• Using the parametric bootstrap (we simulate a new dataset under the assumption of a demographic expansion, and some values of θ1, θ0 and τ derived from the original data), we get a new mismatch distribution.

• We calculate a sum of square deviations of the mismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

• The test statistic is

Page 24: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Raggedness index statistic

• Using the parametric bootstrap (we simulate a new dataset under the assumption of a demographic expansion, and some values of θ1, θ0 and τ derived from the original data), we get a new mismatch distribution.

• We calculate a summary statistic r based on maximum number of mutational differences (d) and frequency of the allelic classes (x).

• The test statistic is same as for SSD.

Page 25: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Test of demographic expansion

• Populations that have undergone demographic expansions are expected to have smaller sum of squared differences and smaller raggedness index in their mismatch distributions than non-expanded populations.

• Therefore, based on the test statistics, what values of P would be considered significant (i.e. support the hypothesis of a demographic expansion)?

Page 26: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Mismatch distribution

• Multimodal

Page 27: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Mismatch distribution

• Unimodal

Page 28: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Test of demographic expansion• Maximum likelihood approaches.

– We assume some model of molecular evolution, and calculate the likelihood of our data under that model of evolution

– We estimate relevant parameters under this model – Model of molecular evolution has to be known a priori

• Bayesian inference.– We do not assume a particular model– We divide our sampling throughout the duration of the

coalescent and estimate relevant parameters– Based on distribution of parameter estimates, we infer a

process (model).

Page 29: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Test of demographic expansion

• Both maximum likelihood and Bayesian inference approaches allow the calculation of confidence intervals.

• Likelihoods and posterior probabilities are solved through the Markov Chain Monte Carlo (MCMC) algorithm – this is kind of a resampling algorithm.

• Likelihoods obtained under different models can be compared using standard model selection criteria –these include hierarchical Likelihood Ratio Test (hLRT), Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC).

Page 30: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Maximum likelihood methods

• ML assumes a model of sequence evolution.• ML attempts to answer the question: What is

the probability that I would observe these data (a multiple sequence alignment), given a particular model of evolution (a tree and a process).

• ML uses a ‘model’. This is justifiable, since molecular sequence data can be shown to have arisen according to a evolutionary process.

Page 31: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Maximum Likelihood - goal

• To estimate the probability that we would observe a particular dataset, given a phylogenetic tree and some notion of how the evolutionary process worked over time.

Probability of given

a b c db a e fc e a gd c f a

⎨ ⎪ ⎪

⎪ ⎪

⎬ ⎪ ⎪

⎪ ⎪

π = a ,c,g,t[ ]

Page 32: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Bayesian inference methods

Page 33: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Bayesian inference methods

Page 34: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Bayesian inference methods

Page 35: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Demographic events

• Same as a population can undergo a demographic expansion, it can also undergo a demographic contraction.

• Severe demographic contractions are called bottlenecks.

Page 36: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Demographic declines

• Demographic declines should result in patterns opposite to demographic expansions.

• Three popular methodologies exist– Heterozygozity method– Allele number vs. allele range method– Coalescent method

Page 37: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Demographic declines

• The heterozygosity method takes note of the fact that recently declined populations will have a relative excess of observed to expected heterozygotes

• Why would this be true?• Implemented in the program Bottleneck.

Page 38: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Demographic declines

• The allele number vs. allele range method takes note of the fact that recently declined populations will have fewer alleles than expected relative to allele range.

• Why would this be true?• Implemented in the program M value.

Page 39: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Demographic declines

• M = k/(R+1)

Page 40: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Demographic declines

• The heterozygosity and allele number / allele range methods work only with microsatellite data.

• They model expected heterozygosityunder three different models of microsatellite evolution.– IAM – Infinite alleles model– SMM – Stepwise mutation model– TPM – Two phase model

Page 41: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Demographic declines

• IAM – Infinite alleles model – it assumes that every new mutation results in a new allele, and that there is no relationships between the newly generated allele, and the parental allele

Page 42: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Demographic declines

• SMM – Stepwise mutation model – it assumes that every new mutation results in a new allele, and that this new allele is either one step larger or one step smaller than the parental allele.

Page 43: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Demographic declines

• TPM – Two phase model – this model is a mix of the SMM and IAM models. Some percentage of the mutations are allowed to form according to the SMM model (~80%), the rest according to IAM. Some programs allow for the input of an average allele jump size.

Page 44: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Demographic declines

• TPM – Two phase model – this model is a mix of the SMM and IAM models. Some percentage of the mutations are allowed to form according to the SMM model (~80%), the rest according to IAM. Some programs allow for the input of an average allele jump size.

Page 45: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Demographic declines

• The method implemented in M value also assumes the a priory knowledge of θ of the population prior to the population decline.

• Calculating θ from the current genetic diversity would result in a conservative estimate of population decline.

Page 46: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Demographic declines• Harpia harpyja example:• The average value of M for the 24 microsatellite loci

was 0.84, a value significantly lower than that obtained under simulation of a pre-bottleneck population size (p = 0.026 using the genetic parameter θ of 2.24).

• We derived θ from estimated census sizes of 104 to 105 harpy eagle individuals assuming that the effective number of individuals is equivalent to 1/10 the census size, and that microsatellite mutation rate (µ) estimates range from 2.5 x 10-3 to 5.6 x 10-4.

Page 47: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Demographic declines

• Harpia harpyja example:• When the parameter θ was estimated directly

from the microsatellite data (θ = 1.50), the Mvalue was not significant (p = 0.101).

• However, the θ calculated from the data itself is necessarily a lower bound estimate if H. harpyja shows any population structure, or if the θ does not represent original population prior to reduction.

Page 48: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Demographic declines

Page 49: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Demographic declines

• Different methods have different power to detect bottleneck, an to register a bottleneck event for different amount of time.

• Heterozygosity are often more immediately sensitive, but they do not register a bottleneck for a very long time.

• Allele number / allele range methods tend to register equally or slightly less severe bottleneck, but longer time in the past.

Page 50: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Demographic declines

• The coalescent method estimates parameters for current and ancestral population size, and time when population size occurred.

• Implemented in the program MSvar.

Page 51: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Demographic declines

0

1

2

3

4

5

6

7

8

0 5000 10000 15000 20000 25000

generations

Ne

presentpast

0

1

2

3

4

5

6

7

8

0 5000 10000 15000 20000 25000

generations

Ne

presentpast

• Both populations experienced decreases in effective population sizes.

• Populations started decreasing 100-150 years ago.

• Reductions range from 25% to 60%.

Page 52: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Populations size

• Effective population size – a summary statistic representing some ideal number of individuals based on some summary statistic.

Page 53: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Populations size

• Inbreeding effective population size – The ideal number of individuals that are contributing to the reproductive population –can be calculated from pedigree information and more commonly from the coalescent properties of the observed gene tree.

Page 54: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Populations size

• Variance effective population size – The ideal number of individuals that represents the sampling variance of the population –this can either be sampling across generations, or based on the variance in gene frequencies observed in the data (so based from θ estimates).

Page 55: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Populations size

• Different concepts of effective population sizes will result in different estimates of effective population sizes.

• The inbreeding effective population size is a backward looking statistic whereas the variance effective population sizes reflect recent demographic/population genetic processes influencing genetic systems.

Page 56: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Populations size

• Large inbreeding effective population sizes and small variance effective population sizes are indicative of recently reduced genetic variation due to decreases in population size or habitat fragmentation (Gerber & Templeton, 1996).

• In contrast, with a rapid increase in population size, theory predicts a small inbreeding effective population size and large variance and eigenvalueeffective sizes (Templeton, 1980).

Page 57: Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

Populations size

• There is therefore no simple relationship between effective population sizes, and census sizes, although it is often claimed that in stable populations at equilibrium, there is 1:10 relationships between inbreeding and census population sizes.