Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each...

Demographic events

Measuring Genetic Diversity

• Theta = θ = 4Nµ = 4Nm = 4N(µ+m)• For haploid markers θ = 2Nµ = 2Nm = 2N(µ+m)• The all important population genetic parameter.• It is based on the number of alleles or the number of

different nucleotides in a given sample.• It quantifies genetic diversity of a given population.

Theta (θ) Hom

• The expected homozygosity (Zouros, 1979; Chakraborty and Weiss (1991) in a population at equilibrium between drift and mutation.

• Sensitive to small sample and allele sizes

• For microsat data

Theta (θ) S

• Estimated from the infinite-site equilibrium relationship (Watterson, 1975) between the number of segregating sites (S), the sample size (n) and θ for a sample of non-recombining DNA.

Theta (θ) k

• Estimated from the infinite-allele equilibrium relationship (Ewens, 1972) between the expected number of alleles (k), the sample size (n) and θ.

• 95% confidence limits are calculated as

Sterling number (expansion factor of a factorialFalling factorial

Theta (θ) πˆ

• Estimated from the infinite-site equilibrium (Tajima, 1983) relationship between the mean number of pair-wise differences (πˆ) and theta (θ ).

Why so many θ measures

• Not all methods are suitable for all types of data.• Ultimately all methods should result in the same

estimates of theta.• Differences in estimates can be interpreted as

violations of assumptions, and each method is sensitive to different assumptions.

Tajima’s D

• Tajima’s (1989) D test quantifies the discordance between the estimate of theta from number of segregating sites and from average pair-wise sequence divergence.

• Negative values interpreted as a signal of purifying selection or alternately as demographic expansion.

Fu’s Fs

• Fu’s (1997) Fs measures the probability of observing a certain number of haplotypes given particular value of θ – the test looks at discordance in values of θderived from number of haplotypes and average pair-wise sequence divergence.

• Negative values interpreted as a signal of purifying selection or alternately as demographic expansion.

Stobek’s S• Strobek’s (1987) S measures the level of discordance

in values of θ derived from number of haplotypes and average pair-wise sequence divergence.

• The expected number of alleles in a sample is an increasing function of the migration rates, whereas the expected average number of nucleotide differences is shown to be independent of the migration rates and equal to 4Nµ.

• Negative values interpreted as a signal population structure.

Differences in θ measures

• Have selective interpretations.• Have demographic interpretations.

Inferring demographic change

• Demographic changes are changes in effective population size over time.

• Differences in θ summary statistics based different population genetic measures will detect demographic changes.

• Distribution of allelic frequencies are the signatures of demographic changes.

• Coalescent analysis of data will recover signals of demographic changes.

Mismatch distribution

• A frequency graph of pair-wise differences between alleles.

• It is usually multimodal in samples drawn from populations at demographic equilibrium (it reflects the highly stochastic shape of gene trees)

• It is usually unimodal in populations having passed through a recent demographic expansion (Rogers and Harpending, 1992; Hudson and Slatkin, 1991) or though a range expansion with high levels of migration between neighboring demes (Ray et al. 2003, Excoffier 2004).


• Multimodal


• Unimodal


• The mismatch distribution is a graphic way of visualizing the signature of an expansion.

• If there is population expansion, then theoretically we can calculate the population size before expansion, the population size after expansion, and the time that the expansion happened.

Pure demographic expansion

• We assumes that a population at equilibrium has suddenly passed τ generations ago from a population size of N0 to N1

• The probability of observing S segregating sites between two randomly chosen non-recombining haplotypes is given by

Pure demographic expansion• In a simplified analysis θ1 is assumed to be ∞ - i.e. no

coalescent event since expansion.• In this case:

• Where m and v are the mean and variance of the mismatch distribution.

• This simplifying assumption tends to underestimate the time to expansion, but is a fast solution.

• This model is implemented in DnaSP.


• Alternately all three variables θ1, θ0 and τ can be solved for simultaneously using generalized non-linear least-squares approach.

• The objective is to simultaneously change all three variables such that Fs is maximized.

• This approach is more exact, but it is computationally intensive.

• This model is implemented in Arlequin.


• Using the model implemented in Arlequin, we can also estimate the confidence intervals around all three variables.

• This is done by a parametric bootstrap.

Parametric bootstrap• Assume some model plus some set of parameters,

and simulate a new dataset.• Simulate N datasets.• For each new simulated dataset, calculate a new set

of values.• Plot a frequency distribution of newly calculated

values, and see where actual values or parameters are placed relative to values of parameters derived from simulated data.

• In this case we know the values of three variables θ1, θ0 and τ and using these we generate new datasets under the assumption of population expansion.

Test of demographic expansion

• Using the parametric bootstrap, we can place confidence intervals on the three variables θ1, θ0 and τ.

• The model used in the parametric bootstrap assumes population expansion, and we have not yet tested if the data have a signature of population expansion.

• We use the parametric bootstrap.• Statistic 1 - the sum of square deviations (SSD)

between the observed and the expected mismatch.• Statistic 2 - the raggedness index of the observed

distribution.

SSD statistic

• Using the parametric bootstrap (we simulate a new dataset under the assumption of a demographic expansion, and some values of θ1, θ0 and τ derived from the original data), we get a new mismatch distribution.

• We calculate a sum of square deviations of the mismatch distribution for each bootstrapped dataset, and compare it to the sum of squared deviations of the actual dataset.

• The test statistic is

Raggedness index statistic

• Using the parametric bootstrap (we simulate a new dataset under the assumption of a demographic expansion, and some values of θ1, θ0 and τ derived from the original data), we get a new mismatch distribution.

• We calculate a summary statistic r based on maximum number of mutational differences (d) and frequency of the allelic classes (x).

• The test statistic is same as for SSD.


• Populations that have undergone demographic expansions are expected to have smaller sum of squared differences and smaller raggedness index in their mismatch distributions than non-expanded populations.

• Therefore, based on the test statistics, what values of P would be considered significant (i.e. support the hypothesis of a demographic expansion)?


• Multimodal


• Unimodal

Test of demographic expansion• Maximum likelihood approaches.

– We assume some model of molecular evolution, and calculate the likelihood of our data under that model of evolution

– We estimate relevant parameters under this model – Model of molecular evolution has to be known a priori

• Bayesian inference.– We do not assume a particular model– We divide our sampling throughout the duration of the

coalescent and estimate relevant parameters– Based on distribution of parameter estimates, we infer a

process (model).


• Both maximum likelihood and Bayesian inference approaches allow the calculation of confidence intervals.

• Likelihoods and posterior probabilities are solved through the Markov Chain Monte Carlo (MCMC) algorithm – this is kind of a resampling algorithm.

• Likelihoods obtained under different models can be compared using standard model selection criteria –these include hierarchical Likelihood Ratio Test (hLRT), Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC).

Maximum likelihood methods

• ML assumes a model of sequence evolution.• ML attempts to answer the question: What is

the probability that I would observe these data (a multiple sequence alignment), given a particular model of evolution (a tree and a process).

• ML uses a ‘model’. This is justifiable, since molecular sequence data can be shown to have arisen according to a evolutionary process.

Maximum Likelihood - goal

• To estimate the probability that we would observe a particular dataset, given a phylogenetic tree and some notion of how the evolutionary process worked over time.

Probability of given

a b c db a e fc e a gd c f a

⎧

⎨ ⎪ ⎪

⎩

⎪ ⎪

⎫

⎬ ⎪ ⎪

⎭

⎪ ⎪

π = a ,c,g,t[ ]

Bayesian inference methods

Demographic events

• Same as a population can undergo a demographic expansion, it can also undergo a demographic contraction.

• Severe demographic contractions are called bottlenecks.

Demographic declines

• Demographic declines should result in patterns opposite to demographic expansions.

• Three popular methodologies exist– Heterozygozity method– Allele number vs. allele range method– Coalescent method


• The heterozygosity method takes note of the fact that recently declined populations will have a relative excess of observed to expected heterozygotes

• Why would this be true?• Implemented in the program Bottleneck.


• The allele number vs. allele range method takes note of the fact that recently declined populations will have fewer alleles than expected relative to allele range.

• Why would this be true?• Implemented in the program M value.


• M = k/(R+1)


• The heterozygosity and allele number / allele range methods work only with microsatellite data.

• They model expected heterozygosityunder three different models of microsatellite evolution.– IAM – Infinite alleles model– SMM – Stepwise mutation model– TPM – Two phase model


• IAM – Infinite alleles model – it assumes that every new mutation results in a new allele, and that there is no relationships between the newly generated allele, and the parental allele


• SMM – Stepwise mutation model – it assumes that every new mutation results in a new allele, and that this new allele is either one step larger or one step smaller than the parental allele.


• TPM – Two phase model – this model is a mix of the SMM and IAM models. Some percentage of the mutations are allowed to form according to the SMM model (~80%), the rest according to IAM. Some programs allow for the input of an average allele jump size.


• The method implemented in M value also assumes the a priory knowledge of θ of the population prior to the population decline.

• Calculating θ from the current genetic diversity would result in a conservative estimate of population decline.

Demographic declines• Harpia harpyja example:• The average value of M for the 24 microsatellite loci

was 0.84, a value significantly lower than that obtained under simulation of a pre-bottleneck population size (p = 0.026 using the genetic parameter θ of 2.24).

• We derived θ from estimated census sizes of 104 to 105 harpy eagle individuals assuming that the effective number of individuals is equivalent to 1/10 the census size, and that microsatellite mutation rate (µ) estimates range from 2.5 x 10-3 to 5.6 x 10-4.


• Harpia harpyja example:• When the parameter θ was estimated directly

from the microsatellite data (θ = 1.50), the Mvalue was not significant (p = 0.101).

• However, the θ calculated from the data itself is necessarily a lower bound estimate if H. harpyja shows any population structure, or if the θ does not represent original population prior to reduction.


• Different methods have different power to detect bottleneck, an to register a bottleneck event for different amount of time.

• Heterozygosity are often more immediately sensitive, but they do not register a bottleneck for a very long time.

• Allele number / allele range methods tend to register equally or slightly less severe bottleneck, but longer time in the past.


• The coalescent method estimates parameters for current and ancestral population size, and time when population size occurred.

• Implemented in the program MSvar.


0

1

2

3

4

5

6

7

8

0 5000 10000 15000 20000 25000

generations

Ne

presentpast

0

1

2

3

4

5

6

7

8

0 5000 10000 15000 20000 25000

generations

Ne

presentpast

• Both populations experienced decreases in effective population sizes.

• Populations started decreasing 100-150 years ago.

• Reductions range from 25% to 60%.

Populations size

• Effective population size – a summary statistic representing some ideal number of individuals based on some summary statistic.

Populations size

• Inbreeding effective population size – The ideal number of individuals that are contributing to the reproductive population –can be calculated from pedigree information and more commonly from the coalescent properties of the observed gene tree.

Populations size

• Variance effective population size – The ideal number of individuals that represents the sampling variance of the population –this can either be sampling across generations, or based on the variance in gene frequencies observed in the data (so based from θ estimates).

Populations size

• Different concepts of effective population sizes will result in different estimates of effective population sizes.

• The inbreeding effective population size is a backward looking statistic whereas the variance effective population sizes reflect recent demographic/population genetic processes influencing genetic systems.

Populations size

• Large inbreeding effective population sizes and small variance effective population sizes are indicative of recently reduced genetic variation due to decreases in population size or habitat fragmentation (Gerber & Templeton, 1996).

• In contrast, with a rapid increase in population size, theory predicts a small inbreeding effective population size and large variance and eigenvalueeffective sizes (Templeton, 1980).

Populations size

• There is therefore no simple relationship between effective population sizes, and census sizes, although it is often claimed that in stable populations at equilibrium, there is 1:10 relationships between inbreeding and census population sizes.

Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each...

Documents

Transcript of Measuring Genetic Diversity - UFSCarevolucao/TGE/Lect03.pdfmismatch distribution for each...