Estimating fitness landscapes John Pinney j.pinney@imperial.ac.uk.

Post on 12-Jan-2016

215 views 0 download

Transcript of Estimating fitness landscapes John Pinney j.pinney@imperial.ac.uk.

Estimating fitness landscapes

John Pinneyj.pinney@imperial.ac.uk

Genotype network

Genotype network

0 = ‘Wild type’

Genotype network

0

Δ1

Genotype network

0

Δ1

Δ2

Δ3

Δ4

Δ5

Genotype network

0

Δ1Δ2Δ3Δ4Δ5

Δ1Δ2Δ3Δ4

Δ1Δ2Δ3

Δ1Δ2

Δ1

Genotype network

+Fitness values at every node

=Fitness landscape

With an accurate fitness landscape we could predict:

mutational trajectories e.g. under drug treatment.

rates of emergence of drug resistance.

optimal drug combinations to prevent emergence of drug resistance.

At best, fitness data for only relatively few genotypes will be available.

How can we estimate unobserved values?

How can we tell if these estimates are good enough for real applications of fitness landscapes?

How can we estimate unobserved values?

Specific mutations are expected to contribute to fitness in different ways

=>Machine learning based on mutations as features.

HIV-1 drug resistance database

http://hivdb.stanford.edu/

A great resource for exploring genotype-phenotype relationships.

Includes a large amount of sequence data from clinical and lab studies from early 1990s onwards.

In vitro data

Viruses with known sequence are assayed to assess their ability to reproduce in vitro in the presence of various drugs.

Most of these isolates were obtained from patients who may have been untreated or on any number of drug regimes.

=> some biases in sequence coverage

Genotypes are described using mutations relative to a particular consensus sequence (e.g. subtype B)

Summary of Phenosense results for a variety of protease inhibitors (PIs).

Machine learning from in vitro data

Using mutations relative to the consensus sequence as indicator variables, we can apply standard machine learning techniques to predict fitness under a given condition from the sequence.

Given the large number of uninformative features, LASSO and other techniques that include feature selection tend to do well.

from Rhee et al.(2006)

using least-squared regression to obtain coefficients for contribution of each mutation to resistance against a selection of PI drugs.

from Hinkley et al.(2011)

using generalised kernel ridge regression.

tested model using only main effects (ME) against model incorporating epistasis: inter-genic, intra-genic or both (MEEP)

from Hinkley et al.(2011)

These authors found ~18% improvement in predictive power by including epistasis between mutations within the same gene – e.g. the HIV protease shown.

In vivo data

A drug resistance fitness landscape in vitro may not be the same as that experienced by the virus when exposed to the patient’s immune system.

Another approach is to learn fitness landscapes by comparing the sequences of drug-naïve viruses against those obtained from patients on a specific drug regime.

Machine learning from in vivo data

Deforche et al. (2008) apply a Bayesian Network

Probability of a set of mutations (A1,A2,...,An)

Fitness of a set of mutations (A1,A2,...,An)

A phylogenetic guide tree is used to take sequence sampling bias into account

Predicting and validating mutational trajectories

Where next?