SIV: A synergistic approach to the analysis of high...

28
SIV: A synergistic approach to the analysis of high-throughput screening data Andrew Leach, Francis Atkinson, Gianpaolo Bravi, Darren Green, Mike Hann, Gavin Harper, Andy Whittington GlaxoSmithKline Research and Development Stevenage

Transcript of SIV: A synergistic approach to the analysis of high...

SIV: A synergistic approach to the analysis of high-throughput screening

data

Andrew Leach, Francis Atkinson, Gianpaolo Bravi, Darren Green, Mike Hann, Gavin Harper, Andy Whittington

GlaxoSmithKline Research and DevelopmentStevenage

Select a subset ofrepresentatives

IC50 and QC on solid sample

SIV

H2L Chemistry etc.

Typical HTS flowchart

HTS (single-point percentage inhibition) on

liquid samples

SIV

Library design, compound acquisition etc.

IC50 on liquid sample

Assay Data Analysis

❚ Historically a cutoff was used to define an “active” sample❙ often chosen according to capacity of next screen iteration

❚ Require a measure from the screeners of what is “active”❙ “active” = significantly different from the negative control

values (n=16 for 384-well plates)

HTS data analysis: objective

❚ Mine HTS data to discover less potent but more attractive start points for medicinal chemistry❙ Include “grey data” and “inactives”

❚ Discover low potency series where application of a normal cutoff value would find none

❚ Suggest the best samples to progress when there are too many primary positives

❚ Develop models for:❙ Library design, ❙ Compound acquisition & selection - iterative screening

HTS data analysis: schematicAc

tivity

Chemically tractable series

Structural Descriptor

SingletonChemically intractable series

Traditional“hit cutoff”

Statistical methods

❚ Many, validated, methods for Quantitative Structure Activity Relationships (QSAR)❚ Work best with

❙ small data sets (10s or 100s of points)❙ Congeneric series❙ high quality assay data (IC50s)❙ chemists who know the SAR

❚ Typical HTS data is characterised by:❙ High volume❙ Poor quality (noise, false positives, false negatives)❙ Diverse structural classes❙ Multiple binding modes

Why SIV? Selection by Interactive Visualisation

❚ Requirements for multiple lead series❚ Many high potency “hits” have undesirable chemical

features❚ “Black box” computational methods are not able to

unambiguously assess chemical tractability➔ Expert medicinal chemist knowledge enabled by

informatics tools➔ Incorporate target-specific knowledge

What is SIV?

❚ A combination of computational methods, with the combined results visualised to aid sample selection

❚ The data is normally filtered for reactive/undesirable species before analysis

❚ Visualisation is usually through Spotfire❚ Our experience is that no single method works all of the time, therefore it

is normal to select several❙ e.g. clustering (various flavours), 3D pharmacophores, Docking, Kernel, SCAM,

GaP, similarity analysis etc.

❚ Some techniques just look at the “actives” (though there may be many thousands of these!); others use all of the data

Different Methods Find Different Things

KernelDiscrimination SCAM

387 / 6786 250 / 6786435 / 3214

1050 / 79651(1.3%)

(5.7%) (3.7%)(13.5%)

Some SIV techniques

CLUSTERINGPHYSICAL PROPERTIES

PREDICTIVE MODELS OTHER “GOODNESS” CRITERIA

Support

QuikClus clogpcmr

MW

Bits_On Flex_Index

N_DonorsAndrews BE

SCAMTypicality

Novelty

Acceptance SimilarityFierce Filters

Brian’s BaddiesNeural Networks

Typical SIV application

Supportsimilarity/clustering

Pharma-cophore

3D pharmaco-

phores

Filters

Docking Sub-structures

Samples for next screening iteration~5K

HTS data (“actives”)20K

Chemistvisualisation

Inspecting every molecule in isolation is not feasible

Florida election officials hold ballots up to the light as they conduct a manual recount Saturday

Picture courtesy of cnn.com

Volusia County Judge Michael McDermott, right, takes a look at a ballot with an election observer.

Picture courtesy of cbsnet.com

Filters

❚ “Core set” of 2D filters❙ reactive/undesirable moieties

❚ Known problematic compounds❙ fluorophores etc.

❚ Properties/counts❙ c.f. Lipinski❙ Use properties meaningful to a chemist❙ Supplement generic filters with those that are specific to the project

“Support”

❚ Gain confidence that high activity is not due to chance❚ Provides “representative” molecule (a simple form of

clustering)❚ Identify clear singletons❚ Based on nearest-neighbour similarity

❙ simplest is 2D Daylight tanimoto but there are others

❚ Similarity to compounds that have been accepted in the past ❙ continue to enhance the overall productivity of the procedure

Support example: singletons

❚ Extract from various databases all compounds with activity keywords related to anti-inflammation

❚ Which compounds have the lowest near-neighbour similarity?

Nb

NH2

Pt2+

NH2

ClMn

2+

NH

NH

NH

NH

NHO

OH

ON

OOHCl

Sn+ N

O

O N

N

O

SO

O O

N

O

SO

O O

O

ClCl

Cl

FTi

4+

Cluster analysis with activity and variance

High activitylow s.d.

High activityhigh s.d.→ SAR?

Processing the results from 3D pharmacophoreanalysis

❚ May generate large hitlists from 3D searches, especially those that are “generic”/“promiscuous” in character

❚ Want to group together molecules based on the key pharmacophoric features/substructure responsible❙ Identify the features within each hit molecule that match the

pharmacophore points❙ Clip away remainder of the structure to give core

❘ keep rings

❙ Order/cluster compounds according to core structure and visualise

N

NH

N

S O

F

p38/SK203580

N

NN

N

F

NH2

NH

p38/SK220025

N

NNH

N

Introducing “chemical restrictions” into the clipping algorithm

❚ Clipping algorithm can sometimes be too fierce - may want to retain certain chemical details (e.g. synthon points)

❚ Extended algorithm uses rules to determine which functionality can be clipped and which cannot:

NH

O

YX

NH

Othese represent thepharmacophore-matchingregions

HTS Analysis techniques that use all the data

❚ SCAM❙ recursive partitioning

❚ GaP❙ Pharmacophore-based methods that compare active vs. inactive

❚ Kernel Discrimination

Caution may be required when developing quantitative models from HTS data

4

5

6

7

8

9

10

-40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130

%inhibition

pIC5

0

HTS analysis using binary kernel discrimination

❚ Kernel density estimate of parent distribution at x is:

❚ Aitchison and Aitkin form of Kλ for binary data:

❙ x, xi are vectors of length M differing at d positions❙ λ=0.5 ⇒ uniform density; λ→1 ⇒ less smooth

descriptor

dens

ity

descriptor

dens

ity

∑=

=n

iin 1),(K

1)(p̂ xxx λKλ is the density function; λ is a smoothing parameterthat affects the range of influence of each point xi in the sample

ddMi )1(),(K λλλ −−=xx

Choosing the smoothing parameter

❚ If πA, πI are the probabilities that a randomly selected molecule is active, inactive and p(x|A), p(x|I) are the probabilities that a molecule with descriptor vector x occurs when selected at random from the active or inactive populations

❚ Then the estimated probability that a compound with x is active is:

❙ Equivalent to ranking by:

❚ Select λ by systematic variation to minimise sum of ranks of true actives: a leave-one-out approach❙ we use the same value for both active and inactive molecules

.)(K1

)(K1

)|(∑

∈−

∈−

=

Iiin

Aiin

AL

I

A

xx

xx

λ

∑∑

∈−+

∈−

∈−

=

IiiInI

AiiAnA

AiiAnA

Ap)(K)()(K)(

)(K)(

)|(ˆxxxx

xx

xλπλπ

λπ

Evaluation of kernel discrimination: iterative screening experiments

5000 training points

0100200300400500600700800

020

0040

0060

00800

010

000

Number of Compounds Selected

Num

ber o

f Act

ives

Fou

nd

kernel(aptt)mergedss(aptt)kernel(daylight)mergedss(daylight)expected

5000 training points (noisy)

0100200300400500600700

020

0040

0060

00800

010

000

Number of Compounds Selected

Num

ber o

f Act

ives

Fou

nd kernel(aptt)mergedss(aptt)kernel(daylight)mergedss(daylight)expected

❚ Applicable to many different binary descriptors❙ structural keys, fingerprints, atom-

pairs, topological torsions, pharmacophore keys

❚ Compare to similarity as the “baseline” rather than random selection

❚ Noisy data: deliberately misclassify inactives as active

Additional application of kernel: Typicality scores

❚ Classify the data (active/inactive)❚ Describe the molecules with 2D fingerprints❚ Use a leave one out algorithm to predict activity of each

compound from all the others❙ molecules have experimental activity that is “typical” or

“atypical”

Typicality plotsCompounds that are typical of “active” compounds but areinactive in the assay

Compounds that are consistent with the high

activity in the assay

Compounds consistent with inactivity. But these may represent an interesting

lead series!

May be “false positives”Or perhaps we should find

some additional similarcompounds to test

Summary

❚ We are not trying to find as many “hits” as we can❚ Rather, we want to search for a diversity of progressable hits (leads)

❚ Most quantitative methods are designed to maximize the expected number of hits - but they may all be from the same series

❚ Overall success depends on gaining information about the space of compounds, as well as finding hits.

❚ No one technique is superior, so use multiple methods and incorporate chemist expertise

Acknowledgements - screening group

❚ Florence Martin❚ Steve Hiscox❚ Chris Molloy❚ Andy Vines❚ Graham Baker❚ Steve Rees❚ Mike Snowden