SIV: A synergistic approach to the analysis of high...
Transcript of SIV: A synergistic approach to the analysis of high...
SIV: A synergistic approach to the analysis of high-throughput screening
data
Andrew Leach, Francis Atkinson, Gianpaolo Bravi, Darren Green, Mike Hann, Gavin Harper, Andy Whittington
GlaxoSmithKline Research and DevelopmentStevenage
Select a subset ofrepresentatives
IC50 and QC on solid sample
SIV
H2L Chemistry etc.
Typical HTS flowchart
HTS (single-point percentage inhibition) on
liquid samples
SIV
Library design, compound acquisition etc.
IC50 on liquid sample
Assay Data Analysis
❚ Historically a cutoff was used to define an “active” sample❙ often chosen according to capacity of next screen iteration
❚ Require a measure from the screeners of what is “active”❙ “active” = significantly different from the negative control
values (n=16 for 384-well plates)
HTS data analysis: objective
❚ Mine HTS data to discover less potent but more attractive start points for medicinal chemistry❙ Include “grey data” and “inactives”
❚ Discover low potency series where application of a normal cutoff value would find none
❚ Suggest the best samples to progress when there are too many primary positives
❚ Develop models for:❙ Library design, ❙ Compound acquisition & selection - iterative screening
HTS data analysis: schematicAc
tivity
Chemically tractable series
Structural Descriptor
SingletonChemically intractable series
Traditional“hit cutoff”
Statistical methods
❚ Many, validated, methods for Quantitative Structure Activity Relationships (QSAR)❚ Work best with
❙ small data sets (10s or 100s of points)❙ Congeneric series❙ high quality assay data (IC50s)❙ chemists who know the SAR
❚ Typical HTS data is characterised by:❙ High volume❙ Poor quality (noise, false positives, false negatives)❙ Diverse structural classes❙ Multiple binding modes
Why SIV? Selection by Interactive Visualisation
❚ Requirements for multiple lead series❚ Many high potency “hits” have undesirable chemical
features❚ “Black box” computational methods are not able to
unambiguously assess chemical tractability➔ Expert medicinal chemist knowledge enabled by
informatics tools➔ Incorporate target-specific knowledge
What is SIV?
❚ A combination of computational methods, with the combined results visualised to aid sample selection
❚ The data is normally filtered for reactive/undesirable species before analysis
❚ Visualisation is usually through Spotfire❚ Our experience is that no single method works all of the time, therefore it
is normal to select several❙ e.g. clustering (various flavours), 3D pharmacophores, Docking, Kernel, SCAM,
GaP, similarity analysis etc.
❚ Some techniques just look at the “actives” (though there may be many thousands of these!); others use all of the data
Different Methods Find Different Things
KernelDiscrimination SCAM
387 / 6786 250 / 6786435 / 3214
1050 / 79651(1.3%)
(5.7%) (3.7%)(13.5%)
Some SIV techniques
CLUSTERINGPHYSICAL PROPERTIES
PREDICTIVE MODELS OTHER “GOODNESS” CRITERIA
Support
QuikClus clogpcmr
MW
Bits_On Flex_Index
N_DonorsAndrews BE
SCAMTypicality
Novelty
Acceptance SimilarityFierce Filters
Brian’s BaddiesNeural Networks
Typical SIV application
Supportsimilarity/clustering
Pharma-cophore
3D pharmaco-
phores
Filters
Docking Sub-structures
Samples for next screening iteration~5K
HTS data (“actives”)20K
Chemistvisualisation
Inspecting every molecule in isolation is not feasible
Florida election officials hold ballots up to the light as they conduct a manual recount Saturday
Picture courtesy of cnn.com
Volusia County Judge Michael McDermott, right, takes a look at a ballot with an election observer.
Picture courtesy of cbsnet.com
Filters
❚ “Core set” of 2D filters❙ reactive/undesirable moieties
❚ Known problematic compounds❙ fluorophores etc.
❚ Properties/counts❙ c.f. Lipinski❙ Use properties meaningful to a chemist❙ Supplement generic filters with those that are specific to the project
“Support”
❚ Gain confidence that high activity is not due to chance❚ Provides “representative” molecule (a simple form of
clustering)❚ Identify clear singletons❚ Based on nearest-neighbour similarity
❙ simplest is 2D Daylight tanimoto but there are others
❚ Similarity to compounds that have been accepted in the past ❙ continue to enhance the overall productivity of the procedure
Support example: singletons
❚ Extract from various databases all compounds with activity keywords related to anti-inflammation
❚ Which compounds have the lowest near-neighbour similarity?
Nb
NH2
Pt2+
NH2
ClMn
2+
NH
NH
NH
NH
NHO
OH
ON
OOHCl
Sn+ N
O
O N
N
O
SO
O O
N
O
SO
O O
O
ClCl
Cl
FTi
4+
Processing the results from 3D pharmacophoreanalysis
❚ May generate large hitlists from 3D searches, especially those that are “generic”/“promiscuous” in character
❚ Want to group together molecules based on the key pharmacophoric features/substructure responsible❙ Identify the features within each hit molecule that match the
pharmacophore points❙ Clip away remainder of the structure to give core
❘ keep rings
❙ Order/cluster compounds according to core structure and visualise
Introducing “chemical restrictions” into the clipping algorithm
❚ Clipping algorithm can sometimes be too fierce - may want to retain certain chemical details (e.g. synthon points)
❚ Extended algorithm uses rules to determine which functionality can be clipped and which cannot:
NH
O
YX
NH
Othese represent thepharmacophore-matchingregions
HTS Analysis techniques that use all the data
❚ SCAM❙ recursive partitioning
❚ GaP❙ Pharmacophore-based methods that compare active vs. inactive
❚ Kernel Discrimination
Caution may be required when developing quantitative models from HTS data
4
5
6
7
8
9
10
-40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130
%inhibition
pIC5
0
HTS analysis using binary kernel discrimination
❚ Kernel density estimate of parent distribution at x is:
❚ Aitchison and Aitkin form of Kλ for binary data:
❙ x, xi are vectors of length M differing at d positions❙ λ=0.5 ⇒ uniform density; λ→1 ⇒ less smooth
descriptor
dens
ity
descriptor
dens
ity
∑=
=n
iin 1),(K
1)(p̂ xxx λKλ is the density function; λ is a smoothing parameterthat affects the range of influence of each point xi in the sample
ddMi )1(),(K λλλ −−=xx
Choosing the smoothing parameter
❚ If πA, πI are the probabilities that a randomly selected molecule is active, inactive and p(x|A), p(x|I) are the probabilities that a molecule with descriptor vector x occurs when selected at random from the active or inactive populations
❚ Then the estimated probability that a compound with x is active is:
❙ Equivalent to ranking by:
❚ Select λ by systematic variation to minimise sum of ranks of true actives: a leave-one-out approach❙ we use the same value for both active and inactive molecules
.)(K1
)(K1
)|(∑
∑
∈−
∈−
=
Iiin
Aiin
AL
I
A
xx
xx
xλ
λ
∑∑
∑
∈−+
∈−
∈−
=
IiiInI
AiiAnA
AiiAnA
Ap)(K)()(K)(
)(K)(
)|(ˆxxxx
xx
xλπλπ
λπ
Evaluation of kernel discrimination: iterative screening experiments
5000 training points
0100200300400500600700800
020
0040
0060
00800
010
000
Number of Compounds Selected
Num
ber o
f Act
ives
Fou
nd
kernel(aptt)mergedss(aptt)kernel(daylight)mergedss(daylight)expected
5000 training points (noisy)
0100200300400500600700
020
0040
0060
00800
010
000
Number of Compounds Selected
Num
ber o
f Act
ives
Fou
nd kernel(aptt)mergedss(aptt)kernel(daylight)mergedss(daylight)expected
❚ Applicable to many different binary descriptors❙ structural keys, fingerprints, atom-
pairs, topological torsions, pharmacophore keys
❚ Compare to similarity as the “baseline” rather than random selection
❚ Noisy data: deliberately misclassify inactives as active
Additional application of kernel: Typicality scores
❚ Classify the data (active/inactive)❚ Describe the molecules with 2D fingerprints❚ Use a leave one out algorithm to predict activity of each
compound from all the others❙ molecules have experimental activity that is “typical” or
“atypical”
Typicality plotsCompounds that are typical of “active” compounds but areinactive in the assay
Compounds that are consistent with the high
activity in the assay
Compounds consistent with inactivity. But these may represent an interesting
lead series!
May be “false positives”Or perhaps we should find
some additional similarcompounds to test
Summary
❚ We are not trying to find as many “hits” as we can❚ Rather, we want to search for a diversity of progressable hits (leads)
❚ Most quantitative methods are designed to maximize the expected number of hits - but they may all be from the same series
❚ Overall success depends on gaining information about the space of compounds, as well as finding hits.
❚ No one technique is superior, so use multiple methods and incorporate chemist expertise