3D-QSAR and docking studies of pentacycloundecylamines at ...
QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser:...
Transcript of QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser:...
QSAR:Brief and Incomplete Introduction
Dr Olena Mokshyna
Shift of Paradigm
A. Agrawal and A. Choudhary APL Mater. 4 053208 (2016); https://www.bigmax.mpg.de2
Why do we need modeling?
• Cut costs
• Increase speed
• Decrease human efforts
• Reduce animal testing
• Find something new!
3
Dawn of QSAR
• Crum-Brown and Fraser: Φ = f(C)• Meyer and Overton: suggested that the narcotic
(depressant) action of a group of organic compounds paralleled their olive oil/water partition coefficients
• Hammett: “σ−ρ” culture• Taft : separating polar, steric, and resonance effects• 1962 - Hansch et al. published their study on the
structure-activity relationships of plant growth regulators and their dependency on Hammett constants and octanol-water partition coefficient
• Hansch and Fujita: QSAR paradigm• Free-Wilson approach
Gramatica, Paola. (2011). A SHORT HISTORY OF QSAR EVOLUTIONA. Cherkasov et al. (J Med Chem, 2014) QSAR modeling: where have you been? Where are you going to? doi: 10.1021/jm4004285
QSAR = Quantitative Structure-Activity RelationshipQSPR = Quantitative Structure-Property Relationship
4
QSAR step-by-step: general workflow
Properties
Structures Descriptors
Machine Learning algorithm
Model ready!
5
• Step 1: Describe the molecules
6
Descriptors• Calculated or
Empirical (i.e. lipophilicity, shifts in NMR) • Integral (MW, surface area) or
Fragment (fingerprints, simplexes)• 1D, 2D, 3D or even 4D• Also pharmacophore, physico-chemical,
quantum and many, many more
7
Requirements Towards Descriptors
• Invariance to molecular numbering or labeling• Invariance of a descriptor value to any translation or rotation of the
molecule (for 3D)• Clear structural interpretation• Not based on experimental properties• Able to discriminate among isomers• Simple• Smooth structure-property landscape
8
• Step 1: Describe the molecules• Step 2: Choose machine learning algorithm
9
Exciting World of Machine Learning
10
No labels With labels
11
Classification / Regression
https://medium.com/free-code-camp/using-machine-learning-to-predict-the-quality-of-wines-9e2e13d7480d 12
Class 0 / Class 1 y = [0.1, 3.4, 0.11,…]
Training a Regression Model
Simple case: linear regression with one variable
𝑦 = ℎ 𝑥
ℎ 𝑥 = 𝜃& + 𝜃(𝑥
13
Learning algorithm
X
y
Training a Regression ModelCost (or loss) function:
Task: optimize coefficients to minimize cost function
Use gradient descent
Optimization:
The size of each step is determined by the parameter α, which is called the learning rate
14
Training a Classification Model
y∈{0,1}
Sigmoid functionAlgorithm: logistic regression
𝑧 = 𝜃& + 𝜃(𝑥
15
ℎ+ 𝑥 = 𝑔(𝑧 𝑥 )
𝑔 𝑧 =1
1 + 𝑒12
Support Vector Machines (SVM)
Random Forest(RF)
Gradient Boosting Machines (GBM) 16
SVM Classification
Optimal separation?
The largest possible margin = A greater chance of new data being classified correctly 17
SVM
18
SVM: Kernel trick
w - weights19
RFFirst let’s grow a tree
Breiman L (2001). "Random Forests". Machine Learning. 45 (1): 5–32. 20
RF
• Step 1 − Select random samples from a dataset.• Step 2 − Construct a
decision tree for every sample. Get the prediction from every decision tree.• Step 3 − Trees vote for every
predicted result.• Step 4 − Select the most
voted prediction
21
https://vas3k.com/blog/machine_learning/?ref=hn
GBMhttps://sefiks.com/2019/02/24/machine-learning-wars-deep-learning-vs-gbm/
GBM vs Deep Learning:
• Add new models to the ensemble sequentially• At each particular iteration, a new weak, base-learner model is trained with respect to the
error of the whole ensemble learnt so far
23
https://vas3k.com/blog/machine_learning/?ref=hn
Hyperparameters Tuning
• Model parameters = learned during the model training (e.g. weights in Linear Regression).
• Hyperparameters = are all the parameters set by the user before starting training (e.g. number of estimators)
n_estimators=100max_depth=10
n_estimators=100max_depth=5
n_estimators=200max_depth=50
25
Hyperparameter Tuning: Search
• Manual Search• Grid Search• Random Search• Automated Hyperparameter
Tuning (Bayesian Optimization, Genetic Algorithms)• Artificial Neural Networks (ANNs)
Tuning
26
Model is trained!Profit?Not so fast
27
• Step 1: Describe the molecules• Step 2: Choose machine learning algorithm• Step 3: Validate your model
28
Model Validation
29
Train / Cross-validation / Test Sets
Data-driven design of metal–organic frameworks for wet flue gas CO2 capture. Boyd et al. Nature, 2019doi:10.1038/s41586-019-1798-7
Training/test ~ 70%/30%K-fold cross-validation
30
High Bias vs High Variance
Question: Which case is underfitting, which case is overfitting?
31
High Bias vs High Variance: Cure?
• Get more (diverse!) training examples• Try less features• Play with hyperparameters (i.e., decrease
the depth of the trees in RF)
• Try more new features• Play with feature engineering using old features• Play with hyperparameters (i.e., increase the depth
of the trees in RF)
32
Model Metrics: Regression
33
Model Metrics: Classification
34
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑇𝑃 + 𝑇𝑁
𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =𝑇𝑁
𝑇𝑁 + 𝐹𝑃
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =𝑇𝑃
𝑇𝑃 + 𝐹𝑁
𝐵𝑎𝑙𝑎𝑛𝑐𝑒𝑑 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 + 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦
2
35
• Step 0: Check and curate your data• Step 1: Describe the molecules• Step 2: Choose machine learning algorithm• Step 3: Validate your model
36
Databases
37
Data Curation
• Errors in structures
• Errors in values• Errors in units
i.e. mg/kg instead of mol/kg
38Fourches et al. Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research. J.Chem.Inf.Model, 2010. doi: 10.1021/ci100176x
39
Preparation of Features
40
Lab Course
Scikit-learn• Module for Python to perform
machine learninghttps://www.scikit-learn.org/
RDKit• Module for Python (C, java) to
do variety of cheminformatics tasks https://www.rdkit.org/
Jupyter notebookPython
41
Further Reading & Learning
Textbooks on QSAR:• A.R. Leach, V.J. Gillet: An Introduction to Chemoinformatics. Springer,
2003• Gasteiger J.(Editor), Engel T.(Editor): Chemoinformatics : A Textbook.
John Wiley & Sons, 2004Machine learning:
• Machine Learning on Coursera: https://www.coursera.org/learn/machine-learning/• Machine Learning Crash Course:
https://developers.google.com/machine-learning/crash-course
42
Thank you for the attention!
43
Any questions?
44
Regularization
Regularization is a way to avoid overfitting by penalizing high-valued regression coefficients.L2 norm:
L1 norm:
45