QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser:...

QSAR:Brief and Incomplete Introduction

Dr Olena Mokshyna

Shift of Paradigm

A. Agrawal and A. Choudhary APL Mater. 4 053208 (2016); https://www.bigmax.mpg.de2

Why do we need modeling?

• Cut costs

• Increase speed

• Decrease human efforts

• Reduce animal testing

• Find something new!

3

Dawn of QSAR

• Crum-Brown and Fraser: Φ = f(C)• Meyer and Overton: suggested that the narcotic

(depressant) action of a group of organic compounds paralleled their olive oil/water partition coefficients

• Hammett: “σ−ρ” culture• Taft : separating polar, steric, and resonance effects• 1962 - Hansch et al. published their study on the

structure-activity relationships of plant growth regulators and their dependency on Hammett constants and octanol-water partition coefficient

• Hansch and Fujita: QSAR paradigm• Free-Wilson approach

Gramatica, Paola. (2011). A SHORT HISTORY OF QSAR EVOLUTIONA. Cherkasov et al. (J Med Chem, 2014) QSAR modeling: where have you been? Where are you going to? doi: 10.1021/jm4004285

QSAR = Quantitative Structure-Activity RelationshipQSPR = Quantitative Structure-Property Relationship

4

QSAR step-by-step: general workflow

Properties

Structures Descriptors

Machine Learning algorithm

Model ready!

5

• Step 1: Describe the molecules

6

Descriptors• Calculated or

Empirical (i.e. lipophilicity, shifts in NMR) • Integral (MW, surface area) or

Fragment (fingerprints, simplexes)• 1D, 2D, 3D or even 4D• Also pharmacophore, physico-chemical,

quantum and many, many more

7

Requirements Towards Descriptors

• Invariance to molecular numbering or labeling• Invariance of a descriptor value to any translation or rotation of the

molecule (for 3D)• Clear structural interpretation• Not based on experimental properties• Able to discriminate among isomers• Simple• Smooth structure-property landscape

8

• Step 1: Describe the molecules• Step 2: Choose machine learning algorithm

9

Exciting World of Machine Learning

10

No labels With labels

11

Classification / Regression

https://medium.com/free-code-camp/using-machine-learning-to-predict-the-quality-of-wines-9e2e13d7480d 12

Class 0 / Class 1 y = [0.1, 3.4, 0.11,…]

Training a Regression Model

Simple case: linear regression with one variable

𝑦 = ℎ 𝑥

ℎ 𝑥 = 𝜃& + 𝜃(𝑥

13

Learning algorithm

X

y

Training a Regression ModelCost (or loss) function:

Task: optimize coefficients to minimize cost function

Use gradient descent

Optimization:

The size of each step is determined by the parameter α, which is called the learning rate

14

Training a Classification Model

y∈{0,1}

Sigmoid functionAlgorithm: logistic regression

𝑧 = 𝜃& + 𝜃(𝑥

15

ℎ+ 𝑥 = 𝑔(𝑧 𝑥 )

𝑔 𝑧 =1

1 + 𝑒12

Support Vector Machines (SVM)

Random Forest(RF)

Gradient Boosting Machines (GBM) 16

SVM Classification

Optimal separation?

The largest possible margin = A greater chance of new data being classified correctly 17

SVM

18

SVM: Kernel trick

w - weights19

RFFirst let’s grow a tree

Breiman L (2001). "Random Forests". Machine Learning. 45 (1): 5–32. 20

https://en.wikipedia.org/wiki/Leo_Breiman

https://en.wikipedia.org/wiki/Machine_Learning_(journal)

RF

• Step 1 − Select random samples from a dataset.• Step 2 − Construct a

decision tree for every sample. Get the prediction from every decision tree.• Step 3 − Trees vote for every

predicted result.• Step 4 − Select the most

voted prediction

21

https://vas3k.com/blog/machine_learning/?ref=hn

https://vas3k.com/blog/machine_learning/%3Fref=hn

GBMhttps://sefiks.com/2019/02/24/machine-learning-wars-deep-learning-vs-gbm/

GBM vs Deep Learning:

• Add new models to the ensemble sequentially• At each particular iteration, a new weak, base-learner model is trained with respect to the

error of the whole ensemble learnt so far

23

https://sefiks.com/2019/02/24/machine-learning-wars-deep-learning-vs-gbm/

https://vas3k.com/blog/machine_learning/?ref=hn

https://vas3k.com/blog/machine_learning/%3Fref=hn

Hyperparameters Tuning

• Model parameters = learned during the model training (e.g. weights in Linear Regression).

• Hyperparameters = are all the parameters set by the user before starting training (e.g. number of estimators)

n_estimators=100max_depth=10



25

Hyperparameter Tuning: Search

• Manual Search• Grid Search• Random Search• Automated Hyperparameter

Tuning (Bayesian Optimization, Genetic Algorithms)• Artificial Neural Networks (ANNs)

Tuning

26

Model is trained!Profit?Not so fast

27

• Step 1: Describe the molecules• Step 2: Choose machine learning algorithm• Step 3: Validate your model

28

Model Validation

29

Train / Cross-validation / Test Sets

Data-driven design of metal–organic frameworks for wet flue gas CO2 capture. Boyd et al. Nature, 2019doi:10.1038/s41586-019-1798-7

Training/test ~ 70%/30%K-fold cross-validation

30

http://doi.org/10.1038/s41586-019-1798-7

High Bias vs High Variance

Question: Which case is underfitting, which case is overfitting?

31

High Bias vs High Variance: Cure?

• Get more (diverse!) training examples• Try less features• Play with hyperparameters (i.e., decrease

the depth of the trees in RF)

• Try more new features• Play with feature engineering using old features• Play with hyperparameters (i.e., increase the depth

of the trees in RF)

32

Model Metrics: Regression

33

Model Metrics: Classification

34

𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑇𝑃 + 𝑇𝑁

𝑁

𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =𝑇𝑁

𝑇𝑁 + 𝐹𝑃

𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =𝑇𝑃

𝑇𝑃 + 𝐹𝑁

𝐵𝑎𝑙𝑎𝑛𝑐𝑒𝑑 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 + 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦

2

• Step 0: Check and curate your data• Step 1: Describe the molecules• Step 2: Choose machine learning algorithm• Step 3: Validate your model

36

Databases

37

Data Curation

• Errors in structures

• Errors in values• Errors in units

i.e. mg/kg instead of mol/kg

38Fourches et al. Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research. J.Chem.Inf.Model, 2010. doi: 10.1021/ci100176x

Preparation of Features

40

Lab Course

Scikit-learn• Module for Python to perform

machine learninghttps://www.scikit-learn.org/

RDKit• Module for Python (C, java) to

do variety of cheminformatics tasks https://www.rdkit.org/

Jupyter notebookPython

41

https://www.rdkit.org/

https://www.rdkit.org/

Further Reading & Learning

Textbooks on QSAR:• A.R. Leach, V.J. Gillet: An Introduction to Chemoinformatics. Springer,

2003• Gasteiger J.(Editor), Engel T.(Editor): Chemoinformatics : A Textbook.

John Wiley & Sons, 2004Machine learning:

• Machine Learning on Coursera: https://www.coursera.org/learn/machine-learning/• Machine Learning Crash Course:

https://developers.google.com/machine-learning/crash-course

42

https://www.coursera.org/learn/machine-learning/

https://developers.google.com/machine-learning/crash-course

Thank you for the attention!

43

Any questions?

44

Regularization

Regularization is a way to avoid overfitting by penalizing high-valued regression coefficients.L2 norm:

L1 norm:

45

QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser:...

Documents

Transcript of QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser:...