QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser:...

45
QSAR: Brief and Incomplete Introduction Dr Olena Mokshyna

Transcript of QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser:...

Page 1: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

QSAR:Brief and Incomplete Introduction

Dr Olena Mokshyna

Page 2: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Shift of Paradigm

A. Agrawal and A. Choudhary APL Mater. 4 053208 (2016); https://www.bigmax.mpg.de2

Page 3: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Why do we need modeling?

• Cut costs

• Increase speed

• Decrease human efforts

• Reduce animal testing

• Find something new!

3

Page 4: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Dawn of QSAR

• Crum-Brown and Fraser: Φ = f(C)• Meyer and Overton: suggested that the narcotic

(depressant) action of a group of organic compounds paralleled their olive oil/water partition coefficients

• Hammett: “σ−ρ” culture• Taft : separating polar, steric, and resonance effects• 1962 - Hansch et al. published their study on the

structure-activity relationships of plant growth regulators and their dependency on Hammett constants and octanol-water partition coefficient

• Hansch and Fujita: QSAR paradigm• Free-Wilson approach

Gramatica, Paola. (2011). A SHORT HISTORY OF QSAR EVOLUTIONA. Cherkasov et al. (J Med Chem, 2014) QSAR modeling: where have you been? Where are you going to? doi: 10.1021/jm4004285

QSAR = Quantitative Structure-Activity RelationshipQSPR = Quantitative Structure-Property Relationship

4

Page 5: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

QSAR step-by-step: general workflow

Properties

Structures Descriptors

Machine Learning algorithm

Model ready!

5

Page 6: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

• Step 1: Describe the molecules

6

Page 7: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Descriptors• Calculated or

Empirical (i.e. lipophilicity, shifts in NMR) • Integral (MW, surface area) or

Fragment (fingerprints, simplexes)• 1D, 2D, 3D or even 4D• Also pharmacophore, physico-chemical,

quantum and many, many more

7

Page 8: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Requirements Towards Descriptors

• Invariance to molecular numbering or labeling• Invariance of a descriptor value to any translation or rotation of the

molecule (for 3D)• Clear structural interpretation• Not based on experimental properties• Able to discriminate among isomers• Simple• Smooth structure-property landscape

8

Page 9: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

• Step 1: Describe the molecules• Step 2: Choose machine learning algorithm

9

Page 10: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Exciting World of Machine Learning

10

Page 11: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

No labels With labels

11

Page 12: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Classification / Regression

https://medium.com/free-code-camp/using-machine-learning-to-predict-the-quality-of-wines-9e2e13d7480d 12

Class 0 / Class 1 y = [0.1, 3.4, 0.11,…]

Page 13: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Training a Regression Model

Simple case: linear regression with one variable

𝑦 = ℎ 𝑥

ℎ 𝑥 = 𝜃& + 𝜃(𝑥

13

Learning algorithm

X

y

Page 14: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Training a Regression ModelCost (or loss) function:

Task: optimize coefficients to minimize cost function

Use gradient descent

Optimization:

The size of each step is determined by the parameter α, which is called the learning rate

14

Page 15: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Training a Classification Model

y∈{0,1}

Sigmoid functionAlgorithm: logistic regression

𝑧 = 𝜃& + 𝜃(𝑥

15

ℎ+ 𝑥 = 𝑔(𝑧 𝑥 )

𝑔 𝑧 =1

1 + 𝑒12

Page 16: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Support Vector Machines (SVM)

Random Forest(RF)

Gradient Boosting Machines (GBM) 16

Page 17: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

SVM Classification

Optimal separation?

The largest possible margin = A greater chance of new data being classified correctly 17

Page 18: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

SVM

18

Page 19: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

SVM: Kernel trick

w - weights19

Page 20: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

RFFirst let’s grow a tree

Breiman L (2001). "Random Forests". Machine Learning. 45 (1): 5–32. 20

Page 21: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

RF

• Step 1 − Select random samples from a dataset.• Step 2 − Construct a

decision tree for every sample. Get the prediction from every decision tree.• Step 3 − Trees vote for every

predicted result.• Step 4 − Select the most

voted prediction

21

Page 22: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

https://vas3k.com/blog/machine_learning/?ref=hn

Page 23: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

GBMhttps://sefiks.com/2019/02/24/machine-learning-wars-deep-learning-vs-gbm/

GBM vs Deep Learning:

• Add new models to the ensemble sequentially• At each particular iteration, a new weak, base-learner model is trained with respect to the

error of the whole ensemble learnt so far

23

Page 24: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

https://vas3k.com/blog/machine_learning/?ref=hn

Page 25: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Hyperparameters Tuning

• Model parameters = learned during the model training (e.g. weights in Linear Regression).

• Hyperparameters = are all the parameters set by the user before starting training (e.g. number of estimators)

n_estimators=100max_depth=10

n_estimators=100max_depth=5

n_estimators=200max_depth=50

25

Page 26: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Hyperparameter Tuning: Search

• Manual Search• Grid Search• Random Search• Automated Hyperparameter

Tuning (Bayesian Optimization, Genetic Algorithms)• Artificial Neural Networks (ANNs)

Tuning

26

Page 27: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Model is trained!Profit?Not so fast

27

Page 28: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

• Step 1: Describe the molecules• Step 2: Choose machine learning algorithm• Step 3: Validate your model

28

Page 29: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Model Validation

29

Page 30: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Train / Cross-validation / Test Sets

Data-driven design of metal–organic frameworks for wet flue gas CO2 capture. Boyd et al. Nature, 2019doi:10.1038/s41586-019-1798-7

Training/test ~ 70%/30%K-fold cross-validation

30

Page 31: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

High Bias vs High Variance

Question: Which case is underfitting, which case is overfitting?

31

Page 32: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

High Bias vs High Variance: Cure?

• Get more (diverse!) training examples• Try less features• Play with hyperparameters (i.e., decrease

the depth of the trees in RF)

• Try more new features• Play with feature engineering using old features• Play with hyperparameters (i.e., increase the depth

of the trees in RF)

32

Page 33: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Model Metrics: Regression

33

Page 34: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Model Metrics: Classification

34

𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑇𝑃 + 𝑇𝑁

𝑁

𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =𝑇𝑁

𝑇𝑁 + 𝐹𝑃

𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =𝑇𝑃

𝑇𝑃 + 𝐹𝑁

𝐵𝑎𝑙𝑎𝑛𝑐𝑒𝑑 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 + 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦

2

Page 35: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

35

Page 36: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

• Step 0: Check and curate your data• Step 1: Describe the molecules• Step 2: Choose machine learning algorithm• Step 3: Validate your model

36

Page 37: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Databases

37

Page 38: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Data Curation

• Errors in structures

• Errors in values• Errors in units

i.e. mg/kg instead of mol/kg

38Fourches et al. Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research. J.Chem.Inf.Model, 2010. doi: 10.1021/ci100176x

Page 39: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

39

Page 40: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Preparation of Features

40

Page 41: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Lab Course

Scikit-learn• Module for Python to perform

machine learninghttps://www.scikit-learn.org/

RDKit• Module for Python (C, java) to

do variety of cheminformatics tasks https://www.rdkit.org/

Jupyter notebookPython

41

Page 42: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Further Reading & Learning

Textbooks on QSAR:• A.R. Leach, V.J. Gillet: An Introduction to Chemoinformatics. Springer,

2003• Gasteiger J.(Editor), Engel T.(Editor): Chemoinformatics : A Textbook.

John Wiley & Sons, 2004Machine learning:

• Machine Learning on Coursera: https://www.coursera.org/learn/machine-learning/• Machine Learning Crash Course:

https://developers.google.com/machine-learning/crash-course

42

Page 43: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Thank you for the attention!

43

Page 44: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Any questions?

44

Page 45: QSAR: Brief and Incomplete Introduction · 2020. 2. 4. · Dawn of QSAR •Crum-Brown and Fraser: Φ = f(C) •Meyer and Overton: suggested that the narcotic (depressant) action of

Regularization

Regularization is a way to avoid overfitting by penalizing high-valued regression coefficients.L2 norm:

L1 norm:

45