OPSIN: Taming the jungle of IUPAC chemical nomenclature

30
1 OPSIN Taming the jungle of IUPAC chemical nomenclature Daniel Lowe, Peter Murray-Rust, Robert C Glen 8th September 2013 Indianapolis, ACS 4-[(19S,21R,26R,27S)-19,21-dihydroxy-27-methoxy-26- methylnonacosyl]phenyl 3,6-di-O-methyl-α-D-glucopyranosyl-(1→4)-2,3-di-O- methyl-α-L-rhamnopyranosyl-(1→2)-3-O-methyl-α-L-rhamnopyranoside

description

OPSIN (Open Parser for Systematic IUPAC Nomenclature) is an open source freely available program for converting chemical names, especially those that are systematic in nature, to chemical structures. The software is available as a Java library, command-line interface and as a web service (opsin.ch.cam.ac.uk). OPSIN accepts names that conform to either IUPAC or CAS nomenclature and can convert them to SMILES, InChI and CML (Chemical Markup Language). OPSIN has grown from covering only simple general organic chemical nomenclature to the point of having competent coverage of all areas of organic chemical nomenclature. One of the most recent additions is comprehensive support for the nomenclature of carbohydrates. This brings support for dialdoses, diketoses, ketoaldoses, alditols, aldonic acids, uronic acids, aldaric acids, glycosides and oligosacchardides, in both the open chain and cyclic forms, named systematically or from trivial sugar stems with support for modification terms such as anhydro or deoxy. OPSIN’s support for specialised and general organic nomenclature will be demonstrated through illustrative examples and accompanying performance metrics. We focus in particular on areas of nomenclature for which support was recently added and those that are complex to implement such as fused ring nomenclature.

Transcript of OPSIN: Taming the jungle of IUPAC chemical nomenclature

Page 1: OPSIN: Taming the jungle of IUPAC chemical nomenclature

1

OPSIN Taming the jungle of IUPAC

chemical nomenclature

Daniel Lowe, Peter Murray-Rust, Robert C Glen 8th September 2013

Indianapolis, ACS

4-[(19S,21R,26R,27S)-19,21-dihydroxy-27-methoxy-26-

methylnonacosyl]phenyl 3,6-di-O-methyl-α-D-glucopyranosyl-(1→4)-2,3-di-O-

methyl-α-L-rhamnopyranosyl-(1→2)-3-O-methyl-α-L-rhamnopyranoside

Page 2: OPSIN: Taming the jungle of IUPAC chemical nomenclature

2

ol

What is chemical name to structure? (2S)- but 2- Amino 1- -

Stereochemistry locant substituent locant alk unsaturation suffix

an

NH2• 1

2

3

4

Page 3: OPSIN: Taming the jungle of IUPAC chemical nomenclature

3

• Identify documents by their chemical structures

• Assist with structure viewing

• Identify incorrect chemical names

• Extract reagent structures hence allowing reactions to be reconstructed from text

Uses of chemical name to structure

Page 4: OPSIN: Taming the jungle of IUPAC chemical nomenclature

4

Page 5: OPSIN: Taming the jungle of IUPAC chemical nomenclature

5

Parsing

• Over 4000 discrete morphemes form the program’s vocabulary

(a morpheme is the smallest section of a word with meaning)

• These are grouped into 140 classes e.g.

• unsaturator (‘ene’)

• aminoAcidEndsInIne (‘tyros’)

• simpleSubstituent (‘amino’)

Page 6: OPSIN: Taming the jungle of IUPAC chemical nomenclature

6

Word Rule Example acetal Propanal dimethyl acetal additionCompound Carbon tetrachloride acidHalideOrPseudoHalide Cyanic chloride amide Nitrous amide anhydride Acetic anhydride biochemicalEster Adenosine 5'-triphosphate carbonylDerivative Propanone oxime divalentFunctionalGroup Diethyl ether ester Ethyl ethanoate functionalClassEster Acetic acid ethyl ester functionGroupAsGroup Cyanide glycol Ethylene glycol glycolEther Ethylene glycol monomethyl ether hydrazide Phosphoric hydrazide monovalentFunctionalGroup Ethyl alcohol multiEster Ethyl propyl methylphosphonate oxide Thiophene 1,1-dioxide polymer Poly(ethylene) simple Ethylbenzene substituent Chloro

Page 7: OPSIN: Taming the jungle of IUPAC chemical nomenclature

7

Supported chain nomenclature

Alkanes Heteroatom hydrides Heterogeneous heteroatom hydrides

dodectetractkiliane pentaphosphane disilazane

Trivial acids

butyric acid

Page 8: OPSIN: Taming the jungle of IUPAC chemical nomenclature

8

Supported ring nomenclature Monocyclic spiro

dispiro[4.2.4.2]tetradecane

Hantzsch-Widman

1,3,5-triazine

furo[3,2-b]thieno[2,3-e]pyridine 2,2':6',2''-terpyridyl

Fused ring Ring assembly

Von Baeyer

tricyclo[2.2.1.12,5]octane

Polycyclic spiro

spiro[piperidine-4,9'-xanthene]

Page 9: OPSIN: Taming the jungle of IUPAC chemical nomenclature

9

Structural assembly nomenclature Conjunctive nomenclature

benzeneethanol

Substitutive nomenclature

2,4,6-trinitrotoluene

Additive nomenclature

methylsulfonyl

Multiplicative nomenclature

4,4'-methylenedioxydibenzoic acid

Functional class nomenclature

ethyl alcohol

Page 10: OPSIN: Taming the jungle of IUPAC chemical nomenclature

10

Structural modifications

Heteroatom replacement

1-thia-4-aza-2,6-disilacyclohexane

Unsaturation

hexa-1,3-dien-5-yne

Hydro, dehydro, indicated hydrogen and added hydrogen

2,7-dihydro-1H-azepine

Functional replacement Suffixes including

infixed suffixes

methanedithioic acid 1-chloro-2,4-

diimidotricarbonic acid

Lambda convention

2λ6-trisulfane

Page 11: OPSIN: Taming the jungle of IUPAC chemical nomenclature

11

Bridges and stereochemistry Bridges

4a,8a-propanoquinoline

E/Z stereochemistry

(Z)-2-chloro-but-2-ene

Relative cis/trans stereochemistry

trans-2,6-dimethyl-2,6-dihydronaphthalene

R/S stereochemistry

(1R,3S)-3-amino-3-methylcyclohexanol

Page 12: OPSIN: Taming the jungle of IUPAC chemical nomenclature

12

Miscellaneous nomenclature

1,3-xylene

Groups with indeterminately positioned structural features

Charge and oxidation numbers

methylmercury(1+) or methylmercury(II)

“per-nomenclature”

2-deoxy-ᴅ-ribose

Subtractive nomenclature

perhydroanthracene

perchlorobenzene

Page 13: OPSIN: Taming the jungle of IUPAC chemical nomenclature

13

Polymer nomenclature

poly[(benzo[1,2-d:4,5-d']bis[1,3]thiazole-2,6-diyl)-1,4-phenyleneoxy-1,3-phenylene(1,3,5,7-tetraoxo- 1,2,3,5,6,7-hexahydrobenzo[1,2-c:4,5-c']dipyrrole-2,6-diyl)-1,3-phenyleneoxy-1,4-phenylene]

Structure-based polymer nomenclature

Page 14: OPSIN: Taming the jungle of IUPAC chemical nomenclature

14

Domain specific nomenclature

Steroid nomenclature

17β-Hydroxy-8α,9β,10α-androst-4-en-3-one

ʟ-leucinamide

Amino acid

cyclo(ᴅ-alanyl-ʟ-phenylalanyl) ʟ-arginyl-O-phosphono-ʟ-seryl-ʟ-alanyl-ʟ-proline

Oligopeptide Cyclic peptide

guanylyl(3'-5')uridine 3'-monophosphate

Nucleotide nomenclature

Page 15: OPSIN: Taming the jungle of IUPAC chemical nomenclature

15

Carbohydrate nomenclature (acyclic)

ᴅ-gluco-hexose or ᴅ-glucose (preferred)

ʟ-ribo-ᴅ-manno-nonose

• Carbohydrates are defined using configurational prefixes that each specify the stereochemistry for between 1 and 4 stereocentres

Page 16: OPSIN: Taming the jungle of IUPAC chemical nomenclature

16

Carbohydrate derivatives

• These carbohydrate chains can then be algorithmically modified by suffixes

ᴅ-glucose

ᴅ-glucitol

ᴅ-glucaric acid

ᴅ-gluconic acid

Page 17: OPSIN: Taming the jungle of IUPAC chemical nomenclature

17

Carbohydrate nomenclature (cyclic)

α-ᴅ-glucopyranose

2,7-anhydro-D-glycero-β-D-galacto-oct-2- ulopyranosonic acid

ᴅ-glucose

Page 18: OPSIN: Taming the jungle of IUPAC chemical nomenclature

18

Carbohydrate nomenclature (oligosaccharides)

β-ᴅ-Fructofuranosyl α-ᴅ-glucopyranoside β-ᴅ-glucopyranosyl-(1→3)-β-ᴅ-glucopyranosyl-(1→3)-ᴅ-glucopyranose

Page 19: OPSIN: Taming the jungle of IUPAC chemical nomenclature

19

Fused ring nomenclature

• All fused ring nomenclature is processed algorithmically e.g. even benzofuran is constructed from benzene and furan rather than being a trivial name

• For example:

benzo[b]cycloocta[jk]fluorene

8

6 6 6

5

Page 20: OPSIN: Taming the jungle of IUPAC chemical nomenclature

20

Fused ring nomenclature (numbering)

• Transform to an idealised grid aligned along the longest row of rings

• Apply quadrant rules e.g. favour most rings in upper right quadrant

8 6

6 6 5 6 6 6 5 8

8 6 6 5 6 6 6 6 5 8

6 6 5 6 8 6 6 5 8 6

Page 21: OPSIN: Taming the jungle of IUPAC chemical nomenclature

21

Fused ring nomenclature (numbering)

• Atoms numbered in ascending order from upper rightmost ring

6

6 6 5 8 Peripheral numbering rules used to

choose grid layout that gives the

best numbering

Page 22: OPSIN: Taming the jungle of IUPAC chemical nomenclature

22

Beyond IUPAC: CAS index name un-inversion

CAS Index Name IUPAC name

benzene, ethyl- ethylbenzene

Disulfide, bis(2-chloroethyl) Bis(2-chloroethyl) disulfide

Benzoic acid, 4,4’-methylenebis[2-chloro- 4,4'-Methylenebis[2-chlorobenzoic acid]

Phosphoric acid, ethyl dimethyl ester ethyl dimethyl phosphate

Page 23: OPSIN: Taming the jungle of IUPAC chemical nomenclature

23

Beyond IUPAC: Correcting missing spaces

tert-butylacetate tert-butyl acetate

tert-butyl-4-vinylperbenzoate

No locant and perbenzoate has more

than one non-degenerate hydrogen

diethylcarbonate

Has no substitutable hydrogen

Ethylacetate

non-ester would be butanoate or butyrate!

Page 24: OPSIN: Taming the jungle of IUPAC chemical nomenclature

24

Performance on machine-generated names

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

ACD/Name 12.02Names

ChemBioDraw13Names

Lexichem 2.1 Names Marvin 6.0.2 Names

No Result

Constitutional Discrepancy

Stereochemical Discrepancy

Correctly Interpreted

30,000 structures randomly selected from PubChem

used as input to machine-generate names

Page 25: OPSIN: Taming the jungle of IUPAC chemical nomenclature

25

Performance on unique names from US patent headings

Page 26: OPSIN: Taming the jungle of IUPAC chemical nomenclature

26

What’s not supported

• Parsing of generic chemical names

• E.g. 2- or 3- alkylsubstitutedbenzofurans

• Advanced inorganic nomenclature e.g. coordinate bonding

• Some natural product nomenclature

• Advanced stereochemistry e.g. pseudo asymmetric stereo centers, axial stereochemistry etc.

Page 27: OPSIN: Taming the jungle of IUPAC chemical nomenclature

27

Usage Batch conversion on the

command line

RESTful web service

(opsin.ch.cam.ac.uk)

NameToStructure nts = NameToStructure.getInstance(); String chemicalName = "acetonitrile"; String smiles = nts.parseToSmiles(chemicalName);

Java API

java -jar opsin-1.5.0-jar-with-dependencies.jar -osmi input.txt output.smi

Page 28: OPSIN: Taming the jungle of IUPAC chemical nomenclature

28

Who is using OPSIN? Commercial software

Cinfony

(interface to

Python)

Many text mining efforts

Workflows Web services

Page 29: OPSIN: Taming the jungle of IUPAC chemical nomenclature

29

Conclusions

• OPSIN combines high recall, precision and speed of execution

• Recent improvements have significantly improved coverage of biochemical nomenclature

Visit opsin.ch.cam.ac.uk to try it out and download!

Page 30: OPSIN: Taming the jungle of IUPAC chemical nomenclature

30

OPSIN: Taming the jungle of IUPAC chemical nomenclature

[email protected]

For more information see:

Chemical Name to Structure: OPSIN, an Open Source Solution

J. Chem. Inf. Model., 2011, 51 (3), pp 739–753

Extraction of chemical structures and reactions from the literature (https://www.repository.cam.ac.uk/handle/1810/244727)

Acknowledgements

Albina Asadulina

Rich Apodaca

Peter Corbett

Roger Sayle

Funding