Topics in bioinformatics CS697

Topics in bioinformaticsCS697

Spring 2011Class 12 – Mar-22-2011

Molecular distance measurementsMolecular transformations

Rotation in 3D - matrices

Rotations in 3D – Euler angles

* Rotate the XYZ-system about the Z-axis by α. The X-axis now lies on the line of nodes.

* Rotate the XYZ-system again about the now rotated X-axis by β. The Z-axis is now in its final orientation, and the x-axis remains on the line of nodes.

* Rotate the XYZ-system a third time about the new Z-axis by γ.

axis-angle

Quaternions

Extension of complex numbers.

h=a+bi+cj+dk, a,b,c,d real numbers.

i,j,k : imaginary components s.t.:

i2=j2=k2=-1

ij=k, jk=i, ki=j

ij = -ji, jk = -kj, ki = -ik

Unit quaternions

• Unit quaternions :

• Group under quaternion multiplication.

• can be mapped to:

Quaternions as axis-angle rotation

Some resources

http://mathworld.wolfram.com/RotationMatrix.html

http://mathworld.wolfram.com/EulerAngles.html

http://mathworld.wolfram.com/Quaternion.html

http://mathworld.wolfram.com/RotationMatrix.html

http://mathworld.wolfram.com/EulerAngles.html

http://mathworld.wolfram.com/Quaternion.html

Molecular distance measurements

Measuring protein structure similarity

• Visual comparison• Dihedral angle comparison• Distance matrix• RMSD (root mean square distance)

Given two “shapes” or structures A and B, we are interested in defining a distance, or similarity measure between A and B.

Is the resulting distance (similarity measure) D a metric?

D(A,B) ≤ D(A,C) + D(C,B)

Comparing dihedral angles

Torsion angles () are:- local by nature- invariant upon rotation and translation of the molecule- compact (O(n) angles for a protein of n residues)

Add 1 degreeTo all

But…

Internal distance matrix

1

2

3

4

6.0

8.1

5.91 2 3 4

1 0 3.8 6.0 8.1

2 3.8 0 3.8 5.9

3 6.0 3.8 0 3.8

4 8.1 5.9 3.8 0

Internal distance matrix (2)

• Advantages- invariant with respect to rotation and translation- can be used to compare proteins

• Disadvantages- the distance matrix is O(n2) for a protein with n

residues- comparing distance matrix is a hard problem- insensitive to chirality

Quality Assessment through RMSD

▆ RMSD: Root Mean Squared Deviation

• one of the simplest measures to quantify how different two protein conformations really are

• advantage: simple to compute by representing conformations as 3N vectors (N atoms)

• limitations: 1) limited to conformations of the same protein chain

2) atom-atom correspondence needed on different-length chains

3) not very descriptive if changes are localized

▆ RMSD: Average atomic distance

• given two conformations of a chain of N atoms

• represent the conformations as two 3N vectors x and y

• RMSD(x,y) is the euclidean distance between x and y, averaged over the N atoms

▆ lRMSD: least RMSD

• same conformation rigidly transformed in space (translated or rotated) should give an RMSD of 0

• before computing RMSD(x,y) one needs to remove changes due to rigid-body transformations

Protein Structure Superposition

A rigid-body transformation T is a combination of a translation t and a rotation R: T(x) = Rx+t

The quantity to be minimized is:

Where a and b are the two point sets.

∑i=1

N

∥Rai−bi +t∥2

The translation part

∂ ε∂t

=2∑i= 1

N

Rai−bi +t =0E is minimum with respectto t when:

Then: t=−R ∑i=1

N

a i ∑i=1

N

bi

If both data sets A and B have been centered on 0, then t = 0 !

Step 1: Translate point sets A and B such that their centroids coincide at the origin of the framework

The rotation part

Let A and B be the centroids of A' and B', and A and B the matrices containing the coordinates of the points of A and B centered on 0:

μA=1N

∑i=1

N

ai

μ B=1N

∑i=1

N

b i

A= [ a1− μ A a2− μ A . . . aN− μ A ]B= [b1−μ B b2− μ B .. . b N− μ B ]

Build covariance matrix: C=ABT

x = 3x3

3xNNx3

The rotation part

U and V are orthogonal matrices, and D is a diagonal matrix containing the singular values.U, V and D are 3x3 matrices

C=UDVT

Compute SVD (Singular Value Decomposition) of C:

Define S by:

S= { I if det C 0diag {1,1,−1 } otherwise }

Then

R=USV T

2. Build covariance matrix:

The algorithm

C=ABT

C=UDVT

S= { I if det C 0diag {1,1,−1 } otherwise }

1. Center the two point sets A and B

3. Compute SVD (Singular ValueDecomposition) of C:

5. Compute rotation matrix

4. Define S:

R=USV T

O(N) in time!

6. Compute lRMSD:

lRMSD= ∑i=1

N

a' i2∑

i=1

N

b' i2−2∑

i=1

3

d i s i

N

Some reading material

http://cnx.org/content/m11608/latest/#RMSD

http://cnx.org/content/m11608/latest/#RMSD

lRMSD has Shortcomings

▆ lRMSD cannot capture localized changes: if a small perturbation occurs in a part of the structure, e.g. rotation of a hinge connecting two domains, lRMSD will report a large value

▆Main reason: lRMSD does not know how to attribute changes to specific atoms of the chain

▆ lRMSD distributes change equally (through the averaging) to all atoms in a protein chain

▆Measuring conformational similarity is an active research area

Other Quality Assessment: Shape Similarity

▆ Sometimes assessment of cavities on the surface of a protein is more important than description of the rest of the structure, especially when the goal is prediction of a binding site rather than of the entire structure (which can be thought of as a scaffold)

▆ Methods that assess surface area, solvent accessible surface area, that compute volumes, and detect cavities on proteins are very important in the context of binding and docking

Model each atom as a vdw sphere, the union of which gives the molecular surface

Not all molecular surface is accessible to solvent. Rolling a solvent ball over the vdw spheres traces out the solvent accessible surface area (SASA) . SASA is important to quantitatively determine interactions of the protein

Solvent-accessible Solvent Area (SASA)▆ Computational geometry methods that use Delaunay triangulations and alpha shapes assess SASA and

other geometric descriptors of molecular surfaces, volumes, and cavities

▆ We will come back to this topic in the context of molecular docking – further reading about shape computing at http://cnx.org/content/m11616/latest/

SASA for a 1.4 Å ball SASA for a 1.5 Å ball. Increasing the radius reduces the SASA due to more cavities that a bulkier ball cannot penetrate

Ultrafast Shape Recognition (USR)

Drug design – Screening a number of potential compounds.

Find a set of molecules which closely resemble a lead molecule from a HUGE database.

Shape similarity may indicate similar binding properties and similar activity.

Overview

Efficient global comparison of molecular shapes.

The molecules are represented as feature vectors, representing the relative positions of the atoms.

Does not require alignment of the molecules.

Suitable for large database search.

Feature vector representation

The shape of a molecule is uniquely determined by the relative positions of the atoms.

… Which are determined by the inter-atomic distances.

The set of distances can be constrained due to forces that hold the atoms together.

Strategic feature points

The molecule is described as 4 sets of atomic distance distributions from feature points:Center of mass - ctd

Point closest to ctd - cst

Point farthest from ctd – fct

Point farthest from fct – ftf

The moments of the distributions are calculated and stored as a feature vector.

Estimate of the size, compactness and symmetry of the molecule.

Feature vector for a molecule

Similarity score

Advantages and disadvantages

Extremely fast due to calculation of only 4N distances and distributions.

Very sensitive to small changes in the molecule shape.

Does not directly account for chemical interactions and atom types.

Quality Assessment through LGA

▆ Local-Global-Alignment (LGA) introduced by Adam Zemla in 2003 is being used as a more accurate similarity assessment than lRMSD in CASP

▆ LGA generates many different local superpositions to find regions where two conformations are similar: combines longest continuous segment (LCS) and global distance test (GDT) to find local and global similarities

▆ LCS superimposes the longest segments that fit under a selected RMSD cutoff

▆ GDT complements evaluations made with LCS searching for the largest (not necessary continuous) set of ‘equivalent’ residues that deviate by no more than a specified distance cutoff

▆ Further reading: Zemla A., "LGA - a Method for Finding 3D Similarities in Protein Structures", Nucleic Acids Research, 2003, Vol. 31, No. 13, pp. 3370-3374 http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=12824330

Topics in bioinformatics CS697

Documents

Transcript of Topics in bioinformatics CS697