SHEEP: A TOOL FOR DESCRIPTION OF β-SHEETS IN PROTEIN 3D STRUCTURES

16
SHEEP: A TOOL FOR DESCRIPTION OF ¯-SHEETS IN PROTEIN 3D STRUCTURES EVGENIY AKSIANOV * ,§ and ANDREI ALEXEEVSKI ,,* Belozersky Institute of Physico-Chemical Biology Moscow State University Leninskie Gory 1-40, Moscow 119991, Russia Department of Bioengineering and Bioinformatics & Belozersky Institute of Physico-Chemical Biology, Moscow State University Leninskie Gory 1-37 & 1-40, Moscow 119991, Russia Scientific Research Institute for System Studies (NIISI RAN), Moscow, Russia § [email protected] [email protected] Received 29 September 2011 Revised 1 December 2011 Accepted 8 January 2012 Published 28 February 2012 The description of a protein fold is a hard problem due to significant variability of main structural units, -sheets and -helixes, and their mutual arrangements. An adequate description of the structural units is an important step in objective protein structure classifi- cation, which to date is based on expert judgment in a number of cases. Explicit determination and description of structural units is more complicated for -sheets than for -helixes due to -sheets variability both in composition and geometry. We have developed an algorithm that can significantly modify -sheets detected by commonly used DSSP and Stride algorithms and represent the result as a \-sheet map," a table describing certain -sheet features. In our approach, -sheets (rather than -strands) are considered as holistic objects. Both hydrogen bonds and geometrical restrains are explored for the determination of -sheets. The algorithm is implemented in SheeP program. It was tested for prediction architectures of domains from 93 well-defined all- and = SCOP protein domain families, and showed 93% of correct results. The Web-service http://mouse.belozersky.msu.ru/sheep allows to detect -sheets in a given protein structure, visualize -sheet maps, as well as input three-dimensional structures with highlighted -sheets and their structural features. Keywords: Protein secondary structure; beta-sheet; alpha-helix; protein architecture. 1. Introduction The core of the majority of solved protein structures is formed by secondary structure elements, -helixes and -sheets, with the latter being more complicated structural units than the former. Indeed, a -sheet consists of several hydrogen-bonded Journal of Bioinformatics and Computational Biology Vol. 10, No. 2 (2012) 1241003 (16 pages) # . c Imperial College Press DOI: 10.1142/S021972001241003X 1241003-1 J. Bioinform. Comput. Biol. 2012.10. Downloaded from www.worldscientific.com by MCGILL UNIVERSITY on 11/27/14. For personal use only.

Transcript of SHEEP: A TOOL FOR DESCRIPTION OF β-SHEETS IN PROTEIN 3D STRUCTURES

Page 1: SHEEP: A TOOL FOR DESCRIPTION OF β-SHEETS IN PROTEIN 3D STRUCTURES

SHEEP: A TOOL FOR DESCRIPTION OF

¯-SHEETS IN PROTEIN 3D STRUCTURES

EVGENIY AKSIANOV*,§ and ANDREI ALEXEEVSKI†,‡,¶

*Belozersky Institute of Physico-Chemical Biology

Moscow State University

Leninskie Gory 1-40, Moscow 119991, Russia

†Department of Bioengineering and

Bioinformatics & Belozersky Institute of Physico-Chemical

Biology, Moscow State University

Leninskie Gory 1-37 & 1-40, Moscow 119991, Russia

‡Scientific Research Institute

for System Studies (NIISI RAN), Moscow, Russia§[email protected][email protected]

Received 29 September 2011

Revised 1 December 2011

Accepted 8 January 2012

Published 28 February 2012

The description of a protein fold is a hard problem due to significant variability of main

structural units, �-sheets and �-helixes, and their mutual arrangements. An adequate

description of the structural units is an important step in objective protein structure classifi-

cation, which to date is based on expert judgment in a number of cases. Explicit determination

and description of structural units is more complicated for �-sheets than for �-helixes due to

�-sheets variability both in composition and geometry. We have developed an algorithm that

can significantly modify �-sheets detected by commonly used DSSP and Stride algorithms and

represent the result as a \�-sheet map," a table describing certain �-sheet features. In our

approach, �-sheets (rather than �-strands) are considered as holistic objects. Both hydrogen

bonds and geometrical restrains are explored for the determination of �-sheets. The algorithm

is implemented in SheeP program. It was tested for prediction architectures of domains from 93

well-defined all-� and �=� SCOP protein domain families, and showed 93% of correct results.

The Web-service http://mouse.belozersky.msu.ru/sheep allows to detect �-sheets in a given

protein structure, visualize �-sheet maps, as well as input three-dimensional structures with

highlighted �-sheets and their structural features.

Keywords: Protein secondary structure; beta-sheet; alpha-helix; protein architecture.

1. Introduction

The core of themajority of solved protein structures is formed by secondary structure

elements, �-helixes and �-sheets, with the latter being more complicated structural

units than the former. Indeed, a �-sheet consists of several hydrogen-bonded

Journal of Bioinformatics and Computational Biology

Vol. 10, No. 2 (2012) 1241003 (16 pages)

#.c Imperial College Press

DOI: 10.1142/S021972001241003X

1241003-1

J. B

ioin

form

. Com

put.

Bio

l. 20

12.1

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

CG

ILL

UN

IVE

RSI

TY

on

11/2

7/14

. For

per

sona

l use

onl

y.

Page 2: SHEEP: A TOOL FOR DESCRIPTION OF β-SHEETS IN PROTEIN 3D STRUCTURES

�-strands; geometrically, it can be approximated by a curved surface with variable

curvature1; side chains of residues are located either on one side of a surface or on the

another side; �-sheets can form various well-defined geometrical shapes like �-bar-

rels, �-prisms, �-helixes, �-propellers etc.2 The adequate determination of �-sheets

and their representation both in computer-readable and human-readable formats is

an essential step for several purposes. Programs representing �-sheet as a flat scheme

were developed earlier.3 Data for creating such \map of �-sheet" are outputs of

secondary structure detectors like DSSP4 and Stride.5

We have developed an algorithm, program and web-service SheeP, which modify

the outputs of secondary structure detectors (including homemade) using additional

geometrical criteria. The modifications are aimed to follow expert definitions of

�-sheets, particularly those implemented in SCOP protein domain structural classi-

fication.6 Splitting and joining �-sheets are allowed at the step of the modifications.

The result of SheeP program is a list of �-sheets in input protein structure. Each

�-sheet is stored and displayed in a special format called a map of �-sheet, keeping

holistic information on the analyzed object. This format allows showing irregularities

often observed in protein structures,7 which indicate the sides of a surface-approxi-

mating �-sheet. The map is acceptable for showing �-barrels and weak contacts

between two �-sheets. Information on �-helixes is also stored and displayed.

SheeP program is a part of the automatic protein architecture detector, results of

which will be published elsewhere. The correct determination of �-sheets in input

protein structure by SheeP is more important for this purpose than the percentage

of correct individual residue assignments in comparison with other protein sec-

ondary structure detectors. We would like to avoid such mistakes as joining two

�-sheets of �-sandwiches into one, splitting evident for the expert �-barrel into two

�-sheets, etc. This is why, besides testing on the set of selected structures, we have

applied SheeP to protein domain structures from several SCOP families. We had

chosen those families for which the number of �-sheets and their arrangement were

clearly described in SCOP family or fold annotations. Nevertheless, we were forced

to analyze manually all tested structures to confirm or correct family annotations

for particular family members.

SheeP results for 93 SCOP families showed 7% of essential mistakes in com-

parison with human judgment.

2. Definitions of ¯-sheet Features

The �-sheet is \an approximately planar array of two or more adjacent �-strands

such that hydrogen bonds may be formed between C¼O groups of one �-strand and

NH groups of another".8 This definition should be refined, including main features of

�-sheets. Hydrogen bonds define pairing of residues from adjacent strands as shown

in Fig. 1(a). Residue pairing in a given three-dimensional (3D) protein structure can

also be established by geometrical criteria.9,10 The �-sheet structure implies that

any �-sheet residue may have two paired residues at most. Pairing relation is

E. Aksianov & A. Alexeevski

1241003-2

J. B

ioin

form

. Com

put.

Bio

l. 20

12.1

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

CG

ILL

UN

IVE

RSI

TY

on

11/2

7/14

. For

per

sona

l use

onl

y.

Page 3: SHEEP: A TOOL FOR DESCRIPTION OF β-SHEETS IN PROTEIN 3D STRUCTURES

symmetric i.e. if ith residue is paired to jth residue, then jth residue is paired to ith

residue.

A �-sheet can be represented by a graph in which the vertices are �-sheet resi-

dues and edges are of two types: edges connecting consecutive residues of a strand

and edges connecting paired residues. In figures, we draw the former edges in rows

and call them horizontal and the latter in columns and call them vertical. Horizontal

edges are oriented due to N- to C-terminus orientation of a polypeptide chain.

The connected components of a sheet graph with respect to vertical edges only

are broken lines; they are called crests, see Fig. 1(b) for a term motivation.

Four residues form a �-quadruplet 9 if they are vertices of a quadrilateral in a

�-sheet graph; two opposite edges of the quadrilateral are horizontal edges and

another two are vertical. Quadruplet is parallel if its horizontal edges are co-oriented

and antiparallel if they are inversely oriented.

Orientation of a �-quadruplet is defined by a choice of a normal vector to

the plane approximating four C� atoms of the �-quadruplet. Another way to

Fig. 1. �-sheets and their maps. (a)�(c) show a part of a �-sheet from PDB entry 1NYT. (a) Wireframe

model of backbone atoms. C�-atoms are shown by balls, hydrogen bonds are dotted lines. N- to C-

direction of each strand is shown by arrows. Residue pairing is defined by hydrogen bonds. Paired residues

are neighbors in a column. (b) Backbone model. C�-atoms are at vertices of broken lines. One crest of the

�-sheet is within the oval. (c) Map of the �-sheet. Cells are shaded differently according the location of the

residue side chain on one or another side of the �-sheet surface. Cells in Figs. 1(a) and 1(c) correspond to

each other. (d) Map of a �-barrel from PDB entry 1LUC. Pairings between residues from upper and lower

strands are indicated by residue numbers of a partner. Ends of �-strands and crests are shown by thick

border lines of cells. Empty cells (gaps) are used due �-bulge Ala75. Thus, Asp37-Glu43, Arg100-Cys106,

and Pro169-Val173 are contiguous strands. Residues Thr73 and Phe103 (Ala74 and Gly104, resp.) are not

paired, it is indicated by thick border line.

A Tool for Description of �-Sheets in Protein 3D Structures

1241003-3

J. B

ioin

form

. Com

put.

Bio

l. 20

12.1

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

CG

ILL

UN

IVE

RSI

TY

on

11/2

7/14

. For

per

sona

l use

onl

y.

Page 4: SHEEP: A TOOL FOR DESCRIPTION OF β-SHEETS IN PROTEIN 3D STRUCTURES

define quadruplet orientation is to choose the direction of quadruplet’s bypass.

Equivalence of these definitions is established by screwdriver rule.

Orientations of two adjacent quadruplets are consistent if bypass directions

induced on their common edge from each quadruplet are opposite.

Practically, all sheet quadruplets may by oriented consistently, thus, defining a

�-sheet orientation. Theoretically possible non-orientable �-sheets (like M€obius

band) actually are not found in 3D protein structures; the only exclusive example

described in literature11 can be placed in doubt. Orientation allows distinguishing

sides of a �-sheet surface.

To simplify the �-sheet representation, we map �-sheet graph into a table, called

�-sheet map (Figs. 1(c) and 1(d)). A table cell either contains �-sheet residue or is

empty. Each strand is mapped into a raw, each crest into a column. Several strands

in one raw and several crests in one column are allowed. Strand ends as well as crest

ends are marked by appropriate thick border lines of the cell (Figs. 1(c) and 1(d)).

Therefore, if there is no thick border line between residues in one row and there are

one or more empty cells between them, then they are adjacent residues of the same

strand. Empty cell insertion is necessary to represent �-sheet irregularities, for

example, �-bulges7 (Fig. 1(d), Ala75). Similar agreements are accepted for crests

and columns.

To map �-sheets of �-barrels (and other complicated �-sheets), it is allowed to

indicate crest (resp., strand) continuation over the border of a map by specifying

next crest residue ID (Fig. 1(d)). Such a residue may belong to the same sheet map

or to another. It is allowed that one �-sheets is presented by two or more separate

tables, with specified pairings between them.

A �-sheet map can be constructed for every �-sheet graph although there are

several possible ways of construction. For example, to construct sheet map of a

�-barrel, one can cut �-barrel between any pair of adjacent strands. Nevertheless,

�-sheet graph can be reconstructed from any �-sheet map.

3. Algorithm

The SheeP algorithm takes the input 3D structure of a protein or a complex of

proteins and returns maps of all detected �-sheets. It is allowed that one �-sheet

contains strands from two or more proteins because there are examples of �-sheets

formed by two or more proteins in a protein complex. It is allowed also that the map

of one �-sheet is represented by two (or more) separated parts with explicitly

indicated pairings and strand continuations between them.

The SheeP algorithm includes the following procedures.

3.1. Making ¯-sheet graph

We explore DSSP-like format to describe �-sheet graph of a protein structure. In

this format �-strand ID, �-sheet ID, and (not more than two) identifiers of paired

E. Aksianov & A. Alexeevski

1241003-4

J. B

ioin

form

. Com

put.

Bio

l. 20

12.1

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

CG

ILL

UN

IVE

RSI

TY

on

11/2

7/14

. For

per

sona

l use

onl

y.

Page 5: SHEEP: A TOOL FOR DESCRIPTION OF β-SHEETS IN PROTEIN 3D STRUCTURES

residues are assigned to each residue of each �-sheet. Residue pairing must be a

symmetric relation. Paired residues correspond to vertical edges of the graph, pairs

of consecutive residues that belongs to the same �-strand correspond to horizontal

edges.

We created several procedures that construct �-sheet graphs.

DSSP to �-sheet graph. The DSSP algorithm4 determines secondary structure of

a protein taking 3D protein structure on input. Hydrogen bonds between backbone

nitrogens and oxygens are detected and only this information is used. DSSP output

provides secondary structure assignment per each residue. Additionally, residues

marked as \�-structured" (either E ��� \extended" or B ��� \bridge" in DSSP

terms) refer to at most two paired residues and each �-structured residue is assigned

to a specified �-sheet. In DSSP program output file symmetry of pairing relations is

no guaranteed (see, for example, DSSP output file for PDB entry 1GA6, chain A:

Asp265 is paired to Thr204, but Thr204 is not paired to Asp265). Although non-

symmetric pairing is rare, we were forced to create a procedure that solves conflicts

in DSSP output (if any) and reformat it into our �-sheet graph format.

Stride to �-sheet graph. Stride algorithm5 solves the same task as DSSP. Stride

takes in account both hydrogen bonds between backbone atoms and torsion angles ’

and . Output describes �-helixes and �-strands as lists of residues. Explicit

information on pairings and �-sheets is not presented. We created a procedure

recovering pairing between residues of �-strands and creating �-sheet graph. The

procedure explores hydrogen bonds indicated in Stride output. �-sheets are detected

as connected components of the graph.

Pairing detector. We have developed a detector of residue pairing, which

explores geometrical criteria only.

First, let us define �-like quadrilaterals. Denote by Q a space quadrilateral of C�-

atoms of residues i, iþ 1, j, jþ 1 (tested as antiparallel �-quadruplet) or i, iþ 1, j,

j� 1 (tested as parallel �-quadruplet). In both cases, denote four vertices of Q by A,

B (upper horizontal edge) and C, D (lower horizontal edge), see Fig. 2(a). In the

case of antiparallel quadrilateral, the order of vertices is chosen in such a manner

that the length of AD is less or equal to the length of BC. In addition to lengths AD

and BC, a value �, which is minus common logarithm of the probability that Q is

formed by C�-atoms of a �-quadruplet, is computed. The definition of � is given in

the Sec. 6.1. � is a function of four angles A, B, C, D of the quadrilateral Q and four

torsion angles tor(AB), tor(CD), tor(BC), tor(DA) corresponding to edges of Q.

Q is called �-like qadrilateral if (i) lengths of edges BC and AD are less than

6.0�A; (ii) � < 8. Thresholds for parameters of a �-like quadrilateral were approved

by statistical data (see Sec. 6.1 and Fig. 3(d)).

Finally, two residues, ith and jth are detected as paired if

(i) they form vertical edge in some �-like quadrilateral Q;

(ii) among all �-like quadrilaterals containing ith residue and one of residues

j� 1, j, jþ 1 quadrilateral Q has minimal value of the parameter �;

A Tool for Description of �-Sheets in Protein 3D Structures

1241003-5

J. B

ioin

form

. Com

put.

Bio

l. 20

12.1

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

CG

ILL

UN

IVE

RSI

TY

on

11/2

7/14

. For

per

sona

l use

onl

y.

Page 6: SHEEP: A TOOL FOR DESCRIPTION OF β-SHEETS IN PROTEIN 3D STRUCTURES

(iii) among all �-like quadrilaterals containing jth residue and one of residues

i� 1, i, iþ 1 quadrilateral Q has minimal value of the parameter �.

A module for the detection of �-helixes was also created but is not discussed here.

MakeSSP.We have developed an algorithm which detects �-helixes and �-sheets

in input 3D protein structure using geometrical criteria. Output contains a list of

�-helixes and a �-sheet graph.

To create sheet graph, pairing of residues is defined by exhausting all pairs of

residues and applying paring detector. Clearly, pairing relation is symmetric.

Practically, one residue may have more than two candidates for pairing. To over-

come such conflict, only two pairings that come from �-like quadrilaterals with

minimal value of the parameter � are retained. All residues that are paired with at

least one other residue are vertices of the sheet graph; pairing defines vertical edges

and pairs of consecutive residues define horizontal edges.

3.2. Mapping ¯-sheet graph

Let � be a �-sheet-like graph i.e. a connected graph with two sorts of edges and the

same local constrains as in �-sheet graph. An algorithm for mapping � into a plain

table ��� a �-sheet map ��� should satisfy the following conditions: (1) � can be

reconstructed from the map; (2) a map is user friendly, which mainly means:

minimal number of gaps (empty cells) are inserted into the map. For an arbitrary

�-sheet-like graph, the problem of developing such an algorithm seems be compli-

cated. Fortunately, complicated graphs of �-sheets in protein structures are rare.

Nevertheless, to also adequately map complicated graphs, we allow the represen-

tation of �-sheet map by two or more separate parts (for example, see PDB code

1HZT, chain A).

In our algorithm, first, the longest strand is considered as one row table. Second,

a strand paired to the first one is added as a new (second) row in the table. Paired

residues are placed one under another; in the case of bulges or other irregularities

gaps (empty cells) are inserted into appropriate positions. Direction of the second

strand (left-right or right-left) is uniquely defined by this procedure. Next, a strand

that has residues paired to already-constructed map is selected. It is placed into

appropriate rows according residue pairing. Conflicts of several types may appear at

this step. The conflicts are solved by either inserting a number of gaps into the map

or creating a new table in which pairing with the previously constructed part of a

map is specified.

3.2.1. �-sheets modifier

Initially constructed �-sheet map and, respectively, �-sheet graph, undergo a

number of modifications, listed below.

�-sheet graph recalculation. All residue pairings are dropped and recalculated

using pairing detector. This procedure is applied, for example, after merging of two

sheets into one for recalculating new map of the new sheet.

E. Aksianov & A. Alexeevski

1241003-6

J. B

ioin

form

. Com

put.

Bio

l. 20

12.1

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

CG

ILL

UN

IVE

RSI

TY

on

11/2

7/14

. For

per

sona

l use

onl

y.

Page 7: SHEEP: A TOOL FOR DESCRIPTION OF β-SHEETS IN PROTEIN 3D STRUCTURES

A vertex of graph is removed from a sheet graph if it is incident to only one edge

and this edge is horizontal. Three or more consecutive vertexes of a strand are

removed if each of them is not incident to any vertical edge. (Thus, loops of three or

more residues within a strand are prohibited.)

A strand (resp., crest) is extended if there is a C�-atom in the spatial area defined

by the interpolation of local vicinity of the strand (resp., crest) terminal C�-atom.

All strands and crests are extended by this procedure and pairing is recalculated for

the obtained new set of �-sheet residues.

Joining two sheets into one. Two sheets are joined after strand and crest

extension if they share two or more common residues.

Splitting a strand into two strands. If the angle between vectors C�i ! C�iþ2 and

C�iþ3 ! C�iþ5 is more than 75� than bend between residues iþ 2 and iþ 3 is

detected and the strand is split into two strands.

Joining two strands into one. Two strands of a sheet are joined into one if they are

separated by less than three residues; these residues are included into the strand.

Splitting one sheet into two sheets. Two vertical edges incident to consecutive

residues of a strand are deleted if their deletion leads to graph decomposition into

two connected components and each component contains more than five vertices.

In the algorithm, the loop of modifications consists of consequently applied steps:

extending strands and crests; joining strands and sheets; recalculating the graph;

and splitting strands and sheets. The �-sheet graph obtained after all these steps is

stored. The loop is repeated until the obtained �-sheet graph coincides with any of

stores ones. Finally, unpaired vertexes at the edges of the strands are removed.

3.3. Outline of the algorithm

The algorithm consists of four steps. First, �-sheet graph is constructed using

DSSP, Stride, or MakeSSP. Second, �-sheet map of the �-sheet graph is con-

structed. Third, the modifications of the �-sheet graph and the respective �-sheet

map are made. The modification step may be switched off; this possibility was

explored to compare the results of DSSP and Stride with and without further

modifications. Fourth, output data are computed and saved in files. They include

graphical representations of maps of all detected �-sheets and scripts in RasMol12/

JMol13 format acceptable for 3D structure visualization with distinguished �-sheets,

secondary structure elements, pairing, and hydrogen bonds.

4. Program Implementation

SheeP program is written in Free Pascal. Both source codes and binaries are

available for download (under Linux and Windows). Web service is available at

http://mouse.belozersky.msu.ru/sheep. Several variants of the algorithm depend-

ing on initial secondary structure detector and modifications are suggested. For a

particular structure, it might be useful to test several variants. DSSP with further

modifications is recommended for data mining.

A Tool for Description of �-Sheets in Protein 3D Structures

1241003-7

J. B

ioin

form

. Com

put.

Bio

l. 20

12.1

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

CG

ILL

UN

IVE

RSI

TY

on

11/2

7/14

. For

per

sona

l use

onl

y.

Page 8: SHEEP: A TOOL FOR DESCRIPTION OF β-SHEETS IN PROTEIN 3D STRUCTURES

5. Materials

SheeP program was tested on domains from all-�, �/� classes according to SCOP

1.75 classification.6 A list of domains and the detailed results are available at service

http://mouse.belozersky.msu.ru/sheep jbcb.

5.1. Training and control sets

Algorithm parameters were adjusted on the set of 302 domains from structures

solved by X-ray, resolution below 1.5�A. They were chosen from 296 SCOP protein

domain families. All major architectures (�-sandwiches and �=�-sandwiches,

�-barrels and ��-barrels, �-helixes, �-propellers, �-prisms, Rossmann folds) are

represented in this data set. �-sheets in all domains were confirmed by manual

revision.

This set was subdivided into two groups. The first group of 88 domains was

considered as training set for algorithm parameters adjustment, and the rest 214

domains were used for the control.

5.2. Testing set

All 99 PDB entries satisfying the following conditions were selected: (1) protein

structures are X-ray solved with resolution better than 1.5�A; (2) they are from 50%

non-redundant set according PDB; (3) each entry contains at least one protein

chain of 80 or more residues with at least one �-sheet; (4) BLAST search of selected

chains against all sequences of SCOP 1.75 domains shows no hits with more than

20% sequence identity. The last condition guarantees that the testing set does not

intersect with the training and control ones. All 99 PDB entries were manually

examined. If two or more chains within one PDB file coincided (for example, they

are two chains of a homodimer), then only one of them was retained. In all selected

chains, the structural domains were determined manually. Finally, 157 �-sheet

containing domains were included in the testing set.

5.3. Set of quadrilaterals for pairing detector parameters adjustment

All 2344 DSSP detected �-quadruplets from 88 domains of the training set were

selected for the statistical investigation, 1642 of them are antiparallel and 702

parallel. Four C�-atoms of each �-quadruplet form a space quadrilateral in which

vertices were marked by A, B, C, D as shown on Figs. 2(a) and 2(b). In the case of

antiparallel �-quadruplet, A and D were assigned to the paired residues with the

shorter edge between corresponding C�-atoms. Only 6 of totally 2344 �-quadruplets

have lengths of one of edges BC or DA more than 6�A. In the case of parallel

�-quadruplets vertices A and B were assigned to residues with smallest residue

numbers.

The set of non-�-quadruplets for negative control consists of all 1549 residue

quadruplets [i, j, k, l] from 88 domains such that (1) i and j, k, and l are neighbors in

E. Aksianov & A. Alexeevski

1241003-8

J. B

ioin

form

. Com

put.

Bio

l. 20

12.1

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

CG

ILL

UN

IVE

RSI

TY

on

11/2

7/14

. For

per

sona

l use

onl

y.

Page 9: SHEEP: A TOOL FOR DESCRIPTION OF β-SHEETS IN PROTEIN 3D STRUCTURES

the chain: ji� jj ¼ 1, jk� lj ¼ 1; (2) there is no other pairs of neighbor residues

among [i, j, k, l]; (3) none of residues [i, j, k, l] was detected as �-structured by

DSSP algorithm; (4) lengths of edges BC and DA of a space quadrilateral of C�-

atoms defined as described in previous paragraph are less or equal to 6�A. Of the

1549 non-�-quadruplets, 788 are antiparallel and 761 parallel.

5.4. SCOP domain families

Several SCOP families were selected for examination of �-sheets definitions by

SheeP program (Table 2). Selected families are those for which core �-sheets are

explicitly described in SCOP definitions.

6. Results and Discussion

6.1. Statistics of quadruplets’ parameters for pairing detector

Our goal was to reproduce expert judgment on �-sheets in 3D protein structure. An

expert can detect �-sheets either by visualization of hydrogen bonds between

backbone nitrogen and oxygen atoms or just by observing the backbone represen-

tation of the structure (Fig. 1(b)). Several geometrical restrains on C�-atoms

location within pairs of consecutive �-quadruplets were investigated earlier.9,10

Here we analyzed statistics of geometrical parameters for C�-atoms from

�-quadruplets. These four C�-atoms form space quadrilateral [A, B, C, D] which is

approximately planar and has right angles. Lengths of edges AB and DC are fixed

(3:8� 0:03�A) because A and B (resp., D and C) are consecutive C�-atoms. Thus,

generally six independent parameters of a space quadrilateral are reduced to four for

�-quadrilaterals. We analyzed lengths of edges BC and DA, the internal angles of

the quadrilateral, the torsion angles corresponding to the edges (Fig. 2(a)).

(a) (b) (c)

Fig. 2. Schemes of quadrilaterals of C�-atoms. C�-atoms are shown by black balls. (a) Antiparallel

quadrilateral Q. Vertices A and B (resp., C and D) belong to consecutive residues. Angles A (the angle of

the triangle DAB) and D (the angle of the triangle CDA) are shown as well as torsion angle tor(DA) which

is an angle between planes ADC and ADB. (b) Scheme of two quadrilaterals from consecutive antiparallel

�-quadruplets. Vertical double-line connects C�-atoms of two residues linked by two hydrogen bonds.

Letters A and D (A’ and D’ for the second quadrilateral) are assigned to the shorter vertical edge. (c)

Scheme of two parallel quadruplets from consecutive parallel �-quadruplets.

A Tool for Description of �-Sheets in Protein 3D Structures

1241003-9

J. B

ioin

form

. Com

put.

Bio

l. 20

12.1

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

CG

ILL

UN

IVE

RSI

TY

on

11/2

7/14

. For

per

sona

l use

onl

y.

Page 10: SHEEP: A TOOL FOR DESCRIPTION OF β-SHEETS IN PROTEIN 3D STRUCTURES

In the case of antiparallel �-quadruplets, the distribution of distances between

C�-atoms of paired residues is bimodal.9 Actually, this distance is a bit larger for

paired residues that have two hydrogen bonds than for paired residues having

no hydrogen bonds. We take this difference in account for statistical analysis by

appropriate identification of quadrilateral vertices (edge AD is assumed to be

shorter than BC in all antiparallel quadrilaterals, see Fig. 2(b))

The distributions of all these parameters are shown in Figs. 3(a)�3(c).

Let Q be a quadrilateral of C�-atoms, either antiparallel or parallel (see Fig. 2).

For each angle A, B, C, D of Q we compute p-value using the normal approximation

of the distribution of corresponding angles in �-quadruplets. A torsion angle cor-

responds to each edge of Q because Q is space (non-planar) quadrilateral (Fig. 2(a)).

We denote torsion angles by tor(AB), tor(BC), tor(CD), and tor(DA). Torsion

angles tor(AB) and tor(CD) (resp., tor(BC) and tor(DA)) of �-quadruplets are

highly correlated (data not shown). Due to this observation, we computed p-values

for average torsion angles (tor(AB) þ tor(CD))/2 and (tor(BC) þ tor(DA))/2 also

using normal approximation. Denote by � ¼ �(Q) minus common logarithm of the

product of all six probabilities. �(Q) could be considered as the distance of Q to ideal

�-quadruplet.

(a) (b)

(c) (d)

Fig. 3. Distributions of quadrilateral parameters. (a) Distances between C�-atoms from paired residues

of �-quadrilaterals. (b) Angles of �-quadrilaterals. (c) Torsion angles of �-quadrilaterals. (d) Parameter �

for �-quadrilaterals and for non-�-quadrilaterals.

E. Aksianov & A. Alexeevski

1241003-10

J. B

ioin

form

. Com

put.

Bio

l. 20

12.1

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

CG

ILL

UN

IVE

RSI

TY

on

11/2

7/14

. For

per

sona

l use

onl

y.

Page 11: SHEEP: A TOOL FOR DESCRIPTION OF β-SHEETS IN PROTEIN 3D STRUCTURES

In Fig. 3(d), the distributions of the parameter � are shown for two sets of data,

quadrilaterals of �-quadruplets and quadrilaterals of non-�-quadruplets.

6.2. Comparison of ¯-sheets detected by various algorithms

Several variants of the algorithm were implemented in the SheeP program. Variants

differ in the secondary structure detector used at the initial step and the presence/

absence of the modification step. SheeP parameters were adjusted on the set of

302 protein domain structures representing the main �-sheet�based architectures

(see Sec. 5). All 302 structures were manually inspected and �-sheets in each of them

were defined.

Human judgment on �-sheets is often used to confirm the results of the secondary

structure detectors8 due to the absence of appropriately large data sets of structures

that can be considered as a gold standard definitions of �-sheets. For 42 structures of

the training and control sets, human decision was uncertain. For example, it was

not clear whether domain structure from PDB entry 1IKP, chain A, residues

395�606, is open barrel or sandwich. We know no way to approve any of possible

decision in these cases by verifiable arguments ��� except trusting to selected sec-

ondary structure detector, which does not seem to be correct. This obstacle forced

us to adjust algorithm is iterative manner, refining parameters using the training set

(88 domains) and check the results on the control set (214 domains). Thus, the

results of SheeP program on the control set of 214 structures cannot be considered as

an independent verification of the SheeP program.

For independent control, we used testing set of 157 �-sheet domains from 99

PDB entries (see Sec. 5.2). This set was selected after completing the adjustment of

program parameters. All structures were manually examined to detect �-sheets. To

minimize the human factor, we did not used SheeP algorithms during the analysis;

domains were visualized with RasMol using secondary structure definitions from

PDB entry headers. In 18 structures of 157, human judgment on �-sheets was

uncertain.

Mistakes during automatic �-sheet detection were identified for those structures

for which human decision on �-sheets was undoubted (260 domains in the training

and control sets, 139 in the testing set). We detected a mistake if (1) detected

number and compound of �-sheets contradicts human decision; (2) a �-sheet was

considered as a �-barrel (closed or open) by an expert, but automatically detected

�-sheet was not sufficient for detecting a �-barrel by a special program.

The results of several variants of the algorithm on the training, control, and

testing sets are shown in Table 1.

As shown in Table 1, mistakes of any algorithm variant were found in less than

15% of the analyzed structures. Note that only essential mistakes i.e. mistakes

preventing correct determination of the protein architecture, were considered.

Differences in a few residues of different �-sheet determinations were not taken into

account.

A Tool for Description of �-Sheets in Protein 3D Structures

1241003-11

J. B

ioin

form

. Com

put.

Bio

l. 20

12.1

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

CG

ILL

UN

IVE

RSI

TY

on

11/2

7/14

. For

per

sona

l use

onl

y.

Page 12: SHEEP: A TOOL FOR DESCRIPTION OF β-SHEETS IN PROTEIN 3D STRUCTURES

SheeP modifier reduces the number of mistakes. Revision of cases showed that

the modifier actually improves �-sheet definitions in several cases. Examples are

shown in Fig. 4.

6.3. SheeP results for selected SCOP categories

We used SheeP program for automatic detection of �-sheets in a number of SCOP

categories (folds, superfamilies, families). From each analyzed SCOP category, 90%

non-redundant set of domain structures was selected.

Expert decisions on �-sheet compound and arrangements are not available

for each SCOP domain. Thus, we were not able to compare SheeP annotations

for all protein structures with independent expert judgments. This is why structures

were taken from those folds, for which �-sheets were defined in SCOP annota-

tions explicitly. Nevertheless, due to the variability of domain structures within

a fold, �-sheet definitions in all structures were confirmed and corrected manually,

if necessary. Only mistakes in core �-sheet definitions were taken into account

(Table 2).

As it is shown in Tables 1 and 2, SheeP modifier, in comparison with DSSP and

Stride detectors without the �-sheet modification step, reduces the number of

Fig. 4. Comparison of two algorithms, SheeP based on DSSP without ((a) and (c)) and with ((b) and

(d)) modifications. Different �-sheets are differently shaded. (a) and (b) SCOP annotated �-sandwich

from PDB entry 1CZT, chain A. (c) and (d) SCOP annotated �-barrel from PDB entry 1H3G, chain A. In

both structures SheeP modifier corrected DSSP otput.

Table 1. Comparison of automatic �-sheet detection with human judgment. Domains with uncertain

human judgment on �-sheets are not included.

Number of structures with mistakes

Number of

structures DSSP Stride

SheeP based

on DSSP

(default)

SheeP based

on Stride

SheeP based

on MakeSSP

Training set 71 12 12 8 8 12

Control set 189 13 15 6 9 11

TrainingþControl

260 25 (10%) 27 (10%) 14(5%) 17 (7%) 23 (9%)

Testing set 139 21 (15%) 20 (14%) 4 (3%) 7 (5%) 10 (7%)

E. Aksianov & A. Alexeevski

1241003-12

J. B

ioin

form

. Com

put.

Bio

l. 20

12.1

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

CG

ILL

UN

IVE

RSI

TY

on

11/2

7/14

. For

per

sona

l use

onl

y.

Page 13: SHEEP: A TOOL FOR DESCRIPTION OF β-SHEETS IN PROTEIN 3D STRUCTURES

Table2.DSSP,Stride,

andSheePresultsforselected

foldsandfamilies.90%

non-redundantsetofstructureswereselected

from

each

studiedSCOPcategory.

Number

ofstructureswithmistakes

SCOPclassificationlevel

andID

Architecture

No.of

structures

DSSP

Stride

SheePbased

onDSSP

(default)

SheePbased

onStride

SheePbasedon

Mak

eSSP

Familyc.2.1.2

Rossmannfold

110

22

10

1

Familyc.1.8.1

TIM

barrels

53

22

21

63

16

Foldsb.80,b.81(30families)

�-helix

54

29

32

14

16

10

Foldsb.77,b.78(5

families)

�-prism

21

12

01

2

Superfamiliesb.1.18,b.18.1

(56families)

�-sandwich

174

20

32

64

17

Tota

l412

74(1

8%)

89(2

2%)

27(7

%)

24(6

%)

46(1

1%)

Table

3.DSSP,Stride,

andSheePresultsfordomainswithuncertain

humanjudgmenton�-sheets.

Number

ofstructureswithmistakes

Set

Architecture

No.of

structures

DSSP

Stride

SheePbased

onDSSP

(default)

SheePbased

onStride

SheePbasedon

Mak

eSSP

TrainingþControl

Different

42

21

45

5

Testingset

Different

18

00

11

0

Familyc.2.1.2

Rossmannfold

40

00

00

Familyc.1.8.1

TIM

barrels

00

00

00

Foldsb.80,b.81(30families)

�-helix

73

30

11

Foldsb.77,b.78(5

families)

�-prism

10

00

00

Superfamiliesb.1.18,b.18.1

(56families)

�-sandwich

70

00

12

Tota

l79

5(6

%)

4(5

%)

5(6

%)

8(1

0%)

8(1

0%)

A Tool for Description of �-Sheets in Protein 3D Structures

1241003-13

J. B

ioin

form

. Com

put.

Bio

l. 20

12.1

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

CG

ILL

UN

IVE

RSI

TY

on

11/2

7/14

. For

per

sona

l use

onl

y.

Page 14: SHEEP: A TOOL FOR DESCRIPTION OF β-SHEETS IN PROTEIN 3D STRUCTURES

mistakes significantly. Based on these results, we consider \DSSP þ modifier"

variant of the algorithm as the most appropriate for the purpose of automatic

architecture detector.

6.4. SheeP results in cases of uncertain human decisions

Human judgment on �-sheets was uncertain for a number of structures. For these

structures, we fixed two or more admissible decisions and consider automatic

detection of �-sheets being correct if it met one of admissible decisions.

The problem of choosing decision on �-sheet compound could be demonstrated in

a domain from PDB entry 1JMX, residues 364�494 of chain A. This domain is

classified as immunoglobulin-like �-sandwich (seven strands, two sheets) in SCOP

database and as �-sandwich, immunoglobulin-like architecture, in CATH. Human

judgment revealed three clearly defined \sub-sheets." Sub-sheets 1�3 and 2�3 are

connected by hydrogen bonds between backbone oxygens and nitrogens. Two

hydrogen bonds between sub-sheets 2 and 3 do not form a regular pattern. The

sandwich upper layer is formed by sub-sheets 1 and 2 and sub-sheet 3 coincides with

the bottom layer.

Possible decisions, based on visual estimations, could be as follows. (1) There is

one �-sheet which includes all three sub-sheets. (2) There are two �-sheets, one of

which consists of sub-sheets 1 and 3 and another coincides with the sub-sheet 2. (3)

There are three �-sheets coinciding with the sub-sheets. The first decision contra-

dicts both the sandwich architecture and regularity of hydrogen bond patterns, the

second contradicts the sandwich architecture, and the third is in better accordance

with the sandwich architecture (which is common for immunoglobulin-like fold),

but rather good hydrogen bonds pattern between sub-sheets 1 and 3 is not taken

into account. We considered both decisions (2) and (3) as admissible.

Results of SheeP for domains with uncertain human decision are shown in

Table 3.

Table 3 demonstrates that in the majority of case outputs of SheeP algorithm

match one of human decisions on �-sheets.

It should be noted that human judgment on �-sheet number and compound was

uncertain in approximately 1/10 of cases (79 of 890, see Tables 1�3), showing the

hardness of the problem. We believe that considering �-sheets as structural units of

domain architecture within an automatic architecture detector would reduce the

uncertainty.

7. Conclusions

An algorithm for the detection of �-sheets in input 3D protein structure and rep-

resentation of each �-sheets by �-sheet map was developed. The �-sheet map con-

tains information about strands, pairing of residues and crests, irregularities and

sides of the �-sheet. It also allows describing adequately �-barrels, and other

E. Aksianov & A. Alexeevski

1241003-14

J. B

ioin

form

. Com

put.

Bio

l. 20

12.1

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

CG

ILL

UN

IVE

RSI

TY

on

11/2

7/14

. For

per

sona

l use

onl

y.

Page 15: SHEEP: A TOOL FOR DESCRIPTION OF β-SHEETS IN PROTEIN 3D STRUCTURES

complicated �-sheets. The algorithm explores hydrogen bond patterns (on the first

step) as well as constrain on relative location of C�-atoms (on the step of �-sheet

modifications).

The algorithm is implemented in SheeP program and web service, which provide

additional visualization possibilities.

SheeP showed 93% of correct �-sheet in comparison with human judgment and

essential mistakes in 7% of structures. We believe that SheeP can be used for an

automatic architecture detection. Currently, the classification of protein structures

involves manual steps14 even in popular databases SCOP6 and CATH.15

Holistic view on �-sheets, implemented in �-sheet maps, simplifies the automatic

comparison of �-sheets from structures of relative proteins, which may be con-

sidered as �-sheet alignments.

Acknowledgments

This work was partially supported by RFBR grant 10-07-00685-a.

References

1. Znamenskiy D, Le Tuan K, Poupon A, Chomilier J, Mornon JP, Beta-sheet modeling byhelical surfaces, Protein Eng 13:407�412, 2000.

2. Chothia C, Hubard T, Brenner S, Barns H, Murzin A, Protein folds in the all-� and all-�classes, Annu Rev Biophys Biomol Struct 26:597�627, 1997.

3. Hutchinson EG, Thornton JM, HERA ��� A program to draw schematic diagrams ofprotein secondary structures, Proteins 8:203�212, 1990.

4. Kabsch W, Sander C, Dictionary of protein secondary structure: Pattern recognition ofhydrogen-bonded and geometrical features. Biopolymers 22:2577�2637, 1983

5. Frishman D, Argos P, Knowledge-based protein secondary structure assignment, Pro-teins 23:566�579, 1995.

6. Murzin AG, Brenner SE, Hubbard T, Chothia C, SCOP: A structural classification ofproteins database for the investigation of sequences and structures, J Mol Biol247:536�540, 1995.

7. Chan AWE, Hutchinson EG, Harris D, Thornton JM, Identification, classification, andanalysis of �-bulges in proteins, Protein Sci 2:1574�1590, 1993.

8. Cammack R, Attwood T, Campbell P, Parish H, Smith A, Vella F, Stirling J. (eds.)Oxford Dictionary of Biochemistry and Molecular Biology, Oxford University Press,USA, 2006.

9. Majumdar I, Krishna SS, Grishin NV, PALSSE: A program to delineate linear sec-ondary structural elements from protein structures, BMC Bioinformatics 6:202, 2005.

10. Martin J, Letellier G, Marin A, Taly J-F, de Brevern AG, Gibrat J-F, Protein secondarystructure assignment revisited: A detailed analysis of different assignment methods,BMC Struct Biol 5:17, 2005.

11. LiuW-M, Is there a M€obius band in closed protein �-sheets? Protein Eng 10:1373�1377,1997.

12. Sayle RA, Milner-White EJ, RasMol: Biomolecular graphics for all, Trends Biochem Sci20:374�376, 1995.

13. Jmol: An open-source Java viewer for chemical structures in 3D. Available at: http://www.jmol.org/.

A Tool for Description of �-Sheets in Protein 3D Structures

1241003-15

J. B

ioin

form

. Com

put.

Bio

l. 20

12.1

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

CG

ILL

UN

IVE

RSI

TY

on

11/2

7/14

. For

per

sona

l use

onl

y.

Page 16: SHEEP: A TOOL FOR DESCRIPTION OF β-SHEETS IN PROTEIN 3D STRUCTURES

14. Sam V, Tai CH, Garnier J, Gibrat JF, Lee B, Munson PJ., Towards an automaticclassification of protein structural domains based on structural similarity, BMC Bioin-formatics 9:74, 2008.

15. Orengo CA, Michie AD, Jones DT, Swindells MB, Thornton JM, CATH: A hierarchicclassification of protein domain structures, Structure 5:1093�1108, 1997.

Andrei V. Alexeevski received his Diploma and Doctoral

degrees, both in Mathematics, from Moscow State University

(MSU), Moscow, Russia, in 1970 and 1983, respectively. From

1989 up to now, he has been working at the Laboratory of

Mathematical Methods in Biology, A.N. Belozersky Institute

of Physico-Chemical Biology, MSU, where he is currently a Chief

of the laboratory. His research focus is in the field of bioinfor-

matics, computational biology, and theoretical mathematics.

Evgeniy Aksianov received his Diploma degree in Virology

from Moscow State University (MSU), Moscow, Russia, in 2003.

From 2004 up to now, he is at the Laboratory of Mathematical

Methods in Biology, A.N. Belozersky Institute of Physico-

Chemical Biology, MSU.

E. Aksianov & A. Alexeevski

1241003-16

J. B

ioin

form

. Com

put.

Bio

l. 20

12.1

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

CG

ILL

UN

IVE

RSI

TY

on

11/2

7/14

. For

per

sona

l use

onl

y.