SHEEP: A TOOL FOR DESCRIPTION OF β-SHEETS IN PROTEIN 3D STRUCTURES
Transcript of SHEEP: A TOOL FOR DESCRIPTION OF β-SHEETS IN PROTEIN 3D STRUCTURES
SHEEP: A TOOL FOR DESCRIPTION OF
¯-SHEETS IN PROTEIN 3D STRUCTURES
EVGENIY AKSIANOV*,§ and ANDREI ALEXEEVSKI†,‡,¶
*Belozersky Institute of Physico-Chemical Biology
Moscow State University
Leninskie Gory 1-40, Moscow 119991, Russia
†Department of Bioengineering and
Bioinformatics & Belozersky Institute of Physico-Chemical
Biology, Moscow State University
Leninskie Gory 1-37 & 1-40, Moscow 119991, Russia
‡Scientific Research Institute
for System Studies (NIISI RAN), Moscow, Russia§[email protected]¶[email protected]
Received 29 September 2011
Revised 1 December 2011
Accepted 8 January 2012
Published 28 February 2012
The description of a protein fold is a hard problem due to significant variability of main
structural units, �-sheets and �-helixes, and their mutual arrangements. An adequate
description of the structural units is an important step in objective protein structure classifi-
cation, which to date is based on expert judgment in a number of cases. Explicit determination
and description of structural units is more complicated for �-sheets than for �-helixes due to
�-sheets variability both in composition and geometry. We have developed an algorithm that
can significantly modify �-sheets detected by commonly used DSSP and Stride algorithms and
represent the result as a \�-sheet map," a table describing certain �-sheet features. In our
approach, �-sheets (rather than �-strands) are considered as holistic objects. Both hydrogen
bonds and geometrical restrains are explored for the determination of �-sheets. The algorithm
is implemented in SheeP program. It was tested for prediction architectures of domains from 93
well-defined all-� and �=� SCOP protein domain families, and showed 93% of correct results.
The Web-service http://mouse.belozersky.msu.ru/sheep allows to detect �-sheets in a given
protein structure, visualize �-sheet maps, as well as input three-dimensional structures with
highlighted �-sheets and their structural features.
Keywords: Protein secondary structure; beta-sheet; alpha-helix; protein architecture.
1. Introduction
The core of themajority of solved protein structures is formed by secondary structure
elements, �-helixes and �-sheets, with the latter being more complicated structural
units than the former. Indeed, a �-sheet consists of several hydrogen-bonded
Journal of Bioinformatics and Computational Biology
Vol. 10, No. 2 (2012) 1241003 (16 pages)
#.c Imperial College Press
DOI: 10.1142/S021972001241003X
1241003-1
J. B
ioin
form
. Com
put.
Bio
l. 20
12.1
0. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by M
CG
ILL
UN
IVE
RSI
TY
on
11/2
7/14
. For
per
sona
l use
onl
y.
�-strands; geometrically, it can be approximated by a curved surface with variable
curvature1; side chains of residues are located either on one side of a surface or on the
another side; �-sheets can form various well-defined geometrical shapes like �-bar-
rels, �-prisms, �-helixes, �-propellers etc.2 The adequate determination of �-sheets
and their representation both in computer-readable and human-readable formats is
an essential step for several purposes. Programs representing �-sheet as a flat scheme
were developed earlier.3 Data for creating such \map of �-sheet" are outputs of
secondary structure detectors like DSSP4 and Stride.5
We have developed an algorithm, program and web-service SheeP, which modify
the outputs of secondary structure detectors (including homemade) using additional
geometrical criteria. The modifications are aimed to follow expert definitions of
�-sheets, particularly those implemented in SCOP protein domain structural classi-
fication.6 Splitting and joining �-sheets are allowed at the step of the modifications.
The result of SheeP program is a list of �-sheets in input protein structure. Each
�-sheet is stored and displayed in a special format called a map of �-sheet, keeping
holistic information on the analyzed object. This format allows showing irregularities
often observed in protein structures,7 which indicate the sides of a surface-approxi-
mating �-sheet. The map is acceptable for showing �-barrels and weak contacts
between two �-sheets. Information on �-helixes is also stored and displayed.
SheeP program is a part of the automatic protein architecture detector, results of
which will be published elsewhere. The correct determination of �-sheets in input
protein structure by SheeP is more important for this purpose than the percentage
of correct individual residue assignments in comparison with other protein sec-
ondary structure detectors. We would like to avoid such mistakes as joining two
�-sheets of �-sandwiches into one, splitting evident for the expert �-barrel into two
�-sheets, etc. This is why, besides testing on the set of selected structures, we have
applied SheeP to protein domain structures from several SCOP families. We had
chosen those families for which the number of �-sheets and their arrangement were
clearly described in SCOP family or fold annotations. Nevertheless, we were forced
to analyze manually all tested structures to confirm or correct family annotations
for particular family members.
SheeP results for 93 SCOP families showed 7% of essential mistakes in com-
parison with human judgment.
2. Definitions of ¯-sheet Features
The �-sheet is \an approximately planar array of two or more adjacent �-strands
such that hydrogen bonds may be formed between C¼O groups of one �-strand and
NH groups of another".8 This definition should be refined, including main features of
�-sheets. Hydrogen bonds define pairing of residues from adjacent strands as shown
in Fig. 1(a). Residue pairing in a given three-dimensional (3D) protein structure can
also be established by geometrical criteria.9,10 The �-sheet structure implies that
any �-sheet residue may have two paired residues at most. Pairing relation is
E. Aksianov & A. Alexeevski
1241003-2
J. B
ioin
form
. Com
put.
Bio
l. 20
12.1
0. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by M
CG
ILL
UN
IVE
RSI
TY
on
11/2
7/14
. For
per
sona
l use
onl
y.
symmetric i.e. if ith residue is paired to jth residue, then jth residue is paired to ith
residue.
A �-sheet can be represented by a graph in which the vertices are �-sheet resi-
dues and edges are of two types: edges connecting consecutive residues of a strand
and edges connecting paired residues. In figures, we draw the former edges in rows
and call them horizontal and the latter in columns and call them vertical. Horizontal
edges are oriented due to N- to C-terminus orientation of a polypeptide chain.
The connected components of a sheet graph with respect to vertical edges only
are broken lines; they are called crests, see Fig. 1(b) for a term motivation.
Four residues form a �-quadruplet 9 if they are vertices of a quadrilateral in a
�-sheet graph; two opposite edges of the quadrilateral are horizontal edges and
another two are vertical. Quadruplet is parallel if its horizontal edges are co-oriented
and antiparallel if they are inversely oriented.
Orientation of a �-quadruplet is defined by a choice of a normal vector to
the plane approximating four C� atoms of the �-quadruplet. Another way to
Fig. 1. �-sheets and their maps. (a)�(c) show a part of a �-sheet from PDB entry 1NYT. (a) Wireframe
model of backbone atoms. C�-atoms are shown by balls, hydrogen bonds are dotted lines. N- to C-
direction of each strand is shown by arrows. Residue pairing is defined by hydrogen bonds. Paired residues
are neighbors in a column. (b) Backbone model. C�-atoms are at vertices of broken lines. One crest of the
�-sheet is within the oval. (c) Map of the �-sheet. Cells are shaded differently according the location of the
residue side chain on one or another side of the �-sheet surface. Cells in Figs. 1(a) and 1(c) correspond to
each other. (d) Map of a �-barrel from PDB entry 1LUC. Pairings between residues from upper and lower
strands are indicated by residue numbers of a partner. Ends of �-strands and crests are shown by thick
border lines of cells. Empty cells (gaps) are used due �-bulge Ala75. Thus, Asp37-Glu43, Arg100-Cys106,
and Pro169-Val173 are contiguous strands. Residues Thr73 and Phe103 (Ala74 and Gly104, resp.) are not
paired, it is indicated by thick border line.
A Tool for Description of �-Sheets in Protein 3D Structures
1241003-3
J. B
ioin
form
. Com
put.
Bio
l. 20
12.1
0. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by M
CG
ILL
UN
IVE
RSI
TY
on
11/2
7/14
. For
per
sona
l use
onl
y.
define quadruplet orientation is to choose the direction of quadruplet’s bypass.
Equivalence of these definitions is established by screwdriver rule.
Orientations of two adjacent quadruplets are consistent if bypass directions
induced on their common edge from each quadruplet are opposite.
Practically, all sheet quadruplets may by oriented consistently, thus, defining a
�-sheet orientation. Theoretically possible non-orientable �-sheets (like M€obius
band) actually are not found in 3D protein structures; the only exclusive example
described in literature11 can be placed in doubt. Orientation allows distinguishing
sides of a �-sheet surface.
To simplify the �-sheet representation, we map �-sheet graph into a table, called
�-sheet map (Figs. 1(c) and 1(d)). A table cell either contains �-sheet residue or is
empty. Each strand is mapped into a raw, each crest into a column. Several strands
in one raw and several crests in one column are allowed. Strand ends as well as crest
ends are marked by appropriate thick border lines of the cell (Figs. 1(c) and 1(d)).
Therefore, if there is no thick border line between residues in one row and there are
one or more empty cells between them, then they are adjacent residues of the same
strand. Empty cell insertion is necessary to represent �-sheet irregularities, for
example, �-bulges7 (Fig. 1(d), Ala75). Similar agreements are accepted for crests
and columns.
To map �-sheets of �-barrels (and other complicated �-sheets), it is allowed to
indicate crest (resp., strand) continuation over the border of a map by specifying
next crest residue ID (Fig. 1(d)). Such a residue may belong to the same sheet map
or to another. It is allowed that one �-sheets is presented by two or more separate
tables, with specified pairings between them.
A �-sheet map can be constructed for every �-sheet graph although there are
several possible ways of construction. For example, to construct sheet map of a
�-barrel, one can cut �-barrel between any pair of adjacent strands. Nevertheless,
�-sheet graph can be reconstructed from any �-sheet map.
3. Algorithm
The SheeP algorithm takes the input 3D structure of a protein or a complex of
proteins and returns maps of all detected �-sheets. It is allowed that one �-sheet
contains strands from two or more proteins because there are examples of �-sheets
formed by two or more proteins in a protein complex. It is allowed also that the map
of one �-sheet is represented by two (or more) separated parts with explicitly
indicated pairings and strand continuations between them.
The SheeP algorithm includes the following procedures.
3.1. Making ¯-sheet graph
We explore DSSP-like format to describe �-sheet graph of a protein structure. In
this format �-strand ID, �-sheet ID, and (not more than two) identifiers of paired
E. Aksianov & A. Alexeevski
1241003-4
J. B
ioin
form
. Com
put.
Bio
l. 20
12.1
0. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by M
CG
ILL
UN
IVE
RSI
TY
on
11/2
7/14
. For
per
sona
l use
onl
y.
residues are assigned to each residue of each �-sheet. Residue pairing must be a
symmetric relation. Paired residues correspond to vertical edges of the graph, pairs
of consecutive residues that belongs to the same �-strand correspond to horizontal
edges.
We created several procedures that construct �-sheet graphs.
DSSP to �-sheet graph. The DSSP algorithm4 determines secondary structure of
a protein taking 3D protein structure on input. Hydrogen bonds between backbone
nitrogens and oxygens are detected and only this information is used. DSSP output
provides secondary structure assignment per each residue. Additionally, residues
marked as \�-structured" (either E ��� \extended" or B ��� \bridge" in DSSP
terms) refer to at most two paired residues and each �-structured residue is assigned
to a specified �-sheet. In DSSP program output file symmetry of pairing relations is
no guaranteed (see, for example, DSSP output file for PDB entry 1GA6, chain A:
Asp265 is paired to Thr204, but Thr204 is not paired to Asp265). Although non-
symmetric pairing is rare, we were forced to create a procedure that solves conflicts
in DSSP output (if any) and reformat it into our �-sheet graph format.
Stride to �-sheet graph. Stride algorithm5 solves the same task as DSSP. Stride
takes in account both hydrogen bonds between backbone atoms and torsion angles ’
and . Output describes �-helixes and �-strands as lists of residues. Explicit
information on pairings and �-sheets is not presented. We created a procedure
recovering pairing between residues of �-strands and creating �-sheet graph. The
procedure explores hydrogen bonds indicated in Stride output. �-sheets are detected
as connected components of the graph.
Pairing detector. We have developed a detector of residue pairing, which
explores geometrical criteria only.
First, let us define �-like quadrilaterals. Denote by Q a space quadrilateral of C�-
atoms of residues i, iþ 1, j, jþ 1 (tested as antiparallel �-quadruplet) or i, iþ 1, j,
j� 1 (tested as parallel �-quadruplet). In both cases, denote four vertices of Q by A,
B (upper horizontal edge) and C, D (lower horizontal edge), see Fig. 2(a). In the
case of antiparallel quadrilateral, the order of vertices is chosen in such a manner
that the length of AD is less or equal to the length of BC. In addition to lengths AD
and BC, a value �, which is minus common logarithm of the probability that Q is
formed by C�-atoms of a �-quadruplet, is computed. The definition of � is given in
the Sec. 6.1. � is a function of four angles A, B, C, D of the quadrilateral Q and four
torsion angles tor(AB), tor(CD), tor(BC), tor(DA) corresponding to edges of Q.
Q is called �-like qadrilateral if (i) lengths of edges BC and AD are less than
6.0�A; (ii) � < 8. Thresholds for parameters of a �-like quadrilateral were approved
by statistical data (see Sec. 6.1 and Fig. 3(d)).
Finally, two residues, ith and jth are detected as paired if
(i) they form vertical edge in some �-like quadrilateral Q;
(ii) among all �-like quadrilaterals containing ith residue and one of residues
j� 1, j, jþ 1 quadrilateral Q has minimal value of the parameter �;
A Tool for Description of �-Sheets in Protein 3D Structures
1241003-5
J. B
ioin
form
. Com
put.
Bio
l. 20
12.1
0. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by M
CG
ILL
UN
IVE
RSI
TY
on
11/2
7/14
. For
per
sona
l use
onl
y.
(iii) among all �-like quadrilaterals containing jth residue and one of residues
i� 1, i, iþ 1 quadrilateral Q has minimal value of the parameter �.
A module for the detection of �-helixes was also created but is not discussed here.
MakeSSP.We have developed an algorithm which detects �-helixes and �-sheets
in input 3D protein structure using geometrical criteria. Output contains a list of
�-helixes and a �-sheet graph.
To create sheet graph, pairing of residues is defined by exhausting all pairs of
residues and applying paring detector. Clearly, pairing relation is symmetric.
Practically, one residue may have more than two candidates for pairing. To over-
come such conflict, only two pairings that come from �-like quadrilaterals with
minimal value of the parameter � are retained. All residues that are paired with at
least one other residue are vertices of the sheet graph; pairing defines vertical edges
and pairs of consecutive residues define horizontal edges.
3.2. Mapping ¯-sheet graph
Let � be a �-sheet-like graph i.e. a connected graph with two sorts of edges and the
same local constrains as in �-sheet graph. An algorithm for mapping � into a plain
table ��� a �-sheet map ��� should satisfy the following conditions: (1) � can be
reconstructed from the map; (2) a map is user friendly, which mainly means:
minimal number of gaps (empty cells) are inserted into the map. For an arbitrary
�-sheet-like graph, the problem of developing such an algorithm seems be compli-
cated. Fortunately, complicated graphs of �-sheets in protein structures are rare.
Nevertheless, to also adequately map complicated graphs, we allow the represen-
tation of �-sheet map by two or more separate parts (for example, see PDB code
1HZT, chain A).
In our algorithm, first, the longest strand is considered as one row table. Second,
a strand paired to the first one is added as a new (second) row in the table. Paired
residues are placed one under another; in the case of bulges or other irregularities
gaps (empty cells) are inserted into appropriate positions. Direction of the second
strand (left-right or right-left) is uniquely defined by this procedure. Next, a strand
that has residues paired to already-constructed map is selected. It is placed into
appropriate rows according residue pairing. Conflicts of several types may appear at
this step. The conflicts are solved by either inserting a number of gaps into the map
or creating a new table in which pairing with the previously constructed part of a
map is specified.
3.2.1. �-sheets modifier
Initially constructed �-sheet map and, respectively, �-sheet graph, undergo a
number of modifications, listed below.
�-sheet graph recalculation. All residue pairings are dropped and recalculated
using pairing detector. This procedure is applied, for example, after merging of two
sheets into one for recalculating new map of the new sheet.
E. Aksianov & A. Alexeevski
1241003-6
J. B
ioin
form
. Com
put.
Bio
l. 20
12.1
0. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by M
CG
ILL
UN
IVE
RSI
TY
on
11/2
7/14
. For
per
sona
l use
onl
y.
A vertex of graph is removed from a sheet graph if it is incident to only one edge
and this edge is horizontal. Three or more consecutive vertexes of a strand are
removed if each of them is not incident to any vertical edge. (Thus, loops of three or
more residues within a strand are prohibited.)
A strand (resp., crest) is extended if there is a C�-atom in the spatial area defined
by the interpolation of local vicinity of the strand (resp., crest) terminal C�-atom.
All strands and crests are extended by this procedure and pairing is recalculated for
the obtained new set of �-sheet residues.
Joining two sheets into one. Two sheets are joined after strand and crest
extension if they share two or more common residues.
Splitting a strand into two strands. If the angle between vectors C�i ! C�iþ2 and
C�iþ3 ! C�iþ5 is more than 75� than bend between residues iþ 2 and iþ 3 is
detected and the strand is split into two strands.
Joining two strands into one. Two strands of a sheet are joined into one if they are
separated by less than three residues; these residues are included into the strand.
Splitting one sheet into two sheets. Two vertical edges incident to consecutive
residues of a strand are deleted if their deletion leads to graph decomposition into
two connected components and each component contains more than five vertices.
In the algorithm, the loop of modifications consists of consequently applied steps:
extending strands and crests; joining strands and sheets; recalculating the graph;
and splitting strands and sheets. The �-sheet graph obtained after all these steps is
stored. The loop is repeated until the obtained �-sheet graph coincides with any of
stores ones. Finally, unpaired vertexes at the edges of the strands are removed.
3.3. Outline of the algorithm
The algorithm consists of four steps. First, �-sheet graph is constructed using
DSSP, Stride, or MakeSSP. Second, �-sheet map of the �-sheet graph is con-
structed. Third, the modifications of the �-sheet graph and the respective �-sheet
map are made. The modification step may be switched off; this possibility was
explored to compare the results of DSSP and Stride with and without further
modifications. Fourth, output data are computed and saved in files. They include
graphical representations of maps of all detected �-sheets and scripts in RasMol12/
JMol13 format acceptable for 3D structure visualization with distinguished �-sheets,
secondary structure elements, pairing, and hydrogen bonds.
4. Program Implementation
SheeP program is written in Free Pascal. Both source codes and binaries are
available for download (under Linux and Windows). Web service is available at
http://mouse.belozersky.msu.ru/sheep. Several variants of the algorithm depend-
ing on initial secondary structure detector and modifications are suggested. For a
particular structure, it might be useful to test several variants. DSSP with further
modifications is recommended for data mining.
A Tool for Description of �-Sheets in Protein 3D Structures
1241003-7
J. B
ioin
form
. Com
put.
Bio
l. 20
12.1
0. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by M
CG
ILL
UN
IVE
RSI
TY
on
11/2
7/14
. For
per
sona
l use
onl
y.
5. Materials
SheeP program was tested on domains from all-�, �/� classes according to SCOP
1.75 classification.6 A list of domains and the detailed results are available at service
http://mouse.belozersky.msu.ru/sheep jbcb.
5.1. Training and control sets
Algorithm parameters were adjusted on the set of 302 domains from structures
solved by X-ray, resolution below 1.5�A. They were chosen from 296 SCOP protein
domain families. All major architectures (�-sandwiches and �=�-sandwiches,
�-barrels and ��-barrels, �-helixes, �-propellers, �-prisms, Rossmann folds) are
represented in this data set. �-sheets in all domains were confirmed by manual
revision.
This set was subdivided into two groups. The first group of 88 domains was
considered as training set for algorithm parameters adjustment, and the rest 214
domains were used for the control.
5.2. Testing set
All 99 PDB entries satisfying the following conditions were selected: (1) protein
structures are X-ray solved with resolution better than 1.5�A; (2) they are from 50%
non-redundant set according PDB; (3) each entry contains at least one protein
chain of 80 or more residues with at least one �-sheet; (4) BLAST search of selected
chains against all sequences of SCOP 1.75 domains shows no hits with more than
20% sequence identity. The last condition guarantees that the testing set does not
intersect with the training and control ones. All 99 PDB entries were manually
examined. If two or more chains within one PDB file coincided (for example, they
are two chains of a homodimer), then only one of them was retained. In all selected
chains, the structural domains were determined manually. Finally, 157 �-sheet
containing domains were included in the testing set.
5.3. Set of quadrilaterals for pairing detector parameters adjustment
All 2344 DSSP detected �-quadruplets from 88 domains of the training set were
selected for the statistical investigation, 1642 of them are antiparallel and 702
parallel. Four C�-atoms of each �-quadruplet form a space quadrilateral in which
vertices were marked by A, B, C, D as shown on Figs. 2(a) and 2(b). In the case of
antiparallel �-quadruplet, A and D were assigned to the paired residues with the
shorter edge between corresponding C�-atoms. Only 6 of totally 2344 �-quadruplets
have lengths of one of edges BC or DA more than 6�A. In the case of parallel
�-quadruplets vertices A and B were assigned to residues with smallest residue
numbers.
The set of non-�-quadruplets for negative control consists of all 1549 residue
quadruplets [i, j, k, l] from 88 domains such that (1) i and j, k, and l are neighbors in
E. Aksianov & A. Alexeevski
1241003-8
J. B
ioin
form
. Com
put.
Bio
l. 20
12.1
0. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by M
CG
ILL
UN
IVE
RSI
TY
on
11/2
7/14
. For
per
sona
l use
onl
y.
the chain: ji� jj ¼ 1, jk� lj ¼ 1; (2) there is no other pairs of neighbor residues
among [i, j, k, l]; (3) none of residues [i, j, k, l] was detected as �-structured by
DSSP algorithm; (4) lengths of edges BC and DA of a space quadrilateral of C�-
atoms defined as described in previous paragraph are less or equal to 6�A. Of the
1549 non-�-quadruplets, 788 are antiparallel and 761 parallel.
5.4. SCOP domain families
Several SCOP families were selected for examination of �-sheets definitions by
SheeP program (Table 2). Selected families are those for which core �-sheets are
explicitly described in SCOP definitions.
6. Results and Discussion
6.1. Statistics of quadruplets’ parameters for pairing detector
Our goal was to reproduce expert judgment on �-sheets in 3D protein structure. An
expert can detect �-sheets either by visualization of hydrogen bonds between
backbone nitrogen and oxygen atoms or just by observing the backbone represen-
tation of the structure (Fig. 1(b)). Several geometrical restrains on C�-atoms
location within pairs of consecutive �-quadruplets were investigated earlier.9,10
Here we analyzed statistics of geometrical parameters for C�-atoms from
�-quadruplets. These four C�-atoms form space quadrilateral [A, B, C, D] which is
approximately planar and has right angles. Lengths of edges AB and DC are fixed
(3:8� 0:03�A) because A and B (resp., D and C) are consecutive C�-atoms. Thus,
generally six independent parameters of a space quadrilateral are reduced to four for
�-quadrilaterals. We analyzed lengths of edges BC and DA, the internal angles of
the quadrilateral, the torsion angles corresponding to the edges (Fig. 2(a)).
(a) (b) (c)
Fig. 2. Schemes of quadrilaterals of C�-atoms. C�-atoms are shown by black balls. (a) Antiparallel
quadrilateral Q. Vertices A and B (resp., C and D) belong to consecutive residues. Angles A (the angle of
the triangle DAB) and D (the angle of the triangle CDA) are shown as well as torsion angle tor(DA) which
is an angle between planes ADC and ADB. (b) Scheme of two quadrilaterals from consecutive antiparallel
�-quadruplets. Vertical double-line connects C�-atoms of two residues linked by two hydrogen bonds.
Letters A and D (A’ and D’ for the second quadrilateral) are assigned to the shorter vertical edge. (c)
Scheme of two parallel quadruplets from consecutive parallel �-quadruplets.
A Tool for Description of �-Sheets in Protein 3D Structures
1241003-9
J. B
ioin
form
. Com
put.
Bio
l. 20
12.1
0. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by M
CG
ILL
UN
IVE
RSI
TY
on
11/2
7/14
. For
per
sona
l use
onl
y.
In the case of antiparallel �-quadruplets, the distribution of distances between
C�-atoms of paired residues is bimodal.9 Actually, this distance is a bit larger for
paired residues that have two hydrogen bonds than for paired residues having
no hydrogen bonds. We take this difference in account for statistical analysis by
appropriate identification of quadrilateral vertices (edge AD is assumed to be
shorter than BC in all antiparallel quadrilaterals, see Fig. 2(b))
The distributions of all these parameters are shown in Figs. 3(a)�3(c).
Let Q be a quadrilateral of C�-atoms, either antiparallel or parallel (see Fig. 2).
For each angle A, B, C, D of Q we compute p-value using the normal approximation
of the distribution of corresponding angles in �-quadruplets. A torsion angle cor-
responds to each edge of Q because Q is space (non-planar) quadrilateral (Fig. 2(a)).
We denote torsion angles by tor(AB), tor(BC), tor(CD), and tor(DA). Torsion
angles tor(AB) and tor(CD) (resp., tor(BC) and tor(DA)) of �-quadruplets are
highly correlated (data not shown). Due to this observation, we computed p-values
for average torsion angles (tor(AB) þ tor(CD))/2 and (tor(BC) þ tor(DA))/2 also
using normal approximation. Denote by � ¼ �(Q) minus common logarithm of the
product of all six probabilities. �(Q) could be considered as the distance of Q to ideal
�-quadruplet.
(a) (b)
(c) (d)
Fig. 3. Distributions of quadrilateral parameters. (a) Distances between C�-atoms from paired residues
of �-quadrilaterals. (b) Angles of �-quadrilaterals. (c) Torsion angles of �-quadrilaterals. (d) Parameter �
for �-quadrilaterals and for non-�-quadrilaterals.
E. Aksianov & A. Alexeevski
1241003-10
J. B
ioin
form
. Com
put.
Bio
l. 20
12.1
0. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by M
CG
ILL
UN
IVE
RSI
TY
on
11/2
7/14
. For
per
sona
l use
onl
y.
In Fig. 3(d), the distributions of the parameter � are shown for two sets of data,
quadrilaterals of �-quadruplets and quadrilaterals of non-�-quadruplets.
6.2. Comparison of ¯-sheets detected by various algorithms
Several variants of the algorithm were implemented in the SheeP program. Variants
differ in the secondary structure detector used at the initial step and the presence/
absence of the modification step. SheeP parameters were adjusted on the set of
302 protein domain structures representing the main �-sheet�based architectures
(see Sec. 5). All 302 structures were manually inspected and �-sheets in each of them
were defined.
Human judgment on �-sheets is often used to confirm the results of the secondary
structure detectors8 due to the absence of appropriately large data sets of structures
that can be considered as a gold standard definitions of �-sheets. For 42 structures of
the training and control sets, human decision was uncertain. For example, it was
not clear whether domain structure from PDB entry 1IKP, chain A, residues
395�606, is open barrel or sandwich. We know no way to approve any of possible
decision in these cases by verifiable arguments ��� except trusting to selected sec-
ondary structure detector, which does not seem to be correct. This obstacle forced
us to adjust algorithm is iterative manner, refining parameters using the training set
(88 domains) and check the results on the control set (214 domains). Thus, the
results of SheeP program on the control set of 214 structures cannot be considered as
an independent verification of the SheeP program.
For independent control, we used testing set of 157 �-sheet domains from 99
PDB entries (see Sec. 5.2). This set was selected after completing the adjustment of
program parameters. All structures were manually examined to detect �-sheets. To
minimize the human factor, we did not used SheeP algorithms during the analysis;
domains were visualized with RasMol using secondary structure definitions from
PDB entry headers. In 18 structures of 157, human judgment on �-sheets was
uncertain.
Mistakes during automatic �-sheet detection were identified for those structures
for which human decision on �-sheets was undoubted (260 domains in the training
and control sets, 139 in the testing set). We detected a mistake if (1) detected
number and compound of �-sheets contradicts human decision; (2) a �-sheet was
considered as a �-barrel (closed or open) by an expert, but automatically detected
�-sheet was not sufficient for detecting a �-barrel by a special program.
The results of several variants of the algorithm on the training, control, and
testing sets are shown in Table 1.
As shown in Table 1, mistakes of any algorithm variant were found in less than
15% of the analyzed structures. Note that only essential mistakes i.e. mistakes
preventing correct determination of the protein architecture, were considered.
Differences in a few residues of different �-sheet determinations were not taken into
account.
A Tool for Description of �-Sheets in Protein 3D Structures
1241003-11
J. B
ioin
form
. Com
put.
Bio
l. 20
12.1
0. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by M
CG
ILL
UN
IVE
RSI
TY
on
11/2
7/14
. For
per
sona
l use
onl
y.
SheeP modifier reduces the number of mistakes. Revision of cases showed that
the modifier actually improves �-sheet definitions in several cases. Examples are
shown in Fig. 4.
6.3. SheeP results for selected SCOP categories
We used SheeP program for automatic detection of �-sheets in a number of SCOP
categories (folds, superfamilies, families). From each analyzed SCOP category, 90%
non-redundant set of domain structures was selected.
Expert decisions on �-sheet compound and arrangements are not available
for each SCOP domain. Thus, we were not able to compare SheeP annotations
for all protein structures with independent expert judgments. This is why structures
were taken from those folds, for which �-sheets were defined in SCOP annota-
tions explicitly. Nevertheless, due to the variability of domain structures within
a fold, �-sheet definitions in all structures were confirmed and corrected manually,
if necessary. Only mistakes in core �-sheet definitions were taken into account
(Table 2).
As it is shown in Tables 1 and 2, SheeP modifier, in comparison with DSSP and
Stride detectors without the �-sheet modification step, reduces the number of
Fig. 4. Comparison of two algorithms, SheeP based on DSSP without ((a) and (c)) and with ((b) and
(d)) modifications. Different �-sheets are differently shaded. (a) and (b) SCOP annotated �-sandwich
from PDB entry 1CZT, chain A. (c) and (d) SCOP annotated �-barrel from PDB entry 1H3G, chain A. In
both structures SheeP modifier corrected DSSP otput.
Table 1. Comparison of automatic �-sheet detection with human judgment. Domains with uncertain
human judgment on �-sheets are not included.
Number of structures with mistakes
Number of
structures DSSP Stride
SheeP based
on DSSP
(default)
SheeP based
on Stride
SheeP based
on MakeSSP
Training set 71 12 12 8 8 12
Control set 189 13 15 6 9 11
TrainingþControl
260 25 (10%) 27 (10%) 14(5%) 17 (7%) 23 (9%)
Testing set 139 21 (15%) 20 (14%) 4 (3%) 7 (5%) 10 (7%)
E. Aksianov & A. Alexeevski
1241003-12
J. B
ioin
form
. Com
put.
Bio
l. 20
12.1
0. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by M
CG
ILL
UN
IVE
RSI
TY
on
11/2
7/14
. For
per
sona
l use
onl
y.
Table2.DSSP,Stride,
andSheePresultsforselected
foldsandfamilies.90%
non-redundantsetofstructureswereselected
from
each
studiedSCOPcategory.
Number
ofstructureswithmistakes
SCOPclassificationlevel
andID
Architecture
No.of
structures
DSSP
Stride
SheePbased
onDSSP
(default)
SheePbased
onStride
SheePbasedon
Mak
eSSP
Familyc.2.1.2
Rossmannfold
110
22
10
1
Familyc.1.8.1
TIM
barrels
53
22
21
63
16
Foldsb.80,b.81(30families)
�-helix
54
29
32
14
16
10
Foldsb.77,b.78(5
families)
�-prism
21
12
01
2
Superfamiliesb.1.18,b.18.1
(56families)
�-sandwich
174
20
32
64
17
Tota
l412
74(1
8%)
89(2
2%)
27(7
%)
24(6
%)
46(1
1%)
Table
3.DSSP,Stride,
andSheePresultsfordomainswithuncertain
humanjudgmenton�-sheets.
Number
ofstructureswithmistakes
Set
Architecture
No.of
structures
DSSP
Stride
SheePbased
onDSSP
(default)
SheePbased
onStride
SheePbasedon
Mak
eSSP
TrainingþControl
Different
42
21
45
5
Testingset
Different
18
00
11
0
Familyc.2.1.2
Rossmannfold
40
00
00
Familyc.1.8.1
TIM
barrels
00
00
00
Foldsb.80,b.81(30families)
�-helix
73
30
11
Foldsb.77,b.78(5
families)
�-prism
10
00
00
Superfamiliesb.1.18,b.18.1
(56families)
�-sandwich
70
00
12
Tota
l79
5(6
%)
4(5
%)
5(6
%)
8(1
0%)
8(1
0%)
A Tool for Description of �-Sheets in Protein 3D Structures
1241003-13
J. B
ioin
form
. Com
put.
Bio
l. 20
12.1
0. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by M
CG
ILL
UN
IVE
RSI
TY
on
11/2
7/14
. For
per
sona
l use
onl
y.
mistakes significantly. Based on these results, we consider \DSSP þ modifier"
variant of the algorithm as the most appropriate for the purpose of automatic
architecture detector.
6.4. SheeP results in cases of uncertain human decisions
Human judgment on �-sheets was uncertain for a number of structures. For these
structures, we fixed two or more admissible decisions and consider automatic
detection of �-sheets being correct if it met one of admissible decisions.
The problem of choosing decision on �-sheet compound could be demonstrated in
a domain from PDB entry 1JMX, residues 364�494 of chain A. This domain is
classified as immunoglobulin-like �-sandwich (seven strands, two sheets) in SCOP
database and as �-sandwich, immunoglobulin-like architecture, in CATH. Human
judgment revealed three clearly defined \sub-sheets." Sub-sheets 1�3 and 2�3 are
connected by hydrogen bonds between backbone oxygens and nitrogens. Two
hydrogen bonds between sub-sheets 2 and 3 do not form a regular pattern. The
sandwich upper layer is formed by sub-sheets 1 and 2 and sub-sheet 3 coincides with
the bottom layer.
Possible decisions, based on visual estimations, could be as follows. (1) There is
one �-sheet which includes all three sub-sheets. (2) There are two �-sheets, one of
which consists of sub-sheets 1 and 3 and another coincides with the sub-sheet 2. (3)
There are three �-sheets coinciding with the sub-sheets. The first decision contra-
dicts both the sandwich architecture and regularity of hydrogen bond patterns, the
second contradicts the sandwich architecture, and the third is in better accordance
with the sandwich architecture (which is common for immunoglobulin-like fold),
but rather good hydrogen bonds pattern between sub-sheets 1 and 3 is not taken
into account. We considered both decisions (2) and (3) as admissible.
Results of SheeP for domains with uncertain human decision are shown in
Table 3.
Table 3 demonstrates that in the majority of case outputs of SheeP algorithm
match one of human decisions on �-sheets.
It should be noted that human judgment on �-sheet number and compound was
uncertain in approximately 1/10 of cases (79 of 890, see Tables 1�3), showing the
hardness of the problem. We believe that considering �-sheets as structural units of
domain architecture within an automatic architecture detector would reduce the
uncertainty.
7. Conclusions
An algorithm for the detection of �-sheets in input 3D protein structure and rep-
resentation of each �-sheets by �-sheet map was developed. The �-sheet map con-
tains information about strands, pairing of residues and crests, irregularities and
sides of the �-sheet. It also allows describing adequately �-barrels, and other
E. Aksianov & A. Alexeevski
1241003-14
J. B
ioin
form
. Com
put.
Bio
l. 20
12.1
0. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by M
CG
ILL
UN
IVE
RSI
TY
on
11/2
7/14
. For
per
sona
l use
onl
y.
complicated �-sheets. The algorithm explores hydrogen bond patterns (on the first
step) as well as constrain on relative location of C�-atoms (on the step of �-sheet
modifications).
The algorithm is implemented in SheeP program and web service, which provide
additional visualization possibilities.
SheeP showed 93% of correct �-sheet in comparison with human judgment and
essential mistakes in 7% of structures. We believe that SheeP can be used for an
automatic architecture detection. Currently, the classification of protein structures
involves manual steps14 even in popular databases SCOP6 and CATH.15
Holistic view on �-sheets, implemented in �-sheet maps, simplifies the automatic
comparison of �-sheets from structures of relative proteins, which may be con-
sidered as �-sheet alignments.
Acknowledgments
This work was partially supported by RFBR grant 10-07-00685-a.
References
1. Znamenskiy D, Le Tuan K, Poupon A, Chomilier J, Mornon JP, Beta-sheet modeling byhelical surfaces, Protein Eng 13:407�412, 2000.
2. Chothia C, Hubard T, Brenner S, Barns H, Murzin A, Protein folds in the all-� and all-�classes, Annu Rev Biophys Biomol Struct 26:597�627, 1997.
3. Hutchinson EG, Thornton JM, HERA ��� A program to draw schematic diagrams ofprotein secondary structures, Proteins 8:203�212, 1990.
4. Kabsch W, Sander C, Dictionary of protein secondary structure: Pattern recognition ofhydrogen-bonded and geometrical features. Biopolymers 22:2577�2637, 1983
5. Frishman D, Argos P, Knowledge-based protein secondary structure assignment, Pro-teins 23:566�579, 1995.
6. Murzin AG, Brenner SE, Hubbard T, Chothia C, SCOP: A structural classification ofproteins database for the investigation of sequences and structures, J Mol Biol247:536�540, 1995.
7. Chan AWE, Hutchinson EG, Harris D, Thornton JM, Identification, classification, andanalysis of �-bulges in proteins, Protein Sci 2:1574�1590, 1993.
8. Cammack R, Attwood T, Campbell P, Parish H, Smith A, Vella F, Stirling J. (eds.)Oxford Dictionary of Biochemistry and Molecular Biology, Oxford University Press,USA, 2006.
9. Majumdar I, Krishna SS, Grishin NV, PALSSE: A program to delineate linear sec-ondary structural elements from protein structures, BMC Bioinformatics 6:202, 2005.
10. Martin J, Letellier G, Marin A, Taly J-F, de Brevern AG, Gibrat J-F, Protein secondarystructure assignment revisited: A detailed analysis of different assignment methods,BMC Struct Biol 5:17, 2005.
11. LiuW-M, Is there a M€obius band in closed protein �-sheets? Protein Eng 10:1373�1377,1997.
12. Sayle RA, Milner-White EJ, RasMol: Biomolecular graphics for all, Trends Biochem Sci20:374�376, 1995.
13. Jmol: An open-source Java viewer for chemical structures in 3D. Available at: http://www.jmol.org/.
A Tool for Description of �-Sheets in Protein 3D Structures
1241003-15
J. B
ioin
form
. Com
put.
Bio
l. 20
12.1
0. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by M
CG
ILL
UN
IVE
RSI
TY
on
11/2
7/14
. For
per
sona
l use
onl
y.
14. Sam V, Tai CH, Garnier J, Gibrat JF, Lee B, Munson PJ., Towards an automaticclassification of protein structural domains based on structural similarity, BMC Bioin-formatics 9:74, 2008.
15. Orengo CA, Michie AD, Jones DT, Swindells MB, Thornton JM, CATH: A hierarchicclassification of protein domain structures, Structure 5:1093�1108, 1997.
Andrei V. Alexeevski received his Diploma and Doctoral
degrees, both in Mathematics, from Moscow State University
(MSU), Moscow, Russia, in 1970 and 1983, respectively. From
1989 up to now, he has been working at the Laboratory of
Mathematical Methods in Biology, A.N. Belozersky Institute
of Physico-Chemical Biology, MSU, where he is currently a Chief
of the laboratory. His research focus is in the field of bioinfor-
matics, computational biology, and theoretical mathematics.
Evgeniy Aksianov received his Diploma degree in Virology
from Moscow State University (MSU), Moscow, Russia, in 2003.
From 2004 up to now, he is at the Laboratory of Mathematical
Methods in Biology, A.N. Belozersky Institute of Physico-
Chemical Biology, MSU.
E. Aksianov & A. Alexeevski
1241003-16
J. B
ioin
form
. Com
put.
Bio
l. 20
12.1
0. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by M
CG
ILL
UN
IVE
RSI
TY
on
11/2
7/14
. For
per
sona
l use
onl
y.