[Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary...

28
3 Protein Structure Introduction Chapter 3 Notation SYMBOL DEFINITION C α α-Carbon τ dihedral angle φ {N–C α } rotation about peptide bond χ 1 χ 4 rotamer dihedral angles in amino acid sidechains ψ {C α –C}=O rotation about peptide bond ω C α 1 {C–N}–C α 2 rotation Life is the mode of existence of proteins, and this mode of exis- tence essentially consists in the constant self-renewal of the chemical constituents of these substances. Friedrich Engels, 1878 (1820–1895). 3.1 The Machinery of Life 3.1.1 From Tissues to Hormones The term “protein” originates from the Greek word proteios, meaning “pri- mary” or “of first rank”. The name was adapted by J¨ ons Berzelius in 1838 to emphasize the importance of this class of molecules. Indeed, proteins play T. Schlick, Molecular Modeling and Simulation: An Interdisciplinary Guide, 77 Interdisciplinary Applied Mathematics 21, DOI 10.1007/978-1-4419-6351-2 3, c Springer Science+Business Media, LLC 2010

Transcript of [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary...

Page 1: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

3Protein Structure Introduction

Chapter 3 Notation

SYMBOL DEFINITION

Cα α-Carbonτ dihedral angleφ {N–Cα} rotation about peptide bondχ1–χ4 rotamer dihedral angles in amino acid sidechainsψ {Cα–C}=O rotation about peptide bondω Cα

1 –{C–N}–Cα2 rotation

Life is the mode of existence of proteins, and this mode of exis-tence essentially consists in the constant self-renewal of the chemicalconstituents of these substances.

Friedrich Engels, 1878 (1820–1895).

3.1 The Machinery of Life

3.1.1 From Tissues to Hormones

The term “protein” originates from the Greek word proteios, meaning “pri-mary” or “of first rank”. The name was adapted by Jons Berzelius in 1838to emphasize the importance of this class of molecules. Indeed, proteins play

T. Schlick, Molecular Modeling and Simulation: An Interdisciplinary Guide, 77Interdisciplinary Applied Mathematics 21, DOI 10.1007/978-1-4419-6351-2 3,c© Springer Science+Business Media, LLC 2010

Page 2: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

78 3. Protein Structure Introduction

crucial, life-sustaining biological roles, both as constituent molecules and astriggers of physiological processes for all living things. For example, proteins pro-vide the architectural support in muscle tissues, ligaments, tendons, bones, skin,hair, organs, and glands. Their environment-tailored structures make possible thecoordinated function (motion, regulation, etc.) in some of these assemblies.

Proteins also provide the fundamental services of transport and storage, suchas of oxygen and iron in muscle and blood cells. The first pair of solved proteinstructures hemoglobin and myoglobin, serve as the crucial oxygen carriers invertebrates. Hemoglobin is found in red blood cells and is the chief oxygen carrierin the blood (it also transports carbon dioxide and hydrogen ions). Myoglobin isfound in muscle cells, where it stores oxygen and facilitates oxygen movementin muscle tissue. The sperm whale depends on myoglobin in its muscle cells forlarge amounts of oxygen supplies during long underwater journeys.

Proteins further play crucial regulatory roles in many basic processes funda-mental to life, such as reaction catalysis (e.g., digestion); immunological andhormonal functions; and the coordination of neuronal activity, cell and bonegrowth, and cell differentiation.

Given this enormous repertoire, Berzelius could not have coined a better name!

3.1.2 Size and Function Variability

Protein molecules come in a wide range of sizes and have evolved many functions.The major classes of proteins include globular, fibrous, and membrane proteins.Globular proteins are among the most commonly studied group. Newly foundribosomal proteins form a characteristic class of proteins that can be ordered asglobular proteins, with disordered extensions.

To suit their environment and function, fibrous proteins (e.g., the collagen mol-ecule in skin and bones), which are generally insoluble in aqueous environments,are extended in shape, whereas globular proteins tend to be compact. Collagenis a left-handed helix with a quaternary structure made of collagen fibrils aggre-gated in a parallel superhelical arrangement. See [680] for the crystal structureof a collagen-like peptide with a biologically relevant sequence (also shown inFigure 3.9) and summary of collagen structures elucidated to date. The globu-lar protein myoglobin (see Figure 3.12) is highly compact, organized as 75%α-helices. Similarly, hemoglobin is a tetramer composed of four polypeptidechains held by noncovalent interactions; each subunit of hemoglobin in humans isvery similar to myoglobin. Both proteins bind oxygen molecules through a centralheme group.

There certainly are some very large proteins such as the muscle protein titin ofabout 27000 amino acids (and mass of 3000 kDa), but the average protein con-tains several hundred residues. The size of polypeptides can be determined fromgel electrophoresis experiments: the rate of migration of the molecule is inverselyproportional to the logarithm of its length. The mass of a polypeptide or proteincan be estimated from mobility-to-mass relationships established for referenceproteins and by mass spectrometry measurements. Equilibrium ultracentrifuga-

Page 3: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

3.1. The Machinery of Life 79

R

H

NH3+ COO−

HR

NH3+ COO-

Figure 3.1. (left) The general formula for an amino acid, and (right) the spatial tetrahedralarrangement of an L-amino acid. The mirror image, a D-amino acid, is rare in proteins inNature, if it exists.

tion [197] is another favored technique for determining various macromolecularfeatures, including molecular weight, on the basis of transport properties.

3.1.3 Chapter Overview

This chapter introduces the bare basics in protein structure: the amino acidbuilding blocks, primary sequence variations, and the framework for describingconformational flexibility in polypeptides. Included also is an introduction to themore advanced topic of sequence similarity and relation to structure, in the sec-tion on variations in protein sequences. Students are encouraged to return to thissubsection after reading Chapters 3 and 4. Chapter 4 continues to describe sec-ondary, supersecondary, and tertiary structural elements in proteins, as well asprotein classification.

The protein treatment in these chapters is brief in comparison to the minituto-rial on nucleic acids. Readers should consult the many excellent texts on proteinstructure (see books listed in Appendix C), like that by Branden and Tooze, andreview introductory chapters in biochemistry texts like Stryer’s. The 1999 text byFersht [394], in particular, is a comprehensive description of the state-of-the-art inprotein structure, and also reviews recent advances and insights from theoreticalapproaches.

Box 3.1: Water Structure

Water — that deceptively simple molecule composed of two hydrogens and one oxygen —displays highly unusual and complex properties that are far from fully understood [788].Perhaps because of those properties — like the contraction of ice when it melts, largeheat of vaporization, and large specific heat — water is the best of all solvents and afundamental substance to sustain life. Solvent organization and reorganization are crucialto the stability of proteins, nucleic acids, saccharides, and other molecular systems and

Page 4: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

80 3. Protein Structure Introduction

affects functional motions profoundly [384]. The energetic and kinetic aspects of waterstructure are difficult to pinpoint by experiment and simulation because of the range oftimescales associated with water motions, from the fast perturbations of order 0.1 ps to theslow proton exchanges of millisecond order.

Important to the understanding of solvation structure and dynamics in the vicinity ofmacromolecules is the tendency of water to form hydrogen bonds [1018] (see also Box 3.2for a definition of a hydrogen bond). In ice, the ordered crystal structure of water mole-cules, each oxygen is surrounded by a tetrahedron of four other oxygen atoms, with onehydrogen between each oxygen pair. In liquid water, many water molecules are engagedin such a hydrogen-bonded network, but the network is highly dynamic, with hydrogen-bonded partners changing rapidly.

This local organization of liquid water can easily be observed from experimental andcomputed radial distribution functions (e.g., O–O and O–H distances), which reflect thedegree of occupancy of neighbors from a central oxygen or hydrogen molecule. Thus, forexample, the highest peak in the O–O radial distribution function at a distance of about2.9 A corresponds to the first solvation shell, in which the four near-neighbor oxygens ofthe central oxygen molecule can be found at room temperature.

Figure 3.2 illustrates the structure of water clusters as computed by minimizing the poten-tial energy composed of bond length, bond angle, and intermolecular Coulomb and van derWaals terms (see Chapter 9 for energy terms discussion). Such hydrogen bonds form ubiq-uitously in the environment of biomolecules. Water molecules penetrate into the grooves ofnucleic acid helices, aggregate around hydrophilic, or water-soluble, segments of proteins(which cluster at the protein surface) and stabilize solute conformations through varioushydrogen bonds and bridging arrangements. The dynamic nature of both water structureand biomolecules gives rise to the concept of hydration shells; see Chapter 6 in the con-text of DNA. That is, the solvent structure around the solute is multilayered, with the firsthydration shell associated with water molecules in direct contact with the solute and theoutermost layer as the bulk solvent.

Box 3.2: Hydrogen Bonds

A hydrogen bond is an attractive, weak electrostatic (noncovalent) bond [1018]. It formswhen a hydrogen atom covalently binds to an electronegative atom and is electrostaticallyattracted to another (typically electronegative) atom. The atom to which the hydrogen atom(H) is covalently bound is considered the hydrogen donor (D), and the other atom is thehydrogen acceptor (A). Thus, the hydrogen bond is stabilized by the Coulombic attractionbetween the partial negative charge of the A atom and the partial positive charge of H. SeeFigure 3.2 for examples in water clusters.

In biological polymers, the donor and acceptor atoms are either nitrogens or oxygens. Forexample, in protein helices and sheets, the D–H · · · A sequence is N–H · · · O=C. In thenucleic-acid base pair of adenine–thymine, the two D–H · · · A sequences are N–H · · · Oand N–H · · · (see Nucleic Acid chapters for details). Non-classical, weaker hydrogen

Page 5: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

3.1. The Machinery of Life 81

Figure 3.2. Structures of water clusters of 2, 4, 8, 16, and 125 molecules as minimized withthe CHARMM force field from initial coordinates computed in [1120]. Two geometries areshown for both the 8 and 16-molecule clusters; the lower energy structures (by roughly 4and 6%, respectively) are associated with the more compact, cube-like shapes (left sidein both cases). The tetrahedral structure of water is apparent in the inner molecules of thelarger systems, where each molecule is hydrogen bonded to four others.

Page 6: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

82 3. Protein Structure Introduction

bonds have been noted in biological systems (e.g., protein/DNA complexes), involving acarbon instead of one electronegative atom: C–H · · · O [820].

The strength of a hydrogen bond can be characterized by two geometric quantities whichgovern the hydrogen bond energy: colinearity of the D–H · · · A atoms, and optimal H · · · A(or D · · · A) distance. The ideal, strongest hydrogen bond often has its three atoms colin-ear. The strength of a hydrogen bond is several kilocalories per mole, compared to about0.6 kcal/mol for thermal energy at room temperature, but the exact value remains uncertain(e.g., [319]). However, the formation of a network of hydrogen bonds in macromolecularsystems leads to a cooperative effect that enhances stability considerably [1018].

3.2 The Amino Acid Building Blocks

Proteins and polypeptides are composed of linked amino acids. That amino acidcomposition of the polymer is known as the primary structure or sequence forshort.

3.2.1 Basic Cα Unit

Each amino acid consists of a central tetrahedral carbon known as the alpha (α)carbon (Cα) attached to four units: a hydrogen atom, a protonated amino group(NH+

3 ), a dissociated carboxyl group (COO−), and a distinguishing sidechain, orR group (see Figure 3.1).

This dipolar or zwitterionic form of the amino acid (COO− and NH+3 ) is typi-

cal for neutral pH (pH of 7). The un-ionized form of an amino acid correspondsto COOH and NH2 end groups. Different combinations involving ionized/un-ionized forms for each of the side groups can occur for the amino acid dependingon the pH of the solution.

C1α

C1α C2

α

R1

R1 R2

NH3+

NH3+

H

COO − COO −

COO −

C2α

R2

NH3+

H

+

H2O +

H

C N

H

H

O

Figure 3.3. Formation of a dipeptide by joining two amino acids.

Page 7: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

3.2. The Amino Acid Building Blocks 83

The tetrahedral arrangement about Cα makes possible two mirror images forthe molecule. Only the L-isomer (“left-handed” from the Latin word levo) is aconstituent of proteins on earth (see Figure 3.1). This asymmetry is not presentlyunderstood, but one explanation is that this imbalance is related to an asymmetryin elementary particles.

3.2.2 Essential and Nonessential Amino Acids

There are 20 naturally-occurring amino acids.1 Among them, humans can syn-thesize about a dozen. The remaining 9 amino acids must be ingested throughour diet; these are termed essential amino acids. They are histidine, isoleucine,leucine, lysine, methionine, phenylalanine, threonine, tryptophan, and valine.

Meat eaters need not be too concerned about a balanced diet of those nutri-ents, since animal flesh is a complete source of the essential amino acids. Incontrast, vegetarians, particularly vegans — who omit all animal products likeeggs and dairy in addition to meat, poultry, and fish — must perform a deli-cate balancing act to ensure that their bodies can synthesize all the basic proteinsessential to good health. See Box 3.3 for a discussion of essential amino acids andbalanced diets.

Box 3.3: Protein Chemistry and Vegetarian Diets

Vegetarians must take care to combine foods from three basic groups which are com-plementary with respect to their supply of these amino acids: (a) rice and grains likeoats, wheat, corn, cereals, and breads; (b) legumes and soy-products; and (c) nut prod-ucts (cashews, almonds, and various nut butters). Peanuts are technically members of thelegume family rather than nuts.

Notable vegetarian combinations are: rice or grains (low in or lacking isoleucine, lysine,and threonine) with beans (rich in isoleucine and leucine and, in the case of lima beans,also lysine); cereals with leafy vegetables; corn, wheat, or rye (low in or lackingtryptophan and lysine) with soy protein/soybeans (rich in isoleucine, tryptophan, lysine,methionine, and valine); corn with nuts or seeds (rich in methionine, isoleucine, andleucine); bread/wheat with peanut butter (rich in valine and tryptophan); and potatoes(limited methionine and leucine) with onions, garlic, lentils and egg or fish (if permit-ted), all of which are rich in methionine.

A classic Native-American dish of acorn squash stuffed with wild rice, quinoa, and blackbeans is a superior mixture of nutrients. The plant quinoa, prepared like a grain, is actuallya fruit and moreover a complete protein (rich in lysine and other amino acids), as well asrich in vitamins E and B, fiber, and the minerals calcium, phosphorus, and iron. Nuts alsocontain several vitamins and minerals that protect against heart disease, like folate, andcalcium, magnesium, and potassium, which also protect against high blood pressure.

1At least two nonstandard by naturally occurring amino acids in certain Archaea and eubacteriaare known, pyrrolysine and selenocysteine [68, 511].

Page 8: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

84 3. Protein Structure Introduction

Contrary to an existing myth, such food group combinations need not be eaten at the samemeal to guarantee a complete source of the essential amino acids; a daily approach suffices.Given all these food sources, vegetarians — even vegans who carefully comply — will notbe deficient in protein. However, nutrients that present a challenge to vegans and lacto-ovovegetarians are vitamin B12 (deficiencies of which can cause nerve damage), found infortified cereals, and zinc (needed for protein synthesis, healing wounds, and immunity),available in fortified cereals, soy-based foods, and dairy products.

Given the intimate relationship between protein chemistry and good nutrition, readers ofthis text should be healthy as well as smart!

Interestingly, the requirements for vegetarian diets are now of crucial inter-est to NASA researchers: future astronauts who will spend extended periods oftime in space stations (on Mars, Jupiter, or the Moon) will have to depend onhydroponically-grown plant crops for nearly all their protein requirements, as wellas vitamins, minerals, and fiber. Research is now in progress on how best to se-lect a limited set of plants that can adapt to growing in nutrient-enriched water(rather than soil) and in small spaces. At the same time, this selection must meetthe basic dietary requirements of space-station scientists, as well as satisfy theirculinary taste and demand for variety [173].

H N

H

Ciα

H

C

O

OH

Ri n

Figure 3.4. The repeating formula for a polypeptide.

NCH

H2C

CCH

+NH3

CH2

C

O_

O

O

H

COO

CH3

Aspartame

OC N

H

Figure 3.5. The dipeptide aspartame.

Page 9: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

3.2. The Amino Acid Building Blocks 85

3.2.3 Linking Amino Acids

A polypeptide is formed when amino acids join together. Namely, the carboxylcarbon of one amino acid joins the amino nitrogen of another amino acid to formthe peptide (C–N) bond with the release of one water molecule (Figure 3.3).The general repeating formula for a polypeptide is shown in Figure 3.4. Whenthe amino acid residue is proline, its Cα is linked to the nitrogen of the peptidebackbone through the proline ring.

A model of aspartame, a dipeptide of aspartic acid and phenylalanine, isshown in Figure 3.5. It was discovered accidentally in 1965 by a ‘careless’ chemistwho licked his fingers during his laboratory work. To his surprise, a substance100–200 times sweeter than sucrose was discovered. Because it is a kind ofprotein, aspartame is metabolized in the body like proteins and is a source ofamino acids. (This should not, however, be taken as an endorsement for diet softdrinks as a source of nutrients!)

The synthesis of polypeptides and proteins occurs in a cellular structure, theribosome, in vivo. Synthesis in vitro is facile for 100–150 residues but much moreinvolved for longer chains.

3.2.4 The Amino Acid Repertoire: From Flexible Glycineto Rigid Proline

The chemical formulas of the twenty L-amino acids are shown in Figure 3.6,with the corresponding space-filling models shown in Figure 3.7. The commonlyused three-letter abbreviation for each acid is illustrated, as well as a grouping intoamino acid subfamilies. A one-letter mnemonic is also used to identify sequencesof amino acids, as shown in Table 3.1.

A broader classification than indicated in the figures consists of the followingthree groups:

• NPo: amino acids with strictly nonpolar (hydrophobic, or water insoluble)side chains:Ala, Val, Leu, Ile, Phe, Pro, Met, Gly, Trp, Tyr;

• CPo: amino acids with charged polar residues:Asp, Glu, His, Lys, Arg; and

• UPo: amino acids with uncharged polar side chains:Ser, Thr, Cys, Asn, Gln.

Each amino acid has a unique combination of properties — size, polarity, cyclicconstituents, sulfur constituents, etc. — that critically affects the noncovalent andcovalent (i.e., disulfide bonds) interactions that form and stabilize protein three-dimensional (3D) architecture. These interactions originate from electrostatic,van der Waals, hydrophobic, and hydrogen bonding forces. These properties aredescribed in turn for these amino acid classes.

Page 10: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

86 3. Protein Structure Introduction

ALIPHATIC SIDE CHAINS (NPo)

ALIPHATIC HYDROXYL SIDE CHAINS (UPo) SECONDARY AMINO GROUP (NPo)

ACIDIC SIDE CHAINS AND THEIR AMIDE DERIVITIVES (CPo - Asp, Glu; UPo - Asn, Gln)

SULFUR-CONTAINING SIDE CHAINS (NPo)

BASIC SIDE CHAINS (CPo)

AROMATIC SIDE CHAINS (NPo but potential for polarity)

Glycine (Gly) Valine (Val)Alanine (Ala)

Isoleucine (Ile)Leucine (Leu)

Aspartic Acid (Asp)

Proline (Pro)Threonine (Thr)Serine (Ser)

Glutamic Acid (Glu)

Methionine (Met)

Asparagine (Asn) Glutamine (Gln)

Cysteine (Cys)

Lysine (Lys)Arginine (Arg)

Histidine (His)

Phenylalanine (Phe) Tyrosine (Tyr) Tryptophan (Trp)

+H3N C

H

H

COO

+H3N

H3N

C CH3

CH3

CH3CH2

CH3

CH3

CH

H

COO

COO

+H3N C

H

COO

CH2+ C

H

+H3N C

H

COO

CHCH3

CH3

+H3N +H3N

+H3N

+H2N

COO

COO COO

COO

COO COO

CH2

CH2

+H3N

+H3N

+H3N

+H3N

+H3N

+H3N

+H3N

+H3NNH3

+H3N

+H3N

+H3N

COO

COO

COO

COO

COO

COOCOO

COO

COO

CH2

CH2

CH2

CH2

CH2 CH2 CH2

CH2

CH2

CH2

CH2 CH2

CH2

CH2 CH2

CH2 CH2

CH2 CH2

CH2 CH2

H2

CH3

C

H

C

H

C

H

OH

C

H

C

H

C

H

C

H

C

C

O

O

O

NH2 NH2

C

C

O

O

O

C

H

OH

C

H

C

H

C

H

C

H

C

H

C

H

C

H

C

H

SH

+

NH C

C

C CC

NH+

CH

NH

HC

NH2

NH2

CH

CH

CH

HC

HC

CH

CH

C

HC

HC

HCNH

C

C

CH

CH

CH

HC

S

OH

C

C H

CH3

Figure 3.6. The chemical formulas of the 20 natural amino acids as found in neutral pH(pH of 7). The acronyms NPo, UPo, CPo denote, respectively, nonpolar, uncharged polar,and charged polar amino acids.

Page 11: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

3.2. The Amino Acid Building Blocks 87

Figure 3.7. Space-filling models of the 20 amino acids.

Aliphatic R: Gly, Ala, Val, Leu, Ile

Glycine, alanine, valine, leucine, and isoleucine can be classified as nonpo-lar. Glycine, the simplest of the amino acids, is first in the aliphatic-sidechainsubgroup. Each member in this family has a sidechain (R) which increases in

Page 12: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

88 3. Protein Structure Introduction

bulk and branching design. Glycine is most flexible and hence an important con-stituent of proteins. For example, glycine is a major component of the α-helix ofthe protein α-keratin, which makes up hair and wool, as well as the β-sheet of thepolypeptideβ-keratin, which is silk (see Figure 3.9). Since the increasing aliphaticsubstitutions in this family increases the bulkiness of the amino acid, the overallconformational flexibility correspondingly decreases within a polypeptide. How-ever, the conformational variability of each of these amino acids increases due tothe rotameric variations of the amino acid (roughly, different 3D arrangementsabout central bonds within the sidechain — see Section 3.4).

Rigid Proline

Proline is a nonpolar amino acid as well. In contrast to glycine residues, which al-low a great deal of conformational flexibility about the backbone (i.e., wide rangeof sterically-permissible rotations φ and ψ about the peptide bond — see Sec-tion 3.4), flexibility in proline residues is largely limited, due to the cyclic natureof its sidechain.

Aliphatic Hydroxyl R: Ser, Thr

Serine and threonine contain aliphatic hydroxyl groups and are considereduncharged polar, capable of forming hydrogen bonds (see Box 3.2).

Acidic R and Amide Derivatives: Asn, Gln, Asp, Glu

Similarly, asparagine and glutamine possess amide groups and are also consid-ered uncharged polar with potential for hydrogen bond formation. Their acidicanalogs, aspartic acid and glutamic acid, are negatively charged (intrinsic pH ofaround 4) and thus considered charged polar, but the polar end of their sidechainsis separated from Cα by hydrophobic CH2 groups.

Basic R: Lys, Arg, His

Lysine, arginine, and histidine have basic sidechains and are thus in the chargedpolar category of amino acids. Lysine and arginine — the longest amino acids —are positively charged at physiological concentrations (that is, sidechain pH of10–12), whereas histidine’s charge can be both positive or negative depending onits environment. This duality in histidine stems from its imidazole ring, which isin the physiological range of pH. For this reason, histidine residues serve as goodmetal binders and are often found in the active sites of proteins.

Aromatic R: Phe, Tyr, Trp

The amino acids with aromatic sidechains — phenylalanine, tyrosine, and trypto-phan — have significant potential for electrostatic interactions due to an electrondeficit in the ring hydrogen atoms. Phenylalanine is highly hydrophobic while the

Page 13: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

3.3. Sequence Variations in Proteins 89

other two can be considered mildly hydrophobic, since their aromaticity is juxta-posed with polar properties (hydroxyl group of tyrosine and indole-ring nitrogenof tryptophan). The aromatic rings of this amino acid family also have poten-tial for electron transfer. They can all be classified as nonpolar, though the mildhydrophobicity of tyrosine often warrants its classification as an uncharged polaramino acid.

Sulfur-Containing R: Met, Cys

Finally, nonpolar cysteine and methionine contain sulfur in their sidechains andare thus hydrophobic. Cysteine, in particular, is very reactive and binds to heavymetals. It has an important role in protein conformations through its unique abilityto form disulfide bonds between two cysteine residues. Disulfide bonds are cova-lent but reversible and are thought to be important in many cases by directing aprotein to its native structure and maintaining this functionally-important state.

L A G V I P F Y M W S T N Q C E R K D H0

2

4

6

8

10

Residue

Fre

quen

cy in

pro

tein

s (%

)

Nonpolar Uncharged

Polar

Charged

Polar

Figure 3.8. Amino acid frequencies as computed from 45,137 proteins collected from 15taxa representing the three kingdoms of life (Bacteria, Archaea, and Eukaryota) [615]. SeeTable 3.1 for the frequencies and key to the one-letter amino acid abbreviations.

3.3 Sequence Variations in Proteins

From the constituent library of twenty natural amino acids, 20N sequence com-binations for an N -residue peptide are possible, an enormous number when N isseveral hundred. However, natural evolution has favored certain sequences morethan others. Sequence similarity is an important factor in indicating common evo-lutionary ancestry of proteins, as discussed below. It is therefore widely used asa tool for classifying proteins into families, as well as for relating sequence tostructure and predicting structure from sequence (homology modeling [16,79], asintroduced in Chapter 2).

Page 14: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

90 3. Protein Structure Introduction

3.3.1 Globular Proteins

In most proteins, the twenty amino acids occur at roughly similar frequencies.Notable exceptions occur for certain amino acids like methionine, which is fre-quently found at the N-terminus of the peptide since it serves as the amino acidinitiator of synthesis, or special groups of proteins, such as membrane or fibrousproteins.

Table 3.1 shows the frequency of occurrence of amino acid residues in thePDB40 dataset of 971 domains of unrelated proteins with a sequence identityof 40% or less [959], and Figure 3.8 displays the data as histograms. We see thatnonpolar Ala and Leu (boldfaced entries in the table) have the highest percentages(above 8%) within the representative protein database. The lowest frequencies(4% and below) occur for Trp (aromatic sidechain), Cys (sulfur-containing side-chain), His and Met (the other sulfur-containing sidechain), Tyr and Phe (aromaticsidechains also), and Gln.

Table 3.1. Amino acid frequencies in proteins based on the data of [615] which analyzed45,137 proteins from 15 taxa. Bold and italics types are used, respectively, for the highest(>8%) and lowest (≤ 2.5%) frequencies.

Amino Acid Freq. [%]

Alanine (Ala, A) 8.1Arginine (Arg, R) 5.1Asparagine (Asp, D) 5.2Aspartic acid (Asn, N) 4.0Cysteine (Cys, C) 1.2Glutamine (Gln, Q) 3.8Glutamic acid (Glu, E) 6.5Glycine (Gly, G) 7.2Histidine (His, H) 2.2Isoleucine (Ile, I) 6.8Leucine (Leu, L) 10.3Lysine (Lys, K) 5.9Methionine (Met, M) 2.5Phenylalanine (Phe, F) 4.2Proline (Pro, P) 4.3Serine (Ser, S) 6.2Threonine (Thr, T) 5.1Tryptophan (Trp, W) 1.1Tyrosine (Tyr, Y) 3.2Valine (Val, V) 6.9

3.3.2 Membrane and Fibrous Proteins

Membrane proteins are embedded in a dynamic lipid bilayer environment, wheremobility is more restricted. They therefore have more hydrophobic residues than

Page 15: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

3.3. Sequence Variations in Proteins 91

globular proteins, which favor polar groups on the exterior surface. Since mem-brane proteins are particularly difficult to crystallize, simulation work is especiallyimportant in this area to understand their detailed function.

Fibrous, or structural, proteins tend to have repetitive sequences. The triple-stranded collagen helix, for example, is composed of repeating triplets whichinclude glycine as the first residue and often proline as one or both of the otherresidues of the triplet. A model of collagen is shown in Figure 3.9.

Since collagen is needed to rebuild joint cartilage, there are important prac-tical applications to skin and bone ailments. For example, a gelatin-containing(glucosamine and calcium-enriched) powdered drink mix called Knox NutraJointis being touted as a dietary supplement that helps maintain healthy joints andbones. (‘Juice Your Joints’ touts an ad featuring an athletic sexagenarian waterskier). Gelatin is rich in two amino acids, glycine and proline, that make upcollagen.) Even though our bodies make these two amino acids, manufacturersclaim that this gelatin-containing supplement may be helpful in decreasing theprogression of osteoarthritis, a condition caused by cartilage deterioration.

Collagen model (1CLG) (Gly−Pro−Pro) units12

3Silk fibroin model (2SLK) (Gly−Ala) units

Triple−helical collagen−like peptide (1BKV)Ile−Thr−Gly−Ala−Arg−Gly−Leu−Ala−Gly−Pro−Hyp−Gly[capped by (Pro−Hyp−Gly) on both ends]3

Figure 3.9. Models of the fibrous proteins collagen (triple helix) and silk, along with acrystallographically-determined collagen-like peptide (Hyp denotes hydroxyproline).

Another use of collagen is in a skin product used to heal wounds such as fromvenous skin ulcers, burns, and skin surgery. In May 1998 the FDA approvedApilgraf, a product for treating venous skin ulcers made of human skin cellsmixed with collagen cells from cattle.

Page 16: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

92 3. Protein Structure Introduction

Silk is another example of a fibrous protein with a repetitive sequence. Theproduct of many insects and spiders, silk is the polypeptide β-keratin composedlargely of glycine, alanine, and serine residues, with smaller amounts of otheramino acids such as glutamine, tyrosine, leucine, valine, and proline. The softness,flexibility, and high tensile strength of silk stems from its unique arrangement ofloose hydrogen bonding networks in the form of β-sheets connected by β-turns, amixture of both highly-ordered and less densely-packed regions. Figure 3.9 showsa model of the repetitive β-sheet network of silk (without connecting regions).

3.3.3 Emerging Patterns from Genome Databases

As genome sequencing projects are completed, interesting findings on enzymesequences also emerge. For example, the genome of the tuberculosis bacterium(completed in 1998 by the Wellcome Trust Genome Campus of the SangerInstitute in collaboration with the Institut Pasteur in Paris) revealed surprisinglythat, unlike other bacteria, repetitive gene families of glycine-rich proteins existin M. tuberculosis; these approximately 10% of the enzyme-coding sequences areassociated with gene families involved in anaerobic respiratory functions.

3.3.4 Sequence Similarity

Sequence Similarity Generally Implies Structure Similarity

As mentioned above, sequence similarity generally implies structural, func-tional, and evolutionary commonality. Thus, for example, if we were to scanthe Protein Databank (PDB) randomly and find two proteins with low se-quence identity (say less than 20%), we could reasonably propose that they alsohave little structural similarity. Such an example is shown in Figure 3.12 for thecytochrome/barstar pair. Similarly, large sequence similarity generally impliesstructural similarity (see introduction in 2.1.2 of Chapter 2).

In general, small mutations (e.g., single amino acid substitutions) are welltolerated by the native structure, even when they occur at critical regions of sec-ondary structure. The small protein Rop (Repressor of primer), which controlsthe mechanism of plasmid replication, provides an interesting subject to both thissequence-implies-structure paradigm, and to exceptions to this rule (discussedbelow).

Rop is a dimer, with each monomer consisting of two antiparallel α-helicesconnected by a short turn; it dimerizes to form a four-helix bundle as activeform, as shown in Figure 3.10. (Fold details and motifs are discussed in the nextchapter). Recall that Rop was used as the basis for solving Paracelsus challenge(Chapter 2) because the α-helix motif was thought to be quite stable.

The high stability of Rop emerged surprisingly from experiments of Castagnoliet al. [205]. When these researchers deleted just a few residues in a key turn regionthat produces the overall bundle fold in the native Rop structure, they expected onelong contiguous helix to form. Instead, their tinkering produced a small variation

Page 17: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

3.3. Sequence Variations in Proteins 93

N

NC

C

C N

CN

Ala−Asp31−Ala

Ala−Asp31−Ala

C

C

N

NGly1

Gly30

Gly30Gly1

C

C

N

N Pro31

Pro31

Single mutation:entire topology altered

Minor mutation(overall structurepreserved)

Minor insertion

(overall structurepreserved)

ROP wild−type(1RPR)(left−handed anti−parallel bundle)

ROP mutant (1RPO) (Asp31 Ala−Asp31−Ala)

ROP mutant (1GTO) ROP mutant (1B6Q) (Met1 Gly, Asp30 Gly) (Ala31 Pro)

(right−handed mixed bundle)

a b

c d

Figure 3.10. The protein Repressor of Primer (63 residues per monomer) provides in-teresting examples of the paradigm of structure inference by sequence similarity: thefour-helix bundle motif of the wildtype (a) can be both structurally stable, i.e., resistantto mutations — as shown by the two variants in (b) and (c) — or structurally fragile andhighly sensitive to mutations, caused by proline substitution at the turn region — as shownin (d), a mutant with an entirely different topology [462]. In (b), two Ala residues wereinserted at both sides of the amino acid Asp in the loop region. In (c), the Asp residue con-necting the two α-helices of each monomer was mutated to Gly, and Met1 was changed toGly [205].

Page 18: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

94 3. Protein Structure Introduction

of the original bundle motif. Apparently, the four-helix bundle motif is so stablethat a new turn was formed from residues that used to be part of the α-helix back-bone! Thus, the original bundle motif, though slightly smaller, was maintainedin the mutants. This is seen in Figure 3.10, which displays the wildtype enzymestructure (a) and those of two mutants in the above cited study (b and c).

Though this experiment supports the general notion that protein structuresare remarkably stable to tinkering (mutations), we emphasize that functionalproperties of proteins are fragile and quite sensitive to sequence changes.

Exceptions Exist

There are many exceptions, however, to this simple sequence/structure/functionrelationship.

Namely, examples exist where despite large sequence similarity there is smallstructural and functional similarity. A classic example of this relationship is thedisease sickle-cell anemia, where a minute substitution in sequence leads to al-tered function with devastating consequences. This abnormality results from thereplacement of the highly-polar glutamate residue in hemoglobin by the nonpo-lar amino acid valine. This key substitution at the surface of the protein leads toan entirely different quaternary structure for this multidomain red-blood pigmentprotein. This is because the markedly altered structure affects the solubility ofoxygenated hemoglobin and leads to a clumping of the deoxygenated form of themolecule (HbS instead of HbA).

Conversely, examples exist where despite small sequence similarity there islarge structural, and even functional and evolutionary similarity. A classic ex-ample for this relationship is the myoglobin/hemoglobin pair of proteins (seeFigure 3.12). These proteins only share 20% of the sequence. However, asoxygen-carrying molecules, they share structural, functional, and evolutionarysimilarity. Proteins in the calmodulin family are also known to display a greatdeal of structural variability for similar sequences [664] (see Figure 3.11).

More generally, changes in 3D architecture (despite a nontrivial degree ofsequence similarity) can result from a variety of factors, as follows.

• Mutations in critical regions of the proteins, such as active sites and ligandbinding sites, can change 3D structures dramatically. Such an example isshown for the pair of immunoglobulins in Figure 3.12.

• Mutations in regions that connect two secondary-structural elements canalso be responsible for structural divergence, as in the helix-loop-helixmotif of the EF-hand family, and the connecting loops in helix bundles.

Figure 3.11 illustrates this principle for the two EF-hand calcium-bindingproteins calmodulin and sarcoplasmic calcium-binding protein: one isoverall extended in shape while the other is more compact [979].

Helix bundles are sensitive to mutations in loop or turn regions that connectdifferent helices, to the extent that a single amino acid substitution (alanine

Page 19: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

3.3. Sequence Variations in Proteins 95

to proline) can change the topology of a homodimeric 4–helical bundle pro-tein from the canonical left-handed all-antiparallel form to a right-handedmixed parallel and antiparallel bundle [462]. Figure 3.10(d) shows this dif-ferent resulting topology of the Rop four-helix bundle subject to the singlemutation Ala31→Pro at the turn region.

• Structural variations can be observed in the same system determined atdifferent environmental conditions such as solvent or crystal packing. Thesame T4-lysozyme mutants in Figure 3.12 (100% sequence similarity) dis-play intriguing mobility, adopting 5 different crystal conformations [374]due to a hinge bending motion.

• Multidomain proteins can adapt quaternary structures that depend sensi-tively on the number of subunits and/or on the sequence.

D

B

C

A

EF hands

EF hands

A B

C D

111 118 145 148

1 6 38 45

77

82

C

D

BA

Calmodulin (3cln),elongated form

Sarcoplasmic calcium-bindingprotein (2scp)

EF hands

EF hands

1 3 38 43 82

89 122 130 159 174 A B

C D

Figure 3.11. Structural variability despite large similarity in the protein secondary-struc-tural elements is illustrated for two calcium-binding proteins — calmodulin (148 residues)and sarcoplasmic calcium binding protein (174 residues) [979] — due to different over-all 3D arrangement of the shared motifs. Though sharing only 30% of the sequence, bothproteins are made of two repeating units, each consisting of two EF hand motifs. Eachhand motif contains two helical regions surrounding a calcium binding loop (crystal-boundcalcium atoms are rendered as large spheres; only three are bound to 2scp).

Page 20: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

96 3. Protein Structure Introduction

Leu 79Leu 13

A’

A

B and B’

Immunoglobulins8FAB:A, strands in red1DCL:B, strands in blueAll β sandwich, Greek key

Myogloblin (5MBA, yellow)Hemoglobin (1ASH)All−α, globin−like, folded leaf

Cytochrome C6 (1CTJ) all−α, folded leafBarstar (1BTA) α/β, 3 parallel β strands

Mutant T4 lysozymes (150L) Wild−type likeHinge angle 32�, yellowMainly α, lysozyme−like

Seq sim 100%, RMS 2 A Seq sim 75%, RMS 9 Aoo

o Seq sim 3%, RMS 7 A Seq sim 12%, RMS 1.8 A

oa b

c d

Figure 3.12. Various examples of sequence/structure relationships in proteins: (a) Lowsequence similarity (3% for alignment of 72% of the residues) generally implies low struc-ture similarity (cytochrome C6 versus barstar). Still, exceptions are found. For example,in (b), despite low (12%) sequence similarity, there is large structure and function similar-ity (hemoglobin and myoglobin); conversely, despite high sequence similarity, there canbe structural diversity, due to (c) hinge bending in two lysozyme mutants (Met → Ile inresidue 6) or (d) different orientation of one of the two subunits in two immunoglobulins.The lysozyme mutant displays 5 different crystal conformations, one similar to the wild-type (shown in blue) and others overall very similar except for a different hinge-bendingangle (see defining arrows); the form with largest bend (32o) is shown in yellow. The twoimmunoglobulins differ markedly in tertiary organization due mainly to differences in thelinker domain between the A and B subunits of each protein.

Page 21: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

3.4. Protein Conformation Framework 97

gauche− gauche+

τ = 0o

60

120−120

−60

180 gauche− gauche+trans

For n−butane (CH3 − CH2 − CH2 − CH3)

CH3

CH3

CH3

CH3

CH3

CH3

CH3

CH3

trans

τ

Figure 3.13. Gauche (g) and trans (t) dihedral-angle orientations for n-butane: (left) clas-sification wheel; (middle) simple Newman projections that illustrate the three favoredorientations of the two end methyl groups about the central C–C bond (perpendicular to theplane of the paper); and (right) the trans conformation, which has the least steric clashes.

3.4 Protein Conformation Framework

3.4.1 The Flexible φ and ψ and Rigid ω Dihedral Angles

Polypeptides can have a wide variety of conformations, i.e., 3D structures differ-ing only in rotational orientations about covalent bonds.2 This type of rotationalflexibility is characterized by a dihedral angle, which measures the relative ori-entation of four linked atoms in a molecule, i − j − k − l. A dihedral angle fora 4-atom sequence that is not necessarily covalently bonded can also be used forspecial terms in the potential energy function; see Chapter 9 for examples. SeeBox 3.4, Figure 3.14, and the equations in Appendix D, under the Addendum toAssignment 8 for a definition of a dihedral angle.

Box 3.4: Dihedral Angle

The dihedral angle τijkl defined for a sequence of linked atoms i–j–k–l (Figure 3.14) isthe angle between the normal to the plane of atoms i–j–k and the normal to the plane ofatoms j–k–l. The sign of τijkl is determined by the triple product (a × b) · c, where a, band c are the interatomic distance vectors for atoms i→ j, j → k and k → l, respectively.

Strictly defined, the related torsion angle, τ is the angle between the two planes definedby i–j–k and j–k–l. Thus, τ + τ = π (180o). However, the terms torsion and dihedralangle are often used interchangeably. We will often use dihedral angle to refer to thenumerical value of the angle, and torsion angle or torsional potential when we discussgeneral properties of these rotations.

2see Chapter 8, Subsection 8.4.1, for the related definition of configuration.

Page 22: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

98 3. Protein Structure Introduction

When the dihedral angle is 0o, the four atoms i–j–k–l are coplanar and atoms i and l coin-cide in their projections onto the plane normal to the j–k bond; this orientation is definedas cis or syn. When the dihedral angle is 180o, the atoms are coplanar but atoms i and llie opposite one another in the projection onto the plane normal to the j–k bond; such anorientation is defined as trans or anti. More generally, angular regions convenient to de-scribe protein and nucleic acid conformations are the following: cis (≈0o), trans (≈180o)and ± gauche (≈±60o). Another common terminology is: syn (≈0o), anti (≈180o), ±synclinal (≈±60o), and ± anticlinal (≈±120o). See Figure 3.13 for a simple illustrationfor n-butane.

While the peptide group (Figure 3.3) is relatively rigid — it has 40% double-bondcharacter — there is a great deal of flexibility about each of the single bondsalong the backbone, {N–Cα} and {Cα–C}=O. The two dihedral angles φ and ψare used to define rotations about the bond between the nitrogen and Cα of themainchain and between Cα and the carbonyl carbon, respectively (Figure 3.15).

The dihedral angle ω defines the rotation about the peptide bond, namely forthe atomic sequence Cα

1 –{C–N}–Cα2 , where C1 and C2 are the α-carbons of two

adjacent amino acids. Because of the partial double-bond character of the peptidebond and the steric interactions between adjacent sidechains, ω is typically in thetrans configuration: ω = 180o.3 In this orientation, all four atoms lie in the sameplane, with the distance between Cα

1 and Cα2 as large as possible (see Figure 3.13

for a definition of various dihedral angle orientations).

Figure 3.14. Definition of a dihedral angle τijkl = cos−1(nab ·nbc), the angle between thetwo normals spanned by atoms i, j, k and j, k, l.

3Non-trivial deviations from planar peptide bonds can be shown by theory and experiment (e.g., asreviewed in [348]). A statistical survey of peptide and protein databases verified that the distributionof rotation angles (or energies associated with peptide bond rotations) follows Boltzmann statistics[799].

Page 23: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

3.4. Protein Conformation Framework 99

O

C

NH

C

O

HN

Ci−1α

Ciα

Ri−1 H

H R i

H Ri+1

ψi−1 ω φi ψi ω φi−1

i−1 Residue i i+1

Ci+1α

Figure 3.15. Rotational flexibility in polypeptides: definition of the φ, ψ and ω dihedralangles.

3.4.2 Rotameric Structures

Besides the {φ , ψ} flexibility associated with the two backbone bonds involvingCα, multiple conformations are possible for 18 of the 20 amino acids when thesidechain geometries differ (excluded are glycine and alanine). Rotameric struc-tures of amino acids (and hence proteins) are those that have the same {φ , ψ}angles but differ in the sidechain conformations. The dihedral angles used to de-fine sidechain rotations are denoted by χ, with subscripts used as needed (seeFigure 3.17).

C1

C2

C3

C4Cα

χ1

χ2

χ3

χ4 N

HH

H HH H

HH

HH

H

H

Figure 3.16. Lysine’s four rotamers defined by torsional variables χ1 through χ4.

For example, in lysine, whose sidechain has four carbons (see Figure 3.16),dihedral angles χ1 through χ4 denote the rotations about bonds Cα–C1, C1–C2,C2–C3, and C3–C4, respectively (see also Figure 3.17 for other amino acids).Rotameric structures for polypeptides and proteins depend on the environment ofthe polymer and on the secondary and tertiary structures.

3.4.3 Ramachandran Plots

The feasible combinations of the φ and ψ angles are limited due to steric hin-drance. That is, only certain combinations are typically observed, with somedependence on residue size and shape. Glycine is unique in its flexibility — it istherefore a good agent for turns in polypeptides and proteins — but other residuesexhibit a highly limited range of sterically-permissible φ and ψ combinations. Infact, only roughly one tenth the area of the {φ , ψ} space is generally observedfor polypeptides and proteins. Among the first to note this limitation were

Page 24: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

100 3. Protein Structure Introduction

ALIPHATIC SIDE CHAINS

ALIPHATIC HYDROXYL SIDE CHAINS SECONDARY AMINO GROUP

ACIDIC SIDE CHAINS AND THEIR AMIDE DERIVITIVES

SULFUR-CONTAINING SIDE CHAINS

BASIC SIDE CHAINS

AROMATIC SIDE CHAINS

Val (V)

H2C

+H2N C Hα

CH2

H2C

Cα CH2 CH2 C

Cα CH2 CH2 C NH2

O

CH2 C

CH2 C NH2

O

CH2 OHCα

CH

Cα Cα

CH2

CH3

CH3

CH3

CH3

CH3

CH2 CH3

Cα C

CH3

OH

H

C HCH

Cα CH2 Cα CH2

Cα CH2

Cα CH2

Cα CH2 Cα CH2

SHCH2 S CH3

CH2 CH2 CH2 NH3+

CH2 CH2HN C NH2

NH2+

C

+HNCH

NH

CHCH2Cα

C

HCNH

OH

CH2Cα

χ1χ2χ3

χ4

O

O

O

O

_ _

Leu (L)

Ser (S) Thr (T) Pro (P)

Asp (D)

Asn (N)

Glu (E)

Gln (Q)

Met (M) Cys (C)

Lys (K) Arg (R)

His (H)

Phe (F) )W(prT)Y(ryT

Ile (I)

Figure 3.17. Rotameric notation used for 18 of the 20 amino acids is illustrated usingdifferent colors for χ1–χ4, as shown in the top right key.

Page 25: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

3.4. Protein Conformation Framework 101

−90 0 90 180

−90

0

90

180

−180

φ (degrees)

Ψ (

de

gre

es)

Figure 3.18. Ramachandran plots, obtained from a subset of the PDB40 dataset [959],corresponding to X-ray protein structures with resolution of 2.5 A or better (470 proteins,95778 total residues plotted, with proline and glycine excluded).

G.N. Ramachandran4 and coworkers in 1963, after which Ramachandran plotsare called. Around the same time, John Schellman and coworkers were workingindependently along the same lines of mapping the energetically favorable andexcluded regions for protein conformations [1098].

These diagrams in the {φ , ψ} space, as shown in Figure 3.18, are usedto describe this {φ , ψ} flexibility (actually inflexibility) in polypeptides andproteins. See also Figure 3.19 for a comparative view of Ramachandran plots de-rived from the moderate-resolution X-ray structures shown in Figure 3.18 versushigh-resolution X-ray as well as NMR-derived structures.

Often, Ramachandran diagrams are presented by plotting the backbone dihedralangles of all nonterminal residues in a protein for a large group of known protein

4For lovers of scientific history, the 1998 biography of this renowned Indian molecular biophysi-cist (1922–2001) is recommended [1091]. His peppered poetry is an added bonus. (My favorite poemis number 9, on Superhelical Twisting and Replication of D N A [1091, p. 159].)

Page 26: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

102 3. Protein Structure Introduction

180

0

−180

180

0

−180

180

0

−1801800 0 1000500

100

50

0

50

25

0

100

50

0

Ψ (d

egre

es)

Cou

nts

Length (residues)φ (degrees)

X−ray, mod. res.

X−ray, higher res.

NMR

Figure 3.19. Three sets of Ramachandran plots based on the PDB40 dataset [959], corre-sponding to: (top) X-ray protein structures with resolution of 2.5 A or better (470 proteins,95,778 total residues plotted, with proline and glycine excluded); (middle) X-ray proteinstructures with resolution of 1.8 A or better (183 proteins, 29,758 total residues plotted,proline and glycine excluded); and (bottom) NMR-derived structures (113 proteins, 84,719total residues plotted, proline and glycine excluded). For each subset of structures, thelength distribution is also shown.

structures. This superimposed view, averaged over many residues, approximatesprotein conformational tendencies. The favorable regions correspond to commonsecondary-structure elements, such as helices and sheets, with finer motifs alsonoted.

In addition to favorable combinations of φ and ψ in polypeptides, the side-chain dihedral angle χ1 has been found to cluster around one of three conformers

Page 27: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

3.4. Protein Conformation Framework 103

known as gauche+ (or g+, χ1 = +60o), gauche− (or g−, χ1 = −60o), andtrans (or t, χ1 = 180o). These are the favored orientations about tetrahedral atoms(Figure 3.13). Some dependence of χ1 on the residue’s φ and ψ values has alsobeen noted.

3.4.4 Conformational Hierarchy

Most natural proteins adopt specific 3D structures that are associated with their bi-ological activity. Of course, proteins are dynamic, but typical thermal fluctuationsand local configurational arrangements revolve around a specific globally-foldedstructure. The majority of proteins is believed to be unknotted in a topolog-ical sense, though the polypeptide chain is frequently covalently bonded viadisulphide links and noncovalently held together by hydrogen bonds [1254].

One of the hallmarks of biomolecular structure is that the amino acid sequencedetermines the 3D structure of a protein. This was first shown by Christian B.Anfinsen and his colleagues in the early 1960s [51]. Anfinsen shared the NobelPrize in Chemistry in 1972 for his work on ribonuclease — connecting the aminoacid sequence to the biologically active conformation — with Stanford Mooreand William H. Stein — who connected ribonuclease’s chemical structure to itscatalytic activity.5 In Anfinsen’s work, the protein ribonuclease was denatured bydestroying its hydrogen bonding network as well as intrinsic disulfide bonds. Theresearchers observed that the protein spontaneously refolded into its native statein a short time, regaining all its enzymatic activity. Of course, we recognize nowthat accessory chaperone molecules may be necessary to assist in the folding ofmany large proteins in vivo, as discussed in Chapter 2.

Four basic levels are used to describe protein structure:

• primary structure — the sequence of amino acids;

• secondary structure — regular local structural patterns such as α-helicesand β-sheets, or combination motifs thereof (supersecondary structure);

• tertiary structure — the 3D arrangement of all atoms in the polypeptidechain in space; and

• quaternary structure (used for large proteins with independent subunits) —the complete 3D interaction network among the different subunits.

The next chapter describes in turn the secondary and supersecondary, tertiary,and quaternary structure of proteins.

5Readers are invited to browse the electronic museum of Profiles in Science atwww.profiles.nlm.nih.gov/ for a glimpse not only of Anfinsen’s scientific activities but also of hisother hobbies and interests.

Page 28: [Interdisciplinary Applied Mathematics] Molecular Modeling and Simulation: An Interdisciplinary Guide Volume 21 || Protein Structure Introduction

104 3. Protein Structure Introduction