Chap. 11 Protein Structures. Amino Acid R: large white and gray C: black Nitrogen: blue Oxygen: red...

of 65/65
Chap. 11 Protein Structures
  • date post

    03-Jan-2016
  • Category

    Documents

  • view

    214
  • download

    0

Embed Size (px)

Transcript of Chap. 11 Protein Structures. Amino Acid R: large white and gray C: black Nitrogen: blue Oxygen: red...

  • Chap. 11 Protein Structures

  • Amino AcidR: large white and grayC: blackNitrogen: blueOxygen: redHydrogen: white

    General structure of amino acidsan amino groupa carboxyl group-carbon bonded to a hydrogen and a side-chain group, RSide chain R determines the identity of particular amino acid

  • ProteinProtein: polymer consisting of AAs linked by peptide bondsAA in a polymer is called a residueFolded into 3D structuresStructure of protein determines its functionPrimary structure: linear arrangement of AAsAA sequence (primary structure) determines 3D structure of a protein, which in turn determines its propertiesN- and C-terminalSecondary structure: short stretches of AAsTertiary structure: overall 3D structure

  • Protein Structures

  • Secondary structureSecondary structures have repetitive interactions resulting from hydrogen bonding between N-H and carboxyl groups of peptide backboneConformations of side chains of AA are not part of the secondary structure-helix

  • Secondary structure-pleated sheetParallel/antiparallel

    3D form of antiparallel

  • Secondary structure: domain unit unit (helix-turn-helix) meanderGreek key Part of chain folds independently of foldings of other parts Such independent folded portion of protein is called domain (super-secondary structure)

  • DomainLarger proteins are modularTheir structural units, domains or folds, can be covalently linked to generate multi-domain proteinsDomains are not only structurally, but also functionally, discrete units domain family members are structurally and functionally conserved and recombined in complex ways during evolutionDomains can be seen as the units of evolutionNovelty in protein function often arises as a result of gain or loss of domains, or by re-shuffling existing domains along sequencePairs of protein domains with the same 3D fold, precise function is conserved to ~40% sequence identity (broad functional class is conserved ~20%)DNA binding domainshttp://en.wikipedia.org/wiki/DNA-binding_domain

  • MotifA short, conserved regions (frequently the most conserved regions of a domain)Critical for the domain to functionDomain vs. MotifMotif are structural characteristicsDomains are functional regions, usually consisting of a few motifs

  • Motif RepresentationMotifIn multiple alignments of distinctly related sequences, highly conserved regions are called motifs, features, signatures or blocksTends to correspond to core structural and functional elements of the proteins

  • Motifcomplement control protein moduleImmunoglobulin moduleFibronectin type I moduleGrowth factor moduleKringle moduleGreek key motif is often found in barrel tertiary structure

  • Linked series of -meandersGreek key patternAlternative untisTop and side views (-helical section is outside)

  • Secondary structure: conformationSchematic diagrams of fibrous and globular proteinsComputer-generated model of globular proteinTwo types of Protein ConformationsFibrous Globular folds back onto itself to create a spherical shape

  • Secondary Structure PredictionAb initio prediction (from AA sequence)Still an open problem1974 Peter Chou and Gerald FasmanUse known structures to determine which AA contributes to each secondary structurePropensity values : likelihood that an AA appears in a particular structureP(a), P(b) and P(turn)>1 indicates a greater than average chance (log-odd ratios)Frequency values: frequency of an AA being found in a hairpinFour positions in a hairpin beta-turnAccuracy is around 50-60%, but popular due to its foundation for later prediction programs

  • AAP(a)P(b)P(turn) f(i) f(i+1) f(i+2) f(i+3)Alanine14283660.0600.076 0.0350.058Arginine9893950.0700.1060.0990.085Asparagine6789950.1610.0830.1910.091Aspartic acid101541460.1470.1100.1790.081Cysteine701191190.1490.0500.1170.128Glutamic acid15137740.0560.0600.0770.064Glutamine111110980.0740.0980.0370.098Glycine57751560.1020.0850.1900.152Histidine10087950.1400.0470.0930.054Isoleucine108160470.0430.0340.0130.054Leucine121130590.0610.0250.0360.070Lysine114741010.0550.1150.0720.095Methionine145105600.0680.0820.0140.055Pheylalanine113138600.0590.0410.0650.065Proline57551520.1020.3010.0340.068Serine77751430.1200.1390.1250.106Threonine83119960.0860.1080.0650.079Tryptophan108137960.0770.0130.0640.167Tyrosine691471140.0820.0650.1140.125Valine104170500.0620.0480.0280.053

  • Chou-Fasman AlgorithmStep 1: identify alpha-helicesFind a region of six contiguous residues where at least four have P(a)>103Extend the region until a set of four contiguous residues with P(a)103, length is >5, and P(a)> P(b), alphaStep 2: beta strandsFind a region of five contiguous residues with at least three with P(b)>105Extend the region until a set of four contiguous residues with P(b)105, and P(b)> P(a), beta

  • Chou-Fasman AlgorithmStep 3: beta turnsFor each residue f, determine the turn propensity (P(t)) for j, asP(t) j = f(i) j *f(i+1) j+1 *f(i+2) j+2 *f(i+3) j+3A turn at postion if P(t) >0.000075, average P(turn) from j to j+3 > 100, and P(a)< P(turn) > P(b)Step 4: overlapsIf alpha region overlaps with beta, the regions P(a) and P(b) determine the most likely structure in the overlapped regionIf P(a) > P(b) for the overlapping region, alphaIf P(a) < P(b) for the overlapping region, betaIf P(a) = P(b), no valid call

  • Secondary structure predictionPage 427Chou and Fasman (1974) based on the frequencies of amino acids found in a helices, b-sheets, and turns.Proline: occurs at turns, but not in a helices.GOR (Garnier, Osguthorpe, Robson): related algorithmModern algorithms: use multiple sequence alignments and achieve higher success rate (about 70-75%)

  • Secondary structure predictionWeb servers:

    GOR4JpredNNPREDICTPHDPredatorPredictProteinPSIPREDSAM-T99secTable 11-3Page 429

  • Secondary Structure Prediction by PSIREDPrediction of regions of the protein that form alpha-helix, beta-sheet, or random coilhttp://bioinf.cs.ucl.ac.uk/psipred/Based on neural networksUses Chou-Fasman-like algorithm but first does PSI-BLAST search to get a collection of sequences related to the input (searching for orthologous sequences)Univ. College London, 1999

  • PSI-BLAST is performed in five stepsSelect a query and search it against a protein database

    PSI-BLAST constructs a multiple sequence alignment then creates a profile or specialized position-specificscoring matrix (PSSM)Page 146

  • R,I,KCD,E,TK,R,TN,L,Y,GInspect the blastp output to identify empirical rules regarding amino acids tolerated at each position

  • A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3 -2 -1 -2 -1 1 2 K -1 1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1 0 -1 -3 -2 -3 3 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 4 V 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 4 5 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 6 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 7 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 8 L -1 -3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3 -2 -1 -2 0 3 9 L -1 -3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 2 10 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 11 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 12 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 13 W -2 -3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3 -3 -2 7 0 0 14 A 3 -2 -1 -2 -1 -1 -2 4 -2 -2 -2 -1 -2 -3 -1 1 -1 -3 -3 -1 15 A 2 -1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1 3 0 -3 -2 -2 16 A 4 -2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 -1 ... 37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0

    20 amino acidsall the amino acids from position 1 to the end of your PSI-BLAST query protein

  • A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3 -2 -1 -2 -1 1 2 K -1 1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1 0 -1 -3 -2 -3 3 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 4 V 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 4 5 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 6 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 7 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 8 L -1 -3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3 -2 -1 -2 0 3 9 L -1 -3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 2 10 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 11 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 12 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 13 W -2 -3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3 -3 -2 7 0 0 14 A 3 -2 -1 -2 -1 -1 -2 4 -2 -2 -2 -1 -2 -3 -1 1 -1 -3 -3 -1 15 A 2 -1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1 3 0 -3 -2 -2 16 A 4 -2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 -1 ... 37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0

    note that a given amino acid (such as alanine) in your query protein can receive different scores for matching alaninedepending on the position in the protein

  • A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3 -2 -1 -2 -1 1 2 K -1 1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1 0 -1 -3 -2 -3 3 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 4 V 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 4 5 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 6 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 7 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 8 L -1 -3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3 -2 -1 -2 0 3 9 L -1 -3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 2 10 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 11 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 12 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 13 W -2 -3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3 -3 -2 7 0 0 14 A 3 -2 -1 -2 -1 -1 -2 4 -2 -2 -2 -1 -2 -3 -1 1 -1 -3 -3 -1 15 A 2 -1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1 3 0 -3 -2 -2 16 A 4 -2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 -1 ... 37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0

    note that a given amino acid (such as tryptophan) in your query protein can receive different scores for matching tryptophandepending on the position in the protein

  • PSI-BLAST is performed in five stepsSelect a query and search it against a protein database

    PSI-BLAST constructs a multiple sequence alignment then creates a profile or specialized position-specific scoring matrix (PSSM)

    The PSSM is used as a query against the database

    PSI-BLAST estimates statistical significance (E values)

    Repeat steps [3] and [4] iteratively, typically 5 times.At each new search, a new profile is used as the queryPage 146

  • SRC proteinTyrosine kinaseEnzyme putting a phophate group on tyrosine AA (phosphorylation)Activates an inactive protein, eventually activates cell-division proteinsNP_005408

    >gi|4885609|ref|NP_005408.1| proto-oncogene tyrosine-protein kinase Src [Homo sapiens]MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAEPKLFGGFNSSDTVTSPQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYVAPSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLLDFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPECPESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL

  • Examining Crystal StructureCn3D: NCBI structure viewer and modeling toolDeppView: SWISSPROTJMOL

    NCBI Structure databaseLinks to NCBI MMDB (Molecular Modeling Database)MMDB contains experimentally verified protein structures

    SRC MMDB ID 56157, PDB ID 1FMK

    View Structure from NCBI Structure databaseOpens up Cn3D windowClick to rotate; Ctrl_click to zoom; Shift_clcik to moveRendering and coloring menus

  • Tertiary structure3D arrangment of all atoms in the moduleConsiders arrangement of helical and sheet sections, conformations of side chains, arrangement of atoms of side chains, etc.

    Experimentally determined byX-ray crystallography measure diffraction patterns of atoms

    NMR (Nuclear Magnetic Resonance) spectroscopy use protein samples in aqueous solution

  • Tertiary structure of -lactalbuminmyoglobin

  • Protein familiesGroups of genes of identical or similar sequence are commonSometimes, repetition of identical sequences is correlated with the synthesis of increased quantities of a gene producte.g., a genome contains multiple copies of ribosomal RNAsHuman chromosome 1 has 2000 genes for 5S rRNA (sedimentation coefficient), and chr 13, 14, 15, 21 and 22 have 280 copies of a repeat unit made up of 28S, 5.8S and 18SAmplication of rRNA genes evolved because of heavy demand for rRNA synthesis during cell divisionThese rRNA genes are examples of protein families having identical or near identical sequencesSequence similarities indicate a common evolutionary origin- and -globin families have distinct sequence similarities evolved from a single ancestral globin gene

  • Protein families and superfamiliesDayhoff classification, 1978Protein families at least 50 % AA sequence similar (based on physico-chemical AA features)Related proteins with less similarity (35%) belong to a superfamily, may have quite diverse functions- and -globins are classified as two separate families, and together with myoglobins form the globin superfamily families have distinct sequence similarities evolved from a single ancestral globin gene

  • Protein family databasePattern or secondary database derived from sequencesa pattern may be the most conserved aspects of sequence familiesThe most conserved part may vary between speciesUse scoring system to account for some variabilityPosition-specific scoring matrix (PSSM) or ProfileContrast to a pairwise alignment, having the same weight regardless of positionsProtein family databases are derived by different analytical techniquesBut, trying to find motifs, conserved regions, considered to reflect shared structural or functional characteristicsThree groups: single motifs, multiple motifs, or full domain alignments

  • Protein family databasesPattern or secondary database derived from sequences

    Data sourceStored infoPROSITESwiss-ProtRegular expressions (patterns) of single most conserved motifProfilesSwiss-ProtWeighted matrices (profiles) of position-sensitive weightsPRINTSSwiss-Prot and TrEMBLAligned motifs (fingerprints)PfamSwiss-Prot and TrEMBlmultiple sequence alignment of a protein domain or conserved regionBlocksinterPro/PRINTSAligned motifs (blocks)eMOTIFBlocks/PRINTSPermissive regular expressions

  • Single Motif MethodRegular expressionPROSITEPDB 1ivyCarboxypet_Ser_His (PS00560)[LIVF]-x2-[[LIVSTA]-x[IVPST]-[GSDNQL]-[SAGV]-[SG]-H-x-[IVAQ]-P-x(3)-[PSA][] any of the enclosed symbolsX- any residue(3) number of repeatsFuzzy regular expressionBuild regular expressions with info on shared biochemical properties of AAProvide flexibility according to AA group clustering

  • Multiple motif methodsPRINTSEncode multiple motifs (called fingerprints) in ungapped, unweighted local alignmentsBLOCKSDerived from PROSITE and PRINTSUse the most highly conserved regions in protein families in PROSITEUse motif-finding algorithm to generate a large number of candidate blocksInitially, three conserved AA positions anywhere in the alignment are identified and used as anchorsBlocks are iteratively extended and ultimately encoded as ungapped local alignmentsGraph theory is used to assemble a best set of blocks for a given familyUse position specific scoring matrix (PSSM), similar to a profile

  • Full domain alignmentProfilesUse family-based scoring matrix via dynamic programmingHas position-specific info on insertions and deletions in the sequence familyHidden Markov Model (HMM)PFAM, SMART, TIGRFAM represent full domain alignments as HMMsPFAMRepresents each family as seed alignment, full alignment, and an HMMSeed contains representative members of the familyFull alignment contains all members of the family as detected with HMM constructed from seen alignment

  • Structure-based Sequence AlignmentWell-known that sequence alignment is not correct by sequence similarity alone and that similar structure but no sequence similaritySequence alignment is augmented by structural alignmentsCOMPASS< HOMSTRAD< PALI, ..

  • Protein Structure Comparison/Classification

  • Protein structuresDomainPolypeptide chain in a protein folds into a tertiary structure One or more compact globular regions called domainsThe tertiary structure associated with a domain region is also described as a protein foldMulti-domainProteins with polypeptide chains fold into several domainsNearly half the known globular structures are multidomain, more than half in two domainsAutomatic structure comparison methods are introduced in 1970s shortly after the first crystal structures are stored in PDB

  • Structure comparison algorithmsTwo main components in structure comparison algorithmsScoring similarities in structural featuresOptimization strategy maximizing similarities measuredMost are based on geometric properties from 3D coordinatesIntermolecular methodSuperpose structures by minimizing distance between superposed positionIntraCompare sets of internal distances between positions to identify an alignment maximizing the number of equivalent positionsDistance is described by RMSD (Root Mean Square Deviation), squared root of the average squared distance between equivalent atoms

  • Inter vs. Intra

  • RMSD

  • Distant homologStructure is more conserved than sequences during evolutionStructural similarity between distant homologs can be foundPairwise sequence similaritySSAP structural similarity score in parenthesis (0 100)

  • Distant homolog

  • Structural variations in protein families

  • Structure comparison algorithmsSSAP, 1989Residue level, Intra, Dynamic programmingDALI, 1993Residue fragment level, intra, Monte Carlo optimizationCOMPARER, 1990Multiple element level, both, Dynamic programming

  • Structure classification hierarchyClass level -- proteins are grouped according to their structural class (composition of residues in a -helical and -strand conformations)Mainly- , mainly- , alternating - , plus (mainly- and are segregated)Architecture the manner by which secondary structure elements are packed together (arrangement of sec. structures in 3D space)Fold group (topology)Orientation of sec. structures and the connectivity between themSuperfamilyFamily

  • Hierarchy example

  • Protein Structure databasesPDBOver 20,000 entries deduced from X-ray diffraction, NMR or modelingMassively redundant1FMK, 1BK5, 2F9C, ..

  • Protein Structure databasesSCOP (Structural Classification of Proteins)Multi-domain protein is split into its constituent domainsKnown structures are classified according to evolutionary and structural relationshipDomains in SCOP are grouped by species and hierarchically classified into families, superfamilies, folds and classesFamily level group together domains with celar sequence similaritiesSuperfamily group of domains with structural and functional evidence for their descent from a common evolutionary ancestorGold group of domains with the same major secondary structure with the same chain topologyDomains identified manually by visually inspecting structuresProteins in the same superfamily often have the same function

  • Protein Structure databasesCATH (Class, Architecture, Topology, Homology)Homology clustered domains with 35% sequence identity and shared common ancestry800 fold families, 10 of which are super-folds2009 www.cs.uml.edu/~kim/580/08_cath.pdf

  • Structure classificationMost structure classifications are established at the domain levelThought to be an important evolutionary unit and easier to determine domain boundaries from structural data than from sequence dataCriteria for assessing domain regions within a structureThe domain possesses a compact globular structureResidues within a domain make more internal contacts than to residues in the rest of polypeptideSecondary structure elements are usually not shared with other regions of the polypeptideThere is evidence for existence of this region as an evolutionary unit

  • CATH classifications

  • Multi-domain structures

  • Protein Function/Structure Prediction

  • Protein Function PredictionIn the absense of experimental data, function of a protein is usually inferred from its sequence similarity to a protein of known functionThe more similar the sequence, the more similar the function is likely to beNot always trueCan clues to function be derived directly from 3D structure

    Definition of functionFunction can be described at many levels: biochemical, biological processes, pathways, organ levelProteins are annotated at different degrees of functional specificity: ubiquitin-like dome, signaling protein, ..GO (Gene Ontology) scheme

  • Protein Function PredictionSequence-based largely unreliableProfile-basedProfiles are constructed from sequences of whole protein families with families are grouped by 3D structure or function (as in Pfam)Start with sequences matched by an initial search, iteratively pull in more remote homologuesMore sensitivity than simple sequence comparison because profiles implicitly contain information on which residues within the family are well conserved and which sites are more variableStructure-basedFold-basedProteins sharing simlar functions often shave similar folds, resulting from descent from a common ancestral proteinSometimes, function of proteins alter during evolution with the folds unchangedThus, fold match is not always reliableSurface clefts and binding pockets

  • Chap. 12 RNA Structures

  • Stem-loop structureRNA structure

  • A loop structureA loop between i and j when base at i pairs with base at jBase at i+1 pairs with at base jOr base at i pairs with base at j-1Or a multiple loopRNA structure

  • Search for minimum free energy Gibbs free energy at 37 degrees (C)Free energy increments of base pairs are counted as stacks of adjacent pairsSuccessive CGs: -3.3 kcal/molUnfavorable loop initiation energy to constrain bases in a loopRNA secondary structure

  • Ad-hoc approachSimply look at a strand and find areas where base pairing can occurPossible to find many locations where folds can occurPrediction should be able to determine the most likely oneWhat should be the criteria ?1980, Nussinov-Jacobson AlgorithmMore stable one is the most likely structureFind the fold that forms the greatest number of base pairs (base-pairing lowers the overall energy of the strand, more stable)Checking for all possible folds is impossible -> dynamic programmingRNA structure prediction

  • Create an nxn matrix for a sequence with n basesInitialize the diagonal to 0Fill the matrix with the largest number of base pairs (S)

    w(I,j) = 1 if base I can be paired with base j

    Nussinov-Jacobson Algorithm S(i+1, j-1) + w(i,j)S(i,j) = max [ S(i+1, j) ] S(i, j-1) max[S(I,k) + S(k+1,j)}

    *