Скачать 75.22 Kb.
|PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information|
Jiang Qian, Brad Stenger, Cyrus A. Wilson, Jimmy Lin, Ronald Jansen, Sarah A. Teichmann1, Jong Park2, Werner Krebs, Haiyuan Yu,
Vadim Alexandrov, Nathaniel Echols, Mark Gerstein*
Department of Molecular Biophysics and Biochemistry
PO Box 208114, New Haven, CT 06520, USA
1Department Biochemistry & Molecular Biology, University College London, Darwin Bldg, Gower St, London WC1E 6BT, UK and 2European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
*To whom correspondence should be addressed. Tel: +1 203 432 6105; Fax: +1 360 838 7861; Email: Mark.Gerstein@yale.edu
Revised version sent to Nuc. Acids. Res. 23 Feb. 2001 .
As the number of protein folds is quite limited, a mode of analysis that will be increasingly common in the future, especially with the advent of structural genomics, is to survey and re-survey the finite parts list of folds from an expanding number of perspectives. We have developed a new resource, called PartsList, that lets one dynamically perform these comparative fold surveys. It is available on the web at bioinfo.mbb.yale.edu/partslist and www.partslist.org. The system is based on the existing fold classifications and functions as a form of companion annotation for them, providing “global views” of many already completed fold surveys. The central idea in the system is that of comparison through ranking; PartsList will rank the ~420 folds based on more than 180 attributes. These include: (i) occurrence in a number of completely sequenced genomes (e.g. it will show the most common folds in the worm vs. yeast); (ii) occurrence in the structure databank (e.g. most common folds in the PDB); (iii) both absolute and relative gene expression information (e.g. most changing folds in expression over the cell cycle); (iv) protein-protein interactions, based on experimental data in yeast and comprehensive PDB surveys (e.g. most interacting fold); (v) sensitivity to inserted transposons; (vi) the number of functions associated with the fold (e.g. most multi-functional folds); (vii) amino acid composition (e.g. most Cys-rich folds); (viii) protein motions (e.g. most mobile folds); and (ix) the level of similarity based on a comprehensive set of structural alignments (e.g. most structurally variable folds). The integration of whole-genome expression and protein-protein interaction data with structural information is a particularly novel feature of our system. We provide three ways of visualizing the rankings: a profiler emphasizing the progression of high and low ranks across many pre-selected attributes, a dynamic comparer for custom comparisons, and a numerical rankings correlator. These allow one to directly compare very different attributes of a fold (e.g. expression level, genome occurrence, and maximum motion) in the uniform numerical format of ranks. This uniform framework, in turn, highlights the way that the frequency of many of the attributes falls off with approximate power-law behavior (i.e. according to V-b, for attribute value V and constant exponent b), with a few folds having large values and most having small values.
Protein folds can be considered the most basic molecular parts. There are a very limited number of them in biology. Currently, about 500 are known, and it is believed that there may be no more than a few thousand in total (1-3). This number is considerably less than the number of genes in complex, multicellular organisms (>10,000 for multicellular organisms (4)). Consequently, folds provide a valuable way of simplifying and making manageable complex genomic information. In addition, folds are useful for studying the relationships between evolutionarily distant organisms since, in making comparisons, structure is more conserved than sequence or function.
In a general sense, how should one approach the analysis of molecular parts? A simple analogy to mechanical parts may be useful in this regard. Given the “parts” from a number of devices (e.g. a car, a bicycle, and a plane) one might like to know which ones are shared by all and which are unique (say, wings for a plane). Furthermore, one might want to know which are common, generic parts and which are more specialized. Finally, one might like to organize the parts by a number of standardized attributes (e.g. the most flexible parts, the parts with the most functions, and the biggest parts). PartsList aims to provide answers to simple questions such as these for the domain of protein folds.
Properties related to protein folds can be divided into those that are “intrinsic” versus “extrinsic”. Intrinsic information concerns an individual fold itself -- e.g. its sequence, 3D structure, and function -- while “extrinsic” information relates to a fold in the context of all other folds -- e.g. its occurrence in many genomes and expression level in relation to that for other folds. Web-based search tools already provide intrinsic information about protein structures in the form of reports about individual structures. Valuable examples include the PDB Structure Explorer (5), PDBsum (6), and the MMDB (7). However, current resources lack the ability to fully present extrinsic information.
Likewise, while there are many databases storing information related to individual organisms (e.g. SGD, MIPS and FlyBase (8-10)), comparative genomics (PEDANT and COGs (9,11)), gene expression (GEO, the Gene Expression Omnibus at the NCBI, and ExpressDB (12)), and protein-protein interactions (DIP and BIND (13,14)), none of these integrates gene sequences, protein interactions, expression levels and other attributes with structure. (However, it should be mentioned that the Sacc3D module of SGD and PEDANT do tabulate the occurrence of folds in genomes.)
PartsList is arranged somewhat differently from most other biological resources. In a usual database (e.g. GenBank(15)) the number of entries increases as the database develops, while each entry has a fairly fixed number of attributes to describe it. In contrast, PartsList is envisioned to have a relatively stable number of entries, i.e. the finite list of protein folds, while the attributes that describe each entry are expected to increase considerably. In the current version of PartsList the properties for a protein fold include: amino acid composition, alignment information, fold occurrences in various genomes, statistics related to motions, absolute expression levels of yeast in different experiments, relative expression ratios for yeast, worm, and E. coli in various conditions, information on protein-protein interactions (based on whole genome yeast interaction data and databank surveys), and sensitivity of the genes associated with the fold to inserted transposons.
One reason to build the database is to compare protein folds in a rich context and in a unified way. This was achieved through ranking. This allows users to directly compare very different attributes of a fold in a uniform numerical format. The rankings can be visualized in three ways: a profiler emphasizing the progression of high and low ranks across many pre-selected attributes, a rankings comparer for custom comparisons, and a numerical rankings correlator. This can help users gain insight into the functions of protein folds in the context of the whole genome. Our system makes it very easy to answer questions like: “What is the most common fold in the worm as compared to E. coli?” “What is the most highly expressed fold in yeast and how does this compare to the fold that changes most in expression level during the cell-cycle?” And "which fold has the most protein-protein interactions in the PDB and is it highly ranked in terms of protein motions?"
One of the strengths of the uniform numerical system of ranks in PartsList is that it puts everything into a common framework so that one can see hidden similarities in the occurrence of parts ordered according to many different attributes. In particular, as we describe below, we found that the frequency of many of the attributes falls off according to a power-law distribution (i.e. according to V-b, for attribute value V and a constant b), with a few folds having large attribute values and most having small values. For instance, there are only a few folds that occur many times in the yeast genome, and most only occur once or twice. Likewise, most folds only have a few functions associated with them, but there are a few "Swiss-army-knife" folds that are associated with many distinct functions. Similar power-law-like expressions have been found to apply in a variety of other situations relating to proteins -- for instance, in the occurrence of oligo-peptide words (16-18), in the frequency of transmembrane helices (19) and sequence families with given size (20), and in the structure of biological networks, with a few nodes having many connections and most have only a few (21,22).
PartsList is built on top of the Structural Classification of Proteins (SCOP) (23) fold classification and acts as an accompanying annotation to this system. SCOP is divided into a hierarchy of five levels: class, fold, superfamily, family and protein. The "parts" in our system can be either SCOP folds or superfamilies. However, sometimes for ease of expression we will just refer to “folds” when we really mean “folds and/or superfamilies.” We currently use 420 folds and 610 superfamilies
While we chose to use the SCOP classification, we could equally well have based the system on the other existing fold classifications, e.g. CATH (24), FSSP (25), or VAST (26,27). Moreover, for most attributes, we could also have developed our system around non-structural classifications of protein parts -- e.g. Pfam (28), Blocks (29), or SMART (30). However, basing it around actual structural folds has the advantage that each part is more precisely and physically defined.
Attributes that can be ranked: Information in the system
Currently the attributes for each entry (i.e. protein fold) can be separated into several main categories: statistical information from a comprehensive set of structural alignments, amino-acid composition information, fold occurrences in various genomes, expression levels in different experiments, protein interactions, macromolecular motion, transposon sensitivity and miscellaneous.
We have developed a formalism for expressing each of the attributes, which is described in Table 1. In the table the term PART refers to either fold or superfamily, depending on which of these is being ranked. Essentially, we have a database of attributes where each attribute is given a standardized description and associated with a precise reference. In the following, we describe some main categories of attributes.
The data in this category reveal fold occurrences in 20 different genomes, including 4 archaea, 2 eukaryotes, and 16 bacteria; (additional details online).
The data were obtained in the following fashion: Once a library of folds has been constructed, representative sequences can be extracted (50). Then one can use these to search genomes by comparing each representative sequence against the genomes using the standard pairwise comparison programs, FASTA (55) and BLAST (56) and well-established thresholds (57).
Alternatively, one can build up profiles by running each representative sequence against PDB with PSI-Blast and then comparing these profiles against each of the genomes. This later procedure is more sensitive than pairwise comparison and relatively efficient once the profiles are made up. However, in doing large-scale surveys one has to be conscious of the potential biases introduced due to the profiles being more sensitive for larger families, which often results in the big families getting even bigger.
After the structure assignment, it becomes easy to enumerate how often a fold or structure feature occurs in a given genome or organism. Detailed information can be found in (19,31,32,58). This pools assignments from previous work (59,60).
Number of Structures. We did a comprehensive set of structural alignments of structures in the PDB structure databank (35,61,62). The number of structures and aligned pairs used in these comparisons, which are based around Astral (50), give approximate measures of the occurrence of folds in the PDB. Comparison of these values to those for genome occurrence provides a measure of how biased the composition of the PDB is (63).
Sequence Diversity. The scores from the alignments indicate the sequence diversity between the related structures within folds or superfamilies, in terms of percent sequence identity and a sequence-based P-value. P-values are useful measures of statistical significance of the similarity calculation. A P-value is the probability that one can obtain the same or better alignment score from a randomly composed alignment. A smaller P-value is less likely to have been obtained by chance than a larger P-value. Large P-values close to 1.0 indicate that the similarity is characteristically random and thus insignificant.
Structural Diversity. We also give analogous measures of the diversity of the structures with a given fold, allowing one to rank folds by their degree of variability. We tabulate untrimmed and trimmed RMS, along with the structural P-value. RMS, root-mean-squared deviation in alpha carbon positions, has been the traditional statistic that gauges the divergence between two related structures. Smaller RMS scores indicate more closely related structures. However, sometimes a few ill-fitting atoms may significantly increase the RMS of structures known to be similar. To compensate for this we also report a "trimmed" RMS for a conserved core structure, which is based on the better fitting half of the aligned alpha-carbons, and structural P-value, which compensates for other effects such as structure size. For details, see Wilson et al. (35).
This allows us to see which folds are most biased in composition of particular amino acids. We use various levels of the Astral clustering of the SCOP sequences to arrive at the composition (50).
Three techniques are frequently used to obtain genome-wide gene expression data. They are Affymetrix oligonucleotide gene chips, SAGE (Serial Analysis of Gene Expression), and cDNA microarrays (43,64,65). SAGE and, to some degree, gene chips measure the absolute expression levels (in units of mRNA transcripts per cell), while microarrays are used to obtain the expression level changes of a given ORF as the ratio to a reference state.
A main motivation for expression experiments is often to study protein function and to characterize the functions of unannotated genes. However, this does not preclude relating other attributes of proteins, such as their structure, to expression data. For instance, it may be that highly expressed protein folds share a number of characteristics, such as a particularly stable architecture or a composition biased in a certain way. Relating expression and structure involved matching the PDB structure database against the genome and then summing the expression levels of all ORFs containing the same fold. However, if one is trying to find genes expressed in a particular metabolic state, PartsList is not the right place to look.
Absolute. The absolute expression level data gives a good representation of highly expressed genes. All the experiments currently indexed by PartsList are for yeast. For each experiment, in addition to ranking based on the average expression level for a fold, we also consider the composition in the transcriptome and the enrichment of this value relative to its composition in the genome. Transcriptome composition is the fractional composition of a fold (relative to that for other folds) in the mRNA population. In other words, it is the composition of a fold in the genome weighted by the expression levels of each of the genes. The enrichment is the relative change between the composition of a fold in the genome and the transcriptome. For more details, see (33,66). We report values for experiments from a number of different labs (41-44) and a single reference set that merges and scales all the expression sets together.
Ratio. The expression ratio data shows the most actively changing genes over a period of time (e.g. cell cycle) or based on a change in states (e.g. healthy vs. diseased). Source data for expression ratios are the fluctuations in expression of a certain fold over a period of time (e.g. the cell cycle). These are measured in terms of standard deviations for a particular fold, which is calculated from the average of the expression ratio standard deviations for each gene that matches the fold structure.
Information on protein-protein interactions is derived from surveys of the contacts in the PDB and the experiments in yeast.
PDB. To determine which domains interact with one another in the PDB entries indexed by SCOP (9,580 at the time of the analysis), the coordinates of each domain were parsed to check whether there are five or more contacts within 5 Å to another domain, as described in (67). The distance of 5 Å was chosen, as this is a conservative threshold for interaction between two atoms, where the atoms are either C’s or atoms in side-chains. The 5-contact threshold was chosen to make sure the contact between the domains was reasonably extensive. (In fact, the number of domains identified as contacting each other hardly changed for thresholds between 1 and 10 contacts and 3 to 6 Å distances).
Yeast. The interactions between structural domains in the yeast genome were obtained by assigning protein structures to the yeast proteins using PSI-BLAST and PDB-ISL as described in Teichmann et al (39,68). Assigned structural domains contained within the same ORF that were adjacent within 30 amino acids were assumed to interact. (This is generally true of the domains in the PDB, with a few exceptions, such as domains in transcription factors like adjacent zinc fingers, or variable and constant immunoglobulin domains.) To derive intermolecular interactions in the yeast genome we combined three sets of protein-protein interactions: (i) the MIPS web pages on complexes and pairwise interactions (February 2000)(9), (ii) the global yeast-two-hybrid experiments by Uetz et al. (51) and (iii) large-scale yeast two-hybrid experiments by Ito et al. (52). Out of all these pairwise interactions known for yeast ORFs, there is a limited set in which both partners are completely covered by one structural domain (to within 100 residues). This set of protein pairs was used to derive a further set of domain contacts in the yeast genome as described in (67).
Information on motions is from the Macromolecular Motions Database (36,37). We consider a set of approximately 4400 motions automatically identified by examining the PDB and a smaller, manually curated set of motions. For each fold we determine the number of entries in the motions database that are associated with it. Then over this set of motions we either average or take the maximum value of a number of relevant statistics describing the motion, i.e. the maximum C displacement in the motion, the overall rotation of the motion, and the energy difference between the start and endpoints of structures involved in the motion.
Ross-MacDonald et al. (40) developed a procedure for randomly inserting transposons throughout the yeast genome. They investigated the phenotypes resulting from each insertion in 20 different growth conditions in comparison to wild-type growth. The experiment for each insertion in each condition was repeated several times. If the observed phenotype of the mutant deviates from the average wild-type phenotype, this could be either because of a real effect of the mutation on the cell or it could just a be typical variation of the phenotype of wild-type cells. We developed a P-value score that measures the degree of confidence that the observed phenotype results from randomly changing wild-type cells. The negative logarithm of this P-value rises with the significance of the phenotype measurements and can be understood as the sensitivity of the cell to mutations in a particular gene. We calculated a value for the transposon sensitivity for protein folds by geometrically averaging the P-values of the associated genes.
The miscellaneous section includes any information that does not fit into a major category. It includes: number of pseudogenes in worm associated with a fold (53), total number of functions and number of enzymatic functions associated with a fold (54), the average length of the sequence, and the year the domain structure was originally determined.
The above data, of course, have systematic and statistical errors. For some attributes we expect considerably smaller errors than others. For instance, we expect the numbers related to the sequence composition of different folds (e.g. the Ala composition) to be particularly accurate, since the only factors affecting these are errors in the underlying sequence of the protein and in the scop fold classification itself. In contrast, there is a considerable known rate of false positives associated with the global protein interaction experiments using the two-hybrid method (51,69), and this suggests statistics based on yeast interactions may be somewhat less accurate. Furthermore, the precise values for the rankings in PartsList are also contingent on the evolving contents of various databanks. Thus, over time as more structures are determined, one should expect statistics such as the most common folds in a particular genome to change somewhat. A very detailed discussion of the expected errors in the various quantities in PartsList is available on the web from the help section.
Ranking all the folds based on extrinsic information
The PartsList resource facilitates exploring extrinsic information by dynamically ranking protein folds in different contexts, such as genome and expression levels. We provide three tools for visualizing the rankings: Comparer, Correlator, and Profiler. The overall structure of PartsList is schematically shown in Fig. 1.
The motivation behind Comparer is to allow one to rank folds according to a given attribute and then see the ranks associated with other attributes. The ranking attribute and the additional attributes are selected by the user. Figure 2(a) shows an example. The most common folds in E. coli are shown alongside three other attributes: fold occurrence in yeast, fluctuation in expression level during the yeast cell cycle, and fluctuation in expression level in E. coli during heat shock. Which displayed attribute is used to rank the folds can be easily changed; in the example in Figure 2(a) the report can be re-sorted based on the other three attributes by clicking on arrows.
In principle, Profiler presents the same information as Comparer. However, it shows the progressing pattern for several pre-selected categories and is intended to give people an easy-to use interface that gives some simple views of the data. Figure 2(b) shows an example that highlights the phylogenetic pattern of fold occurrence in 20 genomes.
Correlator uses linear and rank correlation coefficients to measure the association between two selected attributes. The difference between these two types of correlation coefficients is that the former relates to the actual values while the latter relates to the ranks among the samples. The interpretation of the linear correlation coefficient can be completely meaningless if the joint probability distribution of the variables is too different from a binormal distribution. This is the reason for introducing the rank correlation coefficient. Correlator provides both coefficients for the selected quantities. In most cases, they are close. For example, the linear correlation coefficient and rank correlation coefficient for fold occurrence in genomes A. fulgidus and M. jannaschii (Aful and Mjan) are 0.88 and 0.77, respectively, while the corresponding coefficients for fold occurrence in A. fulgidus and S. cerevisiae (Scer) are 0.52 and 0.48, respectively. This is not surprising, as the first two genomes are both Archaeal, while in the second comparison one genome belongs to Archaea (Aful) and another to Eucarya (Scer). As one would expect, the fold occurrences for the more closely related genomes have a higher correlation.
In addition to the coefficients, Correlator displays a scatter plot to aid in visualizing the correlation between the selected fold attributes. Figure 2(c) shows the scatter plot for the second example above: the correlation between occurrences in the A. fulgidus and S. cerevisiae genomes. One can easily observe that some folds appear frequently in Scer but seldom or never in A. fulgidus. By clicking on a point on the plot, one obtains detailed information about the corresponding fold. This kind of plot can reveal interesting folds with certain relationships between attributes even though in some cases the overall correlation coefficients between the two attributes are almost zero (i.e. no correlation).
|Statistical Support and Web Development for a Web-based Master Sample Management System for Integrating Aquatic Ecosystem Status and Trend Monitoring||An environmental community, based upon the interaction between climate, soil, topography, plants and animals. When functioning, this system is self-sustaining|
|Credit based Choice Based Continuous Evaluation Pattern System||To manage information technology efforts, including the design, development and maintenance of internet or intranet-based products and services in the financial services Industry. Summary|
|Guided by the firm hand of the government, China has evolved a manufacturing-based development pattern that is more robust and balanced than that of India, which has set great store by an Information Technology-based `knowledge economy' without vital linkages with other sectors||1 Enriching the gold dust: extreme-value based genome-wide association in the post-gwas era|
|B2b inter-organisational digitalisation strategies towards interaction based approach||Application of Edible Coating Based on Whey Protein-Gellan Gum for Apricot (Prunus armeniaca L.)|
|Protein Peeling 2: a web server to convert protein structures into series of Protein Units||Choice Based Credit System Syllabus|