Скачать 3.39 Mb.
Principal Investigator/Program Director (Last, First, Middle): Jacobson, Matthew P.
Refinement of Comparative Protein Models
A. Specific Aims
We aim to create and test computational methods capable of refining comparative protein structure models to an accuracy comparable to that of moderate to high resolution experimental structures. Our overall strategy is to dramatically improve the efficiency of sampling, which has limited the success of prior efforts at comparative model refinement, by a combination of (a) identifying which degrees of freedom are critical to sample, (b) developing new algorithms for making large moves along these degrees of freedom, and (c) using experimental data, if available, to help constrain the search space. We will
Aims 1–3 address both Comparative Modeling Goals identified by RFA-GM-05-008, “High Accuracy Protein Structure Modeling”. Our approach integrates methods grounded in bioinformatics (Sali), physics (Jacobson and Dill), and applied mathematics (Coutsias). Close collaborations among these researchers, as well as Dr. Shoichet (Aim 4), is facilitated by most of the researchers being located together at UCSF; our budget also allocates funds for Dr. Coutsias (U. New Mexico, Dept. of Mathematics and Statistics) to spend several months a year at UCSF. The tangible outcome of this research will be a set of freely available modular source codes and executable programs that implement the methods in Aims 1–3.
B. Background and Significance
Comparative structure prediction, in conjunction with the Protein Structure Initiative (PSI), has the potential to bridge the gap between the number of available protein sequences (>1 million) and structures (>20,000). The experience of the Sali group in constructing ModBase, a database currently containing over one million protein comparative models, highlights this potential. The fraction of sequences with comparative models for at least one domain is currently 57%.1 This number will continue to grow as the PSI enters its production phase. The New York Structural Genomix Research Consortium documented the number and quality of the comparative models that could be built based on their new structures. On average, about 100 protein sequences without any prior structural characterization could be modeled for each new structure.1 The accuracy of these models, however, varies significantly, with many accurately representing the overall tertiary structure, but relatively few (<10%, i.e., those with >50% sequence identity) expected to be as accurate as moderate resolution experimental structures (1–2 Å RMSD). While the CASP and CAFASP competitions have shown some measurable progress over time in the accuracy of comparative models2,3, they indicated little ability to refine comparative models to an accuracy better than the template protein. The research we propose addresses this critical bottleneck to high accuracy protein comparative modeling.
The three major sources of inaccuracy in comparative protein models are 1) incorrect choice of template protein, 2) inaccuracy in aligning the target sequence to the template, and 3) inability to routinely refine comparative models, i.e., to predict conformations of residues that do not align to the template, structural differences between the target and template proteins in aligned regions, and critical details such as side chain conformations. Improved methods of sequence alignment, fueled by the ever-growing databases of protein sequences4 and structures as well as algorithmic improvements, will contribute to the first two of these challenges. The research we propose focuses on the third of these challenges, model refinement, but will also contribute to identifying correct templates and alignments. Specifically, we propose methods for conformational sampling and scoring of comparative protein models, generated by any alignment and model building protocols. The new algorithms we develop will be capable of refining individual models to improve the accuracy, and choosing the most accurate model among several generated from different templates and/or different alignments.
In short, we aim to improve the accuracy of comparative models by identifying the global free energy minimum of the protein sequence. This is a very challenging undertaking, despite the fact that the initial model(s) should be “close” to the global minimum, in the sense that at least the tertiary fold is correct, as long as the correct template is chosen. Success requires both adequate sampling and accurate scoring, and these two imperatives work against each other: more accurate scoring functions generally entail greater computational expense, reducing the amount of sampling that can be accomplished with fixed computer time. Our strategy is to develop methods that dramatically improve the efficiency of sampling, by a combination of 1) identifying which degrees of freedom are critical to sample, 2) developing new algorithms for making large moves along these degrees of freedom, and 3) using experimental data, if available, to help constrain the search space.
In Section B.1, we argue that existing energy models, particularly those that treat the protein at an atomic level of detail, are accurate enough to be useful in refining comparative models. Then in Section B.2, we review available methods for protein sampling, and outline our approach to improving sampling efficiency for comparative model refinement. In Section B.3 we discuss the role that experimental data can play in aiding comparative model refinement, and identify low-resolution data from xray crystallography as a neglected but potentially very useful source of data for this purpose. Finally, in Section B.4, we return to the potential impact of new methods for high-accuracy comparative modeling, highlighting the role that comparative models can play in structure-based inhibitor discovery, and the challenges that confront this goal.
B.1 Refinement of Comparative Models: Scoring
Several lines of evidence suggest that currently available all-atom energy functions, although they have limitations5, are capable of the accuracy required to achieve our goal (i.e., refining comparative models to 1–2 Å RMSD). We focus attention on scoring functions composed of all-atom force fields and implicit solvent models. These are used as the primary scoring functions in Aims 1 and 2 due to the attractive balance they provide between accuracy and computational efficiency. However, the sampling methods that we develop can be used with virtually any all-atom scoring function, whether physics- or knowledge-based. Studies that provide grounds for optimism that all-atom energy functions can provide the necessary accuracy include:
Figure 1. Loop prediction results from Jacobson et al.20. Left: Statistics of loop prediction on 4, 6, 8, and 10 residue loops. Right: Example of local energy minima identified for a nine residue loop in 3pte (residues 78–86), during the three-stage prediction algorithm. These results and others suggest that, in most cases, the OPLS/GB energy function is capable of identifying near-native conformations, and dihedral-angle based sampling is capable of generating near-native conformations.
Given that it is possible to identify accurate structures among decoys and reconstruct portions of proteins with high accuracy with currently available energy functions, why is it not possible to routinely refine homology models to comparable levels of accuracy? One critical bottleneck is sampling. A typical homology model requires refinement of several loops simultaneously, and may also exhibit distortions or incorrect lengths of secondary structure elements. Sampling of all of these degrees of freedom simultaneously represents a major challenge, but we argue in the next section that this challenge is not insurmountable.
|PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information||Toward a Method for Providing Database Structures Derived from an Ontological Specification Process: the Example of Knowledge Management|
|30, 2987-2994 (2002). Andersson,S. G. et al. Comparative genomics of microbial pathogens and symbionts. Bioinformatics 18 Suppl 2: S17||Catalog description: Mathematical models in manufacturing management. Linear models, financial decision models, production planning models, inventory control models, and production smoothing|
|Protein Peeling 2: a web server to convert protein structures into series of Protein Units||Sampling distributions: Sampling Types of sampling – Sampling distributions – t distribution, f distribution, Chi-square distribution. (3)|
|Application of Edible Coating Based on Whey Protein-Gellan Gum for Apricot (Prunus armeniaca L.)||Rbf-based Meshless method for Large Deflection of Thin Plates|
|Rbf-based Meshless method for Large Deflection of Thin Plates||ﾠ Attribute sampling : tables and explanations : tables for determining confidence limits and sample size based on close approximations of the binomial distributions / Herman Burstein|