Скачать 180.68 Kb.
Bioinformatics – DRAFT#1
(Book chapter for American Society for Information Science & Technology, vol. 40, 2005)
The application of computers in molecular biology helps manage the already vast and increasing amount of biological data and plays an integral role in the discovery of new biological relationships. Though the term “bioinformatics” is popular today, the work within the field of bioinformatics continues and integrates various threads of work in computer analysis of sequence data, sequence-similarity searching (e.g., fasta, clustal multiple-sequence alignment, and phylogenetic analysis (Misener & Krawetz, 2001). From this foundation grew specific software applications for sequence and presentation, such as Genotator, that integrates the output of multiple analyses into formats appropriate for professional communications. Recent technologies such as the Internet, naturally, have influenced access to molecular databases but also to the background literature of biology. For instance, software and techniques, such as primary sequence analysis methods, transcriptional control region identification using MatInspector, oligonucleotide and PCR-primer design, have become part of the biologist’s work kit. The published science related to this work, too, is evolving. As Paris (2003) describes it: “It has been assumed that the supporting data are available to all legitimate researchers who ask for it – i.e., that this is truly public data. The papers themselves have served as the primary hook by which the research is recognized, attributed and credited, as well as the primary location in which the authors provide interpretation and detailed annotation. But as the volume of data has increased, these publications have become more and more abstract minimalist representations of the larger and larger bolus of genomic data: bookmarks or placeholders with hundreds of co-authors, and fewer and fewer interpretations and insights – not just because of the quantity of data, but also because of growing proprietary IP (intellectual property) concerns.”
The large-scale characterization of genomes, mainly by DNA sequencing, was termed “genomics”. However, the field expanded quickly to cover characterization of genomes on the sequence level, as well as the comprehensive analysis of gene expression and the protein complement of organisms. It is difficult for any biologist to remain current as the volume and variety of work expands. Yet a consistent theme appears in the growing literature of genomics, molecular biology, biocomputing and computational molecular biology – that the umbrella term “bioinformatics” represents a “new, growing area of science that uses computational approaches to answer biological questions. Answering these questions requires that investigators take advantage of large, complex data sets (both public and private) in a rigorous fashion to reach valid, biological conclusions. The potential of such an approach is beginning to change the fundamental way in which basic science is done, helping to more efficiently guide experimental design in the laboratory” (Baxevanis, 2001, p. 1). To biologists, then, bioinformatics is intended to serve their scientific goals of analysis, sequencing, and application of genomic data. As a corollary, bioinformatics integrates powerful computer systems for modeling genomic and protein data, and displaying the results on screen as 2d and 3d visualization to help the scientist interpret the phenomenon being studied, and to suggest underlying mathematical models that explain the behavior of biological processes. For the information- and computer-scientist, and increasingly for the biologist, bioinformatics is turning to methods to integrate the heterogeneous representations of the literature around genomic processes into a single information retrieval and display system. Indeed, many concerns of bioinformatics, such as “gene discovery, site-directed mutagenesis, and experiments to expose previously unknown relationships with respect to the structure and function of genes and proteins” (Baxevanis, 2001, p. 1) can be expressed as data mining and information retrieval questions. Therefore, the opportunity for cooperation between molecular biologists, information scientists, and computer science has never been greater. This chapter is dedicated to helping information scientists approach the world of molecular biology from their perspective by explaining the tools of bioinformatics, its relationship to other fields, and the accomplishments and challenges to bioinformatics.
Tools of Bioinformatics
The tools of bioinformatics as defined above focus primarily (though not exclusively) on dna sequences, the use of very large databases, and the sharing of research results, or professional communication.
“DNA”, or deoxyribonucleic acid, is composed of four chemical bases: purines (adenine, abbreviated “A”, and guanine, “G”) and pyrimidines (cytosine (C) and thymine (T)). “Sequencing” is determining the exact order of these building blocks in the DNA. Each base differs from others by the combination of oxygen, carbon, nitrogen and hydrogen. Every base is attached to a “deoxyribose”, or sugar molecule, and to a phosphate molecule, to create nucleic acid, or nucleotide. Each nucleotide is linked in a certain order, or sequence, through the phosphate group. The precise order and linking of bases determines what that gene produces. For humans, this means the “exact order of 3 billion blocks (called bases and abbreviated A, T, C, and G) making up the DNA of the 24 different human chromosomes….” The most famous sequencing project is the Human Genome Project, whose goal is to “reveal the estimated 100,000 human genes within our DNA as well as the regions controlling them.” In addition, one of the purposes of the Human Genome Project is to identify the small regions of DNA that very between individuals. This differences may underlie disease susceptibility and drug responsiveness, particularly the most common variations that are called SNPs (single nucleotide polymorphism).” [HGPI, 2002]. The sequencing process includes:
First chromosomes (of up to 250 million bases) are divided into smaller pieces, or “subcloned.” A template is created from the shorter piece to generate fragments, each differing only by a single base. That single base is used as an identifier during template preparation of sequence reaction. Using florescent dyes, the fragments can be identified by color when the fragments are separated by a process called “gel electrophoresis.” The base at the end of each fragment is now identified (“base-calling”) to help recreate the original sequences of A, T, C, and G for each subcloned piece. A four-color histogram, (a “chromotograph”) is created to show the presence and location for each of the bases. Finally, the short sequences in blocks of about 500 bases (called the “read length”) are assembled by computer into long, continuous stretches for analysis of errors, gene-coding regions, and other distinctions.
The results of the sequencing can now be published. The results of the sequence are submitted to major public sequence databases, such as GenBank, of the National Center for Biotechnical Information (NCBI), part of the National Library of Medicine (NLM) and National Institutes of Health (NIH). NCBI hosts PubMed, GenBank, molecular sequencing databases, literature databases (e.g., PubMed, Online Mendelian Inheritance of Man (OMIM) and are detailed below). The data generated by DNA sequencing are stored in very large, multidimensional databases, such as the Wisconsin Package, GCG sequencers as well as literature intended to teach bioinformatics and keep the scientist up to date, such as Medline, SCI, and journals, especially Science, Journal of Biological Chemistry, Cell, Molecular Medicine Today, the New England Journal of Medicine and special purpose serials, such as Briefings in Bioinformatics.
Very Large Databases
There is an amazing variety of software packages available to molecular biologists to support research by searching very large databases. Most of these systems manipulate locally collected data or data extracted from public or private databases. Many of these databases are detailed below.
The packages support particular functions of gene sequencing. For example, there are at least 150 free software applications (Gilbert, 2000, pp. 157-184) addressing all aspects of sequencing. One popular package is the Genetics Computer Corporation’s “Wisconsin Package” (http://www.gcg.com/). An integrated suite of over 130 program tools for manipulating, analysis, and comparing nucleotide and protein sequences, this package also uses a gui, SeqLab, to interact with color-coded graphic sequences. The package includes sequence comparison statistics (alignment of two sequences to indicate gaps, best fit and x/y plotting of sequence similarity), database searching tools (LookUp, StringSearch for biological literature, BLAST, NetBLAST, FASTA and others for sequence strings, PAUPSearch, GrowTree, Diverge for phylogenetic relations, Fragment Assembly, gene finding and pattern recognition tools, protein analysis (e.g., PeptideMap), ChopUp, Reformat and others for manipulating text files. Similarly, DNASTAR’s Lasergene Sequence Analysis Software is a suite of eight applications to trim and assemble sequence data; discover and annotate genes patterns; predict protein secondary structure; create Boolean queries from sequence similarity, consensus sequence and text terms; sequencing, hybridization, and transcription; create maps; and import data from other sources (Burland, 2001, p. 71). It isn’t possible to review all applications and their capabilities, but the reader can see that software applications in molecular biology have moved from stand-alone, mini- and mainframe applications to desktop, Internet- and lan-based with an emphasis in the graphic display of data, pattern detection and sequence prediction, and integration of the literature.
Relationship to Other Fields
Today’s bioinformatics has its parentage in several fields: statistics, computer science, and medicine and there remains some confusion about the boundaries of the field. Biology turned, as many fields do, to statistics to help model and explain patterns and deviations from patterns. With the introduction of computer technology, a greater variety of patterns could be examined more quickly and without the likely introduction of human error. Applied to biomedical processes, a picture of the invisible world of the molecule became possible. While expanded modeling of these processes made biology more approachable, some researchers diverged into more narrow techniques. For example, computational molecular biologists (e.g., Laszczynski 1999) choose numerical simulation as a complement to traditional theoretical and experimental approaches, and as a field can problem as a field can probe more deeply, testing theories that cannot otherwise be examined, such as phenomena at the atomic level. Pursuing this path would lead to quantum mechanics and the various simplifications (e.g., Born-Oppenheimer) employed. However, it does suggest that computational molecular biology focuses on computational and mathematical answers to biological questions.
As bioinformatics is defining itself, so is the related medical informatics field, which at times is a bit circular. One medical informatics program describes its work as “devoted to basic investigation and training in both clinical informatics and bioinformatics” in order to “… bring together scientists who create and validate models of how knowledge and data are used within biomedicine” (Musen, 1999). Jenders, Sideli & Hripcsak (1998) offer several definitions: “medical informatics = study and use of computers and information in health care; definition by MF Collen (MEDINFO ‘80, Tokyo, later extended): ‘Medical informatics is the application of computers, communications and information technology and systems to all fields of medicine - medical care, medical education and medical research’ [and] ‘definition by Asso. of American Medical Colleges (AAMC) 1986 ‘Medical informatics is a developing body of knowledge and a set of techniques concerning the organizational management of information in support of medical research, education, and patient care.... Medical informatics combines medical science with several technologies and disciplines in the information and computer sciences and provides methodologies by which these can contribute to better use of the medical knowledge base and ultimately to better medical care.’” It seems, then, that training of biomedical informaticians – medical and biological – are conceptually related in that they both aim to improve the health of people and share educational resources, the practice of each varies, the one focusing on clinical practice, the other on basic biological science.
However, the boundaries between genomics and bioinformatics are porous, the literature seeming to suggest that genomics, like computational molecular biology, focuses on a smaller set of biological processes, and bioinformatics emphasizing storage and retrieval of biological literature and a broader perspective of genomic data. For example, genomics appears to emphasize the research of specific microbiology and genetics of specific organism, such as the E. coli genome or the fruitfly Drosophila melanogaster, although DNA microarray research appears to belong to both (e.g., Tessier et al.)
Accomplishments of Bioinformatics
Bioinformatics encompasses all aspects of molecular biology research and has made amazing advancements in understanding and sharing information about molecular processes. Not the least of which is the structure of protein molecules themselves and the biological sequences in which they are found. Applying statistical, mathematical and computer techniques, however, has pushed bioinformatics into fuller explanations of the invisible and unanticipated, such as energy equations to model the dynamic behavior of molecules, linear and 3D protein functions, and probabilistic models of sequences, especially Hidden Markov Models (Koski, 2001). The computer flat-files that contain these data are now used to visualize known biological structures (e.g., using Cn3D) and to help predict macromolecule structure in 3D. Furthermore, and of particular interest to pharmaceutical companies, bioinformatics’ application of genomic data applied to the study of variation in host and pathogen DNA and disease helps design drug treatments (Lengauer, 2002). Of significance to information science are the databases and software used to store, retrieve, and display data.
The vast amount of data generated in sequencing and in the support literature of molecular biology has resulted in a tremendous number of databases and the attendant difficulties of querying across a heterogeneous environment. As a result the Sequence Retrieval System (SRS) (Etzold et al., 1996) and the “NCBI Data Model” are used. These are detailed below, after a brief exposition of the main information resources employed in bioinformatics. The primary databases may be divided by themes: bibliographic; taxonomic; nucleic acid; genomic; protein and specialized protein databases; protein families, domains and functional sites; proteomics initiatives; and enzyme/metabolic pathways (Zdobnov et al. 2002). There are far too many databases to be reviewed here, but a comprehensive list is available at http://www.expasy.ch/alinks.html
Of these, the most commonly used and the only one that is publicly available is Medline (or PubMed, http://www.ncbi.nlm.nih.gov/PubMed/). Commercial databases include Embase (biomedical and pharmacological abstracts), Agricola, Biosis (the former Biological Abstracts). Interestingly taxonomic databases reflect an old issue in librarianship, controlled vocabularies (ontologies) reflecting the knowledge and modes of expression of a given field. NCBI maintains the most important taxonomic databases, whose hierarchical taxonomy is used by Nucleotide Sequence Databases, SWISS-PROT, and TrEMBL (along with derivatives such as NiceProt).
Nucleotide sequences represent an international effort. GenBank (National Center for Biotechnology Information (NCBI)), the European Bioinformatics Institute (EBI, http://www.ebi.ac.uk/) , and the DNA Data Bank of Japan have joined to create the International Nucleotide Sequence Database Collaboration. The quality and currency of the data vary between databases. For instance, the quality of the data in the nucleotide sequence databases is the responsibility of the authors or submitters (the scientists themselves, no professional enforcement of standards). With more than 10 billion nucleotides in more than 10 million individual entries, one can imagine the potential error rate (http://www3.ebi.ac.uk/services/DBStats/). See Rodriguez-Tomé (2001) for a description of EMBL and examples of interfaces for submitting and searching the databases and the Genome Monitoring Table for updates on the progress of genome sequencing projects (http://www.ebi.ac.uk/~sterk/genome-MOT/MOTgraph.html).
Other projects address error potentials by clustering and specialization. Reminiscent of latent semantic analysis, clustering of data to remove redundant record is performed by UniGene (http://www.ncbi.nlm.nih.gov/UniGene) and STACK (Sequence Tag Alignment and Consensus Knowledgebase, http://www.sanbi.ac.za/Dbases.html). The Ribosomal Database Project (http://rdp.life.uiuc.edu/index.html), HIV Sequence Database (http://hiv-web.lanl.gov/), IMGT database (http://imgt.cnusc.fr:8104/texts/info.html), Transfac (transcription factors and transcription factor binding sites (http://transfac.gbf.de/TRANSFAC/index.html), EPD (Eukaryotic Promoter Database (ftp://ftp.ebi.ac.uk/pub/databases/epd), REBASE (http://rebase.neb.com/rebase), and GoBase (http://megasun.bch.umontreal.ca/gobase/gobase.html) are all examples of specialty resources.
The popular knowledge of the Human Genome Project has introduced to the public three significant databases for human genes. The primary human genome database is the Genome Database, or GDB (http://www.gdb.org). Related to this is the Online Mendelian Inheritance in Man (http://www3.ncbi.nlm.nih.gov/Omim/) that catalogues all human genes and genetic disorders. The Sequence Variation Database, like omim and gdb, maps genetic variation, but has links to many sequence variance databases (http://www.ebi.ac.uk/mutations/index.html) and via the Sequence Retrieval System (SRS) interface to other human mutation databases. Increasingly portals are instituted to harmonize searches, such as GeneCard (http://bioinfo.weizmann.ac.il/cards/).
Perhaps the most well-known are the protein sequence databases. Like the nucleotide databases, the protein sequence databases fall into two groups: all species’ data or specific organisms. Of interest to information science is the further division of these databases into “sequence data” or “annotated sequence data”. SWISS-PROT (Bairoch & Apweiler, 2000) is an annotated universal protein sequence database (http://www.expasy.ch) and strives to quality in the annotations and integration with other biomolecuar databases. Each entry is analyzed by biologists: as of May 2000, there were more than 85,000 annotated sequence entries from more than 6,000 different species. A sister product was created from Swiss-Prot, called TrEMBL (Translation of EMBL nucleotide sequence database), to speed new sequence information to the public. SP-TrEMBLE focuses on entries which will be incorporated later into Swiss-Prot. REM-TrEMBLE contains other data that will not be integrated because it may be redundant or are truncated or not real proteins. SPTR (SWALL) is another protein sequence database that provides non-redundant sequence data by focusing on data currency in SWISS-PROT, ignoring REM-TrEMBLE, and by performing sequence comparisons against a database of all known isoforms.
The specialized protein sequence databases perform different functions with the data – such as pre-clustering of SwissProt records (CluSTr (http://www.ebi.ac.uk/clustr)), catalogues and structure-based classification of peptidases (MEROPS http://www.merops.co.uk, and “PepCards”, or classification, nomenclature and hyperlinks for each peptidase, and Fam[ily]Cards and ClanCards). The Yeast Protein Database (YDP for Saccharomyces cerevisiae, http://www.proteome.com/databases/) details about 6,000 yeast proteins. The protein classification schemas define the cellular role, function, and pathway, and other information about the functional data in the “YPD Protein Reports.”
The finding of relationships when an unknown protein cannot be matched to other known structures calls for examining the “sequence signatures.” PROSITE, PRINTS, PFAM, ProDom (http://www.toulouse.inra.fr/prodom.html), and especially InterPro attempt in one form or another attempt to derive patterns from sequence databases, using various clustering algorithms. InterPro (Integrated Resource of Protein Families, Domains and Functional Sites) in an integrated documentation resource for Prosite, prints, and Pfam, which helps address the question of ambiguous biological relevance when a pattern is detected (e.g., by ignoring family discriminators), by linking to known protein sequences in Swiss-Prot and Tremble. InterPro entries are available as xml-formatted files (ftp://ftp.ebi.ac.uk/pub/databases/interpro).
Some work focuses on learning more about organisms are various levels, even though the protein sequences (or proteome) has no known role. For instance, the Kyoto Encyclopedia of Genes and Genomes (or “KEGG”, http://www.genome.ad.jp/kegg/) and Proteome Analysis Initiative (http://www.ebi.ac.uk/swissprot/hbi/hpi.html) provide information about the gene, transcript, protein, and function level. In additional all completely sequenced organisms in Swiss-Prot and TrEmble have the proteome set information in available through InterPro and CluSTr. This includes the amino acid composition and links to the homology (HSSP, Homology-derived Secondary Structure of Proteins (http://www.sander.ebi.ac.uk/hssp)). As evidence of the need to understand better the function of proteins (Ashburner et al., 2000) genes and how to associate the literature more successfully, Paris (2003) notes “Recognizing the existing problems in classifying and organizing information about cell and molecular biology, especially in this era of exponentially exploding data from genomics and proteomics experiements, a consortium was proposed by Michael Ashburner in 1998 (ISMB), and eventually established in 1999 to create and promote a consistent, scientifically-sound, useful “gene ontology”. The result (http://www.geneontology.org/) is a troika of tree-schemes based on molecular function, biological processes, and cellular component; genes and gene products can map to multiple locations in multiple trees, reflecting biological diversity and (to some extent) ambiguity of knowledge. If used to support the annotation process, this is one approach that will help eliminate many problems ….”
NCBI Data Model
Of particular interest to information scientists is the NCBI Data Model. “This new and more powerful model made possible the rapid development of software and the integration of databases that underlie the popular Entrez retrieval system and on which the GenBank data is now built. The advances of the model (e.g., the ability to move effortlessly from the published literature to DNA sequences to the proteins they encode, to chromosome maps of the genes, and to the three-dimensional structures of the proteins) have been apparent for years to biologists using Entrez …” (Baxevanis & Ouellette, 2001, p. 20).
The NCBI data model is an example of the Abstract Syntax Notation 1, an ISO standard for reliable encoding of data that is data-centered (here the DNA), human-interpretable, and computer-readable flat files. NCBI’s website (http://www.ncbi.nlm.nih.gov/) is a portal, offering PubMed, Entrez, BLAST, OMIM and other services. Searches are permitted by PubMed (author and journal), protein, nucleotide, structure, genome, PMC, LocusLink, PopSet, OMIM, Taxonomy, book, ProbeSet, 3D Domain, UniSTS, Domain, SNP, Journal and UniGene. The definitions of each, from the NCBI homepage follow (http://www.ncbi.nlm.nih.gov:80/entrez/query/static/help/helpdoc.html):
|Khosrow-Pour, M. (Ed.). (2005). Encyclopedia of Information Science and Technology. Vol. 1 Hershey, pa, usa: Idea Group Inc||Ils 501: Introduction to Information Science and Technology|
|The fusion of biology and information technology into an interdisciplinary field called bioinformatics is a natural one. As Altman and Koza (1996) describe it||Chapter 1: Software within the Information Society History of computing|
|Isf 100D: Introduction to Technology, Society, and Culture||Isf 100D: Introduction to Technology, Society, and Culture|
|Information Technology in a Global Society||Ieee p™/D26b Draft Standard for Information Technology: Hardcopy System and Device Security|
|Словарь основных бюрократических терминов Европейских Рамочных программ Перевод по источнику Myer W. Morron. The European Union’s Information Society Technology Program in f полная он-лайн версия книги|
|The"Science and Society"and the"Science and History"features that appear in this book were designed and developed by time school Publishing, a division of time magazine. Time and the red border are trademarks of Time Inc|