Скачать 0.56 Mb.
300 The Fenway
Boston, MA 02115-5898
The fusion of biology and information technology into an interdisciplinary field called bioinformatics is a natural one. As Altman and Koza (1996) describe it, biology and information sciences are similar because an “organism must use energy to maintain [the distinction between inside and outside], and must develop strategies for gathering, storing, and using the energy for that purpose … and that the basic tools available to organisms are the ability to store information in DNA and the ability to effect function with protein…”. Likewise Staben (see Appendix 1) draws parallels between the information flows in cellular, experimental, and evolutional situations and the biologists’ own information flows. The National Center for Biotechnical Information (NCBI) emphasizes the association when describing the databases, the portals to those databases, and the underlying scientific endeavors of biology. For example, NCBI stresses evolutionary biology, protein modeling, and genome mapping in its definition of bioinformatics, and how “In light of these advances, a researcher’s burden has shifted from mapping a genome or genomic region of interest, to navigating a vast number of Web sites and databases” (NCBI, 2003, ¶13). As the above suggests, a review of the field must include a discussion of the biological processes being studied as well as the databases resulting from this research. The task is made difficult because the thrust of molecular biology is shifting from gene sequences to expression data, and the goals of bioinformatics are debated in the literature. Moreover, the incredible volume, heterogeneity, distribution and use of biological databases make simple descriptions useless. This review sees “bioinformatics” as consisting of two symbiotic branches. One emphasizes the generation of macromolecular structure, genome sequence and gene expression data, although the specifics fall outside the scope of this review. The other is the application of large computer-based sources of data and scientific papers, and the computational techniques involved in database design and data mining, information storage and retrieval, and visualization of sequence and structural alignment, macromolecular geometry, phylogenetic tree construction, prediction of protein structure and function, gene finding, and expression data clustering which are the aims of this paper. Today’s molecular biology calls for biologists to learn enough information technology to help themselves, or that computer and information scientists learn enough biology to communicate successful (Altman and Koza, 1996, 73) and find mutually beneficial roles (Denn and MacMullan, 2002).
The structure of the review is to consider first 1) the difficulty of defining the field, 2) the basics of biology and the data produced through research, 3) examples of data models and sources and their uses, and 4) opportunities for collaboration.
There are many definitions of the field. Ellis (2003a) notes 40 published operational definitions between 2000-2001 and another 37 (2003b) in 2003. The definitions vary by audience. It appears that several specialties were working out for themselves their roles in molecular biology and the legacy of computational biology’s influence on their work. For example, the question of bioinformatics appears as basic scientific work (Allen 2001), education (Brass, 2000; Ouzounis 2000; Sander, 2001; Pearson 2001), employment and retooling (Henry 2001; Zauhar 2001; Gardiner 2001; Bass, 2000), computing (NIH 2000; Watkins, 2001; Sansom & Smith 2000; Gibas 2001; Russon, 2000; Steinberg 2000; Emmett, 2001; Attwood & Miller 2001); genomics and proteomics (Smith 2000) and sub-domains, e.g., molecular therapeutics (Swindells 2000), as well as overviews of the field (McDonald 2001; Bernstein 2001; BioTech 1998; Adler & Conklin, 2000; Mohan-Ram 2000; Taylor, 2000; Cottle 2001). Some of the reviews are biologists speaking across domain barriers to computer & information science, wondering aloud how to manage the data (Watkins 2001; Russon 2000; Emmett 2001; Butte 2001; Roos 2001; Attwood 2000).
A second reason for the many definitions is because gene research is moving away from gene sequencing to gene expression and protein sequence (proteome), and raises different research investigations, and the role of other fields, specifically computer science, information science, statistics, mathematics. There are also more questions about how to make the results of retrieval more intelligible to biologists (information retrieval, visualization, and data mining). The literature reflects the maturing practice of bioinformatics as more evidently two distinct work roles (biology + information technology) collaborating on specific biological questions: e.g., drug discovery (Gatto, 2003), pharmacogenomics (Jain 2003), neurosurgery (Taylor, Mainprize, & Rutka, 2003), medical practice (Grant, Moshyk, Kuskniruk, & Moehr, 2003; Breski 2002), glycobiology (Marchal, Golfier, Dugas & Majed 2003).
Turning the tables, there are also significant bioinformatics reviews of biology. Naturally, there remain eternal questions of employment and grants (Van Haren 2002; Kolatkar 2002; Jenson 2002; Basi, Clum, & Modi, 2003; Henry 2002; Calandra 2002; Schachter 2002). Yet the field struggles to define itself (Altman & Dugan 2003; Bayat 2003; NCBI 2003; Ouzounis 2002; Fuchs 2002) and its relationship with medical informatics (Altman 2000; Altman 2003), health informatics (Grant, Moshyk, Kushmiruk & Moehr, 2003), mathematics (Andersson, Larsson, Larsson & Jacob 1999) and traditional genomics (Rost, Honig & Valencia 2002; Valencia 2002; Altman 2003; Chicurel 2002). Most intriguing, perhaps, for readers of ARIST are the information metaphors in biology (Nishikawa 2002). The critical role, however, of “information” (in the LIS sense) remains (Gywne 2002; Denn & MacMullen 2002; Paris 2003; Luscombe, Greenbaum & Gerstein 2001) as does the search for resolving data modeling related problems, such as applying xml.
Historically some of the work today associated with “bioinformatics” was viewed as computational biology and genomics. Faced with more data than can be efficiently processed and new research questions, biology turned, as many fields do, to statistics and technology to help model phenomena, and to expose interesting patterns and deviations from patterns. With the introduction of computer technology, a greater variety of patterns could be examined more quickly and without the introduction of human error. As applied to biomedical processes, a picture of the invisible world of the molecule became possible. Expanded modeling of these processes made biology more approachable and enabled research both on a broader descriptive level and on highly focused questions. For instance, some computational molecular biologists (e.g., Laszczynski 1999) moved to numerical simulation as a complement to traditional theoretical and experimental approaches, and to probe in silico, testing theories that cannot otherwise be examined, such as phenomena at the atomic level. Pursuing this path leads to quantum mechanics and the various simplifications (e.g., Born-Oppenheimer) employed and outside this review. However, it suggests that computational molecular biology focuses on computerized and mathematical answers to biological questions.
Genomics, it may be argued, is the precursor to bioinformatics. Genomics “is operationally defined as investigations into the structure and function of very large numbers of genes undertaken in a simultaneous fashion” (Univ. of California-Davis, 2003). The primary work effort is comparative genomics, although functional genomics is the “post-genome era” focus. Functional genomics infers the function of gene expression, typically based on eukaryotic homologues or other model organisms, not usually tested in vivo. Functional genomics also includes mutagenesis, the production of changes in DNA sequences that affect gene products; genotypes, the specific changes in DNA sequences in a mutant; and phenotypes, the biological consequence of a mutagen’s presence. Functional genomic testing of phenotypes relies heavily upon technology, such as analytical chemistry, imagery, robotics and process automation, for analysis and generates tremendous amounts of data. The boundaries between genomics and bioinformatics are porous, the literature suggesting that genomics, like computational molecular biology, focuses on physical biological processes. Bioinformatics emphasizes the storage and retrieval of biological data and the research literature: to organize very large heterogeneous structures and determine algorithms for clustering, retrieving and displaying subsets in meaningful ways, relying heavily upon information visualization, statistics, and integrating appropriate supporting bibliographies. For example, genomics may emphasize the research of specific microbiology and genetics of specific organism, such as the E. coli genome or the fruitfly Drosophila melanogaster; bioinformatics on the manipulation of the generated data, although some work, such as DNA microarrays, may belong to both (e.g., Tessier, Benoît, Rigby, Hogues, van het Hoog, Thomas & Brousseau 2000).
Before settling on a definition of bioinformatics, it is useful to get a feeling for the bioinformatics literature and how the concepts of information are closely related to those of information science. Nishikawa (2002) proposes an “Island Model” of biology – how given a set of inputs, the proteins will cluster based on similarity (or homology) in amino acid sequences. His description of clustering uses identifiable fields of molecular processes (such as amino acid sequences) that map directly to concepts in the information retrieval and clustering literature. The description of the behavior of polypeptides under physiological conditions parallel the behavior of a query under different user cognitive conditions. Bayat (2002), in an article with which information scientists will sympathize, emphasizes the interdisciplinary nature of the field: biology, medicine, computer science, and maths/physics. “The main tools of a bioinformatician are computer software programs and the internet. A fundamental activity is sequence analysis of DNA and proteins using various programs and databases available on the world wide web.” (1019). Finally, Luscombe, Greenbaum & Gerstein (2001, 347) propose a reasoned definition which they submitted to the Oxford English Dictionary, and is adopted by this review, with a small addition: “bioinformatics is conceptualising biology in terms of molecules (in the sense of Physical chemistry) and applying ‘informatics techniques’ (derived from disciplines such as applied maths, computer [and information] science and statistics) to understand and organise the information associated with these molecules, on a large scale. In short, bioinformatics is a management information system for molecular biology and has many practical applications” [emphasis in original].
As these definitions demonstrate, the emphasis in bioinformatics is on the data sources, the efficient manipulation thereof, and to improve understanding and aid discovery. The challenge for computer and information science is to help biologists “organize data in a way that allows researchers to access existing information and to submit new entries as they are produced …, to develop tools and resources that aid in the analysis of data …, and to analyse the data and interpret the results in a biologically meaningful manner” (Luscombe, Greenbaum & Gerstein, 2001, 385). While the rest of this review focuses on data sources it follows Altman and Koza’s (1996) suggestion to provide “just enough biology” to situate the data. Refer to Westhead, Paris & Twyman (2002) and Dwyer (2003) for excellent introductions to the field of bioinformatics and Winter, Hickey & Fletcher (2002) for a review of genetics.
The basics of biology and the data produced through research
This section can only scan the work of molecular biology in order to give to computer and information science a feeling for the complexity of molecular biology and the types of data generated from its research. The purpose is to suggest why the data are modeled as they are, how the research literature is being integrated into the biologists’ toolkit and suggest opportunities for CIS to further the information and processing needs of the field. Here some themes of molecular biology are introduced and then the focus shifts to the resources.
Molecular biology generates tremendous amounts of data from DNA or protein sequences, macromolecular structures and the results of functional genomics that need to be made useful – scientifically-sound and able to be interpreted by subject specialists – in a computationally efficient manner. Early research efforts in resulted in data that were stored in flat-files and queried using tools like Fasta and PSI-Blast (Altshul 1997) for comparing protein sequences.
“DNA”, or deoxyribonucleic acid, is composed of four chemical bases: purines (adenine, abbreviated “A”, and guanine, “G”) and pyrimidines (cytosine (C) and thymine (T)). “Sequencing” is determining the exact order of these building blocks in the DNA. Each base differs from others by the combination of oxygen, carbon, nitrogen and hydrogen. Every base is attached to a “deoxyribose”, or sugar molecule, and to a phosphate molecule, to create nucleic acid, or nucleotide. Each nucleotide is linked in a certain order, or sequence, through the phosphate group. The precise order and linking of bases determines what that gene produces. For humans, this means the “exact order of 3 billion blocks (called bases and abbreviated A, T, C, and G) making up the DNA of the 24 different human chromosomes….” The most famous sequencing project is the Human Genome Project, whose goal is to “reveal the estimated 100,000 human genes within our DNA as well as the regions controlling them.” In addition, one of the purposes of the Human Genome Project is to identify the small regions of DNA that very between individuals. This differences may underlie disease susceptibility and drug responsiveness, particularly the most common variations that are called SNPs (single nucleotide polymorphism).” [HGPI, 2002]. The sequencing process includes:
First chromosomes (of up to 250 million bases) are divided into smaller pieces, or “subcloned.” A template is created from the shorter piece to generate fragments, each differing only by a single base. That single base is used as an identifier during template preparation of sequence reaction. Using florescent dyes, the fragments can be identified by color when the fragments are separated by a process called “gel electrophoresis.” The base at the end of each fragment is now identified (“base-calling”) to help recreate the original sequences of A, T, C, and G for each subcloned piece. A four-color histogram, (a “chromotograph”) is created to show the presence and location for each of the bases. Finally, the short sequences in blocks of about 500 bases (called the “read length”) are assembled by computer into long, continuous stretches for analysis of errors, gene-coding regions, and other distinctions.
The results of the sequencing can now be published in major public sequence databases, such as the National Center for Biotechnical Information’s (NCBI) GenBank, which holds over 12 billion bases in 11.5 million entries (Benson 2000). NCBI is part of the National Library of Medicine (NLM) and National Institutes of Health (NIH). NCBI hosts PubMed, GenBank, molecular sequencing databases, literature databases (e.g., PubMed, Online Mendelian Inheritance of Man (OMIM) and are detailed below). The data generated by DNA sequencing are stored in very large, multidimensional databases, such as the Wisconsin Package, GCG sequencers as well as literature intended to teach bioinformatics and keep the scientist up to date, such as Medline, SCI, and journals, especially Science, Journal of Biological Chemistry, Cell, Molecular Medicine Today, the New England Journal of Medicine and special topic serials, such as Bioinformatics and Briefings in Bioinformatics.
Beyond DNA sequencing
Protein sequences, strings of 20 amino acids of which 400,000 are known, generate more complex forms of macromolecular structure data. The data gathered by x-ray crystallography and NMR generates 3D (x-y-z) coordinate data, and stored in the Protein Data Bank (PDB) (Berman, 2000; Berstein 1977). The most publicly recognized project is that of the human genome. This genome is built from almost 3 billion bases (Lander 2001; Venter 2001). Recently much work focuses on yeast and human gene expression, or measuring mRNA produced in cells under different conditions (Eisen & Brown 1999; Cheung 1999; Duggan 1999; Lipshutz 1999; Velculescu 1999). Others concentrate on larger systems, such as metabolic pathways, regulatory networks, and protein-protein interaction data from 2-hybrid experiments. Successfully describing, retrieving, and presenting these data is one part of bioinformatics; another is the manipulation of these data in biologically-meaningful ways.
In the latter case, biologists look for functional clustering (e.g., based on metabolic pathways or sequence segments), or relationships of proteins, such as homologous (structurally and sequentially similar), or analogous proteins (related folds). In the former, the volume and heterogeneity of the data require the creation of algorithms for basic analysis, such as protein sequence analysis (Miller, Gurd, & Brass 1999) and understanding intron and exon promoter regions (Zhang 1999; Boguski 1999). Beyond the basic level of file indexing and searching, the structured data must be manipulated as a data mining project to predict biological information (Orengo & Taylor 1996; Orengo 1999). Each of these data-use functions faces the question of data integration (Gerstein 2000). Molecular biology’s information needs can, then, be classified into three groups: raw data, data integrated with bibliographic systems and other data reflecting biological processes, and data used predictively (Wilson, Kreychman, & Gerstein 2000) to solve specific questions, such as prediction of secondary and tertiary protein structures (Russell & Sternberg 1995) and calculating energetics of macromolecular structures. The problem facing scientists in both fields is how to address data redundancy and multiplicity in these very large databases and how to integrate multiple sources of data where different nomenclature and file formats are common.
Very Large Databases
There is an amazing variety of software packages available to molecular biologists to support research by searching very large databases. Most of these systems manipulate locally collected data or data extracted from public or private databases. For instance, some databases (e.g., PDBsum, NDB, CASTH, SCOP and Swiss-Prot) provide external links to other databases. The need for data integration lead to Sequence Retrieval System (SRS) (Zdobnov, Lopez, Apweiler & Etzold, 2002; Etzold, Ulyanov & Argos 1996) described below as are several data retrieval systems for purely biological data (e.g., Entrez (Schuler 1996)) and to integrate the research literature (e.g., PubMed (Wade 2000)). Given the large number of these information storage and retrieval systems only a representational sample is offered.
Some of the packages support particular functions of gene sequencing. For example, there are at least 150 free software applications (Gilbert, 2000, pp. 157-184) addressing all aspects of sequencing. One popular package is the Genetics Computer Corporation’s “Wisconsin Package” (http://www.gcg.com/). An integrated suite of over 130 program tools for manipulating, analysis, and comparing nucleotide and protein sequences, this package also uses a gui, SeqLab, to interact with color-coded graphic sequences. The package includes sequence comparison statistics (alignment of two sequences to indicate gaps, best fit and x/y plotting of sequence similarity), database searching tools (LookUp, StringSearch for biological literature, BLAST, NetBLAST, FASTA and others for sequence strings, PAUPSearch, GrowTree, Diverge for phylogenetic relations, Fragment Assembly, gene finding and pattern recognition tools, protein analysis (e.g., PeptideMap), ChopUp, Reformat and others for manipulating text files. Similarly, DNASTAR’s Lasergene Sequence Analysis Software is a suite of eight applications to trim and assemble sequence data; discover and annotate genes patterns; predict protein secondary structure; create Boolean queries from sequence similarity, consensus sequence and text terms; sequencing, hybridization, and transcription; create maps; and import data from other sources (Burland, 2001, p. 71). It isn’t possible to review all applications and their capabilities, but the reader can see that software applications in molecular biology have moved from stand-alone, mini- and mainframe applications to desktop, Internet- and lan-based with an emphasis in the graphic display of data, pattern detection and sequence prediction, and integration of the literature.
|Impact of Information Technology in the field of Civil Engineering||Of Bioinformatics – draft#1 (Book chapter for American Society for Information Science & Technology, vol. 40, 2005) Gerald Benoît Introduction|
|Литература: iso/iec 18092: 2004. Information technology Telecommunications and information exchange between systems Near Field Communication Interface and Protocol (nfcip-1)|
Порог доступа к информации должны быть снижены. По итогам Коллегии Минкомсвязи России
|Agents in Bioinformatics, Computational and Systems Biology|
|There are many reasons why a budding academic might want to avoid interdisciplinary research. It is difficult enough to acquire expertise in one field of||Bcb 570. Bioinformatics IV (Computational Functional Genomics and Systems Biology)|
|Belkin, S., D. R. Smulski, et al. (1996). "Oxidative stress detection with Escherichia coli harboring a katG':: lux fusion." Applied and Environmental||Global Change I course: a technology-Enhanced, Interdisciplinary Learning Environment|
|3 natural history, evolution, physical anthropology, botany, zoology, biology  4||3 natural history, evolution, physical anthropology, botany, zoology, biology  4|