Скачать 180.68 Kb.
The Nucleotide database contains sequence data from GenBank, EMBL, and DDBJ, the members of the tripartite, international collaboration of sequence databases. EMBL is the European Molecular Biology Laboratory (EMBL) at Hinxton Hall, UK; DDBJ is the DNA Database of Japan (DDBJ) in Mishima, Japan. Sequences are also incorporated from the Genome Sequence Data Base (GSDB), Santa Fe, NM. Patent sequences are incorporated through arrangements with the U.S. Patent and Trademark Office (USPTO), and via the collaborating international databases from other international patent offices.
The Protein database contains sequence data from the translated coding regions from DNA sequences in GenBank, EMBL and DDBJ as well as protein sequences submitted to Protein Information Resource (PIR), SWISS-PROT, Protein Research Foundation (PRF), and Protein Data Bank (PDB) (sequences from solved structures).
The Genome database provides views for a variety of genomes, complete chromosomes, contiged sequence maps, and integrated genetic and physical maps.
The Structure database or Molecular Modeling DataBase (MMDB) contains experimental data from crystallographic and NMR structure determinations. The data for MMDB are obtained from the Protein Data Bank (PDB). The NCBI has cross-linked structural data to bibliographic information, to the sequence databases, and to the NCBI taxonomy.
Use Cn3D, the NCBI 3D structure viewer, for easy interactive visualization of molecular structures from Entrez.
The PopSet database contains aligned sequences submitted as a set resulting from a population, a phylogenetic, or mutation study. These alignments describe such events as evolution and population variation. The PopSet database contains both nucleotide and protein sequence data.
The OMIM (Online Mendelian Inheritance in Man) database is a catalog of human genes and genetic disorders. See OMIM and its Help document, OMIM Help.
The Taxonomy database contains the names of all organisms that are represented in the NCBI genetic database by at least one nucleotide or protein sequence. For the context of the Taxonomy database see Taxonomyand Taxonomy FAQ.
The Bookshelf has a collection of Biomedical books that are linked in Entrez and can also be separately searched at Bookshelf. See the Books FAQ.
ProbeSet database is an Entrez view of NCBI's GEO (Gene Expression Omnibus). Geo is a gene expression and hybridization array repository. See the Search Tips and FAQ.
3D Domains contains protein domains from the NCBI Conserved Domain Database. See CDD.
UniStS is a unified, non-redundant view of sequence tagged sites (STSs). UniSTS integrates marker and mapping data from a variety of public resources. Data sources include dbSTS, RHdb, GDB, various human maps (Genethon genetic map, Marshfield genetic map, Whitehead RH map, Whitehead YAC map, Stanford RH map, NHGRI chr 7 physical map, WashU chrX physical map), and various mouse maps (Whitehead RH map, Whitehead YAC map, Jackson laboratory's MGD map). See UniSTS.
A central repository database for both single base nucleotide substitutions and short deletion and insertion polymorphisms. For the search page and available search fields and search examples, see Entrez SNP.
An experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location. For the search page and available query tips and FAQs see UniGene.
PubMed Central (PMC) is a digital archive that provides access to the full-text of articles from a set of life science journals. This set of journals includes only those journals who voluntarily participate in the PubMed Central archive. For the search page, available FAQs and current list of participating journals see PubMed Central.
Below is the result of a search using Entrez (for a gene in humans; http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=6849043&form=6&db=n&Dopt=g). The reader will identify immediately with some of the “access points” in the NCBI Sequence Viewer (Figure 1) as well as notice the sequence of bases Hyperlinks (in this case, organism, MedLine, PubMed, exon) in the record link the biological data to support literature. Notice the traditional access points (authors, title, journal), links to other literature (the Medline unique identifier (muid) and PubMed identifier (pmid)) entries, and among others, the base. This record also includes the sequence identifiers (Seq-id) because NCBI integrates sequence data from multiple sources.
LOCUS HSDDT1 166 bp DNA linear PRI 01-FEB-2000
DEFINITION Homo sapiens D-dopachrome tautomerase (DDT) gene, exon 1.
VERSION AF012432.1 GI:2352911
SEGMENT 1 of 3
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 166)
AUTHORS Esumi,N., Budarf,M., Ciccarelli,L., Sellinger,B., Kozak,C.A. and
TITLE Conserved gene structure and genomic linkage for D-dopachrome
tautomerase (DDT) and MIF
JOURNAL Mamm. Genome 9 (9), 753-757 (1998)
REFERENCE 2 (bases 1 to 166)
AUTHORS Esumi,N. and Wistow,G.
TITLE Direct Submission
JOURNAL Submitted (07-JUL-1997) Molecular Structure and Function, NEI,
Building 6, Rm. 331, NIH, Bethesda, MD 20892, USA
BASE COUNT 24 a 61 c 50 g 31 t
1 cttcttccgc cagagctgtt tccgttcctc tgcccgccat gccgttcctg gagctggaca
61 cgaatttgcc cgccaaccga gtgcccgcgg ggctggagaa acgactctgc gccgccgctg
121 cctccatcct gggcaaacct gcggacgtaa gcgtgggccg ggcagc
Challenges to Bioinformatics
The challenges to molecular biology and biology in general fall outside the scope of this chapter. Nevertheless, the pursuit of answers to the scientific challenge has lead biologists to the domain of computer- and information-science. For instance, Beginning Perl for Bioinformatics (Tisdall, 2001) and Developing Bioinformatics Computer Skills (Gibas & Jambeck, 2001) were recently published as manuals to instruct biologists how to parse full-text records. In addition there are several free JavaBeans tools available for manipulating Blast and gsdb (e.g., BlastView and AnnotView, http://www.cbil.upenn.edu/bioWidgets/). Anyone trained in information storage and retrieval will see immediately the similarities between methods employed in full-text retrieval and the potential application of parsing, matching, clustering, similarity measurements, and display from IS&R to bioinformatics records. And that the biologists are learning the computing skills themselves suggests opportunities for information- and computer-science to assist.
Integration with clinical informatics
Altman (2000) describes the Stanford Medical Informatics program as the next step in a “post-genome age, [where] the interplay between basic biological data (sequences, structures, pathways, and genetic networks) and clinical information systems is, clearly, critical” (p. 442). The primary concerns for the future (http://bits.stanford.edu/) emphasize robust computing to issues of “information acquisition, storage, retrieval, and management” (p. 442). By outlining six “affinity groups”, Altman suggests greater integration (and hence opportunity for computer- and information-science) in
Furthermore Altman (1998, p. 53) suggests that “dna sequence information and sequence annotations will appear in the medical chart with increasing frequency” which suggests both the ethical issues of making such data publicly accessible but also the computerization issue of what data are stored and how to integrate an expanded data model. Certainly the advancements in controlled vocabularies in clinical informatics can be applied to representing bioinformatic data.
The heterogeneity of data outside the various sequence databases and resources described above calls for greater cross-discipline mapping (similar to the umls) or semantically independent modeling schema, such as xml.
There is already several xml-schema for biology, such as the Bioinformatic Sequence Markup Language (BSML) and Biopolymer Markup Language (BIOML) (http://bioperl.org/Projects/XML/) and bioXML (http://stateslab.bioinformatics.med.umich.edu/). Others (accessible via http://www.xml.com/pub/rg/Bioinformatics) include Neuron Markup Language (NeuroML), Anatomical Markup Language (AnatML), Apple/Genetech BLAST, Acrchitecture for Genomic Annotation, Visualization and Exchange (AGAVE), Biomolecular Interaction Network Database (BIND), Fasta2XML (sequence data file format), Genome Annotation Markup Elements (GAME DTD), Genbank to XML conversion (gb2xml), Genome Annotation Markup Elements (GAME), InterPro, Integrated Taxonomic Information System (ITIS), Microarray Markup Language (MAML), Molecular Dynamics Markup Language (MODL), Multiple Sequence Alignments (MSAML), phylogenetic tree charts (phyloML), Ribonucleic Acid Markup Language (RiboML), Systems Biology Markup Language (SBML), Taxonomic Markup language (TML), Gene Expression Markup Language (GEML) and even an XML-based Ontology Exchange Language (XOL). Visualization of data is critical in bioinformatics and several products are available that combine xml records and display techniques (http://industry.ebi.ac.uk/~alan/VisSupp/VisAware/index.html).
Data mining and visualization
Some activities in molecular biology focus on predicting sequences where there is missing values or on establishing patterns that otherwise would be impossible to be detected (Benoit, 2000). Data mining, defined as an “exploration and analysis by automatic and semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules”, when focused on biological processes turns bioinformatics into a data mining activity. See Benoît (2000) for an detailed overview of data mining or Kantardzic (2003) for a clear discussion of the statistical foundations.. EBI (European Bioinformatics Institute, 1999) recently hosted a conference exploring the intersection of bioinformatics as the field matures and data mining: “During the last few years bioinformatics has been overwhelmed with increasing floods of data, both in terms of volume and in terms of new databases and new types of data. We are now entering the post-genomic age, where, in addition to complete genome sequences, we are learning about gene expression patterns and protein interactions on genomic scales. This poses new challenges. Old ways of dealing with data item by item are no longer sustainable and it is necessary to create new opportunities for discovering biological knowledge ‘in silico’ by data mining.”
The porous borders of clinical information systems, medical informatics and bioinformatics, especially in light of Altman’s prediction of tighter integration of these fields, implies great opportunities for data mining. Medical research and practice have generated tremendous amounts of data, beyond that created by pharmaceutical and biomedical research. For instance, electronic patient records and integrated medical-information systems provides a great warehouse of clinical data online. By mining these data, bio- and other informaticians can detect trends and surprising events from the data, to support informed decision making by clinicians (e.g., evidence-based medicine) and even create “intelligent” system that respond to the data (evidence-based adaptive medicine), to improve health care. Certainly data mining has been applied successfully in biomedical research to develop new pharmaceuticals and for disease-specific treatments, such as advances in cancer treatment, and degenerative disorders.
In addition to mining databases of biological processes, research is also turning to mining the literature of molecular biology to expose and visualize unanticipated relationship among the records. In “biobibliometrics”, Stapley and Benoît (2000) describe a visualization technique and retrieval system from co-occurrences of gene names in Medline abstracts.
Conclusions and Summary
Altman, R. B. (1998). Bioinformatics in support of molecular medicine. In C. G. Chute, (Ed.), AMIA Annual Symposium, pp. 53-61.
Altman, R. B. (2000, Sept/Oct.). The Interactions between clinical informatics and bioinformatics: a case study. Journal of the American Medical Informatics Association, 7 (5), 439-443.
Andersson, Sten, Larsson, Kåre; Larsson, Marcus; Jacob, Michael. (1999). Biomathematics: mathematics of biostructures and biodynamics. Amsterdam: Elsevier.
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics 25, 25-29.
Baxevanis, Andreas D., & Ouellette, B. F. Francis. (2001). Bioinformatics: a practical guide to the analysis of genes and proteins. (2nd ed.). Methods of biochemical analysis, vol. 43. New York: Wiley-Interscience. [QD271 M45 v. 43]
Benoît, G. (2000). Data mining. In Blaise Cronin (Ed.), Annual Review of Information Science and Technology, vol. 39. Medford: Information Today.
Briefings in Bioinformatics. [QH441.2 B75]
Burland, Timothy G. (2001). dnastar’s Lasergene Sequence Analysis Software. In Stephen Misener & Stephen A. Krawetz, (Eds.). Bioinformatics: methods and protocols. Totowa, NJ: Humana, pp. 71-91.
European Bioinformatics Institute. (1999). Data Mining for Bioinformatics – towards in silico biology. Available: [on-line] http://industry.ebi.ac.uk/datamining99/
Gibas, Cynthia, & Jambeck, Per. (2001). Developing Bioinformatics Computer Skills. Sebastopol: O’Reilly.
Gilbert, Don. (2001). Free Software in Molecular Biology for Macintosh and MS Windows Computers. In Stephen Misener & Stephen A. Krawetz, (Eds.). Bioinformatics: methods and protocols. Totowa, NJ: Humana, pp. 149-184.
Goodfellow, Julia M. (Ed.). (1995). Computer Modelling in Molecular Biology. Weinheim: VCH. [qh506 c654 1995]
Human Genome Project Information. http://www.ornl.gov/hgmis/faq/seqfacts.html
Jenders, Robert; Sideli, Robert; & Hripcsak, George. Introduction to Medical Informatics. Available: [On-line] http://www.cpmc.columbia.edu/edu/textbook
Kantardzic, M. (2003). Data Mining: concepts, models, methods, and algorithms. Piscataway, NJ: IEEE Press/Wiley-Interscience.
Koski, Timo. (2001). Hidden Markov Models for Bioinformatics. Computational biology series, vol. 2. Dordrecht: Kluwer Academic.
Lengauer, Thomas (Ed.). (2002). Bioinformatics – from genomes to drugs. Weinheim: Wiley-VCH. [QH506 B56 2002]
Lesk. Arthur M. (Ed.) (1988). Computational Molecular Biology: sources and methods for sequence analysis. Oxford: Oxford Univ. Press.
Leszczynski, Jerzy. (Ed.). (1999). Computational Molecular Biology. Theoretical and computational chemistry, vol. 8. Amsterdam: Elsevier. [QH506 C642 1999]
Misener, Stephen & Krawetz, Stephen A., (Eds.). (2000). Bioinformatics: methods and protocols. Methods in molecular biology, vol. 132. Totowa, NJ: Humana. [QH506 M45 v.132]
Musen, Mark A. (1999, Jan/Feb). Stanford Medical Informatics: uncommon research, common goals. MD Computing, pp. 47-49.
Rodriguez-Tomé, Patricia. (2001). Resources at EBI. In Stephen Misener & Stephen A. Krawetz, (Eds.). Bioinformatics: methods and protocols. Totowa, NJ: Humana, pp. 313-335.
Sensen, Christoph W. (Ed.). (2002). Essentials of Genomics and Bioinformatics. Weinheim: Wiley-VCH. [QH447 G467 2002]
Smith, Douglas W. (Ed.). (1993). Biocomputing: informatics and genome projects. San Diego: Academic. [QH447 B45 1993]
Stapley, B. J., & Benoît, G. (2000). Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. Pacific Symposium on Biocomputing 5, 538-549. Available [online] http://www.smi.stanford.edu/projects/helix/psb00/stapley.pdf
Tessier, D C., Benoît, F., Rigby, T., Hogues, H., van het Hoog, M., Thomas, D. T., & Brousseau, R. (2000). A DNA Microarray fabrication strategy for research laboratories. In C. Sensen (Ed.)., Essentials of Genomics and Bioinformatics.
Tisdal, James. (2001). Beginning Perl for Bioinformatics. Sebastopol: O’Reilly.
Zdobnov, E. M., Lopez, R., Apweiler, R., & Etzold, T. (2002). Using the molecular biology data. In C. Sensen (Ed.), pp. 265-284.
|Khosrow-Pour, M. (Ed.). (2005). Encyclopedia of Information Science and Technology. Vol. 1 Hershey, pa, usa: Idea Group Inc||Ils 501: Introduction to Information Science and Technology|
|The fusion of biology and information technology into an interdisciplinary field called bioinformatics is a natural one. As Altman and Koza (1996) describe it||Chapter 1: Software within the Information Society History of computing|
|Isf 100D: Introduction to Technology, Society, and Culture||Isf 100D: Introduction to Technology, Society, and Culture|
|Information Technology in a Global Society||Ieee p™/D26b Draft Standard for Information Technology: Hardcopy System and Device Security|
|Словарь основных бюрократических терминов Европейских Рамочных программ Перевод по источнику Myer W. Morron. The European Union’s Information Society Technology Program in f полная он-лайн версия книги|
|The"Science and Society"and the"Science and History"features that appear in this book were designed and developed by time school Publishing, a division of time magazine. Time and the red border are trademarks of Time Inc|