Скачать 211.22 Kb.
BOBBY E. MCKNIGHT
(Under the Direction of Ismailcem Budak Arpinar)
The association of experimental data with domain knowledge expressed in ontologies facilitates information aggregation, meaningful querying and knowledge discovery to aid in the process of analyzing the extensive amount of interconnected data available for genome projects. TcruziKB is an ontology based problem solving system to describe and provide access to the data available for a traditional genome database for the parasite Trypanosoma Cruzi. The problem solving environment enables many advanced search and information presentation features that enable complex queries that would be difficult, if not impossible, to execute without semantic enhancements. However the problem solving features do not only improve the quality of the information retrieved but also reduces the strain on the user by improving usability over the standard system.
INDEX WORDS: Semantic Web, SPARQL, Query, Ontologies, Bioinformatics, Genomics
from a genome database to a semantic knowledge base
Bobby e. mcknight
B.S., The University of Georgia, 2006
A Thesis Submitted to the Graduate Faculty of The University of Georgia in Partial Fulfillment of the Requirements for the Degree
master OF computer science
Bobby E. McKnight
All Rights Reserved
Frome a genome database to a semantic knowledge base
bobby e. Mcknight
Major Professor: Ismailcem Budak Arpinar
Committee: John A. Miller
Electronic Version Approved:
Dean of the Graduate School
The University of Georgia
Thanks to Maciej Janik and Matthew Eavenson (Cuadro project), Sena Arpinar and Ying Xu (collaborators at IOB), members of the J.C.K. Laboratory and the TcruziDB team (for facilitating access to data and evaluation subjects and providing valuable advice).
TABLE OF CONTENTS
LIST OF TABLES vii
LIST OF FIGURES viii
1 INTRODUCTION 1
2 DATA INVENTORY AND KNOWLEDGE ENGINEERING 6
3 VISUAL QUERY BUILDER 10
3.1 Query Structure 11
3.2 Enhancing Queries and Search Results 12
3.3 Natural Language Query Building 14
4 MULTI-PERSPECTIVE DATA EXPLORATION 19
4.1 Tabular Explorer 19
4.2 Statistical Explorer 20
4.3 Graph Explorer 22
4.4 Literature Explorer 24
5 EVALUATION 26
6 RELATED WORK 30
6.1 Keyword Search 31
6.2 Formal Language 33
6.3 Query Building 34
6.4 Natural Language Query 39
6.5 Hybrid Methods 41
7 CONCLUSION 44
A SCHEMAS AND DATASETS 50
B TCRUZIKB – WEB APPLICATION 55
C SUS EVALUATION RESULTS 71
D EMPERICAL EVALUATION RESULTS 72
LIST OF TABLES
Table 1: SUS scores broken down by area of expertise 27
Table 2: A Breakdown of the Features Provided by Semantic Search Engines. 33
Table 3: A Breakdown of the Features Offered by Query Building Systems. 39
Table 4: A Comparison of Natural Language Query Systems. 43
LIST OF FIGURES
Figure 1: Diagrammatic description of the ontology schema 7
Figure 2: SPARQL query created by the Visual Query Builder 13
Figure 3: Sample of the Interactive Natural Language Query Interface. 16
Figure 4: Sample of the Interactive Natural Language Query Interface. 16
Figure 5: Figure 5: Our interpretation of the parse tree from Figure 1.. 17
Figure 6: The Statistical Explorer showing the percentage of expression results for the property “Life Cycle Stage” 21
Figure 7: The results in graphical format. 22
Figure 8: Expanded Graphical Explorer 23
Figure 9: Formula for Gain and Entropy 24
Figure 10: Formula for document scores 25
Figure 11: Sample statement from SUS 26
Figure 12: The iSPARQL Interface 36
Figure 13: The GINSENG Interface in Action 42
The contemporary Bioinformatics researcher, when formulating a hypothesis or looking for evidence to validate one, commonly performs intensive querying to genome databases, i.e. using a Web interface to pose questions about a collection of information on one or a set of organisms. However, current techniques invariably require high human involvement to manually browse through an extensive mass of data, observe evidences, analyze their interconnections and draw conclusions. The size, diversity and complexity of data and tools make this a time consuming and difficult task.
The scientific analysis of the parasite Trypanosoma cruzi (T.cruzi), the principal causative agent of human Chagas disease, is our driving biological application. Approximately 18 million people, predominantly in Latin America, are infected with the T.cruzi parasite. Research on T.cruzi is thus an important human disease related effort, which has reached a critical juncture with the quantities of experimental data being generated by labs around the world, in large part because of the publication of the T.cruzi genome in 2005 . Although this research has the potential to improve human health significantly, the data being generated exist in independent heterogeneous databases with poor integration and accessibility. Our goal is to integrate these data and create an infrastructure to facilitate their analysis and mining.
In contrast with the use of downloaded raw data and custom processing scripts for information integration, the association of experimental data with domain knowledge expressed in ontologies facilitates richer integration, meaningful querying and knowledge discovery to aid in the process of analyzing the extensive amount of interconnected data available for genome projects. The use of ontologies in this work, unlike the common understanding of ontologies in the Bioinformatics field , goes beyond the reference to a standardized vocabulary and explores the representation of conceptual relationships between entities to guide the researcher on “connecting the dots” from hypotheses to evidences, in a Relationship Web .
As part of this project, we engineered an ontology to describe domain knowledge and the data available for the project TcruziDB, a genome database for the parasitic agent Trypanosoma cruzi. In comparison with traditional genome databases the use of semantic web technologies in this context offers advantages such as:
- Unlimited flexibility on the query perspective: TcruziDB.org offers 5 standpoints, where the user can search for Genes, ESTs, Contigs, Scaffolds and ORFs. Those queries reflect the possible uses of the system as predicted by the development team and/or the community of users involved in the development stages. In such system, the user is limited to the available queries and in the advent of a request for new queries, human involvement is required for the implementation of the necessary SQL statements and visualization interfaces. Through the use of the ontologies’ schemas - i.e. the definition of the possible types of data and interconnections available in the knowledge base - we offer a high-level querying system, where the user is guided by the system throughout the process of posing a question in an intuitive way, e.g. looking for: “Gene -> codes for -> Protein -> expressed in -> Epimastigote (Life Cycle Stage)”.
- Complex query handling: a key component of ontologies as envisioned by our project is the concept of a relationship. Through the use of ontologies a user will be able to ask questions not only about entities (Genes, ESTs, ORFs), but also about how those entities are related. For instance, someone might be interested in giving 2 accession numbers for genes from different organisms and retrieving all the known relationships between those genes. Such query might return, for the sake of the argument, that they both have expression results in a life cycle stage where the organism is resident in water. This type of query is viable through the use of ontologies to reveal semantic associations, and is very difficult otherwise (e.g. Gene -> has expression -> Protein Expression -> in life cycle -> Life Cycle Stage -> environment -> Water).
-TcruziKB not only supports guided form based query formulation but a query mechanism all human beings are familiar with, natural language querying. Using this feature a user can ask questions in unrestricted English such as “Find genes that code for proteins that are in life cycle stages present in the human body”. While using the natural language query interface the user receives help from the system in the form of keyword suggestions from the knowledge base to help them properly construct a query.
- Loosely-coupled web integration of distributed data sources: most genome databases integrate data from different sources in some level, usually by downloading, parsing and storing external data in a relational database. In our system we are able to integrate our data with external sources in the server side, but also provide loosely-coupled dynamic integration at the client side. Through the use of Ajax, Linked Data and SPARQL endpoints, our system is able to dynamically query multiple sources and “mash up” the results in one page for visualization.
In addition to the provision of data integration and query capabilities, TcruziKB aims at helping the user on the difficult task of making sense of the information in hands. We implemented multiple interfaces for results exploration to allow for the user to analyze query results through different perspectives:
- The tabular explorer lists the results in a spreadsheet format, with a row per item and a column per attribute, while cells contain values. This perspective provides prompt access to all attributes of a group of items, allowing for sorting and filtering of data in a well known and widely used interface style for biomedical researchers.
- The graph explorer, by the other hand, focuses on relationships, drawing each item and value as nodes, and the attributes as edges. This perspective brings connectivity to the first level, allowing the researcher to unveil hidden relationships between data.
- The statistical explorer offers a higher level summarization of data in a first glance. It is often very important for the researcher to understand first the general characteristics of the dataset, before more specific questions can be posed.
- The enhanced literature search suggests papers that might be interesting to help the researcher to understand the result set being displayed. We calculate keyword weight based on the ontology and submit a query to the NCBI e-Utils web services, before ranking and displaying the abstracts to the user.
We expect the above mentioned contributions to compose a valuable toolkit for data sharing and analysis on the Web that can be reused and extended for virtually any genome project, and even any domain of knowledge. In the following sections we describe the knowledge engineering and data acquisition for TcruziDB, followed by the query interface and the visualization perspectives. Both subjective and objective evaluation strategies are used to rate the usability of the system compared to the usability of the non-semantical enhanced TcruziDB. Final considerations and future work are presented in final chapters.
DATA INVENTORY AND KNOWLEDGE ENGINEERING
In the field of Genomics, data comes from different sources and in heterogeneous representation formats. From simple char-delimited files (flat files) to complex relational database schemas, gigabytes of annotations are available for use. We engineer an ontology to represent the knowledge in this domain and to serve as an umbrella for integration of the multiple sources.
The ontology engineering process comprised both a top-down (deductive) and a bottom-up (inductive) stage. Since the TcruziDB database was already available with valuable information, we started the modeling process by observing examples of data and building the definitional component of the ontology (a.k.a. ontology schema) in an inductive process. Following, we consulted the literature for precise definitions of the identified classes, and further deductive exploration of possible dimensions in the light of the identified use cases. In every class definition throughout the modeling process we searched for existing ontologies in order to reuse or extend its contents. Ontology reuse is highly desirable, since it promotes both the efficiency of the modeling process itself and the interoperability level of the resulting system.
Figure 1: Diagrammatic description of the ontology schema
Through the ontology engineering process we identified a manageable subset from the domain to test the system and its underlying concepts. As depicted in figure 1, our ontology schema is able to represent genes, as well as the organisms they belong to and the proteins that they encode. Proteins may present enzymatic function, which in turn may be part of a process represented in a biological pathway. Proteomic expression is also captured, including the information about the life cycle stage in which the protein was expressed, as well as quantitative measures of that expression. GO, SO, Taxonomy and EnzyO are reused in this project.
While the ontology schema encompasses the description of classes and properties, as well as mappings and extensions to pre-existing ontologies, the assertional component of the ontology (also called knowledge base) associates data with definition from the domain model. We obtain data from several sources, including Pfam flat files, Interpro XML and relational data stored in the Genome Unified Schema (GUS) for TcruziDB. We automatically mapped the GUS Schema to an ontology using the D2RQ mapping framework . Some of the ~400 tables mapped were manually verified for enhancement and reuse of existing ontologies. The subset of the TcruziDB dataset used in this project includes: 19,613 automated gene predictions (protein coding); 139,147 protein expression results from metacyclic trypomastigotes (CL strain) and amastigotes, trypomastigotes and epimastigotes (Brazil strain) of T. cruzi. The dataset also features links to the sequence ontology, gene ontology and enzyme commission numbers. Some external data was also downloaded and imported from flat files to the ontology, containing information such as: 31,630 protein domain (Pfam) annotations; 8,065 ortholog groups predicted by the OrthoMCL algorithm.
As part of the knowledge base creation process, every biological sequence (nucleotides or amino acids), as well as annotations associated with those sequences are identified in our system by a URI. We choose to use the original URL that gives access to the item in its original web interface as their identifier. We also added this URL to the rdfs:seeAlso annotation property, so that we are able to take the user to the original web interface by a click within our interface. If the original URL changes through time, the URI will still be a valid identifier, and we can update the rdfs:seeAlso property to reflect the most up to date URL for the item. However, if we desire to import more data into the knowledge base, the URL change could possibly cause inconsistencies if not treated appropriately. This problem can be overcome by contacting the data provider and having them to commit to a naming scheme (e.g. a basic namespace) independent of the resource location (URL). We are in the process of establishing those contacts.
The ontology schema produced as a result of this work is domain focused, instead of application specific. This means that it can be reused by other applications in the same domain or in related domains of knowledge. Additionally, any project that commits to the use of these ontologies enables seamless inter operation with our system, enabling our reuse of their data, or their utilization of ours.
VISUAL QUERY BUILDER
The key enabler of TcruziKB visual query builder is the ontology schema, which represents all possible types of data residing in the knowledge base and how they can be interconnected. Through the use of RDFS domain and range meta-properties, we are able to describe a property in terms of the class that it applies to, and the range of possible values that it can assume – as in the property “translated_to” applies to a “Gene” and its value has to be a “Protein”. It is through these property descriptions that our system is able to guide the user in building a query.
The system starts the query with a standard information retrieval (IR) task, in which the user performs a simple keyword query for a term (class or instance) to start building a more complex query. This initial search is performed on top of the whole set of ontologies loaded by the system. Its performance is enhanced by “indexing” the data in advance (as it arrives) — an appropriate vector is built for each item, and stored in a vector-space database (the Lucene text search engine is used for this purpose). The user can directly select an instance of interest to root the query onto, or select a class and accept “any” instance of this class as a result. Then, by reading the ontology schema, the system retrieves all possible properties that apply to the selected “root” term, and present them in a list for the user to choose. When a property is chosen, another background query to the schema retrieves the possible classes in the range of that property, and the process continues iteratively, until the intended query is achieved. After the user has built the query through the visual interface, the system encodes and submits a SPARQL query to the server in the background (via Ajax calls).
Item 3.1 explains in details the structure of the queries built, and the automatic extensions implemented by the visual query builder. Item 3.2 explains the support for queries to multiple servers. The results for the queries are obtained in XML and can be displayed through several user-friendly perspectives. Item 3.3 details the basic characteristics of the result set and how the system implements a protocol for the result set’s content enhancement. The multi-perspective exploration of the results is presented in details in the next chapter.
3.1 Query Structure
The queries composed by the visual query builder are directed graph patterns to be searched in a knowledge base. The graphs can be decomposed in paths, the paths decomposed in triples and the triples decomposed in basic elements.
The basic elements of a query are: class, instance, property and variable. The triples compose the basic elements in the structure: “subject (S), predicate (P), object (O)”, where subject and object can be a class, instance or a variable, and the predicate can be either a property or a variable. For example, observe the triple “GeneX, codes_for, ?protein”, where the question mark preceding an element denotes a variable. A query using this triple indicates that this pattern is to be searched in the knowledge base, allowing the variables in the triple to be substituted by any actual value that matches the remaining of the triple. In the example showed, the query will return any proteins that GeneX codes for, as long as this is explicitly stated in the knowledge base through the property “codes_for”. Triples can be connected to one another to form a path, such as “(genes, codes_for, ?protein) AND (?protein, ?related_to, Amastigote)”. In this example, we composed a path by using the logical connector “AND” to connect two triples. Logical connectors supported are “AND” and “OR”. The expected result for this query would be any proteins that have any relationship with the Amastigote life cycle stage of Tcruzi, along with the relationships found as part of the result. The addition of filters to constrain the matched results is also possible. In the visual query builder we support filters in order to search for elements that match a certain regular expression, such as “all proteins whose names starting with 'Muc'.” Other advanced elements are envisioned, such as searching for any paths connecting two instances, with the possibility of expressing constraints on the searched paths, as described in previous work by our research group . Such advanced issues are under development and are not supported by the system as of this moment.
3.2 Enhancing Queries and Result Sets
An important feature of the query builder is its ability to guide the user through a directed graph pattern from any standpoint, in any direction desired. For example, a user should be able to start in a “Protein” and find any “(?protein, has_expression, ?proteinExpression)”, as well as start in “ProteinExpression” and find any “?proteinExpression, is_expression_of, ?protein).” We anticipate that not all data sources will explicitly state the inverse of every property, so the query builder is able to create a virtual “inverseOf” relationship for the user interface. The virtual relationship is realized by its concrete inverse by flipping the subject and object. As a matter of fact, we anticipate that some data sources will not present an ontology schema of any kind. In that case, the visual query builder navigability would be seriously compromised, since it would not know which property applies to which class. However, for cases where the schema is not present, but the metadata is – i.e. there are no domain and range descriptions, but the type of the instances is known – we can build a virtual schema by inspecting all properties and the types of their subjects and objects. This feature is supported, but not executed automatically due to its computational cost.
After the user has built the desired graph pattern to be searched, the visual query builder pro actively enhances the query by adding triples to retrieve extra information about the results (in case they are available). So, in addition to what was explicitly stated by the user to be present in the results, the system retrieves the label, type and original web page for each resource. These additions are valuable in analytical interfaces since they facilitate the understanding of the information presented. Please refer to Figure 2 to see an example of a SPARQL query created by the visual query builder.
Figure 2: SPARQL query created by the Visual Query Builder
The interaction between the client (TcruziKB Query Builder) and the server is defined by the SPARQL Protocol for RDF . Servers implementing that protocol are often called SPARQL endpoints. We support queries to multiple SPARQL endpoints by storing a list of servers and performing calls to all of them each time a query is executed. The results are asynchronously received from the SPARQL endpoints and aggregated in a result set for further presentation to the user. The aggregation of results is nicely handled by the use of RDF and ontologies. The addition and configuration of new SPARQL endpoints is supported through our user interface. As a consequence, researchers using our system can automatically integrate and use new data sources without any development intervention.
We extended the SPARQL Protocol for RDF to support the automatic configuration of a SPARQL endpoint in our system. The extension is backwards compatible, so if a specific endpoint does not respond to the implemented extensions, it will still be added to the system. The extension basically consists of the implementation and retrieval of an ontology-based description of the namespaces cited in a SPARQL endpoint.
3.3 Natural Language Query Processing
A typical problem in bioinformatics is that the user of a particular program may not have a great deal of background in computer science. Therefore, requiring that queries to the system be asked in a formal query language is an unreasonable assumption. It is partially to overcome this limitation that research in natural language querying exists. Ontology assisted natural language processing has received much attention recently but still has many shortcomings when applied to real datasets. TcruziKB encompasses much of the existing research by providing an interface to allow biological researches to ask questions in natural English language but also utilizes algorithms to compensate for their shortcomings.
When a user opts to enter a query in natural English they are presented with a simple text box that they can enter text into. Because the initial phase of forming a query in this manner can be overwhelming suggestions are provided in a similar manner to the visual query builder that allow the user to select a starting point for their query except suggestions do not solely come from they ontology, they also come from a set of predefined English rules such as “Who”, “What”, “Find”, and so forth. After the user enters in some initial starting words they are presented with other suggestions relating to what they have previously entered. For example, if the user has entered “Gene” in their query they would be presented with suggestions corresponding to the properties of the Gene class from the ontology as well as English rule words. In figure 3 below, the user has entered a partial sentence and is now presented with suggestions most relevant to the word they are currently typing as well as words relating to other ontology words they have previously typed. In this case the user has entered the word “gene” previously and is not being presented with suggestions corresponding to properties of the Gene class in the ontology.
Figure 3: Sample of the Interactive Natural Language Query Interface.
Given an English sentence the Stanford parser builds a parse tree where each node denotes a part of the sentence. For example, “What is the life cycle stage of GeneX”gives the parse tree in Figure 4 and the interpretation can be seen in Figure 5.
(WHNP (WP What))
(SQ (VBZ is)
(NP (DT the) (NN life cycle stage))
(PP (IN of)
(NP (CD GeneX)))))
Figure 4: The parse tree generated from the Stanford Parser for the sentence “What is the life cycle stage of GeneX?”
|Creation and evaluation of fuzzy knowledge-base||Database Mining in the Human Genome Initiative John L. Houle,a Wanda Cadigan,b Sylvain Henry,b Anu Pinnamanenib and Sonny Lundahlc|
|Semantic Web : a guide to the Future of xml, Web Services, and Knowledge Management||Toward a Method for Providing Database Structures Derived from an Ontological Specification Process: the Example of Knowledge Management|
|Key Words: Peer tutoring, wiki, sample code database, knowledge production, e-learning||Queries: Enabling Querying for Semantic Associations on the Semantic Web|
|Genome 10k a proposal to Obtain Whole Genome Sequence for 10,000 Vertebrate Species||Integrating Fuzzy Knowledge Base to Genetically Evolved Certainty Neuron Fuzzy Cognitive Maps to Facilitate Decision-Making|
|All analyses were performed using the hg18, mm8 and dm3 genome assemblies provided within the ucsc genome Browser (1). Human and mouse cage tags were retrieved||Reducing the Gap Between Database Models and Database Designs: xmi2sql|