Ontology-Based Information Extraction: Current Approaches

Скачать 57.56 Kb.
НазваниеOntology-Based Information Extraction: Current Approaches
Размер57.56 Kb.
  1   2   3
Marcin Białek1

Ontology-Based Information Extraction: Current Approaches

The goal of Information Extraction (IE) is to automatically extract structured information from unstructured, or semi structured machine readable documents - generally human language texts by means of natural language processing (NLP). For example one could want to extract essential information from real estate web pages while ignoring other types of information. Recently on the market of IE engines a new player emerged, Ontology-Based IE consequently gains more and more supporters. Here ontology, which is a formal representation of the knowledge by a set of concepts within a domain and the relationships between those concepts, plays a crucial role in the IE process. Using Ontology as the IE tool makes OBIE very convenient tool for gathering information that can be used later on in construction of Semantic Web concept. In this paper I will try to explain the idea of OBIE, with its flaws and advantages. I will try not only to provide theoretical approach, but also to review current trends on this field, to point out some common architecture in currently used systems and in the end classify them based on different factors depending on their usability in real life application. As a conclusion an attempt to identify possible trends and directions in this field will be made.

Keywords and phrases: Ontology-Based Information Extraction, Information Extraction, Semantic Web, Ontology, Knowledge Representation.


"...Search today is still kind of a hunt, where you get all these links, and as we teach software to understand the documents, really read them in the sense a human does, you'll get answers more directly..." - Bill Gates.

The human ability to understand and use language remains one of the unsolved mysteries of modern science. The beginnings of Information Extraction dates back to late 1970 [1] when a Reuters research group lead by Peggy M. Andersen came up with an idea of automatic extraction of facts from press releases in order to generate new stories.

The goal of IE is to transform machine text into structured format and thereby reducing the information in document transforming it to a tabular structure [2]. Specified information can be extracted from several different documents and then later on merged into uniform representation. In case that the data have uniform information the automated analysis can be performed for example by data mining techniques which aim to discover patterns, and describe the relations between them.

The opinions on the IE classification are different among the researchers, but most of them bind it tightly with Natural Language Processing. Some of them, as for example Riloff states that information extraction are a form of natural language processing in which certain types of information must be recognized and extracted from text [3]. On regular basis the IE system are not trying to understand the input data. The only thing that they are entitled to do is to analyze portions of documents containing relevant information. Relevance is determined by predefined domain guidelines which specify what type of information the system is expected to find.

As an example of IE system, one can describe a system that will process the web pages of real estates, extracting the information about price of the property, location, standard, number of rooms... Of course to obtain such information, some kind of model of an algorithm is needed to guide the process.

Nowadays the view on IE changes, more and more people are starting to see it not only as a process of retrieving disconnected text tokens, but more like obtaining meaningful semantic data. Russel and Norvig [4], are proposing in their book that we can classify IE as a middleground between Information Retrieval (IR) systems that merely finds documents that are relevant to user requirements, and text parsers that are trying to extract text along with specific semantic context. While there are a lot of examples of successful IR implementation (probably the most visible ones are common web search engines - e.g. google), the area of text parsing and semantic data mining does not have such spectacular successes - althrough there are a semantic search engines like YAGO, which can answer us a question like - "Who has won the Nobel prize after Albert Einstein" - but still their usability is very limited.

Recentry there emerged a new branch of IE called Ontology-Based IE (OBIE) which consequently gains more supporters. On the foundation of OBIE there have to be defined an ontology acccoding to wchich the system will process given text. Ontology is ussually defined it theory as: "Explicit specification of a conceptualization"[6], which in practice mean that ontology provides a shared vocabulary, which can be used to model a domain - that is, the type of objects and/or concepts that exist, includein their properties and relation. In general every ontology is designed for specific domain. For example ontology for real estate market would contain concepts like property, city, country, district, block of flats, standalone house. A bunch of ready ontology is available on - http://semanticweb.org/wiki/Ontology.

In this paper I'll try to provide review for the field of OBIE to on one provide a clear definition what is an OBIE system an what is not, and on other to analyze architectures of different OBIE systems in order to point out some common solution and approaches to the problem.

  1. Ontology-Based Information Extraction definition

In order to define OBIE system I'll start with pointing out it responsibilities, and expectations:

  • Natural language processing: as it was mentioned before, OBIE is a subfield of IE, and as such must deal with natural language processing. Usually when dealing with IE we are not narrowing the domain that much, but there would be no sense in OBIE without semantic. As an imput mediom OBIE system usually use unstructured text files or semistructured xml webpage files. Although there are solutions that can take as an input even a pdf file like for example GATE (General Architecture for Text Engineering)[7], but in this case the algorithm takes into consideration more pattern recognition than actual IE.

  • Ontology based output - one of the main reasone one would ose OBIE system i to generata semantically conneted output data. Usually ontology used for processing text data is used also for presenting the results, but there are also systems that produces ontology during the execution - the aproach like this can be seen in Alexander Maedche, Steffen Staab work in which they use dictionary parsing mechanisms for acquiring a domain-specific concept taxonomy with a discovery mechanism for the acquisition of non-taxonomic conceptual relations[8].

  • Using ontology during the IE process - in OBIE the goal is not to implement new extraction methods, but to use current existing methods to extract data needed for identifying the components of ontology.

So the easyest OBIE definition would be following Daya C. Wimalasuriya and Dejing Dou[8]:

"An Ontology-Based Information Extraction System: A system that processes unstructured or semi-structured natural language text through a mechanism guided by ontologies to extract certain types of information and presents the output using ontologies."

  1. Uasbility of Ontology-Based Information Extraction Systems

OBIE is a relatively new concept, and as usual with new promising concepts there comes great expectations, as always some of them were falsified. Good example of this was a panic on the stock exchange when in 2006 authors such as Marc Fawzi started a belief that in the near future OBIE semantic web system will put an end to google search engine[6], he stated, that The Semantic Web (or Web 3.0) promises to “organize the world’s information” in a dramatically more logical way than Google can ever achieve with their current engine design. As we all know nothing like this happened, and even more OBIE systems right now are not threated as a golden hammer, but nevertheless, there are some area that they are poving themself:

  • Natural language automating processing - currently the concept of web 3.0 [10] is still in phase of planing and lofty ideas, and nearly 80% of text inside www are in form of human readable texts OBIE and base IE systems are nececerely for changing those texts from human redable to machine redable form. The need of change is specially visible in artificial inteligence system, such as for example the one described by Chavez, Anthony and Maes, where inteligent agents perform automated negotiations in dynamicaly changing environment[11].

  • Creating semantic content for web 3.0 - as it was stated beffore Tim Berners-Lee semantic web can be revolutionary idea, but right now www is still in 2.0 phase, and it's very hard to find semantic content inside. As Popov [12] stated it's imposible to imagine that one day we will wake up and suddenly all the content of www will migrate from human redable 2.0 to semantic 3.0. The process of transforming the web will be slowly and painfully, and up till than a crucial role in delivering semantic content to cutting edge systems will be played by OBIE.

  • Improving the quality of the ontologies - it can happen in two ways. First I've already mentioned about the possibility of generating ontology "on fly" during the document parsing, right naw the proces is very unstructerized and inocurate, of course the bottom line for this process is professional judgement, but right now we are able to automaticly produce only basic functionality ontologies. On the other hand there as there is increasing interest in OBIE systems there are more and more domainn specific ontologies beiong made, and than as a part of natural lifecycle, updated. As an example one could point Dulib Core ontology which represents the basic vocabulary to describe elements.

It is wort to underline that OBIE systems were introduce for the purpouse of generating data needed for performing semantic query. Nowwadays the semantic query therm is very often misused. The easiest way to describe semantic query would be as search that will discover the meaning of words, unlike the typical search engine method of searching only for the occurrence of keyword. This definition is very easy, but often it gets missunderstood. To have a clear view I will try to point out some common mistakes in this area:

  • Structured data - semantic technology has nothing to do with structured data. Some people think that search engines like google are using semantic technology only on the evidence that it returns structured result. The trick with harvesting information was known up till ancient egypctian times, nad it has nothing to do with semntic technology.

  • Morphology - changing query "top 10" to equivalent "top ten" (only by substitution of 10 with ten) is also not semantic capability, but only a simple substitution.

  • Syntax - the system basing only on the input text syntax can't be called semantic, if this would be true we could compare it to an expectations that 8 year old child with perfect reading skills would be fully sufficient to understand sheakspere sonets.

  • Statistics - an infinite number of monkeys randomly typing keyboard would in the end come up with complete text of ecyclopedia. But for it to be semantically relevant monkeys would have to finish their job, while semantic system does not operate on patterns. Simple test - human brain is capable of catching semantic from a sentence even it has never encontered it before. Try: "Human aligators live on the north pool". If semantics were built on statistics, computers and algorithms would not understand this and billions of other sentences.

  1. General functionality and common architecture

Althrough the detailed OBIE implementation can be specific to a project; the general functionality and specific part of system are common across all projects (Figure 1). The central place in all OBIE systems belongs to ontology. The ontology has to be project domain specific, and later on will drive the semantical connections between the information. Ontology can be represented in several languages, begining with RDF (Resource Description Framework) which has vary limited usage and was originally designed as a metadata data model, ending on Ontology Web Language (OWL) which has been standard recomended by World Wide Web Consorcium (W3C).

The data described by OWL contains classes/properties and set of connections/relations etween them wchich binds them to eachothers. An ontology consists from set of axioms wchich places a constraints on individual (classes), telling what type of relationship is allowed. Thouse types of connection allows automatic systems to extract additional data based on the information originally provied. Currently the cutting edge in ontology description language is OWL 2 which as Bernardo Cuenca Grau said is a promising extension to and revision of OWL that is currently being developed within the W3C OWL Working Group [13]. A simpliest ontology OWL 2 listing can be found at Figure 1, it represent ontology for real estate contailning only Property class.

Listing 1. Simple ontology sample.

As mentioned before the most crucial role in OBIE is played by ontology, but in order to have one we have to either write one in an ontology editor or generate it by Ontology Generator. There is a plenthora of ontology editors on the market, but probobly the most commonly used right now is protégé ontology editor developed on stanford university. As Knublauch and Musen states in their work protégé editor provides convinint abstraction above the language specific ontology design[14]. Protégé has a lot of plugins that can expres ontology in almoust any language. An example of this approach with given ontology can be found in work of Michal Laclavik, Martin Seleng, and Marian Babik [15] where they are proposing tool that analyzes text using regullar expressions patterns and detects equivalent semantic elements according to the defined domain ontology.

There is also another way to obtain ontology into OBIE system - ontology can be generated on the fly from given text. This method has a lot of drawbacks. First of all we need lexical database in order to tokenize the text. The ontologies that are automatically generated will never be the same quality as written by domain expert, we can not forget that we need ontology to search for semantic relation between data, not the other way around. In order to generte usable ontology we would need a lot of data form given model, the more specific they will be, the more accurate ontology will be generated. Altrough drawbacks the area of automatic ontology generation looks quite promissing. An example of this approach can be found in [16] and [17], especially worth noticing is the work of Paul Buitelaar, Daniel Olejnik and Michael Sintek where authors are developing an plug in to protégé editor, wchich can extract ontologies from given text, and then expert in given domain can edit it. Both works underline that ontologies generated this way are not 100% accurate, but ussually making it by a domain expert is too costly solution.

Semantic Lexicon is often uset to aid IE module and ontology generation. Semantic lexicions are providing database fo given language. Semantic lexicons groups words in sets of synonims, it ussualy provides short general descriptions, but most of all it records various semantic relations between synonims sets. English language the most recognized semantic lexicon is WordNet it has been incorporated in a large number of ontology based project.

Preprocesing and extraction module will be detaily described in fallowing section, at this point I will only note that preproceso consists of input specific modules which trasform text into form that can be processed by extraction module. Preprocesing consissta mainnly of striping witespaces, HTML tags, unreadable characters... Extraction module is a place where accual IE takes place, right here the input data is being analyzed, changed to tokens understandable by onlology, and in the end bined with semantic relationship.

The data data produced by extraction module are than it neads to be transformed to specific descriptive logic language (right now it is usually OWL) in order to save in knowledge database.

As the result of OBIE process we get knowledge database with filled data filled using specified ontology. When the process in redy usually there is a front end Search Engine which allows user to query the database. The example of search engines using semantic knowledge bases, are common in internet. Most common example of this is www.hakia.com. Founded in 2004 was suppose to be next google with a semantic anwer to indexing files. The founders boasted that: first time ontological semantics will be used that will enable a search engine to perceive concepts beyond words and retrieve results with meaningful equivalents [18]. Right now semantic search engines still are not facing threats to google, but as Randal Stross stated in his article [19]: "A growing number of entrepreneurs are placing their bets, however, on a hybrid system that puts humans back into the search equation.".

At the end it will be good to show all the elements at a ready system, in order to do so I chose Autonomously Wu and Weld Kylin system [20], where authors are trying to obtain structurised data from Wikipedia. Kylin does not require user specified ontology, instead of this if uses generator which searches for infoboxes with similar atributes across the whole wikipedia, and based on this tries to gues indiwiduals and relations between them. In order to generate ontology Kylin Ontology Generator uses WordNet as a lexical database. The aim of this project is not to make new Semantic Search engin, but instead to produce ontology that will be used in developing a communal correction system for Wikipedia, and could be corrected later on by Wikipedia users. Such ontology could be used later as a model representation of real world.

  1. Preprocessig (structuring) methods

A document is an abstract entity that has a variety of possible actual representations. Informally, the task of the document structuring process is to take the most “raw” representation and convert it to the representation through which is the essence (i.e., the meaning) of the document surfaces. A reviev of tocenization methods has ben published by Feldman and  Sanger in [21]: 

  • Tokenization - The most comonly used preprocessing metod is tokenization, where prior to more sophisticated processing, text must be broken up into meaningful parts. Documents can be broken at several levels, i.e. chapters, sections, sentences, words or even characters. An the usage of OBIE systems the tokenization is usually breaking text intosentences, and then later on into words which are called tokens. One of the problems in such tokenization can be identifying sentence boundaries distinguishing between a period that signals the end of a sentence and a period that is part of a previous token like Mr. or Dr..

  1   2   3


Ontology-Based Information Extraction: Current Approaches iconSome approaches of ontology Decomposition in Description Logics

Ontology-Based Information Extraction: Current Approaches iconSymposium of Quantum-Informational Medicine qim 2011: Acupuncture-Based & Consciousness-Based Holistic Approaches & Techniques, Belgrade, 23-25 September 2011

Ontology-Based Information Extraction: Current Approaches icon1 New Approaches to Cold War: History and Current International Politics

Ontology-Based Information Extraction: Current Approaches iconSecurity Information, Heterogeneity, Intrusion Detection, Behavioral Analysis, Ontology

Ontology-Based Information Extraction: Current Approaches iconPartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information

Ontology-Based Information Extraction: Current Approaches iconThe automated feature extraction system based on the characteristic sequences for time series and grayscale image classification
Россия, Ярославль, ул. Советская, 14, Тел. (4852) 79-77-75. E-mail
Ontology-Based Information Extraction: Current Approaches iconGuided by the firm hand of the government, China has evolved a manufacturing-based development pattern that is more robust and balanced than that of India, which has set great store by an Information Technology-based `knowledge economy' without vital linkages with other sectors

Ontology-Based Information Extraction: Current Approaches iconIntegrated Systems Based Approaches to Realising Energy Efficiency Opportunities for Industrial/ Commercial Users – by Sector

Ontology-Based Information Extraction: Current Approaches iconIntegrated Systems Based Approaches to Realising Energy Efficiency Opportunities for Industrial/ Commercial Users – by Sector

Ontology-Based Information Extraction: Current Approaches iconThe difficult return: arts based approaches to mental health literacy and building resilience with returned military personnel and their families

Разместите кнопку на своём сайте:

База данных защищена авторским правом ©lib.znate.ru 2014
обратиться к администрации
Главная страница