Скачать 57.56 Kb.
Ontology-Based Information Extraction: Current Approaches
The goal of Information Extraction (IE) is to automatically extract structured information from unstructured, or semi structured machine readable documents - generally human language texts by means of natural language processing (NLP). For example one could want to extract essential information from real estate web pages while ignoring other types of information. Recently on the market of IE engines a new player emerged, Ontology-Based IE consequently gains more and more supporters. Here ontology, which is a formal representation of the knowledge by a set of concepts within a domain and the relationships between those concepts, plays a crucial role in the IE process. Using Ontology as the IE tool makes OBIE very convenient tool for gathering information that can be used later on in construction of Semantic Web concept. In this paper I will try to explain the idea of OBIE, with its flaws and advantages. I will try not only to provide theoretical approach, but also to review current trends on this field, to point out some common architecture in currently used systems and in the end classify them based on different factors depending on their usability in real life application. As a conclusion an attempt to identify possible trends and directions in this field will be made.
Keywords and phrases: Ontology-Based Information Extraction, Information Extraction, Semantic Web, Ontology, Knowledge Representation.
"...Search today is still kind of a hunt, where you get all these links, and as we teach software to understand the documents, really read them in the sense a human does, you'll get answers more directly..." - Bill Gates.
The human ability to understand and use language remains one of the unsolved mysteries of modern science. The beginnings of Information Extraction dates back to late 1970  when a Reuters research group lead by Peggy M. Andersen came up with an idea of automatic extraction of facts from press releases in order to generate new stories.
The goal of IE is to transform machine text into structured format and thereby reducing the information in document transforming it to a tabular structure . Specified information can be extracted from several different documents and then later on merged into uniform representation. In case that the data have uniform information the automated analysis can be performed for example by data mining techniques which aim to discover patterns, and describe the relations between them.
The opinions on the IE classification are different among the researchers, but most of them bind it tightly with Natural Language Processing. Some of them, as for example Riloff states that information extraction are a form of natural language processing in which certain types of information must be recognized and extracted from text . On regular basis the IE system are not trying to understand the input data. The only thing that they are entitled to do is to analyze portions of documents containing relevant information. Relevance is determined by predefined domain guidelines which specify what type of information the system is expected to find.
As an example of IE system, one can describe a system that will process the web pages of real estates, extracting the information about price of the property, location, standard, number of rooms... Of course to obtain such information, some kind of model of an algorithm is needed to guide the process.
Nowadays the view on IE changes, more and more people are starting to see it not only as a process of retrieving disconnected text tokens, but more like obtaining meaningful semantic data. Russel and Norvig , are proposing in their book that we can classify IE as a middleground between Information Retrieval (IR) systems that merely finds documents that are relevant to user requirements, and text parsers that are trying to extract text along with specific semantic context. While there are a lot of examples of successful IR implementation (probably the most visible ones are common web search engines - e.g. google), the area of text parsing and semantic data mining does not have such spectacular successes - althrough there are a semantic search engines like YAGO, which can answer us a question like - "Who has won the Nobel prize after Albert Einstein" - but still their usability is very limited.
Recentry there emerged a new branch of IE called Ontology-Based IE (OBIE) which consequently gains more supporters. On the foundation of OBIE there have to be defined an ontology acccoding to wchich the system will process given text. Ontology is ussually defined it theory as: "Explicit specification of a conceptualization", which in practice mean that ontology provides a shared vocabulary, which can be used to model a domain - that is, the type of objects and/or concepts that exist, includein their properties and relation. In general every ontology is designed for specific domain. For example ontology for real estate market would contain concepts like property, city, country, district, block of flats, standalone house. A bunch of ready ontology is available on - http://semanticweb.org/wiki/Ontology.
In this paper I'll try to provide review for the field of OBIE to on one provide a clear definition what is an OBIE system an what is not, and on other to analyze architectures of different OBIE systems in order to point out some common solution and approaches to the problem.
In order to define OBIE system I'll start with pointing out it responsibilities, and expectations:
So the easyest OBIE definition would be following Daya C. Wimalasuriya and Dejing Dou:
"An Ontology-Based Information Extraction System: A system that processes unstructured or semi-structured natural language text through a mechanism guided by ontologies to extract certain types of information and presents the output using ontologies."
OBIE is a relatively new concept, and as usual with new promising concepts there comes great expectations, as always some of them were falsified. Good example of this was a panic on the stock exchange when in 2006 authors such as Marc Fawzi started a belief that in the near future OBIE semantic web system will put an end to google search engine, he stated, that The Semantic Web (or Web 3.0) promises to “organize the world’s information” in a dramatically more logical way than Google can ever achieve with their current engine design. As we all know nothing like this happened, and even more OBIE systems right now are not threated as a golden hammer, but nevertheless, there are some area that they are poving themself:
It is wort to underline that OBIE systems were introduce for the purpouse of generating data needed for performing semantic query. Nowwadays the semantic query therm is very often misused. The easiest way to describe semantic query would be as search that will discover the meaning of words, unlike the typical search engine method of searching only for the occurrence of keyword. This definition is very easy, but often it gets missunderstood. To have a clear view I will try to point out some common mistakes in this area:
Althrough the detailed OBIE implementation can be specific to a project; the general functionality and specific part of system are common across all projects (Figure 1). The central place in all OBIE systems belongs to ontology. The ontology has to be project domain specific, and later on will drive the semantical connections between the information. Ontology can be represented in several languages, begining with RDF (Resource Description Framework) which has vary limited usage and was originally designed as a metadata data model, ending on Ontology Web Language (OWL) which has been standard recomended by World Wide Web Consorcium (W3C).
The data described by OWL contains classes/properties and set of connections/relations etween them wchich binds them to eachothers. An ontology consists from set of axioms wchich places a constraints on individual (classes), telling what type of relationship is allowed. Thouse types of connection allows automatic systems to extract additional data based on the information originally provied. Currently the cutting edge in ontology description language is OWL 2 which as Bernardo Cuenca Grau said is a promising extension to and revision of OWL that is currently being developed within the W3C OWL Working Group . A simpliest ontology OWL 2 listing can be found at Figure 1, it represent ontology for real estate contailning only Property class.
Listing 1. Simple ontology sample.
As mentioned before the most crucial role in OBIE is played by ontology, but in order to have one we have to either write one in an ontology editor or generate it by Ontology Generator. There is a plenthora of ontology editors on the market, but probobly the most commonly used right now is protégé ontology editor developed on stanford university. As Knublauch and Musen states in their work protégé editor provides convinint abstraction above the language specific ontology design. Protégé has a lot of plugins that can expres ontology in almoust any language. An example of this approach with given ontology can be found in work of Michal Laclavik, Martin Seleng, and Marian Babik  where they are proposing tool that analyzes text using regullar expressions patterns and detects equivalent semantic elements according to the defined domain ontology.
There is also another way to obtain ontology into OBIE system - ontology can be generated on the fly from given text. This method has a lot of drawbacks. First of all we need lexical database in order to tokenize the text. The ontologies that are automatically generated will never be the same quality as written by domain expert, we can not forget that we need ontology to search for semantic relation between data, not the other way around. In order to generte usable ontology we would need a lot of data form given model, the more specific they will be, the more accurate ontology will be generated. Altrough drawbacks the area of automatic ontology generation looks quite promissing. An example of this approach can be found in  and , especially worth noticing is the work of Paul Buitelaar, Daniel Olejnik and Michael Sintek where authors are developing an plug in to protégé editor, wchich can extract ontologies from given text, and then expert in given domain can edit it. Both works underline that ontologies generated this way are not 100% accurate, but ussually making it by a domain expert is too costly solution.
Semantic Lexicon is often uset to aid IE module and ontology generation. Semantic lexicions are providing database fo given language. Semantic lexicons groups words in sets of synonims, it ussualy provides short general descriptions, but most of all it records various semantic relations between synonims sets. English language the most recognized semantic lexicon is WordNet it has been incorporated in a large number of ontology based project.
Preprocesing and extraction module will be detaily described in fallowing section, at this point I will only note that preproceso consists of input specific modules which trasform text into form that can be processed by extraction module. Preprocesing consissta mainnly of striping witespaces, HTML tags, unreadable characters... Extraction module is a place where accual IE takes place, right here the input data is being analyzed, changed to tokens understandable by onlology, and in the end bined with semantic relationship.
The data data produced by extraction module are than it neads to be transformed to specific descriptive logic language (right now it is usually OWL) in order to save in knowledge database.
As the result of OBIE process we get knowledge database with filled data filled using specified ontology. When the process in redy usually there is a front end Search Engine which allows user to query the database. The example of search engines using semantic knowledge bases, are common in internet. Most common example of this is www.hakia.com. Founded in 2004 was suppose to be next google with a semantic anwer to indexing files. The founders boasted that: first time ontological semantics will be used that will enable a search engine to perceive concepts beyond words and retrieve results with meaningful equivalents . Right now semantic search engines still are not facing threats to google, but as Randal Stross stated in his article : "A growing number of entrepreneurs are placing their bets, however, on a hybrid system that puts humans back into the search equation.".
At the end it will be good to show all the elements at a ready system, in order to do so I chose Autonomously Wu and Weld Kylin system , where authors are trying to obtain structurised data from Wikipedia. Kylin does not require user specified ontology, instead of this if uses generator which searches for infoboxes with similar atributes across the whole wikipedia, and based on this tries to gues indiwiduals and relations between them. In order to generate ontology Kylin Ontology Generator uses WordNet as a lexical database. The aim of this project is not to make new Semantic Search engin, but instead to produce ontology that will be used in developing a communal correction system for Wikipedia, and could be corrected later on by Wikipedia users. Such ontology could be used later as a model representation of real world.
A document is an abstract entity that has a variety of possible actual representations. Informally, the task of the document structuring process is to take the most “raw” representation and convert it to the representation through which is the essence (i.e., the meaning) of the document surfaces. A reviev of tocenization methods has ben published by Feldman and Sanger in :
|Some approaches of ontology Decomposition in Description Logics||Symposium of Quantum-Informational Medicine qim 2011: Acupuncture-Based & Consciousness-Based Holistic Approaches & Techniques, Belgrade, 23-25 September 2011|
|1 New Approaches to Cold War: History and Current International Politics||Security Information, Heterogeneity, Intrusion Detection, Behavioral Analysis, Ontology|
|PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information||The automated feature extraction system based on the characteristic sequences for time series and grayscale image classification|
Россия, Ярославль, ул. Советская, 14, Тел. (4852) 79-77-75. E-mail
|Guided by the firm hand of the government, China has evolved a manufacturing-based development pattern that is more robust and balanced than that of India, which has set great store by an Information Technology-based `knowledge economy' without vital linkages with other sectors||Integrated Systems Based Approaches to Realising Energy Efficiency Opportunities for Industrial/ Commercial Users – by Sector|
|Integrated Systems Based Approaches to Realising Energy Efficiency Opportunities for Industrial/ Commercial Users – by Sector||The difficult return: arts based approaches to mental health literacy and building resilience with returned military personnel and their families|