Avoiding Information Overload: Knowledge Management on the Internet

Скачать 155.64 Kb.
НазваниеAvoiding Information Overload: Knowledge Management on the Internet
Размер155.64 Kb.
  1   2   3   4   5
Author: Dr Adam Bostock of Acro Logic June 2002

TSW 02-02

Avoiding Information Overload:
Knowledge Management on the Internet


1 Executive Summary 3

2 The Technologies 3

3 Technology Watch Issue 4

4 Technical Overview 4

4.1 Search Engines 4

4.2 Web Browsers: Searching and Saving Individual Pages 6

4.3 Knowledge Management Systems and Agents 6

4.4 Knowledge Representation 9

4.5 XML 10

5 Developments 11

5.1 XML Search Engines 11

5.2 Web Browsers 11

5.3 Knowledge Management Systems and Agents 12

5.4 Metadata 12

5.5 XML Extensions and Applications 12

5.6 Knowledge Application Tools 13

6 Assessment 13

7 References 16

7.1 Agents Error: Reference source not found

7.2 Applications and Projects (Miscellaneous) Error: Reference source not found

7.3 Knowledge Management Technologies and Products Error: Reference source not found

7.4 Metadata Error: Reference source not found

7.5 Metadata (Dublin Core) Error: Reference source not found

7.6 Metadata (Miscellaneous) Error: Reference source not found

7.7 Metadata (RDF - Resource Description Framework) Error: Reference source not found

7.8 Ontology Error: Reference source not found

7.9 Search Engines and Directories Error: Reference source not found

7.10 Semantic Web 19

7.11 Web Services Error: Reference source not found

7.12 XML Error: Reference source not found

8 Glossary 21

9 Appendix A - Example of XML 22

10 Appendix B - Developing Your Web Site to Generate XML 23

10.1 Converting an Existing Database Driven System 23

10.2 Starting from Scratch 23

10.3 Build Your Own XML Portal 23

10.4 Benefits 23

11 Appendix C - Example Use of Namespaces 23

12 Appendix D - Information Presentation Issues 24

13 Appendix E - Metadata and Information Extraction 24

14 Appendix F - Internet Directories 25

15 Appendix G - Search Engines 25

  1. Executive Summary

Keywords: search, knowledge management, XML, metadata, RDF, ontology, agent

It is estimated that there are over two billion Web pages, and thousands of newsgroups and forums, on the Internet - covering virtually every topic imaginable. However, many users find that searching the Internet can be a time consuming and tedious process. Even experienced searchers sometimes run into difficulties. To fully benefit from the potential opportunities of the Internet, both Web site developers and users need to be aware of the tools and techniques for managing and retrieving online knowledge.

This has driven the development of improved search and information retrieval systems. However, we now need sophisticated information extraction (and/or summary) capabilities to present the user only with the information they need, rather than a large set of relevant documents to read.

Search service providers, Web portals, and amalgamations of community Web sites could all help their users to benefit today, just by adopting the current generation of knowledge management systems, particularly those with effective information extraction capabilities.

Metadata has a very useful role to play, but it has limitations with regard to information extraction.

One of the key opportunities of the XML initiative is to allow structure and (indirectly) "meaning" to be embedded into the content of the resource itself. XML provides the much needed data structure for computer-to-computer interaction. The availability of good user-friendly, and "intelligent", tools will be critical in persuading the wider community to adopt XML as an alternative to HTML.

It is probably reasonable to state that the current generation of knowledge management systems is an interim measure, to be superseded by AI systems in the long-term. Such systems will probably be able to process natural language and XML encoded content.

The success of Internet based knowledge management, and the Semantic Web, will require the development and integration of various data standards, ontology definitions, and knowledge management and agent technologies. It will take a concerted and significant effort to get there. The likely longer-term benefits are much more effective Internet searches and smart information extraction services, which present the user with concise relevant extracts.

In the meantime, perhaps we should also think about how authors represent knowledge and present information, and how users apply knowledge, in a more structured and meaningful way.

This report includes a glossary, reference section and appendices.

A concise, interactive, XML version of this report, plus extra features, can be found here:
Acro-Report (www.acrologic.co.uk/cgi-bin/src.pl?r=ikm-16.446)

  1. The Technologies

In the broadest sense knowledge management encompasses all aspects to do with creating, maintaining, organising, classifying, representing, storing, querying, retrieving, analysing and presenting knowledge. It encompasses people, procedures, processes, policies and technologies. An organisation's level of success in knowledge management is limited by the weakest link in that chain. In this report we primarily focus on the technology. This report provides an overview of the issues, and the tools and techniques available for managing and finding knowledge on the Internet. The aim being to avoid adding to the stress and frustration felt by Internet users, and enable them to quickly acquire knowledge from the Internet. The resulting benefits are applicable to all types of Internet user. This report was written for a diverse audience and so jargon, purist definitions and detailed technical aspects have been kept to a minimum (though the references contain more detail).

A popular method for finding Internet resources is through search engines and directories. Directories allow a user to manually browse a hierarchy of categories to find appropriate Web sites. Search engines take a user's query and automatically search their database to return matching results.

Knowledge management (KM) systems and agents are two distinct topics but increasingly within this context they can be observed working together. These systems have more functionality than search engines, and some systems can match concepts derived from unstructured data.

For computer systems to communicate directly with each other, standard data formats are required. XML allows standard languages and data formats to be developed. Coupled with an appropriate ontology systems can perform useful functions on the data they exchange with each other. KM systems can be enhanced to utilise XML to provide improved capabilities.

  1. Technology Watch Issue

The key issues now for knowledge management and searching on the Internet are:

  1. The volume of information on the Internet is such that it is not feasible to manually search for and retrieve all relevant sources of quality knowledge on a given topic

  2. Users would probably not have enough time to read all the relevant documents on the Internet

  3. This [2] requires information extraction, to present only the relevant parts of a document.

  4. Knowledge management systems and search agents are needed to assist with the above

  5. HTML Web pages contain unstructured data, with no computer understandable "meaning"

  6. XML and ontology provides an opportunity to address the above issue

  7. A critical mass for each community needs to co-operate to develop acceptable and workable standards for each community. Activities are needed not just on metadata but also on representing document content and the corresponding knowledge much more effectively.

  1. Technical Overview

    1. Search Engines

It is assumed readers are familiar with directories and search engines (if not, see Appendices F and G). Useful though they are, current search engines have limitations. The classification and searching algorithms are automated and lack intelligence. In particular, search engines have been vulnerable to abuse from "spammers" claiming to offer popular products or services to attract visitors to their site on false pretences. Unfortunately, because of this, the algorithms and rules of search engines have adapted quickly to try and guarantee relevant (non-spam) results for searchers. This means that Search Engine Optimisation (SEO) practices have changed quickly, and what used to be good or acceptable practice can, in some cases, now be detrimental to your Web site's rating on a search engine. [References]

This issue has also resulted in the role of metadata being down graded in Web pages. Metadata may be used to describe a Web page, e.g. description and keywords. A few years ago metadata, particularly keywords, were used to determine how well a page matched a query. However, the spammers abused this and so search algorithms were modified to play down the role of metadata.

      1. The Invisible Web

Even a determined search engine that attempts to index every Web page will not actually be able to find, or extract, all of the content potentially available to Web users. The reasons for this include:

  • Web pages that have no links to them, and are not explicitly submitted to search engines

  • Web pages that reside in password protected areas of a Web site

  • Content stored in an unsupported format (e.g. images, animations)

  • Dynamic Web pages generated from a database in response to a user's actions or queries.

The latter scenario is the one to which people often refer to when they talk about the "invisible Web". Examples of dynamically generated pages are the results returned by a search engine, whereby the content is tailored to the specific query of a user. There are many sites that use this technique, and consequently a significant volume of knowledge may reside in the invisible Web.

There are relatively simple ways of making some of this hidden knowledge available to search engines:

1) A representative set of static pages can be extracted from the database and linked to in the normal way. This approach may be used to demonstrate a sample of what is available.

2) One or more Web pages containing index listings of the data can be derived from a selected set of database records. The index listing could contain links to dynamically generated pages. For those search engines that index dynamic pages they then have a link to follow, which reveals some of the underlying data in the database. For complex systems this approach may not be feasible.

3) Generate static Web indexes that contain metadata on a selected set of records. Caution: remember the spammers and how they abuse metadata? Well, because this technique can be used to deceive search engines you may be automatically penalised (see: Cloaking).

      1. On-site Searching

Some Web sites are very large, or contain hidden data, and so they provide an on-site search facility either via third party facilities, public search engines (with the search restricted to their site), or their own search system. (Many have an irritatingly small field in which to type your query.)

However, you may want to bear in mind that search systems (and their users) are not perfect. If a user searches for something provided on your site but they mistype it or use a different set of words to describe the object (e.g. "green ball" instead of "green sphere") the search will probably fail. The user may then assume that your site does not have any information on the object and leave the site. Some commercial Web site owners have reported sales going up when they have removed their on-site search feature! This may be because the failed search scenario has been removed, plus the user has to browse the site to find what they want. Whilst browsing they may find what they want, something similar, or something unexpected yet desirable.

      1. Meta Search Engines

Until relatively recently no search engine came close to indexing all the pages on the Web, and each search engine had a different set of indexed pages (with some degree of overlap of course). Therefore, in order to conduct a comprehensive search of the Web a user would need to perform similar searches on a number of major search engines. Clearly this process can be cumbersome. This is where meta search engines come in. They take one query from the user and automatically pass that on to several search engines. The meta search engine then presents the combined results to the user.

      1. Integrated Search and Directory Features

Users may be confused as to whether a particular site is a search engine or a directory because it may appear to offer both features. A directory will probably provide the option to browse by category and the option to search. The search feature looks for a match in the list of descriptions for each Web site in the directory. Conversely, a search engine may also offer a directory hierarchy for the user to browse through. The situation gets even more blurred today because some directories now extend their search features by calling upon the capabilities of a third party's search engine. For example, a search on the directory Yahoo! may return results generated by the Google search engine.

      1. Other Types of Internet Resource

It is worth remembering that the Internet contains more than (HTML) Web pages. The Google search engine has been particularly proactive in recognising this and provides search results that refer to Newsgroups, Adobe Acrobat (PDF) files, and Office documents (Word and PowerPoint).

      1. Other Methods of Finding Information

Other methods exist for finding information: Ask someone (in the real world, or in one of the many online forums and newsgroups); Web logs (blogs) are Web sites where the author maintains a log of things happening which correspond to their own interest or expertise; and technologies described later.

    1. Web Browsers: Searching and Saving Individual Pages

      1. Searching within a Page

Once a Web browser has loaded and displayed a page, the user may want to conduct a search of the text on that page, particularly if the page is very long. Most Web browsers provide some kind of search or find capability. However, a well designed and structured page should reduce the need for such actions, e.g. start with a contents list. Information presentation and Web design are key aspects of effective information retrieval, but are beyond the scope of this report.

      1. Saving Pages

A Web browser will usually allow you to save a Web page to your computer. This saves one file for the HTML code, and additional files for the images. However, it would be useful to record where the Web page originates from, e.g. inserting a line that displays the source URI at the top of the saved page. Alternatively, you may want to use a database to record metadata on the page you have saved, including the saved file path and name, and the corresponding URI of the source. Most materials are protected by copyright, so your metadata database could record a link to the original page.


Another method for "saving" a Web page is to bookmark it, or add it to your favourites list. This simply records the URL of the Internet resource, rather than copying the content itself. Most browsers will support this function. For those that have relatively fast Internet connections this may be a better option than save, as it avoids issues of copyright and using a saved local copy which may become outdated. However, the potential downside is that many Web pages eventually move to a new URL address, or get deleted. (It would be nice if more Web site owners provided a link to the new address of the resource or some type of help, rather than just reporting Error 404 - page not found.)

    1. Knowledge Management Systems and Agents

      1. Introduction to Knowledge Management Systems

The technology underlying knowledge management (KM) systems should ideally provide support for all aspects of knowledge management (see earlier Technology chapter). KM systems can automatically search for, retrieve, organise and classify information. Some have the ability to:

  • Extract relevant content from a document or page

  • Summarise a document

  • Automatically classify, cluster and match documents by concept.

The system administrator typically has control over which resources the system has access to (Intranet, Internet, documents, databases, etc.). Given the autonomous nature of its knowledge acquisition, its demand on busy resources (e.g. bandwidth) may become excessive. However, the administrator should have the option to control its level of activity and/or schedule activities for a more suitable time of day.

The system can retrieve documents relevant to a user's explicit search request, or those matching concepts in a selected document, or those matching a user's interests. However, given lots of potential matches from the billions of pages on the Web we need even more help. This is where information extraction can help by presenting users with only the relevant extracts from documents. Extraction technology removes information from a variety of document formats. It may aggregate content into a single location, and translate text in documents into a structured format such as XML and/or database records. Document summarisation also represents a potentially useful feature, depending on its accuracy.

There are several knowledge management products available. These products may offer one or more of the following features:
  1   2   3   4   5


Avoiding Information Overload: Knowledge Management on the Internet iconTRack has published the information contained in this publication to assist public knowledge and discussion and to help improve the sustainable management of

Avoiding Information Overload: Knowledge Management on the Internet iconThe Law and Economics of Information Overload

Avoiding Information Overload: Knowledge Management on the Internet iconInformation Systems Journals: Knowledge Castles or Knowledge Gardens? Brian Whitworth

Avoiding Information Overload: Knowledge Management on the Internet iconKeywords Information management, information retrieval, 3D, similarity, categorization, information visualization, classification introduction

Avoiding Information Overload: Knowledge Management on the Internet icon1. Knowledge Management 1Introduction

Avoiding Information Overload: Knowledge Management on the Internet iconBibliography on Knowledge Management

Avoiding Information Overload: Knowledge Management on the Internet iconThe professionalization of knowledge Management

Avoiding Information Overload: Knowledge Management on the Internet iconOnline information (internet and intranets)

Avoiding Information Overload: Knowledge Management on the Internet iconToward a Method for Providing Database Structures Derived from an Ontological Specification Process: the Example of Knowledge Management

Avoiding Information Overload: Knowledge Management on the Internet iconРоссия, 117312 Москва, ул. Вавилова, д. 47А тел.: (495) 221-10-70
Во время обучения слушатели смогут на практике реализовать решения для защиты приложений и служб с поддержкой pki, например, Microsoft...
Разместите кнопку на своём сайте:

База данных защищена авторским правом ©lib.znate.ru 2014
обратиться к администрации
Главная страница