As of January, 2012, this site is no longer being updated, due to work and health issues

SearchTools.com: Background Topics

Meta Data and Search


Metadata is information about information: more precisely, it's structured information about resources. It can be as simple as an author's name or as complex as a geographic code or a controlled-vocabulary subject heading. Library catalogs are remote meta data, as are book reviews, indexes to art collections and summaries. Some document formats allow metadata to be incorporated into documents or records, such as HTML <meta> tags and Dublin Core tags, MP3 ID3 fields, Microsoft Office Properties, Adobe XAP data, and database keyword fields.

Metadata generally uses a more controlled vocabulary and it provides the context of the words, so it provides more scope for locating useful information with the best recall and precision. For example, metadata can indicate whether an article containing the name "Tim Berners-Lee" is by him or about him, which can often be valuable to searchers. Standard date formats allow for accurate date range searching and sorting. Metadata can be an interface between internal codes and human-readable names, letting people find information the way they want it, rather than the way applications manage it.

Many content management and publishing systems provide metadata tools, which allow authors, editors, and librarians to add appropriate entries more easily, and use standard vocabulary and formatting. However, this is not yet a standard part of web publishing.

Metadata on most sites has significant limits: many documents lack metadata, few search engines recognize additional metadata fields or tags, there is no way to tell if the metadata is accurate, misspellings and typos are common, word meanings change over time, and choosing accurate keywords and categories is difficult. Editing and managing metadata requires a significant investment of resources.

All this makes metadata or keyword search engines, which search only the assigned key words rather than the text of a document, much less useful and valuable than fulltext search engines. When full text search engines index metadata content and use it to improve search rankings, this is the best of both worlds.


Resources

Faceted Metadata Search and Browse SearchTools Report
A new approach to search and browse using extensive metadata.
 
Controlled Vocabulary.com
Makes a strong case for using a standard listing of subject terms for best retrieval. Includes listings of books and resources.
 
Digital Libraries: Metadata Resources
Beautiful and complete listing of resources, including SOIF and RDM.
 
UKOLN Metadata
Good links and information about Metadata in general and more from the UK Office for Library and Information Networking.
 
Meta Matters
A service of the National Library of Australia, providing information on metadata and Dublin Core.
 
XML.com Metadata
Metadata from an XML point of view, including RDF.
 
Introduction to Metadata Getty Research Institute, second edition, July 2001
Articles on topics related to metadata cataloging and retrieval on the web.
 
Adobe XMP
This eXtensible Metadata Platform is an XML framework for Adobe products, allowing them to store metadata about documents within the file binary source (previously called XAP). The open-source license SDK allows other applications to access the metadata without having to parse the file contents. The metadata, based on RDF, describes the contents of the file using standards including Dublin Core tags, asset management, rights management and custom schemas. As of October 2001, no search engines can read this information, but we expect this to change in the near future.

Articles & Books

Faceted Classification of Information The Knowlege Management Connection, May 2004 new
A nice look at the basics of faceted classification, which results in metadata. This article contrasts traditional library systems based on Ranganathan with business and organizational knowledge classification, for an enterprise intranet or KM system.
 
Metadata: What is it and Why is it Important Biosafety Clearing-House, January 30 2002
A good general introduction, describing many values of metadata. Makes the points that it preserves the internally investment in the data, promotes data sharing and improves interoperability. Provides an example of use in biology for Dublin Core and RDF.

Why metadata is important
New Thinking Newsletter, October 1, 2001 by Gerry McGovern
A plea for implementing metadata, mainly to help people find content in the "torturous mazes" of Intranets.
 
Metadata in a nutshell Information Europe, Summer 2001 by Michael Day
Helpful short introduction.
 
Using Metadata to Improve Local Searching Exploit Interactive; April 2000 by Brian Kelly
Using Dublin Core metadata to search by issue number, type of article and funding agencies.
 
Metadata: Cataloging by Any Other Name Online Magazine; January 1999 by Jessica Milstead and Susan Feldman
A librarian-oriented article on the value of metadata as consistent descriptions of information objects. Discusses text and multimedia objects, formats of metadata, standards and software. Describes how search engines can use appropriate metadata to improve relevance results.
 
A new dawn New Scientist, 30 May 1998 by Glyn Moody (subscription required)
Describes the potential for XML, searching, XSL style sheets, XLink link extensions, RDF metadata, Dublin Core, and more.
 
Metadata: An Overview Paper given by Dr. Warwick Cathro, Assistant Director-General, Services to Libraries Division at the Standards Australia Seminar, "Matching Discovery and Recovery" August 1997.
Describes the concepts and background of meta data as a tool to improve the retrieval of information, especially for web wide search tools. Covers the Warwick Framework and the Dublin Core efforts to provide standard tags and other meta data elements in some detail.
 
Who Will Master Metadata? Network Computing Online, by Christine Hudgins-Bonafield.
Business-oriented analysis of how software developers such as Infoseek, Bunyip, IBM and Microsoft are working with librarians to set metadata standards.
 
Web architecture: Metadata Tim Berners-Lee, 1997
The inventor of the World Wide Web was thinking about the value of metadata in finding resources, providing self-describing information.

And, for another point of view:

Metacrap: Putting the torch to seven straw-men of the meta-utopia May - August 2001, by Cory Doctorow
Considers the problems of creating consistent quality metadata to be insurmountable. These are defined as: lies, laziness, stupidity, lack of self-knowledge, schema bias, metrics bias, and language issue

Topic Maps and Semantic Webs

Rather than viewing information as a chaotic mass, researchers are looking for systems to provide meaningful organization. Topic maps started as a way to provide back-of-the-book style indexing for complex documentation. They allow people to analyze the data in new ways, and discovery of previously hidden relationships: and even "wander at leisure through a multidimensional data space".

TopicMaps
Topic Maps are collections of topics, associations and optional scope defining concept relationships in hierarchical or non-hierarchical layouts. They describe knowledge structures and associate them with information resources such as documents or database records. This site aims to provide a starting point for topic map information.
 
The TAO of Topic Maps XML Europe 2000 by Steve Pepper
Explains the value of topic maps compared to indexes and glossaries, as providing codification tools for knowledge. Describes topic subjects, types, and names, occurrences (information resources such as documents or records), and associations, relationships between topics.


Dublin Core and other Web Metadata Standards

A set of 15 generic XML tags for tracking and cataloging web pages and creating metadata. These are: Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier (URL), Source, Language, Relation, Coverage and Rights (copyright information).

As of October, 2001, this has been approved by the National Information Standards Organization (NISO) and American National Standards Institute (ANSI) as standard Z39.85.

Dublin Core
basic information on the Dublin Core Metadata Initiative.
 
Dublin Core Recommended Qualifiers
Standard for additional attributes of the main tags provides consistency and refinement. For example, it standardizes date formats and document languages, allows alternate titles, standard subject headings such as the Library of Congress, and Dewey Decimal System, and more.
 
Using Dublin Core
A helpful guide to creating DC tags, with examples and advice.

General Adoption of Dublin Core Metadata


RDF (Resource Description Framework)

W3C standard to provide an infrastructure for metadata, including content rating, security, authentication, "push", EDI, Dublin Core and more.


Metadata Helper Applications: Tagging and Entity Extraction

ClearForest
Entity identification and extraction using statistical and structural analysis, and semantics, linguistic patterns based on domain-specific "rulebooks" such as health, law, or financial services.
Report: Unstructured Information Management Report Infosphere, March, 2003 by Magnus Stensmo and Mikael Thorson, $325/€295 for a single PDF license  
In the context of UIM, ClearForest has a significant value, providing excellent information extraction and visualization. However, it's limited to a language-depended set of pattern matches.
 
Report: Unstructured Data Management: the elephant in the corner (guest or customer access required) the451 Report, November 2002 by Nick Patience and Rachel Chalmers
Discusses the business aspects of the company, competition, markets (government and pharmaceuticals), and the technology. Finds the market solid but recommends improving integration with portals.
Report: Searching for Value in Search Technology (subscribers only) Gilbane Report Vol 10, Num 7, September 2002 by Sebastian Holt
Highlights innovative search engines companies including ClearForest entity extraction and interactive graphical interface.
 
Entrieva (formerly Semio)
Data mining software uses linguistic analysis and rules to extract concepts from textual information and displays the concepts and relationships in a 3d map. Creates taxonomies based on fitting content into existing categories in the fields of defense, drugs, health care and technology.
Report: Unstructured Data Management: the elephant in the corner (guest or customer access required) the451 Report, November 2002 by Nick Patience and Rachel Chalmers
Discusses the business aspects of the company, pricing, markets, competition and the technology. The recent acquisition is a cause for concern, but also an opportunity to create more standard APIs and support Web Services.

Interwoven MetaTagger
Works with Interwoven content management system or independently to add structure to free-text content, according to categories set up within corporate taxonomies.
 
Inxight Categorizer (link to SearchTools Listing Report)
Using linguistic and statistical technology from parent company Xerox PARC, Inxight's categorizer automatically classifies content and organizes by subject. Identifies entities such as people, places, companies and products. Can integrate with personalization tools to build individual categories. Can display results in a Star Tree visualization or more traditional text list. Taxonomy manager application provides interactive control for editors, training sets and specific rules also apply.
Report: Unstructured Data Management: the elephant in the corner (guest or customer access required) the451 Report, November 2002 by Nick Patience and Rachel Chalmers

Discusses the business aspects of the product, sales channels, markets (mainly enterprise, publishers, government, OEMs), competition and the technology. Considers this to be the best of the categorization and visualization tools.

Metabot from Tetranet
Allows editors to view, generate and control metadata for hundreds of HTML documents in a site.
 
MetaEdit from DSTC
Graphic user interface for adding and editing metadata in a HotMeta repository.
 
Metadata Creator from Applied Systems
Extracts key words and phrases according to the enterprise controlled vocabulary.
 
Mcatis
Meta tag generator that includes Dublin Core tags, works on Windows 9x, NT, and so on.
 
TagGen from HiSoftware
Supports Dublin Core extensively, with a graphic user interface. Spellchecks and provides search engine optimization advice.
 
Voquette Content Enhancement Engine
Automatically extracts descriptive text and tags it as metadata, also assigns metadata based on a knowledge base. Standardizes content from disparate sources, structured and unstructured text. They have a patent for a semantic web system of tagging and searching.
 
Watchfire Metadata Management System for Enterprise
Gathers documents using disk scanning and spidering, then analyzes them and generates keyword tags. Can insert these metatags into large collections of files automatically, using standard formats and controlled vocabulary.
Page Updated 2003-05-29