As of January, 2012, this site is no longer being updated, due to work and health issues

Guide to Search Tools

File Format Parsing:

Accessing, Extracting and Converting Binary Content for Indexing

What is File Format Parsing?

Search engines need text to index: this may seem obvious, but the devil is in the details. Extracting text is easy when working with txt, html or xml files, but much more difficult for binary files, including MS Office and archive formats. So search indexers need to use file format parsers, also called "filters". These can access the binary file formats, extracting the text and keeping track of whatever structure is there. Some file parsers are better than others, and all of them may need updating: as Microsoft switched from the proprietary format to their xml (doc -> docx), the search indexers need updated filters to read the new formats.

A good article with historical notes, technical details, and some interesting context, Where Have All the Filters Gone? by Mark Bennett (from June 2007)

Commercial File Parsers

On the commercial side of things, there are two main packages which are included with almost every enterprise search system. These are and Microsoft also has one they use on Windows. I believe that these packages have active development and support. These packages have both input and output APIs, so customers can create additional custom file format parsers. I don't expect much change in these packages.

Tika: Open Source File Parser Framework

As of March, 2009, there's a new open source filter framework: Tika, the Lucene open source project for calling format parsers and returning the result as XHTML. It's a well-designed standardized interface, making use of existing open source file parsers, including Apache POI for Microsoft Office documents (old and new), and PDFBox for Adobe Acrobat-type files. There are a dozen other formats already supported, and the API makes it easy to add a custom file parser without having to write any special code in the indexer.

But Tika is not limited to Lucene and related projects. Because it's open source, any search engine indexer can use it to access file parsers (within the limits of the Apache license). This simplifies everyone's life considerably, and creates a framework for open-source file parsers that is stable and documented. People can work on improving the code of the file parsers, or write their own and know that it will be compatible. There are other open-source file parsers, but the Tika framework and toolkit are likely to be dominant as long as they keep working.

Other Open Source File Parsers

Charlie Hull directed me to a file converters listed in the Omega Search Engine overview, part of the Xapian Search project. Presumably, this lists packages that they've tested for quality. I'm copying the list and adding links as of March 2009.

  • pdf (acrobat) - pdftotext using xpdf
  • PostScript - ps2pdf using xpdf
  • OpenOffice/StarOffice documents - compressed XML - (.sxc, .stc, .sxd, .std, .sxi, .sti, .sxm, .sxw, .sxg, .stw) using unzip
  • OpenDocument format documents - compressed XML - (.odt, .ods, .odp, .odg, .odc, .odf, .odb, .odi, .odm, .ott, .ots, .otp, .otg, .otc, .otf, .oti, .oth) using unzip
  • Microsoft Word documents (.doc, .dot) using antiword or catdoc
  • MS Excel documents (.xls, .xlb, .xlt) using xls2csv (comes with catdoc)
  • MS Powerpoint documents (.ppt, .pps) using catppt (comes with catdoc)
  • Wordperfect documents (.wpd) using wpd2text (comes with libwpd)
  • MS Works documents (.wps, .wpt) using wps2text (comes with libwps)
  • AbiWord documents (.abw) - xml reader?
  • Compressed AbiWord documents (.zabw) using gzip
  • Rich Text Format documents (.rtf) using unrtf
  • Perl POD documentation (.pl, .pm, .pod) using pod2text
  • TeX DVI files (.dvi) using catdvi
  • DjVu files (.djv, .djvu) using djvutxt

Are there any others? Any of these obsolete? Please comment on my blog entry

Date Created: 2009-03-06