As of January, 2012, this site is no longer being updated, due to work and health issues
What is File Format Parsing?
Search engines need text to index: this may seem obvious, but the devil is in the details. Extracting text is easy when working with txt, html or xml files, but much more difficult for binary files, including MS Office and archive formats. So search indexers need to use file format parsers, also called "filters". These can access the binary file formats, extracting the text and keeping track of whatever structure is there. Some file parsers are better than others, and all of them may need updating: as Microsoft switched from the proprietary format to their xml (doc -> docx), the search indexers need updated filters to read the new formats.
A good article with historical notes, technical details, and some interesting context, Where Have All the Filters Gone? by Mark Bennett (from June 2007)
Commercial File Parsers
On the commercial side of things, there are two main packages which are included with almost every enterprise search system. These are and Microsoft also has one they use on Windows. I believe that these packages have active development and support. These packages have both input and output APIs, so customers can create additional custom file format parsers. I don't expect much change in these packages.
- Outside In (previously at INSO, now owned by Oracle) - parses over 400 file format to extract both text and metadata, and export into HTML, Image, PDF, XML, or one of four text, HTML or XML search-ready formats.
- Keyview (now owned by Autonomy) - has C and Java APIs,parses to parse to over 1,000 file formats (including multimedia), and 400 content repositories including RDBMS, CRM, ERP, ECM Portal, and Business Intelligence systems. It can extract text and metadata, convert to HTML while retaining the original layout, and be configured to recognize object rules and metadata relationships. (The Oracle quote on that page is from 2004.)
- Microsoft IFilter (instructions for MS Search Server 2008) - DLLs in Windows COM, and can work on .NET and #C with a little tweaking. They also used by Microsoft's desktop search and toolbar. They officially support about 250 file name extensions are officially supported, including image, sound and video files, and there's a standard API for adding more.
- Davisor Offisor, a Java library which converts both new and old Microsoft Office documents (doc, docx, rtf, ppt, pptx, xls, xlsx) to clean XML. It doesn't do just text, it includes style information, structure and metadata. I've heard good things about it -- worth looking at.
- Isys File Readers Filters Works with more than 200 formats, including ??? Windows only, but has API interfaces in C, C++, Java, and VB.NET
Tika: Open Source File Parser Framework
As of March, 2009, there's a new open source filter framework: Tika, the Lucene open source project for calling format parsers and returning the result as XHTML. It's a well-designed standardized interface, making use of existing open source file parsers, including Apache POI for Microsoft Office documents (old and new), and PDFBox for Adobe Acrobat-type files. There are a dozen other formats already supported, and the API makes it easy to add a custom file parser without having to write any special code in the indexer.
But Tika is not limited to Lucene and related projects. Because it's open source, any search engine indexer can use it to access file parsers (within the limits of the Apache license). This simplifies everyone's life considerably, and creates a framework for open-source file parsers that is stable and documented. People can work on improving the code of the file parsers, or write their own and know that it will be compatible. There are other open-source file parsers, but the Tika framework and toolkit are likely to be dominant as long as they keep working.
Other Open Source File Parsers
Charlie Hull directed me to a file converters listed in the Omega Search Engine overview, part of the Xapian Search project. Presumably, this lists packages that they've tested for quality. I'm copying the list and adding links as of March 2009.
- pdf (acrobat) - pdftotext using xpdf
- PostScript - ps2pdf using xpdf
- OpenOffice/StarOffice documents - compressed XML - (.sxc, .stc, .sxd, .std, .sxi, .sti, .sxm, .sxw, .sxg, .stw) using unzip
- OpenDocument format documents - compressed XML - (.odt, .ods, .odp, .odg, .odc, .odf, .odb, .odi, .odm, .ott, .ots, .otp, .otg, .otc, .otf, .oti, .oth) using unzip
- Microsoft Word documents (.doc, .dot) using antiword or catdoc
- MS Excel documents (.xls, .xlb, .xlt) using xls2csv (comes with catdoc)
- MS Powerpoint documents (.ppt, .pps) using catppt (comes with catdoc)
- Wordperfect documents (.wpd) using wpd2text (comes with libwpd)
- MS Works documents (.wps, .wpt) using wps2text (comes with libwps)
- AbiWord documents (.abw) - xml reader?
- Compressed AbiWord documents (.zabw) using gzip
- Rich Text Format documents (.rtf) using unrtf
- Perl POD documentation (.pl, .pm, .pod) using pod2text
- TeX DVI files (.dvi) using catdvi
- DjVu files (.djv, .djvu) using djvutxt
Are there any others? Any of these obsolete? Please comment on my blog entry