As of January, 2012, this site is no longer being updated, due to work and health issues

Search Tools News 2008

December 16, 2008

CiteSeerX and SeerSuite: Adding to the Semantic Web

CiteSeer is a nifty free service that indexes and searches academic papers (mostly in CS and various Information Sciences). It's been doing Automated Citation Indexing for a decade, linking cited and citing papers, which turn out to be extremely valuable for research and area studies.

Now, Professor Lee Giles and his students at Pennsylvania State University have rebuilt the system from scratch, and are sharing it, using an open source Apache license, SeerSuite (currently beta 0.1). Even a smallish digital library can take advantage of the automated metadata extraction and citation linking, with the reliable Lucene search engine underneath.&

Now, they're combining technologies (from OCR, to machine learning), and reverse-engineering data from PDF documents. This includes extracting captions and numbers out of tables, chemical formulae and molecular structures, mathematical equations, and 2D graphs, storing them in various standard markup formats. All this information is not metadata, it's source data, and incredibly valuable for avoiding duplication, allowing reproduction of experiments, and taking that data in directions the original researchers did not expect. It brings the research into the Semantic Web, where there are tools just waiting for data like this.

I wrote a bit more about CiteSeerX and SeerSuite in InfoToday, and there's more information at the CiteSeerX site. Please comment, if you care to, on the blog entry.

October 24, 2008

6: MediaWiki Highlights the Wrong Terms in Search Results

Even worse than the intrusive markup symbols, the MediaWiki Search highlights substrings when the search only retrieves on whole words. It also highlights words that are not indexed and are never part of the search query: those of less than four characters, and words on the very large stopword listing. This is just wrong, and the main Wikipedia has fixed it (recently). However, this wrong default behavior can show up on sites running even the most recent MediaWiki version (1.14a). Read more or leave a comment -

October 21, 2008

5: MediaWiki Shows Markup Symbols in Search Results

MediaWiki's site search does a good thing with search results: for each article, it shows not just the title, date and size, it also displays the search terms matched in the article, with some extracted text from the area around the term, so searchers can understand the context of the match. This is particularly useful for words with multiple meanings, such as rose, pound or bank. (For more information, see Matching Search Terms In Context.)

Most search engines remove the page markup (HTML or other) before saving the page, or at least before displaying the results with the match terms in context. MediaWiki Search does not do this, so results will show not just the text from the page, but also hidden text (such as in graphic file names) and markup symbols. Read more or leave a comment

October 17, 2008

4: MediaWiki Search Results Header is Misleading

The MediaWiki search results page does not actually say how many articles match the search terms. It has confusing and unlabeled number links for results page navigation, and results-per-page setting, right next to each other, but no total for results. Read more or leave a comment

October 16, 2008

3: MediaWiki Search has Limited Syntax and Functionality

The default behavior of the MediaWiki search engine is to find only pages which match every word in the search query (find all). When there are quotes around terms in the query, it will only find pages with those terms as a phrase, which is nice. Unlike other search engines, however, there's no way to override this behavior. No options are available to exclude terms from search results, search for several synonyms, automatically use plural word matching, or find substrings using truncation or wildcards. More... or leave a comment

October 14, 2008

2: MediaWiki Search, Stopword Hell

Stopwords are supposedly words that don't make sense in searches, words that are on many pages and therefore are "noise" in search results. They are generally short, such as a, an, the. However, the MediaWiki search has excludes 547 words as stopwords. But they're perfectly good words, and by ignoring them, the search engine fails in many cases where it should find results. see more details or leave a comment

October 13, 2008

Why the MediaWiki Site Search Stinks

Wikipedia, and particularly the related sites running the software released as MediaWiki, have some of the worst site search I have ever seen. The default installation's query processing is absurdly limited, the retrieval is crippled by bad settings, the relevance is unclear, and the results page is not just ugly but contradictory and confusing. I will be posting more detailed analyses supporting each of these statements.


Default versions of the wikimedia search engine are very nearly unusable. If you have a MediaWiki, check the page Special:Version. If there is no mention of a search plugin, then run, do not walk, to replace the site search module. At least use the MWSearch (Lucene) extension, a version of which is used on the main wikipedia, or, better, the Sphinx search extension (which powers the New World Encyclopedia search). Your wiki readers will thank you.

see more details or leave a comment

First Reason: Does Not Index Short Words

The MediaWiki search engine defaults to a four-letter minimum word length. Seriously. Not only will it not search for one, two, or three letter words, they're not even in the index: they are completely unfindable. There is no way to search for perfectly reasonable words like fan, lab, qi, or pH. The wiki can't find rum while Google finds 731 on that same site (though some are duplicates). see more details or leave a comment

September 17, 2008

Enterprise Search Summit, 22-24 September, 2008

Focused on real-world issues of implementing and enhancing search for intranets, portals and large web sites. It's been a wonderful conference every time, because it is just about search and related issues. It's a great mix of case studies, specialist presentations, and even the vendor talks are good

Avi Rappoport will be presenting a pre-conference workshop, Enterprise Search 101, and a talk, Inside the Black Box of the Search Index which will start with the basic inverted index, and some of the more interesting aspects, including tokenizing and document caching.

Online registration available. There's also a vendor exhibit hall, shared with the KMworld and Intranets conferences. To just see the exhibits, fill in the registration, and scroll down to the "Exhibits only" button, that's free. But you'll be missing a lot of fascinating talks.

Search Solutions in London, 23 September, 2008

Sponsored by the British Computer Society, Information Retrieval Specialist Group, this is an interactive and collegial meeting, focusing on innovations in information search and retrieval.

September 16, 2008

Overview Article on Inverted Indexing for Search

This is an expansive and detailed overview, including both practical and theoretical information. The authors report on their own and others tests results finding that inverted indexes are both significantly faster to search and easier to maintain than relational database management systems, signature files and suffix arrays. The article also has a thorough annotated bibliography.

Inverted files for text search engines by Justin Zobel and Alistair Moffat in ACM Computing Surveys. 2006;38(2) (56 pages). Available in PDF format: It's $10 if you don't have an ACM account and you have to register on the site. There also seems to be a copy of the PDF file on CiteSeer.

July 11, 2008

Sphinx search: New SearchTools Report

Sphinx is an open source free search engine, written in C, using both SQL and custom index files to provide a very fast text search. The architecture scales to over a billion records by distributing the index and querying among multiple virtual and real processors.

While it does a full text search, Sphinx is designed to work with structured content (music lyrics, products), and semi-structured content (RSS feeds, blog posts, magazine articles). Sphinx is much faster and more flexible than the internal SQL functions such as where, order by, and group by. This structure allows it to display results in a faceted metadata, for example in the results, showing graphical facets including country, source, theme and date.

Sphinx does not have a robot crawler, although it can accept input in XML which can be generated by a crawler. It connects directly to mySQL and PostgreSQL, and has web scripts for external sources. APIs are available in PHP, Python, Java, Perl and Ruby. Read more and/or tell me about your Sphinx experience here.

July 10, 2008


The final new element in the recent agreement on the Robots Exclusion Protocol is the "x-robots-tag". This is an addition to the HTTP header sent in response to a URL request. This header tag can enclose the same values as the Robots META tag: noindex, nofollow, noarchive, nosnippet and noodp. But unlike the meta tag, it's not limited to HTML: it can be applied to non-HTML items, such as PDF, text, office and CAD documents, which may not have useful Properties metadata. More...

July 2, 2008

Crawling and indexing Flash files and HTML forms

Adobe has a special Flash client for search engine indexing of SWF (Flash) file, beyond the static text. Google is implementing it now, Yahoo and MSN may follow. It's not clear what value the text is in Flash files, how the robots will extract it, what's going on with JavaScript and external XML files. Which reminded me that Google's been auto-filling forms for a while now. More in my new Indexing Interactive Content page or comment on the blog entry.

June 27, 2008

Search usability research

Whitney Quesenbery and her colleagues convey the findings of a long study about how search is used at the UK's Open University, Whitney gave a talk at the Enterprise Search Summit, and presented more formally at the Usability Professionalsí Association conference in June 2008.

The study included search log analysis, heuristic reviews, remote and local usability testing on the search user experience, over the course of several years, and they are linked from Whitney's valuable Search Usability page.

It's great to see more research done over time and with a large amount of data. I'm keeping a listing of what I've found at CiteULike with the tag search-interface, and planning to update my Search Usability page.

June 26, 2008

The Long Tail, The Short Head, and Search

I've just posted an article on the Long Tail, Short Head and Search. Every site, intranet and enterprise search log I've analyzed fits the model of the Long Tail, with a very few very popular search terms, then tailing off very quickly to unique queries (the Long Tail), creating a Zipf curve.

The Short Head -- the few most frequently used search terms -- is the best place to start in analyzing search engine usage. My article also gives some suggestions for taking the information and using it to improve a search engine.

June 23, 2008

HCIR 2008: Workshop on Human-Computer Interaction and Information Retrieval

Making the connection between interface and search, this workshop is focused this year on complex search tasks. The 2007 Workshop presentations ranged from visual text analysis to online consumer choice. This year's workshop will be 23 October, 2008, in Redmond, Washington, USA.

June 15, 2008

My article is up on InfoToday: New Robots Exclusion Protocol Agreement Among Yahoo!, Google, and Microsoft Live Search. Nothing earthshaking, just a summary from a library point of view, and a quote from Danny Sullivan saying that this is an important first step.

June 13, 2008

More Information on the new Robots Exclusion Protocol

Search indexing robot writers and web publishers should definitely look at the new extensions to the REP, as there are useful additions to both robots.txt directives and Robots META tags. Most of these features have been supported by the big three search engines (Google, Yahoo, MSN Live), but it's nice to have that formalized, and other search robots can take advantage of the new functionality.

The new X-Robots-Tag (added to the HTTP header for non-HTML files) is a good way to send the meta information, but requires automated extensions to the servers. For example, if content is available in both HTML and PDF formats, it's easy to send NOINDEX values for all PDF, directing search engines away from the printable format and towards the browser-readable format.

It turns out that NOODP comes in handy when a page is linked from the ODP (Open Directory Project), and the title or text in that entry is not accurate, which happens sometimes. Using the NOODP robots meta tag value tells the search engines not to use the ODP entry, but rather the title and text from the page. NOYDIR does the same for the Yahoo Directory, but is only officially supported by Yahoo and its Slurp robot.

For pages with frequent changes, NOARCHIVE makes some sense: the old content may be in the searchable index, but at least the search engines will not display the old version of the page itself.

However, I have yet to figure out when someone would use NOSNIPPET (which also disables archive display). Limiting a listing in the search results page to the title and URL seems like such a bad idea. Why would anyone do this? Please, explain it to me in a comment.

June 3, 2008

New Robot Exclusion Protocol (REP)

Supported by webwide search engines Yahoo, Google and Microsoft, this adds directives to robots.txt:

There are also HTML meta/ properties tag directives for:

Yahoo has a nice long blog entry on this, as does Google (and now: Microsoft Live Search has a blog entry too). Great news for web developers, who've been waiting for this for a very long time.

But there's nothing from the robots mailing list or the which is a shame.

This is also a test for all site and intranet search crawlers -- any abandoned software will not recognize these new directives.

I'll dig further into this in the next week and provide more analysis and details.


May 7, 2008

A First Taxonomy of Search Log Junk

Search logs contain a lot of weird things, and some of them can have a significant effect on search log analysis. Having looked at tens of thousand lines of search log entries, I offer this first attempt at defining some of the weirdest and least useful kinds of log entry, which I call "Search Log Junk". Here are the types of junk that I've seen most frequently: empty queries, repeat queries, robot crawlers, server hacks, search field/guestbook spam, and internal test queries. More...

January - April, 2008: updates to the site suspended due to injury


SearchTools News, like a Blog, since 1998. / RSS feed

Site change details

For earlier news, see the 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999 and 1998 news archive pages

Last Update: 2009-01-30

Home Guide Tools Listing News Background Search Contact

Search Tools Consulting's principal analyst, Avi Rappoport, may be available to help you with selection, analysis, user experience, and functional search engine work. Please contact us with your questions, comments, or possible consulting discussions.

Creative Commons - Copyright © 2008-2009 Search Tools Consulting.
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.