|
Current Search Tools News
Wikipedia, and particularly the related sites running the software released as MediaWiki, have some of the worst site search I have ever seen. The default installation's query processing is absurdly limited, the retrieval is crippled by bad settings, the relevance is unclear, and the results page is not just ugly but contradictory and confusing. I will be posting more detailed analyses supporting each of these statements. Read more about this analysis or leave a comment - October 2008
Even worse than the intrusive markup symbols, the MediaWiki Search highlights substrings when the search only retrieves on whole words. It also highlights words that are not indexed and are never part of the search query: those of less than four characters, and words on the very large stopword listing. This is just wrong, and the main Wikipedia has fixed it (recently). However, this wrong default behavior can show up on sites running even the most recent MediaWiki version (1.14a). Read more or leave a comment - Oct. 24, 2008.
MediaWiki's site search does display the search terms matched in the article, with some extracted text from the area around the term, so searchers can understand the context of the match. (For more information, see Matching Search Terms In Context.)
But most search engines remove the page markup (HTML or other) before saving the page, or at least before displaying the results with the match terms in context. MediaWiki Search does not do this, so results will show not just the text from the page, but also hidden text (such as in graphic file names) and markup symbols. Read more or leave a comment - Oct. 21, 2008.
The MediaWiki search results page does not actually say how many articles match the search terms. It has confusing and unlabeled number links for results page navigation, and results-per-page setting, right next to each other, but no total for results. Read more or leave a comment - Oct. 17, 2008.
The default behavior of the MediaWiki search engine is to find only pages which match every word in the search query (find all). When there are quotes around terms in the query, it will only find pages with those terms as a phrase, which is nice. Unlike other search engines, however, there's no way to override this behavior. No options are available to exclude terms from search results, search for several synonyms, automatically use plural word matching, or find substrings using truncation or wildcards. More... or leave a comment - Oct. 16, 2008.
Stopwords are supposedly words that don't make sense in searches, words that are on many pages and therefore are "noise" in search results. They are generally short, such as a, an, the. However, the MediaWiki search has excludes 547 words as stopwords. But they're perfectly good words, and
by ignoring them, the search engine fails in many cases where it should find results. see more details or leave a comment
- Oct. 14. 2008.
The MediaWiki search engine defaults to a four-letter minimum word length. Seriously. Not only will it not search for one, two, or three letter words, they're not even in the index: they are completely unfindable. There is no way to search for perfectly reasonable words like fan, lab, qi, or pH. The bartendersdatabase.com wiki can't find rum while Google finds 731 on that same site (though some are duplicates). see more details or leave a comment. - Oct. 13. 2008
Overview Article on Inverted Indexing for Search
This is an expansive and detailed overview, including both practical and theoretical information. The authors report on their own and others tests results finding that inverted indexes are both significantly faster to search and easier to maintain than relational database management systems, signature files and suffix arrays. The article also has a thorough annotated bibliography.
Inverted files for text search engines by Justin Zobel and Alistair Moffat
in ACM Computing Surveys. 2006;38(2) (56 pages).
Available in PDF format: http://doi.acm.org/10.1145/1132956.1132959 It's $10 if you don't have an ACM account and you have to register on the site. There also seems to be a copy of the PDF file on CiteSeer- Sept. 17. 2008.
Sphinx is an open source free search engine, written in C, using both SQL and custom index files to provide a very fast text search. The architecture scales to over a billion records by distributing the index and querying among multiple virtual and real processors. Read more and tell me about your Sphinx experience here. -- July 11, 2008
For more news about search engines, see the News page.
|
|
Good
question! If you have serious content, a site or intranet search
engine will allow your visitors to jump directly to the topic
they want. More...
|
|
The
guide will help you learn more
about site searching. Or try the
remote search services on this site.
|
|
Includes
articles and books providing general
site search information, and those with more specific product reviews.
|
|
Alphabetical
List or divided by platform: Java,
Mac, Perl,
Unix, and Windows;
also Remote Search Hosting Services,
Code Libraries, and Open
Source Search Engines |
|
Multimedia
Search, Faceted Metadata
Search, PDF and Web Site Search,
Intranets and EIPs, XML and Search, Information Architecture,
Web Indexing Robot Spiders, and
more. |
|
|
This
site provides information, news and advice about web site searching
technology. It is maintained by Avi Rappoport, associate AJ Summers
and various Search Tools Consulting
interns as a service to the Web community. We welcome your comments
and suggestions: just contact us.
We are also available for search tools needs analysis, competitive
analysis, search tools installation and more: for information see
the Consulting page , use our
contact form or send
.
Disclosure:
Search Tools Consulting presents the SearchTools.com site as a free
service to the web development community and is not sponsored by
any advertisers. Search Tools Consulting also provides analysis
and information to sites and institutions installing search engines,
and to some search engine developers. We do not give them site visitor
or survey personal information or allow our relationships with any
vendors to change any product review or analysis. |
|
The
SearchTools
blog on LiveJournal provides an opportunity for you to tell
me what you think about enterprise search tools for web sites and
intranets, and about the SearchTools.com web site. You do not have
to have an account to post, you can reply anonymously. All comments
are screened, so there will be no blog spam.
leave
a comment
|
For more news about search engines, see the
News page. Technorati
Profile
|
|