Enterprise and Site Search Notes
August 9, 2011
- Pingar (new semantic search engine)
- Semantic search running on top of Solr, offering an API including autocomplete, related terms, co-occurring locations and names, document summaries. Currently integrating with SharePoint, software as a service option. Has a Chinese version, more languages to come.
- Scholar Citations: Google Moves into the Domain of Web of Science and Scopus
- "Author profiling, rising from the need to better disambiguate researchers and to better find and connect relevant researchers, has become an increasingly hot product area in the past 2 years" -- the article describes the Google Scholar Citations system of automating profiles and allowing scholars to edit them. It also talks about Microsoft Academic Search, arXiv, ACM, RefWorks, DBLP, INSPIRE, OpenID.net, ORCID, Publish or Perish, RePEc, ReasearcherID, VIVO, and Scopus.
- Interview With Louis Rosenfeld, Author of Search Analytics
- Shari Thurow interviews Lou Rosenfeld about what his Site Search Analytics book can teach people concerned with SEO and search engine marketing, such as missing content, site navigation problems, keyword phrases in context.
- Diego Basch on The Need for Speed [Search Engine Usability]
- It never occurred to me to describe interaction design and search results usability as "speed and contrast them to traditional information retrieval relevance judgments, but, sure, why not.
- IndexFile - PHP Lucene indexing
- IFile is a PHP based framework for indexing documents(PDF, DOC, HTML, etc) using Lucene. Note that the documentation is in Italian.
- The CIO Vs. The Information Access Mafia -- InformationWeek
- Management's goal should be not to limit access to data, but to figure out a way to facilitate access. They haven't and they won't. They want to brag about the benefits when the company comes up with something great, something really innovative, but they want deep cover to protect their rears when and if we have given the wrong level of access to the wrong people.
- Don't Use Java 7, For Anything >> Lucid Imagination
- July 28, 2011 -- Posted by hossman -- Java 7 GA was released today, but as noted by Uwe Schindler, there are some very frightening bugs in HotSpot Loop optimizations that are enabled by default. In the best case scenario, these bugs cause the JVM to crash. In the worst case scenario, they cause incorrect execution of loops. Bottom Line: Don't use Java 7 for anything (unless maybe you know you don't have any loops in your java code)
- Giving search users what they want, not what they're asking for - FierceContentManagement
- Nice overview of modern site search features such as facets, best bets, taxonomy, alternate vocabulary, and related topics
- Make Your Business 10X Faster with Box's New Search (company switched to Solr)
- The cloud storage service Box.net switched to Solr/Lucene and has seen significantly fast indexing and searching. They are also using Apache Tika to read a wide variety of file formats.
- Locayta ESP - Website Search Engine Solutions
- A new ecommerce site search engine
- Lucene Faceting module added to the trunk
- While Solr had faceting, the core Lucene library did not have any formal support for the powerful faceted metadata approach to search results presentation. Lucene Jira ticket 3079 contains new code to do this, at very large scale (millions).
- Collaborative relevance in Enterprise Search (using search analytics)
- A fresh approach to the discussion of long-term search relevance, setting up a process for responding to user behavior and using it to improve results. This article is a little case study of a Findwise installation in a Swedish power company, apparently using clickthrough tracking and log analytics.
- Google+ Social = World Domination?
- My article for InfoToday's NewsBreaks, with a bit about how hard it is to do a near-real-time relevant searches with privacy protection.
- Sorting Out Microsoft's Mixed-Up Enterprise Search Strategy -- Redmondmag.com
- There are many flavors of Search Server and FAST Search, and it's still not at all clear where Microsoft will go with them.
- A digital real estate firm digs into site search data to improve navigation - Internet Retailer
- Tangible example of using site search analytics to make changes in content. If we see that people are looking on the site for something like 'domain authority' we can address that. It gives us a reason to develop that information on the site," says [Mary Smucker-Priest] Site search data is unique because its shows a retailer how customers are using a site, she says. It's reliable, informative data."
- Don't Build Data Castles Made of Sand (ReadWriteWeb)
- Strongly for enterprise search and unified information access "Leaving the data where it resides and providing unified access to it not only helps make the most of existing IT investments, it provides a full view of the information that matters most to you, which helps you make better decisions."
- Designing Faceted Search: Getting the basics right (part 2) Information Interaction
- Another great explanation of facet issues, the best so far. Detangling the issues of hyperlink vs. checkbox from those of single-select vs multi-select is particularly important.
- Google Custom Search: More Layout Options (and no more iframe)
- Introduction to Search with Sphinx - O'Reilly Media
- Definitive Sphinx book by the author of the program, with the benefit of O'Reilly editing.
- Coreseek - Chinese version of Sphinx search engine
- The page is mostly in Chinese so I can't really tell what's up, but it's clearly a version of Sphinx, and it says it's made by Beijing Choice Software Technology Inc. and has been around since 2006.
- Flying Sphinx - full text search engine as a Heroku addon
- Flying Sphinx is a service providing a Thinking Sphinx search engine deployed on Heroku (Ruby on Rails) cloud services. Rates range from $12 per month to $300 which includes 15GB of indexed data and near-real-time index updates. I think it's run by a very active Sphinx/Thinking Sphinx contributor. This is a link to the Heroku page, there's also a site: www.flying-sphinx.com
- smartOCI cross catalog search engine with OCI 4.0 for SAP SRM application | NetSol Technologies
- This seems to be a federated search on web-based supplier catalogs (b2b) using something called the Open Catalog Interface.
- Search Analytics for Your Site: Conversations with Your Customers :: UXmatters
- Chapter 8 of Lou Rosenfeld's book on site search analytics, full of practical advice. It contains lots that's useful for intranet and enterprise search as well.
- Yahoo! Search BOSS group
- [Yahoo Build your Own Search Service] BOSS V2 Updates
- New features include daily limit controls, HTTPS, SQL and YQL support, more news content, and better documentation. Note that BOSS v.1 will go away on July 20, 2011.
- Sometimes I Wish NLP Systems Occasionally Blew Up - LingPipe Blog
- Like search, badly performing Natural Language Processing systems tend to be quietly bad, rather than flaming out. Great quotes: "NLP Systems are Easier to Sell than Build", "Customers get the potential value of advanced NLP/Text Analytics/etc in the same way that people get the potential value of space flight." and "Be aware that you are selling to the best NLP systems out there: Humans", "It is unrealistic to expect a named entity model to perform well on Twitter if its training data is the 1994 Wall Street Journal."
- How to Make a Great Search Results Page (short ecommerce overview)
- Intro to search on online stores, this says that analytics should show at least 55% of visitors using site search (that seems a little high to me). There's a little bit about size and placement of search fields and general suggestions on relevance and elements of search results.
- Open Source Web Mining Toolkit | Bixo
- Includes solid crawler code, Hadoop to store the URL database, Cascading pipeline and scheduler, and Tika to read the pages.
- A Workshop on Restructuring Adjectives in WordNet
- WordNet is the most widely-used English-language lexical database. This US government grant is to add coding values for dimensional (amount) adjectives. Apparently there's a recently-developed AdjScales system? Anyway, anything that adds to WordNet is a good thing by me. (via Ross Stapleton-Gray)
- crawler-commons - Shared Java components for web crawlers
- Open source robot spider crawling modules, now includes code to handle robots.txt properly
- Read documentation
- Drupal CMS module to integrate Solr search with faceted metadata
- Comparing User Research Methods for Information Architecture :: UXmatters
- General evaluation of many IA tools including card sorting, tree testing, and prototyping. Listing the benefits and limitations of each is really helpful.
- Looks like a good way to prioritize all kinds of projects
- Yahoo BOSS OAuth Coding Examples - YDN
- Yahoo's web search API, BOSS v.2, now requires OAuth to avoid misuse. This page example code in several languages.
- Result Diversification in Search - Talk at LinkedIn, June 14 2011
- Looks fascinating: A Framework for Result Diversification in Search: a LinkedIn Tech Talk by Sreenivas Gollapudi (Microsoft Research)
- "401 unauthorized access" when server-side code or browser tries to call itself with URL - imason blog
- Using a load balancer with two SharePoint servers sometimes returned a 401 error because the real hosts were authorized but the load host was not. I always wonder how these things show up, here's one cause.
- Search Is Not Just an IT Problem
- Nice overview of the many issues of search strategy
- [WordPress] Comment-Page-1 Pagination Creates 23 Million Duplicate URLs
- The default setting of WordPress has two URLs for each post: the name and the name followed by /comment-page-1/. This article talks about how to tell search engines not to index this on the WordPress side. Search engines should just exclude URLs with that string.
- Search Solutions is a special one-day event dedicated to the latest innovations in web & enterprise search. In contrast to other major industry events, Search Solutions aims to be highly interactive and collegial, with attendance limited to 60 delegates.
- The Changing Face of SERPs: Organic Click Through Rate Curve
- Search engine marketing-oriented study by Optify of clickthrough rates depending on position of the item in search results. Fully 37% of all clicks are on the first result, 60% percent of clicks are on the first three or so, and very few on second and subsequent pages. They contrast this with sponsored results Cost Per Click and the ultimate business results.
- Q-Senesi Enterprise Search Engine
- A new search engine for intranets, web sites and analytics. It has been running for a while as a bookfinder and web data search engine and now is being developed as a standalone product. However, there seems to be no robot crawler or spider and the structured data configuration is quite tedious. I also couldn't find a price anywhere.
- Search as middle-ware at att.com, with Shantanu Deo - Coté's People Over Process
- Informal conversation starts with AT&T's CMS and goes on to talk about using Solr for catalog search on the site. It started as a bit of a skunkworks, and takes very little maintenance and few resources.
- Solr Spellchecker internals (now with tests!) AB emmaespina
- Excellent introduction to the default Solr spelling checker processes. The test information is particularly useful, including data sets with natural human-made spelling errors. There's a follow-up that's also very useful.
- Using Google [Connectors] for Lucene < Real Story Group Blog
- Lucid Imagination is using the open source enterprise connector famework with their LucidWorks Enterprise distribution of Solr/Lucene.
- Classifying Searchers - What Really Counts? - Enterprise Search Blog
- Thinking about relationships with search engine vendors, market niches, diverse information requirements, and many more issues in managing search.
- Novices Orienteer, Experts Teleport - Boxes and Arrows: The design behind the design
- Nice overview by Tyler Tate of different search strategies based on domain and technical expertise. Concrete UI and UX recommendations with screenshots.
- Diversity in Document Retrieval (International workshop DDR 2011)
- Watson and healthcare (using Lucene)
- IBM's Watson, winner at Jeopardy, uses Lucene open-source search engine for indexing and searching large document databases (Also Apache UIMA). This particular article is about implications for health-care including mining medical records and patient reported data.
- Enterprise Super Search - Enterprise Super Search - Agent Support Portal
- Exploring synonyms within large commercial site search engine queries [HP Research Report]
- Straightforward study of hp.com's queries finds that reformulations and clickthrough data are not good ways to find synonyms, but that starting from lists of product names is promising. Spellchecking for typos and spacing, and pluralization reduce the need for synonym listings. It also recommends a larger data set than their 190,000 query log. (via Martin White, @IntranetFocus)
- The REAL cost of SharePoint Search - SharePointEduTech
- SurfRay (vendor) overview of common problems with searching, questions to consider, and new technologies to evaluate.
- ZettaSearch - combined search and analytics (BI)
- Starts with Solr/Lucene and adds autosuggest for faceted metadata, geospatial filtersing, spellchecking, search sessions, integration with external resources and databases, related documents, simple bar, pie and column charts, integration with Quantum4D for visualization.
April 4, 2011
AEW 2011 - ASIST European Workshop, 1-2 June 2011
"Organized by the European Chapter of ASIS&T, it is an ideal cross-disciplinary forum to present and encounter work by research students in the fields of information, library and computer science. The workshop will offer a unique networking opportunity where ideas and research can be discussed with fellow future luminaries of the field. "
Webinar: ‘Mind the Enterprise Search Gap’ (NASA case study)
Online presentation about search in the Johnson Space Center (JSC) using taxonomies, ontologies, classification, semantics.
10 Reasons To Resolve To Create A Taxonomy For Your Business In 2011
Cheerful introduction to the concepts of enterprise taxonomies.
searchtechnoloigies post: Search for Government Documents
Why the company loves government documents, including the content, but finds challenges for search: continuity, semi-structured data and organization, advanced search needs.
CIKM 2011 (Conference on Information and Knowledge Management) - October 24-28, in Glasgow
CKIM always has interesting topics and workshops.
Exclusive Interview: Brian Pinkerton (of LucidImagination)
Steve Arnold interviews Brian Pinkerton, one of the founders of LucidImagination. It's mainly about open-source search in the enterprise marketplace.
Selecting a Content Management System: To Score or Not to Score?
This process applies to any enterprise software selection: past the basic functionality, each member of the selection committee should score the most important features for each vendor. Then compare scores and ask why.
The Secrets Behind Blekko's Search Technology
The Blekko web search engine is running on distributed peer-to-peer nodes, there's no single master list of everything. They use MapReduce but adjust the Reduce step to be less demanding. And it's all in perl. Interesting.
Movement and Change in User Interfaces - Webtorque Blog
Results of a design change which put a "pinned" banner on top of search results. This included a special offer and a sort menu: in the test, no one noticed them.
TemaTres Vocabulary Server | The way to manage formal representations of knowledge
"TemaTres is an open source vocabularyserver, web application to manage and exploit vocabularies, thesauri, taxonomies and formal representations of knowledge."
Web Search: What do we know from a single search query?
Prof. Jim Jansen summarizes how much web users reveal about ourselves with each query, with a link to a conference paper on the topic. Spoiler: the example query is "ASIS&T annual meeting 2009" -- more specific than "world cup" or "twitter".
Real-world intranets in 2010: SWOT analysis — Business Information Review
Nice overview of practical issues in enterprise / intranet information systems. It includes a useful section on Search. (SWOT is Strengths, Weaknesses, Opportunities, Threats).
Lucene and Solr: 2010 in Review « Sematext Blog
A nice summary of the Lucene/Solr merge in 2010 including the actual codebases, developer mailing lists, and coordinated release versions. There's a new sub-project, ManifoldCF, that manages connectors for datasources and has access control support. Mahout, Nutch, and Tika are now top-level Apache projects. Etc.
Transaction-like Document Processing in AIE
Technical discussion of how to update index items in groups and thus with limited blocking of other processing. This makes access control and content changes much faster and more efficient.
Internet Archive content, VUFind (Solr), and text mining « CRRA Blog
notes on creating metadata-rich searchable portals
Google Research Director Peter Norvig on Being Wrong
[question about pagerank as a stand-in for credibility] Yeah, that's always a problem. One way we try to counter that is diversity. We haven't figured out any way to get around majority rules, so we want to show the most popular result first, but then after that, for the second one, you don't want something that's almost the same as the first. You prefer some diversity, so there's where minority views start coming in
A New Kind of Search Experience (enterprisesearchblog)
Thinking about the Qwiki information user experience.
Solr and LucidWorks Enterprise: When to use each | Lucid Imagination
Lucid Imagination is a vendor of service interfaces for the open-source Lucene/Solr packages, as well as spelling, faceting, file format translation, and so on. This is a fairly clear description of their offerings, including an interactive browser admin interface and a REST API.
Sharepoint Field Notes: FAST & SharePoint search
Describes the FullTextSQL of SharePoint Search vs. the FQL of FAST search, and other practical issues.
Processing Tweets with LingPipe #3: Near duplicate detection and evaluation « LingPipe Blog
Excellent description of the tricky issues with near-duplicate detection, in some ways harder because of the 140 character limit on Twitter. Describes tokenization approaches and the Jaccard Distance algorithm for similarity, and what they call "entropy": how some collections are more uniform than others.
New BBC Site Search (BBC blog)
Summary of changes made in scope searching, driven by analytics: "I remember analysing the logs for the Gardening website search and discovering that the top result was for [eastenders] and returned no results."
Meta Keywords Tag: Internet Versus Intranet | ZDNet
Short article explains how meta keywords tags are mostly ignored by search engines such as google, but can be very useful for intranet search.
- How Google Instant’s Autocomplete Suggestions Work - SearchEngineLand
- Danny Sullivan's detailed look at how Google's auto-complete and instant search, which seems to be populated by a combination of many recent searches and long-term frequent searches. They have an inconsistent policy on removing negative terms from the suggested list. He thinks they should just remove them without fuss.
These examples emphasize how enterprises and web sites should keep very strict control over their autocomplete (and spelling) suggestions, rather than rely on technology.
- Error-Tolerance and The Long Tail of Search in Ecommerce « Exorbyte Blog
- In e-commerce, it's much better to find something than nothing. Spellchecking and other tools for inexact matching can improve results for the long tail of unique and near-unique queries. Exorbyte reports increases of conversion from 5% to 20% - at least somewhat related to their error tolerance.
- A Vision for Unifying Access to Data and Documents - Forbes [Attivio]
- Describes Attivio's combination of unstructured and structured data for a new kind of BI. ... entity data and the sentiment analysis and other analysis techniques become part of the meta data for the unstructured data. This allows the unstructured data to be presented as structured data. You can do relational joins between the structured and unstructured data to answer questions like: “Find all negative product reviews that mention both the iPad and our top-20 best sellers.”
- MagnetStreet attracts sales with improved site search - Internet Retailer (SLI systems)
- Consumers using site search browse about 11 more pages per visit than shoppers who don’t, and remain on the e-commerce site nearly 13 minutes longer.
- Autonomy IDOL Universal Search product page
- "IDOL Universal Search provides users with a simple, personalized search experience tailored specifically to their unique requirements. In many typical environments, content may be spread across disparate internal data repositories and systems as well as external content sources and engines. Users are forced to conduct multiple, repetitive searches or risk missing crucial information. With Universal Search, a single query quickly federates unified results across systems in an intuitive, easy-to-navigate Web interface. Users get the big picture quickly through advanced visualization and dynamic categorization for a faster, more thorough exploration of results."
- Incompetent Research Skills Curb Users' Problem Solving (Jakob Nielsen's Alertbox)
- Results from a usability test - only 1% of the time did users change search strategy. So search engines need to concentrate on improving reliability of simple search results.
- Interaction Models for Faceted Search « Information Interaction
- Tony Russell-Rose does a great job identifying various models of faceted search interactions, including the relations between the facets, Smart Dead Ends, the relative values of "instant update" and "two-stage" matching and user choice, multi-select AND vs. OR, and interstitial pages. Very valuable analysis and naming.
- video - Compound Term Processing - Concept Searching's Ground Breaking Technology
- Youtube Video: Concept searching uses lemmatized views of words to offer valuable autocomplete suggestions.
- Google's autocomplete function libels man by linking his name to 'fraud' | Metro.co.uk
- A Milan court has found Google guilty of libel after its autocomplete search engine function linked a man's name to terms including 'conman'
- How to find intranet screenshots » Step Two Designs, James Robertson
- Not many sources for intranet design examples: research reports, some groups and conferences
- ElasticSearch + Cassandra / Lucandra, Solandra
- Interesting discussion with a tilt towards ElasticSearch
- Google Rich Snipping of Microdata for SEO · NavigationArts
- Nice introduction to semantic markup, labels, and Google's Rich Snippet microdata formats.
- Library "Discovery" systems (unified search)
- Bibliography and link list to library-oriented wide-ranging document search services, especially WorldCat Local, Summon, Ebsco Discovery Service and Primo Centra
- Delivering information where it’s needed « The Findability blog
- David Ronnqvist describes his thesis project on making searches location-aware. This did add some relevancy, and more important, removed irrelevant results.
- Applied Relevance - Enterprise Taxonomy Experts
- Tools for Classification, Taxonomy management and ontologies, Tagging, and Faceted Search. Works with DataFacet taxonomies from Wand. Somewhat SharePoint-oriented, but with Java versions as well.
- LukeW | More on Designing in Keynote
- YUI 3: AutoComplete [beta]
- [Commerce] Site Search - Google updates its site search tool for e-retailers - Internet Retailer
- Describes the new features for Google Commerce Search, including Instant Search, merchandising tools, and inventory. The price is high, starting at $25,000 per year, and a services provider says that integration takes at least a few weeks.
- Essential SharePoint: Metadata that checks in but won’t check out
- 9 site columns that you should not add to your SharePoint document libraries – unless you want to keep them forever
- Autocompleter - Mountain View - US jobs - Google
- 34,000 wpm required... (April Fool's joke)
- reallysimplehistory - Really Simple History (RSH): Ajax history and bookmarking library - Google Project Hosting
- Frequently Asked Questions - Google APIs Console - Google Code
- Google Custom Search API - pricing for higher traffic
- Any usage beyond the free usage quota [100 queries per day] will fail if you are not signed up for billing. Once you have enabled billing, you will be billed for all requests at the rate of $5 per 1000 queries, for up to 10,000 queries per day. If you need additional quota, please request additional quota from the console.
- How do I create a CSE that searches the entire web? - Custom Search Help
- Google will let you search the whole web, though the results will be different from those of google.com, and you can only get the first 100 results. Usage is free up to 100 queries per day, with Google-supplied advertising, after that, it's available at approximately $5 per 1,000.
- Who judges what "good" search results are?
- Mary Ellen Bates and Chris Sherman show how info pros and researchers differ from search engine optimizers in evaluating search results. For research, recall is better and two good matches on the first page is a good result. For SEOs, 80% of the first page should be useful: more emphasis on precision. Good reminder about variations in search requirements.
- PHP/ir (information retrieval for PHP programmers, Ian Barber)
- Generally interesting IR site, includes links to recent presentations.
- DN.no migrating to Solr (slides)
- NHST, a Norwegian publisher, shows how they migrated their search from FAST to Solr. FAST had some lemmatization and entity extraction features that Solr doesn't have, but the Solr multi-core architecture is more flexible, it's easy to tune, easy to run development versions, they're happy with Solr. The presentation is in Norwegian but there's an English translation below. [presented by Cominvent integration company]
- New Restricted-Use Secure Enterprise Search Licensing for Oracle Content Management Products (Oracle ECM Alerts - Product News & Information)
- Looks like oracle is bundling text search with it's content management systems, but only for the content in that repository. Customers have to buy a full license to index other sources. This also applies to Stellent CMS customers, or they can upgrade 11g.
- [vendor-created] Best Full Text Search Software. Compare, reviews & ratings.
- Tables comparing language, OS, some features of search engines. Created by SearchBlox.
- List of resources: Article text extraction from HTML documents (via BeyondSearch)
- Getting meaningful text (rather than navigation and ads) is a challenge. Tomaz Kovacic rounds up information and tools on the topic, including journal articles, software, online services and demos
- RAMP Search Service
- RAMP site search indexes text, images, audio, and video with its own transcript-creator and MetaPlayer interface. It provides faceted search results, as well as federated and "blended" search, and has contextual content recommendations and interface widgets, search suggestions, spell correction, and keyword merchandising The search is designed to work with the company's Publishing, Workflow, Video and Advertising modules, but may not depend on them. It's SaaS and the pricing model is mysterious but probably fairly high.
- The Open Vocabularies Service - SKOS Editor, repository for Controlled Vocabularies
- A collection of controlled vocabularies, a Visual Vocabulary Editor, to visualize relationships between concepts in complex classification systems. Import/export formats include SKOS, HTML, Excel
- Solr Powered ISFDB – Part #7: Simple UI
- Low-key post blogging changes to the default Solr user interface with.
- Lucene Revolution conference sponsored by Lucid Imagination
- The talks look really interesting
- How to Make Robots Cry With Faceted Navigation | ClickZ
- add to test
- Secure Search in Enterprise Webs: Tradeoffs in Efficient Implementation for Document Level Security
- Useful research - dependence of query processing time on result set size and visibility density for different classes of user. Scaled up to collections of tens of thousands of documents, our results suggest that query times will be unacceptable if exact counts of matching documents are required and also for users who can view only a small proportion of documents. We show that the time to conduct access checks is dramatically increased if requests must be sent off-server, even on a local network, and discuss methods for reducing the cost of security checks. We conclude that enterprises can effectively reduce DLS overheads by organizing documents in such a way that most access checking can be at collection rather than document level, by forgoing accurate match counts, by using caching, batching or hierarchical methods to cut costs of DLS checking and, if applicable, by using a single portal both to access and search documents.
- Developing a SharePoint 2010 Strategy. . . or How Setting It Up and "Getting It Out There" Is Not a Strategy
- A synthesis of many cases where Sharepoint has been implemented so haphazardly that the results complex structures and frustrating user experiences. Jeff Carr identifies key components for success with SharePoint: purpose, governance, people & objectives, requirements, IA, technology, and maintenance.
- elasticsearch - blog - New Search Types
- Changing Bits: Lucene's FuzzyQuery is 100 times faster in 4.0
- Searchers Punt Early [netflix movie searching]
- Walter Underwood of Netflix posts about the minimal text that people use to search for movie titles. Examples include frank g[ehry], baron mu[nchausen], and apoc[alypto]. Adding auto-complete to the search field improved usability considerably.
- Typeahead Search With CouchDB | Couchbase Blog
- Algorithm and code for implementing autocomplete with CouchDB, based on frequency of words in the index.
- Organizing query completions for web search
- Creating autocomplete suggestions based on query logs, click-through patterns and reformulation, and clustering them with an appropriate label. The result is compared to web search logs, but not tested with users or click-through analysis. (PDF available from author's page)
- US patents on "autocomplete"
- A scary number of software patent applications on autocomplete.
- Advancing search query autocompletion services with more and better suggestions
- An academic approach to efficiency in generating autocompletion suggestions, and evaluating the results using only query logs as input. I think that more traditional relevance testing with user assessments and click-tracking is much more likely to be meaningful.
(PDF is available at the authors' publications pages)
- Comparative Evaluation of Reliabilities on Semantic Search Functions: Auto-complete and Entity-Centric Unified Search.
- A slightly disingenuous comparison, the semantic entity-centric search extension wins in the "maturity" section. But at least they tested autocomplete, there's not much else out there.
- Netflix REST API Reference - Autocomplete
- One of the earliest implementations of autocomplete in search, implemented as in this API call. It's a strikingly successful usability and relevance improvement for known-item searches.
- Autocomplete Me - Epic Fail
- A section of failblog showing ridiculous autocomplete examples: "google is: evil, isp, israel, god, gay, psychotic bacon out to murder my mother..."
- HICR person - Gene Golovchinsky, Ph.D.
- My other area of interest relates to Human-Computer Information Retrieval (HCIR). I have designed, implemented, and evaluated a variety of interfaces for browsing and exploring document collections. I have focused on novel interaction techniques including dynamic hypertext, query-mediated browsing, implicit queries based on annotations, and lately (working with Jeremy Pickens), on Collaborative Exploratory Search. I have published extensively in information seeking and collaborative exploratory search, and have co-organized several workshops on information exploration (at CHI98 and SIGIR98) and collaborative search (at CSCW98, JCDL 2008, and coming up at CSCW 2010).
- Autocomplete for semantic content - live example
- Semantic autocompletion
"Autocompletion can help user with quickly finding terms from a data source. To effectively support autocompletion on large sources the algorithm and result presentation should be optimised for the specific task."
Michiel Hilebrand, Alia Amin and Jacco van Ossenbruggen, investigated the design space of autocompletion and methods to support configuration of autocompletion widgets for RDF.
- AOE / TYPO3 Solr Enterprise Search
- Autocomplete design pattern
- Nice examples of auto-complete on this UI pattern site.
- Trendspotting: Rich Autocomplete in Site Search « Get Elastic Ecommerce Blog
- Nice overview of auto-complete interfaces that go beyond text to include images, prices, categories, and special offers. A comment points out that by pushing suggestions, the site might lose some user vocabulary and demand for other products.
- The TASTIER Project on Efficient Auto-Completion, Type-Ahead Search
- TASTIER is a joint research project between Tsinghua University and UC Irvine. It focuses on efficient autocompletion, type-ahead search on large data sets of various types, such as relational data, documents, semi-structured data. "TASTIER" stands for type-ahead search techniques in large data sets.
- Data Science Toolkit - free open source
- A bunch of useful extractors and connectors: binary file formats to text, geo-coordinates for street addresses and political areas, a gazetteer Geodict, IP addresses, HTML to text, to story, Boilerplate remover, entity extraction for people.
Web service or self-contained pre-installed VM.
- Nstein Exchange | OTCA 5.1 brings full semantic REST API and continues to improve its semantic coverage!
- Semantic annotation API, looks a lot like OpenCalais, by nstien/OpenText.
- Manning: ManifoldCF in Action (Early Access edition)
- Mainfold CF a framework for managing content access connectors. It's the Apache open-source project that sets up a standard API for modules to get data from databases, data management systems, file systems, web sites, rss, etc. It supports authorization and authentication. The book is being written right now, buy early access to see chapters as they are finished.
- Enterprise Search Summit Fall 2011 - Nov. 1 to 3, Washington DC
- Avi Rappoport is the new program chair for this summit, and will be putting up a call for presentations shortly.
- overview of ManifoldCF
- Why Google Can't Count Results Properly (on web search, site search, or CSE)
- Danny Sullivan discusses the unreliability of Google search results counts, and quotes Matt Cutts of Google: [Adding more search terms] "causes us to go deeper through our posting lists looking for matches, which can lead to more accurate (and larger) results estimates. Other things can cause us to go deeper in finding matches, such as clicking deeper in search results. Results estimates can also vary based on which data centers or indices your query hits, as well as what language you're searching in."
Avi says: This is not a really helpful answer, especially for Google Custom Search Engine (free search service) and the paid Site Search version. It's terribly confusing in those cases.
- Webinator Search Engine v. 6 changes (Thunderstone)
- This version adds an internal federation option, XML, SOAP interface, authorization controls, customizable thesaurus, group by site and other new features. Webinator is free for development and small sites, software price is based on the number of queries, documents and features. It's also available as a hardware-software appliance and hosted SaaS. (h/t to arnoldit for the link)
- Mendeley challenge - mashup content for better open science
- [Mendeley has a] crowdsourced research database, with 70 million documents, usage statistics and reader demographics, social tags, and related research recommendations, all available under a Creative Commons license. We want to see a world in which science is mashed up… with anything. So, we are really excited to announce the Mendeley API Binary Battle. For you, this means: Build an application with this data, make science more open, win $10,001!
- Go Rogue With Enterprise Search -- InformationWeek (March 2011)
- Overview of the pitfalls of enterprise search implementations and some effective ways to avoid them. Going "rogue" is a skunkworks approach, finding people with information access problems and solving them first, then slowly scaling out. More practical examples would have been nice.
- Amazon.com: Dynamic Taxonomies and Faceted Search: Theory, Practice, and Experience (The Information Retrieval Series) (9783642023583): Giovanni Maria Sacco, Yannis Tzitzikas: Books
- Open Source Search Conference: Lucene and Solr in Government 2011
- How Far Does Semantic Software Really Go?
- Lynda Moulton describes the challenging configuration and management of integrating semantics into search engines.
"Semantic search operates best when it focuses on a topical domain of knowledge. The language that defines that domain may range from simple to complex, broad or narrow, deep or shallow. The language may be applied to the task of semantic search from a taxonomy (usually shallow and simple), a set of language rules (numbering thousands to millions) or from an ontology of concepts to a semantic net with millions of terms and relationships among concepts."
- Improving Findability Inside the Firewall
- When is personalization too personal? [Endeca e-commerce]
- Endeca's e-commerce customers say that too much personalization seems creepy. They recommend defining a small number of customer segments and tuning the segment's user experience to be the most satisfying. They can individual information to get their visitors into the most appropriate segments.
- Semantic web user interfaces - Do they have to be ugly?
- An excellent presentation that identifies the failures of semantic information user experiences. It proposes two solutions: projects that do one thing well rather than everything badly, and shared frameworks for implementing decent UIs for semantic data.
- Architecture astronauts take over - Joel on Software
- The hallmark of an architecture astronaut is that they don't solve an actual problem... they solve something that appears to be the template of a lot of problems. Or at least, they try.
- Streams, Walls, and Feeds [free NNGroup report on RSS & Social Media]
- Thorough research on RSS feeds and social networks including FaceBook, MySpace and Twitter. PDF report is free.
- Specifications/OpenSearch/1.1/Draft 4 - OpenSearch
- Why you can't just 'Google' for Enterprise Knowledge (2009)
- Google works hard to get search right, enterprises must be willing to do the same.
- Searchperience - new cloud hosted search service - built on Solr
- New SaaS for enterprise search is built on Lucene/Solr, with their own crawler and indexer. It's designed for high volume of both content and queries, as well as stemming, fuzzy search, autocomplete and facets. It seems to be developed by aoemedia, who also do Typo3 open-source web CMS.
- Guidelines for Web-based naming
- Term-based thesauri ... Change over time (efoundations blog)
- Without requiring too much knowledge of the intricacies of RDF, this post talks about how to handle changes in concepts and related changes in the labels that explain them. It's stuff like this that is slowly making semantic web concepts practical and useful. Link via @bradleypallen
- [#LUCENE-2573] Tiered flushing of DWPTs by RAM with low/high water marks - ASF JIRA
- Lucene developers are adding code to index more efficiently by flushing a memory (RAM) cache when it gets too full. This should apply to Solr too, but it's still in a Lucene branch, not the main trunk.
- Using internal rank metrics in external search engines - Comperio Search Nuggets
- Simply put, you have now leveraged a web site’s internal data in your external search engine. Consequently adding to the precision of the results using a previously unreachable metric.
- Hash URIs | Jeni's Musings
- More on dynamic URIs with # (hash) and #! (hash bang), and some basic principles of accessibility and robustness.
- Wallaby | Adobe Flash FLA to HTML - Adobe Labs
- "Wallaby" is the codename for an experimental technology that converts the artwork and animation contained in Adobe® Flash® Professional (FLA) files into HTML. This allows you to reuse and extend the reach of your content to devices that do not support the Flash runtimes. Once these files are converted to HTML, you can edit them with an HTML editing tool...
- February Redux - Enterprise Search London (London, England) - Meetup
- boilerpipe - removes clutter around web page content (java code library)
- The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page. The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings. Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate. Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0. The algorithms used by the library are based on (and extending) some concepts of the paper "Boilerplate Detection using Shallow Text Features" by Christian Kohlschütter et al., presented at WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA. Click here to read the paper and the presentation slides. A video of the presentation is freely available on Videolectures.net
(turn speaker balance to the left to improve audio quality). Commercial support is available through Kohlschütter Search Intelligence.
- European HCIR workshop - 4th July 2011
- Human-computer Information Retrieval meeting, in Newcastle, England - inspired by HCIR in the US.
- Intranet Focus » Intranet Roadmap Workshops start in the UK
- Open Bibliographic Data Guide
- What makes relevance such a challenge in the enterprise? (sharepoint & fast search blog)
- Nice overview of why internal search is often worse than web search: mainly that there's little meaningful linking within an intranet, little incentive to make a site easily searchable, and security issues with access control. The post recommends realistic expectations, not indexing low-value content, looking at third-party relevance tools, offering scope or zoned search, and tagging content.
- HTML5 specification, w3
- Complex and difficult to read, though I can tell they're trying to make it easier.
- HTML5 - A Step Forward Towards Semantic Web
- Nice introduction to the new structural tags in HTML5: section, article, aside, header, hgroup, footer, and nav, and new content tags: figure, video, audio, canvas.
- UIA in the Cloud: A Microsoft Azure Odyssey (Attivio)
- Helpful blog post about the steps needed to create a development and deployment Java server system within the Microsoft Azure cloud.
- In the AJAX Element, Custom Search Engine only, passing the "parameter google.search.Search.FILTERED_CSE_RESULTSET" tells it to get the MAXIMUM number of results: ten on each page, up to ten pages, 100 results.
For other API calls, the number can be 1 to 8, which will return up to 8 results pages for a MAXIMUM 64 RESULTS.
- BOSS API Guide - Yahoo Development Network
- YDN's BOSS (Build Your Own Search Service) version 2 developer guide. This sends queries to the Yahoo/Bing search engine for web, images, and news results, with very flexible results. Some free queries for development, then price per result.
- Palantir Technologies » Blog Archive » Palantir: search with a twist (part two: realtime indexing and security)
- general about issues of access control and relevance ranking not even hinting about the existence of invisible documents. Inverse document frequency can give hints, therefore they are doing it on metadata but not on body text, they say it doesn't hurt relevance.
- Palantir Blog: search with a twist (part one: memory efficiency)
- Palantir Business Intelligence uses the Lucene core search engine, but not for simple web-style search. They say "We want to leverage the inverted index capabilities of Lucene, but our data access patterns are a bit different than the typical use case: we need things like pervasive range-querying, different types of relevance, and dynamic views of the data based on security constraints." But for queries with binary answers, a document is in the result set or is not, they found a more efficient way than Lucene 2005's memory usage approach.
- Is Google Custom Search Influencing Google Web Search?
- INFOdocket - new service from the founders of ResourceShelf and DocuTicker
- Information Industry News + New Web Sites and Tools From Gary Price and Shirl Kennedy - definitely a site worth tracking.
- Kris Collins | University of Bedfordshire: Google Blue Box Vs Funnel Back search
- Very short comparison of the Google Mini enterprise search engine appliance and the Funnelback appliance, which wins on price and functionality.
- Jeff's Search Engine Caffè: Google "Recipe View" Search Disappointing and Dangerous
- Jeff finds the Google Recipe View data structure lacking, as it mainly filters on ingredients rather than facets such as expertise, health, cuisine, technique, etc. It's also difficult for small sites and blogs to generate the rich snippets metadata. He thinks the system is "under cooked and lacks seasoning".
- Companies Want Better Analytics, Just Not Right Now
- Describes a Ventana Research report: companies see analytics as a way to make more money, but only 24% are planning analytics-based changes. Budget, infrastructure, and inertia seem to block change.
- Information Filtering and Information Retrieval: Two Sides of the Same Coin (Citeseer X)
- Classic 1992 article talks about filtering unstructured data and relates it to the then-current understanding of information retrieval.
- Remedies for Search Bias (Ben Edelman re Google manual biases)
- Uses examples of airline reservation systems and Windows browsers to show how regulations could curb Google tendency to promote its own services. It seems pretty reasonable to me, no techno-paranoia or obvious cluelessness.
- Elastic Search: Distributed, Lucene-based Search Engine « Sematext Blog
- Otis Gospodnetic interviews Shay Banon of ElasticSearch, a large-scale data-grid level distributed search solution using the most modern architecture approaches. Some comparisons to Solr, but this is a completely different codebase.
- Tempo :: The tiny JSON rendering engine by TwigKit
- Greplin Lets You Find Your Stuff in the Cloud
- My short article for InfoToday NewsBreaks discusses the functionality, interface and uses for the Greplin personal cloud search engine. It's built on Lucene and other scalable server software, and the OAuth protocol for accessing accounts. Useful service, a cute company story, but they may find fierce competition in a few months.
- Moving from Oracle Text to Solr/Lucene @ Digital Collections Blog 2009
- Problems faced by the DC-x publishing/archiving/text analytics company with Oracle Text in 2009 included stability and scale, missing facets, support issues, database load, query syntax, and expense. The transition to Solr seems to have been reasonably smooth.
- OAuth Will Murder Your Children
- OAuth clients are asking for write access when they don't need to be: this post suggests adding checkboxes to clarify choices.
- How Smart Do We Want Search To Get? (Gord Hotchkiss)
- Extrapolating about search getting smart to the point of precognition brings up important issues of control and privacy. Gord Hotchkiss, who has done a lot of behavioral research, thinks that humans would rather be guided by search than have our choices taken away. I suspect that he's right, as he points out that even B2B procurement automation never quite took off.
- OverDrive, Bluefire, and the EPUBlic Library
- A nice long discussion, Mac and iOS oriented
- Auto Complete with Redis
- Example implementation of autocomplete lookup using the Redis data structure framework, often used in Ruby on Rails.
- What is the best open source solution for implementing fast auto-complete? - Quora
- Excellent information on autocomplete by people who have implemented it on Quora, Facebook, etc.
- Piwik - Web analytics - Open source
- A free open-source alternative to Google Analytics, available in 40 languages.
- The ROI of User Experience with Dr. Susan Weinschenk (video)
- A spectacular drawing animation/video about how to see the User Experience as a way to avoid expensive errors and duplication. It would be great to show a doubter.
- Taxonomy Fairy Tales (video)
- Patrick Lambe and Matt Moore discuss why internal and enterprise search engines don't work as well as web search, due to a lack of meaningful hypertext links and different expectations (I agree). They mention the value of taxonomies for improving search, and the fact that there is no Taxonomy Fairy or magic automated system to organize things. And they recommend that most of the taxonomy come from the bottom up, by categorizing stuff and seeing how it fits together. Very engaging and I'm glad that I agree with them.
- Google Search Appliance - Open Source Projects
- 6.2 Administering Crawl for Web and File Share Content: Introduction - Google Search Appliance
- For both scheduled crawls and continuous crawls, documents usually appear in search results approximately 30 minutes after they are crawled. This period can increase if the system is under a heavy load, or if there are many non-HTML documents.
- Comparing the Sensitivity of Information Retrieval Metrics - Microsoft Research
- "Information retrieval effectiveness is usually evaluated using measures such as Normalized Discounted Cumulative Gain (NDCG), Mean Average Precision (MAP) and Precision at some cutoff (Precision@k) on a set of judged queries. Recent research has suggested an alternative, evaluating information retrieval systems based on user behavior. Particularly promising are experiments that interleave two rankings and track user clicks. According to a recent study, interleaving experiments can identify large differences in retrieval effectiveness with much better reliability than other click-based methods. We study interleaving in more detail, comparing it with traditional measures in terms of reliability, sensitivity and agreement. To detect very small differences in retrieval effectiveness, a reliable outcome with standard metrics requires about 5,000 judged queries, and this is about as reliable as interleaving with 50,000 user impressions. Amongst the traditional measures, NDCG has the strongest correlation
with interleaving. Finally, we present some new forms of analysis, including an approach to enhance interleaving sensitivity."
- reflections on James Kalbach‘s course on Faceted Search & Beyond,
- Tony Russell-Rose describes James Kalbach‘s course on Faceted Search & Beyond, a day-long workshop on the design and content considerations.
- What does the user see? [filter UI] Neven Mrgan's tumblr
- Vivid example of what *not* to do when choosing facets and values. User vocabularies the way to go!
- Greplin Is A Personal Search Engine For Your Online Life (Tech Crunch)
- Greplin is like an extension of desktop search to all one's online accounts. Give it an account name and password, and it will use OAuth to index all the data you have in Gmail, Twitter, Facebook, and LinkedIn. The article is from last year, the app is now public. It's cloud hosted and index updates are from every 20 minutes to a whole day.
- InfoCamp Berkeley '11 - March 5-6
- An unconference combining ui, ux, information management, search, libraries, and related fields. I went last year, it was good being around so many enthusiastic people.
- The Curse of Mental Accounting | Wired Science | Wired.com
- 10 Things You Should Have Learnt from the JC Penney SEO Fiasco
- Adding Google to SharePoint? < Real Story Group Blog
- Mentions the cost of both the Google Appliance and FAST search engines, and suggests being open to other options.
- Faceted search: choosing good facet suggestions « FT Tech Blog
- Describes many questions encountered when designing faceted search including whether to show dead-ends, options to broaden the search, hierarchies and sub-categories, ranges vs. discrete choices, etc. The screenshot examples are particularly useful.
- Deciphering Discovery by Greg Notess (Online Magazine, Jan. 2011)
- Greg Notess, a librarian and search engine expert, surveys the new research discovery services (not litigation discovery) from ProQuest, Ex Libris, OCLC, and EBSCO. These are aggregating indexes that attempt to provide Google-style access across library collections and licensed databases. His personal experiences with ProQuest Summon were mixed at best.
- Interaction 11 conference report: day 3
- Interaction design / user experience / usability / interface / information architecture: it's all good.
- Documentation/Developer how to guide - OpenSearch
- Documentation/Frequently asked questions - OpenSearch
- Google Code Playground
- Programmatically Creating Custom Search Engines - Google Custom Search APIs and Tools - Google Code
- 2011 GCSE (Labs) Getting Started with the API - JSON/Atom Custom Search API - Google Code
- 2011 Using REST to Invoke the API - JSON/Atom Custom Search API - Google Code
- Google changed the search API fairly radically, so this is is an interface between the old AJAX version and the new more RESTful version. I'm not sure if it can call the web search engine, now that the API is deprecated. (FWIW, I found the old interface opaque and baffling, will see how the new one works out).
- Any GCSE
- EnterPrise Search in Sharepoint 2010 « sharepointstories
- Spire - new BigData & fulltext search engine
- Provides automated distributed storage of huge amounts of data and near-real-time search access. It's built on Hadoop and HBase (with some MapReduce), and uses both SQL and Lucene-style query languages, returning JSON objects. It can work in public or private clouds.
- Thinking Sphinx - Ruby connector to Sphinx
- Open-source interface connects Ruby ActiveRecord objects and the Sphinx search engine.
- About | Sphinx (open source search engine)
- Sphinx is an open source search (GPLv2) engine that works across platforms and indexes content in SQL databases, NoSQL, and files. It does not have a web crawler or robot spider. It's just a code library, so has no user interface, just API calls and SQL queries. It can scale dramatically up, indexing billions of documents, with search distributed among many machines, easily handling 3,000 queries per second, and it's the search for Craigslist. Developers offer support and implementation services.
- Search and Business Intelligence: The Humble Inverted Index Wins Again
- My article on extending full-text search into agile business intelligence tools, with examples and notes on the search vendors moving into this space.
- Make your search engine seem psychic - Enterprise Search
- Miles Kehoe points out that Google, Factiva and LexisNexis work hard every day at improving their results by recognizing patterns and tuning results, and enterprise search admins should do so as well.
- An Intro To The Semantic Web: Why You Need To Know About It Sooner Than Later | Web Central Station
- A friendly and colorful introduction to the semantic web, where automated tools can search and group chunks of data. Examples include disseminating your new address to all your friends and finding preferred restaurants away from home. Includes links to good resources.
- Enterprise Search | Coveo Releases Version 6.5 of Its Enterprise Search Platform
- Coveo' v 6.5 seems to be aiming at Business Intelligence, and praised by Jim Coleman at Netezza. Use cases are Customer Support and Business Process Management for sales and marketing.
- Product Out-of-Stock Checklist « Get Elastic Ecommerce Blog
- User interface and SEO advice for dealing with out-of-stock items, whether temporary and permanent
- Greg Nudelman's Search Matters articles on UX Matters
- Immersive Mobile E-Commerce Search Using Drop-Down Menus | UX Magazine
- Important article starts from first principals of good UI design for cell phones and mobile devices and goes on to show examples, working from designs with lots of navigation chrome to minimalist layouts with a single menu for faceting and filtering.
- the all-thing
- Whistlepig: minimalist real-time full-text search
- "Whistlepig is a minimalist realtime full-text search index. Its goal is to be as small and feature-free as possible, while still remaining useful, performant and scalable to large corpora. "
Whistlepig does fairly complex queries, including parentheses. But it has no relevance ranking, it shows the results with the most-recently-added items first.
- LinkedIn Quick Tip: How to Navigate the New Polls CIO.com
- Breaking the Web with hash-bangs
- Use cases of faceted search for Apache Solr
- A good practical discussion of how faceted metadata search works, how to design facets and set values in Solr, and how to use the same approach for autocomplete, trending keywords, and personalized alerts via RSS.
- Solandra - Solr + Cassandra
- Integrates Solr and Cassandra together FAQ, when to use Solandra?
"-- You find you struggle with managing Solr on more than one box.– You already use Cassandra and want to add free-text search to your application.– You need to maintain a huge number of distinct indexes.– You need to scale a massive index (many millions of documents).– You want better real-time search semantics."
- Organize and display refinements to streamline the site search experience (SLI)
- Very helpful overview of search results faceting user experience issue. It's a vendor blog post, and they use "refinement" instead of "facet", (an example of a site-specific synonym), but the meaning is the same.
- 11 Trends in Enterprise Search for 2011 (coveo blog)
- Vendor-sponsored webinar foresees Enterprise Search engines indexing all data across repositories.
- Recommind Introduces Decisiv Search QuickStart Edition
- Enterprise search for law firms now comes pre-configured to work with document management systems, email, and time-and-billing systems.
- URL Design — Warpspire
- OpenSearchServer (on Lucene) - search vendor and service provider
- A full-featured search engine built on Lucene, this is free open source software. The extensive web admin interface covers the settings in the Lucene config files, from crawl processes to spellcheck settings. The group creating the software is available for technical support and consulting, they also offer a hosted option.
- Expected Reciprocal Rank for Graded Relevance | Yahoo! Research
- Goes beyond Discounted Cumulative Gain to account for very relevant documents below others, Expected Reciprocal Rank. The study compared ERR with others to see which best correlates with the clickthrough events.
- [Academic] Publications | Yahoo! Labs
- Researchers published articles mainly in journals and proceedings of conferences, covering topics from clickthrough ranking to contextual advertising through faceted browsing with user tags
- [Academic] Publications by Google Researchers
- Most of these papers are in academic journals and proceedings of conferences, and they range across the research spectrum, from IR algorithms to education to cryptography.
- Webscope from Yahoo! Labs (curated collection of testing data sets)
- The Yahoo! Webscope™ Program is a reference library of interesting and scientifically useful datasets for non-commercial use by academics and other scientists
- Seekquarry / Yioop :: Open Source Search Engine Software
- Alpha version of a PHP-based crawler and search engine, architected to be distributed on multiple nodes. It will store the crawled URLs and can also use ARC, ODP RDF, Wikipedia XML. Mainly administered via a web UI, though there are config files. Example web search engine at yioop.com. The license is GPLv3.
- lucene-stanford-lemmatizer - GitHub
- This code library goes beyond stemming to tag each word or phrase with the part of speech (English only). This allows the system to choose the correct word stem for matching, so the gerund "marking" is not treated the same as the name "Malkrk". Knowing the part of speech allows much smarter stopword functionality: it will treat at least part of "The The" as a noun. Currently in Java, with English rules, under the GPL.
- dtSearch With Native .NET 4 / 64-Bit
- dtSearch is a workhorse, it's been around forever and it's still fast and cheap. This release is 64-bit native, with support for faceted search involving millions of items. The new .NET version 4 SDK has access control, and a sample Microsoft Azure cloud implementation. There are web, intranet, desktop and CD/DVD search products.
- SharePoint Partners Can Plug Taxonomy Gaps - [paid report from] Forrester Research
- Just knowing that MOSS is bad at metadata & taxonomy management is useful.
Summary: "Microsoft Office SharePoint Server 2007 (MOSS 2007) has lousy metadata capabilities. If tagging content and managing taxonomies in MOSS 2007 are important to you— and they should be — consider buying a third-party solution to meet key requirements. SharePoint Server 2010 ships with significant metadata enhancements that could spur enterprises to upgrade. Yet even with these improvements, sought after semantic capabilities like autoclassification and entity extraction remain gaps in the SharePoint platform. Use this report as a guide to partners that can plug these important SharePoint 2007 and SharePoint 2010 gaps." $499 study
- InfoFinder Enterprise Search
- Enterprise search vendor with connectors to file servers, email, groupware, databases, CRM and ERP applications. This search engine is optimized for Windows
- Seven ways to think like the web « Jon Udell
- Project SIKULI
- Sikuli is a visual technology to automate and test graphical user interfaces (GUI) using images (screenshots). Sikuli includes Sikuli Script, a visual scripting API for Jython, and Sikuli IDE, an integrated development environment for writing visual scripts with screenshots easily. Sikuli Script automates anything you see on the screen without internal API's support. You can programmatically control a web page, a Windows/Linux/Mac OS X desktop application, or even an iphone or android application running in a simulator or via VNC.
- Pannous (search engine consulting)
- Provides customized enterprise search using Lucene, Solr, ElasticSearch and Fast.
- Searching SAP Data with Duet Enterprise 1.0 and SharePoint Server 2010 - Microsoft SharePoint Team Blog
- Duet Enterprise connects SharePoint BCS to the SAP data source. It should also work for FAST.
- Why Jetwick moved from Solr to ElasticSearch
- Peter Karich talks about migrating his near-real-time Twitter search engine Jetwick to ElasticSearch for scaling up. He considered Solandra, but it didn't solve his indexing performance problems. Very practical and useful.
- authentication architecture in a federated enterprise
- Security access control and authentication discussion, specifically about federated search, from a Canadian government science library. I find this particularly useful because it's dealing with a real case, with concrete examples.
- CHI Conversations | BayCHI | Jef Raskin
- SolrHQ | SolrHQ make using Solr search easy
- A Solr search hosting service.
- Web Analytics: Frequently Asked Questions And Direct Answers | Occam's Razor by Avinash Kaushik
- This is specifically for e-commerce and branding campaigns, but it's got lots of good advice for other situations.
- Top 5 Metadata Resources « Digital Asset Management
- Helpful annotated list of sites: controlledvocabulary.com iptc.org dpbestflow.org photometadata.org and adobe.com/products/xm
- Google Patents, Updated (seobythesea.com)
- Bill Slawski at SEO By The Sea has good analysis of the technical aspects of web search. In this listing, he groups links to Google's known patents by category with annotations and partners. Enlightening.
- Morgan & Claypool Publishers - Synthesis Lectures on Information Concepts, Retrieval, and Services - 1(1):1 - Abstract
- Foundations and Trends in Information Retrieval
- Asking Questions About Internet Behavior :: UXmatters
- Lou Rosenfeld analyzes one element of UX interviews, about Internet familiarity. The methods used in this case are appropriate for all kinds of usability testing.
- Endeca Resource Center | Whitepapers, Case Studies, Analyst Reports and more
- A zillion PDFs from this leading faceted metadata search engine vendor.
- FBI set to roll out advanced security search engine
- US government's N-Dex search engine federates access to disparate sources including incident, arrest, conviction, probation and parole records, aliases, tattoos, and other records, mainly via the National Crime Information Center,Interstate Identification Index and OneDJ services. It does not include intelligence data. It is designed to provide Law Enforcement Agencies with search visualization and mapping tools, data analytics, trends and hotspots, alerting subscriptions, etc. Security is designed with basic access controls, and Green, Yellow, and Red record flags. There are about 200 million records planned for the system and almost as many acronyms. It's being developed by Ratheon.
- Providing Structured Data - Google Custom Search APIs and Tools - Google Code
- Add DataObject metatag attribute name=pubdate to testing
- Making AJAX Applications Crawlable [hash fragment URLs] - Google Code
- Google's new system for making AJAX-created pages crawlable and index-able uses a #! (hash and exclamation point) symbols to indicate a "pretty url". The robot crawler converts that to _escaped_fragment_ and send it site as a request, which replies with a simple flat HTML file based on parameters. In search results, the symbols go back in and the server requests the AJAXified version with the parameters. This is why we sometimes see #! in twitter URLs...
- Intranet Focus » Search-based Applications (book review)
- Positive review of the book "Search-based Applications" which covers searching databases with unstructured text tools.
- Findability Inside The Firewall – Still Trying To Find The Information We Need - Highly Competitive - software industry insights
- ElasticSearch at berlinbuzzwords 2010
- Berlin Buzzwords 2011 [conference] - Search, Store, Scale
- Open source software meeting focusing on scalable search, data mining in the cloud and NoSQL. June 6 and 7th, Berlin, CFP due March 1.
- Lucene Revolution 2011 User Conference | San Francisco May 23-26 2011 | Lucid Imagination
- For users and developers of Lucene/Solr and related open-source software projects. Sponsored by the search services vendor Lucid Imagination. CFP is open until March 2.
- Ajax Incremental Search Revisited [Exorbyte]
- "How Structured Data Matching Engines Enable a Revolutionary Incremental Search Interface | White Paper" Vendor-focused but detailed discussion of implementing search auto-complete.
- All An Autocomplete Can Be and More « Exorbyte Blog
- Vendor Exorbyte talks about autocomplete issues (and touts its own solutions).
- Optimizing Solr performance and User Experience for eCommerce
- Scrapy at a glance (scrapy.org)
- "Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival."
It's an open-source Python code library that runs as a robot spider and includes XPath to extract content from html, runs on Unix, Linux, Windows, Mac OS.
- Open source intranet search over millions of documents with full security
- Case study (summary) of an a search installation using Xapian open-source search, storing file permissions and ACLs for security.
- Stack Overflow Search — Now 81% Less Crappy [SO Blog]
- Talks about switching the search engine at StackOverflow sites (question-answering communities) from SQL Server to Lucene.NET - the C# version of Lucene. The reasons are to take advantage of their web farm, reduce database load, better control search results, avoid external dependencies.
- Lucene and Solr Search Resource Center | DZone
- DeveloperZone sponsored by Lucid Imagination
- Apache Solr: Get Started, Get Excited! | Javalobby
- Another good overview of what Lucene/Solr can do, with examples. Sponsored by Lucid Imagination.
- Bing Heads up on tag optimization (SEM 101) - Webmaster Center blog - Site Blogs - Bing Community
- Bin's extremely wordy take on page meta tags includes xhtml, doctype, title, meta description, meta keywords (spelling variations OK), equiv=content-language, equiv=content-type, robots, and also canonical links.
- Meta tags - Google Webmaster Tools Help
- Google's meta tags: title, description, robots, google / notranslate, google-site-verification, equiv=content-type and equiv=refresh. Note the absence of "keywords" -- this indicates that Google will not index them, but it probably checks them for spam.
- SearchBlox on AWS Elastic Beanstalk
- SearchBlox offers remote cloud-hosted search service, compatible with Amazon Web Services Elastic Beanstalk
- SharePoint Technology Conference | San Francisco | Feb 7-9, 2011
- Several sessions on various flavors of SharePoint search
- WCC Smart Search and Match - ELISE software
- Primarily identity and employment matching, but also marketed for e-commerce and enterprise content management search.
- Enterprise Search Grows Up (Coveo Blog)
- Search vendor blog posits Enterprise Search 2.0 as a single aggregated index of all enterprise content, and a single point of query.
- Sinequa Enterprise Search Solutions | Enterprise Features
- Business Search within your Business Context | Company | Sinequa.com
- flow support
- open data search analysis for australian university
- Adobe Omniture Search&Promote
- I think, though I'm not positive, that this is the current iteration of the Atomz hosted site search engine. Very e-commerce oriented.
- Zend Framework: Documentation: Overview
- Zend_Search_Lucene is a general purpose text search engine written entirely in PHP 5, supporting Lucene versions 1.4 - 2.3.
- Liip Blog // Why a project switched from Google Search Appliance to Zend_Lucene
- Interesting discussion on the limits of the Google Search Appliance and a switch to Zend_Lucene. They would probably have used Solr if it had been available.
- Web Analytics and User Experience: An Interview with Louis Rosenfeld
- Lou Rosenfeld (co-author of Information Architecture for the World Wide Web) talks about the value of search analytics in particular and going beyond canned reports in general.
- ldspider - crawling framework for the "linked data" web
- A Java library for following links to servers with with special format metadata: RDF/XML, N-TRIPLES and N-QUADS. Output can also be written in SPARQL.
- Web Search: How much are real time search queries worth?
- Estimating value by analyzing real-time queries and collating them with advertising.
- Enterprise Trends: Contrarians and Other Wise Forecasters - Enterprise Search Blog
- Lynda Moulton talks about using enterprise search in information management, as it can tie together many silos and incompatible systems.