As of January, 2012, this site is no longer being updated, due to work and health issues
Lucene/Solr Meetup, June 2009
Notes by Avi Rappoport, Search Tools Consulting
Meetup info
These are my sketchy notes, there was so much good stuff, I did not get it all.
-- Avi, June 5, 2009
Changes in upcoming Solr 1.4 (Grant Ingersoll)
- a new logo (see right)
- new character filters (from Lucene 2.4)
- faster faceting methods - FieldCache (from Lucene)
- improved numeric range calculations (see TrieRange below)
- Java-based replication with solr request handlers (see Lucid blog post)
- StatsComponent - returns xml for each field
- Term vector component- for proximity & other interesting stuff (see Lucid blog post)
- Duplicate detection during indexing - RemoveDuplicates Token
- Better Arabic handling (from Lucene)
- CharFilter - normalize chars before tokenizing, like a lightweight pipeline
- Solr Cell (a.k.a. Content Extracting Library, aka ExtractingRequestHandler) - wrapper around Tika
- Clustering - grouping similar docs - jira framework for plugging modules
- first implementation using carrot2: concerned with clustering short extracts of text from search results
- Configure deletion policy (possibly from Lucene)
- SolrJS - JQuery parsing
- VelocityResponseWriter - hooks to Velocity templates for interface without middleware or app server
Near-real-time indexing - Jason Rutherglen & Jake Mannix (LinkedIn)
- Historical model was batch indexing
- For faster indexing, insert/delete several per second, need to worry about I/O efficiency
- Relevance matters (unlike, say, in Twitter Search)
- Have contributed Lucene patches to especially for deletes and flushing
- 1313, 1483, 1231, 1292 were all that I caught
- NearRealTimeSearch - Lucene
- RealTimeSearch - Solr
- Other LinkedIn open source projects: Zoie (extensions to Lucene), Bobo (faceted/parametric search tuned for high performance), Voldemort (high performance distributed key-value storage)
Payload Efficiency - Michael Busch (IBM)
- Current inverted index contains the
- dictionary (list of words)
- posting list (documents associated with each word)
- Payloads - additional optional metadata for each term in the dictionary, for example position, as byte arrays
- Current Lucene Store (document data) is slow, sequential
- Payloads and column-stride fields could be much more efficient
- other stuff I didn't take good notes on (sorry)
- Check out the blog post from the ApacheCon about payloads
TrieRange
- Supplements Lucene's basic string field type, orders of magnitude faster for range searching
- came out of geo-searching
- Field mapped to a numeric type: int, long, double, float
- Code stores sorts with several layers of precision
- example: year, month, day
- Naive sorts also much easier.
Query Parser Framework - also IBM
- Current process is hard to maintain and extend
- Need to separate syntax and semantics
- Can generate multiple languages
- Looks like a pipeline (I may be missing something)
- Text Parser - converts incoming string to a Query Node Tree
- Iterates through the nodes before going to next parser
- Can add new parsers in any order, e.g. validation, tokenization
- Query Builder
- Iterates through the nodes
- Outputs Lucene or other query language
- see developer discussion & JIRA ID 1567 (current status is patch)
Zevents.com - fun with deduplication and dynamic re-ranking
Xoopit.com - indexing service in their own cloud, just starting, will provide simple config for Lucene
Lucid search - Erik Hatcher
- index sources: lucidimagination.com site, lucene apache site, email lists
- facet by source, project, issue (JIRA), author (email)
- multi-select facets (checkboxes)
- displays facets with 0 hits
- About 1/2 attendees no longer use stopwords
- 1/4 do use them
- 1/4 do both or don't know
- No one has any evidence that they are useful, but they can screw up phrase queries
- They do make indexes bigger
- Indexing them leads to search transparency - no question that what you search is what you get
- Maybe Lucene and Solr should disable stopwords by default
- Need some kind of standard corpus / queries / judgments for testing
- Still in infant stages of development
All in all, a great meeting full of very positive energy.