An Overview of Enterprise Search
Avi Rappoport
Search Tools Consulting
(SLIS graduate)
About Avi Rappoport
- Medieval Studies Major
- BMUG (Berkeley Mac User Group) 1986-92
- MLIS (Masters in Library and Information Studies), 1987-8
- User vocabulary & thesaurus
- Worked in software development
- EndNote
- Metrowerks CodeWarrior
- StarNine Mail & WebSTAR web server
- Search Tools Consulting, since 1998
- Web site and intranet search engines
- SearchTools.com
- Research, analysis, dissemination
- Real-world consulting
handout
Defining Enterprise Search
- Web site search
- Corporate / Academic Site
- Online Store
- Information Site
- Directory, Yellow Pages
- Intranet search
- Internal networks
- Crossing departmental lines
- HR and Facilities info
- Research
- Data Silos
Similarities to Webwide Search
- Techniques and algorithms
- Robot crawlers, HTTP transport
- Very large index files
- Distributed indexing and search processing
- Full-text indexing of unstructured text
- Simple query language
- Relevance Ranking of Results
- Standard interface conventions
Differences from Web Search
- Limited Scope
- More control over content
- Somewhat less variety in formatting
- Few meaningful hyperlinks
- So Page Rank is less useful
- Alternative indexing methods
- File Systems
- SOAP & Web Services
- More tuneable
- Index update scheduling
- Relative value of content for relevance ranking
- No search spam!
Search and Information Architecture
- Defining IA
- The art and science of organizing information
- Information Architecture
- Identifies duplicate and missing coverage
- Provides standard vocabulary
- Adds labels for describing documents
- Search can supplement IA
- Provide ad-hoc access
- Cross categories
- Supports user vocabularies
Knowledge Management
- “The process through which organizations generate value from their intellectual and knowledge-based assets” (CIO Magazine)
- Organizes information, processes and people
- Attempts to regularize implicit knowledge
- Provides collaboration tools
Simple Search vs. Research
- Text search for most things
- 80% of information needs?
- Short queries
- Good-enough answers
- Research
- Scientific and academic
- Legal discovery
- Medical diagnosis
- Patents and trademarks
- Business and sales information
- Image and multimedia search
Finding and Gathering
- Use robot crawler on intranet
- List of "root" URLs
- Special cases for special server
- Other Data Sources
- File systems & shared servers
- Mailing list archives (index threads), blogs, wikis
- Content Management Systems
- Databases
- Legacy programs
- Feeds
- Access options
- API (Application Programmer Interface)
- Web Services
- HTML front end
File Format Issues
- Primary
- More complex
- PDF
- Word processing
- Flash
- Spreadsheets
- CAD and project files
- File formats may change
- XML
- Index text with minimal context
- Index attributes
- Index structured hierarchies
Security and Access Control
- Basic authentication, single sign-on
- Indexer logs in as a user
- Stores document permissions
- Can use search as a teaser for subscriptions
- Show search results
- Mark clearly which are free and which require payment
- Collection-level control
- Don't allow users to search areas not allowed
- Hit-level authentication
- Check access status for each result before displaying
Search Forms UI
- Search field on every page
- Longer field = longer queries (a good thing)
- Simple search form
- A few more options
- Zones, date ranges, category
- Advanced search
- For power users and librarians
- Expose everything
Query Processing
- Generally like webwide search
- Don't require complex operators
- Search full text by default
- Multi-word query handling
- Webwide search matches every word
- Commerce search should match any word
- Smaller document set
- Don't want to say "no"
- Stemming
- Capitalization
- Stopwords
Relevance
- Basically TF-IDF
- Link analysis less useful than on the web
- Page Rank etc. rely on meaningful links
- Intranet links tend to be navigational
- Some content is more valuable
- Product info vs. support notes
- Current vs. archived
- Local vs. global
- Weighting rules can reflect that
Results Pages
- Similar to web search results
- With enterprise site colors, graphics, page design
- But similar search header / results list / footer
- Search Suggestions (Best Bets)
- For frequent searches, give useful directions
- Transparency
- Mark match terms in context
- Vital for documents with no title, confusing URL
- Integrate taxonomy
- Show topics or categories
- Faceted Metadata results!
- Show fruitful paths, valuable options
- Best of structured and unstructured search
Choose, Implement, Maintain
- Buy or download, don't build
- Quality search and other features are non-trivial
- Homegrown systems rarely satisfy
- Effort to implement varies wildly
- Number of documents, complexity, interfaces
- Resources, servers, network
- Enterprise information needs
- Maintenance
- Keep system working, index current, scale up
- Add new data sources
- Change as new needs appear
- Log analysis
Beyond Simple Search: Research
- Academic, legal, medical, scientific
- Willing to work harder
- Refine and improve queries
- Use fields and options
- Save searches and export results
- High recall (finding all relevant documents)
- Not just a "good enough" result
- Understand topical landscape
- Related concepts
- Visualization of results
Conclusion - Enterprise Search