Luce Ne Bootcamp
Luce Ne Bootcamp
Intro
My Background Your Background Brief History of Lucene Goals for Tutorial
Understand Lucene core capabilities Real examples, real code, real data
Ask Questions!!!!!
Schedule
1. 2. 3.
10-10:10 Introducing Lucene and Search 10:10-12 Indexing, Analysis, Searching, Performance 12-12:05 Break
4.
5. 6. 7. 8. 9. 10.
11.
12.
Lucene is
NOT a crawler
See Nutch
NOT an application
See PoweredBy on the Wiki
NOT a library for doing Google PageRank or other link analysis algorithms
See Nutch
Search Basics
Goal: Identify documents that are similar to input query Lucene uses a modified Vector Space Model (VSM)
Boolean + VSM TF-IDF The words in the document and the query each define a Vector in an n-dimensional space Sim(q1, d1) = cos In Lucene, boolean approach restricts what documents to score
d1
q1
Indexing
Process of preparing and adding text to Lucene
Optimized for searching
Its our job to convert whatever file format we have into something Lucene can use
Indexing Classes
Analyzer
Creates tokens using a Tokenizer and filters them through zero or more TokenFilters
IndexWriter
Responsible for converting text into internal Lucene format
Indexing Classes
Directory
Where the Index is stored RAMDirectory, FSDirectory, others
Document
A collection of Fields Can be boosted
Field
Free text, keywords, dates, etc. Defines attributes for storing, indexing Can be boosted Field Constructors and parameters
Open up Fieldable and Field in IDE
How to Index
Create IndexWriter
Optimize (optional)
Task 1.a
From the Boot Camp Files, use the basic.ReutersIndexer skeleton to start Index the small Reuters Collection using the IndexWriter, a Directory and StandardAnalyzer
Boost every 10 documents by 3
Questions to Answer:
What Fields should I define? What attributes should each Field have?
What Fields should OMIT_NORMS?
Pick a field to boost and give a reason why you think it should be boosted
Searching
Key Classes:
Searcher
Provides methods for searching Take a moment to look at the Searcher class declaration IndexSearcher, MultiSearcher, ParallelMultiSearcher
IndexReader
Loads a snapshot of the index into memory for searching
Hits
Storage/caching of results from searching
QueryParser
JavaCC grammar for creating Lucene Queries http://lucene.apache.org/java/docs/queryparsersyntax.html
Query
Logical representation of programs information need
Query Parsing
Basic syntax:
title:hockey +(body:stanley AND body:cup)
OR/AND must be uppercase Default operator is OR (can be changed) Supports fairly advanced syntax, see the website
http://lucene.apache.org/java/docs/queryparsersyntax.html
Task 1.b
Using the ReutersIndexerTest.java skeleton in the boot camp files
Search your newly created index using queries you develop Delete a Document by the doc id
Hints:
Use a IndexSearcher
Questions:
What is the default field for the QueryParser? What Analyzer to use?
Task 1 Results
Locks
Lucene maintains locks on files to prevent index corruption Located in same directory as index
Lucene 2.3 has some transactional semantics for indexing, but is not a DB
Updates are always a delete and an add Updates are always a delete and an add
Yes, that is a repeat! Nature of data structures used in search
Analysis
Analysis is the process of creating Tokens to be indexed Analysis is usually done to improve results overall, but it comes with a price Lucene comes with many different Analyzers, Tokenizers and TokenFilters, each with their own goals
See contrib/analyzers
StandardAnalyzer is included with the core JAR and does a good job for most English and Latin-based tasks Often times you want the same content analyzed in different ways Consider a catch-all Field in addition to other Fields
Indexing in a Nutshell
For each Document
For each Field to be tokenized
Create the tokens using the specified Tokenizer
Tokens consist of a String, position, type and offset information
Pass the tokens through the chained TokenFilters where they can be changed or removed Add the end result to the inverted index
Inverted Index
aardvark 0
hood
little
0
0
1
2
red
riding robin
0
0 1
Robin Hood
Little Women
women zoo 2
Tokenization
Split words into Tokens to be processed
Tokenization is fairly straightforward for most languages that use a space for word segmentation
More difficult for some East Asian languages See the CJK Analyzer
Modifying Tokens
TokenFilters are used to alter the token stream to be indexed Common tasks:
Remove stopwords Lower case Stem/Normalize -> Wi-Fi -> Wi Fi Add Synonyms
Custom Analyzers
Solution: write your own Analyzer
Better solution: write a configurable Analyzer so you only need one Analyzer that you can easily change for your projects
See Solr
Special Cases
Dates and numbers need special treatment to be searchable
o.a.l.document.DateTools org.apache.solr.util.NumberUtils
5 minute Break
Indexing Performance
Behind the Scenes
Lucene indexes Documents into memory
At certain trigger points, memory (segments) are flushed to the Directory
mergeFactor
How often segments are merged Smaller == less RAM, better for incremental updates Larger == faster, better for batch indexing
maxFieldLength
Limit the number of terms in a Document
Takes storage and term vectors out of the merge process Turn off auto-commit if there are stored fields and term vectors Provides significant performance increase
Index Threading
IndexWriter and IndexReader are threadsafe and can be shared between threads without external synchronization One open IndexWriter per Directory Parallel Indexing
Index to separate Directory instances Merge using IndexWriter.addIndexes Could also distribute and collect
Benchmarking Indexing
contrib/benchmark Try out different algorithms between Lucene 2.2 and trunk (2.3)
contrib/benchmark/conf:
indexing.alg indexing-multithreaded.alg
Info:
Mac Pro 2 x 2GHz Dual-Core Xeon 4 GB RAM
ant run-task -Dtask.alg=./conf/indexing.alg -Dtask.mem=1024M
Benchmarking Results
Records/Sec 2.2 Trunk Trunk-mt (4) 421 2,122 3,680 Avg. T Mem 39M 52M 57M
Searching
Earlier we touched on basics of search using the QueryParser Now look at:
Searcher/IndexReader Lifecycle Query classes More details on the QueryParser Filters Sorting
Lifecycle
Recall that the IndexReader loads a snapshot of index into memory
This means updates made since loading the index will not be seen
Business rules are needed to define how often to reload the index, if at all
IndexReader.isCurrent() can help
Query Classes
TermQuery is basis for all non-span queries BooleanQuery combines multiple Query instances as clauses
should required
Spans
Spans provide information about where matches took place Not supported by the QueryParser Can be used in BooleanQuery clauses Take 2-3 minutes to explore SpanQuery classes
SpanNearQuery useful for doing phrase matching
QueryParser
MultiFieldQueryParser Boolean operators cause confusion
Better to think in terms of required (+ operator) and not allowed (- operator)
Most applications either modify QP, create their own, or restrict to a subset of the syntax Your users may not need all the flexibility of the QP
Sorting
Lucene default sort is by score Searcher has several methods that take in a Sort object
Sorting should be addressed during indexing Sorting is done on Fields containing a single term that can be used for comparison The SortField defines the different sort types available
AUTO, STRING, INT, FLOAT, CUSTOM, SCORE, DOC
Sorting II
Look at Searcher, Sort and SortField Custom sorting is done with a SortComparatorSource
SortFilterTest.java example
Filters
Filters restrict the search space to a subset of Documents Use Cases
Search within a Search Restrict by date Rating Security Author
Filter Classes
QueryWrapperFilter (QueryFilter)
Restrict to subset of Documents that match a Query
RangeFilter
Restrict to Documents that fall within a range Better alternative to RangeQuery
CachingWrapperFilter
Wrap another Filter and provide caching
SortFilterTest.java example
Expert Results
Searcher has several expert methods
Hits is not always what you need due to:
Caching Normalized Scores Reexecutes Query repeatedly as results are accessed
HitCollector allows low-level access to all Documents as they are scored TopDocs represents top n docs that match
TopDocsTest in examples
Searchers
MultiSearcher
Search over multiple Searchables, including remote
MultiReader
Not a Searcher, but can be used with IndexSearcher to achieve same results for local indexes
ParallelMultiSearcher
Like MultiSearcher, but threaded
RemoteSearchable
RMI based remote searching
Search Performance
Search speed is based on a number of factors:
Query Type(s) Query Size Analysis Occurrences of Query Terms Optimize Index Size Index type (RAMDirectory, other) Usual Suspects
CPU Memory I/O Business Needs
Query Types
Be careful with WildcardQuery as it rewrites to a BooleanQuery containing all the terms that match the wildcards Avoid starting a WildcardQuery with wildcard Use ConstantScoreRangeQuery instead of RangeQuery Be careful with range queries and dates
User mailing list and Wiki have useful tips for optimizing date handling
Query Size
Stopword removal Search an all field instead of many fields with the same terms Disambiguation
May be useful when doing synonym expansion
Usual Suspects
CPU
Profile your application
Memory
Examine your heap size, garbage collection approach
I/O
Cache your Searcher
Define business logic for refreshing based on indexing needs
Business Needs
Do you really need to support Wildcards?
What about date range queries down to the millisecond?
Explanations
explain(Query, int) method is useful for understanding why a Document scored the way it did ExplainsTest in sample code
Open Luke and try some queries and then use the explain button
FieldSelector
Prior to version 2.1, Lucene always loaded all Fields in a Document FieldSelector API addition allows Lucene to skip large Fields
Options: Load, Lazy Load, No Load, Load and Break, Load for Merge, Size, Size and Break
Makes storage of original content more viable without large cost of loading it when not used FieldSelectorTest in example code
Affecting Relevance
FunctionQuery from Solr (variation in Lucene) Override Similarity Implement own Query and related classes Payloads HitCollector
Lunch
1-2:30
Recap
Indexing Searching Performance Odds and Ends
Explains FieldSelector Relevance
Next Up
Dealing with Content
File Formats Extraction
File Formats
Several open source libraries, projects for extracting content to use in Lucene
PDF: PDFBox
http://www.pdfbox.org/
Tika
http://incubator.apache.org/tika/
Aperture
http://aperture.sourceforge.net
Aperture Basics
Crawlers Data Connectors Extraction Wrappers
POI, PDFBox, HTML, XML, etc. http://aperture.wiki.sourceforge.net/Extractors will give you info on what comes back from Aperture
Large Task
Using the skeleton files in the com.lucenebootcamp.training.full package:
Get some content:
Web, file system Different file formats
Index it
Plan out your fields, boosts, field properties Support updates and deletes Optional:
How fast can you make it go? Divide and conquer? Multithreaded?
Large Task
Search Content
Allow for arbitrary user queries across multiple Fields via command line or simple web interface
How fast can you make it?
Support:
Sort Filter Explains
How much slower is to retrieve an explanation?
Large Task
Document Retrieval
Display/write out the one or more documents Support FieldSelector
Large Task
Optional Tasks
Hit Highlighting using contrib/Highlighter Multithreaded indexing and Search Explore other Field construction options
Binary fields, term vectors
Use Lucene trunk version and try out some of the changes in indexing Try out Solr or Nutch at http://lucene.apache.org/
Whats do they offer that Lucene Java doesnt that you might need?
Term Information
TermEnum gives access to terms and how many Documents they occur in
IndexReader.terms() IndexReader.termPositions()
Lucene Contributions
Many people have generously contributed code to help solve common problems These are in contrib directory of the source Popular:
Analyzers Highlighter Queries and MoreLikeThis Snowball Stemmers Spellchecker
Open Discussion
Multilingual Best Practices
UNICODE One Index versus many
Resources
http://lucene.apache.org/ http://en.wikipedia.org/wiki/Vector_space_model
[email protected]
Discussions on how to develop Lucene
Issue Tracking
https://issues.apache.org/jira/secure/Dashboard.jspa
Resources
[email protected]
Finally
Please take the time to fill out a survey to help me improve this training
Located in base directory of source Email it to me at [email protected]
Extras
Task 2
Take 10-15 minutes, pair up, and write an Analyzer and Unit Test
Examine results in Luke Run some searches
Ideas:
Combine existing Tokenizers and TokenFilters Normalize abbreviations Filter out all words beginning with the letter A Identify/Mark sentences
Questions:
What would help improve search results?
Task 2 Results
Share what you did and why Improving Results (in most cases)
Stemming Ignore Case Stopword Removal Synonyms Pay attention to business needs
Grab Bag
Accessing Term Information
TermEnum TermDocs
Term Vectors
Task 6
Count and print all the unique terms in the index and their frequencies
Notes:
Half of the class write it using TermEnum and TermDocs Other Half write it using Term Vectors Time your Task Only count the title and body content
Task 6 Results
Term Vector approach is faster on smaller collections TermEnum approach is faster on larger collections
Task 4
Re-index your collection
Add in a rating field that randomly assigns a number between 0 and 9
Questions
How to sort the title? How to sort multiple Fields?
Task 4 Results
Add stitle to use for sorting the title
Task 5
Create and search using Filters to:
Restrict to all docs written on Feb. 26, 1987 Restrict to all docs with the word computer in title
Also:
Create a Filter where the length of the body + title is greater than X
Task 5 Results
Solr has more advanced Filter mechanisms that may be worth using Cache filters
Task 7
Pair up if you like and take 30-40 minutes to:
Pick two file formats to work on Identify content in that format
Can you index contents on your hard drive? Project Gutenberg, Creative Commons, Wikipedia Combine w/ Reuters collection
Extract the content and index it using the appropriate library Store the content as a Field Search the content Load Documents with and without FieldSelector and measure performance
Task 7 (cont.)
Include score and explanation in results Dump results to XML or HTML Be prepared to share with class what you did
What libraries did you use? What content did you use? What is your Document structure? What issues did you have?
20 Minute Break
Task 7 Results
Explain what your group did Build a Content Handler Framework
Or help out with Tika
Task 8
Building on Task 7
Incorporate one or more contrib packages into your solution