0% found this document useful (0 votes)
138 views83 pages

Luce Ne Bootcamp

This document provides an overview and schedule for a Lucene boot camp tutorial. The boot camp will cover Lucene core capabilities like indexing, analysis, searching and performance. It will include real examples, code and data. The schedule includes sessions on indexing, analysis, searching, class examples, contributions and resources. It also provides background that Lucene is for text search and is not a crawler, application or library for other tasks. The document outlines the basics of indexing, searching, analysis and performance considerations in Lucene.

Uploaded by

mfahci
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views83 pages

Luce Ne Bootcamp

This document provides an overview and schedule for a Lucene boot camp tutorial. The boot camp will cover Lucene core capabilities like indexing, analysis, searching and performance. It will include real examples, code and data. The schedule includes sessions on indexing, analysis, searching, class examples, contributions and resources. It also provides background that Lucene is for text search and is not a crawler, application or library for other tasks. The document outlines the basics of indexing, searching, analysis and performance considerations in Lucene.

Uploaded by

mfahci
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 83

Lucene Boot Camp

Grant Ingersoll Lucid Imagination Nov. 12, 2007 Atlanta, Georgia

Intro
My Background Your Background Brief History of Lucene Goals for Tutorial
Understand Lucene core capabilities Real examples, real code, real data

Ask Questions!!!!!

Schedule
1. 2. 3.

10-10:10 Introducing Lucene and Search 10:10-12 Indexing, Analysis, Searching, Performance 12-12:05 Break

4.
5. 6. 7. 8. 9. 10.

12-1 More on Indexing, Analysis, Searching, Performance


1-2:30 Lunch 2:30-2:40 Recap, Questions, Content 2:40-4:40 Class Example 4-4:20 Break 4:20-5 Class Example 5-5:20 Lucene Contributions (time permitting)

11.
12.

5:20-5:25 Open Discussion (time permitting)


5:25-5:30 Resources/Wrap Up

Lucene is
NOT a crawler
See Nutch

NOT an application
See PoweredBy on the Wiki

NOT a library for doing Google PageRank or other link analysis algorithms
See Nutch

A library for enabling text based search

A Few Words about Solr


HTTP-based Search Server XML Configuration XML, JSON, Ruby, PHP, Java support Caching, Replication Many, many nice features that Lucene users need http://lucene.apache.org/solr

Search Basics
Goal: Identify documents that are similar to input query Lucene uses a modified Vector Space Model (VSM)
Boolean + VSM TF-IDF The words in the document and the query each define a Vector in an n-dimensional space Sim(q1, d1) = cos In Lucene, boolean approach restricts what documents to score

d1

q1

dj= <w1,j,w2,j,,wn,j> q= <w1,q,w2,q,wn,q> w = weight assigned to term

Indexing
Process of preparing and adding text to Lucene
Optimized for searching

Key Point: Lucene only indexes Strings


What does this mean?
Lucene doesnt care about XML, Word, PDF, etc.
There are many good open source extractors available

Its our job to convert whatever file format we have into something Lucene can use

Indexing Classes
Analyzer
Creates tokens using a Tokenizer and filters them through zero or more TokenFilters

IndexWriter
Responsible for converting text into internal Lucene format

Indexing Classes
Directory
Where the Index is stored RAMDirectory, FSDirectory, others

Document
A collection of Fields Can be boosted

Field
Free text, keywords, dates, etc. Defines attributes for storing, indexing Can be boosted Field Constructors and parameters
Open up Fieldable and Field in IDE

How to Index
Create IndexWriter

For each input


Create a Document Add Fields to the Document Add the Document to the IndexWriter

Close the IndexWriter

Optimize (optional)

Task 1.a
From the Boot Camp Files, use the basic.ReutersIndexer skeleton to start Index the small Reuters Collection using the IndexWriter, a Directory and StandardAnalyzer
Boost every 10 documents by 3

Questions to Answer:
What Fields should I define? What attributes should each Field have?
What Fields should OMIT_NORMS?

Pick a field to boost and give a reason why you think it should be boosted

Use the Luke

Searching
Key Classes:
Searcher
Provides methods for searching Take a moment to look at the Searcher class declaration IndexSearcher, MultiSearcher, ParallelMultiSearcher

IndexReader
Loads a snapshot of the index into memory for searching

Hits
Storage/caching of results from searching

QueryParser
JavaCC grammar for creating Lucene Queries http://lucene.apache.org/java/docs/queryparsersyntax.html

Query
Logical representation of programs information need

Query Parsing
Basic syntax:
title:hockey +(body:stanley AND body:cup)

OR/AND must be uppercase Default operator is OR (can be changed) Supports fairly advanced syntax, see the website
http://lucene.apache.org/java/docs/queryparsersyntax.html

Doesnt always play nice, so beware


Many applications construct queries programmatically or restrict syntax

Task 1.b
Using the ReutersIndexerTest.java skeleton in the boot camp files
Search your newly created index using queries you develop Delete a Document by the doc id

Hints:
Use a IndexSearcher

Create a Query using the QueryParser


Display the results from the Hits

Questions:
What is the default field for the QueryParser? What Analyzer to use?

Task 1 Results
Locks
Lucene maintains locks on files to prevent index corruption Located in same directory as index

Scores from Hits are normalized


Scores across queries are NOT comparable

Lucene 2.3 has some transactional semantics for indexing, but is not a DB

Deletion and Updates


Deletions can be a bit confusing
Both IndexReader and IndexWriter have delete methods

Updates are always a delete and an add Updates are always a delete and an add
Yes, that is a repeat! Nature of data structures used in search

Analysis
Analysis is the process of creating Tokens to be indexed Analysis is usually done to improve results overall, but it comes with a price Lucene comes with many different Analyzers, Tokenizers and TokenFilters, each with their own goals
See contrib/analyzers

StandardAnalyzer is included with the core JAR and does a good job for most English and Latin-based tasks Often times you want the same content analyzed in different ways Consider a catch-all Field in addition to other Fields

Commonly Used Analyzers


StandardAnalyzer WhitespaceAnalyzer PerFieldAnalyzerWrapper SimpleAnalyzer

Indexing in a Nutshell
For each Document
For each Field to be tokenized
Create the tokens using the specified Tokenizer
Tokens consist of a String, position, type and offset information

Pass the tokens through the chained TokenFilters where they can be changed or removed Add the end result to the inverted index

Position information can be altered


Useful when removing words or to prevent phrases from matching

Inverted Index
aardvark 0

hood
little

0
0

1
2

Little Red Riding Hood

red
riding robin

0
0 1

Robin Hood

Little Women
women zoo 2

Tokenization
Split words into Tokens to be processed

Tokenization is fairly straightforward for most languages that use a space for word segmentation
More difficult for some East Asian languages See the CJK Analyzer

Modifying Tokens
TokenFilters are used to alter the token stream to be indexed Common tasks:
Remove stopwords Lower case Stem/Normalize -> Wi-Fi -> Wi Fi Add Synonyms

StandardAnalyzer does things that you may not want

Custom Analyzers
Solution: write your own Analyzer

Better solution: write a configurable Analyzer so you only need one Analyzer that you can easily change for your projects
See Solr

Tokenizers and TokenFilters must be newly constructed for each input

Special Cases
Dates and numbers need special treatment to be searchable
o.a.l.document.DateTools org.apache.solr.util.NumberUtils

Altering Position Information


Increase Position Gap between sentences to prevent phrases from crossing sentence boundaries Index synonyms at the same position so query can match regardless of synonym used

5 minute Break

Indexing Performance
Behind the Scenes
Lucene indexes Documents into memory
At certain trigger points, memory (segments) are flushed to the Directory

Segments are periodically merged

Lucene 2.3 has significant performance improvements

IndexWriter Performance Factors


maxBufferedDocs
Minimum # of docs before merge occurs and a new segment is created Usually, Larger == faster, but more RAM

mergeFactor
How often segments are merged Smaller == less RAM, better for incremental updates Larger == faster, better for batch indexing

maxFieldLength
Limit the number of terms in a Document

Lucene 2.3 IndexWriter Changes


setRAMBufferSizeMB New model for automagically controlling indexing factors based on the amount of memory in use Obsoletes setMaxBufferedDocs and setMergeFactor

Takes storage and term vectors out of the merge process Turn off auto-commit if there are stored fields and term vectors Provides significant performance increase

Index Threading
IndexWriter and IndexReader are threadsafe and can be shared between threads without external synchronization One open IndexWriter per Directory Parallel Indexing
Index to separate Directory instances Merge using IndexWriter.addIndexes Could also distribute and collect

Benchmarking Indexing
contrib/benchmark Try out different algorithms between Lucene 2.2 and trunk (2.3)
contrib/benchmark/conf:
indexing.alg indexing-multithreaded.alg

Info:
Mac Pro 2 x 2GHz Dual-Core Xeon 4 GB RAM
ant run-task -Dtask.alg=./conf/indexing.alg -Dtask.mem=1024M

Benchmarking Results
Records/Sec 2.2 Trunk Trunk-mt (4) 421 2,122 3,680 Avg. T Mem 39M 52M 57M

Your results will depend on analysis, etc.

Searching
Earlier we touched on basics of search using the QueryParser Now look at:
Searcher/IndexReader Lifecycle Query classes More details on the QueryParser Filters Sorting

Lifecycle
Recall that the IndexReader loads a snapshot of index into memory
This means updates made since loading the index will not be seen

Business rules are needed to define how often to reload the index, if at all
IndexReader.isCurrent() can help

Loading an index is an expensive operation


Do not open a Searcher/IndexReader for every search

Query Classes
TermQuery is basis for all non-span queries BooleanQuery combines multiple Query instances as clauses
should required

PhraseQuery finds terms occurring near each other, position-wise


slop is the edit distance between two terms

Take 2-3 minutes to explore Query implementations

Spans
Spans provide information about where matches took place Not supported by the QueryParser Can be used in BooleanQuery clauses Take 2-3 minutes to explore SpanQuery classes
SpanNearQuery useful for doing phrase matching

QueryParser
MultiFieldQueryParser Boolean operators cause confusion
Better to think in terms of required (+ operator) and not allowed (- operator)

Check JIRA for QueryParser issues


http://www.gossamer-threads.com/lists/lucene/java-user/40945

Most applications either modify QP, create their own, or restrict to a subset of the syntax Your users may not need all the flexibility of the QP

Sorting
Lucene default sort is by score Searcher has several methods that take in a Sort object

Sorting should be addressed during indexing Sorting is done on Fields containing a single term that can be used for comparison The SortField defines the different sort types available
AUTO, STRING, INT, FLOAT, CUSTOM, SCORE, DOC

Sorting II
Look at Searcher, Sort and SortField Custom sorting is done with a SortComparatorSource

Sorting can be very expensive


Terms are cached in the FieldCache

SortFilterTest.java example

Filters
Filters restrict the search space to a subset of Documents Use Cases
Search within a Search Restrict by date Rating Security Author

Filter Classes
QueryWrapperFilter (QueryFilter)
Restrict to subset of Documents that match a Query

RangeFilter
Restrict to Documents that fall within a range Better alternative to RangeQuery

CachingWrapperFilter
Wrap another Filter and provide caching

SortFilterTest.java example

Expert Results
Searcher has several expert methods
Hits is not always what you need due to:
Caching Normalized Scores Reexecutes Query repeatedly as results are accessed

HitCollector allows low-level access to all Documents as they are scored TopDocs represents top n docs that match
TopDocsTest in examples

Searchers
MultiSearcher
Search over multiple Searchables, including remote

MultiReader
Not a Searcher, but can be used with IndexSearcher to achieve same results for local indexes

ParallelMultiSearcher
Like MultiSearcher, but threaded

RemoteSearchable
RMI based remote searching

Look at MultiSearcherTest in example code

Search Performance
Search speed is based on a number of factors:
Query Type(s) Query Size Analysis Occurrences of Query Terms Optimize Index Size Index type (RAMDirectory, other) Usual Suspects
CPU Memory I/O Business Needs

Query Types
Be careful with WildcardQuery as it rewrites to a BooleanQuery containing all the terms that match the wildcards Avoid starting a WildcardQuery with wildcard Use ConstantScoreRangeQuery instead of RangeQuery Be careful with range queries and dates
User mailing list and Wiki have useful tips for optimizing date handling

Query Size
Stopword removal Search an all field instead of many fields with the same terms Disambiguation
May be useful when doing synonym expansion

Difficult to automate and may be slower


Some applications may allow the user to disambiguate

Relevance Feedback/More Like This


Use most important words Important can be defined in a number of ways

Usual Suspects
CPU
Profile your application

Memory
Examine your heap size, garbage collection approach

I/O
Cache your Searcher
Define business logic for refreshing based on indexing needs

Warm your Searcher before going live -- See Solr

Business Needs
Do you really need to support Wildcards?
What about date range queries down to the millisecond?

Explanations
explain(Query, int) method is useful for understanding why a Document scored the way it did ExplainsTest in sample code

Open Luke and try some queries and then use the explain button

FieldSelector
Prior to version 2.1, Lucene always loaded all Fields in a Document FieldSelector API addition allows Lucene to skip large Fields
Options: Load, Lazy Load, No Load, Load and Break, Load for Merge, Size, Size and Break

Makes storage of original content more viable without large cost of loading it when not used FieldSelectorTest in example code

Scoring and Similarity


Lucene has sophisticated scoring mechanism designed to meet most needs Has hooks for modifying scores Scoring is handled by the Query, Weight and Scorer class

Affecting Relevance
FunctionQuery from Solr (variation in Lucene) Override Similarity Implement own Query and related classes Payloads HitCollector

Take 5 to examine these

Lunch

1-2:30

Recap
Indexing Searching Performance Odds and Ends
Explains FieldSelector Relevance

Next Up
Dealing with Content
File Formats Extraction

Large Task Miscellaneous Wrapping Up

File Formats
Several open source libraries, projects for extracting content to use in Lucene
PDF: PDFBox
http://www.pdfbox.org/

Word: POI, Open Office, TextMining


http://www.textmining.org/textmining.zip

XML: SAX or Pull parser HTML: Neko, Jtidy


http://people.apache.org/~andyc/neko/doc/html/ http://jtidy.sourceforge.net/

Tika
http://incubator.apache.org/tika/

Aperture
http://aperture.sourceforge.net

Aperture Basics
Crawlers Data Connectors Extraction Wrappers
POI, PDFBox, HTML, XML, etc. http://aperture.wiki.sourceforge.net/Extractors will give you info on what comes back from Aperture

LuceneApertureCallbackHandler in example code

Large Task
Using the skeleton files in the com.lucenebootcamp.training.full package:
Get some content:
Web, file system Different file formats

Index it
Plan out your fields, boosts, field properties Support updates and deletes Optional:
How fast can you make it go? Divide and conquer? Multithreaded?

Large Task
Search Content
Allow for arbitrary user queries across multiple Fields via command line or simple web interface
How fast can you make it?

Support:
Sort Filter Explains
How much slower is to retrieve an explanation?

Large Task
Document Retrieval
Display/write out the one or more documents Support FieldSelector

Large Task
Optional Tasks
Hit Highlighting using contrib/Highlighter Multithreaded indexing and Search Explore other Field construction options
Binary fields, term vectors

Use Lucene trunk version and try out some of the changes in indexing Try out Solr or Nutch at http://lucene.apache.org/
Whats do they offer that Lucene Java doesnt that you might need?

Large Task Metadata


Pair up if you want Ask questions 2 hours Use Luke to check your index! Explore other parts of Lucene that you are interested in Be prepared to discuss/share with the class

Large Task Post-Mortem


Volunteers to share?

Term Information
TermEnum gives access to terms and how many Documents they occur in
IndexReader.terms() IndexReader.termPositions()

TermDocs gives access to the frequency of a term in a Document


IndexReader.termDocs()

Term Vectors give access to term frequency information in a given Document


IndexReader.getTermFreqVector

TermsTest in sample code

Lucene Contributions
Many people have generously contributed code to help solve common problems These are in contrib directory of the source Popular:
Analyzers Highlighter Queries and MoreLikeThis Snowball Stemmers Spellchecker

Open Discussion
Multilingual Best Practices
UNICODE One Index versus many

Advanced Analysis Distributed Lucene Crawling Hadoop Nutch Solr

Resources
http://lucene.apache.org/ http://en.wikipedia.org/wiki/Vector_space_model

Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto


Lucene In Action by Hatcher and Gospodneti Wiki Mailing Lists
[email protected]
Discussions on how to use Lucene

[email protected]
Discussions on how to develop Lucene

Issue Tracking
https://issues.apache.org/jira/secure/Dashboard.jspa

We always welcome patches


Ask on the mailing list before reporting a bug

Resources
[email protected]

Finally
Please take the time to fill out a survey to help me improve this training
Located in base directory of source Email it to me at [email protected]

There are several Lucene related talks on Friday

Extras

Task 2
Take 10-15 minutes, pair up, and write an Analyzer and Unit Test
Examine results in Luke Run some searches

Ideas:
Combine existing Tokenizers and TokenFilters Normalize abbreviations Filter out all words beginning with the letter A Identify/Mark sentences

Questions:
What would help improve search results?

Task 2 Results
Share what you did and why Improving Results (in most cases)
Stemming Ignore Case Stopword Removal Synonyms Pay attention to business needs

Grab Bag
Accessing Term Information
TermEnum TermDocs
Term Vectors

FieldSelector Scoring and Similarity File Formats

Task 6
Count and print all the unique terms in the index and their frequencies
Notes:
Half of the class write it using TermEnum and TermDocs Other Half write it using Term Vectors Time your Task Only count the title and body content

Task 6 Results
Term Vector approach is faster on smaller collections TermEnum approach is faster on larger collections

Task 4
Re-index your collection
Add in a rating field that randomly assigns a number between 0 and 9

Write searches to sort by


Date Title Rating, Date, Doc Id A Custom Sort

Questions
How to sort the title? How to sort multiple Fields?

Task 4 Results
Add stitle to use for sorting the title

Task 5
Create and search using Filters to:
Restrict to all docs written on Feb. 26, 1987 Restrict to all docs with the word computer in title

Also:
Create a Filter where the length of the body + title is greater than X

Task 5 Results
Solr has more advanced Filter mechanisms that may be worth using Cache filters

Task 7
Pair up if you like and take 30-40 minutes to:
Pick two file formats to work on Identify content in that format
Can you index contents on your hard drive? Project Gutenberg, Creative Commons, Wikipedia Combine w/ Reuters collection

Extract the content and index it using the appropriate library Store the content as a Field Search the content Load Documents with and without FieldSelector and measure performance

Task 7 (cont.)
Include score and explanation in results Dump results to XML or HTML Be prepared to share with class what you did
What libraries did you use? What content did you use? What is your Document structure? What issues did you have?

20 Minute Break

Task 7 Results
Explain what your group did Build a Content Handler Framework
Or help out with Tika

Task 8
Building on Task 7
Incorporate one or more contrib packages into your solution

You might also like