Lucene Learning
Lucene Learning
Cisco Confidential
Lucene
Lucene
word
Java
Cisco Confidential
1. 2. 3.
Cisco Confidential
1.
(
: ,
(Posting List)
[email protected]
2008 Cisco Systems, Inc. All rights reserved. Cisco Confidential
,
4
Lucene
Lucene index segments documents fields terms
---
7 Byte
UTF-8 Vint---
Byte UTF-8
Cisco Confidential
(Index)
Lucene Lucene
Cisco Confidential
(Segment)
(segment) segments_N
(metadata)
segments.gen
SegmentInfos.FORMAT_LOCKLESS = -2 : file names are never reused (write once), each file is written to the next generation. gen0 gen1 segments_N segments_N N M
Cisco Confidential
segments_N
Cisco Confidential
(Document)
Cisco Confidential
10
(Field)
(Field)
(.fnm)
Cisco Confidential
11
(Field)
(.fdt
.fdx)
Cisco Confidential
12
(Term)
Cisco Confidential
13
(tis)
(tii)
Cisco Confidential
14
(frq)
Cisco Confidential
15
(prx)
Cisco Confidential
16
(nrm)
Cisco Confidential
17
(del)
Cisco Confidential
18
Indexing
// IndexWriter IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED); // writer.setMergeFactor(mergeFactor);// writer.setMaxBufferedDocs(1000); // // Document Document doc = new Document(); // Fields doc.add(new Field("title",title,Field.Store.YES,Field.Index.TOKENIZED)); doc.add(new Field("content",content,Field.Store.YES,Field.Index.TOKENIZED)); writer.addDocument(doc); // writer.optimize(); // writer.close();// 19 Document
document
10
Cisco Confidential
IndexWriter
directory.clearLock(WRITE_LOCK_NAME); directory.makeLock(WRITE_LOCK_NAME);
return lockFile.createNewFile();
SegmentsInfos
segmentInfos = new SegmentInfos(); segmentInfos.commit(directory);//write(dir) setRollbackSegmentInfos(segmentInfos);
20
Document
Document
//ArrayList List<Fieldable> fields = new ArrayList<Fieldable>(); private float boost = 1.0f; // public Document() {} boost 1
(Field)
Field
public
Cisco Confidential
21
IndexWriter
writer.addDocument(doc);
doFlush = docWriter.addDocument(doc, analyzer); DocumentsWriter.updateDocument(Document, Analyzer,Term)
DocumentWriter.updateDocument,
DocumentsWriter
(1)DocumentsWriterThreadState state = getThreadState(doc, delTerm); //nextDocId++, (2) DocWriter perDoc = state.consumer.processDocument(); (3) finishDocument(state, perDoc);
state.consumer.processDocument();
consumer.startDocument(); fieldsWriter.startDocument(); .. for(int i=0;i<numDocFields;i++) { Fieldable field = docFields.get(i); . fieldInfos.add() .
[email protected]
2008 Cisco Systems, Inc. All rights reserved. Cisco Confidential
//
fnm
fdx fdt
22
fdt
fieldsStream
fieldsStream.writeVInt(fi.number);// fieldsStream.writeByte(bits); // if (field.isCompressed()) {// fieldsStream.writeVInt(len);// fieldsStream.writeBytes(data, offset, len);// fieldsStream.writeString(field.stringValue());// fields quickSort(fields, 0, fieldCount-1); fields[i].consumer.processFields(fields[i].fields,fields[i].fieldCount ); .
Cisco Confidential
23
DocFieldProcessor.addThreads()
DocInverterPerThread TermsHashPerThread
TermsHashConsumerPerThread consumer FreqProxTermsWriterPerThread FreqProxTermsWriter.addThreads freq prox TermsHashPerThread nextPerThread TermsHashConsumerPerThread consumer TermVectorsTermsWriterPerThread TermVectorsTermsWriter tvx tvd tvf InvertedDocEndConsumerPerThread endConsumer NormsWriterPerThread NormsWriter.addThreads StoredFieldsWriterPerThread fieldsWriter StoredFieldsWriter.addThreads fnm fdx fdt FieldInfos fieldInfos;
[email protected]
2008 Cisco Systems, Inc. All rights reserved. Cisco Confidential
nrm
24
DocInverterPerField.processFields(Fieldable[], int)
boolean doInvert = consumer.start(fields, count); --> TermsHashPerField.start(Fieldable[], int) 1. , ****
Cisco Confidential
25
2.
consumer.add();
CharBlockPool charPool; Token DocumentsWriter freeCharBlocks ByteBlockPool bytePool; freeByteBlocks freq, prox DocumentsWriter freq prox
1. 2. 3. Term Term postingsHash token
26
Add
read
write
fdt, fdx
write addTerm
27
Searching
Cisco Confidential
28
norm(t, d)
lengthNorm(field) = (1.0 / Math.sqrt(numTerms)) Term
Boost
queryNorm(q) query
Cisco Confidential
29
Query Tree
IndexReader
IndexReader reader = IndexReader.open(FSDirectory.open(indexDir));
IndexSearcher
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", new StandardAnalyzer(Version.LUCENE_CURRENT)); Query query = parser.parse("+(+apple* -boy) (cat* dog) -(eat~ foods)");
Cisco Confidential
30
BooleanQuery
1.
TermQuery FuzzyQuery
2. PrefixQuery MultiTermQuery
Inner Class
Cisco Confidential
31
Weight
Term Weight
BooleanQuery(Query).weight(Searcher) public Weight weight(Searcher searcher) throws IOException { // Query Term Term PrefixQuery FuzzyQuery(MultiTermQuery) Term BooleanQuery OR(should)
Cisco Confidential
32
//
Term Weight
float sum = weight.sumOfSquaredWeights(); float norm = getSimilarity(searcher) .queryNorm(sum); weight.normalize(norm); return weight; } // public void normalize(float norm) { norm *= getBoost(); for (Weight w : weights) { w.normalize(norm); } }
Cisco Confidential
33
Scorer SumScorer
// // // // MUST Scorer Scorer Scorer Scorer optional
List<Scorer> required = new ArrayList<Scorer>(); MUST_NOT SHOULD List<Scorer> prohibited = new ArrayList<Scorer>(); List<Scorer> optional = new ArrayList<Scorer>();
Iterator<BooleanClause> cIter = clauses.iterator(); . return new BooleanScorer2(similarity, minNrShouldMatch, required, prohibited, optional); countingSumScorer = makeCountingSumScorer(); 34
Cisco Confidential
Query
//
this.norms //
Cisco Confidential
35
Cisco Confidential
36
coord N , collector.topDocs() N
37
Thanks
Cisco Confidential
38