0% found this document useful (0 votes)
63 views38 pages

Lucene Learning

Cisco Confidential Lucene index segments documents fields terms Term Field Document Segment (c) 2008 Cisco Systems, Inc. All rights reserved. Segments.gen contains an Int32 version header (SegmentInfos.FORMAT_LOCKLESS = -2), followed by the generation recorded as Int64, written twice.

Uploaded by

Hans Tan
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views38 pages

Lucene Learning

Cisco Confidential Lucene index segments documents fields terms Term Field Document Segment (c) 2008 Cisco Systems, Inc. All rights reserved. Segments.gen contains an Int32 version header (SegmentInfos.FORMAT_LOCKLESS = -2), followed by the generation recorded as Int64, written twice.

Uploaded by

Hans Tan
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Indexing and Search with Lucene

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

Lucene
 Lucene
  word

Java

 

(Serial Scanning) (Full-text Search)

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

 1.  2.  3.

(Index) (Indexing) (Search)

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

1.


(
: ,

 

(Posting List)
 
[email protected]
2008 Cisco Systems, Inc. All rights reserved. Cisco Confidential

,
4

Lucene
Lucene index segments documents fields terms

Term Field Document Segment


[email protected]
2008 Cisco Systems, Inc. All rights reserved. Cisco Confidential

---

 Byte  UInt32  UInt64  Vint


128

8 (bit) 4 Byte 8 Byte


Byte Byte 1
10000000 00000001

7 Byte

 Chars  String Chars

UTF-8 Vint---

Byte UTF-8

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

(Index)
Lucene Lucene

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

(Segment)


(segment) segments_N

(metadata)

segments.gen

 Segments.gen : contains an Int32 version header

(SegmentInfos.FORMAT_LOCKLESS = -2), followed by the generation recorded as Int64, written twice.

 SegmentInfos.FORMAT_LOCKLESS = -2 : file names are never reused (write once), each file is written to the next generation.  gen0 gen1 segments_N segments_N N M

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

segments_N

 Format  Version  NameCount  SegCount  Seg1-Segx:  .. = .xxx

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

(Document)
  

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

10

(Field)


(Field)

(.fnm)

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

11

(Field)

(.fdt

.fdx)

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

12

(Term)


[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

13

(tis)

(tii)

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

14

(frq)

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

15

(prx)

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

16

(nrm)

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

17

(del)

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

18

Indexing
 // IndexWriter  IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);  // writer.setMergeFactor(mergeFactor);// writer.setMaxBufferedDocs(1000); //  // Document  Document doc = new Document();  // Fields doc.add(new Field("title",title,Field.Store.YES,Field.Index.TOKENIZED)); doc.add(new Field("content",content,Field.Store.YES,Field.Index.TOKENIZED));  writer.addDocument(doc); //  writer.optimize(); //  writer.close();// 19 Document

document

10

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

IndexWriter
 
directory.clearLock(WRITE_LOCK_NAME); directory.makeLock(WRITE_LOCK_NAME);

return lockFile.createNewFile();

SegmentsInfos
segmentInfos = new SegmentInfos(); segmentInfos.commit(directory);//write(dir) setRollbackSegmentInfos(segmentInfos);

 

deleter = new IndexFileDeleter()

 Similarity similarity = Similarity.getDefault();


MergePolicy mergePolicy = new LogByteSizeMergePolicy(this); MergeScheduler mergeScheduler = new ConcurrentMergeScheduler();


[email protected]

DocumentWriter docWriter = new DocumentsWriter(directory, this, indexingChain);


2008 Cisco Systems, Inc. All rights reserved. Cisco Confidential

20

Document
 Document
//ArrayList List<Fieldable> fields = new ArrayList<Fieldable>(); private float boost = 1.0f; // public Document() {} boost 1

(Field)

 Field  
public

fnm fdx fdt store the field value in the index


static enum Store { public boolean isStored() { return true; } }, NO {@Override public boolean isStored() { return false; } public abstract boolean isStored(); } }; YES { @Override

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

21

IndexWriter
 writer.addDocument(doc);
doFlush = docWriter.addDocument(doc, analyzer); DocumentsWriter.updateDocument(Document, Analyzer,Term)

 DocumentWriter.updateDocument,

DocumentsWriter

(1)DocumentsWriterThreadState state = getThreadState(doc, delTerm); //nextDocId++, (2) DocWriter perDoc = state.consumer.processDocument(); (3) finishDocument(state, perDoc);

 state.consumer.processDocument();
consumer.startDocument(); fieldsWriter.startDocument(); .. for(int i=0;i<numDocFields;i++) { Fieldable field = docFields.get(i); . fieldInfos.add() .
[email protected]
2008 Cisco Systems, Inc. All rights reserved. Cisco Confidential

//

fnm

fdx fdt

22

fdt

fieldsStream

fieldsStream.writeVInt(fi.number);// fieldsStream.writeByte(bits); // if (field.isCompressed()) {// fieldsStream.writeVInt(len);// fieldsStream.writeBytes(data, offset, len);// fieldsStream.writeString(field.stringValue());// fields quickSort(fields, 0, fieldCount-1); fields[i].consumer.processFields(fields[i].fields,fields[i].fieldCount ); .

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

23

 DocFieldProcessorPerThread  DocFieldProcessorPerThread  DocFieldConsumerPerThread consumer DocInverter.addThreads

DocFieldProcessor.addThreads()

DocInverterPerThread TermsHashPerThread

InvertedDocConsumerPerThread consumer TermsHash.addThreads

TermsHashConsumerPerThread consumer FreqProxTermsWriterPerThread FreqProxTermsWriter.addThreads freq prox TermsHashPerThread nextPerThread TermsHashConsumerPerThread consumer TermVectorsTermsWriterPerThread TermVectorsTermsWriter tvx tvd tvf InvertedDocEndConsumerPerThread endConsumer NormsWriterPerThread NormsWriter.addThreads  StoredFieldsWriterPerThread fieldsWriter StoredFieldsWriter.addThreads fnm fdx fdt FieldInfos fieldInfos;
[email protected]
2008 Cisco Systems, Inc. All rights reserved. Cisco Confidential

nrm

24

 DocInverterPerField.processFields(Fieldable[], int)
 boolean doInvert = consumer.start(fields, count); --> TermsHashPerField.start(Fieldable[], int)     1. , ****

final TokenStream streamValue = field.tokenStreamValue();


boolean hasMoreTokens = stream.incrementToken();// OffsetAttribute offsetAttribute = fieldState.attributeSource.addAttribute(OffsetAttribute.class);// PositionIncrementAttribute posIncrAttribute = fieldState.attributeSource.addAttribute(PositionIncrementAttribute.class);// PayloadAttribute Token

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

25

 2.


consumer.add();

CharBlockPool charPool; Token DocumentsWriter freeCharBlocks ByteBlockPool bytePool; freeByteBlocks freq, prox DocumentsWriter freq prox

IntBlockPool intPool; Token bytePool DocumentsWriter freeIntBlocks


1. 2. 3. Term Term postingsHash token

CharBlockPool ByteBlockPool IntBlockPool


2008 Cisco Systems, Inc. All rights reserved. Cisco Confidential

[email protected]

26

 read    read  read  


[email protected]
2008 Cisco Systems, Inc. All rights reserved. Cisco Confidential

Add

read

write

fnm fieldInfos, write reader read add , , read add

fdt, fdx

write addTerm

27

Searching


[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

28

 coord(q,d)  tf(t in d) Term t  idf(t) Term t d

 norm(t, d)
lengthNorm(field) = (1.0 / Math.sqrt(numTerms)) Term

Boost

t.getBoost() d.getBoost() f.getBoost()

 queryNorm(q) query

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

29

Query Tree
 

IndexReader
IndexReader reader = IndexReader.open(FSDirectory.open(indexDir));

IndexSearcher
IndexSearcher searcher = new IndexSearcher(reader);

 QueryParser
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", new StandardAnalyzer(Version.LUCENE_CURRENT)); Query query = parser.parse("+(+apple* -boy) (cat* dog) -(eat~ foods)");

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

30

BooleanQuery

1.

TermQuery FuzzyQuery

2. PrefixQuery MultiTermQuery

Inner Class

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

31

Weight

Term Weight

BooleanQuery(Query).weight(Searcher) public Weight weight(Searcher searcher) throws IOException { // Query Term Term PrefixQuery FuzzyQuery(MultiTermQuery) Term BooleanQuery OR(should)

Query query = searcher.rewrite(this); // // Weight idf queryNorm

Weight weight = query.createWeight(searcher);

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

32

//

Term Weight

float sum = weight.sumOfSquaredWeights(); float norm = getSimilarity(searcher) .queryNorm(sum); weight.normalize(norm); return weight; } // public void normalize(float norm) { norm *= getBoost(); for (Weight w : weights) { w.normalize(norm); } }

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

33

Scorer SumScorer
// // // // MUST Scorer Scorer Scorer Scorer optional

List<Scorer> required = new ArrayList<Scorer>(); MUST_NOT SHOULD List<Scorer> prohibited = new ArrayList<Scorer>(); List<Scorer> optional = new ArrayList<Scorer>();

Iterator<BooleanClause> cIter = clauses.iterator(); . return new BooleanScorer2(similarity, minNrShouldMatch, required, prohibited, optional); countingSumScorer = makeCountingSumScorer(); 34

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

 Scorer = norms;  SumScorer coord

Query

//

this.norms //

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

35

  

ConjunctionScorer(+A +B) DisjunctionSumScorer(A OR B) ReqExclScorer(+A -B)

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

36

1. TopScoreDocCollector collector = TopScoreDocCollector.create(nDocs, !weight.scoresDocsOutOfOrder()); 2. IndexSearcher.search(Weight, Filter, Collector) scorer.score(collector) 3. 4. // pq


for (int i = pq.size() - start - howMany; i > 0; i--) { pq.pop(); } populateResults(results, howMany); return newTopDocs(results, start); }
[email protected]
2008 Cisco Systems, Inc. All rights reserved. Cisco Confidential

coord N , collector.topDocs() N

return sum * coordinator.coordFactors[coordinator.nrMatchers];

37

Thanks

[email protected]

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

38

You might also like