0% found this document useful (0 votes)
15 views11 pages

5 Indexing and Searching Big Data

This document discusses Lucene, an open-source information retrieval library written in Java that allows applications to add indexing and search capabilities. It describes how Lucene works by converting text into an inverted index format that enables fast searching. The key components for indexing include IndexWriter, Document, Analyzer, Field, and Directory classes. For searching, important classes are IndexSearcher, Term, Query, TermQuery, and TopDocs. Various analyzers like WhitespaceAnalyzer and StandardAnalyzer are available in Lucene to preprocess text for indexing.

Uploaded by

Vijay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views11 pages

5 Indexing and Searching Big Data

This document discusses Lucene, an open-source information retrieval library written in Java that allows applications to add indexing and search capabilities. It describes how Lucene works by converting text into an inverted index format that enables fast searching. The key components for indexing include IndexWriter, Document, Analyzer, Field, and Directory classes. For searching, important classes are IndexSearcher, Term, Query, TermQuery, and TopDocs. Various analyzers like WhitespaceAnalyzer and StandardAnalyzer are available in Lucene to preprocess text for indexing.

Uploaded by

Vijay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Searching and

Indexing Big Data

By Dinesh Amatya
Lucene

 Lucene is a high performance, scalable Information


Retrieval (IR) library.
 Iets you add indexing and searching capabilities to
your application
 can index and make searchable any data that can
be converted to a textual format
 mature, free, open-source project implemented in
Java

Lucene
Basic Concepts : Indexing

 To search large amounts of text


quickly, one must first index that
text and convert it into a format
that will let one search it rapidly,
eliminating the slow sequential
scanning process. This conversion process is
called indexing, and its output is called an index.

Basic Concept : Inverted Index
Basic Concept: Searching

 Searching is the process of


looking up words in an index
to find documents where
they appear
 Quality of search described by
– Recall

– Precision
 Searches index instead of text


Typical Components of Search Application
Core Indexing Classes

IndexWriter
Document
Analyzer
Field
Directory
Primary Analyzers available in
Lucene
WhitespaceAnalyzer
SimpleAnalyzer
StopAnalyzer
KeywordAnalyzer
StanderdAnalyzer
Core Searching Classes

IndexSearcher
Term
Query
TermQuery
TopDocs
References

 https://en.wikipedia.org/wiki/Full_text_search
 Lucene in Action
 http://www.javabeat.net/using-the-built-in-analyzers-in-
lucene/

You might also like