Apache Solr Search Patterns - Sample Chapter
Apache Solr Search Patterns - Sample Chapter
$ 49.99 US
32.99 UK
P U B L I S H I N G
Jayant Kumar
Apache Solr
Search Patterns
ee
pl
C o m m u n i t y
E x p e r i e n c e
D i s t i l l e d
Apache Solr
Search Patterns
Leverage the power of Apache Solr to power up your business by
navigating your users to their data quickly and efficiently
Sa
m
Jayant Kumar
Chapter 8, AJAX Solr, focuses on an AJAX Solr feature that helps reduce dependency
on the application. This chapter will also cover an in-depth understanding of AJAX Solr
as a framework and its implementation.
Chapter 9, SolrCloud, provides the complete procedure to implement SolrCloud and
examines the benefits of using a distributed search with SolrCloud.
Chapter 10, Text Tagging with Lucene FST, focuses on the basic understanding of an
FST and its implementation and guides us in designing an algorithm for text tagging,
which can be implemented using FSTs and further integrated with Solr.
[1]
Let us study the Solr index in depth. A Solr index consists of documents, fields, and
terms, and a document consists of strings or phrases known as terms. Terms that refer
to the context can be grouped together in a field. For example, consider a product on
any e-commerce site. Product information can be broadly divided into multiple fields
such as product name, product description, product category, and product price.
Fields can be either stored or indexed or both. A stored field contains the unanalyzed,
original text related to the field. The text in indexed fields can be broken down into
terms. The process of breaking text into terms is known as tokenization. The terms
created after tokenization are called tokens, which are then used for creating the
inverted index. The tokenization process employs a list of token filters that handle
various aspects of the tokenization process. For example, the tokenizer breaks a
sentence into words, and the filters work on converting all of those words to lowercase.
There is a huge list of analyzers and tokenizers that can be used as required.
[2]
Chapter 1
Let us look at a working example of the indexing process with two documents
having only a single field. The following are the documents:
Suppose we tell Solr that the tokenization or breaking of terms should happen
on whitespace. Whitespace is defined as one or more spaces or tabs. The tokens
formed after the tokenization of the preceding documents are as follows:
The inverted index thus formed will contain the following terms and associations:
Inverted index
In the index, we can see that the token Harry appears in both documents. If we
search for Harry in the index we have created, the result will contain documents 1
and 2. On the other hand, the token Prince has only document 1 associated with it
in the index. A search for Prince will return only document 1.
[3]
Let us look at how an index is stored in the filesystem. Refer to the following image:
For the default installation of Solr, the index can be located in the <Solr_directory>/
example/solr/collection1/data. We can see that the index consists of files starting
with _0 and _1. There are two segments* files and a write.lock file. An index is
built up of sub-indexes known as segments. The segments* file contains information
about the segments. In the present case, we have two segments namely _0.* and _1.*.
Whenever new documents are added to the index, new segments are created
or multiple segments are merged in the index. Any search for an index involves all
the segments inside the index. Ideally, each segment is a fully independent index
and can be searched separately.
Lucene keeps on merging these segments into one to reduce the number of segments
it has to go through during a search. The merger is governed by mergeFactor and
mergePolicy. The mergeFactor class controls how many segments a Lucene index
is allowed to have before it is coalesced into one segment. When an update is made
to an index, it is added to the most recently opened segment. When a segment fills
up, more segments are created. If creating a new segment would cause the number
of lowest-level segments to exceed the mergeFactor value, then all those segments
are merged to form a single large segment. Choosing a mergeFactor value involves
a trade-off between indexing and search. A low mergeFactor value indicates a small
number of segments and a fast search. However, indexing is slow as more and more
mergers continue to happen during indexing. On the other hand, maintaining a
high value of mergeFactor speeds up indexing but slows down the search, since the
number of segments to search increases. Nevertheless, documents can be pushed to
newer segments on disk with fewer mergers. The default value of mergeFactor is
10. The mergePolicy class defines how segments are merged together. The default
method is TieredMergePolicy, which merges segments of approximately equal
sizes subject to an allowed number of segments per tier.
[4]
Chapter 1
Let us look at the file extensions inside the index and understand their importance.
We are working with Solr Version 4.8.1, which uses Lucene 4.8.1 at its core.
The segment file names have Lucene41 in them, but this string is not related to
the version of Lucene being used.
The index structure is almost similar for Lucene 4.2 and later.
as well as a generation number. The file with the largest generation number is
considered to be active. The segments.gen file contains the current generation
of the index.
.si: The segment information file stores metadata about the segments.
.fnm: In our example, we can see the _0.fnm and _1.fnm files. These files
contain information about fields for a particular segment of the index.
The information stored here is represented by FieldsCount, FieldName,
FieldNumber, and FieldBits. FieldCount is used to generate and store
ordered number of fields in this index. If there are two fields in a document,
FieldsCount will be 0 for the first field and 1 for the second field. FieldName
is a string specifying the name as we have specified in our configuration.
FieldBits are used to store information about the field such as whether the
field is indexed or not, or whether term vectors, term positions, and term
offsets are stored. We study these concepts in depth later in this chapter.
.fdx: This file contains pointers that point a document to its field data. It
is used for stored fields to find field-related data for a particular document
from within the field data file (identified by the .fdt extension).
.fdt: The field data file is used to store field-related data for each document.
If you have a huge index with lots of stored fields, this will be the biggest file
in the index. The fdt and fdx files are respectively used to store and retrieve
fields for a particular document from the index.
[5]
.tim: The term dictionary file contains information related to all terms in
an index. For each term, it contains per-term statistics, such as document
frequency and pointers to the frequencies, skip data (the .doc file), position
(the .pos file), and payload (the .pay file) for each term.
.tip: The term index file contains indexes to the term dictionary file.
The .tip file is designed to be read entirely into memory to provide fast
and random access to the term dictionary file.
.doc: The frequencies and skip data file consists of the list of documents that
contain each term, along with the frequencies of the term in that document. If
the length of the document list is greater than the allowed block size, the skip
data to the beginning of the next block is also stored here.
.pos: The positions file contains the list of positions at which each term
occurs within documents. In addition to terms and their positions, the file
also contains part payloads and offsets for speedy retrieval.
.pay: The payload file contains payloads and offsets associated with certain
.nvd and .nvm: The normalization files contain lengths and boost factors
for documents and fields. This stores boost values that are multiplied
into the score for hits on that field.
.dvd and .dvm: The per-document value files store additional scoring
.tvx: The term vector index file contains pointers and offsets to the .tvd
.tvd: The term vector data file contains information about each document
that has term vectors. It contains terms, frequencies, positions, offsets, and
payloads for every document.
.del: This file will be created only if some documents are deleted from the
index. It contains information about what files were deleted from the index.
.cfs and .cfe: These files are used to create a compound index where all files
belonging to a segment of the index are merged into a single .cfs file with
a corresponding .cfe file indexing its subfiles. Compound indexes are used
when there is a limitation on the system for the number of file descriptors the
system can open during indexing. Since a compound file merges or collapses
all segment files into a single file, the number of file descriptors to be used
for indexing is small. However, this has a performance impact as additional
processing is required to access each file within the compound file.
[6]
Chapter 1
In addition to tokenizers and filters, an analyzer can contain a char filter. A char
filter is another component that pre-processes input characters, namely adding,
changing, or removing characters from the character stream. It consumes and
produces a character stream and can thus be chained or pipelined.
Let us look at an example from the schema.xml file, which is shipped with the
default Solr:
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
[8]
Chapter 1
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The field type specified here is named text_general and it is of type solr.
TextField. We have specified a position increment gap of 100. That is, in a
multivalued field, there would be a difference of 100 between the last token of one
value and first token of the next value. A multivalued field has multiple values for
the same field in a document. An example of a multivalued field is tags associated
with a document. A document can have multiple tags and each tag is a value
associated with the document. A search for any tag should return the documents
associated with it. Let us see an example.
Here each document has three tags. Suppose that the tags associated with a
document are tokenized on comma. The tags will be multiple values within the index
of each document. In this case, if the position increment gap is specified as 0 or not
specified, a search for series book will return the first document. This is because the
token series and book occur next to each other in the index. On the other hand, if a
positionIncrementGap value of 100 is specified, there will be a difference of 100
positions between series and book and none of the documents will be returned in
the result.
[9]
In this example, we have multiple analyzers, one for indexing and another for
search. The analyzer used for indexing consists of a StandardTokenizer class
and two filters, stop and lowercase. The analyzer used for the search (query)
consists of three filters, stop, synonym, and lowercase filters.
The standard tokenizer splits the input text into tokens, treating whitespace and
punctuation as delimiters that are discarded. Dots not followed by whitespace are
retained as part of the token, which in turn helps in retaining domain names. Words
are split at hyphens (-) unless there is a number in the word. If there is a number
in the word, it is preserved with hyphen. @ is also treated as a delimiter, so e-mail
addresses are not preserved.
The output of a standard tokenizer is a list of tokens that are passed to the stop filter
and lowercase filter during indexing. The stop filter class contains a list of stop
words that are discarded from the tokens received by it. The lowercase filter converts
all tokens to lowercase. On the other hand, during a search, an additional filter
known as synonym filter is applied. This filter replaces a token with its synonyms.
The synonyms are mentioned in the synonyms.txt file specified as an attribute in
the filter.
Let us make some modifications to the stopwords.txt and synonyms.txt files
in our Solr configuration and see how the input text is analyzed.
Add the following two words, each in a new line in the stopwords.txt file:
and
the
We have now told Solr to treat and and the as stop words, so during analysis they
would be dropped. During the search phrase, we map King to Prince, so a search
for king will be replaced by a search for prince.
In order to view the results, perform the following steps:
Open up your Solr interface, select a core (say collection1), and click on the
Analysis link on the left-hand side.
Enter the text of the first document in text box marked field value (index).
[ 10 ]
Chapter 1
We can see the complete analysis phase during indexing. First, a standard tokenizer
is applied that breaks the input text into tokens. Note that here Half-Blood was
broken into Half and Blood. Next, we saw the stop filter removing the stop words
we mentioned previously. The words And and The are discarded from the token
stream. Finally, the lowercase filter converts all tokens to lowercase.
During the search, suppose the query entered is Half-Blood and King. To check
how it is analyzed, enter the value in Field Value (Query), select the text value
in the FieldName / FieldType, and click on Analyze values.
[ 11 ]
We can see that during the search, as before, Half-Blood is tokenized as Half and
Blood, And and is dropped in the stop filter phase. King is replaced with prince
during the synonym filter phase. Finally, the lowercase filter converts all tokens
to lowercase.
An important point to note over here is that the lowercase filter appears as the
last filter. This is to prevent any mismatch between the text in the index and that
in the search due to either of them having a capital letter in the token.
The Solr analysis feature can be used to analyze and check whether the analyzer we
have created gives output in the desired format during indexing and search. It can
also be used to debug if we find any cases where the results are not as expected.
What is the use of such complex analysis of text? Let us look at an example to
understand a scenario where a result is expected from a search but none is found.
The following two documents are indexed in Solr with the custom analyzer we
just discussed:
After indexing, the index will have the following terms associated with the
respective document ids:
A search for project will return both documents 1 and 2. However, a search for
manager will return only document 2. Ideally, manager is equal to management.
Therefore, a search for manager should also return both documents. This intelligence
has to be built into Solr with the help of analyzers, tokenizers, and filters. In this case,
a synonym filter mentioning manager, management, manages as synonyms should
do the trick. Another way to handle the same scenario is to use stemmers. Stemmers
reduce words into their stem, base, or root form. In this chase, the stem for all the
preceding words will be manage. There is a huge list of analyzers, tokenizers, and
filters available with Solr by default that should be able to satisfy any scenario we
can think of.
[ 12 ]
Chapter 1
For more information on analyzers, tokenizers, and filters, refer to: http://wiki.
apache.org/solr/AnalyzersTokenizersTokenFilters
AND and OR queries are handled by respectively performing an intersection or
union of documents returned from a search on all the terms of the query. Once the
documents or hits are returned, a scorer calculates the relevance of each document
in the result set on the basis of the inbuilt Term Frequency-Inverse Document
Frequency (TF-IDF) scoring formula and returns the ranked results. Thus, a search
for Project AND Manager will return only the 2nd document after the intersection
of results that are available after searching both terms on the index.
It is important to remember that text processing during indexing and search affects
the quality of results. Better results can be obtained by high-quality and well thought
of text processing during indexing and search.
TF-IDF is a formula used to calculate the relevancy of search terms in
a document against terms in existing documents. In a simple form, it
favors a document that contains the term with high frequency and has
lower occurrence in all the other documents.
In a simple form, a document with a high TF-IDF score contains the
search term with high frequency, and the term itself does not appear
as much in other documents.
More details on TF-IDF will be explained in Chapter 2, Customizing a
Solr Scoring Algorithm.
[ 13 ]
The question here is whether the words world's and Sony's should be indexed. If
yes, then how? Should a search for Sony return this document in the result? What
would be the stop words herethe words that do not need to be indexed? Ideally,
we would like to ignore stop words such as the, on, of, is, all, or your. How
should the document be indexed so that Xperia Z Ultra matches this document?
First, we need to ensure that Z is not a stop word. The search should contain the
term xperia z ultra. This would break into +xperia OR z OR ultra. Here
xperia is the only mandatory term. The results would be sorted in such a fashion
that the document (our document) that contains all three terms will be at the top.
Also, ideally we would like the search for world or sony to return this document in
the result. In this case, we can use the LetterTokenizerFactory class, which will
separate the words as follows:
World's => World, s
Sony's => Sony, s
Then, we need to pass the tokens through a stop filter to remove stop words. The
output from the stop filter passes through a lowercase filter to convert all tokens
to lowercase. During the search, we can use a WhiteSpaceTokenizer and a
LowerCaseFilter tokenizer to tokenize and process our input text.
In a real-life situation, it is advisable to take multiple examples with different use
cases and work around the scenarios to provide the desired solutions for those use
cases. Given that the numbers of examples are large, the derived solution should
satisfy most of the cases.
If we translate the same sentence into German, here is how it will look:
German
Solr comes with an inbuilt field type for German - text_de, which has a
StandardTokenizer class followed by a lowerCaseFilter class and a stopFilter
class for German words. In addition, the analyzer has two German-specific filters,
GermanNormalizationFilter and GermanLightStemFilter. Though this text
analyzer does a pretty good job, there may be cases where it will need improvement.
Let's translate the same sentence into Arabic and see how it looks:
Arabic
[ 14 ]
Chapter 1
Note that Arabic is written from right to left. The default analyzer in the Solr schema
configuration is text_ar. Again tokenization is carried out with StandardTokenizer
followed by LowerCaseFilter (used for non-Arabic words embedded inside
the Arabic text) and the Arabic StopFilter class. This is followed by the Arabic
Normalization filter and the Arabic Stemmer. Another aspect used in Arabic is known
as a diacritic. A diacritic is a mark (also known as glyph) added to a letter to change the
sound value of the letter. Diacritics generally appear either below or above a letter or,
in some cases, between two letters or within the letter. Diacritics such as ' in English
do not modify the meaning of the word. In contrast, in other languages, the addition
of a diacritic modifies the meaning of the word. Arabic is such a language. Thus, it is
important to decide whether to normalize diacritics or not.
Let us translate the same sentence into Japanese and see what we get:
Japanese
Now that the complete sentence does not have any whitespace to separate the
words, how do we identify words or tokens and index them? The Japanese analyzer
available in our Solr schema configuration is text_ja. This analyzer identifies the
words in the sentence and creates tokens. A few tokens identified are as follows:
Japanese tokens
It also identifies some of the stop words and removes them from the sentence.
As in English, there are other languages where a word is modified by adding a
suffix or prefix to change the tense, grammatical mood, voice, aspect, person, number,
or gender of the word. This concept is called inflection and is handled by stemmers
during indexing. The purpose of a stemmer is to change words such as indexing,
indexed, or indexes into their base form, namely index. The stemmer has to be
introduced during both indexing and search so that the stems or roots are
compared during both indexing and search.
[ 15 ]
The point to note is that each language is unique and presents different challenges to
the search engine. In order to create a language-aware search, the steps that need to
be taken are as follows:
Tokenization: Decide the way tokens should be formed from the language.
Use one Solr field for each language: This is a simple approach that
guarantees that the text is processed the same way as it was indexed. As
different fields can have separate analyzers, it is easy to handle multiple
languages. However, this increases the complexity at query time as the input
query language needs to be identified and the related language field needs
to be queried. If all fields are queried, the query execution speed goes down.
Also, this may require creation of multiple copies of the same text across
fields for different languages.
Use one Solr core per language: Each core has the same field with
different analyzers, tokenizers, and filters specific to the language on that
core. This does not have much query time performance overhead. However,
there is significant complexity involved in managing multiple cores. This
approach would prove complex in supporting multilingual documents
across different cores.
[ 16 ]
Chapter 1
All languages in one field: Indexing and search are much easier as there is
only a single field handling multiple languages. However, in this case, the
analyzer, tokenizer, and filter have to be custom built to support the languages
that are expected in the input text. The queries may not be processed in the
same fashion as the index. Also, there might be confusion in the scoring
calculation. There are cases where particular characters or words may be
stop words in one language and meaningful in another language.
Custom analyzers are built as Solr plugins. The following link gives
more details regarding the same: https://wiki.apache.org/
solr/SolrPlugins#Analyzer.
The final aim of a multilingual search should be to provide better search results to
the end users by proper processing of text both during indexing and at query time.
Precision equation
Recall equation
[ 17 ]
Another way to define precision and recall is by classifying the documents into four
classes between relevancy and retrieval as follows:
We can see that as the number of irrelevant documents or B increases in the result
set, the precision goes down. If all documents are retrieved, then the recall is perfect
but the precision would not be good. On the other hand, if the document set contains
only a single relevant document and that relevant document is retrieved in the search,
then the precision is perfect but again the result set is not good. This is a trade-off
between precision and recall as they are inversely related. As precision increases, recall
decreases and vice versa. We can increase recall by retrieving more documents, but
this will decrease the precision of the result set. A good result set has to be a balance
between precision and recall.
We should optimize our results for precision if the hits are plentiful and several results
can meet the search criteria. Since we have a huge collection of documents, it makes
sense to provide a few relevant and good hits as opposed to adding irrelevant results
in the result set. An example scenario where optimization for precision makes sense is
web search where the available number of documents is huge.
On the other hand, we should optimize for recall if we do not want to miss out any
relevant document. This happens when the collection of documents is comparatively
small. It makes sense to return all relevant documents and not care about the irrelevant
documents added to the result set. An example scenario where recall makes sense is
patent search.
Traditional accuracy of the result set is defined by the following formula:
Accuracy = 2*((precision * recall) / (precision + recall))
[ 18 ]
Chapter 1
This combines both precision and recall and is a harmonic mean of precision and recall.
Harmonic mean is a type of averaging mean used to find the average of fractions. This
is an ideal formula for accuracy and can be used as a reference point while figuring out
the combination of precision and recall that your result set will provide.
Let us look at some practical problems faced while searching in different
business scenarios.
[ 19 ]
The takeaway from this is that categorization and feature listing of products should
be taken care of. Misrepresentation of features can lead to incorrect search results.
Another takeaway is that we need to provide multiple facets in the search results.
For example, while displaying the list of all mobiles, we need to provide facets for a
brand. Once a brand is selected, another set of facets for operating systems, network,
and mobile phone features has to be provided. As more and more facets are selected,
we still need to show facets within the remaining products.
Another problem is that we do not know what product the customer is searching
for. A site that displays a huge list of products from different categories, such as
electronics, mobiles, clothes, or books, needs to be able to identify what the customer
is searching for. A customer can be searching for samsung, which can be in mobiles,
tablets, electronics, or computers. The site should be able to identify whether the
customer has input the author name or the book name. Identifying the input would
help in increasing the relevance of the result set by increasing the precision of the
search results. Most e-commerce sites provide search suggestions that include the
category to help customers target the right category during their search.
Amazon, for example, provides search suggestions that include both latest
searched terms and products along with category-wise suggestions:
[ 20 ]
Chapter 1
It is also important that products are added to the index as soon as they are available.
It is even more important that they are removed from the index or marked as sold
out as soon as their stock is exhausted. For this, modifications to the index should
be immediately visible in the search. This is facilitated by a concept in Solr known
as Near Real Time Indexing and Search (NRT). More details on using Near Real
Time Search will be explained later in this chapter.
On the recruiter side, the search provided over the candidate database is required
to have a huge set of fields to search upon every data point that the candidate has
entered. The recruiters are very selective when it comes to searching for candidates
for specific jobs. Educational qualification, industry, function, key skills, designation,
location, and experience are some of the fields provided to the recruiter during
a search. In such cases, the precision has to be high. The recruiter would like a
certain candidate and may be interested in more candidates similar to the selected
candidate. The more like this search in Solr can be used to provide a search for
candidates similar to a selected candidate.
NRT is important as the site should be able to provide a job or a candidate for a
search as soon as any one of them is added to the database by either the recruiter
or the candidate. The promptness of the site is an important factor in keeping users
engaged on the site.
[ 22 ]
Chapter 1
Here is the reference to the API for the HttpSolrServer program http://lucene.
apache.org/solr/4_6_0/solr-solrj/org/apache/solr/client/solrj/impl/
HttpSolrServer.html.
Add all files from the <solr_directory>/dist folder to the classpath
for compiling and running the HttpSolrServer program.
[ 23 ]
[ 24 ]
Chapter 1
Check the hard and soft limits of file descriptors using the ulimit
command:
ulimit -Hn
ulimit -Sn
To increase the number of file descriptors system wide, edit the file /
etc/sysctl.conf and add the following line:
fs.file-max = 100000
Distributed indexing
When dealing with large amounts of data to be indexed, in addition to speeding up
the indexing process, we can work on distributed indexing. Distributed indexing
can be done by creating multiple indexes on different machines and finally merging
them into a single, large index. Even better would be to create the separate indexes
on different Solr machines and use Solr sharding to query the indexes across
multiple shards.
For example, an index of 10 million products can be broken into smaller chunks
based on the product ID and can be indexed over 10 machines, with each indexing
a million products. While searching, we can add these 10 Solr servers as shards and
distribute our search queries over these machines.
Chapter 1
Collection: A logical index that spans across multiple Solr cores is called a
collection. Thus, if we have a two-core Solr index on a single Solr server, it
will create two collections with multiple cores in each collection. The cores
can reside on multiple Solr servers.
Leader: One of the cores within a shard will act as a leader. The leader is
responsible for making sure that all the replicas within a shard are up to date.
SolrCloud has a central configuration that can be replicated automatically across all
the nodes that are part of the SolrCloud cluster. The central configuration is maintained
using a configuration management and coordination system known as Zookeeper.
Zookeeper provides reliable coordination across a huge cluster of distributed systems.
Solr does not have a master node. It uses Zookeeper to maintain node, shard, and
replica information based on configuration files and schemas. Documents can be
sent to any server, and Zookeeper will be able to figure out where to index them. If
a leader for a shard goes down, another replica is automatically elected as the new
leader using Zookeeper.
If a document is sent to a replica during indexing, it is forwarded to the leader.
On receiving the document at a leader node, the SolrCloud determines whether the
document should go to another shard and forwards it to the leader of that shard.
The leader indexes the document and forwards the index notification to its replicas.
[ 27 ]
SolrCloud provides automatic failover. If a node goes down, indexing and search can
happen over another node. Also, search queries are load balanced across multiple
shards in the Solr cluster. Near Real Time Indexing is a feature where, as soon as
a document is added to the index, the same is available for search. The latest Solr
server contains commands for soft commit, which makes documents added to the
index available for search immediately without going through the traditional commit
process. We would still need to make a hard commit to make changes onto a stable
data store. A soft commit can be carried out within a few seconds, while a hard
commit takes a few minutes. SolrCloud exploits this feature to provide near real time
search across the complete cluster of Solr servers.
It can be difficult to determine the number of shards in a Solr collection in the first
go. Moreover, creating more shards or splitting a shard into two can be tedious task
if done manually. Solr provides inbuilt commands for splitting a shard. The previous
shard is maintained and can be deleted at a later date.
SolrCloud also provides the ability to search the complete collection of one or more
particular shards if needed.
SolrCloud removes all the hassles of maintaining a cluster of Solr servers manually
and provides an easy interface to handle distributed search and indexing over a
cluster of Solr servers with automatic failover. We will be discussing SolrCloud in
Chapter 9, SolrCloud.
Summary
In this chapter, we went through the basics of indexing in Solr. We saw the structure
of the Solr index and how analyzers, tokenizers, and filters work in the conversion
of text into searchable tokens. We went through the complexities involved in
multilingual search and also discussed the strategies that can be used to handle the
complexities. We discussed the formula for measuring the quality of search results
and understood the meaning of precision and recall. We saw in brief the problems
faced by e-commerce and job websites during indexing and search. We discussed
the challenges faced while indexing a large number of documents. We saw some tips
on improving the speed of indexing. Finally, we discussed distributed indexing and
search and how SolrCloud provides a solution for implementing the same.
[ 28 ]
www.PacktPub.com
Stay Connected: