Apache 2.5docx
Apache 2.5docx
Another way
to think of this is
that, instead of specifying the field in documents as they are indexed, Solr finds values for
this field in the
external file.
External fields are not searchable. They can be used only for function queries or
display.
For more information on function queries, see the section on Function Queries.
The ExternalFileField type is handy for cases where you want to update a particular field
in many
documents more often than you want to update the rest of the documents. For example, suppose
you have
implemented a document rank based on the number of views. You might want to update the rank
of all the
documents daily or hourly, while the rest of the contents of the documents might be updated
much less
frequently. Without ExternalFileField, you would need to update each document just to
change the rank.
Using ExternalFileField is much more efficient because all document values for a particular
field are
stored in an external file that can be updated as frequently as you wish.
In schema.xml, the definition of this field type might look like this:
<fieldType name="entryRankFile" keyField="pkId" defVal="0" stored="false" indexed="false"
class=
"solr.ExternalFileField"/>
The keyField attribute defines the key that will be defined in the external file. It is
usually the unique key for
the index, but it doesn’t need to be as long as the keyField can be used to identify
documents in the index.
A defVal defines a default value that will be used if there is no entry in the external file
for a particular
document.
Format of the External File
The file itself is located in Solr’s index directory, which by default is $SOLR_HOME/data.
The name of the file
should be external_fieldname_ or external_fieldname_.*. For the example above, then, the
file could be
named external_entryRankFile or external_entryRankFile.txt.
If any files using the name pattern .* (such as .txt) appear, the last (after being sorted
by
name) will be used and previous versions will be deleted. This behavior supports
implementations on systems where one may not be able to overwrite a file (for example,
on Windows, if the file is in use).
The file contains entries that map a key field, on the left of the equals sign, to a value,
on the right. Here are
a few example entries:
Page 166 of 1426 Apache Solr Reference Guide 7.7
Guide Version 7.7 - Published: 2019-03-04 c 2019, Apache Software Foundation
doc33=1.414
doc34=3.14159
doc40=42
The keys listed in this file do not need to be unique. The file does not need to be sorted,
but Solr will be able
to perform the lookup faster if it is.
Reloading an External File
It’s possible to define an event listener to reload an external file when either a searcher
is reloaded or when
a new searcher is started. See the section Query-Related Listeners for more information, but
a sample
definition in solrconfig.xml might look like this:
<listener event="newSearcher" class="org.apache.solr.schema.ExternalFileFieldReloader"/>
<listener event="firstSearcher" class="org.apache.solr.schema.ExternalFileFieldReloader"/>
The PreAnalyzedField Type
The PreAnalyzedField type provides a way to send to Solr serialized token streams,
optionally with
independent stored values of a field, and have this information stored and indexed without
any additional
text processing applied in Solr. This is useful if user wants to submit field content that
was already processed
by some existing external text processing pipeline (e.g., it has been tokenized, annotated,
stemmed,
synonyms inserted, etc.), while using all the rich attributes that Lucene’s TokenStream
provides (per-token
attributes).
The serialization format is pluggable using implementations of PreAnalyzedParser interface.
There are two
out-of-the-box implementations:
• JsonPreAnalyzedParser: as the name suggests, it parses content that uses JSON to represent
field’s
content. This is the default parser to use if the field type is not configured otherwise.
• SimplePreAnalyzedParser: uses a simple strict plain text format, which in some situations
may be easier
to create than JSON.
There is only one configuration parameter, parserImpl. The value of this parameter should be
a fully
qualified class name of a class that implements PreAnalyzedParser interface. The default
value of this
parameter is org.apache.solr.schema.JsonPreAnalyzedParser.
By default, the query-time analyzer for fields of this type will be the same as the index-
time analyzer, which
expects serialized pre-analyzed text. You must add a query type analyzer to your fieldType in
order to
perform analysis on non-pre-analyzed queries. In the example below, the index-time analyzer
expects the
default JSON serialization format, and the query-time analyzer will employ
StandardTokenizer/LowerCaseFilter:
Apache Solr Reference Guide 7.7 Page 167 of 1426
c 2019, Apache Software Foundation Guide Version 7.7 - Published: 2019-03-04
<fieldType name="pre_with_query_analyzer" class="solr.PreAnalyzedField">
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
JsonPreAnalyzedParser
This is the default serialization format used by PreAnalyzedField type. It uses a top-level
JSON map with the
following keys:
Key Description Required
v Version key. Currently the supported version is 1. required
str Stored string value of a field. You can use at most one of str or
bin.
optional
bin Stored binary value of a field. The binary value has to be Base64
encoded.
optional
tokens serialized token stream. This is a JSON list. optional
Any other top-level key is silently ignored.
Token Stream Serialization
The token stream is expressed as a JSON list of JSON maps. The map for each token consists of
the following
keys and values:
Key Description Lucene Attribute Value Required?
t token CharTermAttribute UTF-8 string representing the
current token
required
s start offset OffsetAttribute Non-negative integer optional
e end offset OffsetAttribute Non-negative integer optional
i position increment PositionIncrementAt
tribute
Non-negative integer - default
is 1
optional
p payload PayloadAttribute Base64 encoded payload optional
y lexical type TypeAttribute UTF-8 string optional
f flags FlagsAttribute String representing an integer
value in hexadecimal format
optional
Any other key is silently ignored.
Page 168 of 1426 Apache Solr Reference Guide 7.7
Guide Version 7.7 - Published: 2019-03-04 c 2019, Apache Software Foundation
JsonPreAnalyzedParser Example
{
"v":"1",
"str":"test ąćęłńośźż",
"tokens": [
{"t":"one","s":123,"e":128,"i":22,"p":"DQ4KDQsODg8=","y":"word"},
{"t":"two","s":5,"e":8,"i":1,"y":"word"},
{"t":"three","s":20,"e":22,"i":1,"y":"foobar"}
]
}
SimplePreAnalyzedParser
The fully qualified class name to use when specifying this format via the parserImpl
configuration
parameter is org.apache.solr.schema.SimplePreAnalyzedParser.
SimplePreAnalyzedParser Syntax
The serialization format supported by this parser is as follows:
Serialization format
content ::= version (stored)? tokens
version ::= digit+ " "
; stored field value - any "=" inside must be escaped!
stored ::= "=" text "="
tokens ::= (token ((" ") + token)*)*
token ::= text ("," attrib)*
attrib ::= name '=' value
name ::= text
value ::= text
Special characters in "text" values can be escaped using the escape character \. The
following escape
sequences are recognized:
EscapeSequence Description
\ literal space character
\, literal , character
\= literal = character
\\ literal \ character
\n newline
\r carriage return
\t horizontal tab
Please note that Unicode sequences (e.g., \u0001) are not supported.
Apache Solr Reference Guide 7.7 Page 169 of 1426
c 2019, Apache Software Foundation Guide Version 7.7 - Published: 2019-03-04
Supported Attributes
The following token attributes are supported, and identified with short symbolic names:
Name Description Lucene attribute Value format
i position increment PositionIncrementAttribute integer
s start offset OffsetAttribute integer
e end offset OffsetAttribute integer
y lexical type TypeAttribute string
f flags FlagsAttribute hexadecimal integer
p payload PayloadAttribute bytes in hexadecimal format;
whitespace is ignored
Token positions are tracked and implicitly added to the token stream - the start and end
offsets consider
only the term text and whitespace, and exclude the space taken by token attributes.
Example Token Streams
1 one two three
• version: 1
• stored: null
• token: (term=one,startOffset=0,endOffset=3)
• token: (term=two,startOffset=4,endOffset=7)
• token: (term=three,startOffset=8,endOffset=13)
1 one two three
• version: 1
• stored: null
• token: (term=one,startOffset=0,endOffset=3)
• token: (term=two,startOffset=5,endOffset=8)
• token: (term=three,startOffset=11,endOffset=16)
1 one,s=123,e=128,i=22 two three,s=20,e=22
• version: 1
• stored: null
• token: (term=one,positionIncrement=22,startOffset=123,endOffset=128)
• token: (term=two,positionIncrement=1,startOffset=5,endOffset=8)
Page 170 of 1426 Apache Solr Reference Guide 7.7
Guide Version 7.7 - Published: 2019-03-04 c 2019, Apache Software Foundation
• token: (term=three,positionIncrement=1,startOffset=20,endOffset=22)
1 \ one\ \,,i=22,a=\, two\=
\n,\ =\ \
• version: 1
• stored: null
• token: (term=one ,,positionIncrement=22,startOffset=0,endOffset=6)
• token: (term=two= ,positionIncrement=1,startOffset=7,endOffset=15)
• token: (term=\,positionIncrement=1,startOffset=17,endOffset=18)
Note that unknown attributes and their values are ignored, so in this example, the “a”
attribute on the first
token and the " " (escaped space) attribute on the second token are ignored, along with their
values,
because they are not among the supported attribute names.
1 ,i=22 ,i=33,s=2,e=20 ,
• version: 1
• stored: null
• token: (term=,positionIncrement=22,startOffset=0,endOffset=0)
• token: (term=,positionIncrement=33,startOffset=2,endOffset=20)
• token: (term=,positionIncrement=1,startOffset=2,endOffset=2)
1 =This is the stored part with \=
\n \t escapes.=one two three
• version: 1
• stored: This is the stored part with = \t escapes.
• token: (term=one,startOffset=0,endOffset=3)
• token: (term=two,startOffset=4,endOffset=7)
• token: (term=three,startOffset=8,endOffset=13)
Note that the \t in the above stored value is not literal; it’s shown that way to visually
indicate the actual tab
char that is in the stored value.
1 ==
• version: 1
• stored: ""
• (no tokens)
Apache Solr Reference Guide 7.7 Page 171 of 1426
c 2019, Apache Software Foundation Guide Version 7.7 - Published: 2019-03-04
1 =this is a test.=
• version: 1
• stored: this is a test.
• (no tokens)
Field Properties by Use Case
Here is a summary of common use cases, and the attributes the fields or field types should
have to support
the case. An entry of true or false in the table indicates that the option must be set to the
given value for the
use case to function correctly. If no entry is provided, the setting of that attribute has no
impact on the case.
Use Case indexed stored multiValue
d
omitNorm
s
termVecto
rs
termPositi
ons
docValues
search within
field
true
retrieve
contents
true8 true8
use as unique
key
true false
sort on field true7 false9 true 1 true7
highlighting true4 true true2 true 3
faceting 5 true7 true7
add multiple
values,
maintaining
order
true
field length
affects doc
score
false
MoreLikeThis 5 true 6
Notes:
1. Recommended but not necessary.
2. Will be used if present, but not necessary.
3. (if termVectors=true)
4. A tokenizer must be defined for the field, but it doesn’t need to be indexed.
5. Described in Understanding Analyzers, Tokenizers, and Filters.
6. Term vectors are not mandatory here. If not true, then a stored field is analyzed. So term
vectors are
Page 172 of 1426 Apache Solr Reference Guide 7.7
Guide Version 7.7 - Published: 2019-03-04 c 2019, Apache Software Foundation
recommended, but only required if stored=false.
7. For most field types, either indexed or docValues must be true, but both are not
required. DocValues can
be more efficient in many cases. For [Int/Long/Float/Double/Date]PointFields,
docValues=true is
required.
8. Stored content will be used by default, but docValues can alternatively be used. See
DocValues.
9. Multi-valued sorting may be performed on docValues-enabled fields using the two-argument
field()
function, e.g., field(myfield,min); see the field() function in Function Queries.
Apache Solr Reference Guide 7.7 Page 173 of 1426
c 2019, Apache Software Foundation Guide Version 7.7 - Published: 2019-03-04
Defining Fields
Fields are defined in the fields element of schema.xml. Once you have the field types set
up, defining the
fields themselves is simple.
Example Field Definition
The following example defines a field named price with a type named float and a default
value of 0.0; the
indexed and stored properties are explicitly set to true, while any other properties
specified on the float
field type are inherited.
<field name="price" type="float" default="0.0" indexed="true" stored="true"/>
Field Properties
Field definitions can have the following properties:
name
The name of the field. Field names should consist of alphanumeric or underscore characters
only and not
start with a digit. This is not currently strictly enforced, but other field names will not
have first class
support from all components and back compatibility is not guaranteed. Names with both leading
and
trailing underscores (e.g., _version_) are reserved. Every field must have a name.
type
The name of the fieldType for this field. This will be found in the name attribute on the
fieldType
definition. Every field must have a type.
default
A default value that will be added automatically to any document that does not have a value
in this field
when it is indexed. If this property is not specified, there is no default.
Optional Field Type Override Properties
Fields can have many of the same properties as field types. Properties from the table below
which are
specified on an individual field will override any explicit value for that property specified
on the the
fieldType of the field, or any implicit default property value provided by the underlying
fieldType
implementation. The table below is reproduced from Field Type Definitions and Properties,
which has more
details:
Property Description Values Implicit Default
indexed If true, the value of the field can be used
in queries to retrieve matching
documents.
true or false true
stored If true, the actual value of the field can be
retrieved by queries.
true or false true
Page 174 of 1426 Apache Solr Reference Guide 7.7
Guide Version 7.7 - Published: 2019-03-04 c 2019, Apache Software Foundation
Property Description Values Implicit Default
docValues If true, the value of the field will be put in
a column-oriented DocValues structure.
true or false false
sortMissingFirst
sortMissingLast
Control the placement of documents
when a sort field is not present.
true or false false
multiValued If true, indicates that a single document
might contain multiple values for this field
type.
true or false false
uninvertible If true, indicates that an indexed="true"
docValues="false" field can be "uninverted"
at query time to build up large
in memory data structure to serve in place
of DocValues. Defaults to true for
historical reasons, but users are
strongly encouraged to set this to
false for stability and use
docValues="true" as needed.
true or false true
omitNorms If true, omits the norms associated with
this field (this disables length
normalization for the field, and saves
some memory). Defaults to true for all
primitive (non-analyzed) field types,
such as int, float, data, bool, and string.
Only full-text fields or fields need norms.
true or false *
omitTermFreqAndP
ositions
If true, omits term frequency, positions,
and payloads from postings for this field.
This can be a performance boost for fields
that don’t require that information. It also
reduces the storage space required for
the index. Queries that rely on position
that are issued on a field with this option
will silently fail to find documents. This
property defaults to true for all field
types that are not text fields.
true or false *
omitPositions Similar to omitTermFreqAndPositions but
preserves term frequency information.
true or false *
Apache Solr Reference Guide 7.7 Page 175 of 1426
c 2019, Apache Software Foundation Guide Version 7.7 - Published: 2019-03-04
Property Description Values Implicit Default
termVectors
termPositions
termOffsets
termPayloads
These options instruct Solr to maintain full
term vectors for each document,
optionally including position, offset and
payload information for each term
occurrence in those vectors. These can be
used to accelerate highlighting and other
ancillary functionality, but impose a
substantial cost in terms of index size.
They are not necessary for typical uses of
Solr.
true or false false
required Instructs Solr to reject any attempts to
add a document which does not have a
value for this field. This property defaults
to false.
true or false false
useDocValuesAsStor
ed
If the field has docValues enabled, setting
this to true would allow the field to be
returned as if it were a stored field (even
if it has stored=false) when matching
“*” in an fl parameter.
true or false true
large Large fields are always lazy loaded and
will only take up space in the document
cache if the actual value is < 512KB. This
option requires stored="true" and
multiValued="false". It’s intended for
fields that might have very large values so
that they don’t get cached in memory.
true or false false
Page 176 of 1426 Apache Solr Reference Guide 7.7
Guide Version 7.7 - Published: 2019-03-04 c 2019, Apache Software Foundation
Copying Fields
You might want to interpret some document fields in more than one way. Solr has a mechanism
for making
copies of fields so that you can apply several distinct field types to a single piece of
incoming information.
The name of the field you want to copy is the source, and the name of the copy is the
destination. In
schema.xml, it’s very simple to make copies of fields:
<copyField source="cat" dest="text" maxChars="30000" />
In this example, we want Solr to copy the cat field to a field named text. Fields are copied
before analysis is
done, meaning you can have two fields with identical original content, but which use
different analysis
chains and are stored in the index differently.
In the example above, if the text destination field has data of its own in the input
documents, the contents
of the cat field will be added as additional values – just as if all of the values had
originally been specified by
the client. Remember to configure your fields as multivalued="true" if they will ultimately
get multiple
values (either from a multivalued source or from multiple copyField directives).
A common usage for this functionality is to create a single "search" field that will serve as
the default query