OpenText Content Server CE 21.4 - Advanced Indexing and Searching Guide English (LLESSRC210400-GGD-En-01)
OpenText Content Server CE 21.4 - Advanced Indexing and Searching Guide English (LLESSRC210400-GGD-En-01)
LLESSRC210400-GGD-EN-01
OpenText Content Server
Advanced Indexing and Searching Guide
LLESSRC210400-GGD-EN-01
Rev.: 2021-Aug-27
This documentation has been created for OpenText Content Server CE 21.4.
It is also valid for subsequent software releases unless OpenText has made newer documentation available with the product,
on an OpenText website, or by any other means.
Tel: +1-519-888-7111
Toll Free Canada/USA: 1-800-499-6544 International: +800-4996-5440
Fax: +1-519-888-0677
Support: https://support.opentext.com
For more information, visit https://www.opentext.com
One or more patents may cover this product. For more information, please visit https://www.opentext.com/patents.
Disclaimer
Every effort has been made to ensure the accuracy of the features and techniques presented in this publication. However,
Open Text Corporation and its affiliates accept no responsibility and offer no warranty whether expressed or implied, for the
accuracy of this publication.
Table of Contents
1 Overview ..................................................................................... 5
1.1 Understanding the Search Infrastructure ............................................. 5
1.2 Understanding the Search Grid .......................................................... 8
IDX Index 99
In order to use the data stored in Content Server, you must be able to find it quickly
and easily. For this reason, creating indexes and maintaining their integrity are two
of the most important tasks that Content Server Administrators perform.
Content Server Administrators create indexes by designing data flows that extract
and process the data they want to index. As the size of a Content Server repository
increases, Administrators must be able to optimize Content Server's indexing and
search functionality to accommodate the increasing demands being made on the
system. To do so requires a thorough understanding of the architecture of Content
Server's indexing and searching systems.
The Content Server indexing and searching system was rearchitected to provide
better search engine performance, the ability to scale to large datasets by sharing
indexing tasks over multiple indexing processes, and more flexible and configurable
search result ranking.
Notes
• This guide assumes that you are a high-level Content Server Administrator
who has read the Content Server Administration - Admin Online Help and
performed advanced administrative tasks at a Content Server site.
• OpenText recommends that you do not use a production environment to
experiment with the index and search operations described in this guide.
Instead, experiment with a test Content Server system and roll out the
changes to a production system when appropriate.
All of the processes in the search infrastructure (that is, the data flow processes,
Update Distributor, Index Engines, Search Federators, and Search Engines) are
managed by a Content Server Admin server. Although OpenText recommends that
you run the Index Engine and Search Engine processes associated with a partition
on the same computer, other indexing and searching processes can (and in large
installations, should) run on separate computers. Index Engines and Search Engines
communicate with each other through shared files stored in the Partition's index
directory.
1.1.2 DCS
The Document Conversion Service (DCS) is a process that converts documents from
their native formats to text so that they can be indexed. Content Server Admin
servers manage DCSs in Content Server.
The DCS infrastructure has an API which allows various filter packs to be installed
and multiple filter packs to work together. By default, when shipped Content Server
supports the OpenText Document Filters (OTDF). OpenText Document Filters can
convert many file formats to text which allows them to be indexed and summarized.
For more information about the DCS, see “Understanding Document Conversion”
on page 26.
• “Partitions” on page 9
• “Update Distributor” on page 9
• “Index Engine” on page 10
• “Search Federator” on page 12
• “Search Engine” on page 12
1.2.1 Partitions
A Partition is a logical portion of an entire index of data. Large datasets can be
indexed by more than one process and the processes can reside on different
computers. The distribution of indexing is achieved by partitioning or dividing your
dataset across all of the processes responsible for building indexes. Each Partition
contains a distinct portion of the complete dataset.
For more information about partition maps, see OpenText Content Server Admin
Online Help - Search Administration (LLESWBS-H-AGD).
Adding Partitions
Because the architecture of the search grid allows indexing tasks to be shared by
multiple indexing processes across Partitions, Content Server Administrators have
more flexibility when scaling for extremely large data sets. The search grid allows
large datasets to be indexed by more than one Index Engine process and the Index
Engine processes may reside on different machines. In this way, the search grid
architecture supports massive parallelism during indexing, and allows the index to
grow larger by adding more processes or machines to do the work.
For specific information about how to add Partitions, see OpenText Content Server
Admin Online Help - Search Administration (LLESWBS-H-AGD).
Index updates occur when items are added to the data source and when existing
items are modified or deleted. Although each indexed item is owned by only one of
the Partitions that comprise the index, the Update Distributor passes user
information about the item to all Partitions to quickly update all references to the
user.
When an update needs to be made to an item, the Update Distributor send the user
informations and OTObject portion of the update to the Index Engines. If any Index
Engine contains the item, it responds to the Update Distributor and receives the rest
of the data associated with the update. If no Index Engine contains the item, the
Update Distributor chooses an Index Engine to which it will send the remaining
data.
Note: Search Engines may not use the most recent updates immediately. This
is because updates are asynchronous due to the number of processing steps in
a data flow.
• A metadata index
• A content index
• A content accumulator
Metadata Index
The metadata index is an index of the metadata associated with the items in the
partition. It usually operates in RAM, ideally without being swapped to disk, and is
associated with the metaLog file and the checkpoint file in the index directory (that
is, Content Server home/index/data_source/index). The metaLog file is a chronological
listing of the changes made to the metadata index since the last checkpoint. The
checkpoint file is an on-disk snapshot of the metadata index at some point in time.
When an update is made to the metadata index, it is appended to the metaLog file,
which is later read by the dependent Search Engines so that they can update their
own memory-based copy of the metadata index. At certain points (called
checkpoints), the entire metadata index is committed to disk and captured in the
checkpoint file. At the same time, a new metaLog file is created. Each time the Index
Engine and dependent Search Engines start, they load the data from the checkpoint
file and associated metaLog file to produce the most recent metadata for indexing
and searching purposes.
Notes
A new storage mechanism for text regions has been implemented, the Retrieve-Only
mode. In this mode, the metadata region values are stored on disk, and the in-
memory index is not created; the region data is not searchable, but it may be
retrieved. This option allows further tuning of the Search Engine to reduce the
memory requirements.
Within Content Server, by choosing Retrieve-Only for Hot Phrases and Summaries
you may save an additional 30% of memory. These two data types are derived from
content already indexed, so there is no loss of searchable data.
The Retrieve-Only mode is supported through the Partition Map page, which
supports conversions between RAM and DISK modes for text fields. For details, see
OpenText Content Server Admin Online Help - Search Administration (LLESWBS-H-
AGD).
Note: Efficient conversion between all three field modes is supported. As with
conversions between RAM <-> DISK field modes, conversion to and from
DISK_RET mode requires a restart of the Index and Search Engines.
Content Index
The content index is a collection of index fragments, each stored on disk in their own
subdirectory of the index directory. These fragments are either produced by the
merging of other index fragments or by the in-memory content accumulator when it
has reached the limit set by the Accumulator Memory Size setting on the Specific
Properties page of a partition.
Note: The total size of the content index is limited by the Maximum Content
Disk Size setting on the Specific Properties page of a partition.
Content Accumulator
The content accumulator controls the number of index fragments that are produced,
thereby influencing the number of merges required. Its operation is governed by the
Accumulator Memory Size setting on the Specific Properties page of a partition.
When the value of this setting is reached, the content accumulator dumps itself to
produce another on-disk index fragment. The content accumulator operates in RAM
without being swapped to disk, and it is associated with the accumLog file in the
index directory (that is, Content Server home/index/data_source/index). As with the
metadata index, updates are appended to the accumLog file that is monitored by the
dependent Search Engines. At certain points, the accumulated content is committed
to disk as a new index fragment and a new accumLog file is created.
Note: When either of the limits specified by the Maximum Metadata Memory
Size or Maximum Content Disk Size settings are reached, the Update
Distributor stops adding new objects to that Partition. If all Partitions reach
their size limits, the Update Distributor stops and reports that all partitions are
full by reporting a process error. In addition to the process error, a default
control rule will automatically send the Content Server administrators an e-
mail message when the partitions are approaching their size limits.
You can then create a new partition manually or set up a control rule that
creates one for you when partitions have reached a certain capacity.
If an object generates too much information to fit in the specified accumulator size,
an error code is set in the OTContentStatus field. This normally would not happen,
however, because content is truncated to 10 megabytes by default (as controlled by
the ContentTruncSizeInMBytes setting).
Each Search Federator has one Search Engine per Partition. For more information
about how to add Search Federators to a search grid, see OpenText Content Server
Admin Online Help - Search Administration (LLESWBS-H-AGD).
Each Search Engine knows the file system location of the index built by a particular
Index Engine process. The Search Engine uses the files created by the Index Engine
process to maintain its own searchable index consisting of an in-memory metadata
component, an on-disk content component, and an in-memory content accumulator
component. This three-part model is analogous to the Index Engine's three part
model; however, the Search Engine has searching (rather than updating)
responsibilities. For information about the three part index model maintained by the
Index Engine, see “Index Engine” on page 10.
Before you can search, you must index the data that you want to be searchable.
Content Server's indexing system allows you to index data from a variety of sources
including the internal Content Server repository, directories on a file system, and
external Web sites.
Only one process is normally required to extract data from the Content Server
database, so OpenText recommends that each Content Server system have only one
Enterprise Extractor process, unless OpenText Global Services or Customer Support
has advised adding multiple Enterprise Extractors as part of a strategy of high-
volume indexing. For more information about high-volume indexing, or adding or
configuring an Enterprise Extractor process, see OpenText Content Server Admin
Online Help - Search Administration (LLESWBS-H-AGD).
Content Server Extractor processes become aware of updates (adds, updates, and
deletes) by monitoring the DTreeNotify table in the Content Server database. Each
time an item is updated, an entry containing the following information is added to
the DTreeNotify table:
A Content Server Extractor process selects a line from this table and stores the
corresponding information in memory while it compiles the information it writes to
the data flow's iPool. Depending on how you have configured your Content Server
site, the Extractor process selects a line from either the top (oldest updates) or
bottom (newest updates) of the DTreeNotify table. If the
wantDescendingExtractor setting in the [LivelinkExtractor] section of the
opentext.ini file is set to TRUE (the default setting), the Extractor processes the
most recent updates first. For more information about the
wantDescendingExtractor setting, see the [LivelinkExtractor] section in the
OpenText Content Server Admin Online Help - Content Server Administration
(LLESWBA-H-AGD).
When the Content Server Extractor process has processed the update information
and written a message to the data flow's iPool, the entry is deleted from the
DTreeNotify table. When multiple Extractor processes exist, each process writes its
messages to a unique iPool. However, since the Update Distributor can only read
from a single iPool, the extracted data must be merged before it reaches the Update
Distributor process. The following images illustrate possible designs for data flows
containing multiple Content Server Extractor processes.
Important
You may experience a significant decrease in performance if you implement a
configuration that was not specifically designed for your system.
For efficiency, there are certain rules about what an Extractor process will write to a
single iPool message. For example, if an Extractor is writing an update to an item as
well as a delete for the same item in the same iPool message, it writes the delete
only.
If you have configured your Content Server site to use external file storage (EFS),
you can specify how a Content Server Extractor process extracts document content
from the EFS – by extracting the complete document content from the EFS or by
referencing the location of the document in the EFS . This behavior is controlled by
the UseContentReference setting in the [LivelinkExtractor] section of the
opentext.ini file. By default, this setting is not included in the opentext.ini file,
which is the equivalent of setting it to false. In other words, the complete
document context is extracted. However, configuring an Enterprise Extractor to
extract document content by referencing the location of the document in the EFS can
reduce the load to the iPools in the Enterprise data flow and improve overall
indexing speed. In this case, the Document Conversion Service (DCS) must be able
to access the document content using the exact reference. For more information
about the Document Conversion Service, see “Understanding Document
Conversion” on page 26.
The Directory Walker process is the producer process for the User Help and the
Admin Help data flows in Content Server. It crawls the directories where the
Content Server User or Admin Online Help files are stored and deposits the content
of those files (encoded in iPool messages) into an iPool. Content Server
Administrators can also use Directory Walker processes in custom data flows. For
more information about administering Directory Walker processes and creating
Directory Walker data flows, see OpenText Content Server Admin Online Help - Search
Administration (LLESWBS-H-AGD).
When a Directory Walker process walks a set of directories for the first time, it
records the files that match its criteria in a crawl history database file. Content Server
administrators can specify the name and location of the crawl history database file
on the Specific Properties page of a Directory Walker process. This information is
also contained in the configuration file for the Directory Walker process. The
DBLocator parameter in the configuration file specifies the directory in which the
files are stored. The DBLocatorName parameter in the configuration file specifies the
file name.
Content Server adds the .new extension to its current crawl history database file. For
example, if the DBLocator parameter is Content Server_home/myDirWalk and the
DBLocatorName parameter is myDirWalker, the Directory Walker process creates the
Content Server_home/myDirWalk/myDirWalker.new file when it walks its directories
for the first time. This crawl history database file contains a list of all the files that the
Directory Walker process has encoded in iPool messages and sent to the iPool.
When the Directory Walker process runs again, it renames the original crawl history
database file (history.new), giving it the .old file extension (history.old). It then
recrawls the directories and creates a new crawl history database file (history.new).
The Directory Walker process compares the new crawl history database file with the
old crawl history database file and sends any appropriate iPool messages to the
iPool. In this way, the Directory Walker process keeps track of new, modified, and
deleted files. This process repeats each time that the Directory Walker process runs.
The next time that the Directory Walker process runs, it renames the old crawl
history file (.old), giving it the .junk file extension (history.junk). At the same time, it
renames the current crawl history database file (history.new), giving it the .old file
extension (history.old), and creates a new crawl database history file (history.new)
that contains the most recent information.
The XML files produced by your third-party applications must meet certain
requirements to work properly with XML Activator processes.
Placement
The XML files that a third-party application places in directories for an XML
Activator process to read must be fully closed. The third-party application can fulfill
this requirement by writing files to a local directory and then moving the files to the
XML Activator process's incoming directory.
Naming
Format
Along with their content (for example, binary data or text), the XML files that are
generated by third-party applications must include XML data that maps to data
interchange pool (iPool) messages. This XML data tells the XML Activator process
what to do with the corresponding content. The following table describes iPool key-
value pairs, which you include in XML files as tagged elements and constitute iPool
messages. For more information about configuring an XML Activator process, see
OpenText Content Server Admin Online Help - Search Administration (LLESWBS-H-
AGD).
The following table describes the metadata fields that OpenText recommends you
include in each XML file. Each field corresponds to a Content Server region. If you
do not include these fields, Content Server users will not be able to search them as
well as possible in Content Server.
The tag names that you use must match the metadata field names, unless you
specify alternative tag mappings in Content Server under the Metadata List field for
the XML Activator process, which is located on the process's Specific Info page. You
can also include as many metadata fields as necessary by wrapping information in
tags whose names match Content Server regions, or whose names are mapped to
regions in the Metadata List field for the XML Activator process. For more
information about mapping metadata tags when adding or configuring an XML
Activator process, see the Content Server Admin Online Help.
Field Description
OTName The name of the data object
OTOwnerID A unique identifier for the owner of the data object
OTLocation The original location of the data object
OTCreateDate The date on which the data object was created
OTCreateTime The time at which the data object was created
OTModifyDate The date on which the data object was last modified
OTModifyTime The time at which the data object was last modified
OTCreatedBy The node number of the creator of the data object
OTCreatedByName The login name of the creator of the data object
OTCreatedByFullName The full name of the creator of the data object
Operations
• If there is no existing object with the external object id specified in the OTURN
value, then an Add of an object with the specified metadata is performed.
Note: AddOrModify only works with metadata, not content; if you need to
add or modify content, please use the AddOrReplace operation.
• If there is an existing object with the external object id specified in the OTURN
value, then a Modify of that object is performed as follows:
– the newly specified metadata regions replace any previous metadata regions
of the same region names
<Body>
<OTURN>OTURN</OTURN>
<Operation>AddOrReplace</Operation>
<Metadata>
<OTName>Data object name</OTName>
<OTOwnerID>Owner ID number</OTOwnerID>
<OTLocation>Original location</OTLocation>
<OTCreateDate>Creation date</OTCreateDate>
<OTCreateTime>Creation time</OTCreateTime>
<OTModifyDate>Date last modified</OTModifyDate>
<OTModifyTime>Time last modified</OTModifyTime>
<OTCreatedBy>
Node number of creator
</OTCreatedBy>
<OTCreatedByName>
Login name of creator
</OTCreatedByName>
<OTCreatedByFullName>
Full name of creator
</OTCreatedByFullName>
</Metadata>
<Content encoding='Base64'>Content data</Content>
</Body>
• If there is no existing object with the external object ID specified in the OTURN
value, then an Add of an object with the specified metadata and content is
performed.
• If there is an existing object with the external object id specified in the OTURN
value, then a Replace of that object is performed as follows:
– all previous metadata of the object is discarded (including any regions not
mentioned in the new metadata) and replaced with the newly specified
metadata regions
For the Delete operation, structure your XML file as follows, including only the
OTURN and the Operation. Do not include any content or metadata because this
operation deletes data that already exists in the index and is identified by its OTURN.
<?xml version="1.0"?>
<Body>
<OTURN>OTURN</OTURN>
<Operation>Delete</Operation>
</Body>
• If there is an existing object with the external object id specified in the OTURN
value, then the object is deleted.
• Otherwise, no objects are affected.
Note: Beginning with Content Server 20.3, the option to batch ModifyByQuery
and DeleteByQuery operations has been removed. Instead, they are always
batched.
For example, Content Server cannot delete renditions from the index. Another
example is when there are versions of a document in different partitions.
DeleteByQuery will allow these cases to be handled. The update distributor sends
these requests to all of the index engines. Each index engine logs the number of
objects it deleted. The DeleteByQuery feature complements the “Modifying Objects
By Query” on page 25 feature.
Caution
Using DeleteByQuery is a potentially dangerous operation and must be
used carefully since it can result in data loss.
The query should not match more objects than you intend to delete. For
example, a query for "TempSandbox" will also match "Second
TempSandbox", which may not be what you intended. OpenText
recommends that you restrict the use of the DeleteByQuery operation to
fields which just use one-word keys.
Note: Beginning with Content Server 20.3, the option to batch ModifyByQuery
and DeleteByQuery operations has been removed. Instead, they are always
batched.
Unlike AddOrModify, the ModifyByQuery operation will NOT create a new entry
(object). It will only change existing entries. The operation is broadcast to all Index
Engines by the Update Distributor, and each Index Engine logs the number of
objects it modified.
Like DeleteByQuery, this operation is performed using iPools, not using the Live
Query Language. The query string is used in place of the object ID in the iPool.
Caution
Using ModifyByQuery is a potentially dangerous operation and must be
used carefully since it can result in data loss.
The query should not match more objects than you intend to modify. For
example, a query for "TempSandbox" will also match "Second
TempSandbox", which may not be what you intended. OpenText
recommends that you restrict use of the ModifyByQuery operation to fields
which just use one-word keys.
For more information about XML Activator, see OpenText Content Server Admin
Online Help - Search Administration (LLESWBS-H-AGD).
OpenText Document Filters converts items from their native file formats to a simple
text format for viewing or indexing in Content Server, and is used to display
Content Server items. For summary hit highlighting, find similar, recommender
synopsis generation, and classification profile generation, the DCS server uses
Quality Document Filters to convert Content Server items.
The OpenText Document Filters detect and convert items of the following formats:
• Microsoft Word
• Microsoft Excel
• Microsoft PowerPoint
• Microsoft Outlook
• Standard Mail (RFC822)
• Adobe PDF
• HTML
• ZIP
• TAR
• LZH
If you want to extend the DCS's MIME type detection, text-extraction, and
document-conversion capabilities, you can create a custom filter pack that your
Content Server Administrator can install. For more information, see the OpenText
Content Server Filter Pack - Creating Filter Packs (LLESCF-CFP) guide.
The DCS architecture provides a framework for MIME type detection and document
conversion that is easily extensible. The architecture is composed of one or more
DCS servers that work with DCS workers and filter packs. DCS workers are
processes that identify and load the filters required to extract text or convert a
document of a particular file format to a simple text format such as HTML. DCS
filter packs are installable sets of DCS components and associated files that extend
Content Server's document-conversion capability.
When items are added to Content Server, the default MIME type detection relies on
the following sequence:
• browser identification
• item file extension
• DCS process.
• DCS process
• browser identification
• item file extension.
Note: When represented in indexing data flows, the DCS server is called the
Document Conversion Process and data interchange pools are called iPools.
For more information, see OpenText Content Server Admin Online Help - Search
Administration (LLESWBS-H-AGD).
DCS clients are processes that make requests directly to a DCS server (dcs). For
example, if a user tries to view a document of a particular MIME type, a DCS client
named LLView opens a socket connection with an available DCS server (dcs) and
passes the document to that DCS server for MIME type verification and conversion.
In this case, the document's MIME type must be listed in the [filters] section of
the opentext.ini file; otherwise, the user will be prompted to Open or Download the
document. Similarly, if a user tries to hit highlight a search result item, find a similar
item, generate a recommender synopsis, or generate a classification profile, a DCS
client named wfwconv opens a socket connection with an available DCS server, and
passes the appropriate item to that DCS server for MIME type verification and
conversion. Figure 2-4 illustrates the sequence of operations performed when a user
makes one of these types of requests in Content Server.
In Figure 2-4, OTDF is OpenText Document Filters, and QDF is OpenText Quality
Document Filters.
In Figure 2-5, OTDF is OpenText Document Filters, and QDF is OpenText Quality
Document Filters.
For information about how DCS servers interact with Content Server Admin servers,
see the OpenText Content Server Admin Online Help - Search Administration (LLESWBS-
H-AGD).
DCS workers are processes that identify and load the filters required to extract text
or convert a document of a particular MIME type to a simple text format. For
example, for a user to view a Microsoft Word document, the DCS server generates a
worker process that loads a third-party set of filters, which is used to convert the
Microsoft Word document to HTML format for viewing in Content Server.
DCS workers are persistent processes. This means that once a particular worker
process is loaded by the DCS, it remains available for future conversion operations
as long as the DCS server is active.
For new objects, the Update Distributor looks in the search.ini file and balances its
distribution of data between the available Index Engines listed there. The Update
Distributor continues making updates in this manner until all partitions are full – at
which point it stops all indexing processes and returns an error indicating that the
partitions are full.
Note: Each Index Engine has a dedicated section in the search.ini file
(delineated with the [IndexEngine_ prefix). This section contains information
about the data flow, partition, ports, and log files associated with the specified
Index Engine process.
When a new object is indexed, the object identifier and the user information in the
object are passed to each Partition in the system via the Index Engines. The Index
Engines then respond indicating that none of them presently contain the object, so
one Index Engine is selected and an Add operation is performed. The object
identifier is a value that uniquely identifies the object in Content Server. For objects
in an Enterprise data source, this is an objID number and version number, and for
objects in a Directory Walker data source, it is an absolute path.
When an update is made to an object that has already been indexed, a similar
process occurs. The object identifier and the user information in the object are passed
to each partition in the system via the Index Engines. The Index Engines then
respond indicating which one of them contains the object, and an Update operation
is performed.
To remove deleted data from the index (compacting it) and reduce the number of
index fragments in the system, the Index Engine periodically merges one, two, or
three index fragments. The new index fragment is similar in size to the sum of all the
input index fragments. Once the merge operation is complete, the new index
fragment and the original input index fragments are stored in the index directory,
which means that at least two times the original disk space requirement must be
available. Later, a cleanup thread deletes the original input index fragments.
The frequency with which merges occur is determined by Content Server's merge
policy. For example, the Index Engine does not perform a merge if the available disk
space (based on the value of the Maximum Content Disk Size setting on the
Specific Properties page of a partition) is smaller than its estimation of required disk
space. The merge policy is composed of a set of user-defined settings and a set of
system-defined rules. Together, they control when and how a merge process is
launched.
User-Defined Settings
The following table describes the settings that you can set to influence the merge and
compaction policy in Content Server:
Setting Description
Maximum Content Disk Size Specifies the total disk space available for use
where the partition's content index is stored.
This setting is also used indirectly by the Update
Distributor when determining to which partition
to send an update. The Update Distributor shuts
down when all the partitions are full.
Setting Description
Merges Enables or disables merge capability in the
system. You can change the value of the Merges
setting to True or False on the Partition Map's
Specific Properties page.
Setting Description
TailMergeMinimumNumberOfSubIndexes Specifies the number of index fragments that
must exist before the index engine runs a
secondary merge thread.
System-Defined Rules
An index engine runs two merge threads: a primary thread and a secondary thread.
The threads wake according to the value of the Merge Attempt Interval setting on
the Partition Map's Specific Properties page (when a merge is not already in
progress), and then apply a two-phase policy of system-defined rules. The first
phase nominates sets of fragments as candidates for the merge. The minimum
number of fragments to merge is one (in which case, compaction occurs) and the
maximum is three. The second phase determines whether or not the merge is
possible and desirable given the current state of the system. The first set of index
fragments that passes both phases are merged. If there are no feasible candidate sets,
the thread goes back to sleep.
The index engine runs the secondary thread if the following conditions are true:
Phase Description
I Candidates are nominated in the following order:
• If the number of index fragments is greater than the value of the
Target Index Number setting (by default, five), the smallest
consecutive grouping of three is nominated, then the smallest
consecutive grouping of two is nominated (according to the default
value of the Index Ratio setting). If both fail, a compaction is
nominated with the next step. For more information about the
Target Index Number and Index Ratio settings, see “User-Defined
Settings” on page 31.
• If an index fragment has not been included in a merge for more
than the number of days specified by the Oldest Index Date
setting (by default, 30), it is nominated. If this candidate fails
during Phase II, a warning message is logged indicating that the
index fragment is too large to compact.
• If a fragment is present from an index migration, it is nominated. If
this candidate fails during Phase II, a warning message is logged,
indicating that the index fragment is too large to compact.
• All consecutive sets of three and two index fragments are
examined and sets are nominated in descending order of combined
fragment directory size – provided that neighboring fragments are
at least a certain percentage of the size of the remaining fragments.
This percentage is determined by the Index Ratio setting (for
example, if this setting is 3, the neighboring fragments must be at
least one-third of the size of the remaining fragments).
If the merge fails, the combined file size of the nominated candidates becomes the
current lockout. If the merge succeeds, the number of successful merges since setting
the lockout is incremented. If five successful merges have occurred since the last
lockout, the lockout is removed. If the lockout remains for more than 24 hours and
there have been no successful merges, the lockout is removed. If the Index Engine
process is restarted, the lockout is removed.
The DefragmentDailyTimes setting allows the administrator to specify the times (in
the system's default time zone at startup) at which to start the defragmentation. The
default setting is DefragmentDailyTimes=2:30 which is 2:30 am local time. Only
the hour (ranging from 0 to 23) and minute (ranging from 0 to 59) may be specified.
Multiple start times can be expressed as a comma-delimited list (no spaces allowed),
for example, DefragmentDailyTimes=0:00,23:59.
Note: If the system's time zone changes, for example from daylight savings
time, the scheduler continues to run based on the time zone at startup. The
next time the Index Engine or Search Engine is restarted, it will use the new
system time zone.
For the Search Engine, these settings can be reloaded through the Admin server, so
they can be rescheduled without having to stop and restart the Search Engine. This
option is currently not available for the Index Engine.
If the reloaded ini settings specify daily defragmentation times, a new daily
defragmentation scheduler daemon is started using the latest set of times.
• When a partition is full, it stops accepting new objects and will only accept
updates to existing objects.
• If the updates exceed the estimated reserved space, then the index will do some
rebalancing to bring the memory usage below 80%, typically down to about 77%.
Note: Rebalancing of partitions that are in No-Add (update only) mode can be
enabled via the AllowRebalancingOfNoAddPartitions setting in the update
distributor section of the search.ini file (this setting is false by default). This
setting should only be enabled after careful consideration as it is applied to all
partitions of the No-Add type and may not be appropriate for some email
archiving installations.
The partition percent full model is described through the following settings which
apply per partition (each partition can have its own values for these settings). The
global defaults for all the partitions are shown below:
• StartRebalancingAtMetadataPercentFull=80
• StopRebalancingAtMetadataPercentFull=77
• StopAddAtMetadataPercentFull=70 (85 is recommended for new CM apps)
• MetadataPercentFullWarnThreshold=65 (80 is recommended for new CM
apps)
• WarnAboutAddPercentFull=true (will show up at the Info Level of logging)
Note: The default settings for percent full and partition rebalancing are
lower, so if an existing partition is already above the new lower limits, that
partition may stop accepting new objects and may start rebalancing itself.
This is a normal part of the maintenance process, but you may need to add
more partitions and change automatic partition creation rules, if they are
defined, to match these new default settings.
You can disable this behavior using an Admin server management page to
manually set these settings higher. In earlier versions of Content Server
there was no Admin server interface for these settings, so to override the
new lower defaults, you must use a search.ini_override file. For more
information, see “Overriding the search.ini file” on page 97.
Based on the default values listed above, when a partition reaches 65% metadata full
(the default for the MetadataPercentFullWarnThreshold threshold), warning
messages will appear by default in the Index Engine logs at the Info Level. To
disable this default logging, set WarnAboutAddPercentFull=false. A less verbose
level of logging, such as Warning Level, will also result in no messages being logged
in the Index Engine logs.
When the metadata percent use of a partition reaches 70% percent full (the default
for the StopAddAtMetadataPercentFull threshold), the addition of new objects to
this specific partition is halted. Updates to objects currently residing in this partition
are still allowed. Warning messages appear in the Index Engine logs at the Warning
Level specifying that the partition is no longer accepting new objects.
If the metadata percent use of a partition reaches 80% full (the default for the
StartRebalancingAtMetadataPercentFull threshold), an update coming in for an
object residing in that partition causes the partition to rebalance. The rebalance
moves the specific object from this partition to a partition that is accepting new
objects to decrease the memory use of that partition. Warning messages appear in
the Index Engine logs at the Guaranteed Level, stating that the partition has entered
a rebalance mode. The partition will stay in the rebalance mode until the metadata %
full use decreases to the 77% level (the default value for the
StopRebalancingAtMetadataPercentFull threshold). Once the metadata percent
full use decreases to this 77% level, updates coming in to this partition are allowed,
and no longer cause a rebalance. If the metadata percent full once again increases to
80%, a new partition rebalance will be triggered.
Note: New objects are still not accepted into this partition unless the metadata
percent use decreases below the 70% use threshold for
StopAddAtMetadataPercentFull. If no partitions are available to accept new
objects, the Update Distributor will go down with process error 174 or 175.
The administrator should use the log warnings and/or the automatic partition
creation rules of Content Server to ensure that new partitions are created as
required to prevent this.
One of the advantages of this model is that it will effectively place a partition
automatically into a soft update-only mode (no new objects are accepted) as it begins
to fill. Previously, administrators had to closely watch partition sizes and make
explicit decisions about when a partition mode should be changed. This system will
also gracefully and automatically put that partition back into Read-Write mode if
conditions change, for example, large numbers of objects are deleted.
The amount of space to set aside for updates depends on the expected usage. A pure
archiving solution with few anticipated updates might operate best on the provided
defaults. Other systems might require a larger “buffer” for rebalancing the
partitions.
Note: Earlier versions of Content Server did not provide an Admin page to
change these default settings. To change any of the default settings for a
partition, the administrator must use the search.ini_override file. For more
information, see “Overriding the search.ini file” on page 97.
This example of the search.ini_override file will only override the default
settings for the first partition in the system which is called firstPartition.
[Partition_firstPartition]
StartRebalancingAtMetadataPercentFull=70
StopRebalancingAtMetadataPercentFull=85
WarnAboutAddPercentFull=true
StopAddAtMetadataPercentFull=75
MetadataPercentFullWarnThreshold=70
In the case of non-stop updating and searching, the Search Engine still serializes the
updates in transaction order and still enforces at least a 10 milliseconds (ms) delay
between updates. If MaxSearchesWhileUpdateBlocked=x (for x > 0) the following
occurs:
• an update runs
Technically, the internal ReadWriteGate controls not just searches and updates, but
other background threads, for example, the merge thread does both read and write
operations that would count like a search or update.
Note: You can use the opentext.ini file to ignore the content of specified
MIME types. For example, you may want to exclude audio files by adding the
audio/mpeg=TRUE parameter . Only the metadata (name, creation date, and so
on) is extracted and indexed. For information about the opentext.ini file, see
the OpenText Content Server Admin Online Help - Content Server Administration
(LLESWBA-H-AGD).
The bad object heuristic is based on the fact that bad documents contain a lot of
unique words. If the Index Engines allowed bad objects to be indexed, the index's
word list would grow to enormous proportions and search performance would
rapidly deteriorate.
The bad object heuristic assumes that users are primarily interested in searching
natural language documents. Natural language documents have certain predictable
characteristics and the heuristic prunes objects that do not match those
characteristics. In theory, this may include part lists or directories that could
potentially be valuable; however, in practice, most of the bad objects being detected
include video, audio, binary files, or spreadsheets containing tables of floating point
numbers which pollute the index's word list with many unique words that users
would never use when searching. In fact, statistics indicate that roughly 1% of the
bad objects account for over 60% of the unique words.
The heuristic is based on the observation that natural language documents include
function words (for example, articles, conjunctions, and prepositions) that make up a
significant fraction of all word occurrences. In a typical English document, the 100
most frequent words account for over 50% of all word occurrences. The distribution
is such that a few words occur often and others occur rarely, where the majority of
occurrences are repeats. Using a value of 0.25 the heuristic discards any object in
which the ratio of the number of unique words is more than 25% of the total number
of word occurrences. For example, a 2000 word object would have to have more than
500 unique words, which is highly unlikely to occur in a natural language
document.
The heuristic also assumes that natural language documents consist mainly of short
words. This is due the presence of function words, which tend to be short in most
alphabetic languages. For example, the accepted statistic for English is that the
average word size is about 5 characters; for German, the average is 6 characters.
• The unique ratio (that is, the number of unique words in the document versus
the total number of words in the document) is greater than the value of the
Document Word Ratio setting, and the average word length is greater than the
value of theMaximum Average Word Length setting. Using the default values of
these settings, this means that if 10% of the object content consists of unique
words and the average word length is greater than 10 characters, it is considered
a bad object.
• The unique ratio (that is, the number of unique words in the document versus
the total number of words in the document) is greater than or equal to the value
of the Restrictive Document Word Ratio setting. The default value of this setting
is higher than its counterpart, the Document Word Ratio setting.
• The average word length is greater than or equal to the value of the Restrictive
Maximum Average Word Length setting. The default value of this setting is
higher than its counterpart, the Maximum Average Word Length parameter.
Tip: You can disable the bad object heuristic by adjusting the values of the
associated settings (for example, change the value of the Restrictive Document
Word Ratio setting from 0.5 to 1.0). Turning off the heuristic means that all
objects will be indexed, except for objects with excluded MIME types or object
types that are not extracted.
The previous behavior was that an exception would occur in the index engine and as
a result, both the index engine and the update distributor would stop. Now an object
with an invalid field value can be indexed, a note about the invalid field is recorded
in the Index Engine log, and the error count in the OTIndexError field is
incremented.
The integer field OTIndexError is populated with a count of these errors. The field
holds a count of the number of errors that occurred for a given object. This field can
be queried to determine how many objects were indexed with problems in the
metadata.
Modification of the metadata does not reset this value, although re-indexing of the
object will. This functionality has been extended to accept seed values of
OTIndexError coming in through an iPool as well.
This feature allows searching for all documents whose content is empty or did not
have their content indexed. Such objects have a content check sum value of 0.
sContentProcessedAggregated_Pass = 200;
sContentProcessedUnknownMIMEType_Pass = 201;
sContentProcessedAlmostBadObject_Pass = 202;
sMostProcessedSomeNonUTF8_Pass = 203;
sMostProcessedSomeUnsupportedCodePage_Pass = 204;
sMostProcessedSomeUnsupportedLang_Pass = 205;
sContentIndexedTruncatedSize_Suspect = 300;
sContentIndexedSomeUnreadableSections_Suspect = 301;
sNotIndexedEncrypted_Bad = 400;
sNotIndexedUnsupportedMIMEType_Bad = 401;
sNotIndexedBadObjectHeuristic_Bad = 402;
sNotIndexedAccumSizeExceeded_Bad = 403;
sNotIndexedInvalidCorrupt_Bad = 404;
sNotIndexedUnknown_Bad = 405;
Codes have been reserved to indicate that an object’s content is OK (codes in the 100
group). However, these codes are not used in the current release of Content Server
to indicate that an object is OK. Rather, a non-empty code is assigned only to objects
that have undergone an event (the 200+ group). Therefore, an example where clause
to retrieve objects that have undergone an event is:
where [region “OTContentStatus”] >= “200”
• Subdirectories
• accumLog file
• checkpoint file
• livelink.# file
• livelink.ctl file
• metaLog file
2.5.1 Subdirectories
A top level index directory contains one subdirectory for each index fragment
associated with the index. An index fragment's subdirectory provides a word list
(with ancillary multi-level word lists), object level postings, offset level postings, and
a skip file. These files are separated into the following components:
• Core files, which contain the index structures for pure alpha words that do not
contain the numbers 0-9, an ampersand (&), a hyphen (-), or a period (.)
• Other files, which contain the index structures for non-alpha words that contain
the numbers 0-9, an ampersand (&), a hyphen (-), or a period (.)
• Region files, which contain the index structures for XML regions
In Figure 2-8, the coreidx1.idx file is the complete UTF-8 word list for the alpha
words. The *.idx{2,3,..} files contain partitions of the coreidx1.idx word list
used for word lookup. The coreobj.dat file contains the object-level postings. Object-
level postings are the IDs of the objects in which the word occurs. The coreoff.dat
file contains the offset-level postings. Offset-level postings are the offsets of
occurrences of the word within an object. The coreskip.idx file contains object-
level skipping information. The region files are indexed in a similar way and the
map file contains a version number, some pointers, and a list of checksums for every
file in the index fragment.
accumLog File
The accumLog file (content accumulator log file) contains the updates that have been
made to the in-memory content accumulator since the last index fragment was
created from an accumulator dump. The contents of the in-memory accumulator can
be recreated by replaying the updates contained in this file. For more information
about the in-memory content accumulator, see “Content Accumulator” on page 11.
checkpoint File
The checkpoint file is an on-disk snapshot of the metadata index. The metadata
index can be recreated by reading the checkpoint file, and then replaying the
updates contained in the metaLog file. For more information about the metaLog file,
see “metaLog File” on page 47 .
When the search grid needs to create a checkpoint, all partitions simultaneously
write their data. In some customer scenarios, where there are many partitions, this
can overwhelm the file system capacity and cause thrashing, resulting in
unexpectedly slow checkpoint creation.
This feature is controlled via a setting in the dataflow section of the search.ini file:
ParallelCommit=true or false
livelink.# File
The livelink.# file is a configuration file that contains information about the
current state of the index. The numeric extension for the configuration file increases
incrementally each time the information changes inside the file. The configuration
file may contain the following information:
livelink.ctl File
livelink.ctl File
The livelink.ctl file (control file) contains the name of the configuration file that
describes the current state of the index. Each time the index's state is modified, the
numeric extension is increased by one. In the example directory structure shown in
Figure 2-7 , the livelink.ctl file contains the string livelink.13.
metaLog File
The metaLog file (metadata log file) contains a chronological list of the updates that
have been made to the metadata index since the last checkpoint, where a snapshot of
the metadata index was captured in the checkpoint file. For information about
checkpoint files, see “checkpoint File” on page 45 .
A log file is deemed to be full when its size exceeds a defined limit. When the active
log file is full, it is renamed to include a time/date stamp and given a file extension
that indicates it can be archived, for example, my_partition_name_
20100315199w949. log-2010.03.29-11.57.15.archive.log
The number of log files that should be retained can also be configured. Archive.log
files exceeding the maximum number will be deleted based upon age.
Special handling is implemented for the startup log files. The first log file often has
valuable diagnostic information, so it is excluded from the deletion schedule for the
archive log files. There is a separate count of the number of startup log files that
should be retained. Startup logs will be renamed: my_partition_name_
20100315199w949. log _startup -2010.03.29-11.57.15.archive.log
To enable rolling log files, for each process in the search.ini file, set:
CreationStatus=4
LogSizeLimitInMbytes=xxx
MaxLogFiles=yyy
MaxStartupLogFiles=zzz
Components of the search grid communicate with each other using data flow
processes to pass information to one another using iPools.
Understanding the role that iPools play in data flows can help you to more
accurately administer the indexing processes at your Content Server site.
The information in the Interchange Pools section of a Data Flow Manager page
allows you to monitor iPool activity and debug data flow process errors. For
example, if a data flow process shuts down unexpectedly, the Interchange Pools
section allows you to determine the number of iPool messages that were successfully
processed before it shut down and the number of iPool messages that remain in the
queue.
The data flow directory (sometimes called the iPool base) contains a tree of
subdirectories for each iPool in the corresponding data flow. By default, the iPool
subdirectories are named using a sequence of numbers that indicate their position in
the data flow. The first iPool subdirectory name contains _0, the second iPool
subdirectory name contains _1, and so on. For example, the first iPool tree may be
named 2154_0.0, 2154_0.1, or 2154_0.2. The second iPool tree may be named
2154_1.0, 2154_1.1, or 2154_1.2. At their deepest branches, the iPool subdirectory
trees contain the iPool messages that pass through the data flow.
Figure 3-2: Windows Directory Structure of the Admin Online Help Index
The extension on an iPool subdirectory name is an integer value. If you add 2 to this
integer value, the resulting number indicates the level at which the iPool message
files reside in the directory tree. For example, an iPool subdirectory tree named
2154_0.2 indicates that the iPool message files reside four levels deep. The iPool
mechanism automatically appends the last digit to an iPool subdirectory name.
Although you can change the iPool subdirectory names by modifying the command
line for a data flow process, Content Server may not reference the new name
consistently and errors may occur. For this reason, OpenText does not recommend
changing iPool subdirectory names in this way.
Warning
The iPool system fails if any of the directories in the iPool tree are locked. If an
application accesses a directory in a Windows environment, the operating
system automatically locks the directory on the application's behalf. You must
ensure that no other application (for example, Windows Explorer or a DOS
Command Prompt) accesses the directories in the iPool tree when the data flow
runs.
Transaction Directories
All communication between data flow processes and iPools occurs inside
transactions. Transactions are units of interaction with an iPool. For example, data
flow processes execute operations such as reading and writing data inside
transactions. A temporary directory called a transaction directory is created for each
active transaction in a data flow.
Transaction directories reside in an index's data flow directory, at the same level as
the iPool subdirectory trees. Transaction directory names have the following
extensions that indicate the current state of the transaction.
Status Directories
An index's data flow directory also contains a directory named status.prv. The
status.prv directory records the number of iPool message files that have entered
the iPool as well as the number of iPool message files that have left the iPool. It
contains subdirectories that have the same name as the iPool subdirectories in the
data flow directory. Two files reside inside these subdirectories, one with an .in
extension, and one with an .out extension:
• .in – specifies the total number of iPool messages that were written to the
corresponding iPool. This number is always greater than or equal to the number
specified in the .out file, depending on the current state of the data flow.
• .out – specifies the total number of iPool messages that were read from the
corresponding iPool. This number is always less than or equal to the number
specified in the .in file, depending on the current state of the data flow.
The difference between the two numbers specified in the .in and .out files is the
total number of messages that remain in the corresponding iPool, as specified in the
Interchange Pools section on the data flow page.
• The iPool library code locates each unlocked transaction log file (.log), and then
locks them, preventing other new connections from interfering in this
connection. Transaction log files are not typical log files—they exist specifically
as a means to locate incomplete transactions. A transaction log file is unlocked if
it is associated with an orphaned or outstanding transaction.
• If the iPool library code locates an unlocked transaction log file that is associated
with a transaction that has not finished rolling back, it completes the roll-back
operation. For more information about roll-back operations, see “Rolling Back
Transactions” on page 61 .
• If the iPool library code locates an unlocked transaction log file that is associated
with a transaction that has not finished its commit operation, it completes the
commit operation. For more information about commit operations, see
“Committing Transactions” on page 56 .
• If the iPool library code locates an unlocked transaction log file that is associated
with a transaction that was active when the corresponding data flow process
shut down (.trn extension), it marks the transaction with a roll-back extension
(.rlb), and then completes the roll-back operation.
Note: If a connection fails to complete for some reason, the next connection
operation automatically completes all outstanding tasks.
Once a connection is made, all communication between data flow processes and
iPools occurs inside transactions. Transactions are units of interaction with an iPool
that provide recovery capabilities to a series of read and write operations. The iPool
library processes a single iPool transaction in a coherent and reliable manner,
independent of all other transactions.
The recovery capabilities provided by the transaction mechanism are subject only to
the integrity of the underlying operating system calls. This means that the following
kernel calls must be atomic (that is, they must complete successfully or not at all):
Starting Transactions
When a data flow process starts a transaction, a temporary transaction directory is
created in the data flow directory. The transaction directory has an arbitrary and
unique name with the .trn extension (for example, data_flow/transaction.trn).
A log file of the same name is also created in the data flow directory (for example,
data_flow/transaction.log). The log file acts as a locking mechanism for the
transaction, ensuring that no two processes access the same transaction at the same
time. It is not used to record the activities that occur in the iPool. The log file is
locked when the transaction starts.
If the process that started the transaction shuts down unexpectedly, the next
connection operation automatically completes all outstanding tasks.
Writing Transactions
In most data flows, a producer process extracts or assembles data, and then writes
the data to an iPool, making it available to the next process in the data flow. The next
data flow process then reads the data from the iPool, processes it, and writes the
new data to another iPool. This processing chain continues until the data flow
terminates, usually resulting in an index of the data that the producer process
originally extracted or assembled.
The data that a process writes to an iPool is encoded in iPool messages, which enter
the iPool and are processed in a queue. To add iPool messages to an iPool, a data
flow process performs a series of write operations within an active transaction.
All of the write operations that occur within a single transaction are collected inside
of a write directory that resides in the corresponding transaction.trn directory (for
example, data_flow/transaction.trn/write). Each write directory contains a
subdirectory that has the same name as the iPool subdirectory to which the data is
being written. The data flow process opens and writes data inside this subdirectory
(for example, data_flow/ transaction.trn/write/2154_1). Because a transaction
can contain write operations for many different iPool subdirectories, its write
directory may contain multiple subdirectories, each representing the corresponding
iPool subdirectory.
Although a single transaction can contain multiple write operations, each of which
may write data to a different iPool subdirectory, a single write operation cannot
write data to multiple iPool subdirectories.
The changes that a write operation makes are not finalized or implemented until the
transaction is committed. For more information, see “Committing Transactions”
on page 56 .
Reading Transactions
After a data flow process writes data to an iPool, the next data flow process in the
chain reads the data from that iPool so that it can process it. This data is stored as
iPool message files in the iPool subdirectory. Data flow processes read these iPool
message files by performing a series of read operations within an active transaction.
All of the read operations that occur within a single transaction are collected inside
of a read directory that resides in the corresponding transaction.trn directory (for
example, data_flow/transaction.trn/read). Each read directory contains a
subdirectory that has the same name as the iPool subdirectory from which the data
is being read. The data flow process opens and reads data inside this subdirectory
(for example, data_flow/transaction.trn/read/2154_0). Because transactions
can contain read operations for many different iPool subdirectories, their read
directories may contain multiple subdirectories, each representing the
corresponding iPool subdirectory.
A read operation moves iPool messages from their location in the iPool subdirectory
to the appropriate location in the read directory. This ensures that the iPool
messages are not available to other processes that may be reading from the same
iPool subdirectory. Data flow processes read iPool message files one at a time. Each
time that the data flow process returns to the iPool subdirectory to read another file,
it deletes any empty directories that exist in the iPool as a result of the previous read
operation.
Although a single transaction can contain multiple read operations, each of which
may read data from a different iPool subdirectory, a single read operation cannot
read data from multiple iPool subdirectories.
Read operations are not finalized or implemented until the transaction is committed.
Files that a data flow process reads are not available to any other processes before a
commit operation occurs. For more information, see “Committing Transactions”
on page 56.
iPool messages are read recursively, starting at the root of the iPool subdirectory.
The state of the iPool determines which file is next in the queue for read operations.
For example, if two files have already been read, the next read operation will
automatically read the third file.
Committing Transactions
A data flow process commits a transaction to finalize the write or read operations
that it has performed within the transaction. A commit operation closes the active
transaction.
During a commit operation for a write transaction, the subdirectory of the write
directory is renamed. For example, the data_flow/transaction.cmt/write/2154_
0 directory is renamed to data_flow/2154_0.0/0000000004. This moves all of the
iPool items that were written in the data_flow/transaction.cmt/write/2154_0
directory to the new directory. After the data_flow/transaction.cmt/write/
2154_0 directory is renamed, its contents are available to the read operations that
are issued within other transactions.
During a commit operation of a read transaction, all of the iPool items in the data_
flow/transaction.cmt/read/2154_0 directory are deleted. These iPool items are
no longer required because they have already been used for the read operation that
was just finalized.
Both the rename operation and the delete operation are automatic. Renaming the
data_flow/transaction.cmt/write/2154_0 directory in this way ensures that all
of the iPool items in the directory are available at the same time or that none are
available at all. Because a single read operation cannot read data from multiple iPool
subdirectories, separate rename calls are issued for each iPool subdirectory.
The ten-digit directories that reside inside the iPool subdirectory are created when
transactions that write files to this subdirectory are committed. The files that they
contain are the iPool message files that were written to the iPool subdirectory when
the transaction was committed. By default, an iPool subdirectory can contain a
maximum of 50 ten-digit directories that are named incrementally from 0000000000
to 0000000049. Each of these ten-digit directories can contain an unlimited number
of iPool message files.
Note: If a process tries to read data from the iPool subdirectory in the middle
of this procedure, it always looks for the iPool subdirectory that has the highest
numeric value (in this case 2154_1.1, not 2154_1.0). If that subdirectory is
empty (if the original iPool subdirectory has not been renamed yet), the
process can retry at a later date.
During a roll back of a read transaction, the subdirectory of the read directory is
renamed. For example, the data_flow/transaction.rlb/read/2154_0 directory is
renamed to data_flow/2154_0.0/XXXXXXXXXX (where XXXXXXXXXX is a ten-digit
directory in the iPool subdirectory). This returns all of the iPool items in the data_
flow/transaction.rlb/read/2154_0 directory to the iPool subdirectory from
which they were originally read. The subdirectory of the read directory is renamed
and returned to the front of the transaction queue in the iPool subdirectory.
During a roll back of a write operation, all of the iPool items in the data_flow/
transaction.rlb/write/2154_0 directory are deleted (if any exist). These iPool
items are no longer required because they may contain modifications that should not
be retained.
Using the scenario illustrated by Figure 3-9, suppose that a data flow process reads
two iPool message files from the 0000000008 directory in the iPool subdirectory but,
for some reason, cannot read the third file. Because the read operation cannot
complete successfully, the entire operation rolls back. This means that the
subdirectory of the read directory is renamed to 0000000007, so that it can be placed
at the front of the queue. The next time that the data flow process tries to read files
from the iPool, it reads the files in the 0000000007 directory first, because it is now at
the front of the queue.
Now suppose that the same problem occurs when reading from the 0000000000
directory in the iPool. The data flow process reads the first two iPool message files
successfully, but cannot read the third file. Again, the subdirectory of the read
directory must be renamed and returned to the front of the queue. But because the
0000000000 directory already exists in this case, the iPool library assigns a very large
number to the directory. In a sense, the numbering system wraps around (think of a
clock, where 12 is earlier than 1), allowing roll back operations to continue
successfully.
Note: Gaps cannot exist between directory names in the iPool subdirectory.
For example, the 0000000000 directory cannot be followed by the 0000000003
directory. The only permissible gap is the one that occurs between the
0000000000 directory and the directory name that is given when the numbering
system wraps around.
Before it closes its iPool connection, the process either commits the transaction or
rolls back the transaction. To finalize its changes, the data flow process commits the
transaction. This renames the data_flow/XYNDHFGY.trn directory to
XYNDHFGY.cmt. Because the rename operation is atomic, it guarantees that the
commit operation completes successfully before the intermediate process can start a
new transaction. During the commit operation, the data_flow/XYNDHFGY.trn/
write/2504_1 directory moves to the back of the 2504_1 queue (in Figure 3-10 this is
the data_flow/2504_1.0/0000000002 position in the queue). If the intermediate
process or the entire system shuts down unexpectedly at this point, the commit
operation completes the next time that a data flow process connects to the iPool.
To cancel its operations, the data flow process rolls back the transaction. This
renames the XYNDHFGY.trn directory to XYNDHFGY.rlb. Because the rename
operation is atomic, it guarantees that the roll back operation completes successfully
before the intermediate process can start a new transaction. If the intermediate
process or the entire system shuts down unexpectedly at this point, the roll back
operation completes the next time that a data flow process connects to the iPool.
You search for indexed information by constructing Queries and submitting them to
a target index. You construct Queries using a query language called the Live Query
Language (LQL); however, Content Server then converts the LQL Queries to
OTSTARTS and OTSQL query languages for internal processing.
Along with the appropriate search result items themselves, Content Server displays
a score for each result. The score indicates the relevance of a particular search result
item in relation to all other items in the result set.
When Content Server converts an LQL query to the OTSTARTS query language, it
wraps the OTSTARTS expressions with the appropriate OTSQL expressions. The
OTSTARTS expressions define the query criteria and the OTSTARTS expressions define
additional processing information about the query.
The Live Query Language (LQL) consists of the keywords listed in the following
table. For the purposes of this table:
• in lexicographic ordering, 100 is less than 20, because the terms are ordered as
Unicode strings.
• in numeric ordering, 100 is greater than 20, because the terms are converted to
numbers before comparing.
LQL Conventions
The following LQL conventions help define the syntax of a complex Query.
Using Phrases
• A space character (for example, a tab or space) denotes the end of a word.
• Content Server interprets strings that include punctuation but no spaces as one
phrase (for example, the phrase “www.opentext.com”).
• You cannot omit the double quotation marks around reserved keywords (for
example, qlnear) when you want to treat the keywords as search criteria. For
example, to search the Content Server Online Help system for information about
the qlnear keyword, you must specify "qlnear" as your Query and choose the
appropriate Help slice.
Using Wildcards
You can use an asterisk (*) as a placeholder for characters in Queries. For example,
to use an asterisk to represent the end of a word that has many different endings,
type account* to find Content Server items that contain the words account,
accounts, accountant, accounting, and so on.
You can use escape characters to search for certain special characters in your query
syntax. The single backlash (\) is treated as the beginning of an escape sequence for
certain special cases.
Common and important special characters you can search for include:
Other non-special characters are not affected by the backslash, so a query like:
\wheat is still mapped to \wheat.
A single backlash at the end of a query term is invalid syntax and will return an
invalid syntax error message from the Search Engine. For example,
By default, Content Server processes Query terms from left to right. Use parentheses
() to override the default order, and have Content Server evaluate the expressions
within parentheses first. For example, type (grow OR farm) AND (wheat OR grain)
to find Content Server items that contain at least one instance of the words grow or
farm and at least one instance of the words wheat or grain.
A regular expression is a set of strings that match certain patterns. To specify these
patterns, the special characters $, ^, ., *, +, ?, [, ], (, ), \, function as operators on all
remaining ordinary characters. An ordinary character is a simple regular expression
that matches a character and nothing else (for example, h matches h and nothing
else). To represent a special character as an ordinary character, precede it by a
backslash (\). For example, \$ matches $.
The following guidelines describe how to use regular expressions most effectively:
• When used together with the QLREGEX keyword in a Content Server search, the
specified regular expression is found anywhere within a word in a Content
Server item. For example, QLREGEX "cat" finds items that contain at least one
word with cat in it, such as cat, category, or concatenate.
• Constraining and narrowing searches saves time, especially in prefix searches.
For example, to find 2014 part numbers that begin with d, a search using QLREGEX
"^d" takes much longer than a search using QLREGEX "^d2014".
• Regular expression searches on prefixes are the most efficient. For any other
regular expression search, every word in the index must be examined to see if it
matches.
• For every word that matches the regular expression, a query is performed, so it is
best to constrain a regular expression search as much as possible.
You can use the following list of operators within regular expressions to describe
sets of strings.
Operator Description
. Matches any single character.
For example, a.2 matches any word containing a three-character string that
begins with a and ends in 2 (such as, ab2z, aa2, or aaa2zzz).
Operator Description
[] Encloses a character set or range. The following rules apply:
• You can intermix ranges and single characters.
• In a character set, ], -, and ^ have special meaning; all other characters
represent themselves only.
• The minus sign (-) is a range operator between two characters.
• The caret (^) can be used only in the first position in a character set.
For example,
[abc] matches a, b, or c.
[a-z] matches any lower case character.
[-$a0-9] matches -, $, a, or any single digit.
[^] Begins a character set that complements the specified character set (it matches
any character except those that are specified). If - or ] follows [^, - or ] is
treated as the first character.
For example,
[^a-z] matches any character, except the letters of the alphabet.
[^]^a-z0-9] matches any character, except ], ^, or alphanumeric characters.
^ Matches the beginning of a word.
For example,
^sp matches any instance of sp at the beginning of a word only (such as special,
but not especially).
* Matches the smallest preceding regular expression zero or more times.
However, the * operator following a regular expression that has ^ as the
beginning of a word is interpreted as the expression with any ending (like a
wildcard (*) in queries).
For example,
ad* matches a, ad, add, and so on.
+ Matches the smallest preceding regular expression when the preceding regular
expression occurs at least once.
For example,
tr[ei]+ matches tre, tri, tree, trie, triie, and so on. It does not match tr.
? Matches the smallest preceding regular expression when the preceding regular
expression occurs zero or one time.
For example,
se[ea]? matches only se, sea, and see.
$ Matches the end of a word.
For example,
the$ matches the characters the when they appear at the end of a word.
Operator Description
| Separates two alternatives. If x and y are regular expressions, then x|y
matches anything that either x or y matches.
For example,
sea|lake matches words containing sea and lake only.
[abc] can also be written as a|b|c
() Groups items (such as alternatives or complex regular expressions) so that you
can combine them with other regular expressions and operators.
For example,
(ro)?(co)+ matches any non-zero number of co strings that is preceded by
nothing or ro (such as, co, coco, rococo, and so on).
Note: The OTSQL keywords (SELECT, FROM, WHERE, ORDEREDBY) are not case-
sensitive.
Most OTSQL queries use only the SELECT and WHERE clauses; however, more
complex queries can use all clauses and embed OTSTARTS expressions as necessary.
SELECT
The SELECT clause contains a comma-separated list of quoted strings which
represent the regions to retrieve from the search result item (for example, the OTMeta
region and OTScore region). It is a required clause.
For more information about regions, see "Configure Index Regions" in the OpenText
Content Server Admin Online Help - Search Administration (LLESWBS-H-AGD).
Syntax
FROM
The FROM clause is an optional OTSQL clause that is not used in Content Server. It
can contain the names of the Content Server slices to query; however, because
Content Server submits OTSQL queries directly to a Search Federator (which
corresponds to a particular index or slice), it is not necessary to specify the slice
name explicitly.
The OR operator is the only valid OTSTARTS operator used in the FROM clause. For
more information about OTSTARTS operators, see “Combined Queries” on page 86 .
Syntax
WHERE
The WHERE clause contains the search term(s) or phrase(s) that Content Server users
specify on the search bar or the Content Server Search page. It is a required clause.
All OTSTARTS query language components can be embedded in the WHERE clause. For
more information about OTSTARTS, see “OTSTARTS Query Language” on page 80.
Syntax
WHERE "search_expression"
ORDEREDBY
The ORDEREDBY clause is an optional OTSQL clause that contains result ranking
criteria that the Search Engine uses when evaluating queries. Content Server always
uses the following syntax:
ORDEREDBY RankingExpression ("search_expression")
Although the following values represent valid ordering criteria for the ORDEREDBY
clause, Content Server users cannot specify these values through the Content Server
interface.
Expression Description
DEFAULT Ranks results according to a default ranking criteria
NOTHING Does not rank results
Expression Description
RELEVANCY Ranks results according to the relevancy ranking that the Content
Server Search Engine supports. This is the default ordering criteria.
RANKINGEXPRESSION Ranks results according to the results of the OTSTARTS ranking
(rankingexpression) expression that you specify
EXISTENCE Ranks results based on the number of distinct query terms that appear
in a document (not the instances of the same query term)
RAWCOUNT Ranks results based on the instances of the query terms that appear in
a document (not the number of distinct query terms)
REGION fieldname Ranks results based on the contents of the region or field named
fieldname. This is only supported for integer, time, and date regions.
Syntax
ORDEREDBY ordering_criteria
• Phonetic
• Stem
• Thesaurus
• Right-truncation
• Left-truncation
• Regex
• Range
Phonetic
The Phonetic modifier expands a simple word query to include words that sound
like the specified word. For example, the following query produces documents
containing at least one instance of sail or sale.
phonetic "sail"
Stem
The Stem modifier expands a simple word query to include words that are likely
noun plural or singular forms of the specified word. For example, the following
query produces documents containing at least one instance of tax or taxes.
stem "tax"
The languages supported for the Stem modifier are English, French, German, Italian
and Spanish.
Thesaurus
The Thesaurus modifier expands a simple word query to include words that are
derived from the thesaurus expansion of the specified word. For example, the
following query produces documents containing terms such as stone, rubble, rock, or
flintstone.
thesaurus "stone"
Right-truncation
The Right-truncation modifier expands a simple word query to include terms whose
prefixes match the specified query expression. For example, the following query
produces documents containing terms such as sailing, sails, sailed, sailboat, and sailor.
right-truncation "sail"
Left-truncation
The Left-truncation modifier expands a simple word query to include terms whose
suffixes match the specified query expression. For example, the following query
produces documents containing terms such as sailing, remaining, boating, or banding.
left-truncation "ing"
Regex
The Regex modifier expands a simple word query to include words that are derived
from the regular expression expansion of the specified word. For example, the
following query produces documents containing at least one instance of the, their,
them, then, there, therefore, these, or they.
regex "^[Tt]he"
Range
The Range modifier expands a simple word query to include words in the
lexicographic or numeric range of the specified words or numbers. (If the specified
terms are integers or real numbers, it uses the numeric range. Otherwise, it uses the
lexicographic range.) The ends of a range specification are separated by the tilde
character (~). In most cases, the Range modifier is used to perform date range
queries. For example, the following query produces documents containing numbers
in the numeric range of 20140101 to 20160101.
range "20140101~20160101"
The Range modifier can also be used in other contexts. For example, the following
query produces documents containing words in the lexicographic range between
twelve and twenty.
range "twelve~twenty"
Comparator
Comparators are used to perform queries over lexicographic or numeric ranges. (If
the specified term is an integer or real number, it uses the numeric range. Otherwise,
it uses the lexicographic range.) In most cases, comparators are used to perform date
range queries; however, they can also be used in other contexts.
Comparators can be used with simple word queries, or with compound word
queries, though some comparators only use the first word if given a phrase.
• <
• <=
• >
• >=
• !=
• =
Note: The behavior of comparators depends upon the type definition of the
region. Text string comparisons use a text sort, so that 2000 > 1000000 for
values stored in a text region.
<
The Less Than comparator will match all values which exist, and are less than the
specified term. If a phrase is provided, only the first term in the phrase is used.
For example, the following query produces all documents containing numbers that
are numerically less than 20140101. This is especially useful when searching for
documents that were published before a specific date.
< "20140101"
<=
The Less Than or Equal To comparator will match all values which exist, and are less
than or equal to the specified term. If a phrase is provided, only the first term in the
phrase is used.
>
The Greater Than comparator will match all values which exist, and are greater than
the specified term. If a phrase is provided, only the first term in the phrase is used.
For example, the following query produces all documents containing numbers that
are numerically greater than 20140101. This is especially useful when searching for
documents that were published after a specific date.
> "20140101"
>=
The Greater Than or Equal To comparator will match all values which exist, and are
greater than or equal to the specified term. If a phrase is provided, only the first term
in the phrase is used.
!=
The Not Equal To comparator restricts the result to exclude the specified word or
phrase. Although this comparator is useful for performing date queries, it can also
be used in other word or phrase queries. For example, the following query produces
all documents that do not contain the word Fred.
!= "Fred"
The Equals comparator indicates that the specified word or phrase must be included
in the query, as follows:
• For metadata text regions, the specified word or phrase must match the entire
region, or an entire component value of a multivalued region, to be considered a
match.
For example, [region "CityName"] ="York" would match “York” in the
CityName region, but would not match “New York”.
• For content searches, the specified word or phrase may appear anywhere in the
object's content.
For example, ="York" would match an object with City of York in its content.
• Searches are not case-sensitive.
For example, ="York" would match YORK.
• Searches are not sensitive to extraneous whitespace or punctuation, as long as
they do not break the individual search words.
For example, [region "CityName"] ="York" would match a CityName of : York
:.
• Searches of multivalued metadata fields only require one component of the
multivalue to match.
For example, if the input XML of an object included <CityName>London</
CityName><CityName>New York</CityName>, then a search for [region
"CityName"] ="New York" will match the object because “New York” matches
an entire component of the multivalued CityName region, and a search for
[region "CityName"] ="York" would not match.
4.1.4 Region
Region queries are those in which the specified query must exist in the specified
region in order to be considered a match. For example the following query produces
all of the documents containing the phrase Reproductive Habits of the Australian
Cane Toad in the title region.
title "Reproductive Habits of the Australian Cane Toad"
The following query produces all of the documents for which the last-modified date
does not include values in the numeric range from 20140101 to 20150101.
date-last-modified != range "20140101~20150101"
Regions can be nested, which means that the following query can be used to
produce all of the documents containing chapters, that contain sections, that have
headings, which include the phrase Frog (assuming that the database contains those
regions).
[ region "OTDoc" : "Chapter" : "Section" : "Heading" ] "Frog"
The following table describes the valid STARTS fields and the equivalent fields in
OTSTARTS:
Combined Queries
Operators are used to combine queries. There are two kinds of operators supported
in OTSTARTS: Boolean and proximity. There are multiple Boolean operators and one
proximity operator. The proximity operator (prox) has a slightly different syntax
than the Boolean operators.
and
The and operator denotes the case in which both the left and right expressions must
exist in the result. For example the following query produces all documents
containing the phrase cats and the phrase dogs.
"cats" and "dogs"
or
The or operator denotes the case in which the left or right expression must exist in
the result. For example, the following query produces all documents containing the
phrase cats or the phrase dogs and all documents that contain both cats and dogs.
"cats" or "dogs"
and-not
The and-not operator denotes the case in which the left expression must exist but
the right expression must not exist in the result. For example, the following query
produces all of the documents containing the phrase cats and not containing the
phrase dogs.
"cats" and-not "dogs"
xor
The xor operator (also called the exclusive-or operator) denotes the case in which
the left expression or the right expression (but not both expressions) must exist in
the result. For example, the following query produces all of the documents
containing the phrase cats or the phrase dogs but not containing both cats and dogs.
"cats" xor "dogs"
sor
The sor operator (also called the synonym-or operator) denotes the case in which
the left or right expression exist in the result and are synonyms of each other.
Functionally, the or and sor operators generate the same number of matching
documents; however, the relevance ranking of the documents may differ slightly.
For example, the following query produces all of the documents containing the
phrase U.K. or the phrase United Kingdom, or both phrases.
"U.K." sor "United Kingdom"
When presented in a summary list, these results may be ranked differently than the
results of the same query using the or operator.
The prox operator must take at least one parameter, the maximum allowable
distance between the expressions.
Parameters
distance Denotes the maximum number of words that can exist between two phrases if
they are considered in proximity to each other. The distance parameter must
be a positive integer.
order Denotes whether the order of the phrases is important. The order parameter
can have values of T or t to denote that order is important and F or f to denote
that order is not important. If the order parameter is omitted, order is not
considered important.
For example, the following query produces all of the documents in which the phrase
dogs precedes or follows the phrase cats by no more than ten words.
"cats" prox[10] "dogs"
The following query produces all of the documents in which the phrase dogs follows
the phrase cats by no more than ten words.
"cats" prox[10,T] "dogs"
Precedence can also be enforced using parentheses. For example, the following
query produces all of the documents containing the phrase cats followed (by no
more than ten words) by the phrase dogs and the phrase raining.
"raining" and ( "cats" prox[10,T] "dogs" )
In Content Server, a Query has a weight and each Advanced Criteria has a weight.
These weights determine how important each component of the overall score is
during the score calculation. The Query weight is the weight given to the query
score in the calculation of the overall search result score. It can be specified in the
Query Weight field on the Ranking Properties page of a Search Manager. For more
information about how Advanced criteria are weighted, see “Calculating Advanced
Criteria Scores” on page 90.
The calculation of an item's search result score is based on the weighted average of its
query score and its advanced criteria scores.
Internally, search result scores have a range from 0 to 1. For display on the Search
Results page, the internal scores are multiplied by 100 and listed as percentage
values.
Query scores are calculated by the Content Server Search Engines, based on the
following main criteria: :
Criteria Description
The number of times that a search An item with three occurrences of a search term is considered
term occurs in an item more relevant than an item with two occurrences of the same
term.
The length of an item An item that is shorter in length is considered more relevant
than a longer item. Document length only affects the
contribution from the content field, not metadata fields.
The number of times that the A search term that occurs in only 10% of all documents in the
search term(s) appear throughout database is considered more relevant than a search term that
the entire Content Server database occurs in every document. This means that if two items have
the same length and the same number of matches for two
different search terms (that is, term 1 occurs N times in
document 1 and term 2 occurs N times in document 2), their
scores differ if term 1 is much more common than term 2
throughout the entire database.
The co-occurrence of search terms If a Query contains two (or more) search terms, items
containing both terms are considered more relevant than
items containing only one term. Suppose that you run a
Query for dog and cat and Content Server returns two results:
document A, which contains 10 instances of dog and
document B, which contains 5 instances of dog and 5
instances of cat. Assuming that the documents are equal in
length, and dog and cat occur with similar frequency
throughout the database, document B will receive a higher
score than document A. Although document A contains
more instances of dog, document B contains instances of both
dog and cat in the same document.
If the search criteria consists of multiple phrases (with OR operators), Content Server
calculates scores for each phrase separately and then combines the scores to
determine the final display value. This increases the scores of those Content Server
items that contain several of the specified phrases.
• Field Rank
• Date Rank
• Type Rank
• Object Rank
Advanced Rankings
The data flow section of the search.ini configuration file contains Advanced
Ranking settings used in search. To optimize search, new settings have been added,
which are now the default settings for both Enterprise and non-Enterprise data
sources. These settings are listed below:
• ExpressionWeight=100
• ObjectRankRanker="OTObjectScore",50
• ExtraWeightFieldRankers="OTName",200;"OTDComment",50
• DateFieldRankers="OTModifyDate",45,2
• TypeFieldRankers="OTFilterMIMEType",2:"application/pdf",
100:"application/msword",100:"application/vnd.ms-
• excel",30:"text/plain",75:"text/html",75;"OTSubType",2:"144",
100:"0",100:"202",100:"140",200
Note: These new settings are present in the search.ini file. If you are
migrating from an older version of Content Server, the older settings will be
preserved, but a conversion button will be available on the Search Manager
Ranking tab for administrators who wish to implement the new recommended
settings.
Field Rank
The Field Rank criteria is based on an item's metadata regions. Metadata regions
store information (metadata) about indexed items (for example, the OTSummary
region stores an item's summary, the OTLocation region stores the item's location,
and the OTCreatedby region stores the name of the user who originally created the
item). When you configure the Field Rank, you specify which metadata regions are
most relevant. If a term in a Query matches the metadata in a region that you
specify, the corresponding item is given a higher advanced criteria score. For
example, you can use the Field Rank criteria to give emphasis to items that match
the specified query in the OTName or OTSummary regions. For more information
about index regions, see “Configuring Index Regions” in the OpenText Content Server
Admin Online Help - Search Administration (LLESWBS-H-AGD).
The score for the Field Rank criteria is determined by calculating a query score as if
the search terms were being sought in the region name or names that you specify.
This score is then combined with the other scores in the weighted average to come
up with an overall search result score.
You specify field rank criteria (that is, region names and their corresponding
weights) with the Field Rank setting on a Search Manager's Ranking Properties
page. For more information about setting Field Rank Criteria, see “Configuring
Advanced Ranking” in the OpenText Content Server Admin Online Help - Search
Administration (LLESWBS-H-AGD).
Date Rank
The Date Rank criteria is based on an item's dated metadata regions. Dated metadata
regions store dates associated with indexed items (for example, the OTModifyDate
region stores the date on which the item was last modified). When you configure the
Date Rank, you specify which dated metadata regions are most relevant. If a date in
a Query matches the date in a dated region that you specify, the corresponding item
is given a higher advanced criteria score.
The score for the Date Rank criteria is determined by looking up the date stored in
the specified dated region, subtracting today's date to find the difference in days,
and then applying a distribution to the difference. This score is then combined with
the other scores in the weighted average to come up with an overall search result
score.
You specify date rank criteria (that is, region names, the days of interest, and
weights) with the Date Rank setting on a Search Manager's Ranking Properties
page. For more information about setting Date Rank Criteria, see “Configuring
Advanced Ranking” in the OpenText Content Server Admin Online Help - Search
Administration (LLESWBS-H-AGD).
Suppose that the Date Rank field was set to the following value:
"OTModifyDate",45,2
Then D=45. If an item was modified five days ago, the score would be 0.9 (45/(5+45)).
If an item was modified 45 days ago, the score would be 0.5 (45/(45+45)). If an item
was modified 135 days ago, the score would be 0.25 (45/(135+45)).
Note: The 2 in the example is the relative weight compared to the other
advanced criteria.
Type Rank
The Type Rank criteria is used to give emphasis to certain subtypes or MIME types.
When you configure the Type Rank, you specify which metadata regions of type
Enum are most relevant. For example, you can use the Type Rank criteria to boost
the scores of any Documents or Discussions if you consider those items to be more
relevant to most searches. You can then boost the scores of Documents in PDF
format further.
The score for the Type Rank criteria is determined independently of a Query by
looking up the value of an item's Enum metadata region, and comparing it to the list
of corresponding values specified in the Type Rank criteria. If an item's value
matches any of the values specified in the criteria, it is assigned the corresponding
Type Rank score. If there is no match, the item is assigned a score of zero. This score
is then combined with the other scores in the weighted average to come up with an
overall search result score.
Tip: The Type Rank criteria applies to any field of type Enum (see the
LLFieldDefinitions.txt file in the config directory of your Content Server
installation for some examples).
Suppose that the Type Rank setting contains the following value:
"OTFilterMIMEType",2:"application/msword",50:"application/msexcel",30
Note: The 2 in the example is the relative weight compared to the other
advanced criteria.
Object Rank
The Object Rank criteria is based on the relative value of items to Content Server
users. Relative value is measured by tracking the usage of items at the Content
Server site and makes the following assumptions:
• Items that are downloaded more frequently than others are considered more
valuable.
• Items that are modified more frequently than others are considered more
valuable.
• Items that have many aliases are considered more valuable.
• Items that appear on the Favorites page of many users are considered more
valuable.
• Items that reside in a highly valued parent container are considered more
valuable.
For more information about these parameters, see the “opentext.ini File Reference”
in the OpenText Content Server Admin Online Help - Content Server Administration
(LLESWBA-H-AGD).
Because this advanced criteria score is based on an item's usage in Content Server,
the value is changing constantly. Although it is not efficient to send updates to the
corresponding score to the Index Engine each time an item is downloaded or
modified, the Object Rank scores become meaningless if they get stale. To
accommodate this tradeoff, Content Server allows you to schedule the frequency of
Object Rank calculation. Scheduling the Object Rank calculation allows only large
changes in the calculation to trigger an index update, and minimizes the amount of
information sent to the index with each update. You enable and schedule the Object
Rank calculation on the Configure Search Options page in Content Server. For more
information, see “Configuring Search Options” in the OpenText Content Server Admin
Online Help - Search Administration (LLESWBS-H-AGD).
The object rank score is calculated in the Content Server server by the relevance
thread. There must be only one relevance thread running at each Content Server site.
This means that in a clustered environment, only one Content Server instance
should run the relevance thread. The relevance thread is responsible for:
• Calculating the object rank scores for all Content Server items
• Comparing the scores to the previous scores (if this is not the first time the thread
has run)
• Updating the DTreeNotifyTable with the Node ID of the items whose scores
have changed. These updates are handled by the Content Server Extractor
process in the same way that any other update is handled. For more information,
see “Updating an Index” on page 30 .
Note: When the relevance thread runs for the first time, all of the Favorites lists
are scanned. Subsequently, changes to the Favorites lists are tracked as they are
made.
For large Content Server repositories, the object rank calculation may be quite
taxing. To mitigate this, you can configure the relevance thread to run on a separate
Content Server instance within a cluster. Although this removes the load from the
front-end Content Server instance, system performance may still be impacted (due
to competition for resources on the RDB machine). If the system performance hit is
too severe, OpenText recommends disabling the object rank criteria entirely. Then,
object rank scores will not be included in the calculation of the overall search result
score.
You access the region search options via the Regions tab of a Search Manager. On
this page, all of the indexed regions are displayed and check boxes are provided to
allow you to make each region queryable, displayable and searchable by default,
sortable, and used with Search Filters. A standard set of regions are selected for you.
The Search by Default check box allows you to select exactly which regions are to
be searched. Doing so allows for more precise searching and for greater relevance of
search results, since hits in perhaps unforeseen metadata regions should no longer
occur.
The Filter check box allows your users to use Search Filters with regions on the
Search Results page. For more information, see OpenText Content Server Admin
Online Help - Search Administration (LLESWBS-H-AGD).
MaximumNumberOfValuesPerFacet parameter
• Description:
The total count of facet values for the region. This parameter determines the
threshold at which an indicator is displayed in the Content Server user interface,
on the search filter region title bar, to show the facet of Search Results set is
incomplete.
• Values:
A positive integer. The default value is 32767.
Note: Although some of these settings apply to both the Index Engines and
Search Engines, they are only reloadable without engine restart for the Search
Engines.
• HitLocationRestrictionFields
• SystemDefaultSortLanguage
• DefaultMetadataAttributeFieldNames="OTName","OTDComment"
• DefaultMetadataAttributeFieldNames_OTDComment="lang","gr";"orig",
"unknown";"color","pink"
• DefaultMetadataAttributeFieldNames_OTName="lang","gr";"orig",
"unknown";"color","pink"
• DefragmentMemoryOptions
• DefragmentSpaceInMBytes
• DefragmentDailyTimes
This configuration file is a shadow version of the search.ini file. If the search.ini_
override file exists, then any values in the search.ini_override file take
precedence over their corresponding values in the main search.ini configuration
file.
OpenText recommends that you create only a small number of entries in the
search.ini_override file.
The shadow file must be named search.ini_override and it must reside in the
same directory as the search.ini file. The search.ini_override file also has the
same format, with sections and name/value pairs as the search.ini file. Values in
the override file take precedence over the ini file.
The ini file reading functionality restricts how white space is handled so Name/
Value entries should appear as in the example below, with no extra white space
anywhere:
name=value<CR>
Note: Content Server ensures that standard configuration files are replicated
among partitions as necessary. However, if administrators use shadow
configuration files, then they are responsible for replicating these files among
partitions.
Using a shadow configuration file should not be required in most Content Server
installations in normal operation. This feature might be used to convert regions to
the Retrieve-Only mode.
shadow version of 97 V
SELECT, OTSQL query language 78 View As Web Page command 29
servers 27
settings, reloadable 96 W
severity codes 43 WHERE, OTSQL query language 79
shadow version of search.ini file 97 wildcard character 75
simple phrase with modifiers Windows Directory Structure of the Admin
(see also modifier) Online Help Index 50
sor operator 87 workers 29
Soundex algorithm 73 write operations
space consumed by log files 47 committing 56
special characters 76 rolling back 61
status directory 52 writing transactions 54
Stem modifier 81
storage X
external 18 xor operator 87
Retrieve-Only mode 11
temporary areas 49
syntax, invalid 75
T
table, DTreeNotify 16, 93
temporary
file storage 49
instance of DCS server 27
transaction directory 52
Thesaurus modifier 81
transaction directory 52
transaction mechanism
committing 56
reading 55
starting 54
writing 54
transaction.cmt 56
transaction.log 54
transaction.rlb 61
transaction.trn 54
type rank 92
U
undo transactions 61
Universal Naming Conventions (UNC) 50
update blocking 39
Update Distributor 7, 9
updating
creating partition prematurely 35
index 30
metadata 35