0% found this document useful (0 votes)
1K views104 pages

OpenText Content Server CE 21.4 - Advanced Indexing and Searching Guide English (LLESSRC210400-GGD-En-01)

Uploaded by

Vladimir Perez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views104 pages

OpenText Content Server CE 21.4 - Advanced Indexing and Searching Guide English (LLESSRC210400-GGD-En-01)

Uploaded by

Vladimir Perez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

OpenText Content Server

Advanced Indexing and Searching


Guide

This guide is intended for administrators seeking to optimize


the effectiveness of their Content Server searching and indexing
system. This guide assumes that you are a high-level
Administrator who has read the Content Server Admin Online
Help and performed advanced administrative tasks at a Content
Server site.

LLESSRC210400-GGD-EN-01
OpenText Content Server
Advanced Indexing and Searching Guide
LLESSRC210400-GGD-EN-01
Rev.: 2021-Aug-27
This documentation has been created for OpenText Content Server CE 21.4.
It is also valid for subsequent software releases unless OpenText has made newer documentation available with the product,
on an OpenText website, or by any other means.

Open Text Corporation

275 Frank Tompa Drive, Waterloo, Ontario, Canada, N2L 0A1

Tel: +1-519-888-7111
Toll Free Canada/USA: 1-800-499-6544 International: +800-4996-5440
Fax: +1-519-888-0677
Support: https://support.opentext.com
For more information, visit https://www.opentext.com

Copyright © 2021 Open Text. All Rights Reserved.


Trademarks owned by Open Text.

One or more patents may cover this product. For more information, please visit https://www.opentext.com/patents.

Disclaimer

No Warranties and Limitation of Liability

Every effort has been made to ensure the accuracy of the features and techniques presented in this publication. However,
Open Text Corporation and its affiliates accept no responsibility and offer no warranty whether expressed or implied, for the
accuracy of this publication.
Table of Contents
1 Overview ..................................................................................... 5
1.1 Understanding the Search Infrastructure ............................................. 5
1.2 Understanding the Search Grid .......................................................... 8

2 Administering Indexing ........................................................... 15


2.1 Working with Data Flow Producer Processes .................................... 15
2.2 Understanding Document Conversion .............................................. 26
2.3 Updating an Index ........................................................................... 30
2.4 Avoiding Index Pollution .................................................................. 40
2.5 Index Directory Structure ................................................................. 43
2.6 Managing Rotating Log Files ............................................................ 47

3 Understanding How Data Flow Processes and Search


Grid Processes Communicate ............................................... 49
3.1 Working With Data Interchange Pools (iPools) .................................. 49

4 Administering Searching ........................................................ 67


4.1 Understanding Query Languages ..................................................... 67
4.2 Understanding Relevance Ranking .................................................. 88
4.3 Working with Regions ...................................................................... 95
4.4 Reloading Settings .......................................................................... 96
4.5 Overriding the search.ini file ............................................................. 97

IDX Index 99

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide iii


Chapter 1
Overview

In order to use the data stored in Content Server, you must be able to find it quickly
and easily. For this reason, creating indexes and maintaining their integrity are two
of the most important tasks that Content Server Administrators perform.

Content Server Administrators create indexes by designing data flows that extract
and process the data they want to index. As the size of a Content Server repository
increases, Administrators must be able to optimize Content Server's indexing and
search functionality to accommodate the increasing demands being made on the
system. To do so requires a thorough understanding of the architecture of Content
Server's indexing and searching systems.

The Content Server indexing and searching system was rearchitected to provide
better search engine performance, the ability to scale to large datasets by sharing
indexing tasks over multiple indexing processes, and more flexible and configurable
search result ranking.

Notes

• This guide assumes that you are a high-level Content Server Administrator
who has read the Content Server Administration - Admin Online Help and
performed advanced administrative tasks at a Content Server site.
• OpenText recommends that you do not use a production environment to
experiment with the index and search operations described in this guide.
Instead, experiment with a test Content Server system and roll out the
changes to a production system when appropriate.

This chapter covers the following:

• “Understanding the Search Infrastructure” on page 5


• “Understanding the Search Grid” on page 8

1.1 Understanding the Search Infrastructure


When all of the indexing and searching processes, and components for a given data
source are considered together, they can be visually represented as a search
infrastructure. Figure 1-1 is an example of a basic search infrastructure. This
infrastructure also illustrates a one-partition search grid that contains an Update
Distributor, an Index Engine, an index, a Search Federator, and a Search Engine.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 5


Chapter 1 Overview

Figure 1-1: A Search Infrastructure

All of the processes in the search infrastructure (that is, the data flow processes,
Update Distributor, Index Engines, Search Federators, and Search Engines) are
managed by a Content Server Admin server. Although OpenText recommends that
you run the Index Engine and Search Engine processes associated with a partition
on the same computer, other indexing and searching processes can (and in large
installations, should) run on separate computers. Index Engines and Search Engines
communicate with each other through shared files stored in the Partition's index
directory.

This section describes the following search infrastructure components:

• “Producer Process” on page 7


• “DCS” on page 7
• “Update Distributor” on page 7
• “Search Managers” on page 7

6 OpenText Content Server LLESSRC210400-GGD-EN-01


1.1. Understanding the Search Infrastructure

1.1.1 Producer Process


Producer processes are data flow processes that locate or extract data (for example,
the Enterprise Extractor, Directory Walker, or XML Activator process). Producer
processes are usually followed by the Document Conversion Process, which
converts data from its native format to text and then the Update Distributor, which
passes data to the index engines for indexing.

1.1.2 DCS
The Document Conversion Service (DCS) is a process that converts documents from
their native formats to text so that they can be indexed. Content Server Admin
servers manage DCSs in Content Server.

The DCS infrastructure has an API which allows various filter packs to be installed
and multiple filter packs to work together. By default, when shipped Content Server
supports the OpenText Document Filters (OTDF). OpenText Document Filters can
convert many file formats to text which allows them to be indexed and summarized.
For more information about the DCS, see “Understanding Document Conversion”
on page 26.

1.1.3 Update Distributor


An Update Distributor process reads data from an iPool that was output by the
Document Conversion Service, and then distributes the data among Index Engines
in their respective partitions. This balances the load of data to be indexed among the
partitions. For more information about the Update Distributor, see the Update
Distributor section in “Understanding the Search Grid” on page 8.

1.1.4 Search Managers


A Search Manager is a container for Search Federator processes. It receives a user's
search request (in the form of a Query) and then gives the request to a Search
Federator to handle. A Search Manager also receives the final result set from a
Search Federator and passes it to Content Server to be displayed to users in Content
Server on Search Results pages. Each data source that you create has a single Search
Manager. Unlike the Search Federators and Search Engines, the Search Manager is a
simple Content Server object, not a Java process; therefore, it is not managed by an
Admin server.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 7


Chapter 1 Overview

1.2 Understanding the Search Grid


The search grid is a system within the search infrastructure that allows scalability in
two ways: by adding partitions or by adding Search Federators. Adding partitions
allows for scaling to accommodate increased database size. Adding Search
Federators allows for supporting non-stop search capability without a single point of
failure (search process and machine redundancy, assuming disk redundancy for the
on-disk index).

This section includes information on the following search grid components:

• “Partitions” on page 9
• “Update Distributor” on page 9
• “Index Engine” on page 10
• “Search Federator” on page 12
• “Search Engine” on page 12

Figure 1-2: Search Grid

8 OpenText Content Server LLESSRC210400-GGD-EN-01


1.2. Understanding the Search Grid

1.2.1 Partitions
A Partition is a logical portion of an entire index of data. Large datasets can be
indexed by more than one process and the processes can reside on different
computers. The distribution of indexing is achieved by partitioning or dividing your
dataset across all of the processes responsible for building indexes. Each Partition
contains a distinct portion of the complete dataset.

A Partition Map is a representation of a data source's search grid. Viewing a


Partition Map allows you to see all of the components in a search grid (for example,
Partitions, Index Engines, and Search Engines), therefore making it an effective way
to administer the system. For example, you can monitor the status of each
component simultaneously, which helps you to maintain the performance of
indexing and searching at your site.

For more information about partition maps, see OpenText Content Server Admin
Online Help - Search Administration (LLESWBS-H-AGD).

Adding Partitions
Because the architecture of the search grid allows indexing tasks to be shared by
multiple indexing processes across Partitions, Content Server Administrators have
more flexibility when scaling for extremely large data sets. The search grid allows
large datasets to be indexed by more than one Index Engine process and the Index
Engine processes may reside on different machines. In this way, the search grid
architecture supports massive parallelism during indexing, and allows the index to
grow larger by adding more processes or machines to do the work.

For specific information about how to add Partitions, see OpenText Content Server
Admin Online Help - Search Administration (LLESWBS-H-AGD).

1.2.2 Update Distributor


An Update Distributor is a process that balances out index updates across Partitions.
This is done by spreading the updates in parallel across all updateable Partitions
that are not full.

Index updates occur when items are added to the data source and when existing
items are modified or deleted. Although each indexed item is owned by only one of
the Partitions that comprise the index, the Update Distributor passes user
information about the item to all Partitions to quickly update all references to the
user.

When an update needs to be made to an item, the Update Distributor send the user
informations and OTObject portion of the update to the Index Engines. If any Index
Engine contains the item, it responds to the Update Distributor and receives the rest
of the data associated with the update. If no Index Engine contains the item, the
Update Distributor chooses an Index Engine to which it will send the remaining
data.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 9


Chapter 1 Overview

Note: Search Engines may not use the most recent updates immediately. This
is because updates are asynchronous due to the number of processing steps in
a data flow.

1.2.3 Index Engine


An Index Engine is a process that indexes data for the Partition to which it belongs
so that Search Engines can search the data. Each Partition can have only one Index
Engine.

A partition's index has three components:

• A metadata index
• A content index
• A content accumulator

Metadata Index
The metadata index is an index of the metadata associated with the items in the
partition. It usually operates in RAM, ideally without being swapped to disk, and is
associated with the metaLog file and the checkpoint file in the index directory (that
is, Content Server home/index/data_source/index). The metaLog file is a chronological
listing of the changes made to the metadata index since the last checkpoint. The
checkpoint file is an on-disk snapshot of the metadata index at some point in time.

When an update is made to the metadata index, it is appended to the metaLog file,
which is later read by the dependent Search Engines so that they can update their
own memory-based copy of the metadata index. At certain points (called
checkpoints), the entire metadata index is committed to disk and captured in the
checkpoint file. At the same time, a new metaLog file is created. Each time the Index
Engine and dependent Search Engines start, they load the data from the checkpoint
file and associated metaLog file to produce the most recent metadata for indexing
and searching purposes.

Notes

• The size of the metadata index is limited by the Maximum Metadata


Memory Size setting on the Specific Properties page of a partition. It is
highly recommended that you do not modify this value without first
consulting OpenText Customer Support to best follow recommended sizing
guidelines.
• The amount of memory used by metadata can be reduced by moving
metadata region values to disk. An administrator can actively choose which
regions are set to disk according to the system's needs. By aggressively
tuning the metadata regions, a memory savings of up to 30% can be realized.
This new functionality is enabled via the Metadata Memory Settings tab of
the Partition Map of the Enterprise Data Source Folder.

10 OpenText Content Server LLESSRC210400-GGD-EN-01


1.2. Understanding the Search Grid

A new storage mechanism for text regions has been implemented, the Retrieve-Only
mode. In this mode, the metadata region values are stored on disk, and the in-
memory index is not created; the region data is not searchable, but it may be
retrieved. This option allows further tuning of the Search Engine to reduce the
memory requirements.

Within Content Server, by choosing Retrieve-Only for Hot Phrases and Summaries
you may save an additional 30% of memory. These two data types are derived from
content already indexed, so there is no loss of searchable data.

The Retrieve-Only mode is supported through the Partition Map page, which
supports conversions between RAM and DISK modes for text fields. For details, see
OpenText Content Server Admin Online Help - Search Administration (LLESWBS-H-
AGD).

Note: Efficient conversion between all three field modes is supported. As with
conversions between RAM <-> DISK field modes, conversion to and from
DISK_RET mode requires a restart of the Index and Search Engines.

Content Index
The content index is a collection of index fragments, each stored on disk in their own
subdirectory of the index directory. These fragments are either produced by the
merging of other index fragments or by the in-memory content accumulator when it
has reached the limit set by the Accumulator Memory Size setting on the Specific
Properties page of a partition.

Note: The total size of the content index is limited by the Maximum Content
Disk Size setting on the Specific Properties page of a partition.

Content Accumulator
The content accumulator controls the number of index fragments that are produced,
thereby influencing the number of merges required. Its operation is governed by the
Accumulator Memory Size setting on the Specific Properties page of a partition.
When the value of this setting is reached, the content accumulator dumps itself to
produce another on-disk index fragment. The content accumulator operates in RAM
without being swapped to disk, and it is associated with the accumLog file in the
index directory (that is, Content Server home/index/data_source/index). As with the
metadata index, updates are appended to the accumLog file that is monitored by the
dependent Search Engines. At certain points, the accumulated content is committed
to disk as a new index fragment and a new accumLog file is created.

Note: When either of the limits specified by the Maximum Metadata Memory
Size or Maximum Content Disk Size settings are reached, the Update
Distributor stops adding new objects to that Partition. If all Partitions reach
their size limits, the Update Distributor stops and reports that all partitions are
full by reporting a process error. In addition to the process error, a default
control rule will automatically send the Content Server administrators an e-
mail message when the partitions are approaching their size limits.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 11


Chapter 1 Overview

You can then create a new partition manually or set up a control rule that
creates one for you when partitions have reached a certain capacity.

If an object generates too much information to fit in the specified accumulator size,
an error code is set in the OTContentStatus field. This normally would not happen,
however, because content is truncated to 10 megabytes by default (as controlled by
the ContentTruncSizeInMBytes setting).

1.2.4 Search Federator


A Search Federator is a process that manages one or more Search Engines and
distributes search requests to each of them. When the Search Engines return their
results, a Search Federator is also responsible for merging the results and returning
the final result set to the Search Manager. Each data source that you create can have
many Search Federators, all managed by a single Search Manager. Also, each Search
Federator must have exactly one Search Engine for each Partition, and Search
Federators cannot share Search Engines.

Adding Search Federators


A complete index is the union of all the partitions. To ensure that all of the data in an
index is being searched, Content Server must run every Query on each Partition in
the search grid. To do this, the Search Manager sends the Query to a Search
Federator process. The Search Federator sends the Query to a Search Engine for each
Partition, collects the results, and then sends them back to the Search Manager. You
add Search Federators and their associated Search Engines to a search grid for
redundancy (to provide an alternative Search Federator in case of failure). If you
have multiple Search Federators, the Search Manager chooses one at random when
issuing a Query. This approach has proven to give fairly equal distribution over time
and is less costly than more complex attempts at determining performance and
availability.

Each Search Federator has one Search Engine per Partition. For more information
about how to add Search Federators to a search grid, see OpenText Content Server
Admin Online Help - Search Administration (LLESWBS-H-AGD).

1.2.5 Search Engine


A Search Engine is a process that searches a Partition's index. It produces search
results based on the data it locates, and then returns the results to the Search
Federator. Each Partition must have at least one Search Engine associated with it for
each Search Federator.

Each Search Engine knows the file system location of the index built by a particular
Index Engine process. The Search Engine uses the files created by the Index Engine
process to maintain its own searchable index consisting of an in-memory metadata
component, an on-disk content component, and an in-memory content accumulator
component. This three-part model is analogous to the Index Engine's three part
model; however, the Search Engine has searching (rather than updating)

12 OpenText Content Server LLESSRC210400-GGD-EN-01


1.2. Understanding the Search Grid

responsibilities. For information about the three part index model maintained by the
Index Engine, see “Index Engine” on page 10.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 13


Chapter 2
Administering Indexing

Before you can search, you must index the data that you want to be searchable.
Content Server's indexing system allows you to index data from a variety of sources
including the internal Content Server repository, directories on a file system, and
external Web sites.

This chapter covers the following topics:

• “Working with Data Flow Producer Processes” on page 15


• “Understanding Document Conversion” on page 26
• “Updating an Index” on page 30
• “Avoiding Index Pollution” on page 40
• “Index Directory Structure” on page 43

2.1 Working with Data Flow Producer Processes


A producer process is a data flow process that is responsible for extracting the data
that you want to index at a particular data source. Different producer processes are
used to extract data at different types of data sources. For example, the Content
Server Extractor process extracts data within a Content Server repository, a
Directory Walker process extracts data stored on a file system, and an XML
Activator producer process extracts XML data stored in a specified directory on a
file system. Like all other data flow processes, producer processes are managed by a
Content Server Admin server.

This section covers the following topics:

• “Content Server Extractor” on page 16


• “Directory Walker” on page 19
• “XML Activator” on page 20

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 15


Chapter 2 Administering Indexing

2.1.1 Content Server Extractor


The most critical data flow producer process is the Content Server Extractor process.
This process is responsible for locating the data that will be indexed at a Content
Server site (that is, the Enterprise data source). Without at least one Content Server
Extractor process operating in an Enterprise data flow, the items in your Content
Server repository will not be indexed and therefore, are not searchable.

Note: The Content Server Extractor process corresponds to the


INDEXUPDATE binary file that is distributed with Content Server.

Only one process is normally required to extract data from the Content Server
database, so OpenText recommends that each Content Server system have only one
Enterprise Extractor process, unless OpenText Global Services or Customer Support
has advised adding multiple Enterprise Extractors as part of a strategy of high-
volume indexing. For more information about high-volume indexing, or adding or
configuring an Enterprise Extractor process, see OpenText Content Server Admin
Online Help - Search Administration (LLESWBS-H-AGD).

Content Server Extractor processes become aware of updates (adds, updates, and
deletes) by monitoring the DTreeNotify table in the Content Server database. Each
time an item is updated, an entry containing the following information is added to
the DTreeNotify table:

• A unique ID within the table


• The dataID of the object being updated
• The type of update being performed (for example, an add, update, or delete)
• The version of the object being updated

A Content Server Extractor process selects a line from this table and stores the
corresponding information in memory while it compiles the information it writes to
the data flow's iPool. Depending on how you have configured your Content Server
site, the Extractor process selects a line from either the top (oldest updates) or
bottom (newest updates) of the DTreeNotify table. If the
wantDescendingExtractor setting in the [LivelinkExtractor] section of the
opentext.ini file is set to TRUE (the default setting), the Extractor processes the
most recent updates first. For more information about the
wantDescendingExtractor setting, see the [LivelinkExtractor] section in the
OpenText Content Server Admin Online Help - Content Server Administration
(LLESWBA-H-AGD).

Note: If an Enterprise data flow contains multiple Content Server Extractor


processes, the updates are divided equally among the processes (using modulo
arithmetic on the DataID) to ensure that no duplication occurs.

When the Content Server Extractor process has processed the update information
and written a message to the data flow's iPool, the entry is deleted from the
DTreeNotify table. When multiple Extractor processes exist, each process writes its

16 OpenText Content Server LLESSRC210400-GGD-EN-01


2.1. Working with Data Flow Producer Processes

messages to a unique iPool. However, since the Update Distributor can only read
from a single iPool, the extracted data must be merged before it reaches the Update
Distributor process. The following images illustrate possible designs for data flows
containing multiple Content Server Extractor processes.

Important
You may experience a significant decrease in performance if you implement a
configuration that was not specifically designed for your system.

These examples are provided only to illustrate several possible configurations


and should not be adopted without previously creating a system concept and
running an analysis of your system. It might not be possible to implement
these examples on all systems.

OpenText strongly recommends you request assistance from Global Services to


check your system's performance and to suggest improvements before you
make any modifications to your Extractor processes.

Figure 2-1: Two Extractor Processes

The configuration in Figure 2-1 is an example of Extractor processes used in a small


ECM system.

Figure 2-2: Two Extractor Processes and Document Conversion Processes

The configuration in Figure 2-2 is an example of Extractor processes used in a mid-


sized Email Archiving system.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 17


Chapter 2 Administering Indexing

Figure 2-3: Four Extractor Processes and Two Document Conversion


Processes

The configuration in Figure 2-3 is an example of Extractor processes used in an


extremely large Email Journaling system.

For efficiency, there are certain rules about what an Extractor process will write to a
single iPool message. For example, if an Extractor is writing an update to an item as
well as a delete for the same item in the same iPool message, it writes the delete
only.

If you have configured your Content Server site to use external file storage (EFS),
you can specify how a Content Server Extractor process extracts document content
from the EFS – by extracting the complete document content from the EFS or by
referencing the location of the document in the EFS . This behavior is controlled by
the UseContentReference setting in the [LivelinkExtractor] section of the
opentext.ini file. By default, this setting is not included in the opentext.ini file,
which is the equivalent of setting it to false. In other words, the complete
document context is extracted. However, configuring an Enterprise Extractor to
extract document content by referencing the location of the document in the EFS can
reduce the load to the iPools in the Enterprise data flow and improve overall
indexing speed. In this case, the Document Conversion Service (DCS) must be able
to access the document content using the exact reference. For more information
about the Document Conversion Service, see “Understanding Document
Conversion” on page 26.

18 OpenText Content Server LLESSRC210400-GGD-EN-01


2.1. Working with Data Flow Producer Processes

2.1.2 Directory Walker


The Directory Walker process is one of several producer processes that can be used
in Content Server data flows. It walks particular directories, locates files that match
the specified criteria, and sends those files to an iPool so the DCS can access and
then convert them for indexing.

The Directory Walker process is the producer process for the User Help and the
Admin Help data flows in Content Server. It crawls the directories where the
Content Server User or Admin Online Help files are stored and deposits the content
of those files (encoded in iPool messages) into an iPool. Content Server
Administrators can also use Directory Walker processes in custom data flows. For
more information about administering Directory Walker processes and creating
Directory Walker data flows, see OpenText Content Server Admin Online Help - Search
Administration (LLESWBS-H-AGD).

By default, a Directory Walker process is configured to run only once. In order to


update the User or Admin Help indexes (for example, after installing a new module
at your Content Server site), you must manually run the appropriate Directory
Walker processes so that they can detect any new files to add to the help indexes.

When a Directory Walker process walks a set of directories for the first time, it
records the files that match its criteria in a crawl history database file. Content Server
administrators can specify the name and location of the crawl history database file
on the Specific Properties page of a Directory Walker process. This information is
also contained in the configuration file for the Directory Walker process. The
DBLocator parameter in the configuration file specifies the directory in which the
files are stored. The DBLocatorName parameter in the configuration file specifies the
file name.

Content Server adds the .new extension to its current crawl history database file. For
example, if the DBLocator parameter is Content Server_home/myDirWalk and the
DBLocatorName parameter is myDirWalker, the Directory Walker process creates the
Content Server_home/myDirWalk/myDirWalker.new file when it walks its directories
for the first time. This crawl history database file contains a list of all the files that the
Directory Walker process has encoded in iPool messages and sent to the iPool.

When the Directory Walker process runs again, it renames the original crawl history
database file (history.new), giving it the .old file extension (history.old). It then
recrawls the directories and creates a new crawl history database file (history.new).
The Directory Walker process compares the new crawl history database file with the
old crawl history database file and sends any appropriate iPool messages to the
iPool. In this way, the Directory Walker process keeps track of new, modified, and
deleted files. This process repeats each time that the Directory Walker process runs.
The next time that the Directory Walker process runs, it renames the old crawl
history file (.old), giving it the .junk file extension (history.junk). At the same time, it
renames the current crawl history database file (history.new), giving it the .old file
extension (history.old), and creates a new crawl database history file (history.new)
that contains the most recent information.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 19


Chapter 2 Administering Indexing

Note: If the DBLocator parameter in the Directory Walker process's


configuration file is empty, the Directory Walker process's recrawl mechanism
is disabled. In this case, the process does not maintain a record of the files that
it crawls and always sends all of the files that meet its criteria to the iPool,
without detecting deleted files.

2.1.3 XML Activator


The Content Server XML Activator allows you to connect third-party applications to
Content Server data flows. These third-party applications do not require proprietary
tools, libraries, or coding languages, and interface with Content Server through XML
files that meet certain requirements. An XML Activator Producer process reads data
(as XML files) from a specified directory in the file system, where a third-party
application has written its information. The XML Activator Producer process then
sends the data to an iPool for the next process in the data flow.

The XML files produced by your third-party applications must meet certain
requirements to work properly with XML Activator processes.

Placement
The XML files that a third-party application places in directories for an XML
Activator process to read must be fully closed. The third-party application can fulfill
this requirement by writing files to a local directory and then moving the files to the
XML Activator process's incoming directory.

Naming

XML Activator processes read files in ascending, last-modified order.

20 OpenText Content Server LLESSRC210400-GGD-EN-01


2.1. Working with Data Flow Producer Processes

Format

Along with their content (for example, binary data or text), the XML files that are
generated by third-party applications must include XML data that maps to data
interchange pool (iPool) messages. This XML data tells the XML Activator process
what to do with the corresponding content. The following table describes iPool key-
value pairs, which you include in XML files as tagged elements and constitute iPool
messages. For more information about configuring an XML Activator process, see
OpenText Content Server Admin Online Help - Search Administration (LLESWBS-H-
AGD).

Table 2-1: iPool Key-Value Pairs

Key XML Tag Value


OTURN <OTURN> A persistent and unique string that identifies the data object
(the content included in a particular XML file).
Operation <Operation> The action that Content Server performs with the data object.
This value is set to AddOrModify, AddOrReplace, Delete,
DeleteByQuery or ModifyByQuery.

Note: Beginning with Content Server 20.3, the option


to batch ModifyByQuery and DeleteByQuery
operations has been removed. Instead, they are always
batched.
Content <Content> The data object. The Content tag can contain either XML data
or binary data formatted in Base64. If the tag contains binary
data formatted in Base64, the tag must take the form
<Content encoding= ' Base64 ' >. There can only be one
Content tag.

If the Content tag is specified to contain Base64 characters,


the data is decoded and placed in the iPools' Content region.
The Document Conversion process then converts the binary
data to text.
Metadata <Metadata> A series of fields, in the form of tagged text, that describe the
data object. Metadata fields act as queryable attributes for
users by mapping data to either existing Content Server
regions or to regions defined in the XML Activator
process's .tok file. See the following table for typical
metadata fields, each of which maps to a Content Server
region. For more information about defining regions in .tok
files, see the Content Server Developer's Documentation.

The following table describes the metadata fields that OpenText recommends you
include in each XML file. Each field corresponds to a Content Server region. If you
do not include these fields, Content Server users will not be able to search them as
well as possible in Content Server.

The tag names that you use must match the metadata field names, unless you
specify alternative tag mappings in Content Server under the Metadata List field for

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 21


Chapter 2 Administering Indexing

the XML Activator process, which is located on the process's Specific Info page. You
can also include as many metadata fields as necessary by wrapping information in
tags whose names match Content Server regions, or whose names are mapped to
regions in the Metadata List field for the XML Activator process. For more
information about mapping metadata tags when adding or configuring an XML
Activator process, see the Content Server Admin Online Help.

The XML files must also be valid XML.

Table 2-2: Required Metadata Fields

Field Description
OTName The name of the data object
OTOwnerID A unique identifier for the owner of the data object
OTLocation The original location of the data object
OTCreateDate The date on which the data object was created
OTCreateTime The time at which the data object was created
OTModifyDate The date on which the data object was last modified
OTModifyTime The time at which the data object was last modified
OTCreatedBy The node number of the creator of the data object
OTCreatedByName The login name of the creator of the data object
OTCreatedByFullName The full name of the creator of the data object

Operations

For the AddOrModify operation, structure your XML file as follows:


<?xml version="1.0"?>
<Body>
<OTURN>OTURN</OTURN>
<Operation>AddOrModify</Operation>
<Metadata>
<ModField>over-write this field</ModField>
<TotallyNew>New Stuff</TotallyNew>
</Metadata>
</Body>

The AddOrModify operation behaves as follows:

• If there is no existing object with the external object id specified in the OTURN
value, then an Add of an object with the specified metadata is performed.

Note: AddOrModify only works with metadata, not content; if you need to
add or modify content, please use the AddOrReplace operation.
• If there is an existing object with the external object id specified in the OTURN
value, then a Modify of that object is performed as follows:

22 OpenText Content Server LLESSRC210400-GGD-EN-01


2.1. Working with Data Flow Producer Processes

– the newly specified metadata regions replace any previous metadata regions
of the same region names

– other previously existing metadata regions are unchanged.

For the AddOrReplace operation, structure your XML file as follows:


<?xml version="1.0"?>

<Body>
<OTURN>OTURN</OTURN>
<Operation>AddOrReplace</Operation>
<Metadata>
<OTName>Data object name</OTName>
<OTOwnerID>Owner ID number</OTOwnerID>
<OTLocation>Original location</OTLocation>
<OTCreateDate>Creation date</OTCreateDate>
<OTCreateTime>Creation time</OTCreateTime>
<OTModifyDate>Date last modified</OTModifyDate>
<OTModifyTime>Time last modified</OTModifyTime>
<OTCreatedBy>
Node number of creator
</OTCreatedBy>
<OTCreatedByName>
Login name of creator
</OTCreatedByName>
<OTCreatedByFullName>
Full name of creator
</OTCreatedByFullName>
</Metadata>
<Content encoding='Base64'>Content data</Content>
</Body>

The AddOrReplace operation behaves as follows:

• If there is no existing object with the external object ID specified in the OTURN
value, then an Add of an object with the specified metadata and content is
performed.

• If there is an existing object with the external object id specified in the OTURN
value, then a Replace of that object is performed as follows:

– all previous metadata of the object is discarded (including any regions not
mentioned in the new metadata) and replaced with the newly specified
metadata regions

– if new content is specified, it replaces the previous content; if no new content


is specified, the previous content is unchanged.

For the Delete operation, structure your XML file as follows, including only the
OTURN and the Operation. Do not include any content or metadata because this
operation deletes data that already exists in the index and is identified by its OTURN.
<?xml version="1.0"?>
<Body>
<OTURN>OTURN</OTURN>
<Operation>Delete</Operation>
</Body>

The Delete operation behaves as follows:

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 23


Chapter 2 Administering Indexing

• If there is an existing object with the external object id specified in the OTURN
value, then the object is deleted.
• Otherwise, no objects are affected.

Deleting Objects by Query


The DeleteByQuery feature allows objects which match a search query to be deleted
by submitting a request using the iPool interface. The DeleteByQuery feature is
efficient in situations where many similar objects need to be removed, and in cases
where Content Server objects become orphaned.

Note: Beginning with Content Server 20.3, the option to batch ModifyByQuery
and DeleteByQuery operations has been removed. Instead, they are always
batched.

For example, Content Server cannot delete renditions from the index. Another
example is when there are versions of a document in different partitions.
DeleteByQuery will allow these cases to be handled. The update distributor sends
these requests to all of the index engines. Each index engine logs the number of
objects it deleted. The DeleteByQuery feature complements the “Modifying Objects
By Query” on page 25 feature.

Caution
Using DeleteByQuery is a potentially dangerous operation and must be
used carefully since it can result in data loss.

The query should not match more objects than you intend to delete. For
example, a query for "TempSandbox" will also match "Second
TempSandbox", which may not be what you intended. OpenText
recommends that you restrict the use of the DeleteByQuery operation to
fields which just use one-word keys.

Sample XML Syntax for a DeleteByQuery operation:


<?xml version="1.0"?>
<Body>
<OTURN>[region "ProjectCode"] "TempSandbox"</OTURN>
<Operation>DeleteByQuery</Operation>
</Body>

The DeleteByQuery operation behaves as follows: Any objects matching the


OTSTARTS query in the OTURN value are deleted.

Note: The OTSTARTS query syntax is documented in “OTSTARTS Query


Language” on page 80. If invalid syntax is used, no objects will be deleted,
and a syntax error will be reported in the Index Engine log.

24 OpenText Content Server LLESSRC210400-GGD-EN-01


2.1. Working with Data Flow Producer Processes

Modifying Objects By Query


The ModifyByQuery feature allows metadata to be modified for a collection of
objects using a Query issued through the iPool. This feature allows applications to
collapse multiple indexing transactions into a single action. For example, you can set
all prior versions of an object to not current, or rename a folder containing many
objects in Content Server. For Web Content Management, this may allow all objects
belonging to a project to have their status changed from pending to released
simultaneously. Multiple metadata regions can be modified in a single operation.
The ModifyByQuery feature complements the “Deleting Objects by Query”
on page 24 feature.

Note: Beginning with Content Server 20.3, the option to batch ModifyByQuery
and DeleteByQuery operations has been removed. Instead, they are always
batched.

Unlike AddOrModify, the ModifyByQuery operation will NOT create a new entry
(object). It will only change existing entries. The operation is broadcast to all Index
Engines by the Update Distributor, and each Index Engine logs the number of
objects it modified.

Like DeleteByQuery, this operation is performed using iPools, not using the Live
Query Language. The query string is used in place of the object ID in the iPool.

Caution
Using ModifyByQuery is a potentially dangerous operation and must be
used carefully since it can result in data loss.

The query should not match more objects than you intend to modify. For
example, a query for "TempSandbox" will also match "Second
TempSandbox", which may not be what you intended. OpenText
recommends that you restrict use of the ModifyByQuery operation to fields
which just use one-word keys.

Sample XML syntax that can be passed to the XML Activator:


<?xml version="1.0"?>
<Body>
<OTURN>[region "ProjectCode"] "Sandbox1"</OTURN>
<Operation>ModifyByQuery</Operation>
<Metadata>
<ModField>over-write this field</ModField>
<TotallyNew>New Stuff</TotallyNew>
</Metadata>
</Body>

For more information about XML Activator, see OpenText Content Server Admin
Online Help - Search Administration (LLESWBS-H-AGD).

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 25


Chapter 2 Administering Indexing

2.2 Understanding Document Conversion


Content Server's Document Conversion Service (DCS) uses filter packs to convert
items from their native file formats (for example, Microsoft Word, Microsoft Excel,
or Adobe PDF) to a simple text format for viewing or indexing in Content Server.
The filter pack used depends on whether Content Server is indexing content or a
user is requesting to view a Content Server item.

OpenText Document Filters converts items from their native file formats to a simple
text format for viewing or indexing in Content Server, and is used to display
Content Server items. For summary hit highlighting, find similar, recommender
synopsis generation, and classification profile generation, the DCS server uses
Quality Document Filters to convert Content Server items.

The OpenText Document Filters detect and convert items of the following formats:

• Microsoft Word
• Microsoft Excel
• Microsoft PowerPoint
• Microsoft Outlook
• Standard Mail (RFC822)
• Adobe PDF
• HTML
• ZIP
• TAR
• LZH

If you want to extend the DCS's MIME type detection, text-extraction, and
document-conversion capabilities, you can create a custom filter pack that your
Content Server Administrator can install. For more information, see the OpenText
Content Server Filter Pack - Creating Filter Packs (LLESCF-CFP) guide.

The DCS architecture provides a framework for MIME type detection and document
conversion that is easily extensible. The architecture is composed of one or more
DCS servers that work with DCS workers and filter packs. DCS workers are
processes that identify and load the filters required to extract text or convert a
document of a particular file format to a simple text format such as HTML. DCS
filter packs are installable sets of DCS components and associated files that extend
Content Server's document-conversion capability.

When items are added to Content Server, the default MIME type detection relies on
the following sequence:

• browser identification
• item file extension

26 OpenText Content Server LLESSRC210400-GGD-EN-01


2.2. Understanding Document Conversion

• DCS process.

You can change the sequence by modifying the trustBrowserMIMEType parameter.


When you set the trustBrowserMIMETypeparameter to False in the General section
of the opentext.ini file, the MIME type detection relies on the following sequence:

• DCS process
• browser identification
• item file extension.

2.2.1 DCS Servers


A DCS server receives data in two ways: from a DCS client (for MIME type
detection, viewing, hit highlighting, find similar, recommender synopsis generation,
or classification profile generation) or from a data interchange pool (for indexing).
The DCS server that receives and processes information from a DCS client is named
dcs. The DCS server that receives and processes information from data interchange
pools is named otdoccnv.

Note: When represented in indexing data flows, the DCS server is called the
Document Conversion Process and data interchange pools are called iPools.
For more information, see OpenText Content Server Admin Online Help - Search
Administration (LLESWBS-H-AGD).

DCS clients are processes that make requests directly to a DCS server (dcs). For
example, if a user tries to view a document of a particular MIME type, a DCS client
named LLView opens a socket connection with an available DCS server (dcs) and
passes the document to that DCS server for MIME type verification and conversion.
In this case, the document's MIME type must be listed in the [filters] section of
the opentext.ini file; otherwise, the user will be prompted to Open or Download the
document. Similarly, if a user tries to hit highlight a search result item, find a similar
item, generate a recommender synopsis, or generate a classification profile, a DCS
client named wfwconv opens a socket connection with an available DCS server, and
passes the appropriate item to that DCS server for MIME type verification and
conversion. Figure 2-4 illustrates the sequence of operations performed when a user
makes one of these types of requests in Content Server.

To establish a socket connection with an available DCS server, a DCS client


randomly selects one of the DCS servers listed in the [DCSServers] section of the
opentext.ini file and attempts to connect to it. The DCS client continues its
connection attempts until it finds an available DCS server to handle its request. If no
DCS server is available, the DCS client generates a temporary instance of a DCS
server to handle its request. After it completes the requested operation, this
temporary instance of the DCS server is removed.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 27


Chapter 2 Administering Indexing

Figure 2-4: Receiving Information from DCS Clients

In Figure 2-4, OTDF is OpenText Document Filters, and QDF is OpenText Quality
Document Filters.

In order to be indexed, Content Server items must be converted to a simple text


format. This conversion is handled by the DCS server named otdoccnv. Unlike the
DCS servers used to convert data for viewing or hit highlighting, the DCS server
used to convert data for indexing reads data from a data interchange pool. Data
interchange pools temporarily store Content Server items as they pass through
indexing data flows. In this case, the DCS server uses an internal library (the
DCSIPool library) to read data from the appropriate data interchange pool in a data
flow.

28 OpenText Content Server LLESSRC210400-GGD-EN-01


2.2. Understanding Document Conversion

Figure 2-5: Receiving Information from iPools

In Figure 2-5, OTDF is OpenText Document Filters, and QDF is OpenText Quality
Document Filters.

For information about how DCS servers interact with Content Server Admin servers,
see the OpenText Content Server Admin Online Help - Search Administration (LLESWBS-
H-AGD).

2.2.2 DCS Workers


When a DCS server receives data (from either a DCS client or a data interchange
pool), it performs its own MIME type detection to verify the type of document it has
received. Then, depending on the rules defined in the DCS rules file (which governs
the behavior of a DCS server), the DCS server generates the appropriate worker to
handle the text extraction or conversion of the document.

DCS workers are processes that identify and load the filters required to extract text
or convert a document of a particular MIME type to a simple text format. For
example, for a user to view a Microsoft Word document, the DCS server generates a
worker process that loads a third-party set of filters, which is used to convert the
Microsoft Word document to HTML format for viewing in Content Server.

For hit highlighting, find similar, recommender synopsis generation, and


classification profile generation, the DCS server generates a worker process that
loads QDF, which is used to convert items to the appropriate formats. DCS workers
are also responsible for returning converted items to DCS servers, which in turn,
make them available to the user in Content Server's interface.

DCS workers are persistent processes. This means that once a particular worker
process is loaded by the DCS, it remains available for future conversion operations
as long as the DCS server is active.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 29


Chapter 2 Administering Indexing

2.2.3 DCS Filter Packs


DCS filter packs are installable sets of DCS components and associated files that
extend Content Server's document conversion capability. The only filter pack
available by default is Content Server's custom filter pack, QDF. For specific
information about QDF, see the OpenText Content Server Admin Online Help - Content
Server Administration (LLESWBA-H-AGD).

2.3 Updating an Index


The Update Distributor reads data that has been written to an iPool by the
Document Conversion process, and then distributes the data among the Index
Engines that are active at the Content Server site. The data either represents new
objects that have been added to Content Server or updates (modifications or deletes)
that have been made to existing objects or their metadata.

For new objects, the Update Distributor looks in the search.ini file and balances its
distribution of data between the available Index Engines listed there. The Update
Distributor continues making updates in this manner until all partitions are full – at
which point it stops all indexing processes and returns an error indicating that the
partitions are full.

Note: Each Index Engine has a dedicated section in the search.ini file
(delineated with the [IndexEngine_ prefix). This section contains information
about the data flow, partition, ports, and log files associated with the specified
Index Engine process.

When a new object is indexed, the object identifier and the user information in the
object are passed to each Partition in the system via the Index Engines. The Index
Engines then respond indicating that none of them presently contain the object, so
one Index Engine is selected and an Add operation is performed. The object
identifier is a value that uniquely identifies the object in Content Server. For objects
in an Enterprise data source, this is an objID number and version number, and for
objects in a Directory Walker data source, it is an absolute path.

When an update is made to an object that has already been indexed, a similar
process occurs. The object identifier and the user information in the object are passed
to each partition in the system via the Index Engines. The Index Engines then
respond indicating which one of them contains the object, and an Update operation
is performed.

30 OpenText Content Server LLESSRC210400-GGD-EN-01


2.3. Updating an Index

2.3.1 Merging Index Fragments


In Content Server, a full-text index consists of a metadata index, a content
accumulator, and a content index. The content index consists of multiple index
fragments and each fragment represents an incremental update to the full-text index.
When executing a Query, the Search Engine evaluates each index fragment,
combines the results from each fragment with the results from the metadata index
and the accumulator, and displays a consolidated result list. Typically, the more
index fragments that the Search Engine has to evaluate for a Query, the longer the
search time.

To remove deleted data from the index (compacting it) and reduce the number of
index fragments in the system, the Index Engine periodically merges one, two, or
three index fragments. The new index fragment is similar in size to the sum of all the
input index fragments. Once the merge operation is complete, the new index
fragment and the original input index fragments are stored in the index directory,
which means that at least two times the original disk space requirement must be
available. Later, a cleanup thread deletes the original input index fragments.

The frequency with which merges occur is determined by Content Server's merge
policy. For example, the Index Engine does not perform a merge if the available disk
space (based on the value of the Maximum Content Disk Size setting on the
Specific Properties page of a partition) is smaller than its estimation of required disk
space. The merge policy is composed of a set of user-defined settings and a set of
system-defined rules. Together, they control when and how a merge process is
launched.

User-Defined Settings
The following table describes the settings that you can set to influence the merge and
compaction policy in Content Server:

Table 2-3: User-Defined Settings

Setting Description
Maximum Content Disk Size Specifies the total disk space available for use
where the partition's content index is stored.
This setting is also used indirectly by the Update
Distributor when determining to which partition
to send an update. The Update Distributor shuts
down when all the partitions are full.

You can change the value of the Maximum


Content Disk Size setting on the Specific
Properties page of a partition.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 31


Chapter 2 Administering Indexing

Setting Description
Merges Enables or disables merge capability in the
system. You can change the value of the Merges
setting to True or False on the Partition Map's
Specific Properties page.

OpenText recommends that you do not turn off


merge capability unless advised to do so by
OpenText Customer Support.
Merge Attempt Interval Specifies how often an Index Engine checks a
partition's index to determine whether a merge
should occur.

You can change the value of the Merge Attempt


Interval setting on the Partition Map's Specific
Properties page.
Target Index Number Determines whether or not the Index Engine
merges index fragments. If the number of index
fragments in the partition exceeds the value of
this setting, the Index Engine begins
aggressively merging the smaller index
fragments.

You can change the value of the Target Index


Number setting on the Partition Map's Specific
Properties page. Valid values range from 1 to 15.
By default, this setting is set to 5.
Oldest Index Date Specifies the number of days after which an
index fragment is included in a merge in order
to remove deleted data (compact).

You can change the value of the Oldest Index


Date setting on the Partition Map's Specific
Properties page. Valid values range from 30 to
60. By default this setting is set to 30.
Index Ratio Specifies the ratio for determining whether
subindexes are candidates for a merge
operation. The default value of the setting is 3,
which specifies that the merge ratio is 3:1. This
means that a merge will occur when
neighbouring indexes are 1/3 of the largest of 3
(or 2) consecutive index fragments.

You can change the value of the Index Ratio


setting on the Partition Map's Specific Properties
page.Valid values range from 2 to 5. By default,
this parameter is set to 3.

32 OpenText Content Server LLESSRC210400-GGD-EN-01


2.3. Updating an Index

Setting Description
TailMergeMinimumNumberOfSubIndexes Specifies the number of index fragments that
must exist before the index engine runs a
secondary merge thread.

You can set the value of the


TailMergeMinimumNumberOfSubIndexes
parameter in the search.ini file. By default
this parameter is set to 8.

OpenText recommends not changing this setting


as resynchronizing a Content Server Admin
Server will reset this parameter to its default
value.
SubIndexCapSizeInMBytes Specifies the maximum size an index fragment
can reach before it is no longer considered for
merge or compaction operations.

You can set the value of the


SubIndexCapSizeInMBytes parameter in the
search.ini file. By default this parameter is
set to infinity.

OpenText recommends that you do not alter the


value of this parameter without consulting
Customer Support. Resynchronizing a Content
Server Admin Server will reset this parameter to
its default value.

System-Defined Rules
An index engine runs two merge threads: a primary thread and a secondary thread.
The threads wake according to the value of the Merge Attempt Interval setting on
the Partition Map's Specific Properties page (when a merge is not already in
progress), and then apply a two-phase policy of system-defined rules. The first
phase nominates sets of fragments as candidates for the merge. The minimum
number of fragments to merge is one (in which case, compaction occurs) and the
maximum is three. The second phase determines whether or not the merge is
possible and desirable given the current state of the system. The first set of index
fragments that passes both phases are merged. If there are no feasible candidate sets,
the thread goes back to sleep.

The index engine runs the secondary thread if the following conditions are true:

• The primary thread is performing a merge


• The number of index fragments is greater than or equal to the value of the
TailMergeMinimumNumberOfSubIndexes parameter in the search.ini file (by
default this parameter is set to 8).

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 33


Chapter 2 Administering Indexing

Table 2-4: System-Defined Rules

Phase Description
I Candidates are nominated in the following order:
• If the number of index fragments is greater than the value of the
Target Index Number setting (by default, five), the smallest
consecutive grouping of three is nominated, then the smallest
consecutive grouping of two is nominated (according to the default
value of the Index Ratio setting). If both fail, a compaction is
nominated with the next step. For more information about the
Target Index Number and Index Ratio settings, see “User-Defined
Settings” on page 31.
• If an index fragment has not been included in a merge for more
than the number of days specified by the Oldest Index Date
setting (by default, 30), it is nominated. If this candidate fails
during Phase II, a warning message is logged indicating that the
index fragment is too large to compact.
• If a fragment is present from an index migration, it is nominated. If
this candidate fails during Phase II, a warning message is logged,
indicating that the index fragment is too large to compact.
• All consecutive sets of three and two index fragments are
examined and sets are nominated in descending order of combined
fragment directory size – provided that neighboring fragments are
at least a certain percentage of the size of the remaining fragments.
This percentage is determined by the Index Ratio setting (for
example, if this setting is 3, the neighboring fragments must be at
least one-third of the size of the remaining fragments).

If a merge cannot be started and the number of fragments is greater


than the Target Index Number, a message is logged indicating that
the merge cannot start.
II A merge does not occur if one of the following criteria is satisfied, in
the order presented:
• If merging has been disabled (that is, if Merges is set to False)
• If the nominated set of fragments has a combined file size greater
than or equal to the current lockout (described below)
• If the remaining disk space is less than the combined size of the
nominated set of fragments and the expected size of the next
accumulator dump
• If a fragment in the nominated set is greater than the value of the
SubIndexCapSizeInMBytes parameter in the search.ini file.
Once a fragment reaches the value specified by the
SubIndexCapSizeInMBytes parameter, it is no longer
considered for merge or compaction operations. The default value
of this parameter is infinity. OpenText recommends that you do
not alter the value of this parameter without consulting Customer
Support.

If the merge fails, the combined file size of the nominated candidates becomes the
current lockout. If the merge succeeds, the number of successful merges since setting

34 OpenText Content Server LLESSRC210400-GGD-EN-01


2.3. Updating an Index

the lockout is incremented. If five successful merges have occurred since the last
lockout, the lockout is removed. If the lockout remains for more than 24 hours and
there have been no successful merges, the lockout is removed. If the Index Engine
process is restarted, the lockout is removed.

2.3.2 Managing Memory via Automatic Defragmentation


During metadata updates, the memory used for storing metadata can slowly
fragment in certain scenarios. A feature includes a scheduled task that pro-actively
defragments the metadata. This ensures that the estimates of percent full are more
accurate, and reduces the likelihood that new partitions are created prematurely.

The defragmentation mechanism is a sophisticated feature that is controlled in the


data flow section of the search.ini configuration file via three settings, shown
below, along with their default values:
DefragmentMemoryOptions=2
DefragmentSpaceInMBytes=10
DefragmentDailyTimes=2:30

The DefragmentMemoryOptions setting controls if and how defragmentation runs.


While the defaults are likely fine for most cases, information follows on adjusting
them:

• DefragmentMemoryOptions=0 – defragmentation is disabled and will not run


automatically. This setting is not recommended for most systems where updates
are frequent.
• DefragmentMemoryOptions=1 – defragmentation will run every time a
checkpoint is written, and will defragment the entire index. This setting is not
recommended for most systems as it leads to frequent defragmentation that
blocks updates in the Index Engine and searches in the Search Engine.
• DefragmentMemoryOptions=2 – the default, causes the Index Engine and the
Search Engine to invoke a new daily scheduler daemon.

The daily scheduler daemon is controlled via the settings in


DefragmentSpaceInMBytes and DefragmentDailyTimes.

The DefragmentSpaceInMBytes setting allows the administrator to specify how


many megabytes of space are allowed to be used for the defragmentation process.

• DefragmentSpaceInMBytes=0 – This option minimizes the memory used by the


defragmentation process, but defragmentation may take twice as long than with
some other settings.
• DefragmentSpaceInMBytes=-1 – This option typically minimizes the running
time of the defragmentation process, but when using this setting administrators
should ensure that sufficient free memory is available to temporarily hold an
extra copy of the largest region in memory.
• DefragmentSpaceInMBytes=m – where m is a positive integer, then the faster
defragmentation approach is used if the region undergoing defragmentation is

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 35


Chapter 2 Administering Indexing

less than m megabytes in size. If the region undergoing defragmentation exceeds


size m in MB, then the lower-memory defragmentation approach is used. The
default is DefragmentSpaceInMBytes=10 megabytes.

The DefragmentDailyTimes setting allows the administrator to specify the times (in
the system's default time zone at startup) at which to start the defragmentation. The
default setting is DefragmentDailyTimes=2:30 which is 2:30 am local time. Only
the hour (ranging from 0 to 23) and minute (ranging from 0 to 59) may be specified.
Multiple start times can be expressed as a comma-delimited list (no spaces allowed),
for example, DefragmentDailyTimes=0:00,23:59.

Note: If the system's time zone changes, for example from daylight savings
time, the scheduler continues to run based on the time zone at startup. The
next time the Index Engine or Search Engine is restarted, it will use the new
system time zone.

For the Search Engine, these settings can be reloaded through the Admin server, so
they can be rescheduled without having to stop and restart the Search Engine. This
option is currently not available for the Index Engine.

When a Search Engine is reloaded, if a daily defragmentation scheduler daemon was


previously running, it is cancelled, and a new log message is written (at the Status
level), called Cancelled Previously Scheduled Daily Defragmentation Tasks. If
defragmentation was in progress at the time of scheduler cancellation, it will
continue until it completes.

If the reloaded ini settings specify daily defragmentation times, a new daily
defragmentation scheduler daemon is started using the latest set of times.

2.3.3 Rebalancing Partitions


The previous partition percent full model, and the associated approach to partition
rebalancing based on how full a partition is, has been modified and enhanced. The
enhanced model will help administrators better manage memory use of partitions
by preventing over-full partitions and prematurely creating new partitions. This
model applies only to partitions in Read-Write mode (see details in Note below) and
provides the following functionality:

• When a partition is full, it stops accepting new objects and will only accept
updates to existing objects.
• If the updates exceed the estimated reserved space, then the index will do some
rebalancing to bring the memory usage below 80%, typically down to about 77%.

Note: Rebalancing of partitions that are in No-Add (update only) mode can be
enabled via the AllowRebalancingOfNoAddPartitions setting in the update
distributor section of the search.ini file (this setting is false by default). This
setting should only be enabled after careful consideration as it is applied to all
partitions of the No-Add type and may not be appropriate for some email
archiving installations.

36 OpenText Content Server LLESSRC210400-GGD-EN-01


2.3. Updating an Index

The partition percent full model is described through the following settings which
apply per partition (each partition can have its own values for these settings). The
global defaults for all the partitions are shown below:

• StartRebalancingAtMetadataPercentFull=80
• StopRebalancingAtMetadataPercentFull=77
• StopAddAtMetadataPercentFull=70 (85 is recommended for new CM apps)
• MetadataPercentFullWarnThreshold=65 (80 is recommended for new CM
apps)
• WarnAboutAddPercentFull=true (will show up at the Info Level of logging)

Note: The default settings for percent full and partition rebalancing are
lower, so if an existing partition is already above the new lower limits, that
partition may stop accepting new objects and may start rebalancing itself.
This is a normal part of the maintenance process, but you may need to add
more partitions and change automatic partition creation rules, if they are
defined, to match these new default settings.
You can disable this behavior using an Admin server management page to
manually set these settings higher. In earlier versions of Content Server
there was no Admin server interface for these settings, so to override the
new lower defaults, you must use a search.ini_override file. For more
information, see “Overriding the search.ini file” on page 97.

The metadata percent full value of a partition is calculated based on the


MaxMetadataSizeInMBytes setting which is set for each partition via a Content
Server management page. The following description of the model uses the default
Partition values that are listed above as an example.

Based on the default values listed above, when a partition reaches 65% metadata full
(the default for the MetadataPercentFullWarnThreshold threshold), warning
messages will appear by default in the Index Engine logs at the Info Level. To
disable this default logging, set WarnAboutAddPercentFull=false. A less verbose
level of logging, such as Warning Level, will also result in no messages being logged
in the Index Engine logs.

When the metadata percent use of a partition reaches 70% percent full (the default
for the StopAddAtMetadataPercentFull threshold), the addition of new objects to
this specific partition is halted. Updates to objects currently residing in this partition
are still allowed. Warning messages appear in the Index Engine logs at the Warning
Level specifying that the partition is no longer accepting new objects.

If the metadata percent use of a partition reaches 80% full (the default for the
StartRebalancingAtMetadataPercentFull threshold), an update coming in for an
object residing in that partition causes the partition to rebalance. The rebalance
moves the specific object from this partition to a partition that is accepting new
objects to decrease the memory use of that partition. Warning messages appear in
the Index Engine logs at the Guaranteed Level, stating that the partition has entered
a rebalance mode. The partition will stay in the rebalance mode until the metadata %

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 37


Chapter 2 Administering Indexing

full use decreases to the 77% level (the default value for the
StopRebalancingAtMetadataPercentFull threshold). Once the metadata percent
full use decreases to this 77% level, updates coming in to this partition are allowed,
and no longer cause a rebalance. If the metadata percent full once again increases to
80%, a new partition rebalance will be triggered.

Note: New objects are still not accepted into this partition unless the metadata
percent use decreases below the 70% use threshold for
StopAddAtMetadataPercentFull. If no partitions are available to accept new
objects, the Update Distributor will go down with process error 174 or 175.
The administrator should use the log warnings and/or the automatic partition
creation rules of Content Server to ensure that new partitions are created as
required to prevent this.

Automatic partition creation can be triggered either by date or by % full.


OpenText recommends you set the automatic partition creation rules to trigger
between 65% and 70%.

One of the advantages of this model is that it will effectively place a partition
automatically into a soft update-only mode (no new objects are accepted) as it begins
to fill. Previously, administrators had to closely watch partition sizes and make
explicit decisions about when a partition mode should be changed. This system will
also gracefully and automatically put that partition back into Read-Write mode if
conditions change, for example, large numbers of objects are deleted.

Figure 2-6: Rebalancing Partitions

The amount of space to set aside for updates depends on the expected usage. A pure
archiving solution with few anticipated updates might operate best on the provided

38 OpenText Content Server LLESSRC210400-GGD-EN-01


2.3. Updating an Index

defaults. Other systems might require a larger “buffer” for rebalancing the
partitions.

Note: Earlier versions of Content Server did not provide an Admin page to
change these default settings. To change any of the default settings for a
partition, the administrator must use the search.ini_override file. For more
information, see “Overriding the search.ini file” on page 97.

This example of the search.ini_override file will only override the default
settings for the first partition in the system which is called firstPartition.
[Partition_firstPartition]
StartRebalancingAtMetadataPercentFull=70
StopRebalancingAtMetadataPercentFull=85
WarnAboutAddPercentFull=true
StopAddAtMetadataPercentFull=75
MetadataPercentFullWarnThreshold=70

2.3.4 Preventing Update Blocking


Previous Search Engine behavior gave search queries priority over updates, so for
applications with a constant search load updates could be blocked for prolonged
periods. Currently Content Server has the capability to ensure some time is allocated
to reading the updates. This capability is controlled by a setting in the data flow
section of the search.ini file, MaxSearchesWhileUpdateBlocked, with a default
value of 20.

The MaxSearchesWhileUpdateBlocked setting specifies the maximum number of


searches to allow to start running while an update is blocked. If it is set to a positive
integer x, then once an update is waiting, at most x new searches can start to run,
and subsequent searches will be blocked until an update has completed.

The special setting of MaxSearchesWhileUpdateBlocked=0 means no maximum


number of searches is specified. This will not cause new search delays compared to
previous Search Engine behavior, when high search volumes could prevent updates
from being made.

When MaxSearchesWhileUpdateBlocked has positive settings for x, the Search


Engine log messages for “Search starting” and “Search finished”, which show the
ReadWriteGate state, now display an extra counter curReadsWhileWriteBlocked,
which shows the count of searches attempted while a current update has been
waiting. If this number reaches or exceeds x, new searches will wait until after an
update has run. An example log line is:
1268262953661:RMI TCP Connection(11)-10.16.14.190:4:Search finished ;
readers(waiting=2,running=0), writers(waiting=1,running=0), curReadsWhileWriteBlocked=3:

In the case of non-stop updating and searching, the Search Engine still serializes the
updates in transaction order and still enforces at least a 10 milliseconds (ms) delay
between updates. If MaxSearchesWhileUpdateBlocked=x (for x > 0) the following
occurs:

• an update runs

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 39


Chapter 2 Administering Indexing

• when it finishes, all waiting searches start


• any new search arriving in the next 10 milliseconds also starts
• the next update is issued, but has to wait because of the running searches
• the next x searches that arrive are allowed to start
• subsequent searches are blocked
• when all running searches finish, the waiting update runs
• when this update finishes, all waiting searches start

Technically, the internal ReadWriteGate controls not just searches and updates, but
other background threads, for example, the merge thread does both read and write
operations that would count like a search or update.

The MaxSearchesWhileUpdateBlocked parameter is not affected by


reloadSettings().

2.4 Avoiding Index Pollution


Depending on the type of content at your site, it may not be reasonable or feasible to
index everything. In fact, some content (for example, audio files and binary files)
should not be indexed, because it cannot be converted effectively and therefore
bloats the index. To detect and avoid this type of content, the Index Engines
implement a bad object heuristic.

Note: You can use the opentext.ini file to ignore the content of specified
MIME types. For example, you may want to exclude audio files by adding the
audio/mpeg=TRUE parameter . Only the metadata (name, creation date, and so
on) is extracted and indexed. For information about the opentext.ini file, see
the OpenText Content Server Admin Online Help - Content Server Administration
(LLESWBA-H-AGD).

The bad object heuristic is based on the fact that bad documents contain a lot of
unique words. If the Index Engines allowed bad objects to be indexed, the index's
word list would grow to enormous proportions and search performance would
rapidly deteriorate.

The bad object heuristic assumes that users are primarily interested in searching
natural language documents. Natural language documents have certain predictable
characteristics and the heuristic prunes objects that do not match those
characteristics. In theory, this may include part lists or directories that could
potentially be valuable; however, in practice, most of the bad objects being detected
include video, audio, binary files, or spreadsheets containing tables of floating point
numbers which pollute the index's word list with many unique words that users
would never use when searching. In fact, statistics indicate that roughly 1% of the
bad objects account for over 60% of the unique words.

The heuristic is based on the observation that natural language documents include
function words (for example, articles, conjunctions, and prepositions) that make up a

40 OpenText Content Server LLESSRC210400-GGD-EN-01


2.4. Avoiding Index Pollution

significant fraction of all word occurrences. In a typical English document, the 100
most frequent words account for over 50% of all word occurrences. The distribution
is such that a few words occur often and others occur rarely, where the majority of
occurrences are repeats. Using a value of 0.25 the heuristic discards any object in
which the ratio of the number of unique words is more than 25% of the total number
of word occurrences. For example, a 2000 word object would have to have more than
500 unique words, which is highly unlikely to occur in a natural language
document.

The heuristic also assumes that natural language documents consist mainly of short
words. This is due the presence of function words, which tend to be short in most
alphabetic languages. For example, the accepted statistic for English is that the
average word size is about 5 characters; for German, the average is 6 characters.

2.4.1 Bad Object Heuristic


The bad object heuristic identifies bad documents as objects that contain more
tokens than the value specified by the Minimum Document Size to Consider
setting on the Partition Map's Specific Properties page (the default value is 16384)
and fulfill one of the following criteria (the settings described also appear on the
Partition Map's Specific Properties page):

• The unique ratio (that is, the number of unique words in the document versus
the total number of words in the document) is greater than the value of the
Document Word Ratio setting, and the average word length is greater than the
value of theMaximum Average Word Length setting. Using the default values of
these settings, this means that if 10% of the object content consists of unique
words and the average word length is greater than 10 characters, it is considered
a bad object.
• The unique ratio (that is, the number of unique words in the document versus
the total number of words in the document) is greater than or equal to the value
of the Restrictive Document Word Ratio setting. The default value of this setting
is higher than its counterpart, the Document Word Ratio setting.
• The average word length is greater than or equal to the value of the Restrictive
Maximum Average Word Length setting. The default value of this setting is
higher than its counterpart, the Maximum Average Word Length parameter.

If an object satisfies the heuristic's conditions, its content and content-derived


metadata (for example, OTSummary) are excluded from the index, and an entry is
added to the Index Engine's log file. The entry lists which document's content-
related metadata was deleted from the indexing queue, which heuristic it failed, and
the three parameters that are inputs to the bad object heuristic.

Example 2-1: Sample Bad Object Heuristic

Deleting object /raid1/davidb/src/Livelink_Control_Doc_116.txt


Heuristic 2 nUniqueWords 1138 nWords 1552 nLengthOfWords 10650:

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 41


Chapter 2 Administering Indexing

Tip: You can disable the bad object heuristic by adjusting the values of the
associated settings (for example, change the value of the Restrictive Document
Word Ratio setting from 0.5 to 1.0). Turning off the heuristic means that all
objects will be indexed, except for objects with excluded MIME types or object
types that are not extracted.

2.4.2 Objects with Invalid Field Values


The behavior of the index engine and the update distributor indicate when an object
from an iPool has invalid data. Typical input data errors would include dates or
integers with invalid formats, integers out of range, or poorly formed data, for
example, invalid UTF8 characters.

The previous behavior was that an exception would occur in the index engine and as
a result, both the index engine and the update distributor would stop. Now an object
with an invalid field value can be indexed, a note about the invalid field is recorded
in the Index Engine log, and the error count in the OTIndexError field is
incremented.

The integer field OTIndexError is populated with a count of these errors. The field
holds a count of the number of errors that occurred for a given object. This field can
be queried to determine how many objects were indexed with problems in the
metadata.

Modification of the metadata does not reset this value, although re-indexing of the
object will. This functionality has been extended to accept seed values of
OTIndexError coming in through an iPool as well.

2.4.3 OTCheckSum Region


The OTCheckSum region is automatically externalized so that it becomes searchable.
The OTCheckSum region stores a checksum that is computed over the content of an
object. It can be used to see if the content of that object has changed. Objects that
have no content have a content checksum of 0. Objects that have been classified as
bad objects via the bad object heuristic also have a content checksum of 0. For more
information, see “Bad Object Heuristic” on page 41. An example of a search for all
objects that have no content or that are bad objects is:
Select “OTObject” where [region “OTCheckSum”]“0”

This feature allows searching for all documents whose content is empty or did not
have their content indexed. Such objects have a content check sum value of 0.

42 OpenText Content Server LLESSRC210400-GGD-EN-01


2.5. Index Directory Structure

2.4.4 OTContentStatus Region


This region provides valuable feedback for eDiscovery and index quality.
Assessment of quality can be inherited from a prior stage (such as DCS), or added by
the Search Engine. The worst score is kept. The codes are roughly ranked by severity,
allowing assessment by category. For example, a search for objects with codes of 300
or greater would identify only objects for which content is basically missing. The
codes and their meanings are as follows:
sContentIndexed_OK = 100;
sOnlyHeaderNoContentExpected_OK = 101;
sNoContentNoneExpected_OK = 102;
sUknownLegacy_OK = 103;

sContentProcessedAggregated_Pass = 200;
sContentProcessedUnknownMIMEType_Pass = 201;
sContentProcessedAlmostBadObject_Pass = 202;
sMostProcessedSomeNonUTF8_Pass = 203;
sMostProcessedSomeUnsupportedCodePage_Pass = 204;
sMostProcessedSomeUnsupportedLang_Pass = 205;

sContentIndexedTruncatedSize_Suspect = 300;
sContentIndexedSomeUnreadableSections_Suspect = 301;

sNotIndexedEncrypted_Bad = 400;
sNotIndexedUnsupportedMIMEType_Bad = 401;
sNotIndexedBadObjectHeuristic_Bad = 402;
sNotIndexedAccumSizeExceeded_Bad = 403;
sNotIndexedInvalidCorrupt_Bad = 404;
sNotIndexedUnknown_Bad = 405;

Codes have been reserved to indicate that an object’s content is OK (codes in the 100
group). However, these codes are not used in the current release of Content Server
to indicate that an object is OK. Rather, a non-empty code is assigned only to objects
that have undergone an event (the 200+ group). Therefore, an example where clause
to retrieve objects that have undergone an event is:
where [region “OTContentStatus”] >= “200”

And an example where clause to retrieve objects with no event is:


where [region “OTContentStatus”] != range “200~999”

2.5 Index Directory Structure


The exact components of each index directory in Content Server vary according to
the index's status; however, there are some common elements. For example, the top
level directory of a partition's index may contain the following:

• Subdirectories
• accumLog file
• checkpoint file
• livelink.# file
• livelink.ctl file

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 43


Chapter 2 Administering Indexing

• metaLog file

Figure 2-7: Directory Structure of a Simple Index

2.5.1 Subdirectories
A top level index directory contains one subdirectory for each index fragment
associated with the index. An index fragment's subdirectory provides a word list
(with ancillary multi-level word lists), object level postings, offset level postings, and
a skip file. These files are separated into the following components:

• Core files, which contain the index structures for pure alpha words that do not
contain the numbers 0-9, an ampersand (&), a hyphen (-), or a period (.)
• Other files, which contain the index structures for non-alpha words that contain
the numbers 0-9, an ampersand (&), a hyphen (-), or a period (.)
• Region files, which contain the index structures for XML regions

In Figure 2-8, the coreidx1.idx file is the complete UTF-8 word list for the alpha
words. The *.idx{2,3,..} files contain partitions of the coreidx1.idx word list
used for word lookup. The coreobj.dat file contains the object-level postings. Object-
level postings are the IDs of the objects in which the word occurs. The coreoff.dat
file contains the offset-level postings. Offset-level postings are the offsets of
occurrences of the word within an object. The coreskip.idx file contains object-
level skipping information. The region files are indexed in a similar way and the
map file contains a version number, some pointers, and a list of checksums for every
file in the index fragment.

44 OpenText Content Server LLESSRC210400-GGD-EN-01


2.5. Index Directory Structure

Figure 2-8: Structure of a Subdirectory

accumLog File
The accumLog file (content accumulator log file) contains the updates that have been
made to the in-memory content accumulator since the last index fragment was
created from an accumulator dump. The contents of the in-memory accumulator can
be recreated by replaying the updates contained in this file. For more information
about the in-memory content accumulator, see “Content Accumulator” on page 11.

checkpoint File

The checkpoint file is an on-disk snapshot of the metadata index. The metadata
index can be recreated by reading the checkpoint file, and then replaying the
updates contained in the metaLog file. For more information about the metaLog file,
see “metaLog File” on page 47 .

When the search grid needs to create a checkpoint, all partitions simultaneously
write their data. In some customer scenarios, where there are many partitions, this
can overwhelm the file system capacity and cause thrashing, resulting in
unexpectedly slow checkpoint creation.

An optional (not on by default) configuration setting will cause partitions to take


turns writing their checkpoint files (sequential checkpointing). In typical systems,
this results in slower checkpoint generation, but in some scenarios it can improve
results depending on the size and architecture of the system.

This feature is controlled via a setting in the dataflow section of the search.ini file:
ParallelCommit=true or false

If ParallelCommit=true, the update distributor will log: Added partition X to


checkpointing.... Finished parallel checkpointing

If ParallelCommit=false the update distributor will log: Added partition X to


checkpointing... Finished sequential checkpointing

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 45


Chapter 2 Administering Indexing

livelink.# File

The livelink.# file is a configuration file that contains information about the
current state of the index. The numeric extension for the configuration file increases
incrementally each time the information changes inside the file. The configuration
file may contain the following information:

• Version 1, which is the version number of the configuration file. If OpenText


makes changes to the contents of the configuration file in future versions of
Content Server, this number will increase. This helps to prevent compatibility
issues during upgrades.
• SubIndex, which is a list of the index fragments that comprise the current index.
The index fragments are listed in ascending order according to the range of
internal object IDs that they contain. The ordering of the index fragments is
significant because it indicates when merges occurred. For example, suppose that
index fragments if1, if2, if3, if4, and if5 exist. If fragments if2, if3, and if4 are then
merged, the merge creates a new fragment if6. Now the order of the fragments
appears as if1, if6, if5.
• SubIndexName<x, which is the name of the directory in which the index fragment
resides>
• SubIndexDir, which allows us in the future to specify hard coded alternate
locations for a subindex.
• SubIndexSize, which is the size (in bytes) of the index fragment directory
• SubIndexCheckSum, which is the checksum of the map file in the index
fragment's directory. This is used to verify that the correct index fragment is in
the directory (that is, to prevent Content Server administrators from including
the wrong index fragment from a backup and restore).

Note: A checksum is a value used to ensure data are stored or transmitted


without error. It is created by calculating a value from the binary values in a
block of data using some algorithm and storing the results with the data.
When the data are received, a new checksum may be computed and
matched against the existing checksum. A non-match indicates an error. A
match indicates, with high probability, that the data is the same.
• Index, which indicates that the index fragment list is now complete and the
upper level index is about to be described
• AccumLog<x>, which specifies that the current accumulator log file is named
accumLog.x. This log file contains all updates to the in-memory accumulator
since the last index fragment was created from an accumulator dump.
• AccumLogValidOffset<x>, which indicates that the current valid depth in the
file is <x>. The depth pointer indicates the depth at which the file is guaranteed
to have been synched.
• MetaLog<x>, which specifies that the current metadata log is named metaLog.x
and that the current checkpoint file is named checkpoint.x.

46 OpenText Content Server LLESSRC210400-GGD-EN-01


2.6. Managing Rotating Log Files

• MetaLogValidOffset<x>, which indicates that the current valid depth of the


metadata log is <x>. The depth pointer indicates the depth at which the file is
guaranteed to have been synched
• CheckpointSizex, which is the size (in bytes) of the checkpoint file
• TimeStampx, which is the time (in milliseconds) when this configuration file was
created
• LastModifiedDate, which is the largest value of the last modified date associated
with an object in the index. This is used to re-extract data from a backup and
restore.

livelink.ctl File

livelink.ctl File

The livelink.ctl file (control file) contains the name of the configuration file that
describes the current state of the index. Each time the index's state is modified, the
numeric extension is increased by one. In the example directory structure shown in
Figure 2-7 , the livelink.ctl file contains the string livelink.13.

metaLog File

The metaLog file (metadata log file) contains a chronological list of the updates that
have been made to the metadata index since the last checkpoint, where a snapshot of
the metadata index was captured in the checkpoint file. For information about
checkpoint files, see “checkpoint File” on page 45 .

2.6 Managing Rotating Log Files


A mechanism for generating log files uses a series of rotating log files.
Administrators can now back up and delete log files that are not currently in use for
active logging, and an upper limit on the disk space consumed by log files can be
established.

A log file is deemed to be full when its size exceeds a defined limit. When the active
log file is full, it is renamed to include a time/date stamp and given a file extension
that indicates it can be archived, for example, my_partition_name_
20100315199w949. log-2010.03.29-11.57.15.archive.log

The number of log files that should be retained can also be configured. Archive.log
files exceeding the maximum number will be deleted based upon age.

Special handling is implemented for the startup log files. The first log file often has
valuable diagnostic information, so it is excluded from the deletion schedule for the
archive log files. There is a separate count of the number of startup log files that
should be retained. Startup logs will be renamed: my_partition_name_
20100315199w949. log _startup -2010.03.29-11.57.15.archive.log

To enable rolling log files, for each process in the search.ini file, set:

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 47


Chapter 2 Administering Indexing

CreationStatus=4
LogSizeLimitInMbytes=xxx
MaxLogFiles=yyy
MaxStartupLogFiles=zzz

48 OpenText Content Server LLESSRC210400-GGD-EN-01


Chapter 3
Understanding How Data Flow Processes and
Search Grid Processes Communicate

Components of the search grid communicate with each other using data flow
processes to pass information to one another using iPools.

3.1 Working With Data Interchange Pools (iPools)


As data passes through a Content Server data flow, it is deposited in data
interchange pools (iPools). iPools are temporary storage areas that connect the
processes in a data flow. They form a disk-based queue that allows large amounts of
data to flow from one data flow process to another. Each data element in the queue
is stored as a single file, called an iPool message.

Figure 3-1: The Flow of Data

Understanding the role that iPools play in data flows can help you to more
accurately administer the indexing processes at your Content Server site.

This section covers the following topics:

• “Monitoring iPool Activity” on page 50


• “Exploring the iPool Directory Structure” on page 50
• “Understanding How Data Flows Communicate With iPools” on page 53

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 49


Chapter 3 Understanding How Data Flow Processes and Search Grid Processes Communicate

3.1.1 Monitoring iPool Activity


You can monitor a data flow's activity in the Interchange Pools section on the
corresponding Data Flow Manager page in Content Server. The Interchange Pools
section illustrates a particular data flow's performance based on the status of its
iPool operations. It lists the status of each iPool in the data flow, the number of
messages that the iPool currently contains (Pending), and the total number of iPool
messages that have successfully passed through the data flow (Processed).

The information in the Interchange Pools section of a Data Flow Manager page
allows you to monitor iPool activity and debug data flow process errors. For
example, if a data flow process shuts down unexpectedly, the Interchange Pools
section allows you to determine the number of iPool messages that were successfully
processed before it shut down and the number of iPool messages that remain in the
queue.

By default, a single extracted iPool message can contain encoded versions of up to


500 objects and their corresponding metadata. Because most indexes are updated
incrementally, the number of objects encoded in an iPool message varies according
to the amount of data being indexed at one time. However, if you purge and reindex
a data source, all of the corresponding data is reindexed at once which means that
each iPool message contains encoded version of approximately 500 objects.

Exploring the iPool Directory Structure


Each data source that you create contains a data flow directory, where it stores its
iPool information. For example, the data flow directory for an Enterprise data flow
in a standard Content Server installation is Content Server_home/index/enterprise/
data_flow. iPools are represented as subdirectories in the data flow directory.

Note: The data flow directory supports Windows Universal Naming


Conventions (UNC). You can also access a Windows data flow directory using
a Samba mounted device in a Unix environment.

The data flow directory (sometimes called the iPool base) contains a tree of
subdirectories for each iPool in the corresponding data flow. By default, the iPool
subdirectories are named using a sequence of numbers that indicate their position in
the data flow. The first iPool subdirectory name contains _0, the second iPool
subdirectory name contains _1, and so on. For example, the first iPool tree may be
named 2154_0.0, 2154_0.1, or 2154_0.2. The second iPool tree may be named
2154_1.0, 2154_1.1, or 2154_1.2. At their deepest branches, the iPool subdirectory
trees contain the iPool messages that pass through the data flow.

50 OpenText Content Server LLESSRC210400-GGD-EN-01


3.1. Working With Data Interchange Pools (iPools)

Figure 3-2: Windows Directory Structure of the Admin Online Help Index

The extension on an iPool subdirectory name is an integer value. If you add 2 to this
integer value, the resulting number indicates the level at which the iPool message
files reside in the directory tree. For example, an iPool subdirectory tree named
2154_0.2 indicates that the iPool message files reside four levels deep. The iPool
mechanism automatically appends the last digit to an iPool subdirectory name.

Figure 3-3: An iPool Subdirectory Tree

The location of a particular iPool subdirectory is specified in the command line of


the corresponding data flow process. Depending on the type of data flow process
that you examine, the command line may contain the -writeiPool, -writearea, -
readiPool, and -readarea arguments, which specify the location of the data flow
process's read and write iPools. Data flow processes that write data to an iPool
contain -writeiPool and -writearea arguments. Data flow processes that read
data from an iPool contain -readiPool and -readarea arguments. Data flow
processes that read and write data from iPools contain -writeiPool, -writearea, -
readiPool, and -readarea arguments.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 51


Chapter 3 Understanding How Data Flow Processes and Search Grid Processes Communicate

Although you can change the iPool subdirectory names by modifying the command
line for a data flow process, Content Server may not reference the new name
consistently and errors may occur. For this reason, OpenText does not recommend
changing iPool subdirectory names in this way.

Warning

The iPool system fails if any of the directories in the iPool tree are locked. If an
application accesses a directory in a Windows environment, the operating
system automatically locks the directory on the application's behalf. You must
ensure that no other application (for example, Windows Explorer or a DOS
Command Prompt) accesses the directories in the iPool tree when the data flow
runs.

Transaction Directories
All communication between data flow processes and iPools occurs inside
transactions. Transactions are units of interaction with an iPool. For example, data
flow processes execute operations such as reading and writing data inside
transactions. A temporary directory called a transaction directory is created for each
active transaction in a data flow.

Transaction directories reside in an index's data flow directory, at the same level as
the iPool subdirectory trees. Transaction directory names have the following
extensions that indicate the current state of the transaction.

• .trn, which indicates that a transaction is ongoing


• .cmt, which indicates that a transaction is being committed
• .rlb, which indicates that a transaction is being rolled back

For more information about iPool transactions and transaction directories,


“Understanding How Data Flows Communicate With iPools” on page 53 .

Status Directories
An index's data flow directory also contains a directory named status.prv. The
status.prv directory records the number of iPool message files that have entered
the iPool as well as the number of iPool message files that have left the iPool. It
contains subdirectories that have the same name as the iPool subdirectories in the
data flow directory. Two files reside inside these subdirectories, one with an .in
extension, and one with an .out extension:

• .in – specifies the total number of iPool messages that were written to the
corresponding iPool. This number is always greater than or equal to the number
specified in the .out file, depending on the current state of the data flow.
• .out – specifies the total number of iPool messages that were read from the
corresponding iPool. This number is always less than or equal to the number
specified in the .in file, depending on the current state of the data flow.

52 OpenText Content Server LLESSRC210400-GGD-EN-01


3.1. Working With Data Interchange Pools (iPools)

The difference between the two numbers specified in the .in and .out files is the
total number of messages that remain in the corresponding iPool, as specified in the
Interchange Pools section on the data flow page.

3.1.2 Understanding How Data Flows Communicate With


iPools
Before it can communicate with an iPool, a data flow process must make a
connection to the iPool. It does this by calling a function in the iPool library. When a
connection is made, the following events occur to ensure proper data recovery:

• The iPool library code locates each unlocked transaction log file (.log), and then
locks them, preventing other new connections from interfering in this
connection. Transaction log files are not typical log files—they exist specifically
as a means to locate incomplete transactions. A transaction log file is unlocked if
it is associated with an orphaned or outstanding transaction.
• If the iPool library code locates an unlocked transaction log file that is associated
with a transaction that has not finished rolling back, it completes the roll-back
operation. For more information about roll-back operations, see “Rolling Back
Transactions” on page 61 .
• If the iPool library code locates an unlocked transaction log file that is associated
with a transaction that has not finished its commit operation, it completes the
commit operation. For more information about commit operations, see
“Committing Transactions” on page 56 .
• If the iPool library code locates an unlocked transaction log file that is associated
with a transaction that was active when the corresponding data flow process
shut down (.trn extension), it marks the transaction with a roll-back extension
(.rlb), and then completes the roll-back operation.

Note: If a connection fails to complete for some reason, the next connection
operation automatically completes all outstanding tasks.

Once a connection is made, all communication between data flow processes and
iPools occurs inside transactions. Transactions are units of interaction with an iPool
that provide recovery capabilities to a series of read and write operations. The iPool
library processes a single iPool transaction in a coherent and reliable manner,
independent of all other transactions.

The recovery capabilities provided by the transaction mechanism are subject only to
the integrity of the underlying operating system calls. This means that the following
kernel calls must be atomic (that is, they must complete successfully or not at all):

• Creating files or directories (in locked or unlocked mode)


• Locking and unlocking files
• Opening existing files
• Renaming a single file or directory

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 53


Chapter 3 Understanding How Data Flow Processes and Search Grid Processes Communicate

A temporary directory called a transaction directory is created for each active


transaction in a data flow. Before you can start a transaction, you must connect to
the appropriate iPool.

Starting Transactions
When a data flow process starts a transaction, a temporary transaction directory is
created in the data flow directory. The transaction directory has an arbitrary and
unique name with the .trn extension (for example, data_flow/transaction.trn).

A log file of the same name is also created in the data flow directory (for example,
data_flow/transaction.log). The log file acts as a locking mechanism for the
transaction, ensuring that no two processes access the same transaction at the same
time. It is not used to record the activities that occur in the iPool. The log file is
locked when the transaction starts.

If the process that started the transaction shuts down unexpectedly, the next
connection operation automatically completes all outstanding tasks.

Writing Transactions
In most data flows, a producer process extracts or assembles data, and then writes
the data to an iPool, making it available to the next process in the data flow. The next
data flow process then reads the data from the iPool, processes it, and writes the
new data to another iPool. This processing chain continues until the data flow
terminates, usually resulting in an index of the data that the producer process
originally extracted or assembled.

The data that a process writes to an iPool is encoded in iPool messages, which enter
the iPool and are processed in a queue. To add iPool messages to an iPool, a data
flow process performs a series of write operations within an active transaction.

Figure 3-4: Writing Transactions

54 OpenText Content Server LLESSRC210400-GGD-EN-01


3.1. Working With Data Interchange Pools (iPools)

All of the write operations that occur within a single transaction are collected inside
of a write directory that resides in the corresponding transaction.trn directory (for
example, data_flow/transaction.trn/write). Each write directory contains a
subdirectory that has the same name as the iPool subdirectory to which the data is
being written. The data flow process opens and writes data inside this subdirectory
(for example, data_flow/ transaction.trn/write/2154_1). Because a transaction
can contain write operations for many different iPool subdirectories, its write
directory may contain multiple subdirectories, each representing the corresponding
iPool subdirectory.

Although a single transaction can contain multiple write operations, each of which
may write data to a different iPool subdirectory, a single write operation cannot
write data to multiple iPool subdirectories.

The changes that a write operation makes are not finalized or implemented until the
transaction is committed. For more information, see “Committing Transactions”
on page 56 .

Reading Transactions
After a data flow process writes data to an iPool, the next data flow process in the
chain reads the data from that iPool so that it can process it. This data is stored as
iPool message files in the iPool subdirectory. Data flow processes read these iPool
message files by performing a series of read operations within an active transaction.

All of the read operations that occur within a single transaction are collected inside
of a read directory that resides in the corresponding transaction.trn directory (for
example, data_flow/transaction.trn/read). Each read directory contains a
subdirectory that has the same name as the iPool subdirectory from which the data
is being read. The data flow process opens and reads data inside this subdirectory
(for example, data_flow/transaction.trn/read/2154_0). Because transactions
can contain read operations for many different iPool subdirectories, their read
directories may contain multiple subdirectories, each representing the
corresponding iPool subdirectory.

A read operation moves iPool messages from their location in the iPool subdirectory
to the appropriate location in the read directory. This ensures that the iPool
messages are not available to other processes that may be reading from the same
iPool subdirectory. Data flow processes read iPool message files one at a time. Each
time that the data flow process returns to the iPool subdirectory to read another file,
it deletes any empty directories that exist in the iPool as a result of the previous read
operation.

Although a single transaction can contain multiple read operations, each of which
may read data from a different iPool subdirectory, a single read operation cannot
read data from multiple iPool subdirectories.

Read operations are not finalized or implemented until the transaction is committed.
Files that a data flow process reads are not available to any other processes before a
commit operation occurs. For more information, see “Committing Transactions”
on page 56.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 55


Chapter 3 Understanding How Data Flow Processes and Search Grid Processes Communicate

Figure 3-5: Reading Transactions

iPool messages are read recursively, starting at the root of the iPool subdirectory.
The state of the iPool determines which file is next in the queue for read operations.
For example, if two files have already been read, the next read operation will
automatically read the third file.

Note: iPools contain multiple subdirectories in order to keep the number of


files in each subdirectory to a manageable limit.

Committing Transactions
A data flow process commits a transaction to finalize the write or read operations
that it has performed within the transaction. A commit operation closes the active
transaction.

When a transaction is committed, the corresponding transaction directory is


renamed from transaction.trn to transaction.cmt. When this occurs, the write
operations specified within the transaction are officially finalized and, after an
undetermined amount of time, the corresponding iPool message files are made
available in the iPool subdirectory. Renaming the transaction directory in this way
prevents a loss of data if the data flow process shuts down unexpectedly while
committing a transaction.

56 OpenText Content Server LLESSRC210400-GGD-EN-01


3.1. Working With Data Interchange Pools (iPools)

Figure 3-6: Committing Transactions

During a commit operation for a write transaction, the subdirectory of the write
directory is renamed. For example, the data_flow/transaction.cmt/write/2154_
0 directory is renamed to data_flow/2154_0.0/0000000004. This moves all of the
iPool items that were written in the data_flow/transaction.cmt/write/2154_0
directory to the new directory. After the data_flow/transaction.cmt/write/
2154_0 directory is renamed, its contents are available to the read operations that
are issued within other transactions.

During a commit operation of a read transaction, all of the iPool items in the data_
flow/transaction.cmt/read/2154_0 directory are deleted. These iPool items are

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 57


Chapter 3 Understanding How Data Flow Processes and Search Grid Processes Communicate

no longer required because they have already been used for the read operation that
was just finalized.

Both the rename operation and the delete operation are automatic. Renaming the
data_flow/transaction.cmt/write/2154_0 directory in this way ensures that all
of the iPool items in the directory are available at the same time or that none are
available at all. Because a single read operation cannot read data from multiple iPool
subdirectories, separate rename calls are issued for each iPool subdirectory.

If a data flow process shuts down unexpectedly while an active transaction is in


a .cmt state, the commit operation automatically completes the next time that a data
flow process connects to the iPool.

When a write transaction is committed and an iPool message is written to an iPool


subdirectory, the message is added to the end of a queue. All iPool messages are
processed in the order that they enter the queue (that is, the first one in the queue is
always the first one out of the iPool, and so on).

58 OpenText Content Server LLESSRC210400-GGD-EN-01


3.1. Working With Data Interchange Pools (iPools)

Figure 3-7: Exceeding the Maximum Width

The ten-digit directories that reside inside the iPool subdirectory are created when
transactions that write files to this subdirectory are committed. The files that they
contain are the iPool message files that were written to the iPool subdirectory when
the transaction was committed. By default, an iPool subdirectory can contain a
maximum of 50 ten-digit directories that are named incrementally from 0000000000
to 0000000049. Each of these ten-digit directories can contain an unlimited number
of iPool message files.

If the iPool subdirectory contains 50 ten-digit directories, and an additional


transaction that writes files to the iPool subdirectory is committed, the iPool library
performs the following operations:

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 59


Chapter 3 Understanding How Data Flow Processes and Search Grid Processes Communicate

• Creates a new iPool subdirectory, whose extension increments by 1 (for example,


2154_1.1)

• Renames the existing iPool subdirectory (2154_1.0) to 0000000000 and moves it to


the new iPool subdirectory.
• Commit the new transaction

Figure 3-8: Committing Transactions

Note: If a process tries to read data from the iPool subdirectory in the middle
of this procedure, it always looks for the iPool subdirectory that has the highest
numeric value (in this case 2154_1.1, not 2154_1.0). If that subdirectory is
empty (if the original iPool subdirectory has not been renamed yet), the
process can retry at a later date.

60 OpenText Content Server LLESSRC210400-GGD-EN-01


3.1. Working With Data Interchange Pools (iPools)

Rolling Back Transactions


A roll-back operation is the opposite of a commit operation. When a transaction is
rolled back, the corresponding transaction directory is renamed from
transaction.trn to transaction.rlb. When this occurs, the read or write
operations specified within the transaction are officially rolled back. For a read
transaction, this means that the corresponding iPool message files are returned, in
their original state, to the appropriate location in the iPool subdirectory. For a write
transaction, this means that the corresponding iPool message files are not written to
the iPool subdirectory.

During a roll back of a read transaction, the subdirectory of the read directory is
renamed. For example, the data_flow/transaction.rlb/read/2154_0 directory is
renamed to data_flow/2154_0.0/XXXXXXXXXX (where XXXXXXXXXX is a ten-digit
directory in the iPool subdirectory). This returns all of the iPool items in the data_
flow/transaction.rlb/read/2154_0 directory to the iPool subdirectory from
which they were originally read. The subdirectory of the read directory is renamed
and returned to the front of the transaction queue in the iPool subdirectory.

During a roll back of a write operation, all of the iPool items in the data_flow/
transaction.rlb/write/2154_0 directory are deleted (if any exist). These iPool
items are no longer required because they may contain modifications that should not
be retained.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 61


Chapter 3 Understanding How Data Flow Processes and Search Grid Processes Communicate

Figure 3-9: Rolling Back Transactions

Using the scenario illustrated by Figure 3-9, suppose that a data flow process reads
two iPool message files from the 0000000008 directory in the iPool subdirectory but,
for some reason, cannot read the third file. Because the read operation cannot
complete successfully, the entire operation rolls back. This means that the
subdirectory of the read directory is renamed to 0000000007, so that it can be placed
at the front of the queue. The next time that the data flow process tries to read files
from the iPool, it reads the files in the 0000000007 directory first, because it is now at
the front of the queue.

Note: Figure 3-6 should not be used to illustrate subdirectory scenarios.

Now suppose that the same problem occurs when reading from the 0000000000
directory in the iPool. The data flow process reads the first two iPool message files
successfully, but cannot read the third file. Again, the subdirectory of the read
directory must be renamed and returned to the front of the queue. But because the
0000000000 directory already exists in this case, the iPool library assigns a very large
number to the directory. In a sense, the numbering system wraps around (think of a
clock, where 12 is earlier than 1), allowing roll back operations to continue
successfully.

62 OpenText Content Server LLESSRC210400-GGD-EN-01


3.1. Working With Data Interchange Pools (iPools)

Note: Gaps cannot exist between directory names in the iPool subdirectory.
For example, the 0000000000 directory cannot be followed by the 0000000003
directory. The only permissible gap is the one that occurs between the
0000000000 directory and the directory name that is given when the numbering
system wraps around.

If a data flow process shuts down unexpectedly while an active transaction is in


a .trn or .rlb state, the corresponding transaction automatically rolls back the next
time that a data flow process connects to the iPool.

Example: Intermediate Data Flow Process


Figure 3-10 illustrates the operations that occur as an intermediate data flow process
reads files from one iPool subdirectory, and then writes the files to a different iPool
subdirectory within a single iPool transaction.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 63


Chapter 3 Understanding How Data Flow Processes and Search Grid Processes Communicate

Figure 3-10: Intermediate Data Flow Process

This intermediate data flow process does the following:

64 OpenText Content Server LLESSRC210400-GGD-EN-01


3.1. Working With Data Interchange Pools (iPools)

• Connects to the data flow directory associated with the 2504_0.0.


• Starts an iPool transaction, which creates a randomly named transaction
directory in the corresponding data flow directory (XYNDHFGY.trn).
• Locates the start of the 2504_0.0 iPool queue. In Figure 3-10 the iPool queue starts
at the data_flow/2504_0.0/0000000005/0000000000 file.
• Moves this iPool message to the data_flow/XYNDHFGY.trn/read/2504_0
directory. This prevents other data flow processes from accessing the same iPool
message while the intermediate process is using it.
• Reads data from the iPool message and processes it. Then it creates a new iPool
message named 0000000000 in the data_flow/XYNDHFGY.trn/write/2504_1
directory. The new iPool message resides in the transaction directory until a
commit operation occurs.

Note: If the intermediate process or the entire system shuts down


unexpectedly during this process, an implicit roll back takes place the next
time that a process connects to the 2504_0.0 iPool. During the roll back
operation, all of the iPool message in the data_flow/XYNDHFGY.trn/read
directory are returned to the front of the iPool queue and all of the iPool
messages in the data_flow/XYNDHFGY.trn/write directory are deleted.

Before it closes its iPool connection, the process either commits the transaction or
rolls back the transaction. To finalize its changes, the data flow process commits the
transaction. This renames the data_flow/XYNDHFGY.trn directory to
XYNDHFGY.cmt. Because the rename operation is atomic, it guarantees that the
commit operation completes successfully before the intermediate process can start a
new transaction. During the commit operation, the data_flow/XYNDHFGY.trn/
write/2504_1 directory moves to the back of the 2504_1 queue (in Figure 3-10 this is
the data_flow/2504_1.0/0000000002 position in the queue). If the intermediate
process or the entire system shuts down unexpectedly at this point, the commit
operation completes the next time that a data flow process connects to the iPool.

To cancel its operations, the data flow process rolls back the transaction. This
renames the XYNDHFGY.trn directory to XYNDHFGY.rlb. Because the rename
operation is atomic, it guarantees that the roll back operation completes successfully
before the intermediate process can start a new transaction. If the intermediate
process or the entire system shuts down unexpectedly at this point, the roll back
operation completes the next time that a data flow process connects to the iPool.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 65


Chapter 4
Administering Searching

You search for indexed information by constructing Queries and submitting them to
a target index. You construct Queries using a query language called the Live Query
Language (LQL); however, Content Server then converts the LQL Queries to
OTSTARTS and OTSQL query languages for internal processing.

When a Query is submitted to Content Server, it goes to the Search Manager


associated with the index that is to be searched (if not specified in the Query, this is
the Enterprise index). The Search Manager passes the Query to an available Search
Federator which distributes it to the Search Engines that it manages. The Search
Engines do the real work, searching each Partition's index and producing search
results which they then send back to the Search Federator. Once the Search
Federator has received results from all the Search Engines, it merges the results and
sends the final result set to the Search Manager. The Search Manager then sends the
final result set to Content Server for display on the Search Results page.

Along with the appropriate search result items themselves, Content Server displays
a score for each result. The score indicates the relevance of a particular search result
item in relation to all other items in the result set.

This chapter covers the following topics:

• “Understanding Query Languages” on page 67


• “Understanding Relevance Ranking” on page 88
• “Working with Regions” on page 95

4.1 Understanding Query Languages


Content Server accepts Queries constructed in the Live Query Language (LQL),
converts the Queries internally from LQL to the OTSTARTS query language, and then
wraps the OTSTARTS statements in OpenText Structured Query Language (OTSQL).
It then passes the queries to the appropriate Search Federators and Search Engines
which evaluate and respond to the corresponding search request.

OTSTARTS is OpenText's implementation of the STARTS syntax, which is a simple


protocol that Search Engines and Search Federators can use to search and index
collections of text documents.

OTSQL is OpenText's implementation of the Structured Query Language (SQL)


syntax for full-text searching, which is commonly used in relational database
management systems. OTSQL is a limited version of SQL that specifies query and
retrieval components in single statements. To do this, OTSQL depends on certain
expressions in the OTSTARTS query language.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 67


Chapter 4 Administering Searching

When Content Server converts an LQL query to the OTSTARTS query language, it
wraps the OTSTARTS expressions with the appropriate OTSQL expressions. The
OTSTARTS expressions define the query criteria and the OTSTARTS expressions define
additional processing information about the query.

This section covers the following topics:

• “Live Query Language (LQL)” on page 68

• “OTSQL Query Language” on page 78

• “OTSTARTS Query Language” on page 80

4.1.1 Live Query Language (LQL)


The Live Query Language consists of keywords and conventions. You use LQL
keywords when you perform searches with complex Queries. LQL conventions help
define the syntax of a complex Query. For more information about using complex
Queries, see “Searching with Complex Queries” in OpenText Content Server - Search
(LLESWBB-H-UGD).

The Live Query Language (LQL) consists of the keywords listed in the following
table. For the purposes of this table:

• a term is a single word or phrase, and an expression is two or more terms


connected with ANDs, ORs, or both.

• in lexicographic ordering, 100 is less than 20, because the terms are ordered as
Unicode strings.

• in numeric ordering, 100 is greater than 20, because the terms are converted to
numbers before comparing.

Table 4-1: LQL Keywords

Keyword Description Syntax


AND The AND keyword finds Content <expression> AND
Server items that contain both <expression>
specified query expressions.
For example, type fish AND
bird to find Content Server items
that contain both fish and bird.
AND-NOT The AND-NOT keyword finds <expression> AND-NOT
Content Server items that contain <expression>
the expression to the left of the
keyword, and do not contain the For example, type fish AND-NOT
expression to the right of the bird to find Content Server items
keyword. that contain fish but do not
contain bird.

68 OpenText Content Server LLESSRC210400-GGD-EN-01


4.1. Understanding Query Languages

Keyword Description Syntax


OR The OR keyword has the same <expression> OR
function as the | keyword. It finds <expression>
| Content Server items that contain
the expression to the left of the For example, type fish OR bird
keyword, or those that contain the to find Content Server items that
expression to the right of the contain fish, or Content Server
keyword, or those that contain items that contain bird, or Content
both expressions. Server items that contain both fish
and bird.
SOR The SOR keyword has the same <expression>
function as the OR keyword but its SOR<expression>
relevancy ranking algorithm
considers the expressions to the For example, type "U.K." SOR
left and right of the SOR keyword "United Kingdom" to find
as synonyms and treats them as a Content Server items that contain
single term. This prevents at least one instance of U.K., or
synonyms from carrying more United Kingdom, or both. Content
weight in the calculation of the Server may list the results of this
score for a search result item. The query in a different order than the
ranked scores and order of the results of "U.K." OR "United
results produced by queries that Kingdom".
use the SOR keyword may differ
from the ranked scores and order
of the results produced by queries
that use the OR keyword.
XOR The XOR keyword finds Content <expression>
Server items that contain one of XOR<expression>
the two specified query
expressions, but not both. This For example, type fish XOR
keyword is also known as the bird to find Content Server items
<exclusive OR> keyword. that contain fish or bird, but not
both.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 69


Chapter 4 Administering Searching

Keyword Description Syntax


= The = keyword finds Content =<term>
Server items that contain the
specified phrase, as follows: .

• For metadata text regions, the .


search term or phrase must
match the entire region, or an For example, [qlregion
entire component value of a "CityName"] ="York" would
multivalued region, to be match “York” in the CityName
considered a match. region, but would not match
• For content searches, the “New York”.
search term or phrase may
appear anywhere in the object .
content.
For example, ="York" would
• Searches are not case-sensitive. match an object with City of
• Searches are not sensitive to York in its content.
extraneous whitespace or
punctuation (as long as they For example, ="York" would
do not break the individual match “YORK”.
search terms).
For example, [qlregion
• Searches of multivalued
"CityName"] ="York" would
metadata fields only require
match a CityName of “: York :”.
one component of the
multivalue to match. For example, if the input XML of
an object included
<CityName>London</
CityName><CityName>New
York</CityName>, then a search
for [qlregion "CityName"]
="New York" will match the
object because “New York”
matches an entire component of
the multivalued CityName region,
and a search for [qlregion
"CityName"] ="York" would
not match.
NOT The NOT keyword has the same NOT<term>
function as the ! keyword and
! the != keyword. It finds Content For example, type NOT bird to
Server items that do not contain find Content Server items that do
!= the specified phrase. not contain bird.

70 OpenText Content Server LLESSRC210400-GGD-EN-01


4.1. Understanding Query Languages

Keyword Description Syntax


< The < keyword finds Content <<term>
Server items that match all values
which exist, and are less than the For example, type < 20200630 to
specified term. If a phrase is find Content Server items that
provided, only the first term in contain at least one number that is
the phrase is used. If the specified less than 20200630. If you restrict
term is an integer or real number, this Query to a date attribute, it
Content Server uses numeric finds Content Server items dated
ordering. Otherwise, it uses before June 30, 2020.
lexicographic ordering.
<= The <= keyword finds Content <= <term>
Server items that match all values
which exist, and are less than or For example, type <= 20200630
equal to the specified term. If a to find Content Server items that
phrase is provided, only the first contain at least one number that is
term in the phrase is used. If the less than or equal to 20200630. If
specified term is an integer or real you restrict this Query to a date
number, Content Server uses attribute, it finds Content Server
numeric ordering. Otherwise, it items dated June 30, 2020, or
uses lexicographic ordering. earlier.
> The > keyword finds Content > <term>
Server items that match all values
which exist, and are greater than For example, type > 20200630 to
the specified term. If a phrase is find Content Server items that
provided, only the first term in contain at least one number that is
the phrase is used. If the specified greater than 20200630. If you
term is an integer or real number, restrict this query to a date
Content Server uses numeric attribute, it finds Content Server
ordering. Otherwise, it uses items dated after June 30, 2020.
lexicographic ordering.
>= The >= keyword finds Content >= <term>
Server items that match all values
which exist, and are greater than Type >= 20200630 to find
or equal to the specified term. If a Content Server items that contain
phrase is provided, only the first at least one number that is greater
term in the phrase is used. If the than or equal to 20200630. If you
specified term is an integer or real restrict this Query to a date
number, Content Server uses attribute, it finds Content Server
numeric ordering. Otherwise, it items dated June 30, 2020, or later.
uses lexicographic ordering.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 71


Chapter 4 Administering Searching

Keyword Description Syntax


PROX The PROX[n,t/f] keyword finds <expression> prox[N,T/F]
Content Server items that contain <expression>
all specified query phrases that
are within a designated distance For example, type cats
(N) of one-another, and where the prox[10] dogs to find all
ordering is either ordered (T) or documents in which the phrase
unordered (F). Distance (N) is dogs precedes or follows the
calculated as the number of phrase cats by no more than ten
tokens between the last token in words. Typing cats prox[10,T]
the first phrase and the first token dogs will find all documents in
in the second phrase. The which the phrase dogs follows the
specification of the second phrase cats by no more than ten
parameter (T/F) is optional and words.
defaults to F.
QLATTRIBUTE The QLATTRIBUTE keyword finds QLATTRIBUTE<expression>
Content Server items with specific
language values of OTName and For example, type QLATTRIBUTE
OTDComment. lang = fr, voiture to find
Content Server items that contain
All OTName and OTDComment the word voiture.
field values in OpenText receive a
lang="x" attribute, where x is
the metadata language code.
QLATTRIBUTE is used in
conjunction with the QLREGION
keyword to narrow searches to
region values with a particular
attribute. lang is the only region
attribute that is currently
supported. The two operators
used with QLATTRIBUTE are: =,
is null.
QLNEAR The QLNEAR keyword has the For example, type cats QLNEAR
same function as the PROX dogs to find all documents in
keyword but it is set to a default which the phrase dogs precedes or
of prox[10,F]. follows the phrase cats by no more
than ten words.

72 OpenText Content Server LLESSRC210400-GGD-EN-01


4.1. Understanding Query Languages

Keyword Description Syntax


QLPHONETIC The QLPHONETIC keyword finds QLPHONETIC<term>
Content Server items that contain
words that sound like the For example, type QLPHONETIC
specified query expression. This sail to find Content Server items
keyword is based on a popular that contain words like sail or sale.
algorithm called Soundex. The
Soundex algorithm takes an
English word and produces a
four-digit representation of the
word, designed to match the
phonetic pronunciation of the
original word. Soundex is often
used for vague searches where a
close match is required.
QLREGION The QLREGION keyword finds [QLREGION region]<term>
Content Server items that contain
search terms within a specific For example, type [QLREGION
index region. Each region stores a OTCreatedByName] jdoe to
different kind of information find Content Server items that
about indexed items; for example, contain the word jdoe in the
the content and metadata of OTCreatedByName region.
documents. For a list of regions,
contact your Content Server
administrator.
QLSTEM The QLSTEM keyword finds QLSTEM <term>
Content Server items that are
likely noun plural or singular For example, type QLSTEM tax to
forms of the specified query find Content Server items that
expression. contain either the word tax or
taxes.
QLTHESAURUS The QLTHESAURUS keyword finds QLTHESAURUS<term>
Content Server items that contain
thesaurus entries that are For example, type QLTHESAURUS
associated with the specified stone to find Content Server
query expression. items that contain at least one of
the thesaurus entries associated
with the word stone, such as stone,
pit, seed, or kernel.
QLRIGHT- The QLRIGHT-TRUNCATION QLRIGHT-TRUNCATION<term>
TRUNCATION keyword finds Content Server
items that contain words whose For example, type QLRIGHT-
prefixes match the specified query TRUNCATION post to find
expression. This keyword does Content Server items that contain
not work with phrases. at least one word that begins with
post, such as postpone, postman, or
posture.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 73


Chapter 4 Administering Searching

Keyword Description Syntax


QLLEFT-TRUNCATION The QLLEFT-TRUNCATION QLLEFT-TRUNCATION <term>
keyword finds Content Server
items that contain words whose For example, type QLLEFT-
suffixes match the specified query TRUNCATION thing to find
expression. This keyword does Content Server items that contain
not work with phrases. at least one word that ends with
thing, such as thing, anything, or
soothing.
QLRANGE The QLRANGE keyword finds QLRANGE "<term> ~<term>"
Content Server items that contain
words in the lexicographic or For example, type QLRANGE
numeric range between two "giraffe ~ gnu" to find
specified query expressions. The Content Server items that contain
terms that you use to specify the at least one word in the
range are also included in the lexicographic range from giraffe to
range. If the specified terms are gnu, such as giraffe, giraffes, give,
integers or real numbers, Content glow, gnome, or gnu. Or, type
Server uses numeric ordering. QLRANGE "20100801 ~
Otherwise, it uses lexicographic 20100901 to find items that
ordering. contain at least one number
between 20100801 and 20100901
You must enclose the range in (for example, dates between
double quotation marks (" ") and August 1, 2010 and September 1,
separate the terms that you use to 2010, inclusive).
specify the range with a tilde (~).
QLREGEX The QLREGEX keyword finds QLREGEX "<term>"
Content Server items that contain
a regular expression expansion of For example, type QLREGEX
the specified query expression. "^the" to find Content Server
Query expressions used with the items that contain at least one of
QLREGEX keyword must be the regular expression expansions
enclosed in double quotation of ^the, such as the, their, them,
marks (" "). This keyword does then, there, therefore, thesaurus,
not work with phrases. For these, or they.
information about using regular
expressions, see “Working with
Regular Expressions Using LQL”
in the OpenText Content Server -
Search (LLESWBB-H-UGD).

74 OpenText Content Server LLESSRC210400-GGD-EN-01


4.1. Understanding Query Languages

LQL Conventions
The following LQL conventions help define the syntax of a complex Query.

Using Phrases

A phrase is a fixed sequence of one or more words that is enclosed in double


quotation marks (" "). For example, "good", and "for better or for worse" are
phrases. You can, however, omit the double quotation marks around one-word
phrases (for example, rice).

When you construct phrases, the following rules apply:

• A space character (for example, a tab or space) denotes the end of a word.
• Content Server interprets strings that include punctuation but no spaces as one
phrase (for example, the phrase “www.opentext.com”).
• You cannot omit the double quotation marks around reserved keywords (for
example, qlnear) when you want to treat the keywords as search criteria. For
example, to search the Content Server Online Help system for information about
the qlnear keyword, you must specify "qlnear" as your Query and choose the
appropriate Help slice.

Using Wildcards

You can use an asterisk (*) as a placeholder for characters in Queries. For example,
to use an asterisk to represent the end of a word that has many different endings,
type account* to find Content Server items that contain the words account,
accounts, accountant, accounting, and so on.

Using Escape Characters

You can use escape characters to search for certain special characters in your query
syntax. The single backlash (\) is treated as the beginning of an escape sequence for
certain special cases.

Common and important special characters you can search for include:

• quotation mark " use \"


• apostrophe ' use \'
• single backslash \ use \\
• double backslash \\ use \\\\

Other non-special characters are not affected by the backslash, so a query like:
\wheat is still mapped to \wheat.

A single backlash at the end of a query term is invalid syntax and will return an
invalid syntax error message from the Search Engine. For example,

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 75


Chapter 4 Administering Searching

• invalid syntax: select "OTName" where [region "OTName"]"C:\"


• proper syntax: select "OTName" where [region "OTName"]"C:\\"

Understanding the Order of Processing

By default, Content Server processes Query terms from left to right. Use parentheses
() to override the default order, and have Content Server evaluate the expressions
within parentheses first. For example, type (grow OR farm) AND (wheat OR grain)
to find Content Server items that contain at least one instance of the words grow or
farm and at least one instance of the words wheat or grain.

Working with Regular Expressions in LQL


You can use regular expressions in Live Query Language if you enclose them in
quotation marks (" ") and precede them with the QLREGEX keyword.

A regular expression is a set of strings that match certain patterns. To specify these
patterns, the special characters $, ^, ., *, +, ?, [, ], (, ), \, function as operators on all
remaining ordinary characters. An ordinary character is a simple regular expression
that matches a character and nothing else (for example, h matches h and nothing
else). To represent a special character as an ordinary character, precede it by a
backslash (\). For example, \$ matches $.

The following guidelines describe how to use regular expressions most effectively:

• When used together with the QLREGEX keyword in a Content Server search, the
specified regular expression is found anywhere within a word in a Content
Server item. For example, QLREGEX "cat" finds items that contain at least one
word with cat in it, such as cat, category, or concatenate.
• Constraining and narrowing searches saves time, especially in prefix searches.
For example, to find 2014 part numbers that begin with d, a search using QLREGEX
"^d" takes much longer than a search using QLREGEX "^d2014".

• Regular expression searches on prefixes are the most efficient. For any other
regular expression search, every word in the index must be examined to see if it
matches.
• For every word that matches the regular expression, a query is performed, so it is
best to constrain a regular expression search as much as possible.

You can use the following list of operators within regular expressions to describe
sets of strings.

Table 4-2: Using Operators to Describe Sets of Words

Operator Description
. Matches any single character.

For example, a.2 matches any word containing a three-character string that
begins with a and ends in 2 (such as, ab2z, aa2, or aaa2zzz).

76 OpenText Content Server LLESSRC210400-GGD-EN-01


4.1. Understanding Query Languages

Operator Description
[] Encloses a character set or range. The following rules apply:
• You can intermix ranges and single characters.
• In a character set, ], -, and ^ have special meaning; all other characters
represent themselves only.
• The minus sign (-) is a range operator between two characters.
• The caret (^) can be used only in the first position in a character set.

For example,
[abc] matches a, b, or c.
[a-z] matches any lower case character.
[-$a0-9] matches -, $, a, or any single digit.
[^] Begins a character set that complements the specified character set (it matches
any character except those that are specified). If - or ] follows [^, - or ] is
treated as the first character.

For example,
[^a-z] matches any character, except the letters of the alphabet.
[^]^a-z0-9] matches any character, except ], ^, or alphanumeric characters.
^ Matches the beginning of a word.

For example,
^sp matches any instance of sp at the beginning of a word only (such as special,
but not especially).
* Matches the smallest preceding regular expression zero or more times.
However, the * operator following a regular expression that has ^ as the
beginning of a word is interpreted as the expression with any ending (like a
wildcard (*) in queries).

For example,
ad* matches a, ad, add, and so on.
+ Matches the smallest preceding regular expression when the preceding regular
expression occurs at least once.

For example,
tr[ei]+ matches tre, tri, tree, trie, triie, and so on. It does not match tr.
? Matches the smallest preceding regular expression when the preceding regular
expression occurs zero or one time.

For example,
se[ea]? matches only se, sea, and see.
$ Matches the end of a word.

For example,
the$ matches the characters the when they appear at the end of a word.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 77


Chapter 4 Administering Searching

Operator Description
| Separates two alternatives. If x and y are regular expressions, then x|y
matches anything that either x or y matches.

For example,
sea|lake matches words containing sea and lake only.
[abc] can also be written as a|b|c
() Groups items (such as alternatives or complex regular expressions) so that you
can combine them with other regular expressions and operators.

For example,
(ro)?(co)+ matches any non-zero number of co strings that is preceded by
nothing or ro (such as, co, coco, rococo, and so on).

4.1.2 OTSQL Query Language


OTSQL queries may contain the following clauses:

• SELECT "region1", "region2", "region3"


• [FROM "slice"]
• WHERE "search_expression"
• [ORDEREDBY "ordering_criteria"]

Note: The OTSQL keywords (SELECT, FROM, WHERE, ORDEREDBY) are not case-
sensitive.

Most OTSQL queries use only the SELECT and WHERE clauses; however, more
complex queries can use all clauses and embed OTSTARTS expressions as necessary.

SELECT
The SELECT clause contains a comma-separated list of quoted strings which
represent the regions to retrieve from the search result item (for example, the OTMeta
region and OTScore region). It is a required clause.

For more information about regions, see "Configure Index Regions" in the OpenText
Content Server Admin Online Help - Search Administration (LLESWBS-H-AGD).

78 OpenText Content Server LLESSRC210400-GGD-EN-01


4.1. Understanding Query Languages

Syntax

SELECT "region1", "region2", "region3"

FROM
The FROM clause is an optional OTSQL clause that is not used in Content Server. It
can contain the names of the Content Server slices to query; however, because
Content Server submits OTSQL queries directly to a Search Federator (which
corresponds to a particular index or slice), it is not necessary to specify the slice
name explicitly.

The OR operator is the only valid OTSTARTS operator used in the FROM clause. For
more information about OTSTARTS operators, see “Combined Queries” on page 86 .

Syntax

FROM "slice" OR "slice"

WHERE
The WHERE clause contains the search term(s) or phrase(s) that Content Server users
specify on the search bar or the Content Server Search page. It is a required clause.

All OTSTARTS query language components can be embedded in the WHERE clause. For
more information about OTSTARTS, see “OTSTARTS Query Language” on page 80.

Syntax

WHERE "search_expression"

ORDEREDBY
The ORDEREDBY clause is an optional OTSQL clause that contains result ranking
criteria that the Search Engine uses when evaluating queries. Content Server always
uses the following syntax:
ORDEREDBY RankingExpression ("search_expression")

Although the following values represent valid ordering criteria for the ORDEREDBY
clause, Content Server users cannot specify these values through the Content Server
interface.

Table 4-3: Result Ranking Criteria

Expression Description
DEFAULT Ranks results according to a default ranking criteria
NOTHING Does not rank results

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 79


Chapter 4 Administering Searching

Expression Description
RELEVANCY Ranks results according to the relevancy ranking that the Content
Server Search Engine supports. This is the default ordering criteria.
RANKINGEXPRESSION Ranks results according to the results of the OTSTARTS ranking
(rankingexpression) expression that you specify
EXISTENCE Ranks results based on the number of distinct query terms that appear
in a document (not the instances of the same query term)
RAWCOUNT Ranks results based on the instances of the query terms that appear in
a document (not the number of distinct query terms)
REGION fieldname Ranks results based on the contents of the region or field named
fieldname. This is only supported for integer, time, and date regions.

Syntax

ORDEREDBY ordering_criteria

4.1.3 OTSTARTS Query Language


This section describes the components of the OTSTARTS query language, including:

• “Simple Word or Phrase” on page 80


• “Simple Word with Modifiers” on page 81
• “Comparator” on page 83
• “Region” on page 85
• “Combined Queries” on page 86

Simple Word or Phrase


A simple word or phrase query consists of a word or a phrase surrounded by
quotation marks. Examples of simple phrase queries include:
"Fred"
"OpenText Corporation"

The result of a simple word or phrase query is the collection of documents


containing that word or phrase.

80 OpenText Content Server LLESSRC210400-GGD-EN-01


4.1. Understanding Query Languages

Simple Word with Modifiers


Modifiers enhance simple word queries, making them more general or more specific
than the original query. For example, a thesaurus expansion can be requested to
expand the collection of matching documents.

The following are valid modifiers:

• Phonetic
• Stem
• Thesaurus
• Right-truncation
• Left-truncation
• Regex
• Range

Phonetic

The Phonetic modifier expands a simple word query to include words that sound
like the specified word. For example, the following query produces documents
containing at least one instance of sail or sale.
phonetic "sail"

Stem

The Stem modifier expands a simple word query to include words that are likely
noun plural or singular forms of the specified word. For example, the following
query produces documents containing at least one instance of tax or taxes.
stem "tax"

The languages supported for the Stem modifier are English, French, German, Italian
and Spanish.

Thesaurus

The Thesaurus modifier expands a simple word query to include words that are
derived from the thesaurus expansion of the specified word. For example, the
following query produces documents containing terms such as stone, rubble, rock, or
flintstone.
thesaurus "stone"

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 81


Chapter 4 Administering Searching

Right-truncation

The Right-truncation modifier expands a simple word query to include terms whose
prefixes match the specified query expression. For example, the following query
produces documents containing terms such as sailing, sails, sailed, sailboat, and sailor.
right-truncation "sail"

Left-truncation

The Left-truncation modifier expands a simple word query to include terms whose
suffixes match the specified query expression. For example, the following query
produces documents containing terms such as sailing, remaining, boating, or banding.
left-truncation "ing"

Regex

The Regex modifier expands a simple word query to include words that are derived
from the regular expression expansion of the specified word. For example, the
following query produces documents containing at least one instance of the, their,
them, then, there, therefore, these, or they.
regex "^[Tt]he"

Range

The Range modifier expands a simple word query to include words in the
lexicographic or numeric range of the specified words or numbers. (If the specified
terms are integers or real numbers, it uses the numeric range. Otherwise, it uses the
lexicographic range.) The ends of a range specification are separated by the tilde
character (~). In most cases, the Range modifier is used to perform date range
queries. For example, the following query produces documents containing numbers
in the numeric range of 20140101 to 20160101.
range "20140101~20160101"

The Range modifier can also be used in other contexts. For example, the following
query produces documents containing words in the lexicographic range between
twelve and twenty.
range "twelve~twenty"

82 OpenText Content Server LLESSRC210400-GGD-EN-01


4.1. Understanding Query Languages

Comparator
Comparators are used to perform queries over lexicographic or numeric ranges. (If
the specified term is an integer or real number, it uses the numeric range. Otherwise,
it uses the lexicographic range.) In most cases, comparators are used to perform date
range queries; however, they can also be used in other contexts.

Comparators can be used with simple word queries, or with compound word
queries, though some comparators only use the first word if given a phrase.

The following are valid comparators:

• <
• <=
• >
• >=
• !=
• =

Note: The behavior of comparators depends upon the type definition of the
region. Text string comparisons use a text sort, so that 2000 > 1000000 for
values stored in a text region.

<

The Less Than comparator will match all values which exist, and are less than the
specified term. If a phrase is provided, only the first term in the phrase is used.

For example, the following query produces all documents containing numbers that
are numerically less than 20140101. This is especially useful when searching for
documents that were published before a specific date.
< "20140101"

<=

The Less Than or Equal To comparator will match all values which exist, and are less
than or equal to the specified term. If a phrase is provided, only the first term in the
phrase is used.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 83


Chapter 4 Administering Searching

>

The Greater Than comparator will match all values which exist, and are greater than
the specified term. If a phrase is provided, only the first term in the phrase is used.

For example, the following query produces all documents containing numbers that
are numerically greater than 20140101. This is especially useful when searching for
documents that were published after a specific date.
> "20140101"

>=

The Greater Than or Equal To comparator will match all values which exist, and are
greater than or equal to the specified term. If a phrase is provided, only the first term
in the phrase is used.

!=

The Not Equal To comparator restricts the result to exclude the specified word or
phrase. Although this comparator is useful for performing date queries, it can also
be used in other word or phrase queries. For example, the following query produces
all documents that do not contain the word Fred.
!= "Fred"

The Equals comparator indicates that the specified word or phrase must be included
in the query, as follows:

• For metadata text regions, the specified word or phrase must match the entire
region, or an entire component value of a multivalued region, to be considered a
match.
For example, [region "CityName"] ="York" would match “York” in the
CityName region, but would not match “New York”.
• For content searches, the specified word or phrase may appear anywhere in the
object's content.
For example, ="York" would match an object with City of York in its content.
• Searches are not case-sensitive.
For example, ="York" would match YORK.
• Searches are not sensitive to extraneous whitespace or punctuation, as long as
they do not break the individual search words.
For example, [region "CityName"] ="York" would match a CityName of : York
:.
• Searches of multivalued metadata fields only require one component of the
multivalue to match.
For example, if the input XML of an object included <CityName>London</
CityName><CityName>New York</CityName>, then a search for [region

84 OpenText Content Server LLESSRC210400-GGD-EN-01


4.1. Understanding Query Languages

"CityName"] ="New York" will match the object because “New York” matches
an entire component of the multivalued CityName region, and a search for
[region "CityName"] ="York" would not match.

4.1.4 Region
Region queries are those in which the specified query must exist in the specified
region in order to be considered a match. For example the following query produces
all of the documents containing the phrase Reproductive Habits of the Australian
Cane Toad in the title region.
title "Reproductive Habits of the Australian Cane Toad"

The following query produces all of the documents for which the last-modified date
does not include values in the numeric range from 20140101 to 20150101.
date-last-modified != range "20140101~20150101"

All OTSTARTS regions have the following syntax:


[ region "regionName" ]

Regions can be nested, which means that the following query can be used to
produce all of the documents containing chapters, that contain sections, that have
headings, which include the phrase Frog (assuming that the database contains those
regions).
[ region "OTDoc" : "Chapter" : "Section" : "Heading" ] "Frog"

The following table describes the valid STARTS fields and the equivalent fields in
OTSTARTS:

Table 4-4: Valid STARTS Fields

Starts Description OTSTARTS


title Specifies the title of the document Title
author Specifies the author of the document Author
body-of-text Specifies the body of the document OTData
document-text Specifies the text of the document OTData
date-last- Specifies the date on which the document was last OTObjectDate
modified modified
time-last- Specifies the time at which the document was last OTObjectTime
modified modified
any Specifies an unrestricted query. This is the default
field.
linkage Specifies the Uniform Resource Locator (URL) for OTURN
the document

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 85


Chapter 4 Administering Searching

Starts Description OTSTARTS


linkage-type Specifies the Multipurpose Internet Mail OTMIMEType
Extension (MIME) type of the document
cross-reference- Specifies the list of URLs in a document. Because
linkage Content Server Search Engines do not modify or
interpret documents, there is no direct OTSTARTS
equivalent for this field. However, for those
HTML documents that may exist in a database,
the <A> tags may be a suitable substitute for this
field.
language Specifies the language of the document, as
described in the RFC-1766 specification. Because
Content Server Search Engines do not require
language specifications of this sort, there is no
OTSTARTS equivalent.

Combined Queries
Operators are used to combine queries. There are two kinds of operators supported
in OTSTARTS: Boolean and proximity. There are multiple Boolean operators and one
proximity operator. The proximity operator (prox) has a slightly different syntax
than the Boolean operators.

The following query sample describes the general syntax of operators:


expression operator expression

Where expression denotes a simple phrase, a simple phrase with modifiers, a


compound phrase, a comparator, or field queries.

and

The and operator denotes the case in which both the left and right expressions must
exist in the result. For example the following query produces all documents
containing the phrase cats and the phrase dogs.
"cats" and "dogs"

or

The or operator denotes the case in which the left or right expression must exist in
the result. For example, the following query produces all documents containing the
phrase cats or the phrase dogs and all documents that contain both cats and dogs.
"cats" or "dogs"

86 OpenText Content Server LLESSRC210400-GGD-EN-01


4.1. Understanding Query Languages

and-not

The and-not operator denotes the case in which the left expression must exist but
the right expression must not exist in the result. For example, the following query
produces all of the documents containing the phrase cats and not containing the
phrase dogs.
"cats" and-not "dogs"

xor

The xor operator (also called the exclusive-or operator) denotes the case in which
the left expression or the right expression (but not both expressions) must exist in
the result. For example, the following query produces all of the documents
containing the phrase cats or the phrase dogs but not containing both cats and dogs.
"cats" xor "dogs"

The xor operator provides a shortcut for the following query:


(left-expression or right-expression) and not
(left-expression and right-expression)

sor

The sor operator (also called the synonym-or operator) denotes the case in which
the left or right expression exist in the result and are synonyms of each other.
Functionally, the or and sor operators generate the same number of matching
documents; however, the relevance ranking of the documents may differ slightly.
For example, the following query produces all of the documents containing the
phrase U.K. or the phrase United Kingdom, or both phrases.
"U.K." sor "United Kingdom"

When presented in a summary list, these results may be ranked differently than the
results of the same query using the or operator.

prox[ distance, order ]


The prox operator (the proximity operator) denotes the case in which both
expressions exist in the result, and one expression precedes the other by no more
than a specified number of words. Unless specified by the order parameter, the
prox operator does not consider the order of the expressions to be important.

The prox operator must take at least one parameter, the maximum allowable
distance between the expressions.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 87


Chapter 4 Administering Searching

Parameters

distance Denotes the maximum number of words that can exist between two phrases if
they are considered in proximity to each other. The distance parameter must
be a positive integer.
order Denotes whether the order of the phrases is important. The order parameter
can have values of T or t to denote that order is important and F or f to denote
that order is not important. If the order parameter is omitted, order is not
considered important.

For example, the following query produces all of the documents in which the phrase
dogs precedes or follows the phrase cats by no more than ten words.
"cats" prox[10] "dogs"

The following query produces all of the documents in which the phrase dogs follows
the phrase cats by no more than ten words.
"cats" prox[10,T] "dogs"

Precedence can also be enforced using parentheses. For example, the following
query produces all of the documents containing the phrase cats followed (by no
more than ten words) by the phrase dogs and the phrase raining.
"raining" and ( "cats" prox[10,T] "dogs" )

4.2 Understanding Relevance Ranking


Relevance ranking is based on a measurement of similarity between a Query and the
content, the metadata, or both of each Content Server item. When a Query is run,
each item that matches the Query is assigned a score, and those scores are used to
sort (rank) the items from most to least relevant. Items that Content Server perceives
to be more relevant to a Query receive a higher score and appear at the top of the list
of results on Content Server's Search Results page.

In Content Server, a Query has a weight and each Advanced Criteria has a weight.
These weights determine how important each component of the overall score is
during the score calculation. The Query weight is the weight given to the query
score in the calculation of the overall search result score. It can be specified in the
Query Weight field on the Ranking Properties page of a Search Manager. For more
information about how Advanced criteria are weighted, see “Calculating Advanced
Criteria Scores” on page 90.

The calculation of an item's search result score is based on the weighted average of its
query score and its advanced criteria scores.

Note: A weighted average is just an average, with weights indicating how


many times an element should be included in the average. If all of the weights
are identical, the weighted average is identical to the normal average. If one
element's weight is twice as much as all of the other element weights, that
element is counted twice in the average.

88 OpenText Content Server LLESSRC210400-GGD-EN-01


4.2. Understanding Relevance Ranking

Internally, search result scores have a range from 0 to 1. For display on the Search
Results page, the internal scores are multiplied by 100 and listed as percentage
values.

4.2.1 Calculating the Query Score


A query score is a statistical measure of an item's importance to a Query. The more
important an item, the higher its query score. The query score is the equivalent of
the overall score that was displayed on the Search Results page.

Query scores are calculated by the Content Server Search Engines, based on the
following main criteria: :

Table 4-5: Query Score Criteria

Criteria Description
The number of times that a search An item with three occurrences of a search term is considered
term occurs in an item more relevant than an item with two occurrences of the same
term.
The length of an item An item that is shorter in length is considered more relevant
than a longer item. Document length only affects the
contribution from the content field, not metadata fields.
The number of times that the A search term that occurs in only 10% of all documents in the
search term(s) appear throughout database is considered more relevant than a search term that
the entire Content Server database occurs in every document. This means that if two items have
the same length and the same number of matches for two
different search terms (that is, term 1 occurs N times in
document 1 and term 2 occurs N times in document 2), their
scores differ if term 1 is much more common than term 2
throughout the entire database.
The co-occurrence of search terms If a Query contains two (or more) search terms, items
containing both terms are considered more relevant than
items containing only one term. Suppose that you run a
Query for dog and cat and Content Server returns two results:
document A, which contains 10 instances of dog and
document B, which contains 5 instances of dog and 5
instances of cat. Assuming that the documents are equal in
length, and dog and cat occur with similar frequency
throughout the database, document B will receive a higher
score than document A. Although document A contains
more instances of dog, document B contains instances of both
dog and cat in the same document.

If the search criteria consists of multiple phrases (with OR operators), Content Server
calculates scores for each phrase separately and then combines the scores to
determine the final display value. This increases the scores of those Content Server
items that contain several of the specified phrases.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 89


Chapter 4 Administering Searching

4.2.2 Calculating Advanced Criteria Scores


Advanced criteria are additional contributors to an item's overall score. Like the
query score, the closer that an item matches the advanced criteria (for example, if it
is a specific MIME type), the higher its associated advanced criteria score.

Advanced criteria include the following:

• Field Rank
• Date Rank
• Type Rank
• Object Rank

Advanced Rankings
The data flow section of the search.ini configuration file contains Advanced
Ranking settings used in search. To optimize search, new settings have been added,
which are now the default settings for both Enterprise and non-Enterprise data
sources. These settings are listed below:

• ExpressionWeight=100
• ObjectRankRanker="OTObjectScore",50
• ExtraWeightFieldRankers="OTName",200;"OTDComment",50
• DateFieldRankers="OTModifyDate",45,2
• TypeFieldRankers="OTFilterMIMEType",2:"application/pdf",
100:"application/msword",100:"application/vnd.ms-

• excel",30:"text/plain",75:"text/html",75;"OTSubType",2:"144",
100:"0",100:"202",100:"140",200

Note: These new settings are present in the search.ini file. If you are
migrating from an older version of Content Server, the older settings will be
preserved, but a conversion button will be available on the Search Manager
Ranking tab for administrators who wish to implement the new recommended
settings.

90 OpenText Content Server LLESSRC210400-GGD-EN-01


4.2. Understanding Relevance Ranking

Field Rank
The Field Rank criteria is based on an item's metadata regions. Metadata regions
store information (metadata) about indexed items (for example, the OTSummary
region stores an item's summary, the OTLocation region stores the item's location,
and the OTCreatedby region stores the name of the user who originally created the
item). When you configure the Field Rank, you specify which metadata regions are
most relevant. If a term in a Query matches the metadata in a region that you
specify, the corresponding item is given a higher advanced criteria score. For
example, you can use the Field Rank criteria to give emphasis to items that match
the specified query in the OTName or OTSummary regions. For more information
about index regions, see “Configuring Index Regions” in the OpenText Content Server
Admin Online Help - Search Administration (LLESWBS-H-AGD).

The score for the Field Rank criteria is determined by calculating a query score as if
the search terms were being sought in the region name or names that you specify.
This score is then combined with the other scores in the weighted average to come
up with an overall search result score.

You specify field rank criteria (that is, region names and their corresponding
weights) with the Field Rank setting on a Search Manager's Ranking Properties
page. For more information about setting Field Rank Criteria, see “Configuring
Advanced Ranking” in the OpenText Content Server Admin Online Help - Search
Administration (LLESWBS-H-AGD).

Date Rank
The Date Rank criteria is based on an item's dated metadata regions. Dated metadata
regions store dates associated with indexed items (for example, the OTModifyDate
region stores the date on which the item was last modified). When you configure the
Date Rank, you specify which dated metadata regions are most relevant. If a date in
a Query matches the date in a dated region that you specify, the corresponding item
is given a higher advanced criteria score.

The score for the Date Rank criteria is determined by looking up the date stored in
the specified dated region, subtracting today's date to find the difference in days,
and then applying a distribution to the difference. This score is then combined with
the other scores in the weighted average to come up with an overall search result
score.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 91


Chapter 4 Administering Searching

You specify date rank criteria (that is, region names, the days of interest, and
weights) with the Date Rank setting on a Search Manager's Ranking Properties
page. For more information about setting Date Rank Criteria, see “Configuring
Advanced Ranking” in the OpenText Content Server Admin Online Help - Search
Administration (LLESWBS-H-AGD).

Suppose that the Date Rank field was set to the following value:
"OTModifyDate",45,2

Then D=45. If an item was modified five days ago, the score would be 0.9 (45/(5+45)).
If an item was modified 45 days ago, the score would be 0.5 (45/(45+45)). If an item
was modified 135 days ago, the score would be 0.25 (45/(135+45)).

Note: The 2 in the example is the relative weight compared to the other
advanced criteria.

Type Rank
The Type Rank criteria is used to give emphasis to certain subtypes or MIME types.
When you configure the Type Rank, you specify which metadata regions of type
Enum are most relevant. For example, you can use the Type Rank criteria to boost
the scores of any Documents or Discussions if you consider those items to be more
relevant to most searches. You can then boost the scores of Documents in PDF
format further.

The score for the Type Rank criteria is determined independently of a Query by
looking up the value of an item's Enum metadata region, and comparing it to the list
of corresponding values specified in the Type Rank criteria. If an item's value
matches any of the values specified in the criteria, it is assigned the corresponding
Type Rank score. If there is no match, the item is assigned a score of zero. This score
is then combined with the other scores in the weighted average to come up with an
overall search result score.

Type rank criteria consists of a list of per region specifications separated by


semicolons. Each per region specification includes a region name, the weight given
to the type rank criteria for this region, and a colon separated list of corresponding
region values and scores. You specify Type Rank criteria (that is, Enum region name
and corresponding weight, as well as the corresponding Enum region value and its
score) in the Type Rank setting on a Search Manager's Properties page. For more
information about setting Type Rank criteria, see “Configuring Advanced Ranking”
in the OpenText Content Server Admin Online Help - Search Administration (LLESWBS-
H-AGD)..

Tip: The Type Rank criteria applies to any field of type Enum (see the
LLFieldDefinitions.txt file in the config directory of your Content Server
installation for some examples).

Suppose that the Type Rank setting contains the following value:
"OTFilterMIMEType",2:"application/msword",50:"application/msexcel",30

92 OpenText Content Server LLESSRC210400-GGD-EN-01


4.2. Understanding Relevance Ranking

This means that the score for application/msword occurring in the


OTFilterMIMEType region is 1.0 (from 50/50), while the score for application/
msexcel occurring in the OTFilterMIMEType region is 0.6 (from 30/50).

Note: The 2 in the example is the relative weight compared to the other
advanced criteria.

Object Rank
The Object Rank criteria is based on the relative value of items to Content Server
users. Relative value is measured by tracking the usage of items at the Content
Server site and makes the following assumptions:

• Items that are downloaded more frequently than others are considered more
valuable.
• Items that are modified more frequently than others are considered more
valuable.
• Items that have many aliases are considered more valuable.
• Items that appear on the Favorites page of many users are considered more
valuable.
• Items that reside in a highly valued parent container are considered more
valuable.

If an item is considered valuable, based on these assumptions, it has a high score


associated with it in a special field in the database. This is a query-independent score
that can be used to boost the overall score of an important item. The name of the
database field that stores the high score and the weight given to this advanced
criteria is specified by the value of the Object Rank setting on a Search Manager's
Properties page. If you enable the Object Rank criteria calculation, the following
additions are made to the opentext.ini file:

• The ObjectRankEnabled parameter is set to true in the [options] section of the


opentext.ini file
• A [relagent] section is added to the opentext.ini file. The [relagent] section
contains the SleepIntervalSec parameter, whose value represents the
frequency (in seconds) with which the Object Rank score is calculated for
Content Server items.
• A reference to the [relagent] section is added to the value of the load
parameter in the [loader] section of the opentext.ini file

For more information about these parameters, see the “opentext.ini File Reference”
in the OpenText Content Server Admin Online Help - Content Server Administration
(LLESWBA-H-AGD).

Because this advanced criteria score is based on an item's usage in Content Server,
the value is changing constantly. Although it is not efficient to send updates to the
corresponding score to the Index Engine each time an item is downloaded or

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 93


Chapter 4 Administering Searching

modified, the Object Rank scores become meaningless if they get stale. To
accommodate this tradeoff, Content Server allows you to schedule the frequency of
Object Rank calculation. Scheduling the Object Rank calculation allows only large
changes in the calculation to trigger an index update, and minimizes the amount of
information sent to the index with each update. You enable and schedule the Object
Rank calculation on the Configure Search Options page in Content Server. For more
information, see “Configuring Search Options” in the OpenText Content Server Admin
Online Help - Search Administration (LLESWBS-H-AGD).

The object rank score is calculated in the Content Server server by the relevance
thread. There must be only one relevance thread running at each Content Server site.
This means that in a clustered environment, only one Content Server instance
should run the relevance thread. The relevance thread is responsible for:

• Calculating the object rank scores for all Content Server items
• Comparing the scores to the previous scores (if this is not the first time the thread
has run)
• Updating the DTreeNotifyTable with the Node ID of the items whose scores
have changed. These updates are handled by the Content Server Extractor
process in the same way that any other update is handled. For more information,
see “Updating an Index” on page 30 .

Note: When the relevance thread runs for the first time, all of the Favorites lists
are scanned. Subsequently, changes to the Favorites lists are tracked as they are
made.

For large Content Server repositories, the object rank calculation may be quite
taxing. To mitigate this, you can configure the relevance thread to run on a separate
Content Server instance within a cluster. Although this removes the load from the
front-end Content Server instance, system performance may still be impacted (due
to competition for resources on the RDB machine). If the system performance hit is
too severe, OpenText recommends disabling the object rank criteria entirely. Then,
object rank scores will not be included in the calculation of the overall search result
score.

94 OpenText Content Server LLESSRC210400-GGD-EN-01


4.3. Working with Regions

4.3 Working with Regions


You can extend which metadata regions are Queryable, Displayable, Search by
Default, Sortable, and Filter to be used with Search Filters. Once the new data has
been indexed, you can configure the search options for the new regions.

You access the region search options via the Regions tab of a Search Manager. On
this page, all of the indexed regions are displayed and check boxes are provided to
allow you to make each region queryable, displayable and searchable by default,
sortable, and used with Search Filters. A standard set of regions are selected for you.

The Search by Default check box allows you to select exactly which regions are to
be searched. Doing so allows for more precise searching and for greater relevance of
search results, since hits in perhaps unforeseen metadata regions should no longer
occur.

The Filter check box allows your users to use Search Filters with regions on the
Search Results page. For more information, see OpenText Content Server Admin
Online Help - Search Administration (LLESWBS-H-AGD).

MaximumNumberOfValuesPerFacet parameter

The [DataFlow_xxx] section of the search.ini file controls options to configure


Content Server data flow processes, including the
MaximumNumberOfValuesPerFacet parameter.

• Description:
The total count of facet values for the region. This parameter determines the
threshold at which an indicator is displayed in the Content Server user interface,
on the search filter region title bar, to show the facet of Search Results set is
incomplete.

Note: OpenText strongly recommends that you do not modify the


MaximumNumberOfValuesPerFacet setting directly in the search.ini file.
Instead, use the search.ini_override file. For more information, see
“Overriding the search.ini file” on page 97.
• Syntax:
MaximumNumberOfValuesPerFacet=32767

• Values:
A positive integer. The default value is 32767.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 95


Chapter 4 Administering Searching

4.4 Reloading Settings


Settings in the data flow section of the search.ini configuration file are reloadable.
Reloadable means that these settings can be modified and that the changes will be
picked up by the Search Engine(s) once a reloadSettings command is issued to
those Search Engine(s) via the Admin server. This is in contrast with non-reloadable
settings where changes require that the Search Engine(s) be restarted.

Note: Although some of these settings apply to both the Index Engines and
Search Engines, they are only reloadable without engine restart for the Search
Engines.

The following Multilingual Metadata search settings are now reloadable:

• HitLocationRestrictionFields
• SystemDefaultSortLanguage
• DefaultMetadataAttributeFieldNames="OTName","OTDComment"
• DefaultMetadataAttributeFieldNames_OTDComment="lang","gr";"orig",
"unknown";"color","pink"

• DefaultMetadataAttributeFieldNames_OTName="lang","gr";"orig",
"unknown";"color","pink"

The following defragmentation settings are now reloadable:

• DefragmentMemoryOptions
• DefragmentSpaceInMBytes
• DefragmentDailyTimes

The following settings were previously reloadable:

• all the settings in Advanced Rankings


• field Aliases
• ContentTruncSizeInMBytes
• QueryTimeOutInMS
• thesaurus settings
• DefaultMetadataFieldNamesCSL

96 OpenText Content Server LLESSRC210400-GGD-EN-01


4.5. Overriding the search.ini file

4.5 Overriding the search.ini file


The search.ini_override configuration file is available to provide administrators
with more options for maintenance and upgrades. This feature has been
implemented because there are occasionally search configuration values that need to
be set which are not accessible from within Content Server. In deployments with
Content Server, the configuration file is owned by Content Server and generated by
Content Server. Any direct changes to the configuration file to meet special
requirements are lost each time Content Server re-writes the configuration file.

This configuration file is a shadow version of the search.ini file. If the search.ini_
override file exists, then any values in the search.ini_override file take
precedence over their corresponding values in the main search.ini configuration
file.

Note: A shadow file should only be used when recommended by OpenText


Support.

OpenText recommends that you create only a small number of entries in the
search.ini_override file.

The shadow file must be named search.ini_override and it must reside in the
same directory as the search.ini file. The search.ini_override file also has the
same format, with sections and name/value pairs as the search.ini file. Values in
the override file take precedence over the ini file.

The ini file reading functionality restricts how white space is handled so Name/
Value entries should appear as in the example below, with no extra white space
anywhere:
name=value<CR>

There is also a special value, DELETE_OVERRIDE, which effectively removes an ini


entry rather than simply replacing it. Typically, this will cause the search component
using the ini file to revert to its default. An example of such an entry is:
DumpToDiskOnStart=DELETE_OVERRIDE

The new search.ini_override configuration file allows the administrator to set


configuration parameters that cannot be erased or modified by Content Server. This
also allows future versions of the search component to be released that require new
configuration settings independently of updates to the Content Server configuration
code.

Note: Content Server ensures that standard configuration files are replicated
among partitions as necessary. However, if administrators use shadow
configuration files, then they are responsible for replicating these files among
partitions.

For example, a file generated by Content Server is named search_partition_


name.ini, while the equivalent shadow file is named search_partition_
name.ini_override.

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 97


Chapter 4 Administering Searching

Using a shadow configuration file should not be required in most Content Server
installations in normal operation. This feature might be used to convert regions to
the Retrieve-Only mode.

98 OpenText Content Server LLESSRC210400-GGD-EN-01


Index checksum algorithm 46
clients 27
codes, severity 43
<, less than comparator 83 combined query
<=, less than or equal to comparator 83 and operator 86
>, greater than comparator 84 and-not operator 87
>=, greater than or equal to comparator 84 or operator 86
!=, not equal to comparator 84 prox operator 87
.cmt 56 sor operator 87
.in 52 xor operator 87
.log 53 committing transactions 56, 57
.out 52 example 60
.rlb 61, 63 comparator
.trn 53, 54 \xdc = 83
=, equals comparator 84 configuration file 97
\xdc = comparator 83 parameters 97
content accumulator 11
A
crawl database history file 19
accumLog file 11, 45
creating
accumulator, content 11
checksums 46
active transactions 54
data sources 50
adding
iPool subdirectories 56
information to DTreeNotify table 16
log files 54
objects 11
partition prematurely 35
partitions 9
transaction directory 52
search federators 12
AddOrModify operation 22 D
AddOrReplace operation 21 data flow
algorithm directory structure 50
checksum 46 iPool subdirectories 50
relevance ranking 69 reading from iPools 55
Soundex 73 writing to iPools 54
and operator 86 data interchange pools 27, 28, 49
and-not operator 87 data source creation 50
architecture 5, 26 date rank 91
DCS (Document Conversion Service) 30
B architecture 26
bad object clients 27
heuristic 40, 41 overview 26
indexing 40 servers 27
binary file, INDEXUPDATE 16 temporary instance 27
blocking updates 39 workers 29
DCSIPool library 28
C
defragmenting
character
index 35
escape 75
scheduler daemon 35, 36
quotation mark 76
Delete operation 23
special 76
deleting objects 24
wildcard 75
detecting
checkpoint file 10, 45

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 99


Index

file formats 26 metaLog 10, 47


MIME type (Multipurpose Internet Mail opentext.ini 16, 27
Extension) 26, 29 rotating log 47
new files 19 search.ini 30
directory search.ini_override 97
status 52 shadow version of search.ini file 97
transaction 52 temporary storage 49
disk 11 file formats
Maximum Content Disk Size 31 detecting 26
space consumed by log files 47 MIME type (Multipurpose Internet Mail
DISK_RET (Retrieve-Only mode) 10 Extension) 26
document conversion OpenText Document Filters (OTDF) 26
for indexing 28 QDF 26
for viewing 27 filter packs 30
DTreeNotify table 16, 93 loading 29
overview 26
E fragments, index 11, 31, 35
engine framework 26
index 10 FROM, OTSQL query language 79
search 12
equals comparator 84 G
error greater than comparator 84
input data 42 greater than or equal To comparator 84
process 11, 30, 38
H
syntax 75
heuristic, bad object 40, 41
escape character 75
history file, Directory Walker 19
excluding MIME types 40, 42
extending document conversion 30 I
extensions, iPool subdirectories 50 index
external file storage 18 bad object 40
extractor process 16 defragmenting 35
fragments 11, 31
F invalid field value 42
field merging 11, 31
memory mode 11 metadata 10, 45
metadata 21 pollution 40
query 83 updating 30
Retrieve-Only (DISK_RET) 10 index engine 10
value 42 log file 41
field rank 91 restarting 11
file scheduler daemon 35
accumLog 11, 45 section in search.ini file 30
binary, INDEXUPDATE 16 indexing 28
checkpoint 10, 45 INDEXUPDATE file 16
configuration 97 input data error 42
external storage 18 interchange pools 49
history, Directory Walker 19 introduction 5
livelink 46 invalid
livelink.ctl 47 data 42

100 OpenText Content Server LLESSRC210400-GGD-EN-01


Index

field value 42 index 10, 45


syntax 75 language 72
iPool 27, 28 memory 10
base 50 percent full value 37
connecting to 53 update 35
directory structure 50 metaLog file 10, 47
status directories 52 MIME type (Multipurpose Internet Mail
transaction directories 52 Extension) 27, 86
iPool (data interchange pool) 49 detecting 26, 29
iPool messages 50 excluding 40, 42
committing 56 ranking 92
reading 55 modifier
status, .in and .out 53 left-truncation 82
writing 54 Phonetic 81
iPool subdirectories 50, 51 Range 82
Regex 82
L Right-truncation 82
language Stem 81
metadata 72 Thesaurus 81
natural 40 modifying objects 25
query 67
left-truncation modifier 82 N
less than comparator 83 natural language 40
less than or equal to comparator 83 normal operation 98
Live Query Language (LQL) 68 not equal to comparator 84
livelink file 46
livelink.ctl file 47 O
loading filter packs 29 object rank 93
log files objects
creating 54 adding new 11
index engine 41 deleting 24
rotating 47 modifying 25
LQL (Live Query Language) 68 operation on 21
with invalid field values 42
M OpenText Document Filters (OTDF) 26
Maximum Content Disk Size 31 opentext.ini file
memory addition of object rank criteria 93
field mode 11 DCS servers listed 27
managing 35 Extractor processes 16
metadata 10 operation
on disk 10 AddOrModify 22
rebalancing 36 AddOrReplace 21
merging Delete 23
extracted data 16 normal 98
index fragments 11, 31 on data object 21
search results 12 read 55
system capability 32 roll back 61
metadata write 54
field 21 or operator 86

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 101


Index

order of processing 76 OTSTARTS 80


ORDEREDBY, OTSQL query language 79 quotation mark 76
OTDF (OpenText Document Filters) 26
otdoccnv 27 R
OTSQL query language 78 RAM 11
FROM 79 (see also memory, disk)
ORDEREDBY 79 Range modifier 82
SELECT 78 rank
WHERE 79 algorithm 69
OTSTARTS query language 80 date 91
overriding field 91
default order of processing 76 object 93
partition default settings 39 relevance 88
search.ini file 97 severity codes 43
overview 5 type 92
DCS (Document Conversion Service) 26 read operation 55
filter packs 26 committing 56
rolling back 61
P Reading Transactions 56
partition 9 rebalancing partitions 36
adding 9 Regex modifier 82
adding objects to 11 regular expressions 76
checkpoint creation 45 relevance ranking 88
creating prematurely 35 algorithm 69
overriding default settings 39 reloadable settings 96
percent full value 37 restarting
rebalancing 36 index engine 11
search grid 5 search engine 11, 96
size of metadata index 10 Retrieve-Only mode 11, 98
structure 43 (DISK_RET) 10
updating 30 Right-truncation modifier 82
percent full value of partition 37 roll back transactions 61
persistence 29 rotating log files 47
Phonetic modifier 81
phrase 75 S
pollution, of index 40 Samba mounts 50
port numbers 30 scheduler daemon 35, 36
process search
error 11, 38 federator 12
extractor 16 phrase 75
workers 29 query languages 67
processing order 76 regular expressions 76
prox operator 87 results 12
search engine 12
Q restarting 11, 96
QDF (Quality Document Filters) 26 scheduler daemon 35
query language search.ini file
LQL (Live Query Language) 68 index engine section 30
OTSQL 78 override file 97

102 OpenText Content Server LLESSRC210400-GGD-EN-01


Index

shadow version of 97 V
SELECT, OTSQL query language 78 View As Web Page command 29
servers 27
settings, reloadable 96 W
severity codes 43 WHERE, OTSQL query language 79
shadow version of search.ini file 97 wildcard character 75
simple phrase with modifiers Windows Directory Structure of the Admin
(see also modifier) Online Help Index 50
sor operator 87 workers 29
Soundex algorithm 73 write operations
space consumed by log files 47 committing 56
special characters 76 rolling back 61
status directory 52 writing transactions 54
Stem modifier 81
storage X
external 18 xor operator 87
Retrieve-Only mode 11
temporary areas 49
syntax, invalid 75

T
table, DTreeNotify 16, 93
temporary
file storage 49
instance of DCS server 27
transaction directory 52
Thesaurus modifier 81
transaction directory 52
transaction mechanism
committing 56
reading 55
starting 54
writing 54
transaction.cmt 56
transaction.log 54
transaction.rlb 61
transaction.trn 54
type rank 92

U
undo transactions 61
Universal Naming Conventions (UNC) 50
update blocking 39
Update Distributor 7, 9
updating
creating partition prematurely 35
index 30
metadata 35

LLESSRC210400-GGD-EN-01 Advanced Indexing and Searching Guide 103

You might also like