0% found this document useful (0 votes)
4 views6 pages

Crossing Full Text Search Fielded

yes

Uploaded by

rutger.nieuw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views6 pages

Crossing Full Text Search Fielded

yes

Uploaded by

rutger.nieuw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Crossing the Full-Text Search /

Fielded Data Divide


from a Development Perspective
Where individual PCs can
store gigabytes of data, and This article discusses methods for synthesizing the
enterprise Intranets and public “apples” of full-text searching with the “oranges” of
sites terabytes of data, finding
the correct document (or Web fielded data, using the dtSearch® Text Retrieval Engine
page) requires a complete as an example.
arsenal of full-text indexed and
fielded data search tools. While
this combination makes sense “oranges” of fielded data, using holding every word in a
for the end-user, from a the dtSearch® Text Retrieval document collection, along with
development perspective, these Engine as an example. its location in the document. A
two approaches to data are very full-text indexed search can
different — the equivalent of Apples and Oranges: instantly retrieve a reference to
“apples and oranges.” This Full-Text and Fielded the rarest of rare birds, and
article discusses methods for Data Searching display this reference as a
synthesizing the “apples” of A full-text search index is the highlighted hit in the document.
full-text searching with the democracy of data structures, The retrieval process in a full-

Options for synthesizing


the "apples" of full-text
indexing with the
"oranges" of fielded
data searching from a
development perspective.
Self-Contained Separate Database
Documents with Fields and Documents

Search
Results

Indexing
Indexing

BLOB Data Adding Fields “On-The-Fly” “Stored Fields” in


During Indexing Search Results

Reprinted with permission of PC AI Online Magazine V. 16 #5


For more information about PC AI Online Magazine, visit www.pcai.com.
places in the database where a documents that contain terms
hit appears. This leads to two that match the full-text search
To take full advantage of questions: from a searching but are not really on topic. For
XML as a hierarchical perspective, why is it important example, the search could
data structure, dtSearch to bridge these two very retrieve a document that simply
supports nested field different approaches to data mentions the effect on South
searching. retrieval; and from a
development perspective, how supply of pesticides and North
American hummingbirds’ food

is it possible to bridge them


efficiently.
American partridges’ mating
text indexed search is
Additional Boolean logic,
habits.
independent of whether the
match appears in a document Limits of Full-Text term weighting and other
title, or a single footnote buried Searching search techniques can further
in a collection of Web-based A bird watcher wanting to refine the full-text search to
files. search a document collection eliminate some false hits. A
By contrast, information in a for the effect of pesticides on search, for example, could
fielded database structure is exclude documents containing
American hummingbirds might
the mating habits of North
more hierarchical than
use the Boolean / proximity by excluding entire documents
South American hummingbirds
democratic. In an XML
database, for instance, the search: that contain the phrase South
rarest of rare birds might American hummingbirds, or
reside in a category tree many less drastically, by giving
(effect w/9 pesticides) and
levels deep. In an SQL documents containing South
mating and North American
database, the rarest of rare American hummingbirds a
and hummingbirds
birds might reside in a multi- This search would look for a negative term weighting.
level table matrix. document containing all of the However, if an author’s name
Precision searching of these following: the word effect happened to be Hummingbird
databases means not only within nine words of the word Q. Pesticide and he enjoys
finding the rarest of rare birds pesticides, the word mating, the writing articles on the mating
anywhere in the database, but phrase North American and the
word hummingbirds. However, these articles would still result
habits of North American gulls,
also finding the phrase (and
highlighting the match) only even this level of precision in a large quantity of false hits.
when it appears in a highly searching might generate a Yet, even if a searcher were
specific field structure. In an number of false hits, or aware of this author, avoiding
XML database, the rarest of
rare birds might appear in two
separate North American and
South American branches of a
tree structure. Precision
searching could find the rarest
of rare birds in the North
American tree branch, and not,
for example, in the South
American tree branch.
In summary, while precision
full-text searching treats all hits
as equal, precision fielded data
searching of a structured
database must make
A full-text search index Precision searching of a database means

distinctions among the various


treats all hits as equal. finding a hit only when it appears in a
highly specific field structure.

Reprinted with permission of PC AI Online Magazine V. 16 #5


For more information about PC AI Online Magazine, visit www.pcai.com.
these false hits through full- documents. Documents that
text queries alone would not be contain both words would rank One approach in
an easy task. even higher. Unfortunately, this
An alternative to Boolean also includes articles by dtSearch for linking this
logic is natural language structure to full-text
“clustering.” With this While false hits, or over- index data is through a
Hummingbird Q. Pesticide.

approach, once a search finds inclusiveness in full-text database access library


an applicable document, a searching, is annoying, under- such as Microsoft’s
follow-up search effectively inclusiveness, or false misses, ADO.NET. An
enters the text of the entire because of spelling variants,
document to look for other phrase variants, and the like is integrating ADO.NET
documents of that type or also a concern. Certain application iterates over
cluster. Natural language techniques can find word every row of every table
relevancy-ranked searching variants: stemming can find or field in an SQL
looks for all words in a search variants such as database, associating
request or document, and ranks hummingbirding; fuzziness can
by hit term density and rarity sift through misspellings, such
each field with the
retrieved documents with as hummingsbird; and thesaurus relevant document
matching terms. searching can find a Native identifier.
For example, if hummingbird American hummingbird
appears in thousands of synonym. However, at a certain
documents, but pesticides only point, extending the list of components to encompass key
appears in dozens, the latter retrieved documents to search criteria assists in
receives a much higher encompass word variants will avoiding both false hits and
relevancy ranking in looking itself start resulting in false hits. false misses. For example,
for matching cluster Adding in fielded data mating habits might reside in a

XML as a Database Format


To take full advantage of The first example looks for The fifth example, with
XML as a hierarchical data any field entitled persona that the //, looks for a field called
structure, dtSearch supports contains Henry. The second line containing publius. In
nested field searching. For search, containing the / as a contrast to the other examples,
example, sample dtSearch field separator, looks for a which specify precise
nested field searches over field called stagedir hierarchical sequences, in this
Shakespeare converted into an containing exeunt citizens, last example, the line field
XML database might be: with the stagedir field directly could be anywhere from
• persona contains Henry nested in a field called scene. directly beneath the scene
• scene/stagedir contains The third example looks field, to nested at multiple
for a triple nested hierarchical levels of depth.
• scene/speech/line scene/speech/line field Finally, the last example
exeunt citizens

sequence containing publius. combines full-text searching


• /play/title contains Henry The forth example, starting with nested field searching.
contains publius

with the /, looks for the play This example would combine
• scene//line contains field at the top of the a full-text search for henry the
the Fifth

hierarchy, with a title field just fifth and a nested field search
• (henry the fifth) and
publius
beneath it containing Henry for scene/speech/line contains
(scene/speech/line contains the Fifth. publius.
publius)

Reprinted with permission of PC AI Online Magazine V. 16 #5


For more information about PC AI Online Magazine, visit www.pcai.com.
behavior field, North American after removing the document store pointers to associate each
might translate to a geographic from the group, its fielded set of fields to the correct
field, and hummingbirds would information remains. This document.
be a bird type field, leaving advantage is distinctly not One approach in dtSearch for
pesticides for the full-text present in most of the other linking this structure to full-text
search component (assuming options described below. index data is through a database
no pesticide field). The result is While storing fields inside access library such as
a much narrower margin of the documents certainly has its Microsoft’s ADO.NET.
error. advantages, getting to this An integrating ADO.NET
result in a large document application iterates over every
Combining Full-Text and collection may not be an easy row of every table or field in an
Fielded Data Searching process. Simply because a file SQL database, associating each
While combined hierarchical format supports meta data does field with the relevant
fielded data searching and not mean that all documents of document identifier. For
“democratic” full-text that file type will already example, specific fielded data
searching can improve search include such data. The size of a entries for bird type, behavior,
precision, the question remains document collection as well as and geographic area would all
how to synthesize these two perhaps the diversity of correspond to specific
very different approaches to document types can make document designations.
data from a development editing each document in the The full-text search index
perspective. collection to add fields then incorporates the
prohibitively time consuming. ADO.NET fielded data
Option 1: Self-Contained Finally, the fielded data itself components along with the
Documents with Fields may require a more complex document pointers. In this
The easiest approach to structure than the underlying manner, the document pointers
bridging the full-text / fielded documents support. Classifying act as the bridge between the
data search divide is to have the fielded data may require a full-text component and the
each document in a collection table structure or hierarchical fielded data component.
contain its own fielded or meta data classification, which does Because an actual structured
data. For example, the bird not work in the limited fielded database holds the meta or
document collection could data options that, for example, fielded data, this approach
contain HTML, PDF, or the PDF format provides. In supports a more complex
Microsoft Office files — word that case, a separate database relational structure to the field
processor, spreadsheet, structure may offer a more components.
presentation, etc. Each flexible option for fielded data The separate database and
document in the collection storage. documents approach also
could contain a bird type field, preserves the original
a behavior field, and a Option 2: Separate documents, and any existing
geographic field, along with fields inside of the original
searchable full-text content.
Database and
documents for indexing and
For indexing efficiency, this
Documents
This option stores meta searching. This original fields’
solution requires a single pass information, pertaining to each preservation is an advantage
over each document in the file, within a separate database this approach also shares with
collection to pick up fields and structure such as SQL or XML. the following option.
full-text data. Another Database entries can include
advantage to this approach is fielded data information for Option 3: BLOB Data
its organizational flexibility. each document, such as bird The previous option stored
Since each document contains type, behavior, and geographic meta information pertaining to
its own fields, it represents its area. Along with the fielded each document in a database
own self-contained unit. Even data, the database would also structure, along with pointers to

Reprinted with permission of PC AI Online Magazine V. 16 #5


For more information about PC AI Online Magazine, visit www.pcai.com.
that document. This option they require storage in a
stores the full document copy, separate database. Rather, these
called BLOB data, in the fields simply become a part of
In the dtSearch Engine,
database along with its fielded the full-text search index. an “xfilter” can combine
data. BLOB data can be A special function stores a full-text query with a
anything from a raw text file to these document attributes upon filter for specific
a structured file type such as a indexing, in addition to the document attributes,
word processor, spreadsheet or full-text data. As with adding such as file name, date,
presentation document. fields to the documents
The indexing mechanism for themselves, indexing returns to
or size, or the presence
BLOB data is similar to a seamless, one-pass operation. in the document of a
indexing a standalone file. It As with the BLOB and separate word or field. The field
also has the benefit of database approaches, the component can consist
supporting any preexisting dynamic addition of fields of a standard document
fields inside a BLOB upon indexing supports pre- attribute, or an attribute
document, along with the fields existing fielded data already
in the database. For example, if inside a document, in addition that dtSearch adds “on
the database stored a word to newly added fields. the fly” while indexing.
processor document as BLOB Adding attributes while
data, indexing supports the title indexing requires the matching
and subject fields contained of specific fields to existing individual document to add the
within that document, in documents. The easiest fields, resulting in yet another
addition to fielded data in the approach to obtaining the major benefit relative to adding
database. fielded data information for fields inside each file. Under
Whether using document dynamic addition upon the dynamic approach, the
pointers or BLOB data, tools indexing is to require the entry original documents remain
such as ADO.NET for of certain attributes when a untouched, preserving the
accessing database fields are document joins the collection. original documents intact for
fairly advanced and easy to For example, a separate archival purposes.
use. On the negative side, the document management
database fields approach interface could require entry of Option 5: Modifying
requires a separate database, fields such as bird type, Search Results Data to
resulting in substantial behavior, and geographic area Add “Stored Fields”
maintenance overhead. A when checking in a document. The above options for
separate database also Dynamically adding fields to combining fielded and full-text
eliminates the efficiency of documents during indexing, searching focus on the text
single-pass integrated indexing similar to adding fields retrieval engine’s method for
of documents and fields. internally to documents, has the executing a query. Another
disadvantage of requiring alternative is to focus on the
Option 4: Adding Fields potentially cumbersome presentation of search results
individual attribute after a query. This alternative
relies on fields and other
“On-the-Fly” During
assignments to each document
methods to more efficiently sift
Indexing
Another alternative for before indexing. However, this
combining full-text and fielded dynamic approach is more through a data set following a
data searching is to add efficient for one key reason: it search.
document attributes while adds fields independent of the Drawing a sharp line between
indexing. In contrast to the actual documents. Therefore, the presentation of search
above examples, these new the dynamic indexing approach results and search techniques is,
fields are not part of the does not require opening, of course, impossible. For
documents themselves, nor do editing and then closing each example, natural language

Reprinted with permission of PC AI Online Magazine V. 16 #5


For more information about PC AI Online Magazine, visit www.pcai.com.
Sample Objects for Document Classification
In the dtSearch Engine, an "xfilter" can combine a full-text query with a filter for specific
document attributes, such as file name, date, or size, or the presence in the document of a word
or field. The field component can consist of a standard document attribute, or an attribute that
dtSearch adds "on the fly" while indexing.
Search Results
(user request) and This query would match any document that contains
xfilter(name "abc*.html") (user request) with a file name matching abc*.html

(user request) and This query would match any document that contains
xfilter(word "projectxyz") (user request) and that also contains the word projectxyz

(user request) and (xfilter This final query adds two field restrictions to the
(word "Type::projectx") (user request): one for a named field called type with an
and xfilter(word entry of projectx, and the second for a named field called
"classification::high")) classification with an entry of high.

A dtSearch SearchFilter uses an in-memory object, consisting of a table of bit vectors, to achieve
similar results to that of an xfilter.

relevancy ranking is both an fact would become for adding dynamic document
integral part of the search immediately apparent through a attributes upon indexing can
process and an integral part of stored author fields tag. The also store document attributes
the retrieved document sorting user could then skip over the as fields upon search results
that underlies the presentation presentation. In other words,
of search results. However, documents without even the same code that adds a bird
Hummingbird Q. Pesticide

while the execution of a search bothering to browse them. In type field, a behavior field, and
works on programmatic this way, the stored fields a geographic field dynamically
autopilot, after a search, sifting approach effectively offers during indexing also serves to
through a retrieved data set similar benefits to those of create the information tags in
relies on the human factor. narrowing the scope of a search search results.
Adding fielded data into this by entering specific fields. The option for adding fields
phase provides valuable Unlike the other approaches on-the-fly dynamically while
information for the human in this article, however, stored indexing thus serves double
search results browser to use in fields upon search results duty: in the automatic
separating relevant items from achieves its benefits without execution search-request phase
false hits. “Stored fields” in requiring specific fielded data and in the human-driven search
search results, where a stored elements in a text query. results browsing phase. The
field information tag describes Because of its simplicity from end-result is a two-fold bridge
the contents of each document, an end-user perspective, stored of the gap between full-text and
is an example of this human- fields is an ideal approach for fielded data searching.
oriented approach. experienced searchers and
If a search retrieved false hits novices alike.
Please visit dtSearch online at
in the form of articles by From a development
www.dtsearch.com
Hummingbird Q. Pesticide, this perspective, the same function

Reprinted with permission of PC AI Online Magazine V. 16 #5


For more information about PC AI Online Magazine, visit www.pcai.com.

You might also like