IRS Study Material
IRS Study Material
IRS Study Material
• Item(Data):
The smallest complete textual unit processed and manipulated by an IR system
Depend on how a specific source treats information
Overhead:
The time a user spends in all of the steps leading to reading an item containing
needed information, excluding the time for actually reading the relevant data
• Query generation
• Search composition
• Search execution
• Scanning results of query to select items to read
Query Generation:
The user entering the keyword like Eg :Sachine Tendulkar, Best Restaurants,……..etc
It will be providing relevant information to the user
Search Composition:
If the given text or query, It will be available or not finding the user required
Information
Search Execution:
Automatically start Execution procedure for relevant Information
Scanning Results of Query for Reading Item:
The user Required Information starts scanning Results of Query, Required Information
Generated Infront of User’s Screen
An Information Retrieval system consists of a software Program that facilitates a
user in the information the user needs
The system may use standard computer hardware or specialized hardware to
support the search sub Function to convert Non-Textual sources to searchable
Media
Over head is Expressed as the time a user spends in all of the steps leading to reading an item
containing the needed Information
Overhead from users Perspective is the time required to find the Information
Thus, search Composition, Search Execution &reading Non-relevant Items are all aspects of IR
Overhead
Relevant Item:
In IRS the term “Relevant” Item is used to Represent an Item containing the needed Information
2
Measures Used:
The two major Measures commonly Associated with Information Systems are
Precision
Recall
Precision:
The ability to retrieve Top ranked Documents that are Mostly Relevant
Recall:
Precision: Number-Retrieved-Relevant
Number-Total-Retrieved
Recall: Number-Retrieved-Relevant
Number-Possible-Retrieved
3
The two major measures commonly associated with information systems are
“precision”and “recall”
Support of user search generation
How to present the search results in a format that facilitate the user in determining
relevant items
The two major measures commonly associated with information systems are precision and recall.
When a user decides to issue a search looking for information on a topic, the total database is
logically divided into four segments shown in Figure 1.1. Relevant items are those documents
that contain information that helps the searcher in answering his question. Non-relevant items
are those
items that do not provide any directly useful information. There are two possibility
es with respect to each item: it can be retrieved or not retrieved by the user’s query. Precision
and recall are defined as:
Figure 1.1 Effects of Search on Total Document Space
Searching an Item:
For a search looking into an a Topic, the total Database is Logically divided 4 Segments
Where Number_Possible_Relevant are the number of relevant items in the database.
Number_Total_Retieved is the total number of items retrieved from the query.
Number_Retrieved_Relevant is the number of items retrieved that are relevant
to the user’s search need.
4
Precision Measures one Aspect of Information retrieved overhead for a user Associated with a
particular Search
5
Two More Objectives of IR Systems :
• Support of user search generation How to specify the information a user needs
• Language ambiguities – “field”
• Vocabulary corpus of a user and item authors Must assist users automatically and
through interaction in developing a search specification that represents the need of users
and the writing style of diverse authors
• How to present the search results in a format that facilitate the user in determining
relevant items ,
A) Ranking in order of potential relevance
B) Item clustering and link analysis.
Item normalization,
Selective dissemination of information (i.e., “mail”),
Archival document database search, and
An index database search along with the
Automatic file build process that supports index files.
3
Figure 1.4 Total Information Retrieval System
4
Standardize Input:
• Standardizing the input takes the different external format of input data and performs the
translation to the formats acceptable to the system.
• Translate foreign language into Unicode Allow a single browser to display the languages
and
potentially a single search system to search them
• Translate multi-media input into a standard format
Video: MPEG-2, MPEG-1, AVI, Real Video…
Audio: WAV, Real Audio
Image: GIF, JPEG, BMP…
• Parse the item into logical sub-divisions that have meaning to user Title, Author,
Abstract, Main Text, Conclusion, References, Country, Keyword…
• Visible to the user and used to increase the precision of a search and optimize the display
The zoning information is passed to the processing token identification operation to store
the information, allowing searches to be restricted to a specific zone display the minimum
data required from each item to allow determination of the possible relevance of that item
(Display zones such as Title, Abstract…)
• Identify the information that are used in the search process – Processing Tokens (Better
than Words)
• The first step is to determine a word
Dividing input symbols into three classes
• Valid word symbols: alphabetic characters, numbers
• Inter-word symbols: blanks, periods, semicolons (non-searchable)
• Special processing symbols: hyphen (-)
A word is defined as a contiguous set of word symbols bounded by inter-word
symbols.
Stop Algorithm:
• Save system resources by eliminating from the set of searchable processing tokens those
have little value to the search Whose frequency and/or semantic use make them of no use
as searchable token
• Any word found in almost every item
• Any word only found once or twice in the database
Frequency * Rank = Constant
Stop algorithm v.s. Stop list
5
Characterize Tokens :
• Identify any specific word characteristics Word sense disambiguation Part of speech
tagging
Uppercase – proper names, acronyms, and organization Numbers and dates
Stemming Algorithm :
6
Document Database Search :
Provides the capability for a query to search against all items received by the
system
Composed of the search process, user entered queries and document database.
Document database contains all items that have been received, processed and store
by the system. Usually items in the Document DB do not change May be
partitioned by time and allow for archiving by the Time partitions.
Queries differ from profiles in that they are typically short and focused on a
specific area of interest .
When an item is determined to be of interest, a user may want to save it (file it) for
future reference Accomplished via the index process.
In the index process, the user can logically store an item in a file along with
additional index terms and descriptive text the user wants to associate with the
item. An index can reference the original item, or contain substantive information
on the original item Similar to card catalog in a library.
The Index Database Search Process provides the capability to create indexes and
search them
The user may search the index and retrieve the index and/or the document it
references
The system also provides the capability to search the index and then search the
items referenced by the index records that satisfied the index portion of the query
Combined file search
In an ideal system the index record could reference portions of items versus the
total item
Two classes of index files: public and private index files Every user can have one
or more private index files leading to a very large number of files, and each private
index file references only a small subset of the total number of items in the
Document database Public index files are maintained by professional library
services personnel and typically index every item in the Document database
The capability to create private and public index files is frequently implemented
via a structured Database Management System (RDBMS)
To assist the users in generating indexes, the system provides a process called
Automatic File Build (Information Extraction)
Process selected incoming documents and automatically determines
potential indexing for the item
• Authors, date of publication, source, and references
The rules that govern which documents are processed for extraction of index
information and the index term extraction process are stored in Automatic File Build
Profiles. When an item is processed it results in creation of Candidate Index Records ->
for review and edit by a user
Prior to actual update of an index file.
7
1.4 Relationship to Database Management Systems :
1. An Information Retrieval System is software that has the features and functions
required to
manipulate “information” items versus a DBMS that is optimized to handle “structured”
data.
2. Structured data is well defined data (facts) typically represented by tables. There is a
semantic description associated with each attribute within a table that well defines that
attribute. For example, there is no confusion between the meaning of “employee name” or
“employee salary” and what values to enter in a specific database record. On the other
hand, if two different people generate an
abstract for the same item, they can be different. One abstract may generally discuss the
most important topic in an item. Another abstract, using a different vocabulary, may
specify the details of many topics. It is this diversity and ambiguity of language.
3. With structured data a user enters a specific request and the results returned provide the
user with the desired information. The results are frequently tabulated and presented in a
report format for ease of use. In contrast, a search of “information” items has a high
probability of not finding all the items a user is looking for. The user has to refine his
search to locate additional items of interest. This process is called “iterative search.
Two other systems frequently described in the context of information retrieval are,
8
As such, libraries have always been concerned with storing and retrieving
information in the media it is created on.
As the quantities of information grew exponentially, libraries were forced to
make maximum use of electronic tools to facilitate the storage and retrieval
process. With the worldwide Internet of libraries and information sources (e.g.,
publishers, news agencies,….etc)via Internet, more focus has been on the
concept of an electric library
List of Softwares For Digital Libraries
KOHA
BIBLIOTEQ
PMP
Indexing is one of the critical disciplines in library science and significant effort has gone into the
Establishment of Indexing and cataloging Standards
Migration of many of the library products to a digital format Introduces both opportunities
and challenges the full text of items available for search makes the index process
Another important library service is a source of search Intermediaries to assist users in finding
Information
Information storage and Retrieval Technology has addressed a small subset of the issue associated with
Digital Libraries the focus has been on the search and retrieval of Textual data with no concern for
establishing standards on the contents of the system.
DATAWAREHOUSES:
A Data warehouse is a type of Data Management System that is designed to enable and support
Business Intelligence Activities, Especially Analytices, Data warehouses are solely Intended to perform
queries and Analysis and often contain Large amounts of Historical Data.
List of Softwares For DATAWAREHOUSES
A Data warehouse is a group of Data specific to the entire organization, not only to particular group of
users
It is not used for daily operations and transaction processing but used for making decisions.
9
1.6 Information Retrieval Systems Capabilities :
1.6.1 Querying:
Main paradigms:
Query term sets
Query terms connected with Boolean operations
Weighted terms
Relaxation or restriction of term matching
Term expansion
Natural language
10
Boolean Queries :
Select all documents that discuss computers, or documents that discuss servers
that do not discuss mainframes.
Select all documents that discuss computers or servers, do not select any
documents that discuss mainframes.
Select all documents that discuss computers, and do not discuss either servers
or mainframes.
Weighting:
A Weight is associated with each term
Weighted Term:
Term weighting is a procedure that takes place during the text indexing
process in order to assess the value of each term to the document, Term
weighting is the assignment of numerical values to terms that represent their
importance in a document in order to improve retrieval effectiveness
Natural Language :
Natural language, understood as a tool that people use to express themselves, has specific properties that
reduce the efficacy of textual information retrieval systems. These properties are linguistic variation and
ambiguity. By linguistic variation we mean the possibility of using different words or expressions to
communicate the same idea. Linguistic ambiguity is when a word or phrase allows for more than one
interpretation.
11
1.6.2 Browsing :
Browsing can be defined as an interactive search activity in which the direction of the search is
determined by the user on the basis of immediate feedback from the system being browsed. Most users of most
information retrieval systems exhibit browsing behavior no matter what the underlying system structure.
Vocabulary browse
Vocabulary browse provides the capability to display in
Alphabetical sorted order words from the document Database
Logically all unique words (Processing Tokens)in the
Database are kept in sorted order along with the count of the
number of unique items in which the word is found
It makes the search Procedure Easier
It corrects entered the word “computen” when they really
Meant”computer”
Iterative search(query refinement)
The result of a previous search is subjected to a new query
Same as repeating the previous query with additional conditions.
Canned(stored) queries
Users tend to reuse previous queries
Allows users to store previously-used queries and incorporate
Canned queries tend to be large.
12
Information Retrieval Systems
UNIT-2
CATALOGING AND INDEXING
CATALOGING:
It is also a Systematic arrangement of Items in an Alphabetical or other logical order
Including brief Description
A Catalogue is the record of the collection in the Library
A library Catalogue is a list of books and other reading material available in a
particular Library
The card Catalogue has been a familiar sight to Library users for generations
But, it has been effectively replaced by the online public Access catalog
TYPES OF CATALOGUES:
Author Catalogues
Title Catalogues
Author/Title Catalogues
Subject Catalogues
Author Catalogues:
The Author Catalogues contain entires with Author names as the heading, Authors may be
persons or Corporate bodies and the term Author is normally extended to Included writers,
Illustrators ,performers ,producers, translators, &others with some Intellectual or Artistic
responsibility for a Work
Eg: Vikas publishing pvt ltd, SIA Publications
Title Catalogues:
The Title Catalogue has entries with title as the heading some libraries and Information centers
make title entries for all items being Indexed, but in other situations title entries are made
selectively for only one Material
Author/Title Catalogues:
The Author/Title Catalogues contain both title and author Entries As both titles and Authors
names are in Alphabetical order,It is Easy to file together Authors Names and Titles as headings
Subject Catalogues:
Subject Catalogues have an Indication of the Subject of the Documents being Indexed as their
headings, The Entries are arranged in an appropriate System order
EX:
Car, Lawyers, These entries are arranged Alphabetically according to the subject heading
1
INDEXING:
Indexing is an Important process in Information Retrieval Systems
It forms the core Functionality of the IR Process Since, It is the first step in IR and assists
in efficient Information Retrieval ,Indexing reduces the documents to the Informative
terms contained in them
The transformation from received item to searchable data structure is called indexing.
• Process can be manual or automatic.
• Creating a direct search in document data base or indirect search through indexfiles.
• Concept based representation: instead of transforming the input into a searchable format
some systems transform the input into different representation that is concept based
.Search ? Search and return item as per the incoming items.
History of indexing:
It shows the dependency of information processing capabilities on manual and then
automatic processing systems .
• Indexing originally called cataloging : oldest technique to identity the contents of items to
assist in retrieval.
• One of the technique as similar as cataloging &Indexing ,the technique as both are
Systematic Arrangement of items in an Alphabetical
• Items overlap between full item indexing , public and private indexing of files
Objectives of Indexing :
The public file indexer needs to consider the information needs of all users of library
system . Items overlap between full item indexing , public and private indexing of files
• Users may use public index files as part of search criteria to increase recall.
• The primary objective of representing the concepts within an item to facilitate users
finding relevant information.
• Users may use public index files as part of search criteria to increase recall.
• The primary objective of representing the concepts within an item to facilitate users
finding relevant information.
2
Fig:
• 1.Decided the scope of indexing and the level of detail to be provided. Based on usage
scenario of users.
• 2. Second decision is to link index terms together in a single index for a particular concept.
Indexing process:
Indexing Process is the collecting, Parsing, &Storing of data to facilitate fast and accurate
Information retrieval Index Design incorporates Interdisciplinary Concepts from Linguistics,
Cognitive Psychology, Mathematics, Informatics &Computer Science
TEXT PROCESSING
3
Text process phases
1. Document Parsing: Documents come in all sorts of languages, character sets, and
formats; often, the same document may contain multiple languages or formats, e.g., a
French email with Portuguese PDF attachments. Document parsing deals with the
recognition and “breaking down” of the document structure into individual components.
In this pre processing phase, unit documents are created; e.g., emails with attachments are
split into one document representing the email and as many documents as there are
attachments.
2. Lexical Analysis: After parsing, lexical analysis tokenizes a document, seen as an
input stream, into words. Issues related to lexical analysis include the correct
identification of accents, abbreviations, dates, and cases. The difficulty of this operation
depends much on the language at hand: for example, the English language has neither
diacritics nor cases, French has diacritics but no cases, German has both diacritics and
cases. The recognition of abbreviations and, in particular, of time expressions would
deserve a separate chapter due to its complexity and the extensive literature in the field
For current approaches
3. Stop-Word Removal: A subsequent step optionally applied to the results of lexical
analysis is stop-word removal, i.e., the removal of high-frequency words. For example,
given the sentence “search engines are the most visible information retrieval applications”
and a classic stop words set such as the one adopted by the Snowball stemmer,1 the effect
of stop-word removal would be: “search engine most visible information retrieval
applications”.
4. Phrase Detection: This step captures text meaning beyond what is possible with pure
bag- of-word approaches, thanks to the identification of noun groups and other phrases.
Phrase detection may be approached in several ways, including rules (e.g., retaining terms
that are not separated by punctuation marks), morphological analysis , syntactic analysis,
and combinations thereof. For example, scanning our example sentence “search engines
are the most visible information retrieval applications” for noun phrases would probably
result in identifying “search engines” and “information retrieval”.
5. Stemming and Lemmatization: Following phrase extraction, stemming and
lemmatization aim at stripping down word suffixes in order to normalize the word.
Stemming as Removing words ending , In particular stemming is a heuristic process that
“chops off” the ends of words in the hope of achieving the goal correctly most of the time;
a classic rule based algorithm for this was devised by Porter ,
4
According to the Porter stemmer, our example sentence “Search engines are the most
visible information retrieval applications” would result in: “Search engine are the most
visible inform retrieval application”.
• Lemmatization is a process that typically uses dictionaries and morphological analysis of
words in order to return the base or dictionary form of a word, thereby collapsing its
inflectional forms (see, e.g., [278]). For example, our sentence would result in “Search
engine are the most visible information retrieval application” when lemmatized according
to a WordNet-based lemmatizer
6. Weighting: The final phase of text pre processing deals with term weighting. As
previously mentioned, words in a text have different descriptive power; hence, index
terms can be weighted differently to account for their significance within a document
and/or a document collection. Such a weighting can be binary, e.g., assigning 0 for term
absence and 1 for presence.
5
SCOPE OF INDEXING
• When perform the indexing manually, problems arise from two sources the author and the
indexer the author and the indexer.
• Vocabulary domain may be different the author and the indexer.
• This results in different quality levels of indexing.
• The indexer must determine when to stop the indexing process.
• Two factors to decide on level to index the concept in a item.
• The exhaustively and how specific indexing is desired.
• Exhaustively of index is the extent to which the different concepts in the item are indexed.
• For example, if two sentences of a 10-page item on microprocessors discusses on-board
caches, should this concept be indexed
• Specific relates to preciseness of index terms used in indexing.
• For example, whether the term “processor” or the term “microcomputer” or the term
“Pentium” should be used in the index of an item is based upon the specificity decision.
• Indexing an item only on the most important concept in it and using general index terms
yields low exhaustively and specificity.
• Another decision on indexing is what portion of an item to be indexed Simplest case is to
limit the indexing to title and abstract(conceptual ) zone .
• General indexing leads to loss of precision and recall. PREORDINATION AND
LINKAGES
• Another decision on linkages process whether linkages are available between index terms
for an item.
• Used to correlate attributes associated with concepts discussed in an item. this process is
called preordination.
• When index terms are not coordinated at index time the coordination occurs at search
time. This is called post co-ordination , implementing by “AND” ing index terms.
• Factors that must be determined in linkage process are the number of terms that can be
related.
• Ex., an item discusses ‘the drilling of oil wells in Mexico by CITGO and the introduction
of oil refineries in Peru by the U.S.’
6
AUTOMATIC INDEXING
Automatic Indexing is the computerized process of Scanning Large volumes of Documents
against a controlled Vocabulary,Taxonomy or Ontology and using those controlled terms to
quickly and effectively Index large Electronic Document depositories
• Case Total document indexing.
• Automatic Indexing requires few seconds based on the processor and complexity of
algorithms to generate indexes.’
• Index resulting form automated indexing fall into two classes , weighted and un weighted.
• Weighted indexing system: a attempt is made to place a value on the index term
associated with concept in the document. Based on the frequency of occurrence of the
term in the item.
• Un weighted indexing system : the existence of an index term in a document and some
times its word location are kept as part of searchable data structure.
• Values are normalized between 0 and1.
• The results are presented to the user in order of rank value from highest number to lowest
number.
Indexing By term
• Terms (vocabulary) of the original item are used as basis of index process.
• There are two major techniques for creation of index statistical and natural language.
• Statistical can be based upon vector models and probabilistic models with a special case
being Bayesian model(accounting for uncertainty inherent in the model selection process).
• Called statistical because their calculation of weights use information such as frequency of
occurrence of words.
• Natural language also use some statistical information, but perform more complex
parsing to define the final set of index concept.
• Other weighted systems discussed as vectorised Information system.
• The system emphasizes weights as a foundation for information detection and stores these
weights in a vector form.
• Each vector represents a document. And each position in a vector represent a unique
word(processing token) in a database..
• The value assigned to each position is the weight of that term in the document.
• 0 indicates that the word was not in the document.
• Search is accomplished by calculating the distance between the query vector and
7
document vector.
• Bayesian approach: based on evidence reasoning( drawing conclusion from evidence)
• Could be applied as part of index term weighing. But usually applied as part of retrieval
process by calculating the relationship between an item and specific query.
• Graphic representation each node represents a random variable arch between the nodes
represent a probabilistic dependencies between the node and its parents.
• Two level Bayesian network
• “ c”“ represents concept in a query
• “f” representing concepts in an item
• Uses neural NW strength of the system word relationship (synonyms) and uses the
information in generating context vectors.
• Two neural networks are used one to generated stem context vectors and another one to
perform query.
• Interpretation is same as the weights.
• Multimedia indexing:
• Indexing video or images can be accomplished at raw data level.
8
INFORMATION EXTRACTION
There are two processes associated with information extraction:
• 1.determination of facts to go into structured fields in a database and
• 2. Extraction of text that can be used to summarize an item.
• The process of extracting facts to go into indexes is called Automatic File Build.
• In establishing metrics to compare information extraction, precision and recall are applied
with slight modifications.
• Recall refers to how much information was extracted from an item versus how much
should have been extracted from theitem.
• It shows the amount of correct and relevant data extracted versus the correct and relevant
data in the item.
• Precision refers to how much information was extracted accurately versus the total
information extracted.
• Additional metrics used are over generation and fallout.
• Over generation measures the amount of irrelevant information that is extracted.
• This could be caused by templates filled on topics that are not intended to be extracted or
slots that get filled with non-relevant data.
• Fallout measures how much a system assigns incorrect slot fillers as the number of
• These measures are applicable to both human and automated extraction processes.
• Another related information technology is document summarization.
• Rather than trying to determine specific facts, the goal of document summarization is to
extract a summary of an item maintaining the most important ideas while significantly
reducing the size.
• Examples of summaries that are often part of any item are titles, table of contents, and
abstracts with the abstract being the closest.
• The abstract can be used to represent the item for search purposes or as a way for a user to
determine the utility of an item without having to read the complete item.
9
DATA STRUCTURES
• Introduction to DataStructures
• StemmingAlgorithms
• Inverted FileStructure
• N-Gram DataStructure
• PAT DataStructure
• Signature FileStructure
• Hypertext and XML DataStructures
• The knowledge of data structure gives an insight into the capabilities available to the
system.
• Each data structure has a set of associated capabilities.
1. Ability to represent the concepts
2. Supports location of those concepts Introduction
10
Item Normalization:
The Item normalization is the Incoming items to a standard format whatever user is searching a
Item ,It is not exactly user keyword converts into system understandable format
Document Manager:
A Document Manager is a system used to receive, track manage and store Documents and reduce
paper, Most of capable of keeping a record of the various versions created and modified by
different users , In the case of management of Digital documents such systems are based on
computer programs.
Processing Tokens:
Identify the information that are used in the search process-processing tokens
Special Processingsymbols:-Hyphen(-)
11
Item
normalization
Document
File creation
Document
Document
manager
Search manager
Result of a search are references to the items that satisfy the search statement which are
passed to the document manager for retrieval.
Focus : on data structure that support search function
STEMMING ALGORITHMS:
Stemming is nothing but cutting &trimming
The concept of stemming is introduced in the 1960’s
The main goal of stemming was to Improve performance and require less system
resources by reducing the number of unique words that a system has to contain
The stemming algorithms are used to improve the efficiency of the information system
&to Improve recall
Reduce precision
Increase Recall
Reduce diversity
Increase search efficiency
Conflation
12
Stemming variations:
Table lookup stemming
Poster stemming
Dictionary stemming(K-Stemming)
Successor stemming
Table lookup stemming:
Uses Large Data structures
Ex: Retrieval ware
K-Stemming Example INQUERY
Combine rules+ dictionary words
Iterative Nature
Removes large prefixes&suffixes
Poster stemming Algorithm:
The poster stemming Algorithm is based up on a set of conditions of the stem, suffix
&prefix and associated actions given the condition
Conditions are
The measure M of stem is a function of sequences of Vowels(A,E,I,O,U,Y)followed by a
consonant
If V is a sequence of vowels and C is a sequence of consonants, then M is: the
number
Where the initial C and final V are optional and M is the Number
C(VC)M V
*<X>stem ends with a letter
*V* stem contains Vowel
*d stem ends with double consonant
Uses wild chord characters
Dictionary Stemming:
Also called K-Stemming
Dictionary based
Used in Inquery called In query K-Stems
Avoid collapsing word with different meanings
13
Uses NLP Dictionaries
Example British->Britan
Extract meaning of words only it finds
Uses 6 major Data files
1. Dictionary of words
2. Supplement of words
3. Exceptions list of those words that should retain an “e” at the
end(Eg:”Suites”to”suite”but “suited” to “suit”)
Direct-conflation-allows definition of direct conflation via word pairs that
override the stemming algorithm
Country-Nationality-conflationsbetween Nationalities &countries(“British”
maps to “Britian”)
Proper nouns-list of proper nouns that should not be stemmed
Successor Stemming:
It contains symbol tree for words
Constructs symbol tree based on words
Represents both prefix &suffix
It Implements 3 methods
Cut off
Peak& plateau
Complete word
14
Symbol tree for terms bag, barn, bring, box, bottle
Conclusion Stemming:
Good efficient
Depends on nature of vocabulary
Stemming is as effective as Manual conflation
Stemming can affect retrieval(recall)and where effects were Identified
They were positive
It has a potential to increase recall. STEMMING ALGORITHMS
The most common Data structure used in both Database Management and
information Retrieval Systems is the Inverted File Structure
Inverted File Structures are composed of three basic files
Document file
Dictionary
Inversion Lists
15
Features of Inverted File:
Increases Precision
Zoning used
Ranking also used
Used to store concepts &relationships
NLP Used(Natural Language Processing)
Increases Precision:
The ability to retrieve Top ranked Documents that are Mostly relevant ,the scope has
been increasing for Priority
Zoning used:
It is logical sub setting Information has to store specific zone
Ranking also used:
The public will be giving ranking based on particular file, Document, PDF,
WORD…..etc
Store concepts &relationships:
It is maintaining as relationships between Database management system and Information
retrieval systems
Natural Language Processing:
Natural Language processing as based on public emotions &views like Happy, sad.
16
17
Inversion lists structures are used because they provide optimum performance in
searching large Databases
Inversion list file structures as well suited to store concepts &their relationships
18
N-Gram Data Structures
Ignore words:
It is ignores a words &sentences Repeating a words once or twice
N-gram:
It is Indicating as N-gram is equal to N-Length of characters
Trigram:
It is Indicating as trigram is equal to 3 letters of characters
19
Inversion Lists Document vectors are used
Here Maximum number of n-grams used of Unique
Retail trigrams are Ret, eta, tai etc
Disadvantages is Longer N-grams results poor result
N-gram characters strength is vey poor
Example:
Se ea Col ol lo on ny
Sea col olo lon ony
#sea # #colo colon olony long#
Inter words means symbols like non searchable
Bigrams =2(no Inter word symbols)
Trigrams=3(it will acceptable as Inter word symbols &No Inter word symbols)
Advantages:
The first use of N-Grams dates to world war-II, when it was used by
Cryptographers
Another Major use of N-Grams in particular trigrams is in spelling error
Detection &corrections
Frequency of occurrence of N gram pattern also can be used for
Identifying the language of an Item
Because of the processing token bounds of N-gram data structures,
optimized performance techniques can be applied in mapping items to an
N-gram searchable structure & in query processing
There is no semantic meaning in a particular n-gram since It is a fragment
of processing token and may not represent a concept
Thus n-grams are a poor representation of concepts &their relationships
20
PAT data structure:
PAT is one of the Data structure, Practical algorithm to retrieve Information coded in
alpha numeric, PAT tree is a data structure that allows very efficient searching with
processing
PAT structure or PAT tree or PAT array : continuous text input data structures(string like
N- Gram datastructure).
The input stream is transformed into a searchable data structure consisting of substrings,
all substrings are unique.
Each position in a input string is a anchor point for a substring.
Using n-grams with inter word symbols included between valid processing tokens
equates to continous text Input Data structure that is being Indexed in contiguous
“n”characters Tokens
Different view of Addressing a continous Text
Input data structure comes from PAT Trees &PAT Arrays
The Input stream is transformed into a searchable Data structure consisting of substrings
In creation of PAT Trees each position in the Input string is the anchor point for a sub-
string that starts at that point and Include all new text up to the end of the Input
All substrings are unique
This view of text lends itself to many different search processing structures
Substring can start at any point in the text and can be uniquely by its starting location
&length
A PAT Tree is unbalanced, binary digital tree defined by the Sistrings
The individual bits of the Sistring decide the branching patterns with zero branching left
and one branching right
PAT Trees also allow each node in the tree to specify which bit is used to determine the
branching via bit position
We have to eliminate Sistring a text wherever we want position of Text
Text
Economics for Warsaw is Complex
Sistring=1
Conomics for Warsaw is Complex
Sistring =2
21
Onomics for Warsaw is Complex
Sistring=4
Omics for Warsaw is Complex
:
:
:
:
Sistring=9
For Warsaw is Complex
:
:
:
:
:
:
Sistring=25
Ex
In creation of PAT trees each position in the input string is the anchor point for a sub-
string that starts at that point and includes all new text up to the end of the input.
Binary tree, most common class for prefix search, But Pat trees are sorted logically which
facilitate range search, and more accurate then inversion file.
PAT trees provide alternate structure if supporting strings search.
The key values are stored at the leaf nodes (bottom nodes) in the PATTree.
For a text input of size “n” there are “n” leaf nodes and “n-1” at most higher
level nodes.
It is possible to place additional constraints on sistrings for the leafnodes
22
The full PAT binary tree is
23
Signature file structure:
the goal of a signature file structure is to provide a fast test to eliminate the
majority of items that are not related to a query
because file structure is highly compresses and unordered, It requires
significantly less space than an Inverted file structure
New items can be concatenated to the end of the structure
When items are deleted from Information Databases, It leaves deleted Items in
place and mark them as deleted
Signature file structure is a linear scan of the Compressed of Items producing
a response time linear with respect to a file size
Application(s)/Advantage(s)
Signature files provide a practical solution for storing and locating information in a
number of different situations.
Signature files have been applied as medium size databases, databases with low
frequency of terms, WORM devices, parallel processing machines, and distributed
environments
24
Bot of these languages provide detailed descriptions for subsets of text similar to the
zoning.
Hypertext allows one item to reference another item via a embedded pointer.
HTML defines internal structure for information exchange over WWW on theinternet.
XML: defined by DTD, DOM, XSL,etc.
25
HIDDEN MARKOV MODELS:
The Hidden Markov Models is used for searching as Textual Queries has Introduced a
new Paradigm for search, the output of one term of query = = Input of another query
Q1 Q2 Qn
222
2
The Hidden Markov models one Input is generated again it produce as
output & that output has creating as one Input ,it is come across as chain
process
The statistical process that can generate output that is equivalent to the set of
queries that would consider the document relevant
The general definition that a HMM(Hidden Markov Models)is a defined by
the output that is produced by passing some unknown key via state
transitions through a noisy channel output is the query and the unknown
keys are the relevant Documents
Channel is the mismatch between the author’s way of expressing idea’s and
the Users ability to specify his query
The development for HMM(Hidden Markov Models) approach begins with
applying Bayes Rule to the conditional Probability
Disadvantages of HMM:
The biggest problem in using this approach is to estimate the transition
probability Matrix and the output for every Document
If there was a large training Database of queries and the relevant
documents then the problem could be solved using Estimation-
Maximization Algorithms
26
III UNIT
AUTOMATIC INDEXING:
Automatic Indexing is the computerized process of Scanning Large volumes of Documents
against a controlled Vocabulary, Taxonomy or Ontology and using those controlled terms to
quickly and effectively Index large Electronic Document depositories
• Case Total document indexing.
• Automatic Indexing requires few seconds based on the processor and complexity of
algorithms to generate indexes.’
• Index resulting form automated indexing fall into two classes , weighted and un weighted.
• Weighted indexing system: a attempt is made to place a value on the index term
associated with concept in the document. Based on the frequency of occurrence of the
term in the item.
• Un weighted indexing system : the existence of an index term in a document and some
times its word location are kept as part of searchable data structure.
• Values are normalized between 0 and1.
• The results are presented to the user in order of rank value from highest number to lowest
number.
Indexing By term
• Terms (vocabulary) of the original item are used as basis of index process.
• There are two major techniques for creation of index statistical and natural language.
• Statistical can be based upon vector models and probabilistic models with a special case
being Bayesian model(accounting for uncertainty inherent in the model selection process).
• Called statistical because their calculation of weights use information such as frequency of
occurrence of words.
• Natural language also use some statistical information, but perform more complex parsing
to define the final set of index concept.
Types of Classes in Automatic Indexing:
Standardize Input
Logical Subsetting(Zoning)
Apply StopLists
(Stop Algorithm)
(stop Algorithm)
Characterize Tokens
Apply Stemming
Document Frequency:
Term Frequency:
Statistical strategies cover the broadest range of Indexing techniques and the
most prevalent in commercial systems
The basis for a statistical approach is use of frequency of occurrence of
events
the events usually are related to occurrences f processing tokens
(words/Phrases)within the documents and within the database
The words/phrases are the Domain of searchable values
The static approach stores a single statistic, such a how often each words
occurs in an item that is used in generating relevance scores after standard
Boolean Search
Statistical Indexing
Probabilistic Weighing:
This has the advantage of being able to use the developed formal theory of
probability to direct the algorithmic development
This is summarized by the probability Ranking Principle(PRP)
It stores the information that are used in calculating a probability that a
particular Item satisfies in relevant to a particular Query
We are going to apply two types of techniques
HYPOTHESIS
PLAUSIBLE COROLLARY
HYPOTHESIS:
It contains the collection of list of words with rankings based on occurrences
PLAUSIBLE COROLLARY:
The most promising source of techniques for estimating the probabilities of
usefulness for output ranking in IR Standard Probability Theory & Statistics
The Bayesian model provides a conceptually simple yet complete model for
Information Systems
The Bayesian approach is based up on Conditional Probabilities
Bayesian approach stress Information used in generating a relative
Confidence level of an Items relevance to a query
It produces a good relative relevance value than producing and absolute
probability
Vector Weighting:
One of the earliest using Statistical approaches in Information retrieval was the
smart system at Cornell University
Recall is Improved
It can be used with concept classes using neural networks
An Example of applying a concept approach is the convection system
The convection system uses neural network algorithm(A neural network is
a method in Artificial Intelligence that teaches computers to process data
in a way that is inspired by the human brain )
The process of mapping from a specific term to a concept that the term
represents is complex because a term may represent Multiple different
concepts to different degrees
Documents Clustering:
The Document clustering is nothing but we are going to clusters the
Documents what are he terms is there on what are the terms Existing in
number of Documents
Used to create documents clusters
The search can retrieve items similar to an Item of Interest, even if the query
would not have retrieved the item (Resultant set Expansion)
Result-set clustering
Define the Domain for Clustering
Documents: May focus on specific zone within the items that are used to
determine similarity
Thesaurus: Determine which words are synonyms and the strength of their
relationships
Apply some algorithm to determine the classes to which each object will be
assigned
Guide lines on the characteristics of the Classes in Clustering
There is a risk that the name assigned to the semantic definition of the class could
also be misleading
The size of the classes should be within the same order of Magnitude
Within a class, one object should not dominate the class
Word Relationships:-
Paradigmatic T:-Paradigmatic relates words with the same semantic base such as
”Formula” & “equation , Axonomy &Synonym
Thesaurus Generation:
The Collection of terms can be generated or Cluster it can done Manually or
Automatical
Automatically generated Thesauri contain classes that reflect the use of
words
The classes do not naturally have a name, but are just a groups of
Statistically Similar Term Clustering
The more Frequently two terms Co-occur in the Same Items, the more
Likely they are about the same concept
Each and every items it has to identified number of repeat times as well as
possible location of Documents
Once the terms are selected they are clustered based upon the word relationship
guidelines and the interpretation of the strength of the relationship. This is also
part of the art of manual creation of the thesaurus, using the judgment of the
human analyst
There are many techniques for the automatic generation of term clusters to create
statistical thesauri. When the number of clusters created is very large
The basis for automatic generation of a Thesaurus is set of Items that
represents the Vocabulary to be Included in the Thesaurus
The processing tokens(words)in the set of Items are the attributes to be used
to create the clusters
The Automated Method of clustering Documents is based up on the
Clustering, where each cluster is defined by set of words &Phrases
They all use as their basis of the concept that more Frequently two terms co-
occur in the same Items, the more likely they are about the same concept
They differ by the completeness with which terms are correlated
Item Clustering:
Hierarchy Clustering:
The start with un-clustered items and perform pair-wise similarity measures
to determine the clusters Hierarchical or(it is often treated as making
Clusters which are Un clusters they can be clusters based on similarity,
They can be arranged Tree manner &hierarchal Structure)
We start with Large cluster and we breaking it down into smaller cluster
Search Statements&Binding:
The Search statements are the statements of an Information and generated by users
to specify the concepts, they are trying to locate in items
The longer the search query, the easier it is for the system to find items
The Searching is concerned with calculating the Similarity between a users Search
Statement and the items in the Database
To determine the weight an Item has with respect to the Search Statement, The
Cosine Formula is used to calculate the distance between the vector for the Item
and the vector for the Query
The Measure Simplifies the denominator from the Jaccard Formula and Introduces
a factor of 2 in the Numerator
The Internet has multiple different mechanisms that are the basis for searching of
items. The primary techniques are associated with servers on the Internet that
create indexes of items on the Internet and allow search of them. Some of the most
commonly used nodes are YAHOO, AltaVista and Lycos. In all of these systems
there are active processes that visit a large number of Internet sites and retrieve
textual data which they index.
The primary design decisions are on the level to which they retrieve data and
their general philosophy on user access. LYCOS (http://www.lycos.com) and
AltaVista automatically go out to other Internet sites and return the text at the
sites for automatic indexing (http://www.altavista.digital.com). Lycos returns
home pages from each site for automatic indexing while Altavista indexes all of
the text at a site.
The retrieved text is then used to create an index to the source items storing
the Uniform Resource Locator (URL) to provide to the user to retrieve an
item. All of the systems use some form of ranking algorithm to assist in
display of the retrieved items. The algorithm is kept relatively simple using
statistical information on the occurrence of words within the retrieved text
Closely associated with the creation of the indexes is the technique for
accessing nodes on the Internet to locate text to be indexed. This search
process is also directly available to users via Intelligent Agents. Intelligent
Agents provide the capability for a user to specify an information, need
which will be used by the Intelligent Agent as it independently moves
between Internet sites locating information of interest.
The functions that are available with Electronic display and Visualization of
Data
Modify representations of Data and Information or Display
Conditions(Changing colors scales)
Use the same representation while showing changes in data
Animate the display to show changes in space &Time
Create Hyperlinks under user control to establish relationships between data
The Information visualization addresses how the results of search may be
optimally display to the users to facilitate their understanding of what the
search has provided and their selection of most likely Items of Interest to
read Cognitive(The action or process of acquiring Knowledge and
understanding through Experience &The Sense)
The Engineering derives design principles for Visualization techniques
It is Attention memory and information processing of the human visual
system
There are many areas that information visualization and presentation can help the
user:
A. The reduce the amount of time to understand the results of a search and
likely clusters of relevant information
B. yield information that comes from the relationships between items versus
treating each item as independent
C. perform simple actions that produce sophisticated information search
functions
Visualization can be divided into two broad classes
Link Visualization
Attribute visualization
Link Visualization:
Attribute Visualization:
Cognition &Preception
Cognition:- cognition means to Store the Information( The action or process of
acquiring knowledge and understanding through Thought, Experience& the
Senses)
CONE-TREE:
Perspective Wall:-
The perspective wall divides the Information into three Visual Areas
This allows the user to keep all of the Information in Perspective while Focusing
on Particular Area
Tree Maps:-
This Technique Makes Maximum use of the display Screen Space by using
Rectangular boxes that are Recursively subdivided and based on Parent-Child
relationships between the Data
The CPU
OS
Memory and
The Network Management articles are all related to a general category of
computer operating Systems
The Computer Applications which are shown in the rest of the Figure
Envision System:-
The Envision System not only displays the relevance rank and estimated relevance
of each Item found by a query, but also Simultaneously Presents other query
Information
CITY SCAPE:
The Basic Concept of a Text Scanning System is the ability for one or more users to enter
Queries with the text of the Items to be Searched Sequentially and Compared to the Query
Terms
When all of the text has been accessed the query is complete
One Advantage of this type Architecture is that as Soon as Item is Identified as satisfying a
query the results can be presented to the user for retrieval
The idea is to try and match the search String against the Input text
If as soon as a Mis-Match is detected in the Comparison process, shift the Input text one
position &Start the comparison process over
The Expected number of comparisons when Searching an Input text string of N
Characters for a pattern of M Character is
Where is the expected number of Comparisons and C is the Size of the Alphabet for the text.T
The pattern P with the text T for each possible shift of Relative to T until either
O=Order
Order of n*m
-T= aaa…….ah
-P=aaah
It is a String Algorithm
It is Significantly Enhanced as the Comparison process Started at the end of the search
pattern processing right to Left versus the start of the search pattern
The advantage is that large Jumps are Mis-Matched Character in the Input Stream the
search Pattern which occurs frequently(one position to another position)
The original Boyer-Moore Algorithm is developed for additional text search Techniques
It was originally designed to support scanning for a single Search String
It was Expanded to handle Multiple Search Strings on a Single Pass
It enhanced and simplified versions of the Boyer-Moore Algorithm have been developed
by May Researchers (Mollier-Nielson-84, Iyengar-80)
The Boyer-Moore’s Pattern Matching Algorithm is based on two Heuristics
Looking-glass heuristic: Compare P with a Sub-Sequence of T Moving backwards
Character-Jump Heuristic: when a Mis-Match occur at T[i]=C
If P Contains C, Shift P to Align the last occurrence of C in P with T[i] Else Shift P to align
P[O] with T[i+1]
It is an Boyer-Moore Algorithm
Knuth-Morris Pratt Algorithm:
The Knuth Morris Pratt Algorithm made a major Improvement in Previous Algorithms
It is an text search Algorithm
Even in the worst Case it works well
Unlike the previous Algorithms it does not Depend on the length of Input String
It can also work as long Strings
The basic Concept Behind the algorithm is that whenever a Mis Match is
detected, the previous Matched Characters define the Number of Characters
that can be Skipped in the Input Stream Prior to Process Again
Starting The Comparison
Position: 12345678
Input Stream a b d a d e f g
It is going to Identified as Position
When the Mis-Match Occurs in Position 4 with a “F” in the pattern and a “b” in
the Input Stream this Algorithm allows the Comparison to Jump at least the
three Positions associated with recognized “abd”
Since the Mis Match on the position could be the beginning of the search
strings four Positions can not be Skipped
To Know the Number of Positions to Jump based up on Mis Match in the
Search Pattern the Search Pattern is Pre-Processed to define a number of
characters to be Jumped for Each Position
P=Position
S=Search Position
I=Input Stream
The Software text Search is applicable for many Situations but faced Some Restrictions to
handle Many Search Terms Simultaneously against the same text and Limits due to I/O Speeds
One approach is to have specialized hardware Machine to perform the searches &Pass
the results to the Main Computer which supported the user Interface &Retrieval of Hits
Since the searcher is hardware based Scalability is achieved by Increasing the number of
hardware Search devices
The only Limit on speed is the time it takes to flow the text off of Secondary
Storage(Disk drives) to the Searchers
By having one search Machine Per Disk, the maximum time it takes to search a Database
of any size will be the time to search one disk
In some systems, The Disks were formatted to optimize the data flow off the drives
Another Major Advantage of Using a hardware Text Search unit is in the Elimination of
the Index that represents the Document Database
Typically the Indexes are 70% the size of the Actual Items
Other Advantages are that new items can be Searched as soon as received by the
System rather than waiting for the Index to be Created and the search Speed is
Deterministic
Even through it may be slower than using an Index the predictability of how long it will
take to stream the data provides the user with an Extract Search Time
As hits as Discovered they can Immediately be made Available to the user versus waiting
for the total search to complete as in Index Searches
One of the earliest hardware text string Search units was the Rapid Search Machine
developed by General Electric
A More Sophisticated Search units was developed by operating Systems INC Called the
Associative File Processor(AFP)
It is Capable of Searching Against Multiple Queries at the Same Time
In this regard we are going to discuss two types of Hardware Text Search Systems
In Some of the Systems, the Boolean Logic between terms is resolved in the term Detector
Hardware
The GESCAN System uses a Text Array Processor that Simultaneously Matches Many
terms &Conditions against a given Text Stream
The Fast Data Finder(FDP) is the most recent Specialized hardware text search unit Still in Use in
Many Organizations
It was Developed to search Text and has been used to Search English &Foreign
Languages
The early Fast Data Finders Considered of an Array of Programmable Text Processing
cells Connected in Series a Pipline Hardware Search Processor
MULTIMEDIA INFORMATION RETRIEVAL
Graph Retrieval:
The Graph based Information Retrieval System whose query can be Expressed as a graph of
Topics/subtopics, Documents are ranked with respect to a query up on relationships among
Documents, relationships among Topics/Sub topics& Relationship between query terms
&Documents
The relationship between one object to another object by using as Bar chart, Pie-chart
like Pictorial Representation
User should understandable very easy way
The system is evaluated and compared with two Information retrieval Systems on two
standard text collections
It is also Ranking based Algorithm
Example like No of products sales, No of products refunded, Students No of pass
percentage of students &No of Fail Percentage of Students it will be done as Result
Analysis
The results show that the proposed approach outperforms the other systems
The goal of Information retrieval is to effectively retrieve Documents relevant to users
queries
The graph based approach to information retrieval, its computation is fast &Scalable
and its Structure is flexible to Incorporate
Many Performance Enhancement Techniques
It most useful for Result Analysis, user will understandable very easily
Imagery Retrieval:
The Image Retrieval System is used for retrieving Images related to the user request
from the Database
In the Presented Image Retrieval System
An Image Retrieval System is a computer System used for browsing, Searching
&Retrieving Images from a large Database of Digital Images
The most Traditional and common methods of Image retrieval Utilize Some
method of Adding Meta Data, Such Captioning, Keywords, Title or descriptions to
the Images So, that retrieval can be Performed over the Annotation Words
The Manual Image Annotation is time-Consuming Laborious and Expensive to
address this, There has been a large Amount of research done on Automatic
Image Annotation
Additionally, the Increase in social web applications and the Semantic Web have
Inspired , The development of several web-based Image Annotation Tools
The need to retrieve a desired Image from a collection and to efficiently access
the Information is Shared by Many groups Including Journalists, Engineers.
Designers, Artists, Advertising Agencies
Image needs and uses across users in these groups vary considerably users many
require Access to Image based on Primitive Features Such as Color, Texture or
Shape or users may Require access to Images based on Abstract Concepts and
Symbolic Imagery
Content Based Image Retrieval(CBIR)Technology is now beginning to Move out
of the Laboratory into the marketplace
The technology still lacks Maturity and is not yet being used in a Significant Scale
Video Retrieval:
The video is an Electronic medium for the recording Copying, Playback, Broadcasting and
Display of Moving Visual Media
Video was first developed for Mechanical Television Systems, which were Quickly
replaced by CRT(Cathod-Ray-Tube)Systems which in turn were replaced by Flat Panel
display
It is Interactive web based Application which takes Video Frame from users and retrieve
the information from the Database
The Database consists of various Video data like still Video Frames, Audio &Video
During Recent years, Methods have been developed for retrieval of videos based on
their Visual features Color, Texture, Shape, Motion &Spatial-Temporal Composition are
the most Common Visual Features used in Similarity Match
Additionally ,the increase in Social Media web applications and the Semantic web
The videos is shared by many groups designers, as well as Social Media Platforms like
Youtube,…etc