Lec 1 - Intro - Unit 1 Information Technology
Lec 1 - Intro - Unit 1 Information Technology
NINAD MIRAJKAR
[email protected]
MOB: 9321 727 943
Part 1 – Unit 1
Introduction:
- Overview of IR Systems
- Historical Perspectives - Goals of IR
- The impact of the web on IR
- The role of artificial intelligence (AI) in IR.
In today’s lecture..
• We cover some concepts from Unit 1.
• The definition of IR
• An overview of the IR system
• Components of an IR system
• The past, present and future of IR
In Unit 1
Introduction:
Overview of IR Systems - Historical Perspectives - Goals of IR - The impact of the web on IR - The role of
artificial intelligence (AI) in IR.
Text representation: Statistical Characteristics of Text: Zipf's law; Porter stemmer; morphology; index
term selection; using thesauri. Basic Tokenizing,
Indexing: Simple tokenizing, stop-word removal, and stemming; inverted indices; Data Structure and File
Organization for IR - efficient processing with sparse vectors
Definition
“A software program that deals with the organization, storage, retrieval, and evaluation of information from
document repositories, particularly textual information.”
Information Retrieval is the activity of obtaining material that can usually be documented on an
unstructured nature i.e. usually text which satisfies an information need from within large collections
which is stored on computers. For example, Information Retrieval can be when a user enters a query into
the system.
Not only librarians, professional searchers, etc engage themselves in the activity of information retrieval
but nowadays hundreds of millions of people engage in IR every day when they use web search engines.
Information Retrieval is believed to be the dominant form of Information access.
Overview of IR Systems:
Information Retrieval (IR) is a field of study that deals with the organization, storage, retrieval, and
presentation of information from various sources.
IR systems are designed to help users find relevant information efficiently and effectively. They are
essential in managing large volumes of data, such as those found on the web or in databases.
These systems play a crucial role in managing and organizing various types of information, including text
documents, images, audio, and video. The process of information retrieval involves matching user
queries with indexed data and presenting the most relevant results.
Difference between Information and Data
Data
Data is defined as a collection of individual facts or statistics. Data can come in the form of text,
observations, figures, images, numbers, graphs, or symbols. For example, data might include individual
prices, weights, addresses, ages, names, temperatures, dates, or distances.
Data is a raw form of knowledge and, on its own, doesn’t carry any significance or purpose. In other
words, you have to interpret data for it to have meaning. Data can be simple—and may even seem
useless until it is analyzed, organized, and interpreted.
Quantitative data is provided in numerical form, like the weight, volume, or cost of an item.
Qualitative data is descriptive, but non-numerical, like the name, sex, or eye color of a person.
Difference between Information and Data
Information
For example, a set of data could include temperature readings in a location over several years. Without
any additional context, those temperatures have no meaning. However, when you analyze and organize
that information, you could determine seasonal temperature patterns or even broader climate trends.
Only when the data is organized and compiled in a useful way can it provide information that is
beneficial to others.
The Key Differences Between Data vs
Information
•Data is a collection of facts, while information puts those facts into context.
•While data is raw and unorganized, information is organized.
•Data points are individual and sometimes unrelated. Information maps out that data to provide a
big-picture view of how it all fits together.
•Data, on its own, is meaningless. When it’s analyzed and interpreted, it becomes meaningful
information.
•Data isn’t sufficient for decision-making, but you can make decisions based on information.
Examples of Data vs Information
1. At a restaurant, a single customer’s bill amount is data. However, when the restaurant owners collect
and interpret multiple bills over a range of time, they can produce valuable information, such as what
menu items are most popular and whether the prices are sufficient to cover supplies, overhead, and
wages.
2. The number of likes on a social media post is a single element of data. When that’s combined with
other social media engagement statistics, like followers, comments, and shares, a company can intuit
which social media platforms perform the best and which platforms they should focus on to more
effectively engage their audience.
Historical Perspectives
The roots of Information Retrieval can be traced back to early systems like library catalogs, which
aimed to organize and retrieve books and documents. The development of digital computers and the
internet revolutionized IR, making it possible to index and search vast amounts of information
electronically. Early systems like Boolean and vector space models laid the foundation for modern IR
approaches. In the 1950s and 1960s, the development of electronic computers led to the first digital
information retrieval systems.
One of the earliest information retrieval systems was the Cranfield project, initiated at the Cranfield
Institute of Technology in the UK in the late 1950s. The Cranfield experiments laid the groundwork
for the vector space model and introduced the concept of relevance ranking.
In the 1960s, researchers like Gerard Salton made significant contributions to IR by introducing the
SMART (System for the Mechanical Analysis and Retrieval of Text) information retrieval system.
SMART employed the vector space model and statistical techniques for term weighting and ranking.
Past, Present, and Future of Information Retrieval
1. Early Developments: As there was an increase in the need for a lot of information, it became
necessary to build data structures to get faster access. The index is the data structure for faster retrieval
of information. Over centuries manual categorization of hierarchies was done for indexes.
2. Information Retrieval In Libraries: Libraries were the first to adopt IR systems for information
retrieval. In first-generation, it consisted, automation of previous technologies, and the search was based
on author name and title. In the second generation, it included searching by subject heading, keywords,
etc. In the third generation, it consisted of graphical interfaces, electronic forms, hypertext features, etc.
3. The Web and Digital Libraries: It is cheaper than various sources of information, it provides greater
access to networks due to digital communication and it gives free access to publish on a larger medium.
Types of IR model
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Each IR mode has its strengths and weaknesses, and the choice of the appropriate model depends
on the specific requirements of the retrieval task, the size and nature of the collection, and the user's
information needs. In practice, hybrid models that combine different IR modes are often used to
achieve more accurate and effective retrieval results.
IR systems consist of several key processes:
1. Document Collection: This is the set of documents that the IR system will index and search. It can range
from a small local database to the entire World Wide Web.
2. Preprocessing: Before indexing, text documents undergo preprocessing steps, including tokenization,
stop-word removal, and stemming. Preprocessing helps standardize the text and reduces variations in
word forms, enabling more accurate retrieval.
3. Indexing: The process of creating an inverted index, which maps terms (words or phrases) to the
documents that contain them. The inverted index allows for efficient lookup of documents containing
specific terms, making retrieval faster.
4. Query Processing: When a user submits a query, it goes through the same preprocessing steps as the
documents. The processed query is then matched against the inverted index to identify relevant
documents.
5. Ranking: The retrieved documents need to be ranked based on their relevance to the query. Various
ranking algorithms, such as TF-IDF (Term Frequency-Inverse Document Frequency) and BM25, assign
scores to documents based on term frequency, document frequency, and other factors.
6. Presentation of Results: The top-ranked documents are presented to the user in a user-friendly format,
often as a list of titles and snippets.
Components of an IR system
The components of an Information Retrieval (IR) model are the fundamental building blocks that
define how the model operates and processes information. These components collectively
determine how the model indexes, retrieves, and ranks documents to match user queries. The main
components of an IR model include:
1. Document Representation: This component defines how documents are represented in the IR
system. It involves converting raw text or other types of data (e.g., images, audio) into a suitable
format that can be indexed and processed. Document representation is crucial for capturing the
content and characteristics of each document, enabling efficient retrieval.
2. Query Representation: Query representation determines how user queries are processed and
matched against indexed documents. Queries can be represented as sets of terms, vectors, or other
data structures that facilitate matching with document representations.
3. Indexing: Indexing involves the creation of data structures that efficiently map terms to the
documents that contain them. The most common indexing structure is the inverted index, which
lists terms along with pointers to the documents in which they appear. Indexing plays a vital role in
speeding up the retrieval process by enabling quick access to relevant documents.
4. Term Weighting: Term weighting assigns importance scores to terms based on their frequency
and relevance. Weighting helps to distinguish between important and less important terms during
retrieval. Popular term weighting schemes include TF-IDF (Term Frequency-Inverse Document
Frequency) .
5. Retrieval Algorithm: The retrieval algorithm determines how documents are ranked and selected
for presentation to the user. Different IR models use various algorithms, such as Boolean operations,
vector similarity measures, or probabilistic scoring functions, to determine document relevance.
6. Ranking Strategy: The ranking strategy dictates how the retrieved documents are ordered to
present the most relevant ones at the top of the result list. The ranking strategy may consider
factors like document similarity to the query, document popularity, or user feedback.
7. Relevance Feedback (optional): Some IR models incorporate relevance feedback, where users
provide feedback on the initial search results, and the model refines the results based on this
feedback. Relevance feedback helps improve the accuracy of retrieval by incorporating user
preferences.
8. Query Processing: Query processing involves various preprocessing steps applied to user queries
before they are matched against the indexed data. These steps may include tokenization, stop-word
removal, stemming, and other text processing techniques to standardize and enhance query
representation.
9. Presentation of Results: This component deals with how the search results are presented to the
user. The results can be displayed as a ranked list, snippets of text, images, or any other format that
facilitates user understanding and interaction.
10. Evaluation Metrics (for model assessment): To assess the performance of an IR model,
evaluation metrics are used. These metrics measure the quality of retrieval results, such as
precision, recall, F1-score, mean average precision (MAP), and normalized discounted cumulative
gain (NDCG).
---------------------------------------------------------------------------------------------------------------------------------------------------------------
These components work together in an IR model to enable effective and efficient information
retrieval. The specific implementation and design of these components vary across different IR
models, reflecting the diversity of approaches used to address the challenges of information
retrieval.
Sequence of activities of how the system
works User Query
Indexing
Term Weighting
Note: the process of a specific algorithm will work at this stage. Rest of the
Retrieval Algorithm steps in the process are followed in general and are independent of the
algorithm.
Presentation of
Results
User Receives
Results
Explanation of Flowchart:
User Query: The process starts when a user enters a query to search for specific information.
Query Preprocessing: The user query undergoes preprocessing steps, such as tokenization,
stop-word removal, and stemming, to standardize and enhance its representation.
Query Representation: The preprocessed query is converted into a suitable data structure, such as a
set of terms or a vector, that can be matched against the indexed data.
Indexing: The IR system creates an inverted index that maps terms to the documents containing
them. This index allows for efficient retrieval of relevant documents during the search process.
Term Weighting: Term weighting assigns importance scores to the terms in the query and
documents based on their frequency and relevance.
Retrieval Algorithm: The IR system uses a retrieval algorithm (e.g., Boolean operations, vector
similarity measures, probabilistic scoring) to identify relevant documents that match the user's
query.
Ranking Strategy: The retrieved documents are ranked based on their relevance to the query using a
ranking strategy, such as sorting by similarity scores or incorporating user feedback.
Presentation of Results: The top-ranked documents are presented to the user in a user-friendly
format, such as a list of titles and snippets, images, or other relevant information.
User Receives Results: The user receives the search results, and based on the presented documents,
they can interact with the retrieved information.
The Impact of the Web on IR
The advent of the World Wide Web brought new challenges and opportunities to Information
Retrieval. Web search engines, such as Google, Bing, and Yahoo, have become an integral part of our
daily lives. The sheer size and dynamic nature of the web necessitated the development of
specialized algorithms and techniques to crawl, index, and rank web pages effectively.
The emergence of the World Wide Web in the 1990s brought unprecedented challenges and
opportunities to information retrieval. The web presented an enormous and dynamic collection of
interconnected documents, making traditional IR techniques less effective.
Web search engines like Google revolutionized IR by combining efficient crawling, indexing, and
ranking algorithms. Google's PageRank algorithm, developed by Larry Page and Sergey Brin,
revolutionized web search by considering the link structure between web pages to determine
relevance and authority.
The Role of Artificial Intelligence (AI) in
IR
Artificial Intelligence plays a crucial role in modern IR systems. AI techniques, such as natural
language processing (NLP), machine learning, and deep learning, are employed to improve various
aspects of IR, including query understanding, relevance ranking, and user personalization.
AI-powered systems can learn from user interactions and feedback, leading to more accurate and
context-aware search results.
1. Query Understanding: AI-powered systems can better understand user queries, accounting for
synonyms, word order, and context, leading to more accurate results.
2. Relevance Ranking: Machine learning models can learn from user interactions and feedback to
improve ranking algorithms and present more relevant results.
3. Personalization: AI enables the creation of personalized IR experiences, tailoring search results to
individual user preferences and behaviors.
4. Natural Language Search: Voice-based and natural language search has become more prevalent,
allowing users to interact with IR systems using spoken language.
AI-powered IR systems continue to evolve, and ongoing research in this area is essential for further
improvements and innovations in the field.
References Links
1. https://bloomfire.com/blog/data-vs-information/
2. https://www.geeksforgeeks.org/what-is-information-retrieval/
3. https://www.geeksforgeeks.org/difference-between-information-and-data/
4. The Cranfield Project
5. The SMART Information Retrieval System
6. Google's PageRank Algorithm
Part 2 – Unit 1
Text representation:
Statistical Characteristics of Text: Zipf's law;
Porter stemmer; morphology;
Index term selection;
Using thesauri.
In this lecture..
1. Text Representation in Information Retrieval
2. Statistical Characteristics of Text: Zipf's Law:
3. Porter Stemmer and Morphology:
4. Index Term Selection and Thesauri:
Text Representation in Information
Retrieval
Statistical Characteristics of Text: Zipf's
Law:
Zipf's law states that in a given corpus* of natural language text, the frequency of any word is inversely
proportional to its rank in the frequency table.
In simpler terms, a few words occur very frequently (like "the," "and," "is") while the majority of words
occur rarely. Understanding Zipf's law helps in designing efficient indexing and retrieval strategies, such
as term weighting and relevance ranking.
Zipf's law, proposed by linguist George Zipf, is an empirical law that describes the frequency distribution
of words in natural language.
*corpus = collection
Zipf's law has significant implications for
information retrieval:
1. Term Frequency Weighting: Inverted index-based IR systems often use term frequency (TF)
weighting to rank documents. Zipf's law suggests that frequent terms are less informative than rare
terms. Therefore, TF-IDF weighting, which considers both term frequency and inverse document
frequency, helps in giving more importance to rare terms.
2. Stop-Word Removal: Since common words appear frequently and have little discriminative power,
they are often removed during preprocessing. Stop-word lists are created, containing words like "a,"
"an," "the," etc., which are omitted during indexing.
Mathematically, Zipf's law can be
expressed as:
f(r) = C / r^s
Where:
• f(r) is the frequency of the word at rank r
• C is a constant
• s is the exponent typically close to 1
Zipf's Law
Zipf's Law is an empirical law that describes the frequency distribution of words in natural language
texts. It states that the frequency of any word is inversely proportional to its rank in the frequency
table. In simpler terms, a few words occur very frequently, while the majority of words occur rarely.
Mathematically, Zipf's Law can be expressed as: f(r) = C / r^s
Explanation:
Zipf's Law suggests that the most frequent word in a language will appear approximately twice as
often as the second most frequent word, three times as often as the third most frequent word, and
so on. This power-law distribution is common in various natural phenomena, and it has significant
implications for information retrieval and language processing.
Let's illustrate Zipf's Law using a simple
example
Example: Consider the following text corpus containing seven sentences:
1. "Information retrieval is essential for data science.“
2. "Information retrieval systems aim to find relevant documents.“
3. "Data science is an interdisciplinary field.“
4. "Information retrieval deals with indexing and searching.“
5. "Data science includes statistics and machine learning.“
6. "Search engines use information retrieval techniques.“
7. "Data science is widely used in various industries."
Step 1: Tokenization and Word Frequency
Count
Tokenize the text corpus and count the frequency of each word:
Word Frequency
information 3
retrieval 3
is 3
data 3
science 3
essential 1
for 1
systems 1
aim 1
to 1
find 1
relevant 1
documents 1
an 1
interdisciplinary 1
field 1
deals 1
with 1
indexing 1
and 1
searching 1
includes 1
statistics 1
machine 1
learning 1
search 1
engines 1
use 1
techniques 1
widely 1
used 1
in 1
various 1
industries 1
Step 2: Rank Calculation
Rank the words based on their frequencies:
Well it all depends on the documents you have and how each
term is eventually ranked.
How is the TF-IDF calculated? And what is
it?
TF-IDF (Term Frequency-Inverse Document Frequency) is a popular term weighting scheme used in
information retrieval and text mining to measure the importance of a term in a document relative to
a collection of documents. TF-IDF takes into account both the frequency of a term in a document
(TF) and its rarity across the entire collection (IDF). The formula to calculate TF-IDF is as follows:
TF-IDF = TF(term, document) * IDF(term)
Where:
• TF(term, document) is the term frequency of the term in the document. It represents how many
times the term appears in the document.
• IDF(term) is the inverse document frequency of the term. It measures the rarity of the term across
the entire collection of documents.
The TF-IDF value for a specific term in a particular document is the product of its term frequency
and its inverse document frequency.
Term Frequency (TF):
The term frequency (TF) of a term in a document is a measure of how frequently the term appears in
that document. It is calculated as the ratio of the number of times the term occurs in the document
to the total number of terms in the document. TF is typically normalized to prevent bias towards
longer documents, and one common normalization method is to divide the raw term frequency by
the maximum term frequency in the document.
Formula for normalized TF (TF_norm):
TF_norm(term, document) = (Number of occurrences of term in the document) / (Total number of
terms in the document)
Inverse Document Frequency (IDF):
The inverse document frequency (IDF) of a term is a measure of how rare the term is across the
entire collection of documents. It is calculated as the logarithm of the ratio of the total number of
documents in the collection (N) to the number of documents that contain the term (n).
Formula for IDF:
IDF(term) = log(N / n)
Where:
• N: Total number of documents in the collection.
• n: Number of documents that contain the term.
The IDF value is higher when the term is rare in the collection and lower when the term is more
common.
Calculating TF-IDF:
Once the TF and IDF values are calculated for each term in a document, you can compute the TF-IDF
score for that term in the document using the formula mentioned at the beginning:
TF-IDF(term, document) = TF_norm(term, document) * IDF(term)
The TF-IDF score reflects the importance of a term in a specific document relative to the entire
collection. Terms with high TF-IDF scores are considered more informative and discriminative and
are often used for ranking and retrieval purposes in information retrieval systems. They indicate that
the term is frequent in the document (high TF) and rare in the collection (high IDF), suggesting that
the term is highly relevant to the content of the document. Conversely, terms with low TF-IDF scores
are either very common across all documents or very rare in the document, indicating lower
relevance to the document's content.
Thesauri
A thesaurus is a controlled vocabulary that organizes and relates
words and phrases to express concepts and their relationships. It
serves as a knowledge organization system, helping users find
synonyms, antonyms, hierarchical relationships, and related terms.
Thesauri play a vital role in information retrieval and are commonly
used in search engines, libraries, and databases to improve search
precision and recall.
How Thesauri Work: Thesauri are typically organized as a
hierarchical structure or network, with broader and narrower terms
representing the hierarchical relationships between concepts.
Synonyms and related terms are also linked to provide alternative
and related access points to the same or similar information.
Note: Thesauri is named after the ancient Dinosaur Thesaurus.
In this example, the thesaurus provides information about synonyms for "Dog" (Canine, Pooch), its
broader concept (Animal), a narrower term representing a specific type of dog (Beagle), and related
terms that are conceptually linked to "Dog" (Bark, Pet, Leash). Users can utilize this information to
find relevant documents or resources associated with different terms that relate to the concept of
"Dog."
By integrating thesauri into an information retrieval system, users can access a broader range of
relevant information and discover alternative ways of expressing their information needs, enhancing
the effectiveness and flexibility of the retrieval process.
Another common identified example of
Thesauri
Students. Yes, read below
Let's consider a practical example of using a hash table to implement a simple phone book contacts
application. The goal is to store a list of contacts with their phone numbers and efficiently retrieve
contact information based on the contact name.
Step 1: Creating the Hash Table We'll start by creating a hash table to store the contacts. For
simplicity, let's assume we have a limited number of contacts, and the phone book can store up to
10 contacts. We'll use an array-based implementation for the hash table.
Step 2: Hash Function Next, we need a hash function to convert the contact name into an index
where the contact will be stored in the hash table. For this example, we'll use a simple hash function
that calculates the sum of ASCII values of characters in the name and takes the modulo of the hash
table size.
Step 3: Inserting Contacts Now, we'll implement a function to insert contacts into the phone book
using the hash table.
Step 4: Retrieving Contacts To retrieve contact information, we'll implement a function that takes
the contact name and returns the corresponding phone number.
Step 5: Putting It All Together Now, let's insert some contacts and retrieve their phone numbers:
Working of Hash Tables:
In this example, the hash table is created with 10 slots (hash_table_size = 10). The hash function
converts the contact name into an index, where the contact information will be stored. If two
contacts hash to the same index due to a collision, linear probing is used to find the next available
slot. When retrieving a phone number, the hash function is applied to the contact name to locate the
correct slot, and the phone number is returned.
Hash tables provide efficient access to contact information based on the contact name. The time
complexity for insertion, retrieval, and deletion is O(1) on average (assuming a good hash function),
making hash tables a practical and effective data structure for various applications, including phone
book contacts, databases, and caching systems.
B-trees
B-trees are self-balancing tree data structures designed to efficiently store and retrieve large
amounts of data in blocks or pages. They are commonly used in databases and information retrieval
systems for indexing and organizing data on disk.
How B-trees Work:
1. Node Structure: B-trees consist of internal nodes and leaf nodes. Internal nodes store keys and
pointers to child nodes, while leaf nodes store actual data entries or references to data blocks.
2. Balance: B-trees maintain balance, ensuring that all leaf nodes are at the same level. This balance
reduces the number of disk accesses required for retrieval and insertion.
Pros and Cons of B-trees:
Pros:
• Efficient for large datasets and disk-based storage.
• Maintains balanced structure, leading to predictable and consistent performance.
• Reduces the number of disk I/O operations, making it suitable for databases and file systems.
Cons:
• More complex to implement and maintain compared to simple data structures like hash tables.
• Insertions and deletions require tree restructuring, leading to higher overhead compared to hash
tables for small datasets.
• In-memory B-trees can be less efficient than other tree structures, such as binary search trees or
AVL trees, for smaller datasets.
Example: Consider a database containing a large number of user records. B-trees can be used to
index the user records efficiently based on a unique identifier, such as a user ID.
Trie-based structures
Trie (pronounced "try") is a tree-like data structure used for efficiently storing and retrieving strings
or sequences. Trie-based structures, such as prefix trees and compressed trie structures (like
Patricia trie), are used in information retrieval for various tasks involving string matching and
searching.
How Trie Based Structures Work:
1. Node Structure: In a trie, each node represents a single character of a string. Nodes are linked based
on the characters they represent, forming a hierarchical tree-like structure.
2. Prefix Matching: Trie-based structures excel at prefix matching, making them efficient for
autocompletion and searching tasks.
Pros and Cons of Trie Based Structrues
Pros:
• Excellent for tasks requiring string matching and prefix search, such as autocompletion and spell
checking.
• Space-efficient when there are many common prefixes in the dataset, as shared prefixes are
represented only once.
Cons:
• Inefficient for storing large datasets of long strings as they can be memory-intensive.
• High space overhead when there is little repetition of prefixes.
• More complex to implement compared to simple data structures like arrays or linked lists.
Example: For a search engine's autocompletion feature, a trie-based structure can be used to
efficiently store and retrieve a large number of search queries for real-time suggestions as the user
types.
File Organization for IR
File organization involves how the indexed data is stored on disk to optimize access times.
Sequential access methods are suitable for processing documents in order, while direct access
methods, such as hashing and indexing, allow for faster retrieval of specific documents.
Efficient Processing with Sparse Vectors
In IR, document-term matrices are often very sparse since most documents contain only a small
subset of the entire vocabulary. Efficient storage and processing techniques for sparse vectors are
critical to avoid unnecessary memory usage and computational overhead.
Sparse vectors can be represented using various data structures, such as:
1. Compressed Sparse Row (CSR) Format: This format stores only the non-zero elements of a sparse
matrix, along with row and column indices. It reduces memory requirements by omitting zero
elements.
2. Inverted Index Compression: Posting lists in the inverted index can be compressed to save space.
Techniques like variable-byte encoding, delta encoding, and Golomb coding are used to represent
integers more efficiently.
Efficient processing of sparse vectors is essential for high-performance IR systems, especially when
dealing with large-scale collections containing millions or billions of documents.