AICS Unit I

The document discusses various measures of association used in information retrieval to quantify relationships between terms, including mutual information, pointwise mutual information, chi-square, t-score, and correlation coefficients. It also mentions the cluster hypothesis and its implications for document clustering.

Uploaded by

Priti Jadhav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views4 pages

AICS Unit I

Uploaded by

Priti Jadhav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Unit I: - Introduction to Information Retrieval

AUTOMATIC CLASSIFICATION: Measures of association

In the context of Information Retrieval (IR), measures of association refer to statistical

techniques or metrics that help quantify the relationship between terms or variables within a
dataset. These measures are commonly used to analyze the significance of co-occurrences or
associations between terms in documents, queries, or other elements of the IR process. They
provide insights into the relevance and potential connections between terms, which can be
valuable for tasks like document indexing, query expansion, and relevance ranking.

Here are some common measures of association used in IR:

1. Mutual Information (MI): Mutual Information measures the extent to which the
presence or absence of one term is related to the presence or absence of another term.
It's calculated as the logarithm of the ratio of the joint probability of the terms to the
product of their individual probabilities. High mutual information indicates a strong
association between terms.
2. Pointwise Mutual Information (PMI): PMI is similar to Mutual Information but
focuses on the difference between the joint probability and the expected probability if
the terms were independent. It's calculated as the logarithm of the ratio of the joint
probability to the product of the individual probabilities. Positive PMI values indicate
an association, while negative values indicate less association.
3. Chi-Square (χ²): Chi-Square measures the difference between the observed and
expected co-occurrence of terms in a contingency table. It's often used to assess
whether the observed associations are statistically significant. Higher χ² values imply
a stronger association between terms.
4. T-score (Student's t-test): T-score is used to determine if the difference between the
observed and expected co-occurrences of terms is significant. It's calculated using the
means and variances of the term frequencies. A higher absolute t-score indicates a
stronger association.
5. Association Measures in Association Rule Mining:
o Support: The proportion of transactions (documents) in which a term or term
pair occurs.
o Confidence: The conditional probability of finding the second term in a
transaction (document) given the presence of the first term.
o Lift: Measures how much more likely the second term is to occur when the
first term is present compared to its expected occurrence.
6. Correlation Coefficients: Measures like Pearson's correlation coefficient and
Spearman's rank correlation coefficient can be used to quantify the linear or
monotonic relationship between term occurrences.
7. Normalized Google Distance (NGD): This measure uses Google search frequencies
to quantify the relatedness of terms. It's based on the idea that the distance between
terms in the Google search space reflects their semantic relatedness.
8. Cramer's V: This measure is used to assess the strength of association between
categorical variables in a contingency table. It's an extension of the chi-square test and
can be useful for analyzing associations between terms.
These measures of association are used in various IR tasks such as document clustering,
query expansion, and term weighting. However, the choice of measure depends on the
specific task, the characteristics of the data, and the goals of the analysis.

TAUTOMATIC CLASSIFICATION: e cluster hypothesis

It seems like there might be a typo or a misunderstanding in your question. It's not entirely
clear what you're asking about "TAUTOMATIC CLASSIFICATION: e cluster hypothesis."
Could you please provide more context or clarify your question?

If you're referring to "automatic classification" and the "cluster hypothesis," I can certainly
provide information on those topics:

1. Automatic Classification: Automatic classification, also known as text classification

or document classification, is a natural language processing (NLP) task where
machine learning algorithms are used to automatically assign predefined categories or
labels to text documents based on their content. The goal is to train a model that can
accurately predict the appropriate category for new, unseen documents. This task has
various applications, such as spam detection, sentiment analysis, topic categorization,
and more.
2. Cluster Hypothesis: The "cluster hypothesis" is a concept often discussed in the
context of document clustering or unsupervised machine learning. It suggests that
documents that are similar in content tend to belong to the same category or cluster. In
other words, documents that share common themes, keywords, or concepts are likely
to be grouped together in a cluster. This hypothesis forms the basis for many
clustering algorithms, such as K-means clustering and hierarchical clustering.

Automatic Text Analysis: Indexing and Index Term Weighting, Probabilistic Indexing,
Automatic Classification. in information retrieval in simple words
ChatGPT

Sure, let's break down these concepts in information retrieval in simple words:

1. Indexing and Index Term Weighting:

o Think of indexing as creating a catalog for a library. When you want to find a
book in a library, you look at the catalog, which tells you where to find the
book on the shelves.
o In information retrieval, indexing is like creating a catalog for documents, web
pages, or any kind of text. It involves going through the text and picking out
important words or phrases (index terms) that describe what the text is about.
o Index term weighting is like giving more importance to certain words in the
catalog. For example, if you're looking for a book about cooking, the word
"recipe" might be more important than the word "book" in the catalog. So,
index term weighting assigns values to words to indicate their importance in
describing the content.
2. Probabilistic Indexing:
o This is a fancy term for making educated guesses. In information retrieval,
probabilistic indexing means that the system tries to estimate the probability
that a document (like a web page) is relevant to your search.
o Imagine you're searching for information about cats. The system might look at
various factors like how often the word "cat" appears in a document, how
many other documents mention cats, and how recent the document is. Based
on these factors, it makes a guess about how likely it is that the document is
what you're looking for.
3. Automatic Classification:
o Automatic classification is like sorting things into different categories without
someone doing it manually. It's a bit like a mail sorting machine at a post
office that reads the address on an envelope and decides which bin to put it in.
o In information retrieval, automatic classification means that a computer
program or algorithm categorizes documents into different groups or topics
based on their content. For example, it might classify news articles into
categories like "sports," "politics," or "entertainment" without a human having
to do it.

In a nutshell, these concepts help organize and find information efficiently in the vast sea of
digital content on the internet or in databases. Indexing and weighting make it easier to find
relevant stuff, probabilistic indexing tries to guess what you're looking for, and automatic
classification helps in sorting things into neat categories automatically.

Irs r22 Unit 4 Lecture Notes User Search Techniques Ranking Algorithms
No ratings yet
Irs r22 Unit 4 Lecture Notes User Search Techniques Ranking Algorithms
24 pages
Unit 1
No ratings yet
Unit 1
108 pages
Wi Ese Notes
No ratings yet
Wi Ese Notes
66 pages
IRS Unit 4
No ratings yet
IRS Unit 4
63 pages
Unit Iii
No ratings yet
Unit Iii
100 pages
Chap 6
No ratings yet
Chap 6
70 pages
Chap 13
No ratings yet
Chap 13
68 pages
Unit III
No ratings yet
Unit III
37 pages
Irs Cie-II Notes
No ratings yet
Irs Cie-II Notes
30 pages
Intro IR
No ratings yet
Intro IR
108 pages
Unit 4
No ratings yet
Unit 4
61 pages
Automatic Indexing
No ratings yet
Automatic Indexing
26 pages
IRS Unit-4
No ratings yet
IRS Unit-4
35 pages
Flat Clustering PDF
No ratings yet
Flat Clustering PDF
73 pages
Unit-3 Irs
No ratings yet
Unit-3 Irs
46 pages
IRS Unit-2
No ratings yet
IRS Unit-2
63 pages
Irs Unit - 4
No ratings yet
Irs Unit - 4
29 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
Lec 1 - Intro - Unit 1 Information Technology
No ratings yet
Lec 1 - Intro - Unit 1 Information Technology
102 pages
Statistical Indexing Is A Method Used in Information Retrieval Systems
No ratings yet
Statistical Indexing Is A Method Used in Information Retrieval Systems
22 pages
Irs Unit-4 Modified
No ratings yet
Irs Unit-4 Modified
13 pages
Ii - 3 Unit
No ratings yet
Ii - 3 Unit
45 pages
Irs Unit-3 Notes - 241202 - 145950
No ratings yet
Irs Unit-3 Notes - 241202 - 145950
21 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
60 pages
Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989
No ratings yet
Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989
65 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
What Is Information Retrieval (IR)
No ratings yet
What Is Information Retrieval (IR)
17 pages
Unit 4 Data Warehousing and Data Mining
No ratings yet
Unit 4 Data Warehousing and Data Mining
15 pages
IR Presentation 1
No ratings yet
IR Presentation 1
41 pages
Ir Assignment
No ratings yet
Ir Assignment
12 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
IRS Automatic Indexing UNIT-2
67% (3)
IRS Automatic Indexing UNIT-2
18 pages
WDM 3,4,5
No ratings yet
WDM 3,4,5
12 pages
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
No ratings yet
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
27 pages
Information Retrieval Systems Slip Test 2
No ratings yet
Information Retrieval Systems Slip Test 2
10 pages
Unit-4 1
No ratings yet
Unit-4 1
7 pages
IRS-Class - Unit-3
No ratings yet
IRS-Class - Unit-3
95 pages
CMP 312 - 2
No ratings yet
CMP 312 - 2
5 pages
IRS Unit 4 by Krishna
No ratings yet
IRS Unit 4 by Krishna
23 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Module 7 Mining Object Spatial Multimedia Text and Web Data
100% (1)
Module 7 Mining Object Spatial Multimedia Text and Web Data
28 pages
IRS Unit 3 by Krishna
No ratings yet
IRS Unit 3 by Krishna
50 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
60 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
Module 1print
No ratings yet
Module 1print
5 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
L35 MC 6
No ratings yet
L35 MC 6
351 pages
CS317 IR W1a
No ratings yet
CS317 IR W1a
20 pages
Intelligent
No ratings yet
Intelligent
20 pages
A Study On Document Classification Using Machine Learning Techniques
No ratings yet
A Study On Document Classification Using Machine Learning Techniques
6 pages
Unit-3 Irs
No ratings yet
Unit-3 Irs
48 pages
IRS III Year UNIT-3 Part 1
50% (2)
IRS III Year UNIT-3 Part 1
18 pages
Web Mining UNIT-II Chapter-01 - 02 - 03
No ratings yet
Web Mining UNIT-II Chapter-01 - 02 - 03
19 pages
A Roadmap To Accounting For IncomeTaxes - November 2020
100% (1)
A Roadmap To Accounting For IncomeTaxes - November 2020
670 pages
Clustering and Search Techniques in Information Retrieval Systems
67% (3)
Clustering and Search Techniques in Information Retrieval Systems
39 pages
Lecture 2: More Similarity Searching Multidimensional Scaling
No ratings yet
Lecture 2: More Similarity Searching Multidimensional Scaling
8 pages
Context Based Document Indexing and Retrieval Using Big Data Analytics - A Review
No ratings yet
Context Based Document Indexing and Retrieval Using Big Data Analytics - A Review
3 pages
UNIT 4 Mining Object Spatial Multimedia Text and Web Data
No ratings yet
UNIT 4 Mining Object Spatial Multimedia Text and Web Data
30 pages
IDoc document-OpenText
No ratings yet
IDoc document-OpenText
13 pages
Evaluation of Gas Hydrate in Gas Pipeline Transportation
No ratings yet
Evaluation of Gas Hydrate in Gas Pipeline Transportation
107 pages
Is 14223 1 1995
No ratings yet
Is 14223 1 1995
10 pages
Global Market Forecast 2015-2034 PDF
No ratings yet
Global Market Forecast 2015-2034 PDF
27 pages
BSD Junction Blok A No 3, JL Pahlawan Seribu, BSD City, Tangerang Selatan PH: (021) 3032 1716 / 081 689 5500 / Cs@royalgardenspa - Co.id
No ratings yet
BSD Junction Blok A No 3, JL Pahlawan Seribu, BSD City, Tangerang Selatan PH: (021) 3032 1716 / 081 689 5500 / Cs@royalgardenspa - Co.id
26 pages
627317044FINAL - Detailed Advt 02 - 2024
No ratings yet
627317044FINAL - Detailed Advt 02 - 2024
14 pages
1 s2.0 S1755581723001256 Main
No ratings yet
1 s2.0 S1755581723001256 Main
41 pages
BS-08 Partitionof Bengal
No ratings yet
BS-08 Partitionof Bengal
23 pages
Sikorsky v. City of Newburgh, No. 23-1171 (2d Cir. May 2, 2025)
No ratings yet
Sikorsky v. City of Newburgh, No. 23-1171 (2d Cir. May 2, 2025)
13 pages
Admit Card
No ratings yet
Admit Card
3 pages
Final BSBCMM401 Assessment Solution
No ratings yet
Final BSBCMM401 Assessment Solution
18 pages
Flowchart and Guidelines For Non-Degree Applications 2025 Via Google Form
No ratings yet
Flowchart and Guidelines For Non-Degree Applications 2025 Via Google Form
2 pages
SILO Company Presentation FY 2017
No ratings yet
SILO Company Presentation FY 2017
61 pages
Charles Oman
No ratings yet
Charles Oman
49 pages
Cold Working of Metals 2997
No ratings yet
Cold Working of Metals 2997
7 pages
FCB UnO ControlCenter Manual
No ratings yet
FCB UnO ControlCenter Manual
30 pages
Types of Plants:: Operations
No ratings yet
Types of Plants:: Operations
2 pages
B.tech Eeee Syllabus
No ratings yet
B.tech Eeee Syllabus
12 pages
Technical Information: 1 - Introduction
No ratings yet
Technical Information: 1 - Introduction
3 pages
RIPMWC Round 2 Sample Questions 2019
100% (3)
RIPMWC Round 2 Sample Questions 2019
2 pages
1.1 Mechanical Tender Drawing For Sanwa Project (R2)
No ratings yet
1.1 Mechanical Tender Drawing For Sanwa Project (R2)
9 pages
MINERALS
No ratings yet
MINERALS
4 pages
Unit 5 Review Answers
No ratings yet
Unit 5 Review Answers
17 pages
Document 3254323
No ratings yet
Document 3254323
2 pages
BSI05 Adba
No ratings yet
BSI05 Adba
3 pages
Coursera NAKCJGGSDNKW
No ratings yet
Coursera NAKCJGGSDNKW
1 page
Elgamatic 100
No ratings yet
Elgamatic 100
1 page
Petron Plustm Formula 7 Diesel Engine Conditioner
No ratings yet
Petron Plustm Formula 7 Diesel Engine Conditioner
2 pages
Text Analytics with Python: A Brief Introduction to Text Analytics with Python
From Everand
Text Analytics with Python: A Brief Introduction to Text Analytics with Python
Anthony S. Williams
No ratings yet
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Text Mining: Fundamentals and Applications
From Everand
Text Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet

AICS Unit I

Uploaded by

AICS Unit I

Uploaded by

Unit I: - Introduction to Information Retrieval

AUTOMATIC CLASSIFICATION: Measures of association

In the context of Information Retrieval (IR), measures of association refer to statistical

Here are some common measures of association used in IR:

TAUTOMATIC CLASSIFICATION: e cluster hypothesis

1. Automatic Classification: Automatic classification, also known as text classification

1. Indexing and Index Term Weighting:

You might also like