AICS Unit I
AICS Unit I
1. Mutual Information (MI): Mutual Information measures the extent to which the
presence or absence of one term is related to the presence or absence of another term.
It's calculated as the logarithm of the ratio of the joint probability of the terms to the
product of their individual probabilities. High mutual information indicates a strong
association between terms.
2. Pointwise Mutual Information (PMI): PMI is similar to Mutual Information but
focuses on the difference between the joint probability and the expected probability if
the terms were independent. It's calculated as the logarithm of the ratio of the joint
probability to the product of the individual probabilities. Positive PMI values indicate
an association, while negative values indicate less association.
3. Chi-Square (χ²): Chi-Square measures the difference between the observed and
expected co-occurrence of terms in a contingency table. It's often used to assess
whether the observed associations are statistically significant. Higher χ² values imply
a stronger association between terms.
4. T-score (Student's t-test): T-score is used to determine if the difference between the
observed and expected co-occurrences of terms is significant. It's calculated using the
means and variances of the term frequencies. A higher absolute t-score indicates a
stronger association.
5. Association Measures in Association Rule Mining:
o Support: The proportion of transactions (documents) in which a term or term
pair occurs.
o Confidence: The conditional probability of finding the second term in a
transaction (document) given the presence of the first term.
o Lift: Measures how much more likely the second term is to occur when the
first term is present compared to its expected occurrence.
6. Correlation Coefficients: Measures like Pearson's correlation coefficient and
Spearman's rank correlation coefficient can be used to quantify the linear or
monotonic relationship between term occurrences.
7. Normalized Google Distance (NGD): This measure uses Google search frequencies
to quantify the relatedness of terms. It's based on the idea that the distance between
terms in the Google search space reflects their semantic relatedness.
8. Cramer's V: This measure is used to assess the strength of association between
categorical variables in a contingency table. It's an extension of the chi-square test and
can be useful for analyzing associations between terms.
These measures of association are used in various IR tasks such as document clustering,
query expansion, and term weighting. However, the choice of measure depends on the
specific task, the characteristics of the data, and the goals of the analysis.
It seems like there might be a typo or a misunderstanding in your question. It's not entirely
clear what you're asking about "TAUTOMATIC CLASSIFICATION: e cluster hypothesis."
Could you please provide more context or clarify your question?
If you're referring to "automatic classification" and the "cluster hypothesis," I can certainly
provide information on those topics:
Automatic Text Analysis: Indexing and Index Term Weighting, Probabilistic Indexing,
Automatic Classification. in information retrieval in simple words
ChatGPT
Sure, let's break down these concepts in information retrieval in simple words:
In a nutshell, these concepts help organize and find information efficiently in the vast sea of
digital content on the internet or in databases. Indexing and weighting make it easier to find
relevant stuff, probabilistic indexing tries to guess what you're looking for, and automatic
classification helps in sorting things into neat categories automatically.