Data Mining Exam Answers - April 2024
Data Mining Exam Answers - April 2024
Process:
1. Construct a graph where each data point is a vertex, and edges represent distances
between points.
2. Use an algorithm like Kruskal’s or Prim’s to find the MST, ensuring the total edge
weight is minimized.
3. For k partitions, remove the k-1 longest edges, splitting the MST into k subtrees, each
representing a cluster.
Example: For 5 points (A, B, C, D, E) with distances, the MST might connect A-B, B-
C, C-D, D-E. Removing the longest edge (e.g., D-E) splits it into two clusters.
Advantages:
Ensures connectivity of all points with minimal distance, providing a robust initial
structure.
Reduces sensitivity to initial centroid selection, a common issue in k-means.
Limitations:
Computational cost increases with dataset size (O(E log V) for Kruskal’s, where E is
edges and V is vertices).
Less effective in high-dimensional spaces where distance metrics become unreliable.
Support: This measures the frequency of the rule’s itemset in the dataset, calculated
as the proportion of transactions containing both the antecedent and consequent (e.g.,
support = 60% if 60% of transactions include {bread, butter}). It indicates the rule’s
statistical significance.
Confidence: This assesses the reliability of the rule, defined as the probability of the
consequent given the antecedent (e.g., confidence = 75% if 75% of bread transactions
include butter). It reflects the rule’s predictive strength.
Lift: This compares the observed support of the rule with the expected support if
items were independent (e.g., lift > 1 indicates a positive correlation, such as 1.5
meaning the rule is 50% more likely than random). It measures the rule’s interest or
strength of association.
Application: These metrics help filter rules—high support ensures commonality, high
confidence ensures reliability, and high lift ensures relevance. However, challenges include
balancing trade-offs (e.g., high confidence with low support) and avoiding spurious
correlations, requiring domain knowledge for interpretation.
In association rule mining, these measures collectively determine rule quality, guiding
decisions in applications like market basket analysis.
Classification: Assigns data points to predefined categories (e.g., spam vs. not spam)
using algorithms like decision trees or neural networks. It requires labeled training
data and is widely used in fraud detection.
Clustering: Groups similar data points into clusters without prior labels, using
methods like k-means or hierarchical clustering. It’s useful for customer segmentation
and pattern recognition.
Association Rule Mining: Identifies relationships between variables (e.g., “if bread,
then butter”) using rules like support and confidence, as seen in market basket
analysis.
Regression: Predicts continuous outcomes (e.g., sales figures) based on input
variables, employing linear or logistic regression techniques.
Anomaly Detection: Spots unusual patterns or outliers (e.g., fraudulent transactions)
using statistical or distance-based methods.
Summarization: Provides concise data representations, such as averages or trends, to
aid understanding. These tasks require preprocessing (e.g., cleaning data), feature
selection, and validation to ensure reliability, though over-reliance on automated tools
can sometimes overlook contextual nuances.
21. Discuss the Bayes theorem in statistical perspective on data mining.
Bayes Theorem, expressed as ( P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} ), is a
foundational statistical tool in data mining for probabilistic classification and
decision-making. Here, ( P(A|B) ) is the posterior probability of hypothesis A given
evidence B, ( P(B|A) ) is the likelihood, ( P(A) ) is the prior probability, and ( P(B) )
is the marginal probability. In data mining, it underpins Naive Bayes classifiers,
which assume independence between features to predict categories (e.g., email spam
detection). Its strength lies in handling small datasets and incorporating prior
knowledge, but the independence assumption can oversimplify real-world data,
potentially skewing results. Variants like Bayesian networks address this by modeling
dependencies, making it versatile for tasks like medical diagnosis or sentiment
analysis, though it requires careful tuning to avoid bias from inaccurate priors.
22. Illustrate the K nearest neighbors in distance-based algorithms.
The K Nearest Neighbors (k-NN) algorithm is a simple, instance-based method in
distance-based algorithms used for classification and regression. It operates by finding
the k closest data points (neighbors) to a new, unclassified point based on a distance
metric, typically Euclidean distance (( d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2} )). For
classification, the majority class among the k neighbors determines the new point’s
label; for regression, it averages the neighbors’ values.
Illustration: Suppose we have a dataset with two classes (A and B) and a new point P. With
k=3, we calculate distances to all points, select the three nearest (e.g., 2 from A, 1 from B),
and assign P to class A. The choice of k is critical—small k (e.g., 1) is noise-sensitive, while
large k smooths but may include irrelevant points. Preprocessing like normalization is
essential to avoid bias from varying feature scales. k-NN’s strength lies in its simplicity and
adaptability, but it struggles with high-dimensional data (curse of dimensionality) and
requires significant memory for large datasets, making it less efficient for real-time
applications.
Demonstration: Given a dataset with n points and k clusters, PAM starts by randomly
selecting k medoids. It iteratively swaps a medoid with a non-medoid point if the swap
reduces the total cost (sum of distances). For example, with 5 points (A, B, C, D, E) and k=2,
if A and C are initial medoids, PAM evaluates swapping C with D. If the new configuration
lowers the total distance, D replaces C. This process repeats until convergence.
PAM is more robust to noise and outliers than k-means because medoids are real data points,
not averages. However, its computational complexity (O(k(n-k)²)) makes it slower than k-
means, especially for large datasets. It’s ideal for small-to-medium datasets where outlier
resistance is key, though it requires careful initial medoid selection to avoid suboptimal
clustering.
Process: The Apriori algorithm, for instance, starts by generating frequent 1-itemsets (items
meeting minimum support). It then iteratively builds larger itemsets (2-itemsets, 3-itemsets,
etc.) by joining frequent sets and pruning those below the threshold using the Apriori
property: any subset of a frequent itemset must also be frequent. Example: In a transaction
dataset with {bread, milk, butter}, if {bread, butter} has 60% support (above the 50%
threshold), it’s a large 2-itemset. This continues until no new large itemsets are found.
Key Metrics:
Support: Percentage of transactions containing the itemset (e.g., 60% for {bread,
butter}).
Confidence: Probability of buying butter given bread (e.g., 75% if 75% of bread
transactions include butter).
Lift: Ratio of observed support to expected support, indicating rule strength (e.g., lift
> 1 suggests positive correlation).
Conclusion: Large itemsets enable actionable insights (e.g., product placement strategies),
but success depends on tuning parameters and handling scalability, making advanced
algorithms and parallel processing valuable for large datasets.