Data Mining Long Answers
Data Mining Long Answers
- Binning: Sort data and partition into equal-sized bins, then smooth by bin mean, median,
or boundaries.
- Regression: Fit a regression function (like linear) to the data and use the function to
smooth.
- Clustering: Detect and smooth outliers by grouping similar data and averaging within
clusters.
- Moving average: Average a window of nearby values to smooth fluctuations.
These techniques help improve data quality before applying mining algorithms.
These techniques are essential to make large datasets manageable for analysis.
5. Explain slice and pivot operations on data cube with a neat sketch?
A. Slice operation selects a single dimension from a data cube to form a sub-cube (e.g., sales
for the year 2023).
Pivot operation (also called rotate) reorients the cube view, changing the dimensional
orientation to view data from different angles (e.g., swap rows and columns).
[Diagram not supported in code: A 3D cube showing slicing along one plane and rotating the
cube.]
Choosing the right type affects performance and storage during aggregation.
Apriori first finds frequent 1-itemsets (e.g., milk, bread), then 2-itemsets (e.g., milk &
bread), and so on, pruning infrequent ones early.
8. Define the terms frequent item sets, closed item sets and association rules?
A. - Frequent Item Sets: Groups of items that appear together frequently in transactions
(e.g., {milk, bread}).
- Closed Item Sets: Itemsets that are frequent and have no supersets with the same
frequency.
- Association Rules: Implication rules of the form X → Y, meaning if X occurs, Y is likely to
occur (e.g., milk → bread).
These are used in market basket analysis to find relationships among items.
9. Explain each of the following clustering algorithms in terms of the following
criteria: (i) shapes of clusters that can be determined; (ii) input parameters that
must be specified; and (iii) limitations. (a) k-means (b) k-medoids (c) CLARA
A. (a) K-Means:
(i) Assumes spherical clusters
(ii) Requires number of clusters (k)
(iii) Sensitive to outliers and initial seed
(b) K-Medoids:
(i) Handles arbitrary shapes better
(ii) Needs number of clusters (k)
(iii) More robust to noise, but computationally expensive
10. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
Compute (a) The Euclidean distance between the two objects. (b)The
Manhattan distance between the two objects. (c) The Minkowski distance
between the two objects, using p = 3
A. Let A = (22, 1, 42, 10), B = (20, 0, 36, 8)