0% found this document useful (0 votes)
5 views4 pages

Data Mining Long Answers

Data mining is essential for businesses like Amazon, which uses it for customer relationship management by analyzing purchase histories and recommending products. Key functions include classification, clustering, association rule mining, and prediction, which cannot be fully replaced by simple query processing or statistical analysis. The document also discusses various data mining concepts, techniques, and algorithms, emphasizing their importance in data analysis and decision-making.

Uploaded by

AVINASH KUMAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views4 pages

Data Mining Long Answers

Data mining is essential for businesses like Amazon, which uses it for customer relationship management by analyzing purchase histories and recommending products. Key functions include classification, clustering, association rule mining, and prediction, which cannot be fully replaced by simple query processing or statistical analysis. The document also discusses various data mining concepts, techniques, and algorithms, emphasizing their importance in data analysis and decision-making.

Uploaded by

AVINASH KUMAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

AAT-II

Data Mining and Knowledge Discovery


1. Explain with an example where data mining is crucial to the success of a
business. What data mining functions does this business need? Can they be
performed alternatively by data query processing or simple statistical analysis?
A. Data mining plays a crucial role in customer relationship management (CRM),
particularly for businesses like Amazon. Amazon uses data mining to analyze customer
purchase histories, browsing patterns, and feedback to recommend products. For example,
if a customer frequently buys fiction books, Amazon will recommend new fiction releases.

The data mining functions needed include:


- Classification (to predict customer buying behavior),
- Clustering (to segment customers),
- Association Rule Mining (to discover patterns like “customers who bought X also bought
Y”), and
- Prediction (to forecast future sales).

These cannot be completely performed by simple query processing or basic statistical


analysis. Queries can retrieve data but not reveal hidden patterns. Statistical analysis
provides summaries, but data mining gives predictive insights. Hence, data mining is
essential for success in such cases.

2. Explain the difference between discrimination and classification? Between


characterization and clustering? Between classification and prediction? For
each of these pairs of tasks, how are they similar?
A. Discrimination vs. Classification:
- Discrimination distinguishes between different classes based on attribute values
(descriptive), whereas classification builds models to assign new data to predefined classes
(predictive).
- Similarity: Both deal with class labels.

Characterization vs. Clustering:


- Characterization provides summarized data about a target class (descriptive), while
clustering groups data based on similarity without predefined labels (unsupervised).
- Similarity: Both analyze general patterns in data.

Classification vs. Prediction:


- Classification assigns categories, prediction forecasts continuous values.
- Similarity: Both are supervised learning and involve a training phase.

3. Explain briefly about the data smoothing techniques?


A. Data smoothing is used to remove noise and outliers from data. Key techniques include:

- Binning: Sort data and partition into equal-sized bins, then smooth by bin mean, median,
or boundaries.
- Regression: Fit a regression function (like linear) to the data and use the function to
smooth.
- Clustering: Detect and smooth outliers by grouping similar data and averaging within
clusters.
- Moving average: Average a window of nearby values to smooth fluctuations.

These techniques help improve data quality before applying mining algorithms.

4. Classify the various data reduction techniques?


A. Data reduction techniques aim to reduce the volume of data without losing its integrity.
Major types:

- Dimensionality Reduction: Removes irrelevant or redundant attributes (e.g., PCA –


Principal Component Analysis).
- Numerosity Reduction: Uses models (e.g., regression, clustering) or histograms to
represent data with fewer numbers.
- Data Compression: Uses encoding schemes like wavelet transforms.
- Aggregation: Data is summarized into higher-level forms, such as daily to monthly sales.
- Sampling: Selects representative subsets of the data.

These techniques are essential to make large datasets manageable for analysis.

5. Explain slice and pivot operations on data cube with a neat sketch?
A. Slice operation selects a single dimension from a data cube to form a sub-cube (e.g., sales
for the year 2023).
Pivot operation (also called rotate) reorients the cube view, changing the dimensional
orientation to view data from different angles (e.g., swap rows and columns).

[Diagram not supported in code: A 3D cube showing slicing along one plane and rotating the
cube.]

These operations help in multidimensional analysis and OLAP (Online Analytical


Processing).
6. Summarize different measures used in data warehouse construction with an
example?
A. Data warehouse measures can be classified as:

- Distributive (e.g., COUNT, SUM): Can be computed in parts and merged.


- Algebraic (e.g., AVERAGE): Computed using a fixed number of distributive measures.
- Holistic (e.g., MEDIAN, MODE): Require access to entire data set.

Example: In a sales warehouse,


- SUM of sales is distributive,
- AVERAGE sale value is algebraic,
- MEDIAN sales price is holistic.

Choosing the right type affects performance and storage during aggregation.

7. Discuss which algorithm is an influential algorithm for mining frequent item


sets for boolean association rules? Explain with an example?
A. A priori is the most influential algorithm for mining frequent item sets in transactional
databases. It uses the principle that if an itemset is frequent, all of its subsets must also be
frequent.

Example: For a supermarket:


Transaction 1: {milk, bread}
Transaction 2: {milk, diaper, beer, bread}
Transaction 3: {milk, diaper, beer, cola}

Apriori first finds frequent 1-itemsets (e.g., milk, bread), then 2-itemsets (e.g., milk &
bread), and so on, pruning infrequent ones early.

It’s effective but can be computationally expensive due to candidate generation.

8. Define the terms frequent item sets, closed item sets and association rules?
A. - Frequent Item Sets: Groups of items that appear together frequently in transactions
(e.g., {milk, bread}).
- Closed Item Sets: Itemsets that are frequent and have no supersets with the same
frequency.
- Association Rules: Implication rules of the form X → Y, meaning if X occurs, Y is likely to
occur (e.g., milk → bread).

These are used in market basket analysis to find relationships among items.
9. Explain each of the following clustering algorithms in terms of the following
criteria: (i) shapes of clusters that can be determined; (ii) input parameters that
must be specified; and (iii) limitations. (a) k-means (b) k-medoids (c) CLARA
A. (a) K-Means:
(i) Assumes spherical clusters
(ii) Requires number of clusters (k)
(iii) Sensitive to outliers and initial seed

(b) K-Medoids:
(i) Handles arbitrary shapes better
(ii) Needs number of clusters (k)
(iii) More robust to noise, but computationally expensive

(c) CLARA (Clustering LARge Applications):


(i) Handles large datasets using samples
(ii) Number of clusters (k) and sample size
(iii) May miss global patterns due to sampling

10. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
Compute (a) The Euclidean distance between the two objects. (b)The
Manhattan distance between the two objects. (c) The Minkowski distance
between the two objects, using p = 3
A. Let A = (22, 1, 42, 10), B = (20, 0, 36, 8)

(a) Euclidean Distance = sqrt((22-20)^2 + (1-0)^2 + (42-36)^2 + (10-8)^2)


= sqrt(4 + 1 + 36 + 4) = sqrt(45) ≈ 6.71

(b) Manhattan Distance = |22-20| + |1-0| + |42-36| + |10-8| = 2 + 1 + 6 + 2 = 11

(c) Minkowski (p=3) = [|2|^3 + |1|^3 + |6|^3 + |2|^3]^(1/3)


= (8 + 1 + 216 + 8)^(1/3) = (233)^(1/3) ≈ 6.2

You might also like