0% found this document useful (0 votes)
24 views5 pages

CST466

Data mining QP

Uploaded by

binduann
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views5 pages

CST466

Data mining QP

Uploaded by

binduann
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

0400CST466082401

Scheme of Valuation/Answer Key


(Scheme of evaluation (marks in brackets) and answers of problems/key)
APJ ABDUL KALAM TECHNOLOGICAL UNIVERSITY
EIGHTH SEMESTER B.TECH(S) DEGREE EXAMINATION,AUGUST 2024
Course Code: CST466
Course Name: DATA MINING
Max. Marks: 100 Duration: 3 Hours

PART A
Answer all questions, each carries 3 marks. Marks

1 Define data mining. (1.5)


Any three applications (1.5)
2 Difference between OLTP and OLAP (3)
3 Need for data preprocessing. Explain any three points. (3)
4 Need of data normalization (1)
min-max normalization by setting min=0 and max=1
(1)
300: 0.25
Z- Score Normalization:
300: -0.35
(1)
5 Clustering is the most suitable learning technique. (3)
Organize a large collection of documents into clusters based on their content.
6 For a spam email classifier, precision is more important than recall because the (3)
harm of misclassifying legitimate emails as spam (false positives) are greater
than the spam emails not being caught (false negatives). High precision ensures
that users' important emails are not missed and helps to maintain trust in the
spam filtering system.
7 Compare partition algorithm with Apriori algorithm. (3)
8 Support ({bread, peanut-butter}) = 3/5 (3)
Support ({beer, bread}) =1/5
Confidence ({bread, peanut-butter}) = 3/4
Confidence ({beer, bread}) =1/2
9 Focused crawling (1.5)
Regular crawling (1.5)

Page 1of 5
0400CST466082401

10 TF-IDF is preferred over raw term frequency counts for document similarity (3)
analysis because it reduces the influence of common words and highlights
distinctive terms

TF-IDF(d,t) = TF(d,t) * IDF (t)

PART B
Answer one full question from each module, each carries 14 marks.
Module I
11 a) Three tier architecture - diagram (4)
Explanation (3)

b) Draw star schema diagram for the data warehouse. (5)


Write Slice operation for viewing all patient visits for a specific doctor. (2)
OR
12 a) Draw star schema by showing fact table and dimension table. (5)
Write Roll Up operation for summarizing daily to monthly sales data. (2)
b) Explanation of various stages – selection, preprocessing, transformation, data (4)
mining, interpretation and evaluation
Diagram (3)
Module II
13 a) Missing data handling techniques (6)
1. Ignore the tuple
2. Fill in the missing value manually
3. Use a global constant to fill in the missing value
4. Use the attribute mean to fill in the missing value
5.Use the attribute mean for all samples belonging to the same class
6. Use the most probable value to fill in the missing value
b) Partition into equal-frequency (equi-depth) bins:

Page 2of 5
0400CST466082401

- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25 (2)
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9 (3)
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15 (3)
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

OR
14 a) Stepwise forward selection with example (2)
Stepwise backward elimination with example (2)
Combination of forward selection and backward elimination with example (2)
b) PCA Explanation (4)
Example (4)
Module III
15 a) Gain of attribute A = 0.5769 (7)
Gain of attribute B = 1.0365
Root is B

b) Explain DBSCAN with core point, border point and noise point (4)
Write any three advantages of DBSCAN
Can find clusters of arbitrary shape. (3)
Does not require specifying the number of clusters in advance.
Can identify noise and outliers.
Works well with clusters of varying density.
OR
16 a) Explain the use of dendrogram in hierarchical clustering. (3)
Single Linkage: Minimum distance between points in the two clusters.
Complete Linkage: Maximum distance between points in the two clusters. (3)
Average Linkage: Average distance between points in the two clusters.

Page 3of 5
0400CST466082401

b) Step 1: Let the randomly select 2 medoids (8)


Step2: Calculate the distance matrices between medoid and non-medoid points
using Manhattan distance= |X1-X2| + |Y1-Y2|.
Step 3: randomly select one non-medoid point and recalculate the cost.
Swap Cost = New Cost – Previous Cost > 0
Module IV
17 a) Three itemset: {I1, I2, I3} and {I1, I2, I5} (5)
Association Rules:
[I1^I2]=>[I3] (3)
[I1^I3]=>[I2]
[I2^I3]=>[I1]
b) Dynamic item set counting technique in association rule mining - Explanation (6)

OR
18 a) Frequent Itemsets (support ≥ 40%): {milk, bread}, {bread, eggs}, {bread, (5)
butter}
Association Rules (confidence ≥ 60%):
 {milk} → {bread} (Confidence = 100%) (3)
 {bread} → {milk} (Confidence = 80%)
 {eggs} → {bread} (Confidence = 100%)
 {butter} → {bread} (Confidence = 100%)

b) Pincer search algorithm (3)


Example (3)
Module V
19 a) Web usage mining (3)
Web structure mining (3)
Web content mining (3)
b) Trie and suffix tree - Explanation (5)
OR
20 a) Methods used for text mining (3)
Compare text mining with web mining (4)
b) HITS algorithm (4)
Example (3)

Page 4of 5
0400CST466082401

****

Page 5of 5

You might also like