0% found this document useful (0 votes)
47 views4 pages

Data Mining List of Important Question

Uploaded by

Amrit Sapkota
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views4 pages

Data Mining List of Important Question

Uploaded by

Amrit Sapkota
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Probable Questions Collection for Data Mining (Elective-I )

Chapter 1: Introduction [4 Marks]


1. What is Data Mining? Differentiate between descriptive and predictive Data Mining.
2. Explain the architecture of a typical data mining system.
3. What are the main requirements of data mining?
4. What are the key steps in Knowledge Discovery in Databases (KDD)? Explain.
5. Is KDD a Data Mining? Explain the phases of KDD with example.
6. Explain Data Mining as a step in KDD.
7. How is Data Mining different from query tools?
8. Describe Data Warehouse and explain its characteristics.
9. What is meta data? Briefly explain the architecture of Data Warehouse.
10. Explain Different Data Warehouse Models.
11. What are Virtual Warehouse, Data Mart and Enterprise Warehouse? Explain.
12. What do you mean by “schema”? Briefly describe different Data Warehouse schemas.
13. Compare Star, Snowflakes and Fact Constellation schemas with examples.
14. Explain symmetric and asymmetric data with example.
Chapter 2: Data Pre-processing [10 marks]
1. Why is data preprocessing required?
2. What do you mean by attribute? Explain Nominal, Ordinal, Interval and Ratio attribute types.
3. Differentiate discrete and continuous attributes with examples.
4. What are the types of data sets? What are spatial and temporal data. Give examples.
5. What do you mean by dimension of data? Briefly explain about curse of Dimensionality.
6. Explain why and how to avoid curse of dimensionality.
7. Explain the techniques for dimensionality reduction.
8. Differentiate between feature selection and feature extraction.
9. What do you mean by skewed data? Describe positive and negative skewness. Why are real-life data skewed?
10. Explain different processes involved in data preprocessing.
11. What is data cleaning? Explain different approaches to handle missing data, noisy data and outliers.
12. What is data integration? What are the challenges of Data Integration? How can they be handled?
13. Explain data reduction with strategies.
14. What do you mean by Data Sampling? Explain various way of sampling the data.
15. Describe briefly about Data Discretization and how it can be achieved.
16. What is Online Analytical Processing (OLAP)? Explain various operations on OLAP with suitable example.
17. Differentiate between OLAP and OLTP. Define data cube, OLAP operations, fact table and dimension table.
18. Differentiate between Data Warehouse and Database. (Note: Answer is same as that of Qn.14)
19. What are the approaches to measure similarity between data.
Chapter 3: Classification [20 marks]
1. Define Classification and prediction with example. Explain different stages in Classification with clear block
diagram.
2. Differentiate supervised and unsupervised learning with suitable examples.
3. Describe Decision Tree Classifier with example. Explain Hunt’s Algorithm.
4. Describe Iterative Dichotomizer 3 (ID3) algorithm. What makes ID3 Algorithm “greedy”?
5. How does C4.5 overcomes problem of ID3 Algorithm? Compare Information Gain, Gain ratio and Gini Index.
6. What do you mean by overfitting? What are approaches to avoid it? Differentiate Pre-pruning and Post-
pruning.
7. What are the advantages and disadvantages of Decision Tree Classifier?
8. Define rule based classifier with example. Describe rule assessment measures- Coverage and Accuracy.
9. How can rules be extracted from decision tree? Explain with an example.
10. Explain Sequential covering algorithm for Rule Induction. What are characteristics of Rule Based Classifier?
11. Describe CN2 and RIPPER algorithm for rule growing. What are the measures for Rule Evaluation?
12. What are the advantages and disadvantages of rule based classifier?
13. What is rote-learner? Describe K-Nearest Neighbor Classifier with an example. Explain issues for choosing K.
14. What are the advantages and disadvantages of Nearest Neighbor Classifier?
15. Describe Naïve Bayes Classifier. Why is it “Naïve”? Explain Laplacian Correction for zero-probability
Problem.
16. What are the advantages and disadvantages of Naïve Bayes Classifier?
17. Define Artificial Neural Network (ANN). Explain Back Propagation Algorithm for training an ANN.
18. How can you measure classifier accuracy by Holdout and Cross validation methods?
19. Explain the role of Receiver Operating Characteristics (ROC) Curve for classifier model selection.
20. Describe Confusion Matrix with example. Define Accuracy, Error rate, Sensitivity, Specificity, Precision, Recall,
Positive Predictive Value and Negative Predictive Value, TPR, TNR, FPR, FNR of the classifier model.
21. Explain the inverse relation between Precision and Recall of classifier model.
22. Consider following training data (Buys Computer Data)
SN Age income student credit_rating class: buys_computer
1 Young high no Fair no
2 Young high no excellent no
3 middle_aged high no Fair yes
4 Senior medium no Fair yes
5 Senior low yes Fair yes
6 Senior low yes excellent no
7 middle_aged low yes excellent yes
8 Young medium no Fair no
9 Young low yes Fair yes
10 Senior medium yes Fair yes
11 Young medium yes excellent yes
12 middle_aged medium no excellent yes
13 middle_aged high yes Fair yes
14 Senior medium no excellent no
a. Determine the root node of decision tree for above data set using ID3 algorithm.
b. Draw the complete Decision Tree using ID3 algorithm and determine class for X={age = young, income = low,
student = yes and credit rating = fair}
c. Use Naïve Bayes Classifier to determine class for
i. X={age = young, income = low, student = yes and credit rating = fair}
ii. X={age = middle_aged, income = low and credit rating = excellent}
iii. X={age = senior and credit rating = excellent}
23. Consider the following data set (Play Golf Data). Use “ID3 algorithm” and “Naïve Bayes Classifier” to predict if
people will play golf on a
a. Hot, Sunny day with high humidity and no wind.
b. Cool, rainy day with normal humidity and no wind
c. Mild, overcast, windy day with normal humidity
Temperature Temperature Humidity Humidity Class:
SN Outlook Windy
(Numeric) (Nominal) (Numeric) (Nominal) Play
1 Overcast 83 Hot 86 High False Yes
2 Overcast 64 Cool 65 Normal True Yes
3 Overcast 72 Mild 90 High True Yes
4 Overcast 81 Hot 75 Normal False Yes
5 Rainy 70 Mild 96 High False Yes
6 Rainy 68 Cool 80 Normal False Yes
7 Rainy 65 Cool 70 Normal True No
8 Rainy 75 Mild 80 Normal False Yes
9 Rainy 71 Mild 91 High True No
10 Sunny 85 Hot 85 High False No
11 Sunny 80 Hot 90 High True No
12 Sunny 72 Mild 95 High False No
13 Sunny 69 Cool 70 Normal False Yes
14 Sunny 75 Mild 70 Normal True Yes
24. Given the following confusion matrix, determine Accuracy, Error rate, Sensitivity, Specificity, Precision,
Recall, Positive Predictive Value and Negative Predictive Value, TPR, TNR, FPR, FNR of the classifier model.
Predicted
+ ve - ve
Actual
+ ve 152 130

- ve 88 630

25. Given the following confusion matrix, determine Accuracy, Error rate, Sensitivity, Specificity, Precision,
Recall, Positive Predictive Value and Negative Predictive Value, TPR, TNR, FPR, FNR of the classifier model.
Predicted cancer = yes cancer = no
Actual
cancer = yes 90 210

cancer = no 140 9560

Chapter 4: Association Analysis [18 marks]


1. Explain Association rule mining with example. Define Support, frequent item set and confidence with
example.
2. Describe apriori principle and explain apriori algorithm for frequent item set generation.
3. Discuss about advantages and disadvantages of apriori algorithm.
4. Define FP-Growth algorithm for frequent item set generation. Explain how FP growth approach
overcomes the disadvantage of Apriori algorithm.
5. Define the measure “Lift” with suitable example. What does it signifies?
6. What are FP-Tree and Conditional FP-Tree? Explain with example.
7. Discuss advantages and disadvantages of FP-Growth algorithm.
8. Describe the issues related to categorical data. Explain Sequential, Sub-graph and infrequent patterns.
9. Given min_sup=33.34% and min_conf=60%, use apriori algorithm on following transaction data to determine
frequent item sets. Also, indicate association rules generated, underline the strong ones and sort them by
confidence.
Transaction ID Item set
TID1 HotDogs, Buns, Ketchup
TID2 HotDogs, Buns
TID3 HotDogs, Coke, Chips
TID4 Coke, Chips
TID5 Chips, Ketchup
TID6 HotDogs, Coke, Chips
10. Use the data set in Qn.8 with same min_sup to build FP-tree. Show for each transaction, how the tree evolves.
Then use FP-Growth approach to discover the frequent item set from this FP tree.
11. Identify the candidate and large item sets of the following transaction table. Use Apriori algorithm with
minimum support 2. Also, indicate association rules generated, underline the strong ones and sort them by
confidence.(min_conf = 60%)
Transaction id Items
t1 {A, C, D}
t2 {B, C, E}
t3 {A, B, C, E}
t4 {B, E}
t5 {A, B, C, E}
12. Use the data set in Qn.10 with same minimum support to build FP-tree. Show for each transaction, how the
tree evolves. Then use FP-Growth approach to discover the frequent item set from this FP tree.
13. Write short note on improving the efficiency of apriori algorithm.
14. How can we handle categorical, sequential, graphical and data stream using association mining.
Chapter 5: Cluster Analysis [16 marks]
1. Define Clustering. Why do we need cluster analysis? Discuss the qualities of a good cluster.
2. Describe major clustering approaches. Differentiate between hierarchical and partitioning clustering technique.
3. Describe K-means algorithm for clustering and discuss its strengths and weaknesses.
4. Compare k-means, k-medoids, and k-modes algorithm.
5. What is hierarchical clustering? Describe AGNES and DIANA methods of hierarchical clustering approach.
6. Describe DBSCAN Clustering. What are the advantages of Density based clustering.
7. Describe external, internal and relative measures of clustering quality.
8. Write short notes on
a. Partitioning Clustering
b. Hierarchical Clustering
c. Evaluation of Clustering
9. Identify the cluster of the following instances using K-Means clustering algorithm (Take, K=2 and K=3)
X = {2, 24, 21, 5, 6, 41, 35, 36, 9, 26, 44, 7, 46, 26, 11, 1, 32, 43, 48, 13}
10. Identify the cluster of the following instances using K-Means clustering algorithm (Take, K=2).[IOE 2063]
Instance X Y
1 1.0 1.5
2 2.5 5.5
3 1.5 1.0
4 2.0 3.0
5 2.5 3.5
6 4.0 6.2
11. Use K-means algorithm and Euclidean distance to cluster the following 8 examples into 3 clusters:
A1=(2, 10) , A2=(2, 5) , A3=(8, 4) , A4=(5, 8) , A5=(7, 5) , A6=(6, 4) , A7=(1, 2) , A8=(4, 9)
Chapter 6: Anomaly/Fraud Detection [6marks]
1. What do you mean by anomaly detection, why is it important and where is it applicable?
2. What are the challenges in Anomaly Detection? Explain Different types of anomaly detection schemes.
3. List the drawbacks of graphical approach and describe statistical approaches for anomaly detection.
4. Write sort notes on
a. Distance based approaches for anomaly detection
b. Likelihood approach for anomaly detection
c. Base rate fallacy

Chapter 7: Advance applications [6 marks]


1. Describe web mining along with its structure. What are the challenges in web mining?
2. Briefly explain the page ranking algorithm.
3. Write Short notes on
a. WWW Mining
b. Time series data and regression analysis
c. Multimedia Mining

You might also like