0% found this document useful (0 votes)
23 views15 pages

Data Mining-1

Data mining

Uploaded by

Aswin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views15 pages

Data Mining-1

Data mining

Uploaded by

Aswin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

DATA MINING

MODULE :1

MODULE I : Introduction to Data mining


Data mining Introduction. Data - Data mining Functionalities -Classification of
Data mining systems - Issues in Data mining - Data Objects and Attribute Types,
Basic Statistical Descriptions of Data, Data Visualization, Measuring Data
Similarity and Dissimilarity - Data Preprocessing.

Part A

1. List and explain any two data mining functionalities.


2. Explain data mining with an example.
3. Give an example for supervised and unsupervised learning. Explain
Why?
4. List and explain any 4 common data mining tasks.
5. List any two types of data mining systems according to the data that can
be mined. (Ans: Database Data, Data warehouse data and transactional
data)
6. Explain descriptive and predictive data mining.
(Ans: Descriptive mining tasks characterize the general properties of the
data in the database. Predictive mining tasks perform inference on the
current data in order to make predictions.)
7. Explain the issue in data mining due to diversity of database types.
(Ans: Handling complex types of data, Mining dynamic, metworked and
global data repositories)
8. List the major dimensions in a multidimensional view of data mining.
(Ans: data, knowledge, technologies and applications)
9. Explain nominal and binary attributes with examples.
10. Explain the various ways to measure the central tendency of data.
11. Explain five-number summary with an example.
12. Illustrate box plot with an example.
13. Six observations on two variables are available as shown in the
following table. Plot the observations in a scatter diagram. How many
groups would you say there are and list their members.
Obs a b c d e f
X1 3 4 2 5 1 4
X2 2 1 5 2 6 2
14. With clustering as an example, explain the measures of object similarity
and dissimilarity.
15. Given two objects represented by tuples (22, 1, 42, 10) and (20, 0, 36,
8):
a. Compute the Euclidean distance between the two objects,
b. Compute the Manhattan distance between the two objects.
16. List any 4 methods for data cleaning.

Part B
17. Explain the concept of data mining and its significance in modern
society. [4]
Discuss how data mining techniques are used to extract valuable
insights from large datasets, and provide examples of real-world
applications. [4]
18. Illustrate the steps involved in KDD process
19. Discuss in detail the various functionalities of data mining, with
example, including classification, clustering, association rule mining,
and anomaly detection. [8]
20. Explain any four functionalities of data mining.
21. Explain classification and clustering with the help of an example and
appropriate diagram.
22. Explain the three challenges to data mining regarding data mining
methodology and user interaction issues.
23. Differentiate between
a. Discrimination and classification.
b. Characterization and clustering.
c. Classification and prediction?
d. For each of these pairs of tasks, explain how they are similar.
24. Explain any four major issues in data mining.
25. Explain attributes of a data object and its different types.
26. Suppose we have the following values for salary (in thousands), shown
in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110. Give
the equations and calculate mean, median and mode.
(Ans:Mean=58 or 58000, Median =54 or 54000, Mode = 52 or 52000)

27. Explain the following for measuring the dispersion or spread of numeric
data with an example.
a. Rangeb. Quantiles c. Quartiles
d. Interquartile range.
28. Suppose that the data for analysis includes the attribute age. The age
values for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20,
20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46,
52, 70.
a. What is the mean of the data?
b. What is the median?
c. What is the mode of the data?
d. Comment on the data’s modality (i.e., bimodal, trimodal, etc.).
e. Can you find (roughly) the first quartile (Q1)
f. and the third quartile (Q3) of the data?
g. Give the five-number summary of the data.
h. Show a boxplot of the data.

29. uppose a hospital tested the age and body fat data for 18 randomly
selected adults with the following result:

a. Plot an equal-width histogram of width 10 and explain the


correlation.
b. Plot a scatter plot and explain the correlation between age and %fat

30. Explain any 4 methods to graphically visualize data with example.


31. Explain why data preprocessing is required. [2]
In real world tuples with missing values for some attributes are a
common occurrence. Explain the various methods to handle this
problem. [6]
32. Illustrate the major tasks in data preprocessing.
DATA MINING
MODULE-2 (Questions)

MODULE II : Data Warehouse & OLAP


Introduction to Data Warehouse & OLAP - Data Warehousing - Multidimensional
data models - data warehouse architectures - Implementation - Data Warehousing
to Data mining- Data Cube Computation Methods - Data mining query languages
- Architectures of data mining systems.

Part A

33. Explain the differences between an operational database and data


warehouse.
34. Explain why we need a separate data warehouse even when we have
large operational databases.
35. List and explain the four features of a data warehouse.
36. Explain any four ways how organizations are using information from
data warehouses.
37. Differentiate between fact table and dimension table.
38. Draw the concept hierarchy of dimension location, described by the
attributes number, street, city, province-of-state, zip-code and country.
39. Differentiate between the apex cube and base cube.
40. Draw the lattice of cuboids, making up a 4-D data cube for time, item,
location and supplier.
41. Explain concept hierarchy with an example for a particular dimension.
42. Differentiate between full materialization and partial materialization of
data cube.
43. Explain any two choices for data cube materialization.
44. Explain any two data warehouse applications.
45. A data cube, C, has n dimensions, and each dimension has exactly p
distinct values in the base cuboid. Assume that there are no concept
hierarchies associated with the dimensions.
a. What is the maximum number of cells possible in the base cuboid?
b. What is the minimum number of cells possible in the base cuboid?
46. Explain base cell and aggregate cell with an example.
47. Explain ancestor and descendant cells with an example.
48. Explain lattice of cuboids with a diagram.
49. List any two efficient data cube computation methods.

Part B
50. Explain with examples the different OLAP services.
51. Explain with example any two data warehouse models from an architecture
point of view. (Ans: Enterprise warehouse, Data mart, virtual warehouse)
52. Illustrate the 3-tier data warehouse architecture.
53. Data warehouses often adopt a three tier architecture. Explain it with the
help of a diagram.
54. Explain any two schemas with examples for multidimensional
databases.
55. Draw the different schemas for a data warehouse which consists of
dimension, subscriber, phone, time and location and measure as call
details.
56. Illustrate with examples the different OLAP operations in
multidimensional data models.
57. Suppose that a data warehouse consists of the three dimensions time,
doctor, and patient, and the two measures count and charge, where
charge is the fee that a doctor charges a patient for a visit
a. .Enumerate three classes of schemas that are popularly used for
modeling data warehouses.
b. Draw a schema diagram for the above data warehouse using star
and snowflake schema.
c. Starting with the base cuboid [day, doctor, patient], what specific
OLAP operations should be performed in order to list the total fee
collected by each doctor in 2023?
58. Explain the process of a data warehouse design. [4 steps -2 marks each]
59. Explain the different views to be considered in a data warehouse design. [2
marks each]
60. A data cube, C, has n dimensions, and each dimension has exactly p
distinct values in the base cuboid. Assume that there are no concept
hierarchies associated with the dimensions.
a. What is the maximum number of cells possible in the base cuboid?
b. What is the minimum number of cells possible in the base cuboid?
c. What is the maximum number of cells possible (including both base
cells and aggregate cells) in the data cube, C? (Ex: 3.13)
d. What is the minimum number of cells possible in the data cube, C?
61. Design a data warehouse for a regional weather bureau. The weather
bureau has about 1,000 probes, which are scattered throughout various
land and ocean locations in the region to collect basic weather data,
including air pressure, temperature, and precipitation at each hour. All data
is sent to the central station, which has collected such data for over 10
years. Your design should facilitate efficient querying and on-line
analytical processing, and derive general weather patterns in
multidimensional space. (use star Schema) (Ex: 3.7)
62. Explain any one data cube computation method with an example.
63. Illustrate multiway aggregation for full cube computation.
64. Explain with an example BUC construction of an iceberg cube.
65. Explain with an example star-cubing construction of an iceberg cube.

MODULE III : Association Rule Mining


Mining Frequent Patterns, Associations and Correlations - Mining Methods -
Mining Various Kinds of Association Rules - Correlation Analysis - Pattern
Mining in Multilevel, Multidimensional Space - Constraint-Based Frequent
Pattern Mining - Semantic Annotation of Frequent Patterns

Part A

66. Explain with an example support of an itemset in association rule


mining?
67. Discuss the importance of the confidence measure in association rule
mining.
68. Explain the difference between support and confidence in association
rule mining?
69. Explain frequent patterns in the context of data mining and association rule
mining.

70. Briefly describe the process of generating association rules from


frequent itemsets.
71. Explain how correlation analysis differs from association rule mining?
72. Explain the concept of correlation analysis in data mining.
73. Describe the role of lift in association rule mining and its significance.

74. List the different ways in which pattern mining can be classified with
respect to data and applications involved.
75. List the different ways in which pattern mining can be classified based
on pattern diversity.
76. Differentiate between pattern mining in multilevel association and
multidimensional association.
77. Explain what type of mining is used to study the sales of diamond
watches in jewelry sales data. (Ans: Rare Pattern)
78. List any four constraints that can be used in constraint based mining.
79. In constraint based mining, differentiate between data constraint and
dimension/level constraints.
80. In constraint based mining, differentiate between knowledge type
constraint and interestingness constraints.
81. Explain the use of semantic annotation in frequent patterns.
82.

Part B
83. Explain with an example the Apriori algorithm used for frequent itemset
mining.
84. Using apriori algorithm, find the frequent itemset for the following
transactional data. (Min-Sup =2)

85. A database has five transactions. Let min sup = 60% and min con f = 80%.
Find all frequent itemsets using Apriori algorithm.
86. Explain the Apriori principle and its significance in frequent pattern
mining with an example.

87. Suppose the following transaction data contain the frequent itemset X =
{I1, I2, I5}.
What are the association rules that can be generated from X, if the
minimum confidence threshold is 70%?

88. A database has five transactions. Let min sup = 60% and min con f = 80%.
a. Find all frequent itemsets using FP growth algorithm.

89. Briefly explain how the FP-Growth algorithm works for mining frequent
patterns with an example.
90. Using Vertical data format, find the frequent itemset for the following
transactional data. (Min-Sup =2)
91. Explain the different ways in which pattern mining can be classified
a. based on pattern diversity,
b. with respect to data and applications involved.
92. Explain with an example pattern mining in multilevel associations.
93. Explain with an example pattern mining in multidimensional
associations.
94. Briefly explain the different types of Pattern mining in multilevel and
multi dimensional space.

95. Explain with an example constraint based pattern generation.


96. Explain with an example how to prune pattern space with four pattern
mining constraints.
97. Explain with an example pruning of data space with data pruning
constraints.
98. Explain pruning pattern space and pruning data space in constraint
based pattern generation.
DATA MINING

MODULE IV : Classification and Prediction


Basic Concepts - Decision Tree Induction - Bayesian Classification - Rule Based
Classification - Classification by Backpropagation - Support Vector Machines -
Associative Classification-Lazy Learners - Other Classification Methods -
Prediction.

Part A
1. Compare Classification and prediction.
2. List the major steps of decision tree classification.
3. Explain the concept of entropy in the context of decision tree induction and
how it is used to determine attribute selection.
4. Discuss the key components of the Naive Bayes classifier and how it makes
classification decisions based on Bayesian probability theory.
5. Explain rule-based classification.
6. Explain the concept of rule pruning in rule-based classification and its
impact on the performance of the classifier.
7. Explain the role of backpropagation algorithm in training artificial neural
networks for classification tasks.
8. Explain any two activation functions.
9. Explain the basic principle of SVM and how it constructs a hyperplane to
separate different classes in the feature space.
10.Explain the concept of the kernel trick in SVM and its use.
11.Differentiate between association rule mining and associative classification?
12.List the different steps in associative classification.
13.Compare lazy learners in machine learning to eager learners.
14.Explain with an example, a lazy learner in machine learning.
15.Explain genetic algorithm with an example.
16.Explain Fuzzy Set approch with an example.

Part B
17. Distinguish between classification and prediction with examples.
18. Briefly outline the major steps of decision tree classification.
19. a) Explain tree pruning with an example. [3]
b) Explain Why is tree pruning useful in decision tree induction? [3]
c) What is a drawback of using a separate set of tuples to evaluate pruning?
[2]
20.Discuss the concept of entropy in decision tree induction. [3]
Explain how entropy is calculated and how it influences the decision tree
construction process. [3]
Provide examples to illustrate how entropy is used to determine the best
attribute for splitting at each node. [2]
21.You're working on a spam email detection system using a Naive Bayes
classifier.
a) Describe how you would preprocess the email data, extract relevant
features, and apply the Naive Bayes algorithm for classification. [4]
b) Discuss the assumptions underlying the Naive Bayes model and how they
may affect the classifier's performance. [2]
c) Evaluate the effectiveness of your classifier using appropriate metrics and
propose strategies for improving its accuracy. [2]
22.Explain the Bayesian classification framework and its underlying principles.
[4]
Discuss the assumptions of the Naive Bayes classifier and how they affect
its performance. [4]
23.Compare and contrast the strengths and weaknesses of Naive Bayes with
other classification algorithms. [4]
Illustrate your explanations with examples to demonstrate the application of
Bayesian classification in real-world scenarios.[4]
24.Explain rule-based classification and discuss how association rules can be
transformed into classification rules with an example.
25.Imagine you're building a rule-based system to diagnose medical conditions
based on patient symptoms.
Explain the process of rule induction from data and how you would generate
interpretable rules for medical diagnosis. [4]
Discuss the trade-offs between accuracy and interpretability in rule-based
classification systems and propose strategies for optimizing both. [4]
26. Illustrate the architecture of your neural network, including the number of
layers, activation functions, and learning parameters. [4]
Explain the backpropagation algorithm and how it is used to train the neural
network. [4]
27. Explain the backpropagation algorithm and its role in training artificial
neural networks for classification tasks.
Discuss the significance of activation functions in backpropagation-based
classification and provide examples of commonly used activation functions.
28.You're developing a neural network model for predicting housing prices
based on various features such as location, size, and amenities.
Describe the architecture of your neural network, including the number of
layers, activation functions, and learning parameters. [4]
Explain the backpropagation algorithm and how it is used to train the neural
network.[4]
29.Explain support vector machines and how they construct decision
boundaries to separate classes in the feature space. [4]Discuss the concept of
margin maximization and its importance in SVM's generalization ability. [4]
30.Explain kernel in SVM and its significance in handling non-linearly
separable data. [4] Compare SVM with other classification methods in terms
of their effectiveness and computational complexity. [4]
31.Explain CMAR with an example. Compare CMAR with FP growth
algorithm.
32.Illustrate the general framework for discriminative frequent pattern based
classification.
33. Explain the k-nearest neighbors (KNN) algorithm as an example of a lazy
learner [4] and discuss its strengths and weaknesses.[4]
34.Imagine you're using the k-nearest neighbors (KNN) algorithm for image
classification. Explain the process of feature extraction from images and the
role of distance metrics in KNN classification. Discuss the trade-offs
between lazy and eager learners in terms of computational efficiency and
model interpretability.
35.Compare and contrast between genetic algorithm, rough set approach and
fuzzy set approaches.
36.Illustrate the fuzzy set approach with an example.

MODULE V : Clustering
Cluster Analysis - Types of Data - Categorization of Major Clustering Methods -
K-means Partitioning Methods - Hierarchical Methods - Density-Based Methods -
Grid Based Methods - Model- Based Clustering Methods -Clustering High
Dimensional Data - Constraint - Based Cluster Analysis -Outlier Analysis Data
Mining Applications.

Part A
1. Explain the purpose of cluster analysis and how it differs from
classification.
2. Differentiate between numerical and categorical data types.
3. Clustering is recognised as an important data mining technique with
broad applications. Give one example for each of the following cases
a. An application that uses clustering as a major data mining
technique.
b. An application that uses clustering for preprocessing.
4. List the four major categories of clustering methods.
5. Explain the basic idea behind the K-means algorithm.
6. Describe how clusters are represented in the K-means algorithm using
centroids.
7. Explain the concept of hierarchical clustering and how it differs from
partitioning methods like K-means.
8. Explain how to interpret a dendrogram produced by hierarchical
clustering and how it represents the clustering hierarchy.
9. Provide an example of a density-based clustering algorithm and its
application domain.
10. Discuss the advantages of grid-based methods in handling large
datasets.
11. Explain the fundamental concept behind the Density-Based Spatial
Clustering of Applications with Noise (DBSCAN) algorithm.
12. List any two model based clustering methods.
(Ans: Fuzzy clustering and probabilistic model-based clustering)
13. Define the concept of outliers in data analysis and their significance.
14. Explain any two challenges of outlier detection.
15. Explain any 2 applications of data mining.
16.
Part B
17. Discuss the advantages and disadvantages of hierarchical clustering
compared to partitioning methods.
18. Explain the steps you would take to perform cluster analysis on the
dataset of customer data for a retail company to identify distinct
customer segments. [3]
Discuss the choice of distance measures, clustering algorithms, and
evaluation metrics suitable for this task. [3]
Provide insights into how the resulting clusters can be interpreted and
utilized to enhance marketing strategies and customer engagement. [2]
19. Explain the different requirements for clustering as a data mining tool.
20. Tabulate the general characteristics of the major categories of clustering
methods.
21. Explain with an example how the K-means algorithm initializes cluster
centroids and iteratively updates them.
22. Illustrate how the agglomerative hierarchical clustering algorithm
constructs a dendrogram to represent the clustering hierarchy. [3]
Discuss the different linkage criteria used in hierarchical clustering and
their effects on the resulting dendrogram structure. [3]
Analyze the advantages and limitations of hierarchical clustering
compared to partitioning methods.[2]
23. Explain each of the following clustering algorithms in terms of (1)
shapes of clusters that can be determined (2) input parameters that must
be specified and (3) limitations.
a. k-means
b. K-medoids
24. Explain BIRCH and Chameleon with examples.
25. Explain the concept of density-based clustering and its approach to
identifying clusters based on regions of high density separated by
regions of low density. [4]
Explain the DBSCAN (Density-Based Spatial Clustering of
Applications with Noise) algorithm and its parameters, including
epsilon (ε) and minimum points (MinPts).[4]

26. Consider a dataset containing information about traffic flow in a city,


including vehicle speeds and traffic congestion levels.
a. Apply grid-based clustering to the dataset to identify spatial
clusters of high traffic density. [4]
b. Describe the grid partitioning process and how you would analyze
the resulting clusters to identify traffic hotspots and congestion
patterns. [4]

27. Identify and explain the following.


a. The cluster analysis method used to overcome the difficulty in
using one set of global parameters in clustering analysis. [4]
b. The clustering method based on a set of density distribution
functions. [4]
28. Apply the CLIQUE algorithm to an embedded data space containing
three dimensions: age, salary, vacation.
a. Explain how CLIQUE partitions in the d-dimensional data space.
[4]
b. Explain how CLIQUE uses dense cells to assemble clusters. [4]
29. List and explain the different types of outliers.
(Ans: List - 2 marks, Global, Contextual and Collective - 2 marks each)
30. Explain in detail with an example.
a. Fuzzy clustering and
b. Probabilistic model-based clustering.
31. Illustrate how intrusion can be detected in TCP connections using
clustering based outlier detection. [Explanation -3 marks, Diagram -3
marks, Example diagram -2]
32. Explain how supervised, unsupervised methods can be used for outlier
detection with examples. [2+2, 2+2]

You might also like