0% found this document useful (0 votes)
48 views14 pages

Classification

Uploaded by

titleofyour123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views14 pages

Classification

Uploaded by

titleofyour123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

CONTINUOUS ASSESSMENT TEST – 2

Regulations R 2019 - V21


Department of Computer Science and Engineering
Third Year / Fifth Semester
191CSC503T/Data Mining

CO1: Apply suitable pre processing and visualization techniques on data.


CO2: Formulate association rules by mining frequent patterns.
CO3: Categorize the data using classification algorithms.
CO4: Organize the data using clustering methods.
CO5: Apply WEKA tool to provide solutions for real world problems.

Unit – III CLASSIFICATION

Q.No Questions CO’s Bloom's level

Part - A

1 Define Classification. CO3 K1

2 List the applications of classification and prediction. CO3 K2

3 Define Support vector machine. CO3 K1

4 Define back propagation. CO3 K1

5 What are K-nearest neighbor classifiers? CO3 K1

6 Differentiate lazy learners and Eager learners. CO3 K2

7 Illustrate support vector machines with examples. CO3 K2

8 How would you show your understanding about rule based CO3 K2
classification?

9 Discuss why pruning is needed in the decision tree. CO3 K2

10 Define Lazy learners with an example. CO3 K2

11 What are eager learners? CO3 K1


Q.No Questions CO’ Bloom's level
s

PART-B

1 Illustrate in detail about the Bayesian Classification methods CO3 K3


with an example.

Discuss about constraint based association rule mining with


2 CO3 K3
example

Outline the working principle of the support vector machine


3 CO3 K4
with a neat sketch.

Illustrate in detail about the Backpropagation classification


4 CO3 K3
methods with an example.

Elucidate the different techniques used to improve the


5 CO3 K4
classification accuracy

Q.No Questions CO’ Bloom's level


s

PART-C

1 Evaluate the following dataset using Naive Bayes CO3 K5


classification algorithm.
Sl. No. Color Legs Height Smelly Species

1 White 3 Short Yes M

2 Green 2 Tall No M

3 Green 3 Short Yes M

4 White 3 Short Yes M

5 Green 2 Short No H

6 White 2 Tall No H

7 White 2 Tall No H
8 White 2 Short Yes H

2 Justify your answer: For a university dataset assume the CO3 K5


necessary features required for model evaluation and
selection.

UNIT IV : CLUSTERING TECHNIQUES


Q.No Questions CO’s Bloom's
level

PART - A

1 What is cluster analysis? CO4 K1

2 Define Clustering? CO4 K1

3 How is the quality of a cluster represented? CO4 K2

4 Define K-means partitioning CO4 K1

5 List the major clustering methods. CO4 K2

6 Define outlier. How will you determine outliers in the data? CO4 K1

7 Discuss the challenges of outlier detection. CO4 K2

8 Explain the typical phases of outlier detection methods. CO4 K2

9 Distinguish between Classification and clustering. CO4 K2

10 Give the methods of clustering high dimensional data. CO4 K2

11 How is the goodness of clusters measured? CO4 K2

12 Classify hierarchical clustering methods CO4 K2

13 Define grid-based method in clustering. CO4 K1

14 What are the applications of cluster analysis? CO4 K1

15 What is the concept of partitioning methods? CO4 K1

Q.No Questions CO’ Bloom's level


s

PART-B

1 Consider that the data mining task is to cluster the following CO4 K3
eight points A1,A2,A3,B1,B2,B3,C1AND C2(with (X,Y)
representing location) into three clusters A1(2,10) , A2(2,5) ,
A3(8,4) , B1(5,8) , B2(7,5) , B3(6,4) , C1(1,2) , C2(4,9).
The distance function is Euclidean distance. Suppose initially
we assign A1, B1 and C1 as the center of each cluster,
respectively. Use the K-means algorithm to show the three
cluster centers after the first round of execution and the final
tree clusters.

2 Use K-medoid algorithm to determine clusters for the following CO4 K3


with k=2

Point X Y

P1 2 6

P2 3 4

P3 3 8

P4 4 7

P5 6 2

P6 6 4

P7 7 3

P8 7 4

P9 8 5

P10 7 6

3 Outline the steps involved in the DBSCAN algorithm. CO4 K4


Determine the core, border, noise points from following data
using DBSCAN. minpts=4 and eps=1.9

Point X Y

P1 2 10
P2 2 5

P3 8 4

P4 5 8

P5 7 5

P6 6 4

P7 1 2

P8 4 9

4 Discuss about the requirements of Clustering in data mining. CO4 K3

5 Let us consider four points (X1,X2,X3,X4) with the following CO4 K3


co-ordinate as a two-dimensional samples for clustering

X1=(1,0) , X2=(0,1) , X3=(2,1) , X4=(3,3,)


a) Apply one iteration of the K-means partition clustering
algorithm.
b) What is the change in the total square error?
c) Apply the second iteration of the K-means algorithm.
Clusters: C1=(X1,X3) C2=(X2,X4)

6 Analyze the different clustering techniques used in data mining. CO4 K4

7 Give an insight of various outlier detection methods used in data CO4 K3


mining.

8 Analyze the various constraints while clustering high CO4 K4


dimensional data.

Q.No Questions CO’ Bloom's level


s

PART-C

1 Cluster the following eight points (with (x, y) representing CO4 K5


locations) into three clusters: (1, 2), (2, 5),(2, 10),(4, 9), (5,
8), (6, 4), (7, 5),(8, 4)
Initial cluster centers are: (8, 4), (5, 8) (1, 2)
Use K-Means Algorithm to find the three cluster centers till
the second iteration.

2 Outline the steps involved in the DBSCAN algorithm. CO4 K5


Determine the core, border, noise points from following data
using DBSCAN. minpts=4 and eps=1.9

POINTS X Y

P1 3 7

P2 4 6

P3 5 5

P4 6 4

P5 7 3

P6 6 2

P7 7 2

P8 8 4

P9 3 3

P10 2 6

P11 3 5

P12 2 4

UNIT V : WEKA TOOL


Q.No Questions CO’ Bloom's
s level

PART A

1 Why is data preprocessing needed? Name any four preprocessing CO5 K2


filters used in the WEKA tool.

2 What are the foundations of data mining? CO5 K1

3 Name some specific application oriented databases. CO5 K2


4 Explain how data mining is used in health care analysis. CO5 K1

5 Explain data mining applications for bio medical and DNA data CO5 K1
analysis.

6 Differentiate between data mining and data warehousing. CO5 K2

7 What are the applications of data mining? CO5 K1

8 List out the various data mining tools. CO5 K2

9 What is a dataset? Give an example. CO5 K1

10 What is association-rule learner? CO5 K1

11 Draw the layout of the Weka tool. CO5 K1

12 List out the limitations of the Weka tool. CO5 K2

13 Write down the functionalities of the Weka tool. CO5 K1

14 What is auto import? Give an example. CO5 K1

15 List out various data warehouse tools. CO5 K2

Q.No Questions CO’s Bloom's


level

PART B

1 Discuss in detail about the WEKA tool and its functionalities. CO5 K3

2 Outline the features involved in the Iris plant database in detail. CO5 K4

3 Outline the features involved in the breast cancer database in CO5 K3


detail.

4 Give a detailed note on Association rule learner. CO5 K3

5 Evaluate the performance measures of the different CO5 K4


classification algorithm for Iris plant dataset using WEKA tool

6 Evaluate the performance measures of the different clustering CO5 K4


algorithm for breast cancer dataset using WEKA tool

7 Evaluate the performance measures of the different clustering CO5 K4


algorithm for Iris plant dataset using WEKA tool
8 Evaluate the performance measures of the different CO5 K4
classification algorithm for breast cancer dataset using WEKA
tool

Q.No Questions CO’s Bloom's


level

PART C

1 Illustrate the steps involved in loading and classifying the Iris CO5 K5
plant database in the WEKA tool.

2 Elucidate the steps involved in loading and classifying Breast CO5 K5


cancer databases in the WEKA tool.

CHAPTER 3
PART A
1.Define Classification.
Classification is a data mining technique used to predict categorical
labels by analyzing and categorizing data into predefined classes. It
helps in assigning instances to specific groups based on input data.

2. List the applications of classification and prediction.


Classification and prediction are applied in medical diagnosis, spam
email detection, customer segmentation, credit scoring, and fraud
detection. These techniques help make future predictions and improve
decision-making.
3. Define Support vector machine.
A support vector machine (SVM) is a supervised learning algorithm that
finds the optimal boundary between different classes in a dataset. It aims
to maximize the margin between the data points of different classes.

4. Define back propagation.


Back propagation is a learning algorithm used in neural networks to
adjust weights by minimizing the error between predicted and actual
outputs. It involves propagating errors backward from output to input
layers.

5. What are K-nearest neighbor classifiers?


K-nearest neighbor (KNN) is a non-parametric classification algorithm
that assigns a class to a data point based on the majority class of its
nearest neighbors. It compares the distance between data points to
classify them.

6. Differentiate lazy learners and eager learners.


Lazy learners store the training data and perform classification during
prediction, like KNN. Eager learners, such as decision trees, build a
model before making predictions.

7. Illustrate support vector machines with examples.


SVMs separate data points using a hyperplane, such as classifying
emails as "spam" or "not spam" by identifying the most significant
boundary between them. The margin between classes is maximized for
better classification.

8. How would you show your understanding about rule-based


classification?
Rule-based classification assigns classes based on predefined "if-then"
rules extracted from the training data. For example, "If age > 30 and
income > 50K, then classify as high-spender."
9. Discuss why pruning is needed in the decision tree.
Pruning reduces the size of decision trees by removing branches that
have little significance, preventing overfitting. It improves
generalization on unseen data.

10. Define Lazy learners with an example.


Lazy learners like KNN store all training data and only perform
computation when predicting. For instance, KNN classifies a new data
point by analyzing its closest neighbors from the training data.

11. What are eager learners?


Eager learners, such as decision trees and SVMs, build a classification
model in advance before making predictions. They learn a generalized
pattern from the training data to classify future data.

CHAPTER 4
1. What is cluster analysis?
Cluster analysis is a data mining technique used to group similar objects
into clusters based on shared characteristics. It helps in identifying
natural patterns in data without predefined labels.
2. Define Clustering.
Clustering is the process of dividing data into meaningful groups
(clusters) where objects in the same cluster are more similar to each
other than to objects in other clusters.
3. How is the quality of a cluster represented?
The quality of a cluster is represented by metrics such as cohesion,
which measures how closely related the data points are within a cluster,
and separation, which assesses how distinct or well-separated a cluster is
from others.
4. Define K-means partitioning.
K-means is a partitioning algorithm that divides a dataset into K clusters
by assigning each point to the nearest centroid and then recalculating the
centroids based on cluster memberships.
5. List the major clustering methods.
The major clustering methods include partitioning methods, hierarchical
methods, density-based methods, grid-based methods, and model-based
methods.
6. Define outlier. How will you determine outliers in the data?
An outlier is an observation that significantly deviates from the rest of
the data. Outliers can be detected using statistical methods, such as the
Z-score, or visual techniques, like box plots.
7. Discuss the challenges of outlier detection.
Challenges of outlier detection include handling high-dimensional data,
differentiating between noise and meaningful outliers, and detecting
outliers in data with varying distributions.
8. Explain the typical phases of outlier detection methods.
Outlier detection methods typically involve data preprocessing, selecting
an appropriate detection technique (statistical, distance-based, or model-
based), and validating detected outliers.
9. Distinguish between Classification and clustering.
Classification assigns labels to data based on predefined categories,
whereas clustering groups data into clusters based on inherent
similarities without prior labeling.
10. Give the methods of clustering high dimensional data.
Methods for clustering high-dimensional data include subspace
clustering, projected clustering, and biclustering, which focus on finding
meaningful clusters in a subset of dimensions.
11. How is the goodness of clusters measured?
The goodness of clusters is measured using metrics such as the
Silhouette score, Dunn index, and Davies-Bouldin index, which evaluate
both cohesion and separation of clusters.
12. Classify hierarchical clustering methods.
Hierarchical clustering methods are classified into agglomerative
(bottom-up) and divisive (top-down). Agglomerative methods start with
individual data points and merge them, while divisive methods start with
one large cluster and split it.
13. Define grid-based method in clustering.
Grid-based clustering divides the data space into a grid of cells and
clusters the data points based on the density of points within each grid
cell. Examples include STING and CLIQUE.
14. What are the applications of cluster analysis?
Cluster analysis is applied in customer segmentation, image recognition,
market research, bioinformatics, and document classification to find
patterns and relationships in data.
15. What is the concept of partitioning methods?
Partitioning methods divide the dataset into distinct clusters by
iteratively relocating points between clusters to optimize a given
criterion, such as minimizing within-cluster variance. Examples include
K-means and K-medoids.

CHAPTER 5
1. Why is data preprocessing needed?
Data preprocessing is essential to clean, transform, and prepare raw data
for analysis, improving the quality of data mining results. It handles
missing values, noise, and inconsistencies in the data.
2. Name any four preprocessing filters used in the WEKA tool.
Four preprocessing filters in WEKA include RemoveUseless,
ReplaceMissingValues, Normalize, and Discretize.
3. What are the foundations of data mining?
The foundations of data mining include statistics, machine learning,
artificial intelligence, and database systems, which together enable
extracting meaningful patterns from large datasets.
4. Name some specific application-oriented databases.
Examples of application-oriented databases include spatial databases,
time-series databases, multimedia databases, and web databases.
5. Explain how data mining is used in health care analysis.
Data mining in healthcare is used to predict disease outbreaks, identify
patient risk groups, and optimize treatment plans by discovering patterns
in medical records and clinical data.
6. Explain data mining applications for biomedical and DNA data
analysis.
In biomedical and DNA analysis, data mining is used to identify genes
related to diseases, classify biological sequences, and discover genetic
patterns, improving diagnostics and drug development.
7. Differentiate between data mining and data warehousing.
Data warehousing focuses on storing large volumes of data in a
centralized repository, while data mining analyzes this data to uncover
patterns and insights for decision-making.
8. What are the applications of data mining?
Data mining applications include fraud detection, market basket
analysis, customer segmentation, financial forecasting, and social media
analysis.
9. List out the various data mining tools.
Popular data mining tools include WEKA, RapidMiner, KNIME,
Orange, and IBM SPSS Modeler.
10. What is a dataset? Give an example.
A dataset is a collection of data, usually organized in rows and columns.
For example, a customer dataset may include columns like "Name,"
"Age," and "Purchase History."
11. What is association-rule learner?
An association-rule learner discovers relationships or associations
between variables in large datasets, often used for market basket
analysis. It finds rules like "If A is bought, B is also likely to be bought."
12. Draw the layout of the Weka tool.
[Since I can't provide a visual here, the layout typically includes panels
like the Preprocess, Classify, Cluster, Associate, and Visualize tabs for
different data mining tasks.]
13. List out the limitations of the Weka tool.
Limitations of WEKA include scalability issues with very large datasets,
limited support for real-time data mining, and a relatively basic user
interface for complex operations.
14. Write down the functionalities of the Weka tool.
WEKA offers functionalities like data preprocessing, classification,
regression, clustering, association rule mining, and data visualization.
15. What is auto import? Give an example.
Auto import refers to the automatic loading of data or modules in a
software tool. For example, in Python, pandas can auto-import data from
CSV files for analysis.
16. List out various data warehouse tools.
Popular data warehouse tools include Amazon Redshift, Google
BigQuery, Snowflake, Microsoft Azure SQL Data Warehouse, and IBM
Db2 Warehouse.

You might also like