Classification
Classification
Part - A
8 How would you show your understanding about rule based CO3 K2
classification?
PART-B
PART-C
2 Green 2 Tall No M
5 Green 2 Short No H
6 White 2 Tall No H
7 White 2 Tall No H
8 White 2 Short Yes H
PART - A
6 Define outlier. How will you determine outliers in the data? CO4 K1
PART-B
1 Consider that the data mining task is to cluster the following CO4 K3
eight points A1,A2,A3,B1,B2,B3,C1AND C2(with (X,Y)
representing location) into three clusters A1(2,10) , A2(2,5) ,
A3(8,4) , B1(5,8) , B2(7,5) , B3(6,4) , C1(1,2) , C2(4,9).
The distance function is Euclidean distance. Suppose initially
we assign A1, B1 and C1 as the center of each cluster,
respectively. Use the K-means algorithm to show the three
cluster centers after the first round of execution and the final
tree clusters.
Point X Y
P1 2 6
P2 3 4
P3 3 8
P4 4 7
P5 6 2
P6 6 4
P7 7 3
P8 7 4
P9 8 5
P10 7 6
Point X Y
P1 2 10
P2 2 5
P3 8 4
P4 5 8
P5 7 5
P6 6 4
P7 1 2
P8 4 9
PART-C
POINTS X Y
P1 3 7
P2 4 6
P3 5 5
P4 6 4
P5 7 3
P6 6 2
P7 7 2
P8 8 4
P9 3 3
P10 2 6
P11 3 5
P12 2 4
PART A
5 Explain data mining applications for bio medical and DNA data CO5 K1
analysis.
PART B
1 Discuss in detail about the WEKA tool and its functionalities. CO5 K3
2 Outline the features involved in the Iris plant database in detail. CO5 K4
PART C
1 Illustrate the steps involved in loading and classifying the Iris CO5 K5
plant database in the WEKA tool.
CHAPTER 3
PART A
1.Define Classification.
Classification is a data mining technique used to predict categorical
labels by analyzing and categorizing data into predefined classes. It
helps in assigning instances to specific groups based on input data.
CHAPTER 4
1. What is cluster analysis?
Cluster analysis is a data mining technique used to group similar objects
into clusters based on shared characteristics. It helps in identifying
natural patterns in data without predefined labels.
2. Define Clustering.
Clustering is the process of dividing data into meaningful groups
(clusters) where objects in the same cluster are more similar to each
other than to objects in other clusters.
3. How is the quality of a cluster represented?
The quality of a cluster is represented by metrics such as cohesion,
which measures how closely related the data points are within a cluster,
and separation, which assesses how distinct or well-separated a cluster is
from others.
4. Define K-means partitioning.
K-means is a partitioning algorithm that divides a dataset into K clusters
by assigning each point to the nearest centroid and then recalculating the
centroids based on cluster memberships.
5. List the major clustering methods.
The major clustering methods include partitioning methods, hierarchical
methods, density-based methods, grid-based methods, and model-based
methods.
6. Define outlier. How will you determine outliers in the data?
An outlier is an observation that significantly deviates from the rest of
the data. Outliers can be detected using statistical methods, such as the
Z-score, or visual techniques, like box plots.
7. Discuss the challenges of outlier detection.
Challenges of outlier detection include handling high-dimensional data,
differentiating between noise and meaningful outliers, and detecting
outliers in data with varying distributions.
8. Explain the typical phases of outlier detection methods.
Outlier detection methods typically involve data preprocessing, selecting
an appropriate detection technique (statistical, distance-based, or model-
based), and validating detected outliers.
9. Distinguish between Classification and clustering.
Classification assigns labels to data based on predefined categories,
whereas clustering groups data into clusters based on inherent
similarities without prior labeling.
10. Give the methods of clustering high dimensional data.
Methods for clustering high-dimensional data include subspace
clustering, projected clustering, and biclustering, which focus on finding
meaningful clusters in a subset of dimensions.
11. How is the goodness of clusters measured?
The goodness of clusters is measured using metrics such as the
Silhouette score, Dunn index, and Davies-Bouldin index, which evaluate
both cohesion and separation of clusters.
12. Classify hierarchical clustering methods.
Hierarchical clustering methods are classified into agglomerative
(bottom-up) and divisive (top-down). Agglomerative methods start with
individual data points and merge them, while divisive methods start with
one large cluster and split it.
13. Define grid-based method in clustering.
Grid-based clustering divides the data space into a grid of cells and
clusters the data points based on the density of points within each grid
cell. Examples include STING and CLIQUE.
14. What are the applications of cluster analysis?
Cluster analysis is applied in customer segmentation, image recognition,
market research, bioinformatics, and document classification to find
patterns and relationships in data.
15. What is the concept of partitioning methods?
Partitioning methods divide the dataset into distinct clusters by
iteratively relocating points between clusters to optimize a given
criterion, such as minimizing within-cluster variance. Examples include
K-means and K-medoids.
CHAPTER 5
1. Why is data preprocessing needed?
Data preprocessing is essential to clean, transform, and prepare raw data
for analysis, improving the quality of data mining results. It handles
missing values, noise, and inconsistencies in the data.
2. Name any four preprocessing filters used in the WEKA tool.
Four preprocessing filters in WEKA include RemoveUseless,
ReplaceMissingValues, Normalize, and Discretize.
3. What are the foundations of data mining?
The foundations of data mining include statistics, machine learning,
artificial intelligence, and database systems, which together enable
extracting meaningful patterns from large datasets.
4. Name some specific application-oriented databases.
Examples of application-oriented databases include spatial databases,
time-series databases, multimedia databases, and web databases.
5. Explain how data mining is used in health care analysis.
Data mining in healthcare is used to predict disease outbreaks, identify
patient risk groups, and optimize treatment plans by discovering patterns
in medical records and clinical data.
6. Explain data mining applications for biomedical and DNA data
analysis.
In biomedical and DNA analysis, data mining is used to identify genes
related to diseases, classify biological sequences, and discover genetic
patterns, improving diagnostics and drug development.
7. Differentiate between data mining and data warehousing.
Data warehousing focuses on storing large volumes of data in a
centralized repository, while data mining analyzes this data to uncover
patterns and insights for decision-making.
8. What are the applications of data mining?
Data mining applications include fraud detection, market basket
analysis, customer segmentation, financial forecasting, and social media
analysis.
9. List out the various data mining tools.
Popular data mining tools include WEKA, RapidMiner, KNIME,
Orange, and IBM SPSS Modeler.
10. What is a dataset? Give an example.
A dataset is a collection of data, usually organized in rows and columns.
For example, a customer dataset may include columns like "Name,"
"Age," and "Purchase History."
11. What is association-rule learner?
An association-rule learner discovers relationships or associations
between variables in large datasets, often used for market basket
analysis. It finds rules like "If A is bought, B is also likely to be bought."
12. Draw the layout of the Weka tool.
[Since I can't provide a visual here, the layout typically includes panels
like the Preprocess, Classify, Cluster, Associate, and Visualize tabs for
different data mining tasks.]
13. List out the limitations of the Weka tool.
Limitations of WEKA include scalability issues with very large datasets,
limited support for real-time data mining, and a relatively basic user
interface for complex operations.
14. Write down the functionalities of the Weka tool.
WEKA offers functionalities like data preprocessing, classification,
regression, clustering, association rule mining, and data visualization.
15. What is auto import? Give an example.
Auto import refers to the automatic loading of data or modules in a
software tool. For example, in Python, pandas can auto-import data from
CSV files for analysis.
16. List out various data warehouse tools.
Popular data warehouse tools include Amazon Redshift, Google
BigQuery, Snowflake, Microsoft Azure SQL Data Warehouse, and IBM
Db2 Warehouse.