0% found this document useful (0 votes)

66 views7 pages

Data Mining

Data discretization refers to converting continuous data values into a smaller number of intervals to make the data easier to evaluate and manage. It involves converting continuous attribute values into a finite set of intervals with minimal information loss. Issues to consider in data mining include data quality, preprocessing, and algorithms for tasks like classification, clustering, and association rule mining.

Uploaded by

Mano

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views7 pages

Data Mining

Uploaded by

Mano

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

SECTION D

Q.5 What is data discretization? Discuss the issues to be considered in data mining.
Data discretization refers to a method of converting a huge number of data values into smaller
ones so that the evaluation and management of data become easy. In other words, data
discretization is a method of converting attributes values of continuous data into a finite set of
intervals with minimum data loss.
Issues:

Q4 llustrate data preprocessing in detail.

Data preprocessing is the process of transforming raw data into an understandable
format. It is also an important step in data mining as we cannot work with raw data. The
quality of the data should be checked before applying machine learning or data mining
algorithms.
Preprocessing of data is mainly to check the data quality. The quality can be checked by
the following

● Accuracy: To check whether the data entered is correct or not.

● Completeness: To check whether the data is available or not recorded.
● Consistency: To check whether the same data is kept in all the places that do or do
not match.
● Timeliness: The data should be updated correctly.
● Believability: The data should be trustable.
● Interpretability: The understandability of the data.
Q3 Explain with example the various steps in decision tree induction.
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node
holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a customer at a
company is likely to buy a computer or not. Each internal node represents a test on an attribute.
Each leaf node represents a class.

The benefits of having a decision tree are as follows −

● It does not require any domain knowledge.
● It is easy to comprehend.
● The learning and classification steps of a decision tree are simple and fast.

Q2 Briefly outline the key features of various clustering methods with relevant
examples.

The various types of clustering are:

1. Connectivity-based Clustering (Hierarchical clustering)
2. Centroids-based Clustering (Partitioning methods)
3. Distribution-based Clustering
4. Density-based Clustering (Model-based methods)
5. Fuzzy Clustering
6. Constraint-based (Supervised Clustering)

1. Connectivity-Based Clustering (Hierarchical Clustering)

Hierarchical Clustering is a method of unsupervised machine learning clustering where it
begins with a pre-defined top to the bottom hierarchy of clusters

2. Centroid Based Clustering

Centroid-based clustering is considered as one of the simplest clustering algorithms, yet
the most effective way of creating clusters and assigning data points to them.

3. Density-based Clustering (Model-based Methods)

If one looks into the previous two methods that we discussed, one would observe that
both hierarchical and centroid-based algorithms are dependent on a distance
(similarity/proximity) metric.

4. Distribution-Based Clustering
Until now, the clustering techniques as we know are based around either proximity
(similarity/distance) or composition (density). There is a family of clustering algorithms
that take a totally different metric into consideration – probability.

5. Fuzzy Clustering
The general idea about clustering revolves around assigning data points to mutually
exclusive clusters, meaning, a data point always resides uniquely inside a cluster and it
cannot belong to more than one cluster.

6. Constraint-based (Supervised Clustering)

The clustering process, in general, is based on the approach that the data can be divided
into an optimal number of “unknown” groups. The underlying stages of all the clustering
algorithms to find those hidden patterns and similarities, without any intervention or
predefined conditions.
SECTION C
1. DIFFERENCE BETWEEN CLASSIFICATION & PREDICTION.
● Classification is the method of recognizing to which group; a new process belongs
to a background of a training data set containing a new process of observing whose
group membership is familiar.
● Predication is the method of recognizing the missing or not available numerical data
for a new process of observing.
● A classifier is built to detect explicit labels.
● A predictor will be build that predicts a current valued job or command value.
● In classification, authenticity depends on detecting the class label correctly.
● In predication, the authenticity depends on how well a given predictor can guess the
value of a predicated attribute for new data.
● In classification, the sample can be called the classifier.
● In prediction, the sample can be called the predictor.

2. (Co1)Explain the steps in the knowledge discovery database.

Some people don’t differentiate data mining from knowledge discovery while others view data mining
as an essential step in the process of knowledge discovery. Here is the list of steps involved in the
knowledge discovery process −

● Data Cleaning − In this step, the noise and inconsistent data is removed.
● Data Integration − In this step, multiple data sources are combined.
● Data Selection − In this step, data relevant to the analysis task are retrieved from the
database.
● Data Transformation − In this step, data is transformed or consolidated into forms appropriate
for mining by performing summary or aggregation operations.
● Data Mining − In this step, intelligent methods are applied in order to extract data patterns.
● Pattern Evaluation − In this step, data patterns are evaluated.
● Knowledge Presentation − In this step, knowledge is represented.
3. (CO5)State Bayes theorem and discuss how Bayesian classifiers work.

Bayes' theorem describes the probability of occurrence of an event related to any condition. It is also
considered for the case of conditional probability. Bayes theorem is also known as the formula for the
probability of “causes”.

The Naive Bayes classifier works on the principle of conditional probability, as given by the Bayes
theorem. While calculating the math on probability, we usually denote probability as P. Some of the
probabilities in this event would be as follows: The probability of getting two heads = 1/4.

4. (CO4)How will you handle missing values in the dataset before the mining
process? Explain.

Data Mining — Handling Missing Values the Database

1. Ignore the data row. ...

2. Use a global constant to fill in for missing values. ...
3. Use attribute mean. ...
4. Use attribute means for all samples belonging to the same class. ...
5. Use a data mining algorithm to predict the most probable value.

5. (CO1)Outline the characteristics of data warehouses and define metadata.

Metadata is simply defined as data about data. The data that is used to represent other data is
known as metadata.
Metadata is the roadmap to a data warehouse.
Metadata in a data warehouse defines the warehouse objects.
Metadata acts as a directory.

Data warehouses are characterized

by being:
1. Subject-oriented: A data warehouse typically provides information on a topic (such
as a sales inventory or supply chain) rather than company operations.
2. Time-variant: Time variant keys (e.g., for the date, month, time) are typically present.
3. Integrated: A data warehouse combines data from various sources. These may
include a cloud, relational databases, flat files, structured and semi-structured data,
metadata, and master data. The sources are combined in a manner that’s
consistent, relatable, and ideally certifiable, providing a business with confidence in
the data’s quality.
4. Persistent and non-volatile: Prior data isn’t deleted when new data is added.
Historical data is preserved for comparisons, trends, and analytics.

Section – B (Very Short answers)

1. (CO1) Define Data Mining.

Data mining is the process of analyzing a large batch of information to discern trends and
patterns. Data mining can be used by corporations for everything from learning about what
customers are interested in or want to buy to fraud detection and spam filtering.
OR
It is the process of finding patterns and correlations within large data sets to identify
relationships between data. Data mining tools allow a business organization to predict
customer behavior. Data mining tools are used to build risk models and detect fraud.

2. (CO2) State the different layers of the data warehouse.

Data Source Layer

The Data Source Layer is the layer where the data from the source is encountered and
subsequently sent to the other layers for desired operations.

Data Staging Layer

Step #1: Data Extraction
Step #2: Landing Database
Step #3: Staging Area
Step #4: ETL

Data Storage Layer

The processed data is stored in the Data Warehouse.

Data Presentation Layer

This Layer is where the users get to interact with the data stored in the data warehouse.

3. (CO3). What is the need for data preprocessing?

Data Preprocessing is required because:

Real world data are generally:
Incomplete: Missing attribute values, missing certain attributes of importance, or having only
aggregate data
Noisy: Containing errors or outliers
Inconsistent: Containing discrepancies in codes or names

4. (CO4) State the important terms used in association rule mining.

1. Support: Support is the rate of frequency of an item appearing in the total number of items.
2. Confidence: Confidence is the conditional probability of occurrence of a consequent (then) providing the
occurrence of an antecedent (if).
3. Lift: Lift is the ratio of confidence and support. It tells how likely an item is purchased after another item is
purchased.

5. . (CO5) Define classification model.

A classification model tries to draw some conclusions from the input values given for training.
It will predict the class labels/categories for the new data. Feature: A feature is an individual
measurable property of a phenomenon being observed.

Data preprocessing is a data mining technique that is used to transform the raw data in a
useful and efficient forma

Modbus Mapping Document
No ratings yet
Modbus Mapping Document
57 pages
Mx3ipg2a PDF
No ratings yet
Mx3ipg2a PDF
2 pages
Unit 1 Data Mining Task
No ratings yet
Unit 1 Data Mining Task
7 pages
DWM NOTES
No ratings yet
DWM NOTES
118 pages
Risk Management Case Study
100% (1)
Risk Management Case Study
3 pages
Unit-2 Introduction To Data Mining
100% (1)
Unit-2 Introduction To Data Mining
11 pages
Short Notes On Data Mining & Warehousing
No ratings yet
Short Notes On Data Mining & Warehousing
43 pages
Chapter 3
No ratings yet
Chapter 3
9 pages
Data Mining 2-5
No ratings yet
Data Mining 2-5
4 pages
Data Mining Imp Solutions
No ratings yet
Data Mining Imp Solutions
6 pages
Model Question Paper 2
No ratings yet
Model Question Paper 2
7 pages
Wibd
No ratings yet
Wibd
10 pages
Data Mining Questions and Answers
No ratings yet
Data Mining Questions and Answers
22 pages
Data Mining and Constraints: An Overview: (Vgrossi, Pedre) @di - Unipi.it, Turini@unipi - It
No ratings yet
Data Mining and Constraints: An Overview: (Vgrossi, Pedre) @di - Unipi.it, Turini@unipi - It
25 pages
Masa Concrete Block Production en
0% (1)
Masa Concrete Block Production en
16 pages
AIML-HC Mod 02
No ratings yet
AIML-HC Mod 02
65 pages
DWDM Unit-II Notes
No ratings yet
DWDM Unit-II Notes
29 pages
1
No ratings yet
1
4 pages
DM-Model Question Paper Solutions
No ratings yet
DM-Model Question Paper Solutions
27 pages
Reverse Engineering The YouTube Algorithm PT 2
No ratings yet
Reverse Engineering The YouTube Algorithm PT 2
6 pages
MCQ On Data Mining With Answers Set-1
No ratings yet
MCQ On Data Mining With Answers Set-1
11 pages
DMjoy
No ratings yet
DMjoy
9 pages
BTECH Data Mining Answer
No ratings yet
BTECH Data Mining Answer
35 pages
Unit I DWDM
No ratings yet
Unit I DWDM
26 pages
Data Mining and Visualization Question Bank
100% (1)
Data Mining and Visualization Question Bank
11 pages
Oral Questions LP II
No ratings yet
Oral Questions LP II
21 pages
Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
5 pages
SMM 2024 WRF
No ratings yet
SMM 2024 WRF
374 pages
RhinoGold 4.0 - Level 1 - Tutorial 015P - Channel Pendant PDF
No ratings yet
RhinoGold 4.0 - Level 1 - Tutorial 015P - Channel Pendant PDF
2 pages
Data Mining and Warehouse
No ratings yet
Data Mining and Warehouse
30 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
38 pages
Assignment of DMDW kg11
No ratings yet
Assignment of DMDW kg11
17 pages
DWDM
No ratings yet
DWDM
18 pages
IV-cse DM Viva Questions
No ratings yet
IV-cse DM Viva Questions
10 pages
Comp 414 Revision
No ratings yet
Comp 414 Revision
9 pages
Solved DM Questions
No ratings yet
Solved DM Questions
6 pages
DM&DW SEE Module 1
No ratings yet
DM&DW SEE Module 1
6 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
NationalSemiconductor FACTAdvancedCMOSLogicDatabook1993OCR PDF
No ratings yet
NationalSemiconductor FACTAdvancedCMOSLogicDatabook1993OCR PDF
749 pages
Lecture 3.1.1
No ratings yet
Lecture 3.1.1
17 pages
Soln 1
100% (1)
Soln 1
6 pages
Scandinavian
No ratings yet
Scandinavian
2 pages
Top 50 Data Mining Interview Questions & Answers - GeeksforGeeks
No ratings yet
Top 50 Data Mining Interview Questions & Answers - GeeksforGeeks
25 pages
Data Mining Long Answers
No ratings yet
Data Mining Long Answers
4 pages
Data Mining
No ratings yet
Data Mining
20 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
20 pages
DM
No ratings yet
DM
7 pages
Best Cydia Sources 2024 Top Repos With iOS Jailbreak Tweaks
No ratings yet
Best Cydia Sources 2024 Top Repos With iOS Jailbreak Tweaks
1 page
Datamining&warehousing
No ratings yet
Datamining&warehousing
65 pages
Dataqb
No ratings yet
Dataqb
38 pages
Data Mining
No ratings yet
Data Mining
15 pages
Chapter 1 - Data Mining and Data Warehouse
No ratings yet
Chapter 1 - Data Mining and Data Warehouse
44 pages
CHAPTER1 Datamining
No ratings yet
CHAPTER1 Datamining
33 pages
DM UNIT-1 Question and Answer
No ratings yet
DM UNIT-1 Question and Answer
25 pages
Lecture 3
No ratings yet
Lecture 3
10 pages
Unit-2 Finalized
No ratings yet
Unit-2 Finalized
12 pages
5 What Is Data-WPS Office
No ratings yet
5 What Is Data-WPS Office
19 pages
Lecture Notes 1.1 & 1.2
No ratings yet
Lecture Notes 1.1 & 1.2
8 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
Unit 3 BI & Data Science
No ratings yet
Unit 3 BI & Data Science
19 pages
Data Science & Big Data Analysis Module 1,2,3,4,5
No ratings yet
Data Science & Big Data Analysis Module 1,2,3,4,5
70 pages
DM Notes (6th Nov)
No ratings yet
DM Notes (6th Nov)
6 pages
Module 1
No ratings yet
Module 1
41 pages
Exercise 2 Analitik Bisnis
0% (1)
Exercise 2 Analitik Bisnis
10 pages
6 TheRealTimeFaceDetectionandRecognitionSystem
No ratings yet
6 TheRealTimeFaceDetectionandRecognitionSystem
48 pages
CS1004 DWM 2marks 2013
No ratings yet
CS1004 DWM 2marks 2013
22 pages
Systems Analysis and Design With UML 2.0
No ratings yet
Systems Analysis and Design With UML 2.0
30 pages
Assignment 1 SMU 3013 Mathematics SEMESTER 1 SESSION 2018/2019
No ratings yet
Assignment 1 SMU 3013 Mathematics SEMESTER 1 SESSION 2018/2019
12 pages
CS311-Computational Structures: Problems, Languages, Machines, Computability, Complexity
No ratings yet
CS311-Computational Structures: Problems, Languages, Machines, Computability, Complexity
51 pages
DS 4254 NetApp ASA
No ratings yet
DS 4254 NetApp ASA
5 pages
Đ Án 1 (Thành+Dương)
No ratings yet
Đ Án 1 (Thành+Dương)
28 pages
HFY-Checklist-14!06!04Inspection Checklist FOTE Test（FOTE试验（现场验收试验）
No ratings yet
HFY-Checklist-14!06!04Inspection Checklist FOTE Test（FOTE试验（现场验收试验）
1 page
Company PROFILE - Peoplelink
No ratings yet
Company PROFILE - Peoplelink
118 pages
A Robust and Regularized Extreme Learning Machine
No ratings yet
A Robust and Regularized Extreme Learning Machine
7 pages
A Survey On OpenFlow-based Software Defined Networks: Security Challenges and Countermeasures
No ratings yet
A Survey On OpenFlow-based Software Defined Networks: Security Challenges and Countermeasures
14 pages
Microsoft Nav 2009 Part A
No ratings yet
Microsoft Nav 2009 Part A
3 pages
Scan 0016
No ratings yet
Scan 0016
2 pages
My Credentials
No ratings yet
My Credentials
1 page
CS Project
No ratings yet
CS Project
14 pages
State of Practice of Building Information Modeling
No ratings yet
State of Practice of Building Information Modeling
8 pages
S4 2023 Add Ons
No ratings yet
S4 2023 Add Ons
8 pages
Intel's New Chimera - Alder Lake - Agner's CPU Blog
No ratings yet
Intel's New Chimera - Alder Lake - Agner's CPU Blog
4 pages
Python BCA5thSem QA 2025
No ratings yet
Python BCA5thSem QA 2025
3 pages
BrightEye 5 - Analog Composite TBC and Frame Sync BE5
No ratings yet
BrightEye 5 - Analog Composite TBC and Frame Sync BE5
2 pages
Nonin 9590 Vantage
No ratings yet
Nonin 9590 Vantage
2 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet

Data Mining

Uploaded by

Data Mining

Uploaded by

SECTION D

Q4 llustrate data preprocessing in detail.

● Accuracy: To check whether the data entered is correct or not.

The benefits of having a decision tree are as follows −

The various types of clustering are:

1. Connectivity-Based Clustering (Hierarchical Clustering)

2. Centroid Based Clustering

3. Density-based Clustering (Model-based Methods)

6. Constraint-based (Supervised Clustering)

2. (Co1)Explain the steps in the knowledge discovery database.

Data Mining — Handling Missing Values the Database

1. Ignore the data row. ...

5. (CO1)Outline the characteristics of data warehouses and define metadata.

Data warehouses are characterized

Section – B (Very Short answers)

1. (CO1) Define Data Mining.

2. (CO2) State the different layers of the data warehouse.

Data Source Layer

Data Staging Layer

Data Storage Layer

Data Presentation Layer

3. (CO3). What is the need for data preprocessing?

Data Preprocessing is required because:

4. (CO4) State the important terms used in association rule mining.

5. . (CO5) Define classification model.

You might also like