Classification Unit3

Uploaded by

shafiyasayyed23

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Classification Unit3

Uploaded by

shafiyasayyed23

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

There are two forms of data analysis that can be used for extracting

models describing important classes or to predict future data trends.

These two forms are as follows −
Classification
Prediction
Classification models predict categorical class labels; and prediction
models predict continuous valued functions. For example, we can build a
classification model to categorize bank loan applications as either safe or
risky, or a prediction model to predict the expenditures in dollars of
potential customers on computer equipment given their income and
occupation.
Classification and prediction are two main methods used to mine the data. We use
these two techniques to analyze the data, to explore more about unknown data.

Classification:
Classification is the process of finding a good model that describes the data classes or
concepts, and the purpose of classification is to predict the class of objects whose class
label is unknown. In simple terms, we can think of Classification as categorizing the
incoming new data based on our current or past assumptions that we have made and
the data that we already have with us.

Prediction:
We can think of prediction is like something that may go to happen in the future. And
just like that in prediction, we identify or predict the missing or unavailable data for a
new observation based on the previous data that we have and based on the future
assumptions. In prediction, the output is a continuous value.
Difference between Prediction and Classification:

Sr.No. Prediction Classification

Classification is about determining a
Prediction is about predicting a missing/unknown
1. (categorial) class (or label) for an
element(continuous value) of a dataset
element in a dataset

Eg. We can think of prediction as predicting the Eg. Whereas the grouping of patients
2. correct treatment for a particular disease for an based on their medical records can be
individual person. considered classification.

The model used to predict the unknown value is The model used to classify the unknown
3.
called a predictor. value is called a classifier.

A classifier is also constructed from a

The predictor is constructed from a training set
training set composed of the records of
4. and its accuracy refers to how well it can estimate
databases and their corresponding class
the value of new data.
names
Example of Classification and prediction
What is classification?
Following are the examples of cases where the data analysis task is Classification −
A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
A marketing manager at a company needs to analyze a customer with a given profile,
who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the
categorical labels. These labels are risky or safe for loan application data and yes or no
for marketing data.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will
spend during a sale at his company. In this example we are bothered to predict a
numeric value. Therefore the data analysis task is an example of numeric prediction. In
this case, a model or a predictor will be constructed that predicts a continuous-valued-
function or ordered value.
How Does Classification Works?
With the help of the bank loan application that we have discussed above, let us
understand the working of classification. The Data Classification process includes two
steps −
Building the Classifier or Model
Using Classifier for Classification
Building the Classifier or Model
This step is the learning step or the learning phase.

In this step the classification algorithms build the classifier.

The classifier is built from the training set made up of database tuples and their
associated class labels.
Each tuple that constitutes the training set is referred to as a category or class. These
tuples can also be referred to as sample, object or data points.
Using Classifier for Classification
In this step, the classifier is used for classification. Here the test data is used to
estimate the accuracy of classification rules. The classification rules can be applied to
the new data tuples if the accuracy is considered acceptable.
Comparison of Classification and Prediction Methods:

Accuracy: Accuracy of the classifier can be referred to as the ability of the classifier to predicts
the class label correctly, and the accuracy of the predictor can be referred to as how well a
given predictor can estimate the unknown value.

Speed: The speed of the method depends on the computational cost of generating and using
the classifier/predictor.

Robustness: Robustness is the ability to make correct predictions or classifications, in the

context of data mining robustness is the ability of the classifier or predictor to make correct
predictions from incoming unknown data.

Scalability: Scalability is referring to an increase or decrease in performance of the classifier or

predictor based on the given data.
Interpretability: Interpretability can be referred to as how readily we can understand the
reasoning behind predictions or classification made by the predictor or classifier.
Classification and Prediction Issues
The major issue is preparing the data for Classification and Prediction. Preparing the data
involves the following activities −
Data Cleaning − Data cleaning involves removing the noise and treatment of missing values.
The noise is removed by applying smoothing techniques and the problem of missing values is
solved by replacing a missing value with most commonly occurring value for that attribute.
data cleaning is referred to as the preprocessing of the data, removing the noise from the data,
cleaning the data, and fixing the missing or unknown values from the data.
Relevance Analysis: After cleaning the data, we have to do an analysis on data to find the
relevant data according to the problem. For example, we use correlation analysis to compare
the various classes in the classification method. After cleaning the data and analyzing the data,
we might need to normalize the resultant data, because normalized data gives more accuracy
while predicting an unknown value. Normalization can be achieved by scaling all the values in
the dataset from 0 to 1 in the range.
Data Transformation and reduction − The data can be transformed by any of the following
methods.
Decision trees are a popular machine learning algorithm that can be used for both regression
and classification tasks. They are easy to understand, interpret, and implement, making them
an ideal choice for beginners in the field of machine learning. In this comprehensive guide, we
will cover all aspects of the decision tree algorithm, including the working principles, different
types of decision trees, the process of building decision trees, and how to evaluate and
optimize decision trees.

What is a Decision Tree?

A decision tree is a non-parametric supervised learning algorithm for classification and
regression tasks. It has a hierarchical tree structure consisting of a root node, branches,
internal nodes, and leaf nodes. Decision trees are used for classification and regression tasks,
providing easy-to-understand models.

Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
An attribute selection
An attribute selection measure is a heuristic for choosing the splitting test that “best”
separates a given data partition, D, of class-labeled training tuples into single classes.
If it can split D into smaller partitions as per the results of the splitting criterion, ideally every
partition can be pure (i.e., some tuples that fall into a given partition can belong to the same
class).
Conceptually, the “best” splitting criterion is the most approximately results in such a method.
Attribute selection measures are called a splitting rules because they decides how the tuples at
a given node are to be divided.
The attribute selection measure supports a ranking for every attribute defining the given
training tuples. The attribute having the best method for the measure is selected as the
splitting attribute for the given tuples.
Tree Pruning
Pruning is the procedure that decreases the size of decision trees. It can decrease the risk of
overfitting by defining the size of the tree or eliminating areas of the tree that support little
power. Pruning supports by trimming the branches that follow anomalies in the training
information because of noise or outliers and supports the original tree in a method that
enhances the generalization efficiency of the tree.
Various methods generally use statistical measures to delete the least reliable departments,
frequently resulting in quicker classification and an improvement in the capability of the tree
to properly classify independent test data.
Pruning means to change the model by deleting the child nodes of a branch node. The pruned
node is regarded as a leaf node. Leaf nodes cannot be pruned.
A decision tree consists of a root node, several branch nodes, and several leaf nodes.The root
node represents the top of the tree. It does not have a parent node, however, it has different
child nodes.
Branch nodes are in the middle of the tree. A branch node has a parent node and several child
nodes.
Leaf nodes represent the bottom of the tree. A leaf node has a parent node. It does not have
Scalability in data mining

Scalability in data mining refers to the ability of a data mining algorithm to handle large
amounts of data efficiently and effectively. This means that the algorithm should be able to
process the data in a timely manner, without sacrificing the quality of the results. In other
words, a scalable data mining algorithm should be able to handle an increasing amount of data
without requiring a significant increase in computational resources. This is important because
the amount of data available for analysis is growing rapidly, and the ability to process that data
quickly and accurately is essential for making informed decisions.
Vertical Scalability
Vertical scalability is also known as scale-up refers to the ability of a system or algorithm to handle an increase in
workload by adding more computational resources, such as faster processors or more memory. This is in contrast
to horizontal scalability, which involves adding more machines to a distributed computing system to handle an
increase in workload.
Vertical scalability can be an effective way to improve the performance of a system or algorithm, particularly for
applications that are limited by the computational resources available to them. By adding more resources, a
system can often handle more data or perform more complex calculations, which can improve the speed and
accuracy of the results. However, there are limitations to vertical scalability, and at some point, adding more
resources may not result in a significant improvement in performance.
Horizontal Scalability
Horizontal scalability, also known as scale-out, refers to the ability of a system or algorithm to handle an increase
in workload by adding more machines to a distributed computing system. This is in contrast to vertical scalability,
which involves adding more computational resources, such as faster processors or more memory, to a single
machine.
Horizontal scalability can be an effective way to improve the performance of a system or algorithm, particularly
for applications that require a lot of computational power. By adding more machines to the system, the workload
can be distributed across multiple machines, which can improve the speed and accuracy of the results. However,
there are limitations to horizontal scalability, and at some point, adding more machines may not result in a
significant improvement in performance. Additionally, horizontal scalability can be more complex to implement
and manage than vertical scalability.
Decision Tree Induction in Data Mining
Decision tree induction is a common technique in data mining that is used to generate a predictive
model from a dataset. This technique involves constructing a tree-like structure, where each internal
node represents a test on an attribute, each branch represents the outcome of the test, and each leaf
node represents a prediction. The goal of decision tree induction is to build a model that can accurately
predict the outcome of a given event, based on the values of the attributes in the dataset.
To build a decision tree, the algorithm first selects the attribute that best splits the data into distinct
classes. This is typically done using a measure of impurity, such as entropy or the Gini index, which
measures the degree of disorder in the data. The algorithm then repeats this process for each branch of
the tree, splitting the data into smaller and smaller subsets until all of the data is classified.

Decision tree induction is a popular technique in data mining because it is easy to understand and
interpret, and it can handle both numerical and categorical data. Additionally, decision trees can handle
large amounts of data, and they can be updated with new data as it becomes available. However,
decision trees can be prone to overfitting, where the model becomes too complex and does not
generalize well to new data. As a result, data scientists often use techniques such as pruning to simplify
the tree and improve its performance.
Decision Tree Induction in Data Mining example

Building Inspection Report
100% (1)
Building Inspection Report
12 pages
UNIT 3 DM
No ratings yet
UNIT 3 DM
34 pages
DM Unit 4
No ratings yet
DM Unit 4
22 pages
Down 4
No ratings yet
Down 4
83 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
Module 04
No ratings yet
Module 04
75 pages
Classification, Prediction
100% (1)
Classification, Prediction
67 pages
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
No ratings yet
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
43 pages
9 Data Mining - Classification & Prediction
No ratings yet
9 Data Mining - Classification & Prediction
4 pages
Data Mining-Unit-3
No ratings yet
Data Mining-Unit-3
16 pages
Unit 3
No ratings yet
Unit 3
16 pages
Data Mining UNIT-2 Notes
No ratings yet
Data Mining UNIT-2 Notes
91 pages
Classification Algorithm
No ratings yet
Classification Algorithm
78 pages
DATA MINING MODULE 3
No ratings yet
DATA MINING MODULE 3
27 pages
Classification & Prediction
No ratings yet
Classification & Prediction
24 pages
updated dm unit 3
No ratings yet
updated dm unit 3
28 pages
Data Mining Classification Prediction
No ratings yet
Data Mining Classification Prediction
3 pages
Module 04 Edited
No ratings yet
Module 04 Edited
19 pages
Classification & Prediction
No ratings yet
Classification & Prediction
19 pages
Unit-5_3161610
No ratings yet
Unit-5_3161610
92 pages
DATA MINING JNTUH CSE R18
No ratings yet
DATA MINING JNTUH CSE R18
20 pages
Classification
No ratings yet
Classification
15 pages
Classification and Prediction
No ratings yet
Classification and Prediction
41 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
50 pages
V1-CH-6-Classification and Prediction
No ratings yet
V1-CH-6-Classification and Prediction
38 pages
u4 clasification and prediction
No ratings yet
u4 clasification and prediction
15 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
55 pages
Unit Iii Classification
No ratings yet
Unit Iii Classification
57 pages
Week 4 Part 1 Classification
No ratings yet
Week 4 Part 1 Classification
71 pages
Data Mining 5 Semester Bca
No ratings yet
Data Mining 5 Semester Bca
44 pages
Unit-3
No ratings yet
Unit-3
53 pages
overview_basics
No ratings yet
overview_basics
16 pages
Fundamentals of Data Science Unit 4
100% (1)
Fundamentals of Data Science Unit 4
31 pages
Unit 3 (DWDM)
No ratings yet
Unit 3 (DWDM)
23 pages
Classification (Part II)
No ratings yet
Classification (Part II)
162 pages
18mca52c U3
No ratings yet
18mca52c U3
8 pages
Data Mining UNIT-III R20 Syllabus
No ratings yet
Data Mining UNIT-III R20 Syllabus
50 pages
Classification in Data Mining 12
No ratings yet
Classification in Data Mining 12
7 pages
08 - Classification - Decision Trees
No ratings yet
08 - Classification - Decision Trees
116 pages
Module 3_classification
No ratings yet
Module 3_classification
9 pages
ABP DWDM UNIT 4 Classification 1
No ratings yet
ABP DWDM UNIT 4 Classification 1
51 pages
CH-5 DM Classification
No ratings yet
CH-5 DM Classification
31 pages
New Classification11
No ratings yet
New Classification11
98 pages
Classification: Unit-III
No ratings yet
Classification: Unit-III
90 pages
Chapter 4 Classification
No ratings yet
Chapter 4 Classification
78 pages
331mt 3.1 (1)
No ratings yet
331mt 3.1 (1)
36 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
Classification and Prediction
No ratings yet
Classification and Prediction
14 pages
CH 5
No ratings yet
CH 5
84 pages
For More Visit WWW - Ktunotes.in
No ratings yet
For More Visit WWW - Ktunotes.in
21 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Unit V - Classification and Prediction 2020-21
100% (1)
Unit V - Classification and Prediction 2020-21
68 pages
5.classification and Prediction
No ratings yet
5.classification and Prediction
9 pages
Unit4-Classification and Prediction
No ratings yet
Unit4-Classification and Prediction
129 pages
Data Minning Unit 2-1
No ratings yet
Data Minning Unit 2-1
10 pages
Classification and Prediction
No ratings yet
Classification and Prediction
21 pages
Chp8 (Topic Not in Book) - ClassificationPrediction+Issues
No ratings yet
Chp8 (Topic Not in Book) - ClassificationPrediction+Issues
7 pages
Chapter 5. Classification and Prediction
No ratings yet
Chapter 5. Classification and Prediction
122 pages
DWDM Unit-3: What Is Classification? What Is Prediction?
No ratings yet
DWDM Unit-3: What Is Classification? What Is Prediction?
12 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
OPC UA Part 2 - Security Model 1.03 Specification
No ratings yet
OPC UA Part 2 - Security Model 1.03 Specification
39 pages
Trends On Curriculum Contextualization: Thursday, July 01, 2021 Department of Education Region VIII Leyte 1
100% (1)
Trends On Curriculum Contextualization: Thursday, July 01, 2021 Department of Education Region VIII Leyte 1
34 pages
Rethinking Resistance: (S. Ramnarayan and Christian Harpelund)
100% (1)
Rethinking Resistance: (S. Ramnarayan and Christian Harpelund)
12 pages
Unit - 4 Artificial Intelligence Assignment
No ratings yet
Unit - 4 Artificial Intelligence Assignment
13 pages
English Top 10 Endangered Mexico
No ratings yet
English Top 10 Endangered Mexico
14 pages
Aman Futures
No ratings yet
Aman Futures
2 pages
GIRASSOL
No ratings yet
GIRASSOL
1 page
RPT 2020 Bahasa Inggeris Tingkatan 3 KSSM Sumberpendidikan
No ratings yet
RPT 2020 Bahasa Inggeris Tingkatan 3 KSSM Sumberpendidikan
3 pages
Applsci 09 01796 v2
No ratings yet
Applsci 09 01796 v2
15 pages
Tecumseh Pull Starter Identification
No ratings yet
Tecumseh Pull Starter Identification
1 page
dầu hạt cải-dầu oliu
No ratings yet
dầu hạt cải-dầu oliu
14 pages
Chapter 8 - Fluid Mechanics
No ratings yet
Chapter 8 - Fluid Mechanics
61 pages
Hotel Dialogues in English
0% (1)
Hotel Dialogues in English
3 pages
E 3102 AYTB Model - PDF 2
No ratings yet
E 3102 AYTB Model - PDF 2
1 page
Sample - Salmon Market (2022 - 2027) - Mordor Intelligence
No ratings yet
Sample - Salmon Market (2022 - 2027) - Mordor Intelligence
23 pages
Powder Metallurgy For 2025 & 2026 GATE ESE PSUs by S K Mondal
No ratings yet
Powder Metallurgy For 2025 & 2026 GATE ESE PSUs by S K Mondal
13 pages
Formato de Conteo Vehicular Xls
No ratings yet
Formato de Conteo Vehicular Xls
2 pages
Cryptographic Algorithm Validation Program
No ratings yet
Cryptographic Algorithm Validation Program
21 pages
Mrs. Vandana - Aligarh - Master Walkin Closet and Bathroom Closet - Wardrobe
No ratings yet
Mrs. Vandana - Aligarh - Master Walkin Closet and Bathroom Closet - Wardrobe
6 pages
sPARKLE & bLINK 2.3
No ratings yet
sPARKLE & bLINK 2.3
76 pages
Economics of Food Safety: Basic
No ratings yet
Economics of Food Safety: Basic
50 pages
PROPOSED CONSTRUCTION OF 2 CLASSROOMS AT NDEGE PRIMARY SCHOOL
No ratings yet
PROPOSED CONSTRUCTION OF 2 CLASSROOMS AT NDEGE PRIMARY SCHOOL
3 pages
Magnetic Bearing
No ratings yet
Magnetic Bearing
5 pages
Image Cse No Model No Part No Description
No ratings yet
Image Cse No Model No Part No Description
4 pages
10 9MA0 01 9MA0 02 A Level Pure Mathematics Practice Set 10
No ratings yet
10 9MA0 01 9MA0 02 A Level Pure Mathematics Practice Set 10
5 pages
As En6 Q2 W3 D3 PDF
No ratings yet
As En6 Q2 W3 D3 PDF
6 pages
English Grammar Lessons (By Anselm Shiran)
No ratings yet
English Grammar Lessons (By Anselm Shiran)
4 pages
Lynxos
No ratings yet
Lynxos
4 pages