Data Mining Unit 3

Classification and prediction models are used for extracting patterns from data. Classification models predict categorical labels while prediction models predict continuous values. There are several techniques for building classification and prediction models, including decision trees. Decision trees use attributes to split data into partitions at internal nodes, with class labels at leaf nodes. Information gain and the Gini index are common measures used to select the optimal attribute for splitting at each node. Tree pruning can later be used to remove outlier branches and reduce overfitting.

Uploaded by

balijagudam shashank

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

110 views50 pages

Data Mining Unit 3

Uploaded by

balijagudam shashank

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 50

Unit 3

Classification and prediction

• Basic concepts
• Decision tree induction
• Bayesian classification,
• Rule–based classification,
• Lazy learner
Classification and prediction
• There are two forms of data analysis that can be used for extracting
models describing important classes or to predict future data trends.
These two forms are as follows −
• Classification
• Prediction
Classification models predict categorical class labels; and prediction
models predict continuous valued functions.
For example, we can build a classification model to categorize bank
loan applications as either safe or risky, or a prediction model to predict
the expenditures in dollars of potential customers on computer
equipment given their income and occupation.
What is classification?
• Following are the examples of cases where the data analysis task is
Classification −
• A bank loan officer wants to analyze the data in order to know which
customer (loan applicant) are risky or which are safe.
• A marketing manager at a company needs to analyze a customer with a
given profile, who will buy a new computer.
• In both of the above examples, a model or classifier is constructed to
predict the categorical labels. These labels are risky or safe for loan
application data and yes or no for marketing data.
What is prediction?
• Following are the examples of cases where the data analysis task is
Prediction −
• Suppose the marketing manager needs to predict how much a given
customer will spend during a sale at his company. In this example we
are bothered to predict a numeric value. Therefore the data analysis
task is an example of numeric prediction. In this case, a model or a
predictor will be constructed that predicts a continuous-valued-
function or ordered value.
• Note − Regression analysis is a statistical methodology that is most
often used for numeric prediction.
How Does Classification Works?
• With the help of the bank loan application that
we have discussed above, let us understand the
working of classification. The Data Classification
process includes two steps −
• Building the Classifier or Model
• Using Classifier for Classification
Building the Classifier or Model
• This step is the learning step or the learning phase.
• In this step the classification algorithms build the
classifier.
• The classifier is built from the training set made up of
database tuples and their associated class labels.
• Each tuple that constitutes the training set is referred to as
a category or class. These tuples can also be referred to as
sample, object or data points.
Using Classifier for Classification

• In this step, the classifier is used for

classification.
• Here the test data is used to estimate the accuracy
of classification rules.
• The classification rules can be applied to the new
data tuples if the accuracy is considered
acceptable.
Classification and Prediction Issues
• The major issue is preparing the data for Classification and Prediction.
Preparing the data involves the following activities −
• Data Cleaning − Data cleaning involves removing the noise and
treatment of missing values. The noise is removed by applying
smoothing techniques and the problem of missing values is solved by
replacing a missing value with most commonly occurring value for
that attribute.
• Relevance Analysis − Database may also have the irrelevant
attributes. Correlation analysis is used to know whether any two given
attributes are related.
• Data Transformation and reduction − The data can
be transformed by any of the following methods.
• Normalization − The data is transformed using
normalization. Normalization involves scaling all values
for given attribute in order to make them fall within a
small specified range. Normalization is used when in the
learning step, the neural networks or the methods
involving measurements are used.
• Generalization − The data can also be transformed by
generalizing it to the higher concept. For this purpose we
can use the concept hierarchies.
Comparison of Classification and Prediction Methods
• Here is the criteria for comparing the methods of Classification and Prediction
−
• Accuracy − Accuracy of classifier refers to the ability of classifier. It predict
the class label correctly and the accuracy of the predictor refers to how well a
given predictor can guess the value of predicted attribute for a new data.
• Speed − This refers to the computational cost in generating and using the
classifier or predictor.
• Robustness − It refers to the ability of classifier or predictor to make correct
predictions from given noisy data.
• Scalability − Scalability refers to the ability to construct the classifier or
predictor efficiently; given large amount of data.
• Interpretability − It refers to what extent the classifier or predictor
understands.
Classification by Decision Tree Induction
• Decision tree induction is the learning of decision trees from class-
labeled training tuples.
• A decision tree is a flowchart-like tree structure, where each internal
node (nonleaf node) denotes a test on an attribute, each branch
represents an outcome of the test, and each leaf node (or terminal
node) holds a class label. The topmost node in a tree is the root node.
Decision Tree Induction
• During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in
machine learning, developed a decision tree algorithm known as ID3 (Iterative
Dichotomiser).
• Quinlan later presented C4.5 (a successor of ID3), which became a benchmark
to which newer supervised learning algorithms are often compared.
• Classification and Regression Trees (CART), which described the generation of
binary decision trees.
• ID3, C4.5, and CART adopt a greedy (i.e., nonbacktracking) approach in which
decision trees are constructed in a top-down recursive divide-and-conquer
manner.
Information gain
• ID3 uses information gain as its attribute selection measure.
• Let node N represent or hold the tuples of partition D. The attribute
with the highest information gain is chosen as the splitting attribute
for node N. This attribute minimizes the information needed to
classify the tuples in the resulting partitions and reflects the least
randomness or “impurity” in these partitions.
• The expected information needed to classify a tuple in D is given by

Info(D) is just the average amount of information needed to identify

the class label of a tuple in D.
• Now, suppose we were to partition the tuples in D on some attribute
A having v distinct values, {a1, a2,..., av}, as observed from the
training data. If A is discrete-valued, these values correspond directly
to the v outcomes of a test on A. Attribute A can be used to split D
into v partitions or subsets, {D1, D2,..., Dv}, where Dj contains those
tuples in D that have outcome aj of A. These partitions would
correspond to the branches grown from node N.
• How much more information would we still need (after the
partitioning) in order to arrive at an exact classification? This amount
is measured by

The term |Dj | |D| acts as the weight of the jth partition. InfoA (D) is the expected
information required to classify a tuple from D based on the partitioning by A
• Information gain is defined as the difference between the
original information requirement (i.e., based on just the
proportion of classes) and the new requirement (i.e.,
obtained after partitioning on A). That is,
Gain ratio
• The information gain measure is biased toward tests with many outcomes.
That is, it prefers to select attributes having a large number of values.
• For example, consider an attribute that acts as a unique identifier, such as
product ID.
• A split on product ID would result in a large number of partitions (as many
as there are values), each one containing just one tuple.
• Because each partition is pure, the information required to classify data
set D based on this partitioning would be Infoproduct ID(D) = 0. Therefore,
the information gained by partitioning on this attribute is maximal.
• Clearly, such a partitioning is useless for classification.
C4.5, a successor of ID3, uses an extension to
information gain known as gain ratio
Gini index
• Gini Index is calculated for binary variables only. It measures the
impurity in training tuples of dataset D, as

P is the probability that tuple belongs to class C. The Gini index that is
calculated for binary split dataset D by attribute A is given by:
Example:
Lets consider the dataset in the image below and draw a decision tree using gini index.

• Index A B C D E
• 1 4.8 3.4 1.9 0.2 positive
• 2 5 3 1.6 1.2 positive
• 3 5 3.4 1.6 0.2 positive
• 4 5.2 3.5 1.5 0.2 positive
• 5 5.2 3.4 1.4 0.2 positive
• 6 4.7 3.2 1.6 0.2 positive
• 7 4.8 3.1 1.6 0.2 positive
• 8 5.4 3.4 1.5 0.4 positive
• 9 7 3.2 4.7 1.4 negative
• 10 6.4 3.2 4.7 1.5 negative
• 11 6.9 3.1 4.9 1.5 negative
• 12 5.5 2.3 4 1.3 negative
• 13 6.5 2.8 4.6 1.5 negative
• 14 5.7 2.8 4.5 1.3 negative
• 15 6.3 3.3 4.7 1.6 negative
• 16 4.9 2.4 3.3 1 negative
• In Gini Index, we have to choose some random values to categorize
each attribute. These values for this dataset are:
Tree Pruning

• Pruning is the method of removing the unused branches from the

decision tree. Some branches of the decision tree might represent
outliers or noisy data.
• Tree pruning is the method to reduce the unwanted branches of the
tree. This will reduce the complexity of the tree and help in effective
predictive analysis. It reduces the overfitting as it removes the
unimportant branches from the trees.
There are two ways of pruning the tree:
• #1) Prepruning: In this approach, the construction of the decision tree is
stopped early. It means it is decided not to further partition the branches. The
last node constructed becomes the leaf node and this leaf node may hold the
most frequent class among the tuples.
• The attribute selection measures are used to find out the weightage of the
split. Threshold values are prescribed to decide which splits are regarded as
useful. If the portioning of the node results in splitting by falling below
threshold then the process is halted.
• #2) Postpruning: This method removes the outlier branches from a
fully grown tree. The unwanted branches are removed and replaced by
a leaf node denoting the most frequent class label. This technique
requires more computation than prepruning, however, it is more
reliable.
• The pruned trees are more precise and compact when compared to
unpruned trees but they carry a disadvantage of replication and
repetition.
• Repetition occurs when the same attribute is tested again and again
along a branch of a tree. Replication occurs when the duplicate
subtrees are present within the tree. These issues can be solved by
multivariate splits.

DM Unit-3
No ratings yet
DM Unit-3
46 pages
Classification DecisionTreesNaiveBayeskNN
No ratings yet
Classification DecisionTreesNaiveBayeskNN
75 pages
R20 DMT Unit-Iii
No ratings yet
R20 DMT Unit-Iii
21 pages
Classification
No ratings yet
Classification
73 pages
Classification & Prediction
No ratings yet
Classification & Prediction
24 pages
Classification and Prediction
No ratings yet
Classification and Prediction
143 pages
Data Mining Book
No ratings yet
Data Mining Book
84 pages
DWM Unit-V Notes
No ratings yet
DWM Unit-V Notes
15 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Data Mining-Unit-3
No ratings yet
Data Mining-Unit-3
16 pages
Unit 3
No ratings yet
Unit 3
16 pages
Module 3 Notes
No ratings yet
Module 3 Notes
31 pages
Classification and Prediction
No ratings yet
Classification and Prediction
41 pages
Classification-1
No ratings yet
Classification-1
48 pages
Unit-4 DM
No ratings yet
Unit-4 DM
15 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
70 pages
DWDM Unit-3: What Is Classification? What Is Prediction?
No ratings yet
DWDM Unit-3: What Is Classification? What Is Prediction?
12 pages
DWDM Module IV
No ratings yet
DWDM Module IV
57 pages
Classification
100% (1)
Classification
37 pages
05classification Rule Mining
No ratings yet
05classification Rule Mining
56 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
No ratings yet
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
43 pages
Classification and Prediction
100% (1)
Classification and Prediction
31 pages
CH 5
No ratings yet
CH 5
84 pages
Module - 4.1-DM-1
No ratings yet
Module - 4.1-DM-1
63 pages
Unit-6: Classification and Prediction
No ratings yet
Unit-6: Classification and Prediction
63 pages
Classification Algorithm
No ratings yet
Classification Algorithm
78 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
88 pages
Unit 3-Classification
No ratings yet
Unit 3-Classification
71 pages
Unit 4
No ratings yet
Unit 4
186 pages
Classification
No ratings yet
Classification
81 pages
Classification and Prediction: Data Mining 이복주 단국대학교 컴퓨터공학과
No ratings yet
Classification and Prediction: Data Mining 이복주 단국대학교 컴퓨터공학과
75 pages
ML Unit II
No ratings yet
ML Unit II
183 pages
Module 3
No ratings yet
Module 3
64 pages
Module 04
No ratings yet
Module 04
75 pages
Classification, Prediction
100% (1)
Classification, Prediction
67 pages
DM 4
No ratings yet
DM 4
68 pages
Classification Unit-4
No ratings yet
Classification Unit-4
19 pages
Data Mining & Knowledge Discovery
No ratings yet
Data Mining & Knowledge Discovery
34 pages
Data Mining Unit 2
No ratings yet
Data Mining Unit 2
40 pages
DWDM Unit IV Note
No ratings yet
DWDM Unit IV Note
21 pages
Classification and Prediction
No ratings yet
Classification and Prediction
21 pages
Class Basic
No ratings yet
Class Basic
67 pages
DWM - Module 3
No ratings yet
DWM - Module 3
22 pages
DM - 06 Mar 2025
No ratings yet
DM - 06 Mar 2025
13 pages
CH-5 DM Classification
No ratings yet
CH-5 DM Classification
31 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
20210913115613D3708 - Session 05-08 Decision Tree Classification
No ratings yet
20210913115613D3708 - Session 05-08 Decision Tree Classification
37 pages
Unit-Iii: Classification and Prediction
No ratings yet
Unit-Iii: Classification and Prediction
21 pages
Updated DM Unit 3
No ratings yet
Updated DM Unit 3
28 pages
DWDM - Unit - V
No ratings yet
DWDM - Unit - V
93 pages
CH 4
No ratings yet
CH 4
21 pages
Classification: Basic Concepts
No ratings yet
Classification: Basic Concepts
73 pages
05 Classification
No ratings yet
05 Classification
79 pages
4 Classification
No ratings yet
4 Classification
20 pages
Data Mining Unit 2
No ratings yet
Data Mining Unit 2
41 pages
Unit 4
No ratings yet
Unit 4
20 pages
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Question Bank
No ratings yet
Question Bank
5 pages
Decision Tree Using Sci-Kit Learn
No ratings yet
Decision Tree Using Sci-Kit Learn
9 pages
It - III B.tech Sem-II - DWDM Lab Manual (20-21)
No ratings yet
It - III B.tech Sem-II - DWDM Lab Manual (20-21)
94 pages
Unit V Classification
No ratings yet
Unit V Classification
69 pages
Lung Cancer Prediction Using Machine Learning
No ratings yet
Lung Cancer Prediction Using Machine Learning
25 pages
DMW Module 3
No ratings yet
DMW Module 3
112 pages
Decision Tree and KNN Assignment Two
No ratings yet
Decision Tree and KNN Assignment Two
13 pages
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
No ratings yet
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
61 pages
UCS551 Chapter 6 - Classification
No ratings yet
UCS551 Chapter 6 - Classification
20 pages
Decision Tree Algorithm, Explained
No ratings yet
Decision Tree Algorithm, Explained
20 pages
Latest Seminar Report Yash Ingole
No ratings yet
Latest Seminar Report Yash Ingole
35 pages
Classification: Table 4.1. Data Set For Exercise 2
No ratings yet
Classification: Table 4.1. Data Set For Exercise 2
7 pages
AP19110010110 Project Report
No ratings yet
AP19110010110 Project Report
9 pages
Decision Tree Algorithm - A Complete Guide: Data Science Blogathon
No ratings yet
Decision Tree Algorithm - A Complete Guide: Data Science Blogathon
13 pages
Decision Tree
No ratings yet
Decision Tree
12 pages
Credit Card Fraud Detection Using Machine Learning
100% (1)
Credit Card Fraud Detection Using Machine Learning
5 pages
Data Analysis and Modelling
No ratings yet
Data Analysis and Modelling
107 pages
How Decision Tree Algorithm Works
No ratings yet
How Decision Tree Algorithm Works
16 pages
Decision Tree - IDS3
No ratings yet
Decision Tree - IDS3
13 pages
Module 3 DecisionTree Notes
100% (1)
Module 3 DecisionTree Notes
14 pages
Fake Profile Detection
100% (1)
Fake Profile Detection
69 pages
ML Lab Manual TE 2021-22
No ratings yet
ML Lab Manual TE 2021-22
43 pages
Decision Trees and Regression Techniques
No ratings yet
Decision Trees and Regression Techniques
27 pages
Unit 3: Classification & Regression: Question Bank and Its Solution
No ratings yet
Unit 3: Classification & Regression: Question Bank and Its Solution
180 pages
Ch4 Supervised
No ratings yet
Ch4 Supervised
78 pages
2072 4119 1 SM
No ratings yet
2072 4119 1 SM
5 pages
ML L8 Decision Tree
No ratings yet
ML L8 Decision Tree
109 pages
Unit 5 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Mining - WWW - Rgpvnotes.in
15 pages
Unit-4-DECISION TREES
No ratings yet
Unit-4-DECISION TREES
16 pages
Decision Trees and Decision Modeling
No ratings yet
Decision Trees and Decision Modeling
58 pages

Data Mining Unit 3

Uploaded by

Data Mining Unit 3

Uploaded by

Unit 3

Classification and prediction

• In this step, the classifier is used for

Info(D) is just the average amount of information needed to identify

• Pruning is the method of removing the unused branches from the

You might also like