0% found this document useful (0 votes)

221 views15 pages

Unit 5 - Data Mining - WWW - Rgpvnotes.in

This document discusses classification algorithms and decision trees for data mining. It covers decision tree induction algorithms, split algorithms based on information theory and the Gini index, and overfitting and pruning of decision trees. The key topics covered are decision tree structure and algorithms, criteria for splitting nodes such as information gain and the Gini index, and addressing overfitting through pruning decision trees.

Uploaded by

Bhagwan Bharose

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

221 views15 pages

Unit 5 - Data Mining - WWW - Rgpvnotes.in

Uploaded by

Bhagwan Bharose

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Program : B.

E
Subject Name: Data Mining
Subject Code: CS-8003
Semester: 8th
Downloaded from be.rgpvnotes.in

Unit-V Classification:-Introduction, Decision Tree, The Tree Induction Algorithm, Split

Algorithms Based on Information Theory, Split Algorithm Based on the Gini Index, Overfitting
and Pruning, Decision Trees Rules, Naïve Bayes Method. Cluster Analysis:- Introduction, Desired
Features of Cluster Analysis, Types of Cluster Analysis Methods:- Partitional Methods,
Hierarchical Methods, Density- Based Methods, Dealing with Large Databases. Quality and
Validity of Cluster Analysis Methods.
--------------
Classification:

Introduction:

Classification is a data mining function that assigns items in a collection to target categories or
classes. The goal of classification is to accurately predict the target class for each case in the data.
For example, a classification model could be used to identify loan applicants as low, medium, or
high credit risks.

There are various applications of classification algorithms as

1. Medical Diagnosis
2. Image and pattern recognition
3. Fault detection
4. Financial market position etc.

There are two forms of data analysis that can be used for extracting models describing important
classes or to predict future data trends. These two forms are as follows −

 Classification

 Prediction

Classification models predict categorical class labels; and prediction models predict continuous
valued functions. For example, we can build a classification model to categorize bank loan
applications as either safe or risky, or a prediction model to predict the expenditures in dollars of
potential customers on computer equipment given their income and occupation.

There are three main approaches to classify problem:

1. The first approach divides the space defined by data points into regions and each reagion
correspond to a given class.
2. The second approach is to find the probability of an example belonging to each class.
3. The third approach is to find the probability of a class containing that example.

Page no: 1 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

Decision Tree:

A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node
holds a class label. The topmost node in the tree is the root node.

The following decision tree is for the concept buy_computer that indicates whether a customer at
a company is likely to buy a computer or not. Each internal node represents a test on an attribute.
Each leaf node represents a class shown in figure 1.

Figure 1: Decision Tree

The benefits of having a decision tree are as follows −

 It does not require any domain knowledge.

 It is easy to comprehend.

 The learning and classification steps of a decision tree are simple and fast.

The Tree Induction Algorithm:

Decision Tree Induction Algorithm

A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm known
as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the successor of ID3. ID3
and C4.5 adopt a greedy approach. In this algorithm, there is no backtracking; the trees are
constructed in a top-down recursive divide-and-conquer manner.

The algorithm needs three parameters : D(Data Partition), Attribute List, and Attribute Selection
Method. Initially, D is the entire set of training tuples and associated class labels. Attribute list is

Page no: 2 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

a list of attributes describing the tuples. Attribute selection method specifies a heuristic procedure
for selecting the attribute that “best” discriminates the given tuples according to class.

Generating a decision tree form training tuples of data partition D

Algorithm : Generate_decision_tree

Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.

Output:
A Decision Tree

Method
create a node N;

if tuples in D are all of the same class, C then

return N as leaf node labeled with class C;

if attribute_list is empty then

return N as leaf node with labeled
with majority class in D;|| majority voting

apply attribute_selection_method(D, attribute_list)

to find the best splitting_criterion;
label node N with splitting_criterion;

if splitting_attribute is discrete-valued and

multiway splits allowed then // no restricted to binary trees

Page no: 3 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

attribute_list = splitting attribute; // remove splitting attribute

for each outcome j of splitting criterion

// partition the tuples and grow subtrees for each partition

let Dj be the set of data tuples in D satisfying outcome j; // a partition

if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;

Split Algorithms Based on Information Theory:

Splitting Criterion

 Work out entropy based on distribution of classes.

 Trying splitting on each attribute.
 Work out expected information gain for each attribute.
 Choose best attribute.

Split Algorithm Based on the Gini Index:

The Gini Index is calculated by subtracting the sum of the squared probabilities of each class from
one. It favors larger partitions. Information Gain multiplies the probability of the class times the
log (base=2) of that class probability. Information Gain favors smaller partitions with many
distinct values. Ultimately, you have to experiment with your data and the splitting criterion.

The Gini Index measure of goodness is based on the measure of diversity. The best splitter is one
that decreases the diversity of the record sets by the greatest amount. In other words we want to
maximize.

Page no: 4 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

Algorithm / Split Criterion Description Tree Type

Gini Split / Gini Index Favors larger partitions. Very CART
simple to implement.
Information Gain / Entropy Favors partitions that have ID3 / C4.5
small counts but many distinct
values.

Diversity (before split) – diversity (left child)+diversity (right child)

Using Gini Split / Gini Index

Where pi is the relative frequency of class i in Gini.

 Favors larger partitions.

 Uses squared proportion of classes.
 Perfectly classified, Gini Index would be zero.
 Evenly distributed would be 1 – (1/# Classes).
 You want a variable split that has a low Gini Index.
 The algorithm works as 1 – ( P(class1)^2 + P(class2)^2 + … + P(classN)^2)

The Gini index is used in the classic CART algorithm and is very easy to calculate.

Gini Index:

for each branch in split:

Calculate percent branch represents #Used for weighting

for each class in branch:

Calculate probability of class in the given branch.

Square the class probability.

Sum the squared class probabilities.

Page no: 5 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

Subtract the sum from 1. #This is the Ginin Index for branch

Weight each branch based on the baseline probability.

Sum the weighted gini index for each split.

Overfitting and Pruning:

Overfitting is a significant practical difficulty for decision tree models and many other predictive
models. Overfitting happens when the learning algorithm continues to develop hypotheses that
reduce training set error at the cost of an increased test set error. There are several approaches to
avoiding overfitting in building decision trees.

 Pre-pruning that stop growing the tree earlier, before it perfectly classifies the training set.
 Post-pruning that allows the tree to perfectly classify the training set, and then post prune
the tree.
 These models are incorrect
 They require collection of unnecessary features
 Also they are more difficult to comprehend.

Practically, the second approach of post-pruning overfit trees is more successful because it is not
easy to precisely estimate when to stop growing the tree.

Decision Trees Rules:

Once a decision tree has been constructed, it is a simple matter to convert it into an equivalent set
of rules shown in figure 2.

Converting a decision tree to rules before pruning has three main advantages:

1. Converting to rules allows distinguishing among the different contexts in which a decision
node is used.

 Each distinct path through the tree produces a distinct rule.

 Therefore, a single path can be pruned, rather than an entire decision node.

 If the tree itself were pruned, the only possible actions would be to remove an entire

node, or leave it in its original form.

Page no: 6 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

2. Unlike the tree, the rules do not maintain a distinction between attribute tests that occur
near the root of the tree and those that occur near the leaves.

 This allows pruning to occur without having to consider how to re-build the tree
if root nodes are removed.

3. Rules are easier for people to read and understand.

Figure 2: Decision Tree Rules

Naïve Bayes Method:

The Naive Bayesian classifier is based on Bayes’ theorem with independence assumptions
between predictors. A Naive Bayesian model is easy to build, with no complicated iterative
parameter estimation which makes it particularly useful for very large datasets. Despite its
simplicity, the Naive Bayesian classifier often does surprisingly well and is widely used because it
often outperforms more sophisticated classification methods.

Advantages of Naïve Bayes Classifier

 It is easy to implement.

Page no: 7 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

 In most of the cases we can obtain optimal results.

Disadvantages of Naïve Bayes Classifier

 There is an assumption of class conditional independence, therefore it may lead to loss of

accuracy.
 We can deal with dependencies using Bayesian belief Networks.

Algorithm

Bayes theorem provides a way of calculating the posterior probability, P(c|x), from P(c), P(x), and
P(x|c). Naive Bayes classifier assumes that the effect of the value of a predictor (x) on a given
class (c) is independent of the values of other predictors. This assumption is called class
conditional independence.

 P(c|x) is the posterior probability of class (target) given predictor (attribute).

 P(c) is the prior probability of class.
 P(x|c) is the likelihood which is the probability of predictor given class.
 P(x) is the prior probability of predictor.

Cluster Analysis

Introduction

Cluster analysis is the process of finding groups of objects in such a way that the objects of a group
are similar (or related) to one another and different from (or unrelated) to the objects of the other
groups.

Page no: 8 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

Cluster analysis is a multivariate method which aims to classify a sample of subjects (or objects)
on the basis of a set of measured variables into a number of different groups such that similar
subjects are placed in the same group. An example where this might be used is in the field of
psychiatry, where the characterization of patients on the basis of clusters of symptoms can be
useful in the identification of an appropriate form of therapy. In marketing, it may be useful to
identify distinct groups of potential customers so that, for example, advertising can be
appropriately targeted.

The idea of cluster analysis is that we have a set of observations, on which we have available
several measurements. Using these measurements, we want to find out if the observations naturally
group together in some predictable way. For example, we may have recorded physical
measurements on many animals, and we want to know if there's a natural grouping (based, perhaps
on species) that distinquishes the animals from another. (This use of cluster analysis is sometimes
called "numerical taxonomy").

Desired Features of Cluster Analysis:

Cluster is a group of objects that belongs to the same class. In other words, similar objects are
grouped in one cluster and dissimilar objects are grouped in another cluster.

Clustering is the process of making a group of abstract objects into classes of similar objects.

Points to Remember

 A cluster of data objects can be treated as one group.

 While doing cluster analysis, we first partition the set of data into groups based on data
similarity and then assign the labels to the groups.

 The main advantage of clustering over classification is that, it is adaptable to changes and
helps single out useful features that distinguish different groups.

Applications of Cluster Analysis

 Clustering analysis is broadly used in many applications such as market research, pattern
recognition, data analysis, and image processing.

 Clustering can also help marketers discover distinct groups in their customer base. And
they can characterize their customer groups based on the purchasing patterns.

Page no: 9 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

 In the field of biology, it can be used to derive plant and animal taxonomies, categorize
genes with similar functionalities and gain insight into structures inherent to populations.

 Clustering also helps in identification of areas of similar land use in an earth observation
database. It also helps in the identification of groups of houses in a city according to house
type, value, and geographic location.

 Clustering also helps in classifying documents on the web for information discovery.

 Clustering is also used in outlier detection applications such as detection of credit card
fraud.

 As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.

Requirements of Clustering in Data Mining (Characteristics of clustering techniques)

The following points throw light on why clustering is required in data mining −

 Scalability − We need highly scalable clustering algorithms to deal with large databases.

 Ability to deal with different kinds of attributes − Algorithms should be capable to be

applied on any kind of data such as interval-based (numerical) data, categorical, and
binary data.

 Discovery of clusters with attribute shape − The clustering algorithm should be capable of
detecting clusters of arbitrary shape. They should not be bounded to only distance
measures that tend to find spherical cluster of small sizes.

 High dimensionality − The clustering algorithm should not only be able to handle low-
dimensional data but also the high dimensional space.

 Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some
algorithms are sensitive to such data and may lead to poor quality clusters.

 Interpretability − The clustering results should be interpretable, comprehensible, and

usable.

 High Dimensionality

 Constraint based clustering

Page no: 10 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

Types of Cluster Analysis Methods:-

Clustering Methods
Clustering methods can be classified into the following categories −

 Partitioning Method

 Hierarchical Method

 Density-based Method

 Grid-Based Method

 Model-Based Method

 Constraint-based Method

Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’
partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify the
data into k groups, which satisfy the following requirements −

 Each group contains at least one object.

 Each object must belong to exactly one group.

Points to remember −

 For a given number of partitions (say k), the partitioning method will create an initial
partitioning.

 Then it uses the iterative relocation technique to improve the partitioning by moving
objects from one group to other.

Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We can classify
hierarchical methods on the basis of how the hierarchical decomposition is formed. There are two
approaches here −

 Agglomerative Approach

 Divisive Approach

Page no: 11 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each object forming
a separate group. It keeps on merging the objects or groups that are close to one another. It keep
on doing so until all of the groups are merged into one or until the termination condition holds.

Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the objects in
the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down
until each object in one cluster or the termination condition holds. This method is rigid, i.e., once
a merging or splitting is done, it can never be undone.

Approaches to Improve Quality of Hierarchical Clustering

Here are the two approaches that are used to improve the quality of hierarchical clustering −

 Perform careful analysis of object linkages at each hierarchical partitioning.

 Integrate hierarchical agglomeration by first using a hierarchical agglomerative algorithm

to group objects into micro-clusters, and then performing macro-clustering on the micro-
clusters.

Density-based Method
This method is based on the notion of density. The basic idea is to continue growing the given
cluster as long as the density in the neighborhood exceeds some threshold, i.e., for each data point
within a given cluster, the radius of a given cluster has to contain at least a minimum number of
points.

Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number of cells
that form a grid structure.

Advantage

 The major advantage of this method is fast processing time.

 It is dependent only on the number of cells in each dimension in the quantized space.

Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for a given
model. This method locates the clusters by clustering the density function. It reflects spatial
distribution of the data points.

Page no: 12 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

This method also provides a way to automatically determine the number of clusters based on
standard statistics, taking outlier or noise into account. It therefore yields robust clustering
methods.

Constraint-based Method
In this method, the clustering is performed by the incorporation of user or application-oriented
constraints. A constraint refers to the user expectation or the properties of desired clustering
results. Constraints provide us with an interactive way of communication with the clustering
process. Constraints can be specified by the user or the application requirement.

Dealing with Large Databases:

For details, please refer following link.

http://www.irma-international.org/viewtitle/31522/
https://arxiv.org/ftp/arxiv/papers/1307/1307.5437.pdf

Quality and Validity of Cluster Analysis Methods

The term cluster validation is used to design the procedure of evaluating the goodness of clustering
algorithm results. This is important to avoid finding patterns in a random data, as well as, in the
situation where you want to compare two clustering algorithms.
Generally, clustering validation statistics can be categorized into 3 classes (Charrad et al. 2014,
Brock et al. (2008), Theodoridis and Koutroumbas (2008)):
1. Internal cluster validation, which uses the internal information of the clustering process to
evaluate the goodness of a clustering structure without reference to external information. It
can be also used for estimating the number of clusters and the appropriate clustering algorithm
without any external data.
2. External cluster validation, which consists in comparing the results of a cluster analysis to an
externally known result, such as externally provided class labels. It measures the extent to
which cluster labels match externally supplied class labels. Since we know the “true” cluster
number in advance, this approach is mainly used for selecting the right clustering algorithm
for a specific data set.
3. Relative cluster validation, which evaluates the clustering structure by varying different
parameter values for the same algorithm (e.g.,: varying the number of clusters k). It’s
generally used for determining the optimal number of clusters.

Page no: 13 Follow us on facebook to get real-time updates from RGPV

We hope you find these notes useful.
You can get previous year question papers at
https://qp.rgpvnotes.in .

If you have any queries or you want to submit your

study notes please write us at
[email protected]

R-05-ELE-PEPTPN Discipline Specific Training Guide For Registration As A Professional PE-PT-PN
No ratings yet
R-05-ELE-PEPTPN Discipline Specific Training Guide For Registration As A Professional PE-PT-PN
44 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
91 pages
Module-3 Association Analysis: Data Mining Association Analysis: Basic Concepts and Algorithms
No ratings yet
Module-3 Association Analysis: Data Mining Association Analysis: Basic Concepts and Algorithms
34 pages
Traditional Learning and Distant Learning On The Academic Performance of Grade 12 Humss of Fatima Na
100% (1)
Traditional Learning and Distant Learning On The Academic Performance of Grade 12 Humss of Fatima Na
9 pages
Classification and Regression Trees (CART - I) : Dr. A. Ramesh
No ratings yet
Classification and Regression Trees (CART - I) : Dr. A. Ramesh
34 pages
Unit 4
No ratings yet
Unit 4
4 pages
Unit 4 Data Science
No ratings yet
Unit 4 Data Science
21 pages
Naïve Bayes Classifier Algorithm
No ratings yet
Naïve Bayes Classifier Algorithm
10 pages
Unit 3 - Data Mining - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 3 - Data Mining - WWW - Rgpvnotes.in PDF
10 pages
Tree Traversals (Inorder, Preorder and Postorder)
No ratings yet
Tree Traversals (Inorder, Preorder and Postorder)
4 pages
ML-3-Decision Tree
No ratings yet
ML-3-Decision Tree
17 pages
Unit 2 - Advanced Computer Architecture - WWW - Rgpvnotes.in
No ratings yet
Unit 2 - Advanced Computer Architecture - WWW - Rgpvnotes.in
15 pages
DOC-20231118-WA0008new Unit 3
No ratings yet
DOC-20231118-WA0008new Unit 3
15 pages
Unit 3 - Artificial Intelligence - WWW - Rgpvnotes.in
No ratings yet
Unit 3 - Artificial Intelligence - WWW - Rgpvnotes.in
10 pages
Lesson Plan: Data Warehousing and Data Mining
No ratings yet
Lesson Plan: Data Warehousing and Data Mining
1 page
ML Unit-3
No ratings yet
ML Unit-3
24 pages
Data Warehousing and Data Mining (10cs755)
No ratings yet
Data Warehousing and Data Mining (10cs755)
142 pages
Lecture 01 05.08.2024 AI-ML Introduction
No ratings yet
Lecture 01 05.08.2024 AI-ML Introduction
46 pages
Data Mining and Model Selection
No ratings yet
Data Mining and Model Selection
27 pages
Dataming T PDF
No ratings yet
Dataming T PDF
48 pages
Mc4301 APR May 24 (Machine Learning)
No ratings yet
Mc4301 APR May 24 (Machine Learning)
3 pages
Pattern Recognition
No ratings yet
Pattern Recognition
3 pages
Unit V
No ratings yet
Unit V
67 pages
Data Science M-1 Notes
No ratings yet
Data Science M-1 Notes
34 pages
Question Bank 1to11
No ratings yet
Question Bank 1to11
19 pages
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
FDS Unit 1
No ratings yet
FDS Unit 1
21 pages
Unit 4
No ratings yet
Unit 4
24 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Data Mining - Tasks: Data Characterization Data Discrimination
No ratings yet
Data Mining - Tasks: Data Characterization Data Discrimination
4 pages
R Language
No ratings yet
R Language
59 pages
Data Mining
No ratings yet
Data Mining
2 pages
Unit - 3
No ratings yet
Unit - 3
42 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
2 pages
Ai Ii Notes
No ratings yet
Ai Ii Notes
33 pages
Constraint Satisfaction Problems: AIMA: Chapter 6
No ratings yet
Constraint Satisfaction Problems: AIMA: Chapter 6
64 pages
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
No ratings yet
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
40 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
CH 6
No ratings yet
CH 6
72 pages
DataMining Lecture 1
No ratings yet
DataMining Lecture 1
35 pages
All Pairs Shortest Path
No ratings yet
All Pairs Shortest Path
28 pages
Unit Iv
No ratings yet
Unit Iv
8 pages
UNIT-1 Introduction: Dr. C.Nagaraju Head of Cse Ysrec of YVU Proddatur
100% (1)
UNIT-1 Introduction: Dr. C.Nagaraju Head of Cse Ysrec of YVU Proddatur
86 pages
R22 ML Syllabus
No ratings yet
R22 ML Syllabus
2 pages
UNIT-2 ML Notes
No ratings yet
UNIT-2 ML Notes
15 pages
R18CSE4102-UNIT 2 Data Mining Notes
100% (1)
R18CSE4102-UNIT 2 Data Mining Notes
31 pages
Unit II - Problem Solving by Searching
No ratings yet
Unit II - Problem Solving by Searching
21 pages
Attribute Oriented Induction
100% (1)
Attribute Oriented Induction
6 pages
Unit 5 Intro To Machine Learning
No ratings yet
Unit 5 Intro To Machine Learning
25 pages
Outline: Problem Statement Definitions & Examples Strategies
No ratings yet
Outline: Problem Statement Definitions & Examples Strategies
7 pages
Machine Learning Algorithms
100% (1)
Machine Learning Algorithms
15 pages
ML L8 Decision Tree
No ratings yet
ML L8 Decision Tree
109 pages
3.1 What Is Data Warehouse?: Unit Iii
No ratings yet
3.1 What Is Data Warehouse?: Unit Iii
33 pages
Iv Semester: Data Mining Question Bank: Unit 2 2 Mark Questions)
No ratings yet
Iv Semester: Data Mining Question Bank: Unit 2 2 Mark Questions)
5 pages
Overview of Parallel Coordinates, Visualizing Neural Network and Visualization of Trees
No ratings yet
Overview of Parallel Coordinates, Visualizing Neural Network and Visualization of Trees
9 pages
The Classic TF-IDF Vector Space Model
No ratings yet
The Classic TF-IDF Vector Space Model
15 pages
Unit 3 Univariate Analysis
No ratings yet
Unit 3 Univariate Analysis
39 pages
Relational Database Design: Exercises
No ratings yet
Relational Database Design: Exercises
9 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
16 pages
ML Unit 1
No ratings yet
ML Unit 1
15 pages
Unit 1 Classification & Prediction DM
No ratings yet
Unit 1 Classification & Prediction DM
71 pages
GENED2 - Reading in Philippine History
No ratings yet
GENED2 - Reading in Philippine History
13 pages
Linda Burton - David Brady - The Oxford Handbook of The Social Science of Poverty-Springer (2019)
100% (3)
Linda Burton - David Brady - The Oxford Handbook of The Social Science of Poverty-Springer (2019)
937 pages
Bihar PCS
No ratings yet
Bihar PCS
7 pages
Piagets Cognitive Developmental Theory Critical Review
No ratings yet
Piagets Cognitive Developmental Theory Critical Review
10 pages
Determinism: Major Paradigms in Geography
No ratings yet
Determinism: Major Paradigms in Geography
14 pages
A Mosaic of Black Feminism
100% (2)
A Mosaic of Black Feminism
21 pages
Cognitive-Experiential Self Theory and Conflict-Handling Styles
No ratings yet
Cognitive-Experiential Self Theory and Conflict-Handling Styles
21 pages
Group 7 Final Research Paper Engtech6
No ratings yet
Group 7 Final Research Paper Engtech6
35 pages
Hfed021 Test
No ratings yet
Hfed021 Test
6 pages
Elementary Science Curriculum (Physics, Earth and Space Science)
0% (1)
Elementary Science Curriculum (Physics, Earth and Space Science)
5 pages
Sisp Sisr May
100% (1)
Sisp Sisr May
8 pages
Perceptron
No ratings yet
Perceptron
2 pages
Module4 Measuring Disease Part II
No ratings yet
Module4 Measuring Disease Part II
38 pages
B Tech 7th Sem Mid Term Exam
No ratings yet
B Tech 7th Sem Mid Term Exam
1 page
Formal Safety Assessment in Maritime Industry - Explanation To IMO Guidelines
No ratings yet
Formal Safety Assessment in Maritime Industry - Explanation To IMO Guidelines
17 pages
Theory of Divorce and Development
No ratings yet
Theory of Divorce and Development
6 pages
Business Research Process: Dr. Sanjay Rastogi, IIFT, New Delhi
No ratings yet
Business Research Process: Dr. Sanjay Rastogi, IIFT, New Delhi
13 pages
Chapter 7 Population & Sampling
No ratings yet
Chapter 7 Population & Sampling
9 pages
Artikel AI ChatGPT
No ratings yet
Artikel AI ChatGPT
14 pages
Introduction Lesson Plan
No ratings yet
Introduction Lesson Plan
3 pages
MODULE 5 Learning Thinking Styles and Multiple Intelligences
No ratings yet
MODULE 5 Learning Thinking Styles and Multiple Intelligences
26 pages
Lesson Plan 20-1-2025 - G2
No ratings yet
Lesson Plan 20-1-2025 - G2
2 pages
Notes - Data Collection and Analysis
No ratings yet
Notes - Data Collection and Analysis
2 pages
Your Big Idea
No ratings yet
Your Big Idea
19 pages
Qualitative Vs Quantitative Research Design - 21st October 2024
No ratings yet
Qualitative Vs Quantitative Research Design - 21st October 2024
6 pages
Research Paper Proposal On: Financing Particle Board Import For Bangladesh Market
No ratings yet
Research Paper Proposal On: Financing Particle Board Import For Bangladesh Market
6 pages
Olson 2020
No ratings yet
Olson 2020
131 pages
Chap I (GMT)
No ratings yet
Chap I (GMT)
13 pages

Unit 5 - Data Mining - WWW - Rgpvnotes.in

Uploaded by

Unit 5 - Data Mining - WWW - Rgpvnotes.in

Uploaded by

Program : B.

Unit-V Classification:-Introduction, Decision Tree, The Tree Induction Algorithm, Split

There are various applications of classification algorithms as

There are three main approaches to classify problem:

Page no: 1 Follow us on facebook to get real-time updates from RGPV

Figure 1: Decision Tree

The benefits of having a decision tree are as follows −

 It does not require any domain knowledge.

The Tree Induction Algorithm:

Decision Tree Induction Algorithm

Page no: 2 Follow us on facebook to get real-time updates from RGPV

Generating a decision tree form training tuples of data partition D

if tuples in D are all of the same class, C then

if attribute_list is empty then

apply attribute_selection_method(D, attribute_list)

if splitting_attribute is discrete-valued and

Page no: 3 Follow us on facebook to get real-time updates from RGPV

attribute_list = splitting attribute; // remove splitting attribute

// partition the tuples and grow subtrees for each partition

Split Algorithms Based on Information Theory:

 Work out entropy based on distribution of classes.

Split Algorithm Based on the Gini Index:

Page no: 4 Follow us on facebook to get real-time updates from RGPV

Algorithm / Split Criterion Description Tree Type

Diversity (before split) – diversity (left child)+diversity (right child)

Using Gini Split / Gini Index

Where pi is the relative frequency of class i in Gini.

 Favors larger partitions.

for each branch in split:

Calculate percent branch represents #Used for weighting

for each class in branch:

Calculate probability of class in the given branch.

Square the class probability.

Sum the squared class probabilities.

Page no: 5 Follow us on facebook to get real-time updates from RGPV

Weight each branch based on the baseline probability.

Sum the weighted gini index for each split.

Overfitting and Pruning:

Decision Trees Rules:

 Each distinct path through the tree produces a distinct rule.

node, or leave it in its original form.

Page no: 6 Follow us on facebook to get real-time updates from RGPV

3. Rules are easier for people to read and understand.

Figure 2: Decision Tree Rules

Naïve Bayes Method:

Advantages of Naïve Bayes Classifier

Page no: 7 Follow us on facebook to get real-time updates from RGPV

 In most of the cases we can obtain optimal results.

Disadvantages of Naïve Bayes Classifier

 There is an assumption of class conditional independence, therefore it may lead to loss of

 P(c|x) is the posterior probability of class (target) given predictor (attribute).

Page no: 8 Follow us on facebook to get real-time updates from RGPV

Desired Features of Cluster Analysis:

 A cluster of data objects can be treated as one group.

Applications of Cluster Analysis

Page no: 9 Follow us on facebook to get real-time updates from RGPV

Requirements of Clustering in Data Mining (Characteristics of clustering techniques)

 Ability to deal with different kinds of attributes − Algorithms should be capable to be

 Interpretability − The clustering results should be interpretable, comprehensible, and

 Constraint based clustering

Page no: 10 Follow us on facebook to get real-time updates from RGPV

Types of Cluster Analysis Methods:-

 Each group contains at least one object.

 Each object must belong to exactly one group.

Page no: 11 Follow us on facebook to get real-time updates from RGPV

Approaches to Improve Quality of Hierarchical Clustering

 Perform careful analysis of object linkages at each hierarchical partitioning.

 Integrate hierarchical agglomeration by first using a hierarchical agglomerative algorithm

 The major advantage of this method is fast processing time.

Page no: 12 Follow us on facebook to get real-time updates from RGPV

Dealing with Large Databases:

For details, please refer following link.

Quality and Validity of Cluster Analysis Methods

Page no: 13 Follow us on facebook to get real-time updates from RGPV

If you have any queries or you want to submit your

You might also like