0% found this document useful (0 votes)

80 views

Analysis of Classification Algorithm in Data Mining

Data Mining is the extraction of hidden predictive information from large database. Classification is the process of finding a model that describes and distinguishes data classes or concept. This paper performs the study of prediction of class label using C4.5 and Naïve Bayesian algorithm.C4.5 generates classifiers expressed as decision trees from a fixed set of examples. The resulting tree is used to classify future samples .The leaf nodes of the decision tree contain the class name whereas a non-leaf node is a decision node. The decision node is an attribute test with each branch (to another decision tree) being a possible value of the attribute. C4.5 uses information gain to help it decide which attribute goes into a decision node. A Naïve Bayesian classifier is a simple probabilistic classifier based on applying Baye’s theorem with strong (naive) independence assumptions. Naive Bayesian classifier assumes that the effect of an attribute value on a given class is independent of the values of the other attribute. This assumption is called class conditional independence. The results indicate that Predicting of class label using Naïve Bayesian classifier is very effective and simple compared to C4.5 classifier.

Uploaded by

Integrated Intelligent Research

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views

Analysis of Classification Algorithm in Data Mining

Uploaded by

Integrated Intelligent Research

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Volume: 03, June 2014, Pages: 361-364

International Journal of Data Mining Techniques and Applications

ISSN: 2278-2419

Analysis of Classification Algorithm in Data Mining

R. Aruna devi1,K. Nirmala2
1
2

Research Scholar, Manonmaniam Sundaranar University, Tirunelveli,Tamil Nadu,India

Associate Professor, Department of Computer Science, Quaid-E- Millath Government College for
Women (A), Chennai, Tamil Nadu, India
[email protected], [email protected]

ABSTRACT
Data Mining is the extraction of hidden predictive information from large database. Classification is
the process of finding a model that describes and distinguishes data classes or concept. This paper
performs the study of prediction of class label using C4.5 and Nave Bayesian algorithm.C4.5
generates classifiers expressed as decision trees from a fixed set of examples. The resulting tree is
used to classify future samples .The leaf nodes of the decision tree contain the class name whereas a
non-leaf node is a decision node. The decision node is an attribute test with each branch (to another
decision tree) being a possible value of the attribute. C4.5 uses information gain to help it decide
which attribute goes into a decision node. A Nave Bayesian classifier is a simple probabilistic
classifier based on applying Bayes theorem with strong (naive) independence assumptions. Naive
Bayesian classifier assumes that the effect of an attribute value on a given class is independent of the
values of the other attribute. This assumption is called class conditional independence. The results
indicate that Predicting of class label using Nave Bayesian classifier is very effective and simple
compared to C4.5 classifier.
Keywords: Data Mining, Classification, Nave Bayesian Classifier, Entropy
I. INTRODUCTION
Data mining is the extraction of implicit,
previously unknown, and potentially useful
information from large databases. It uses
machine learning, statistical and visualization
techniques to discover and present knowledge
in a form, which is easily comprehensible to
humans. Data mining functionalities are used
to specify the kind of patterns to be found in
data mining tasks. Data mining task can be
classified into two categories: Descriptive and
Predictive.
Descriptive
mining
tasks
characterize the general properties of the data
in the database[1]. Predictive mining tasks
perform inference on the current data in order
to make prediction. Classification is the
process of finding a model that describes and
distinguishes data classes / concepts. The goal
of data mining is to extract knowledge from a
data set in a human-understandable structure
and involves database and data management,
data preprocessing, model and inference
considerations, complexity considerations,
post-processing
of
found
structure,
visualization and online updating. The actual
data-mining task is the automatic or semiIntegrated Intelligent Research (IIR)

automatic analysis of large quantities of data

to extract previously unknown interesting
patterns such as groups of data records (cluster
analysis), unusual records (anomaly detection)
and dependencies (association rule mining).
A primary reason for using data mining is to
assist in the analysis of collections of
observations of behavior. Data mining
involves
six
common
classes
of
tasks.(1)Anomaly
detection

The
identification of unusual data records, that
might be interesting or data errors and require
further
investigation.(2)Association rule
learning-Searches for relationships between
variables.(3) Clustering is the task of
discovering groups and structures in the data
that are in some way or another "similar",
without using known structures in the
data.(4)Classification is the task of
generalizing known structure to apply to new
data.(5)Regression Attempts to find a
function which models the data with the least
error.(6)Summarization providing a more
compact representation of the data set,
including visualization and report generation.

361

Volume: 03, June 2014, Pages: 361-364

International Journal of Data Mining Techniques and Applications

ISSN: 2278-2419

II. C4.5 ALGORITHM

Depending on the precise nature of the

probability model, naive Bayes classifiers can
be trained very efficiently in a supervised
learning
setting.
In
many
practical
applications, parameter estimation for naive
Bayes models uses the method of maximum
likelihood; in other words, one can work with
the naive Bayes model without believing in
Bayesian probability or using any Bayesian
methods. In spite of their naive design and
apparently over-simplified assumptions, naive
Bayes classifiers have worked quite well in
many complex real-world situations.

C4.5 algorithm is introduced by Quinlan for

inducing Classification Models, also called
Decision Trees[13]. We are given a set of
records. Each record has the same structure,
consisting of a number of attribute/value pairs.
One of these attributes represents the category
of the record. The problem is to determine a
decision tree that on the basis of answers to
questions about the non-category attributes
predicts correctly the value of the category
attribute. Usually the category attribute takes
only the values {true, false}, or {success,
failure}, or something equivalent.
The C4.5 algorithm can be summarized as
follows:
Step 1: Given a set of S cases, C4.5 first grows
an initial tree using the concept of information
entropy. The training data is a set S=S1, S2, of
already classified samples. Each sample Si=X1,
X2... is a vector where X1, X2, represent
attributes or features of the sample. The
training data is augmented with a
vector C = c1 , c2, where c1, c2, represent the
class to which each sample belongs.
Step 2: At each node of the tree, C4.5 chooses
one attribute of the data that most effectively
splits its set of samples into subsets enriched
in one class or the other. Its criterion is the
normalized information gain that results from
choosing an attribute for splitting the data. The
attribute with the highest normalized
information gain is chosen to make the
decision.
Step 3: Create a decision tree based on the best
node
Step 4: Apply the same procedure recursively
III. NAVE BAYESIAN
A Naive Bayesian classifier is a simple
probabilistic classifier based on applying
Bayes theorem with strong (naive)
independence assumptions. A more descriptive
term for the underlying probability model
would be "independent feature model"[2]. In
simple terms, a Naive Bayes classifier
assumes that the presence (or absence) of a
particular feature of a class is unrelated to the
presence (or absence) of any other feature,
given the class variable.

Integrated Intelligent Research (IIR)

Given data sets with many attributes, it would

be extremely computationally expensive to
compute P(X/Ci). In order to reduce
computation in evaluating P(X/Ci), the nave
assumption of class conditional independence
is made. This presumes that the values of the
attributes are conditionally independent of one
another, given the class label of the tuple (i.e.,
that there are no dependence relationships
among the attributes). Thus,
P(X1/Ci) P(X2 /Ci) P(Xn/Ci).
easily estimate the probabilities P(X1/Ci)
,P(X2/CiP(Xn/Ci) from the training tuples
.Recall that here Xk refers to the value of
attribute Ak for tuple X.
IV. DATASET DESCRIPTION
The main objective of this paper is to use
classification algorithm to predict the class
label using C4.5 classifier and Bayesian
classifier on the large dataset. The model used
in this paper predicts the status of the tuple
having the values department has system
who are 26..30 years, have income 46..50K[2].
Table 1: Department

362

Volume: 03, June 2014, Pages: 361-364

International Journal of Data Mining Techniques and Applications

ISSN: 2278-2419

V. EXPERIMENTAL RESULTS AND

DISCUSSIONS
In this paper we wish to study of C4.5 and
Nave Bayesian classification, given the
training data as in Table 1. The data tuple are
described by the attributes department, status,
age and salary. The class label attribute status,
has two distinct values
namely {
Senior,Junior }.Let C1 corresponds to the class
Status=Junior and C2 correspond to
Status=Senior. The tuple we wish to classify
is,

Ai is ith possible value of A

is a subset of S containing all

items where the value of A is Ai

X=(dept=system,age=26..30,salary=46..5
0k)
5.1 C4.5 ALGORITHM

C4.5 uses a statistical property, called

information gain. Gain measures how well a
given attribute separates training examples
into targeted classes. The one with the highest
information is selected as test attribute. In
order to define gain, we first borrow an idea
from information theory called entropy.
Entropy measures the amount of information
in an attribute.

5.2 NAVE BAYESIAN CLASSIFIER

Given data sets with many attributes, it would
be extremely computationally expensive to
compute P(X/Ci).We need to maximize
P(X/Ci)*P(Ci) for i=1,2.P(Ci), the prior
probability of each class, can be computed
based on the training tuples.
P(status=senior)=5/11=0.455
P(status=junior)=6/11=0.545
To compute P(X/Ci) for i=1to 2, We compute
the following conditional probabilities

Where:

E(S) is the information entropy of the

set S ;
n is the number of different values of
the attribute in S (entropy is computed
for one chosen attribute)
fS(j) is the frequency (proportion) of
the value j in the set S
log 2 is the binary logarithm
Entropy of 0 identifies a perfectly
classified set.

P(dept=system/status=senior)=2/5=0.4
P(dept=system/status=junior)=2/6=0.33
P(age=26..30/status=senior)=0/5=0
P(age=26..30/status=junior)=3/6=0.5
P(salary=46k..50k/status=senior)=2/5=0.4
P(salary=46k..50k/status=junior)=2/6=0.3
3
Using the above probabilities, we obtain
P(X/status=junior)
0.33=0.054

Where:

G(S,A) is the gain of the set S after a

split over the A attribute
E(S) is the information entropy of the
set S
m is the number of different values of
the attribute A in S
fS(Ai ) is the frequency (proportion) of
the items possessing Ai as value for A
in S

Integrated Intelligent Research (IIR)

=0.33

0.5

Similarly,
P(X/status=Senior) =0.4 X 0 X 0.4 = 0
To find the class Ci, that maximizes ,
P(X/ Ci ) X P( Ci ), We compute
P(X/status=senior) X P(status=Senior) =
0 X 0.455 = 0
P(X/status=Junior) X P(status=Junior) =
363

Volume: 03, June 2014, Pages: 361-364

International Journal of Data Mining Techniques and Applications

ISSN: 2278-2419

0.054 X 0.545 = 0.0294

REFERENCES

To find class Ci that maximizes the nave

Bayesian
classification
predicts
status=junior for tuple X

[1]A.K.Pujari, Data Mining Techniques,

University Press, India 2001.

5.3 COMPARISON AND REULTS

For the comparison of our study, first we used
a C4.5 classification algorithm derives its
classes from a fixed set of training instances.
The classes created by C4.5 are inductive, that
is, given a small set of training instances, the
specific classes created by C4.5 are expected
to work for all future instances.
Secondly we used a Nave Bayesian
classification algorithm. The Nave Bayesian
classifier is that it only requires a small
amount of training data to estimate the
parameters necessary for classification.
Because independent variables are assumed,
only the variances of the variables for each
class need to be determined and not the entire
covariance matrix. It handles missing values
by ignoring the instance. It handles
quantitative and discrete data. Nave Bayesian
algorithm is very fast and space efficient.
VI. CONCLUSION AND FUTURE
DEVELOPMENT
In this paper, the comparative study of two
classification algorithms is compared. The
Nave Bayesian model is tremendously
appealing because of its simplicity, elegance,
and robustness. The results indicate that Nave
Bayesian classifier is very effective and simple
compared to C4.5. A large number of
modifications have been introduced, by the
statistical, data mining, machine learning, and
pattern recognition communities, in an attempt
to make it more flexible.

Integrated Intelligent Research (IIR)

[2]Jiawei Han and Micheline Kamber Data

Mining Concepts and Techniques
[3]S.N.Sivanandam and S.Sumathi, Data
Mining Concepts Tasks and Techniques,
Thomson , Business Information India
Pvt.Ltd.India 2006
[4] H. Wang, W. Fan, P. Yu, and J.
Han.Mining concept-drifting data streams
using ensemble Classifiers.
[5] V. Ganti, J. Gehrke, R. Ramakrishnan,
andW. Loh. Mining data streams under block
evolution.
[6] Friedman N, Geiger D, Goldsmith M
(1997) Bayesian network classifiers.
[7] Jensen F., An Introduction to Bayesian
Networks.
[8] Murthy, Automatic Construction of
Decision Trees from Data
[9]Website:www.cs.umd.edu/~samir/498/10Al
gorithms-08.pdf
[10] Website:www.hkws.org/seminar/sem4332006-2007-no69.pdf
[11]Website:en.wikipedia.org/wiki/Data_mini
ng
[12]http://en.wikipedia.org/wiki/ID3_algorith
m
[13]Quinlan JR(1993) C4.5: Programs for
machine
learning.Morgan
Kaufmann
Publichers, San Mateo.

364

Banking Untuk Presentation.
No ratings yet
Banking Untuk Presentation.
6 pages
Trends of E-Learning Research From 2000 To 2008 Use of Text
No ratings yet
Trends of E-Learning Research From 2000 To 2008 Use of Text
12 pages
Analysis of Classification Algorithm in Data Mining
No ratings yet
Analysis of Classification Algorithm in Data Mining
3 pages
41 j48 Naive Bayes Weka
No ratings yet
41 j48 Naive Bayes Weka
5 pages
Data Mining UNIT-III R20 Syllabus
No ratings yet
Data Mining UNIT-III R20 Syllabus
50 pages
Classification Ppts 2021
No ratings yet
Classification Ppts 2021
80 pages
CS402 Mod 3
No ratings yet
CS402 Mod 3
2 pages
3 - Sınıflandırma 2
No ratings yet
3 - Sınıflandırma 2
62 pages
Classification and Prediction
No ratings yet
Classification and Prediction
21 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
50 pages
Unit - Iii
No ratings yet
Unit - Iii
52 pages
Classification and Prediction
No ratings yet
Classification and Prediction
143 pages
L05 - Advance Analytical Theory and Methods - Classification
No ratings yet
L05 - Advance Analytical Theory and Methods - Classification
34 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
Class Basic
No ratings yet
Class Basic
75 pages
Classification Through Machine Learning Technique: C4.5 Algorithm Based On Various Entropies
No ratings yet
Classification Through Machine Learning Technique: C4.5 Algorithm Based On Various Entropies
8 pages
Module - 4.1-DM-1
No ratings yet
Module - 4.1-DM-1
63 pages
Classification & Prediction
No ratings yet
Classification & Prediction
24 pages
Classification
100% (1)
Classification
37 pages
dm4
No ratings yet
dm4
68 pages
Classification DecisionTreesNaiveBayeskNN
No ratings yet
Classification DecisionTreesNaiveBayeskNN
75 pages
DWDM 4
No ratings yet
DWDM 4
58 pages
05 Classification Part1
No ratings yet
05 Classification Part1
35 pages
7 Classification
100% (3)
7 Classification
63 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
06-Classification_Part1
No ratings yet
06-Classification_Part1
44 pages
Classification&DecisionTree (2)
No ratings yet
Classification&DecisionTree (2)
10 pages
05classification Rule Mining
No ratings yet
05classification Rule Mining
56 pages
Unit 3
100% (1)
Unit 3
21 pages
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
No ratings yet
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
43 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
95 pages
4 Classification
No ratings yet
4 Classification
20 pages
UNIT 5 NOTES DWM
No ratings yet
UNIT 5 NOTES DWM
18 pages
AI Chapter 3 Part 2
No ratings yet
AI Chapter 3 Part 2
51 pages
20210913115613D3708 - Session 05-08 Decision Tree Classification
No ratings yet
20210913115613D3708 - Session 05-08 Decision Tree Classification
37 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Classification
No ratings yet
Classification
45 pages
TTDS Lecture 4
No ratings yet
TTDS Lecture 4
31 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
70 pages
Classification and Prediction
100% (1)
Classification and Prediction
31 pages
Module 5: Data Mining Algorithms: Classification
No ratings yet
Module 5: Data Mining Algorithms: Classification
34 pages
Unit 3
No ratings yet
Unit 3
16 pages
DWDM Unit-3: What Is Classification? What Is Prediction?
No ratings yet
DWDM Unit-3: What Is Classification? What Is Prediction?
12 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
updated dm unit 3
No ratings yet
updated dm unit 3
28 pages
Decision Tree
No ratings yet
Decision Tree
33 pages
Unit-III Classification
No ratings yet
Unit-III Classification
10 pages
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
No ratings yet
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
129 pages
Classification
No ratings yet
Classification
33 pages
_08ClassBasic_v1
No ratings yet
_08ClassBasic_v1
46 pages
CH-5 DM Classification
No ratings yet
CH-5 DM Classification
31 pages
Chapter 4
No ratings yet
Chapter 4
31 pages
Unit 4
No ratings yet
Unit 4
20 pages
Unit 3 Classification
No ratings yet
Unit 3 Classification
71 pages
Chap4 Classification Lecture 5
No ratings yet
Chap4 Classification Lecture 5
74 pages
Classification
No ratings yet
Classification
81 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
88 pages
unit 2 notes (1)
No ratings yet
unit 2 notes (1)
83 pages
Post Op Weka Data Set Sample PDF
No ratings yet
Post Op Weka Data Set Sample PDF
8 pages
ML-Lec-06-Supervised Learning-Decision Trees
No ratings yet
ML-Lec-06-Supervised Learning-Decision Trees
45 pages
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Impact of Stress On Software Engineers Knowledge Sharing and Creativity (A Pakistani Perspective)
No ratings yet
Impact of Stress On Software Engineers Knowledge Sharing and Creativity (A Pakistani Perspective)
5 pages
RobustClustering Algorithm Based On Complete LinkApplied To Selection Ofbio - Basis Foramino Acid Sequence Analysis
No ratings yet
RobustClustering Algorithm Based On Complete LinkApplied To Selection Ofbio - Basis Foramino Acid Sequence Analysis
10 pages
Paper13 PDF
No ratings yet
Paper13 PDF
8 pages
A Study For The Discovery of Web Usage Patterns Using Soft Computing Based Data Clustering Techniques
No ratings yet
A Study For The Discovery of Web Usage Patterns Using Soft Computing Based Data Clustering Techniques
14 pages
Multidimensional Suppression For KAnonymity in Public Dataset Using See5
No ratings yet
Multidimensional Suppression For KAnonymity in Public Dataset Using See5
5 pages
Predicting Movie Success Based On IMDB Data
No ratings yet
Predicting Movie Success Based On IMDB Data
4 pages
BCA 302 MIS Chapter 6 Final To Send
No ratings yet
BCA 302 MIS Chapter 6 Final To Send
7 pages
Dav Assignment 5
No ratings yet
Dav Assignment 5
2 pages
Data Mining in CRM: Analytics-Intelligent Management of Product Life Cycle
No ratings yet
Data Mining in CRM: Analytics-Intelligent Management of Product Life Cycle
40 pages
K-Means and MAP REDUCE Algorithm
No ratings yet
K-Means and MAP REDUCE Algorithm
13 pages
SVM - Friend or Foe?: Reason 1
No ratings yet
SVM - Friend or Foe?: Reason 1
9 pages
Data Mi Nin: Find The Answers To These Questions in The Following Text
No ratings yet
Data Mi Nin: Find The Answers To These Questions in The Following Text
3 pages
Machine Learning Techniques: Important Questions Unit-1
No ratings yet
Machine Learning Techniques: Important Questions Unit-1
8 pages
Association Rules FP Growth
No ratings yet
Association Rules FP Growth
32 pages
Analytica Chimica Acta: Ewa Szyma Nska
No ratings yet
Analytica Chimica Acta: Ewa Szyma Nska
10 pages
1644397192phd Computer Engg
No ratings yet
1644397192phd Computer Engg
42 pages
Data Mining Nov10
100% (1)
Data Mining Nov10
2 pages
01 Introduction
No ratings yet
01 Introduction
36 pages
Web Mining
No ratings yet
Web Mining
13 pages
Data Mining Project Report template
No ratings yet
Data Mining Project Report template
3 pages
EE769-11 Dimension Reduction
No ratings yet
EE769-11 Dimension Reduction
16 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
Clustering
No ratings yet
Clustering
8 pages
KDD Process Mode Framework
No ratings yet
KDD Process Mode Framework
5 pages
AI in Food Science Research Proposal (1)
No ratings yet
AI in Food Science Research Proposal (1)
17 pages
Lec 1 Data Mining Introduction For Exam
No ratings yet
Lec 1 Data Mining Introduction For Exam
48 pages
DWM 5
No ratings yet
DWM 5
9 pages
K-Means Questions: K K K K K
No ratings yet
K-Means Questions: K K K K K
3 pages
BD Chapter 5
No ratings yet
BD Chapter 5
14 pages
WQD7005 Case Study - 17219402
No ratings yet
WQD7005 Case Study - 17219402
21 pages
Data Mining
No ratings yet
Data Mining
2 pages
H2o Prot
No ratings yet
H2o Prot
359 pages
05279573
No ratings yet
05279573
8 pages
K Means Clustering Lecture
No ratings yet
K Means Clustering Lecture
32 pages