Analysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data Mining
Associate Professor, Department of Computer Science, Quaid-E- Millath Government College for
Women (A), Chennai, Tamil Nadu, India
[email protected], [email protected]
ABSTRACT
Data Mining is the extraction of hidden predictive information from large database. Classification is
the process of finding a model that describes and distinguishes data classes or concept. This paper
performs the study of prediction of class label using C4.5 and Nave Bayesian algorithm.C4.5
generates classifiers expressed as decision trees from a fixed set of examples. The resulting tree is
used to classify future samples .The leaf nodes of the decision tree contain the class name whereas a
non-leaf node is a decision node. The decision node is an attribute test with each branch (to another
decision tree) being a possible value of the attribute. C4.5 uses information gain to help it decide
which attribute goes into a decision node. A Nave Bayesian classifier is a simple probabilistic
classifier based on applying Bayes theorem with strong (naive) independence assumptions. Naive
Bayesian classifier assumes that the effect of an attribute value on a given class is independent of the
values of the other attribute. This assumption is called class conditional independence. The results
indicate that Predicting of class label using Nave Bayesian classifier is very effective and simple
compared to C4.5 classifier.
Keywords: Data Mining, Classification, Nave Bayesian Classifier, Entropy
I. INTRODUCTION
Data mining is the extraction of implicit,
previously unknown, and potentially useful
information from large databases. It uses
machine learning, statistical and visualization
techniques to discover and present knowledge
in a form, which is easily comprehensible to
humans. Data mining functionalities are used
to specify the kind of patterns to be found in
data mining tasks. Data mining task can be
classified into two categories: Descriptive and
Predictive.
Descriptive
mining
tasks
characterize the general properties of the data
in the database[1]. Predictive mining tasks
perform inference on the current data in order
to make prediction. Classification is the
process of finding a model that describes and
distinguishes data classes / concepts. The goal
of data mining is to extract knowledge from a
data set in a human-understandable structure
and involves database and data management,
data preprocessing, model and inference
considerations, complexity considerations,
post-processing
of
found
structure,
visualization and online updating. The actual
data-mining task is the automatic or semiIntegrated Intelligent Research (IIR)
The
identification of unusual data records, that
might be interesting or data errors and require
further
investigation.(2)Association rule
learning-Searches for relationships between
variables.(3) Clustering is the task of
discovering groups and structures in the data
that are in some way or another "similar",
without using known structures in the
data.(4)Classification is the task of
generalizing known structure to apply to new
data.(5)Regression Attempts to find a
function which models the data with the least
error.(6)Summarization providing a more
compact representation of the data set,
including visualization and report generation.
361
362
X=(dept=system,age=26..30,salary=46..5
0k)
5.1 C4.5 ALGORITHM
Where:
P(dept=system/status=senior)=2/5=0.4
P(dept=system/status=junior)=2/6=0.33
P(age=26..30/status=senior)=0/5=0
P(age=26..30/status=junior)=3/6=0.5
P(salary=46k..50k/status=senior)=2/5=0.4
P(salary=46k..50k/status=junior)=2/6=0.3
3
Using the above probabilities, we obtain
P(X/status=junior)
0.33=0.054
Where:
=0.33
0.5
Similarly,
P(X/status=Senior) =0.4 X 0 X 0.4 = 0
To find the class Ci, that maximizes ,
P(X/ Ci ) X P( Ci ), We compute
P(X/status=senior) X P(status=Senior) =
0 X 0.455 = 0
P(X/status=Junior) X P(status=Junior) =
363
REFERENCES
364