0% found this document useful (0 votes)
30 views5 pages

2019-12 Classification of Pruning Methodologies For Model Development Using Data Mining Techniques

The document discusses various pruning methodologies used in associative classification. It describes how association rules are generated from transactional databases and defines support and confidence thresholds. It also provides examples of classification rule sets and discusses different associative classification techniques and how they apply pruning.

Uploaded by

Parashu Ram Pal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views5 pages

2019-12 Classification of Pruning Methodologies For Model Development Using Data Mining Techniques

The document discusses various pruning methodologies used in associative classification. It describes how association rules are generated from transactional databases and defines support and confidence thresholds. It also provides examples of classification rule sets and discusses different associative classification techniques and how they apply pruning.

Uploaded by

Parashu Ram Pal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Engineering and Advanced Technology (IJEAT)

ISSN: 2249 – 8958, Volume-9 Issue-2, December, 2019

Classification of Pruning Methodologies for


Model Development using Data Mining
Techniques
Parashu Ram Pal, Pankaj Pathak, Vikash Yadav, Priyanka Ora

 co-occurrences in database. Let we have an item-set with a set


Abstract: Knowledge discovery process deals with two of elements such as E = {E1, E2, …, En}, and a transaction set
essential data mining techniques, association and classification. with a number of transactions such as S = {S1, S2, …, Sm},
Classification produces a set of large number of associative such that SiS having a set of elements E’ and E’  E.
classification rules for a given observation. Pruning removes Support threshold and confidence threshold (used to
unnecessary class association rules without losing classification
accuracy. These processes are very significant but at the same
determine the importance of the rules) are defined as below:
time very challenging. The experimental results and limitations of (i) The items that co-occur in S are known as support denoted
existing class association rules mining techniques have shown by minimum-support, it is decided by the user. I is called an
that there is a requirement to consider more pruning parameters item-set, where I E, and ai  I co-occur in S., I is said to
so that the size of classifier can be further optimized. Here be frequent item-set if and only if the occurrence of I in S is
through this paper we are presenting a survey various strategies greater than minimum-support.
for class association rule pruning and study their effects that (ii) To determine the strongness of an item-set I1 implies
enables us to extract efficient compact and high confidence class
association rule set and we have also proposed a pruning
another item-set I2, where I1, I2  E; and I1  I2 = {} is
methodology.. known as Confidence denoted by minimum-confidence. It
is also decided by the user.
Keywords: associative classification, data mining, knowledge An association rule represented by I1  I2 is correct if
discovery process, pruning.
frequency for the co-occurrence of I1 and I2 is more than
minimum-support, and the confidence of this association rule
I. INTRODUCTION
is more than minimum-confidence. The calculation of support
is: (I1  I2) / (transactions in DT). The ccalculation of
Associative Classification rule mining concept was first
confidence is: support (I1  I2) / support (I1). I1  I2 can be
coined in the year 1997 [1] and [2]. The first classifier was
analysed as “if I1 exists, it is likely that I2 will also exists”.
named as CBA [3] in 1988. Later classifiers like CPAR [4] in
Associative classification rule mining has two steps: (i)
2003, MCAR [6] in 2005, etc. The first step finds the frequent
finding all frequent item-sets. This is the most complex and
item-sets and rules for class association. The threshold value
time consuming task that generates all possible item-sets
is used to remove unwanted sets. Then strong rules are
(combinations), 2n for a set of n items and extracts only those
segregated. By using the confidence value the weak rules are
item-sets having frequency at least minimum-support in a
pruned. Lastly, a small subset from all these rules is selected
training database. (ii) Generating strong association rules.
and a classifier is made. Many techniques have proposed
They are produced from frequent item-sets obtained in step
various methods for optimization of rules to form a classifier
(i). All rules having confidence not less than minimum-
[5]. Proposed associative classification techniques use several
confidence are extracted. For instance we work with the
approaches to find, extract, save, arrange in ranks and prune
training data which is shown in table 1. It has three attributes
the redundant rules. This paper aims to compare pruning
A (A1, A2, A3), B (B1, B2, B3), C (C1, C2, C3) and two class
methodologies that are used in different classifiers in order to
labels (L1, L2). The minimum-support = 30% and
find and develop an efficient classifier as an end result in the
minimum-confidence = 70%. The Table 2 shows a classifier
most efficient manner as possible.
with the strong class association rules according to confidence
they hold along with their support and confidence.
II. ASSOCIATIVE CLASSIFICATION RULE MINING
Association rules are derived from association rule mining
generated form transactional databases as per their

Revised Manuscript Received on December 15, 2019.


* Correspondence Author
*Parashu Ram Pal, Department of Information Technology, ABES
Engineering College, Ghaziabad, India
Pankaj Pathak, Department of Information Technology, Symbiosis
Institute of Telecom Management, Pune, India
Vikash Yadav, Department of Information Technology, ABES
Engineering College, Ghaziabad, India
Priyanka Ora, Department of Computer Science, Medi-Caps
University, Indore, India

Published By:
Retrieval Number: B3317129219/2019©BEIESP Blue Eyes Intelligence Engineering
DOI: 10.35940/ijeat.B3317.129219 2043 & Sciences Publication
Classification of Pruning Methodologies for Model Development using Data Mining Techniques

quality of association rules that are obtained to get good


Table 1. Training Database results than CBA [15]. Associative classification (AC) is an
TID A B C Class approach that utilizes the technique of association rule
i. A2 B2 C1 L1 discovery to learn classifier. Associative classification can be
ii. A1 B2 C2 L2 categorized into two ways; one is eager associative
iii. A1 B3 C3 L2
classification and second is lazy learning associative
classification [16]. The His-GC classifier is used. First phase
iv. A3 B1 C2 L1
is subset generation and second phase is subset evaluation. In
v. A1 B1 C3 L2 the first phase, all the possible combination of item set is
vi. A2 B3 C1 L1 generated based on the testing dataset from the training
vii. A3 B3 C2 L1 dataset. Second phase is involved with the calculation of
viii. A1 B1 C1 L1 posteriori probability for each of the item set generated from
ix. A2 B3 C1 L1 the first phase. Highest probability class is assigned to the test
tuple [17].
x. A1 B1 C1 L2
LLAC (Lazy Learning in Associative Classification): Test
Table 2. Strong Associative Classification Rule Set data is taken as input and generates the subset for each
combination of class value. Support s and confidence c is
Associative Classification Rule Support Confidence calculated for each generated subsets. The class label for the
Antecedent Consequent new unseen test tuple is assigned as per the highest support
A2 L1 3/10 3/3 and confidence values. HiSC i.e. Highest Subset Confidence
B3 L1 3/10 3/3 algorithm is used to predict the class label of unseen test
A2,C1 L1 3/10 3/3 instance. LLAC outperforms in compare to traditional
A1 L2 4/10 4/5 classification systems [17].
C1 L1 4/10 4/5 LACI (Lazy Associative Classification using Information
Gain): In the previous works, all the possible subsets are
generated. Generating all the subsets takes huge computation
III. DIFFERENT PRUNING TECHNIQUES FOR time. To address this challenge, LACI is proposed. To reduce
ASSOCIATIVE CLASSIFICATION the numbers of generated subsets, Information gain is
Pruning means removal of unwanted/unuseful elements. In calculated for each and every attribute from the training
associative classification, pruning is used to remove dataset and the highest information gain attribute is chosen to
infrequent item-sets, infrequent and weak class association generate the subsets. [18]
rules. Pruning is also used to decide that which rule from the Apriori-type algorithm is very difficult to predict the
strong associative classification rule set will be included in the defects by implementing it on misbalancing dataset. To
final classifier based on its confidence and coverage capacity. improve this, class rules are used to separate the class label,
Pruning can be applied into three levels in the overall process which is the vital type of one of the association rules. It also
of associative classification. tells the relationship between the attributes and categories in
(a) Early pruning: early pruning is used not to generate which dataset is further divided. Empirical comparison with
irrelevant candidate sets, remove infrequent item-sets and four datasets, is the techniques used to implement the problem
infrequent associative classification rules. The support which superior than the other classification techniques. The
threshold value is used at this level. research shows the results that are on the basis of SDP model
(b) Intermediate pruning: intermediate pruning is used to on class-association [18].
remove weak class association rules and extract only strong A. Early Pruning Techniques
class association rules. The confidence threshold value is used
at this stage to accomplish the task. In early pruning, we use support threshold value to delete the
(c) Late pruning: late pruning is used to extract only a selected infrequent item-sets. Following methods are used to reduce
subset of strong class association rules to form the final efforts involved in finding frequent item-sets.
associative classifier. The strong ness and coverage capacity  Handling Mutually Exclusive Items
of the class association rules is used to do it.
Associative Classification easily removes noise and achieves This technique exploits the concept that values of an attribute
higher accuracy. It a complete rule set than traditional used to be mutually exclusive means one attribute in an
classification techniques [14]. Lasso regulation is used to instance contains only one value of attribute. To put this into
mine the association rules and to pruning of data. In lasso effect, those attributes having more than one value must be
regulation, variable selection approach is used to introduce a restricted by the candidate generator. This technique firstly
new approach to tackle the problem of rule pruning and accelerates the second pass and secondly, on subsequent
summarization. For massive high dimensional data, passes, candidate generation for candidates pruned by this
association rules are used to find the relationships between technique do not required for subset–support based pruning to
association and attributes. As the dimension set of data get be explicitly checked
higher, the data get sparse and results in number of  Exploiting the Equivalence of Supports
association rules and it’s difficult to understand them. By
using new approach i.e. CARs (Class-Association Rules)
which is based on Lasso regulation, prune the least interesting
association rules which will enhance the numbers and the

Published By:
Retrieval Number: B3317129219/2019©BEIESP Blue Eyes Intelligence Engineering
DOI: 10.35940/ijeat.B3317.129219 2044 & Sciences Publication
International Journal of Engineering and Advanced Technology (IJEAT)
ISSN: 2249 – 8958, Volume-9 Issue-2, December, 2019

Usually the item-sets used to have their support near to either is frequently used by decision tree algorithms like C4.5 [9].
support of one of its subsets or support of one of its supersets. First a decision tree is constructed and, then it is decided
To prevent these equivalent rules from being generated or whether each node and its descendants is replaced or not by a
incurring any overhead, the exact-equivalence strategy single leaf. The decision is made on the basis of the estimated
remove from the set of frequent item-sets any set having a error using pessimistic error estimation [10] of a node and
subset with equivalent support before forming the next set of comparing it with its potential replacement leaf. This
candidates. backward pruning can also be used in class association rule
 Other Techniques pruning. The algorithms including [3] have used it to
effectively remove the number of extracted class association
We can find some other techniques that may optimize the
rules.
candidate rule sets. The algorithm [11] reduces the candidate
item-sets by removing those item-sets that are containing C. Late Pruning Techniques
more number of items than available number of attributes.
Late pruning is involved with the final step of classifier
The algorithms including [11] do not generates candidate
formation. There are a large number of late pruning
item-sets, and uses other data structure FP-tree to generate the
techniques available that have been used by different
frequent item-sets. We can search some more techniques that
researchers. Some of these techniques are listed out here
can be used to prune the infrequent item-sets more efficiently.
along-with their scope and importance:
B. Intermediate Pruning Techniques
 Matching Longest Associative Classification Rules
Intermediate pruning uses confidence threshold value to
In this technique we select those class association rules having
remove the weak class association rules. But following more
longest left hand side that matches a particular case. The
methods may optimize the size of class association rules
longest match method is based on the conclusion that the class
without reducing its classification strength:
association rules with longest left hand side will contain more
 Removing Redundant Class Association Rules accurate and richer information for the prediction of a class.
We know that longest match is more specific and accurate, but
The idea behind this technique is to exploit the fact that if a
the problem with this case is that the support and confidence
rule R meets the confidence threshold value, then any rule
of the class association rule decreases exponentially as the
containing R and having confidence less than R will apply to
size of its left hand size increases.
only covered instances. Such rules are termed as redundant
class association rules. Redundant rule pruning has been  Database Coverage
reported in [5]. Its works as follows: Let R  C be a general
Database coverage is a very popular pruning technique in
rule, any rule R’  C such as R  R’ and R’  C has lower associative classification. The algorithms including [3] and
confidence in compare to R  C, will be redundant and they [4] have successfully exploited this pruning technique to
are removed from associative classification rule set. It minimize the size of class association rule set. This method
significantly reduces the size of the class association rule set works as follows: first all the rules of class association rule set
and minimizes rules redundancy. The algorithms, including are sorted in descending order using their confidence. Then
[5, 7] have used the pruning of redundant class association each class association rule is tested against the training dataset
rules. It perform pruning immediately as a rule is inserted into instances. If a rule correctly classifies some instances in the
the data structure called CR-tree. training dataset, all instances covered by the rule are removed
 Handling Conflicting Class Association Rules from the training dataset and the rule is marked as candidate
Conflicting class association rules refers to such rules that rule. If a rule does not correctly covers any instance in training
have similar LHS item-sets but predicting different classes in dataset then it is removed from the class association rule set.
RHS. Let’s take given rules such as R  C1 and R  C2 [7]. Finally we get the class association rule set, having candidate
Proposed a pruning technique considers these conflicting rules only.
rules and removes them. The algorithm [8] considers such  Lazy Pruning
rules as useful knowledge and combines them in a single rules
naming multi-label class association rule i.e. R  C1 V C2. Lazy pruning aims to remove only those rules from class
association rule set that incorrectly classify the training data
 Correlation Testing Between Rule Body & its Class set instances. In lazy pruning each rule of class association
This concept is taken from statistics that finds the correlation rule set is tested against training dataset instances and we
between rule body and its predicting class to determine delete those rules that either incorrectly classifies at least one
whether they are correlated or not. The chi-square testing is training data set instance or they do not covers even single
used to for this purpose. If the discovered class association instance of the training data set. Here we do not delete the
rule is negatively correlated, it is pruned. If the rule body is covered instances from the training data set as it is done in
positively correlated to its class, it is stored in class database coverage method. Lazy pruning considers all class
association rule set. The algorithm [4] performs the association rules classifying the instances of training data set,
chi-square testing in its rule discovery step to retain or remove whereas the database coverage method considers only single
the class association rules. rule classifying an instance.

 Backward Class Association Rule Pruning In other words in lazy pruning an instance is classified by
several class association rules
Pruning in decision tree involves pre-pruning and but in database coverage
post-pruning. The post-pruning, known as backward pruning method an instance is covered

Published By:
Retrieval Number: B3317129219/2019©BEIESP Blue Eyes Intelligence Engineering
DOI: 10.35940/ijeat.B3317.129219 2045 & Sciences Publication
Classification of Pruning Methodologies for Model Development using Data Mining Techniques

by only single class association rule. The experimental results From the above table it is obvious that huge number of
have shown that lazy pruning produces large number of classification rule are generated by the lazy pruning
potential class association rules and therefore consumes more technique. The reasons is that lazy pruning involves the
memory space in compare to other techniques. method in which a large number of those class association
rules (as spare rules) are stored that do not covers even any
 Laplace Accuracy
objects. The proposed database coverage methods overcomes
The Laplace accuracy is used by the associative classification with this problem and removes these spare rules that reduces
algorithm [4]. It is mainly used in class association rule the size of the associative classifiers. The CBA [3] algorithm
mining to calculate the expected error of the rules. It generates the associative classifiers of reasonable size in
calculates expected accuracy for each class association rule compare to the lazy pruning methods [12].
before the rule is applied for classification of test instances.
The CPAR algorithm has shown that exploitation of Laplace VI. CONCLUSION
accuracy has produced better results in compare to CBA.
Associative classification is a significant technique of
knowledge discovery in data mining field. Pruning techniques
IV. PROPOSED METHODOLOGY
are the most important part of the process of constructing and
Our proposed methodology aims to reduce the number of effective classifier with high accuracy standard. Effective
rules as well as to study the impact of pruning on accuracy. class association rule mining yields a classifier that reduces
Database coverage method which provides heuristics to select error possibility and increases the accuracy rate and can be
the rule subset from set of rules. This method has one shortfall deployed for use in big data analytics and data science. The
that in some cases like when there is no rule to classify it paper discusses the different pruning methods that have been
considers largest frequency class for remaining unclassified proposed since the inception of the class associative rule
instances. We can hybrid database coverage pruning with the mining technique and compares them with the latest ones.
rule induction for maximum coverage of dataset. The rule Comparison has been made on the results obtained by
which has inducted followed by rule evaluation step through different associative classification algorithms employing a
the rank. The proposed method includes rule induction, particular pruning technique. The results show that the
evaluation of the rules and classifying the test data. database coverage with pessimistic error and database
Evaluation of the rule helps to seek out whether the rule is coverage pruning have produced better results in comparison
able to cover the large part of dataset or not. While evaluating with the lazy pruning methods. They generate compact
the rule the rank of the rule is constantly revised to reflect the classifiers that are easy to understand, implement and use for
coverage of the rule on test examples. Proposed method tries classification of new data items.
to acquire as many as possible instances of dataset within the
rule and hence less number of rules derived. So there is no REFERENCES
majority voting class concept is used for unclassified
1. A., Azmi M. and Bernado. 2016. "Class Association Rules Pruning
instances. using Regularization ." In Proceeding of International Conference on
Computer System and Applications. IEEE.
V. EXPERIMENTAL EVALUATION 2. Agarwal R., Imielinski T. and Swami A. 1993. "Mining Association
Rules between Sets of Items in Large Databases." In Proceedings of
We have compared the effect of three pruning techniques International Conference on Management of Data. Washington DC.
with the number of rules derived by them. These are CBA [3] 207-216.
(pessimistic error and database coverage), MCAR [6] 3. Bayardo R. 1997. "Brute Force mining of high confidence
database coverage and lazy pruning [12]. The experiments are classification rules." In proceedings of an International conference on
Knowledge Discovery and Data Mining. Newport Beach, CA, United
done on the fourteen datasets available on UCI M/L data States. 123-126.
repository [13]. Table 3 gives the number of rules derived 4. Coenen F., and Leng P. 2004. "An Evaluation of Approaches to
from different pruning techniques. Classification Rule Selection ." In Proceedings of International
Conference on Data Mining. Brighton, United Kingdom: IEEE.
Table 3: Set of Rules Derived by Different Pruning 359-362.
Techniques 5. Hiang, Mohammad S. A. and Tze. 2017. "Effects of Pruning on
Accuracy in Associative Classification." In Journal of Informatics and
S. Name of Pessimistic Proposed Lazy Mathematical Sciences, Vol. 9, No. 4.
No. Data Set Error & Database Pruning 6. J., Quinlan. 1993. "C4.5: Programs for Machine Learning." San Mateo,
Database Coverage CA: Morgan Kaufmann.
Coverage 7. J., Vishwakarma N. and Agrawal. 2013. "Comparative Analysis of
1. Breast 47 67 22183 Different Techniques in Classification based on Association Rules." In
Proceeding of International Conference on Computational
2. Glass 29 39 11061 Intelligence and Comuting Research. IEEE.
3. Heart 43 80 40069 8. Liu B., Hsu W. and Ma Y. 1998. "Integrating Classification and
4. Iris 5 15 190 Association Rule Mining." In Proceedings of International
5. Labor 17 16 7967 Conference on Knowledge Discovery and Data Mining. New York.
80-86.
6. Lymph 35 52 86917
9. P., Baralis E. and Torino. 2002. "A Lazy Approach to Pruning
7. Pima 40 93 9842 Classification Rules." In Proceeding of International Conference on
8. Tic-tac 28 28 41823 Data Mining. IEEE.
9. Wine 11 51 40775 10. P., Merz C. and Murphy. n.d.
10. Zoo 5 9 380921 "UCI Repository of Machine
Learning Databases." Irvine CA,:
University of California.

Published By:
Retrieval Number: B3317129219/2019©BEIESP Blue Eyes Intelligence Engineering
DOI: 10.35940/ijeat.B3317.129219 2046 & Sciences Publication
International Journal of Engineering and Advanced Technology (IJEAT)
ISSN: 2249 – 8958, Volume-9 Issue-2, December, 2019

11. Pal P. R., and Jain R. C. 2010. "CAAC: Combinatorial Approach of


Associative Classification." International Journal of Networking and
Applications Vol. 2, No. 1. 470-474.
12. S., Tamrakar P. and Ibrahim. 2018. "A Review of Lazy Learning
Associative Classifications ." In International Journal of Pure and
Applied Mathematics, Vol. 119, No 15.
13. Tao F., Murtagh F., and Farid M. 2003. "Weighted Association Rule
Mining using Weighted Support and Significance Framework." In
proceedings of 9th ACM Conference on Knowledge Discovery and
Data Mining. Washington DC. 661-666.
14. Thabtah F., Cowling P. and Peng Y. 2005. "MCAR: Multi-class
Classification based on Association Rule Approach." In Proceedings
of International Conference on Computer System and Applications.
Cairo, Egypt: IEEE. 1-7.
15. Thabtah F., Cowling P. and Peng Y. 2004. "MMAC: A new
Multi-class Multi-label Associative Classification Approach." In
Proceedings of International Conference on Data Mining. Brighton,
United Kingdom. 217-224.
16. Y., Han J. Pei and Yin. 2000. "Mining Frequent Patterns without
Candidate Generation." In Proceedings of International Conference
on ACM SIGMOD. 1-12.
17. Yin X., and Han J. 2003. "Classification based on Predictive
Association Rules." In Proceedings of International Conference of
Data Mining.
18. Yuanxum Shao, Bin Liu Guoqi Li and Shihai Wand. 2017. "Software
Defect Prediciton based on Class Association Rules." In Proceeding of
International Conference on Reliability System Engineering. IEEE.
1-7.

AUTHORS PROFILE
Dr. Parashu Ram Pal, obtained Ph.D. in Computer
Science. He is working as a Professor in Department of
Information Technology, ABES Engineering College,
Ghaziabad, India. He has published three books and more
than 40 Research Papers in various International, National
Journals & Conferences. His area of interests are Data
Mining, Computer Architecture, Computer Graphics and Operations
Research. He is devoted to Education, Research & Development for more
than twenty years and always try to create a proper environment for
imparting quality education with the spirit of service to the humanity. He
believes in motivating the colleagues and students to achieve excellence in
the field of education and research.

Dr. Pankaj Pathak obtains Masters and Ph.D. in


2005, 2014 respectively. He is working as an Assistant
Professor in Symbiosis Institute of Telecom Management.
His area of interests are Data Mining, AI, and Smart
Technologies. He has Published Several Research papers
in the area of Data Mining, IOT security and Speech
Recognition Technology.in the field of education and research.

Dr. Vikash Yadav received his Ph.D. (Computer


Science & Engineering) degree from Dr. A.P.J Abdul
Kalam University (Formerly U. P. Technical University)
Lucknow, (U.P. India) in 2017. He is currently working as
an Assistant Professor in the Department of Computer
Science & Engineering, ABES Engineering College,
Ghaziabad, India and has more the 7 years of Teaching/Research experience
and published more than 30 research papers in various
National/International Conferences/Journals. He is also a reviewer of various
SCI/SCIE/Scopus indexed journals. His area of interest includes Data
Structure, Data Mining, Image Processing and Big Data Analytics.

Dr. Priyanka Ora completed her Masters and Ph.D. in


Computer Science. She is working as an Assistant
Professor in Department of Computer Science,
Medi-Caps University, Indore, India. She has more than
six years of academic experience. She published
published more than 10 Research Papers in various International, National
Journals & Conferences. Her area of interests are cloud computing, cyber
security, internet of things.

Published By:
Retrieval Number: B3317129219/2019©BEIESP Blue Eyes Intelligence Engineering
DOI: 10.35940/ijeat.B3317.129219 2047 & Sciences Publication

You might also like