Feature Extraction and Reduction by using ModifiedApriori algorithm (1)
Feature Extraction and Reduction by using ModifiedApriori algorithm (1)
Abstract— Feature selection is of great importance in the world of data mining. The existing
algorithms for doing so lead to extremely high dimensionality when interactions among features
are considered. So, an algorithm is needed to extract only useful features. An Improvised
approach for association rule mining is devised, which does not require the users to provide
support and confidence values and we are considering the interaction among features and relative
confidence to get the better result. The user can specify the number of rules and features that
have to be generated. This method has an advantage over Standard Apriori, as it gives more
control over the efficiency and quality of the results. It is also capable to handle large data sets
and high dimensions more effectively. We calculated mean, median, variance and standard
deviation on the lift value of rules generated and found that variance of improvised apriori is 30
– 40% lower than that of standard apriori, which indicates that modified Apriori performs better
than standard apriori. The mean and median value of the lift is also 20 – 50% higher in modified
apriori than in standard apriori, which implies that the features are more tightly coupled in
modified apriori than in standard apriori and hence have a higher probability of getting
the perfect result.
Index Terms— Data mining, Itemset, Rule, Support, Feature Selection, Association rule mining,
Apriori algorithm, Confidence, Relative Confidence, Lift, ARAF.
I. INTRODUCTION
Data mining is a process of extracting knowledge from large datasets using advanced computational methods,
including statistics, machine learning, and artificial intelligence. It involves the discovery of hidden patterns,
relationships, and trends within the data, which can be used to make informed decisions. Data mining involves
several steps: data collection and preparation, then data modeling, evaluation, and interpretation.
One of the most widely used techniques in data mining is association rule mining. It is a process of discovering
interesting relationships or patterns among variables in large datasets. It involves identifying frequently occurring
itemsets, and then generating rules based on those itemsets. These rules can be used to make predictions or to gain
insights into the behavior of the underlying system.
Association rules are typically represented in the form of "if-then" statements, where the "if" part represents the
antecedent, and the "then" part represents the consequent. The strength of the association is measured using
statistical measures such as support, confidence, and lift.
The Apriori algorithm is one of the most widely used algorithms for mining frequent itemsets in a dataset. It was
proposed by Agrawal and Srikant in 1994[1] and is considered a seminal algorithm in the field of data mining.
III. MOTIVATION
Association rule mining is a valuable tool for uncovering meaningful patterns in large datasets across a range of
domains. However, the existing Apriori algorithm used for association rule mining has certain limitations,
including an unpredictable runtime and the need for users to set two difficult-to-determine parameters. Recent
works have proposed innovative approaches to mining frequent itemsets and association rules, including dynamic
819
parameters, pruning techniques, and ranking mechanisms. By combining and adapting these methods, we aim to
develop an enhanced version of the Apriori algorithm that addresses the limitations of the existing algorithm and
provides more efficient, scalable, and high-quality results for association rule mining.
V. PROBLEM DEFINITION
The existing Apriori algorithm requires the user to specify two parameters: the minimum support and the minimum
confidence. These parameters determine which itemsets and rules are considered frequent and interesting,
respectively. However, setting these parameters can be challenging and subjective, as different values may lead to
different results. Moreover, these parameters may not capture all the relevant or useful patterns in the data, as they
only consider the frequency and accuracy of the rules, but not other factors such as novelty, diversity, or utility.
Furthermore, the existing Apriori algorithm may generate a large number of candidate itemsets and rules, many of
which can be redundant or irrelevant and may not contribute enough to produce the desired output. Therefore,
there is a need for a modified version of the Apriori algorithm that can provide more precise and meaningful results
by allowing the user to specify the number of itemsets and confident rules that they want to consider, instead of
the minimum support and minimum confidence values. This would make our algorithm more user-friendly and
applicable to people who are not domain-specific.
820
In this section, we provide the design that we have used for capturing the problem statement. The problem is
captured in four steps as given in Fig. 1. First, the data is fed into the algorithm after preprocessing. Now, we
perform feature selection on the provided dataset by using the association rule mining algorithm that we propose
in later sections. Following that, we extract feasible interactions among the generated features and remove the
redundant features. Finally, we evaluate the effectiveness of the proposed algorithm and record the results.
IX. METHODOLOGY
The Apriori Algorithm is a popular algorithm used for finding frequent itemsets in a dataset. It uses a bottom-up
approach where frequent subsets of items are iteratively generated and combined to form larger itemsets. The
algorithm reduces the search space by eliminating infrequent itemsets that cannot be part of any frequent itemset.
The algorithm was first proposed by Rakesh Agrawal and Ramakrishnan Srikant in their 1994 paper titled "Fast
Algorithms for Mining Association Rules"[1]. The main idea of the Apriori algorithm is that only the itemset
containing no infrequent itemsets needs examining whether it is frequent. Therefore, the number of candidate
itemsets is reduced rapidly.
Algorithm 1 is an algorithm to extract features from a set of association rules. The input to the algorithm is a set
of association rules denoted as AR. The output of the algorithm is a set of features represented by F. For each
association rule, the algorithm checks if its right-hand side has only a class label. If the rule satisfies this condition,
then the antecedent of the rule, which is the left-hand side of the arrow, is added to the set of features. Finally,
once all the association rules have been processed, the algorithm returns the set of features as the output.
Algorithm 1 Generate Features
Input: D(database);
minsupp(minimum support);
minconf(minimum confidence);
Output: Rt (all association rules)
1: L1= {frequent 1-itemsets}
2: for k = 2; Lk−1 != ∅; k++ do
3:Ck = apriori-gen (Lk−1);
4: for all transactions T ∈ D do
5: Ct = subset (Ck, T);
6: for all candidates C ∈ Ct do
7: Count(C)←Count(C)+1;
8: end for
9:end for
10: Lk = {C ∈ Ct|Count(C) ≥ minsupp · |D|}
11: end for
12: Lf = Uk Lk
13: Return Rt=GenerateRules(Lf , minconf)
821
ARAF has two advantages. First, it can handle any antecedent size in theory. Second, it is easy to interpret. Rules
are very understandable for humans, and the features from rules inherit this property.
Input: D(database);
minsupp(minimum support);
minconf(minimum confidence);
Output: F (generated features);
1: Run the Apriori algorithm to get Lk
2: Generate the confident rules from Lk
for i from 1 to k:
CRi ={R=“(X1 = x1, X2 = x2, ….Xi = xi) → Y = c” |conf(R)≥minconf,
(X1 = x1, X2 = x2, ….Xi=xi, Y=c)∈ Li}
3: Return F = GenerateFeatures(Ui CRi)
The parameters of minimum support (minsupp) and minimum confidence (minconf) are not easy to determine and
have some limitations. The values of these parameters affect the number and quality of the association rules that
are produced. If the values are too high, the algorithm may not find any rules or only a few that are not very useful.
On the other hand, if the values are too low, the algorithm may generate many rules that are not reliable or relevant.
This can also cause problems with the space and time efficiency of the algorithm, as it has to store and process a
large number of rules.
To avoid the above drawbacks caused by the Standard Apriori algorithm we propose a new algorithm/method
called Modified Apriori with feature reduction, in which we use number of frequent itemsets and rules instead of
support and confidence thresholds.
Unbalanced data is a situation where the number of observations in one class is significantly higher or lower than
the number of observations in the other classes. This can lead to biased or inaccurate model performance, as the
model tends to learn more about the majority class, rather than the minority class.
Modified Apriori –
Input: D(database);
dfreq(the number of frequent itemsets);
dconf (the number of confident rules);
Output: F(generated features);
1: Count the support of itemsets in IS(c)1, where
IS1(c)={(Xi=xi, Y=c), for i ∈ {1,2,...,m}, for c ∈ C}
2: Generate FS1(c)={dfreq/|C| most frequent itemsets in IS1(c)}
3: k = 2
4: while ISk(c) != ∅ and k < number of columns:
Count the support of itemsets in ISk, where
IS2(c) ={ (Xi =xi, Xj = Xj , ........., Xk=xk, Y=c)}
All k-1 length subsets of (Xi = xi, Xj = Xj, ........., Xk = xk) are in FSk-1(c)
Generate FSk(c)={dfreq/|C| most frequent itemsets in ISk-1(c) U ISk(c)}
5: Calculate the relative confidence of rules in AR, where
AR = {class association rules generated from FSk(c)}
6: Generate CR={dconf most relatively confident rules in AR}
7: Return F=GenerateFeatures(CR)
The method that we are currently proposing is for the Modified version of the Standard Apriori algorithm where
we are reducing the complexity [and dimensionality]. We have created certain functions which help us in reducing
the complexity. The functions which we have implemented are namely, generating frequent itemsets, generating
candidate itemsets, and generating confident association rules from frequent itemsets.
Consider a rule A -> B with a confidence of 0.8 and another rule C -> B with a confidence of 0.6. Although the
confidence of the first rule is higher, the relative confidence of the second rule may be higher because it may have
822
fewer antecedents or a more significant antecedent than the first rule. Therefore, relative confidence is useful in
identifying the most important rules for a specific class.
By using relative confidence, we can select the most relevant rules for each class based on their strength of
association. This can help in identifying the most significant features that contribute to the prediction of
a specific class.
Relative confidence(rconf) is a measure of the strength of association between two sets of items in a transactional
dataset. It is calculated as the confidence of a rule (i.e., the conditional probability of the consequent given the
antecedent) divided by the support of the antecedent. The formula for relative confidence is:
rconf(X->Y) = (supp(X U Y)/(supp(X) – supp(X U Y))/(supp(Y)/(n-supp(Y)).
Relative confidence provides a way to evaluate the importance of a rule while taking into account the frequency
of the antecedent in the dataset. Rules with high relative confidence indicate a strong association between the items
in the antecedent and consequent and can be used to identify patterns in the data.
Modified Apriori with feature reduction –
Considering we have 3 feature sets, A, A, B, and A, C we will consider only that feature which has the highest
relative confidence among 3 and a subset is present in all the feature sets which we are considering.
Inputs:
Rules
Transactions
Output:
1:Verdict
2: rules = ∅
3: for rules_out in confident_rules:
4:out_rconf = rules_out.rconf
5:rule_considered = rules_out
823
6:for rules_in in confident_rules:
7:in_rconf = rules_in.rconf
8:if in_rconf > out_rconf:
9:out_rconf = in_rconf
10:rule_considered = rules_in
11:rules = rule_considered
12: return rules
TABLE I: COMPARISON TABLE BETWEEN STANDARD APRIORI AND MODIFIED APRIORI WITH FEATURE REDUCTION.
The above table provides the mean, median, variance, and standard deviation for two different versions of the
Apriori algorithm applied to four different datasets: diabetes, country_wise, heart, and heart_2020_cleaned.
In the above table, the mean, median, standard deviation, and variance of lifts have been compared for the features
generated by the standard and Modified Apriori algorithms respectively applied to four different datasets: diabetes,
country_wise, heart, and heart_2020_cleaned. Lift is a measure of how good a rule or a model is at predicting
something. A high lift means the rule is very accurate and useful, while a low lift means the rule is not much better
than random guessing.
We can observe that for diabetes, country_wise and heart_2020_cleaned datasets, mean and median of lifts are
higher in Modified Apriori than Standard Apriori. Also, standard deviation and variance are lower in Modified
Apriori than Standard Apriori. For the heart dataset, mean and median of lifts are lower in Modified Apriori than
Standard Apriori. Also, standard deviation and variance are higher in Modified Apriori than Standard Apriori.
Based on the above observations, we can infer that in general, the variance of Standard Apriori is higher than that
of Modified Apriori , which indicates that Modified Apriori performs better than Standard Apriori . Low variance
implies that lifts of the features are closer to the mean value. And since the mean value is higher in Modified
Apriori algorithm, the Modified Apriori algorithm gives more accurate features. The mean and median value of
the lift is also higher in Modified Apriori than in Standard Apriori , which implies that the features are more tightly
coupled in Modified Apriori than in Standard Apriori and hence have a higher probability of getting
the perfect result.
824
Fig. 3: Output of proposed method for feature selection.
To support the above results we have plotted a line graph denoting the Dataset [X-axis] vs Numerical intervals [Y-
axis] for both the algorithms i.e., Standard Apriori and Modified Apriori with feature reduction.
As we can see in the above line chart, we can conclude that Modified Apriori with feature reduction works better
when compared to standard Apriori both in terms of Time and Space complexity.
XIII. CONCLUSION
Association rule mining is really a fast-growing area nowadays. Researchers aim at finding the best association
rules which can be used for providing better analysis. This paper basically gives a more efficient and optimized
version of the Standard Apriori algorithm that can be used in various fields for prediction analysis. Also, various
evaluation metrics are used to prove this, and statistical data are provided to support this. This research can be
further enhanced by considering more factors helpful for reducing the complexity and also various fields in which
it can be optimized further.
REFERENCES
[1] R.Agrawal and R.Srikant,”Fast algorithms for mining association rules”, Proc. 20th Int. Conf. on VLDB, Volume 30, Pages
487–499, 1994.
[2] Qiuqiang Lin and Chuanhou Gao, “Discovering Categorical Main and Interaction Effects Based on Association Rule
Mining”, Transaction on Knowledge and Data Engineering, Volume 1, Pages 1-1, 2021.
[3] M. Zhou, M. Dai, Y. Yao, J. Liu, C. Yang, and H. Peng, “BOLT-SSI: A statistical approach to screening interaction effects
for ultra-high dimensional data,” Arvix, Volume 3, Pages 30-40, 2019.
[4] R. Agrawal, T. Imieliundefinedski, and A. Swami, “Mining association rules between sets of items in large databases,”
SIGMOD Rec., Volume 22, Pages 207–216, 1993.
[5] M. Zaki, S. Parthasarathy, M. Ogihara, and W. li, “New algorithms for fast discovery of association rules,” KDD’99,
Volume 10, Pages 100-111, 1999.
[6] J. Hipp, U. Guntzer, and G. Nakhaeizadeh, “Algorithms for association rule mining — a general survey and comparison,”
SIGKDD Explor. Newsl., Volume 2, no. 1, Pages 58– 64, 2000.
[7] B. Liu, Y. Ma, and C. K. Wong, “Improving an association rule-based classifier,” PKDD’00, Volume 32, Pages 504– 509,
2000.
825
[8] W. Li, J. Han, and J. Pei, “Cmar: Accurate and efficient classification based on multiple class-association rules,”,
ICDM’01. USA: IEEE Computer Society, Volume 25, Pages 369– 376, 2001.
[9] R. D. Shah, “Modelling interactions in high-dimensional data with backtracking,” J. Mach. Learn. Res., Volume 17, No.
1, Pages 7225–7255, 2016.
[10] S. Ventura and J. M. Luna, “Pattern Mining with Evolutionary Algorithms”, Springer, Volume 16, Pages 134-145, 2016
826