A New Feature Selection Method Based On Frequent A
A New Feature Selection Method Based On Frequent A
Research Article
Keywords: Feature Selection, Dimensionality Reduction, Text Classification, Association Rule Mining, Feature Interaction
DOI: https://doi.org/10.21203/rs.3.rs-1180542/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License
Page 1/15
Abstract
Feature selection is one of the major issues in pattern recognition. The quality of selected features is important for classification as the low-quality data can
degrade the model construction performance. Due to the difficulty of dealing with the problem that selected features always contain redundant information,
this paper focuses on the association analysis theory in data mining to select important features. In this study, a novel Feature Selection method based on
Frequent and Associated Itemsets (FS-FAI) for text classification is proposed. FS-FAI seeks to find relevant features and also takes feature interaction into
account. Moreover, it uses association as a metric to evaluate the relativity between the target concept and feature(s). To evaluate the efficacy of the proposed
method, several experiments were conducted on a BBC data set from the BBC news website. The obtained results were compared to well-known feature
selection methods. The reported results demonstrated the effectiveness of the proposed feature selection method in selecting high-quality features and in
handling redundant information in text classification.
1. Introduction
Feature selection is a significant component of data mining [1] and an important step for text categorization [2]. Text data is unstructured data and comprises
of words. Often, the Vector Space Model (VSM) is used to represent textual data to facilitate computer processing. It is used to convert the unstructured data
to structured ones and it deals with the document as a bag of words (features) [3]. In fact, not all features are valuable when building the document classifier.
Some of those features may be irrelevant or redundant. Moreover, when the irrelevant features are more than relevant ones, it badly affects the classification
results. In this case, the selection of the subset of the original features often improves the classification performance [4]. Consequently, Feature selection in the
field of text classification can be defined as a process that aims to find a minimum size of relevant text features to reduce text classification error [5].
A good method to select features is important as it enhances the accuracy and efficiency of the text classification. So in this paper, we propose a novel feature
selection approach for text classification in order to reduce the size of the subset of features and enhance the efficiency of the classifier without reducing its
accuracy.
2. Problem Statement
One of the most important issues in the problems of text classification is the feature space with high dimensions. Therefore, the selection of distinctive
features is important for text classification. According to [6] there are two primary reasons for selecting certain features over others. The first reason is due to
scalability, since using a large number of features requires memory, computation power, storage, network bandwidth, etc. therefore running a smaller subset of
features reduces the computation time. The second reason is related to the performance of the algorithm, e.g., algorithms perform better when not considering
features that do not provide more information but they add noise.
To overcome these problems, a novel Feature Selection method based on Frequent and Associated Itemsets (FS-FAI) for text classification is introduced to
select important features. The proposed method incorporates the association analysis theory in data mining. Association analysis can detect interesting
associations between data items [7]. The idea of the proposed method is to identify features that are closely related to the target attribute and also correlated
with each other by mining frequent and associated itemsets from the training data set. Moreover, effectively remove both irrelevant and redundant features.
3. Related Work
Feature selection has been an active area of research since the 1970s, where a lot of research work has been published. This section presents a review of
several existing studies associated with the feature selection methodologies that are used to select important and distinctive features.
Pawening et al. [8] proposed an approach to select features based on Mutual Information (MI) in order to classify heterogeneous features. The proposed
approach used a Joint Mutual Information Maximation (JMIM) method in order to select features while taking into consideration the class label. It also used
the Unsupervised Feature Transformation (UFT) method in order to convert non-numerical features into numerical ones.
The authors [9] presented an improved method for the Chi Square (Chi2) test that combined with interclass concentration and frequency. Chi2 test
improvement was performed with three aspects of in-class dispersion, frequency, and inter-class concentration.
In [10] the authors proposed an efficient algorithm for data classification incorporated with feature selection based on the association rule mining process to
select features having a significant impact on the target attribute.
The feature selection method based on association rules, ARFS, was developed in [11]. The algorithm used association rules in order to extract the frequent 2-
items set of the category and feature attributes in the data set. It then combined the sequential forward selection approach in order to search for feature
subsets and used the performance of the decision trees algorithm as the evaluation criteria for the selected feature subsets.
In [12], the authors have integrated feature weighting and feature selection methods with the support vector machine classifier (SVM). The Chi2 was used to
reduce the number of attributes significantly by taking a total of K=500 of the top ranked attributes. Then the feature weighting was used to calculate the
weight of each selected attribute.
As it is clear from the above studies, most feature selection approaches can efficiently identify the non-relevant features using different evaluation functions.
But it focused either on eliminating the redundant features or on considering the feature interaction. While feature selection has been studied extensively in the
field of data mining, the study of using association analysis as an approach for feature selection is rare. In contrast, our algorithm uses association analysis
as a method for feature selection that aims to eliminate the redundant and irrelevant features, and also considers the feature interactions.
Page 2/15
4. Methodology
The quality of data is important for classification as the low quality data can degrade the model construction performance. the reason for this low-
performance issue is that there are many irrelevant features that are evaluated during the model building process. So in this study, we proposed a novel
feature subset selection approach named FS-FAI, Which seeks to find relevant features and also takes feature interaction into account. Moreover, FS-FAI uses
association as a metric to evaluate the relativity between the target concept and feature(s), which differs from traditional measures, such as the distance
measure [13, 14, 15], the dependence measure [16, 17], and the consistency measure [7, 18]. The proposed system for text classification undergoes five steps
which are: text preprocessing, feature extraction, feature selection using the FS-FAI approach, classification, and finally evaluate the performance of the
proposed technique. Fig. 1 and Fig. 2 shows the main steps of the algorithm and general framework of the proposed system.
M
Wji = tf ji*log df (1)
j
Where,Wji is the weight of the term j in document i, M is the total number of documents, tf ji is the frequency of the term j in document i, and df jis the number
of documents that contain the term j [24]
In this study, we focus on frequent itemsets that contain two or more items. After collecting the frequent itemsets for each class label, the next step is to
pruning these itemsets based on some constraints.
All-confidence is used to estimate the degree of mutual association in the itemset. All-confidence [26] of the itemset Y = (k1,..., ki), denoted as all-conf (k), is
defined as follows:
Page 3/15
According to equation 2, all-confidence for each frequent itemset is calculated, then each itemset having an all-confidence value less than or equal to the
minimum all-confidence (Min_allconf) threshold is eliminated.
The itemsets are also pruned based on the itemset redundancy method where if the itemset contains another itemset.
At the end of the pruning phase, the associated and frequent itemsets, whose all-confidence and support are greater than the threshold, respectively, are
mined. If the association among items is closer, the itemsets are more effective for classification. Further, the redundant itemsets are eliminated based on the
principle of that the itemset containment of other itemsets.
At the end of the feature selection phase, the final feature subset is identified, which retained frequent and relevant features, eliminated redundant and
irrelevant features, and also considered feature interaction.
TP + TN
Accuracy = TP + FP + TN + FN (3)
F1 = 2* ( Precision + Recall )
= (4)
( β + 1 ) *Tp + β *FN + FP
2 2
Table 1
The BBC data set description.
Class label No. of documents No. of documents in the training set No. of documents in the testing set
Table 2 shows the number and percentage of frequent features selected using our proposed feature selection method form the preprocessed BBC training
data set that contains 2909 words/features.
Table 2
Number and percentage of frequent features using the FS-FAI feature
selection method.
Support NO. of selected features Percentage of selected features
Table 3
Number and percentage of selected
features using the Chi2 and MI feature
selection method.
Percentage NO. of selected features
≈ 5% 146
≈ 6% 175
≈ 8% 233
≈ 10 % 291
≈ 20 % 582
≈ 30 % 873
≈ 40 % 1164
Tables 2 and 3 present comparative results of the number of features selected by the three feature selection methods. It can be seen that all the feature
selection methods could significantly decrease the number of selected features. But our proposed feature selection method is not just limited to reducing the
number of selected features, but it also tries to find the frequent and associated features that are related to each other as well as to the target variable and can
also remove the redundant and irrelevant features. Additionally, the ranking criteria are not used for variable selection as in the Chi2 and MI methods. Where
the Chi2 method is used to measure the independence of two variables (i.e. features and the target) while the MI method is used to solve the problem of
redundancy.
The effectiveness of the features selected using the three methods will be evaluated in the next section.
As shown in Table 4, the performance of various classification techniques is compared when using the first and second scenarios in terms of F-measure and
classification accuracy where the highest values achieved are highlighted in bold format.
Page 5/15
Table 4
Comparison of different classifiers when using the first and second scenarios in terms of F-measure, and accuracy.
First Second scenario
scenario
Feature selection percentage
NB Accuracy 90.719 83.533 78.743 85.329 81.138 87.425 83.383 89.521 84.880 92.365 84.731 91.018 84.431 8
F- 91 84 79 85 81 87 83 90 85 92 85 91 84 8
measure
DT Accuracy 84.443 85.179 78.443 86.527 80.239 85.479 78.743 86.228 80.988 84.880 81.138 84.132 83.234 8
F- 84 85 78 86 80 85 79 86 81 85 81 84 83 8
measure
LR Accuracy 94.431 87.575 80.389 89.521 83.084 90.868 85.928 92.066 87.725 94.910 91.317 95.359 92.814 9
F- 94 87 80 89 83 91 86 92 88 95 91 95 93 9
measure
RF Accuracy 90.569 86.976 82.784 89.970 85.179 90.868 87.575 90.269 88.922 90.868 88.024 90.269 89.222 9
F- 91 87 83 90 85 91 88 90 89 91 88 90 89 9
measure
From Table 4, when applying the classification technique using the first scenario, we noticed that LR achieves higher performance with 94.43% accuracy, and
94% F-measure.
In the second scenario, the aim of this experiment is to evaluate the effectiveness of applying classification techniques using the Chi2 and MI feature selection
method and compare it with the first scenario. Features were selected from the training data set under the different percentages of selecting the highest
scoring features.
In Table 4, we noticed that when applying the classification using the second scenario, it is not always effective especially when using low percentages for
feature selection and this may be due to filtering out some important features when using low percentages which affect the efficiency of the classifiers.
However, it can improve the performance of classifiers when higher feature selection percentages are used. Additionally, we find that the Chi2 feature selection
method always yields significantly better results than the MI method in terms of F-measure and accuracy and, indicating that the features selected by the Chi2
method are more effective.
In most cases, when using feature selection percentage = 5%, 6%, 8%, and 10% the performance of classifiers with the selected features was very poor except
in the case of DT. Whereas when using feature selection percentage = 20%, 30%, and 40%, the selected features contribute to improving the performance of the
classifiers most often. For the NB, the performance is improved when using Chi2 at the percentage equals to 20% with 92.37% accuracy, and 92% F-measure
and also at the percentage equals to 30% with 91.02% accuracy, and 91% F-measure. For the DT, it has better performance when using the features selected by
Chi2 with percentages 5%, 6%, 8%, 10%, and 20% in terms of classification accuracy with 85.18%, 86.53%, 85.48%, 86.23%, and 84.88% respectively, and F-
measure with 85%, 86%, 85%, 86%, and 85% respectively. For the LR, the performance is improved when using Chi2 at the percentage equals to 20% with
94.91% accuracy, and 95% F-measure and also at the percentage equals to 30% with 95.36% accuracy, and 95% F-measure. For the RF, the performance is
improved when using Chi2 with percentages 8%, 20%, and 40% in terms of classification accuracy with 90.87%, 90.87%, and 91.47% respectively, and F-
measure with 91% for all percentages.
These results also showed that using the features selected by Chi2 with the percentage of 20% can improve the performance of all the classification methods
used, even if the improvement percentage is small, but the number of features used in this case is much less than that used in the first scenario.
Table 5 summarizes the previous results presented in Table 4, where it shows the best results achieved for each classifier using the second scenario in terms
of classification accuracy and F-measure, as well as the number of features that are used to train the classifier.
Table 5
The best results achieved for each classifier using the second scenario
Accuracy F-measure percentage % No. of selected features Feature selection method
Page 6/15
86.53% using 175 features. The LR obtained the best F-measure of 95% and the best accuracy of 95.36% using 873 features. The RF obtained the best F-
measure of 91% and the best accuracy of 91.47% using 1164 features.
Figure 8 shows comparative results of classification techniques using first and second scenarios. It can be seen that when applying classification techniques
using the second scenario with Chi2 feature selection, the results are improved compared to the first scenario in terms of F-measure, classification accuracy. It
is observed that the LR with Chi2 gives the best performance with 95.36% accuracy, and 95% F-measure using only 30% of features.
The performance of classification techniques using the third scenario with the proposed feature selection method (FS-FAI) is shown in Table 6.
Table 6
Performance comparison of different classifiers using the
proposed FS-FAI method with different Minsupp threshold
values
Minsupp threshold
F-measure 99 97 97 95
F-measure 96 95 95 93
F-measure 97 96 96 95
F-measure 98 97 97 95
In the third scenario, the aim of this experiment is to evaluate the effectiveness of applying classification techniques using the FS-FAI proposed feature
selection method and then compared it to the first scenario and second scenarios. In the FS-FAI feature selection method, Features were selected from the
training data set under different values for the Minsupp that are 0.030, 0.035, 0.04, and 0.045, while the Min_allconf threshold was fixed at 0.13 and the
minimum frequency threshold, was fixed at 0.048.
From Table 6 we noticed that when applying classification techniques with the proposed FS-FAI approach to select features, their performance is better. AS we
can also observe that in most cases, the performance of the classifier increases as the support value decreases. This is can be reasonable because some
frequent itemsets are filtered out by a high Minsupp threshold. Accordingly, the FS-FAI method generates fewer itemsets when the Minsupp threshold is high
resulting in useful itemsets not being extracted.
When comparing the results of the third scenario with the results of the first scenario, we found that when applying the classification technique using the third
scenario, it performs much better than applying the algorithms using the first scenario as depicted in Fig. 9 In addition, we can be observed that the NB
classifier using the third scenario has the best performance with the highest values for accuracy (99.1%), and F-measure (99%) using only 8% of features.
When comparing the results of the third scenario shown in Table 6 with the results of the second scenario shown in Table 4, we found that when applying the
classification technique using the third scenario, it performs much better than applying the algorithms using the second scenario. But in the case of LR, we
observed that when applied with Chi2 with 20% and 30% of features, it performs better than when applied with the proposed FS-FAI method with Minsupp =
0.045 in terms of classification accuracy. Where the accuracy of LR with Chi2 using percentage = 20% and 30% is 94.9% and 95.4%, respectively. While it is
94.7% with the FS-FAI method (Minsupp = 0.045). We also noticed that the difference in accuracy between the two methods is not significant, but the
percentage of features used in the FSFAI method is 4%, which is much lower than that used in the Chi2 method.
In Fig. 10, the best results achieved when applying the classification methods are compared using the second (as mentioned in Table 5) and the third scenario.
It can be concluded that the NB classifier using the third scenario has the best performance with accuracy (99.1%), and F-measure (99%) using only 8% of
features.
6. Conclusion
In this paper, we proposed a novel feature selection method based on frequent itemsets (FS-FAI) for text classification in order to search for important
features. The aim of the proposed method is to find the frequent and associated features that are correlated with each other and that are also closely
correlative with the target attribute furthermore it can effectively remove both irrelevant redundant and redundant features from the feature space.
The results of the experiments that presented in this work are summarized as follows:
• When applying classification techniques using the second scenario compared to the first scenario, using a fewer number of features, the Chi2 method
sometimes contributed to improving the efficiency of the classifiers, unlike MI, in terms of F-measure, and classification accuracy. This is meant that the
features selected by the Chi2 method are more effective.
• When applying classification techniques using the third scenario compared to other scenarios, we found that
Page 7/15
• The third scenario achieved the highest performance in terms of F-measure, and accuracy using the features selected by the proposed FS-FAI method.
The proposed FS-FAI method reduced the number of selected features by a great percentage.
Although the number of selected features was much less, their quality and efficiency were higher compared to the selected features by Chi2 and MI
methods, resulting in reduced computational cost and improved classification performance.
The NB classifier using the proposed FS-FAI method had the best performance with the highest values for accuracy (99.1%), F-measure (99%), using only
230 features out of 2909 features representing 8% of features.
As concluding observations, the work presented in this paper has proved the objectives of this study where the results obtained demonstrate its effectiveness.
Declarations
Acknowledgements
Authors sincerely acknowledge Computer Science Department in Faculty of Science, Minia University for the facilities and support.
Disclosure of potential Conflict of Interest: The authors declare that they have no conflict of interest.
Ethical Statement: “All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or
national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.”
Consent Statement: “Informed consent was obtained from all individual participants included in the study.”
References
1. Peng, Hanyang, & Fan, Y. (2017). Feature selection by optimizing a lower bound of conditional mutual information. Information Sciences, 418, 652–667.
2. Shang, C., Li, M., Feng, S., Jiang, Q., & Fan, J. (2013). Feature selection via maximizing global information gain for text classification. Knowledge-Based
Systems, 54, 298–309.
3. Zhang, L., & Duan, Q. (2019). A feature selection method for multi-label text based on feature importance. Applied Sciences, 9(4), 665.
4. Sangodiah, A., Ahmad, R., & Ahmad, W. F. W. (2014). A review in feature extraction approach in question classification using Support Vector Machine.
2014 IEEE International Conference on Control System, Computing and Engineering (ICCSCE 2014), 536–541.
5. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1), 1–47.
6. Forman, G. (2007). Feature selection for text classification. Computational Methods of Feature Selection, 16, 257–274.
7. Zhao, Z., & Liu, H. (2009). Searching for interacting features in subset selection. Intelligent Data Analysis, 13(2), 207–228.
8. Pawening, R. E., Darmawan, T., Bintana, R. R., Arifin, A. Z., & Herumurti, D. (2016). Feature selection methods based on mutual information for classifying
heterogeneous features. Jurnal Ilmu Komputer Dan Informasi, 9(2), 106–112.
9. Sun, J., Zhang, X., Liao, D., & Chang, V. (2017). Efficient method for feature selection in text classification. 2017 International Conference on Engineering
and Technology (ICET), 1–6.
10. Kaoungku, N., Suksut, K., Chanklan, R., Kerdprasop, K., & Kerdprasop, N. (2017). Data classification based on feature selection with association rule
mining. Proceedings of the International MultiConference of Engineers and Computer Scientists, 1.
11. Qu, Y., Fang, Y., & Yan, F. (2019). Feature Selection Algorithm Based on Association Rules. Journal of Physics: Conference Series, 1168(5).
12. Ukhti Ikhsani Larasati, I. U., Much Aziz Muslim, I. U., Riza Arifudin, I. U., & Alamsyah, I. U. (2019). Improve the Accuracy of Support Vector Machine Using
Chi Square Statistic and Term Frequency Inverse Document Frequency on Movie Review Sentiment Analysis. Scientific Journal of Informatics, 6(1), 138–
149.
13. Kononenko, I. (1994). Estimating attributes: Analysis and extensions of RELIEF. European Conference on Machine Learning, 171–182
14. Liu, M., & Zhang, D. (2016). Feature selection with effective distance. Neurocomputing, 215, 100–109.
15. Anggraeny, F. T., Purbasari, I. Y., & Suryaningsih, E. (2018). Relief Feature Selection and Bayesian Network Model for Hepatitis Diagnosis. Prosiding
International Conference on Information Technology and Business (ICITB), 113–118.
16. Barraza, N., Moro, S., Ferreyra, M., & de la Peña, A. (2019). Mutual information and sensitivity analysis for feature selection in customer targeting: A
comparative study. Journal of Information Science, 45(1), 53–67.
17. Sinayobye, J. O., Kyanda, S. K., Kiwanuka, N. F., & Musabe, R. (2019). Hybrid model of correlation based filter feature selection and machine learning
classifiers applied on smart meter data set. 2019 IEEE/ACM Symposium on Software Engineering in Africa (SEiA), 1–10.
18. Dash, M., & Liu, H. (1997). Feature selection for classification. Intelligent Data Analysis, 1(3), 131–156.
19. Verma, T., Renu, R., & Gaur, D. (2014). Tokenization and filtering process in RapidMiner. International Journal of Applied Information Systems, 7(2), 16–18.
Page 8/15
20. Saif, H., Fernández, M., He, Y., & Alani, H. (2014). On stop words, filtering and data sparsity for sentiment analysis of twitter.
21. Samir, A., & Lahbib, Z. (2018). Stemming and lemmatization for information retrieval systems in amazigh language. International Conference on Big Data,
Cloud and Applications, 222–233.
22. Liu, Qing, Wang, J., Zhang, D., Yang, Y., & Wang, N. (2018). Text features extraction based on TF-IDF associating semantic. 2018 IEEE 4th International
Conference on Computer and Communications (ICCC), 2338–2343.
23. Soucy, P., & Mineau, G. W. (2005). Beyond TFIDF weighting for text categorization in the vector space model. IJCAI, 5, 1130– 1135.
24. Ahuja, R., Chug, A., Kohli, S., Gupta, S., & Ahuja, P. (2019). The impact of features extraction on the sentiment analysis. Procedia Computer Science, 152,
341–348.
25. Agarwal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. Proc. of the 20th VLDB Conference, 487–499.
26. Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., & Verkamo, A. I. (1994). Finding interesting rules from large sets of discovered association rules.
Proceedings of the Third International Conference on Information and Knowledge Management, 401–407.
27. Sokolova, M., Japkowicz, N., & Szpakowicz, S. (2006). Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation.
Australasian Joint Conference on Artificial Intelligence, 1015–1021.
Figures
Figure 1
Page 9/15
Figure 2
Figure 3
Page 10/15
Figure 4
Page 11/15
Figure 5
Figure 6
Page 12/15
Figure 7
Page 13/15
Figure 8
Figure 9
Page 14/15
Figure 10
Comparison of the best results for classification techniques using first and third scenarios
Page 15/15