0% found this document useful (0 votes)
12 views15 pages

A New Feature Selection Method Based On Frequent A

The document presents a novel feature selection method called FS-FAI, which utilizes association analysis to enhance text classification by identifying relevant features while addressing redundancy. Experiments conducted on a BBC dataset demonstrate the method's effectiveness compared to traditional feature selection techniques like Chi2 and Mutual Information. The study emphasizes the importance of selecting high-quality features to improve classification accuracy and efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views15 pages

A New Feature Selection Method Based On Frequent A

The document presents a novel feature selection method called FS-FAI, which utilizes association analysis to enhance text classification by identifying relevant features while addressing redundancy. Experiments conducted on a BBC dataset demonstrate the method's effectiveness compared to traditional feature selection techniques like Chi2 and Mutual Information. The study emphasizes the importance of selecting high-quality features to improve classification accuracy and efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

A New Feature Selection Method Based on Frequent and Associated Itemsets

For Text Classification


Heba Mamdouh Farghaly
Minia University Faculty of Science
Tarek Abd El-Hafeez (  [email protected] )
Minia University Faculty of Science https://orcid.org/0000-0003-1785-1058

Research Article

Keywords: Feature Selection, Dimensionality Reduction, Text Classification, Association Rule Mining, Feature Interaction

Posted Date: March 7th, 2022

DOI: https://doi.org/10.21203/rs.3.rs-1180542/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License

Page 1/15
Abstract
Feature selection is one of the major issues in pattern recognition. The quality of selected features is important for classification as the low-quality data can
degrade the model construction performance. Due to the difficulty of dealing with the problem that selected features always contain redundant information,
this paper focuses on the association analysis theory in data mining to select important features. In this study, a novel Feature Selection method based on
Frequent and Associated Itemsets (FS-FAI) for text classification is proposed. FS-FAI seeks to find relevant features and also takes feature interaction into
account. Moreover, it uses association as a metric to evaluate the relativity between the target concept and feature(s). To evaluate the efficacy of the proposed
method, several experiments were conducted on a BBC data set from the BBC news website. The obtained results were compared to well-known feature
selection methods. The reported results demonstrated the effectiveness of the proposed feature selection method in selecting high-quality features and in
handling redundant information in text classification.

1. Introduction
Feature selection is a significant component of data mining [1] and an important step for text categorization [2]. Text data is unstructured data and comprises
of words. Often, the Vector Space Model (VSM) is used to represent textual data to facilitate computer processing. It is used to convert the unstructured data
to structured ones and it deals with the document as a bag of words (features) [3]. In fact, not all features are valuable when building the document classifier.
Some of those features may be irrelevant or redundant. Moreover, when the irrelevant features are more than relevant ones, it badly affects the classification
results. In this case, the selection of the subset of the original features often improves the classification performance [4]. Consequently, Feature selection in the
field of text classification can be defined as a process that aims to find a minimum size of relevant text features to reduce text classification error [5].

A good method to select features is important as it enhances the accuracy and efficiency of the text classification. So in this paper, we propose a novel feature
selection approach for text classification in order to reduce the size of the subset of features and enhance the efficiency of the classifier without reducing its
accuracy.

2. Problem Statement
One of the most important issues in the problems of text classification is the feature space with high dimensions. Therefore, the selection of distinctive
features is important for text classification. According to [6] there are two primary reasons for selecting certain features over others. The first reason is due to
scalability, since using a large number of features requires memory, computation power, storage, network bandwidth, etc. therefore running a smaller subset of
features reduces the computation time. The second reason is related to the performance of the algorithm, e.g., algorithms perform better when not considering
features that do not provide more information but they add noise.

To overcome these problems, a novel Feature Selection method based on Frequent and Associated Itemsets (FS-FAI) for text classification is introduced to
select important features. The proposed method incorporates the association analysis theory in data mining. Association analysis can detect interesting
associations between data items [7]. The idea of the proposed method is to identify features that are closely related to the target attribute and also correlated
with each other by mining frequent and associated itemsets from the training data set. Moreover, effectively remove both irrelevant and redundant features.

3. Related Work
Feature selection has been an active area of research since the 1970s, where a lot of research work has been published. This section presents a review of
several existing studies associated with the feature selection methodologies that are used to select important and distinctive features.

Pawening et al. [8] proposed an approach to select features based on Mutual Information (MI) in order to classify heterogeneous features. The proposed
approach used a Joint Mutual Information Maximation (JMIM) method in order to select features while taking into consideration the class label. It also used
the Unsupervised Feature Transformation (UFT) method in order to convert non-numerical features into numerical ones.

The authors [9] presented an improved method for the Chi Square (Chi2) test that combined with interclass concentration and frequency. Chi2 test
improvement was performed with three aspects of in-class dispersion, frequency, and inter-class concentration.

In [10] the authors proposed an efficient algorithm for data classification incorporated with feature selection based on the association rule mining process to
select features having a significant impact on the target attribute.

The feature selection method based on association rules, ARFS, was developed in [11]. The algorithm used association rules in order to extract the frequent 2-
items set of the category and feature attributes in the data set. It then combined the sequential forward selection approach in order to search for feature
subsets and used the performance of the decision trees algorithm as the evaluation criteria for the selected feature subsets.

In [12], the authors have integrated feature weighting and feature selection methods with the support vector machine classifier (SVM). The Chi2 was used to
reduce the number of attributes significantly by taking a total of K=500 of the top ranked attributes. Then the feature weighting was used to calculate the
weight of each selected attribute.

As it is clear from the above studies, most feature selection approaches can efficiently identify the non-relevant features using different evaluation functions.
But it focused either on eliminating the redundant features or on considering the feature interaction. While feature selection has been studied extensively in the
field of data mining, the study of using association analysis as an approach for feature selection is rare. In contrast, our algorithm uses association analysis
as a method for feature selection that aims to eliminate the redundant and irrelevant features, and also considers the feature interactions.

Page 2/15
4. Methodology
The quality of data is important for classification as the low quality data can degrade the model construction performance. the reason for this low-
performance issue is that there are many irrelevant features that are evaluated during the model building process. So in this study, we proposed a novel
feature subset selection approach named FS-FAI, Which seeks to find relevant features and also takes feature interaction into account. Moreover, FS-FAI uses
association as a metric to evaluate the relativity between the target concept and feature(s), which differs from traditional measures, such as the distance
measure [13, 14, 15], the dependence measure [16, 17], and the consistency measure [7, 18]. The proposed system for text classification undergoes five steps
which are: text preprocessing, feature extraction, feature selection using the FS-FAI approach, classification, and finally evaluate the performance of the
proposed technique. Fig. 1 and Fig. 2 shows the main steps of the algorithm and general framework of the proposed system.

4.1. Text Preprocessing


Preprocessing is a critical step and important task in text mining which is used to reduce the number of features in a data set, along with improving the
performance of the classification technique in terms of classification accuracy and resource requirements. As, most document and text datasets include many
useless words such as Lang, misspelling, stop words, etc. In several algorithms, especially probabilistic and statistical learning algorithms, unnecessary
features and noise can have an adverse impact on system performance. Three common preprocessing steps of text classification including tokenization [19],
stop-word removal [20], and lemmatization [21] are considered within the scope of this study.

4.2. Feature Extraction


Text feature extraction is an essential step in information retrieval and data mining. It quantifies the distinctive terms extracted from the text that represents
the content of the text and converts it from an unstructured original text into structured information that can be recognized and processed by a computer [22].
In this study, the TF-IDF technique, the most frequently used statistics for feature weighting, is used to weight and select unique words from the data set [23].
The weight of any term in any document can be calculated according to the following equation:

M
Wji = tf ji*log df (1)
j
Where,Wji is the weight of the term j in document i, M is the total number of documents, tf ji is the frequency of the term j in document i, and df jis the number
of documents that contain the term j [24]

4.3. Proposed Feature Selection Method


In this section, a novel feature selection approach named FSFAI is proposed. The proposed approach is specifically designed for text categorization tasks. It
seeks to find the relevant features while considering the interaction of the features with class attributes and also with each other. As depicted in Fig. 3 and Fig.
4, it consists of three steps:

(1) finding frequent itemsets for each class label

(2) frequent itemsets pruning 3) feature subset identification.

4.3.1. Finding Frequent Itemsets for Each Class Label


This step focuses on mining the frequent itemsets from the preprocessed training data set. The purpose of mining frequent itemsets is to discover the
interesting, hidden relationship between patterns in the data set. Initially, the frequent itemsets are defined as those with an occurrence frequency greater than
a specified minimum support (Minsupp) threshold. At this stage, the Apriori method is applied to a text document. An Apriori method first identifies all frequent
individual itemsets (i.e.1-itemsets) that satisfy a specific Minsupp threshold. This process can be performed until new frequent itemsets cannot be generated
[25].

In this study, we focus on frequent itemsets that contain two or more items. After collecting the frequent itemsets for each class label, the next step is to
pruning these itemsets based on some constraints.

4.3.2. Frequent Itemsets Pruning


The pruning process aims at reducing the number of itemsets that were generated in the process of mining frequent itemsets. This is due to the existence of
some itemsets that may not have the discriminative power to distinguish classes, which may contribute to an invalid classification. Therefore we need to
prune itemsets to remove redundant and noisy information. Our proposed algorithm uses the following methods for itemsets pruning as shown in Figures 5
and 6:

• First, find the associated itemsets:

All-confidence is used to estimate the degree of mutual association in the itemset. All-confidence [26] of the itemset Y = (k1,..., ki), denoted as all-conf (k), is
defined as follows:

All-confidence (Y) = s (Y)/max (support (k1), .., support (ki)) (2)

Page 3/15
According to equation 2, all-confidence for each frequent itemset is calculated, then each itemset having an all-confidence value less than or equal to the
minimum all-confidence (Min_allconf) threshold is eliminated.

• Second, pruning based on the itemset redundancy method:

The itemsets are also pruned based on the itemset redundancy method where if the itemset contains another itemset.

At the end of the pruning phase, the associated and frequent itemsets, whose all-confidence and support are greater than the threshold, respectively, are
mined. If the association among items is closer, the itemsets are more effective for classification. Further, the redundant itemsets are eliminated based on the
principle of that the itemset containment of other itemsets.

4.3.3. Feature Subset Identification


In this phase, features are selected from the set of frequent and associated itemsets. Therefore, we should save a set of words that are helpful in predicting
class labels effectively. For this purpose, the features that have a percentage of frequency occurrences in the set of frequent and associated itemsets less than
the defined minimum frequency threshold are deleted. Finally, the algorithm returns the set of features that are important in predicting the class attribute
based on an analysis of their occurrence in the set of itemsets extracted from the training data set. Fig. 7 shows our algorithm for identifying frequent features
subset.

At the end of the feature selection phase, the final feature subset is identified, which retained frequent and relevant features, eliminated redundant and
irrelevant features, and also considered feature interaction.

4.4. Classification Process


Classification algorithms are used to assign a class for unseen documents accurately. In this work, well-known classifiers, Random Forest (RF), Naïve Bayes
(NB), Decision trees (DT), and Logistic Regression (LR), are applied in order to investigate contributions of the features selected by the proposed method to the
classification accuracy.

4.5. Performance Evaluation


The performance of the proposed method can be measured using well-known evaluation metrics - the accuracy of the classification, and F-measure [27].

TP + TN
Accuracy = TP + FP + TN + FN (3)

( Precision × Recall ) ( β + 1 ) *Tp


2

F1 = 2* ( Precision + Recall )
= (4)
( β + 1 ) *Tp + β *FN + FP
2 2

5. Experimental Results And Analysis


In this section, we have conducted experiments to assess the performance of the proposed FS-FAI feature selection method. The FS-FAI algorithm is compared
with two different feature selection methods, Chi2 and MI, and four different classifiers, DT, RF, NB, and LR. We conducted our experiments on a 3 GHz i5
computer with a 4GB main memory and 64-bit Windows 7 operating system. The experiment is carried out using the python programming language.

5.1. Preparing Data for Evaluation


In this study, BBC data set from the BBC news website [28] is used to evaluate the classification system. BBC is one of the most popular text classification
datasets, and it contains a set of 43772 unique words and 2,225 documents/articles that belong to five different classes (business, entertainment, politics,
sport, and tech). Raw documents were preprocessed and converted into word vectors where redundant symbols, characters, and words were removed then
words were converted to the base form using lemmatization. To train the classifier, we used about 70% of the documents as a training data set to generate the
frequent subset of features and build the classifier. The remaining 30% of the documents were used to test and evaluate a classifier. Table 1 shows the
description of the BBC data set.

Table 1
The BBC data set description.
Class label No. of documents No. of documents in the training set No. of documents in the testing set

Business 510 354 156

Entertainment 386 261 125

Politics 417 299 118

Sport 511 357 154

Tech 401 286 115

Total 2225 1557 668

5.2. Results Analysis and Evaluation


Page 4/15
In this section, we provide an evaluation of the proposed method. Initially, the focus of the first part of this section is on generating the frequent feature subset.
While in the second part we evaluate the quality of the selected features by conducting a set of experiments.

5.2.1.. Frequent Feature Subset Generation


The main objective of the experiment is to select the best features from a list of features of the BBC data set using the FSFAI proposed feature selection
method. To find the best results for FS-FAI, different values for the Minsupp are examined while fixing the Min_allconf threshold, and the minimum frequency
threshold for each run. The Minsupp threshold values were set to 0.030, 0.035, 0.04, and 0.045 to extract frequent itemsets from the preprocessed BBC
training data set using the Apriori algorithm where the Min_allconf threshold was fixed at 0.13 to prune the frequent itemsets and the minimum frequency
threshold was fixed at 0.048 to select frequent and relevant features from the pruned itemsets. Where the values of the minimum frequency threshold and
Min_allconf threshold are the best values that were determined by the trial and error methodology.

Table 2 shows the number and percentage of frequent features selected using our proposed feature selection method form the preprocessed BBC training
data set that contains 2909 words/features.

Table 2
Number and percentage of frequent features using the FS-FAI feature
selection method.
Support NO. of selected features Percentage of selected features

0.03 230 ≈8%

0.035 181 ≈6%

0.04 154 ≈5%

0.045 125 ≈4%


The performance of the proposed feature selection method (FS-FAI) is compared with the two well-known feature selection methods that are Chi2 and MI in
the term of the number of selected features. The percentage of choosing the highest scoring features using Chi2 and MI methods is set at 5%, 6%, 8%, 10%,
20%, 30%, and 40%. The number of features selected using Chi2 and MI methods from the preprocessed BBC training data set is listed in Table 3.

Table 3
Number and percentage of selected
features using the Chi2 and MI feature
selection method.
Percentage NO. of selected features

≈ 5% 146

≈ 6% 175

≈ 8% 233

≈ 10 % 291

≈ 20 % 582

≈ 30 % 873

≈ 40 % 1164
Tables 2 and 3 present comparative results of the number of features selected by the three feature selection methods. It can be seen that all the feature
selection methods could significantly decrease the number of selected features. But our proposed feature selection method is not just limited to reducing the
number of selected features, but it also tries to find the frequent and associated features that are related to each other as well as to the target variable and can
also remove the redundant and irrelevant features. Additionally, the ranking criteria are not used for variable selection as in the Chi2 and MI methods. Where
the Chi2 method is used to measure the independence of two variables (i.e. features and the target) while the MI method is used to solve the problem of
redundancy.

The effectiveness of the features selected using the three methods will be evaluated in the next section.

5.2.2. Results Evaluation


In order to test the effectiveness of our ideas and evaluate the quality of the selected features, several experiments were conducted. The experiments were
conducted using well-known classification techniques, namely, DT, NB, RF, and LR, for prediction in three scenarios: 1) without applying feature selection
method, 2) with applying well-known feature selection methods that are Chi2 and MI, and 3) with applying the proposed feature selection method (FS-FAI)
then the performance of classification techniques was evaluated. The default parameters for each classification techniques were used.

As shown in Table 4, the performance of various classification techniques is compared when using the first and second scenarios in terms of F-measure and
classification accuracy where the highest values achieved are highlighted in bold format.

Page 5/15
Table 4
Comparison of different classifiers when using the first and second scenarios in terms of F-measure, and accuracy.
First Second scenario
scenario
Feature selection percentage

5% 6% 8% 10% 20% 30% 4

Chi2 MI Chi2 MI Chi2 MI Chi2 MI Chi2 MI Chi2 MI C

NB Accuracy 90.719 83.533 78.743 85.329 81.138 87.425 83.383 89.521 84.880 92.365 84.731 91.018 84.431 8

F- 91 84 79 85 81 87 83 90 85 92 85 91 84 8
measure

DT Accuracy 84.443 85.179 78.443 86.527 80.239 85.479 78.743 86.228 80.988 84.880 81.138 84.132 83.234 8

F- 84 85 78 86 80 85 79 86 81 85 81 84 83 8
measure

LR Accuracy 94.431 87.575 80.389 89.521 83.084 90.868 85.928 92.066 87.725 94.910 91.317 95.359 92.814 9

F- 94 87 80 89 83 91 86 92 88 95 91 95 93 9
measure

RF Accuracy 90.569 86.976 82.784 89.970 85.179 90.868 87.575 90.269 88.922 90.868 88.024 90.269 89.222 9

F- 91 87 83 90 85 91 88 90 89 91 88 90 89 9
measure
From Table 4, when applying the classification technique using the first scenario, we noticed that LR achieves higher performance with 94.43% accuracy, and
94% F-measure.

In the second scenario, the aim of this experiment is to evaluate the effectiveness of applying classification techniques using the Chi2 and MI feature selection
method and compare it with the first scenario. Features were selected from the training data set under the different percentages of selecting the highest
scoring features.

In Table 4, we noticed that when applying the classification using the second scenario, it is not always effective especially when using low percentages for
feature selection and this may be due to filtering out some important features when using low percentages which affect the efficiency of the classifiers.

However, it can improve the performance of classifiers when higher feature selection percentages are used. Additionally, we find that the Chi2 feature selection
method always yields significantly better results than the MI method in terms of F-measure and accuracy and, indicating that the features selected by the Chi2
method are more effective.

In most cases, when using feature selection percentage = 5%, 6%, 8%, and 10% the performance of classifiers with the selected features was very poor except
in the case of DT. Whereas when using feature selection percentage = 20%, 30%, and 40%, the selected features contribute to improving the performance of the
classifiers most often. For the NB, the performance is improved when using Chi2 at the percentage equals to 20% with 92.37% accuracy, and 92% F-measure
and also at the percentage equals to 30% with 91.02% accuracy, and 91% F-measure. For the DT, it has better performance when using the features selected by
Chi2 with percentages 5%, 6%, 8%, 10%, and 20% in terms of classification accuracy with 85.18%, 86.53%, 85.48%, 86.23%, and 84.88% respectively, and F-
measure with 85%, 86%, 85%, 86%, and 85% respectively. For the LR, the performance is improved when using Chi2 at the percentage equals to 20% with
94.91% accuracy, and 95% F-measure and also at the percentage equals to 30% with 95.36% accuracy, and 95% F-measure. For the RF, the performance is
improved when using Chi2 with percentages 8%, 20%, and 40% in terms of classification accuracy with 90.87%, 90.87%, and 91.47% respectively, and F-
measure with 91% for all percentages.

These results also showed that using the features selected by Chi2 with the percentage of 20% can improve the performance of all the classification methods
used, even if the improvement percentage is small, but the number of features used in this case is much less than that used in the first scenario.

Table 5 summarizes the previous results presented in Table 4, where it shows the best results achieved for each classifier using the second scenario in terms
of classification accuracy and F-measure, as well as the number of features that are used to train the classifier.

Table 5
The best results achieved for each classifier using the second scenario
Accuracy F-measure percentage % No. of selected features Feature selection method

NB 92.37 92 20 582 Chi2

DT 86.53 86 6 175 Chi2

LR 95.36 95 30 873 Chi2

RF 91.47 91 40 1164 Chi2


As shown in Table 5, the results show that the best performance for all classifiers is achieved when using the second scenario with Chi2 feature selection. The
NB obtained the best F-measure of 92% and best accuracy of 92.37% using 582 features. The DT obtained the best F-measure of 86% and best accuracy of

Page 6/15
86.53% using 175 features. The LR obtained the best F-measure of 95% and the best accuracy of 95.36% using 873 features. The RF obtained the best F-
measure of 91% and the best accuracy of 91.47% using 1164 features.

Figure 8 shows comparative results of classification techniques using first and second scenarios. It can be seen that when applying classification techniques
using the second scenario with Chi2 feature selection, the results are improved compared to the first scenario in terms of F-measure, classification accuracy. It
is observed that the LR with Chi2 gives the best performance with 95.36% accuracy, and 95% F-measure using only 30% of features.

The performance of classification techniques using the third scenario with the proposed feature selection method (FS-FAI) is shown in Table 6.

Table 6
Performance comparison of different classifiers using the
proposed FS-FAI method with different Minsupp threshold
values
Minsupp threshold

0.03 0.035 0.04 0.045

NB Accuracy 99.102 96.707 96.707 95.059

F-measure 99 97 97 95

DT Accuracy 96.257 94.910 95.359 93.713

F-measure 96 95 95 93

LR Accuracy 97.455 96.107 96.407 94.760

F-measure 97 96 96 95

RF Accuracy 98.503 96.55 97.006 95.059

F-measure 98 97 97 95
In the third scenario, the aim of this experiment is to evaluate the effectiveness of applying classification techniques using the FS-FAI proposed feature
selection method and then compared it to the first scenario and second scenarios. In the FS-FAI feature selection method, Features were selected from the
training data set under different values for the Minsupp that are 0.030, 0.035, 0.04, and 0.045, while the Min_allconf threshold was fixed at 0.13 and the
minimum frequency threshold, was fixed at 0.048.

From Table 6 we noticed that when applying classification techniques with the proposed FS-FAI approach to select features, their performance is better. AS we
can also observe that in most cases, the performance of the classifier increases as the support value decreases. This is can be reasonable because some
frequent itemsets are filtered out by a high Minsupp threshold. Accordingly, the FS-FAI method generates fewer itemsets when the Minsupp threshold is high
resulting in useful itemsets not being extracted.

When comparing the results of the third scenario with the results of the first scenario, we found that when applying the classification technique using the third
scenario, it performs much better than applying the algorithms using the first scenario as depicted in Fig. 9 In addition, we can be observed that the NB
classifier using the third scenario has the best performance with the highest values for accuracy (99.1%), and F-measure (99%) using only 8% of features.

When comparing the results of the third scenario shown in Table 6 with the results of the second scenario shown in Table 4, we found that when applying the
classification technique using the third scenario, it performs much better than applying the algorithms using the second scenario. But in the case of LR, we
observed that when applied with Chi2 with 20% and 30% of features, it performs better than when applied with the proposed FS-FAI method with Minsupp =
0.045 in terms of classification accuracy. Where the accuracy of LR with Chi2 using percentage = 20% and 30% is 94.9% and 95.4%, respectively. While it is
94.7% with the FS-FAI method (Minsupp = 0.045). We also noticed that the difference in accuracy between the two methods is not significant, but the
percentage of features used in the FSFAI method is 4%, which is much lower than that used in the Chi2 method.

In Fig. 10, the best results achieved when applying the classification methods are compared using the second (as mentioned in Table 5) and the third scenario.
It can be concluded that the NB classifier using the third scenario has the best performance with accuracy (99.1%), and F-measure (99%) using only 8% of
features.

6. Conclusion
In this paper, we proposed a novel feature selection method based on frequent itemsets (FS-FAI) for text classification in order to search for important
features. The aim of the proposed method is to find the frequent and associated features that are correlated with each other and that are also closely
correlative with the target attribute furthermore it can effectively remove both irrelevant redundant and redundant features from the feature space.

The results of the experiments that presented in this work are summarized as follows:

• When applying classification techniques using the second scenario compared to the first scenario, using a fewer number of features, the Chi2 method
sometimes contributed to improving the efficiency of the classifiers, unlike MI, in terms of F-measure, and classification accuracy. This is meant that the
features selected by the Chi2 method are more effective.

• When applying classification techniques using the third scenario compared to other scenarios, we found that

Page 7/15
• The third scenario achieved the highest performance in terms of F-measure, and accuracy using the features selected by the proposed FS-FAI method.

The proposed FS-FAI method reduced the number of selected features by a great percentage.
Although the number of selected features was much less, their quality and efficiency were higher compared to the selected features by Chi2 and MI
methods, resulting in reduced computational cost and improved classification performance.
The NB classifier using the proposed FS-FAI method had the best performance with the highest values for accuracy (99.1%), F-measure (99%), using only
230 features out of 2909 features representing 8% of features.

As concluding observations, the work presented in this paper has proved the objectives of this study where the results obtained demonstrate its effectiveness.

Declarations
Acknowledgements

Authors sincerely acknowledge Computer Science Department in Faculty of Science, Minia University for the facilities and support.

Funding Not applicable.

Availability of data and material https://github.com/tarekhemdan/Feature-Selection/blob/main/cs-65924-bbctext.csv

Disclosure of potential Conflict of Interest: The authors declare that they have no conflict of interest.

Ethical Statement: “All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or
national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.”

Consent Statement: “Informed consent was obtained from all individual participants included in the study.”

Code availability: https://github.com/tarekhemdan/Feature-Selection

References
1. Peng, Hanyang, & Fan, Y. (2017). Feature selection by optimizing a lower bound of conditional mutual information. Information Sciences, 418, 652–667.
2. Shang, C., Li, M., Feng, S., Jiang, Q., & Fan, J. (2013). Feature selection via maximizing global information gain for text classification. Knowledge-Based
Systems, 54, 298–309.
3. Zhang, L., & Duan, Q. (2019). A feature selection method for multi-label text based on feature importance. Applied Sciences, 9(4), 665.
4. Sangodiah, A., Ahmad, R., & Ahmad, W. F. W. (2014). A review in feature extraction approach in question classification using Support Vector Machine.
2014 IEEE International Conference on Control System, Computing and Engineering (ICCSCE 2014), 536–541.
5. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1), 1–47.
6. Forman, G. (2007). Feature selection for text classification. Computational Methods of Feature Selection, 16, 257–274.
7. Zhao, Z., & Liu, H. (2009). Searching for interacting features in subset selection. Intelligent Data Analysis, 13(2), 207–228.
8. Pawening, R. E., Darmawan, T., Bintana, R. R., Arifin, A. Z., & Herumurti, D. (2016). Feature selection methods based on mutual information for classifying
heterogeneous features. Jurnal Ilmu Komputer Dan Informasi, 9(2), 106–112.
9. Sun, J., Zhang, X., Liao, D., & Chang, V. (2017). Efficient method for feature selection in text classification. 2017 International Conference on Engineering
and Technology (ICET), 1–6.
10. Kaoungku, N., Suksut, K., Chanklan, R., Kerdprasop, K., & Kerdprasop, N. (2017). Data classification based on feature selection with association rule
mining. Proceedings of the International MultiConference of Engineers and Computer Scientists, 1.
11. Qu, Y., Fang, Y., & Yan, F. (2019). Feature Selection Algorithm Based on Association Rules. Journal of Physics: Conference Series, 1168(5).
12. Ukhti Ikhsani Larasati, I. U., Much Aziz Muslim, I. U., Riza Arifudin, I. U., & Alamsyah, I. U. (2019). Improve the Accuracy of Support Vector Machine Using
Chi Square Statistic and Term Frequency Inverse Document Frequency on Movie Review Sentiment Analysis. Scientific Journal of Informatics, 6(1), 138–
149.
13. Kononenko, I. (1994). Estimating attributes: Analysis and extensions of RELIEF. European Conference on Machine Learning, 171–182
14. Liu, M., & Zhang, D. (2016). Feature selection with effective distance. Neurocomputing, 215, 100–109.
15. Anggraeny, F. T., Purbasari, I. Y., & Suryaningsih, E. (2018). Relief Feature Selection and Bayesian Network Model for Hepatitis Diagnosis. Prosiding
International Conference on Information Technology and Business (ICITB), 113–118.
16. Barraza, N., Moro, S., Ferreyra, M., & de la Peña, A. (2019). Mutual information and sensitivity analysis for feature selection in customer targeting: A
comparative study. Journal of Information Science, 45(1), 53–67.
17. Sinayobye, J. O., Kyanda, S. K., Kiwanuka, N. F., & Musabe, R. (2019). Hybrid model of correlation based filter feature selection and machine learning
classifiers applied on smart meter data set. 2019 IEEE/ACM Symposium on Software Engineering in Africa (SEiA), 1–10.
18. Dash, M., & Liu, H. (1997). Feature selection for classification. Intelligent Data Analysis, 1(3), 131–156.
19. Verma, T., Renu, R., & Gaur, D. (2014). Tokenization and filtering process in RapidMiner. International Journal of Applied Information Systems, 7(2), 16–18.

Page 8/15
20. Saif, H., Fernández, M., He, Y., & Alani, H. (2014). On stop words, filtering and data sparsity for sentiment analysis of twitter.
21. Samir, A., & Lahbib, Z. (2018). Stemming and lemmatization for information retrieval systems in amazigh language. International Conference on Big Data,
Cloud and Applications, 222–233.
22. Liu, Qing, Wang, J., Zhang, D., Yang, Y., & Wang, N. (2018). Text features extraction based on TF-IDF associating semantic. 2018 IEEE 4th International
Conference on Computer and Communications (ICCC), 2338–2343.
23. Soucy, P., & Mineau, G. W. (2005). Beyond TFIDF weighting for text categorization in the vector space model. IJCAI, 5, 1130– 1135.
24. Ahuja, R., Chug, A., Kohli, S., Gupta, S., & Ahuja, P. (2019). The impact of features extraction on the sentiment analysis. Procedia Computer Science, 152,
341–348.
25. Agarwal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. Proc. of the 20th VLDB Conference, 487–499.
26. Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., & Verkamo, A. I. (1994). Finding interesting rules from large sets of discovered association rules.
Proceedings of the Third International Conference on Information and Knowledge Management, 401–407.
27. Sokolova, M., Japkowicz, N., & Szpakowicz, S. (2006). Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation.
Australasian Joint Conference on Artificial Intelligence, 1015–1021.

Figures

Figure 1

General steps of the proposed system for text classification.

Page 9/15
Figure 2

The general framework of the proposed system for text classification.

Figure 3

The proposed feature selection method FS-FAI.

Page 10/15
Figure 4

Steps of the proposed feature selection method FS-FAI.

Page 11/15
Figure 5

Algorithm of pruning frequent itemsets

Figure 6

Algorithm of computing all confidence for an itemset.

Page 12/15
Figure 7

Algorithm for selecting a frequent feature subset.

Page 13/15
Figure 8

Comparative results of classification techniques using first and second scenarios.

Figure 9

comparative results of classification techniques using first and third scenarios.

Page 14/15
Figure 10

Comparison of the best results for classification techniques using first and third scenarios

Page 15/15

You might also like