Feature Extraction or Feature Selection For Text C
Feature Extraction or Feature Selection For Text C
net/publication/276431000
CITATIONS READS
53 6,086
2 authors, including:
Seeja K.R.
Indira Gandhi Delhi Technical University for Women
37 PUBLICATIONS 302 CITATIONS
SEE PROFILE
All content following this page was uploaded by Seeja K.R. on 26 November 2015.
Seeja K. R
Department of Computer Science & Engineering, Indira Gandhi Delhi Technical University for Women, New Delhi,
India
[email protected]
Abstract—Dimensionality reduction is generally payment services are the most targeted industry sector. E-
performed when high dimensional data like text are mails can be categorized into three [3] - Ham, Spam and
classified. This can be done either by using feature Phishing. Ham is solicited and legitimate email while
extraction techniques or by using feature selection spam is an unsolicited email. On the other hand phishing
techniques. This paper analyses which dimension is an unsolicited, deceitful, and potentially harmful email.
reduction technique is better for classifying text data like Phishing emails are created by fraudulent people to
emails. Email classification is difficult due to its high imitate real E-banking emails. Phishing attacks are
dimensional sparse features that affect the generalization classified into two [4] as shown in fig.1,
performance of classifiers. In phishing email detection,
dimensionality reduction techniques are used to keep the
most instructive and discriminative features from a
collection of emails, consists of both phishing and
legitimate, for better detection. Two feature selection
techniques - Chi-Square and Information Gain Ratio and
two feature extraction techniques – Principal Component
Analysis and Latent Semantic Analysis are used for the
analysis. It is found that feature extraction techniques
offer better performance for the classification, give stable
classification results with the different number of features
chosen, and robustly keep the performance over time.
Fig.1. Types of phishing attack
Index Terms—Feature Selection, Feature Extraction,
Dimensionality Reduction, Text mining, Phishing, The first one is deceptive phishing which is related to
Classification. social engineering schemes, depend on forged email that
pretence from a legitimate company or bank. Then,
through a link within the email, the attacker attempts to
I. INTRODUCTION mislead users to fake Websites. These fake Web sites are
designed to deceptively obtain financial data (usernames,
Phishing is a new internet crime in comparison with passwords, credit card numbers, and personal information,
others, such as hacking. The word of phishing is a etc) from genuine users. The second technique is malware
variation on the word fishing. The idea is that bait is phishing that is related to technical subterfuge schemes
thrown out with the hopes that a user will grab it and bite that rely on malware after users click on a link embedded
into it just like the fish. Phishing is capable of damaging in the email, or by detecting and using security holes in
electronic commerce because it causes user to lose their the user’s computer to obtain the his online account
trust on the internet. To make customers aware of latest information directly. Phishing emails look exactly same
phishing attacks, some international organizations, such as that of e- banking e-mails and they easily traps internet
as anti phishing working group (APWG), have published banking users to disclose their banking credentials like
phishing alerts on their websites [1]. According to Anti bank account number, password, credit card number, and
Phishing working group trends report first quarter 2014 other important information needed for transaction. The
[2], the number of phishing sites increased by 10.7 attacker then performs fraudulent transaction from the
percent over the fourth quarter of 2013 and also the user’s account using this collected information.
Copyright © 2015 MECS I.J. Information Engineering and Electronic Business, 2015, 2, 60-65
Feature Extraction or Feature Selection for Text Classification: A Case Study on Phishing Email Detection 61
Copyright © 2015 MECS I.J. Information Engineering and Electronic Business, 2015, 2, 60-65
62 Feature Extraction or Feature Selection for Text Classification: A Case Study on Phishing Email Detection
features are not used in the computations anymore. learner”. Here each individual classifier is a “weak
learner”, while all the classifiers taken together are a
Chi-Square “strong learner”. Bagging works by combining
The Chi-Square (χ2) [13] is a popular feature selection classifications of randomly generated training sets to
method that evaluates features individually by computing form a final prediction. Such techniques can typically be
chi square statistics with respect to the classes. It means used as a variance reduction technique by incorporating
that the chi-squared score, analysis the dependency randomization into its construction procedure and then
between the term and the class. If the term is independent creating an ensemble out of it. Bagging classifier has
from the class, then its score is equal to 0, other wise 1. A attracted much attention, due to its simple
term with a higher chi-squared score is more informative. implementation and accuracy. Thus, we can call bagging
as a “smoothing operation” that has an advantage to
Information Gain improve the predictive performance of regression or
Information Gain [14] is a feature selection technique classification.
that can decrease the size of features by computing the In case of classification, where there are two possible
value of each attribute and rank the attributes. Then we classes {positive, negative}, a classification algorithm
simply decide a threshold in the metric and keep the creates a classifier on the basis of a training set (in this
attributes with a value over it. It just keeps those top paper it is email dataset). In the bagging method, it
ranking ones. Generally, Information Gain selects the creates a series of classifiers. These classifiers are
features via scores. This technique can be simpler than combined into a “compound classifier”. The final
the previous one. The basic idea is that we only have to prediction of the “compound classifier” is gained from
compute the score for each feature that can reflects in weighted combination of individual classifier predictions.
discrimination between classes, then the features are The meaning of this theory can be described as a “voting
sorted according to this score and then just keep those top procedure” where the objective is to find the classifier
ranking ones. which is having stronger influence on the final prediction
than other classifiers.
B. Bagging Classifier
Bagging classifier is an ensemble technique which was
proposed by Leo Breiman in 1994. It is designed to III. PHISHING E-MAIL CLASSIFICATION FRAME WORK
improve the stability and accuracy of machine learning
algorithms used in classification and regression. The The phishing email classification frame work used in
basic principle behind ensemble methods is that a group this research is shown in Fig. 2.
of “weak learners” can come together to form a “strong
Copyright © 2015 MECS I.J. Information Engineering and Electronic Business, 2015, 2, 60-65
Feature Extraction or Feature Selection for Text Classification: A Case Study on Phishing Email Detection 63
The various steps in phishing email classification are, subject, receiver, date, etc. The body contains the plain
text and may embed within HTML links. In the case of
Data set is prepared by collecting a group of e-mails HTML emails, these contain a set of tags to format the
from the publicly available corpus of legitimate and text to be displayed on screen. In our work we did not use
phishing e-mails. Then the e-mails are labeled as any separation between body and header. We consider
legitimate and phishing correspondingly. whole the emails itself and so the feature vector contains
Tokenization is performed to separate words from all the kinds of features like HEADER based feature,
the e-mail by using white space (space, tab, URL based feature and BODY based feature [17].
newline) as the delimiter.
Then the words that do not have any significant Body based feature: All body-based features occur
importance in building the classifier are removed. in the boy of emails and are involved: (body-
This is called stop word removal and stop words are keyword), (body- jspopup), (body-java script),
words like a, the, that etc. multipart emails, html emails, verify phrase emails,
Then stemming is performed to remove in flexional htmllink, image link and etc.
ending from the necessary words. URL based feature: These features are extracted
Finally, the Term-Document-Frequency (TDF) from the URL link of emails, and included: html-
matrix is created where each row in the matrix link, url IP ad- dresses, image link and etc.
corresponds to a document (e-mail) and each Header based feature: The features are extracted
column corresponds to a term (word) in the from the e-mail header like subject, sender, receiver
document. Each cell represents the frequency etc.
(number of occurrence) of the corresponding word
in the corresponding document. Thus, each e-mail in The extracted features are converted into a long vector
the data set has been converted into an equivalent by using parsing and stemming. Parsing is a process to
vector. extract features from email and analyzing them.
Generally prior to the classification, dimensionality Stemming is the process for reducing inflected (or
reduction techniques are applied to convert the long sometimes derived) words to their stem or root form.
vector created in step 5. Feature selection or feature Then stop words, those words that do not have any
extraction techniques are used for dimension significant importance in building the classifiers, are re-
reduction and this improves the training time of the moved. Thus the email dataset of 2700 emails is
classifiers. converted into 2,173 terms (feature). This means that the
Finally the classification model classifies the dataset email dataset can be represented as a term-document
into phishing and legitimate. matrix with 2700 rows and 2173 columns. Each row in
the matrix corresponds to a document (e-mail) and each
column corresponds to a term (word) in the document.
IV. IMPLEMENTATION Each cell represents the frequency (number of occurrence)
The proposed text mining based text classification is of the corresponding word in the corresponding document.
implemented by using the text mining features available The vector generated in this stage is considered as long
in WEKA 3.7.11. The procedure is described in the vector.
following sections. C. Conversion of Long vector into Short vector
A. Dataset Preparation This conversion is done due of 3 reasons:
Data set is prepared by collecting a group of e-mails
from the well known publicly available corpus that most 1. To transform the data representation into a shorter,
authors in this area have used. Phishing dataset consisting more compact, and more predictive one.
1,000 phishing emails received from November 2004 to 2. To reduce the complexity of handling features in
August 2007 provided by Monkey website [15] and, classification process
1,700 Ham email from Spam Assassin project [16]. Then 3. To increase the speed of classification process
the e-mails are labeled as phishing and legitimate
correspondingly. This conversion (Long to Short) can be done either by
using Feature Selection or by Feature Extraction. We
Table 1. Dataset selected PCA and LSA as feature extraction techniques
and Chi-Square and Info Gain as feature selection
Total number of samples 2700
techniques for the analysis. From the initial 2173 features,
Phishing emails 1000 small sets of 10,15,50,100,300,500,1000 and 2000
Legitimate emails 1700 features are selected/extracted with PCA, LSA, Chi-
Square and Info Gain, for analysis.
B. Creation of Long Vector
D. Classification
In general, an email consists of two parts: the header
and the body message. The header contains information After converting long vector to short vector based on
about the message in the form of many fields like sender, feature extraction and feature selection, we trained
Copyright © 2015 MECS I.J. Information Engineering and Electronic Business, 2015, 2, 60-65
64 Feature Extraction or Feature Selection for Text Classification: A Case Study on Phishing Email Detection
different classifiers on dataset. After several trials and The results of various experiments conducted on the
comparison, we decided to use ensemble classifier model selected dataset for different number of features are
bagging with J48 decision tree as the base classifier [18]. shown in Fig.3 and Fig.4.
The reason for this decision is that, for our dataset and True positive (TP) = number of phishing email that
methods (feature extraction and selection), bagging gives correctly classified as phishing.
good results by reducing the variance of the data set and False positive (FP) = number of ham email that
thus reduces over fitting of the training data. incorrectly classified as phishing.
For better visualization, the results are presented in the
form of the area under the ROC (Receiver Operating
V. RESULTS AND DISCUSSIONS Characteristic) curve, which reaches the best value at 1
and worst value at 0. The results are also shown in terms
of accuracy of the classification for better understanding.
It evidents from the Fig (3,4), both FS and FE methods When we compare the techniques in terms of accuracy
obtain a certifiable results. But FS shows better (Fig. 4), it is observed that the FE techniques have good
performance by increasing the number of features, and performance with a small number of features and the
the results are not better than the FE techniques. On the performance values are decreased when a large number
other hand, the FE methods need commonly much less of features are chosen, while the FS algorithms need
features to obtain a good performance. In FE methods, more features for accurate classification. This is because
choosing more features might degrade the performance of since FS methods directly select features from the dataset,
the algorithm. From these results, we can observe that the which includes information from the whole dataset, with
statistical feature extraction techniques are well suited to small number of features they may missed some of the
discriminate between ham and phishing emails. more informative and important features.
Especially LSA technique in terms of area under ROC
curve shows a good and stable performance irrespective
of the num- ber of features chosen.
Copyright © 2015 MECS I.J. Information Engineering and Electronic Business, 2015, 2, 60-65
Feature Extraction or Feature Selection for Text Classification: A Case Study on Phishing Email Detection 65
Copyright © 2015 MECS I.J. Information Engineering and Electronic Business, 2015, 2, 60-65