0% found this document useful (0 votes)
23 views7 pages

Feature Extraction or Feature Selection For Text C

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views7 pages

Feature Extraction or Feature Selection For Text C

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/276431000

Feature Extraction or Feature Selection for Text Classification: A Case Study


on Phishing Email Detection

Article  in  International Journal of Information Engineering and Electronic Business · March 2015


DOI: 10.5815/ijieeb.2015.02.08

CITATIONS READS

53 6,086

2 authors, including:

Seeja K.R.
Indira Gandhi Delhi Technical University for Women
37 PUBLICATIONS   302 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Seeja K.R. on 26 November 2015.

The user has requested enhancement of the downloaded file.


I.J. Information Engineering and Electronic Business, 2015, 2, 60-65
Published Online March 2015 in MECS (http://www.mecs-press.org/)
DOI: 10.5815/ijieeb.2015.02.08

Feature Extraction or Feature Selection for Text


Classification: A Case Study on Phishing Email
Detection
Masoumeh Zareapoor
Department of Computer Science, Jamia Hamdard, New Delhi, India
[email protected]

Seeja K. R
Department of Computer Science & Engineering, Indira Gandhi Delhi Technical University for Women, New Delhi,
India
[email protected]

Abstract—Dimensionality reduction is generally payment services are the most targeted industry sector. E-
performed when high dimensional data like text are mails can be categorized into three [3] - Ham, Spam and
classified. This can be done either by using feature Phishing. Ham is solicited and legitimate email while
extraction techniques or by using feature selection spam is an unsolicited email. On the other hand phishing
techniques. This paper analyses which dimension is an unsolicited, deceitful, and potentially harmful email.
reduction technique is better for classifying text data like Phishing emails are created by fraudulent people to
emails. Email classification is difficult due to its high imitate real E-banking emails. Phishing attacks are
dimensional sparse features that affect the generalization classified into two [4] as shown in fig.1,
performance of classifiers. In phishing email detection,
dimensionality reduction techniques are used to keep the
most instructive and discriminative features from a
collection of emails, consists of both phishing and
legitimate, for better detection. Two feature selection
techniques - Chi-Square and Information Gain Ratio and
two feature extraction techniques – Principal Component
Analysis and Latent Semantic Analysis are used for the
analysis. It is found that feature extraction techniques
offer better performance for the classification, give stable
classification results with the different number of features
chosen, and robustly keep the performance over time.
Fig.1. Types of phishing attack
Index Terms—Feature Selection, Feature Extraction,
Dimensionality Reduction, Text mining, Phishing, The first one is deceptive phishing which is related to
Classification. social engineering schemes, depend on forged email that
pretence from a legitimate company or bank. Then,
through a link within the email, the attacker attempts to
I. INTRODUCTION mislead users to fake Websites. These fake Web sites are
designed to deceptively obtain financial data (usernames,
Phishing is a new internet crime in comparison with passwords, credit card numbers, and personal information,
others, such as hacking. The word of phishing is a etc) from genuine users. The second technique is malware
variation on the word fishing. The idea is that bait is phishing that is related to technical subterfuge schemes
thrown out with the hopes that a user will grab it and bite that rely on malware after users click on a link embedded
into it just like the fish. Phishing is capable of damaging in the email, or by detecting and using security holes in
electronic commerce because it causes user to lose their the user’s computer to obtain the his online account
trust on the internet. To make customers aware of latest information directly. Phishing emails look exactly same
phishing attacks, some international organizations, such as that of e- banking e-mails and they easily traps internet
as anti phishing working group (APWG), have published banking users to disclose their banking credentials like
phishing alerts on their websites [1]. According to Anti bank account number, password, credit card number, and
Phishing working group trends report first quarter 2014 other important information needed for transaction. The
[2], the number of phishing sites increased by 10.7 attacker then performs fraudulent transaction from the
percent over the fourth quarter of 2013 and also the user’s account using this collected information.

Copyright © 2015 MECS I.J. Information Engineering and Electronic Business, 2015, 2, 60-65
Feature Extraction or Feature Selection for Text Classification: A Case Study on Phishing Email Detection 61

Several techniques have already been developed for following sections.


Phishing email detection. They include black listing and
white listing [4], network and content based filtering [22],  Feature Extraction
firewalls [4, 21, 22], client side tool bars [4, 21, 22], In feature extraction [8], the original feature space is
Server Side filters [21, 22] and user awareness [22]. But converted to a more compact new space. All the original
the most critical issue with these current techniques is, features are transformed into the new reduced space
when classifying email (text), often the data contained in without deleting them but replacing the original features
emails are very complex, multidimensional, or by a smaller representative set. That is when the number
represented by a large number of features. This results in of feature in input data is too large to be processed then
high space and time complexity [5] and poor classifier the input data will be transformed into a reduced
performance. The cost of computing power requirement representation set of features.
of the classification algorithms can be reduced by using
fewer distinctive features [6]. Thus dimensionality
Principal Components Analysis (PCA)
reduction techniques are used for email classification task
PCA is a well known technique that can reduce the
in order to avoid dimensionality problem. This can be
dimensionality of data by transforming the original
done either by using feature extraction or by using feature
attribute space into smaller space. In the other word, the
selection. In this paper, we used both feature selection
purpose of principle components analysis is to derive new
and feature extraction techniques, to discriminate
variables that are combinations of the original variables
between two classes of emails (ham or phishing) by using
and are uncorrelated. This is achieved by transforming
fewer and more distinctive features, to reduce the
the original variables Y = [y1, y2,..., yp] (where p is
computation cost and enhance the results.
number of original variable) to a new set of variables, T =
The dimensionality reduction techniques like PCA
[ t1, t2,..., tq] (where q is number of new variables),
have been popular since the early 90s in text processing
which are combinations of the original variables.
tasks [7, 8]. Tsymbal et al. [9] propose two variants of
Transformed attributes are framed by first, computing the
PCA that use the within and between class covariance
mean (μ) of the dataset, then covariance matrix of the
matrices to take into account the class information. They
original attributes is calculated [5]. And the second step is,
test the results on typical database data, but not to text
extracting its eigenvectors. The eigenvectors (principal
categorization. Brutlag and Meek [10] investigate the
components) introduce as a linear transformation from
effect of feature selection by means of common
the original attribute space to a new space in which
information statistic on email filtering. Xia and Wong [11]
attributes are uncorrelated. Eigenvectors can be sorted
discussed the email categorization problem in the context
according to the amount of variation in the original data.
of personal information management.
The best n eigenvectors (those one with highest
This paper analyses the effect of various
eigenvalues) are selected as new features while the rest
dimensionality reduction techniques in text classification.
are discarded.
Feature extraction methods like Principal Component
Analysis (PCA) [7] and Latent Semantic Analysis (LSA)
Latent semantic Analysis (LSA)
[12] are compared with classical feature selection
LSA method is a novel technique in text classification.
techniques like Chi-Square (χ2) [13], and Information
Generally, LSA analyzes relationships between a term
Gain (IG) [14], which have an established reputation in
and concepts contained in an unstructured collection of
text classification. In order to study the effectiveness of
text. It is called Latent Semantic Analysis, because of its
various dimensionality reduction techniques in phishing
ability to correlate semantically related terms that are
email classification, each technique were tested with
latent in a text. LSA produces a set of concepts, which is
Bagging classifier [8], which has already proved by
smaller in size than the original set, related to documents
researchers, good for e-mail classification.
and terms [11, 12]. It uses SVD (Singular Value
Decomposing) to identify pat- tern between the terms &
concepts contained in the text, and find the relationships
II. MATERIALS AND METHODS
between documents. The method commonly referred to
as concept searches. It has ability to extract the
A. Dimensionality Reduction Techniques conceptual content of a body of text by establishing
In text classification tasks, the documents or examples associations between those terms that occur in similar
are represented by thousands of tokens, which make the contexts. LSA is mostly used for page retrieval systems
classification problem very hard for many classifiers. and text clustering purposes. LSA overcomes two of the
Dimensionality reduction is a typical step in text mining, most problematic keyword queries: multiple words that
which transform the data representation into a shorter, have similar meanings and words that have more than one
more compact, and more predictive one [8]. The new meaning.
space is easier to handle because of its size, and also to
 Feature Selection
carry the most important part of the information needed
to distinguish between emails, allowing for the creation In feature selection technique, a subset of original
of profiles that describe the data set. Two major classes of features is selected, and only the selected features are
dimensionality reduction techniques are described in the used for training and testing the classifiers. The removed

Copyright © 2015 MECS I.J. Information Engineering and Electronic Business, 2015, 2, 60-65
62 Feature Extraction or Feature Selection for Text Classification: A Case Study on Phishing Email Detection

features are not used in the computations anymore. learner”. Here each individual classifier is a “weak
learner”, while all the classifiers taken together are a
Chi-Square “strong learner”. Bagging works by combining
The Chi-Square (χ2) [13] is a popular feature selection classifications of randomly generated training sets to
method that evaluates features individually by computing form a final prediction. Such techniques can typically be
chi square statistics with respect to the classes. It means used as a variance reduction technique by incorporating
that the chi-squared score, analysis the dependency randomization into its construction procedure and then
between the term and the class. If the term is independent creating an ensemble out of it. Bagging classifier has
from the class, then its score is equal to 0, other wise 1. A attracted much attention, due to its simple
term with a higher chi-squared score is more informative. implementation and accuracy. Thus, we can call bagging
as a “smoothing operation” that has an advantage to
Information Gain improve the predictive performance of regression or
Information Gain [14] is a feature selection technique classification.
that can decrease the size of features by computing the In case of classification, where there are two possible
value of each attribute and rank the attributes. Then we classes {positive, negative}, a classification algorithm
simply decide a threshold in the metric and keep the creates a classifier on the basis of a training set (in this
attributes with a value over it. It just keeps those top paper it is email dataset). In the bagging method, it
ranking ones. Generally, Information Gain selects the creates a series of classifiers. These classifiers are
features via scores. This technique can be simpler than combined into a “compound classifier”. The final
the previous one. The basic idea is that we only have to prediction of the “compound classifier” is gained from
compute the score for each feature that can reflects in weighted combination of individual classifier predictions.
discrimination between classes, then the features are The meaning of this theory can be described as a “voting
sorted according to this score and then just keep those top procedure” where the objective is to find the classifier
ranking ones. which is having stronger influence on the final prediction
than other classifiers.
B. Bagging Classifier
Bagging classifier is an ensemble technique which was
proposed by Leo Breiman in 1994. It is designed to III. PHISHING E-MAIL CLASSIFICATION FRAME WORK
improve the stability and accuracy of machine learning
algorithms used in classification and regression. The The phishing email classification frame work used in
basic principle behind ensemble methods is that a group this research is shown in Fig. 2.
of “weak learners” can come together to form a “strong

Fig.2. Phishing Email Classification framework

Copyright © 2015 MECS I.J. Information Engineering and Electronic Business, 2015, 2, 60-65
Feature Extraction or Feature Selection for Text Classification: A Case Study on Phishing Email Detection 63

The various steps in phishing email classification are, subject, receiver, date, etc. The body contains the plain
text and may embed within HTML links. In the case of
 Data set is prepared by collecting a group of e-mails HTML emails, these contain a set of tags to format the
from the publicly available corpus of legitimate and text to be displayed on screen. In our work we did not use
phishing e-mails. Then the e-mails are labeled as any separation between body and header. We consider
legitimate and phishing correspondingly. whole the emails itself and so the feature vector contains
 Tokenization is performed to separate words from all the kinds of features like HEADER based feature,
the e-mail by using white space (space, tab, URL based feature and BODY based feature [17].
newline) as the delimiter.
 Then the words that do not have any significant  Body based feature: All body-based features occur
importance in building the classifier are removed. in the boy of emails and are involved: (body-
This is called stop word removal and stop words are keyword), (body- jspopup), (body-java script),
words like a, the, that etc. multipart emails, html emails, verify phrase emails,
 Then stemming is performed to remove in flexional htmllink, image link and etc.
ending from the necessary words.  URL based feature: These features are extracted
 Finally, the Term-Document-Frequency (TDF) from the URL link of emails, and included: html-
matrix is created where each row in the matrix link, url IP ad- dresses, image link and etc.
corresponds to a document (e-mail) and each  Header based feature: The features are extracted
column corresponds to a term (word) in the from the e-mail header like subject, sender, receiver
document. Each cell represents the frequency etc.
(number of occurrence) of the corresponding word
in the corresponding document. Thus, each e-mail in The extracted features are converted into a long vector
the data set has been converted into an equivalent by using parsing and stemming. Parsing is a process to
vector. extract features from email and analyzing them.
 Generally prior to the classification, dimensionality Stemming is the process for reducing inflected (or
reduction techniques are applied to convert the long sometimes derived) words to their stem or root form.
vector created in step 5. Feature selection or feature Then stop words, those words that do not have any
extraction techniques are used for dimension significant importance in building the classifiers, are re-
reduction and this improves the training time of the moved. Thus the email dataset of 2700 emails is
classifiers. converted into 2,173 terms (feature). This means that the
 Finally the classification model classifies the dataset email dataset can be represented as a term-document
into phishing and legitimate. matrix with 2700 rows and 2173 columns. Each row in
the matrix corresponds to a document (e-mail) and each
column corresponds to a term (word) in the document.
IV. IMPLEMENTATION Each cell represents the frequency (number of occurrence)
The proposed text mining based text classification is of the corresponding word in the corresponding document.
implemented by using the text mining features available The vector generated in this stage is considered as long
in WEKA 3.7.11. The procedure is described in the vector.
following sections. C. Conversion of Long vector into Short vector
A. Dataset Preparation This conversion is done due of 3 reasons:
Data set is prepared by collecting a group of e-mails
from the well known publicly available corpus that most 1. To transform the data representation into a shorter,
authors in this area have used. Phishing dataset consisting more compact, and more predictive one.
1,000 phishing emails received from November 2004 to 2. To reduce the complexity of handling features in
August 2007 provided by Monkey website [15] and, classification process
1,700 Ham email from Spam Assassin project [16]. Then 3. To increase the speed of classification process
the e-mails are labeled as phishing and legitimate
correspondingly. This conversion (Long to Short) can be done either by
using Feature Selection or by Feature Extraction. We
Table 1. Dataset selected PCA and LSA as feature extraction techniques
and Chi-Square and Info Gain as feature selection
Total number of samples 2700
techniques for the analysis. From the initial 2173 features,
Phishing emails 1000 small sets of 10,15,50,100,300,500,1000 and 2000
Legitimate emails 1700 features are selected/extracted with PCA, LSA, Chi-
Square and Info Gain, for analysis.
B. Creation of Long Vector
D. Classification
In general, an email consists of two parts: the header
and the body message. The header contains information After converting long vector to short vector based on
about the message in the form of many fields like sender, feature extraction and feature selection, we trained

Copyright © 2015 MECS I.J. Information Engineering and Electronic Business, 2015, 2, 60-65
64 Feature Extraction or Feature Selection for Text Classification: A Case Study on Phishing Email Detection

different classifiers on dataset. After several trials and The results of various experiments conducted on the
comparison, we decided to use ensemble classifier model selected dataset for different number of features are
bagging with J48 decision tree as the base classifier [18]. shown in Fig.3 and Fig.4.
The reason for this decision is that, for our dataset and True positive (TP) = number of phishing email that
methods (feature extraction and selection), bagging gives correctly classified as phishing.
good results by reducing the variance of the data set and False positive (FP) = number of ham email that
thus reduces over fitting of the training data. incorrectly classified as phishing.
For better visualization, the results are presented in the
form of the area under the ROC (Receiver Operating
V. RESULTS AND DISCUSSIONS Characteristic) curve, which reaches the best value at 1
and worst value at 0. The results are also shown in terms
of accuracy of the classification for better understanding.

Fig.3. Performance Comparison in terms of ROC Area

It evidents from the Fig (3,4), both FS and FE methods When we compare the techniques in terms of accuracy
obtain a certifiable results. But FS shows better (Fig. 4), it is observed that the FE techniques have good
performance by increasing the number of features, and performance with a small number of features and the
the results are not better than the FE techniques. On the performance values are decreased when a large number
other hand, the FE methods need commonly much less of features are chosen, while the FS algorithms need
features to obtain a good performance. In FE methods, more features for accurate classification. This is because
choosing more features might degrade the performance of since FS methods directly select features from the dataset,
the algorithm. From these results, we can observe that the which includes information from the whole dataset, with
statistical feature extraction techniques are well suited to small number of features they may missed some of the
discriminate between ham and phishing emails. more informative and important features.
Especially LSA technique in terms of area under ROC
curve shows a good and stable performance irrespective
of the num- ber of features chosen.

Fig.4. Performance Comparison in terms of Accuracy

Copyright © 2015 MECS I.J. Information Engineering and Electronic Business, 2015, 2, 60-65
Feature Extraction or Feature Selection for Text Classification: A Case Study on Phishing Email Detection 65

[15] Phishing Corpus: http://monkey.org/wjose/wiki/doku.php;


VI. CONCLUSION [16] SpamAssassin PublicCorpus: http://spamassassin.apache.
org/publiccorpus
In this paper, feature selection methods are compared [17] A. Almomani, T.C.Wan, A.Manasrah, A.Altaher,
with statistical feature extraction techniques for email M.Baklizi, S.Ramadass. An enhanced online phishing e-
classification. The results show good classification mail detection framework based on evolving connectionist
performance when using the feature extraction techniques system. International journal of innovative computing,
to classify emails. One of the significant objects in this information and control (2012); 9(2); 1065-1086.
[18] D. Opitz, R. Maclin. Popular ensemble methods: An
work is, the results of feature extraction methods (PCA,
empirical study. Journal of Artificial Intelligence
LSA) are not dependent on number of features chosen. It Research, (1999), Vol(11); 169-198
is an advantage in text classification because choosing the [19] F Toolan, J. Carthy. Phishing Detection using Classifier
correct number of features in the high dimensional space Ensembles. IEEE conference on eCrime Researchers
is a difficult problem. Moreover, Latent Semantic Summit, Tacoma, WA, USA, (2009); 1 – 9.
Analysis is found to be the best method, since it [20] S.A. Nimeh, D. Nappa, X. Wang, S. Nair. A comparison
outperforms other methods in terms of the area under the of machine learning techniques for phishing detection. In
ROC curve and accuracy, even when dataset are Proceedings of the eCrime Researchers Summit, 2007; vol.
presented with very few features. 1. (Pittsburgh, PA, USA); 60–69.
[21] V. Ramanathan, H. Wechsler. Phishing detection and
impersonated entity discovery using Conditional Random
REFERENCES Field and Latent Dirichlet Allocation. Journal of
[1] APWG. Anti phishing working: Computers & Security, (2013), 34; 123-139.
http://www.antiphishing.org [22] V.Ramanathan, H.Wechsler. PhishGILLNET-phishing
[2] Phishing Activity Trends Report 2014: detection methodology using probabilistic latent semantic
http://docs.apwg.org/reports/apwg_trends_report_q1_201 analysis, AdaBoost and co-training. Journal on
4.pdf. information security, 2012.
[3] I.R.A.Hamid, J.Abawajy. Hybrid feature selection for
phishing email detection. International Conference of
Algorithms and Architectures for Parallel Processing,
(2011), Lecture Notes in Computer Science, Springer, Authors’ Profiles
Berlin, Germany; 266-275.
[4] G. L. Huillier, R. Weber, N. Figueroa. Online Phishing Masoumeh Zareapoor is a Ph.D. student at Jamia Hamdard
Classification Using Adversarial Data Mining and University, New Delhi, India. She received her Master degree
Signaling Games. ACM SIGKDD Explorations in computer science from Jamia Hamdard University in 2010.
Newsletter, (2009), 11(2); 92-99.
[5] J.J. Verbeek. Supervised Feature Extraction for Text
Categorization. Tenth Belgian-Dutch Conference on Seeja.K.R received her Ph.D. degree in Computer Science from
Machine Learning, (2000). Jamia Hamdard University, New Delhi, India, in July 2010. She
[6] G. Biricik, B. Diri, A.C. Sonmez. Abstract feature is currently working as associate professor in the Department of
extraction for text classification. Turk J Elec Eng & Comp Computer Science & Engineering, Indira Gandhi Delhi
Sci, (2012), 20(1); 1102-1015. Technical University for Women, Delhi, India. Her research
[7] J.C. Gomez, M.F. Moens. PCA document reconstruction interests include data mining, algorithm design, bioinformatics
for email classification. Computational Statistics and Data and NP-Complete problems.
Analysis, (2012), 56(3); 741–751.
[8] J.C.Gomez, E. Boiy, M.F.Moens. Highly discriminative
statistical features for email classification. Knowledge and
Information System, (2012), 31(1); 23-53.
[9] A. Tsymbal, S. Puuronen, M. Pechenizkiy, M.
Baumgarten, D.W.Patterson. Eigenvector-based feature
extraction for classification. AAAI Press, (2002); 354–358.
[10] J.D.Brutlag, C.Meek. Challenges of the email domain for
text classification. In Proceedings of the seventeenth
international conference on machine learning, (2000);
103–110.
[11] Y. Xia, K.F. Wong. Binarization approaches to email
categorization. In: ICCPOL; 474–481.
[12] G.L.Huillier, A.Hevia, R.Weber, S.Rios. Latent Semantic
Analysis and Keyword Extraction for Phishing
Classification. IEEE International Conference on
Intelligence and Security Informatics, (2010); 129 – 131.
[13] M. Hall, L. Smith. Practical feature subset selection for
machine learning. Proceedings of the 21st Australasian
Conference on Computer Science. (1998); 181-191.
[14] T. Mori. Information gain ratio as term weight: The case
of summarization of IR results. In Proceeding of the 19th
international conference on computational linguistics,
Taiwan (2002); 688-694.

Copyright © 2015 MECS I.J. Information Engineering and Electronic Business, 2015, 2, 60-65

View publication stats

You might also like