0% found this document useful (0 votes)

23 views7 pages

Feature Extraction or Feature Selection For Text C

Uploaded by

Rodrigo Fernandez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views7 pages

Feature Extraction or Feature Selection For Text C

Uploaded by

Rodrigo Fernandez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/276431000

Feature Extraction or Feature Selection for Text Classiﬁcation: A Case Study

on Phishing Email Detection

Article in International Journal of Information Engineering and Electronic Business · March 2015

DOI: 10.5815/ijieeb.2015.02.08

CITATIONS READS

53 6,086

2 authors, including:

Seeja K.R.
Indira Gandhi Delhi Technical University for Women
37 PUBLICATIONS 302 CITATIONS

SEE PROFILE

All content following this page was uploaded by Seeja K.R. on 26 November 2015.

The user has requested enhancement of the downloaded file.

I.J. Information Engineering and Electronic Business, 2015, 2, 60-65
Published Online March 2015 in MECS (http://www.mecs-press.org/)
DOI: 10.5815/ijieeb.2015.02.08

Feature Extraction or Feature Selection for Text

Classification: A Case Study on Phishing Email
Detection
Masoumeh Zareapoor
Department of Computer Science, Jamia Hamdard, New Delhi, India
[email protected]

Seeja K. R
Department of Computer Science & Engineering, Indira Gandhi Delhi Technical University for Women, New Delhi,
India
[email protected]

Abstract—Dimensionality reduction is generally payment services are the most targeted industry sector. E-
performed when high dimensional data like text are mails can be categorized into three [3] - Ham, Spam and
classified. This can be done either by using feature Phishing. Ham is solicited and legitimate email while
extraction techniques or by using feature selection spam is an unsolicited email. On the other hand phishing
techniques. This paper analyses which dimension is an unsolicited, deceitful, and potentially harmful email.
reduction technique is better for classifying text data like Phishing emails are created by fraudulent people to
emails. Email classification is difficult due to its high imitate real E-banking emails. Phishing attacks are
dimensional sparse features that affect the generalization classified into two [4] as shown in fig.1,
performance of classifiers. In phishing email detection,
dimensionality reduction techniques are used to keep the
most instructive and discriminative features from a
collection of emails, consists of both phishing and
legitimate, for better detection. Two feature selection
techniques - Chi-Square and Information Gain Ratio and
two feature extraction techniques – Principal Component
Analysis and Latent Semantic Analysis are used for the
analysis. It is found that feature extraction techniques
offer better performance for the classification, give stable
classification results with the different number of features
chosen, and robustly keep the performance over time.
Fig.1. Types of phishing attack
Index Terms—Feature Selection, Feature Extraction,
Dimensionality Reduction, Text mining, Phishing, The first one is deceptive phishing which is related to
Classification. social engineering schemes, depend on forged email that
pretence from a legitimate company or bank. Then,
through a link within the email, the attacker attempts to
I. INTRODUCTION mislead users to fake Websites. These fake Web sites are
designed to deceptively obtain financial data (usernames,
Phishing is a new internet crime in comparison with passwords, credit card numbers, and personal information,
others, such as hacking. The word of phishing is a etc) from genuine users. The second technique is malware
variation on the word fishing. The idea is that bait is phishing that is related to technical subterfuge schemes
thrown out with the hopes that a user will grab it and bite that rely on malware after users click on a link embedded
into it just like the fish. Phishing is capable of damaging in the email, or by detecting and using security holes in
electronic commerce because it causes user to lose their the user’s computer to obtain the his online account
trust on the internet. To make customers aware of latest information directly. Phishing emails look exactly same
phishing attacks, some international organizations, such as that of e- banking e-mails and they easily traps internet
as anti phishing working group (APWG), have published banking users to disclose their banking credentials like
phishing alerts on their websites [1]. According to Anti bank account number, password, credit card number, and
Phishing working group trends report first quarter 2014 other important information needed for transaction. The
[2], the number of phishing sites increased by 10.7 attacker then performs fraudulent transaction from the
percent over the fourth quarter of 2013 and also the user’s account using this collected information.

Several techniques have already been developed for following sections.

Phishing email detection. They include black listing and
white listing [4], network and content based filtering [22],  Feature Extraction
firewalls [4, 21, 22], client side tool bars [4, 21, 22], In feature extraction [8], the original feature space is
Server Side filters [21, 22] and user awareness [22]. But converted to a more compact new space. All the original
the most critical issue with these current techniques is, features are transformed into the new reduced space
when classifying email (text), often the data contained in without deleting them but replacing the original features
emails are very complex, multidimensional, or by a smaller representative set. That is when the number
represented by a large number of features. This results in of feature in input data is too large to be processed then
high space and time complexity [5] and poor classifier the input data will be transformed into a reduced
performance. The cost of computing power requirement representation set of features.
of the classification algorithms can be reduced by using
fewer distinctive features [6]. Thus dimensionality
Principal Components Analysis (PCA)
reduction techniques are used for email classification task
PCA is a well known technique that can reduce the
in order to avoid dimensionality problem. This can be
dimensionality of data by transforming the original
done either by using feature extraction or by using feature
attribute space into smaller space. In the other word, the
selection. In this paper, we used both feature selection
purpose of principle components analysis is to derive new
and feature extraction techniques, to discriminate
variables that are combinations of the original variables
between two classes of emails (ham or phishing) by using
and are uncorrelated. This is achieved by transforming
fewer and more distinctive features, to reduce the
the original variables Y = [y1, y2,..., yp] (where p is
computation cost and enhance the results.
number of original variable) to a new set of variables, T =
The dimensionality reduction techniques like PCA
[ t1, t2,..., tq] (where q is number of new variables),
have been popular since the early 90s in text processing
which are combinations of the original variables.
tasks [7, 8]. Tsymbal et al. [9] propose two variants of
Transformed attributes are framed by first, computing the
PCA that use the within and between class covariance
mean (μ) of the dataset, then covariance matrix of the
matrices to take into account the class information. They
original attributes is calculated [5]. And the second step is,
test the results on typical database data, but not to text
extracting its eigenvectors. The eigenvectors (principal
categorization. Brutlag and Meek [10] investigate the
components) introduce as a linear transformation from
effect of feature selection by means of common
the original attribute space to a new space in which
information statistic on email filtering. Xia and Wong [11]
attributes are uncorrelated. Eigenvectors can be sorted
discussed the email categorization problem in the context
according to the amount of variation in the original data.
of personal information management.
The best n eigenvectors (those one with highest
This paper analyses the effect of various
eigenvalues) are selected as new features while the rest
dimensionality reduction techniques in text classification.
are discarded.
Feature extraction methods like Principal Component
Analysis (PCA) [7] and Latent Semantic Analysis (LSA)
Latent semantic Analysis (LSA)
[12] are compared with classical feature selection
LSA method is a novel technique in text classification.
techniques like Chi-Square (χ2) [13], and Information
Generally, LSA analyzes relationships between a term
Gain (IG) [14], which have an established reputation in
and concepts contained in an unstructured collection of
text classification. In order to study the effectiveness of
text. It is called Latent Semantic Analysis, because of its
various dimensionality reduction techniques in phishing
ability to correlate semantically related terms that are
email classification, each technique were tested with
latent in a text. LSA produces a set of concepts, which is
Bagging classifier [8], which has already proved by
smaller in size than the original set, related to documents
researchers, good for e-mail classification.
and terms [11, 12]. It uses SVD (Singular Value
Decomposing) to identify pat- tern between the terms &
concepts contained in the text, and find the relationships
II. MATERIALS AND METHODS
between documents. The method commonly referred to
as concept searches. It has ability to extract the
A. Dimensionality Reduction Techniques conceptual content of a body of text by establishing
In text classification tasks, the documents or examples associations between those terms that occur in similar
are represented by thousands of tokens, which make the contexts. LSA is mostly used for page retrieval systems
classification problem very hard for many classifiers. and text clustering purposes. LSA overcomes two of the
Dimensionality reduction is a typical step in text mining, most problematic keyword queries: multiple words that
which transform the data representation into a shorter, have similar meanings and words that have more than one
more compact, and more predictive one [8]. The new meaning.
space is easier to handle because of its size, and also to
 Feature Selection
carry the most important part of the information needed
to distinguish between emails, allowing for the creation In feature selection technique, a subset of original
of profiles that describe the data set. Two major classes of features is selected, and only the selected features are
dimensionality reduction techniques are described in the used for training and testing the classifiers. The removed

features are not used in the computations anymore. learner”. Here each individual classifier is a “weak
learner”, while all the classifiers taken together are a
Chi-Square “strong learner”. Bagging works by combining
The Chi-Square (χ2) [13] is a popular feature selection classifications of randomly generated training sets to
method that evaluates features individually by computing form a final prediction. Such techniques can typically be
chi square statistics with respect to the classes. It means used as a variance reduction technique by incorporating
that the chi-squared score, analysis the dependency randomization into its construction procedure and then
between the term and the class. If the term is independent creating an ensemble out of it. Bagging classifier has
from the class, then its score is equal to 0, other wise 1. A attracted much attention, due to its simple
term with a higher chi-squared score is more informative. implementation and accuracy. Thus, we can call bagging
as a “smoothing operation” that has an advantage to
Information Gain improve the predictive performance of regression or
Information Gain [14] is a feature selection technique classification.
that can decrease the size of features by computing the In case of classification, where there are two possible
value of each attribute and rank the attributes. Then we classes {positive, negative}, a classification algorithm
simply decide a threshold in the metric and keep the creates a classifier on the basis of a training set (in this
attributes with a value over it. It just keeps those top paper it is email dataset). In the bagging method, it
ranking ones. Generally, Information Gain selects the creates a series of classifiers. These classifiers are
features via scores. This technique can be simpler than combined into a “compound classifier”. The final
the previous one. The basic idea is that we only have to prediction of the “compound classifier” is gained from
compute the score for each feature that can reflects in weighted combination of individual classifier predictions.
discrimination between classes, then the features are The meaning of this theory can be described as a “voting
sorted according to this score and then just keep those top procedure” where the objective is to find the classifier
ranking ones. which is having stronger influence on the final prediction
than other classifiers.
B. Bagging Classifier
Bagging classifier is an ensemble technique which was
proposed by Leo Breiman in 1994. It is designed to III. PHISHING E-MAIL CLASSIFICATION FRAME WORK
improve the stability and accuracy of machine learning
algorithms used in classification and regression. The The phishing email classification frame work used in
basic principle behind ensemble methods is that a group this research is shown in Fig. 2.
of “weak learners” can come together to form a “strong

Fig.2. Phishing Email Classification framework

The various steps in phishing email classification are, subject, receiver, date, etc. The body contains the plain
text and may embed within HTML links. In the case of
 Data set is prepared by collecting a group of e-mails HTML emails, these contain a set of tags to format the
from the publicly available corpus of legitimate and text to be displayed on screen. In our work we did not use
phishing e-mails. Then the e-mails are labeled as any separation between body and header. We consider
legitimate and phishing correspondingly. whole the emails itself and so the feature vector contains
 Tokenization is performed to separate words from all the kinds of features like HEADER based feature,
the e-mail by using white space (space, tab, URL based feature and BODY based feature [17].
newline) as the delimiter.
 Then the words that do not have any significant  Body based feature: All body-based features occur
importance in building the classifier are removed. in the boy of emails and are involved: (body-
This is called stop word removal and stop words are keyword), (body- jspopup), (body-java script),
words like a, the, that etc. multipart emails, html emails, verify phrase emails,
 Then stemming is performed to remove in flexional htmllink, image link and etc.
ending from the necessary words.  URL based feature: These features are extracted
 Finally, the Term-Document-Frequency (TDF) from the URL link of emails, and included: html-
matrix is created where each row in the matrix link, url IP ad- dresses, image link and etc.
corresponds to a document (e-mail) and each  Header based feature: The features are extracted
column corresponds to a term (word) in the from the e-mail header like subject, sender, receiver
document. Each cell represents the frequency etc.
(number of occurrence) of the corresponding word
in the corresponding document. Thus, each e-mail in The extracted features are converted into a long vector
the data set has been converted into an equivalent by using parsing and stemming. Parsing is a process to
vector. extract features from email and analyzing them.
 Generally prior to the classification, dimensionality Stemming is the process for reducing inflected (or
reduction techniques are applied to convert the long sometimes derived) words to their stem or root form.
vector created in step 5. Feature selection or feature Then stop words, those words that do not have any
extraction techniques are used for dimension significant importance in building the classifiers, are re-
reduction and this improves the training time of the moved. Thus the email dataset of 2700 emails is
classifiers. converted into 2,173 terms (feature). This means that the
 Finally the classification model classifies the dataset email dataset can be represented as a term-document
into phishing and legitimate. matrix with 2700 rows and 2173 columns. Each row in
the matrix corresponds to a document (e-mail) and each
column corresponds to a term (word) in the document.
IV. IMPLEMENTATION Each cell represents the frequency (number of occurrence)
The proposed text mining based text classification is of the corresponding word in the corresponding document.
implemented by using the text mining features available The vector generated in this stage is considered as long
in WEKA 3.7.11. The procedure is described in the vector.
following sections. C. Conversion of Long vector into Short vector
A. Dataset Preparation This conversion is done due of 3 reasons:
Data set is prepared by collecting a group of e-mails
from the well known publicly available corpus that most 1. To transform the data representation into a shorter,
authors in this area have used. Phishing dataset consisting more compact, and more predictive one.
1,000 phishing emails received from November 2004 to 2. To reduce the complexity of handling features in
August 2007 provided by Monkey website [15] and, classification process
1,700 Ham email from Spam Assassin project [16]. Then 3. To increase the speed of classification process
the e-mails are labeled as phishing and legitimate
correspondingly. This conversion (Long to Short) can be done either by
using Feature Selection or by Feature Extraction. We
Table 1. Dataset selected PCA and LSA as feature extraction techniques
and Chi-Square and Info Gain as feature selection
Total number of samples 2700
techniques for the analysis. From the initial 2173 features,
Phishing emails 1000 small sets of 10,15,50,100,300,500,1000 and 2000
Legitimate emails 1700 features are selected/extracted with PCA, LSA, Chi-
Square and Info Gain, for analysis.
B. Creation of Long Vector
D. Classification
In general, an email consists of two parts: the header
and the body message. The header contains information After converting long vector to short vector based on
about the message in the form of many fields like sender, feature extraction and feature selection, we trained

different classifiers on dataset. After several trials and The results of various experiments conducted on the
comparison, we decided to use ensemble classifier model selected dataset for different number of features are
bagging with J48 decision tree as the base classifier [18]. shown in Fig.3 and Fig.4.
The reason for this decision is that, for our dataset and True positive (TP) = number of phishing email that
methods (feature extraction and selection), bagging gives correctly classified as phishing.
good results by reducing the variance of the data set and False positive (FP) = number of ham email that
thus reduces over fitting of the training data. incorrectly classified as phishing.
For better visualization, the results are presented in the
form of the area under the ROC (Receiver Operating
V. RESULTS AND DISCUSSIONS Characteristic) curve, which reaches the best value at 1
and worst value at 0. The results are also shown in terms
of accuracy of the classification for better understanding.

Fig.3. Performance Comparison in terms of ROC Area

It evidents from the Fig (3,4), both FS and FE methods When we compare the techniques in terms of accuracy
obtain a certifiable results. But FS shows better (Fig. 4), it is observed that the FE techniques have good
performance by increasing the number of features, and performance with a small number of features and the
the results are not better than the FE techniques. On the performance values are decreased when a large number
other hand, the FE methods need commonly much less of features are chosen, while the FS algorithms need
features to obtain a good performance. In FE methods, more features for accurate classification. This is because
choosing more features might degrade the performance of since FS methods directly select features from the dataset,
the algorithm. From these results, we can observe that the which includes information from the whole dataset, with
statistical feature extraction techniques are well suited to small number of features they may missed some of the
discriminate between ham and phishing emails. more informative and important features.
Especially LSA technique in terms of area under ROC
curve shows a good and stable performance irrespective
of the number of features chosen.

Fig.4. Performance Comparison in terms of Accuracy

[15] Phishing Corpus: http://monkey.org/wjose/wiki/doku.php;

VI. CONCLUSION [16] SpamAssassin PublicCorpus: http://spamassassin.apache.
org/publiccorpus
In this paper, feature selection methods are compared [17] A. Almomani, T.C.Wan, A.Manasrah, A.Altaher,
with statistical feature extraction techniques for email M.Baklizi, S.Ramadass. An enhanced online phishing e-
classification. The results show good classification mail detection framework based on evolving connectionist
performance when using the feature extraction techniques system. International journal of innovative computing,
to classify emails. One of the significant objects in this information and control (2012); 9(2); 1065-1086.
[18] D. Opitz, R. Maclin. Popular ensemble methods: An
work is, the results of feature extraction methods (PCA,
empirical study. Journal of Artificial Intelligence
LSA) are not dependent on number of features chosen. It Research, (1999), Vol(11); 169-198
is an advantage in text classification because choosing the [19] F Toolan, J. Carthy. Phishing Detection using Classifier
correct number of features in the high dimensional space Ensembles. IEEE conference on eCrime Researchers
is a difficult problem. Moreover, Latent Semantic Summit, Tacoma, WA, USA, (2009); 1 – 9.
Analysis is found to be the best method, since it [20] S.A. Nimeh, D. Nappa, X. Wang, S. Nair. A comparison
outperforms other methods in terms of the area under the of machine learning techniques for phishing detection. In
ROC curve and accuracy, even when dataset are Proceedings of the eCrime Researchers Summit, 2007; vol.
presented with very few features. 1. (Pittsburgh, PA, USA); 60–69.
[21] V. Ramanathan, H. Wechsler. Phishing detection and
impersonated entity discovery using Conditional Random
REFERENCES Field and Latent Dirichlet Allocation. Journal of
[1] APWG. Anti phishing working: Computers & Security, (2013), 34; 123-139.
http://www.antiphishing.org [22] V.Ramanathan, H.Wechsler. PhishGILLNET-phishing
[2] Phishing Activity Trends Report 2014: detection methodology using probabilistic latent semantic
http://docs.apwg.org/reports/apwg_trends_report_q1_201 analysis, AdaBoost and co-training. Journal on
4.pdf. information security, 2012.
[3] I.R.A.Hamid, J.Abawajy. Hybrid feature selection for
phishing email detection. International Conference of
Algorithms and Architectures for Parallel Processing,
(2011), Lecture Notes in Computer Science, Springer, Authors’ Profiles
Berlin, Germany; 266-275.
[4] G. L. Huillier, R. Weber, N. Figueroa. Online Phishing Masoumeh Zareapoor is a Ph.D. student at Jamia Hamdard
Classification Using Adversarial Data Mining and University, New Delhi, India. She received her Master degree
Signaling Games. ACM SIGKDD Explorations in computer science from Jamia Hamdard University in 2010.
Newsletter, (2009), 11(2); 92-99.
[5] J.J. Verbeek. Supervised Feature Extraction for Text
Categorization. Tenth Belgian-Dutch Conference on Seeja.K.R received her Ph.D. degree in Computer Science from
Machine Learning, (2000). Jamia Hamdard University, New Delhi, India, in July 2010. She
[6] G. Biricik, B. Diri, A.C. Sonmez. Abstract feature is currently working as associate professor in the Department of
extraction for text classification. Turk J Elec Eng & Comp Computer Science & Engineering, Indira Gandhi Delhi
Sci, (2012), 20(1); 1102-1015. Technical University for Women, Delhi, India. Her research
[7] J.C. Gomez, M.F. Moens. PCA document reconstruction interests include data mining, algorithm design, bioinformatics
for email classification. Computational Statistics and Data and NP-Complete problems.
Analysis, (2012), 56(3); 741–751.
[8] J.C.Gomez, E. Boiy, M.F.Moens. Highly discriminative
statistical features for email classification. Knowledge and
Information System, (2012), 31(1); 23-53.
[9] A. Tsymbal, S. Puuronen, M. Pechenizkiy, M.
Baumgarten, D.W.Patterson. Eigenvector-based feature
extraction for classification. AAAI Press, (2002); 354–358.
[10] J.D.Brutlag, C.Meek. Challenges of the email domain for
text classification. In Proceedings of the seventeenth
international conference on machine learning, (2000);
103–110.
[11] Y. Xia, K.F. Wong. Binarization approaches to email
categorization. In: ICCPOL; 474–481.
[12] G.L.Huillier, A.Hevia, R.Weber, S.Rios. Latent Semantic
Analysis and Keyword Extraction for Phishing
Classification. IEEE International Conference on
Intelligence and Security Informatics, (2010); 129 – 131.
[13] M. Hall, L. Smith. Practical feature subset selection for
machine learning. Proceedings of the 21st Australasian
Conference on Computer Science. (1998); 181-191.
[14] T. Mori. Information gain ratio as term weight: The case
of summarization of IR results. In Proceeding of the 19th
international conference on computational linguistics,
Taiwan (2002); 688-694.

View publication stats

Detecting Phishing Website With Code Implementation
No ratings yet
Detecting Phishing Website With Code Implementation
13 pages
Detailed Lesson Plan - Mail Merge
100% (1)
Detailed Lesson Plan - Mail Merge
9 pages
Research Methodology PAPER
No ratings yet
Research Methodology PAPER
5 pages
INSET Guidelines and House Rules
100% (1)
INSET Guidelines and House Rules
15 pages
Feature Extraction or Feature Selection For Text Classification: A Case Study On Phishing Email Detection
No ratings yet
Feature Extraction or Feature Selection For Text Classification: A Case Study On Phishing Email Detection
6 pages
1608 02196 PDF
No ratings yet
1608 02196 PDF
18 pages
Phi Sing Email Detection Report Python
No ratings yet
Phi Sing Email Detection Report Python
59 pages
Detection of E Mail Phishing Attacks Usi
No ratings yet
Detection of E Mail Phishing Attacks Usi
7 pages
Phishing Detection in E-Mails Using Machine Learning: Srishti Rawal Bhuvan Rawal Aakhila Shaheen Shubham Malik
No ratings yet
Phishing Detection in E-Mails Using Machine Learning: Srishti Rawal Bhuvan Rawal Aakhila Shaheen Shubham Malik
4 pages
IEEE Format Paper
No ratings yet
IEEE Format Paper
20 pages
Study On Phishing Attacks: International Journal of Computer Applications December 2018
No ratings yet
Study On Phishing Attacks: International Journal of Computer Applications December 2018
4 pages
Bhavsar 2018 Ijca 918286
No ratings yet
Bhavsar 2018 Ijca 918286
4 pages
17-2018-Review of Various Methods For Phishing
No ratings yet
17-2018-Review of Various Methods For Phishing
9 pages
Email Phishing: Text Classification Using Natural Language Processing
No ratings yet
Email Phishing: Text Classification Using Natural Language Processing
12 pages
Pooja 2020
No ratings yet
Pooja 2020
10 pages
Synopsis Topic:-Phising:Changing Dimensions Introduction
No ratings yet
Synopsis Topic:-Phising:Changing Dimensions Introduction
3 pages
Phishing Website Detection Using Fuzzy Logic: Twinkll Sisodia Simran Choudhary
No ratings yet
Phishing Website Detection Using Fuzzy Logic: Twinkll Sisodia Simran Choudhary
6 pages
Cybercrime and Business: How To Not: Get Caught by The Online Phisherman
No ratings yet
Cybercrime and Business: How To Not: Get Caught by The Online Phisherman
8 pages
Identification of Phishing Emails With Machine Learning Algorithms For Improved Cybersecurity
No ratings yet
Identification of Phishing Emails With Machine Learning Algorithms For Improved Cybersecurity
14 pages
Synopises Edited
No ratings yet
Synopises Edited
20 pages
Heterogeneous Ensemble Feature Selection
No ratings yet
Heterogeneous Ensemble Feature Selection
15 pages
Classification of Features For Detecting Phishing Web Sites Based On Machine Learning Techniques
No ratings yet
Classification of Features For Detecting Phishing Web Sites Based On Machine Learning Techniques
51 pages
Full Thesis
No ratings yet
Full Thesis
81 pages
Major Project BharathUniversity
No ratings yet
Major Project BharathUniversity
5 pages
Phishing Website Classification and Detection J Kumar
No ratings yet
Phishing Website Classification and Detection J Kumar
6 pages
Seminar of Internet Security Law
No ratings yet
Seminar of Internet Security Law
21 pages
Cmu Isri 06 112
No ratings yet
Cmu Isri 06 112
16 pages
Phishing Detection Using Named Entity Recognition
No ratings yet
Phishing Detection Using Named Entity Recognition
24 pages
Online Detection and Prevention of Phishing Attacks
No ratings yet
Online Detection and Prevention of Phishing Attacks
6 pages
N Tabassum A Hybrid Machine Learning Based Phishing Website Detection Technique Through Dimensionality Reduction
No ratings yet
N Tabassum A Hybrid Machine Learning Based Phishing Website Detection Technique Through Dimensionality Reduction
6 pages
Synopises From Wisen
No ratings yet
Synopises From Wisen
24 pages
Phishing Attack - Awareness and Prevention: August 17, 2012
No ratings yet
Phishing Attack - Awareness and Prevention: August 17, 2012
17 pages
Phishing & Anti-Phishing: A Review: Disha D N, Rachana N B, Kumari Deepika, Nidhi Shri G
No ratings yet
Phishing & Anti-Phishing: A Review: Disha D N, Rachana N B, Kumari Deepika, Nidhi Shri G
6 pages
A Framework For Predicting Phishing Websites Using Neural Networks
No ratings yet
A Framework For Predicting Phishing Websites Using Neural Networks
7 pages
A Preliminary Investigation On User Factors of Phishing E-Mail
No ratings yet
A Preliminary Investigation On User Factors of Phishing E-Mail
9 pages
Project Report of Bharat
No ratings yet
Project Report of Bharat
71 pages
Module4 Cyber
No ratings yet
Module4 Cyber
10 pages
Phishing Web Page Detection Methods URL and HTML Features Detection
No ratings yet
Phishing Web Page Detection Methods URL and HTML Features Detection
5 pages
(IJCST-V10I5P25) :mrs B Vijaya, Abboori Sekhar
No ratings yet
(IJCST-V10I5P25) :mrs B Vijaya, Abboori Sekhar
9 pages
What Is Phishing
No ratings yet
What Is Phishing
9 pages
Anti Phishing Attacks
No ratings yet
Anti Phishing Attacks
58 pages
Anti Phishing Attacks
0% (1)
Anti Phishing Attacks
58 pages
Original Phishing Seminer
No ratings yet
Original Phishing Seminer
32 pages
Detection of E Mail Phishing Attacks Usi
No ratings yet
Detection of E Mail Phishing Attacks Usi
7 pages
Certain Investigation On Web Application Security Phishing Detection and Phishing Target Discovery
No ratings yet
Certain Investigation On Web Application Security Phishing Detection and Phishing Target Discovery
10 pages
Research Paper1
No ratings yet
Research Paper1
5 pages
Specific Contribution:: Keywords
No ratings yet
Specific Contribution:: Keywords
27 pages
Analysis of Phishing Emails
No ratings yet
Analysis of Phishing Emails
24 pages
Phishing Scenario
No ratings yet
Phishing Scenario
5 pages
Paper Major1
No ratings yet
Paper Major1
6 pages
Phishing Attack A Case Study
No ratings yet
Phishing Attack A Case Study
44 pages
#Phishing Examples
No ratings yet
#Phishing Examples
3 pages
Unmasking Phishing Threats Through Cutting-Edge Machine Learning
No ratings yet
Unmasking Phishing Threats Through Cutting-Edge Machine Learning
8 pages
A Comprehensive Study of Phishing Attacks
No ratings yet
A Comprehensive Study of Phishing Attacks
4 pages
Paper 32-Enhancing Arabic Phishing Email Detection
No ratings yet
Paper 32-Enhancing Arabic Phishing Email Detection
14 pages
Conference 101719
No ratings yet
Conference 101719
9 pages
Social Effects of Phishing On E-Commerce: January 2008
No ratings yet
Social Effects of Phishing On E-Commerce: January 2008
6 pages
Web Browser To Prevent Phishing and Sybil Attacks
No ratings yet
Web Browser To Prevent Phishing and Sybil Attacks
4 pages
Raika ShahLJdA
No ratings yet
Raika ShahLJdA
8 pages
Synopsis of Project On Automatic Phishing Email Website Detection System Using Fuzzy Techniques
No ratings yet
Synopsis of Project On Automatic Phishing Email Website Detection System Using Fuzzy Techniques
20 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
Leibniz's Characteristica Universalis and Calculus Ratiocinator Today
No ratings yet
Leibniz's Characteristica Universalis and Calculus Ratiocinator Today
21 pages
Topic Modeling A Comprehensive Review
No ratings yet
Topic Modeling A Comprehensive Review
17 pages
Eu-Online Feb11 1730h
No ratings yet
Eu-Online Feb11 1730h
74 pages
Appendix: Ps Matching in R: (With Attached Dataset and Code)
No ratings yet
Appendix: Ps Matching in R: (With Attached Dataset and Code)
24 pages
Dynamic Factor Analysis: Marianna Bolla
No ratings yet
Dynamic Factor Analysis: Marianna Bolla
30 pages
The Cybersyn Project As A Paradigm For Managing and Learning in Complexity
No ratings yet
The Cybersyn Project As A Paradigm For Managing and Learning in Complexity
11 pages
Pdfminer Docs
No ratings yet
Pdfminer Docs
19 pages
State Aid and Collusion 30092010
No ratings yet
State Aid and Collusion 30092010
26 pages
Sociology in The Era of Big Data: The Ascent of Forensic Social Science
No ratings yet
Sociology in The Era of Big Data: The Ascent of Forensic Social Science
24 pages
An Obituary To Common Sense
No ratings yet
An Obituary To Common Sense
2 pages
Stafford Beer's Cwarel Isaf Cottage: History and Now
No ratings yet
Stafford Beer's Cwarel Isaf Cottage: History and Now
19 pages
The Viable System Model and The Viplan Software
No ratings yet
The Viable System Model and The Viplan Software
19 pages
Carol S. Aneshensel Theory-Based Data Analysis For The Social Sciences
No ratings yet
Carol S. Aneshensel Theory-Based Data Analysis For The Social Sciences
178 pages
Magks: Joint Discussion Paper Series in Economics
No ratings yet
Magks: Joint Discussion Paper Series in Economics
33 pages
BCQ EARIE Presentation
No ratings yet
BCQ EARIE Presentation
23 pages
Can We Trust Big Data? Applying Philosophy of Science To Software
No ratings yet
Can We Trust Big Data? Applying Philosophy of Science To Software
17 pages
Astroinformatics Initiatives Actual Offer Future Demand and Opportunities Study
No ratings yet
Astroinformatics Initiatives Actual Offer Future Demand and Opportunities Study
20 pages
Artificial Intelligence - Guide
No ratings yet
Artificial Intelligence - Guide
15 pages
COM3704 TUTORIAL Letter 101
No ratings yet
COM3704 TUTORIAL Letter 101
43 pages
The Ultimate 2023 Guide To Building A Real Estate Website
No ratings yet
The Ultimate 2023 Guide To Building A Real Estate Website
31 pages
HK2 A8gs202502
No ratings yet
HK2 A8gs202502
4 pages
Goldman Sachs - FT Application Details
No ratings yet
Goldman Sachs - FT Application Details
2 pages
1301 Fall 2021
No ratings yet
1301 Fall 2021
7 pages
Colbeck Appeal
No ratings yet
Colbeck Appeal
3 pages
Message 8.B Requesting An Adjustment
No ratings yet
Message 8.B Requesting An Adjustment
2 pages
Cover Letter Thank You For Your Time
100% (2)
Cover Letter Thank You For Your Time
9 pages
Empathy Statements For Customer Service
No ratings yet
Empathy Statements For Customer Service
21 pages
Bharuch To Ahmedabad
No ratings yet
Bharuch To Ahmedabad
1 page
Transport Layer Protocols AND Application Layer Protocol
No ratings yet
Transport Layer Protocols AND Application Layer Protocol
6 pages
MTx9x - Common Group Messages (MTn90 - MTn99)
No ratings yet
MTx9x - Common Group Messages (MTn90 - MTn99)
40 pages
Xerox-Od Xerox-Od1 Merged
No ratings yet
Xerox-Od Xerox-Od1 Merged
2 pages
GenC Hiring - Student Brochure
No ratings yet
GenC Hiring - Student Brochure
5 pages
Notification SCTIMST General Apprentice Posts
No ratings yet
Notification SCTIMST General Apprentice Posts
4 pages
Lik
No ratings yet
Lik
2 pages
Question Answer of Hotel Management
No ratings yet
Question Answer of Hotel Management
8 pages
DT 13 336
No ratings yet
DT 13 336
1 page
Virtual University Project - SRS
No ratings yet
Virtual University Project - SRS
25 pages
Home Water Filter - 567 Stage Filter Systems
No ratings yet
Home Water Filter - 567 Stage Filter Systems
1 page
TWI Application Form - Rev 10
No ratings yet
TWI Application Form - Rev 10
3 pages
Bni Mentor Program Napabni Com
100% (2)
Bni Mentor Program Napabni Com
11 pages
#37039 Password Reset: Submitted Received Via Requester
No ratings yet
#37039 Password Reset: Submitted Received Via Requester
2 pages
18th - School
No ratings yet
18th - School
1 page
LIFERAY - Portal Salient Features - : Out-Of-The-Box Tools
No ratings yet
LIFERAY - Portal Salient Features - : Out-Of-The-Box Tools
4 pages
Manual For KLC2000v8 - Server
No ratings yet
Manual For KLC2000v8 - Server
19 pages
Privacy Policy
No ratings yet
Privacy Policy
6 pages

Feature Extraction or Feature Selection For Text C

Uploaded by

Feature Extraction or Feature Selection For Text C

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Feature Extraction or Feature Selection for Text Classiﬁcation: A Case Study

Article in International Journal of Information Engineering and Electronic Business · March 2015

The user has requested enhancement of the downloaded file.

Feature Extraction or Feature Selection for Text

Several techniques have already been developed for following sections.

Fig.2. Phishing Email Classification framework

Fig.3. Performance Comparison in terms of ROC Area

Fig.4. Performance Comparison in terms of Accuracy

[15] Phishing Corpus: http://monkey.org/wjose/wiki/doku.php;

View publication stats

You might also like