0% found this document useful (0 votes)
131 views5 pages

Base Paper PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
131 views5 pages

Base Paper PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Phishing Web Sites Features Classification Based on

Extreme Learning Machine


Yasin Sönmez1 Türker Tuncer2
Dicle University -Technical Sciences Vocational School FÕrat University-Faculty of Technology Forensic Comp.
DiyarbakÕr / Turkey ElazÕ÷ / Turkey
[email protected] [email protected]

Hüseyin Gökal 3 Engin AvcÕ4


Cyprus International University Faculty of Edu. FÕrat University-Faculty of Technology Sofware Eng.
Lefkoúa / Cyprus ElazÕ÷ / Turkey
[email protected] [email protected]

Abstract—Phishing are one of the most common and most


dangerous attacks among cybercrimes. The aim of these attacks
is to steal the information used by individuals and organizations Phishing Web sites Features
to conduct transactions. Phishing websites contain various hints Many articles have been published about how to predict
among their contents and web browser-based information. The the phishing websites by using artificial intelligence
purpose of this study is to perform Extreme Learning Machine techniques. We examined phishing websites and extracted
(ELM) based classification for 30 features including Phishing
features of these web sites. Guidelines regarding the extracted
Websites Data in UC Irvine Machine Learning Repository
features of this database are given below.
database. For results assessment, ELM was compared with other
machine learning methods such as Support Vector Machine In the first section we defined rules and we gave equations
(SVM), Naïve Bayes (NB) and detected to have the highest of web features. We need these equations in order to explain
accuracy of 95.34% phishing attacks characaterization.
Keywords—Extreme Learning Machine,Features 1.1. Address Bar based Features
Classification, Information Security, Phishing.
1.1.1. Using the IP Address
I. INTRODUCTION Rule:
Internet use has become an essential part of our daily (1)
activities as a result of rapidly growing technology. Due to this
rapid growth of technology and intensive use of digital
systems, data security of these systems has gained great 1.1.2. Long URL to Hide the Suspicious Part
importance. The primary objective of maintaining security in
information technologies is to ensure that necessary
precautions are taken against threats and dangers likely to be (2)
faced by users during the use of these technologies [1].
Phishing is defined as imitating reliable websites in order to 1.3. Using URL Shortening Services “TinyURL”
obtain the proprietary information entered into websites every
day for various purposes, such as usernames, passwords and (3)
citizenship numbers. Phishing websites contain various hints
among their contents and web browser-based information [2- 1.1.4. URL’s having “@” Symbol
4]. Individual(s) committing the fraud sends the fake website
or e-mail information to the target address as if it comes from (4)
an organization, bank or any other reliable source that
performs reliable transactions. Contents of the website or the 1.1.5. Redirecting using “//”
e-mail include requests aiming to lure the individuals to enter
or update their personal information or to change their (5)
passwords as well as links to websites that look like exact
1.1.6. Adding Prefix or Suffix Separated by (-) to the
copies of the websites of the organizations concerned [6-10].
Domain
(6)

‹,(((
1.1.7. Sub Domain and Multi Sub Domains
(18)

(7) 1.3 HTML and JavaScript based Features


1.3.1. Website Forwarding

1.1.8.HTTPS (Hyper Text Transfer Protocol with Secure (19)


Sockets Layer)

(8) 1.3.2 Status Bar Customization


(20)
1.1.9. Domain Registration Length
1.3.3. Disabling Right Click
(9)
(21)
1.1.10. Favicon
1.3.4. Using Pop-up Window
(10)
(22)
1.1.11. Using Non-Standard Port
1.3.5. IFrame Redirection
(11)
(23)
1.1.12. The Existence of “HTTPS” Token in the Domain
Part of the URL 1.4. Domain based Features
1.4.1. Age of Domain
(12)
(24)
1.2. Abnormal Based Features
1.2.1. Request URL 1.4.2.DNS Record
Rule: IF (25)
(13)
1.4.3.Website Traffic

1.2.2 URL of Anchor


(26)

(14)
1.4.4. PageRank

1.2.3 Links in <Meta>, <Script> and <Link> tags (27)

1.4.5. Google Index


(15)
(28)

1.2.4. Server Form Handler (SFH) 1.4.6. Number of Links Pointing to Page

(16) (29)

1.2.5. Submitting Information to Email


1.4.7.Statistical-Reports Based Feature
(17)
(30)
1.2.6.Abnormal URL In this study, Extreme Learning Machine (ELM) based
classification was performed for the following 30 features [11]
extracted based on the features of websites in UC Irvine
Machine Learning Repository. In the Table 1, features of web A. Classification
sites are listed. Classification is to determine the class to which each data
sample of the methods belongs, which methods are used when
TABLE I. FEATURES OF WEBSITES the outputs of input data are qualitative. The purpose is to
divide the whole problem space into a certain number of
Output
Input (Features)
(Class) classes. A wide range of classification methods are present.
1.1. Address Bar based Features This is due to the fact that different classification methods
1.1.1. Using the IP Address have been constructed for different data as there is no perfect
1.1.2. Long URL to Hide the Suspicious Part method that works on every data set. As mentioned in
1.1.3. Using URL Shortening Services “TinyURL” literature studies, the aim of classification is to assign the new
1.1.4. URL’s having “@” Symbol
1.1.5. Redirecting using “//” samples to classes by using the pre-labeled samples. The most
1.1.6. Adding Prefix or Suffix Separated by (-) to the commonly used classification methods are described below.
Domain
1.1.7. Sub Domain and Multi Sub Domains • Artificial Neural Networks (ANN)
1.1.8. HTTPS (Hyper Text Transfer Protocol with Secure • Support Vector Machine (SVM)
Sockets Layer) • Naive Bayes (NB)
1.1.9. Domain Registration Length
1.1.10. Favicon
1.1.11. Using Non-Standard Port Extreme Learning Machine (ELM)
1.1.12. The Existence of “HTTPS” Token in the Domain
Part of the URL Extreme Learning Machine (ELM) is a feed-forward
1.2. Abnormal Based Features artificial neural network (ANN) model with a single hidden
1.2.1.Request URL layer. For the ANN to ensure a high-performing learning,
-1 Phishing
1.2.2.URL of Anchor parameters such as threshold value, weight and activation
1 Legitimate
1.2.3.Links in <Meta>, <Script> and <Link> tags
1.2.4.Server Form Handler (SFH) function must have the appropriate values for the data system
1.2.5.Submitting Information to Email to be modeled. In gradient-based learning approaches, all of
1.2.6.Abnormal URL these parameters are changed iteratively for appropriate
1.3. HTML and JavaScript based Features values. Thus, they may be slow and produce low-performing
1.3.1. Website Forwarding results due to the likelihood of getting stuck in local minima.
1.3.2. Status Bar Customization
1.3.3. Disabling Right Click
In ELM Learning Processes, differently from ANN that
1.3.4. Using Pop-up Window renews its parameters as gradient-based, input weights are
1.3.5. IFrame Redirection randomly selected while output weights are analytically
1.4. Domain based Features calculated. As an analytical learning process substantially
1.4.1. Age of Domain reduces both the solution time and the likelihood of error
1.4.2. DNS Record value getting stuck in local minima, it increases the
1.4.3. Website Traffic
1.4.4. PageRank performance ratio. In order to activate the cells in the hidden
1.4.5. Google Index layer of ELM, a linear function as well as non-linear (sigmoid,
1.4.6. Number of Links Pointing to Page sinus, Gaussian), non-derivable or discrete activation
1.4.7. Statistical-Reports Based Feature functions can be used [12-19]. ELM structure is given in
Figure 1.
II. MATERIAL AND METHOD
Procedural steps for solving the classification problem INPUT LAYER
i=1,2,….n
HIDDEN LAYER
j=1,2,….m
OUTPUT LAYER
k=1,2,….p

presented is as follows:
• Identification of the problem X (1) b(1)

This study attempts to solve the problem as to how X (2) b (2)

phishing analysis data will be classified. .


. y(p)

X(n-1)
• Data set b(m-1)

X(n)

Approximately 11,000 data containing the 30 features


extracted based on the features of websites in UC Irvine b (m)

Machine Learning Repository database.


Fig. 1. An artificial neural network model with a single hidden layer
• Modeling
with forwardfeed
After the data is ready to be processed, modeling process
for the learning algorithm is initiated. The model is basically
the construction of the need for output identified in accordance
with the task qualifications. (31)

In equation 1, xi refers to input vector and yp refers to


output vector (m and n neuron count) , wi,j indicates input
layer to hidden layer weights and ȕj indicates output layer to performance compared to other methods in terms of
hidden layer weights, bj represents the threshold value of performance and speed.
neurons in the hidden layer and g(.) represents activation
function. Input layer weights (w) and bias (bj) values in the
equation are randomly assigned. Activation function (g(.)),
input layer neuron count (n) and hidden layer neuron count
(m) are assigned in the beginning step [12-19].
TABLE II. ACCURACY OF MACHINE LEARNING METHODS.
• Model performance evaluation
Methods Train Accuracy Test / True Accuracy
The topics addressed in this section are the two measures ELM 100% 95.34%
that affect the performance of the model and the algorithm NB 100% 93,80%
used, the first one being the division of data set into training SVM 100% 92,98%
and test data set and the second one being the definition of IV. EXPERIMENTAL RESULTS
expressions measuring the performance. In the first measure,
the data set is divided into three parts as training, validation In this study, features in the database created for phishing
and test data by three-phase division in K-Fold method, and websites are classified by determining the input and output
model selection and performance status are simultaneously parameters for the ELM classifier. Results obtained by ELM
performed. In the second measure, performance assessment of show that ELM has higher achievement compared to other
classifier models generally uses a validation value. Validation classifier (SVM and NB) methods. This study is considered to
value can be measured as the ratio of data count detected or be an applicable design in automated systems with high-
estimated correctly by the algorithm into all data in the data performing classification against the phishing activity of
set. websites. Furthermore, in literature comparisons, this study is
observed to be high-performing by having a high performance
(32) of 92.18% that is also the highest test performance in the
publication no. [3].

III. EXPERIMENTAL RESULTS V. CONCLUSIONS


These results were obtained by using MATLAB 2103b In this paper, we defined features of phishing attack and
software and a PC with Intel i7-6500 CPU and 8 GB RAM. we proposed a classification model in order to classification of
the phishing attacks. This method consists of feature
extraction from websites and classification section. In the
feature extraction, we have clearly defined rules of phishing
feature extraction and these rules have been used for obtaining
features. In order to classification of these feature, SVM, NB
and ELM were used. In the ELM, 6 different activation
functions were used and ELM achieved highest accuracy
score.

5HIHUHQFHV

[1] G. Canbek and ù. Sa÷Õro÷lu, “A Review on Information, Information


Security and Security Processes,” Politek. Derg., vol. 9, no. 3, pp. 165–
174, 2006.
[2] L. McCluskey, F. Thabtah, and R. M. Mohammad, “Intelligent rule-
based phishing websites classification,” IET Inf. Secur., vol. 8, no. 3, pp.
Fig. 2. ELM performance chart.
153–160, 2014.
[3] R. M. Mohammad, F. Thabtah, and L. McCluskey, “Predicting phishing
websites based on self-structuring neural network,” Neural Comput.
Appl., vol. 25, no. 2, pp. 443–458, 2014.
[4] R. M. Mohammad, F. Thabtah, and L. McCluskey, “An assessment of
While attaining these results, cell count in the hidden layer features related to phishing websites using an automated technique,”
Internet Technol. …, pp. 492–497, 2012.
is 1000 and activation count is sigmoid for ELM.
[5] W. Hadi, F. Aburub, and S. Alhawari, “A new fast associative
• Comparison of the results of different classification algorithm for detecting phishing websites,” Appl. Soft
Comput. J., vol. 48, pp. 729–734, 2016.
classification methods
[6] N. Abdelhamid, “Multi-label rules for phishing classification,” Appl.
Achieved performance of ELM method and achieved Comput. Informatics, vol. 11, no. 1, pp. 29–46, 2015.
performance of other machine learning methods (Support [7] N. Sanglerdsinlapachai and A. Rungsawang, “Using domain top-page
Vector Machine (SVM), Naive Bayes (NB)) are presented in similarity feature in machine learning-based web phishing detection,” in
3rd International Conference on Knowledge Discovery and Data
Table 2. As deduced from these data, ELM achieved higher Mining, WKDD 2010, 2010, pp. 187–190.
[8] W. D. Yu, S. Nargundkar, and N. Tiruthani, “A phishing vulnerability
analysis of web based systems,” IEEE Symp. Comput. Commun. (ISCC
2008), pp. 326–331, 2008.
[9] P. Ying and D. Xuhua, “Anomaly based web phishing page detection,”
in Proceedings - Annual Computer Security Applications Conference,
ACSAC, 2006, pp. 381–390.
[10] M. Moghimi and A. Y. Varjani, “New rule-based phishing detection
method,” Expert Syst. Appl., vol. 53, pp. 231–242, 2016.
[11] DATASET: Lichman, M. (2013). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science
[12] G.-B. Huang et al., “Extreme learning machine: Theory and
applications,” Neurocomputing, vol. 70, no. 1–3, pp. 489–501, 2006.
[13] C. S. Guang-bin Huang, Qin-yu Zhu, “Extreme learning machine: A
new learning scheme of feedforward neural networks,”
Neurocomputing, vol. 70, pp. 489–501, 2006.
[14] T. S. Guzella and W. M. Caminhas, “A review of machine learning
approaches to Spam filtering,” Expert Systems with Applications, vol.
36, no. 7. pp. 10206–10222, 2009.
[15] Ö. F.. Ertu÷rul, AúÕrÕ Ö÷renme Makineleri ile biyolojik sinyallerin gizli
kaynaklarÕna ayrÕútÕrÕlmasÕ. D.Ü. Mühendislik Dergisi Cilt: 7, 1, 3-9-
2016
[16] M. E. Tagluk, M. S. Mamiú, M. Arkan, and Ö. F. Ertugrul, “Aúiri
Ögrenme Makineleri ile Enerji Iletim Hatlari Ariza Tipi ve Yerinin
Tespiti,” in 2015 23rd Signal Processing and Communications
Applications Conference, SIU 2015 - Proceedings, 2015, pp. 1090–
1093.
[17] Ö. Faruk Ertu÷rul and Y. Kaya, “A detailed analysis on extreme learning
machine and novel approaches based on ELM,” Am. J. Comput. Sci.
Eng., vol. 1, no. 5, pp. 43–50, 2014.
[18] Ö. F. Ertugrul, “Forecasting electricity load by a novel recurrent extreme
learning machines approach,” Int. J. Electr. Power Energy Syst., vol. 78,
pp. 429–435, 2016.
[19] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine:
Theory and applications,” Neurocomputing, vol. 70, no. 1, pp. 489–501,
2006.

You might also like