100% found this document useful (1 vote)

561 views96 pages

Main PDF

This document is a project report submitted by Bhushan Dhamdhere, Rohit Chinchwade, Kaushal Dhonde, and Swapnil Mehetre to the University of Pune for their Bachelor of Engineering in Computer Engineering. The report describes a hybrid model to detect phishing sites using clustering and Bayesian approaches. It includes a certificate signed by their project guide Prof. Rahul Patil certifying that the report is a bona fide record of their final year project work. The report also contains acknowledgments, contents, and various sections required for a project report such as literature survey, software requirements specification, system design, implementation, testing, and results.

Uploaded by

bhushan_dhamdhere01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

561 views96 pages

Main PDF

Uploaded by

bhushan_dhamdhere01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 96

A PROJECT REPORT ON

A HYBRID MODEL TO DETECT PHISHING SITES USING

CLUSTERING AND BAYESIAN APPROACH

SUBMITTED TO UNIVERSITY OF PUNE,

IN THE PARTIAL FULFILMENT OF THE REQUIREMENTS FOR
AWARD OF BACHELORS
OF
BACHELOR OF ENGINEERING (COMPUTER ENGINEERING)

BHUSHAN DHAMDHERE
ROHIT CHINCHWADE
KAUSHAL DHONDE
SWAPNIL MEHETRE

Under the Guidance of

PROF. RAHUL PATIL

DEPARTMENT OF COMPUTER ENGINEERING,

PIMPRI CHINCHWAD COLLEGE OF ENGINEERING,PUNE

DEPARTMENT OF COMPUTER ENGINEERING

PIMPRI CHINCHWAD COLLEGE OF ENGINEERING,PUNE

CERTIFICATE

This is to certify The Final Year Project report entitled

A HYBRID MODEL TO DETECT PHISHING SITES USING CLUSTERING AND
BAYESIAN APPROACH
is a record of bonafide work for Project carried out by and submitted by

BHUSHAN DHAMDHERE
ROHIT CHINCHWADE
KAUSHAL DHONDE
SWAPNIL MEHETRE

Under Guidance Of
Prof. Rahul Patil,
in partial fulfillment of the requirements for the
award of Degree of Bachelors in Computer Engineering of University of Pune.

(PROF. RAHUL PATIL)

Project Guide

(PROF. DR. J. S. UMALE)

Head, Computer Engineering

Examination Approval Sheet

The Project Report entitled

A HYBRID MODEL TO DETECT PHISHING SITES USING CLUSTERING AND

BAYESIAN APPROACH

By
Bhushan Dhamdhere
Rohit Chinchwade
Kaushal Dhonde
Swapnil Mehetre
is approved for Project, B.E Computer Engineering, University of Pune
at
Pimpri Chinchwad College of Engineering

Examiners :

External Examiner :

Internal Examiner :

Date

Acknowledgments
We express our sincere thanks to our Guide Prof. Rahul Patil, for his constant encouragement and support throughout our project, especially for the useful suggestions given during the
course of project and having laid down the foundation for the success of this work.
We would also like to thank our Project Coordinator Mrs. Deepa Abin, for her assistance,
genuine support and guidance from early stages of the project. We would like to thank Prof.
Dr. J. S. Umale, Head of Computer Department for his unwavering support during the entire
course of this project work. We are very grateful to our Principal Dr. A. M. Fulambarkar for
providing us with an environment to complete our project successfully. We also thank all the
staff members of our college and technicians for their help in making this project a success. We
also thank all the web committees for enriching us with their immense knowledge. Finally, we
take this opportunity to extend our deep appreciation to our family and friends, for all that they
meant to us during the crucial times of the completion of our project.

Bhushan Dhamdhere
Rohit Chinchwade
Kaushal Dhonde
Swapnil Mehetre

Contents
List of Figures

viii

List of Tables

Abstract

Introduction

1.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2

Brief Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3

Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.4

Applying Software Engineering approach . . . . . . . . . . . . . . . . . . . .

Literature Survey

Software Requirements Specifications

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.1.1

Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.1.2

Intended audience and reading suggestions . . . . . . . . . . . . . . .

3.1.3

Project Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.1.4

Design and Implementation Constraints . . . . . . . . . . . . . . . . .

3.1.5

Assumptions and Dependencies . . . . . . . . . . . . . . . . . . . . .

System Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.2.1

System Feature 1: String Searching . . . . . . . . . . . . . . . . . . .

3.2.2

System Feature 2: String Tokenization . . . . . . . . . . . . . . . . . .

3.2.3

System Feature 3: K-Means Clustering . . . . . . . . . . . . . . . . .

3.2.4

System Feature 4: DOM Tree Parsing . . . . . . . . . . . . . . . . . .

3.2

3.2.5
3.3

3.4

3.5

3.6

System Feature 5: Naive Bayes Classifier . . . . . . . . . . . . . . . .

External Interface Requirements . . . . . . . . . . . . . . . . . . . . . . . . .

3.3.1

User Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3.2

Hardware Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3.3

Software Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3.4

Communication Interfaces . . . . . . . . . . . . . . . . . . . . . . . .

Non-Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . .

3.4.1

Performance Requirements . . . . . . . . . . . . . . . . . . . . . . . .

3.4.2

Safety Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.4.3

Security Requirements . . . . . . . . . . . . . . . . . . . . . . . . . .

3.4.4

Software Quality Attributes . . . . . . . . . . . . . . . . . . . . . . .

Analysis Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.5.1

Data Flow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.5.2

Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.5.3

State-Transition Diagram . . . . . . . . . . . . . . . . . . . . . . . . .

System Implementation Plan . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.6.1

Cost Estimation Model . . . . . . . . . . . . . . . . . . . . . . . . . .

3.6.2

Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

System Design

4.1

System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2

UML Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Technical Specification

5.1

Technology used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Schedule, Estimate and Team Structure

6.1

Project Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.2

Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.3

Team Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Software Implementation

7.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.2

Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.3

Important Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.4

Business Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Software Testing

8.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8.2

Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8.3

Snapshot of GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Results

9.1

Accuracy of Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9.2

Project Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10 Deployment and Maintenance

10.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11 Appendix A: Glossary

12 Appendix B: Semester I Assignments

vii

List of Figures
2.1

Total reported attacks per month for 1 year[7] . . . . . . . . . . . . . . . . . .

2.2

Major attacked countries by volume of attack[7] . . . . . . . . . . . . . . . . .

2.3

Major attacked countries by Brands attacked[7] . . . . . . . . . . . . . . . . .

3.1

Level 1 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.2

Level 2 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3

Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.4

State-Transition Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.5

Cocomo-II Embedded Project Model . . . . . . . . . . . . . . . . . . . . . . .

3.6

Gantt Chart Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.1

System Architecture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2

Feature Extraction Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.3

K-Means Clustering Model . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.4

Naive Bayes Classifier Model . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.5

Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.6

Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.7

Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.8

State-Transition Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.9

Collaboration Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.10 Package Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.11 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.12 Component Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.13 Deployment Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.1

Cocomo-II Embedded Project Model . . . . . . . . . . . . . . . . . . . . . . .

viii

7.1

Sample DOM Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.2

DOM Tree constructed in PROJECT . . . . . . . . . . . . . . . . . . . . . . .

8.1

Test Cases for Project Main Modules . . . . . . . . . . . . . . . . . . . . . . .

8.2

Main Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8.3

Manual Entry Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8.4

Manual Entry Form Empty . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8.5

Prediction Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8.6

Load Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9.1

Accuracy Testing graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9.2

New site feature extraction in progress . . . . . . . . . . . . . . . . . . . . . .

9.3

Prediction Results of Site . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10.1 JDK Step 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10.2 JDK Step 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10.3 JDK Step 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10.4 JDK Step 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

List of Tables
7.1

Sample Dataset for K-Means Clustering . . . . . . . . . . . . . . . . . . . . .

7.2

Initial Cluster Centroid values . . . . . . . . . . . . . . . . . . . . . . . . . .

7.3

Dataset after clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.4

Final Cluster Centroid values . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.5

Sample Training data set of Classifier . . . . . . . . . . . . . . . . . . . . . .

7.6

New Unknown site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.7

Probability for feature set to be Original . . . . . . . . . . . . . . . . . . . . .

7.8

Probability for feature set to be Phish . . . . . . . . . . . . . . . . . . . . . . .

Abstract
As the Electronic Commerce and On-line Trade expand, phishing has already become one
of the several forms of network crimes. Our project model presents an automatic approach
for intelligent phishing web detection based on learning from a large number of legitimate and
phishing webs. As given a web, its Uniform Resource Locator (URL) features are first analyzed,
and then classified by K-Means Clustering. When the webs legality is still suspicious, its web
page is parsed into a document object model tree, and then classified by Naive Bayes Classifier
(NB). Experimental results show that our approach can achieve the high detection accuracy, the
lower detection time and performance with a small sample of the classification model training
set.
A novel framework using a Bayesian approach for content-based phishing web page detection
is presented. Our model takes into account textual and visual contents to measure the similarity
between the protected web page and suspicious web pages. A text classifier and an algorithm
fusing the results from classifiers are introduced.

Chapter 1
Introduction
1.1

Overview

One of the most dangerous attacks in the todays internet trend are happening in the form of
phishing sites. The major attacks are done to retrieve the personal information of the users from
the banking sectors.
Phishing is the act of acquiring electronic information such as Username, Password, and
Credit-Cards Information by masquerading as trustworthy authority. This information may be
used to retrieve some information by logging into the system with these username and password
or performing some transaction with the use of username, password and credit card information
retrieved from this phishing.
Phishing can be of many types but nowadays the very usual way of phishing is through the
E-Mail or creating the Web-Sites of brands (like ICICI Bank, SBI Bank, www.faceboook.com,
etc.) which looks very alike with their legitimate sites and asking users to enter their username
password or any such personal information.
Phishing sites are the major attacks by which most of internet users are being fooled by the
phisher. The replicas of the legitimate sites are created and users are directed to that web site
by luring some offers to it. There are certain standards which are given by W3C (World Wide
Web Consortium), based on these standards we are choosing some features which can easily
describe the difference between legit site and phish site.

CHAPTER 1. INTRODUCTION

Phisher is the community of hackers which creates the replicas of the legitimate web sites
to retrieve users personal information such as passwords, credit card number, and financial
transaction information. As per the survey done by RSA Fraud Surveyor, the Phishing attacks
have been raised by 2% since the last December 2012 to January 2013.
The W3C has set some standards that are followed by most of the legit sites but a phisher
may not care to follow these standards as this site is intended to catch many fish in very small
amount of time and bait. There are certain characteristics of the URLs and source code of the
Phishing site based on which we can guess the site is fake or not.
To detect and prevent the attacks from such phishing sites various preventive strategies are
employed by anti-phishing service providers like Google Toolbar, an Anti-Virus service provider.
These are the most common in the anti-phishing service providers. These service providers are
creating and maintaining the databases of blacklisted sites. Some of the anti-phishing organizations are available like www.phishtank.com who maintains the blacklist of the reported phishing
sites and their current status if they are still online or not.
The phisher are creating sites at such a rate that there always will be some period in what
the site is not reported as phish, in that case these techniques of maintaining online blacklist
repositories fails. The major drawback or setback we have seen in this method is like the
normal user will not always be taking caution about the phishing site, he may get tricked by
overall look of site like legitimate site and it may happen like the site is not yet verified by the
service providers and hence is not blocked.

1.2

Brief Description

We are proposing the system which will detect the phishing sites based on training models
provided after studying the results from various phishing sites. We have proposed an approach
where we will determine the site is phishing or not based on URL and HTML features of the
website. We will first retrieve the URL features from the URL of the website such as follows:
IP as URL
Dots in URL
2

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 1. INTRODUCTION

Slashes in URL
Suspicious Characters in URL
After retrieving these features of URL we will download the source code of URL webpage
and parse using HTML DOM Parser to get more HTML features from the website as follows:
Null Anchors count in URL
Foreign Anchors count in URL
HTTPS /SSL /TSL certificate validity check

1.3

Problem Definition

The aim of our project is towards the detection of phishing web pages by selecting textual
and visual contents of the Web-Site such as URL features and Anchor tag features from visual
contents of web pages, we are applying string parsing algorithm on textual features and using
DOM tree of the web-sites visual content to analyze further features which may contribute to
the prediction of the result more efficiently.
The model which we are proposing uses the textual features from Web-Site such as: no. of
slashes in URL, no. of dots in the URL; these features are used to put the Web-Site in the cluster
of the database using K-Means Clustering algorithm.
If the model still lies in the Suspicious Cluster, more visual features are extracted by downloading the Web-Site and applying DOM Tree Parsing then extracting features we require like
HTTPS:// or SSL certified, No. of Foreign anchor tags, No. of Null anchor tags. Then we
are applying Naive Bayes Classifier which will be predicting the result thus results are more
correctly predicted.

1.4

Applying Software Engineering approach

New Advances in internet technology and the rapid growth of networks in quality and quantity has introduced new applications and concerns in Internet Banking and industry. The unique
3

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 1. INTRODUCTION

requirements and constraints associated with Internet Securities have brought new challenges
to software development for such environments, as it demands extensive improvements to traditional Anti-Phishing systems development methodologies in order to fulfill the special needs
of this field.
We examine the challenges of developing software for personal system connected to internet,
starting by reviewing website characteristics and investigating the status Anti-Phishing software
development methods. It has been shown that Agile methodologies are appropriate methods for
the development of such systems; based on this assumption, we identify specific requirements
for a Internet Security software development methodology, based on which a new agile method
is engineered using the Hybrid Methodology Design approach.

Dept. of Comp. Engg. PCCOE Pune-44.

Chapter 2
Literature Survey
The literature survey of the anti-phishing has been done for this model and following are the
conclusive records of the literature survey.
The model is being surveyed with respect to following points:
1. Existing Model
2. Current Phishing Status
3. Existing documentation for the proposed model which is being referred for the current
project.
Existing Models
1. Plug-in for Browsers
The browser plug-ins which are used (for Mozilla Firefox, Google Chrome) to
detect the site is phishing site or not. The working of the browser is like whenever
you enter the URL in the browsers address bar, the browser will just copy the URL
and the plug-in will send the URL to the online repository of the browser and the
browser will search for the entries for that URL and if there are no entries it will not
raise the alarm even if the site is Phishing site[8].
If the site is not present in the repository of the browser it will not raise the alarm
and user will continue to the Web-Site because the plug-in is showing this site is not
malicious.
5

CHAPTER 2. LITERATURE SURVEY

It may not be possible for the online repository to maintain the record of each and
every site because there is a very large no. of Web-Sites launched every day.
2. Anti-Viruses having Internet Phishing Security.

Anti-Virus works very similar to the browser plug-in it also catches the URL from
the browser and checks into its own repository which may be updated at the client
site on daily basis.

Here the anti-virus service provider is making the surveys and it checks the sites
on regular basis and updates the database if the phishing site is found then the
database is updated at the client end which prevents the attacks more efficiently
than only depending upon the plug-ins of the browser[8].

The question remains same for the new web sites which has not yet being identified by the anti-virus service provider. There is no any protection for the user and
user relies on the anti-virus service provider that the site may be being tested by the
anti-virus. The models which are used to detect the Phishing attack uses only the
URL features to predict the site is malicious or not even they are using the Visual
features of the sites very low amount of features are used to predict and machine
learning approach is not yet being used to detect the phishing sites[4].
Current Phishing Status

Looking at the First fortnight report by Anti-Phishing Organization (www.antiphishing.org)

and RSA Online Fraud Attacks Surveys few major points:
Phishing attacks has been increased by 2% since December 2012.
India is having 4% of global attacks by volume of attack.
India is being targeted 4% of global attacks by volume of brands attacked.

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 2. LITERATURE SURVEY

Figure 2.1: Total reported attacks per month for 1 year[7]

In January, RSA identified 30,151 attacks launched worldwide, a 2% increase in attack volume from December. Considering historical data, the overall trend in attack numbers in an
annual view shows slightly lower volume of attacks through the first quarter of the year.

Figure 2.2: Major attacked countries by volume of attack[7]

The U.S. was targeted by phishing attacks most in January, with 57% of total phishing volume. The UK endured 10%, followed by India and Canada both on 4% of attack volume
respectively.

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 2. LITERATURE SURVEY

Figure 2.3: Major attacked countries by Brands attacked[7]

Brands in US were most targeted in January; 30% of Phishing attacks were targeting US organizations followed by UK representing 11% of worldwide brands attacked by Phishers. Other
nations whose brands were most targeted includes India, Italy, Australia, France and Brazil.

Supporting papers
A Layout-Similarity-Based Approach for Detecting Phishing Pages-Angelo P.
E.Rosiello, Engin Kirda, Christopher Kruegel, Fabrizio Ferrandi, Politecnico di Milano

In this paper, an extension of our system (called DOM-Anti-Phish) that mitigates

the shortcomings of our previous system. In particular, our novel approach leverages layout similarity information to distinguish between malicious and benign web
pages. This makes it possible to reduce the involvement of the user and significantly reduces the false alarm rate. Our experimental evaluation demonstrates that
our solution is feasible in practice.

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 2. LITERATURE SURVEY

We are referring the use of DOM Tree for the feature extraction process and the
Visual features of the Web-Pages.

Textual and Visual Content-Based Anti-Phishing: A Bayesian Approach-IEEE

Transactions October- 2011 Haijun Zhang, Gang Liu, Tommy W. S. Chow, Senior
Member, IEEE, and Wenyin Liu, Senior Member, IEEE

A novel framework using a Bayesian approach for content-based phishing web

page detection is presented. Our model takes into account textual and visual contents to measure he similarity between the protected web page and suspicious web
pages. A text classifier, an image classifier, and an algorithm fusing the results from
classifiers are introduced. An outstanding feature of this paper is the exploration of
a Bayesian model to estimate the matching threshold. This is required in the classifier for determining the class of the web page and identifying whether the web page
is phishing or not. In the text classifier, the Naive Bayes rule is used to calculate
the probability that a web page is phishing. In the image classifier, the earth movers
distance is employed to measure the visual similarity, and our Bayesian model is
designed to determine the threshold. In the data fusion algorithm, the Bayes theory is used to synthesize the classification results from textual and visual content.
The effectiveness of our proposed approach was examined in a large-scale data set
collected from real phishing cases. Experimental results demonstrated that the text
classifier and the image classifier we designed deliver promising results, the fusion
algorithm outperforms either of the individual classifiers, and our model can be
adapted to different phishing cases.

We are referring the use of Naive Bayes Classifier for the detection of the malicious Web-Pages.

An Efficient Approach to Detecting Phishing Web-Xiaoqing GU, Hongyuan

WANG, Tongguang NI

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 2. LITERATURE SURVEY

This paper presents an automatic approach for intelligent phishing web detection
based on learning from a large number of legitimate and phishing webs. As given a
web, its Uniform Resource Locator (URL) features are first analyzed, and then classified by Naive Bayesian(NB)classifier. When the webs legality is still suspicious,
its web page is parsed into a document object model tree, and then classified by Support Vector Machine (SVM) classifier. Experimental results show that our approach
can achieve the high detection accuracy, the lower detection time and performance
with a small sample of the classification model training set.

This paper refers to the use of textual features of the URL which can be used for
the detection of the fraud Web-Pages.

Dept. of Comp. Engg. PCCOE Pune-44.

Chapter 3
Software Requirements Specifications
3.1
3.1.1

Introduction
Purpose

Our project aims towards the detection of phishing web pages by selecting textual and visual
contents of the Web-Site such as URL features and Anchor tag features from visual contents of
web pages, we are applying string parsing algorithm on textual features and using DOM tree of
the web-sites visual content to analyze further features which may contribute to the prediction
of the result more efficiently.
The model which we are proposing uses the textual feature from Web-Site such as: no. of
slashes in URL, no. of dots in the URL; these features are used to put the Web-Site in the cluster
of the database using K-Means Clustering algorithm[8].
If the model still lies in the Suspicious Cluster more visual features are extracted by downloading the Web-Site and applying DOM Tree Parsing then extracting features we require like
HTTPS:// or SSL certified, No. of Foreign anchor tags, No. of Null anchor tags. Then we
are applying Naive Bayes Classifier which will be predicting the result thus results are more
correctly predicted[3][2].

CHAPTER 3. SOFTWARE REQUIREMENTS SPECIFICATIONS

3.1.2

Intended audience and reading suggestions

This SRS is intended for the reading of Project Developing team, Project Analysis team,
Project Head, Users and other managing committee. This project SRS is following IEEE standard format in IEEE Standards 830-1998.
The readers of this SRS are advised to go through indexed points in order to access this SRS
more efficiently.

3.1.3

Project Scope

Phishing frequently impacts users privacy and safety. Internet Service Providers (ISPs) are
facing a huge problem in the Internet community from phishers and hackers. The scope of this
project revolves around the identification, reduction and elimination of phishing activities and
protection of users from phishing artists.
The Software will be detecting the web sites if they are malicious (Phishing) based on strong
features using clustering and if it is not able to detect the result, the software will use the Naive
Bayes Classifier Prediction which will give result based on probabilistic model.
In this model a fast and accurate approach is proposed to detect phishing web. Our approach
determines whether a web page is a phishing web or a legitimate one, based on its URL and
web page features, and is merely a combination of NB and K-Means. The K-Means classifier
used to detect the URL is that K-Means is a rapid detection method for classification and URL
features can be easily acquired. If the K-Means classifier cannot judge the given webs legality
definitely, the NB classifier is used to detect it based on its web page features.

3.1.4

Design and Implementation Constraints

Java Technology to be used

The Java technology enables portability and scalability of the software hence Java platform is to be used. Most of the techniques used in the processing of the data are already
implemented in the Java hence reducing the efforts of programming.

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 3. SOFTWARE REQUIREMENTS SPECIFICATIONS

HTTP communication protocol to be used

The software is using the internet access to download the Web Page for the textual
feature extraction if required for prediction hence the HTTP standard protocols is being
used for online data downloading.
Serialization of databases required

Serialization is the process by which the application can send program objects through
a stream, which can be a file stream or a network stream. Sending objects through a
stream will allow developers to create solutions that were not available until now.
Strong Database Requirement

The System will using the existing database entries to predict the result of the current
data set. Thus the system requires strong database of VALID as well as INVALID Phish
Entries without which it is very hard to produce the output for the Naive Bayes Classifier.

3.1.5

Assumptions and Dependencies

The Input Database is assumed to be correct.

The Database which we will be used for the initial entries of the training of the system
is assumed to be the correct input for the system. The URL which is selected as the
fake or phishing Web Pages must be the originally declared as phishing Web Page and
vice-versa.

The training data set is taken from the online repositories like www.phishtank.com from
where the known valid phish Web Pages can be retrieved and some legitimate web pages
directly taken from Google search tool.

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 3. SOFTWARE REQUIREMENTS SPECIFICATIONS

3.2
3.2.1

System Features
System Feature 1: String Searching

String searching algorithms, sometimes called string matching algorithms, are an important
class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger string or text[1].
The String Parsing of the URL is to be done for extraction of feature from the URL and
creating the data set of the textual contents.
This Feature will create following set of database:
Total Number of slashes in URL.
Total Number of dots in URL.
Total Number of suspicious characters in URL.
URL as IP Address.

3.2.2

System Feature 2: String Tokenization

The system is accepting the CSV input in which all the entries for the given data set URL are
enclosed within single string and are separated by the commas.
This type of input cannot be directly transformed to the data set entry; we first need to format
that string according to the data set requirements. Hence the string tokenization is required to
accept the CSV from the User and store into database.
This Feature will work like following example:
CSV INPUT: http://www.my.input.com,0,3,0,0,2,1

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 3. SOFTWARE REQUIREMENTS SPECIFICATIONS

This will produce following data set:

3.2.3

URL

: http://www.my.input.com

Number of Slashes

Number of Dots

Suspicious Characters

SSL Certificate

Foreign Anchors

Null Anchors

System Feature 3: K-Means Clustering

The K-Means Clustering algorithm is used for clustering of the Strong Features of the system
which will be directly giving results in two clusters for a site is More Suspicious and Less
Suspicious.
K-Means clustering is applied onto the feature which may have discrete values in it, such as
count of suspicious characters, slashes, null anchors, foreign anchors and dots. These discrete
values are converted into the form of 0 and 1. 0 for less suspicious values and 1 for more
suspicious values are used. The feature will be providing the result in two clusters based on all
above mentioned features[5]:
This feature is of High priority and preliminary Data Mining will give the better performance
of system. Risk can be there as if the result is unpredictable one.
This feature will take the data set prepared by the String Searching feature of the system and
will apply K-Means Clustering Algorithm for Data Mining over the system.
arg maxS

Pk P
i=1

xj Si

||xi -i ||2

Where i is the mean of points Si .

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 3. SOFTWARE REQUIREMENTS SPECIFICATIONS

The feature will be providing the result in two forms based on only considerably strong
features of the Web-Site whose result is to be declared:
More Suspicious: If the values of the feature are very much larger then it has more suspicions
of being phishing site.
Less Suspicious: If the values of the feature are considerably low then they may not be treated
as the phishing site.

3.2.4

System Feature 4: DOM Tree Parsing

HTML Parser is a Java library used to parse HTML in either a linear or nested fashion.
Primarily used for transformation or extraction, it features filters, visitors, custom tags and easy
to use JavaBeans. It is a fast, robust and well tested package[6].
If the result of the K-Means lies in region Suspicious Region we need to extract visual features
of the URL this requires to download and parse the URL using DOM Tree.
This Parsing will help to identify the following data set:
SSL Certificate
NULL Anchor Tags
Foreign Anchor Tags

3.2.5

System Feature 5: Naive Bayes Classifier

Naive Bayes Classifier is the strong predictor algorithm which we will be using in this particular module but using it only if site is not predicted using the Clustering because of the cost
of execution of the algorithm.
This feature is of Medium priority and used for secondary Data Mining which will not give
the better performance of system but the accuracy of prediction can be achieved. The risk factor
in Clustering can be lowered using the Naive Bayes Classifier.
16

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 3. SOFTWARE REQUIREMENTS SPECIFICATIONS

The Naive Bayes Classifier will be using the data set prepared by both String Searching and
the DOM Tree HTML Parsing to predict the output hence the results will be near to accurate[3].
Following is the formula to calculate the results:
Vn b = argmaxj P (j )

P (ai |j )

Generally estimate P (ai |j ) using m estimates.

P (ai |j ) =

nc +mp
n+m

where,
n = the number training examples for which = j .
nc = the number of examples for which = j and a = ai .
p = a priori estimate for P (ai |j ).
m = the equivalent sample size.
The feature will be providing the result in two forms based on all the features taken into data
set of the Web-Site whose result is to be declared:
Phishing Site: If the site is resulting into Valid Phish.

Legitimate Site: If the site is resulting into Invalid Phish.

3.3

3.3.1

External Interface Requirements

User Interfaces

The User of the system will be interacting with the system by using following functionality
provided:
1. Manage Dataset
17

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 3. SOFTWARE REQUIREMENTS SPECIFICATIONS

This feature enables users to add or delete more records into the dataset.
2. Upload CSV

If user wishes to get the ready dataset from another site or another computer then he
may upload any CSV file that has compatible format of dataset.
3. Apply Clustering

To determine the new centroid after adding records into the dataset user can use this
features.
4. Prediction Model

To determine whether a site is phishing site or not user can use make use of main
feature of the project.
5. Save Database

After changing the database user needs to save the new database, for this user can rely
on this feature of the project.

3.3.2

Hardware Interfaces

Operating System:Windows Platform

Hardware:Intel rCore 2 Duo ror better
Internet Connection

3.3.3

Software Interfaces

Java SDK:1.7 or better

Database System:My SQL
Libraries

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 3. SOFTWARE REQUIREMENTS SPECIFICATIONS

DOM Tree Parser

SAX Parser
Database
Serialized Database
Operating System
Windows XP or better
Data Set[8]
URL

The input URL for which the detection is to be done by system.

IP as URL

Getting URL name will cost the phisher to buy space on some web-hosting site.
The phisher may ignore this and use the IP address itself as the URL.Legitimate
sites will always have some URL name.
Suspicious Characters

The total count of the characters which are not included A-Z and 0-9 in the URL.
The phisher may use tricky characters to look like the legitimate site and the standard procedure is not to include any other characters than A-Z and 0-9 for easy
remembering for users.

Phisher may trick the User by inserting any of

& % - _ @
to look like the web site as legitimate site.
Number of Slashes

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 3. SOFTWARE REQUIREMENTS SPECIFICATIONS

Total number of the Slashes occurred in the URL. The URL should not contain
more number of slashes. If it contains more than five slashes then the URL is considered to be a phishing URL.
Number of Dots

Total number of the Slashes occurred in the URL. The dots may provide the information regarding the total number of sub-domains used by the URL. More the
sub-domains used more the suspicious site.
Number of Dots
1. NULL Anchor
A null anchor is an anchor that points to nowhere. The more nil anchors a
page has, the more suspicious it becomes.
2. Foreign Anchor
An anchor tag contains href attribute whose value is an URL to which the
page is linked with. If the domain name in the URL is not similar to the domain
in page URL then it is called as foreign anchor.
HTTPS-SSL Certificate

Most of the legitimate sites are using SSL certificate for online identity. SSL
certificate is provided by trusted authority and need to be updated by some time
period.

Phisher cannot get the SSL certificate by providing fake identity and will not
manage to update the certificate.

3.3.4

Communication Interfaces

Standard HTTP COMMUNICATION interface required for internet connection.

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 3. SOFTWARE REQUIREMENTS SPECIFICATIONS

3.4
3.4.1

Non-Functional Requirements
Performance Requirements

The product must use the Clustering as the preliminary function to detect the phishing site if
that module is not able to determine then and then only go for Naive Bayes Classifier. This will
increase the performance of system as the Naive Bayes is Complex Algorithm for prediction
and K-Means Clustering is the easy method for Data Mining.

3.4.2

Safety Requirements

The safety of the system can be achieved by providing an authenticated login to the system
and limited privileges to the end users of the system to make the changes into the databases.
Safety of the system is achieved by providing backup of the data contained into system so that
even if the system crashes down during the working all the data would remain safe and data loss
would not take place.

3.4.3

Security Requirements

The system which is to be developed is provided with authentication (i.e., username and
password) so that other workers who should not be granted access to system are restricted.
This also helps us to keep the database secure from various actions to alter the data by an
unauthorized user.

3.4.4

Software Quality Attributes

1. We are not depending on only single Data Mining method thus we are ensuring reliability
of the software in case of failure of primary module, also correctness about the output can
be stated.
2. Most of the components can be used as cross platform so we can state the robustness of
the system.
3. Scalability of the software can be considered as the SQA as the Java components are to
be used, the java components can be modified and more packages classes can be added
into system to extend its features.
21

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 3. SOFTWARE REQUIREMENTS SPECIFICATIONS

4. Portability can be achieved while using the java platform components as the java can be
easily available in any system it is open source and easy to install and use.

3.5
3.5.1

Analysis Models
Data Flow Diagrams

Figure 3.1: Level 1 DFD

Figure 3.2: Level 2 DFD

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 3. SOFTWARE REQUIREMENTS SPECIFICATIONS

3.5.2

Class Diagram

Figure 3.3: Class Diagram

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 3. SOFTWARE REQUIREMENTS SPECIFICATIONS

3.5.3

State-Transition Diagram

Figure 3.4: State-Transition Diagram

3.6
3.6.1

System Implementation Plan

Cost Estimation Model

Basic COCOMO computes software development effort (and cost) as a function of program
size. Program size is expressed in estimated thousands of source lines of code (SLOC)[1].
COCOMO applies to three classes of software projects:
Organic Projects-small teams with good experience working with less than rigid
requirements.
24

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 3. SOFTWARE REQUIREMENTS SPECIFICATIONS

Semi-Detached Projects-medium teams with mixed experience working with a mix of

rigid and less than rigid requirements.
Embedded Projects- developed within a set of tight constraints. It is also combination
of organic and semi-detached projects.(hardware, software, operational, ...)
The basic COCOMO equations take the form:
1. Efforts Applied (E) = ab(KLOC)bb[person-months]
2. Development Time (D) = cb(E)db[months]
3. People Required (P ) = E/D [count]
Where, KLOC is the estimated number of delivered lines (expressed in thousands) of code
for project. The coefficients ab, bb, cb and db are given in the following table:
Organic

2.4

1.05

2.5

0.38

Semi-Detached

3.0

1.12

2.5

0.35

Embedded

3.6

1.20

2.5

0.32

Basic COCOMO is good for quick estimate of software costs. However it does not account
for differences in hardware constraints, personnel quality and experience, use of modern tools
and techniques, and so on.

Figure 3.5: Cocomo-II Embedded Project Model

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 3. SOFTWARE REQUIREMENTS SPECIFICATIONS

3.6.2

Gantt Chart

Gantt charts illustrate the start and finish dates of the terminal elements and summary elements of a project. Terminal elements and summary elements comprise the work breakdown
structure of the project. Some Gantt charts also show the dependency (i.e. precedence network)
relationships between activities. Gantt charts can be used to show current schedule status using
percent-complete shadings and a vertical TODAY[1].

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 3. SOFTWARE REQUIREMENTS SPECIFICATIONS

Figure 3.6: Gantt Chart Model

Dept. of Comp. Engg. PCCOE Pune-44.

Chapter 4
System Design
4.1

System Architecture

Figure 4.1: System Architecture Model

Above figure explains the architecture for the system which contains the major components
and their connectors along with the topology among the components.
The System is having 3 major modules divided into:
28

CHAPTER 4. SYSTEM DESIGN

1. Feature Extraction

This feature will extract the features of the URL required to identify the Phish site.
This includes various methods which are explained in next section.
2. Apply Clustering Algorithm

The database clustering is to be done by using K-Means Clustering which will help to
produce the results at very early stage and using very small amount of data set from the
features extracted by previous methods.
3. Apply Naive Bayes Classifier

Naive Bayes Classifier is only used when system has plotted current data set in suspicious cluster using K-Means Clustering. NB then use all the features and compare them
with existing data set finally producing a prediction result about the site is VALID or
INVALID Phish.
Feature Extraction

The URL is provided as the input to the system and system needs to apply some methods to
fetch the features from that URL. Feature includes Visual and Textual features.
The Feature extraction process will involve two measure algorithms to extract the features
from the URL which are String Searching Algorithm and DOM Tree Parsing Algorithm.
String Searching Algorithm will be used to determine the textual features of the web site
URL. DOM Tree Parser will be used to parse the HTML source code of Web-Page and extract
required features from the DOM Tree.

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 4. SYSTEM DESIGN

Figure 4.2: Feature Extraction Model

Clustering Algorithm

The Data set prepared by using feature extraction process is used in Data Mining Algorithm
of K-Means Clustering where three clusters of the system are created VALID Phish, INVALID
Phish, and suspicious Cluster.
According to threshold value of the Data set it is inserted into the cluster if the site is showing
high threshold value then it should go into VALID phish where it can be declared as Phishing
Web Page.
If value of data set is very low than threshold value, the web page lies into INVALID Phish
cluster where it is declared as the Legitimate Web page.
If value of data set is near to the threshold value, the web page lies into suspicious cluster
where another method of classification is applied to predict the result.

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 4. SYSTEM DESIGN

Figure 4.3: K-Means Clustering Model

Naive Bayes Classifier

Once the site is inserted into the Suspicious cluster of database the Naive Bayes Classifier
is applied onto that data set where the data set is compared with respect to existing data set in
database and the results produced if site is VALID phish or INVALID phish and accordingly it
is shifted from suspicious cluster to applicable cluster.

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 4. SYSTEM DESIGN

Figure 4.4: Naive Bayes Classifier Model

4.2

UML Diagrams

Use Case Diagram

A use case diagram at its simplest is a representation of a users interaction with the system
and depicting the specifications of a use case. A use case itself might drill into a lot of detail
about every possibility; a use-case diagram can help provide a higher-level view of the system.
For our system only actor applicable is the User itself he can perform the tasks such as logging
into system and accessing application to provide input to the system. Other tasks are included
into the accessing the application itself such as Enter URL, Enter CSV File, Access Database,
Apply System Functionality.

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 4. SYSTEM DESIGN

Figure 4.5: Use Case Diagram

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 4. SYSTEM DESIGN

Class Diagram
There are 5 classes which can be identified based on the features and functions of the respective class.
Single user class is identified and the user is able to access the system for login, log-out,
manage the databases, view output results etc. This class is having one to one association with
Application class
Other two classes based on the basic functionality are K-Means and NaiveBayes which are
performing computations and providing the results for the system.
Main Application class is the parent class of all the other classes and it consists all the functionality control of the application, these other classes are called using Application class.

Figure 4.6: Class Diagram

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 4. SYSTEM DESIGN

Sequence Diagram
The sequence diagram provides the flow messages with respect to the time. In the given
system only Log-in and Log-out Stimulus are having synchronous messages of Authentication
and Confirmation messages respectively.
All the other stimulus are asynchronous in nature as the system is performing its action and
leaving the data set at its place so no return call for the stimulus is being used for this purpose
of messages.

Figure 4.7: Sequence Diagram

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 4. SYSTEM DESIGN

State-Transition Diagram
State transition diagram provides various phase the software or application will go throughout
its life cycle.
Here application being developed goes through various phases of activities which are going
to be performed one after another.

Figure 4.8: State-Transition Diagram

Collaboration Diagram
Communication diagrams show a lot of the same information as sequence diagrams, but
because of how the information is presented, some of it is easier to find in one diagram than
the other. Communication diagrams show which elements each one interacts with better, but
sequence diagrams show the order in which the interactions take place more clearly.

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 4. SYSTEM DESIGN

In order to maintain the ordering of messages in such a free-form diagram, messages are
labeled with a chronological number and placed near the link the message is sent over. Reading
a communication diagram involves starting at message 1.0, and following the messages from
object to object.
For the given system there are no sub messages to communicate amongst the objects only the
messages are communicated through one object to another irrespective of return call for that
message.

Figure 4.9: Collaboration Diagram

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 4. SYSTEM DESIGN

Package Diagram
Package diagrams can use packages that represent the different layers of a software system
to illustrate the layered architecture of a software system. The dependencies between these
packages can be adorned with labels / stereotypes to indicate the communication mechanism
between the layers.
Package diagram used for this application contains mainly the packages from Java platform as
the development platform is java platform and most of the functions are derived from the inbuilt
packages from the java technology hence main Java package includes various sub packages as
AWT, JPCAP and many more.

Figure 4.10: Package Diagram

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 4. SYSTEM DESIGN

Activity Diagram
Here the prediction using Naive Bayes and prediction using K-Means can be performed in
parallel and all the other activities are needed to be performed in serial.
The activities are mostly system controlled hence swim-lane is not required to be shown.
Also very few activities are branching and conditional activities as log-on and log-out.

Figure 4.11: Activity Diagram

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 4. SYSTEM DESIGN

Component Diagram
When using a component diagram to show the internal structure of a component, the provided
and required interfaces of the encompassing component can delegate to the corresponding interfaces of the contained components.
Major components that can be distinguished based on the functionality of the system are
given in above diagram from our system. A Java collections component includes the packages
and classes which will be used as it is from the java software development kit. Process Builders
is the component which enable system to download the web site and URL.
Serialization component shows the database to be used and types of data sets including the
data set members etc. Naive Bayes Collection is the whole new component which is not directly
available in the system and includes the data mining techniques to predict the output of the
system.

Figure 4.12: Component Diagram

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 4. SYSTEM DESIGN

Deployment Diagram
For our system there is no hardware node needed to be attached hence only software deployment is viewed in above diagram. Here most of the nodes are the packages of the java and
remaining are the modules which are needed to be connected with one or more modules of the
project.

Figure 4.13: Deployment Diagram

Dept. of Comp. Engg. PCCOE Pune-44.

Chapter 5
Technical Specification
5.1

Technology used

Java Platform
Java is a set of several computer software products and specifications from Sun Microsystems
(which has since merged with Oracle Corporation), that together provide a system for developing application software and deploying it in a cross-platform computing environment. Java is
used in a wide variety of computing platforms from embedded devices and mobile phones on
the low end, to enterprise servers and supercomputers on the high end. While less common,
Java applet are sometimes used to provide improved and secure functions while browsing the
World Wide Web on desktop computers.
Writing in the Java programming language is the primary way to produce code that will be
deployed as Java byte code. There are, however, byte code compilers available for other languages such as Ada, JavaScript, Python, and Ruby. Several new languages have been designed
to run natively on the Java Virtual Machine (JVM), such as Scala, Clojure and Groovy. Java syntax borrows heavily from C and C++, but object-oriented features are modelled after Smalltalk
and Objective-C.[9] Java eliminates certain low-level constructs such as pointers and has a very
simple memory model where every object is allocated on the heap and all variables of object
types are references. Memory management is handled through integrated automatic garbage
collection performed by the JVM.

CHAPTER 5. TECHNICAL SPECIFICATION

Clustering
Clustering, in the context of databases, refers to the ability of several servers or instances
to connect to a single database. An instance is the collection of memory and processes that
interacts with a database, which is the set of physical files that actually store data.
Clustering takes different forms, depending on how the data is stored and allocated resources. The first type is known as the shared-nothing architecture. In this clustering mode,
each node/server is fully independent, so there is no single point of contention. An example
of this would be when a company has multiple data centers for a single website. With many
servers across the globe, no single server is a master. Shared-nothing is also known as database
sharding.

Classification
Classification consists of predicting a certain outcome based on a given input. In order to
predict the outcome, the algorithm processes a training set containing a set of attributes and the
respective outcome, usually called goal or prediction attribute. The algorithm tries to discover
relationships between the attributes that would make it possible to predict the outcome. Next
the algorithm is given a data set not seen before, called prediction set, which contains the same
set of attributes, except for the prediction attribute not yet known. The algorithm analyses the
input and produces a prediction. The prediction accuracy defines how good the algorithm is.

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 5. TECHNICAL SPECIFICATION

Dept. of Comp. Engg. PCCOE Pune-44.

Chapter 6
Schedule, Estimate and Team Structure

6.1

Project Estimate

Project is not requiring any new hardware components so there is very less financial requirement for the project.
The project is required to be the cost estimation for the man power to be allocated and used
efficiently. We have followed COCOMO II model with moderate constraints to allocate the man
power for the project to be completed on or before the deadline.

Figure 6.1: Cocomo-II Embedded Project Model

CHAPTER 6. SCHEDULE, ESTIMATE AND TEAM STRUCTURE

6.2

Schedule

The project is to be completed on or before the deadlines provided hence a strong project
planning was required, the use of Gantt chart has increased the efficiency for keeping the project
on track. With the help of the Gantt chart we could track the project flow and corrective actions
were taken in order to follow the deadlines strictly.
Because of following the deadlines strictly, the project is completed before deadlines provided hence we could thoroughly test the project modules and most of the small defects found
were scanned and removed immediately.

6.3

Team Structure

The team required team for the development of this project is 4.5 persons as per the estimation
of the COCOMO-II model based on working hours and average lines of code to be carried out.
The team structure we have decided has four developing members and one guide member.
The guide member played major role by keeping the project flow as per schedule and solving
the major error and obstacles that affected the project development schedule.

The other four members are working as a team of developers and testers. Team lead was
given to member no.1. The role of member 1 was to work on the initial system designing and
developing the system. The member 2 had worked into resource gathering and literature survey
as well as developing the source codes of system. Team member 3 had been allocated the role of
tester whose job was to thoroughly test system with respect to test cases written by developers.
46

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 6. SCHEDULE, ESTIMATE AND TEAM STRUCTURE

Last team member worked as scribe of the team and has done all the documentation during the
development of the system.

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 6. SCHEDULE, ESTIMATE AND TEAM STRUCTURE

Dept. of Comp. Engg. PCCOE Pune-44.

Chapter 7
Software Implementation
7.1

Introduction

We have conducted a survey for the current status and the patterns in the phishing techniques
used by the phishers. We found a trend of patterns that can be found in the phishing sites such
as the phishers uses some characters which are very identical with the alphabets in English
language as @ looks identical to character a etc.
This series of pattern and many more are traceable from the URL of the site. Some phishers
uses the source codes of the original sites and performs minimal changes into code this results
into the visual look very identical with the original sites. But this results into anchor tags of the
site to open another domain. This can also be traced and both of the textual and HTML features
can be used to find out whether a site is original or the phishing site.
We can use classification technique to predict the result if the site is original or the phishing
site.

7.2

Databases

As system is implemented in JAVA language and also with the help of the system procedure
call interface there is no overhead of database.

CHAPTER 7. SOFTWARE IMPLEMENTATION

All the records are stored into the single table; attributes of the table are as follows:
URL contains the name of URL of Record.
IP as URL Boolean value
Dots in URL Numerical Value
Slashes in URL Numerical Value
Suspicious Characters in URL Numerical Value
HTTPS / SSL / TSL Boolean Value
Foreign Anchors Numerical Value
Null Anchors Numerical Value
Serialized database is used hence there is no requirement to use any other database management tool to store the database.

7.3

Important Modules

There are four major modules:

Feature Extraction
Apply Clustering
Apply Classifier
Detailed description of each module is given below.

Feature Extraction
Feature Extraction process is the initial stage in the project to create database or to find the
site is phishing site or not. It requires two methods to create a single dataset of features in the
feature extraction process. These processes are as follows:
1. String Parsing
2. DOM Tree Parsing
50

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 7. SOFTWARE IMPLEMENTATION

String Parsing is applied on the string input URL given by the user itself. In java string
searching is made easy by providing the in built package for string operations. We just need to
give inputs as what to search and input string.
DOM Tree Parsing is much harder job than the string searching as we need to create our
source code string and parse it. In java no readily available packages are present to parse the
DOM Tree. Hence we need to create a vector and insert each tag found in the HTML source
code of the URL single tag at time and again with the help of the DFS parsing we are adding
these nodes in the list to display.

Figure 7.1: Sample DOM Tree

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 7. SOFTWARE IMPLEMENTATION

Figure 7.2: DOM Tree constructed in PROJECT

Apply Clustering
Clustering, in the context of databases, refers to the ability of several servers or instances
to connect to a single database. An instance is the collection of memory and processes that
interacts with a database, which is the set of physical files that actually store data.
Clustering takes different forms, depending on how the data is stored and allocated resources. The first type is known as the shared-nothing architecture. In this clustering mode,
each node/server is fully independent, so there is no single point of contention. An example
of this would be when a company has multiple data centers for a single website. With many
servers across the globe, no single server is a master. Shared-nothing is also known as database
sharding.
Contrast this with shared-disk architecture, in which all data is stored centrally and then
accessed via instances stored on different servers or nodes.

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 7. SOFTWARE IMPLEMENTATION

The distinction between the two types has become blurred recently with the introduction
of grid computing or distributed caching. In this setup, data is still centrally managed but
controlled by a powerful virtual server that is comprised of many servers that work together as
one.
In this case the database clustering is applied to reduce the complexities of the stored values.
Simply two clusters are created for each discrete feature of the dataset. Then the lowest value
and highest values of that respective feature set are taken and considered as the initial centroid
to start the algorithm.
K-Means Clustering is used because it is unsupervised algorithm and provides faster results
as compared to the other clustering algorithms.
Following is the example of the execution of K-Means Clustering for this project.
We take a dataset and find the minimal and maximal value present in each of the feature used
for K-Means Clustering.

CLUSTER

DOTS

SLASH

S.CHAR

N.ANCHR

F.ANCHR

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 7. SOFTWARE IMPLEMENTATION

CLUSTER

DOTS

SLASH

S.CHAR

N.ANCHR

F.ANCHR

Table 7.1: Sample Dataset for K-Means Clustering

Then we are calculating the distance of new dataset item one feature at time by comparing
current value with the high centroid and low centroid of that feature. The value lies nearer to
any cluster centroid is labeled with that centroid cluster number and the centroid is recalculated
by taking mean of all values present in that cluster.

CLUSTER

DOTS

SLASH

S.CHAR

N.ANCHR

F.ANCHR

Less=0

More=1

Table 7.2: Initial Cluster Centroid values

This algorithm is unsupervised algorithm which means it should terminate itself after some
condition is satisfied. In this case the algorithm comes to halt when there is no movement in
centroid is observed.

CLUSTER

DOTS

LBL

SLASH

LBL

S.CH

LBL

N.AN

LBL

F.AN

LBL

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 7. SOFTWARE IMPLEMENTATION

CLUSTER

DOTS

LBL

SLASH

LBL

S.CH

LBL

N.AN

LBL

F.AN

LBL

Table 7.3: Dataset after clustering

Then all the calculated centroid is declared as the final centroid for that feature in the database.

CLUSTER

DOTS

SLASH

S.CHAR

N.ANCHR

F.ANCHR

Less=0

1.5

1.85

0.25

0.825

More=1

7.5

7.25

8.5

Table 7.4: Final Cluster Centroid values

Apply Classifier
Classification consists of predicting a certain outcome based on a given input. In order to
predict the outcome, the algorithm processes a training set containing a set of attributes and the
respective outcome, usually called goal or prediction attribute. The algorithm tries to discover
relationships between the attributes that would make it possible to predict the outcome. Next
the algorithm is given a data set not seen before, called prediction set, which contains the same
set of attributes, except for the prediction attribute not yet known. The algorithm analyses the
input and produces a prediction. The prediction accuracy defines how good the algorithm is.
In simple terms, a naive Bayes classifier assumes that the value of a particular feature is
unrelated to the presence or absence of any other feature, given the class variable. For example,
a fruit may be considered to be an apple if it is red, round, and about 3 in diameter. A naive
Bayes classifier considers each of these features to contribute independently to the probability
that this fruit is an apple, regardless of the presence or absence of the other features.

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 7. SOFTWARE IMPLEMENTATION

For some types of probability models, naive Bayes classifiers can be trained very efficiently
in a supervised learning setting. In many practical applications, parameter estimation for naive
Bayes models uses the method of maximum likelihood; in other words, one can work with the
naive Bayes model without accepting Bayesian probability or using any Bayesian methods.
Abstractly, the probability model for a classifier is a conditional model
p(C|F1 , . . . , Fn )
Over a dependent class variable with a small number of outcomes or classes, conditional on
several feature variables through . The problem is that if the number of features is large or when
a feature can take on a large number of values, then basing such a model on probability tables
is infeasible. We therefore reformulate the model to make it more tractable.
Using Bayes theorem, this can be written
p(C|F1 , . . . , Fn ) =

p(C)p(F1 ,...,Fn |C)

p(F1 ...,Fn )

In plain English, using Bayesian Probability terminology, the above equation can be written
as:
posterior =

priorlikelihood
evidence

Or more simplified formula is as given below

p(f si |Cj ) =

nc +mp
n+m

where,
n = the number training examples for which = j .
nc = the number of examples for which = j and a = ai .
p = a priori estimate for P (ai |j ).
m = the equivalent sample size.
Suppose we have taken a training dataset of 10 websites on which K-Means Clustering is
already applied. This training dataset is given as input for the Naive Bayes Classifier.
56

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 7. SOFTWARE IMPLEMENTATION

URL

DTS

SLS

SCH

NAC

FAC

SSL

PSH

Table 7.5: Sample Training data set of Classifier

We have taken a new site whose results of valid or invalid phish are not known to us as
follows:
URL

DTS

SLS

SCH

NAC

FAC

SSL

PSH

Table 7.6: New Unknown site

By using formula p(f si |Cj ) =

nc +mp
n+m

we can evaluate probability of each feature contributing

to the final probability.

We can calculate final probability of data set of legit site as follows:

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 7. SOFTWARE IMPLEMENTATION

PROB

0.5

DOTS

0.5

0.75

SLASHES

0.5

0.56

S.CHARS

0.5

N.ANCHR

0.5

0.6

F.ANCHR

0.5

SSL

0.5

0.57143
0.017857

Table 7.7: Probability for feature set to be Original

PROB

0.5

DOTS

0.5

0.25

SLASHES

0.5

0.44

S.CHARS

0.5

N.ANCHR

0.5

0.4

F.ANCHR

0.5

SSL

0.5

0.4285
0.00238

Table 7.8: Probability for feature set to be Phish

After calculating the final probabilities of feature set to be valid or invalid phish we need
to compare the results, result with maximum probabilities are declared as the final prediction
results.
Hence we can say that current record is not a phishing site as 0.017857 > 0.002381.

7.4

Business Logic

After analyzing the project approach and studying the research papers for the given system
we have decided to follow the Waterfall Model for the Software Development Life Cycle in this
58

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 7. SOFTWARE IMPLEMENTATION

project development process. This approach consists of the following steps that are followed in
order to achieve the goal:
1. Requirement Specification resulting into the project requirement documentation.
2. Design resulting into the Software Architecture.
3. Construction resulting into actual writing codes and developing software.
4. Integration resulting into combining all the modules of the project and finalizing the
development phase.
5. Testing and Debugging gives the defect free software.
6. Installing resulting into the providing the software to end user.
Thus the waterfall model maintains that one should move to next phase only when previous
phase are verified and reviewed.

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 7. SOFTWARE IMPLEMENTATION

Dept. of Comp. Engg. PCCOE Pune-44.

Chapter 8
Software Testing
8.1

Introduction

In data mining, data scientists use algorithms to identify previously unrecognized patterns and
trends hidden within vast amounts of structured and unstructured information. These patterns
are used to create predictive models that try to forecast future behaviour.
These models have many practical business applications they help banks decide which customers to approve for loans, and marketers use them to determine which leads to target with
campaigns.
But extracting real meaning from data can be challenging. Bad data, flawed processes and the
misinterpretation of results can yield false positives and negatives, which can lead to inaccurate
conclusions and ill-advised business decisions.
Thorough testing is needed to be done before handover of the software to the end user as
the user may rely on the predictions made by the software to take some major decisions for his
business requirements.

CHAPTER 8. SOFTWARE TESTING

8.2

Test Cases

Figure 8.1: Test Cases for Project Main Modules

8.3

Snapshot of GUI

Figure 8.2: Main Form

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 8. SOFTWARE TESTING

Figure 8.3: Manual Entry Form

Figure 8.4: Manual Entry Form Empty

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 8. SOFTWARE TESTING

Figure 8.5: Prediction Model

Figure 8.6: Load Form

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 8. SOFTWARE TESTING

Dept. of Comp. Engg. PCCOE Pune-44.

Chapter 9
Results
9.1

Accuracy of Result

Accuracy testing is done to measure up to what level the software may be trusted in order to
make decisions based on the predictions of the project.
We have taken a total of 100 sites to build the training model of the database which is used
to predict the result using classifier constructed in this project. The sites are taken as the 50
- 50 division, 50 sites are taken as known phishing sites from the http://www.phishtank.com/
site which stores the phishing site reported by the users and declares it database to be used by
the other software companies. Another 50 sites are the known legit sites which are taken from
official page links.
Then we have taken 20 sites for which results were not known to the system as the input
of the classifier. The sites were classified with the 85% of accuracy. Another 20 sites were
introduced to the software as input and they were also classifier with correct output of 83.33%
accuracy.

CHAPTER 9. RESULTS

Figure 9.1: Accuracy Testing graph

9.2

Project Results

Figure 9.2: New site feature extraction in progress

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 9. RESULTS

Figure 9.3: Prediction Results of Site

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 9. RESULTS

Dept. of Comp. Engg. PCCOE Pune-44.

Chapter 10
Deployment and Maintenance
10.1

Installation

Java Standard Edition JDK 7 Installation

The JDK 7.2 can be downloaded from this website:
http://www.oracle.com/technetwork/java/javase/downloads/index.html
Click the Download JDK button in the Java Platform Standard Edition section. Make sure
you download the JDK and not the JRE.

Figure 10.1: JDK Step 1

CHAPTER 10. DEPLOYMENT AND MAINTENANCE

Then, select the installation file for your platform: If your system is 32-bit, select the jdk7u2windows-i586.exe. If your system is 64-bit, select the jdk-7u2-windows-x64.exe. You can
find out what type of system you have by going to Start, Control Panel, System, and look at the
information listed under System type.

Figure 10.2: JDK Step 2

Once you have obtained the installation file, double-click it to begin the installation process.
This process will lead you through the following series of windows:
Setup Click Next.
Custom Setup You do not need to make any changes to the default setting. Just verify the
installation directory,

Click Next.
Progress Wait next window to open.
Destination Folder You do not need to make any changes to the default setting. Just verify the
installation directory,

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 10. DEPLOYMENT AND MAINTENANCE

Click Next.
Progress Wait process to end.
Complete Click Finish to complete. A browser window may open that asks you to register the
software. You may do so, or just close it without registration.

Figure 10.3: JDK Step 3

The documentation can be downloaded from the same website as the JDK:
http://www.oracle.com/technetwork/java/javase/downloads/index.html
This time, scroll down, and click the Download button in the Java SE 7 Documentation
section of the Additional Resources box.

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 10. DEPLOYMENT AND MAINTENANCE

Figure 10.4: JDK Step 4

Dept. of Comp. Engg. PCCOE Pune-44.

Chapter 11
Appendix A: Glossary
NB: Naive Bayes Classifier, a mathematical from the Bayesian Approach used to produce the
results based on existing evidences.
CSV: Comma Separated Values, a terminology used in databases referring to the string which
includes all the table column entries from the current database.
DOM Tree: Document Object Model, is an internal representation used by browsers to represent a web page.
IDS: Intrusion Detection System, a system which will work as background process for detection of web pages in real time.
JDK: Java Development Kit is the set of standard libraries provided by JAVA which are required to develop the basic block of java project.
URL: Uniform Resource Locator, the name of website by which it is known in Computer
Networks.
SSL: Secure Socket Layer is cryptographic protocol that is designed to provide communication
security over the Internet.

Chapter 12
Appendix B: Semester I Assignments
Assignment No. 1
Modules of the project development: Mathematical model for project
This hybrid model is adapting the divide and conquer strategy as we are dividing the problem
into smaller two problems and solving them individually to solve the given problem. Here Clustering and Bayesian Classifier Approach are two different methods applied on separate parts to
solve the problem by dividing into two small problems.
S {DS, F S, F E, L, U RL, K M EAN Spred , N Bmodel , N Bpred }
where,
DS = Data Set for given Model.
F E = Feature Extraction Procedure to produce F S.
K M EAN Spred = K-Means Clustering Prediction.
N Bmodel = Naive Bayes Classifier Training Model.
N Bpred = Naive Bayes Classifier Prediction.

F S F E(U RL)
where,
F S = Feature set for the given Model.
F E = Feature Extraction Procedure to produce F S.
U RL = U RL input to system.

CHAPTER 12. APPENDIX B: SEMESTER I ASSIGNMENTS

f s1 , f s2 , f s3 ....f sn DS
L1 , L2 , L3 L

f si F E(U RL)
where,
f si = Current Feature Set.

Li K M EAN Spred (DS, L)

K MEANSpred(fs1 ,fs2 ,fs3 ....fsn ) = argmins

PLS =1 P
i

fsj Li ||fsj

i ||2

where,
i =the mean of pointsSi .

N Bmodel N Bpred (DS, L)

f si F E(U RL)
Li N Bpred (N Bmodel , f si )

NBpred (C|fs1 , fs2 , fs3 ....fsn ) =

p(C)p(fs1 ,fs2 ,fs3 ....fsn |C)

p(fs1 ,fs2 ,fs3 ....fsn )

where,
C =dependant class variable.
p =probability.
The K-Means Clustering is NP-Hard problem.
The Naive Bayes Classifier is P-Complete problem and we can solve the complete polynomial for the given problem for naive bayes classifier.

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 12. APPENDIX B: SEMESTER I ASSIGNMENTS

Assignment No. 2
Algorithmic strategies used in project: Algorithms K-Means Clustering
and Naive Bayes Classifier
K-Means Clustering Simulation
Following given is the sample Data Set and its evaluation based on K-Means Clustering.
For this example we have used K = 3 as the cluster size.

W ebSite

DOT S

SLASHES

SU S.CHAR

REM ARK

For the above given Data Sets, applying K-Means for K = 3 we are forming 3 clusters with
following initial centroid.

Cluster

Centroid1

Centroid2

Centroid3

Centroid4

Cluster1

Cluster2

Cluster3

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 12. APPENDIX B: SEMESTER I ASSIGNMENTS

W ebSite

DOT S

SLASHES

SU S.CHAR

REM ARK

After 1st Iteration.

Cluster

Centroid1

Centroid2

Centroid3

Centroid4

Cluster1

2.5

Cluster2

11.33

Cluster3

8.5

15.5

W ebSite

DOT S

SLASHES

SU S.CHAR

REM ARK

Here we can see that Feature Set D first included into 1st cluster and after er-arranging centroid Feature Set D is included into 3rd cluster.
After few iterations centroid becomes stable as follow:

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 12. APPENDIX B: SEMESTER I ASSIGNMENTS

Cluster

Centroid1

Centroid2

Centroid3

Centroid4

Cluster1

1.67

2.34

Cluster2

0.75

4.25

6.25

10.75

Cluster3

8.5

15.5

Naive Bayes Classifier Simulation

To evaluate the results using Naive Bayes Classifier we can use following formula:
P(ai |j ) =

nc +mp
n+m

W ebSite

DOT S

SLASHES

SU S.CHAR

SSL

F rA

N lA

CLU ST ER

The Feature Set for which cluster is to be decided :

W ebSite

DOT S

SLASHES

SU S.CHAR

SSL

F rA

N lA

Here we want to classify our new Feature Set which is not listed above and is unique for the
given model. We need to calculate probabilities:
P(1|1), P(5|1), P(9|1), P(10|1), P(1|1), P(7|1), P(2|1)
P(1|3), P(5|3), P(9|3), P(10|3), P(1|3), P(7|3), P(2|3)
P(ai |j ) =

nc +mp
n+m

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 12. APPENDIX B: SEMESTER I ASSIGNMENTS

For Calculation P(J|3)

n nc

P (ai |j )

IP=1

0.5

0.625

DOTS=5

0.5

0.500

SLASHES=9

0.5

0.611

SUS.CHAR=10

0.5

0.563

SSL=1

0.5

0.625

FrA=7

0.5

0.611

NlA=1

0.5

0.438

For Calculation P(J|1)

n nc

P (ai |j )

IP=1

0.5

0.375

DOTS=5

0.5

0.500

SLASHES=9

0.5

0.389

SUS.CHAR=10

0.5

0.438

SSL=1

0.5

0.375

FrA=7

0.5

0.389

NlA=1

0.5

0.563

P(1|1), P(5|1), P(9|1), P(10|1), P(1|1), P(7|1), P(2|1)

= 0.375 0.500 0.389 0.438 0.375 0.389 0.563

= 0.002617

P(1|3), P(5|3), P(9|3), P(10|3), P(1|3), P(7|3), P(2|3)

= 0.625 0.500 0.611 0.563 0.625 0.611 0.438

= 0.017950

Here 0.017950 > 0.002617, hence our Feature Set gets classified as VALID PHISH.

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 12. APPENDIX B: SEMESTER I ASSIGNMENTS

Assignment No. 3
Study of various options available to implement the project modules and
why then given options are chosen?
Our project aims towards detecting a web page is a Valid Phish, Invalid Phish. We use
K-means algorithm to cluster data set. For this purpose we used machine learning technique
Naive Bayes Classifier to identify the most important features that differentiate Phishing Site
from Legitimate Site.
Why use Data Mining?
Two major reasons to use data mining :
1. The amount of data is very large and useful information is very little.
2. There is a need to extract useful information from the data and to interpret the data.
Facing to enormous volumes of data, human analysts with no special tools can no longer
make sense. However, Data mining can automate the process of finding relationships and patterns in raw data and the results can be either utilized in an automated decision support system
or assessed by a human analyst. This is why to use data mining, especially in science and business areas which need to analyse large amounts of data to discover trends which they could not
otherwise find.
If we know how to reveal valuable knowledge hidden in raw data, data might be one of our
most valuable assets. While data mining is the tool to extract diamonds of knowledge from your
historical data and predict outcomes of future situations.
Why Clustering?
Clustering analysis to mine the Web is quite different from traditional clustering due to the
inherent difference between Web usage data clustering and classic clustering. Therefore, there
is a need to develop specialized techniques for clustering analysis based on Web usage data.
Some approaches to clustering analysis have been developed for mining the Web access logs.
81

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 12. APPENDIX B: SEMESTER I ASSIGNMENTS

Why K-Means Clustering?

K-means is one of the simplest unsupervised learning algorithms that partition feature vectors
into k clusters so that the within group sum of squares is minimized.
Mean Shift clustering is able to produce clusters with shapes that depend upon the topology
of the data and does not need an a priori estimate of the number of clusters to find.
K-means on the other-hand assumes the isotropy of the clusters and need to be taught the
number of clusters to extract in advance.
Why Naive Bayes Classifier?
A Naive Bayes Classifier is a simple probabilistic classifier based on applying Bayes theorem
with strong (Naive) independence assumptions. A more descriptive term for the underlying
probability model would be independent feature model.
In simple terms, a Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature, given
the class variable. For example, a fruit may be considered to be an apple if it is red, round,
and about 4 in diameter. Even if these features depend on each other or upon the existence
of the other features, a Naive Bayes classifier considers all of these properties to independently
contribute to the probability that this fruit is an apple.
Depending on the precise nature of the probability model, Naive Bayes classifiers can be
trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for Naive Bayes models uses the method of maximum likelihood; in other words,
one can work with the Naive Bayes model without believing in Bayesian probability or using
any Bayesian methods.
Technologies Used:1

Platform

Java 2 Standard Edition

SDK

JDK 1.7/NetBeans IDE 7.3

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 12. APPENDIX B: SEMESTER I ASSIGNMENTS

Database Connection

JDBC

Database

MySQL

Java 2 Standard Edition

Java Platform, Standard Edition or Java SE is a widely used platform for programming in the
Java language. It is the Java Platform used to deploy portable applications for general use. In
practical terms, Java SE consists of a virtual machine, which must be used to run Java programs,
together with a set of libraries (or packages) needed to allow the use of file systems, networks,
graphical interfaces, and so on, from within those programs.
There are two basic techniques involved in reflection:
Discovery - this involves taking an object or class and discovering the members, super classes,
implemented interfaces, and then possibly using the discovered elements.
Use by Name - involves starting with the symbolic name of an element and using the named
element.
NetBeans
NetBeans refers to both a platform framework for Java desktop applications, and an integrated development environment (IDE) for developing with Java, JavaScript, PHP, Python,
Ruby, Groovy, C, C++, Scala, Clojure, and others.
The NetBeans IDE is written in Java and can run anywhere a JVM is installed, including
Windows, Mac OS, Linux and Solaris. A JDK is required for Java development functionality,
but is not required for development in other programming languages.
The NetBeans platform allows applications to be developed from a set of modular software
components called modules.

Dept. of Comp. Engg. PCCOE Pune-44.

CHAPTER 12. APPENDIX B: SEMESTER I ASSIGNMENTS

Java Development Kit

The Java Development Kit (JDK) is a Sun Microsystems product aimed at Java developers. Since the introduction of Java, it has been by far the most widely used Java SDK. On 17
November 2006, Sun announced that it would be released under the GNU General Public License (GPL), thus making it free software. This happened in large part on 8 May 2007; Sun
contributed the source code to the OpenJDK.

Dept. of Comp. Engg. PCCOE Pune-44.

Bibliography
[1] Wikipedia, the online encyclopedia. http://www.wikipedia.org/:Online, 2013.
[2] Christopher Kruegel Fabrizio Ferrandi-Politecnico di Milano Angelo P. E. Rosiello, Engin Kirda. A layout-similarity-based approach for detecting phishing pages.
[3] Tommy W. S. Chow-Senior Member-IEEE Wenyin Liu-Senior Member-IEEE Haijun Zhang, Gang Liu. Textual and visual content-based anti-phishing: A bayesian approach.
IEEE Transactions, 2011.
[4] Anti-Phishing Organization. First fortnight report. 2013.
[5] Beijing-China ; Liu Xumin ; Guan Yong Shi Na;Coll. of Inf. Eng., Capital Normal
Univ. CNU. Research on k-means clustering algorithm: An improved k-means clustering
algorithm. pages 6367, 2011.
[6] Kogent Learning Solutions. Java 6 and j2ee 1.5, black book.
[7] RSA Online Surveyer. Phishing kit-the same wolf, just a different sheeps clothing. February 2013.
[8] Tongguang NI Xiaoqing GU, Hongyuan WANG. An efficient approach to detecting phishing web.

BE Project Report
No ratings yet
BE Project Report
65 pages
Report Finale
No ratings yet
Report Finale
53 pages
Sem 2 Report
No ratings yet
Sem 2 Report
118 pages
BE Final Report
No ratings yet
BE Final Report
60 pages
Optimizing Crop Selection and Profitability Through Machine Learning-Based Recommendations
No ratings yet
Optimizing Crop Selection and Profitability Through Machine Learning-Based Recommendations
61 pages
Latex - Code - Interview Preparation Final Report
No ratings yet
Latex - Code - Interview Preparation Final Report
64 pages
Report
No ratings yet
Report
48 pages
Final Report
No ratings yet
Final Report
58 pages
Sign Language Detection System 1 1 1 Copy
No ratings yet
Sign Language Detection System 1 1 1 Copy
60 pages
A Y 2023 24 Latex Class File For PW Stage I Report
No ratings yet
A Y 2023 24 Latex Class File For PW Stage I Report
31 pages
Rapport
No ratings yet
Rapport
68 pages
Thesis Format For Bachelors Uow 1
No ratings yet
Thesis Format For Bachelors Uow 1
24 pages
Online System Survey Report
No ratings yet
Online System Survey Report
56 pages
Project GRT
No ratings yet
Project GRT
50 pages
Report of Dimensions Measurement of An Object in 2D Image Using Image Processing in Python
No ratings yet
Report of Dimensions Measurement of An Object in 2D Image Using Image Processing in Python
70 pages
Academic Year: 2023-24: Under The Guidance of
No ratings yet
Academic Year: 2023-24: Under The Guidance of
79 pages
Be Project Report Format Term II PDF
No ratings yet
Be Project Report Format Term II PDF
63 pages
G 05 Stage 2 REPORT 2 Copy 2 1 1
No ratings yet
G 05 Stage 2 REPORT 2 Copy 2 1 1
58 pages
Project Report Sem II Final
0% (1)
Project Report Sem II Final
102 pages
Share CapstoneFinal
No ratings yet
Share CapstoneFinal
69 pages
Arc42 by Example - Software Architecture Documentation in Practice - 1st Ed (2023)
100% (1)
Arc42 by Example - Software Architecture Documentation in Practice - 1st Ed (2023)
342 pages
Internship Project Report Guide
No ratings yet
Internship Project Report Guide
76 pages
News O Mania BE Project Report
No ratings yet
News O Mania BE Project Report
66 pages
Project On Python HTML Css Js
No ratings yet
Project On Python HTML Css Js
25 pages
Latex Class File For PW Stage II Report A Y 2024 25 1
No ratings yet
Latex Class File For PW Stage II Report A Y 2024 25 1
27 pages
Wieringa Requirements Engineering - Frameworks For Understanding
No ratings yet
Wieringa Requirements Engineering - Frameworks For Understanding
467 pages
Software Defect Detection Using Machine Learning
No ratings yet
Software Defect Detection Using Machine Learning
61 pages
Restaurant Management System
No ratings yet
Restaurant Management System
85 pages
Project Report BE Black Book
No ratings yet
Project Report BE Black Book
77 pages
Project Report Template PICT 1
No ratings yet
Project Report Template PICT 1
58 pages
Ioe Bachelors Proposal Proposal Format Pulchowk Campus
No ratings yet
Ioe Bachelors Proposal Proposal Format Pulchowk Campus
14 pages
A Smart System For Donation Handling of Charitable Trusts and Ngos
No ratings yet
A Smart System For Donation Handling of Charitable Trusts and Ngos
75 pages
NEWS
No ratings yet
NEWS
434 pages
Mini Project
No ratings yet
Mini Project
49 pages
Build Your Own Lisp - Learn C and Build Your Own Programming Language (PDFDrive) PDF
No ratings yet
Build Your Own Lisp - Learn C and Build Your Own Programming Language (PDFDrive) PDF
289 pages
05 G5report
No ratings yet
05 G5report
141 pages
Group No 22 Report
No ratings yet
Group No 22 Report
54 pages
CD Notes
No ratings yet
CD Notes
194 pages
Blitz-Logs 20220827203749
No ratings yet
Blitz-Logs 20220827203749
277 pages
Enabling Identity Based Cloud
No ratings yet
Enabling Identity Based Cloud
69 pages
Maxime Klusman
No ratings yet
Maxime Klusman
98 pages
OCR OF TEMPLE INSCRIPTIONS AND TRANSLATIONS IN DEVANAGARI SCRIPTS Final
No ratings yet
OCR OF TEMPLE INSCRIPTIONS AND TRANSLATIONS IN DEVANAGARI SCRIPTS Final
44 pages
Final Year Sem VII
No ratings yet
Final Year Sem VII
23 pages
Project Stage I Report Format
No ratings yet
Project Stage I Report Format
50 pages
SPCC
No ratings yet
SPCC
80 pages
ARS Chaotic Image Encription and Decription Using Java
No ratings yet
ARS Chaotic Image Encription and Decription Using Java
57 pages
Grass Cutter
No ratings yet
Grass Cutter
64 pages
CD Notes Unit1 Aktu
No ratings yet
CD Notes Unit1 Aktu
71 pages
Sample Minor Project Report-2020-21
No ratings yet
Sample Minor Project Report-2020-21
55 pages
Project Black Book
No ratings yet
Project Black Book
35 pages
Report PDF
No ratings yet
Report PDF
73 pages
Be
No ratings yet
Be
42 pages
Index Nielit Report
No ratings yet
Index Nielit Report
3 pages
Satywan Report
No ratings yet
Satywan Report
92 pages
Project Report
No ratings yet
Project Report
32 pages
Analyzing Sentiments in One Go: Savitribai Phule Pune University A Priliminary Project Report On
No ratings yet
Analyzing Sentiments in One Go: Savitribai Phule Pune University A Priliminary Project Report On
54 pages
FYP Rep1 Sampleeeeeeeeee
No ratings yet
FYP Rep1 Sampleeeeeeeeee
33 pages
Project Report For Intrusion Detection System Using Fuzzy Clustring Algorithm
100% (1)
Project Report For Intrusion Detection System Using Fuzzy Clustring Algorithm
48 pages
E Notice Report
No ratings yet
E Notice Report
82 pages
Minor PROJECT WS 21 22
No ratings yet
Minor PROJECT WS 21 22
37 pages
Lec4 DP
No ratings yet
Lec4 DP
47 pages
M. S. Bidve Engineering College, Latur: "Online Shopping System"
No ratings yet
M. S. Bidve Engineering College, Latur: "Online Shopping System"
61 pages
Automata Theory Lec-03
No ratings yet
Automata Theory Lec-03
58 pages
LAN Security Manager PDF
No ratings yet
LAN Security Manager PDF
47 pages
Introduction To Compiler Design: B.Sc. (SE) - 3rd Year (Session-2017-18)
No ratings yet
Introduction To Compiler Design: B.Sc. (SE) - 3rd Year (Session-2017-18)
40 pages
Natural Language Processing
No ratings yet
Natural Language Processing
28 pages
SPPU Report Format
No ratings yet
SPPU Report Format
50 pages
Compiler Lecture 10
No ratings yet
Compiler Lecture 10
19 pages
Natural Language Processing - Wikipedia
No ratings yet
Natural Language Processing - Wikipedia
20 pages
How To Read XML File From QTP
100% (1)
How To Read XML File From QTP
5 pages
ExpressJs Handwritten Notes
100% (2)
ExpressJs Handwritten Notes
144 pages
Iral 2015 0005
No ratings yet
Iral 2015 0005
19 pages
XML Parsing in QTP Using XMLUtil - A Simple Example
No ratings yet
XML Parsing in QTP Using XMLUtil - A Simple Example
8 pages
Chapter 4 Syntax Directed Translation (SDT)
100% (1)
Chapter 4 Syntax Directed Translation (SDT)
6 pages
Spacy Io Usage Spacy 101
No ratings yet
Spacy Io Usage Spacy 101
10 pages
CC Assignment No 03
No ratings yet
CC Assignment No 03
12 pages
Topics To Focus On For Passing Marks Compiler Design
No ratings yet
Topics To Focus On For Passing Marks Compiler Design
3 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
7 pages
The Abstract Stack Machine
No ratings yet
The Abstract Stack Machine
7 pages
CD - Question - Bank
No ratings yet
CD - Question - Bank
6 pages
Chapter 1 Pattern Classification
No ratings yet
Chapter 1 Pattern Classification
11 pages
Fortisiem: Free Up Teams
No ratings yet
Fortisiem: Free Up Teams
7 pages
Abinitio Interview Ques
No ratings yet
Abinitio Interview Ques
30 pages
Qa - CD Unit-3
No ratings yet
Qa - CD Unit-3
8 pages
VL2019201000936 Da PDF
No ratings yet
VL2019201000936 Da PDF
2 pages
Text Mining: Open Source Tokenization Tools - An Analysis
No ratings yet
Text Mining: Open Source Tokenization Tools - An Analysis
11 pages
Linda Implementations in Java For Concurrent Systems: G. C. Wells, A. G. Chalmers and P. G. Clayton
No ratings yet
Linda Implementations in Java For Concurrent Systems: G. C. Wells, A. G. Chalmers and P. G. Clayton
19 pages
ChatGPT for Business: Strategies for Success
From Everand
ChatGPT for Business: Strategies for Success
Matthew C. Smith
1/5 (1)
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet