1NH16CS054
1NH16CS054
PROJECT REPORT
ON
BACHELOR OF ENGINEERING
IN
BY
Ms. TINU NS
Assistant Professor,
Dept. of CSE, NHCE
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
CERTIFICATE
It is hereby certified that the project work entitled “DETECTING PHISHING WEBSITE
USING MACHINE LEARNING” is a bonafide work carried out by M JAYA BHARATHI
(1NH16CS054), TEJA PRAVEEN KUMAR CH(1NH17CS427), B PREETHI REDDY(1NH16CS21)
in partial fulfilment for the award of Bachelor of Engineering in COMPUTER SCIENCE AND
ENGINEERING of the New Horizon College of Engineering during the year 2019-2020. It is
certified that all corrections/suggestions indicated for Internal Assessment have been
incorporated in the Report deposited in the departmental library. The project report has
been approved as it satisfies the academic requirements in respect of project work
prescribed for the said Degree.
External Viva
1.………………………………………….. ………………………………….
2.…………………………………………… …………………………………..
ABSTRACT
The criminals, who want to obtain sensitive data, first create unauthorized replicas of a real
website and e-mail.
The e-mail will be created using logos and slogans of a legitimate company.
The nature of website creation is one of the reasons that the Internet has grown so rapidly
as a communication medium.
Phisher then send the "spoofed" e-mails to as many people as possible in an attempt to
lure them into the scheme.
When these e-mails are opened or when a link in the mail is clicked, the consumers are
redirected to a spoofed website, appearing to be from the legitimate entity.
We discuss the methods used for detection of phishing Web sites based on url importance
properties.
i
ACKNOWLEDGEMENT
The satisfaction and euphoria that accompany the successful completion of any task
would be impossible without the mention of the people who made it possible, whose
constant guidance and encouragement crowned our efforts with success.
I would also like to thank Dr. B.Rajalakshmi, Professor and Head, Department of
Computer Science and Engineering, for her constant support.
I express my gratitude to Ms. Tinu NS, Assistant Professor, my project guide, for
constantly monitoring the development of the project and setting up precise deadlines.
Her valuable suggestions were the motivating factors in completing the work.
Finally, a note of thanks to the teaching and non-teaching staff of Dept of Computer
Science and Engineering, for their cooperation extended to me, and my friends, who
helped me directly or indirectly in the course of the project work.
ABSTRACT i
ACKNOWLEDGEMENT ii
LISTOFFIGURES iii
1. INTRODUCTION
1.1 DOMAININTRODUCTION 1
MACHINELEARNING 1
DATAMINING 4
1.2 PROBLEM DEFINITION 8
1.3 OBJECTIVES 8
1.4 SCOPE OFTHE PROJECT 9
2. LITERATURE SURVEY
2.1 MACHINELEARNING 11
2.2 EXISTINGSYSTEM 14
2.3 PROPOSED SYSTEM 17
2.4 ADVANTAGESANDDIADVANTAGES 18
3. REQUIREMENT ANALYSIS
3.1 FUNCTIONALREQUIREMENTS 19
3.2 NON-FUNCTIONALREQUIREMENTS 19
3.2.1 ACCESSIBILITY 20
3.2.2 MAINTINABILITY 20
3.2.3 SCALABILITY 20
3.2.4 PORTABILITY 21
3.3 HARDWAREREQUIREMENTS 21
3.4 SOFTWAREREQUIREMENTS 22
4. DESIGN
4.1 DESIGN GOALS 35
4.2 SYSTEM ARCHITECTURE
4.3 UMLDIAGRAMS
4.3.1 USECASE DIAGRAM
4.3.2 ACTIVITY DIAGRAM
4.3.3 DATAFLOW DIAGRAM
4.3.4 SEQUENCE DIAGRAM
4.3.3 DATAFLOW DIAGRAM
4.3.4 SEQUENCE DIAGRAM
5. IMPLEMENTATION
5.1 ALGORITHMS USED 41
5.2 FUNCTIONS USED
6. TESTING
6.1 TYPES OF TESTING 45
6.1.1 UNIT TESTING
6.1.2 INTEGRATION TESTING
6.1.3 VALIDATION TESTING
6.1.4 SYSTEM TESTING
6.2 TESTING OF INITIALIZATION AND UI COMPNENTS
7. SNAPSHOTS 48
8. CONCLUSION 53
9. REFERENCES 54
LIST OF FIGURES
CHAPTER 1
INTRODUCTION
Phishing costs Internet users billions of dollars per year. It refers to luring techniques used
by identity thieves to fish for personal information in a pond of unsuspecting Internet users.
Phishers use spoofed e-mail, phishing software to steal personal information and financial
account details such as usernames and passwords. This paper deals with methods for
detecting phishing Web sites by analyzing various features of benign and phishing URLs by
Machine learning techniques. We discuss the methods used for detection of phishing Web
sites based on lexical features, host properties and page importance properties. We
consider various machine learning algorithms for evaluation of the features in order to get a
better understanding of the structure of URLs that spread phishing. The fine-tuned
parameters are useful in selecting the apt machine learning algorithm for separating the
phishing sites from benign sites.
The criminals, who want to obtain sensitive data, first create unauthorized replicas of a real
website and e-mail, usually from a financial institution or another company that deals with
financial information. The e-mail will be created using logos and slogans of a legitimate
company. The nature of website creation is one of the reasons that the Internet has grown
so rapidly as a communication medium, it also permits the abuse of trademarks, trade
names, and other corporate identifiers upon which consumers have come to rely as
mechanisms for authentication. Phisher then send the "spoofed" e-mails to as many people
as possible in an attempt to lure them in to the scheme. When these e-mails are opened or
when a link in the mail is clicked, the consumers are redirected to a spoofed website,
appearing to be from the legitimate entity.
Advantages
• This system can be used by many E-commerce or other websites in order to have
good customer relationship.
• User can make online payment securely.
• Data mining algorithm used in this system provides better performance as compared
to other traditional classifications algorithms.
• With the help of this system user can also purchase products online without any
hesitation.
Disadvantages
To overcome this problem we are using some of the machine learning algorithms in which
it will help us to identify the phishing websites based on the features present in the
algorithm. By using these algorithm we cam be able to keep the user personal credentials
or the sensitive data safe from the intruders.
The main purpose of the project is to detect the fake or phishing websites who are trying to
get access to the sensitive data or by creating the fake websites and trying to get access of
the user personal credentials. We are using machine learning algorithms to safeguard the
sensitive data and to detect the phishing websites who are trying to gain access on sensitive
data.
One of the challenges faced by our research was the unavailability of reliable training
datasets. In fact, this challenge faces any researcher in the field. However, although plenty
of articles about predicting phishing websites using data mining techniques have been
disseminated these days, no reliable training dataset has been published publically, maybe
because there is no agreement in literature on the definitive features that characterize
phishing websites, hence it is difficult to shape a dataset that covers all possible features.
In this article, we shed light on the important features that have proved to be sound and
effective in predicting phishing websites. In addition, we proposed some new features,
experimentally assign new rules to some well-known features and update some other
features.
Phishers can use long URL to hide the doubtful part in the address bar. For example:
http://federmacedoadv.com.br/3f/aze/ab51e2e319e51502f416dbe46b773a5e/?cmd=_hom
e&dispatch=11004d58f5b74f8dc1e7c2e8dd4105e811004d58f5b74f8dc1e7c2e8dd4105
[email protected]
To ensure accuracy of our study, we calculated the length of URLs in the dataset and
produced an average URL length. The results showed that if the length of the URL is greater
than or equal 54 characters then the URL classified as phishing. By reviewing our dataset we
were able to find 1220 URLs lengths equals to 54 or more which constitute 48.8% of the
total dataset size.
We have been able to update this feature rule by using a method based on frequency and
thus improving upon its accuracy.
URL shortening is a method on the “World Wide Web” in which a URL may be made
considerably smaller in length and still lead to the required webpage. This is accomplished
by means of an “HTTP Redirect” on a domain name that is short, which links to the webpage
that has a long URL. For example, the URL “http://portal.hud.ac.uk/” can be shortened to
“bit.ly/19DXSk4”.
TinyURL → Phishing
Rule: IF{
Otherwise → Legitimate
Using “@” symbol in the URL leads the browser to ignore everything preceding the “@”
symbol and the real address often follows the “@” symbol.
The existence of “//” within the URL path means that the user will be redirected to another
website. An example of such URL’s is:
“http://www.legitimate.com//http://www.phishing.com”. We examin the location where
the “//” appears. We find that if the URL starts with “HTTP”, that means the “//” should
appear in the sixth position. However, if the URL employs “HTTPS” then the “//” should
appear in seventh position.
The dash symbol is rarely used in legitimate URLs. Phishers tend to add prefixes or suffixes
separated by (-) to the domain name so that users feel that they are dealing with a
legitimate webpage. For example http://www.Confirme-paypal.com/.
1.1.8. HTTPS (Hyper Text Transfer Protocol with Secure Sockets Layer)
The existence of HTTPS is very important in giving the impression of website legitimacy, but
this is clearly not enough. The authors in (Mohammad, Thabtah and McCluskey 2012)
(Mohammad, Thabtah and McCluskey 2013) suggest checking the certificate assigned with
HTTPS including the extent of the trust certificate issuer, and the certificate age. Certificate
Authorities that are consistently listed among the top trustworthy names include:
“GeoTrust, GoDaddy, Network Solutions, Thawte, Comodo, Doster and VeriSign”.
Furthermore, by testing out our datasets, we find that the minimum age of a reputable
certificate is two years.
Rule:
Use https and Issuer Is Trusted 𝑎𝑛𝑑 𝐴𝑔𝑒 𝑜𝑓 𝐶𝑒𝑟𝑡𝑖𝑓𝑖𝑐𝑎𝑡𝑒 ≥ 1 Years → Legitimate
IF{ Using https and Issuer Is Not Trusted → Suspicious
Otherwise → Phishing
Based on the fact that a phishing website lives for a short period of time, we believe that
trustworthy domains are regularly paid for several years in advance. In our dataset, we find
that the longest fraudulent domains have been used for one year only.
1.1.10. Favicon
A favicon is a graphic image (icon) associated with a specific webpage. Many existing user
agents such as graphical browsers and newsreaders show favicon as a visual reminder of the
website identity in the address bar. If the favicon is loaded from a domain other than that
shown in the address bar, then the webpage is likely to be considered a Phishing attempt.
Meaning Preferred
PORT Service
Status
445 SMB Providing shared access to files, printers, serial ports Close
1.1.12. The Existence of “HTTPS” Token in the Domain Part of the URL
Request URL examines whether the external objects contained within a webpage such as
images, videos and sounds are loaded from another domain. In legitimate webpages, the
webpage address and most of objects embedded within the webpage are sharing the same
domain.
An anchor is an element defined by the <a> tag. This feature is treated exactly as “Request
URL”. However, for this feature we examine:
1. If the <a> tags and the website have different domain names. This is similar to
request URL feature.
A. <a href=“#”>
B. <a href=“#content”>
C. <a href=“#skip”>
Given that our investigation covers all angles likely to be used in the webpage source code,
we find that it is common for legitimate websites to use <Meta> tags to offer metadata
about the HTML document; <Script> tags to create a client side script; and <Link> tags to
retrieve other web resources. It is expected that these tags are linked to the same domain
of the webpage.
Rule:
IF
% of Links in " < 𝑀𝑒𝑡𝑎 > ", " < 𝑆𝑐𝑟𝑖𝑝𝑡 > " 𝑎𝑛𝑑 " < Link>" < 17% → 𝐿𝑒𝑔𝑖𝑡𝑖𝑚𝑎𝑡𝑒
{% of Links in < 𝑀𝑒𝑡𝑎 > ", " < 𝑆𝑐𝑟𝑖𝑝𝑡 > " 𝑎𝑛𝑑 " < Link>" ≥ 17% And ≤ 81% → Suspicious
Otherwise → Phishing
SFHs that contain an empty string or “about:blank” are considered doubtful because an
action should be taken upon the submitted information. In addition, if the domain name in
SFHs is different from the domain name of the webpage, this reveals that the webpage is
suspicious because the submitted information is rarely handled by external domains.
Web form allows a user to submit his personal information that is directed to a server for
processing. A phisher might redirect the user’s information to his personal email. To that
end, a server-side script language might be used such as “mail()” function in PHP. One more
client-side function that might be used for this purpose is the “mailto:” function.
This feature can be extracted from WHOIS database. For a legitimate website, identity is
typically part of its URL.
The fine line that distinguishes phishing websites from legitimate ones is how many times a
website has been redirected. In our dataset, we find that legitimate websites have been
redirected one time max. On the other hand, phishing websites containing this feature have
been redirected at least 4 times.
Phishers may use JavaScript to show a fake URL in the status bar to users. To extract this
feature, we must dig-out the webpage source code, particularly the “onMouseOver” event,
and check if it makes any changes on the status bar.
Phishers use JavaScript to disable the right-click function, so that users cannot view and save
the webpage source code. This feature is treated exactly as “Using onMouseOver to hide
the Link”. Nonetheless, for this feature, we will search for event “event.button==2” in the
webpage source code and check if the right click is disabled.
It is unusual to find a legitimate website asking users to submit their personal information
through a pop-up window. On the other hand, this feature has been used in some legitimate
websites and its main goal is to warn users about fraudulent activities or broadcast a
IFrame is an HTML tag used to display an additional webpage into one that is currently
shown. Phishers can make use of the “iframe” tag and make it invisible i.e. without frame
borders. In this regard, phishers make use of the “frameBorder” attribute which causes the
browser to render a visual delineation.
This feature can be extracted from WHOIS database (Whois 2005). Most phishing websites
live for a short period of time. By reviewing our dataset, we find that the minimum age of
the legitimate domain is 6 months.
For phishing websites, either the claimed identity is not recognized by the WHOIS database
(Whois 2005) or no records founded for the hostname (Pan and Ding 2006). If the DNS
record is empty or not found then the website is classified as “Phishing”, otherwise it is
classified as “Legitimate”.
This feature measures the popularity of the website by determining the number of visitors
and the number of pages they visit. However, since phishing websites live for a short period
of time, they may not be recognized by the Alexa database (Alexa the Web Information
Company., 1996). By reviewing our dataset, we find that in worst scenarios, legitimate
websites ranked among the top 100,000. Furthermore, if the domain has no traffic or is not
recognized by the Alexa database, it is classified as “Phishing”. Otherwise, it is classified as
“Suspicious”.
1.4.4. PageRank
PageRank is a value ranging from “0” to “1”. PageRank aims to measure how important a
webpage is on the Internet. The greater the PageRank value the more important the
webpage. In our datasets, we find that about 95% of phishing webpages have no PageRank.
Moreover, we find that the remaining 5% of phishing webpages may reach a PageRank
value up to “0.2”.
This feature examines whether a website is in Google’s index or not. When a site is indexed
by Google, it is displayed on search results (Webmaster resources, 2014). Usually, phishing
webpages are merely accessible for a short period and as a result, many phishing webpages
may not be found on the Google index.
The number of links pointing to the webpage indicates its legitimacy level, even if some links
are of the same domain (Dean, 2014). In our datasets and due to its short life span, we find
that 98% of phishing dataset items have no links pointing to them. On the other hand,
legitimate websites have at least 2 external links pointing to them.
Rule:
IF
Of Link Pointing to The Webpage = 0 → Phishing
{ Of Link Pointing to The Webpage > 0 𝑎𝑛𝑑 ≤ 2 → 𝑆𝑢𝑠𝑝𝑖𝑐𝑖𝑜𝑢𝑠
Otherwise → Legitimate
Phishing is one of the most common and most dangerous attacks among cybercrimes. The
aim of these attacks is to steal the information used by individuals and organizations to
conduct transactions. Phishing websites are fake websites that contain various hints among
their contents and web browser-based information. When a user opens a fake webpage
and enters the username and protected password, the credentials of the user are acquired
by the attacker which can be used for malicious purposes. Phishing websites look very
similar in appearance to their corresponding legitimate websites to attract large number of
Internet users.
GOALS
1. Use of features extracted from websites which explain characteristics of a website for
phishing detection
2. Classification of website based on such features, using Extreme Learning Machines (ELM)
which is an advanced neural network leveraging generalization capabilities given by
randomization of weights
METHODOLOGY
The study uses a dataset which contains approximately 11,000 data containing the 30
features extracted based on the features of websites in UC Irvine Machine Learning
Repository database. For classification, a neural network named Extreme Learning Machine
(ELM) will be used. Extreme Learning Machine (ELM) is a feed-forward artificial neural
network (ANN) model with a single hidden layer. In ELM Learning Processes, differently
from ANN that renews its parameters as gradient-based, input weights are randomly
selected while output weights are analytically calculated. The given data set will be divided
into three parts as training, validation and test data by three-phase division in K-Fold
method, and model selection and performance status will be simultaneously performed.
This way the performance of the model will be measured in a reliable manner.
CHAPTER 2
LITERATURE SURVEY
The purpose or goal behind phishing is data, money or personal information stealing
through the fake website. The best strategy for avoiding the contact with the phishing web
site is to detect real time malicious URL. Phishing websites can be determined on the basis
of their domains. They usually are related to URL which needs to be registered (low-level
domain and upper-level domain, path, query). Recently acquired status of intra-URL
relationship is used to evaluate it using distinctive properties extracted from words that
compose a URL based on query data from various search engines such as Google and Yahoo.
These properties are further led to the machine-learningbased classification for the
identification of phishing URLs from a real dataset. This paper focus on real time URL
phishing against phishing content by using phish-STORM. For this a few relationship
between the register domain rest of the URL are consider also intra URL relentless is
consider which help to dusting wish between phishing or non phishing URL. For detecting a
phishing website certain typical blacklisted urls are used, but this technique is unproductive
as the duration of phishing websites is very short. Phishing is the name of avenue. It can be
defined as the manner of deception of an organization's customer to communicate with
their confidential information in an unacceptable behaviour. It can also be defined as
intentionally using harsh weapons such as Spasm to automatically target the victims and
targeting their private information. As many of the failures being occurred in the SMTP are
exploiting vectors for the phishing websites, there is a greater availability of communication
for malicious message deliveries.
Proposed a novel classification approach that use heuristic based feature extraction
approach.
In this, they have classified extracted features into different categories such as URL
Obfuscation features, Hyperlink-based features.
Moreover, proposed technique gives 92.5% accuracy. Also thismodel is purely depends on
the quality and quantity of the training set and Broken links feature extraction.
Machine learning
AI (ML) is a class of calculation that enables programming applications to turn out to be
progressively precise in anticipating results without being expressly customized. The
fundamental reason of AI is to assemble calculations that can get input information and
utilize factual examination to foresee a yield while refreshing yields as new information
winds up accessible.
The procedures engaged with AI are like that of information mining and prescient
displaying. Both require scanning through information to search for examples and modifying
program activities as needs be. Numerous individuals know about AI from shopping on the
web and being served advertisements identified with their buy. This happens on the
grounds that suggestion motors use AI to customize online promotion conveyance in
practically continuous. Past customized advertising, other regular AI use cases incorporate
misrepresentation location, spam separating, arrange security risk identification, prescient
support and building news sources.
Programmers prefer python because of the increased productivity it provides. Since there is
no compilation step, the edit-test-debug cycle is incredibly fast. Debugging Python programs
is easy. A bug or bad input will never cause a segmentation fault. Instead, when the
interpreter discovers an error, it raises an exception. When the program doesn't catch the
exception, the interpreter prints a stack trace. A source level debugger allows inspection of
local and global variables, evaluation of arbitrary expressions, setting breakpoints, stepping
through the code a line at a time, and so on. On the other hand, often the quickest way to
debug a program is to add a few print statements to the source. The fast edit-testdebug
cycle makes this simple approach very effective.
The Jupyter Notebook App is a server-customer application that permits altering and
running note pad records by means of an internet browser. The Jupyter Notebook App can
be executed on a nearby work area requiring no web access (as portrayed in this report) or
can be introduced on a remote server and got to through the web. Notwithstanding
showing/altering/running note pad archives, the Jupyter Notebook App has a "Dashboard"
(Notebook Dashboard), a "control board" indicating nearby records and permitting to open
note pad reports or closing down their portions.
• A scratch pad part is a "computational motor" that executes the code contained in a
Notebook record. The ipython part, referenced in this guide, executes python code.
Portions for some, different dialects exist (official parts).
• When you open a Notebook report, the related part is consequently propelled. At
the point when the scratch pad is executed (either cell-by-cell or with menu Cell - >
Run All), the portion plays out the calculation and produces the outcomes.
Contingent upon the sort of calculations, the piece may expend critical CPU and
RAM. Note that the RAM isn't discharged until the part is closed down, he Notebook
Dashboard is the part which is indicated first when you dispatch Jupyter Notebook
App. The Notebook Dashboard is essentially used to open note pad archives, and to
deal with the running portions (picture and shutdown).
• The Notebook Dashboard has different highlights like a record director, in particular
exploring organizers and renaming/erasing documents.
2.2.2 MATPLOTLIB
People are exceptionally visual animals: we comprehend things better when we see things
envisioned. Notwithstanding, the progression to showing investigations, results or bits of
knowledge can be a bottleneck: you probably won't realize where to begin or you may have
as of now a correct configuration as a top priority, however then inquiries like "Is this the
correct method to imagine the bits of knowledge that I need to convey to my group of
onlookers?" will have unquestionably gone over your brain.
When you're working with the Python plotting library Matplotlib, the initial step to
responding to the above inquiries is by structure up information on themes like: The life
structures of a Matplotlib plot: what is a subplot? What are the Axes? What precisely is a
figure?
Plot creation, which could bring up issues about what module you precisely need to import
(pylab or pyplot?), how you precisely ought to approach instating the figure and the Axes of
your plot, how to utilize matplotlib in Jupyter note pads, and so on.
Sparing, appearing, your plots: demonstrate the plot, spare at least one figures to, for
instance, pdf documents, clear the tomahawks, clear the figure or close the plot, and so on.
In conclusion, you'll quickly cover two manners by which you can alter Matplotlib: with
templates and the rc settings.
Since all is set for you to begin plotting your information, it's an ideal opportunity to
investigate some plotting schedules. You'll regularly go over capacities like plot() and
disperse(), which either draw focuses with lines or markers interfacing them, or draw
detached focuses, which are scaled or shaded. In any case, as you have just found in the
case of the primary area, you shouldn't neglect to pass the information that you need these
capacities to utilize!
These capacities are just the exposed rudiments. You will require some different capacities
to ensure your plots look magnificent:
2.4.3 NUMPY
NumPy is, much the same as SciPy, Scikit-Learn, Pandas, and so forth one of the bundles
that you can't miss when you're learning information science, principally in light of the fact
that this library gives you a cluster information structure that holds a few advantages over
Python records, for example, being increasingly reduced, quicker access in perusing and
composing things, being progressively advantageous and increasingly productive.
NumPy exhibits are somewhat similar to Python records, yet at the same time particularly
unique in the meantime. For those of you who are new to the subject, how about we clear
up what it precisely is and what it's useful for. As the name gives away, a NumPy cluster is a
focal information structure of the numpy library. The library's name is another way to say
"Numeric Python" or "Numerical Python".
At the end of the day, NumPy is a Python library that is the center library for logical
registering in Python. It contains an accumulation of apparatuses and strategies that can be
utilized to settle on a PC numerical models of issues in Science and Engineering. One of
these apparatuses is an elite multidimensional cluster object that is an incredible
information structure for effective calculation of exhibits and lattices. To work with these
clusters, there's a tremendous measure of abnormal state scientific capacities work on
these grids and exhibits.since you have set up your condition, it's the ideal opportunity for
the genuine work. In fact, you have officially gone for some stuff with exhibits in the above
DataCamp Light pieces. Be that as it may, you haven't generally gotten any genuine hands-
on training with them, since you originally expected to introduce NumPy all alone pc. Since
you have done this current, it's a great opportunity to perceive what you have to do so as to
run the above code pieces without anyone else.
A few activities have been incorporated underneath with the goal that you would already be
able to rehearse how it's done before you begin your own.To make a numpy exhibit, you
can simply utilize the np.array() work. You should simply pass a rundown to it, and
alternatively, you can likewise indicate the information sort of the information. In the event
that you need to find out about the conceivable information types that you can pick, go
here or consider investigating DataCamp's NumPy cheat sheet.There's no compelling reason
to proceed to retain these NumPy information types in case you're another client; But you
do need to know and mind what information you're managing. The information types are
there when you need more power over how your information is put away in memory and on
plate. Particularly in situations where you're working with broad information, it's great that
you know to control the capacity type.
Remember that, so as to work with the np.array() work, you have to ensure that the numpy
library is available in your condition. The NumPy library pursues an import tradition: when
you import this library, you need to ensure that you import it as np. By doing this, you'll
ensure that different Pythonistas comprehend your code all the more effectively.
2.2.4 PANDAS
2.4.5 ANACONDA
Boa constrictor is like pyenv, venv and minconda; it's intended to accomplish a python
situation that is 100% reproducible on another condition, autonomous of whatever different
forms of a task's conditions are accessible. It's somewhat like Docker, however limited to
the Python biological system. Jupyter is an astounding introduction device for expository
work; where you can display code in "squares," joins with rich content depictions among
squares, and the consideration of organized yield from the squares, and charts created in an
all around planned issue by method for another square's code. Jupyter is extraordinarily
great in expository work to guarantee reproducibility in somebody's exploration, so anybody
can return numerous months after the fact and outwardly comprehend what somebody
attempted to clarify, and see precisely which code drove which representation and end.
Regularly in diagnostic work you will finish up with huge amounts of half-completed note
pads clarifying Proof-of-Concept thoughts, of which most won't lead anyplace at first. A
portion of these introductions may months after the fact—or even years after the fact—
present an establishment to work from for another issue.
2.2.6 PYTHON
Troubleshooting Python programs is simple: a bug or awful information will never cause a
division blame. Rather, when the mediator finds a blunder, it raises a special case. At the
point when the program doesn't get the special case, the translator prints a stack follow. A
source level debugger permits assessment of nearby and worldwide factors, assessment of
discretionary articulations, setting breakpoints, venturing through the code a line at any
given moment, etc. The debugger is written in Python itself, vouching for Python's
contemplative power. Then again, frequently the speediest method to troubleshoot a
program is to add a couple of print proclamations to the source: the quick alter test-
investigate cycle makes this straightforward methodology successful.
Python is an item situated, abnormal state programming language with incorporated unique
semantics essentially for web and application improvement. It is amazingly alluring in the
field of Rapid Application Development since it offers dynamic composing and dynamic
restricting alternatives.
Python is generally basic, so it's anything but difficult to learn since it requires a one of a
kind language structure that centers around coherence. Designers can peruse and interpret
Python code a lot simpler than different dialects. Thusly, this decreases the expense of
program upkeep and improvement since it enables groups to work cooperatively without
huge language and experience obstructions.
Moreover, Python underpins the utilization of modules and bundles, which implies that
projects can be planned in a secluded style and code can be reused over an assortment of
tasks. When you've built up a module or bundle you need, it very well may be scaled for use
in different tasks, and it's anything but difficult to import or fare these modules.
A standout amongst the most encouraging advantages of Python is that both the standard
library and the mediator are accessible for nothing out of pocket, in both parallel and source
structure. There is no restrictiveness either, as Python and all the important instruments are
accessible on every single real stage. In this way, it is a tempting alternative for designers
who would prefer not to stress over paying high improvement costs.
CHAPTER 3
REQUIREMENT ANALYSIS
A function of software system is defined in functional requirement and the behavior of the
system is evaluated when presented with specific inputs or conditions which may include
calculations, data manipulation and processing and other specific functionality.
• Our system should be able to load air quality data and preprocess data.
• It should be able to analyze the air quality data.
• It should be able to group data based on hidden patterns.
• It should be able to assign a label based on its data groups.
• It should be able to split data into trainset and testset.
• It should be able to train model using trainset.
• It must validate trained model using testset.
• It should be able to display the trained model accuracy.
• It should be able to accurately predict the air quality on unseen data.
Nonfunctional requirements describe how a system must behave and establish constraints
of its functionality. This type of requirements is also known as the system’s quality
attributes. Attributes such as performance, security, usability, compatibility are not the
feature of the system, they are a required characteristic. They are "developing" properties
that emerge from the whole arrangement and hence we can't compose a particular line of
code to execute them. Any attributes required by the customer are described by the
specification. We must include only those requirements that are appropriate for our project.
Some Non-Functional Requirements are as follows:
• Reliability
• Maintainability
• Performance
• Portability
• Scalability
• Flexibility
Some of the quality attributes are as follows:
3.2.1 ACCESSIBILITY:
Availability is a general term used to depict how much an item, gadget, administration, or
condition is open by however many individuals as would be prudent.
In our venture individuals who have enrolled with the cloud can get to the cloud to store
and recover their information with the assistance of a mystery key sent to their email ids.
3.2.2 MAINTAINABILITY:
In programming designing, viability is the simplicity with which a product item can be
altered so as to:
• Correct absconds
New functionalities can be included in the task based the client necessities just by
adding the proper documents to existing venture utilizing ASP.net and C# programming
dialects. Since the writing computer programs is extremely straightforward, it is simpler to
discover and address the imperfections and to roll out the improvements in the
undertaking.
3.2.3 SCALABILITY:
Framework is fit for taking care of increment all out throughput under an expanded burden
when assets (commonly equipment) are included.
Framework can work ordinarily under circumstances, for example, low data transfer
capacity and substantial number of clients.
3.2.4 PORTABILITY:
Convey ability is one of the key ideas of abnormal state programming. Convenient is the
product code base component to have the capacity to reuse the current code as opposed to
making new code while moving programming from a domain to another. Venture can be
executed under various activity conditions gave it meet its base setups. Just framework
records and dependant congregations would need to be designed in such case.
The functional requirements for a system describe what the system should do.
Those requirments depend on the type of software being developed,the expected users of
the software. These are the statement of services the system should provide,how the
system should react to particular inputs and how the system should behave in particular
situation.
The four primary functions of systems engineering are all performed by the end users, which
is the customers. Operational requirements which are given by:-
• Mission profile or scenario: It is a map which describes the procedures and leads us
to the final goal/ objective. The goal of proposed system is, to predict the crop yield
prediction for future year using previous year dataset.
• Performance: It basically gives system parameters to reach our goal. Parameters for
the proposed system are accurate predicted value which is compared to the existing
system.
• Utilization environments: It enlists the different permutations and combinations a
system can be reused in many other applications which gives better prediction, as
well as gives a new approach to prediction techniques.
• Life cycle: It discuss about the life span of a system. As number of data increases the
number of iterations increases, which will give more accuracy to the output.
➢ Organizational Requirement
• Process Standards: To make sure the system is a quality product, IEEE standards
have been used during system development.
• Design Methods: Design is an important step, on which all other steps in the
engineering process are based on.
• It takes the project from a theoretical idea to an actual product. It gives us the basis
of our solution.Because all the steps after designing are based on the design itself,
this step affects the quality of the product and is a major player in how the testing
and maintenance of a project take place and how successful they are. Following the
design to the ‘T’ is of utmost importance.
➢ Product Requirement
• Portability: As the system is Python based, it will run on a platform which is
supported by ANACONDA.
• Correctness: The system has been put through rigorous testing after it has followed
strict guidelines and rules. The testing has validated the data.
• Ease of Use: The user interface allows the user to interact with the system at a very
comfortable level with no hassles.
• Modularity: The many different modules in the system are neatly defined for ease of
use and to make the product as flexible as possible with different permutations and
combinations.
• Robustness: During the development of the system special care is being taken to
make sure that the end results are optimized to the highest level and the results are
relevant and validated. Python language is used for the development, itself provides
robustness to the system and thus makes it highly unlikely to fail.
User Requirement
• The user should able to have User Interface Window with Visualize Graphics.
• The user should able to configure with neat GUI all the parameters.
Resource Requirement
Anaconda 3-5.0.3: Anaconda is a free and open source distribution of the Python and R
programming languages for data science, machine learning and other applications.
Anaconda distribution comes with 1400 packages as well as the conda package and virtual
environment manager, called Anaconda Navigator. Packages can be made using the conda
build command. Anaconda Navigatoris a desktop graphical user interface allows user to
manage conda packages. The following applications are available by default in navigator:
Jupyter lab, Jupyter netbook, Spyder, Orange, Rstudio etc. conda is an open source, cross
platform, language-agnostic package manager and environment management system. It
installs, runs and update packages and their dependencies.
1. Jupyter Notebook: The code is fully written in Python language using Jupyter
notebook. It is the spin-off projects from the IPyton project, which used to have an
IPython Notebook project itself. IPython kernel, which allows you to write your
programs in Python. We can install Jupyter Notebook using command $pip
installJupyter. It has serveral menus that you can use to interact with your notebook
they are listed as:
• File
• Edit
• View
• Insert
• Cell
• Kernel, Widgets, Help
The kernel cell is for working with the kernel that is running in the background. Here we can
restart the kernel, reconnect to it, shut it down, or even change with kernel your notebook
is using.
The following is the hardware requirements of the system for the proposed system:
The following is the software requirements of the system for the proposed system:
• OS : Windows 10
• Platform : Jupyter Notebook
• Language : Python
• IDE/tool : Anaconda 3-5.0.3
CHAPTER 4
DESIGN
Technologies Used
■ PYTHON
■ MACHINE LEARNING
1.1. Open CV
OpenCV (Open Source Computer Vision Library) is an open source PC vision and AI
programming library. OpenCV was worked to give a typical foundation to PC vision
applications and to quicken the utilization of machine discernment in the business items.
Being a BSD-authorized item, OpenCV makes it simple for organizations to use and adjust
the code. The library has more than 2500 enhanced calculations, which incorporates an
exhaustive arrangement of both exemplary and best in class PC vision and AI calculations.
These calculations can be utilized to distinguish and perceive faces, distinguish objects,
arrange human activities in recordings, track camera developments, track moving articles,
extricate 3D models of items, produce 3D point mists from stereo cameras, fasten pictures
together to create a high goals picture of a whole scene, find comparative pictures from a
picture database, expel red eyes from pictures taken utilizing streak, pursue eye
developments, perceive landscape and set up markers to overlay it with enlarged reality,
and so on.
OpenCV has in excess of 47 thousand individuals of client network and evaluated number of
downloads surpassing 18 million. The library is utilized broadly in organizations, examine
gatherings and by administrative bodies. It has C++, Python, Java and MATLAB interfaces
and supports Windows, Linux, Android and Mac OS.
Tensorflow:
Neural Networks:
The neural system itself isn't a calculation, yet rather a structure for some, extraordinary AI
calculations to cooperate and process complex information inputs. Such frameworks learn
to perform undertakings by thinking about models, for the most part without being
modified with any errand explicit principles. For instance, in picture acknowledgment, they
may figure out how to distinguish pictures by dissecting precedent pictures and utilizing the
outcomes to recognize it in different pictures.
They do this with no earlier information about felines, for instance, that they have hide,
tails, hairs and feline like countenances. Rather, they consequently create distinguishing
qualities from the learning material that they procedure.
As of 2011, the state of the art in deep learning feedforward networks alternated between
convolutional layers and max-pooling layers,topped by several fully or sparsely connected
layers followed by a final classification layer. Learning is normally managed without
unsupervised pre-preparing. In the convolutional layer, there are channels that are
convolved with the information. Each channel is comparable to a loads vector that must be
prepared. Such directed profound learning strategies were the first to accomplish human-
aggressive execution on certain tasks.
➢ Class diagram
➢ Sequence Diagram:
CHAPTER 5
IMPLEMENTATION
Implementation is the process of defining how the system should be built, ensuring that it is
operational and meets quality standards. It is a systematic and structured approach for
effectively integrating a software-based service or component into the requirements of end
users.
Anaconda3 includes Python 3.6. Anaconda Navigator is a desktop graphical user interface
(GUI) included in Anaconda distribution that allows users to launch applications and manage
anaconda packages, environments and channels without using command-line commands.
Navigator can search for packages on Anaconda Cloud or in a local Anaconda Repository,
install them in an environment, run the packages and update them. It is available for
Windows, macOS and Linux. The following are the system requirements:
▪ License: Free use and redistribution under the terms of the Anaconda End User License
Agreement.
▪ Operating system: Windows Vista or newer, 64-bit macOS 10.10+, or Linux, including
Ubuntu, RedHat, CentOS 6+, and others. Windows XP supported on Anaconda versions 2.2
and earlier. See lists. Download it from our archive.
▪ System architecture: 64-bit x86, 32-bit x86 with Windows or Linux, Power8 or Power9.
Minimum 3 GB disk space to download and install.
After the installation of anaconda navigator, we were taught python programming. We were taught
various inclusion of python libraries such as NumPy i.e. introduction to NumPy, NumPy arrays, few
notes on array indexing, NumPy array indexing, NumPy operations and few exercises to recall it. We
were taught how to use Pandas, how to include data frames, finding and replacing missing data with
useful information, group-by functions, merging, joining and concatenating and other data input and
output operations. We were also taught python for data visualization that is matplotlib, seaborn.
Matplotlib is a plotting library for python and its extension NumPy. It makes use of general-purpose
GUI kits and provides an object-oriented API for embedding the plots. In seaborn we were taught
distribution plots, categorial plots, matrix plots, grids, regression plots etc.
CHAPTER 6
TESTING
The purpose of testing is to discover errors. Testing is the process of trying to discover every
conceivable fault or weakness in a work product. It provides a way to check the functionality
of components, sub assemblies, assemblies and/or a finished product it is the process of
exercising software with the intent of ensuring that the Software system meets its
requirements and user expectations and does not fail in an unacceptable manner. There are
various types of test. Each test type addresses a specific testing requirement .
TYPES OF TESTS
Verification is a Quality control process that is used to evaluate whether or not a product,
service, or system complies with regulations, specifications, or conditions imposed at the
start of a development phase. Verification can be in development, scale-up, or production.
This is often an internal process.
As a rule, system testing takes, as its input, all of the "integrated" software components that
have successfully passed integration testing and also the software system itself integrated
with any applicable hardware system(s).
System testing is a more limited type of testing; it seeks to detect defects both within the
"inter-assemblages" and also within the system as a whole.
System testing tests not only the design, but also the behavior and even the believed
expectations of the customer. It is also intended to test up to and beyond the bounds
defined in the software/hardware requirements specification(s).
CHAPTER 7
SNAPSHOTS
CHAPTER 8
CONCLUSION
It is outstanding that a decent enemy of phishing apparatus ought to anticipate the phishing
assaults in a decent timescale. We accept that the accessibility of a decent enemy of
phishing device at a decent time scale is additionally imperative to build the extent of
anticipating phishing sites. This apparatus ought to be improved continually through
consistent retraining. As a matter of fact, the accessibility of crisp and cutting-edge
preparing dataset which may gained utilizing our very own device [30, 32] will help us to
retrain our model consistently and handle any adjustments in the highlights, which are
influential in deciding the site class. Albeit neural system demonstrates its capacity to tackle
a wide assortment of classification issues, the procedure of finding the ideal structure is very
difficult, and much of the time, this structure is controlled by experimentation.Our model
takes care of this issue via computerizing the way toward organizing a neural system
conspire; hence, on the off chance that we construct an enemy of phishing model and for
any reasons we have to refresh it, at that point our model will encourage this procedure,
that is, since our model will mechanize the organizing procedure and will request scarcely
any client defined parameters.
CHAPTER 9
REFERENCES
• APWG, Aaron G, Manning R (2013) APWG phishing reports. APWG, 1 February 2013.
• Kaspersky Lab (2013) Spam in January 2012: love, governmental issues and game.
[Online].
Available:http://www.kaspersky.com/about/news/spam/2012
• Seogod (2011) Black Hat SEO. Search engine optimization Tools. [Online].
Accessible:http://www.seobesttools.com/dark cap website optimization/. Gotten to 8 Jan
2013
• Dhamija R, Tygar JD, Hearst M (2006) Why phishing works. In: Proceedings of the
SIGCHImeeting on human factors in figuring frameworks, Cosmopolitan Montre 'al,
Canada
• Cranor LF (2008) A system for thinking about the human tuned in. In: UPSEC'08
Proceedings of the first meeting on ease of use, brain science, and security, Berkeley,
CA,USA
• Xiang G, Hong J, Rose CP, Cranor L (2011) CANTINA?:a include rich AI structure
foridentifying phishing sites. ACM Trans Inf Syst Secur 14(2):1–28
14 %
SIMILARITY INDEX
%
INTERNET SOURCES
14%
PUBLICATIONS
%
STUDENT PAPERS
PRIMARY SOURCES
1
"Root for a Phishing Page using Machine
Learning", International Journal of Innovative
7%
Technology and Exploring Engineering, 2019
Publication
2
Rami M. Mohammad, Fadi Thabtah, Lee
McCluskey. "Predicting phishing websites based
5%
on self-structuring neural network", Neural
Computing and Applications, 2013
Publication
3
"Phishing Attack Detection using Machine
Learning", International Journal of Innovative
1%
Technology and Exploring Engineering, 2020
Publication
4
"International Conference on Advancements of
Medicine and Health Care through Technology;
1%
12th - 15th October 2016, Cluj-Napoca,
Romania", Springer Science and Business
Media LLC, 2017
Publication
5
"Smart Glass for Visual Impaired People",
International Journal of Innovative Technology
1%
and Exploring Engineering, 2019
Publication
6
Ankit Kumar Jain, B. B. Gupta. "Phishing
Detection: Analysis of Visual Similarity Based
<1%
Approaches", Security and Communication
Networks, 2017
Publication
/0 Instructor
PAGE 1
PAGE 2
PAGE 3
PAGE 4
PAGE 5
PAGE 6
PAGE 7
PAGE 8
PAGE 9
PAGE 10
PAGE 11
PAGE 12
PAGE 13
PAGE 14
PAGE 15
PAGE 16
PAGE 17
PAGE 18
PAGE 19
PAGE 20
PAGE 21
PAGE 22
PAGE 23
PAGE 24
PAGE 25
PAGE 26
PAGE 27
PAGE 28
PAGE 29
PAGE 30