Nisha Internship3
Nisha Internship3
Nisha Internship3
CHAPTER 1
COMPANY PROFILE
1.1 About The Company:
Pantech E Learning is a subsidiary of the Pantech Group. Pantech is a Think Tank with a
keen interest in sharing technical knowledge expertise to the student and staff community
viz a viz On Campus Courses, In-House Courses, Faculty Development Programs, Hands
on
Sessions, Workshops and Seminars. Domain of expertise : Python Programming , Arduino
Programming Embedded Systems ,PCB Design ,Android App Development ,IoT
Applications VHDL Programming ,Verilog Programming ,Core Java and Advance Java
Programming ,Simulink Design using Matlab, Android App Development Power
Electronics using MATLAB ,IoT using Arduino ,Robotics, Matlab Programming, Cloud
Computing using JAVA ,Python Programming Data Mining & Its Programming ,Machine
Learning using MATLAB Block Chain ,OpenCV & Image Processing ,IoT Using
Raspberry Pi and the Cloud Interface ,Deep Learning using Python ,NS2 Programming
Machine Learning using Python , Computer Vision and Machine Learning, Power
Electronics using MATLAB, Power Systems using MATLAB Image Processing using
Matlab, Renewable Energy using MATLAB Virtual Reality ,Electric Vehicle Design
,Augment Reality Computer Vision - CV Robots.
EMBEDDED SYSTEM
CYBERSECURITY
At PANTECH life is all about delivering the highest quality to customers. Reduced costs,
quicker time-to market, huge value adds and enhanced productivity are our way of life. The
very cornerstone of our success has been our unerring path to ensuring that QA processes
and procedures are met with unwavering dedication.
1.2 Mission:
"To help our customers in achieving their time-to-market objective by being their
dependable technology partners and delivering our commitments on time and every time
with quality."
1.3 Vision:
Pantech e learning solutions will become the market leader in Embedded system
development, firware & manpower outsourcing focusing on specific application areas in
Communications, Automotive and Consumer electronics."
1.4 Services:
Direct Staffing
Contract Staffing
Out Sourcing
Corporate Training
Contract Staffing Services provides skilled resources to clients to meet their requirements
for defined periods and to over the lengthy selection process by absorbing Consultants,
based on their performance during the deputation.
Corporate Training KNOWX is a BRIDGE between the IT/Electronic Industry and the
Student community.
We have a broad range of course offerings to equip you and your organization with the right
skills, at precisely the right time at right cost.
CHAPTER 2
INTRODUCTION
In just five years, the number of smartphone users has risen from 1 billion to 3.8 billion. China,
India, and the United States are the top three mobile phone users. SMS, or Short Message
Service, is a text messaging service that has been around for a while. It is also possible to use
SMS without having access to the internet. As a result, SMS is supported by both smartphones
and basic mobile phones. Despite the fact that smart phones come with a variety of text
messaging apps such as WhatsApp, this service is only available via the internet. SMS, on the
other hand, can be sent at any point in time. As a result, SMS service traffic is steadily
expanding. Unsolicited communications are sent by spammers. Spammers bombard people
with a large quantity of messages for the advantage of their organisations or personal gain.
Spam is the term for these kinds of messages. Despite the availability of numerous SMS spam
filtering solutions, complex strategies are still required to deal with this problem. Spam
messages on mobile devices can be irritating. SMS spam and email spam are two different
types of spam messages. The term "spam" or "SMS spam" refers to the same thing. Spammers
use these spam mailings to promote their utilities or businesses. Users may sometimes suffer
financial losses as a result of spam mailings.
Machine Learning is a technology that allows machines to learn from past data and anticipate
future data. Machine learning and deep learning can now be used to tackle most real-world
problems in a variety of fields, including health, security, and market analysis. Machine
learning approaches include supervised learning, unsupervised learning, semi supervised
learning, and others. The dataset in supervised learning has output labels, whereas datasets
without labels are dealt with in unsupervised learning. We used a UCI dataset with labels and
employed multiple supervised learning techniques to detect SMS spam.
For example, if any data set was the characteristics and purchasing behavior of shoppers at
grocery stores, the unsupervised learning task might be to segment the shoppers into groups
or “clusters” that exhibit similar behaviors. Such learning methods might find that college
students, parents with young children, and older adults have characteristic shopping
behaviors that are similar within each group but dissimilar from the other. This is an
unsupervised learning task because there is no right or wrong about how many clusters can
be found in the data, which people belong in which cluster, or even how to describe each
cluster. Now after having a clear understanding of Machine Learning, authors have used the
same in generating the rules, which will help in governing or identifying based on inputs
whether or not the message is SPAM or HAM. For processing the document content authors
have used TF-IDF9 vectorization for generating the Word Cloud. After that authors have
briefly described TF-IDF vectorization works.
TF-IDF stands for Turn Frequency Inverse Document Frequency, used in machine learning
and text mining as a weighting factor for identifying words features.The weight increases as
the word frequency in a document increases, i.e. weight increases, the more times a term
occurs in the document, but that offset by the number of times the word appears, in the entire
data set or this offset helps remove the importance from really common words like ‘the' or
‘a' appear quite often in all across the document. It is used very often in relevance ranking
and scoring and to move stop words from ML Model, where these stop words don't give any
relevant information about a particular document type or class. Figure 3 represents the TF-
IDF mathematical formula.
CHAPTER 3
SYSTEM ARCHITECHTURE
It is not a new period to use machine learning and deep learning techniques to detect spam.
Previously, ML approaches were used to classify SMS spam by a number of academics.
Nilam Nur Amir Sjarif et al. combined the TF-IDF technique with a random forest classifier
and reached 97.5 percent accuracy. The TF-IDF approach uses two measurements, Term
Frequency and Inverse Document Frequency, to quantify the words in a document. For
email spam filtering, A.Lakshmana Rao et colleagues used four machine learning
classifiers: Decision Trees, Naive Bayes, Logistic Regression, and Random Forest, with the
random forest classifier achieving 97 percent accuracy. Pavas Navaney et al. suggested
several machine learning techniques and used support vector machines to obtain 97.4
percent accuracy. Luo GuangJun et al. used a variety of shallow machine learning
techniques and found that the logistic regression classifier had a high accuracy rate. For the
detection of SMS spam, Tian Xia et al presented the Hidden Markov Model. Their model
used the information about the order of words thereby solving issues with low term
frequency. For example m0ney or mo.ney for the word money. Therefore, instead of words,
they focused on spam detection at the language character level, such as letters in english.
This model. M. Nivaashini et.al applied a deep neural network for SMS spam detection and
achieved an accuracy of 98%. They also compared DNN performance with NB, Random
Forest, SVM, and KNN. Mehul Gupta et.al compared various spam detection machine
learning models with deep learning models and showed that deep learning models achieved
a high accuracy rate in SMS spam detection. Gomatham Sai Sravya et.al compared various
machine learning algorithms for SMS spam detection and achieved the best accuracy with
the Naive Bayes classification model. M. Rubin Julis et.al applied various machine learning
classifiers and achieved an accuracy of 97% with a support vector machine. K. Sree Ram
Murthy et.al proposed Recurrent Neural Networks for SMS spam detection and achieved a
good accuracy rate. S. Sheikh proposed SMS spam detection using feature selection and the
Neural Network model and achieved a good accuracy rate. Adem Tekerek et.al applied
various machine learning classification models for SMS spam detection and achieved an
accuracy of 97% with a support vector machine classifier.
Hardware:
2. RAM - 4GB
Software:
1. Anaconda Navigator
2. ML - NLP
Disadvantage:-
● Time complexity was more
● Prediction accuracy was not so high
The prediction method will employ 3 machine learning algorithms which are support
vector
machine and naive bayes ( multinomial and gaussian nb)
STEPS for Proposed Approach-
Step 1:-Initialize the dataset containing training data wholesale price index
Step 2:-Select all the rows and column 1 from dataset to “x” which is independent variable
Step 3:-Select all of the rows and column 2 from dataset to “y” which is dependent variable
Step 4:- Fit DTR/SVR/LR to the dataset
step 5:-Predict the new value
step 6:-Visualize the result and check the accuracy.
1.Data Ingestion:
Data ingestion is the transportation of data from assorted sources to a storage medium
where it can be accessed, used and analyzed by an organization. The destination is typically
a data warehouse, data mart, database, or a document store. Sources may be almost
anything – including SaaS data, in-house apps, databases, spreadsheets, or even
information scraped from the internet. The data ingestion layer is the backbone of any
analytics architecture. Downstream reporting and analytics systems rely on consistent and
accessible data. There are different ways of ingesting data, and the design of a particular
data ingestion layer can be based on various models or architectures.
2. Data Preprocessing:
Data Preprocessing is a data mining technique used to transform the raw data into useful
and efficient format. The data here goes through two stages. 1. Data Cleaning: It is very
important for data to be error free and free of unwanted data. So, the data is cleansed before
performing the next steps. Cleansing of data includes checking for missing values,
duplicate records and invalid formatting and removing them. 2. Data Transformation: Data
Transformation is transformation of the datasets mathematically; data is transformed into
appropriate forms suitable for data mining process. This allows us to understand the data
more keenly by arranging the 100‟s of records in an orderly way. Transformation includes
Normalization, Standardization, Attribute Selection.
trends graphs it is observed that US exports depend on and follow the areas planted and
harvested annually. A sudden drop in China‟s exports in the year 2009 is observed and in
the meantime their imports kept increasing in the last 12 years regardless of the global
yield, which implies China has a huge and lasting demand of soybean crop but now it relies
on the global supply to meet the needs.
Evaluation Metric
Modelling of data involves creating a data model for the data to be stored in the database.
The process of modeling means training a Machine Learning Algorithm to predict the
labels from the features, tuning it for business need, and validating it on the hold out data.
The output from modeling is a trained model that can be used for inference, making
predictions on new data points. Modeling is independent of the previous steps in the
Machine Learning process and has standardized inputs which means we can alter the
prediction problem without needing to rewrite all our code. If the business requirements
change, we can generate new label times, build corresponding features, and input them into
the model. Models are implemented and later evaluated for their accuracies using root mean
square error.
Regressors used for prediction purpose -
● Random Forest Regressor- regression method
● Support Vector Regression (SVR) – uses kernel functions
● Linear Regression – regression method
● Decision Tree Regression – regression method
● R2 score - The r2-score of a regression is the percentage of the test set tuples
● Root Mean Square Error: The Root Mean Square Error is evaluated for model
UML DIAGRAMS
The Unified Modeling Language (UML) is used to specify, visualize, modify, construct
and document the artifacts of an object-oriented software intensive system under
development. UML offers a standard way to visualize a system's architectural blueprints,
including elements such as:
● actors
● business processes
● (logical) components
● activities
● programming language statements
● database schemas, and
● Reusable software components.
UML combines best techniques from data modeling (entity relationship diagrams), business
modeling (work flows), object modeling, and component modeling. It can be used with all
processes, throughout the software development life cycle, and across different
implementation technologies. UML has synthesized the notations of the Booch method, the
Object-modeling technique (OMT) and Object-oriented software engineering (OOSE) by
fusing them into a single, common and widely usable modeling language. UML aims to be
a standard modeling language which can model concurrent and distributed systems.
Sequence Diagram:
Sequence Diagrams represent the objects participating in the interaction horizontally and
time vertically. A Use Case is a kind of behavioral classifier that represents a declaration of
an offered behavior. Each use case specifies some behavior, possibly including variants that
the subject can perform in collaboration with one or more actors. Use cases define the offered
behavior of the subject without reference to its internal structure. These behaviors involving
interactions between the actor and the subject, may result in changes to the state of the
subject and communications with its environment. A use case can include possible variations
of its basic behavior, including exceptional behavior and error handling.
Activity Diagrams-:
Activity diagrams are graphical representations of Workflows of stepwise activities and
actions with support for choice, iteration and concurrency. In the Unified Modeling
Language, activity diagrams can be used to describe the business and operational step-by-
step workflows of components in a system. An activity diagram shows the overall flow of
control.
Class diagram
The class diagram is the main building block of object-oriented modeling. It is used for
general conceptual modeling of the system of the application, and for detailed modeling
translating the models into programming code. Class diagrams can also be used for data
modeling. The classes in a class diagram represent both the main elements, interactions in
the application, and the classes to be programmed.In the diagram, classes are represented
with boxes that contain three compartments:
1. The top compartment contains the name of the class. It is printed in bold and centered,
and the first letter is capitalized.
2. The middle compartment contains the attributes of the class. They are left-aligned and
the first letter is lowercase
3. The bottom compartment contains the operations the class can execute. They are also
left-aligned and the first letter is lowercase.
CLASS DIAGRAM
CHAPTER 4
DOMAIN SPECIFICATION
Machine Learning is a system that can learn from example through self-improvement and
without being explicitly coded by programmers. The breakthrough comes with the idea
that a machine can singularly learn from the data (i.e., example) to produce accurate
results.
Machine learning combines data with statistical tools to predict an output. This output is
then used by corporations to make actionable insights. Machine learning is closely related
to data mining and Bayesian predictive modeling. The machine receives data as input and
uses an algorithm to formulate answers.
A typical machine learning task is to provide a recommendation. For those who have a
Netflix account, all recommendations of movies or series are based on the user's historical
data. Tech companies are using unsupervised learning to improve the user experience with
personalizing recommendations.
Machine learning is also used for a variety of tasks like fraud detection, predictive
maintenance, portfolio optimization, automatizing tasks and so on.
Machine Learning
For instance, the machine is trying to understand the relationship between the wage of an
individual and the likelihood to go to a fancy restaurant. It turns out the machine finds a
positive relationship between wage and going to a high-end restaurant: This is the model
Inferring When the model is built, it is possible to test how powerful it is on never-seen-
before data. The new data are transformed into a features vector, go through the model and
give a prediction. This is all the beautiful part of machine learning. There is no need to update
the rules or train again the model. You can use the model previously trained to make
inference on new data.
The life of Machine Learning programs is straightforward and can be summarized in the
following points:
1. Define a question
2. Collect data
3. Visualize data
4. Train algorithm
5. Test the Algorithm
6. Collect feedback
7. Refine the algorithm
8. Loop 4-7 until the results are satisfying
9. Use the model to make a prediction
Once the algorithm gets good at drawing the right conclusions, it applies that knowledge to
new sets of data.
Machine learning can be grouped into two broad learning tasks: Supervised and
Unsupervised.
There are many other algorithms
An algorithm uses training data and feedback from humans to learn the relationship of given
inputs to a given output. For instance, a practitioner can use marketing expense and weather
forecast as input data to predict the sales of cans.
You can use supervised learning when the output data is known. The algorithm will predict
new data.
There are two categories of supervised learning:
● Classification task
● Regression Task
Imagine you want to predict the gender of a customer for a commercial. You will start
gathering data on the height, weight, job, salary, purchasing basket, etc. from your
customer database. You know the gender of each of your customers, it can only be male
or female. The objective of the classifier will be to assign a probability of being a male or
a female (i.e., label) based on the information (i.e., features you have collected). When the
model learns how to recognize male or female, you can use new data to make a prediction.
For instance, you just got new information from an unknown customer, and you want to
know if it is a male or female. If the classifier predicts male = 70%, it means the algorithm
is sure at 70% that this customer is a male, and 30% it is a female.
The label can be of two or more classes. The above example has only two classes, but if
a classifier needs to predict object, it has dozens of classes (e.g., glass, table, shoes, etc.
each object represents a class)
When the output is a continuous value, the task is a regression. For instance, a financial
analyst may need to forecast the value of a stock based on a range of features like equity,
previous stock performances, macroeconomics index. The system will be trained to
estimate the price of the stocks with the lowest possible error.
In unsupervised learning, an algorithm explores input data without being given an explicit
output variable (e.g., explores customer demographic data to identify patterns)
● Machine learning, which assists humans with their day-to-day tasks, personally
● used in different ways such as Virtual Assistant, Data analysis, software solutions.
Automation:
● Machine learning, which works entirely autonomously in any field without the need
● any human intervention. For example, robots performing the essential process steps
● manufacturing plants.
Finance Industry
● Machine learning is growing in popularity in the finance industry. Banks are mainly
● using ML to find patterns inside the data but also to prevent fraud.
Government organization
● The government makes use of ML to manage public safety and utilities. Take the
● example of China with its massive face recognition. The government uses Artificial
Healthcare industry
● Healthcare was one of the first industries to use machine learning image detection.
Marketing
● estimate the value of a customer. With the boom of data, the marketing department
When combining big data and machine learning, better forecasting techniques have been
implemented (an improvement of 20 to 30 % over traditional forecasting tools). In terms
of sales, it means an increase of 2 to 3 % due to the potential reduction in inventory costs.
Deep learning is a computer software that mimics the network of neurons in the brain. It is a
subset of machine learning and is called deep learning because it makes use of deep neural
networks. The machine uses different layers to learn from the data. The depth of the model
is represented by the number of layers in the model. Deep learning is the new state of the art
in terms of AI. In deep learning, the learning phase is done through a neural network.
● Q-learning
● Deep Q network
● State-Action-Reward-State-Action (SARSA)
● Deep Deterministic Policy Gradient (DDPG)
AI in Finance:
The financial technology sector has already started using AI to save time, reduce costs, and add
value. Deep learning is changing the lending industry by using more robust credit scoring. Credit
decision-makers can use AI for robust credit lending applications to achieve faster, more accurate
risk assessment, using machine intelligence to factor in the character and capacity of applicants.
Underwrite is a Fintech company providing an AI solution for credit makers company.
underwrite.ai uses AI to detect which applicant is more likely to pay back a loan. Their approach
radically outperforms traditional methods.
AI in HR:
Under Armour, a sportswear company revolutionizes hiring and modernizes the candidate
experience with the help of AI. In fact, Under Armour Reduces hiring time for its retail stores by
35%. Under Armour faced a growing popularity interest back in 2012. They had, on average,
30000 resumes a month. Reading all of those applications and beginning to start the screening and
interview process was taking too long. The lengthy process to get people hired and on-boarded
impacted Under Armour's ability to have their retail stores fully staffed, ramped and ready to
operate.
At that time, Under Armour had all of the 'must have' HR technology in place such as transactional
solutions for sourcing, applying, tracking and onboarding but those tools weren't useful enough.
Under armour choose HireVue, an AI provider for HR solution, for both on-demand and live
interviews. The results were bluffing; they managed to decrease by 35% the time to fill. In return,
the hired higher quality staff.
AI in Marketing:
Artificial Intelligence
With machine learning, you need fewer data to train the algorithm than deep learning. Deep
learning requires an extensive and diverse set of data to identify the underlying structure. Besides,
machine learning provides a faster-trained model. Most advanced deep learning architecture can
take days to a week to train. The advantage of deep learning over machine learning is it is highly
accurate. You do not need to understand what features are the best representation of the data; the
neural network learned how to select critical features. In machine learning, you need to choose
for yourself what features to include in the model.
CHAPTER 5
TensorFlow
The most famous deep learning library in the world is Google's TensorFlow. Google uses
machine learning in all of its products to improve the search engine, translation, image
captioning or recommendations.
To give a concrete example, Google users can experience a faster and more refined search
with AI. If the user types a keyword in the search bar, Google provides a recommendation
about what could be the next word.
Google wants to use machine learning to take advantage of their massive datasets to give
users the best experience. Three different groups use machine learning:
● Researchers
● Data scientists
● Programmers.
They can all use the same toolset to collaborate with each other and improve their efficiency.
Google does not just have any data; they have the world's most massive computer, so
TensorFlow was built to scale. TensorFlow is a library developed by the Google Brain Team
to accelerate machine learning and deep neural network research.
It was built to run on multiple CPUs or GPUs and even mobile operating systems, and it has
several wrappers in several languages like Python, C++ or Java.
In this context, you will learn
It is called Tensor flow because it takes input as a multidimensional array, also known
as tensors. You can construct a sort of flowchart of operations (called a Graph) that you
want to perform on that input. The input goes in at one end, and then it flows through this
system of multiple operations and comes out the other end as output. This is why it is called
TensorFlow because the tensor goes in, it flows through a list of operations, and then it comes
out the other side.
You can train it on multiple machines then you can run it on a different machine, once you
have the trained model.
The model can be trained and used on GPUs as well as CPUs. GPUs were initially designed
for video games. In late 2010, Stanford researchers found that GPU was also very good at
matrix operations and algebra so that it makes them very fast for doing these kinds of
calculations. Deep learning relies on a lot of matrix multiplication. TensorFlow is very fast
at computing matrix multiplication because it is written in C++. Although it is implemented
in C++, TensorFlow can be accessed and controlled by other languages mainly, Python.
Finally, a significant feature of Tensor Flow is the Tensor Board. The Tensor Board enables
you to monitor graphically and visually what TensorFlow is doing.
● Python is Interactive: You can actually sit at a Python prompt and interact with
the interpreter directly to write your programs.
Python was developed by Guido van Rossum in the late eighties and early nineties at the
National Research Institute for Mathematics and Computer Science in the Netherlands.
Python is derived from many other languages, including ABC, Modula-3, C, C++, Algol-
68, Small Talk, Unix shell, and other scripting languages.
Python is copyrighted. Like Perl, Python source code is now available under the GNU
General Public License (GPL).
Python is now maintained by a core development team at the institute, although Guido van
Rossum still holds a vital role in directing its progress.
Easy-to-learn: Python has few keywords, simple structure, and a clearly defined syntax.
This allows the student to pick up the language quickly.
Easy-to-read: Python code is more clearly defined and visible to the eyes.
A broad standard library: Python's bulk of the library is very portable and cross-platform
compatible on UNIX, Windows, and Macintosh.
Interactive Mode: Python has support for an interactive mode which allows interactive
testing and debugging of snippets of code.
Portable: Python can run on a wide variety of hardware platforms and has the same interface
on all platforms.
Extendable: You can add low-level modules to the Python interpreter. These modules enable
programmers to add to or customize their tools to be more efficient.
GUI Programming: Python supports GUI applications that can be created and ported to
many system calls, libraries, and windows systems, such as Windows MFC, Macintosh, and
the X Window system of Unix.
Scalable: Python provides a better structure and support for large programs than shell
scripting.
Apart from the above-mentioned features, Python has a big list of good features, few are
listed below:
● It provides very high-level dynamic data types and supports dynamic type checking.
● It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.
Python is available on a wide variety of platforms including Linux and Mac OS X. Let's
understand how to set up our Python environment.
Python is available on a wide variety of platforms including Linux and Mac OS X. Let's
understand how to set up our Python environment.
● Pandas
● Numpy
● Sklearn
● seaborn
● matplotlib
● Importing Datasets
CHAPTER 6
Pandas
Pandas is quite a game changer when it comes to analyzing data with Python and it is one
of the most preferred and widely used tools in data munging/wrangling if not THE most
used one. Pandas is an open source What’s cool about Pandas is that it takes data (like a
CSV or TSV file, or a SQL database) and creates a Python object with rows and columns
called data frame that looks very similar to table in a statistical software (think Excel or
SPSS for example. People who are familiar with R would see similarities to R too). This is
so much easier to work with in comparison to working with lists and/or dictionaries through
for loops or list comprehension.
● Open a local file using Pandas, usually a CSV file, but could also be a delimited text
file (like TSV), Excel, etc
● Open a remote file or database like a CSV or a JSONon a website through a URL or
read from a SQL table/database
There are different commands to each of these options, but when you open a file, they
would look like this:
● pd.readfiletype()
As I mentioned before, there are different filetypes Pandas can work with, so you would
replace “filetype” with the actual, well, filetype (like CSV). You would give the path,
filename etc inside the parenthesis. Inside the parenthesis you can also pass different
arguments that relate to how to open the file. There are numerous arguments and in order
to know all you them, you would have to read the documentation (for example, the
documentation for pd.read csv() would contain all the arguments you can pass in this
Pandas command).
In order to convert a certain Python object (dictionary, lists etc) the basic command is:
● pd.DataFrame()
Inside the parenthesis you would specify the object(s) you’re creating the data frame from.
This command also has different arguments .
You can also save a data frame you’re working with/on to different kinds of files (like CSV,
Excel, JSON and SQL tables). The general code for that is:
● df.to_filetype(filename)
Now that you’ve loaded your data, it’s time to take a look. How does the data frame look?
Running the name of the data frame would give you the entire table, but you can also get
the first n rows with df.head(n) or the last n rows with df.tail(n). df.shape would give you
the number of rows and columns. df.info() would give you the index, datatype and
memory information. The command s.value_counts(dropna=False) would allow you to
view unique values and counts for a series (like a column or a few columns). A very useful
command is df.describe() which inputs summary statistics for numerical columns. It is
also possible to get statistics on the entire data frame or a series (a column etc):
Selection of Data
One of the things that is so much easier in Pandas is selecting the data you want in
comparison to selecting a value from a list or a dictionary. You can select a column (df[col])
and return column with label col as Series or a few columns (df[[col1, col2]]) and returns
columns as a new DataFrame. You can select by position (s.iloc[0]), or by index
(s.loc['index_one']) . In order to select the first row you can use df.iloc[0,:] and in order to
select the first element of the first column you would run df.iloc[0,0] . These can also be
used in different combinations, so I hope it gives you an idea of the different selection and
indexing you can perform in Pandas.
Data Cleaning
Data cleaning is a very important step in data analysis. For example, we always check for
missing values in the data by running pd.isnull() which checks for null Values, and returns
a boolean array (an array of true for missing values and false for non-missing values). In
order to get a sum of null/missing values, run pd.isnull().sum(). pd.notnull() is the opposite
of pd.isnull(). After you get a list of missing values you can get rid of them, or drop them
by using df.dropna() to drop the rows or df.dropna(axis=1) to drop the columns. A different
approach would be to fill the missing values with other values by using df.fillna(x) which
fills the missing values with x (you can put there whatever you want) or s.fillna(s.mean())
to replace all null values with the mean (mean can be replaced with almost any function
from the statistics section). It is sometimes necessary to replace values with different
values. For example, s.replace (1,'one') would replace all values equal to 1 with 'one'. It’s
possible to do it for multiple values: s.replace([1,3],['one','three'])would replace all 1 with
'one' and 3with the You also rename specific : df.rename(columns={'old_name': 'new_
name'})or use df.set_index('column_one') to change the index of the data frame.
Join/Combine
The last set of basic Pandas commands are for joining or combining data frames or
rows/columns. The three commands are: df1.append(df2)— add the rows in df1 to the end
of df2 (columns should be identical)
● df.concat([df1, df2],axis=1) — add the columns in df1 to the end of df2 (rows should
be identical)
● df1.join(df2,on=col1,how='inner') — SQL-style join the columns in df1with the
columns on df2 where the rows for colhave identical values. how can be equal to one of
'left', 'right', 'outer', 'inner'
CHAPTER 7
Numpy
Numpy is one such powerful library for array processing along with a large collection of
high-level mathematical functions to operate on these arrays. These functions fall into
categories like Linear Algebra, Trigonometry, Statistics, Matrix manipulation, etc.
NumPy’s main object is a homogeneous multidimensional array. Unlike python’s array class
which only handles one-dimensional array, NumPy’s ndarray class can handle
multidimensional array and provides more functionality. NumPy’s dimensions are known
as axes. For example, the array below has 2 dimensions or 2 axes namely rows and columns.
Sometimes dimension is also known as a rank of that particular array or matrix.
NumPy is imported using the following command. Note here np is the convention followed
for the alias so that we don't need to write numpy every time. NumPy is the basic library for
scientific computations in Python and this article illustrates some of its most frequently used
functions. Understanding NumPy is the first major step in the journey of machine learning
and deep learning.
7.3 Sklearn
In python, scikit-learn library has a pre-built functionality under sklearn. Pre processing.
Next thing is to do feature extraction Feature extraction is an attribute reduction process.
Unlike feature selection, which ranks the existing attributes according to their predictive
significance, feature extraction actually transforms the attributes. The transformed
attributes, or features, are linear combinations of the original attributes. Finally our models
are trained using Classifier algorithm.. We use nltk . classify module on Natural Language
Toolkit library on Python. We use the labelled dataset gathered . The rest of our labelled
data will be used to evaluate the models. Some machine learning algorithms were used to
classify pre processed data. The chosen classifiers were Decision tree , Support Vector
Machines and Random forest. These algorithms are very popular in text classification tasks.
7.4 Seaborn
Data Visualization in Python
Python offers multiple great graphing libraries that come packed with lots of different
features. No matter if you want to create interactive, live or highly customized plots python
has a excellent library for you.
In this article, we will learn how to create basic plots using Matplotlib, Pandas visualization
and Seaborn as well as how to use some specific features of each library. This article will
focus on the syntax and not on interpreting the graphs.
7.5 Matplotlib
Matplotlib is the most popular python plotting library. It is a low level library with a Matlab
like interface which offers lots of freedom at the cost of having to write more code.
Matplotlib is specifically good for creating basic graphs like line charts, bar charts,
histograms and many more. It can be imported by typing:
Line Chart
In Matplotlib we can create a line chart by calling the plot method. We can also plot multiple
columns in one graph, by looping through the columns we want, and plotting each column
on the same axis.
Line Chart
Histogram
In Matplotlib we can create a Histogram using the hist method. If we pass it categorical data
like the points column from the wine-review dataset it will automatically calculate how
often each class occurs.
Histogram
Bar Chart
A bar-chart can be created using the bar method. The bar-chart isn’t automatically
calculating the frequency of a category so we are going to use pandas value_counts function
to do this. The bar-chart is useful for categorical data that doesn’t have a lot of different
categories (less than 30) because else it can get quite messy.
Bar-Chart
Pandas Visualization
Pandas is a open source high-performance, easy-to-use library providing data structures,
such as data frames, and data analysis tools like the visualization tools we will use in this
article. Pandas Visualization makes it really easy to create plots out of a pandas dataframe
and series. It also has a higher level API than Matplotlib and therefore we need less code
for the same results.
Heatmap
A Heatmap is a graphical representation of data where the individual values contained in a
matrix are represented as colors. Heatmaps are perfect for exploring the correlation of
features in a dataset.
To get the correlation of the features inside a dataset we can call <dataset>.corr() , which is
a Pandas dataframe method. This will give use the correlation matrix.
Matplotlib:
CHAPTER 8
TESTING
Functional tests provide systematic demonstrations that functions tested are available as
specified by the business and technical requirements, system documentation, and user
manuals.
Functional testing is centered on the following items:
● Functions: Identified functions must be exercised.
● Output: Identified classes of software outputs must be exercised.
● Systems/Procedures: system should work properly
➔ Integration Testing
Software integration testing is the incremental integration testing of two or more integrated
software components on a single platform to produce failures caused by interface defects.
Here in machine learning we are dealing with a dataset which is in excel sheet format so if
any test case we need means we need to check the excel file. Later on classification will work
on the respective columns of the dataset .
CHAPTER 9
RESULTS
Data mining is a process to extract knowledge from existing data. It is used as a tool in
banking and finance, in general, to discover useful information from the operational and
historical data to enable better decision-making. It is an interdisciplinary field, the confluence
of Statistics, Database technology, Information science, Machine learning, and Visualization.
It involves steps that include data selection, data integration, data transformation, data
mining, pattern evaluation, knowledge presentation.
CHAPTER 10
OUTPUT
CONCLUSION
The research aims at predicting whether the messages are spam or true and it runs on efficient
machine learning algorithms and technologies having a good accuracy. The training datasets
obtained provide enough insights for predicting the appropriate messages . Thus, the system
helps the users in identification of their messages whether they are spam messages or true
messages with certain accurate prediction .
REFERENCES
[1]Online:https://www.statista.com/statistics/330695/number-ofsmartphone-users
worldwide/
[2] S. M. Abdulhamid, M.S.Abd Latif, Haruna Chiroma, “Robust Heart Disease Prediction
A Review on Mobile SMS Spam Filtering Techniques”, EEE Access, vol. 5, pp. 15650-
15666, 2017, doi: 10.1109/ACCESS.2017.2666785
[3] Nilam Nur Amir Sjarif, N F Mohd Azmi, Suriayati Chuprat, “SMS Spam Message
Detection using Term Frequenct-Inverse Document Frequency and Random Forest
Algorithm,” in The Fifth Information Systems International Conference 2019, Procedia
Computer Science 161 (2019) 509-515,ScienceDirect.
[4] A.Lakshmanarao,K.Chandra Sekhar, Y.Swathi, “An Efficient Spam Classification
System using Ensemble Machine Learning Algorithm,” in Journal of Applied Science and
Computations, Volume 5, Issue 9, September/2018.
[5] Pavas Navaney, Gaurav Dubey, Ajay Rana, “SMS Spam Filtering using Supervised
Machine Learning Algorithms.,” in 8th International Conference on Cloud Computing, Data
Science & Engineering, 978-1- 5386-1719-9/18/ 2018 IEEE.
[6] Luo GuangJun,, Shah Nazir, Habib Ullah Khan, Amin Ul Haq, “Spam Detection
Approach for Secure Mobile Messgae Communication using Machine Learning
Algorithms.,” in Hindawi,Security and Communication Netwroks,Volume 2020,Article
id:8873639.July-2020.
[7] Tian Xia, Xuemin Chen, “A Discrete Hidden Markov Model for SMS Spam Detection.,”
in Applied Science,MDPI, Appl. Sci. 2020, 10, 5011; doi:10.3390/app10145011.
[8] M. Nivaashini, R.S.Soundariya, A.Kodieswari, P.Thangaraj, “: SMS Spam Detection
using Deep Neural Network.,” in International Journal of Pure and Applied Mathematics,
Volume 119 No. 18 2018, 2425-2436.
[9] Mehul Gupta, Aditya Bakliwal, Shubhangi Agarwal,Pulkit Mehndiratta, “: A
Comparative Study of Spam SMS Detection using Machine Learning Classifiers.,” in 2018
Eleventh International Conference on Contemporary Computing (IC3), 2-4 August, 2018.
[10] Gomatham Sai Sravya, G Pradeepini, Vaddeswaram, “: Mobile Sms Spam Filter
Techniques Using Machine Learning Techniques.,” International Journal Of Scientific &
Technology Research Volume 9, Issue 03, March 2020.
[11] M.Rubin Julis, S.AIagesan:, “Spam Detection In Sms Using Machine Learning through
Textmining”, International Journal Of Scientific & Technology Research Volume 9, Issue 02,
February 2020.
[12] K. Sree Ram Murthy,K.Kranthi Kumar, K.Srikar, CH.Nithya, S.Alagesan:, “SMS Spam
Detection using RNN”, : International
DEPT OF CSE, LAEC BIDAR Page 47
SMS SPAM DETECTION
While the Internet has brought unprecedented convenience to many people for managing
their finances and investments, it also provides opportunities for conducting fraud on a
massive scale with little cost to the fraudsters. Fraudsters can manipulate users instead of
hardware/software systems, where barriers to technological compromise have increased
significantly. Phishing is one of the most widely practised Internet frauds. It focuses on the
theft of sensitive personal information such as passwords and credit card details. Phishing
attacks take two forms:
The specific malware used in phishing attacks is subject of research by the virus and
malware community and is not addressed in this thesis. Phishing attacks that proceed by
deceiving users are the research focus of this thesis and the term ‘phishing attack’ will be
used to refer to this type of attack.
1.1 Objactive:
The main objective of this paper is to detect the Begin, Malicious and Malware URLs with
the use of machine learning.
1.2 Motivatoin:
The reason behind this system is to take precautions to prevent users from these harmful
sites. It will make people conscious in addition to building strong security mechanisms
which are able to detect and prevent phishing URL’s from reaching the user.
problem is by restructuring the NN model in terms of tuning some parameters, adding new
neurons to the hidden layer or sometimes adding a new layer to the network. A NN with a
small number of hidden neurons may not have a satisfactory representational power to
model the complexity and diversity inherent in the data. On the other hand, networks with
too many hidden neurons could overfit the data. However, at a certain stage the model can
no longer be improved, therefore, the structuring process should be terminated. Hence, an
acceptable error rate should be specified when creating any NN model, which itself is
considered a problem since it is difficult to determine the acceptable error rate a priori . For
instance, the model designer may set the acceptable error rate to a value that is unreachable
which causes the model to stick in local minima or sometimes the model designer may set
the acceptable error rate to a value that can further be improved.
Disadvantages:
1. It will take time to load all the dataset.
2. Process is not accuracy.
3. It will analyse slowly.
We find that phishing website prefers to have longer URL, more levels (delimited
by dot), more tokens in domain and path, longer token. Besides, phishing and malware
websites could pretend to be a benign one by containing popular brand names as tokens
other than those in second-level domain. Considering phishing websites and malware
websites may use IP address directly so as to cover the suspicious URL, which is very rare
in benign case. Also, phishing URLs are found to contain several suggestive word tokens
(confirm, account, banking, secure, ebayisapi, webscr, login, signin), we check the presence
of these security sensitive words and include the binary value in our features. Intuitively,
malicious sites are always less popular than benign ones. For this reason, site popularity can
be considered as an important feature. Traffic rank feature is acquired from Alexa.com.
Host-based features are based on the observation that malicious sites are always registered
in less reputable hosting centres or regions.
Advantages:
1. All of URLs in the dataset are labelled.
2. We used two supervised learning algorithms random forest and support vector
machine to train using scikit-learn library.
CHAPTER 3
LITERATURE SURVEY
Abstract:
Phishing websites, fraudulent sites that impersonate a trusted third party to gain access to
private data, continue to cost Internet users over a billion dollars each year. In this paper,
we describe the design and performance characteristics of a scalable machine learning
classifier we developed to detect phishing websites. We use this classifier to maintain
Google’s phishing blacklist automatically. Our classifier analyses millions of pages a day,
examining the URL and the contents of a page to determine whether or not a page is
phishing. Unlike previous work in this field, we train the classifier on a noisy dataset
consisting of millions of samples from previously collected live classification data. Despite
the noise in the training data, our classifier learns a robust model for identifying phishing
pages which correctly classifies more than 90% of phishing pages several weeks after
training concludes.
Author: Chris Seiffert, Taghi M. Khoshgoftaar, Jason Van Hulse, Amri Napolitano.
Abstract:
Constructing classification models using skewed training data can be a challenging task.
We present RUS Boost, a new algorithm for alleviating the problem of class imbalance.
RUS Boost combines data sampling and boosting ,providing a simple and efficient method
for improving classification performance when training data is imbalanced. In addition to
performing favorably when compared to SMOTE Boost (another hybrid sampling/boosting
algorithm), RUS Boost is computationally less expensive than SMOTE Boost and results
in significantly shorter model training times. This combination of simplicity, speed and
performance makes RUS Boost an excellent technique for learning from imbalanced data.
Title: Application of Machine Learning Algorithm Intrusion Detect Dataset within Misuse
Detection Context.
A small subset of machine learning algorithms, mostly inductive learning based applied to
the KDD 1999 Cup intrusion detection dataset resulted in dismal performance for user-
toroot and remote-to-local attack categories as reported in the recent literature. The
uncertainty to explore if other machine learning algorithms can demonstrate better
performance compared to the ones already employed constitutes the motivation for the
study reported herein. Specifically, exploration of if certain algorithms perform better
for certain attack classes and consequently, if a multi-expert classifier design can deliver
desired performance measure is of high interest. This paper evaluates performance of
a comprehensive set of pattern recognition and machine learning algorithms on four
attack categories as found in the KDD 1999 Cup intrusion detection dataset. Results
of simulation study implemented to that effect indicated that certain classification
algorithms perform better for certain attack.
experiments: the syntactic similarity for sentences, and the subject and object of verb
comparison. The results of the experiments indicated that both features can be used for
some verbs, but more work has to be done for others.
Because iTrust Page is user-assisted, iTrust Page avoids the false positives and the false
negatives associated with automatic phishing detection. We implemented iTrust Page as a
downloadable extension to FireFox. After being featured on the Mozilla website for FireFox
extensions, iTrust Page was downloaded by more than 5,000 users in a two week period.
We present an analysis of our tool’s effectiveness and ease of use based on our examination
of usage logs collected from the 2,050 users who used iTrust Page for more than two weeks.
Based on these logs, we find that iTrust Page disrupts users on fewer than 2% of the pages
they visit, and the number of disruptions decreases over time.
Title : Online Phishing Classification Using Adversarial Data Mining and Signaling Games.
CHAPTER 4
SYSTEM ARCHITECTURE
1. Data Set
2. Python
4.2 Methodology
• Data Collection
• Data Pre-Processing
• Feature Extraction
• Evaluation model
Data used in this paper is a set of records. This step is concerned with selecting the subset of
all available data that you will be working with. ML problems start with data preferably, lots
of data (examples or observations) for which you already know the target answer. Data for
which you already know the target answer is called labelled data.
Organize your selected data by formatting, cleaning and sampling from it.
1. Formatting
2. Cleaning
3. Sampling
Formatting: The data you have selected may not be in a format that is suitable for you to work
with. The data may be in a relational database and you would like it in a flat file, or the data
may be in a proprietary file format and you would like it in a relational database or a text file.
Cleaning: Cleaning data is the removal or fixing of missing data. There may be data instances
that are incomplete and do not carry the data you believe you need to address the problem.
These instances may need to be removed. Additionally, there may be sensitive information in
some of the attributes and these attributes may need to be anonym zed or removed from the
data entirely.
Sampling: There may be far more selected data available than you need to work with. More
data can result in much longer running times for algorithms and larger computational and
memory requirements. You can take a smaller representative sample of the selected data that
may be much faster for exploring and prototyping solutions before considering the whole
dataset.
Next thing is to do Feature extraction is an attribute extension we created more columns from
URL’s. Finally, our models are trained using Classifier algorithm. We use the labelled dataset
gathered. The rest of our labelled data will be used to evaluate the models. Some machine
learning algorithms were used to classify pre-processed data. The chosen classifiers were
Random forest.
Model Evaluation is an integral part of the model development process. It helps to find the best
model that represents our data and how well the chosen model will work in the future.
To avoid over fitting, both methods use a test set (not seen by the model) to evaluate model
performance. Performance of each classification model is estimated base on its averaged. The
result will be in the visualized form. Representation of classified data in the form of graphs.
Accuracy is defined as the percentage of correct predictions for the test data. It can be
calculated easily by dividing the number of correct predictions by the number of total
predictions. We predict the accuracy over actual and predicted output and calculate accuracy
as –
CHAPTER 5
ALGORITHM
Random forest is a type of supervised machine learning algorithm based on ensemble learning.
Ensemble learning is a type of learning where you join different types of algorithms or same
algorithm multiple times to form a more powerful prediction model. The random forest
algorithm combines multiple algorithm of the same type i.e. multiple decision trees, resulting
in a forest of trees, hence the name "Random Forest". The random forest algorithm can be used
for both regression and classification tasks.
The following are the basic steps involved in performing the random forest algorithm
1. The random forest algorithm is not biased, since, there are multiple trees and each tree
is trained on a subset of data. Basically, the random forest algorithm relies on the power
of "the crowd"; therefore, the overall biasedness of the algorithm is reduced.
2. This algorithm is very stable. Even if a new data point is introduced in the dataset the
overall algorithm is not affected much since new data may impact one tree, but it is very
hard for it to impact all the trees.
3. The random forest algorithm works well when you have both categorical and numerical
features.
4. The random forest algorithm also works well when data has missing values or it has not
been scaled we.
Motion of all the cars around it. It uses all of that data to figure out not only how to drive the
car but also to figure out and predict what potential drivers around the car are going to do.
What's impressive is that the car is processing almost a gigabyte a second of data.
Deep Learning:
Deep learning is a computer software that mimics the network of neurons in a brain. It is a
subset of machine learning and is called deep learning because it makes use of deep neural
networks. The machine uses different layers to learn from the data. The depth of the model is
represented by the number of layers in the model. Deep learning is the new state of the art in
term of AI. In deep learning, the learning phase is done through a neural network.
Reinforcement Learning:
● Q-learning
● Deep Q network
● State-Action-Reward-State-Action (SARSA) ● Deep Deterministic Policy
Gradient (DDPG)
AI in Finance: The financial technology sector has already started using AI to save time,
reduce costs, and add value. Deep learning is changing the lending industry by using more
robust credit scoring. Credit decision-makers can use AI for robust credit lending applications
to achieve faster, more accurate risk assessment, using machine intelligence to factor in the
character and capacity of applicants.
Underwrite is a Fintech company providing an AI solution for credit makers company.
underwrite.ai uses AI to detect which applicant is more likely to pay back a loan. Their approach
radically outperforms traditional methods.
AI in HR: Under Armour, a sportswear company revolutionizes hiring and modernizes the
candidate experience with the help of AI. In fact, Under Armour Reduces hiring time for its
retail stores by 35%. Under Armour faced a growing popularity interest back in 2012. They
had, on average, 30000 resumes a month. Reading all of those applications and begin to start
the screening and interview process was taking too long. The lengthy process to get people
hired and on-boarded impacted Under Armour's ability to have their retail stores fully staffed,
ramped and ready to operate.
At that time, Under Armour had all of the 'must have' HR technology in place such as
transactional solutions for sourcing, applying, tracking and onboarding but those tools weren't
useful enough. Under armour choose HireVue, an AI provider for HR solution, for both
ondemand and live interviews. The results were bluffing; they managed to decrease by 35%
the time to fill. In return, the hired higher quality staffs.
For example, deep-learning analysis of audio allows systems to assess a customer's emotional
tone. If the customer is responding poorly to the AI chatbot, the system can be rerouted the
conversation to real, human operators that take over the issue.
Apart from the three examples above, AI is widely used in other sectors/industries.
Artificial Intelligence
With machine learning, you need fewer data to train the algorithm than deep learning. Deep
learning requires an extensive and diverse set of data to identify the underlying structure.
Besides, machine learning provides a faster-trained model. Most advanced deep learning
architecture can take days to a week to train. The advantage of deep learning over machine
learning is it is highly accurate. You do not need to understand what features are the best
representation of the data; the neural network learned how to select critical features. In machine
learning, you need to choose for yourself what features to include in the model.
TensorFlow
The most famous deep learning library in the world is Google's TensorFlow. Google product
uses machine learning in all of its products to improve the search engine, translation, image
captioning or recommendations.
To give a concrete example, Google users can experience a faster and more refined the search
with AI. If the user types a keyword a the search bar, Google provides a recommendation about
what could be the next word.
Google wants to use machine learning to take advantage of their massive datasets to give users
the best experience. Three different groups use machine learning:
● Researchers
● Data scientists ● Programmers.
They can all use the same toolset to collaborate with each other and improve their efficiency.
Google does not just have any data; they have the world's most massive computer, so
TensorFlow was built to scale. TensorFlow is a library developed by the Google Brain Team to
accelerate machine learning and deep neural network research.
It was built to run on multiple CPUs or GPUs and even mobile operating systems, and it has
several wrappers in several languages like Python, C++ or Java.
TensorFlow Architecture
This is why it is called TensorFlow because the tensor goes in it flows through a list of
operations, and then it comes out the other side.
Development Phase: This is when you train the mode. Training is usually done on your Desktop
or laptop.
Run Phase or Inference Phase: Once training is done TensorFlow can be run on many different
platforms. You can run it on
● Desktop running Windows, macOS or Linux
● Cloud as a web service
● Mobile devices like iOS and Android
You can train it on multiple machines then you can run it on a different machine, once you have
the trained model.
The model can be trained and used on GPUs as well as CPUs. GPUs were initially designed
for video games. In late 2010, Stanford researchers found that GPU was also very good at
matrix operations and algebra so that it makes them very fast for doing these kinds of
calculations. Deep learning relies on a lot of matrix multiplication. TensorFlow is very fast at
computing the matrix multiplication because it is written in C++. Although it is implemented
in C++, TensorFlow can be accessed and controlled by other languages mainly, Python.
● Classification:tf.estimator.LinearClassifier
● Deep learning classification: tf.estimator.DNNClassifier
● Deep learning wipe and deep: tf.estimator.DNNLinearCombinedClassifier
● Booster tree regression: tf.estimator.BoostedTreesRegressor
● Boosted tree classification: tf.estimator.BoostedTreesClassifier
CHAPTER 6
REQUIREMENTS ANALYSIS
● Anaconda Navigator
● Python
● Python built-in modules
o Numpy o
Pandas o
Matplotlib o
Sklearn o
Seaborm
In order to run, many scientific packages depend on specific versions of other packages.
Data scientists often use multiple versions of many packages, and use multiple environments
to separate these different versions.
The command line program conda is both a package manager and an environment manager, to
help data scientists ensure that each version of each package has all the dependencies it requires
and works correctly.
Navigator is an easy, point-and-click way to work with packages and environments without
needing to type conda commands in a terminal window. You can use it to find the packages you
want, install them in an environment, run the packages and update them, all inside Navigator.
The simplest way is with Spyder. From the Navigator Home tab, click Spyder, and write and
execute your code.
You can also use Jupyter Notebooks the same way. Jupyter Notebooks are an increasingly
popular system that combine your code, descriptive text, output, images and interactive
interfaces into a single notebook file that is edited, viewed and used in a web browser.
● Add support for Offline Mode for all environment related actions.
● Add support for custom configuration of main windows links.
● Numerous bug fixes and performance enhancements.
Python is Interactive: You can actually sit at a Python prompt and interact with the
interpreter directly to write your programs.
Python was developed by Guido van Rossum in the late eighties and early nineties at the
National Research Institute for Mathematics and Computer Science in the Netherlands.
Python is derived from many other languages, including ABC, Modula-3, C, C++, Algol-68,
SmallTalk, Unix shell, and other scripting languages.
Python is copyrighted. Like Perl, Python source code is now available under the GNU General
Public License (GPL).
Python is now maintained by a core development team at the institute, although Guido van
Rossum still holds a vital role in directing its progress.
Easy-to-learn: Python has few keywords, simple structure, and a clearly defined
syntax. This allows the student to pick up the language quickly.
Easy-to-read: Python code is more clearly defined and visible to the eyes.
A broad standard library: Python's bulk of the library is very portable and
crossplatform compatible on UNIX, Windows, and Macintosh.
Interactive Mode: Python has support for an interactive mode which allows
interactive testing and debugging of snippets of code.
Portable: Python can run on a wide variety of hardware platforms and has the same
interface on all platforms.
Extendable: You can add low-level modules to the Python interpreter. These modules
enable programmers to add to or customize their tools to be more efficient.
GUI Programming: Python supports GUI applications that can be created and ported
to many system calls, libraries, and windows systems, such as Windows MFC,
Macintosh, and the X Window system of Unix.
Scalable: Python provides a better structure and support for large programs than shell
scripting.
Apart from the above-mentioned features, Python has a big list of good features, few are listed
below:
It provides very high-level dynamic data types and supports dynamic type checking.
It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.
Python is available on a wide variety of platforms including Linux and Mac OS X. Let's understand
how to set up our Python environment.
● Numpy
● Sklearn
● seaborn
● matplotlib
● Importing Datasets
6.3.1Pandas
Pandas is quite a game changer when it comes to analyzing data with Python and it is one of
the most preferred and widely used tools in data munging/wrangling if not THE most used one.
Pandas is an open source What’s cool about Pandas is that it takes data (like a CSV or TSV file,
or a SQL database) and creates a Python object with rows and columns called data frame that
looks very similar to table in a statistical software (think Excel or SPSS for example. People
who are familiar with R would see similarities to R too). This is so much easier to work with
in comparison to working with lists and/or dictionaries through for loops or list comprehension.
In order to use Pandas in your Python IDE (Integrated Development Environment) like Jupyter
Notebook or Spyder (both of them come with Anaconda by default), you need to import the
Pandas library first. Importing a library means loading it into the memory and then it’s there
for you to work with. In order to import Pandas all you have to do is run the following code:
● import pandas as pd
● import numpy as np
Usually you would add the second part (‘as pd’) so you can access Pandas with ‘pd.command’
instead of needing to write ‘pandas.command’ every time you need to use it. Also, you would
import numpy as well, because it is very useful library for scientific computing with Python.
Now Pandas is ready for use! Remember, you would need to do it every time you start a new
Jupyter Notebook, Spyder file etc.
● Open a local file using Pandas, usually a CSV file, but could also be a delimited text file
(like TSV), Excel, etc
● Open a remote file or database like a CSV or a JSONon a website through a URL or read
from a SQL table/database
There are different commands to each of these options, but when you open a file, they would
look like this:
● pd.read_filetype()
As I mentioned before, there are different filetypes Pandas can work with, so you would replace
“filetype” with the actual, well, filetype (like CSV). You would give the path, filename etc
inside the parenthesis. Inside the parenthesis you can also pass different arguments that relate
to how to open the file. There are numerous arguments and in order to know all you them, you
would have to read the documentation (for example, the documentation for pd.read_csv()
would contain all the arguments you can pass in this Pandas command).
In order to convert a certain Python object (dictionary, lists etc) the basic command is:
● pd.DataFrame()
Inside the parenthesis you would specify the object(s) you’re creating the data frame from. This
command also has different arguments .
You can also save a data frame you’re working with/on to different kinds of files (like CSV,
Excel, JSON and SQL tables). The general code for that is:
● df.to_filetype(filename)
Selection of Data
One of the things that is so much easier in Pandas is selecting the data you want in comparison
to selecting a value from a list or a dictionary. You can select a column (df[col]) and return
column with label col as Series or a few columns (df[[col1, col2]]) and returns columns as a
new DataFrame. You can select by position (s.iloc[0]), or by index (s.loc['index_one']) . In
order to select the first row you can use df.iloc[0,:] and in order to select the first element of
the first column you would run df.iloc[0,0] . These can also be used in different combinations,
so I hope it gives you an idea of the different selection and indexing you can perform in Pandas.
You can use different conditions to filter columns. For example, df[df[year] > 1984] would give
you only the column year is greater than 1984. You can use & (and) or | (or) to add different
conditions to your filtering. This is also called boolean filtering.
It is possible to sort values in a certain column in
an ascending order using df.sort_values(col1) ; and also in a
descending order using df.sort_values(col2,ascending=False). Furthermore, it’s possible to
sort values by col1 in ascending order then col2 in descending order by using
df.sort_values([col1,col2],ascending=[True,False]).
The last command in this section is groupby. It involves splitting the data into groups based on
some criteria, applying a function to each group independently and combining the results into
a data structure. df.groupby(col) returns a groupby object for values from one column while
df.groupby([col1,col2]) returns a groupby object for values from multiple columns.
Data Cleaning
Data cleaning is a very important step in data analysis. For example, we always check for
missing values in the data by running pd.isnull() which checks for null Values, and returns a
boolean array (an array of true for missing values and false for non-missing values). In order
to get a sum of null/missing values, run pd.isnull().sum(). pd.notnull() is the opposite of
pd.isnull(). After you get a list of missing values you can get rid of them, or drop them by using
df.dropna() to drop the rows or df.dropna(axis=1) to drop the columns. A different approach
would be to fill the missing values with other values by using df.fillna(x) which fills the missing
values with x (you can put there whatever you want) or s.fillna(s.mean()) to replace all null
values with the mean (mean can be replaced with almost any function from the statistics
section).
It is sometimes necessary to replace values with different
values. For example, s.replace(1,'one') would replace all values equal to 1 with 'one'.
It’s possible to do it for multiple values: s.replace([1,3],['one','three'])would
replace all 1 with 'one' and 3 with 'three'. You can also rename
specific columns by running: df.rename(columns={'old_name': 'new_ name'})or
use df.set_index('column_one') to change the index of the data frame.
Join/Combine
The last set of basic Pandas commands are for joining or combining data frames or
rows/columns. The three commands are: df1.append(df2)— add the rows in df1 to the end of
df2 (columns should be identical)
● df.concat([df1, df2],axis=1) — add the columns in df1 to the end of df2 (rows should be
identical)
● df1.join(df2,on=col1,how='inner') — SQL-style join the columns in df1with the columns
on df2 where the rows for colhave identical values. how can be equal to one
of: 'left', 'right', 'outer', 'inner'
6.3.2 Numpy
Numpy is one such powerful library for array processing along with a large collection of high-
level mathematical functions to operate on these arrays. These functions fall into categories
like Linear Algebra, Trigonometry, Statistics, Matrix manipulation, etc.
Getting NumPy
NumPy’s main object is a homogeneous multidimensional array. Unlike python’s array class
which only handles one-dimensional array, NumPy’s ndarray class can handle
multidimensional array and provides more functionality. NumPy’s dimensions are known as
axes. For example, the array below has 2 dimensions or 2 axes namely rows and columns.
Sometimes dimension is also known as a rank of that particular array or matrix.
Importing NumPy
NumPy is imported using the following command. Note here np is the convention followed for
the alias so that we don't need to write numpyevery time.
● import numpy as np
NumPy is the basic library for scientific computations in Python and this article illustrates some
of its most frequently used functions. Understanding NumPy is the first major step in the
journey of machine learning and deep learning.
6.3.3 Sklearn
In python, scikit-learn library has a pre-built functionality under sklearn. Pre processing.
Next thing is to do feature extraction Feature extraction is an attribute reduction process. Unlike
feature selection, which ranks the existing attributes according to their predictive significance,
feature extraction actually transforms the attributes. The transformed attributes, or features, are
linear combinations of the original attributes. Finally our models are trained using Classifier
algorithm.. We use nltk . classify module on Natural Language Toolkit library on Python. We
use the labelled dataset gathered . The rest of our labelled data will be used to evaluate the
models. Some machine learning algorithms were used to classify pre processed data. The
chosen classifiers were Decision tree , Support Vector Machines and Random forest. These
algorithms are very popular in text classification tasks.
6.3.4 Seaborn
Data visualization is the discipline of trying to understand data by placing it in a visual context,
so that patterns, trends and correlations that might not otherwise be detected can be exposed.
Python offers multiple great graphing libraries that come packed with lots of different features.
No matter if you want to create interactive, live or highly customized plots python has a
excellent library for you.
In this article, we will learn how to create basic plots using Matplotlib, Pandas visualization
and Seaborn as well as how to use some specific features of each library. This article will focus
on the syntax and not on interpreting the graphs.
6.3.5 Matplotlib
Matplotlib is the most popular python plotting library. It is a low level library with a Matlab
like interface which offers lots of freedom at the cost of having to write more code.
Matplotlib is specifically good for creating basic graphs like line charts, bar charts, histograms
and many more. It can be imported by typing:
Line Chart
Histogram
In Matplotlib we can create a Histogram using the hist method. If we pass it categorical data
like the points column from the wine-review dataset it will automatically calculate how often
each class occurs.
Histogram
Bar Chart
A bar-chart can be created using the bar method. The bar-chart isn’t automatically calculating
the frequency of a category so we are going to use pandas value_counts function to do this. The
bar-chart is useful for categorical data that doesn’t have a lot of different categories (less than
30) because else it can get quite messy.
Bar-Chart
Pandas Visualization
Pandas is a open source high-performance, easy-to-use library providing data structures, such
as dataframes, and data analysis tools like the visualization tools we will use in this article.
Pandas Visualization makes it really easy to create plots out of a pandas dataframe and series.
It also has a higher level API than Matplotlib and therefore we need less code for the same
results.
Matplotlib:
Data visualization is the discipline of trying to understand data by placing it in a visual context,
so that patterns, trends and correlations that might not otherwise be detected can be exposed.
Python offers multiple great graphing libraries that come packed with lots of different features.
In this article we looked at Matplotlib, Pandas visualization and Seaborn.
CHAPTER 7
TESTING
Software testing is an investigation conducted to provide stakeholders with information about
the quality of the product or service under test. Software Testing also provides an objective,
independent view of the software to allow the business to appreciate and understand the risks
at implementation of the software. Test techniques include, but are not limited to, the process
of executing a program or application with the intent of finding software bugs.
Software Testing can also be stated as the process of validating and verifying that a software
program/application/product:
● Meets the business and technical requirements that guided its design and Development.
● Works as expected and can be implemented with the same characteristics.
● Functional Testing
Functional tests provide systematic demonstrations that functions tested are available as
specified by the business and technical requirements, system documentation, and user manuals.
Functional testing is centered on the following items:
● Functions: Identified functions must be exercised.
● Output: Identified classes of software outputs must be exercised.
● Systems/Procedures: system should work properly
Software integration testing is the incremental integration testing of two or more integrated
software components on a single platform to produce failures caused by interface defects.
Here in machine learning we are dealing with dataset which is in excel sheet format so if any
test case we need means we need to check excel file. Later on classification will work on the
respective columns of dataset .
Test Case 1 :
CHAPTER 8
RESULTS
CHAPTER 9
OUTPUT
CONCLUSION
In this paper, we describe our large-scale system for automatically classifying phishing pages
which maintains a false positive rate below 0.1%. Our classification system examines millions
of potential phishing pages daily in a fraction of the time of a manual review process. By
automatically updating our blacklist with our classifier, we minimize the amount of time that
phishing pages can remain active before we protect our users from them. Even with a perfect
classifier and a robust system, we recognize that our blacklist approach keeps us perpetually a
step behind the phishers. We can only identify a phishing URL and normal URL using machine
learning algorithm. Result we got in terms of accuracy metric.
REFERENCES
[1] G. Aaron and R. Rasmussen, “Global phishing survey: Trends and domain name use in
2016,” 2016.
[2] B. Gupta, A. Tewari, A. K. Jain, and D. P. Agrawal, “Fighting against phishing attacks:
state of the art and future challenges,” Neural Computing and Applications, vol. 28, no.
12, pp. 3629–3654, 2017.
[4] quarter 2016,” 2014. R. Verma, N. Shashidhar, and N. Hossain, “Detecting phishing
[5] emails the natural language way,” in Computer Security–ESORICS 2012. Springer,
2012, pp. 824–841. M. Khonji, Y. Iraqi, and A. Jones, “Phishing detection: a literature
[6] survey,” IEEE Communications Surveys & Tutorials, vol. 15, no. 4, pp. 2091–2121,
2013. G. Park and J. M. Taylor, “Using syntactic features for phishing [7] detection,”
arXiv preprint arXiv:1506.00037, 2015. R. Dazeley, J. L. Yearwood, B. H.
Kang, and A. V. Kelarev