0% found this document useful (0 votes)
48 views254 pages

ML Notes N

Nyc

Uploaded by

Ramnath Ram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views254 pages

ML Notes N

Nyc

Uploaded by

Ramnath Ram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 254

Introduction to Machine Learning

The Machine Learning Tutorial covers both the fundamentals and more complex ideas of
machine learning. Students and professionals in the workforce can benefit from our machine
learning tutorial.

A rapidly developing field of technology, machine learning allows computers to automatically


learn from previous data. For building mathematical models and making predictions based on
historical data or information, machine learning employs a variety of algorithms. It is currently
being used for a variety of tasks, including speech recognition, email filtering, auto-tagging on
Facebook, a recommender system, and image recognition.

You will learn about the many different methods of machine learning, including reinforcement
learning, supervised learning, and unsupervised learning, in this machine learning tutorial.
Regression and classification models, clustering techniques, hidden Markov models, and
various sequential models will all be covered.

What is Machine Learning

In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which work on
our instructions. But can a machine also learn from experiences or past data like a human does?
So here comes the role of Machine Learning.

ADVERTISEMENT
Introduction to Machine Learning

A subset of artificial intelligence known as machine learning focuses primarily on the creation
of algorithms that enable a computer to independently learn from data and previous
experiences. Arthur Samuel first used the term "machine learning" in 1959. It could be
summarized as follows:

Without being explicitly programmed, machine learning enables a machine to automatically


learn from data, improve performance from experiences, and predict things.

Machine learning algorithms create a mathematical model that, without being explicitly
programmed, aids in making predictions or decisions with the assistance of sample historical
data, or training data. For the purpose of developing predictive models, machine learning brings
together statistics and computer science. Algorithms that learn from historical data are either
constructed or utilized in machine learning. The performance will rise in proportion to the
quantity of information we provide.

A machine can learn if it can gain more data to improve its performance.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
How does Machine Learning work

A machine learning system builds prediction models, learns from previous data, and predicts
the output of new data whenever it receives it. The amount of data helps to build a better model
that accurately predicts the output, which in turn affects the accuracy of the predicted output.

Let's say we have a complex problem in which we need to make predictions. Instead of writing
code, we just need to feed the data to generic algorithms, which build the logic based on the
data and predict the output. Our perspective on the issue has changed as a result of machine
learning. The Machine Learning algorithm's operation is depicted in the following block
diagram:

Features of Machine Learning:

o Machine learning uses data to detect various patterns in a given dataset. o It can learn from
past data and improve automatically.
o It is a data-driven technology.

o Machine learning is much similar to data mining as it also deals with the huge amount of the
data.

Need for Machine Learning

The demand for machine learning is steadily rising. Because it is able to perform tasks that are
too complex for a person to directly implement, machine learning is required. Humans are
constrained by our inability to manually access vast amounts of data; as a result, we require
computer systems, which is where machine learning comes in to simplify our lives.
By providing them with a large amount of data and allowing them to automatically explore the
data, build models, and predict the required output, we can train machine learning algorithms.
The cost function can be used to determine the amount of data and the machine learning
algorithm's performance. We can save both time and money by using machine learning.

The significance of AI can be handily perceived by its utilization's cases, Presently, AI is


utilized in self-driving vehicles, digital misrepresentation identification, face acknowledgment,
and companion idea by Facebook, and so on. Different top organizations, for example, Netflix
and Amazon have constructed AI models that are utilizing an immense measure of information
to examine the client interest and suggest item likewise.

Following are some key points which show the importance of Machine Learning:

o Rapid increment in the production of data o Solving complex problems, which are difficult
for a human o Decision making in various sector including finance o Finding hidden
patterns and extracting useful information from data.

Classification of Machine Learning

At a broad level, machine learning can be classified into three types:

1. Supervised learning

2. Unsupervised learning

3. Reinforcement learning

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Applications of Machine learning
Machine learning is a buzzword for today's technology, and it is growing very rapidly day by
day. We are using machine learning in our daily life even without knowing it such as Google
Maps, Google assistant, Alexa, etc. Below are some most trending real-world applications of
Machine Learning:
1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places, digital images, etc. The popular use case of image recognition
and face detection is, Automatic friend tagging suggestion:

Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo
with our Facebook friends, then we automatically get a tagging suggestion with name, and the
technology behind this is machine learning's face detection and recognition algorithm.

It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech recognition,
and it's a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is also known
as "Speech to text", or "Computer speech recognition." At present, machine learning
algorithms are widely used by various applications of speech recognition. Google assistant,
Siri, Cortana, and Alexa are using speech recognition technology to follow the voice
instructions.

3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct path
with the shortest route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:

o Real Time location of the vehicle form Google Map app and sensors o Average time has
taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It takes information
from the user and sends back to its database to improve the performance.

4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such as
Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some
product on Amazon, then we started getting an advertisement for the same product while
internet surfing on the same browser and this is because of machine learning.

Google understands the user interest using various machine learning algorithms and suggests
the product as per customer interest.

As similar, when we use Netflix, we find some recommendations for entertainment series,
movies, etc., and this is also done with the help of machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine learning
plays a significant role in self-driving cars. Tesla, the most popular car manufacturing company
is working on self-driving car. It is using unsupervised learning method to train the car models
to detect people and objects while driving.

6. Email Spam and Malware Filtering:


Whenever we receive a new email, it is filtered automatically as important, normal, and spam.
We always receive an important mail in our inbox with the important symbol and spam emails
in our spam box, and the technology behind this is Machine learning. Below are some spam
filters used by Gmail:

o Content Filter o Header filter o General blacklists filter o Rules-based filters

o Permission filters

Some machine learning algorithms such as Multi-Layer


Perceptron, Decision tree, and Naïve Bayes classifier are used for email spam
filtering and malware detection.

7. Virtual Personal Assistant:


We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As
the name suggests, they help us in finding the information using our voice instruction. These
assistants can help us in various ways just by our voice instructions such as Play music, call
someone, Open an email, Scheduling an appointment, etc.

These virtual assistants use machine learning algorithms as an important part.

These assistant record our voice instructions, send it over the server on a cloud, and decode it
using ML algorithms and act accordingly.

8. Online Fraud Detection:


Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various ways that a

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
fraudulent transaction can take place such as fake accounts, fake ids, and steal money in the
middle of a transaction. So to detect this, Feed Forward Neural network helps us by checking
whether it is a genuine transaction or a fraud transaction.

For each genuine transaction, the output is converted into some hash values, and these values
become the input for the next round. For each genuine transaction, there is a specific pattern
which gets change for the fraud transaction hence, it detects it and makes our online transactions
more secure.

9. Stock Market trading:


Machine learning is widely used in stock market trading. In the stock market, there is always a
risk of up and downs in shares, so for this machine learning's long short term memory neural
network is used for the prediction of stock market trends.

10. Medical Diagnosis:


In medical science, machine learning is used for diseases diagnoses. With this, medical
technology is growing very fast and able to build 3D models that can predict the exact position
of lesions in the brain.

It helps in finding brain tumors and other brain-related diseases easily.

11. Automatic Language Translation:


Nowadays, if we visit a new place and we are not aware of the language then it is not a problem
at all, as for this also machine learning helps us by converting the text into our known
languages. Google's GNMT (Google Neural Machine Translation) provide this feature, which
is a Neural Machine Learning that translates the text into our familiar language, and it called
as automatic translation.

The technology behind the automatic translation is a sequence to sequence learning algorithm,
which is used with image recognition and translates the text from one language to another
language.
Machine learning Life cycle
Machine learning has given the computer systems the abilities to automatically learn without
being explicitly programmed. But how does a machine learning system work? So, it can be
described using the life cycle of machine learning. Machine learning life cycle is a cyclic
process to build an efficient machine learning project. The main purpose of the life cycle is to
find a solution to the problem or project.

Machine learning life cycle involves seven major steps, which are given below:

o Gathering Data o Data preparation o Data Wrangling o Analyse Data o Train the
model o Test the model o Deployment

The most important thing in the complete process is to understand the problem and to know the
purpose of the problem. Therefore, before starting the life cycle, we need to understand the
problem because the good result depends on the better understanding of the problem.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
In the complete life cycle process, to solve a problem, we create a machine learning system
called "model", and this model is created by providing "training". But to train a model, we need
data, hence, life cycle starts by collecting data.

1. Gathering Data:

Data Gathering is the first step of the machine learning life cycle. The goal of this step is to
identify and obtain all data-related problems.

In this step, we need to identify the different data sources, as data can be collected from various
sources such as files, database, internet, or mobile devices. It is one of the most important
steps of the life cycle. The quantity and quality of the collected data will determine the
efficiency of the output. The more will be the data, the more accurate will be the prediction.

This step includes the below tasks: o Identify various data sources o Collect data
o Integrate the data obtained from different sources

By performing the above task, we get a coherent set of data, also called as a dataset. It will be
used in further steps.

2. Data preparation

After collecting the data, we need to prepare it for further steps. Data preparation is a step
where we put our data into a suitable place and prepare it to use in our machine learning
training.

In this step, first, we put all data together, and then randomize the ordering of data.

This step can be further divided into two processes:

o Data exploration:
It is used to understand the nature of data that we have to work with. We need to understand
the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.

o Data pre-processing:
Now the next step is preprocessing of data for its analysis.

3. Data Wrangling

Data wrangling is the process of cleaning and converting raw data into a useable format. It is
the process of cleaning the data, selecting the variable to use, and transforming the data in a
proper format to make it more suitable for analysis in the next step. It is one of the most
important steps of the complete process. Cleaning of data is required to address the quality
issues.

It is not necessary that data we have collected is always of our use as some of the data may not
be useful. In real-world applications, collected data may have various issues, including:

o Missing Values o Duplicate data o Invalid data o Noise

So, we use various filtering techniques to clean the data.

It is mandatory to detect and remove the above issues because it can negatively affect the quality
of the outcome.

4. Data Analysis

Now the cleaned and prepared data is passed on to the analysis step. This step involves:

o Selection of analytical techniques o Building models o Review the result

The aim of this step is to build a machine learning model to analyze the data using various
analytical techniques and review the outcome. It starts with the determination of the type of
the problems, where we select the machine learning techniques

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
such as Classification, Regression, Cluster analysis, Association, etc. then build the
model using prepared data, and evaluate the model.

Hence, in this step, we take the data and use machine learning algorithms to build the model.

5. Train Model

Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.

We use datasets to train the model using various machine learning algorithms. Training a model
is required so that it can understand the various patterns, rules, and, features.

6. Test Model

Once our machine learning model has been trained on a given dataset, then we test the model.
In this step, we check for the accuracy of our model by providing a test dataset to it.

Testing the model determines the percentage accuracy of the model as per the requirement of
project or problem.

7. Deployment

The last step of machine learning life cycle is deployment, where we deploy the model in the
real-world system.

If the above-prepared model is producing an accurate result as per our requirement with
acceptable speed, then we deploy the model in the real system. But before deploying the project,
we will check whether it is improving its performance using available data or not.
The deployment phase is similar to making the final report for a project.
Difference between Artificial intelligence and Machine learning
Artificial intelligence and machine learning are the part of computer science that are correlated
with each other. These two technologies are the most trending technologies which are used for
creating intelligent systems.

Although these are two related technologies and sometimes people use them as a synonym for
each other, but still both are the two different terms in various cases.

On a broad level, we can differentiate both AI and ML as:


AI is a bigger concept to create intelligent machines that can simulate human thinking
capability and behavior, whereas, machine learning is an application or subset of AI that
allows machines to learn from data without being programmed explicitly.

Below are some main differences between AI and machine learning along with the overview of
Artificial intelligence and machine learning.

ADVERTISEMENT

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Artificial Intelligence

Artificial intelligence is a field of computer science which makes a computer system that can
mimic human intelligence. It is comprised of two words "Artificial" and "intelligence", which
means "a human-made thinking power." Hence we can define it as,

Artificial intelligence is a technology using which we can create intelligent systems that can
simulate human intelligence.

The Artificial intelligence system does not require to be pre-programmed, instead of that, they
use such algorithms which can work with their own intelligence. It involves machine learning
algorithms such as Reinforcement learning algorithm and deep learning neural networks. AI is
being used in multiple places such as Siri, Google?s AlphaGo, AI in Chess playing, etc.

Based on capabilities, AI can be classified into three types:

o Weak AI o General AI o Strong AI Currently, we are working with weak AI and general
AI. The future of AI is Strong AI for which it is said that it will be intelligent than humans.

Machine learning

Machine learning is about extracting knowledge from the data. It can be defined as,

Machine learning is a subfield of artificial intelligence, which enables machines to learn from
past data or experiences without being explicitly programmed.

Machine learning enables a computer system to make predictions or take some decisions using
historical data without being explicitly programmed. Machine learning uses a massive amount
of structured and semi-structured data so that a machine learning model can generate accurate
result or give predictions based on that data.

ADVERTISEMENT

ADVERTISEMENT
Machine learning works on algorithm which learn by it?s own using historical data. It works
only for specific domains such as if we are creating a machine learning model to detect pictures
of dogs, it will only give result for dog images, but if we provide a new data like cat image then
it will become unresponsive. Machine learning is being used in various places such as for online
recommender system, for Google search algorithms, Email spam filter, Facebook Auto friend
tagging suggestion, etc.

It can be divided into three types: o Supervised learning o Reinforcement learning o


Unsupervised learning

Key differences between Artificial Intelligence (AI) and Machine learning (ML):

Artificial Intelligence Machine learning

Artificial intelligence is a technology which Machine learning is a subset of AI which allows a


enables a machine to simulate human machine to automatically learn from past data without
behavior. programming explicitly.

The goal of AI is to make a smart computer The goal of ML is to allow machines to learn from data
system like humans to solve complex so that they can give accurate output.
problems.

In AI, we make intelligent systems to In ML, we teach machines with data to perform a
perform any task like a human. particular task and give an accurate result.

Machine learning and deep learning are the Deep learning is a main subset of machine learning.
two main subsets of AI.

AI has a very wide range of scope. Machine learning has a limited scope.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
AI is working to create an intelligent system Machine learning is working to create machines that can
which can perform various complex tasks. perform only those specific tasks for which they are
trained.

AI system is concerned about maximizing Machine learning is mainly concerned about accuracy
the chances of success. and patterns.

The main applications of AI are Siri, The main applications of machine learning are Online
customer support using catboats, Expert recommender system, Google search
System, Online game playing, intelligent algorithms, Facebook auto friend tagging
humanoid robot, etc. suggestions, etc.

On the basis of capabilities, AI can be Machine learning can also be divided into mainly three
divided into three types, which are, Weak types that are Supervised learning, Unsupervised
AI, General AI, and Strong AI. learning, and Reinforcement learning.

It includes learning, It includes learning and self-correction when introduced


reasoning, and self-correction. with new data.

AI completely deals with Structured, semi- Machine learning deals with Structured
structured, and unstructured data. and semi-structured data.

How to get datasets for Machine Learning


The field of ML depends vigorously on datasets for preparing models and making precise
predictions. Datasets assume a vital part in the progress of AIML projects and are fundamental
for turning into a gifted information researcher. In this article, we will investigate the various
sorts of datasets utilized in AI and give a definite aid on where to track down them.

What is a dataset?
A dataset is a collection of data in which data is arranged in some order. A dataset can contain
any data from a series of an array to a database table. Below table shows an example of the
dataset:

A tabular dataset can be understood as a database table or matrix, where each column

Country Age Salary Purchased

India 38 48000 No

France 43 45000 Yes

Germany 30 54000 No

France 48 65000 No

Germany 40 Yes

India 35 58000 Yes

corresponds to a particular variable, and each row corresponds to the fields of the dataset.
The most supported file type for a tabular dataset is "Comma Separated File," or CSV. But
to store a "tree-like data," we can use the JSON file more efficiently.

Types of data in datasets

o Numerical data:Such as house price, temperature, etc. o Categorical data:Such as Yes/No,


True/False, Blue/green, etc.
o Ordinal data:These data are similar to categorical data but can be measured on the basis of
comparison.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Note: A real-world dataset is of huge size, which is difficult to manage and process at the
initial level. Therefore, to practice machine learning algorithms, we can use any dummy
dataset.

Types of datasets

Machine learning incorporates different domains, each requiring explicit sorts of datasets. A
few normal sorts of datasets utilized in machine learning include:

Image Datasets:
Image datasets contain an assortment of images and are normally utilized in computer vision
tasks such as image classification, object detection, and image segmentation.

Examples : o ImageNet o CIFAR-10 o MNIST

Text Datasets:
Text datasets comprise textual information, like articles, books, or virtual entertainment posts.
These datasets are utilized in NLP techniques like sentiment analysis, text classification, and
machine translation.

Examples :

o Gutenberg Task dataset o IMDb film reviews dataset

Time Series Datasets:


Time series datasets include information focuses gathered after some time. They are generally
utilized in determining, abnormality location, and pattern examination. Examples :

o Securities exchange information o Climate information o Sensor readings.

Tabular Datasets:
Tabular datasets are organized information coordinated in tables or calculation sheets. They
contain lines addressing examples or tests and segments addressing highlights or qualities.
Tabular datasets are utilized for undertakings like relapse and arrangement. The dataset given
before in the article is an illustration of a tabular dataset.
Need of Dataset

o Completely ready and pre-handled datasets are significant for machine learning projects.

o They give the establishment to prepare exact and solid models. Notwithstanding, working with
enormous datasets can introduce difficulties regarding the board and handling.
o To address these difficulties, productive information the executive's strategies and are expected
to handle calculations.

Data Pre-processing:

Data pre-processing is a fundamental stage in preparing datasets for machine learning. It


includes changing raw data into a configuration reasonable for model training. Normal pre-
processing procedures incorporate data cleaning to eliminate irregularities or blunders,
standardization to scale data inside a particular reach, highlight scaling to guarantee highlights
have comparative ranges, and taking care of missing qualities through ascription or evacuation.

During the development of the ML project, the developers completely rely on the datasets. In
building ML applications, datasets are divided into two parts:

o Training dataset: o Test Dataset

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Note: The datasets are of large size, so to download these datasets, you must have fast internet
on your computer.

Training Dataset and Test Dataset:

In machine learning, datasets are ordinarily partitioned into two sections: the training dataset
and the test dataset. The training dataset is utilized to prepare the machine learning model,
while the test dataset is utilized to assess the model's exhibition. This division surveys the
model's capacity, to sum up to inconspicuous data. It is fundamental to guarantee that the
datasets are representative of the issue space and appropriately split to stay away from
inclination or overfitting.
Popular sources for Machine Learning datasets

Below is the list of datasets which are freely available for the public to work on it: 1. Kaggle
Datasets

Kaggle is one of the best sources for providing datasets for Data Scientists and Machine
Learners. It allows users to find, download, and publish datasets in an easy way. It also provides
the opportunity to work with other machine learning engineers and solve difficult Data Science
related tasks.

Kaggle provides a high-quality dataset in different formats that we can easily find and
download.

The link for the Kaggle dataset is https://www.kaggle.com/datasets.

2. UCI Machine Learning Repository


The UCI Machine Learning Repository is an important asset that has been broadly utilized by
scientists and specialists beginning around 1987. It contains a huge collection of datasets sorted
by machine learning tasks such as regression, classification, and clustering. Remarkable

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
datasets in the storehouse incorporate the Iris dataset, Vehicle Assessment dataset, and Poker
Hand dataset.

The link for the UCI machine learning repository is https://archive.ics.uci.edu/ml/index.php.

3. Datasets via AWS


We can search, download, access, and share the datasets that are publicly available via AWS
resources. These datasets can be accessed through AWS resources but provided and maintained
by different government organizations, researches, businesses, or individuals.

Anyone can analyze and build various services using shared data via AWS resources. The
shared dataset on cloud helps users to spend more time on data analysis rather than on
acquisitions of data.

This source provides the various types of datasets with examples and ways to use the dataset.
It also provides the search box using which we can search for the required dataset. Anyone can
add any dataset or example to the Registry of Open Data on AWS.

The link for the resource is https://registry.opendata.aws/.

4. Google's Dataset Search Engine


Google's Dataset Web index helps scientists find and access important datasets from different
sources across the web. It files datasets from areas like sociologies, science, and environmental
science. Specialists can utilize catchphrases to find datasets, channel results in light of explicit
standards, and access the datasets straightforwardly from the source.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The link for the Google dataset search engine is https://toolbox.google.com/datasetsearch.

5. Microsoft Datasets
The Microsoft has launched the "Microsoft Research Open data" repository with the
collection of free datasets in various areas such as natural language processing, computer
vision, and domain-specific sciences. It gives admittance to assorted and arranged datasets
that can be significant for machine learning projects.

The link to download or use the dataset from this resource is https://msropendata.com/. 6.
Awesome Public Dataset Collection
Awesome public dataset collection provides high-quality datasets that are arranged in a well-
organized manner within a list according to topics such as Agriculture, Biology, Climate,
Complex networks, etc. Most of the datasets are available free, but some may not, so it is better
to check the license before downloading the dataset.

The link to download the dataset from Awesome public dataset


collection is https://github.com/awesomedata/awesome-public-datasets.

7. Government Datasets
There are different sources to get government-related data. Various countries publish
government data for public use collected by them from different departments.

The goal of providing these datasets is to increase transparency of government work among the
people and to use the data in an innovative approach. Below are some links of government
datasets:

o Indian Government dataset o US Government Dataset o Northern Ireland Public


Sector Datasets o European Union Open Data Portal 8. Computer Vision Datasets

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Visual data provides multiple numbers of the great dataset that are specific to computer visions
such as Image Classification, Video classification, Image Segmentation, etc. Therefore, if you
want to build a project on deep learning or image processing, then you can refer to this source.

The link for downloading the dataset from this source is https://www.visualdata.io/.

9. Scikit-learn dataset
Scikit-learn, a well-known machine learning library in Python, gives a few underlying datasets
to practice and trial and error. These datasets are open through the sci-kit-learn Programming
interface and can be utilized for learning different machine-learning calculations. Scikit-learn
offers both toy datasets, which are little and improved, and genuine world datasets with greater
intricacy. Instances of sci-kit-learn datasets incorporate the Iris dataset, the Boston Lodging
dataset, and the Wine dataset.
The link to download datasets from this source is https://scikit-
learn.org/stable/datasets/index.html.

Data Ethics and Privacy:

Data ethics and privacy are basic contemplations in machine learning projects. It is fundamental
to guarantee that data is gathered and utilized morally, regarding privacy freedoms and
observing pertinent regulations and guidelines. Data experts ought to go to lengths to safeguard
data privacy, get appropriate assent, and handle delicate data mindfully. Assets, for example,
moral rules and privacy structures can give direction on keeping up with moral practices in data
assortment and use.

Conclusion:

In conclusion, datasets structure the groundwork of effective machine-learning projects.


Understanding the various kinds of datasets, the significance of data pre-processing, and the
job of training and testing datasets are key stages towards building powerful models. By
utilizing well-known sources, for example, Kaggle, UCI Machine Learning Repository, AWS,

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Google's Dataset Search, Microsoft Datasets, and government datasets, data researchers and
specialists can get to an extensive variety of datasets for their machine learning projects. It is
fundamental to consider data ethics and privacy all through the whole data lifecycle to
guarantee mindful and moral utilization of data. With the right datasets and moral practices,
machine learning models can accomplish exact predictions and drive significant bits of
knowledge.

Supervised Machine Learning


Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled
data means some input data is already tagged with the correct output.

In supervised learning, the training data provided to the machines work as the supervisor that
teaches the machines to predict the output correctly. It applies the same concept as a student
learns in the supervision of the teacher.

Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping
function to map the input variable(x) with the output variable(y).

In the real-world, supervised learning can be used for Risk Assessment, Image classification,
Fraud Detection, spam filtering, etc.

How Supervised Learning Works?

In supervised learning, models are trained using labelled dataset, where the model learns about
each type of data. Once the training process is completed, the model is tested on the basis of
test data (a subset of the training set), and then it predicts the output.

The working of Supervised learning can be easily understood by the below example and
diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle,
triangle, and Polygon. Now the first step is that we need to train the model for each shape.

o If the given shape has four sides, and all the sides are equal, then it will be labelled as a Square.

o If the given shape has three sides, then it will be labelled as a triangle.

o If the given shape has six equal sides then it will be labelled as hexagon.

Now, after training, we test our model using the test set, and the task of the model is to identify
the shape.

The machine is already trained on all types of shapes, and when it finds a new shape, it classifies
the shape on the bases of a number of sides, and predicts the output.

Steps Involved in Supervised Learning:

o First Determine the type of training dataset o Collect/Gather the labelled training data. o Split
the training dataset into training dataset, test dataset, and validation dataset.
o Determine the input features of the training dataset, which should have enough knowledge so
that the model can accurately predict the output.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Determine the suitable algorithm for the model, such as support vector machine, decision tree,
etc.

o Execute the algorithm on the training dataset. Sometimes we need validation sets as the control
parameters, which are the subset of training datasets.
o Evaluate the accuracy of the model by providing the test set. If the model predicts the correct
output, which means our model is accurate.

Types of supervised Machine learning Algorithms:

Supervised learning can be further divided into two types of problems:

1. Regression

Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc. Below are some popular Regression algorithms which come
under supervised learning: o Linear Regression o Regression Trees o Non-Linear Regression
o Bayesian Linear Regression o Polynomial Regression

2. Classification

Classification algorithms are used when the output variable is categorical, which means there
are two classes such as Yes-No, Male-Female, True-false, etc.

Spam Filtering,
o Random Forest o Decision Trees o Logistic Regression o Support vector Machines Note: We
will discuss these algorithms in detail in later chapters.

Advantages of Supervised learning:

o With the help of supervised learning, the model can predict the output on the basis of prior
experiences.

o In supervised learning, we can have an exact idea about the classes of objects.

o Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.

Disadvantages of supervised learning:

o Supervised learning models are not suitable for handling the complex tasks.

o Supervised learning cannot predict the correct output if the test data is different from the
training dataset.

o Training required lots of computation times. o In supervised learning, we need enough


knowledge about the classes of object.

Unsupervised Machine Learning


In the previous topic, we learned supervised machine learning in which models are trained
using labeled data under the supervision of training data. But there may be many cases in which
we do not have labeled data and need to find the hidden patterns from the given dataset. So, to
solve such types of cases in machine learning, we need unsupervised learning techniques.

What is Unsupervised Learning?

As the name suggests, unsupervised learning is a machine learning technique in which models
are not supervised using training dataset. Instead, models itself find the hidden patterns and
insights from the given data. It can be compared to learning which takes place in the human
brain while learning new things. It can be defined as:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.

Unsupervised learning cannot be directly applied to a regression or classification problem


because unlike supervised learning, we have the input data but no corresponding output data.
The goal of unsupervised learning is to find the underlying structure of dataset, group that
data according to similarities, and represent that dataset in a compressed format.

Example: Suppose the unsupervised learning algorithm is given an input dataset containing
images of different types of cats and dogs. The algorithm is never trained upon the given
dataset, which means it does not have any idea about the features of the dataset. The task of the
unsupervised learning algorithm is to identify the image features on their own.

Unsupervised learning algorithm will perform this task by clustering the image dataset into the
groups according to similarities between images.

Why use Unsupervised Learning?

Below are some main reasons which describe the importance of Unsupervised Learning:

o Unsupervised learning is helpful for finding useful insights from the data.

o Unsupervised learning is much similar as a human learns to think by their own experiences,
which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make unsupervised
learning more important.

o In real-world, we do not always have input data with the corresponding output so to solve such
cases, we need unsupervised learning.

Working of Unsupervised Learning

Working of unsupervised learning can be understood by the below diagram:

Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the machine
learning model in order to train it. Firstly, it will interpret the raw data to find the hidden patterns
from the data and then will apply suitable algorithms such as k-means clustering, Decision tree,
etc.

Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.

Types of Unsupervised Learning Algorithm:

The unsupervised learning algorithm can be further categorized into two types of problems:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Clustering: Clustering is a method of grouping the objects into clusters such that objects with
most similarities remains into a group and has less or no similarities with the objects of another
group. Cluster analysis finds the commonalities between the data objects and categorizes them
as per the presence and absence of those commonalities.

o Association: An association rule is an unsupervised learning method which is used for finding
the relationships between variables in the large database. It determines the set of items that
occurs together in the dataset. Association rule makes marketing strategy more effective. Such
as people who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A
typical example of Association rule is Market Basket Analysis.

Note: We will learn these algorithms in later chapters.

Unsupervised Learning algorithms:

Below is the list of some popular unsupervised learning algorithms:

o K-means clustering o KNN (k-nearest neighbors) o Hierarchal clustering o Anomaly


detection o Neural Networks o Principle Component Analysis o Independent
Component Analysis o Apriori algorithm o Singular value decomposition
Advantages of Unsupervised Learning

o Unsupervised learning is used for more complex tasks as compared to supervised learning
because, in unsupervised learning, we don't have labeled input data.

o Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to labeled


data.

Disadvantages of Unsupervised Learning

o Unsupervised learning is intrinsically more difficult than supervised learning as it does not have
corresponding output.

o The result of the unsupervised learning algorithm might be less accurate as input data is not
labeled, and algorithms do not know the exact output in advance.
Difference between Supervised and Unsupervised Learning
Supervised and Unsupervised learning are the two techniques of machine learning. But both
the techniques are used in different scenarios and with different datasets. Below the explanation
of both learning methods along with their difference table is given.

Supervised Machine Learning:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Supervised learning is a machine learning method in which models are trained using labeled
data. In supervised learning, models need to find the mapping function to map the input variable
(X) with the output variable (Y).

Supervised learning needs supervision to train the model, which is similar to as a student learns
things in the presence of a teacher. Supervised learning can be used for two types of problems:
Classification and Regression.

Learn more Supervised Machine Learning

Example: Suppose we have an image of different types of fruits. The task of our supervised
learning model is to identify the fruits and classify them accordingly. So to identify the image
in supervised learning, we will give the input data as well as output for that, which means we
will train the model by the shape, size, color, and taste of each fruit. Once the training is
completed, we will test the model by giving the new set of fruit. The model will identify the
fruit and predict the output using a suitable algorithm.

Unsupervised Machine Learning:

Unsupervised learning is another machine learning method in which patterns inferred from the
unlabeled input data. The goal of unsupervised learning is to find the structure and patterns
from the input data. Unsupervised learning does not need any supervision. Instead, it finds
patterns from the data by its own.

Learn more Unsupervised Machine Learning

Unsupervised learning can be used for two types of problems: Clustering and Association.

Example: To understand the unsupervised learning, we will use the example given above. So
unlike supervised learning, here we will not provide any supervision to the model. We will just
provide the input dataset to the model and allow the model to find the patterns from the data.
With the help of a suitable algorithm, the model will train itself and divide the fruits into
different groups according to the most similar features between them.
The main differences between Supervised and Unsupervised learning are given below:

Supervised Learning Unsupervised Learning

Supervised learning algorithms are trained using labelled Unsupervised learning


data. algorithms unlabelled
data.

Supervised learning model takes direct feedback to check Unsupervised learning model
if it is predicting correct output or not. does not

Supervised learning model predicts the output. Unsupervised learning model


finds data.

In supervised learning, input data is provided to the model In unsupervised learning, only
along with the output. input model.

The goal of supervised learning is to train the model so The goal of unsupervised learning
that it can predict the output when it is given new data. patterns and useful insights from
the

Supervised learning needs supervision to train the model. Unsupervised learning does not
need the model.

Supervised learning can be categorized Unsupervised Learning can


in Classification and Regression problems. in Clustering and Associations

Note: The supervised and unsupervised learning both are the machine learning methods,
and selection of any of these learning depends on the factors related to the structure and
volume of your dataset and the use cases of the problem.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Supervised learning can be used for those cases where we know the Unsupervised learning can be
input as well as corresponding outputs. used for have only input data and
no correspond

Supervised learning model produces an accurate result. Unsupervised learning model


may as compared to supervised
learning.

Supervised learning is not close to true Artificial intelligence as in Unsupervised learning is more
this, we first train the model for each data, and then only it can close Intelligence as it learns
predict the correct output. similarly as routine things by his
experiences.

It includes various algorithms such as Linear Regression, Logistic It includes various algorithms
Regression, Support Vector Machine, Multi-class Classification, such as Apriori algorithm.
Decision tree, Bayesian Logic, etc.
ADVERTISEMENT

What is Reinforcement Learning?

o Reinforcement Learning is a feedback-based Machine learning technique in which an agent


learns to behave in an environment by performing the actions and seeing the results of actions.
For each good action, the agent gets positive feedback, and for each bad action, the agent gets
negative feedback or penalty.

o In Reinforcement Learning, the agent learns automatically using feedbacks without any labeled
data, unlike supervised learning.

o Since there is no labeled data, so the agent is bound to learn by its experience only.

o RL solves a specific type of problem where decision making is sequential, and the goal is long-
term, such as game-playing, robotics, etc.

o The agent interacts with the environment and explores it by itself. The primary goal of an agent
in reinforcement learning is to improve the performance by getting the maximum positive
rewards.
o The agent learns with the process of hit and trial, and based on the experience, it learns to
perform the task in a better way. Hence, we can say that "Reinforcement learning is a type of
machine learning method where an intelligent agent (computer program) interacts with the
environment and learns to act within that." How a Robotic dog learns the movement of his
arms is an example of Reinforcement learning.
o It is a core part of Artificial intelligence, and all AI agent works on the concept of reinforcement
learning. Here we do not need to pre-program the agent, as it learns from its own experience
without any human intervention.

o Example: Suppose there is an AI agent present within a maze environment, and his goal is to
find the diamond. The agent interacts with the environment by performing some actions, and
based on those actions, the state of the agent gets changed, and it also receives a reward or
penalty as feedback.

o The agent continues doing these three things (take action, change state/remain in the same
state, and get feedback), and by doing these actions, he learns and explores the environment.

o The agent learns that what actions lead to positive feedback or rewards and what actions lead
to negative feedback penalty. As a positive reward, the agent gets a positive point, and as a
penalty, it gets a negative point.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Terms used in Reinforcement Learning

o Agent(): An entity that can perceive/explore the environment and act upon it.

o Environment(): A situation in which an agent is present or surrounded by. In RL, we assume


the stochastic environment, which means it is random in nature.
o Action(): Actions are the moves taken by an agent within the environment.

o State(): State is a situation returned by the environment after each action taken by the agent.

o Reward(): A feedback returned to the agent from the environment to evaluate the action of the
agent.

o Policy(): Policy is a strategy applied by the agent for the next action based on the current state.

o Value(): It is expected long-term retuned with the discount factor and opposite to the short-
term reward.
o Q-value(): It is mostly similar to the value, but it takes one additional parameter as a current
action (a).

Key Features of Reinforcement Learning

o In RL, the agent is not instructed about the environment and what actions need to be taken.

o It is based on the hit and trial process.

o The agent takes the next action and changes states according to the feedback of the previous
action. o The agent may get a delayed reward.

o The environment is stochastic, and the agent needs to explore it to reach to get the maximum
positive rewards.

Approaches to implement Reinforcement Learning

There are mainly three ways to implement reinforcement-learning in ML, which are:

1. Value-based:
The value-based approach is about to find the optimal value function, which is the maximum
value at a state under any policy. Therefore, the agent expects the long-term return at any
state(s) under policy π.

2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future rewards without
using the value function. In this approach, the agent tries to apply such a policy that the action
performed in each step helps to maximize the future reward. The policy-based approach has
mainly two types of policy:

o Deterministic: The same action is produced by the policy (π) at any state.
o Stochastic: In this policy, probability determines the produced action.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
3. Model-based: In the model-based approach, a virtual model is created for the environment,
and the agent explores that environment to learn it. There is no particular solution or
algorithm for this approach because the model representation is different for each
environment.

Elements of Reinforcement Learning

There are four main elements of Reinforcement Learning, which are given below:

1. Policy

2. Reward Signal

3. Value Function

4. Model of the environment

1) Policy: A policy can be defined as a way how an agent behaves at a given time. It maps
the perceived states of the environment to the actions taken on those states. A policy is the core
element of the RL as it alone can define the behavior of the agent. In some cases, it may be a
simple function or a lookup table, whereas, for other cases, it may involve general computation
as a search process. It could be deterministic or a stochastic policy:

For deterministic policy: a = π(s)


For stochastic policy: π(a | s) = P[At =a | St = s]

2) Reward Signal: The goal of reinforcement learning is defined by the reward signal. At
each state, the environment sends an immediate signal to the learning agent, and this signal is
known as a reward signal. These rewards are given according to the good and bad actions
taken by the agent. The agent's main objective is to maximize the total number of rewards for
good actions. The reward signal can change the policy, such as if an action selected by the agent
leads to low reward, then the policy may change to select other actions in the future.
3) Value Function: The value function gives information about how good the situation
and action are and how much reward an agent can expect. A reward indicates the immediate
signal for each good and bad action, whereas a value function specifies the good state and
action for the future. The value function depends on the reward as, without reward, there
could be no value. The goal of estimating values is to achieve more rewards.

4) Model: The last element of reinforcement learning is the model, which mimics the
behavior of the environment. With the help of the model, one can make inferences about how
the environment will behave. Such as, if a state and an action are given, then a model can
predict the next state and reward.

The model is used for planning, which means it provides a way to take a course of action by
considering all future situations before actually experiencing those situations. The approaches
for solving the RL problems with the help of the model are termed as the model-based
approach. Comparatively, an approach without using a model is called a model-free
approach.

Classification Algorithm in Machine Learning


As we know, the Supervised Machine Learning algorithm can be broadly classified into
Regression and Classification Algorithms. In Regression algorithms, we have predicted the
output for continuous values, but to predict the categorical values, we need Classification
algorithms.

What is the Classification Algorithm?

The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program learns
from the given dataset or observations and then classifies new observation into a number of
classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can
be called as targets/labels or categories.

Unlike regression, the output variable of Classification is a category, not a value, such as "Green
or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised learning

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
technique, hence it takes labeled input data, which means it contains input with the
corresponding output.

In classification algorithm, a discrete output function(y) is mapped to input variable(x).

1. y=f(x), where y = categorical output


The best example of an ML classification algorithm is Email Spam Detector.

The main goal of the Classification algorithm is to identify the category of a given dataset, and
these algorithms are mainly used to predict the output for the categorical data.

Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are similar
to each other and dissimilar to other classes.

The algorithm which implements the classification on a dataset is known as a classifier. There
are two types of Classifications:
o Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.

o Multi-class Classifier: If a classification problem has more than two outcomes, then it is called
as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.

Learners in Classification Problems:

In the classification problems, there are two types of learners:

1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the
test dataset. In Lazy learner case, classification is done on the basis of the most related data
stored in the training dataset. It takes less time in training but more time for predictions.
Example: K-NN algorithm, Case-based reasoning

2. Eager Learners:Eager Learners develop a classification model based on a training dataset


before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in
learning, and less time in prediction. Example: Decision Trees, Naïve Bayes, ANN.

Types of ML Classification Algorithms:

Classification Algorithms can be further divided into the Mainly two category:

o Linear Models o Logistic Regression o Support Vector Machines


o Non-linear Models o K-Nearest Neighbours o Kernel SVM o Naïve Bayes
o Decision Tree Classification o Random Forest Classification

Note: We will learn the above algorithms in later chapters.

Evaluating a Classification model:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Once our model is completed, it is necessary to evaluate its performance; either it is a
Classification or Regression model. So for evaluating a Classification model, we have the
following ways:

1. Log Loss or Cross-Entropy Loss:

o It is used for evaluating the performance of a classifier, whose output is a probability value
between the 0 and 1.

o For a good binary Classification model, the value of log loss should be near to 0. o The value
of log loss increases if the predicted value deviates from the actual value. o The lower log loss
represents the higher accuracy of the model. o For Binary classification, cross-entropy can be
calculated as:

1. ?(ylog(p)+(1?y)log(1?p)) Where y= Actual output, p= predicted output.

2. Confusion Matrix:

o The confusion matrix provides us a matrix/table as output and describes the performance of the
model.

o It is also known as the error matrix.


The matrix consists of predictions result in a summarized form, which has a total number of
correct predictions and incorrect predictions. The matrix looks like as below table:

o Actual Positive Actual


Negative

Predicted Positive True Positive False


Positive

Predicted Negative False Negative True


Negative

3. AUC-ROC curve:

o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for Area
Under the Curve.

o It is a graph that shows the performance of the classification model at different thresholds.

o To visualize the performance of the multi-class classification model, we use the AUC-ROC
Curve. o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-
axis and FPR(False Positive Rate) on X-axis.

Use cases of Classification Algorithms

Classification algorithms can be used in different places. Below are some popular use cases of
Classification Algorithms:

o Email Spam Detection o Speech Recognition o Identifications of Cancer tumor cells.


o Drugs Classification o Biometric Identification, etc.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Logistic Regression in Machine Learning o Logistic regression is one of the most popular
Machine Learning algorithms, which comes under the Supervised Learning technique. It is
used for predicting the categorical dependent variable using a given set of independent
variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False,
etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which
lie between 0 and 1.

o Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.

o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).

o The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.

o Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.

o Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below image
is showing the logistic function:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Note: Logistic regression uses the concept of predictive modeling as regression; therefore,
it is called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.
Logistic Function (Sigmoid Function):
The sigmoid function is a mathematical function used to map the predicted values to
probabilities. o It maps any real value into another value within a range of 0 and 1.

o The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function or
the logistic function.

o In logistic regression, we use the concept of the threshold value, which defines the probability
of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.

Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature. o The independent variable should not
have multi-collinearity.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below: o We know the
equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three types:

Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.

o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered


types of the dependent variable, such as "cat", "dogs", or "sheep"

o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

Python Implementation of Logistic Regression (Binomial)

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
To understand the implementation of Logistic Regression in Python, we will use the below
example:

Example: There is a dataset given which contains the information of various users obtained
from the social networking sites. There is a car making company that has recently launched a
new SUV car. So the company wanted to check how many users from the dataset, wants to
purchase the car.

For this problem, we will build a Machine Learning model using the Logistic regression
algorithm. The dataset is shown in the below image. In this problem, we will predict the
purchased variable (Dependent Variable) by using age and salary (Independent
variables).

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Steps in Logistic Regression: To implement the Logistic Regression using Python, we will
use the same steps as we have done in previous topics of Regression. Below are the steps:

o Data Pre-processing step o Fitting Logistic Regression to the Training set o Predicting the
test result o Test accuracy of the result(Creation of Confusion matrix) o Visualizing the test
set result.

1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we can use
it in our code efficiently. It will be the same as we have done in Data pre-processing topic. The
code for this is given below:

1. #Data Pre-procesing Step


2. # importing libraries 3. import numpy as nm

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
4. import matplotlib.pyplot as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')
By executing the above lines of code, we will get the dataset as the output. Consider the given
image:

Now, we will extract the dependent and independent variables from the given dataset. Below
is the code for it:

1. #Extracting Independent and dependent Variable

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. x= data_set.iloc[:, [2,3]].values
3. y= data_set.iloc[:, 4].values
In the above code, we have taken [2, 3] for x because our independent variables are age and
salary, which are at index 2, 3. And we have taken 4 for y variable because our dependent
variable is at index 4. The output will be:

Now we will split the dataset into a training set and test set. Below is the code for it:

1. # Splitting the dataset into training and test set.


2. from sklearn.model_selection import train_test_split
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)

The output for this is given below:

For test set:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
For training set:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
In logistic regression, we will do feature scaling because we want accurate result of predictions.
Here we will only scale the independent variable because dependent variable have only 0 and
1 values. Below is the code for it:

1. #feature Scaling
2. from sklearn.preprocessing import StandardScaler
3. st_x= StandardScaler()
4. x_train= st_x.fit_transform(x_train)
5. x_test= st_x.transform(x_test) The scaled output is given below:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. Fitting Logistic Regression to the Training set:

We have well prepared our dataset, and now we will train the dataset using the training set.
For providing training or fitting the model to the training
set, we will import the LogisticRegression class of the sklearn library.

After importing the class, we will create a classifier object and use it to fit the model to the
logistic regression. Below is the code for it:

1. #Fitting Logistic Regression to the training set


2. from sklearn.linear_model import LogisticRegression
3. classifier= LogisticRegression(random_state=0)
4. classifier.fit(x_train, y_train)
Output: By executing the above code, we will get the below output:

Out[5]:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
1. LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
2. intercept_scaling=1, l1_ratio=None, max_iter=100,
3. multi_class='warn', n_jobs=None, penalty='l2',
4. random_state=0, solver='warn', tol=0.0001, verbose=0,
5. warm_start=False)
ADVERTISEMENT

ADVERTISEMENT
Hence our model is well fitted to the training set.

3. Predicting the Test Result

Our model is well trained on the training set, so we will now predict the result by using test set
data. Below is the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test)
In the above code, we have created a y_pred vector to predict the test set result.

Output: By executing the above code, a new vector (y_pred) will be created under the variable
explorer option. It can be seen as:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The above output image shows the corresponding predicted users who want to purchase or not
purchase the car.

4. Test Accuracy of the result

Now we will create the confusion matrix here to check the accuracy of the classification. To
create it, we need to import the confusion_matrix function of the sklearn library. After
importing the function, we will call it using a new variable cm. The function takes two
parameters, mainly y_true( the actual values) and y_pred (the targeted value return by the
classifier). Below is the code for it:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix() Output:

By executing the above code, a new confusion matrix will be created. Consider the below
image:

We can find the accuracy of the predicted result by interpreting the confusion matrix. By above
output, we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect Output).

5. Visualizing the training set result

Finally,we will visualize the training set result. To visualize the


result, we will use ListedColormap class of matplotlib library. Below is the code for it:

1. #Visualizing the training set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step
=0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1
.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Logistic Regression (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
In the above code, we have imported the ListedColormap class of Matplotlib library to create
the colormap for visualizing the result. We have created two new variables x_set and y_set to
replace x_train and y_train. After that, we have used the nm.meshgrid command to create a
rectangular grid, which has a range of -1(minimum) to 1 (maximum). The pixel points we have
taken are of 0.01 resolution.

To create a filled contour, we have used mtp.contourf command, it will create regions of
provided colors (purple and green). In this function, we have passed the classifier.predict to
show the predicted data points predicted by the classifier.

Output: By executing the above code, we will get the below output:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The graph can be explained in the below points:

o In the above graph, we can see that there are some Green points within the green region and
Purple points within the purple region.
o All these data points are the observation points from the training set, which shows the result
for purchased variables.

o This graph is made by using two independent variables i.e., Age on the x-axis and Estimated
salary on the y-axis.

o The purple point observations are for which purchased (dependent variable) is probably 0,
i.e., users who did not purchase the SUV car.

o The green point observations are for which purchased (dependent variable) is probably 1
means user who purchased the SUV car.

o We can also estimate from the graph that the users who are younger with low salary, did not
purchase the car, whereas older users with high estimated salary purchased the car.

o But there are some purple points in the green region (Buying the car) and some green points in
the purple region(Not buying the car). So we can say that younger users with a high estimated
salary purchased the car, whereas an older user with a low estimated salary did not purchase
the car.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The goal of the classifier:

We have successfully visualized the training set result for the logistic regression, and our goal
for this classification is to divide the users who purchased the SUV car and who did not
purchase the car. So from the output graph, we can clearly see the two regions (Purple and
Green) with the observation points. The Purple region is for those users who didn't buy the car,
and Green Region is for those users who purchased the car.

Linear Classifier:

As we can see from the graph, the classifier is a Straight line or linear in nature as we have
used the Linear model for Logistic Regression. In further topics, we will learn for non-linear
Classifiers.

Visualizing the test set result:

Our model is well trained using the training dataset. Now, we will visualize the result for new
observations (Test set). The code for the test set will remain same as above except that here we
will use x_test and y_test instead of x_train and y_train. Below is the code for it:

1. #Visulaizing the test set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step
=0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1
.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Logistic Regression (Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show() Output:

The above graph shows the test set result. As we can see, the graph is divided into two regions
(Purple and Green). And Green observations are in the green region, and Purple observations
are in the purple region. So we can say it is a good prediction and model. Some of the green
and purple data points are in different regions, which can be ignored as we have already
calculated this error using the confusion matrix (11 Incorrect output).

Hence our model is pretty good and ready to make new predictions for this classification
problem.

K-Nearest Neighbor(KNN) Algorithm for Machine Learning o K-Nearest Neighbour is one of


the simplest Machine Learning algorithms based on Supervised Learning technique.

o K-NN algorithm assumes the similarity between the new case/data and available cases and put
the new case into the category that is most similar to the available categories.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used
for the Classification problems.

o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.

o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an action
on the dataset.

o KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.

o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we
want to know either it is a cat or dog. So for this identification, we can use the KNN algorithm,
as it works on a similarity measure. Our KNN model will find the similar features of the new
data set to the cats and dogs images and based on the most similar features it will put it in either
cat or dog category.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of problem,
we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or class
of a particular dataset. Consider the below diagram:

How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm: o Step-1: Select the
number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors o Step-3: Take the K
nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.

o Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum. o Step-6: Our model is ready.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Suppose we have a new data point and we need to put it in the required category. Consider the
below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.

o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It can
be calculated as:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors
in category A and two nearest neighbors in category B. Consider the below image:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN algorithm:

o There is no particular way to determine the best value for "K", so we need to try some values
to find the best out of them. The most preferred value for K is 5.

o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in
the model.

o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o It is simple to implement.

o It is robust to the noisy training data o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points for
all the training samples.

Python implementation of the KNN algorithm

To do the Python implementation of the K-NN algorithm, we will use the same problem and
dataset which we have used in Logistic Regression. But here we will improve the performance
of the model. Below is the problem description:

Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured
a new SUV car. The company wants to give the ads to the users who are interested in buying
that SUV. So for this problem, we have a dataset that contains multiple user's information
through the social network. The dataset contains lots of information but the Estimated Salary
and Age we will consider for the independent variable and the Purchased variable is for the
dependent variable. Below is the dataset:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Steps to implement the K-NN algorithm:

o Data Pre-processing step o Fitting the K-NN algorithm to the Training set o Predicting the
test result o Test accuracy of the result(Creation of Confusion matrix) o Visualizing the test
set result.

Data Pre-Processing Step:

The Data Pre-processing step will remain exactly the same as Logistic Regression. Below is
the code for it:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp 4. import pandas as pd

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv') 8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)

16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)
By executing the above code, our dataset is imported to our program and well pre-processed.
After feature scaling our test dataset will look like:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
From the above output image, we can see that our data is successfully scaled.

o Fitting K-NN classifier to the Training data:


Now we will fit the K-NN classifier to the training data. To do this we will import the
KNeighborsClassifier class of Sklearn Neighbors library. After importing the class, we will
create the Classifier object of the class. The Parameter of this class will be o n_neighbors:
To define the required neighbors of the algorithm. Usually, it takes 5.

o metric='minkowski': This is the default parameter and it decides the distance between the
points. o p=2: It is equivalent to the standard Euclidean metric.

And then we will fit the classifier to the training data. Below is the code for it:

1. #Fitting K-NN classifier to the training set


2. from sklearn.neighbors import KNeighborsClassifier
3. classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
4. classifier.fit(x_train, y_train)
Output: By executing the above code, we will get the output as:

Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
o Predicting the Test Result: To predict the test set result, we will create a y_pred vector as we
did in Logistic Regression. Below is the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test) Output:

The output for the above code will be:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Creating the Confusion Matrix:
Now we will create the Confusion Matrix for our K-NN model to see the accuracy of the
classifier. Below is the code for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)
In above code, we have imported the confusion_matrix function and called it using the variable
cm.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Output: By executing the above code, we will get the matrix as below:

In the above image, we can see there are 64+29= 93 correct predictions and 3+4= 7 incorrect
predictions, whereas, in Logistic Regression, there were 11 incorrect predictions. So we can
say that the performance of the model is improved by using the K-NN algorithm.

o Visualizing the Training set result:


Now, we will visualize the training set result for K-NN model. The code will remain same as
we did in Logistic Regression, except the name of the graph. Below is the code for it:

1. #Visulaizing the trianing set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step
=0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1
.shape),

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('K-NN Algorithm (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show() Output:

By executing the above code, we will get the below graph:

The output graph is different from the graph which we have occurred in Logistic Regression.
It can be understood in the below points:

o As we can see the graph is showing the red point and green points. The green points are for
Purchased(1) and Red Points for not Purchased(0) variable.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o The graph is showing an irregular boundary instead of showing any straight line or any curve
because it is a K-NN algorithm, i.e., finding the nearest neighbor.

o The graph has classified users in the correct categories as most of the users who didn't buy the
SUV are in the red region and users who bought the SUV are in the green region.

o The graph is showing good result but still, there are some green points in the red region and
red points in the green region. But this is no big issue as by doing this model is prevented from
overfitting issues.

o Hence our model is well trained.


o Visualizing the Test set result:
After the training of the model, we will now test the result by putting a new dataset,
i.e., Test dataset. Code remains the same except some minor changes: such as x_train and
y_train will be replaced by x_test and y_test. Below is the code for it:

1. #Visualizing the test set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step
=0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1
.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('K-NN algorithm(Test set)')
14. mtp.xlabel('Age')

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show() Output:

The above graph is showing the output for the test data set. As we can see in the graph, the
predicted output is well good as most of the red points are in the red region and most of the
green points are in the green region.

However, there are few green points in the red region and a few red points in the green region.
So these are the incorrect observations that we have observed in the confusion matrix(7
Incorrect output).

Support Vector Machine Algorithm


Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a hyperplane.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified using
a decision boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this strange creature. So
as support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
SVM algorithm can be used for Face detection, image classification, text categorization, etc.

Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.

o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-


dimensional space, but we need to find out the best decision boundary that helps to classify the
data points. This best boundary is known as the hyperplane of SVM.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there
are 3 features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.

How does SVM works?

Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we have
a dataset that has two tags (green and blue), and the dataset has two features x1 and x2.
We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue.
Consider the below image:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary
or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from
both the classes. These points are called support vectors. The distance between the vectors and
the hyperplane is called as margin. And the goal of SVM is to maximize this margin. The
hyperplane with maximum margin is called the optimal hyperplane.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
So to separate these data points, we need to add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be
calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Hence we get a circumference of radius 1 in case of non-linear data.

Python Implementation of Support Vector Machine

Now we will implement the SVM algorithm using Python. Here we will use the same dataset
user_data, which we have used in Logistic regression and KNN classification. o Data Pre-
processing step

Till the Data pre-processing step, the code will remain the same. Below is the code:

1. #Data Pre-processing Step


2. # importing libraries 3. import numpy as nm
4. import matplotlib.pyplot as mtp
5. import pandas as pd
6.
7. #importing datasets

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
8. data_set= pd.read_csv('user_data.csv')
9.
10. #Extracting Independent and dependent Variable
11. x= data_set.iloc[:, [2,3]].values 12. y= data_set.iloc[:, 4].values
13.
14. # Splitting the dataset into training and test set.
15. from sklearn.model_selection import train_test_split
16. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)

17. #feature Scaling


18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)
After executing the above code, we will pre-process the data. The code will give the dataset as:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The scaled output for the test set will be:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Fitting the SVM classifier to the training set:

Now the training set will be fitted to the SVM classifier. To create the SVM classifier, we will
import SVC class from Sklearn.svm library. Below is the code for it:

1. from sklearn.svm import SVC # "Support vector classifier"


2. classifier = SVC(kernel='linear', random_state=0)
3. classifier.fit(x_train, y_train)
In the above code, we have used kernel='linear', as here we are creating SVM for linearly
separable data. However, we can change it for non-linear data. And then we fitted the classifier
to the training dataset(x_train, y_train)

Output:

Out[8]:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr',
degree=3, gamma='auto_deprecated', kernel='linear', max_iter=-1, probability=False,
random_state=0,
shrinking=True, tol=0.001, verbose=False)
The model performance can be altered by changing the value of C(Regularization factor),
gamma, and kernel.

o Predicting the test set result:


Now, we will predict the output for test set. For this, we will create a new vector y_pred. Below
is the code for it:
1. #Predicting the test set result
2. y_pred= classifier.predict(x_test)
After getting the y_pred vector, we can compare the result of y_pred and y_test to check the
difference between the actual value and predicted value.

Output: Below is the output for the prediction of the test set:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Creating the confusion matrix:
Now we will see the performance of the SVM classifier that how many incorrect predictions
are there as compared to the Logistic regression classifier. To create the confusion matrix, we
need to import the confusion_matrix function of the sklearn library. After importing the
function, we will call it using a new variable cm. The function takes two parameters, mainly
y_true( the actual values) and y_pred (the targeted value return by the classifier). Below is the
code for it:

1. #Creating the Confusion matrix

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
BIST TechnologiesGlobal Education Private
LimitedTechnology

Output:
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

As we can see in the above output image, there are 66+24= 90 correct predictions and 8+2= 10
correct predictions. Therefore we can say that our SVM model improved as compared to the
Logistic regression model.

o Visualizing the training set result:


Now we will visualize the training set result, below is the code for it:

1. from matplotlib.colors import ListedColormap


2. x_set, y_set = x_train, y_train
3. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step
=0.01),
4. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
5. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1
.shape),

www.bisttechnologies.com www.gedutech.net
mail: [email protected] mail: [email protected] contact: +91 8919651415
contact: +91 7676886524
+91 7676886524 +91 8919651415
BIST TechnologiesGlobal Education Private
LimitedTechnology

Output:
6. alpha = 0.75, cmap = ListedColormap(('red', 'green')))
7. mtp.xlim(x1.min(), x1.max())
8. mtp.ylim(x2.min(), x2.max())
9. for i, j in enumerate(nm.unique(y_set)):
10. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
11. c = ListedColormap(('red', 'green'))(i), label = j)
12. mtp.title('SVM classifier (Training set)')
13. mtp.xlabel('Age')
14. mtp.ylabel('Estimated Salary')
15. mtp.legend()
16. mtp.show()

By executing the above code, we will get the output as:

As we can see, the above output is appearing similar to the Logistic regression output. In the
output, we got the straight line as hyperplane because we have used a linear kernel in the
classifier. And we have also discussed above that for the 2d space, the hyperplane in SVM is
a straight line. o Visualizing the test set result:

www.bisttechnologies.com www.gedutech.net
mail: [email protected] mail: [email protected] contact: +91 8919651415
contact: +91 7676886524
+91 7676886524 +91 8919651415
BIST TechnologiesGlobal Education Private
LimitedTechnology

Output:
1. #Visulaizing the test set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step
=0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1
.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('SVM classifier (Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

By executing the above code, we will get the output as:

www.bisttechnologies.com www.gedutech.net
mail: [email protected] mail: [email protected] contact: +91 8919651415
contact: +91 7676886524
+91 7676886524 +91 8919651415
BIST TechnologiesGlobal Education Private
LimitedTechnology

Output:

As we can see in the above output image, the SVM classifier has divided the users into two
regions (Purchased or Not purchased). Users who purchased the SUV are in the red region with
the red scatter points. And users who did not purchase the SUV are in the green region with
green scatter points. The hyperplane has divided the two classes into Purchased and not
purchased variable.

Naïve Bayes Classifier Algorithm o Naïve Bayes algorithm is a supervised learning algorithm,
which is based on Bayes theorem and used for solving classification problems.

o It is mainly used in text classification that includes a high-dimensional training dataset.

o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which
helps in building the fast machine learning models that can make quick predictions.

o It is a probabilistic classifier, which means it predicts on the basis of the probability of an


object.

o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis,
and classifying articles.

Why is it called Naïve Bayes?

www.bisttechnologies.com www.gedutech.net
mail: [email protected] mail: [email protected] contact: +91 8919651415
contact: +91 7676886524
+91 7676886524 +91 8919651415
BIST TechnologiesGlobal Education Private
LimitedTechnology

Output:
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:

www.bisttechnologies.com www.gedutech.net
mail: [email protected] mail: [email protected] contact: +91 8919651415
contact: +91 7676886524
+91 7676886524 +91 8919651415
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases
of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence
each feature individually contributes to identify that it is an apple without depending on each
other. o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:

o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability. o
The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
1. Convert the given dataset into frequency tables.

2. Generate Likelihood table by finding the probabilities of given features.

3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Outlook Play

0 Rainy Yes

1 Sunny Yes

Overcast Yes
2

3 Overcast Yes

4 Sunny No

Rainy Yes
5

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
12 Overcast Yes

13 Overcast Yes

Weather Yes No

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Overcast 5 0

Frequency table for the Weather Conditions:

Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.3

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:

Rainy 2 2

Sunny 3 2

Total 10 5

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:

o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.

o It performs well in Multi-class predictions as compared to the other Algorithms. o It is the


most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier: o Naive Bayes assumes that all features are
independent or unrelated, so it cannot learn the relationship between features.

Applications of Naïve Bayes Classifier: o It is used for Credit Scoring. o It is used in medical
data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner. o
It is used in Text classification such as Spam filtering and Sentiment analysis.

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a normal distribution. This means
if predictors take continuous values instead of discrete, then the model assumes that these
values are sampled from the Gaussian distribution.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.

o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word is
present or not in a document. This model is also famous for document classification tasks.

Python Implementation of the Naïve Bayes algorithm:

Now we will implement a Naive Bayes Algorithm using Python. So for this, we will use the
"user_data" dataset, which we have used in our other classification model. Therefore we can
easily compare the Naive Bayes model with the other models.

Steps to implement:

o Data Pre-processing step o Fitting Naive Bayes to the Training set o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix) o Visualizing the test set result.

1) Data Pre-processing step:


In this step, we will pre-process/prepare the data so that we can use it efficiently in our code. It
is similar as we did in data-pre-processing. The code for this is given below:

1. Importing the libraries


2. import numpy as nm
3. import matplotlib.pyplot as mtp 4. import pandas as pd
5.
6. # Importing the dataset
7. dataset = pd.read_csv('user_data.csv')
8. x = dataset.iloc[:, [2, 3]].values
9. y = dataset.iloc[:, 4].values
10.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
11. # Splitting the dataset into the Training set and Test set
12. from sklearn.model_selection import train_test_split
13. x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)
14.
15. # Feature Scaling
16. from sklearn.preprocessing import StandardScaler
17. sc = StandardScaler()
18. x_train = sc.fit_transform(x_train)
19. x_test = sc.transform(x_test)
In the above code, we have loaded the dataset into our program using "dataset =
pd.read_csv('user_data.csv'). The loaded dataset is divided into training and test set, and then
we have scaled the feature variable.

The output for the dataset is given as:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2) Fitting Naive Bayes to the Training Set:
After the pre-processing step, now we will fit the Naive Bayes model to the Training set. Below
is the code for it:

1. # Fitting Naive Bayes to the Training set


2. from sklearn.naive_bayes import GaussianNB
3. classifier = GaussianNB()

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
4. classifier.fit(x_train, y_train)
In the above code, we have used the GaussianNB classifier to fit it to the training dataset. We
can also use other classifiers as per our requirement.

Output:

Out[6]: GaussianNB(priors=None, var_smoothing=1e-09)


3) Prediction of the test set result:
Now we will predict the test set result. For this, we will create a new predictor variable y_pred,
and will use the predict function to make the predictions.

1. # Predicting the Test set results


2. y_pred = classifier.predict(x_test) Output:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The above output shows the result for prediction vector y_pred and real vector y_test. We can
see that some predications are different from the real values, which are the incorrect
predictions.

4) Creating Confusion Matrix:


Now we will check the accuracy of the Naive Bayes classifier using the Confusion matrix.
Below is the code for it:

1. # Making the Confusion Matrix


2. from sklearn.metrics import confusion_matrix

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
3. cm = confusion_matrix(y_test, y_pred) Output:

As we can see in the above confusion matrix output, there are 7+3= 10 incorrect predictions,
and 65+25=90 correct predictions.

5) Visualizing the training set result:


Next we will visualize the training set result using Naïve Bayes Classifier. Below is the code
for it:

1. # Visualising the Training set results


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max( ) + 1,
step = 0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(
X1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple', 'green')))

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
8. mtp.xlim(X1.min(), X1.max())
9. mtp.ylim(X2.min(), X2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Naive Bayes (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show() Output:

In the above output we can see that the Naïve Bayes classifier has segregated the data points
with the fine boundary. It is Gaussian curve as we have used GaussianNB classifier in our
code.

6) Visualizing the Test set result:


1. # Visualising the Test set results
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
4. X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max( ) + 1,
step = 0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(
X1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
8. mtp.xlim(X1.min(), X1.max())
9. mtp.ylim(X2.min(), X2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Naive Bayes (test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show() Output:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The above output is final output for test set data. As we can see the classifier has created a
Gaussian curve to divide the "purchased" and "not purchased" variables. There are some wrong
predictions which we have calculated in Confusion matrix. But still it is pretty good classifier.

Superwised learning regression analysis:

Regression Analysis in Machine learning


Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables. More
specifically, Regression analysis helps us to understand how the value of the dependent variable
is changing corresponding to an independent variable when other independent variables are
held fixed. It predicts continuous/real values such as temperature, age, salary, price, etc.

We can understand the concept of regression analysis using the below example:

Example: Suppose there is a marketing company A, who does various advertisement every
year and get sales on that. The below list shows the advertisement made by the company in the
last 5 years and the corresponding sales:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know
the prediction about the sales for this year. So to solve such type of prediction problems in
machine learning, we need regression analysis.

Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.

In Regression, we plot a graph between the variables which best fits the given datapoints, using
this plot, the machine learning model can make predictions about the data. In simple words,
"Regression shows a line or curve that passes through all the datapoints on target-predictor
graph in such a way that the vertical distance between the datapoints and the regression line
is minimum." The distance between datapoints and line tells whether a model has captured a
strong relationship or not.

Some examples of regression can be as: o Prediction of rain using temperature and other factors
o Determining Market trends o Prediction of road accidents due to rash driving.

Terminologies Related to the Regression Analysis:

o Dependent Variable: The main factor in Regression analysis which we want to predict or
understand is called the dependent variable. It is also called target variable.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Independent Variable: The factors which affect the dependent variables or which are used to
predict the values of the dependent variables are called independent variable, also called as a
predictor.

o Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be avoided.

o Multicollinearity: If the independent variables are highly correlated with each other than other
variables, then such condition is called Multicollinearity. It should not be present in the dataset,
because it creates problem while ranking the most affecting variable.

o Underfitting and Overfitting: If our algorithm works well with the training dataset but not
well with test dataset, then such problem is called Overfitting. And if our algorithm does not
perform well even with training dataset, then such problem is called underfitting.

Why do we use Regression Analysis?

As mentioned above, Regression analysis helps in the prediction of a continuous variable.


There are various scenarios in the real world where we need some future predictions such as
weather condition, sales prediction, marketing trends, etc., for such case we need some
technology which can make predictions more accurately. So for such case we need Regression
analysis which is a statistical method and used in machine learning and data science. Below
are some other reasons for using Regression analysis:

o Regression estimates the relationship between the target and the independent variable. o It is
used to find the trends in data. o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important factor, the
least important factor, and how each factor is affecting the other factors.

Types of Regression

There are various types of regressions which are used in data science and machine learning.
Each type has its own importance on different scenarios, but at the core, all the regression

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
methods analyze the effect of the independent variable on dependent variables. Here we are
discussing some important types of regression which are given below:

o Linear Regression o Logistic Regression o Polynomial Regression o Support Vector


Regression Decision Tree Regression o Random Forest Regression o Ridge Regression o
Lasso Regression:

Linear Regression: o Linear regression is a statistical regression method which is used for
predictive analysis.

o It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.

o It is used for solving the regression problem in machine learning.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Linear regression shows the linear relationship between the independent variable (X-axis) and
the dependent variable (Y-axis), hence called linear regression.

o If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is called
multiple linear regression.
The relationship between variables in the linear regression model can be explained using the
below image. Here we are predicting the salary of an employee on the basis of the year of
experience.

o Below is the mathematical equation for Linear regression:

1. Y= aX+b
Here, Y = dependent variables (target variables), X= Independent variables (predictor
variables), a and b are the linear coefficients

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Some popular applications of linear regression are: o Analyzing trends and sales estimates o
Salary forecasting o Real estate prediction o Arriving at ETAs in traffic.

Logistic Regression:

o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a binary
or discrete format such as 0 or 1.

Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No,
True or False, Spam or not spam, etc.

o It is a predictive analysis algorithm which works on the concept of probability.

o Logistic regression is a type of regression, but it is different from the linear regression algorithm
in the term how they are used.

o Logistic regression uses sigmoid function or logistic function which is a complex cost
function. This sigmoid function is used to model the data in logistic regression. The function
can be represented as:

o f(x)= Output between the 0 and 1 value.

o x= input to the function o e= base of natural logarithm.

When we provide the input values (data) to the function, it gives the S-curve as follows:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o It uses the concept of threshold levels, values above the threshold level are rounded up to 1,
and values below the threshold level are rounded up to 0.

There are three types of logistic regression:

o Binary(0/1, pass/fail) o Multi(cats, dogs, lions) o Ordinal(low, medium, high)

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Polynomial Regression:

o Polynomial Regression is a type of regression which models the non-linear dataset using a
linear model.

o It is similar to multiple linear regression, but it fits a non-linear curve between the value of x
and corresponding conditional values of y.

o Suppose there is a dataset which consists of datapoints which are present in a non-linear
fashion, so for such case, linear regression will not best fit to those datapoints. To cover such
datapoints, we need Polynomial regression.

o In Polynomial regression, the original features are transformed into polynomial features
of given degree and then modeled using a linear model. Which means the datapoints are
best fitted using a polynomial line.

o The equation for polynomial regression also derived from linear regression equation that means
Linear regression equation Y= b0+ b1x, is transformed into Polynomial regression equation Y=
b0+b1x+ b2x2+ b3x3+.....+ bnxn.

o Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is our
independent/input variable.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o The model is still linear as the coefficients are still linear with quadratic
Note: This is different from Multiple Linear regression in such a way that in Polynomial
regression, a single element has different degrees instead of multiple variables with the
same degree.
Support Vector Regression:
Support Vector Machine is a supervised learning algorithm which can be used for regression
as well as classification problems. So if we use it for regression problems, then it is termed as
Support Vector Regression.

Support Vector Regression is a regression algorithm which works for continuous variables.
Below are some keywords which are used in Support Vector Regression:

o Kernel: It is a function used to map a lower-dimensional data into higher dimensional data.

o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is a
line which helps to predict the continuous variables and cover most of the datapoints.

o Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a margin
for datapoints. o Support vectors: Support vectors are the datapoints which are nearest to the
hyperplane and opposite class.

In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum
number of datapoints are covered in that margin. The main goal of SVR is to consider the
maximum datapoints within the boundary lines and the hyperplane (best-fit line) must
contain a maximum number of datapoints. Consider the below image:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.

Decision Tree Regression:

o Decision Tree is a supervised learning algorithm which can be used for solving both
classification and regression problems.

o It can solve problems for both categorical and numerical data o Decision Tree regression builds
a tree-like structure in which each internal node represents the "test" for an attribute, each
branch represent the result of the test, and each leaf node represents the final decision or result.

o A decision tree is constructed starting from the root node/parent node (dataset), which splits
into left and right child nodes (subsets of dataset). These child nodes are further divided into
their children node, and themselves become the parent node of those nodes. Consider the below
image:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Above image showing the example of Decision Tee regression, here, the model is trying to
predict the choice of a person between Sports cars or Luxury car.

o Random forest is one of the most powerful supervised learning algorithms which is capable of
performing regression as well as classification tasks.

o The Random Forest regression is an ensemble learning method which combines multiple
decision trees and predicts the final output based on the average of each tree output. The
combined decision trees are called as base models, and it can be represented more formally as:
g(x)= f0(x)+ f1(x)+ f2(x)+....

o Random forest uses Bagging or Bootstrap Aggregation technique of ensemble learning in


which aggregated decision tree runs in parallel and do not interact with each other.

o With the help of Random Forest regression, we can prevent Overfitting in the model by creating
random subsets of the dataset.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Ridge Regression:

o Ridge regression is one of the most robust versions of linear regression in which a small amount
of bias is introduced so that we can get better long term predictions.

o The amount of bias added to the model is known as Ridge Regression penalty. We can
compute this penalty term by multiplying with the lambda to the squared weight of each
individual features.

o The equation for ridge regression will be:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.

o Ridge regression is a regularization technique, which is used to reduce the complexity of the
model. It is also called as L2 regularization. o It helps to solve the problems if we have more
parameters than samples.

Lasso Regression:

o Lasso regression is another regularization technique to reduce the complexity of the model.

o It is similar to the Ridge Regression except that penalty term contains only the absolute weights
instead of a square of weights.

o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression
can only shrink it near to 0.

o It is also called as L1 regularization. The equation for Lasso regression will be:

Linear Regression in Machine Learning


Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression shows
the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.

The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Mathematically, we can represent a linear regression as:

ADVERTISEMENT

y= a0+a1x+ ε
Here,

Y= Dependent Variable (Target Variable) X= Independent Variable (predictor Variable) a0=


intercept of the line (Gives an additional degree of freedom) a1 = Linear regression
coefficient (scale factor to each input value). ε = random error

The values for x and y variables are training datasets for Linear Regression model
representation.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:

o Simple Linear Regression:


If a single independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Simple Linear Regression.
Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.

Linear Regression Line

A linear line showing the relationship between the dependent and independent variables is
called a regression line. A regression line can show two types of relationship:

o Positive Linear Relationship:


If the dependent variable increases on the Y-axis and independent variable increases on X-axis,
then such a relationship is termed as a Positive linear relationship.

o Negative Linear Relationship:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
If the dependent variable decreases on the Y-axis and independent variable increases on the X-
axis, then such a relationship is called a negative linear relationship.

Finding the best fit line:

When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will
have the least error.

The different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to
calculate this we use cost function.

Cost functiono The different values for weights or coefficient of lines (a0, a1) gives the different
line of regression, and the cost function is used to estimate the values of the coefficient for the
best fit line.

o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.

o We can use the cost function to find the accuracy of the mapping function, which maps the
input variable to the output variable. This mapping function is also known as Hypothesis
function.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the
average of squared error occurred between the predicted values and actual values. It can be
written as:

For the above linear equation, MSE can be calculated as:

Where,

N=Total number of observation Yi = Actual value


(a1xi+a0)= Predicted value.

Residuals: The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost
function will high. If the scatter points are close to the regression line, then the residual will be
small and hence the cost function.

Gradient Descent: o Gradient descent is used to minimize the MSE by calculating the gradient
of the cost function.

o A regression model uses gradient descent to update the coefficients of the line by reducing the
cost function.
It is done by a random selection of values of coefficient and then iteratively update the values
to reach the minimum cost function.

Model Performance:

The Goodness of fit determines how the line of regression fits the set of observations. The
process of finding the best model out of various models is called optimization. It can be
achieved by below method:

1. R-squared method:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o R-squared is a statistical method that determines the goodness of fit.

o It measures the strength of the relationship between the dependent and independent variables
on a scale of 0-100%.

o The high value of R-square determines the less difference between the predicted values and
actual values and hence represents a good model.

o It is also called a coefficient of determination, or coefficient of multiple determination for


multiple regression. o It can be calculated from the below formula:

Assumptions of Linear Regression

Below are some important assumptions of Linear Regression. These are some formal checks
while building a Linear Regression model, which ensures to get the best possible result from
the given dataset.

o Linear relationship between the features and target:


Linear regression assumes the linear relationship between the dependent and independent
variables.

o Small or no multicollinearity between the features:


Multicollinearity means high-correlation between the independent variables. Due to
multicollinearity, it may difficult to find the true relationship between the predictors and target
variables. Or we can say, it is difficult to determine which predictor variable is affecting the
target variable and which is not. So, the model assumes either little or no multicollinearity
between the features or independent variables.

o Homoscedasticity Assumption:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Homoscedasticity is a situation when the error term is the same for all the values of independent
variables. With homoscedasticity, there should be no clear pattern distribution of data in the
scatter plot.
o Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal distribution pattern. If
error terms are not normally distributed, then confidence intervals will become either too wide
or too narrow, which may cause difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any deviation,
which means the error is normally distributed.

o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be any
correlation in the error term, then it will drastically reduce the accuracy of the model.
Autocorrelation usually occurs if there is a dependency between residual errors.

Simple Linear Regression in Machine Learning


Simple Linear Regression is a type of Regression algorithms that models the relationship
between a dependent variable and a single independent variable. The relationship shown by a
Simple Linear Regression model is linear or a sloped straight line, hence it is called Simple
Linear Regression.

The key point in Simple Linear Regression is that the dependent variable must be a
continuous/real value. However, the independent variable can be measured on continuous or
categorical values.

Simple Linear regression algorithm has mainly two objectives:

o Model the relationship between the two variables. Such as the relationship between Income
and expenditure, experience and Salary, etc.

o Forecasting new observations. Such as Weather forecasting according to temperature,


Revenue of a company according to the investments in a year, etc.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Simple Linear Regression Model:

The Simple Linear Regression model can be represented using the below equation:

y= a0+a1x+ ε Where,

a0= It is the intercept of the Regression line (can be obtained putting x=0) a1= It is the
slope of the regression line, which tells whether the line is increasing or decreasing. ε =
The error term. (For a good model it will be negligible) Implementation of Simple Linear
Regression Algorithm using Python

Problem Statement example for Simple Linear Regression:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Here we are taking a dataset that has two variables: salary (dependent variable) and experience
(Independent variable). The goals of this problem is:

o We want to find out if there is any correlation between these two variables o We will
find the best fit line for the dataset. o How the dependent variable is changing by
changing the independent variable.

In this section, we will create a Simple Linear Regression model to find out the best fitting line
for representing the relationship between these two variables.

To implement the Simple Linear regression model in machine learning using Python, we need
to follow the below steps:

Step-1: Data Pre-processing

The first step for creating the Simple Linear Regression model is data pre-processing. We have
already done it earlier in this tutorial. But there will be some changes, which are given in the
below steps:

o First, we will import the three important libraries, which will help us for loading the dataset,
plotting the graphs, and creating the Simple Linear Regression model.

1. import numpy as nm
2. import matplotlib.pyplot as mtp 3. import pandas as pd o Next, we will load the dataset
into our code:

1. data_set= pd.read_csv('Salary_Data.csv')
By executing the above line of code (ctrl+ENTER), we can read the dataset on our Spyder IDE
screen by clicking on the variable explorer option.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The above output shows the dataset, which has two variables: Salary and Experience.
Note: In Spyder IDE, the folder containing the code file must be saved as a working
directory, and the dataset or csv file should be in the same folder.
o After that, we need to extract the dependent and independent variables from the given dataset.
The independent variable is years of experience, and the dependent variable is salary. Below
is code for it:

1. x= data_set.iloc[:, :-1].values
2. y= data_set.iloc[:, 1].values

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
In the above lines of code, for x variable, we have taken -1 value since we want to remove the
last column from the dataset. For y variable, we have taken 1 value as a parameter, since we
want to extract the second column and indexing starts from the zero.

By executing the above line of code, we will get the output for X and Y variable as:

In the above output image, we can see the X (independent) variable and Y (dependent) variable
has been extracted from the given dataset.

o Next, we will split both variables into the test set and training set. We have 30 observations, so
we will take 20 observations for the training set and 10 observations for the test set. We are
splitting our dataset so that we can train our model using a training dataset and then test the
model using a test dataset. The code for this is given below:

1. # Splitting the dataset into training and test set.


2. from sklearn.model_selection import train_test_split

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 1/3, random_state=0)
By executing the above code, we will get x-test, x-train and y-test, y-train dataset. Consider the
below images:

Test-dataset:

Training Dataset:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o For simple linear Regression, we will not use Feature Scaling. Because Python libraries take
care of it for some cases, so we don't need to perform it here. Now, our dataset is well prepared
to work on it and we are going to start building a Simple Linear Regression model for the given
problem.

Step-2: Fitting the Simple Linear Regression to the Training Set:

Now the second step is to fit our model to the training dataset. To do so, we will import the
LinearRegression class of the linear_model library from the scikit learn. After importing the
class, we are going to create an object of the class named as a regressor. The code for this is
given below:

1. #Fitting the Simple Linear Regression model to the training dataset


2. from sklearn.linear_model import LinearRegression
3. regressor= LinearRegression()
4. regressor.fit(x_train, y_train)

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
In the above code, we have used a fit() method to fit our Simple Linear Regression object to
the training set. In the fit() function, we have passed the x_train and y_train, which is our
training dataset for the dependent and an independent variable. We have fitted our regressor
object to the training set so that the model can easily learn the correlations between the predictor
and target variables. After executing the above lines of code, we will get the below output.

Output:

Out[7]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)


Step: 3. Prediction of test set result:

dependent (salary) and an independent variable (Experience). So, now, our model is ready to
predict the output for the new observations. In this step, we will provide the test dataset (new
observations) to the model to check whether it can predict the correct output or not.

We will create a prediction vector y_pred, and x_pred, which will contain predictions of test
dataset, and prediction of training set respectively.

1. #Prediction of Test and Training set result


2. y_pred= regressor.predict(x_test)
3. x_pred= regressor.predict(x_train)
On executing the above lines of code, two variables named y_pred and x_pred will generate in
the variable explorer options that contain salary predictions for the training set and test set.
Output:

You can check the variable by clicking on the variable explorer option in the IDE, and also
compare the result by comparing values from y_pred and y_test. By comparing these values,
we can check how good our model is performing.

Step: 4. visualizing the Training set results:

Now in this step, we will visualize the training set result. To do so, we will use the scatter()
function of the pyplot library, which we have already imported in the pre-processing step. The
scatter () function will create a scatter plot of observations.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary of
employees. In the function, we will pass the real values of training set, which means a year of
experience x_train, training set of Salaries y_train, and color of the observations. Here we are
taking a green color for the observation, but it can be any color as per the choice.

Now, we need to plot the regression line, so for this, we will use the plot() function of the
pyplot library. In this function, we will pass the years of experience for training set, predicted
salary for training set x_pred, and color of the line.

Next, we will give the title for the plot. So here, we will use the title() function of the pyplot
library and pass the name ("Salary vs Experience (Training Dataset)".

After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel() function.

Finally, we will represent all above things in a graph using show(). The code is given below:

1. mtp.scatter(x_train, y_train, color="green")


2. mtp.plot(x_train, x_pred, color="red")
3. mtp.title("Salary vs Experience (Training Dataset)")
4. mtp.xlabel("Years of Experience")
5. mtp.ylabel("Salary(In Rupees)")
6. mtp.show() Output:

By executing the above lines of code, we will get the below graph plot as an output.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
In the above plot, we can see the real values observations in green dots and predicted values
are covered by the red regression line. The regression line shows a correlation between the
dependent and independent variable.

The good fit of the line can be observed by calculating the difference between actual values
and predicted values. But as we can see in the above plot, most of the observations are close
to the regression line, hence our model is good for the training set.

Step: 5. visualizing the Test set results:

In the previous step, we have visualized the performance of our model on the training set. Now,
we will do the same for the Test set. The complete code will remain the same as the above code,
except in this, we will use x_test, and y_test instead of x_train and y_train.

Here we are also changing the color of observations and regression line to differentiate between
the two plots, but it is optional.

1. #visualizing the Test set results

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. mtp.scatter(x_test, y_test, color="blue")
3. mtp.plot(x_train, x_pred, color="red")
4. mtp.title("Salary vs Experience (Test Dataset)")
5. mtp.xlabel("Years of Experience")
6. mtp.ylabel("Salary(In Rupees)")
7. mtp.show() Output:

By executing the above line of code, we will get the output as:

In the above plot, there are observations given by the blue color, and prediction is given by the
red regression line. As we can see, most of the observations are close to the regression line,
hence we can say our Simple Linear Regression is a good model and able to make good
predictions.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Multiple Linear Regression
In the previous topic, we have learned about Simple Linear Regression, where a single
Independent/Predictor(X) variable is used to model the response variable (Y). But there may
be various cases in which the response variable is affected by more than one predictor variable;
for such cases, the Multiple Linear Regression algorithm is used.

Moreover, Multiple Linear Regression is an extension of Simple Linear regression as it takes


more than one predictor variable to predict the response variable. We can define it as:

Multiple Linear Regression is one of the important regression algorithms which models the
linear relationship between a single dependent continuous variable and more than one
independent variable.
Example:

Prediction of CO2 emission based on engine size and number of cylinders in a car.

Some key points about MLR:

o For MLR, the dependent or target variable(Y) must be the continuous/real, but the predictor or
independent variable may be of continuous or categorical form.

o Each feature variable must model the linear relationship with the dependent variable.
o MLR tries to fit a regression line through a multidimensional space of data-points.

MLR equation:
In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple
predictor variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple Linear Regression, so
the same is applied for the multiple linear regression equation, the equation becomes:

1. Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</sub>x<sub>2</sub>
+ b<sub>3</sub>x<sub>3</sub>+...... bnxn ............... (a)
Where,

Y= Output/Response variable b0, b1, b2, b3 , bn....= Coefficients of the model.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
x1, x2, x3, x4,...= Various Independent/feature variable

Assumptions for Multiple Linear Regression:

o A linear relationship should exist between the Target and predictor variables. o The regression
residuals must be normally distributed.
o MLR assumes little or no multicollinearity (correlation between the independent variable) in
data.

Implementation of Multiple Linear Regression model using Python:


To implement MLR using Python, we have below problem:

Problem Description:

We have a dataset of 50 start-up companies. This dataset contains five main information:
R&D Spend, Administration Spend, Marketing Spend, State, and Profit for a financial
year. Our goal is to create a model that can easily determine which company has a maximum
profit, and which is the most affecting factor for the profit of a company.

Since we need to find the Profit, so it is the dependent variable, and the other four variables are
independent variables. Below are the main steps of deploying the MLR model:

1. Data Pre-processing Steps

2. Fitting the MLR model to the training set

3. Predicting the result of the test set

Step-1: Data Pre-processing Step:

The very first step is data pre-processing, which we have already discussed in this tutorial. This
process contains the below steps:

o Importing libraries: Firstly we will import the library which will help in building the model.
Below is the code for it:
1. # importing libraries

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. import numpy as nm
3. import matplotlib.pyplot as mtp 4. import pandas as pd o Importing dataset: Now we will
import the dataset(50_CompList), which contains all the variables. Below is the code for it:

1. #importing datasets
2. data_set= pd.read_csv('50_CompList.csv') Output: We will get the dataset as:

In above output, we can clearly see that there are five variables, in which four variables are
continuous and one is categorical variable. o Extracting dependent and independent
Variables:

1. #Extracting Independent and dependent Variable

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. x= data_set.iloc[:, :-1].values
3. y= data_set.iloc[:, 4].values Output:

Out[5]:
array([[165349.2, 136897.8, 471784.1, 'New York'],
[162597.7, 151377.59, 443898.53, 'California'],
[153441.51, 101145.55, 407934.54, 'Florida'],
[144372.41, 118671.85, 383199.62, 'New York'],
[142107.34, 91391.77, 366168.42, 'Florida'],
[131876.9, 99814.71, 362861.36, 'New York'],
[134615.46, 147198.87, 127716.82, 'California'],
[130298.13, 145530.06, 323876.68, 'Florida'],
[120542.52, 148718.95, 311613.29, 'New York'],
[123334.88, 108679.17, 304981.62, 'California'],
[101913.08, 110594.11, 229160.95, 'Florida'],
[100671.96, 91790.61, 249744.55, 'California'],
[93863.75, 127320.38, 249839.44, 'Florida'],
[91992.39, 135495.07, 252664.93, 'California'],
[119943.24, 156547.42, 256512.92, 'Florida'],
[114523.61, 122616.84, 261776.23, 'New York'],
[78013.11, 121597.55, 264346.06, 'California'],
[94657.16, 145077.58, 282574.31, 'New York'],
[91749.16, 114175.79, 294919.57, 'Florida'],
[86419.7, 153514.11, 0.0, 'New York'],
[76253.86, 113867.3, 298664.47, 'California'],
[78389.47, 153773.43, 299737.29, 'New York'],
[73994.56, 122782.75, 303319.26, 'Florida'],

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
[67532.53, 105751.03, 304768.73, 'Florida'],
[77044.01, 99281.34, 140574.81, 'New York'],
[64664.71, 139553.16, 137962.62, 'California'],
[75328.87, 144135.98, 134050.07, 'Florida'],
[72107.6, 127864.55, 353183.81, 'New York'],
[66051.52, 182645.56, 118148.2, 'Florida'],
[65605.48, 153032.06, 107138.38, 'New York'],
[61994.48, 115641.28, 91131.24, 'Florida'],
[61136.38, 152701.92, 88218.23, 'New York'],
[63408.86, 129219.61, 46085.25, 'California'],
[55493.95, 103057.49, 214634.81, 'Florida'],
[46426.07, 157693.92, 210797.67, 'California'],
[46014.02, 85047.44, 205517.64, 'New York'],
[28663.76, 127056.21, 201126.82, 'Florida'],
[44069.95, 51283.14, 197029.42, 'California'],
[20229.59, 65947.93, 185265.1, 'New York'],
[38558.51, 82982.09, 174999.3, 'California'],
[28754.33, 118546.05, 172795.67, 'California'],
[27892.92, 84710.77, 164470.71, 'Florida'],
[23640.93, 96189.63, 148001.11, 'California'],
[15505.73, 127382.3, 35534.17, 'New York'],
[22177.74, 154806.14, 28334.72, 'California'],
[1000.23, 124153.04, 1903.93, 'New York'],
[1315.46, 115816.21, 297114.46, 'Florida'],
[0.0, 135426.92, 0.0, 'California'],
[542.05, 51743.15, 0.0, 'New York'],
[0.0, 116983.8, 45173.06, 'California']], dtype=object)

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
As we can see in the above output, the last column contains categorical variables which are not
suitable to apply directly for fitting the model. So we need to encode this variable.

Encoding Dummy Variables:

As we have one categorical variable (State), which cannot be directly applied to the model, so
we will encode it. To encode the categorical variable into numbers, we will use the
LabelEncoder class. But it is not sufficient because it still has some relational order, which
may create a wrong model. So in order to remove this problem, we will use OneHotEncoder,
which will create the dummy variables. Below is code for it:

1. #Catgorical data
2. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
3. labelencoder_x= LabelEncoder()
4. x[:, 3]= labelencoder_x.fit_transform(x[:,3])
5. onehotencoder= OneHotEncoder(categorical_features= [3])
6. x= onehotencoder.fit_transform(x).toarray()
Here we are only encoding one independent variable, which is state as other variables are
continuous.

Output:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
As we can see in the above output, the state column has been converted into dummy variables
(0 and 1). Here each dummy variable column is corresponding to the one State. We can
check by comparing it with the original dataset. The first column corresponds to the California
State, the second column corresponds to the Florida State, and the third column corresponds
to the New York State.
Note: We should not use all the dummy variables at the same time, so it must be 1 less than
the total number of dummy variables, else it will create a dummy variable trap.
o Now, we are writing a single line of code just to avoid the dummy variable trap:

1. #avoiding the dummy variable trap:


2. x = x[:, 1:]

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
If we do not remove the first dummy variable, then it may introduce multicollinearity in the
model.

As we can see in the above output image, the first column has been removed.

o Now we will split the dataset into training and test set. The code for this is given below:

1. # Splitting the dataset into training and test set.


2. from sklearn.model_selection import train_test_split

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
The above code will split our dataset into a training set and test set.

Output: The above code will split the dataset into training set and test set. You can check the
output by clicking on the variable explorer option given in Spyder IDE. The test set and training
set will look like the below image:

Test set:

Training set:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Note: In MLR, we will not do feature scaling as it is taken care by the library, so we don't
need to do it manually.
Step: 2- Fitting our MLR model to the Training set:
Now, we have well prepared our dataset in order to provide training, which means we will fit
our regression model to the training set. It will be similar to as we did in Simple Linear
Regression model. The code for this will be:

1. #Fitting the MLR model to the training set:


2. from sklearn.linear_model import LinearRegression
3. regressor= LinearRegression()
4. regressor.fit(x_train, y_train) Output:

Out[9]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)


Now, we have successfully trained our model using the training dataset. In the next step, we
will test the performance of the model using the test dataset.

Step: 3- Prediction of Test set results:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The last step for our model is checking the performance of the model. We will do it by
predicting the test set result. For prediction, we will create a y_pred vector. Below is the code
for it:

1. #Predicting the Test set result;


2. y_pred= regressor.predict(x_test)
By executing the above lines of code, a new vector will be generated under the variable explorer
option. We can test our model by comparing the predicted values and test set values.

Output:

In the above output, we have predicted result set and test set. We can check model performance
by comparing these two value index by index. For example, the first index has a predicted value
of 103015$ profit and test/real value of 103282$ profit. The difference is only of 267$, which
is a good prediction, so, finally, our model is completed here.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o We can also check the score for training dataset and test dataset. Below is the code for it:

1. print('Train Score: ', regressor.score(x_train, y_train))


2. print('Test Score: ', regressor.score(x_test, y_test)) Output: The score is:

Train Score: 0.9501847627493607


Test Score: 0.9347068473282446
The above score tells that our model is 95% accurate with the training dataset and 93%
accurate with the test dataset.
Note: In the next topic, we will see how we can improve the performance of the model
using the Backward Elimination process.
Applications of Multiple Linear Regression:
There are mainly two applications of Multiple Linear Regression:

o Effectiveness of Independent variable on prediction: o Predicting the impact of changes:


Clustering in Machine Learning
Clustering or cluster analysis is a machine learning technique, which groups the unlabelled
dataset. It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points. The objects with the possible similarities remain in a group
that has less or no similarities with another group."

It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the presence and absence of those similar patterns.

It is an unsupervised learning method, hence no supervision is provided to the algorithm, and


it deals with the unlabeled dataset.

After applying this clustering technique, each cluster or group is provided with a cluster-ID.
ML system can use this id to simplify the processing of large and complex datasets.

The clustering technique is commonly used for statistical data analysis.


Note: Clustering is somewhere similar to the classification algorithm, but the difference is

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
the type of dataset that we are using. In classification, we work with the labeled data set,
whereas in clustering, we work with the unlabelled dataset.
Example: Let's understand the clustering technique with the real-world example of Mall:

When we visit any shopping mall, we can observe that the things with similar usage are grouped
together. Such as the t-shirts are grouped in one section, and trousers are at other sections,
similarly, at vegetable sections, apples, bananas, Mangoes, etc., are grouped in separate
sections, so that we can easily find out the things. The clustering technique also works in the
same way. Other examples of clustering are grouping documents according to the topic.

The clustering technique can be widely used in various tasks. Some most common uses of this
technique are:

o Market Segmentation o Statistical data analysis o Social network analysis o Image


segmentation o Anomaly detection, etc.

Apart from these general usages, it is used by the Amazon in its recommendation system to
provide the recommendations as per the past search of products. Netflix also uses this technique
to recommend the movies and web-series to its users as per the watch history.

The below diagram explains the working of the clustering algorithm. We can see the different
fruits are divided into several groups with similar properties.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Types of Clustering Methods

The clustering methods are broadly divided into Hard clustering (datapoint belongs to only
one group) and Soft Clustering (data points can belong to another group also). But there are
also other various approaches of Clustering exist. Below are the main clustering methods used
in Machine learning:

1. Partitioning Clustering

2. Density-Based Clustering

3. Distribution Model-Based Clustering

4. Hierarchical Clustering

5. Fuzzy Clustering

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is the K-
Means Clustering algorithm.

In this type, the dataset is divided into a set of k groups, where K is used to define the number
of pre-defined groups. The cluster center is created in such a way that the distance between the
data points of one cluster is minimum as compared to another cluster centroid.

Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected. This
algorithm does it by identifying different clusters in the dataset and connects the areas of high
densities into clusters. The dense areas in data space are divided from each other by sparser
areas.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
These algorithms can face difficulty in clustering the data points if the dataset has varying
densities and high dimensions.

Distribution Model-Based Clustering


In the distribution model-based clustering method, the data is divided based on the probability
of how a dataset belongs to a particular distribution. The grouping is done by assuming some
distributions commonly Gaussian Distribution.

The example of this type is the Expectation-Maximization Clustering algorithm that uses
Gaussian Mixture Models (GMM).

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no
requirement of pre-specifying the number of clusters to be created. In this technique, the dataset
is divided into clusters to create a tree-like structure, which is also called a dendrogram. The
observations or any number of clusters can be selected by cutting the tree at the correct level.
The most common example of this method is the Agglomerative Hierarchical algorithm.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one
group or cluster. Each dataset has a set of membership coefficients, which depend on the degree
of membership to be in a cluster. Fuzzy C-means algorithm is the example of this type of
clustering; it is sometimes also known as the Fuzzy k-means algorithm.

Clustering Algorithms

The Clustering algorithms can be divided based on their models that are explained above. There
are different types of clustering algorithms published, but only a few are commonly used. The
clustering algorithm is based on the kind of data that we are using. Such as, some algorithms
need to guess the number of clusters in the given dataset, whereas some are required to find the
minimum distance between the observation of the dataset.

Here we are discussing mainly popular Clustering algorithms that are widely used in machine
learning:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms.
It classifies the dataset by dividing the samples into different clusters of equal variances. The
number of clusters must be specified in this algorithm. It is fast with fewer computations
required, with the linear complexity of O(n).

2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth density
of data points. It is an example of a centroid-based model, that works on updating the candidates
for centroid to be the center of the points within a given region.

3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with


Noise. It is an example of a density-based model similar to the mean-shift, but with some
remarkable advantages. In this algorithm, the areas of high density are separated by the areas
of low density. Because of this, the clusters can be found in any arbitrary shape.

4. Expectation-Maximization Clustering using GMM: This algorithm can be used as an


alternative for the k-means algorithm or for those cases where K-means can be failed. In GMM,
it is assumed that the data points are Gaussian distributed.

5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm performs


the bottom-up hierarchical clustering. In this, each data point is treated as a single cluster at the
outset and then successively merged. The cluster hierarchy can be represented as a tree-
structure.

6. Affinity Propagation: It is different from other clustering algorithms as it does not require to
specify the number of clusters. In this, each data point sends a message between the pair of
data points until convergence. It has O(N2T) time complexity, which is the main drawback of
this algorithm.

Applications of Clustering

Below are some commonly known applications of clustering technique in Machine Learning:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o In Identification of Cancer Cells: The clustering algorithms are widely used for the
identification of cancerous cells. It divides the cancerous and non-cancerous data sets into
different groups.

o In Search Engines: Search engines also work on the clustering technique. The search result
appears based on the closest object to the search query. It does it by grouping similar data
objects in one group that is far from the other dissimilar objects. The accurate result of a query
depends on the quality of the clustering algorithm used.

o Customer Segmentation: It is used in market research to segment the customers based on


their choice and preferences.

o In Biology: It is used in the biology stream to classify different species of plants and animals
using the image recognition technique.

o In Land Use: The clustering technique is used in identifying the area of similar lands use in
the GIS database. This can be very useful to find that for what purpose the particular land
should be used, that means for which purpose it is more suitable.
Hierarchical Clustering in Machine Learning
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to
group the unlabeled datasets into a cluster and also known as hierarchical cluster analysis or
HCA.

In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped
structure is known as the dendrogram.

Sometimes the results of K-means clustering and hierarchical clustering may look similar, but
they both differ depending on how they work. As there is no requirement to predetermine the
number of clusters as we did in the K-Means algorithm.

The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with


taking all data points as single clusters and merging them until one cluster is left.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down
approach.

Why hierarchical clustering?


As we already have other clustering algorithms such as K-Means Clustering, then why we
need hierarchical clustering? So, as we have seen in the K-means clustering that there are some
challenges with this algorithm, which are a predetermined number of clusters, and it always
tries to create the clusters of the same size. To solve these two challenges, we can opt for the
hierarchical clustering algorithm because, in this algorithm, we don't need to have knowledge
about the predefined number of clusters.

In this topic, we will discuss the Agglomerative Hierarchical clustering algorithm.

Agglomerative Hierarchical clustering

The agglomerative hierarchical clustering algorithm is a popular example of HCA. To group


the datasets into clusters, it follows the bottom-up approach. It means, this algorithm
considers each dataset as a single cluster at the beginning, and then start combining the closest
pair of clusters together. It does this until all the clusters are merged into a single cluster that
contains all the datasets.

This hierarchy of clusters is represented in the form of the dendrogram.

How the Agglomerative Hierarchical clustering Work?

The working of the AHC algorithm can be explained using the below steps:

o Step-1: Create each data point as a single cluster. Let's say there are N data points, so the
number of clusters will also be N.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there
will now be N-1 clusters.

o Step-3: Again, take the two closest clusters and merge them together to form one cluster. There
will be N-2 clusters.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters.
Consider the below images:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to
divide the clusters as per the problem.
Note: To better understand hierarchical clustering, it is advised to have a look on k-means
clustering

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Measure for the distance between two clusters
As we have seen, the closest distance between the two clusters is crucial for the hierarchical
clustering. There are various ways to calculate the distance between two clusters, and these
ways decide the rule for clustering. These measures are called Linkage methods. Some of the
popular linkage methods are given below:

1. Single Linkage: It is the Shortest Distance between the closest points of the clusters. Consider
the below image:

2. Complete Linkage: It is the farthest distance between the two points of two different clusters.
It is one of the popular linkage methods as it forms tighter clusters than single-linkage.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
3. Average Linkage: It is the linkage method in which the distance between each pair of datasets
is added up and then divided by the total number of datasets to calculate the average distance
between two clusters. It is also one of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the centroid of the
clusters is calculated. Consider the below image:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
From the above-given approaches, we can apply any of them according to the type of problem
or business requirement.

Woking of Dendrogram in Hierarchical clustering


The dendrogram is a tree-like structure that is mainly used to store each step as a memory that
the HC algorithm performs. In the dendrogram plot, the Y-axis shows the Euclidean distances
between the data points, and the x-axis shows all the data points of the given dataset.

The working of the dendrogram can be explained using the below diagram:

In the above diagram, the left part is showing how clusters are created in agglomerative
clustering, and the right part is showing the corresponding dendrogram.

o As we have discussed above, firstly, the datapoints P2 and P3 combine together and form a
cluster, correspondingly a dendrogram is created, which connects P2 and P3 with a rectangular
shape. The hight is decided according to the Euclidean distance between the data points.

o In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created. It is
higher than of previous, as the Euclidean distance between P5 and P6 is a little bit greater than
the P2 and P3.

o Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram, and
P4, P5, and P6, in another dendrogram. o At last, the final dendrogram is created that combines
all the data points together.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
We can cut the dendrogram tree structure at any level as per our requirement.

Python Implementation of Agglomerative Hierarchical Clustering

Now we will see the practical implementation of the agglomerative hierarchical clustering
algorithm using Python. To implement this, we will use the same dataset problem that we have
used in the previous topic of K-means clustering so that we can compare both concepts easily.

The dataset is containing the information of customers that have visited a mall for shopping.
So, the mall owner wants to find some patterns or some particular behavior of his customers
using the dataset information.

Steps for implementation of AHC using Python:


The steps for implementation will be the same as the k-means clustering, except for some
changes such as the method to find the number of clusters. Below are the steps:

1. Data Pre-processing

2. Finding the optimal number of clusters using the Dendrogram

3. Training the hierarchical clustering model

4. Visualizing the clusters

Data Pre-processing Steps:


In this step, we will import the libraries and datasets for our model.

o Importing the libraries

1. # Importing the libraries


2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The above lines of code are used to import the libraries to perform specific tasks, such as
numpy for the Mathematical operations, matplotlib for drawing the graphs or scatter plot, and
pandas for importing the dataset. o Importing the dataset

1. # Importing the dataset


2. dataset = pd.read_csv('Mall_Customers_data.csv')
As discussed above, we have imported the same dataset of Mall_Customers_data.csv, as we
did in k-means clustering. Consider the below output:

o Extracting the matrix of features

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Here we will extract only the matrix of features as we don't have any further information about
the dependent variable. Code is given below:

1. x = dataset.iloc[:, [3, 4]].values


Here we have extracted only 3 and 4 columns as we will use a 2D plot to see the clusters. So,
we are considering the Annual income and spending score as the matrix of features.

Step-2: Finding the optimal number of clusters using the Dendrogram


Now we will find the optimal number of clusters using the Dendrogram for our model. For this,
we are going to use scipy library as it provides a function that will directly return the
dendrogram for our code. Consider the below lines of code:

1. #Finding the optimal number of clusters using the dendrogram


2. import scipy.cluster.hierarchy as shc
3. dendro = shc.dendrogram(shc.linkage(x, method="ward"))
4. mtp.title("Dendrogrma Plot") 5. mtp.ylabel("Euclidean Distances")
6. mtp.xlabel("Customers")
7. mtp.show()
In the above lines of code, we have imported the hierarchy module of scipy library. This
module provides us a method shc.denrogram(), which takes the linkage() as a parameter. The
linkage function is used to define the distance between two clusters, so here we have passed
the x(matrix of features), and method "ward," the popular method of linkage in hierarchical
clustering.

The remaining lines of code are to describe the labels for the dendrogram plot.

Output:

By executing the above lines of code, we will get the below output:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Using this Dendrogram, we will now determine the optimal number of clusters for our model.
For this, we will find the maximum vertical distance that does not cut any horizontal bar.
Consider the below diagram:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
In the above diagram, we have shown the vertical distances that are not cutting their horizontal
bars. As we can visualize, the 4th distance is looking the maximum, so according to this, the
number of clusters will be 5(the vertical lines in this range). We can also take the 2nd number
as it approximately equals the 4th distance, but we will consider the 5 clusters because the same
we calculated in the K-means algorithm.

So, the optimal number of clusters will be 5, and we will train the model in the next step,
using the same.

Step-3: Training the hierarchical clustering model


As we know the required optimal number of clusters, we can now train our model. The code is
given below:

1. #training the hierarchical model on dataset

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. from sklearn.cluster import AgglomerativeClustering
3. hc= AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')
4. y_pred= hc.fit_predict(x)
In the above code, we have imported the AgglomerativeClustering class of cluster module of
scikit learn library.

Then we have created the object of this class named as hc. The AgglomerativeClustering class
takes the following parameters:

o n_clusters=5: It defines the number of clusters, and we have taken here 5 because it is the
optimal number of clusters.

o affinity='euclidean': It is a metric used to compute the linkage.

o linkage='ward': It defines the linkage criteria, here we have used the "ward" linkage. This
method is the popular linkage method that we have already used for creating the Dendrogram.
It reduces the variance in each cluster.

In the last line, we have created the dependent variable y_pred to fit or train the model. It does
train not only the model but also returns the clusters to which each data point belongs.

After executing the above lines of code, if we go through the variable explorer option in our
Sypder IDE, we can check the y_pred variable. We can compare the original dataset with the
y_pred variable. Consider the below image:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
As we can see in the above image, the y_pred shows the clusters value, which means the
customer id 1 belongs to the 5th cluster (as indexing starts from 0, so 4 means 5th cluster), the
customer id 2 belongs to 4th cluster, and so on.

Step-4: Visualizing the clusters


As we have trained our model successfully, now we can visualize the clusters corresponding to
the dataset.

Here we will use the same lines of code as we did in k-means clustering, except one change.
Here we will not plot the centroid that we did in k-means, because here we have used
dendrogram to determine the optimal number of clusters. The code is given below:

1. #visulaizing the clusters


2. mtp.scatter(x[y_pred == 0, 0], x[y_pred == 0, 1], s = 100, c = 'blue', label = 'Cluster 1'
)
3. mtp.scatter(x[y_pred == 1, 0], x[y_pred == 1, 1], s = 100, c = 'green', label = 'Cluster
2')

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
4. mtp.scatter(x[y_pred== 2, 0], x[y_pred == 2, 1], s = 100, c = 'red', label = 'Cluster 3')
5. mtp.scatter(x[y_pred == 3, 0], x[y_pred == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4'
)
6. mtp.scatter(x[y_pred == 4, 0], x[y_pred == 4, 1], s = 100, c = 'magenta', label = 'Clust er 5')
7. mtp.title('Clusters of customers')
8. mtp.xlabel('Annual Income (k$)') 9. mtp.ylabel('Spending Score (1-100)')
10. mtp.legend()
11. mtp.show()
Output: By executing the above lines of code, we will get the below output:

K-Means Clustering Algorithm


K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. In this topic, we will learn what is K-means
clustering algorithm, how the algorithm works, along with the Python implementation of k-
means clustering.

What is K-Means Algorithm?

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled
dataset into different clusters. Here K defines the number of pre-defined clusters that need to
be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three
clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim
of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters,
and repeats the process until it does not find the best clusters. The value of k should be
predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.

o Assigns each data point to its closest k-center. Those data points which are near to the particular
k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid
of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is
given below:

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.

o We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point. So, here we are selecting the below two
points as k points, which are not the part of our dataset. Consider the below image:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance
between two points. So, we will draw a median between both
the centroids. Consider the below image:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color them as
blue and yellow for clear visualization.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these centroids,
and will find new centroids as below:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
From the above image, we can see, one yellow point is on the left side of the line, and two blue
points are right to the line. So, these three points will be assigned to new centroids.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.

o We will repeat the process by finding the center of gravity of centroids, so the new centroids
will be as shown in the below image:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o As we got the new centroids so again will draw the median line and reassign the data points.
So, the image will be:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
As our model is ready, so we can now remove the assumed centroids, and the two final clusters
will be as shown in the below image:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
How to choose the value of "K number of clusters" in K-means Clustering?

The performance of the K-means clustering algorithm depends upon highly efficient clusters
that it forms. But choosing the optimal number of clusters is a big task. There are some different
ways to find the optimal number of clusters, but here we are discussing the most appropriate
method to find the number of clusters or value of K. The method is given below:

Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters.
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of
Squares, which defines the total variations within a cluster. The formula to calculate the value
of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2 distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
In the above formula of WCSS,

∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point
and its centroid within a cluster1 and the same for the other two terms.

To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.

To find the optimal value of clusters, the elbow method follows the below steps:

o It executes the K-means clustering on a given dataset for different K values (ranges from 1-
10).

o For each value of K, calculates the WCSS value. o Plots a curve between calculated WCSS
values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is considered
as the best value of K.

Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the
elbow method. The graph for the elbow method looks like the below image:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Note: We can choose the number of clusters equal to the given data points. If we choose
the number of clusters equal to the data points, then the value of WCSS becomes zero, and
that will be the endpoint of the plot.
Python Implementation of K-means Clustering Algorithm

In the above section, we have discussed the K-means algorithm, now let's see how it can be
implemented using Python.

Before implementation, let's understand what type of problem we will solve here. So, we have
a dataset of Mall_Customers, which is the data of customers who visit the mall and spend
there.

In the given dataset, we have Customer_Id, Gender, Age, Annual Income ($), and
Spending Score (which is the calculated value of how much a customer has spent in the mall,
the more the value, the more he has spent). From this dataset, we need to calculate some
patterns, as it is an unsupervised method, so we don't know what to calculate exactly.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The steps to be followed for the implementation are given below:

o Data Pre-processing o Finding the optimal number of clusters using the elbow
method o Training the K-means algorithm on the training dataset o Visualizing the
clusters

Step-1: Data pre-processing Step


The first step will be the data pre-processing, as we did in our earlier topics of Regression and
Classification. But for the clustering problem, it will be different from other models. Let's
discuss it:

o Importing Libraries
As we did in previous topics, firstly, we will import the libraries for our model, which is part
of data pre-processing. The code is given below:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
In the above code, the numpy we have imported for the performing mathematics calculation,
matplotlib is for plotting the graph, and pandas are for managing the dataset.

o Importing the Dataset:


Next, we will import the dataset that we need to use. So here, we are using the
Mall_Customer_data.csv dataset. It can be imported using the below code:

1. # Importing the dataset


2. dataset = pd.read_csv('Mall_Customers_data.csv')
By executing the above lines of code, we will get our dataset in the Spyder IDE. The dataset
looks like the below image:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
From the above dataset, we need to find some patterns in it. o Extracting Independent Variables

Here we don't need any dependent variable for data pre-processing step as it is a clustering
problem, and we have no idea about what to determine. So we will just add a line of code for
the matrix of features.

1. x = dataset.iloc[:, [3, 4]].values


As we can see, we are extracting only 3rd and 4th feature. It is because we need a 2d plot to
visualize the model, and some features are not required, such as customer_id.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Step-2: Finding the optimal number of clusters using the elbow method
In the second step, we will try to find the optimal number of clusters for our clustering problem.
So, as discussed above, here we are going to use the elbow method for this purpose.

As we know, the elbow method uses the WCSS concept to draw the plot by plotting WCSS
values on the Y-axis and the number of clusters on the X-axis. So we are going to calculate the
value for WCSS for different k values ranging from 1 to 10. Below is the code for it:

1. #finding optimal number of clusters using the elbow method


2. from sklearn.cluster import KMeans
3. wcss_list= [] #Initializing the list for the values of WCSS 4.
5. #Using for loop for iterations from 1 to 10.
6. for i in range(1, 11):
7. kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)
8. kmeans.fit(x)
9. wcss_list.append(kmeans.inertia_)
10. mtp.plot(range(1, 11), wcss_list) 11. mtp.title('The Elobw Method Graph')
12. mtp.xlabel('Number of clusters(k)')
13. mtp.ylabel('wcss_list')
14. mtp.show()
As we can see in the above code, we have used the KMeans class of sklearn. cluster library to
form the clusters.

Next, we have created the wcss_list variable to initialize an empty list, which is used to contain
the value of wcss computed for different values of k ranging from 1 to 10.

After that, we have initialized the for loop for the iteration on a different value of k ranging
from 1 to 10; since for loop in Python, exclude the outbound limit, so it is taken as 11 to include
10th value.

The rest part of the code is similar as we did in earlier topics, as we have fitted the model on a
matrix of features and then plotted the graph between the number of clusters and WCSS.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Output: After executing the above code, we will get the below output:

From the above plot, we can see the elbow point is at 5. So the number of clusters here will
be 5.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Step- 3: Training the K-means algorithm on the training dataset
As we have got the number of clusters, so we can now train the model on the dataset.

To train the model, we will use the same two lines of code as we have used in the above section,
but here instead of using i, we will use 5, as we know there are 5 clusters that need to be formed.
The code is given below:

1. #training the K-means model on a dataset


2. kmeans = KMeans(n_clusters=5, init='k-means++', random_state= 42)
3. y_predict= kmeans.fit_predict(x)
The first line is the same as above for creating the object of KMeans class.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
In the second line of code, we have created the dependent variable y_predict to train the model.

By executing the above lines of code, we will get the y_predict variable. We can check it under
the variable explorer option in the Spyder IDE. We can now compare the values of y_predict
with our original dataset. Consider the below image:

From the above image, we can now relate that the CustomerID 1 belongs to a cluster

3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs to cluster 4, and so on.

Step-4: Visualizing the Clusters


The last step is to visualize the clusters. As we have 5 clusters for our model, so we will
visualize each cluster one by one.

To visualize the clusters will use scatter plot using mtp.scatter() function of matplotlib.

1. #visulaizing the clusters


2. mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Clus ter 1')
#for first cluster

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
3. mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Clu ster 2')
#for second cluster
4. mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluste r 3')
#for third cluster
5. mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Clu ster 4')
#for fourth cluster
6. mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = ' Cluster
5') #for fifth cluster
7. mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = ' yellow',
label = 'Centroid')
8. mtp.title('Clusters of customers') 9. mtp.xlabel('Annual Income (k$)')
10. mtp.ylabel('Spending Score (1-100)')
11. mtp.legend()
12. mtp.show()
In above lines of code, we have written code for each clusters, ranging from 1 to 5. The first
coordinate of the mtp.scatter, i.e., x[y_predict == 0, 0] containing the x value for the showing
the matrix of features values, and the y_predict is ranging from 0 to 1.

Output:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The output image is clearly showing the five different clusters with different colors. The
clusters are formed between two parameters of the dataset; Annual income of customer and
Spending. We can change the colors and labels as per the requirement or choice. We can also
observe some points from the above patterns, which are given below:

o Cluster1 shows the customers with average salary and average spending so we can categorize
these customers as

o Cluster2 shows the customer has a high income but low spending, so we can categorize them
as careful.

o Cluster3 shows the low income and also low spending so they can be categorized as sensible.

o Cluster4 shows the customers with low income with very high spending so they can be
categorized as careless.

o Cluster5 shows the customers with high income and high spending so they can be categorized
as target, and these customers can be the most profitable customers for the mall owner.

Apriori Algorithm in Machine Learning


The Apriori algorithm uses frequent itemsets to generate association rules, and it is designed to
work on the databases that contain transactions. With the help of these association rule, it
determines how strongly or how weakly two objects are connected. This algorithm uses a
breadth-first search and Hash Tree to calculate the itemset associations efficiently. It is the
iterative process for finding the frequent itemsets from the large dataset.

This algorithm was given by the R. Agrawal and Srikant in the year 1994. It is mainly used
for market basket analysis and helps to find those products that can be bought together. It can
also be used in the healthcare field to find drug reactions for patients.

What is Frequent Itemset?

Frequent itemsets are those items whose support is greater than the threshold value or user-
specified minimum support. It means if A & B are the frequent itemsets together, then
individually A and B should also be the frequent itemset.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7}, in these two
transactions, 2 and 3 are the frequent itemsets.
Note: To better understand the apriori algorithm, and related term such as support and
confidence, it is recommended to understand the association rule learning.
Steps for Apriori Algorithm
Below are the steps for the apriori algorithm:

Step-1: Determine the support of itemsets in the transactional database, and select the
minimum support and confidence.

Step-2: Take all supports in the transaction with higher support value than the minimum or
selected support value.

Step-3: Find all the rules of these subsets that have higher confidence value than the threshold
or minimum confidence.

Step-4: Sort the rules as the decreasing order of lift.

Apriori Algorithm Working


We will understand the apriori algorithm using an example and mathematical calculation:

Example: Suppose we have the following dataset that has various transactions, and from this
dataset, we need to find the frequent itemsets and generate the association rules using the
Apriori algorithm:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Solution:

Step-1: Calculating C1 and L1:

o In the first step, we will create a table that contains support count (The frequency of each
itemset individually in the dataset) of each itemset in the given dataset. This table is called the
Candidate set or C1.

o Now, we will take out all the itemsets that have the greater support count that the Minimum
Support (2). It will give us the table for the frequent itemset L1. Since all the itemsets have
greater or equal support count than the minimum support, except the E, so E itemset will be
removed.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Step-2: Candidate Generation C2, and L2: o In this step, we will generate C2 with the help of
L1. In C2, we will create the pair of the itemsets of L1 in the form of subsets.
o After creating the subsets, we will again find the support count from the main transaction table
of datasets, i.e., how many times these pairs have occurred together in the given dataset. So,
we will get the below table for C2:

o Again, we need to compare the C2 Support count with the minimum support count, and after
comparing, the itemset with less support count will be eliminated from the table C2. It will give
us the below table for L2

Step-3: Candidate generation C3, and L3:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o For C3, we will repeat the same two processes, but now we will form the C3 table with subsets
of three itemsets together, and will calculate the support count from the dataset. It will give the
below table:

o Now we will create the L3 table. As we can see from the above C3 table, there is only one
combination of itemset that has support count equal to the minimum support count. So, the L3
will have only one combination, i.e., {A, B, C}.

Step-4: Finding the association rules for the subsets:


To generate the association rules, first, we will create a new table with the possible rules from
the occurred combination {A, B.C}. For all the rules, we will calculate the Confidence using
formula sup( A ^B)/A. After calculating the confidence value for all rules, we will exclude the
rules that have less confidence than the minimum threshold(50%).

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Consider the below table:

As the given threshold or minimum confidence is 50%, so the first three rules A ^B → C,
B^C → A, and A^C → B can be considered as the strong association rules for the given
problem.

Advantages of Apriori Algorithm o This is easy to understand algorithm o The join and prune

Rules Support Confidence

A ^B → C 2 Sup{(A ^B) ^C}/sup(A ^B)= 2/4=0.5=50%

B^C → A 2 Sup{(B^C) ^A}/sup(B ^C)= 2/4=0.5=50%

A^C → B 2 Sup{(A ^C) ^B}/sup(A ^C)= 2/4=0.5=50%

C→ A ^B 2 Sup{(C^( A ^B)}/sup(C)= 2/5=0.4=40%

A→ B^C 2 Sup{(A^( B ^C)}/sup(A)= 2/6=0.33=33.33%

B→ B^C 2 Sup{(B^( B ^C)}/sup(B)= 2/7=0.28=28%

steps of the algorithm can be easily implemented on large datasets.

Disadvantages of Apriori Algorithm o The apriori algorithm works slow compared to other
algorithms. o The overall performance can be reduced as it scans the database for multiple
times.
o The time complexity and space complexity of the apriori algorithm is O(2D), which is very high.
Here D represents the horizontal width present in the database.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Python Implementation of Apriori Algorithm
Now we will see the practical implementation of the Apriori Algorithm. To implement this, we
have a problem of a retailer, who wants to find the association between his shop's product, so
that he can provide an offer of "Buy this and Get that" to his customers.

The retailer has a dataset information that contains a list of transactions made by his customer.
In the dataset, each row shows the products purchased by customers or transactions made by
the customer. To solve this problem, we will perform the below steps:

o Data Pre-processing o Training the Apriori model on the dataset o Visualizing the
results

1. Data Pre-processing Step:


The first step is data pre-processing step. Under this, first, we will perform the importing of the
libraries. The code for this is given below: o Importing the libraries:

Before importing the libraries, we will use the below line of code to install the apyori package
to use further, as Spyder IDE does not contain it:

1. pip install apyroi


Below is the code to implement the libraries that will be used for different tasks of the model:

1. import numpy as nm
2. import matplotlib.pyplot as mtp 3. import pandas as pd o Importing the dataset:
Now, we will import the dataset for our apriori model. To import the dataset, there will be some
changes here. All the rows of the dataset are showing different transactions made by the
customers. The first row is the transaction done by the first customer, which means there is no
particular name for each column and have their own individual value or product details(See the
dataset given below after the code). So, we need to mention here in our code that there is no
header specified. The code is given below:

1. #Importing the dataset

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. dataset = pd.read_csv('Market_Basket_data1.csv')
3. transactions=[]
4. for i in range(0, 7501):
5. transactions.append([str(dataset.values[i,j]) for j in range(0,20)])
In the above code, the first line is showing importing the dataset into pandas format. The second
line of the code is used because the apriori() that we will use for training our model takes the
dataset in the format of the list of the transactions. So, we have created an empty list of the
transaction. This list will contain all the itemsets from 0 to 7500. Here we have taken 7501
because, in Python, the last index is not considered.

The dataset looks like the below image:

2. Training the Apriori Model on the dataset


To train the model, we will use the apriori function that will be imported from the apyroi
package. This function will return the rules to train the model on the dataset. Consider the
below code:

1. from apyori import apriori


2. rules= apriori(transactions= transactions, min_support=0.003, min_confidence = 0.2,
min_lift=3, min_length=2, max_length=2)

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
In the above code, the first line is to import the apriori function. In the second line, the apriori
function returns the output as the rules. It takes the following parameters:

o transactions: A list of transactions.

o min_support= To set the minimum support float value. Here we have used 0.003 that is
calculated by taking 3 transactions per customer each week to the total number of transactions.

o min_confidence: To set the minimum confidence value. Here we have taken 0.2. It can be
changed as per the business problem. o min_lift= To set the minimum lift value. o
min_length= It takes the minimum number of products for the association. o max_length =
It takes the maximum number of products for the association.

3. Visualizing the result


Now we will visualize the output for our apriori model. Here we will follow some more steps,
which are given below:

o Displaying the result of the rules occurred from the apriori function
1. results= list(rules)
2. results
By executing the above lines of code, we will get the 9 rules. Consider the below output:

Output:
[RelationRecord(items=frozenset({'chicken', 'light cream'}),
support=0.004533333333333334,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}),
items_add=frozenset({'chicken'}), confidence=0.2905982905982906,
lift=4.843304843304844)]),
RelationRecord(items=frozenset({'escalope', 'mushroom cream sauce'}),
support=0.005733333333333333,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'mushroom cream sauce'}),

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
items_add=frozenset({'escalope'}), confidence=0.30069930069930073,
lift=3.7903273197390845)]),
RelationRecord(items=frozenset({'escalope', 'pasta'}), support=0.005866666666666667,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta'}),
items_add=frozenset({'escalope'}), confidence=0.37288135593220345,
lift=4.700185158809287)]),
RelationRecord(items=frozenset({'fromage blanc', 'honey'}),
support=0.0033333333333333335,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'fromage blanc'}),
items_add=frozenset({'honey'}), confidence=0.2450980392156863,
lift=5.178127589063795)]),
RelationRecord(items=frozenset({'ground beef', 'herb & pepper'}), support=0.016,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'herb & pepper'}),
items_add=frozenset({'ground beef'}), confidence=0.3234501347708895,
lift=3.2915549671393096)]),
RelationRecord(items=frozenset({'tomato sauce', 'ground beef'}),
support=0.005333333333333333,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'tomato sauce'}),
items_add=frozenset({'ground beef'}), confidence=0.37735849056603776,
lift=3.840147461662528)]),
RelationRecord(items=frozenset({'olive oil', 'light cream'}), support=0.0032,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}),
items_add=frozenset({'olive oil'}), confidence=0.20512820512820515,
lift=3.120611639881417)]),
RelationRecord(items=frozenset({'olive oil', 'whole wheat pasta'}), support=0.008,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'whole wheat pasta'}),

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
items_add=frozenset({'olive oil'}), confidence=0.2714932126696833,
lift=4.130221288078346)]),
RelationRecord(items=frozenset({'pasta', 'shrimp'}), support=0.005066666666666666,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta'}),
items_add=frozenset({'shrimp'}), confidence=0.3220338983050848,
lift=4.514493901473151)])]
As we can see, the above output is in the form that is not easily understandable. So, we will
print all the rules in a suitable format. o Visualizing the rule, support, confidence, lift in
more clear way:

1. for item in results:


2. pair = item[0]
3. items = [x for x in pair]
4. print("Rule: " + items[0] + " -> " + items[1])
5.
6. print("Support: " + str(item[1]))
7. print("Confidence: " + str(item[2][0][2]))
8. print("Lift: " + str(item[2][0][3]))
9. print("=====================================") Output:

By executing the above lines of code, we will get the below output:
Rule: chicken -> light cream
Support: 0.004533333333333334
Confidence: 0.2905982905982906
Lift: 4.843304843304844
=====================================
Rule: escalope -> mushroom cream sauce
Support: 0.005733333333333333

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Confidence: 0.30069930069930073
Lift: 3.7903273197390845
=====================================
Rule: escalope -> pasta
Support: 0.005866666666666667
Confidence: 0.37288135593220345
Lift: 4.700185158809287
=====================================
Rule: fromage blanc -> honey
Support: 0.0033333333333333335
Confidence: 0.2450980392156863
Lift: 5.178127589063795
=====================================
Rule: ground beef -> herb & pepper
Support: 0.016
Confidence: 0.3234501347708895
Lift: 3.2915549671393096
=====================================
Rule: tomato sauce -> ground beef
Support: 0.005333333333333333
Confidence: 0.37735849056603776

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Lift: 3.840147461662528
=====================================
Rule: olive oil -> light cream
Support: 0.0032
Confidence: 0.20512820512820515
Lift: 3.120611639881417
=====================================
Rule: olive oil -> whole wheat pasta
Support: 0.008
Confidence: 0.2714932126696833
Lift: 4.130221288078346
=====================================
Rule: pasta -> shrimp
Support: 0.005066666666666666
Confidence: 0.3220338983050848
Lift: 4.514493901473151

=====================================
From the above output, we can analyze each rule. The first rules, which is Light cream →
chicken, states that the light cream and chicken are bought frequently by most of the customers.
The support for this rule is 0.0045, and the confidence is 29%. Hence, if a customer buys light
cream, it is 29% chances that he also buys chicken, and it is .0045 times appeared in the
transactions. We can check all these things in other rules also.

Decision Tree Classification Algorithm o Decision Tree is a Supervised learning technique


that can be used for both classification and Regression problems, but mostly it is preferred for
solving Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision rules and each leaf
node represents the outcome.

o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.

o The decisions or the test are performed on the basis of features of the given dataset.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands
on further branches and constructs a tree-like structure.

o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.

o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.

o Below diagram explains the general structure of a decision tree:


Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.

Why use Decision Trees?

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand. o The logic behind the decision tree can be easily understood because it shows a
tree-like structure.

Decision Tree Terminologies

• Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.

• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.

• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.

• Pruning: Pruning is the process of removing the unwanted branches from the tree.

• Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.

o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).

o Step-3: Divide the S into subsets that contains possible values for the best attributes. o Step-
4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node. Finally,
the decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider the
below diagram:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select the
best attribute for the nodes of the tree. There are two popular techniques for ASM, which are:

o Information Gain o Gini Index

1. Information Gain:

o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.

o It calculates how much information a feature provides us about a class.

o According to the value of information gain, we split the node and build the decision tree.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:
1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies
randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)


Where, o S= Total number of samples o P(yes)= probability of yes o P(no)= probability of
no

2. Gini Index:

o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.

o An attribute with the low Gini index should be preferred as compared to the high Gini index.

o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the learning
tree without reducing accuracy is known as Pruning. There are mainly two types of tree
pruning technology used:

o Cost Complexity Pruning o Reduced Error Pruning.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Advantages of the Decision Tree

o It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.

o It can be very useful for solving decision-related problems. o It helps to think about all the
possible outcomes for a problem. o There is less requirement of data cleaning compared to
other algorithms.

Disadvantages of the Decision Tree

o The decision tree contains lots of layers, which makes it complex.

o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

Python Implementation of Decision Tree

Now we will implement the Decision tree using Python. For this, we will use the dataset
"user_data.csv," which we have used in previous classification models. By using the same
dataset, we can compare the Decision tree classifier with other classification models such as
KNN SVM, LogisticRegression, etc.

Steps will also remain the same, which are given below:

o Data Pre-processing step o Fitting a Decision-Tree algorithm to the Training set o


Predicting the test result o Test accuracy of the result(Creation of Confusion matrix) o
Visualizing the test set result.

1. Data Pre-Processing Step:


Below is the code for the pre-processing step:

1. # importing libraries
2. import numpy as nm

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
3. import matplotlib.pyplot as mtp 4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv') 8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)

16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)
In the above code, we have pre-processed the data. Where we have loaded the dataset, which
is given as:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. Fitting a Decision-Tree algorithm to the Training set
Now we will fit the model to the training set. For
this, we will import the DecisionTreeClassifier class from sklearn.tree
library. Below is the code for it:

1. #Fitting Decision Tree classifier to the training set


2. From sklearn.tree import DecisionTreeClassifier
3. classifier= DecisionTreeClassifier(criterion='entropy', random_state=0)
4. classifier.fit(x_train, y_train)

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
In the above code, we have created a classifier object, in which we have passed two main
parameters;

o "criterion='entropy': Criterion is used to measure the quality of split, which is calculated by


information gain given by entropy.

o random_state=0": For generating the random states.

Below is the output for this:

Out[8]:
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=0, splitter='best')
3. Predicting the test result
Now we will predict the test set result. We will create a new prediction vector y_pred. Below
is the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test) Output:

In the below output image, the predicted output and real test output are given. We can clearly
see that there are some values in the prediction vector, which are different from the real vector
values. These are prediction errors.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
4. Test accuracy of the result (Creation of Confusion matrix)
In the above output, we have seen that there were some incorrect predictions, so if we want to
know the number of correct and incorrect predictions, we need to use the confusion matrix.
Below is the code for it:

1. #Creating the Confusion matrix


2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred) Output:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
In the above output image, we can see the confusion matrix, which has 6+3= 9 incorrect
predictions and62+29=91 correct predictions. Therefore, we can say that compared to
other classification models, the Decision Tree classifier made a good prediction.

5. Visualizing the training set result:


Here we will visualize the training set result. To visualize the training set result we will plot a
graph for the decision tree classifier. The classifier will predict yes or No for the users who
have either Purchased or Not purchased the SUV car as we did in Logistic Regression. Below
is the code for it:

1. #Visulaizing the trianing set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step
=0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1
.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. fori, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Decision Tree Algorithm (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show() Output:

The above output is completely different from the rest classification models. It has both vertical
and horizontal lines that are splitting the dataset according to the age and estimated salary
variable.

As we can see, the tree is trying to capture each dataset, which is the case of overfitting.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
6. Visualizing the test set result:
Visualization of test set result will be similar to the visualization of the training set except that
the training set will be replaced with the test set.

1. #Visulaizing the test set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step
=0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1
.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. fori, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Decision Tree Algorithm(Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show() Output:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
As we can see in the above image that there are some green data points within the purple region
and vice versa. So, these are the incorrect predictions which we have discussed in the confusion
matrix.

Random Forest Algorithm


Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based
on the concept of ensemble learning, which is a process of combining multiple classifiers to
solve a complex problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions, and it predicts the
final output.

The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.

The below diagram explains the working of the Random Forest algorithm:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Note: To better understand the Random Forest Algorithm, you should have knowledge of
the Decision Tree Algorithm.
Assumptions for Random Forest

Since the random forest combines multiple trees to predict the class of the dataset, it is possible
that some decision trees may predict the correct output, while others may not. But together, all
the trees predict the correct output. Therefore, below are two assumptions for a better Random
forest classifier:

o There should be some actual values in the feature variable of the dataset so that the classifier
can predict accurate results rather than a guessed result.

o The predictions from each tree must have very low correlations.

Why use Random Forest?

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Below are some points that explain why we should use the Random Forest algorithm:

<="" li="">

o It takes less training time as compared to other algorithms. o It predicts output with high
accuracy, even for the large dataset it runs efficiently. o It can also maintain accuracy when a
large proportion of data is missing.

How does Random Forest algorithm work?

Random Forest works in two-phase first is to create the random forest by combining N decision
tree, and second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign the new data
points to the category that wins the majority votes.

The working of the algorithm can be better understood by the below example:

Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is
given to the Random forest classifier. The dataset is divided into subsets and given to each
decision tree. During the training phase, each decision tree produces a prediction result, and
when a new data point occurs, then based on the majority of results, the Random Forest
classifier predicts the final decision. Consider the below image:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Applications of Random Forest

There are mainly four sectors where Random forest mostly used:

1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.

2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be
identified.

3. Land Use: We can identify the areas of similar land use by this algorithm.

4. Marketing: Marketing trends can be identified using this algorithm.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Advantages of Random Forest

o Random Forest is capable of performing both Classification and Regression tasks. o It is


capable of handling large datasets with high dimensionality. o It enhances the accuracy of the
model and prevents the overfitting issue.

Disadvantages of Random Forest

o Although random forest can be used for both classification and regression tasks, it is not more
suitable for Regression tasks.

Python Implementation of Random Forest Algorithm

Now we will implement the Random Forest Algorithm tree using Python. For this, we will use
the same dataset "user_data.csv", which we have used in previous classification models. By
using the same dataset, we can compare the Random Forest classifier with other classification
models such as Decision tree Classifier, KNN, SVM, Logistic Regression, etc.

Implementation Steps are given below:

o Data Pre-processing step o Fitting the Random forest algorithm to the Training set o
Predicting the test result o Test accuracy of the result (Creation of Confusion matrix) o
Visualizing the test set result.

1.Data Pre-Processing Step:


Below is the code for the pre-processing step:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp 4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv') 8.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)

16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)
In the above code, we have pre-processed the data. Where we have loaded the dataset, which
is given as:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. Fitting the Random Forest algorithm to the training set:
Now we will fit the Random forest algorithm to the training set. To fit it, we will import the
RandomForestClassifier class from the sklearn.ensemble library. The code is given below:

1. #Fitting Decision Tree classifier to the training set


2. from sklearn.ensemble import RandomForestClassifier
3. classifier= RandomForestClassifier(n_estimators= 10, criterion="entropy")
4. classifier.fit(x_train, y_train)
In the above code, the classifier object takes below parameters:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o n_estimators= The required number of trees in the Random Forest. The default value is 10.
We can choose any number but need to take care of the overfitting issue.
o criterion= It is a function to analyze the accuracy of the split. Here we have taken "entropy"
for the information gain.

Output:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy', max_depth=None,


max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None, oob_score=False,
random_state=None, verbose=0, warm_start=False)
3. Predicting the Test Set result
Since our model is fitted to the training set, so now we can predict the test result. For prediction,
we will create a new prediction vector y_pred. Below is the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test) Output:

The prediction vector is given as:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
By checking the above prediction vector and test set real vector, we can determine the incorrect
predictions done by the classifier.

4. Creating the Confusion Matrix


Now we will create the confusion matrix to determine the correct and incorrect predictions.
Below is the code for it:

1. #Creating the Confusion matrix


2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred) Output:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
As we can see in the above matrix, there are 4+4= 8 incorrect predictions and 64+28= 92
correct predictions.

5. Visualizing the training Set result


Here we will visualize the training set result. To visualize the training set result we will plot a
graph for the Random forest classifier. The classifier will predict yes or No for the users who
have either Purchased or Not purchased the SUV car as we did in Logistic Regression. Below
is the code for it:

1. from matplotlib.colors import ListedColormap


2. x_set, y_set = x_train, y_train
3. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step
=0.01),
4. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
5. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1
.shape),
6. alpha = 0.75, cmap = ListedColormap(('purple','green' )))

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
7. mtp.xlim(x1.min(), x1.max())
8. mtp.ylim(x2.min(), x2.max())
9. for i, j in enumerate(nm.unique(y_set)):
10. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1], 11. c = ListedColormap(('purple',
'green'))(i), label = j)
12. mtp.title('Random Forest Algorithm (Training set)')
13. mtp.xlabel('Age')
14. mtp.ylabel('Estimated Salary')
15. mtp.legend()
16. mtp.show()
Output:

The above image is the visualization result for the Random Forest classifier working with the
training set result. It is very much similar to the Decision tree classifier. Each data point
corresponds to each user of the user_data, and the purple and green regions are the prediction
regions. The purple region is classified for the users who did not purchase the SUV car, and the
green region is for the users who purchased the SUV.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
So, in the Random Forest classifier, we have taken 10 trees that have predicted Yes or NO for
the Purchased variable. The classifier took the majority of the predictions and provided the
result.

6. Visualizing the test set result


Now we will visualize the test set result. Below is the code for it:

1. #Visulaizing the test set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step
=0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1
.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Random Forest Algorithm(Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show() Output:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The above image is the visualization result for the test set. We can check that there is a minimum
number of incorrect predictions (8) without the Overfitting issue. We will get different results
by changing the number of trees in the classifier.

Cross-Validation in Machine Learning


Cross-validation is a technique for validating the model efficiency by training it on the subset
of input data and testing on previously unseen subset of the input data. We can also say that it
is a technique to check how a statistical model generalizes to an independent dataset.

In machine learning, there is always the need to test the stability of the model. It means based
only on the training dataset; we can't fit our model on the training dataset. For this purpose, we
reserve a particular sample of the dataset, which was not part of the training dataset. After that,
we test our model on that sample before deployment, and this complete process comes under
cross-validation. This is something different from the general train-test split.

Hence the basic steps of cross-validations are:

o Reserve a subset of the dataset as a validation set. o Provide the training to the model using
the training dataset.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Now, evaluate model performance using the validation set. If the model performs well with the
validation set, perform the further step, else check for the issues.

Methods used for Cross-Validation

There are some common methods that are used for cross-validation. These methods are given
below:

1. Validation Set Approach

2. Leave-P-out cross-validation

3. Leave one out cross-validation

4. K-fold cross-validation

5. Stratified k-fold cross-validation

Validation Set Approach


We divide our input dataset into a training set and test or validation set in the validation set
approach. Both the subsets are given 50% of the dataset.

But it has one of the big disadvantages that we are just using a 50% dataset to train our model,
so the model may miss out to capture important information of the dataset. It also tends to give
the underfitted model.

Leave-P-out cross-validation
In this approach, the p datasets are left out of the training data. It means, if there are total n
datapoints in the original input dataset, then n-p data points will be used as the training dataset
and the p data points as the validation set. This complete process is repeated for all the samples,
and the average error is calculated to know the effectiveness of the model.

There is a disadvantage of this technique; that is, it can be computationally difficult for the
large p.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Leave one out cross-validation
This method is similar to the leave-p-out cross-validation, but instead of p, we need to take 1
dataset out of training. It means, in this approach, for each learning set, only one datapoint is
reserved, and the remaining dataset is used to train the model. This process repeats for each
datapoint. Hence for n samples, we get n different training set and n test set. It has the following
features:

o In this approach, the bias is minimum as all the data points are used. o The process is executed
for n times; hence execution time is high.
o This approach leads to high variation in testing the effectiveness of the model as we iteratively
check against one data point.

K-Fold Cross-Validation
K-fold cross-validation approach divides the input dataset into K groups of samples of equal
sizes. These samples are called folds. For each learning set, the prediction function uses k-1
folds, and the rest of the folds are used for the test set. This approach is a very popular CV
approach because it is easy to understand, and the output is less biased than other methods.

The steps for k-fold cross-validation are: o Split the input dataset into K groups o For each
group:

o Take one group as the reserve or test data set. o Use remaining groups as the training dataset
o Fit the model on the training set and evaluate the performance of the model using the test
set.

Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5 folds. On 1st
iteration, the first fold is reserved for test the model, and rest are used to train the model.
On 2nd iteration, the second fold is used to test the model, and rest are used to train the model.
This process will continue until each fold is not used for the test fold.

Consider the below diagram:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Stratified k-fold cross-validation
This technique is similar to k-fold cross-validation with some little changes. This approach
works on stratification concept, it is a process of rearranging the data to ensure that each fold
or group is a good representative of the complete dataset. To deal with the bias and variance, it
is one of the best approaches.

It can be understood with an example of housing prices, such that the price of some houses can
be much high than other houses. To tackle such situations, a stratified k-fold cross-validation
technique is useful.

Holdout Method
This method is the simplest cross-validation technique among all. In this method, we need to
remove a subset of the training data and use it to get prediction results by training it on the rest
part of the dataset.

The error that occurs in this process tells how well our model will perform with the unknown
dataset. Although this approach is simple to perform, it still faces the issue of high variance,
and it also produces misleading results sometimes.

Comparison of Cross-validation to train/test split in Machine Learning

o Train/test split: The input data is divided into two parts, that are training set and test set on a
ratio of 70:30, 80:20, etc. It provides a high variance, which is one of the biggest disadvantages.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Training Data: The training data is used to train the model, and the dependent variable is
known.

o Test Data: The test data is used to make the predictions from the model that is already trained
on the training data. This has the same features as training data but not the part of that.
o Cross-Validation dataset: It is used to overcome the disadvantage of train/test split by splitting
the dataset into groups of train/test splits, and averaging the result. It can be used if we want to
optimize our model that has been trained on the training dataset for the best performance. It is
more efficient as compared to train/test split as every observation is used for the training and
testing both.

Limitations of Cross-Validation

There are some limitations of the cross-validation technique, which are given below:

o For the ideal conditions, it provides the optimum output. But for the inconsistent data, it may
produce a drastic result. So, it is one of the big disadvantages of cross-validation, as there is no
certainty of the type of data in machine learning.

o In predictive modeling, the data evolves over a period, due to which, it may face the differences
between the training set and validation sets. Such as if we create a model for the prediction of
stock market values, and the data is trained on the previous 5 years stock values, but the realistic
future values for the next 5 years may drastically different, so it is difficult to expect the correct
output for such situations.

Applications of Cross-Validation

o This technique can be used to compare the performance of different predictive modeling
methods.

o It has great scope in the medical research field.

o It can also be used for the meta-analysis, as it is already being used by the data scientists in the
field of medical statistics. Introduction to Dimensionality Reduction Technique

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
What is Dimensionality Reduction?

The number of input features, variables, or columns present in a given dataset is known as
dimensionality, and the process to reduce these features is called dimensionality reduction.

A dataset contains a huge number of input features in various cases, which makes the predictive
modeling task more complicated. Because it is very difficult to visualize or make predictions
for the training dataset with a high number of features, for such cases, dimensionality reduction
techniques are required to use.

Dimensionality reduction technique can be defined as, "It is a way of converting the higher
dimensions dataset into lesser dimensions dataset ensuring that it provides similar
information." These techniques are widely used in machine learning for obtaining a better fit
predictive model while solving the classification and regression problems.

It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.

ADVERTISEMENT

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The Curse of Dimensionality

Handling the high-dimensional data is very difficult in practice, commonly known as the curse
of dimensionality. If the dimensionality of the input dataset increases, any machine learning
algorithm and model becomes more complex. As the number of features increases, the number
of samples also gets increased proportionally, and the chance of overfitting also increases. If
the machine learning model is trained on high-dimensional data, it becomes overfitted and
results in poor performance.

Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.

Benefits of applying Dimensionality Reduction

Some benefits of applying dimensionality reduction technique to the given dataset are given
below:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o By reducing the dimensions of the features, the space required to store the dataset also gets
reduced.

o Less Computation training time is required for reduced dimensions of features. o Reduced
dimensions of features of the dataset help in visualizing the data quickly.
o It removes the redundant features (if present) by taking care of multicollinearity.

Disadvantages of dimensionality Reduction

There are also some disadvantages of applying the dimensionality reduction, which are given
below:

o Some data may be lost due to dimensionality reduction.

o In the PCA dimensionality reduction technique, sometimes the principal components required
to consider are unknown.

Approaches of Dimension Reduction

There are two ways to apply the dimension reduction technique, which are given below:

Feature Selection
Feature selection is the process of selecting the subset of the relevant features and leaving out
the irrelevant features present in a dataset to build a model of high accuracy. In other words, it
is a way of selecting the optimal features from the input dataset.

Three methods are used for the feature selection:

1. Filters Methods

In this method, the dataset is filtered, and a subset that contains only the relevant features is
taken. Some common techniques of filters method are:

o Correlation o Chi-Square Test


o ANOVA o Information Gain, etc.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. Wrappers Methods

The wrapper method has the same goal as the filter method, but it takes a machine learning
model for its evaluation. In this method, some features are fed to the ML model, and evaluate
the performance. The performance decides whether to add those features or remove to increase
the accuracy of the model. This method is more accurate than the filtering method but complex
to work. Some common techniques of wrapper methods are:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Forward Selection o Backward Selection o Bi-directional Elimination

3. Embedded Methods: Embedded methods check the different training iterations of the
machine learning model and evaluate the importance of each feature. Some common
techniques of Embedded methods are:

o LASSO

o Elastic Net o Ridge Regression, etc.

Feature Extraction:
Feature extraction is the process of transforming the space containing many dimensions into
space with fewer dimensions. This approach is useful when we want to keep the whole
information but use fewer resources while processing the information.

Some common feature extraction techniques are:

a. Principal Component Analysis

a. Linear Discriminant Analysis

a. Kernel PCA

a. Quadratic Discriminant Analysis Common techniques of Dimensionality Reduction


a. Principal Component Analysis

a. Backward Elimination

a. Forward Selection

a. Score comparison

a. Missing Value Ratio

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
a. Low Variance Filter

a. High Correlation Filter

a. Random Forest
a. Factor Analysis

a. Auto-Encoder

Principal Component Analysis (PCA)


Principal Component Analysis is a statistical process that converts the observations of
correlated features into a set of linearly uncorrelated features with the help of orthogonal
transformation. These new transformed features are called the Principal Components. It is
one of the popular tools that is used for exploratory data analysis and predictive modeling.

PCA works by considering the variance of each attribute because the high attribute shows the
good split between the classes, and hence it reduces the dimensionality. Some real-world
applications of PCA are image processing, movie recommendation system, optimizing the
power allocation in various communication channels.

Backward Feature Elimination


The backward feature elimination technique is mainly used while developing Linear
Regression or Logistic Regression model. Below steps are performed in this technique to
reduce the dimensionality or in feature selection:

o In this technique, firstly, all the n variables of the given dataset are taken to train the model.

o The performance of the model is checked.

o Now we will remove one feature each time and train the model on n-1 features for n times, and
will compute the performance of the model.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o We will check the variable that has made the smallest or no change in the performance of the
model, and then we will drop that variable or features; after that, we will be left with n-1
features. o Repeat the complete process until no feature can be dropped.

In this technique, by selecting the optimum performance of the model and maximum tolerable
error rate, we can define the optimal number of features require for the machine learning
algorithms.

Forward Feature Selection


Forward feature selection follows the inverse process of the backward elimination process. It
means, in this technique, we don't eliminate the feature; instead, we will find the best features
that can produce the highest increase in the performance of the model. Below steps are
performed in this technique:

o We start with a single feature only, and progressively we will add each feature at a time.

o Here we will train the model on each feature separately. o The feature with the best
performance is selected.
The process will be repeated until we get a significant increase in the performance of the model.

Missing Value Ratio


If a dataset has too many missing values, then we drop those variables as they do not carry
much useful information. To perform this, we can set a threshold level, and if a variable has
missing values more than that threshold, we will drop that variable. The higher the threshold
value, the more efficient the reduction.

Low Variance Filter


As same as missing value ratio technique, data columns with some changes in the data have
less information. Therefore, we need to calculate the variance of each variable, and all data
columns with variance lower than a given threshold are dropped because low variance features
will not affect the target variable.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
High Correlation Filter
High Correlation refers to the case when two variables carry approximately similar
information. Due to this factor, the performance of the model can be degraded. This correlation
between the independent numerical variable gives the calculated value of the correlation
coefficient. If this value is higher than the threshold value, we can remove one of the variables
from the dataset. We can consider those variables or features that show a high correlation with
the target variable.

Random Forest
Random Forest is a popular and very useful feature selection algorithm in machine learning.
This algorithm contains an in-built feature importance package, so we do not need to program
it separately. In this technique, we need to generate a large set of trees against the target
variable, and with the help of usage statistics of each attribute, we need to find the subset of
features.

Random forest algorithm takes only numerical variables, so we need to convert the input data
into numeric data using hot encoding.

Factor Analysis
Factor analysis is a technique in which each variable is kept within a group according to the
correlation with other variables, it means variables within a group can have a high correlation
between themselves, but they have a low correlation with variables of other groups.

We can understand it by an example, such as if we have two variables Income and spend. These
two variables have a high correlation, which means people with high income spends more, and
vice versa. So, such variables are put into a group, and that group is known as the factor. The
number of these factors will be reduced as compared to the original dimension of the dataset.

Auto-encoders
One of the popular methods of dimensionality reduction is auto-encoder, which is a type of
ANN or artificial neural network, and its main aim is to copy the inputs to their outputs. In this,

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
the input is compressed into latent-space representation, and output is occurred using this
representation. It has mainly two parts:

o Encoder: The function of the encoder is to compress the input to form the latent-space
representation.

o Decoder: The function of the decoder is to recreate the output from the latent-space
representation.

Principal Component Analysis


Principal Component Analysis is an unsupervised learning algorithm that is used for the
dimensionality reduction in machine learning. It is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated features with the help of
orthogonal transformation. These new transformed features are called the Principal
Components. It is one of the popular tools that is used for exploratory data analysis and
predictive modeling. It is a technique to draw strong patterns from the given dataset by reducing
the variances.

PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.

PCA works by considering the variance of each attribute because the high attribute shows the
good split between the classes, and hence it reduces the dimensionality. Some real-world
applications of PCA are image processing, movie recommendation system, optimizing the
power allocation in various communication channels. It is a feature extraction technique, so
it contains the important variables and drops the least important variable.

The PCA algorithm is based on some mathematical concepts such as:

o Variance and Covariance o Eigenvalues and Eigen factors

Some common terms used in PCA algorithm:

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Dimensionality: It is the number of features or variables present in the given dataset. More
easily, it is the number of columns present in the dataset.

o Correlation: It signifies that how strongly two variables are related to each other. Such as if
one changes, the other variable also gets changed. The correlation value ranges from -1 to +1.
Here, -1 occurs if variables are inversely proportional to each other, and +1 indicates that
variables are directly proportional to each other.

o Orthogonal: It defines that variables are not correlated to each other, and hence the correlation
between the pair of variables is zero.

o Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v will be
eigenvector if Av is the scalar multiple of v.
Covariance Matrix: A matrix containing the covariance between the pair of variables is called
the Covariance Matrix.

Principal Components in PCA


As described above, the transformed new features or the output of PCA are the Principal
Components. The number of these PCs are either equal to or less than the original features
present in the dataset. Some properties of these principal components are given below: o The
principal component must be the linear combination of the original features.

o These components are orthogonal, i.e., the correlation between a pair of variables is zero.

o The importance of each component decreases when going to 1 to n, it means the 1 PC has the
most importance, and n PC will have the least importance.

Steps for PCA algorithm

1. Getting the dataset


Firstly, we need to take the input dataset and divide it into two subparts X and Y, where X is
the training set, and Y is the validation set.

2. Representing data into a structure

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Now we will represent our dataset into a structure. Such as we will represent the two-
dimensional matrix of independent variable X. Here each row corresponds to the data items,
and the column corresponds to the Features. The number of columns is the dimensions of the
dataset.

3. Standardizing the data


In this step, we will standardize our dataset. Such as in a particular column, the features with
high variance are more important compared to the features with lower variance.
If the importance of features is independent of the variance of the feature, then we will divide
each data item in a column with the standard deviation of the column.
Here we will name the matrix as Z.

4. Calculating the Covariance of Z


To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After
transpose, we will multiply it by Z. The output matrix will be the Covariance matrix of Z.

5. Calculating the Eigen Values and Eigen Vectors


Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance matrix
Z. Eigenvectors or the covariance matrix are the directions of the axes with high information.
And the coefficients of these eigenvectors are defined as the eigenvalues.

6. Sorting the Eigen Vectors


In this step, we will take all the eigenvalues and will sort them in decreasing order,
which means from largest to smallest. And simultaneously sort the eigenvectors accordingly in
matrix P of eigenvalues. The resultant matrix will be named as P*.

7. Calculating the new features Or Principal Components


Here we will calculate the new features. To do this, we will multiply the P* matrix to the Z. In
the resultant matrix Z*, each observation is the linear combination of original features. Each
column of the Z* matrix is independent of each other.

8. Remove less or unimportant features from the new dataset.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The new feature set has occurred, so we will decide here what to keep and what to remove. It
means, we will only keep the relevant or important features in the new dataset, and unimportant
features will be removed out.

Applications of Principal Component Analysis

o PCA is mainly used as the dimensionality reduction technique in various AI applications such
as computer vision, image compression, etc.

o It can also be used for finding hidden patterns if data has high dimensions. Some fields where
PCA is used are Finance, data mining, Psychology, etc.

www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778

You might also like