0% found this document useful (0 votes)
7 views141 pages

Unit 2 ML

Machine Learning unit2 notes.

Uploaded by

mohit gola
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views141 pages

Unit 2 ML

Machine Learning unit2 notes.

Uploaded by

mohit gola
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 141

Week 1

Machine Learning Techniques


Machine learning is a data analytics technique that teaches computers to do what
comes naturally to humans and animals: learn from experience. Machine learning
algorithms use computational methods to directly "learn" from data without relying on
a predetermined equation as a model.

As the number of samples available for learning increases, the algorithm adapts to
improve performance. Deep learning is a special form of machine learning.

How does machine learning work?


Machine learning uses two techniques: supervised learning, which trains a model on
known input and output data to predict future outputs, and unsupervised learning,
which uses hidden patterns or internal structures in the input data.
Supervised learning
Supervised machine learning creates a model that makes predictions based on evidence
in the presence of uncertainty. A supervised learning algorithm takes a known set of
input data and known responses to the data (output) and trains a model to generate
reasonable predictions for the response to the new data. Use supervised learning if you
have known data for the output you are trying to estimate.

Supervised learning uses classification and regression techniques to develop machine


learning models.

Classification models classify the input data. Classification techniques predict discrete
responses. For example, the email is genuine, or spam, or the tumor is cancerous or
benign. Typical applications include medical imaging, speech recognition, and credit
scoring.

Use taxonomy if your data can be tagged, classified, or divided into specific groups or
classes. For example, applications for handwriting recognition use classification to
recognize letters and numbers. In image processing and computer vision, unsupervised
pattern recognition techniques are used for object detection and image segmentation.

Common algorithms for performing classification include support vector machines


(SVMs), boosted and bagged decision trees, k-nearest neighbors, Naive Bayes,
discriminant analysis, logistic regression, and neural networks.

Regression techniques predict continuous responses - for example, changes in


temperature or fluctuations in electricity demand. Typical applications include power
load forecasting and algorithmic trading.

If you are working with a data range or if the nature of your response is a real number,
such as temperature or the time until a piece of equipment fails, use regression
techniques.

Common regression algorithms include linear, nonlinear models, regularization,


stepwise regression, boosted and bagged decision trees, neural networks, and
adaptive neuro-fuzzy learning.

Using supervised learning to predict heart attacks

Physicians want to predict whether someone will have a heart attack within a year. They
have data on previous patients, including age, weight, height, and blood pressure.
They know if previous patients had had a heart attack within a year. So the problem is to
combine existing data into a model that can predict whether a new person will have a
heart attack within a year.
Unsupervised Learning
Detects hidden patterns or internal structures in unsupervised learning data. It is used to
eliminate datasets containing input data without labeled responses.

Clustering is a common unsupervised learning technique. It is used for exploratory data


analysis to find hidden patterns and clusters in the data. Applications for cluster analysis
include gene sequence analysis, market research, and commodity identification.

For example, if a cell phone company wants to optimize the locations where they build
towers, they can use machine learning to predict how many people their towers are
based on.

A phone can only talk to 1 tower at a time, so the team uses clustering algorithms to
design the good placement of cell towers to optimize signal reception for their groups
or groups of customers.

Common algorithms for performing clustering are k-means and k-medoids,


hierarchical clustering, Gaussian mixture models, hidden Markov models,
self-organizing maps, fuzzy C-means clustering, and subtractive clustering.

Ten methods are described and it is a foundation you can build on to improve your
machine learning knowledge and skills:

o Regression

o Classification

o Clustering
o Dimensionality Reduction

o Ensemble Methods

o Neural Nets and Deep Learning

o Transfer Learning

o Reinforcement Learning

o Natural Language Processing

o Word Embedding's

Let's differentiate between two general categories of machine learning: supervised and
unsupervised. We apply supervised ML techniques when we have a piece of data that
we want to predict or interpret. We use the previous and output data to predict the
output based on the new input.

For example, you can use supervised ML techniques to help a service business that
wants to estimate the number of new users that will sign up for the service in the next
month. In contrast, untrained ML looks at ways of connecting and grouping data points
without using target variables to make predictions.

In other words, it evaluates data in terms of traits and uses traits to group objects that
are similar to each other. For example, you can use unsupervised learning techniques to
help a retailer who wants to segment products with similar characteristics-without
specifying in advance which features to use.

1. Regression
Regression methods fall under the category of supervised ML. They help predict or
interpret a particular numerical value based on prior data, such as predicting an asset's
price based on past pricing data for similar properties.

The simplest method is linear regression, where we use the mathematical equation of
the line (y = m * x + b) to model the data set. We train a linear regression model with
multiple data pairs (x, y) by computing the position and slope of a line that minimizes
the total distance between all data points and the line. In other words, we calculate the
slope (M) and the y-intercept (B) for a line that best approximates the observations in
the data.

Let us consider a more concrete example of linear regression. I once used linear
regression to predict the energy consumption (in kW) of some buildings by gathering
together the age of the building, the number of stories, square feet, and the number of
wall devices plugged in.

Since there was more than one input (age, square feet, etc.), I used a multivariable
linear regression. The principle was similar to a one-to-one linear regression. Still, in this
case, the "line" I created occurred in a multi-dimensional space depending on the
number of variables.

Now imagine that you have access to the characteristics of a building (age, square feet,
etc.), but you do not know the energy consumption. In this case, we can use the fitted
line to estimate the energy consumption of the particular building. The plot below
shows how well the linear regression model fits the actual energy consumption of the
building.

Note that you can also use linear regression to estimate the weight of each factor that
contributes to the final prediction of energy consumed. For example, once you have a
formula, you can determine whether age, size, or height are most important.

Linear regression model estimates of building energy consumption (kWh).

Regression techniques run the gamut from simple (linear regression) to complex
(regular linear regression, polynomial regression, decision trees, random forest
regression, and neural nets). But don't get confused: start by studying simple linear
regression, master the techniques, and move on.
2. Classification
In another class of supervised ML, classification methods predict or explain a class value.
For example, they can help predict whether an online customer will purchase a product.
Output can be yes or no: buyer or no buyer. But the methods of classification are not
limited to two classes. For example, a classification method can help assess whether a
given image contains a car or a truck. The simplest classification algorithm is logistic
regression, which sounds like a regression method, but it is not. Logistic regression
estimates the probability of occurrence of an event based on one or more inputs.

For example, logistic regression can take two test scores for a student to predict that the
student will get admission to a particular college. Because the guess is a probability, the
output is a number between 0 and 1, where 1 represents absolute certainty. For the
student, if the predicted probability is greater than 0.5, we estimate that they will be
admitted. If the predicted probability is less than 0.5, we estimate it will be rejected.

The chart below shows the marks of past students and whether they were admitted.
Logistic regression allows us to draw a line that represents the decision boundary.

Because logistic regression is the simplest classification model, it is a good place to start
for classification. As you progress, you can dive into nonlinear classifiers such as decision
trees, random forests, support vector machines, and neural nets, among others.

3. Clustering
We fall into untrained ML with clustering methods because they aim to group or group
observations with similar characteristics. Clustering methods do not use the output
information for training but instead let the algorithm define the output. In clustering
methods, we can only use visualization to observe the quality of the solution.

The most popular clustering method is K-Means, where "K" represents the number of
clusters selected by the user. (Note that there are several techniques for selecting the
value of K, such as the elbow method.)

o Randomly chooses K centers within the data.

o Assigns each data point closest to the randomly generated centers.

Otherwise, we return to step 2. (To prevent ending in an infinite loop if the centers
continue to change, set the maximum number of iterations in advance.)

The process is over if the centers do not change (or change very little).

The next plot applies the K-means to the building's data set. The four measurements
pertain to air conditioning, plug-in appliances (microwave, refrigerator, etc.), household
gas, and heating gas. Each column of the plot represents the efficiency of each building.

Linear regression model estimates of building energy consumption (kWh).


Regression techniques run the gamut from simple (linear) to complex (regular linear,
polynomial, decision trees, random forest, and neural nets). But don't get confused: start
by studying simple linear regression, master the techniques, and move on.

Clustering Buildings into Efficient (Green) and Inefficient (Red) Groups.

As you explore clustering, you will come across very useful algorithms such as
Density-based Spatial Clustering of Noise (DBSCAN), Mean Shift Clustering,
Agglomerative Hierarchical Clustering, and Expectation-Maximization Clustering using
the Gaussian Mixture Model, among others.

4. Dimensionality Reduction
We use dimensionality reduction to remove the least important information (sometimes
unnecessary columns) from the data setFor example, and images may consist of
thousands of pixels, which are unimportant to your analysis. Or, when testing microchips
within the manufacturing process, you may have thousands of measurements and tests
applied to each chip, many of which provide redundant information. In these cases, you
need a dimensionality reduction algorithm to make the data set manageable.

The most popular dimensionality reduction method is Principal Component Analysis


(PCA), which reduces the dimensionality of the feature space by finding new vectors that
maximize the linear variance of the data. (You can also measure the extent of
information loss and adjust accordingly.) When the linear correlations of the data are
strong, PCA can dramatically reduce the dimension of the data without losing too much
information.

Another popular method is t-stochastic neighbor embedding (t-SNE), which minimizes


nonlinear dimensions. People usually use t-SNE for data visualization, but you can also
use it for machine learning tasks such as feature space reduction and clustering, to
mention a few.

The next plot shows the analysis of the MNIST database of handwritten digits. MNIST
contains thousands of images of numbers 0 to 9, which the researchers use to test their
clustering and classification algorithms. Each row of the data set is a vector version of
the original image (size 28 x 28 = 784) and a label for each image (zero, one, two,
three, …, nine). Therefore, we are reducing the dimensionality from 784 (pixels) to 2 (the
dimensions in our visualization). Projecting to two dimensions allows us to visualize
higher-dimensional original data sets.

5. Ensemble Methods
Imagine that you have decided to build a bicycle because you are not happy with the
options available in stores and online. Once you've assembled these great parts, the
resulting bike will outlast all other options.

Each model uses the same idea of combining multiple predictive models (supervised
ML) to obtain higher quality predictions than the model.

For example, the Random Forest algorithm is an ensemble method that combines
multiple decision trees trained with different samples from a data set. As a result, the
quality of predictions of a random forest exceeds the quality of predictions predicted
with a single decision tree.

Think about ways to reduce the variance and bias of a single machine learning model.
By combining the two models, the quality of the predictions becomes balanced. With
another model, the relative accuracy may be reversed. It is important because any given
model may be accurate under some conditions but may be inaccurate under other
conditions.

Most of the top winners of Kaggle competitions use some dressing method. The most
popular ensemble algorithms are Random Forest, XGBoost, and LightGBM.

6. Neural networks and deep learning


Unlike linear and logistic regression, which is considered linear models, neural networks
aim to capture nonlinear patterns in data by adding layers of parameters to the model.
The simple neural net has three inputs as in the image below, a hidden layer with five
parameters and an output layer.

Neural network with a hidden layer.

The neural network structure is flexible enough to construct our famous linear and
logistic regression. The term deep learning comes from a neural net with many hidden
layers and encompasses a variety of architectures.
It is especially difficult to keep up with development in deep learning as the research
and industry communities redouble their deep learning efforts, spawning whole new
methods every day.

Deep learning: A neural network with multiple hidden layers.

Deep learning techniques require a lot of data and computation power for best
performance as this method is self-tuning many parameters within vast architectures. It
quickly becomes clear why deep learning practitioners need powerful computers with
GPUs (Graphical Processing Units).

In particular, deep learning techniques have been extremely successful in vision (image
classification), text, audio, and video. The most common software packages for deep
learning are Tensorflow and PyTorch.

7. Transfer learning
Let's say you are a data scientist working in the retail industry. You've spent months
training a high-quality model to classify images as shirts, t-shirts, and polos. Your new
task is to create a similar model to classify clothing images like jeans, cargo, casual, and
dress pants.

Transfer learning refers to reusing part of an already trained neural net and adapting it
to a new but similar task. Specifically, once you train a neural net using the data for a
task, you can move a fraction of the trained layers and combine them with some new
layers that you can use for the new task. The new neural net can learn and adapt quickly
to a new task by adding a few layers.
The advantage of transfer learning is that you need fewer data to train a neural net,
which is especially important because training for deep learning algorithms is expensive
in terms of both time and money.

The main advantage of transfer learning is that you need fewer data to train a neural
net, which is especially important because training for deep learning algorithms is
expensive both in terms of time and money (computational resources). Of course, it isn't
easy to find enough labeled data for training.

Let's come back to your example and assume that you use a neural net with 20 hidden
layers for the shirt model. After running a few experiments, you realize that you can
move the 18 layers of the shirt model and combine them with a new layer of parameters
to train on the pant images.

So the Pants model will have 19 hidden layers. The inputs and outputs of the two
functions are different but reusable layers can summarize information relevant to both,
for example, fabric aspects.

Transfer learning has become more and more popular, and there are many concrete
pre-trained models now available for common deep learning tasks such as image and
text classification.

8. Reinforcement Learning
Imagine a mouse in a maze trying to find hidden pieces of cheese. At first, the Mouse
may move randomly, but after a while, the Mouse's feel helps sense which actions bring
it closer to the cheese. The more times we expose the Mouse to the maze, the better at
finding the cheese.

Process for Mouse refers to what we do with Reinforcement Learning (RL) to train a
system or game. Generally speaking, RL is a method of machine learning that helps an
agent to learn from experience.

RL can maximize a cumulative reward by recording actions and using a trial-and-error


approach in a set environment. In our example, the Mouse is the agent, and the maze is
the environment. The set of possible actions for the Mouse is: move forward, backward,
left, or right. The reward is cheese.

You can use RL when you have little or no historical data about a problem, as it does not
require prior information (unlike traditional machine learning methods). In the RL
framework, you learn from the data as you go. Not surprisingly, RL is particularly
successful with games, especially games of "correct information" such as chess and Go.
With games, feedback from the agent and the environment comes quickly, allowing the
model to learn faster. The downside of RL is that it can take a very long time to train if
the problem is complex.

As IBM's Deep Blue beat the best human chess player in 1997, the RL-based algorithm
AlphaGo beat the best Go player in 2016. The current forerunners of RL are the teams of
DeepMind in the UK.

In April 2019, the OpenAI Five team was the first AI to defeat the world champion team
of e-sport Dota 2, a very complex video game that the OpenAI Five team chose because
there were no RL algorithms capable of winning it. You can tell that reinforcement
learning is a particularly powerful form of AI, and we certainly want to see more
progress from these teams. Still, it's also worth remembering the limitations of the
method.

9. Natural Language Processing


A large percentage of the world's data and knowledge is in some form of human
language. For example, we can train our phones to autocomplete our text messages or
correct misspelled words. We can also teach a machine to have a simple conversation
with a human.

Natural Language Processing (NLP) is not a machine learning method but a widely used
technique for preparing text for machine learning. Think of many text documents in
different formats (Word, online blog). Most of these text documents will be full of typos,
missing characters, and other words that need to be filtered out. At the moment, the
most popular package for processing text is NLTK (Natural Language Toolkit), created
by Stanford researchers.

The easiest way to map text to a numerical representation is to count the frequency of
each word in each text document. Think of a matrix of integers where each row
represents a text document, and each column represents a word. This matrix
representation of the term frequency is usually called the term frequency matrix (TFM).
We can create a more popular matrix representation of a text document by dividing
each entry on the matrix by the weighting of how important each word is in the entire
corpus of documents. We call this method Term Frequency Inverse Document
Frequency (TFIDF), and it generally works better for machine learning tasks.

10. Word Embedding


TFM and TFIDF are numerical representations of text documents that consider only
frequency and weighted frequencies to represent text documents. In contrast, word
embedding can capture the context of a word in a document. As with word context,
embeddings can measure similarity between words, allowing us to perform arithmetic
with words.

Word2Vec is a neural net-based method that maps words in a corpus to a numerical


vector. We can then use these vectors to find synonyms, perform arithmetic operations
with words, or represent text documents (by taking the mean of all word vectors in the
document). For example, we use a sufficiently large corpus of text documents to
estimate word embeddings.

Let's say vector('word') is the numeric vector representing the word 'word'. To
approximate the vector ('female'), we can perform an arithmetic operation with the
vectors:

vector('king') + vector('woman') - vector('man') ~ vector('queen')

Arithmetic with Word (Vectors) Embeddings.

The word representation allows finding the similarity between words by computing the
cosine similarity between the vector representations of two words. The cosine similarity
measures the angle between two vectors.

We calculate word embedding's using machine learning methods, but this is often a
pre-stage of implementing machine learning algorithms on top. For example, let's say
we have access to the tweets of several thousand Twitter users. Let's also assume that
we know which Twitter users bought the house. To estimate the probability of a new
Twitter user buying a home, we can combine Word2Vec with logistic regression.

You can train the word embedding yourself or get a pre-trained (transfer learning) set of
word vectors. To download pre-trained word vectors in 157 different languages, look at
Fast Text.

Summary
Studying these methods thoroughly and fully understanding the basics of each can
serve as a solid starting point for further study of more advanced algorithms and
methods.

There is no best way or one size fits all. Finding the right algorithm is partly just trial and
error - even highly experienced data scientists can't tell whether an algorithm will work
without trying it out. But algorithmic selection also depends on the size and type of data
you're working with, the insights you want to derive from the data, and how those
insights will be used.
Machine Learning Paradigms
Machine Learning (ML) is an application where algorithms can learn from experience without
being explicitly programmed. We have input data that we want our algorithm to examine and the
algorithm returns an output based on what it has learned from our input.
The learning component is what makes ML unique. It is almost like a black box that takes an input,
does the magic inside, and outputs the values we want to predict.
Yet, just like humans, machines can have different approaches to learn the material. Those different
approaches are called ML paradigms and they help us understand how a computer learns from data,
namely the input.
There are three basic ML paradigms:
1. Reinforcement Learning
2. Supervised Learning
3. Unsupervised Learning
Reinforcement Learning

In psychology, reinforcement is a term that refers to “anything that increases the likelihood that a
response will occur”(Cherry). Let’s understand this with an example.
Let’s say that you want to teach your cat, Nancy, how to sit. When Nancy sits, you reward her a treat.
She eventually learns that sitting is an action that brings her something delicious, so she learns how to
sit with an intention of getting the reward.
When this psychological phenomenon is applied to ML, machines observe the environment, decide on
their actions and get a reward or punishment in return. While doing so, the algorithms learn what to
do to optimize their decisions based on those punishments and rewards. This learning system is
called an agent.
One of the most common examples of reinforcement learning is the chess game you play against your
computer. Fun fact: AlphaGo, a computer program that was developed by DeepMind, is the first
computer that was able to defeat a world champion in chess!
Supervised Learning
Supervised learning means that the training data that we feed our algorithm with has labels on it. As a
result, it maps the input (training data) to output and labels the data accordingly. It is commonly
represented in a table.
There are two supervised learning tasks:
Classification
The model assigns a category to the target variable. The target variable is the category you want your
algorithm to find.
If you have an input of patients and you want to predict if someone has heart disease or not, your
model makes predictions from the data using features such as a patient’s sex, age, etc. and assigns the
category (yes or no in this case) to the target variable (heart disease). It would look something like
this:

Other famous examples include predicting if an email is spam or not and given handwritten digits,
classify which digit a given input is.
Regression
The model assigns a continuous variable to the target variable. Let’s say that you have the data of past
stock prices and you want to predict the price for tomorrow. You cannot assign a specific category to
the target variable as in the above example, therefore you should predict a specific value for each row
in the target column.
Other examples of regression would be predicting height, weight, or weather.
Unsupervised Learning
In unsupervised learning, we don’t have a column for the target variable — we actually don’t really
know what we are looking for.
We will look at an example to understand what that means.
Let’s say that you work for a company and you are asked to categorize different customers you have
so that the company will improve its marketing strategies. Yet, you do not know how to create those
categories. Clustering identifies the consumption behavior of each category and identifies how many
groups we should have and who should be placed into each group. In summary, our models divide the
dataset according to its similarities. This is called Clustering, a subcategory of Unsupervised
Learning.
Summary
Applications of Machine learning
Machine learning is a buzzword for today's technology, and it is growing very rapidly day by day. We
are using machine learning in our daily life even without knowing it such as Google Maps, Google
assistant, Alexa, etc. Below are some most trending real-world applications of Machine Learning:

1. Image Recognition:

Image recognition is one of the most common applications of machine learning. It is used to identify
objects, persons, places, digital images, etc. The popular use case of image recognition and face
detection is, Automatic friend tagging suggestion:

Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo with
our Facebook friends, then we automatically get a tagging suggestion with name, and the technology
behind this is machine learning's face detection and recognition algorithm.

It is based on the Facebook project named "Deep Face," which is responsible for face recognition and
person identification in the picture.

2. Speech Recognition

While using Google, we get an option of "Search by voice," it comes under speech recognition, and it's
a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is also known as
"Speech to text", or "Computer speech recognition." At present, machine learning algorithms are
widely used by various applications of speech recognition. Google assistant, Siri, Cortana,
and Alexa are using speech recognition technology to follow the voice instructions.
3. Traffic prediction:

If we want to visit a new place, we take help of Google Maps, which shows us the correct path with the
shortest route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested
with the help of two ways:

o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It takes information from the
user and sends back to its database to improve the performance.

4. Product recommendations:

Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some
product on Amazon, then we started getting an advertisement for the same product while internet surfing
on the same browser and this is because of machine learning.

Google understands the user interest using various machine learning algorithms and suggests the
product as per customer interest.

As similar, when we use Netflix, we find some recommendations for entertainment series, movies, etc.,
and this is also done with the help of machine learning.

5. Self-driving cars:

One of the most exciting applications of machine learning is self-driving cars. Machine learning plays
a significant role in self-driving cars. Tesla, the most popular car manufacturing company is working
on self-driving car. It is using unsupervised learning method to train the car models to detect people and
objects while driving.

6. Email Spam and Malware Filtering:

Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We
always receive an important mail in our inbox with the important symbol and spam emails in our spam
box, and the technology behind this is Machine learning. Below are some spam filters used by Gmail:

o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters

Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve Bayes
classifier are used for email spam filtering and malware detection.
7. Virtual Personal Assistant:

We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As the
name suggests, they help us in finding the information using our voice instruction. These assistants can
help us in various ways just by our voice instructions such as Play music, call someone, Open an email,
Scheduling an appointment, etc.

These virtual assistants use machine learning algorithms as an important part.

These assistant record our voice instructions, send it over the server on a cloud, and decode it using ML
algorithms and act accordingly.

8. Online Fraud Detection:

Machine learning is making our online transaction safe and secure by detecting fraud transaction.
Whenever we perform some online transaction, there may be various ways that a fraudulent transaction
can take place such as fake accounts, fake ids, and steal money in the middle of a transaction. So to
detect this, Feed Forward Neural network helps us by checking whether it is a genuine transaction or
a fraud transaction.

For each genuine transaction, the output is converted into some hash values, and these values become
the input for the next round. For each genuine transaction, there is a specific pattern which gets change
for the fraud transaction hence, it detects it and makes our online transactions more secure.

9. Stock Market trading:

Machine learning is widely used in stock market trading. In the stock market, there is always a risk of
up and downs in shares, so for this machine learning's long short term memory neural network is
used for the prediction of stock market trends.

10. Medical Diagnosis:

In medical science, machine learning is used for diseases diagnoses. With this, medical technology is
growing very fast and able to build 3D models that can predict the exact position of lesions in the brain.

It helps in finding brain tumors and other brain-related diseases easily.

11. Automatic Language Translation:

Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at all,
as for this also machine learning helps us by converting the text into our known languages. Google's
GNMT (Google Neural Machine Translation) provide this feature, which is a Neural Machine Learning
that translates the text into our familiar language, and it called as automatic translation.

The technology behind the automatic translation is a sequence to sequence learning algorithm, which
is used with image recognition and translates the text from one language to another language.
Simple Linear Regression

Simple linear regression is an approach for predicting


a response using a single feature.

It is assumed that the two variables are linearly related. Hence, we


try to find a linear function that predicts the response value(y) as
accurately as possible as a function of the feature or independent
variable(x).

For generality, we define:

x as feature vector, i.e x = [x_1, x_2, …., x_n],

y as response vector, i.e y = [y_1, y_2, …., y_n]

for n observations (in above example, n=10).

A scatter plot of above dataset looks like:-


Now, the task is to find a line which fits best in above scatter plot
so that we can predict the response for any new feature values. (i.e a
value of x not present in a dataset)

This line is called the regression line.

The equation of the regression line is represented as:

Here,

• h(x_i) represents the predicted response value for ith


observation.
• b_0 and b_1 are regression coefficients and represent y-
intercept and slope of regression line respectively.

To create our model, we must “learn” or estimate the values of


regression coefficients b_0 and b_1. And once we’ve estimated these
coefficients, we can use the model to predict responses!

In this article, we are going to use the Least Squares technique.

Now consider:

Here, e_i is a residual error in ith observation.


So, our aim is to minimize the total residual error.

We define the squared error or cost function, J as:

and our task is to find the value of b_0 and b_1 for which
J(b_0,b_1) is minimum!

Without going into the mathematical details, we present the result


here:
where SS_xy is the sum of cross-deviations of y and x:

and SS_xx is the sum of squared deviations of x:

Note: The complete derivation for finding least squares estimates in


simple linear regression can be found here.

Given below is the python implementation of the above technique on


our small dataset:
import numpy as np
import matplotlib.pyplot as pltdef estimate_coef(x, y):
# number of observations/points
n = np.size(x)# mean of x and y vector
m_x, m_y = np.mean(x), np.mean(y)# calculating cross-deviation
and deviation about x
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x# calculating regression
coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_xreturn(b_0, b_1)def plot_regression_line(x,
y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)# predicted response vector
y_pred = b[0] + b[1]*x# plotting the regression line
plt.plot(x, y_pred, color = "g")# putting labels
plt.xlabel('x')
plt.ylabel('y')# function to show plot
plt.show()def main():
# observations
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])# estimating
coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))# plotting regression line
plot_regression_line(x, y, b)if __name__ == "__main__":
main()

The output of the above piece of code is:


Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437
What is Cost Function in Machine Learning
Machine Learning models require a high level of accuracy to work in the actual world. But how
do you calculate how wrong or right your model is? This is where the cost function comes into
the picture. A machine learning parameter that is used for correctly judging the model, cost
functions are important to understand to know how well the model has estimated the relationship
between your input and output parameters.

What Is Cost Function in Machine Learning?

After training your model, you need to see how well your model is performing. While accuracy
functions tell you how well the model is performing, they do not provide you with an insight on
how to better improve them. Hence, you need a correctional function that can help you compute
when the model is the most accurate, as you need to hit that small spot between an undertrained
model and an overtrained model.

A Cost Function is used to measure just how wrong the model is in finding a relation between
the input and output. It tells you how badly your model is behaving/predicting

Consider a robot trained to stack boxes in a factory. The robot might have to consider certain
changeable parameters, called Variables, which influence how it performs. Let’s say the robot
comes across an obstacle, like a rock. The robot might bump into the rock and realize that it is
not the correct action.

It will learn from this, and next time it will learn to avoid rocks. Hence, your machine uses
variables to better fit the data. The outcome of all these obstacles will further optimize the robot
and help it perform better. It will generalize and learn to avoid obstacles in general, say like a fire
that might have broken out. The outcome acts as a cost function, which helps you optimize the
variable, to get the best variables and fit for the model.
Figure 1: Robot learning to avoid obstacles

What Is Gradient Descent?

Gradient Descent is an algorithm that is used to optimize the cost function or the error of the
model. It is used to find the minimum value of error possible in your model.

Gradient Descent can be thought of as the direction you have to take to reach the least possible
error. The error in your model can be different at different points, and you have to find the
quickest way to minimize it, to prevent resource wastage.

Gradient Descent can be visualized as a ball rolling down a hill. Here, the ball will roll to the
lowest point on the hill. It can take this point as the point where the error is least as for any
model, the error will be minimum at one point and will increase again after that.

In gradient descent, you find the error in your model for different values of input variables. This
is repeated, and soon you see that the error values keep getting smaller and smaller. Soon you’ll
arrive at the values for variables when the error is the least, and the cost function is optimized.
Figure 2: Gradient Descent

What Is the Cost Function For Linear Regression?

A Linear Regression model uses a straight line to fit the model. This is done using the equation
for a straight line as shown :

Figure 3: Linear regression function

In the equation, you can see that two entities can have changeable values (variable) a, which is
the point at which the line intercepts the x-axis, and b, which is how steep the line will be, or
slope.
At first, if the variables are not properly optimized, you get a line that might not properly fit the
model. As you optimize the values of the model, for some variables, you will get the perfect fit.
The perfect fit will be a straight line running through most of the data points while ignoring the
noise and outliers. A properly fit Linear Regression model looks as shown below
:

Figure 4: Linear regression graph

For the Linear regression model, the cost function will be the minimum of the Root Mean
Squared Error of the model, obtained by subtracting the predicted values from actual values. The
cost function will be the minimum of these error values.

Figure 5: Linear regression cost function

By the definition of gradient descent, you have to find the direction in which the error decreases
constantly. This can be done by finding the difference between errors. The small difference
between errors can be obtained by differentiating the cost function and subtracting it from the
previous gradient descent to move down the slope.

Figure 6: Linear regression gradient descent function

After substituting the value of the cost function (J) in the above equation, you get :

Figure 7: Linear regression gradient descent function simplified

In the above equations, a is known as the learning rate. It decides how fast you move down the
slope. If alpha is large, you take big steps, and if it is small; you take small steps. If alpha is too
large, you can entirely miss the least error point and our results will not be accurate. If it is too
small it will take too long to optimize the model and you will also waste computational power.
Hence you need to choose an optimal value of
alpha.

Figure 8: (a) Large learning rate, (b) Small learning rate, (c) Optimum learning rate
What Is the Cost Function for Neural Networks?

A neural network is a machine learning algorithm that takes in multiple inputs, runs them
through an algorithm, and essentially sums the output of the different algorithms to get the final
output.

The cost function of a neural network will be the sum of errors in each layer. This is done by
finding the error at each layer first and then summing the individual error to get the total error. In
the end, it can represent a neural network with cost function optimization as :

Figure 9: Neural network with the error function

For neural networks, each layer will have a cost function, and each cost function will have its
own least minimum error value. Depending on where you start, you can arrive at a unique value
for the minimum error. You need to find the minimum value out of all local minima. This value
is called the global minima.
Figure 10: Cost function graph for Neural Networks

The cost function for neural networks is given as :

Figure 11: Cost function for Neural Networks

Gradient descent is just the differentiation of the cost function. It is given as :

Figure 12: Gradient descent for Neural Networks


How to Implement Cost Functions in Python?

You have looked at what a cost function is and the formulae required to find the cost function for
different algorithms. Now let’s implement cost functions using Python. For this, you must take a
numpy array of random numbers as our data.

Start by importing important modules.

Figure 13: Importing necessary modules

Now, let’s load up the data.

Figure 14: Importing data


The numpy array is a 2-D array with random points. Each element of the array corresponds to an
x and y coordinate. Here, x is the input and y is the output required. Let’s separate these points
and plot them.

Figure 15: Plotting the data

Now, let's set our theta value and store the y values in a different array so we can predict the x
values.

Figure 16: Setting theta values and separating x and y

Let’s initialize the ‘m’ and ‘b’ values along with the learning rate.
Figure 17: Setting learning parameters

Using mathematical operations, find the cost function value for our inputs.

Figure 18: Finding cost function

Using the cost function, you can update the theta value.

Figure 19: Updating theta value

Now, find the gradient descent and print the updated value of theta at every iteration.
Figure 20: Finding gradient descent

On plotting the gradient descent, you can see the decrease in the loss at each iteration.
Figure 21: Plotting gradient descent
MSE and MAE

The objective of Linear Regression is to find a line that minimizes the


prediction error of all the data points.

The mean absolute error (MAE) is a quantity used to measure how close
predictions are to the outcomes. The mean absolute error is an average of the all
absolute errors. The mean absolute error is a common measure of estimate error in
time series analysis. The mean squared error of an estimator measures the average
of the squares of the errors, which means the difference between the estimator and
estimated.
MSE is a function, equivalent to the expected value of the squared error loss or
quadratic loss. The difference occurs because of the randomness. The MSE is a
measure of the quality of an estimator, it is always positive, and values which are
closer to zero are better. The MSE is the second moment of the error, and includes
both the variance of the estimator and its bias. For an unbiased estimator, the MSE
is the variance of the estimator.

***
Epoch vs Batch Size vs Iterations

Gradient Descent

It is an iterative optimization algorithm used in machine learning to find the


best results (minima of a curve).

Gradient means the rate of inclination or declination of a slope.

Descent means the instance of descending.

The algorithm is iterative means that we need to get the results multiple times
to get the most optimal result. The iterative quality of the gradient descent
helps a under-fitted graph to make the graph fit optimally to the data.

Source

The Gradient descent has a parameter called learning rate. As you can see
above (left), initially the steps are bigger that means the learning rate is higher
and as the point goes down the learning rate becomes more smaller by the
shorter size of steps. Also,the Cost Function is decreasing or the cost is
decreasing .Sometimes you might see people saying that the Loss Function is
decreasing or the loss is decreasing, both Cost and Loss represent same
thing (btw it is a good thing that our loss/cost is decreasing).

We need terminologies like epochs, batch size, iterations only when the data
is too big which happens all the time in machine learning and we can’t pass all
the data to the computer at once. So, to overcome this problem we need to
divide the data into smaller sizes and give it to our computer one by one and
update the weights of the neural networks at the end of every step to fit it to
the data given.

Epochs

One Epoch is when an ENTIRE dataset is passed


forward and backward through the neural network only
ONCE.
Since one epoch is too big to feed to the computer at once we divide it in
several smaller batches.

Why we use more than one Epoch?

I know it doesn’t make sense in the starting that — passing the entire dataset
through a neural network is not enough. And we need to pass the full dataset
multiple times to the same neural network. But keep in mind that we are using
a limited dataset and to optimise the learning and the graph we are
using Gradient Descent which is an iterative process. So, updating the
weights with single pass or one epoch is not enough.

One epoch leads to underfitting of the curve in the graph (below).

As the number of epochs increases, more number of times the weight are
changed in the neural network and the curve goes
from underfitting to optimal to overfitting curve.

So, what is the right numbers of epochs?


Unfortunately, there is no right answer to this question. The answer is
different for different datasets but you can say that the numbers of epochs is
related to how diverse your data is… just an example - Do you have only
black cats in your dataset or is it much more diverse dataset?

Batch Size

Total number of training examples present in a single


batch.

Note: Batch size and number of batches are two different things.

But What is a Batch?

As I said, you can’t pass the entire dataset into the neural net at once. So,
you divide dataset into Number of Batches or sets or parts.

Just like you divide a big article into multiple sets/batches/parts like
Introduction, Gradient descent, Epoch, Batch size and Iterations which makes
it easy to read the entire article for the reader and understand it.

Iterations

Iterations is the number of batches needed to complete


one epoch.
Note: The number of batches is equal to number of iterations for one
epoch.

Let’s say we have 2000 training examples that we are going to use .

We can divide the dataset of 2000 examples into


batches of 500 then it will take 4 iterations to complete
1 epoch.

Where Batch Size is 500 and Iterations is 4, for 1 complete epoch.


Classification Algorithm in Machine Learning
As we know, the Supervised Machine Learning algorithm can be broadly classified into
Regression and Classification Algorithms. In Regression algorithms, we have predicted the
output for continuous values, but to predict the categorical values, we need Classification
algorithms.

What is the Classification Algorithm?


The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program learns
from the given dataset or observations and then classifies new observation into a number of
classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can
be called as targets/labels or categories.
Unlike regression, the output variable of Classification is a category, not a value, such as "Green
or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised learning
technique, hence it takes labeled input data, which means it contains input with the
corresponding output.
In classification algorithm, a discrete output function(y) is mapped to input variable(x).
1. y=f(x), where y = categorical output

The best example of an ML classification algorithm is Email Spam Detector.


The main goal of the Classification algorithm is to identify the category of a given dataset, and
these algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are similar
to each other and dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier. There
are two types of Classifications:

o Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.

o Multi-class Classifier: If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.

Learners in Classification Problems:


In the classification problems, there are two types of learners:

1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives
the test dataset. In Lazy learner case, classification is done on the basis of the most
related data stored in the training dataset. It takes less time in training but more time for
predictions.
Example: K-NN algorithm, Case-based reasoning

2. Eager Learners: Eager Learners develop a classification model based on a training


dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes
more time in learning, and less time in prediction. Example: Decision Trees, Naïve
Bayes, ANN.

Types of ML Classification Algorithms:


Classification Algorithms can be further divided into the Mainly two category:

o Linear Models

o Logistic Regression

o Support Vector Machines

o Non-linear Models

o K-Nearest Neighbours

o Kernel SVM

o Naïve Bayes

o Decision Tree Classification

o Random Forest Classification

Evaluating a Classification model:


Once our model is completed, it is necessary to evaluate its performance; either it is a
Classification or Regression model. So for evaluating a Classification model, we have the
following ways:
1. Log Loss or Cross-Entropy Loss:

o It is used for evaluating the performance of a classifier, whose output is a probability


value between the 0 and 1.

o For a good binary Classification model, the value of log loss should be near to 0.

o The value of log loss increases if the predicted value deviates from the actual value.

o The lower log loss represents the higher accuracy of the model.

o For Binary classification, cross-entropy can be calculated as:


For multi-class classification problem you can calculate the Log loss in the same way.
Just use the formula given below.

2. Confusion Matrix:

o The confusion matrix provides us a matrix/table as output and describes the performance
of the model.

o It is also known as the error matrix.

o The matrix consists of predictions result in a summarized form, which has a total number
of correct predictions and incorrect predictions. The matrix looks like as below table:

Actual Positive Actual Negative


o

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative


3. AUC-ROC curve:

o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands
for Area Under the Curve.

o It is a graph that shows the performance of the classification model at different


thresholds.

o To visualize the performance of the multi-class classification model, we use the


AUC-ROC Curve.

o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis
and FPR (False Positive Rate) on X-axis.

Use cases of Classification Algorithms


Classification algorithms can be used in different places. Below are some popular use cases of
Classification Algorithms:

o Email Spam Detection

o Speech Recognition

o Identifications of Cancer tumour cells.

o Drugs Classification

o Biometric Identification, etc.


***
K-Nearest Neighbor (KNN) Algorithm for Machine
Learning
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.

o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.

o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.

o K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.

o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.

o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.

o KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.

o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point
x1, so this data point will lie in which of these categories. To solve this type of problem, we need
a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors

o Step-2: Calculate the Euclidean distance of K number of neighbors


o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

o Step-4: Among these k neighbors, count the number of the data points in each category.

o Step-5: Assign the new data points to that category for which the number of the neighbor
is maximum.

o Step-6: Our model is ready.


Suppose we have a new data point and we need to put it in the required category. Consider the
below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.

o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry.
It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.

How to select the value of K in the K-NN Algorithm?


Below are some points to remember while selecting the value of K in the K-NN algorithm:
There is no particular way to determine the best value for "K", so we need to try some values to
find the best out of them. The most preferred value for K is 5.

o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.

o Large values for K are good, but it may find some difficulties.

When to use KNN?

o We can use KNN when Dataset is labelled and noise-free and it’s must be small
because KNN is a “Lazy learner”. Let’s understand KNN algorithm with the
help of an example
NAME AGE GENDER CLASS OF SPORTS

Ajay 32 0 Football

Mark 40 0 Neither

Sara 16 1 Cricket

Zaira 34 1 Cricket

Sachin 55 0 Neither

Rahul 40 0 Cricket

Pooja 20 1 Neither

Smith 15 0 Cricket

Laxmi 55 1 Football

Michael 15 0 Football

o Here male is denoted with numeric value 0 and female with 1. Let’s find in
which class of people Angelina will lie whose k factor is 3 and age is 5. So we
have to find out the distance using
o d=√((x2-x1)²+(y2-y1)²) to find the distance between any two points.
o So let’s find out the distance between Ajay and Angelina using formula
o d=√((age2-age1)²+(gender2-gender1)²)
o d=√((5-32)²+(1-0)²)
o d=√729+1
o d=27.02
o Similarly, we find out all distance one by one.
Distance between Angelina and Distance

Ajay 27.02

Mark 35.01

Sara 11.00

Zaira 9.00

Sachin 50.01

Rahul 35.01

Pooja 15.00

Smith 10.00

Laxmi 50.00

Michael 10.05

So the value of k factor is 3 for Angelina. And the closest to 3 is 9,10,10.5 that is
closest to Angelina are Zaira, Smith and Michael.

Zaira 9 cricket
o Michael 10 cricket
o smith 10.5 football
o so according to KNN algorithm, Angelina will be in the class of people who like
cricket. So this is how KNN algorithm works.

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data

o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


o Always needs to determine the value of K which may be complex some time.

o The computation cost is high because of calculating the distance between the data points
for all the training samples.

Python implementation of the KNN algorithm


To do the Python implementation of the K-NN algorithm, we will use the same problem and
dataset which we have used in Logistic Regression. But here we will improve the performance of
the model. Below is the problem description:
Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured a
new SUV car. The company wants to give the ads to the users who are interested in buying that
SUV. So for this problem, we have a dataset that contains multiple user's information through the
social network. The dataset contains lots of information but the Estimated Salary and Age we
will consider for the independent variable and the Purchased variable is for the dependent
variable. Below is the dataset:

Steps to implement the K-NN algorithm:

o Data Pre-processing step

o Fitting the K-NN algorithm to the Training set

o Predicting the test result

o Test accuracy of the result(Creation of Confusion matrix)

o Visualizing the test set result.


Data Pre-Processing Step:
The Data Pre-processing step will remain exactly the same as Logistic Regression. Below is the
code for it:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)

By executing the above code, our dataset is imported to our program and well pre-processed.
After feature scaling our test dataset will look like:
From the above output image, we can see that our data is successfully scaled.

o Fitting K-NN classifier to the Training data:


Now we will fit the K-NN classifier to the training data. To do this we will import
the KNeighborsClassifier class of Sklearn Neighbors library. After importing the class,
we will create the Classifier object of the class. The Parameter of this class will be

o n_neighbors: To define the required neighbors of the algorithm. Usually, it takes


5.

o metric='minkowski': This is the default parameter and it decides the distance


between the points.

o p=2: It is equivalent to the standard Euclidean metric.


And then we will fit the classifier to the training data. Below is the code for it:
1. #Fitting K-NN classifier to the training set
2. from sklearn.neighbors import KNeighborsClassifier
3. classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
4. classifier.fit(x_train, y_train)

Output: By executing the above code, we will get the output as:
Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')

o Predicting the Test Result: To predict the test set result, we will create a y_pred vector
as we did in Logistic Regression. Below is the code for it:
1. #Predicting the test set result
2. y_pred= classifier.predict(x_test)

Output:
The output for the above code will be:
o Creating the Confusion Matrix:
Now we will create the Confusion Matrix for our K-NN model to see the accuracy of the
classifier. Below is the code for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

In above code, we have imported the confusion_matrix function and called it using the variable
cm.
Output: By executing the above code, we will get the matrix as below:
In the above image, we can see there are 64+29= 93 correct predictions and 3+4= 7 incorrect
predictions, whereas, in Logistic Regression, there were 11 incorrect predictions. So we can say
that the performance of the model is improved by using the K-NN algorithm.

o Visualizing the Training set result:


Now, we will visualize the training set result for K-NN model. The code will remain
same as we did in Logistic Regression, except the name of the graph. Below is the code
for it:
1. #Visulaizing the trianing set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1
, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.sh
ape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('K-NN Algorithm (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:
By executing the above code, we will get the below graph:

The output graph is different from the graph which we have occurred in Logistic Regression. It
can be understood in the below points:

o As we can see the graph is showing the red point and green points. The green
points are for Purchased(1) and Red Points for not Purchased(0) variable.

o The graph is showing an irregular boundary instead of showing any straight line
or any curve because it is a K-NN algorithm, i.e., finding the nearest neighbor.

o The graph has classified users in the correct categories as most of the users who
didn't buy the SUV are in the red region and users who bought the SUV are in the
green region.

o The graph is showing good result but still, there are some green points in the red
region and red points in the green region. But this is no big issue as by doing this
model is prevented from overfitting issues.

o Hence our model is well trained.


o Visualizing the Test set result:
After the training of the model, we will now test the result by putting a new dataset, i.e.,
Test dataset. Code remains the same except some minor changes: such as x_train and
y_train will be replaced by x_test and y_test.
Below is the code for it:
1. #Visualizing the test set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1
, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.sh
ape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('K-NN algorithm(Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:
The above graph is showing the output for the test data set. As we can see in the graph, the
predicted output is well good as most of the red points are in the red region and most of the green
points are in the green region.
However, there are few green points in the red region and a few red points in the green region.
So these are the incorrect observations that we have observed in the confusion matrix(7 Incorrect
output).
Logistic Regression in Machine Learning
o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.

o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.

o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.

o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).

o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.

o Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.

o Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification. The
below image is showing the logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is called
logistic regression, but is used to classify samples; Therefore, it falls under the classification
algorithm.

Logistic Function (Sigmoid Function):


o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.

o It maps any real value into another value within a range of 0 and 1.

o The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.

o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.

Assumptions for Logistic Regression:


o The dependent variable must be categorical in nature.

o The independent variable should not have multi-collinearity.

Logistic Regression Equation:


The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:


On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.

o Multinomial: In multinomial Logistic regression, there can be 3 or more possible


unordered types of the dependent variable, such as "cat", "dogs", or "sheep"

o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

Python Implementation of Logistic Regression (Binomial)


To understand the implementation of Logistic Regression in Python, we will use the below
example:
Example: There is a dataset given which contains the information of various users obtained from
the social networking sites. There is a car making company that has recently launched a new
SUV car. So the company wanted to check how many users from the dataset, wants to purchase
the car.
For this problem, we will build a Machine Learning model using the Logistic regression
algorithm. The dataset is shown in the below image. In this problem, we will predict
the purchased variable (Dependent Variable) by using age and salary (Independent
variables).
Steps in Logistic Regression: To implement the Logistic Regression using Python, we will use
the same steps as we have done in previous topics of Regression. Below are the steps:

o Data Pre-processing step

o Fitting Logistic Regression to the Training set

o Predicting the test result

o Test accuracy of the result(Creation of Confusion matrix)

o Visualizing the test set result.


1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we can
use it in our code efficiently. It will be the same as we have done in Data pre-processing topic.
The code for this is given below:
1. #Data Pre-procesing Step
2. # importing libraries
3. import numpy as nm
4. import matplotlib.pyplot as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')

By executing the above lines of code, we will get the dataset as the output. Consider the given
image:

Now, we will extract the dependent and independent variables from the given dataset. Below is
the code for it:
1. #Extracting Independent and dependent Variable
2. x= data_set.iloc[:, [2,3]].values
3. y= data_set.iloc[:, 4].values
In the above code, we have taken [2, 3] for x because our independent variables are age and
salary, which are at index 2, 3. And we have taken 4 for y variable because our dependent
variable is at index 4. The output will be:

Now we will split the dataset into a training set and test set. Below is the code for it:
1. # Splitting the dataset into training and test set.
2. from sklearn.model_selection import train_test_split
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)

The output for this is given below:


For test
set:

For training set:


In logistic regression, we will do feature scaling because we want accurate result of predictions.
Here we will only scale the independent variable because dependent variable have only 0 and 1
values. Below is the code for it:
1. #feature Scaling
2. from sklearn.preprocessing import StandardScaler
3. st_x= StandardScaler()
4. x_train= st_x.fit_transform(x_train)
5. x_test= st_x.transform(x_test)

The scaled output is given below:


2. Fitting Logistic Regression to the Training set:
We have well prepared our dataset, and now we will train the dataset using the training set. For
providing training or fitting the model to the training set, we will import
the LogisticRegression class of the sklearn library.
After importing the class, we will create a classifier object and use it to fit the model to the
logistic regression. Below is the code for it:
1. #Fitting Logistic Regression to the training set
2. from sklearn.linear_model import LogisticRegression
3. classifier= LogisticRegression(random_state=0)
4. classifier.fit(x_train, y_train)

Output: By executing the above code, we will get the below output:
Out[5]:
1. LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
2. intercept_scaling=1, l1_ratio=None, max_iter=100,
3. multi_class='warn', n_jobs=None, penalty='l2',
4. random_state=0, solver='warn', tol=0.0001, verbose=0,
5. warm_start=False)

Hence our model is well fitted to the training set.


3. Predicting the Test Result
Our model is well trained on the training set, so we will now predict the result by using test set
data. Below is the code for it:
1. #Predicting the test set result
2. y_pred= classifier.predict(x_test)

In the above code, we have created a y_pred vector to predict the test set result.
Output: By executing the above code, a new vector (y_pred) will be created under the variable
explorer option. It can be seen as:

The above output image shows the corresponding predicted users who want to purchase or not
purchase the car.
4. Test Accuracy of the result
Now we will create the confusion matrix here to check the accuracy of the classification. To
create it, we need to import the confusion_matrix function of the sklearn library. After
importing the function, we will call it using a new variable cm. The function takes two
parameters, mainly y_true( the actual values) and y_pred (the targeted value return by the
classifier). Below is the code for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix()

Output:
By executing the above code, a new confusion matrix will be created. Consider the below image:

We can find the accuracy of the predicted result by interpreting the confusion matrix. By above
output, we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect Output).
5. Visualizing the training set result
Finally, we will visualize the training set result. To visualize the result, we will
use ListedColormap class of matplotlib library. Below is the code for it:
1. #Visualizing the training set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1
, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.sh
ape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Logistic Regression (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

In the above code, we have imported the ListedColormap class of Matplotlib library to create
the colormap for visualizing the result. We have created two new variables x_set and y_set to
replace x_train and y_train. After that, we have used the nm.meshgrid command to create a
rectangular grid, which has a range of -1(minimum) to 1 (maximum). The pixel points we have
taken are of 0.01 resolution.
To create a filled contour, we have used mtp.contourf command, it will create regions of
provided colors (purple and green). In this function, we have passed the classifier.predict to
show the predicted data points predicted by the classifier.
Output: By executing the above code, we will get the below output:

The graph can be explained in the below points:


o In the above graph, we can see that there are some Green points within the green region
and Purple points within the purple region.

o All these data points are the observation points from the training set, which shows the
result for purchased variables.

o This graph is made by using two independent variables i.e., Age on the
x-axis and Estimated salary on the y-axis.

o The purple point observations are for which purchased (dependent variable) is probably
0, i.e., users who did not purchase the SUV car.

o The green point observations are for which purchased (dependent variable) is probably
1 means user who purchased the SUV car.

o We can also estimate from the graph that the users who are younger with low salary, did
not purchase the car, whereas older users with high estimated salary purchased the car.

o But there are some purple points in the green region (Buying the car) and some green
points in the purple region(Not buying the car). So we can say that younger users with a
high estimated salary purchased the car, whereas an older user with a low estimated
salary did not purchase the car.
The goal of the classifier:
We have successfully visualized the training set result for the logistic regression, and our goal for
this classification is to divide the users who purchased the SUV car and who did not purchase the
car. So from the output graph, we can clearly see the two regions (Purple and Green) with the
observation points. The Purple region is for those users who didn't buy the car, and Green
Region is for those users who purchased the car.
Linear Classifier:
As we can see from the graph, the classifier is a Straight line or linear in nature as we have used
the Linear model for Logistic Regression. In further topics, we will learn for non-linear
Classifiers.
Visualizing the test set result:
Our model is well trained using the training dataset. Now, we will visualize the result for new
observations (Test set). The code for the test set will remain same as above except that here we
will use x_test and y_test instead of x_train and y_train. Below is the code for it:
1. #Visulaizing the test set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1
, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.sh
ape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Logistic Regression (Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

The above graph shows the test set result. As we can see, the graph is divided into two regions
(Purple and Green). And Green observations are in the green region, and Purple observations are
in the purple region. So we can say it is a good prediction and model. Some of the green and
purple data points are in different regions, which can be ignored as we have already calculated
this error using the confusion matrix (11 Incorrect output).
Hence our model is pretty good and ready to make new predictions for this classification
problem.
Logistic Regression vs Linear

Regression
Logistic regression is a Probability problem. Meaning that the outcome of the algorithm is

between 0 and 1. It maintains a threshold value to classify the data points(samples). The

probability value above the threshold is evaluated to true, otherwise to false.

The logistic regression is based on the Linear Regression i.e it is derived from the Linear

Regression. To understand how this probabilistic algorithm is based on Linear Regression. Let us

take an example and derive the Logistic Regression equation.

Problem: the probability of being infected given the range of age (20 to 50)

Let us try to solve this problem using Linear Regression initially

Given age= 20 between 55


Note: 0<=p<=1

//The probability value should range from 0 to 1.

We have y= β0 + β1*X1 // linear regression equation


Case 1: β0 =-1.700, β1 =0.064, age=35 //β0 is the y-intercept and β1 is the slope of the straight

line drawn. Β0 and β1 are randomly chosen.

Let us try to plug in the values to the straight-line equation we have and observe the

corresponding results.

y(p=1|age)=-1.700+0.066(35) = 0.54

Case 2: for age 25 and 45

y(p=1|age)=-1.700+0.064(25)= -0.09 //negative result

y(p=1|age)=-1.700+0.064(45)= 1.20 //value is more than 1

We have noticed that, for the age 25, the final value of the equation is less than 0 and for the age

45, the final value of the equation is greater than 1.

According to the Logistic Regression assumption, the values should range from 0 to 1. Thus,

trying to solve this problem using linear regression is failing to satisfy the assumptions of logistic

regression.

The corresponding plotting is shown below.


Now let us try to solve the same problem using logistic regression

The probability value should be greater than or equal to 0

For P>=0

P(X)= exp(β0 + β1*X)

// taking exponentiation to the straight-line equation will give us the positive value.

For P<=1

// the probability value should be less than or equal to 1.

P(X)= exp(β0 + β1X)/1+exp(β0 + β1X) //dividing a number by one number greater than that will

give us the value below the numerator value.


This equation is referred to as the Logistic Regression equation.

P(X)(1+exp(β0 + β1X)=exp(β0 + β1X)

P(X)= exp(β0 + β1X)-p(x) exp(β0 + β1X)

P(X)= (1-P(X)) exp(β0 + β1*X)

P(X)/1-P(X)= exp(β0 + β1*X)

Take nat. log on both sides

Ln(P/1-P)= β0 + β1*X //exponent gets cancelled after applying nat. Log. this function is referred

to as the logit function

The right-hand side of the equation is equal to the straight-line equation. Which is equal to

Y= β0 + β1*X. Hence, logistic regression is based on linear regression.

Going back to the equation below.

P(x)= exp(β0 + β1X)/1+exp(β0 + β1X)

Divide both numerator and denominator by exp(β0 + β1*X)

Therefore we get, P(X) = 1/1+exp(- β0 + β1*X)

Sometimes written as P(x)=1/1+exp(-z) referred to as sigmoid equation where z= β0 + β1*X

As it can be noticed, the values are ranging from 0 to 1, which satisfies logistic regression

assumptions.

The final plot is shown. The curve is also referred to as s- curve or Sigmoid curve.
Difference between Sigmoid, Logistic, Softmax
Functions, and Cross-Entropy Loss (Log Loss)
in Classification Problems

1. Introduction

When learning logistic regression and deep learning (neural networks), I


always encounter the terms including:

• Sigmoid function

• Logistic function

• Softmax function

• Log loss

• Cross entropy Loss

• Negative log-likelihood

Every time I see them, I did not really try to understand them, because there
are existing libraries out there I can use that do everything for me. For
example, when I build logistic regression models, I will directly
use sklearn.linear_model.LogisticRegression from Scikit-Learn. When I work on deep
learning classification problems using PyTorch, I know that I need to add
a sigmoid activation function at the output layer with Binary Cross-Entropy
Loss for binary classifications, or add a (log) softmax function with Negative
Log-Likelihood Loss (or just Cross-Entropy Loss instead) for multiclass
classification problems.

Recently, when I revisited these concepts, I found it useful to look into the
math and understand what was buried underneath. So, in this post, I gathered
materials from different sources and I will demonstrate the mathematical
formulas with some explanations.

a cheat sheet for calculation, which can be accessed on github repo GitHub.

2. Sigmoid Function (Logistic Function)

Sigmoid functions are general mathematical functions that share similar


properties: have S-shaped curves, just as the figure below shows.

Members of Sigmoid Functions Family, from Wikipedia


The Curve of a Logistic Function, from Wikipedia

The most common sigmoid function used in machine learning is Logistic


Function, as the formula below.

Image by author

The formula is simple, but it is quite useful because it offers us some nice
properties:

1. It maps the feature space into probability functions

2. It uses exponential
3. It is differentiable

For property 1, It is not difficult to see that:

• When x is really large (goes to infinity), the output will be close to 1

• When x is really small (goes to -infinity), the output will be close to


0

• When x is 0, the output will be 1/2

For property 2, a nonlinear relationship ensures most points to be either close


to 0 or 1, instead of being stuck in the ambiguous zone in the middle.

Property 3 is also quite important: we need the function to be differentiable to


calculate the gradient when updating the weight from errors either using
gradient descent in general ML problems or backpropagation in neural
networks.

The properties of the logistic function are great, but how is the logistic
function used in logistic regression to solve binary classification problems?

3. Logistic Function in Logistic Regression

3.1 Review on Linear Regression

Before going too far, let’s review the concept of regression models.
Regression has long been used in statistical modeling and is part of the
supervised machine learning methods. It is the process of modeling the
relationship between a dependent variable with one or more independent
variables.

Example of Simple Linear Regression, from Wikipedia

The most commonly used regression model is linear regression, which


predicts values using linear combinations of features. The plot shown above is
the simplest form of linear regression, called simple linear regression. It has
two parameters β_0 and β_1 where each represents the intercept and slope to
define the red best fit line among the data points. With the two parameters
trained using the existing data, we will be able to predict a new y value given
an unseen x value.

Simple linear regression, Image by author


With the simplest form defined, we can generalize the linear regression
formula to accommodate multiple dimensions of x, which can also be called
multiple linear regression (multivariate regression). Basically, it extends to
multiple dimensions and uses multiple features (e.g., house size, age, location,
etc.) to make predictions (e.g., sale price).

Generalized linear regression model, Image by author

3.2 Logistic Function and Logistic Regression

Besides predicting actual values as regression, the linear regression models


can also be used for classification problems by predicting the probability of
the subject in a specific class, this can be simply done by replacing y with p:

Image by author

The problem is that the probability p here is unbound — it can be any value.
So, in order to constrain the probability range to be between 0 and 1, we can
use the logistic function introduced in the previous section and map it:
Image by author

This will make sure that no matter what the predicted value is, the probability
p will be in the range between 0 and 1 with all the advantages introduced
earlier. However, the exponential form is not easy to deal with, so we can
rearrange the formula using the odds function. Odds is the brother of
probability, and it represents the ratio between “success” and “nonsuccess”.
When the p=0, odds is 0; when p=0.5, odds is 1; when p=1, odds is ∞. The
relationships are shown below:

Relationship between Odds and Probability, Image by author

With the odds function defined, we get:


Image by author

It is easy to see the similarity between the two equations, so we have:

Image by author

We use log to remove the exponential relationship, so it goes back to the term
that we are familiar with at the end. The part on the right of the equals sign is
still the linear combination of the input x and the parameters β. The part on the
left of the equals sign now becomes the logarithm of odds, or giving it a new
name logit of probability p. So, the whole equation becomes the definition of
the logit function, or log-odds, and it is the inverse function of the standard
logistic function. By modeling using the logit function, we have two
advantages:

1. We can still treat it as a linear regression model using our familiar


linear function of the predictors

2. We can use it to predict the true probability of the subject in a


class— by transforming the predicted value using the inverse logit
function.
That is how logistic regression works behind the hood using the logistic
function and is perfectly suitable to make binary classification (2 classes): For
class A and B, if the predicted probability of being class A is above the
threshold we set (e.g., 0.5), then it is classified as class A; on the other hand, if
the predicted probability is below (e.g., 0.5), then it is classified as class B.

We have just covered the binary classification using logistic regression. So,
what if there are more than 2 classes?

4. Multi-class Classification and Softmax Function

4.1 Methods of Multi-class Classifications

There are several ways of using binary classifiers to handle multi-class


classification problems, and two common ones
are: one-versus-the-rest and one-versus-one.

The one-versus-the-rest method trains K-1 binary classifiers to separate each


class from the rest. Then the instances which are excluded by all of the
classifiers will be classified as class K. It works in many cases, but the biggest
issue with the one-versus-the-rest method is ambiguous region: where some
instances may be put into multiple classes.

On the other hand, we have the one-versus-one method, which trains a binary
classifier between each of the classes. Similar to one-versus-the-rest,
the ambiguous region also exists here, but this time there exist instances that
are not classified into any of the classes. What’s even worse is the efficiency:
we need n choose 2 (combination) classifiers for n classes, as shown below in
the equation. For example, if we have 10 classes, we need 45 classifiers to use
this method!

Number of classifiers needed for the one-versus-one method, Image by author

With these restrictions on one-versus-the-rest and one-versus-one methods,


how can we do multi-class classifications then? The answer is to use the
softmax function.

4.2 Softmax Function

The Softmax function is a generalized form of the logistic function as


introduced in the binary classification part above. Here is the equation:

Softmax Function, Image by author

To interpret it, we can see it as: the probability of classifying the instance
as j can be calculated as the exponential of the j th element of the input
divided by the sum of exponentials of all the input elements. To better
understand it, we can see the example below:

Example of Applying Softmax Function to Model Output, by Sewade Ogun (Public License)

An image classifier gives numerical output after feeding forward through the
neural network, in this case, we have a 3x3 array where rows are instances
and columns are classes. The first row contains the predictions of the first
image: the scores are 5, 4, and 2 for classes cat, dog, and horse respectively.
The numbers do not make sense, so we feed them into a softmax function. By
plugging the three numbers into the equation, we can get the probability of the
image being a cat, dog, and horse to be 0.71, 0.26, and 0.04, which sums up to
1.

Similar to the logistic function, the softmax function also has the following
advantages so that people are widely using it in multi-class classification
problems:

1. It maps the feature space into probability functions


2. It uses exponential

3. It is differentiable

Another way to interpret the softmax function is through the famous Bayes
Theorem, where:

Bayes Theorem, Image by author

Applying it to our case in softmax, all of the terms can be interpreted as


probabilities:

Image by author

where

Image by author

5. Cross-Entropy Loss and Log Loss


When we train classification models, we are most likely to define a loss
function that describes how much out predicted values deviate from the true
values. Then we will use gradient descent methods to adjust model parameters
in order to lower the loss. It is a type of optimization problem, and also called
backpropagation in deep learning.

Before we start on this, I strongly recommend the article from Daniel


Godoy: Understanding binary cross-entropy / log loss: a visual explanation. It
does a really good explanation of the practical math concepts underneath and
shows them in a visual way. Here in this post, I am using a little bit different
conventions, more like wikipedia.

Let’s get started!

5.1 Log Loss (Binary Cross-Entropy Loss)

One commonly used loss function used in classification problems is called


cross-entropy loss, or log loss in binary cases. Let’s first place the expression
below:

Log Loss (Binary Cross Entropy), Image by author

Since the log function has the property that when y is at 0, its log goes to
-infinity; when y is at 1, its log is at 0, we can use it to model the loss pretty
efficiently. For an instance with true label 0:
• If the predicted value is 0, then the formula above will return a loss
of 0.

• If the predicted value is 0.5, then the formula above will return a
loss of 0.69

• If the predicted value is 0.99, then the formula above will return a
loss of 4.6

As we can see here, the log magnifies the mistake in the classification, so the
misclassification will be penalized much more heavily compared to any linear
loss functions. The closer the predicted value is to the opposite of the true
value, the higher the loss will be, which will eventually become infinity.
That’s exactly what we want a loss function to be. So where does the
definition of log loss come from?

5.2 Derivation of Log Loss

Cross-Entropy is a concept derived from information theory that measures the


difference between two probability distributions, and the definition of it
between true probability distribution p and estimated probability q in the
information theory is:

Cross-Entropy, Image by author


where H(p) is the entropy of
distribution p, and D_KL(p||q) is Kullback–Leibler Divergence, a divergence
of p from q. It is also called relative entropy, of p with respect to q.

The definition of entropy and Kullback-Leibler Divergence are shown as


below:

Definitions of entropy and Kullback-Leibler Divergence, Image by author

Plugging them in, it is easy to get the expression of cross-entropy:

Image by author

For binary classification problems, there are only two classes, so we can
express them explicitly:
Image by author

Note that the p here is the probability function instead of the distribution p.
Also, we can express the true distribution p(y) as 1/N, so the binary
cross-entropy (log loss) can be expressed as:

Log Loss (Binary Cross Entropy), Image by author

Note that a minus sign is placed at the beginning because the log function of
values between 0 to 1 gives us negative values. We want to flip the sign so
that the loss will be positive — we want to minimize the loss.

If we want, this formula can be further expanded in its expression to include


the relationship with the model parameters θ, shown below, but it’s essentially
the same as what we have above.

Log Loss with respect to Model Parameters, Image by author

5.3 Cross-Entropy Loss (Multi-class)


After deriving the binary case above, we can easily extend it to multi-class
classification problems. Below is a generalized form of the cross-entropy loss
function. It only sums the log of the probability when the instance class is k,
similar to the binary case, where there is always only part of the expression
taken account of, and the others are just 0.

Cross-Entropy Loss (Generalized Form), Image by author

Again, it can also be expressed with respect to the model parameters θ, but it
is essentially the same equation:

Cross-Entropy Loss with respect to Model Parameter, Image by author

5.4 Cross-Entropy Loss vs Negative Log-Likelihood

The cross-entropy loss is always compared to the negative log-likelihood. In


fact, in PyTorch, the Cross-Entropy Loss is equivalent to (log) softmax
function plus Negative Log-Likelihood Loss for multiclass classification
problems. So how are these two concepts really connected?

Before we dive into it, we have to understand the difference between


probability and likelihood. In short:
• Probability: Find the chance of some event given a sample
distribution of the data

• Likelihood: Find the best distribution of the data given the sample
data

So we are essential modeling the same problems using different expressions,


but they are equivalent:

Expression of Likelihood, Image by author

Above is the definition of the likelihood of parameters θ given the data (from
x_1 to x_n), which is equivalent to the probability of getting these data (x_1 to
x_n) given the parameters θ, and it can be expressed as the product of each
individual probability.

Knowing p is the true probability distribution, we can further rewrite the


product using the estimated probability distribution as follow:

Image by author
where q_i (estimated probability distribution) and p_i (true probability
distribution) are:

Image by author

where n_i is the number of times i occurs in the training data. Then by taking
the negative log on the likelihood, we can get:

Negative Log-Likelihood is Equivalent to Cross-Entropy, Image by author

We can easily get the equation above given the log of a product becomes the
sum of logs. Magically, the negative log-likelihood becomes the cross-entropy
as introduced in the sections above.

6. Conclusions

To summarize the concepts introduced in this article so far:

• Sigmoid Function: A general mathematical function that has an


S-shaped curve, or sigmoid curve, which is bounded, differentiable,
and real.
• Logistic Function: A certain sigmoid function that is widely used in
binary classification problems using logistic regression. It maps
inputs from -infinity to infinity to be from 0 to 1, which intends to
model the probability of binary events.

• Softmax Function: A generalized form of the logistic function to be


used in multi-class classification problems.

• Log Loss (Binary Cross-Entropy Loss): A loss function that


represents how much the predicted probabilities deviate from the
true ones. It is used in binary cases.

• Cross-Entropy Loss: A generalized form of the log loss, which is


used for multi-class classification problems.

• Negative Log-Likelihood: Another interpretation of the


cross-entropy loss using the concepts of maximum likelihood
estimation. It is equivalent to cross-entropy loss.
Support Vector Machine Algorithm
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified using a
decision boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose
we see a strange cat that also has some features of dogs, so if we want a model that can accurately identify
whether it is a cat or dog, so such a model can be created by using the SVM algorithm. We will first train
our model with lots of images of cats and dogs so that it can learn about different features of cats and
dogs, and then we test it with this strange creature. So as support vector creates a decision boundary
between these two data (cat and dog) and choose extreme cases (support vectors), it will see the extreme
case of cat and dog. On the basis of the support vectors, it will classify it as a cat. Consider the below
diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.

Types of SVM
SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.

o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:


Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in
n-dimensional space, but we need to find out the best decision boundary that helps to classify the
data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means if
there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are
3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of
the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence
called a Support vector.

How does SVM works?


Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset
that has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that
can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But
there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is
called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These
points are called support vectors. The distance between the vectors and the hyperplane is called
as margin. And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is
called the optimal hyperplane.
Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be
calculated as:
z=x2 +y2

By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d
space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.

Python Implementation of Support Vector Machine

Now we will implement the SVM algorithm using Python. Here we will use the same
dataset user_data, which we have used in Logistic regression and KNN classification.

o Data Pre-processing step

Till the Data pre-processing step, the code will remain the same. Below is the code:
1. #Data Pre-processing Step
2. # importing libraries
3. import numpy as nm
4. import matplotlib.pyplot as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')
9.
10. #Extracting Independent and dependent Variable
11. x= data_set.iloc[:, [2,3]].values
12. y= data_set.iloc[:, 4].values
13.
14. # Splitting the dataset into training and test set.
15. from sklearn.model_selection import train_test_split
16. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0
)
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)

After executing the above code, we will pre-process the data. The code will give the
dataset as:
The scaled output for the test set will be:
Fitting the SVM classifier to the training set:

Now the training set will be fitted to the SVM classifier. To create the SVM classifier, we
will import SVC class from Sklearn.svm library. Below is the code for it:
1. from sklearn.svm import SVC # "Support vector classifier"
2. classifier = SVC(kernel='linear', random_state=0)
3. classifier.fit(x_train, y_train)

In the above code, we have used kernel='linear', as here we are creating SVM for
linearly separable data. However, we can change it for non-linear data. And then we
fitted the classifier to the training dataset(x_train, y_train)

Output:
Out[8]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, tol=0.001, verbose=False)

The model performance can be altered by changing the value of C(Regularization


factor), gamma, and kernel.
o Predicting the test set result:
Now, we will predict the output for test set. For this, we will create a new vector
y_pred. Below is the code for it:
1. #Predicting the test set result
2. y_pred= classifier.predict(x_test)

After getting the y_pred vector, we can compare the result of y_pred and y_test to
check the difference between the actual value and predicted value.

Output: Below is the output for the prediction of the test set:

o Creating the confusion matrix:


Now we will see the performance of the SVM classifier that how many incorrect
predictions are there as compared to the Logistic regression classifier. To create
the confusion matrix, we need to import the confusion_matrix function of the
sklearn library. After importing the function, we will call it using a new
variable cm. The function takes two parameters, mainly y_true( the actual values)
and y_pred (the targeted value return by the classifier). Below is the code for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

Output:

As we can see in the above output image, there are 66+24= 90 correct predictions and
8+2= 10 correct predictions. Therefore we can say that our SVM model improved as
compared to the Logistic regression model.

o Visualizing the training set result:


Now we will visualize the training set result, below is the code for it:
1. from matplotlib.colors import ListedColormap
2. x_set, y_set = x_train, y_train
3. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max(
) + 1, step =0.01),
4. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
5. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1
.shape),
6. alpha = 0.75, cmap = ListedColormap(('red', 'green')))
7. mtp.xlim(x1.min(), x1.max())
8. mtp.ylim(x2.min(), x2.max())
9. for i, j in enumerate(nm.unique(y_set)):
10. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
11. c = ListedColormap(('red', 'green'))(i), label = j)
12. mtp.title('SVM classifier (Training set)')
13. mtp.xlabel('Age')
14. mtp.ylabel('Estimated Salary')
15. mtp.legend()
16. mtp.show()

Output:

By executing the above code, we will get the output as:

As we can see, the above output is appearing similar to the Logistic regression output.
In the output, we got the straight line as hyperplane because we have used a linear
kernel in the classifier. And we have also discussed above that for the 2d space, the
hyperplane in SVM is a straight line.

o Visualizing the test set result:


1. #Visulaizing the test set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max(
) + 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1
.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('SVM classifier (Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

By executing the above code, we will get the output as:

As we can see in the above output image, the SVM classifier has divided the users into
two regions (Purchased or Not purchased). Users who purchased the SUV are in the red
region with the red scatter points. And users who did not purchase the SUV are in the
green region with green scatter points. The hyperplane has divided the two classes into
Purchased and not purchased variable.
Decision Tree Classification Algorithm
o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands
on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
o Below diagram explains the general structure of a decision tree:

Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset
and problem is the main point to remember while creating a machine learning model. Below are the
two reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.

Decision Tree Terminologies

Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node
of the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute
and, based on the comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes and move
further. It continues the process until it reaches the leaf node of the tree. The complete process can be
better understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels. The next decision node further gets split into one
decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes
(Accepted offers and Declined offer). Consider the below diagram:

Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best attribute for the
root node and for sub-nodes. So, to solve such problems there is a technique which is called as Attribute
selection measure or ASM. By this measurement, we can easily select the best attribute for the nodes
of the tree. There are two popular techniques for ASM, which are:

o Information Gain
o Gini Index

1. Information Gain:

o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:
1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in
data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:

o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal decision
tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology used:

o Cost Complexity Pruning


o Reduced Error Pruning.

Advantages of the Decision Tree

o It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree

o The decision tree contains lots of layers, which makes it complex.


o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

Python Implementation of Decision Tree

Now we will implement the Decision tree using Python. For this, we will use the dataset
"user_data.csv," which we have used in previous classification models. By using the same dataset, we
can compare the Decision tree classifier with other classification models such
as KNN SVM, LogisticRegression, etc.

Steps will also remain the same, which are given below:

o Data Pre-processing step


o Fitting a Decision-Tree algorithm to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.
RANDOM FOREST ALGORITHM
Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex
problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree
and based on the majority votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.

The below diagram explains the working of the Random Forest algorithm:

Play Video

Note: To better understand the Random Forest Algorithm, you should have knowledge of the Decision
Tree Algorithm.

Assumptions for Random Forest

Since the random forest combines multiple trees to predict the class of the dataset, it is possible that
some decision trees may predict the correct output, while others may not. But together, all the trees
predict the correct output. Therefore, below are two assumptions for a better Random forest classifier:
o There should be some actual values in the feature variable of the dataset so that the classifier
can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.

Why use Random Forest?

Below are some points that explain why we should use the Random Forest algorithm:

<="" li="">
o It takes less training time as compared to other algorithms.
o It predicts output with high accuracy, even for the large dataset it runs efficiently.
o It can also maintain accuracy when a large proportion of data is missing.

How does Random Forest algorithm work?

Random Forest works in two-phase first is to create the random forest by combining N decision tree,
and second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign the new data points
to the category that wins the majority votes.

The working of the algorithm can be better understood by the below example:

Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the
Random forest classifier. The dataset is divided into subsets and given to each decision tree. During the
training phase, each decision tree produces a prediction result, and when a new data point occurs, then
based on the majority of results, the Random Forest classifier predicts the final decision. Consider the
below image:
Applications of Random Forest

There are mainly four sectors where Random forest mostly used:

1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be
identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.

Advantages of Random Forest

o Random Forest is capable of performing both Classification and Regression tasks.


o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.
Disadvantages of Random Forest

o Although random forest can be used for both classification and regression tasks, it is not more
suitable for Regression tasks.

Python Implementation of Random Forest Algorithm

Now we will implement the Random Forest Algorithm tree using Python. For this, we will use the same
dataset "user_data.csv", which we have used in previous classification models. By using the same
dataset, we can compare the Random Forest classifier with other classification models such as Decision
tree Classifier, KNN, SVM, Logistic Regression, etc.

Implementation Steps are given below:

o Data Pre-processing step


o Fitting the Random forest algorithm to the Training set
o Predicting the test result
o Test accuracy of the result (Creation of Confusion matrix)
o Visualizing the test set result.

1.Data Pre-Processing Step:

Below is the code for the pre-processing step:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)

In the above code, we have pre-processed the data. Where we have loaded the dataset, which is given
as:

2. Fitting the Random Forest algorithm to the training set:

Now we will fit the Random forest algorithm to the training set. To fit it, we will import
the RandomForestClassifier class from the sklearn.ensemble library. The code is given below:

1. #Fitting Decision Tree classifier to the training set


2. from sklearn.ensemble import RandomForestClassifier
3. classifier= RandomForestClassifier(n_estimators= 10, criterion="entropy")
4. classifier.fit(x_train, y_train)
In the above code, the classifier object takes below parameters:

o n_estimators= The required number of trees in the Random Forest. The default value is 10.
We can choose any number but need to take care of the overfitting issue.
o criterion= It is a function to analyze the accuracy of the split. Here we have taken "entropy"
for the information gain.

Output:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',


max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)

3. Predicting the Test Set result

Since our model is fitted to the training set, so now we can predict the test result. For prediction, we
will create a new prediction vector y_pred. Below is the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test)

Output:

The prediction vector is given as:


By checking the above prediction vector and test set real vector, we can determine the incorrect
predictions done by the classifier.

4. Creating the Confusion Matrix

Now we will create the confusion matrix to determine the correct and incorrect predictions. Below is
the code for it:

1. #Creating the Confusion matrix


2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

Output:
As we can see in the above matrix, there are 4+4= 8 incorrect predictions and 64+28= 92 correct
predictions.

5. Visualizing the training Set result

Here we will visualize the training set result. To visualize the training set result we will plot a graph for
the Random forest classifier. The classifier will predict yes or No for the users who have either
Purchased or Not purchased the SUV car as we did in Logistic Regression. Below is the code for it:

1. from matplotlib.colors import ListedColormap


2. x_set, y_set = x_train, y_train
3. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, ste
p =0.01),
4. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
5. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),

6. alpha = 0.75, cmap = ListedColormap(('purple','green' )))


7. mtp.xlim(x1.min(), x1.max())
8. mtp.ylim(x2.min(), x2.max())
9. for i, j in enumerate(nm.unique(y_set)):
10. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
11. c = ListedColormap(('purple', 'green'))(i), label = j)
12. mtp.title('Random Forest Algorithm (Training set)')
13. mtp.xlabel('Age')
14. mtp.ylabel('Estimated Salary')
15. mtp.legend()
16. mtp.show()

Output:

The above image is the visualization result for the Random Forest classifier working with the training
set result. It is very much similar to the Decision tree classifier. Each data point corresponds to each
user of the user_data, and the purple and green regions are the prediction regions. The purple region is
classified for the users who did not purchase the SUV car, and the green region is for the users who
purchased the SUV.

So, in the Random Forest classifier, we have taken 10 trees that have predicted Yes or NO for the
Purchased variable. The classifier took the majority of the predictions and provided the result.

6. Visualizing the test set result

Now we will visualize the test set result. Below is the code for it:

1. #Visulaizing the test set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, ste
p =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),

7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))


8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Random Forest Algorithm(Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

The above image is the visualization result for the test set. We can check that there is a minimum number
of incorrect predictions (8) without the Overfitting issue. We will get different results by changing the
number of trees in the classifier.

You might also like