ML Notes N
ML Notes N
The Machine Learning Tutorial covers both the fundamentals and more complex ideas of
machine learning. Students and professionals in the workforce can benefit from our machine
learning tutorial.
You will learn about the many different methods of machine learning, including reinforcement
learning, supervised learning, and unsupervised learning, in this machine learning tutorial.
Regression and classification models, clustering techniques, hidden Markov models, and
various sequential models will all be covered.
In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which work on
our instructions. But can a machine also learn from experiences or past data like a human does?
So here comes the role of Machine Learning.
ADVERTISEMENT
Introduction to Machine Learning
A subset of artificial intelligence known as machine learning focuses primarily on the creation
of algorithms that enable a computer to independently learn from data and previous
experiences. Arthur Samuel first used the term "machine learning" in 1959. It could be
summarized as follows:
Machine learning algorithms create a mathematical model that, without being explicitly
programmed, aids in making predictions or decisions with the assistance of sample historical
data, or training data. For the purpose of developing predictive models, machine learning brings
together statistics and computer science. Algorithms that learn from historical data are either
constructed or utilized in machine learning. The performance will rise in proportion to the
quantity of information we provide.
A machine can learn if it can gain more data to improve its performance.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
How does Machine Learning work
A machine learning system builds prediction models, learns from previous data, and predicts
the output of new data whenever it receives it. The amount of data helps to build a better model
that accurately predicts the output, which in turn affects the accuracy of the predicted output.
Let's say we have a complex problem in which we need to make predictions. Instead of writing
code, we just need to feed the data to generic algorithms, which build the logic based on the
data and predict the output. Our perspective on the issue has changed as a result of machine
learning. The Machine Learning algorithm's operation is depicted in the following block
diagram:
o Machine learning uses data to detect various patterns in a given dataset. o It can learn from
past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge amount of the
data.
The demand for machine learning is steadily rising. Because it is able to perform tasks that are
too complex for a person to directly implement, machine learning is required. Humans are
constrained by our inability to manually access vast amounts of data; as a result, we require
computer systems, which is where machine learning comes in to simplify our lives.
By providing them with a large amount of data and allowing them to automatically explore the
data, build models, and predict the required output, we can train machine learning algorithms.
The cost function can be used to determine the amount of data and the machine learning
algorithm's performance. We can save both time and money by using machine learning.
Following are some key points which show the importance of Machine Learning:
o Rapid increment in the production of data o Solving complex problems, which are difficult
for a human o Decision making in various sector including finance o Finding hidden
patterns and extracting useful information from data.
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Applications of Machine learning
Machine learning is a buzzword for today's technology, and it is growing very rapidly day by
day. We are using machine learning in our daily life even without knowing it such as Google
Maps, Google assistant, Alexa, etc. Below are some most trending real-world applications of
Machine Learning:
1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places, digital images, etc. The popular use case of image recognition
and face detection is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo
with our Facebook friends, then we automatically get a tagging suggestion with name, and the
technology behind this is machine learning's face detection and recognition algorithm.
It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech recognition,
and it's a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also known
as "Speech to text", or "Computer speech recognition." At present, machine learning
algorithms are widely used by various applications of speech recognition. Google assistant,
Siri, Cortana, and Alexa are using speech recognition technology to follow the voice
instructions.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct path
with the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:
o Real Time location of the vehicle form Google Map app and sensors o Average time has
taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes information
from the user and sends back to its database to improve the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such as
Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some
product on Amazon, then we started getting an advertisement for the same product while
internet surfing on the same browser and this is because of machine learning.
Google understands the user interest using various machine learning algorithms and suggests
the product as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series,
movies, etc., and this is also done with the help of machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine learning
plays a significant role in self-driving cars. Tesla, the most popular car manufacturing company
is working on self-driving car. It is using unsupervised learning method to train the car models
to detect people and objects while driving.
o Permission filters
These assistant record our voice instructions, send it over the server on a cloud, and decode it
using ML algorithms and act accordingly.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
fraudulent transaction can take place such as fake accounts, fake ids, and steal money in the
middle of a transaction. So to detect this, Feed Forward Neural network helps us by checking
whether it is a genuine transaction or a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these values
become the input for the next round. For each genuine transaction, there is a specific pattern
which gets change for the fraud transaction hence, it detects it and makes our online transactions
more secure.
The technology behind the automatic translation is a sequence to sequence learning algorithm,
which is used with image recognition and translates the text from one language to another
language.
Machine learning Life cycle
Machine learning has given the computer systems the abilities to automatically learn without
being explicitly programmed. But how does a machine learning system work? So, it can be
described using the life cycle of machine learning. Machine learning life cycle is a cyclic
process to build an efficient machine learning project. The main purpose of the life cycle is to
find a solution to the problem or project.
Machine learning life cycle involves seven major steps, which are given below:
o Gathering Data o Data preparation o Data Wrangling o Analyse Data o Train the
model o Test the model o Deployment
The most important thing in the complete process is to understand the problem and to know the
purpose of the problem. Therefore, before starting the life cycle, we need to understand the
problem because the good result depends on the better understanding of the problem.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
In the complete life cycle process, to solve a problem, we create a machine learning system
called "model", and this model is created by providing "training". But to train a model, we need
data, hence, life cycle starts by collecting data.
1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step is to
identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be collected from various
sources such as files, database, internet, or mobile devices. It is one of the most important
steps of the life cycle. The quantity and quality of the collected data will determine the
efficiency of the output. The more will be the data, the more accurate will be the prediction.
This step includes the below tasks: o Identify various data sources o Collect data
o Integrate the data obtained from different sources
By performing the above task, we get a coherent set of data, also called as a dataset. It will be
used in further steps.
2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is a step
where we put our data into a suitable place and prepare it to use in our machine learning
training.
In this step, first, we put all data together, and then randomize the ordering of data.
o Data exploration:
It is used to understand the nature of data that we have to work with. We need to understand
the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.
o Data pre-processing:
Now the next step is preprocessing of data for its analysis.
3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable format. It is
the process of cleaning the data, selecting the variable to use, and transforming the data in a
proper format to make it more suitable for analysis in the next step. It is one of the most
important steps of the complete process. Cleaning of data is required to address the quality
issues.
It is not necessary that data we have collected is always of our use as some of the data may not
be useful. In real-world applications, collected data may have various issues, including:
It is mandatory to detect and remove the above issues because it can negatively affect the quality
of the outcome.
4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step involves:
The aim of this step is to build a machine learning model to analyze the data using various
analytical techniques and review the outcome. It starts with the determination of the type of
the problems, where we select the machine learning techniques
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
such as Classification, Regression, Cluster analysis, Association, etc. then build the
model using prepared data, and evaluate the model.
Hence, in this step, we take the data and use machine learning algorithms to build the model.
5. Train Model
Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.
We use datasets to train the model using various machine learning algorithms. Training a model
is required so that it can understand the various patterns, rules, and, features.
6. Test Model
Once our machine learning model has been trained on a given dataset, then we test the model.
In this step, we check for the accuracy of our model by providing a test dataset to it.
Testing the model determines the percentage accuracy of the model as per the requirement of
project or problem.
7. Deployment
The last step of machine learning life cycle is deployment, where we deploy the model in the
real-world system.
If the above-prepared model is producing an accurate result as per our requirement with
acceptable speed, then we deploy the model in the real system. But before deploying the project,
we will check whether it is improving its performance using available data or not.
The deployment phase is similar to making the final report for a project.
Difference between Artificial intelligence and Machine learning
Artificial intelligence and machine learning are the part of computer science that are correlated
with each other. These two technologies are the most trending technologies which are used for
creating intelligent systems.
Although these are two related technologies and sometimes people use them as a synonym for
each other, but still both are the two different terms in various cases.
Below are some main differences between AI and machine learning along with the overview of
Artificial intelligence and machine learning.
ADVERTISEMENT
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Artificial Intelligence
Artificial intelligence is a field of computer science which makes a computer system that can
mimic human intelligence. It is comprised of two words "Artificial" and "intelligence", which
means "a human-made thinking power." Hence we can define it as,
Artificial intelligence is a technology using which we can create intelligent systems that can
simulate human intelligence.
The Artificial intelligence system does not require to be pre-programmed, instead of that, they
use such algorithms which can work with their own intelligence. It involves machine learning
algorithms such as Reinforcement learning algorithm and deep learning neural networks. AI is
being used in multiple places such as Siri, Google?s AlphaGo, AI in Chess playing, etc.
o Weak AI o General AI o Strong AI Currently, we are working with weak AI and general
AI. The future of AI is Strong AI for which it is said that it will be intelligent than humans.
Machine learning
Machine learning is about extracting knowledge from the data. It can be defined as,
Machine learning is a subfield of artificial intelligence, which enables machines to learn from
past data or experiences without being explicitly programmed.
Machine learning enables a computer system to make predictions or take some decisions using
historical data without being explicitly programmed. Machine learning uses a massive amount
of structured and semi-structured data so that a machine learning model can generate accurate
result or give predictions based on that data.
ADVERTISEMENT
ADVERTISEMENT
Machine learning works on algorithm which learn by it?s own using historical data. It works
only for specific domains such as if we are creating a machine learning model to detect pictures
of dogs, it will only give result for dog images, but if we provide a new data like cat image then
it will become unresponsive. Machine learning is being used in various places such as for online
recommender system, for Google search algorithms, Email spam filter, Facebook Auto friend
tagging suggestion, etc.
Key differences between Artificial Intelligence (AI) and Machine learning (ML):
The goal of AI is to make a smart computer The goal of ML is to allow machines to learn from data
system like humans to solve complex so that they can give accurate output.
problems.
In AI, we make intelligent systems to In ML, we teach machines with data to perform a
perform any task like a human. particular task and give an accurate result.
Machine learning and deep learning are the Deep learning is a main subset of machine learning.
two main subsets of AI.
AI has a very wide range of scope. Machine learning has a limited scope.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
AI is working to create an intelligent system Machine learning is working to create machines that can
which can perform various complex tasks. perform only those specific tasks for which they are
trained.
AI system is concerned about maximizing Machine learning is mainly concerned about accuracy
the chances of success. and patterns.
The main applications of AI are Siri, The main applications of machine learning are Online
customer support using catboats, Expert recommender system, Google search
System, Online game playing, intelligent algorithms, Facebook auto friend tagging
humanoid robot, etc. suggestions, etc.
On the basis of capabilities, AI can be Machine learning can also be divided into mainly three
divided into three types, which are, Weak types that are Supervised learning, Unsupervised
AI, General AI, and Strong AI. learning, and Reinforcement learning.
AI completely deals with Structured, semi- Machine learning deals with Structured
structured, and unstructured data. and semi-structured data.
What is a dataset?
A dataset is a collection of data in which data is arranged in some order. A dataset can contain
any data from a series of an array to a database table. Below table shows an example of the
dataset:
A tabular dataset can be understood as a database table or matrix, where each column
India 38 48000 No
Germany 30 54000 No
France 48 65000 No
Germany 40 Yes
corresponds to a particular variable, and each row corresponds to the fields of the dataset.
The most supported file type for a tabular dataset is "Comma Separated File," or CSV. But
to store a "tree-like data," we can use the JSON file more efficiently.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Note: A real-world dataset is of huge size, which is difficult to manage and process at the
initial level. Therefore, to practice machine learning algorithms, we can use any dummy
dataset.
Types of datasets
Machine learning incorporates different domains, each requiring explicit sorts of datasets. A
few normal sorts of datasets utilized in machine learning include:
Image Datasets:
Image datasets contain an assortment of images and are normally utilized in computer vision
tasks such as image classification, object detection, and image segmentation.
Text Datasets:
Text datasets comprise textual information, like articles, books, or virtual entertainment posts.
These datasets are utilized in NLP techniques like sentiment analysis, text classification, and
machine translation.
Examples :
Tabular Datasets:
Tabular datasets are organized information coordinated in tables or calculation sheets. They
contain lines addressing examples or tests and segments addressing highlights or qualities.
Tabular datasets are utilized for undertakings like relapse and arrangement. The dataset given
before in the article is an illustration of a tabular dataset.
Need of Dataset
o Completely ready and pre-handled datasets are significant for machine learning projects.
o They give the establishment to prepare exact and solid models. Notwithstanding, working with
enormous datasets can introduce difficulties regarding the board and handling.
o To address these difficulties, productive information the executive's strategies and are expected
to handle calculations.
Data Pre-processing:
During the development of the ML project, the developers completely rely on the datasets. In
building ML applications, datasets are divided into two parts:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Note: The datasets are of large size, so to download these datasets, you must have fast internet
on your computer.
In machine learning, datasets are ordinarily partitioned into two sections: the training dataset
and the test dataset. The training dataset is utilized to prepare the machine learning model,
while the test dataset is utilized to assess the model's exhibition. This division surveys the
model's capacity, to sum up to inconspicuous data. It is fundamental to guarantee that the
datasets are representative of the issue space and appropriately split to stay away from
inclination or overfitting.
Popular sources for Machine Learning datasets
Below is the list of datasets which are freely available for the public to work on it: 1. Kaggle
Datasets
Kaggle is one of the best sources for providing datasets for Data Scientists and Machine
Learners. It allows users to find, download, and publish datasets in an easy way. It also provides
the opportunity to work with other machine learning engineers and solve difficult Data Science
related tasks.
Kaggle provides a high-quality dataset in different formats that we can easily find and
download.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
datasets in the storehouse incorporate the Iris dataset, Vehicle Assessment dataset, and Poker
Hand dataset.
Anyone can analyze and build various services using shared data via AWS resources. The
shared dataset on cloud helps users to spend more time on data analysis rather than on
acquisitions of data.
This source provides the various types of datasets with examples and ways to use the dataset.
It also provides the search box using which we can search for the required dataset. Anyone can
add any dataset or example to the Registry of Open Data on AWS.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The link for the Google dataset search engine is https://toolbox.google.com/datasetsearch.
5. Microsoft Datasets
The Microsoft has launched the "Microsoft Research Open data" repository with the
collection of free datasets in various areas such as natural language processing, computer
vision, and domain-specific sciences. It gives admittance to assorted and arranged datasets
that can be significant for machine learning projects.
The link to download or use the dataset from this resource is https://msropendata.com/. 6.
Awesome Public Dataset Collection
Awesome public dataset collection provides high-quality datasets that are arranged in a well-
organized manner within a list according to topics such as Agriculture, Biology, Climate,
Complex networks, etc. Most of the datasets are available free, but some may not, so it is better
to check the license before downloading the dataset.
7. Government Datasets
There are different sources to get government-related data. Various countries publish
government data for public use collected by them from different departments.
The goal of providing these datasets is to increase transparency of government work among the
people and to use the data in an innovative approach. Below are some links of government
datasets:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Visual data provides multiple numbers of the great dataset that are specific to computer visions
such as Image Classification, Video classification, Image Segmentation, etc. Therefore, if you
want to build a project on deep learning or image processing, then you can refer to this source.
The link for downloading the dataset from this source is https://www.visualdata.io/.
9. Scikit-learn dataset
Scikit-learn, a well-known machine learning library in Python, gives a few underlying datasets
to practice and trial and error. These datasets are open through the sci-kit-learn Programming
interface and can be utilized for learning different machine-learning calculations. Scikit-learn
offers both toy datasets, which are little and improved, and genuine world datasets with greater
intricacy. Instances of sci-kit-learn datasets incorporate the Iris dataset, the Boston Lodging
dataset, and the Wine dataset.
The link to download datasets from this source is https://scikit-
learn.org/stable/datasets/index.html.
Data ethics and privacy are basic contemplations in machine learning projects. It is fundamental
to guarantee that data is gathered and utilized morally, regarding privacy freedoms and
observing pertinent regulations and guidelines. Data experts ought to go to lengths to safeguard
data privacy, get appropriate assent, and handle delicate data mindfully. Assets, for example,
moral rules and privacy structures can give direction on keeping up with moral practices in data
assortment and use.
Conclusion:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Google's Dataset Search, Microsoft Datasets, and government datasets, data researchers and
specialists can get to an extensive variety of datasets for their machine learning projects. It is
fundamental to consider data ethics and privacy all through the whole data lifecycle to
guarantee mindful and moral utilization of data. With the right datasets and moral practices,
machine learning models can accomplish exact predictions and drive significant bits of
knowledge.
In supervised learning, the training data provided to the machines work as the supervisor that
teaches the machines to predict the output correctly. It applies the same concept as a student
learns in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping
function to map the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image classification,
Fraud Detection, spam filtering, etc.
In supervised learning, models are trained using labelled dataset, where the model learns about
each type of data. Once the training process is completed, the model is tested on the basis of
test data (a subset of the training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and
diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle,
triangle, and Polygon. Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify
the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies
the shape on the bases of a number of sides, and predicts the output.
o First Determine the type of training dataset o Collect/Gather the labelled training data. o Split
the training dataset into training dataset, test dataset, and validation dataset.
o Determine the input features of the training dataset, which should have enough knowledge so
that the model can accurately predict the output.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Determine the suitable algorithm for the model, such as support vector machine, decision tree,
etc.
o Execute the algorithm on the training dataset. Sometimes we need validation sets as the control
parameters, which are the subset of training datasets.
o Evaluate the accuracy of the model by providing the test set. If the model predicts the correct
output, which means our model is accurate.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc. Below are some popular Regression algorithms which come
under supervised learning: o Linear Regression o Regression Trees o Non-Linear Regression
o Bayesian Linear Regression o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means there
are two classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
o Random Forest o Decision Trees o Logistic Regression o Support vector Machines Note: We
will discuss these algorithms in detail in later chapters.
o With the help of supervised learning, the model can predict the output on the basis of prior
experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.
o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
As the name suggests, unsupervised learning is a machine learning technique in which models
are not supervised using training dataset. Instead, models itself find the hidden patterns and
insights from the given data. It can be compared to learning which takes place in the human
brain while learning new things. It can be defined as:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.
Example: Suppose the unsupervised learning algorithm is given an input dataset containing
images of different types of cats and dogs. The algorithm is never trained upon the given
dataset, which means it does not have any idea about the features of the dataset. The task of the
unsupervised learning algorithm is to identify the image features on their own.
Unsupervised learning algorithm will perform this task by clustering the image dataset into the
groups according to similarities between images.
Below are some main reasons which describe the importance of Unsupervised Learning:
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own experiences,
which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make unsupervised
learning more important.
o In real-world, we do not always have input data with the corresponding output so to solve such
cases, we need unsupervised learning.
Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the machine
learning model in order to train it. Firstly, it will interpret the raw data to find the hidden patterns
from the data and then will apply suitable algorithms such as k-means clustering, Decision tree,
etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
The unsupervised learning algorithm can be further categorized into two types of problems:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Clustering: Clustering is a method of grouping the objects into clusters such that objects with
most similarities remains into a group and has less or no similarities with the objects of another
group. Cluster analysis finds the commonalities between the data objects and categorizes them
as per the presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method which is used for finding
the relationships between variables in the large database. It determines the set of items that
occurs together in the dataset. Association rule makes marketing strategy more effective. Such
as people who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A
typical example of Association rule is Market Basket Analysis.
o Unsupervised learning is used for more complex tasks as compared to supervised learning
because, in unsupervised learning, we don't have labeled input data.
o Unsupervised learning is intrinsically more difficult than supervised learning as it does not have
corresponding output.
o The result of the unsupervised learning algorithm might be less accurate as input data is not
labeled, and algorithms do not know the exact output in advance.
Difference between Supervised and Unsupervised Learning
Supervised and Unsupervised learning are the two techniques of machine learning. But both
the techniques are used in different scenarios and with different datasets. Below the explanation
of both learning methods along with their difference table is given.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Supervised learning is a machine learning method in which models are trained using labeled
data. In supervised learning, models need to find the mapping function to map the input variable
(X) with the output variable (Y).
Supervised learning needs supervision to train the model, which is similar to as a student learns
things in the presence of a teacher. Supervised learning can be used for two types of problems:
Classification and Regression.
Example: Suppose we have an image of different types of fruits. The task of our supervised
learning model is to identify the fruits and classify them accordingly. So to identify the image
in supervised learning, we will give the input data as well as output for that, which means we
will train the model by the shape, size, color, and taste of each fruit. Once the training is
completed, we will test the model by giving the new set of fruit. The model will identify the
fruit and predict the output using a suitable algorithm.
Unsupervised learning is another machine learning method in which patterns inferred from the
unlabeled input data. The goal of unsupervised learning is to find the structure and patterns
from the input data. Unsupervised learning does not need any supervision. Instead, it finds
patterns from the data by its own.
Unsupervised learning can be used for two types of problems: Clustering and Association.
Example: To understand the unsupervised learning, we will use the example given above. So
unlike supervised learning, here we will not provide any supervision to the model. We will just
provide the input dataset to the model and allow the model to find the patterns from the data.
With the help of a suitable algorithm, the model will train itself and divide the fruits into
different groups according to the most similar features between them.
The main differences between Supervised and Unsupervised learning are given below:
Supervised learning model takes direct feedback to check Unsupervised learning model
if it is predicting correct output or not. does not
In supervised learning, input data is provided to the model In unsupervised learning, only
along with the output. input model.
The goal of supervised learning is to train the model so The goal of unsupervised learning
that it can predict the output when it is given new data. patterns and useful insights from
the
Supervised learning needs supervision to train the model. Unsupervised learning does not
need the model.
Note: The supervised and unsupervised learning both are the machine learning methods,
and selection of any of these learning depends on the factors related to the structure and
volume of your dataset and the use cases of the problem.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Supervised learning can be used for those cases where we know the Unsupervised learning can be
input as well as corresponding outputs. used for have only input data and
no correspond
Supervised learning is not close to true Artificial intelligence as in Unsupervised learning is more
this, we first train the model for each data, and then only it can close Intelligence as it learns
predict the correct output. similarly as routine things by his
experiences.
It includes various algorithms such as Linear Regression, Logistic It includes various algorithms
Regression, Support Vector Machine, Multi-class Classification, such as Apriori algorithm.
Decision tree, Bayesian Logic, etc.
ADVERTISEMENT
o In Reinforcement Learning, the agent learns automatically using feedbacks without any labeled
data, unlike supervised learning.
o Since there is no labeled data, so the agent is bound to learn by its experience only.
o RL solves a specific type of problem where decision making is sequential, and the goal is long-
term, such as game-playing, robotics, etc.
o The agent interacts with the environment and explores it by itself. The primary goal of an agent
in reinforcement learning is to improve the performance by getting the maximum positive
rewards.
o The agent learns with the process of hit and trial, and based on the experience, it learns to
perform the task in a better way. Hence, we can say that "Reinforcement learning is a type of
machine learning method where an intelligent agent (computer program) interacts with the
environment and learns to act within that." How a Robotic dog learns the movement of his
arms is an example of Reinforcement learning.
o It is a core part of Artificial intelligence, and all AI agent works on the concept of reinforcement
learning. Here we do not need to pre-program the agent, as it learns from its own experience
without any human intervention.
o Example: Suppose there is an AI agent present within a maze environment, and his goal is to
find the diamond. The agent interacts with the environment by performing some actions, and
based on those actions, the state of the agent gets changed, and it also receives a reward or
penalty as feedback.
o The agent continues doing these three things (take action, change state/remain in the same
state, and get feedback), and by doing these actions, he learns and explores the environment.
o The agent learns that what actions lead to positive feedback or rewards and what actions lead
to negative feedback penalty. As a positive reward, the agent gets a positive point, and as a
penalty, it gets a negative point.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Terms used in Reinforcement Learning
o Agent(): An entity that can perceive/explore the environment and act upon it.
o State(): State is a situation returned by the environment after each action taken by the agent.
o Reward(): A feedback returned to the agent from the environment to evaluate the action of the
agent.
o Policy(): Policy is a strategy applied by the agent for the next action based on the current state.
o Value(): It is expected long-term retuned with the discount factor and opposite to the short-
term reward.
o Q-value(): It is mostly similar to the value, but it takes one additional parameter as a current
action (a).
o In RL, the agent is not instructed about the environment and what actions need to be taken.
o The agent takes the next action and changes states according to the feedback of the previous
action. o The agent may get a delayed reward.
o The environment is stochastic, and the agent needs to explore it to reach to get the maximum
positive rewards.
There are mainly three ways to implement reinforcement-learning in ML, which are:
1. Value-based:
The value-based approach is about to find the optimal value function, which is the maximum
value at a state under any policy. Therefore, the agent expects the long-term return at any
state(s) under policy π.
2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future rewards without
using the value function. In this approach, the agent tries to apply such a policy that the action
performed in each step helps to maximize the future reward. The policy-based approach has
mainly two types of policy:
o Deterministic: The same action is produced by the policy (π) at any state.
o Stochastic: In this policy, probability determines the produced action.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
3. Model-based: In the model-based approach, a virtual model is created for the environment,
and the agent explores that environment to learn it. There is no particular solution or
algorithm for this approach because the model representation is different for each
environment.
There are four main elements of Reinforcement Learning, which are given below:
1. Policy
2. Reward Signal
3. Value Function
1) Policy: A policy can be defined as a way how an agent behaves at a given time. It maps
the perceived states of the environment to the actions taken on those states. A policy is the core
element of the RL as it alone can define the behavior of the agent. In some cases, it may be a
simple function or a lookup table, whereas, for other cases, it may involve general computation
as a search process. It could be deterministic or a stochastic policy:
2) Reward Signal: The goal of reinforcement learning is defined by the reward signal. At
each state, the environment sends an immediate signal to the learning agent, and this signal is
known as a reward signal. These rewards are given according to the good and bad actions
taken by the agent. The agent's main objective is to maximize the total number of rewards for
good actions. The reward signal can change the policy, such as if an action selected by the agent
leads to low reward, then the policy may change to select other actions in the future.
3) Value Function: The value function gives information about how good the situation
and action are and how much reward an agent can expect. A reward indicates the immediate
signal for each good and bad action, whereas a value function specifies the good state and
action for the future. The value function depends on the reward as, without reward, there
could be no value. The goal of estimating values is to achieve more rewards.
4) Model: The last element of reinforcement learning is the model, which mimics the
behavior of the environment. With the help of the model, one can make inferences about how
the environment will behave. Such as, if a state and an action are given, then a model can
predict the next state and reward.
The model is used for planning, which means it provides a way to take a course of action by
considering all future situations before actually experiencing those situations. The approaches
for solving the RL problems with the help of the model are termed as the model-based
approach. Comparatively, an approach without using a model is called a model-free
approach.
The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program learns
from the given dataset or observations and then classifies new observation into a number of
classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can
be called as targets/labels or categories.
Unlike regression, the output variable of Classification is a category, not a value, such as "Green
or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised learning
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
technique, hence it takes labeled input data, which means it contains input with the
corresponding output.
The main goal of the Classification algorithm is to identify the category of a given dataset, and
these algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are similar
to each other and dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier. There
are two types of Classifications:
o Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then it is called
as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the
test dataset. In Lazy learner case, classification is done on the basis of the most related data
stored in the training dataset. It takes less time in training but more time for predictions.
Example: K-NN algorithm, Case-based reasoning
Classification Algorithms can be further divided into the Mainly two category:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Once our model is completed, it is necessary to evaluate its performance; either it is a
Classification or Regression model. So for evaluating a Classification model, we have the
following ways:
o It is used for evaluating the performance of a classifier, whose output is a probability value
between the 0 and 1.
o For a good binary Classification model, the value of log loss should be near to 0. o The value
of log loss increases if the predicted value deviates from the actual value. o The lower log loss
represents the higher accuracy of the model. o For Binary classification, cross-entropy can be
calculated as:
2. Confusion Matrix:
o The confusion matrix provides us a matrix/table as output and describes the performance of the
model.
3. AUC-ROC curve:
o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for Area
Under the Curve.
o It is a graph that shows the performance of the classification model at different thresholds.
o To visualize the performance of the multi-class classification model, we use the AUC-ROC
Curve. o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-
axis and FPR(False Positive Rate) on X-axis.
Classification algorithms can be used in different places. Below are some popular use cases of
Classification Algorithms:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Logistic Regression in Machine Learning o Logistic regression is one of the most popular
Machine Learning algorithms, which comes under the Supervised Learning technique. It is
used for predicting the categorical dependent variable using a given set of independent
variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False,
etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which
lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below image
is showing the logistic function:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Note: Logistic regression uses the concept of predictive modeling as regression; therefore,
it is called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.
Logistic Function (Sigmoid Function):
The sigmoid function is a mathematical function used to map the predicted values to
probabilities. o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function or
the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the probability
of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.
o The dependent variable must be categorical in nature. o The independent variable should not
have multi-collinearity.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below: o We know the
equation of the straight line can be written as:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:
On the basis of the categories, Logistic Regression can be classified into three types:
Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
To understand the implementation of Logistic Regression in Python, we will use the below
example:
Example: There is a dataset given which contains the information of various users obtained
from the social networking sites. There is a car making company that has recently launched a
new SUV car. So the company wanted to check how many users from the dataset, wants to
purchase the car.
For this problem, we will build a Machine Learning model using the Logistic regression
algorithm. The dataset is shown in the below image. In this problem, we will predict the
purchased variable (Dependent Variable) by using age and salary (Independent
variables).
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Steps in Logistic Regression: To implement the Logistic Regression using Python, we will
use the same steps as we have done in previous topics of Regression. Below are the steps:
o Data Pre-processing step o Fitting Logistic Regression to the Training set o Predicting the
test result o Test accuracy of the result(Creation of Confusion matrix) o Visualizing the test
set result.
1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we can use
it in our code efficiently. It will be the same as we have done in Data pre-processing topic. The
code for this is given below:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
4. import matplotlib.pyplot as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')
By executing the above lines of code, we will get the dataset as the output. Consider the given
image:
Now, we will extract the dependent and independent variables from the given dataset. Below
is the code for it:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. x= data_set.iloc[:, [2,3]].values
3. y= data_set.iloc[:, 4].values
In the above code, we have taken [2, 3] for x because our independent variables are age and
salary, which are at index 2, 3. And we have taken 4 for y variable because our dependent
variable is at index 4. The output will be:
Now we will split the dataset into a training set and test set. Below is the code for it:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
For training set:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
In logistic regression, we will do feature scaling because we want accurate result of predictions.
Here we will only scale the independent variable because dependent variable have only 0 and
1 values. Below is the code for it:
1. #feature Scaling
2. from sklearn.preprocessing import StandardScaler
3. st_x= StandardScaler()
4. x_train= st_x.fit_transform(x_train)
5. x_test= st_x.transform(x_test) The scaled output is given below:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. Fitting Logistic Regression to the Training set:
We have well prepared our dataset, and now we will train the dataset using the training set.
For providing training or fitting the model to the training
set, we will import the LogisticRegression class of the sklearn library.
After importing the class, we will create a classifier object and use it to fit the model to the
logistic regression. Below is the code for it:
Out[5]:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
1. LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
2. intercept_scaling=1, l1_ratio=None, max_iter=100,
3. multi_class='warn', n_jobs=None, penalty='l2',
4. random_state=0, solver='warn', tol=0.0001, verbose=0,
5. warm_start=False)
ADVERTISEMENT
ADVERTISEMENT
Hence our model is well fitted to the training set.
Our model is well trained on the training set, so we will now predict the result by using test set
data. Below is the code for it:
Output: By executing the above code, a new vector (y_pred) will be created under the variable
explorer option. It can be seen as:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The above output image shows the corresponding predicted users who want to purchase or not
purchase the car.
Now we will create the confusion matrix here to check the accuracy of the classification. To
create it, we need to import the confusion_matrix function of the sklearn library. After
importing the function, we will call it using a new variable cm. The function takes two
parameters, mainly y_true( the actual values) and y_pred (the targeted value return by the
classifier). Below is the code for it:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix() Output:
By executing the above code, a new confusion matrix will be created. Consider the below
image:
We can find the accuracy of the predicted result by interpreting the confusion matrix. By above
output, we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect Output).
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step
=0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1
.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Logistic Regression (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
In the above code, we have imported the ListedColormap class of Matplotlib library to create
the colormap for visualizing the result. We have created two new variables x_set and y_set to
replace x_train and y_train. After that, we have used the nm.meshgrid command to create a
rectangular grid, which has a range of -1(minimum) to 1 (maximum). The pixel points we have
taken are of 0.01 resolution.
To create a filled contour, we have used mtp.contourf command, it will create regions of
provided colors (purple and green). In this function, we have passed the classifier.predict to
show the predicted data points predicted by the classifier.
Output: By executing the above code, we will get the below output:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The graph can be explained in the below points:
o In the above graph, we can see that there are some Green points within the green region and
Purple points within the purple region.
o All these data points are the observation points from the training set, which shows the result
for purchased variables.
o This graph is made by using two independent variables i.e., Age on the x-axis and Estimated
salary on the y-axis.
o The purple point observations are for which purchased (dependent variable) is probably 0,
i.e., users who did not purchase the SUV car.
o The green point observations are for which purchased (dependent variable) is probably 1
means user who purchased the SUV car.
o We can also estimate from the graph that the users who are younger with low salary, did not
purchase the car, whereas older users with high estimated salary purchased the car.
o But there are some purple points in the green region (Buying the car) and some green points in
the purple region(Not buying the car). So we can say that younger users with a high estimated
salary purchased the car, whereas an older user with a low estimated salary did not purchase
the car.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The goal of the classifier:
We have successfully visualized the training set result for the logistic regression, and our goal
for this classification is to divide the users who purchased the SUV car and who did not
purchase the car. So from the output graph, we can clearly see the two regions (Purple and
Green) with the observation points. The Purple region is for those users who didn't buy the car,
and Green Region is for those users who purchased the car.
Linear Classifier:
As we can see from the graph, the classifier is a Straight line or linear in nature as we have
used the Linear model for Logistic Regression. In further topics, we will learn for non-linear
Classifiers.
Our model is well trained using the training dataset. Now, we will visualize the result for new
observations (Test set). The code for the test set will remain same as above except that here we
will use x_test and y_test instead of x_train and y_train. Below is the code for it:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Logistic Regression (Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show() Output:
The above graph shows the test set result. As we can see, the graph is divided into two regions
(Purple and Green). And Green observations are in the green region, and Purple observations
are in the purple region. So we can say it is a good prediction and model. Some of the green
and purple data points are in different regions, which can be ignored as we have already
calculated this error using the confusion matrix (11 Incorrect output).
Hence our model is pretty good and ready to make new predictions for this classification
problem.
o K-NN algorithm assumes the similarity between the new case/data and available cases and put
the new case into the category that is most similar to the available categories.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used
for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an action
on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we
want to know either it is a cat or dog. So for this identification, we can use the KNN algorithm,
as it works on a similarity measure. Our KNN model will find the similar features of the new
data set to the cats and dogs images and based on the most similar features it will put it in either
cat or dog category.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of problem,
we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or class
of a particular dataset. Consider the below diagram:
The K-NN working can be explained on the basis of the below algorithm: o Step-1: Select the
number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors o Step-3: Take the K
nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum. o Step-6: Our model is ready.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Suppose we have a new data point and we need to put it in the required category. Consider the
below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It can
be calculated as:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors
in category A and two nearest neighbors in category B. Consider the below image:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.
Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try some values
to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in
the model.
o Large values for K are good, but it may find some difficulties.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o It is simple to implement.
o It is robust to the noisy training data o It can be more effective if the training data is large.
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points for
all the training samples.
To do the Python implementation of the K-NN algorithm, we will use the same problem and
dataset which we have used in Logistic Regression. But here we will improve the performance
of the model. Below is the problem description:
Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured
a new SUV car. The company wants to give the ads to the users who are interested in buying
that SUV. So for this problem, we have a dataset that contains multiple user's information
through the social network. The dataset contains lots of information but the Estimated Salary
and Age we will consider for the independent variable and the Purchased variable is for the
dependent variable. Below is the dataset:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Steps to implement the K-NN algorithm:
o Data Pre-processing step o Fitting the K-NN algorithm to the Training set o Predicting the
test result o Test accuracy of the result(Creation of Confusion matrix) o Visualizing the test
set result.
The Data Pre-processing step will remain exactly the same as Logistic Regression. Below is
the code for it:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp 4. import pandas as pd
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv') 8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)
By executing the above code, our dataset is imported to our program and well pre-processed.
After feature scaling our test dataset will look like:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
From the above output image, we can see that our data is successfully scaled.
o metric='minkowski': This is the default parameter and it decides the distance between the
points. o p=2: It is equivalent to the standard Euclidean metric.
And then we will fit the classifier to the training data. Below is the code for it:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
4. classifier.fit(x_train, y_train)
Output: By executing the above code, we will get the output as:
Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
o Predicting the Test Result: To predict the test set result, we will create a y_pred vector as we
did in Logistic Regression. Below is the code for it:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Creating the Confusion Matrix:
Now we will create the Confusion Matrix for our K-NN model to see the accuracy of the
classifier. Below is the code for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)
In above code, we have imported the confusion_matrix function and called it using the variable
cm.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Output: By executing the above code, we will get the matrix as below:
In the above image, we can see there are 64+29= 93 correct predictions and 3+4= 7 incorrect
predictions, whereas, in Logistic Regression, there were 11 incorrect predictions. So we can
say that the performance of the model is improved by using the K-NN algorithm.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('K-NN Algorithm (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show() Output:
The output graph is different from the graph which we have occurred in Logistic Regression.
It can be understood in the below points:
o As we can see the graph is showing the red point and green points. The green points are for
Purchased(1) and Red Points for not Purchased(0) variable.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o The graph is showing an irregular boundary instead of showing any straight line or any curve
because it is a K-NN algorithm, i.e., finding the nearest neighbor.
o The graph has classified users in the correct categories as most of the users who didn't buy the
SUV are in the red region and users who bought the SUV are in the green region.
o The graph is showing good result but still, there are some green points in the red region and
red points in the green region. But this is no big issue as by doing this model is prevented from
overfitting issues.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show() Output:
The above graph is showing the output for the test data set. As we can see in the graph, the
predicted output is well good as most of the red points are in the red region and most of the
green points are in the green region.
However, there are few green points in the red region and a few red points in the green region.
So these are the incorrect observations that we have observed in the confusion matrix(7
Incorrect output).
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a hyperplane.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified using
a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this strange creature. So
as support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there
are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have
a dataset that has two tags (green and blue), and the dataset has two features x1 and x2.
We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue.
Consider the below image:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary
or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from
both the classes. These points are called support vectors. The distance between the vectors and
the hyperplane is called as margin. And the goal of SVM is to maximize this margin. The
hyperplane with maximum margin is called the optimal hyperplane.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
So to separate these data points, we need to add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be
calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Hence we get a circumference of radius 1 in case of non-linear data.
Now we will implement the SVM algorithm using Python. Here we will use the same dataset
user_data, which we have used in Logistic regression and KNN classification. o Data Pre-
processing step
Till the Data pre-processing step, the code will remain the same. Below is the code:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
8. data_set= pd.read_csv('user_data.csv')
9.
10. #Extracting Independent and dependent Variable
11. x= data_set.iloc[:, [2,3]].values 12. y= data_set.iloc[:, 4].values
13.
14. # Splitting the dataset into training and test set.
15. from sklearn.model_selection import train_test_split
16. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The scaled output for the test set will be:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Fitting the SVM classifier to the training set:
Now the training set will be fitted to the SVM classifier. To create the SVM classifier, we will
import SVC class from Sklearn.svm library. Below is the code for it:
Output:
Out[8]:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr',
degree=3, gamma='auto_deprecated', kernel='linear', max_iter=-1, probability=False,
random_state=0,
shrinking=True, tol=0.001, verbose=False)
The model performance can be altered by changing the value of C(Regularization factor),
gamma, and kernel.
Output: Below is the output for the prediction of the test set:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Creating the confusion matrix:
Now we will see the performance of the SVM classifier that how many incorrect predictions
are there as compared to the Logistic regression classifier. To create the confusion matrix, we
need to import the confusion_matrix function of the sklearn library. After importing the
function, we will call it using a new variable cm. The function takes two parameters, mainly
y_true( the actual values) and y_pred (the targeted value return by the classifier). Below is the
code for it:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
BIST TechnologiesGlobal Education Private
LimitedTechnology
Output:
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)
As we can see in the above output image, there are 66+24= 90 correct predictions and 8+2= 10
correct predictions. Therefore we can say that our SVM model improved as compared to the
Logistic regression model.
www.bisttechnologies.com www.gedutech.net
mail: [email protected] mail: [email protected] contact: +91 8919651415
contact: +91 7676886524
+91 7676886524 +91 8919651415
BIST TechnologiesGlobal Education Private
LimitedTechnology
Output:
6. alpha = 0.75, cmap = ListedColormap(('red', 'green')))
7. mtp.xlim(x1.min(), x1.max())
8. mtp.ylim(x2.min(), x2.max())
9. for i, j in enumerate(nm.unique(y_set)):
10. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
11. c = ListedColormap(('red', 'green'))(i), label = j)
12. mtp.title('SVM classifier (Training set)')
13. mtp.xlabel('Age')
14. mtp.ylabel('Estimated Salary')
15. mtp.legend()
16. mtp.show()
As we can see, the above output is appearing similar to the Logistic regression output. In the
output, we got the straight line as hyperplane because we have used a linear kernel in the
classifier. And we have also discussed above that for the 2d space, the hyperplane in SVM is
a straight line. o Visualizing the test set result:
www.bisttechnologies.com www.gedutech.net
mail: [email protected] mail: [email protected] contact: +91 8919651415
contact: +91 7676886524
+91 7676886524 +91 8919651415
BIST TechnologiesGlobal Education Private
LimitedTechnology
Output:
1. #Visulaizing the test set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step
=0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1
.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('SVM classifier (Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
www.bisttechnologies.com www.gedutech.net
mail: [email protected] mail: [email protected] contact: +91 8919651415
contact: +91 7676886524
+91 7676886524 +91 8919651415
BIST TechnologiesGlobal Education Private
LimitedTechnology
Output:
As we can see in the above output image, the SVM classifier has divided the users into two
regions (Purchased or Not purchased). Users who purchased the SUV are in the red region with
the red scatter points. And users who did not purchase the SUV are in the green region with
green scatter points. The hyperplane has divided the two classes into Purchased and not
purchased variable.
Naïve Bayes Classifier Algorithm o Naïve Bayes algorithm is a supervised learning algorithm,
which is based on Bayes theorem and used for solving classification problems.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which
helps in building the fast machine learning models that can make quick predictions.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis,
and classifying articles.
www.bisttechnologies.com www.gedutech.net
mail: [email protected] mail: [email protected] contact: +91 8919651415
contact: +91 7676886524
+91 7676886524 +91 8919651415
BIST TechnologiesGlobal Education Private
LimitedTechnology
Output:
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
www.bisttechnologies.com www.gedutech.net
mail: [email protected] mail: [email protected] contact: +91 8919651415
contact: +91 7676886524
+91 7676886524 +91 8919651415
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases
of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence
each feature individually contributes to identify that it is an apple without depending on each
other. o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability. o
The formula for Bayes' theorem is given as:
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
1. Convert the given dataset into frequency tables.
Problem: If the weather is sunny, then the Player should play or not?
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Outlook Play
0 Rainy Yes
1 Sunny Yes
Overcast Yes
2
3 Overcast Yes
4 Sunny No
Rainy Yes
5
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
12 Overcast Yes
13 Overcast Yes
Weather Yes No
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Overcast 5 0
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
Rainy 2 2
Sunny 3 2
Total 10 5
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
Disadvantages of Naïve Bayes Classifier: o Naive Bayes assumes that all features are
independent or unrelated, so it cannot learn the relationship between features.
Applications of Naïve Bayes Classifier: o It is used for Credit Scoring. o It is used in medical
data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner. o
It is used in Text classification such as Spam filtering and Sentiment analysis.
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal distribution. This means
if predictors take continuous values instead of discrete, then the model assumes that these
values are sampled from the Gaussian distribution.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word is
present or not in a document. This model is also famous for document classification tasks.
Now we will implement a Naive Bayes Algorithm using Python. So for this, we will use the
"user_data" dataset, which we have used in our other classification model. Therefore we can
easily compare the Naive Bayes model with the other models.
Steps to implement:
o Data Pre-processing step o Fitting Naive Bayes to the Training set o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix) o Visualizing the test set result.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
11. # Splitting the dataset into the Training set and Test set
12. from sklearn.model_selection import train_test_split
13. x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)
14.
15. # Feature Scaling
16. from sklearn.preprocessing import StandardScaler
17. sc = StandardScaler()
18. x_train = sc.fit_transform(x_train)
19. x_test = sc.transform(x_test)
In the above code, we have loaded the dataset into our program using "dataset =
pd.read_csv('user_data.csv'). The loaded dataset is divided into training and test set, and then
we have scaled the feature variable.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2) Fitting Naive Bayes to the Training Set:
After the pre-processing step, now we will fit the Naive Bayes model to the Training set. Below
is the code for it:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
4. classifier.fit(x_train, y_train)
In the above code, we have used the GaussianNB classifier to fit it to the training dataset. We
can also use other classifiers as per our requirement.
Output:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The above output shows the result for prediction vector y_pred and real vector y_test. We can
see that some predications are different from the real values, which are the incorrect
predictions.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
3. cm = confusion_matrix(y_test, y_pred) Output:
As we can see in the above confusion matrix output, there are 7+3= 10 incorrect predictions,
and 65+25=90 correct predictions.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
8. mtp.xlim(X1.min(), X1.max())
9. mtp.ylim(X2.min(), X2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Naive Bayes (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show() Output:
In the above output we can see that the Naïve Bayes classifier has segregated the data points
with the fine boundary. It is Gaussian curve as we have used GaussianNB classifier in our
code.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
4. X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max( ) + 1,
step = 0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(
X1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
8. mtp.xlim(X1.min(), X1.max())
9. mtp.ylim(X2.min(), X2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Naive Bayes (test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show() Output:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The above output is final output for test set data. As we can see the classifier has created a
Gaussian curve to divide the "purchased" and "not purchased" variables. There are some wrong
predictions which we have calculated in Confusion matrix. But still it is pretty good classifier.
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every
year and get sales on that. The below list shows the advertisement made by the company in the
last 5 years and the corresponding sales:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know
the prediction about the sales for this year. So to solve such type of prediction problems in
machine learning, we need regression analysis.
Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints, using
this plot, the machine learning model can make predictions about the data. In simple words,
"Regression shows a line or curve that passes through all the datapoints on target-predictor
graph in such a way that the vertical distance between the datapoints and the regression line
is minimum." The distance between datapoints and line tells whether a model has captured a
strong relationship or not.
Some examples of regression can be as: o Prediction of rain using temperature and other factors
o Determining Market trends o Prediction of road accidents due to rash driving.
o Dependent Variable: The main factor in Regression analysis which we want to predict or
understand is called the dependent variable. It is also called target variable.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Independent Variable: The factors which affect the dependent variables or which are used to
predict the values of the dependent variables are called independent variable, also called as a
predictor.
o Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each other than other
variables, then such condition is called Multicollinearity. It should not be present in the dataset,
because it creates problem while ranking the most affecting variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset but not
well with test dataset, then such problem is called Overfitting. And if our algorithm does not
perform well even with training dataset, then such problem is called underfitting.
o Regression estimates the relationship between the target and the independent variable. o It is
used to find the trends in data. o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important factor, the
least important factor, and how each factor is affecting the other factors.
Types of Regression
There are various types of regressions which are used in data science and machine learning.
Each type has its own importance on different scenarios, but at the core, all the regression
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
methods analyze the effect of the independent variable on dependent variables. Here we are
discussing some important types of regression which are given below:
Linear Regression: o Linear regression is a statistical regression method which is used for
predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Linear regression shows the linear relationship between the independent variable (X-axis) and
the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is called
multiple linear regression.
The relationship between variables in the linear regression model can be explained using the
below image. Here we are predicting the salary of an employee on the basis of the year of
experience.
1. Y= aX+b
Here, Y = dependent variables (target variables), X= Independent variables (predictor
variables), a and b are the linear coefficients
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Some popular applications of linear regression are: o Analyzing trends and sales estimates o
Salary forecasting o Real estate prediction o Arriving at ETAs in traffic.
Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a binary
or discrete format such as 0 or 1.
Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No,
True or False, Spam or not spam, etc.
o Logistic regression is a type of regression, but it is different from the linear regression algorithm
in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex cost
function. This sigmoid function is used to model the data in logistic regression. The function
can be represented as:
When we provide the input values (data) to the function, it gives the S-curve as follows:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o It uses the concept of threshold levels, values above the threshold level are rounded up to 1,
and values below the threshold level are rounded up to 0.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear dataset using a
linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the value of x
and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-linear
fashion, so for such case, linear regression will not best fit to those datapoints. To cover such
datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial features
of given degree and then modeled using a linear model. Which means the datapoints are
best fitted using a polynomial line.
o The equation for polynomial regression also derived from linear regression equation that means
Linear regression equation Y= b0+ b1x, is transformed into Polynomial regression equation Y=
b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is our
independent/input variable.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o The model is still linear as the coefficients are still linear with quadratic
Note: This is different from Multiple Linear regression in such a way that in Polynomial
regression, a single element has different degrees instead of multiple variables with the
same degree.
Support Vector Regression:
Support Vector Machine is a supervised learning algorithm which can be used for regression
as well as classification problems. So if we use it for regression problems, then it is termed as
Support Vector Regression.
Support Vector Regression is a regression algorithm which works for continuous variables.
Below are some keywords which are used in Support Vector Regression:
o Kernel: It is a function used to map a lower-dimensional data into higher dimensional data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is a
line which helps to predict the continuous variables and cover most of the datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a margin
for datapoints. o Support vectors: Support vectors are the datapoints which are nearest to the
hyperplane and opposite class.
In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum
number of datapoints are covered in that margin. The main goal of SVR is to consider the
maximum datapoints within the boundary lines and the hyperplane (best-fit line) must
contain a maximum number of datapoints. Consider the below image:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.
o Decision Tree is a supervised learning algorithm which can be used for solving both
classification and regression problems.
o It can solve problems for both categorical and numerical data o Decision Tree regression builds
a tree-like structure in which each internal node represents the "test" for an attribute, each
branch represent the result of the test, and each leaf node represents the final decision or result.
o A decision tree is constructed starting from the root node/parent node (dataset), which splits
into left and right child nodes (subsets of dataset). These child nodes are further divided into
their children node, and themselves become the parent node of those nodes. Consider the below
image:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Above image showing the example of Decision Tee regression, here, the model is trying to
predict the choice of a person between Sports cars or Luxury car.
o Random forest is one of the most powerful supervised learning algorithms which is capable of
performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which combines multiple
decision trees and predicts the final output based on the average of each tree output. The
combined decision trees are called as base models, and it can be represented more formally as:
g(x)= f0(x)+ f1(x)+ f2(x)+....
o With the help of Random Forest regression, we can prevent Overfitting in the model by creating
random subsets of the dataset.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Ridge Regression:
o Ridge regression is one of the most robust versions of linear regression in which a small amount
of bias is introduced so that we can get better long term predictions.
o The amount of bias added to the model is known as Ridge Regression penalty. We can
compute this penalty term by multiplying with the lambda to the squared weight of each
individual features.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
o Ridge regression is a regularization technique, which is used to reduce the complexity of the
model. It is also called as L2 regularization. o It helps to solve the problems if we have more
parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model.
o It is similar to the Ridge Regression except that penalty term contains only the absolute weights
instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression
can only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will be:
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression shows
the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Mathematically, we can represent a linear regression as:
ADVERTISEMENT
y= a0+a1x+ ε
Here,
The values for x and y variables are training datasets for Linear Regression model
representation.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
A linear line showing the relationship between the dependent and independent variables is
called a regression line. A regression line can show two types of relationship:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
If the dependent variable decreases on the Y-axis and independent variable increases on the X-
axis, then such a relationship is called a negative linear relationship.
When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will
have the least error.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to
calculate this we use cost function.
Cost functiono The different values for weights or coefficient of lines (a0, a1) gives the different
line of regression, and the cost function is used to estimate the values of the coefficient for the
best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps the
input variable to the output variable. This mapping function is also known as Hypothesis
function.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the
average of squared error occurred between the predicted values and actual values. It can be
written as:
Where,
Residuals: The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost
function will high. If the scatter points are close to the regression line, then the residual will be
small and hence the cost function.
Gradient Descent: o Gradient descent is used to minimize the MSE by calculating the gradient
of the cost function.
o A regression model uses gradient descent to update the coefficients of the line by reducing the
cost function.
It is done by a random selection of values of coefficient and then iteratively update the values
to reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The
process of finding the best model out of various models is called optimization. It can be
achieved by below method:
1. R-squared method:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o R-squared is a statistical method that determines the goodness of fit.
o It measures the strength of the relationship between the dependent and independent variables
on a scale of 0-100%.
o The high value of R-square determines the less difference between the predicted values and
actual values and hence represents a good model.
Below are some important assumptions of Linear Regression. These are some formal checks
while building a Linear Regression model, which ensures to get the best possible result from
the given dataset.
o Homoscedasticity Assumption:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Homoscedasticity is a situation when the error term is the same for all the values of independent
variables. With homoscedasticity, there should be no clear pattern distribution of data in the
scatter plot.
o Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal distribution pattern. If
error terms are not normally distributed, then confidence intervals will become either too wide
or too narrow, which may cause difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any deviation,
which means the error is normally distributed.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be any
correlation in the error term, then it will drastically reduce the accuracy of the model.
Autocorrelation usually occurs if there is a dependency between residual errors.
The key point in Simple Linear Regression is that the dependent variable must be a
continuous/real value. However, the independent variable can be measured on continuous or
categorical values.
o Model the relationship between the two variables. Such as the relationship between Income
and expenditure, experience and Salary, etc.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Simple Linear Regression Model:
The Simple Linear Regression model can be represented using the below equation:
y= a0+a1x+ ε Where,
a0= It is the intercept of the Regression line (can be obtained putting x=0) a1= It is the
slope of the regression line, which tells whether the line is increasing or decreasing. ε =
The error term. (For a good model it will be negligible) Implementation of Simple Linear
Regression Algorithm using Python
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Here we are taking a dataset that has two variables: salary (dependent variable) and experience
(Independent variable). The goals of this problem is:
o We want to find out if there is any correlation between these two variables o We will
find the best fit line for the dataset. o How the dependent variable is changing by
changing the independent variable.
In this section, we will create a Simple Linear Regression model to find out the best fitting line
for representing the relationship between these two variables.
To implement the Simple Linear regression model in machine learning using Python, we need
to follow the below steps:
The first step for creating the Simple Linear Regression model is data pre-processing. We have
already done it earlier in this tutorial. But there will be some changes, which are given in the
below steps:
o First, we will import the three important libraries, which will help us for loading the dataset,
plotting the graphs, and creating the Simple Linear Regression model.
1. import numpy as nm
2. import matplotlib.pyplot as mtp 3. import pandas as pd o Next, we will load the dataset
into our code:
1. data_set= pd.read_csv('Salary_Data.csv')
By executing the above line of code (ctrl+ENTER), we can read the dataset on our Spyder IDE
screen by clicking on the variable explorer option.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The above output shows the dataset, which has two variables: Salary and Experience.
Note: In Spyder IDE, the folder containing the code file must be saved as a working
directory, and the dataset or csv file should be in the same folder.
o After that, we need to extract the dependent and independent variables from the given dataset.
The independent variable is years of experience, and the dependent variable is salary. Below
is code for it:
1. x= data_set.iloc[:, :-1].values
2. y= data_set.iloc[:, 1].values
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
In the above lines of code, for x variable, we have taken -1 value since we want to remove the
last column from the dataset. For y variable, we have taken 1 value as a parameter, since we
want to extract the second column and indexing starts from the zero.
By executing the above line of code, we will get the output for X and Y variable as:
In the above output image, we can see the X (independent) variable and Y (dependent) variable
has been extracted from the given dataset.
o Next, we will split both variables into the test set and training set. We have 30 observations, so
we will take 20 observations for the training set and 10 observations for the test set. We are
splitting our dataset so that we can train our model using a training dataset and then test the
model using a test dataset. The code for this is given below:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 1/3, random_state=0)
By executing the above code, we will get x-test, x-train and y-test, y-train dataset. Consider the
below images:
Test-dataset:
Training Dataset:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o For simple linear Regression, we will not use Feature Scaling. Because Python libraries take
care of it for some cases, so we don't need to perform it here. Now, our dataset is well prepared
to work on it and we are going to start building a Simple Linear Regression model for the given
problem.
Now the second step is to fit our model to the training dataset. To do so, we will import the
LinearRegression class of the linear_model library from the scikit learn. After importing the
class, we are going to create an object of the class named as a regressor. The code for this is
given below:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
In the above code, we have used a fit() method to fit our Simple Linear Regression object to
the training set. In the fit() function, we have passed the x_train and y_train, which is our
training dataset for the dependent and an independent variable. We have fitted our regressor
object to the training set so that the model can easily learn the correlations between the predictor
and target variables. After executing the above lines of code, we will get the below output.
Output:
dependent (salary) and an independent variable (Experience). So, now, our model is ready to
predict the output for the new observations. In this step, we will provide the test dataset (new
observations) to the model to check whether it can predict the correct output or not.
We will create a prediction vector y_pred, and x_pred, which will contain predictions of test
dataset, and prediction of training set respectively.
You can check the variable by clicking on the variable explorer option in the IDE, and also
compare the result by comparing values from y_pred and y_test. By comparing these values,
we can check how good our model is performing.
Now in this step, we will visualize the training set result. To do so, we will use the scatter()
function of the pyplot library, which we have already imported in the pre-processing step. The
scatter () function will create a scatter plot of observations.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary of
employees. In the function, we will pass the real values of training set, which means a year of
experience x_train, training set of Salaries y_train, and color of the observations. Here we are
taking a green color for the observation, but it can be any color as per the choice.
Now, we need to plot the regression line, so for this, we will use the plot() function of the
pyplot library. In this function, we will pass the years of experience for training set, predicted
salary for training set x_pred, and color of the line.
Next, we will give the title for the plot. So here, we will use the title() function of the pyplot
library and pass the name ("Salary vs Experience (Training Dataset)".
After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel() function.
Finally, we will represent all above things in a graph using show(). The code is given below:
By executing the above lines of code, we will get the below graph plot as an output.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
In the above plot, we can see the real values observations in green dots and predicted values
are covered by the red regression line. The regression line shows a correlation between the
dependent and independent variable.
The good fit of the line can be observed by calculating the difference between actual values
and predicted values. But as we can see in the above plot, most of the observations are close
to the regression line, hence our model is good for the training set.
In the previous step, we have visualized the performance of our model on the training set. Now,
we will do the same for the Test set. The complete code will remain the same as the above code,
except in this, we will use x_test, and y_test instead of x_train and y_train.
Here we are also changing the color of observations and regression line to differentiate between
the two plots, but it is optional.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. mtp.scatter(x_test, y_test, color="blue")
3. mtp.plot(x_train, x_pred, color="red")
4. mtp.title("Salary vs Experience (Test Dataset)")
5. mtp.xlabel("Years of Experience")
6. mtp.ylabel("Salary(In Rupees)")
7. mtp.show() Output:
By executing the above line of code, we will get the output as:
In the above plot, there are observations given by the blue color, and prediction is given by the
red regression line. As we can see, most of the observations are close to the regression line,
hence we can say our Simple Linear Regression is a good model and able to make good
predictions.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Multiple Linear Regression
In the previous topic, we have learned about Simple Linear Regression, where a single
Independent/Predictor(X) variable is used to model the response variable (Y). But there may
be various cases in which the response variable is affected by more than one predictor variable;
for such cases, the Multiple Linear Regression algorithm is used.
Multiple Linear Regression is one of the important regression algorithms which models the
linear relationship between a single dependent continuous variable and more than one
independent variable.
Example:
Prediction of CO2 emission based on engine size and number of cylinders in a car.
o For MLR, the dependent or target variable(Y) must be the continuous/real, but the predictor or
independent variable may be of continuous or categorical form.
o Each feature variable must model the linear relationship with the dependent variable.
o MLR tries to fit a regression line through a multidimensional space of data-points.
MLR equation:
In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple
predictor variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple Linear Regression, so
the same is applied for the multiple linear regression equation, the equation becomes:
1. Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</sub>x<sub>2</sub>
+ b<sub>3</sub>x<sub>3</sub>+...... bnxn ............... (a)
Where,
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
x1, x2, x3, x4,...= Various Independent/feature variable
o A linear relationship should exist between the Target and predictor variables. o The regression
residuals must be normally distributed.
o MLR assumes little or no multicollinearity (correlation between the independent variable) in
data.
Problem Description:
We have a dataset of 50 start-up companies. This dataset contains five main information:
R&D Spend, Administration Spend, Marketing Spend, State, and Profit for a financial
year. Our goal is to create a model that can easily determine which company has a maximum
profit, and which is the most affecting factor for the profit of a company.
Since we need to find the Profit, so it is the dependent variable, and the other four variables are
independent variables. Below are the main steps of deploying the MLR model:
The very first step is data pre-processing, which we have already discussed in this tutorial. This
process contains the below steps:
o Importing libraries: Firstly we will import the library which will help in building the model.
Below is the code for it:
1. # importing libraries
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. import numpy as nm
3. import matplotlib.pyplot as mtp 4. import pandas as pd o Importing dataset: Now we will
import the dataset(50_CompList), which contains all the variables. Below is the code for it:
1. #importing datasets
2. data_set= pd.read_csv('50_CompList.csv') Output: We will get the dataset as:
In above output, we can clearly see that there are five variables, in which four variables are
continuous and one is categorical variable. o Extracting dependent and independent
Variables:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. x= data_set.iloc[:, :-1].values
3. y= data_set.iloc[:, 4].values Output:
Out[5]:
array([[165349.2, 136897.8, 471784.1, 'New York'],
[162597.7, 151377.59, 443898.53, 'California'],
[153441.51, 101145.55, 407934.54, 'Florida'],
[144372.41, 118671.85, 383199.62, 'New York'],
[142107.34, 91391.77, 366168.42, 'Florida'],
[131876.9, 99814.71, 362861.36, 'New York'],
[134615.46, 147198.87, 127716.82, 'California'],
[130298.13, 145530.06, 323876.68, 'Florida'],
[120542.52, 148718.95, 311613.29, 'New York'],
[123334.88, 108679.17, 304981.62, 'California'],
[101913.08, 110594.11, 229160.95, 'Florida'],
[100671.96, 91790.61, 249744.55, 'California'],
[93863.75, 127320.38, 249839.44, 'Florida'],
[91992.39, 135495.07, 252664.93, 'California'],
[119943.24, 156547.42, 256512.92, 'Florida'],
[114523.61, 122616.84, 261776.23, 'New York'],
[78013.11, 121597.55, 264346.06, 'California'],
[94657.16, 145077.58, 282574.31, 'New York'],
[91749.16, 114175.79, 294919.57, 'Florida'],
[86419.7, 153514.11, 0.0, 'New York'],
[76253.86, 113867.3, 298664.47, 'California'],
[78389.47, 153773.43, 299737.29, 'New York'],
[73994.56, 122782.75, 303319.26, 'Florida'],
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
[67532.53, 105751.03, 304768.73, 'Florida'],
[77044.01, 99281.34, 140574.81, 'New York'],
[64664.71, 139553.16, 137962.62, 'California'],
[75328.87, 144135.98, 134050.07, 'Florida'],
[72107.6, 127864.55, 353183.81, 'New York'],
[66051.52, 182645.56, 118148.2, 'Florida'],
[65605.48, 153032.06, 107138.38, 'New York'],
[61994.48, 115641.28, 91131.24, 'Florida'],
[61136.38, 152701.92, 88218.23, 'New York'],
[63408.86, 129219.61, 46085.25, 'California'],
[55493.95, 103057.49, 214634.81, 'Florida'],
[46426.07, 157693.92, 210797.67, 'California'],
[46014.02, 85047.44, 205517.64, 'New York'],
[28663.76, 127056.21, 201126.82, 'Florida'],
[44069.95, 51283.14, 197029.42, 'California'],
[20229.59, 65947.93, 185265.1, 'New York'],
[38558.51, 82982.09, 174999.3, 'California'],
[28754.33, 118546.05, 172795.67, 'California'],
[27892.92, 84710.77, 164470.71, 'Florida'],
[23640.93, 96189.63, 148001.11, 'California'],
[15505.73, 127382.3, 35534.17, 'New York'],
[22177.74, 154806.14, 28334.72, 'California'],
[1000.23, 124153.04, 1903.93, 'New York'],
[1315.46, 115816.21, 297114.46, 'Florida'],
[0.0, 135426.92, 0.0, 'California'],
[542.05, 51743.15, 0.0, 'New York'],
[0.0, 116983.8, 45173.06, 'California']], dtype=object)
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
As we can see in the above output, the last column contains categorical variables which are not
suitable to apply directly for fitting the model. So we need to encode this variable.
As we have one categorical variable (State), which cannot be directly applied to the model, so
we will encode it. To encode the categorical variable into numbers, we will use the
LabelEncoder class. But it is not sufficient because it still has some relational order, which
may create a wrong model. So in order to remove this problem, we will use OneHotEncoder,
which will create the dummy variables. Below is code for it:
1. #Catgorical data
2. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
3. labelencoder_x= LabelEncoder()
4. x[:, 3]= labelencoder_x.fit_transform(x[:,3])
5. onehotencoder= OneHotEncoder(categorical_features= [3])
6. x= onehotencoder.fit_transform(x).toarray()
Here we are only encoding one independent variable, which is state as other variables are
continuous.
Output:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
As we can see in the above output, the state column has been converted into dummy variables
(0 and 1). Here each dummy variable column is corresponding to the one State. We can
check by comparing it with the original dataset. The first column corresponds to the California
State, the second column corresponds to the Florida State, and the third column corresponds
to the New York State.
Note: We should not use all the dummy variables at the same time, so it must be 1 less than
the total number of dummy variables, else it will create a dummy variable trap.
o Now, we are writing a single line of code just to avoid the dummy variable trap:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
If we do not remove the first dummy variable, then it may introduce multicollinearity in the
model.
As we can see in the above output image, the first column has been removed.
o Now we will split the dataset into training and test set. The code for this is given below:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
The above code will split our dataset into a training set and test set.
Output: The above code will split the dataset into training set and test set. You can check the
output by clicking on the variable explorer option given in Spyder IDE. The test set and training
set will look like the below image:
Test set:
Training set:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Note: In MLR, we will not do feature scaling as it is taken care by the library, so we don't
need to do it manually.
Step: 2- Fitting our MLR model to the Training set:
Now, we have well prepared our dataset in order to provide training, which means we will fit
our regression model to the training set. It will be similar to as we did in Simple Linear
Regression model. The code for this will be:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The last step for our model is checking the performance of the model. We will do it by
predicting the test set result. For prediction, we will create a y_pred vector. Below is the code
for it:
Output:
In the above output, we have predicted result set and test set. We can check model performance
by comparing these two value index by index. For example, the first index has a predicted value
of 103015$ profit and test/real value of 103282$ profit. The difference is only of 267$, which
is a good prediction, so, finally, our model is completed here.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o We can also check the score for training dataset and test dataset. Below is the code for it:
It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the presence and absence of those similar patterns.
After applying this clustering technique, each cluster or group is provided with a cluster-ID.
ML system can use this id to simplify the processing of large and complex datasets.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
the type of dataset that we are using. In classification, we work with the labeled data set,
whereas in clustering, we work with the unlabelled dataset.
Example: Let's understand the clustering technique with the real-world example of Mall:
When we visit any shopping mall, we can observe that the things with similar usage are grouped
together. Such as the t-shirts are grouped in one section, and trousers are at other sections,
similarly, at vegetable sections, apples, bananas, Mangoes, etc., are grouped in separate
sections, so that we can easily find out the things. The clustering technique also works in the
same way. Other examples of clustering are grouping documents according to the topic.
The clustering technique can be widely used in various tasks. Some most common uses of this
technique are:
Apart from these general usages, it is used by the Amazon in its recommendation system to
provide the recommendations as per the past search of products. Netflix also uses this technique
to recommend the movies and web-series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the different
fruits are divided into several groups with similar properties.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Types of Clustering Methods
The clustering methods are broadly divided into Hard clustering (datapoint belongs to only
one group) and Soft Clustering (data points can belong to another group also). But there are
also other various approaches of Clustering exist. Below are the main clustering methods used
in Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is the K-
Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number
of pre-defined groups. The cluster center is created in such a way that the distance between the
data points of one cluster is minimum as compared to another cluster centroid.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected. This
algorithm does it by identifying different clusters in the dataset and connects the areas of high
densities into clusters. The dense areas in data space are divided from each other by sparser
areas.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
These algorithms can face difficulty in clustering the data points if the dataset has varying
densities and high dimensions.
The example of this type is the Expectation-Maximization Clustering algorithm that uses
Gaussian Mixture Models (GMM).
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no
requirement of pre-specifying the number of clusters to be created. In this technique, the dataset
is divided into clusters to create a tree-like structure, which is also called a dendrogram. The
observations or any number of clusters can be selected by cutting the tree at the correct level.
The most common example of this method is the Agglomerative Hierarchical algorithm.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one
group or cluster. Each dataset has a set of membership coefficients, which depend on the degree
of membership to be in a cluster. Fuzzy C-means algorithm is the example of this type of
clustering; it is sometimes also known as the Fuzzy k-means algorithm.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained above. There
are different types of clustering algorithms published, but only a few are commonly used. The
clustering algorithm is based on the kind of data that we are using. Such as, some algorithms
need to guess the number of clusters in the given dataset, whereas some are required to find the
minimum distance between the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used in machine
learning:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms.
It classifies the dataset by dividing the samples into different clusters of equal variances. The
number of clusters must be specified in this algorithm. It is fast with fewer computations
required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth density
of data points. It is an example of a centroid-based model, that works on updating the candidates
for centroid to be the center of the points within a given region.
6. Affinity Propagation: It is different from other clustering algorithms as it does not require to
specify the number of clusters. In this, each data point sends a message between the pair of
data points until convergence. It has O(N2T) time complexity, which is the main drawback of
this algorithm.
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o In Identification of Cancer Cells: The clustering algorithms are widely used for the
identification of cancerous cells. It divides the cancerous and non-cancerous data sets into
different groups.
o In Search Engines: Search engines also work on the clustering technique. The search result
appears based on the closest object to the search query. It does it by grouping similar data
objects in one group that is far from the other dissimilar objects. The accurate result of a query
depends on the quality of the clustering algorithm used.
o In Biology: It is used in the biology stream to classify different species of plants and animals
using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands use in
the GIS database. This can be very useful to find that for what purpose the particular land
should be used, that means for which purpose it is more suitable.
Hierarchical Clustering in Machine Learning
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to
group the unlabeled datasets into a cluster and also known as hierarchical cluster analysis or
HCA.
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped
structure is known as the dendrogram.
Sometimes the results of K-means clustering and hierarchical clustering may look similar, but
they both differ depending on how they work. As there is no requirement to predetermine the
number of clusters as we did in the K-Means algorithm.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down
approach.
The working of the AHC algorithm can be explained using the below steps:
o Step-1: Create each data point as a single cluster. Let's say there are N data points, so the
number of clusters will also be N.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there
will now be N-1 clusters.
o Step-3: Again, take the two closest clusters and merge them together to form one cluster. There
will be N-2 clusters.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters.
Consider the below images:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to
divide the clusters as per the problem.
Note: To better understand hierarchical clustering, it is advised to have a look on k-means
clustering
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Measure for the distance between two clusters
As we have seen, the closest distance between the two clusters is crucial for the hierarchical
clustering. There are various ways to calculate the distance between two clusters, and these
ways decide the rule for clustering. These measures are called Linkage methods. Some of the
popular linkage methods are given below:
1. Single Linkage: It is the Shortest Distance between the closest points of the clusters. Consider
the below image:
2. Complete Linkage: It is the farthest distance between the two points of two different clusters.
It is one of the popular linkage methods as it forms tighter clusters than single-linkage.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
3. Average Linkage: It is the linkage method in which the distance between each pair of datasets
is added up and then divided by the total number of datasets to calculate the average distance
between two clusters. It is also one of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the centroid of the
clusters is calculated. Consider the below image:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
From the above-given approaches, we can apply any of them according to the type of problem
or business requirement.
The working of the dendrogram can be explained using the below diagram:
In the above diagram, the left part is showing how clusters are created in agglomerative
clustering, and the right part is showing the corresponding dendrogram.
o As we have discussed above, firstly, the datapoints P2 and P3 combine together and form a
cluster, correspondingly a dendrogram is created, which connects P2 and P3 with a rectangular
shape. The hight is decided according to the Euclidean distance between the data points.
o In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created. It is
higher than of previous, as the Euclidean distance between P5 and P6 is a little bit greater than
the P2 and P3.
o Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram, and
P4, P5, and P6, in another dendrogram. o At last, the final dendrogram is created that combines
all the data points together.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
We can cut the dendrogram tree structure at any level as per our requirement.
Now we will see the practical implementation of the agglomerative hierarchical clustering
algorithm using Python. To implement this, we will use the same dataset problem that we have
used in the previous topic of K-means clustering so that we can compare both concepts easily.
The dataset is containing the information of customers that have visited a mall for shopping.
So, the mall owner wants to find some patterns or some particular behavior of his customers
using the dataset information.
1. Data Pre-processing
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The above lines of code are used to import the libraries to perform specific tasks, such as
numpy for the Mathematical operations, matplotlib for drawing the graphs or scatter plot, and
pandas for importing the dataset. o Importing the dataset
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Here we will extract only the matrix of features as we don't have any further information about
the dependent variable. Code is given below:
The remaining lines of code are to describe the labels for the dendrogram plot.
Output:
By executing the above lines of code, we will get the below output:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Using this Dendrogram, we will now determine the optimal number of clusters for our model.
For this, we will find the maximum vertical distance that does not cut any horizontal bar.
Consider the below diagram:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
In the above diagram, we have shown the vertical distances that are not cutting their horizontal
bars. As we can visualize, the 4th distance is looking the maximum, so according to this, the
number of clusters will be 5(the vertical lines in this range). We can also take the 2nd number
as it approximately equals the 4th distance, but we will consider the 5 clusters because the same
we calculated in the K-means algorithm.
So, the optimal number of clusters will be 5, and we will train the model in the next step,
using the same.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. from sklearn.cluster import AgglomerativeClustering
3. hc= AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')
4. y_pred= hc.fit_predict(x)
In the above code, we have imported the AgglomerativeClustering class of cluster module of
scikit learn library.
Then we have created the object of this class named as hc. The AgglomerativeClustering class
takes the following parameters:
o n_clusters=5: It defines the number of clusters, and we have taken here 5 because it is the
optimal number of clusters.
o linkage='ward': It defines the linkage criteria, here we have used the "ward" linkage. This
method is the popular linkage method that we have already used for creating the Dendrogram.
It reduces the variance in each cluster.
In the last line, we have created the dependent variable y_pred to fit or train the model. It does
train not only the model but also returns the clusters to which each data point belongs.
After executing the above lines of code, if we go through the variable explorer option in our
Sypder IDE, we can check the y_pred variable. We can compare the original dataset with the
y_pred variable. Consider the below image:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
As we can see in the above image, the y_pred shows the clusters value, which means the
customer id 1 belongs to the 5th cluster (as indexing starts from 0, so 4 means 5th cluster), the
customer id 2 belongs to 4th cluster, and so on.
Here we will use the same lines of code as we did in k-means clustering, except one change.
Here we will not plot the centroid that we did in k-means, because here we have used
dendrogram to determine the optimal number of clusters. The code is given below:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
4. mtp.scatter(x[y_pred== 2, 0], x[y_pred == 2, 1], s = 100, c = 'red', label = 'Cluster 3')
5. mtp.scatter(x[y_pred == 3, 0], x[y_pred == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4'
)
6. mtp.scatter(x[y_pred == 4, 0], x[y_pred == 4, 1], s = 100, c = 'magenta', label = 'Clust er 5')
7. mtp.title('Clusters of customers')
8. mtp.xlabel('Annual Income (k$)') 9. mtp.ylabel('Spending Score (1-100)')
10. mtp.legend()
11. mtp.show()
Output: By executing the above lines of code, we will get the below output:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled
dataset into different clusters. Here K defines the number of pre-defined clusters that need to
be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three
clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim
of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters,
and repeats the process until it does not find the best clusters. The value of k should be
predetermined in this algorithm.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the particular
k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
How does the K-Means Algorithm Work?
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid
of each cluster.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is
given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point. So, here we are selecting the below two
points as k points, which are not the part of our dataset. Consider the below image:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance
between two points. So, we will draw a median between both
the centroids. Consider the below image:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color them as
blue and yellow for clear visualization.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these centroids,
and will find new centroids as below:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
From the above image, we can see, one yellow point is on the left side of the line, and two blue
points are right to the line. So, these three points will be assigned to new centroids.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the new centroids
will be as shown in the below image:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o As we got the new centroids so again will draw the median line and reassign the data points.
So, the image will be:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
As our model is ready, so we can now remove the assumed centroids, and the two final clusters
will be as shown in the below image:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
How to choose the value of "K number of clusters" in K-means Clustering?
The performance of the K-means clustering algorithm depends upon highly efficient clusters
that it forms. But choosing the optimal number of clusters is a big task. There are some different
ways to find the optimal number of clusters, but here we are discussing the most appropriate
method to find the number of clusters or value of K. The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters.
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of
Squares, which defines the total variations within a cluster. The formula to calculate the value
of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2 distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
In the above formula of WCSS,
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point
and its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
o It executes the K-means clustering on a given dataset for different K values (ranges from 1-
10).
o For each value of K, calculates the WCSS value. o Plots a curve between calculated WCSS
values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is considered
as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the
elbow method. The graph for the elbow method looks like the below image:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Note: We can choose the number of clusters equal to the given data points. If we choose
the number of clusters equal to the data points, then the value of WCSS becomes zero, and
that will be the endpoint of the plot.
Python Implementation of K-means Clustering Algorithm
In the above section, we have discussed the K-means algorithm, now let's see how it can be
implemented using Python.
Before implementation, let's understand what type of problem we will solve here. So, we have
a dataset of Mall_Customers, which is the data of customers who visit the mall and spend
there.
In the given dataset, we have Customer_Id, Gender, Age, Annual Income ($), and
Spending Score (which is the calculated value of how much a customer has spent in the mall,
the more the value, the more he has spent). From this dataset, we need to calculate some
patterns, as it is an unsupervised method, so we don't know what to calculate exactly.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The steps to be followed for the implementation are given below:
o Data Pre-processing o Finding the optimal number of clusters using the elbow
method o Training the K-means algorithm on the training dataset o Visualizing the
clusters
o Importing Libraries
As we did in previous topics, firstly, we will import the libraries for our model, which is part
of data pre-processing. The code is given below:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
In the above code, the numpy we have imported for the performing mathematics calculation,
matplotlib is for plotting the graph, and pandas are for managing the dataset.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
From the above dataset, we need to find some patterns in it. o Extracting Independent Variables
Here we don't need any dependent variable for data pre-processing step as it is a clustering
problem, and we have no idea about what to determine. So we will just add a line of code for
the matrix of features.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Step-2: Finding the optimal number of clusters using the elbow method
In the second step, we will try to find the optimal number of clusters for our clustering problem.
So, as discussed above, here we are going to use the elbow method for this purpose.
As we know, the elbow method uses the WCSS concept to draw the plot by plotting WCSS
values on the Y-axis and the number of clusters on the X-axis. So we are going to calculate the
value for WCSS for different k values ranging from 1 to 10. Below is the code for it:
Next, we have created the wcss_list variable to initialize an empty list, which is used to contain
the value of wcss computed for different values of k ranging from 1 to 10.
After that, we have initialized the for loop for the iteration on a different value of k ranging
from 1 to 10; since for loop in Python, exclude the outbound limit, so it is taken as 11 to include
10th value.
The rest part of the code is similar as we did in earlier topics, as we have fitted the model on a
matrix of features and then plotted the graph between the number of clusters and WCSS.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Output: After executing the above code, we will get the below output:
From the above plot, we can see the elbow point is at 5. So the number of clusters here will
be 5.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Step- 3: Training the K-means algorithm on the training dataset
As we have got the number of clusters, so we can now train the model on the dataset.
To train the model, we will use the same two lines of code as we have used in the above section,
but here instead of using i, we will use 5, as we know there are 5 clusters that need to be formed.
The code is given below:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
In the second line of code, we have created the dependent variable y_predict to train the model.
By executing the above lines of code, we will get the y_predict variable. We can check it under
the variable explorer option in the Spyder IDE. We can now compare the values of y_predict
with our original dataset. Consider the below image:
From the above image, we can now relate that the CustomerID 1 belongs to a cluster
3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs to cluster 4, and so on.
To visualize the clusters will use scatter plot using mtp.scatter() function of matplotlib.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
3. mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Clu ster 2')
#for second cluster
4. mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluste r 3')
#for third cluster
5. mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Clu ster 4')
#for fourth cluster
6. mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = ' Cluster
5') #for fifth cluster
7. mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = ' yellow',
label = 'Centroid')
8. mtp.title('Clusters of customers') 9. mtp.xlabel('Annual Income (k$)')
10. mtp.ylabel('Spending Score (1-100)')
11. mtp.legend()
12. mtp.show()
In above lines of code, we have written code for each clusters, ranging from 1 to 5. The first
coordinate of the mtp.scatter, i.e., x[y_predict == 0, 0] containing the x value for the showing
the matrix of features values, and the y_predict is ranging from 0 to 1.
Output:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The output image is clearly showing the five different clusters with different colors. The
clusters are formed between two parameters of the dataset; Annual income of customer and
Spending. We can change the colors and labels as per the requirement or choice. We can also
observe some points from the above patterns, which are given below:
o Cluster1 shows the customers with average salary and average spending so we can categorize
these customers as
o Cluster2 shows the customer has a high income but low spending, so we can categorize them
as careful.
o Cluster3 shows the low income and also low spending so they can be categorized as sensible.
o Cluster4 shows the customers with low income with very high spending so they can be
categorized as careless.
o Cluster5 shows the customers with high income and high spending so they can be categorized
as target, and these customers can be the most profitable customers for the mall owner.
This algorithm was given by the R. Agrawal and Srikant in the year 1994. It is mainly used
for market basket analysis and helps to find those products that can be bought together. It can
also be used in the healthcare field to find drug reactions for patients.
Frequent itemsets are those items whose support is greater than the threshold value or user-
specified minimum support. It means if A & B are the frequent itemsets together, then
individually A and B should also be the frequent itemset.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7}, in these two
transactions, 2 and 3 are the frequent itemsets.
Note: To better understand the apriori algorithm, and related term such as support and
confidence, it is recommended to understand the association rule learning.
Steps for Apriori Algorithm
Below are the steps for the apriori algorithm:
Step-1: Determine the support of itemsets in the transactional database, and select the
minimum support and confidence.
Step-2: Take all supports in the transaction with higher support value than the minimum or
selected support value.
Step-3: Find all the rules of these subsets that have higher confidence value than the threshold
or minimum confidence.
Example: Suppose we have the following dataset that has various transactions, and from this
dataset, we need to find the frequent itemsets and generate the association rules using the
Apriori algorithm:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Solution:
o In the first step, we will create a table that contains support count (The frequency of each
itemset individually in the dataset) of each itemset in the given dataset. This table is called the
Candidate set or C1.
o Now, we will take out all the itemsets that have the greater support count that the Minimum
Support (2). It will give us the table for the frequent itemset L1. Since all the itemsets have
greater or equal support count than the minimum support, except the E, so E itemset will be
removed.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Step-2: Candidate Generation C2, and L2: o In this step, we will generate C2 with the help of
L1. In C2, we will create the pair of the itemsets of L1 in the form of subsets.
o After creating the subsets, we will again find the support count from the main transaction table
of datasets, i.e., how many times these pairs have occurred together in the given dataset. So,
we will get the below table for C2:
o Again, we need to compare the C2 Support count with the minimum support count, and after
comparing, the itemset with less support count will be eliminated from the table C2. It will give
us the below table for L2
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o For C3, we will repeat the same two processes, but now we will form the C3 table with subsets
of three itemsets together, and will calculate the support count from the dataset. It will give the
below table:
o Now we will create the L3 table. As we can see from the above C3 table, there is only one
combination of itemset that has support count equal to the minimum support count. So, the L3
will have only one combination, i.e., {A, B, C}.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Consider the below table:
As the given threshold or minimum confidence is 50%, so the first three rules A ^B → C,
B^C → A, and A^C → B can be considered as the strong association rules for the given
problem.
Advantages of Apriori Algorithm o This is easy to understand algorithm o The join and prune
Disadvantages of Apriori Algorithm o The apriori algorithm works slow compared to other
algorithms. o The overall performance can be reduced as it scans the database for multiple
times.
o The time complexity and space complexity of the apriori algorithm is O(2D), which is very high.
Here D represents the horizontal width present in the database.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Python Implementation of Apriori Algorithm
Now we will see the practical implementation of the Apriori Algorithm. To implement this, we
have a problem of a retailer, who wants to find the association between his shop's product, so
that he can provide an offer of "Buy this and Get that" to his customers.
The retailer has a dataset information that contains a list of transactions made by his customer.
In the dataset, each row shows the products purchased by customers or transactions made by
the customer. To solve this problem, we will perform the below steps:
o Data Pre-processing o Training the Apriori model on the dataset o Visualizing the
results
Before importing the libraries, we will use the below line of code to install the apyori package
to use further, as Spyder IDE does not contain it:
1. import numpy as nm
2. import matplotlib.pyplot as mtp 3. import pandas as pd o Importing the dataset:
Now, we will import the dataset for our apriori model. To import the dataset, there will be some
changes here. All the rows of the dataset are showing different transactions made by the
customers. The first row is the transaction done by the first customer, which means there is no
particular name for each column and have their own individual value or product details(See the
dataset given below after the code). So, we need to mention here in our code that there is no
header specified. The code is given below:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. dataset = pd.read_csv('Market_Basket_data1.csv')
3. transactions=[]
4. for i in range(0, 7501):
5. transactions.append([str(dataset.values[i,j]) for j in range(0,20)])
In the above code, the first line is showing importing the dataset into pandas format. The second
line of the code is used because the apriori() that we will use for training our model takes the
dataset in the format of the list of the transactions. So, we have created an empty list of the
transaction. This list will contain all the itemsets from 0 to 7500. Here we have taken 7501
because, in Python, the last index is not considered.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
In the above code, the first line is to import the apriori function. In the second line, the apriori
function returns the output as the rules. It takes the following parameters:
o min_support= To set the minimum support float value. Here we have used 0.003 that is
calculated by taking 3 transactions per customer each week to the total number of transactions.
o min_confidence: To set the minimum confidence value. Here we have taken 0.2. It can be
changed as per the business problem. o min_lift= To set the minimum lift value. o
min_length= It takes the minimum number of products for the association. o max_length =
It takes the maximum number of products for the association.
o Displaying the result of the rules occurred from the apriori function
1. results= list(rules)
2. results
By executing the above lines of code, we will get the 9 rules. Consider the below output:
Output:
[RelationRecord(items=frozenset({'chicken', 'light cream'}),
support=0.004533333333333334,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}),
items_add=frozenset({'chicken'}), confidence=0.2905982905982906,
lift=4.843304843304844)]),
RelationRecord(items=frozenset({'escalope', 'mushroom cream sauce'}),
support=0.005733333333333333,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'mushroom cream sauce'}),
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
items_add=frozenset({'escalope'}), confidence=0.30069930069930073,
lift=3.7903273197390845)]),
RelationRecord(items=frozenset({'escalope', 'pasta'}), support=0.005866666666666667,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta'}),
items_add=frozenset({'escalope'}), confidence=0.37288135593220345,
lift=4.700185158809287)]),
RelationRecord(items=frozenset({'fromage blanc', 'honey'}),
support=0.0033333333333333335,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'fromage blanc'}),
items_add=frozenset({'honey'}), confidence=0.2450980392156863,
lift=5.178127589063795)]),
RelationRecord(items=frozenset({'ground beef', 'herb & pepper'}), support=0.016,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'herb & pepper'}),
items_add=frozenset({'ground beef'}), confidence=0.3234501347708895,
lift=3.2915549671393096)]),
RelationRecord(items=frozenset({'tomato sauce', 'ground beef'}),
support=0.005333333333333333,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'tomato sauce'}),
items_add=frozenset({'ground beef'}), confidence=0.37735849056603776,
lift=3.840147461662528)]),
RelationRecord(items=frozenset({'olive oil', 'light cream'}), support=0.0032,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}),
items_add=frozenset({'olive oil'}), confidence=0.20512820512820515,
lift=3.120611639881417)]),
RelationRecord(items=frozenset({'olive oil', 'whole wheat pasta'}), support=0.008,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'whole wheat pasta'}),
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
items_add=frozenset({'olive oil'}), confidence=0.2714932126696833,
lift=4.130221288078346)]),
RelationRecord(items=frozenset({'pasta', 'shrimp'}), support=0.005066666666666666,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta'}),
items_add=frozenset({'shrimp'}), confidence=0.3220338983050848,
lift=4.514493901473151)])]
As we can see, the above output is in the form that is not easily understandable. So, we will
print all the rules in a suitable format. o Visualizing the rule, support, confidence, lift in
more clear way:
By executing the above lines of code, we will get the below output:
Rule: chicken -> light cream
Support: 0.004533333333333334
Confidence: 0.2905982905982906
Lift: 4.843304843304844
=====================================
Rule: escalope -> mushroom cream sauce
Support: 0.005733333333333333
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Confidence: 0.30069930069930073
Lift: 3.7903273197390845
=====================================
Rule: escalope -> pasta
Support: 0.005866666666666667
Confidence: 0.37288135593220345
Lift: 4.700185158809287
=====================================
Rule: fromage blanc -> honey
Support: 0.0033333333333333335
Confidence: 0.2450980392156863
Lift: 5.178127589063795
=====================================
Rule: ground beef -> herb & pepper
Support: 0.016
Confidence: 0.3234501347708895
Lift: 3.2915549671393096
=====================================
Rule: tomato sauce -> ground beef
Support: 0.005333333333333333
Confidence: 0.37735849056603776
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Lift: 3.840147461662528
=====================================
Rule: olive oil -> light cream
Support: 0.0032
Confidence: 0.20512820512820515
Lift: 3.120611639881417
=====================================
Rule: olive oil -> whole wheat pasta
Support: 0.008
Confidence: 0.2714932126696833
Lift: 4.130221288078346
=====================================
Rule: pasta -> shrimp
Support: 0.005066666666666666
Confidence: 0.3220338983050848
Lift: 4.514493901473151
=====================================
From the above output, we can analyze each rule. The first rules, which is Light cream →
chicken, states that the light cream and chicken are bought frequently by most of the customers.
The support for this rule is 0.0045, and the confidence is 29%. Hence, if a customer buys light
cream, it is 29% chances that he also buys chicken, and it is .0045 times appeared in the
transactions. We can check all these things in other rules also.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands
on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand. o The logic behind the decision tree can be easily understood because it shows a
tree-like structure.
• Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes. o Step-
4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node. Finally,
the decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider the
below diagram:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select the
best attribute for the nodes of the tree. There are two popular techniques for ASM, which are:
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o According to the value of information gain, we split the node and build the decision tree.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:
1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies
randomness in data. Entropy can be calculated as:
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
o Gini index can be calculated using the below formula:
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the learning
tree without reducing accuracy is known as Pruning. There are mainly two types of tree
pruning technology used:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Advantages of the Decision Tree
o It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
o It can be very useful for solving decision-related problems. o It helps to think about all the
possible outcomes for a problem. o There is less requirement of data cleaning compared to
other algorithms.
o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.
Now we will implement the Decision tree using Python. For this, we will use the dataset
"user_data.csv," which we have used in previous classification models. By using the same
dataset, we can compare the Decision tree classifier with other classification models such as
KNN SVM, LogisticRegression, etc.
Steps will also remain the same, which are given below:
1. # importing libraries
2. import numpy as nm
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
3. import matplotlib.pyplot as mtp 4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv') 8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)
In the above code, we have pre-processed the data. Where we have loaded the dataset, which
is given as:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. Fitting a Decision-Tree algorithm to the Training set
Now we will fit the model to the training set. For
this, we will import the DecisionTreeClassifier class from sklearn.tree
library. Below is the code for it:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
In the above code, we have created a classifier object, in which we have passed two main
parameters;
Out[8]:
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=0, splitter='best')
3. Predicting the test result
Now we will predict the test set result. We will create a new prediction vector y_pred. Below
is the code for it:
In the below output image, the predicted output and real test output are given. We can clearly
see that there are some values in the prediction vector, which are different from the real vector
values. These are prediction errors.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
4. Test accuracy of the result (Creation of Confusion matrix)
In the above output, we have seen that there were some incorrect predictions, so if we want to
know the number of correct and incorrect predictions, we need to use the confusion matrix.
Below is the code for it:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
In the above output image, we can see the confusion matrix, which has 6+3= 9 incorrect
predictions and62+29=91 correct predictions. Therefore, we can say that compared to
other classification models, the Decision Tree classifier made a good prediction.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1
.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. fori, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Decision Tree Algorithm (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show() Output:
The above output is completely different from the rest classification models. It has both vertical
and horizontal lines that are splitting the dataset according to the age and estimated salary
variable.
As we can see, the tree is trying to capture each dataset, which is the case of overfitting.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
6. Visualizing the test set result:
Visualization of test set result will be similar to the visualization of the training set except that
the training set will be replaced with the test set.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
As we can see in the above image that there are some green data points within the purple region
and vice versa. So, these are the incorrect predictions which we have discussed in the confusion
matrix.
As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions, and it predicts the
final output.
The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Note: To better understand the Random Forest Algorithm, you should have knowledge of
the Decision Tree Algorithm.
Assumptions for Random Forest
Since the random forest combines multiple trees to predict the class of the dataset, it is possible
that some decision trees may predict the correct output, while others may not. But together, all
the trees predict the correct output. Therefore, below are two assumptions for a better Random
forest classifier:
o There should be some actual values in the feature variable of the dataset so that the classifier
can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Below are some points that explain why we should use the Random Forest algorithm:
<="" li="">
o It takes less training time as compared to other algorithms. o It predicts output with high
accuracy, even for the large dataset it runs efficiently. o It can also maintain accuracy when a
large proportion of data is missing.
Random Forest works in two-phase first is to create the random forest by combining N decision
tree, and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data
points to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is
given to the Random forest classifier. The dataset is divided into subsets and given to each
decision tree. During the training phase, each decision tree produces a prediction result, and
when a new data point occurs, then based on the majority of results, the Random Forest
classifier predicts the final decision. Consider the below image:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Applications of Random Forest
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be
identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Advantages of Random Forest
o Although random forest can be used for both classification and regression tasks, it is not more
suitable for Regression tasks.
Now we will implement the Random Forest Algorithm tree using Python. For this, we will use
the same dataset "user_data.csv", which we have used in previous classification models. By
using the same dataset, we can compare the Random Forest classifier with other classification
models such as Decision tree Classifier, KNN, SVM, Logistic Regression, etc.
o Data Pre-processing step o Fitting the Random forest algorithm to the Training set o
Predicting the test result o Test accuracy of the result (Creation of Confusion matrix) o
Visualizing the test set result.
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp 4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv') 8.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)
In the above code, we have pre-processed the data. Where we have loaded the dataset, which
is given as:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. Fitting the Random Forest algorithm to the training set:
Now we will fit the Random forest algorithm to the training set. To fit it, we will import the
RandomForestClassifier class from the sklearn.ensemble library. The code is given below:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o n_estimators= The required number of trees in the Random Forest. The default value is 10.
We can choose any number but need to take care of the overfitting issue.
o criterion= It is a function to analyze the accuracy of the split. Here we have taken "entropy"
for the information gain.
Output:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
By checking the above prediction vector and test set real vector, we can determine the incorrect
predictions done by the classifier.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
As we can see in the above matrix, there are 4+4= 8 incorrect predictions and 64+28= 92
correct predictions.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
7. mtp.xlim(x1.min(), x1.max())
8. mtp.ylim(x2.min(), x2.max())
9. for i, j in enumerate(nm.unique(y_set)):
10. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1], 11. c = ListedColormap(('purple',
'green'))(i), label = j)
12. mtp.title('Random Forest Algorithm (Training set)')
13. mtp.xlabel('Age')
14. mtp.ylabel('Estimated Salary')
15. mtp.legend()
16. mtp.show()
Output:
The above image is the visualization result for the Random Forest classifier working with the
training set result. It is very much similar to the Decision tree classifier. Each data point
corresponds to each user of the user_data, and the purple and green regions are the prediction
regions. The purple region is classified for the users who did not purchase the SUV car, and the
green region is for the users who purchased the SUV.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
So, in the Random Forest classifier, we have taken 10 trees that have predicted Yes or NO for
the Purchased variable. The classifier took the majority of the predictions and provided the
result.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The above image is the visualization result for the test set. We can check that there is a minimum
number of incorrect predictions (8) without the Overfitting issue. We will get different results
by changing the number of trees in the classifier.
In machine learning, there is always the need to test the stability of the model. It means based
only on the training dataset; we can't fit our model on the training dataset. For this purpose, we
reserve a particular sample of the dataset, which was not part of the training dataset. After that,
we test our model on that sample before deployment, and this complete process comes under
cross-validation. This is something different from the general train-test split.
o Reserve a subset of the dataset as a validation set. o Provide the training to the model using
the training dataset.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Now, evaluate model performance using the validation set. If the model performs well with the
validation set, perform the further step, else check for the issues.
There are some common methods that are used for cross-validation. These methods are given
below:
2. Leave-P-out cross-validation
4. K-fold cross-validation
But it has one of the big disadvantages that we are just using a 50% dataset to train our model,
so the model may miss out to capture important information of the dataset. It also tends to give
the underfitted model.
Leave-P-out cross-validation
In this approach, the p datasets are left out of the training data. It means, if there are total n
datapoints in the original input dataset, then n-p data points will be used as the training dataset
and the p data points as the validation set. This complete process is repeated for all the samples,
and the average error is calculated to know the effectiveness of the model.
There is a disadvantage of this technique; that is, it can be computationally difficult for the
large p.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Leave one out cross-validation
This method is similar to the leave-p-out cross-validation, but instead of p, we need to take 1
dataset out of training. It means, in this approach, for each learning set, only one datapoint is
reserved, and the remaining dataset is used to train the model. This process repeats for each
datapoint. Hence for n samples, we get n different training set and n test set. It has the following
features:
o In this approach, the bias is minimum as all the data points are used. o The process is executed
for n times; hence execution time is high.
o This approach leads to high variation in testing the effectiveness of the model as we iteratively
check against one data point.
K-Fold Cross-Validation
K-fold cross-validation approach divides the input dataset into K groups of samples of equal
sizes. These samples are called folds. For each learning set, the prediction function uses k-1
folds, and the rest of the folds are used for the test set. This approach is a very popular CV
approach because it is easy to understand, and the output is less biased than other methods.
The steps for k-fold cross-validation are: o Split the input dataset into K groups o For each
group:
o Take one group as the reserve or test data set. o Use remaining groups as the training dataset
o Fit the model on the training set and evaluate the performance of the model using the test
set.
Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5 folds. On 1st
iteration, the first fold is reserved for test the model, and rest are used to train the model.
On 2nd iteration, the second fold is used to test the model, and rest are used to train the model.
This process will continue until each fold is not used for the test fold.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Stratified k-fold cross-validation
This technique is similar to k-fold cross-validation with some little changes. This approach
works on stratification concept, it is a process of rearranging the data to ensure that each fold
or group is a good representative of the complete dataset. To deal with the bias and variance, it
is one of the best approaches.
It can be understood with an example of housing prices, such that the price of some houses can
be much high than other houses. To tackle such situations, a stratified k-fold cross-validation
technique is useful.
Holdout Method
This method is the simplest cross-validation technique among all. In this method, we need to
remove a subset of the training data and use it to get prediction results by training it on the rest
part of the dataset.
The error that occurs in this process tells how well our model will perform with the unknown
dataset. Although this approach is simple to perform, it still faces the issue of high variance,
and it also produces misleading results sometimes.
o Train/test split: The input data is divided into two parts, that are training set and test set on a
ratio of 70:30, 80:20, etc. It provides a high variance, which is one of the biggest disadvantages.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Training Data: The training data is used to train the model, and the dependent variable is
known.
o Test Data: The test data is used to make the predictions from the model that is already trained
on the training data. This has the same features as training data but not the part of that.
o Cross-Validation dataset: It is used to overcome the disadvantage of train/test split by splitting
the dataset into groups of train/test splits, and averaging the result. It can be used if we want to
optimize our model that has been trained on the training dataset for the best performance. It is
more efficient as compared to train/test split as every observation is used for the training and
testing both.
Limitations of Cross-Validation
There are some limitations of the cross-validation technique, which are given below:
o For the ideal conditions, it provides the optimum output. But for the inconsistent data, it may
produce a drastic result. So, it is one of the big disadvantages of cross-validation, as there is no
certainty of the type of data in machine learning.
o In predictive modeling, the data evolves over a period, due to which, it may face the differences
between the training set and validation sets. Such as if we create a model for the prediction of
stock market values, and the data is trained on the previous 5 years stock values, but the realistic
future values for the next 5 years may drastically different, so it is difficult to expect the correct
output for such situations.
Applications of Cross-Validation
o This technique can be used to compare the performance of different predictive modeling
methods.
o It can also be used for the meta-analysis, as it is already being used by the data scientists in the
field of medical statistics. Introduction to Dimensionality Reduction Technique
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
What is Dimensionality Reduction?
The number of input features, variables, or columns present in a given dataset is known as
dimensionality, and the process to reduce these features is called dimensionality reduction.
A dataset contains a huge number of input features in various cases, which makes the predictive
modeling task more complicated. Because it is very difficult to visualize or make predictions
for the training dataset with a high number of features, for such cases, dimensionality reduction
techniques are required to use.
Dimensionality reduction technique can be defined as, "It is a way of converting the higher
dimensions dataset into lesser dimensions dataset ensuring that it provides similar
information." These techniques are widely used in machine learning for obtaining a better fit
predictive model while solving the classification and regression problems.
It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.
ADVERTISEMENT
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The Curse of Dimensionality
Handling the high-dimensional data is very difficult in practice, commonly known as the curse
of dimensionality. If the dimensionality of the input dataset increases, any machine learning
algorithm and model becomes more complex. As the number of features increases, the number
of samples also gets increased proportionally, and the chance of overfitting also increases. If
the machine learning model is trained on high-dimensional data, it becomes overfitted and
results in poor performance.
Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.
Some benefits of applying dimensionality reduction technique to the given dataset are given
below:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o By reducing the dimensions of the features, the space required to store the dataset also gets
reduced.
o Less Computation training time is required for reduced dimensions of features. o Reduced
dimensions of features of the dataset help in visualizing the data quickly.
o It removes the redundant features (if present) by taking care of multicollinearity.
There are also some disadvantages of applying the dimensionality reduction, which are given
below:
o In the PCA dimensionality reduction technique, sometimes the principal components required
to consider are unknown.
There are two ways to apply the dimension reduction technique, which are given below:
Feature Selection
Feature selection is the process of selecting the subset of the relevant features and leaving out
the irrelevant features present in a dataset to build a model of high accuracy. In other words, it
is a way of selecting the optimal features from the input dataset.
1. Filters Methods
In this method, the dataset is filtered, and a subset that contains only the relevant features is
taken. Some common techniques of filters method are:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a machine learning
model for its evaluation. In this method, some features are fed to the ML model, and evaluate
the performance. The performance decides whether to add those features or remove to increase
the accuracy of the model. This method is more accurate than the filtering method but complex
to work. Some common techniques of wrapper methods are:
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Forward Selection o Backward Selection o Bi-directional Elimination
3. Embedded Methods: Embedded methods check the different training iterations of the
machine learning model and evaluate the importance of each feature. Some common
techniques of Embedded methods are:
o LASSO
Feature Extraction:
Feature extraction is the process of transforming the space containing many dimensions into
space with fewer dimensions. This approach is useful when we want to keep the whole
information but use fewer resources while processing the information.
a. Kernel PCA
a. Backward Elimination
a. Forward Selection
a. Score comparison
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
a. Low Variance Filter
a. Random Forest
a. Factor Analysis
a. Auto-Encoder
PCA works by considering the variance of each attribute because the high attribute shows the
good split between the classes, and hence it reduces the dimensionality. Some real-world
applications of PCA are image processing, movie recommendation system, optimizing the
power allocation in various communication channels.
o In this technique, firstly, all the n variables of the given dataset are taken to train the model.
o Now we will remove one feature each time and train the model on n-1 features for n times, and
will compute the performance of the model.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o We will check the variable that has made the smallest or no change in the performance of the
model, and then we will drop that variable or features; after that, we will be left with n-1
features. o Repeat the complete process until no feature can be dropped.
In this technique, by selecting the optimum performance of the model and maximum tolerable
error rate, we can define the optimal number of features require for the machine learning
algorithms.
o We start with a single feature only, and progressively we will add each feature at a time.
o Here we will train the model on each feature separately. o The feature with the best
performance is selected.
The process will be repeated until we get a significant increase in the performance of the model.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
High Correlation Filter
High Correlation refers to the case when two variables carry approximately similar
information. Due to this factor, the performance of the model can be degraded. This correlation
between the independent numerical variable gives the calculated value of the correlation
coefficient. If this value is higher than the threshold value, we can remove one of the variables
from the dataset. We can consider those variables or features that show a high correlation with
the target variable.
Random Forest
Random Forest is a popular and very useful feature selection algorithm in machine learning.
This algorithm contains an in-built feature importance package, so we do not need to program
it separately. In this technique, we need to generate a large set of trees against the target
variable, and with the help of usage statistics of each attribute, we need to find the subset of
features.
Random forest algorithm takes only numerical variables, so we need to convert the input data
into numeric data using hot encoding.
Factor Analysis
Factor analysis is a technique in which each variable is kept within a group according to the
correlation with other variables, it means variables within a group can have a high correlation
between themselves, but they have a low correlation with variables of other groups.
We can understand it by an example, such as if we have two variables Income and spend. These
two variables have a high correlation, which means people with high income spends more, and
vice versa. So, such variables are put into a group, and that group is known as the factor. The
number of these factors will be reduced as compared to the original dimension of the dataset.
Auto-encoders
One of the popular methods of dimensionality reduction is auto-encoder, which is a type of
ANN or artificial neural network, and its main aim is to copy the inputs to their outputs. In this,
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
the input is compressed into latent-space representation, and output is occurred using this
representation. It has mainly two parts:
o Encoder: The function of the encoder is to compress the input to form the latent-space
representation.
o Decoder: The function of the decoder is to recreate the output from the latent-space
representation.
PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.
PCA works by considering the variance of each attribute because the high attribute shows the
good split between the classes, and hence it reduces the dimensionality. Some real-world
applications of PCA are image processing, movie recommendation system, optimizing the
power allocation in various communication channels. It is a feature extraction technique, so
it contains the important variables and drops the least important variable.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
o Dimensionality: It is the number of features or variables present in the given dataset. More
easily, it is the number of columns present in the dataset.
o Correlation: It signifies that how strongly two variables are related to each other. Such as if
one changes, the other variable also gets changed. The correlation value ranges from -1 to +1.
Here, -1 occurs if variables are inversely proportional to each other, and +1 indicates that
variables are directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other, and hence the correlation
between the pair of variables is zero.
o Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v will be
eigenvector if Av is the scalar multiple of v.
Covariance Matrix: A matrix containing the covariance between the pair of variables is called
the Covariance Matrix.
o These components are orthogonal, i.e., the correlation between a pair of variables is zero.
o The importance of each component decreases when going to 1 to n, it means the 1 PC has the
most importance, and n PC will have the least importance.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
Now we will represent our dataset into a structure. Such as we will represent the two-
dimensional matrix of independent variable X. Here each row corresponds to the data items,
and the column corresponds to the Features. The number of columns is the dimensions of the
dataset.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778
The new feature set has occurred, so we will decide here what to keep and what to remove. It
means, we will only keep the relevant or important features in the new dataset, and unimportant
features will be removed out.
o PCA is mainly used as the dimensionality reduction technique in various AI applications such
as computer vision, image compression, etc.
o It can also be used for finding hidden patterns if data has high dimensions. Some fields where
PCA is used are Finance, data mining, Psychology, etc.
www.bisttechnologies.com
mail: [email protected]
Contact: 6305624547, 9392162778