0% found this document useful (0 votes)
37 views31 pages

Unit-1 MLA

Uploaded by

yajak70324
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views31 pages

Unit-1 MLA

Uploaded by

yajak70324
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Unit – 1

Machine Learning Algorithms


Unit I Introduction: Basic Definitions, Types of learning- Supervised Learning,
Unsupervised Learning, Reinforcement Learning, Hypothesis Space and Inductive Bias,
Evaluation and Cross-validation

Machine Learning
Machine learning is a growing technology which enables computers to
learn automatically from past data. Machine learning uses various
algorithms for building mathematical models and making
predictions using historical data or information. Currently, it is
being used for various tasks such as image recognition, speech
recognition, email filtering, Facebook auto-
tagging, recommender system, and many more.

This machine learning gives an introduction to machine learning along


with the wide range of machine learning techniques such
as Supervised, Unsupervised, and Reinforcement learning. You
will learn about regression and classification models, clustering
methods, hidden Markov models, and various sequential models.

What is Machine Learning


Machine Learning is said as a subset of artificial intelligence that is
mainly concerned with the development of algorithms which allow a
computer to learn from the data and past experiences on their own.
The term machine learning was first introduced by Arthur
Samuel in 1959. We can define it in a summarized way as:

Machine learning enables a machine to automatically learn from data, improve


performance from experiences, and predict things without being explicitly
programmed.

With the help of sample historical data, which is known as training


data, machine learning algorithms build a mathematical model that
helps in making predictions or decisions without being explicitly
programmed. Machine learning brings computer science and statistics
together for creating predictive models. Machine learning constructs or
uses the algorithms that learn from historical data. The more we will
provide the information, the higher will be the performance.

A machine has the ability to learn if it can improve its


performance by gaining more data.
How does Machine Learning work
A Machine Learning system learns from historical data, builds the
prediction models, and whenever it receives new data,
predicts the output for it. The accuracy of predicted output
depends upon the amount of data, as the huge amount of data helps
to build a better model which predicts the output more accurately.

Features of Machine Learning:


o Machine learning uses data to detect various patterns in a given
dataset.
o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals
with the huge amount of the data.

Need for Machine Learning


The need for machine learning is increasing day by day. The reason
behind the need for machine learning is that it is capable of doing
tasks that are too complex for a person to implement directly. As a
human, we have some limitations as we cannot access the huge
amount of data manually, so for this, we need some computer systems
and here comes the machine learning to make things easy for us.

We can train machine learning algorithms by providing them the huge


amount of data and let them explore the data, construct the models,
and predict the required output automatically. The performance of the
machine learning algorithm depends on the amount of data, and it can
be determined by the cost function. With the help of machine learning,
we can save both time and money.

The importance of machine learning can be easily understood by its


uses cases, Currently, machine learning is used in self-driving
cars, cyber fraud detection, face recognition, and friend
suggestion by Facebook, etc. Various top companies such as Netflix
and Amazon have build machine learning models that are using a vast
amount of data to analyze the user interest and recommend product
accordingly.

Following are some key points which show the importance of


Machine Learning:

o Rapid increment in the production of data


o Solving complex problems, which are difficult for a human
o Decision making in various sector including finance
o Finding hidden patterns and extracting useful information from
data.

Classification of Machine Learning


At a broad level, machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

1) Supervised Learning
Supervised learning is a type of machine learning method in which we
provide sample labeled data to the machine learning system in order
to train it, and on that basis, it predicts the output.

The system creates a model using labeled data to understand the


datasets and learn about each data, once the training and processing
are done then we test the model by providing a sample data to check
whether it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output
data. The supervised learning is based on supervision, and it is the
same as when a student learns things in the supervision of the
teacher. The example of supervised learning is spam filtering.

Supervised learning can be grouped further in two categories of


algorithms:

o Classification
o Regression

2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns
without any supervision.

The training is provided to the machine with the set of data that has
not been labeled, classified, or categorized, and the algorithm needs to
act on that data without any supervision. The goal of unsupervised
learning is to restructure the input data into new features or a group of
objects with similar patterns.

In unsupervised learning, we don't have a predetermined result. The


machine tries to

find useful insights from the huge amount of data. It can be further
classifieds into two categories of algorithms:

o Clustering
o Association

3) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which
a learning agent gets a reward for each right action and gets a penalty
for each wrong action. The agent learns automatically with these
feedbacks and improves its performance. In reinforcement learning,
the agent interacts with the environment and explores it. The goal of
an agent is to get the most reward points, and hence, it improves its
performance.
The robotic dog, which automatically learns the movement of his arms,
is an example of Reinforcement learning.

History of Machine Learning


Before some years (about 40-50 years), machine learning was science
fiction, but today it is the part of our daily life. Machine learning is
making our day to day life easy from self-driving cars to Amazon
virtual assistant "Alexa". However, the idea behind machine
learning is so old and has a long history. Below some milestones are
given which have occurred in the history of machine learning:

The early history of Machine Learning (Pre-1940):

o 1834: In 1834, Charles Babbage, the father of the computer,


conceived a device that could be programmed with punch cards.
However, the machine was never built, but all modern
computers rely on its logical structure.
o 1936: In 1936, Alan Turing gave a theory that how a machine
can determine and execute a set of instructions.

The era of stored program computers:

o 1940: In 1940, the first manually operated computer, "ENIAC"


was invented, which was the first electronic general-purpose
computer. After that stored program computer such as EDSAC in
1949 and EDVAC in 1951 were invented.
o 1943: In 1943, a human neural network was modeled with an
electrical circuit. In 1950, the scientists started applying their
idea to work and analyzed how human neurons might work.
Computer machinery and intelligence:

o 1950: In 1950, Alan Turing published a seminal paper,


"Computer Machinery and Intelligence," on the topic of
artificial intelligence. In his paper, he asked, "Can machines
think?"

Machine intelligence in Games:

o 1952: Arthur Samuel, who was the pioneer of machine learning,


created a program that helped an IBM computer to play a
checkers game. It performed better more it played.
o 1959: In 1959, the term "Machine Learning" was first coined
by Arthur Samuel.

The first "AI" winter:

o The duration of 1974 to 1980 was the tough time for AI and ML
researchers, and this duration was called as AI winter.
o In this duration, failure of machine translation occurred, and
people had reduced their interest from AI, which led to reduced
funding by the government to the researches.

Machine Learning from theory to reality

o 1959: In 1959, the first neural network was applied to a real-


world problem to remove echoes over phone lines using an
adaptive filter.
o 1985: In 1985, Terry Sejnowski and Charles Rosenberg invented
a neural network NETtalk, which was able to teach itself how to
correctly pronounce 20,000 words in one week.
o 1997: The IBM's Deep blue intelligent computer won the chess
game against the chess expert Garry Kasparov, and it became
the first computer which had beaten a human chess expert.

Machine Learning at 21st century


o 2006: In the year 2006, computer scientist Geoffrey Hinton has
given a new name to neural net research as "deep learning,"
and nowadays, it has become one of the most trending
technologies.
o 2012: In 2012, Google created a deep neural network which
learned to recognize the image of humans and cats in YouTube
videos.
o 2014: In 2014, the Chabot "Eugen Goostman" cleared the
Turing Test. It was the first Chabot who convinced the 33% of
human judges that it was not a machine.
o 2014: DeepFace was a deep neural network created by
Facebook, and they claimed that it could recognize a person with
the same precision as a human can do.
o 2016: AlphaGo beat the world's number second player Lee
sedol at Go game. In 2017 it beat the number one player of this
game Ke Jie.
o 2017: In 2017, the Alphabet's Jigsaw team built an intelligent
system that was able to learn the online trolling. It used to
read millions of comments of different websites to learn to stop
online trolling.

Machine Learning at present:


Now machine learning has got a great advancement in its research,
and it is present everywhere around us, such as self-driving
cars, Amazon Alexa, Catboats, recommender system, and many
more. It includes Supervised, unsupervised, and reinforcement
learning with clustering, classification, decision tree, SVM
algorithms, etc.

Modern machine learning models can be used for making various


predictions, including weather prediction, disease
prediction, stock market analysis, etc.

Applications of Machine learning


Machine learning is a buzzword for today's technology, and it is
growing very rapidly day by day. We are using machine learning in our
daily life even without knowing it such as Google Maps, Google
assistant, Alexa, etc. Below are some most trending real-world
applications of Machine Learning:

1. Image Recognition:
Image recognition is one of the most common applications of machine
learning. It is used to identify objects, persons, places, digital images,
etc. The popular use case of image recognition and face detection
is, Automatic friend tagging suggestion:

Facebook provides us a feature of auto friend tagging suggestion.


Whenever we upload a photo with our Facebook friends, then we
automatically get a tagging suggestion with name, and the technology
behind this is machine learning's face detection and recognition
algorithm.

It is based on the Facebook project named "Deep Face," which is


responsible for face recognition and person identification in the
picture.

2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes
under speech recognition, and it's a popular application of machine
learning.

Speech recognition is a process of converting voice instructions into


text, and it is also known as "Speech to text", or "Computer speech
recognition." At present, machine learning algorithms are widely
used by various applications of speech recognition. Google
assistant, Siri, Cortana, and Alexa are using speech recognition
technology to follow the voice instructions.

3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which
shows us the correct path with the shortest route and predicts the
traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-


moving, or heavily congested with the help of two ways:

o Real Time location of the vehicle form Google Map app and
sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it
better. It takes information from the user and sends back to its
database to improve the performance.

4. Product recommendations:
Machine learning is widely used by various e-commerce and
entertainment companies such as Amazon, Netflix, etc., for product
recommendation to the user. Whenever we search for some product
on Amazon, then we started getting an advertisement for the same
product while internet surfing on the same browser and this is because
of machine learning.

Google understands the user interest using various machine learning


algorithms and suggests the product as per customer interest.

As similar, when we use Netflix, we find some recommendations for


entertainment series, movies, etc., and this is also done with the help
of machine learning.

5. Self-driving cars:
One of the most exciting applications of machine learning is self-
driving cars. Machine learning plays a significant role in self-driving
cars. Tesla, the most popular car manufacturing company is working
on self-driving car. It is using unsupervised learning method to train
the car models to detect people and objects while driving.

6. Email Spam and Malware Filtering:


Whenever we receive a new email, it is filtered automatically as
important, normal, and spam. We always receive an important mail in
our inbox with the important symbol and spam emails in our spam box,
and the technology behind this is Machine learning. Below are some
spam filters used by Gmail:

o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters

Some machine learning algorithms such as Multi-Layer


Perceptron, Decision tree, and Naïve Bayes classifier are used for
email spam filtering and malware detection.
7. Virtual Personal Assistant:
We have various virtual personal assistants such as Google
assistant, Alexa, Cortana, Siri. As the name suggests, they help us
in finding the information using our voice instruction. These assistants
can help us in various ways just by our voice instructions such as Play
music, call someone, Open an email, Scheduling an appointment, etc.

These virtual assistants use machine learning algorithms as an


important part.

These assistant record our voice instructions, send it over the server
on a cloud, and decode it using ML algorithms and act accordingly.

8. Online Fraud Detection:


Machine learning is making our online transaction safe and secure by
detecting fraud transaction. Whenever we perform some online
transaction, there may be various ways that a fraudulent transaction
can take place such as fake accounts, fake ids, and steal money in
the middle of a transaction. So to detect this, Feed Forward Neural
network helps us by checking whether it is a genuine transaction or a
fraud transaction.

For each genuine transaction, the output is converted into some hash
values, and these values become the input for the next round. For
each genuine transaction, there is a specific pattern which gets change
for the fraud transaction hence, it detects it and makes our online
transactions more secure.

9. Stock Market trading:


Machine learning is widely used in stock market trading. In the stock
market, there is always a risk of up and downs in shares, so for this
machine learning's long short term memory neural network is
used for the prediction of stock market trends.

10. Medical Diagnosis:


In medical science, machine learning is used for diseases diagnoses.
With this, medical technology is growing very fast and able to build 3D
models that can predict the exact position of lesions in the brain.

It helps in finding brain tumors and other brain-related diseases easily.

11. Automatic Language Translation:


Nowadays, if we visit a new place and we are not aware of the
language then it is not a problem at all, as for this also machine
learning helps us by converting the text into our known languages.
Google's GNMT (Google Neural Machine Translation) provide this
feature, which is a Neural Machine Learning that translates the text
into our familiar language, and it called as automatic translation.

The technology behind the automatic translation is a sequence to


sequence learning algorithm, which is used with image recognition and
translates the text from one language to another language.

Machine learning Life cycle


Machine learning life cycle involves seven major steps, which are given
below:

o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment

The most important thing in the complete process is to understand


the problem and to know the purpose of the problem. Therefore,
before starting the life cycle, we need to understand the problem
because the good result depends on the better understanding of the
problem.

In the complete life cycle process, to solve a problem, we create a


machine learning system called "model", and this model is created by
providing "training". But to train a model, we need data, hence, life
cycle starts by collecting data.

1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The
goal of this step is to identify and obtain all data-related problems.

In this step, we need to identify the different data sources, as data can
be collected from various sources such as files, database, internet,
or mobile devices. It is one of the most important steps of the life
cycle. The quantity and quality of the collected data will determine the
efficiency of the output. The more will be the data, the more accurate
will be the prediction.

This step includes the below tasks:

o Identify various data sources


o Collect data
o Integrate the data obtained from different sources

By performing the above task, we get a coherent set of data, also


called as a dataset. It will be used in further steps.

2. Data preparation
After collecting the data, we need to prepare it for further steps. Data
preparation is a step where we put our data into a suitable place and
prepare it to use in our machine learning training.

In this step, first, we put all data together, and then randomize the
ordering of data.

This step can be further divided into two processes:

o Data exploration:
It is used to understand the nature of data that we have to work
with. We need to understand the characteristics, format, and
quality of data.
A better understanding of data leads to an effective outcome. In
this, we find Correlations, general trends, and outliers.
o Data pre-processing:
Now the next step is preprocessing of data for its analysis.

3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into
a useable format. It is the process of cleaning the data, selecting the
variable to use, and transforming the data in a proper format to make
it more suitable for analysis in the next step. It is one of the most
important steps of the complete process. Cleaning of data is required
to address the quality issues.

It is not necessary that data we have collected is always of our use as


some of the data may not be useful. In real-world applications,
collected data may have various issues, including:

o Missing Values
o Duplicate data
o Invalid data
o Noise

So, we use various filtering techniques to clean the data.

It is mandatory to detect and remove the above issues because it can


negatively affect the quality of the outcome.

4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step.
This step involves:

o Selection of analytical techniques


o Building models
o Review the result

The aim of this step is to build a machine learning model to analyze


the data using various analytical techniques and review the outcome.
It starts with the determination of the type of the problems, where we
select the machine learning techniques such
as Classification, Regression, Cluster analysis, Association, etc.
then build the model using prepared data, and evaluate the model.

Hence, in this step, we take the data and use machine learning
algorithms to build the model.

5. Train Model
Now the next step is to train the model, in this step we train our model
to improve its performance for better outcome of the problem.
We use datasets to train the model using various machine learning
algorithms. Training a model is required so that it can understand the
various patterns, rules, and, features.

6. Test Model
Once our machine learning model has been trained on a given dataset,
then we test the model. In this step, we check for the accuracy of our
model by providing a test dataset to it.

Testing the model determines the percentage accuracy of the model


as per the requirement of project or problem.

7. Deployment
The last step of machine learning life cycle is deployment, where we
deploy the model in the real-world system.

If the above-prepared model is producing an accurate result as per our


requirement with acceptable speed, then we deploy the model in the
real system. But before deploying the project, we will check whether it
is improving its performance using available data or not. The
deployment phase is similar to making the final report for a project.

Difference between Artificial intelligence and


Machine learning
AI is a bigger concept to create intelligent machines that can simulate human thinking
capability and behavior, whereas, machine learning is an application or subset of AI
that allows machines to learn from data without being programmed explicitly.

Artificial Intelligence
Artificial intelligence is a field of computer science which makes a
computer system that can mimic human intelligence. It is comprised of
two words "Artificial" and "intelligence", which means "a human-
made thinking power." Hence we can define it as,

Artificial intelligence is a technology using which we can create intelligent systems


that can simulate human intelligence.

o Weak AI
o General AI
o Strong AI

Currently, we are working with weak AI and general AI. The future of AI
is Strong AI for which it is said that it will be intelligent than humans.

Machine learning
Machine learning is about extracting knowledge from the data. It can
be defined as,

Machine learning is a subfield of artificial intelligence, which enables machines to


learn from past data or experiences without being explicitly programmed.

Machine learning enables a computer system to make predictions or


take some decisions using historical data without being explicitly
programmed. Machine learning uses a massive amount of structured
and semi-structured data so that a machine learning model can
generate accurate result or give predictions based on that data.

It can be divided into three types:

o Supervised learning
o Reinforcement learning
o Unsupervised learning

Key differences between Artificial Intelligence


(AI) and Machine learning (ML):

Artificial Intelligence Machine learning


Artificial intelligence is a Machine learning is a subset of AI
technology which enables a which allows a machine to
machine to simulate human automatically learn from past data
behavior. without programming explicitly.

The goal of AI is to make a smart The goal of ML is to allow machines


computer system like humans to to learn from data so that they can
solve complex problems. give accurate output.

In AI, we make intelligent systems In ML, we teach machines with data


to perform any task like a human. to perform a particular task and give
an accurate result.

Machine learning and deep Deep learning is a main subset of


learning are the two main subsets machine learning.
of AI.

AI has a very wide range of scope. Machine learning has a limited


scope.

AI is working to create an Machine learning is working to


intelligent system which can create machines that can perform
perform various complex tasks. only those specific tasks for which
they are trained.

AI system is concerned about Machine learning is mainly


maximizing the chances of concerned about accuracy and
success. patterns.

The main applications of AI The main applications of machine


are Siri, customer support learning are Online recommender
using catboats, Expert System, system, Google search
Online game playing, intelligent algorithms, Facebook auto
humanoid robot, etc. friend tagging suggestions, etc.

On the basis of capabilities, AI can Machine learning can also be


be divided into three types, which divided into mainly three types that
are, Weak AI, General AI, are Supervised
and Strong AI. learning, Unsupervised learning,
and Reinforcement learning.

It includes learning, reasoning, It includes learning and self-


and self-correction. correction when introduced with
new data.

AI completely deals with Machine learning deals with


Structured, semi-structured, and Structured and semi-structured
unstructured data. data.
Hypothesis in Machine Learning
The hypothesis is a common term in Machine Learning and data
science projects. As we know, machine learning is one of the most
powerful technologies across the world, which helps us to predict
results based on past experiences. Moreover, data scientists and ML
professionals conduct experiments that aim to solve a problem. These
ML professionals and data scientists make an initial assumption for the
solution of the problem.

This assumption in Machine learning is known as Hypothesis. In


Machine Learning, at various times, Hypothesis and Model are used
interchangeably. However, a Hypothesis is an assumption made by
scientists, whereas a model is a mathematical representation that is
used to test the hypothesis. In this topic, "Hypothesis in Machine
Learning," we will discuss a few important concepts related to a
hypothesis in machine learning and their importance. So, let's start
with a quick introduction to Hypothesis.

What is Hypothesis?
The hypothesis is defined as the supposition or proposed
explanation based on insufficient evidence or assumptions. It is
just a guess based on some known facts but has not yet been proven.
A good hypothesis is testable, which results in either true or false.

Example: Let's understand the hypothesis with a common example.


Some scientist claims that ultraviolet (UV) light can damage the eyes
then it may also cause blindness.

In this example, a scientist just claims that UV rays are harmful to the
eyes, but we assume they may cause blindness. However, it may or
may not be possible. Hence, these types of assumptions are called a
hypothesis.

Hypothesis in Machine Learning (ML)


The hypothesis is one of the commonly used concepts of statistics in
Machine Learning. It is specifically used in Supervised Machine
learning, where an ML model learns a function that best maps the
input to corresponding outputs with the help of an available dataset.
In supervised learning techniques, the main aim is to determine the
possible hypothesis out of hypothesis space that best maps input to
the corresponding or correct outputs.

There are some common methods given to find out the possible
hypothesis from the Hypothesis space, where hypothesis space is
represented by uppercase-h (H) and hypothesis by lowercase-h
(h). These are defined as follows:

Hypothesis space (H):


Hypothesis space is defined as a set of all possible legal
hypotheses; hence it is also known as a hypothesis set. It is
used by supervised machine learning algorithms to determine the best
possible hypothesis to describe the target function or best maps input
to output.

It is often constrained by choice of the framing of the problem, the


choice of model, and the choice of model configuration.

Hypothesis (h):
It is defined as the approximate function that best describes the target
in supervised machine learning algorithms. It is primarily based on
data as well as bias and restrictions applied to data.

Hence hypothesis (h) can be concluded as a single hypothesis that


maps input to proper output and can be evaluated as well as used to
make predictions.

The hypothesis (h) can be formulated in machine learning as follows:

y= mx + b

Where,
Y: Range

m: Slope of the line which divided test data or changes in y divided by


change in x.

x: domain

c: intercept (constant)

Example: Let's understand the hypothesis (h) and hypothesis space


(H) with a two-dimensional coordinate plane showing the distribution
of data as follows:

Now, assume we have some test data by which ML algorithms predict


the outputs for input as follows:

If we divide this coordinate plane in such as way that it can help you to
predict output or result as follows:

Based on the given test data, the output result will be as follows:
However, based on data, algorithm, and constraints, this coordinate
plane can also be divided in the following ways as follows:

With the above example, we can conclude that;

Hypothesis space (H) is the composition of all legal best possible ways
to divide the coordinate plane so that it best maps input to proper
output.

Further, each individual best possible way is called a hypothesis (h).


Hence, the hypothesis and hypothesis space would be like this:

Hypothesis in Statistics
Similar to the hypothesis in machine learning, it is also considered an
assumption of the output. However, it is falsifiable, which means it can
be failed in the presence of sufficient evidence.

Unlike machine learning, we cannot accept any hypothesis in statistics


because it is just an imaginary result and based on probability. Before
start working on an experiment, we must be aware of two important
types of hypotheses as follows:

o Null Hypothesis: A null hypothesis is a type of statistical


hypothesis which tells that there is no statistically significant
effect exists in the given set of observations. It is also known as
conjecture and is used in quantitative analysis to test theories
about markets, investment, and finance to decide whether an
idea is true or false.
o Alternative Hypothesis: An alternative hypothesis is a direct
contradiction of the null hypothesis, which means if one of the
two hypotheses is true, then the other must be false. In other
words, an alternative hypothesis is a type of statistical
hypothesis which tells that there is some significant effect that
exists in the given set of observations.

Inductive Bias & Variance in Machine


Learning
Machine learning is a branch of Artificial Intelligence, which allows
machines to perform data analysis and make predictions. However, if
the machine learning model is not accurate, it can make predictions
errors, and these prediction errors are usually known as Bias and
Variance. In machine learning, these errors will always be present as
there is always a slight difference between the model predictions and
actual predictions. The main aim of ML/data science analysts is to
reduce these errors in order to get more accurate results. We are
going to discuss bias and variance, Bias-variance trade-off,
Underfitting and Overfitting.

Errors in Machine Learning?


In machine learning, an error is a measure of how accurately an
algorithm can make predictions for the previously unknown dataset.
On the basis of these errors, the machine learning model is selected
that can perform best on the particular dataset. There are mainly two
types of errors in machine learning, which are:

o Reducible errors: These errors can be reduced to improve the


model accuracy. Such errors can further be classified into bias
and Variance.

o Irreducible errors: These errors will always be present in the


model

regardless of which algorithm has been used. The cause of these


errors is unknown variables whose value can't be reduced.

What is Bias?
In general, a machine learning model analyses the data, find patterns
in it and make predictions. While training, the model learns these
patterns in the dataset and applies them to test data for
prediction. While making predictions, a difference occurs
between prediction values made by the model and actual
values/expected values, and this difference is known as bias
errors or Errors due to bias. It can be defined as an inability of
machine learning algorithms such as Linear Regression to capture the
true relationship between the data points. Each algorithm begins with
some amount of bias because bias occurs from assumptions in the
model, which makes the target function simple to learn. A model has
either:

o Low Bias: A low bias model will make fewer assumptions about
the form of the target function.
o High Bias: A model with a high bias makes more assumptions,
and the model becomes unable to capture the important
features of our dataset. A high bias model also cannot
perform well on new data.
Generally, a linear algorithm has a high bias, as it makes them learn
fast. The simpler the algorithm, the higher the bias it has likely to be
introduced. Whereas a nonlinear algorithm often has low bias.

Some examples of machine learning algorithms with low bias are


Decision Trees, k-Nearest Neighbours and Support Vector
Machines. At the same time, an algorithm with high bias is Linear
Regression, Linear Discriminant Analysis and Logistic
Regression.

Ways to reduce High Bias:


High bias mainly occurs due to a much simple model. Below are some
ways to reduce the high bias:

o Increase the input features as the model is underfitted.


o Decrease the regularization term.
o Use more complex models, such as including some polynomial
features.

What is a Variance Error?


The variance would specify the amount of variation in the prediction if
the different training data was used. In simple words, variance tells
that how much a random variable is different from its expected
value. Ideally, a model should not vary too much from one training
dataset to another, which means the algorithm should be good in
understanding the hidden mapping between inputs and output
variables. Variance errors are either of low variance or high
variance.

Low variance means there is a small variation in the prediction of the


target function with changes in the training data set. At the same
time, High variance shows a large variation in the prediction of the
target function with changes in the training dataset.

A model that shows high variance learns a lot and perform well with
the training dataset, and does not generalize well with the unseen
dataset. As a result, such a model gives good results with the training
dataset but shows high error rates on the test dataset.

Since, with high variance, the model learns too much from the dataset,
it leads to overfitting of the model. A model with high variance has the
below problems:

o A high variance model leads to overfitting.


o Increase model complexities.

Usually, nonlinear algorithms have a lot of flexibility to fit the model,


have high variance.

Some examples of machine learning algorithms with low variance


are, Linear Regression, Logistic Regression, and Linear
discriminant analysis. At the same time, algorithms with high
variance are decision tree, Support Vector Machine, and K-
nearest neighbours.

Ways to Reduce High Variance:

o Reduce the input features or number of parameters as a model


is overfitted.
o Do not use a much complex model.
o Increase the training data.
o Increase the Regularization term.

Different Combinations of Bias-Variance


There are four possible combinations of bias and variances, which are
represented by the below diagram:
1. Low-Bias, Low-Variance:
The combination of low bias and low variance shows an ideal
machine learning model. However, it is not possible practically.
2. Low-Bias, High-Variance: With low bias and high variance,
model predictions are inconsistent and accurate on average. This
case occurs when the model learns with a large number of
parameters and hence leads to an overfitting
3. High-Bias, Low-Variance: With High bias and low variance,
predictions are consistent but inaccurate on average. This case
occurs when a model does not learn well with the training
dataset or uses few numbers of the parameter. It leads
to underfitting problems in the model.
4. High-Bias, High-Variance:
With high bias and high variance, predictions are inconsistent
and also inaccurate on average.

How to identify High variance or High Bias?


High variance can be identified if the model has:
o Low training error and high test error.

High Bias can be identified if the model has:

o High training error and the test error is almost similar to training
error.

Bias-Variance Trade-Off
While building the machine learning model, it is really important to
take care of bias and variance in order to avoid overfitting and
underfitting in the model. If the model is very simple with fewer
parameters, it may have low variance and high bias. Whereas, if the
model has a large number of parameters, it will have high variance
and low bias. So, it is required to make a balance between bias and
variance errors, and this balance between the bias error and variance
error is known as the Bias-Variance trade-off.

For an accurate prediction of the model, algorithms need a low


variance and low bias. But this is not possible because bias and
variance are related to each other:
o If we decrease the variance, it will increase the bias.
o If we decrease the bias, it will increase the variance.

Bias-Variance trade-off is a central issue in supervised learning. Ideally,


we need a model that accurately captures the regularities in training
data and simultaneously generalizes well with the unseen dataset.
Unfortunately, doing this is not possible simultaneously. Because a
high variance algorithm may perform well with training data, but it
may lead to overfitting to noisy data. Whereas, high bias algorithm
generates a much simple model that may not even capture important
regularities in the data. So, we need to find a sweet spot between bias
and variance to make an optimal model.

Hence, the Bias-Variance trade-off is about finding the sweet

spot to make a balance between bias and variance errors.

Cross-Validation in Machine Learning


Cross-validation is a technique for validating the model efficiency by
training it on the subset of input data and testing on previously unseen
subset of the input data. We can also say that it is a technique to
check how a statistical model generalizes to an independent
dataset.

In machine learning, there is always the need to test the stability of


the model. It means based only on the training dataset; we can't fit our
model on the training dataset. For this purpose, we reserve a particular
sample of the dataset, which was not part of the training dataset. After
that, we test our model on that sample before deployment, and this
complete process comes under cross-validation. This is something
different from the general train-test split.

Hence the basic steps of cross-validations are:

o Reserve a subset of the dataset as a validation set.


o Provide the training to the model using the training dataset.
o Now, evaluate model performance using the validation set. If the
model performs well with the validation set, perform the further step,
else check for the issues.

Methods used for Cross-Validation


There are some common methods that are used for cross-validation.
These methods are given below:

1. Validation Set Approach


2. Leave-P-out cross-validation
3. Leave one out cross-validation
4. K-fold cross-validation
5. Stratified k-fold cross-validation

Validation Set Approach


We divide our input dataset into a training set and test or validation
set in the validation set approach. Both the subsets are given 50% of
the dataset.

But it has one of the big disadvantages that we are just using a 50%
dataset to train our model, so the model may miss out to capture
important information of the dataset. It also tends to give the
underfitted model.

Leave-P-out cross-validation
In this approach, the p datasets are left out of the training data. It
means, if there are total n datapoints in the original input dataset, then
n-p data points will be used as the training dataset and the p data
points as the validation set. This complete process is repeated for all
the samples, and the average error is calculated to know the
effectiveness of the model.

There is a disadvantage of this technique; that is, it can be


computationally difficult for the large p.

Leave one out cross-validation


This method is similar to the leave-p-out cross-validation, but instead
of p, we need to take 1 dataset out of training. It means, in this
approach, for each learning set, only one datapoint is reserved, and
the remaining dataset is used to train the model. This process repeats
for each datapoint. Hence for n samples, we get n different training set
and n test set. It has the following features:

o In this approach, the bias is minimum as all the data points are used.
o The process is executed for n times; hence execution time is high.
o This approach leads to high variation in testing the effectiveness of the
model as we iteratively check against one data point.

K-Fold Cross-Validation
K-fold cross-validation approach divides the input dataset into K groups
of samples of equal sizes. These samples are called folds. For each
learning set, the prediction function uses k-1 folds, and the rest of the
folds are used for the test set. This approach is a very popular CV
approach because it is easy to understand, and the output is less
biased than other methods.

The steps for k-fold cross-validation are:

o Split the input dataset into K groups


o For each group:
o Take one group as the reserve or test data set.
o Use remaining groups as the training dataset
o Fit the model on the training set and evaluate the performance
of the model using the test set.

Let's take an example of 5-folds cross-validation. So, the dataset is


grouped into 5 folds. On 1st iteration, the first fold is reserved for test
the model, and rest are used to train the model. On 2 nd iteration, the
second fold is used to test the model, and rest are used to train the
model. This process will continue until each fold is not used for the test
fold.

Consider the below diagram:


Stratified k-fold cross-validation
This technique is similar to k-fold cross-validation with some little
changes. This approach works on stratification concept, it is a process
of rearranging the data to ensure that each fold or group is a good
representative of the complete dataset. To deal with the bias and
variance, it is one of the best approaches.

It can be understood with an example of housing prices, such that the


price of some houses can be much high than other houses. To tackle
such situations, a stratified k-fold cross-validation technique is useful.

Holdout Method
This method is the simplest cross-validation technique among all. In
this method, we need to remove a subset of the training data and use
it to get prediction results by training it on the rest part of the dataset.

The error that occurs in this process tells how well our model will
perform with the unknown dataset. Although this approach is simple to
perform, it still faces the issue of high variance, and it also produces
misleading results sometimes.

Comparison of Cross-validation to train/test split in


Machine Learning
o Train/test split: The input data is divided into two parts, that are
training set and test set on a ratio of 70:30, 80:20, etc. It provides a
high variance, which is one of the biggest disadvantages.
o Training Data: The training data is used to train the model,
and the dependent variable is known.
o Test Data: The test data is used to make the predictions from
the model that is already trained on the training data. This has
the same features as training data but not the part of that.
o Cross-Validation dataset: It is used to overcome the disadvantage
of train/test split by splitting the dataset into groups of train/test splits,
and averaging the result. It can be used if we want to optimize our
model that has been trained on the training dataset for the best
performance. It is more efficient as compared to train/test split as
every observation is used for the training and testing both.

Limitations of Cross-Validation
There are some limitations of the cross-validation technique, which are
given below:

o For the ideal conditions, it provides the optimum output. But for the
inconsistent data, it may produce a drastic result. So, it is one of the
big disadvantages of cross-validation, as there is no certainty of the
type of data in machine learning.
o In predictive modeling, the data evolves over a period, due to which, it
may face the differences between the training set and validation sets.
Such as if we create a model for the prediction of stock market values,
and the data is trained on the previous 5 years stock values, but the
realistic future values for the next 5 years may drastically different, so
it is difficult to expect the correct output for such situations.

Applications of Cross-Validation
o This technique can be used to compare the performance of different
predictive modeling methods.
o It has great scope in the medical research field.
o It can also be used for the meta-analysis, as it is already being used by
the data scientists in the field of medical statistics.

You might also like