Machine learning notes_Unit-1
Machine learning notes_Unit-1
In the real world, we are surrounded by humans who can learn everything from their experiences with
their learning capability, and we have computers or machines which work on our instructions. But can
a machine also learn from experiences or past data like a human does? So here comes the role
of Machine Learning.
A subset of artificial intelligence known as machine learning focuses primarily on the creation of
algorithms that enable a computer to independently learn from data and previous experiences. Arthur
Samuel first used the term "machine learning" in 1959. It could be summarized as follows:
Without being explicitly programmed, machine learning enables a machine to automatically learn from
data, improve performance from experiences, and predict things.
A machine can learn if it can gain more data to improve its performance.
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo with
our Facebook friends, then we automatically get a tagging suggestion with name, and the technology
behind this is machine learning's face detection and recognition algorithm.
It is based on the Facebook project named "Deep Face," which is responsible for face recognition and
person identification in the picture.
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech recognition, and it's
a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also known as
"Speech to text", or "Computer speech recognition." At present, machine learning algorithms are
widely used by various applications of speech recognition. Google assistant, Siri, Cortana,
and Alexa are using speech recognition technology to follow the voice instructions.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct path with
the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested
with the help of two ways:
o Real Time location of the vehicle form Google Map app and sensors
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some
product on Amazon, then we started getting an advertisement for the same product while internet
surfing on the same browser and this is because of machine learning
Google understands the user interest using various machine learning algorithms and suggests the
product as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series, movies, etc.,
and this is also done with the help of machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine learning plays a
significant role in self-driving cars. Tesla, the most popular car manufacturing company is working on
self-driving car. It is using unsupervised learning method to train the car models to detect people and
objects while driving.
Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We
always receive an important mail in our inbox with the important symbol and spam emails in our spam
box, and the technology behind this is Machine learning. Below are some spam filters used by Gmail:
o Content Filter
o Header filter
o Rules-based filters
o Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve Bayes
classifier are used for email spam filtering and malware detection.
We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As the name
suggests, they help us in finding the information using our voice instruction. These assistants can help
us in various ways just by our voice instructions such as Play music, call someone, Open an email,
Scheduling an appointment, etc.
These assistant record our voice instructions, send it over the server on a cloud, and decode it using
ML algorithms and act accordingly.
For each genuine transaction, the output is converted into some hash values, and these values become
the input for the next round. For each genuine transaction, there is a specific pattern which gets change
for the fraud transaction hence, it detects it and makes our online transactions more secure.
Machine learning is widely used in stock market trading. In the stock market, there is always a risk of
up and downs in shares, so for this machine learning's long short term memory neural network is
used for the prediction of stock market trends.
In medical science, machine learning is used for diseases diagnoses. With this, medical technology is
growing very fast and able to build 3D models that can predict the exact position of lesions in the brain.
Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at all,
as for this also machine learning helps us by converting the text into our known languages. Google's
GNMT (Google Neural Machine Translation) provide this feature, which is a Neural Machine Learning
that translates the text into our familiar language, and it called as automatic translation.
The technology behind the automatic translation is a sequence to sequence learning algorithm, which
is used with image recognition and translates the text from one language to another language.
It is a branch of Artificial Intelligence and computer science that helps build a model based on training
data and make predictions and decisions without being constantly programmed. Machine Learning is
used in various applications such as email filtering, speech recognition, computer vision, self-driven
cars, Amazon product recommendation, etc.
The major issue that comes while using machine learning algorithms is the lack of quality as well as
quantity of data. Although data plays a vital role in the processing of machine learning algorithms,
many data scientists claim that inadequate data, noisy data, and unclean data are extremely
exhausting the machine learning algorithms. For example, a simple task requires thousands of sample
data, and an advanced task such as speech or image recognition needs millions of sample data
examples. Further, data quality is also important for the algorithms to work ideally, but the absence of
data quality is also found in Machine Learning applications. Data quality can be affected by some
factors as follows:
o Noisy Data- It is responsible for an inaccurate prediction that affects the decision as well as
accuracy in classification tasks.
o Incorrect data- It is also responsible for faulty programming and results obtained in machine
learning models. Hence, incorrect data may affect the accuracy of the results also.
o Generalizing of output data- Sometimes, it is also found that generalizing output data
becomes complex, which results in comparatively poor future actions.
As we have discussed above, data plays a significant role in machine learning, and it must be of good
quality as well. Noisy data, incomplete data, inaccurate data, and unclean data lead to less accuracy in
classification and low-quality results. Hence, data quality can also be considered as a major common
problem while processing machine learning algorithms.
To make sure our training model is generalized well or not, we have to ensure that sample training
data must be representative of new cases that we need to generalize. The training data must cover all
cases that are already occurred as well as occurring.
Further, if we are using non-representative training data in the model, it results in less accurate
predictions. A machine learning model is said to be ideal if it predicts well for generalized cases and
provides accurate decisions. If there is less training data, then there will be a sampling noise in the
model, called the non-representative training set. It won't be accurate in predictions. To overcome
this, it will be biased against one class or a group.
Hence, we should use representative data in training to protect against being biased and make
accurate predictions without any drift.
A machine learning model operates under a specific context which results in bad recommendations
and concept drift in the model. Let's understand with an example where at a specific time customer is
looking for some gadgets, but now customer requirement changed over time but still machine learning
model showing same recommendations to the customer while customer expectation has been
changed. This incident is called a Data Drift. It generally occurs when new data is introduced or
interpretation of data changes. However, we can overcome this by regularly updating and monitoring
data according to the expectations.
Although Machine Learning and Artificial Intelligence are continuously growing in the market, still
these industries are fresher in comparison to others. The absence of skilled resources in the form of
manpower is also an issue. Hence, we need manpower having in-depth knowledge of mathematics,
science, and technologies for developing and managing scientific substances for machine learning.
6.Irrelevant features
Although machine learning models are intended to give the best possible outcome, if we feed garbage
data as input, then the result will also be garbage. Hence, we should use relevant features in our
training sample. A machine learning model is said to be good if training data has a good set of features
or less to no irrelevant features.
This issue is also very commonly seen in machine learning models. However, machine learning models
are highly efficient in producing accurate results but are time-consuming. Slow programming,
excessive requirements' and overloaded data take more time to provide accurate results than
expected. This needs continuous maintenance and monitoring of the model for delivering accurate
results.
The machine learning process is very complex, which is also another major issue faced by machine
learning engineers and data scientists. However, Machine Learning and Artificial Intelligence are very
new technologies but are still in an experimental phase and continuously being changing over time.
There is the majority of hits and trial experiments; hence the probability of error is higher than
expected. Further, it also includes analyzing the data, removing data bias, training data, applying
complex mathematical calculations, etc., making the procedure more complicated and quite tedious.
This process occurs when data is unable to establish an accurate relationship between input and
output variables. It simply means trying to fit in undersized jeans. It signifies the data is too simple to
establish a precise relationship.
Overfitting refers to a machine learning model trained with a massive amount of data that negatively
affect its performance. It is like trying to fit in Oversized jeans. Unfortunately, this is one of the
significant issues faced by machine learning professionals. This means that the algorithm is trained
with noisy and biased data, which will affect its overall performance. Let’s understand this with the
help of an example. Let’s consider a model trained to differentiate between a cat, a rabbit, a dog, and
a tiger. The training data contains 1000 cats, 1000 dogs, 1000 tigers, and 4000 Rabbits. Then there is a
considerable probability that it will identify the cat as a rabbit. In this example, we had a vast amount
of data, but it was biased; hence the prediction was negatively affected.
4. Reinforcement Learning
1. Supervised Machine Learning
As its name suggests, Supervised machine learning is based on supervision. It means in the
supervised learning technique, we train the machines using the "labelled" dataset, and based
on the training, the machine predicts the output. Here, the labelled data specifies that some
of the inputs are already mapped to the output. More preciously, we can say; first, we train
the machine with the input and corresponding output, and then we ask the machine to predict
the output using the test dataset.
Let's understand supervised learning with an example. Suppose we have an input dataset of
cats and dog images. So, first, we will provide the training to the machine to understand the
images, such as the shape & size of the tail of cat and dog, Shape of eyes, colour, height (dogs
are taller, cats are smaller), etc. After completion of training, we input the picture of a cat and
ask the machine to identify the object and predict the output. Now, the machine is well
trained, so it will check all the features of the object, such as height, shape, colour, eyes, ears,
tail, etc., and find that it's a cat. So, it will put it in the Cat category. This is the process of how
the machine identifies the objects in Supervised Learning.
The main goal of the supervised learning technique is to map the input variable(x) with the
output variable(y). Some real-world applications of supervised learning are Risk Assessment,
Fraud Detection, Spam filtering, etc.
Supervised machine learning can be classified into two types of problems, which are given
below:
o Classification
o Regression
a) Classification
Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification
algorithms predict the categories present in the dataset. Some real-world examples of
classification algorithms are Spam Detection, Email filtering, etc.
Some popular classification algorithms are given below:
b) Regression
Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous output
variables, such as market trends, weather prediction, etc.
o Lasso Regression
Advantages:
o Since supervised learning work with the labelled dataset so we can have an exact idea about
the classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior experience.
Disadvantages:
o It may predict the wrong output if the test data is different from the training data.
o Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process, image
classification is performed on different image data with pre-defined labels.
o Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is done by
using medical images and past labelled data with labels for disease conditions. With such a
process, the machine can identify a disease for the new patients.
o Fraud Detection - Supervised Learning classification algorithms are used for identifying fraud
transactions, fraud customers, etc. It is done by using historic data to identify the patterns that
can lead to possible fraud.
o Spam detection - In spam detection & filtering, classification algorithms are used. These
algorithms classify an email as spam or not spam. The spam emails are sent to the spam folder.
o Speech Recognition - Supervised learning algorithms are also used in speech recognition. The
algorithm is trained with voice data, and various identifications can be done using the same,
such as voice-activated passwords, voice commands, etc.
Unsupervised learning is different from the Supervised learning technique; as its name
suggests, there is no need for supervision. It means, in unsupervised machine learning, the
machine is trained using the unlabeled dataset, and the machine predicts the output without
any supervision.
In unsupervised learning, the models are trained with the data that is neither classified nor
labelled, and the model acts on that data without any supervision.
The main aim of the unsupervised learning algorithm is to group or categories the unsorted
dataset according to the similarities, patterns, and differences. Machines are instructed to
find the hidden patterns from the input dataset.
Let's take an example to understand it more preciously; suppose there is a basket of fruit
images, and we input it into the machine learning model. The images are totally unknown to
the model, and the task of the machine is to find the patterns and categories of the objects.
So, now the machine will discover its patterns and differences, such as colour difference, shape
difference, and predict the output when it is tested with the test dataset.
Unsupervised Learning can be further classified into two types, which are given below:
o Clustering
o Association
1) Clustering
The clustering technique is used when we want to find the inherent groups from the data. It is
a way to group the objects into a cluster such that the objects with the most similarities remain
in one group and have fewer or no similarities with the objects of other groups. An example
of the clustering algorithm is grouping the customers by their purchasing behaviour.
o Mean-shift algorithm
o DBSCAN Algorithm
o Principal Component Analysis
2) Association
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-growth
algorithm.
Advantages:
o These algorithms can be used for complicated tasks compared to the supervised ones because
these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset is
easier as compared to the labelled dataset.
Disadvantages:
o The output of an unsupervised algorithm can be less accurate as the dataset is not labelled,
and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled dataset
that does not map with the output.
o Network Analysis: Unsupervised learning is used for identifying plagiarism and copyright in
document network analysis of text data for scholarly articles.
3. Semi-Supervised Learning
We can imagine these algorithms with an example. Supervised learning is where a student is
under the supervision of an instructor at home and college. Further, if that student is self-
analysing the same concept without any help from the instructor, it comes under unsupervised
learning. Under semi-supervised learning, the student has to revise himself after analyzing the
same concept under the guidance of an instructor at college.
Advantages:
o It is highly efficient.
Disadvantages:
o Accuracy is low.
4. Reinforcement Learning
In reinforcement learning, there is no labelled data like supervised learning, and agents learn
from their experiences only.
The reinforcement learning process is similar to a human being; for example, a child learns
various things by experiences in his day-to-day life. An example of reinforcement learning is to
play a game, where the Game is the environment, moves of an agent at each step define states,
and the goal of the agent is to get a high score. Agent receives feedback in terms of punishment
and rewards.
Due to its way of working, reinforcement learning is employed in different fields such as Game
theory, Operation Research, Information theory, multi-agent systems.
o Video Games:
RL algorithms are much popular in gaming applications. It is used to gain super-human
performance. Some popular games that use RL algorithms are AlphaGO and AlphaGO Zero.
o Resource Management:
The "Resource Management with Deep Reinforcement Learning" paper showed that how to
use RL in computer to automatically learn and schedule resources to wait for different jobs in
order to minimize average job slowdown.
o Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial and
manufacturing area, and these robots are made more powerful with reinforcement learning.
There are different industries that have their vision of building intelligent robots using AI and
Machine learning technology.
o Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with the help of
Reinforcement Learning by Salesforce company.
Advantages
o It helps in solving complex real-world problems which are difficult to be solved by general
techniques.
o The learning model of RL is similar to the learning of human beings; hence most accurate
results can be found.
o Too much reinforcement learning can lead to an overload of states which can weaken the
results.
What is a Dataset?
A Dataset is a set of data grouped into a collection with which developers can work to meet
their goals. In a dataset, the rows represent the number of data points and the columns
represent the features of the Dataset. They are mostly used in fields like machine learning,
business, and government to gain insights, make informed decisions, or train algorithms.
Datasets may vary in size and complexity and they mostly require cleaning and preprocessing
to ensure data quality and suitability for analysis or modeling.
Dataset
This is the Iris dataset. Since this is a dataset with which we build models, there are input
features and output features. Here:
The input features are Sepal Length, Sepal Width, Petal Length, and Petal Width.
Datasets can be stored in multiple formats. The most common ones are CSV, Excel, JSON, and
zip files for large datasets such as image datasets.
Why are datasets used?
Datasets are used to train and test AI models, analyze trends, and gain insights from data. They
provide the raw material for computers to learn patterns and make predictions.
Types of Datasets
There are various types of datasets available out there. They are:
Numerical Dataset: They include numerical data points that can be solved with equations.
These include temperature, humidity, marks and so on.
Categorical Dataset: These include categories such as colour, gender, occupation, games,
sports and so on.
Web Dataset: These include datasets created by calling APIs using HTTP requests and
populating them with values for data analysis. These are mostly stored in JSON (JavaScript
Object Notation) formats.
Time series Dataset: These include datasets between a period, for example, changes in
geographical terrain over time.
Image Dataset: It includes a dataset consisting of images. This is mostly used to differentiate
the types of diseases, heart conditions and so on.
Ordered Dataset: These datasets contain data that are ordered in ranks, for example, customer
reviews, movie ratings and so on.
Partitioned Dataset: These datasets have data points segregated into different members or
different partitions.
File-Based Datasets: These datasets are stored in files, in Excel as .csv, or .xlsx files.
Bivariate Dataset: In this dataset, 2 classes or features are directly correlated to each other.
For example, height and weight in a dataset are directly related to each other.
Multivariate Dataset: In these types of datasets, as the name suggests 2 or more classes are
directly correlated to each other. For example, attendance, and assignment grades are directly
correlated to a student’s overall grade.
Properties of Dataset
Center of data: This refers to the “middle” value of the data, often measured by mean, median,
or mode. It helps understand where most of the data points are concentrated.
Skewness of data: This indicates how symmetrical the data distribution is. A perfectly
symmetrical distribution (like a normal distribution) has a skewness of 0. Positive skewness
means the data is clustered towards the left, while negative skewness means it’s clustered
towards the right.
Spread among data members: This describes how much the data points vary from the center.
Common measures include standard deviation or variance, which quantify how far individual
points deviate from the average.
Presence of outliers: These are data points that fall significantly outside the overall pattern.
Identifying outliers can be important as they might influence analysis results and require
further investigation.
Correlation among the data: This refers to the strength and direction of relationships between
different variables in the dataset. A positive correlation indicates values in one variable tend
to increase as the other does, while a negative correlation suggests they move in opposite
directions. No correlation means there’s no linear relationship between the variables.
Type of probability distribution that the data follows: Understanding the distribution (e.g.,
normal, uniform, binomial) helps us predict how likely it is to find certain values within the
data and choose appropriate statistical methods for analysis.
Features of a Dataset
The features of a dataset may allude to the columns available in the dataset. The features of a
dataset are the most critical aspect of the dataset, as based on the features of each available
data point, will there be any possibility of deploying models to find the output to predict the
features of any new data point that may be added to the dataset.
It is only possible to determine the standard features from some datasets since their
functionalities and data would be completely different when compared to other datasets.
Some possible features of a dataset are:
Numerical Features: These may include numerical values such as height, weight, and so on.
These may be continuous over an interval, or discrete variables.
Categorical Features: These include multiple classes/ categories, such as gender, colour, and
so on.
Metadata: Includes a general description of a dataset. Generally in very large datasets, having
an idea/ description of the dataset when it’s transferred to a new developer will save a lot of
time and improve efficiency.
Size of the Data: It refers to the number of entries and features it contains in the file containing
the Dataset.
Formatting of Data: The datasets available online are available in several formats. Some of
them are JSON (JavaScript Object Notation), CSV (Comma Separated Value), XML (eXtensible
Markup Language), DataFrame, and Excel Files (xlsx or xlsm). For particularly large datasets,
especially involving images for disease detection, while downloading the files from the
internet, it comes in zip files which will be needed to extract in the system to individual
components.
Target Variable: It is the feature whose values/attributes are referred to to get outputs from
the other features with machine learning techniques.
Data Entries: These refer to the individual values of data present in the Dataset. They play a
huge role in data analysis.
Data preprocessing
Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the first and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across the clean
and formatted data. And while doing any operation with data, it is mandatory to clean it and
put in a formatted way. So for this, we use data preprocessing task.
A real-world data generally contains noises, missing values, and maybe in an unusable format
which cannot be directly used for machine learning models. Data preprocessing is required
tasks for cleaning the data and making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.
It involves below steps:
o Getting the dataset
o Importing libraries
o Importing datasets
o Finding Missing Data
o Encoding Categorical Data
o Splitting dataset into training and test set
o Feature scaling
read_csv() function:
Now to import the dataset, we will use read_csv() function of pandas library, which is used to
read a csv file and performs various operations on it. Using this function, we can read a csv file
locally as well as through an URL.
1. data_set= pd.read_csv('Dataset.csv')
Here, data_set is a name of the variable to store our dataset, and inside the function, we have
passed the name of our dataset. Once we execute the above line of code, it will successfully
import the dataset in our code
Extracting dependent and independent variables:
To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used to
extract the required rows and columns from the dataset.
x= data_set.iloc[:,:-1].values
In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is for
all the columns. Here we have used :-1, because we don't want to take the last column as it
contains the dependent variable. So by doing this, we will get the matrix of features.
1. y= data_set.iloc[:,3].values
Here we have taken all the rows with the last column only.
There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal with null values. In
this way, we just delete the specific row or column which consists of null values. But this
way is not so efficient and removing data may lead to loss of information which will not give
the accurate output.
By calculating the mean: In this way, we will calculate the mean of that column or row
which contains any missing value and will put it on the place of missing value. This strategy
is useful for the features which have numeric data such as age, salary, year, etc. Here, we will
use this approach.
To handle missing values, we will use Scikit-learn library in our code, which contains
various libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library
Categorical data is data which has some categories such as, in our dataset; there are two
categorical variable, Country, and Purchased.
Since machine learning model completely works on mathematics and numbers, but if our
dataset would have a categorical variable, then it may create trouble while building the
model. So it is necessary to encode these categorical variables into numbers.
But in our case, there are three country variables, and as we can see in the above output, these
variables are encoded into 0, 1, and 2. By these values, the machine learning model may
assume that there is some correlation between these variables which will produce the wrong
output. So to remove this issue, we will use dummy encoding.
Dummy Variables:
Dummy variables are those variables which have values 0 or 1. The 1 value gives the
presence of that variable in a particular column, and rest variables become 0. With dummy
encoding, we will have a number of columns equal to the number of categories.
In our dataset, we have 3 categories so it will produce three columns having 0 and 1 values.
For Dummy Encoding, we will use OneHotEncoder class of preprocessing library.
For Purchased Variable:
1. labelencoder_y= LabelEncoder()
2. y= labelencoder_y.fit_transform(y)
For the second categorical variable, we will only use labelencoder object
of LableEncoder class. Here we are not using OneHotEncoder class because the purchased
variable has only two categories yes or no, and which are automatically encoded into 0 and 1.
) Splitting the Dataset into the Training set and Test set
In machine learning data preprocessing, we divide our dataset into a training set and test set.
This is one of the crucial steps of data preprocessing as by doing this, we can enhance the
performance of our machine learning model.
Suppose, if we have given training to our machine learning model by a dataset and we test it
by a completely different dataset. Then, it will create difficulties for our model to understand
the correlations between the models.
If we train our model very well and its training accuracy is also very high, but we provide a
new dataset to it, then it will decrease the performance. So we always try to make a machine
learning model which performs well with the training set and also with the test dataset. Here,
we can define these datasets as:
Training Set: A subset of dataset to train the machine learning model, and we already know
the output.
Test set: A subset of dataset to test the machine learning model, and by using the test set,
model predicts the output.
7) Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a technique to
standardize the independent variables of the dataset in a specific range. In feature scaling, we
put our variables in the same range and in the same scale so that no any variable dominate the
other variable.
If we
compute any two values from age and salary, then salary values will dominate the age values,
and it will produce an incorrect result. So to remove this issue, we need to perform feature
scaling for machine learning.
Standardization
Normalization
Here, we will use the standardization method for our dataset.
• Bias: Bias refers to the error due to overly simplistic assumptions in the learning
algorithm. These assumptions make the model easier to comprehend and learn but
might not capture the underlying complexities of the data. It is the error due to the
model’s inability to represent the true relationship between input and output accurately.
When a model has poor performance both on the training and testing data means high
bias because of the simple model, indicating underfitting.
• Variance: Variance, on the other hand, is the error due to the model’s sensitivity to
fluctuations in the training data. It’s the variability of the model’s predictions for
different instances of training data. High variance occurs when a model learns the
training data’s noise and random fluctuations rather than the underlying pattern. As a
result, the model performs well on the training data but poorly on the testing data,
indicating overfitting.
Note: The underfitting model has High bias and low variance.
The model is too simple, So it may be not capable to represent the complexities in the
data.
The input features which is used to train the model is not the adequate representations
of underlying factors influencing the target variable.
The size of the training dataset used is not enough.
Excessive regularization are used to prevent the overfitting, which constraint the model
to capture the data well.
4. Increase the number of epochs or increase the duration of training to get better results.
A statistical model is said to be overfitted when the model does not make accurate
predictions on testing data. When a model gets trained with so much data, it starts
learning from the noise and inaccurate data entries in our data set. And when testing
with test data results in High variance. Then the model does not categorize the data
correctly, because of too many details and noise. The causes of overfitting are the non-
parametric and non-linear methods because these types of machine learning algorithms
have more freedom in building the model based on the dataset and therefore they can
really build unrealistic models. A solution to avoid overfitting is using a linear
algorithm if we have linear data or using the parameters like the maximal depth if we
are using decision trees.
2. Increase the training data can improve the model’s ability to generalize to unseen data
and reduce the likelihood of overfitting.
What is Bias?
Bias is simply defined as the inability of the model because of that there is some
difference or error occurring between the model’s predicted value and the actual value.
These differences between actual or expected values and the predicted values are
known as error or bias error or error due to bias. Bias is a systematic error that occurs
due to wrong assumptions in the machine learning process.
Bias(Y^)=E(Y^)–Y
Where E(Y^) is the expected value of the estimator Y^Y^. It is the measurement of the
model that how well it fits the data.
• Low Bias: Low bias value means fewer assumptions are taken to build the target
function. In this case, the model will closely match the training dataset.
• High Bias: High bias value means more assumptions are taken to build the target
function. In this case, the model will not match the training dataset closely.
The high-bias model will not be able to capture the dataset trend. It is considered as the
underfitting model which has a high error rate. It is due to a very simplified algorithm.
For example, a linear regression model may have a high bias if the data has a non-linear
relationship.
Ways to reduce high bias in Machine Learning:
• Use a more complex model: One of the main reasons for high bias is the very
simplified model. it will not be able to capture the complexity of the data. In such cases,
we can make our mode more complex by increasing the number of hidden layers in the
case of a deep neural network. Or we can use a more complex model like Polynomial
regression for non-linear datasets, CNN for image processing, and RNN for sequence
learning.
• Increase the number of features: By adding more features to train the dataset will
increase the complexity of the model. And improve its ability to capture the underlying
patterns in the data.
• Increase the size of the training data: Increasing the size of the training data can help
to reduce bias by providing the model with more examples to learn from the dataset.
What is Variance?
Variance is the measure of spread in data from its mean position. In machine learning
variance is the amount by which the performance of a predictive model changes when
it is trained on different subsets of the training data. More specifically, variance is the
variability of the model that how much it is sensitive to another subset of the training
dataset. i.e. how much it can adjust on the new subset of the training dataset.
Let Y be the actual values of the target variable, and Then the variance of a model can
be measured as the expected value of the square of the difference between predicted
values and the expected value of the predicted values.
Variance=E[(Y^–E[Y^])2]
where E[Yˉ]E[Yˉ] is the expected value of the predicted values. Here expected value is
averaged over all the training data.
• Low variance: Low variance means that the model is less sensitive to changes in the
training data and can produce consistent estimates of the target function with different
subsets of data from the same distribution. This is the case of underfitting when the
model fails to generalize on both training and test data.
• High variance: High variance means that the model is very sensitive to changes in the
training data and can result in significant changes in the estimate of the target function
when trained on different subsets of data from the same distribution. This is the case of
overfitting when the model performs well on the training data but poorly on new,
unseen test data. It fits the training data too closely that it fails on the new training
dataset.
• Cross-validation: By splitting the data into training and testing sets multiple times,
cross-validation can help identify if a model is overfitting or underfitting and can be
used to tune hyperparameters to reduce variance.
• Feature selection: By choosing the only relevant feature will decrease the model’s
complexity. and it can reduce the variance error.
• Simplifying the model: Reducing the complexity of the model, such as decreasing the
number of parameters or layers in a neural network, can also help reduce variance and
improve generalization performance.
If the algorithm is too simple (hypothesis with linear equation) then it may be on high
bias and low variance condition and thus is error-prone. If algorithms fit too complex
(hypothesis with high degree equation) then it may be on high variance and low bias.
In the latter condition, the new entries will not perform well. Well, there is something
between both of these conditions, known as a Trade-off or Bias Variance Trade-off.
This tradeoff in complexity is why there is a tradeoff between bias and variance. An
algorithm can’t be more complex and less complex at the same time. For the graph, the
perfect tradeoff will be like this.
We try to optimize the value of the total error for the model by using the Bias-
Variance Tradeoff.
The best fit will be given by the hypothesis on the tradeoff point. The error to
complexity graph to show trade-off is given as –