Introduction To Machine Learning
Introduction To Machine Learning
Learning:
Machine learning
• Machine learning is a type of artificial intelligence that enables
computers to detect patterns and establish baseline behavior using
algorithms that learn through training or observation.
• It can process and analyze vast amounts of data that are simply
impractical for humans.
• Machine learning tasks are classified into two main categories:
• Supervised learning – the machine is presented with a set of inputs
and expected outputs, later given a new input the output is
predicted.
• Unsupervised learning – the machine aims to find patterns, within a
dataset without an explicit input from a human as to what these
patterns might look like.
What is supervised learning?
• Supervised machine learning is a branch of artificial intelligence that
focuses on training models to make predictions or decisions based on
labeled training data.
• It involves a learning process where the model learns from known
examples to predict or classify unseen or future instances accurately.
• Supervised machine learning has two key components: first is input
data and second corresponding output labels.
• The goal is to build a model that can learn from this labeled data to
make predictions or classifications on new, unseen data.
• The labeled data consists of input features (also known as
independent variables or predictors) and the corresponding output
labels (also known as dependent variables or targets).
• The model’s objective is to capture patterns and relationships
between the input features and the output labels, allowing it to
generalize and make accurate predictions on unseen data.
How Does Supervised Learning Work?
In Regression, the output variable must be of continuous In Classification, the output variable must be a discrete
nature or real value. value.
The task of the regression algorithm is to map the input The task of the classification algorithm is to map the input
value (x) with the continuous output variable(y). value(x) with the discrete output variable(y).
Regression Algorithms are used with continuous data. Classification Algorithms are used with discrete data.
In Regression, we try to find the best fit line, which can In Classification, we try to find the decision boundary,
predict the output more accurately. which can divide the dataset into different classes.
Regression algorithms can be used to solve the regression Classification Algorithms can be used to solve classification
problems such as Weather Prediction, House price problems such as Identification of spam emails, Speech
prediction, etc. Recognition, Identification of cancer cells, etc.
The regression Algorithm can be further divided into Linear The Classification algorithms can be divided into Binary
and Non-linear Regression. Classifier and Multi-class Classifier.
What is Unsupervised Learning?
• There may be many cases in which we do not have labeled data and need
to find the hidden patterns from the given dataset.
• So, to solve such types of cases in machine learning, we need unsupervised
learning techniques.
• Unsupervised learning is a type of machine learning in which
models are trained using unlabeled dataset and are allowed to act
on that data without any supervision.
• Unsupervised learning is a machine learning technique in which models
are not supervised using training dataset.
• Instead, models itself find the hidden patterns and insights from the
given data.
• It can be compared to learning which takes place in the human brain while
learning new things.
• Unsupervised learning cannot be directly applied to a regression or
classification problem because unlike supervised learning, we have
the input data but no corresponding output data.
• The goal of unsupervised learning is to find the underlying structure
of dataset, group that data according to similarities, and represent
that dataset in a compressed format.
• Example:
• Suppose the unsupervised learning algorithm is given an input
dataset containing images of different types of cats and dogs.
• The algorithm is never trained upon the given dataset, which means it
does not have any idea about the features of the dataset.
• The task of the unsupervised learning algorithm is to identify the
image features on their own.
• Unsupervised learning algorithm will perform this task by clustering
the image dataset into the groups according to similarities between
images.
Why use Unsupervised Learning?
• K-means clustering
• KNN (k-nearest neighbors)
• Hierarchal clustering
• Anomaly detection
• Neural Networks
• Principle Component Analysis
• Independent Component Analysis
• Apriori algorithm
• Singular value decomposition
Advantages of Unsupervised Learning
• This instance, where the model cannot find patterns in our training
set and hence fails for both seen and unseen data, is called
Underfitting.
Underfitting
• The below figure shows an example of Underfitting. As we can see,
the model has found no patterns in our data and the line of best fit is
a straight line that does not pass through any of the data points. The
model has failed to train properly on the data given and cannot
predict new data either.
• Underfitting occurs when our machine learning model is not able to
capture the underlying trend of the data.
• To avoid the overfitting in the model, the fed of training data can be
stopped at an early stage, due to which the model may not learn
enough from the training data.
• As a result, it may fail to find the best fit of the dominant trend in the
data.
What is Variance?
• Variance is the very opposite of Bias.
• The variability of model prediction for a given data point
which tells us the spread of our data is called the variance of
the model.
• The model with high variance has a very complex fit to the
training data and thus is not able to fit accurately on the
data which it hasn’t seen before.
• As a result, such models perform very well on training data but
have high error rates on test data.
• When a model is high on variance, it is then said to
as Overfitting of Data.
In the above figure, we can see that our model has learned extremely well for our
training data, which has taught it to identify cats. But when given new data, such as
the picture of a fox, our model predicts it as a cat, as that is what it has learned. This
happens when the Variance is high, our model will capture all the features of the data
given to it, including the noise, will tune itself to the data, and predict it very well but
when given new data, it cannot predict on it as it is too specific to training data.
• During training, it allows our model to ‘see’ the data a certain number
of times to find patterns in it. If it does not work on the data for long
enough, it will not find patterns and bias occurs. On the other hand, if
our model is allowed to view the data too many times, it will learn
very well for only that data. It will capture most patterns in the data,
but it will also learn from the unnecessary data present, or from the
noise.
Overfitting
• Our model will perform really well on testing data and get high
accuracy but will fail to perform on new, unseen data. New data may
not have the exact same features and the model won’t be able to
predict it very well. This is called Overfitting.
Overfitting
• Overfitting occurs when our machine learning model tries to cover all
the data points or more than the required data points present in the
given dataset.
• Because of this, the model starts caching noise and inaccurate values
present in the dataset, and all these factors reduce the efficiency and
accuracy of the model.
• The overfitted model has low bias and high variance.
• Cross-Validation
• Training with more data
• Removing features
• Early stopping the training
• Regularization
• Ensembling
The Bias-Variance Tradeoff
• The bias-variance tradeoff illustrates the relationship between bias
and variance in machine learning models. As we decrease bias,
variance tends to increase, and vice versa. Finding the optimal
tradeoff is crucial to achieve good model performance.
1.High Bias, Low Variance: Underfitting
• When a model has high bias and low variance, it tends to underfit the
data. Underfitting occurs when the model is too simple to capture the
underlying patterns in the data. It leads to poor performance on both
the training and testing data, as the model fails to generalize.
Underfitting can be addressed by increasing the model’s complexity
or incorporating more relevant features.
• 2. Low Bias, High Variance: Overfitting
• Conversely, a model with low bias and high variance tends to overfit
the data. Overfitting happens when the model becomes too complex,
capturing noise or random fluctuations in the training data. It
performs exceptionally well on the training data but fails to generalize
to unseen data. To address overfitting, techniques like regularization,
feature selection, or collecting more training data can be employed.
Finding the Optimal Balance