0% found this document useful (0 votes)
3 views56 pages

Unit 1 ML - Ver 2

Uploaded by

flytondk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views56 pages

Unit 1 ML - Ver 2

Uploaded by

flytondk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

Machine

Learning
Subject code:
Regulations: 2021 AL3451

Unit - 1 Introduction to Machine Learning

Introduction and Motivation for


Machine Learning
Machine Learning (ML): An Introduction
• ML is the sub field of Artificial
Intelligence (AI).

• AI reasons and answers questions


about a system it has been trained
on.

• ML is the underlying computer


program by which AI can ‘reason’.

• The aim of ML is making the


computer to learn by itself.
Machine Learning (ML): An Introduction
Traditional Programming vs ML Programming

Input Data
Traditional Output
Program Programming

Input Data
(Training Data)
ML Algorithm
(Program)
Output
Output Programming
(Training Data)

Input Data
(Huge volume)
Machine Learning (ML): An Introduction
• ML enables to learn from past data or from past experiences without
being explicitly programmed and ML is highly dependent on models,
that is algorithms, in simple words, computer programs.
• ML Definitions:
 The use and development of computer systems that are able to
learn and adapt without following explicit instructions, by using
algorithms and statistical models to analyse and draw inferences
from patterns in data.
 With the help of sample historical data, which is known as training
data, machine learning algorithms build a mathematical model that
helps in making predictions or decisions without explicitly
programmed.
Machine Learning (ML): An Introduction
• Example :
House Size
No. of Bed rooms No. of Bath rooms No. of years Selling Price
(sq ft)
2000 3 2 5 3 Lakhs
1500 2 1 7 2 Lakhs
2500 4 2 2 4 Lakhs
1200 2 1 9 1.5 Lakhs
• A linear regression model might come up with an equation like this:
Selling Price=
+(

• Coefficients through the training process:


 = 50,000 (Intercept)
 = 100 (Coefficient for Size in sq ft)
 = 20,000 (Coefficient for each Bedroom)
 = 30,000 (Coefficient for each Bathroom)
 =10,000 (Coefficient for each Year of Age)
Types of Machine Learning Algorithms
The three types of ML algorithms are
1. Supervised Learning.
2. Unsupervised Learning.
3. Reinforcement Learning.

Supervised Learning :
• In Supervised learning, a model is getting trained with a
labelled dataset.
• It is a process of providing input data as well as correct output
data, the supervised learning algorithm is to find a mapping
function to map the input with the output.
Supervised Learning
Supervised Learning
• Supervised learning is a fundamental paradigm in machine
learning where the model is trained on a labeled dataset.

• This means that for each input in the training set, the desired
output or "label" is provided.

• The goal of supervised learning is to learn a mapping from inputs


to outputs, enabling the model to make predictions.
• Examples of Supervised Learning Algorithms:
1. Linear Regression
2. Logistic Regression
3. Support Vector Machines (SVM)
4. Decision Trees and Random Forests
5. Neural Networks
Supervised Learning
• The learning task in supervised learning can be broadly divided into two types:
• Classification: The output variable is categorical, and the goal is to predict
which category or class the new data will fall into. For example, identifying
whether an email is spam or not is a classification problem.
• Regression: The output variable is continuous, and the goal is to predict a
quantitative value. Predicting the price of a house based on its features (like
size, location, and number of bedrooms) is a regression problem.

Feature 1: Frequency of suspicious Feature x: Size of the house in Sq. feets


keywords (lower is better)
Feature 2: Personalization level (higher
is better)
Unsupervised Learning
• Unsupervised learning is a type of machine learning where the
algorithm learns patterns from unlabelled data.

• The system tries to learn without explicit instructions by finding


structure in the input data.

• It is widely used for exploratory data analysis, pattern discovery,


and feature extraction.
• Examples of Unsupervised Learning Algorithms:
1. K-Means Clustering
2. Hierarchical Clustering
3. Principal Component Analysis (PCA) etc.
Unsupervised Learning
Unsupervised Learning
• The learning task in unsupervised learning can be broadly divided into
tthree types:
• Clustering: Dividing the dataset into groups based on similarity.
Examples include K-means clustering and hierarchical clustering.
• Association: Discovering rules that describe large portions of the
data, such as frequent itemsets in market basket analysis.
• Dimensionality Reduction: Reducing the number of variables
under consideration, to simplify the model and reduce noise.
Techniques include Principal Component Analysis (PCA) and t-
Distributed Stochastic Neighbor Embedding (t-SNE).
Reinforcement Learning

• Reinforcement Learning (RL) is a type of machine learning


paradigm where an agent learns to make decisions by
performing actions in an environment to achieve a goal.

• The agent learns through trial and error, receiving rewards or


penalties for actions performed, without prior knowledge of
the best actions to take.

• The objective is to develop a strategy, known as a policy, that


maximizes the cumulative reward over time.
Reinforcement Learning
Examples of Machine Learning Applications

1. Healthcare: Disease Diagnosis and Prediction


2. Finance: Fraud Detection
3. Retail: Personalized Recommendations
4. Entertainment: Content Recommendation
5. Transportation: Route Optimization
Vapnik-Chervonenkis (VC) Dimension
• VC dimensions are used to quantify the accuracy of the
MC model.
• VC dimension is a measure of the capacity or complexity
of a space of functions, that can be learned by a
classification algorithm
• VC dimension is the maximum number of points that can
be separated by a model for all possible configuration.
Linear Classifier with two data points
• Consider two data points of category positive and negative points.
• The possible combinations of N data points are 2N. So for 2 data
points there are 4 combinations (++,--,+-,-+).
• The classification of these two data points using a straight line is
as follows:

• In all the cases, the linear classifier shatters all possible


combinations.
• Shattering is the ability of a model to classify a set of points
perfectly.
Linear Classifier with three data points
• Consider three data points which can be either positive or
negative.
• The possible combinations of 3 data points are 23, which is 8
combinations (+++,---,++-,+-+,-++,--+,-+-,+--).
• The classification of these three data points using a straight line is
as follows:

• In all the cases, the linear classifier shatters all possible


combinations of data points, assuming the points are in general
position (ie, not all 3 points are collinear).
• Collinear points are the points that lie on the same straight line
or in a single line.
Linear Classifier with Four data points
• Consider 4 data points which can be either positive or negative.
• The possible combinations of 4 data points are 24, which is 16
combinations.
• The classification of these 4 data points using a straight line is as
follows:

• As seen in the last two figures, these combination of 4 data


points can not be shattered by linear classifier.
• So it can be concluded that the linear classifier can shatter at
most 3 data points in general position.
Rectangle Classifier with Four data points
• In 4 data point set, the axis aligned rectangle classifier can shatter
in all possible 16 ways.
• The considered classifier is: Positive data points inside the axis
aligned rectangle and negative points outside the axis aligned
rectangle.

• In all the cases, the axis aligned rectangle classifier shatters all
possible combinations of data points, assuming the points are in
general position (ie, not all 4 points are collinear).
Vapnik-Chervonenkis (VC) Dimension
• VC dimensions are used to quantify the accuracy of the
MC model.
• VC dimension is a measure of the capacity or complexity
of a space of functions, that can be learned by a
classification algorithm
• VC dimension is the maximum number of points that can
be separated by a model for all possible configuration.
Rectangle Classifier with Five data points
• For any 5 data point set, we can define a axis aligned rectangle
which has the most extern points as vertices.

• In 5 data point set, there are 32 possible combinations.

• In the above cases, the axis aligned rectangle can not use a
rectangle classifier.
VC Dimension: A Summary
• For a data set contining N points, it can be labelled in 2N ways.

• A hypothesis h H that separates the positive points from negative


points, then we say H shatters N points.

• The maximum number of points that can be shattered by H is


called the Vapnik-Chervonenkis (VC) dimension of H, which is
denoted as VC(H), and measures the capacity of H.

• The VC dimension of a linear classifier in a ‘d’ dimensional space is


d+1.

• The VC dimension of a rectangle classifier in a two dimensional


space is 4.
VC Dimension: A Summary
• In the context of a linear classifier working with two data points, a
hypothesis is a function that attempts to separate these points
into distinct classes using a linear decision boundary.

• For simplicity, let's consider a two-dimensional space (features


and ) and two data points that belong to either Class 0 or Class 1.

• Suppose we have two data points:

• Point A(,) belonging to Class 0.

• Point A(,) belonging to Class 1.


VC Dimension: A Summary
Hypothesis for a Linear Classifier:

• A linear classifier tries to find a line that can separate these two points based
on their classes. The hypothesis in this context can be represented by the
equation of the line:

• For simplicity, let's consider a two-dimensional space (features and ) and two
data points that belong to either Class 0 or Class 1.

• Here, and are the features of the data points, and , and are the parameters of
the model that define the decision boundary. The hypothesis predicts the class
of a data point based on the sign of the output:

• if ≥ 0, the point is classified as Class 1.

• if < 0, the point is classified as Class 0.


VC Dimension: A Summary
Example for a Linear Classifier:
• A simple linear classifier might learn a decision boundary that correctly
separates these the points.

• For instance, a decision boundary might be given by the equation -1, which
can be rewritten in the form of as
, with = -1, = 1, -1.
• Imagine we have Point A (1, 2) in Class 0 and Point B (4, 2) in Class 1.

• For Point A(< 0) it would be classified as Class 0, and for Point B(≥ 0), it would
be classified as Class 1.

• The goal during the training phase is to adjust the parameters (,


and ) so that the hypothesis correctly classifies the two points.

• For a perfectly linearly separable set of two points, there always


exists at least one set of parameters that can correctly classify
them.
Probably Approximately Correct (PAC) learning
• Introduced by Leslie Valiant in 1984, PAC learning provides a formal
way to discuss the learnability of concepts (e.g., classifications or
other predictions) from data.

• PAC learning offers insights into how well a learning algorithm can be
expected to perform given certain conditions, including the
complexity of the learning task, the amount of training data, and the
allowable error rate.

• Probably: Refers to the confidence (δ) with which we can expect the
learning algorithm to achieve its goal.

• Approximately Correct: Indicates that the hypothesis output by the


learning algorithm doesn't have to be perfect but must be close to the
true concept within some error margin (ε).
Probably Approximately Correct (PAC) learning
• With high probability, a PAC learning algorithm models a hypothesis
that is approximately identical to the target concept.

• PAC Learning Example:


• N number of Car having Price and Engine power, as training set, (p,e), find
whether the car is family car or not. The hypothesis gives answer whether
the car is family car or not.
• In the following figure, C is a Target function. Instances within rectangle
represents family cars and outside are not family cars
• Hypothesis - closely approximate C, and there may be error region.
Probably Approximately Correct (PAC) learning
• Instances lies on shaded region are positive/negative according to our
actual function 'C', but those are negative/positive based on the
hypothesis h. Hence it is called as false negative or false positive
hypothesis h. Hence it is called as false negative or false positive.

• The probability of error region needs to be small.


• The error region : P(C XOR h) <= ε, where 0 <= ε <= 0.5
Probably Approximately Correct (PAC) learning
• The aim of PAC learning is low generalizarion error with high
probability.
• Generalization Error:
• The difference between the error on the training dataset and the error on the
entire distribution of data.
• It's a measure of how well a hypothesis generalizes to new data, beyond the
examples it was trained on.
• A low generalization error indicates that the model performs well on both the
training data and unseen data.
• Generalization Error < ϵ, means the model's predictions are, on average, ϵ-
close to the true outcomes for unseen data.
• ϵ is a threshold that defines the maximum allowable average difference
(error) between the model's predicted outcomes and the true outcomes for
the unseen data. It's a measure of how much error we're willing to tolerate.
Probably Approximately Correct (PAC) learning
• Generalization Error:
• For example, let's choose ϵ=0.1 or 10%. This means we're allowing up to a
10% error rate on the model's predictions on new data compared to the
true outcomes.
• In case, the training accuracy is 98%, and the unseen data accuracy is 90%,
so the generalization error is 98%−90%=8%.
• Since we've set ϵ=10%, and our calculated generalization error is 5%, which
is less than ϵ, the model's performance is within our tolerance for error.
• Therefore the Generalization Error (8%) < ϵ (10%).

• High Probability (At Least 1−δ):


• The model achieves this generalization error with a confidence level of at
least 1−δ. This probability threshold indicates the reliability of the model's
performance on unseen data.
Probably Approximately Correct (PAC) learning
• High Probability (At Least 1−δ):
• The model achieves this generalization error with a confidence level of at
least 1−δ. This probability threshold indicates the reliability of the model's
performance on unseen data.
• δ is a small positive number chosen by the practitioner before the learning
process begins.
• It represents the tolerance for the model's failure to meet the ϵ-close
criterion on the unseen data.
• A smaller δ corresponds to a higher confidence requirement.
• 1−δ represents the confidence level that the learning algorithm will
produce a hypothesis with a generalization error less than ϵ. If δ=0.05, for
example, it implies that we are 95% confident (since 1−0.05=0.95) that the
generalization error of the learned hypothesis (8%) will be less than ϵ(10%).
• Hence, the learning algorithm is “Probably Approximately Correct”
Hypothesis Space
Hypothesis Space
Hypothesis Space
Hypothesis Space
Hypothesis Space
Hypothesis Space
Hypothesis Space
Hypothesis Space
Generalization
• Generalization refers to the ability of a trained model to
accurately make predictions on new, unseen data.
• That is, after being trained on a training set, a model
can analyse new data and make accurate predictions.
• A model’s ability to generalize is central to the success
of a model.
• Without generalization, the model may become too
specific to the training set, memorizing specific words
or phrases that were common in the training data and
failing to understand new examples.
Generalization
• Overfitting:
• Overfitting refers to a scenario where a machine
learning model memorizes the training data but does
not correctly learn its underlying patterns.
• Overfit models perform exceptionally well on training
data but fail to generalize to new, unseen data.
• If a model has been trained too well on training data, it
will be unable to generalize. It will make inaccurate
predictions when given new data, making the model
useless even though it is able to make accurate
predictions for the training data.
Generalization
• Underfitting:
• Occurs when a machine learning model is too simplistic
and can’t capture the underlying patterns in the data.
• Underfitting happens when a model has not been trained
enough on the data.
• In the case of underfitting, it makes the model just as
useless and it is not capable of making accurate
predictions, even with the training data.
Inductive Bias
• Inductive bias is the set of assumptions that a machine
learning algorithm makes about the relationship between
input variables (features) and output variables (labels)
based on the training data.
• Real life example: When a child learns to recognize
animals, they may start with a bias that animals have four
legs. This bias helps them identify a wide range of animals
like dogs, cats, cows, etc., but it may fail when confronted
with animals like snakes or birds.
• Inductive bias is the prior knowledge or beliefs that the
algorithm uses to generalize from the training data to new,
unseen data.
Inductive Bias
• The phrase "preferring one answer over another after
viewing certain instances" sums up inductive bias.
• Every ML model has a bias of its own.

• Different models can be trained based on fixed training


data. All those models will behave differently for new
unseen data.
Inductive Bias
• A few examples of inductive bias are listed below:
• Linear ML model:
• The linear model presupposes that each of the input
characteristics and the target have a linear connection.
• Example: A direct linear relationship between the number
of years someone has worked (input feature) and their
salary (target).
• Decision Trees:
• Decision trees internalize in their nodes constant models.
• In a decision tree, each node applies a simple rule like
"wear a raincoat if raining" to direct decisions from top to
bottom, leading to a final choice based on evaluated
conditions.
Inductive Bias
• Convolutional Neural Network:
• The layer-based structure of a convolutional neural network
imposes a bias toward hierarchical processing.
• For example, facial recognition technology, using CNNs,
starts with layers that identify basic features like edges, then
builds up to complex ones, ultimately recognizing faces with
precision.
• Bayesian modeling:
• The priors selected in this case greatly reveal the bias (which
tells the model what happens when not much data is
available).
• In spam email detection with scant data, the model uses prior
assumptions, like marking emails with "free" as spam,
demonstrating how initial biases guide early machine learning
decisions.
Inductive Bias
• Importance of Inductive Bias in ML:

1. Necessity in ML: Essential for guiding the algorithm in


learning the target function from given training
examples.
2. Role in Prediction: Enables the model to generalize from
training data to unseen situations, crucial for accurate
predictions.
3. Example of Inductive Bias: Occam's razor principle,
favoring simpler hypotheses that explain the data just as
well as more complex ones.
Inductive Bias
• Importance of Inductive Bias in ML:
4. Consistency Requirement: The chosen hypothesis
should correctly predict outcomes for all training
instances.
5. Mathematical Foundation: Inductive bias helps logically
imply the learner’s hypothesis, guiding the learning
process.
6. Real-world Application Challenges: In many cases, such
as with neural networks, the inductive bias is implicit
and not easily formalized.
Inductive Bias
• Types of Inductive Bias in ML:
1. Maximum conditional independence: It aims to maximize
conditional independence if the hypothesis can be framed within a
Bayesian framework. The Naive Bayes classifier employs this bias.

2. Minimum cross-validation error: It picks the hypothesis


with the lowest cross-validation error when trying to decide
between them. Despite the fact that cross-validation may appear
to be bias-free, the "no free lunch" theorems demonstrate that
cross-validation is in fact biased.

3. Maximum margin: When dividing a group of students into two


classes, try to make the boundary as wide as possible. The bias in
support vector machines is this. It is assumed that different classes
often have a lot of space between them.
Inductive Bias
• Types of Inductive Bias in ML:
4. Minimum description length: When formulating a hypothesis,
make an effort to keep the description as brief as possible.

5. Minimum features: Unless a feature is supported by solid


evidence, it should be removed. The underlying premise of feature
selection algorithms is this.

6. Nearest neighbors: In a small neighborhood in feature space, it


is reasonable to assume that the majority of the cases belong to the
same class. Assume that a case, for which the class is unknown,
belongs to the same class as the majority in the area. The k-nearest
neighbors' algorithm employs this bias. The underlying premise is that
cases that are close to one another typically belong to the same class.
Bias-Variance Trade-off
• Grasping the concept of prediction errors, including both bias and
variance, is crucial for enhancing the accuracy of machine-learning
algorithms.
• Bias represents assumptions made by a model that may simplify
reality too much.(If a model trying to predict ice cream sales based on temperature always
estimates 200 sales regardless of whether it's 75°F or 90°F, it shows high bias. The model oversimplifies by
not adjusting its predictions for varying temperatures, leading to a consistent under- or overestimation.)

• Variance indicates how much a model's predictions vary for a given


data point. (If you're trying to predict how many ice creams will be sold based on temperature, but
your model gives widely varying forecasts (e.g., 100 ice creams on one 75°F day, then 300 on another 75°F
day with similar conditions), this demonstrates high variance. The model is too sensitive to small changes in
the input data, leading to inconsistent predictions for the same situation.)

• Achieving the right balance between bias and variance is essential


for developing accurate and reliable machine-learning models.
• Bias: Bias-Variance Trade-off
• The bias is known as the difference between the prediction of the
values by the ML model and the correct value.
• Being high in biasing gives a large error in training as well as testing
data. It is recommended that an algorithm should always be low-
biased to avoid the problem of underfitting.
• By high bias, the data predicted is in a straight line format, thus not
fitting accurately in the data in the data set. Such fitting is known as
the Underfitting of Data. This happens when the hypothesis is too
simple or linear in nature. Refer to the graph given below for an
example of such a situation.
• In such a problem, a hypothesis looks like follows:
+)
• Variability:
Bias-Variance Trade-off
• The variability of model prediction for a given data point which tells us
the spread of our data is called the variance of the model.
• The model with high variance
• has a very complex fit to the training data
• is not able to fit accurately on the data which it hasn’t seen before
• perform very well on training data but have high error rates on
test data.
• When a model is high on variance, it is then said to as Overfitting of
Data.
• In such a problem, a hypothesis looks like follows:
Bias-Variance Trade-off
• Bias-Variance Trade-off:
• If the algorithm is too simple (hypothesis with linear equation) then it
may be on high bias and low variance condition and thus is error-
prone.
• If algorithms fit too complex (hypothesis with high degree equation)
then it may be on high variance and low bias. In the latter condition,
the new entries will not perform well.
• The condition between both of these above conditions, known as a
Trade-off or Bias Variance Trade-off.
• The perfect trade-off will be like
Bias-Variance Trade-off

• As model complexity increases, training error (yellow dashed line)


decreases, but variance (blue dashed line) increases, indicating overfitting.
• A less complex model has higher bias (black dotted line), leading to
underfitting.
• The red dashed line represents irreducible error, which cannot be reduced
by improving the model.

You might also like