UNIT II Part-1
UNIT II Part-1
Parametric Methods
Code:U18CST7002
Presented by: Nivetha Raju
Department: CSE
Machine Learning
•Machine learning can be briefed as learning a function (f)
that maps input variables (X) and the following results are
given in output variables (Y).
•Y = f(x)
• Logistic Regression
• Linear Discriminant Analysis
• Perceptron
• Naive Bayes
• Simple Neural Networks
9
Parametric model
Assumptions can greatly simplify the learning
process but can also limit what can be learned.
Algorithms that simplify the function to a known
form are called parametric machine learning
algorithms.
we assume that the sample is drawn from some
distribution that obeys a known model, for
example, Gaussian.
The algorithms involve two steps:
1.Select a form for the function.
2.Learn the coefficients(parameters) for the function from
the training data.
The advantage of the parametric approach is that
the model is defined up to a small number of
parameters—for example, mean, variance—the
sufficient statistics of the distribution. 10
Maximum Likelihood
Maximum Likelihood Estimation (MLE) is a statistical method used to
estimate the parameters of a statistical model. The core idea is to find the
parameter values that make the observed data most probable under the
assumed model.
11
Maximum Likelihood-Gaussian
Density
12
Parametric Classification
To predict the classes of new data, the trained classifier finds the class
with the smallest misclassification cost. So we use the discriminant
function
13
• Given the sample
X {xt ,r t }tN1
t
1 if x Ci
x ri t t
0 if x C j , j i
• ML estimates are
x 2
ri t x tr i t t
mi rit
P̂ C i t
mi t
si2 t
N ri t
i
r t
• Discriminant becomes t t
1
gi x log 2 log si
x mi log P̂ C 2
2 i
2 2si
14
Parametric Classification
15
Parametric Classification
16
Parametric Classification
Equal variances
Single boundary at
halfway between
means
17
Parametric Classification
Two boundaries
18
Regression
Logistic regression is another supervised learning algorithm
which is used to solve the classification problems
Logistic regression algorithm works with the categorical
variable such as 0 or 1, Yes or No, True or False, Spam or not
spam, etc.
19
What is Logistic Regression?
Logistic regression is used for binary classification where we
use sigmoid function, that takes input as independent
variables and produces a probability value between 0 and 1.
20
Logistic Function – Sigmoid Function
• The sigmoid function is a mathematical function used to
map the predicted values to probabilities.
• It maps any real value into another value within a range
of 0 and 1. The value of the logistic regression must be
between 0 and 1, which cannot go beyond this limit, so it
forms a curve like the “S” form.
• The S-form curve is called the Sigmoid function or the
logistic function.
• In logistic regression, we use the concept of the threshold
value, which defines the probability of either 0 or 1. Such
as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
21
How does Logistic Regression work?
The logistic regression model transforms the linear regression
function continuous value output into categorical value
output using a sigmoid function, which maps any real-valued
set of independent variables input into a value between 0
and 1. This function is known as the logistic function.
22
How does Logistic Regression work?
23
Parametric Regression
r f x
estimator : g x |
~N 0, 2
pr | x ~ N g x | , 2
p(x, r) = p(r|x)p(x) In regression, we would like to write the numeric output, called
the dependent variable, as a function of the input, called the
N independent var
L |X log px t , r t
t 1
N N
log pr | x log px t
t t
t 1 t 1
Parametric Regression
Linear Regression
Polynomial regression
Polynomial regression
Bias
• Bias is one type of error that occurs due to wrong
assumptions about data such as assuming data is linear
when in reality, data follows a complex function.
• On the other hand, variance gets introduced with high
sensitivity to variations in training data.
• This also is one type of error since we want to make our
model robust against noise.
• There are two types of error in machine learning.
Reducible error and Irreducible error.
• Bias and Variance come under reducible error.
29
What is Bias?
• Bias is simply defined as the inability of the model because
of that there is some difference or error occurring between
the model’s predicted value and the actual value.
• These differences between actual or expected values and
the predicted values are known as error or bias error or
error due to bias. Bias is a systematic error that occurs due
to wrong assumptions in the machine learning process.
• Let Y be the true value of a parameter, and let \hat Y be an
estimator of Y based on a sample of data. Then, the bias of
the estimator \hat Y is given by:
30
What is Bias?
31
Ways to reduce high bias in Machine
Learning
• Use a more complex model: One of the main reasons
for high bias is the very simplified model. it will not be able
to capture the complexity of the data.
• Increase the number of features: By adding more
features to train the dataset will increase the complexity of
the model.
• Reduce Regularization of the model: Regularization
techniques such as L1 or L2 regularization can help to
prevent overfitting and improve the generalization ability
of the model.
• Increase the size of the training data
32
Variance
• Variance is the measure of spread in data from its mean
position.
• In machine learning variance is the amount by which the
performance of a predictive model changes when it is
trained on different subsets of the training data.
• More specifically, variance is the variability of the model
that how much it is sensitive to another subset of the
training dataset. i.e. how much it can adjust on the new
subset of the training dataset.
33
Ways to Reduce the Variance in Machine
Learning:
• Cross-validation: By splitting the data into training and testing
sets multiple times, cross-validation can help identify if a model
is overfitting or underfitting and can be used to tune
hyperparameters to reduce variance.
• Feature selection: By choosing the only relevant feature will
decrease the model’s complexity. and it can reduce the variance
error.
• Regularization: We can use L1 or L2 regularization to reduce
variance in machine learning models
• Ensemble methods: It will combine multiple models to
improve generalization performance. Bagging, boosting, and
stacking are common ensemble methods that can help reduce34
Variance errors are either low or high-
variance errors.
• Low variance: Low variance means that the model is less
sensitive to changes in the training data and can produce
consistent estimates of the target function with different
subsets of data from the same distribution. This is the case of
underfitting when the model fails to generalize on both
training and test data.
36
Different Combinations of Bias-Variance
37
Different Combinations of Bias-Variance
38
Tuning Model Complexity: Bias/Variance
Dilemma
• Let us say that a sample X ={x ^ t, r ^ t } is drawn from
some unknown joint probability density p(x, r). Using this
sample, we construct our estimate g(·). The expected
square error (over the joint density) at x can be written as
39
Tuning Model Complexity: Bias/Variance
Dilemma
40
Tuning Model Complexity: Bias/Variance
Dilemma
(a) Function, f(x) = 2sin(1.5x), and one noisy (N(0,1)) dataset sampled from
the function. Five samples are taken, each containing twenty instances.
(b), (c), (d) are five polynomial fits, namely, gi(·), of order 1, 3, and 5. For
each case, dotted line is the average of the five fits, namely, g(·).
41
Tuning Model Complexity: Bias/Variance
Dilemma
In the same setting as that of previous figure, using one hundred models
instead of five, bias, variance, and error for polynomials of order 1 to 5.
Order 1 has the smallest variance. Order 5 has the smallest bias. As the
order is increased, bias decreases but variance increases. Order 3 has the
minimum error.
42
Tuning Model Complexity: Bias/Variance
Dilemma
As the order of the polynomial increases, small changes in the dataset
cause a greater change in the fitted polynomials; thus variance increases.
But a complex model on the average allows a better fit to the underlying
function; thus bias decreases. This is called the bias/variance dilemma
• If there is bias, this indicates that our model class does not contain
the solution; this is underfitting
• If there is variance, the model class is too general and also learns the
noise; this is overfitting.
43
Tuning Model Complexity: Bias/Variance
Dilemma
• If the algorithm is too simple (hypothesis with linear
equation) then it may be on high bias and low variance
condition and thus is error-prone.
• If algorithms fit too complex (hypothesis with high degree
equation) then it may be on high variance and low bias.
• In the latter condition, the new entries will not perform
well. Well, there is something between both of these
conditions, known as a Trade-off or Bias Variance Trade-off.
• This tradeoff in complexity is why there is a tradeoff
between bias and variance.
• An algorithm can’t be more complex and less complex at
the same time.
44
Bias Variance Tradeoff
45
Parameters and Hyperparameters
47
Cross-validation
48
Comparison of Cross-validation to train/test
split in Machine Learning
• Train/test split: The input data is divided into two parts, that
are training set and test set on a ratio of 70:30, 80:20, etc.
It provides a high variance, which is one of the biggest
disadvantages.
50
Structural Risk Minimization
52
Regularization
Lasso Regression
• A regression model which uses the L1 Regularization
technique is called LASSO(Least Absolute Shrinkage and
Selection Operator) regression. Lasso Regression adds the
“absolute value of magnitude” of the coefficient as a
penalty term to the loss function(L).
• Lasso regression also helps us achieve feature selection by
penalizing the weights to approximately equal to zero if
that feature does not serve any purpose in the model.
53
Regularization
Ridge Regression
A regression model that uses the L2 regularization technique
is called Ridge regression. Ridge regression adds the “squared
magnitude” of the coefficient as a penalty term to the loss
function(L).
54
Regularization
Elastic Net Regression
This model is a combination of L1 as well as L2 regularization.
That implies that we add the absolute norm of the weights as
well as the squared measure of the weights. With the help of
an extra hyperparameter that controls the ratio of the L1 and
L2 regularization.
55
Akaike’s information criterion (AIC)&
Bayesian Information Criterion
56
AIC and BIC
• AIC and BIC: These methods estimate how much the
training error might be optimistic (underestimated) and
adjust it to predict the test error.
• No Validation Needed: They do this adjustment
without requiring a separate validation set.
• Influence of Input Features: The more input features
(or parameters) the model has, the greater the
adjustment for underestimation.
• Effect of Training Set Size: As the size of the training
set increases, the underestimation decreases.
• Impact of Noise: The adjustment also increases with
the amount of noise in the data, which can be estimated
from the error of a low-bias model.
57
Minimum description length (MDL)
59