0% found this document useful (0 votes)
70 views59 pages

UNIT II Part-1

The document discusses parametric methods in machine learning, defining parametric, non-parametric, and semi-parametric models, and explaining the role of parameters in model training. It covers concepts such as Maximum Likelihood Estimation, bias and variance in model performance, and the importance of tuning model complexity to achieve a balance between bias and variance. Additionally, it highlights logistic regression as a key parametric model used for binary classification.

Uploaded by

janarthana9789
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views59 pages

UNIT II Part-1

The document discusses parametric methods in machine learning, defining parametric, non-parametric, and semi-parametric models, and explaining the role of parameters in model training. It covers concepts such as Maximum Likelihood Estimation, bias and variance in model performance, and the importance of tuning model complexity to achieve a balance between bias and variance. Additionally, it highlights logistic regression as a key parametric model used for binary classification.

Uploaded by

janarthana9789
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

UNIT II

Parametric Methods

Code:U18CST7002
Presented by: Nivetha Raju
Department: CSE
Machine Learning
•Machine learning can be briefed as learning a function (f)
that maps input variables (X) and the following results are
given in output variables (Y).

•Y = f(x)

•Thus machine learning models can be parameterized so


that their behavior can be tuned for a given problem.
•These models can have many parameters and finding
the best combination of parameters can be treated as a
search problem.
Machine Learning

• An algorithm learns this target function


from training data
• In order to estimate the unknown function,
we need to fit a model over the data (the
training data to be more precise).
• The form of the function is unknown, so our
job as machine learning practitioners is to
evaluate different machine learning
algorithms and see which is better at
approximating the underlying function.
• In general, this process can be parametric
or non-parametric.
Machine Learning
•Parametric models assume a specific functional form for
the underlying distribution of the data and are
characterized by a fixed number of parameters. The
model structure is predefined, and the parameters are
estimated from the data.
•Example:Linear regression, logistic regression, and
Gaussian Naive Bayes.

•Non-parametric models do not assume a specific


functional form for the data. Instead, they are flexible and
can adapt to the shape of the data. The number of
parameters can grow with the amount of data.
Example:k-Nearest Neighbors (k-NN), Decision Trees,
and Kernel Density Estimation.

•Semi-parametric models combine aspects of both


parametric and non-parametric models. They include a
parametric component that captures the main structure
of the data and a non-parametric component that allows
for additional flexibility.
What is a parameter in a machine
learning model?
A model parameter is a configuration variable that is internal to
the model and whose value can be estimated from the given
data.

•They are required by the model when making predictions.


•Their values define the skill of the model on your problem.
•They are estimated or learned from historical training data.
•They are often not set manually by the practitioner.
•They are often saved as part of the learned model.
What is a parameter in a machine
learning model?
The examples of model parameters include:

• The weights in an artificial neural network.


• The support vectors in a support vector machine.
• The coefficients in linear regression or logistic regression.
Parametric model
A learning model that summarizes data with a set of fixed-size
parameters (independent on the number of instances of
training).Parametric machine learning algorithms are which
optimizes the function to a known form.
Parametric model
• In a parametric model, you know exactly which model you
are going to fit in with the data, for example, linear
regression line.
• b0 + b1*x1 + b2*x2 = 0
where,
• b0, b1, b2 → the coefficients of the line that control the
intercept and slope
• x1, x2 → input variables
• The assumed functional form is always a linear
combination of input variables and as such parametric
machine learning algorithms are also frequently referred
to as ‘linear machine learning algorithms.’
8
Parametric model
Some more examples of parametric machine learning
algorithms include:

• Logistic Regression
• Linear Discriminant Analysis
• Perceptron
• Naive Bayes
• Simple Neural Networks

9
Parametric model
Assumptions can greatly simplify the learning
process but can also limit what can be learned.
Algorithms that simplify the function to a known
form are called parametric machine learning
algorithms.
we assume that the sample is drawn from some
distribution that obeys a known model, for
example, Gaussian.
The algorithms involve two steps:
1.Select a form for the function.
2.Learn the coefficients(parameters) for the function from
the training data.
The advantage of the parametric approach is that
the model is defined up to a small number of
parameters—for example, mean, variance—the
sufficient statistics of the distribution. 10
Maximum Likelihood
Maximum Likelihood Estimation (MLE) is a statistical method used to
estimate the parameters of a statistical model. The core idea is to find the
parameter values that make the observed data most probable under the
assumed model.

11
Maximum Likelihood-Gaussian
Density

12
Parametric Classification
To predict the classes of new data, the trained classifier finds the class
with the smallest misclassification cost. So we use the discriminant
function

13
• Given the sample
X  {xt ,r t }tN1
t

 1 if x  Ci
x ri t   t
0 if x  C j , j i
• ML estimates are

 x 2
 ri t  x tr i t t
 mi rit
P̂ C i   t
mi  t
si2  t
N  ri t
i
r t

• Discriminant becomes t t

1
gi x    log 2  log si 
x  mi   log P̂ C  2

2 i
2 2si

14
Parametric Classification

15
Parametric Classification

16
Parametric Classification

Equal variances

Single boundary at
halfway between
means

17
Parametric Classification

Variances are different

Two boundaries

18
Regression
Logistic regression is another supervised learning algorithm
which is used to solve the classification problems
Logistic regression algorithm works with the categorical
variable such as 0 or 1, Yes or No, True or False, Spam or not
spam, etc.

19
What is Logistic Regression?
Logistic regression is used for binary classification where we
use sigmoid function, that takes input as independent
variables and produces a probability value between 0 and 1.

For example, we have two classes Class 0 and Class 1 if the


value of the logistic function for an input is greater than 0.5
(threshold value) then it belongs to Class 1 otherwise it
belongs to Class 0. It’s referred to as regression because it
is the extension of linear regression but is mainly used for
classification problems.

20
Logistic Function – Sigmoid Function
• The sigmoid function is a mathematical function used to
map the predicted values to probabilities.
• It maps any real value into another value within a range
of 0 and 1. The value of the logistic regression must be
between 0 and 1, which cannot go beyond this limit, so it
forms a curve like the “S” form.
• The S-form curve is called the Sigmoid function or the
logistic function.
• In logistic regression, we use the concept of the threshold
value, which defines the probability of either 0 or 1. Such
as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
21
How does Logistic Regression work?
The logistic regression model transforms the linear regression
function continuous value output into categorical value
output using a sigmoid function, which maps any real-valued
set of independent variables input into a value between 0
and 1. This function is known as the logistic function.

22
How does Logistic Regression work?

23
Parametric Regression
r  f x  
estimator : g x | 
 ~N 0,  2

pr | x  ~ N g x | ,  2

p(x, r) = p(r|x)p(x) In regression, we would like to write the numeric output, called
the dependent variable, as a function of the input, called the
N independent var
L  |X  log px t , r t 
t 1
N N
log pr | x  log px t 
t t

t 1 t 1
Parametric Regression
Linear Regression
Polynomial regression
Polynomial regression
Bias
• Bias is one type of error that occurs due to wrong
assumptions about data such as assuming data is linear
when in reality, data follows a complex function.
• On the other hand, variance gets introduced with high
sensitivity to variations in training data.
• This also is one type of error since we want to make our
model robust against noise.
• There are two types of error in machine learning.
Reducible error and Irreducible error.
• Bias and Variance come under reducible error.

29
What is Bias?
• Bias is simply defined as the inability of the model because
of that there is some difference or error occurring between
the model’s predicted value and the actual value.
• These differences between actual or expected values and
the predicted values are known as error or bias error or
error due to bias. Bias is a systematic error that occurs due
to wrong assumptions in the machine learning process.
• Let Y be the true value of a parameter, and let \hat Y be an
estimator of Y based on a sample of data. Then, the bias of
the estimator \hat Y is given by:

30
What is Bias?

• Low Bias: Low bias value means fewer assumptions are


taken to build the target function. In this case, the model
will closely match the training dataset.
• High Bias: High bias value means more assumptions are
taken to build the target function. In this case, the model
will not match the training dataset closely.
The high-bias model will not be able to capture the dataset
trend. It is considered as the underfitting model which has a
high error rate. It is due to a very simplified algorithm.

31
Ways to reduce high bias in Machine
Learning
• Use a more complex model: One of the main reasons
for high bias is the very simplified model. it will not be able
to capture the complexity of the data.
• Increase the number of features: By adding more
features to train the dataset will increase the complexity of
the model.
• Reduce Regularization of the model: Regularization
techniques such as L1 or L2 regularization can help to
prevent overfitting and improve the generalization ability
of the model.
• Increase the size of the training data

32
Variance
• Variance is the measure of spread in data from its mean
position.
• In machine learning variance is the amount by which the
performance of a predictive model changes when it is
trained on different subsets of the training data.
• More specifically, variance is the variability of the model
that how much it is sensitive to another subset of the
training dataset. i.e. how much it can adjust on the new
subset of the training dataset.

33
Ways to Reduce the Variance in Machine
Learning:
• Cross-validation: By splitting the data into training and testing
sets multiple times, cross-validation can help identify if a model
is overfitting or underfitting and can be used to tune
hyperparameters to reduce variance.
• Feature selection: By choosing the only relevant feature will
decrease the model’s complexity. and it can reduce the variance
error.
• Regularization: We can use L1 or L2 regularization to reduce
variance in machine learning models
• Ensemble methods: It will combine multiple models to
improve generalization performance. Bagging, boosting, and
stacking are common ensemble methods that can help reduce34
Variance errors are either low or high-
variance errors.
• Low variance: Low variance means that the model is less
sensitive to changes in the training data and can produce
consistent estimates of the target function with different
subsets of data from the same distribution. This is the case of
underfitting when the model fails to generalize on both
training and test data.

• High variance: High variance means that the model is very


sensitive to changes in the training data and can result in
significant changes in the estimate of the target function when
trained on different subsets of data from the same distribution.
This is the case of overfitting when the model performs35
Ways to Reduce the Variance in Machine
Learning:
• Simplifying the model: Reducing the complexity of the
model, such as decreasing the number of parameters or
layers in a neural network, can also help reduce variance
and improve generalization performance.
• Early stopping: Early stopping is a technique used to
prevent overfitting by stopping the training of the learning
model when the performance on the validation set stops
improving.

36
Different Combinations of Bias-Variance

• High Bias, Low Variance


• High Variance, Low Bias
• High-Bias, High-Variance
• Low Bias, Low Variance

37
Different Combinations of Bias-Variance

38
Tuning Model Complexity: Bias/Variance
Dilemma
• Let us say that a sample X ={x ^ t, r ^ t } is drawn from
some unknown joint probability density p(x, r). Using this
sample, we construct our estimate g(·). The expected
square error (over the joint density) at x can be written as

39
Tuning Model Complexity: Bias/Variance
Dilemma

40
Tuning Model Complexity: Bias/Variance
Dilemma
(a) Function, f(x) = 2sin(1.5x), and one noisy (N(0,1)) dataset sampled from
the function. Five samples are taken, each containing twenty instances.
(b), (c), (d) are five polynomial fits, namely, gi(·), of order 1, 3, and 5. For
each case, dotted line is the average of the five fits, namely, g(·).

41
Tuning Model Complexity: Bias/Variance
Dilemma
In the same setting as that of previous figure, using one hundred models
instead of five, bias, variance, and error for polynomials of order 1 to 5.
Order 1 has the smallest variance. Order 5 has the smallest bias. As the
order is increased, bias decreases but variance increases. Order 3 has the
minimum error.

42
Tuning Model Complexity: Bias/Variance
Dilemma
As the order of the polynomial increases, small changes in the dataset
cause a greater change in the fitted polynomials; thus variance increases.
But a complex model on the average allows a better fit to the underlying
function; thus bias decreases. This is called the bias/variance dilemma
• If there is bias, this indicates that our model class does not contain
the solution; this is underfitting
• If there is variance, the model class is too general and also learns the
noise; this is overfitting.

for example, a polynomial of the same order, we have an unbiased


estimator, and estimated bias decreases as the number of models
increase. This shows the error-reducing effect of choosing the right
model (which we called inductive bias

43
Tuning Model Complexity: Bias/Variance
Dilemma
• If the algorithm is too simple (hypothesis with linear
equation) then it may be on high bias and low variance
condition and thus is error-prone.
• If algorithms fit too complex (hypothesis with high degree
equation) then it may be on high variance and low bias.
• In the latter condition, the new entries will not perform
well. Well, there is something between both of these
conditions, known as a Trade-off or Bias Variance Trade-off.
• This tradeoff in complexity is why there is a tradeoff
between bias and variance.
• An algorithm can’t be more complex and less complex at
the same time.
44
Bias Variance Tradeoff

45
Parameters and Hyperparameters

Parameters are the internal variables of a model that are


learned from the training data during the training process.
These values are optimized to minimize the error between
the predicted outputs and the actual outputs in the
training set.
•Examples:
•Weights in Neural Networks:
•Coefficients in Linear Regression
•Support Vectors in SVM

Hyperparameters are external configurations or settings


used to control the learning process of a model. These are
set before the training process begins and are not learned
from the data.
Examples:
Learning Rate
Number of Trees in a Random Forest
Number of Hidden Layers in a Neural Network 46
Model Selection Procedures

There are 6 number of procedures can be used to fine-tune


model complexity.
• Cross-validation
• Regularization
• Akaike’s information criterion (AIC) and Bayesian
information criterion
BIC (BIC)
• Structural risk minimization (SRM)
• Minimum description length (MDL)
• Bayesian model selection

47
Cross-validation

Cross-validation is a technique for validating the model


efficiency by training it on the subset of input data and testing
on previously unseen subset of the input data.
Basic steps of cross-validations are:
• Reserve a subset of the dataset as a validation set.
• Provide the training to the model using the training dataset.
• Now, evaluate model performance using the validation set.
If the model performs well with the validation set, perform
the further step, else check for the issues.

48
Comparison of Cross-validation to train/test
split in Machine Learning
• Train/test split: The input data is divided into two parts, that
are training set and test set on a ratio of 70:30, 80:20, etc.
It provides a high variance, which is one of the biggest
disadvantages.

• Training Data: The training data is used to train the model,


and the dependent variable is known.

• Test Data: The test data is used to make the predictions


from the model that is already trained on the training data.
This has the same features as training data but not the part
of that.

• Cross-Validation dataset: It is used to overcome the


disadvantage of train/test split by splitting the dataset into
groups of train/test splits, and averaging the result. It can
be used if we want to optimize our model that has been
trained on the training dataset for the best performance. 49
Cross Validation and Structural Risk
Minimization
• In the same setting as that of figure before, training
and validation sets (each containing 50 instances) are
generated. (a) Training data and fitted polynomials of
order from 1 to 8. (b) Training and validation errors as a
function of the polynomial order. The “elbow” is at 3.

50
Structural Risk Minimization

• Structural Risk Minimization (SRM) is a principle in


machine learning and statistical learning theory that
aims to find a balance between two competing factors:
model complexity and the ability to minimize
errors on both training data and unseen data
(generalization). It's a foundational concept behind
techniques like Support Vector Machines (SVMs).

• Model Selection: SRM involves evaluating a series of


models with increasing complexity. For each model, you
calculate the empirical risk and add a penalty based on
the model's complexity.

• Optimal Model: The model that achieves the best trade-


off between low empirical risk and low complexity
penalty is chosen. This model is expected to generalize
well to new data.
51
Regularization
Regularization is a technique used to reduce errors by fitting
the function appropriately on the given training set and
avoiding overfitting. The commonly used regularization
techniques are :

• Lasso Regularization – L1 Regularization


• Ridge Regularization – L2 Regularization
• Elastic Net Regularization – L1 and L2 Regularization

52
Regularization
Lasso Regression
• A regression model which uses the L1 Regularization
technique is called LASSO(Least Absolute Shrinkage and
Selection Operator) regression. Lasso Regression adds the
“absolute value of magnitude” of the coefficient as a
penalty term to the loss function(L).
• Lasso regression also helps us achieve feature selection by
penalizing the weights to approximately equal to zero if
that feature does not serve any purpose in the model.

53
Regularization
Ridge Regression
A regression model that uses the L2 regularization technique
is called Ridge regression. Ridge regression adds the “squared
magnitude” of the coefficient as a penalty term to the loss
function(L).

54
Regularization
Elastic Net Regression
This model is a combination of L1 as well as L2 regularization.
That implies that we add the absolute norm of the weights as
well as the squared measure of the weights. With the help of
an extra hyperparameter that controls the ratio of the L1 and
L2 regularization.

55
Akaike’s information criterion (AIC)&
Bayesian Information Criterion

Methods such as Akaike’s information criterion (AIC) and


Bayesian information criterion(BIC) work by estimating this
optimism and adding it to the training error to estimate
test error, without any need for validation.
Bayesian model selection is used when we have some
prior knowledge about the appropriate class of
approximating functions. This prior knowledge is defined
as a prior distribution over models, p(model).

56
AIC and BIC
• AIC and BIC: These methods estimate how much the
training error might be optimistic (underestimated) and
adjust it to predict the test error.
• No Validation Needed: They do this adjustment
without requiring a separate validation set.
• Influence of Input Features: The more input features
(or parameters) the model has, the greater the
adjustment for underestimation.
• Effect of Training Set Size: As the size of the training
set increases, the underestimation decreases.
• Impact of Noise: The adjustment also increases with
the amount of noise in the data, which can be estimated
from the error of a low-bias model.
57
Minimum description length (MDL)

• If the data is simple, it has a short complexity; for


example, if it is a sequence of ‘0’s, we can just write ‘0’
and the length of the sequence. If the data is completely
random, then cannot have any description of the data
shorter than the data itself.
• If a model is appropriate for the data, then it has a good
fit to the data, and instead of the data, it can send/store
the model description.
• Out of all the models that describe the data, it want to
have the simplest model so that it lends itself to the
shortest description.
• Again have a trade-off between how simple the model is
and how well it explains the data. 58
Bayesian model selection

Bayesian model selection is used when we have some


prior knowledge about the appropriate class of
approximating functions. This prior knowledge is defined
as a prior distribution over models, p(model). Given the
data and assuming a model, we can calculate p(model|
data) using Bayes’ rule:

59

You might also like