3rd Unit Last 5 Answer AIML

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 21

What is Risk and explain about Parametric Methods?

Parametric Machine Learning Algorithms


Assumptions can greatly simplify the learning process, but can also limit
what can be learned. Algorithms that simplify the function to a known form
are called parametric machine learning algorithms.

A learning model that summarizes data with a set of parameters of fixed


size (independent of the number of training examples) is called a
parametric model. No matter how much data you throw at a parametric
model, it won’t change its mind about how many parameters it needs.

— Artificial Intelligence: A Modern Approach, page 737


The algorithms involve two steps:

1. Select a form for the function.


2. Learn the coefficients for the function from the training data.
An easy to understand functional form for the mapping function is a line, as
is used in linear regression:

b0 + b1*x1 + b2*x2 = 0

Where b0, b1 and b2 are the coefficients of the line that control the
intercept and slope, and x1 and x2 are two input variables.

Assuming the functional form of a line greatly simplifies the learning


process. Now, all we need to do is estimate the coefficients of the line
equation and we have a predictive model for the problem.

Often the assumed functional form is a linear combination of the input


variables and as such parametric machine learning algorithms are often
also called “linear machine learning algorithms“.
The problem is, the actual unknown underlying function may not be a linear
function like a line. It could be almost a line and require some minor
transformation of the input data to work right. Or it could be nothing like a
line in which case the assumption is wrong and the approach will produce
poor results.
Some more examples of parametric machine learning algorithms include:

 Logistic Regression
 Linear Discriminant Analysis
 Perceptron
 Naive Bayes
 Simple Neural Networks
Benefits of Parametric Machine Learning Algorithms:

 Simpler: These methods are easier to understand and interpret results.


 Speed: Parametric models are very fast to learn from data.
 Less Data: They do not require as much training data and can work well
even if the fit to the data is not perfect.
Limitations of Parametric Machine Learning Algorithms:

 Constrained: By choosing a functional form these methods are highly


constrained to the specified form.
 Limited Complexity: The methods are more suited to simpler problems.
 Poor Fit: In practice the methods are unlikely to match the underlying
mapping function.

Explain in detail about univariate and multivariate tree with an algorithm?


Linear Discriminant Analysis for Machine Learning
by Jason Brownlee on April 6, 2016 in Machine Learning Algorithms
Tweet Tweet Share Share

Last Updated on August 15, 2020

Logistic regression is a classification algorithm traditionally limited to only


two-class classification problems.

If you have more than two classes then Linear Discriminant Analysis is the
preferred linear classification technique.

In this post you will discover the Linear Discriminant Analysis (LDA)
algorithm for classification predictive modeling problems. After reading this
post you will know:

 The limitations of logistic regression and the need for linear discriminant analysis.
 The representation of the model that is learned from data and can be saved to
file.
 How the model is estimated from your data.
 How to make predictions from a learned LDA model.
 How to prepare your data to get the most from the LDA model.
This post is intended for developers interested in applied machine learning,
how the models work and how to use them well. As such no background in
statistics or linear algebra is required, although it does help if you know
about the mean and variance of a distribution.
LDA is a simple model in both preparation and application. There is some
interesting statistics behind how the model is setup and how the prediction
equation is derived, but is not covered in this post.

Kick-start your project with my new book Master Machine Learning


Algorithms, including step-by-step tutorials and the Excel Spreadsheet files
for all examples.
Let’s get started.
Linear Discriminant Analysis for Machine Learning
Photo by Jamie McCaffrey, some rights reserved.
Limitations of Logistic Regression
Logistic regression is a simple and powerful linear classification algorithm.
It also has limitations that suggest at the need for alternate linear
classification algorithms.

 Two-Class Problems. Logistic regression is intended for two-class or binary


classification problems. It can be extended for multi-class classification, but is
rarely used for this purpose.
 Unstable With Well Separated Classes. Logistic regression can become unstable
when the classes are well separated.
 Unstable With Few Examples. Logistic regression can become unstable when
there are few examples from which to estimate the parameters.
Linear Discriminant Analysis does address each of these points and is the
go-to linear method for multi-class classification problems. Even with
binary-classification problems, it is a good idea to try both logistic
regression and linear discriminant analysis.
Representation of LDA Models
The representation of LDA is straight forward.

It consists of statistical properties of your data, calculated for each class.


For a single input variable (x) this is the mean and the variance of the
variable for each class. For multiple variables, this is the same properties
calculated over the multivariate Gaussian, namely the means and the
covariance matrix.

These statistical properties are estimated from your data and plug into the
LDA equation to make predictions. These are the model values that you
would save to file for your model.

Let’s look at how these parameters are estimated.

Get your FREE Algorithms Mind Map

Sample of the handy machine learning algorithms mind map.

Describe models in linear discrimination?

Learning LDA Models


LDA makes some simplifying assumptions about your data:
1. That your data is Gaussian, that each variable is is shaped like a bell curve when
plotted.
2. That each attribute has the same variance, that values of each variable vary
around the mean by the same amount on average.
With these assumptions, the LDA model estimates the mean and variance
from your data for each class. It is easy to think about this in the univariate
(single input variable) case with two classes.

The mean (mu) value of each input (x) for each class (k) can be estimated
in the normal way by dividing the sum of values by the total number of
values.

muk = 1/nk * sum(x)

Where muk is the mean value of x for the class k, nk is the number of
instances with class k. The variance is calculated across all classes as the
average squared difference of each value from the mean.

sigma^2 = 1 / (n-K) * sum((x – mu)^2)

Where sigma^2 is the variance across all inputs (x), n is the number of
instances, K is the number of classes and mu is the mean for input x.

Making Predictions with LDA


LDA makes predictions by estimating the probability that a new set of
inputs belongs to each class. The class that gets the highest probability is
the output class and a prediction is made.

The model uses Bayes Theorem to estimate the probabilities.


Briefly Bayes’ Theorem can be used to estimate the probability of the
output class (k) given the input (x) using the probability of each class and
the probability of the data belonging to each class:
P(Y=x|X=x) = (PIk * fk(x)) / sum(PIl * fl(x))

Where PIk refers to the base probability of each class (k) observed in your
training data (e.g. 0.5 for a 50-50 split in a two class problem). In Bayes’
Theorem this is called the prior probability.
PIk = nk/n

The f(x) above is the estimated probability of x belonging to the class. A


Gaussian distribution function is used for f(x). Plugging the Gaussian into
the above equation and simplifying we end up with the equation below.
This is called a discriminate function and the class is calculated as having
the largest value will be the output classification (y):

Dk(x) = x * (muk/siga^2) – (muk^2/(2*sigma^2)) + ln(PIk)

Dk(x) is the discriminate function for class k given input x, the muk,
sigma^2 and PIk are all estimated from your data.

How to Prepare Data for LDA


This section lists some suggestions you may consider when preparing your
data for use with LDA.

 Classification Problems. This might go without saying, but LDA is intended for
classification problems where the output variable is categorical. LDA supports
both binary and multi-class classification.
 Gaussian Distribution. The standard implementation of the model assumes a
Gaussian distribution of the input variables. Consider reviewing the univariate
distributions of each attribute and using transforms to make them more
Gaussian-looking (e.g. log and root for exponential distributions and Box-Cox for
skewed distributions).
 Remove Outliers. Consider removing outliers from your data. These can skew the
basic statistics used to separate classes in LDA such the mean and the standard
deviation.
 Same Variance. LDA assumes that each input variable has the same variance. It is
almost always a good idea to standardize your data before using LDA so that it
has a mean of 0 and a standard deviation of 1.
Extensions to LDA
Linear Discriminant Analysis is a simple and effective method for
classification. Because it is simple and so well understood, there are many
extensions and variations to the method. Some popular extensions include:
 Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of
variance (or covariance when there are multiple input variables).
 Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs is
used such as splines.
 Regularized Discriminant Analysis (RDA): Introduces regularization into the
estimate of the variance (actually covariance), moderating the influence of
different variables on LDA.
The original development was called the Linear Discriminant or Fisher’s
Discriminant Analysis. The multi-class version was referred to Multiple
Discriminant Analysis. These are all simply referred to as Linear
Discriminant Analysis now.

What is Gradient Descent and logistic discrimination?

Gradient Descent

Gradient descent algorithm and its variants ( Adam, SGD etc. )


have become very popular training (optimisation) algorithm in
many machine learning applications. Optimisation algorithms can
be informally grouped into two categories — gradient-based and
gradient-free(ex. particle swarm, genetic algorithm etc.). As you
can guess, gradient descent is a gradient-based algorithm. Why
gradient is important in training machine learning?

The objective of training a machine learning model is to minimize


the loss or error between ground truths and predictions by
changing the trainable parameters. And gradient, which is the
extension of derivative in multi-dimensional space, tells the
direction along which the loss or error is optimally minimized. If
you recall from vector calculus class, gradient is defined as the
maximum rate of change. Therefore, the formula for gradient
descent is simply:

θj is a trainable parameter, j. α is a learning rate. J(θ) is a cost


function.

In the below figure, the shortest from the starting point ( the
peak) to the optima ( valley) is along the gradient trajectory. The
same principle applies the multi-dimensional space which is
generally the case for machine learning training.
To demonstrate how gradient descent is applied in machine
learning training, we’ll use logistic regression.

Logistic Regression (LR)

Binary Case

To understand how LR works, let’s imagine the following


scenario: we want to predict the sex of a person (male = 0,
female = 1) based on age (x1), annual income (x2) and education
level (x3). If Y is the predicted value, a logistic regression model
for this problem would take the form:

Z = b0 + b1(x1) + b2(x2) + b3(x3)

Y = 1.0 / (1.0 + e^-Z)

b0 is often called ‘bias’ and b1, b2 and b3 are called ‘weights’.

Z has the same form as a linear regression while Y is a sigmoid


activation function. Y takes a value between 0 and 1. If Y is less
than 0.5, we conclude the predicted output is 0 and if Y is greater
than 0.5, you conclude the output is 1.

Now, we are ready to look at a more formal form of LR below:


Φn is the augmented transformation of Xn in feature space. tn is
the class label. σ is a sigmoid activation. W is a weight vector
(including bias term). p(C1|Φ) and p(C2|Φ) is the probability of
assigning to C1 and C2 given Φ, respectively.

Given the above formulation, the main goal here is to maximise


the likelihood of observing the data given the weight (W). The
likelihood function is a joint-distribution of the observed data and
is given below:

Π is a product operator.

From the likelihood function, it can be observed that y is


Bernoulli distributed.

When working with probability, it is desirable to convert to


logarithm since logarithm turns a product into a sum and thus
avoid the issue of taking a product with a very small
number(typically for probability). Below are the negative log-
likelihood (NLL) and its gradient with respect to weights. NLL is
used to turn a maximization into a minimization problem.
Essentially, minimizing NLL is equivalent to maximizing the
likelihood.

Multiclass Case

The binary case of LR can be extended to the multiclass case with


some changes of notation.

Let’s assume there is K class. So, p(Ck) is the probability of


assigning to class k given Φ.

Instead of sigmoid activation, softmax activation is used to


convert class score (ak) into proper probability.

W is a weight matrix ( DxK) — D is feature space dimension.


The likelihood function and negative likelihood (NLL) are given
below.

y is now Multinoulli distributed.

Explain in detail about back propagation algorithm?

Backpropagation:

Backpropagation is a supervised learning algorithm, for training Multi-layer


Perceptrons (Artificial Neural Networks).

I would recommend you to check out the following Deep Learning


Certification blogs too:

 What is Deep Learning?


 Deep Learning Tutorial
 TensorFlow Tutorial
 Neural Network Tutorial

But, some of you might be wondering why we need to train a Neural


Network or what exactly is the meaning of training.

Why We Need Backpropagation?

While designing a Neural Network, in the beginning, we initialize weights


with some random values or any variable for that fact.
Now obviously, we are not superhuman. So, it’s not necessary that
whatever weight values we have selected will be correct, or it fits our model
the best.

Okay, fine, we have selected some weight values in the beginning, but our
model output is way different than our actual output i.e. the error value is
huge.

Now, how will you reduce the error?

Basically, what we need to do, we need to somehow explain the model to


change the parameters (weights), such that error becomes minimum.

Let’s put it in an another way, we need to train our model.

One way to train our model is called as Backpropagation. Consider the


diagram below:

Let me summarize the steps for you:

 Calculate the error – How far is your model output from the actual
output.
 Minimum Error – Check whether the error is minimized or not.
 Update the parameters – If the error is huge then, update the
parameters (weights and biases). After that again check the error.
Repeat the process until the error becomes minimum.
 Model is ready to make a prediction – Once the error becomes
minimum, you can feed some inputs to your model and it will produce
the output.
I am pretty sure, now you know, why we need Backpropagation or why and
what is the meaning of training a model.

Now is the correct time to understand what is Backpropagation.

What is Backpropagation?

The Backpropagation algorithm looks for the minimum value of the error
function in weight space using a technique called the delta rule or gradient
descent. The weights that minimize the error function is then considered to
be a solution to the learning problem.

Let’s understand how it works with an example:

You have a dataset, which has labels.

Consider the below table:

Input Desired Output


0 0
1 2
2 4
Now the output of your model when ‘W” value is 3:

Model output
Input Desired Output
(W=3)
0 0 0
1 2 3
2 4 6
Notice the difference between the actual output and the desired output:

Desired Model output


Input Absolute Error Square Error
Output (W=3)
0 0 0 0 0
1 2 3 1 1
2 4 6 2 4
Let’s change the value of ‘W’. Notice the error when ‘W’ = ‘4’
Model Model
Desired Absolute Square Square
Input output output
Output Error Error Error
(W=3) (W=4)
0 0 0 0 0 0 0
1 2 3 1 1 4 4
2 4 6 2 4 8 16
Now if you notice, when we increase the value of ‘W’ the error has
increased. So, obviously there is no point in increasing the value of ‘W’
further. But, what happens if I decrease the value of ‘W’? Consider the
table below:

Model Model
Desired Absolute Square Square
Input output output
Output Error Error Error
(W=3) (W=2)
0 0 0 0 0 0 0
1 2 3 2 4 3 0
2 4 6 2 4 4 0
Now, what we did here:

 We first initialized some random value to ‘W’ and propagated forward.


 Then, we noticed that there is some error. To reduce that error, we
propagated backwards and increased the value of ‘W’.
 After that, also we noticed that the error has increased. We came to
know that, we can’t increase the ‘W’ value.
 So, we again propagated backwards and we decreased ‘W’ value.
 Now, we noticed that the error has reduced.

So, we are trying to get the value of weight such that the error becomes
minimum. Basically, we need to figure out whether we need to increase or
decrease the weight value. Once we know that, we keep on updating the
weight value in that direction until error becomes minimum. You might
reach a point, where if you further update the weight, the error will increase.
At that time you need to stop, and that is your final weight value.

Consider the graph below:


We need to reach the ‘Global Loss Minimum’.

This is nothing but Backpropagation.

Let’s now understand the math behind Backpropagation.

How Backpropagation Works?

Consider the below Neural Network:


The above network contains the following:

 two inputs
 two hidden neurons
 two output neurons
 two biases

You might also like