Mathematics For Machine
Mathematics For Machine
Mathematics For Machine
Preamble: This is a foundational course for awarding B. Tech. Minor in Computer Science
and Engineering with specialization in Machine Learning. The purpose of this course is to
introduce mathematical foundations of basic Machine Learning concepts among learners, on which
Machine Learning systems are built. This course covers Linear Algebra, Vector Calculus, Probability
and Distributions, Optimization and Machine Learning problems. Concepts in this course help the
learners to understand the mathematical principles in Machine Learning and aid in the creation of new
Machine Learning solutions, understand & debug existing ones, and learn about the inherent
assumptions & limitations of the current methodologies.
Prerequisite:
1. A sound background in higher secondary school Mathematics.
2. Python for Machine Learning (CST 253)
Course Outcomes: After the completion of the course the student will be able to
Make use of the concepts, rules and results about linear equations, matrix algebra,
CO 1 vector spaces, eigenvalues & eigenvectors and orthogonality & diagonalization to
solve computational problems (Cognitive Knowledge Level: Apply)
Perform calculus operations on functions of several variables and matrices,
CO 2
including partial derivatives and gradients (Cognitive Knowledge Level: Apply)
S . I N
OTE
Utilize the concepts, rules and results about probability, random variables, additive
CO 3
K T U N
& multiplicative rules, conditional probability, probability distributions and Bayes’
theorem to find solutions of computational problems (Cognitive Knowledge Level:
Apply)
Train Machine Learning Models using unconstrained and constrained optimization
CO 4
methods (Cognitive Knowledge Level: Apply)
Illustrate how the mathematical objects - linear algebra, probability, and calculus
CO 5 can be used to design machine learning algorithms (Cognitive Knowledge Level:
Understand)
PO 1 PO 2 PO 3 PO 4 PO 5 PO 6 PO 7 PO 8 PO 9 PO 10 PO 11 PO 12
CO 1 √ √ √ √ √
CO 2 √ √ √ √
CO 3 √ √ √ √ √
CO 4 √ √ √ √ √ √
CO 5 √ √ √ √ √ √ √ √
Assessment Pattern
S . I N
OTE
Apply 40% 40% 40%
Analyse
Evaluate K T U N
Create
Mark Distribution
Attendance : 10 marks
First Internal Examination shall be preferably conducted after completing the first half of the
syllabus and the Second Internal Examination shall be preferably conducted after completing
remaining part of the syllabus.
There will be two parts: Part A and Part B. Part A contains 5 questions (preferably, 2
questions each from the completed modules and 1 question from the partly covered module),
having 3 marks for each question adding up to 15 marks for part A. Students should answer
all questions from Part A. Part B contains 7 questions (preferably, 3 questions each from the
completed modules and 1 question from the partly covered module), each with 7 marks. Out
of the 7 questions in Part B, a student should answer any 5.
End Semester Examination Pattern: There will be two parts; Part A and Part B. Part A
contains 10 questions with 2 questions from each module, having 3 marks for each question.
Students should answer all questions. Part B contains 2 questions from each module of which
student should answer anyone. Each question can have maximum 2 sub-divisions and carries
14 marks.
S . I N
T U N OTE
K
Module 1
Module 2
Module 3
. I N
Continuous Probabilities, Sum Rule, Product Rule, and Bayes’ Theorem. Summary Statistics
S
OTE
and Independence – Important Probability distributions - Conjugacy and the Exponential
K T U N
Family - Change of Variables/Inverse Transform.
Module 4
Module 5
Text book:
1.Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and
Cheng Soon Ong published by Cambridge University Press (freely available at https://
mml - book.github.io)
Reference books:
.
2018 published by Cambridge University Press
S I N
T U
Cambridge University Press OTE
4. Convex Optimization by Stephen Boyd and Lieven Vandenberghe, 2004 published by
N
K
5. Pattern Recognition and Machine Learning by Christopher M Bishop, 2006, published
by Springer
S . I N
T U N OTE
4. 4.A set
4. 4.4.
A set
A
A(0,
A set
of of
set
set1,
of
n linearly
of n
noflinearly
K
n linearly
1)n, linearly
independent
independent
linearly independent
independent
(0, 1,−1)
vectors
vectors
independent vectors
vectors
vectors
form a basis
in R inn R n
forms
n
forms a basis.
a basis. Does
Does thethe
set set of vectors
of vectors (2, (2, 4,−3) ,
3 in R forms a basis. Does the set of vectors (2, 4,−3) ,
n n
forinRR?inExplain
R3 forms
forms a basis.
a your
basis. DoesDoes
reasons.
4,−3) , (0, 1, 1) , (0, 1,−1) form a basis for R ? Explain your reasons. the of
the set setvectors
of vectors (2, 4,−3)
(2, 4,−3) , ,
(0, 1, (0,
(0,1)1,1, 1),1,−1)
, 1)
(0, ,(0,
(0,1,−1)
1,−1) form
formform aabasis
a basisbasis
for R for
3
for?R R33??Explain
ExplainExplainyouryour your reasons.
reasons.
reasons.
5. 5.Consider the transformation T (x, y) = (x +
Consider the transformation T (x, y) = (x + y, x + 2y, 2x y, x + 2y, 2x + 3y). Obtain
+ 3y). kerker
Obtain T and useuse this to
T and
5. 5.5. this
Consider
Consider
Consider the
the
to calculate
calculate transformation
the transformation
thetransformation
the nullity.
nullity. Also (x,TTy)
TAlso
find (x,
(x,=y)
find
the y)
(x ==+(x
the (x
y, +x+ y,+y, x2y,
transformation
transformation x ++2x2y,+2x
2y,
matrix 2x
3y).+ 3y).
+for
matrix 3y). Obtain
Obtain
T. Tker
T. ker ker
forObtain TT and
and and use
use use this
this this to
to to
calculate
calculate
calculate thenullity.
nullity.
the nullity.
the Also
AlsoAlso find
findfind thetransformation
transformation
the transformation
the matrix matrix
for T.
matrix forT.
for T.
6. 6.Find
Find
the the characteristic
characteristic equation,
equation, eigenvalues,
eigenvalues, and eigenspaces
and eigenspaces corresponding
corresponding to each to each
6. 6.6. Find
Find Find the
the the
eigenvalue characteristic
characteristic equation,
equation,
characteristic eigenvalues,
eigenvalues,
equation,
of the following matrix and
and and
eigenvalues, eigenspaces
eigenspaces corresponding
corresponding
eigenspaces to each
to each
corresponding to each
eigenvalue of the following matrix
eigenvalue
eigenvalue offollowing
of the
eigenvalue of thefollowing
the following matrix
matrix
matrix
"
7. Diagonalize the following matrix, if possible
7. 7.7. Diagonalize
Diagonalize thefollowing
following
the following
Diagonalize the matrix,
matrix, ififpossible
possible
if possible
matrix,
"
1. 1. For
Fora scalar
a scalar function
function f(x,f(x, x2 +3y
y, zy,) z=) x=2 +3y
S . I
2
N
2 +2z+2z
2
, find
2, find the the gradient
gradient andand its magnitude
its magnitude at at the
1.thepoint
OTE
(1, 2, -1).
For a scalar
point function f(x, y, z ) = x2 +3y2 +2z2, find the gradient and its magnitude at the
(1, 2, -1).
pointthe
2. 2. Find
Find
the
(1, maximum
2, -1).
the maximum
Findcondition
2.subject to the
the K T U
and N
andminimum
minimumvalues
x2 + y2and
condition
maximum <=x2 2.
values of
of the
the function f(x, y)
y) == 4x
2 2
4x++4y4y- x- x-2 y- ysubject
2 to
+ y2 <= 2. values of the function f(x, y) = 4x + 4y - x 2 - y2 subject to
minimum
the condition
3. 3. Suppose x2 +trying
y2 <= 2. f(x, y) y)
= x2=+ x
2y2 + 2y2. Along
Supposeyouyou were
were trying to to
minimize
minimize f(x, + 2y + 2y2. what Along vector
what vector
3.should you
should
Suppose youtravel from (5,(5,
travel
you werefrom
12)?
trying 12)?
to minimize f(x, y) = x2+ 2y + 2y2. Along what vector
4. should you travel from (5, 12)?
4. Find thethe
second order Taylor series expansion forfor
f(x, y) y)
= (x + y)
+ y)about (0 (0
, 0).
2 2
Find second order Taylor series expansion f(x, = (x about , 0).
5.
4.Find thethe
critical points
orderofTaylor
f(x, y)series
= x expansion
3xy+5x-2y+6y +8.y) = (x + y)2 about (0 , 0).
2– 2
Find second for f(x,
5. Find the critical points of f(x, y) = x2 – 3xy+5x-2y+6y2+8.
6. Compute the gradient of the Rectified Linear Unit (ReLU) function ReLU(z) =
5. Find the critical points of f(x, y) = x2 – 3xy+5x-2y+6y2+8.
6. Compute the gradient of the Rectified Linear Unit (ReLU) function ReLU(z) = max(0 , z).
max(0 , z).
7. 6.LetCompute
LL ==||Ax
the gradient of the Rectified Linear Unit (ReLU) function ReLU(z) = max(0 , z).
||Ax- b||
- b||2,22where
, whereAAisisa amatrix
matrixand
andx xand
andb bare
arevectors.
vectors.Derive
DerivedL
dLininterms
termsofof dx.
2
7. Let
7.dx.Let L = ||Ax - b||22, where A is a matrix and x and b are vectors. Derive dL in terms of dx.
Course Outcome 3 (CO3):
Course Outcome 3 (CO3):
1. Let J and T be independent events, where P(J)=0.4 and P(T)=0.7.
i. Find P(J∩T)
1. Let J and T be independent events, where P(J)=0.4 and P(T)=0.7.
ii. Find P(J∪T)
i. Find P(J∩T)
iii. Find P(J∩T′)
ii. Find P(J∪T)
iii. Find P(J∩T′)
i. Find P(J∩T)
i.
i. Given that E(R)=2.85, find a and b.
i. Given
ii. that E(R)=2.85,
Find P(R>2). find a and b.
ii. Find P(R>2).
4. A biased coin (with probability of obtaining a head equal to p > 0) is tossed repeatedly and
4. A biasedindependently
coin (with probability
until the of first
S . I N
obtaining
head isaobserved.
head equalCompute
to p > 0)the
is tossed repeatedly
probability that the first head
OTE
and independently
appears at anuntil
eventhenumbered
first headtoss.
is observed. Compute the probability that the
5. Two players
K T U N
first head appears at an even numbered toss.
5. Two players A and B are competing at a trivia quiz game involving a series of questions. On
A and B question,
any individual are competing at a triviathat
the probabilities quiz game
A and involving
B give a series
the correct of are p and q
answer
questions. On any individual
respectively, question,with
for all questions, the outcomes
probabilities
for that A andquestions
different B give the correct
being independent. The
answer gameare p finishes
and q respectively,
when a player for wins
all questions, with outcomes
by answering a questionforcorrectly.
differentCompute the
questions being independent.
probability that A winsThe if game finishes when a player wins by answering a
i. A answers
question correctly. Compute thethe
first question, that A wins if
probability
ii. B answers the first question.
i. A answers the first question,
6. A coin for which P(heads) = p is tossed until two successive tails are obtained. Find the
ii. Bprobability
answers the first
that thequestion.
experiment is completed on the nth toss.
6. A coin for which P(heads) = p is tossed until two successive tails are obtained. Find
7. You roll a fair dice twice. Let the random variable X be the product of the outcomes of the
the probability that the experiment is completed on the nth toss.
two rolls. What is the probability mass function of X? What are the expected value and the
7. You rollstandard
a fair dice twice. Let the random variable X be the product of the outcomes of
deviation of X?
the two rolls. What is the probability mass function of X? What are the expected value
and 8. While watching
the standard deviationaofgameX? of Cricket, you observe someone who is clearly supporting
Mumbai Indians. What is the probability that they were actually born within 25KM of
Mumbai? Assume that:
• the probability that a randomly selected person is born within 25KM of Mumbai is
1/20;
• the chance that a person born within 25KMs of Mumbai actually supports MI is
7/10 ;
• the probability that a person not born within 25KM of Mumbai supports MI with
probability 1/10.
9. What isDownloaded
an exponential family? Why are
from exponential families useful?
Ktunotes.in
COMPUTER SCIENCE AND ENGINEERING
8. While watching a game of Cricket, you observe someone who is clearly supporting
Mumbai Indians. What is the probability that they were actually born within 25KM of
Mumbai? Assume that:
• the probability that a randomly selected person is born within 25KM of
Mumbai is 1/20;
• the chance that a person born within 25KMs of Mumbai actually supports MI
is 7/10 ;
• the probability that a person not born within 25KM of Mumbai supports MI
with probability 1/10.
9. What is an exponential family? Why are exponential families useful?
10. Let Z1 and Z2 be independent random variables each having the standard normal
distribution. Define the random variables X and Y by X = Z1 + 3Z2 and Y = Z1 + Z2.
Argue that the joint distribution of (X, Y) is a bivariate normal distribution. What are
the parameters of this distribution?
11. Given a continuous random variable x, with cumulative distribution function Fx(x),
show that the random variable y = Fx(x) is uniformly distributed.
I N
12. Explain Normal distribution, Binomial distribution and Poisson distribution in the
S .
OTE
exponential family form.
K T U N
Course Outcome 4(CO4):
5. Consider
5. Consider the
the update
update equation
equation for
for stochastic
stochastic gradient
gradient descent.
descent. Write
Write down
down the
the update
update when
when
we use
we use aa mini-batch
mini-batch size
size of
of one.
one.
COMPUTER SCIENCE AND ENGINEERING
6. Consider
6.
6. Consider the
Consider the function
the function
function
"
"
9. 9.
Solve the following LP problem withwith
the simplex method.
Solve the following LP problem the simplex
S I N
method.
.
9. Solve the following LP problem with the simplex method.
T U N "
OTE
subjectK
subject
to to
subject
to constraints
the the constraints
the constraints
Course
Course Outcome
Outcome 5 (CO5):
5 (CO5):
Course Outcome 5 (CO5):
1. What is a loss function? Give examples.
1. What is a loss function? Give examples.
2.1. What is
area loss
training/validation/test sets? What is cross-validation? Name one or two
function? Give examples.
2. What areoftraining/validation/test
examples sets? What is cross-validation? Name one or two examples
cross-validation methods.
2. What are training/validation/test sets? What is cross-validation? Name one or two examples
of cross-validation methods.
3. of cross-validation
Explain methods.
generalization, overfitting, model selection, kernel trick, Bayesian learning
3. Explain generalization, overfitting, model selection, kernel trick, Bayesian learning
3. Explain generalization, overfitting, model selection, kernel trick, Bayesian learning
4. Distinguish between Maximum Likelihood Estimation (MLE) and Maximum A Posteriori
4. Distinguish between Maximum Likelihood Estimation (MLE) and Maximum A Posteriori
Estimation (MAP)?
Estimation (MAP)?
5. What is the link between structural risk minimization and regularization?
5. What is the link between structural risk minimization and regularization?
6. What is a kernel? What is a dot product? Give examples of kernels that are valid dot
6. What is a kernel? What is a dot product? Give examples of kernels that are valid dot
products.
products. Downloaded from Ktunotes.in
Course Outcome 5 (CO5):
4.2. Distinguish
What are training/validation/test
between Maximum sets? What isEstimation
Likelihood cross-validation?
(MLE) Name one or two A
and Maximum examples
of cross-validation
Posteriori Estimationmethods.
(MAP)?
5.3. What
Explain generalization,
is the link between overfitting, model
structural risk selection, and
minimization kernel trick, Bayesian learning
regularization?
6.4. What
Distinguish between
is a kernel? WhatMaximum Likelihood
is a dot product? GiveEstimation
examples (MLE) andthat
of kernels Maximum
are validAdot
Posteriori
Estimation (MAP)?
products.
7.5. What
Whatisisridge
the link between How
regression? structural risktrain
can one minimization and regularization?
a ridge regression linear model?
8.6. What
What isis Principal
a kernel?Component
What is a Analysis
dot product?
(PCA)?GiveWhich
examples
eigenof value
kernels that arethevalid dot
indicates
products.of largest variance? In what sense is the representation obtained from a
direction
7. projection
What is ridge ontoregression?
the eigenHow can onecorresponding
directions train a ridge regression
the the linear
largestmodel?
eigen values
optimal for data reconstruction?
8. What is Principal Component Analysis (PCA)? Which eigen value indicates the direction of
largest variance?
9. Suppose that you In what
have sense is
a linear the representation
support vector machine obtained
(SVM) from a projection
binary classifier.onto the
eigen directions
Consider a pointcorresponding the the
that is currently largest eigen
classified valuesand
correctly, optimal
is farforaway
data reconstruction?
from the
9. decision
Supposeboundary. If you
that you have remove
a linear the point
support vectorfrom the training
machine (SVM) set, andclassifier.
binary re-train the
Consider a
classifier, will the decision boundary change or stay the same? Explain your answer
point that is currently classified correctly, and is far away from the decision boundary. If you
inremove
one sentence.
the point from the training set, and re-train the classifier, will the decision boundary
change or stay the same? Explain your answer in one sentence.
10. Suppose you have n independent and identically distributed (i.i.d) sample data points
10. xSuppose
1, ... , xnyou havedata
. These n independent
points come andfrom
identically distributed
a distribution (i.i.d)
where thesample of a x1, ... ,
data points
probability
xn. These
given data points
datapoint
. I N
x is come from a distribution where the probability of a given datapoint x is
S
T U N OTE
K "
i. What are the prior and posterior odds for the fair coin?
ii. What are the prior and posterior predictive probabilities of heads on the next
flip? Here prior predictive means prior to considering the data of the first four
flips.
. I N
an eigenvector corresponding to each of the eigenvalues?
S
eigenvector corresponding to each of the eigenvalues?
T U N OTE
3
K
Let f(x, y, z) = xyer, where r = x2+z2-5. Calculate the gradient of f at the
3 Let f(x, y, z) = xyer, where r = x2+z2-5. Calculate the gradient of f at the point
(1,point
3, -2).(1, 3, -2).
4 Compute the Taylor polynomials Tn, n = 0 , ... , 5 of f(x) = sin(x) + cos(x) at
4 x0 Compute
= 0. the Taylor polynomials Tn, n = 0 , ... , 5 of f(x) = sin(x) + cos(x)
5 LetatXxbe a continuous random variable with probability density function on
0 = 0.
0 <= x <= 1 defined by f(x) = 3x2. Find the pdf of Y = X2.
65 Let that
Show X beifatwocontinuous
events A random variable
and B are with probability
independent, then A anddensity function
B' are independent.
7 Explain the principle of the gradient descent algorithm.
on 0 <= x <= 1 defined by f(x) = 3x2. Find the pdf of Y = X2.
6 Show that if two events A and B are independent, then A and B' are
independent.
7 Explain the principle of the gradient descent algorithm.
8
one over the other.
Briey explain the difference between (batch) gradient descent and stochastic
8 Briey explain the difference between (batch) gradient descent and stochastic
9 What isdescent.
gradient the empirical
Give an risk? What of
example is “empirical risk minimization”?
when you might prefer one over the other.
910gradient descent.
What Give
is thethe an
empirical example
risk? of
Whatwhen you might
is “empirical prefer
risk one over the other.
minimization”?
9 What is Explain
the empirical concept
risk? of ais
What Kernel function
“empirical risk in Support
minimization”? Vector Machines.
10 Explain the concept of a Kernel function in Support Vector Machines. Why are
10 Explain Why
the concept of aso
are kernels Kernel
useful?function in Supporta Vector
What properties kernel Machines. Why
should posses to are
be
kernels so useful? What properties a kernel should posses to be used in an SVM?
kernels so
useduseful?
in an What
SVM?properties a kernel should posses to be used in an SVM?
PART B
PART
Answer any one Question from each B module. Each question carries 14 Marks
11 a)Answer i. any one
Find allQuestion
solutionsfrom PART
to theeach B linear
module.
system of Each question carries 14 Marks
equations (6)
11 a) i. Answer
Find all any
solutions to the system of linear equations
one Question from each module. Each question carries 14 Marks (6)
OTE
T 3
vectors(W)
orthogonal
and why?
to [2, −3, 1] forms a subspace W of R . What is dim
W of R3. What is dim (W) and why?
b) Use
(W) and why?
Usethethe
b) Use theofGramm-Schmidt K T U N
Gramm-Schmidt
Gramm-Schmidt
process
process
to
to find
process
find an
an orthogonal
to find basis for
an orthogonal
orthogonal basis for the
the for
basis
column
column
the space
space
(8)
(8)
(8)
the following
column matrix
space of the following matrix
of the following matrix
"
OR
12 a) i. Let L be the line through the OR 2
ORorigin in R that is parallel to the vector (6)
12 a) i. Let L be 4]Tline
[3,the . Find the standard
through matrix
the origin in Rof
2 the orthogonal projection onto L. Also
that is parallel to the vector (6)
T
[3, 4] find
. Find
thethe standard
point matrixisofclosest
on L which the orthogonal projection
to the point onto
(7 , 1) and findL.the
Also
point on
find theL point
whichonis Lclosest
whichtoisthe
closest
pointto(-3
the, 5).
point (7 , 1) and find the point on
ii. Find
L which the rank-1
is closest approximation
to the point (-3 , 5).of
ii. Find the rank-1 approximation of
"
N OTE
13 a) A skier is on a mountain with equation z = 100 – 0.4x2 – 0.3y2, where z denotes (8)
T U
13 height.
a)
K
A skier is on a mountain with equation z = 100 – 0.4x2 – 0.3y2, where z (8)
denotes height.
i. The skier is located at the point with xy-coordinates (1 , 1), and wants to
skii.downhill along
The skier is the steepest
located possible
at the path. xy-coordinates
point with In which direction
(1 ,(indicated
1), and
by a vector (a , b) in the xy-plane) should the skier begin skiing.
wants to ski downhill along the steepest possible path. In which
direction (indicated by a vector (a , b) in the xy-plane) should the
ii. The skier begins skiing in the direction given by the xy-vector (a , b) you
found skier begin
in part (i), skiing.
so the skier heads in a direction in space given by the
vector (a , b , c). Find the value of c.
ii. The skier begins skiing in the direction given by the xy-vector (a ,
b) b) linear
Find the you found in part (i),tosothe
approximation thefunction
skier heads
f(x,y)in=a2direction
- sin(-x -in3y)space
at the (6)
given
point (0 , π),by
andthethen
vector
use (a , b answer
your , c). Find
to the value f(0.001
estimate of c. , π).
b) Find the linear approximation to the function f(x,y) = 2 - sin(-x - (6)
3y) at the point (0 , π),OR
and then use your answer to estimate
14 a) Let g be the function given by (8)
f(0.001 , π).
OR
"
i. Calculate the partial derivatives of g at (0 , 0).
i. Calculate the partial derivatives of g at (0 , 0).
ii.
ii. Show
Showthatthatggisisnot
notdifferentiable
differentiableatat(0
(0,,0).
0).
b) Find the second order Taylor series expansion for f(x,y) = e-(x2+y2) cos(xy) about (0 , (6)
b) Find the second order Taylor series expansion for f(x,y) = e-(x2+y2) cos(xy) (6)
0).
aboutare
15 a) There (0 ,two
0). bags. The first bag contains four mangos and two apples; the second (6)
15 a) There
bag are twofour
contains bags. The first
mangos andbag
fourcontains
apples. four mangos
We also haveand two apples;
a biased (6)
coin, which
the second
shows bagwith
“heads” contains four mangos
probability 0.6 andand fourwith
“tails” apples. We also0.4.
probability haveIf athe coin
biased coin, which shows “heads” with probability 0.6 and “tails” with
probability 0.4. If the coin shows “heads”. we pick a fruit at
showsrandom
“heads”.from bag 1;
we pick otherwise
a fruit at we pick a fruit at random from bag 2. Your
random fromflips
friend bag the
1; otherwise
coin (youwe pick see
cannot
S . I
a fruit
N
theatresult),
random froma bag
picks fruit2.atYour friend
random
OTE
flips the coin (you cannot see the result), picks a fruit at random from the
from the corresponding bag, and presents you a mango.
What What
is the is
K T U N
corresponding bag, and presents you a mango.
the probability
probability that
that the the mango
mango was picked
was picked from2?bag 2?
from bag
b) b)
Suppose that one
Suppose thathasone
written
has awritten
computer program that
a computer sometimes
program compiles and (8)(8)
that sometimes
sometimes notand
compiles (code does not
sometimes change).
not (code doesYou decide toYou
not change). model
decidethe apparent
to model
stochasticity (success vs. no success) x of the compiler using a Bernoulli
the apparent stochasticity (success vs. no success) x of the compiler using
distribution with parameter μ:
a Bernoulli distribution with parameter μ:
"
Choose a conjugate prior for the Bernoulli likelihood and compute the posterior
Choose a conjugate prior for the Bernoulli likelihood and compute the
distribution p( μ | x1 , ... , xN).
posterior distribution p( μ | x1 , ... OR
, xN).
OR
16 a) Consider a mixture of two Gaussian distributions (8)
i. i.Compute
Compute the the marginal
marginal distributions
distributions for for
eacheach dimension.
dimension.
ii. ii.
Compute
Compute the mean, mode and
the mean, modemedian
andformedian
each marginal
for eachdistribution.
marginal
iii. Compute the mean and mode for the two-dimensional distribution.
distribution.
i. Compute
b) Express the marginal distributions for each dimension.
iii.the Binomial
Compute the distribution
mean and mode as an
for exponential family distribution.
the two-dimensional distribution. Also (6)
ii. Compute the mean, mode and median for each marginal distribution.
express the Betathedistribution
iii. Compute mean and mode is an for
exponential family distribution.
the two-dimensional Show that the
distribution.
product of the Beta and the Binomial distribution is also a member of the
b) b)Express
Express the Binomial
the Binomial distribution
distribution as anas exponential
an exponential family
family distribution.
distribution. Also (6)(6)
exponential family.
FindAlso
17 a) express the express
theextrema thef(x,y,z)
of Beta distribution
Beta distribution =isxan is an exponential
- yexponential
+ z subject family =family
x2 + y2 distribution.
distribution.
to g(x,y,z) +Show that the
z2 = 2. (8)
b) product
Let Show of that
the the
Betaproduct
and theof the Beta and
Binomial the Binomial
distribution distribution
is also a member is also
of athe (6)
exponential
memberfamily.
of the exponential family.
17 a) Find the extrema of f(x,y,z) = x - y + z subject to g(x,y,z) = x2 + y2 + z2 = 2. (8)
17b) a)Let Find the extrema of f(x,y,z) = x - y + z subject to g(x,y,z) = x2 + y2 + z2 = (8)(6)
2.
S . I N
OTE
b) Let
K T U N
Show
" that x* = (1 , 1/2 , -1) is optimal for the optimization problem
Show that x* = (1 , 1/2 , -1) is optimal for the optimization problem
(6)
OR
18 a) Derive the gradient descent trainingOR
rule assuming that the target function (8)
18 a) Derive the gradient descent training rule assuming that the target function is (8)
is represented as od = w0 + w1x1 + ... + wnxn. Define explicitly the cost/
represented as od = w0 + w1x1 + ... + wnxn. Define explicitly the cost/error function
error function E, assuming that a set of training examples D is provided,
E, assuming that a set of training examples D is provided, where each training
where each training example d ∈ D is associated with the target output td.
example d ∈ D is associated with the target output td.
b) Find the maximum value of f(x,y,z) = xyz given that g(x,y,z) = x + y + z = 3 and (6)
x,y,z >= 0.
19 a) Consider the following probability distribution (7)
where θ is a parameter and x is a positive real number. Suppose you get m i.i.d.
where
where
samples is a parameter
θ isxi aθdrawn
parameter and
andthis
from xaispositive
a positive
x isdistribution. real
real number.
number.
Compute Suppose
Suppose
the you
maximumyou getmmi.i.d.
get
likelihood
samples
i.i.d.xfor
estimator drawn
isamples xfrom
θ based thisfrom
drawn
i on these distribution. ComputeCompute
this distribution.
samples. the maximum likelihood
the maximum
estimator for θ based
likelihood on these
estimator for θsamples.
based on these samples.
b) b)Consider the following
Consider Bayesian
the following network
Bayesian with with
network boolean variables.
boolean variables. (7) (7)
b) Consider the following Bayesian network with boolean variables. (7)
S . I N
T U N OTE
K
i. List variable(s) conditionally independent of X33 given X11 and X12
ii. List variable(s) conditionally independent of X33 and X22
iii. Write the joint probability P(X11, X12, X13, X21, X22, X31, X32, X33)
factored according to the Bayes net. How many parameters are
necessary to define the conditional probability distributions for
this Bayesian network?
iv. Write an expression for P(X13 = 0,X22 = 1,X33 = 0) in terms of the
conditional probability distributions given in your answer to part
(iii). Justify your answer.
OR
OR
20 a) Consider the following one dimensional COMPUTER
training data set,SCIENCE
’x’ denotes AND
negative (6)
ENGINEERING
examples and ’o’ positive examples. The exact data points and their labels are
20 a) Consider the following one dimensional training data set, ’x’ denotes (6)
given in the table below. Suppose a SVM is used to classify this data.
negative examples and ’o’ positive examples. The exact data points and
their labels are given in the table below. Suppose a SVM is used to
classify this data.
"
i. Indicate which arewhich
i. Indicate the support vectors
are the andvectors
support mark theand
decision
mark boundary.
the decision
ii. Give the value of the cost function and the model parameter after training.
b) Supposeboundary.
I N
that we are fitting a Gaussian mixture model for data items
S . (8)
OTE
consisting of athe
ii. Give single
valuereal value,
of the costx,function
using K and
= 2 components. We haveafter
the model parameter N=
5 training
K T U N
cases, in which the values of x are as 5, 15, 25, 30, 40. Using the
training.
EM algorithm to find the maximum likeihood estimates for the model
parameters, what are the mixing proportions for the two components, π1
and π2, and the means for the two components, μ1 and μ2. The standard
deviations for the two components are fixed at 10.
S . I N
T U N OTE
K "
What values for the parameters π1, π2 , µ1, and µ2 will be found in
What
the nextvalues forofthe
M step algorithm?π1, π2 , μ1, and μ2 will be found in the next
theparameters
M step of the algorithm? ****
****
Teaching Plan
No. of
Lectures
No Topic
(45)
T U N OTE 1
7.
8.
K
Cholesky Decomposition, Eigen decomposition and Diagonalization
Singular Value Decomposition - Matrix Approximation
1
1
Module-II (VECTOR CALCULUS) 6
E S
1
5 Convex Optimization
OT 1
6.
7.
K TUN
Linear Programming
Quadratic Programming
1
1
5 Module-V (CENTRAL MACHINE LEARNING PROBLEMS) 14
14. Kernels 1
*Assignments may include applications of the above theory. With respect to module V,
programming assignments may be given.
S . I N
T U N OTE
K