Mid Sem Solution 2019
Mid Sem Solution 2019
Monsoon 2019
Mid-Semester Exam: CSE543/ECE5ML
13/9/2019
Time Limit: 60 Minutes Roll No:
Instructions:
Please do not plagiarize. It will be dealt with very strictly!
Try to answer all questions. The last question is Extra Credit. Try to make the most of it.
In the unlikely case that a question is ambiguous, please clearly state any assumptions that you
are making. For reducing subjectivity in grading, please do this even after clarifying with the
invigilator.
Good Luck!
1. (20 points) A receiver receives two types of signals, zero and non-zero. The zero signal is
distributed as p(v|y = yz ) ∼ N (0, 4), i.e., a Gaussian distribution with mean µ = 0 and
variance σ 2 = 4. The non-zero signal generates a voltage distributed as p(v|y = ynz ) ∼
0.5N (−5, 4) + 0.5N (+5, 4). Your task is to design a classifier that, given the voltage, can
identify whether the signal is zero or non-zero. Given the task and a training set, please answer
the following questions:
i) (4 points) Would a Bayes classifier be useful to solve this problem? Can you write the
Bayes classifier decision rule for this classification problem? Be sure to have the complete
definition, so that all cases are covered.
ii) (2 points) Why can you not use a logistic regression (LR) classifier for this problem
directly?
iii) (8 points) Let’s say Reverend Bayes was a fan of logistic regression, and asked you to solve
this confounding problem using LR. To appease Rev. Bayes, could you perhaps find a way
to apply logistic regression and still solve this problem? {Hint: Think transformation of
variables, perhaps adding more features.}
iv) (2+4=6 points) Let the two signals be 2-dimensional, such that the zero signal is dis-
tributed as N ([0, 0], 4I), where I is the 2 × 2 Identity matrix, and the non-zero signal
distributed as 0.5N ([−5, −5], 4I) + 0.5N ([+5, +5], 4I). Would a Bayesian classifier still
work better than logistic regression? Justify your argument. Could your strategy for ap-
plying LR in your previous part also work for this 2-D version of the signal? If yes, write
down the modified LR model and the corresponding classification rule?
Solution:
i) (1 point) Yes, Bayes classifier can be used to solve the given problem.
1 −(x−0)2
p(v|y = yz ) N (0, 4) = √ e 8
8π
Assign equal prior. (If equal prior is not assigned then the decision boundary should
contain prior aswell)
(1 point)
p(v)
logit[p(v)] = ln[ ]
1 − p(v)
Machine Learning Mid-Semester Exam: CSE543/ECE5ML - Page 3 of 9 13/9/2019
2
ew0 +w1 x1 +w2 x1
2
1+ew0 +w1 x1 +w2 x1
logit[p(v)] = ln[ 2 ]
ew0 +w1 x1 +w2 x1
1− 2
1+ew0 +w1 x1 +w2 x1
logit[p(v)] = w0 + w1 x1 + w2 x21
iv) (2 points) Yes, Bayes classifier still works better than logistic regression (same reason as
above) the 2d data also is not linearly separable.
(1 point) Yes, the modified logistic regression works for the 2d model too, we just need to
take the combined square terms.
(3 points)
Transformed data: w0 + w1 x1 + w2 x2 + w3 x1 x2 + w4 x21 + w5 x22
1
p(v) = 2 2
1 + e−(w0 +w1 x1 +w2 x2 +w3 x1 x2 +w4 x1 +w5 x2 )
2 2
ew0 +w1 x1 +w2 x2 +w3 x1 x2 +w4 x1 +w5 x2
p(v) = 2 2
1 + ew0 +w1 x1 +w2 x2 +w3 x1 x2 +w4 x1 +w5 x2
Obtaining the logit
p(v)
logit[p(v)] = ln[ ]
1 − p(v)
2. (15 points) What is the probabilistic model for logistic regression? Write the maximum like-
lihood (ML) based objective function for this model. Extend it to the Maximum-a-posteriori
(MAP) based objective with a Gaussian prior (N (0, σp2 )) on the parameters. Feel free to use
the log-likelihood and the log-posterior expressions. Derive the gradient expression for both
log-likelihood and log-posterior. (Points break-up 2+4+4+5=15)
Machine Learning Mid-Semester Exam: CSE543/ECE5ML - Page 4 of 9 13/9/2019
Solution: Let us consider logistic regression model where output variable yi is a Bernoulli
Random Variable. i.e yi can be 0 or 1. The logistic regression model can be written as:
Goal is to estimate the parameter β using maximum likelihood estimation. Let us consider we
have an i.i.d. sample of N data points (yi , xi ) ∼ D , i ∈ {1, ..., N }.
The likelihood of a single input-output pair (yi , xi ) can be written as:
Since, all observations are iid, the likelihood of the entire sample is equal to the product of the
likelihoods of the single observations:
N
Y
L(β; Y, X) = [σ(xi β)]yi [1 − σ(xi β)](1−yi ) (5)
i=1
here, Y is the N × 1 vector of all outputs and X is the N × K matrix of all inputs.
The log likelihood of likelihood L(β; Y, X) can be written as :
N
X
`(β; Y, X) = [− ln(1 + exp(xi β)) + yi xi β] (6)
i=1
Proof:
Machine Learning Mid-Semester Exam: CSE543/ECE5ML - Page 5 of 9 13/9/2019
In MAP estimate we consider β as a random variable and assume a prior belief distribution on
it. Given is the Gaussian prior N (0, σ 2 ):
1 −β T β
P (β) = N (0, σ 2 ) = √ exp (25)
2πσ 2σ 2
Substituting values of P (β) and P (D|β) from equations (25) and (7) in the above equation, we
get:
N
1 1 X
βM AP = argmaxβ [− ln(2π) − 2 β T β + [− ln(1 + exp(xi β)) + yi xi β]] (29)
2 2σ
i=1
ignoring constant term, we get:
N
X 1 T
βM AP = argmaxβ [− ln(1 + exp(xi β)) + yi xi β] − β β (30)
2σ 2
i=1
N
X 1 T
βM AP = argmaxβ [− ln(1 + exp(xi β)) + yi xi β] − β β (31)
2σ 2
i=1
βM AP = argmaxβ `M AP (32)
1
`M AP = `(β; Y, X) − 2 β T β (33)
2σ
1
T
∇β `M AP = ∇β `(β; Y, X) − ∇β β β (34)
2σ 2
from equation (20) we get the first order derivative of ∇β `M AP :
N h
X i 1
∇β `M AP = yi − σ(xi β) xi − 2 β (35)
σ
i=1
Machine Learning Mid-Semester Exam: CSE543/ECE5ML - Page 7 of 9 13/9/2019
i) You are given a data set on cancer detection. You’ve build a classification model and
achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance?
Is accuracy the right metric for this problem? {Hint: What is the fraction of people who
have cancer as opposed to people who do not?}
ii) You are working on a classification problem. For validation purposes, you’ve randomly
sampled the training data set into train and validation. You are confident that your model
will work incredibly well on unseen data since your validation accuracy is high. However,
you get shocked after getting poor test accuracy. What went wrong? How would you
resolve it?
iii) My student conducted an experiment in which he claims to achieve a training accuracy
of 93% and the test accuracy of 78%. Would you have any suggestions to improve on the
obtained results? What would you have suggested if the training accuracy was 78% and
the test accuracy was 93%?
iv) You came to know that your model is suffering from low bias and high variance. What
are the approaches that you can use to tackle it? Justify why would they work.
Solution:
i) Given the nature of the problem at hand i.e. cancer detection, we know that the dataset
can be highly imbalanced, i.e. a majority of the data contains those samples in which
people are not suffering from cancer and a minority of those who actually suffer from this
disease.
In such a case, accuracy is not a good measure of the model performance as we can predict
all the majority samples right and still get a higher accuracy like 96%, but would probably
be misclassifying the scarce class which consists of the people suffering from cancer, who
are of primary interest here.
(2 marks)
To evaluate the model performance in such a scenario, True Positive Rate (sensitivity)
and True Negative Rate (specificity), precision, recall or F-Score shall be used to measure
the classifier performance class wise.
(1 mark)
So, if the minority class performance is low, we can do the following:
• Use precision / recall instead of overall classification accuracy.
• To deal with the class imbalance, we can do weighted classification by giving the loss
associated to minority class samples more weight.
• Use undersampling / oversampling to settle the imbalance present in the data.
(2 marks)
Machine Learning Mid-Semester Exam: CSE543/ECE5ML - Page 8 of 9 13/9/2019
ii) Poor performance on the test set is possible despite good validation accuracy, if in case
random split leads to a lucky validation set or due to a biased validation split due to
imbalanced class ratios. (1 mark)
In order to resolve it, following measures can be performed: (2 marks for each point)
• Consider stratified sampling instead of random sampling. This will ensure that the
data samples belonging to different classes are distributed in a more balanced way
throughout the train-test-validation split unlike in the case of random sampling.
• Apart from this, k-fold cross validation can be used to make sure that each data
sample is seen in the test set once.
iii) In the first case, when the training accuracy is more than the test accuracy, it is the case
of overfitting. Measures to improve the performance are:
• Add regularisation in order to lower down the model complexity.
• Reduce the number of features.
• Try increasing the number of training samples.
(2.5 marks)
In the second scenario, when the training accuracy is much lesser than the test accuracy,
it can be the case of unfortunate split where the test data is abundant with the samples
specific to those classes which had a higher accuracy in the training set. To avoid this, we
can try stratified sampling or k-fold cross validation.(2.5 marks)
iv) A low bias and high variance scenario occurs when the trained model is too good to mimic
the training data distribution and gives a too good accuracy on the train set (overfitting).
However, such a model would have very less generalization capability and hence is likely
to perform poor on any unseen data during inference. (1 mark)
For such a scenario with high variance, an ensemble approach like random forest (bag-
ging). Bagging algorithms divides a data set into subsets made with repeated randomized
sampling. Then, these samples are used to generate a set of models using a single learn-
ing algorithm. Later, the model predictions are combined using voting (classification) or
averaging (regression). (2 marks)
Other measure to deal with high variance are as follows:
• Reduce model complexity to avoid overfitting.
• Use the regularisation technique to lower the model complexity by penalizing higher
model coefficients.
(2 marks)
4. (10 points) (Extra Credit) Define Entropy and provide the mathematical expression for it.
Compute the entropy corresponding to the “PlayTennis” column (i.e., H(P layT ennis)) in the
table given in Fig. 1. Compute the conditional entropy H(W ind|T emperature). {Advice: Do
not try to find the solution to the fifteenth decimal place. The emphasis is on your ability to
writing the expression and computing the probabilities correctly. You may use fractions and
simplify the expression as far as possible. You do not really need a calculator here.}
Solution: Entropy is defined as the expected number of bits needed to encode a randomly
drawn value of a random variable X. It could also be defined as a measure of impurity, disorder
or uncertainty in a set of samples. (2 marks)
Machine Learning Mid-Semester Exam: CSE543/ECE5ML - Page 9 of 9 13/9/2019
Entropy H(X) of a random variable X can be written as (2 marks. 1 mark if written only for
the binary case)
Xn
H(X) = − P (X = i)log2 P (X = i)
i=1
5 5 9 9
H(P layT ennis) = − log2 − log2 = 0.94
14 14 14 14
(2.5 marks. 1 if partially correct with silly mistakes. 0 otherwise)
4 3 3 1 1 6 3 3 3 3 4 2 2 2 2
H(W ind|T emperature) = − [ log2 + log2 ]− [ log2 + log2 ]− [ log2 + log2 ]
14 4 4 4 4 14 6 6 6 6 14 4 4 4 4
(3.5 marks. 2.5 marks if partially correct with silly mistakes. Conditional Entropy expression
written, but probabilities not shown - 1 mark. 0 otherwise)