0% found this document useful (0 votes)
72 views9 pages

Mid Sem Solution 2019

(1) A Bayesian classifier would be useful to solve this classification problem between zero and non-zero signals. The Bayesian classifier decision rule compares the probability distributions of the two classes. (2) Logistic regression cannot directly be used because the data is not linearly separable. However, it can be applied by adding a squared term of the input voltage, making the data linearly separable. (3) For the 2D version of the signals, Bayesian classification still works better than logistic regression as the data remains non-linearly separable. The modified logistic regression approach using additional squared and cross terms also works for the 2D case.

Uploaded by

Lokesh Murali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views9 pages

Mid Sem Solution 2019

(1) A Bayesian classifier would be useful to solve this classification problem between zero and non-zero signals. The Bayesian classifier decision rule compares the probability distributions of the two classes. (2) Logistic regression cannot directly be used because the data is not linearly separable. However, it can be applied by adding a squared term of the input voltage, making the data linearly separable. (3) For the 2D version of the signals, Bayesian classification still works better than logistic regression as the data remains non-linearly separable. The modified logistic regression approach using additional squared and cross terms also works for the 2D case.

Uploaded by

Lokesh Murali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Machine Learning Name:

Monsoon 2019
Mid-Semester Exam: CSE543/ECE5ML
13/9/2019
Time Limit: 60 Minutes Roll No:

Instructions:
Please do not plagiarize. It will be dealt with very strictly!
Try to answer all questions. The last question is Extra Credit. Try to make the most of it.
In the unlikely case that a question is ambiguous, please clearly state any assumptions that you
are making. For reducing subjectivity in grading, please do this even after clarifying with the
invigilator.
Good Luck!

1. (20 points) A receiver receives two types of signals, zero and non-zero. The zero signal is
distributed as p(v|y = yz ) ∼ N (0, 4), i.e., a Gaussian distribution with mean µ = 0 and
variance σ 2 = 4. The non-zero signal generates a voltage distributed as p(v|y = ynz ) ∼
0.5N (−5, 4) + 0.5N (+5, 4). Your task is to design a classifier that, given the voltage, can
identify whether the signal is zero or non-zero. Given the task and a training set, please answer
the following questions:

i) (4 points) Would a Bayes classifier be useful to solve this problem? Can you write the
Bayes classifier decision rule for this classification problem? Be sure to have the complete
definition, so that all cases are covered.
ii) (2 points) Why can you not use a logistic regression (LR) classifier for this problem
directly?
iii) (8 points) Let’s say Reverend Bayes was a fan of logistic regression, and asked you to solve
this confounding problem using LR. To appease Rev. Bayes, could you perhaps find a way
to apply logistic regression and still solve this problem? {Hint: Think transformation of
variables, perhaps adding more features.}
iv) (2+4=6 points) Let the two signals be 2-dimensional, such that the zero signal is dis-
tributed as N ([0, 0], 4I), where I is the 2 × 2 Identity matrix, and the non-zero signal
distributed as 0.5N ([−5, −5], 4I) + 0.5N ([+5, +5], 4I). Would a Bayesian classifier still
work better than logistic regression? Justify your argument. Could your strategy for ap-
plying LR in your previous part also work for this 2-D version of the signal? If yes, write
down the modified LR model and the corresponding classification rule?

Solution:

i) (1 point) Yes, Bayes classifier can be used to solve the given problem.

1 −(x−0)2
p(v|y = yz ) N (0, 4) = √ e 8

p(v|y = ynz ) 0.5N (−5, 4) + 0.5N (+5, 4)


Machine Learning Mid-Semester Exam: CSE543/ECE5ML - Page 2 of 9 13/9/2019

Assign equal prior. (If equal prior is not assigned then the decision boundary should
contain prior aswell)

p(y = yz |V ) = p(y = ynz |V )

Expanding using Naive Bayes

p(v|y = yz ) ∗ p(yz ) = p(v|y = ynz ) ∗ p(ynz )

p(v|y = yz ) = p(v|y = ynz )

(1 point)

1 −x2 1 −(x+5)2 1 −(x−5)2


√ e 8 = 0.5 √ e 8 + 0.5 √ e 8
8π 8π 8π
−x2 −(x+5)2 −(x−5)2
e 8 = 0.5e 8 + 0.5e 8

−x2 −(x+5)2 −(x−5)2


(2 points) If e 8 > 0.5e 8 + 0.5e 8 then signal is zero. Otherwise, non zero.
ii) (2 points) We cannot use Logistic regression directly because, standard logistic regression
can only classify linearly separable data. And the non zero signal voltage has a non linear
boundary.
iii) (2 points) Yes, we can solve this using logistic regression, we need to use some feature
transformation technique.
(2 points) For this problem statement we will add a squared version of the input voltage,
making the data linearly separable.
(4 points) let us model the probability function as bernoulli.

p(y|v) = p(v)y (1 − p(v))(1−y)

Transformed data will be w0 + w1 x1 + w2 x21


2
1 ew0 +w1 x1 +w2 x1
p(v) = 2 = 2
1 + e−(w0 +w1 x1 +w2 x1 ) 1 + ew0 +w1 x1 +w2 x1
Corresponding logit function

p(v)
logit[p(v)] = ln[ ]
1 − p(v)
Machine Learning Mid-Semester Exam: CSE543/ECE5ML - Page 3 of 9 13/9/2019

2
ew0 +w1 x1 +w2 x1
2
1+ew0 +w1 x1 +w2 x1
logit[p(v)] = ln[ 2 ]
ew0 +w1 x1 +w2 x1
1− 2
1+ew0 +w1 x1 +w2 x1

logit[p(v)] = w0 + w1 x1 + w2 x21

iv) (2 points) Yes, Bayes classifier still works better than logistic regression (same reason as
above) the 2d data also is not linearly separable.
(1 point) Yes, the modified logistic regression works for the 2d model too, we just need to
take the combined square terms.
(3 points)
Transformed data: w0 + w1 x1 + w2 x2 + w3 x1 x2 + w4 x21 + w5 x22

p(y|v) = P (v)y (1 − p(v))(1−y)

1
p(v) = 2 2
1 + e−(w0 +w1 x1 +w2 x2 +w3 x1 x2 +w4 x1 +w5 x2 )
2 2
ew0 +w1 x1 +w2 x2 +w3 x1 x2 +w4 x1 +w5 x2
p(v) = 2 2
1 + ew0 +w1 x1 +w2 x2 +w3 x1 x2 +w4 x1 +w5 x2
Obtaining the logit

p(v)
logit[p(v)] = ln[ ]
1 − p(v)

logit[p(v)] = w0 + w1 x1 + w2 x2 + w3 x1 x2 + w4 x21 + w5 x22

2. (15 points) What is the probabilistic model for logistic regression? Write the maximum like-
lihood (ML) based objective function for this model. Extend it to the Maximum-a-posteriori
(MAP) based objective with a Gaussian prior (N (0, σp2 )) on the parameters. Feel free to use
the log-likelihood and the log-posterior expressions. Derive the gradient expression for both
log-likelihood and log-posterior. (Points break-up 2+4+4+5=15)
Machine Learning Mid-Semester Exam: CSE543/ECE5ML - Page 4 of 9 13/9/2019

Solution: Let us consider logistic regression model where output variable yi is a Bernoulli
Random Variable. i.e yi can be 0 or 1. The logistic regression model can be written as:

P (y1 = 1|xi ) = σ(xi β) (1)

Where σ(t) is the logistic function:


1
σ(t) = (2)
1 + exp(−t)

xi is a n × 1 vector of inputs and β is a n × 1 vector of coefficients. Furthermore:

P (y1 = 0|xi ) = 1 − P (y1 = 1|xi ) = 1 − σ(xi β) (3)

Goal is to estimate the parameter β using maximum likelihood estimation. Let us consider we
have an i.i.d. sample of N data points (yi , xi ) ∼ D , i ∈ {1, ..., N }.
The likelihood of a single input-output pair (yi , xi ) can be written as:

L(β; yi , xi ) = [σ(xi β)]yi [1 − σ(xi β)](1−yi ) (4)

Since, all observations are iid, the likelihood of the entire sample is equal to the product of the
likelihoods of the single observations:
N
Y
L(β; Y, X) = [σ(xi β)]yi [1 − σ(xi β)](1−yi ) (5)
i=1

here, Y is the N × 1 vector of all outputs and X is the N × K matrix of all inputs.
The log likelihood of likelihood L(β; Y, X) can be written as :
N
X
`(β; Y, X) = [− ln(1 + exp(xi β)) + yi xi β] (6)
i=1

The MLE estimate of β can be given as:

βM LE = argmaxβ ln(P (D|β)) = argmaxβ `(β; Y, X) (7)

Proof:
Machine Learning Mid-Semester Exam: CSE543/ECE5ML - Page 5 of 9 13/9/2019

`(β; Y, X) = ln(L(β; Y, X)) (8)


N
Y
[σ(xi β)]yi [1 − σ(xi β)]1−yi

`(β; Y, X) = ln (9)
i=1
N
X
`(β; Y, X) = [yi ln(σ(xi β)) + (1 − yi ) ln(1 − σ(xi β))] (10)
i=1
N
X 1 1
`(β; Y, X) = [yi ln( ) + (1 − yi ) ln(1 − )] (11)
1 + exp(−xi β) 1 + exp(−xi β)
i=1
N h
X  1    1 i
`(β; Y, X) = ln + yi ln (12)
1 + exp(xiβ ) exp(−xi β)
i=1
N
X
`(β; Y, X) = [ln(1) − ln(1 + exp(xi β)) + yi (ln(1) − ln(exp(−xi β)))] (13)
i=1
N
X
`(β; Y, X) = [− ln(1 + exp(xi β)) + yi xi β] (14)
i=1

First order derivation of `(β; Y, X):


N
X 
∇β `(β; Y, X) = ∇β [− ln(1 + exp(xi β)) + yi xi β] (15)
i=1
N
X
∇β `(β; Y, X) = (∇β [− ln(1 + exp(xi β)) + yi xi β]) (16)
i=1
N 
X exp(xi β) 
∇β `(β; Y, X) = − xi + yi xi (17)
1 + exp(xi β)
i=1
N 
X exp(xi β) exp(−xi β) 
∇β `(β; Y, X) = yi − xi (18)
1 + exp(xi β) exp(−xi β)
i=1
N 
X 1 
∇β `(β; Y, X) = yi − xi (19)
1 + exp(−xi β)
i=1
N h
X i
∇β `(β; Y, X) = yi − σ(xi β) xi (20)
i=1
Machine Learning Mid-Semester Exam: CSE543/ECE5ML - Page 6 of 9 13/9/2019

Second order derivative:

∇ββ `(β; Y, X) = ∇β (∇β `(β; Y, X)) (21)


XN h i
∇ββ `(β; Y, X) = ∇β ( yi − σ(xi β) xi ) (22)
i=1
N
X
∇ββ `(β; Y, X) = − xi ∇β σ(xi β) (23)
i=1
N
X
∇ββ `(β; Y, X) = − xTi xi σ(xi β)[1 − σ(xi β)] (24)
i=1

In MAP estimate we consider β as a random variable and assume a prior belief distribution on
it. Given is the Gaussian prior N (0, σ 2 ):

1  −β T β 
P (β) = N (0, σ 2 ) = √ exp (25)
2πσ 2σ 2

The MAP estimate is given by:

βM AP = argmaxβ ln(P (β|D)) (26)


βM AP = argmaxβ [ln P (β) + ln(P (D|β)) − ln P (D] (27)
βM AP = argmaxβ [ln P (β) + ln(P (D|β))] (28)

Substituting values of P (β) and P (D|β) from equations (25) and (7) in the above equation, we
get:
N
1 1 X
βM AP = argmaxβ [− ln(2π) − 2 β T β + [− ln(1 + exp(xi β)) + yi xi β]] (29)
2 2σ
i=1
ignoring constant term, we get:
N
X 1 T 
βM AP = argmaxβ [− ln(1 + exp(xi β)) + yi xi β] − β β (30)
2σ 2
i=1

N
X 1 T 
βM AP = argmaxβ [− ln(1 + exp(xi β)) + yi xi β] − β β (31)
2σ 2
i=1
 
βM AP = argmaxβ `M AP (32)
1
`M AP = `(β; Y, X) − 2 β T β (33)

 1 
T
∇β `M AP = ∇β `(β; Y, X) − ∇β β β (34)
2σ 2
from equation (20) we get the first order derivative of ∇β `M AP :

N h
X i 1
∇β `M AP = yi − σ(xi β) xi − 2 β (35)
σ
i=1
Machine Learning Mid-Semester Exam: CSE543/ECE5ML - Page 7 of 9 13/9/2019

second order derivative:


N
X 1
∇ββ `M AP = − xTi xi σ(xi β)[1 − σ(xi β)] − (36)
σ2
i=1

3. (20 points) (5 points each) Answer the following questions:

i) You are given a data set on cancer detection. You’ve build a classification model and
achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance?
Is accuracy the right metric for this problem? {Hint: What is the fraction of people who
have cancer as opposed to people who do not?}
ii) You are working on a classification problem. For validation purposes, you’ve randomly
sampled the training data set into train and validation. You are confident that your model
will work incredibly well on unseen data since your validation accuracy is high. However,
you get shocked after getting poor test accuracy. What went wrong? How would you
resolve it?
iii) My student conducted an experiment in which he claims to achieve a training accuracy
of 93% and the test accuracy of 78%. Would you have any suggestions to improve on the
obtained results? What would you have suggested if the training accuracy was 78% and
the test accuracy was 93%?
iv) You came to know that your model is suffering from low bias and high variance. What
are the approaches that you can use to tackle it? Justify why would they work.

Solution:

i) Given the nature of the problem at hand i.e. cancer detection, we know that the dataset
can be highly imbalanced, i.e. a majority of the data contains those samples in which
people are not suffering from cancer and a minority of those who actually suffer from this
disease.

In such a case, accuracy is not a good measure of the model performance as we can predict
all the majority samples right and still get a higher accuracy like 96%, but would probably
be misclassifying the scarce class which consists of the people suffering from cancer, who
are of primary interest here.
(2 marks)
To evaluate the model performance in such a scenario, True Positive Rate (sensitivity)
and True Negative Rate (specificity), precision, recall or F-Score shall be used to measure
the classifier performance class wise.
(1 mark)
So, if the minority class performance is low, we can do the following:
• Use precision / recall instead of overall classification accuracy.
• To deal with the class imbalance, we can do weighted classification by giving the loss
associated to minority class samples more weight.
• Use undersampling / oversampling to settle the imbalance present in the data.
(2 marks)
Machine Learning Mid-Semester Exam: CSE543/ECE5ML - Page 8 of 9 13/9/2019

ii) Poor performance on the test set is possible despite good validation accuracy, if in case
random split leads to a lucky validation set or due to a biased validation split due to
imbalanced class ratios. (1 mark)
In order to resolve it, following measures can be performed: (2 marks for each point)
• Consider stratified sampling instead of random sampling. This will ensure that the
data samples belonging to different classes are distributed in a more balanced way
throughout the train-test-validation split unlike in the case of random sampling.
• Apart from this, k-fold cross validation can be used to make sure that each data
sample is seen in the test set once.
iii) In the first case, when the training accuracy is more than the test accuracy, it is the case
of overfitting. Measures to improve the performance are:
• Add regularisation in order to lower down the model complexity.
• Reduce the number of features.
• Try increasing the number of training samples.
(2.5 marks)
In the second scenario, when the training accuracy is much lesser than the test accuracy,
it can be the case of unfortunate split where the test data is abundant with the samples
specific to those classes which had a higher accuracy in the training set. To avoid this, we
can try stratified sampling or k-fold cross validation.(2.5 marks)
iv) A low bias and high variance scenario occurs when the trained model is too good to mimic
the training data distribution and gives a too good accuracy on the train set (overfitting).
However, such a model would have very less generalization capability and hence is likely
to perform poor on any unseen data during inference. (1 mark)
For such a scenario with high variance, an ensemble approach like random forest (bag-
ging). Bagging algorithms divides a data set into subsets made with repeated randomized
sampling. Then, these samples are used to generate a set of models using a single learn-
ing algorithm. Later, the model predictions are combined using voting (classification) or
averaging (regression). (2 marks)
Other measure to deal with high variance are as follows:
• Reduce model complexity to avoid overfitting.
• Use the regularisation technique to lower the model complexity by penalizing higher
model coefficients.
(2 marks)

4. (10 points) (Extra Credit) Define Entropy and provide the mathematical expression for it.
Compute the entropy corresponding to the “PlayTennis” column (i.e., H(P layT ennis)) in the
table given in Fig. 1. Compute the conditional entropy H(W ind|T emperature). {Advice: Do
not try to find the solution to the fifteenth decimal place. The emphasis is on your ability to
writing the expression and computing the probabilities correctly. You may use fractions and
simplify the expression as far as possible. You do not really need a calculator here.}
Solution: Entropy is defined as the expected number of bits needed to encode a randomly
drawn value of a random variable X. It could also be defined as a measure of impurity, disorder
or uncertainty in a set of samples. (2 marks)
Machine Learning Mid-Semester Exam: CSE543/ECE5ML - Page 9 of 9 13/9/2019

Figure 1: Data for playing tennis decisions

Entropy H(X) of a random variable X can be written as (2 marks. 1 mark if written only for
the binary case)
Xn
H(X) = − P (X = i)log2 P (X = i)
i=1

5 5 9 9
H(P layT ennis) = − log2 − log2 = 0.94
14 14 14 14
(2.5 marks. 1 if partially correct with silly mistakes. 0 otherwise)

4 3 3 1 1 6 3 3 3 3 4 2 2 2 2
H(W ind|T emperature) = − [ log2 + log2 ]− [ log2 + log2 ]− [ log2 + log2 ]
14 4 4 4 4 14 6 6 6 6 14 4 4 4 4
(3.5 marks. 2.5 marks if partially correct with silly mistakes. Conditional Entropy expression
written, but probabilities not shown - 1 mark. 0 otherwise)

You might also like