0% found this document useful (0 votes)
1K views13 pages

SMAI Question Papers

This document provides instructions and questions for a mid-semester examination in Statistical Methods in AI. It consists of multiple choice questions worth 23 points total, long answer questions worth 52 points total, and true/false questions worth 8 points total with negative marking for incorrect answers. The document provides general instructions to students to return the question booklet with answer sheets tied together. It does not permit questions to be asked during the exam and instructs students to state any necessary assumptions. The body of the document then lists the multiple choice, long answer, and true/false questions.

Uploaded by

Sangam Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views13 pages

SMAI Question Papers

This document provides instructions and questions for a mid-semester examination in Statistical Methods in AI. It consists of multiple choice questions worth 23 points total, long answer questions worth 52 points total, and true/false questions worth 8 points total with negative marking for incorrect answers. The document provides general instructions to students to return the question booklet with answer sheets tied together. It does not permit questions to be asked during the exam and instructs students to state any necessary assumptions. The body of the document then lists the multiple choice, long answer, and true/false questions.

Uploaded by

Sangam Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

International Institute of Information Technology, Hyderabad

(Deemed to be University)
Statistical Methods in AI (CSE/ECE 471) - Spring-2019
Mid-semester Examination 1

Maximum Time : 90 Minutes Total Marks : 100


Roll No. Programme Date
Room No. Seat No. Invigilator Sign.

Marks secured

Multiple Choice Questions


Question: 1 2 3 4 Total
Points: 2 2 2 3 9
Score:

Long Questions-1
Question: 5 6 7 8 9 10 11 12 13 14 15 Total
Points: 9 15 12 2 2 6 3 3 6 2 2 62
Score:

Long Questions-2
Question: 16 17 18 19 Total
Points: 4 14 2 9 29
Score:

General Instructions to the students


1. QUESTION BOOKLET NEEDS TO BE RETURNED ALONG WITH
ANSWER SHEETS. PLEASE TIE TOGETHER YOUR ANSWER SHEETS
AND QUESTION BOOKLET, WITH THE BOOKLET ON TOP.

2. No questions will be answered during the exam. Make necessary reasonable


assumptions, state them and proceed.

1
Multiple Choice Questions
For the following questions, specify ALL the correct answers. (Note: Partial marks will
not be given for partially correct answers.)

1. (2 points) Suppose the k-means algorithm is used to cluster n samples from a dataset.
Each sample is l-dimensional. Suppose the number of clusters is K and the number
of iterations for k-means to converge is m. What is the order of the run-time for the
algorithm ?
A. O(nKm) B. O(nKlm) C. O(nlm) D. None of the above

2. (2 points) For the same settings above, i.e. K, n, l, m, what is the order of total
storage space for the k-means algorithm ?
A. O((K + n) ∗ l) B. O(Kl) C. O(nKlm) D. None of the above

3. (2 points) In Figure 1, consider the 2-D dataset whose labels are given as + and −.
What is the smallest value of k for which a point at location marked will be classified
with a − label ? Assume Eulclidean distance.

Figure 1:

A. 1 B. 3 C. 4 D. 7

4. (3 points) Suppose X is a random variable with range {a1 , a2 , . . . an }. What is the


maximum possible value for H(X) – the entropy of X ?
1 n
A. n
B. log2 (n) C. 2
D. None of the above

Long Questions
Write detailed answers. Adequately explain your assumptions and thought process.

Suppose we have a 2-class dataset with n samples. Each sample is labelled positive
or negative class. Suppose the fraction of positive-labelled samples is x. It is also
known that x < 0.5. We can define some simple non-machine learning classifiers that
assign labels based simply on the proportions found in the training data as follows:

• Random Guess Classifier (RGC): Randomly assign half of the samples to positive
class and the other half to negative.
• Weighted Guess Classifier (WGC): Randomly assign x fraction of the samples
to positive class and the remaining (1 − x) fraction to negative class.
• Majority Class Classifier (MCC): assign the label of majority class to all the
samples.

2
The baseline performance of these classifiers is determined by predicting labels on
the n-sample dataset.

5. (9 points) Write down the confusion matrices for the three classifiers.

6. (15 points) Fill the following table (write your answer in the answer sheet only)

Classifier Accuracy Precision Recall


RGC
WGC
MCC

7. (12 points) Suppose we now have k classes and xi represents the fraction of i-th class
samples among the n samples. What is the accuracy for each of the classifiers speci-
fied above ?

For a k-means clustering setting, assume the following notation:

• K : The number of clusters


• nk : Number of instances in k-th cluster
• µk : The center of k-th cluster
• xpq : A data sample within the q-th cluster, i.e. 1 6 p 6 nq

8. (2 points) Write down the expression Jk for the average Euclidean distance between
members of k-th cluster.

9. (2 points) Write down the expression Sk for the sum of Euclidean distance of each
cluster member from its center in k-th cluster.
X X
10. (6 points) Let J = Jk . Let S = Sk . Derive the mathematical relationship
k k
between S and J.

Consider a labelled dataset with B binary input variables, Xi ∈ {0, 1}, 1 6 i 6 B.


The number of output classes is C.

11. (3 points) How many parameters (probabilities) must be estimated to train a Naive
Bayes classifier on this data ?

12. (3 points) How many parameters must be estimated if we do not make the Naive
Bayes assumption ?
Suppose we have a set of observations x1 , x2 , . . . xn . It is assumed that the data has
been generated by sampling from an exponential distribution
(
λe−λx if x > 0
f (x; λ) = (1)
0 if x < 0

13. (6 points) What is the maximum likelihood estimate of λ ?


For the set of points in Figure 2

Page 3
Figure 2:

14. (2 points) What is the equation of the least-squares-error linear regression line ?
15. (2 points) What is the value of the mean squared error for the estimated line ?
Suppose you are given a labelled dataset D = (xi , yi ), 1 6 i 6 N, xi ∈ Rd where the
class labels are binary, i.e. yi ∈ {0, 1}.
ez
16. (4 points) Let p(z) = . Show that its derivative p0 (z) = p(z)(1 − p(z)).
1 + ez
17. (14 points) From the expression for the likelihood of the given data under the logistic
regression model and from the equations used to obtain maximum likelihood estimate
XN N
X
of the model parameters, show that y i xi = Pi xi where Pi = p(yi = 1|xi ; β)
i=1 i=1
and β ∈ R(d+1) stands for the parameter vector of the logistic regression model. Hint:
Use the result from the previous question.
Suppose we wish to predict the gender of a person based on two binary attributes:
Leg-Cover (pants or skirts) and Facial-Hair (some or none). We have a dataset of
2,000 people, half male and half female. 75% of the males have no facial hair. Skirts
are worn by 50% of the females. All females are fully bare-faced and no male wears
a skirt.

Figure 3:

18. (2 points) What is the initial entropy of the dataset ?


19. (9 points) Suppose we wish to build a decision tree for our prediction task. Compute
the Information Gain for each choice of ‘Leg-Cover’, ‘Facial-Hair’ as root node. Based
on the gain values, which attribute is preferable as root node ? Use the values from
Figure 3.

Page 4
International Institute of Information Technology, Hyderabad
(Deemed to be University)
Statistical Methods in AI (CSE/ECE 471) - Spring-2019
Mid-semester Examination 2

Maximum Time : 90 Minutes Total Marks : 75


Roll No. Programme Date
Room No. Seat No. Invigilator Sign.

Marks secured

Multiple Choice Questions


Question: 1 2 3 4 5 6 7 8 9 10 11 Total
Points: 2 2 2 2 2 2 2 2 2 2 3 23
Score:

Long Questions-1
Question: 12 13 14 15 16 17 18 19 20 Total
Points: 5 5 15 6 4 5 5 3 4 52
Score:

General Instructions to the students


1. QUESTION BOOKLET NEEDS TO BE RETURNED ALONG WITH
ANSWER SHEETS. PLEASE TIE TOGETHER YOUR ANSWER SHEETS
AND QUESTION BOOKLET, WITH THE BOOKLET ON TOP.

2. Multiple-choice and True/False questions MUST be answered clearly within the


question booklet itself. NO MARKS FOR WRITING THE CHOICES IN
ANSWER SHEET.

3. No questions will be answered during the exam. Make necessary reasonable


assumptions, state them and proceed.
True or False
Circle True or False. NOTE: This section (True or False) has negative marking
for incorrect answers. (2 points each)

1. (2 points) True False Two random variables A, B are independent if p(A, B) =


p(A|B)p(B).
2. (2 points) True False By minimizing its loss function, k-means clustering always reaches
the global minimum.
3. (2 points) True False Naive Bayes classifier finds a Maximum Aposteriori Probability
(MAP) estimate of its parameters.
4. (2 points) True False Any boolean function can be learnt by a linear classifier (perceptron).

5. (2 points) True False Suppose x1 , x2 are two data points with the same class label A and
x1 6= x2 . Suppose x3 = x1 +x 2
2
is a datapoint that belongs to a
different class B. No perceptron exists that classifies x1 , x2 into A
and classifies x3 into class B.
6. (2 points) True False Suppose we have a model from a fixed hypothesis set. As the amount
of training data decreases, the possibility of overfitting the model
increases.
7. (2 points) True False For a given dataset, a random forest classifier tends to have a lower
bias than a decision tree.

Multiple Choice
Mark all answers you think are correct. No marks for partially correct answers.

8. (2 points) Consider the following regression model : arg minθ ky − Xθk22 + λ kθk22 .
What does increasing λ do ?
A. Bias of the model increases, Variance decreases
B. Bias of the model increases, Variance stays the same
C. Bias of the model decreases, Variance increases
D. Bias of the model decreases, Variance stays the same

9. (2 points) Which of the following activation functions has an unbounded range ?


A. ReLU (max(x, 0)) B. Linear C. Sigmoid D. Tanh

10. (2 points) For which of the following machine learning approaches can we have a
kernel-ized version (similar to SVM) ?
A. k-NN B. k-means C. PCA D. None of the above

11. (3 points) A 1-nearest neighbor classifier has than a than a 5-


nearest neighbor classifier.
A. larger variance B. larger bias C. smaller variance D. smaller bias

Page 2
Long Questions
Write detailed answers. Adequately explain your assumptions and thought process.
12. (5 points) Figure 1 shows two plots, corresponding to the 2-D distribution of two
different datasets. Suppose PCA is performed on the given data. Clearly draw the
directions of the first and second principal component vectors in each plot. NOTE:
Draw directly on the plots in the question paper.

Figure 1:

13. (5 points) Suppose the month of the year is one of the attributes in your dataset.
Currently, each month is represented by an integer k, 0 6 k 6 11 and let’s say k = 0
corresponds to December,k = 1 to January etc. Come up with a feature representa-
tion f (k) such the representation for December is at equal Euclidean distance from
representations of January and November, i.e. ||f (0) − f (1)||2 = ||f (0) − f (11)||2 .
Hint: f (k) can be a vector.
14. (15 points) Figure 2 shows a 2-D dataset (circles). Suppose the k-means algorithm
is run with k = 2 and the squares represent the initial locations of the estimated
means. Indicate the new locations of the cluster means after 1 iteration of the k-
means algorithm. Draw a triangle at the location of each cluster mean. Also write
1, 2 alongside each data point and the new cluster mean to show which data points
belong to cluster 1 and which datapoints belong to cluster 2. Assume that datapoints
whose locations do not align with integer axes coordinates have coordinates of 0.5.
For e.g. the coordinates of top-left datapoint are (0, 7). The coordinates of datapoint
immediately to its right are (0.5, 7)
15. (6 points) The loss function for k-means clustering with k > 1 clusters, data-points
x1 , x2 . . . xn , centers µ1 , µ2 , . . . µk and Euclidean distance is given by

k X
X
L= kxi − µj k22
j=1 xi ∈Sj

where Sj refers to points with cluster center µj . Suppose stochastic gradient descent
with a learning rate of η is used. Derive the update rule for parameter µ1 for a given
data-point xp . NOTE: xp may or may not be a sample in S1 .

Page 3
Figure 2:

Consider the following dataset (row is a data sample, each sample has two dimensions)
 
6 −4
−3 5
X= −2

6
7 −3
Suppose PCA is used to determine the principal components.
16. (4 points) What are the unit vectors in the directions corresponding to the principal
components ? HINT: There might be a faster way to guess the vectors instead of
computing the covariance matrix.
17. (5 points) What is sum of eigenvalues corresponding to the principal components ?
18. (5 points) Figure 3 shows the truth table for a NAND logic gate. Implement the
NAND function via a neural network architecture with a single neuron and an ap-
propriate choice of weights, bias and activation function.

Figure 3:
In the lecture on SVM, we saw that one could use a mapping function φ : Rn −→ Rd
to transform points from the original Rn space to another space Rd . We also saw that
one could define a kernel function K(x, z) such that K(x, z) = φ(x)T φ(z). Suppose α
is a positive real constant value. Suppose φ1 : Rn −→ Rd , φ2 : Rn −→ Rd are feature
mappings of K1 and K2 respectively. In terms of φ1 , φ2
19. (3 points) Write the formula for the feature mapping φ3 corresponding to K(x, z) =
αK1 (x, z)
20. (4 points) Write the formula for the feature mapping φ3 corresponding to K(x, z) =
K1 (x, z)K2 (x, z)

Page 4
SMAI Spring-2019 Quiz-3
Full marks: 30
Time: 50 Mins

1. Given the hyperplane defined by the line y = (1,-2)​T​x


= w​T​x What is the minimal adjustment to w to make
a new point y = 1, x = (1,1) be correctly classified?
[5]

2. Which of the following is/are true regarding an


SVM? Give explanation:
(a) For two dimensional data points, the separating
hyperplane learnt by a linear SVM will be a straight
line.
(b) In theory, a Gaussian kernel SVM can model
any complex separating hyperplane.
(c) For every kernel function used in a SVM, one
can obtain a equivalent closed form basis
expansion.
(d) Overfitting in an SVM is a function of number of
support vectors ​[5]

3. Suppose a support vector machine for separating


pluses from minuses finds a plus support vector at
the point x​1​= (1, 0), a minus support vector at x​2​= (0,
1).You are to determine values for the classification
vector ​W ​and the threshold value ​b​. Your
expression for w may contain x1 and x2.
Hint:​ Think about the values produced by the
decision rule for the support vectors, x1 and x2 ​[5]
4. Suppose you have trained an SVM classifier with a
Gaussian kernel and it learned the following
decision boundary on training set:

When you measure the SVM’s performance on


cross validation set it performs poorly. Should you
increase/ decrease the value of 𝛔​2​. Give
explanation.​ [5]
5. State True/ False with proper justification:
A. If a learning algorithm is suffering from high
bias only adding more training samples may
not improve the test error significantly
B. A model with more parameters is more prone
to overfitting and typically has higher variance
C. When debugging learning algorithms it is
useful to plot a learning curve to understand if
there is high bias or high variance problem
D. If a neural network has much lower training
error than test error then adding more layers
would help bring the test error down as we
can fit the test set better. ​[10]
Quiz-3 Solution

1.

2. Solution - a, b, d
a - Trivially true.
b - Gaussian kernel can be written as a taylor expansion and seen as a basis
expansion of infinite dimensions, hence giving it the ability to model any
separating hyperplane.
d - More the number of Support Vectors, higher the chance of the classifier being
over fit.

3. The decision boundary goes through the origin, so b = 0. As w = (a​1​x​1​ - a​2​x​2​), and
x​1​ is a support vector so, w . x​1​ = 1. Substituting for w, we get w . x​1​ = (a​1​x​1​ - a​2​x​2​) . x​1​ =
a​1​x​1​.x​1​ = a​1​ because x​1​ and x​2​ are orthogonal and x​1​ is a unit vector. Hence w . x​1​ = a​1​ =
1. Same reasoning yields a​2​ = 1. So w = (a​1​x​1​ - a​2​x​2​) = x​1​ - x​2​ = [1 –1].
W = x​1​ - x​2​ = [1 –1]
b=0

4. The figure shows a decision boundary that is overfit to the training set so we
would like to increase the bias / lower the variance of the SVM. Hence 𝛔​2 ​ should be
increased.
5.

True/False Answer Explanation

True If a learning algorithm is If a learning algorithm is


suffering from high bias, suffering from high bias,
only adding more training only adding more training
examples may ​not​ improve examples may not improve
the test error significantly. the test error significantly.

True A model with more More model parameters


parameters is more prone increases the model's
to overfitting and typically complexity, so it can more
has higher variance. tightly fit data in training,
increasing the chances of
overfitting.

True When debugging learning The shape of a learning


algorithms, it is useful to curve is a good indicator of
plot a learning curve to bias or variance problems
understand if there is a with your learning
high bias or high variance algorithm.
problem.

False If a neural network has With lower training than


much lower training error test error, the model has
than test error, then adding high variance. Adding
more layers will help bring more layers will increase
the test error down model complexity, making
because we can fit the test the variance problem
set better. worse.

You might also like