SMAI Question Papers
SMAI Question Papers
(Deemed to be University)
Statistical Methods in AI (CSE/ECE 471) - Spring-2019
Mid-semester Examination 1
Marks secured
Long Questions-1
Question: 5 6 7 8 9 10 11 12 13 14 15 Total
Points: 9 15 12 2 2 6 3 3 6 2 2 62
Score:
Long Questions-2
Question: 16 17 18 19 Total
Points: 4 14 2 9 29
Score:
1
Multiple Choice Questions
For the following questions, specify ALL the correct answers. (Note: Partial marks will
not be given for partially correct answers.)
1. (2 points) Suppose the k-means algorithm is used to cluster n samples from a dataset.
Each sample is l-dimensional. Suppose the number of clusters is K and the number
of iterations for k-means to converge is m. What is the order of the run-time for the
algorithm ?
A. O(nKm) B. O(nKlm) C. O(nlm) D. None of the above
2. (2 points) For the same settings above, i.e. K, n, l, m, what is the order of total
storage space for the k-means algorithm ?
A. O((K + n) ∗ l) B. O(Kl) C. O(nKlm) D. None of the above
3. (2 points) In Figure 1, consider the 2-D dataset whose labels are given as + and −.
What is the smallest value of k for which a point at location marked will be classified
with a − label ? Assume Eulclidean distance.
Figure 1:
A. 1 B. 3 C. 4 D. 7
Long Questions
Write detailed answers. Adequately explain your assumptions and thought process.
Suppose we have a 2-class dataset with n samples. Each sample is labelled positive
or negative class. Suppose the fraction of positive-labelled samples is x. It is also
known that x < 0.5. We can define some simple non-machine learning classifiers that
assign labels based simply on the proportions found in the training data as follows:
• Random Guess Classifier (RGC): Randomly assign half of the samples to positive
class and the other half to negative.
• Weighted Guess Classifier (WGC): Randomly assign x fraction of the samples
to positive class and the remaining (1 − x) fraction to negative class.
• Majority Class Classifier (MCC): assign the label of majority class to all the
samples.
2
The baseline performance of these classifiers is determined by predicting labels on
the n-sample dataset.
5. (9 points) Write down the confusion matrices for the three classifiers.
6. (15 points) Fill the following table (write your answer in the answer sheet only)
7. (12 points) Suppose we now have k classes and xi represents the fraction of i-th class
samples among the n samples. What is the accuracy for each of the classifiers speci-
fied above ?
8. (2 points) Write down the expression Jk for the average Euclidean distance between
members of k-th cluster.
9. (2 points) Write down the expression Sk for the sum of Euclidean distance of each
cluster member from its center in k-th cluster.
X X
10. (6 points) Let J = Jk . Let S = Sk . Derive the mathematical relationship
k k
between S and J.
11. (3 points) How many parameters (probabilities) must be estimated to train a Naive
Bayes classifier on this data ?
12. (3 points) How many parameters must be estimated if we do not make the Naive
Bayes assumption ?
Suppose we have a set of observations x1 , x2 , . . . xn . It is assumed that the data has
been generated by sampling from an exponential distribution
(
λe−λx if x > 0
f (x; λ) = (1)
0 if x < 0
Page 3
Figure 2:
14. (2 points) What is the equation of the least-squares-error linear regression line ?
15. (2 points) What is the value of the mean squared error for the estimated line ?
Suppose you are given a labelled dataset D = (xi , yi ), 1 6 i 6 N, xi ∈ Rd where the
class labels are binary, i.e. yi ∈ {0, 1}.
ez
16. (4 points) Let p(z) = . Show that its derivative p0 (z) = p(z)(1 − p(z)).
1 + ez
17. (14 points) From the expression for the likelihood of the given data under the logistic
regression model and from the equations used to obtain maximum likelihood estimate
XN N
X
of the model parameters, show that y i xi = Pi xi where Pi = p(yi = 1|xi ; β)
i=1 i=1
and β ∈ R(d+1) stands for the parameter vector of the logistic regression model. Hint:
Use the result from the previous question.
Suppose we wish to predict the gender of a person based on two binary attributes:
Leg-Cover (pants or skirts) and Facial-Hair (some or none). We have a dataset of
2,000 people, half male and half female. 75% of the males have no facial hair. Skirts
are worn by 50% of the females. All females are fully bare-faced and no male wears
a skirt.
Figure 3:
Page 4
International Institute of Information Technology, Hyderabad
(Deemed to be University)
Statistical Methods in AI (CSE/ECE 471) - Spring-2019
Mid-semester Examination 2
Marks secured
Long Questions-1
Question: 12 13 14 15 16 17 18 19 20 Total
Points: 5 5 15 6 4 5 5 3 4 52
Score:
5. (2 points) True False Suppose x1 , x2 are two data points with the same class label A and
x1 6= x2 . Suppose x3 = x1 +x 2
2
is a datapoint that belongs to a
different class B. No perceptron exists that classifies x1 , x2 into A
and classifies x3 into class B.
6. (2 points) True False Suppose we have a model from a fixed hypothesis set. As the amount
of training data decreases, the possibility of overfitting the model
increases.
7. (2 points) True False For a given dataset, a random forest classifier tends to have a lower
bias than a decision tree.
Multiple Choice
Mark all answers you think are correct. No marks for partially correct answers.
8. (2 points) Consider the following regression model : arg minθ ky − Xθk22 + λ kθk22 .
What does increasing λ do ?
A. Bias of the model increases, Variance decreases
B. Bias of the model increases, Variance stays the same
C. Bias of the model decreases, Variance increases
D. Bias of the model decreases, Variance stays the same
10. (2 points) For which of the following machine learning approaches can we have a
kernel-ized version (similar to SVM) ?
A. k-NN B. k-means C. PCA D. None of the above
Page 2
Long Questions
Write detailed answers. Adequately explain your assumptions and thought process.
12. (5 points) Figure 1 shows two plots, corresponding to the 2-D distribution of two
different datasets. Suppose PCA is performed on the given data. Clearly draw the
directions of the first and second principal component vectors in each plot. NOTE:
Draw directly on the plots in the question paper.
Figure 1:
13. (5 points) Suppose the month of the year is one of the attributes in your dataset.
Currently, each month is represented by an integer k, 0 6 k 6 11 and let’s say k = 0
corresponds to December,k = 1 to January etc. Come up with a feature representa-
tion f (k) such the representation for December is at equal Euclidean distance from
representations of January and November, i.e. ||f (0) − f (1)||2 = ||f (0) − f (11)||2 .
Hint: f (k) can be a vector.
14. (15 points) Figure 2 shows a 2-D dataset (circles). Suppose the k-means algorithm
is run with k = 2 and the squares represent the initial locations of the estimated
means. Indicate the new locations of the cluster means after 1 iteration of the k-
means algorithm. Draw a triangle at the location of each cluster mean. Also write
1, 2 alongside each data point and the new cluster mean to show which data points
belong to cluster 1 and which datapoints belong to cluster 2. Assume that datapoints
whose locations do not align with integer axes coordinates have coordinates of 0.5.
For e.g. the coordinates of top-left datapoint are (0, 7). The coordinates of datapoint
immediately to its right are (0.5, 7)
15. (6 points) The loss function for k-means clustering with k > 1 clusters, data-points
x1 , x2 . . . xn , centers µ1 , µ2 , . . . µk and Euclidean distance is given by
k X
X
L= kxi − µj k22
j=1 xi ∈Sj
where Sj refers to points with cluster center µj . Suppose stochastic gradient descent
with a learning rate of η is used. Derive the update rule for parameter µ1 for a given
data-point xp . NOTE: xp may or may not be a sample in S1 .
Page 3
Figure 2:
Consider the following dataset (row is a data sample, each sample has two dimensions)
6 −4
−3 5
X= −2
6
7 −3
Suppose PCA is used to determine the principal components.
16. (4 points) What are the unit vectors in the directions corresponding to the principal
components ? HINT: There might be a faster way to guess the vectors instead of
computing the covariance matrix.
17. (5 points) What is sum of eigenvalues corresponding to the principal components ?
18. (5 points) Figure 3 shows the truth table for a NAND logic gate. Implement the
NAND function via a neural network architecture with a single neuron and an ap-
propriate choice of weights, bias and activation function.
Figure 3:
In the lecture on SVM, we saw that one could use a mapping function φ : Rn −→ Rd
to transform points from the original Rn space to another space Rd . We also saw that
one could define a kernel function K(x, z) such that K(x, z) = φ(x)T φ(z). Suppose α
is a positive real constant value. Suppose φ1 : Rn −→ Rd , φ2 : Rn −→ Rd are feature
mappings of K1 and K2 respectively. In terms of φ1 , φ2
19. (3 points) Write the formula for the feature mapping φ3 corresponding to K(x, z) =
αK1 (x, z)
20. (4 points) Write the formula for the feature mapping φ3 corresponding to K(x, z) =
K1 (x, z)K2 (x, z)
Page 4
SMAI Spring-2019 Quiz-3
Full marks: 30
Time: 50 Mins
1.
2. Solution - a, b, d
a - Trivially true.
b - Gaussian kernel can be written as a taylor expansion and seen as a basis
expansion of infinite dimensions, hence giving it the ability to model any
separating hyperplane.
d - More the number of Support Vectors, higher the chance of the classifier being
over fit.
3. The decision boundary goes through the origin, so b = 0. As w = (a1x1 - a2x2), and
x1 is a support vector so, w . x1 = 1. Substituting for w, we get w . x1 = (a1x1 - a2x2) . x1 =
a1x1.x1 = a1 because x1 and x2 are orthogonal and x1 is a unit vector. Hence w . x1 = a1 =
1. Same reasoning yields a2 = 1. So w = (a1x1 - a2x2) = x1 - x2 = [1 –1].
W = x1 - x2 = [1 –1]
b=0
4. The figure shows a decision boundary that is overfit to the training set so we
would like to increase the bias / lower the variance of the SVM. Hence 𝛔2 should be
increased.
5.