0% found this document useful (0 votes)
14 views3 pages

Set3sol 2022

The document contains lecture notes for a machine learning course, specifically covering multiple choice questions and open questions related to topics such as convolutional neural networks, k-means clustering, autoencoders, and principal component analysis (PCA). It includes correct answers for the multiple choice questions and explanations for the open questions. The instructor is Andrea Celli from Bocconi University, and the notes are not formally published.

Uploaded by

benassi.giochi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views3 pages

Set3sol 2022

The document contains lecture notes for a machine learning course, specifically covering multiple choice questions and open questions related to topics such as convolutional neural networks, k-means clustering, autoencoders, and principal component analysis (PCA). It includes correct answers for the multiple choice questions and explanations for the open questions. The instructor is Andrea Celli from Bocconi University, and the notes are not formally published.

Uploaded by

benassi.giochi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

30412 MACHINE LEARNING April 29, 2023

Lecture 14 to 21
Problem Set 3

Instructor: Andrea Celli∗

1 Multiple Choice Questions


1. Convolutional neural networks generally have fewer free parameters as compared to fully connected
neural networks.
 true
 false
[ac: True]
2. Weight sharing allows CNNs to deal with image data without using too many parameters. However,
weight sharing increases the variance of a model.
 true
 false [ac: false, it increases bias]
3. Your friend performed PCA on some data and claimed P that they Pretained at least 95% of the variance
using k principal components. This is equivalent to i>k+1 λi / i λi ≤ 0.05, where λ1 , . . . , λk , are the
eigenvalues associated to each principal component, sorted in a non-increasing order.
 true [ac: true!]
 false
4. Neural networks:
 Optimize a convex objective function [ac: False]
 Can only be trained with stochastic gradient descent [ac: False]
 Can use a mix of different activation functions [ac: True]
 Can be made to perform well even when the number of parameters/weights is much greater than
the number of data points. [ac: True]
5. Consider the value
N k
1 XX
Lk = inf zi,j kxi − µ̂j k22 , (1)
µ̂1 ,...,µ̂k ∈Rd ,zi,j N i=1 j=1

what extra conditions on z do we need to obtain the correct objective for k means clustering?
* Computing Sciences Department, Bocconi University. [email protected]. Disclaimer: These notes have
not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with
the permission of the Instructor.

1
Pk
 zi,j ∈ {0, 1}, j=1 zi,j = 1 [ac: this one]
Pk
 zi,j ∈ [0, 1], j=1 zi,j = 1
Pk
 zi,j ∈ [0, 1], j=1 zi,j = 0
6. Consider the loss in Equation (1) with the appropriate conditions on zi,j specified in the previous
question. The statements below are all true except one. Which of the following statement is false?
 For 2 ≤ k < N and with the initial means randomly chosen as k data points, the k-means algorithm
with k clusters is not guaranteed to reach the optimal Lk loss value.
 For 2 ≤ k < N , Lk is computationally hard to compute.
 For k ≥ N , Lk = 0.
 The sequence (Lk )1≤k≤N is not necessarily strictly decreasing. [ac: this one is false. Since all the
datapoints are distinct, we can always strictly improve the loss by assigning µ̂k+1 = xi where
xi 6= µ̂j for all 1 ≤ j ≤ k.]
7. You are given the data in R2 illustrated in the following figure which you want to cluster into an inner
ring and an outer ring (hence a number of clusters k = 2). Which of the following statement(s) is/are
correct?
(a) There exists some initialization such that k-means clustering succeeds.
(b) There exists an appropriate feature expansion such that k-means (with standard initialization)
succeeds.
(c) There exists an appropriate feature expansion such that the Expectation Maximization algorithm
(with standard initialization) for a Gaussian Mixture Model succeeds.

 Only a and c
 Only a
 All of them
 Only b and c [ac: this one]
 None of them
 Only a and b

2 Open Questions
1. Describe the main ideas of behind autoencoders.
2. Illustrate the AdaBoost algorithm, explain its purpose and in which case it is useful.
3. Tell if the following statement about the Principal Component Analysis (PCA) procedure are true or
false. Motivate your answers.

2
(a) The set of the Principal Components vectors are providing an orthonormal base for the original
feature space.
(b) Using as features for regression/classification problems the projection of the original features into
the principal components provided by the PCA reduces the phenomenon of overfitting.
(c) The percentage of the variance explained by a Principal Component is inversely proportional to
the value of the corresponding eigenvalue.
(d) PCA might get stuck into local optima, thus trying multiple random initializations might help.
(e) Even if all the input features have similar scales, we should still perform mean normalization (so
that each feature has zero mean) before running PCA.
[ac: TRUE: The procedure for the PCA looks for the direction in the dataset which is providing the
most variance and extracts the (first) Principal Component (PC) as the unit vector identifying that
direction. Iteratively, it checks the direction, orthogonal to the previous one, with the most variance
and extracts another (second) PC. The process iterates over a number of PC equal to the number
of dimensions of the dataset. This produces an orthogonal base to the initial dataset. FALSE: only
selecting the first K PC one has the chance to remove the noise from the data and avoid overfitting. If
we are keeping all the PC, we would have a linear transformation of the original dataset, which is likely
to behave as good as the original one for the supervised task. FALSE: The variance explained by each
PC is directly proportional to the corresponding eigenvalue. FALSE: There is no source of stochasticity
in the process of performing PCA. Thus the use of multiple initialization would not produce different
results. TRUE: The process considers the case in which the points are centered in the origin (have
zero mean). If one of the component has an average value far different from zero, we might bias the
direction of the first eigen–vector towards this dimension. ]

You might also like