ML Interview Questions and Answers
ML Interview Questions and Answers
September 6, 2023
Contents
5 Math 5
5.1 Algebra and (little) calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.1.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.1.2 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.1.3 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.1.4 Calculus and convex optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 Probability and statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.2.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.2.2 Stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6 Computer Science 36
6.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.2 Complexity and numerical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.3 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2
This document contains my solutions to “Introduction to Machine Learning Interviews” by Chip Huyen. The
section and question numbering has been structured to align with the original book by Chip Huyen in order to
maintain consistency.
The LATEX source files for this booklet are open source, and available at:
github.com/starzmustdie/ml-interview-questions-and-answers.
You can find the compiled PDF document in the GitHub repository or on Google Drive.
Please note: not all questions have answers, and there’s a possibility that my responses may not be entirely
accurate. Therefore, it’s advisable to consider the information provided cautiously.
• linkedin: linkedin.com/in/zafir-stojanovski
If you found this guide helpful, please consider supporting me by buying me a coffee. :)
3
Figure 1: Prompt: “Alan Turing, preparing for his upcoming Machine Learning Interview, cinematic, analog
film”. Source: Clipdrop by stability.ai.
4
5 Math
5.1 Algebra and (little) calculus
5.1.1 Vectors
1. Dot product
i. What’s the geometric interpretation of the dot product of two vectors?
Answer: Multiplication of the length of one vector and the length of the projection of the other
vector onto the first one.
ii. Given a vector u, find vector v of unit length such that the dot product of u and v is maximum.
Answer:
u
v=
∥u∥
2. Outer product
i. Given two vectors a = [3, 2, 1] and b = [−1, 0, 1]. Calculate the outer product aT b?
Answer:
−3 0 3
aT b = −2 0 2
−1 0 1
ii. Give an example of how the outer product can be useful in ML.
Answer: The Covariance matrix is a commonly used quantity in Machine Learning algorithms
(eg. PCA). Given a dataset X ∈ Rn×d with n samples and d features, we calculate the (empirical)
covariance as follows:
n
1X
Cov [X] = (xi − x̄)(xi − x̄)T
n i=1
1
Pn
where x̄ is the mean feature vector: x̄ = n i=1 xi .
3. What does it mean for two vectors to be linearly independent?
Answer: Two vectors a, b ∈ Rn are linearly independent iff a ̸= c · b for c ∈ R.
4. Given two sets of vectors A = {a1 , a2 , . . . , an } and B = {b1 , b2 , . . . , bm } how do you check if they share
the same basis?
Answer: We should check if every vector in B can be written as a linear combination of vectors in A.
More precisely, for all bi ∈ B it should hold that:
5
∥ · ∥ is a semi-norm if only N.1 - N.3 are satisfied. In general, we define the p-norm as follows:
d
! p1
X
∥x∥p := xpi
i=1
ii. How do norm and metric differ? Given a norm, make a metric. Given a metric, can we make a norm?
Answer: A metric is a function d : X × X → R if:
M.1. Non-negativity: d(x, y) > 0 if x ̸= y else d(x, y) = 0
M.2. Triangle inequality: d(x, y) ≤ d(x, z) + d(z, y)
M.3. Symmetry: d(x, y) = d(y, x)
A norm can always induce a metric:
d(x, y) := ∥x − y∥
5.1.2 Matrices
1. Why do we say that matrices are linear transformations?
Answer: They are called linear transformations because they preserve vector addition and scalar multi-
plication:
A(x + y) = Ax + Ay
A(c · x) = c · (Ax)
2. What’s the inverse of a matrix? Do all matrices have an inverse? Is the inverse of a matrix always unique?
Answer: A matrix A ∈ Rn×n is called invertible if there exists a matrix B ∈ Rn×n s.t. AB = BA = I.
The inverse of a matrix is unique. A square matrix that does not have an inverse is called singular.
A square matrix is singular iff its determinant is 0. Furthermore, non-square matrices do not have an
inverse.
5. A 4 × 4 matrix has four eigenvalues 3, 3, 2, −1. What can we say about the trace and the determinant of
this matrix?
Answer: The trace is the sum of the eignevalues: 3 + 3 + 2 − 1 = 7.
The determinant is the product of the eigenvalues: 3 · 3 · 2 · (−1) = −18.
6
6. Given the following matrix:
1 4 −2
−1 3 2
3 5 −6
Without explicitly using the equation for calculating determinants, what can we say about this matrix’s
determinant? (Hint: rely on a property of this matrix to determine its determinant)
Answer: [TODO]
7. What’s the difference between the Covariance matrix AT A and the Gram matrix AAT ?
Answer: Suppose A ∈ Rn×d , corresponding to n samples with each having d features. Then, the
Covariance matrix AT A ∈ Rd×d captures the ”distance” between features, whereas the Gram matrix
AAT ∈ Rn×n captures the ”distance” between samples.
8. Given A ∈ Rn×m and b ∈ Rn :
i. Find x such that Ax = b.
Answer: If A is square and has full rank, then: x = A−1 b.
ii. When does this have a unique solution?
Answer: In a set of linear simultaneous equations, a unique solution exists if and only if:
• the number of unknowns and the number of equations are equal
• all equations are consistent
• there is no linear dependence between any two or more equations, that is, all equations are
independent.
iii. Why is it when A has more columns than rows, Ax = b has multiple solutions?
Answer: Each unknown can be seen as an available degree of freedom. Each equation introduced
into the system can be viewed as a constraint that restricts one degree of freedom.
Therefore, the critical case occurs when the number of equations and the number of free variables
are equal – for every variable giving a degree of freedom, there exists a corresponding constraint
removing a degree of freedom.
The underdetermined case, by contrast, occurs when the system has been underconstrained – that
is, when the unknowns outnumber the equations.
iv. Given a matrix A with no inverse. How would you solve the equation Ax = b ? What is the
pseudoinverse and how to calculate it?
Answer: For an arbitrary matrix A ∈ Rm×n , its pseudoinverse A# ∈ Rn×m is a matrix that satisfies
the following criteria:
• AA# A = A
• A# AA# = A#
• (AA# )T = AA#
• (A# A)T = A# A
Back to our system of linear equations, regardless of whether A is square and regardless of the rank
of A, all solutions (if any exist) can be obtained using the pseudoinverse:
x = A# b + (I − A# A)w
where w is a vector of free parameters that ranges over all possible n × 1 vectors. A necessary and
sufficient condition for any solution(s) to exist is that the potential solution obtained using w = 0
satisfy Ax = b, that is: AA# b = b (substitute x = A# b in Ax = b). If this condition does not
hold, the equation system is inconsistent and has no solution. If the condition holds, the system is
consistent and at least one solution exists. For example, in the above-mentioned case in which A is
square and of full rank, A# simply equals A−1 and the general solution equation simplifies to:
as previously stated, where w has completely dropped out of the solution, leaving only a single
solution. In other cases though, w remains and hence an infinitude of potential values of the free
parameter vector w give an infinitude of solutions of the equation.
9. Derivative is the backbone of gradient descent.
7
i. What does the derivative represent?
Answer: The derivative of a function measures the sensitivity to change in the function output with
respect to a change in the input.
Moreover, when it exists, the derivative at a given point is the slope of the tangent line to the graph
of the function at that point. The tangent line is the best linear approximation of the function at
that input value. This is the reason why in gradient descent we (slowly) move in the (negative)
direction of the derivative.
ii. What’s the difference between derivative, gradient, and Jacobian?
Answer:
• When f : R → R, we calculate the derivative dx df
.
• When f : R → R, we calculate the gradient:
n
h i
∂f ∂f ∂f
∇x f = ∂x 1 ∂x2 . . . ∂x n
10. Say we have the weights w ∈ Rd×m and a mini-batch x of n elements, each element is of the shape 1 × d
∂y
so that x ∈ Rn×d . We have the output y = f (x; w) = xw. What’s the dimension of the Jacobian ∂x ?
Answer: First, notice that y ∈ Rn×m . With that said, Jacx (f ) ∈ R(n×m)×(n×d) , or equivalently Jacx (f ) ∈
R(n·m)×(n·d) , given that we have reshaped the 4-dim tensor into a 2-dim tensor, i.e. a matrix.
11. Given a very large symmetric matrix A that doesn’t fit into memory, say A ∈ R1M ×1M and a function f
that can quickly compute f (x) = Ax for x ∈ R1M . Find the unit vector x so that xT Ax is minimal. (Hint:
Can you frame it as an optimization problem and use gradient descent to find an approximate solution?).
Answer: Since this is a constrained optimization problem, we can turn it into an unconstrained one by
using a Lagrange multiplier:
L(x, λ) = xT Ax + λ(xt x − 1)
The critical points of Lagrangians occur at saddle points, rather than at local minima. Unfortunately,
methods such as gradient descent which are designed to find local minima (or maxima) cannot be applied
directly. For this reason, we must either use an optimization technique that stationary points that are not
necessarily extrema (such as Newton’s method without an extremum seeking line search), or we modify
the formulation to ensure that it’s a minimization problem.
For the purpose of our problem, let us use the latter solution. First, let us compute the gradients:
∂L
= 2Ax + 2λx ∈ Rn
∂x
∂L
= xT x − 1 ∈ R
∂λ
Next, we stack them into a single column vector:
∂L
∂x
∇x,λ L = ∈ Rn+1
∂L
∂λ
∗ ∗ 2
At the optimal x , λ we have ∥∇x,λ L∥ ≈ 0, meaning that all partial derivatives are approximately 0.
Therefore, they are a solution to the original optimization problem L(x, λ). Unlike the critical points
in L however, x∗ and λ∗ occur at local minima (instead of at saddle points), so numerical optimization
techniques (such as gradient descent) can be used to find them.
(Source: Wikipedia)
8
5.1.3 Dimensionality reduction
1. Why do we need dimensionality reduction?
Answer: Curse of dimensionality – When the dimensionality increases, the volume of the space
increases so fast that the available data become sparse. In order to obtain a reliable result, the amount
of data needed often grows exponentially with the dimensionality. Also, organizing and searching data
often relies on detecting areas where objects form groups with similar properties; in high dimensional
data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data
organization strategies from being efficient.
2. Eigendecomposition is a common factorization technique used for dimensionality reduction. Is the eigen-
decomposition of a matrix always unique?
Answer: The decomposition is not always unique. Suppose A ∈ R2×2 has two equal eigenvalues λ1 =
λ2 = λ, with corresponding eigenvectors u1 , u2 . Then:
Au1 = λ1 u1 = λu1
Au2 = λ2 u2 = λu2
Notice that we can permute the matrix of eigenvectors (thus obtaining a different factorization):
λ 0
A u2 u1 = u2 u1
0 λ
Au2 = λu2
Au1 = λu1
A = U ΣV T
where U ∈ Rm×m is an orthogonal matrix of left singular vectors, V ∈ Rn×n is an orthogonal matrix
of right singular vectors, and Σ ∈ Rm×n is ”diagonal” matrix of singular values s.t. exactly r of the
values σi := Σii are non-zero. By construction:
• The left singular vectors of A are the eigenvectors of AAT . From the Spectral Theorem, the
eigenvectors (and thus the left singular vectors) are orthonormal.
• The right singular vectors of A are the eigenvecotrs of AT A. From the Spectral Theorem, the
eigenvectors (and thus the right singular vectors) are orthonormal.
9
√
• If λ is an eigenvalue of AAT (or AT A), then λ is a singular value of A. From the positive-
semidefinitness of AAT (or AT A), the eigenvalues (and thus the singular values) are non-negative.
ii. What’s the relationship between PCA and SVD?
Answer: Suppose we have data X ∈ Rn×d with n samples and d features. Moreover, assume that
the data has been centered s.t. the mean of each feature is 0. Then, we can perform PCA in two
main ways:
• First we compute the covariance matrix C = (n−1) 1
X T X ∈ Rd×d , and perform eigendecomposi-
tion: C = V LV T , with eigenvalues as the diagonal of L ∈ Rd×d , and eigenvectors as the columns
of V ∈ Rd×d . Then, we stack the k eigenvectors of V corresponding to the top k eigenvalues into
a matrix Ṽ ∈ Rd×k . Finally, we obtain the component values as follows: X̃ = X Ṽ ∈ Rn×k .
• Alternatively, instead of first computing the covariance matrix and then performing eigendecom-
position, notice that given the above formulation, we can directly compute SVD on the data
matrix X, thus obtaining: X = U ΣV t . By construction, the right singular vectors in V are the
eigenvectors of X T X. Similarly, we stack the k right singular vectors corresponding to the top
k singular values into a matrix Ṽ ∈ Rd×k . Finally, we obtain the component values as follows:
X̃ = X Ṽ ∈ Rn×k .
Even though SVD is slower, is often considered to be the preferred method because of its higher
numerical accuracy.
(See more here)
6. How does t-SNE (T-distributed Stochastic Neighbor Embedding) work? Why do we need it?
Answer: t-SNE is a statistical method for visualizing high-dimensional data by giving each datapoint a
location in a two or three-dimensional map. Specifically, it models each high-dimensional object by a two-
or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar
objects are modeled by distant points with high probability.
First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that
similar objects are assigned a higher probability while dissimilar points are assigned a lower probability.
In particular, given a set of N high-dimensional objects x1 , . . . , xN , t-SNE computes:
exp(−∥xi − xj ∥2 /2σi2 )
pj|i = P 2 2
k̸=i exp(−∥xi − xk ∥ /2σi )
P
and sets pi|i = 0. Note that j pj|i = 1 for all i. Then, it computes:
pj|i + pi|j
pij =
2N
P
Note that pij = pji , pii = 0, i,j pij = 1. The similarity of datapoint xj to datapoint xi is the conditional
probability pj|i that xi would pick xj as its neighbor if neighbors were picked in proportion to their prob-
ability density under a Gaussian centered at xi .
Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map.
In particular, it aims to learn d-dimensional map y1 , . . . , yN (where d is chosen 2 or 3) that reflects the
similarities pij as well as possible. To this end, it measures similarities qij between two points yi and yj
using:
(1 + ∥yi − yj ∥2 )−1
qij = P P 2 −1
k l̸=k (1 + ∥yk − yl ∥ )
and set qii = 0. Herein a heavy-tailed Student t-distribution (with one-degree of freedom, which is the
same as a Cauchy distribution) is used to measure similarities between low-dimensional points in order to
allow dissimilar objects to be modeled far apart in the map.
Finally, the locations yi in the map are determined by minimizing the Kullback-Leibler divergence of
the distribution P from the distribution Q:
X pij
KL [P ∥Q] = pij log
qij
i̸=j
The minimization of the KL divergence wrt the points yi is performed using gradient descent.
10
While t-SNE plots often seem to display clusters, the visual clusters can be influenced strongly by the cho-
sen parameterization and therefore a good understanding of the parameters for t-SNE is necessary. Such
”clusters” can be shown to even appear in non-clustered data, and thus may be false findings. Interactive
exploration may thus be necessary to choose parameters and validate results.
(Source: Wikipedia)
f (a + h) − f (a)
f ′ (a) = lim
h→0 h
exists. This implies that the function is continuous at a. Note that every continuous function is not
necessarily differentiable.
ii. Give an example of when a function doesn’t have a derivative at a point.
Answer: Any non-continuous function. For example,
(
x2 if x ≤ 0
f (x) =
1 + x otherwise
is not differentiable at x = 0.
iii. Give an example of non-differentiable functions that are frequently used in machine learning. How
do we do backpropagation if those functions aren’t differentiable?
Answer: Some non-differentiable functions commonly used in Machine Learning:
f (x) = |x|
(
x if x ≥ 0
ReLU(x) =
0 otherwise
(
x if x ≥ 0
LeakyReLU(x) =
αx otherwise
Each of these functions are not differentiable at x = 0. In theory, since any finite set of points is a set
of measure 0, we have p(x = 0) and therefore can disregard that these functions are not differentiable.
In practice however, due to finite precision it can occur that x = 0. In these cases we can use a ”faux”
derivative – by usually picking the left or the right derivative at the given point.
2. Convexity
i. What does it mean for a function to be convex or concave? Draw it.
Answer: A function is called convex if the line segment between any two points on the graph of
the function lies above the graph between the two points. More precisely, the function f : X → R is
convex if and only if for all 0 ≤ t ≤ 1 and all x1 , x2 ∈ X:
11
Pn
Theorem:
Pn Let a1 , . . . , an and b1 , . . . , bn be non-negative real numbers, and define a = i=1 ai and
b = i=1 bi . Then, the log sum inequality that:
n
X ai a
log ≥ a log
i=1
bi b
bi
≥0
b
n
X bi
=1
i=1
b
Theorem: The Kullback-Leibler divergence is convex in the pair of probability distributions (p, q),
i.e.
KL [λp1 + (1 − λ)p2 ∥λq + (1 − λ)q2 ] ≤ λKL [p1 ∥q1 ] + (1 − λ)KL [p2 ∥q2 ]
Now:
12
Where in (3) we use the Log sum inequality with:
a1 = λp1 (x) a2 = (1 − λ)p2 (x)
b1 = λq1 (x) b2 = (1 − λ)q2 (x)
Finally, going back to the original task of proving the convexity of H [P, Q] in Q, let us first decompose
the cross-entropy:
H [P, Q] = H [P ] + KL [P ∥Q]
Since H [P ] is constant wrt. Q, and we showed that KL [P ∥Q] is convex wrt. both P and Q, we can
deduce that H [P, Q] is convex wrt. Q.
(Source: The Book of Statistical Proofs)
3. Given a logistic discriminant classifier:
p(y = 1|x) = σ(wt x)
where the sigmoid function is given by:
σ(z) = (1 + exp(−z))−1
The logistic loss for training sample xi with class label yi is given by:
L(yi , xi ; w) = −yi log p(yi |xi )
(Note: I believe in the book yi is missing from the loss L)
i. Show that p(y = −1|x) = σ(−wt x).
Answer:
p(y = −1|x) = 1 − p(y = 1|x)
1
=1−
1 + exp(−wt x)
1 + exp(−wt x) − 1
=
1 + exp(−wt x)
exp(−wt x)
=
1 + exp(−wt x)
1
=
exp(wt x)(1 + exp(−wt x))
1
=
exp(wt x) + 1
= σ(−wt x)
ii. Show that ∇w L(yi , xi ; w) = −yi (1 − p(yi |xi ))xi .
Answer: We have the following computation graph:
z = wt x
1
σ(z) =
1 + exp(−z)
L(yi , xi ; w) = −yi log σ(z)
Therefore, using the chain rule we can decompose the derivative as follows:
∂L ∂L ∂σ ∂z
= · ·
∂w ∂σ ∂z ∂w
Let’s compute each term separately:
∂L yi
=−
∂σ σ(z)
∂σ(z) ∂ 1 1 1 exp(−z)
= =− (− exp(−z)) = ·
∂(z) ∂z 1 + exp(−z) (1 + exp(−z))2 1 + exp(−z) 1 + exp(−z)
1 exp(−z) + 1 − 1 1 1
= · = · 1− = σ(z)(1 − σ(z))
1 + exp(−z) 1 + exp(−z) 1 + exp(−z) 1 + exp(−z)
∂z ∂
= (ww x) = x
∂w ∂w
13
Therefore:
∂L ∂L ∂σ ∂z yi
= · · =− · σ(z)(1 − σ(z)) · x = −yi (1 − σ(z))x
∂w ∂σ ∂z ∂w σ(z)
4. Most ML algorithms we use nowadays use first-order derivatives (gradients) to construct the next training
iteration.
i. How can we use second-order derivatives for training models?
Answer: Given a twice differentiable function f : R → R we seek to solve the optimization problem:
min f (x)
x∈R
Newton’s method approaches this problem in an iterative fashion by constructing a sequence {xk }
that converges towards the minimizer x∗ of f. In particular, it performs second-order Taylor approx-
imation of f around xk :
1
f (xk + t) ≈ f (xk ) + f ′ (xk )t + f ′′ (xk )t2
2
The next iterate xk+1 is defined as to minimize this quadratic approximation in t, and setting
xk+1 = xk + t. If the second derivative is positive, the quadratic approximation is a convex function
of t, and its minimum can be found by setting the derivative to zero. More precisely:
d 1 ′′
0= f (xk ) + f (xk )t + f (xk )t = f ′ (xk ) + f ′′ (xk )t
′ 2
dt 2
f ′ (xk )
t=−
f ′′ (xk )
Putting everything together, the Newton’s method performs the following iteration:
f ′ (xk )
xk+1 = xk + t = xk −
f ′′ (xk )
(Source: Wikipedia)
ii. Pros and cons of second-order optimization.
Answer: The advantage of the Newton optimization method is that in general it converges faster
than pure gradient descent, since we perform a higher order local approximation (second order as
opposed to first order), and therefore make a more informed choice about the next step in the descent.
14
• If the Hessian is positive definite (equivalently, has all eigenvalues positive) at a, then f attains a
local minimum at a.
• If the Hessian is negative definite (equivalently, has all eigenvalues negative) at a, then f attains a
local maximum at a.
• If the Hessian has both positive and negative eigenvalues, then a is a saddle point for f .
In those cases not listed above, the test is inconclusive.
(Source: Wikipedia)
6. Jensen’s inequality forms the basis for many algorithms for probabilistic inference, including Expectation-
Maximization and variational inference. Explain what Jensen’s inequality is.
Answer: As stated before, for a given convex function f , we had the following property:
g(tx1 + (1 − t)x2 ) ≤ tg(x1 ) + (1 − t)g(x2 )
Let us generalize this property. Again, supposeP we have a convex function f , variables x1 , . . . , xn ∈ I, and
non-negative real numbers α1 , . . . , αn s.t. i αi = 1. Then, by induction we have:
Let’s formalize it one step further. Consider a convex function f , a discrete random variable X with n
possible values x1 , . . . , xn , and real non-negative values ai = p(X = xi ). Then, we obtain the general form
of the Jensen’s inequality:
g(E [X]) ≤ E [g(X)]
(Source: Introduction to Probability, Statistics, and Random Processes)
7. Explain the chain rule.
Answer: The chain rule is a formula that expresses the derivative of the composition of two differentiable
functions f and g in terms of the derivatives of f and g. More precisely, if h = f ◦ g is the function such
that h(x) = f (g(x)) for every x, then the chain rule is:
∂h ∂f ∂g
= ·
∂x ∂g ∂x
(Source: Wikipedia)
8. Let x ∈ Rn , L = CrossEntropy(Sof tmax(x), y), in which y is a one-hot vector. Take the derivative of L
with respect to x.
Answer: First, let us expand the loss:
C
X
L=− yc log(sc )
c=1
where C is the total number of classes, and sc is the c-th entry of the softmax output:
exp(xc )
sc = PC
k=1 exp(xk )
Now:
C
!!
∂ ∂ X
log(sc ) = xc − log exp(xk )
∂xl ∂xl
k=1
C
!
∂xc ∂ X
= − log exp(xk )
∂xl ∂xl
k=1
1
= 1{c=l} − PC · exp(xl )
k=1 exp(xk )
= 1{c=l} − sl
15
Therefore:
C
∂L X ∂ log(sc )
=− yc
∂xl c=1
∂xl
C
yc (sl − 1{c=l} )
X
=
c=1
C C
yc 1{c=l}
X X
= sl yc −
c=1 c=1
= sl − yl
Solving the system, we obtain the following stationary points (x, y, λ):
√ ! √ !
1 1 3 7 1 −3 7 1
0, 1, , 0, −1, − , , − , −4 , , − , −4
2 2 8 8 8 8
By evaluating the function f that we wish to optimize at these stationary points, we get:
f (0, 1) = −1
f (0, −1) = 1
√ !
3 7 1 65
f ,− =
8 8 16
√ !
3 7 1 65
f − ,− =
8 8 16
65
Therefore, the constrained maximum is 16 , and the constrained minimum is −1.
(Source: Wikipedia)
16
Therefore, the probability of interest is:
Z 0.5 Z 0.5 Z 0.5
1
P (X = 0.5) = p(x)dx = dx = 1dx = 0
0.5 0.5 1−0 0.5
In fact, for most continuous random variable X, the probability of obtaining a specific value c is P (X =
c) = 0, since any constant is a set of measure 0. An exception to this rule would be the Dirac delta
function.
2. Can the values of PDF be greater than 1? If so, how do we interpret the PDF?
Answer: The values of the PDF can indeed be greater than 1. All that matters is that the PDF p(x)
integrates to 1, that is:
Z
p(x)dx = 1
R
Intuitively, you can imagine the PDF as being a ‘border’ to a fluid container of volume 1. If we squeeze it
from the sides, thereby reducing the volume in this area, we have to increase the volume (that is, extend
the border) in the middle in order to compensate for the loss of volume on the sides. Nevertheless, the
total volume of the container is still exactly 1.
3. What’s the difference between multivariate distribution and multi-modal distribution?
Answer: Multi-modal distribution refers to a distribution that has multiple modes (values that appear
‘significantly’ more often than others). On the other hand, a multivariate distribution refers to a distri-
bution of multiple variables.
(See more here)
(Source: Wikiwand)
5. It’s a common practice to assume an unknown variable to be of the normal distribution. Why is that?
Answer: The Central Limit Theorem (CLT) states that the distribution of the sum of a large number of
independent, identically distributed random variables is approximately normal, regardless of the underly-
ing distribution. Because so many things in the universe can be modeled as the sum of a large number of
independent random variables, the normal distribution pops up a lot.
(Source: yours truly, GPT-3)
6. How would you turn a probabilistic model into a deterministic model?
Answer: A deterministic mathematical model is meant to yield a single solution describing the outcome of
some ”experiment” given appropriate inputs. A probabilistic model is, instead, meant to give a distribution
of possible outcomes (i.e. it describes all outcomes and gives some measure of how likely each is to occur).
It should be noted that a probabilistic model can be quite useful even for a person who believes the entire
universe to be deterministic. This utility arises because even a deterministic process may have so many
variables that any model that attempts to account for them all is too cumbersome to work with. For
example, a coin toss might be deterministic if one could precisely measure everything about the flip, the
coin, the floor, the air currents, the tides, the precise location on earth, etc. In practice, this level of
deterministic modeling is impossible, so stochastic models are used instead.
Similarly, deterministic models can be used to great effect even in real-world process that is clearly
stochastic. For example, the heat equation works great in many situations despite the fact that it ignores
the ”random” motion of the atoms involved. Usually, in these scenarios, the distribution of possible final
answers is so sharply peaked (i.e. has such a small variance) that there is no need to complicate the model
by forcing it to calculate the distribution rather than just a single value.
(Source: Quora)
17
7. Is it possible to transform non-normal variables into normal variables? How?
Answer: In statistics, a power transform is a family of functions applied to create a monotonic trans-
formation of data using power functions. It is a data transformation technique used to stabilize variance,
make the data more normal distribution-like, improve the validity of measures of association (such as the
Pearson correlation between variables), and for other data stabilization procedures.
where the parameter λ is estimated using the profile likelihood function and using goodness-of-fit tests.
Moreover, this transformation only holds for yi > 0.
On the other hand, the Yeo-Johnson transformation allows also for zero and negative values of y. The
hyperparameter λ can be any real number, where λ = 1 produces the identity transformation. The
transformation law reads:
((yi + 1)y − 1)/λ if λ ̸= 0, y ≥ 0
(λ)
log(y + 1)
i if λ = 0, y ≥ 0
yi = 2−λ
−((−yi + 1) − 1)/(2 − λ) if λ ̸= 2, y < 0
− log(−yi + 1) if λ = 2, y < 0
(Source: Wikipedia)
The t-distribution is symmetric and bell-shaped, like the normal distribution. However, the t-distribution
has heavier tails, meaning that it allows for producing values that fall far from its mean.
The t-distribution plays a role in a number of widely used statistical analyses, including Student’s t-
test for assessing the statistical significance of the difference between two sample means, the construction
of confidence intervals for the difference between two population means, and in linear regression analysis.
Moreover, we previously saw that the Student t-distribution (with one degree of freedom) is used to
measure the similarities between low-dimensional points in t-SNE, in order to allow for dissimilar objects
to be modeled far apart in the map. Since it has heavier tails than the Gaussian distribution, it penalizes
these larger distances less vigorously.
(Source: Wikipedia)
9. Assume you manage an unreliable file storage system that crashed 5 times in the last year, each crash
happens independently.
i. What’s the probability that it will crash in the next month?
Answer: We will resort to the Poisson distribution, which is concerned with expressing the proba-
bility of a given number of events occurring in a fixed time interval, given that these events happen
with a known constant mean rate λ, and independently of the time since the last event. For this
problem, the expected number of events is λ = 5 crashes per year.
Because the events occur at constant rate, we can look what happens in multiples of the fixed
time interval. In particular, since we are interested in the number of crashes per month, we can
5
construct a new Poisson random variable X with a mean rate of λm = 12 crashes per month. With
that said, we now resort to calculating the probability that there is going to be at least one crash in
the following month:
λ0m e−λm
P (X ≥ 1) = 1 − P (X = 0) = 1 − = 1 − 0.65924 = 0.34075
0!
18
ii. What’s the probability that it will crash at any given moment?
Answer: Even though time is continuous, let us attempt to discretize it down to every moment (or
timestamp). We can easily find that λt = 5t → 0 as t → ∞. However, since the Poisson distribution
is only defined for λ ∈ (0, ∞), i.e. λ > 0, we cannot apply it to this problem formulation.
10. Say you built a classifier to predict the outcome of football matches. In the past, it’s made 10 wrong
predictions out of 100. Assume all predictions are made independently., what’s the probability that the
next 20 predictions are all correct?
Answer: [not sure]Assuming binomial distribution, we can calculate the Maximum Likelihood Estimate
for the probability of the classifier being right:
90
p̂ = = 0.9
100
Therefore, the probability that the classifier correctly predicts the next 20 out of 20 games is:
20
P (X = 20|n = 20, p = 0.9) = C20 · (0.9)20 · (0.1)0 = 0.1215
11. Given two random variables X and Y . We have the values P (X|Y ) and P (Y ) for all values of X and Y .
How would you calculate P (X)?
Answer:
X X
P (X = x) = P (X = x, Y = y) = P (X = x|Y = y)P (Y = y)
y∈Y y∈Y
12. You know that your colleague Jason has two children and one of them is a boy. What’s the probability
that Jason has two sons? (Hint: it’s not 21 )
Answer: Define the following variables:
( (
1 : boy in the family 1: girl in the family
b= g=
0 : no boys in the family 0: no girls in the family
Given that the 4 possibilities are {(B, B), (B, G), (G, B), (G, G)}, we obtain the following probabilities:
3 1 3 1
p(b = 1) = , p(b = 0) = p(g = 1) = , p(g = 0) =
4 4 4 4
With that said, we are interested in the probability that there are no girls (or equivalently, both children
are boys), given that one of the children is a boy. More precisely:
p(b = 1|g = 0)p(g = 0)
p(g = 0|b = 1) =
p(b = 1)
1
1· 4
= 3
4
1
=
3
(Read more here)
13. There are only two electronic chip manufacturers: A and B, both manufacture the same amount of chips.
A makes defective chips with probability of 30%, while B makes defective chips with a probability of 70%.
i. If you randomly pick a chip from the store, what is the probability that it is defective?
Answer: Define the following variables:
( (
1 : chip is functional A : manufacturer is A
c= m=
0 : chip is defective B : manufacturer is B
From the problem statement we have:
p(m = A) = p(m = B) = 0.5
p(c = 0|m = A) = 0.3
p(c = 0|m = B) = 0.7
Now, let us compute the probability that a randomly chosen chip is defective:
p(c = 0) = p(c = 0|m = A)p(m = A) + p(c = 0|m = B)p(m = B) = 0.3 · 0.5 + 0.7 · 0.5 = 0.5
19
ii. Suppose you now get two chips coming from the same company, but you don’t know which one.
When you test the first chip, it appears to be functioning. What is the probability that the second
electronic chip is also good?
Answer: Let us denote the two chips with random variables c1 and c2 . Then, we are interested in
the probability:
p(c2 = 1|c1 = 1)
p(c2 = 1, c1 = 1)
=
p(c1 = 1)
p(c2 = 1, c1 = 1|m = A)p(m = A) + p(c2 = 1, c1 = 1|m = B)p(m = B)
=
p(c1 = 1|m = A)p(m = A) + p(c1 = 1|m = B)p(m = B)
p(c2 = 1|m = A)p(c1 = 1|m = A)p(m = A) + p(c2 = 1|m = B)p(c1 = 1|m = B)p(m = B)
= (cond. indep.)
p(c1 = 1|m = A)p(m = A) + p(c1 = 1|m = B)p(m = B)
0.7 · 0.7 · 0.5 + 0.3 · 0.3 · 0.5
=
0.7 · 0.5 + 0.3 · 0.5
=0.58
14. There’s a rare disease that only 1 in 10000 people get. Scientists have developed a test to diagnose the
disease with the false positive rate and the false negative rate of 1%.
i. Given a person is diagnosed positive, what’s the probability that this person actually has the disease?
Answer:
Define the following variables:
( (
1 : the patient has the disease 1 : the test is positive
d= t=
0 : the patient doesn’t have the disease 0 : the test is negative
From the problem statement, we have:
p(d = 1) = 0.0001
p(t = 1|d = 0) = 0.01
p(t = 0|d = 1) = 0.01
Now, using Bayes’ theorem, we can update our prior belief for the presence of the disease given that
the test came back positive:
p(t = 1|d = 1)p(d = 1)
p(d = 1|t = 1) =
p(t = 1)
p(t = 1|d = 1)p(d = 1)
=
p(t = 1|d = 1)p(d = 1) + p(t = 1|d = 0)p(d = 0)
(1 − p(t = 0|d = 1))p(d = 1)
=
(1 − p(t = 0|d = 1))p(d = 1) + p(t = 1|d = 0)(1 − p(d = 1))
(1 − 0.01) · 0.0001
=
(1 − 0.01) · 0.0001 + 0.01 · (1 − 0.0001)
= 0.0098
ii. What’s the probability that a person has the disease if two independent tests both come back positive?
Answer: Let us denote the outcomes of the two tests as t1 and t2 . Similarly as before, by reusing
some of the quantities we have already computed, we obtain:
p(t1 = 1, t2 = 1|d = 1)p(d = 1)
p(d = 1|t1 = 1, t2 = 1) =
p(t1 = 1, t2 = 1)
p(t1 = 1, t2 = 1|d = 1)p(d = 1)
=
p(t1 = 1, t2 = 1|d = 1)p(d = 1) + p(t1 = 1, t2 = 1|d = 0)p(d = 0)
p(t1 = 1|d = 1)p(t2 = 1|d = 1)p(d = 1)
=
p(t1 = 1|d = 1)p(t2 = 1|d = 1)p(d = 1) + p(t1 = 1|d = 0)p(t2 = 1|d = 0)p(d = 0)
p(t = 1|d = 1)p(t = 1|d = 1)p(d = 1)
=
p(t = 1|d = 1)p(t = 1|d = 1)p(d = 1) + p(t = 1|d = 0)p(t = 1|d = 0)p(d = 0)
(1 − 0.01) · (1 − 0.01) · 0.0001
=
(1 − 0.01) · (1 − 0.01) · 0.0001 + 0.01 · 0.01 · 0.9999
= 0.495
20
Notice the sudden increase in belief when we have two positive tests compared to only one.
15. A dating site allows users to select 10 out of 50 adjectives to describe themselves. Two users are said to
match if they share at least 5 adjectives. If Jack and Jin randomly pick adjectives, what is the probability
that they match?
Answer: Without loss of generality, suppose that we know exactly which 10 of the 100 adjectives Jin picks.
With that said, we apply the naive definition of probability with the numerator being the total num-
ber of ways that Jack can match with Jin on 10 adjectives; and the denominator being the number of
possible adjective combinations that Jack can pick:
P10 10 90
k=5 k · 10−k
100
10
(Source: Quora)
16. Consider a person A whose sex we don’t know. We know that for the general human height, there
2
are two distributions: the height of males follows hm = N (µm , σm ) and the height of females follows
2
hj = N (µj , σj ). Derive a probability density function to describe A’s height.
Answer: We will resort to constructing a Gaussian Mixture Model (GMM). Under the assumption that
the probability of picking a male ϕm is equal to the probability of picking a female ϕj , that is ϕm = ϕj = 21 ,
we obtain:
2
h(x) = ϕm · N (x; µm , σm ) + ϕj · N (x; µj , σj2 )
!
(x − µm )2 (x − µj )2
1 1 1 1
= ·p exp − + ·q exp −
2 2πσm2 2σm2 2 2πσj2 2σj2
17. There are three weather apps, each the probability of being wrong 13 of the time. What’s the probability
that it will be foggy in San Francisco tomorrow if all the apps predict that it’s going to be foggy in San
Francisco tomorrow and during this time of the year, San Francisco is foggy 50% of the time? (Hint:
you’d need to consider both the cases where all the apps are independent and where they are dependent.)
Answer: Assuming conditional independence, we can easily resort to Bayes’ Law for estimating the
probability of rain. Define the following variables:
( (
1 : the weather is foggy 1 : the i-th app says it will be foggy
w= ai = ∀i ∈ {1, 2, 3}
0 : the weather is not foggy 0 : the i-th app says it will not be foggy
p(w = 1|a1 = 1, a2 = 1, a3 = 1)
p(a1 = 1, a2 = 1, a3 = 1|w = 1)p(w = 1)
=
p(a1 = 1, a2 = 1, a3 = 1)
p(a1 = 1, a2 = 1, a3 = 1|w = 1)p(w = 1)
=
p(a1 = 1, a2 = 1, a3 = 1|w = 1)p(w = 1) + p(a1 = 1, a2 = 1, a3 = 1|w = 0)p(w = 0)
p(a1 = 1|w = 1)p(a2 = 1|w = 1)p(a3 = 1|w = 1)p(w = 1)
=
p(a1 = 1|w = 1)p(a2 = 1|w = 1)p(a3 = 1|w = 1)p(w = 1) + p(a1 = 1|w = 0)p(a2 = 1|w = 0)p(a3 = 1|w = 0)p(w = 0)
2 3 1
3 2
=
2 3 1 1 3 1
3 2 + 3 2
=0.88
Not sure about the case when the apps are dependent.
(Read more here)
21
18. Given n samples from a uniform (discrete) distribution [0, d]. How do you estimate d? (Also known as
the German tank problem)
Answer: Suppose we have a sample of serial numbers x1 , . . . , xn drawn without replacement from U [0, d].
Based on this sample of size n, we would like to obtain an (unbiased) estimate of the parameter d.
Given the sample, a first approach would be to build an estimator m = maxi xi . However, on aver-
age we would always underestimate the true valueh d,i since xi ≤ d ∀x ∼ U [0, d]. In order to build an
ˆ it would have to hold that E dˆ = d, which is clearly not the case here. Therefore,
unbiased estimator d,
we need another approach.
Even though we are trying to estimate d, suppose for a moment we know its true value. Then, let M
denote the random variable that expresses the maximum of a given sample, i.e. M = maxi {X1 , . . . , Xn }
(since the sample is random, the variable M is also random). Then, the probability that the random
variable M obtains a specific value m, given that we know the true value d is:
m−1
n−1
P (M = m|d) = d
s.t. n ≤ m ≤ d
n
i.e. by setting the sample maximum to m and producing a sample of size n without replacement, we
calculate the ratio of the number of ways of picking the remaining n − 1 from the valid m − 1 values
(remember we already picked one, and it’s the sample maximum m, so the remaining Xi have to be
≤ m − 1), divided by the total number of ways of picking any n samples from the d possible values.
Since the only possible values for m are in the interval [n, d], by the law of total probability we have:
d d m−1 d
X X n−1 1 X m−1
1= P (M = m|d) = d
= d n−1
m=n m=n n n m=n
For reasons that will become apparent later, we can also express the identity in the following manner:
d+1 d
d+1 X m−1 X m
= =
n+1 m=n+1
n m=n
n
Now, since we are interested in building an unbiased estimator, let us look at the expectation of the
random variable M , conditioned that we know the true value d:
d
X
E [M |d] = mP (M = m|d)
m=n
d m−1
X n−1
= m d
m=n n
d
1 X n (m − 1)!
= d
·m·
(n − 1)!(m − n)!
n m=n
n
d
n X m!
= d − n)!
n m=n
k!(m
d
n X m
= d
n m=n
n
n d+1
= d (identity above)
n
n+1
n (d + 1)!
= d!
·
n!(d−n)!
(n + 1)!(d − n)!
22
After canceling out the terms, we obtain:
n
E [M |d] = (d + 1)
n+1
ˆ by replacing E [M |d] in the expression above with
Finally, we can arrive at un unbiased estimator for d, d,
our observation m ≡ maxi xi , and then solving for d:
n
m= (d + 1)
n+1
m
=⇒ dˆ = m + −1
n
Let us quickly check that it is indeed an unbiased estimator:
h i
ˆ d) = E dˆ − d
Bias(d;
h m i
=E m+ −1 −d
n
1
= E [m] + E [m] − 1 − d
n
n+1
= · E [m] − 1 − d
n
n+1 n
= · (d + 1) − 1 − d
n n+1
=d+1−1−d
=0
Letting Y ≡ min{n ∈ N |Xn > 0.5} be the first day where we draw a value greater than 0.5, this random
variable has a geometric distribution Y ∼ Geom(θ). Therefore, the expected number of days until we
draw a value greater than 0.5 is:
1
E [Y ] = ≈ 3.24 days
θ
(Source: StackExchange)
20. You’re part of a class. How big the class has to be for the probability of at least a person sharing the
same birthday with you is greater than 50%?
Answer: From a permutations perspective, let the event A be the probability of finding a group of 23
people without any repeated birthdays. Then, denote the event B as the probability of finding a group of
23 people with at least one repeated birthday, P (B) = 1 − P (A).
Now, we can calculate P (A) as the ratio of the total number of birthdays without repetitions but where
order matters (e.g. for a group of 2 people, we consider the options {{05.12, 12.05}, {01.03, 06.08}, . . .})
divided by the total number of birthdays with repetition and where order matters. More precisely:
365!
(365−k)!
P (A) =
365k
365!
(365−k)!
=⇒ P (B) = 1 − P (A) = 1 −
365k
The smallest group of k people for which P (B) > 0.5 is 23.
(Source: Wikipedia)
23
21. You decide to fly to Vegas for a weekend. You pick a table that doesn’t have a bet limit, and for each
game, you have the probability p of winning, which doubles your bet, and 1 − p of losing your bet. Assume
that you have unlimited money (e.g. you bought Bitcoin when it was 10 cents), is there a betting strategy
that has a guaranteed positive payout, regardless of the value of p?
Answer: [TODO]
22. Given a fair coin, what’s the number of flips you have to do to get two consecutive heads?
Answer: Suppose X is the number of coin flips that you need in order to get two heads in a row. We are
interested in the quantity E [X].
We can condition E [X] on whatever our first flip is. Let E [X|H] denote the number of remaining flips
that are needed in order to get two heads in a row, conditioned on the fact that we have already rolled a
head. We define E [X|T ] analogously.
Now, notice that E [X|T ] = E [X], since if we flipped tail on the first try, we haven’t made any progress
towards flipping two heads in a row, so we have to start over.
23. In national health research in the US, the results show that the top 3 cities with the lowest rate of kidney
failure are cities with populations under 5000. Doctors originally thought that there must be something
special about small town diets, but when they looked at the top 3 cities with the highest rate of kidney
failure, they are also very small cities. What might be a probabilistic explanation for this phenomenon?
Answer: Hasty generalization (also known as the Law of Small Numbers) is an informal fallacy of faulty
generalization, which involves reaching an inductive generalization based on insufficient evidence – essen-
tially making a rushed conclusion without considering all of the variables or enough evidence.
In statistics, it may involve basing broad conclusions regarding a statistical survey from a small sam-
ple group that fails to sufficiently represent an entire population.
(Source: Wikipedia)
24. Derive the maximum likelihood estimator of an exponential distribution.
Answer: The probability density function of the exponential distribution with rate λ is given as follows:
(
λe−λx , x ≥ 0,
p(x; λ) =
0, x<0
24
Give a sample x1 , . . . , xn , the likelihood is:
n
Y
L(x1 , . . . , xn ; λ) = λ exp(−λxi )
i=1
n
!
X
n
= λ exp −λ xi
i=1
Calculating the derivative wrt. λ and setting it to 0, we obtain our estimator λ̂:
n
∂ log L n X
= − xi = 0
∂λ λ i=1
n
n X
= xi
λ i=1
n
=⇒ λ̂ = Pn
i=1 xi
5.2.2 Stats
1. Explain frequentist vs. Bayesian statistics.
Answer: The frequentist approach views probabilities as objective quantities, and calculates them as rel-
ative frequencies. The model parameters are fixed unknown constants, and it cannot make probabilistic
statements about them. The goal is to use the sample data to build point estimates of the parameters
(potentially with standard error).
On the other hand, the Bayesian approach views probabilities as subjective quantities, and represent
degree of belief. Moreover, the model parameters are random variables and cannot be determined exactly,
but are expressed with uncertainty. The goal is to build a posterior distribution of the parameters, given
the data at hand.
2. Given the array x = [1, 5, 3, 2, 4, 4], find its mean, median, variance, and standard deviation.
Answer:
n
1X
mean(x) = xi
n i=1
1
= (1 + 5 + 3 + 2 + 4 + 4) = 3.16
6
( n
x , if n is odd
med(x) = x[ n2]+x[ n+1 ]
2 2
2 , if n is even
3+4
= = 3.5
2
n
1X
var(x) = (xi − mean(x))2
n i=1
1
(1 − 3.16)2 + (5 − 3.16)2 + (3 − 3.16)2 + (2 − 3.16)2 + (4 − 3.16)2 + (4 − 3.16)2 + = 1.80
=
6
p
std(x) = var(x)
√
= 1.80 = 1.34
3. When should we use median instead of mean? When should we use mean instead of median?
Answer: Almost all analytic calculations on sets of data are more natural and easier to work with in
terms of the mean than the median. This is because the mean can be analytically expressed as a sum of
terms (multiplied by a constant), whereas the median doesn’t have a clear analytic form. Examples of
25
usages: test of significance, measuring bias, estimating convergence,...
The real use of the median comes when the data set may contain extreme outliers, as it is a more
robust measure than the mean.
(Source: StackExchange)
4. What is a moment of function? Explain the meanings of the zeroth to fourth moments.
Answer: Moments describe how the probability mass of a random variable is distributed. In particular,
they quantify a distribution’s location, spread, and shape. The mathematical concept is closely related to
the concept of moment in physics.
Let X be a random variable. Then its kth raw moment is defined as follows:
Z ∞
µk = E X k = xk p(x)dx
−∞
If the random variable X has mean µx , then its kth central moment is:
Z ∞
mk = E (X − µx )k = (x − µx )k p(x)dx
−∞
Central moments are useful because they allow us to quantify properties of distributions in ways that are
location-invariant. For example, we may be interested in comparing the variability in height of adults
versus children. Obviously, adults are taller than children on average, but we want to measure which
group has greater variability while disregarding the absolute heights of people in each group.
If the random variable X has standard deviation σx , then its kth standardized moment is:
" k # Z ∞ k
X − µx x − µx
m̄k = E = p(x)dx
σx −∞ σx
Standardization makes the moment both location- and scale-invariant. Why might we care about scale
invariance? As we will see, the third, fourth, and higher standardized moments quantify the relative and
absolute tailedness of distributions. In such cases, we do not care about how spread out a distribution is,
but rather how the mass is distributed along the tails.
Since x0 = 1 for any number x, the zeroth raw, central and standardize moments are all 1:
Z ∞ Z ∞
0
µ0 = m0 = m̄0 = (. . .) p(x)dx = p(x)dx = 1
−∞ −∞
The zeroth moment captures the fact that probability distributions are normalized quantities, and they
always sum to one regardless of their location, scale, or shape.
The first moment tells us how far away from the origin the center of mass is. The first central and stan-
dardized moments are less interesting because they are always zeros.
The second central moment increases quadratically as mass gets further away from the distribution’s mean.
In other words, variance captures how spread out a distribution is – points that are further away from the
mean than others are penalized disproportionally. More often we calculate the second central rather than
the second raw moment, since we are interested in comparing each distribution’s relative spread while
ignoring its location – otherwise, distributions that have further-away locations can have larger second
raw moments, compared to ones that are closer to the origin.
26
The third standardized moment, called skewness, measures the relative size of the two tails of a given
distribution:
" 3 # Z ∞ 3
X − µx x − µx
m̄3 = E = p(x)dx
σx −∞ σx
To see how skewness qunatifies the relative size of the two tails, consider this: any data point less than a
standard deviation from the mean (i.e. data near the center) results in a standard score less than 1; which
is then raised to the third power, making the absolute value of the cubed standard score even smaller. In
other words, data points less than a standard deviation from the mean contribute very little to the final
calculation of skewness. Since the cubic function preserves sign, if both tails are balanced, the skewness
is zero. Otherwise, the skewness is positive for longer right tails and negative for longer left tails.
The fourth standardized moment, called kurtosis, measures the combined size of the tails relative to
the whole distribution:
" 4 # Z ∞ 4
X − µx x − µx
m̄4 = E = p(x)dx
σx −∞ σx
Unlike skewness’s cubic term which preserves sign, kurtosis’s even power means that the metric is al-
ways positive and that long tails on either side dominate the calculation. Just as we saw with skewness,
kurtosis’s fourth power means that standard scores less than 1—again, data near the peak of the distri-
bution—only marginally contribute to the total calculation. In other words, kurtosis measures tailedness,
not peakedness.
What about moments of higher order? The short answer is, for k ≥ 5:
• if k is odd, then this standardized moment essentially captures nearly the same information as the
skewness (since it preserves sign), with the only difference being the magnitude with which we penalize
outliers far away from the center.
• Similarly, if k is even, then this standardized moment captures nearly the same information as
kurtosis, since it disregards the sign of the terms.
(Source: this awesome blog from Gregory Gundersen)
5. Are independence and zero covariance the same? Give a counterexample if not.
Answer: Independence implies zero covariance:
But the opposite does not hold, i.e. zero covariance does not imply independence. Suppose X ∼ N (0, σ 2 ),
and Y = X 2 . Then:
E X 3 = E (−X)3 = E −X 3 = −E X 3
=⇒ E X 3 = 0
27
6. Suppose that you take 100 random newborn puppies and determine that the average weight is 1 pound
with the population standard deviation of 0.12 pounds. Assuming the weight of newborn puppies follows
a normal distribution, calculate the 95% confidence interval for the average weight of all newborn puppies.
Answer: Let X1 , . . . , Xn be a sample s.t. E [Xi ] = µ and Var [Xi ] = σ 2 . Suppose we are interested in
building an estimator for the mean µ from the sample as follows:
n
1X
X̄n = Xi
n i=1
Then, the expected value of this estimator is:
" n #
1X
E X̄n = E Xi
n i=1
n
1X
= E [Xi ]
n i=1
n
1X
= µ
n i=1
1
= ·n·µ
n
=µ
And its variance is:
" n #
1X
Var X̄n = Var Xi
n i=1
n
1 X
= Var [Xi ]
n2 i=1
n
1 X 2
= σ
n2 i=1
1
= · n · σ2
n2
σ2
=
n
Moreover, due to the Central Limit Theorem (CLT), the estimator X̄n has approximately normal distri-
bution. Therefore, the following derive random variable:
X̄n − E X̄n X̄n − µ
q = σ/√n ∼ N (0, 1)
Var X̄ n
Coming back to our example, let us first note that we don’t have the true population standard devi-
ation σ. Therefore, we need to plug in a proxy – the sample standard deviation s. As a rule of thumb, as
long as the number of samples n is larger than 50, we can make this substitution.
28
7. Suppose that we examine 100 newborn puppies and the 95% confidence interval for their average weight
is [0.9, 1.1] pounds. Which of the following statements is true?
i. Given a random newborn puppy, its weight has a 95% chance of being between 0.9 and 1.1 pounds.
ii. If we examine another 100 newborn puppies, their mean has a 95% chance of being in that interval.
iii. We’re 95% confident that this interval captured the true mean weight.
Answer: From our solution to the previous problem, it is obvious that the correct answer is iii.
8. Suppose we have a random variable X supported on [0, 1] from which we can draw samples. How can we
come up with an unbiased estimate of the median of X?
Answer: An answer to this question is provided here.
9. Can correlation be greater than 1? Why or why not? How to interpret a correlation value of 0.3?
Answer: In its base form, the Cauchy-Schwarz inequality states that for all vectors u and v of an inner
product space, it is true that:
Furthermore, we can extend this rule to random variables, by defining an inner product as the expectation
of the product of the random variables:
⟨X, Y ⟩ := E [XY ]
Using this idea, let us provide a bound for the covariance of two variables:
2 2
|Cov [X, Y ]| = |E [(X − E [X])(Y − E [Y ])]|
2
= ⟨(X − E [X]), (Y − E [Y ])⟩
≤ ⟨X − E [X] , X − E [X]⟩ · ⟨Y − E [Y ] , Y − E [Y ]⟩ (Cauchy Schwarz)
= E (X − E [X])2 · E (Y − E [Y ])2
Therefore, we obtain:
p p
|Cov [X, Y ]| ≤ Var [X] · Var [Y ]
29
ii. How much does your newborn puppy have to weigh to be in the top 10% in terms of weight?
Answer: From the z-table we obtain:
Φ(Z = 1.29) ≈ 0.9
Therefore, for values greater than
X = Z · σ + µ = 1.29 · 0.12 + 1 = 1.15
we observe puppies in the top 10% in terms of weight.
iii. Suppose the weight of newborn puppies followed a skew distribution. Would it still make sense to
calculate z-scores?
Answer: Z-score wouldn’t be a meaningful quantity for non-symmetric distributions.
In the case of the Normal distribution, a z score of +1 corresponds to a percentile of about 0.841 – or
equivaletnly, that 84.1% of the samples from the distribution are below one standard deviation above
the mean. Likewise, a z-score of 0 is the point where half the data is below and half the data is above.
Now, consider a binary random variable that can take only the values 1 or 1000, and addition-
ally comes from a skewed distribution s.t. p(X = 1) = 0.9999, p(X = 1000) = 0.0001. Then, the
mean of this random variable is:
µx = E [X] = 0.9999 · 1 + 0.0001 · 1000 = 1.0999
Now, the z-score for 1, the mean, and 1000 are:
1 − µx 1 − 1.0999 −0.099
Z(X = 1) = = = <0
σ σ σ
1.0999 − µx 1.0999 − 1.0999
Z(X = 1.0999) = = =0
σ σ
1000 − µx 1000 − 1.0999 998.9
Z(X = 1000) = = = >0
σ σ σ
But in this case, it does not hold anymore that the variable whose z-score is 0 (i.e. the mean) is the
50-th percentile: 99.99% of the variables are on the left (since p(X = 1) = 0.9999), and only 0.01%
are on the right of Z = 0 (since p(X = 1000) = 0.0001). Therefore, this score does not hold the same
meaning as the one for a symmetric distribution.
(Source: Quora)
11. Tossing a coin 15 times resulted in 10 heads and 5 tails. How would you analyze whether a coin is fair?
Answer: Let us denote with p the probability of getting head. Then, we define the following two
hypotheses:
1
H0 : p = (the coin is fair)
2
1
Ha : p ̸= (the coin is not fair)
2
Now, we calculate the probability of the event we observed (10 heads and 5 tails) and events more extreme
than it (11 heads and 4 tails, 12 heads and 3 tails. . . ). Since we are performing two-sided test, we also
consider the events in the ”opposite” direction of extremeness (5 heads and 10 tails, 4 heads and 11 tails, 3
heads and 12 tails . . . ). Given that each of these events comes from the binomial distribution, we obtain:
15 15
X 15 1
p-value = 2 = 0.3017
k 2
k=10
We reject the null hypothesis if p-value < α, where α is a chosen significance level. In this case, at none of
the common significance levels: 90% (α = 0.1), 95% (α = 0.05), 99% (α = 0.01); we would have sufficient
evidence to reject the null hypothesis.
30
and appropriately only considered the event we observed (10 heads and 5 tails) and events more extreme
than it (11 heads and 4 tails, 12 heads and 3 tails. . . ) without the events in the ”opposite” direction, we
would still not have sufficient evidence to reject the null hypothesis since:
15 15
X 15 1
p-value = = 0.1508
k 2
k=10
=⇒ p-value > α ∀α ∈ {0.1, 0.05, 0.01}
(Source: StackExchange)
12. Statistical significance.
i. How do you assess the statistical significance of a pattern whether it is a meaningful pattern or just
by chance?
Answer: In statistical hypothesis testing, a result has statistical significance when it is very unlikely
to have occurred given the null hypothesis simply by chance alone. More precisely, a study’s defined
significance level, denoted by α, is the probability of the study rejecting the null hypothesis, given
that the null hypothesis is true (also known as the probability of Type I error); and the p-value of
a result is the probability of obtaining a result at least as extreme, given that the null hypothesis is
true. The result is statistically significant, by the standards of the study, when p-value ≤ α. The
significance level for a study is chosen before data collection, and is typically set to 5% or much lower
– depending on the field of study.
(Source: Wikipedia)
ii. What’s the distribution of p-values?
Answer: Let T denote your test statistic, which under the null hypothesis has the cumulative
distribution function (CDF) F (t) ≡ Pr(T < t) for all t. Assuming that F is invertible, we can derive
the distribution of the random p-value P = F (T ) (remember that the p-value is the probability of
obtaining a result at least as extreme as our test statistic) as follows:
By definition, this is the CDF of the uniform distribution. Therefore, we can conclude that under
the null hypothesis, the p-values are distributed uniformly.
(Source: StackExchange)
iii. Recently, a lot of scientists started a war against statistical significance. What do we need to keep
in mind when using p-value and statistical significance?
Answer: The multiple testing problem occurs when one considers a set of statistical inferences si-
multaneously – the more inferences are made, the more likely erroneous inferences become.
Then, the family-wise error rate (FWER) is the probability that at least one Type I error occurs in
the family:
m = 1 =⇒ FWER = 0.05
m = 10 =⇒ FWER = 0.40
m = 50 =⇒ FWER = 0.92
31
In other words, at merely 50 simultaneous tests at level α = 0.05, we have a 92% probability of
rejecting the null hypothesis (and thus obtaining “statistically significant” results) when it is in fact
true.
One approach to alleviate this issue is to apply Bonferroni correction: assume we run m tests,
and we want to achieve FWER of at most α (e.g. α = 0.05); then, run each individual tests with
α
level αsingle := m . In this case, notice that:
In other words, no matter how large the family is, the FWER is bounded by a user-set constant. The
advantage of the Bonferroni correction is that it is simple and correct. However, its disadvantage is
that it is too conservative – the test rarely rejects the null hypothesis.
13. Variable correlation.
i. What happens to a regression model if two of their supposedly independent variables are strongly
correlated?
Answer: Collinearity is a linear association between two input variables. More precisely, two vari-
ables X1 and X2 are perfectly collinear if there exist parameters λ0 and λ1 such that, for all obser-
vations i we have:
(i) (i)
X2 = λ0 + λ1 X1
Multicollinearity refers to a situation in which more than two input variables in a multiple regression
model are highly linearly related.
w̃ = (X T X)−1 X T y
32
iii. How do we test for independence between two continuous variables?
Answer: Testing for independence between two continuous variables is in general a hard problem.
With that said, suppose we have two continuous random variables X and Y , with joint distribution
P(X,Y ) , and marginal distributions PX and PY . Then, the mutual information of these two variables
is calculated as:
P(X,Y ) (x, y)
Z Z
I(X; Y ) = P(X,Y ) (x, y) log dxdy
Y X PX (x)PY (y)
the mutual information is 0. Similar line of reasoning can be applied to show that mutual information
of 0 implies that X and Y are independent.
(See more here)
14. A/B testing is a method of comparing two versions of a solution against each other to determine which
one performs better. What are some of the pros and cons of A/B testing?
Answer: A/B testing is a shorthand for a simple randomized controlled experiment, in which two sam-
ples (A and B) of a single vector-variable are compared. These values are similar except for one variation
which might affect a user’s behavior. A/B tests are widely considered the simplest form of controlled
experiment. However, by adding more variants to the test, its complexity grows.
A/B tests are useful for understanding user engagement and satisfaction of online features like a new
feature or product. Large social media sites like LinkedIn, Facebook, and Instagram use A/B testing to
make user experiences more successful and as a way to streamline their services.
When conducting A/B testing, the user should evaluate the pros and cons of it to see if it aligns best with
the results that they’re hoping for.
• Pros: Through A/B testing, it’s easy to get a clear idea of what users prefer, since it’s directly testing
one thing over the other. It’s based on real user behavior so the data can be very helpful especially
when determining what works better between two options. In addition, it can also provide answers
to very specific design questions. One example of this is Google’s A/B testing with hyperlink colors.
In order to optimize revenue, they tested dozens of different hyperlink hues to see which color the
users tend to click more on.
• Cons: However, there are a couple cons to A/B testing. Like mentioned above, A/B testing is good
for specific design questions but it can also be a downside since it’s mostly only good for specific
design problems with very measurable outcomes. It could also be a very costly and timely process.
Depending on the size of the company and/or team, there could be a lot of meetings and discussions
about what exactly to test and what the impact of the A/B test is. If there’s not a significant impact,
it could end up as a waste of time and resources.
(Source: Wikipedia)
15. You want to test which of the two ad placements on your website is better. How many visitors and/or
how many times each ad is clicked do we need so that we can be 95% sure that one placement is better?
Answer: In A/B testing, we are often interested in testing if the treatment group is significantly different
from the control group in a certain success metric (e.g., conversion rate). The null hypothesis is that there
is no significant difference.
Type I error happens when we reject the null hypothesis when it should not be rejected. The proba-
bility of Type I error is known as significance level, or α (commonly chose as 0.05).
33
Type II error happens when we fail to reject the null hypothesis when it should be rejected. The proba-
bility of Type II error is also known as β.
Statistical power is the probability that the test rejects the null hypothesis when it should be rejected, or
in other words 1 − β. A common value for statistical power is 0.80 (meaning that β = 0.20).
In order to obtain meaningful results, we want our test to have sufficient statistical power. Since in-
creasing the sample size results in higher statistical power, but also in larger costs for the experiment,
we are interested in finding the minimum sample size required that guarantees the desired statistical power.
Now, let us tackle the problem of determining the required number of sessions in order to confirm with
95% confidence that one ad placement is better than the other. We decide to create 2 groups (whose size
we aim to figure out), and show each one a distinct ad.
If each session is a Bernoulli trial (the user clicks the ad or not), each group follows a binomial dis-
tribution. In this case, we aim to figure out the needed sample size to compare two binomial proportions
using a two-sided test with significance level α and power 1 − β, where the size of one sample n2 is k times
as large as the size of the other sample n1 . To test the hypothesis H0 : p1 = p2 vs. Ha : p1 ̸= p2 , the
following sample size is required:
hq i2
p2 q2
p̄q̄ 1 + k1 z1−α/2 + p1 q1 +
p
k z1−β
n1 =
∆2
n2 = kn1
where
(Source: TowardsDataScience)
16. Your company runs a social network whose revenue comes from showing ads in news-feed. To double
revenue, your coworker suggests that you should just double the number of ads shown. Is that a good
idea? How do you find out?
Answer: We could perform an A/B test, in which one group sees the standard amount of ads, and the
other group sees twice the amount.
With that said, minimum detectable effect (MDE) is a calculation that estimates the smallest improve-
ment we are willing to be able to detect. This metric is used to estimate how long an experiment will
take given the baseline conversion rate, the statistical significance and traffic allocation. Informally, the
smaller the MDE, the the longer the experiment will need to last in order to reach statistical significance
at the particular sensitivity we are interested in.
34
17. Imagine that you have the prices of 10,000 stocks over the last 24 month period and you only have the
price at the end of each month, which means you have 24 price points for each stock. After calculating
the correlations of 10,000 * 9,999 pairs of stock, you found a pair that has the correlation to be above 0.8.
i. What’s the probability that this happens by chance?
Answer: In order to answer this question, we first start off by calculating the t-score for the Pearson
correlation coefficient r = 0.8 between two random stocks consisting of n = 24 price points:
r √
t= √ n−2
1−r 2
0.8 √
= 2
24 − 2
1 − (0.8)
= 6.253
Then, we can calculate the p-value – that is the probability of observing correlation scores of 0.8
or more extreme, given that the null hypothesis ”The two chosen stocks are uncorrelated” is true.
Using a Cumulative Distribution Function (CDF) calculator for the t-distribution with (n − 2) = 22
degrees of freedom, we find that:
Even though at first sight this p-value seems low enough to discard the null hypothesis that the two
stocks are not correlated, keep in mind that we might actually fall in the trap of Multiple Testing
Hypothesis, given that we are performing m = 10000 · 9999 = 99990000 simoultaneous tests. With
that said, let us observe the family-wise error rate (FWER), that is the probability that at least one
Type I error (reject the null when it is in fact true) occurs in the family:
Therefore, we see that in fact the probability of observing a correlation of 0.8 or higher, when in fact
there is no correlation between the stocks, is close to 1.
ii. How to avoid this kind of accidental patterns?
Answer: As we discussed previously, one approach would be to perform Bonferroni correction.
18. How are sufficient statistics and Information Bottleneck Principle used in machine learning?
Answer: The information bottleneck method is a technique designed for finding the best trade-off be-
tween accuracy and complexity (compression) when summarizing (e.g. clustering) a random variable X,
given a joint probability distribution p(X, Y ), between X and an observed relevant variable Y .
Applications include distributional clustering and dimension reduction, and more recently it has been
suggested as a theoretical foundation for deep learning. It generalized the classical notion of minimal suf-
ficient statistics from parametric statistics to arbitrary distributions, not necessarily of exponential form.
It does so by relaxing the sufficiency condition to capture some fraction of the mutual information with
the relevant variable Y .
The information bottleneck can also be viewed as a rate distortion problem, with a distortion func-
tion that measures how well Y is predicted from a compressed representation T compared to its direct
prediction from X. This interpretation provides a general iterative algorithm for solving the information
35
bottleneck trade-off and calculating the information curve from the distribution p(X, Y ). Let the com-
pressed representation be given by random variable T . The algorithm minimizes the following functional
with respect to conditional distribution p(t|x):
where I(X; T ) and I(T ; Y ) are the mutual information of X and T , and of T and Y respectively, and β
is a Lagrange multiplier.
(Source: Wikipedia)
6 Computer Science
6.1 Algorithms
1.
i. You have three matrices: A ∈ R100×5 , B ∈ R5×200 , C ∈ R200×20 , and you need to calculate the
product ABC. In what order would you perform your multiplication and why?
Answer: Since matrix multiplication is associative, the answer is the same whether we multiply
in the order of (AB)C or A(BC). However, let us observe the cost through the number of scalar
multiplications we need to perform:
2. What are some of the causes for numerical instability in deep learning?
Answer: Overflow, underflow, division by zero, log 0, NaN as input, etc.
36
3. In many machine learning techniques (e.g. batch norm), we often see a small term ϵ added to the
calculation. What’s the purpose of that term?
Answer: The purpose is to avoid operations that are undefined for 0, such as division by 0, log 0, etc.
4. What made GPUs popular for deep learning? How are they compared to TPUs?
Answer: GPUs became popular for deep learning because matrix multiplications can be efficiently par-
alleled over hundreds of cores.
TPUs are specialized hardware for neural nets, with the difference that they have lower precision for
representing floating point numbers, allowing for higher memory throughput and faster addition/multi-
plication.
(See more here)
5. What does it mean when we say a problem is intractable?
Answer: From a computational complexity stance, intractable problems are problems for which there
exist no efficient algorithms to solve them. For example, exact Bayesian inference is (often) intractable
(i.e. there is no closed-form solution, and numerical approximations are also computationally expensive)
because it involves the computation of high-dimensional integrals over a range of real numbers.
More precisely, if you want to find the parameters θ ∈ Θ of a model given some data D, then Bayesian
inference is simply the application of the Bayes theorem:
p(D|θ)p(θ)
p(θ|D) =
p(D)
p(D|θ)p(θ)
=R
p(D|θ′ )p(θ′ )dθ′
where p(θ|D) is the posterior (the quantity we are interestedR in computing), p(D|θ) is the likelihood of the
data given the parameters θ, p(θ) is the prior, and p(D) = p(D|θ′ )p(θ′ )dθ′ is the evidence of the data –
which is the intractable quantity, as it requires integrating over all possible values of θ. If all terms were
tractable (polynomially computable), then given more data D we could iteratively update our posterior
(which becomes prior in the next iteration), and exact Bayesian inference would become tractable.
The variational Bayesian approach casts the problem of inferring p(θ|D)(which requires the computation of
the intractable evidence term) as an optimization problem, which approximately finds the posterior. More
precisely, it approximates the intractable posterior p(θ|D) with a tractable one, q(θ|D) (the variational
distribution). For example, Variational Autoencoders (VAEs) utilize the variational Bayes Approach to
approximate the posterior in the context of neural networks, so that existing deep-learning techniques
could be used to learn the parameters of the model.
The variational Bayesian approach (VBA) becomes always more appealing in machine learning. For
example, Bayesian neural networks (which can partially solve some of the inherent problems of non-
Bayesian neural networks) are usually inspired by the results reported in the VAE paper, which shows the
feasibility of the VBA in the context of deep learning.
(Source: StackExchange)
6. What are the time and space complexity for doing backpropagation on a recurrent neural network?
Answer: For the forward-pass of a single example in one timestep we need to evaluate all the weights,
resulting in O(w) time complexity, where w is the number of weights. Due to the recurrence, we repeat the
computation for T timesteps, resulting in O(T · w). Moreover, performing this un-rolled forward pass for
an entire batch will amount the time complexity to O(B · T · w). Lastly, we note that the time complexity
of the forward and the backward pass is the same.
As for the space complexity, note that we need to keep in memory both the network weights and the
activations from the forward pass (required for the backprop computation). Given that storing the stor-
ing the activations for a single timestep is O(a), the space complexity amounts to O(w + B · T · a).
7. Is knowing a model’s architecture and its hyperparameters enough to calculate the memory requirements
for that model?
Answer: Memory consumption of a neural network depends on many factors and variables that are very
hard to account for all at once. Without holistic knowledge of the backend internals of the framework
37
you use (e.g. CUDA), estimating the memory footprint is extremely hard. Following are several potential
factors to consider when estimating memory usage:
• Model size.
• Batch size.
• Tensor types (FP64, FP32, FP16, INT8).
• Optimizers – for example Adam keeps momentum updated gradients and variances for each network
parameter over time.
• Loss functions – for example, losses in Self-supervised settings require auxiliary memory for sample-
to-sample or feature-to-feature relation matrices.
• In-place tensor ops are faster and consume less memory — make sure you know which layers in your
network support them.
• Training (forward pass + backward pass on each iteration) is way slower and consumes more memory
than inference (forward pass only on each iteration) — measure them separately.
• For benchmarking make sure that no other process uses the same device/set of devices.
• Some wrappers (e.g. DataParallel) cause additional overhead via synchronization between the devices
or machines — always benchmark the network on a single device first.
• CUDA driver and cuDNN versions matter.
(Source: Quora)
8. Your model works fine on a single GPU but gives poor results when you train it on 8 GPUs. What might
be the cause of this? What would you do to address it?
Answer: [TODO]
9. What benefits do we get from reducing the precision of our model? What problems might we run into?
How to solve these problems?
Answer: There are numerous benefits to using numerical formats with lower precision than 32-bit floating
point. First, they require less memory, enabling the training and deployment of larger neural networks.
Second, they require less memory bandwidth which speeds up data transfer operations. Third, math
operations run much faster in reduced precision.
One major drawback is that numerical instabilities can occur due to the reduction in precision. In par-
ticular, it might happen that very small gradients underflow in the half precision setting (FP16), thereby
effectively stalling the learning process.
Mixed precision training achieves all the aforementioned benefits while ensuring that no task-specific
accuracy is lost compared to full precision training. It does so by: 1) maintaining a 32-bit copy of the
network for performing the gradient updates; 2) porting the model to use the FP16 data type where
appropriate; 3) adding loss scaling to preserve small gradient values. We maintain a single-precision copy
of the network in order to preserve small gradients (recall that we multiply by a small learning rate at
the update step). While each forward-backward step operates in half precision mode, the half-precision
gradients are cast into single-precision before performing the update on the master weights.
38
10. How to calculate the average of 1M floating-point numbers with minimal loss of precision?
Answer: In numerical analysis, the Kahan summation algorithm, also known as compensated summa-
tion, significantly reduces the numerical error in the total obtained by adding a sequence of finite-precision
floating-point numbers, compared to the obvious approach. This is done by keeping a separate running
compensation (a variable to accumulate small errors), in effect extending the precision of the sum by the
precision of the compensation variable.
Lastly, to obtain the average, we can simply divide by the number of elements.
(Source: Wikipedia)
11. How should we implement batch normalization if a batch is spread out over multiple GPUs?
Answer: First, let us observe the computation graph of BN when executed on a single device:
• Forward pass: for input data X = x1 , . . . , xN the data are normalized to be zero-mean and unit-
variance, then scaled and shifted:
xi − µ
yi = γ · +β
σ
PN q PN
2
xi i=1 (xi −µ)
where µ = i=1
N and σ = N + ϵ, and γ and β are learnable parameters.
• Backward pass: for calculating the gradient ∂x
∂L
i
we need to consider the partial gradient from ∂L
∂yi ,
∂L ∂L
and the gradients from ∂µ and ∂σ , since µ and σ are functions of xi :
∂L ∂L ∂yi ∂L ∂µ ∂L ∂σ
= · + · + ·
∂xi ∂yi ∂xi ∂µ ∂xi ∂σ ∂xi
Standard implementations of BN in public frameworks (such as Caffe, MXNet, Torch, TF, PyTorch)
are unsynchronized, which means that the data are normalized within each GPU. Therefore the working
batch-size of the BN layer is BatchSize/nGPU (batch-size in each GPU). Since the working batch-size
is typically large enough for standard vision tasks, such as classification and detection, there is no need
to synchronize BN layer during the training, as the synchronization will mainly slow down the training.
However, for the Semantic Segmentation task, the state-of-the-art approaches typically adopt dilated con-
volution, which is very memory consuming. Due to memory constraints, the working bath-size can be too
small for BN layers (2 or 4 in each GPU) when using larger/deeper pre-trained networks.
In order to implement a synchronised BN (see Figure 2), suppose we have K numbers of GPUs, with
sum(x)k and sum(x2 )k denoting the sum of elements and sum of squares in the k-th GPU accordingly.
Then:
• Forward xi and sum of squares sum(x2 ) =
P
P 2 pass: we can calculate the sum of elements sum(x) =
xi in each GPU, and then synchronise
q to sum across the devices. Then, calculate the global mean
P
(x) sum(x2 )
µ = N and global variance σ = N − µ2 + ϵ. Note the equivalence of the variance term:
v v
u
u1 N
X
u
u1 X N
2
σ=t (xi − µ) = t (x2 − 2µxi + µ2 )
N i=1
N i=1 i
v "N # v
u
u1 X XN XN u
u1 X N
=t 2
xi − 2µ xi + 2
µ = t x2 − 2µ2 + µ2
N i=1 i=1 i=1
N i=1 i
v
u
u1 N
X
=t x2i − µ2
N i=1
39
Figure 2: Sync BN1
import numpy a s np
Answer: The issue with the code is that it is not properly vectorized, i.e. it doesn’t utilize its optimized
and pre-compiled functions and operations. One way to vectorize the code is as follows:
import numpy a s np
np . random . s e e d ( 4 2 )
m, n , r = 9 0 , 1 1 0 , 100
volume = np . random . rand (m, n , r )
r o i = np . random . rand ( 3 )
radius = 3.4
v a l s = f i r s t + second + t h i r d # (m, n , r )
mask = v a l s < r a d i u s ∗∗2 # (m, n , r )
40
7 Machine learning workflows
7.1 Basics
1. Explain supervised, unsupervised, weakly supervised, semi-supervised, and active learning.
Answer: Supervised learning is the machine learning task of learning a function that maps an input to
an output based on example input-output pairs. It infers a function from labeled training data consisting
of a set of training examples. (Source: Wikipedia)
Unsupervised learning is the machine learning task of learning patterns from unlabeled data. The hope is
that through mimicry, the algorithm is forced to build a compact internal representation of its world and
then generate imaginative content from it (Source: Wikipedia)
Weakly supervised learning is a branch of machine learning where noisy, limited, or imprecise sources
are used to provide supervision signal for labeling large amounts of training data in a supervised learning
setting. This approach alleviates the burden of obtaining hand-labeled data sets, which can be costly or
impractical. Instead, inexpensive weak labels are employed with the understanding that they are imper-
fect, but can nonetheless be used to create a strong predictive model. (Source: Wikipedia)
Semi-supervised learning is an approach to machine learning that combines a small amount of labeled
data with a large amount of unlabeled data during training. Semi-supervised learning falls between
unsupervised learning and supervised learning, and is a special instance of weak supervision. (Source:
Wikipedia)
Active learning is a branch of machine learning where a learning algorithm can interactively query a user
(or some other information source) to label new data points with the desired outputs. (Source: Wikipedia)
Self-supervised learning is a branch of machine learning that learns from unlabeled data by automati-
cally extracting labels from the sample. For example, we could mask out a word in a sentence, which the
algorithm then has to predict. (Source: Wikipedia)
2. Empirical risk minimization.
i. What’s the risk in empirical risk minimization?
Answer: Assume that there is a joint probability distribution P (x, y) over an input space X and a
target space Y . The goal is to learn a function h : X → Y (often called hypothesis) which outputs an
object y ∈ Y , given x ∈ X. Moreover, assume that we are given a non-negative real-valued loss func-
tion L(ŷ, y) which measures how different the prediction ŷ of a hypothesis is from the true outcome y.
The risk associated with the hypothesis h(x) is then defined as the expectation of the loss func-
tion:
Z
R(h) = E [L(h(x), y)] = L(h(x), y)dP (x, y)
(The above answer is for the general form of Risk Minimization, instead of the Empirical one. See
the next answer which introduces the idea of Empirical Risk Minimization.)
ii. Why is it empirical?
Answer: In practice, the risk R(h) cannot be computed because the distribution P (x, y) is un-
available. However, we can compute an approximation, called empirical risk, by averaging the loss
function on the training set:
n
1X
Remp = L(h(xi ), yi )
n i=1
41
The empirical risk minimization principle states that the learning algorithm should choose a hypoth-
esis ĥ which minimizes the empirical risk:
In practice, machine learning algorithms cope with this problem by employing a convex approxi-
mation to the 0-1 (like hinge loss for SVM), which is easier to optimize.
(Source: Wikipedia)
3. Occam’s razor states that when the simple explanation and complex explanation both work equally well,
the simple explanation is usually correct. How do we apply this principle in ML?
Answer: Statistical versions of Occam’s razor have a more rigorous formulation than what philosophical
discussions produce. In particular, they must have a specific definition of the term simplicity, and that
definition can vary.
For example, Minimum Description Length (MDL) is a model selection principle where the shortest
description of the data is the best model. Within Algorithmic Information Theory, the description length
of a data sequence is defined as the length of the smallest program that outputs that data set.
(Source: Wikipedia)
4. What are the conditions that allowed deep learning to gain popularity in the last decade?
Answer: The three main factors that allowed deep learning to gain popularity are:
• Increased access to large datasets of high quality and fine-grained labels (ImageNet, CityScapes).
• Improved hardware (GPUs, TPUs).
• Algorithmic advances (Residual Connections, Attention mechanism, BatchNorm)
5. If we have a wide NN and a deep NN with the same number of parameters, which one is more expressive
and why?
Answer: Deeper networks are more expressive, since they encode an inductive bias that complex functions
can be modeled as composition of simple functions. In turn, this allows the network to learn multiple levels
of an abstraction hierarchy. Empirically, it has been shown that deeper networks lead to more compact
models with better generalization performance.
6. The Universal Approximation Theorem states that a neural network with 1 hidden layer can approximate
any continuous function for inputs within a specific range. Then why can’t a simple neural network reach
an arbitrarily small positive error?
Answer: While at first look the universality of 2-layer networks is appealing, one aspect that is not
explicitly stated is the arbitrarily large requirements for width. Typically, these networks require expo-
nential width with respect to the input size, which in turn leads to an exponential increase in memory
and computation time.
Moreover, the Universal Approximation Theorem states that these wide networks can approximate the
train set arbitrarily close – but make no statements about the generalization performance on the test set.
Empirically, it has been shown that networks that overfit on the train set do not generalize well on test
set, since the network merely memorizes in the input-output pairs.
7. What are saddle points and local minima? Which are thought to cause more problems for training large
NNs?
Answer: A saddle point is a point on the surface of the graph of a function where the slopes (derivatives)
in orthogonal directions are all zero (a critical point), but which is not a local extremum of the function.
A function f has a local minimum at x∗ if there exists some ϵ > 0 such that f (x∗ ) ≤ f (x) for all
x ∈ X within distance ϵ of x∗ .
42
Today, neither saddle points nor local minima are considered significant problems (e.g. see this pa-
per). Nevertheless, let us provide some intuitive explanations for why these long-feared mathematical
phenomena are not as harmful or common.
For a point to truly be a local minimum of the loss function, it has to be a local minimum in all di-
rections, where each direction is specified by one of the parameters of the network. In contrast, for a
saddle point, only one direction has to be different than others. Given that we usually train million-
(or even billion-) parameter models, it is much more likely that at least one direction displays different
behavior than the others, as opposed to all of them displaying the same behavior. Therefore, we can
conclude that local minima are not as common.
Okay, but what if we actually do arrive at a local minima, or even more commonly – a saddle point?
It is important to stress that today’s networks are typically trained with Stochastic Gradient Descent
(or one of its momentum-based variants), as opposed to pure Gradient Descent. Since we are calculating
the loss wrt. the current batch (and not the entire dataset), we are not truly traversing the original loss
landscape, but a proxy of it. And if we eventually get stuck in a local minima / saddle point in the
loss landscape (or even its current proxy), in the next iteration we are optimizing over a different batch,
which is a different proxy of the loss, and therefore will slightly nudge us in a different direction. This
regularization effect is a huge reason why we are able to train neural networks that show remarkable
capabilities.
(See more here)
8. Hyperparameters.
i. What are the differences between parameters and hyperparameters?
Answer: Parameters are quantities that are optimized by the model during the training process
(e.g. weights of a neural network).
Hyperparameters are quantities related to the learning procedure, which are not optimized during
training, but are set by the user before the training starts (e.g. learning rate of the optimizer).
ii. Why is hyperparameter tuning important?
Answer: Each learning algorithm has unique properties. When it comes to selecting a value for a
hyperparameter, in general there is no one-size-fits-all. For this reason, we need to tune the values
of the hyperparemeter on a hold-out validation set.
Furthermore, even if a given hyperparameter value seems to work well for a given model when
trained on a particular dataset, there is no guarantee that the same value will be appropriate when
the model is trained on a different dataset.
iii. Explain an algorithm for tuning hyperparameters.
Answer: Two fairly naive, but commonly used algorithms for tuning hyperparameters are:
• Grid Search – given a set of values for each hyperparameter, the algorithm looks over each
possible combination.
• Random Search – given an interval of possible hyperparameter values, the algorithm trains the
model by sampling randomly from the provided ranges.
One major drawback of these two approaches is that they are uninformative – the choice of the
next set of parameters is independent of the performance of the previous choice. This serves as a
motivating factor as to why someone might consider using Bayesian Optimization.
x∗ = argmax f (x)
x
Practically speaking, this function would wrap the entire training procedure of the model, which
means that we cannot apply gradient based optimization since it is not differentiable. We have to
settle on optimizing the function f by merely evaluating it at a given point x. Given these con-
straints, we now introduce the two main components of Bayesian Optimization: surrogate model,
2 Source: https://towardsdatascience.com/shallow-understanding-on-bayesian-optimization-324b6c1f7083
43
Figure 3: Bayesian Optimization2 .
Since we lack an expression for the objective function, the first step is to use a surrogate model
to approximate f (x). While there are several possible approaches for the surrogate function, a com-
mon choice is to use Gaussian Processes (GPs). GPs are especially useful for our purposes, since
they not only provide a model which best fits the data µ(x), but also produce uncertainty estimates
σ(x) at each point x.
Next, we need to choose an acquisition function, whose main purpose is to decide which point x
is best to sample next, based on the information provided by the surrogate model. Given that we
have already sampled several hyperparameters x1 , x2 , . . . for evaluation, the acquistion function has
to make a choice whether to keep exploring in the already visited regions in order to greedily optimize
the surrogate to the objective f , or take on a risk by venturing into a previously unexplored area
which could potentially yield an even better optimum. This is known as the exploration-exploitation
dilemma in Bayesian optimization. Among the several possible choices for an acquisition function,
the Upper Confidence Bound (UCB) is the most intuitive one, because it explicitly trades off be-
tween exploring new regions, and exploiting already visited ones. In particular, the affinity for the
hyperparameter x is estimated as follows:
where µ(x) is the mean of the GP posterior at x, σ(x) is the standard deviation of the GP posterior
at x, and λ is a hyperparameter which trades-off the two (this is user set, and not optimized). As
we discussed, the affinity for each hyperparameter x is a weighted sum of the expected performance
and the uncertainty surrounding the choice of this parameter.
See Figure 3 for a visual explanation of the inter-play between the surrogate model and the ac-
quisition function. Finally, here is what the Bayesian Optimization algorithm would look like in
pseudocode:
1. Evaluate f (x) at n initial points
44
2. While n ≤ N repeat:
• Update the surrogate model (the GP posterior) using all avaliable data D1:n
• Select the next candidate: xn+1 = argmax x aUCB (x, λ)
• Evaluate yn+1 = f (xn+1 )
• Augment the dataset D1:n+1 = {D1:n , (xn+1 , yn+1 )}
3. Return the x that evaluated with the largest f (x)
(Source: Stathis Kamperis’s blog)
9. Classification vs. regression.
i. What makes a classification problem different from a regression problem?
Answer: Classification is the task of predicting a discrete class label, whereas regression is the task
of predicting a continuous quantity.
ii. Can a classification problem be turned into a regression problem and vice versa?
Answer: While it is technically possible to turn a classification (and vice versa), there is rarely ever
a reason to do so. Following are some of the reasons why this is not a good idea:
• One loses information by binning the response.
• Continuous targets have an order. In general, classification problems don’t assume class ordering.
• Continuity implies smoothness. This is in contrast to the orthogonal representation of one-hot
encoded labels.
(Source: StackExchange)
10. Parametric vs. non-parametric methods.
i. What’s the difference between parametric methods and non-parametric methods? Give an example
of each method.
Answer: A learning model that summarizes data with a set of parameters of fixed size (independent
of the number of training examples) is called a parametric model. Some prominent examples: Linear
regression, Logistic regression, Neural Networks, Naive Bayes.
In contrast, non-parametric methods do not take a predetermined form, but are constructed ac-
cording to the information derived from the data. For this reason, they require larger sample sizes
because the data must supply the model structure. Some prominent examples: k Nearest Neighbors,
Decision Trees, Gaussian Processes.
(Source: MachineLearningMastery, Wikipedia)
Interestingly, SVMs can be considered both parametric and non-parametric model, depending on
whether you view them from the primal (parameters w and b) or the dual problem space (formula-
tion through support vectors).
(Source: Quora)
ii. When should we use one and when should we use the other?
Answer: To answer this question, we will consider two aspects:
(a) Dataset size: Non-parametric methods are more applicable in cases when we have larger datasets
that provide sufficient coverage s.t. we can derive the entire model structure based on the data
alone. Otherwise, if the datasets are not large enough, we can inject a prior into our training
process by fixing the parametric form of the model. This allows the optimization procedure to
only focus on inferring the parameter values, and not having to derive the entire model structure.
(b) Inference time requirements: Since parametric models use a fixed parametrization, they are
more applicable in cases when we need consistent inference time guarantees. In contrast, the
prediction time of non-parametric methods might depend on the dataset size (e.g. finding k-
nearest neighbors, iterating over all support vectors, . . . )
11. Why does ensembling independently trained models generally improve performance?
Answer: Consider K regression models, each of which has an error ϵk ∼ N (0, σ), with Var [ϵk ] = E ϵ2k =
v and covariances Cov [ϵk , ϵl ] = E [ϵk ϵl ] = c (notice that E [ϵk ] = 0). Then, the expected squared error of
45
the ensemble predictor (with each model having the same weight) is given as:
!2
1 X 1 X
ϵ2k +
X
E ϵk = 2 E ϵk ϵl
K K
k k l,l̸=k
1 X X X
E ϵ2k +
= 2 E [ϵk ϵl ]
K
k k l,l̸=k
1
= (Kv + K(K − 1)c)
K2
1 K −1
= v+ c
K K
Now, consider the following two cases:
(a) The errors are maximally correlated:
corr(ϵk , ϵl ) = 1
Cov [ϵk , ϵl ]
p =1
Var [ϵk ] Var [ϵl ]
(Cov [ϵk , ϵl ])2 = Var [ϵk ] Var [ϵl ]
c2 = v · v
=⇒ c = v
1 K −1
= v+ v (c = v)
K K
=v
= E ϵ2k
In other words, we observe no gain wrt. a single model when the errors of the models are correlated.
(b) The errors are uncorrelated:
corr(ϵk , ϵl ) = 0
=⇒ Cov [ϵk , ϵl ] = 0
=⇒ c = 0
Let w = 0.001. Then, L1 (w) = |0.001| = 0.001, but L2 (w) = (0.001)2 = 0.000001. We see that for
the same value of w, the L2 term is much lower than the L1 term. As a consequence, in order to minimize
the L1 regularization loss, the optimizer is forced to push w even closer to 0.
46
13. Why does an ML model’s performance degrade in production?
Answer: Sources of bugs in production can be:
(a) Caused by the training data:
• Selection bias: the sampling procedure for collecting the dataset does not account for all relevant
characteristics of the population.
• Missing data: the decision on how to handle missing data (by dropping it, or imputing it with
a certain technique) might not hold beyond the offline test set evaluation.
• Misspecified schema: the data is there, but the semantics of the features can be misunderstood.
This happens when there is no proper documentation (schema) explaining how the dataset was
gathered, and what the features exactly mean.
(b) Caused by the model:
• Model mismatch: the model assumes structure not applicable to the data seen in deployment.
(c) Caused by the algorithm:
• Classic implementation bugs: off-by-one, type mismatches, value-v-reference, etc.
• Subtle mathematical bugs: incorrect gradients, unexpected broadcast, biased estimators, etc.
• Fundamental mathematical challenges: non-convex optimization, numerical stability, sampling
noise, etc.
(d) Caused by the test data. Given a distribution P (X, Y ), where X are the inputs, and Y are the
targets, a model deployed in production can experience the following shifts in data:
• Covariate shift: P (X) changes, but P (Y |X) remains the same.
• Label shift: P (Y ) changes, but P (X|Y ) remains the same.
• Concept drift: P (Y |X) changes, but P (X) remains the same.
(See more here)
14. What problems might we run into when deploying large machine learning models?
Answer: There are several challenges that come with deploying large ML models:
(a) Component efficiency. For complex models with many different components, it’s especially important
to conduct ablation studies – removing each component while keeping the rest – to determine the
efficiency of each component. One might consider discarding / substituting components that are
inefficient in time or provide only minor improvement in accuracy.
(b) Server- vs client- side inference. Inferencing on the user phone consumes the phone’s memory and
battery, and makes it harder for collecting user feedback. Inferencing on the cloud increases the
product latency, requires setting up a server to process all user requests, and might scare away
privacy-conscious users.
(c) Interpretability. If a model predicts that someone shouldn’t get a loan, that person deserves to know
the reason why. One needs to consider the performance/interpretability tradeoffs. Making a model
more complex might increase its performance but make the results harder to interpret.
(d) Bias. One should do extensive testing and model validation to confirm that the model doesn’t
perpetuate potential biases that can occur in the training data, such as racial and gender stereotypes.
Moreover, it’s important to design post-hoc safety systems that prevent users with malicious intents.
(Source: Chip Huyen’s ML system design interview booklet)
15. Your model performs really well on the test set but poorly in production.
i. What are your hypotheses about the causes?
Answer: As we discussed above, the three main reasons for change in performance compared to the
test set could be attributed to one or more of the following: covariate shift, label shift, concept drift.
Let us explain each one in a bit more detail.
(a) During covariate shift P (X) changes, but P (Y |X) remains the same. In supervised ML, the
label is the variable of direct interest, and the input features are covariate variables.
Suppose you want to build a model that predicts breast cancer. Research has shown that the
risk of breast cancer is higher for women over the age of 40, so you decide to have a variable
“age” as your input. Given that women over 40 are more likely to have breast cancer, due to
47
potentially sampling bias from the hospital data, during training we might have more women
with age over 40 compared to inference time — resulting in P (X) changing. However, the prob-
ability of a patient of a certain age having cancer doesn’t change just because we start seeing a
higher number of young users – therefore, P (Y |X) remains the same. The change in our user
base doesn’t affect the nature of the cancer.
(b) During label shift P (Y ) changes, but P (X|Y ) remains the same.
Quite often, label shift is accompanied with covariate shift. Consider the breast cancer exam-
ple again, and suppose we have covariate shift (P (X) changes but P (Y |X) remains the same).
Because of the change in P(X), we are also bound to see a change in class distribution: during
training there are more positive classes (due to the aforementioned sampling bias in our dataset),
but during inference time there are less patients with positive classes (since we experience change
in user base, more women with age less than 40 start using the model, but we also know they
are less likely to have cancer) — therefore P (Y ) also changes. However, the distribution of age
for sick patients doesn’t change just because we start seeing lower number of users who have
cancer – therefore, P (X|Y ) remains the same. Again, the change in our user base doesn’t affect
the nature of the cancer.
However, it’s not always the case that label shift is accompanied by covariate shift. Imagine that
there is now a preventive drug that every woman takes that helps reduce their chance of getting
breast cancer. The probability P (Y |X) reduces for women of all ages, so it’s no longer a case
of covariate shift. However, given a person with breast cancer, the age distribution remains the
same so this is still a case of label shift.
(c) During concept drift P (Y |X) changes, but P (X) remains the same. You can think of this as
“same input, different output”.
Consider you’re in charge of a model that predicts the price of a house based on its features.
Before COVID-19, a 3 bedroom apartment in San Francisco could cost $2,000,000. However, at
the beginning of COVID-19, many people left San Francisco, so the same house would cost only
$1,500,000. So even though the distribution of house features remains the same, the conditional
distribution of the price of a house given its features has changed.
In many cases, concept drifts are cyclic or seasonal. For example, rideshare’s prices will fluctuate
on weekdays versus weekends, and flight tickets rise during holiday seasons. Companies might
have different models to deal with cyclic and seasonal drifts. For example, they might have one
model to predict rideshare prices on weekdays and another model for weekends.
ii. How do you validate whether your hypotheses are correct?
Answer: In order to validate whether our hypotheses are correct, we need to put methods in place
that are able to detect these data distribution shifts. One can classify the approaches in two primary
groups: statistical methods, and time scale windows for detecting shifts.
Statistical methods aim to detect a shift between the training and serving data by quantifying it
in some statistical sense. One simple approach is to compare the moments (mean, variance, skew-
ness, kurtosis, etc.) of the two distributions – in case these quantities significantly differ, one can
conclude that we there is a shift. However, if there is no significant difference between the quanti-
ties describing the two distributions, we cannot conclude that there is no shift present, since these
statistics are not sufficient.
A more sophisticated solution is to use a two-sample hypothesis test, shortened as two-sample test.
It’s a test to determine whether the difference between two populations (two sets of data) is statisti-
cally significant. If the difference is statistically significant, then the probability that the difference is
a random fluctuation due to sampling variability is very low, and therefore, the difference is caused
by the fact that these two populations come from two distinct distributions. If you consider the data
from yesterday to be the source population and the data from today to be the target population and
they are statistically different, it’s likely that the underlying data distribution has shifted between
yesterday and today. Some examples include: Kolmogorov-Smirnov test (non-parametric test, but
can only be used for one-dimensional data), Least-Squares Density Difference (based on the least
squares density-difference estimation method), Maximum Mean Discrepancy (a kernel-based tech-
48
nique for multivariate two-sample testing), and many more.
On the other hand, time scale windows are designed to detect temporal shifts that happen over
time. To detect temporal shifts, a common approach is to treat input data to ML applications as
time series data.
When dealing with temporal shifts, the time scale window of the data we look at affects the shifts
we can detect. If your data has a weekly cycle, then a time scale of less than a week won’t detect
the cycle. Therefore, by setting the time window too small, your detection technique can produce a
false alarm simply due to the seasonality inherent to the data.
When computing running statistics over time, it’s important to differentiate between cumulative
and sliding statistics. Sliding statistics are computed within a single time scale window, e.g. an
hour. Cumulative statistics are continually updated with more data. This means for each beginning
of a time scale window, the sliding accuracy is reset, whereas the cumulative sliding accuracy is not.
Because cumulative statistics contain information from previous time windows, they might obscure
what happens in a specific time window.
iii. Imagine your hypotheses about the causes are correct. What would you do to address them?
Answer: To make a model work with a new distribution in production, there are three main ap-
proaches. The first is the approach that currently dominates research: train models using massive
datasets. The hope here is that if the training dataset is large enough, the model will be able to learn
such a comprehensive distribution that whatever data points the model will encounter in production
will likely come from this distribution.
The second approach, less popular in research, is to adapt a trained model to a target distribu-
tion without requiring new labels. Zhang et al. (2013) used causal interpretations together with
kernel embedding of conditional and marginal distributions to correct models’ predictions for both
covariate shifts and label shifts without using labels from the target distribution. Similarly, Zhao et
al. (2020) proposed domain-invariant representation learning: an unsupervised domain adaptation
technique that can learn data representations invariant to changing distributions. However, this area
of research is heavily underexplored and hasn’t found wide adoption in industry.
The third approach is what is usually done in the industry today: retrain your model using the
labeled data from the target distribution. However, retraining your model is not so straightforward.
Retraining can mean retraining your model from scratch on both the old and new data or continuing
training the existing model on new data. The latter approach is also called fine-tuning.
(Source: Chip Huyen’s blog on Data Distribution Shifts)
2. What is the difference between sampling with vs. without replacement? Name an example of when you
would use one rather than the other?
Answer: When we sample with replacement, the two sample values are independent. Practically, this
means that what we get on the first one doesn’t affect what we get on the second. Mathematically, this
means that the covariance between the two is zero.
In sampling without replacement, the two sample values aren’t independent. Practically, this means
that what we get on the first one affects what we can get for the second one. Mathematically, this means
that the covariance between the two isn’t zero. That complicates the computations. In particular, if we
have a SRS (simple random sample) without replacement, from a population with variance σ 2 , then the
2
covariance of two of the different sample values is − Nσ−1 , where N is the population size.
With that said, if our sampling specification states that there should be no duplicates, we should sam-
ple without replacement – keeping in mind that this in fact introduces covariance between the samples.
49
However, if the learning algorithm has strong IID assumptions (independent and identically distributed),
then we have to sample with replacement.
Nevertheless, this argument is mostly regarding small population sizes. In large populations, if we sample
without replacement, the covariance will be close to 0 because of the N − 1 term in the denominator. In
this case, both sampling with and without replacement is nearly identical.
(Source: University of Texas)
3. Explain Markov chain Monte Carlo sampling.
Answer: Integration is the core computation of probabilistic inference. As a general formulation, we
might want to estimate:
Z
ϕ = f (x)p(x)dx = Ep(x) [f (x)]
Algorithms that compute expectations in the above way, by using samples xi ∼ p(x), are called Monte
Carlo methods.
A joint distribution p(X) over a sequence of random variables X = [x1 , . . . , xn ] is said to have the
Markov property if:
The main idea behind Markov Chain Monte Carlo (MCMC) methods is to generate samples of a tar-
get distribution p(x) by iteratively building approximations q(x) that only need to be good locally. To
illustrate the intuition behind MCMC, consider the following iterative algorithm for finding the maximum
of p(x):
• Draw a proposal x′ ∼ q(x′ |xt )
p(x′ )
• Compute a = p(xt )
• If a > 1, accept xt+1 = x′ , else reject x′ and set xt+1 = xt .
By forcing each new entry xt+1 to be be at least as high as the previous one, we eventually reach a (local)
maximum of p(x).
This very same procedure can be adapted to return a sample from p(x) instead by tweaking the rules of
transition from xt to xt+1 , leading to the Metropolis-Hasting method:
• Draw a proposal x′ ∼ q(x′ |xt ) from a proposal distribution q, for example q(x′ |xt ) = N (x′ ; xt , σ 2 ).
p(x′ ) q(xt |x′ )
• Compute a = p(xt ) q(x′ |xt ) .
• If a > 1, accept xt+1 = x′ .
• Otherwise, accept with probability a, and reject with probability 1 − a. The outcome can be decided
by drawing uniformly from [0, 1] and comparing it to a.
50
The Markov chain stays at the same place for one time period when rejecting, meaning that the corre-
sponding point will later show up at least 2 times. Usually, the proposal distribution is symmetric, such
that q(xt |x′ ) = q(x′ |xt ).
Using this method, the samples will spend more “time” in regions where p(x) is high (lower probabil-
ity of sampling a better proposition) and less “time” in regions where p(x) is low (any proposition would
be better than the current one), but the algorithm can still visit regions of low probability.
4. If you need to sample from high-dimensional data, which sampling method would you choose?
Answer: Gibbs Sampling is a special case of the Metropolis-Hasting algorithm. It employs the idea
that sampling from a high-dimensional joint distribution is often difficult, while sampling from a one-
dimensional conditional distribution is easier. So, instead of directly sampling from the joint p(x), the
Gibbs sampler alternates between drawing from the respective conditional distributions, as illustrated in
Algorithm 1.
It generates an instance from the distribution of each variable in turn, conditioning it on the current
values of the other variables. Thus, Gibbs is useful when drawing from the joint is hard / infeasible,
while drawing from the conditional is more tractable. Although there are theoretical guarantees for the
convergence of Gibbs Sampling, as it is a special case of the Metropolis-Hastings, it is unknown how many
iterations are needed to reach the stationary distribution.
5. Suppose we have a classification task with many classes. An example is when you have to predict the next
word in a sentence – the next word can be one of many, many possible words. If we have to calculate the
probabilities for all classes, it’ll be prohibitively expensive. Instead, we can calculate the probabilities for
a small set of candidate classes. This method is called candidate sampling. Name and explain some of
the candidate sampling algorithms.
Answer: There exist several methods for candidate sampling:
(a) Sampled Softmax. Assume that we have a single-label problem. Each training example (xi , ti ) con-
sists of a context and a target class. We write P (y|x) for the probability of the target class being y,
given that the context is x.
We would like to train a function F (x, y) to produce log-probabilities (commonly used in many
loss functions):
where K(x) is an arbitrary function that doesn’t depend on y. In full softmax training, for every
training example (xi , ti ) we would need to compute F (xi , y) for all classes y ∈ L. This can get
expensive if the universe of classes L is very large.
In Sampled Softmax, for each training example (xi , ti ), we pick a small set Si ⊂ L of ”sampled”
classes according to a chosen sampling function Q(y|x). Each class y ∈ L is included in Si indepen-
dently with probability Q(y|xi ):
Y Y
P (Si = S|xi ) = Q(y|xi ) (1 − Q(y|xi ))
y∈S y∈(L−S)
We create a set of candidates Ci as the union of the target class and the sampled classes:
Ci = Si ∪ {ti }
51
Our task during training is now simplified to finding out which of the classes in Ci is the target class.
For each class y ∈ Ci , we want to compute the posterior probability that y is the target class
given our knowledge of xi and Ci , that is P (ti = y|xi , Ci ). Now, let us further expand this term:
P (ti = y, xi , Ci )
P (ti = y|xi , Ci ) =
P (Ci , xi )
P (ti = y, Ci |xi )P (xi )
=
P (Ci |xi )P (xi )
P (ti = y, Ci |xi )
=
P (Ci |xi )
P (Ci |ti = y, xi )P (ti = y|xi )
=
P (Ci |xi )
P (Ci |ti = y, xi )P (y|xi )
=
P (Ci |xi )
Y Y
Q(y ′ |xi ) (1 − Q(y ′ |xi ))P (y|xi )
y ′ ∈Ci −{y} y ′ ∈L−Ci
=
P (Ci |xi )
Y Y
′
Q(y |xi ) (1 − Q(y ′ |xi ))
P (y|xi ) y′ ∈Ci y ′ ∈L−Ci
=
Q(y|xi ) P (Ci |xi )
P (y|xi )
= · K(xi , Ci )
Q(y|xi ) | {z }
const. wrt. yi
Using this quantity, we can train our sampled version of the softmax classifier to predict which is
the true class among the candidates in Ci . In particular, for the set of candidate classes y ∈ Ci , we
compute F (x, y) (output of our network), and subtract log Q(y|x) (which is user set, and hopefully
has analytical form). Then, we pass this quantity to the Cross Entropy loss function, and backprop-
agate in order to optimize F to give us the desired outputs.
It should be noted that the procedure of sampling the softmax is only applied during training (in
order to decrease the computational complexity), whereas during inference time we compute the
standard softmax.
(b) Noise Contrastive Estimation. The idea is similar as before – computing the entire softmax can be
prohibitively expensive during training, therefore instead of comparing the true class to all other
classes, we compare to a much smaller sample of negative classes. The difference is that now the
target can be a (multi-)set, compared to only being a single label as before.
Assume that each training example (xi , Ti ) ∼ P (x, T ) consists of a context and a set of target
classes. For each such example, we create sample a set of negative classes Si by sampling from a
predefined distribution Q(y|xi ).
Then, we construct a (multi-)set of candidates consisting of the sum of the target and sampled
classes:
Ci = Ti + Si
Remember that the logit of a Logistic Regression model in fact represents the log-odds that the input
comes from the positive vs the negative class:
T P
logit = θ x = log = log(P ) − log(Q)
Q
For NCE, we want to output logits which represent the log-odds that the inputs come from the target
set T (true distribution P ), instead of the sampled set S (the noise distribution Q).
52
For this reason, we ”repurpose” our network to output F (x, y) := log(P (y|x)), and subtract from it
log(Q(y|x)):
resulting in the desired set of logits, which as mentioned above, represent the log-odds that the input
comes from the target set T vs the sampled set S. It should be noted that the Q is chosen by the
user, typically in such a way that its log has an analytical form.
Finally, we pass these new logits to the Binary Cross Entropy loss function, with a label indicating
whether y came from T or S. The backpropagation signal trains F (x, y) to approximate what we
want it to.
(c) Negative Sampling. Negative sampling is a simplified variant of Noise Contrastive Estimation where
we neglect to subtract off log(Q(y|x)) during training. As a result, F (x, y) is trained to approximate
log(P (y|x)) − log(Q(y|x)).
Therefore, in order to build a robust classifier capable of generalizing to unseen comments (and
events related to them) in the future, we find it important to have representative samples from each
historical period.
One such approach that ensures the above-mentioned property is stratified sampling. In particular,
we select each month to represent a single strata, and then sample randomly within this population.
ii. Suppose you get back 100K labeled comments from 20 annotators and you want to look at some
labels to estimate the quality of the labels. How many labels would you look at? How would you
sample them?
Answer: Given that we have 20 annotators labeling 100K comments, in order to estimate the qual-
ity, we could look at 5% of the comments (5000 out of the 100K), or roughly one annotator’s workload.
Since we are only interested in estimating the overall quality of the sample, we could perform simple
random sampling. On the other hand, if we were interested in the quality of work of each annota-
tor, we could again perform stratified sampling, where each annotator’s labeled pool of comments is
considered a strata. From there, we can make a more informed decision about which annotators to
retain in order to improve the overall quality of the work.
7. Suppose you work for a news site that historically has translated only 1% of all its articles. Your coworker
argues that we should translate more articles into Chinese because translations help with the readership.
On average, your translated articles have twice as many views as your non-translated articles. What might
be wrong with this argument? (Hint: think about selection bias.)
Answer: Selection bias is the bias introduced by the selection of individuals, groups, or data for anal-
ysis in such a way that proper randomization is not achieved, thereby failing to ensure that the sample
obtained is representative of the population intended to be analyzed.
It might be the case that the 1% of translated articles are not a random sample from the population
of all articles, but only ones containing news appealing to readers in China.
53
Therefore, if the news site starts putting in more resources in translating articles from the entire pop-
ulation (meaning not only ones that could be considered of interest to Chinese readers), the return of
investment might actually not pay off, since not all articles might be of interest to the readers in China.
8. How to determine whether two sets of samples (e.g. train and test splits) come from the same distribution?
Answer: The Kolmogorov–Smirnov test (K-S test or KS test) is a nonparametric test of the equality
of continuous one-dimensional probability distributions that can be used to compare a sample with a
reference probability distribution (one-sample K–S test), or to compare two samples (two-sample K–S
test). In essence, the test answers the question ”What is the probability that this collection of samples
could have been drawn from that probability distribution?” or, in the second case, ”What is the prob-
ability that these two sets of samples were drawn from the same (but unknown) probability distribution?”.
The Kolmogorov–Smirnov statistic quantifies a distance between the empirical distribution function of
the sample and the cumulative distribution function of the reference distribution, or between the empir-
ical distribution functions of two samples. The null distribution of this statistic is calculated under the
null hypothesis that the sample is drawn from the reference distribution (in the one-sample case) or that
the samples are drawn from the same distribution (in the two-sample case).
In order to determine whether two sets of samples come from the same distribution, the Kolmogorov
statistic is computed as:
where F1,n and F2,m are the empirical distribution functions of the first and the second sample respec-
tively, and sup is the supremum function (see Figure 4).
It should be noted that the K-S test is only applicable to one-dimensional data. There exist other
approaches, such as Least-Squares Density Difference (based on the least squares density-difference es-
timation method), Maximum Mean Discrepancy (a kernel-based technique for multivariate two-sample
testing), and many more.
Figure 4: Illustration of the two-sample Kolmogorov–Smirnov statistic. Red and blue lines each correspond to
an empirical distribution function, and the black arrow is the two-sample KS statistic.
(Source: Wikipedia)
54
9. How do you know you’ve collected enough samples to train your ML model?
Answer: In statistical learning theory, the sample complexity of a machine learning algorithm represents
the training-samples that it needs in order to successfully learn a target function.
More precisely, the sample complexity is the number of training-samples that we need to supply to the
algorithm, so that the function returned by the algorithm is within an arbitrarily small error of the best
possible function, with probability arbitrarily close to 1.
However, if we are only interested in a particular class of target functions (e.g, only linear functions)
then the sample complexity is finite, and it depends linearly on the VC dimension on the class of target
functions.
The Vapnik–Chervonenkis (VC) dimension is a measure of the capacity (complexity, expressive power,
richness, or flexibility) of a set of functions that can be learned by a statistical binary classification algo-
rithm. It is defined as the cardinality of the largest set of points that the algorithm can shatter, which
means the algorithm can always learn a perfect classifier for any labeling of at least one configuration of
those data points.
Nevertheless, in practice these theoretical bounds are rarely used or computed. A recent review arti-
cle suggests that the most prominent approach for sample size determination is the post hoc method of
fitting a learning curve. In essence, you take increasingly large subsets of the data, train the model, and
obtain the metric of interest (e.g. accuracy). From there, plot a function with the x axis being the sample
size and the y axis being the obtained accuracy, and then fit an exponential curve. The resulting curve
should approximate the needed number of samples in order to achieve the desired accuracy (see Figure
5).
(Source: Wikipedia, University of Alabama at Birmingham)
3
Figure 5: Example of estimated sample size vs accuracy curve.
10. How to determine outliers in your data samples? What to do with them?
3 Source: https://keras.io/examples/keras recipes/sample size estimate/
55
Answer: In data analysis, anomaly detection (also referred to as outlier detection and sometimes as
novelty detection) is generally understood to be the identification of rare items, events or observations
which deviate significantly from the majority of the data and do not conform to a well defined notion of
normal behaviour. Such examples may arouse suspicions of being generated by a different mechanism, or
appear inconsistent with the remainder of that set of data.
Anomaly detection finds application in many domains including cyber security, medicine, machine vi-
sion, statistics, neuroscience, law enforcement and financial fraud to name only a few. Anomalies were
initially searched for clear rejection or omission from the data to aid statistical analysis, for example to
compute the mean or standard deviation. They were also removed to better predictions from models such
as linear regression, and more recently their removal aids the performance of machine learning algorithms.
However, in many applications anomalies themselves are of interest and are the observations most desirous
in the entire data set, which need to be identified and separated from noise or irrelevant outliers.
There are three broad categories of anomaly detection: supervised, semi-supervised and unsupervised.
Due to its wider and relevant applications, unsupervised anomaly detection is the most popular approach.
Many anomaly detection techniques have been proposed in the literature, some of which are:
• Statistical. A common approach is to assume Normality of the data and calculate the Z-score. Then,
given a certain threshold (e.g. 3), mark all samples above the threshold as outliers; in this example,
we essentially label points that are 3 standard deviations away from the mean as anomalies.
• Density-based techniques. A prominent example is to use k-NN in such a way that for each point one
computes the sum of the distances of the k nearest neighbors, denoted as weight. Then, outliers are
points that have the largest weight.
• Gausan Anomaly Detection (Rippel et al.). Use a pre-trained model as a feature extractor, and
estimate the empirical mean and covariance using the dataset at hand. Then, score each item in the
dataset via the Mahalanobis distance, and define a cut-off threshold.
(Source: Wikipedia)
11. Sample duplication.
i. When should you remove duplicate training samples? When shouldn’t you?
Answer: While training a supervised learning algorithm, the usual assumptions are that:
1. Data points are independent and identically distributed
2. Training and testing data is sampled from the same distribution
If you believe that these assumptions hold for the problem you are working on, or are working with
structured data (e.g. classic tabular data) where duplicates can actually happen, you should not
throw out the duplicate data points. If your algorithm achieves a very low training error by learning
to be very good at some data that repeats a lot in the training set, it should achieve a similarly low
testing error because that same data point is going to repeat just as frequently (assumption #2).
On the other hand, if you believe that these 2 assumptions will not hold in your problem setting
(e.g. you expect a larger shift in distribution), or you are working on a problem with unstructured
data (e.g. text) where duplicates are highly improbable (and are likely due to an error in the ETL
pipeline) it might be a good idea to remove the repeated samples.
(Source: Quora)
ii. What happens if we accidentally duplicate every data point in your train set or in your test set?
Answer: There is not going to be any change in the optimal parameters θ. However, the training
process will take twice as long.
(Source: Quora)
12. Missing data
i. In your dataset, two out of 20 variables have more than 30% missing values. What would you do?
Answer: Datasets with missing values are typically incompatible with most statistical estimators
which assume that all inputs are numerical, and that all have and hold meaning.
A basic strategy to use incomplete datasets is to discard entire rows and/or columns containing
56
missing values. However, this comes at the price of losing data which may be valuable (even though
incomplete). A better strategy is to impute the missing values, i.e., to infer them from the known
part of the data. Two prominent approaches are: Univariate, and Multivariate feature imputation.
In the Univariate case, each missing value for a feature is substituted with either a constant value,
or a calculated statistic of the present values in the feature (mean, median, mode).
In the Multivariate case, one models each feature with missing values as a function of other fea-
tures, and uses that estimate for imputation. It does so in an iterated round-robin fashion: at each
step, a feature column is designated as output y and the other feature columns are treated as inputs
X. A regressor is fit on (X, y) for known y. Then, the regressor is used to predict the missing values
of y. This is done for each feature in an iterative fashion, and then is repeated for a pre-defined
number of imputation rounds. The results of the final imputation round are returned.
(Source: Scikit)
ii. How might techniques that handle missing data make selection bias worse? How do you handle this
bias?
Answer: By far, the most common means of dealing with missing data is listwise deletion (also
known as complete case), which is when all cases with a missing value are deleted. If the data are
missing completely at random, then listwise deletion does not add any bias, but it does decrease the
power of the analysis by decreasing the effective sample size.
If the cases are not missing completely at random, then listwise deletion will introduce bias be-
cause the sub-sample of cases represented by removing the missing data are not representative of the
original sample (and if the original sample was itself a representative sample of a population, the
complete cases are not representative of that population either). While listwise deletion is unbiased
when the missing data is missing completely at random, this is rarely the case in reality.
One way of preventing the introduction of this bias is to use an imputation technique. This way, we
can retain the discussed samples and thereby not perpetuate the sampling bias.
(Source: Wikipedia)
In the statistical theory of design of experiments, randomization involves randomly allocating the exper-
imental units across the treatment groups. For example, if an experiment compares a new drug against
a standard drug, then the patients should be allocated to either the new drug or to the standard drug
control using randomization.
Randomized experimentation is not unsystematic. Randomization reduces bias by equalizing other fac-
tors that have not been explicitly accounted for in the experimental design. Randomization also produces
ignorable designs (the method of data collection and the nature of missing data do not depend on the
missing data), which are valuable in model-based statistical inference.
In the design of experiments, the simplest design for comparing treatments is the ”completely randomized
design”. Some ”restriction on randomization” can occur with blocking (arranging of experimental units
in groups that are similar to one another) and experiments that have hard-to-change factors. Additional
restrictions on randomization can occur when a full randomization is infeasible or when it is desirable to
reduce the variance of estimators of selected effects.
Randomization of treatment in clinical trials pose ethical problems. In some cases, randomization reduces
the therapeutic options for both physician and patient, and so randomization requires clinical equipoise –
an assumption that there is not a better treatment present for either the control or experimental group.
(Source: Wikipedia)
14. Class imbalance.
57
Answer: The accuracy paradox is the paradoxical finding that accuracy is not a good metric for
predictive models when classifying in predictive analytics. This is because a simple model may have
a high level of accuracy but be too crude to be useful. For example, if the incidence of category A
is dominant, being found in 99% of cases, then predicting that every case is category A will have an
accuracy of 99%.
(Source: Wikipedia)
ii. Why is it hard for ML models to perform well on data with class imbalance?
Answer: The loss functions that we train the models with typically do not explicitly account for
the class imbalance. In the case of Binary Cross Entropy, we have:
n
1X
L= −yi log(ŷi ) −(1 − yi ) log(1 − ŷi )
n i=1 | {z } | {z }
cost for the minority class yi =1 cost for the majority class yi =0
Therefore, when we train the model, the optimization procedure aims to lower the loss as much as
possible, resulting in disproportionate preference towards the more common class.
iii. Imagine you want to build a model to detect skin lesions from images. In your training dataset, only
1% of your images shows signs of lesions. After training, your model seems to make a lot more false
negatives than false positives. What are some of the techniques you’d use to improve your model?
Answer: One way to fix this problem is to use cost-sensitive learning by re-weighting the cost for
the majority and minority classes:
n
1X
L= −w1 yi log(ŷi ) −w0 (1 − yi ) log(1 − ŷi )
n i=1 | {z } | {z }
cost for the minority class yi =1 cost for the majority class yi =0
There are several possible heuristics for the weighting, with a popular one being:
n
wj =
2 · nj
where nj is the number of samples in class j for j ∈ {0, 1}. Let us observe how the weighting changes
when we have balanced vs. imbalanced class setting:
balanced imbalanced
1 1
n1 = · n n1 = ·n
2 10
1 9
n0 = · n n0 = ·n
2 10
n n
w1 = =1 w1 = =5
2 · 12 · n 1
2 · 10 ·n
n n
w0 = =1 w0 = = 0.55
2 · 12 · n 9
2 · 10 ·n
=⇒ w0 = w1 w1 > w0
In other words, when the classes are balanced, we default to the standard Binary Cross Entropy
Loss; otherwise, we put more weight on the minority class.
(See more here)
15. Training data leakage.
i. Imagine you’re working with a binary task where the positive class accounts for only 1% of your
data. You decide to oversample the rare class then split your data into train and test splits. Your
model performs well on the test split but poorly in production. What might have happened?
Answer: By oversampling when training offline, we force the model to put more weight on the rare
class, thereby increasing the recall, but decreasing the precision.
However, if the rare class happens truly rarely even in production (IID assumption), then the results
that we obtain on our test set are not at all representative of the results that we hope to expect in
production, since we oversample before the train/test split and wrongfully introduce a shift between
the test set and the data seen during production.
A better approach to the problem is to only oversample the train set after we have done the train/test
split – this way, we don’t introduce a distribution shift in the test set with respect to the data seen
during production, thereby obtaining more representative results.
58
ii. You want to build a model to classify whether a comment is spam or not spam. You have a dataset
of a million comments over the period of 7 days. You decide to randomly split all your data into the
train and test splits. Your co-worker points out that this can lead to data leakage. How?
Answer: This is a case of time leakage. Notice that our dataset has temporal component: the
trend of comments over time that turn into spam. We introduce leakage by splitting this dataset
randomly, instead of putting newer data in the test set and older data in the train set. That way, we
don’t “cheat” by looking at the future of the trends, and will get a realistic estimate of the model’s
performance once it would deployed in production.
(See more here)
16. How does data sparsity affect your models? (Hint: Sparse data is different from missing data.)
Answer: One implicit assumptions for many models is that our data is in some sense “nice” – the output
Y varies very smoothly with the input X, and the features show dependencies. Sparse data often do not
satisfy neither of these assumptions, meaning that the model will have a hard time finding a smooth func-
tion to fit. In turn, this will cause the model to overfit to the training set, and display poor generalization
performance when deployed in production.
Naturally, another common issue that can arise from sparse features is the increase in space and time
complexity. Models will require more coefficients, and storing the dataset in memory can become in-
tractable.
17. Feature leakage
Typically, one does not apply intricate pre-processing techniques for the continuous variables, except
for normalization/standardization.
59
Categorical variables on the other hand are a bit more intricate and require special care. Neural nets
cannot handle textual features, so we have to transform them to numerical ones. However, simply enu-
merating the categories is plain wrong – if you represent ”apples” with 1, and ”oranges” with 2, does it
mean that ”oranges” = 2 x ”apples”? Another way to encode the categorical features is with one-hot
encoding, but this can introduce data sparsity, which can be an undesired trait of our dataset as we have
seen in a previous question.
Therefore, one of the most widely used methods for encoding textual features is to use word embeddings,
such as Word2Vec. The benefits of using an embedding is that it provides low-dimensional, distributed rep-
resentation that allow for capturing relationships between the categories (eg. ”king” - ”man” + ”woman”
= ”queen”)
(See more here)
20. Your model has been performing fairly well using just a subset of features available in your data. Your
boss decided that you should use all the features available instead. What might happen to the training
error? What might happen to the test error?
Answer: Due to the curse of dimensionality, as we use more dimensions to describe our data, the more
sparse the feature space becomes. The data points become further from each other, which results in loss
of smoothness of the output Y with respect to the input X, as well as degradation of feature dependencies.
Moreover, recall that as we increase the dimensionality of the data, the number of needed training ex-
amples increases exponentially. However, in the problem statement it is said that we merely increase the
number of features, but keep the number of samples the same! Nevertheless, given sufficient capacity, the
model can in fact learn a decision boundary that better separates the training data.
However, since we break the smoothness assumptions of the data, the model will actually overfit! In
turn, this will result in poor generalization performance on unseen data, or more precisely – an increase
in the test error.
This is where Occam’s razor can come in handy – when the simple explanation and complex explanation
both work equally well, the simple explanation is usually correct.
4 Source: https://laptrinhx.com/boost-your-network-performance-1920317541/
60
4
Figure 6: Loss curve for overfitting and underfitting
where:
h i
MSE θ̂, θ = ED (θ̂ − θ)2
h i
Bias θ̂, θ = ED θ̂ − θ
h i h i 2
Var θ̂ = ED ED θ̂ − θ̂
61
Figure 7: Bias-Variance tradeoff5 .
(Source: Wikipedia)
ii. How’s this tradeoff related to overfitting and underfitting?
Answer: The relationship to overfitting and underfitting is as follows:
• The bias error is an error from erroneous assumptions in the learning algorithm. High bias can
cause an algorithm to miss the relevant relations between features and target outputs (underfit-
ting).
• The variance is an error from sensitivity to small fluctuations in the training set. High variance
may result from an algorithm modeling the random noise in the training data (overfitting).
See also Figure 8.
Figure 8: The relationship of bias and variance to overfitting and underfitting. Left: High variance (overfitting);
Right: High bias (underfitting).
(Source: Wikipedia)
iii. How do you know that your model is high variance, low bias? What would you do in this case?
Answer: In high variance, low bias mode the model performs well on the training set but not on the
test set. In this case one can: consider getting more training examples, try smaller set of features,
reduce number of model parameters, increase regularization parameter λ.
iv. How do you know that your model is low variance, high bias? What would you do in this case?
Answer: In low variance, high bias mode, the model fails to even learn the data from the training set.
In this case one can: add more features, increase number of model parameters, decrease regularization
parameter λ.
4. Cross-validation.
5 Source: https://en.wikipedia.org/wiki/Bias%E2%80%93variance tradeoff
62
i. Explain different methods for cross-validation.
Answer: Cross-validation is a model validation technique for assessing how the results of a statistical
analysis will generalize to an independent data set. Cross-validation is a resampling method that
uses different portions of the data to test and train a model on different iterations. It is mainly used
in settings where the goal is prediction, and one wants to estimate how accurately a predictive model
will perform in practice. In a prediction problem, a model is usually given a dataset of known data on
which training is run (training dataset), and a dataset of unknown data (or first seen data) against
which the model is tested. The goal of cross-validation is to test the model’s ability to generalize to
an independent dataset (i.e., an unknown dataset, for instance from a real problem).
However, if we merely have a train/test split, and evaluate the effectiveness of the chosen hyper-
parameter on the test set directly, we will get an overoptimistic estimate of the model’s performance,
since we are overfitting to the particular test set. Therefore, we typically split the dataset in train/-
val/test set, such that we fit our hyperparameters on the validation set, whereas we evaluate the
final model performance once on the test set in order to get a realistic estimate of how the model
will behave in production.
iii. Your model’s loss curves on the train, valid, and test sets look like this (see Figure 9). What might
have been the cause of this? What would you do?
Answer: It might be the case that the train set provides good enough coverage for the test distri-
bution, but not for the entirety of the validation distribution. The cause might have been a bug in
our splitting procedure, or we might simply have been unlucky with the random draws. A quick fix
would be to check the implementation again, and perform a new split with a different random seed.
6. Your team is building a system to aid doctors in predicting whether a patient has cancer or not from their
X-ray scan. Your colleague announces that the problem is solved now that they’ve built a system that
can predict with 99.99% accuracy. How would you respond to that claim?
Answer: Consider we have a dataset of 100,000 patients, where only 10 of them have been labeled as
having cancer.
63
Figure 9: Example of train, val, test curves.
99990
A naive model always predicting that a given patient doesn’t have cancer would have 100000 = 99.99%
accuracy, but is a totally useless model.
7. F1 score.
i. What’s the benefit of F1 over the accuracy?
Answer:
• F1 score. Pro: takes into account how the data is distributed. Useful when you have data with
imbalance classes. Con: Less interpretable, since it is merely a trade-off between precision and
recall.
• Accuracy. Pro: easy to understand. Con: It does not take into account how the data is dis-
tributed.
(Source: StackExchange)
ii. Can we still use F1 for a problem with more than two classes. How?
Answer: For a multi-class classification problem, we don’t calculate an overall F1 score. Instead,
we calculate the F1 score per class in a one-vs-rest manner.
(See more here.)
8. Given a binary classifier that outputs the following confusion matrix (Table 1).
64
9. Consider a classification where 99% of data belongs to class A and 1% of data belongs to class B.
i. If your model predicts A 100% of the time, what would the F1 score be? (Hint: The F1 score when
A is mapped to 0 and B to 1 is different from the F1 score when A is mapped to 1 and B to 0.)
Answer: First, let us suppose that A is mapped as the positive class, and B as the negative. Then,
we would have the following confusion matrix:
Then, we have:
Precision = 0
Recall = 0
This implies that F1 is undefined.
ii. If we have a model that predicts A and B at a random (uniformly), what would the expected F1 be?
Answer: Given that we choose each class with uniform probability, we obtain the following confusion
matrix:
Predicted Pos Predicted Neg
Actual Pos 0.5 · 0.99n 0.5 · 0.99n
Actual Neg 0.5 · 0.01n 0.5 · 0.01n
Let us see why using MSE in logistic regression leads to a non-convex optimization problem. We have the
following computation graph:
1
L(y, ŷ) = (y − ŷ)2 ŷ = z = θx
1 + exp(−z)
65
Now, let us obtain the second derivative:
∂L ∂L ∂ ŷ ∂z
= · ·
∂θ ∂ ŷ ∂z ∂θ
= −2(y − ŷ) · (ŷ(1 − ŷ)) · x
= −2x (y − ŷ)(ŷ − ŷ 2 )
= −2x y ŷ − y ŷ 2 − ŷ 2 + ŷ 3
∂L
= −2x y ŷ(1 − ŷ)x − 2y ŷ ŷ(1 − ŷ)x − 2ŷ ŷ(1 − ŷ)x + 3ŷ 2 ŷ(1 − ŷ)x
2
∂ θ
= −2x2 ŷ(1 − ŷ) y − 2y ŷ − 2ŷ 2 + 3ŷ 2
Since 2 ≥ 0, x2 ≥ 0 and ŷ(1 − ŷ) ≥ 0, we can exclude them from our analysis of the sign of the second
derivative.
∂2L
∝ − −2ŷ 2 + 3ŷ 2
2
∂ θ
2
= − 3ŷ(ŷ − )
3
2
∂2L
Now, notice when ŷ ∈ 0, 32 then ∂∂ 2Lθ ≥ 0, but when ŷ ∈ 23 , 1 then
∂2θ ≤ 0. Therefore, we proved that
the MSE loss is not convex wrt θ.
Now, let us see why using the Log-loss in logistic regression leads to a convex optimization problem.
We have the following computation graph:
1
L(y, ŷ) = − [y log(ŷ) + (1 − y) log(1 − ŷ)] ŷ = z = θx
1 + exp(−z)
∂2L
= x [ŷ(1 − ŷ)x(1 − y) + y ŷ(1 − ŷ)x]
∂2θ
= x2 ŷ(1 − ŷ) [(1 − y) + y]
= x2 ŷ(1 − ŷ)
≥0
In many circumstances it makes sense to give more weight to points further away from the mean. For
example, being off by 10 is more than twice as bad as being off by 5. In such cases RMSE is a more
appropriate measure of error.
66
If being off by 10 is just twice as bad as being off by 5, then MAE is more appropriate.
(Source: StackExchange)
12. Show that the negative log-likelihood and cross-entropy are the same for binary classification tasks.
Answer: I am not sure I understand the question 100%, so I will give an answer to the following instead
(perhaps the author meant this): Show that the Binary Cross Entropy loss can be naturally derived from
the maximum likelihood principle under the assumption that the likelihood of the output is a Bernoulli
random variable:
Perhaps my confusion stems from misunderstanding what exactly people mean when they say “log-loss”
or “negative log-likelihood” for a particular loss: every loss commonly used today is derived from the
maximum likelihood principle when we assume a certain form for the likelihood: Bernoulli likelihood →
Binary Cross-Entropy; Categorical likelihood → Categorical Cross Entropy; Gaussian likelihood → MSE;
Laplace likelihood → MAE. There is no one particular loss that is a “log-loss”, many of them are!!!
(/* end rant */)
13. For classification tasks with more than two labels (e.g. MNIST with 10 labels), why is cross-entropy a
better loss function than MSE?
Answer: Similarly as before, let us see how we can derive the Cross Entropy loss from the maximum
likelihood principle. Assume that the likelihood of the output is a Categorical random variable:
C
Y
p(y|x, w) = ŷcyc ŷ = fw (x)
c=1
where y ∈ RC is a one-hot encoding for the true class. During training we wish to maximize the likelihood:
67
Which results in minimizing the Cross Entropy loss.
In contrast, now assume that the likelihood of the output is a Normal random variable:
(y − ŷ)2
1
p(y|x, w) = √ exp − ŷ = fw (x)
2πσ 2 2σ 2
When doing a multi-class classification problem, the assumption about the likelihood being a Categorical
random variable as opposed to a Normal random variable is more natural, hence the common use of the
Cross Entropy loss function.
In fact, the same line of reasoning can be applied for answering Question 10 of this section.
14. Consider a language with an alphabet of 27 characters. What would be the maximal entropy of this
language?
Answer: The entropy is maximized when we have the least amount of knowledge about predicting the
next character in a sequence, given the previous characters. This corresponds to a uniform distribution,
1
which would imply that the probability of every character is 27 .
15. A lot of machine learning models aim to approximate probability distributions. Let’s say P is the distri-
bution of the data and Q is the distribution learned by our model. How do measure how close Q is to
P?
Answer: In mathematical statistics, the Kullback–Leibler divergence, denoted KL [P ∥Q], is a type of
statistical distance: a measure of how one probability distribution P is different from a second, reference
probability distribution Q:
X P (x)
KL [P ∥Q] = P (x) log
Q(x)
x∈X
A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q
as a model when the actual distribution is P .
While it is a distance, it is not a metric, the most familiar type of distance: it is not symmetric in
the two distributions, and does not satisfy the triangle inequality:
68
We will give a counter-example to prove that the triangle inequality is not satisfied. Let:
X = {0, 1}
(
1
if x = 0
P (x) = 21
2 if x = 1
(
1
if x = 0
R(x) = 43
4 if x = 1
(
1
if x = 0
Q(x) = 10 9
10 if x = 1
X Y Z P(X, Y, Z)
0 0 0 0.27000
0 0 1 0.00930
0 1 0 0.21500
0 1 1 0.10750
1 0 0 0.09690
1 0 1 0.19380
1 1 0 0.05375
1 1 1 0.05375
Now imagine we have observed Y = 0, and are interested in the most probable value for X. While
we are not intersted in Z, it’s a quantity that is a part of our distribution, hence we cannot ignore
it completely.
69
If we want to perform Maximum A Posteriori estimation, then we marginalize over Z:
P (X = 0|Y = 0)
P (X = 0, Y = 0)
=
P (Y = 0)
P (X = 0, Y = 0, Z = 0) + P (X = 0, Y = 0, Z = 1)
=
P (X = 0, Y = 0, Z = 0) + P (X = 0, Y = 0, Z = 1) + P (X = 1, Y = 0, Z = 0) + P (X = 1, Y = 0, Z = 1)
0.2793
=
0.57
=0.49
Therefore, according to the MAP principle, the most likely value for X is 1, given that Y = 0.
Alternatively, if we go down the Maximum Probable Explanation route, we have to maximize over
all pairs of conditional probabilities where Y = 0:
P (X = 0, Y = 0, Z = 0) 0.27
P (X = 0, Z = 0|Y = 0) = = = 0.4736
P (Y = 0) 0.57
P (X = 0, Y = 0, Z = 1) 0.0093
P (X = 0, Z = 1|Y = 0) = = = 0.0163
P (Y = 0) 0.57
P (X = 1, Y = 0, Z = 0) 0.0969
P (X = 1, Z = 0|Y = 0) = = = 0.17
P (Y = 0) 0.57
P (X = 1, Y = 0, Z = 1) 0.1938
P (X = 1, Z = 1|Y = 0) = = = 0.34
P (Y = 0) 0.57
From the 4 pairs, we maximize the (joint) conditional when X = 0 (and Z = 0).
With this, we showed that MAP and MPE can give different solutions, given a same observation.
(See more here)
17. Suppose you want to build a model to predict the price of a stock in the next 8 hours and that the
predicted price should never be off more than 10% from the actual price. Which metric would you use?
Answer: I am not sure if we can define a metric that predicted price is never off by more than 10%,
but using Mean Absolute Percentage Error we can investigate whether our model on average is not off by
more than 10 %.
The Mean Absolute Percentage Error (MAPE) is a measure of prediction accuracy of a forecasting method
in statistics. It usually expresses the accuracy as a ratio defined by the formula:
n
100% X At − Ft
MAPE =
n i=1 At
70
(See more here)
2. What happens if we don’t apply feature scaling to logistic regression?
Answer: Standardization isn’t required for logistic regression. The main goal of standardizing features
is to help convergence of the technique used for optimization.
However, if one uses regularization (eg. Lasso or Ridge), then standardization is required since the
obtained solutions are not equivariant under scaling the input.
(Source: StackExchange)
3. What are the algorithms you’d use when developing the prototype of a fraud detection model?
Answer: Decision Trees, Random Forests, Gradient Boosting Machine, Support Vector Machine, K-
Nearest Neighbors, . . .
4. Feature selection.
i. Why do we use feature selection?
Answer: Feature selection is the process of selecting a subset of relevant features for use in model
construction. Feature selection techniques are used for several reasons:
• simplification of models to make them easier to interpret by users
• shorter training times
• to avoid the curse of dimensionality
The central premise when using a feature selection technique is that the data contains some features
that are either redundant or irrelevant, and can thus be removed without incurring much loss of
information. Redundant and irrelevant are two distinct notions, since one relevant feature may be
redundant in the presence of another relevant feature with which it is strongly correlated.
Feature selection techniques should be distinguished from feature extraction. Feature extraction
creates new features from functions of the original features, whereas feature selection returns a sub-
set of the features. Feature selection techniques are often used in domains where there are many
features and comparatively few samples. Archetypal cases for the application of feature selection
include the analysis of written texts and DNA microarray data, where there are many thousands of
features, and a few tens to hundreds of samples.
ii. What are some of the algorithms for feature selection? Pros and cons of each.
Answer: There are three main categories of feature selection techniques:
• Wrapper methods use a predictive model to score feature subsets. Each new subset is used to
train a model, which is tested on a hold-out set. Counting the number of mistakes made on that
hold-out set (the error rate of the model) gives the score for that subset. As wrapper methods
train a new model for each subset, they are very computationally intensive, but usually provide
the best performing feature set for that particular type of model or typical problem. Common
approaches are greedy forward (start with the best performing variable against the target, and
iteratively add new ones), and greedy backward (start with all features, and iteratively truncate
one by one) selection.
• Filter methods use a proxy measure instead of the error rate to score a feature subset. This
measure is chosen to be fast to compute, while still capturing the usefulness of the feature set.
Common measures include the mutual information, Pearson correlation coefficient, Relief-based
algorithms, inter/intra class distance, and the scores of significance tests for each class/feature
combinations. Filters are usually less computationally intensive than wrappers, but they produce
a feature set which is not tuned to a specific type of predictive model. This lack of tuning means
a feature set from a filter is more general than the set from a wrapper, usually giving lower
prediction performance than a wrapper. However the feature set doesn’t contain the assumptions
of a prediction model, and so is more useful for exposing the relationships between the features.
Many filters provide a feature ranking rather than an explicit best feature subset, and the cut
off point in the ranking is chosen via cross-validation.
• Embedded methods perform feature selection as part of the model construction process. The
exemplar of this approach is the LASSO method for constructing a linear model, which penalizes
the regression coefficients with an L1 penalty, shrinking many of them to zero. Any features which
have non-zero regression coefficients are ’selected’ by the LASSO algorithm.
(Source: Wikipedia)
71
5. k-means clustering.
i. How would you choose the value of k?
Answer: While there is no universally optimal way to set the value of k, there are two approaches
that have empirically worked well in practice:
• Elbow Curve Method. The idea is to perform k-means clustering for several values of k, plot the
sum of squared distances of samples to their closest cluster center, and pick the first value of k
for which the aforementioned metric plateaus (see Figure 10).
6
Figure 10: Choosing the optimal k according to the Elbow method
• Silhouette coefficient. Similarly as before, the idea is to perform k-means clustering for several
values of k, plot the mean Silhouette coefficient (see below), and pick the value of k for which
the metric is maximized.
The Silhouette coefficient is calculated as the (normalized) difference between the mean nearest-
cluster distance (b) and the mean intra-cluster distance (a):
b−a
Silhouette score =
max{b, a}
We obtain the best value of 1 when a < b:
b−a b−a a
Silhouette score = = =1−
max{b, a} b b
which occurs when we minimize the ratio ab , by either minimizing the within cluster distances,
or maximizing the distances to the nearest cluster.
which occurs when we minimize the ratio ab , by either minimizing the distances to the nearest
cluster, or maximizing the distances within the cluster.
(Source: SciKit)
ii. If the labels are known, how would you evaluate the performance of your k-means clustering algo-
rithm?
Answer: Given that labels are known, we can evaluate the performance of our clustering with:
6 Source: https://towardsdatascience.com/clustering-metrics-better-than-the-elbow-method-6926e1f723a6
72
• Purity. Purity is a measure of the extent to which clusters contain a single class. Its calculation
can be thought of as follows: For each cluster, count the number of data points from the most
common class in said cluster. Now take the sum over all clusters and divide by the total number
of data points. Formally, given some set of clusters M and some set of classes C, both partitioning
N data points, the metric can be defined as follows:
1 X
Purity = max |m ∩ c|
N c∈C
m∈M
This measure doesn’t penalize having many clusters, and more clusters will make it easier to
produce a high purity. A purity score of 1 is always possible by putting each data point in its
own cluster. Also, purity doesn’t work well for imbalanced data, where even poorly performing
clustering algorithms will give a high purity value. For example, if a size 1000 dataset consists
of two classes, one containing 999 points and the other containing 1 point, then every possible
partition will have a purity of at least 99.9%.
• Rand index. The Rand index computes the fraction of correct pairwise assignments between the
clustering output and the ground truth. Formally:
TP + TN
RI =
TP + FP + FN + TN
where T P is the number of pairs of points that are clustered together in the predicted partition
and in the ground truth partition, F P is the number of pairs of points that are clustered together
in the predicted partition but not in the ground truth partition etc. If the dataset is of size N ,
then T P + F P + F N + T N = N2 .
iii. How would you do it if the labels aren’t known?
Answer: If the labels are not known, as we have seen before, we can compute the Silhouette
coefficient to get a better idea of the goodness of cluster assignment. Moreover, let us introduce 2
more techniques:
• Davies–Bouldin index. The Davies–Bouldin index can be calculated by the following formula:
1 X σm + σk
DB = max
M k̸=m d(cm , ck )
m∈M
where M is the number of clusters, cm is the centroid of cluster m, σm is the average distance of
all elements in cluster m to the centroid cm , and d(cm , ck ) is the distance between the centroids
cm and ck . Since algorithms that produce clusters with low intra-cluster distances (high intra-
cluster similarity) and high inter-cluster distances (low inter-cluster similarity) will have a low
Davies–Bouldin index, the clustering algorithm that produces a collection of clusters with the
smallest Davies–Bouldin index is considered the best algorithm based on this criterion.
• Dunn index. The Dunn index aims to identify dense and well-separated clusters. It is defined as
the ratio between the minimal inter-cluster distance to maximal intra-cluster distance. For each
cluster partition, the Dunn index can be calculated by the following formula:
max1≤m<k≤M d(m, k)
D=
max1≤t≤M d′ (t)
where d(m, k) represents the distance between clusters m and k, and d′ (t) measures the intra-
cluster distance of cluster t. The inter-cluster distance d(m, k) between two clusters may be any
number of distance measures, such as the distance between the centroids of the clusters. Similarly,
the intra-cluster distance d′ (t) may be measured in a variety ways, such as the maximal distance
between any pair of elements in cluster t. Since internal criterion seek clusters with high intra-
cluster similarity and low inter-cluster similarity, algorithms that produce clusters with high
Dunn index are more desirable.
iv. Given the following dataset (see Figure 11), can you predict how K-means clustering works on it?
Explain.
Answer: K-means is good at finding the clusters when they have spherical shapes. In this example
only one of the cluster is of spherical shape, meaning that k-means will likely result in a failure
mode regarding what a human would consider to be the natural clusters. Given that we set k = 2,
depending on initialization, we could end up with something along the lines of Figure 12.
(See more here.)
73
Figure 11: K means example
On the other hand, using higher k results in less flexibility, therefore high bias, but low variance
(different draws don’t result in differently learned functions).
(Source: StackExchange)
74
a set of clusters {V1 , V2 , . . . , VK } s.t.:
K
[
Vi = X ∀i, j ̸= i : Vi ∩ Vj = ∅
i=1
K-means minimizes the aforementioned objective by first initializing µk for all k, and then iterating
between:
1. Assign step. For all n, compute zn (given µ):
(
1 if k = argmin j ∥xn − µj ∥22
znk =
0 otherwise
We terminate the algorithm when assignments do not change anymore. Convergence to a local op-
timum is guaranteed since each assign step decreases the cost. Convergence to a globally optimal
solution is not guaranteed.
Interestingly, we can arrive at the above-mentioned optimization objective from the maximum like-
lihood principle, by assuming that the data is i.i.d, and that the clusters correspond to a Gaussian
distribution with mean µk and I (identity) covariance matrix (resulting in spherical clusters):
K
Y znk
p(xn |z, µ) = [N (xn |µk , I)]
k=1
N Y
Y K
znk
=⇒ p(X|z, µ) = [N (xn |µk , I)]
n=1 k=1
N X
X K
=⇒ − log(p(X|z, µ)) = − znk ∥xn − µk ∥22
n=1 k=1
75
θ := {{µk }K K K
k=1 , {Σk }k=1 , {πk }k=1 }
Similar as before, the goal is to find θ that maximizes the log-likelihood for the data X:
N
X K
X
log(p(X|θ)) = log N (xn |µk , Σk )
n=1 k=1
However, finding an analytical expression that directly maximizes the log-likelihood is hard, since
we cannot push the log inside the sum. For this reason, we turn to an iterative technique called
Expectation Maximization (EM), which consists of two steps:
• E-step: missing data are estimated given observed data and current estimate of the parameters.
• M-step: likelihood function is maximized under the assumption that the missing data are known.
EM is guaranteed to increase the likelihood at each iteration, although might end in a local optimum.
Going back to our clustering problem, let us expand the posterior probability that a sample n
belongs to a cluster k (which is in fact, the optimal cluster assignment):
p(zn = k, xn , θ)
qnk := p(zn = k|xn , θ) =
p(xn , θ)
p(zn = k, xn |θ)
=
p(xn |θ)
p(zn = k, xn |θ)
= PK
j=1 p(zj , xn |θ)
p(zn = k|θ)p(xn |zn = k, θ)
= PK
j=1 p(zn = j|θ)p(xn |zn = j, θ)
πk N (xn |µk , Σk )
= PK
j=1 πj N (xn |µj , Σj )
So, if we know µ, Σ and π, we have a closed form solution for the posterior q.
On the other hand, by computing a derivative of the log-likelihood mentioned above, and setting
it to zero, we obtain the optimal parameters for µ, Σ and π:
PN
n=1 qnk xn
µk = PN
n=1 qnk
PN
qnk (xn − µk )(xn − µk )T
Σk = n=1 PN
n=1 qnk
N
1 X
πk = qnk
N n=1
However, this requires us to know the optimal q. Therefore, we are in a chicken-and-egg type prob-
lem: we need µ, Σ, π to estimate the posterior cluster assignments qnk ; but also we need qnk in order
to estimate µ, Σ, π.
This leads us to the iterative EM solution for the GMM algorithm. At time t perform:
1. E-step (assign): compute cluster assignments (given current estimate of θ):
76
2. M-Step (update): update the cluster parameters µt+1 t+1 t+1
k , Σk , πk
PN t
qnk xn
µt+1
k
n=1
= PN t
n=1 qnk
PN t
n=1 qnk (xn − µt+1 t+1 T
k )(xn − µk )
Σt+1
k = PN t
n=1 qnk
N
1 X t
πkt+1 = q
N n=1 nk
(Source: CAIS)
ii. When would you choose one over another?
Answer: In general case, it would be advisable to default to using GMMs, since they generalize
better to various cluster shapes. However, given some prior knowledge that the clusters are supposed
to have spherical shapes, it might be beneficial to use K-Means as it explicitly encodes this property.
8. Bagging and boosting are two popular ensembling methods. Random forest is a bagging example while
XGBoost is a boosting example.
i. What are some of the fundamental differences between bagging and boosting algorithms?
Answer: With bagging, instead of training one model on the entire dataset, we sample with re-
placement to create several different datasets, and train a separate model on each of the bootstraps.
Sampling with replacement ensures each bootstrap is independent of its peers.
In contrast, with boosting we train several models iteratively (one by one) on the entire dataset.
However, after training a model, we adjust each sample’s weight based on the current ensemble per-
formance, and train the next model on the re-weighted dataset.
With that said, the fundamental differences between bagging and boosting are:
• Bagging can train all models in parallel (fast), but each one of them is independent of how well
the others are doing.
• Boosting trains the models iteratively (slow), but each next model learns from the mistakes of
the previous ones.
ii. How are they used in deep learning?
Answer: While in theory there is nothing stopping you from applying bagging and boosting to deep
neural networks, in practice it is almost infeasible.
Historically, these ensembling techniques have been introduced by combining many weak (slightly
above random performance), but inexpensive learners such as decision trees.
Today, deep neural networks are so large that even a single model is distributed among many nodes
for distributed training. To consider training hundreds of networks for the purposes of boosting/bag-
ging is almost unimaginable. Even if we somehow train them, at inference time the predictions can
take from seconds to several minutes, deeming them extremely impractical (e.g. self-driving cars).
Moreover, it is known that neural networks are very sample inefficient, requiring huge datasets which
can take weeks to months to train on. This means that training in an iterative manner, such as in
boosting, could take months to years to finish.
9. Given this directed graph (see Figure 13).
i. Construct its adjacency matrix.
Answer:
0 1 0 1 1
0 0 1 1 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
77
Figure 13: Example of a directed graph
ii. How would this matrix change if the graph is now undirected?
Answer: We add edges in the opposite direction as well, making the matrix symmetric:
0 1 0 1 1
1 0 1 1 0
0 1 0 0 0
1 1 0 0 0
1 0 0 0 0
iii. What can you say about the adjacency matrices of two isomorphic graphs?
Answer: Let us quickly note that the (i, j)-th entry of a matrix M is equal to eTi M ej , where ei is
a vector with 1 in the i-th entry and 0 elsewhere.
Let GA and GB be two isomorphic graphs, whose adjacency matrices are A and B. Given that
isomorphic graphs are identical up to a permutation π, we want the (i, j)-th entry of A to be the
same as the (π(i), π(j))-th entry of B:
where P is the matrix corresponding to the permutation π. For this to hold for all i and j, we must
have:
A = P T BP
(Source: StackExchange)
10. Imagine we build a user-item collaborative filtering system to recommend to each user items similar to
the items they’ve bought before.
i. You can build either a user-item matrix or an item-item matrix. What are the pros and cons of each
approach?
Answer: [TODO]
ii. How would you handle a new user who hasn’t made any purchases in the past?
Answer: [TODO]
11. Is feature scaling necessary for kernel methods?
Answer: In general, feature scaling is important for kernel methods. The kernel is effectively a distance
and if different features vary on different scales then this can matter. For instance, for the RBF kernel we
have:
2
K(x, x′ ) = exp −γ ∥x − x′ ∥
so if one dimension takes much larger values than others then it will dominate the kernel values and you’ll
lose some signal in other dimensions. This applies to the linear kernel too.
Nevertheless, this doesn’t apply to all kernels, since some have scaling built in. A typical example is
the Mahalanobis kernel:
K(x, x′ ) = exp −γ(x − x′ )T Σ̂−1 (x − x′ )
78
12. Naive Bayes classifier.
i. How is Naive Bayes classifier naive?
Answer: Suppose we are performing classification, and want to determine the posterior proability
for class y, given some input x1 , . . . , xm . Then, using Bayes’ rule we would get:
p(x1 , . . . xm |y)p(y)
p(y|x1 , . . . xm ) =
p(x1 , . . . , xm )
Naive Bayes is naive since it assumes conditional independence of the observations xi given the class
y:
p(x1 , . . . xm |y)p(y) p(x1 |y) . . . p(xm |y)p(y)
p(y|x1 , . . . xm ) = =
p(x1 , . . . , xm ) |{z} p(x1 , . . . , xm )
naive
Since we are typically only interested in comparing the unnormalized scores for the various classes,
notice that we can disregard the evidence term in the denominator:
p(y|x1 , . . . xm ) ∝ p(x1 |y) . . . p(xm |y)p(y)
Lastly, if we are performing text classification, and certain words are missing from the corpora for
the given class, notice that the entire product will result in 0, since p(xi |y) = 0. In this case, it is a
good idea to perform Laplace smoothing, by adding a pseudocount of 1 for each word.
ii. Let’s try to construct a Naive Bayes classifier to classify whether a tweet has a positive or negative
sentiment. We have four training samples (see Table 3). According to your classifier, what’s sentiment
of the sentence “The hamster is upset with the puppy”?
Tweet Label
This makes me so upset Negative
This puppy makes me happy Positive
Look at this happy hamster Positive
No hamsters allowed in my house Negative
Answer: In order to improve the robustness of Naive Bayes, it is a good idea to transform all words
to lowercase, remove stop words, and perform word stemming (remove plural form of nouns, tense
of verbs, etc.). After pre-processing the inputs, our resulting dataset is illustrated in Table 4.
Tweet Label
make upset Negative
puppy make happy Positive
look happy hamster Positive
hamster allow house Negative
Moreover, the pre-processed input which we wish to classify becomes “hamster upset puppy”. Lastly,
notice that some of the words in the input sentence don’t exist for one of the classes (e.g. upset is not
present for the positive class); therefore, we will perform Laplace smoothing by adding a pseudocount
of 1 for all words in the corpus.
With that said, we specify the counts of each word given that they appear in the positive/negative
tweets (accounted for Laplace smoothing) in Table 5.
Finally, let us calculate the scores for the positive and negative class:
p(Positive|hamster upset puppy) ∝ p(hamster|Positive)p(upset|Positive)p(puppy|Positive)p(Positive)
2 1 2 1
= · · · = 0.00072
14 14 14 2
79
word positive count negative count
mouse 2 2
upset 1 2
puppy 2 1
happy 3 1
look 2 1
hamster 2 2
allow 1 2
house 1 2
total 14 13
13. Two popular algorithms for winning Kaggle solutions are Light GBM and XGBoost. They are both
gradient boosting algorithms.
i. What is gradient boosting?
Answer: Gradient boosting is an idea that combines the techniques of gradient descent and boosting.
P
The idea is to fit an ensemble t ρt ht (x) in stage-wise manner, where in each stage we introduce
a weak learner to compensate the shortcomings of existing weak learners. In contrast to Adaboost
which identifies shortcomings by high-weight data points, Gradient Boosting identifies shortcomings
by gradients.
To illustrate Gradient Boosting, suppose we are doing least-squares regression, where the goal is
to teach a model F to predict values of the form ŷ = F (xi ) by minimizing the mean squared error:
n
1 X
L(y, ŷ) = (yˆi − yi )2
2n i=1
In gradient boosting, we approach this problem by building the final model in M stages. At each
stage m (1 ≤ m ≤ M ), suppose we have some imperfect model Fm which we wish to improve by
adding a new estimator hm as follows:
or equivalently:
hm (xi ) = yi − Fm (xi )
Now, let us see how this algorithm is related to gradient descent. Recall that when training neural
nets, we minimize the objective by moving in the opposite direction of the gradient of the loss wrt.
the parameters:
∂L
θt+1 := θt −
∂θt
In contrast, in Gradient Boosting we minimize the loss by instead adjusting the outputs F (xi ) for
each sample xi . In other words, we treat F (xi ) as parameters and take derivatives:
P
∂L ∂ i L(yi , F (xi )) ∂L(yi , F (xi ))
= = = F (xi ) − yi = −h(xi )
∂F (xi ) ∂F (xi ) ∂F (xi )
80
Finally, notice how the improvement step of our current estimator resembles the gradient descent
update in neural nets:
It turns out that performing the update with gradients, instead of fitting the residual, is more
general and useful. In short, we can theoretically plug in any differentiable function for L, and inject
its properties when fitting the next model. For example, we can utilize the Huber loss which is more
robust to outliers:
(
1
(ŷ − y)2 if |ŷ − y| ≤ δ
L(y, ŷ) = 2
δ(|ŷ − y| − δ/2) otherwise
In the linearly separable case, SVM is trying to find the hyperplane that maximizes the margin,
with the condition that both classes are classified correctly. But in reality, datasets are almost never
linearly separable, so the condition of 100% correctly classified by a hyperplane will never be met.
Answer: SVM would work as intended, by finding a linear decision boundary that maximizes the
margin between the two classes (see Figure 15).
iii. How well would vanilla SVM work on this dataset (see Figure 16)?
Answer: Again, SVM will still find a linear decision boundary that classifies all samples correctly.
However, since we don’t allow for any slackness, due to the one outlier, the boundary is more slanted
compared to the previous one (see Figure 17).
81
Figure 15: Solution to SVM example 1
82
iv. How well would vanilla SVM work on this dataset (see Figure 18)?
Answer: SVM won’t work at all, since the two classes are not linearly separable. As discussed
above, two potential solutions are: introducing slack variables; using non-linear kernels.
On the other hand, a language model is a probability distribution over sequences of words. Given such a
sequence of length m, a language model assigns a probability P (w1 , . . . , wm ) to the whole sequence. Lan-
guage models generate probabilities by training on text corpora, which are considered random samples
from the underlying population.
(Source: Wikipedia Wikipedia)
3. Language models are often referred to as unsupervised learning, but some say its mechanism isn’t that
different from supervised learning. What are your thoughts?
Answer: Language models are trained in a self-supervised manner, which can be considered an inter-
section point between supervised and unsupervised learning. By utilizing a large corpora of text scraped
from the Internet without any labels (unsupervised learning), we mask out some words/tokens and train
the model to predict the missing values (supervised learning, since we know the ”ground-truth”).
83
4. Word embeddings.
i. Why do we need word embeddings?
Answer: Neural nets cannot handle textual features, so we have to transform them to numerical
ones. However, simply enumerating the categories is plain wrong – if you represent ”apples” with
1, and ”oranges” with 2, does it mean that ”oranges” = 2 x ”apples”? Another way to encode the
categorical features is with one-hot encoding, but this can introduce data sparsity, which can be an
undesired trait of our dataset as we have seen in a previous question.
Therefore, one of the most widely used methods for encoding textual features is to use word embed-
dings, such as Word2Vec. The benefits of using an embedding is that it provides low-dimensional,
distributed representation that allow for capturing relationships between the categories (eg. ”king”
- ”man” + ”woman” = ”queen”)
(See more here)
ii. What’s the difference between count-based and prediction-based word embeddings?
Answer: Words that appear in the same context or have semantic relevance have proximity in the
vector representation. There are two different ways vectors are represented from text: count-based
and predictive-based methods.
Count-based word embeddings start off by building a co-occurrence matrix, and then either use a
matrix decomposition method (e.g. SVD, see Figure 19), or an algorithm based on gradient descent
(e.g. GloVe) to extract the embeddings. The intuitive reason why matrix decomposition should
work is that the correlation (or relative co-occurrence) of the words in the text corpora represented
by the (i, j)-th entry in the co-occurrence matrix should be representable as a dot product (co-
variance) of the two embedding vectors. Similarly, GloVe optimizes for the dot products of pairs of
embeddings (scaled by the co-occurrence), but instead using gradient descent to generate the vectors.
Predictive-based methods instead use a 2-layer neural network to learn the word embeddings, by pre-
dicting which words appear in the nearby context for a given input word (see Figure 20). Intuitively,
if two words generally appear in the same context, they will result in similar word embeddings, since
this will minimize the loss function.
iii. Most word embedding algorithms are based on the assumption that words that appear in similar
contexts have similar meanings. What are some of the problems with context-based word embed-
dings?
7 Source: https://www.youtube.com/watch?v=5PL0TmQhItYab channel=macheads101
84
Answer: Context size. Using too small of a context window can result in inferring fewer relationships
between the words, but using too large of a context window can result in every word being in the
context of the other words.
Co-occurrence doesn’t imply similarity. Just because words occur together in a sentence often doesn’t
necessarily mean that the two words have the same meaning. For example, it might be the case that
“really” and “like” occur frequently together, thereby being grouped closely in embedding space, but
the meanings of the words are rather different.
Polysemy. Traditionally, word embeddings don’t account for polysemy – words having multiple
meanings based on the context. For example, in the sentence ”The club I tried yesterday was
great!”, it is not clear if the term club is related to the word sense of a club sandwich, baseball club,
clubhouse, golf club, or any other sense that club might have.
5. Given 5 documents:
D1: The duck loves to eat the worm
D2: The worm doesn’t like the early bird
D3: The bird loves to get up early to get the worm
D4: The bird gets the worm from the early duck
D5: The duck and the birds are so different from each other but one thing they have in common is that
they both get the worm
i. Given a query Q: “The early bird gets the worm”, find the two top-ranked documents according to
the TF/IDF rank using the cosine similarity measure and the term set {bird, duck, worm, early, get,
love}. Are the top-ranked documents relevant to the query?
TF TF-IDF
IDF
D1 D2 D3 D4 D5 Q D1 D2 D3 D4 D5 Q
bird 0 1/3 1/5 1/5 1/4 1/4 log(5/4) = 0.22 0 0.07 0.04 0.04 0.05 0.05
duck 1/3 0 0 1/5 1/4 0 log(5/3) = 0.51 0.17 0 0 0.1 0.12 0
worm 1/3 1/3 1/5 1/5 1/4 1/4 log(5/5) = 0.00 0 0 0 0 0 0
early 0 1/3 1/5 1/5 0 1/4 log(5/3) = 0.51 0 0.17 0.1 0.1 0 0.12
get 0 0 1/5 1/5 1/4 1/4 log(5/3) = 0.51 0 0 0.1 0.1 0.12 0.12
love 1/3 0 1/5 0 0 0 log(5/2) = 0.91 0.3 0 0.18 0 0 0
Answer: The calculations of the TF-IDF scores are given in Table 6. Recall that:
TF-IDF(t, D) = TF(t, D) · IDF(t)
TF(t, D) = freq(t ∈ D)
|D|
IDF(t) = log
|{d ∈ D : t ∈ d}|
Now, we calculate the cosine similarities between the TF-IDF scores in order to obtain a ranking of
the document similarity:
cos (Q, D1) = 0
cos (Q, D2) = 0.73
cos (Q, D3) = 0.63
cos (Q, D4) = 0.82
cos (Q, D5) = 0.54
We obtain that the two most similar documents to Q are D4 and D2. However, one could argue
that those two documents are not most relevant to the query Q, since D3 and D5 are semantically
somewhat closer in meaning.
ii. Assume that document D5 goes on to tell more about the duck and the bird and mentions “bird”
three times, instead of just once. What happens to the rank of D5? Is this change in the ranking of
D5 a desirable property of TF/IDF? Why?
Answer: [not sure]When one recalculates the TF-IDF scores for the modified document D5, it
becomes apparent that the rankings do not change.
85
6. Your client wants you to train a language model on their dataset but their dataset is very small with only
about 10,000 tokens. Would you use an n-gram or a neural language model?
Answer: Language models estimate the probability distribution over a sequence discrete tokens x =
(x1 , . . . , xT ) as follows:
p(x) = p(x1 , . . . , xT )
= p(x1 )p(x2 |x1 ) . . . p(xT |x1 , . . . xT −1 )
We can represent each conditional distribution with a probability table and learn the entries of these
tables, but this becomes intractable as T grows. For example, the probability table for p(xT |x1 , . . . xT −1 )
comprises of |V|T entries. Therefore, huge memory and training sets are needed.
Even if we use an n-gram, which is based on the Markov assumption – thereby reducing p(xT |x1 , . . . xT −1 )
to p(xT |xT −n+1 , . . . xT −1 ), we still need tables as large as |V|n .
Due to the exponential nature of the (non-parametric) n-gram model, we can easily suffer from the
curse of dimensionality. Therefore, it might be a good idea to go with a neural language model, which op-
erates in low-dimensional spaces and encodes smooth parametric functions for estimating the conditional
distribution (instead of using large tables). The advantage is that these models typically scale linearly
with |V | and n. Moreover, since neural models also operate on word embeddings (instead of orthogonal
one-hot encodings), modeling the relationships between the context and the target becomes an easier task.
Of course, the small dataset of 10,000 tokens might not be large enough to train a sufficiently strong
neural model, so it might be a good idea to try both approaches and see what works best.
7. For n-gram language models, does increasing the context length (n) improve the model’s performance?
Why or why not?
Answer: As we discussed above, by increasing n, the tables representing the conditional distributions
p(xT |xT −n+1 , . . . xT −1 ) grow exponentially. In order to obtain meaningful representations, we’d need
exponentially more datapoints, thereby falling into the trap of the curse of dimensionality.
8. What problems might we encounter when using softmax as the last layer for word-level language models?
How do we fix it?
Answer: This is already answered in Question 5, Section 7.2.
9. What’s the Levenshtein distance of the two words “doctor” and “bottle”?
Answer: The Levenshtein distance between two words is the minimum number of single-character edits
(i.e. insertions, deletions or substitutions) required to change one word into the other.
1. Change d into b
2. Change c into t
3. Change o into l
4. Change r into e
10. BLEU is a popular metric for machine translation. What are the pros and cons of BLEU?
Answer: BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which
has been machine-translated from one natural language to another. Quality is considered to be the cor-
respondence between a machine’s output and that of a human: ”the closer a machine translation is to a
professional human translation, the better it is” – this is the central idea behind BLEU. BLEU was one
of the first metrics to claim a high correlation with human judgements of quality, and remains one of the
most popular automated and inexpensive metrics.
Scores are calculated for individual translated segments—generally sentences—by comparing them with a
set of good quality reference translations. Those scores are then averaged over the whole corpus to reach
an estimate of the translation’s overall quality. Intelligibility or grammatical correctness are not taken
into account.
86
BLEU’s output is always a number between 0 and 1. This value indicates how similar the candidate
text is to the reference texts, with values closer to 1 representing more similar texts. Few human trans-
lations will attain a score of 1, since this would indicate that the candidate is identical to one of the
reference translations. For this reason, it is not necessary to attain a score of 1. Because there are more
opportunities to match, adding additional reference translations will increase the BLEU score.
Language models are typically comprised of an embedding layer, followed by a number of Transformer
or LSTM layers, which are finally followed by a softmax layer. Embedding layers learn word represen-
tations, such that similar words (in meaning) are represented by vectors that are near each other (in
cosine distance). [Press & Wolf, 2016] showed that the softmax matrix, in which every word also has a
vector representation, also exhibits this property. This leads them to propose to share the softmax and
embedding matrices, which is done today in nearly all language models.
Additionally, [Press & Wolf, 2016] propose Three-way Weight Tying, a method for NMT models in which
the embedding matrix for the source language, the embedding matrix for the target language, and the
softmax matrix for the target language are all tied. That method has been adopted by the Attention Is
All You Need model and many other neural machine translation models.
(Source: PapersWithCode; See more here)
87
8.3 Computer vision
1. For neural networks that work with images like VGG-19, InceptionNet, you often see a visualization of
what type of features each filter captures. How are these visualizations created?
Answer: Feature visualization answers questions about what a network (or parts of it) is looking for by
generating examples. Neural networks are generally differentiable with respect to their inputs. Therefore,
if we want to find out what kind of input would cause a certain behavior – whether that’s an internal
neuron firing or the final output behavior – we can use derivatives to iteratively tweak the input towards
that goal. If we want to understand individual features, we can search for examples where they have high
values – either for a neuron at an individual position, a layer, or an entire channel (see Figure 21)
(Source: DistillPub)
2. Filter size
i. How are your model’s accuracy and computational efficiency affected when you decrease or increase
its filter size?
Answer: Increasing the filter size results in decrease in computational efficiency (since the number
of model parameters increases), and an increase in accuracy up to a certain point beyond which the
network can overfit and imitate a fully-connected network.
Alternatively, decreasing the filter size results in increase of computational efficiency, and a decrease
in accuracy when tending towards using extremely small kernels (e.g. 1x1) which do not capture the
local structure of the inputs properly.
ii. How do you choose the ideal filter size?
Answer: First of all, even-sized filters are not typical used because they break the symmetry around
the neuron we are computing the local features for. Since 1x1 filters are too noisy and don’t capture
local dependencies, and 5x5 filters are rather computationally expensive, it has been empirically
shown that 3x3 filters combine the best of both worlds: low computation cost and high accuracy.
3. Convolutional layers are also known as “locally connected”. Explain what it means.
Answer: Each filter operates only on a small neighborhood (e.g 3x3) around the ground pixel, and is
applied across the entire spatial domain. This sort of weight sharing drastically reduces the number of
the parameters, and injects the “spatial equivariance” bias.
4. When we use CNNs for text data, what would the number of channels be for the first conv layer?
Answer: The number of input channels for the first conv layer will be the embedding dimensionality of
the words.
5. What is the role of zero padding?
Answer: By applying the convolution operation, we slightly reduce the size of the input image. In order
to retain the same size, we append a border of zeros around the image.
6. Why do we need upsampling? How to do it?
Answer: If pixel-level outputs are desired, we have to upsample the features again. Downsampling
provides strong features with large receptive field; upsampling yields output at the same resolution as
input. There are several strategies for upsampling:
88
• Nearest neighbor : scale each neuron using nearest neighbor interpolation.
• Bilinear : scale each channel using bilinear neighbor interpolation.
• Bed of nails: insert elements at sparse location (followed by convolution).
• Max unpooling: remember which position was maximum during pooling, and insert the new element
in the same location. Requires corresponding pairs of downsampling and upsampling layers.
Alternatively, instead of first downsampling and then upsampling, we can use dilated convolutions in order
to reach a large receptive field size quickly. Dilated convolutions increase the receptive field of standard
convolutions without increasing the number of parameters.
The max pool layer merely operates on the spatial dimension, and works on each channel inde-
pendently. Therefore, the output is of shape: (W/2, H/2, C), and the number of parameters is 0.
In contrast, the conv layer will span over the entire channel dimension, resulting in the output:
(W/2, H/2, 1); however, the number of parameters is 1 × 2 × 2 × C.
9. When we replace a normal convolutional layer with a depthwise separable convolutional layer, the number
of parameters can go down. How does this happen? Give an example to illustrate this.
Answer: Imagine we have an input image of shape 12 × 12 × 3, and we want to apply 256 filters of size
5 × 5, thereby obtaining an output volume of size 8 × 8 × 256.
In standard convolution, we directly apply 256 filters of size 5 × 5 × 3, where each filter spans the entire
input channel dimension (which in this case is 3).
89
Now, notice that in the standard convolution the number of parameters is 5 × 5 × 3 × 256 = 19200,
whereas in the depthwise separable convolution the number of parameters is 5 × 5 × 3 + 3 × 256 = 843.
The difference in the number of parameters is evident.
Nevertheless, the reduction in model size comes at a cost of loss in expressivity. The intuition is as
follows: in the standard convolution operation we transform the input volume 256 times independently,
whereas in the depthwise separable convolution we transform the input volume per-channel only once,
and only then extend it to 256 dimensions.
(Source: TowardsDataScience)
10. Can you use a base model trained on ImageNet (image size 256 x 256) for an object classification task on
images of size 320 x 360? How?
Answer: Typically convnets have a global average pooling layer at the end of the stack of convolutional
layers (and right before the stack of fully connected layers). The purpose of this layer is to reduce an
activation volume of size C × W × H down to C × 1 × 1 by computing the average of all values along
the spatial dimension for each channel. Therefore, no matter the input image size, the spatial dimension
at the end will be averaged out, and thus the fully connected layers will receive a fixed size input as
expected.
11. How can a fully-connected layer be converted to a convolutional layer?
Answer: Suppose the output volume of the stack of convolution layers has a shape of W × H × C.
Moreover, say we want to apply a fully-connected layer on this volume, and turn it into T dimensional
vector representation.
Now, notice that instead of first flattening the input and then applying a linear transformation which
consists of (W × H × C) × T parameters, we can directly apply a convolution operation with T filters of
size W × H (under the implicit assumption that the filter spans the entire channel dimension C). Again,
the number of parameters is same: T × W × H × C, with the only difference being the shape of the output:
1 × 1 × T (instead of simply T ).
(Source: DeepLearningAI)
12. Pros and cons of FFT-based convolution and Winograd-based convolution.
Answer: [TODO]
In the problem, each machine provides a random reward from a probability distribution specific to that
machine, that is not known a-priori. The objective of the gambler is to maximize the sum of rewards
earned through a sequence of lever pulls. The crucial tradeoff the gambler faces at each trial is between
”exploitation” of the machine that has the highest expected payoff and ”exploration” to get more informa-
tion about the expected payoffs of the other machines. In practice, multi-armed bandits have been used
to model problems such as managing research projects in a large organization, like a science foundation
or a pharmaceutical company.
(Source: Wikipedia)
2. How would a finite or infinite horizon affect our algorithms?
Answer: In reinforcement learning, an agent receives reward on each time step and the goal, loosely
speaking, is to maximize the reward received. But that doesn’t actually fully define the goal, because each
decision can affect what the agent can do in the future. Consequently, we’re left with the question “how
does potential future reward affect our decision right now?”
90
One answer is it doesn’t! We might say the objective is to only maximize the reward you’ll receive
immediately for your next action, and ignore the consequences of how that affects your ability to receive
reward after that. When your define the objective this way, you’ve defined the objective to be over a finite
horizon, where horizon refers to how many steps into the future the agent cares about the reward it can
receive. Specifically, in this case, we’ve defined a 1 step horizon objective since we only care about the
next reward received.
But we could define any arbitrary horizon we want as the objective. We could define a 2 step hori-
zon, in which the agent makes a decision that will enable it to maximize the reward it will receive in the
next 2 time steps. Or we could choose a 3, or 4, or n step horizon!
Or, we could go even further and define an infinite horizon objective in which the agent tries to maximize
the reward it will receive infinitely far into the future; that is, the agent cares about all possible future
rewards and makes its decision accordingly.
Although that might sound strange, infinite horizon objectives are the most common kind of objective
in reinforcement learning work. Typically, in infinite horizon problems the objective is to maximize the
discounted infinite horizon reward, which means the value of rewards further away in time successively
matter less to the agent. Discounting has the important property that the best possible infinite horizon
value is finite itself, allowing the agent to compare the relative value of different decisions.
So when someone says “finite horizon” what they mean is describing the value the agent can achieve
over only a finite number of steps into the future from its current state, rather than an infinite horizon,
in which the agent cares about reward over all possible future steps.
(Source: Quora)
3. Why do we need the discount term for objective functions?
Answer: From a theoretical perspective, using a discount rate (smaller than 1) is a mathematical trick
to make an infinite sum finite. From a practical perspective, discount factors are associated with time
horizons – longer time horizons have have much more variance as they include more uncertainty about
the future, while short time horizons are biased towards only short-term gains.
(See more here)
4. Fill in the empty circles using the minimax algorithm (see Figure 22).
91
Figure 23: Minimax solution
Answer: We consider an algorithm on-policy if it navigates the environment with the policy that is
currently being learnt. In contrast, an off-policy algorithm uses a different policy for navigating the envi-
ronment.
Suppose we collect experience by executing a small number of policies π1 , . . . , πN in a desired Markov en-
vironment. These could be existing policies that are known to be fairly good, or they could be exploration
policies that seek to explore the state and action space well. In either case, we collect our experiences of
the form of (st , at , rt , st+1 ) tuples, where st is the state of the environment at time t, at is the action that
was executed, rt is the reward that was received, and st+1 is the resulting state. Suppose we now want
to evaluate a new policy π̃ to estimate its expected cumulative discounted reward. We could do this by
executing π̃ in the real environment, but usually that is expensive and could lead to large losses in the
real world (such as a robot car striking a pedestrian or crashing into a wall). Instead we would like to
evaluate π̃ using the collected experience. This is off-policy policy evaluation.
92
Off-policy learning seeks to find a good (or even optimal) policy by doing a series of off-policy policy
evaluations. Suppose πθ is a policy that is defined by a set of weights, θ. For example, πθ could be a
neural network for choosing actions. Using off-policy policy evaluation, we can estimate both the expected
value (cumulative discounted reward) of πθ and also its gradient with respect to θ. Then we can update
θ by taking a step in the direction of the gradient and repeat.
Off-policy evaluation is not as accurate as on-policy evaluation. If the policy π̃ that we are evaluat-
ing is very different from the initial policies π1 , . . . , πN that were used to collect experience, then we
cannot obtain a very accurate estimate of its value. In such cases, we need to collect more “on policy”
experience by executing π̃ in the real Markov environment.
In summary, off-policy evaluation is faster since it only involves computation, and is safer since it doesn’t
require acting in the real world. However, it is less accurate than on-policy evaluation.
(Source: Quora)
8. What’s the difference between model-based and model-free? Which one is more data-efficient?
Answer: The main difference between model-based and model-free agents is whether during learning or
acting, the agent uses predictions of the environment response.
Model-based agents can ask ask the model for a single prediction about the next state and reward,
or the full distribution of next states and rewards. These predictions can be provided entirely outside of
the learning agent (e.g. by computer code that understands the rules of a dice or board game), or they
can be learned by the agent (in which case they will be approximate).
Algorithms that purely sample from experience such as Monte Carlo Control, SARSA, Q-learning, Actor-
Critic are ”model free” RL algorithms. They rely on real samples from the environment and never use
generated predictions of next state and next reward to alter behaviour (although they might sample from
experience memory, which is close to being a model).
The archetypical model-based algorithms are Dynamic Programming (Policy Iteration and Value Itera-
tion) – these all use the model’s predictions or distributions of next state and reward in order to calculate
optimal actions. Specifically in Dynamic Programming, the model must provide state transition proba-
bilities, and expected reward from any (state, action) pair. Note this is rarely a learned model.
Basic TD learning is also model-based – in order to pick the best action, it needs to query a model
that predicts what will happen on each action, and implement a policy as follows:
X
π(s) = argmax p(s′ , r|s, a)(r + V (s′ ))
a
s′ ,r
In general, model-based agents are more data-efficient since you can relate the sparsely available sam-
ples to the inner built model.
(Source: StackExchange)
93
2. Self-attention.
i. What’s the motivation for self-attention?
Answer: Say the following sentence is an input sentence we want to translate: “The animal didn’t
cross the street because it was too tired”.
What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a
simple question to a human, but not as simple to an algorithm. When the model is processing the
word “it”, self-attention allows it to associate “it” with “animal”.
As the model processes each word (each position in the input sequence), self attention allows it
to look at other positions in the input sequence for clues that can help lead to a better encoding for
this word.
If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate
its representation of previous words/vectors it has processed with the current one it’s processing.
Self-attention is the method the Transformer uses to bake the “understanding” of other relevant
words into the one we’re currently processing.
(Source: Jay Alammar’s blog)
ii. Why would you choose a self-attention architecture over RNNs or CNNs?
Answer: Like recurrent neural networks (RNNs), transformers are designed to process sequential
input data, such as natural language, with applications towards tasks such as translation and text
summarization. However, unlike RNNs, transformers process the entire input all at once. The
attention mechanism provides context for any position in the input sequence. For example, if the
input data is a natural language sentence, the transformer does not have to process one word at a
time. This allows for more parallelization than RNNs and therefore reduces training times.
(Source: Wikipedia)
iii. Why would you need multi-headed attention instead of just one head for attention?
Answer: Consider the sentence: “I kicked the ball”. Notice that the sentence has a causal structure:
“I (who?) kicked (did what?) the ball (to whom?)”.
Using a single attention head wouldn’t be able to disambiguate the three pieces of information
since it is a convex combination of the vector representations.
Therefore, instead of using a single overparametrized attention head, the authors suggest to use
multiple smaller heads which would be able to learn the different concepts without entanglement.
(See more here)
iv. How would changing the number of heads in multi-headed attention affect the model’s performance?
Answer: As we discussed in the previous question, it is of crucial importance to use more than one
head, as it increases the expressivity of the Transformer.
However, setting too many heads might cause duplication of information, overfitting, and entan-
glement of sentence structures.
Nevertheless, this is a hyperparameter that can be searched over – the authors of the paper “Attention
is All You Need” found that using 8 heads was optimal for the problem of machine translation.
3. Transfer learning
i. You want to build a classifier to predict sentiment in tweets but you have very little labeled data
(say 1000). What do you do?
Answer: It would be a good idea to take a pre-trained model (e.g. BERT), and fine tune it on our
limited dataset.
ii. What’s gradual unfreezing? How might it help with transfer learning?
Answer: A common practice in DL is to take large pretrained models achieving state-of-the-art
results on many benchmarks, and fine tune them to some other downstream application. Nevertheless,
naive fine-tuning can result in catastrophic forgetting, and overfitting to the new dataset. Gradual
unfreezing tackles this problem by gradually unfreezing the layers (usually starting from the deeper
layers) in order to prevent catastrophic forgetting of the knowledge learned during the source task.
94
4. Bayesian methods.
i. How do Bayesian methods differ from the mainstream deep learning approach?
Answer: In deep learning, we mostly aim to optimize the weights W by maximizing the likelihood
of the data p(Y |X, W ). In contrast, Bayesian methods assume posterior distribution on the weights
p(W |X, Y ) which is often learnt through an approximate parametric formulation q(W |X, Y ) due
to intractability – a process known as Variational Inference. Another approach of estimating the
posterior is through sampling, which is arguably a much more expensive technique.
ii. How are the pros and cons of Bayesian neural networks compared to the mainstream neural networks?
Answer: The advantage of Bayesian Neural Nets is that they provide a natural notion of uncertainty
for the prediction, which is quite useful in fields like medicine, biology, aerospace, etc.
Nevertheless, at this time, this comes at the cost of decrease in accuracy compared to standard
neural nets, difficulty to scale to large problems, the ambiguity of choosing a proper prior, etc.
iii. Why do we say that Bayesian neural networks are natural ensembles?
Answer: The reason why they are natural ensembles is because we can average the predictions from
all possible weight configurations, scaled by the probability of each configuration occurring:
Z
p(Y |X) = p(Y, W |X)dw
Zw
= p(Y |X, W )p(W )dw
w
5. GANs.
i. What do GANs converge to?
Answer: Let x ∈ RD denote an observation and p(z) a prior over latent variables z ∈ RQ . More-
over, let GwG : RQ → RD denote the generator network with induced distribution pmodel , and
DwD : RD → [0, 1] denote the discriminator network which outputs a probability of a sample being
real or not.
D and G play the following two-player minimax game with value function V (G, D):
G∗ , D∗ = argmin argmax V (D, G)
G D
V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼p(z) [log (1 − D(G(z)))]
We train D to assign probability 1 to samples from pdata and 0 to samples from pmodel ; whereas we
train G to fool D such that it assigns probability 1 to samples from pmodel .
It can be shown that for any generator G, the optimal discriminator D∗ is:
∗ pdata (x)
DG (x) =
pdata (x) + pmodel (x)
Now, let’s explore what happens to the value function V when we are at D∗ :
∗ ∗ ∗
V (G, DG ) = Ex∼pdata [log DG (x)] + Ex∼pmodel [log(1 − (DG (x)))]
pdata (x) pmodel (x)
= Ex∼pdata log + Ex∼pmodel log
pdata (x) + pmodel (x) pdata (x) + pmodel (x)
= KL [pdata ∥pdata + pmodel ] + KL [pmodel ∥pdata + pmodel ]
= − log 4 + log 4 + KL [pdata ∥pdata + pmodel ] + KL [pmodel ∥pdata + pmodel ]
pdata + pmodel pdata + pmodel
= − log 4 + KL pdata ∥ + KL pmodel ∥
2 2
= − log 4 + JSD [pdata , pmodel ]
where JSD is the Jenson Shannon Divergence. Since this is a non-negative quantity, when G tries to
∗
minimize V (G, DG ) it will push JSD [pdata , pmodel ] down to 0, which is achieved for pmodel = pdata .
In turn, this implies:
∗ pdata (x) 1
DG (x) = = =
pdata (x) + pmodel (x) |{z} 2
pmodel =pdata
∗
V (G∗, DG ) = − log 4 + JSD [pdata , pmodel ] =
|{z} − log 4
pmodel =pdata
95
ii. Why are GANs so hard to train?
Answer: One major failure mode of GANs is mode collapse, where the generator learns to produce
high-quality sample with very low variability, covering only a fraction of pdata . Consider the example
in Figure 26:
1. The generator learns to fool the discriminator by producing values close to Antarctic tempera-
tures.
2. The discriminator can’t distinguish Antarctic temperatures, but learns that all Australian tem-
peratures are real.
3. The generator learns that it should produce Australian temperatures and abandons the Antarctic
mode.
4. The discriminator can’t distinguish Australian temperatures, but learns that all Antarctic tem-
peratures are real.
5. Repeat.
96
3. Neural network in simple NumPy.
i. Write in plain NumPy the forward and backward pass for a two-layer feed-forward neural network
with a ReLU layer in between.
Answer: A typical neural network will have the following computation graph:
Z [l] = W [l] A[l−1]
A[l] = g(Z [l] )
2
L = A[L] − Y
where A[0] = X is the input to the net, A[L] is the output of the net, and g is an activation function.
Accordingly, the gradients can be derived as follows (see more here and here):
dZ [l] = dA[l] ⊙ g ′ (Z [l] )
dW [l] = dZ [l] A[l−1]t
dA[l−1] = W [l]t dZ [l]
The computation graph for our task will look as follows:
Z [1] = W [1] X
A[1] = ReLU(Z [1] )
Z [2] = W [2] A[1]
2
L = Z [2] − Y
97
Figure 27: Activation functions
t
!
Y
′
= (tanhi ) (wh )k
i=t−k+1
98
First, if we don’t carefully initialize the weights, we might end up with saturating activation func-
tions, that is (tanhi )′ ≈ 0, implying that ∂h∂ht−k
t
≈ 0.
Given that we carefully initialize the weights we have tanh(x) ≈ x (no saturation), or more pre-
cisely:
On the other hand, for wh < 1 the gradients will vanish – for example, if wh = 0.9 and k = 100 we
have ∂h∂ht−k
t
= (wh )k = 0.0000266.
7. Weight normalization separates a weight vector’s norm from its gradient. How would it help with training?
Answer: [TODO]
8. When training a large neural network, say a language model with a billion parameters, you evaluate your
model on a validation set at the end of every epoch. You realize that your validation loss is often lower
than your train loss. What might be happening?
Answer: There are several possible reasons as to why the validation loss is lower than the training loss:
• It might be that the training loss uses regularizers (e.g. L2 norm on the weights). Since at validation
time we only evaluate the main loss function, it might happen that the validation loss is lower than
the (composite) train loss.
• The model might have dropout layers, which impose heavy regularization during training. These are
layers that behave differently during training (dropping out neurons) and inference time (no dropping
out neurons, but multiplying weights by dropout probability to imitate ensembling).
• The validation set might simply be easier compared to the train set.
9. What criteria would you use for early stopping?
Answer: We can perform early stopping if there is no change or a decrease in a metric of interest
(validation loss, accuracy, etc.). It is a good idea to use a patience parameter in order to avoid noisy
estimates.
10. Gradient descent vs SGD vs mini-batch SGD.
Answer:
• In gradient descent we first perform a forward pass and a backward pass for each sample in the
dataset, before we take a step in the direction of the cumulative gradient. This is extremely slow
to converge, as today’s datasets are extremely large in size, which implies that we perform gradient
updates too rarely.
• In SGD, after performing a forward and a backward pass for a single sample, we take a step in the
direction of the single gradient. Even though we perform gradient updates much more often, the
entire process is extremely noisy as we are optimizing with respect to a single sample at a time.
• Mini-batch SGD combines the best of both worlds: perform a forward/backward pass for a batch
(e.g. 32) of samples, and take a step in the direction of the gradient for the current mini-batch. On
one hand, we perform updates more often than pure gradient descent; on the other hand, we optimize
over an entire mini-batch, minimizing the noise in the estimated gradient.
99
11. It’s a common practice to train deep learning models using epochs: we sample batches from data without
replacement. Why would we use epochs instead of just sampling data with replacement?
Answer: First of all, (Bottou et al, 2009) empirically show that random shuffling before each epoch leads
to faster convergence. On top of it, there are some practical reasons as to why one would not perform
random draws with replacement:
• It’s a well defined metric: ”the neural network was trained for 10 epochs” is a clearer statement
than ”the neural network was trained for 18942 iterations” or ”the neural network was trained over
303072 samples”.
• There’s enough randomness going on during the training phase: random weight initialization, mini-
batch shuffling, dropout, etc.
• It is easy to implement.
• It avoids wondering whether the training set is large enough not to have epochs.
(Source: StackExchange)
12. Your model’s weights fluctuate a lot during training. How does that affect your model’s performance?
What to do about it?
Answer: Let us look at the gradient update of the weights:
The fluctuation can be attributed to two primary factors: large learning rate or exploding gradients. In
either case, this can seriously destabilize the training process. In order to resolve this issue, we could: 1)
lower the learning rate; 2) perform gradient clipping.
13. Learning rate.
i. Draw a graph number of training epochs vs training error for when the learning rate is: i) too high;
ii) too low; iii) acceptable.
Answer: The graph is depicted in Figure 28
Intuitively, the idea that this helps your network to slowly adapt to the data. However, in prac-
tice the main reason for warmup steps is to allow adaptive optimizers (e.g. Adam, RMSProp, etc)
8 Source https://towardsdatascience.com/useful-plots-to-diagnose-your-neural-network-521907fa2f45
100
to compute correct statistics of the gradients. Therefore, a warmup period makes little sense when
training with plain SGD.
For example, RMSProp computes a moving average of the squared gradients to get an estimate
of the variance in the gradients for each parameter. For the first update, the estimated variance is
just the square root of the sum of the squared gradients for the first batch. Since, in general, this
will not be a good estimate, your first update could push your network in a wrong direction. To
avoid this problem, you give the optimiser a few steps to estimate the variance while making as little
changes as possible (low learning rate) and only when the estimate is reasonable, you use the actual
(high) learning rate.
(Source: StackExchange)
14. Compare batch norm and layer norm.
Answer: As an example we will use a fully-connected network, although the same reasoning can be
applied to CNNs, Transformers, etc.
Suppose the output of the previous layer is X ∈ RB×D where B is the batch size and D is the di-
mensionality of the embedding. Both techniques normalize the input as follows:
X − E [X]
Y =p ∗γ+β
Var [X] + ϵ
where γ and β are learnt affine parameters, and all operations are treated as broadcasts. The main
difference stems from how the two techniques compute the statistics:
• Batch norm computes the mean and standard deviation over the batch, meaning that E [X] ∈ R1×D
and Var [X] ∈ R1×D .
• Layer norm computes the mean and standard deviation over the features, meaning that E [X] ∈ RB×1
and Var [X] ∈ RB×1 .
15. Why is squared L2 norm sometimes preferred to L2 norm for regularizing neural networks?
Answer: Both the L2 norm and the squared L2 norm provide the same optimization goal. However,
the squared L2 norm is computationally more simple, as you don’t have to calculate the square root.
Moreover, when deriving the gradients by hand, the derivatives have a more easier form to work with.
16. Some models use weight decay: after each gradient update, the weights are multiplied by a factor slightly
less than 1. What is this useful for?
Answer: Weight Decay, or L2 Regularization, is a regularization technique applied to the weights of a
neural network. We minimize a loss function compromising both the primary loss function and a penalty
on the L2 Norm of the weights:
λ t
L̃(w) = L(W ) + ww
2
where λ is a hyperparameter determining the strength of the penalty. This encourages smaller weights,
which prevents potential overfitting by allowing the network to arbitrarily closely fit the data.
Let us now observe the implicit effect of weight decay when performing the update rule in the optimization
procedure:
wt+1 = wt − α∇w L̃(wt )
= wt − α(∇w L(wt ) + λwt )
= (1 − αλ)wt − α∇w L(wt )
In other words, since α > 0, λ > 0, what we essentially do is we first slightly decay the weights, and then
take a small step in the opposite direction of the gradient.
Weight decay can be incorporated directly into the weight update rule, rather than just implicitly by
defining it through to objective function. Often weight decay refers to the implementation where we spec-
ify it directly in the weight update rule (whereas L2 regularization is usually the implementation which
is specified in the objective function). For example, see the PyTorch documentation on how the SGD
optimizer incorporates weight decay explicitly.
(Source: PapersWithCode)
101
17. It’s a common practice for the learning rate to be reduced throughout the training
i. What’s the motivation?
Answer: The learning rate is controlling the size of the update steps along the gradient. This
parameter sets how much of the gradient you update, with:
• Larger learning rate allowing the model converge faster, but may overstep the optimal point.
• Smaller learning rate being more receptive to the loss function, but may require more epochs to
converge and may get stuck in local minima.
The learning rate schedule allows you to start training with larger steps and reduce the learning rate,
in an effort to combine the best of both worlds.
(Source: Peltarion)
ii. What might be the exceptions?
Answer: An exception might be a continual learning scenario, where we continuously update the
model with new data. In this case, if we constantly decrease the learning rate, the model will not
learn from data further down the stream. On the other hand, if we use a scheduler with (warm)
restarts, the alternation between small and large learning rates can further exacerbate the effects of
catastrophic forgetting. Therefore, it might be the most optimal case to use a constant learning rate.
18. Batch size.
i. What happens to your model training when you decrease the batch size to 1?
Answer: After performing a forward and a backward pass for a single sample, we take a step in
the direction of the single gradient. Even though we perform gradient updates much more often, the
entire process is extremely noisy as we are optimizing with respect to a single sample at a time.
ii. What happens when you use the entire training data in a batch?
Answer: After performing a forward pass and a backward pass for each sample in the dataset, we
take a step in the direction of the cumulative gradient. This is extremely slow to converge, as today’s
datasets are extremely large in size, which implies that we perform gradient updates too rarely.
iii. How should we adjust the learning rate as we increase or decrease the batch size?
Answer: When utilizing larger batches, we can afford to have large learning rates, as the approx-
imated gradient of the batch is closer to the true gradient. On the other hand, using very small
batches yields noisy estimates of the gradient, and is therefore advisable to also use small learning
rates so that we don’t diverge in our optimization procedure.
19. Why is Adagrad sometimes favored in problems with sparse gradients?
Answer: First, let us note that feature sparsity will induce gradient sparsity, simply due to the chain
rule (the derivative of the weight is dependent on the value of the feature/activation). In turn, this can
result in a lot of oscillation when performing the gradient update, since the direction will be dominated
by the non-sparse gradients.
In order to alleviate this problem, Adagrad proposes to use a scaled learning rate per weight, such that we
move slower in the direction of a lot of oscillation (variance), and give more importance in the direction
of sparsity. More precisely, the update rule is as follows:
t
t
X t
∇w L(wt ) ∇w L(wt )
G =
i=1
t+1 α
w = wt − p ∗ ∇w L(wt )
diag(Gt ) + ϵI
This way, we dampen gradients that have high variance; and give more weight to gradients with low
variance (induced by the sparsity).
(See more here)
20. Adam vs. SGD.
i. What can you say about the ability to converge and generalize of Adam vs. SGD?
Answer: (Wilson et al., 2017) show that solutions found by adaptive methods (e.g. Adam) generalize
worse than SGD, even when these solutions have better training performance. In other words,
adaptive methods converge faster, but have worse generalization performance than pure SGD.
(See more here)
102
ii. What else can you say about the difference between these two optimizers?
Answer: Pure SGD performs gradient updates based on mini-batches instead of the entire dataset:
One issue with it is that the learning rate scales the gradients equally in every direction. Therefore,
if one direction dominates the gradient, the optimization procedure might oscillate a lot and take a
lot longer to converge to the minimum.
With this in mind, Adam aims to improve the convergence performance by: 1) operating on a
momentum-updated version of the gradients; 2) dividing the momentum by the gradient variance;
The goal of both terms is dampen the variance of the gradient. More precisely, the update step looks
as follows:
21. With model parallelism, you might update your model weights using the gradients from each machine
asynchronously or synchronously. What are the pros and cons of asynchronous SGD vs. synchronous
SGD?
Answer: [TODO]
22. Why shouldn’t we have two consecutive linear layers in a neural network?
Answer: Consider the following two-layer MLP:
h = g(W1 x + b1 )
y = g(W2 h + b2 )
y = g(W2 g(W1 x + b1 ) + b2 )
y = W2 (W1 x + b1 ) + b2 = W2 W1 x + W2 b1 + b2 = W x + b
24. Design the smallest neural network that can function as an XOR gate.
Answer:
25. Why don’t we just initialize all weights in a neural network to zero?
Answer: The problem of symmetry is not only present when we initialize the weights with zero, but also
when we (in decreasing order of restrictiveness):
• Initialize all weight matrices with a constant.
• Initialize each weight matrix with a different constant
• Initialize each weight matrix such that all rows for a given matrix are the same.
103
The issue with this sort of initialization is that all units in each layer will produce the same outputs.
∂L
Therefore, during backpropagation, we will have the same local derivatives ∂z l , which in turn will cause
i
the weights for all neurons in a given layer to perform the same update. By induction, one can show
that even after n iterations of gradient descent we will preserve the symmetry, thus seriously straining the
capacity of the network.
(See more here)
26. Stochasticity.
i. What are some sources of randomness in a neural network?
Answer: Random weight initialization, mini-batch shuffling, dropout, etc.
ii. Sometimes stochasticity is desirable when training neural networks. Why is that?
Answer:
• Random noise allows neural nets to produce multiple outputs given the same instance of input.
For example, AlphaGo outputs probability of winning the game for every possible move, and
chooses the next step by sampling with respect to these probabilities – otherwise, the behavior
of the agent is going to be highly deterministic.
• Random noise limits the amount of information flowing through the network, forcing the net-
work to learn meaningful representations of data. For example, vanilla Autoencoders can easily
overfit to the train data, thus not being able to generate samples with high fidelity. In contrast,
Variational Autoencoders (VAEs) inject Gaussian noise in the bottleneck, thereby forcing the
network to learn meaningful representations of the data.
• Random noise provides ”exploration energy” for finding better optimization solutions during
gradient descent. For example, in Stochastic Gradient Descent (SGD) instead of navigating the
loss landscape by computing the gradients over the entire dataset, we estimate the gradient by
evaluating the network on a mini-batch of samples. This way, even if we get stuck in a local
optima in one iteration, in the next it is very likely we escape it as we are always optimizing
with respect to a “proxy” to the original loss.
(Source: Eric Jang’s blog)
28. Pruning.
i. Pruning is a popular technique where certain weights of a neural network are set to 0. Why is it
desirable?
Answer: In the context of artificial neural networks, pruning is the practice of removing parameters
(which may entail removing individual parameters, or parameters in groups such as by neurons) from
an existing network. The goal of this process is to maintain accuracy of the network while increasing
its efficiency. This can be done to reduce the computational resources required to run the neural
network.
(Source: Wikipedia)
104
ii. How do you choose what to prune from a neural network?
Answer: Typically, there are 2 techniques: 1) pruning weights; 2) pruning neurons.
Weight-based pruning doesn’t cause a large dip in performance, but might require special hard-
ware to allow for efficient sparse computations. On the other hand, neuron-based pruning allows
the network to be run normally without sparse computation / special hardware, but might seriously
strain the performance capabilities of the model.
Nevertheless, in both cases the goal is to remove more of the less important parameters. Typical
pruning criteria used in practice are:
• Magnitude. Remove weights that have magnitude close to 0; remove neurons whose L2 norm of
the weight is also close to 0.
• Activations. We could also use the training data to observe the activations of the neurons. We
can remove neurons whose distribution of activations extremely peaked (invariant to the input);
Moreover, we can also remove a neuron if its activation pattern is highly correlated to another
neuron in the same layer.
We repeat the pruning process until certain conditions are met – dipping below a threshold accuracy,
achieving certain memory size / computation time for the model, etc. Nonetheless, depending on the
problem at hand, we might have to fine-tune the pruned model for a couple of iterations in order to
improve the performance.
(Source: TowardsDataScience)
29. Under what conditions would it be possible to recover training data from the weight checkpoints?
Answer: In general, there are 2 main types of model inversion attacks:
• White-box, where we have access to the model checkpoints. In fact, today it is very common to
open-source the model weights, on platforms such as PyTorch Hub, Hugging Face, etc. The main
idea of white-box model inversion part is to fix the pre-trained network, and optimize the input
(through back-propagation) with respect to a given confidence score.
• Black-box, where we can only query the model. This is the most typical scenario, as many services
offer access to their model through an API. The main idea of black-box model inversion is to train a
decoder network to recreate the input with which we query the online model, based on its outputted
confidence scores / latent representation.
(See more here)
30. Why do we try to reduce the size of a big trained model through techniques such as knowledge distillation
instead of just training a small model from the beginning?
Answer: Empirically, given enough data larger models often outperform their smaller counterparts. Since
the process of model distillation is related to compressing large models while still keeping nearly the same
performance, there might be indication that the compressed model might also outperform the purely
trained variant.
One might even relate the idea to polysemanticity – the ability of neurons to pack many concepts in
a single neuron. Since larger models are more expressive, distillation might provide a way of packing the
knowledge of the larger models into a smaller one. In turn, this might lead to outperforming a purely
trained small variant, simply due to the lack of expressivity of the small model.
105