Introduction to Machine Learning (4)
Introduction to Machine Learning (4)
• p-dimensional input
• m hidden layers
Solution
1. Input to First Hidden Layer:
• Each hidden unit in the first layer has weights from all p input dimensions.
• Number of weights:
p×k
• Number of weights:
(m − 1) × k 2
1
3. Last Hidden Layer to Output Layer:
• Number of weights:
k
Correct Answer:
pk + (m − 1)k 2 + k
y = ReLU(Wx)
where:
• x ∈ Rp (input)
• y ∈ Rd (output)
ReLU(z) = max(0, z)
(applied element-wise)
2
Solution
1. ReLU Activation:
• If z = Wi · x is positive: ReLU is active, so the derivative is 1.
• If z = Wi · x ≤ 0: ReLU is inactive, so the derivative is 0.
∂yi
2. Computing ∂Wij
• Since:
p
!
X
yi = ReLU Wik xk ,
k=1
• This means:
– If pk=1 Wik xk > 0, the gradient is xj .
P
3
Solution
1. Gradient of Output w.r.t. Hidden Representation:
∇W (B) (y) = h
Final Answer
Accepted Answers: ∇W (A) (y) depends on W (B) , ∇W (B) (y) depends on W (A)
• Two different initializations of the same network could converge to different minima.
• For a given initialization, gradient descent will converge to the same minima irre-
spective of the learning rate.
• Initializing all weights to the same constant value leads to undesirable results.
Solution
1. Different Initializations Lead to Different Minima:
• Neural networks have non-convex loss surfaces.
4
• The learning rate affects the optimization trajectory.
• If all weights are initialized to the same value, all neurons will produce the same
output.
Final Answer
Accepted Answers:
• Two different initializations of the same network could converge to different minima.
• Initializing all weights to the same constant value leads to undesirable results.
1 exp(x) − exp(−x)
σ(x) = , tanh(x) =
1 + exp(−x) exp(x) + exp(−x)
Which of the following statements are true?
• 0 < σ ′ (x) ≤ 1
4
5
Solution
1. Derivative of Sigmoid Function:
• Since σ(x) is always between 0 and 1, its derivative is maximized when σ(x) = 0.5.
• The function tanh′ (x) is maximum at x = 0, where tanh(0) = 0 and tanh′ (0) = 1.
Final Answer
Accepted Answers:
• 0 < σ ′ (x) ≤ 1
4
f (x; p) = (1 − p)(x−1) p, x = 1, 2, . . .
Given the samples [4, 5, 6, 5, 4, 3] drawn from this distribution, find the maximum likeli-
hood estimate (MLE) of p.
6
Solution
The MLE for p in a geometric distribution is given by:
1
p̂ =
x̄
where x̄ is the sample mean.
Step 1: Compute Sample Mean
4+5+6+5+4+3 27
x̄ = = = 4.5
6 6
Step 2: Compute MLE of p
1
p̂ = ≈ 0.222
4.5
Final Answer
Accepted Answer: 0.222
• If the prior is N (0.6, 0.1), we will likely require fewer samples for converging to the
true value than if the prior is N (0.4, 0.1).
• If the prior is N (0.4, 0.1), we will likely require fewer samples for converging to the
true value than if the prior is N (0.6, 0.1).
• With a prior of N (0.1, 0.001), the estimate will never converge to the true value,
regardless of the number of samples used.
• With a prior of U (0, 0.5) (uniform between 0 and 0.5), the estimate will never
converge to the true value, regardless of the number of samples used.
Solution
1. Effect of the Prior Mean:
MAP estimation incorporates prior knowledge. If the prior mean is closer to the true
value (p = 0.7), the estimator converges faster. Since N (0.6, 0.1) is closer to 0.7 than
N (0.4, 0.1), fewer samples will be needed to converge, making the first statement true.
2. Incorrect Claim about N (0.4, 0.1):
Since 0.4 is farther from 0.7 than 0.6, this statement is false.
7
3. Effect of a Strongly Biased Prior:
A prior of N (0.1, 0.001) is highly concentrated near 0.1. While it slows down conver-
gence, given sufficient data, the MAP estimate will eventually reach the true value. This
statement is false.
4. Effect of a Uniform Prior on [0, 0.5]:
Since the uniform prior does not include 0.7 in its support, the MAP estimate will never
reach 0.7, regardless of the number of samples. This statement is true.
Final Answer
Accepted Answers:
• If the prior is N (0.6, 0.1), we will likely require fewer samples for converging to the
true value than if the prior is N (0.4, 0.1).
• With a prior of U (0, 0.5), the estimate will never converge to the true value, regard-
less of the number of samples used.
• To obtain a distribution over the predicted values for a new data point, we need to
compute an integral over the parameter space.
• The MAP estimate of the parameter gives a point prediction for a new data point.
• The MLE of a parameter gives a distribution of predicted values for a new data
point.
Solution
1. Bayesian Approach and Integral Computation:
To obtain a predictive distribution, we integrate over the parameter space using Bayesian
inference: Z
P (y|x) = P (y|x, θ)P (θ|D)dθ
Since it provides a single best estimate, it gives a point prediction, making the second
statement true.
8
3. Incorrect MLE Claim:
The MLE finds the most likely parameter value but does not provide a distribution over
predicted values. This statement is false.
4. Incorrect Requirement for Point Estimate:
A full Bayesian approach computes distributions directly without needing a point esti-
mate, making this statement false.
Final Answer
Accepted Answers:
• To obtain a distribution over the predicted values for a new data point, we need to
compute an integral over the parameter space.
• The MAP estimate of the parameter gives a point prediction for a new data point.
where pi and qi are the true and predicted distributions, respectively. Given this,
which of the following statements are true?
Solution
1. Connection to KL Divergence:
The cross-entropy loss can be rewritten using the KL divergence:
9
Final Answer
Accepted Answer: Minimizing HCE (p, q) is equivalent to minimizing DKL (p||q).
• Using the ReLU activation function avoids all problems arising due to gradients
being too small.
• The dead neurons problem in ReLU networks can be fixed using a leaky ReLU
activation function.
Solution
1. Incorrect Claim About Non-Linearity:
Non-linearity is crucial for deep networks; otherwise, the entire network collapses into a
linear function. The first statement is false.
2. Incorrect Claim About Saturating Activation Functions:
Saturating activation functions (e.g., sigmoid, tanh) cause vanishing gradients, making
optimization difficult. The second statement is also false.
3. Incorrect Claim About ReLU Solving All Gradient Issues:
While ReLU mitigates vanishing gradients, it still suffers from the dying ReLU problem
where neurons output zero permanently. Thus, the third statement is false.
4. Correct Statement About Leaky ReLU:
Leaky ReLU assigns a small slope for negative inputs, preventing neurons from completely
dying. The fourth statement is true.
Final Answer
Accepted Answers:
• Using the ReLU activation function avoids all problems arising due to gradients
being too small.
10