0% found this document useful (0 votes)
5 views10 pages

Introduction to Machine Learning (4)

The document presents solutions to various machine learning problems, including calculating trainable parameters in neural networks, gradients with ReLU activation, and MAP estimation for Bernoulli distributions. It discusses the implications of weight initialization, the derivatives of activation functions, and minimizing cross-entropy loss. Each question is answered with explanations and accepted answers, highlighting key concepts in neural networks and statistical estimation.

Uploaded by

Daksh Bhardwaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views10 pages

Introduction to Machine Learning (4)

The document presents solutions to various machine learning problems, including calculating trainable parameters in neural networks, gradients with ReLU activation, and MAP estimation for Bernoulli distributions. It discusses the implications of weight initialization, the derivatives of activation functions, and minimizing cross-entropy loss. Each question is answered with explanations and accepted answers, highlighting key concepts in neural networks and statistical estimation.

Uploaded by

Daksh Bhardwaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Machine Learning Assignment 5 Solution

Name: Sunny Singh Jadon


Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
March 19, 2025

Question 1: Total Trainable Parameters in a Feedfor-


ward Neural Network
Problem Statement
A feedforward neural network is given with:

• p-dimensional input

• m hidden layers

• k hidden units per layer

• A scalar output (i.e., one neuron in the output layer)

• Bias terms are ignored.

Determine the total number of trainable parameters (weights).

Solution
1. Input to First Hidden Layer:

• The first hidden layer receives input from p features.

• Each hidden unit in the first layer has weights from all p input dimensions.

• Number of weights:
p×k

2. Hidden Layer Connections:

• Each of the m hidden layers has k units.

• Each unit in a layer is connected to all k units in the previous layer.

• There are m − 1 such hidden layer transitions.

• Number of weights:
(m − 1) × k 2

1
3. Last Hidden Layer to Output Layer:

• The output layer has a single neuron.

• It is connected to all k units in the last hidden layer.

• Number of weights:
k

Total Trainable Parameters:

Total Parameters = pk + (m − 1)k 2 + k

Correct Answer:
pk + (m − 1)k 2 + k

Question 2: Gradient of a Neural Network Layer with


ReLU Activation
Problem Statement
Consider a neural network layer:

y = ReLU(Wx)
where:

• x ∈ Rp (input)

• y ∈ Rd (output)

• W ∈ Rd×p (weight matrix)

• ReLU activation function:

ReLU(z) = max(0, z)

(applied element-wise)

Find the gradient:


∂yi
∂Wij
for i = 1, . . . , d and j = 1, . . . , p.

2
Solution
1. ReLU Activation:
• If z = Wi · x is positive: ReLU is active, so the derivative is 1.
• If z = Wi · x ≤ 0: ReLU is inactive, so the derivative is 0.
∂yi
2. Computing ∂Wij

• Since:
p
!
X
yi = ReLU Wik xk ,
k=1

• Its derivative w.r.t. Wij is:


p
!
X
I Wik xk > 0 xj
k=1

• This means:
– If pk=1 Wik xk > 0, the gradient is xj .
P

– Otherwise, the gradient is 0.


Correct Answer:
p
!
X
I Wik xk > 0 xj
k=1

Question 3: Gradient Dependencies in a Two-Layer


Neural Network
Problem Statement
Consider a two-layered neural network:
y = σ(W (B) σ(W (A) x))
where:
• W (A) and W (B) are weight matrices.
• h = σ(W (A) x) is the hidden layer representation.
• σ is the activation function.
• ∇g (f ) denotes the gradient of f w.r.t. g.
Determine which of the following statements are true:
• ∇h (y) depends on W (A)
• ∇W (A) (y) depends on W (B)
• ∇W (A) (h) depends on W (B)
• ∇W (B) (y) depends on W (A)

3
Solution
1. Gradient of Output w.r.t. Hidden Representation:

∇h (y) = W (B) σ ′ (W (A) x)

This depends on W (B) , not W (A) , so the first option is incorrect.


2. Gradient of Output w.r.t. W (A) :

∇W (A) (y) = ∇h (y) · ∇W (A) (h) = W (B) σ ′ (W (A) x)xT

This depends on W (B) , so the second option is correct.


3. Gradient of Hidden Representation w.r.t. W (A) :

∇W (A) (h) = σ ′ (W (A) x)xT

This does not depend on W (B) , so the third option is incorrect.


4. Gradient of Output w.r.t. W (B) :

∇W (B) (y) = h

Since h depends on W (A) , the fourth option is correct.

Final Answer
Accepted Answers: ∇W (A) (y) depends on W (B) , ∇W (B) (y) depends on W (A)

Question 4: Neural Network Weight Initialization


Problem Statement
Which of the following statements about weight initialization in a network using the
sigmoid activation function are true?

• Two different initializations of the same network could converge to different minima.

• For a given initialization, gradient descent will converge to the same minima irre-
spective of the learning rate.

• Initializing all weights to the same constant value leads to undesirable results.

• Initializing all weights to very large values leads to undesirable results.

Solution
1. Different Initializations Lead to Different Minima:
• Neural networks have non-convex loss surfaces.

• Different initializations may lead to different local minima.

• This statement is true.


2. Convergence to the Same Minima Regardless of Learning Rate:

4
• The learning rate affects the optimization trajectory.

• High learning rates may lead to divergence or poor convergence.

• Different learning rates may lead to different minima.

• This statement is false.

3. Initializing All Weights to the Same Constant Value:

• If all weights are initialized to the same value, all neurons will produce the same
output.

• This leads to symmetry in gradients, preventing effective learning.

• This statement is true.

4. Initializing Weights to Very Large Values:

• Large weights cause extreme values in the sigmoid activation.

• This leads to vanishing gradients due to saturation.

• This statement is true.

Final Answer
Accepted Answers:

• Two different initializations of the same network could converge to different minima.

• Initializing all weights to the same constant value leads to undesirable results.

• Initializing all weights to very large values leads to undesirable results.

Question 5: Derivatives of Sigmoid and Tanh Activa-


tion Functions
Problem Statement
Consider the following statements about the derivatives of the sigmoid σ(x) and hyper-
bolic tangent tanh(x) activation functions:

1 exp(x) − exp(−x)
σ(x) = , tanh(x) =
1 + exp(−x) exp(x) + exp(−x)
Which of the following statements are true?

• σ ′ (x) = σ(x)(1 − σ(x))

• 0 < σ ′ (x) ≤ 1
4

• tanh′ (x) = 12 (1 − (tanh(x))2 )

• 0 < tanh′ (x) ≤ 1

5
Solution
1. Derivative of Sigmoid Function:

σ ′ (x) = σ(x)(1 − σ(x))

This is a well-known result, so the first statement is true.


2. Range of σ ′ (x):

• Since σ(x) is always between 0 and 1, its derivative is maximized when σ(x) = 0.5.

• This gives σ ′ (x) = 0.25, meaning 0 < σ ′ (x) ≤ 14 .

• Thus, the second statement is true.

3. Derivative of Tanh Function:

tanh′ (x) = 1 − (tanh(x))2

The given formula tanh′ (x) = 21 (1 − (tanh(x))2 ) is incorrect (the factor 1


2
is unnecessary).
Therefore, the third statement is false.
4. Range of tanh′ (x):

• The function tanh′ (x) is maximum at x = 0, where tanh(0) = 0 and tanh′ (0) = 1.

• The minimum value is close to 0 for large positive or negative x.

• Thus, 0 < tanh′ (x) ≤ 1, making the fourth statement true.

Final Answer
Accepted Answers:

• σ ′ (x) = σ(x)(1 − σ(x))

• 0 < σ ′ (x) ≤ 1
4

• 0 < tanh′ (x) ≤ 1

Question 6: MLE of p in a Geometric Distribution


Problem Statement
A geometric distribution is defined by the probability mass function:

f (x; p) = (1 − p)(x−1) p, x = 1, 2, . . .

Given the samples [4, 5, 6, 5, 4, 3] drawn from this distribution, find the maximum likeli-
hood estimate (MLE) of p.

6
Solution
The MLE for p in a geometric distribution is given by:
1
p̂ =

where x̄ is the sample mean.
Step 1: Compute Sample Mean
4+5+6+5+4+3 27
x̄ = = = 4.5
6 6
Step 2: Compute MLE of p
1
p̂ = ≈ 0.222
4.5

Final Answer
Accepted Answer: 0.222

Question 7: Maximum A Posteriori (MAP) Estima-


tion for a Bernoulli Distribution
Problem Statement
Consider a Bernoulli distribution with p = 0.7 (true value of the parameter). We compute
a MAP estimate of p by assuming a prior distribution over p. Let N (µ, σ 2 ) denote a
Gaussian distribution with mean µ and variance σ 2 .
Which of the following statements are true?

• If the prior is N (0.6, 0.1), we will likely require fewer samples for converging to the
true value than if the prior is N (0.4, 0.1).

• If the prior is N (0.4, 0.1), we will likely require fewer samples for converging to the
true value than if the prior is N (0.6, 0.1).

• With a prior of N (0.1, 0.001), the estimate will never converge to the true value,
regardless of the number of samples used.

• With a prior of U (0, 0.5) (uniform between 0 and 0.5), the estimate will never
converge to the true value, regardless of the number of samples used.

Solution
1. Effect of the Prior Mean:
MAP estimation incorporates prior knowledge. If the prior mean is closer to the true
value (p = 0.7), the estimator converges faster. Since N (0.6, 0.1) is closer to 0.7 than
N (0.4, 0.1), fewer samples will be needed to converge, making the first statement true.
2. Incorrect Claim about N (0.4, 0.1):
Since 0.4 is farther from 0.7 than 0.6, this statement is false.

7
3. Effect of a Strongly Biased Prior:
A prior of N (0.1, 0.001) is highly concentrated near 0.1. While it slows down conver-
gence, given sufficient data, the MAP estimate will eventually reach the true value. This
statement is false.
4. Effect of a Uniform Prior on [0, 0.5]:
Since the uniform prior does not include 0.7 in its support, the MAP estimate will never
reach 0.7, regardless of the number of samples. This statement is true.

Final Answer
Accepted Answers:

• If the prior is N (0.6, 0.1), we will likely require fewer samples for converging to the
true value than if the prior is N (0.4, 0.1).

• With a prior of U (0, 0.5), the estimate will never converge to the true value, regard-
less of the number of samples used.

Question 8: Parameter Estimation Techniques


Problem Statement
Which of the following statements about parameter estimation techniques are true?

• To obtain a distribution over the predicted values for a new data point, we need to
compute an integral over the parameter space.

• The MAP estimate of the parameter gives a point prediction for a new data point.

• The MLE of a parameter gives a distribution of predicted values for a new data
point.

• We need a point estimate of the parameter to compute a distribution of the pre-


dicted values for a new data point.

Solution
1. Bayesian Approach and Integral Computation:
To obtain a predictive distribution, we integrate over the parameter space using Bayesian
inference: Z
P (y|x) = P (y|x, θ)P (θ|D)dθ

This means the first statement is true.


2. MAP Estimate as a Point Prediction:
The MAP estimate is defined as:

θ̂M AP = arg max P (θ|D)


θ

Since it provides a single best estimate, it gives a point prediction, making the second
statement true.

8
3. Incorrect MLE Claim:
The MLE finds the most likely parameter value but does not provide a distribution over
predicted values. This statement is false.
4. Incorrect Requirement for Point Estimate:
A full Bayesian approach computes distributions directly without needing a point esti-
mate, making this statement false.

Final Answer
Accepted Answers:
• To obtain a distribution over the predicted values for a new data point, we need to
compute an integral over the parameter space.
• The MAP estimate of the parameter gives a point prediction for a new data point.

Question 9: Minimizing Cross-Entropy Loss


Problem Statement
In classification settings, it is common in machine learning to minimize the discrete cross-
entropy loss:
X
HCE (p, q) = − pi log qi
i

where pi and qi are the true and predicted distributions, respectively. Given this,
which of the following statements are true?

• Minimizing HCE (p, q) is equivalent to minimizing the (self) entropy H(q).


• Minimizing HCE (p, q) is equivalent to minimizing HCE (q, p).
• Minimizing HCE (p, q) is equivalent to minimizing the KL divergence DKL (p||q).
• Minimizing HCE (p, q) is equivalent to minimizing the KL divergence DKL (q||p).

Solution
1. Connection to KL Divergence:
The cross-entropy loss can be rewritten using the KL divergence:

HCE (p, q) = H(p) + DKL (p||q)


Since the true distribution p is fixed, minimizing HCE (p, q) is equivalent to minimizing
DKL (p||q), making the third statement true.
2. Incorrect Statements:
• The self-entropy H(q) refers to the entropy of the predicted distribution and is
unrelated to the loss function.
• Cross-entropy is not symmetric, so HCE (p, q) ̸= HCE (q, p).
• KL divergence is asymmetric, meaning DKL (p||q) ̸= DKL (q||p).

9
Final Answer
Accepted Answer: Minimizing HCE (p, q) is equivalent to minimizing DKL (p||q).

Question 10: Activation Functions in Neural Networks


Problem Statement
Which of the following statements about activation functions are NOT true?

• Non-linearity of activation functions is not a necessary criterion when designing


very deep neural networks.

• Saturating non-linear activation functions (derivative → 0 as x → ±∞) avoid the


vanishing gradients problem.

• Using the ReLU activation function avoids all problems arising due to gradients
being too small.

• The dead neurons problem in ReLU networks can be fixed using a leaky ReLU
activation function.

Solution
1. Incorrect Claim About Non-Linearity:
Non-linearity is crucial for deep networks; otherwise, the entire network collapses into a
linear function. The first statement is false.
2. Incorrect Claim About Saturating Activation Functions:
Saturating activation functions (e.g., sigmoid, tanh) cause vanishing gradients, making
optimization difficult. The second statement is also false.
3. Incorrect Claim About ReLU Solving All Gradient Issues:
While ReLU mitigates vanishing gradients, it still suffers from the dying ReLU problem
where neurons output zero permanently. Thus, the third statement is false.
4. Correct Statement About Leaky ReLU:
Leaky ReLU assigns a small slope for negative inputs, preventing neurons from completely
dying. The fourth statement is true.

Final Answer
Accepted Answers:

• Non-linearity of activation functions is not a necessary criterion when designing


very deep neural networks.

• Saturating non-linear activation functions (derivative → 0 as x → ±∞) avoid the


vanishing gradients problem.

• Using the ReLU activation function avoids all problems arising due to gradients
being too small.

10

You might also like