0% found this document useful (0 votes)

5 views10 pages

Introduction to Machine Learning (4)

The document presents solutions to various machine learning problems, including calculating trainable parameters in neural networks, gradients with ReLU activation, and MAP estimation for Bernoulli distributions. It discusses the implications of weight initialization, the derivatives of activation functions, and minimizing cross-entropy loss. Each question is answered with explanations and accepted answers, highlighting key concepts in neural networks and statistical estimation.

Uploaded by

Daksh Bhardwaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views10 pages

Introduction to Machine Learning (4)

Uploaded by

Daksh Bhardwaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Machine Learning Assignment 5 Solution

Name: Sunny Singh Jadon

Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
March 19, 2025

Question 1: Total Trainable Parameters in a Feedfor-

ward Neural Network
Problem Statement
A feedforward neural network is given with:

• p-dimensional input

• m hidden layers

• k hidden units per layer

• A scalar output (i.e., one neuron in the output layer)

• Bias terms are ignored.

Determine the total number of trainable parameters (weights).

Solution
1. Input to First Hidden Layer:

• The first hidden layer receives input from p features.

• Each hidden unit in the first layer has weights from all p input dimensions.

• Number of weights:
p×k

2. Hidden Layer Connections:

• Each of the m hidden layers has k units.

• Each unit in a layer is connected to all k units in the previous layer.

• There are m − 1 such hidden layer transitions.

• Number of weights:
(m − 1) × k 2

1
3. Last Hidden Layer to Output Layer:

• The output layer has a single neuron.

• It is connected to all k units in the last hidden layer.

• Number of weights:
k

Total Trainable Parameters:

Total Parameters = pk + (m − 1)k 2 + k

Correct Answer:
pk + (m − 1)k 2 + k

Question 2: Gradient of a Neural Network Layer with

ReLU Activation
Problem Statement
Consider a neural network layer:

y = ReLU(Wx)
where:

• x ∈ Rp (input)

• y ∈ Rd (output)

• W ∈ Rd×p (weight matrix)

• ReLU activation function:

ReLU(z) = max(0, z)

(applied element-wise)

Find the gradient:

∂yi
∂Wij
for i = 1, . . . , d and j = 1, . . . , p.

2
Solution
1. ReLU Activation:
• If z = Wi · x is positive: ReLU is active, so the derivative is 1.
• If z = Wi · x ≤ 0: ReLU is inactive, so the derivative is 0.
∂yi
2. Computing ∂Wij

• Since:
p
!
X
yi = ReLU Wik xk ,
k=1

• Its derivative w.r.t. Wij is:

p
!
X
I Wik xk > 0 xj
k=1

• This means:
– If pk=1 Wik xk > 0, the gradient is xj .
P

– Otherwise, the gradient is 0.

Correct Answer:
p
!
X
I Wik xk > 0 xj
k=1

Question 3: Gradient Dependencies in a Two-Layer

Neural Network
Problem Statement
Consider a two-layered neural network:
y = σ(W (B) σ(W (A) x))
where:
• W (A) and W (B) are weight matrices.
• h = σ(W (A) x) is the hidden layer representation.
• σ is the activation function.
• ∇g (f ) denotes the gradient of f w.r.t. g.
Determine which of the following statements are true:
• ∇h (y) depends on W (A)
• ∇W (A) (y) depends on W (B)
• ∇W (A) (h) depends on W (B)
• ∇W (B) (y) depends on W (A)

3
Solution
1. Gradient of Output w.r.t. Hidden Representation:

∇h (y) = W (B) σ ′ (W (A) x)

This depends on W (B) , not W (A) , so the first option is incorrect.

2. Gradient of Output w.r.t. W (A) :

∇W (A) (y) = ∇h (y) · ∇W (A) (h) = W (B) σ ′ (W (A) x)xT

This depends on W (B) , so the second option is correct.

3. Gradient of Hidden Representation w.r.t. W (A) :

∇W (A) (h) = σ ′ (W (A) x)xT

This does not depend on W (B) , so the third option is incorrect.

4. Gradient of Output w.r.t. W (B) :

∇W (B) (y) = h

Since h depends on W (A) , the fourth option is correct.

Final Answer
Accepted Answers: ∇W (A) (y) depends on W (B) , ∇W (B) (y) depends on W (A)

Question 4: Neural Network Weight Initialization

Problem Statement
Which of the following statements about weight initialization in a network using the
sigmoid activation function are true?

• Two different initializations of the same network could converge to different minima.

• For a given initialization, gradient descent will converge to the same minima irre-
spective of the learning rate.

• Initializing all weights to the same constant value leads to undesirable results.

• Initializing all weights to very large values leads to undesirable results.

Solution
1. Different Initializations Lead to Different Minima:
• Neural networks have non-convex loss surfaces.

• Different initializations may lead to different local minima.

• This statement is true.

2. Convergence to the Same Minima Regardless of Learning Rate:

4
• The learning rate affects the optimization trajectory.

• High learning rates may lead to divergence or poor convergence.

• Different learning rates may lead to different minima.

• This statement is false.

3. Initializing All Weights to the Same Constant Value:

• If all weights are initialized to the same value, all neurons will produce the same
output.

• This leads to symmetry in gradients, preventing effective learning.

• This statement is true.

4. Initializing Weights to Very Large Values:

• Large weights cause extreme values in the sigmoid activation.

• This leads to vanishing gradients due to saturation.

• This statement is true.

Final Answer
Accepted Answers:

• Two different initializations of the same network could converge to different minima.

• Initializing all weights to the same constant value leads to undesirable results.

• Initializing all weights to very large values leads to undesirable results.

Question 5: Derivatives of Sigmoid and Tanh Activa-

tion Functions
Problem Statement
Consider the following statements about the derivatives of the sigmoid σ(x) and hyper-
bolic tangent tanh(x) activation functions:

1 exp(x) − exp(−x)
σ(x) = , tanh(x) =
1 + exp(−x) exp(x) + exp(−x)
Which of the following statements are true?

• σ ′ (x) = σ(x)(1 − σ(x))

• 0 < σ ′ (x) ≤ 1
4

• tanh′ (x) = 12 (1 − (tanh(x))2 )

• 0 < tanh′ (x) ≤ 1

5
Solution
1. Derivative of Sigmoid Function:

σ ′ (x) = σ(x)(1 − σ(x))

This is a well-known result, so the first statement is true.

2. Range of σ ′ (x):

• Since σ(x) is always between 0 and 1, its derivative is maximized when σ(x) = 0.5.

• This gives σ ′ (x) = 0.25, meaning 0 < σ ′ (x) ≤ 14 .

• Thus, the second statement is true.

3. Derivative of Tanh Function:

tanh′ (x) = 1 − (tanh(x))2

The given formula tanh′ (x) = 21 (1 − (tanh(x))2 ) is incorrect (the factor 1

2
is unnecessary).
Therefore, the third statement is false.
4. Range of tanh′ (x):

• The function tanh′ (x) is maximum at x = 0, where tanh(0) = 0 and tanh′ (0) = 1.

• The minimum value is close to 0 for large positive or negative x.

• Thus, 0 < tanh′ (x) ≤ 1, making the fourth statement true.

Final Answer
Accepted Answers:

• σ ′ (x) = σ(x)(1 − σ(x))

• 0 < σ ′ (x) ≤ 1
4

• 0 < tanh′ (x) ≤ 1

Question 6: MLE of p in a Geometric Distribution

Problem Statement
A geometric distribution is defined by the probability mass function:

f (x; p) = (1 − p)(x−1) p, x = 1, 2, . . .

Given the samples [4, 5, 6, 5, 4, 3] drawn from this distribution, find the maximum likeli-
hood estimate (MLE) of p.

6
Solution
The MLE for p in a geometric distribution is given by:
1
p̂ =
x̄
where x̄ is the sample mean.
Step 1: Compute Sample Mean
4+5+6+5+4+3 27
x̄ = = = 4.5
6 6
Step 2: Compute MLE of p
1
p̂ = ≈ 0.222
4.5

Final Answer
Accepted Answer: 0.222

Question 7: Maximum A Posteriori (MAP) Estima-

tion for a Bernoulli Distribution
Problem Statement
Consider a Bernoulli distribution with p = 0.7 (true value of the parameter). We compute
a MAP estimate of p by assuming a prior distribution over p. Let N (µ, σ 2 ) denote a
Gaussian distribution with mean µ and variance σ 2 .
Which of the following statements are true?

• If the prior is N (0.6, 0.1), we will likely require fewer samples for converging to the
true value than if the prior is N (0.4, 0.1).

• If the prior is N (0.4, 0.1), we will likely require fewer samples for converging to the
true value than if the prior is N (0.6, 0.1).

• With a prior of N (0.1, 0.001), the estimate will never converge to the true value,
regardless of the number of samples used.

• With a prior of U (0, 0.5) (uniform between 0 and 0.5), the estimate will never
converge to the true value, regardless of the number of samples used.

Solution
1. Effect of the Prior Mean:
MAP estimation incorporates prior knowledge. If the prior mean is closer to the true
value (p = 0.7), the estimator converges faster. Since N (0.6, 0.1) is closer to 0.7 than
N (0.4, 0.1), fewer samples will be needed to converge, making the first statement true.
2. Incorrect Claim about N (0.4, 0.1):
Since 0.4 is farther from 0.7 than 0.6, this statement is false.

7
3. Effect of a Strongly Biased Prior:
A prior of N (0.1, 0.001) is highly concentrated near 0.1. While it slows down conver-
gence, given sufficient data, the MAP estimate will eventually reach the true value. This
statement is false.
4. Effect of a Uniform Prior on [0, 0.5]:
Since the uniform prior does not include 0.7 in its support, the MAP estimate will never
reach 0.7, regardless of the number of samples. This statement is true.

Final Answer
Accepted Answers:

• If the prior is N (0.6, 0.1), we will likely require fewer samples for converging to the
true value than if the prior is N (0.4, 0.1).

• With a prior of U (0, 0.5), the estimate will never converge to the true value, regard-
less of the number of samples used.

Question 8: Parameter Estimation Techniques

Problem Statement
Which of the following statements about parameter estimation techniques are true?

• To obtain a distribution over the predicted values for a new data point, we need to
compute an integral over the parameter space.

• The MAP estimate of the parameter gives a point prediction for a new data point.

• The MLE of a parameter gives a distribution of predicted values for a new data
point.

• We need a point estimate of the parameter to compute a distribution of the pre-

dicted values for a new data point.

Solution
1. Bayesian Approach and Integral Computation:
To obtain a predictive distribution, we integrate over the parameter space using Bayesian
inference: Z
P (y|x) = P (y|x, θ)P (θ|D)dθ

This means the first statement is true.

2. MAP Estimate as a Point Prediction:
The MAP estimate is defined as:

θ̂M AP = arg max P (θ|D)

Since it provides a single best estimate, it gives a point prediction, making the second
statement true.

8
3. Incorrect MLE Claim:
The MLE finds the most likely parameter value but does not provide a distribution over
predicted values. This statement is false.
4. Incorrect Requirement for Point Estimate:
A full Bayesian approach computes distributions directly without needing a point esti-
mate, making this statement false.

Final Answer
Accepted Answers:
• To obtain a distribution over the predicted values for a new data point, we need to
compute an integral over the parameter space.
• The MAP estimate of the parameter gives a point prediction for a new data point.

Question 9: Minimizing Cross-Entropy Loss

Problem Statement
In classification settings, it is common in machine learning to minimize the discrete cross-
entropy loss:
X
HCE (p, q) = − pi log qi
i

where pi and qi are the true and predicted distributions, respectively. Given this,
which of the following statements are true?

• Minimizing HCE (p, q) is equivalent to minimizing the (self) entropy H(q).

• Minimizing HCE (p, q) is equivalent to minimizing HCE (q, p).
• Minimizing HCE (p, q) is equivalent to minimizing the KL divergence DKL (p||q).
• Minimizing HCE (p, q) is equivalent to minimizing the KL divergence DKL (q||p).

Solution
1. Connection to KL Divergence:
The cross-entropy loss can be rewritten using the KL divergence:

HCE (p, q) = H(p) + DKL (p||q)

Since the true distribution p is fixed, minimizing HCE (p, q) is equivalent to minimizing
DKL (p||q), making the third statement true.
2. Incorrect Statements:
• The self-entropy H(q) refers to the entropy of the predicted distribution and is
unrelated to the loss function.
• Cross-entropy is not symmetric, so HCE (p, q) ̸= HCE (q, p).
• KL divergence is asymmetric, meaning DKL (p||q) ̸= DKL (q||p).

9
Final Answer
Accepted Answer: Minimizing HCE (p, q) is equivalent to minimizing DKL (p||q).

Question 10: Activation Functions in Neural Networks

Problem Statement
Which of the following statements about activation functions are NOT true?

• Non-linearity of activation functions is not a necessary criterion when designing

very deep neural networks.

• Saturating non-linear activation functions (derivative → 0 as x → ±∞) avoid the

vanishing gradients problem.

• Using the ReLU activation function avoids all problems arising due to gradients
being too small.

• The dead neurons problem in ReLU networks can be fixed using a leaky ReLU
activation function.

Solution
1. Incorrect Claim About Non-Linearity:
Non-linearity is crucial for deep networks; otherwise, the entire network collapses into a
linear function. The first statement is false.
2. Incorrect Claim About Saturating Activation Functions:
Saturating activation functions (e.g., sigmoid, tanh) cause vanishing gradients, making
optimization difficult. The second statement is also false.
3. Incorrect Claim About ReLU Solving All Gradient Issues:
While ReLU mitigates vanishing gradients, it still suffers from the dying ReLU problem
where neurons output zero permanently. Thus, the third statement is false.
4. Correct Statement About Leaky ReLU:
Leaky ReLU assigns a small slope for negative inputs, preventing neurons from completely
dying. The fourth statement is true.

Final Answer
Accepted Answers:

• Non-linearity of activation functions is not a necessary criterion when designing

very deep neural networks.

• Saturating non-linear activation functions (derivative → 0 as x → ±∞) avoid the

vanishing gradients problem.

• Using the ReLU activation function avoids all problems arising due to gradients
being too small.

1.deep Learning Assignment1 Solutions 1
100% (3)
1.deep Learning Assignment1 Solutions 1
12 pages
Quasi Equilibrium State Pendulum
No ratings yet
Quasi Equilibrium State Pendulum
22 pages
DL - Assignment 5 Solution
No ratings yet
DL - Assignment 5 Solution
7 pages
Weather Tools PPT
No ratings yet
Weather Tools PPT
23 pages
Solution Dseclzg524!01!102020 Ec2r
100% (1)
Solution Dseclzg524!01!102020 Ec2r
6 pages
Solution Manual For PRML
No ratings yet
Solution Manual For PRML
253 pages
Wa0193.
No ratings yet
Wa0193.
4 pages
Assignment 5 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
No ratings yet
Assignment 5 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
5 pages
Week 2 (1)
No ratings yet
Week 2 (1)
7 pages
Ann Assignment 2ashish
No ratings yet
Ann Assignment 2ashish
10 pages
Deep Learning Assignment3 Solution
No ratings yet
Deep Learning Assignment3 Solution
9 pages
ass6_solns
No ratings yet
ass6_solns
13 pages
Linear_Algebra Practice Problems
No ratings yet
Linear_Algebra Practice Problems
21 pages
week2
No ratings yet
week2
3 pages
Introduction to Machine Learning - - Unit 8 - Week 5
No ratings yet
Introduction to Machine Learning - - Unit 8 - Week 5
4 pages
Neural Problems
No ratings yet
Neural Problems
45 pages
2021-exam2-solution
No ratings yet
2021-exam2-solution
11 pages
ML Ctanujit
No ratings yet
ML Ctanujit
56 pages
Deep Learning Assignment2 Solutions PDF
No ratings yet
Deep Learning Assignment2 Solutions PDF
16 pages
Linear Algebra Assignment Solution
100% (1)
Linear Algebra Assignment Solution
28 pages
2019-20-I MS Key
No ratings yet
2019-20-I MS Key
6 pages
2020-exam2-solution
No ratings yet
2020-exam2-solution
9 pages
PRML Solution Manual
No ratings yet
PRML Solution Manual
253 pages
Risk Methodologies Examination
No ratings yet
Risk Methodologies Examination
4 pages
1160 CS F425 20241218114944 Comprehensive Exam Question Paper
No ratings yet
1160 CS F425 20241218114944 Comprehensive Exam Question Paper
5 pages
Backpropagation Math
No ratings yet
Backpropagation Math
11 pages
Feed Forward NN
No ratings yet
Feed Forward NN
35 pages
NNLS1 2019 HW4 Solutions
No ratings yet
NNLS1 2019 HW4 Solutions
11 pages
Neural Problems[1][1]
No ratings yet
Neural Problems[1][1]
38 pages
Lecture 2 - GD Linear Regression
No ratings yet
Lecture 2 - GD Linear Regression
28 pages
Deep Learning: International Islamic University of Chittagong
No ratings yet
Deep Learning: International Islamic University of Chittagong
31 pages
01 Introduction To Feedforward Neural Networks (Hugo)
No ratings yet
01 Introduction To Feedforward Neural Networks (Hugo)
78 pages
Assignment 4 Solution
No ratings yet
Assignment 4 Solution
3 pages
Activation Function in NN
No ratings yet
Activation Function in NN
29 pages
Midterm 2010 Solutions
No ratings yet
Midterm 2010 Solutions
8 pages
output_25
No ratings yet
output_25
8 pages
Week 5
No ratings yet
Week 5
3 pages
Midterm With Solutions
No ratings yet
Midterm With Solutions
26 pages
Machine Learning Questions Final - Solutions
No ratings yet
Machine Learning Questions Final - Solutions
5 pages
Instructor Solution Manual To Neural Networks and Deep Learning A Textbook Solutions 3319944622 9783319944623 - Compress
No ratings yet
Instructor Solution Manual To Neural Networks and Deep Learning A Textbook Solutions 3319944622 9783319944623 - Compress
40 pages
Instructor's Solution Manual For Neural Networks
No ratings yet
Instructor's Solution Manual For Neural Networks
40 pages
Quiz1 Solutions Quiz 1 Soln
No ratings yet
Quiz1 Solutions Quiz 1 Soln
7 pages
Solution 3 Problem 1: Let X
No ratings yet
Solution 3 Problem 1: Let X
12 pages
Linear Regression Review
67% (6)
Linear Regression Review
3 pages
Machine Learning Homework
No ratings yet
Machine Learning Homework
8 pages
2022-exam2-solution
No ratings yet
2022-exam2-solution
10 pages
output_23
No ratings yet
output_23
6 pages
Activation Functions
No ratings yet
Activation Functions
4 pages
DL - Quiz 2 - Google Forms
No ratings yet
DL - Quiz 2 - Google Forms
10 pages
CS6910 Tutorial1
No ratings yet
CS6910 Tutorial1
10 pages
nonlinear
No ratings yet
nonlinear
8 pages
Tutorial 8 Questions
No ratings yet
Tutorial 8 Questions
3 pages
MLF Q2 Practice Problems
No ratings yet
MLF Q2 Practice Problems
61 pages
Statistics Resit July 16 2019+%28with+Answers%29
No ratings yet
Statistics Resit July 16 2019+%28with+Answers%29
11 pages
Recognition Patterns: Jean Carlo Grandas Franco March 2020
No ratings yet
Recognition Patterns: Jean Carlo Grandas Franco March 2020
9 pages
Maximum Likelihood Estimators and Least Squares
No ratings yet
Maximum Likelihood Estimators and Least Squares
5 pages
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
Sequences and Infinite Series, A Collection of Solved Problems
From Everand
Sequences and Infinite Series, A Collection of Solved Problems
Steven Tan
No ratings yet
Applications of Derivatives Errors and Approximation (Calculus) Mathematics Question Bank
From Everand
Applications of Derivatives Errors and Approximation (Calculus) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet
Session Plan: Computer Systems Servicing NC II
No ratings yet
Session Plan: Computer Systems Servicing NC II
5 pages
3D Printing Medical Miracles: Diagnostic Device Design Dynamic Response Analysis Hybrid 3D Printing
No ratings yet
3D Printing Medical Miracles: Diagnostic Device Design Dynamic Response Analysis Hybrid 3D Printing
60 pages
WP 010 Cep Alarmmgmt
No ratings yet
WP 010 Cep Alarmmgmt
6 pages
Nontraditional Manufacturing Processes: Mf30604: Aknath Lecture No.-1 Introduction To Ntmps
No ratings yet
Nontraditional Manufacturing Processes: Mf30604: Aknath Lecture No.-1 Introduction To Ntmps
14 pages
GROHE Specification Sheet 102915SH00
No ratings yet
GROHE Specification Sheet 102915SH00
1 page
Business Communication Assignment
No ratings yet
Business Communication Assignment
13 pages
Abebe Final Ppt
No ratings yet
Abebe Final Ppt
52 pages
Evaluasi Penerapan Sistem Informasi Absensi Online
No ratings yet
Evaluasi Penerapan Sistem Informasi Absensi Online
9 pages
SPE-196438-MS Optimizing Operating Cost Through Production Management and Techno-Economic Approach in Mature Field and Gross-Split Scheme
No ratings yet
SPE-196438-MS Optimizing Operating Cost Through Production Management and Techno-Economic Approach in Mature Field and Gross-Split Scheme
7 pages
Summary
No ratings yet
Summary
9 pages
4-h Recommendations Cheyenne Whitney
No ratings yet
4-h Recommendations Cheyenne Whitney
1 page
Hyundai FB Machiningcenter
No ratings yet
Hyundai FB Machiningcenter
28 pages
Media and Information Literacy: Quarter 1 - Module 3
No ratings yet
Media and Information Literacy: Quarter 1 - Module 3
12 pages
STS Report Outline1
100% (1)
STS Report Outline1
14 pages
Reading Body Language of 7 Meaning Communication
No ratings yet
Reading Body Language of 7 Meaning Communication
10 pages
DSA lab program
No ratings yet
DSA lab program
52 pages
Internet of Things Security: Principles, Applications, Attacks, and Countermeasures 1st Edition Gupta instant download
No ratings yet
Internet of Things Security: Principles, Applications, Attacks, and Countermeasures 1st Edition Gupta instant download
59 pages
Shopify Vs BeCommerce
No ratings yet
Shopify Vs BeCommerce
3 pages
Mud Lab Manual - New
71% (7)
Mud Lab Manual - New
32 pages
Alagappa University, Karaikudi
No ratings yet
Alagappa University, Karaikudi
3 pages
Sample (Charles Benson)
No ratings yet
Sample (Charles Benson)
23 pages
Intern Report
100% (1)
Intern Report
57 pages
Laidler Associates Company Resume
No ratings yet
Laidler Associates Company Resume
28 pages
Genset Start/Stop Control: Description
No ratings yet
Genset Start/Stop Control: Description
4 pages
9.1 Module Overview
No ratings yet
9.1 Module Overview
11 pages
ETAP User Manual Pag 250-500
No ratings yet
ETAP User Manual Pag 250-500
249 pages
Solenoid Operated Valves Pilot Operated Poppet Type 2-Way Normally Open Common Cavity, Size 08
No ratings yet
Solenoid Operated Valves Pilot Operated Poppet Type 2-Way Normally Open Common Cavity, Size 08
4 pages
Edexcel IGCSE Further Pure Mathematics June 2022 Question Paper 1R - 4pm1-01r-Que-20220527
No ratings yet
Edexcel IGCSE Further Pure Mathematics June 2022 Question Paper 1R - 4pm1-01r-Que-20220527
36 pages

Introduction to Machine Learning (4)

Uploaded by

Introduction to Machine Learning (4)

Uploaded by

Machine Learning Assignment 5 Solution

Name: Sunny Singh Jadon

Question 1: Total Trainable Parameters in a Feedfor-

• k hidden units per layer

• A scalar output (i.e., one neuron in the output layer)

• Bias terms are ignored.

Determine the total number of trainable parameters (weights).

• The first hidden layer receives input from p features.

2. Hidden Layer Connections:

• Each of the m hidden layers has k units.

• Each unit in a layer is connected to all k units in the previous layer.

• There are m − 1 such hidden layer transitions.

• The output layer has a single neuron.

• It is connected to all k units in the last hidden layer.

Total Trainable Parameters:

Total Parameters = pk + (m − 1)k 2 + k

Question 2: Gradient of a Neural Network Layer with

• W ∈ Rd×p (weight matrix)

• ReLU activation function:

Find the gradient:

• Its derivative w.r.t. Wij is:

– Otherwise, the gradient is 0.

Question 3: Gradient Dependencies in a Two-Layer

∇h (y) = W (B) σ ′ (W (A) x)

This depends on W (B) , not W (A) , so the first option is incorrect.

∇W (A) (y) = ∇h (y) · ∇W (A) (h) = W (B) σ ′ (W (A) x)xT

This depends on W (B) , so the second option is correct.

∇W (A) (h) = σ ′ (W (A) x)xT

This does not depend on W (B) , so the third option is incorrect.

Since h depends on W (A) , the fourth option is correct.

Question 4: Neural Network Weight Initialization

• Initializing all weights to very large values leads to undesirable results.

• Different initializations may lead to different local minima.

• This statement is true.

• High learning rates may lead to divergence or poor convergence.

• Different learning rates may lead to different minima.

• This statement is false.

3. Initializing All Weights to the Same Constant Value:

• This leads to symmetry in gradients, preventing effective learning.

• This statement is true.

4. Initializing Weights to Very Large Values:

• Large weights cause extreme values in the sigmoid activation.

• This leads to vanishing gradients due to saturation.

• This statement is true.

• Initializing all weights to very large values leads to undesirable results.

Question 5: Derivatives of Sigmoid and Tanh Activa-

• σ ′ (x) = σ(x)(1 − σ(x))

• tanh′ (x) = 12 (1 − (tanh(x))2 )

• 0 < tanh′ (x) ≤ 1

σ ′ (x) = σ(x)(1 − σ(x))

This is a well-known result, so the first statement is true.

• This gives σ ′ (x) = 0.25, meaning 0 < σ ′ (x) ≤ 14 .

• Thus, the second statement is true.

3. Derivative of Tanh Function:

tanh′ (x) = 1 − (tanh(x))2

The given formula tanh′ (x) = 21 (1 − (tanh(x))2 ) is incorrect (the factor 1

• The minimum value is close to 0 for large positive or negative x.

• Thus, 0 < tanh′ (x) ≤ 1, making the fourth statement true.

• σ ′ (x) = σ(x)(1 − σ(x))

• 0 < tanh′ (x) ≤ 1

Question 6: MLE of p in a Geometric Distribution

Question 7: Maximum A Posteriori (MAP) Estima-

Question 8: Parameter Estimation Techniques

• We need a point estimate of the parameter to compute a distribution of the pre-

This means the first statement is true.

θ̂M AP = arg max P (θ|D)

Question 9: Minimizing Cross-Entropy Loss

• Minimizing HCE (p, q) is equivalent to minimizing the (self) entropy H(q).

HCE (p, q) = H(p) + DKL (p||q)

Question 10: Activation Functions in Neural Networks

• Non-linearity of activation functions is not a necessary criterion when designing

• Saturating non-linear activation functions (derivative → 0 as x → ±∞) avoid the

• Non-linearity of activation functions is not a necessary criterion when designing

• Saturating non-linear activation functions (derivative → 0 as x → ±∞) avoid the

You might also like