Deep Learning Assignment0
Deep Learning Assignment0
Instructions:
• This assignment is meant to help you understand certain concepts we will use in the
course.
1. Simple Derivatives
(a) Find the derivative of the sigmoid function with respect to x where the sigmoid
function σ(x) is given by,
1
σ(x) =
1 + e−x
dσ(x)
σ 0 (x) =
dx
d 1
= ( )
dx 1 + e−x
d
= (1 + e−x )−1
dx
d
= −(1 + e−x )−2 (1 + e−x )
dx
= −(1 + e−x )−2 (−e−x )
e−x
−(1 + e−x )−2 (−e−x ) =
(1 + e−x )2
1 e−x
=
1 + e−x 1 + e−x
1 1 − 1 + e−x
=
1 + e−x 1 + e−x
1 1
= 1−
1 + e−x 1 + e−x
= σ(x)(1 − σ(x))
Solution: Given,
L = (y − ŷ)2
1 − x2 (x−1)2 2
= e 2 − e− 2
2π
The derivative of L w.r.t x is given by dL
dx
= L0 , which can be found as follows:
1 d − x2 (x−1)2 2
L0 = e 2 − e− 2
2π dx
2 − x2 (x−1)2 d x2 (x−1)2
= e 2 − e− 2 e− 2 − e− 2
2π dx
1 − x2 −
(x−1)2
d
− x2
2 d − (x−1)2
= e 2 −e 2 e − e 2
π dx dx
1 − x2 (x−1)2
x2 d x2 (x−1)2 d (x − 1)2
= e 2 − e− 2 e− 2 (− ) − e− 2 −
π dx 2 dx 2
1 − x2 (x−1)2 x2 (x−1)2
e 2 − e− 2 e− 2 (−x) − e− 2 (−(x − 1))
=
π
−1 − x2 (x−1)2 x2 (x−1)2
= e 2 − e− 2 xe− 2 − (x − 1)e− 2
π
By substituting x = 1, we get :
dL −1 − 1 − 2
(1−1)2
− 12
(1−1)2
− 2
= e 2 − e e − (1 − 1)e
dx x=1 π
−1 − 1 1
e 2 − 1 e− 2
=
π
(c) Find the derivative of f (ρ) with respect to ρ where f (ρ) is given by,
ρ 1−ρ
f (ρ) = ρ log + (1 − ρ) log
ρ̂ 1 − ρ̂
(Hint : You can treat ρ̂ as a constant.)
Page 2
Solution: The derivative of f (ρ) with respect to ρ can be found as follows:
d
f 0 (ρ) = (f (ρ))
dρ
d ρ 1 − ρ
= ρlog( ) + (1 − ρ)log
dρ ρ̂ 1 − ρ̂
d
= ρlog(ρ) − ρlog(ρ̂) + (1 − ρ)log(1 − ρ) − (1 − ρ)log(1 − ρ̂)
dρ
d d d d
= (ρlog(ρ)) − (ρlog(ρ̂)) + ((1 − ρ)log(1 − ρ)) − ((1 − ρ)log(1 − ρ̂))
dρ dρ dρ dρ
Treating ρ̂ as a constant and using product rule of derivatives, we get,
1 −1
f 0 (ρ) = (ρ. + log(ρ)(1)) − log(ρ̂)(1) + ((1 − ρ). + log(1 − ρ)(−1)) − log(1 − ρ̂)(−1)
ρ (1 − ρ)
= 1 + log(ρ) − log(ρ̂) − 1 − log(1 − ρ) + log(1 − ρ̂)
ρ 1−ρ
= log( ) − log( )
ρ̂ 1 − ρ̂
ρ(1 − ρ̂)
= log( )
ρ̂(1 − ρ)
2. Chain Rule
Using the chain rule of derivatives, find the derivative of f (x) with respect to x where
(a) f (x) = xlog(3x )
Solution: Let,
z = 3x
dz d x
∴ = 3 = 3x log3
dx dx
Also let,
y = log(z)
dy d 1 1
∴ = logz = = x
dz dz z 3
Therefore, we can write f (x) in terms of y which itself can be written in terms
of z, i.e. ,
f (x) = xy
Page 3
The derivative of f (x) can be found as follows:
d
f 0 (x) = (f (x))
dx
d
= (xy)
dx
dy d
=x +y x (By Product Rule)
dx dx
dy dz
=x +y (By Chain Rule)
dz dx
1
= x x 3x log3 + log3x
3
= xlog3 + log3x
= log3x + log3x
= 2log3x
where,
z = w 0 x + b0
dz d
∴ = (w0 x + b0 ) = w0
dx dx
and
y = w1 (σ(z)) + b1
dy dσ(z)
∴ = w1 = w1 σ(z)(1 − σ(z))
dz dz
Therefore, we can write f (x) in terms of y which itself can be written in terms
of z, i.e. ,
f (x) = σ(y)
Page 4
The derivative of f (x) can be found as given below. Also, recall from Q1(a),
the derivative of σ(x) w.r.t x is given by σ 0 (x) = σ(x)(1 − σ(x)).
f (x) = σ(y)
d
f 0 (x) = σ(y)
dx
d dy
= σ(y) (By Chain rule)
dy dx
dy dz
= σ(y)(1 − σ(y)) (By Chain rule)
dz dx
= σ(y)(1 − σ(y))w1 σ(z)(1 − σ(z))w0
3. Taylor Series
(a) Consider x ∈ R and f (x) ∈ R. Write down the Taylor series expansion of f (x).
f 00 (x) f (n)
f (x + δx) = f (x) + f 0 (x)(δx) + (δx)2 + . . . + (δx)n + . . .
2! n!
where δx is very small, f 0 (x) is the first derivative of f (x) with respect to x and
f (n) (x) is the nth derivative of f (x) with respect to x.
(b) Consider x ∈ Rn and f (x) ∈ R. Write down the Taylor series expansion of f (x).
Page 5
4. Softmax Function
(a) How is the softmax function defined ?
ev1
sof tmax(v)1 = P3 , note that here K = 3
e vk
k=1
e2.1
= = 0.0502
e2.1 + e4.8 + e3.5
ev2
sof tmax(v)2 = P3 vk
k=1 e
e4.8
= = 0.7464
e2.1 + e4.8 + e3.5
ev3
sof tmax(v)3 = P3 vk
k=1 e
e3.5
= = 0.2034
e2.1 + e4.8 + e3.5
Therefore, sof tmax(v) = [0.0502 0.7464 0.2034]
(b) Can you think of any concept which is similar to what the softmax function com-
putes? (Hint : You probably learnt it in high school)
Solution: The output of the softmax function can be used to represent the
probability distribution over K components of the input vector.
5. Matrix Multiplication
(a) What are the four ways of multiplying two matrices ?
Solution:
1. The most common way of finding the product of two matrices A and B is
to compute the ij-th element of the resultant product matrix C using the
Page 6
ith row of A and j th column of B. For example, suppose matrix A is of
size m×n with elements aij and a matrix B of size n×p with elements bjk ,
then multiplying matrices A and B will produce matrix C of size m × p.
The ij-th element of this matrix will be computed as,
n
X
cij = aik bkj
k=1
2. The second way is to realise that the columns of C are the linear combi-
nations of columns of A. To get the ith column of C, multiply the whole
matrix A with the ith column of B. (Remember that a matrix times col-
umn is a column.)
Example: Let A be a 3 × 2 matrix and B be a 2 × 3 matrix. Then,
C = AB
a11 a12 b11 b12 b13
= a21 a22 b21 b22 b23
a31 a32
a11 a12 b11 a11 a12 b12 a11 a12 b13
a21 a22 b21 a21 a22 b22 a21 a22 b23
=
a31 a32 a31 a32 a31 a32
| {z } | {z } | {z }
1st column of C 2nd column of C 3rd column of C
a11 b11 + a12 b21 a11 b12 + a12 b22 a11 b13 + a12 b23
= a21 b11 + a22 b21 a21 b12 + a22 b22 a21 b13 + a22 b23
a31 b11 + a32 b21 a31 b12 + a32 b22 a31 b13 + a32 b23
3. The third way is to realise that the rows of C are the linear combinations
of rows of B. To get the ith row of C, multiply the ith row of A with the
whole matrix B. (Remember that a row times matrix is a row.)
Page 7
Example: Let A be a 3 × 2 matrix and B be a 2 × 3 matrix.
C = AB
a11 a12 b11 b12 b13
= a21 a22 b21 b22 b23
a31 a32
a11 a12 b11 b12 b13 st
1 row of C
b21 b22 b23
a21 a22 b11 b12 b13
=
2nd row of C
b21 b22 b23
a31 a32 b11 b12 b13
3rd rowof C
b21 b22 b23
a11 b11 + a12 b21 a11 b12 + a12 b22 a11 b13 + a12 b23
= a21 b11 + a22 b21 a21 b12 + a22 b22 a21 b13 + a22 b23
a31 b11 + a32 b21 a31 b12 + a32 b22 a31 b13 + a32 b23
C = AB
a11 a12 b11 b12 b13
= a21 a22 b21 b22 b23
a31 a32
a11 b11 b12 b13 a12 b21 b22 b23
= a21 + a22
a31 a32
| {z } | {z } | {z } | {z }
1st column of A 1st row of B 2nd column of A 2nd row of B
a11 b11 a11 b12 a11 b13 a12 b21 a12 b22 a12 b23
= a21 b11 a21 b12 a21 b13 + a22 b21 a22 b22 a22 b23
a31 b11 a31 b12 a31 b13 a32 b21 a32 b22 a32 b23
a11 b11 + a12 b21 a11 b12 + a12 b22 a11 b13 + a12 b23
= a21 b11 + a22 b21 a21 b12 + a22 b22 a21 b13 + a22 b23
a31 b11 + a32 b21 a31 b12 + a32 b22 a31 b13 + a32 b23
(b) Consider a matrix A of size m × n and a vector x of size n. What is the result of
Page 8
the matrix-vector multiplication Ax. Is it a vector or a matrix? What are the the
dimensions of the product.
6. L2-norm
(a) What is meant by L2-norm of a vector?
v1
(b) Given a vector v = v2 ∈ R3 , find it’s L2-norm, i.e. ||v||2 .
v3
√
Solution: ||v||2 = v1 2 + v2 2 + v3 2
v1
v2
(c) Given a vector v = .. ∈ Rn , find it’s L2-norm, i.e ||v||2 .
.
vn
pPn
Solution: ||v||2 = i=1 vi 2
7. Euclidean Distance
Consider two vectors x and y ∈ Rn . How would you compute the Euclidean distance
between the two vectors ?
x1 y1
x2 y2
Solution: Let, x = .. and y = .. be the two vectors.The Euclidean distance,
. .
xn yn
Page 9
d, between the two vectors can then be calculated as:
p
d = (x1 − y1 )2 + (x2 − y2 )2 + . . . + (xn − yn )2
8. Consider two vectors x and y ∈ Rn . How do you compute the dot product between the
two vectors ? Is it a matrix of size n × n, a vector of size n or a scalar ?
x1 y1
x2 y2
Solution: Let, x = .. and y = .. be the two vectors. Then, the dot product
. .
xn yn
between them is defined as follows:
x · y = xT y
= x1 y1 + x2 y2 + . . . + xn yn
Xn
= xi y i
i=1
9. Consider two vectors x and y ∈ Rn . How do you compute the cosine of the angle between
the two vectors ?
x1 y1
x2 y2
Solution: Let, x = .. and y = .. be the two vectors and θ be the angle
. .
xn yn
between them. Then, the cosine of the angle between the two vectors is given by:
x·y
cos θ =
|x||y|
Page 10
Solution: The equation of line can be written as:
y = mx + b
a1 x 1 + a2 x 2 = b
where, x1 = x, x2 = y, a1 = −m, a2 = 1
(b) What is the equation of a plane in 3 dimensions (assume the axes are x1 , x2 , x3 )?
a1 x 1 + a2 x 2 + a3 x 3 = b
(c) What is the equation of a plane in n dimensions (assume the axes are x1 , x2 , . . . , xn )
?
11. Basis Consider a set of vectors S = {v1 , v2 , . . . , vn } ∈ Rn . When do you say that these
vectors form a basis in Rn ?
x = c1 v1 + c2 v2 + . . . + cn vn
Page 11
For example :
0 0 1
The unit basis vectors for R3 are 0, 1 and 0. Note that you can represent
1 0 0
3
any vector v ∈ R as the linear combination of these three basis vectors.
Solution: Two vectors u and v are said to be orthogonal vectors when their
dot-product is zero i.e. u · v = uT v = 0.
Solution: From part (a) of this question, we know that two vectors u and v are
said to be orthogonal if their dot product is zero. Therefore, to check whether
v1 , v2 and v3 are orthogonal, we have to find the dot product between them.
We do this by taking two vectors at a time.
v1 · v2 = v1T v2
0
= 1 0 0 1
0
=0
v2 · v3 = v2T v3
0
= 0 1 0 0
1
=0
v1 · v3 = v1T v3
0
= 1 0 0 0
1
=0
Page 12
As we can see, we can take any subset of the above 3 vectors and compute the
dot product and the result will be zero. Therefore, v1 , v2 and v3 are orthogonal
to each other.
13. Consider two vectors a and b ∈ Rn . What is the vector projection of b onto a ?
Solution: The vector projection of b onto a will have the same direction as vector
a but it will be either a scaled up or down version of a depending on the vector b.
The vector projection of b onto a is given by,
a·b aT b
· a = ·a
||a||2 ||a||2
Page 13
But for a set of linearly independent vectors no vector in the set can be written as
a linear combination of the remaining vectors in the set. An alternate way of saying
this is that, a set of vectors is linearly independent if the only solution to the equation
n
X
ck xk = 0, is, ck = 0 ∀k = {1, 2, . . . , n}
k=1
17. Consider a vector x ∈ Rn and a matrix A ∈ Rn×n . The product xT Ax can be written
as i=1 nj=1
Pn P
?
Pn Pn
Solution: i=1 j=1 xi Aji xj
18. KL Divergence
(a) Consider a discrete random variable X which can take one of k values from the
set {x1 , . . . , xk }. A distribution over X defines the value of P r(X = x) ∀x ∈
{x1 , . . . , xn }. Consider two such distributions P and Q. How do you compute the
KL divergence between P and Q.
0.228 0.619 0.153
Q= | {z } | {z } | {z }
Pr(X = x1 ) Pr(X = x2 ) Pr(X = x1 )
Page 14
Then, the KL divergence between P and Q can be calculated as:
0 1 0
DKL (P ||Q) = (0.0 ∗ log + 1.0 ∗ log + 0.0 ∗ log )
0.228 0.619 0.153
= 0.691
Solution: The cross entropy between two distributions P and Q is given by,
X
H(P, Q) = − P (x) log Q(x)
x
For example,
Consider a discrete random variable X which can take one of 3 values from the set
{x1 , x2 , x3 }. A distribution over X defines the value of P r(X = x) ∀x ∈ {x1 , x2 , x3 }.
Consider two such distributions P and Q which are defined as follows:
0 1 0
P = |{z} |{z} |{z}
Pr(X = x1 ) Pr(X = x2 ) Pr(X = x3 )
0.228 0.619 0.153
Q= | {z } | {z } | {z }
Pr(X = x1 ) Pr(X = x2 ) Pr(X = x1 )
Page 15