Weekly Homework X
Weekly Homework X
Bhavik Shangari
CS550 Machine Learning
September 2, 2024
Question 1.
Figure 1: Data
Solution. Applying PCA to the above given data include these steps
1. Formulate data in form of Matrix
2 5
6 4
10 11
14 14
T
3. Calculate the covariance matrix M = XN,X , where N denotes number of data points
1. 0.91524923
0.91524923 1.
4. Calculate the Eigen Vectors and Eigen Values of this matrix M. (Mx = λx).
1
λ1 = 1.91524923 λ2 = 0.08475077
0.70710678 −0.70710678
0.70710678 0.70710678
5. Now after getting Eigen Vectors and Eigen Values, you need to project the original
points onto these vectors to get the projections.
−1.34164079 −0.84270097
−0.4472136 −1.08347268 0.70710678 −0.70710678
∗
0.4472136 0.60192927 0.70710678 0.70710678
1.34164079 1.32424438
7. If you want to consider only 1 Dimension, you could take the first column, which
corresponds to the maximum Eigen value.
8. The Eigen Values provided in the Question by you does not match with the eigen
values computed, so we cannot find eigen vectors through the Covariance matrix for
the given values, and thus there is no need to compare
2
Question 2.
rows represents different data points and columns represent x and y coordinates.
3. For starting with K-means Clustering Algorithm, we need to start by Randomly Ini-
tializing Centroids (given K=2), and assign each point to one of the Cluster based on
Euclidean distance from the Centroid. Figure 3.
3
Figure 3: Randomly Initialized Centroids
4. Then we need to replace the centroid with the mean of the points assigned to that
cluster adn repreat this process Iteratively, till the centroid do not change at all. At
the end of all Iterations we get the results shown in Figure 4
4
Figure 4: K Means Final Result
prev_median = np.zeros_like(median)
while median.all() != prev_median.all():
dist_mat = []
for pt in points:
a = []
for med in median:
dist = distance(pt, med)
a.append(dist)
dist_mat.append(a)
dist_mat = np.array(dist_mat)
5
plt.pause(0.1)
cluster = {0:[], 1:[]}
for i, el in enumerate(np.argmax(dist_mat, axis=1)):
cluster[el].append(points[i])
cluster[0] = np.stack(cluster[0])
cluster[1] = np.stack(cluster[1])
prev_median = median
for i in range(len(median)):
median[i] = np.mean(cluster[i], axis=0)
print(median)
6
Question 3.
2. Consider that in the projected space, the difference between means of different classes
should be high and the variance of cluster of each class should be small, which leads
to the Objective Function
µ˜1 − µ˜2
max 2 (1)
s˜1 + s˜2 2
where µ˜1 and µ˜2 are the means of the projections of both classes and s˜1 2 and s˜2 2 are
the covariances of the projections of each class.
3. We get the Separate class Covariance for the data points of both the classes s1 and s2 ,
which are in the original feature space.
X
s1 = (xi − µ1 )(xi − µ1 )T (2)
xi ∈c1
X
s2 = (xi − µ2 )(xi − µ2 )T (3)
xi ∈c2
where c1 and c2 denotes the set of points in class 1 and class 2 respectively, µ1 and
µ2 are the means of points of class 1 and 2. s1 and s2 denotes the Class Covariance
matrices.
p i = v T xi (4)
5. So we will have X
s˜1 2 = (pi − µ˜1 )(pi − µ˜1 )T
pi ∈c1
7
X
s˜1 2 = (v T xi − v T µ1 )(v T xi − v T µ1 )T
xi ∈c1
s˜1 2 = v T s1 v (5)
We get Within Class Covariance Matrix as
S W = s1 + s2 (6)
SB = (µ1 − µ2 )(µ1 − µ2 )T
So
(µ˜1 − µ˜2 )2 = (v T µ1 − v T µ2 )2 = v T SB v
µ˜1 − µ˜2 v T SB v
max 2 = max T
s˜1 + s˜2 2 v SW v
then
δ(J(v))
=0
δv
we finally get
M v = λv (7)
where M = Sw−1 SB
8
Figure 5: Data
λ1 = 3.13137004 λ2 = 0.
0.91955932 −0.59522755
v1 = v2 =
0.39295122 0.80355719
9
The Projected Vector is shown below, with class labels in Figure 6
4.07118849
3.41092352
3.0179723
5.11638527
5.25004215
P =
12.20554606
8.66096567
10.24078996
10.10713308
12.33920294
Figure 6: Projection
class2 = np.array([[9,10],
[6,8],
[9,5],
10
[8,7],
[10,8]])
Sb = (mean1.reshape(-1, 2) - mean2.reshape(-1,2)).T @
(mean1.reshape(1, 2) - mean2.reshape(1, 2))
1. Advantages
2. Disadvantages
11
Question 4.
12
Question 5.
Solution. Let’s start by exploring the difference between Model Parameters and Model Hy-
perparameters.
1. The Model Parameters are the internal 1. The Model Hyperparameters are the
variables that are learned by the pro- external variables that are set before
cess of Optimization, such as Weights the training process, these can be the
and Biases in models like Linear Regres- number of neurons, layers in case of
sion and Neural Networks or Positional neural networks, learning rate in case
Embeddings in the case of LLMs of Gradient Descent process.
2. Learned during Training / Optimiza-
tion Process. 2. Can be manually set or be made to
learn (by making it parameter).
3. Define the Performance of Model
3. Control the Learning process.
Hyperparameter Tuning: This is the process of finding the optimal set of hyperpa-
rameter that make the Training more stable and leads us with the best set of parameters
which has the optimal performance and tailors with the application we want.
K-Means: Number of Cluster (K), Distance (Manhattan, Euclidean)
DBSCAN: min samples and max distance
13
Question 6.
x = As
where A is the mixing matrix, and s are the source signals. The de-mixing matrix W is
the inverse of the mixing matrix, such that:
s = Wx
Log-Likelihood and Cost Function
The log-likelihood of the observed data x, given the de-mixing matrix W, can be written
as:
n
X
L(W) = log p(si ) + log |det(W)|
i=1
Gradient Descent
To minimize the cost function C(W), we use gradient descent. The update rule for W is
given by:
14
∂C(W)
W ←W+η
∂W
where η is the learning rate.
Gradient of the Cost Function
The gradient of the cost function C(W) with respect to W can be derived as follows:
n
∂C(W) X ∂ log p(si ) ∂si ∂ log |det(W)|
=− −
∂W i=1
∂si ∂W ∂W
For a non-Gaussian distribution of the sources s, the gradient of the log-density log p(s)
is a non-linear function of s. Therefore:
∂ log p(si )
= g(si )
∂si
where g(si ) is a non-linear function that depends on the assumed distribution of the
sources.
The gradient of the determinant term is:
∂ log |det(W)|
= (W−1 )⊤
∂W
Therefore, the gradient of the cost function becomes:
n
∂C(W) X
= −W⊤ + g(si )x⊤
i
∂W i=1
Update Rule
The update rule for the de-mixing matrix W using gradient descent is:
n
!
X
−1 ⊤
W ←W+η W − g(si )xi
i=1
This iterative update is performed until convergence, at which point the de-mixing ma-
trix W will allow the extraction of the independent source signals from the observed mixtures.
A is not Full rank: When the mixing matrix A in an Independent Component Analysis
(ICA) problem is not full rank, meaning it is singular or nearly singular, several challenges
and implications arise for the ICA process:
1. Loss of Information
2. Incomplete Separation:
3. Instability in Estimation
15