0% found this document useful (0 votes)
11 views15 pages

Weekly Homework X

Has the homework submissions for CS550

Uploaded by

chuchubhos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views15 pages

Weekly Homework X

Has the homework submissions for CS550

Uploaded by

chuchubhos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Assignment 1

Bhavik Shangari
CS550 Machine Learning
September 2, 2024

Question 1.

Figure 1: Data

Solution. Applying PCA to the above given data include these steps
1. Formulate data in form of Matrix
 
2 5
6 4
 
10 11
14 14

2. Zero Mean and Normalize by Standard Deviation


 
−1.34164079 −0.84270097
 −0.4472136 −1.08347268
 
 0.4472136 0.60192927 
1.34164079 1.32424438

T
3. Calculate the covariance matrix M = XN,X , where N denotes number of data points
 
1. 0.91524923
0.91524923 1.

4. Calculate the Eigen Vectors and Eigen Values of this matrix M. (Mx = λx).

1
λ1 = 1.91524923 λ2 = 0.08475077

   
0.70710678 −0.70710678
0.70710678 0.70710678

5. Now after getting Eigen Vectors and Eigen Values, you need to project the original
points onto these vectors to get the projections.

 
−1.34164079 −0.84270097  
 −0.4472136 −1.08347268 0.70710678 −0.70710678
 ∗
 0.4472136 0.60192927  0.70710678 0.70710678
1.34164079 1.32424438

where * denotes Matrix Multiplication


6. The Projections comes out to be
 
−1.54456287 0.35280373
−1.08235864 −0.44990311
 
 0.74185603 0.1094005 
1.88506548 −0.01230111

7. If you want to consider only 1 Dimension, you could take the first column, which
corresponds to the maximum Eigen value.

8. The Eigen Values provided in the Question by you does not match with the eigen
values computed, so we cannot find eigen vectors through the Covariance matrix for
the given values, and thus there is no need to compare

2
Question 2.

Solution. 1. The given data in Matrix form


 
1 4
1 3
 
0 4
 
5 1
 
6 2
4 0

rows represents different data points and columns represent x and y coordinates.

2. The plot of above data is shown if Figure 2

Figure 2: Plot of data

3. For starting with K-means Clustering Algorithm, we need to start by Randomly Ini-
tializing Centroids (given K=2), and assign each point to one of the Cluster based on
Euclidean distance from the Centroid. Figure 3.

3
Figure 3: Randomly Initialized Centroids

Red -> Centroid


Yellow -> Cluster 1
Purple -> Cluster 2

4. Then we need to replace the centroid with the mean of the points assigned to that
cluster adn repreat this process Iteratively, till the centroid do not change at all. At
the end of all Iterations we get the results shown in Figure 4

4
Figure 4: K Means Final Result

Final Medians: (0.66666667 3.66666667) and (5.0, 1.0)


Code used:

def distance(pt1, pt2):


return np.sqrt((pt1[0] - pt2[0])**2 + (pt1[1] - pt2[1])**2)

points = np.array([(1,4), (1,3), (0,4), (5,1), (6,2), (4,0)])


median = 4*np.random.rand(2,2)

prev_median = np.zeros_like(median)
while median.all() != prev_median.all():
dist_mat = []
for pt in points:
a = []
for med in median:
dist = distance(pt, med)
a.append(dist)
dist_mat.append(a)
dist_mat = np.array(dist_mat)

plt.scatter(median[:,0], median[:, 1],c=’r’)


plt.scatter(points[:, 0], points[:,1], c=np.argmax(dist_mat, axis=1))

5
plt.pause(0.1)
cluster = {0:[], 1:[]}
for i, el in enumerate(np.argmax(dist_mat, axis=1)):
cluster[el].append(points[i])
cluster[0] = np.stack(cluster[0])
cluster[1] = np.stack(cluster[1])
prev_median = median
for i in range(len(median)):
median[i] = np.mean(cluster[i], axis=0)
print(median)

6
Question 3.

Solution. Given data (feature1, feature2) : (Class label)


   
4 1 0
2 4 0
   
2 3 0
   
3 6 0
   
4 4 0
X= 9 , Y =  
 10
1
 
6 8 1
   
9 5 1
  
8 7 1
10 8 1
1. LDA is used for dimensionality reduction, when we need to account for the distance
between data points of different classes when reducing the number of dimensions.

2. Consider that in the projected space, the difference between means of different classes
should be high and the variance of cluster of each class should be small, which leads
to the Objective Function
µ˜1 − µ˜2
max 2 (1)
s˜1 + s˜2 2
where µ˜1 and µ˜2 are the means of the projections of both classes and s˜1 2 and s˜2 2 are
the covariances of the projections of each class.

3. We get the Separate class Covariance for the data points of both the classes s1 and s2 ,
which are in the original feature space.
X
s1 = (xi − µ1 )(xi − µ1 )T (2)
xi ∈c1

X
s2 = (xi − µ2 )(xi − µ2 )T (3)
xi ∈c2

where c1 and c2 denotes the set of points in class 1 and class 2 respectively, µ1 and
µ2 are the means of points of class 1 and 2. s1 and s2 denotes the Class Covariance
matrices.

4. Let pi denote the projections of the data point xi , we can write it as

p i = v T xi (4)

where v T denotes the vector on which projection is taken.

5. So we will have X
s˜1 2 = (pi − µ˜1 )(pi − µ˜1 )T
pi ∈c1

7
X
s˜1 2 = (v T xi − v T µ1 )(v T xi − v T µ1 )T
xi ∈c1

s˜1 2 = v T s1 v (5)
We get Within Class Covariance Matrix as

s˜1 2 + s˜2 2 = v T (s1 + s2 )v = v T SW v

where SW is Within Class Scatter Matrix

S W = s1 + s2 (6)

6. Similarly we define Between Class Scatter Matrix SB as

SB = (µ1 − µ2 )(µ1 − µ2 )T

So
(µ˜1 − µ˜2 )2 = (v T µ1 − v T µ2 )2 = v T SB v

7. Putting this all in our Objective Function, we get

µ˜1 − µ˜2 v T SB v
max 2 = max T
s˜1 + s˜2 2 v SW v
then
δ(J(v))
=0
δv
we finally get
M v = λv (7)
where M = Sw−1 SB

We need to perform eigen decomposition over this to get our projections.


For the values given to us, its plot is shown in Figure 5

8
Figure 5: Data

Red -> Class1


Blue -> Class2

µ1 = [3., 3.6], µ2 = [8.4, 7.6]


 
13.2 −2.2
SW =
−2.2 26.4
 
29.16 21.6
SB =
21.6 16.
Now we calculate matrix M with the help of equation 7, later eigen valuesa nd eigen vectors.

λ1 = 3.13137004 λ2 = 0.
   
0.91955932 −0.59522755
v1 = v2 =
0.39295122 0.80355719

Considering only v1 as it has the maximum variance of projected data.

9
The Projected Vector is shown below, with class labels in Figure 6
 
4.07118849
 3.41092352 
 
 3.0179723 
 
 5.11638527 
 
 5.25004215 
P = 
 12.20554606 

 8.66096567 
 
10.24078996
 
10.10713308
12.33920294

Figure 6: Projection

The code used for all calculation is provided below


class1 = np.array([[4,1],
[2,4],
[2,3],
[3,6],
[4,4]])

class2 = np.array([[9,10],
[6,8],
[9,5],

10
[8,7],
[10,8]])

mean1 = np.mean(class1, axis=0)


mean2 = np.mean(class2, axis=0)

Sw = (class1 - mean1).T @ (class1 - mean1) + (class2 - mean2).T @ (class2 - mean2)

Sb = (mean1.reshape(-1, 2) - mean2.reshape(-1,2)).T @
(mean1.reshape(1, 2) - mean2.reshape(1, 2))

plt.scatter(class1 @ np.linalg.eig(np.linalg.inv(Sw) @ Sb).eigenvectors[:, 0],


np.zeros_like(class1 @ np.linalg.eig(np.linalg.inv(Sw) @ Sb).eigenvectors[:, 0]))

plt.scatter(class2 @ np.linalg.eig(np.linalg.inv(Sw) @ Sb).eigenvectors[:, 0],


np.zeros_like(class2 @ np.linalg.eig(np.linalg.inv(Sw) @ Sb).eigenvectors[:, 0]))

The Advantages & Disadvantages of LDA includes

1. Advantages

(a) Dimensionality Reduction: Reduces the number of features while preserving


class separability.
(b) Class Separation: Maximizes the separation between different classes in the
feature space.
(c) Provides a Linear Decision Boundary: Offers a straightforward model with
clear decision boundaries.

2. Disadvantages

(a) Sensitive to Outliers


(b) Not Suitable for Non-Linear Problems

11
Question 4.

Solution. Github Link


All ML algorithms work with numbers, and they also considers the distance between
feature vectors, which gets affected by the scale and units of the data.
For Example, consider Weight and Height, the general range of Height will be 4 feet to
6.5 feet wheras for Weight it will vary between 60 Kg to 120 Kg.
Now think of Algorithms which rely solely on Distance such as KMeans, all the decision
will solely be based on Weight as its scale is high, so max contribution for Euclidean distance
will come from Weight feature, diminishing affect from Height, which is not a good approach.
These kinds of reasons call for the need of Normalization and the reason why Min-max
Scaling is used.

12
Question 5.

Solution. Let’s start by exploring the difference between Model Parameters and Model Hy-
perparameters.

Model Parameters Model Hyperparameters

1. The Model Parameters are the internal 1. The Model Hyperparameters are the
variables that are learned by the pro- external variables that are set before
cess of Optimization, such as Weights the training process, these can be the
and Biases in models like Linear Regres- number of neurons, layers in case of
sion and Neural Networks or Positional neural networks, learning rate in case
Embeddings in the case of LLMs of Gradient Descent process.
2. Learned during Training / Optimiza-
tion Process. 2. Can be manually set or be made to
learn (by making it parameter).
3. Define the Performance of Model
3. Control the Learning process.

Hyperparameter Tuning: This is the process of finding the optimal set of hyperpa-
rameter that make the Training more stable and leads us with the best set of parameters
which has the optimal performance and tailors with the application we want.
K-Means: Number of Cluster (K), Distance (Manhattan, Euclidean)
DBSCAN: min samples and max distance

13
Question 6.

Solution. Gradient Descent for Determining the De-Mixing Matrix in ICA


In Independent Component Analysis (ICA), the goal is to determine the de-mixing matrix
W such that the observed data x can be expressed as a linear combination of independent
source signals s:

x = As
where A is the mixing matrix, and s are the source signals. The de-mixing matrix W is
the inverse of the mixing matrix, such that:

s = Wx
Log-Likelihood and Cost Function
The log-likelihood of the observed data x, given the de-mixing matrix W, can be written
as:
n
X
L(W) = log p(si ) + log |det(W)|
i=1

where p(si ) is the probability density function of the independent sources.


The cost function to be minimized (negative log-likelihood) is therefore:
n
X
C(W) = − log p(si ) − log |det(W)|
i=1

Gradient Descent
To minimize the cost function C(W), we use gradient descent. The update rule for W is
given by:

14
∂C(W)
W ←W+η
∂W
where η is the learning rate.
Gradient of the Cost Function
The gradient of the cost function C(W) with respect to W can be derived as follows:
n
∂C(W) X ∂ log p(si ) ∂si ∂ log |det(W)|
=− −
∂W i=1
∂si ∂W ∂W
For a non-Gaussian distribution of the sources s, the gradient of the log-density log p(s)
is a non-linear function of s. Therefore:

∂ log p(si )
= g(si )
∂si
where g(si ) is a non-linear function that depends on the assumed distribution of the
sources.
The gradient of the determinant term is:

∂ log |det(W)|
= (W−1 )⊤
∂W
Therefore, the gradient of the cost function becomes:
n
∂C(W) X
= −W⊤ + g(si )x⊤
i
∂W i=1

Update Rule
The update rule for the de-mixing matrix W using gradient descent is:
n
!
X
−1 ⊤
W ←W+η W − g(si )xi
i=1

This iterative update is performed until convergence, at which point the de-mixing ma-
trix W will allow the extraction of the independent source signals from the observed mixtures.

A is not Full rank: When the mixing matrix A in an Independent Component Analysis
(ICA) problem is not full rank, meaning it is singular or nearly singular, several challenges
and implications arise for the ICA process:

1. Loss of Information

2. Incomplete Separation:

3. Instability in Estimation

Solutions include Increasing the Number of Mixtures.

15

You might also like