0% found this document useful (0 votes)
111 views6 pages

Pca PDF

Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of data while retaining most of the information. It transforms correlated variables into a set of uncorrelated variables called principal components. PCA algorithms include singular value decomposition and stochastic gradient descent. The document implements PCA on a random dataset using scikit-learn and analyzes the variance captured by each principal component. It also discusses extensions of PCA like kernel PCA, partial least squares, and canonical correlation analysis.

Uploaded by

Rohan Bansal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views6 pages

Pca PDF

Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of data while retaining most of the information. It transforms correlated variables into a set of uncorrelated variables called principal components. PCA algorithms include singular value decomposition and stochastic gradient descent. The document implements PCA on a random dataset using scikit-learn and analyzes the variance captured by each principal component. It also discusses extensions of PCA like kernel PCA, partial least squares, and canonical correlation analysis.

Uploaded by

Rohan Bansal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Principal Component Analysis

Rohan Bansal

1 Introduction
Principal Component Analysis is a tool used to reduce the dimensions of a set
of variables, still retrieving majority of the information. Working with a high
dimensional dataset often leads to difficulties. Low dimensional data set is eas-
ier to analyze, view and storage is less expensive. A given set of variables might
be correlated, which causes certain redundancies. PCA deletes these redundan-
cies by transforming the original variables to an independent set of variables.
For example, working with a colour image by transforming the red green blue
colours into combinations of white and black. In simpler terms, it’s always eas-
ier to work in 2D than 3D.

In PCA, we try to reflect a d-dimensional set of variables, say xRD onto a or-
thogonal set of k-dimensional vectors, say uRk such that u=[b1 , b2 ....bk ]. Then
we can write the projection of vector x in terms of vectors b1 , b2 ....bk . Let us
donate projection of x by πu (x).
k
X
πu (x) = λi bi
i=1

Using the property of orthogonality, we get

< πu (x) − x, bi >= 0 (1)

Substituting πu (x) = λB in eq(1), where B is the d × k matrix consisting of


b1 , b2 ....bk as columns, we get

πu (x) = B(B T B)−1 B T x

This is the basic intuition of PCA. Our next job is to find the k-dimensional
subspace maximizing the (uncentered) variance of this d-dimensional set of vari-
ables inside the sub-space[1]. So our optimisation problem can be written as
[1]:-
maxB T xxT B (2)
subject to
BT B = 1

1
2 PCA Algorithms
2.1 Singular Value Decomposition
In principal component analysis we find the directions in the data with the most
variation, i.e. the eigenvectors corresponding to the largest eigenvalues of the
covariance matrix, and project the data onto these directions. Suppose U is the
matrix consisting of the eigenvectors with the largest eigenvalues of X, then the
PCA transformation can be given by Y=U T X.

2.2 Stochastic Gradient Descent


Stochastic Gradient Descent(SGD) is a simple method to carry out the opti-
mization in eq(2). It iteratively updates the matrix B at data set point of x,
just like gradient descent. This can be sometimes difficult to implement, given
the optimization function is not essentially convex. So we start with a unit
vector B0 , and then keep on updating the matrix given by the equation (3)

Bt+1 = (I + ηxxT )Bt (3)

where η is the step size parameter[5].The algorithm is highly efficient in terms


of memory and runtime per iteration, requiring storage of a singled-dimensional
vector, and performing only vector-vector and a vector-scalar products in each
iteration[5].

2.3 Matrix Stochastic Gradient


We try to formulate a better algorithm than SGD because SGD doesn’t work
well on non-convex optimizations. We parameterise our subspace using the
projection matrix(BBT ) and relax our constraints by taking the convex hull of
the feasible region[2]. This changes our constraint to trace(M)=k (M=BB T ),
which is a convex optimization objective. The following steps are same as that
of SGD, a k-ranked solution is sampled by taking the average of the iterations.

2.3.1 Capped MSG


This algorithm puts a further constraint on the rank of the iterations which
makes it computationally faster and more efficient than MSG[2]. For K=k, it is
similar to the incremental algorithm of [2].

3 Implementation
Here, we implement PCA using SVD on a random dataset.It is done using
sklearn library on a random dataset created using ’random’ library.

2
x1 x2 x3 x4 x5 x6 y1 y2 y3 y4 y5 y6
754 787 742 759 785 722 290 243 294 245 265 254
492 501 495 489 494 483 502 515 474 507 452 468
57 63 54 53 71 56 442 459 460 444 477 483
769 744 753 733 766 782 738 756 782 795 772 750
863 894 886 884 894 876 557 572 567 537 602 546
Dataset

Figure 1: Variance captured by each component

We choose PCA1 and PCA2 as our components to represent the data as


they have the maximum variance.

3
Figure 2: variance comparison

We can easily observe that PC1 captures the maximum variance.

4 Conclusion
4.1 PCA in noisy settings
Till now, we have considered stochastic settings for implementation of our al-
gorithms. Here, we move on to non-stochastic settings ,i.e, we deal with noisy
gradients and missing data[6]. Oja’s method works well in case of bounded noise.
It cannot be implemented if the noise is unbounded because the optimization
objective of maximization can never be achieved.

4
4.2 Kernel PCA
Kernel methods represent an important class of machine learning algorithms
that simultaneously enjoy strong theoretical guarantees as well as empirical per-
formance[7]. Standard PCA only allows linear dimensionality reduction. How-
ever, if the data has more complicated structures which cannot be well repre-
sented in a linear subspace, standard PCA will not be very helpful. Fortunately,
kernel PCA allows us to generalize standard PCA to nonlinear dimensionality
reduction[8].

4.3 Partial Least Squares


PLS can be regarded as a substitute for the method of multiple regression. Sup-
pose we have a data set of two set of variables(can be different dimensionally),
we need to find a lower dimensional subspace that captures the maximum covari-
ance between the two set of variables. It is often posed as the following problem:
given a data set of n samples of two set of variates (or views), xRdx and yRdy ,
respectively, what is the k-dimensional subspace that captures most of the co-
variance between the two views[4]. So the rest of the method is same as that
discussed earlier in PCA with some changes in the optimization objective

4.4 Canonical Correlation Analysis


Canonical Correlation Analysis (CCA) is a ubiquitous statistical technique for
finding maximallycorrelated linear components of two sets of random vari-
ables[9]. It is posed as the problem given in [9,para 1], which is a non-convex
problem. Another difficulty is that compared to PCA, and most other machine
learning problems, is that the constraints also involve stochastic quantities that
depend on the unknown distribution D. The CCA objective does not decompose
over samples. To solve the non-convexity problem, we extend our problem to
Matrix Stochastic Gradient for CCA[9].

References
[1] Raman Arora, Andrew Cotter, Karen Livescu and Nathan Srebro, ”Stochas-
tic optimization for PCA and PLS,” in Allerton, 2012.
[2] Raman Arora, Andrew Cotter and Nathan Srebro, ”Stochastic Optimization
of PCA with Capped MSG,” Advances in Neural Information Processing
Systems 26 (NIPS 2013).
[3] Mianjy, Poorya and Raman Arora. “Stochastic PCA with l2 and l1 Regu-
larization.” ICML (2018).
[4] Arora, Raman Mianjy, Poorya Marinov, Teodor. (2016). Stochastic opti-
mization for multiview representation learning using partial least squares.
1786-1794.

5
[5] Shamir, Ohad. “Convergence of Stochastic Gradient Descent for PCA.”
ICML (2016).
[6] Marinov, Teodor Vanislavov, Poorya Mianjy and Raman Arora. “Streaming
Principal Component Analysis in Noisy Settings.” ICML (2018).

[7] Ullah, Enayat, Poorya Mianjy, Teodor √ V. Marinov and Raman


Arora. “Streaming Kernel PCA with Õ( n) Random Features.” CoRR
abs/1808.00934 (2018): n. pag.
[8] Scholkopf, Bernhard, Smola, Alexander, and Muller,Klaus-Robert. ”Kernel
principal component analysis,” In Advances in Kernel Methods – Support
VectorLearning, pp. 327–352. MIT Press, 1999.
[9] Arora, R., Marinov, T.V., Mianjy, P. (2017). Stochastic Approximation for
Canonical Correlation Analysis. NIPS.

You might also like