Communication Efficient Distributed Kernel Principal Component Analysis

Balcan, Maria-Florina; Liang, Yingyu; Song, Le; Woodruff, David; Xie, Bo

Computer Science > Machine Learning

arXiv:1503.06858 (cs)

[Submitted on 23 Mar 2015 (v1), last revised 13 Feb 2016 (this version, v4)]

Title:Communication Efficient Distributed Kernel Principal Component Analysis

Authors:Maria-Florina Balcan, Yingyu Liang, Le Song, David Woodruff, Bo Xie

View PDF

Abstract:Kernel Principal Component Analysis (KPCA) is a key machine learning algorithm for extracting nonlinear features from data. In the presence of a large volume of high dimensional data collected in a distributed fashion, it becomes very costly to communicate all of this data to a single data center and then perform kernel PCA. Can we perform kernel PCA on the entire dataset in a distributed and communication efficient fashion while maintaining provable and strong guarantees in solution quality?
In this paper, we give an affirmative answer to the question by developing a communication efficient algorithm to perform kernel PCA in the distributed setting. The algorithm is a clever combination of subspace embedding and adaptive sampling techniques, and we show that the algorithm can take as input an arbitrary configuration of distributed datasets, and compute a set of global kernel principal components with relative error guarantees independent of the dimension of the feature space or the total number of data points. In particular, computing $k$ principal components with relative error $\epsilon$ over $s$ workers has communication cost $\tilde{O}(s \rho k/\epsilon+s k^2/\epsilon^3)$ words, where $\rho$ is the average number of nonzero entries in each data point. Furthermore, we experimented the algorithm with large-scale real world datasets and showed that the algorithm produces a high quality kernel PCA solution while using significantly less communication than alternative approaches.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:1503.06858 [cs.LG]
	(or arXiv:1503.06858v4 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.1503.06858

Submission history

From: Yingyu Liang [view email]
[v1] Mon, 23 Mar 2015 22:00:51 UTC (6,186 KB)
[v2] Sun, 19 Jul 2015 03:19:53 UTC (271 KB)
[v3] Tue, 13 Oct 2015 17:23:53 UTC (98 KB)
[v4] Sat, 13 Feb 2016 23:40:11 UTC (798 KB)

Computer Science > Machine Learning

Title:Communication Efficient Distributed Kernel Principal Component Analysis

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Communication Efficient Distributed Kernel Principal Component Analysis

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators