Memory-Efficient Topic Modeling

Zeng, Jia; Liu, Zhi-Qiang; Cao, Xiao-Qin

Computer Science > Machine Learning

arXiv:1206.1147 (cs)

[Submitted on 6 Jun 2012 (v1), last revised 8 Jun 2012 (this version, v2)]

Title:Memory-Efficient Topic Modeling

Authors:Jia Zeng, Zhi-Qiang Liu, Xiao-Qin Cao

View PDF

Abstract:As one of the simplest probabilistic topic modeling techniques, latent Dirichlet allocation (LDA) has found many important applications in text mining, computer vision and computational biology. Recent training algorithms for LDA can be interpreted within a unified message passing framework. However, message passing requires storing previous messages with a large amount of memory space, increasing linearly with the number of documents or the number of topics. Therefore, the high memory usage is often a major problem for topic modeling of massive corpora containing a large number of topics. To reduce the space complexity, we propose a novel algorithm without storing previous messages for training LDA: tiny belief propagation (TBP). The basic idea of TBP relates the message passing algorithms with the non-negative matrix factorization (NMF) algorithms, which absorb the message updating into the message passing process, and thus avoid storing previous messages. Experimental results on four large data sets confirm that TBP performs comparably well or even better than current state-of-the-art training algorithms for LDA but with a much less memory consumption. TBP can do topic modeling when massive corpora cannot fit in the computer memory, for example, extracting thematic topics from 7 GB PUBMED corpora on a common desktop computer with 2GB memory.

Comments:	20 pages, 7 figures
Subjects:	Machine Learning (cs.LG); Information Retrieval (cs.IR)
Cite as:	arXiv:1206.1147 [cs.LG]
	(or arXiv:1206.1147v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.1206.1147

Submission history

From: Jia Zeng [view email]
[v1] Wed, 6 Jun 2012 08:34:43 UTC (2,235 KB)
[v2] Fri, 8 Jun 2012 14:07:26 UTC (2,239 KB)

Computer Science > Machine Learning

Title:Memory-Efficient Topic Modeling

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Memory-Efficient Topic Modeling

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators