Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections

Hoobin, Christopher; Puglisi, Simon J.; Zobel, Justin

Computer Science > Data Structures and Algorithms

arXiv:1106.2587 (cs)

[Submitted on 14 Jun 2011 (v1), last revised 9 Dec 2011 (this version, v2)]

Title:Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections

Authors:Christopher Hoobin, Simon J. Puglisi, Justin Zobel

View PDF

Abstract:Compression techniques that support fast random access are a core component of any information system. Current state-of-the-art methods group documents into fixed-sized blocks and compress each block with a general-purpose adaptive algorithm such as GZIP. Random access to a specific document then requires decompression of a block. The choice of block size is critical: it trades between compression effectiveness and document retrieval times. In this paper we present a scalable compression method for large document collections that allows fast random access. We build a representative sample of the collection and use it as a dictionary in a LZ77-like encoding of the rest of the collection, relative to the dictionary. We demonstrate on large collections, that using a dictionary as small as 0.1% of the collection size, our algorithm is dramatically faster than previous methods, and in general gives much better compression.

Comments:	VLDB2012
Subjects:	Data Structures and Algorithms (cs.DS); Databases (cs.DB); Information Retrieval (cs.IR)
Report number:	vol5no3/p265_christopherhoobin_vldb2012
Cite as:	arXiv:1106.2587 [cs.DS]
	(or arXiv:1106.2587v2 [cs.DS] for this version)
	https://doi.org/10.48550/arXiv.1106.2587
Journal reference:	Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 3, pp. 265-273 (2011)

Submission history

From: Christopher Hoobin [view email]
[v1] Tue, 14 Jun 2011 00:53:40 UTC (125 KB)
[v2] Fri, 9 Dec 2011 03:26:13 UTC (98 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.DS

< prev | next >

new | recent | 2011-06

Change to browse by:

cs
cs.DB
cs.IR

References & Citations

DBLP - CS Bibliography

listing | bibtex

Christopher Hoobin
Simon J. Puglisi
Justin Zobel

export BibTeX citation

Computer Science > Data Structures and Algorithms

Title:Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Data Structures and Algorithms

Title:Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators