0% found this document useful (0 votes)

44 views12 pages

Improving Compression of Massive Log Data

The document discusses improving log compression by partitioning log data into homogeneous buckets. It proposes a system that partitions log records into buckets using different partition functions to improve the consistency and predictability of the data, which makes compression more effective. The system collects log records, partitions them into buckets using a chosen function, then compresses each bucket independently and in parallel. Several partition functions are described, including segmentation based on record count, round robin distribution, and using edit distance to group similar records. The goal is to organize the data in a way that improves the performance of generic compression tools like bzip2 and gzip.

Uploaded by

Robert

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views12 pages

Improving Compression of Massive Log Data

Uploaded by

Robert

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Improving Compression of Massive Log Data

Robert Christensen
University of Utah
[email protected]

May 1, 2013

Abstract
In this paper we explore a novel method of improving log compression by partitioning data into
homogeneous buckets. By partitioning log records into buckets to improve homogeneous nature of
log data, the effectiveness of generic compression methods, such as bzip2 and gzip, can be improved.
A system is developed to archive log records using a partition function to distribute log records
into buckets. Several partition functions are implemented and their improved compression ratio
is reported when archiving several real world data sets. Several improvements to the method used
in this paper are proposed.

1 Introduction
Log records are always being collected. In many instances the log is a collection of log records from a
variety of sources and collected at a single location. System administrators use log data for a variety
of tasks, such as identifying troubled hardware [8], analyzing network traffic [3], postmortem analysis
of security breaches [2], and performance analysis [6]. Log data is a history of the status of a system.
Because of the usefulness of this data, administrators are reluctant to delete this data, opting to archive
the data in case it is needed in the future.
Each log record is typically a single line written in ASCII text, terminated with a new line character.
The format of log data is designed for system administrators to easily read the log records as a massive
text document without the need for special tools.
A system log for a large system could include multiple statuses intermingled, such as unusual
memory issues with common DHCP status messages. Data compressors are more effective when the
data being archived is consistent and predictable. Large log records lose consistency when events from
different sources are intermingled into a single repository.
Splitting the log collection into multiple repositories, one repository for each log record type, is not
an optimal solution because it would make system logging very difficult for administers of the system
to find crucial information when needed. Records of one system fault may provide information about
other system faults that would not necessarily be located in the same system log repository. Also,
because a series of logs have a common source does not mean the log records are similar in structure.
Current systems collect log data in a centralized source with a variety of heterogeneous records in close
proximity.
A system is proposed to automatically distribute log data within an archive to improve the homo-
geneous nature of the archive, thus improving the effectiveness of using generic compressing utilities
such as bzip2 or gzip.

1
Figure 1: Snapshot of Log data from Red Storm HPC system log

2 Problem Statement
A snapshot of a log file is shown in figure 1. This figure shows a segment of a system log from the HPC
system Red Storm located at Sandia National Laboratory [6]. The characteristics of this segment is
common in all real world log files studied. It is easily seen that a series of records of similar form are
being recorded. For a small period, records of a different form are recorded by the system, creating a
heterogenus log. The existence of these two inconsistent records, in a series of many records, decreases
the consistency of the repository as a whole which hurts compressibility.
If the log records contained in the boxes boardered by the solid green line were group together and
the two log records in the box boardered by the dashed red line were group together for archive, the
archive would contain more homogenus data. Naturally, partitioning data into homogenus groups will
decrease the randomness of the data. By doing this, compression algorithms can work more effectivly.
The system being proposed here has three basic steps:

1. log record collection. Log records are collected in order to be processed by the system. Log
records can be collected and archived in real time but the option must exist to compress log
records after the log is compleatly collected. Since it is common to delimit log records by a new
line character, we will define a single log record as a string which is terminated with the new line
character.

2. log record partitioning. The log archive consists of one or more buckets containing log records
previously processed. The log partitioning step analyzes the new log record being inserted and
determines which bucket to place the new record.

3. log compression. After the log records have been collected and partitioned the buckets containing
the log records can be compressed. Each bucket is completely independent, so each bucket can
be compressed separately in parallel. Also, since most compressing utilities compress data in
blocks [4, 7] , when a bucket becomes sufficiently full, a block of data from the bucket can be
compressed and stored.

In order to distribute log records in such a way that compression algorithms can work more effec-
tively, we have defined the following formal statement of a system to partition log records.
Suppose there exists n log records l1 , l2 , . . . , ln−1 , ln and m buckets b1 , b2 , . . . , bm−1 , bm . The buckets
are ordered, meaning a log record li can not be inserted into one of the buckets iff ∃k s.t. li−1 ∈ bk or
i = 1. This ordering will allow for streaming algorithms to be used to distribute the records into the
buckets.
We define a partitioning function partition(li ), which returns the index k such that the record
li should be inserted as the next log record in the bucket bk .

2
When a record is inserted into a bucket, the value of k is recorded so the complete log can be
reconstructed as it appeared in its original form when the buckets are unarchived. This collection of
index value is also compressed and stored with the log archive.

3 Partition Functions
Each partitioning function implemented is explained below. The formal explanation of how it functions
and a short description of unique behaviors of the partitioning function is also given, if applicable.

3.1 Segmentation
This distribution algorithm only considers the total number of log records the original log contains.
Because of this, the technique only works if the number of log records is known beforehand.
When the log records are partitioned into m partitions, the bucket bk will contain the series of log
records:
h i
l( n )(i−1) , l( n )(i−1)+1 , . . . , l( n )i−2 , l( n )i−1
m m m m

Partitioning the records this way will separate the log archive into m segments, where each segment
contains contiguous log records.

3.2 Round Robin

This distribution algorithm will insert the record into the buckets in a rotating way. Formally, this
will return the bucket index k = i mod m. When distributing all log records each bucket will contain
approximately the same number of records, similar to the Segmentation method. However, this method
will shuffle log records in such a way that homogeneous data is often separated. Because of this, the
bucketized log data will be more chaotic, thus causing the archive to be larger. This partitioning
algorithm will show how poorly distributed data can significantly increase the total size of the output
archive.

3.3 Edit Distance

This distribution algorithm uses the well known edit distance function to determine similarity between
two log records. The function ed(A, B) is defined in the following way. Given two strings of characters
A and B, the value of the edit distance is the minimum number of character modifications required
to convert A into B. A character modification can be one of three operations: character insertion,
character deletion, or character substitution. The smaller the value of ed(A, B), the more similar the
character strings A and B. The range of values returned from ed(A, B) in (0, max(len(A), len(B))).
The smallest value for when the strings A and B are exactly alike. The largest value for when the
strings A and B have no similarity requiring every character in the string to be modified.
When inserting new records into the log archive, the edit distance function will be used to find the
bucket with the most similarity. The bucket with the most similarity is the bucket the log record will
be inserted into. The number of log records processed is expected to be large, so only a small sliding
window of s most recent log records will be compared in each bucket.
We will defined Sk to be the sliding window records for the bucket bk with the constraint 0 ≤
|Sk | ≤ s. The contents of Sk include the last s log entries inserted into bk . If bk contains less then s log
records total, Sk will contain all log records inserted into bk . The notation Sk [t] means the log record
at the tth index if Sk is being referenced.

3
When inserting the log record li , the bucket bk is being found such that we are finding the bucket
bk that results in the smallest value for the formula:
P|Sk |
t=1 ed(li , Sk [t])
|Sk |
When the value of k is found, the log record li is then inserted into the bucket bk .
If all the buckets containing at least one log record were had an average edit distance larger then
some certain threshold and at least one bucket was empty, the log record being inserted will be inserted
into the unused bucket. Otherwise, the log record will be inserted into the bucket with the smallest
average edit distance, even if the value is very large.
For the implementation presented in this paper, the edit distance between the log record being
inserted and the last s log records inserted into each bucket was calculated for each log record inserted.
Calculating the edit distance between two strings is a costly operation. This is because the process
uses dynamic programming, where the operation is of the magnitude of O(len(A)len(B)). The real
world data set used contained some log records over 100,000 characters. Even though most log records
did not exceed 200 characters, the existence of an extremely large log record will result in extremely
poor performance. The large log record must be inserted into the archive in order to guarantee lossless
compression. The log records inserted after the large log record are likely to be much less unusual in
size. Comparing a string with few characters with a string with many characters will always result
in a large edit distance because a large number of insertion will always be required. Because of this,
comparing any normal length string with a unusually large length string will result in a high edit
distance. Because the bucket with the smallest average edit distance is where the new log record will
be placed, the large log record will always be in the sliding window of some bucket because it will
never be replaced. This causes significant performance degradation due to the time the edit distance
function takes when given large strings.
The edit distance distribution algorithm was found to be extremely slow such that it was unusable
for anything useful. In order to use this distribution scheme on the test data the process would take
days to complete. Edit distance calculation optimizations could be used, but it was found that the edit
distance similarity function did not perform as well as the other similarity functions explored. The edit
distance distribution algorithm is included to show that it was considered as a distribution method,
but performance both in time to compute and the resulting compression size showed that this is not a
very good method.

3.4 Character Similarity

Each log record contains characters encoded as text. If two log records are homogeneous they should
also have approximately the same count of similar characters. The Character Similarity partitioning
function uses a distance function based on the number of appearances of each character in a log record.
Given the log records la and lb which contain text. We define a character count array Ca and
Cb for the log records la and lb . Suppose there is an index x which is the binary representation of a
character. For all x we define Ca [x] to be equal to the number of appearances the character whose
binary respresentation is x appears in the log record la . The character count array Cb is also defined
the same way using the log record lb .
The character count distance between the log records la and lb is defined using the formula:
X
|Ca [x] − Cb [x]|
∀x

If la contains one more appearance of a character then lb , that will contribute to increasing the
character distance value by 1.

4
Figure 2: Selection of words from a log record for Word Jaccard Distance

For explanation purposes, we will define a function chardist(la , lb ) which will calculate the character
count arrays for la and lb and return the character count distance between the two log records. When
implemented, the character count arrays should only be calculated once and stored in memory until
they are no longer needed.
We will define Sk to be the sliding window of records for the bucket bk with the constraint 0 ≤
|Sk | ≤ s. The contents of Sk include the last s log entries inserted into the bucket bk . If |bk | ≤ s, then
Sk will contain all log records inserted into bk . The notation Sk [t] means the log record at the tth index
if Sk is being referenced.
To find the optimal bucket bk to place a new log record li , the index k is being found to minimized
the formula:
P|Sk |
t=1 chardist(li , Sk [t])
|Sk |

3.5 Word Jaccard Distance

Using the well known set similarity Jaccard distance measure, this partitioning function selects which
bucket the new log record should be placed. Given two sets A and B, the Jaccard index value is defined
as:

|A ∩ B|
J(A, B) =
|A ∪ B|
The Jaccard index identifies similarity between the two sets. This value will be a rational number
within the range (0, 1), where the lower range identifies two sets completely different and the higher
range identifies the two sets very similar. The Jaccard Distance, which identifies dissimilarity between
two sets, is defined as 1 − J(A, B).
When a log record li is going to be inserted into the log archive, a set of words found in the log
record is created. We will use the symbol Wi to be the set of words found in the log record li . Here,
we define a word to be a continuous series of alphanumeric characters delimited by non-alphanumeric
characters.
Figure 2 shows an example of the words that would be selected from a log record from the Red
Storm HPC system log. Words may contain only letters, numbers, or a mixture of letters and numbers.
When a log record is analyzed to create the word set, no attempt is made to determine the meaning of
words. In the example, The time 22:41:29 is separated into three words. Similar log records appearing
in the future will contain a different value in one or more of the time fields.
No context should be determined from the log record for fields such as this because no standard
format exists to write time to a log record. The partitioning function becomes more generic by collecting
all alpha numeric words in a log record.
We will define Sk to be a sliding window of the last s records inserted into the bucket bk . The
magnitude of elements in Sk has the constraint 0 ≤ |Sk | ≤ s. The value of s is a parameter that can
be provided to the archiver. If Sk contains less then s sets, Sk will contains the set of words for all log
records which have been inserted into bk .
When inserting the log records li into the log archive with its associated set of words Wi , the bucket
bk is being found such that the the bucket bk results in the smallest value for the formula

5
P|Sk |
t=1 (1 − J(Wi , Sk [t]))
|Sk |
When the bucket bk is found, the log record li is inserted into the bucket bk and Wi is inserted into
Sk , removing the oldest set in Sk if needed.
If all buckets containing at least one log record had an average Jaccard distance larger then some
certain threshold and at least one bucket was empty, the log record being inserted will be inserted into
the unused bucket. If no empty buckets exist, the log record will be inserted into the bucket with the
smallest average Jaccard distance even if the distance is extremely large.

3.6 q-gram Jaccard Distance

Using the same function as the Word Jaccard Distance, except rather then words for the log record
being used to evaluate similarity all q-grams in a log record are used to evaluate similarity. A q-gram
is a consecutive sequence of q characters.
When a log record li is going to be inserted into the log archive, all unique q-grams contained in li
are inserted into the set Wi .
This partitioning function is identical to the Word Jaccard Distance function described above except
for the contents of the sets being compared. This method uses a set of all q-grams in the log record in
place of complete words used in the Word Jaccard Distance method.
Even though speed of computation is not being explicitly evaluated in the experiments done, the
q-gram Jaccard Distance method was found to take a very long time to complete. Even though it was
not nearly as time consuming as the edit distance method, the total time taken was considered to be
too slow to be of practical purposes.
The long computation time is not unexpected. For each log record inserted into the archive, a set
will be created containing all q-grams from the log record. The number of times the similarity function
is evaluated for each inserted log record is directly determined by the number of buckets and sliding
window size. The values used for the number of buckets and sliding window size required each inserted
log to be evaluated in the similarity function hundreds of times. Since the number of elements in the
set was proportional to the number of characters in the set, evaluating the Jaccard distance could take
a very long time.

3.7 Approximate q-gram Jaccard Distance

In order to significantly reduce computation time calculating Jaccard similarity between log records to
find the appropriate bucket, The Jaccard similarity can be estimated using the KMV synopsis [1]. The
KMV synopsis method allows for faster and more predictable computation time when calculating the
Jaccard similarity between log records. This allows for log records containing many q-grams to easily
be compared to other records without sacrificing time to archive while retaining the benefits of using
q-grams.
All q-grams in the log record are hashed. The hash function used should be evenly distributed
along the number line in the range desired. For the system developed in this paper the SHA1 hash
function was used because the range of results was adequate, the speed of computing the hash of the
strings was fast, and the randomness of the hash function for the purposes of this system were good.
By assuming the hash function evenly distributes values along the number line, by taking the smallest
k hashes, we can approximate the total number of hashed values inserted onto the number line. The
value of k is a parameter provided to the algorithm.
Using similar techniques, intersection and union between two sets can be approximated while only
storing a small number of hash values. The Jaccard similarity estimation can be very quickly calculated

6
Source Uncompressed (MB) bzip2 compressed (MB) gzip compressed (MB)
Blue Gene/L 509 35 45
Thunderbird 24,172 904 1511
Red Storm 32,596 664 1080
Spirit (ICC2) 28,583 389 666
Liberty 21,768 276 429
Web Data 7501 296 533

Table 1: Statistics for the size of the log data, both before and after compression.

Source log record count median length min length max length
Blue Gene/L 4,747,963 89 62 895
Thunderbird 211,212,192 89 22 1045
Red Storm 219,096,168 99 31 173,254
Spirit (ICC2) 272,298,969 100 13 994
Liberty 265,569,231 89 23 1037
Web Data 26,568,851 277 67 4187

Table 2: Statistics for the log record contained in each data source.

between two log records if a KMV synopsis has already been created for the log records.
This partitioning method uses the same process as that in section 3.6, except rather then building an
exact set of Q-grams, a KMV synopsis is built and the value of the Jaccard distance is approximated.

4 Experiments
System logs from five supercomputers and one Apache web server log file was used to study the com-
pression ratio between compressing using bzip2 or gzip to compress the entire log without partitioning
vs. partitioning using the various algorithms described in section 3.
Four of the supercomputer system logs come from computers located at Sandia National Laborato-
ries: Thunderbird, Red Storm, Spirit (ICC2), and Liberty. The fifth supercomputer system log is from
the Blue Gene/L computer located at Lawrence Livermore National Laboratory [6]. All supercomputer
system logs have been scrubbed of personal data. Otherwise, the log data used in the experiments is
the raw system log. When these system logs are referenced, they will be referenced by the name of the
computer which it originated.
The Apache web server data came from the MesoWest project [5]. When this data is referenced, it
will be called Web Data.
As we can see from table 1, Using bzip2 or gzip to compress the log data is extremely effective.
Compressing using bzip2 is much more effective at compressing the log data used in this experiment.
The bzip2 compression algorithm takes significantly more time when compared to compressing using
gzip. This is one of the primary reasons why gzip is used to compress large log files, as opposed to
bzip2 [3].
The archiving system developed allowed for many command line arguments to set different variables
for a given run. These arguments include the partitioning algorithm used; arguments for the parti-
tioning algorithm, such as maximum number of buckets; and arguments for the compression algorithm
used internally, such as the block size for the compressor for each bucket.
All command line arguments were fixed except the variable being tested. Unless otherwise noted
the following are the default values used in the experiment. The maximum number of buckets m = 32.

7
The sliding window depth s = 10. The q-gram size q = 6. The KMV synopsis value k = 60. If the
value is not used because the partitioning algorithm does not need to value, the value is ignored. The
partitioning algorithm is always explicitly listed for each experiment.
The y-axis scale shows the ratio:

size of archive using the partitioning algorithm

size of archive compressing the log whole
Figure 3 and figure 4 show that Round Robin performs very poorly. By distributing log records this
way the buckets become more heterogeneous, resulting in data more difficult to compress. This shows
that distributing log records among several buckets can result in worse compression if done incorrectly.
Edit distance partitioning resulted in worse performance when using the bzip2 compressor in figure
3 and only slightly better performance when using the gzip compressor is Figure 4. Calculating the edit
distance is a very costly operation. No further experimentation was performed using the edit distance
partitioning function because it would take too long to be of any practical interest.
The word Jaccard distance partitioning algorithm performed better when the partitioner had more
buckets available. Shown in figure 5 and figure 6, Under every condition, increasing the number of
buckets improved the compression ratio. However, as the number of buckets increases the time required
to process the data also increased because a log record being inserted into the archive must be compared
to the last s records in the bucket.
The improvement given by allowing for more buckets to be used by a partitioner depends on the
partitioning algorithm used. While the word Jaccard distance partitioning algorithm improved greatly
when more buckets were added, the character distance partitioning algorithm did not show significant
gains with more buckets.
An experiment was done to evaluate the effects of choosing different kmv synopsis sizes when using
the estimated q-gram Jaccard distance partitioning algorithm. The results are shown in figure 11 and
figure 12. No significant differences exist within the results of this data, suggesting that within certain
ranges the value of the kmv synopsis size does not significantly impact the effectiveness of the result.
Quick review of the experimental results show the most improvment in compression using the
Jaccard q-gram partitioning algorithm, improving the compression using bzip2 and gzip by 20% and
30% respectilvy. However, word Jaccard and approximate Jaccard performed almost as well.

5 Future Work
During the development of this log archiving system, several downfalls of the proposed system became
apparent. Since some require changes to the method proposed here, they were not added to the
current system. After these concerns are addressed, the log archiving system will be much more usable
to administrators interested in long term log archives and will also have improved compression ratios
over the results from this paper.
First, the number of buckets is fixed. In the current method, an unused bucket is used when the
log record being inserted is not similar to any of the currently used buckets. After all buckets are used
if a log record which is very different from the contents of every used bucket is going to be inserted
into the archive, it will be placed in the bucket which most closely matches. However, it is easy to
conceive of cases where doing this will decrease the homogeneous nature of the buckets.
Also, this limitation on how buckets are initially allocated can have significant repercussions. If the
first log records inserted into the archive are very different from the log records to be inserted later,
the buckets will be lopsided. Or, if one very unusual log record is the first inserted into a bucket it
could be possible for that bucket to never be used again because the partition function will never
select that bucket. This negatively impacts the compression ratio.

8
1.5 1.5
Round Robin Round Robin
1.4 Edit Distance 1.4 Segmentation
No partitioning 1.3 No partitioning
1.3 Segmentation Edit Distance
Character Distance 1.2 Character Distance
1.2
Word Jaccard Word Jaccard
1.1
1.1 Jaccard approximation Jaccard approximation
Jaccard qgram 1 Jaccard qgram
1
0.9
0.9
0.8
0.8 0.7

0.7 0.6

Figure 3: Vary partitioning algorithm on Figure 4: Vary partitioning algorithm on

web data. bzip2 compressor. web data. gzip compressor.

1.1 1.1
no partitioning no partitioning
2 2
1 4 1 4
8 8
16 16
0.9 32 0.9 32
64 64
128 128
0.8 192 0.8 192

0.7 0.7

0.6 0.6

Figure 5: Vary number of buckets on web Figure 6: Vary number of buckets on web
data using Word Jaccard Distance parti- data using Word Jaccard Distance parti-
tioning. bzip2 compressor. tioning. gzip compressor.

1.1 1.1
1 1
1.05 5 1.05 5
10 10
1 15 1 15
20 20
0.95 0.95

0.9 0.9

0.85 0.85

0.8 0.8

0.75 0.75

0.7 0.7

Figure 7: Vary sliding window depth on Figure 8: Vary sliding window depth on
web data using Word Jaccard Distance web data using Word Jaccard Distance
partitioning. bzip2 compressor. partitioning. gzip compressor.

9
1.1 1.1
no partitioning no partitioning
4 4
1.05 1.05
8 8
16 16
1 32 1 32
64 64
0.95 0.95

0.9 0.9

0.85 0.85

0.8 0.8

Figure 9: Vary number of buckets on Figure 10: Vary number of buckets on

Red Storm data using Character Dis- Red Storm data using Character Dis-
tance partitioning. bzip2 compressor. tance partitioning. gzip compressor.

0.9 0.8
20 40 60 80 100 20 40 60 80 100
30 50 70 90 110 0.78 30 50 70 90 110

0.85 0.76

0.74
0.8
0.72

0.7
0.75
0.68

0.66
0.7

Figure 11: Vary kmv synopsis size on Figure 12: Vary kmv synopsis size on
web data using Estimated Q-gram Jac- web data using Estimated Q-gram Jac-
card Distance partitioning. bzip2 com- card Distance partitioning. gzip com-
pressor. pressor.

1 1
bgl bgl
0.95 tbird 0.95 tbird
liberty liberty
0.9 spirit 0.9 spirit
redstorm redstorm
0.85 web data 0.85 web data

0.8 0.8

0.75 0.75

0.7 0.7

0.65 0.65

Figure 13: Vary data using Word Jaccard Figure 14: Vary data using Word Jaccard
Distance partitioning. bzip2 compressor. Distance partitioning. gzip compressor.

10
By observing trends in the contents of the buckets dynamically, it is possible to react to unusual
log records in such a way that log records contained in a single bucket are very similar by creating and
closing archive buckets when needed. This will decrease the negative impact of the first log records
inserted into the archive as well.
Second, in all partition functions described in this paper which use a sliding window, the depth
of the sliding window s is fixed. The value of s does impact the effectiveness of the archive, so larger
values of s are desired. However, if s is larger the processing time to select the appropriate bucket
increases.
If a bucket is extremely homogeneous, the value of s does not need to be as large as a bucket which
is not as homogeneous. By observing how homogeneous a bucket is, the value of s can be adjusted
dynamically to improve speed and improve the compression ratio of the log archive.
Some compression methods, such as bzip2 and gzip, compress in blocks of data [7] [4]. The contents
of one compression block should have no impact on a future block. Keeping a sliding window which
straddles two compression blocks will not improve compression ratios, but it could negatively impact
the compression ratio by restricting the contents of a bucket.
Third, if the algorithm were most closely tied to a specific compression algorithm additional opti-
mizations could be applied. As already explained, by considering data in blocks of data rather then
one continues bucket of many log records, the compression algorithm used would be better utilized.
Fourth, since log records are being placed in buckets based on the content of the log record, using
a column based compression algorithm could significantly improve how efficiently the log archive can
be compressed [9] [2]. The method used to distribute log records into buckets naturally inspires one to
use such compression techniques rather then using generic compressors such as bzip2 and gzip.
Fifth, in order for the log archive to be reviewed, it must be completely decompressed. Applying an
index to the log archive could provide the ability to decompress specific sections. This would improve
the usability of this method as an archiving tool because traditional log archiving methods do not
support selective decompression of log archives [3].
Each of the future work topics discussed here are currently being studied and will be implemented
in continued research.

6 Conclusion
In this paper we have implemented a preprocessing step for log archiving to be used in conjunction with
traditional generic compression methods, such as bzip2 and gzip. The preprocessing step separates log
records using a partition function into homogeneous buckets. By improving the homogeneous nature
of the log data, traditional compression methods are able to compress the log data more effectively.
Several partition functions were implemented using a variety of distance functions to detect
similarity between log records. These functions include edit distance, word Jaccard distance, q-gram
Jaccard distance, and q-gram Jaccard distance estimation. These functions were compared to each
other and the traditional compression method of compressing the entire log using bzip2 or gzip.
By using a partition function to improve the homogeneous nature of log data, log archives were
able to be compressed more effectively then compressing the log file as a whole.

References
[1] Kevin Beyer, Peter J Haas, Berthold Reinwald, Yannis Sismanis, and Rainer Gemulla. On synopses
for distinct-value estimation under multiset operations. In Proceedings of the 2007 ACM SIGMOD
international conference on Management of data, pages 199–210. ACM, 2007.

11
[2] Michail Vlachos Francesco Fusco, Marc Ph. Stoecklin. Net-fli: On-the-fly compression, archiving
and indexing of streaming network traffic. VLDB, 3(2), 2010.

[3] Francesco Fusco, Michail Vlachos, and Xenofontas Dimitropoulos. Rasterzip: compressing network
monitoring data with support for partial decompression. In Proceedings of the 2012 ACM conference
on Internet measurement conference, pages 51–64. ACM, 2012.

[4] Jean-Loup Gailly and Mark Adler. The gzip home page. URL: http://www. gzip. org/(January 3,
2013), 1999.

[5] The University of Utah. Mesowest, January 2013.

[6] Adam Oliner and Jon Stearley. What supercomputers say: A study of five system logs. In Depend-
able Systems and Networks, 2007. DSN’07. 37th Annual IEEE/IFIP International Conference on,
pages 575–584. IEEE, 2007.

[7] Julian Seward. bzip2 and libbzip2. avaliable at http://www. bzip. org, 1996.

[8] Jon Stearley. Towards informatic analysis of syslogs. In Cluster Computing, 2004 IEEE Interna-
tional Conference on, pages 309–318. IEEE, 2004.

[9] Binh Dao Vo and Gurmeet Singh Manku. Radixzip: Linear time compression of token streams. In
Proceedings of the 33rd international conference on Very large data bases, pages 1162–1172. VLDB
Endowment, 2007.

Chapter 17 Disk Storage, Basic File Structures, and Hashing Disk Storage Devices
No ratings yet
Chapter 17 Disk Storage, Basic File Structures, and Hashing Disk Storage Devices
10 pages
The Log - What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction - LinkedIn Engineering
No ratings yet
The Log - What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction - LinkedIn Engineering
31 pages
The Log - What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction - LinkedIn Engineering
No ratings yet
The Log - What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction - LinkedIn Engineering
38 pages
CH 13.2 Files
No ratings yet
CH 13.2 Files
68 pages
QB Delhi Campus
No ratings yet
QB Delhi Campus
17 pages
Recursive Data Compression Method
No ratings yet
Recursive Data Compression Method
26 pages
A Framework For Analyzing and Improving Content-Based Chunking Algorithms - 2005
No ratings yet
A Framework For Analyzing and Improving Content-Based Chunking Algorithms - 2005
11 pages
The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction
No ratings yet
The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction
26 pages
Wincc Data Log
No ratings yet
Wincc Data Log
21 pages
Swinging Door Compression US Partent 4669097
No ratings yet
Swinging Door Compression US Partent 4669097
16 pages
Mod 3
No ratings yet
Mod 3
8 pages
Counting Ones in A Window: The Cost of Exact Counts
100% (1)
Counting Ones in A Window: The Cost of Exact Counts
13 pages
Ceng2001 Week7
No ratings yet
Ceng2001 Week7
52 pages
Computer Systems Are Often Used To Store Large Amounts of Data From Which Individual Records Must Be Retrieved According To Some Search Criterion
No ratings yet
Computer Systems Are Often Used To Store Large Amounts of Data From Which Individual Records Must Be Retrieved According To Some Search Criterion
4 pages
Cluster
No ratings yet
Cluster
10 pages
Data and File Structures: Hashing
No ratings yet
Data and File Structures: Hashing
24 pages
SSRCTR 10 01
No ratings yet
SSRCTR 10 01
97 pages
Csci 2111: Data and File Structures Week 10, Lectures 1 & 2: Hashing
No ratings yet
Csci 2111: Data and File Structures Week 10, Lectures 1 & 2: Hashing
19 pages
File Organization Notes
No ratings yet
File Organization Notes
21 pages
Dsa 240404 220052
No ratings yet
Dsa 240404 220052
9 pages
Elmasri Storage Hashing
No ratings yet
Elmasri Storage Hashing
27 pages
Algorithms: Compressed Matching in Dictionaries
No ratings yet
Algorithms: Compressed Matching in Dictionaries
14 pages
Bda A4
No ratings yet
Bda A4
10 pages
Hashing With Chaining
No ratings yet
Hashing With Chaining
5 pages
File Organization CH16 Updated
No ratings yet
File Organization CH16 Updated
30 pages
File Catalog Tutorial and Distributed Disk Usage
No ratings yet
File Catalog Tutorial and Distributed Disk Usage
12 pages
Presentation ON File Organisation: Submitted To: Mrs. Sonal Beniwal
No ratings yet
Presentation ON File Organisation: Submitted To: Mrs. Sonal Beniwal
23 pages
CS-Database System Principles: Final Exam - Summer 2001
No ratings yet
CS-Database System Principles: Final Exam - Summer 2001
18 pages
Exact String Matchin
No ratings yet
Exact String Matchin
7 pages
L8 Parallel Algorithms
No ratings yet
L8 Parallel Algorithms
41 pages
09 FIle
No ratings yet
09 FIle
22 pages
Welch 1984 Technique For PDF
No ratings yet
Welch 1984 Technique For PDF
12 pages
DS TM Study Material Presentations Unit-4 1TM
No ratings yet
DS TM Study Material Presentations Unit-4 1TM
22 pages
Efficient Mixed Mode Summary For Mobile Networks
No ratings yet
Efficient Mixed Mode Summary For Mobile Networks
13 pages
Ds Unit-5
No ratings yet
Ds Unit-5
5 pages
Assignment (DS)
No ratings yet
Assignment (DS)
8 pages
Decaying Window
No ratings yet
Decaying Window
16 pages
File Organization and Indexing: Structure of Disks
No ratings yet
File Organization and Indexing: Structure of Disks
28 pages
DSimp 2
No ratings yet
DSimp 2
21 pages
Part 4 File Organizatin Lec 4 5part 2 File Organization L1&2
No ratings yet
Part 4 File Organizatin Lec 4 5part 2 File Organization L1&2
36 pages
The Log - What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction - LinkedIn Engineering
No ratings yet
The Log - What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction - LinkedIn Engineering
30 pages
Applied Algorithm MID 2024
No ratings yet
Applied Algorithm MID 2024
2 pages
Mod 5
No ratings yet
Mod 5
13 pages
Disk Storage, Basic File Structures, and Hashing
No ratings yet
Disk Storage, Basic File Structures, and Hashing
34 pages
Loki Design Document
No ratings yet
Loki Design Document
8 pages
Unit 2 and 3 (2 Part)
No ratings yet
Unit 2 and 3 (2 Part)
9 pages
File Organization Lec910
No ratings yet
File Organization Lec910
37 pages
DS Unit 4
No ratings yet
DS Unit 4
21 pages
Pram Algorithms: Merging and Graph Coloring
No ratings yet
Pram Algorithms: Merging and Graph Coloring
4 pages
1.2 Five Representative Problems
No ratings yet
1.2 Five Representative Problems
30 pages
Chapter 7 PythonStrngs
No ratings yet
Chapter 7 PythonStrngs
31 pages
Unit05 3 MergeSort
No ratings yet
Unit05 3 MergeSort
27 pages
Algorithm Lecture6 Search
No ratings yet
Algorithm Lecture6 Search
40 pages
It Is A Very Efficient Method To Search The Exact Data Items Based On Hash Table
No ratings yet
It Is A Very Efficient Method To Search The Exact Data Items Based On Hash Table
49 pages
Unit 4 - Lecture 3 - DGIM Algorithm Notes
100% (1)
Unit 4 - Lecture 3 - DGIM Algorithm Notes
8 pages
DS Mid1 Key
No ratings yet
DS Mid1 Key
6 pages
Computer Science II Essentials
From Everand
Computer Science II Essentials
Randall Raus
No ratings yet
Ian Talks Python A-Z
From Everand
Ian Talks Python A-Z
Ian Eress
No ratings yet
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
From Everand
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
Bruce Dang
No ratings yet
Miller Approximation
No ratings yet
Miller Approximation
14 pages
Food Control: Lu Zhang, Michelle A. Schultz, Rick Cash, Diane M. Barrett, Michael J. Mccarthy
No ratings yet
Food Control: Lu Zhang, Michelle A. Schultz, Rick Cash, Diane M. Barrett, Michael J. Mccarthy
10 pages
C++ Programming: From Problem Analysis To Program Design,: Fourth Edition
No ratings yet
C++ Programming: From Problem Analysis To Program Design,: Fourth Edition
53 pages
Daniel's Directorial Vision - Ruby Moon
No ratings yet
Daniel's Directorial Vision - Ruby Moon
2 pages
Circular Renewal Campaign
No ratings yet
Circular Renewal Campaign
5 pages
Manual Aspirador Makita DCL180Z A Batería 18V Litio
No ratings yet
Manual Aspirador Makita DCL180Z A Batería 18V Litio
44 pages
Nicolaus Copernicus
No ratings yet
Nicolaus Copernicus
11 pages
Time Travel-Paul Davies
100% (1)
Time Travel-Paul Davies
7 pages
Carefree Omega Awning Owner's Manual Installation Instructions
No ratings yet
Carefree Omega Awning Owner's Manual Installation Instructions
5 pages
Cambridge Assessment International Education: Hindi As A Second Language 0549/01 October/November 2019
No ratings yet
Cambridge Assessment International Education: Hindi As A Second Language 0549/01 October/November 2019
7 pages
Principles of High Quality Assessment: Clarity of Learning Outcomes and Appropriateness of Assessment Methods
No ratings yet
Principles of High Quality Assessment: Clarity of Learning Outcomes and Appropriateness of Assessment Methods
17 pages
Decluttering For Dummies, Portable Edition Jane Stoller - Read The Ebook Online or Download It For A Complete Experience
100% (1)
Decluttering For Dummies, Portable Edition Jane Stoller - Read The Ebook Online or Download It For A Complete Experience
54 pages
BeneFusion SP1 Operators Manual 2024
No ratings yet
BeneFusion SP1 Operators Manual 2024
86 pages
Flat White Recipe - Starbucks® Coffee at Home
No ratings yet
Flat White Recipe - Starbucks® Coffee at Home
3 pages
Complete Fab Guide
No ratings yet
Complete Fab Guide
57 pages
History Test
No ratings yet
History Test
3 pages
ACP 312 Quiz 1 Week 1 3
No ratings yet
ACP 312 Quiz 1 Week 1 3
6 pages
Exploring New Insights and Stratergies in Nursing Education
No ratings yet
Exploring New Insights and Stratergies in Nursing Education
49 pages
Radio
No ratings yet
Radio
11 pages
Tle 6281
No ratings yet
Tle 6281
15 pages
Unit 2 Principles of Taxation
No ratings yet
Unit 2 Principles of Taxation
24 pages
Activity Design For SSG Election 2021
100% (2)
Activity Design For SSG Election 2021
3 pages
Comparative Chart - Pa1 Types Cronologic AL Curriculum Functional Curriculum Mixed Curriculum Items
No ratings yet
Comparative Chart - Pa1 Types Cronologic AL Curriculum Functional Curriculum Mixed Curriculum Items
1 page
Sandeep Maheshwari
No ratings yet
Sandeep Maheshwari
56 pages
Module 8 - Change Management
No ratings yet
Module 8 - Change Management
11 pages
NT9.1 SDH Network Takeover TL1 ED01
No ratings yet
NT9.1 SDH Network Takeover TL1 ED01
31 pages
Onkyo TX-NR709 Manual
No ratings yet
Onkyo TX-NR709 Manual
96 pages
Solid State Pressure Sensor: Features
No ratings yet
Solid State Pressure Sensor: Features
2 pages
Architecture of Computer System
No ratings yet
Architecture of Computer System
64 pages
CE OOO BOQ Solar PV System Contractor XXX
No ratings yet
CE OOO BOQ Solar PV System Contractor XXX
1 page

Improving Compression of Massive Log Data

Uploaded by

Improving Compression of Massive Log Data

Uploaded by

Improving Compression of Massive Log Data

3.2 Round Robin

3.3 Edit Distance

3.4 Character Similarity

3.5 Word Jaccard Distance

3.6 q-gram Jaccard Distance

3.7 Approximate q-gram Jaccard Distance

size of archive using the partitioning algorithm

Figure 3: Vary partitioning algorithm on Figure 4: Vary partitioning algorithm on

Figure 9: Vary number of buckets on Figure 10: Vary number of buckets on

[5] The University of Utah. Mesowest, January 2013.

You might also like