Improving Compression of Massive Log Data
Improving Compression of Massive Log Data
Robert Christensen
University of Utah
[email protected]
May 1, 2013
Abstract
In this paper we explore a novel method of improving log compression by partitioning data into
homogeneous buckets. By partitioning log records into buckets to improve homogeneous nature of
log data, the effectiveness of generic compression methods, such as bzip2 and gzip, can be improved.
A system is developed to archive log records using a partition function to distribute log records
into buckets. Several partition functions are implemented and their improved compression ratio
is reported when archiving several real world data sets. Several improvements to the method used
in this paper are proposed.
1 Introduction
Log records are always being collected. In many instances the log is a collection of log records from a
variety of sources and collected at a single location. System administrators use log data for a variety
of tasks, such as identifying troubled hardware [8], analyzing network traffic [3], postmortem analysis
of security breaches [2], and performance analysis [6]. Log data is a history of the status of a system.
Because of the usefulness of this data, administrators are reluctant to delete this data, opting to archive
the data in case it is needed in the future.
Each log record is typically a single line written in ASCII text, terminated with a new line character.
The format of log data is designed for system administrators to easily read the log records as a massive
text document without the need for special tools.
A system log for a large system could include multiple statuses intermingled, such as unusual
memory issues with common DHCP status messages. Data compressors are more effective when the
data being archived is consistent and predictable. Large log records lose consistency when events from
different sources are intermingled into a single repository.
Splitting the log collection into multiple repositories, one repository for each log record type, is not
an optimal solution because it would make system logging very difficult for administers of the system
to find crucial information when needed. Records of one system fault may provide information about
other system faults that would not necessarily be located in the same system log repository. Also,
because a series of logs have a common source does not mean the log records are similar in structure.
Current systems collect log data in a centralized source with a variety of heterogeneous records in close
proximity.
A system is proposed to automatically distribute log data within an archive to improve the homo-
geneous nature of the archive, thus improving the effectiveness of using generic compressing utilities
such as bzip2 or gzip.
1
Figure 1: Snapshot of Log data from Red Storm HPC system log
2 Problem Statement
A snapshot of a log file is shown in figure 1. This figure shows a segment of a system log from the HPC
system Red Storm located at Sandia National Laboratory [6]. The characteristics of this segment is
common in all real world log files studied. It is easily seen that a series of records of similar form are
being recorded. For a small period, records of a different form are recorded by the system, creating a
heterogenus log. The existence of these two inconsistent records, in a series of many records, decreases
the consistency of the repository as a whole which hurts compressibility.
If the log records contained in the boxes boardered by the solid green line were group together and
the two log records in the box boardered by the dashed red line were group together for archive, the
archive would contain more homogenus data. Naturally, partitioning data into homogenus groups will
decrease the randomness of the data. By doing this, compression algorithms can work more effectivly.
The system being proposed here has three basic steps:
1. log record collection. Log records are collected in order to be processed by the system. Log
records can be collected and archived in real time but the option must exist to compress log
records after the log is compleatly collected. Since it is common to delimit log records by a new
line character, we will define a single log record as a string which is terminated with the new line
character.
2. log record partitioning. The log archive consists of one or more buckets containing log records
previously processed. The log partitioning step analyzes the new log record being inserted and
determines which bucket to place the new record.
3. log compression. After the log records have been collected and partitioned the buckets containing
the log records can be compressed. Each bucket is completely independent, so each bucket can
be compressed separately in parallel. Also, since most compressing utilities compress data in
blocks [4, 7] , when a bucket becomes sufficiently full, a block of data from the bucket can be
compressed and stored.
In order to distribute log records in such a way that compression algorithms can work more effec-
tively, we have defined the following formal statement of a system to partition log records.
Suppose there exists n log records l1 , l2 , . . . , ln−1 , ln and m buckets b1 , b2 , . . . , bm−1 , bm . The buckets
are ordered, meaning a log record li can not be inserted into one of the buckets iff ∃k s.t. li−1 ∈ bk or
i = 1. This ordering will allow for streaming algorithms to be used to distribute the records into the
buckets.
We define a partitioning function partition(li ), which returns the index k such that the record
li should be inserted as the next log record in the bucket bk .
2
When a record is inserted into a bucket, the value of k is recorded so the complete log can be
reconstructed as it appeared in its original form when the buckets are unarchived. This collection of
index value is also compressed and stored with the log archive.
3 Partition Functions
Each partitioning function implemented is explained below. The formal explanation of how it functions
and a short description of unique behaviors of the partitioning function is also given, if applicable.
3.1 Segmentation
This distribution algorithm only considers the total number of log records the original log contains.
Because of this, the technique only works if the number of log records is known beforehand.
When the log records are partitioned into m partitions, the bucket bk will contain the series of log
records:
h i
l( n )(i−1) , l( n )(i−1)+1 , . . . , l( n )i−2 , l( n )i−1
m m m m
Partitioning the records this way will separate the log archive into m segments, where each segment
contains contiguous log records.
3
When inserting the log record li , the bucket bk is being found such that we are finding the bucket
bk that results in the smallest value for the formula:
P|Sk |
t=1 ed(li , Sk [t])
|Sk |
When the value of k is found, the log record li is then inserted into the bucket bk .
If all the buckets containing at least one log record were had an average edit distance larger then
some certain threshold and at least one bucket was empty, the log record being inserted will be inserted
into the unused bucket. Otherwise, the log record will be inserted into the bucket with the smallest
average edit distance, even if the value is very large.
For the implementation presented in this paper, the edit distance between the log record being
inserted and the last s log records inserted into each bucket was calculated for each log record inserted.
Calculating the edit distance between two strings is a costly operation. This is because the process
uses dynamic programming, where the operation is of the magnitude of O(len(A)len(B)). The real
world data set used contained some log records over 100,000 characters. Even though most log records
did not exceed 200 characters, the existence of an extremely large log record will result in extremely
poor performance. The large log record must be inserted into the archive in order to guarantee lossless
compression. The log records inserted after the large log record are likely to be much less unusual in
size. Comparing a string with few characters with a string with many characters will always result
in a large edit distance because a large number of insertion will always be required. Because of this,
comparing any normal length string with a unusually large length string will result in a high edit
distance. Because the bucket with the smallest average edit distance is where the new log record will
be placed, the large log record will always be in the sliding window of some bucket because it will
never be replaced. This causes significant performance degradation due to the time the edit distance
function takes when given large strings.
The edit distance distribution algorithm was found to be extremely slow such that it was unusable
for anything useful. In order to use this distribution scheme on the test data the process would take
days to complete. Edit distance calculation optimizations could be used, but it was found that the edit
distance similarity function did not perform as well as the other similarity functions explored. The edit
distance distribution algorithm is included to show that it was considered as a distribution method,
but performance both in time to compute and the resulting compression size showed that this is not a
very good method.
If la contains one more appearance of a character then lb , that will contribute to increasing the
character distance value by 1.
4
Figure 2: Selection of words from a log record for Word Jaccard Distance
For explanation purposes, we will define a function chardist(la , lb ) which will calculate the character
count arrays for la and lb and return the character count distance between the two log records. When
implemented, the character count arrays should only be calculated once and stored in memory until
they are no longer needed.
We will define Sk to be the sliding window of records for the bucket bk with the constraint 0 ≤
|Sk | ≤ s. The contents of Sk include the last s log entries inserted into the bucket bk . If |bk | ≤ s, then
Sk will contain all log records inserted into bk . The notation Sk [t] means the log record at the tth index
if Sk is being referenced.
To find the optimal bucket bk to place a new log record li , the index k is being found to minimized
the formula:
P|Sk |
t=1 chardist(li , Sk [t])
|Sk |
|A ∩ B|
J(A, B) =
|A ∪ B|
The Jaccard index identifies similarity between the two sets. This value will be a rational number
within the range (0, 1), where the lower range identifies two sets completely different and the higher
range identifies the two sets very similar. The Jaccard Distance, which identifies dissimilarity between
two sets, is defined as 1 − J(A, B).
When a log record li is going to be inserted into the log archive, a set of words found in the log
record is created. We will use the symbol Wi to be the set of words found in the log record li . Here,
we define a word to be a continuous series of alphanumeric characters delimited by non-alphanumeric
characters.
Figure 2 shows an example of the words that would be selected from a log record from the Red
Storm HPC system log. Words may contain only letters, numbers, or a mixture of letters and numbers.
When a log record is analyzed to create the word set, no attempt is made to determine the meaning of
words. In the example, The time 22:41:29 is separated into three words. Similar log records appearing
in the future will contain a different value in one or more of the time fields.
No context should be determined from the log record for fields such as this because no standard
format exists to write time to a log record. The partitioning function becomes more generic by collecting
all alpha numeric words in a log record.
We will define Sk to be a sliding window of the last s records inserted into the bucket bk . The
magnitude of elements in Sk has the constraint 0 ≤ |Sk | ≤ s. The value of s is a parameter that can
be provided to the archiver. If Sk contains less then s sets, Sk will contains the set of words for all log
records which have been inserted into bk .
When inserting the log records li into the log archive with its associated set of words Wi , the bucket
bk is being found such that the the bucket bk results in the smallest value for the formula
5
P|Sk |
t=1 (1 − J(Wi , Sk [t]))
|Sk |
When the bucket bk is found, the log record li is inserted into the bucket bk and Wi is inserted into
Sk , removing the oldest set in Sk if needed.
If all buckets containing at least one log record had an average Jaccard distance larger then some
certain threshold and at least one bucket was empty, the log record being inserted will be inserted into
the unused bucket. If no empty buckets exist, the log record will be inserted into the bucket with the
smallest average Jaccard distance even if the distance is extremely large.
6
Source Uncompressed (MB) bzip2 compressed (MB) gzip compressed (MB)
Blue Gene/L 509 35 45
Thunderbird 24,172 904 1511
Red Storm 32,596 664 1080
Spirit (ICC2) 28,583 389 666
Liberty 21,768 276 429
Web Data 7501 296 533
Table 1: Statistics for the size of the log data, both before and after compression.
Source log record count median length min length max length
Blue Gene/L 4,747,963 89 62 895
Thunderbird 211,212,192 89 22 1045
Red Storm 219,096,168 99 31 173,254
Spirit (ICC2) 272,298,969 100 13 994
Liberty 265,569,231 89 23 1037
Web Data 26,568,851 277 67 4187
Table 2: Statistics for the log record contained in each data source.
between two log records if a KMV synopsis has already been created for the log records.
This partitioning method uses the same process as that in section 3.6, except rather then building an
exact set of Q-grams, a KMV synopsis is built and the value of the Jaccard distance is approximated.
4 Experiments
System logs from five supercomputers and one Apache web server log file was used to study the com-
pression ratio between compressing using bzip2 or gzip to compress the entire log without partitioning
vs. partitioning using the various algorithms described in section 3.
Four of the supercomputer system logs come from computers located at Sandia National Laborato-
ries: Thunderbird, Red Storm, Spirit (ICC2), and Liberty. The fifth supercomputer system log is from
the Blue Gene/L computer located at Lawrence Livermore National Laboratory [6]. All supercomputer
system logs have been scrubbed of personal data. Otherwise, the log data used in the experiments is
the raw system log. When these system logs are referenced, they will be referenced by the name of the
computer which it originated.
The Apache web server data came from the MesoWest project [5]. When this data is referenced, it
will be called Web Data.
As we can see from table 1, Using bzip2 or gzip to compress the log data is extremely effective.
Compressing using bzip2 is much more effective at compressing the log data used in this experiment.
The bzip2 compression algorithm takes significantly more time when compared to compressing using
gzip. This is one of the primary reasons why gzip is used to compress large log files, as opposed to
bzip2 [3].
The archiving system developed allowed for many command line arguments to set different variables
for a given run. These arguments include the partitioning algorithm used; arguments for the parti-
tioning algorithm, such as maximum number of buckets; and arguments for the compression algorithm
used internally, such as the block size for the compressor for each bucket.
All command line arguments were fixed except the variable being tested. Unless otherwise noted
the following are the default values used in the experiment. The maximum number of buckets m = 32.
7
The sliding window depth s = 10. The q-gram size q = 6. The KMV synopsis value k = 60. If the
value is not used because the partitioning algorithm does not need to value, the value is ignored. The
partitioning algorithm is always explicitly listed for each experiment.
The y-axis scale shows the ratio:
5 Future Work
During the development of this log archiving system, several downfalls of the proposed system became
apparent. Since some require changes to the method proposed here, they were not added to the
current system. After these concerns are addressed, the log archiving system will be much more usable
to administrators interested in long term log archives and will also have improved compression ratios
over the results from this paper.
First, the number of buckets is fixed. In the current method, an unused bucket is used when the
log record being inserted is not similar to any of the currently used buckets. After all buckets are used
if a log record which is very different from the contents of every used bucket is going to be inserted
into the archive, it will be placed in the bucket which most closely matches. However, it is easy to
conceive of cases where doing this will decrease the homogeneous nature of the buckets.
Also, this limitation on how buckets are initially allocated can have significant repercussions. If the
first log records inserted into the archive are very different from the log records to be inserted later,
the buckets will be lopsided. Or, if one very unusual log record is the first inserted into a bucket it
could be possible for that bucket to never be used again because the partition function will never
select that bucket. This negatively impacts the compression ratio.
8
1.5 1.5
Round Robin Round Robin
1.4 Edit Distance 1.4 Segmentation
No partitioning 1.3 No partitioning
1.3 Segmentation Edit Distance
Character Distance 1.2 Character Distance
1.2
Word Jaccard Word Jaccard
1.1
1.1 Jaccard approximation Jaccard approximation
Jaccard qgram 1 Jaccard qgram
1
0.9
0.9
0.8
0.8 0.7
0.7 0.6
1.1 1.1
no partitioning no partitioning
2 2
1 4 1 4
8 8
16 16
0.9 32 0.9 32
64 64
128 128
0.8 192 0.8 192
0.7 0.7
0.6 0.6
Figure 5: Vary number of buckets on web Figure 6: Vary number of buckets on web
data using Word Jaccard Distance parti- data using Word Jaccard Distance parti-
tioning. bzip2 compressor. tioning. gzip compressor.
1.1 1.1
1 1
1.05 5 1.05 5
10 10
1 15 1 15
20 20
0.95 0.95
0.9 0.9
0.85 0.85
0.8 0.8
0.75 0.75
0.7 0.7
Figure 7: Vary sliding window depth on Figure 8: Vary sliding window depth on
web data using Word Jaccard Distance web data using Word Jaccard Distance
partitioning. bzip2 compressor. partitioning. gzip compressor.
9
1.1 1.1
no partitioning no partitioning
4 4
1.05 1.05
8 8
16 16
1 32 1 32
64 64
0.95 0.95
0.9 0.9
0.85 0.85
0.8 0.8
0.9 0.8
20 40 60 80 100 20 40 60 80 100
30 50 70 90 110 0.78 30 50 70 90 110
0.85 0.76
0.74
0.8
0.72
0.7
0.75
0.68
0.66
0.7
Figure 11: Vary kmv synopsis size on Figure 12: Vary kmv synopsis size on
web data using Estimated Q-gram Jac- web data using Estimated Q-gram Jac-
card Distance partitioning. bzip2 com- card Distance partitioning. gzip com-
pressor. pressor.
1 1
bgl bgl
0.95 tbird 0.95 tbird
liberty liberty
0.9 spirit 0.9 spirit
redstorm redstorm
0.85 web data 0.85 web data
0.8 0.8
0.75 0.75
0.7 0.7
0.65 0.65
Figure 13: Vary data using Word Jaccard Figure 14: Vary data using Word Jaccard
Distance partitioning. bzip2 compressor. Distance partitioning. gzip compressor.
10
By observing trends in the contents of the buckets dynamically, it is possible to react to unusual
log records in such a way that log records contained in a single bucket are very similar by creating and
closing archive buckets when needed. This will decrease the negative impact of the first log records
inserted into the archive as well.
Second, in all partition functions described in this paper which use a sliding window, the depth
of the sliding window s is fixed. The value of s does impact the effectiveness of the archive, so larger
values of s are desired. However, if s is larger the processing time to select the appropriate bucket
increases.
If a bucket is extremely homogeneous, the value of s does not need to be as large as a bucket which
is not as homogeneous. By observing how homogeneous a bucket is, the value of s can be adjusted
dynamically to improve speed and improve the compression ratio of the log archive.
Some compression methods, such as bzip2 and gzip, compress in blocks of data [7] [4]. The contents
of one compression block should have no impact on a future block. Keeping a sliding window which
straddles two compression blocks will not improve compression ratios, but it could negatively impact
the compression ratio by restricting the contents of a bucket.
Third, if the algorithm were most closely tied to a specific compression algorithm additional opti-
mizations could be applied. As already explained, by considering data in blocks of data rather then
one continues bucket of many log records, the compression algorithm used would be better utilized.
Fourth, since log records are being placed in buckets based on the content of the log record, using
a column based compression algorithm could significantly improve how efficiently the log archive can
be compressed [9] [2]. The method used to distribute log records into buckets naturally inspires one to
use such compression techniques rather then using generic compressors such as bzip2 and gzip.
Fifth, in order for the log archive to be reviewed, it must be completely decompressed. Applying an
index to the log archive could provide the ability to decompress specific sections. This would improve
the usability of this method as an archiving tool because traditional log archiving methods do not
support selective decompression of log archives [3].
Each of the future work topics discussed here are currently being studied and will be implemented
in continued research.
6 Conclusion
In this paper we have implemented a preprocessing step for log archiving to be used in conjunction with
traditional generic compression methods, such as bzip2 and gzip. The preprocessing step separates log
records using a partition function into homogeneous buckets. By improving the homogeneous nature
of the log data, traditional compression methods are able to compress the log data more effectively.
Several partition functions were implemented using a variety of distance functions to detect
similarity between log records. These functions include edit distance, word Jaccard distance, q-gram
Jaccard distance, and q-gram Jaccard distance estimation. These functions were compared to each
other and the traditional compression method of compressing the entire log using bzip2 or gzip.
By using a partition function to improve the homogeneous nature of log data, log archives were
able to be compressed more effectively then compressing the log file as a whole.
References
[1] Kevin Beyer, Peter J Haas, Berthold Reinwald, Yannis Sismanis, and Rainer Gemulla. On synopses
for distinct-value estimation under multiset operations. In Proceedings of the 2007 ACM SIGMOD
international conference on Management of data, pages 199–210. ACM, 2007.
11
[2] Michail Vlachos Francesco Fusco, Marc Ph. Stoecklin. Net-fli: On-the-fly compression, archiving
and indexing of streaming network traffic. VLDB, 3(2), 2010.
[3] Francesco Fusco, Michail Vlachos, and Xenofontas Dimitropoulos. Rasterzip: compressing network
monitoring data with support for partial decompression. In Proceedings of the 2012 ACM conference
on Internet measurement conference, pages 51–64. ACM, 2012.
[4] Jean-Loup Gailly and Mark Adler. The gzip home page. URL: http://www. gzip. org/(January 3,
2013), 1999.
[6] Adam Oliner and Jon Stearley. What supercomputers say: A study of five system logs. In Depend-
able Systems and Networks, 2007. DSN’07. 37th Annual IEEE/IFIP International Conference on,
pages 575–584. IEEE, 2007.
[7] Julian Seward. bzip2 and libbzip2. avaliable at http://www. bzip. org, 1996.
[8] Jon Stearley. Towards informatic analysis of syslogs. In Cluster Computing, 2004 IEEE Interna-
tional Conference on, pages 309–318. IEEE, 2004.
[9] Binh Dao Vo and Gurmeet Singh Manku. Radixzip: Linear time compression of token streams. In
Proceedings of the 33rd international conference on Very large data bases, pages 1162–1172. VLDB
Endowment, 2007.
12