DGIM Algorithm Theory Explanation
DGIM Algorithm Theory Explanation
Suppose we have a window of length N on a binary stream. We want at all times to be able
to answer queries of the form “how many 1’s are there in the last k bits?” for any k≤ N. For
this purpose we use the DGIM algorithm.
The basic version of the algorithm uses O(log2 N) bits to represent a window of N bits, and
allows us to estimate the number of 1’s in the window with an error of no more than 50%.
To begin, each bit of the stream has a timestamp, the position in which it arrives. The first
bit has timestamp 1, the second has timestamp 2, and so on.
Since we only need to distinguish positions within the window of length N, we shall
represent timestamps modulo N, so they can be represented by log2 N bits. If we also store
the total number of bits ever seen in the stream (i.e., the most recent timestamp) modulo N,
then we can determine from a timestamp modulo N where in the current window the bit with
that timestamp is.
We divide the window into buckets, 5 consisting of:
1. The timestamp of its right (most recent) end.
2. The number of 1’s in the bucket. This number must be a power of 2, and we refer to
the number of 1’s as the size of the bucket.
To represent a bucket, we need log2 N bits to represent the timestamp (modulo N) of its right
end. To represent the number of 1’s we only need log2 log2 N bits. The reason is that we
know this number i is a power of 2, say 2j , so we can represent i by coding j in binary. Since
j is at most log2 N, it requires log2 log2 N bits. Thus, O(logN) bits suffice to represent a
bucket. There are six rules that must be followed when representing a stream by buckets.
• There are one or two buckets of any given size, up to some maximum size.
• All sizes must be a power of 2.