Data Analytics Unit 3
Data Analytics Unit 3
Internet and Web Traffic – A bobbing node in the center of the internet receives
streams of IP packets from many inputs and paths them to its outputs. Websites
receive streams of heterogeneous types. For example, Google receives a hundred
million search queries per day.
The first generation of message brokers, such as RabbitMQ and Apache ActiveMQ,
relied on the Message Oriented Middleware (MOM) paradigm. Later, hyper-
performant messaging platforms (often called stream processors) emerged that are
more suitable for a streaming paradigm.
Popular stream processing tools:
Apache Kafka
RabbitMQ
Organizations typically store their streaming event data in cloud object stores to serve
as operational data lake due to the sheer volume and multi-structured nature of event
streams. They offer a low-cost and long-term solution for storing large amounts of
event data. They're also a flexible integration point, allowing tools from outside your
streaming ecosystem to access data.
Examples:
Amazon Redshift
Microsoft Azure Data Lake Storage
Google Cloud Storage
With data processed and stored in a data warehouse/data lake, you will now need
data analytics tools.
1. Sliding Window:- This is the simplest and most straightforward method. A first-in,
first-out (FIFO) queue with size n and a skip/sub-sampling factor k ≥1 is maintained.
In addition to that, a stride factor s ≥ 1 describes by how many time-steps the window
is shifted before analyzing it.
Advantage
simple to implement
deterministic — reservoir can be
filled very fast from the beginning
Drawbacks
the time history represented by the reservoir R is short; long-term concept drifts
cannot be detected easily — outliers can create noisy analyses
Advantages
The reservoir contains data points from all history of the stream with equal
probability.
Very simple implementation; adding a point requires only O(1)
Drawbacks
A concept drift cannot be compensated; the oldest data point x(1) is equal
important in this sampling technique as the latest data point x(t).
3. Biased Reservoir Sampling:- In biased reservoir sampling Alg. 3.1, [2] the
probability of a data point x(t) being in the reservoir is a decreasing function of its
lingering time within R. So the probability of finding points of the sooner history in R
is high. Very old data points will be in R with very low probability.
Illustration of the Biased Reservoir Sampling
Advantages:
Drawbacks:
R1…RB
4. Histograms
A histogram is maintained while observing the data stream. Hereto, data points are
sorted into intervals/buckets
[ li , ui ]
If the useful range of the observed values is known in advance, a simple vector with
counts and breakpoints could do the job.
V-optimal histograms tries to minimize the variance within each histogram bucket. [4]
proposes an algorithm for efficiently maintaining an approximate V-optimum
histogram from a data stream. This is of relevance for interval data, such as a time-
series of temperature values; i.e., absolute value and distance between values have a
meaning.
An obvious approach if the number of elements is not very large would be to maintain
a Set. We can check if the set contains the element when a new element arrives. If not,
we add the element to the set. The size of the set would give the number of distinct
elements. However, if the number of elements is vast or we are maintaining counts for
multiple streams, it would be infeasible to maintain the set in memory. The algorithms
for counting distinct elements are also approximate, with an error threshold that can be
tweaked by changing the algorithm's parameters.
The first algorithm for counting distinct elements is the Flajolet-Martin algorithm,
named after the algorithm's creators. The Flajolet-Martin algorithm is a single pass
algorithm. If there are m distinct elements in a universe comprising of n elements, the
algorithm runs in O(n) time and O(log(m)) space complexity. The following steps
define the algorithm.
First, we pick a hash function h that takes stream elements as input and outputs
a bit string. The length of the bit strings is large enough such that the result of the
hash function is much larger than the size of the universe. We require at least log
nbits if there are n elements in the universe.
r(a) is used to denote the number of trailing zeros in the binary representation
of h(a) for an element a in the stream.
y = hash(x)
r = get_righmost_set_bit(y)
set_bit(B, r)
4. R = get_righmost_unset_bit(B)
5. return 2 ^ R
Estimating Moments
TYPES OF QUERIES
Ad-Hoc query- You ask a query and there is an immediate response. E.g: What is the
maximum value seen so far in the stream S?
Standing queries- You are asking a query to the system say “ Anytime you have an
answer to this query send me the response” , here you don’t get the answer
immediately .
Now let us suppose we have a window of length N (say N=24) on a binary system, We
want at all times to be able to answer a query of the form “ How many 1’s are there in
the last K bits?” for K<=N.
Here comes the DGIM Algorithm into picture:
Designed to find the number 1’s in a data set. This algorithm uses O(log²N) bits to
represent a window of N bit, allows to estimate the number of 1’s in the window with
and error of no more than 50%. So this algorithm gives a 50% precise answer.
In DGIM algorithm, each bit that arrives has a timestamp, for the position at which it
arrives. if the first bit has a timestamp 1, the second bit has a timestamp 2 and so on..
the positions are recognized with the window size N (the window sizes are usually
taken as a multiple of 2).The windows are divided into buckets consisting of 1’s and
0's.
1. The right side of the bucket should always start with 1. (if it starts with a 0,it is
to be neglected) E.g. · 1001011 → a bucket of size 4 ,having four 1’s and starting
with 1 on it’s right end.
2. Every bucket should have at least one 1, else no bucket can be formed.
Let us take an example to understand the algorithm. Estimating the number of 1’s and
counting the buckets in the given data stream.
This picture shows how we can form the buckets based on the number of ones by
following the rules.
In the given data stream let us assume the new bit arrives from the right. When the new
bit = 0
After the new bit ( 0 ) arrives with a time stamp 101, there is no change in the buckets.
But what if the new bit that arrives is 1, then we need to make changes..
Create a new bucket with the current timestamp and size 1.
If there was only one bucket of size 1, then nothing more needs to be done. However,
if there are now three buckets of size 1( buckets with timestamp 100,102, 103 in the
second step in the picture) We fix the problem by combining the leftmost(earliest) two
buckets of size 1. (purple box)
To combine any two adjacent buckets of the same size, replace them by one bucket of
twice the size. The timestamp of the new bucket is the timestamp of the rightmost of
the two buckets.
Now, sometimes combining two buckets of size 1 may create a third bucket of size 2.
If so, we combine the leftmost two buckets of size 2 into a bucket of size 4. This
process may ripple through the bucket sizes.
You can continue if current timestamp- leftmost bucket timestamp of window < N
(=24 here) E.g. 103–87=16 < 24 so It continue, if it greater or equal to then it will be
stop.
Finally the answer to the query. How many 1’s are there in the last 20 bits? Counting
the sizes of the buckets in the last 20 bits, we say, there are 11 ones.
2. Result prediction : By analyzing sentiments from related sources, one can predict
the probable outcome of a particular event.
4. Product and service review : The most common application of sentiment analysis
is in the area of reviews of customer products and services.
The level of data generated within healthcare systems is not trivial. Traditionally, the
health care industry lagged in using Big Data, because of limited ability to standardize
and consolidate data.Researchers are mining the data to see what treatments are more
effective for particular conditions, identify patterns related to drug side effects, and
gains other important information that can help patients and reduce costs.
With the added adoption of mHealth, eHealth and wearable technologies the volume
of data is increasing at an exponential rate. This includes electronic health record data,
imaging data, patient generated data, sensor data, and other forms of data.By mapping
healthcare data with geographical data sets, it’s possible to predict disease that will
escalate in specific areas. Based of predictions, it’s easier to strategize diagnostics and
plan for stocking serums and vaccines.
Manufacturing
Various companies in the media and entertainment industry are facing new business
models, for the way they – create, market and distribute their content. This is
happening because of current consumer’s search and the requirement of accessing
content anywhere, any time, on any device.
Government
The use and adoption of Big Data within governmental processes allows efficiencies
in terms of cost, productivity, and innovation. In government use cases, the same data
sets are often applied across multiple applications & it requires multiple departments
to work in collaboration.