Test2 1819
Test2 1819
Question 1
Discuss why map-reduce model (and Hadoop implementation) is not appropriate for
realtime stream processing.
Question 2
When executing computations over a stream of events, it is possible to define windows based
on the timestamps of the events and based on the time at which the event arrives to the event
processing system.
a) Discuss the implications of adopting each approach to the result of the computations
(give example when appropriate).
b) Discuss the implications of each approach to the event processing system.
Question 3
Spark system has two APIs for expressing computations: Base API and Dataframes/SQL.
Compare both APIs, presenting advantages of each.
Question 4
Apache Flink operation env.setBufferTimeout(timeoutMillis) is used to force transmission
after some time. Explain why this is necessary in Apache Flink (discussing how events are
processed).
Question 5
Apache Kafka topics can be broken up into partitions.
a) Explain what are partitions in Kafka and why this is an interesting mechanism.
b) Discuss what are the guarantees when consuming event from a topic that has
multiple partitions.
Question 6
Time-series databases include sophisticated mechanisms for compressing information.
a) Explain why data compression is a key feature in these systems.
b) What makes compression particularly efficient in time-series databases.
Question 7
In sensor networks (and in IoT-based sensing systems), nodes are often organized in a tree,
where a sensor node only communicates with its parent and children. Briefly present two
mechanisms used by these systems to save energy when processing a stream of events from
sensors.
DI/FCT/UNL
Mestrado Integrado em Engenharia Informática
Processamento de Streams
2nd Semester, 2018/2019
Consider a stream of events from sensors with the following format (the type of each value is
presented in parenthesis):
timestamp (date), coord x (double), coord y (double), sensor id (long), sensor type (int),
value (double)
Question 8
For your favorite event processing system, write a program that reads the above stream of
events from Kafka topic “IoT”, and continuously outputs for each area the average value for
each sensor type.
Question 9
For your favorite event processing system, write a program that reads the above stream of
events from Kafka topic “IoT”, and outputs alarms when the readings from sensors with
sensor type = 47 are larger than ten times the average of the value in the last 10 days – for
each area, you should output a single alarm every 5 seconds.
NOTE: if you cannot answer this question, write a program that outputs alarms when the
readings from sensors with sensor type = 47 are larger than 100 – for each area, you should
output a single alarm every 5 seconds if at least two sensors read values larger than 100.
Question 10
Present the pseudo-code that should run in each node of a TinyDB deployment for executing
the computation of question 9. Express the code in function exec(local evt, rcv evts): out evts
, where local evt is the value read from the local sensor, expressed using the format presented
previously; rcv evts is a list of messages received from children nodes; and out evts is a list of
messages to send to the parent node.
Question 11
Some timeseries databases use LSM-trees to store events. Databases that use LSM-trees often
keep in memory Bloom filters that summarize the keys that are present in a given tree – this
allows efficient access to the data.
Discuss if this approach is efficient for executing queries for a time interval. If so, present in
pseudo-code, the algorithm used for executing a query. If not, propose which additional
information should be maintained and present, in pseudo-code the algorithm used for
executing a query.