0% found this document useful (0 votes)
17 views3 pages

Test2 1819

This document contains questions for a second exam on stream processing. The questions cover topics like: why MapReduce is not suitable for real-time stream processing; time-based windows in stream processing; Spark APIs; buffering in Flink; Kafka partitions; time-series data compression; energy saving in sensor networks; and programming exercises to analyze a stream of sensor data using different stream processing systems and queries.

Uploaded by

Ermando 8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views3 pages

Test2 1819

This document contains questions for a second exam on stream processing. The questions cover topics like: why MapReduce is not suitable for real-time stream processing; time-based windows in stream processing; Spark APIs; buffering in Flink; Kafka partitions; time-series data compression; energy saving in sensor networks; and programming exercises to analyze a stream of sensor data using different stream processing systems and queries.

Uploaded by

Ermando 8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

DI/FCT/UNL

Mestrado Integrado em Engenharia Informática


Processamento de Streams
2nd Semester, 2018/2019

Second Test (14/June/2019)


Part 1 – closed book

Question 1
Discuss why map-reduce model (and Hadoop implementation) is not appropriate for
realtime stream processing.

Question 2
When executing computations over a stream of events, it is possible to define windows based
on the timestamps of the events and based on the time at which the event arrives to the event
processing system.
a) Discuss the implications of adopting each approach to the result of the computations
(give example when appropriate).
b) Discuss the implications of each approach to the event processing system.

Question 3
Spark system has two APIs for expressing computations: Base API and Dataframes/SQL.
Compare both APIs, presenting advantages of each.

Question 4
Apache Flink operation env.setBufferTimeout(timeoutMillis) is used to force transmission
after some time. Explain why this is necessary in Apache Flink (discussing how events are
processed).

Question 5
Apache Kafka topics can be broken up into partitions.
a) Explain what are partitions in Kafka and why this is an interesting mechanism.
b) Discuss what are the guarantees when consuming event from a topic that has
multiple partitions.

Question 6
Time-series databases include sophisticated mechanisms for compressing information.
a) Explain why data compression is a key feature in these systems.
b) What makes compression particularly efficient in time-series databases.

Question 7
In sensor networks (and in IoT-based sensing systems), nodes are often organized in a tree,
where a sensor node only communicates with its parent and children. Briefly present two
mechanisms used by these systems to save energy when processing a stream of events from
sensors.
DI/FCT/UNL
Mestrado Integrado em Engenharia Informática
Processamento de Streams
2nd Semester, 2018/2019

Second Test (14/June/2019)


Part 2 – open book

Consider a stream of events from sensors with the following format (the type of each value is
presented in parenthesis):

timestamp (date), coord x (double), coord y (double), sensor id (long), sensor type (int),
value (double)

The sensor id is a unique identifier of the sensor.


The sensor type identifies the type of sensor (temperature, light, etc.)
Coord x and coord y are the coordinates of the position where the sensor is placed.
An area is a square identified by a pair (x,y), such that two points are in the same are if the
value of the area, area(coord x, coord y), is the same, with area(x,y) = (round(x*1000),
round(y*1000)).

Question 8
For your favorite event processing system, write a program that reads the above stream of
events from Kafka topic “IoT”, and continuously outputs for each area the average value for
each sensor type.

Question 9
For your favorite event processing system, write a program that reads the above stream of
events from Kafka topic “IoT”, and outputs alarms when the readings from sensors with
sensor type = 47 are larger than ten times the average of the value in the last 10 days – for
each area, you should output a single alarm every 5 seconds.
NOTE: if you cannot answer this question, write a program that outputs alarms when the
readings from sensors with sensor type = 47 are larger than 100 – for each area, you should
output a single alarm every 5 seconds if at least two sensors read values larger than 100.

Answer to one of the following questions.

Question 10
Present the pseudo-code that should run in each node of a TinyDB deployment for executing
the computation of question 9. Express the code in function exec(local evt, rcv evts): out evts
, where local evt is the value read from the local sensor, expressed using the format presented
previously; rcv evts is a list of messages received from children nodes; and out evts is a list of
messages to send to the parent node.

Question 11
Some timeseries databases use LSM-trees to store events. Databases that use LSM-trees often
keep in memory Bloom filters that summarize the keys that are present in a given tree – this
allows efficient access to the data.
Discuss if this approach is efficient for executing queries for a time interval. If so, present in
pseudo-code, the algorithm used for executing a query. If not, propose which additional
information should be maintained and present, in pseudo-code the algorithm used for
executing a query.

You might also like