Test1 1617
Test1 1617
In the following questions, mark each of sentences as (T)rue or (F)alse. Note that wrong answers
have a negative score. Leave blank unless you are sure.
a) The programming model of the generality of distributed processing frameworks are based on
”divide & conquer”strategies, because it works for any kind of problem.
b) Distributed processing over very large datasets relies on partitioning the data sets across
many machines, which rely on specialised hardware to perform the computations and in a
fault-tolerant manner..
c) Efficient distributed processing over very large datasets, partitioned across many machines,
exploits the idea that computations are deployed to the machine nodes where the data resides.
d ) Velocity in Big Data means that data needs to be stored quickly to avoid losing a lot of data
over faults.
e) Variability in Big Data is about elastic solutions. Solutions that are very general and can be
adapted to every kind of problem.
f ) Volume in Big Data is one of the reasons why distributed storage solutions are central on
top of which processing is performed.
g) Batch processing means the data is processed in small batches to reduce the latency and
produce results quickly.
h) Stream processing means the data maybe infinite in size and needs to be processed continu-
ously, as soon as it arrives.
a) Map Reduce processing over big data usually involves specialised hardware machines located
in a cluster.
b) In MapReduce, as the name implies, the programming model comprises only two parts the
map and the reduce phase. InCoop is an extension of MapReduce to allow a third phase -
the combiner;
c) MapReduce execution involves storing intermediate files on local disks at the nodes that
executed the map phase;
d ) The execution of a MapReduce program requires that all files that the nodes involved in the
computation read and write are stored in the the Hadoop filesystem to leverage replication,
including intermediate files; this is necessary to ensure fault-tolerance;
1
e) MapReduce does not store intermediate computations, instead, keeps them in memory. The
master node issues reduce tasks to the nodes that produced intermediate results. Since
everything is in memory, the reduce is fast and explains why MapReduce is very performant;
f ) MapReduce supports incremental computations. When the data is updated by a small
fraction, only the reducers need to be executed thus producing new results very fast.
3 Regarding Spark...
2
4 Regarding Storm and stream processing...
a) Storm computations are defined as a topology of processing elements of two kinds: spouts
and bolts.
b) Storm topologies cannot have cycles.
c) Spouts are components in Storm that perform the initial injection of data in the system.
d ) To provide fault-tolerance guaranties Storm requires each data tuple to be acknowledged
(acked) at each processing step.
e) If a Storm machine fails, the internal state of the lost components is recovered from storage,
making Storm resilient to node failures.
f ) Trident is a complement to Storm. Trident provides a very low-level programming interface
that allows the programmer to produce optimal topologies with minimal latency.
g) Storm provides exactly once tuple processing semantics.
h) Trident extends Storm to provide at least once and at most once processing semantics.
i) Trident extends Storm to provide exactly once tuple processing semantics.
j ) Storm provides at least once processing semantics, which means idempotence is available out
of the box.
5 Explain the role and motivation for using a system like Kafka or Flume, or both, in the
context of distributed processing, in conjunction with Spark or Storm, for instance.
6 Explain the impact of data locality (or lack of ) in distributed processing using Spark.
Explain your answer.
7 Consider the following Pig program written for the dataset of the Debs 2015 Grand
Challenged (the one used in the Labs assignment)
3
8 Consider a very large data set of taxi rides, similar to the dataset of the Debs 2015 Grand
challenge, but covering the whole world. You can assume each data tuple contains all
the information found in the Debs Grand Challenge and, in addition, also includes the
country and city where the taxi ride took place.
a) How would you partition the data among the storage nodes in the following cases:
i) Query for the average taxi ride duration (in seconds) per country.
ii ) Query for the top 3 cities of the world that generate the greatest gross amount of revenue
(total), per day.
b) Implement the first of the queries of the previous question using Spark.You can use pseudo-
code in places of context creating, parsing, etc. Try to use the actual RDD transformations
that Spark provides for the rest of the code.