0% found this document useful (0 votes)
37 views4 pages

Test1 1617

This document contains a practice test for a distributed systems processing course. It includes 8 multiple choice questions about concepts like MapReduce, Spark, Storm, data locality, and partitioning large datasets. The questions cover topics such as programming models, fault tolerance methods, and analyzing taxi trip data at country and city levels.

Uploaded by

Ermando 8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views4 pages

Test1 1617

This document contains a practice test for a distributed systems processing course. It includes 8 multiple choice questions about concepts like MapReduce, Spark, Storm, data locality, and partitioning large datasets. The questions cover topics such as programming models, fault tolerance methods, and analyzing taxi trip data at country and city levels.

Uploaded by

Ermando 8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Processamento de Streams

Faculdade de Ciências e Tecnologia / Universidade Nova de Lisboa


Test #1 (DSPS)
2016/17 closed book - 120 minutes
Number: Name:

In the following questions, mark each of sentences as (T)rue or (F)alse. Note that wrong answers
have a negative score. Leave blank unless you are sure.

1 In the context of big data and distributed data processing in general...

a) The programming model of the generality of distributed processing frameworks are based on
”divide & conquer”strategies, because it works for any kind of problem.
b) Distributed processing over very large datasets relies on partitioning the data sets across
many machines, which rely on specialised hardware to perform the computations and in a
fault-tolerant manner..
c) Efficient distributed processing over very large datasets, partitioned across many machines,
exploits the idea that computations are deployed to the machine nodes where the data resides.
d ) Velocity in Big Data means that data needs to be stored quickly to avoid losing a lot of data
over faults.
e) Variability in Big Data is about elastic solutions. Solutions that are very general and can be
adapted to every kind of problem.
f ) Volume in Big Data is one of the reasons why distributed storage solutions are central on
top of which processing is performed.
g) Batch processing means the data is processed in small batches to reduce the latency and
produce results quickly.
h) Stream processing means the data maybe infinite in size and needs to be processed continu-
ously, as soon as it arrives.

2 Regarding Google’s MapReduce and closely related systems...

a) Map Reduce processing over big data usually involves specialised hardware machines located
in a cluster.
b) In MapReduce, as the name implies, the programming model comprises only two parts the
map and the reduce phase. InCoop is an extension of MapReduce to allow a third phase -
the combiner;
c) MapReduce execution involves storing intermediate files on local disks at the nodes that
executed the map phase;
d ) The execution of a MapReduce program requires that all files that the nodes involved in the
computation read and write are stored in the the Hadoop filesystem to leverage replication,
including intermediate files; this is necessary to ensure fault-tolerance;

1
e) MapReduce does not store intermediate computations, instead, keeps them in memory. The
master node issues reduce tasks to the nodes that produced intermediate results. Since
everything is in memory, the reduce is fast and explains why MapReduce is very performant;
f ) MapReduce supports incremental computations. When the data is updated by a small
fraction, only the reducers need to be executed thus producing new results very fast.

3 Regarding Spark...

a) Spark is a distributed processing platform that does batch processing;


b) Spark is a distributed processing platform that does stream processing;
c) Spark allows batch and stream processing to be mixed and combined in the same program.
d ) Spark uses the abstraction of RDD - resilient distributed datasets. RDDs are mutable and can
be updated at multiple nodes at the same time, and explain why Spark is very performant.
e) Spark programming model is based on the idea of applying transformations and actions to
RDDs.
f ) Spark fault-tolerance works by performing hot replication - meaning the computing tasks are
always issued at a minimum of two nodes so that one set of results is available with high
probability.
g) Spark programs produce a graph that can have narrow and wide dependencies, but not both.
h) Wide dependencies in Spark are very fast because the same transformation can be applied
to many items at once.
i) Narrow dependencies occur when a Spark program includes a map transformation that trans-
forms each item on a RDD into a single item in the resulting RDD.
j ) Wide dependencies occur when a Spark program includes a flatmap transformation that
transforms each item on a RDD into a multiple items in the resulting RDD.
k ) A Narrow dependency cannot occur in Spark when each partition of an RDD produces items
that end up in multiple partitions in the resulting RDD.
l ) The lineage graph in Spark encodes the dependencies between the RDDs.
m) The lineage graph in Spark defines the order of the items inside an RDD, so that deterministic
results are produced despite faults.
n) RDD transformations that involve narrow dependencies can be pipelined and executed on
the same task/node.
o) Spark allows the programmer to control if a particular RDD should be persisted in the
filesystem or kept in memory.
p) Persisting in the filesystem RDDs that are the result of a Wide Dependencies can improve
recovery times due to node/task failures/faults.
q) RDDs that result from Narrow Dependencies should never be considered to be persisted in
the filesystem.

2
4 Regarding Storm and stream processing...

a) Storm computations are defined as a topology of processing elements of two kinds: spouts
and bolts.
b) Storm topologies cannot have cycles.
c) Spouts are components in Storm that perform the initial injection of data in the system.
d ) To provide fault-tolerance guaranties Storm requires each data tuple to be acknowledged
(acked) at each processing step.
e) If a Storm machine fails, the internal state of the lost components is recovered from storage,
making Storm resilient to node failures.
f ) Trident is a complement to Storm. Trident provides a very low-level programming interface
that allows the programmer to produce optimal topologies with minimal latency.
g) Storm provides exactly once tuple processing semantics.
h) Trident extends Storm to provide at least once and at most once processing semantics.
i) Trident extends Storm to provide exactly once tuple processing semantics.
j ) Storm provides at least once processing semantics, which means idempotence is available out
of the box.

5 Explain the role and motivation for using a system like Kafka or Flume, or both, in the
context of distributed processing, in conjunction with Spark or Storm, for instance.

6 Explain the impact of data locality (or lack of ) in distributed processing using Spark.
Explain your answer.

7 Consider the following Pig program written for the dataset of the Debs 2015 Grand
Challenged (the one used in the Labs assignment)

A = LOAD ’sorted_data.csv’ USING PigStorage(’,’) AS


(taxi, hack_license, pickup_datetime, dropoff_datetime, trip_time_in_secs, trip_distance);
B = FOREACH A GENERATE taxi, trip_distance;
C = GROUP B BY taxi ;
D = FOREACH C GENERATE $0, AVG(trip_distance) AS average_trip_distance;
E = ORDER D BY average_trip_distance DESC;
F = LIMIT R 3;
STORE F INTO ’taxi-results’ USING PigStorage();
a) Explain what is being computed by this program. (Do not explain each step of the program.)
b) Assuming the program will be compiled into (multiple) map/reduce jobs, explain which steps/instructions
will involve costly shuffle map/reduce phases.
c) In this example, local storage is being used to read the input dataset and store the results.
What would be the benefit of replacing local storage with Hadoop HDFS? Explain.

3
8 Consider a very large data set of taxi rides, similar to the dataset of the Debs 2015 Grand
challenge, but covering the whole world. You can assume each data tuple contains all
the information found in the Debs Grand Challenge and, in addition, also includes the
country and city where the taxi ride took place.

a) How would you partition the data among the storage nodes in the following cases:
i) Query for the average taxi ride duration (in seconds) per country.
ii ) Query for the top 3 cities of the world that generate the greatest gross amount of revenue
(total), per day.
b) Implement the first of the queries of the previous question using Spark.You can use pseudo-
code in places of context creating, parsing, etc. Try to use the actual RDD transformations
that Spark provides for the rest of the code.

You might also like