Test1 1617

This document contains a practice test for a distributed systems processing course. It includes 8 multiple choice questions about concepts like MapReduce, Spark, Storm, data locality, and partitioning large datasets. The questions cover topics such as programming models, fault tolerance methods, and analyzing taxi trip data at country and city levels.

Uploaded by

Ermando 8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views4 pages

Test1 1617

Uploaded by

Ermando 8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Processamento de Streams

Faculdade de Ciências e Tecnologia / Universidade Nova de Lisboa

Test #1 (DSPS)
2016/17 closed book - 120 minutes
Number: Name:

In the following questions, mark each of sentences as (T)rue or (F)alse. Note that wrong answers
have a negative score. Leave blank unless you are sure.

1 In the context of big data and distributed data processing in general...

a) The programming model of the generality of distributed processing frameworks are based on
”divide & conquer”strategies, because it works for any kind of problem.
b) Distributed processing over very large datasets relies on partitioning the data sets across
many machines, which rely on specialised hardware to perform the computations and in a
fault-tolerant manner..
c) Efficient distributed processing over very large datasets, partitioned across many machines,
exploits the idea that computations are deployed to the machine nodes where the data resides.
d ) Velocity in Big Data means that data needs to be stored quickly to avoid losing a lot of data
over faults.
e) Variability in Big Data is about elastic solutions. Solutions that are very general and can be
adapted to every kind of problem.
f ) Volume in Big Data is one of the reasons why distributed storage solutions are central on
top of which processing is performed.
g) Batch processing means the data is processed in small batches to reduce the latency and
produce results quickly.
h) Stream processing means the data maybe infinite in size and needs to be processed continu-
ously, as soon as it arrives.

2 Regarding Google’s MapReduce and closely related systems...

a) Map Reduce processing over big data usually involves specialised hardware machines located
in a cluster.
b) In MapReduce, as the name implies, the programming model comprises only two parts the
map and the reduce phase. InCoop is an extension of MapReduce to allow a third phase -
the combiner;
c) MapReduce execution involves storing intermediate files on local disks at the nodes that
executed the map phase;
d ) The execution of a MapReduce program requires that all files that the nodes involved in the
computation read and write are stored in the the Hadoop filesystem to leverage replication,
including intermediate files; this is necessary to ensure fault-tolerance;

1
e) MapReduce does not store intermediate computations, instead, keeps them in memory. The
master node issues reduce tasks to the nodes that produced intermediate results. Since
everything is in memory, the reduce is fast and explains why MapReduce is very performant;
f ) MapReduce supports incremental computations. When the data is updated by a small
fraction, only the reducers need to be executed thus producing new results very fast.

3 Regarding Spark...

a) Spark is a distributed processing platform that does batch processing;

b) Spark is a distributed processing platform that does stream processing;
c) Spark allows batch and stream processing to be mixed and combined in the same program.
d ) Spark uses the abstraction of RDD - resilient distributed datasets. RDDs are mutable and can
be updated at multiple nodes at the same time, and explain why Spark is very performant.
e) Spark programming model is based on the idea of applying transformations and actions to
RDDs.
f ) Spark fault-tolerance works by performing hot replication - meaning the computing tasks are
always issued at a minimum of two nodes so that one set of results is available with high
probability.
g) Spark programs produce a graph that can have narrow and wide dependencies, but not both.
h) Wide dependencies in Spark are very fast because the same transformation can be applied
to many items at once.
i) Narrow dependencies occur when a Spark program includes a map transformation that trans-
forms each item on a RDD into a single item in the resulting RDD.
j ) Wide dependencies occur when a Spark program includes a flatmap transformation that
transforms each item on a RDD into a multiple items in the resulting RDD.
k ) A Narrow dependency cannot occur in Spark when each partition of an RDD produces items
that end up in multiple partitions in the resulting RDD.
l ) The lineage graph in Spark encodes the dependencies between the RDDs.
m) The lineage graph in Spark defines the order of the items inside an RDD, so that deterministic
results are produced despite faults.
n) RDD transformations that involve narrow dependencies can be pipelined and executed on
the same task/node.
o) Spark allows the programmer to control if a particular RDD should be persisted in the
filesystem or kept in memory.
p) Persisting in the filesystem RDDs that are the result of a Wide Dependencies can improve
recovery times due to node/task failures/faults.
q) RDDs that result from Narrow Dependencies should never be considered to be persisted in
the filesystem.

2
4 Regarding Storm and stream processing...

a) Storm computations are defined as a topology of processing elements of two kinds: spouts
and bolts.
b) Storm topologies cannot have cycles.
c) Spouts are components in Storm that perform the initial injection of data in the system.
d ) To provide fault-tolerance guaranties Storm requires each data tuple to be acknowledged
(acked) at each processing step.
e) If a Storm machine fails, the internal state of the lost components is recovered from storage,
making Storm resilient to node failures.
f ) Trident is a complement to Storm. Trident provides a very low-level programming interface
that allows the programmer to produce optimal topologies with minimal latency.
g) Storm provides exactly once tuple processing semantics.
h) Trident extends Storm to provide at least once and at most once processing semantics.
i) Trident extends Storm to provide exactly once tuple processing semantics.
j ) Storm provides at least once processing semantics, which means idempotence is available out
of the box.

5 Explain the role and motivation for using a system like Kafka or Flume, or both, in the
context of distributed processing, in conjunction with Spark or Storm, for instance.

6 Explain the impact of data locality (or lack of ) in distributed processing using Spark.
Explain your answer.

7 Consider the following Pig program written for the dataset of the Debs 2015 Grand
Challenged (the one used in the Labs assignment)

A = LOAD ’sorted_data.csv’ USING PigStorage(’,’) AS

(taxi, hack_license, pickup_datetime, dropoff_datetime, trip_time_in_secs, trip_distance);
B = FOREACH A GENERATE taxi, trip_distance;
C = GROUP B BY taxi ;
D = FOREACH C GENERATE $0, AVG(trip_distance) AS average_trip_distance;
E = ORDER D BY average_trip_distance DESC;
F = LIMIT R 3;
STORE F INTO ’taxi-results’ USING PigStorage();
a) Explain what is being computed by this program. (Do not explain each step of the program.)
b) Assuming the program will be compiled into (multiple) map/reduce jobs, explain which steps/instructions
will involve costly shuffle map/reduce phases.
c) In this example, local storage is being used to read the input dataset and store the results.
What would be the benefit of replacing local storage with Hadoop HDFS? Explain.

3
8 Consider a very large data set of taxi rides, similar to the dataset of the Debs 2015 Grand
challenge, but covering the whole world. You can assume each data tuple contains all
the information found in the Debs Grand Challenge and, in addition, also includes the
country and city where the taxi ride took place.

a) How would you partition the data among the storage nodes in the following cases:
i) Query for the average taxi ride duration (in seconds) per country.
ii ) Query for the top 3 cities of the world that generate the greatest gross amount of revenue
(total), per day.
b) Implement the first of the queries of the previous question using Spark.You can use pseudo-
code in places of context creating, parsing, etc. Try to use the actual RDD transformations
that Spark provides for the rest of the code.

Pyspark Dumps
No ratings yet
Pyspark Dumps
10 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
Nptel Big Data Full Assignment Solution 2021
100% (8)
Nptel Big Data Full Assignment Solution 2021
36 pages
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
No ratings yet
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
51 pages
Lec28 - RDD
No ratings yet
Lec28 - RDD
56 pages
Chap5 BigDataComputingAndProcessing
No ratings yet
Chap5 BigDataComputingAndProcessing
72 pages
SPARK
No ratings yet
SPARK
66 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Spark
No ratings yet
Spark
96 pages
Week 0 To 8 Assignment
No ratings yet
Week 0 To 8 Assignment
31 pages
SPARK
No ratings yet
SPARK
35 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
Bigdata Interview Q&A
No ratings yet
Bigdata Interview Q&A
71 pages
Spark and Dask
No ratings yet
Spark and Dask
55 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
2023 BD All Assignment
No ratings yet
2023 BD All Assignment
63 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
prezentareBD Tot
No ratings yet
prezentareBD Tot
30 pages
Lec 9
No ratings yet
Lec 9
33 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
Unit 4 (Big Data Analytics)
No ratings yet
Unit 4 (Big Data Analytics)
28 pages
How To Create A Pipeline Capable of Processing 2.5 Billion Records/day
No ratings yet
How To Create A Pipeline Capable of Processing 2.5 Billion Records/day
65 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
IBP Part 00 Introduction v02
100% (2)
IBP Part 00 Introduction v02
110 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
BDA Lec9
No ratings yet
BDA Lec9
25 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
Airline Reservation System Project Report
100% (1)
Airline Reservation System Project Report
66 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
Spark: Fast, Interactive, Language-Integrated Cluster Computing
No ratings yet
Spark: Fast, Interactive, Language-Integrated Cluster Computing
25 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
Spark and Scala Week 1
No ratings yet
Spark and Scala Week 1
16 pages
Advanced Data Science On Spark: Reza Zadeh
No ratings yet
Advanced Data Science On Spark: Reza Zadeh
47 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Week 8-2
No ratings yet
Week 8-2
9 pages
Exam 2023
No ratings yet
Exam 2023
13 pages
Spark End To End QUESTIONS
No ratings yet
Spark End To End QUESTIONS
10 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Week - 5
No ratings yet
Week - 5
7 pages
Tarea 8
0% (2)
Tarea 8
13 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
DSA307 Lecture 2 Final Out
No ratings yet
DSA307 Lecture 2 Final Out
3 pages
ABD Exame PDF
No ratings yet
ABD Exame PDF
17 pages
Lê Thị Hậu - ITDSIU21085 - Quiz3
No ratings yet
Lê Thị Hậu - ITDSIU21085 - Quiz3
5 pages
BDCC IA2 QP Set 1
No ratings yet
BDCC IA2 QP Set 1
2 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
Big Data in The Construction
No ratings yet
Big Data in The Construction
36 pages
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
(SAMPLE) E - S4CPE - 2405 - SAP S - 4HANA Cloud Private Edition Implementation Consultant For SAP Cloud ERP - ERPPrep
No ratings yet
(SAMPLE) E - S4CPE - 2405 - SAP S - 4HANA Cloud Private Edition Implementation Consultant For SAP Cloud ERP - ERPPrep
5 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
Huawei Cloud Overview
No ratings yet
Huawei Cloud Overview
28 pages
Assignment 03 BigData Computing Noc23-Cs112
No ratings yet
Assignment 03 BigData Computing Noc23-Cs112
6 pages
Karl Storz Aida Dicom Requirement
No ratings yet
Karl Storz Aida Dicom Requirement
12 pages
Interview Question Spark Day1
No ratings yet
Interview Question Spark Day1
3 pages
DB925799-CISC3018 Assignment
No ratings yet
DB925799-CISC3018 Assignment
3 pages
Kanishk Resume
No ratings yet
Kanishk Resume
5 pages
IGNOU PGDCA MCS 202 Computer Organisation Previous Years Unsolved Papers
From Everand
IGNOU PGDCA MCS 202 Computer Organisation Previous Years Unsolved Papers
Manish Soni
No ratings yet
Unit-6 Groupware Chapter 19
No ratings yet
Unit-6 Groupware Chapter 19
58 pages
DRP Sample
No ratings yet
DRP Sample
22 pages
SCCM Secondary Site vs. BDP
No ratings yet
SCCM Secondary Site vs. BDP
2 pages
Slowly Changing Dimension (SCD)
No ratings yet
Slowly Changing Dimension (SCD)
4 pages
Big Data Analytics
0% (1)
Big Data Analytics
19 pages
Internship Report 11
No ratings yet
Internship Report 11
28 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Full Stack Program 2024 1
No ratings yet
Full Stack Program 2024 1
12 pages
Empowerment Technologies 1
No ratings yet
Empowerment Technologies 1
31 pages
SQL For Tivoli Storage Manager
No ratings yet
SQL For Tivoli Storage Manager
32 pages
SYBSc (CS) SE CH 4 1
No ratings yet
SYBSc (CS) SE CH 4 1
26 pages
Soa QB
No ratings yet
Soa QB
14 pages
ITE8 Chp14
No ratings yet
ITE8 Chp14
67 pages
Geospatial Trends Report 2024
No ratings yet
Geospatial Trends Report 2024
20 pages
02 Create Question Answering Solutions With Azure AI Language
No ratings yet
02 Create Question Answering Solutions With Azure AI Language
22 pages
Manual de Proceso de Calidad de Cacao Fino de Aroma, Perú
No ratings yet
Manual de Proceso de Calidad de Cacao Fino de Aroma, Perú
23 pages
San Francisco City Government Daas SF Qa SR
No ratings yet
San Francisco City Government Daas SF Qa SR
2 pages
L6. System Software
No ratings yet
L6. System Software
9 pages
The Impact of Cloud Computing On Full Stack Development
No ratings yet
The Impact of Cloud Computing On Full Stack Development
4 pages
Hackathon Project Report
No ratings yet
Hackathon Project Report
2 pages
Maulana Abul Kalam Azad University of Technology, West Bengal
No ratings yet
Maulana Abul Kalam Azad University of Technology, West Bengal
2 pages
Acknowledgement (Final)
No ratings yet
Acknowledgement (Final)
6 pages
JetStream Migrate Requirements
No ratings yet
JetStream Migrate Requirements
7 pages
001817
No ratings yet
001817
3 pages

Test1 1617

Uploaded by

Test1 1617

Uploaded by

Processamento de Streams

Faculdade de Ciências e Tecnologia / Universidade Nova de Lisboa

1 In the context of big data and distributed data processing in general...

2 Regarding Google’s MapReduce and closely related systems...

a) Spark is a distributed processing platform that does batch processing;

A = LOAD ’sorted_data.csv’ USING PigStorage(’,’) AS

You might also like