Sample Exam Problems
Sample Exam Problems
Marion Neumann
Spring 2017
Note 1: This is a collection of problems to exemplify the style of questions you may expect for
the written exam. The length and difficulty of the exam problems may vary from the ones in
this collection. These sample problems do not reflect the length and difficulty of the entire exam.
Its really just a collection of problems. Not every covered topic is represented in these sample
problems. So, keep an eye on those topics as well.
Note 2: I do not have an answer key for those practice problems. All solutions can be derived from
the course materials. If you have questions or doubts about the correctness of a solution you de-
rived, please ask us in our office hours or discuss them with your peers on Piazza. I encourage you
to actively discuss the problems on Piazza. This way you will learn the most and be prepared for
the exam!
1
Part I: True or False and Multiple Choice
(xx points) Problem 1
Please mark for each statement whether it is true or false. Make sure your choice is clear. Correct
answers will count as 1 point, wrong answers will count as -1 point. The minimum total amount
of points for this problem is 0 points.
Mark zero, one, or multiple right answers for each problem. Wrongly marked answers will
count negative with the same weight correctly marked answers count positive. The mini-
mum total amount of points for each problem is 0 points.
(2 points) Problem 2
Analysis of text data (e.g. webpages) primarily addresses the following aspect of Big Data (mark
ONE).
(A) Velocity
(B) Variety
(C) Volume
(2 point) Problem 3
Making your MapReduce implementation Hadoop-agnostic means to
(2 points) Problem 4
The Driver is executed on
2
Part II: Cluster Computing Distributed Storage & Analysis
(4 points) Problem 5
How do systems for distributed storage and data analysis handle hardware failure? Consider both
data storage and analysis job execution, as well as master and worker node failure in your answer.
(4 points) Problem 6
When dealing with Big Data, you have to consider file compression.
(a) Briefly discuss the tradeoff you face when compressing data.
(b) Which way of compressing the data is most suitable if you want to analyze it using a
MapReduce program?
(c) Which way of compressing the data is most suitable for archiving?
(6 points) Problem 7
Assume you have a data file of size 640MB, the replication rate in the distributed file system is 2,
the default block size is 128MB, and the cluster consists of 6 nodes on 3 racks as shown below.
(a) In the figure below, separate the file into the appropriate number of blocks, label the blocks
with numbers 1, 2, 3, . . . , and distribute them across the data nodes A, B, C, D, E, and F .
file
A C E
B D F
rack1 rack2 rack3
(b) Write down the dictionary mapping the blocks to the file and the dictionary of data nodes
per block. Where are these dictionaries stored in a Hadoop distributed file system?
3
Part III: MapReduce
(6 points) Problem 8
Given the following input data:
2013-03-15 12:39 - 74.125.226.230 /common/logo.gif 1200ms - 2326
2013-03-15 12:39 - 157.166.255.18 /catalog/cat1.html 900ms - 1211
2013-03-15 12:40 - 65.50.196.141 /common/logo.gif 1900ms - 1198
2013-03-15 12:41 - 64.69.4.150 /common/promoex.jpg 4000ms - 2326
2013-03-15 12:44 - 157.166.255.18 /catalog/cat2.html 1100ms - 1451
Write down the data flow for a MapReduce program that analyzes the log data provided in input
data to retrieve the average processing time for each file type; give (i.e. compute) the specific
Mapper outputs, Reducer inputs, and Reducer outputs.
(2 points) Problem 10
What is speculative execution?
(3 points) Problem 11
Describe how serialization is achieved in Hadoop MapReduce.
(6 points) Problem 12
When implementing MapReduce programs in Hadoop , one common sense debugging and devel-
opment strategy is to start small and build incrementally. Explain what is meant by this phrase
with respect to input data and implementation steps.
4
Part IV: MapReduce Algorithms
(2 points) Problem 13
Name three performance indicators to consider when analyzing MapReduce algorithms.
(5 points) Problem 14
(a) Name three use cases for secondary sort. Give an example composite key for each use case.
(b) The Partitioner in a secondary sort MapReduce implementation partitions the key-value
pairs by primary key to ensure that all the key-value pairs with the same primary key end
up at the same Reduce Task. Why do you need to additionally implement a custom Group
Comparator?
(8 points) Problem 15
(a) Write down a MapReduce program using pseudo-code or short textual statements that com-
putes an inverted index.
Each entry in the index should be a word followed by a list of pairs (i, j), where i is the a
unique identifier for the document, and j is the position of the word in the document.
(b) Consider the following three "documents," each consisting of a single sentence:
First, stem the words by replacing plurals by their singular forms. (Stemming involves other
transformations as well, but only plural-singular appears in these documents.) Construct an
inverted index for the above documents (using your MapReduce program developed in the
previous part). Now, i is the number of a document (1, 2, 3)), and j is the position of the
word in the document (positions start at 1, count spaces).
5
Part V: Tools for Big Data Analysis
(2 points) Problem 16
Name four selection criteria when choosing the right tool for Big Data processing and analysis
tasks.
(a) Which tool would be the best choice if you want to explore a data set but arent yet sure
what fields it contains? Briefly state why.
(b) Which tool would be the best choice for a Java developer who wants to do image processing
on 75 million digital photos? Briefly state why.
(c) Which tool would be the best choice to implement the PageRank algorithm to rank 4 billion
webpages?
(d) Which tool would be the best choice to implement a linear perceptron classifier for text
categorization trained on a corpus of one million text documents represented as bags of
word on a vocabulary of 10,000 words? Briefly state why.
(e) Which tool would be the best choice for hosting a hotel customer database and reservation
system for a hotel chain operating 5,000 hotels in the US?
6
(f) Which tool would be the best choice for a Python developer who wants to do sentiment
analysis on 1 million tweeds? Briefly state why.
(g) Which tool would be the best choice for someone who is already familiar with SQL and needs
analyze a directory containing 20 TB of Web server log files? Briefly state why.
(h) Which tool would be the best choice for an analyst who is already familiar with SQL and
wants to quickly run several "what if" scenarios based on 10 billion detail records from a
Point of Sale system? Briefly state why.
(i) Which tool would be the best choice to implement an Extract Transform Load (etl) workflow
integrating terabytes of data from multiple heterogeneous sources.
(j) Which tool would be the best choice for an analyst who wants to quickly run several "what
if" scenarios based on 10 billion detail records from a Point of Sale system? Briefly state why.
7
Part VI: MapReduce for Big Data Applications
Is this MapReduce approach scalable or do we have to expect memory issues for large input data?
Briefly justify your answer.
(8 points) Problem 21
The essential part of many recommendation and classification approaches is to find similar data
points, such as text documents, movies, products, or users. Given the following utility matrix
representing ratings by users A, B, and C for items a through f
a b c d e f
A 4 5 5 1 2
B 3 4 5 1
C 2 1 3 1 5
find the most similar user to user A. That is, is user A more similar to user B or C?
(a) Compute the Jaccard similarity J(A, B) and J(A, C) between user A and users B and C.
(b) Now, treat ratings of 3, 4, and 5 as 1 and 1 and 2 as blank. Compute the Jaccard similarity
B) and J(A,
J(A, C) between user A and users B and C.
(c) Intuitively, should A be more similar to B or C? Which of the similarity measures better
reflects intuition?
8
(15 points) Problem 22
Online news reading has become very popular as the web provides access to news articles from
millions of sources around the world. A key challenge of online news platforms is to help users in
finding and recommending news articles they are interested in.
(a) What is the main difference between traditional newspapers and online news platforms?
(b) Explain the long tail phenomenon and what is means in the context of news article recom-
mendation.
(c) Name four properties of news articles that could be used as features for content-based
recommendation.
(d) State the pseudo code of a MapReduce implementation for collaborative filtering for news
article recommendation using the cosine similarity on normalized ratings. Assume the fol-
lowing input for each rating: (user-id, article-id, rating). (Consider ratings only and
ignore any additional information, such as the publication date of an article, or other meta-
data.) You may use one variable called statistics for each user-article pair to store all
statistics required for the similarity computation. Carefully list the required statistics stored
in statistics and indicate in your pseudo-code when they are computed.
(e) Why do you need to sort the list of (user-id, rating) pairs in the Reducer of the first MapRe-
duce job?