100% found this document useful (1 vote)

1K views9 pages

Sample Exam Problems

This document provides sample exam problems for a course on cse427s. It includes 6 parts covering topics like true/false, multiple choice, cluster computing, MapReduce, MapReduce algorithms, and big data analysis tools. For each part, it lists possible point values and problems related to the topics. Guidelines are provided for the actual exam, such as showing work, keeping answers brief, and clearly labeling parts of problems. The exam is worth 30% of the final grade and covers all topics from the course. Students are encouraged to discuss problems to best prepare for the exam.

Uploaded by

SherelleJiaxinLi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

1K views9 pages

Sample Exam Problems

Uploaded by

SherelleJiaxinLi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

cse427s Sample Exam Problems

Marion Neumann
Spring 2017

Note 1: This is a collection of problems to exemplify the style of questions you may expect for
the written exam. The length and difficulty of the exam problems may vary from the ones in
this collection. These sample problems do not reflect the length and difficulty of the entire exam.
Its really just a collection of problems. Not every covered topic is represented in these sample
problems. So, keep an eye on those topics as well.

Note 2: I do not have an answer key for those practice problems. All solutions can be derived from
the course materials. If you have questions or doubts about the correctness of a solution you de-
rived, please ask us in our office hours or discuss them with your peers on Piazza. I encourage you
to actively discuss the problems on Piazza. This way you will learn the most and be prepared for
the exam!

Exam Guidelines (for the actual exam):

Show your work to receive maximum credit. Partial credit will be given, if work is shown.
Please keep your written answers brief and to the point. Incorrect or rambling statements can hurt
your score on a question.
If your hand writing is not readable, we cannot give you credit.
Clearly indicate parts (a-c) when answering the respective parts of a problem.
Pages xx-xx are blank pages if you need extra space. Clearly indicate problem and part number if
(part of) your answer is written on these pages.
This exam is worth 30% of your final grade.
The exam is broken up into 6 parts and is worth xx points. There are xx problems total. It is your
responsibility to make sure you have all of the pages.
You have 80 mins to complete the exam.

Part Topic Possible Page(s) For grading purpose only

Points Initials Score
I T/F and Multiple Choice
II Cluster Computing
III MapReduce
IV MapReduce Algorithms
V Big Data Analysis Tools
VI Big Data Applications
Total

1
Part I: True or False and Multiple Choice
(xx points) Problem 1
Please mark for each statement whether it is true or false. Make sure your choice is clear. Correct
answers will count as 1 point, wrong answers will count as -1 point. The minimum total amount
of points for this problem is 0 points.

cf. Recap Quizzes (liked on course webpage)

Mark zero, one, or multiple right answers for each problem. Wrongly marked answers will
count negative with the same weight correctly marked answers count positive. The mini-
mum total amount of points for each problem is 0 points.

(2 points) Problem 2
Analysis of text data (e.g. webpages) primarily addresses the following aspect of Big Data (mark
ONE).
(A) Velocity
(B) Variety

(2 point) Problem 3
Making your MapReduce implementation Hadoop-agnostic means to

(A) have as few Hadoop dependencies as possible.

(B) have as many Hadoop dependencies as possible.
(C) decouple data parsing from the Mapper implementation.

(D) make your program only executable on a real Hadoop cluster.

(2 points) Problem 4
The Driver is executed on

(A) a compute node (TaskNode)

(B) the master node running the JobTracker
(C) the NameNode

(D) the client

2
Part II: Cluster Computing Distributed Storage & Analysis

(4 points) Problem 5
How do systems for distributed storage and data analysis handle hardware failure? Consider both
data storage and analysis job execution, as well as master and worker node failure in your answer.

(4 points) Problem 6
When dealing with Big Data, you have to consider file compression.

(a) Briefly discuss the tradeoff you face when compressing data.
(b) Which way of compressing the data is most suitable if you want to analyze it using a
MapReduce program?
(c) Which way of compressing the data is most suitable for archiving?

(6 points) Problem 7
Assume you have a data file of size 640MB, the replication rate in the distributed file system is 2,
the default block size is 128MB, and the cluster consists of 6 nodes on 3 racks as shown below.

(a) In the figure below, separate the file into the appropriate number of blocks, label the blocks
with numbers 1, 2, 3, . . . , and distribute them across the data nodes A, B, C, D, E, and F .

file
A C E

B D F
rack1 rack2 rack3

(b) Write down the dictionary mapping the blocks to the file and the dictionary of data nodes
per block. Where are these dictionaries stored in a Hadoop distributed file system?

3
Part III: MapReduce
(6 points) Problem 8
Given the following input data:
2013-03-15 12:39 - 74.125.226.230 /common/logo.gif 1200ms - 2326
2013-03-15 12:39 - 157.166.255.18 /catalog/cat1.html 900ms - 1211
2013-03-15 12:40 - 65.50.196.141 /common/logo.gif 1900ms - 1198
2013-03-15 12:41 - 64.69.4.150 /common/promoex.jpg 4000ms - 2326
2013-03-15 12:44 - 157.166.255.18 /catalog/cat2.html 1100ms - 1451

Write down the data flow for a MapReduce program that analyzes the log data provided in input
data to retrieve the average processing time for each file type; give (i.e. compute) the specific
Mapper outputs, Reducer inputs, and Reducer outputs.

(xx points) Problem 9

Have a look at the exercises covered in the sections of the MMDS book Chapter 2 (cf. Readings on
course webpage).

(2 points) Problem 10
What is speculative execution?

(3 points) Problem 11
Describe how serialization is achieved in Hadoop MapReduce.

(6 points) Problem 12
When implementing MapReduce programs in Hadoop , one common sense debugging and devel-
opment strategy is to start small and build incrementally. Explain what is meant by this phrase
with respect to input data and implementation steps.

4
Part IV: MapReduce Algorithms
(2 points) Problem 13
Name three performance indicators to consider when analyzing MapReduce algorithms.

(5 points) Problem 14
(a) Name three use cases for secondary sort. Give an example composite key for each use case.
(b) The Partitioner in a secondary sort MapReduce implementation partitions the key-value
pairs by primary key to ensure that all the key-value pairs with the same primary key end
up at the same Reduce Task. Why do you need to additionally implement a custom Group
Comparator?

(8 points) Problem 15
(a) Write down a MapReduce program using pseudo-code or short textual statements that com-
putes an inverted index.
Each entry in the index should be a word followed by a list of pairs (i, j), where i is the a
unique identifier for the document, and j is the position of the word in the document.
(b) Consider the following three "documents," each consisting of a single sentence:

cats and dogs like to fight

take your cat to the dog store

we all like cats and we all like dogs

First, stem the words by replacing plurals by their singular forms. (Stemming involves other
transformations as well, but only plural-singular appears in these documents.) Construct an
inverted index for the above documents (using your MapReduce program developed in the
previous part). Now, i is the number of a document (1, 2, 3)), and j is the position of the
word in the document (positions start at 1, count spaces).

5
Part V: Tools for Big Data Analysis

(2 points) Problem 16
Name four selection criteria when choosing the right tool for Big Data processing and analysis
tasks.

(12 points) Problem 17

Consider the following Big Data analysis tasks. Briefly explain (one sentence) the goal of each task,
name the Hadoop data analysis tool that you believe would be best-suited to accomplish the task,
and briefly explain your choice (1-2 sentences or bullet points):

(a) Business Intelligence Tool

(b) (Interactive) Analysis of crawled web documents
(c) Log-data Analysis
(d) Extract Transform Load (etl)

(e) Frequently Bought Together (fbt)

(f) News Article Recommendation

(4 points each) Problem 18

Name four selection criteria when choosing the right tool for Big Data processing and analysis
tasks.

(3 points each) Problem 19

Choosing the best tool

(a) Which tool would be the best choice if you want to explore a data set but arent yet sure
what fields it contains? Briefly state why.
(b) Which tool would be the best choice for a Java developer who wants to do image processing
on 75 million digital photos? Briefly state why.

(c) Which tool would be the best choice to implement the PageRank algorithm to rank 4 billion
webpages?
(d) Which tool would be the best choice to implement a linear perceptron classifier for text
categorization trained on a corpus of one million text documents represented as bags of
word on a vocabulary of 10,000 words? Briefly state why.

(e) Which tool would be the best choice for hosting a hotel customer database and reservation
system for a hotel chain operating 5,000 hotels in the US?

6
(f) Which tool would be the best choice for a Python developer who wants to do sentiment
analysis on 1 million tweeds? Briefly state why.
(g) Which tool would be the best choice for someone who is already familiar with SQL and needs
analyze a directory containing 20 TB of Web server log files? Briefly state why.

(h) Which tool would be the best choice for an analyst who is already familiar with SQL and
wants to quickly run several "what if" scenarios based on 10 billion detail records from a
Point of Sale system? Briefly state why.
(i) Which tool would be the best choice to implement an Extract Transform Load (etl) workflow
integrating terabytes of data from multiple heterogeneous sources.

(j) Which tool would be the best choice for an analyst who wants to quickly run several "what
if" scenarios based on 10 billion detail records from a Point of Sale system? Briefly state why.

7
Part VI: MapReduce for Big Data Applications

(10 points) Problem 20

A simple approach to recommend items to users is to suggest the items that are most popular.
Briefly describe the MapReduce approach to retrieve those items. Include input, output, and brief
descriptions of Mapper(s) and Reducer(s).

Is this MapReduce approach scalable or do we have to expect memory issues for large input data?
Briefly justify your answer.

(8 points) Problem 21
The essential part of many recommendation and classification approaches is to find similar data
points, such as text documents, movies, products, or users. Given the following utility matrix
representing ratings by users A, B, and C for items a through f
a b c d e f
A 4 5 5 1 2
B 3 4 5 1
C 2 1 3 1 5
find the most similar user to user A. That is, is user A more similar to user B or C?

(a) Compute the Jaccard similarity J(A, B) and J(A, C) between user A and users B and C.

(b) Now, treat ratings of 3, 4, and 5 as 1 and 1 and 2 as blank. Compute the Jaccard similarity
B) and J(A,
J(A, C) between user A and users B and C.

(d) What are possible drawbacks of the Jaccard similarity?

8
(15 points) Problem 22
Online news reading has become very popular as the web provides access to news articles from
millions of sources around the world. A key challenge of online news platforms is to help users in
finding and recommending news articles they are interested in.

(a) What is the main difference between traditional newspapers and online news platforms?

(b) Explain the long tail phenomenon and what is means in the context of news article recom-
mendation.

(c) Name four properties of news articles that could be used as features for content-based
recommendation.
(d) State the pseudo code of a MapReduce implementation for collaborative filtering for news
article recommendation using the cosine similarity on normalized ratings. Assume the fol-
lowing input for each rating: (user-id, article-id, rating). (Consider ratings only and
ignore any additional information, such as the publication date of an article, or other meta-
data.) You may use one variable called statistics for each user-article pair to store all
statistics required for the similarity computation. Carefully list the required statistics stored
in statistics and indicate in your pseudo-code when they are computed.

(e) Why do you need to sort the list of (user-id, rating) pairs in the Reducer of the first MapRe-
duce job?

Static and Dynamic Hashing
No ratings yet
Static and Dynamic Hashing
12 pages
2013-Huawei Certification-Exam Outline (Partner) V2 2
100% (2)
2013-Huawei Certification-Exam Outline (Partner) V2 2
126 pages
PenetrationTesting Notes
No ratings yet
PenetrationTesting Notes
4 pages
Renewing Default Certificates For Tivoli Workload Scheduler
No ratings yet
Renewing Default Certificates For Tivoli Workload Scheduler
88 pages
Unit3 - MCQ-RDBMS - Worksheet - MCQsAnsKey - X402-Part B
No ratings yet
Unit3 - MCQ-RDBMS - Worksheet - MCQsAnsKey - X402-Part B
13 pages
Data Flow Diagrams Complete
100% (1)
Data Flow Diagrams Complete
26 pages
Functional Dependencies and Normalization For Relational Databases
100% (2)
Functional Dependencies and Normalization For Relational Databases
11 pages
Msinfo 32
No ratings yet
Msinfo 32
64 pages
Hands-On Lab: Lab 1: Capture Traffic To/from Your Hardware Address
100% (2)
Hands-On Lab: Lab 1: Capture Traffic To/from Your Hardware Address
65 pages
Rover HD Tab 7 Evo 2015 V1 en PDF
No ratings yet
Rover HD Tab 7 Evo 2015 V1 en PDF
15 pages
Horizontal Fragmentation Exercises
100% (2)
Horizontal Fragmentation Exercises
2 pages
Single Row Functions Multiple Row Functions
100% (1)
Single Row Functions Multiple Row Functions
42 pages
Manual PLC - Siemens
100% (1)
Manual PLC - Siemens
39 pages
DBMS Constraint Violation Practice With Solution
No ratings yet
DBMS Constraint Violation Practice With Solution
3 pages
Lecture 3 - Introduction To NoSQL - Updated
No ratings yet
Lecture 3 - Introduction To NoSQL - Updated
35 pages
IT6602-Software Architecture
0% (1)
IT6602-Software Architecture
16 pages
Java IO Interview Questions and Answers: 1. What Are The Types of I / O Streams?
No ratings yet
Java IO Interview Questions and Answers: 1. What Are The Types of I / O Streams?
5 pages
User Manual For Key Personnel
No ratings yet
User Manual For Key Personnel
18 pages
Gate Questions: Database Management Systems
No ratings yet
Gate Questions: Database Management Systems
76 pages
Dsi 142
100% (1)
Dsi 142
19 pages
Localizacion de Averias SC MP 171 RICOH
No ratings yet
Localizacion de Averias SC MP 171 RICOH
46 pages
Data Warehousing and Data Mining Important Question
No ratings yet
Data Warehousing and Data Mining Important Question
7 pages
Introduction To Embedded Systems
100% (1)
Introduction To Embedded Systems
33 pages
DBMS Question DBMS
100% (1)
DBMS Question DBMS
14 pages
LG DVD Lejatszo DV97 9943CE2M - HA5HLL - RD0061Y
No ratings yet
LG DVD Lejatszo DV97 9943CE2M - HA5HLL - RD0061Y
163 pages
Homogeneous and Heterogeneous Systems
No ratings yet
Homogeneous and Heterogeneous Systems
4 pages
DMW Lab Manual (1) EDIT
No ratings yet
DMW Lab Manual (1) EDIT
118 pages
CHPT 7
No ratings yet
CHPT 7
16 pages
Tu170 Mta Mock
No ratings yet
Tu170 Mta Mock
8 pages
Dbms-Module-2 Solutions
No ratings yet
Dbms-Module-2 Solutions
13 pages
DWH QB
No ratings yet
DWH QB
10 pages
Hive Lecture Notes
100% (1)
Hive Lecture Notes
17 pages
BDC Previous Papers 2 Marks
100% (1)
BDC Previous Papers 2 Marks
7 pages
Thecodingshef: Unit 4 Big Data MCQ Aktu
No ratings yet
Thecodingshef: Unit 4 Big Data MCQ Aktu
13 pages
Cpu S Sil Com-Interfaces Programming Interfaces Protocols Note
No ratings yet
Cpu S Sil Com-Interfaces Programming Interfaces Protocols Note
3 pages
Exchange Server 2010
No ratings yet
Exchange Server 2010
15 pages
Unit 3 Big Data MCQ AKTU: Royal Brinkman Gartenbaubedarf
No ratings yet
Unit 3 Big Data MCQ AKTU: Royal Brinkman Gartenbaubedarf
17 pages
GSM Security
No ratings yet
GSM Security
6 pages
Data Engineering Interview Preparation Questions
No ratings yet
Data Engineering Interview Preparation Questions
7 pages
AWS Lab Practice Guide by WWW - Server-Computer - Com - v1
100% (1)
AWS Lab Practice Guide by WWW - Server-Computer - Com - v1
86 pages
Networking and Protocols Interview Questions
No ratings yet
Networking and Protocols Interview Questions
5 pages
Hbase PPT PDF
No ratings yet
Hbase PPT PDF
100 pages
Pig
No ratings yet
Pig
24 pages
Algebra Questions and Answers
0% (2)
Algebra Questions and Answers
6 pages
DBMS Solutions For EndSem
No ratings yet
DBMS Solutions For EndSem
54 pages
Seminar Essay Ios
No ratings yet
Seminar Essay Ios
11 pages
Database Management System: Important Questions Unit-1
No ratings yet
Database Management System: Important Questions Unit-1
9 pages
Unit V:: Design and Analysis of Algorithms
100% (1)
Unit V:: Design and Analysis of Algorithms
7 pages
CST204 Database Management Systems, July 2021
0% (1)
CST204 Database Management Systems, July 2021
3 pages
Q&A Univ 3unit
No ratings yet
Q&A Univ 3unit
18 pages
KES 2009 India Buyer List: No. Company Name Delegate Name Business Nature Product Profile
No ratings yet
KES 2009 India Buyer List: No. Company Name Delegate Name Business Nature Product Profile
4 pages
Assignment No-1
No ratings yet
Assignment No-1
13 pages
Checkpoint Tuning and Troubleshooting Guide
No ratings yet
Checkpoint Tuning and Troubleshooting Guide
8 pages
Data Mining and Warehousing
100% (3)
Data Mining and Warehousing
30 pages
Important Questions and Answers of Big Data Course
No ratings yet
Important Questions and Answers of Big Data Course
4 pages
CSCF in Volte - The P-CSCF (Part 1 of 4) : Uname: Naveen - S - Cs Passwd: Makeapp@12
No ratings yet
CSCF in Volte - The P-CSCF (Part 1 of 4) : Uname: Naveen - S - Cs Passwd: Makeapp@12
7 pages
Configuring Basic RIPng
No ratings yet
Configuring Basic RIPng
6 pages
QB Solved m3
No ratings yet
QB Solved m3
4 pages
Self Exercises 1
No ratings yet
Self Exercises 1
4 pages
Module 3
No ratings yet
Module 3
6 pages
07 R 51 A 0510
No ratings yet
07 R 51 A 0510
25 pages
UCS310 Latest 2025
No ratings yet
UCS310 Latest 2025
2 pages
2011 AL ICT Model Paper English
0% (1)
2011 AL ICT Model Paper English
18 pages
Noc19 cs33 Assignment5
No ratings yet
Noc19 cs33 Assignment5
3 pages
Normalization Example: Project Management Report
No ratings yet
Normalization Example: Project Management Report
3 pages
Dbms Lab Exam
0% (2)
Dbms Lab Exam
13 pages
DSBDa MCQ
No ratings yet
DSBDa MCQ
17 pages
Model Question Paper Database Management Systems
No ratings yet
Model Question Paper Database Management Systems
2 pages
Assignment 2 DM
No ratings yet
Assignment 2 DM
5 pages
BLU101DataSheet Original
No ratings yet
BLU101DataSheet Original
2 pages
Query Processing - Database Questions & Answers - Sanfoundry 00
No ratings yet
Query Processing - Database Questions & Answers - Sanfoundry 00
7 pages
Intel Desktop Boards D850MV and D850MD, and The Pentium 4 Processor
No ratings yet
Intel Desktop Boards D850MV and D850MD, and The Pentium 4 Processor
2 pages
Syllabus
No ratings yet
Syllabus
9 pages
F U-4 PDF
No ratings yet
F U-4 PDF
48 pages
Normalization in DBMS11
No ratings yet
Normalization in DBMS11
17 pages
Unit-3 Part 1 Normalization
No ratings yet
Unit-3 Part 1 Normalization
31 pages
Pandas Questions
No ratings yet
Pandas Questions
4 pages
Week 9
No ratings yet
Week 9
4 pages
Hard Disk Drives: Product Reference Guide
No ratings yet
Hard Disk Drives: Product Reference Guide
8 pages
CS8481 - CN Lab Questions
No ratings yet
CS8481 - CN Lab Questions
5 pages
Final Dbs MCQ
100% (1)
Final Dbs MCQ
30 pages
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
No ratings yet
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
28 pages
Motorola Driver Log
No ratings yet
Motorola Driver Log
4 pages
Data Warehousing & Data Mining Important Questions
No ratings yet
Data Warehousing & Data Mining Important Questions
1 page
End Sem Paper
No ratings yet
End Sem Paper
4 pages
Bput Coa
No ratings yet
Bput Coa
2 pages
Advanced DBMS Concepts Practical 1
No ratings yet
Advanced DBMS Concepts Practical 1
1 page
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
4 pages
IGNOU MCA Data Science and Big Data Previous Years Unsolved Papers MCS 226
From Everand
IGNOU MCA Data Science and Big Data Previous Years Unsolved Papers MCS 226
Manish Soni
No ratings yet
Touchpad Plus Ver. 2.1 Class 2
From Everand
Touchpad Plus Ver. 2.1 Class 2
Team Orange
No ratings yet

Sample Exam Problems

Uploaded by

Sample Exam Problems

Uploaded by

cse427s Sample Exam Problems

Exam Guidelines (for the actual exam):

Part Topic Possible Page(s) For grading purpose only

cf. Recap Quizzes (liked on course webpage)

(A) have as few Hadoop dependencies as possible.

(D) make your program only executable on a real Hadoop cluster.

(A) a compute node (TaskNode)

(D) the client

(xx points) Problem 9

cats and dogs like to fight

take your cat to the dog store

we all like cats and we all like dogs

(12 points) Problem 17

(a) Business Intelligence Tool

(e) Frequently Bought Together (fbt)

(4 points each) Problem 18

(3 points each) Problem 19

(10 points) Problem 20

(d) What are possible drawbacks of the Jaccard similarity?

You might also like