0% found this document useful (0 votes)
18 views

Assignment 1 - 2024

The document describes two MapReduce assignments. The first asks to analyze population census data to count births by month, and suggests handling many records by adding more reducers. The second asks to precompute friend recommendations by counting common friends between users, and provides sample input/output and asks to design the MapReduce job.

Uploaded by

claudia wong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Assignment 1 - 2024

The document describes two MapReduce assignments. The first asks to analyze population census data to count births by month, and suggests handling many records by adding more reducers. The second asks to precompute friend recommendations by counting common friends between users, and provides sample input/output and asks to design the MapReduce job.

Uploaded by

claudia wong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Assignment 1: MapReduce Design

Out: Feb 29, 2024

Due: Mar 12, 2024, end of day

1. (4pts) Consider that National Bureau of Statistics wants to analyze its population census data.
The dataset contains the details of babies born in the United States in 2017. Each record is of the
form and there are around 17.23 million
records. In order to find the number of babies born during each month of the year, you come up
with the following mapper and reducer.

The MapReduce cluster provided to you consists of N mappers and but only 2 reducers as shown
in the figure above. Reducer1 receives all (key, value) pairs where keys are between A and M
inclusive and Reducer2 receives (key, value) pairs between N and Z inclusive.

Given that mapper and reducer function produces the correct output, what possible issue(s) could
you face while processing a job consisting of 17.23 million records? Suggest a workaround for
that issue.

2. (6pts) Facebook has a list of friends (note that friends are a bi-directional thing on Facebook.
If I'm your friend, you're mine). They also have lots of disk space and they serve hundreds of
millions of requests everyday. They've decided to pre-compute calculations when they can to
reduce the processing time of requests. One common processing request is the "You and Joe
have 230 friends in common" feature. When you visit someone's profile, you see a list of friends
that you have in common. This list doesn't change frequently so it'd be wasteful to recalculate it
every time you visited the profile. We're going to use mapreduce so that we can calculate
everyone's common friends once a day and store those results.

Assume the input file are stored as User: [List of Friends], and the list of friends are sorted, for
example

A: [B, C, D, E, F]

B: [A, C, F]

C: [A, B]

D: [A]

E: [A]

F: [A, B]

This friendship network can be visualized as

As you can see, A and B has common friend C and F, and A and E has no common friend. Your
output will contain a common friends list for all user pairs (not including the one with no
common friend). For a pair of user, you only need to output their common friends once. For
example, you will only output common friends for [B, C] and [C, B] once as [B, C]. It’s also
okay that the common friends list is not sorted. For the above data, you will have the output:

[A, B]: [C, F]

[A, C]: [B]

[A, F]: [B]

[B, C]: [A]


[B, D]: [A]

[B, E]: [A]

[B, F]: [A]

[C, D]: [A]

[C, E]: [A]

[C, F]: [A, B]

[D, E]: [A]

[D, F]: [A]

[E, F]: [A]

Given this input and desired output, design a MapReduce job to perform the required processing.
In particular, detail the sequence of map/reduce phases of your algorithm: what are the map keys,
what are the map values, what are the reduce keys, what are the reduce values, what does the
map function do, what does the reduce function do. Also indicate if there is a possibility to use a
combiner at each step. You can use natural language, diagrams, examples AND/OR pseudo-code
to describe the algorithm, as you prefer (so long as it is readable).

(hint: given one input line A: [B, C, D, E, F], you can imply that B and C have common friend
A. And it’s okay to use a tuple, such as (user1, user2), as the key of key-value pair.)

You might also like