0% found this document useful (0 votes)
3 views

bigdata-1

Uploaded by

Ramesh Raj
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

bigdata-1

Uploaded by

Ramesh Raj
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 3

Suppose we execute the word-count MapReduce program described in this section on a

large repository such as a copy of the Web. We shall


use 100 Map tasks and some number of Reduce tasks.
(a) Suppose we do not use a combiner at the Map tasks. Do you expect there
to be significant skew in the times taken by the various reducers to process
their value list? Why or why not?
(b) If we combine the reducers into a small number of Reduce tasks, say 10
tasks, at random, do you expect the skew to be significant? What if we
instead combine the reducers into 10,000 Reduce tasks?
! (c) Suppose we do use a combiner at the 100 Map tasks. Do you expect skew
to be significant? Why or why not?

Design MapReduce algorithms to take a very large file of


integers and produce as output:
(a) The largest integer.
(b) The average of all the integers.
(c) The same set of integers, but with each integer appearing only once.
(d) The count of the number of distinct integers in the input.

Our formulation of matrix-vector multiplication assumed that


the matrix M was square. Generalize the algorithm to the case where M is an
r-by-c matrix for some number of rows r and columns c.

In the form of relational algebra implemented in SQL, relations are not sets, but
bags; that is, tuples are allowed to appear more than
once. There are extended definitions of union, intersection, and difference for
bags, which we shall define below. Write MapReduce algorithms for computing
the following operations on bags R and S:
(a) Bag Union, defined to be the bag of tuples in which tuple t appears the
sum of the numbers of times it appears in R and S.
(b) Bag Intersection, defined to be the bag of tuples in which tuple t appears
the minimum of the numbers of times it appears in R and S.
(c) Bag Difference, defined to be the bag of tuples in which the number of
times a tuple t appears is equal to the number of times it appears in R
minus the number of times it appears in S. A tuple that appears more
times in S than in R does not appear in the difference.

Selection can also be performed on bags. Give a MapReduce


implementation that produces the proper number of copies of each tuple t that
passes the selection condition. That is, produce key-value pairs from which the
correct result of the selection can be obtained easily from the values.

The relational-algebra operation R(A, B) ⊲⊳ B<C S(C, D)


produces all tuples (a, b, c, d) such that tuple (a, b) is in relation R, tuple (c,
d) is
in S, and b < c. Give a MapReduce implementation of this operation, assuming
R and S are sets.

Suppose a job consists of n tasks, each of which takes time t


seconds. Thus, if there are no failures, the sum over all compute nodes of the
time taken to execute tasks at that node is nt. Suppose also that the probability
of a task failing is p per job per second, and when a task fails, the overhead of
management of the restart is such that it adds 10t seconds to the total execution
time of the job. What is the total expected execution time of the job?

Suppose a Pregel job has a probability p of a failure during


any superstep. Suppose also that the execution time (summed over all compute
nodes) of taking a checkpoint is c times the time it takes to execute a superstep.
To minimize the expected execution time of the job, how many supersteps
should elapse between checkpoints?

What is the communication cost of each of the following


algorithms, as a function of the size of the relations, matrices, or vectors to
which they are applied?
(a) The matrix-vector multiplication algorithm of Section 2.3.2.
(b) The union algorithm of Section 2.3.6.
(c) The aggregation algorithm of Section 2.3.8

Suppose relations R, S, and T have sizes r, s, and t, respectively, and we want to


take the 3-way join R(A, B) ⊲⊳ S(B, C) ⊲⊳ T (A, C),
using k reducers. We shall hash values of attributes A, B, and C to a, b, and c
buckets, respectively, where abc = k. Each reducer is associated with a vector
of buckets, one for each of the three hash functions. Find, as a function of r, s,
t, and k, the values of a, b, and c that minimize the communication cost of the
algorithm.

Suppose we take a star join of a fact table F(A1, A2, . . . , Am)


with dimension tables Di(Ai
, Bi) for i = 1, 2, . . . , m. Let there be k reducers,
each associated with a vector of buckets, one for each of the key attributes
A1, A2, . . . , Am. Suppose the number of buckets into which we hash Ai
is ai
.Naturally, a1a2 · · · am = k. Finally, suppose each dimension table Di has size
di
, and the size of the fact table is much larger than any of these sizes. Find
the values of the ai
’s that minimize the cost of taking the star join as one
MapReduce operation.

Describe the graphs that model the following problems.


(a) The multiplication of an n × n matrix by a vector of length n.
(b) The natural join of R(A, B) and S(B, C), where A, B, and C have domains of
sizes a, b, and c, respectively.

The grouping and aggregation on the relation R(A, B), where A is the
grouping attribute and B is aggregated by the MAX operation. Assume
A and B have domains of size a and b, respectively.

Provide the details of the proof that a one-pass matrixmultiplication algorithm


requires replication rate at least r ≥ 2n
2/q, including:
(a) The proof that, for a fixed reducer size, the maximum number of outputs
are covered by a reducer when that reducer receives an equal number of
rows of M and columns of N.
(b) The algebraic manipulation needed, starting with Pk
i=1 q
2
i ≥ 4n
4

Suppose our inputs are bit strings of length b, and the outputs
correspond to pairs of strings at Hamming distance 1.11
(a) Prove that a reducer of size q can cover at most (q/2) log2
q outputs.
(b) Use part (a) to show the lower bound on replication rate: r ≥ b/ log2
q.
(c) Show that there are algorithms with replication rate as given by part (b)
for the cases q = 2, q = 2b
, and q = 2b/2

You might also like