bigdata-1
bigdata-1
In the form of relational algebra implemented in SQL, relations are not sets, but
bags; that is, tuples are allowed to appear more than
once. There are extended definitions of union, intersection, and difference for
bags, which we shall define below. Write MapReduce algorithms for computing
the following operations on bags R and S:
(a) Bag Union, defined to be the bag of tuples in which tuple t appears the
sum of the numbers of times it appears in R and S.
(b) Bag Intersection, defined to be the bag of tuples in which tuple t appears
the minimum of the numbers of times it appears in R and S.
(c) Bag Difference, defined to be the bag of tuples in which the number of
times a tuple t appears is equal to the number of times it appears in R
minus the number of times it appears in S. A tuple that appears more
times in S than in R does not appear in the difference.
The grouping and aggregation on the relation R(A, B), where A is the
grouping attribute and B is aggregated by the MAX operation. Assume
A and B have domains of size a and b, respectively.
Suppose our inputs are bit strings of length b, and the outputs
correspond to pairs of strings at Hamming distance 1.11
(a) Prove that a reducer of size q can cover at most (q/2) log2
q outputs.
(b) Use part (a) to show the lower bound on replication rate: r ≥ b/ log2
q.
(c) Show that there are algorithms with replication rate as given by part (b)
for the cases q = 2, q = 2b
, and q = 2b/2