0% found this document useful (0 votes)
0 views

lab1-dataAlgorithms copy

The document outlines a lab session for M2 MIAGE focused on SQL and MapReduce principles using Python. It includes exercises on searching via partitioning, grouping and aggregation, n-way merge-sort, and counting words, each requiring specific Python implementations and optimizations. The exercises emphasize performance improvements through techniques like partitioning, sorting, and efficient data structures.

Uploaded by

dahmanisaad642
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

lab1-dataAlgorithms copy

The document outlines a lab session for M2 MIAGE focused on SQL and MapReduce principles using Python. It includes exercises on searching via partitioning, grouping and aggregation, n-way merge-sort, and counting words, each requiring specific Python implementations and optimizations. The exercises emphasize performance improvements through techniques like partitioning, sorting, and efficient data structures.

Uploaded by

dahmanisaad642
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

M2 MIAGE Lab Session

Exercise sheet : understanding main principles behind SQL and MapReduce


via Python programming

Searching via partitioning. Write a Python program that

• generates a collection I of 900,000 random integer numbers.

• asks then 5 integer numbers to the user, and for each number answers YES if the
number is in I, NO otherwise

Provide two possible implementations, one by simply using a list for I and a linear search,
the other one by using a partitioned list for I, that is a list of lists, each sub-list being a
partition. In this second case, you can use a Python dictionary, in which a couple (key,
value) represents one partition: the key is the label/index for the partition while the value is
the list of values for the partition. How many partitions ? How could you distribute the
900,000 numbers over partitions ? Could an hashing function based on the modulo
operator help? Recall that H(n) = n mod m, where n and m are positive integers, returns a
number between 0 and m-1 for any n.

Grouping and aggregation. Write a Python program that:

- generates a collection C of 30,000,000 couples (k,v) where k is a random integer


number between 1 and 7 while v is a random integer number between 1 and 1000.

- compute the sum of values v for each k value ; for instance the couple (3 ,S) should
be resulted, where S = Σ(3,v)∈C v

- rst provide a naive solution, by means of a function groupAndSum(C) where C is


the collection of couples (k,v). Please do not use any form of dictionary (or nested
list). You can use a ( at) list in order to keep track of already processed key values.
Then estimate the cost in terms of execution time of groupAndSum(C) ; if n is the
number of couples, the execution time is rather linear in n or quadratic (n2) or
other?

- how could you improve execution time and memory consumption? could sorting
(k,v) couples on k help? If so, use a technique you know to sort C on k, and then
provide a Python function groupSortedAndSum(C) where C is now assumed to be
sorted on k. Then observe/estimate the cost of sorting + grouping and summing. Do
you notice any improvements?

- could partitioning based on the modulo operator help in lowering execution time? If
so, provide a Python program relying on modulo-based partitioning, by re-using
groupAndSum(C) in each part. Does execution time improve? Do you believe that
with partitioning, a parallel approach in which parts are processed in parallel would
be even better?

n-way merge-sort. Design and implement in Python an algorithm for n-way merge-sort of
n lists of integers. We assume that each of the n lists is sorted in ascendent fashion.
fi
fl
M2 MIAGE Lab Session
The algorithm is encapsulated into a function which we name n-way-ms taking as input the
list of sorted integer lists.

For instance, if L = [ [2, 3.5, 3.7, 4.5, 6, 7], [2.6, 3.6, 6.6, 9], [3, 7, 10] ]

then n-way-ms(L) returns

[ 2, 2.6, 3, 3.5, 3.6, 3.7, 4.5, 6, 6.6, 7, 7, 9, 10]

Consider that you are not allowed to use any sort operation, the n-way-ms algorithm is
required to produce its output by means of a linear scan of all the lists in the input list.

In coding the algorithm it is worth using the pop(i) function fro lists. For instance l.pop(0)
returns the rst value of the list and discard it from the liste itself.

Question: can you estimate the cost of the algorithm? Do you believe more ef cient
versions would exist?

Remark: besides being interesting in per se, n-way merge-sort is important to know
because this kind of data manipulation is behind one of the main steps of shuf e-and-sort
in MapReduce-based systems.

Counting words. Write a Python function that takes a list T of strings and returns a list of
couples (w, c) where w is a word occurring in strings of T and c is the number of
occurrences of c. For instance, if

T=["aa bb cc", "bb aa gg", "aa gg"] then the list returned is [("aa",3), ("bb",2), ("gg",2)]

In case T is a big collection of text lines, would techniques previously explored help for
improving execution time?
fi
fi
fl

You might also like