MapReduce(Streaming) TP Report
MapReduce(Streaming) TP Report
1. Introduction
In this practical session, I implemented and executed MapReduce jobs using Hadoop Streaming. My objective was to
understand how to process large datasets using Python scripts as Mappers and Reducers instead of Java. I followed a step-
by-step approach:
Work 1: Implementing a word count program using MapReduce.
Work 2: Running the program on Hadoop to analyze word occurrences.
Work 3: Analyzing a sales dataset to compute store revenues.
This TP allowed me to explore Hadoop’s distributed data processing model and practice Python-based MapReduce
programming.
2. Work 1: Local Testing of MapReduce (Word Count)
1. Sharing files between local and VM machine
a) Transferring Programs from Windows to Cloudera via a Shared Directory :
c) Execute the mapper with file1 as it input. Verify the results. Do the same thing with file2
2. Mapper
a) Identify the pair that you should use.
The key-value pair to use is <store, cost>, where the key is the store name and the value is the purchase
cost. This allows us to group all purchases by store and calculate the total sales for each store in the
reduce step.
d) Test locally the mapper and verify the results (test for the first 10 lines). Is there any error problem.
e) Add a control statement to avoid the problem. Verify the results
3. Reducer
a) Draw up the flowchart of the reduce (we call it: purchases_reducer)
b) Write the code of “purchases_reducer.py” in python
c) Test locally the reducer and verify the results (test for the first 20 lines). Do manually the sum.
4. MapReduce on Hadoop
a) Put all together and perform the MapReduce for all the file “purchases.txt”
a) Establish a list of sales by category
b) What is the sales value for the Toys category?