0% found this document useful (0 votes)
2 views

MapReduce(Streaming) TP Report

The document is a practical report on implementing and executing MapReduce jobs using Hadoop Streaming with Python. It details the steps taken to create a word count program, analyze a sales dataset, and understand the functions of mappers and reducers. The report includes local testing, code explanations, and the process of running MapReduce on a Hadoop cluster to compute store revenues and sales values.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

MapReduce(Streaming) TP Report

The document is a practical report on implementing and executing MapReduce jobs using Hadoop Streaming with Python. It details the steps taken to create a word count program, analyze a sales dataset, and understand the functions of mappers and reducers. The report includes local testing, code explanations, and the process of running MapReduce on a Hadoop cluster to compute store revenues and sales values.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

MapReduce(streaming) Date : 04/04/2025

Nom et Prènom : Nourhène Guellouz


TP Report Class : MP DS 1

1. Introduction
In this practical session, I implemented and executed MapReduce jobs using Hadoop Streaming. My objective was to
understand how to process large datasets using Python scripts as Mappers and Reducers instead of Java. I followed a step-
by-step approach:
Work 1: Implementing a word count program using MapReduce.
Work 2: Running the program on Hadoop to analyze word occurrences.
Work 3: Analyzing a sales dataset to compute store revenues.
This TP allowed me to explore Hadoop’s distributed data processing model and practice Python-based MapReduce
programming.
2. Work 1: Local Testing of MapReduce (Word Count)
1. Sharing files between local and VM machine
a) Transferring Programs from Windows to Cloudera via a Shared Directory :

b) Drop both files in the shared directory


c) Create a folder on Cloudera VM to drop on it the two files: “/home/cloudera/TP4”
d) Copy the codes to that folder
e) Verify if they really exist

f) Read the two files using two different ways.

2. Understanding the mapper code


a) Read the “mapper” function and explain how it works
The mapper function reads input data line by line, processes each line, and emits key-value pairs as output. It first
removes any unnecessary whitespace and splits the line into words or relevant fields. Each extracted word (or data
element) is then assigned a count of 1 before being printed in the format "key \t value", where the key represents the
word (or another grouping factor), and the value is typically 1. This output is then passed to Hadoop’s shuffle and sort
phase, which organizes the data before sending it to the reducer. The mapper ensures that large datasets can be efficiently
processed in parallel by breaking them down into smaller, manageable units.
b) Create a file1 and file2 with random text and put them in “/home/cloudera/TP4”

c) Execute the mapper with file1 as it input. Verify the results. Do the same thing with file2

d) Execute the mapper for both file1 and file2. Interpret.


When I ran the mapper function on both file1.txt and file2.txt, I noticed that the results were different for each file.
This makes sense because the mapper processes each file separately, reading the lines, splitting them into words, and
outputting each word with a count of 1. Since the content of file1.txt and file2.txt isn’t the same, the words and their
frequencies in the output were also different. However, if some words appeared in both files, they showed up in both
outputs but not necessarily in the same order or quantity. At this stage, the mapper doesn’t combine or count repeated
words—it just prepares the data for the reducer, which will later sum up the occurrences of each word across both
files.
4. Understanding the reducer
a) Read the “reducer” function and explain how it works
Work to do n°2: MapReduce on Hadoop cluster
1. Setting working environment on Cloudera VM and HDFS
Work to do n°3: More MapReduce
1. Files exploration
a) Load the file to the /home/cloudera/MUST/TP4 directory
b) Display the first 10 lines

c) Display the last 10 lines

2. Mapper
a) Identify the pair that you should use.
The key-value pair to use is <store, cost>, where the key is the store name and the value is the purchase
cost. This allows us to group all purchases by store and calculate the total sales for each store in the
reduce step.

b) Draw up the flowchart of the mapper (we call it: purchases_mapper)


c) Write the code of “purchases_mapper.py” in python

d) Test locally the mapper and verify the results (test for the first 10 lines). Is there any error problem.
e) Add a control statement to avoid the problem. Verify the results
3. Reducer
a) Draw up the flowchart of the reduce (we call it: purchases_reducer)
b) Write the code of “purchases_reducer.py” in python

c) Test locally the reducer and verify the results (test for the first 20 lines). Do manually the sum.
4. MapReduce on Hadoop
a) Put all together and perform the MapReduce for all the file “purchases.txt”
a) Establish a list of sales by category
b) What is the sales value for the Toys category?

a) Give the highest sale amount for each store

b) What is this value for the following stores: Lincon, Austin

You might also like