MapReduce(Streaming) TP Report

The document is a practical report on implementing and executing MapReduce jobs using Hadoop Streaming with Python. It details the steps taken to create a word count program, analyze a sales dataset, and understand the functions of mappers and reducers. The report includes local testing, code explanations, and the process of running MapReduce on a Hadoop cluster to compute store revenues and sales values.

Uploaded by

Guellouz Nourhène

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

MapReduce(Streaming) TP Report

Uploaded by

Guellouz Nourhène

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

MapReduce(streaming) Date : 04/04/2025

Nom et Prènom : Nourhène Guellouz

TP Report Class : MP DS 1

1. Introduction
In this practical session, I implemented and executed MapReduce jobs using Hadoop Streaming. My objective was to
understand how to process large datasets using Python scripts as Mappers and Reducers instead of Java. I followed a step-
by-step approach:
Work 1: Implementing a word count program using MapReduce.
Work 2: Running the program on Hadoop to analyze word occurrences.
Work 3: Analyzing a sales dataset to compute store revenues.
This TP allowed me to explore Hadoop’s distributed data processing model and practice Python-based MapReduce
programming.
2. Work 1: Local Testing of MapReduce (Word Count)
1. Sharing files between local and VM machine
a) Transferring Programs from Windows to Cloudera via a Shared Directory :

b) Drop both files in the shared directory

c) Create a folder on Cloudera VM to drop on it the two files: “/home/cloudera/TP4”
d) Copy the codes to that folder
e) Verify if they really exist

f) Read the two files using two different ways.

2. Understanding the mapper code

a) Read the “mapper” function and explain how it works
The mapper function reads input data line by line, processes each line, and emits key-value pairs as output. It first
removes any unnecessary whitespace and splits the line into words or relevant fields. Each extracted word (or data
element) is then assigned a count of 1 before being printed in the format "key \t value", where the key represents the
word (or another grouping factor), and the value is typically 1. This output is then passed to Hadoop’s shuffle and sort
phase, which organizes the data before sending it to the reducer. The mapper ensures that large datasets can be efficiently
processed in parallel by breaking them down into smaller, manageable units.
b) Create a file1 and file2 with random text and put them in “/home/cloudera/TP4”

c) Execute the mapper with file1 as it input. Verify the results. Do the same thing with file2

d) Execute the mapper for both file1 and file2. Interpret.

When I ran the mapper function on both file1.txt and file2.txt, I noticed that the results were different for each file.
This makes sense because the mapper processes each file separately, reading the lines, splitting them into words, and
outputting each word with a count of 1. Since the content of file1.txt and file2.txt isn’t the same, the words and their
frequencies in the output were also different. However, if some words appeared in both files, they showed up in both
outputs but not necessarily in the same order or quantity. At this stage, the mapper doesn’t combine or count repeated
words—it just prepares the data for the reducer, which will later sum up the occurrences of each word across both
files.
4. Understanding the reducer
a) Read the “reducer” function and explain how it works
Work to do n°2: MapReduce on Hadoop cluster
1. Setting working environment on Cloudera VM and HDFS
Work to do n°3: More MapReduce
1. Files exploration
a) Load the file to the /home/cloudera/MUST/TP4 directory
b) Display the first 10 lines

c) Display the last 10 lines

2. Mapper
a) Identify the pair that you should use.
The key-value pair to use is <store, cost>, where the key is the store name and the value is the purchase
cost. This allows us to group all purchases by store and calculate the total sales for each store in the
reduce step.

b) Draw up the flowchart of the mapper (we call it: purchases_mapper)

c) Write the code of “purchases_mapper.py” in python

d) Test locally the mapper and verify the results (test for the first 10 lines). Is there any error problem.
e) Add a control statement to avoid the problem. Verify the results
3. Reducer
a) Draw up the flowchart of the reduce (we call it: purchases_reducer)
b) Write the code of “purchases_reducer.py” in python

c) Test locally the reducer and verify the results (test for the first 20 lines). Do manually the sum.
4. MapReduce on Hadoop
a) Put all together and perform the MapReduce for all the file “purchases.txt”
a) Establish a list of sales by category
b) What is the sales value for the Toys category?

a) Give the highest sale amount for each store

b) What is this value for the following stores: Lincon, Austin

EXAMEN - MENDIX - PROok - 03 Ok
100% (1)
EXAMEN - MENDIX - PROok - 03 Ok
21 pages
Top 25 Penetration Testing Tools (2023) PDF
50% (2)
Top 25 Penetration Testing Tools (2023) PDF
4 pages
Microsoft Azure, Dynamics 365 and Online Services - IsO 27001, 27018, 27017, 27701 Assessment Report 12.2.2020 PDF
100% (1)
Microsoft Azure, Dynamics 365 and Online Services - IsO 27001, 27018, 27017, 27701 Assessment Report 12.2.2020 PDF
34 pages
Hadoop Python MapReduce Tutorial For Beginners
No ratings yet
Hadoop Python MapReduce Tutorial For Beginners
15 pages
Unit 3 MapReduce Part 2
No ratings yet
Unit 3 MapReduce Part 2
12 pages
Lez.d-01-Hadoop (A) Intro
No ratings yet
Lez.d-01-Hadoop (A) Intro
58 pages
Writing An Hadoop MapReduce Program in Python
No ratings yet
Writing An Hadoop MapReduce Program in Python
21 pages
Bda Experiment No2
No ratings yet
Bda Experiment No2
12 pages
Hadoop Mapreduce Python Script
No ratings yet
Hadoop Mapreduce Python Script
3 pages
BDA FW-4
No ratings yet
BDA FW-4
7 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
Lab11 B
No ratings yet
Lab11 B
9 pages
Data Science Presentation
No ratings yet
Data Science Presentation
20 pages
Dllction To MAPREDUCE Afflrlling: L Tro
No ratings yet
Dllction To MAPREDUCE Afflrlling: L Tro
12 pages
Hadoop Wordcount Program
No ratings yet
Hadoop Wordcount Program
20 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
30 pages
Computer Nework
No ratings yet
Computer Nework
44 pages
UNIT 2-tt1
No ratings yet
UNIT 2-tt1
7 pages
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
Hadoop MapReduce Tutorial
No ratings yet
Hadoop MapReduce Tutorial
25 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Run Python MapReduce On Local Docker Hadoop Cluster - DEV Community
No ratings yet
Run Python MapReduce On Local Docker Hadoop Cluster - DEV Community
5 pages
Unit 4 Handouts
No ratings yet
Unit 4 Handouts
13 pages
Why MapReduce
No ratings yet
Why MapReduce
8 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
Hadoop MapReduce Explained Simply
No ratings yet
Hadoop MapReduce Explained Simply
3 pages
Bda Lab Exercises Lab Mannual - 2023
No ratings yet
Bda Lab Exercises Lab Mannual - 2023
72 pages
Practical-1 AIM: To Understand The Overall Programming Architecture Using Map Reduce Api
No ratings yet
Practical-1 AIM: To Understand The Overall Programming Architecture Using Map Reduce Api
7 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Unit IV BDA
No ratings yet
Unit IV BDA
32 pages
BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius
No ratings yet
BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius
30 pages
Looking For Real Exam Questions For IT Certification Exams!
No ratings yet
Looking For Real Exam Questions For IT Certification Exams!
12 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
Bda Unit III r20csm
No ratings yet
Bda Unit III r20csm
54 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Palak
No ratings yet
Palak
10 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Big Data
No ratings yet
Big Data
43 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Cloudera Academic Partnership 7
No ratings yet
Cloudera Academic Partnership 7
70 pages
Hadoop - Mapreduce (1)
No ratings yet
Hadoop - Mapreduce (1)
5 pages
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
From Everand
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
Miguel Miranda de Mattos
No ratings yet
Understanding MapReduce
No ratings yet
Understanding MapReduce
15 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
Data Science
No ratings yet
Data Science
7 pages
Map reduce
No ratings yet
Map reduce
35 pages
23-04-2024
No ratings yet
23-04-2024
3 pages
Lecture 04
No ratings yet
Lecture 04
25 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Bda Practical 2
No ratings yet
Bda Practical 2
3 pages
1.4 Map Reduce
No ratings yet
1.4 Map Reduce
30 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
6. Map Reduce Programming
No ratings yet
6. Map Reduce Programming
67 pages
Analyzing_Data_with_Hadoop
No ratings yet
Analyzing_Data_with_Hadoop
54 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
Lecture 03
No ratings yet
Lecture 03
26 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
20 pages
1Z0-1046-24-Demo
No ratings yet
1Z0-1046-24-Demo
4 pages
PDK5S-P-003 Program Writer: User Manual
No ratings yet
PDK5S-P-003 Program Writer: User Manual
27 pages
LectureNotes Part01
No ratings yet
LectureNotes Part01
113 pages
Info@nu - Edu.sd Contact@nu - Edu.sd
No ratings yet
Info@nu - Edu.sd Contact@nu - Edu.sd
4 pages
Docu104008 - DDVE 7.6.0.5 GCP Installation and Administration Guide (REV 02)
No ratings yet
Docu104008 - DDVE 7.6.0.5 GCP Installation and Administration Guide (REV 02)
66 pages
Unit 7 Internet
No ratings yet
Unit 7 Internet
19 pages
Introduction and Course Overview (Python Foundation For Spatial Analysis)
No ratings yet
Introduction and Course Overview (Python Foundation For Spatial Analysis)
13 pages
[FREE PDF sample] (Ebook) Appsolutely by Lim Tianyi ISBN 9781927892411, 1927892414 ebooks
100% (1)
[FREE PDF sample] (Ebook) Appsolutely by Lim Tianyi ISBN 9781927892411, 1927892414 ebooks
67 pages
Chirantan 1
No ratings yet
Chirantan 1
31 pages
CS-T240 User Manual
No ratings yet
CS-T240 User Manual
218 pages
Document Generation
No ratings yet
Document Generation
5 pages
FIT9132 Tutorial 7 Sample Solution
No ratings yet
FIT9132 Tutorial 7 Sample Solution
6 pages
Download ebooks file Big C 2nd Edition Cay S. Horstmann all chapters
100% (7)
Download ebooks file Big C 2nd Edition Cay S. Horstmann all chapters
61 pages
TM-2120 AVEVA Marine (12.1) System Administration (Basic) Rev 5.0
No ratings yet
TM-2120 AVEVA Marine (12.1) System Administration (Basic) Rev 5.0
151 pages
Quality FMEA Ford Supplier 8 4-Final
No ratings yet
Quality FMEA Ford Supplier 8 4-Final
22 pages
Activity 7 - Shift Registers
No ratings yet
Activity 7 - Shift Registers
6 pages
Electronic Pellet Burner Controller - NPBC-V3C
No ratings yet
Electronic Pellet Burner Controller - NPBC-V3C
9 pages
CH 2
No ratings yet
CH 2
5 pages
History of computer
No ratings yet
History of computer
5 pages
SIFANG Smart Substation Solution - 12 04 2017
50% (2)
SIFANG Smart Substation Solution - 12 04 2017
99 pages
Ritesh - Kumar - Resume - 14 04 2023 23 23 48
No ratings yet
Ritesh - Kumar - Resume - 14 04 2023 23 23 48
1 page
Configuring Static Routes: Khawar Butt Ccie # 12353 (R/S, Security, SP, DC, Voice, Storage & Ccde)
No ratings yet
Configuring Static Routes: Khawar Butt Ccie # 12353 (R/S, Security, SP, DC, Voice, Storage & Ccde)
7 pages
Isolve Online Homework Service
100% (1)
Isolve Online Homework Service
7 pages
Running Head: Cybersecurity Strategy & Plan of Action 1
100% (2)
Running Head: Cybersecurity Strategy & Plan of Action 1
17 pages
Syllabus BCA BSC New 2023 24
No ratings yet
Syllabus BCA BSC New 2023 24
191 pages
Introduction To Computing: Living in The Information Technology Era
No ratings yet
Introduction To Computing: Living in The Information Technology Era
21 pages
Agsveva
No ratings yet
Agsveva
14 pages

MapReduce(Streaming) TP Report

Uploaded by

MapReduce(Streaming) TP Report

Uploaded by

MapReduce(streaming) Date : 04/04/2025

Nom et Prènom : Nourhène Guellouz

b) Drop both files in the shared directory

f) Read the two files using two different ways.

2. Understanding the mapper code

d) Execute the mapper for both file1 and file2. Interpret.

c) Display the last 10 lines

b) Draw up the flowchart of the mapper (we call it: purchases_mapper)

a) Give the highest sale amount for each store

b) What is this value for the following stores: Lincon, Austin

You might also like