0% found this document useful (0 votes)
15 views10 pages

Euro Stat Map Reducetraining

The document provides training on using MapReduce by having trainees connect to a preconfigured server and run MapReduce jobs on a text file to count word frequencies. Trainees download Python mapper and reducer files, run the jobs on a novel to count word frequencies, and save the output to a file.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views10 pages

Euro Stat Map Reducetraining

The document provides training on using MapReduce by having trainees connect to a preconfigured server and run MapReduce jobs on a text file to count word frequencies. Trainees download Python mapper and reducer files, run the jobs on a novel to count word frequencies, and save the output to a file.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

MAPREDUCE TRAINING

1. Introduction
The objective of this practice is to provide the trainees with a broad vision of the mapreduce
paradigm. Due to the difficulty of installing and setting up a Spark infrastructure, we provide
the trainees with the connection to a preconfigured server.

2. PuTTY Installation
Download the latest version of PuTTY:

https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html
Run the installer:

Follow the instructions:


3. Putty Configuration
Open the PuTTY configuration:
Set the following value in the Host Name field:

spark.autoritas.net
Select Connection / SSH / Auth in the left menu:

Browse for the spark.ppk key, and accept the security alert:
Once connected to the server, log in using the user: ubuntu

4. Generating the user space


Invent a username by combining words and numbers, without any special character or
spaces. For example, my name is Kico Rangel and I was born in 1977, hence my username
could be: ​kicorangel77

Write the following in the command line:

./start.sh [your_username]
cd [your_username]

5. Practice with MapReduce


The objective of this practice is to obtain the list of the words contained in the novel with their
frequency of occurrence. To do so we follow some steps by executing command line
commands (below).

All codes and data can be downloaded from the following urls:

https://s3-eu-west-1.amazonaws.com/autoritas.academy/EuroStat/mapreduce/mapper.py
https://s3-eu-west-1.amazonaws.com/autoritas.academy/EuroStat/mapreduce/reducer.py
https://s3-eu-west-1.amazonaws.com/autoritas.academy/EuroStat/mapreduce/wuthering-heights.txt
https://s3-eu-west-1.amazonaws.com/autoritas.academy/EuroStat/mapreduce/wuthering-heights.words.txt
Mapper.py

Reducer.py
Let’s run them:

● Showing the wuthering-heights.txt file, with a version of the novel in plain text:

cat wuthering-heights.txt

● Counting the number of lines, words and characters:

cat wuthering-heights.txt | wc

The output should be something like the following:

Lines Words Characters


4283 118903 684482

● Mapping the job: It decomposes the novel into its words, creating pairs of [word, 1]

cat wuthering-heights.txt | ./mapper.py

● Ordening the mapping: We can see that the same word is many times, always
accompanied by the number 1

cat wuthering-heights.txt | ./mapper.py | sort

● Reducing the job: It reduces the list of words by grouping per word and summing up
the 1s, obtaining the frequency of occurrence of each word:

cat wuthering-heights.txt | ./mapper.py | ./reducer.py

● Ordening the output: We can see the frequency of occurrence per word:

cat wuthering-heights.txt | ./mapper.py | ./reducer.py | sort

● Redirecting the output to a file and exploring the file

cat wuthering-heights.txt | ./mapper.py | ./reducer.py | sort > wuthering-heights.words.txt

nano wuthering-heights.words.txt

● Exiting from the editor:

[ctrl]x
6. Conclusions
MapReduce paradigm is based on the divide and conquer philosophy, where a set of
mappers decompose the data into small groups and apply a simple operation (e.g. obtaining
words and creating pairs [word, 1]), and the reducer regroups the data by making a simple
joining task from the output of the mappers (e.g. summing up frequencies).

You might also like