Euro Stat Map Reducetraining
Euro Stat Map Reducetraining
1. Introduction
The objective of this practice is to provide the trainees with a broad vision of the mapreduce
paradigm. Due to the difficulty of installing and setting up a Spark infrastructure, we provide
the trainees with the connection to a preconfigured server.
2. PuTTY Installation
Download the latest version of PuTTY:
https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html
Run the installer:
spark.autoritas.net
Select Connection / SSH / Auth in the left menu:
Browse for the spark.ppk key, and accept the security alert:
Once connected to the server, log in using the user: ubuntu
./start.sh [your_username]
cd [your_username]
All codes and data can be downloaded from the following urls:
https://s3-eu-west-1.amazonaws.com/autoritas.academy/EuroStat/mapreduce/mapper.py
https://s3-eu-west-1.amazonaws.com/autoritas.academy/EuroStat/mapreduce/reducer.py
https://s3-eu-west-1.amazonaws.com/autoritas.academy/EuroStat/mapreduce/wuthering-heights.txt
https://s3-eu-west-1.amazonaws.com/autoritas.academy/EuroStat/mapreduce/wuthering-heights.words.txt
Mapper.py
Reducer.py
Let’s run them:
● Showing the wuthering-heights.txt file, with a version of the novel in plain text:
cat wuthering-heights.txt
cat wuthering-heights.txt | wc
● Mapping the job: It decomposes the novel into its words, creating pairs of [word, 1]
● Ordening the mapping: We can see that the same word is many times, always
accompanied by the number 1
● Reducing the job: It reduces the list of words by grouping per word and summing up
the 1s, obtaining the frequency of occurrence of each word:
● Ordening the output: We can see the frequency of occurrence per word:
nano wuthering-heights.words.txt
[ctrl]x
6. Conclusions
MapReduce paradigm is based on the divide and conquer philosophy, where a set of
mappers decompose the data into small groups and apply a simple operation (e.g. obtaining
words and creating pairs [word, 1]), and the reducer regroups the data by making a simple
joining task from the output of the mappers (e.g. summing up frequencies).