0% found this document useful (0 votes)
24 views71 pages

Hadoop

The document discusses the Hadoop platform and HDFS (Hadoop Distributed File System). It explains that HDFS is a distributed file system that stores files across multiple machines. HDFS splits files into blocks and replicates each block across multiple machines for reliability and parallel access. The namenode knows the locations of all blocks and coordinates access. The document provides examples of how to interact with HDFS using its Java API, including getting file metadata and renaming files.

Uploaded by

ouadaouiamine2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views71 pages

Hadoop

The document discusses the Hadoop platform and HDFS (Hadoop Distributed File System). It explains that HDFS is a distributed file system that stores files across multiple machines. HDFS splits files into blocks and replicates each block across multiple machines for reliability and parallel access. The namenode knows the locations of all blocks and coordinates access. The document provides examples of how to interact with HDFS using its Java API, including getting file metadata and renaming files.

Uploaded by

ouadaouiamine2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

The HADOOP platform

Introduction
In pioneer days they used oxen for
heavy pulling, and when one ox couldn’t
budge a log, they didn’t try to grow a larger
ox.
We shouldn’t be trying for bigger
computers, but for more systems of
computers.
Introduction

Before talking about Hadoop, do you know


the prefixes?
Sign Prefix Factor Example
K Kilo 103 Page of text
M Mega 106 Transfer speed per second
G Gega 109 DVD, USB key
T Tera 1012 Hard disk
P Péta 1015
E Exa 1018 Facebook, Amazon
Z Zeta 1021 Entire internet since 2010
Introduction

The processing of such large amounts of data requires special


methods. A classic DBMS, is unable to process so much
information.

Distribute data across multiple machines (up to multiple


million computers) in Data Centers
➔Special file system allowing to see only one space which can contain gigantic and/or ver

➔Specific databases (HBase, Cassandra, ElasticSearch).

Processing of the "map-reduce" type:

➔Easy to write algorithms,


➔Easy to parallelize executions.
Introduction

Imagine 5000 computers connected together forming a


cluster.

Data center
Introduction

Each of these blade servers can look like this (4 multi-core


CPUs, 1TB RAM, 24TB hard disks, 5000$ ever-changing
price and technology)

Blade server
Connected machines
All these machines are connected to each other in order to
share storage space and computing power.

The Cloud is an example of distributed storage space: files


are stored on different machines, usually in duplicate to
prevent failure.

The execution of programs is also distributed: they are


executed on one or more machines on the network.

This whole module aims to teach application


programming on a cluster, using Hadoop tools.
Hadoop is a distributed data management and processing system.
It contains many components, including:

●HDFS (Hadoop Distributed File System) a file system that


distributes data over many machines,
●YARN a MapReduce-like program scheduling mechanism.

We will first present HDFS then YARN/MapReduce.


HDFS is a distributed file system:

Files and folders are organized in a tree (like Unix) these files are
stored on a large number of machines in such a way as to make
the exact position of a file invisible.

Access is transparent, regardless of the machines that contain the


files.

●Files are copied in several copies for reliability


and to allow multiple simultaneous accesses
HDFS makes it possible to see all the folders and
files of these thousands of machines as a single
tree, containing P0 of data, as if they were on the
hard disk local.
File organization
Seen from the user, HDFS looks like a Unix file system: there is a
root, directories and files. Files have an owner, a group and
access rights.

Under the root /, there is:


●Directories for Hadoop services: /hbase, /tmp, /var

➔a directory for users' personal files:

✔/user. In this directory, there are also three system

folders: /user/hive, /user/history and /user/spark.


➔a directory to deposit files to share with all users: /share

You will need to distinguish between


HDFS files and "normal" files.
Install Hadoop 3.2.2
Step 1 : Installing Java
First you should install JDK (Java Develoment Kit)

Step 2 : Create environment variables

For java, we create new User variable called “JAVA_Home”


It contains the java directory.
For Hadoop, we create new User variable called
“HADOOP_Home” . It contains the Hadoop directory.
For Spark, we create new User variable called
“SPARK_Home” . It contains the Sparkdirectory.
Install Hadoop 3.2.2

Step 3 : Create environment paths

For java, we create new User path


It contains the java directory +”\bin”.

For Hadoop, we create two new User paths


The 1st contains the Hadoop directory +”\bin”.
The 2nd contains the Hadoop directory +”\sbin”.
For Spark, we create two new User paths
The 1st contains the Spark directory +”\bin”.
The 2nd contains the Spark directory +”\sbin”.
Install Hadoop 3.2.2
Step 4 : Update the Hadoop xml files
In the directory “Hadoop\hadoop-3.2.2\etc\Hadoop” open the
file “core-site.xml” and change the directory of tempfir
according to your Hadoop directory.
Install Hadoop 3.2.2
Step 4 : Update the Hadoop xml files
In the directory “Hadoop\hadoop-3.2.2\etc\Hadoop” open the
file “hdfs-site.xml” and change the directory of namenode
and datanode according to your Hadoop directory.
Install Hadoop 3.2.2
Step 5 : Update the Hadoop-env command file
In the directory “Hadoop\hadoop-3.2.2\etc\Hadoop” open the
file “Hadoop-env” and add the following commands at the
las.

set HADOOP_PREFIX=%HADOOP_HOME%
set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
set YARN_CONF_DIR=%HADOOP_CONF_DIR%
set PATH=%PATH%;%HADOOP_PREFIX%\bin
Install Hadoop 3.2.2
Step 5 : Update the Hadoop-env command file
In the directory “Hadoop\hadoop-3.2.2\etc\Hadoop” open the
file “Hadoop-env” and add the following commands at the
last.

set HADOOP_PREFIX=%HADOOP_HOME%
set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
set YARN_CONF_DIR=%HADOOP_CONF_DIR%
set PATH=%PATH%;%HADOOP_PREFIX%\bin
Install Hadoop 3.2.2
Step 6 : start hadoop
In the Command Prompt. Excute the command “spark-shell”.
Then the command “for %I in (.) do echo %~sI”
This last command mast be excuted in the java directory to display
the short name of your installed jdk.
Use the short name to update the “Hadoop-env” file.
Install Hadoop 3.2.2
Step 7 : start hadoop

In a new Command Prompt, execute the command “start-dfs” to


start the Hadoop services.
Install Hadoop 3.2.2
Step 8 : open the localhost http://localhost:9870/
How does HDFS work?

As with many systems, each HDFS file is split into fixed-size blocks.
A block HDFS= 256MB. Depending on the size of a file, it will need a certain
number of blocks. On HDFS, the last block of a file is the remaining size.

The blocks of the same file are not necessarily all on the same
machine. They are each copied to different machines in order to be
accessed simultaneously by several processes. By default, each block is
copied to 3 different machines (this is configurable).

This replication of blocks on several machines also makes it


possible to guard against breakdowns. Each file is therefore found in several
copies and in different places.
Organization of machines for HDFS
An HDFS cluster is made up of machines playing different roles
mutually exclusive:

• One of the machines is the HDFS master, called the namenode. This
machine contains all the file names and blocks, like a big phone book.
• Another machine is the secondary namenode, a kind of backup
namenode, which saves backups of the directory at regular intervals.
• Some machines are clients. These are access points to the cluster to
connect to and work with.
• All other machines are datanodes. They store blocks of file content.
Diagram of HDFS nodes

The datanodes contain blocks (A, B, C. .. ), the namenode knows


where the files are: which blocks and on which datanodes.
More explication
Datanodes contain blocks. The same blocks are duplicated (replication)
on different datanodes, generally 3 times. This ensures:
• Data reliability in the event of a datanode failure,
• Parallel access by different processes to the same data.

The namenode knows both:


• On which blocks the files are contained,
• On which datanodes the desired blocks are located.
More explication
Datanodes contain blocks. The same blocks are duplicated (replication)
on different datanodes, generally 3 times. This ensures:
• Data reliability in the event of a datanode failure,
• Parallel access by different processes to the same data.
The namenode knows both:
• On which blocks the files are contained,
• On which datanodes the desired blocks are located.

This is called metadata.

Major drawback: failure of the namenode = death of HDFS,


for that, there is the secondary namenode. It archives
metadata, for example every hour.
Java API for HDFS
Java API for HDFS
Hadoop offers a complete Java API for accessing files of HDFS. It is
based on two main classes:
• FileSystem represents the file tree (file system). This class
allows copying local files to HDFS (and vice versa), renaming,
creating and deleting files and folders

• FileStatus manages the information of a file or


folder:
size with getLen(),
nature with isDirectory() and isFile()
Java API for HDFS
Hadoop offers a complete Java API for accessing files of HDFS. It is
based on two main classes:
• FileSystem represents the file tree (file system). This class
allows copying local files to HDFS (and vice versa), renaming,
creating and deleting files and folders
• FileStatus manages the information of a file or
folder:
size with getLen(),
nature with isDirectory() and isFile()
These classes need to know the
configuration of the HDFS cluster, using
the Configuration class. On the other
hand, full file names are represented by
the Path class
Java API example
importorg.apache.hadoop.conf.Configuration;
importorg.apache.hadoop.fs.FileSystem;
importorg.apache.hadoop.fs.FileStatus;
importorg.apache.hadoop.fs.Path;

Configuration conf =newConfiguration();


FileSystem fs = FileSystem.get(conf);
Path fullName=newPath("/user/userName",“hello.txt");
FileStatus infos = fs.getFileStatus(fullName);
System.out.println(Long.toString(infos.getLen())+" octets");
fs.rename(fullName,newPath("/user/etudiant1",“g_mor.txt"));
Displaying the list of blocks in a file
import...;
Public classHDFSinfo {
Public static void main(String[] args) throws IOException
{
Configuration conf =newConfiguration();
FileSystem fs = FileSystem.get(conf);
Path fullName=newPath(“test.txt");
FileStatus infos = fs.getFileStatus(fullName);
BlockLocation[] blocks = fs.getFileBlockLocations(infos, 0,
infos.getLen());
for(BlockLocation blocloc: blocks)
System.out.println(blocloc.toString());
}
}
Reading an HDFS file
importjava.io.*;
import...;
Public class HDFSread {
Public static void main(String[] args)throws IOException {
Configuration conf =newConfiguration();
FileSystem fs = FileSystem.get(conf);
Path fullName=newPath("apitest.txt");
FSDataInputStream inStream = fs.open(fullName);
InputStreamReader isr =newInputStreamReader(inStream);
BufferedReader br =newBufferedReader(isr);
String line = br.readLine();
System.out.println(line);
inStream.close();
fs.close();
}}
Creating an HDFS File
import...;
Public class HDFSwrite {
Public static void main(String[] args)throwsIOException {
Configuration conf =newConfiguration();
FileSystem fs = FileSystem.get(conf);
Path fullName=newPath("apitest.txt");
if(! fs.exists(fullName)) {
}
}
FSDataOutputStream outStream = fs.create(fullName);
outStream.writeUTF(“Hello world!");
outStream.close();
}
fs.close();
MapReduce algorithms
Principles
We want to collect synthetic information from a data set.
Examples on a list of items with a price:
 Calculate the total amount of sales of an item,
 Find the most expensive item,
 Calculate the average price of items.

For each of these examples, the problem can be


written as form of the composition of two functions:
 map: extraction/calculation of
information on each tuple,
 reduce: grouping of this information.
MapReduce algorithms
Example

Calculating the maximum, average or total price can be


written using algorithms of the type:
for each tuple:
value = Mfunction (current tuple)
return FunctionR (values encountered)
MapReduce algorithms
Example
FunctionM is a correspondence function. It calculates a value that
interests us from a tuple,
FunctionR is a grouping function (aggregation): maximum, sum,
count, average, distinct.. .
For example, FunctionM retrieves the price of a car,
FunctionR calculates the max of a set of values:
all_prices = list()
for each car:
all_prices. add( getPrice(current car) )
return max (all_prices)
MapReduce algorithms
Example in Python
data =
[{'id':1, 'mak':'Renault', 'model':'Clio', 'price':4200},
{'id':2, ‘mark':'Fiat', 'model':'500', 'price':8840},
{'id':3, ‘mark':'Peugeot', 'model':'206', 'price':4300},
{'id':4, ‘mark':'Peugeot', 'model':'306', 'price':6140} ]

#returns the price of the car passed as a parameter


def getPrice (car): returncar[ 'price']
#show car price list
print map (getPrice, data)
# displays the highest price
print reduce (max, map(getPrice, data) )
MapReduce algorithms
Example in Python

map(function, list) applies the function to each element of


the list. It performs the “for” loop of the previous algorithm
and returns the list of car prices. The result contains as
many values as in the input list. The function

reduce (function, list) aggregates the values of the list by


the function and returns the final result
MapReduce algorithms
Example in Python

These two functions constitute a "map-reduce" couple and


the goal of this course is to understand and program them.
The key point is the possibility of parallelizing these functions
in order to calculate much faster on a machine with several
cores or on a set of machines linked together.
MapReduce algorithms
Parallelization of Map
The map function is parallelizable by nature, because the
calculations are independent.
Example, for 4 elements to process:

Value1=functionM(element1)
Value2=functionM(element2)
Value3=functionM(element3)
Value4=functionM(element4)
MapReduce algorithms
Parallelization of Map
The four calculations can be done simultaneously, for
example on 4 different machines, provided that the data is
copied there.

Note: the mapped function must be a pure function of its


parameter, it must have no side effects such as modifying a
global variable or remembering its previous values.
MapReduce algorithms
Parallelization of Reduce
The reduce function is partially parallelized, in a form
hierarchy, for example:
Inter 1&2 FunctionR(value1, value2)
Inter 3&4 function(value3, value4)
Result function(Inter 1&2, Inter 3&4)

Only the first two calculations can be done


simultaneously. The 3rd must wait. If there were
more values, we would proceed thus :
MapReduce algorithms
Parallelization of Reduce
1. parallel calculation of the R-Function on all pairs of values
from the map
2. parallel calculation of the R-Function on all pairs of values
intermediates from the previous phase.
3. and so on, until there is only one value left.
MapReduce algorithms
Schema

Data

Map
Reduce
YARN and MapReduce
What is YARN?
YARN (Yet Another Resource Negotiator) is a mechanism
in Hadoop for managing jobs on a cluster of machines.
YARN allows users to launch MapReduce jobs on data
present in HDFS, and to follow (monitor) their progress,
retrieve the messages (logs) displayed by the programs.
Eventually YARN can move a process from one
machine to another in the event of failure or progress
deemed too slow. In fact, YARN is transparent to
the user. We launch the execution of
a MapReduce program and YARN
ensures that it is executed as quickly
as possible.
YARN and MapReduce
What is MapReduce?
MapReduce is a Java environment for writing programs
for YARN. Java is not the simplest language for this, there are
packages to import, class paths to provide...
There are several points to know :

• Principles of a MapReduce job in Hadoop,


• Map function programming,
• Programming the Reduce function,
• Programming a MapReduce job that calls the
two functions,
• Launching the job and retrieving the results.
YARN and MapReduce
Key-value pairs
It's actually a bit more complicated than what was
initially explained. The data exchanged between Map and
Reduce, and more, in the whole job are (key, value) pairs:

 a key: it is any type of data: integer, text. . .


 a value: it is any type of data

Everything is represented like this. For example :

 a text file is a set of (line number, line).


 a weather file is a set of (date and time,
temperature)
YARN and MapReduce
Map
The Map function receives a pair as input and can
produce any number of pairs as output: none, one, or many, at
will. The types of inputs and outputs are as desired.

This very unconstrained specification does a lot of things. In


general, the pairs that Map receives are made up as follows:
 the value of type text is one of the lines or one of the tuples
file to process
 the key of type integer is the position of this line
in the file (we call it offset)
YARN and MapReduce
Map

It should be understood that YARN launches an instance


of Map for each row of each data file to be processed. Each
instance processes the row assigned to it and produces output
pairs.
YARN and MapReduce
Map schema

The MAP tasks each process a pair and produce


0..n pairs. The same keys and/or values may be
produced.
YARN and MapReduce
Reduce
The Reduce function receives a list of pairs as input. These
are the pairs produced by instances of Map. Reduce can produce
any number of output pairs, but most of the time it's just one.
However, the crucial point is that the input pairs processed by an
instance of Reduce all have the same key.
YARN launches a Reduce instance for each different key that
the Map instances have produced, and provides them with only
the pairs with the same key. This is what makes it possible
to aggregate the values. In general, Reduce must do
some processing on the values, like adding all the
values together, or determining the largest of
the values.. .When we design a MapReduce
process, we have to think about the
keys and values necessary for it to work.
YARN and MapReduce
Reduce schema

Reduce tasks receive a list of pairs that all


have the same key and produce a pair that
contains the expected result. This output
pair can have the same key as the input
pair.
YARN and MapReduce
Example
A telephone company wants to calculate the total duration of a subscriber's
telephone calls from a CSV file containing all calls from all subscribers
(subscriber number, called number, date, call duration).
This problem is handled as follows:
1. In input, we have the calls file (1 call per line)
2. YARN launches one instance of Map function per call
3. Each instance of Map receives a pair (offset, line) and produces a
pair (subscriber number, duration) or nothing if it is not the
subscriber that we want. NB: the offset is useless here.
4. YARN sends all pairs to a single instance of Reduce
(because there is only one different key)
5. The Reduce instance adds all the values of
the pairs it receives and produces a single
output pair (subscriber number, total duration)
YARN and MapReduce
MapReduce job phases
A MapReduce job consists of several phases:
1.Pre-processing of input data, e.g. decompression of files
2. Split: separation of data into separately processable blocks and put in the
form of (key, value), ex: in lines or in tuples
3. Map: application of the map function on all pairs (key, value) formed from
the input data, this produces other pairs (key, value) as output
4. Shuffle& Sort: redistribution of data so that the pairs produced by Map
having the same keys are on the same machines
5. Reduce: aggregation of pairs with the same key to obtain the
final result.
YARN and MapReduce
Schema
Data

Result
Workout 1
Some commands
hadoop version # give the version of the Hadoop
hadoop fs -mkdir /test # create new directory named “test”
bin>hadoop fs -ls / # display the content of the directory
hadoop fs -copyFromLocal <localsrc> <hdfs destination>
# copy a file from <localsrc> to <dest>
haoop fs -put <localsrc> <dest> #copy localfile1 of the local file system
to the Hadoop filesystem.
hadoop fs -get <src> <localdest> #copies the file or directory from the
Hadoop file system to the local file
system.
hadoop fs –cat /path_to_file_in_hdfs
Workout 2
Map reduce
Block 1 Tuple 1
Data Block 2 Tuple 2
node Block 3 Tuple 3
1 Block 4 Tuple 4
Na Data
node
me 2
Block m
no Tuple n

de Nb_tuples=n
Data Nb_blocks=m
node Nb_datanodes=p
p Block_size
Workout 2
Student score (example)
Tuple architecture
{'st_id': 89, 'sp': 'GLSD', 'math': 3.09, 'phy': 16.89, 'sci': 14.26, 'phyl': 12.45, 'geog': 19.15, 'eng': 14.1}

block architecture

{'id_bk': 4, 'data': [{'st_id': 16, 'sp': 'STW', 'math': 10.42, 'phy':


19.07, 'sci': 18.46, 'phyl': 5.47, 'geog': 10.92, 'eng': 17.65}, {'st_id':
17, 'sp': 'STW', 'math': 12.88, 'phy': 1.15, 'sci': 12.46, 'phyl': 17.02,
'geog': 17.39, 'eng': 17.22}, {'st_id': 18, 'sp': 'STW', 'math': 6.33,
'phy': 14.58, 'sci': 19.39, 'phyl': 17.84, 'geog': 0.02, 'eng': 14.65},
{'st_id': 19, 'sp': 'ST', 'math': 7.81, 'phy': 4.39, 'sci': 9.6, 'phyl': 13.19,
'geog': 18.34, 'eng': 14.88}]}
Workout 2
Student score (example)
Data node architecture
[{'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'STW', 'math': 18.23, 'phy': 5.32, 'sci': 17.34, 'phyl': 4.61, 'geog': 19.37, 'eng': 1.8}, {'st_id': 1,
'sp': 'ST', 'math': 4.53, 'phy': 13.81, 'sci': 12.32, 'phyl': 4.95, 'geog': 0.31, 'eng': 17.78}, {'st_id': 2, 'sp': 'MATH', 'math': 6.65, 'phy':
14.35, 'sci': 5.35, 'phyl': 0.9, 'geog': 15.71, 'eng': 8.39}, {'st_id': 3, 'sp': 'ST', 'math': 17.9, 'phy': 17.06, 'sci': 5.42, 'phyl': 1.4, 'geog':
7.23, 'eng': 18.99}]}, {'id_bk': 1, 'data': [{'st_id': 4, 'sp': 'MATH', 'math': 15.0, 'phy': 2.47, 'sci': 19.41, 'phyl': 16.03, 'geog': 0.79,
'eng': 6.61}, {'st_id': 5, 'sp': 'ST', 'math': 0.6, 'phy': 9.12, 'sci': 9.59, 'phyl': 6.29, 'geog': 11.94, 'eng': 4.59}, {'st_id': 6, 'sp': 'STW',
'math': 15.37, 'phy': 12.88, 'sci': 17.88, 'phyl': 7.88, 'geog': 12.73, 'eng': 14.11}, {'st_id': 7, 'sp': 'MATH', 'math': 14.83, 'phy':
9.07, 'sci': 9.59, 'phyl': 15.93, 'geog': 4.28, 'eng': 14.38}]}, {'id_bk': 2, 'data': [{'st_id': 8, 'sp': 'STW', 'math': 2.16, 'phy': 9.32,
'sci': 15.19, 'phyl': 13.02, 'geog': 5.38, 'eng': 19.62}, {'st_id': 9, 'sp': 'MATH', 'math': 4.69, 'phy': 4.85, 'sci': 12.8, 'phyl': 18.44,
'geog': 1.99, 'eng': 3.64}, {'st_id': 10, 'sp': 'GLSD', 'math': 9.84, 'phy': 9.0, 'sci': 3.49, 'phyl': 9.05, 'geog': 9.78, 'eng': 16.54},
{'st_id': 11, 'sp': 'GLSD', 'math': 11.52, 'phy': 2.56, 'sci': 16.79, 'phyl': 12.35, 'geog': 15.25, 'eng': 9.84}]}, {'id_bk': 2, 'data':
[{'st_id': 8, 'sp': 'STW', 'math': 2.16, 'phy': 9.32, 'sci': 15.19, 'phyl': 13.02, 'geog': 5.38, 'eng': 19.62}, {'st_id': 9, 'sp': 'MATH',
'math': 4.69, 'phy': 4.85, 'sci': 12.8, 'phyl': 18.44, 'geog': 1.99, 'eng': 3.64}, {'st_id': 10, 'sp': 'GLSD', 'math': 9.84, 'phy': 9.0, 'sci':
3.49, 'phyl': 9.05, 'geog': 9.78, 'eng': 16.54}, {'st_id': 11, 'sp': 'GLSD', 'math': 11.52, 'phy': 2.56, 'sci': 16.79, 'phyl': 12.35, 'geog':
15.25, 'eng': 9.84}]}, {'id_bk': 3, 'data': [{'st_id': 12, 'sp': 'ST', 'math': 13.3, 'phy': 8.36, 'sci': 11.37, 'phyl': 14.56, 'geog': 7.35, 'eng':
11.54}, {'st_id': 13, 'sp': 'GLSD', 'math': 3.62, 'phy': 8.83, 'sci': 4.58, 'phyl': 3.66, 'geog': 4.38, 'eng': 4.07}, {'st_id': 14, 'sp': 'GLSD',
'math': 12.52, 'phy': 14.36, 'sci': 6.33, 'phyl': 0.19, 'geog': 12.93, 'eng': 10.98}, {'st_id': 15, 'sp': 'MATH', 'math': 1.8, 'phy': 18.2,
'sci': 5.26, 'phyl': 19.19, 'geog': 14.35, 'eng': 18.45}]}, {'id_bk': 3, 'data': [{'st_id': 12, 'sp': 'ST', 'math': 13.3, 'phy': 8.36, 'sci': 11.37,
'phyl': 14.56, 'geog': 7.35, 'eng': 11.54}, {'st_id': 13, 'sp': 'GLSD', 'math': 3.62, 'phy': 8.83, 'sci': 4.58, 'phyl': 3.66, 'geog': 4.38,
'eng': 4.07}, {'st_id': 14, 'sp': 'GLSD', 'math': 12.52, 'phy': 14.36, 'sci': 6.33, 'phyl': 0.19, 'geog': 12.93, 'eng': 10.98}, {'st_id': 15,
'sp': 'MATH', 'math': 1.8, 'phy': 18.2, 'sci': 5.26, 'phyl': 19.19, 'geog': 14.35, 'eng': 18.45}]}]
Workout 2
Student score (example)
Dataset architecture
[[{'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'STW', 'math': 18.23, 'phy': 5.32, 'sci': 17.34, 'phyl': 4.61, 'geog': 19.37, 'eng': 1.8}, {'st_id': 1, 'sp': 'ST', 'math': 4.53, 'phy': 13.81, 'sci': 12.32, 'phyl': 4.95, 'geog': 0.31, 'eng':
17.78}, {'st_id': 2, 'sp': 'MATH', 'math': 6.65, 'phy': 14.35, 'sci': 5.35, 'phyl': 0.9, 'geog': 15.71, 'eng': 8.39}, {'st_id': 3, 'sp': 'ST', 'math': 17.9, 'phy': 17.06, 'sci': 5.42, 'phyl': 1.4, 'geog': 7.23, 'eng': 18.99}]},
{'id_bk': 1, 'data': [{'st_id': 4, 'sp': 'MATH', 'math': 15.0, 'phy': 2.47, 'sci': 19.41, 'phyl': 16.03, 'geog': 0.79, 'eng': 6.61}, {'st_id': 5, 'sp': 'ST', 'math': 0.6, 'phy': 9.12, 'sci': 9.59, 'phyl': 6.29, 'geog': 11.94, 'eng':
4.59}, {'st_id': 6, 'sp': 'STW', 'math': 15.37, 'phy': 12.88, 'sci': 17.88, 'phyl': 7.88, 'geog': 12.73, 'eng': 14.11}, {'st_id': 7, 'sp': 'MATH', 'math': 14.83, 'phy': 9.07, 'sci': 9.59, 'phyl': 15.93, 'geog': 4.28, 'eng':
14.38}]}, {'id_bk': 2, 'data': [{'st_id': 8, 'sp': 'STW', 'math': 2.16, 'phy': 9.32, 'sci': 15.19, 'phyl': 13.02, 'geog': 5.38, 'eng': 19.62}, {'st_id': 9, 'sp': 'MATH', 'math': 4.69, 'phy': 4.85, 'sci': 12.8, 'phyl': 18.44, 'geog':
1.99, 'eng': 3.64}, {'st_id': 10, 'sp': 'GLSD', 'math': 9.84, 'phy': 9.0, 'sci': 3.49, 'phyl': 9.05, 'geog': 9.78, 'eng': 16.54}, {'st_id': 11, 'sp': 'GLSD', 'math': 11.52, 'phy': 2.56, 'sci': 16.79, 'phyl': 12.35, 'geog': 15.25,
'eng': 9.84}]}, {'id_bk': 2, 'data': [{'st_id': 8, 'sp': 'STW', 'math': 2.16, 'phy': 9.32, 'sci': 15.19, 'phyl': 13.02, 'geog': 5.38, 'eng': 19.62}, {'st_id': 9, 'sp': 'MATH', 'math': 4.69, 'phy': 4.85, 'sci': 12.8, 'phyl': 18.44,
'geog': 1.99, 'eng': 3.64}, {'st_id': 10, 'sp': 'GLSD', 'math': 9.84, 'phy': 9.0, 'sci': 3.49, 'phyl': 9.05, 'geog': 9.78, 'eng': 16.54}, {'st_id': 11, 'sp': 'GLSD', 'math': 11.52, 'phy': 2.56, 'sci': 16.79, 'phyl': 12.35, 'geog':
15.25, 'eng': 9.84}]}, {'id_bk': 3, 'data': [{'st_id': 12, 'sp': 'ST', 'math': 13.3, 'phy': 8.36, 'sci': 11.37, 'phyl': 14.56, 'geog': 7.35, 'eng': 11.54}, {'st_id': 13, 'sp': 'GLSD', 'math': 3.62, 'phy': 8.83, 'sci': 4.58, 'phyl':
3.66, 'geog': 4.38, 'eng': 4.07}, {'st_id': 14, 'sp': 'GLSD', 'math': 12.52, 'phy': 14.36, 'sci': 6.33, 'phyl': 0.19, 'geog': 12.93, 'eng': 10.98}, {'st_id': 15, 'sp': 'MATH', 'math': 1.8, 'phy': 18.2, 'sci': 5.26, 'phyl': 19.19,
'geog': 14.35, 'eng': 18.45}]}, {'id_bk': 3, 'data': [{'st_id': 12, 'sp': 'ST', 'math': 13.3, 'phy': 8.36, 'sci': 11.37, 'phyl': 14.56, 'geog': 7.35, 'eng': 11.54}, {'st_id': 13, 'sp': 'GLSD', 'math': 3.62, 'phy': 8.83, 'sci': 4.58,
'phyl': 3.66, 'geog': 4.38, 'eng': 4.07}, {'st_id': 14, 'sp': 'GLSD', 'math': 12.52, 'phy': 14.36, 'sci': 6.33, 'phyl': 0.19, 'geog': 12.93, 'eng': 10.98}, {'st_id': 15, 'sp': 'MATH', 'math': 1.8, 'phy': 18.2, 'sci': 5.26, 'phyl':
19.19, 'geog': 14.35, 'eng': 18.45}]}], [{'id_bk': 1, 'data': [{'st_id': 4, 'sp': 'MATH', 'math': 15.0, 'phy': 2.47, 'sci': 19.41, 'phyl': 16.03, 'geog': 0.79, 'eng': 6.61}, {'st_id': 5, 'sp': 'ST', 'math': 0.6, 'phy': 9.12,
'sci': 9.59, 'phyl': 6.29, 'geog': 11.94, 'eng': 4.59}, {'st_id': 6, 'sp': 'STW', 'math': 15.37, 'phy': 12.88, 'sci': 17.88, 'phyl': 7.88, 'geog': 12.73, 'eng': 14.11}, {'st_id': 7, 'sp': 'MATH', 'math': 14.83, 'phy': 9.07,
'sci': 9.59, 'phyl': 15.93, 'geog': 4.28, 'eng': 14.38}]}, {'id_bk': 3, 'data': [{'st_id': 12, 'sp': 'ST', 'math': 13.3, 'phy': 8.36, 'sci': 11.37, 'phyl': 14.56, 'geog': 7.35, 'eng': 11.54}, {'st_id': 13, 'sp': 'GLSD', 'math':
3.62, 'phy': 8.83, 'sci': 4.58, 'phyl': 3.66, 'geog': 4.38, 'eng': 4.07}, {'st_id': 14, 'sp': 'GLSD', 'math': 12.52, 'phy': 14.36, 'sci': 6.33, 'phyl': 0.19, 'geog': 12.93, 'eng': 10.98}, {'st_id': 15, 'sp': 'MATH', 'math':
1.8, 'phy': 18.2, 'sci': 5.26, 'phyl': 19.19, 'geog': 14.35, 'eng': 18.45}]}, {'id_bk': 4, 'data': [{'st_id': 16, 'sp': 'STW', 'math': 10.42, 'phy': 19.07, 'sci': 18.46, 'phyl': 5.47, 'geog': 10.92, 'eng': 17.65}, {'st_id': 17,
'sp': 'STW', 'math': 12.88, 'phy': 1.15, 'sci': 12.46, 'phyl': 17.02, 'geog': 17.39, 'eng': 17.22}, {'st_id': 18, 'sp': 'STW', 'math': 6.33, 'phy': 14.58, 'sci': 19.39, 'phyl': 17.84, 'geog': 0.02, 'eng': 14.65}, {'st_id':
19, 'sp': 'ST', 'math': 7.81, 'phy': 4.39, 'sci': 9.6, 'phyl': 13.19, 'geog': 18.34, 'eng': 14.88}]}, {'id_bk': 4, 'data': [{'st_id': 16, 'sp': 'STW', 'math': 10.42, 'phy': 19.07, 'sci': 18.46, 'phyl': 5.47, 'geog': 10.92,
'eng': 17.65}, {'st_id': 17, 'sp': 'STW', 'math': 12.88, 'phy': 1.15, 'sci': 12.46, 'phyl': 17.02, 'geog': 17.39, 'eng': 17.22}, {'st_id': 18, 'sp': 'STW', 'math': 6.33, 'phy': 14.58, 'sci': 19.39, 'phyl': 17.84, 'geog':
0.02, 'eng': 14.65}, {'st_id': 19, 'sp': 'ST', 'math': 7.81, 'phy': 4.39, 'sci': 9.6, 'phyl': 13.19, 'geog': 18.34, 'eng': 14.88}]}], [{'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'STW', 'math': 18.23, 'phy': 5.32, 'sci': 17.34, 'phyl':
4.61, 'geog': 19.37, 'eng': 1.8}, {'st_id': 1, 'sp': 'ST', 'math': 4.53, 'phy': 13.81, 'sci': 12.32, 'phyl': 4.95, 'geog': 0.31, 'eng': 17.78}, {'st_id': 2, 'sp': 'MATH', 'math': 6.65, 'phy': 14.35, 'sci': 5.35, 'phyl': 0.9, 'geog':
15.71, 'eng': 8.39}, {'st_id': 3, 'sp': 'ST', 'math': 17.9, 'phy': 17.06, 'sci': 5.42, 'phyl': 1.4, 'geog': 7.23, 'eng': 18.99}]}, {'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'STW', 'math': 18.23, 'phy': 5.32, 'sci': 17.34, 'phyl': 4.61,
'geog': 19.37, 'eng': 1.8}, {'st_id': 1, 'sp': 'ST', 'math': 4.53, 'phy': 13.81, 'sci': 12.32, 'phyl': 4.95, 'geog': 0.31, 'eng': 17.78}, {'st_id': 2, 'sp': 'MATH', 'math': 6.65, 'phy': 14.35, 'sci': 5.35, 'phyl': 0.9, 'geog': 15.71,
'eng': 8.39}, {'st_id': 3, 'sp': 'ST', 'math': 17.9, 'phy': 17.06, 'sci': 5.42, 'phyl': 1.4, 'geog': 7.23, 'eng': 18.99}]}, {'id_bk': 1, 'data': [{'st_id': 4, 'sp': 'MATH', 'math': 15.0, 'phy': 2.47, 'sci': 19.41, 'phyl': 16.03, 'geog':
0.79, 'eng': 6.61}, {'st_id': 5, 'sp': 'ST', 'math': 0.6, 'phy': 9.12, 'sci': 9.59, 'phyl': 6.29, 'geog': 11.94, 'eng': 4.59}, {'st_id': 6, 'sp': 'STW', 'math': 15.37, 'phy': 12.88, 'sci': 17.88, 'phyl': 7.88, 'geog': 12.73, 'eng':
14.11}, {'st_id': 7, 'sp': 'MATH', 'math': 14.83, 'phy': 9.07, 'sci': 9.59, 'phyl': 15.93, 'geog': 4.28, 'eng': 14.38}]}, {'id_bk': 2, 'data': [{'st_id': 8, 'sp': 'STW', 'math': 2.16, 'phy': 9.32, 'sci': 15.19, 'phyl': 13.02, 'geog':
5.38, 'eng': 19.62}, {'st_id': 9, 'sp': 'MATH', 'math': 4.69, 'phy': 4.85, 'sci': 12.8, 'phyl': 18.44, 'geog': 1.99, 'eng': 3.64}, {'st_id': 10, 'sp': 'GLSD', 'math': 9.84, 'phy': 9.0, 'sci': 3.49, 'phyl': 9.05, 'geog': 9.78, 'eng':
16.54}, {'st_id': 11, 'sp': 'GLSD', 'math': 11.52, 'phy': 2.56, 'sci': 16.79, 'phyl': 12.35, 'geog': 15.25, 'eng': 9.84}]}, {'id_bk': 4, 'data': [{'st_id': 16, 'sp': 'STW', 'math': 10.42, 'phy': 19.07, 'sci': 18.46, 'phyl': 5.47,
'geog': 10.92, 'eng': 17.65}, {'st_id': 17, 'sp': 'STW', 'math': 12.88, 'phy': 1.15, 'sci': 12.46, 'phyl': 17.02, 'geog': 17.39, 'eng': 17.22}, {'st_id': 18, 'sp': 'STW', 'math': 6.33, 'phy': 14.58, 'sci': 19.39, 'phyl': 17.84,
'geog': 0.02, 'eng': 14.65}, {'st_id': 19, 'sp': 'ST', 'math': 7.81, 'phy': 4.39, 'sci': 9.6, 'phyl': 13.19, 'geog': 18.34, 'eng': 14.88}]}]]
def randomDataset(nb_tuples=2000, block_size=15, nb_dataNode=7, nb_copies=3,

Workout 2 specialities=specialities):

data_nodes=[]

System architecture for i in range(nb_dataNode):


data_nodes.append([])
Nb_tuples=n
nb_blocks=int(nb_tuples/block_size)
Nb_blocks=m if (nb_tuples%block_size!=0):
nb_blocks=nb_blocks+1
Nb_datanodes=p
for i in range(nb_blocks):
Block_size block={}
Specialities block['id_bk']=i
block['data']=[]
nb_copies
for j in range(block_size):
if (i*block_size+j==nb_tuples):
break
sp=random.randint(0, len(specialities)-1)
tuple={}
Generate a dataset tuple['st_id']=i*block_size+j
tuple['sp']=specialities[sp]
tuple['math']= round(random.uniform(0.,20.),2)
tuple['phy']= round(random.uniform(0.,20.),2)
tuple['sci']= round(random.uniform(0.,20.),2)
tuple['phyl']= round(random.uniform(0.,20.),2)
Random dataset tuple['geog']=round(random.uniform(0.,20.),2)
tuple['eng']= round(random.uniform(0.,20.),2)

block['data'].append(tuple)
for k in range(nb_copies):
dns=[]
dn=random.randint(0, len(data_nodes)-1)
while dn in dns:
dn=random.randint(0, len(data_nodes)-1)
data_nodes[dn].append(block)
dns.append(dn)
return [data_nodes,nb_blocks]
Workout 2
System architecture specialities=['ST','MATH','GLSD','STW']

def randomDataset(nb_tuples=2000, block_size=15, nb_dataNode=7, nb_copies=3,


specialities=specialities):

data_nodes=[] create a empty data nodes

for i in range(nb_dataNode):
data_nodes.append([])

nb_blocks=int(nb_tuples/block_size)
if (nb_tuples%block_size!=0):
nb_blocks=nb_blocks+1

for i in range(nb_blocks):
block={}
block['id_bk']=i
block['data']=[]
for j in range(block_size):
Workout 2
System architecture specialities=['ST','MATH','GLSD','STW']

nb_blocks=int(nb_tuples/block_size)
if (nb_tuples%block_size!=0): calculate the number of blocks
nb_blocks=nb_blocks+1

for i in range(nb_blocks):
block={}
block['id_bk']=i
block['data']=[]
for j in range(block_size):
if (i*block_size+j==nb_tuples):
break
sp=random.randint(0, len(specialities)-1)
tuple={}
tuple['st_id']=i*block_size+j
tuple['sp']=specialities[sp]
tuple['math']= round(random.uniform(0.,20.),2)
tuple['phy']= round(random.uniform(0.,20.),2)
Workout 2
System architecture specialities=['ST','MATH','GLSD','STW']

for i in range(nb_blocks):
block={}
block['id_bk']=I create blocks
block['data']=[]
for j in range(block_size):
if (i*block_size+j==nb_tuples):
break
sp=random.randint(0, len(specialities)-1)
tuple={}
tuple['st_id']=i*block_size+j
tuple['sp']=specialities[sp]
tuple['math']= round(random.uniform(0.,20.),2)
tuple['phy']= round(random.uniform(0.,20.),2)
tuple['sci']= round(random.uniform(0.,20.),2)
tuple['phyl']= round(random.uniform(0.,20.),2)
tuple['geog']=round(random.uniform(0.,20.),2)
tuple['eng']= round(random.uniform(0.,20.),2)
Workout 2
System architecture specialities=['ST','MATH','GLSD','STW']

for j in range(block_size):
if (i*block_size+j==nb_tuples):
break
sp=random.randint(0, len(specialities)-1) create tuples
tuple={}
tuple['st_id']=i*block_size+j
tuple['sp']=specialities[sp]
tuple['math']= round(random.uniform(0.,20.),2)
tuple['phy']= round(random.uniform(0.,20.),2)
tuple['sci']= round(random.uniform(0.,20.),2)
tuple['phyl']= round(random.uniform(0.,20.),2)
tuple['geog']=round(random.uniform(0.,20.),2)
tuple['eng']= round(random.uniform(0.,20.),2)

block['data'].append(tuple)
for k in range(nb_copies):
dns=[]
Workout 2
System architecture specialities=['ST','MATH','GLSD','STW']

block['data'].append(tuple)
for k in range(nb_copies):
dns=[]
dn=random.randint(0, len(data_nodes)-1) save dataset
while dn in dns:
dn=random.randint(0, len(data_nodes)-1)
data_nodes[dn].append(block)
dns.append(dn)
return [data_nodes,nb_blocks]
def findBlock(id,dataset):
random_sort=np.arange(len(dataset[0]))

Workout 2 for i in range(len(dataset[0])-1):


n=random.randint(0, len(dataset[0])-1)
m=random.randint(0, len(dataset[0])-1)
System architecture x=random_sort[n]
random_sort[n]=random_sort[m]
Nb_tuples=n
Nb_blocks=m random_sort[m]=x
for i in random_sort:
Nb_datanodes=p
for data in dataset[0][i]:
Block_size
Specialities if data['id_bk']==id:
nb_copies return [data['data'],i]
return
def recuperateDataset(dataset):
Generate a dataset result=[]
nb_blk=dataset[1]
for i in range(nb_blk):
Random dataset blk=findBlock(i, dataset)
result.append(blk)
return result

Recuperate dataset dataset


Without repetition
Workout 2 [[{'id_bk': 1, 'data': [{'st_id': 5, 'sp': 'MATH', 'math': 2.5, 'phy': 9.33, 'sci': 2.48, 'phyl': 13.07, 'geog': 10.56, 'eng': 0.09}, {'st_id': 6, 'sp': 'ST', 'math': 14.41, 'phy': 7.07, 'sci': 15.94, 'phyl': 0.26,
'geog': 13.12, 'eng': 13.41}, {'st_id': 7, 'sp': 'STW', 'math': 3.82, 'phy': 1.25, 'sci': 2.14, 'phyl': 15.8, 'geog': 0.01, 'eng': 13.38}, {'st_id': 8, 'sp': 'ST', 'math': 8.7, 'phy': 13.91, 'sci': 0.35, 'phyl':
4.84, 'geog': 6.87, 'eng': 15.38}, {'st_id': 9, 'sp': 'ST', 'math': 17.09, 'phy': 12.64, 'sci': 18.08, 'phyl': 11.72, 'geog': 19.83, 'eng': 14.23}]}, {'id_bk': 1, 'data': [{'st_id': 5, 'sp': 'MATH', 'math': 2.5,
'phy': 9.33, 'sci': 2.48, 'phyl': 13.07, 'geog': 10.56, 'eng': 0.09}, {'st_id': 6, 'sp': 'ST', 'math': 14.41, 'phy': 7.07, 'sci': 15.94, 'phyl': 0.26, 'geog': 13.12, 'eng': 13.41}, {'st_id': 7, 'sp': 'STW',
'math': 3.82, 'phy': 1.25, 'sci': 2.14, 'phyl': 15.8, 'geog': 0.01, 'eng': 13.38}, {'st_id': 8, 'sp': 'ST', 'math': 8.7, 'phy': 13.91, 'sci': 0.35, 'phyl': 4.84, 'geog': 6.87, 'eng': 15.38}, {'st_id': 9, 'sp': 'ST',
'math': 17.09, 'phy': 12.64, 'sci': 18.08, 'phyl': 11.72, 'geog': 19.83, 'eng': 14.23}]}, {'id_bk': 2, 'data': [{'st_id': 10, 'sp': 'ST', 'math': 9.3, 'phy': 4.03, 'sci': 4.65, 'phyl': 19.33, 'geog': 12.55,
'eng': 15.24}, {'st_id': 11, 'sp': 'MATH', 'math': 7.95, 'phy': 16.42, 'sci': 5.62, 'phyl': 7.67, 'geog': 3.43, 'eng': 14.16}, {'st_id': 12, 'sp': 'ST', 'math': 15.14, 'phy': 9.67, 'sci': 13.69, 'phyl': 10.49,

System architecture
'geog': 5.98, 'eng': 3.39}, {'st_id': 13, 'sp': 'MATH', 'math': 16.25, 'phy': 19.06, 'sci': 18.48, 'phyl': 12.06, 'geog': 11.89, 'eng': 18.62}, {'st_id': 14, 'sp': 'GLSD', 'math': 10.12, 'phy': 13.78, 'sci':
15.21, 'phyl': 19.79, 'geog': 18.01, 'eng': 0.44}]}], [{'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'GLSD', 'math': 16.25, 'phy': 11.24, 'sci': 12.98, 'phyl': 19.56, 'geog': 13.32, 'eng': 7.6}, {'st_id': 1, 'sp':
'ST', 'math': 1.94, 'phy': 4.64, 'sci': 7.1, 'phyl': 14.25, 'geog': 4.37, 'eng': 18.26}, {'st_id': 2, 'sp': 'GLSD', 'math': 8.32, 'phy': 16.65, 'sci': 1.0, 'phyl': 6.32, 'geog': 18.18, 'eng': 9.48}, {'st_id': 3,
'sp': 'MATH', 'math': 12.83, 'phy': 15.81, 'sci': 16.64, 'phyl': 7.04, 'geog': 9.32, 'eng': 8.49}, {'st_id': 4, 'sp': 'STW', 'math': 16.75, 'phy': 0.65, 'sci': 13.35, 'phyl': 5.9, 'geog': 19.54, 'eng':
18.28}]}, {'id_bk': 2, 'data': [{'st_id': 10, 'sp': 'ST', 'math': 9.3, 'phy': 4.03, 'sci': 4.65, 'phyl': 19.33, 'geog': 12.55, 'eng': 15.24}, {'st_id': 11, 'sp': 'MATH', 'math': 7.95, 'phy': 16.42, 'sci': 5.62,
'phyl': 7.67, 'geog': 3.43, 'eng': 14.16}, {'st_id': 12, 'sp': 'ST', 'math': 15.14, 'phy': 9.67, 'sci': 13.69, 'phyl': 10.49, 'geog': 5.98, 'eng': 3.39}, {'st_id': 13, 'sp': 'MATH', 'math': 16.25, 'phy':
19.06, 'sci': 18.48, 'phyl': 12.06, 'geog': 11.89, 'eng': 18.62}, {'st_id': 14, 'sp': 'GLSD', 'math': 10.12, 'phy': 13.78, 'sci': 15.21, 'phyl': 19.79, 'geog': 18.01, 'eng': 0.44}]}, {'id_bk': 3, 'data':
Nb_tuples=n [{'st_id': 15, 'sp': 'STW', 'math': 10.99, 'phy': 3.04, 'sci': 6.65, 'phyl': 10.15, 'geog': 18.88, 'eng': 7.9}, {'st_id': 16, 'sp': 'ST', 'math': 16.4, 'phy': 8.89, 'sci': 18.13, 'phyl': 17.48, 'geog': 7.92,
'eng': 7.57}, {'st_id': 17, 'sp': 'GLSD', 'math': 16.25, 'phy': 13.82, 'sci': 4.84, 'phyl': 12.99, 'geog': 1.66, 'eng': 2.01}, {'st_id': 18, 'sp': 'MATH', 'math': 12.63, 'phy': 6.93, 'sci': 18.15, 'phyl':
11.98, 'geog': 6.87, 'eng': 10.46}, {'st_id': 19, 'sp': 'GLSD', 'math': 8.75, 'phy': 14.73, 'sci': 10.59, 'phyl': 3.2, 'geog': 18.01, 'eng': 14.63}]}], [{'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'GLSD', 'math':

Random dataset
Nb_blocks=m 16.25, 'phy': 11.24, 'sci': 12.98, 'phyl': 19.56, 'geog': 13.32, 'eng': 7.6}, {'st_id': 1, 'sp': 'ST', 'math': 1.94, 'phy': 4.64, 'sci': 7.1, 'phyl': 14.25, 'geog': 4.37, 'eng': 18.26}, {'st_id': 2, 'sp': 'GLSD',
'math': 8.32, 'phy': 16.65, 'sci': 1.0, 'phyl': 6.32, 'geog': 18.18, 'eng': 9.48}, {'st_id': 3, 'sp': 'MATH', 'math': 12.83, 'phy': 15.81, 'sci': 16.64, 'phyl': 7.04, 'geog': 9.32, 'eng': 8.49}, {'st_id': 4,
'sp': 'STW', 'math': 16.75, 'phy': 0.65, 'sci': 13.35, 'phyl': 5.9, 'geog': 19.54, 'eng': 18.28}]}, {'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'GLSD', 'math': 16.25, 'phy': 11.24, 'sci': 12.98, 'phyl': 19.56,
'geog': 13.32, 'eng': 7.6}, {'st_id': 1, 'sp': 'ST', 'math': 1.94, 'phy': 4.64, 'sci': 7.1, 'phyl': 14.25, 'geog': 4.37, 'eng': 18.26}, {'st_id': 2, 'sp': 'GLSD', 'math': 8.32, 'phy': 16.65, 'sci': 1.0, 'phyl':
Nb_datanodes=p 6.32, 'geog': 18.18, 'eng': 9.48}, {'st_id': 3, 'sp': 'MATH', 'math': 12.83, 'phy': 15.81, 'sci': 16.64, 'phyl': 7.04, 'geog': 9.32, 'eng': 8.49}, {'st_id': 4, 'sp': 'STW', 'math': 16.75, 'phy': 0.65, 'sci':
13.35, 'phyl': 5.9, 'geog': 19.54, 'eng': 18.28}]}, {'id_bk': 1, 'data': [{'st_id': 5, 'sp': 'MATH', 'math': 2.5, 'phy': 9.33, 'sci': 2.48, 'phyl': 13.07, 'geog': 10.56, 'eng': 0.09}, {'st_id': 6, 'sp': 'ST',
'math': 14.41, 'phy': 7.07, 'sci': 15.94, 'phyl': 0.26, 'geog': 13.12, 'eng': 13.41}, {'st_id': 7, 'sp': 'STW', 'math': 3.82, 'phy': 1.25, 'sci': 2.14, 'phyl': 15.8, 'geog': 0.01, 'eng': 13.38}, {'st_id': 8,
Block_size 'sp': 'ST', 'math': 8.7, 'phy': 13.91, 'sci': 0.35, 'phyl': 4.84, 'geog': 6.87, 'eng': 15.38}, {'st_id': 9, 'sp': 'ST', 'math': 17.09, 'phy': 12.64, 'sci': 18.08, 'phyl': 11.72, 'geog': 19.83, 'eng': 14.23}]},
{'id_bk': 2, 'data': [{'st_id': 10, 'sp': 'ST', 'math': 9.3, 'phy': 4.03, 'sci': 4.65, 'phyl': 19.33, 'geog': 12.55, 'eng': 15.24}, {'st_id': 11, 'sp': 'MATH', 'math': 7.95, 'phy': 16.42, 'sci': 5.62, 'phyl':
7.67, 'geog': 3.43, 'eng': 14.16}, {'st_id': 12, 'sp': 'ST', 'math': 15.14, 'phy': 9.67, 'sci': 13.69, 'phyl': 10.49, 'geog': 5.98, 'eng': 3.39}, {'st_id': 13, 'sp': 'MATH', 'math': 16.25, 'phy': 19.06, 'sci':
Specialities 18.48, 'phyl': 12.06, 'geog': 11.89, 'eng': 18.62}, {'st_id': 14, 'sp': 'GLSD', 'math': 10.12, 'phy': 13.78, 'sci': 15.21, 'phyl': 19.79, 'geog': 18.01, 'eng': 0.44}]}, {'id_bk': 3, 'data': [{'st_id': 15, 'sp':
'STW', 'math': 10.99, 'phy': 3.04, 'sci': 6.65, 'phyl': 10.15, 'geog': 18.88, 'eng': 7.9}, {'st_id': 16, 'sp': 'ST', 'math': 16.4, 'phy': 8.89, 'sci': 18.13, 'phyl': 17.48, 'geog': 7.92, 'eng': 7.57}, {'st_id':

nb_copies 17, 'sp': 'GLSD', 'math': 16.25, 'phy': 13.82, 'sci': 4.84, 'phyl': 12.99, 'geog': 1.66, 'eng': 2.01}, {'st_id': 18, 'sp': 'MATH', 'math': 12.63, 'phy': 6.93, 'sci': 18.15, 'phyl': 11.98, 'geog': 6.87, 'eng':
10.46}, {'st_id': 19, 'sp': 'GLSD', 'math': 8.75, 'phy': 14.73, 'sci': 10.59, 'phyl': 3.2, 'geog': 18.01, 'eng': 14.63}]}, {'id_bk': 3, 'data': [{'st_id': 15, 'sp': 'STW', 'math': 10.99, 'phy': 3.04, 'sci':
6.65, 'phyl': 10.15, 'geog': 18.88, 'eng': 7.9}, {'st_id': 16, 'sp': 'ST', 'math': 16.4, 'phy': 8.89, 'sci': 18.13, 'phyl': 17.48, 'geog': 7.92, 'eng': 7.57}, {'st_id': 17, 'sp': 'GLSD', 'math': 16.25, 'phy':
13.82, 'sci': 4.84, 'phyl': 12.99, 'geog': 1.66, 'eng': 2.01}, {'st_id': 18, 'sp': 'MATH', 'math': 12.63, 'phy': 6.93, 'sci': 18.15, 'phyl': 11.98, 'geog': 6.87, 'eng': 10.46}, {'st_id': 19, 'sp': 'GLSD',
'math': 8.75, 'phy': 14.73, 'sci': 10.59, 'phyl': 3.2, 'geog': 18.01, 'eng': 14.63}]}]]

[[[{'st_id': 0, 'sp': 'GLSD', 'math': 16.25, 'phy': 11.24, 'sci': 12.98, 'phyl': 19.56, 'geog': 13.32, 'eng': 7.6}, {'st_id': 1, 'sp': 'ST', 'math':
1.94, 'phy': 4.64, 'sci': 7.1, 'phyl': 14.25, 'geog': 4.37, 'eng': 18.26}, {'st_id': 2, 'sp': 'GLSD', 'math': 8.32, 'phy': 16.65, 'sci': 1.0, 'phyl':
6.32, 'geog': 18.18, 'eng': 9.48}, {'st_id': 3, 'sp': 'MATH', 'math': 12.83, 'phy': 15.81, 'sci': 16.64, 'phyl': 7.04, 'geog': 9.32, 'eng': 8.49},
Generate a dataset {'st_id': 4, 'sp': 'STW', 'math': 16.75, 'phy': 0.65, 'sci': 13.35, 'phyl': 5.9, 'geog': 19.54, 'eng': 18.28}], 2], [[{'st_id': 5, 'sp': 'MATH',
'math': 2.5, 'phy': 9.33, 'sci': 2.48, 'phyl': 13.07, 'geog': 10.56, 'eng': 0.09}, {'st_id': 6, 'sp': 'ST', 'math': 14.41, 'phy': 7.07, 'sci': 15.94,

Dataset without repitition


'phyl': 0.26, 'geog': 13.12, 'eng': 13.41}, {'st_id': 7, 'sp': 'STW', 'math': 3.82, 'phy': 1.25, 'sci': 2.14, 'phyl': 15.8, 'geog': 0.01, 'eng':
13.38}, {'st_id': 8, 'sp': 'ST', 'math': 8.7, 'phy': 13.91, 'sci': 0.35, 'phyl': 4.84, 'geog': 6.87, 'eng': 15.38}, {'st_id': 9, 'sp': 'ST', 'math':
17.09, 'phy': 12.64, 'sci': 18.08, 'phyl': 11.72, 'geog': 19.83, 'eng': 14.23}], 0], [[{'st_id': 10, 'sp': 'ST', 'math': 9.3, 'phy': 4.03, 'sci': 4.65,
'phyl': 19.33, 'geog': 12.55, 'eng': 15.24}, {'st_id': 11, 'sp': 'MATH', 'math': 7.95, 'phy': 16.42, 'sci': 5.62, 'phyl': 7.67, 'geog': 3.43, 'eng':
14.16}, {'st_id': 12, 'sp': 'ST', 'math': 15.14, 'phy': 9.67, 'sci': 13.69, 'phyl': 10.49, 'geog': 5.98, 'eng': 3.39}, {'st_id': 13, 'sp': 'MATH',
'math': 16.25, 'phy': 19.06, 'sci': 18.48, 'phyl': 12.06, 'geog': 11.89, 'eng': 18.62}, {'st_id': 14, 'sp': 'GLSD', 'math': 10.12, 'phy': 13.78,
'sci': 15.21, 'phyl': 19.79, 'geog': 18.01, 'eng': 0.44}], 2], [[{'st_id': 15, 'sp': 'STW', 'math': 10.99, 'phy': 3.04, 'sci': 6.65, 'phyl': 10.15,
Random dataset 'geog': 18.88, 'eng': 7.9}, {'st_id': 16, 'sp': 'ST', 'math': 16.4, 'phy': 8.89, 'sci': 18.13, 'phyl': 17.48, 'geog': 7.92, 'eng': 7.57}, {'st_id': 17,
'sp': 'GLSD', 'math': 16.25, 'phy': 13.82, 'sci': 4.84, 'phyl': 12.99, 'geog': 1.66, 'eng': 2.01}, {'st_id': 18, 'sp': 'MATH', 'math': 12.63,
'phy': 6.93, 'sci': 18.15, 'phyl': 11.98, 'geog': 6.87, 'eng': 10.46}, {'st_id': 19, 'sp': 'GLSD', 'math': 8.75, 'phy': 14.73, 'sci': 10.59, 'phyl':
3.2, 'geog': 18.01, 'eng': 14.63}], 2]]

Recuperate dataset dataset


Without repetition
Workout 2 def getAverage(student):
marks=np.array([*student.values()][2:])
System architecture coef=np.array([*coefficients.values()])
return {'st_id':student['st_id'],
Nb_tuples=n
'avg':round(np.dot(marks,coef)/coef.sum(),2
Nb_blocks=m
)}
Nb_datanodes=p
Block_size def getAverages(dataset):
Specialities studentaverages=[]
nb_copies
for block in dataset:
if (block!=None):
avgs= list(map(getAverage,block[0]))
Generate a dataset studentaverages.append([avgs,block[1]])
Averages
return studentaverages

Random dataset
Map function

Recuperate dataset dataset


Without repetition
Workout 2 [[[{'st_id': 0, 'avg': 3.93}, {'st_id': 1, 'avg': 8.73},
{'st_id': 2, 'avg': 9.49}, {'st_id': 3, 'avg': 5.02}, {'st_id':
System architecture 4, 'avg': 16.3}], 0], [[{'st_id': 5, 'avg': 12.71}, {'st_id':
6, 'avg': 14.06}, {'st_id': 7, 'avg': 11.17}, {'st_id': 8,
Nb_tuples=n
Nb_blocks=m 'avg': 13.31}, {'st_id': 9, 'avg': 8.71}], 2], [[{'st_id': 10,
Nb_datanodes=p 'avg': 16.49}, {'st_id': 11, 'avg': 6.88}, {'st_id': 12,
Block_size 'avg': 7.9}, {'st_id': 13, 'avg': 11.04}, {'st_id': 14,
Specialities 'avg': 7.88}], 0], [[{'st_id': 15, 'avg': 15.84}, {'st_id':
nb_copies
16, 'avg': 8.25}, {'st_id': 17, 'avg': 14.43}, {'st_id': 18,
'avg': 12.81}, {'st_id': 19, 'avg': 10.53}], 2]]

Generate a dataset
Averages

Random dataset
Map function

Recuperate dataset dataset


Without repetition
Workout 2 [[[{'st_id': 0, 'avg': 3.93}, {'st_id': 1, 'avg': 8.73},
{'st_id': 2, 'avg': 9.49}, {'st_id': 3, 'avg': 5.02}, {'st_id':
System architecture 4, 'avg': 16.3}], 0], [[{'st_id': 5, 'avg': 12.71}, {'st_id':
6, 'avg': 14.06}, {'st_id': 7, 'avg': 11.17}, {'st_id': 8,
Nb_tuples=n
Nb_blocks=m 'avg': 13.31}, {'st_id': 9, 'avg': 8.71}], 2], [[{'st_id': 10,
Nb_datanodes=p 'avg': 16.49}, {'st_id': 11, 'avg': 6.88}, {'st_id': 12,
Block_size 'avg': 7.9}, {'st_id': 13, 'avg': 11.04}, {'st_id': 14,
Specialities 'avg': 7.88}], 0], [[{'st_id': 15, 'avg': 15.84}, {'st_id':
nb_copies
16, 'avg': 8.25}, {'st_id': 17, 'avg': 14.43}, {'st_id': 18,
'avg': 12.81}, {'st_id': 19, 'avg': 10.53}], 2]]

Generate a dataset
Averages

Random dataset
Map function

Recuperate dataset dataset


Without repetition
def avgSum(st1,st2):
Workout 2 return {'avg':round(st1['avg']+st2['avg'],2)}

System architecture def sum_blocks(dataset):


res=[]
Nb_tuples=n nb_tuples=0
Nb_blocks=m blk_nb=0
for data in dataset:
Nb_datanodes=p Sum, nb_values
nb_tuples=nb_tuples+len(data[0])
Block_size sr={}
Specialities
sr['block']=blk_nb
nb_copies
sr['DN']=data[1]
Reduce function sr['sum']=reduce(avgSum, data[0])['avg']
sr['nb_val']=len(data[0])
sr['avg']=round(sr['sum']/len(data[0]),2)
Generate a dataset blk_nb=blk_nb+1
Averages
res.append(sr)
rs=map(lambda r:{'avg':r['sum']}, res)
sumb=reduce(avgSum, rs)
Random dataset return
Map function {'detail':res,'res':[{'sum':sumb['avg'],'nb_val':n
b_tuples,
'avg':round(sumb['avg']/nb_tuples,2)}]}

Recuperate dataset dataset


Without repetition
Workout 2 ________________ detail of blocks
______________________________
System architecture block DN sum nb_val avg
Nb_tuples=n
0 0 0 43.47 5 8.69
Nb_blocks=m
1 1 2 59.96 5 11.99
Nb_datanodes=p Sum, nb_values 2 2 0 50.19 5 10.04
Block_size
Specialities
3 3 2 61.86 5 12.37
nb_copies _____________________ final result
Reduce function sum nb_val avg
0 215.48 20 10.77
Generate a dataset
Averages

Random dataset
Map function

Recuperate dataset dataset


Without repetition

You might also like