08 SQLOperators BigDataNB
08 SQLOperators BigDataNB
2
The MapReduce paradigm can be used to
implement relational operators
However, the MapReduce implementation is
efficient only when a full scan of the input table(s)
is needed
▪ i.e., when queries are not selective and process all data
Selective queries, which return few tuples/records
of the input tables, are usually not efficient when
implemented by using a MapReduce approach
3
Most preprocessing activities involve
relational operators
E.g., ETL processes in the data warehousing
application context
4
Relations/Tables (also the big ones) can be
stored in the HDFS distributed file system
They are broken in blocks and spread across the
servers of the Hadoop cluster
5
Note
In relational algebra, relations/tables do not
contain duplicate records by definition
This constraint must be satisfied by both
the input and the output relations/tables
6
σC (R)
Applies predicate (condition) C to each
record of table R
Produces a relation containing only the
records that satisfy predicate C
The selection operator can be
implemented by using the filtering
pattern
7
Courses CCode CName Semester ProfID
M2170 Computer science 1 D102
M4880 Digital systems 2 D104
F1401 Electronics 1 D104
F0410 Databases 2 D102
8
Courses CCode CName Semester ProfID
M2170 Computer science 1 D102
M4880 Digital systems 2 D104
F1401 Electronics 1 D104
F0410 Databases 2 D102
9
Map-only job
Each mapper
Analyzes one record at a time of its
split
▪ If the record satisfies C then it emits a
(key,value) pair with key=record and
value=null
▪ Otherwise, it discards the record
10
πS(R)
For each record of table R, keeps only
the attributes in S
Produces a relation with a schema
equal to S (i.e., a relation containing
only the attributes in S)
Removes duplicates, if any
11
Professors ProfId PSurname Department
D102 Smith Computer engineering
D105 Jones Computer engineering
D104 Smith Electronics
12
Professors ProfId PSurname Department
D102 Smith Computer engineering
D105 Jones Computer engineering
D104 Smith Electronics
Result PSurname
Smith
Jones
MasterCourseProf
ProfID PSurname Department
D102 Smith Computer engineering
D101 Red Electronics
16
DegreeCourseProf
ProfID PSurna Department
me
D102 Smith Computer
engineering Result
D105 Jones Computer ProfID PSurna Department
engineering me
D104 White Electronics D102 Smith Computer
engineering
MasterCourseProf
D105 Jones Computer
ProfID PSurna Department engineering
me D104 White Electronics
D102 Smith Computer D101 Red Electronics
engineering
D101 Red Electronics
17
Mappers
For each input record t in R, emit one (key,
value) pair with key=t and value=null
For each input record t in S, emit one (key,
value) pair with key=t and value=null
Reducers
Emit one (key, value) pair for each input (key,
[list of values]) pair with key=t and value=null
▪ i.e., one single copy of each input record is
emitted
18
RS
R and S have the same schema
Produces a relation with the same schema
of R and S
There is a record t in the output of the
intersection operator if and only if t appears
in both relations (R and S)
19
DegreeCourseProf
ProfID PSurname Department
D102 Smith Computer engineering
D105 Jones Computer engineering
D104 White Electronics
MasterCourseProf
ProfID PSurname Department
D102 Smith Computer engineering
D101 Red Electronics
20
DegreeCourseProf
ProfID PSurna Department
me
D102 Smith Computer
engineering Result
D105 Jones Computer ProfID PSurna Department
engineering me
D104 White Electronics D102 Smith Computer
engineering
MasterCourseProf
ProfID PSurna Department
me
D102 Smith Computer
engineering
D101 Red Electronics
21
Mappers
For each input record t in R, emit one
(key, value) pair with key=t and
value=“R”
For each input record t in S, emit one
(key, value) pair with key=t and
value=“S”
22
Reducers
Emit one (key, value) pair with key=t
and value=null for each input (key, [list
of values]) pair with [list of values]
containing two values
▪ It happens if and only if both R and S
contain t
23
R-S
R and S have the same schema
Produces a relation with the same schema
of R and S
There is a record t in the output of the
difference operator if and only if t appears
in R but not in S
24
DegreeCourseProf
ProfID PSurname Department
D102 Smith Computer engineering
D105 Jones Computer engineering
D104 White Electronics
MasterCourseProf
ProfID PSurname Department
D102 Smith Computer engineering
D101 Red Electronics
25
DegreeCourseProf
ProfID PSurna Department
me
D102 Smith Computer
engineering Result
D105 Jones Computer ProfID PSurna Department
engineering me
D104 White Electronics D105 Jones Computer
engineering
MasterCourseProf
D104 White Electronics
ProfID PSurna Department
me
D102 Smith Computer
engineering
D101 Red Electronics
26
Mappers
For each input record t in R, emit one (key,
value) pair with key=t and value=name of
the relation (i.e., R)
For each input record t in S, emit one (key,
value) pair with key=t and value=name of
the relation (i.e., S)
Two mapper classes are needed
One for each relation
27
Reducers
Emit one (key, value) pair with key=t
and value=null for each input (key, [list
of values]) pair with [list of values]
containing only the value R
▪ It happens if and only if t appears in R but
not in S
28
The join operators can be implemented by
using the Join pattern
By using the reduce side or the map side pattern
depending on the size of the input relations/tables
29
Aggregations and Group by are implemented
by using the Summarization pattern
30