Module 5 Data Science
Module 5 Data Science
Affiliated to
Lecture Notes
on
Module 5
HADOOP AND SPARK
(21CS754)
2021 Scheme
Prepared By,
Mrs. Prathibha S ,
Assistant Professor,
Department of CSE,PESITM
MODULE -5
Hadoop: a framework for storing and processing large data sets.
Hadoop Architecture
Hadoop: a framework for storing and processing large data sets.
A MapReduce algorithm splits up the data, processes it in parallel, and then sorts,
combines,and aggregates the results back together. However, the MapReduce algorithm
isn’t well suited for interactive analysis or iterative programs because it writes the data
to a disk in between each computational step. This is expensive when working with
large data sets.
Map Reduce example: (simplified example of a MapReduce flow for counting the
colors in input texts)
MapReduce would work on a small fictitious example. You’re the director of a toy company.
Every toy has two colors, and when a client orders a toy from the web page, the web page
puts an order file on Hadoop with the colors of the
toy. Your task is to find out how many color units you need to prepare. You’ll use a
MapReduce-style algorithm to count the colors. First let’s look at a simplified version.
As the name suggests, the process roughly boils down to two big phases:
■ Mapping phase—The documents are split up into key-value pairs. Until we
reduce, we can have many duplicates.
■ Reduce phase—It’s not unlike a SQL “group by.” The different unique occurrences
are grouped together, and depending on the reducing function, a different
result can be created. Here we wanted a count per color, so that’s what the
reduce function returns.
The whole process is described in the following six steps and depicted in figure 5.4
.
1. Reading the input files.
2. Passing each line to a mapper job.
3. The mapper job parses the colors (keys) out of the file and outputs a file for each
color with the number of times it has been encountered (value). Or more technically
said, it maps a key (the color) to a value (the number of occurrences).
4. The keys get shuffled and sorted to facilitate the aggregation.
5. The reduce phase sums the number of occurrences per color and outputs one
file per key with the total number of occurrences for each color.
6. The keys are collected in an output file.
Spark: replacing MapReduce for better performance
What is Spark?
Spark is well suited for this type of interactive analysis because it doesn’t need to
save the data after each step and has a much better model than Hadoop for sharing
data between servers (a kind of distributed memory).
The transformation consists of four parts:
1 Start up PySpark (should still be open from section 5.2.2) and load the Spark
and Hive context.
2 Read and parse the .CSV file.
3 Split the header line from the data.
4 Clean the data.
Listing 5.3 Connecting to Apache Spark
Step 1: Starting up Spark in interactive mode and loading the context
The Spark context import isn’t required in the PySpark console because a context is readily
available as variable sc. You might have noticed this is also mentioned when
opening PySpark; in case you overlooked it. We then load a Hive context to enable us to work
interactively with Hive. If you work interactively with Spark, the Spark and Hive contexts are
loaded automatically, but if you want to use it in batch mode you need to load it manually.