0% found this document useful (0 votes)
11 views

Module 5 Data Science

The document provides lecture notes on Hadoop and Spark, focusing on their architecture, functionalities, and differences. Hadoop is described as a reliable, fault-tolerant framework for processing large datasets using MapReduce, while Spark is presented as a more efficient alternative that enables in-memory computations and interactive analysis. Additionally, the notes cover data preparation in Spark and its components, including Spark SQL, Spark Streaming, MLLib, and GraphX.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Module 5 Data Science

The document provides lecture notes on Hadoop and Spark, focusing on their architecture, functionalities, and differences. Hadoop is described as a reliable, fault-tolerant framework for processing large datasets using MapReduce, while Spark is presented as a more efficient alternative that enables in-memory computations and interactive analysis. Additionally, the notes cover data preparation in Spark and its components, including Spark SQL, Spark Streaming, MLLib, and GraphX.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

PES Institute of Technology and Management

NH-206, Sagar Road,Shivamogga-577204

Department of Computer Science and Engineering

Affiliated to

VISVESVARAYA TECHNOLOGICAL UNIVERSITY


Jnana Sangama, Belagavi, Karnataka –590018c

Lecture Notes
on

Module 5
HADOOP AND SPARK
(21CS754)
2021 Scheme

Prepared By,
Mrs. Prathibha S ,
Assistant Professor,
Department of CSE,PESITM

MODULE -5
Hadoop: a framework for storing and processing large data sets.

Apache Hadoop is a framework that simplifies working with a cluster of computers. It


aims to be all of the following things and more:
■ Reliable—By automatically creating multiple copies of the data and redeploying
processing logic in case of failure.
■ Fault tolerant —It detects faults and applies automatic recovery.
■ Scalable—Data and its processing are distributed over clusters of computers
(horizontal scaling).
■ Portable—Installable on all kinds of hardware and operating systems.

Hadoop Architecture
Hadoop: a framework for storing and processing large data sets.

At the heart of Hadoop we find


■ A distributed file system (HDFS)
■ A method to execute programs on a massive scale (MapReduce)
■ A system to manage the cluster resources (YARN)
MAPREDUCE: HOW HADOOP ACHIEVES PARALLELISM
Hadoop uses a programming method called MapReduce to achieve parallelism.

A MapReduce algorithm splits up the data, processes it in parallel, and then sorts,
combines,and aggregates the results back together. However, the MapReduce algorithm
isn’t well suited for interactive analysis or iterative programs because it writes the data
to a disk in between each computational step. This is expensive when working with
large data sets.

Map Reduce example: (simplified example of a MapReduce flow for counting the
colors in input texts)

MapReduce would work on a small fictitious example. You’re the director of a toy company.
Every toy has two colors, and when a client orders a toy from the web page, the web page
puts an order file on Hadoop with the colors of the
toy. Your task is to find out how many color units you need to prepare. You’ll use a
MapReduce-style algorithm to count the colors. First let’s look at a simplified version.
As the name suggests, the process roughly boils down to two big phases:
■ Mapping phase—The documents are split up into key-value pairs. Until we
reduce, we can have many duplicates.

■ Reduce phase—It’s not unlike a SQL “group by.” The different unique occurrences
are grouped together, and depending on the reducing function, a different
result can be created. Here we wanted a count per color, so that’s what the
reduce function returns.

The whole process is described in the following six steps and depicted in figure 5.4
.
1. Reading the input files.
2. Passing each line to a mapper job.
3. The mapper job parses the colors (keys) out of the file and outputs a file for each
color with the number of times it has been encountered (value). Or more technically
said, it maps a key (the color) to a value (the number of occurrences).
4. The keys get shuffled and sorted to facilitate the aggregation.
5. The reduce phase sums the number of occurrences per color and outputs one
file per key with the total number of occurrences for each color.
6. The keys are collected in an output file.
Spark: replacing MapReduce for better performance

What is Spark?

Spark is a cluster computing framework similar to MapReduce. Spark, however, doesn’t


handle the storage of files on the (distributed) file system itself, nor does it handle the
resource management. For this it relies on systems such as the Hadoop File System, YARN, or
Apache Mesos. Hadoop and Spark are thus complementary systems.For testing and
development, you can even run Spark on your local system.

HOW DOES SPARK SOLVE THE PROBLEMS OF MAPREDUCE?


Spark creates a kind of shared RAM memory between the computers of your cluster. This
allows the different workers to share variables (and their state) and thus eliminates the need
to write the intermediate results to disk. More technically and more correctly if you’re into
that: Spark uses Resilient Distributed Datasets (RDD), which are a distributed memory
abstraction that lets programmers perform in-memory computations on large clusters in a
faulttolerant way.1 Because it’s an in-memory system, it avoids costly disk operations.

THE DIFFERENT COMPONENTS OF THE SPARK ECOSYSTEM


Spark core provides a NoSQL environment well suited for interactive, exploratory nalysis.
Spark can be run in batch and interactive mode and supports Python. Spark has four other
large components, as listed below and depicted in figure 5.5.
1 Spark streaming is a tool for real-time analysis.
2 Spark SQL provides a SQL interface to work with Spark.
3 MLLib is a tool for machine learning inside the Spark framework.
4 GraphX is a graph database for Spark. We’ll go deeper into graph databases

DATA PREPARATION IN SPARK


Cleaning data is often an interactive exercise, because you spot a problem and fix the
problem, and you’ll likely do this a couple of times before you have clean and crisp data. An
example of dirty data would be a string such as “UsA”, which is improperly capitalized. At this
point, we no longer work in jobs.py but use the PySpark command
line interface to interact directly with Spark.

Spark is well suited for this type of interactive analysis because it doesn’t need to
save the data after each step and has a much better model than Hadoop for sharing
data between servers (a kind of distributed memory).
The transformation consists of four parts:
1 Start up PySpark (should still be open from section 5.2.2) and load the Spark
and Hive context.
2 Read and parse the .CSV file.
3 Split the header line from the data.
4 Clean the data.
Listing 5.3 Connecting to Apache Spark
Step 1: Starting up Spark in interactive mode and loading the context
The Spark context import isn’t required in the PySpark console because a context is readily
available as variable sc. You might have noticed this is also mentioned when
opening PySpark; in case you overlooked it. We then load a Hive context to enable us to work
interactively with Hive. If you work interactively with Spark, the Spark and Hive contexts are
loaded automatically, but if you want to use it in batch mode you need to load it manually.

Step 2: Reading and parsing the .CSV file


Next we read the file from the Hadoop file system and split it at every comma we encounter.
In our code the first line reads the .CSV file from the Hadoop file system. The second line splits
every line when it encounters a comma. Our .CSV parser is naïve by design because we’re
learning about Spark, but you can also use the .CSV
package to help you parse a line more correctly.

Step 3: Split the header line from the data


To separate the header from the data, we read in the first line and retain every line
that’s not similar to the header line.

Step 4: Clean the data


In this step we perform basic cleaning to enhance the data quality. This allows us to
build a better report.
After the second step, our data consists of arrays. We’ll treat every input for a lambda function
as an array now and return an array. To ease this task, we build a helper function that cleans.
Our cleaning consists of reformatting an input such as “10,4%” to 0.104 and encoding every
string as utf-8, as well as replacing underscores with spaces and lowercasing all the strings.
SAVE THE DATA IN HIVE
To store data in Hive we need to complete two steps:
1 Create and register metadata.
2 Execute SQL statements to save data in Hive.

You might also like