0% found this document useful (0 votes)

30 views8 pages

Module 5 Data Science

The document provides lecture notes on Hadoop and Spark, focusing on their architecture, functionalities, and differences. Hadoop is described as a reliable, fault-tolerant framework for processing large datasets using MapReduce, while Spark is presented as a more efficient alternative that enables in-memory computations and interactive analysis. Additionally, the notes cover data preparation in Spark and its components, including Spark SQL, Spark Streaming, MLLib, and GraphX.

Uploaded by

abhishekhagowda11

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views8 pages

Module 5 Data Science

Uploaded by

abhishekhagowda11

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

PES Institute of Technology and Management

NH-206, Sagar Road,Shivamogga-577204

Department of Computer Science and Engineering

Affiliated to

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

Jnana Sangama, Belagavi, Karnataka –590018c

Lecture Notes
on

Module 5
HADOOP AND SPARK
(21CS754)
2021 Scheme

Prepared By,
Mrs. Prathibha S ,
Assistant Professor,
Department of CSE,PESITM

MODULE -5
Hadoop: a framework for storing and processing large data sets.

Apache Hadoop is a framework that simplifies working with a cluster of computers. It

aims to be all of the following things and more:
■ Reliable—By automatically creating multiple copies of the data and redeploying
processing logic in case of failure.
■ Fault tolerant —It detects faults and applies automatic recovery.
■ Scalable—Data and its processing are distributed over clusters of computers
(horizontal scaling).
■ Portable—Installable on all kinds of hardware and operating systems.

Hadoop Architecture
Hadoop: a framework for storing and processing large data sets.

At the heart of Hadoop we find

■ A distributed file system (HDFS)
■ A method to execute programs on a massive scale (MapReduce)
■ A system to manage the cluster resources (YARN)
MAPREDUCE: HOW HADOOP ACHIEVES PARALLELISM
Hadoop uses a programming method called MapReduce to achieve parallelism.

A MapReduce algorithm splits up the data, processes it in parallel, and then sorts,
combines,and aggregates the results back together. However, the MapReduce algorithm
isn’t well suited for interactive analysis or iterative programs because it writes the data
to a disk in between each computational step. This is expensive when working with
large data sets.

Map Reduce example: (simplified example of a MapReduce flow for counting the
colors in input texts)

MapReduce would work on a small fictitious example. You’re the director of a toy company.
Every toy has two colors, and when a client orders a toy from the web page, the web page
puts an order file on Hadoop with the colors of the
toy. Your task is to find out how many color units you need to prepare. You’ll use a
MapReduce-style algorithm to count the colors. First let’s look at a simplified version.
As the name suggests, the process roughly boils down to two big phases:
■ Mapping phase—The documents are split up into key-value pairs. Until we
reduce, we can have many duplicates.

■ Reduce phase—It’s not unlike a SQL “group by.” The different unique occurrences
are grouped together, and depending on the reducing function, a different
result can be created. Here we wanted a count per color, so that’s what the
reduce function returns.

The whole process is described in the following six steps and depicted in figure 5.4
.
1. Reading the input files.
2. Passing each line to a mapper job.
3. The mapper job parses the colors (keys) out of the file and outputs a file for each
color with the number of times it has been encountered (value). Or more technically
said, it maps a key (the color) to a value (the number of occurrences).
4. The keys get shuffled and sorted to facilitate the aggregation.
5. The reduce phase sums the number of occurrences per color and outputs one
file per key with the total number of occurrences for each color.
6. The keys are collected in an output file.
Spark: replacing MapReduce for better performance

What is Spark?

Spark is a cluster computing framework similar to MapReduce. Spark, however, doesn’t

handle the storage of files on the (distributed) file system itself, nor does it handle the
resource management. For this it relies on systems such as the Hadoop File System, YARN, or
Apache Mesos. Hadoop and Spark are thus complementary systems.For testing and
development, you can even run Spark on your local system.

HOW DOES SPARK SOLVE THE PROBLEMS OF MAPREDUCE?

Spark creates a kind of shared RAM memory between the computers of your cluster. This
allows the different workers to share variables (and their state) and thus eliminates the need
to write the intermediate results to disk. More technically and more correctly if you’re into
that: Spark uses Resilient Distributed Datasets (RDD), which are a distributed memory
abstraction that lets programmers perform in-memory computations on large clusters in a
faulttolerant way.1 Because it’s an in-memory system, it avoids costly disk operations.

THE DIFFERENT COMPONENTS OF THE SPARK ECOSYSTEM

Spark core provides a NoSQL environment well suited for interactive, exploratory nalysis.
Spark can be run in batch and interactive mode and supports Python. Spark has four other
large components, as listed below and depicted in figure 5.5.
1 Spark streaming is a tool for real-time analysis.
2 Spark SQL provides a SQL interface to work with Spark.
3 MLLib is a tool for machine learning inside the Spark framework.
4 GraphX is a graph database for Spark. We’ll go deeper into graph databases

DATA PREPARATION IN SPARK

Cleaning data is often an interactive exercise, because you spot a problem and fix the
problem, and you’ll likely do this a couple of times before you have clean and crisp data. An
example of dirty data would be a string such as “UsA”, which is improperly capitalized. At this
point, we no longer work in jobs.py but use the PySpark command
line interface to interact directly with Spark.

Spark is well suited for this type of interactive analysis because it doesn’t need to
save the data after each step and has a much better model than Hadoop for sharing
data between servers (a kind of distributed memory).
The transformation consists of four parts:
1 Start up PySpark (should still be open from section 5.2.2) and load the Spark
and Hive context.
2 Read and parse the .CSV file.
3 Split the header line from the data.
4 Clean the data.
Listing 5.3 Connecting to Apache Spark
Step 1: Starting up Spark in interactive mode and loading the context
The Spark context import isn’t required in the PySpark console because a context is readily
available as variable sc. You might have noticed this is also mentioned when
opening PySpark; in case you overlooked it. We then load a Hive context to enable us to work
interactively with Hive. If you work interactively with Spark, the Spark and Hive contexts are
loaded automatically, but if you want to use it in batch mode you need to load it manually.

Step 2: Reading and parsing the .CSV file

Next we read the file from the Hadoop file system and split it at every comma we encounter.
In our code the first line reads the .CSV file from the Hadoop file system. The second line splits
every line when it encounters a comma. Our .CSV parser is naïve by design because we’re
learning about Spark, but you can also use the .CSV
package to help you parse a line more correctly.

Step 3: Split the header line from the data

To separate the header from the data, we read in the first line and retain every line
that’s not similar to the header line.

Step 4: Clean the data

In this step we perform basic cleaning to enhance the data quality. This allows us to
build a better report.
After the second step, our data consists of arrays. We’ll treat every input for a lambda function
as an array now and return an array. To ease this task, we build a helper function that cleans.
Our cleaning consists of reformatting an input such as “10,4%” to 0.104 and encoding every
string as utf-8, as well as replacing underscores with spaces and lowercasing all the strings.
SAVE THE DATA IN HIVE
To store data in Hive we need to complete two steps:
1 Create and register metadata.
2 Execute SQL statements to save data in Hive.

Appian Interview Question and Answers
100% (2)
Appian Interview Question and Answers
12 pages
Employee Leave Management System Using PHP
No ratings yet
Employee Leave Management System Using PHP
56 pages
Sas Data Management
100% (1)
Sas Data Management
908 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Mastering Generative AI
No ratings yet
Mastering Generative AI
4 pages
Lec - Spark
No ratings yet
Lec - Spark
65 pages
Unit 3 Ids
100% (1)
Unit 3 Ids
24 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Arvind Chaudhary: Snowpro Certified Developer
No ratings yet
Arvind Chaudhary: Snowpro Certified Developer
6 pages
DP 203 ExamTopics
No ratings yet
DP 203 ExamTopics
47 pages
Triggers
0% (1)
Triggers
10 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
Foundation Admin 2023
No ratings yet
Foundation Admin 2023
152 pages
Complete and Incomplete Recovery in Oracle
100% (1)
Complete and Incomplete Recovery in Oracle
11 pages
Spark
No ratings yet
Spark
96 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
CDS View With Join Vs Associations
No ratings yet
CDS View With Join Vs Associations
4 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
DOP-C02 Updated Dumps - AWS Certified DevOps Engineer - Professional
No ratings yet
DOP-C02 Updated Dumps - AWS Certified DevOps Engineer - Professional
40 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Spark 101
No ratings yet
Spark 101
25 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Bal Bhavan Public School Pre-Board Examination (2020-2021) 1/402
No ratings yet
Bal Bhavan Public School Pre-Board Examination (2020-2021) 1/402
8 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
SABDE3G06 Big Data Sparks
No ratings yet
SABDE3G06 Big Data Sparks
57 pages
Europass CV Template
No ratings yet
Europass CV Template
2 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Module 3
No ratings yet
Module 3
51 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Week 14
No ratings yet
Week 14
33 pages
CMS Report
No ratings yet
CMS Report
66 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
SodaPDF Converted Text
No ratings yet
SodaPDF Converted Text
14 pages
Biggdata
No ratings yet
Biggdata
24 pages
DBMS Module 2 Part1
No ratings yet
DBMS Module 2 Part1
48 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Database System Final Exam Sheet 3
No ratings yet
Database System Final Exam Sheet 3
5 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Big Data Spark Cs606pc Syllabus
No ratings yet
Big Data Spark Cs606pc Syllabus
4 pages
Overview
No ratings yet
Overview
25 pages
Module 2
No ratings yet
Module 2
20 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Conducting Systematic Literature - Reviews and Bibliometric Analyses
No ratings yet
Conducting Systematic Literature - Reviews and Bibliometric Analyses
20 pages
Generative AI Lifecycle Patterns. Part 2 - Maturing GenAI - Patterns - by Ali Arsanjani - Sep, 2023 - Medium
No ratings yet
Generative AI Lifecycle Patterns. Part 2 - Maturing GenAI - Patterns - by Ali Arsanjani - Sep, 2023 - Medium
24 pages
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
M5
No ratings yet
M5
18 pages
Spark Introduction
No ratings yet
Spark Introduction
26 pages
SQP 11 - QP
No ratings yet
SQP 11 - QP
12 pages
DM - Topic Five
No ratings yet
DM - Topic Five
30 pages
Unit 4 - Multimedia Databases
No ratings yet
Unit 4 - Multimedia Databases
26 pages
Big Data Analytics 0th Lecture
No ratings yet
Big Data Analytics 0th Lecture
19 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Note
No ratings yet
Note
14 pages
Rdbms Lab. - Module-1 EXPERIMENT - 1&2&3
No ratings yet
Rdbms Lab. - Module-1 EXPERIMENT - 1&2&3
14 pages
Gayathri GCP Cloud Engineer
No ratings yet
Gayathri GCP Cloud Engineer
8 pages
Android List View
No ratings yet
Android List View
7 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Int 421
No ratings yet
Int 421
2 pages
Unit 1: Fundamentals of AS ABAP
No ratings yet
Unit 1: Fundamentals of AS ABAP
6 pages
BDAunit III
No ratings yet
BDAunit III
4 pages
SQL Exercise & Worksheet: Connecting You To The Next Level in Life
No ratings yet
SQL Exercise & Worksheet: Connecting You To The Next Level in Life
6 pages
Lec02 MCQ
No ratings yet
Lec02 MCQ
2 pages
Petrol Bunk Management System
No ratings yet
Petrol Bunk Management System
3 pages
Hadoop OnePage
No ratings yet
Hadoop OnePage
2 pages
Poster Ruta Certificaciones
No ratings yet
Poster Ruta Certificaciones
1 page
CG Boards Fail To Boot After Upgrade From AG Boards
No ratings yet
CG Boards Fail To Boot After Upgrade From AG Boards
1 page
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Module 5 Data Science

Uploaded by

Module 5 Data Science

Uploaded by

PES Institute of Technology and Management

NH-206, Sagar Road,Shivamogga-577204

Department of Computer Science and Engineering

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

Apache Hadoop is a framework that simplifies working with a cluster of computers. It

At the heart of Hadoop we find

Spark is a cluster computing framework similar to MapReduce. Spark, however, doesn’t

HOW DOES SPARK SOLVE THE PROBLEMS OF MAPREDUCE?

THE DIFFERENT COMPONENTS OF THE SPARK ECOSYSTEM

DATA PREPARATION IN SPARK

Step 2: Reading and parsing the .CSV file

Step 3: Split the header line from the data

Step 4: Clean the data

You might also like