0% found this document useful (0 votes)

32 views31 pages

Spark Overview

The document provides an overview of Big Data, Hadoop, and Spark, highlighting the differences between local and distributed systems. It explains how Hadoop distributes large files using HDFS and utilizes MapReduce for computations, while Spark offers a faster and more flexible alternative for handling Big Data with features like Resilient Distributed Datasets (RDDs) and DataFrames. The document also emphasizes Spark's ability to work with various data formats and its advantages over MapReduce in terms of speed and efficiency.

Uploaded by

abhimanyu thakur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views31 pages

Spark Overview

Uploaded by

abhimanyu thakur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Overview of Spark

Let’s learn something!

Python and Spark

● Before we begin the setup and coding

with Python and Spark, let’s discuss
what Spark is in the context of Big Data.
● We’ll begin with a general explanation of
what Big Data is and related
technologies.
Big Data Overview

● What is “Big Data”?

● Explanation of Hadoop, MapReduce,and
Spark
● Local versus Distributed Systems
● Overview of Hadoop Ecosystem
● Overview of Spark
Big Data

● Data that can ﬁt on a local computer, in the

scale of 0-32 GB depending on RAM.
● But what can we do if we have a larger set of
data?
○ Try using a SQL database to move storage
onto hard drive instead of RAM
○ Or use a distributed system, that distributes
the data to multiple machines/computer.
Local versus Distributed

Local

Distributed
Local versus Distributed

Core Core Core Core Core

Local

Core Core Core Core Core Core

Distributed
Big Data

● A local process will use the computation

resources of a single machine
● A distributed process has access to the
computational resources across a number of
machines connected through a network
Big Data

● After a certain point, it is easier to scale out to

many lower CPU machines, than to try to scale
up to a single machine with a high CPU.
● Distributed machines also have the advantage
of easily scaling, you can just add more
machines
Big Data

● They also include fault tolerance, if one

machine fails, the whole network can still go
on.
● Let’s discuss the typical format of a distributed
architecture that uses Hadoop
Hadoop

● Hadoop is a way to distribute very large ﬁles across

multiple machines.
● It uses the Hadoop Distributed File System (HDFS)
● HDFS allows a user to work with large data sets
● HDFS also duplicates blocks of data for fault
tolerance
● It also then uses MapReduce
● MapReduce allows computations on that data
Distributed Storage - HDFS
Name Node

CPU RAM

Data Node Data Node Data Node

CPU RAM CPU RAM CPU RAM

Distributed Storage - HDFS

● HDFS will use blocks Name Node

of data, with a size of CPU RAM
128 MB by default
● Each of these blocks
is replicated 3 times
● The blocks are Data Node Data Node Data Node
distributed in a way
to support fault CPU RAM CPU RAM CPU RAM

tolerance
Distributed Storage - HDFS

● Smaller blocks Name Node

provide more CPU RAM
parallelization during
processing
● Multiple copies of a
block prevent loss of Data Node Data Node Data Node
data due to a failure
of a node CPU RAM CPU RAM CPU RAM
MapReduce

● MapReduce is a way Job Tracker

of splitting a CPU RAM
computation task to
a distributed set of
ﬁles (such as HDFS)
● It consists of a Job Task Task Task
Tracker Tracker Tracker
Tracker and multiple
Task Trackers CPU RAM CPU RAM CPU RAM
MapReduce

● The Job Tracker Job Tracker

sends code to run on CPU RAM
the Task Trackers
● The Task trackers
allocate CPU and
memory for the tasks Task Task Task
Tracker Tracker Tracker
and monitor the
tasks on the worker CPU RAM CPU RAM CPU RAM

nodes
Big Data

● What we covered can be thought of in two distinct

parts:
○ Using HDFS to distribute large data sets
○ Using MapReduce to distribute a computational
task to a distributed data set
● Next we will learn about the latest technology in this
space known as Spark.
● Spark improves on the concepts of using distribution
Spark

● This lecture will be an abstract overview, we will

discuss:
○ Spark
○ Spark vs MapReduce
○ Spark RDDs
○ Spark DataFrames
Spark

● Spark is one of the latest technologies being used

to quickly and easily handle Big Data
● It is an open source project on Apache
● It was ﬁrst released in February 2013 and has
exploded in popularity due to it’s ease of use and
speed
● It was created at the AMPLab at UC Berkeley
Spark

● You can think of Spark as a ﬂexible alternative to

MapReduce
● Spark can use data stored in a variety of formats
○ Cassandra
○ AWS S3
○ HDFS
○ And more
Spark vs MapReduce

● MapReduce requires ﬁles to be stored in HDFS,

Spark does not!
● Spark also can perform operations up to 100x
faster than MapReduce
● So how does it achieve this speed?
Spark vs MapReduce

● MapReduce writes most data to disk after each

map and reduce operation
● Spark keeps most of the data in memory after
each transformation
● Spark can spill over to disk if the memory is
ﬁlled
Spark RDDs

● At the core of Spark is the idea of a Resilient

Distributed Dataset (RDD)
● Resilient Distributed Dataset (RDD) has 4 main
features:
○ Distributed Collection of Data
○ Fault-tolerant
○ Parallel operation - partioned
○ Ability to use many data sources
Spark RDDs
Spark RDDs
Spark RDDs

● RDDs are immutable, lazily evaluated, and

cacheable
● There are two types of Spark operations:
○ Transformations
○ Actions
● Transformations are basically a recipe to follow.
● Actions actually perform what the recipe says to do
and returns something back.
Spark RDDs

● This behaviour carries over to the syntax when

coding.
● A lot of times you will write a method call, but won’t
see anything as a result until you call the action.
● This makes sense because with a large dataset, you
don’t want to calculate all the transformations until
you are sure you want to perform them!
Spark RDDs

● When discussing Spark syntax you will see RDD

versus DataFrame syntax show up.
● With the release of Spark 2.0, Spark is moving
towards a DataFrame based syntax, but keep in
mind that the way ﬁles are being distributed can
still be thought of as RDDs, it is just the typed out
syntax that is changing
Spark RDDs

● We’ve covered a lot!

● Don’t worry if you didn’t memorize all these
details, a lot of this will be covered again as
we learn about how to actually code out
and utilize these ideas!
Spark DataFrames

● Spark DataFrames are also now the standard

way of using Spark’s Machine Learning
Capabilities.
● Spark DataFrame documentation is still pretty
new and can be sparse.
● Let’s get a brief tour of the documentation!
http://spark.apache.org/
Python and Spark

Spring Boot Ecommerce Masterclass
No ratings yet
Spring Boot Ecommerce Masterclass
337 pages
30 Pyspark Coding Questions
No ratings yet
30 Pyspark Coding Questions
9 pages
Distributed Machine Learning With PySpark
100% (3)
Distributed Machine Learning With PySpark
830 pages
SPARK
No ratings yet
SPARK
66 pages
Azure Databricks Mastery
No ratings yet
Azure Databricks Mastery
53 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
Tutorial 1 What Is Cucumber-BDD
No ratings yet
Tutorial 1 What Is Cucumber-BDD
9 pages
Hitachi Cloudera
No ratings yet
Hitachi Cloudera
26 pages
Bigdataqcm PDF
100% (1)
Bigdataqcm PDF
206 pages
Paul Mather The New Microsoft Project
No ratings yet
Paul Mather The New Microsoft Project
41 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Pyspark
100% (1)
Pyspark
48 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
Spark
No ratings yet
Spark
96 pages
SPARK
No ratings yet
SPARK
35 pages
dSbDa MiniProject Case Study
No ratings yet
dSbDa MiniProject Case Study
10 pages
Iterator+in+Java+Collection+ Iterator
No ratings yet
Iterator+in+Java+Collection+ Iterator
8 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Bda Notes
No ratings yet
Bda Notes
241 pages
Spark DataFrame Basics
No ratings yet
Spark DataFrame Basics
10 pages
CH - 5 JS
No ratings yet
CH - 5 JS
109 pages
Marketing Data Lake
No ratings yet
Marketing Data Lake
221 pages
CH - 5 JS
No ratings yet
CH - 5 JS
109 pages
IPD Checklist
No ratings yet
IPD Checklist
1 page
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
SPARK
No ratings yet
SPARK
125 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
SPARK
No ratings yet
SPARK
47 pages
Spring Slides
No ratings yet
Spring Slides
63 pages
Bda U2
No ratings yet
Bda U2
68 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Immediate Download Big Data Anil Maheshwari Ebooks 2024
100% (3)
Immediate Download Big Data Anil Maheshwari Ebooks 2024
66 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Slides For Windows OS
No ratings yet
Slides For Windows OS
43 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
PHP Webforms
No ratings yet
PHP Webforms
39 pages
ECS765P - W4 - Introduction To Spark
No ratings yet
ECS765P - W4 - Introduction To Spark
39 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
Clustering
No ratings yet
Clustering
43 pages
Juanfran
No ratings yet
Juanfran
70 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
Machine Learning Section
No ratings yet
Machine Learning Section
29 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Spark
No ratings yet
Spark
51 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Youtube PavanKumar Manual Testing 02 (Practical)
No ratings yet
Youtube PavanKumar Manual Testing 02 (Practical)
21 pages
Server Side PHP 1
No ratings yet
Server Side PHP 1
19 pages
Module 3
No ratings yet
Module 3
51 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Pyspark
No ratings yet
Pyspark
44 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
R23 IDS Unit3
No ratings yet
R23 IDS Unit3
36 pages
Glossary of Common Machine Learning, Statistics and Data Science Terms - Analytics Vidhya
No ratings yet
Glossary of Common Machine Learning, Statistics and Data Science Terms - Analytics Vidhya
54 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
UDEMY - SK - SelectorsHub Tutorial - A Free Next Gen XPath & Locators Tool
No ratings yet
UDEMY - SK - SelectorsHub Tutorial - A Free Next Gen XPath & Locators Tool
20 pages
Spark and Dask
No ratings yet
Spark and Dask
55 pages
Sanjana Data Engineer
No ratings yet
Sanjana Data Engineer
4 pages
I Am Preparing For A Big Data Analytics University...
No ratings yet
I Am Preparing For A Big Data Analytics University...
15 pages
Natural Language Processing
No ratings yet
Natural Language Processing
19 pages
Big Data Analytics in Telecommunications: Literature Review and Architecture Recommendations
No ratings yet
Big Data Analytics in Telecommunications: Literature Review and Architecture Recommendations
22 pages
BigData Session1
No ratings yet
BigData Session1
14 pages
Module 2
No ratings yet
Module 2
20 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Data Pipelines With AWS Glue (Level 200)
No ratings yet
Data Pipelines With AWS Glue (Level 200)
33 pages
Tools in Data Analytics
No ratings yet
Tools in Data Analytics
17 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
2 - Data Science Tools
No ratings yet
2 - Data Science Tools
21 pages
AI-100 ExamPrep
No ratings yet
AI-100 ExamPrep
46 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Lecture 3
No ratings yet
Lecture 3
15 pages
0 The BigDataEra
No ratings yet
0 The BigDataEra
36 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Xpath Vs CSS - Everything You Need To Know About XPath and CSS
No ratings yet
Xpath Vs CSS - Everything You Need To Know About XPath and CSS
11 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Testing - Apache POI
No ratings yet
Testing - Apache POI
12 pages
UDEMY - SK - XPath Tutorial From Basic To Advance Level
No ratings yet
UDEMY - SK - XPath Tutorial From Basic To Advance Level
9 pages
Tutorial 8 DataTable Aslists in Cucumber
No ratings yet
Tutorial 8 DataTable Aslists in Cucumber
13 pages
Tutorial 10 Data Driven Testing in Cucumber Scenario Outline
No ratings yet
Tutorial 10 Data Driven Testing in Cucumber Scenario Outline
10 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
Chapter 09 Advanced Data Structures
No ratings yet
Chapter 09 Advanced Data Structures
9 pages
Tutorial 6 BackgroundKeyword
No ratings yet
Tutorial 6 BackgroundKeyword
9 pages
Sai Charan de
No ratings yet
Sai Charan de
9 pages
Testing - Log4J
No ratings yet
Testing - Log4J
7 pages
Spark Optimization Techniques
No ratings yet
Spark Optimization Techniques
10 pages
Databricks Lab 1
100% (3)
Databricks Lab 1
7 pages
DAA Lab
No ratings yet
DAA Lab
6 pages
Fast Data Processing With Spark - Second Edition - Sample Chapter
No ratings yet
Fast Data Processing With Spark - Second Edition - Sample Chapter
18 pages
Samatha GCP Data Engineer
No ratings yet
Samatha GCP Data Engineer
8 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Vipal P
No ratings yet
Vipal P
4 pages
Spark 101
No ratings yet
Spark 101
25 pages
Resume - Riaz Mahmud
No ratings yet
Resume - Riaz Mahmud
8 pages
Interview Question Spark Day1
No ratings yet
Interview Question Spark Day1
3 pages
Ateeja Mohammed
No ratings yet
Ateeja Mohammed
2 pages
Big Data Masters Program
No ratings yet
Big Data Masters Program
13 pages
Haard 1
No ratings yet
Haard 1
1 page
Naveen Kumar Nemani Sr. Big Data Engineer: Summary
No ratings yet
Naveen Kumar Nemani Sr. Big Data Engineer: Summary
6 pages
Parsing Json
No ratings yet
Parsing Json
1 page
BC Contact Numbers Emails All
No ratings yet
BC Contact Numbers Emails All
1 page
Nishant Agarwal Resume
No ratings yet
Nishant Agarwal Resume
2 pages
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Spark Overview

Uploaded by

Spark Overview

Uploaded by

Overview of Spark

Let’s learn something!

● Before we begin the setup and coding

● What is “Big Data”?

● Data that can ﬁt on a local computer, in the

Core Core Core Core Core

Core Core Core Core Core Core

● A local process will use the computation

● After a certain point, it is easier to scale out to

● They also include fault tolerance, if one

● Hadoop is a way to distribute very large ﬁles across

Data Node Data Node Data Node

CPU RAM CPU RAM CPU RAM

● HDFS will use blocks Name Node

● Smaller blocks Name Node

● MapReduce is a way Job Tracker

● The Job Tracker Job Tracker

● What we covered can be thought of in two distinct

● This lecture will be an abstract overview, we will

● Spark is one of the latest technologies being used

● You can think of Spark as a ﬂexible alternative to

● MapReduce requires ﬁles to be stored in HDFS,

● MapReduce writes most data to disk after each

● At the core of Spark is the idea of a Resilient

● RDDs are immutable, lazily evaluated, and

● This behaviour carries over to the syntax when

● When discussing Spark syntax you will see RDD

● We’ve covered a lot!

● Spark DataFrames are also now the standard

You might also like