0% found this document useful (0 votes)

227 views19 pages

Introduction To Apache Spark (Spark) : - by Praveen

Spark is a cluster computing software used for large-scale data processing. It provides a programming model where developers can write parallel programs to process large datasets across a cluster. When data exceeds the capacity of a single machine or server, Spark can distribute the data and processing across multiple nodes in a cluster. Developers write Spark programs using APIs in Scala, Java, Python or R to analyze large datasets stored in HDFS, S3, Cassandra or other data sources.

Uploaded by

vasari8882573

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

227 views19 pages

Introduction To Apache Spark (Spark) : - by Praveen

Uploaded by

vasari8882573

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Introduction to Apache Spark

(Spark)
-By Praveen
PART 1 Longs for around 20 to 30 mins …

We are going to answer these questions …

What is Spark ?
When to use Spark?
How to use Spark?

PART 2 Longs for around 10 to 15 mins …

Overview of Spark Software

The activity is.. Filter the data by country

Data size .. > 5MB

# Records .. ~40K
#Attributes / columns .. 15

Schema of the Data:

userID, visited_page, country, device_type,time,os_type,interaction_type,pincode, ……
The activity .. Filter the data by country

Data size .. > 5MB

# Records .. ~40K
#Attributes / columns .. 15 Apply Filter Functionally
The activity .. Filter the data by country

What if the data size exceeds the capacity of Excel .. i.e ~1M rows ?

Data size .. > 50GB

# Records .. ~x Millions
#Attributes / columns .. 15
The activity .. Filter the data by country

What if the data size exceeds the capacity of Excel .. i.e ~1M rows ?

HERE IS THE SOLUTION ….

Data size .. > 50GB

# Records .. ~x Millions
#Attributes / columns .. 15

*May take longer time (~10to 12 hrs) to finish the “FILTER” Process … due to unavailability of enough computing resources
The activity .. Filter the data by country

What if the data size exceeds the capacity (RAM, PROCESSOR & Disk) of the existing Server?

HERE IS THE SOLUTION …

Either Increase the capacity of existing server (Scale-up) OR replace existing

server with brand new high capacity server
Data size .. > 500GB
# Records .. ~x Billions
#Attributes / columns .. 15
Data size .. > 250GB
Data size .. > 50GB # Records .. ~x Billions
# Records .. ~x Billions #Attributes / columns .. 15
#Attributes / columns .. 15

Expensive & single point of failure ...

The activity .. Filter the data by country
What if the Server goes down during the processing? (Single Point of Failure)
Eg: After processing ~498GB for some ~12 hours

Data size .. > 500GB

# Records .. ~x Billions
#Attributes / columns .. 15

High Capacity Server

The activity .. Filter the data by country
What if the data size exceeds the capacity of Server Size .. ? OR
What if the Server goes down during the processing? (Single Point of Failure)

HERE IS THE ULTIMATE (CHEAPER & RELIABLE) SOLUTION ….

The solution is ... Build a “Cluster Computing System ” (Scale-out)

CLUSTER SOFTWARES
Cluster Capacity = Sum of the
allocated capacities available in HADOOP
SPARK CLUSTER
every individual server/computer SPARK
NOSQL DATABASES
Solution .. Use the “spark cluster computing software” to
solve large-volume data (BigData) problem
OUR STRATEGY STEPS …

STEP 1): LOCAL CLUSTER / DEVELOPMENT CLUSTER (To deal with some sample data)
 Set-up local Spark Cluster in your Computer / Server (Spark setup details available in up-coming video
 Write a piece of pySpark / SparkR program to filter the data
 Test & make sure your pySpark/SparkR program is working perfect over sample data in your local
computer/server

STEP 2): PRODUCTION CLUSTER (To deal with large-volume data)

 Assume, you already got access to your org’s / client’s production spark cluster
 Ask your client or admin or your manager the data path in the cluster
 Usually the data will reside in Hadoop’s HDFS, HIVE, AWS S3, NoSQL DBs like Cassandra etc
 Add this data path in your program and do some minor configuration changes if required
 Submit your program to cluster and wait for a while to get the result/output
Some facts of existing “production” clusters over
the world …
 Usually organisations enable “Spark Processing Engine” in their existing Hadoop Cluster
 Remember, Spark is independent, it is not a pre-requisite to have Hadoop to set-up Spark Cluster
 We can use Spark with out Hadoop

 To save Infrastructure & maintenance cost, organisations may enable Spark service in their
existing Hadoop Cluster
 If you ever come across the term “MapReduce Processing Engine (MR Engine)” ..
note that “Spark Processing Engine” is an alternative to MR Engine.
 Writing programs for “Spark Processing Engine” is much easier than Writing programs for “MR Engine”
 Spark Processing engine is super faster than MR Engine
 Many a times – Hadoop’s HDFS/ Hive/HBase, AWS S3, Vertica, Cassandra etc. are the data
sources for our Spark programs in Production Clusters
Now, Let’s answer these questions …

What is Spark ?
When to use Spark?
How to use Spark?
The Answers …

 What is Spark ?
 Spark is a cluster software (to be specific .. It is a general purpose “large-scale
data processing engine”)
 When to use Spark?
 When ever you have to process large volumes of data
 When ever you have to process high velocity steaming data
 Also to implement ML / AI solutions
 How to use Spark?
 To make use of a Spark Cluster, as a developer /analyst, you need to write your
programs/queries using your favourite programming language following Spark’s
programming guidelines
Have a cup of Coffee .. Let’s continue with Part 2 …
Introduction to Apache Spark
(Spark) - Part 2 of 2
PART 1
We are going to answer these questions …
 What is Spark ?
 When to use Spark?
 How to use Spark?

PART 2
Longs for around 10 to 15 mins …

Overview of Spark Software

Overview of Spark Software -
When we install/set-up the Spark software …
 We get 4 built-in Libraries/Modules ..
 SparkSQL & DataFrames
 Spark Streaming
 MLlib (Mahine Learning Library)
 GraphX (Graph)

 We get 4 built-in APIs ..

 Scala API
 Java API
 Python API
 R API
Now let’s us land on Apache Spark home page
... to learn more about Spark …

https://spark.apache.org/
Let’s connect at “Contact class” to learn more …

Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
Apache Spark For Beginners
No ratings yet
Apache Spark For Beginners
30 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
Unit 5
100% (1)
Unit 5
109 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
13 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
3 pages
MapReduce Example
No ratings yet
MapReduce Example
76 pages
Spark RDD Dataframes SQL
No ratings yet
Spark RDD Dataframes SQL
3 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
Apache Hue-Cloudera
No ratings yet
Apache Hue-Cloudera
63 pages
Airline Reservation Project Report
100% (4)
Airline Reservation Project Report
28 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Spark Scala Protected
No ratings yet
Spark Scala Protected
211 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Spark DataFrames Project Exercise - Jupyter Notebook
No ratings yet
Spark DataFrames Project Exercise - Jupyter Notebook
7 pages
Spark Sample Resume 2
100% (1)
Spark Sample Resume 2
7 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
43 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Apache Spark Analytics Made Simple
No ratings yet
Apache Spark Analytics Made Simple
76 pages
Big Data and Spark Developers
No ratings yet
Big Data and Spark Developers
5 pages
Explain in Detail About Hadoop Framework
No ratings yet
Explain in Detail About Hadoop Framework
4 pages
FI1915
No ratings yet
FI1915
1 page
MapReduce Example
No ratings yet
MapReduce Example
3 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Databricksmcqsquestionsandanswers
No ratings yet
Databricksmcqsquestionsandanswers
5 pages
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
Cloudera Hadoop Admin Notes PDF
No ratings yet
Cloudera Hadoop Admin Notes PDF
65 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
LLDB Cheat Sheet
No ratings yet
LLDB Cheat Sheet
2 pages
Cisco ACI - API Calls Vs JSON POST
No ratings yet
Cisco ACI - API Calls Vs JSON POST
5 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Difference Between LSMW and BDC
No ratings yet
Difference Between LSMW and BDC
13 pages
Apache Druid: Sudhindra Tirupati Nagaraj
No ratings yet
Apache Druid: Sudhindra Tirupati Nagaraj
12 pages
ID509cba03e-1996 Audi A6 Repair Manual
100% (1)
ID509cba03e-1996 Audi A6 Repair Manual
2 pages
AaxHadoop Interview Questions and Answers
No ratings yet
AaxHadoop Interview Questions and Answers
37 pages
Microsoft Azure-Case Study Document
No ratings yet
Microsoft Azure-Case Study Document
8 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Apache Sqoop
No ratings yet
Apache Sqoop
21 pages
Computer Networking: A Top Down Approach: 6 Edition Jim Kurose, Keith Ross Addison-Wesley
No ratings yet
Computer Networking: A Top Down Approach: 6 Edition Jim Kurose, Keith Ross Addison-Wesley
52 pages
Custom Url
No ratings yet
Custom Url
3 pages
SQL Server Preview
No ratings yet
SQL Server Preview
22 pages
Status Code 84: Media Write Error
No ratings yet
Status Code 84: Media Write Error
6 pages
Data Sheet
No ratings yet
Data Sheet
22 pages
Databricks How To Data Import PDF
No ratings yet
Databricks How To Data Import PDF
16 pages
Rm0440 Stm32g4 Series Advanced Armbased 32bit Mcus Stmicroelectronics
No ratings yet
Rm0440 Stm32g4 Series Advanced Armbased 32bit Mcus Stmicroelectronics
2,127 pages
CCNA 2 v6.0 Final Exam Answers 2018 - Routing & Switching Essentials-42-45 PDF
No ratings yet
CCNA 2 v6.0 Final Exam Answers 2018 - Routing & Switching Essentials-42-45 PDF
4 pages
1560b - 02 Implementing DNS in Windows 2000
No ratings yet
1560b - 02 Implementing DNS in Windows 2000
28 pages
AIX Best Practices For SAN: Neil Youshak FTSS - South Florida (954) 346-8566
No ratings yet
AIX Best Practices For SAN: Neil Youshak FTSS - South Florida (954) 346-8566
17 pages
Tuf B350m-Plus Gaming Memory QVL
No ratings yet
Tuf B350m-Plus Gaming Memory QVL
16 pages
Full Stack Data Analyst
No ratings yet
Full Stack Data Analyst
32 pages
Turkcell Case Study 066438
No ratings yet
Turkcell Case Study 066438
4 pages
PIC16F628
No ratings yet
PIC16F628
30 pages
Informatics Practices Term II
No ratings yet
Informatics Practices Term II
7 pages
Birst Exercise 1 Gettingstartedinadmin Web
No ratings yet
Birst Exercise 1 Gettingstartedinadmin Web
7 pages
Unit1 Notes ADS
No ratings yet
Unit1 Notes ADS
15 pages
CS11 Week 4-1
No ratings yet
CS11 Week 4-1
14 pages
LTRT-65108 MP-11x & MP-124 H.323 User's Manual Ver 5.0
No ratings yet
LTRT-65108 MP-11x & MP-124 H.323 User's Manual Ver 5.0
350 pages
Chapter 3: Arrays: I. The Array Structure
No ratings yet
Chapter 3: Arrays: I. The Array Structure
5 pages
Text Buffer Code
No ratings yet
Text Buffer Code
5 pages
HCDisk 2 EN
No ratings yet
HCDisk 2 EN
5 pages
Intro Apache Spark Jun5 V4
No ratings yet
Intro Apache Spark Jun5 V4
8 pages
Describe Network-Supported Technologies That Impact How People Learn, Work, and Play
No ratings yet
Describe Network-Supported Technologies That Impact How People Learn, Work, and Play
39 pages
Fabricpath - Part 1, Introduction: Conversational Mac Addresses Learning
No ratings yet
Fabricpath - Part 1, Introduction: Conversational Mac Addresses Learning
11 pages
Baka Prase
No ratings yet
Baka Prase
4 pages
The Transactions Are Not Lost, Added, Duplicated and Modified
No ratings yet
The Transactions Are Not Lost, Added, Duplicated and Modified
2 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Ultimate Salesforce Data Cloud for Customer Experience: Explore, Implement and Elevate B2C Experiences Through Customer Data Innovations Using Salesforce Data Cloud
From Everand
Ultimate Salesforce Data Cloud for Customer Experience: Explore, Implement and Elevate B2C Experiences Through Customer Data Innovations Using Salesforce Data Cloud
Gourab Mukherjee
No ratings yet
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
From Everand
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
Venkata Sasi Kanumuri
No ratings yet
HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
WS-BPEL 2.0 Beginner's Guide
From Everand
WS-BPEL 2.0 Beginner's Guide
Matjaz B. Juric
No ratings yet
TIBCO Software The Ultimate Step-By-Step Guide
From Everand
TIBCO Software The Ultimate Step-By-Step Guide
Gerardus Blokdyk
No ratings yet

Introduction To Apache Spark (Spark) : - by Praveen

Uploaded by

Introduction To Apache Spark (Spark) : - by Praveen

Uploaded by

Introduction to Apache Spark

We are going to answer these questions …

PART 2 Longs for around 10 to 15 mins …

Overview of Spark Software

Data size .. > 5MB

Schema of the Data:

Data size .. > 5MB

Data size .. > 50GB

HERE IS THE SOLUTION ….

Data size .. > 50GB

HERE IS THE SOLUTION …

Either Increase the capacity of existing server (Scale-up) OR replace existing

Expensive & single point of failure ...

Data size .. > 500GB

High Capacity Server

HERE IS THE ULTIMATE (CHEAPER & RELIABLE) SOLUTION ….

The solution is ... Build a “Cluster Computing System ” (Scale-out)

STEP 2): PRODUCTION CLUSTER (To deal with large-volume data)

Overview of Spark Software

 We get 4 built-in APIs ..

You might also like