0% found this document useful (0 votes)
227 views19 pages

Introduction To Apache Spark (Spark) : - by Praveen

Spark is a cluster computing software used for large-scale data processing. It provides a programming model where developers can write parallel programs to process large datasets across a cluster. When data exceeds the capacity of a single machine or server, Spark can distribute the data and processing across multiple nodes in a cluster. Developers write Spark programs using APIs in Scala, Java, Python or R to analyze large datasets stored in HDFS, S3, Cassandra or other data sources.

Uploaded by

vasari8882573
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
227 views19 pages

Introduction To Apache Spark (Spark) : - by Praveen

Spark is a cluster computing software used for large-scale data processing. It provides a programming model where developers can write parallel programs to process large datasets across a cluster. When data exceeds the capacity of a single machine or server, Spark can distribute the data and processing across multiple nodes in a cluster. Developers write Spark programs using APIs in Scala, Java, Python or R to analyze large datasets stored in HDFS, S3, Cassandra or other data sources.

Uploaded by

vasari8882573
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Introduction to Apache Spark

(Spark)
-By Praveen
PART 1 Longs for around 20 to 30 mins …

We are going to answer these questions …

What is Spark ?
When to use Spark?
How to use Spark?

PART 2 Longs for around 10 to 15 mins …

Overview of Spark Software


The activity is.. Filter the data by country

Data size .. > 5MB


# Records .. ~40K
#Attributes / columns .. 15

Schema of the Data:


userID, visited_page, country, device_type,time,os_type,interaction_type,pincode, ……
The activity .. Filter the data by country

Data size .. > 5MB


# Records .. ~40K
#Attributes / columns .. 15 Apply Filter Functionally
The activity .. Filter the data by country

What if the data size exceeds the capacity of Excel .. i.e ~1M rows ?

Data size .. > 50GB


# Records .. ~x Millions
#Attributes / columns .. 15
The activity .. Filter the data by country

What if the data size exceeds the capacity of Excel .. i.e ~1M rows ?

HERE IS THE SOLUTION ….

Data size .. > 50GB


# Records .. ~x Millions
#Attributes / columns .. 15

*May take longer time (~10to 12 hrs) to finish the “FILTER” Process … due to unavailability of enough computing resources
The activity .. Filter the data by country

What if the data size exceeds the capacity (RAM, PROCESSOR & Disk) of the existing Server?

HERE IS THE SOLUTION …

Either Increase the capacity of existing server (Scale-up) OR replace existing


server with brand new high capacity server
Data size .. > 500GB
# Records .. ~x Billions
#Attributes / columns .. 15
Data size .. > 250GB
Data size .. > 50GB # Records .. ~x Billions
# Records .. ~x Billions #Attributes / columns .. 15
#Attributes / columns .. 15

Expensive & single point of failure ...


The activity .. Filter the data by country
What if the Server goes down during the processing? (Single Point of Failure)
Eg: After processing ~498GB for some ~12 hours

Data size .. > 500GB


# Records .. ~x Billions
#Attributes / columns .. 15

High Capacity Server


The activity .. Filter the data by country
What if the data size exceeds the capacity of Server Size .. ? OR
What if the Server goes down during the processing? (Single Point of Failure)

HERE IS THE ULTIMATE (CHEAPER & RELIABLE) SOLUTION ….

The solution is ... Build a “Cluster Computing System ” (Scale-out)

CLUSTER SOFTWARES
Cluster Capacity = Sum of the
allocated capacities available in HADOOP
SPARK CLUSTER
every individual server/computer SPARK
NOSQL DATABASES
Solution .. Use the “spark cluster computing software” to
solve large-volume data (BigData) problem
OUR STRATEGY STEPS …

STEP 1): LOCAL CLUSTER / DEVELOPMENT CLUSTER (To deal with some sample data)
 Set-up local Spark Cluster in your Computer / Server (Spark setup details available in up-coming video
 Write a piece of pySpark / SparkR program to filter the data
 Test & make sure your pySpark/SparkR program is working perfect over sample data in your local
computer/server

STEP 2): PRODUCTION CLUSTER (To deal with large-volume data)


 Assume, you already got access to your org’s / client’s production spark cluster
 Ask your client or admin or your manager the data path in the cluster
 Usually the data will reside in Hadoop’s HDFS, HIVE, AWS S3, NoSQL DBs like Cassandra etc
 Add this data path in your program and do some minor configuration changes if required
 Submit your program to cluster and wait for a while to get the result/output
Some facts of existing “production” clusters over
the world …
 Usually organisations enable “Spark Processing Engine” in their existing Hadoop Cluster
 Remember, Spark is independent, it is not a pre-requisite to have Hadoop to set-up Spark Cluster
 We can use Spark with out Hadoop

 To save Infrastructure & maintenance cost, organisations may enable Spark service in their
existing Hadoop Cluster
 If you ever come across the term “MapReduce Processing Engine (MR Engine)” ..
note that “Spark Processing Engine” is an alternative to MR Engine.
 Writing programs for “Spark Processing Engine” is much easier than Writing programs for “MR Engine”
 Spark Processing engine is super faster than MR Engine
 Many a times – Hadoop’s HDFS/ Hive/HBase, AWS S3, Vertica, Cassandra etc. are the data
sources for our Spark programs in Production Clusters
Now, Let’s answer these questions …

What is Spark ?
When to use Spark?
How to use Spark?
The Answers …

 What is Spark ?
 Spark is a cluster software (to be specific .. It is a general purpose “large-scale
data processing engine”)
 When to use Spark?
 When ever you have to process large volumes of data
 When ever you have to process high velocity steaming data
 Also to implement ML / AI solutions
 How to use Spark?
 To make use of a Spark Cluster, as a developer /analyst, you need to write your
programs/queries using your favourite programming language following Spark’s
programming guidelines
Have a cup of Coffee .. Let’s continue with Part 2 …
Introduction to Apache Spark
(Spark) - Part 2 of 2
PART 1
We are going to answer these questions …
 What is Spark ?
 When to use Spark?
 How to use Spark?

PART 2
Longs for around 10 to 15 mins …

Overview of Spark Software


Overview of Spark Software -
When we install/set-up the Spark software …
 We get 4 built-in Libraries/Modules ..
 SparkSQL & DataFrames
 Spark Streaming
 MLlib (Mahine Learning Library)
 GraphX (Graph)

 We get 4 built-in APIs ..


 Scala API
 Java API
 Python API
 R API
Now let’s us land on Apache Spark home page
... to learn more about Spark …

https://spark.apache.org/
Let’s connect at “Contact class” to learn more …

You might also like