0% found this document useful (0 votes)

67 views10 pages

High Performance Computing Using Apache Spark

The document discusses using Apache Spark for high performance computing. It introduces Spark, explaining that more data means more computational challenges that exceed the capabilities of single machines. It then outlines some key Spark concepts, including SparkSession and SparkContext for connecting to clusters, RDDs for distributed datasets, transformations and actions for processing RDDs lazily and in parallel, and Spark SQL for querying structured data like tables. The document provides an overview of Spark as a tool for distributed computing on large datasets across clusters of machines.

Uploaded by

Eliezer Beczi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views10 pages

High Performance Computing Using Apache Spark

Uploaded by

Eliezer Beczi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 10

High Performance Computing

using Apache Spark

Eliezer Beczi December 7,

2020
Introduction
● More data means more computational challenges.

● Single machines can’t handle data sizes anymore.

● The need to extend computation to multiple nodes.

PySpark

Why Apache Spark?

● Open-source.

● General-purpose.

● Fast.

● APIs.

● Libraries.
Spark essentials
● SparkSession:
○ the main entrypoint to all Spark functionality.

● SparkContext:
○ connects to a cluster manager;
○ acquires executors;
○ sends app code to executors;
○ sends tasks for the executors to run.
Spark essentials
● RDD (Resilient Distributed Datasets):
○ immutable and fault-tolerant collection of elements that can be operated on in parallel.

● RDD operations:
○ transformations;
○ actions.
Spark essentials
● Transformations:
○ produce new RDDs;
○ lazy, not executed until an action is performed.

● The laziness of transformations allow Spark to boost performance by optimizing how a sequence
of transformations is executed at runtime.
Spark essentials
● Actions:
○ return non-RDD objects.

● Map-Reduce processing technique.

Spark SQL
● DataFrames:
○ immutable and fault-tolerant collection of elements that can be operated on in
parallel.

● DataFrames are organized into named columns.

● Conceptually equivalent to a table in RDB.

Spark SQL
● DataFrames can be easily queried using SQL
operations.

● Spark allows to run queries directly on DataFrames

similar to how transformations are performed on
RDDs.
Thank you for your attention!

Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Apache_Spark_Lecture_Notes
No ratings yet
Apache_Spark_Lecture_Notes
4 pages
Unit IV spark
No ratings yet
Unit IV spark
23 pages
CC_ppt
No ratings yet
CC_ppt
12 pages
07_Apache Spark - An Introduction
No ratings yet
07_Apache Spark - An Introduction
36 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
CISD 42 Introduction to Spark_Spark Transformation_Spark Actions
No ratings yet
CISD 42 Introduction to Spark_Spark Transformation_Spark Actions
27 pages
Pyspark TOC - 24 Hours
No ratings yet
Pyspark TOC - 24 Hours
2 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
bda unit 5 - mam
No ratings yet
bda unit 5 - mam
44 pages
Presentation On Apache Spark
No ratings yet
Presentation On Apache Spark
7 pages
bda
No ratings yet
bda
4 pages
Big+Data+with+Apache+Spark+3+and+Python+From+Zero+to+Expert
No ratings yet
Big+Data+with+Apache+Spark+3+and+Python+From+Zero+to+Expert
28 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Py Spark
No ratings yet
Py Spark
9 pages
Unleashing The Power of Apache Spark - A Comprehensive Guide To Data Processing at Scale
No ratings yet
Unleashing The Power of Apache Spark - A Comprehensive Guide To Data Processing at Scale
2 pages
Bigdata
No ratings yet
Bigdata
3 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Unit 5
100% (1)
Unit 5
109 pages
Spark & SparkMLLib
No ratings yet
Spark & SparkMLLib
6 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
Skyess Spark Syllabus
No ratings yet
Skyess Spark Syllabus
12 pages
Unit-4 - Apache Spark
No ratings yet
Unit-4 - Apache Spark
24 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
No ratings yet
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
48 pages
Module 4
No ratings yet
Module 4
29 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
7_apache_spark
No ratings yet
7_apache_spark
48 pages
Pyspark Tutorial
100% (2)
Pyspark Tutorial
27 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Spark 101
No ratings yet
Spark 101
25 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
Spark Tutorial
No ratings yet
Spark Tutorial
8 pages
Module 3
No ratings yet
Module 3
51 pages
Unit 5 Note
No ratings yet
Unit 5 Note
18 pages
SPARK
No ratings yet
SPARK
125 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Spark
No ratings yet
Spark
15 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
SABDE3G06 Big Data Sparks
No ratings yet
SABDE3G06 Big Data Sparks
57 pages
Big data assignment notes
No ratings yet
Big data assignment notes
13 pages
Spark Interview Questions PDF 2
No ratings yet
Spark Interview Questions PDF 2
19 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Chapter 3 spark
No ratings yet
Chapter 3 spark
6 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Spark Architecture
No ratings yet
Spark Architecture
10 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
Big Data Tools 2 - Apache Spark With PySpark
No ratings yet
Big Data Tools 2 - Apache Spark With PySpark
33 pages
Int 421
No ratings yet
Int 421
2 pages
Big Data Spark Cs606pc Syllabus
No ratings yet
Big Data Spark Cs606pc Syllabus
4 pages
Big Data Processing With Apache Spark – Part 1_ Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark – Part 1_ Introduction - InfoQ
18 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet
The Wind Energy Revolution
No ratings yet
The Wind Energy Revolution
18 pages
A Short Review in Model Order Reduction Based On Proper Generalized Decomposition
No ratings yet
A Short Review in Model Order Reduction Based On Proper Generalized Decomposition
11 pages
Chitosan and Alginate Wound Dressings
No ratings yet
Chitosan and Alginate Wound Dressings
7 pages
A Short Review of Failure Mechanisms of Lithium Metal and Lithiated Graphite Anodes in Liquid Electrolyte Solutions
No ratings yet
A Short Review of Failure Mechanisms of Lithium Metal and Lithiated Graphite Anodes in Liquid Electrolyte Solutions
12 pages
A Short Review of Catalysis
No ratings yet
A Short Review of Catalysis
12 pages
A Cyclic Peptide Inhibitor
No ratings yet
A Cyclic Peptide Inhibitor
8 pages
Skeletal Muscle Expression and Abnormal Function
No ratings yet
Skeletal Muscle Expression and Abnormal Function
5 pages
Adaptive Clustering Algorithm
No ratings yet
Adaptive Clustering Algorithm
1 page
Breaking The Diffusion Limit
No ratings yet
Breaking The Diffusion Limit
7 pages
Image Segmentation Adaptive Clustering
No ratings yet
Image Segmentation Adaptive Clustering
9 pages

High Performance Computing Using Apache Spark

Uploaded by

High Performance Computing Using Apache Spark

Uploaded by

High Performance Computing

using Apache Spark

Eliezer Beczi December 7,

● Single machines can’t handle data sizes anymore.

● The need to extend computation to multiple nodes.

Why Apache Spark?

● Map-Reduce processing technique.

● DataFrames are organized into named columns.

● Conceptually equivalent to a table in RDB.

● Spark allows to run queries directly on DataFrames

You might also like