0% found this document useful (0 votes)
67 views10 pages

High Performance Computing Using Apache Spark

The document discusses using Apache Spark for high performance computing. It introduces Spark, explaining that more data means more computational challenges that exceed the capabilities of single machines. It then outlines some key Spark concepts, including SparkSession and SparkContext for connecting to clusters, RDDs for distributed datasets, transformations and actions for processing RDDs lazily and in parallel, and Spark SQL for querying structured data like tables. The document provides an overview of Spark as a tool for distributed computing on large datasets across clusters of machines.

Uploaded by

Eliezer Beczi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views10 pages

High Performance Computing Using Apache Spark

The document discusses using Apache Spark for high performance computing. It introduces Spark, explaining that more data means more computational challenges that exceed the capabilities of single machines. It then outlines some key Spark concepts, including SparkSession and SparkContext for connecting to clusters, RDDs for distributed datasets, transformations and actions for processing RDDs lazily and in parallel, and Spark SQL for querying structured data like tables. The document provides an overview of Spark as a tool for distributed computing on large datasets across clusters of machines.

Uploaded by

Eliezer Beczi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

High Performance Computing

using Apache Spark

Eliezer Beczi December 7,


2020
Introduction
● More data means more computational challenges.

● Single machines can’t handle data sizes anymore.

● The need to extend computation to multiple nodes.


PySpark

Why Apache Spark?


● Open-source.

● General-purpose.

● Fast.

● APIs.

● Libraries.
Spark essentials
● SparkSession:
○ the main entrypoint to all Spark functionality.

● SparkContext:
○ connects to a cluster manager;
○ acquires executors;
○ sends app code to executors;
○ sends tasks for the executors to run.
Spark essentials
● RDD (Resilient Distributed Datasets):
○ immutable and fault-tolerant collection of elements that can be operated on in parallel.

● RDD operations:
○ transformations;
○ actions.
Spark essentials
● Transformations:
○ produce new RDDs;
○ lazy, not executed until an action is performed.

● The laziness of transformations allow Spark to boost performance by optimizing how a sequence
of transformations is executed at runtime.
Spark essentials
● Actions:
○ return non-RDD objects.

● Map-Reduce processing technique.


Spark SQL
● DataFrames:
○ immutable and fault-tolerant collection of elements that can be operated on in
parallel.

● DataFrames are organized into named columns.

● Conceptually equivalent to a table in RDB.


Spark SQL
● DataFrames can be easily queried using SQL
operations.

● Spark allows to run queries directly on DataFrames


similar to how transformations are performed on
RDDs.
Thank you for your attention!

You might also like