0% found this document useful (0 votes)
211 views23 pages

L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming

Apache Spark is a fast, general-purpose cluster computing framework that allows processing of large datasets across clusters of computers using a concept called Resilient Distributed Datasets (RDDs). It supports operations like transformations and actions on structured data using DataFrames and SQL queries. Spark also provides tools for interactive data analysis using notebooks and shells and can read/write data from sources like JSON, databases, and files.

Uploaded by

Satya Narayana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
211 views23 pages

L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming

Apache Spark is a fast, general-purpose cluster computing framework that allows processing of large datasets across clusters of computers using a concept called Resilient Distributed Datasets (RDDs). It supports operations like transformations and actions on structured data using DataFrames and SQL queries. Spark also provides tools for interactive data analysis using notebooks and shells and can read/write data from sources like JSON, databases, and files.

Uploaded by

Satya Narayana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

L02 – Spark SQL for Data Processing

CBG1C04 Big Data Programming


What is Apache Spark

• Apache Spark is a fast, general purpose,


distributed computing platform for large-scale
data processing

• Similar to Hadoop but many times faster

• Spark supports the Scala, Java, Python and R


programming languages

TEMASEK POLYTECHNIC • SCHOOL OF INFORMATICS & IT


The Spark Ecosystem

from Apache Spark 2.x Machine Learning Cookbook

TEMASEK POLYTECHNIC • SCHOOL OF INFORMATICS & IT


Spark Applications

from Spark: The Definitive Guide

TEMASEK POLYTECHNIC • SCHOOL OF INFORMATICS & IT


Spark Shell

• The Spark Shell provides interactive data


exploration with Spark
• The Spark Shell is a Read-Evaluate-Print-Loop
(REPL) shell.

TEMASEK POLYTECHNIC • SCHOOL OF INFORMATICS & IT


Spark Configuration

• In this subject, we will be using Spark 2.3.1


and Python 3.6

• Spark provides the variable spark and it is


the main entry point for interacting with
Spark using DataFrames API

TEMASEK POLYTECHNIC • SCHOOL OF INFORMATICS & IT


Jupyter Notebook with Spark

• When you enter pyspark at the terminal, a


Jupyter notebook will be launched instead.

TEMASEK POLYTECHNIC • SCHOOL OF INFORMATICS & IT


Spark SQL

• What is Spark SQL?


– Spark module for structured data processing

• What does Spark SQL provide?


– The DataFrame API – a library for working with
data as tables
– Catalyst Optimizer which will help to speed up
PySpark

TEMASEK POLYTECHNIC • SCHOOL OF INFORMATICS & IT


DataFrames

• A DataFrame is an immutable distributed


collection of data that is organised into
named columns analogous to a table in a
relational database
• The DataFrame API is used for handling
structured data in DataFrames

TEMASEK POLYTECHNIC • SCHOOL OF INFORMATICS & IT


Transformations and Actions

• Transformations specify how to change from


one DataFrame to another.
• Actions compute a result from a series of
transformations.
Spark will wait until the
very last moment to
execute the graph of
computation instructions,
known as Lazy Evaluation

TEMASEK POLYTECHNIC • SCHOOL OF INFORMATICS & IT


Creating a DataFrame from a JSON File

people.json
{"name":"Alice", "pcode":"94304"}
{"name":"Brayden", "age":30, "pcode":"94304"}
{"name":"Carla", "age":19, "pcode":"10036"}
{"name":"Diana", "age":46}
{"name":"Etienne", "pcode":"94104"}

TEMASEK POLYTECHNIC • SCHOOL OF INFORMATICS & IT


Creating a DataFrame from Database
using JDBC

TEMASEK POLYTECHNIC • SCHOOL OF INFORMATICS & IT


DataFrame Basic Metadata Operations

TEMASEK POLYTECHNIC • SCHOOL OF INFORMATICS & IT


DataFrame Actions

TEMASEK POLYTECHNIC • SCHOOL OF INFORMATICS & IT


DataFrame Transformations

TEMASEK POLYTECHNIC • SCHOOL OF INFORMATICS & IT


DataFrame Transformations

TEMASEK POLYTECHNIC • SCHOOL OF INFORMATICS & IT


DataFrame Transformations

Contains many useful functions


that operate on columns

TEMASEK POLYTECHNIC • SCHOOL OF INFORMATICS & IT


DataFrame Transformations

• DataFrame transformations can be chained

• Other methods:
– Distinct: returns a new DataFrame with distinct
elements of this DataFrame
– join: joins this DataFrame with a second
DataFrame

TEMASEK POLYTECHNIC • SCHOOL OF INFORMATICS & IT


SQL Queries

• Spark SQL also supports the ability to perform


traditional SQL queries. However, you need to
create a temporary table from the DataFrame
using createOrReplaceTempView(“name”).

TEMASEK POLYTECHNIC • SCHOOL OF INFORMATICS & IT


Saving DataFrames

• Data in DataFrames can be saved to a data


source
– Built in support for JDBC, CSV, JSON and Parquet
file format

TEMASEK POLYTECHNIC • SCHOOL OF INFORMATICS & IT


Low-Level APIs
• Spark has a set of lower-level APIs based on
the Resilient Distributed Dataset (RDD).
• You generally use the lower-level APIs in three
situations:
– You need some functionality that you cannot find
in the higher-level APIs; for example, if you need
very tight control over physical data placement
across the cluster.
– You need to maintain some legacy codebase
written using RDDs.
– You need to do some custom shared variable
manipulation.
TEMASEK POLYTECHNIC • SCHOOL OF INFORMATICS & IT
Resilient Distributed Dataset (RDD)

• In memory
• Partitioned
• Typed
• Lazy Evaluation
• Immutable
• Parallel
• Cacheable

from Learning Apache Spark 2


TEMASEK POLYTECHNIC • SCHOOL OF INFORMATICS & IT
TEMASEK POLYTECHNIC • SCHOOL OF INFORMATICS & IT

You might also like