100% found this document useful (1 vote)

258 views34 pages

Spark SQL

Spark SQL provides a relational processing engine for Apache Spark. It allows users to write SQL queries over distributed datasets and take advantage of Spark's optimizations. Spark SQL includes a DataFrame API that represents data as distributed tables, a Catalyst optimizer that applies rules to execution plans, and integration with data sources and machine learning libraries. It aims to support SQL queries on large datasets through its automatic optimization capabilities.

Uploaded by

Roxana Godoy Astudillo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

258 views34 pages

Spark SQL

Uploaded by

Roxana Godoy Astudillo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Spark SQL

The 8 fastest-growing tech skills worth over

$110,000

No. 1: Spark, up 120%, worth $113,214

DO you know how to write code in
Spark ?
Can you write SQL ?

“SQL is a highly sought-after technical skill due to its ability to work with
nearly all databases.”
Ibro Palic, CEO of Resumes Templates
History and Evolution of Big Data
Technologies

Procedural
Programing
interface

Declarative
Queries Automatic
Optimization
So Far…

We have established that we need

platform with Automatic Optimization
What user want ?

•ETL from different

1
sources

•Advanced
2
Analytics
Introducing

Spark SQL : Relational Data Processing

in Spark
Background

 Apache Spark is a general-purpose cluster computing engine with

APIs in Scala, Java and Python and libraries for streaming, graph
processing and machine learning
 RDDs are fault-tolerant, in that the system can recover lost data
using the lineage graph of the RDDs (by rerunning operations such
as the filter above to rebuild missing partitions). They can also
explicitly be cached in memory or on disk to support iteration
 Shark, a modified the Apache Hive system to run on Spark and
implemented traditional RDBMS optimizations, such as columnar
processing, over the Spark engine.
Goals for Spark SQL

 Support Relational Processing both within Spark

programs and on external data sources
 Provide High Performance using established DBMS
techniques.
 Easily support New Data Sources
 Enable Extension with advanced analytics algorithms
such as graph processing and machine learning.
Programming Interface
DataFrame API

 DataFrame is a distributed collection of rows with a

homogeneous schema

Keep Track of
Hashtags ##
# A Lazy Computation
Data Model and DataFrame
Operations
 Spark SQL uses a nested data model based on Hive
 It supports all major SQL data types, including boolean, integer, double,
decimal, string, date, timestamp and also User Defined Data types

Example of DataFrame Operations

DataFrame Operations Cont.

#Access DF with DSL or SQL

Real World Problems

#Heterogeneous
Data Sources
Schema Inference

 Spark SQL can automatically infer the schema of these

objects using reflection
 Scala/Java - extracted from the language’s type system
 Python – Sampling the Dataset
In – Memory Caching

#Invoked with .cache()

User-Defined Functions

How Spark SQLs User defined

functions are different than traditional
Database Systems ?
Catalyst Optimizer

 Catalyst is based on functional programming constructs in Scala

Purposes

Ability to add new

optimization techniques
and features to Ability to extend the
optimizer
Spark SQL
Catalyst Optimization

#Trees

#Rules
Catalyst Optimization Cont.

Rule Based Optimization

Cost Based Optimization

Query Planning in Spark SQL
Extension Points

#Open Source Projects

Extension Points Cont.

 Data Sources
Examples :
 CSV
 Avro
 Parquet
 JDBC
Extension Points Cont.
 User Defined Types (UDTs)

#Useful for Machine Learning

Advanced Analytics Features

1.Schema Inference for Semi structured Data

2.Query Federation to External Databases

Advanced Analytics Features Cont.
3.Integration with Spark’s Machine
Learning Library
Evaluation

 SQL Performance
Evaluation Cont.

 DataFrames vs. Native Spark Code

Pipeline Performance
Applications

 Generalized Online Aggregation

 Computational Genomics
 List is infinite only limited by your imagination…
Conclusion

Our Final Hash Tags

#A Platform with
#Automatic optimization
#Complex pipelines that mix relational and complex analytics
#Large-scale data analysis
#Semi-structured data
#Data types for machine learning
#Extensible optimizer called Catalyst
#Easy to add Optimization rules, data sources and data types

Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Forms of Talk (PDFDrive)
100% (2)
Forms of Talk (PDFDrive)
342 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Spark SQL
No ratings yet
Spark SQL
12 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
From Everand
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
Saba Shah
No ratings yet
PySpark Essentials: A Practical Guide to Distributed Computing
From Everand
PySpark Essentials: A Practical Guide to Distributed Computing
Robert Johnson
No ratings yet
Databricks Essentials: A Guide to Unified Data Analytics
From Everand
Databricks Essentials: A Guide to Unified Data Analytics
Robert Johnson
No ratings yet
Caching in Spark
No ratings yet
Caching in Spark
51 pages
PySpark and Azure Data Engineer Free Notes
No ratings yet
PySpark and Azure Data Engineer Free Notes
65 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
DBT - Commands
No ratings yet
DBT - Commands
2 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Py Spark
No ratings yet
Py Spark
10 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
Spark Repartition1
No ratings yet
Spark Repartition1
7 pages
Building Data Pipelines - 1
No ratings yet
Building Data Pipelines - 1
25 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
9-10 Spark Architecture
No ratings yet
9-10 Spark Architecture
25 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
Databricks Interview Question & Answers
No ratings yet
Databricks Interview Question & Answers
10 pages
Spark RDD Dataframes SQL
No ratings yet
Spark RDD Dataframes SQL
3 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
Using Databricks Notebook in Talend Studio
No ratings yet
Using Databricks Notebook in Talend Studio
19 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Spark QA
No ratings yet
Spark QA
34 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
Apache Pig
100% (2)
Apache Pig
80 pages
Azure Synapse
No ratings yet
Azure Synapse
229 pages
Data Contracts Early Release 042024
No ratings yet
Data Contracts Early Release 042024
52 pages
Spark Interview Questions IV. Next Installment of The Series. - by Amit Singh Rathore - Dev Genius
No ratings yet
Spark Interview Questions IV. Next Installment of The Series. - by Amit Singh Rathore - Dev Genius
15 pages
Airflow 2 X
100% (2)
Airflow 2 X
39 pages
Piyush Data Science 3
No ratings yet
Piyush Data Science 3
26 pages
Comprehensive Guide For Tuning Spark Big Data Applications and Infrastructure
100% (1)
Comprehensive Guide For Tuning Spark Big Data Applications and Infrastructure
20 pages
Apache Cassandra
No ratings yet
Apache Cassandra
3 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
Apache Airflow On Docker For Complete Beginners - Justin Gage - Medium
No ratings yet
Apache Airflow On Docker For Complete Beginners - Justin Gage - Medium
12 pages
Teradata Advanced SQL Part1 PDF
100% (2)
Teradata Advanced SQL Part1 PDF
38 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Databricks Clusters
No ratings yet
Databricks Clusters
29 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
Bigdata Notes
No ratings yet
Bigdata Notes
26 pages
Talend Data Integration: Subramanyam K
No ratings yet
Talend Data Integration: Subramanyam K
64 pages
2 - Apache Airflow
No ratings yet
2 - Apache Airflow
5 pages
Best Practices of Apache Airflow
No ratings yet
Best Practices of Apache Airflow
3 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
No ratings yet
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
35 pages
SQL Server Interview Questions Developers PDF
No ratings yet
SQL Server Interview Questions Developers PDF
142 pages
Pyspark With Docker
100% (1)
Pyspark With Docker
15 pages
Power BI Interview Questions
No ratings yet
Power BI Interview Questions
15 pages
Pyspark Interview: Abhinav Singh
No ratings yet
Pyspark Interview: Abhinav Singh
275 pages
17 SparkSQL
No ratings yet
17 SparkSQL
44 pages
Ebook Accelerating Apache Spark 3
No ratings yet
Ebook Accelerating Apache Spark 3
108 pages
Databricks Optimization Technique
No ratings yet
Databricks Optimization Technique
18 pages
DBT Flow
No ratings yet
DBT Flow
15 pages
Oracle BAM 11gR1 Handbook
From Everand
Oracle BAM 11gR1 Handbook
Wang
No ratings yet
Project On Sales Promotion Parag
No ratings yet
Project On Sales Promotion Parag
66 pages
The Bear Lodge
No ratings yet
The Bear Lodge
5 pages
Tutorial 6B: Tutorial Tahap Clo-Plo
No ratings yet
Tutorial 6B: Tutorial Tahap Clo-Plo
15 pages
CH 3 Socio
No ratings yet
CH 3 Socio
18 pages
Scan Report
No ratings yet
Scan Report
26 pages
Metal Carbonyls
No ratings yet
Metal Carbonyls
9 pages
Innerwear Industry - Sector Report - SMIFS
No ratings yet
Innerwear Industry - Sector Report - SMIFS
45 pages
Cost Minimization of Liquid Steel Production in Libyan Iron and Steel Company
No ratings yet
Cost Minimization of Liquid Steel Production in Libyan Iron and Steel Company
8 pages
Management of Smashed Distal Humerus
No ratings yet
Management of Smashed Distal Humerus
15 pages
SAC Higg Index Comm Guidelines v11
No ratings yet
SAC Higg Index Comm Guidelines v11
57 pages
Shoelace
No ratings yet
Shoelace
2 pages
Week 3 Quiz
No ratings yet
Week 3 Quiz
96 pages
E-Ticket: Departure Flight
No ratings yet
E-Ticket: Departure Flight
3 pages
Types of Logical Reasoning
No ratings yet
Types of Logical Reasoning
10 pages
MELSEC iQ-R WS Safety Controller Ethernet Communication Function Block Reference - 00A
No ratings yet
MELSEC iQ-R WS Safety Controller Ethernet Communication Function Block Reference - 00A
30 pages
Merger Acquisition and Restructuring FIBA316: Amity University Kolkata
No ratings yet
Merger Acquisition and Restructuring FIBA316: Amity University Kolkata
7 pages
Storage & Handling of Diesel Fuel Procedure
No ratings yet
Storage & Handling of Diesel Fuel Procedure
32 pages
Byou Dissertation
No ratings yet
Byou Dissertation
177 pages
IATG 01.80 Formulae Ammunition Management IATG V.3
No ratings yet
IATG 01.80 Formulae Ammunition Management IATG V.3
50 pages
Unit3 Part C Revised
No ratings yet
Unit3 Part C Revised
72 pages
Be Civil Engineering Semester 7 2024 May Dloc IV Solid and Hazardous Waste Management Rev 2019 C Scheme
No ratings yet
Be Civil Engineering Semester 7 2024 May Dloc IV Solid and Hazardous Waste Management Rev 2019 C Scheme
1 page
Natural - and Man Made Disasters
No ratings yet
Natural - and Man Made Disasters
15 pages
Kotler-Chapter-10-MCQ Kotler-Chapter-10-MCQ
No ratings yet
Kotler-Chapter-10-MCQ Kotler-Chapter-10-MCQ
23 pages
CEMS Guidelines Volume I Full Version
100% (1)
CEMS Guidelines Volume I Full Version
28 pages
KeyViewFilterSDK 12.10 DotNetProgramming
No ratings yet
KeyViewFilterSDK 12.10 DotNetProgramming
270 pages
Chromosomal Abnormalities
100% (1)
Chromosomal Abnormalities
61 pages
Syllabus BLEMBA 28 - MM5012 Business Strategy - Enterprise Modeling
No ratings yet
Syllabus BLEMBA 28 - MM5012 Business Strategy - Enterprise Modeling
16 pages
Abb Fox System Specifications r8 RF 1khw002006
100% (1)
Abb Fox System Specifications r8 RF 1khw002006
68 pages
Airframes and Systems: Atpl Ground Training Series
No ratings yet
Airframes and Systems: Atpl Ground Training Series
352 pages

Spark SQL

Uploaded by

Spark SQL

Uploaded by

Spark SQL

The 8 fastest-growing tech skills worth over

No. 1: Spark, up 120%, worth $113,214

We have established that we need

•ETL from different

Spark SQL : Relational Data Processing

 Apache Spark is a general-purpose cluster computing engine with

 Support Relational Processing both within Spark

 DataFrame is a distributed collection of rows with a

Example of DataFrame Operations

#Access DF with DSL or SQL

 Spark SQL can automatically infer the schema of these

#Invoked with .cache()

How Spark SQLs User defined

 Catalyst is based on functional programming constructs in Scala

Ability to add new

Rule Based Optimization

Cost Based Optimization

#Open Source Projects

#Useful for Machine Learning

1.Schema Inference for Semi structured Data

2.Query Federation to External Databases

 DataFrames vs. Native Spark Code

 Generalized Online Aggregation

Our Final Hash Tags

You might also like