SlideShare a Scribd company logo
Building Robust ETL
Pipelines with Apache Spark
Xiao Li
Spark Summit | SF | Jun 2017
2
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
22
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
3
About Me
• Apache Spark Committer
• Software Engineer at Databricks
• Ph.D. in University of Florida
• Previously, IBM Master Inventor, QRep, GDPS A/A and STC
• Spark SQL, Database Replication, Information Integration
• Github: gatorsmile
4
Overview
1. What’s an ETL Pipeline?
2. Using Spark SQL for ETL
- Extract: Dealing with Dirty Data (Bad Records or Files)
- Extract: Multi-line JSON/CSV Support
- Transformation: High-order functions in SQL
- Load: Unified write paths and interfaces
3. New Features in Spark 2.3
- Performance (Data Source API v2, Python UDF)
5
What is a Data Pipeline?
1. Sequence of transformations on data
2. Source data is typically semi-structured/unstructured
(JSON, CSV etc.) and structured (JDBC, Parquet, ORC, the
other Hive-serde tables)
3. Output data is integrated, structured and curated.
– Ready for further data processing, analysis and reporting
6
Example of a Data Pipeline
Aggregate Reporting
Applications
ML
Model
Ad-hoc Queries
Database
Cloud
Warehouse
Kafka, Log
Kafka, Log
7
ETL is the First Step in a Data Pipeline
1. ETL stands for EXTRACT, TRANSFORM and LOAD
2. Goal is to clean or curate the data
- Retrieve data from sources (EXTRACT)
- Transform data into a consumable format (TRANSFORM)
- Transmit data to downstream consumers (LOAD)
8
An ETL Query in Apache Spark
spark.read.json("/source/path")
.filter(...)
.agg(...)
.write.mode("append")
.parquet("/output/path")
EXTRACT
TRANSFORM
LOAD
9
An ETL Query in Apache Spark
Extract
EXTRACT
TRANSFORM
LOAD
val csvTable = spark.read.csv("/source/path")
val jdbcTable = spark.read.format("jdbc")
.option("url", "jdbc:postgresql:...")
.option("dbtable", "TEST.PEOPLE")
.load()
csvTable
.join(jdbcTable, Seq("name"), "outer")
.filter("id <= 2999")
.write
.mode("overwrite")
.format("parquet")
.saveAsTable("outputTableName")
10
What’s so hard about ETL
Queries?
11
Why is ETL Hard?
1. Too complex
2. Error-prone
3. Too slow
4. Too expensive
1. Various sources/formats
2. Schema mismatch
3. Different representation
4. Corrupted files and data
5. Scalability
6. Schema evolution
7. Continuous ETL
12
This is why ETL is important
Consumers of this data don’t want to deal with this
messiness and complexity
13
Using Spark SQL for ETL
14
Structured
Streaming
Spark SQL's flexible APIs,
support for a wide
variety of datasources,
build-in support for
structured streaming,
state of art catalyst
optimizer and tungsten
execution engine make it
a great framework for
building end-to-end ETL
pipelines.
15
Data Source Supports
1. Built-in connectors in Spark:
– JSON, CSV, Text, Hive, Parquet, ORC, JDBC
2. Third-party data source connectors:
– https://spark-packages.org
3. Define your own data source connectors by
Data Source APIs
– Ref link: https://youtu.be/uxuLRiNoDio
16
{"a":1, "b":2, "c":3}
{"e":2, "c":3, "b":5}
{"a":5, "d":7}
spark.read
.json("/source/path”)
.printSchema()
Schema Inference – semi-structured files
17
{"a":1, "b":2, "c":3.1}
{"e":2, "c":3, "b":5}
{"a":"5", "d":7}
spark.read
.json("/source/path”)
.printSchema()
Schema Inference – semi-structured files
18
{"a":1, "b":2, "c":3}
{"e":2, "c":3, "b":5}
{"a":5, "d":7}
val schema = new StructType()
.add("a", "int")
.add("b", "int")
spark.read
.json("/source/path")
.schema(schema)
.show()
User-specified Schema
19
{"a":1, "b":2, "c":3}
{"e":2, "c":3, "b":5}
{"a":5, "d":7}
Availability: Apache Spark 2.2
spark.read
.json("/source/path")
.schema("a INT, b INT")
.show()
User-specified DDL-format Schema
20
Corrupt
Files
java.io.IOException. For example, java.io.EOFException: Unexpected end of input
stream at org.apache.hadoop.io.compress.DecompressorStream.decompress
java.lang.RuntimeException: file:/temp/path/c000.json is not a Parquet file (too
small)
spark.sql.files.ignoreCorruptFiles = true
[SPARK-17850] If true, the Spark jobs will
continue to run even when it encounters
corrupt files. The contents that have
been read will still be returned.
Dealing with Bad Data: Skip Corrupt Files
21
Missing or
Corrupt
Records
[SPARK-12833][SPARK-
13764] TextFile formats
(JSON and CSV) support
3 different ParseModes
while reading data:
1. PERMISSIVE
2. DROPMALFORMED
3. FAILFAST
Dealing with Bad Data: Skip Corrupt Records
22
{"a":1, "b":2, "c":3}
{"a":{, b:3}
{"a":5, "b":6, "c":7}
spark.read
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "_corrupt_record")
.json(corruptRecords)
.show() The default can be configured via
spark.sql.columnNameOfCorruptRecord
Json: Dealing with Corrupt Records
23
{"a":1, "b":2, "c":3}
{"a":{, b:3}
{"a":5, "b":6, "c":7}
spark.read
.option("mode", "DROPMALFORMED")
.json(corruptRecords)
.show()
Json: Dealing with Corrupt Records
24
{"a":1, "b":2, "c":3}
{"a":{, b:3}
{"a":5, "b":6, "c":7}
spark.read
.option("mode", "FAILFAST")
.json(corruptRecords)
.show()
org.apache.spark.sql.catalyst.json
.SparkSQLJsonProcessingException:
Malformed line in FAILFAST mode:
{"a":{, b:3}
Json: Dealing with Corrupt Records
25
spark.read
.option("mode", "FAILFAST")
.csv(corruptRecords)
.show()
java.lang.RuntimeException:
Malformed line in FAILFAST mode:
2015,Chevy,Volt
CSV: Dealing with Corrupt Records
year,make,model,comment,blank
"2012","Tesla","S","No comment",
1997,Ford,E350,"Go get one now they",
2015,Chevy,Volt
26
spark.read.
.option("mode", "PERMISSIVE")
.csv(corruptRecords)
.show()
CSV: Dealing with Corrupt Records
year,make,model,comment,blank
"2012","Tesla","S","No comment",
1997,Ford,E350,"Go get one now they",
2015,Chevy,Volt
27
year,make,model,comment,blank
"2012","Tesla","S","No comment",
1997,Ford,E350,"Go get one now they",
2015,Chevy,Volt
spark.read
.option("header", true)
.option("mode", "PERMISSIVE")
.csv(corruptRecords)
.show()
CSV: Dealing with Corrupt Records
28
val schema = "col1 INT, col2 STRING, col3 STRING, col4 STRING, " +
"col5 STRING, __corrupted_column_name STRING"
spark.read
.option("header", true)
.option("mode", "PERMISSIVE")
.csv(corruptRecords)
.show()
CSV: Dealing with Corrupt Records
29
year,make,model,comment,blank
"2012","Tesla","S","No comment",
1997,Ford,E350,"Go get one now they",
2015,Chevy,Volt
spark.read
.option("mode", ”DROPMALFORMED")
.csv(corruptRecords)
.show()
CSV: Dealing with Corrupt Records
30
Functionality: Better Corruption Handling
badRecordsPath: a user-specified path to store exception files for
recording the information about bad records/files.
- A unified interface for both corrupt records and files
- Enabling multi-phase data cleaning
- DROPMALFORMED + Exception files
- No need an extra column for corrupt records
- Recording the exception data, reasons and time.
Availability: Databricks Runtime 3.0
31
Functionality: Better JSON and CSV Support
[SPARK-18352] [SPARK-19610] Multi-line JSON and CSV Support
- Spark SQL currently reads JSON/CSV one line at a time
- Before 2.2, it requires custom ETL
spark.read
.option(”multiLine",true)
.json(path)
Availability: Apache Spark 2.2
spark.read
.option(”multiLine",true)
.json(path)
32
Transformation: Higher-order Function in SQL
Transformation on complex objects like arrays, maps and
structures inside of columns.
UDF ? Expensive data serialization
tbl_nested
|-- key: long (nullable = false)
|-- values: array (nullable = false)
| |-- element: long (containsNull = false)
33
Transformation: Higher order function in SQL
1) Check for element existence
SELECT EXISTS(values, e -> e > 30) AS v
FROM tbl_nested;
2) Transform an array
SELECT TRANSFORM(values, e -> e * e) AS v
FROM tbl_nested;
tbl_nested
|-- key: long (nullable = false)
|-- values: array (nullable = false)
| |-- element: long (containsNull = false)
Transformation on complex objects like arrays, maps and
structures inside of columns.
34
4) Aggregate an array
SELECT REDUCE(values, 0, (value, acc) -> value + acc) AS sum
FROM tbl_nested;
Ref Databricks Blog: http://dbricks.co/2rUKQ1A
More cool features available in DB Runtime 3.0: http://dbricks.co/2rhPM4c
Availability: Databricks Runtime 3.0
3) Filter an array
SELECT FILTER(values, e -> e > 30) AS v
FROM tbl_nested;
Transformation: Higher order function in SQL
tbl_nested
|-- key: long (nullable = false)
|-- values: array (nullable = false)
| |-- element: long (containsNull = false)
35
Users can create Hive-serde tables using
DataframeWriter APIs
Availability: Apache Spark 2.2
New Format in DataframeWriter API
df.write.format("parquet")
.saveAsTable("tab")
df.write.format("hive")
.option("fileFormat", "avro")
.saveAsTable("tab")
CREATE Hive-serde tables CREATE data source tables
36
Availability: Apache Spark 2.2
Unified CREATE TABLE [AS SELECT]
CREATE TABLE t1(a INT, b INT)
USING ORC
CREATE TABLE t1(a INT, b INT)
USING hive
OPTIONS(fileFormat 'ORC')
CREATE Hive-serde tables CREATE data source tables
CREATE TABLE t1(a INT, b INT)
STORED AS ORC
37
CREATE [TEMPORARY] TABLE [IF NOT EXISTS]
[db_name.]table_name
USING table_provider
[OPTIONS table_property_list]
[PARTITIONED BY (col_name, col_name, ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)]
INTO num_buckets BUCKETS]
[LOCATION path]
[COMMENT table_comment]
[AS select_statement];
Availability: Apache Spark 2.2
Unified CREATE TABLE [AS SELECT]
Apache Spark preferred syntax
38
Apache Spark 2.3+
Massive focus on building ETL-friendly pipelines
39
[SPARK-15689] Data Source API v2
1. [SPARK-20960] An efficient column batch interface for data
exchanges between Spark and external systems.
o Cost for conversion to and from RDD[Row]
o Cost for serialization/deserialization
o Publish the columnar binary formats
2. Filter pushdown and column pruning
3. Additional pushdown: limit, sampling and so on.
Target: Apache Spark 2.3
40
Performance: Python UDFs
1. Python is the most popular language for ETL
2. Python UDFs are often used to express elaborate data
conversions/transformations
3. Any improvements to python UDF processing will ultimately
improve ETL.
4. Improve data exchange between Python and JVM
5. Block-level UDFs
o Block-level arguments and return types
Target: Apache Spark 2.3
41
Recap
1. What’s an ETL Pipeline?
2. Using Spark SQL for ETL
- Extract: Dealing with Dirty Data (Bad Records or Files)
- Extract: Multi-line JSON/CSV Support
- Transformation: High-order functions in SQL
- Load: Unified write paths and interfaces
3. New Features in Spark 2.3
- Performance (Data Source API v2, Python UDF)
42
UNIFIED ANALYTICS PLATFORM
Try Apache Spark in Databricks!
• Collaborative cloud environment
• Free version (community edition)
4242
DATABRICKS RUNTIME 3.0
• Apache Spark - optimized for the cloud
• Caching and optimization layer - DBIO
• Enterprise security - DBES
Try for free today.
databricks.com
Questions?
Xiao Li (lixiao@databricks.com)

More Related Content

PDF
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
PDF
Introduction to Spark Internals
Pietro Michiardi
 
PDF
Building an open data platform with apache iceberg
Alluxio, Inc.
 
PDF
Moving to Databricks & Delta
Databricks
 
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Apache Spark Architecture
Alexey Grishchenko
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Introduction to Spark Internals
Pietro Michiardi
 
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Moving to Databricks & Delta
Databricks
 

What's hot (20)

PDF
Intro to Delta Lake
Databricks
 
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
PDF
Achieving Lakehouse Models with Spark 3.0
Databricks
 
PDF
Databricks Delta Lake and Its Benefits
Databricks
 
PDF
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
PDF
The delta architecture
Prakash Chockalingam
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
Presto Summit 2018 - 09 - Netflix Iceberg
kbajda
 
PDF
Cassandra Introduction & Features
DataStax Academy
 
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
PDF
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
PDF
MongodB Internals
Norberto Leite
 
PDF
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
 
PPTX
Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Getting Started with Delta Lake on Databricks
Knoldus Inc.
 
PDF
Making Apache Spark Better with Delta Lake
Databricks
 
PPTX
Delta lake and the delta architecture
Adam Doyle
 
PDF
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Intro to Delta Lake
Databricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Databricks Delta Lake and Its Benefits
Databricks
 
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
The delta architecture
Prakash Chockalingam
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Presto Summit 2018 - 09 - Netflix Iceberg
kbajda
 
Cassandra Introduction & Features
DataStax Academy
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
MongodB Internals
Norberto Leite
 
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
 
Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Getting Started with Delta Lake on Databricks
Knoldus Inc.
 
Making Apache Spark Better with Delta Lake
Databricks
 
Delta lake and the delta architecture
Adam Doyle
 
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Ad

Similar to Building Robust ETL Pipelines with Apache Spark (20)

PDF
Exceptions are the Norm: Dealing with Bad Actors in ETL
Databricks
 
PPTX
ETL 2.0 Data Engineering for developers
Microsoft Tech Community
 
PDF
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
GoDataDriven
 
PDF
Using Apache Spark as ETL engine. Pros and Cons
Provectus
 
PPTX
Be A Hero: Transforming GoPro Analytics Data Pipeline
Chester Chen
 
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Anyscale
 
PDF
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Allice Shandler
 
PPTX
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
DataStax
 
PPTX
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
PDF
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
PDF
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
DataWorks Summit
 
PDF
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Databricks
 
PPTX
Unlocking Your Hadoop Data with Apache Spark and CDH5
SAP Concur
 
PPTX
Building a modern Application with DataFrames
Spark Summit
 
PPTX
Building a modern Application with DataFrames
Databricks
 
PPTX
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
confluent
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Databricks
 
ETL 2.0 Data Engineering for developers
Microsoft Tech Community
 
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
GoDataDriven
 
Using Apache Spark as ETL engine. Pros and Cons
Provectus
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Chester Chen
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Anyscale
 
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Allice Shandler
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
DataStax
 
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
DataWorks Summit
 
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Databricks
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
SAP Concur
 
Building a modern Application with DataFrames
Spark Summit
 
Building a modern Application with DataFrames
Databricks
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
confluent
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

Recently uploaded (20)

PPTX
Why Use Open Source Reporting Tools for Business Intelligence.pptx
Varsha Nayak
 
PDF
QAware_Mario-Leander_Reimer_Architecting and Building a K8s-based AI Platform...
QAware GmbH
 
PPTX
Explanation about Structures in C language.pptx
Veeral Rathod
 
PPTX
ConcordeApp: Engineering Global Impact & Unlocking Billions in Event ROI with AI
chastechaste14
 
PDF
PFAS Reporting Requirements 2026 Are You Submission Ready Certivo.pdf
Certivo Inc
 
PDF
What to consider before purchasing Microsoft 365 Business Premium_PDF.pdf
Q-Advise
 
PDF
How to Seamlessly Integrate Salesforce Data Cloud with Marketing Cloud.pdf
NSIQINFOTECH
 
PDF
Wondershare Filmora 14.5.20.12999 Crack Full New Version 2025
gsgssg2211
 
PPTX
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
PDF
Micromaid: A simple Mermaid-like chart generator for Pharo
ESUG
 
DOCX
The Future of Smart Factories Why Embedded Analytics Leads the Way
Varsha Nayak
 
PPTX
PFAS Reporting Requirements 2026 Are You Submission Ready Certivo.pptx
Certivo Inc
 
PDF
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
ESUG
 
PPTX
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
PDF
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
PPTX
oapresentation.pptx
mehatdhavalrajubhai
 
PDF
Multi-factor Authentication (MFA) requirement for Microsoft 365 Admin Center_...
Q-Advise
 
PPTX
Services offered by Dynamic Solutions in Pakistan
DaniyaalAdeemShibli1
 
PDF
Why Use Open Source Reporting Tools for Business Intelligence.pdf
Varsha Nayak
 
PDF
Build Multi-agent using Agent Development Kit
FadyIbrahim23
 
Why Use Open Source Reporting Tools for Business Intelligence.pptx
Varsha Nayak
 
QAware_Mario-Leander_Reimer_Architecting and Building a K8s-based AI Platform...
QAware GmbH
 
Explanation about Structures in C language.pptx
Veeral Rathod
 
ConcordeApp: Engineering Global Impact & Unlocking Billions in Event ROI with AI
chastechaste14
 
PFAS Reporting Requirements 2026 Are You Submission Ready Certivo.pdf
Certivo Inc
 
What to consider before purchasing Microsoft 365 Business Premium_PDF.pdf
Q-Advise
 
How to Seamlessly Integrate Salesforce Data Cloud with Marketing Cloud.pdf
NSIQINFOTECH
 
Wondershare Filmora 14.5.20.12999 Crack Full New Version 2025
gsgssg2211
 
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
Micromaid: A simple Mermaid-like chart generator for Pharo
ESUG
 
The Future of Smart Factories Why Embedded Analytics Leads the Way
Varsha Nayak
 
PFAS Reporting Requirements 2026 Are You Submission Ready Certivo.pptx
Certivo Inc
 
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
ESUG
 
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
oapresentation.pptx
mehatdhavalrajubhai
 
Multi-factor Authentication (MFA) requirement for Microsoft 365 Admin Center_...
Q-Advise
 
Services offered by Dynamic Solutions in Pakistan
DaniyaalAdeemShibli1
 
Why Use Open Source Reporting Tools for Business Intelligence.pdf
Varsha Nayak
 
Build Multi-agent using Agent Development Kit
FadyIbrahim23
 

Building Robust ETL Pipelines with Apache Spark

  • 1. Building Robust ETL Pipelines with Apache Spark Xiao Li Spark Summit | SF | Jun 2017
  • 2. 2 TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeley in 2009 22 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
  • 3. 3 About Me • Apache Spark Committer • Software Engineer at Databricks • Ph.D. in University of Florida • Previously, IBM Master Inventor, QRep, GDPS A/A and STC • Spark SQL, Database Replication, Information Integration • Github: gatorsmile
  • 4. 4 Overview 1. What’s an ETL Pipeline? 2. Using Spark SQL for ETL - Extract: Dealing with Dirty Data (Bad Records or Files) - Extract: Multi-line JSON/CSV Support - Transformation: High-order functions in SQL - Load: Unified write paths and interfaces 3. New Features in Spark 2.3 - Performance (Data Source API v2, Python UDF)
  • 5. 5 What is a Data Pipeline? 1. Sequence of transformations on data 2. Source data is typically semi-structured/unstructured (JSON, CSV etc.) and structured (JDBC, Parquet, ORC, the other Hive-serde tables) 3. Output data is integrated, structured and curated. – Ready for further data processing, analysis and reporting
  • 6. 6 Example of a Data Pipeline Aggregate Reporting Applications ML Model Ad-hoc Queries Database Cloud Warehouse Kafka, Log Kafka, Log
  • 7. 7 ETL is the First Step in a Data Pipeline 1. ETL stands for EXTRACT, TRANSFORM and LOAD 2. Goal is to clean or curate the data - Retrieve data from sources (EXTRACT) - Transform data into a consumable format (TRANSFORM) - Transmit data to downstream consumers (LOAD)
  • 8. 8 An ETL Query in Apache Spark spark.read.json("/source/path") .filter(...) .agg(...) .write.mode("append") .parquet("/output/path") EXTRACT TRANSFORM LOAD
  • 9. 9 An ETL Query in Apache Spark Extract EXTRACT TRANSFORM LOAD val csvTable = spark.read.csv("/source/path") val jdbcTable = spark.read.format("jdbc") .option("url", "jdbc:postgresql:...") .option("dbtable", "TEST.PEOPLE") .load() csvTable .join(jdbcTable, Seq("name"), "outer") .filter("id <= 2999") .write .mode("overwrite") .format("parquet") .saveAsTable("outputTableName")
  • 10. 10 What’s so hard about ETL Queries?
  • 11. 11 Why is ETL Hard? 1. Too complex 2. Error-prone 3. Too slow 4. Too expensive 1. Various sources/formats 2. Schema mismatch 3. Different representation 4. Corrupted files and data 5. Scalability 6. Schema evolution 7. Continuous ETL
  • 12. 12 This is why ETL is important Consumers of this data don’t want to deal with this messiness and complexity
  • 14. 14 Structured Streaming Spark SQL's flexible APIs, support for a wide variety of datasources, build-in support for structured streaming, state of art catalyst optimizer and tungsten execution engine make it a great framework for building end-to-end ETL pipelines.
  • 15. 15 Data Source Supports 1. Built-in connectors in Spark: – JSON, CSV, Text, Hive, Parquet, ORC, JDBC 2. Third-party data source connectors: – https://spark-packages.org 3. Define your own data source connectors by Data Source APIs – Ref link: https://youtu.be/uxuLRiNoDio
  • 16. 16 {"a":1, "b":2, "c":3} {"e":2, "c":3, "b":5} {"a":5, "d":7} spark.read .json("/source/path”) .printSchema() Schema Inference – semi-structured files
  • 17. 17 {"a":1, "b":2, "c":3.1} {"e":2, "c":3, "b":5} {"a":"5", "d":7} spark.read .json("/source/path”) .printSchema() Schema Inference – semi-structured files
  • 18. 18 {"a":1, "b":2, "c":3} {"e":2, "c":3, "b":5} {"a":5, "d":7} val schema = new StructType() .add("a", "int") .add("b", "int") spark.read .json("/source/path") .schema(schema) .show() User-specified Schema
  • 19. 19 {"a":1, "b":2, "c":3} {"e":2, "c":3, "b":5} {"a":5, "d":7} Availability: Apache Spark 2.2 spark.read .json("/source/path") .schema("a INT, b INT") .show() User-specified DDL-format Schema
  • 20. 20 Corrupt Files java.io.IOException. For example, java.io.EOFException: Unexpected end of input stream at org.apache.hadoop.io.compress.DecompressorStream.decompress java.lang.RuntimeException: file:/temp/path/c000.json is not a Parquet file (too small) spark.sql.files.ignoreCorruptFiles = true [SPARK-17850] If true, the Spark jobs will continue to run even when it encounters corrupt files. The contents that have been read will still be returned. Dealing with Bad Data: Skip Corrupt Files
  • 21. 21 Missing or Corrupt Records [SPARK-12833][SPARK- 13764] TextFile formats (JSON and CSV) support 3 different ParseModes while reading data: 1. PERMISSIVE 2. DROPMALFORMED 3. FAILFAST Dealing with Bad Data: Skip Corrupt Records
  • 22. 22 {"a":1, "b":2, "c":3} {"a":{, b:3} {"a":5, "b":6, "c":7} spark.read .option("mode", "PERMISSIVE") .option("columnNameOfCorruptRecord", "_corrupt_record") .json(corruptRecords) .show() The default can be configured via spark.sql.columnNameOfCorruptRecord Json: Dealing with Corrupt Records
  • 23. 23 {"a":1, "b":2, "c":3} {"a":{, b:3} {"a":5, "b":6, "c":7} spark.read .option("mode", "DROPMALFORMED") .json(corruptRecords) .show() Json: Dealing with Corrupt Records
  • 24. 24 {"a":1, "b":2, "c":3} {"a":{, b:3} {"a":5, "b":6, "c":7} spark.read .option("mode", "FAILFAST") .json(corruptRecords) .show() org.apache.spark.sql.catalyst.json .SparkSQLJsonProcessingException: Malformed line in FAILFAST mode: {"a":{, b:3} Json: Dealing with Corrupt Records
  • 25. 25 spark.read .option("mode", "FAILFAST") .csv(corruptRecords) .show() java.lang.RuntimeException: Malformed line in FAILFAST mode: 2015,Chevy,Volt CSV: Dealing with Corrupt Records year,make,model,comment,blank "2012","Tesla","S","No comment", 1997,Ford,E350,"Go get one now they", 2015,Chevy,Volt
  • 26. 26 spark.read. .option("mode", "PERMISSIVE") .csv(corruptRecords) .show() CSV: Dealing with Corrupt Records year,make,model,comment,blank "2012","Tesla","S","No comment", 1997,Ford,E350,"Go get one now they", 2015,Chevy,Volt
  • 27. 27 year,make,model,comment,blank "2012","Tesla","S","No comment", 1997,Ford,E350,"Go get one now they", 2015,Chevy,Volt spark.read .option("header", true) .option("mode", "PERMISSIVE") .csv(corruptRecords) .show() CSV: Dealing with Corrupt Records
  • 28. 28 val schema = "col1 INT, col2 STRING, col3 STRING, col4 STRING, " + "col5 STRING, __corrupted_column_name STRING" spark.read .option("header", true) .option("mode", "PERMISSIVE") .csv(corruptRecords) .show() CSV: Dealing with Corrupt Records
  • 29. 29 year,make,model,comment,blank "2012","Tesla","S","No comment", 1997,Ford,E350,"Go get one now they", 2015,Chevy,Volt spark.read .option("mode", ”DROPMALFORMED") .csv(corruptRecords) .show() CSV: Dealing with Corrupt Records
  • 30. 30 Functionality: Better Corruption Handling badRecordsPath: a user-specified path to store exception files for recording the information about bad records/files. - A unified interface for both corrupt records and files - Enabling multi-phase data cleaning - DROPMALFORMED + Exception files - No need an extra column for corrupt records - Recording the exception data, reasons and time. Availability: Databricks Runtime 3.0
  • 31. 31 Functionality: Better JSON and CSV Support [SPARK-18352] [SPARK-19610] Multi-line JSON and CSV Support - Spark SQL currently reads JSON/CSV one line at a time - Before 2.2, it requires custom ETL spark.read .option(”multiLine",true) .json(path) Availability: Apache Spark 2.2 spark.read .option(”multiLine",true) .json(path)
  • 32. 32 Transformation: Higher-order Function in SQL Transformation on complex objects like arrays, maps and structures inside of columns. UDF ? Expensive data serialization tbl_nested |-- key: long (nullable = false) |-- values: array (nullable = false) | |-- element: long (containsNull = false)
  • 33. 33 Transformation: Higher order function in SQL 1) Check for element existence SELECT EXISTS(values, e -> e > 30) AS v FROM tbl_nested; 2) Transform an array SELECT TRANSFORM(values, e -> e * e) AS v FROM tbl_nested; tbl_nested |-- key: long (nullable = false) |-- values: array (nullable = false) | |-- element: long (containsNull = false) Transformation on complex objects like arrays, maps and structures inside of columns.
  • 34. 34 4) Aggregate an array SELECT REDUCE(values, 0, (value, acc) -> value + acc) AS sum FROM tbl_nested; Ref Databricks Blog: http://dbricks.co/2rUKQ1A More cool features available in DB Runtime 3.0: http://dbricks.co/2rhPM4c Availability: Databricks Runtime 3.0 3) Filter an array SELECT FILTER(values, e -> e > 30) AS v FROM tbl_nested; Transformation: Higher order function in SQL tbl_nested |-- key: long (nullable = false) |-- values: array (nullable = false) | |-- element: long (containsNull = false)
  • 35. 35 Users can create Hive-serde tables using DataframeWriter APIs Availability: Apache Spark 2.2 New Format in DataframeWriter API df.write.format("parquet") .saveAsTable("tab") df.write.format("hive") .option("fileFormat", "avro") .saveAsTable("tab") CREATE Hive-serde tables CREATE data source tables
  • 36. 36 Availability: Apache Spark 2.2 Unified CREATE TABLE [AS SELECT] CREATE TABLE t1(a INT, b INT) USING ORC CREATE TABLE t1(a INT, b INT) USING hive OPTIONS(fileFormat 'ORC') CREATE Hive-serde tables CREATE data source tables CREATE TABLE t1(a INT, b INT) STORED AS ORC
  • 37. 37 CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name USING table_provider [OPTIONS table_property_list] [PARTITIONED BY (col_name, col_name, ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [LOCATION path] [COMMENT table_comment] [AS select_statement]; Availability: Apache Spark 2.2 Unified CREATE TABLE [AS SELECT] Apache Spark preferred syntax
  • 38. 38 Apache Spark 2.3+ Massive focus on building ETL-friendly pipelines
  • 39. 39 [SPARK-15689] Data Source API v2 1. [SPARK-20960] An efficient column batch interface for data exchanges between Spark and external systems. o Cost for conversion to and from RDD[Row] o Cost for serialization/deserialization o Publish the columnar binary formats 2. Filter pushdown and column pruning 3. Additional pushdown: limit, sampling and so on. Target: Apache Spark 2.3
  • 40. 40 Performance: Python UDFs 1. Python is the most popular language for ETL 2. Python UDFs are often used to express elaborate data conversions/transformations 3. Any improvements to python UDF processing will ultimately improve ETL. 4. Improve data exchange between Python and JVM 5. Block-level UDFs o Block-level arguments and return types Target: Apache Spark 2.3
  • 41. 41 Recap 1. What’s an ETL Pipeline? 2. Using Spark SQL for ETL - Extract: Dealing with Dirty Data (Bad Records or Files) - Extract: Multi-line JSON/CSV Support - Transformation: High-order functions in SQL - Load: Unified write paths and interfaces 3. New Features in Spark 2.3 - Performance (Data Source API v2, Python UDF)
  • 42. 42 UNIFIED ANALYTICS PLATFORM Try Apache Spark in Databricks! • Collaborative cloud environment • Free version (community edition) 4242 DATABRICKS RUNTIME 3.0 • Apache Spark - optimized for the cloud • Caching and optimization layer - DBIO • Enterprise security - DBES Try for free today. databricks.com