SlideShare a Scribd company logo
9
Most read
10
Most read
13
Most read
Simplify Data
Conversion from
Spark to Deep
Learning
Liang Zhang
Software Engineer @ databricks
About Me
▪ Machine Learning Team
@ Databricks
▪ Master in Carnegie
Mellon University Liang Zhang
linkedin.com/in/liangz1/
Agenda
▪ Why should we care
about data conversion
between spark and deep
learning frameworks?
▪ Pain points
▪ Overview of the Spark
Dataset Converter
▪ Demo
▪ Best Practices
Spark
DataFrame
Motivation: Data Conversion from Spark to DL
TensorFlow
PyTorch
?
• Images from driving camera: Detect traffic lights
• Large amount of data - TBs
• New images arriving every day
• Data cleaning and labeling
• Train the model with all available data and periodically re-train with new data
• Predict the label of new images
Pain points: Data Conversion from Spark to Deep
Learning frameworks
Pain points: Data Conversion from Spark to DL
• Single-node training:
• Collect a sample of data to the driver in a pandas DataFrame
• Distributed training:
• Save the Spark DataFrame to TFRecords files and load TFRecords using
TensorFlow
• Save the Spark DataFrame to parquet files and write your custom PyTorch
DataLoader to load the partitions
Pain points: Data Conversion from Spark to DL
• Single-node training:
• Collect a sample of data to the driver in a pandas DataFrame
• Distributed training:
• Save the Spark DataFrame to TFRecords files and parse the serialized data
in TFRecords using TensorFlow
• Save the Spark DataFrame to parquet files and write your custom PyTorch
DataLoader to load the partitions
• Hard to migrate from single-node to distributed training
• Many lines of extra code to save, load and parse intermediate
files
Overview of the Spark Dataset Converter
Spark
DataFrame
Spark Dataset Converter API Overview
TensorFlow
Dataset
PyTorch
DataLoader
Spark
Dataset
Converter
from petastorm.spark import make_spark_converter
converter = make_spark_converter(df)
with converter.make_tf_dataset() as dataset:
tf_model.fit(dataset)
with converter.make_torch_dataloader() as dataloader:
train(torch_model, dataloader)
Spark Dataset Converter API
HDFS/DBFS
Spark
DataFrame
tf.data.Dataset /
torch.dataloader
Found
cached
parquet file?
Cache
DataFrame in
parquet file
data.parquet
No
Yes Load cached
parquet file with
petastorm
ETL Training
Spark Dataset Converter Features
▪ Recognize cached Spark
DataFrame by checking
the analyzed query plan
▪ Automatic cache cleaning
at program exit
• Change two arguments
to migrate your data
loading code from
single-node to
distributed setting
• Easy migration to distributed
• Cache intermediate files
• Convert MLlib vectors to
1D arrays automatically
• MLlib vector Handling
How to use the Spark Dataset Converter API?
(demo)
Demo notebooks
• Image Classification
• Spark to TensorFlow Dataset
• https://docs.databricks.com/_static/notebooks/deep-learning/petastorm-spark-converter-tenso
rflow.html
• Spark to PyTorch DataLoader
• https://docs.databricks.com/_static/notebooks/deep-learning/petastorm-spark-converter-pytor
ch.html
Best Practices
Best Practices with Spark Dataset Converter
• Image data decoding and preprocessing
• Decode image bytes and preprocess in TransformSpec, not in Spark
• Spark -> TransformSpec -> Dataset.map -> in the model (GPU)
• Generate infinite batches using num_epochs=None
• In distributed training, to guarantee that every worker get exactly the same
amount of data.
• Manage the lifecycle of cache data
• On local laptop or in a scheduled job on Databricks, the cache files will be
automatically deleted when the python process exits.
• In Databricks notebook, we recommend configuring lifecycle rules for the
underlying S3 buckets storing the cache files.
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

More Related Content

PDF
Some Iceberg Basics for Beginners (CDP).pdf
Michael Kogan
 
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
PDF
Introduction to PySpark
Russell Jurney
 
PDF
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
Edureka!
 
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
PDF
Data Quality With or Without Apache Spark and Its Ecosystem
Databricks
 
Some Iceberg Basics for Beginners (CDP).pdf
Michael Kogan
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Introduction to PySpark
Russell Jurney
 
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
Edureka!
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Data Quality With or Without Apache Spark and Its Ecosystem
Databricks
 

What's hot (20)

PPTX
SPARQL Cheat Sheet
LeeFeigenbaum
 
PDF
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
PPTX
Apache Spark overview
DataArt
 
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
PDF
Building Robust ETL Pipelines with Apache Spark
Databricks
 
PDF
Oracle DB 19c: SQL Tuning Using SPM
Arturo Aranda
 
PDF
Spark SQL
Joud Khattab
 
PDF
SPARQL 사용법
홍수 허
 
PDF
Making Apache Spark Better with Delta Lake
Databricks
 
PPTX
Optimizing Apache Spark SQL Joins
Databricks
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PDF
Change Data Feed in Delta
Databricks
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PPTX
Apache Spark sql
aftab alam
 
PDF
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 
PDF
Oracle data guard for beginners
Pini Dibask
 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PDF
End-to-End Deep Learning with Horovod on Apache Spark
Databricks
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
SPARQL Cheat Sheet
LeeFeigenbaum
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
Apache Spark overview
DataArt
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Building Robust ETL Pipelines with Apache Spark
Databricks
 
Oracle DB 19c: SQL Tuning Using SPM
Arturo Aranda
 
Spark SQL
Joud Khattab
 
SPARQL 사용법
홍수 허
 
Making Apache Spark Better with Delta Lake
Databricks
 
Optimizing Apache Spark SQL Joins
Databricks
 
Introduction to Apache Spark
Rahul Jain
 
Change Data Feed in Delta
Databricks
 
Apache Spark Architecture
Alexey Grishchenko
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Apache Spark sql
aftab alam
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 
Oracle data guard for beginners
Pini Dibask
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
End-to-End Deep Learning with Horovod on Apache Spark
Databricks
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Ad

Similar to Simplify Data Conversion from Spark to TensorFlow and PyTorch (20)

PDF
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Databricks
 
PDF
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Databricks
 
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
Databricks
 
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
PDF
Powering tensor flow with big data using apache beam, flink, and spark cern...
Holden Karau
 
PPTX
Simplifying training deep and serving learning models with big data in python...
Holden Karau
 
PDF
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Holden Karau
 
PDF
Integrating Deep Learning Libraries with Apache Spark
Databricks
 
PPTX
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
Codemotion
 
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
PDF
Spark Summit EU talk by Tim Hunter
Spark Summit
 
PPTX
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Holden Karau
 
PDF
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
PDF
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
PDF
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
PPTX
Meetup tensorframes
Paolo Platter
 
PDF
Atlanta Hadoop Users Meetup 09 21 2016
Chris Fregly
 
PPTX
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Databricks
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Databricks
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Holden Karau
 
Simplifying training deep and serving learning models with big data in python...
Holden Karau
 
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Holden Karau
 
Integrating Deep Learning Libraries with Apache Spark
Databricks
 
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
Codemotion
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
Spark Summit EU talk by Tim Hunter
Spark Summit
 
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Holden Karau
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
Meetup tensorframes
Paolo Platter
 
Atlanta Hadoop Users Meetup 09 21 2016
Chris Fregly
 
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
PDF
Machine Learning CI/CD for Email Attack Detection
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Databricks
 

Recently uploaded (20)

PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PPT
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPTX
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PDF
Chad Readey - An Independent Thinker
Chad Readey
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
short term internship project on Data visualization
JMJCollegeComputerde
 
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Chad Readey - An Independent Thinker
Chad Readey
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 

Simplify Data Conversion from Spark to TensorFlow and PyTorch

  • 1. Simplify Data Conversion from Spark to Deep Learning Liang Zhang Software Engineer @ databricks
  • 2. About Me ▪ Machine Learning Team @ Databricks ▪ Master in Carnegie Mellon University Liang Zhang linkedin.com/in/liangz1/
  • 3. Agenda ▪ Why should we care about data conversion between spark and deep learning frameworks? ▪ Pain points ▪ Overview of the Spark Dataset Converter ▪ Demo ▪ Best Practices
  • 4. Spark DataFrame Motivation: Data Conversion from Spark to DL TensorFlow PyTorch ? • Images from driving camera: Detect traffic lights • Large amount of data - TBs • New images arriving every day • Data cleaning and labeling • Train the model with all available data and periodically re-train with new data • Predict the label of new images
  • 5. Pain points: Data Conversion from Spark to Deep Learning frameworks
  • 6. Pain points: Data Conversion from Spark to DL • Single-node training: • Collect a sample of data to the driver in a pandas DataFrame • Distributed training: • Save the Spark DataFrame to TFRecords files and load TFRecords using TensorFlow • Save the Spark DataFrame to parquet files and write your custom PyTorch DataLoader to load the partitions
  • 7. Pain points: Data Conversion from Spark to DL • Single-node training: • Collect a sample of data to the driver in a pandas DataFrame • Distributed training: • Save the Spark DataFrame to TFRecords files and parse the serialized data in TFRecords using TensorFlow • Save the Spark DataFrame to parquet files and write your custom PyTorch DataLoader to load the partitions • Hard to migrate from single-node to distributed training • Many lines of extra code to save, load and parse intermediate files
  • 8. Overview of the Spark Dataset Converter
  • 9. Spark DataFrame Spark Dataset Converter API Overview TensorFlow Dataset PyTorch DataLoader Spark Dataset Converter from petastorm.spark import make_spark_converter converter = make_spark_converter(df) with converter.make_tf_dataset() as dataset: tf_model.fit(dataset) with converter.make_torch_dataloader() as dataloader: train(torch_model, dataloader)
  • 10. Spark Dataset Converter API HDFS/DBFS Spark DataFrame tf.data.Dataset / torch.dataloader Found cached parquet file? Cache DataFrame in parquet file data.parquet No Yes Load cached parquet file with petastorm ETL Training
  • 11. Spark Dataset Converter Features ▪ Recognize cached Spark DataFrame by checking the analyzed query plan ▪ Automatic cache cleaning at program exit • Change two arguments to migrate your data loading code from single-node to distributed setting • Easy migration to distributed • Cache intermediate files • Convert MLlib vectors to 1D arrays automatically • MLlib vector Handling
  • 12. How to use the Spark Dataset Converter API? (demo)
  • 13. Demo notebooks • Image Classification • Spark to TensorFlow Dataset • https://docs.databricks.com/_static/notebooks/deep-learning/petastorm-spark-converter-tenso rflow.html • Spark to PyTorch DataLoader • https://docs.databricks.com/_static/notebooks/deep-learning/petastorm-spark-converter-pytor ch.html
  • 15. Best Practices with Spark Dataset Converter • Image data decoding and preprocessing • Decode image bytes and preprocess in TransformSpec, not in Spark • Spark -> TransformSpec -> Dataset.map -> in the model (GPU) • Generate infinite batches using num_epochs=None • In distributed training, to guarantee that every worker get exactly the same amount of data. • Manage the lifecycle of cache data • On local laptop or in a scheduled job on Databricks, the cache files will be automatically deleted when the python process exits. • In Databricks notebook, we recommend configuring lifecycle rules for the underlying S3 buckets storing the cache files.
  • 16. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.