SlideShare a Scribd company logo
Databricks’ Data Pipelines:
Journey and Lessons Learned
Yu Peng, Burak Yavuz
07/06/2016
Who Are We
Yu Peng
Data Engineer at Databricks
Building Databricks’ next-generation data pipeline
on top of Apache Spark
BS in Xiamen University
Ph.D in The University of Hong Kong
Burak Yavuz
Software Engineer at Databricks
Contributor to Spark since Spark 1.1
Maintainer of Spark Packages
BS in Mechanical Engineering at Bogazici University
MS in Management Science & Engineering at Stanford
University
Building a data pipeline is hard
• At least once or exactly once semantics
• Fault tolerance
• Resource management
• Scalability
• Maintainability
Apache®
Spark™
+ Databricks = Our Solution
• All ETL jobs are built on top of Apache Spark
• Unified solution, everything in the same place
• All ETL jobs are run on Databricks platform
• Platform for Data Engineers and Scientists
• Test out Spark and Databricks new features
Apache, Apache Spark and Spark are trademarks of the Apache Software Foundation
Classic Lambda Data Pipeline
service 0
service ...
log collector
…
.
Centralized
Messaging
System
Delta ETL
Batch ETL
Storage
System
service 1
service ...
log collector
….
service x
service ...
log collector
…
.
…...
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
Customer
Dep 2
Databricks Data Pipeline Overview
Databricks
Dep
….
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Databricks Data Pipeline Overview
Cluster 2 Databricks
Dep
….
7
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Databricks Data Pipeline Overview
Cluster 2 Databricks
Dep
….
8
Databricks Deployment
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
Databricks Filesystem
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Databricks Data Pipeline Overview
Cluster 2 Databricks
Dep
….
9
Databricks Deployment
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
Databricks Filesystem
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Databricks Data Pipeline Overview
Cluster 2
Real-time analysis
Databricks
Dep
….
10
Databricks Deployment
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
DBFS
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Sync daemonRaw record batch (json)
Databricks Data Pipeline Overview
Cluster 2 Databricks
Dep
….
11
Databricks Deployment
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
DBFS
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Sync daemon
ETL jobs
Raw record batch (json)
Tables (parquet)
Databricks Data Pipeline Overview
Cluster 2 Databricks
Dep
….
12
Databricks Deployment
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
DBFS
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Sync daemon
ETL jobs
Data analysis
Raw record batch (json)
Tables (parquet)
Databricks Data Pipeline Overview
Cluster 2
Real-time analysis
Databricks
Dep
….
13
Log collection (Log-daemon)
• Fault tolerance and at least once semantics
• Streaming
• Batch
• Spark History Server
• Multi-tenant and config driven
• Spark container
14
Log Daemon
logStream1
Service 1
active.log
2015-11-30-20.log
2015-11-30-19.log
log rotation
…..
Service 2
active.log
2015-11-30-20.log
2015-11-30-19.log
log rotation
Kinesistopic-1
Service x
active.log
2015-11-30-20.log
2015-11-30-19.log
log rotation
state files
Log Daemon
Architecture
producer
reader
Message Producer
logStream2
producer
reader
logStreamX
producer
reader
…………... …………... …………...
15
topic-2
Sync Daemon
• Read from Kinesis and Write to DBFS
• Buffer and write in batches (128 MB or 5 Mins)
• Partitioned by date
• A long running Apache Spark job
• Easy to scale up and down
16
Databricks Deployment
ETL Jobs
Databricks
Filesystem
No dedup
Append
Dedup
Overwrite
17
New files
Current day
All files
Previous day
Databricks Jobs
Delta job
(every 10 mins)
Batch job
(daily)
Raw records
Databricks
Filesystem
ETL Tables
(Parquet)
ETL Jobs
• Use the same code for Delta and Batch jobs
• Run as scheduled Databricks jobs
• Use spot instances and fallback to on-demand
• Deliver to Databricks as parquet tables
Lessons Learned
- Partition Pruning can save a lot of time and money
Reduced query time from 2800 seconds to just 15 seconds.
Don’t partition too many levels as it leads to worse metadata discovery
performance and cost.
19
Lessons Learned
- High S3 costs: Lots of LIST Requests
Metadata discovery on S3 is expensive. Spark SQL tries to refresh it’s
metadata cache even after write operations.
20
Running It All in Databricks - Jobs
Running It All in Databricks - Spark
Data Analysis & Tools
We get the data in. What’s next?
● Monitoring
● Debugging
● Usage Analysis
● Product Design (A/B testing)
23
Debugging
Access to logs in a matter of seconds thanks to Apache Spark.
24
Monitoring
Monitor logs by log level. Bug introduced on 2016-05-26 01:00:00 UTC. Fix deployed in 2 hours.
25
Usage Analysis + Product Design
SparkR + ggplot2 = Match made in heaven
26
Summary
Databricks + Apache Spark create a unified platform for:
- ETL
- Data Warehousing
- Data Analysis
- Real time analytics
Issues with DevOps out of the question:
- No need to manage a huge cluster
- Jobs are isolated, they don’t cannibalize each other’s resources
- Can launch any Spark version
Ongoing & Future Work
Structured Streaming
- Reduce Complexity of pipeline:
Sync Daemon + Delta + Batch Jobs => Single Streaming Job
- Reduce Latency
Availability of data in seconds instead of minutes
- Event Time Dashboards
28
Try Apache Spark with Databricks
29
http://databricks.com/try
Thank you.
Have questions about ETL with Spark?
Join us at the Databricks Booth 3.45-6.00pm!

More Related Content

PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Hudi architecture, fundamentals and capabilities
PDF
How to Use Oracle RAC in a Cloud? - A Support Question
PDF
How Prometheus Store the Data
PPTX
Apache Tez: Accelerating Hadoop Query Processing
PDF
Oracle RAC Internals - The Cache Fusion Edition
PDF
Understanding oracle rac internals part 2 - slides
PDF
Apache Arrow: High Performance Columnar Data Framework
The Parquet Format and Performance Optimization Opportunities
Hudi architecture, fundamentals and capabilities
How to Use Oracle RAC in a Cloud? - A Support Question
How Prometheus Store the Data
Apache Tez: Accelerating Hadoop Query Processing
Oracle RAC Internals - The Cache Fusion Edition
Understanding oracle rac internals part 2 - slides
Apache Arrow: High Performance Columnar Data Framework

What's hot (20)

PPTX
Apache Tez: Accelerating Hadoop Query Processing
PPTX
Oracle GoldenGate 21c New Features and Best Practices
PDF
AIOUG : OTNYathra - Troubleshooting and Diagnosing Oracle Database 12.2 and O...
PDF
How to use Parquet as a basis for ETL and analytics
PDF
Oracle GoldenGate 19c を使用した 簡単データベース移行ガイド_v1.0
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
PDF
Oracle MAA (Maximum Availability Architecture) 18c - An Overview
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
PPTX
AWR and ASH Deep Dive
PPTX
Hive 3 - a new horizon
PDF
Diving into Delta Lake: Unpacking the Transaction Log
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Delta: Building Merge on Read
PDF
Understanding oracle rac internals part 1 - slides
PDF
Disaster Recovery Plans for Apache Kafka
PDF
How to find what is making your Oracle database slow
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
PDF
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
PPTX
Spline: Data Lineage For Spark Structured Streaming
Apache Tez: Accelerating Hadoop Query Processing
Oracle GoldenGate 21c New Features and Best Practices
AIOUG : OTNYathra - Troubleshooting and Diagnosing Oracle Database 12.2 and O...
How to use Parquet as a basis for ETL and analytics
Oracle GoldenGate 19c を使用した 簡単データベース移行ガイド_v1.0
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Oracle MAA (Maximum Availability Architecture) 18c - An Overview
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
AWR and ASH Deep Dive
Hive 3 - a new horizon
Diving into Delta Lake: Unpacking the Transaction Log
Processing Large Data with Apache Spark -- HasGeek
Delta: Building Merge on Read
Understanding oracle rac internals part 1 - slides
Disaster Recovery Plans for Apache Kafka
How to find what is making your Oracle database slow
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Spline: Data Lineage For Spark Structured Streaming
Ad

Viewers also liked (20)

PDF
Scalable And Incremental Data Profiling With Spark
PPTX
Spark Summit Keynote by Suren Nathan
PDF
Airstream: Spark Streaming At Airbnb
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
PDF
Introducing DataFrames in Spark for Large Scale Data Science
PDF
Morticia: Visualizing And Debugging Complex Spark Workflows
PDF
Huawei Advanced Data Science With Spark Streaming
PPTX
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
PDF
Building a Turbo-fast Data Warehousing Platform with Databricks
PDF
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
PDF
Operational Tips For Deploying Apache Spark
PDF
End-to-end Data Pipeline with Apache Spark
PDF
Low Latency Execution For Apache Spark
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
PDF
Livy: A REST Web Service For Apache Spark
PDF
Spark And Cassandra: 2 Fast, 2 Furious
PDF
Spark on Mesos
PDF
Understanding Memory Management In Spark For Fun And Profit
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
PDF
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Scalable And Incremental Data Profiling With Spark
Spark Summit Keynote by Suren Nathan
Airstream: Spark Streaming At Airbnb
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Introducing DataFrames in Spark for Large Scale Data Science
Morticia: Visualizing And Debugging Complex Spark Workflows
Huawei Advanced Data Science With Spark Streaming
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Turbo-fast Data Warehousing Platform with Databricks
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Operational Tips For Deploying Apache Spark
End-to-end Data Pipeline with Apache Spark
Low Latency Execution For Apache Spark
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Livy: A REST Web Service For Apache Spark
Spark And Cassandra: 2 Fast, 2 Furious
Spark on Mesos
Understanding Memory Management In Spark For Fun And Profit
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Ad

Similar to A Journey into Databricks' Pipelines: Journey and Lessons Learned (20)

PDF
Jump Start with Apache Spark 2.0 on Databricks
PDF
Jumpstart on Apache Spark 2.2 on Databricks
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PDF
What's New in Upcoming Apache Spark 2.3
PPTX
Spark to DocumentDB connector
PDF
An Insider’s Guide to Maximizing Spark SQL Performance
PDF
2018 02-08-what's-new-in-apache-spark-2.3
PDF
Transactional writes to cloud storage with Eric Liang
PDF
Media_Entertainment_Veriticals
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
PPTX
Typesafe spark- Zalando meetup
PDF
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
PPTX
Synapse 2018 Guarding against failure in a hundred step pipeline
PDF
SQL Analytics Powering Telemetry Analysis at Comcast
PDF
Jump Start on Apache Spark 2.2 with Databricks
PDF
Lightbend Fast Data Platform
PPTX
Running Presto and Spark on the Netflix Big Data Platform
Jump Start with Apache Spark 2.0 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Apache Spark 2.0: Faster, Easier, and Smarter
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
Spark to DocumentDB connector
An Insider’s Guide to Maximizing Spark SQL Performance
2018 02-08-what's-new-in-apache-spark-2.3
Transactional writes to cloud storage with Eric Liang
Media_Entertainment_Veriticals
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Typesafe spark- Zalando meetup
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Synapse 2018 Guarding against failure in a hundred step pipeline
SQL Analytics Powering Telemetry Analysis at Comcast
Jump Start on Apache Spark 2.2 with Databricks
Lightbend Fast Data Platform
Running Presto and Spark on the Netflix Big Data Platform

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
How a Careem Clone App Allows You to Compete with Large Mobility Brands
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
System and Network Administraation Chapter 3
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Best Practices for Rolling Out Competency Management Software.pdf
PDF
How to Confidently Manage Project Budgets
PDF
top salesforce developer skills in 2025.pdf
PDF
Multi-factor Authentication (MFA) requirement for Microsoft 365 Admin Center_...
PPTX
Mastering-Cybersecurity-The-Crucial-Role-of-Antivirus-Support-Services.pptx
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
IEEE-CS Tech Predictions, SWEBOK and Quantum Software: Towards Q-SWEBOK
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
The Role of Automation and AI in EHS Management for Data Centers.pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
ai tools demonstartion for schools and inter college
PDF
System and Network Administration Chapter 2
PDF
A REACT POMODORO TIMER WEB APPLICATION.pdf
PDF
Convert Thunderbird to Outlook into bulk
PPTX
What to Capture When It Breaks: 16 Artifacts That Reveal Root Causes
PTS Company Brochure 2025 (1).pdf.......
How a Careem Clone App Allows You to Compete with Large Mobility Brands
How Creative Agencies Leverage Project Management Software.pdf
System and Network Administraation Chapter 3
Which alternative to Crystal Reports is best for small or large businesses.pdf
Best Practices for Rolling Out Competency Management Software.pdf
How to Confidently Manage Project Budgets
top salesforce developer skills in 2025.pdf
Multi-factor Authentication (MFA) requirement for Microsoft 365 Admin Center_...
Mastering-Cybersecurity-The-Crucial-Role-of-Antivirus-Support-Services.pptx
2025 Textile ERP Trends: SAP, Odoo & Oracle
IEEE-CS Tech Predictions, SWEBOK and Quantum Software: Towards Q-SWEBOK
ManageIQ - Sprint 268 Review - Slide Deck
The Role of Automation and AI in EHS Management for Data Centers.pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
ai tools demonstartion for schools and inter college
System and Network Administration Chapter 2
A REACT POMODORO TIMER WEB APPLICATION.pdf
Convert Thunderbird to Outlook into bulk
What to Capture When It Breaks: 16 Artifacts That Reveal Root Causes

A Journey into Databricks' Pipelines: Journey and Lessons Learned

  • 1. Databricks’ Data Pipelines: Journey and Lessons Learned Yu Peng, Burak Yavuz 07/06/2016
  • 2. Who Are We Yu Peng Data Engineer at Databricks Building Databricks’ next-generation data pipeline on top of Apache Spark BS in Xiamen University Ph.D in The University of Hong Kong Burak Yavuz Software Engineer at Databricks Contributor to Spark since Spark 1.1 Maintainer of Spark Packages BS in Mechanical Engineering at Bogazici University MS in Management Science & Engineering at Stanford University
  • 3. Building a data pipeline is hard • At least once or exactly once semantics • Fault tolerance • Resource management • Scalability • Maintainability
  • 4. Apache® Spark™ + Databricks = Our Solution • All ETL jobs are built on top of Apache Spark • Unified solution, everything in the same place • All ETL jobs are run on Databricks platform • Platform for Data Engineers and Scientists • Test out Spark and Databricks new features Apache, Apache Spark and Spark are trademarks of the Apache Software Foundation
  • 5. Classic Lambda Data Pipeline service 0 service ... log collector … . Centralized Messaging System Delta ETL Batch ETL Storage System service 1 service ... log collector …. service x service ... log collector … . …...
  • 6. Customer Dep 0 Customer Dep 1 Amazon Kinesis Customer Dep 2 Databricks Data Pipeline Overview Databricks Dep ….
  • 7. Customer Dep 0 Customer Dep 1 Amazon Kinesis service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Databricks Data Pipeline Overview Cluster 2 Databricks Dep …. 7
  • 8. Customer Dep 0 Customer Dep 1 Amazon Kinesis service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Databricks Data Pipeline Overview Cluster 2 Databricks Dep …. 8
  • 9. Databricks Deployment Customer Dep 0 Customer Dep 1 Amazon Kinesis Databricks Filesystem Databricks Jobs service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Databricks Data Pipeline Overview Cluster 2 Databricks Dep …. 9
  • 10. Databricks Deployment Customer Dep 0 Customer Dep 1 Amazon Kinesis Databricks Filesystem Databricks Jobs service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Databricks Data Pipeline Overview Cluster 2 Real-time analysis Databricks Dep …. 10
  • 11. Databricks Deployment Customer Dep 0 Customer Dep 1 Amazon Kinesis DBFS Databricks Jobs service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Sync daemonRaw record batch (json) Databricks Data Pipeline Overview Cluster 2 Databricks Dep …. 11
  • 12. Databricks Deployment Customer Dep 0 Customer Dep 1 Amazon Kinesis DBFS Databricks Jobs service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Sync daemon ETL jobs Raw record batch (json) Tables (parquet) Databricks Data Pipeline Overview Cluster 2 Databricks Dep …. 12
  • 13. Databricks Deployment Customer Dep 0 Customer Dep 1 Amazon Kinesis DBFS Databricks Jobs service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Sync daemon ETL jobs Data analysis Raw record batch (json) Tables (parquet) Databricks Data Pipeline Overview Cluster 2 Real-time analysis Databricks Dep …. 13
  • 14. Log collection (Log-daemon) • Fault tolerance and at least once semantics • Streaming • Batch • Spark History Server • Multi-tenant and config driven • Spark container 14
  • 15. Log Daemon logStream1 Service 1 active.log 2015-11-30-20.log 2015-11-30-19.log log rotation ….. Service 2 active.log 2015-11-30-20.log 2015-11-30-19.log log rotation Kinesistopic-1 Service x active.log 2015-11-30-20.log 2015-11-30-19.log log rotation state files Log Daemon Architecture producer reader Message Producer logStream2 producer reader logStreamX producer reader …………... …………... …………... 15 topic-2
  • 16. Sync Daemon • Read from Kinesis and Write to DBFS • Buffer and write in batches (128 MB or 5 Mins) • Partitioned by date • A long running Apache Spark job • Easy to scale up and down 16
  • 17. Databricks Deployment ETL Jobs Databricks Filesystem No dedup Append Dedup Overwrite 17 New files Current day All files Previous day Databricks Jobs Delta job (every 10 mins) Batch job (daily) Raw records Databricks Filesystem ETL Tables (Parquet)
  • 18. ETL Jobs • Use the same code for Delta and Batch jobs • Run as scheduled Databricks jobs • Use spot instances and fallback to on-demand • Deliver to Databricks as parquet tables
  • 19. Lessons Learned - Partition Pruning can save a lot of time and money Reduced query time from 2800 seconds to just 15 seconds. Don’t partition too many levels as it leads to worse metadata discovery performance and cost. 19
  • 20. Lessons Learned - High S3 costs: Lots of LIST Requests Metadata discovery on S3 is expensive. Spark SQL tries to refresh it’s metadata cache even after write operations. 20
  • 21. Running It All in Databricks - Jobs
  • 22. Running It All in Databricks - Spark
  • 23. Data Analysis & Tools We get the data in. What’s next? ● Monitoring ● Debugging ● Usage Analysis ● Product Design (A/B testing) 23
  • 24. Debugging Access to logs in a matter of seconds thanks to Apache Spark. 24
  • 25. Monitoring Monitor logs by log level. Bug introduced on 2016-05-26 01:00:00 UTC. Fix deployed in 2 hours. 25
  • 26. Usage Analysis + Product Design SparkR + ggplot2 = Match made in heaven 26
  • 27. Summary Databricks + Apache Spark create a unified platform for: - ETL - Data Warehousing - Data Analysis - Real time analytics Issues with DevOps out of the question: - No need to manage a huge cluster - Jobs are isolated, they don’t cannibalize each other’s resources - Can launch any Spark version
  • 28. Ongoing & Future Work Structured Streaming - Reduce Complexity of pipeline: Sync Daemon + Delta + Batch Jobs => Single Streaming Job - Reduce Latency Availability of data in seconds instead of minutes - Event Time Dashboards 28
  • 29. Try Apache Spark with Databricks 29 http://databricks.com/try
  • 30. Thank you. Have questions about ETL with Spark? Join us at the Databricks Booth 3.45-6.00pm!