Azure Databricks - An Introduction 2019 Roadshow
Azure Databricks - An Introduction 2019 Roadshow
An Introduction
Why Spark?
• Open-source data processing engine built around speed, ease of use, and
sophisticated analytics
• In memory engine that is up to 100 times faster than Hadoop
• Largest open-source data project with 1000+ contributors
• Highly extensible with support for Scala, Java and Python alongside Spark SQL,
GraphX, Streaming and Machine Learning Library (MLlib)
Why Databricks?
• Databricks is the premium version of Spark available in the market
• Spark founders created Databricks
• Spark is the dominant workload in Hadoop
• Databricks commits 75% of the code to Open Source Spark
Hadoop MapReduce
MapReduce in Hadoop
V V V
M M M
V V V
Drive M M M
Azure Disk Disk
Storag
r V V V
e M M M
V V V
M M M
Azure Storage > Driver > VM/Parallelization > write to Disk > VM/Parallelization > write to disk >
repeat…
Writing to disk takes time… every time you run this process in MapReduce
What is Azure Databricks?
Apache® Spark™ is FASTER and EASIER than MapReduce in Hadoop
V V V
M M M
V V V
Drive M M M
Azure Cache Cache
Storag
r V V V
e M M M
V V V
M M M
Faster – In Spark data stays in cache this give Spark the speed over MapReduce (writing to disk)
Easier – You can use the language you are most comfortable with in Spark (Python, Scala, R, SQL)
What is Azure Databricks?
A fast, easy and collaborative Apache® Spark™ based analytics platform optimized
for Azure
Best of Best of
Databricks Microsoft
Interactive workspace that enables collaboration between data scientists, data engineers, and
business analysts.
Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)
Hadoop storage
DATABRICKS APACHE HIGH- Rest APIs
I/O SPARK CONCURRENCY Data warehouses
Enhance Productivity Build on secure & trusted cloud Scale without limits
DATA WA R E H O U S I N G PATT E R N I N A Z U R E
Loading and preparing data for analysis with a data warehouse
DATA LOADING
COSMOS DB
r
AZURE DATABRICKS
LOGS, FILES AND DATA LAKE AZURE
MEDIA STORE STORAGE
(UNSTRUCTURED)
AZURE SQL
DW
HDINSIGHT
AAS
APPLICATIONS
r
LOGS, FILES AND
SQL DB SQL DW
MEDIA DATA LAKE AZURE COSMOS DB AZURE DATABRICKS HDINSIGHT
(UNSTRUCTURED) STORE STORAGE
MACHINE LEARNING
AZURE ML R SERVER
AZURE DATABRICKS
STUDIO (Spark ML)
SENSORS AND IOT
(UNSTRUCTURED)
REAL-TIME
STREAM INGESTION STREAM ANALYTICS APPLICATIONS
r
LOGS, FILES AND
MEDIA EVENT HUBS IoT HUB KAFKA on HDINSIGHT STREAM AZURE DATABRICKS
(UNSTRUCTURED) ANALYTICS (Spark Streaming)
LONG-TERM STORAGE
Reduced Administration
Azure Databricks
Azure HDInsight
ANALYTICS
BIG DATA
Azure Marketplace
HDP | CDH | MapR
Any Hadoop technology, any Workload optimized, managed Frictionless & Optimized Spark
distribution clusters clusters
STORAGE
BIG DATA
Azure Storage
Azure Databricks Next Step
Azure Databricks Home
Documentation, Pricing, Get Started Information
https://azure.microsoft.com/en-us/services/databricks/
Demo