SlideShare a Scribd company logo
Introduction to
Azure Databricks
James Serra
Big Data Evangelist
Microsoft
JamesSerra3@gmail.com
About Me
 Microsoft, Big Data Evangelist
 In IT for 30 years, worked on many BI and DW projects
 Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW/APS developer
 Been perm employee, contractor, consultant, business owner
 Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference
 Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure
Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data
Platform Solutions
 Blog at JamesSerra.com
 Former SQL Server MVP
 Author of book “Reporting with Microsoft SQL Server 2012”
Agenda
 Big Data Architectures
 Why data lakes?
 Top-down vs Bottom-up
 Data lake defined
 Hadoop as the data lake
 Modern Data Warehouse
 Federated Querying
 Solution in the cloud
 SMP vs MPP
Security and performanceFlexibility of choiceReason over any data, anywhere
Data warehouses
Data Lakes
Operational databases
Hybrid
Data warehouses
Data Lakes
Operational databases
SocialLOB Graph IoTImageCRM
T H E M O D E R N D A T A E S T A T E
Security and performanceFlexibility of choiceReason over any data, anywhere
Data warehouses
Operational databases
Hybrid
Data warehouses
Operational databases
SQL Server Azure Data Services
AI built-in | Most secure | Lowest TCO
Industry leader 2 years in a row
#1 TPC-H performance
T-SQL query over any data
70% faster than Aurora
2x global reach than Redshift
No Limits Analytics with 99.9% SLA
Easiest lift and shift
with no code changes
SocialLOB Graph IoTImageCRM
T H E M I C R O S O F T O F F E R I N G
Data lakes Data lakes
Introduction to Azure Databricks
CONTROL EASE OF USE
Azure Data Lake
Analytics
Azure Data Lake Store
Azure Storage
Any Hadoop technology,
any distribution
Workload optimized,
managed clusters
Data Engineering in a
Job-as-a-service model
Azure Marketplace
HDP | CDH | MapR
Azure Data Lake
Analytics
IaaS Clusters Managed Clusters Big Data as-a-service
Azure HDInsight
Frictionless & Optimized
Spark clusters
Azure Databricks
BIGDATA
STORAGE
BIGDATA
ANALYTICS
ReducedAdministration
K N O W I N G T H E V A R I O U S B I G D A T A S O L U T I O N S
Model & ServePrep & Train
Databricks
HDInsight
Data Lake Analytics
Custom
apps
Sensors
and devices
Store
Blobs
Data Lake
Ingest
Data Factory
(Data movement, pipelines & orchestration)
Machine
Learning
Cosmos DB
SQL Data
Warehouse
Analysis Services
Event Hub
IoT Hub
SQL Database
Analytical dashboards
Predictive apps
Operational reports
Intelligence
B I G D ATA & A D VA N C E D A N A LY T I C S AT A G L A N C E
Business
apps
10
01
SQLKafka
Introduction to Azure Databricks
Why Spark?
What is Azure Databricks?
A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure
Best of Databricks Best of Microsoft
Designed in collaboration with the founders of Apache Spark
One-click set up; streamlined workflows
Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)
Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs)
A P A C H E S P A R K
An unified, open source, parallel, data processing framework for Big Data Analytics
Spark Core Engine
Spark SQL
Interactive
Queries
Spark Structured
Streaming
Stream processing
Spark MLlib
Machine
Learning
Yarn Mesos
Standalone
Scheduler
Spark MLlib
Machine
Learning
Spark
Streaming
Stream processing
GraphX
Graph
Computation
D A T A B R I C K S S P A R K I S F A S T
Benchmarks have shown Databricks to often have better performance than alternatives
SOURCE: Benchmarking Big Data SQL Platforms in the Cloud
A D V A N T A G E S O F A U N I F I E D P L A T F O R M
Spark Streaming
Spark Machine
Learning
Spark SQL
Get started quickly by launching
your new Spark environment with
one click.
Share your insights in powerful
ways through rich integration with
Power BI.
Improve collaboration amongst
your analytics team through a
unified workspace.
Innovate faster with native
integration with rest of Azure
platform
Simplify security and identity control
with built-in integration with Active
Directory.
Regulate access with fine-grained user
permissions to Azure Databricks’
notebooks, clusters, jobs and data.
Build with confidence on the trusted
cloud backed by unmatched support,
compliance and SLAs.
Operate at massive scale
without limits globally.
Accelerate data processing with
the fastest Spark engine.
ENHANCE PRODUCTIVITY BUILD ON THE MOST COMPLIANT CLOUD SCALE WITHOUT LIMITS
Differentiated experience on Azure
Optimized Databricks Runtime Engine
DATABRICKS I/O SERVERLESS
Collaborative Workspace
Cloud storage
Data warehouses
Hadoop storage
IoT / streaming data
Rest APIs
Machine learning models
BI tools
Data exports
Data warehouses
Azure Databricks
Enhance Productivity
Deploy Production Jobs & Workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
Build on secure & trusted cloud Scale without limits
Azure Databricks
Collaborative Workspace
Optimized Databricks Runtime Engine
DATABRICKS I/O SERVERLESS
Collaborative Workspace
Rest APIs
Azure Databricks
Deploy Production Jobs & Workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
Deploy Production Jobs & Workflows
Optimized Databricks Runtime Engine
DATABRICKS I/O SERVERLESS
Collaborative Workspace
Rest APIs
Azure Databricks
Deploy Production Jobs & Workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
Optimized Databricks Runtime Engine
Optimized Databricks Runtime Engine
DATABRICKS I/O SERVERLESS
Collaborative Workspace
Rest APIs
Azure Databricks
Deploy Production Jobs & Workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
A Z U R E D A T A B R I C K S C O R E A R T I F A C T S
Azure
Databricks
G E N E R A L S P A R K C L U S T E R A R C H I T E C T U R E
Data Sources (HDFS, SQL, NoSQL, …)
Cluster Manager
Worker Node Worker Node Worker Node
Driver Program
SparkContext
A Z U R E D A T A B R I C K S I N T E G R A T I O N W I T H A A D
Azure Databricks is integrated with AAD—so Azure Databricks users are just regular AAD users
 There is no need to define users—and their
access control—separately in Databricks.
 AAD users can be used directly in Azure
Databricks for all user-based access control
(Clusters, Jobs, Notebooks etc.).
 Databricks has delegated user authentication
to AAD enabling single-sign on (SSO) and
unified authentication.
 Notebooks, and their outputs, are stored in
the Databricks account. However, AAD-
based access-control ensures that only
authorized users can access them.
Access
Control
Azure Databricks
Authentication
C L U S T E R S : A U T O S C A L I N G A N D A U T O T E R M I N A T I O N
Simplifies cluster management and reduces costs by eliminating wastage
When creating Azure Databricks clusters you can choose
Autoscaling and Auto Termination options.
Autoscaling: Just specify the min and max number of clusters.
Azure Databricks automatically scales up or down based on load.
Auto Termination: After the specified minutes of inactivity the
cluster is automatically terminated.
Benefits:
 You do not have to guess, or determine by trial and error, the correct
number of nodes for the cluster
 As the workload changes you do not have to manually tweak the
number of nodes
 You do not have to worry about wasting resources when the cluster is
idle. You only pay for resource when they are actually being used
 You do not have to wait and watch for jobs to complete just so you
can shutdown the clusters
J O B S
Jobs are the mechanism to submit Spark application code for execution on the Databricks clusters
• Spark application code is submitted as a ‘Job’ for execution on
Azure Databricks clusters
• Jobs execute either ‘Notebooks’ or ‘Jars’
• Azure Databricks provide a comprehensive set of graphical
tools to create, manage and monitor Jobs.
W O R K S P A C E S
Workspaces enables users to organize—and share—their Notebooks, Libraries and Dashboards
• Icons indicate the type of the object contained in a
folder
• By default, the workspace and all its contents are
available to users.
A Z U R E D A T A B R I C K S N O T E B O O K S O V E R V I E W
Notebooks are a popular way to develop, and run, Spark Applications
 Notebooks are not only for authoring Spark applications but
can be run/executed directly on clusters
• Shift+Enter
•
•
 Notebooks support fine grained permissions—so they can be
securely shared with colleagues for collaboration (see
following slide for details on permissions and abilities)
 Notebooks are well-suited for prototyping, rapid
development, exploration, discovery and iterative
development Notebooks typically consist of code, data, visualization, comments and notes
L I B R A R I E S O V E R V I E W
Enables external code to be imported and stored into a Workspace
V I S U A L I Z A T I O N
Azure Databricks supports a number of visualization plots out of the box
 All notebooks, regardless of their language,
support Databricks visualizations.
 When you run the notebook the visualizations
are rendered inside the notebook in-place
 The visualizations are written in HTML.
• You can save the HTML of the entire notebook by
exporting to HTML.
• If you use Matplotlib, the plots are rendered as
images so you can just right click and download
the image
 You can change the plot type just by picking
from the selection
D A T A B R I C K S F I L E S Y S T E M ( D B F S )
Is a distributed File System (DBFS) that is a layer over Azure Blob Storage
Azure Blob Storage
Python Scala CLI dbutils
DBFS
S P A R K S Q L O V E R V I E W
Spark SQL is a distributed SQL query engine for processing structured data
D A T A B A S E S A N D T A B L E S O V E R V I E W
Tables enable data to be structured and queried using Spark SQL or any of the Spark’s language APIs
S P A R K M A C H I N E L E A R N I N G ( M L ) O V E R V I E W
 Offers a set of parallelized machine learning algorithms (MMLSpark,
Spark ML, Deep Learning, SparkR)
 Supports Model Selection (hyperparameter tuning) using Cross
Validation and Train-Validation Split.
 Supports Java, Scala or Python apps using DataFrame-based API (as
of Spark 2.0). Benefits include:
• An uniform API across ML algorithms and across multiple languages
• Facilitates ML pipelines (enables combining multiple algorithms into a
single pipeline).
• Optimizations through Tungsten and Catalyst
• Spark MLlib comes pre-installed on Azure Databricks
• 3rd Party libraries supported include: H20 Sparkling Water, SciKit-
learn and XGBoost
Enables Parallel, Distributed ML for large datasets on Spark Clusters
S P A R K S T R U C T U R E D S T R E A M I N G O V E R V I E W
 Unifies streaming, interactive and batch queries—a single API for both
static bounded data and streaming unbounded data.
 Runs on Spark SQL. Uses the Spark SQL Dataset/DataFrame API used
for batch processing of static data.
 Runs incrementally and continuously and updates the results as data
streams in.
 Supports app development in Scala, Java, Python and R.
 Supports streaming aggregations, event-time windows, windowed
grouped aggregation, stream-to-batch joins.
 Features streaming deduplication, multiple output modes and APIs for
managing/monitoring streaming queries.
 Built-in sources: Kafka, File source (json, csv, text, parquet)
A unified system for end-to-end fault-tolerant, exactly-once stateful stream processing
A P A C H E K A F K A F O R H D I N S I G H T I N T E G R A T I O N
Azure Databricks Structured Streaming integrates with Apache Kafka for HDInsight
 Apache Kafka for Azure HDInsight is an enterprise grade streaming ingestion service running in Azure.
 Azure Databricks Structured Streaming applications can use Apache Kafka for HDInsight as a data source or
sink.
 No additional software (gateways or connectors) are required.
 Setup: Apache Kafka on HDInsight does not provide access to the Kafka brokers over the public internet. So
the Kafka clusters and the Azure Databricks cluster must be located in the same Azure Virtual Network.
S P A R K G R A P H X O V E R V I E W
 Unifies ETL, exploratory analysis, and
iterative graph computation within a
single system.
 Developers can:
• view the same data as both graphs and
collections,
• transform and join graphs with RDDs,
and
• write custom iterative graph algorithms
using the Pregel API.
 Currently only supports using the
Scala and RDD APIs.
A set of APIs for graph and graph-parallel computation.
• PageRank
• Connected components
• Label propagation
• SVD++
• Strongly connected
components
• Triangle count
Algorithms
AMPLab
PageRank Benchmark
D A T A B R I C K S C L I
An easy to use interface built on top of the Databricks REST API
Currently, the CLI fully implements the DBFS API and the Workspace API
D A T A B R I C K S R E S T A P I
Cluster API Create/edit/delete clusters
DBFS API Interact with the Databricks File System
Groups API Manage groups of users
Instance Profile API
Allows admins to add, list, and remove instances
profiles that users can launch clusters with
Job API Create/edit/delete jobs
Library API Create/edit/delete libraries
Workspace API List/import/export/delete notebooks/folders
Databricks
REST API
Introduction to Azure Databricks
Modern Big Data Warehouse
Business / custom apps
(Structured)
Logs, files and media
(unstructured)
Azure storage
Polybase
Azure SQL Data Warehouse
Data factory
Data factory
Azure Databricks
(Spark)
Analytical dashboards
Model & ServePrep & TrainStoreIngest Intelligence
Advanced Analytics on Big Data
Web & mobile appsAzure Databricks
(Spark Mllib,
SparkR, SparklyR)
Azure Cosmos DB
Business / custom apps
(Structured)
Logs, files and media
(unstructured)
Azure storage
Polybase
Azure SQL Data Warehouse
Data factory
Data factory
Analytical dashboards
Model & ServePrep & TrainStoreIngest Intelligence
Real-time analytics on Big Data
Unstructured data
Azure storage
Polybase
Azure SQL Data Warehouse
Azure HDInsight
(Kafka)
Azure Databricks
(Spark)
Analytical dashboards
Model & ServePrep & TrainStoreIngest Intelligence
Introduction to Azure Databricks
What it is
• Hadoop (Hortonworks’ Distribution) as a managed
service supporting a variety of open-source analytics
engines such as Apache Spark, Hive LLAP, Storm, Kafka,
HBase.
• Security via Ranger (Kerberos based)
Pricing
• Priced to compete with AWS EMR. Standard offering.
Use When
• Customer prefers a PaaS like experience to address big
data use cases by working with different OSS analytics
engines to address big data use cases. Cost sensitive.
Big Data OSS - Comparison
Azure HDInsight (1st party + Support)
What it is
• Databricks Spark, the most popular open-source analytics
engine, as a managed service providing an easy and fast
way to unlock big data use cases. Offers best-in-class
notebooks experience for productivity and collaboration as
well integration with Azure Data Warehouse, Power BI, etc
• Security via native Azure AD integration
Pricing
• Priced to match Databricks on AWS. Premium offering.
Use When
• Customer prefers SaaS like experience to address big data
use cases and values Databricks’ ease of use, productivity
& collaboration features.
Azure Databricks (1st party + Support)
What it is
Hadoop distributions from Cloudera, MapR &
Hortonworks available on Azure Marketplace as IaaS
VMs.
Pricing
• N/A. Vendor prices their products.
Use When
• Customer wants to move their on premises
Hadoop distribution to Azure IaaS using their
existing licenses.
3rd Party Offerings
Azure HDInsight
What It Is
• Hortonworks distribution as a first party service on Azure
• Big Data engines support – Hadoop Projects, Hive on Tez,
Hive LLAP, Spark, HBase, Storm, Kafka, R Server
• Best-in-class developer tooling and Monitoring capabilities
• Enterprise Features
• VNET support (join existing VNETs)
• Ranger support (Kerberos based Security)
• Log Analytics via OMS
• Orchestration via Azure Data Factory
• Available in most Azure Regions (27) including Gov
Cloud and Federal Clouds
Guidance
• Customer needs Hadoop technologies other than, or in
addition to Spark
• Customer prefers Hortonworks Spark distribution to stay
closer to OSS codebase and/or ‘Lift and Shift’ from on-
premises deployments
• Customer has specific project requirements that are only
available on HDInsight
Azure Databricks
What It Is
• Databricks’ Spark service as a first party service on Azure
• Single engine for Batch, Streaming, ML and Graph
• Best-in-class notebooks experience for optimal productivity
and collaboration
• Enterprise Features
• Native Integration with Azure for Security via AAD (OAuth)
• Optimized engine for better performance and scalability
• RBAC for Notebooks and APIs
• Auto-scaling and cluster termination capabilities
• Native integration with SQL DW and other Azure services
• Serverless pools for easier management of resources
Guidance
• Customer needs the best option for Spark on Azure
• Customer teams are comfortable with notebooks and Spark
• Customers need Auto-scaling and
• Customer needs to build integrated and performant data
pipelines
• Customer is comfortable with limited regional availability (3
in preview, 8 by GA)
Azure ML
What It Is
• Azure first party service for Machine Learning
• Leverage existing ML libraries or extend with Python and R
• Targets emerging data scientists with drag & drop offering
• Targets professional data scientists with
– Experimentation service
– Model management service
– Works with customers IDE of choice
Guidance
• Azure Machine Learning Studio is a GUI based ML tool for
emerging Data Scientists to experiment and operationalize
with least friction
• Azure Machine Learning Workbench is not a compute
engine & uses external engines for Compute, including SQL
Server and Spark
• AML deploys models to HDI Spark currently
• AML should be able to deploy Azure Databricks in the near
future
L O O K I N G A C R O S S T H E O F F E R I N G S
Introduction to Azure Databricks
Azure Databricks – service home page
Azure Databricks – creating a workspace
Azure Databricks – workspace deployment
Azure Databricks – launching the workspace
Azure Databricks – workspace home page
Introduction to Azure Databricks
Engage Microsoft experts for a workshop to help identify
high impact scenarios
Sign up for preview at http://databricks.azurewebsites.net
Learn more about Azure Databricks www.azure.com/databricks
How to get started
Q & A ?
James Serra, Big Data Evangelist
Email me at: JamesSerra3@gmail.com
Follow me at: @JamesSerra
Link to me at: www.linkedin.com/in/JamesSerra
Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)

More Related Content

PPTX
Databricks Platform.pptx
PPTX
Databricks Fundamentals
PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 4
PDF
Azure+Databricks+Course+Slide+Deck+V4.pdf
PPTX
Azure DataBricks for Data Engineering by Eugene Polonichko
PPTX
Azure Synapse Analytics Overview (r2)
PPTX
Microsoft Fabric Introduction
Databricks Platform.pptx
Databricks Fundamentals
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 4
Azure+Databricks+Course+Slide+Deck+V4.pdf
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure Synapse Analytics Overview (r2)
Microsoft Fabric Introduction

What's hot (20)

PDF
Modernizing to a Cloud Data Architecture
PPTX
Microsoft Azure Databricks
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PDF
Time to Talk about Data Mesh
PPTX
Azure data bricks by Eugene Polonichko
PDF
How to govern and secure a Data Mesh?
PDF
Introducing Databricks Delta
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
PDF
Azure Synapse 101 Webinar Presentation
PDF
Introduction to Azure Data Factory
PDF
Azure Data Factory V2; The Data Flows
PDF
Databricks Delta Lake and Its Benefits
PPTX
Azure Synapse Analytics Overview (r1)
PDF
Considerations for Data Access in the Lakehouse
PDF
Intro to Delta Lake
PPTX
Building a modern data warehouse
PDF
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
PPTX
Microsoft Fabric.pptx
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Delta from a Data Engineer's Perspective
Modernizing to a Cloud Data Architecture
Microsoft Azure Databricks
Architect’s Open-Source Guide for a Data Mesh Architecture
Time to Talk about Data Mesh
Azure data bricks by Eugene Polonichko
How to govern and secure a Data Mesh?
Introducing Databricks Delta
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Azure Synapse 101 Webinar Presentation
Introduction to Azure Data Factory
Azure Data Factory V2; The Data Flows
Databricks Delta Lake and Its Benefits
Azure Synapse Analytics Overview (r1)
Considerations for Data Access in the Lakehouse
Intro to Delta Lake
Building a modern data warehouse
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Microsoft Fabric.pptx
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Delta from a Data Engineer's Perspective
Ad

Similar to Introduction to Azure Databricks (20)

PDF
201905 Azure Databricks for Machine Learning
PPTX
Azure Databricks - An Introduction 2019 Roadshow.pptx
PPTX
TechEvent Databricks on Azure
PDF
Azure databricks c sharp corner toronto feb 2019 heather grandy
PPTX
Azure Databricks - An Introduction (by Kris Bock)
PDF
Big Data Adavnced Analytics on Microsoft Azure
PPTX
Azure Databricks Training | Azure Databricks Online Training
PDF
USQL Trivadis Azure Data Lake Event
PPTX
Global AI Bootcamp Madrid - Azure Databricks
PPTX
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
PPTX
Azure Databricks & Spark @ Techorama 2018
PPTX
Machine Learning and AI
PDF
1 Introduction to Microsoft data platform analytics for release
PDF
5 Comparing Microsoft Big Data Technologies for Analytics
PDF
Ready for take-off - How to get your databases into the cloud
PPTX
Ai & Data Analytics 2018 - Azure Databricks for data scientist
PPTX
Azure Data.pptx
PDF
Modern Business Intelligence and Advanced Analytics
PPTX
Azure Data Lake and U-SQL
PPTX
Cepta The Future of Data with Power BI
201905 Azure Databricks for Machine Learning
Azure Databricks - An Introduction 2019 Roadshow.pptx
TechEvent Databricks on Azure
Azure databricks c sharp corner toronto feb 2019 heather grandy
Azure Databricks - An Introduction (by Kris Bock)
Big Data Adavnced Analytics on Microsoft Azure
Azure Databricks Training | Azure Databricks Online Training
USQL Trivadis Azure Data Lake Event
Global AI Bootcamp Madrid - Azure Databricks
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Azure Databricks & Spark @ Techorama 2018
Machine Learning and AI
1 Introduction to Microsoft data platform analytics for release
5 Comparing Microsoft Big Data Technologies for Analytics
Ready for take-off - How to get your databases into the cloud
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Azure Data.pptx
Modern Business Intelligence and Advanced Analytics
Azure Data Lake and U-SQL
Cepta The Future of Data with Power BI
Ad

More from James Serra (20)

PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
PPTX
Data Warehousing Trends, Best Practices, and Future Outlook
PPTX
Data Lake Overview
PPTX
Power BI Overview, Deployment and Governance
PPTX
Power BI Overview
PPTX
Azure data platform overview
PPTX
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
PPTX
Power BI for Big Data and the New Look of Big Data Solutions
PPTX
How to build your career
PPTX
Is the traditional data warehouse dead?
PPTX
Azure SQL Database Managed Instance
PPTX
What’s new in SQL Server 2017
PPTX
Microsoft Data Platform - What's included
PPTX
Learning to present and becoming good at it
PPTX
Microsoft cloud big data strategy
PPTX
Choosing technologies for a big data solution in the cloud
PPTX
What's new in SQL Server 2016
PPTX
Introducing DocumentDB
PPTX
Introduction to PolyBase
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Warehousing Trends, Best Practices, and Future Outlook
Data Lake Overview
Power BI Overview, Deployment and Governance
Power BI Overview
Azure data platform overview
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
Power BI for Big Data and the New Look of Big Data Solutions
How to build your career
Is the traditional data warehouse dead?
Azure SQL Database Managed Instance
What’s new in SQL Server 2017
Microsoft Data Platform - What's included
Learning to present and becoming good at it
Microsoft cloud big data strategy
Choosing technologies for a big data solution in the cloud
What's new in SQL Server 2016
Introducing DocumentDB
Introduction to PolyBase

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
MYSQL Presentation for SQL database connectivity
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
KodekX | Application Modernization Development
PPT
Teaching material agriculture food technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
PDF
Transforming Manufacturing operations through Intelligent Integrations
Understanding_Digital_Forensics_Presentation.pptx
NewMind AI Monthly Chronicles - July 2025
MYSQL Presentation for SQL database connectivity
The Rise and Fall of 3GPP – Time for a Sabbatical?
Spectral efficient network and resource selection model in 5G networks
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
KodekX | Application Modernization Development
Teaching material agriculture food technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Advanced Soft Computing BINUS July 2025.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Chapter 3 Spatial Domain Image Processing.pdf
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
Transforming Manufacturing operations through Intelligent Integrations

Introduction to Azure Databricks

  • 1. Introduction to Azure Databricks James Serra Big Data Evangelist Microsoft [email protected]
  • 2. About Me  Microsoft, Big Data Evangelist  In IT for 30 years, worked on many BI and DW projects  Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW/APS developer  Been perm employee, contractor, consultant, business owner  Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference  Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions  Blog at JamesSerra.com  Former SQL Server MVP  Author of book “Reporting with Microsoft SQL Server 2012”
  • 3. Agenda  Big Data Architectures  Why data lakes?  Top-down vs Bottom-up  Data lake defined  Hadoop as the data lake  Modern Data Warehouse  Federated Querying  Solution in the cloud  SMP vs MPP
  • 4. Security and performanceFlexibility of choiceReason over any data, anywhere Data warehouses Data Lakes Operational databases Hybrid Data warehouses Data Lakes Operational databases SocialLOB Graph IoTImageCRM T H E M O D E R N D A T A E S T A T E
  • 5. Security and performanceFlexibility of choiceReason over any data, anywhere Data warehouses Operational databases Hybrid Data warehouses Operational databases SQL Server Azure Data Services AI built-in | Most secure | Lowest TCO Industry leader 2 years in a row #1 TPC-H performance T-SQL query over any data 70% faster than Aurora 2x global reach than Redshift No Limits Analytics with 99.9% SLA Easiest lift and shift with no code changes SocialLOB Graph IoTImageCRM T H E M I C R O S O F T O F F E R I N G Data lakes Data lakes
  • 7. CONTROL EASE OF USE Azure Data Lake Analytics Azure Data Lake Store Azure Storage Any Hadoop technology, any distribution Workload optimized, managed clusters Data Engineering in a Job-as-a-service model Azure Marketplace HDP | CDH | MapR Azure Data Lake Analytics IaaS Clusters Managed Clusters Big Data as-a-service Azure HDInsight Frictionless & Optimized Spark clusters Azure Databricks BIGDATA STORAGE BIGDATA ANALYTICS ReducedAdministration K N O W I N G T H E V A R I O U S B I G D A T A S O L U T I O N S
  • 8. Model & ServePrep & Train Databricks HDInsight Data Lake Analytics Custom apps Sensors and devices Store Blobs Data Lake Ingest Data Factory (Data movement, pipelines & orchestration) Machine Learning Cosmos DB SQL Data Warehouse Analysis Services Event Hub IoT Hub SQL Database Analytical dashboards Predictive apps Operational reports Intelligence B I G D ATA & A D VA N C E D A N A LY T I C S AT A G L A N C E Business apps 10 01 SQLKafka
  • 11. What is Azure Databricks? A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure Best of Databricks Best of Microsoft Designed in collaboration with the founders of Apache Spark One-click set up; streamlined workflows Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage) Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs)
  • 12. A P A C H E S P A R K An unified, open source, parallel, data processing framework for Big Data Analytics Spark Core Engine Spark SQL Interactive Queries Spark Structured Streaming Stream processing Spark MLlib Machine Learning Yarn Mesos Standalone Scheduler Spark MLlib Machine Learning Spark Streaming Stream processing GraphX Graph Computation
  • 13. D A T A B R I C K S S P A R K I S F A S T Benchmarks have shown Databricks to often have better performance than alternatives SOURCE: Benchmarking Big Data SQL Platforms in the Cloud
  • 14. A D V A N T A G E S O F A U N I F I E D P L A T F O R M Spark Streaming Spark Machine Learning Spark SQL
  • 15. Get started quickly by launching your new Spark environment with one click. Share your insights in powerful ways through rich integration with Power BI. Improve collaboration amongst your analytics team through a unified workspace. Innovate faster with native integration with rest of Azure platform Simplify security and identity control with built-in integration with Active Directory. Regulate access with fine-grained user permissions to Azure Databricks’ notebooks, clusters, jobs and data. Build with confidence on the trusted cloud backed by unmatched support, compliance and SLAs. Operate at massive scale without limits globally. Accelerate data processing with the fastest Spark engine. ENHANCE PRODUCTIVITY BUILD ON THE MOST COMPLIANT CLOUD SCALE WITHOUT LIMITS Differentiated experience on Azure
  • 16. Optimized Databricks Runtime Engine DATABRICKS I/O SERVERLESS Collaborative Workspace Cloud storage Data warehouses Hadoop storage IoT / streaming data Rest APIs Machine learning models BI tools Data exports Data warehouses Azure Databricks Enhance Productivity Deploy Production Jobs & Workflows APACHE SPARK MULTI-STAGE PIPELINES DATA ENGINEER JOB SCHEDULER NOTIFICATION & LOGS DATA SCIENTIST BUSINESS ANALYST Build on secure & trusted cloud Scale without limits Azure Databricks
  • 17. Collaborative Workspace Optimized Databricks Runtime Engine DATABRICKS I/O SERVERLESS Collaborative Workspace Rest APIs Azure Databricks Deploy Production Jobs & Workflows APACHE SPARK MULTI-STAGE PIPELINES DATA ENGINEER JOB SCHEDULER NOTIFICATION & LOGS DATA SCIENTIST BUSINESS ANALYST
  • 18. Deploy Production Jobs & Workflows Optimized Databricks Runtime Engine DATABRICKS I/O SERVERLESS Collaborative Workspace Rest APIs Azure Databricks Deploy Production Jobs & Workflows APACHE SPARK MULTI-STAGE PIPELINES DATA ENGINEER JOB SCHEDULER NOTIFICATION & LOGS DATA SCIENTIST BUSINESS ANALYST
  • 19. Optimized Databricks Runtime Engine Optimized Databricks Runtime Engine DATABRICKS I/O SERVERLESS Collaborative Workspace Rest APIs Azure Databricks Deploy Production Jobs & Workflows APACHE SPARK MULTI-STAGE PIPELINES DATA ENGINEER JOB SCHEDULER NOTIFICATION & LOGS DATA SCIENTIST BUSINESS ANALYST
  • 20. A Z U R E D A T A B R I C K S C O R E A R T I F A C T S Azure Databricks
  • 21. G E N E R A L S P A R K C L U S T E R A R C H I T E C T U R E Data Sources (HDFS, SQL, NoSQL, …) Cluster Manager Worker Node Worker Node Worker Node Driver Program SparkContext
  • 22. A Z U R E D A T A B R I C K S I N T E G R A T I O N W I T H A A D Azure Databricks is integrated with AAD—so Azure Databricks users are just regular AAD users  There is no need to define users—and their access control—separately in Databricks.  AAD users can be used directly in Azure Databricks for all user-based access control (Clusters, Jobs, Notebooks etc.).  Databricks has delegated user authentication to AAD enabling single-sign on (SSO) and unified authentication.  Notebooks, and their outputs, are stored in the Databricks account. However, AAD- based access-control ensures that only authorized users can access them. Access Control Azure Databricks Authentication
  • 23. C L U S T E R S : A U T O S C A L I N G A N D A U T O T E R M I N A T I O N Simplifies cluster management and reduces costs by eliminating wastage When creating Azure Databricks clusters you can choose Autoscaling and Auto Termination options. Autoscaling: Just specify the min and max number of clusters. Azure Databricks automatically scales up or down based on load. Auto Termination: After the specified minutes of inactivity the cluster is automatically terminated. Benefits:  You do not have to guess, or determine by trial and error, the correct number of nodes for the cluster  As the workload changes you do not have to manually tweak the number of nodes  You do not have to worry about wasting resources when the cluster is idle. You only pay for resource when they are actually being used  You do not have to wait and watch for jobs to complete just so you can shutdown the clusters
  • 24. J O B S Jobs are the mechanism to submit Spark application code for execution on the Databricks clusters • Spark application code is submitted as a ‘Job’ for execution on Azure Databricks clusters • Jobs execute either ‘Notebooks’ or ‘Jars’ • Azure Databricks provide a comprehensive set of graphical tools to create, manage and monitor Jobs.
  • 25. W O R K S P A C E S Workspaces enables users to organize—and share—their Notebooks, Libraries and Dashboards • Icons indicate the type of the object contained in a folder • By default, the workspace and all its contents are available to users.
  • 26. A Z U R E D A T A B R I C K S N O T E B O O K S O V E R V I E W Notebooks are a popular way to develop, and run, Spark Applications  Notebooks are not only for authoring Spark applications but can be run/executed directly on clusters • Shift+Enter • •  Notebooks support fine grained permissions—so they can be securely shared with colleagues for collaboration (see following slide for details on permissions and abilities)  Notebooks are well-suited for prototyping, rapid development, exploration, discovery and iterative development Notebooks typically consist of code, data, visualization, comments and notes
  • 27. L I B R A R I E S O V E R V I E W Enables external code to be imported and stored into a Workspace
  • 28. V I S U A L I Z A T I O N Azure Databricks supports a number of visualization plots out of the box  All notebooks, regardless of their language, support Databricks visualizations.  When you run the notebook the visualizations are rendered inside the notebook in-place  The visualizations are written in HTML. • You can save the HTML of the entire notebook by exporting to HTML. • If you use Matplotlib, the plots are rendered as images so you can just right click and download the image  You can change the plot type just by picking from the selection
  • 29. D A T A B R I C K S F I L E S Y S T E M ( D B F S ) Is a distributed File System (DBFS) that is a layer over Azure Blob Storage Azure Blob Storage Python Scala CLI dbutils DBFS
  • 30. S P A R K S Q L O V E R V I E W Spark SQL is a distributed SQL query engine for processing structured data
  • 31. D A T A B A S E S A N D T A B L E S O V E R V I E W Tables enable data to be structured and queried using Spark SQL or any of the Spark’s language APIs
  • 32. S P A R K M A C H I N E L E A R N I N G ( M L ) O V E R V I E W  Offers a set of parallelized machine learning algorithms (MMLSpark, Spark ML, Deep Learning, SparkR)  Supports Model Selection (hyperparameter tuning) using Cross Validation and Train-Validation Split.  Supports Java, Scala or Python apps using DataFrame-based API (as of Spark 2.0). Benefits include: • An uniform API across ML algorithms and across multiple languages • Facilitates ML pipelines (enables combining multiple algorithms into a single pipeline). • Optimizations through Tungsten and Catalyst • Spark MLlib comes pre-installed on Azure Databricks • 3rd Party libraries supported include: H20 Sparkling Water, SciKit- learn and XGBoost Enables Parallel, Distributed ML for large datasets on Spark Clusters
  • 33. S P A R K S T R U C T U R E D S T R E A M I N G O V E R V I E W  Unifies streaming, interactive and batch queries—a single API for both static bounded data and streaming unbounded data.  Runs on Spark SQL. Uses the Spark SQL Dataset/DataFrame API used for batch processing of static data.  Runs incrementally and continuously and updates the results as data streams in.  Supports app development in Scala, Java, Python and R.  Supports streaming aggregations, event-time windows, windowed grouped aggregation, stream-to-batch joins.  Features streaming deduplication, multiple output modes and APIs for managing/monitoring streaming queries.  Built-in sources: Kafka, File source (json, csv, text, parquet) A unified system for end-to-end fault-tolerant, exactly-once stateful stream processing
  • 34. A P A C H E K A F K A F O R H D I N S I G H T I N T E G R A T I O N Azure Databricks Structured Streaming integrates with Apache Kafka for HDInsight  Apache Kafka for Azure HDInsight is an enterprise grade streaming ingestion service running in Azure.  Azure Databricks Structured Streaming applications can use Apache Kafka for HDInsight as a data source or sink.  No additional software (gateways or connectors) are required.  Setup: Apache Kafka on HDInsight does not provide access to the Kafka brokers over the public internet. So the Kafka clusters and the Azure Databricks cluster must be located in the same Azure Virtual Network.
  • 35. S P A R K G R A P H X O V E R V I E W  Unifies ETL, exploratory analysis, and iterative graph computation within a single system.  Developers can: • view the same data as both graphs and collections, • transform and join graphs with RDDs, and • write custom iterative graph algorithms using the Pregel API.  Currently only supports using the Scala and RDD APIs. A set of APIs for graph and graph-parallel computation. • PageRank • Connected components • Label propagation • SVD++ • Strongly connected components • Triangle count Algorithms AMPLab PageRank Benchmark
  • 36. D A T A B R I C K S C L I An easy to use interface built on top of the Databricks REST API Currently, the CLI fully implements the DBFS API and the Workspace API
  • 37. D A T A B R I C K S R E S T A P I Cluster API Create/edit/delete clusters DBFS API Interact with the Databricks File System Groups API Manage groups of users Instance Profile API Allows admins to add, list, and remove instances profiles that users can launch clusters with Job API Create/edit/delete jobs Library API Create/edit/delete libraries Workspace API List/import/export/delete notebooks/folders Databricks REST API
  • 39. Modern Big Data Warehouse Business / custom apps (Structured) Logs, files and media (unstructured) Azure storage Polybase Azure SQL Data Warehouse Data factory Data factory Azure Databricks (Spark) Analytical dashboards Model & ServePrep & TrainStoreIngest Intelligence
  • 40. Advanced Analytics on Big Data Web & mobile appsAzure Databricks (Spark Mllib, SparkR, SparklyR) Azure Cosmos DB Business / custom apps (Structured) Logs, files and media (unstructured) Azure storage Polybase Azure SQL Data Warehouse Data factory Data factory Analytical dashboards Model & ServePrep & TrainStoreIngest Intelligence
  • 41. Real-time analytics on Big Data Unstructured data Azure storage Polybase Azure SQL Data Warehouse Azure HDInsight (Kafka) Azure Databricks (Spark) Analytical dashboards Model & ServePrep & TrainStoreIngest Intelligence
  • 43. What it is • Hadoop (Hortonworks’ Distribution) as a managed service supporting a variety of open-source analytics engines such as Apache Spark, Hive LLAP, Storm, Kafka, HBase. • Security via Ranger (Kerberos based) Pricing • Priced to compete with AWS EMR. Standard offering. Use When • Customer prefers a PaaS like experience to address big data use cases by working with different OSS analytics engines to address big data use cases. Cost sensitive. Big Data OSS - Comparison Azure HDInsight (1st party + Support) What it is • Databricks Spark, the most popular open-source analytics engine, as a managed service providing an easy and fast way to unlock big data use cases. Offers best-in-class notebooks experience for productivity and collaboration as well integration with Azure Data Warehouse, Power BI, etc • Security via native Azure AD integration Pricing • Priced to match Databricks on AWS. Premium offering. Use When • Customer prefers SaaS like experience to address big data use cases and values Databricks’ ease of use, productivity & collaboration features. Azure Databricks (1st party + Support) What it is Hadoop distributions from Cloudera, MapR & Hortonworks available on Azure Marketplace as IaaS VMs. Pricing • N/A. Vendor prices their products. Use When • Customer wants to move their on premises Hadoop distribution to Azure IaaS using their existing licenses. 3rd Party Offerings
  • 44. Azure HDInsight What It Is • Hortonworks distribution as a first party service on Azure • Big Data engines support – Hadoop Projects, Hive on Tez, Hive LLAP, Spark, HBase, Storm, Kafka, R Server • Best-in-class developer tooling and Monitoring capabilities • Enterprise Features • VNET support (join existing VNETs) • Ranger support (Kerberos based Security) • Log Analytics via OMS • Orchestration via Azure Data Factory • Available in most Azure Regions (27) including Gov Cloud and Federal Clouds Guidance • Customer needs Hadoop technologies other than, or in addition to Spark • Customer prefers Hortonworks Spark distribution to stay closer to OSS codebase and/or ‘Lift and Shift’ from on- premises deployments • Customer has specific project requirements that are only available on HDInsight Azure Databricks What It Is • Databricks’ Spark service as a first party service on Azure • Single engine for Batch, Streaming, ML and Graph • Best-in-class notebooks experience for optimal productivity and collaboration • Enterprise Features • Native Integration with Azure for Security via AAD (OAuth) • Optimized engine for better performance and scalability • RBAC for Notebooks and APIs • Auto-scaling and cluster termination capabilities • Native integration with SQL DW and other Azure services • Serverless pools for easier management of resources Guidance • Customer needs the best option for Spark on Azure • Customer teams are comfortable with notebooks and Spark • Customers need Auto-scaling and • Customer needs to build integrated and performant data pipelines • Customer is comfortable with limited regional availability (3 in preview, 8 by GA) Azure ML What It Is • Azure first party service for Machine Learning • Leverage existing ML libraries or extend with Python and R • Targets emerging data scientists with drag & drop offering • Targets professional data scientists with – Experimentation service – Model management service – Works with customers IDE of choice Guidance • Azure Machine Learning Studio is a GUI based ML tool for emerging Data Scientists to experiment and operationalize with least friction • Azure Machine Learning Workbench is not a compute engine & uses external engines for Compute, including SQL Server and Spark • AML deploys models to HDI Spark currently • AML should be able to deploy Azure Databricks in the near future L O O K I N G A C R O S S T H E O F F E R I N G S
  • 46. Azure Databricks – service home page
  • 47. Azure Databricks – creating a workspace
  • 48. Azure Databricks – workspace deployment
  • 49. Azure Databricks – launching the workspace
  • 50. Azure Databricks – workspace home page
  • 52. Engage Microsoft experts for a workshop to help identify high impact scenarios Sign up for preview at http://databricks.azurewebsites.net Learn more about Azure Databricks www.azure.com/databricks How to get started
  • 53. Q & A ? James Serra, Big Data Evangelist Email me at: [email protected] Follow me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)

Editor's Notes

  • #2: Azure analysis services Databricks Cosmos DB Azure time series ADF v2
  • #3: Fluff, but point is I bring real work experience to the session
  • #5: All kinds of data being generated   Stored on-premises and in the cloud – but vast majority in hybrid   Reason over all this data without requiring to move data   They want a choice of platform and languages, privacy and security   <Transition> Microsoft’s offerng
  • #6:  The most complete and compelling offering   SQL Server, Azure Database Services (+ Open Source), Hybrid   AI Built in, Most Secure, Lowest TCO – 1/10th cost of Oracle   <Transition> DEMO
  • #9: Azure build-out of DMSA. Supports hybrid\on-premises too through ADF \ Blob \ Stretch.
  • #11: When it comes to ease of use, Spark again happens to be a lot better than Hadoop. Spark has APIs for several languages such as Scala, Java and Python, besides having the likes of Spark SQL. It is relatively simple to write user-defined functions. It also happens to boast an interactive mode for running commands. Hadoop, on the other hand, is written in Java and has earned the reputation of being pretty difficult to program, although it does have tools that assist in the process. (To learn more about Spark, see How Apache Spark Helps Rapid Application Development.) In-Memory Technology One of the unique aspects of Apache Spark is its unique "in-memory" technology that allows it to be an extremely good data processing system. In this technology, Spark loads all of the data to the internal memory of the system and then unloads it on the disk later. This way, a user can save a part of the processed data on the internal memory and leave the remaining on the disk. Spark also has an innate ability to load necessary information to its core with the help of its machine learningalgorithms. This allows it to be extremely fast. Spark’s Core Spark’s core manages several important functions like setting tasks and interactions as well as producing input/output operations. It can be said to be an RDD, or resilient distributed dataset. Basically, this happens to be a mix of data that is spread across several machines connected via a network. The transformation of this data is created by a four-step method, comprised of mapping the data, sorting it, reducing it and then finally, joining the data. Following this step is the release of the RDD, which is done with support from an API. This API is a union of three languages: Scala, Java and Python. Spark’s SQL Apache Spark’s SQL has a relatively new data management solution called SchemaRDD. This allows the arrangement of data into many levels and can also query data via a specific language. Graphx Service Apache Spark comes with the ability to process graphs or even information that is graphical in nature, thus enabling the easy analysis with a lot of precision. Streaming This is a prime part of Spark that allows it to stream large chunks of data with help from the core. It does so by breaking the large data into smaller packets and then transforming them, thereby accelerating the creation of the RDD. MLib – Machine Learning Library Apache Spark has the MLib, which is a framework meant for structured machine learning. It is also predominantly faster in implementation than Hadoop. MLib is also capable of solving several problems, such as statistical reading, data sampling and premise testing, to name a few.
  • #12: Azure Databricks features – Enhance your teams’ productivity Get started quickly by launching your new Spark environment with one click. Share your insights in powerful ways through rich integration with PowerBI. Improve collaboration amongst your analytics team through a unified workspace. Innovate faster with native integration with rest of Azure platform. Build on the most compliant and trusted cloud Simplify security and identity control with built-in integration with Active Directory. Regulate access with fine-grained user permissions to Azure Databricks’ notebooks, clusters, jobs and data. Build with confidence on the trusted cloud backed by unmatched support, compliance and SLAs. Scale without limits Operate at massive scale without limits globally. Accelerate data processing with the fastest Spark engine.
  • #16: Databricks, founded by the team that created Apache Spark – unified analytics platform that accelerates innovation by unifying data science, engineering & business. 75% of the code committed to Apache Spark comes from Databricks Unified Runtime Create clusters in seconds, dynamically scale them up and down. They’ve made enhancements to Spark engine to make it 10x faster than open source Spark Serverless- Auto-configured multi-user cluster, Reliable sharing with fault isolation Unified Collaboration Overall – a simple & collaborative environment that enables your entire team to use Spark & interact with your data simultaneously DE – Improve ETL performance, zero management clusters. Execute production code from within notebooks DS - For data scientists, easy data exploration in notebooks Business SME – interactive dashboards empower teams to create dynamic reports Enterprise Security Encryption Fine grained Role-based access control (files, clusters, code, application, dashboard) Compliance Rest APIs DE – DBIO, SPARK, API’s , JOBS DS – Spark and Serverless, Interactive Data Science Data Products - Everything Creators of Spark Training People Number of Customers Ingest Workflow Schedule / Run / Monitor Execute Troubleshoot Debug Production Jobs --------- Ingest, ETL, Scheduling, Monitoring
  • #17: Databricks, founded by the team that created Apache Spark – unified analytics platform that accelerates innovation by unifying data science, engineering & business. 75% of the code committed to Apache Spark comes from Databricks Unified Runtime Create clusters in seconds, dynamically scale them up and down. They’ve made enhancements to Spark engine to make it 10x faster than open source Spark Serverless- Auto-configured multi-user cluster, Reliable sharing with fault isolation Unified Collaboration Overall – a simple & collaborative environment that enables your entire team to use Spark & interact with your data simultaneously DE – Improve ETL performance, zero management clusters. Execute production code from within notebooks DS - For data scientists, easy data exploration in notebooks Business SME – interactive dashboards empower teams to create dynamic reports Enterprise Security Encryption Fine grained Role-based access control (files, clusters, code, application, dashboard) Compliance Rest APIs DE – DBIO, SPARK, API’s , JOBS DS – Spark and Serverless, Interactive Data Science Data Products - Everything Creators of Spark Training People Number of Customers Ingest Workflow Schedule / Run / Monitor Execute Troubleshoot Debug Production Jobs --------- Ingest, ETL, Scheduling, Monitoring
  • #18: Databricks, founded by the team that created Apache Spark – unified analytics platform that accelerates innovation by unifying data science, engineering & business. 75% of the code committed to Apache Spark comes from Databricks Unified Runtime Create clusters in seconds, dynamically scale them up and down. They’ve made enhancements to Spark engine to make it 10x faster than open source Spark Serverless- Auto-configured multi-user cluster, Reliable sharing with fault isolation Unified Collaboration Overall – a simple & collaborative environment that enables your entire team to use Spark & interact with your data simultaneously DE – Improve ETL performance, zero management clusters. Execute production code from within notebooks DS - For data scientists, easy data exploration in notebooks Business SME – interactive dashboards empower teams to create dynamic reports Enterprise Security Encryption Fine grained Role-based access control (files, clusters, code, application, dashboard) Compliance Rest APIs DE – DBIO, SPARK, API’s , JOBS DS – Spark and Serverless, Interactive Data Science Data Products - Everything Creators of Spark Training People Number of Customers Ingest Workflow Schedule / Run / Monitor Execute Troubleshoot Debug Production Jobs --------- Ingest, ETL, Scheduling, Monitoring
  • #19: Databricks, founded by the team that created Apache Spark – unified analytics platform that accelerates innovation by unifying data science, engineering & business. 75% of the code committed to Apache Spark comes from Databricks Unified Runtime Create clusters in seconds, dynamically scale them up and down. They’ve made enhancements to Spark engine to make it 10x faster than open source Spark Serverless- Auto-configured multi-user cluster, Reliable sharing with fault isolation Unified Collaboration Overall – a simple & collaborative environment that enables your entire team to use Spark & interact with your data simultaneously DE – Improve ETL performance, zero management clusters. Execute production code from within notebooks DS - For data scientists, easy data exploration in notebooks Business SME – interactive dashboards empower teams to create dynamic reports Enterprise Security Encryption Fine grained Role-based access control (files, clusters, code, application, dashboard) Compliance Rest APIs DE – DBIO, SPARK, API’s , JOBS DS – Spark and Serverless, Interactive Data Science Data Products - Everything Creators of Spark Training People Number of Customers Ingest Workflow Schedule / Run / Monitor Execute Troubleshoot Debug Production Jobs --------- Ingest, ETL, Scheduling, Monitoring
  • #20: Databricks, founded by the team that created Apache Spark – unified analytics platform that accelerates innovation by unifying data science, engineering & business. 75% of the code committed to Apache Spark comes from Databricks Unified Runtime Create clusters in seconds, dynamically scale them up and down. They’ve made enhancements to Spark engine to make it 10x faster than open source Spark Serverless- Auto-configured multi-user cluster, Reliable sharing with fault isolation Unified Collaboration Overall – a simple & collaborative environment that enables your entire team to use Spark & interact with your data simultaneously DE – Improve ETL performance, zero management clusters. Execute production code from within notebooks DS - For data scientists, easy data exploration in notebooks Business SME – interactive dashboards empower teams to create dynamic reports Enterprise Security Encryption Fine grained Role-based access control (files, clusters, code, application, dashboard) Compliance Rest APIs DE – DBIO, SPARK, API’s , JOBS DS – Spark and Serverless, Interactive Data Science Data Products - Everything Creators of Spark Training People Number of Customers Ingest Workflow Schedule / Run / Monitor Execute Troubleshoot Debug Production Jobs --------- Ingest, ETL, Scheduling, Monitoring
  • #29: Additionally, all Azure Databricks programming language notebooks (python, scala, R) support using interactive HTML graphics using javascript libraries like D3. To use this, you can pass any HTML, CSS, or JavaScript code to the displayHTML() function to render its results. You can display MatPlotLib and ggplot objects in Python notebooks You can use Plotly, an interactive graphing library Azure Databricks supports htmlwidgets. With R htmlwidgets you can generate interactive plots using R’s flexible syntax and environment.
  • #31: Diagram from Databricks
  • #32: Managed and Unmanaged Tables Every Spark-SQL table has a metadata information that stores the schema and the data itself. Managed tables are Spark SQL tables where Spark manages both the data and the metadata. Since Spark SQL manages the tables, doing a DROP TABLE example_data will delete both the metadata and data automatically. Unmanaged tables: Here Spark SQL manages the metadata and you control the data’s location. Spark SQL will just manage the relevant metadata, so when you perform DROP TABLE example_data, Spark will only remove the metadata and not the data itself. The data will still be present in the path you provided. Note you can also create an unmanaged table with your data in other data sources like Cassandra, JDBC table, etc.
  • #34: Addresses many of the pain points with DStreams. Enables the development o f “Continuous Applications” that need to interact with batch data, interactive analysis, ML etc.
  • #36: This article claims that Facebook found in their benchmarks that GraphX was not as performant as Giraph. See https://code.facebook.com/posts/319004238457019/a-comparison-of-state-of-the-art-graph-processing-systems/
  • #37: Setting Up Authentication There are two ways to authenticate to Databricks. The first way is to use your username and password pair. To do this run Databricks configure and follow the prompts. The second and recommended way is to use an access token generated from Databricks. To configure the CLI to use the access token run Databricks configure --token. After following the prompts, your access credentials will be stored in the file ~/.Databricks.
  • #55: Databricks, founded by the team that created Apache Spark – unified analytics platform that accelerates innovation by unifying data science, engineering & business. 75% of the code committed to Apache Spark comes from Databricks Unified Runtime Create clusters in seconds, dynamically scale them up and down. They’ve made enhancements to Spark engine to make it 10x faster than open source Spark Serverless- Auto-configured multi-user cluster, Reliable sharing with fault isolation Unified Collaboration Overall – a simple & collaborative environment that enables your entire team to use Spark & interact with your data simultaneously DE – Improve ETL performance, zero management clusters. Execute production code from within notebooks DS - For data scientists, easy data exploration in notebooks Business SME – interactive dashboards empower teams to create dynamic reports Enterprise Security Encryption Fine grained Role-based access control (files, clusters, code, application, dashboard) Compliance Rest APIs DE – DBIO, SPARK, API’s , JOBS DS – Spark and Serverless, Interactive Data Science Data Products - Everything Creators of Spark Training People Number of Customers Ingest Workflow Schedule / Run / Monitor Execute Troubleshoot Debug Production Jobs --------- Ingest, ETL, Scheduling, Monitoring