The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks

Model & ServePrep & Train
Databricks
HDInsight
Data Lake Analytics
Custom
apps
Sensors
and devices
Store
Blobs
Data Lake
Ingest
Data Factory
(Data movement, pipelines & orchestration)
Machine
Learning
Cosmos DB
SQL Data
Warehouse
Analysis Services
Event Hub
IoT Hub
SQL Database
Analytical dashboards
Predictive apps
Operational reports
Intelligence
B I G D ATA & A D VA N C E D A N A LY T I C S AT A G L A N C E
Business
apps
10
01
SQLKafka

A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure
Best of Databricks Best of Microsoft
Designed in collaboration with the founders of Apache Spark
One-click set up; streamlined workflows
Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)
Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs)

A Z U R E D A T A B R I C K S
Microsoft Azure

RAPID
EXPERIMENTATI
ON
DATA
VISUALIZATION
CROSS-TEAM
COLLABORATION
EASY SHARING
OF INSIGHTS

 Infrastructure management
 Data exploration and visualization at scale
 Time to value - From model iterations to intelligence
 Integrating with various ML tools to stitch a solution together
 Operationalize ML models to integrate them into applications

Optimized Databricks Runtime Engine
DATABRICKS I/O SERVERLESS
Collaborative Workspace
Cloud storage
Data warehouses
Hadoop storage
IoT / streaming data
Rest APIs
Machine learning models
BI tools
Data exports
Data warehouses
Azure Databricks
Enhance Productivity
Deploy Production Jobs & Workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
Build on secure & trusted cloud Scale without limits
A Z U R E D A T A B R I C K S

 Easy to create and manage compute clusters that auto-scale
 Rapid development using the integrated workspace that
facilitates cross-team collaboration
 Interactive exploration with notebooks and dashboards
 Seamless integration with ML eco-system libraries and tools
 Deep Learning support with GPUs (coming soon in next release)

Spark
SparkSQL Streaming MLlib GraphX

Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 2
Extract featuresExtract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble

Simple construction, tuning, and testing for ML workflows

model = est2.fit(est1.fit(
 tf2.transform(tf1.transform(data)))
 .transform(
 tf2.transform(tf1.transform(data)))
 )
model = Pipeline(stages=[tf1, tf2, est1, es2]).fit(data)

28
Cross Validation
Model
Training
Feature
Extraction
regularization
parameter:
{0.0, 0.1, ...}

29
Cross Validation
...
Best Model
Model #1
Training
Model #2
Training
Feature
Extraction
Model #3
Training

Microsoft Confidential
Advanced Analytics: Pipeline

Data Science Software Engineering
Prototype (Python/R)
Create model
Re-implement model for
production (Java)
Deploy model
3

Create Pipeline
• Extract raw features
• Transform features
• Select key features
• Fit multiple models
• Combine results to
make prediction
• Extra implementation work
• Different code paths
• Synchronization overhead
Re-implement Pipeline
for production (Java)
Deploy Pipeline
3

Create Pipeline
Persist model or Pipeline:
model.save(“path://...”)
Load Pipeline (Scala/Java)
Model.load(“path://…”)
Deploy in production

Output
{
“id”:5923937,
“prediction”: 1.0
}

 Classification
 Logistic regression w/ elastic net
 Naive Bayes
 Streaming logistic regression
 Linear SVMs
 Decision trees
 Random forests
 Gradient-boosted trees
 Multilayer perceptron
 One-vs-rest
 Regression
 Least squares w/ elastic net
 Isotonic regression
 Decision trees
 Random forests
 Gradient-boosted trees
 Streaming linear methods
 Recommendation
 Alternating Least Squares
 Frequent itemsets
 FP-growth
 Prefix span
Clustering
• Gaussian mixture models
• K-Means
• Streaming K-Means
• Latent Dirichlet Allocation
• Power Iteration Clustering
Statistics
• Pearson correlation
• Spearman correlation
• Online summarization
• Chi-squared test
• Kernel density estimation
Linear algebra
• Local dense & sparse vectors & matrices
• Distributed matrices
• Block-partitioned matrix
• Row matrix
• Indexed row matrix
• Coordinate matrix
• Matrix decompositions
Model import/export
Pipelines
Feature extraction & selection
• Binarizer
• Bucketizer
• Chi-Squared selection
• CountVectorizer
• Discrete cosine transform
• ElementwiseProduct
• Hashing term frequency
• Inverse document frequency
• MinMaxScaler
• Ngram
• Normalizer
• One-Hot Encoder
• PCA
• PolynomialExpansion
• RFormula
• SQLTransformer
• Standard scaler
• StopWordsRemover
• StringIndexer
• Tokenizer
• StringIndexer
• VectorAssembler
• VectorIndexer
• VectorSlicer
• Word2Vec
And more…
4

• Classification
• Regression
• Recommendation
• Clustering
• Frequent itemsets
4
• Model
import/export
• Pipelines
• DataFrames
• Cross validation
• Feature
extraction &
selection
• Statistics
• Linear algebra

 Use Azure Databricks for scaling out ML task
 Leverage well-known model architectures
 MLLib Pipeline API simplifies ML workflows
 Leverage pre-trained models for common tasks

DeepImageFeaturizer.transform
10minutes
6hours

from import
DeepImageFeaturizer
.transform
from import

5
JFK
IAD
LAX
SFO
SEA
DFW
src dest delay tripid
SFO SEA 45 105892
3
id city state
SEA Seattle WA
vertex (node)
edge
vertex

JFK
IAD
LAX
SFO
SEA
DFW src dest delay tripid
SFO SEA 45 105892
3
LAX JFK 52 410022
4
id city state
SEA Seattle WA
SFO San Francisco CA
JFK New York NY
vertices DataFrame
edges
DataFrame
vertex

JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)
(c)
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
Then filter using vertex
& edge data.
paths.filter(“e1.delay > 20”)

Save & load the DataFrames.
vertices = sqlContext.read.parquet(...)
edges = sqlContext.read.parquet(...)
g = GraphFrame(vertices, edges)
g.vertices.write.parquet(...)
g.edges.write.parquet(...)

The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks

More Related Content

What's hot (20)

Similar to The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks (20)

More from Microsoft Tech Community (20)

Recently uploaded (20)

The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks

Editor's Notes