SlideShare a Scribd company logo
@s_kontopoulos
Machine Learning at Scale: Challenges
and Solutions
Stavros Kontopoulos
Senior Software Engineer @ Lightbend, M.Sc.
@s_kontopoulos
Who am I?
2
skonto
s_kontopoulos
S. Software Engineer @ Lightbend, Fast Data Team
Apache Flink
Contributor at
SlideShare stavroskontopoulos
stavroskontopoulos
All trademarks and registered trademarks are property of their respective holders.
@s_kontopoulos
Agenda
- ML in the Enterprise
- ML from development to production
- Key technologies: Apache Spark as a case study
3
@s_kontopoulos
ML in the Enterprise
ML is a key tool that fuels the effort of coupling business monitoring (BI) with
predictive and prescriptive analytics.
business insights -> business optimization -> data monetization
4
@s_kontopoulos
ML in the Enterprise - The Data-Science LifeCycle
Identify Business Question
Identify and collect related Data
Data cleansing, feature extraction (Data pre-processing)
Experiment planning
Model Building
Model Evaluation
Model Deployment/Management in Production
Model Optimization - Performance
5
@s_kontopoulos
Machine Learning Model
A model is a function that maps inputs to outputs and essentially expresses a
mathematical abstraction.
Linear Regression:
Neural Network:
Random Forest:
Function composition
6
@s_kontopoulos
Model Evolution
- Models can be either pre-computed eg. trained off-line or updated on-line.
- Online ML with Streaming:
- Pure online means only use the latest arrived data point to update the model. Usually models
are updated per batch/window eg. online k-means though.
- An interesting case is when we sample the stream and train a model only when the distribution
changes.
- Adaptive supervised learning: SGD (Stochastic Gradient Descent) + random sampling
- Re-train the model by ignoring the previous one.
7
@s_kontopoulos
Machine Learning Pipeline
Machine learning pipeline in Production: describes all steps from data
preprocessing before feeding the model to model output processing
(post-processing).
8
@s_kontopoulos
Machine Learning Pipeline in Libraries
Pros:
- Data and test data go through the same steps
- Like a CI (continuous integration) pipeline people can reason about data
transformation
- Caching of computations
- Model serving easier 9
@s_kontopoulos
Multiple Models in a Pipeline
Within the same pipeline it is also possible to run multiple models:
a) Model Segmentation
b) Model Ensemble
c) Model Chaining
d) Model Composition
http://dmg.org/pmml/v4-1/MultipleModels.html
http://dl.acm.org/citation.cfm?id=1859403
10
@s_kontopoulos
Model Development & Production
Data Scientist
All trademarks and registered trademarks are property of their respective holders.
GO
Data Engineer
11
@s_kontopoulos
Model Standardization
12
ML Framework Model Definition
Evaluation
Data
Predictions
Export Import
PFA - Portable
Format For
Analytics
@s_kontopoulos
Model Standardization
13
- PFA or PMML won’t break the pipeline. PFA is more flexible than PMML.
“Unlike PMML, PFA has control structures to direct program flow, a true type system for both
model parameters and data, and its statistical functions are much more finely grained and can
accept callbacks to modify their behavior” (http://dmg.org/pfa/docs/motivation/)
- Custom model definitions and implementations are more flexible or more
optimized but could break the pipeline.
- Some Implementations:
- https://github.com/jpmml/jpmml-evaluator-spark
- https://github.com/jpmml
- https://github.com/opendatagroup/hadrian
@s_kontopoulos
Model Lifecycle
Some concerns about model lifecycle:
- Model evolution
- Model release practices
- Model versioning
- Model update process
14
@s_kontopoulos
Model Governance
● governed by the company’s policies and procedures, laws and regulations
and organization’s goals
● searchable across company
● be transparent, explainable, traceable and interpretable for auditors and
regulators. Example GDPR requirements:
https://iapp.org/news/a/is-there-a-right-to-explanation-for-machine-learning-in-
the-gdpr/
● have approval and release process
15
@s_kontopoulos
Model Server
“A model server is a system which handles the lifecycle of a model and provides
the required APIs for deploying a model/pipeline.”
Image: https://rise.cs.berkeley.edu/blog/low-latency-model-serving-clipper/ Image: https://www.tensorflow.org/serving/
CLIPPER Tensorflow Serving
16
@s_kontopoulos
Model Serving - Requirements
Other requirements:
- Response time - time to calculate a prediction. Could be a few mills.
- Throughput - predictions per second.
- Support for running multiple models (very common to run hundreds of models
eg. A telecom operator where there is one model per customer or in IoT one
model per site/sensor).
17
@s_kontopoulos
Model Serving - Requirements
- multiple versions of the same machine learning pipeline within the system.
One reason can be A/B testing.
- Model update- How quickly and easy a model can be updated?
- Uptime/reliability
18
@s_kontopoulos
Tensorflow Serving Issues
Not all systems cover the requirements. For example:
● Metadata not available. (https://github.com/tensorflow/serving/issues/612)
● No new models at runtime: (https://github.com/tensorflow/serving/issues/422)
● Can be hard to build from scratch:
https://github.com/tensorflow/serving/issues/327
19
@s_kontopoulos
Model Serving with Apache Flink
Apache Flink: Low latency compared to Spark streaming engine based on the
Beam model.
20
@s_kontopoulos
Model Serving with Apache Flink
Idea: Exploit Flink’s low latency capabilities for serving models. Focus on offline
models loaded from a permanent storage and update them without interruption.
FLIP Proposal:
(https://docs.google.com/document/d/1ON_t9S28_2LJ91Fks2yFw0RYyeZvIvndu8
oGRPsPuk8)
Combines different efforts: https://github.com/FlinkML
● https://github.com/FlinkML/flink-jpmml (https://radicalbit.io/)
● https://github.com/FlinkML/flink-modelServer (Boris Lublinsky)
● https://github.com/FlinkML/flink-tensorflow (Eron Wright)
21
@s_kontopoulos
Model Serving with Apache Flink
22
Use a control stream and a data Stream. Keep model in operator’s state. Join the streams.
Flink provides 2 ways of implementing low-level joins - key based join based on CoProcessFunction and
partitions-based join based on RichCoFlatMapFunction.
@s_kontopoulos
Model Serving with Apache Flink
23
More here:
https://info.lightbend.com/ebook-serving-machine-learning-models-register.html
@s_kontopoulos
Data Lakes
How can we work with data to cover future needs and use cases. We need a
robust ML framework plus flexible infrastructure. Data Warehouses will not work.
Data lake to the rescue.
“A data lake is a method of storing data within a system or repository, in its natural
format, that facilitates the collocation of data in various schemata and structural
forms, usually object blobs or files.”
- Wikipedia
24
@s_kontopoulos
Data Lakes
● Agility. It can be seen as a tool that makes data accessible to different users
and facilitates ML.
● Designed for low-cost storage
● Schema on read
● Security and governance still maturing.
25
@s_kontopoulos
Data Lake Issues
“Through 2018, 80% of data lakes will not include effective metadata management
capabilities, making them inefficient.”
- Gartner
Several vendors try to deliver end-to-end solutions: Databricks Delta platform, IBM
Watson Platform etc.
26
@s_kontopoulos
Notebooks
Very convenient for the data scientist or the analyst.
Production usually is based on traditional deployment methods.
- Spark Notebook
- Apache zeppelin
- Jupyter
27
@s_kontopoulos
ML with Apache Spark
“A popular big data framework for ML and data-science.”
- You can work locally and move to production fast
- ETL/Feature Engineering
- Hyper-parameter tuning
- Rich Model support
- Multiple language support (Scala, Java, Python, R)
28
@s_kontopoulos
Apache Spark - Intro
29
A framework for distributed in-memory data processing.
@s_kontopoulos
Apache Spark - Intro
- User defines computations/operations (map, flatMap etc) on the data-sets
(bounded or not) as a DAG.
- DAG is shipped to nodes where the data lie, computation is executed and
results are sent back to the user.
- The data-sets are considered as immutable distributed data (RDDs).
- Resilient Distributed Datasets (RDD) an immutable distributed
collection of objects.
30
@s_kontopoulos
Apache Spark - Basic Example in Scala
31
basic statistics, a hello world
for ML
@s_kontopoulos
Apache Spark - Intro
There are three APIs: RDD, DataFrames, Datasets
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dat
aframes-and-datasets.html
32
RDD DataFrames (SQL) Datasets
Syntax Errors Runtime Compile Time Compile Time
Analysis Errors Runtime Runtime Compile Time
@s_kontopoulos
Apache Spark - Intro
“Datasets support encoders which allow to map semi-structured formats (eg
JSON) to constructs of type safe languages (Scala, Java). Also they have better
performance compared to java serialization or kryo.”
33
@s_kontopoulos
MLliB
A library for machine learning on top of Spark. Has two APIs:
- RDD based (spark.mllib).
- Datasets / Dataframes based (spark.ml).
The latter is relatively new and makes it easier to construct a ML pipeline or run an
algorithm. The first is older with more features.
34
@s_kontopoulos
MLliB
“As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered
maintenance mode. “
What are the implications?
● MLlib will still support the RDD-based API in spark.mllib with bug fixes.
● MLlib will not add new features to the RDD-based API.
● In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
● After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated.
● The RDD-based API is expected to be removed in Spark 3.0.
35
@s_kontopoulos
MLliB
Supports different categories of ML algorithms:
● Basic statistics (correlations etc)
● Pipelines (LSH, TF-IDF)
● Extracting, transforming and selecting features
● Classification and Regression (Random forests, Gradient boosted trees)
● Clustering (K-means, LDA, etc)
● Collaborative filtering
● Frequent Pattern Mining
● Model selection and tuning
Allows to implement: Fraud detection, Recommendation engines,...
36
@s_kontopoulos
MLliB Local
A new package is available for production use of the algorithms without the need
of Spark itself. How about PMML vs this method?
https://issues.apache.org/jira/browse/SPARK-13944
https://issues.apache.org/jira/browse/SPARK-16365
37
@s_kontopoulos
MLliB - Unsupervised Learning Example
Our data set: https://www.kaggle.com/danielpanizzo/wine-quality/data
Describes wine quality. Different dimensions like: chlorides, sugar etc.
We will apply k-means to identify different clusters of wine quality.
Implemented both mllib and ml implementations as spark notebooks.
38
Normalize Data K-means PCA Visualize
@s_kontopoulos
MLliB - Unsupervised Learning Example
39
parse data
train k-means with different k
@s_kontopoulos
MLliB - Unsupervised Learning Example
40
Counting errors for elbow method
@s_kontopoulos
MLLiB - Unsupervised Learning Example
41
PCA analysis to verify k-means
with k=2
@s_kontopoulos
MLLiB - Unsupervised Learning Example
42
PCA K=2
@s_kontopoulos
MLliB - Unsupervised Learning Example
43
Available with the mllib implementation
@s_kontopoulos
Spark Deep Learning Pipelines
- People know SQL
- Models are productized as SQL UDFS.
Predictions as a SQL statement:
SELECT my_custom_keras_model_udf(image) as predictions from my_spark_image_table
https://github.com/databricks/spark-deep-learning
44
@s_kontopoulos
BigDL
● Developed by Intel.
● It does not use GPUs, optimized for Intel processors.
“It is orders of magnitude faster than out-of-box open source Caffe, Torch or
TensorFlow on a single-node Xeon (i.e., comparable with mainstream GPU).”
● It is implemented as a standalone package on Spark.
● Can be used with existing Spark or Hadoop clusters.
● High-performance powered by Intel MKL and multi-threaded programming.
● Easily scaled-out
● Appropriate for users who are not DL experts.
45
@s_kontopoulos
BigDL
● Offers a user-friendly, idiomatic Scala and Python 2.7/3.5 API for training and
testing machine learning models.
● A lot of useful features: Loss Functions, Layers support etc
● Implements a parameter server for distributed training of DL models
● Support visualization via tensorboard:
https://intel-analytics.github.io/bigdl-doc/UserGuide/visualization-with-tensorb
oard
46
@s_kontopoulos
BigDL in practice
For a cool example of using BigDL on mesos check our blog:
http://developer.lightbend.com/blog/2017-06-22-bigdl-on-mesos/
47
@s_kontopoulos
Thank you! Questions?
https://github.com/skonto/talks/blob/master/big-data-italy-2017/ml/references.md
48

More Related Content

PDF
Streaming analytics state of the art
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Serverless data pipelines gcp
PDF
Data Discovery at Databricks with Amundsen
PDF
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
PDF
Distributed Heterogeneous Mixture Learning On Spark
PDF
Multiplatform Spark solution for Graph datasources by Javier Dominguez
PDF
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Streaming analytics state of the art
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Serverless data pipelines gcp
Data Discovery at Databricks with Amundsen
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
Distributed Heterogeneous Mixture Learning On Spark
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...

What's hot (19)

PDF
What’s New in the Upcoming Apache Spark 3.0
PDF
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
PPTX
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
PDF
Scaling and Modernizing Data Platform with Databricks
PDF
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
PDF
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
PDF
Building Data Intensive Analytic Application on Top of Delta Lakes
PPTX
Jethro data meetup index base sql on hadoop - oct-2014
PDF
Building Identity Graphs over Heterogeneous Data
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
PDF
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
PDF
Pandas UDF: Scalable Analysis with Python and PySpark
PDF
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
PDF
Geospatial Analytics at Scale with Deep Learning and Apache Spark
PDF
Data Versioning and Reproducible ML with DVC and MLflow
PDF
Simplify and Scale Data Engineering Pipelines with Delta Lake
PDF
Scaling Machine Learning with Apache Spark
PDF
Big Data is changing abruptly, and where it is likely heading
What’s New in the Upcoming Apache Spark 3.0
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Scaling and Modernizing Data Platform with Databricks
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Practical Distributed Machine Learning Pipelines on Hadoop
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Building Data Intensive Analytic Application on Top of Delta Lakes
Jethro data meetup index base sql on hadoop - oct-2014
Building Identity Graphs over Heterogeneous Data
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Pandas UDF: Scalable Analysis with Python and PySpark
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Geospatial Analytics at Scale with Deep Learning and Apache Spark
Data Versioning and Reproducible ML with DVC and MLflow
Simplify and Scale Data Engineering Pipelines with Delta Lake
Scaling Machine Learning with Apache Spark
Big Data is changing abruptly, and where it is likely heading
Ad

Similar to Machine learning at scale challenges and solutions (20)

PDF
ArangoML Pipeline Cloud - Managed Machine Learning Metadata
PDF
Use Case Patterns for LLM Applications (1).pdf
PDF
Media_Entertainment_Veriticals
PDF
DevOps for DataScience
PPTX
Is Spark the right choice for data analysis ?
PPTX
Notes on Deploying Machine-learning Models at Scale
PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
PDF
Enabling the digital thread using open OSLC standards
PDF
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
PDF
Tech leaders guide to effective building of machine learning products
PPTX
PPTX
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
PDF
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
PDF
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
PPTX
Deploying Data Science Engines to Production
PPTX
databricks ml flow demonstration using automatic features engineering
PPTX
Serverless machine learning architectures at Helixa
PDF
Fighting Fraud with Apache Spark
PDF
Big Data Engineering for Machine Learning
PDF
Use of standards and related issues in predictive analytics
ArangoML Pipeline Cloud - Managed Machine Learning Metadata
Use Case Patterns for LLM Applications (1).pdf
Media_Entertainment_Veriticals
DevOps for DataScience
Is Spark the right choice for data analysis ?
Notes on Deploying Machine-learning Models at Scale
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Enabling the digital thread using open OSLC standards
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Tech leaders guide to effective building of machine learning products
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Deploying Data Science Engines to Production
databricks ml flow demonstration using automatic features engineering
Serverless machine learning architectures at Helixa
Fighting Fraud with Apache Spark
Big Data Engineering for Machine Learning
Use of standards and related issues in predictive analytics
Ad

More from Stavros Kontopoulos (10)

PDF
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
PDF
Online machine learning in Streaming Applications
PPTX
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
PDF
Apache Flink London Meetup - Let's Talk ML on Flink
PDF
Spark Summit EU Supporting Spark (Brussels 2016)
PDF
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
PPTX
Trivento summercamp masterclass 9/9/2016
PPTX
Trivento summercamp fast data 9/9/2016
PPTX
Typesafe spark- Zalando meetup
PDF
Cassandra at Pollfish
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Online machine learning in Streaming Applications
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
Apache Flink London Meetup - Let's Talk ML on Flink
Spark Summit EU Supporting Spark (Brussels 2016)
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Trivento summercamp masterclass 9/9/2016
Trivento summercamp fast data 9/9/2016
Typesafe spark- Zalando meetup
Cassandra at Pollfish

Recently uploaded (20)

PPT
Introduction Database Management System for Course Database
PPTX
Materi_Pemrograman_Komputer-Looping.pptx
PPTX
Presentation of Computer CLASS 2 .pptx
PDF
medical staffing services at VALiNTRY
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Digital Strategies for Manufacturing Companies
DOCX
The Five Best AI Cover Tools in 2025.docx
PDF
System and Network Administration Chapter 2
PPTX
ai tools demonstartion for schools and inter college
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
ISO 45001 Occupational Health and Safety Management System
PPTX
Transform Your Business with a Software ERP System
PDF
IEEE-CS Tech Predictions, SWEBOK and Quantum Software: Towards Q-SWEBOK
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Introduction to Artificial Intelligence
PDF
QAware_Mario-Leander_Reimer_Architecting and Building a K8s-based AI Platform...
PDF
Best Practices for Rolling Out Competency Management Software.pdf
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Mastering-Cybersecurity-The-Crucial-Role-of-Antivirus-Support-Services.pptx
Introduction Database Management System for Course Database
Materi_Pemrograman_Komputer-Looping.pptx
Presentation of Computer CLASS 2 .pptx
medical staffing services at VALiNTRY
Odoo POS Development Services by CandidRoot Solutions
Digital Strategies for Manufacturing Companies
The Five Best AI Cover Tools in 2025.docx
System and Network Administration Chapter 2
ai tools demonstartion for schools and inter college
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
ISO 45001 Occupational Health and Safety Management System
Transform Your Business with a Software ERP System
IEEE-CS Tech Predictions, SWEBOK and Quantum Software: Towards Q-SWEBOK
PTS Company Brochure 2025 (1).pdf.......
Introduction to Artificial Intelligence
QAware_Mario-Leander_Reimer_Architecting and Building a K8s-based AI Platform...
Best Practices for Rolling Out Competency Management Software.pdf
Upgrade and Innovation Strategies for SAP ERP Customers
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Mastering-Cybersecurity-The-Crucial-Role-of-Antivirus-Support-Services.pptx

Machine learning at scale challenges and solutions

  • 1. @s_kontopoulos Machine Learning at Scale: Challenges and Solutions Stavros Kontopoulos Senior Software Engineer @ Lightbend, M.Sc.
  • 2. @s_kontopoulos Who am I? 2 skonto s_kontopoulos S. Software Engineer @ Lightbend, Fast Data Team Apache Flink Contributor at SlideShare stavroskontopoulos stavroskontopoulos All trademarks and registered trademarks are property of their respective holders.
  • 3. @s_kontopoulos Agenda - ML in the Enterprise - ML from development to production - Key technologies: Apache Spark as a case study 3
  • 4. @s_kontopoulos ML in the Enterprise ML is a key tool that fuels the effort of coupling business monitoring (BI) with predictive and prescriptive analytics. business insights -> business optimization -> data monetization 4
  • 5. @s_kontopoulos ML in the Enterprise - The Data-Science LifeCycle Identify Business Question Identify and collect related Data Data cleansing, feature extraction (Data pre-processing) Experiment planning Model Building Model Evaluation Model Deployment/Management in Production Model Optimization - Performance 5
  • 6. @s_kontopoulos Machine Learning Model A model is a function that maps inputs to outputs and essentially expresses a mathematical abstraction. Linear Regression: Neural Network: Random Forest: Function composition 6
  • 7. @s_kontopoulos Model Evolution - Models can be either pre-computed eg. trained off-line or updated on-line. - Online ML with Streaming: - Pure online means only use the latest arrived data point to update the model. Usually models are updated per batch/window eg. online k-means though. - An interesting case is when we sample the stream and train a model only when the distribution changes. - Adaptive supervised learning: SGD (Stochastic Gradient Descent) + random sampling - Re-train the model by ignoring the previous one. 7
  • 8. @s_kontopoulos Machine Learning Pipeline Machine learning pipeline in Production: describes all steps from data preprocessing before feeding the model to model output processing (post-processing). 8
  • 9. @s_kontopoulos Machine Learning Pipeline in Libraries Pros: - Data and test data go through the same steps - Like a CI (continuous integration) pipeline people can reason about data transformation - Caching of computations - Model serving easier 9
  • 10. @s_kontopoulos Multiple Models in a Pipeline Within the same pipeline it is also possible to run multiple models: a) Model Segmentation b) Model Ensemble c) Model Chaining d) Model Composition http://dmg.org/pmml/v4-1/MultipleModels.html http://dl.acm.org/citation.cfm?id=1859403 10
  • 11. @s_kontopoulos Model Development & Production Data Scientist All trademarks and registered trademarks are property of their respective holders. GO Data Engineer 11
  • 12. @s_kontopoulos Model Standardization 12 ML Framework Model Definition Evaluation Data Predictions Export Import PFA - Portable Format For Analytics
  • 13. @s_kontopoulos Model Standardization 13 - PFA or PMML won’t break the pipeline. PFA is more flexible than PMML. “Unlike PMML, PFA has control structures to direct program flow, a true type system for both model parameters and data, and its statistical functions are much more finely grained and can accept callbacks to modify their behavior” (http://dmg.org/pfa/docs/motivation/) - Custom model definitions and implementations are more flexible or more optimized but could break the pipeline. - Some Implementations: - https://github.com/jpmml/jpmml-evaluator-spark - https://github.com/jpmml - https://github.com/opendatagroup/hadrian
  • 14. @s_kontopoulos Model Lifecycle Some concerns about model lifecycle: - Model evolution - Model release practices - Model versioning - Model update process 14
  • 15. @s_kontopoulos Model Governance ● governed by the company’s policies and procedures, laws and regulations and organization’s goals ● searchable across company ● be transparent, explainable, traceable and interpretable for auditors and regulators. Example GDPR requirements: https://iapp.org/news/a/is-there-a-right-to-explanation-for-machine-learning-in- the-gdpr/ ● have approval and release process 15
  • 16. @s_kontopoulos Model Server “A model server is a system which handles the lifecycle of a model and provides the required APIs for deploying a model/pipeline.” Image: https://rise.cs.berkeley.edu/blog/low-latency-model-serving-clipper/ Image: https://www.tensorflow.org/serving/ CLIPPER Tensorflow Serving 16
  • 17. @s_kontopoulos Model Serving - Requirements Other requirements: - Response time - time to calculate a prediction. Could be a few mills. - Throughput - predictions per second. - Support for running multiple models (very common to run hundreds of models eg. A telecom operator where there is one model per customer or in IoT one model per site/sensor). 17
  • 18. @s_kontopoulos Model Serving - Requirements - multiple versions of the same machine learning pipeline within the system. One reason can be A/B testing. - Model update- How quickly and easy a model can be updated? - Uptime/reliability 18
  • 19. @s_kontopoulos Tensorflow Serving Issues Not all systems cover the requirements. For example: ● Metadata not available. (https://github.com/tensorflow/serving/issues/612) ● No new models at runtime: (https://github.com/tensorflow/serving/issues/422) ● Can be hard to build from scratch: https://github.com/tensorflow/serving/issues/327 19
  • 20. @s_kontopoulos Model Serving with Apache Flink Apache Flink: Low latency compared to Spark streaming engine based on the Beam model. 20
  • 21. @s_kontopoulos Model Serving with Apache Flink Idea: Exploit Flink’s low latency capabilities for serving models. Focus on offline models loaded from a permanent storage and update them without interruption. FLIP Proposal: (https://docs.google.com/document/d/1ON_t9S28_2LJ91Fks2yFw0RYyeZvIvndu8 oGRPsPuk8) Combines different efforts: https://github.com/FlinkML ● https://github.com/FlinkML/flink-jpmml (https://radicalbit.io/) ● https://github.com/FlinkML/flink-modelServer (Boris Lublinsky) ● https://github.com/FlinkML/flink-tensorflow (Eron Wright) 21
  • 22. @s_kontopoulos Model Serving with Apache Flink 22 Use a control stream and a data Stream. Keep model in operator’s state. Join the streams. Flink provides 2 ways of implementing low-level joins - key based join based on CoProcessFunction and partitions-based join based on RichCoFlatMapFunction.
  • 23. @s_kontopoulos Model Serving with Apache Flink 23 More here: https://info.lightbend.com/ebook-serving-machine-learning-models-register.html
  • 24. @s_kontopoulos Data Lakes How can we work with data to cover future needs and use cases. We need a robust ML framework plus flexible infrastructure. Data Warehouses will not work. Data lake to the rescue. “A data lake is a method of storing data within a system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms, usually object blobs or files.” - Wikipedia 24
  • 25. @s_kontopoulos Data Lakes ● Agility. It can be seen as a tool that makes data accessible to different users and facilitates ML. ● Designed for low-cost storage ● Schema on read ● Security and governance still maturing. 25
  • 26. @s_kontopoulos Data Lake Issues “Through 2018, 80% of data lakes will not include effective metadata management capabilities, making them inefficient.” - Gartner Several vendors try to deliver end-to-end solutions: Databricks Delta platform, IBM Watson Platform etc. 26
  • 27. @s_kontopoulos Notebooks Very convenient for the data scientist or the analyst. Production usually is based on traditional deployment methods. - Spark Notebook - Apache zeppelin - Jupyter 27
  • 28. @s_kontopoulos ML with Apache Spark “A popular big data framework for ML and data-science.” - You can work locally and move to production fast - ETL/Feature Engineering - Hyper-parameter tuning - Rich Model support - Multiple language support (Scala, Java, Python, R) 28
  • 29. @s_kontopoulos Apache Spark - Intro 29 A framework for distributed in-memory data processing.
  • 30. @s_kontopoulos Apache Spark - Intro - User defines computations/operations (map, flatMap etc) on the data-sets (bounded or not) as a DAG. - DAG is shipped to nodes where the data lie, computation is executed and results are sent back to the user. - The data-sets are considered as immutable distributed data (RDDs). - Resilient Distributed Datasets (RDD) an immutable distributed collection of objects. 30
  • 31. @s_kontopoulos Apache Spark - Basic Example in Scala 31 basic statistics, a hello world for ML
  • 32. @s_kontopoulos Apache Spark - Intro There are three APIs: RDD, DataFrames, Datasets https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dat aframes-and-datasets.html 32 RDD DataFrames (SQL) Datasets Syntax Errors Runtime Compile Time Compile Time Analysis Errors Runtime Runtime Compile Time
  • 33. @s_kontopoulos Apache Spark - Intro “Datasets support encoders which allow to map semi-structured formats (eg JSON) to constructs of type safe languages (Scala, Java). Also they have better performance compared to java serialization or kryo.” 33
  • 34. @s_kontopoulos MLliB A library for machine learning on top of Spark. Has two APIs: - RDD based (spark.mllib). - Datasets / Dataframes based (spark.ml). The latter is relatively new and makes it easier to construct a ML pipeline or run an algorithm. The first is older with more features. 34
  • 35. @s_kontopoulos MLliB “As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. “ What are the implications? ● MLlib will still support the RDD-based API in spark.mllib with bug fixes. ● MLlib will not add new features to the RDD-based API. ● In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API. ● After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated. ● The RDD-based API is expected to be removed in Spark 3.0. 35
  • 36. @s_kontopoulos MLliB Supports different categories of ML algorithms: ● Basic statistics (correlations etc) ● Pipelines (LSH, TF-IDF) ● Extracting, transforming and selecting features ● Classification and Regression (Random forests, Gradient boosted trees) ● Clustering (K-means, LDA, etc) ● Collaborative filtering ● Frequent Pattern Mining ● Model selection and tuning Allows to implement: Fraud detection, Recommendation engines,... 36
  • 37. @s_kontopoulos MLliB Local A new package is available for production use of the algorithms without the need of Spark itself. How about PMML vs this method? https://issues.apache.org/jira/browse/SPARK-13944 https://issues.apache.org/jira/browse/SPARK-16365 37
  • 38. @s_kontopoulos MLliB - Unsupervised Learning Example Our data set: https://www.kaggle.com/danielpanizzo/wine-quality/data Describes wine quality. Different dimensions like: chlorides, sugar etc. We will apply k-means to identify different clusters of wine quality. Implemented both mllib and ml implementations as spark notebooks. 38 Normalize Data K-means PCA Visualize
  • 39. @s_kontopoulos MLliB - Unsupervised Learning Example 39 parse data train k-means with different k
  • 40. @s_kontopoulos MLliB - Unsupervised Learning Example 40 Counting errors for elbow method
  • 41. @s_kontopoulos MLLiB - Unsupervised Learning Example 41 PCA analysis to verify k-means with k=2
  • 42. @s_kontopoulos MLLiB - Unsupervised Learning Example 42 PCA K=2
  • 43. @s_kontopoulos MLliB - Unsupervised Learning Example 43 Available with the mllib implementation
  • 44. @s_kontopoulos Spark Deep Learning Pipelines - People know SQL - Models are productized as SQL UDFS. Predictions as a SQL statement: SELECT my_custom_keras_model_udf(image) as predictions from my_spark_image_table https://github.com/databricks/spark-deep-learning 44
  • 45. @s_kontopoulos BigDL ● Developed by Intel. ● It does not use GPUs, optimized for Intel processors. “It is orders of magnitude faster than out-of-box open source Caffe, Torch or TensorFlow on a single-node Xeon (i.e., comparable with mainstream GPU).” ● It is implemented as a standalone package on Spark. ● Can be used with existing Spark or Hadoop clusters. ● High-performance powered by Intel MKL and multi-threaded programming. ● Easily scaled-out ● Appropriate for users who are not DL experts. 45
  • 46. @s_kontopoulos BigDL ● Offers a user-friendly, idiomatic Scala and Python 2.7/3.5 API for training and testing machine learning models. ● A lot of useful features: Loss Functions, Layers support etc ● Implements a parameter server for distributed training of DL models ● Support visualization via tensorboard: https://intel-analytics.github.io/bigdl-doc/UserGuide/visualization-with-tensorb oard 46
  • 47. @s_kontopoulos BigDL in practice For a cool example of using BigDL on mesos check our blog: http://developer.lightbend.com/blog/2017-06-22-bigdl-on-mesos/ 47