Machine learning at scale challenges and solutions

@s_kontopoulos
Machine Learning at Scale: Challenges
and Solutions
Stavros Kontopoulos
Senior Software Engineer @ Lightbend, M.Sc.

@s_kontopoulos
Who am I?
2
skonto
s_kontopoulos
S. Software Engineer @ Lightbend, Fast Data Team
Apache Flink
Contributor at
SlideShare stavroskontopoulos
stavroskontopoulos
All trademarks and registered trademarks are property of their respective holders.

@s_kontopoulos
Agenda
- ML in the Enterprise
- ML from development to production
- Key technologies: Apache Spark as a case study
3

@s_kontopoulos
ML in the Enterprise
ML is a key tool that fuels the effort of coupling business monitoring (BI) with
predictive and prescriptive analytics.
business insights -> business optimization -> data monetization
4

@s_kontopoulos
ML in the Enterprise - The Data-Science LifeCycle
Identify Business Question
Identify and collect related Data
Data cleansing, feature extraction (Data pre-processing)
Experiment planning
Model Building
Model Evaluation
Model Deployment/Management in Production
Model Optimization - Performance
5

@s_kontopoulos
Machine Learning Model
A model is a function that maps inputs to outputs and essentially expresses a
mathematical abstraction.
Linear Regression:
Neural Network:
Random Forest:
Function composition
6

@s_kontopoulos
Model Evolution
- Models can be either pre-computed eg. trained off-line or updated on-line.
- Online ML with Streaming:
- Pure online means only use the latest arrived data point to update the model. Usually models
are updated per batch/window eg. online k-means though.
- An interesting case is when we sample the stream and train a model only when the distribution
changes.
- Adaptive supervised learning: SGD (Stochastic Gradient Descent) + random sampling
- Re-train the model by ignoring the previous one.
7

@s_kontopoulos
Machine Learning Pipeline
Machine learning pipeline in Production: describes all steps from data
preprocessing before feeding the model to model output processing
(post-processing).
8

@s_kontopoulos
Machine Learning Pipeline in Libraries
Pros:
- Data and test data go through the same steps
- Like a CI (continuous integration) pipeline people can reason about data
transformation
- Caching of computations
- Model serving easier 9

@s_kontopoulos
Multiple Models in a Pipeline
Within the same pipeline it is also possible to run multiple models:
a) Model Segmentation
b) Model Ensemble
c) Model Chaining
d) Model Composition
http://dmg.org/pmml/v4-1/MultipleModels.html
http://dl.acm.org/citation.cfm?id=1859403
10

@s_kontopoulos
Model Development & Production
Data Scientist
All trademarks and registered trademarks are property of their respective holders.
GO
Data Engineer
11

@s_kontopoulos
Model Standardization
12
ML Framework Model Definition
Evaluation
Data
Predictions
Export Import
PFA - Portable
Format For
Analytics

@s_kontopoulos
Model Standardization
13
- PFA or PMML won’t break the pipeline. PFA is more flexible than PMML.
“Unlike PMML, PFA has control structures to direct program flow, a true type system for both
model parameters and data, and its statistical functions are much more finely grained and can
accept callbacks to modify their behavior” (http://dmg.org/pfa/docs/motivation/)
- Custom model definitions and implementations are more flexible or more
optimized but could break the pipeline.
- Some Implementations:
- https://github.com/jpmml/jpmml-evaluator-spark
- https://github.com/jpmml
- https://github.com/opendatagroup/hadrian

@s_kontopoulos
Model Lifecycle
Some concerns about model lifecycle:
- Model evolution
- Model release practices
- Model versioning
- Model update process
14

@s_kontopoulos
Model Governance
● governed by the company’s policies and procedures, laws and regulations
and organization’s goals
● searchable across company
● be transparent, explainable, traceable and interpretable for auditors and
regulators. Example GDPR requirements:
https://iapp.org/news/a/is-there-a-right-to-explanation-for-machine-learning-in-
the-gdpr/
● have approval and release process
15

@s_kontopoulos
Model Server
“A model server is a system which handles the lifecycle of a model and provides
the required APIs for deploying a model/pipeline.”
Image: https://rise.cs.berkeley.edu/blog/low-latency-model-serving-clipper/ Image: https://www.tensorflow.org/serving/
CLIPPER Tensorflow Serving
16

@s_kontopoulos
Model Serving - Requirements
Other requirements:
- Response time - time to calculate a prediction. Could be a few mills.
- Throughput - predictions per second.
- Support for running multiple models (very common to run hundreds of models
eg. A telecom operator where there is one model per customer or in IoT one
model per site/sensor).
17

@s_kontopoulos
Model Serving - Requirements
- multiple versions of the same machine learning pipeline within the system.
One reason can be A/B testing.
- Model update- How quickly and easy a model can be updated?
- Uptime/reliability
18

@s_kontopoulos
Tensorflow Serving Issues
Not all systems cover the requirements. For example:
● Metadata not available. (https://github.com/tensorflow/serving/issues/612)
● No new models at runtime: (https://github.com/tensorflow/serving/issues/422)
● Can be hard to build from scratch:
https://github.com/tensorflow/serving/issues/327
19

@s_kontopoulos
Model Serving with Apache Flink
Apache Flink: Low latency compared to Spark streaming engine based on the
Beam model.
20

@s_kontopoulos
Idea: Exploit Flink’s low latency capabilities for serving models. Focus on offline
models loaded from a permanent storage and update them without interruption.
FLIP Proposal:
(https://docs.google.com/document/d/1ON_t9S28_2LJ91Fks2yFw0RYyeZvIvndu8
oGRPsPuk8)
Combines different efforts: https://github.com/FlinkML
● https://github.com/FlinkML/flink-jpmml (https://radicalbit.io/)
● https://github.com/FlinkML/flink-modelServer (Boris Lublinsky)
● https://github.com/FlinkML/flink-tensorflow (Eron Wright)
21

@s_kontopoulos
22
Use a control stream and a data Stream. Keep model in operator’s state. Join the streams.
Flink provides 2 ways of implementing low-level joins - key based join based on CoProcessFunction and
partitions-based join based on RichCoFlatMapFunction.

@s_kontopoulos
23
More here:
https://info.lightbend.com/ebook-serving-machine-learning-models-register.html

@s_kontopoulos
Data Lakes
How can we work with data to cover future needs and use cases. We need a
robust ML framework plus flexible infrastructure. Data Warehouses will not work.
Data lake to the rescue.
“A data lake is a method of storing data within a system or repository, in its natural
format, that facilitates the collocation of data in various schemata and structural
forms, usually object blobs or files.”
- Wikipedia
24

@s_kontopoulos
Data Lakes
● Agility. It can be seen as a tool that makes data accessible to different users
and facilitates ML.
● Designed for low-cost storage
● Schema on read
● Security and governance still maturing.
25

@s_kontopoulos
Data Lake Issues
“Through 2018, 80% of data lakes will not include effective metadata management
capabilities, making them inefficient.”
- Gartner
Several vendors try to deliver end-to-end solutions: Databricks Delta platform, IBM
Watson Platform etc.
26

@s_kontopoulos
Notebooks
Very convenient for the data scientist or the analyst.
Production usually is based on traditional deployment methods.
- Spark Notebook
- Apache zeppelin
- Jupyter
27

@s_kontopoulos
ML with Apache Spark
“A popular big data framework for ML and data-science.”
- You can work locally and move to production fast
- ETL/Feature Engineering
- Hyper-parameter tuning
- Rich Model support
- Multiple language support (Scala, Java, Python, R)
28

@s_kontopoulos
Apache Spark - Intro
29
A framework for distributed in-memory data processing.

@s_kontopoulos
- User defines computations/operations (map, flatMap etc) on the data-sets
(bounded or not) as a DAG.
- DAG is shipped to nodes where the data lie, computation is executed and
results are sent back to the user.
- The data-sets are considered as immutable distributed data (RDDs).
- Resilient Distributed Datasets (RDD) an immutable distributed
collection of objects.
30

@s_kontopoulos
Apache Spark - Basic Example in Scala
31
basic statistics, a hello world
for ML

@s_kontopoulos
There are three APIs: RDD, DataFrames, Datasets
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dat
aframes-and-datasets.html
32
RDD DataFrames (SQL) Datasets
Syntax Errors Runtime Compile Time Compile Time
Analysis Errors Runtime Runtime Compile Time

@s_kontopoulos
“Datasets support encoders which allow to map semi-structured formats (eg
JSON) to constructs of type safe languages (Scala, Java). Also they have better
performance compared to java serialization or kryo.”
33

@s_kontopoulos
MLliB
A library for machine learning on top of Spark. Has two APIs:
- RDD based (spark.mllib).
- Datasets / Dataframes based (spark.ml).
The latter is relatively new and makes it easier to construct a ML pipeline or run an
algorithm. The first is older with more features.
34

@s_kontopoulos
MLliB
“As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered
maintenance mode. “
What are the implications?
● MLlib will still support the RDD-based API in spark.mllib with bug fixes.
● MLlib will not add new features to the RDD-based API.
● In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
● After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated.
● The RDD-based API is expected to be removed in Spark 3.0.
35

@s_kontopoulos
MLliB
Supports different categories of ML algorithms:
● Basic statistics (correlations etc)
● Pipelines (LSH, TF-IDF)
● Extracting, transforming and selecting features
● Classification and Regression (Random forests, Gradient boosted trees)
● Clustering (K-means, LDA, etc)
● Collaborative filtering
● Frequent Pattern Mining
● Model selection and tuning
Allows to implement: Fraud detection, Recommendation engines,...
36

@s_kontopoulos
MLliB Local
A new package is available for production use of the algorithms without the need
of Spark itself. How about PMML vs this method?
https://issues.apache.org/jira/browse/SPARK-13944
https://issues.apache.org/jira/browse/SPARK-16365
37

@s_kontopoulos
MLliB - Unsupervised Learning Example
Our data set: https://www.kaggle.com/danielpanizzo/wine-quality/data
Describes wine quality. Different dimensions like: chlorides, sugar etc.
We will apply k-means to identify different clusters of wine quality.
Implemented both mllib and ml implementations as spark notebooks.
38
Normalize Data K-means PCA Visualize

@s_kontopoulos
39
parse data
train k-means with different k

@s_kontopoulos
40
Counting errors for elbow method

@s_kontopoulos
MLLiB - Unsupervised Learning Example
41
PCA analysis to verify k-means
with k=2

@s_kontopoulos
MLLiB - Unsupervised Learning Example
42
PCA K=2

@s_kontopoulos
43
Available with the mllib implementation

@s_kontopoulos
Spark Deep Learning Pipelines
- People know SQL
- Models are productized as SQL UDFS.
Predictions as a SQL statement:
SELECT my_custom_keras_model_udf(image) as predictions from my_spark_image_table
https://github.com/databricks/spark-deep-learning
44

@s_kontopoulos
BigDL
● Developed by Intel.
● It does not use GPUs, optimized for Intel processors.
“It is orders of magnitude faster than out-of-box open source Caffe, Torch or
TensorFlow on a single-node Xeon (i.e., comparable with mainstream GPU).”
● It is implemented as a standalone package on Spark.
● Can be used with existing Spark or Hadoop clusters.
● High-performance powered by Intel MKL and multi-threaded programming.
● Easily scaled-out
● Appropriate for users who are not DL experts.
45

@s_kontopoulos
BigDL
● Offers a user-friendly, idiomatic Scala and Python 2.7/3.5 API for training and
testing machine learning models.
● A lot of useful features: Loss Functions, Layers support etc
● Implements a parameter server for distributed training of DL models
● Support visualization via tensorboard:
https://intel-analytics.github.io/bigdl-doc/UserGuide/visualization-with-tensorb
oard
46

@s_kontopoulos
BigDL in practice
For a cool example of using BigDL on mesos check our blog:
http://developer.lightbend.com/blog/2017-06-22-bigdl-on-mesos/
47

@s_kontopoulos
Thank you! Questions?
https://github.com/skonto/talks/blob/master/big-data-italy-2017/ml/references.md
48

Machine learning at scale challenges and solutions

More Related Content

What's hot (19)

Similar to Machine learning at scale challenges and solutions (20)

More from Stavros Kontopoulos (10)

Recently uploaded (20)

Machine learning at scale challenges and solutions