Dataiku - Data Science Operationalization
Dataiku - Data Science Operationalization
GUIDEBOOK
Operationalization
Finding the Common Ground in 10 Steps
INTRODUCTION
In data science projects, the derivation of business value follows something akin to the Pareto Principle,
where the vast majority of the business value is generated not from the planning, the scoping, or even
from producing a viable machine learning model. Rather, business value comes from the final few steps:
operationalization of that project.
More often than not, there is a disconnect between the worlds of development and production. Some
teams may choose to re-code everything in an entirely different language while others may make
changes to core elements, such as testing procedures, backup plans, and programming languages.
Operationalizing analytics products could become complicated as different opinions and methods vie for
supremacy, resulting in projects that needlessly drag on for months beyond promised deadlines.
The goal of this guide is to explore grounds for commonality and introduce strategies & procedures
designed to bridge the gap between development and operationalization. The topics range from Best
Operating Procedures (managing environmental consistency, data scalability, and consistent code & data
packaging) to Risk Management for unforeseen situations (roll-back and failover strategies). We also dis-
cuss modelling (continuously re-train of models, A/B testing, and multivariate optimization) and imple-
menting communication strategies (auditing and functional monitoring).
Successfully building an analytics product and then operationalizing it is not an easy task — it becomes
twice as hard when teams are isolated and playing by their own rules.
This guide will help your organization find the common ground needed to empower your Data
Science and IT Teams to work together for the benefit of your data and analytics projects as a whole.
That’s why, to support the reliable transport of code and data from one environment to the next, they need to be
packaged together.
JS < _
AN AVERAGE AN AVERAGE OF AN AVERAGE OF 350 LINES OF
OF 700 LINES 2,500 LINES 200 LINES OF CONFIGURATIONS (OR
OF SQL OF PYTHON JAVASCRIPT SCRIPTS) THAT COMBINE
ALL OF THESE ELEMENTS
If you do not support the proper packaging of code and data, the end result is inconsistent code and data during
operationalization. These inconsistencies are particularly dangerous during training or when applying a predictive model.
These inconsistencies, which are quite difficult to detect, can lead to a subtle degradation of the model’s
performance between development and production.
Without proper packaging, the deployment in production can pose significant challenges.
“Where did Mike put the 10GB reference file he used to create the model?
We only have the 10MB sample on Git!”
THE SOLUTION
1. The first step toward consistant packaging and release for operationalization is to establish
a versioning tool, such as Git, to manage all of the code versioning within your product.
2. The next step is to package the code and data. Create packaging scripts that generate
snapshots in the form of a ZIP file for both code and data inside the script; these should be
consistent with the model (or model parameters) that you need to ship. Deploy that ZIP file to
production.
3. Lastly, be vigilant. Remain aware of situations when the data file size is too large (e.g., > 1GB).
In these scenarios, you need to snapshot and version the required data files in a storage.
THE SOLUTION
The solution to the re-training challenge lies in the data science production workflow. This means that you
need to implement a dedicated command for your workflow that does the following:
With regards to implementation, the re-train/re-score/re-validate steps should be automated and executed
every week. The final swap is then manually executed by a human operator that performs the final
consistency check. This approach provides a good balance between automation and reduced re-train cost
while maintaining the final consistency check.
• In use cases such as credit scoring and fraud detection, only real world tests can
provide the actual data output required. Offline tests are simply unable to convey real-
time events, such as credit authorizations (e.g., is the credit offering aligned with the
customer’s repayment ability?);
• A real-world production setup may be different from your actual setup. As mentioned
above, data consistency is a major issue that results in misaligned productions;
• If the underlying data and its behavior is evolving rapidly, then it will be difficult to
validate the models fast enough to cope with the rate of change.
THE SOLUTION
There are three levels of A/B testing that can be used to test the validity of models:
FUNCTIONAL MONITORING
Functional monitoring is used to convey key functionality of the business model’s performance to the
business sponsors/owners. From a business perspective, functional monitoring is critical because it
provides an opportunity to demonstrate the end-results of your predictive model and how it impacts the
product. The kind of functional information that can be conveyed is variable and depends largely on the
industry and use case. Examples of the kind of data displayed can include the number of contacts in a case,
the number of broken flows in a system, and measurements of performance drifts.
FUNCTIONAL MONITORING
“One day, our CEO spotted a funny recommendation on the company
website. We realised that part of the rebuild chain had been broken for 5
days without anyone noticing. Well, we decided to keep this to ourselves.”
THE SOLUTION
A successful communication strategy lies at the heart of any effective organization; such a strategy typically
combines multiple channels:
• Channel for the quick and continuous communication of events — these are channels where
events are seamlessly communicated to team members, such as: new model in production;
outliers in production; drop or increase in model performance over the last 24 hours, etc.
• E-mail based channel with a Daily Report. Such a report should be a succinct summary of key
data, such as: subject with core metrics; top n customers matching specific model criteria; three
model metrics (e.g., a technical metric, high-level long-term metric, and a short-term business
metric), etc.
• A web-based dashboard with drill-down capability; other channels should always include links to
the dashboard in order to drive usage.
• A real-time notification platform, such as Slack, is a popular option that provides flexible
subscription options to stakeholders. If building a monitoring dashboard, visualization tools such
as Tableau and Qlik are popular as well.
IT ENVIRONMENT CONSISTENCY
The smooth flow of the modelling process relies heavily on the existence of a consistent IT environment
during development and production. Modern data science commonly uses technologies such as Python, R,
Spark, Scala, along with open source frameworks/libraries, such as H2O, scikit-learn, and MLlib.
In the past data scientists used technologies that were already available in the production environment,
such as SQL databases, JAVA, and .NET.
In today’s predictive technology environment, it is not practical to translate a data science project to older
technologies like SQL, JAVA, and .NET — doing so incurs substantial re-write costs. Consequently, 80% of
companies involved with predictive modelling use newer technologies such as Python and R.
IT ENVIRONMENT CONSISTENCY
“I’m never trusting them with R in production again. Last time we
attempted to deploy a project with multiple R packages, it was literally a
nightmare”
THE SOLUTION
Fortunately, there are multiple options available when establishing a consistent IT environment, such as:
• Use the built in mechanisms in open source distributions (e.g., virtualenv, pip for Python) or rely
on 3rd party software (e.g., AnacondaTM for Python). AnacondaTM is becoming an increasingly
popular choice amongst Python users, with one-third of our respondents indicating usage. For
Spark, Scala, and R, a vast majority of the data science community is relying solely on open
source options;
• Use a build from source system (e.g., pip for Python) or a binary mechanism (e.g., wheel). In
the scientific community, binary systems are enjoying increased popularity. This is partly due
to the difficulty involved in building an optimized library that leverages all of the capabilities of
scientific computing packages, such as NumPy;
• Rely on a stable release and common package list (in all of your systems) or build a virtual
environment for each project. In the former, IT would rather maintain a common list of “trusted”
packages and then push those packages to software development. In the latter, each data
project would have its own dedicated environment. Remember that the first significant migration
or new product delivery may require you to maintain several environments in order to support
the transition.
STRATEGY
ROLL-BACK STRATEGY
A roll-back strategy is required in order to return to a previous model version after the
latest version has been deployed.
ROLL-BACK STRATEGY
“When our model started showing a 3% drop in performance, we all
panicked. Getting back to a previous version of the model took us
over 4 days!”
THE SOLUTION
A successful roll-back strategy must include all aspects of the data project, such as:
1. Transformation Code
2. Data
3. Software Dependencies
4. Data Schemas
The roll-back will need to be executable by users who may not be trained in predictive technologies, so
it must be established as an accessible and easy-to-use procedure that could be implemented by an IT
Administrator. Roll-back strategies must be tested for usage in a test environment and be accessible in both
development and production environments.
THE SOLUTION
Formulating a failover strategy in a big data workflow presents some unique challenges, mostly due to the
sheer volume of the data involved. It’s not feasible to take a “rebuild” approach, as there is just too much
information to do this efficiently. Given this, a big data workflow must be “state-aware,” meaning that it must
make decisions based on a previously calculated state. As expected, ETL methods are typically not capable
of encoding this kind of logic.
The Ends: The end-points (start & finish) require particular attention, as they are frequently the weakest
points of a big data and analytics workflow. In a real-world environment, a weak point can be an FTP server,
an API, or even a daily e-mail with a CSV attachment. When working with end-points, your scripts should
meticulously check for errors codes, file size, and so on;
Be Parallel: In our study, 90% of big data workflows have multiple branches, with an initial data input used
in two ways and then merged together to train the predictive model. This means that some processing can
be parallelized at the workflow level, as opposed to the cluster, map reduce, and Spark level. As the product
evolves, it is likely that the number of branches will grow — using a parallel methodology helps to keep your
system fast;
Intelligent Re-execution: This simply means that data is automatically re-updated after a temporary
interruption in data input, such as a late update or temporarily missing data. For example, your big data
workflow may retrieve daily pricing data via FTP; your workflow combines this data with existing browser
and order data in order to formulate a pricing strategy. If this 3rd party data is not updated, the pricing
strategy can still be created using existing up-to-date data... but, ideally, the data would be re-updated when
the missing data becomes available.
User Interface: Graphically conveying a workflow enables users to more fully understand, and investigate,
the overall progress of the workflow. At some point a textual interface, or raw logs, reach their limit in terms
of being able to describe the big picture. When this happens, an easy-to-use Web-based user interface is the
best option.
AUDITING
Auditing, in a data science environment, is being able to know what version of each output corresponds
to the code that was used to create it. In regulated domains, such as healthcare and financial services,
organizations must be able to trace everything that is related to a data science workflow. In this context,
organizations must be able to:
• Trace any wrongdoing down to the specific person who modified a workflow for malicious purposes;
• Prove that there is no illegal data usage, particularly personal data;
• Trace the usage of sensitive data in order to avoid data leaks;
• Demonstrate quality and the proper maintenance of the data flow.
Non-regulated companies still must meet auditing requirements in order to understand exactly what’s
going on with their data and workflows, especially if they are being compromised. The ramifications of not
implementing an auditing strategy is typically felt the most when a data science practice moves from the
arena of experimentation to actual real-world production and critical use cases.
Volume Scalability: What happens when the volume of data you manage grows from a few gigabytes
to dozens of terabytes?;
Request Scalability: What happens when the number of customer requests is multiplied by 100?;
Complexity Scalability: What happens when you increase the number of workflows, or processes, from 1 to 20?;
Team Scalability: Can your team handle scalability- related changes? Can they cooperate, collaborate,
and work concurrently?
1. Overnight Data Overflow: Multiple dependent batch jobs that last 1 or 2 hours tend to
eventually break the expected timespan, effectively running throughout the night and into the
next day. Without proper job management and careful monitoring, your resources could quickly
be consumed by out-of-control processes;
2. Bottlenecks: Data bottlenecks can pose a significant problem in any architecture, no matter
how many computing resources are used. Regular testing can help to alleviate this issue;
3. Logs and Bins: Data volume can grow quickly, but at the vanguard of data growth are logs and
bins. This is particularly true when a Hadoop cluster or database is full — when searching for a
culprit, always check the logs and bins first as they’re typically full of garbage.
We can simplify the journey from a prototyping analytics capability to robust productionized analytics with
the following steps:
• deploying models and entire workflows to the production environment in a fast and effective manner;
• monitoring and managing these models in terms of drift, and retraining them either regularly or according
to a predefined trigger; and
• ensuring that the models in production continue to serve their purpose as well as possible given changes
in data and business needs.
This last point is one that most organizations haven’t struggled with or even really encountered, but it’s vital
to keep in mind now, because sustaining the lifecycle of models in production is the price of successfully
deploying and managing them.
Even with measures of accuracy, sustainability becomes an issue. Regular manual checks for drift, even if
conducted monthly and in the most efficient manner, will soon become unwieldy as the number of models
that needs to be checked multiplies. When you add monitoring for business metrics, the workload and
complexity is even more daunting.
And finally, data is constantly shifting. Data sources are being changed, new ones are added, and new insights
develop around this data. This means that models need to be constantly updated and refined in ways that
simple retraining doesn’t address, and this is where the bulk of your team’s effort on sustainability will need to
be focused.
In order to manage the lifecycle of models in a sustainable way, as well as to extend the lifecycle of these
models, you need to be able to:
1. Manage all of your models from a central place, so that there is full visibility into model performance. Have
a central location where you measure and track the drift of models via an API, and to the fullest extent
possible provide for automated retraining and updating of these models;
2. Build webapps and other tools to evaluate models against specific business metrics, so that everyone from
the data scientists designing the models to end users of analytics products are aligned on the goals of the
models; and
3. Free up the time of data scientists and data engineers to focus on making models better and not only on
addressing drift and lagging performance of existing models.
CONTINUOUS RETRAIN OF MODELS
The ultimate success of a data science project comes down to contributions from individual team members
working together towards a common goal. As can be seen from the topics discussed, “effective contribution”
goes beyond specialization in an individual skill-set. Team members must be aware of the bigger picture
and embrace project level requirements, from diligently packaging both code and data to creating Web-
based dashboards for their project’s business owners. When all team members adopt a “big picture”
approach, they are able to help each other complete tasks outside of their comfort zone.
Data science projects can be intimidating; after all, there are a lot of factors to consider. In today’s
competitive environment, individual silos of knowledge will hinder your team’s effectiveness. Best practices,
model management, communications, and risk management are all areas that need to be mastered when
bringing a project to life. In order to do this, team members need to bring adaptability, a collaborative spirit,
and flexibility to the table. With these ingredients, data science projects can successfully make the transition
from the planning room to actual implementation in a business environment.
Everyday AI,
Extraordinary People
Elastic Architecture Built for the Cloud
Filter on Mr.
450+ 45,000+
CUSTOMERS ACTIVE USERS
Dataiku is the world’s leading platform for Everyday AI, systemizing the use of data for
exceptional business results. Organizations that use Dataiku elevate their people (whether
technical and working in code or on the business side and low- or no-code) to extraordinary,
arming them with the ability to make better day-to-day decisions with data.