0% found this document useful (0 votes)

58 views

Dataiku - Data Science Operationalization

Uploaded by

holirconk20240606

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views

Dataiku - Data Science Operationalization

Uploaded by

holirconk20240606

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Data Science

GUIDEBOOK

Operationalization
Finding the Common Ground in 10 Steps
INTRODUCTION

In data science projects, the derivation of business value follows something akin to the Pareto Principle,
where the vast majority of the business value is generated not from the planning, the scoping, or even
from producing a viable machine learning model. Rather, business value comes from the final few steps:
operationalization of that project.

Operationalization simply means deploying a machine learning

model for use across the organization.

More often than not, there is a disconnect between the worlds of development and production. Some
teams may choose to re-code everything in an entirely different language while others may make
changes to core elements, such as testing procedures, backup plans, and programming languages.

Operationalizing analytics products could become complicated as different opinions and methods vie for
supremacy, resulting in projects that needlessly drag on for months beyond promised deadlines.

The goal of this guide is to explore grounds for commonality and introduce strategies & procedures
designed to bridge the gap between development and operationalization. The topics range from Best
Operating Procedures (managing environmental consistency, data scalability, and consistent code & data
packaging) to Risk Management for unforeseen situations (roll-back and failover strategies). We also dis-
cuss modelling (continuously re-train of models, A/B testing, and multivariate optimization) and imple-
menting communication strategies (auditing and functional monitoring).

Successfully building an analytics product and then operationalizing it is not an easy task — it becomes
twice as hard when teams are isolated and playing by their own rules.

This guide will help your organization find the common ground needed to empower your Data
Science and IT Teams to work together for the benefit of your data and analytics projects as a whole.

1 ©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku

STRATEGY

CONSISTENT PACKAGING AND RELEASE

In the process of operationalization, there are multiple workflows: some internal flows correspond to production
while some external or referential flows relate to specific environments. Moreover, data science projects are
comprised of not only code, but also data:

• Code for data transformation

• Configuration and schema for data
• Public referential data, such as postcodes / zip codes
• Internal referential data, such as product category descriptions.

That’s why, to support the reliable transport of code and data from one environment to the next, they need to be
packaged together.

REAL WORLD USAGE

According to our statistics, a typical data scientist project over the course of a 6-week timespan would include:

JS < _
AN AVERAGE AN AVERAGE OF AN AVERAGE OF 350 LINES OF
OF 700 LINES 2,500 LINES 200 LINES OF CONFIGURATIONS (OR
OF SQL OF PYTHON JAVASCRIPT SCRIPTS) THAT COMBINE
ALL OF THESE ELEMENTS

WHY IS THIS IMPORTANT?

If you do not support the proper packaging of code and data, the end result is inconsistent code and data during
operationalization. These inconsistencies are particularly dangerous during training or when applying a predictive model.

These inconsistencies, which are quite difficult to detect, can lead to a subtle degradation of the model’s
performance between development and production.

Without proper packaging, the deployment in production can pose significant challenges.

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 2

APPROACH

CONSISTANT PACKAGING AND RELEASE

“Where did Mike put the 10GB reference file he used to create the model?
We only have the 10MB sample on Git!”

THE SOLUTION
1. The first step toward consistant packaging and release for operationalization is to establish
a versioning tool, such as Git, to manage all of the code versioning within your product.
2. The next step is to package the code and data. Create packaging scripts that generate
snapshots in the form of a ZIP file for both code and data inside the script; these should be
consistent with the model (or model parameters) that you need to ship. Deploy that ZIP file to
production.
3. Lastly, be vigilant. Remain aware of situations when the data file size is too large (e.g., > 1GB).
In these scenarios, you need to snapshot and version the required data files in a storage.

3 ©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku

STRATEGY

CONTINUOUS RETRAIN OF MODELS

It is critical to implement an efficient strategy for the re-training, validation, and deployment of models. The
process needs to follow an internal workflow loop, be validated, and then passed on to production (with an API to
log real outcomes).

In data science projects, predictive models need to be updated regularly because:

• In a competitive environment, models need to be continuously enhanced, adjusted, and updated;

• The environment changes (e.g., new customers with new behaviors); and,
• The underlying data is constantly changing.

WHY IS THIS IMPORTANT?

If your organization has not implemented a “re-

train in production” methodology, then re-training
a model becomes an actual “deploy to production”
task... with the result requiring significant manpower REAL WORLD USAGE
and a loss of agility. If relying on a manual approach,
then the model deployment could consume from 5
In our experience, there are 3 typical
to 30 days of manpower.
re-training scenarios:
In addition, the cost of not deploying a re-trained
1. People building and maintaining
model could be huge. In a typical real-life situation,
online learning systems. Today, online
a scoring model AUC could degrade by 0.01 per week
learning systems are particularly popular
due to the natural drift of input data — remember,
for recommender systems and online
Internet user behavior changes along with its related
advertising platforms;
data! This means that the 0.05 hard performance
optimization that was painstakingly tuned during 2. People who automatically re-train and update
project setup could disappear within a few weeks. their model (typically every 2.3 days); and,

3. People who re-train their model offline and

then manually release it to production (typically
updated every 24 days). This group includes the
greatest number of participants who are unsure of
how many models they have put into production
within the last 6 months.

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 4

APPROACH

CONTINUOUS RETRAIN OF MODELS

“I’ve re-played the cross validation score for the model and, after 4 weeks
in production, we’re back to square one. Should I tell my boss?”

THE SOLUTION
The solution to the re-training challenge lies in the data science production workflow. This means that you
need to implement a dedicated command for your workflow that does the following:

1. Re-trains the new predictive model candidate.

2. Re-scores and re-validates the model (this step produces the required metrics for your model).
3. Swaps the old predictive model with the new one.

With regards to implementation, the re-train/re-score/re-validate steps should be automated and executed
every week. The final swap is then manually executed by a human operator that performs the final
consistency check. This approach provides a good balance between automation and reduced re-train cost
while maintaining the final consistency check.

AUTOMATED MODEL CHECKING

Automated model checking is the testing and
comparison of metrics on old & new models, with a
new model put into production when its performance
exceeds its predecessor. This comparison process is
done post re-validation and should reflect current data
(e.g., daily model updates). A more complex method,
called Multivariate Testing, can be used in scenarios
that require the continuous testing of multiple models.
STRATEGY

FROM A/B TESTING TO

MULTIVARIATE OPTIMIZATION
The purpose of A/B testing different models is to be able to evaluate multiple models in parallel and then
comparing expected model performance to actual results.

WHY IS THIS IMPORTANT?

Offline testing is not sufficient when validating the performance of a data

product. Here are a few reasons why:

• In use cases such as credit scoring and fraud detection, only real world tests can
provide the actual data output required. Offline tests are simply unable to convey real-
time events, such as credit authorizations (e.g., is the credit offering aligned with the
customer’s repayment ability?);
• A real-world production setup may be different from your actual setup. As mentioned
above, data consistency is a major issue that results in misaligned productions;
• If the underlying data and its behavior is evolving rapidly, then it will be difficult to
validate the models fast enough to cope with the rate of change.

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 6

APPROACH

FROM A/B TESTING TO

MULTIVARIATE OPTIMIZATION
“We put the model in production just after Christmas. The performance
dropped. It took us two weeks to convince the business team that it was
perfectly normal at this time of the season, and not due to the model’s
recent release.”

THE SOLUTION
There are three levels of A/B testing that can be used to test the validity of models:

1. Simple A/B testing

2. Multi-armed bandit testing; and
3. Multi-variable armed bandit testing with optimization.

The first, simple A/B testing, is required for most companies

engaged in digital activities while the latter is used primarily
in advanced, competitive real-time use cases (e.g., real-time
bidding/ advertising and trading).

7 ©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku

STRATEGY

FUNCTIONAL MONITORING
Functional monitoring is used to convey key functionality of the business model’s performance to the
business sponsors/owners. From a business perspective, functional monitoring is critical because it
provides an opportunity to demonstrate the end-results of your predictive model and how it impacts the
product. The kind of functional information that can be conveyed is variable and depends largely on the
industry and use case. Examples of the kind of data displayed can include the number of contacts in a case,
the number of broken flows in a system, and measurements of performance drifts.

BUSINESS SPONSOR INVOLVEMENT

In most situations, business sponsors must have the

capability to detect early signs of drift. This is possible
by enabling sponsors to easily review into the model’s REAL WORLD USAGE
characteristics, view its history, determine the drift’s
validity, and then take appropriate action. For example,
Some applications of functional monitoring
a marketing campaign with more customers from a
based on industry or functional need include:
certain age group (e.g., 20-30 years old) could result in
an inaccurate transformation ratio prediction due to the Fraud
relative inconsistency of that group’s consumer behavior.
The number of predicted fraudulent events,
In addition, business sponsors must be provided with the evolution of the prediction’s likelihood, the
access to high-level technical errors. For example, if a number of false positive predictions, and rolling
pricing model is lacking data from a specific category, the fraud figures;
business owner needs to be notified of the missing data so
Churn Reduction
that they are aware of factors that impact their strategies.
The number of predicted churn events, key variables
for churn prediction, and the efficiency of marketing
strategies towards churners (e.g., opening rates of
e-mails);
WHY IS THIS IMPORTANT?
Pricing:
Knowledge transparency must be constantly shared Key variables of the pricing model, pricing drift,
and evangelized throughout an organization at every pricing variation over time, pricing variation across
opportunity. A lapse in communication can compromise products, evolution of margin evaluations per
the importance and the value of using machine learning day/year, and average transformation ratios.
technology within your organization.

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 8

APPROACH

FUNCTIONAL MONITORING
“One day, our CEO spotted a funny recommendation on the company
website. We realised that part of the rebuild chain had been broken for 5
days without anyone noticing. Well, we decided to keep this to ourselves.”

THE SOLUTION
A successful communication strategy lies at the heart of any effective organization; such a strategy typically
combines multiple channels:

• Channel for the quick and continuous communication of events — these are channels where
events are seamlessly communicated to team members, such as: new model in production;
outliers in production; drop or increase in model performance over the last 24 hours, etc.
• E-mail based channel with a Daily Report. Such a report should be a succinct summary of key
data, such as: subject with core metrics; top n customers matching specific model criteria; three
model metrics (e.g., a technical metric, high-level long-term metric, and a short-term business
metric), etc.
• A web-based dashboard with drill-down capability; other channels should always include links to
the dashboard in order to drive usage.
• A real-time notification platform, such as Slack, is a popular option that provides flexible
subscription options to stakeholders. If building a monitoring dashboard, visualization tools such
as Tableau and Qlik are popular as well.

9 ©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku

STRATEGY

IT ENVIRONMENT CONSISTENCY
The smooth flow of the modelling process relies heavily on the existence of a consistent IT environment
during development and production. Modern data science commonly uses technologies such as Python, R,
Spark, Scala, along with open source frameworks/libraries, such as H2O, scikit-learn, and MLlib.

In the past data scientists used technologies that were already available in the production environment,
such as SQL databases, JAVA, and .NET.

In today’s predictive technology environment, it is not practical to translate a data science project to older
technologies like SQL, JAVA, and .NET — doing so incurs substantial re-write costs. Consequently, 80% of
companies involved with predictive modelling use newer technologies such as Python and R.

WHY IS THIS IMPORTANT?

Putting Python or R into production poses its own set of unique challenges in terms of environment and
package management. This is the case due to the large number of packages typically involved; data science
projects rely on an average of 100 R packages, 40 Python packages, and several hundred Java/Scala packages
(most of which are behind Hadoop dependencies). Another challenge is maintaining version control in the
development environment; for example, scikit- learn receives a significant update about twice a year.

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 10

APPROACH

IT ENVIRONMENT CONSISTENCY
“I’m never trusting them with R in production again. Last time we
attempted to deploy a project with multiple R packages, it was literally a
nightmare”

THE SOLUTION
Fortunately, there are multiple options available when establishing a consistent IT environment, such as:

• Use the built in mechanisms in open source distributions (e.g., virtualenv, pip for Python) or rely
on 3rd party software (e.g., AnacondaTM for Python). AnacondaTM is becoming an increasingly
popular choice amongst Python users, with one-third of our respondents indicating usage. For
Spark, Scala, and R, a vast majority of the data science community is relying solely on open
source options;
• Use a build from source system (e.g., pip for Python) or a binary mechanism (e.g., wheel). In
the scientific community, binary systems are enjoying increased popularity. This is partly due
to the difficulty involved in building an optimized library that leverages all of the capabilities of
scientific computing packages, such as NumPy;
• Rely on a stable release and common package list (in all of your systems) or build a virtual
environment for each project. In the former, IT would rather maintain a common list of “trusted”
packages and then push those packages to software development. In the latter, each data
project would have its own dedicated environment. Remember that the first significant migration
or new product delivery may require you to maintain several environments in order to support
the transition.
STRATEGY

ROLL-BACK STRATEGY
A roll-back strategy is required in order to return to a previous model version after the
latest version has been deployed.

REAL WORLD USAGE

Since a subtle impact on your model production may be visible after a few days or
weeks, a roll back is required. For example, a model may show a 1% or 2% drop in
performance after a new release. In our studies, 82% of companies had to roll-back at
least once after initially putting predictive modelling technologies into production.

WHY IS THIS IMPORTANT?

Without a functional roll-back plan, your team may face an existential crisis the first time something goes
wrong with the model. A roll-back plan is like an insurance policy that provides a second chance in the
production environment.

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 12

APPROACH

ROLL-BACK STRATEGY
“When our model started showing a 3% drop in performance, we all
panicked. Getting back to a previous version of the model took us
over 4 days!”

THE SOLUTION
A successful roll-back strategy must include all aspects of the data project, such as:

1. Transformation Code
2. Data
3. Software Dependencies
4. Data Schemas

The roll-back will need to be executable by users who may not be trained in predictive technologies, so
it must be established as an accessible and easy-to-use procedure that could be implemented by an IT
Administrator. Roll-back strategies must be tested for usage in a test environment and be accessible in both
development and production environments.

13 ©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku

STRATEGY

ROBUST DATA FLOW

Preparing for the worst is part of intelligent strategizing; in the world of predictive analytics, this means
having a robust failover strategy. After all, robust data science applications must rely on failover and
validation procedures in order to maintain stability. A failover strategy’s job is to integrate all of the events in
the production system, monitor the system in case the cluster is doing poorly, and immediately alert IT if the
job is not working. The question that needs to be asked is, how do you script a process that would re-run or
recover in case of failure?

CHALLENGES IN STRATEGY FORMULATION

Creating a failover strategy, and developing relevant scripting solutions, requires an awareness of multiple
“What if...?” scenarios:

• What happens when some data is missing?

• What happens when there is a new column in the underlying data?
• What happens when there is a change in the data?
• What happens when the workflow relies on an external API (e.g., geocoding, social data), but the API
cannot be reached? What happens if the script’s computing time exceeds the expected limit? For example,
the time for an overnight job might oscillate from one hour to several hours, effectively making it difficult
to finish the job within an expected time range. There could be multiple underlying reasons behind the
time variations, such as a physical error on a server or resources being pulled by another process.

ADOPTING ETL STRATEGIES

In traditional business intelligence systems, ETL (extract - transform - load) provides some failover
mechanisms, but the failover and validation strategies depend on the intrinsic knowledge of the application.
In addition, ETL technologies do not connect well to Python, R, predictive models, and Hadoop. Some
levels of access are possible, but it is unlikely that the level of detail required for reliable scripting exists. For
example, when running on Hadoop, it is common to leverage the underlying MD5 and hash facilities in order
to check file consistency and store/manage the workflow. This capability is typically not easy to do with ETL.

BIG DATA STRATEGIES

Formulating a failover strategy in a big data workflow presents some unique challenges, mostly due to the
sheer volume of the data involved. It’s not feasible to take a “rebuild” approach, as there is just too much
information to do this efficiently. Given this, a big data workflow must be “state-aware,” meaning that it must
make decisions based on a previously calculated state. As expected, ETL methods are typically not capable
of encoding this kind of logic.

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 14

WHY IS THIS IMPORTANT?
Without a proper failover strategy, your data and analytics workflow will fail. The result will be a loss of credibility
for using a data science approach in your IT environment. It is important to dedicate time and attention to the
creation of your failover strategy, as they are notoriously difficult to perfect the first time around.

THE SOLUTION
Formulating a failover strategy in a big data workflow presents some unique challenges, mostly due to the
sheer volume of the data involved. It’s not feasible to take a “rebuild” approach, as there is just too much
information to do this efficiently. Given this, a big data workflow must be “state-aware,” meaning that it must
make decisions based on a previously calculated state. As expected, ETL methods are typically not capable
of encoding this kind of logic.

The Ends: The end-points (start & finish) require particular attention, as they are frequently the weakest
points of a big data and analytics workflow. In a real-world environment, a weak point can be an FTP server,
an API, or even a daily e-mail with a CSV attachment. When working with end-points, your scripts should
meticulously check for errors codes, file size, and so on;

Be Parallel: In our study, 90% of big data workflows have multiple branches, with an initial data input used
in two ways and then merged together to train the predictive model. This means that some processing can
be parallelized at the workflow level, as opposed to the cluster, map reduce, and Spark level. As the product
evolves, it is likely that the number of branches will grow — using a parallel methodology helps to keep your
system fast;

Intelligent Re-execution: This simply means that data is automatically re-updated after a temporary
interruption in data input, such as a late update or temporarily missing data. For example, your big data
workflow may retrieve daily pricing data via FTP; your workflow combines this data with existing browser
and order data in order to formulate a pricing strategy. If this 3rd party data is not updated, the pricing
strategy can still be created using existing up-to-date data... but, ideally, the data would be re-updated when
the missing data becomes available.

User Interface: Graphically conveying a workflow enables users to more fully understand, and investigate,
the overall progress of the workflow. At some point a textual interface, or raw logs, reach their limit in terms
of being able to describe the big picture. When this happens, an easy-to-use Web-based user interface is the
best option.

USING A WORKFLOW FRAMEWORK

Some programming environments, such as Spark, provide a consistent workflow mechanism. This is a good
option if you do not have requirements for other technologies and if your data science processes are created by
software developers. For large integrated workflows, a programming framework such as Cascading may be a wise
choice if you want to implement a single framework / language.

STRATEGY

AUDITING
Auditing, in a data science environment, is being able to know what version of each output corresponds
to the code that was used to create it. In regulated domains, such as healthcare and financial services,
organizations must be able to trace everything that is related to a data science workflow. In this context,
organizations must be able to:

• Trace any wrongdoing down to the specific person who modified a workflow for malicious purposes;
• Prove that there is no illegal data usage, particularly personal data;
• Trace the usage of sensitive data in order to avoid data leaks;
• Demonstrate quality and the proper maintenance of the data flow.

WHY IS THIS IMPORTANT?

Failure to comply with auditing requirements, particularly in highly-regulated sectors, can have a profound
impact on smooth business continuity. Regulatory-sensitive organizations may run the risk of heavy fines and/
or the loss of a highly-coveted compliance status.

Non-regulated companies still must meet auditing requirements in order to understand exactly what’s
going on with their data and workflows, especially if they are being compromised. The ramifications of not
implementing an auditing strategy is typically felt the most when a data science practice moves from the
arena of experimentation to actual real-world production and critical use cases.

STRATEGY

PERFORMANCE AND SCALABILITY

Performance and scalability go hand-in-hand: as scalability limits are tested (more data / customers /
processes), performance needs to meet or exceed those limits. Strategically, the challenges lies in being
able to create an elastic architecture; the kind of environment that can handle significant transitions (e.g.,
from 10 calls per hour to 100k per hour) without disruption. As you push your data science workflow into
production, you need to consider appropriate increases in your production capability.

Volume Scalability: What happens when the volume of data you manage grows from a few gigabytes
to dozens of terabytes?;
Request Scalability: What happens when the number of customer requests is multiplied by 100?;
Complexity Scalability: What happens when you increase the number of workflows, or processes, from 1 to 20?;
Team Scalability: Can your team handle scalability- related changes? Can they cooperate, collaborate,
and work concurrently?

WHY IS THIS IMPORTANT?

Obviously, there is no silver bullet to solve all scalability problems at once. Some real-world samples,
however, may help to illustrate the unique challenges of scalability and performance:

1. Overnight Data Overflow: Multiple dependent batch jobs that last 1 or 2 hours tend to
eventually break the expected timespan, effectively running throughout the night and into the
next day. Without proper job management and careful monitoring, your resources could quickly
be consumed by out-of-control processes;
2. Bottlenecks: Data bottlenecks can pose a significant problem in any architecture, no matter
how many computing resources are used. Regular testing can help to alleviate this issue;
3. Logs and Bins: Data volume can grow quickly, but at the vanguard of data growth are logs and
bins. This is particularly true when a Hadoop cluster or database is full — when searching for a
culprit, always check the logs and bins first as they’re typically full of garbage.

STRATEGY

SUSTAINABLE MODEL LIFECYCLE

MANAGEMENT
Previously in this document, we’ve discussed issues such as model monitoring and business sponsor
involvement, but as the number of analytics workflows deployed to production increases exponentially, the
issue of sustainability grows in urgency and importance.

We can simplify the journey from a prototyping analytics capability to robust productionized analytics with
the following steps:

• deploying models and entire workflows to the production environment in a fast and effective manner;
• monitoring and managing these models in terms of drift, and retraining them either regularly or according
to a predefined trigger; and
• ensuring that the models in production continue to serve their purpose as well as possible given changes
in data and business needs.

This last point is one that most organizations haven’t struggled with or even really encountered, but it’s vital
to keep in mind now, because sustaining the lifecycle of models in production is the price of successfully
deploying and managing them.

WHY IS THIS IMPORTANT?

Model management is often concerned with the performance of models, and the key metrics are generally
related to the accuracy of scored datasets. But the usefulness of a model is measured in terms of business
metrics -- that is, if a model has excellent accuracy, but it has no business impact, how could it be considered
useful? An example could be a churn prediction model, which accurately predicts churn but provides no
insight into how to reduce that churn.

Even with measures of accuracy, sustainability becomes an issue. Regular manual checks for drift, even if
conducted monthly and in the most efficient manner, will soon become unwieldy as the number of models
that needs to be checked multiplies. When you add monitoring for business metrics, the workload and
complexity is even more daunting.

And finally, data is constantly shifting. Data sources are being changed, new ones are added, and new insights
develop around this data. This means that models need to be constantly updated and refined in ways that
simple retraining doesn’t address, and this is where the bulk of your team’s effort on sustainability will need to
be focused.

THE SOLUTION

In order to manage the lifecycle of models in a sustainable way, as well as to extend the lifecycle of these
models, you need to be able to:

1. Manage all of your models from a central place, so that there is full visibility into model performance. Have
a central location where you measure and track the drift of models via an API, and to the fullest extent
possible provide for automated retraining and updating of these models;
2. Build webapps and other tools to evaluate models against specific business metrics, so that everyone from
the data scientists designing the models to end users of analytics products are aligned on the goals of the
models; and
3. Free up the time of data scientists and data engineers to focus on making models better and not only on
addressing drift and lagging performance of existing models.
CONTINUOUS RETRAIN OF MODELS
The ultimate success of a data science project comes down to contributions from individual team members
working together towards a common goal. As can be seen from the topics discussed, “effective contribution”
goes beyond specialization in an individual skill-set. Team members must be aware of the bigger picture
and embrace project level requirements, from diligently packaging both code and data to creating Web-
based dashboards for their project’s business owners. When all team members adopt a “big picture”
approach, they are able to help each other complete tasks outside of their comfort zone.

Data science projects can be intimidating; after all, there are a lot of factors to consider. In today’s
competitive environment, individual silos of knowledge will hinder your team’s effectiveness. Best practices,
model management, communications, and risk management are all areas that need to be mastered when
bringing a project to life. In order to do this, team members need to bring adaptability, a collaborative spirit,
and flexibility to the table. With these ingredients, data science projects can successfully make the transition
from the planning room to actual implementation in a business environment.
Everyday AI,
Extraordinary People
Elastic Architecture Built for the Cloud

Machine Learning Visualization Data Preparation

Name Sex Age

Natural lang. Gender Integer

Braund, Mr. Owen Harris male 22

Moran, Mr. James male 38
Heikkinen, Miss. Laina
Remove rows containing Mr. female 26
Futrelle, Mrs. Jacques Heath female 35
Keep only rows containing Mr.
Allen, Mr. William Henry male 35
Split column on Mr.
McCarthy, Mr. Robert male
Replace
Hewlett, Mrs (Mary Mr. by ...
D Kingcome) 29

Remove rows equal to Moran, Mr. James

Keep only rows equal to Moran, Mr. James

Clear cells equal to Moran, Mr. James

Filter on Moran, Mr. James

Filter on Mr.

Toggle row highlight

Show complete value

DataOps Governance & MLOps Applications

450+ 45,000+
CUSTOMERS ACTIVE USERS

Dataiku is the world’s leading platform for Everyday AI, systemizing the use of data for
exceptional business results. Organizations that use Dataiku elevate their people (whether
technical and working in code or on the business side and low- or no-code) to extraordinary,
arming them with the ability to make better day-to-day decisions with data.

GUIDEBOOK
www.dataiku.com

Building large scale web apps
From Everand
Building large scale web apps
Addy Osmani
No ratings yet
The Cloud Adoption Playbook: Proven Strategies for Transforming Your Organization with the Cloud
From Everand
The Cloud Adoption Playbook: Proven Strategies for Transforming Your Organization with the Cloud
Moe Abdula
No ratings yet
Survey Questionnaire THESIS Cloud
80% (5)
Survey Questionnaire THESIS Cloud
7 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Rapid Product Development Handbook
From Everand
Rapid Product Development Handbook
University of Oulu Finland
No ratings yet
Re-Architecting Application for Cloud: An Architect's reference guide
From Everand
Re-Architecting Application for Cloud: An Architect's reference guide
Ashutosh Shashi
4/5 (1)
Arize Guide To Optimized Retraining
No ratings yet
Arize Guide To Optimized Retraining
8 pages
IT Specialist: Artificial Intelligence Exam Prep - 500 Questions for Certification Success (0225)
From Everand
IT Specialist: Artificial Intelligence Exam Prep - 500 Questions for Certification Success (0225)
Satou Takahiro
No ratings yet
DP-600: Implementing Analytics Solutions Using Microsoft Fabric Exam Preparation
From Everand
DP-600: Implementing Analytics Solutions Using Microsoft Fabric Exam Preparation
Georgio Daccache
No ratings yet
Software Testing Interview Questions You'll Most Likely Be Asked
From Everand
Software Testing Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Walking the Design for Six Sigma Bridge with Your Customer
From Everand
Walking the Design for Six Sigma Bridge with Your Customer
Carl Cordy
No ratings yet
IBM Cognos Business Intelligence
From Everand
IBM Cognos Business Intelligence
Dustin Adkison
No ratings yet
Agile Basics in 60 Minutes
From Everand
Agile Basics in 60 Minutes
Tom Henricksen
4.5/5 (2)
MCS-034: Software Engineering
From Everand
MCS-034: Software Engineering
Dr. DK Sukhani
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Lean Six Sigma Green Belt - English version: Mindset, Skill set and Tool set
From Everand
Lean Six Sigma Green Belt - English version: Mindset, Skill set and Tool set
ir. H.C. Theisens
No ratings yet
Creating Awe for Business, Project, and Agile Management: Using Accelerated Work Effort to Dramatically Improve Efficiency and Results
From Everand
Creating Awe for Business, Project, and Agile Management: Using Accelerated Work Effort to Dramatically Improve Efficiency and Results
Anthony Washington
No ratings yet
Day - 6 - WONotes
No ratings yet
Day - 6 - WONotes
11 pages
Oracle Quick Guides: Part 2 - Oracle Database Design
From Everand
Oracle Quick Guides: Part 2 - Oracle Database Design
Malcolm Coxall
No ratings yet
Data Science Through R Lesson-2 Data Science in Action: Prof - Dr. A. B. Chowdhury, HOD, CA
No ratings yet
Data Science Through R Lesson-2 Data Science in Action: Prof - Dr. A. B. Chowdhury, HOD, CA
39 pages
Group Project Software Management: A Guide for University Students and Instructors
From Everand
Group Project Software Management: A Guide for University Students and Instructors
Tommy Yuan
No ratings yet
Mainframe Modernization with DevOps Mastery: Mainframes
From Everand
Mainframe Modernization with DevOps Mastery: Mainframes
Ricardo Nuqui
No ratings yet
Azure Fundamentals Success Kit
From Everand
Azure Fundamentals Success Kit
PRIYANKA
No ratings yet
Lecture 8 - Lifecycle of A Data Science Project - Part 2
No ratings yet
Lecture 8 - Lifecycle of A Data Science Project - Part 2
43 pages
Comprehensive Guide to BusinessObjects: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to BusinessObjects: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Big Data Visualization
From Everand
Big Data Visualization
James D. Miller
No ratings yet
Everything that you MUST know as an IT Scrum Master
From Everand
Everything that you MUST know as an IT Scrum Master
Ms. Sweta Suman
No ratings yet
Model Based Environment: A Practical Guide for Data Model Implementation with Examples in Powerdesigner
From Everand
Model Based Environment: A Practical Guide for Data Model Implementation with Examples in Powerdesigner
Vladimir Pantic
No ratings yet
Google Cloud Digital Leader Certification Guide: A comprehensive study guide to Google Cloud concepts and technologies
From Everand
Google Cloud Digital Leader Certification Guide: A comprehensive study guide to Google Cloud concepts and technologies
Bruno Beraldo Rodrigues
No ratings yet
Agile Approaches on Large Projects in Large Organizations
From Everand
Agile Approaches on Large Projects in Large Organizations
Brian Hobbs
No ratings yet
Workfront Implementation and Optimization Techniques: Definitive Reference for Developers and Engineers
From Everand
Workfront Implementation and Optimization Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Azure Fundamentals Exam Insights
From Everand
Azure Fundamentals Exam Insights
Priyanka Banerjee
No ratings yet
Efficient Workflows with Colab: Definitive Reference for Developers and Engineers
From Everand
Efficient Workflows with Colab: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
IBM Business Analytics and Cloud Computing: Best Practices for Deploying Cognos Business Intelligence to the IBM Cloud
From Everand
IBM Business Analytics and Cloud Computing: Best Practices for Deploying Cognos Business Intelligence to the IBM Cloud
Anant Jhingran
5/5 (1)
Podio Technical Implementation Guide: Definitive Reference for Developers and Engineers
From Everand
Podio Technical Implementation Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Accelerated DevOps with AI, ML & RPA: Non-Programmer’s Guide to AIOPS & MLOPS
From Everand
Accelerated DevOps with AI, ML & RPA: Non-Programmer’s Guide to AIOPS & MLOPS
Stephen Fleming
5/5 (2)
Effective Cucumber Automation: Definitive Reference for Developers and Engineers
From Everand
Effective Cucumber Automation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cracking Microservices Interview: Learn Advance Concepts, Patterns, Best Practices, NFRs, Frameworks, Tools and DevOps
From Everand
Cracking Microservices Interview: Learn Advance Concepts, Patterns, Best Practices, NFRs, Frameworks, Tools and DevOps
Sameer S Paradkar
3/5 (1)
A Study of the Supply Chain and Financial Parameters of a Small Business
From Everand
A Study of the Supply Chain and Financial Parameters of a Small Business
Rahul Basu
No ratings yet
Big Data and Data Science: Analytics for the Future
From Everand
Big Data and Data Science: Analytics for the Future
Dhaanyalakshmi Ahuja
No ratings yet
Mastering GCP for Web Applications: A Well-Architected Approach to Cloud Excellence
From Everand
Mastering GCP for Web Applications: A Well-Architected Approach to Cloud Excellence
Chinmoy Mukherjee
No ratings yet
Week5 Modified
No ratings yet
Week5 Modified
25 pages
Essays on Infrastructure-as-code
From Everand
Essays on Infrastructure-as-code
Ravi Rajamani
No ratings yet
Mastering Lead Generation with DeepSeek AI/ A Comprehensive Guide to Transforming Your Sales Strategy
From Everand
Mastering Lead Generation with DeepSeek AI/ A Comprehensive Guide to Transforming Your Sales Strategy
Robert Cullen
No ratings yet
Active Machine Learning with Python: Refine and elevate data quality over quantity with active learning
From Everand
Active Machine Learning with Python: Refine and elevate data quality over quantity with active learning
Margaux Masson-Forsythe
No ratings yet
How to Be a Successful Software Project Manager
From Everand
How to Be a Successful Software Project Manager
Dr. Tuhin Chattopadhyay
No ratings yet
DataOpsMapR Whitepaper DIGITAL
No ratings yet
DataOpsMapR Whitepaper DIGITAL
8 pages
Touchpad Information Technology Class 10: Skill Education Based on Windows & OpenOffice Code (402)
From Everand
Touchpad Information Technology Class 10: Skill Education Based on Windows & OpenOffice Code (402)
Dr. Sanjay Jain
No ratings yet
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
From Everand
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
Brian Knight
3/5 (1)
Learning Dynamics NAV Patterns: Create solutions that are easy to maintain, are quick to upgrade, and follow proven concepts and design
From Everand
Learning Dynamics NAV Patterns: Create solutions that are easy to maintain, are quick to upgrade, and follow proven concepts and design
Marije Brummel
No ratings yet
Enabling World-Class Decisions: The Executive’s Guide to Understanding & Deploying Modern Corporate Performance Management Solutions
From Everand
Enabling World-Class Decisions: The Executive’s Guide to Understanding & Deploying Modern Corporate Performance Management Solutions
Corey Barak
No ratings yet
Cloud Paradigm: Cloud Culture, Economics, and Security.
From Everand
Cloud Paradigm: Cloud Culture, Economics, and Security.
Tony Adams
No ratings yet
Agile Product Management
From Everand
Agile Product Management
DeEtta Jennings - Balthazar
No ratings yet
Lean Six Sigma QuickStart Guide: The Simplified Beginner's Guide to Lean Six Sigma
From Everand
Lean Six Sigma QuickStart Guide: The Simplified Beginner's Guide to Lean Six Sigma
Benjamin Sweeney
3.5/5 (4)
Microsoft Azure AI-102 Practice Tests
From Everand
Microsoft Azure AI-102 Practice Tests
CertSquad Professional Trainers
No ratings yet
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
From Everand
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
vivian njoroge
No ratings yet
The Future of Work for Technical Professionals
From Everand
The Future of Work for Technical Professionals
Tom Henricksen
No ratings yet
AI-Driven Web Apps: Practical Machine Learning for Software Developers
From Everand
AI-Driven Web Apps: Practical Machine Learning for Software Developers
Sivaramarajalu Ramadurai Venkataraajalu
No ratings yet
Lean Six Sigma Black Belt: Mindset, Skill set and Tool set
From Everand
Lean Six Sigma Black Belt: Mindset, Skill set and Tool set
ir. H.C. Theisens
No ratings yet
Unit2_2) How python is deployed and Data Science Process.pptx
No ratings yet
Unit2_2) How python is deployed and Data Science Process.pptx
7 pages
Azure AI Data Scientists Associate DP 100
From Everand
Azure AI Data Scientists Associate DP 100
Manish Soni
No ratings yet
Harward - The Path To Product-Market Fit, A Workshop
No ratings yet
Harward - The Path To Product-Market Fit, A Workshop
91 pages
Lume SSG Handbook How To Create A Static Blog With Lume
No ratings yet
Lume SSG Handbook How To Create A Static Blog With Lume
28 pages
How To Use Server-Side Rendering in Nextjs Apps For Better SEO
No ratings yet
How To Use Server-Side Rendering in Nextjs Apps For Better SEO
11 pages
How To Optimize Nextjs App Performance With Lazy Loading
No ratings yet
How To Optimize Nextjs App Performance With Lazy Loading
20 pages
Marketing Assignment (MBA 1) (AutoRecovered)
No ratings yet
Marketing Assignment (MBA 1) (AutoRecovered)
31 pages
Spa
No ratings yet
Spa
2 pages
ERP Life Cycle
100% (1)
ERP Life Cycle
9 pages
EIA - Syllabus
100% (1)
EIA - Syllabus
3 pages
Globalisation AND PAKISTAN. KIPS SUPER LEC BY SIR FARHAN MIRZA
No ratings yet
Globalisation AND PAKISTAN. KIPS SUPER LEC BY SIR FARHAN MIRZA
9 pages
Apes 110
No ratings yet
Apes 110
211 pages
Cara Belanja Di Alibaba Melalui
No ratings yet
Cara Belanja Di Alibaba Melalui
72 pages
NIRF Ranking 2024 Management-01.03.24
No ratings yet
NIRF Ranking 2024 Management-01.03.24
3 pages
An Assessment of Employee Turnover Intention in The
100% (1)
An Assessment of Employee Turnover Intention in The
71 pages
DC - DCFC - Brochure
No ratings yet
DC - DCFC - Brochure
4 pages
Badri Narayanan 2018
No ratings yet
Badri Narayanan 2018
20 pages
SFTY 100 - Questions With Answers
No ratings yet
SFTY 100 - Questions With Answers
5 pages
Cash Flow Statement Exercise
No ratings yet
Cash Flow Statement Exercise
3 pages
JOINT VENTURE CUM SHAREHOLDER AGREEMENT
No ratings yet
JOINT VENTURE CUM SHAREHOLDER AGREEMENT
21 pages
ALS SHS Application Form
100% (1)
ALS SHS Application Form
11 pages
Jenga Cashflow Exercise
No ratings yet
Jenga Cashflow Exercise
2 pages
Faculty Constitution and by Laws 2
No ratings yet
Faculty Constitution and by Laws 2
8 pages
Analysis UPSA
No ratings yet
Analysis UPSA
37 pages
Orientation Seminar Documentation Part
No ratings yet
Orientation Seminar Documentation Part
8 pages
Cap 19
No ratings yet
Cap 19
12 pages
Banking 20190524 0001
No ratings yet
Banking 20190524 0001
2 pages
Course Outline Services Marketing
No ratings yet
Course Outline Services Marketing
5 pages
Vendor Agreement - Lakme Salon
No ratings yet
Vendor Agreement - Lakme Salon
12 pages
Sap Business One Tb1000 Version 10 Case Study Logistik English
100% (1)
Sap Business One Tb1000 Version 10 Case Study Logistik English
295 pages
Freebee, Not Discount, Is This Year's Mantra For Diwali Sale. New Lesson in Marketing
No ratings yet
Freebee, Not Discount, Is This Year's Mantra For Diwali Sale. New Lesson in Marketing
6 pages
Investor Presentation - Thailand Focus 2022 - Distribution
No ratings yet
Investor Presentation - Thailand Focus 2022 - Distribution
45 pages
Case Study10
No ratings yet
Case Study10
5 pages
Black Book Project Final
No ratings yet
Black Book Project Final
54 pages
Influencer Marketing India
No ratings yet
Influencer Marketing India
20 pages

Dataiku - Data Science Operationalization

Uploaded by

Dataiku - Data Science Operationalization

Uploaded by

Data Science

Operationalization simply means deploying a machine learning

1 ©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku

CONSISTENT PACKAGING AND RELEASE

• Code for data transformation

REAL WORLD USAGE

WHY IS THIS IMPORTANT?

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 2

CONSISTANT PACKAGING AND RELEASE

3 ©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku

CONTINUOUS RETRAIN OF MODELS

In data science projects, predictive models need to be updated regularly because:

• In a competitive environment, models need to be continuously enhanced, adjusted, and updated;

WHY IS THIS IMPORTANT?

If your organization has not implemented a “re-

3. People who re-train their model offline and

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 4

CONTINUOUS RETRAIN OF MODELS

1. Re-trains the new predictive model candidate.

AUTOMATED MODEL CHECKING

FROM A/B TESTING TO

WHY IS THIS IMPORTANT?

Offline testing is not sufficient when validating the performance of a data

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 6

FROM A/B TESTING TO

1. Simple A/B testing

The first, simple A/B testing, is required for most companies

7 ©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku

BUSINESS SPONSOR INVOLVEMENT

In most situations, business sponsors must have the

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 8

9 ©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku

WHY IS THIS IMPORTANT?

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 10

REAL WORLD USAGE

WHY IS THIS IMPORTANT?

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 12

13 ©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku

ROBUST DATA FLOW

CHALLENGES IN STRATEGY FORMULATION

• What happens when some data is missing?

ADOPTING ETL STRATEGIES

BIG DATA STRATEGIES

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 14

USING A WORKFLOW FRAMEWORK

15 ©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku

WHY IS THIS IMPORTANT?

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 16

PERFORMANCE AND SCALABILITY

WHY IS THIS IMPORTANT?

17 ©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku

SUSTAINABLE MODEL LIFECYCLE

WHY IS THIS IMPORTANT?

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 18

Machine Learning Visualization Data Preparation

Name Sex Age

Natural lang. Gender Integer

Braund, Mr. Owen Harris male 22

Remove rows equal to Moran, Mr. James

Keep only rows equal to Moran, Mr. James

Clear cells equal to Moran, Mr. James

Filter on Moran, Mr. James

Toggle row highlight

Show complete value

DataOps Governance & MLOps Applications

©2021 dataiku | dataiku.com

You might also like