0% found this document useful (0 votes)
39 views

Dataiku - Data Science Operationalization

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Dataiku - Data Science Operationalization

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Data Science

GUIDEBOOK

Operationalization
Finding the Common Ground in 10 Steps
INTRODUCTION

In data science projects, the derivation of business value follows something akin to the Pareto Principle,
where the vast majority of the business value is generated not from the planning, the scoping, or even
from producing a viable machine learning model. Rather, business value comes from the final few steps:
operationalization of that project.

Operationalization simply means deploying a machine learning


model for use across the organization.

More often than not, there is a disconnect between the worlds of development and production. Some
teams may choose to re-code everything in an entirely different language while others may make
changes to core elements, such as testing procedures, backup plans, and programming languages.

Operationalizing analytics products could become complicated as different opinions and methods vie for
supremacy, resulting in projects that needlessly drag on for months beyond promised deadlines.

The goal of this guide is to explore grounds for commonality and introduce strategies & procedures
designed to bridge the gap between development and operationalization. The topics range from Best
Operating Procedures (managing environmental consistency, data scalability, and consistent code & data
packaging) to Risk Management for unforeseen situations (roll-back and failover strategies). We also dis-
cuss modelling (continuously re-train of models, A/B testing, and multivariate optimization) and imple-
menting communication strategies (auditing and functional monitoring).

Successfully building an analytics product and then operationalizing it is not an easy task — it becomes
twice as hard when teams are isolated and playing by their own rules.

This guide will help your organization find the common ground needed to empower your Data
Science and IT Teams to work together for the benefit of your data and analytics projects as a whole.

1 ©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku


STRATEGY

CONSISTENT PACKAGING AND RELEASE


In the process of operationalization, there are multiple workflows: some internal flows correspond to production
while some external or referential flows relate to specific environments. Moreover, data science projects are
comprised of not only code, but also data:

• Code for data transformation


• Configuration and schema for data
• Public referential data, such as postcodes / zip codes
• Internal referential data, such as product category descriptions.

That’s why, to support the reliable transport of code and data from one environment to the next, they need to be
packaged together.

REAL WORLD USAGE


According to our statistics, a typical data scientist project over the course of a 6-week timespan would include:

JS < _
AN AVERAGE AN AVERAGE OF AN AVERAGE OF 350 LINES OF
OF 700 LINES 2,500 LINES 200 LINES OF CONFIGURATIONS (OR
OF SQL OF PYTHON JAVASCRIPT SCRIPTS) THAT COMBINE
ALL OF THESE ELEMENTS

WHY IS THIS IMPORTANT?

If you do not support the proper packaging of code and data, the end result is inconsistent code and data during
operationalization. These inconsistencies are particularly dangerous during training or when applying a predictive model.

These inconsistencies, which are quite difficult to detect, can lead to a subtle degradation of the model’s
performance between development and production.

Without proper packaging, the deployment in production can pose significant challenges.

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 2


APPROACH

CONSISTANT PACKAGING AND RELEASE

“Where did Mike put the 10GB reference file he used to create the model?
We only have the 10MB sample on Git!”

THE SOLUTION
1. The first step toward consistant packaging and release for operationalization is to establish
a versioning tool, such as Git, to manage all of the code versioning within your product.
2. The next step is to package the code and data. Create packaging scripts that generate
snapshots in the form of a ZIP file for both code and data inside the script; these should be
consistent with the model (or model parameters) that you need to ship. Deploy that ZIP file to
production.
3. Lastly, be vigilant. Remain aware of situations when the data file size is too large (e.g., > 1GB).
In these scenarios, you need to snapshot and version the required data files in a storage.

3 ©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku


STRATEGY

CONTINUOUS RETRAIN OF MODELS


It is critical to implement an efficient strategy for the re-training, validation, and deployment of models. The
process needs to follow an internal workflow loop, be validated, and then passed on to production (with an API to
log real outcomes).

In data science projects, predictive models need to be updated regularly because:

• In a competitive environment, models need to be continuously enhanced, adjusted, and updated;


• The environment changes (e.g., new customers with new behaviors); and,
• The underlying data is constantly changing.

WHY IS THIS IMPORTANT?

If your organization has not implemented a “re-


train in production” methodology, then re-training
a model becomes an actual “deploy to production”
task... with the result requiring significant manpower REAL WORLD USAGE
and a loss of agility. If relying on a manual approach,
then the model deployment could consume from 5
In our experience, there are 3 typical
to 30 days of manpower.
re-training scenarios:
In addition, the cost of not deploying a re-trained
1. People building and maintaining
model could be huge. In a typical real-life situation,
online learning systems. Today, online
a scoring model AUC could degrade by 0.01 per week
learning systems are particularly popular
due to the natural drift of input data — remember,
for recommender systems and online
Internet user behavior changes along with its related
advertising platforms;
data! This means that the 0.05 hard performance
optimization that was painstakingly tuned during 2. People who automatically re-train and update
project setup could disappear within a few weeks. their model (typically every 2.3 days); and,

3. People who re-train their model offline and


then manually release it to production (typically
updated every 24 days). This group includes the
greatest number of participants who are unsure of
how many models they have put into production
within the last 6 months.

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 4


APPROACH

CONTINUOUS RETRAIN OF MODELS


“I’ve re-played the cross validation score for the model and, after 4 weeks
in production, we’re back to square one. Should I tell my boss?”

THE SOLUTION
The solution to the re-training challenge lies in the data science production workflow. This means that you
need to implement a dedicated command for your workflow that does the following:

1. Re-trains the new predictive model candidate.


2. Re-scores and re-validates the model (this step produces the required metrics for your model).
3. Swaps the old predictive model with the new one.

With regards to implementation, the re-train/re-score/re-validate steps should be automated and executed
every week. The final swap is then manually executed by a human operator that performs the final
consistency check. This approach provides a good balance between automation and reduced re-train cost
while maintaining the final consistency check.

AUTOMATED MODEL CHECKING


Automated model checking is the testing and
comparison of metrics on old & new models, with a
new model put into production when its performance
exceeds its predecessor. This comparison process is
done post re-validation and should reflect current data
(e.g., daily model updates). A more complex method,
called Multivariate Testing, can be used in scenarios
that require the continuous testing of multiple models.
STRATEGY

FROM A/B TESTING TO


MULTIVARIATE OPTIMIZATION
The purpose of A/B testing different models is to be able to evaluate multiple models in parallel and then
comparing expected model performance to actual results.

WHY IS THIS IMPORTANT?

Offline testing is not sufficient when validating the performance of a data


product. Here are a few reasons why:

• In use cases such as credit scoring and fraud detection, only real world tests can
provide the actual data output required. Offline tests are simply unable to convey real-
time events, such as credit authorizations (e.g., is the credit offering aligned with the
customer’s repayment ability?);
• A real-world production setup may be different from your actual setup. As mentioned
above, data consistency is a major issue that results in misaligned productions;
• If the underlying data and its behavior is evolving rapidly, then it will be difficult to
validate the models fast enough to cope with the rate of change.

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 6


APPROACH

FROM A/B TESTING TO


MULTIVARIATE OPTIMIZATION
“We put the model in production just after Christmas. The performance
dropped. It took us two weeks to convince the business team that it was
perfectly normal at this time of the season, and not due to the model’s
recent release.”

THE SOLUTION
There are three levels of A/B testing that can be used to test the validity of models:

1. Simple A/B testing


2. Multi-armed bandit testing; and
3. Multi-variable armed bandit testing with optimization.

The first, simple A/B testing, is required for most companies


engaged in digital activities while the latter is used primarily
in advanced, competitive real-time use cases (e.g., real-time
bidding/ advertising and trading).

7 ©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku


STRATEGY

FUNCTIONAL MONITORING
Functional monitoring is used to convey key functionality of the business model’s performance to the
business sponsors/owners. From a business perspective, functional monitoring is critical because it
provides an opportunity to demonstrate the end-results of your predictive model and how it impacts the
product. The kind of functional information that can be conveyed is variable and depends largely on the
industry and use case. Examples of the kind of data displayed can include the number of contacts in a case,
the number of broken flows in a system, and measurements of performance drifts.

BUSINESS SPONSOR INVOLVEMENT

In most situations, business sponsors must have the


capability to detect early signs of drift. This is possible
by enabling sponsors to easily review into the model’s REAL WORLD USAGE
characteristics, view its history, determine the drift’s
validity, and then take appropriate action. For example,
Some applications of functional monitoring
a marketing campaign with more customers from a
based on industry or functional need include:
certain age group (e.g., 20-30 years old) could result in
an inaccurate transformation ratio prediction due to the Fraud
relative inconsistency of that group’s consumer behavior.
The number of predicted fraudulent events,
In addition, business sponsors must be provided with the evolution of the prediction’s likelihood, the
access to high-level technical errors. For example, if a number of false positive predictions, and rolling
pricing model is lacking data from a specific category, the fraud figures;
business owner needs to be notified of the missing data so
Churn Reduction
that they are aware of factors that impact their strategies.
The number of predicted churn events, key variables
for churn prediction, and the efficiency of marketing
strategies towards churners (e.g., opening rates of
e-mails);
WHY IS THIS IMPORTANT?
Pricing:
Knowledge transparency must be constantly shared Key variables of the pricing model, pricing drift,
and evangelized throughout an organization at every pricing variation over time, pricing variation across
opportunity. A lapse in communication can compromise products, evolution of margin evaluations per
the importance and the value of using machine learning day/year, and average transformation ratios.
technology within your organization.

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 8


APPROACH

FUNCTIONAL MONITORING
“One day, our CEO spotted a funny recommendation on the company
website. We realised that part of the rebuild chain had been broken for 5
days without anyone noticing. Well, we decided to keep this to ourselves.”

THE SOLUTION
A successful communication strategy lies at the heart of any effective organization; such a strategy typically
combines multiple channels:

• Channel for the quick and continuous communication of events — these are channels where
events are seamlessly communicated to team members, such as: new model in production;
outliers in production; drop or increase in model performance over the last 24 hours, etc.
• E-mail based channel with a Daily Report. Such a report should be a succinct summary of key
data, such as: subject with core metrics; top n customers matching specific model criteria; three
model metrics (e.g., a technical metric, high-level long-term metric, and a short-term business
metric), etc.
• A web-based dashboard with drill-down capability; other channels should always include links to
the dashboard in order to drive usage.
• A real-time notification platform, such as Slack, is a popular option that provides flexible
subscription options to stakeholders. If building a monitoring dashboard, visualization tools such
as Tableau and Qlik are popular as well.

9 ©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku


STRATEGY

IT ENVIRONMENT CONSISTENCY
The smooth flow of the modelling process relies heavily on the existence of a consistent IT environment
during development and production. Modern data science commonly uses technologies such as Python, R,
Spark, Scala, along with open source frameworks/libraries, such as H2O, scikit-learn, and MLlib.

In the past data scientists used technologies that were already available in the production environment,
such as SQL databases, JAVA, and .NET.

In today’s predictive technology environment, it is not practical to translate a data science project to older
technologies like SQL, JAVA, and .NET — doing so incurs substantial re-write costs. Consequently, 80% of
companies involved with predictive modelling use newer technologies such as Python and R.

WHY IS THIS IMPORTANT?


Putting Python or R into production poses its own set of unique challenges in terms of environment and
package management. This is the case due to the large number of packages typically involved; data science
projects rely on an average of 100 R packages, 40 Python packages, and several hundred Java/Scala packages
(most of which are behind Hadoop dependencies). Another challenge is maintaining version control in the
development environment; for example, scikit- learn receives a significant update about twice a year.

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 10


APPROACH

IT ENVIRONMENT CONSISTENCY
“I’m never trusting them with R in production again. Last time we
attempted to deploy a project with multiple R packages, it was literally a
nightmare”

THE SOLUTION
Fortunately, there are multiple options available when establishing a consistent IT environment, such as:

• Use the built in mechanisms in open source distributions (e.g., virtualenv, pip for Python) or rely
on 3rd party software (e.g., AnacondaTM for Python). AnacondaTM is becoming an increasingly
popular choice amongst Python users, with one-third of our respondents indicating usage. For
Spark, Scala, and R, a vast majority of the data science community is relying solely on open
source options;
• Use a build from source system (e.g., pip for Python) or a binary mechanism (e.g., wheel). In
the scientific community, binary systems are enjoying increased popularity. This is partly due
to the difficulty involved in building an optimized library that leverages all of the capabilities of
scientific computing packages, such as NumPy;
• Rely on a stable release and common package list (in all of your systems) or build a virtual
environment for each project. In the former, IT would rather maintain a common list of “trusted”
packages and then push those packages to software development. In the latter, each data
project would have its own dedicated environment. Remember that the first significant migration
or new product delivery may require you to maintain several environments in order to support
the transition.
STRATEGY

ROLL-BACK STRATEGY
A roll-back strategy is required in order to return to a previous model version after the
latest version has been deployed.

REAL WORLD USAGE


Since a subtle impact on your model production may be visible after a few days or
weeks, a roll back is required. For example, a model may show a 1% or 2% drop in
performance after a new release. In our studies, 82% of companies had to roll-back at
least once after initially putting predictive modelling technologies into production.

WHY IS THIS IMPORTANT?


Without a functional roll-back plan, your team may face an existential crisis the first time something goes
wrong with the model. A roll-back plan is like an insurance policy that provides a second chance in the
production environment.

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 12


APPROACH

ROLL-BACK STRATEGY
“When our model started showing a 3% drop in performance, we all
panicked. Getting back to a previous version of the model took us
over 4 days!”

THE SOLUTION
A successful roll-back strategy must include all aspects of the data project, such as:

1. Transformation Code
2. Data
3. Software Dependencies
4. Data Schemas

The roll-back will need to be executable by users who may not be trained in predictive technologies, so
it must be established as an accessible and easy-to-use procedure that could be implemented by an IT
Administrator. Roll-back strategies must be tested for usage in a test environment and be accessible in both
development and production environments.

13 ©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku


STRATEGY

ROBUST DATA FLOW


Preparing for the worst is part of intelligent strategizing; in the world of predictive analytics, this means
having a robust failover strategy. After all, robust data science applications must rely on failover and
validation procedures in order to maintain stability. A failover strategy’s job is to integrate all of the events in
the production system, monitor the system in case the cluster is doing poorly, and immediately alert IT if the
job is not working. The question that needs to be asked is, how do you script a process that would re-run or
recover in case of failure?

CHALLENGES IN STRATEGY FORMULATION


Creating a failover strategy, and developing relevant scripting solutions, requires an awareness of multiple
“What if...?” scenarios:

• What happens when some data is missing?


• What happens when there is a new column in the underlying data?
• What happens when there is a change in the data?
• What happens when the workflow relies on an external API (e.g., geocoding, social data), but the API
cannot be reached? What happens if the script’s computing time exceeds the expected limit? For example,
the time for an overnight job might oscillate from one hour to several hours, effectively making it difficult
to finish the job within an expected time range. There could be multiple underlying reasons behind the
time variations, such as a physical error on a server or resources being pulled by another process.

ADOPTING ETL STRATEGIES


In traditional business intelligence systems, ETL (extract - transform - load) provides some failover
mechanisms, but the failover and validation strategies depend on the intrinsic knowledge of the application.
In addition, ETL technologies do not connect well to Python, R, predictive models, and Hadoop. Some
levels of access are possible, but it is unlikely that the level of detail required for reliable scripting exists. For
example, when running on Hadoop, it is common to leverage the underlying MD5 and hash facilities in order
to check file consistency and store/manage the workflow. This capability is typically not easy to do with ETL.

BIG DATA STRATEGIES


Formulating a failover strategy in a big data workflow presents some unique challenges, mostly due to the
sheer volume of the data involved. It’s not feasible to take a “rebuild” approach, as there is just too much
information to do this efficiently. Given this, a big data workflow must be “state-aware,” meaning that it must
make decisions based on a previously calculated state. As expected, ETL methods are typically not capable
of encoding this kind of logic.

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 14


WHY IS THIS IMPORTANT?
Without a proper failover strategy, your data and analytics workflow will fail. The result will be a loss of credibility
for using a data science approach in your IT environment. It is important to dedicate time and attention to the
creation of your failover strategy, as they are notoriously difficult to perfect the first time around.

THE SOLUTION
Formulating a failover strategy in a big data workflow presents some unique challenges, mostly due to the
sheer volume of the data involved. It’s not feasible to take a “rebuild” approach, as there is just too much
information to do this efficiently. Given this, a big data workflow must be “state-aware,” meaning that it must
make decisions based on a previously calculated state. As expected, ETL methods are typically not capable
of encoding this kind of logic.

The Ends: The end-points (start & finish) require particular attention, as they are frequently the weakest
points of a big data and analytics workflow. In a real-world environment, a weak point can be an FTP server,
an API, or even a daily e-mail with a CSV attachment. When working with end-points, your scripts should
meticulously check for errors codes, file size, and so on;

Be Parallel: In our study, 90% of big data workflows have multiple branches, with an initial data input used
in two ways and then merged together to train the predictive model. This means that some processing can
be parallelized at the workflow level, as opposed to the cluster, map reduce, and Spark level. As the product
evolves, it is likely that the number of branches will grow — using a parallel methodology helps to keep your
system fast;

Intelligent Re-execution: This simply means that data is automatically re-updated after a temporary
interruption in data input, such as a late update or temporarily missing data. For example, your big data
workflow may retrieve daily pricing data via FTP; your workflow combines this data with existing browser
and order data in order to formulate a pricing strategy. If this 3rd party data is not updated, the pricing
strategy can still be created using existing up-to-date data... but, ideally, the data would be re-updated when
the missing data becomes available.

User Interface: Graphically conveying a workflow enables users to more fully understand, and investigate,
the overall progress of the workflow. At some point a textual interface, or raw logs, reach their limit in terms
of being able to describe the big picture. When this happens, an easy-to-use Web-based user interface is the
best option.

USING A WORKFLOW FRAMEWORK


Some programming environments, such as Spark, provide a consistent workflow mechanism. This is a good
option if you do not have requirements for other technologies and if your data science processes are created by
software developers. For large integrated workflows, a programming framework such as Cascading may be a wise
choice if you want to implement a single framework / language.

15 ©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku


STRATEGY

AUDITING
Auditing, in a data science environment, is being able to know what version of each output corresponds
to the code that was used to create it. In regulated domains, such as healthcare and financial services,
organizations must be able to trace everything that is related to a data science workflow. In this context,
organizations must be able to:

• Trace any wrongdoing down to the specific person who modified a workflow for malicious purposes;
• Prove that there is no illegal data usage, particularly personal data;
• Trace the usage of sensitive data in order to avoid data leaks;
• Demonstrate quality and the proper maintenance of the data flow.

WHY IS THIS IMPORTANT?


Failure to comply with auditing requirements, particularly in highly-regulated sectors, can have a profound
impact on smooth business continuity. Regulatory-sensitive organizations may run the risk of heavy fines and/
or the loss of a highly-coveted compliance status.

Non-regulated companies still must meet auditing requirements in order to understand exactly what’s
going on with their data and workflows, especially if they are being compromised. The ramifications of not
implementing an auditing strategy is typically felt the most when a data science practice moves from the
arena of experimentation to actual real-world production and critical use cases.

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 16


STRATEGY

PERFORMANCE AND SCALABILITY


Performance and scalability go hand-in-hand: as scalability limits are tested (more data / customers /
processes), performance needs to meet or exceed those limits. Strategically, the challenges lies in being
able to create an elastic architecture; the kind of environment that can handle significant transitions (e.g.,
from 10 calls per hour to 100k per hour) without disruption. As you push your data science workflow into
production, you need to consider appropriate increases in your production capability.

Volume Scalability: What happens when the volume of data you manage grows from a few gigabytes
to dozens of terabytes?;
Request Scalability: What happens when the number of customer requests is multiplied by 100?;
Complexity Scalability: What happens when you increase the number of workflows, or processes, from 1 to 20?;
Team Scalability: Can your team handle scalability- related changes? Can they cooperate, collaborate,
and work concurrently?

WHY IS THIS IMPORTANT?


Obviously, there is no silver bullet to solve all scalability problems at once. Some real-world samples,
however, may help to illustrate the unique challenges of scalability and performance:

1. Overnight Data Overflow: Multiple dependent batch jobs that last 1 or 2 hours tend to
eventually break the expected timespan, effectively running throughout the night and into the
next day. Without proper job management and careful monitoring, your resources could quickly
be consumed by out-of-control processes;
2. Bottlenecks: Data bottlenecks can pose a significant problem in any architecture, no matter
how many computing resources are used. Regular testing can help to alleviate this issue;
3. Logs and Bins: Data volume can grow quickly, but at the vanguard of data growth are logs and
bins. This is particularly true when a Hadoop cluster or database is full — when searching for a
culprit, always check the logs and bins first as they’re typically full of garbage.

17 ©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku


STRATEGY

SUSTAINABLE MODEL LIFECYCLE


MANAGEMENT
Previously in this document, we’ve discussed issues such as model monitoring and business sponsor
involvement, but as the number of analytics workflows deployed to production increases exponentially, the
issue of sustainability grows in urgency and importance.

We can simplify the journey from a prototyping analytics capability to robust productionized analytics with
the following steps:

• deploying models and entire workflows to the production environment in a fast and effective manner;
• monitoring and managing these models in terms of drift, and retraining them either regularly or according
to a predefined trigger; and
• ensuring that the models in production continue to serve their purpose as well as possible given changes
in data and business needs.

This last point is one that most organizations haven’t struggled with or even really encountered, but it’s vital
to keep in mind now, because sustaining the lifecycle of models in production is the price of successfully
deploying and managing them.

WHY IS THIS IMPORTANT?


Model management is often concerned with the performance of models, and the key metrics are generally
related to the accuracy of scored datasets. But the usefulness of a model is measured in terms of business
metrics -- that is, if a model has excellent accuracy, but it has no business impact, how could it be considered
useful? An example could be a churn prediction model, which accurately predicts churn but provides no
insight into how to reduce that churn.

Even with measures of accuracy, sustainability becomes an issue. Regular manual checks for drift, even if
conducted monthly and in the most efficient manner, will soon become unwieldy as the number of models
that needs to be checked multiplies. When you add monitoring for business metrics, the workload and
complexity is even more daunting.

And finally, data is constantly shifting. Data sources are being changed, new ones are added, and new insights
develop around this data. This means that models need to be constantly updated and refined in ways that
simple retraining doesn’t address, and this is where the bulk of your team’s effort on sustainability will need to
be focused.

©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 18


THE SOLUTION

In order to manage the lifecycle of models in a sustainable way, as well as to extend the lifecycle of these
models, you need to be able to:

1. Manage all of your models from a central place, so that there is full visibility into model performance. Have
a central location where you measure and track the drift of models via an API, and to the fullest extent
possible provide for automated retraining and updating of these models;
2. Build webapps and other tools to evaluate models against specific business metrics, so that everyone from
the data scientists designing the models to end users of analytics products are aligned on the goals of the
models; and
3. Free up the time of data scientists and data engineers to focus on making models better and not only on
addressing drift and lagging performance of existing models.
CONTINUOUS RETRAIN OF MODELS
The ultimate success of a data science project comes down to contributions from individual team members
working together towards a common goal. As can be seen from the topics discussed, “effective contribution”
goes beyond specialization in an individual skill-set. Team members must be aware of the bigger picture
and embrace project level requirements, from diligently packaging both code and data to creating Web-
based dashboards for their project’s business owners. When all team members adopt a “big picture”
approach, they are able to help each other complete tasks outside of their comfort zone.

Data science projects can be intimidating; after all, there are a lot of factors to consider. In today’s
competitive environment, individual silos of knowledge will hinder your team’s effectiveness. Best practices,
model management, communications, and risk management are all areas that need to be mastered when
bringing a project to life. In order to do this, team members need to bring adaptability, a collaborative spirit,
and flexibility to the table. With these ingredients, data science projects can successfully make the transition
from the planning room to actual implementation in a business environment.
Everyday AI,
Extraordinary People
Elastic Architecture Built for the Cloud

Machine Learning Visualization Data Preparation

Name Sex Age

Natural lang. Gender Integer

Braund, Mr. Owen Harris male 22


Moran, Mr. James male 38
Heikkinen, Miss. Laina
Remove rows containing Mr. female 26
Futrelle, Mrs. Jacques Heath female 35
Keep only rows containing Mr.
Allen, Mr. William Henry male 35
Split column on Mr.
McCarthy, Mr. Robert male
Replace
Hewlett, Mrs (Mary Mr. by ...
D Kingcome) 29

Remove rows equal to Moran, Mr. James

Keep only rows equal to Moran, Mr. James

Clear cells equal to Moran, Mr. James

Filter on Moran, Mr. James

Filter on Mr.

Toggle row highlight

Show complete value

DataOps Governance & MLOps Applications

450+ 45,000+
CUSTOMERS ACTIVE USERS

Dataiku is the world’s leading platform for Everyday AI, systemizing the use of data for
exceptional business results. Organizations that use Dataiku elevate their people (whether
technical and working in code or on the business side and low- or no-code) to extraordinary,
arming them with the ability to make better day-to-day decisions with data.

©2021 dataiku | dataiku.com


GUIDEBOOK
www.dataiku.com

You might also like