Sebastian Neubauer Data Science - The New Troublemaker in The Delivery Pipeline

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Data Science –

The New Troublemaker in the


Delivery Pipeline
Sebastian Neubauer
@sebineubauer
Blue Yonder:
• Data-driven SaaS solutions for
the Retail industry:
• Replenishment
• Pricing
• Data science since over 10
years
Me:
• Data scientist at BlueYonder
since 5 years
• now working in a team which
aims to increase efficiency of
the data science customer
project teams
Agenda

• What is „Data Science“?


• Why is „Data Science“ so important?
• Troublemaker: why is it so hard to integrate?
• Ways to successfully integrate „Data Science“
• Summary
What is „Data Science“?
What is „Data Science“?

„Data Science is
statistics on a Mac.“
@BigDataBorat

@sebineubauer
What is „Data Science“?
„Data Scientist (n.):
Person who is better at
statistics than any software
engineer and better at
software engineering than any
statistician.“ @josh_wills
@sebineubauer
What is „Data Science“?
My definition:

„Data Science aims to


build systems that
support and automate
data-driven operational
decisions“
@sebineubauer
Examples
What is the optimal price for
the clementines today?
How many apples should be
ordered for next week?
Why is „Data Science“ so
important?
Operational decisions done by
humans are suboptimal
• various known biases:
Operational decisions done by
humans are suboptimal
• various known biases:
• Bandwagon effect: The tendency to do things
because many other people do the same.
• Confirmation bias: The tendency to focus on
information in a way that confirms one's
preconceptions.
• Neglect of probability: The tendency to
completely disregard probability when making
a decision under uncertainty.

@sebineubauer
Operational decisions done by
humans are suboptimal
• humans are lazy:
• each and every decision should be made
optimal
• not making a decision is suboptimal
• simplifying decision making is suboptimal

@sebineubauer
Optimal decisions are growingly
important
• Competition is getting harder:
• „online“ increases the pressure:
• markets growingly transparent
• competition gets global
• digitalization: technology is evolving faster:
• new players can catch up very fast (e.g.
amazon)
• costs for new players often very low
German retailers: <1% profit margin
(Deloitte, Global Powers of Retail)

@sebineubauer
Data Science can help
• Make optimal decisions
• unbiased
• take all information into account
• do complex computations
• Automation enables to optimize all decisions
• as often as needed: daily, realtime…
• as granular as needed: each individual
product…
• actually do decisions, that were not done
before

@sebineubauer
Troublemaker:
Why is „Data Science“ so
hard to integrate?
Q: „Are you already
doing this data
science stuff?“
A: „We have a data
scientist team since
one year, but the
progress is very slow“
A typical data science workflow

@sebineubauer
„Data science for
decision making is
deeply integrated
into the business
processes.“
Trouble on the input side
• data availably:
• the data needs to be available in machine
readable form
• as „realtime“ as possible (remember: we are
predicting the future):
• consistent data: sync different sources
• fully automated, no humans in the loop
• data quality:
• raw unaggregated data
• data quality discipline: missing values, typos…

@sebineubauer
Trouble on the output side

• potentially huge amount of decisions


• how to validate:
• monitoring/alerting
• human approval?
• decisions need to trigger real actions:
• digital price tags
• automated ordering process

@sebineubauer
Increase of coherence and
entanglement
• data needed from different departments
• need to agree on common ID’s
• common data types
• need to synchronize data flow
• data dependencies:
• hard to change data structure, maybe
downstream someone uses the data…:
• change relations, types, names
• data cleaning, remove unused data

@sebineubauer
Data science is greedy by nature

“The current DB should


be sufficiently sized for
the next year”
No data scientist ever!

@sebineubauer
Data science is greedy by nature
In general, the outcome of a data science effort
gets better with…
• …more features („columns“)
• …more historic data („rows“)
• …more different independent data sources (e.g.
weather, stock exchanges data, social media
data…)
• …higher complexity of the algorithm (e.g. deep
learning)
Don’t blame the data scientists

@sebineubauer
Data science is resource intensive

Example: training for daily sales demand forecast


for a supermaket
• 100 stores
• 5000 products
• 3 years historic data
• 50 features: sales aggregates, product details,
weather…

@sebineubauer
Data science is resource intensive

Example: daily predictions


• 100 stores
• 5000 products
• 14 daily forecast horizons

@sebineubauer
Security issues
• which data is allowed where:
• is the data scientist allowed to use production
data on his laptop for analysis?
• may turnover/sales data be seen by different
departments?
• anonymization of restricted/personal data
• establish access control:
• simple firewall not possible anymore
• separate restricted data from allowed data
(e.g. passwords)

@sebineubauer
Resilience and failure

• What happens if there are no decisions made


due to a failure?
• Is your business going down?
• Easy answer: Yes, due to the deep integration,
without any resilience measures, your business is
likely to get seriously harmed.

Big trouble!
@sebineubauer
Ways to successfully
integrate „Data Science“
Yes, it is a DevOps thing!

alignment no walls

delivery pipeline
no silos
value stream

lean thinking: small and quick changes with


focused value to the end customer.
@sebineubauer
Yes, it is a DevOps thing!
• accept that data science is part of your value
stream
• in return, the data science effort should strongly
focus on the added value:
• evaluate the costs and compare to the added
value
• data science is not a self purpose
• a siloed data science team will not work out -
never
• just adding data science without changing
processes will most likely fail
• apply the same quality standards
@sebineubauer
Rule #1

Keep it Simple Stupid


Ockham’s Razor

„Among competing
hypotheses, the one with
the fewest assumptions
should be selected.“
„If the outcomes of two data science models
are compatible, take the one with smaller
resource footprint.“
@sebineubauer
Need to know principle
• Only allow data to leave a department if it is
absolutely necessary:
• security and entanglement
• I know this is hard, the data scientists will
complain
• in principle they are right: more data -> better
results
• here is my suggestion:
• prepare a one time excerpt of the data
• evaluate the impact of every single data
source
• based on the evaluation decide for each
@sebineubauer
Digitilization
• Get human interaction out of the loop:
• machine readable data (hint: log files and
PDFs are for humans…)
• sensors, RFID, …the IoT stuff…
• Streamline data handling:
• let machines communicate over APIs
• agree upon data formats and interfaces

@sebineubauer
Modernize your IT infrastructure
• traditional IT is build for human operation:
• GUI centric
• Excel as data processing backbone
• Ticket driven operation
• provisioning and change is expensive:
• centralized, homogeneous architecture
• if you only have one hammer: tendency to use
the wrong tools
• modern architectures are decentralized &
heterogeneous:
• e.g. event sourcing: immutable events as
universal atomic data source
@sebineubauer
General roadmap
• Start by defining the problem: which decisions
should be automated and improved
• embed data scientists in a cross functional team
building the „decision making system“
• start with a minimal viable product
• measure and extrapolate the impact of the
delivered improved decisions, best in real
money
• based on this „freed money“ define the
acceptable costs: hardware, personnel and long
term maintenance costs due to the increased
complexity
@sebineubauer
Summary
• Data science aims to automate operational
decisions
• Data science has a big and growing potential
• Data science is greedy by nature:
• resource intensive
• increases entanglement
• Data science efforts need to be part of the value
stream:
• align with company goals
• compare costs with actual improvements

@sebineubauer
Thank You!

We’re hiring: blue-yonder.com/careers

@sebineubauer
Attribution
• https://commons.wikimedia.org/wiki/File%3AKMeans-Gaussian-data.svg
• By Chire (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)],
via Wikimedia Commons from Wikimedia Commons
• https://www.flickr.com/photos/33466493@N04/15585845006/in/album-72157648523366230/
• By karsten.thoms [CC BY 2.0 https://creativecommons.org/licenses/by/2.0/]

You might also like