Sebastian Neubauer Data Science - The New Troublemaker in The Delivery Pipeline
Sebastian Neubauer Data Science - The New Troublemaker in The Delivery Pipeline
Sebastian Neubauer Data Science - The New Troublemaker in The Delivery Pipeline
„Data Science is
statistics on a Mac.“
@BigDataBorat
@sebineubauer
What is „Data Science“?
„Data Scientist (n.):
Person who is better at
statistics than any software
engineer and better at
software engineering than any
statistician.“ @josh_wills
@sebineubauer
What is „Data Science“?
My definition:
@sebineubauer
Operational decisions done by
humans are suboptimal
• humans are lazy:
• each and every decision should be made
optimal
• not making a decision is suboptimal
• simplifying decision making is suboptimal
@sebineubauer
Optimal decisions are growingly
important
• Competition is getting harder:
• „online“ increases the pressure:
• markets growingly transparent
• competition gets global
• digitalization: technology is evolving faster:
• new players can catch up very fast (e.g.
amazon)
• costs for new players often very low
German retailers: <1% profit margin
(Deloitte, Global Powers of Retail)
@sebineubauer
Data Science can help
• Make optimal decisions
• unbiased
• take all information into account
• do complex computations
• Automation enables to optimize all decisions
• as often as needed: daily, realtime…
• as granular as needed: each individual
product…
• actually do decisions, that were not done
before
@sebineubauer
Troublemaker:
Why is „Data Science“ so
hard to integrate?
Q: „Are you already
doing this data
science stuff?“
A: „We have a data
scientist team since
one year, but the
progress is very slow“
A typical data science workflow
@sebineubauer
„Data science for
decision making is
deeply integrated
into the business
processes.“
Trouble on the input side
• data availably:
• the data needs to be available in machine
readable form
• as „realtime“ as possible (remember: we are
predicting the future):
• consistent data: sync different sources
• fully automated, no humans in the loop
• data quality:
• raw unaggregated data
• data quality discipline: missing values, typos…
@sebineubauer
Trouble on the output side
@sebineubauer
Increase of coherence and
entanglement
• data needed from different departments
• need to agree on common ID’s
• common data types
• need to synchronize data flow
• data dependencies:
• hard to change data structure, maybe
downstream someone uses the data…:
• change relations, types, names
• data cleaning, remove unused data
@sebineubauer
Data science is greedy by nature
@sebineubauer
Data science is greedy by nature
In general, the outcome of a data science effort
gets better with…
• …more features („columns“)
• …more historic data („rows“)
• …more different independent data sources (e.g.
weather, stock exchanges data, social media
data…)
• …higher complexity of the algorithm (e.g. deep
learning)
Don’t blame the data scientists
@sebineubauer
Data science is resource intensive
@sebineubauer
Data science is resource intensive
@sebineubauer
Security issues
• which data is allowed where:
• is the data scientist allowed to use production
data on his laptop for analysis?
• may turnover/sales data be seen by different
departments?
• anonymization of restricted/personal data
• establish access control:
• simple firewall not possible anymore
• separate restricted data from allowed data
(e.g. passwords)
@sebineubauer
Resilience and failure
Big trouble!
@sebineubauer
Ways to successfully
integrate „Data Science“
Yes, it is a DevOps thing!
alignment no walls
delivery pipeline
no silos
value stream
„Among competing
hypotheses, the one with
the fewest assumptions
should be selected.“
„If the outcomes of two data science models
are compatible, take the one with smaller
resource footprint.“
@sebineubauer
Need to know principle
• Only allow data to leave a department if it is
absolutely necessary:
• security and entanglement
• I know this is hard, the data scientists will
complain
• in principle they are right: more data -> better
results
• here is my suggestion:
• prepare a one time excerpt of the data
• evaluate the impact of every single data
source
• based on the evaluation decide for each
@sebineubauer
Digitilization
• Get human interaction out of the loop:
• machine readable data (hint: log files and
PDFs are for humans…)
• sensors, RFID, …the IoT stuff…
• Streamline data handling:
• let machines communicate over APIs
• agree upon data formats and interfaces
@sebineubauer
Modernize your IT infrastructure
• traditional IT is build for human operation:
• GUI centric
• Excel as data processing backbone
• Ticket driven operation
• provisioning and change is expensive:
• centralized, homogeneous architecture
• if you only have one hammer: tendency to use
the wrong tools
• modern architectures are decentralized &
heterogeneous:
• e.g. event sourcing: immutable events as
universal atomic data source
@sebineubauer
General roadmap
• Start by defining the problem: which decisions
should be automated and improved
• embed data scientists in a cross functional team
building the „decision making system“
• start with a minimal viable product
• measure and extrapolate the impact of the
delivered improved decisions, best in real
money
• based on this „freed money“ define the
acceptable costs: hardware, personnel and long
term maintenance costs due to the increased
complexity
@sebineubauer
Summary
• Data science aims to automate operational
decisions
• Data science has a big and growing potential
• Data science is greedy by nature:
• resource intensive
• increases entanglement
• Data science efforts need to be part of the value
stream:
• align with company goals
• compare costs with actual improvements
@sebineubauer
Thank You!
@sebineubauer
Attribution
• https://commons.wikimedia.org/wiki/File%3AKMeans-Gaussian-data.svg
• By Chire (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)],
via Wikimedia Commons from Wikimedia Commons
• https://www.flickr.com/photos/33466493@N04/15585845006/in/album-72157648523366230/
• By karsten.thoms [CC BY 2.0 https://creativecommons.org/licenses/by/2.0/]