Roman's Data Science
Roman's Data Science
com
R O M A N ' S D ATA
SCIENCE
How to monetize your data
Roman Zykov
OceanofPDF.com
Copyright © 2021 Roman Zykov
No part of this book may be reproduced, or stored in a retrieval system, or transmitted in any form or
by any means, electronic, mechanical, photocopying, recording, or otherwise, without express written
permission of the publisher.
OceanofPDF.com
PREFACE TO THE
ENGLISH EDITION
This book first hit the shelves in Russia on April 26, 2021 and was the
culmination of more than a year of hard work. The initial run sold out in
just three weeks, becoming a bestseller! This inspired me to have the book
translated into English and publish it on Amazon.com.
I would like to think that my experience in data analysis will be of interest
to readers outside of Russia. To help you understand this book better, I will
describe here some of the Russian companies that I refer to in the text.
These are Ozon.ru, Wikimart.ru and RetailRocket.ru.
Ozon.ru is one of the first e-commerce companies in Russia, sometimes
referred to as the “Amazon of Russia.” The company was established in
1998, and by 2019 it was among the top three online retailers in the country.
In 2020, Forbes ranked Ozon the third most valuable Russian internet
company.
Wikimart.ru is an online B2C marketplace that enables Russian retailers to
sell a wide range of products, from electronics to apparel. The company
closed in 2016.
Retailrocket.ru offers online shopping recommendation services based on
user behaviour. I founded the company in 2012 alongside two of my
partners. The company has proved to be extremely successful, with offices
in Russia, Western Europe and South American providing services for over
1000 clients. I left Retailrocket.ru in 2021.
The book was translated into English by RPK Group . RPK’s translators,
Philip Taylor and Alexander Alexandrov, helped me get my messages across
and make the book read as an authentic English text.
OceanofPDF.com
INTRODUCTION
ARCHIMEDES
OceanofPDF.com
TABLE OF CONTENTS
Introduction
Preface to the English edition
Who is This Book For?
How to Read this Book
Chapter 1. How We Make Decisions
Four Hundred Relatively Honest Ways
What We Can Learn from Amazon
Analysis Paralysis
Mistakes and the Caliper Rule
The Pareto Principle
Can We Make Decisions Based on Data Alone?
Chapter 2. Let’s Do Some Data Analysis
Data Analysis Artifacts
Business Intelligence Artifacts
Insights and Hypotheses
Reports, Dashboards and Metrics
Machine Learning Artifacts
Data Engineering Artifacts
Who Analyses Data?
The Perfect Button
Sell Analytics Internally
The Conflict Between the Researcher and Business
The Weaknesses of a Statistical Approach in Data Analysis
Chapter 3. Building Analytics from Scratch
Step One
Choosing the Tech
Let’s Talk about Outsourcing
Hiring and Firing (or Resigning)
Who Do Analysts Answer To?
Should the Head of Analytics Write Code?
Task Management
How to Get the Best Out of Daydreamers
Chapter 4. How about Some Analytical Tasks?
PART I
How to Set Tasks for Data Scientists
How to Check Tasks
How to Test and Introduce Changes into a Working System
Justifying the Task for the Originator
Do You Need to Know How to Code?
PART II
Datasets
Descriptive Statistics
Diagrams
A General Approach to Data Visualization
Pair Data Analysis
Technical Debt
Chapter 5. Data
How We Collect Data
Big Data
Data Connectivity
There’s No Such Thing as Too Much Data
Data Access
Data Quality
Checking and Monitoring Data Quality
Data Types
File Storage Formats
Ways to Retrieve Data
Chapter 6. Data Warehouses
Why Do We Need Data Warehouses?
Data Warehouse Layers
What Kinds of Data Warehouse Exist?
How Data Makes It Into the Warehouse
Hadoop and MapReduce
Spark
Optimizing Work Speed
Data Archiving and Obsolescence
Monitoring Data Warehouses
My Personal Experience
Chapter 7. Data Analysis Tools
Spreadsheets
Notebooks
Visual Analysis Tools
Statistical Software
Working With Data in the Cloud
What Makes a Good Reporting System?
Pivot Tables
OLAP Cubes
Enterprise and Self-Service BI Systems
My Experience
Chapter 8. Machine Learning Algorithms
Types of ML Problems
ML Task Metrics
ML from the Inside
Linear Regression
Logistic Regression
Decision Trees
Learning Errors
What to Do about Overfitting
Ensemble Methods
Chapter 9. The Practice of Machine Learning
How to Learn Machine Learning
ML Competitions
Artificial Intelligence
Required Data Transformations
The Accuracy and Cost of ML Solutions
The Simplicity of the Solution
The Amount of Work Involved in Checking the Result
Mechanical Turk / Yandex Toloka
ML and Big Data
Recency, Frequency and Monetary
Conclusion
Chapter 10. Implementing ML in Real Life: Hypotheses and Experiments
Hypotheses
Hypothesis Testing: Planning
What Is a Hypothesis in Statistics?
The Statistical Significance of Hypotheses
Statistical Criteria for P-Values
Bootstrapping
Bayesian Statistics
A/B Tests in the Real World
A/A Tests
A Few More Words about A/B Tests
Setting up an A/B Test
Experiment Pipeline
Chapter 11. Data Ethics
How We Are Being Watched
Good and Bad Data Usage
The Problem of Data Leakage
Data Ethics
How User Data is Protected
Chapter 12. Challenges and Startups
Web Analytics in Advertising
Internal Web Analytics
Database Marketing
Startups
My Personal Experience
Chapter 13. Building a Career
Starting Your Career
How to Find a Job
Requirements for Candidates
You’ve Accepted an Offer
Continuing Professional Development
How to Change Jobs
Do You Need to Know Everything?
Epilogue
About the Author
Acknowledgements
Bibliography
OceanofPDF.com
WHO IS THIS BOOK FOR?
This book is dedicated to thinking readers who want to try their hand at data
analysis and develop data-driven services.
If you are an investor, this book will help you to better understand the
potential and infrastructure of the startup team asking you for your
investment.
If you are itching to get your startup off the ground, this book will help you
find the right partners and hire the right team from the outset.
If you are starting your career, this book will broaden your horizons and
help you start applying approaches that you may not have considered.
OceanofPDF.com
HOW TO READ THIS BOOK
The beauty of this book is that it does not need to be read in order. The
following is a short description of each chapter:
Chapter 1. "How We Make Decisions" describes the general principles of
decision-making and how data affects decisions.
Chapter 2. "Let’s Do Some Data Analysis" introduces general concepts:
What artifacts do we deal with when analysing data? In this chapter, I also
start to raise some organization issues relating to data analysis.
Chapter 3. "Building Analytics from Scratch" describes the process of
building analytics, from the first tasks to the choice of technology and
hiring personnel.
Chapter 4. "How about Some Analytical Tasks?" This chapter is all about
tasks. What is a good analytical task? And how can we test it? The technical
attributes of such tasks are datasets, descriptive statistics, graphs, pair
analysis and technical debt.
Chapter 5. "Data" covers everything you ever wanted to know about data –
volume, access, quality and formats.
Chapter 6. "Data Warehouses" explains why we need data warehouses and
what kind of data warehouses exist. The chapter also touches upon the
popular Big Data systems Hadoop and Spark.
Chapter 7. "Data Analysis Tools" describes the most popular analytical
methods, from Excel spreadsheets to cloud systems.
Chapter 8. "Machine Learning Algorithms" provides a basic introduction to
machine learning.
Chapter 9. "The Practice of Machine Learning" shares life hacks on how to
study machine learning and how to work with it for it to be useful.
Chapter 10. "Implementing ML in Real Life: Hypotheses and Experiments"
describes three types of statistical analysis of experiments (Fisher statistics,
Bayesian statistics and bootstrapping) and the use of A/B tests in practice.
Chapter 11. "Data Ethics". I could not ignore this topic. Our field is
becoming increasingly regulated by the states. Here we will discuss the
reasons why.
Chapter 12. "Challenges and Startups" describes the main tasks that I faced
in my time in ecommerce, as well as my experience as a co-founder of
Retail Rocket.
Chapter 13. "Building a Career" is aimed more at beginners – how to look
for a job, develop as an analyst and when to move on to something new.
OceanofPDF.com
CHAPTER 1. HOW WE MAKE
DECISIONS
The first principle is that you must not fool yourself – and you are the
easiest person to fool. So you have to be very careful about that. After
you’ve not fooled yourself, it’s easy not to fool other scientists. You just
have to be honest in a conventional way after that […] I have just one
wish for you – the good luck to be somewhere where you are free to
maintain the kind of integrity I have described, and where you do not
feel forced by a need to maintain your position in the organization, or
financial support, or so on, to lose your integrity. May you have that
freedom.
Monetizing data is only possible when we make the right decisions based
on that data. That said, it is a bad idea to base your decisions on statistics
alone – at the very least, you need to know how to read between the lines
and listen to what your gut is telling you. The first chapter of this book is
thus given over to a discussion of the principles that I use when making
data-driven decisions. These are principles that I have tried and tested over
the course of many years and I can assure you that they work.
Decision-making is difficult. So difficult, in fact, that scientists have even
come up with a term for it – “decision fatigue” [7] . The stress of making
hundreds of choices every day builds up to the point where we are so
exhausted by the need to constantly make decisions that we just give up and
start choosing at random. There’s a reason I quoted the brilliant Nobel Prize
winning physicist Richard Feynman at the beginning of this chapter, as his
words have a direct bearing both on data analytics and on our lives in
general.
How can we make the correct decisions while at the same time remaining
true to ourselves?
In his book, Behave: The Biology of Humans at Our Best and Worst ,
neuroscientist and Stanford University professor Robert Sapolsky [1] notes
that our actions, and thus our decisions, are influenced by a multitude of
factors: the environment in which we grew up, childhood trauma, head
injuries, hormones, feelings, emotions, etc. We are constantly influenced by
all kinds of processes that we are not even aware of. We are incapable of
being objective!
It always seemed obvious to me that that it is far easier for us to be biased
and thus cut corners than it is to be objective, because objectivity requires
serious effort.
Keep that in mind the next time you give someone a set of figures and ask
them to make a decision. As my colleagues have to remind me from time to
time, I too am guilty of letting my biases get the better of me when looking
at the results of certain A/B tests. This helps me stay grounded, and I
inevitably end up agreeing with them, as objectivity is always more
important than any a priori decisions I may make before we have even
conducted an experiment.
In today’s world, we are forced to make decisions quickly and with a
number of uncertainties. This is not inherently a bad thing. Quantum
physics teaches us that we do not know the exact location of an electron,
but we do know the probability of finding it. In fact, the entire discipline of
quantum physics on based on such probabilities. And it is the exact same
thing with decisions – no one knows the truth; we’re just trying to guess
what is “right” and hope for the best. And that’s where data comes in, to
increase the likelihood of our making the right decisions.
Analysis Paralysis
Haste makes waste. The biggest mistakes I have ever made can all be put
down to my being in a hurry. For example, 15 years ago, I was hired by
Ozon.ru to build up its analytics division. Part of my job was to compile a
huge weekly data sheet on the company’s activities, without actually being
able to verify the information on a regular basis. Pressure from above,
together with the time constraints, meant that the weekly reports were
riddled with errors, which took a great deal of time and effort to rectify.
The modern world moves at breakneck speeds. Compiling metrics,
however, requires careful attention, and this takes time. Of course, this does
not mean that the opposite situation – “analysis paralysis,” where an
inordinate amount of time is spent on every single figure on the sheet – is
desirable either. Sometimes, the desire to make the right choice leads to this
very state of analysis paralysis, where it becomes impossible to make any
decision whatsoever. The uncertainty about what the decision will lead to is
prohibitively high, or the framework for making the decision too rigid. An
easy way to fall into analysis paralysis is to approach decision-making in a
purely rational manner, guided by logic only.
Graeme Simsion’s The Rosie Project (a favourite of Bill Gates and his wife)
illustrates this idea perfectly. The novel tells the story of Don Tillman, a
successful young genetics professor who is eager to find himself a wife but
has never managed to get further than the first date. He thus concludes that
the traditional way of finding a soulmate is inefficient and decides to take a
scientific approach to the problem. The first stage of his so-called “Wife
Project” is a detailed 30-page questionnaire designed to weed out unsuitable
candidates and identify the perfect mate. Obviously, no one can satisfy such
a list of requirements. Don is then introduced to a girl who does not
demonstrate any of the traits of his perfect woman. No prizes for guessing
who Don ends up with.
Another example is buying a car. The last time I bought a car, I drew up a
massive spreadsheet in Excel listing all the technical parameters of the cars
I had my eye on, even down to the size of the boot in centimetres! I spent a
whole year pondering the decision, going to showrooms, checking out cars,
yet I ended up buying one that wasn’t even on my list – one that spoke to
me. Well, I guess I didn’t follow my heart entirely, because that year of
searching and analysing helped me realize what was really important on
that list and what wasn’t.
I could give you an example from my professional life too, one that is
connected with hypotheses or, more precisely, with tests. Imagine you’ve
come up with a new recommendation algorithm that you think is better than
the existing one and you want to test it out. So, you compare the two
algorithms on ten websites. In four cases, your new algorithm proves to be
superior, in two it fares worse, and in four there is nothing between the two.
Do you phase out the old algorithm in favour of the new one? It all depends
on the criteria you devised for making the decision before comparing the
two algorithms. Was the new algorithm supposed to be better in all cases?
Or just most of the time? In the first case, you are likely to bury yourself in
endless iterations of the algorithm, honing it to perfection, especially given
the fact that the tests will take several weeks to complete. This is a classic
case of “analysis paralysis.” As for the second case, the task would appear
to be far more straightforward. But practice tells us otherwise.
I believe there should always be an element of conscious risk when it
comes to making decisions; we should be able to make a decision even if
we don’t have all the information. These days, we can’t afford to spend time
mulling over our decisions. The world is moving too quickly to afford us
that luxury. If you don’t make a decision, then someone else – a competitor
– will make it for you.
I’ve been working with data for almost 20 years and every day I am
convinced that this principle holds true. At first, it might seem like an
excuse to cut corners. But the fact is that you’ve got to give 100% in order
to learn exactly which 20% will produce the desired results. Back in 1998,
Steve Jobs told Business Week , “Simple can be harder than complex: you
have to work hard to get your thinking clean to make it simple. But it’s
worth it in the end because once you get there, you can move mountains.”
Let me give you an example of how the Pareto principle can be applied in
machine learning. For any given project, a number of features are typically
readied to help test a model. There can be dozens of these features.
Launching the model in this state will be problematic, as hundreds of lines
of code will be required to maintain it. There’s a way around this – to
calculate the relative importance of each feature and discard those that do
not pull their weight. This is a direct application of the Pareto principle –
80% of the model’s accuracy is produced by 20% of its features. In most
cases, it is better to simplify the model by sacrificing an almost
imperceptible amount of its accuracy. The upside to this is that the resulting
project will be several times smaller than the original one. In practice, you
can save time by looking at the features used to tackle similar problems on
kaggle.com, taking the best ones and implementing them in the alpha
version of your own project.
OceanofPDF.com
CHAPTER 2. LET’S DO SOME
DATA ANALYSIS
Patil then points out that the strongest data-driven organizations all follow
the principle “if you can’t measure it, you can’t fix it.” This paves the way
for a number of recommendations:
Collect as much data as you can. Whether you’re doing business
intelligence or building products.
Measure in a proactive and timely manner.
Get as many people in the organization as possible to look at the
data. Multiple eyes help you identify obvious problems quicker.
Foster increased curiosity among employees about why the data has
or has not changed.
I will return to this topic in the chapter on data. Right now, it is time to talk
about what we get at the end of the data analysis process.
Data analysis and machine learning courses have a useful role to play, but
they can be compared to model ships – they are as far from real ships as
courses are from real-life ML applications.
Former Basecamp data scientist Noah Lorang wrote in his blog [16] :
The dirty little secret of the ongoing ‘data science’ boom is that
most of what people talk about as being data science isn’t what
businesses actually need. Businesses need accurate and
actionable information to help them make decisions about how
they spend their time and resources. There is a very small
subset of business problems that are best solved by machine
learning; most of them just need good data and an
understanding of what it means that is best gained using simple
methods.
I can personally vouch for every single word here. Sadly, there’s a lot of
hype around our profession. YouTubers promote data science courses,
presidents talk about AI, and Tesla’s stock goes up every day. But young
specialists who want to build spaceships eventually meet companies that
want to make money. In the following chapters, I will discuss ways to
reconcile their interests.
There are things that can be measured. There are things that
are worth measuring. But what can be measured is not always
what is worth measuring; what gets measured may have no
relationship to what we really want to know. The costs of
measuring may be greater than the benefits. The things that get
measured may draw effort away from the things we really care
about.
OceanofPDF.com
CHAPTER 3. BUILDING
ANALYTICS FROM SCRATCH
Step One
When I need to create or significantly improve an analytics system, I
always take a two-pronged approach: on the one hand, I determine what
tasks and issues we are facing, and on the other, I find out what data is
available.
To compile a list of tasks, we need to conduct interviews with all potential
information consumers who may be affected. When designing the system
for its users, you need to know the answers to the following questions:
What metrics do you need to calculate?
What dashboards do you need to develop?
What information do you need to feed to interactive systems?
Will there be any ML (machine learning) tasks?
What makes this step hard is that consumers (customers) don’t always
know what kind of information they will need. And in order to build an
effective system, the analyst must have at least some expertise in the
business he or she is analysing. After working in ecommerce, I found the
transition to Ostrovok.ru (a hotel booking system) somewhat jarring. Sure,
we did online sales, but I needed very specific knowledge of the hotel
business. When you understand the business, you know what questions to
ask the customer. You can then use their answers to build a data structure
that will help solve the client’s problems.
Then I go to the developers and start exploring what they actually have –
what data they collect and where this data is located. First, I am looking for
data that will help solve the client’s tasks (and I make a point of looking not
only at flowcharts, but also at examples of such data – the actual table rows
and files). Second, an important thing for me is finding data that exists but
has not yet been used from the point of view of the tasks that they could
solve? By the end of this stage, I already have:
A list of issues that are covered by current data
A list of questions that do not have data, and an understanding of
how much effort it will take to obtain them
Data that does not yet solve any urgent problems
Data sources and their approximate volumes.
This is just the first iteration. I take this list to the customers. I talk to the
same people, explain whether it is possible to answer their questions and
whether additional data is needed, and then I go back to the developers. It
looks like shuttle diplomacy, but that’s how I plan the project.
In the end, I have a list of system requirements and a list of available data
and tasks that need to be completed in order to get the missing numbers. It
looks simple, but sometimes these steps can take weeks. I don’t just go and
download all the data from the data warehouse and immediately start trying
to make metrics and dashboards. Instead, I try to solve the problem in my
head. This saves me a lot of energy, and the customers a lot of worry. They
will know in advance what will work out right away and what will not.
These are the most basic questions, but a lot depends on them – including
what kind of employees to hire, how much investment is needed, and how
quickly we can start the project.
My rule of thumb when it comes to data storage is that if a company is
going to make a significant portion of its revenue from data, then it is better
to have its own storage. If analytics is just a supporting project, then it is
better to use cloud storage.
The goal of any business is to make profit. Profit is revenue minus costs,
which include the cost of storing data. And this cost can be quite large if the
data is stored in the cloud. Setting up our own storage can be a solution.
Yes, this will entail administration costs, and the system will require more
attention. But you will obviously have more ways to reduce costs, and the
system will be far more flexible. If analytics do not have such a direct
impact on P&L, then cloud storage will be much easier. You won’t have to
think about failed servers – “the cloud” will do that job for you.
Open-source technologies are very important in analytics. I first
encountered them when I was studying at Phystech. In my second year, I
got a computer. It wasn’t particularly powerful even by those days’
standards, so I installed Linux on it. I spent hours compiling the kernel to fit
my needs, learning to work in the console. This experience would come in
handy exactly ten years later, when I visited the Netflix office in Los Gatos,
California, and met Eric Colson, who was Head of Analytics. He talked
about the tools his employees use in their work, and even wrote them out on
a board with a marker. He was also big on open-source software for data
analysis, such as Python, Hadoop and R. Before that, I had only used
commercial software. I remember sitting alone in the empty Wikimart.ru
offices one night that summer (everyone had gone to a staff party) writing
the first nine lines of code in Pig for the Hadoop platform (my Linux skills
came in handy here). It took four hours. What I didn’t know then was that
just a few years later, this language and this platform would be used to write
the “brain” of the Retail Rocket recommendation system. By the way, the
entire Retail Rocket analytics system – both the internal system for making
decisions and the system that generates recommendations – was written
using only open-source technologies.
Now, looking back, I can say that Retail Rocket is the coolest thing I have
ever done in my career: the company quickly broke even and is now
successfully competing with Western counterparts, employing more than a
hundred people around the world with main offices in Moscow, Togliatti
(Russia), The Hague, Santiago, Madrid and Barcelona. A Russian company,
developing and creating jobs abroad! The development vector has changed:
in addition to the Retail Rocket recommendation system, the company is
selling many related services for online stores. The big data analysis and
ML technologies that we created back in 2013 are still relevant today, and I
am very proud that we managed to rise head and shoulders above our
competitors in terms of technology.
When is using proprietary software a good idea? The answer is, when you
have the money. Almost any proprietary software has an open-source
counterpart, although it is true that they are generally worse, especially in
certain areas. For example, I’ve never found a decent open-source analogue
for OLAP cubes. Open-source reporting systems also look half-baked.
However, when it comes to engineering technologies such as Hadoop,
Spark or Kafka, they are very reliable and powerful developer tools that
have proven their excellence in commercial applications.
Let’s discuss the programming languages that will be used in the
development of the system. My motto is, the fewer the better. Before Retail
Rocket, I was able to get by with just SQL – although I did have to use
commercial tools from Microsoft to move data (ETL) from the source to the
warehouse. Retail Rocket’s recommendation engine used to be
implemented using four programming languages: Pig, Hive, Java and
Python. Then we replaced all of them with Scala, since it belongs to the
JVM family that Hadoop is written in. This makes Scala very easy to use on
the Hadoop/Spark platform, with the latter supporting it natively. But a
couple of years ago, we started using Python and SQL – we had to move
away from Scala, as it was inconvenient for some things.
Scala is a beautiful and elegant programming language, but we ran into two
problems. First, it would be very difficult for users to use it as an interface
to data. SQL is much better for this. Second, all modern ML libraries are
written in Python. So we decided to use Scala for the central core of the
system, data aggregation and delivery, SQL for reports, and Python for
developing ML models and simple prototypes. The choice of a
programming language usually depends on several things:
the system it will be used for (for example, SQL is ideal for
databases)
whether or not there are people who work with this language in your
company or on the market as a whole.
For example, forcing the users of your system to learn difficult
programming languages to obtain data access is a bad idea. For users, it is
just an auxiliary tool, and they don’t want to spend too much time learning
it.
The skills market is a source of constant headaches. Scala is a very rare
language, and it is quite difficult to learn. There are very few people on the
market who know it, and those who do are expensive. Python, on the
contrary, is very popular. But I’d happily give three Python programmers
for one Scala developer. Here we made a conscious decision – the quality of
our work is more important to us, so we chose Scala. It was almost
impossible to hire people trained in Scala people, so we devised our own
crash course [19] , teaching beginners to program in it in six months.
Let’s Talk about Outsourcing
Now let’s look at bringing in external contractors to set up an analytical
system. Various aspects can be outsourced:
the creation and maintenance of the technical part of the system
the analytical part of the system
specific parts of the system.
Good contractors are needed when we need to reduce the set-up time of the
technical part of the project and achieve high-quality results. Good luck on
that one! Contractors generally do not have a deep enough knowledge of
the task at hand and, on top of that, the customer rarely knows what they
want.
I remember being put on a project team once at one of the places I worked.
It wasn’t an analytical project, and in theory it looked great. Better still, the
team leader taught systems design at one of the best universities in the
country. We chose the most “up-to-date” technologies to implement the
project. Three or four developers spent an entire year writing the system.
When it was finally ready, they wasted a whole day trying to get it to
work… with absolutely no success, so we ended up throwing the entire
system onto the scrapheap. The same thing can happen with analytics.
Theory and practice are two wildly different things, especially in today’s
rapidly changing world.
The risk is reduced if you have an experienced analyst on your team who
has personally implemented a number of similar projects in the past. He or
she can be an independent advisor on your project, or even a moderator of
sorts. This is necessary, on the one hand, to keep the customer “grounded,”
and, on the other, to keep a rein on the contractor. I believe that, to begin
with, it is better to produce a stripped-down version of the project that we
can get up and running as soon as possible. There are a few reasons for this.
First off, after you (the client) have spent some time with the program, you
will have a much better idea of what you actually want. It’s hard to get a
clear picture of the project when it is still on paper – it’s like a spherical
cow in a vacuum. The second reason is motivation, which is extremely
important for me personally. When the time starts to drag, the team, as well
as the customers, gradually lose interest in the project. The result is a
laboured project that we don’t really want to be a part of anymore.
If you can’t find someone who can act as an advisor on the project, try to
get an understanding of the problem yourself – read a book, watch video
recordings of conferences, etc. Otherwise, you’re running the risk of the
project never making it off the ground, which would be a huge waste of
time and resources.
It’s all well and good outsourcing the technological part of projects, but can
you outsource analytics too? The short answer is, “no, you can’t.” Third-
party analysts will never be able to fully grasp the complete business
context. That said, some areas of analytics – advertising, for example – can
indeed be outsourced.
Another option is to outsource an entire part of the project: you give them
the data and end up with a finished product.
One example of this kind of cooperation is Retail Rocket, where we started
out with product recommendations. Online stores provided us with data and
a product base, and in the end, they got readymade recommendations. The
idea for this kind of business came to me when I was working at
Wikimart.ru. I was making recommendations for the company’s website
and thought to myself, “why not launch a retail solution?” This way, the
online store would not need to hire machine learning engineers and reinvent
the wheel. The result was obtained much faster, literally within a week. And
our recommendation system was better than the in-house one. If an online
store was to hire me today, then I would probably contract a third-party
recommendation service instead of developing one in-house.
Let me tell you a little about my own personal experience with outsourcing.
I left Ozon.ru in 2009. I’d be running my own blog, KPIs.ru, for a few years
by that time and it was proving to be rather popular. People had even started
to look to me for consulting services in all kinds of areas: game
development, ecommerce, venture capital, etc. Slowly, I started to build up
my consulting activities, working for three companies at the same time. I
helped the first company choose the right technology and hire personnel for
the project team, which included conducting interviews. The second
company was looking for assistance in growing startups. My job with the
third company was more hands-on, as I helped set up the analytical system
from scratch. I took a lot away from this experience. First of all, I was able
to help companies without being bogged down in corporate details and
politics, something that certainly would not have been possible if I had been
hired as a full-time member of staff there. My work also allowed the
companies to launch projects quickly. I ended up staying on at the third
company (Wikimart.ru) full time, as its founder offered me the job of head
of analytics. I agreed because, at the time, I wanted to be working directly
with data and “get my hands dirty,” as it were. And that was the end of my
little outsourcing adventure.
And he’s absolutely right! First, you’re drawn in by the magic of the black
boxes. Then you want more. You become a manager and that’s where things
end – what was once magical becomes routine. You only see the metrics,
and the code becomes less and less comprehensible. That pretty much sums
up my story. The role of player-manager is necessary and useful, but only at
the initial stages. At some point, you need to start delegating everything,
otherwise you’ll do a poor job at both player and manager. Moreover, any
coding or other tasks that the manager performs are far more expensive.
Once I had put my first team together at Retail Rocket, I transitioned from
programming to checking (testing) all the tasks. Then, one of the partners
talked me into going back to coding, a decision I would later regret. I agree
with Colson – at some point, the manager’s got to abandon programming
and problem-solving.
Another important aspect is how motivated the manager is. I often like to
quote Richard Hamming’s lecture “You and Your Research.” Richard
worked at Bell Labs alongside Claude Shannon (the “father of information
theory”). Like many prominent scientists of the time, Hamming was
involved in the Manhattan Project to produce the first atomic bomb. Bell
Labs itself has an impressive track record: the first transistor was developed
there, and seven Nobel Prizes have been awarded for work conducted there.
During the lecture, Hamming was asked to compare research and
management, to which he replied:
Task Management
Analytics tasks can generally be broken down into various areas, each of
which requires a specific approach:
Engineering tasks
Investigating the causes of a given phenomenon (insight)
Hypothesis testing or research.
This is a rather typical scheme and is based on common sense. Tasks can be
accepted or rejected for various reasons (for example, we may be unable to
proceed without the manager’s signoff). When a given task has been
accepted, the analytics team discusses it directly with the client and then
estimates its size and the amount of work required to complete it. One of
the methods used to do this is planning poker [22] . The task is then queued
for execution according to its level of priority. And we had a rule: do
whichever task is at the top of the pile. The categories of tasks are thus
randomized and the analysts develop a rounded skillset.
What are the benefits of this kind of randomization? To begin with, all the
members of the team are able to get an understanding of how the system as
a whole works, which means that they can cover for each other if one of
them goes on vacation, is ill or dismissed. This partly solves the “bus
factor” (the number of team members a company can afford to lose while
still functioning). I never thought about such issues before I was hired by
Retail Rocket. If someone quit, we would put their project on hold until we
found someone to replace them. However, analytics plays a far more
important role at my current company (which I co-founded), and the level
of responsibility is greater too.
Randomization also has its downsides:
Every member of the team has his or her strengths and weaknesses.
Some are good at the engineering side, others are good at building
models, etc. It naturally follows that an engineer with have problems
with ML models, while data scientists will struggle with engineering
tasks.
Team members also have a professional and personal interest in
taking on certain kinds of tasks. Random tasks can prove to be
disruptive for them.
Initiative gets suppressed. Team members stop putting forward
interesting tasks – one person makes a suggestion, another doesn’t
think it’s worthwhile, etc. If the latter gets that task, then there’s a
chance they will sabotage it. At the very least, they will not put the
same kind of effort into it as the person who proposed it.
Thus, this system is not a fix-all solution.
When a task is complete, another employee in the department checks to see
whether it has been implemented correctly from an engineering point of
view. This could involve a code review, an analysis of the architectural
decisions that were made, a check for software tests, etc. Once this has been
done, the task is handed over to the client for further checks. If there are no
issues, the task is considered complete, and no sooner. In our case, we use a
combination of Scrum and Kanban methodologies. But that doesn’t mean
you should create a cargo cult out of them. They depend on the size of the
teams, the ins and outs of the tasks at hand and, most importantly, the
technological knowhow of the team members. I started out with the most
basic columns with simple statuses for managing tasks in Trello and
eventually worked my way to the scheme I outlined above, although I don’t
think that’s perfect either. There’s no single methodology that will make
you completely happy. You’ve just got to use a bit of common sense.
The next class of tasks relates to data analysis: the search for insights.
Usually these are tasks set by managers or clients. They typically describe a
problem, and you need to find the cause. Such tasks typically go through
the same statuses as engineering tasks, but with one difference – we have no
idea whether or not we will actually find the cause. The end result is
unknown, which means that, in theory, we can waste countless hours on the
task without producing anything.
The third class of tasks concerns research, which includes testing
hypotheses and conducting experiments. These are the most difficult (and at
the same time interesting) tasks, and they can have unpredictable results.
They are perfect for those who like to learn and experiment. These tasks are
characterized by their unpredictable results and the long waiting times.
Hypothesis management is not as easy as it would first appear. For
example, at Retail Rocket, only 3 out of every 10 hypotheses for improving
recommendations produced positive results. It takes at least six weeks to
test a single hypothesis. That’s not cheap by any stretch. By “hypothesis,”
we mean any change that will improve something. This is typically a
streamlining proposal aimed at improving a specific metric. Metrics are
mandatory attributes. The most important metric to begin with was website
conversion (the percentage of visitors who made a purchase). Then we took
it one step further: we wanted to increase revenue per visitor, average order
value, average items per order, and even the visual appeal of recommended
items. Streamlining proposals can vary: from correcting an error in an
algorithm to implementing a machine learning algorithm based on neural
networks. We tried to run all the changes in the algorithms through
hypotheses, because even fixing a simple mistake in real life can adversely
impact the metric.
Just like tasks, hypotheses have their own life cycle. First, hypotheses need
to be clearly prioritized, as they are extremely labour-intensive and do not
produce immediate results. Making a mistake at this prioritization stage can
be very costly. I believe that hypotheses should be prioritized from the
outside, that is, the goals should be determined by the business. This is
typically done by the product team. They talk to customers and know what
is best for them. One of the mistakes I made at Retail Rocket was
prioritizing hypotheses myself. The analysts would come up with
hypotheses on their own and then prioritize and test them. While we did
optimize the algorithms, and the groundwork we laid would later prove
useful against our competition, we could have achieved so much more if we
had only given more thought to what the client wanted. I put that down to
the analysts becoming overqualified and the business itself just couldn’t
keep up. Evaluating a hypothesis, understanding its potential benefits and
finding a balance between the work put in and the end result is a fine art.
Interestingly, these problems persist in the West too. In 2016, I submitted a
paper entitled “Hypothesis Testing: How to Eliminate Ideas as Soon as
Possible” [23] for the ACM RecSys international Conference on
Recommender Systems. It’s extremely difficult to get accepted, as all
submissions are reviewed by several researchers. We had submitted a paper
once before [24] , which was rejected. This time, however, the topic turned
out to be a good fit for the conference programme. I presented the paper – a
talk on how we test hypotheses – at MIT in Boston. I remember being
incredibly nervous beforehand, and I learnt the text almost by heart. But
everything went well in the end. Xavier Amatriain, ex-head of analytics at
Netflix and one of the organizers of the conference, even gave me a
proverbial pat on the back. He then invited me for an interview at the Quora
office, where he had a management position at the time – apparently, my
story about hypothesis testing had made quite the impression.
OceanofPDF.com
CHAPTER 4. HOW ABOUT SOME
ANALYTICAL TASKS?
This chapter is made up of two parts: first, I will talk about how to break
down the process of data analysis into tasks, and then I will talk about data
analysis specifically.
PART I
This is bad from start to finish. The task is set through an intermediary, and
there’s no deadline – it’s needed “yesterday.” The originator is convinced
that they know the reason for the problem and does not see any reason to let
the employee in on the details. As a result, the employee charged with
carrying out the task is cut off from the context and has no idea why they
should do it. A good analyst will, of course, complete the task, but it will
likely be sent back to them, as the reason – and thus the hypothesis – turned
out to be wrong. Setting tasks this way leaves no room for creativity, and I
know from personal experience that this can be extremely demotivating –
you feel like you’re nothing more than a calculator. Of course, there are
people who like this kind of approach. But this is no way to hold on to the
best employees. Eventually, they’ll start looking for something else –
something that gives them a better chance of reaching their potential.
Task scheduling [22] is an important process that comes in many different
forms: it can be handled by a manager, or by the team as a whole. It can be
done in real time, with tasks being scheduled as they come up, or
periodically, after a number of tasks have been accumulated. Having tried
out all these methods, I am convinced that it is better when everyone is
involved in planning. This is how we do it at Retail Rocket [22] , and we do
it at specific times. I personally find it difficult to argue with my employees
about options and deadlines, and I often want to make decisions myself. But
there’s a formula to this – the stronger you are, the better employees you
will hire and the greater freedom you will give them in decision making.
This is how you build a strong team, where everyone has the opportunity to
speak their minds.
At Retail Rocket, we always record the audio of our scheduling meetings.
Not only does this keep us disciplined, but we can always go back to the
recordings if we disagree about something later on. That said, it is still
better to write everything down in the wording of the task itself, as this
ensures that everyone has understood everything correctly, especially if
there were some heated discussions before the task was agreed upon.
PART II
Datasets
Datasets most often come in the form of a table, unloaded from a data
warehouse (for example, via SQL) or obtained in another way. A table is
made up of columns and rows (rows are usually referred to as records). In
machine learning, the columns themselves are independent variables or
predictors, or more commonly features and dependent variables (outcomes).
This is how you will find them described in the literature. Machine learning
involves training a model that can use independent variables (features) to
correctly predict the value of dependent variable (as a rule, there is only one
dependent variable in a dataset).
There are two main kinds of variables – categorical and quantitative. A
categorical variable contains the text or digital encoding of “categories.”
This can be:
Binary – which can only be one of two values (for example: yes/no,
0/1)
Nominal – which can have more than two values (for example:
yes/no/don’t know)
Ordinal – when order matters (for example, athlete ranking).
A quantitative variable may be:
Discrete – a countable value (for example: the number of people in a
room)
Continuous – any value from an interval (for example: box weight,
product price).
Let’s look at an example. We’ve got a table of apartment prices (the
dependent variable) with one row (record) for each apartment. Each
apartment has a set of attributes (independent variables) with the following
columns:
Price of the apartment – continuous, dependent
Size of the apartment – continuous
The number of rooms – discrete (1, 2, 3, etc.)
En suite bathroom (yes/no) – binary
Floor number – ordinal or nominal (depending on the task at hand)
Distance to the city centre – continuous.
Descriptive Statistics
The very first thing you should do after unloading the data from the data
warehouse is to do exploratory data analysis, which includes descriptive
statistics and data visualization, and possibly data cleaning by removing
outliers.
Descriptive statistics typically include different statistics for each of the
variables in the input datasets:
The number of non-missing values
The number of unique values
Minimum–maximum
The average (mean) value
The median value
The standard deviation
Percentiles – 25%, 50% (median), 75%, 95%.
Not all variable types can be calculated. For example, the average can only
be calculated for quantitative variables. Statistical packages and data
analysis libraries already have ready-made functions that calculate
descriptive statistics. For example, the pandas Python Data Analysis Library
has a describe function that immediately displays several statistics for one
or all of the variables in the dataset:
s = pd.Series([4-1, 2, 3])
s.describe()
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
While this book is not intended to be a textbook on statistics, I will give you
some helpful tips. In theory, we often assume that we are working with
normally distributed data, which has a bell-shaped histogram (Fig. 4.1).
Diagrams
A good diagram is worth a thousand words. I typically use the following
types of diagram:
histograms
scatterplots
time series charts with trend lines
box plot, box and whiskers plot.
Histograms (Fig. 4.2) are the most useful analysis tool. They allow you to
visualize the frequency distribution of a given value (for categorical
variables) or break a continuous variable into ranges (bins). The second is
used more frequently, and if you provide additional descriptive statistics to
such a diagram, then you will have a complete picture of the variable you
are interested in. The histogram is a simple and intuitive tool.
Fig. 4.3. Scatterplot
Scatterplots (Fig. 4.3) allow you to see the dependent relationship between
two variables. They are easy to plot: the independent variable is plotted
along the horizontal axis, while the dependent variable is plotted along the
vertical axis. Values (records) are displayed as a series of dots. A trend line
can also be added. Advanced statistical analysis software allows you to
mark outliers interactively.
Fig. 4.4. Time series
Time series charts (Fig. 4.4) are similar to scatterplots, with the independent
variable (on the horizontal axis) representing time. Two components can
typically be distinguished from a time series graph – cyclical components
and trend components. A trend can be built when you know how long the
cycle is. For example, grocery stores generally use a seven-day cycle,
meaning that a repeating picture can be seen on the time series every seven
days. A moving average with a window length equal to the cycle is then
superimposed on the chart to give you a trend line. Most statistical packages
(Excel, Google Sheets, etc.) are capable of doing this. If you need a cyclical
component, just subtract the trend line from the time series. These simple
calculations are used to build basic algorithms for time series forecasting.
Fig. 4.5. Box plots
Box plots (Fig. 4.5) are interesting: to some extent, they do the same thing
as histograms, as they show an estimate of the distribution.
Box plots consist of several elements: whiskers denoting the minimum and
maximum data values feasible; and a box whose upper edge represents the
75th percentile and whose lower edge represents the 25th percentile. The
line in the box is the median – the “middle” value that divides the sample
into halves. This type of graph is useful for comparing experimental results
or variables with each other. An example is given below (Fig. 4.6). For me,
this is the best way to visualize the results of hypothesis testing.
Fig. 4.6. Box and whiskers plot for various experiments
says data visualization expert Edward Tufte in his book The Cognitive Style
of PowerPoint [29] .
Technical Debt
Another important thing I learned from the engineers at Retail Rocket was
how to work with technical debt [31] . Technical debt is working with old
projects, optimizing the speed of operations, switching to new versions of
libraries, removing old code from hypothesis testing, and simplifying
projects from an engineering standpoint. All these tasks take up a good third
of analysts’ development time.
I’ve seen software “swamps,” where old stuff gets in the way of creating
new stuff. The whole idea of technical debt is that you have to service
everything you’ve done before. It’s like car maintenance – you need to have
your car serviced regularly, otherwise it will break down when you least
expect it. Code that hasn’t been changed or updated for a long time is bad
code. Although the mantra “if it ain’t broke, don’t fix it” usually reigns
supreme. I remember talking to a Bing developer four years ago. He told
me that the search engine’s architecture contained a compiled library, the
code of which had been lost. What’s more, no one knows how to restore it.
The longer this situation persists, the worse things will get.
This is how Retail Rocket analysts serve technical debt:
Whenever we carry out hypothesis testing, we delete the code for the
hypothesis wherever possible once we are done. This way, we don’t
have to deal with unnecessary junk that doesn’t work.
If any versions of the libraries have been updated, we wait a while to
perform the update ourselves, although we do update regularly. For
example, we update the Spark platform regularly, staring at version
1.0.0.
If any components of data processing are slow, we set a task and
carry it out.
If there are any potentially dangerous risks (for example, cluster disk
overflow), we set the necessary task to eliminate the problem.
My work at Retail Rocket convinced me that dealing with technical debt is
the key to ensuring quality. From an engineering point of view, the project
is built to the best Silicon Valley standards.
OceanofPDF.com
CHAPTER 5. DATA
ISO/IEC/IEEE 24765-2010
Data Connectivity
One of the most important features of data is its ability to link different
sources of data. For example, data can be used to link the cost of online
advertising to sales, thus giving you an efficiency tool. Next, we add data
on completed orders, as the bounce rate can be extremely high in some
online businesses. The output gives us efficiency adjusted for the number of
visitors who did not make a purchase. Further, we can add product
categories to the data, which shows how effective advertising has been for
different product categories. We could continue this process of linking
sources of data indefinitely. This is an illustration of what we call “end-to-
end analytics.”
The above example shows that, by adding a new source of data, we can
improve accuracy and increase the number of “degrees of freedom” in
terms of what we can do with the data itself. As far as I am concerned, this
increases the value of data exponentially. The only snag is how to link the
data. This requires a “key,” which must be present in both sources. But the
key is not always as accurate as we need. Let me give you an example. You
want to link online advertising costs with purchases, but you need a key – a
user ID. The problem is that the advertising systems likely don’t provide
information about how much you have spent on a specific user. Because of
this, you have to use a set of link keys that characterize the ad. This
negatively affects accuracy, but such is the nature of data – it’s better to
have something than nothing at all.
Data Access
Who exactly in a company has access to its data?
Let’s look at Netflix, one of the biggest streaming services in the world (I’m
a big fan of House of Cards myself). Netflix has an interesting corporate
culture [32] . One of its principles is: “Share information openly, broadly,
and deliberately.”
There is one exception to this rule, however: the company has a zero-
tolerance policy when it comes to trading insider information and customer
payment details. Access to this information is restricted. How can this
principle be applied in practice? Don’t restrict access to information for
your employees, rather, restrict access to the personal data of your
customers. I usually go even further and try to remove the barrier between
non-analysts and data. The point is, I believe in the freedom of access to
data. More than that, I believe that the number of intermediaries between
the person seeking data and the data itself should be kept to a minimum.
This is important because we are always in a race against time. The data
requests themselves are often rather simple, you can do them yourself.
“Upload such and such data for me, will you!” That’s not a task for an
analyst. The manager knows exactly what he or she needs, so wouldn’t it be
easier to get it themselves through a simple interface. To do this, you need
to train your team to work with data by themselves. Having intermediaries
only creates delays. The only time an intermediary should be used is when
the person seeking information is incapable of getting the information on
their own, or simply does not want to. This way, you will kill two birds with
one stone – your data scientists won’t be demotivated by the mundane task
of uploading data, and your managers will get the data they need almost
instantaneously, which in turn will keep them incentivized.
Of course, customers’ personal data must be anonymized. This can be done
by encrypting their personal information. It’s better not to delete it
completely so that you can solve certain customer support issues using your
data analysis system.
I try to use this approach wherever I go. You’ve got no idea how grateful
the users of you analytical systems will be when they can access their data
on their own. The brightest and most proactive employees devour
information to help them make decisions, and it would be a crime to put
obstacles in their way.
Data Quality
Data can be dirty, filthy even. If you ever come across “clean” data, it’s
likely not clean at all. But you never know, you might get lucky. Data
scientists spend the majority of their time removing outliers and other
artifacts that can prevent them from getting the right solution from data. We
work in conditions of uncertainty, and we don’t want to increase the
likelihood of error due to dirt in the data.
For me, quality data is data that can be used to solve a specific task without
having to go through any preliminary cleaning. I deliberately wrote
“specific task” here because I believe that different tasks require different
degrees of accuracy. As such, they carry different consequences and risks
for the company. And we walk on the razor’s edge, trying to solve the
problem as quickly as possible with minimum effort. We have to balance
labour intensity against the cost of making a mistake. If we’re talking about
an accounting problem, then accuracy is of utmost importance, as tax
penalties can be rather painful. If it’s a managerial task and the stakes are
not quite as high, then accuracy is not as critical. It’s up to the head of
analytics to make the call.
Poor data quality is usually the result of one of the following:
the human factor
technical data loss
errors integrating and uploading data to the data warehouse
data latency.
Let’s look at these in greater detail.
Some data comes from people directly – we receive numbers from them
through various channels. For the sake of simplicity, let’s assume that they
occasionally fill out a form and send it to us. Our physics teacher at school
taught us about scale-reading errors, which are equal to half the division
value for any given measurement. If you’ve got a ruler that measures in
millimetres, for example, then the reading error would be half a millimetre.
This is due to the fact that we may be looking from the wrong angle – move
the ruler only slightly and our reading could be wrong. What can you expect
from people using tools that are far more complicated than a ruler?
And then we’ve got the deliberate falsification of data. Let’s call it like it is:
whenever someone alters data, even if they do it with the best of intentions,
it is a deliberate falsification. One glaring example of this is elections! It is
thanks to the work of independent researchers who analyse the polls and
look for anomalies, outliers and other “non-random” patterns that such
falsifications are identified. Businesses too have their own techniques for
finding anomalies – for example, using statistical process control charts.
Technical data loss is a big issue in web analytics (the analysis of website
traffic). Not all data from your computer or smartphone reaches the analytic
server. There may be a dozen routers between the server and the client,
some of the network packets may get lost, the user might close his or her
browser while sending information, etc. As a rule, approximately 5% of
data is lost, and it is almost impossible to reduce this number without
complex tweaks. One way to do it is to place the block with the code for
calling the analytic system at the very top of the web page. This will ensure
that the data is sent before the page is fully loaded, thus decreasing losses
somewhat.
Data integration errors are extremely unpleasant. They always crop up
when you least expect them and are difficult to diagnose. When I say
integration error, I mean the loss of data due to a malfunction in the data
collection process. Sometimes, the error can be reversed, others it can’t.
Errors can be reversed if the data can be found at the source and
subsequently re-read. This is impossible if the data “disappears” after it has
been sent to us – that is, it if it flows in a stream and is not saved at any
point on its journey. I’ve already written about the time I discovered that
one of the most advanced web analytics systems in the world was not
transmitting data via the Opera browser. After the problem was fixed, the
data started to flow once again. But there is no way of restoring the old
data. It sometimes happens that the developers of a complex solution lose
sight of the fact that statistical information about the use of a product needs
to be collected, or make a mistake in implementing this function. In this
case, you will only get the statistics after the problem has been fixed, and
you can say goodbye to the old data forever.
Data latency can also be a problem. At the very least, it’s something you
need to think about when working with this data. There are many different
ways to update data warehouses, and we’ll talk about them in the relevant
chapter. The most important thing to remember is that the customer (and
sometimes you as the data scientist) expect the system to correspond
perfectly with the real picture of this world. But this is not the case. There is
always a lag from the moment an event occurs to the moment that
information about this event reaches your data warehouse. It could take
seconds, but it could just as well take days. This is not necessarily down to
an error, but you’ve always got to keep in mind which data source is being
updated and when and keep an eye on the schedule to avoid any unpleasant
surprises.
Data Types
These are the main types of data that you will have to work with:
Point-in-time data (data snapshot)
Data change logs
Reference data.
Let’s look at each of these separately. Imagine you’ve got a current account
where your salary is paid on the first of every month. You’ve got a card
linked to that account that you use to make purchases. Your current balance
reflects the state of your account at a certain point in time. The movement
of funds on the account is what we call the log of changes in the state of the
account (data change logs). The reference data might be categories of
purchases listed in the bank’s online app, for example: groceries, air tickets,
cinema, restaurant, etc. Now let’s look at each data type in more detail.
Point-in-time data. We all deal with different objects, both physical and
virtual. These objects have properties or attributes that can change over
time. For example, the coordinates of your current location on a map, your
bank account balance, the colour of your hair after a trip to the hair salon,
your height and weight over time, the status of your order in an online store,
your position at work, etc. These are all objects with a given property. To
track changes in these properties, you need to “remember” them at certain
points in time, for example, by taking a “snapshot” of all customer accounts
(Table 5.1). Two snapshots allow to you calculate any changes with ease.
There is, however, another way to track changes.
Table 5.1. Example of a “snapshot” of customer accounts
234 2000
245 5000
857 2000
XML is less common than JSON, its main competitor. It is often used for
configuration data in various systems. I prefer JSON for data storage, as it
is the lighter and easier option. As for XML, online stores use it to transmit
product information: catalogue structure, prices, and product names and
attributes.
<person>
<firstName>Ivan</firstName>
<lastName>Ivanov</lastName>
<address>
<streetAddress>101 Moskovskoye Highway, apartment 101</streetAddress>
<city>St. Petersburg</city>
<postalCode>101101</postalCode>
</address>
<phoneNumbers>
<phoneNumber>812 123-1234</phoneNumber>
<phoneNumber>916 123-4567</phoneNumber>
</phoneNumbers>
</person>
Some of the more exotic formats you can encounter in your practice
include:
pkl (“pickle”) is a binary Python object. If you read it using Python,
you will get the necessary data structure in your memory without
any parsing.
hdf is a hierarchical data structure format. It may contain various
types of data, for example, a store’s product catalogue, sales, etc.
The file contains metainformation: product names, data types, etc.
I’ve never worked with such files myself, but they may be useful for
transferring data from a complex project to another team or publish
them on the internet.
parquet and avro are formats designed for big data. They generally
contain a data schema (metainformation) that defines field types and
names and are optimized for use in systems such as Hadoop. Both of
these formats are binary, though avro can be based on JSON.
What else is there to know about storage files? They have a different
approach to storing metainformation. If someone sends you data in a CSV
file, then the first line will likely contain field names, but you won’t get
information about the data type (whether a value is a number, text, date,
etc.). Field descriptions will have to be sent separately, otherwise you will
have to make assumptions. If you get a JSON or XML file, they have a way
of describing data types, which makes them more convenient in this regard.
As for databases, we will discuss them in the chapter on data warehouses.
These layers are logical and, in many ways, arbitrary. Physically, they can
be located either in different systems, or in a single system.
At Retail Rocket, I worked a lot with user activity data. Data arrives in the
cluster in real time in JavaScript Object Notation (JSON) format and is
stored in one folder. This is the first layer, raw data. We immediately
convert the data so that we can work with it in the CSV format. We store it
in a second folder. This the second layer. Data from the second layer is then
sent to the ClickHouse analytical database, where special tables have been
built for whatever task you want – for example, to track the effectiveness of
recommendation algorithms. This is the third layer, data marts.
If you want to find all products that belong to the “Living room” type (in
the Type column), you will have to read all data rows (if there is no index
for this column). This is an exceedingly depressing task if you have billions,
or even trillions, of rows. Columnar databases store data separately by
columns (hence the term “columnar databases”):
ID: 1 (1), 2 (2), 3 (3)
Name: Chair (1), Chair (2), Chair (3)
Type: Kitchen furniture (1), Kitchen furniture (2), Living room (3)
Color: White (1), Red (2), Blue (3)
Now if you select products by “Living room” type, only the data in the
relevant column needs to be read. So, how do we find the records we need?
No problem: each column value contains the number of the record to which
it refers. What’s more, column values can be sorted and even compressed,
which makes the database even faster. The downside of such a system is
that you are limited by the power of the machines on the server that the
database is running on. Columnar databases are perfect for analytical tasks
that involve a large number of filtering and aggregate queries.
If you need a “thresher” for extremely large amounts of data, then look no
further than Yahoo’s Hadoop, which runs on a proprietary distributed file
system (HDFS) and does not place excessive loads on hardware. Facebook
developed the Hive SQL engine for Hadoop, which allows users to write
data queries in SQL. The main advantage of Hadoop is its reliability,
although this comes at the expense of speed. That said, Hadoop is able to
complete tasks that a regular database simply cannot cope with – sure, it
might do it at an agonizingly slow pace, but it gets the job done! I’ve used
Hive in the past and been extremely impressed with it. I’ve even managed
to write some basic recommendation algorithms for it.
So, the main question here is: Which technologies should we use and when?
I’ve got a kind of cheat sheet to help me out on such occasions:
If there’s a relatively small amount of data (i.e. not tables with
billions of rows), then it is easier to use a regular relational database.
If you’ve got several billion rows of data, or speed is a big concern
for dealing with analytical queries (aggregation and selection), then
a columnar database is the way to go.
If you need to store a massive amount of data with hundreds of
billions of rows and are willing to put up with low speeds, or if you
want an archive of raw data, then Hadoop is what you need.
At Retail Rocket, I used Hive on top of Hadoop for technical support;
requests could take more than half an hour to process. The required data
was thus transferred to ClickHouse (a columnar database), which increased
query processing speeds by several hundred times. Speed is extremely
important for interactive analytical tasks. Retail Rocket employees
immediately fell in love with the new solution thanks to its speed. Hadoop
was still our workhorse for calculating recommendations, though.
On the left, we have the source text, each line of which contains the names
of people. The first operation (Split) splits the text line by line, with each
line being independently processed. The second operation (Map) counts the
number of times each name appears in the line. Because the lines are
independent of each other, we can run the operation in parallel on different
machines. The third operation (Shuffle) then separates the names into
groups, and the fourth operation (Reduce) calculates the number of times
each name is mentioned in different lines. At the output, we get the number
of times each name is mentioned in the text. The example we have here
consists of three lines, but the exact same operations would be carried out
for a trillion lines of text.
MapReduce is a concept. Hadoop is a piece of software than implements
this concept. Hadoop itself consists of two main components: the HDFS
distributed file system and the YARN resource scheduler.
To users, the HDFS (Hadoop Distributed File System) looks just like a
regular file system with the usual folders and files. The system itself is
located on a least one computer and has two main functions – a name node
and a data node. When the user wants to write a file to the HDFS, the file is
split into blocks (the size of which depends on the system settings) and the
name node returns the data node where the block needs to be saved. The
client sends data to the data node. The data is recorded and then copied to
other nodes (replicated). The default replication factor is 3, which means
than one chunk of data will exist across three data nodes. Once the process
is complete and all the blocks have been written, the name node makes the
relevant entry in its tables (where, which block has been stored, and which
file it belongs to). This protects us against errors, for example, when the
server goes down. With a replication factor of 3, we can lose two nodes
without it being a problem. And if this does happen, the HDFS will
automatically start to replicate the data and send it to other “live” nodes to
make sure the required replication factor is achieved once again. This is
how we get data sustainability.
The YARN resource scheduler allocates computer power on Hadoop
clusters. This allows us to carry out several tasks in a single cluster
simultaneously. The calculations are typically carried out in the same place
where the data is located – on the same nodes and with the same data. This
saves a lot of time, since data is read from the disk far quicker that it is
copied over the network. When running a task through YARN, it lists the
resources you will need to perform the calculation: how many machines
(executors) from the cluster; how many processor cores; and how much
RAM. YARN also provides real-time reports on the task’s progress.
I first heard about Hadoop when I visited the Netflix offices in 2011. I
immediately set about learning as much as I could about the service and
watched conferences on YouTube about how to work with it. I decided to
try an experiment and launch Hadoop on my work laptop using Cloudera as
the distribution. The good thing about Hadoop is that it can even be
installed on a laptop – all services will run on it. As your data grows, you
can easily add servers, even cheap ones. That’s exactly what I did when I
started writing the Retail Rocket recommendation engine. I started with a
couple of servers and now, five years later, the Hadoop cluster has grown to
50 machines and holds around 2 petabytes of compressed data.
We wrote the first version of the recommendation in two Hadoop
programming languages – Pig and Hive – before switching to Spark and
then Scala.
Don’t overlook MapReduce principles. They came in handy for me once in
a Kaggle competition. I was dealing with a dataset so large that it didn’t all
fit into the computer’s memory, so I had to write a preprocess operation in
pure Python using the MapReduce approach. It took forever! If I were to do
it today, I would install the Spark framework locally rather than try to
reinvent the wheel. This would work, as MapReduce operations can be
performed both in parallel on different machines and sequentially. The
calculations will still take a lot of time, but at least they’ll get done, and you
won’t be worrying about whether you’ve got enough memory for it.
Spark
I was introduced to the Spark framework in 2012, when I got hold of some
video recordings of the Strata Data Conference for the Ostrovok.ru
corporate library. The conferences are organized by O’Reilly Media in the
United States. In one of the lectures, I saw Matei Zaharia (the brain behind
Spark) talk about the advantages of Spark over a pure implementation of
MapReduce in Hadoop. The biggest advantage of Spark is that it loads data
into so-called resilient distributed datasets (RDDs) and allows you to work
with it iteratively in memory. Pure Hadoop relies on disk storage – for each
pair of MapReduce operations, data is read from the disk and then stored. If
the algorithm requires more operations, then data will have to be read from
the disk and then saved back every time. What Spark does after having
completed the first operation is save the data into memory. Subsequent
MapReduce operations will work with this array until the program
explicitly commands it to the saved to disk. This is crucial for machine
learning problems, where iterative algorithms are used to find the optimal
solution. This method provides a massive performance boost, sometimes up
to 100 times faster than Hadoop.
The ideas that Matei Zaharia had introduced me to came in handy a few
years later when we set about writing the second version of the Retail
Rocket recommendation engine. Version 1.0.0 had just come out. It was
summer 2014 when we tried to run Spark on top of Hadoop, and everything
went swimmingly. We managed to boost speed by three to four times during
testing, although we did run into performance issues when dealing with
large numbers of small files. So we wrote a small library that glued them all
together when booting up and that fixed the problem [41] . We’ve been
using Spark on our Hadoop cluster ever since and have absolutely no
regrets.
When we switched from pure Hadoop to “Spark on top of Hadoop,” we had
to get rid of much of our code that had been written in Pig, Hive, Java and
Python – a veritable zoo that was the cause of endless headaches because
we had to be fluent in all of them. And we used a bunch of Python tools
when prototyping machine learning tasks, including IPython + Pyhs2
(Python hive driver) + Pandas + Sklearn. Using Spark meant that we were
able to use a single programming language for prototyping experimental
versions and developing a working version. This was a massive
achievement for our modestly sized team.
Spark supports four programming languages out of the box – Python, Scala,
Java and R via an API. Spark itself is written in Scala, which is why we
chose it. This would allow us to read the Spark source code and even fix
any bugs in it (although we never had to). What’s more, Scala belongs to
the JVM family of programming languages, which is handy when you’re
working with the Hadoop file system directly via an API, since it is written
in Java.
Below is a comparison of programming languages for Spark, where + and –
represent the pros and cons of each language, respectively.
Scala:
+ functional and thus convenient for processing data of any volume
+ native to Spark, which is important if you need to understand how
Spark works from the inside
+ based on JVM, which makes it compatible with Hadoop
+ static type system; the compiler will help you find some of your errors
– difficult to learn; we had do develop our own training programme for
beginners [19]
– Scala developers are generally more expensive than Java and Python
developers
– the language is not as widespread as Java and Python.
Python:
+ popular
+ simple
– dynamic type system; the compiler may not be able to detect certain
errors
– poor performance compared to Scala
– no pure functionality, unlike Scala.
Java:
+ popular
+ native to Hadoop
+ static type system
– not functional (although things were much better after the release of
Java 8).
The decision to use Scala as the main programming language for the second
version of the Retail Rocket recommendation engine was not taken lightly,
as no one knew the language. If I could go back, I would likely give Java 8
and later versions some serious consideration, as Java developers are easier
to find.
Spark is moving towards dataframes and datasets, a trend started by the
pandas library for Python [42] , which just so happens to be the most
popular library for data analysis. It’s easier to write in pandas, but there’s a
catch – the compiler can’t check your work with internal variables, which is
far from ideal for large projects.
OceanofPDF.com
CHAPTER 7. DATA ANALYSIS
TOOLS
As you will recall from previous chapters, classical data analytics is divided
into two stages: 1) searching for hypotheses; and 2) statistical testing of
these hypotheses. In order to formulate hypotheses, we need descriptive
statistics, data visualization and domain knowledge (for example, the
company’s history).
There are a number of software packages that make the analyst’s work with
descriptive statistics and data visualization easier and faster. Let’s look at
some diametrically opposite approaches here. You can only take one of
them, but it is always useful to broaden your horizons – you never know,
maybe an alternative approach will suit you better.
I categorize these tools according to how they interact with the user. This is
somewhat provisional, as certain categories may overlap with one another.
Spreadsheets – Microsoft Excel, Open Office Calc, Google Docs
Sheets.
Notebooks – Jupyter Notebook, Netflix Polynote, Google Colab,
R studio, etc.
Visual analysis tools – Tableau, Google Data Studio, Yandex Data
Lens, Microsoft Power BI, Amazon QuickSight.
Statistical software – SAS, SPSS, Statistica, Knime, Orange Data
Mining.
Spreadsheets
Spreadsheets are one of the most widely used data analysis tools. My first
experience with a spreadsheet was back in 1997 when I drew up some
tables in Quattro Pro for a geography assignment at school. The teacher was
most impressed! I went to my dad’s work (he was an IT engineer at the
time), typed out the text on a keyboard and scaled country maps on a copier.
In the end, I had a fully printed report, which I guess was something of a
novelty back then, especially in Tver (a town near Moscow). After that, I
worked in Microsoft Excel, and then Google Sheets, both of which made it
easy to collaborate in the cloud. The advantages of spreadsheets are:
Low entry threshold.
Everything is clear and intuitive, such as adding columns or
formulas.
You can analyse pivot tables (the most important tool for generating
hypotheses).
It’s easy to create tables.
Notebooks
Notebooks (Fig. 7.1) can be a powerful and flexible tool in the right hands.
They have gained popularity thanks to the widespread use of R
programming languages, particularly Python, for data analysis. A notebook
runs as a web service on a server or your computer and consists of cells of
text with program code. The cells can be launched at random, and all data
output (graphs, statistics, error messages) appears under the cell. You can
write text and titles in the cells. You can produce complete research projects
in notebooks if you want. The two most well-known public notebook
services are Google Colab and Kaggle Notebook (which offers a free trial).
They also have powerful GPU cards, which allow you to perform tasks
using deep learning. Personally, I quite liked the simplicity and power of
the Google Colab service when I was experimenting with creating deepfake
videos.
Fig. 7.1. Jupyter Notebook
Advantages:
Flexibility. There are software libraries for every taste.
Notebooks are easy to run in the cloud, so you don’t have to waste
your computer’s resources.
It’s easy to share and publish your results.
There is support for various programming languages. At Retail
Rocket, I used Jupyter Notebook in Scala.
You can work with any data source that has a driver.
To repeat the result, the notebook just needs to restart and run all the
cells. Not all tools can do this with such ease. For example, formulas
in spreadsheets have a habit of jumping around or disappearing
altogether. You don’t have this problem with notebooks.
Disadvantages:
I don’t think it’s a good idea to use notebooks as a component of a
working system, although I hear that many companies (even Netflix
[49] ) do just this. Notebooks are built for research, not creating
workflows.
The entry threshold is higher than it is for spreadsheets. At least a
basic knowledge of the chosen programming language is required.
Statistical Software
My introduction to data analysis came through statistical software when I
was hired an intern at StatSoft. Spreadsheets and visual analysis tools are
not good at statistical analysis, which is a key part of data analysis. Let’s
say you observe a difference in the indicators. How do we know that
difference is real or an accident? We need to calculate its statistical
significance.
Statistical analysis software usually come in the form of desktop
applications (Fig. 7.3) that perform calculations locally. Data is loaded as
spreadsheets. The software typically has a simple visual ETL (like in
Tableau), as well as a built-in programming language for automating
actions.
Fig. 7.3. STATISTICA
Advantages:
Extensive statistical analysis possibilities. The manuals for these
packages are just as good as data analysis textbooks. The statistical
functions themselves are rigorously tested, unlike the statistical
software that is freely available on the internet.
Good for creating graphs.
Attention to detail, which is important in scientific research.
You can work with data offline.
Disadvantages:
High entry threshold. You need to know what you’re doing and
which statistical criterion to use. Basic knowledge of mathematical
statistics is a must.
Commercial software packages are expensive.
Working With Data in the Cloud
We are living in a time where more and more people are working from
home. Accordingly, an increasing number of tools are moving to the cloud.
I think this has to do with the fact that businesses, and thus data sources, are
starting to put everything on cloud servers. This makes transferring large
amounts of data over the internet a pleasure. According to Gartner [46] ,
public cloud services will cover 90% of all data analysis needs by 2022.
Almost all cloud vendors have already developed visual analysis tools:
Google Data Studio, Microsoft Power BI, Amazon Quick Sight, Yandex
DataLens, etc.
Advantages:
Data and analytics are located within the same security perimeter. It
is easy to manage data access. You don’t need to expose yourself to
risks by opening access to your data over the internet.
Data is available within a single cloud network, which speeds up
work.
You can collaborate natively. I assume you have worked with
services such as Google Docs. It’s so much more convenient to
collaborate than it is to work with a standard office suite.
Thin client – everything is done in the browser. You don’t need to
install any programs.
Flexible pricing – the price depends on how often you use the
service and the amount of data.
Maintenance costs are lower.
Disadvantages:
Price. Even if the cloud provides visualization services for free, you
still have to pay for the calculations and data aggregation. It’s
similar to car rental – if you are an active user, sooner or later it will
make better financial sense for you to buy your own car. It’s the
same with cloud services.
All your data is with a single vendor, meaning you will be tied to
that vendor. If we’re talking petabytes of data, it can be extremely
difficult to transfer it all to your servers or another vendor’s cloud.
I mostly like this trend of migrating data and data analysis to cloud services,
as it makes developing analytics systems easier – and often cheaper – than
buying corporate systems.
Pivot Tables
Pivot tables are the best thing to ever happen to exploratory data analysis!
Data scientists that know how to use pivot tables properly will never be out
of a job. Pivot tables save us from having to perform a huge number of
useless data queries when all we need is a tiny shred of data. I’ve already
talked about my own personal interactive data analysis template: select the
data, copy it into spreadsheets, build a pivot table and work with it. This
method has literally saved me years compared to the time I would have
spent if I had used the direct methods of calculating descriptive statistics
and building simple graphs – the bread-and-butter data analysis operations
for any analytical tool. Now, let’s take a step-by-step look at how to work
with pivot tables.
First, you’ve got to get the data ready. The data should be arranged into a
fact table based on point-in-time tables or data change logs (we talked about
these in the chapter on the types of data). If the table uses identifiers that are
incomprehensible to any ordinary person, then this field should be
decrypted by “joining” the reference data to (or “merging” it with) the fact
table. Take the following example. Sales have dropped off and we want to
understand why. Let’s suppose we have plotted a point-in-time table for
orders with the following fields:
The time and date of the order (for example, 10 November 2020,
12:35:02).
Customer type ID (1,2).
Customer status ID in the loyalty programme (1,2,3).
Order ID (2134, 2135, …).
Customer ID (1,2,3,4, ...).
Order amount in dollars (102, 1012, ...).
The resulting table is a fact table – it records the fact of orders being placed.
The data scientist wants to look at the buying habits of different types of
customers and loyalty card holders. The working hypothesis is that this is
where the main reason for the drop in sales can be found. The ID fields are
not readable and are used to normalize the tables in the accounting
database. But we do have reference tables (tables 7.1 and 7.2) that we can
use to decrypt them.
Table 7.1. Client type reference table
ID Name
1 Individual
2 Business
Table 7.2. Customer status in the loyalty programme
ID Name
1 VIP client
2 Loyalty card
3 No loyalty card
After joining (or merging) the fact and reference tables, we are left with an
updated table of facts (Table 7.3):
datetime – the time and date of the order (for example, 10 November
2020, 12:35:02).
client_type – the type of customer that placed the order (an
individual or a business).
client_status – the customer’s status in the loyalty programme (VIP,
card holder or non-card holder).
order_id – order ID (2134, 2135, …).
client_id – customer ID (1,2, ...).
amount – order amount in dollars (102, 1012, ...).
Table 7.3. An example of data merging
What’s good about this fact table is that the only ID fields are Order ID and
Customer ID. These are useful fields, as they may be needed to view orders
in greater detail in the internal accounting system. The analyst receives a
sample of the data in this format, loads it into a spreadsheet (say, Microsoft
Excel or Google Sheets), and builds a pivot table. So, let’s analyse the pivot
table.
Pivot tables contain two types of data: dimensions and measures.
Dimensions are represented as a system of coordinates. When I hear the
word “dimensions,” I think of three axes of coordinates emanating from a
single point perpendicular to each other, just like we were taught in
geometry class at school. But there can be more than three dimensions
(axes). Many, many more. You can use them as columns, rows, or pivot
table filters, but they can never be placed in cells. Examples of dimensions:
Date and time
Client type
Client status.
Measures are statistics that will be calculated in the pivot table when you
“rotate” or change dimensions. They are, as a rule, aggregate figures: sums,
averages, distinct count, count, etc. Measures for our task include:
Order amount
Average purchase size
Number of orders (one line per order, no duplicate orders)
Number of unique customers (the number of unique IDs should be
counted here, as a single customer can place several orders, which
will be counted several times).
Order IDs and Customer IDs can be both dimensions (where you can find
statistics on specific orders or customers) and measures (where you can
simply count the number of orders or customers). It depends on the specific
task at hand, both methods work.
For each column, the analyst determines whether the data contained in it is
a dimension or a measure, and what statistics on the measure are needed.
This is all the preparatory work that is needed. The next step is to formulate
hypotheses and identify one or several slices that will either confirm or
refute these hypotheses. We use the term “slice” because of the
multidimensional nature of pivot tables. Imagine a three-dimensional object
with a length, width and height. A knob of butter, for example. You take a
knife and cut the butter to get a slice whose plane is perpendicular to the
axis you are looking at. It’s exactly the same principle with pivot tables,
only you are slicing multidimensional data. There can be several axes
depending on the number of dimensions there are – this is where the
multidimensionality comes from. The location of the axis (dimension)
perpendicular to which you make the cut is included in the report filter as a
value that you record. The dimensions that lie in the plane of the slice will
make up the columns and rows of our table. If the report filter is not used,
all the data will be projected onto our slice through aggregation, which is
selected individually for each indicator (sums, averages, and number).
The analyst formulates two hypotheses regarding the drop in sales:
The change in consumer behaviour has been caused by one of the
client types. Client type is one of the dimensions needed to make
this hypothesis.
The change in consumer behaviour has been caused by one of the
loyalty programme groups. Client status is one of the dimensions
needed to make this hypothesis.
As we are dealing with temporal changes, we need another dimension,
namely time. So, we have formulated a hypothesis and carved out the
required data slice. The rest is done by the technology: drag and drop the
required dimensions with the mouse, for example, put the date into columns
and client type into rows. Complete the table with the necessary measures
and check whether the numbers confirm or refute the hypothesis being
tested. Strictly speaking, a suitable statistical criterion should be used to test
the correctness of the hypothesis, although this is rarely done in practice.
Hypotheses can be formulated and tested sequentially. And once you’ve got
enough experience, you’ll be able to formulate them at the subconscious
level. Data scientists will play around with them in order to find the most
probable cause of the problem or reason for the success: make the first
slice, add the dimensions (cross-referencing them with existing ones) and
change the measures.
This type of analysis would take ten times longer without spreadsheets and
visual analysis tools on pivot tables. The analyst would have to program
each slice, for example using the GROUP BY operator in SQL or pivot in
the Python pandas library. Pivot tables allow data scientists to work as fast
as their thoughts allow.
OLAP Cubes
Pivot tables do not reside in spreadsheets only. In fact, they run rather
sluggishly, if at all, in spreadsheets containing massive amounts of data. But
our goal is for everything to work at the speed of the analyst’s thoughts,
right? Software developers try all sorts of tricks to help make this happen.
For example, they place data in a columnar database directly on the user’s
computer (we’ve already talked about the advantages of columnar databases
in the chapter on data warehouses). The second method is to do all the
calculations on servers and give the user access to these servers through the
interface (fat or thin client).
This is how OLAP (On-Line Analytical Processing) cubes were invented.
The history of how OLAP cubes came to be is quite interesting, if only
because our former compatriot Mikhail (Mosha) Pasumansky had a hand in
it. Mosha moved to Israel form St. Petersburg in 1990. It was there that he
wrote the Panorama analytical app, the first version of which came out in
1995. The following year, his company was bought out by Microsoft, which
needed a similar solution for a new version of SQL Server. After integrating
the system in Microsoft software, Mosha started developing a programming
language for working with OLAP cubes called Multidimensional
Expressions (MDX). MDX has become the standard for working with
OLAP cubes and is supported by numerous vendors. OLAP cube services
are now called Analysis Services.
We’ve already looked at how pivot tables work. Now let’s delve into how
the performance problem is solved in OLAP cubes, which pivot tables can
calculate extremely quickly. I’ve worked a lot with Microsoft OLAP cube
technologies, so I will draw on my experience here. The central link of any
OLAP cube is the fact table, which we looked at through examples of
building pivot tables earlier. There is a small but important difference here:
as a rule, fact tables are not linked to directories; they are loaded into cubes
separately from them.
To do this, the data in the warehouse needs to be prepared according to the
so-called “star” schemas (Fig. 7.4): the fact table is connected by fields
containing IDs (keys) with dimensional tables (reference data), as shown in
the figure.
For this, the data in the warehouse is prepared according to the “star”
scheme (Fig. 7.4): the fact table is connected by fields containing ID (keys)
with reference books, as shown in the figure. The general rule here is to
keep all dimensions in separate directories. This is so that they can be
updated independently of the fact table. After preparing the necessary data
in the design program, you need to make a note of which tables are
dimension tables and which are fact tables. You also need to indicate in the
settings which measures need to be calculated. The cube’s preprocessing
involves reading all the data from the warehouse and placing it in special
structures that work quickly in terms of calculating pivot tables. The
dimensions are read and processed first, then the fact table. But it is when
we start adding new data to the cube that things start to get really
interesting.
But what if new data appears? New data is entered into our “star” (the data
warehouse) and the cube is updated accordingly. All the dimensions
(directories) are read: if new elements have appeared or have been renamed,
everything is updated. Dimension directories for OLAP cubes thus need to
be saved and updated in their original form, you can’t delete data from
there. Updating fact tables is more interesting – you can either update and
recalculate the cube entirely, or erase the old data from the fact table, fill it
with completely new data and then update it. It is better to use incremental
refresh here for quicker cube processing times. At Ozon.ru, full cube
processing could take me four days, while incremental updates would take
just 20 minutes.
There are a number of popular options for storing and processing data in
OLAP cubes:
MOLAP – the storage scheme I described above. Data is stored in
special structures which calculate pivot tables extremely quickly.
ROLAP – data is not placed anywhere, rather it is stored in the data
warehouse. The OLAP cubes translate queries from pivot tables into
data warehouse queries and return the results.
HOLAP – data is partly stored in MOLAP, and partly in ROLAP
schemes. This can help reduce any time lags in the cube’s operation.
The cube is updated daily in accordance with the MOLAP scheme,
while new data is held in the ROLAP scheme.
I have always preferred MOLAP for its superior speed. Although the
development of columnar databases has made ROLAP easier. This is
because the ROLAP scheme does not require the cube to be carefully
designed (unlike MOLAP), which makes it far easier to provide technical
support for the OLAP cube.
Microsoft Excel spreadsheets are the perfect client for OLAP cubes, as
working with Microsoft cubes is a breeze. Thin clients just don’t provide
the convenience and flexibility that Excel offers: the ability to use MDX,
flexible formulas, and build reports from other reports. Unfortunately, this
functionality is only supported on the Windows operating system; the OS X
(Apple) version of Excel does not support it. You can’t imagine how many
team leaders and IT specialists curse MS developers for this! It means
they’ve got to have separate Windows laptops or remote machines in the
cloud just so they can work with Microsoft cubes. If you ask me, that’s the
biggest disadvantage that Microsoft could have in our field.
My Experience
I’ve worked with a variety of systems in my time, from spreadsheets to
Hadoop/Spark distributed systems with petabytes of data. All I can say is
there’s a time for everything. Analytics architecture is an artform that
requires remarkable talent. I’ve talked a lot in this chapter about my own
best practices, which have been tested through years of experience in the
field. No matter what solution you come up with, there’s bound to be pros
and cons. And it’s better to know what they are before you create a system,
rather than after.
You’re constantly performing a balancing act between speed and quality,
between the costs now and the costs later. But it’s always helpful to thinks
about the user. The faster and more intuitively the user is able to work with
data, the fewer errors he or she will make, which in turn will enable them to
make decisions more efficiently. I believe it is speed that ensures the
success of a company, even if the quality guarantees are so-so. The one who
has made the most decisions wins.
Analytics systems allow you to navigate the data, but they don’t make
decisions. Humans do. So, why not improve the quality of their lives by
making it easier for them to make decisions?
OceanofPDF.com
CHAPTER 8. MACHINE
LEARNING ALGORITHMS
Types of ML Problems
Classical machine learning can be divided into three types:
supervised learning
unsupervised learning
reinforcement learning.
Supervised learning means that each set of inputs (independent variables or
features) has a value (dependent variable) that the model must predict.
Examples of such tasks are presented in the table below (Table 8.1) [50] .
Table 8.1. Types of ML Problems
There are two main classes of problems here: a regression problem, where
you need to predict an indicator with a continuous scale (for example,
money, probability, etc.); and a classification problem (task), where you
need to understand the class that an object belongs to (whether or not there
are people in a photograph, whether or not a person is sick, who is in the
photo, etc.). There are also a number of tasks related to language
translation, speech recognition, and geo-positioning, which have become
widespread thanks to deep neural networks.
The concept of “regression” was first put forward by Sir Francis Galton
after he read Darwin’s On the Origin of Species . He decided to research the
relationship between child height and parent height. According to his
findings, children of very tall parents tend to be less tall when they reach
adulthood, and children of extremely short parents tend to be less short
when they reach adulthood. Galton called this reverse growth “regression”
(i.e. to move in the opposite direction) [51] . In 1885, he published his
“Regression towards Mediocrity in Hereditary Stature,” and his concept
would soon be applied to all problems with one-sided stochastic
dependence.
Unsupervised learning implies that there is a pattern in the data, but you
don’t know what it is (there is no dependent variable). It could involve
dividing a dataset into clusters (cluster analysis), searching for anomalies,
autoencoders (for example, to reduce the dimension of the feature space),
principal component analysis, or collaborative filtering (recommendation
systems).
Reinforcement learning is a model of learning through interaction with the
environment. This is a special case of a learning model that involves a
“teacher.” Here, the teacher is not a dataset, but rather a reaction of the
environment to an action we have performed. This kind of learning is often
used in the development of bots for games (not just shooter games, but
chess games too), to control robots, etc. Unlike classic machine learning,
there is no dataset here. The agent (for example, a robot) performs an action
in the environment and receives feedback in the form of a reward. It thus
learns to perform actions that maximize the reward. This method can be
used to teach a robot how to walk or play chess, for example.
Each type of problem in classic ML has its own algorithm.
Depending on the task and the data, a variant of the model that is most
likely to produce results is proposed by the popular scikit-learn library. I
used this library when I started studying ML models in Python for Kaggle
competitions.
ML Task Metrics
If there is a non-random pattern in the data, a corresponding model is
selected for the task. And if there is sufficient data, then it is not a
particularly challenging task to train the ML model. The first step is to work
out how to measure the effectiveness of the model. This is crucial: if we
create a model that helps in real life, it must have a clearly defined goal.
And the goal is expressed in the metric. For example, if you want the model
to predict demand for goods, then the goal will be to reduce the discrepancy
with reality (forecast error). If you need to determine whether a given
photograph contains an image of a dog, your goal will be to increase the
percentage of photographs with the correct answer. All Kaggle tasks have a
metric that is used to select the winner.
Ignoring this step is a rookie mistake that even fully fledged analysts still
sometimes make. I can tell you from my experience at Retail Rocket that
choosing the wrong metric can be very costly. This is the foundation of
machine learning. And it’s not as easy to choose the right metric as it may
seem at first. For example, there are a number of COVID-19 throat swab
tests. One of them gives more false-positives (where the patient is not
actually sick), while another gives more false-negatives (which miss
patients who are actually infectious). Which testing system do you choose?
If you choose the first one, then you’ll be forcing many people to self-
isolate, which will in turn affect the economy. If you choose the second one,
then infected people will be allowed to go about their daily lives and spread
the virus. You’ve got to weigh up the advantages and disadvantages. You’ll
run into similar problems when developing recommendation systems: one
algorithm gives more logical recommendations for the user, while the other
gives recommendations that would bring in greater profit for the store.
(This means that the machine learned through weak but important signals
that seem illogical to us, and that its recommendations lead to increased
sales, although this cannot be explained by logic due to the limitations of
the human mind.) Which algorithm do you choose? If you run a business,
you should choose the second. The thing is, though, when you’re trying to
sell this recommendation system, managers who make purchasing decisions
often like the first one better. And that’s easy to understand – a website is
like a storefront and it needs to look appealing. You end up having to
change the first algorithm, which works fine as it is, to ensure greater
profitability and make the recommendations look more logical. Focusing on
a limited number of metrics makes it hard to develop new algorithms.
A metric can be considered the difference between what a model predicts
from the input data (independent variables, or features) and what it actually
is (the dependent variable, or outcome). There are many nuances when it
comes to calculating metrics, but this difference is always key. Typical
metrics for ML tasks depend on what class they are.
For regression, we need to use mean squared error (MSE). This is the sum
of the squares of the differences between the predicted and actual values
from the dataset divided by the number of examples in the dataset:
There are a host of other popular metrics, including RMSE, MAE and R2 .
For classification tasks, the most popular metrics can be easily obtained
from a misclassification, or confusion, matrix. When I interview a candidate
for a data scientist position, I often ask them to plot such a matrix (see Table
8.2) and derive metrics from it.
Table 8.2. Misclassification matrix
Actual: 1 (True) Actual: 0 (False)
The area under the curve is the AUC. If the AUC is close to 0.5, then your
classifier is not really any different from a random coin toss. The closer it is
to 1, the better. This metric is often used in research papers and Kaggle
competitions.
Other tasks, such as ranking search results or for recommendations, have
their own performance indicators. You can learn about them in specialist
literature.
So, we’ve got a metric. We can now use it to compare different models to
gauge which is better. Let the learning begin!
We have a loss function, and from an arbitrary point, we move towards its
minimum sequentially, step by step.
There are two more versions of this algorithm: stochastic gradient descent
(SGD) and mini-batch gradient descent. The first is used for working with
big data, when just one example from the entire dataset is used for a single
training iteration. A batch version of this algorithm can also be applied as
alternative, whereby a subset of the dataset is used instead of a single
example.
Linear Regression
This is the simplest and most popular regression model. We actually use it
at school when we learn the formula for linear dependence. This is the
formula I was taught in school: y = a × x + b . This is so-called linear
regression. It has just one independent variable.
Data scientists typically work with multivariate linear regression, which
uses the following formula:
It consists of the sum of the products of each coefficient, the value of the
corresponding feature and an additional intercept. The result is a straight
line when we are dealing with a single independent variable, and a
hyperplane when we are dealing with N features. When training linear
regression, the hyperplane is constructed in such a way as to minimize its
distance from the points (the dataset). This is the standard deviation. The
first thing I ask a person applying for a data scientist position is: “Here are
the results of an experiment. The points are marked on the plane with two
axes. A line has been plotted to approximate these points. How do you
know whether the straight line has been plotted optimally?” This is a good
question for understanding the essence of linear regression.
If the data at the input is normalized during linear regression, its influence
on the dependent variable (and thus the result) increases proportionally to
the feature’s coefficient. A positive coefficient is when an increase in the
value of a feature increases the value of the dependent variable (positive
correlation). A negative coefficient is when there is a negative correlation or
negative linear independence.
Logistic Regression
This is the most widely used model for solving binary classification tasks
(tasks with two classes).
Let’s say we have to separate two classes: noughts and crosses. I mark them
on a coordinate grid, with the values of the X1 and X2 features being
plotted along the axes (Fig. 8.5). Clearly, you can draw a straight line
between the noughts and crosses to separate them: the noughts are above
the line, and the crosses are below it. This is how logistic regression works
– it looks for a line or hyperplane that separates classes with minimal error.
In terms of results, it gives the probability that a point belongs to a certain
class. The closer a point is to the dividing surface, the less “confident” the
model is in its choice (probability approaches 0.5). Accordingly, the further
the point is from the surface, the closer the probability is to 0 or 1,
depending on the class. There are two classes in our present task. Thus, if
the probability of belonging to one class in 0.3, then the probability of it
belonging to the second is 1−0.3 = 0.7. Probability in logistic regression is
calculated using a sigmoid function (Fig. 8.6).
In this graph t is substituted with a value from a normal linear formula with
coefficients, similar to linear regression. The formula itself is the equation
of the dividing surface I described above.
This is the most popular model both among researchers, who love it for its
simplicity and interpretability (the coefficients are the same as for linear
regression), and among data engineers. Unlike other classifiers, this simple
formula is easily scalable at high loads. And when banner ads catch up on
the internet, it will likely be thanks to logistic regression, which has been
used at Criteo, one of the largest retargeting companies in the world [54] .
Decision Trees
Decision trees are right behind linear methods in terms of popularity. They
are a descriptive tool (Fig. 8.8) that can be used to classification and
regression tasks. The best classification algorithms (Catboost, XGboost and
Random Forest) are based on decision trees. The method itself is non-linear
and represents the rules of conditional sentences: “if… then…” A decision
tree consists of internal nodes and leaf nodes. Internal nodes are conditions
on independent variables (rules). Leaf nodes are answers that contain the
probability of belonging to a given class. Answers are obtained by moving
out from the root of the tree. The goal is to get to a leaf and determine the
necessary class.
Decision trees are built on completely different principles than those we
have considered in linear models. My kids like to play “question and
answer.” One player thinks of a word and the others have to guess what that
word is by asking “yes–no” questions. The player who guesses the word
correctly with the fewest number of questions wins. It’s the same with
decision trees: the rules are constructed in such a way as to move from root
to leaf in as few steps as possible.
Learning Errors
A correctly chosen model tries to find patterns and generalize them.
Performance metrics allow you to compare different models and training
approaches by way of simple comparison. You have to agree that if you
have two models with forecast errors of 15% and 10%, respectively, then it
is obvious that you should go for the second model. And what happens if
data that was not in the dataset gets into the model during testing? If
training the model has provided us with high-quality generalization, then
everything should be in order and the error negligible. If not, the error can
be extremely large.
Training the model can lead to two types of error:
The model did not notice any patterns (high bias, underfitting).
The model performed an overly complex interpretation – for
example, it identified a quadratic relationship where it is actually a
linear one (high variance, overfitting).
Let me try and demonstrate. Figure 8.8 shows the results of experiments in
the form of dots (remember your physics classes at school). We need to find
patterns, which involves plotting lines describing them. Everything in the
first picture in the figure is good: the straight line adequately describes the
data, and the distances from the dots to the line are small. The model has
identified the pattern correctly. The second clearly shows nonlinear
dependence (for example, quadratic), meaning that the line drawn through
the points is incorrect. What we have here is underfitting – we have made a
mistake in the order of the function. The opposite is true for the third
picture. The model is too complex for the linear dependence that we can see
from the dots. Data outliers have distorted the results. This is a classic
example of overfitting, a simpler model (a linear one, for example) should
have been used.
This is, of course, an artificial scenario, with one independent variable on
the horizontal axis and one dependent variable on the vertical axis. In such
simple conditions, the problem is immediately evident from the graph. But
what if we have several independent variables, say, ten? This is where
validation comes to the rescue.
Validation is used to identify such errors when working with models, just
like with a black box. The simplest approach is to randomly divide the
dataset into two parts: a large part that is used to train the model, and a
small part that is used to test it. The split is usually 80:20. The trick here is
that when the model is put into action, the real error will be similar to the
one we obtained from the test dataset. Another kind of validation involves
dividing the data into three parts rather than two: one in which the model is
trained; a second in which the model’s hyperparameters (settings) are
selected; and a third in which the test score is obtained. In his Machine
Learning Yearning [60] , Andrew Ng sees this as the primary validation
model. Now let’s discuss the diagnostic algorithm itself. Say we have two
figures – the mean squared errors for the training dataset and the test
dataset. Now, let’s compare them:
The test and training errors are almost identical. The error is
minimal and thus satisfactory. Congratulations! The model has been
trained correctly and can be put into operation.
The test error is significantly larger than the training error. You are,
nevertheless, satisfied with the training error. Overfitting is evident
here – the model is too complex for the data.
The training error is high. This is an example of underfitting. Either
the model is too simple for the data or there is insufficient data (in
terms of sheer size, or certain features are missing).
K-fold cross validation is more complex. It is widely used in high-level
work, scientific research and competitions and involves randomly dividing
the dataset into k (for example, eight) equal parts. We then extract the first
part from the dataset, train the model on the remaining parts, and count the
errors in the training and extracted data (test error). This sequence is
repeated for all the parts. We are left with k errors that can then be averaged
and make similar comparisons to those described above.
This requires counting the number of errors in the training and dataset at
each iteration step. The training needs to be halted when test errors start to
increase.
The fifth method is regularization. Regularization is the sum of the
coefficients of the model multiplied by the regularization coefficient. It is
added to the error function that the ML model optimizes. There are several
types of regularization: L1 – the sum of the coefficient modules; L2 – the
sum of the squares of the coefficients; and elastic net – the sum of the L1
and L2 regularizations with separate coefficients. The goal of regularization
is the pessimization of coefficients with large values. This ensures that one
feature is not weighted disproportionately. Regularization coefficients are
known as hyperparameters. They also need to be selected so that a smaller
error is obtained. L2 regularization is the more popular of the two methods.
And last but not least, we have ensemble methods.
Ensemble Methods
The “no free lunch” theorem states that no single algorithm will be optimal
for all tasks. Analysts can spend their time collecting model after model to
help best solve a given problem. But what if we were to combine different
models into one large model? We would have a new algorithm – an
ensemble of algorithms. These can be extremely accurate, even if you use
“weak” algorithms that are only negligibly more reliable than a toss of a
coin. The development of computing power (more memory, increasingly
powerful processors, etc.) has made combining algorithms easy to do.
While there are many ways to combine simple algorithms into ensembles,
we will deal the two most popular only – bagging and boosting. Bagging
(or bootstrap aggregating) was first put forward by Leo Breiman in 1994.
The idea is to create several datasets, all of which differ slightly from the
original. There are two ways to do this: randomly selecting samples from
the original dataset (sampling); and randomly selecting a subset of features
from the original dataset. These methods are typically combined. The
sampling of data is done with replacement, meaning that the relevant lines
are not deleted from the original dataset. As a result, certain data will be
present in the new dataset several times over, while other data will not be
there at all.
Base algorithms for bagging should be prone to overfitting. One example
here would be deep decision trees with many branches. A basic algorithm is
then trained on each training dataset. The result of the ensemble is obtained
by averaging the training results on all datasets. The most well-known
algorithm is Random Forest, which is easy enough to write yourself,
although there are a number of ready-made implementations that can be
used [56] .
Boosting is built on entirely different principles. It involves using simple
algorithms that are only slightly more reliable than a coin toss, for example
shallow decision trees. This is how it works: the first tree is trained, then the
second tree is trained based on the errors made in the first tree, then the
third tree, and so on until we reach the required level of accuracy. All the
trained trees are assigned a weight that is proportional to their accuracy.
These weights are used when you need a response from the voting model:
the higher the weight, the greater the impact on the result (for example,
AdaBoost).
The most widely used algorithms based on gradient boosting (Gradient
Boosting Decision Tree) are XGBoost [57] , LightGBM [58] from
Microsoft и CatBoost [59] from Yandex. These are the algorithms that win
Kaggle competitions most frequently.
The difference between boosting and bagging is that the former is a serial
process, while the latter is a parallel process. Bagging is thus considered the
faster option. It can be parallelized across cluster nodes or processor cores.
This is important, for example, when you need results quickly. Random
forest is best for this. If you need accuracy, boosting is the way to go, but
you’ll need to spend a long time (days or even weeks) studying it and
selecting the numerous model parameters (hyperparameters). Random
forest is easier to get to grips with out of the box. One of the secondary
functions of these ensembles is that a list of features sorted in order of their
impact (feature importance) is created. This can be useful if you want to use
another model and there are too many features in the dataset.
OceanofPDF.com
CHAPTER 9. THE PRACTICE OF
MACHINE LEARNING
ML Competitions
On 2 October 2006, Netflix announced its first ever “Netflix Prize,” with $1
million in prize money up for grabs for whoever could improve the
company’s recommendation system by 10% according to the RMSE
method. The prize was eventually collected by BellKor's Pragmatic Chaos
team in September 2009. The contest had lasted almost three years due to
the difficulty of the task.
Similar contests have been held by the ACM Special Interest Group on
Knowledge Discovery and Data Mining (ACM SIGKDD) as part of the
KDD Cup programme. A new contest is held every year with its own
organizers, data and rules.
These developments led to the creation of a platform for commercial
machine learning competitions – Kaggle.com. Kaggle was founded by a
team of three in 2010 and sold to Google in 2017. These days, it provides
an abundance of services, but first and foremost is the machine learning
competition, which offers some serious prize money. The competition
format is almost identical to Netflix’s: a company publishes its data and the
rules of participation. On the closing day of the competition, the points of
all entrants are tallied to determine the winner. The winners get prize
money, while the company gets a solution. A description of the solution is
often published on the forum.
Machine learning competitions are a way to improve practical ML skills
and master the creation of features based on datasets. The events are open
for everyone, and the experience generally proves to be valuable. Sounds
great, right? Well, there is a flipside to this. The solutions can’t be used
wholesale, only certain ideas can be taken from them. For example, Netflix
stated [65] that the winning entry consisted of 107 sub-algorithms, of which
only two gave a significant result: matrix factorization (SVD) and restricted
Boltzmann machine (RBM). The company had some difficulty
implementing these two algorithms into its system. The Pareto principle
proved true once again: 80% of the result was produced by 20% of the
effort (the two algorithms). I’ll say it again – the Netflix developers did not
introduce the entire monster as a whole, but took just a few of its elements.
The winning algorithm simply cannot be introduced in its complete form.
It’s too resource-heavy and complex, and it would cost ridiculous amounts
of money to maintain and support it.
This is the main drawback of solutions obtained in competitions like these –
there are no restrictions on the calculations and the simplicity of the result.
Solutions will often simply not be feasible. But I would still urge you to
take part in such competitions, as they can be extremely beneficial. Look
for solutions on forums and repeat them. Learn how to make features. It’s
not easy, but they are the heart of ML. Don’t worry about where you place –
if your solutions produce results which are around 5% worse than the
winner, you’ve done a great job. Even if you’re only just above the median,
this is still good. You’ll take a lot away from the experience.
If I had to choose between two candidates – one who has won multiple
prizes on Kaggle and has dozens of models under their belt, and another
who has only entered two competitions but came up with a problem and
was able to solve and implement it and then use metrics to prove how it
could make money for the company – I would choose the second.
Even if they won’t have to repeat all of these steps at the new job, I can still
see that they are able to take a step back and see the whole picture. And this
means we will be on the same page with the people who will introduce the
final product. The candidate will have no difficulty understanding the
limitations and requirements of collaborating departments.
Artificial Intelligence
Artificial intelligence (AI) is one of those terms that is in vogue at the
moment. This is the first time that I have mentioned AI in this book,
although I use it all the time. I first came across the phrase “data mining” in
the early 2000s when I was still working at StatSoft. It’s really just
marketing jargon that means run-of-the-mill data analysis made up of
several components. It was a running joke in the office that real experts do
all the data mining by cobbling together a couple of algorithms. It wasn’t
long before a new buzz phrase appeared, machine learning , which
specialists liked better because it actually described a new field. Yet another
oft-used term is big data , although the hype around this appears to have
subsided somewhat – it turns out that the technology itself couldn’t live up
to the high hopes that had been placed on it. I’ve been to plenty of ACM
RecSys conferences and I don’t remember ever hearing the expression big
data there, and some of the corporations that take part have massive
amounts of data (Amazon, Google and Netflix, to name but a few).
Companies only use these words for branding purposes and selling their
services to show that they are “with it.” If they don’t, their competitors will
pass them by.
AI has become a hot topic with the advent of deep learning. The principle
of a neuron working as a building block of a neural network has been
borrowed from biology. But you’ve got to admit that this is no reason to see
the neural network as something even close to insects in terms of
intelligence. Right now, the operations carried out by neural networks can
in no way compare with what even the most primitive beings are capable of
(such as synthesizing new life or independent decision-making). If we are
to come close to creating an intelligence that is somewhat similar to that of
an animal, we need to work on synthesizing and training biological neural
networks, rather than electronic ones.
Artificial intelligence is a rather abstract term, in my opinion, so I prefer to
use more specific language like “computer vision.” It is in this area that the
most significant breakthroughs have occurred, thanks to neural networks.
Computer vision is used everywhere today – from tagging people on mobile
phones and social networks to self-driving cars. Governments use it to
perform police functions, and commercial organizations use it to solve their
problems. I personally love it when the marriage between hardware and
software brings practical results. Tae the Stingray robot, for example, which
kills sea lice on farmed salmon using computer vision and a laser [73] .
Salmon lice are responsible for the mass death of fish during artificial
breeding. For example, it caused global salmon supplies to dip by 9 percent
in 2016 [74] . The underwater robot solves the problem: once it spots a
parasite on the body of a fish, it destroys it with a laser.
Robotics is another area where huge breakthroughs have been made. The
advancements here have not been limited to the use of computer vision
only. When I visited the MIT Museum in Boston, I noticed that the Boston
Dynamics project can trace its roots back to an MIT lab in the 1980s. Even
then, the researchers at the best university in the world were working on
computer and robotic control. They had a robot jumping on a stick without
falling. Boston Dynamics broke away from MIT in 1992. Now the company
is famous for its robots, which have become YouTube stars in their own
right, garnering record numbers of views. The South Korean company
Hyundai recently purchased Boston Dynamics for $1 billion. To be honest,
it’s at times like this that I just can’t work out what’ going through the heads
of Russia’s super-rich – surely it’s far more worthwhile to invest in projects
that could make you the next Elon Musk than it is to pump money into
football clubs. Projects like Boston Dynamics may still be poorly
commercialized today, but their time will come. Humanity is moving in the
right direction.
Will AI replace people? I believe it will. And it will be businesses that make
this happen. Business follows the self-serving principle of “if there’s a way
to save money, then let’s go for it!” Once, in order to save money, many
Western companies started opening manufacturing facilities in Southeast
Asia, where the cost of labour was much cheaper. Robotization means that
fewer workers are needed per product unit, and then it makes better
financial sense to produce the goods in the country where the products are
sold. One example of this is the creation of the Speedfactory robotic
factories by Adidas, which were opened in Germany and the United States
in 2016 and 2017 [75] . The goal was to bring production closer to the
customer. In 2019, however, the company decided to close these factories.
Despite this setback, the trend is clear for all to see – the robotization of
production will replace more and more people.
Required Data Transformations
Before feeding data into ML models, we need to carry out a number of
important transformations on them:
standardize the data (reduction to a single scale)
remove outliers
prepare categorial variables
work with missing data
sample unbalanced classes.
With linear models, you can normalize the data, since data is often
presented on different scales. Say you’ve got two features in a dataset – the
price of an apartment ($30,000 to $1,500,000) and its size (20–500 square
metres). These are completely different ranges, which means that the
model’s coefficients lose their physical meaning. It will be impossible to
compare the effect of a given variable on the model. Regularization also
presents a problem, namely, the unnecessary pessimization of the
coefficients. There are different options for standardization. One is to
subtract the mean and divide the result by the standard deviation of the
variable. The output will be a variable with a mean equal to 0 and a
standard deviation equal to 1. Standardization does not affect the error of
the linear model (if it is performed without regularization). However,
certain methods are sensitive to the scales of variables, for example
principal component analysis (PCA, which we covered in the previous
chapter).
Outliers can also introduce a significant error into the model. The
regression line will look different if you remove the outliers, which is
particularly important when you don’t have much data. Removing outliers
is no easy task. The simplest way is to delete data that lies outside a certain
percentile (for example, the 99th percentile). Figure 9 shows an example of
how an outlier can change a straight line; the dotted line is the linear
regression line for data with outliers, while the solid line shows the data
with the outliers removed. As you can see, the straight line “turns” at the
point where the outliers are removed.
Fig. 9.1. Removing outliers changes behaviour
This can be factored in when the first version of a product is created. You
don’t need to shoot for super-precision right off the bat. It’s better to go for
an acceptable level of accuracy at the initial stages. The second method is to
create a solution, implement it in the working system and only then decide
whether the business needs to improve the metrics or not. It might be worth
spending your time on a different product instead of endlessly polishing one
– the customers won’t notice anyway, or appreciate your efforts.
In terms of RFM, the best customer is the one who regularly places orders
for large amounts of money and who has made a purchase very recently
(Fig. 9.3). This fundamental principle has helped create features that predict
the likelihood of future actions. And it can be extended to other actions (i.e.
not just purchases): the likelihood of getting sick; the likelihood of a user
returning to a website; the likelihood of going to jail; the likelihood of a
person clicking on a banner, etc. I performed reasonably well in a Kaggle
competition using only these variables and a simple linear model. To obtain
better results, I used binary encoding instead of real numbers. Segmentation
can be used as a basis here (I talked about this earlier). You can take either
R or F separately, or RF as a whole.
Conclusion
There are two free resources that I would recommend to supplement any
theoretical books you may have on this topic: Andrew Ng’s [60] book on
the practice of machine learning and Google’s Rules of Machine Learning
[72] . They will take your understanding to new levels.
OceanofPDF.com
CHAPTER 10. IMPLEMENTING
ML IN REAL LIFE: HYPOTHESES
AND EXPERIMENTS
Every experiment may be said to exist only in order to give the facts a
chance of disproving the null hypothesis.
Hypotheses
A hypothesis is an idea for how to improve a product. It doesn’t really
matter what that product is – a website, product or store. Many companies
now employ product managers, whose job includes creating and
maintaining a list of such hypotheses and prioritizing their implementation.
A list of hypotheses is called a “backlog,” which is an important strategic
element of the company’s development. Just how to come up with
hypotheses and prioritize them is a topic for another book. In a nutshell, this
is how it should ideally look: product managers interact with the market,
work with existing and potential customers, analyse the solutions of their
competitors, and carry out focus groups in order to understand how
profitable a given change (hypothesis) may be for the company. A list of
hypotheses is created on the basis of the results of this research, and the
hypotheses prioritized. Businesses need financial metrics in order to
prioritize hypotheses – the more accurate, the better. But this is not as easy
as it sounds, and it is often a matter of taking a “shot in the dark.” The
biggest commercial successes in history have been revolutionary, not
evolutionary. Just look at the first iPhone.
Prioritizing hypotheses serves the main goal of achieving success as quickly
as possible. Following this logic, ideas that are more likely to produce this
result should be first on this list, right? Well, there are different levels of
complexity when it comes to implementing hypotheses, which means that
you have to take labour intensity and the cost of setting up the necessary
infrastructure (servers, hiring additional personnel, etc.) into account when
prioritizing hypotheses. Let’s say the first hypothesis promises to bring in
around $150,000 per year, and you’ll have to hire two developers and a data
scientist for one month to implement it. Meanwhile, the second hypothesis
promises $25,000 at a cost of five day’s work for two additional staff.
Which hypothesis would you pick first? I always leave these decisions to
management. I can’t give a clear-cut answer here.
What if we give the task of coming up with hypotheses and prioritizing
them to individual departments? On the one hand, this seems like the
perfect solution. After all, not having to go through the IT department will
ensure that things get done quicker, right? But let’s imagine that a company
is a living organism and its strongest department (say, IT) is its arms. The
department has a good list of hypotheses, and the priorities are more on the
mark than those produced by other departments – it has strong arms! Now
let’s imagine a triathlon competition. In the Olympics, you have to swim
1500m, then get on a bike and ride 40km and finish of with a 10-km run.
Strong arms are needed for the first stage, but then you’ll need strong legs
for the next two. If you’ve been skimping on leg day, you’ll lose to the
more well-rounded competitors. Or you might not finish the race at all! It’s
the same in business – you can’t rely on one department. You need to take a
balanced approach. I fell into this trap myself at Retail Rocket, stewing in
my own juices trying to prioritize my hypotheses by myself. Sure, one of
our departments was incredibly strong, but the other teams just couldn’t
keep up with us. If I could go back, I’d insist on collaboration on
everything, including the product and the market.
You can’t test all the hypotheses on your list. Most of them will remain just
that – hypotheses. Don’t fret about it, it’s actually a good thing, as it means
the most profitable ideas will be brought into fruition first. Every
hypothesis eats up resources. We’re not fortunate to have infinite resources,
so we can’t test all ideas. I’ll say more – nine out of ten hypotheses don’t
pan out. But you have no idea that a hypothesis will not produce the desired
result until you are well into the testing process. I believe that it is best to
kill a hypothesis as early as possible – as soon as the first sign that the idea
won’t take off presents itself. Not only will this save you resources (a lot of
resources!), but it will also allow you to move on to the next hypothesis
quicker.
I’ve compared various types of hypothesis and their effects. Evolutionary
hypotheses, where one parameter is slightly optimized, have a less profound
effect than revolutionary hypotheses, where the approach is fundamentally
different. That said, evolutionary hypotheses are more likely to bear fruit.
Don’t reject H 0
Reject H 0
This is precisely how sample size calculators work. The calculator enters
the minimum detectable difference in parameter values and the value of α
and β errors. The result will be the amount of data that needs to be
collected. The pattern here is simple – the smaller the difference you want
to detect, the more data you will need.
An alternative to the p -value is the confidence interval – the interval within
which the parameter we are measuring is located with a certain degree of
accuracy. A probability of 95% (α = 0.05) is typically used. If we have two
confidence intervals for the test and control groups, the point at which they
intersect will tell us whether there is a difference between them. P -value
and confidence interval are two sides of the same coin. The interval is
useful for presenting data on graphs. It is often used in alternative methods
for evaluating A/B tests: Bayesian statistics and bootstrapping.
Bootstrapping
Bootstrapping is one of the most interesting ways to evaluate metrics in A/B
tests. It is one of our favoured methods at Retail Rocket for continuous
parameters such as average purchase value, average cost of goods and
revenue per visitor (RPV).
Bootstrapping [79] works by using multiple samples from the data that are
then used to calculate statistics. The algorithm looks like this [80] :
1. Set the number of samples k that we intend to take from the source
dataset. The number must be at least 100 – the more the better.
2. Each time the sample ( k in total) is repeated, elements with
replacement are randomly selected from the original dataset – the same
number as in the original dataset (to preserve the variation of the
parameter [81] ). During this procedure, some elements of the original
dataset will be selected several times over, while others will not be
selected at all.
3. The parameter we need is calculated for each sample.
4. We now have k values that can be used to calculate the confidence
interval or statistical test.
In A/B tests, we work with two groups – a test group and a control group.
Both need their own bootstrap. The required metric in each sample and
group is calculated, as is the difference in metrics between groups. This
gives us k values of the distribution of the difference in the two groups. The
significance of the A/B test is calculated by formulating the null hypothesis
H 0 : the two samples are the same, so the difference between them is zero.
If the Type 1 error rate is α = 0.05, the test is two-tailed and we need only to
calculate the percentiles (quantiles) for the segment [α / 2, 100% – α / 2],
that is [2.5%, 97.5%]. This is easy to do if we sort our series of k values of
the difference of metrics and determine the value of the percentiles at the
ends. If 0 is located between these values, then the null hypothesis cannot
be rejected; if it is located outside these values, then it must be rejected.
Let’s go back to our example with two containers. We have a sample of
1000 balls from each container. If you remember, we need to determine
whether or not there is a difference in the average diameter of the balls
between the containers. For the bootstrap procedure, we take k = 300
samples for both groups. Then we immediately calculate the average in
each sample and the difference between them. This gives us 300 numbers.
We sort these numbers in descending order and then select two – one at the
2.5th percentile (2.5% × 300 = 7.5, or the seventh highest number) and one
at the 97.5th percentile (97.5% × 300 = 292.5, or the 293rd highest
number). If both numbers are positive, or if both are negative, then the
difference is statistically significant.
The word “bootstrap” comes from the expression “to pull oneself up by the
bootstraps” and is often traced back to The Surprising Adventures of Baron
Munchausen , in which the hero pulls himself out of a swamp by his hair.
These days, bootstrapping refers to this sort of “self-pulling,” where we get
something useful for free.
The advantages of bootstrapping are that it is independent from the sample
distribution, the only parameters you are working with are the number of
samples, and you can easily calculate any metric. The disadvantages include
high computational requirements. Working with thousands of samples is
resource-intensive. The third alternative for A/B tests is Bayesian statistics.
Bayesian Statistics
I first learned about the Bayesian approach to A/B tests when I read an
article by Sergei Feldman on the website of one of our competitors,
RichRelevance [82] . I was particularly impressed by the comparison of the
two approaches to formulating the results of A/B tests:
We rejected the null hypothesis that A = B with a p-value of 0.043.
There is an 85% chance that A has a 5% lift over B.
The first belongs to traditional Fisher statistics, while the second belongs to
Bayesian statistics. In the article [82] , Feldman notes the following
drawbacks of p-values for working with hypotheses:
P-values are a difficult concept to grasp and explain. I learned all
about them back in 2002, but I still have to crack open a book every
time to remind myself of what exactly they are.
P-values use a binary approach – you can either reject the null
hypothesis or fail to reject the null hypothesis by comparing the p-
value with the value α = 0.05
Classical mathematical statistics (the frequentist approach) treats
parameters as fixed unknown constants, while Bayesian statistics treats
them as probabilistic quantities [83] . This is somewhat similar to the
difference in the approaches of classical and quantum physics. I personally
prefer the probabilistic approach of Bayesian statistics, as it is clearer and
more natural than the p-value. I was so intrigued by it that I spent ages
looking for a good book on the subject. William Bolstad’s Introduction to
Bayesian Statistics [83] turned out to be just that book. I appreciate a good
book, and in this case, I can call the author a Teacher with a capital T .
Bolstad’s built an extremely robust system for deriving formulas and
proofs. I read his book from cover to cover, did almost all the exercises
inside and then wrote the first version of the software library for A/B testing
at Retail Rocket. I came upon an interesting fact about Bayesian statistics
when I was reading Antonio Rojo’s book on Ronald Fisher [76] – it turns
out that Bayesian statistics was widely used to evaluate statistical
significance even before Fisher came onto the scene. Proponents of the two
approaches (Fisher’s traditional statistics and Bayesian modelling) continue
to argue about which method is better.
Bayes’ theorem looks like this:
where:
P(A) is a priori information that we have about A before the
experiment. These are our beliefs (perhaps even intuitive beliefs)
before we carry out a given experiment.
P(A|B) is posterior probability, the probability of proposition A after
taking the evidence B into account, which leads to new (posterior)
conclusions.
P(B|A) is the likelihood of event B given that hypothesis A is true.
P(B) is the total probability of event B.
Bayes’ theorem allows you to “reverse cause and effect”: a known fact of
an event can be used to calculate the probability that it was the result of a
given cause.
To estimate the parameters, the formula can be written differently:
A/A Tests
I first heard about A/A tests from DJ Patil. I had never used them before.
A/A tests check the last mile of everything you have done for the test: the
random number generator, the data collection scheme, and the statistical
criterion you have chosen for the metric. The test really does divide the
audience into two parts, but the control and test groups use the same version
of the product. In the end, you should have a valid test without rejecting the
null hypothesis, since the version of the product is the same.
The first thing we need to check is how well the random number generator
that will be used to divide the participants into groups in the test works.
Assignment to groups can be done in two ways: by assigning a random
number, or by hashing information about the subject. When a user visits a
site, their ID number is usually recorded in the cookies. This number is used
to recognize the user when they come back to the site. In A/B tests, ID
numbers are hashed, meaning that they are turned from text into a number,
with the last two or three digits being used to divide users into groups: for
example, users whose IDs end in 00–49 are placed in the control group, and
those with 50–99 are placed in the test group. We use a similar principle in
our Retail Rocket Segmentator project [85] . In A/A tests, you get the same
distribution as in the test! If it is set at 50/50, then this is what you should
get at the output. Even small discrepancies of 3% in the data can jeopardize
the entire test. Say you’ve got 100,000 for your test and you want to split
them in half but get 48,000 in one group and 52,000 in the other. This
indicates that there’s a problem with the “randomness” of your generator.
You can also test these distributions in simulations when you know the
exact algorithm. However, in my experience, small design nuances that we
don’t know about can lead to “shifts” in distributions. This is why I prefer
A/A tests.
You also need to be careful to make sure that users are included in groups
evenly – we don’t want any displacement along different sections of users.
For example, say you’ve got two groups of users in your test: companies
(10% of the sample) and individuals (90%). After dividing them into
groups, you notice that the ratio has changed – to 7% and 93%,
respectively, in the control group, and to 12% and 88% in the test group.
There are two possible explanations for this. First, there is pattern in the
assignment of customer identifiers, and this data is used when assigning
groups. Second, the number of companies in the sample is too low. The
second issue is easier to deal with – you need to try and collect more data. If
the observed difference disappears, then everything’s good. If it doesn’t,
then you need to do something about the procedure for assigning users into
groups. Note that such tests are more likely to yield valid results when
you’re using a 50/50 split, rather than something odd like 90/10, where only
10% of users are in the smaller group.
The third thing you need to keep in mind is that no matter what metric you
are looking at your statistical criterion should show a lack of statistical
significance, as we are showing users the exact same thing. In my
experience, binary (binomial) tests yield faster and more accurate results
than tests that use a continuous scale. Website conversion (the percentage of
visitors who make a purchase) will work better than average purchase price
(average basket size). As I see it, there are two reasons for this: 1)
conversion has a lower variability (there are only two values here – whether
the user made a purchase or not); and 2) there may be outliers in metrics
that use a continuous scale. “Outliers” are rare events, such as when a
customer places an unusually large order. And they will skew the metrics
towards whichever group it is in. Not exactly the result we’re looking for, is
it? This is why we usually cut a small percentage of data “from the top”
(ignore the most expensive orders) until the A/A test is completed. We do it
at Retail Rocket. In theory, you can use the median instead of the mean, as
it is more resistant to outliers.
Experiment Pipeline
We know that companies need to have a list of development hypotheses
ranked in order of importance and managed by business development
managers or product specialists. The first hypothesis on the list is taken and,
if needed, modelled and verified using offline tests. The final step is A/B
testing, followed by post-analysis of the results. The decision is then made
on whether or not to implement the hypothesis.
If you can streamline this process, you’ve got a true experiment pipeline.
Conceptually, it’s similar to an industrial conveyor – a hypothesis goes
through a number of statuses: accepted, modelling, offline testing, online
testing, analysis, rejected, implemented. This is more of a mechanical than a
creative process. I had it packed into Trello columns where the card moved
went from left to right. This approach allows for scaling experiments, and it
has its own metrics such as “time between statuses,” “work started,”
“rejected/implemented,” etc.
You’ve probably realized by now that it takes a long time for a hypothesis
to move through the experiment pipeline. A/B testing is especially time-
consuming. I’d wager that you’re thinking it would be nice to “kill”
unsuccessful hypotheses before they have even made it halfway through the
pipeline. It’s always good to reject hypotheses as early as possible so you
don’t end up wasting time on a dead-end project. This is how we managed
to reduce the average time for hypotheses to pass through our experiment
pipeline at Retail Rocket from 90 to 45 days.
OceanofPDF.com
CHAPTER 11. DATA ETHICS
Data is used everywhere these days. But is this a good thing in terms of our
security? There’s a concept in programming called a “greedy algorithm.” A
greedy algorithm is one that is focused on obtaining immediate short-term
benefits. Commercial companies are typically driven by “greedy”
algorithms and want to extract a profit from everything they possibly can,
including the data that we leave behind, knowingly or otherwise. I’d like to
talk about data ethics in this section. We’ve all experienced it – you’re
talking to someone about a top-loading washing machine, say, and a few
minutes later an ad for these washing machines appears in your social
media feed. Someone’s been eavesdropping on you and used your data,
right? Not exactly! It’s a myth, but the very fact that our movements are
being tracked gives food for thought. Is it legal? And, if it is, is it really
ethical?
German Klimenko talks about his project with one of the banks.
From what we can gather, this is more or less how his
Fastscoring+ system works: if you’ve visited a health website
with a li.ru counter and searched for medicine to treat a serious
illness, then the bank that Klimenko works with won’t give you
a loan – no one is interested in lending money to someone who
is seriously ill.
This really struck a nerve with me. Outraged, I decided to write a post on
my website. If you think about it, German Klimenko gave out this
controversial information in order to promote his project. Yet his is not the
only service that has exactly the same kind of data – the difference is they
keep silent about it. Where’s the guarantee that it won’t be used when
deciding on someone’s creditworthiness?
I think the main reason why user data is abused is because many services on
the internet are free. Almost every single content project and social network
is financed through ads. Nothing is ever totally free – websites and online
services need to pay for servers, cover their employees’ wages, etc. Of
course, websites can run subscription plans whereby users can avoid seeing
ads (YouTube does this). Services that follow an ad monetization model are
tempted to make at least some revenue on user data (because advertising
only brings in so much money).
In addition, data leakage is always a possibility when “free” services that
provide analytical services (such as Google Analytics) are installed on a
website. And I am convinced that they don’t always use customer data for
its intended purpose – I mean, they’ve got to make money on something,
right? Any free service is a trojan horse, just keep that in mind at all times.
Sooner or later, “private” analytics services will be all the rage, but they
won’t come cheap, meaning that many websites won’t be able to afford
them.
Data Ethics
The issue of data ethics can be divided into two parts: ethical standards and
data loss prevention. Both require attention and carry certain risks (not
always legal risks, but certainly reputational risks).
Former Amazon data scientist Andreas Weigend wrote an entire book about
data entitled Data for the People . He has this to say about data
confidentiality:
OceanofPDF.com
CHAPTER 12. CHALLENGES AND
STARTUPS
In this chapter, I want to talk about the problems facing modern ecommerce
companies and how to fix them. And I think you might be able to take
something from my experience as a co-founder of Retail Rocket.
Database Marketing
Email has been around for years, and we still use it. Before the internet, we
had mail order, which originated in the United States in the 19th century.
Then there was direct marketing, which involves selling products through
direct communication channels (mail, email, telephone). This is markedly
different from regular advertising, which aims to appeal to the general
consumer rather than a specific target audience. Direct marketing involves
working with clients on an individual level. There are generally two types
of activity in direct marketing: selling to an existing customer base and
building new customer bases. The first is database marketing.
Imagine you’ve got an existing customer database with contact details,
correspondence records, order numbers, etc. You send these people
catalogues in the mail on a regular basis. You’ve been instructed to organize
another mail-out to try and solicit additional orders from customers. The
simplest approach is to send the same thing to everyone – a discount offer,
for example. This is quite expensive to do, as each letter costs money to
mail out and the discount you’re offering will affect the company’s margins.
You’ve also got to keep in mind that some customers will place an order
anyway, with or without a discount. This is where it makes sense to split
customers into groups. There is no point sending customers who are likely
to order an item even if it isn’t on discount a promotional offer (if
absolutely necessary, you can offer them a small discount on something).
This is not true of customers who don’t typically order products that aren’t
on offer, so it is definitely worthwhile sending them a discount. Customers
who have not ordered anything for a while shouldn’t receive any kind of
promotional offer.
A customer scoring mechanism would be useful here. For our sales
promotion, we need to calculate the probability that each customer in our
database will make a purchase. Using this scale, we can then divide
customers into groups, each of which will be offered a unique discount. For
example:
The group of most active customers (probability of making a
purchase > 70%) receives a discount of 3%.
The group of reasonably active customers (probability of making a
purchase 40–70%) receives a discount of 10%.
The group of customers who order items intermittently (probability
of making a purchase 20–40%) receives a coupon worth $10.
The group of customers who haven’t ordered anything for a long
time (probability of making a purchase < 20%) doesn’t receive
anything.
You can compile a model like this using logistic regression with RFM
features, which we covered in the chapter on machine learning. At this
stage, we divide clients into groups and start developing a test plan. Our
model could be wrong despite good metrics obtained from the existing data.
To check this, you need to do an A/B test with a control group. This
involves randomly selecting 20% of the customers in each group and
splitting the resulting group in half. One half of the group will receive a
promo offer – this will be the test group. The second half will receive a
letter only (depending on the sales promotion) – this will be the control
group. This needs to be done with every group, with the possible exception
of the group of customers who haven’t ordered anything for a long time.
Once your plan is ready and you’ve printed all the letters, send them out to
your customers. Have your analysts wait for a while (a month or two)
before they start tallying the results: sales against expenses. You’ll be left
with a table that looks something like this one (Table 12.2) for each group.
Table 12.2. Calculating profits from test mail-outs
Test group Control group Test group Control group Test group Control group
It is thus worth offering discounts to your most active and reasonably active
customers, but not anyone else. This is what happened: the coupon for $10
proved popular among the third group of customers, but their average
purchase value was less than that of the control group. This was not enough
to recoup the money spent on the sales promotion. The result (profit) of the
control group turned out to be better. In my experience, customers respond
better to a fixed-amount coupon than they do to a discount, but the average
purchase value is always lower than that of customers using discounts. This
topic is covered in great detail in Drilling Down: Turning Customer Data
into Profits with a Spreadsheet [71] , my favourite book by Jim Novo, and
is also covered in Strategic Database Marketing [110] by Arthur Hughes.
Real-life scenarios can be more complicated: perhaps you are testing
several types of discounts, coupons and gifts on groups at the same time. I
ran similar promotions for Ozon.ru customers who hadn’t bought anything
for a long time and managed to get over $100,000 in additional sales.
Over the years, direct marketing has become synonymous with junk mail,
but this is not an area that aims to appeal to the general consumer.
Traditional direct marketing is not as popular these days, of course, thanks
to the widespread use of email. It’s far more expensive to send out a
physical letter than it is to drop someone an email, no matter how many
cost-cutting measures you employ. And our inboxes have been overflowing
ever since! But let’s stop and think about it for a second: is bombarding us
with emails likely to increase sales? In the short term, yes. But, in the long
term, you end up “eroding” your customer base. You lose the goodwill of
your customers – people get annoyed when they are spammed with emails
all the time. Some will unsubscribe from your mailing list, while others (the
majority) will send your message to their “spam” folder out of exasperation.
This affects the company’s reputation, and any subsequent emails you may
send will be automatically directed to the spam folder. But, more often than
not, people just stop responding to your emails.
At Ozon.ru, I devised a test to find out how frequently we should send
emails to customers, and what the best day for sending these emails was.
This involved sending daily emails to some clients, and weekly emails to
others. It turned out that the optimum scheme was to send out emails once a
week, every Tuesday. It was also at Ozon.ru that we sent out a newsletter
every morning listing the latest book releases. This was until we realized
that daily mail-outs was a bit much, so we switched to weekly newsletters
containing a selection of books once a week. Our support team started to
get letters from disgruntled customers asking why the book
recommendations were coming in less often now. One customer even wrote
that it had become a tradition in their office for everyone to read the
Ozon.ru newsletter together. After that, we added an option to the
subscription – customers could now choose how often they wanted to
receive the emails, daily or weekly. Some people want regular newsletters,
others not so much, and it’s good to take this information into account. For
example, it would be wise to reduce the information pressure on those
customers who have not placed any orders as a result of past mail-outs.
Every email, even if it remains unopened, tries the customer’s patience. And
that patience is likely to overflow sooner or later.
The next step in the evolution of email marketing was trigger emails and
the customer interaction chains they create. A trigger email is a response to
any action or inaction of the customer. Here are some examples:
Sending an email to a customer who has registered on the website
but has not placed an order within the past three days. Statistics
show that the likelihood of a customer placing an order drops by
10% every day.
Sending an email to a customer who has added an item to their cart
but not taken any further action for one hour after that.
Sending an email with links to drill bits to a client who bought a
drill last month. Statistics show that 50% of customers who have
bought a drill will buy drill bits at some point within the next 30
days. Drill bits are consumable items.
The list could go on. So, what we do is we test all options and collect them
into a single chain of interaction with customers. I was first introduced to
such schemes when I was working at Ozon.ru by some French consultants
who worked for PPE Group. Retail Rocket has implemented a similar
system using data on customer interaction with the site.
Startups
Startups were all the rage a few years ago. I’ve spent pretty much my entire
career working with startups, rather than global corporations. And I even
got in on the ground floor in one of them as a co-founder, so I think I’ve got
the right to speak on this topic. What drives people who want to start a new
business? When it comes to startups, professional experience in the field
you want to get into is crucial, as you generally don’t need to attract
massive investments. This is certainly not the case for people who don’t
have the necessary experience. Experience or not, if you’re launching a
startup while trying to hold onto your nine-to-five, you can forget about
weekends and free time. The usual path is to bring the product to the MVP
(Minimum Viable Product) stage and then start actively seeking
investments and pitching the project to potential investors at conferences.
Venture investors look for projects whose value will grow exponentially.
They invest in many startups, where maybe one will flourish and recoup the
costs. Looking for an investor is an adventure in itself: there are sure to be
some characters who want to make money off you. It happened to me once
where a relatively well-known figure in ecommerce asked for 5% of the
company for setting us up with an investor. Finding a good investor is quite
the achievement. By “good investor,” I mean someone who not only gives
you money on favourable terms, but who also remains hands-off when it
comes to the day-to-day management of your business, and who helps you
out with their connections by finding potential clients, etc. A good investor
can also introduce the founders and managers of a startup to the
management of other portfolio companies. For example, Index Venture (an
investor in Ozon.ru) took us to London and introduced us to such projects
as Betfair, Last.fm and Lovefilm. Investors and founders at Wikimart.ru
arranged trips to California so that we could meet the folks from Color,
Netflix and eBay. You might say this is all hipster posturing and money
down the drain, but, personally, they inspired me to do great things. It’s one
thing to watch a recording of a video conference, and it’s another thing
entirely to talk face-to-face with people from legendary companies and see
that they’re dealing with the same problems that you deal with, just on a
bigger scale.
When negotiating a deal, venture capitalists want to give less and get more,
while the founders want to sell less for as much as possible. The terms of
the deal are what the negotiating is all about, and they form the basis of the
future relationship between the founders and the venture capital fund.
Unfortunately, sometimes they end badly [111] . But this is not always the
case. For example, Retail Rocket has a great relationship with Impulse VC,
which has been with us since the very beginning.
Are there any arguments for going ahead with a project no matter what
happens? There certainly are, and some very compelling ones at that. The
experience of creating a startup gives you an understanding of what is
important and what is not, and far quicker than if you were in a regular
corporate environment. The flip side of this that people don’t talk about is
you can’t just up and leave your own company like you can a normal job –
you’re tied to it by a golden chain, as it were. It’s a matter of what your
price is: those who leave first get less. I know people who have left startups
before they were sold to strategic investors, losing out on a big payday in
the process. There are many reasons why someone may choose to leave a
startup: fatigue, the desire to move onto other projects, disagreements with
partners, etc. The more people are involved in the management of the
company (investors and co-founders), the more likely you are to run into a
“too many cooks in the kitchen” scenario, where arguments take up more
time than actions.
It’s never a good thing when business turns into politics. But a startup is
different from a family business, for example, because the founder will
inevitably lose his or her clout, especially when the company goes public or
is sold to a strategic investor. If you are so attached to your project that you
just can’t share it, then seeking investment is probably not the way to go.
At the time of writing the previous chapter, one of the companies I had
worked for (Ozon.ru) had gone public, while another (Wikimart.ru) had
folded back in 2016. Ozon.ru was founded in 1998, and I started working
there in 2004. The company was growing by approximately 40% per year at
the time. No one talked about startup culture or anything like that, we just
got on with our jobs. Wikimart.ru was founded by two Stanford graduates
who had come up with the idea while they were still studying and scored
their first investment before they had received their diplomas. The culture in
the company was entirely different. They hired some really smart people,
although none of them had experience in ecommerce. That, in my opinion,
was a mistake. You can’t run a warehouse if you don’t have any experience
running one. This is why Wikimart.ru’s financials were worse than those of
Ozon.ru. The second difference between the two companies was that
Wikimart.ru was financed by the American Tiger Global Management
investment firm, while Ozon.ru was bankrolled by Baring Vostok Capital
Partners, a Russian private equity firm whose international footprint is
much smaller. The economic sanctions against Russia cut off the flow of
Western capital for Wikimart.ru, with Tiger Global Management refusing to
finance the project any further. Third, Wikimart.ru started off as an
alternative to Yandex.Market, aggregating current offers from sellers, before
switching to a delivery model. The market wasn’t ready for this kind of
model at the time, although this is not the case anymore, with a number of
major internet retailers (including Ozon.ru) having adopted it.
Another feature of startups is the high staff turnover. Employee salaries are
one of the biggest overheads of a business. When a company grows, so too
does the number of staff it employs. It’s like cell division, and it can get out
of hand very quickly – when everyone has an assistant, and these assistants
have their own assistants, and so on. This is where the company starts to
lose its focus. Then an economic crisis hits and investors start to demand
that you cut costs to keep profits up, so you inevitably have to let people go.
My first experience with this was in 2008, when Ozon.ru took the decision
to lay off 10% of its employees. There was no one really that I could let go,
so I just removed my department’s job ads from wherever they had been
posted. I remember the Ozon.ru CEO going through lists of employees at
the building’s turnstiles, which recorded the times that everyone came in
and left the office, and getting rid of those who were spending less time at
the office. I was there at Wikimart.ru too when they were forced to make
mass layoffs after the news came in from Tiger Global Management. Just
like Ozon.ru, they had to trim the workforce by 10%. At Ostrovok.ru,
however, I was among the managers tasked with reducing the budgets of
our departments by 30%, so I let one third of my team go. Commendably.
Ostrovok.ru did everything by the book, as laid out in the Labour Code of
the Russian Federation, meaning that people were given severance
packages. The way I see it, staff cuts are a normal part of the life of a
startup – something to deal with teething troubles. Unfortunately, however,
when they start to cut staff costs, people are often fired indiscriminately.
Considering that some departments have far more staff than is really
needed, while others are more conservative when it comes to hiring, the
principle of “cut 10% everywhere” is clearly flawed.
My Personal Experience
I first started raving about startups when I was at Ozon.ru. They were a hot
topic back then. I’d had this idea for an apartment remodeling project
codenamed “jobremont” (remont is the Russian word for “repairs” or
“renovations”). This is how it would work: a user places an order for a
renovation job, triggering an auction among apartment renovation
companies, and then selects the offer that best fits their needs. I wanted
people to be able to log in to the service using their mobile number, rather
than their email address. Mobile phone authentication only became a thing
about five years later. I posted on habr.ru and got some pretty positive
feedback. I even met with potential investors twice. At the second meeting,
it was suggested that I develop something else, as they had already invested
in a similar project called remontnik.ru. I turned the offer down, although I
did go through remontnik.ru when I wanted to remodel my flat. Six years
later, I started working on my new project – a recommendation system for
online stores.
We first decided to develop a recommendation system when I was working
at Ozon.ru. I’d accumulated a sizeable database on the site’s customer
traffic, and like any curious analytics engineer from MIPT or the Faculty of
Mechanics and Mathematics at Moscow State University, I started to think
about how I could use this data. It’s the age-old question: What came first –
the chicken or the egg? The data or the idea to develop a recommendation
system? We, that is the data scientists and engineers at Ozon.ru, were
fascinated by the question. What can I say? We were young and full of
ambition back then. What’s more, we’d heard that 35% of sales on
Amazon.com came from recommendations, and we would obsess over
attaining similar figures ourselves. And when someone who’s good with
tech gets an idea into their head, then there’s no stopping them, especially if
they’re surrounded by likeminded people.
Amazon talked about recommendation services back in its 1999 Annual
Report: the word appears ten times in the document. Online stores weren’t
really a thing in Russia back then. Amazon is such an influential company
that the article “Two Decades of Recommender Systems at Amazon.com”
[112] by Greg Linden – the man who created its recommendation system –
is one of the most cited articles according to Google Scholar.
The meeting we had at Ozon.ru with Andreas Weigend, a leading data
scientist at Amazon who now teaches at Berkeley and Stanford and
provides consulting services for a number of ecommerce giants, left a
lasting impression on me. He told me that, “the last click the user makes
will give you more information about them than you had before.” I couldn’t
get these words out of my head, even though I already knew that
sociodemographic data was far less useful than behavioural data. In his
article “I Search, therefore I am” [100] , Weigend said that the search terms
that people use can offer “a revealing view into the mind and the soul of a
person” (“We are what we search for,” “A powerful compression of
people’s lives is encoded in the list of their search queries”). This
information would come in handy later at Retail Rocket when writing
“short-term” personal recommendations.
Anyway, the Ozon.ru website did have a recommendation system at the
time. It had been created by the website’s developers, and we decided to
expand its functionality. Of course, Amazon.com served as a great example
of how recommendation systems were implemented. A lot of ideas were
thrown around: for example, including the recommendation’s weighting in
the widget so that potential buyers can see what percentage of people
bought the recommended product. This functionality is not available on
Ozon.ru and Amazon.com, but I recently found an example of such a
system on the website of the German retailer of musical instruments and
pro audio equipment thomann.de when I was looking for an electronic
drumkit.
The following types of recommendations were available:
people often buy x with this product
people who visited this page purchased…
search recommendations
personal recommendations
There was an interesting story with the “people often buy x with this
product” algorithm. Statistically, it didn’t work particularly well. Then one
of our analysts came across Greg Linden’s article “Amazon.com
Recommendations: Item-to-Item Collaborative Filtering” [113] and wrote
his own cosine method in C# with the help of multi-threaded programming
– because implementing a cosine of the vectors of customer interests on an
SQL server is not much fun. After this, I came to believe in the Great
Cosine in the n -dimensional space of vectors of client sessions, which
would serve me well in the future.
One of the tasks that caused us difficulties was measuring the effectiveness
of recommendation blocks. But we’d already bought the Omniture
SiteCatalyst (now Adobe Analytics) web analytics tool and, with the help of
merchandising analytics that I talked about earlier, we were able to
overcome these issues. By the way, 38% of all cart additions were
attributable to recommendation blocks.
I stopped working on recommendation systems for a while. That was until I
found myself at Wikimart.ru, when a trip to the Netflix offices and a
conversation with Eric Colson about the technologies they use inspired me
to go back a change the technologies we were using completely. Not that
the Wikimart.ru system needed any changes; all my old developments were
working fine on their databases. But, as far as I was concerned, Hadoop was
redefining the possible back then – scaling computations up to a thousand
computers at the same time. What this meant was that I didn’t need to
rewrite or optimize any algorithms – all I had to do was add more
computers to the cluster. Around two years later, in October 2012, I wrote
the following:
ONLINE STORE RECOMMENDATIONS SERVICE
Aim
To create a simple and fast cloud-based product recommendation engine that can be embedded
on a store’s website without interfering with the internal architecture of that site.
Monetization
Stores can pay using the following scheme, in order of priority:
Pay per order. Merchandising analytics can be used to accurately determine the
likelihood that a site visitor will purchase a recommended product.
The service will collect data on the store’s website using JavaScript trackers. The data is
logged on separate servers.
A separate web service issues recommendations on the website. It is important that the
recommendations include products that are in stock only.
Implementation
Typical implementation should include the installation of JavaScript code on the site:
JavaScript trackers for collecting data. If Google Analytics is installed, then it can track
important events on the site (purchases, cart additions, etc.).
OceanofPDF.com
CHAPTER 13. BUILDING A
CAREER
Hamming was referring to research papers, although I think his words can
apply to data analysis as well. There’s nothing stopping you from moving
between areas, from web analytics to machine learning, from analytics to
programming, and vice versa.
If you’ve been feeling uninspired for a while, that there’s nothing for you to
do at work, then talk to your superior. And if, after talking to your boss it
becomes clear that nothing’s going to change, then it’s probably time to
leave. This way, you won’t be kicking yourself for staying too long and
losing precious time. After all, we’ve only got so much of it.
It depends upon the field. I will say this about it. There was a
fellow at Bell Labs, a very, very, smart guy. He was always in
the library; he read everything. If you wanted references, you
went to him and he gave you all kinds of references. But in the
middle of forming these theories, I formed a proposition: there
would be no effect named after him in the long run. He is now
retired from Bell Labs and is an Adjunct Professor. He was very
valuable; I’m not questioning that. He wrote some very good
Physical Review articles; but there’s no effect named after him
because he read too much.
You can’t know everything, and reading takes up a lot of time. I love
reading, but I never have enough time for it. Books can sit on the shelf for
years before you finally pick them up and give them a read. Whenever I
have a choice between passive (theoretical) and active (practical) actions, I
go for the latter. Passive actions include reading, watching a video
presentation, taking an online course, etc. I’m hardly a doofus – I’ve got
dozens of certificates from Coursera. Active actions include solving a
problem, completing a project (even if it’s a personal project), etc. Work
tasks interest me more than educational ones. Knowledge and erudition are
important, but skilful application is far more important. And you don’t need
to know a subject inside and out in order to apply that knowledge – 20% is
enough (remember Pareto!). Remember Xavier Amatriain’s advice in the
chapter on machine learning? Just read the introduction of a book on ML,
open up whatever algorithm you want, and start coding. You’ll learn what’s
important and what isn’t through practice. You’ll never be a good musician
if you only know the theory.
Which employee is better: the one who took 20 different courses and did a
lot of study assignments or the one who finished only a few simple projects,
but saw them through to the end, from idea to implementation? In 95% of
cases, I would say the second. In my experience, I have found that there are
two kinds of people: theorists and practitioners. I once hired a person who
knew the theory inside out, thinking that if they know the theory, they’ll be
able to figure out how to do it in practice. Boy, was I wrong!
So, if you don’t need to read a lot, how are you supposed to learn? This is
what Hamming said:
If you read all the time what other people have done, you will
think the way they thought. If you want to think new thoughts
that are different, then do what a lot of creative people do – get
the problem reasonably clear and then refuse to look at any
answers until you’ve thought the problem through carefully
how you would do it, how you could slightly change the
problem to be the correct one. So yes, you need to keep up. You
need to keep up more to find out what the problems are than to
read to find the solutions. The reading is necessary to know
what is going on and what is possible. But reading to get the
solutions does not seem to be the way to do great research. So
I’ll give you two answers. You read; but it is not the amount, it
is the way you read that counts.
OceanofPDF.com
EPILOGUE
The aim of this book is to give you some practical advice. And if you are
able to apply at least some of my ideas in your work, then I will consider
this a success.
One last piece of advice: always ask yourself the question, “Am I squeezing
everything I possibly can out of the data?” At some of the places I have
worked (Ozon.ru, Wikimart.ru, Retail Rocket), I was responsible for data
analytics myself. At others (TechnoNICOL, Innova, KupiVIP, Fastlane
Ventures), I provided consulting services, and it was here that I realized that
it is not all about numbers.
To make the best use of numbers, you need, first of all, to monitor the
quality of the data itself and, secondly, to effectively manage personnel
within the company, prioritize hypotheses and use the necessary technology.
I have tried to analyse these areas as thoroughly as possible in the relevant
chapters of this book.
We learn about life through trial and error: children experiment more, adults
less. Similarly, companies need to experiment in order to grow. This
includes generating ideas, testing them in practice, obtaining results and
repeating this cycle over and over, even if the result is not as good as you
would have wanted and you are tempted to give up. Don’t be afraid of
failure – it is experimentation that leads to improvements.
Don’t hesitate to get in touch with me through this book’s website
topdatalab.com/book or via email at [email protected] . I’ll be happy
to answer any questions you may have!
OceanofPDF.com
ACKNOWLEDGEMENTS
This book is dedicated to my wife Katya and my children Adella and
Albert. Katya was the one who gave me the determination to write the
book, and she was heavily involved in the editing – for which I am eternally
grateful.
I would also like to thank my parents, who raised me during an exceedingly
difficult period in our country’s history. Special thanks go to my dad,
Vladimir, who instilled in me a love of physics.
Thank you to all those I have met on my long journey into data analytics:
Ilya Polezhaev, Pavel Bolshakov and Vladimir Borovikov at StatSoft for the
guidance you gave a young man just starting out his career; former Ozon.ru
CEO Bernard Lukey and my Ozon.ru colleagues Alexander Perchikov,
Alexander Alekhin and Valery Dyachenko for co-writing the
recommendation system with me; Marina Turkina and Irina Kotkina –
working with you was sheer joy; the founders of Wikimart.ru Kamil
Kurmakayev and Maxim Faldin – our meetings in California were a great
inspiration for me; Alexander Anikin – you were cool back then, but now
you’re a stud; and the founders of Ostrovok.ru, Kirill Makharinsky and
Serge Faguet, not to mention Evgeny Kuryshev, Roman Bogatov and Felix
Shpilman – I had a blast working with you and learning about software
development.
I would also like to thank my co-founders at Retail Rocket, Nikolay
Khlebinsky and Andrey Chizh. And a special thanks to Impulse VC venture
fund (Kirill Belov, Grigory Firsov and Evgeny Poshibalov) for believing in
us. To all my coworkers at Retail Rocket, especially my boys Alexander
Anokhin and Artem Noskov – you guys are the best!
I am grateful to my therapist Elena Klyuster, with whom I have been
working for several years now, for helping me discover my boundaries and
true desires. I would also like to thank my swimming coach Andrey Guz for
imparting an analytical approach to training in me. Turns out that it works
for amateurs just as well as it does for professionals.
I would also like to express my gratitude to all my virtual reviewers,
especially Artem Astvatsaturov, Alexander Dmitriev, Arkady Itenberg,
Alexei Pisartsov and Roman Nester for their input on the chapter on ethics.
Thank you to everyone who had a hand in the publication of this book,
especially Alexey Kuzmenko, who helped me find a publisher in no time by
sidestepping all the bureaucratic nonsense.
OceanofPDF.com
ABOUT THE AUTHOR
Roman Zykov was born in 1981. After completing his undergraduate
studies in 2004, Roman went on to earn a master’s in Applied Physics and
Mathematics at the Moscow Institute of Physics and Technology (MIPT).
Roman started his career in data science in 2002 as a technical consultant at
StatSoft Russia – the Russian office of the U.S. developer of the
STATISTICA statistical data analysis software. In 2004, he was hired as
head of the analytical department of the Ozon.ru online store, where he
created analytical systems from scratch, including web analytics, database
analytics and management reporting, while also contributing to the
recommendation system.
In 2009, he advised on a number of projects for Fast Lane Ventures
investment fund and the gaming industry.
In 2010, Roman was hired to lead to analytics department of the online
retailer Wikimart.ru.
In late 2012, he co-founded RetailRocket.ru, a marketing platform for
online stores. The company is currently the undisputed market leader in
Russia and successfully operates in Chile, the Netherlands, Spain and
several other countries.
Roman ran the blog Analytics in Practice on the now defunct KPIs.ru from
2007 where he evangelized data analysis as it applies to business problems
in ecommerce. He has spoken at numerous industry conferences, including
the Russian Internet Forum, iMetrics and Gec 2014 (with Arkady Volozh of
Yandex), as well as at business conferences in Dublin and London, the U.S.
Embassy (American Center in Moscow), and Sberbank University. He has
also published in PwC Technology Forecast, ToWave, Vedomosti and
Sekret firmy.
In 2016, Roman gave a mini lecture on hypothesis testing at MIT in Boston.
In 2020, he was nominated for a CDO Award.
OceanofPDF.com
BIBLIOGRAPHY
Most of the references in this book are provided via hyperlinks. Over time,
some of them will stop working. I have developed a mechanism to ensure
that all the references remain accessible that is available at
http://topdatalab.com/ref?link=[Reference number] . “Reference
number” corresponds to the number of the respective reference in the text
(for example, for number 23: https://topdatalab.com/ref?link=23). If I learn
that a link or QR code in this book has stopped working, I will restore it as
soon as possible. All the reader has to do is let me know.
OceanofPDF.com