Numsense! Data Science For The Layman
Numsense! Data Science For The Layman
ISBN 978-981-11-1068-9
This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing
process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and
many iterations to get reader feedback, pivot until you have the right book and build traction once
you do.
k-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Finding Customer Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Example: Personalities of Movie Fans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Defining Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Discovering Purchasing Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Support, Confidence and Lift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Example: Mapping Grocery Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Predicting Survival in a Disaster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Example: Surviving the Titanic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Generating a Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Wisdom of the Crowd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Example: Forecasting Crime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Bootstrap Aggregating (Bagging) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
In Figure 1a, boundaries separating the two colors are too detailed. Random variations in flower
growth have been mistaken as persistent patterns, which would not be generalizable to new flowers.
This problem is called overfitting.
In Figure 1c, boundaries separating the two colors are too vague. Subtle patterns are overlooked,
resulting in less information to guide predictions. This problem is called underfitting.
If a predictive model is overfitted or underfitted, its predictions might be inaccurate, undermining
its original purpose. What we need is a predictive model like the one in Figure 1b - an ideal fit that
strikes a balance between describing the current distribution of flowers, and being able to predict
where new flowers would grow.
Fitting a Predictive Model 2
To derive a predictive model that fits our data well, we need to do two things:
Derive multiple models. We could invite multiple girls to draw different boundaries to separate
the flowers, or we could simply ask the same girl to do so. Analogously, we could use different
prediction algorithms, or simply tweak parts of a single algorithm to get different models. The latter
technique is what we will cover, and it is called (parameter tuning).
Select the best model. After we have multiple boundaries drawn, we need to see which boundary is
best at predicting colors of new flowers. An ideal model not only fits known data but also generalizes
to unknown data. Assessing how well models predict new data is a process known as (validation).
Parameter Tuning
An algorithm is composed of parts called parameters, which can be tuned to make the algorithm
work differently. A predictive model can be changed by tweaking its parameters. Table 1 lists
examples of common algorithms which will be covered in this book, as well as their parameters.
Algorithm Parameters
k-Nearest Neighbors Number of nearest neighbors
For instance, let us look at one of the simplest techniques for classifying objects, called k-nearest
neighbors (k-NN). Suppose we need to classify new data points as either blue or orange. k-NN
works on the idea that a data point would likely have the same color as other data points closest to
Fitting a Predictive Model 3
it. For example, if a data point is surrounded by 4 blue points and 1 orange point, that data point is
likely a blue point by majority vote (see Figure 2).
Figure 2. A data point is classified by majority votes from its 5 nearest neighbors.
The k in k-NN is a parameter that refers to the number of nearest neighbors to include in the majority
voting process. In the above example, k equals to 5. Choosing the right value of k is a process called
parameter tuning, and is critical to prediction accuracy.
If k is too small (Figure 1a), data points would match immediate neighbors only, amplifying errors.
If k is too big (Figure 1c), data points would try to match far flung neighbors, diluting underlying
patterns. But when k is just right (Figure 1b), data points would reference a suitable number of
neighbors such that errors cancel out to reveal overall trends in the data, which could be used to
make predictions.
While we can tune k based on current data, how can we assess if it would give accurate predictions
for new data?
Validation
Validation is an assessment of how accurate a model is in predicting new data. However, instead of
waiting for new flowers to bloom to assess our model, we could make a sketch of the current flowers,
and randomly assign them into two groups - the first group of flowers, called the training dataset,
would be used to generate and tune a prediction model, while the second group of flowers, called
the test dataset, would be used to test the model’s prediction accuracy. The model which predicts the
correct color for the most number of flowers in the test dataset is taken to have the best parameter
k. For the validation process to be effective, flowers should be randomly assigned into the training
and test datasets, and should not be concentrated at any area on the thicket.
If the original dataset is small, we may not be able to set aside a test dataset without sacrificing
accuracy from having less data to learn patterns from. Hence, instead of using two separate datasets
for training and testing, there is a technique which lets us use one dataset for both purposes.
Fitting a Predictive Model 4
Cross-Validation
Cross-validation is a technique that maximizes the availability of data for validating a predictive
model. It does this by dividing the dataset into several segments to test the model repeatedly.
In one iteration, all but one of the segments are used to train a predictive model of a desired parameter
value. For instance, if we think that the optimal value for k is 5, we can use the 5 nearest neighbors
in the training segments to predict the color for each flower in the last test segment. This process is
repeated until each segment is used as the test segment exactly once.
As different segments are used to generate color predictions for each iteration, the resulting
predictions would fluctuate. Simulating these variations in predictions results in a more robust
estimate of actual predictive ability. The final estimate of the model’s prediction accuracy is taken
as the average of that across all the iterations (see Figure 3).
Figure 3. Cross-validating a dataset by dividing it into 4 segments. The final prediction accuracy is the average
of the 4 results.
If results from cross-validation suggest that our model’s prediction accuracy is low, we can go back
to re-tune the parameters (e.g. value of k), or re-examine if our data is biased.
Regularization
As the complexity of a model’s parameters increases, so too does the overall complexity of the model
itself. This is hard to illustrate with k-NN as it only has one parameter. Instead, we can use regression
analysis as an example.
To make a prediction with regression analysis, a trend line is derived based on predictor variables.
For instance, if we would like to predict a person’s income, we might use age as a predictor. A
plausible trend line might then be a straight upward slope to indicate increasing income as age
Fitting a Predictive Model 5
increases. However, we could add more predictors into the model, such as IQ scores and parents’
income. As more predictors are added, the model would grow more complex. The resulting trend
might consist of more intricate curves and bumps rather than a simple straight line, thus inflating
the risk of overfitting.
While we can select predictors to include in a regression, we are unable to determine how much
weight a predictor should contribute to deriving the trend line. Instead, the regression algorithm
computes the optimal predictor weights to derive a line that gives the lowest prediction error on
the training dataset. So unlike the case for k-NN, we are sometimes unable to tune a model’s
parameters, as they are governed by its algorithm.
While a regression trend line might predict training data accurately, a complex trend is prone to
overfitting. To keep a model’s complexity in check, a penalty could be imposed on the model, which
increases with the complexity of the model’s parameters. Note that the value of the penalty is itself a
tuning parameter. So in other words, we can introduce a tuning parameter to control parameters that
cannot be tuned. This process is called regularization. By keeping a model’s complexity in check,
we help to maintain its predictive power.
Conclusion
In this chapter, we covered the fundamentals of how to fit a prediction model, which form the basis
for all the techniques we are about to learn in this book. No matter which prediction technique we
use, the processes of parameter tuning, validation and regularization are crucial in ensuring that our
models would be effective in predicting new data.
k-Means Clustering
Finding Customer Clusters
There are patterns in people’s preferences. Take someone who likes the movie 50 First Dates for
instance. It is highly probable that they would like similar chick flicks such as 27 Dresses. Grouping
people by their preferences could help stores better target product advertisements to interested
customers.
However, identifying customer groups is tricky. We do not know 1) how customers should be
grouped, nor 2) how many groups exist. To answer these questions, we will explore a technique
called k-means clustering. This technique can be used to cluster customers or products, where k
represents the number of clusters identified, with each cluster being as distinct from other cluster
as possible.
While this chapter uses the retail sector as an example, clustering is employed in a wide range of
fields. For example, clustering can help to identify biological genotypes, or to pinpoint hot spots of
criminal activity.
With this information, advertisements can be more targeted. If a person likes 50 First Dates, a
storekeep can recommend another movie in the same cluster, such as 27 Dresses. Besides product
k-Means Clustering 8
recommendation, identifying clusters also allow the storekeeper to bundle similar products for
effective discounts.
Defining Clusters
In defining clusters, two questions have to be answered:
Figure 2. Scree plot showing a ‘kink’ where the number of clusters equals to 3.
A scree plot shows how the dissimilarity between cluster members (i.e. within-cluster scatter)
decreases as the number of clusters increases. If there is only one cluster to which all members
belong to, within-cluster scatter would be at its maximum. As the number of clusters increases,
clusters grow more compact and cluster members become more homogenous.
An optimal number of clusters is suggested by the kink in a scree plot. In Figure 2, the kink is
where the number of clusters equals to 3. At this point, the number of clusters derived can reduce
within-cluster scatter to a reasonable degree, beyond which having any more clusters would yield
increasingly smaller clusters that would be less distingushable from each other. This is called the
principle of diminishing marginal returns.
What is the membership of each cluster?
After determining a suitable number of clusters, we can determine the membership of each cluster.
A good cluster would comprise data points packed closely together. Hence, to check the validity of
a cluster, we could verify how far its members are from their cluster’s center. If a data point is far
from its assigned cluster’s center and closer to that of a neighboring cluster, its membership would
be re-assigned.
However, the positions of cluster centers would be unknown initially, so they are approximated. Data
points would then be assigned to cluster centers which they are closest to. Next, cluster centers would
be re-positioned to actually be in the center of their members. Following which, cluster membership
would be re-assigned again based on distance. This iterative process is summarized with a 2-cluster
example illustrated in Figure 3.
k-Means Clustering 10
Step 1: Start by guessing where are the central points of each cluster. Let’s call these pseudo-centers,
since we do not yet know if they are actually at the center of their clusters.
Step 2: Assign each data point to the nearest pseudo-center. By doing so, we have just formed 2
clusters, red and blue.
Step 3: Update the location of pseudo-centers to be in the center of their respective members.
Step 4: Repeat the steps of re-assigning cluster members (Step 2) and re-positioning cluster centers
k-Means Clustering 11
(Step 3), until there are no more changes to cluster membership. These 4 steps wrap up the process of
determining cluster membership. The same process is used for 3 or more clusters, such as in Figure
1.
Despite our focus on 2-dimension analysis, clustering can also be done in 3 or more dimensions.
Additional dimensions a storekeeper might want to consider could include a customer’s age or
frequency of visits. While difficult to visualize, we can still rely on computing programs to calculate
distances between data points and cluster centers.
Limitations
Although k-means clustering is a useful tool, it is not without limitations:
Each data point can only be assigned to one cluster. Sometimes a data point might be in the
middle of 2 clusters, having an equal chance of being assigned to either.
Clusters are assumed to be spherical. The iterative process of finding data points closest to a cluster
center is akin to narrowing the cluster’s radius, so that the resulting cluster is a compact sphere. This
might pose a problem if the shape of an actual cluster is, for instance, an ellipse. An elongated cluster
might be truncated, and its members subsumed into a nearby cluster.
Clusters are assumed to be discrete. k-means clustering does not permit clusters to overlap, nor
to be nested within each other.
Instead of coercing each data point into one cluster, more robust clustering techniques compute
probability values indicating how likely each data point might belong to each cluster. This would
help identify non-spherical or overlapping clusters.
Despite the availability of alternative clustering techniques, the strength of the k-means clustering
algorithm lies in its elegant simplicity. A good strategy might be to start with k-means clustering
to get a basic understanding of the data structure, before diving into more advanced methods to
examine areas where k-means clustering falls short.
References
Kosinski, M., Matz, S., Gosling, S., Popov, V. & Stillwell, D. (2015) Facebook as a Social Science
Research Tool: Opportunities, Challenges, Ethical Considerations and Practical Guidelines. American
Psychologist.
Principal Component Analysis
Exploring Nutritional Content of Food
Imagine that you are a nutritionist trying to explore the nutritional content of food. What is the best
way to differentiate food items? By vitamin content? Protein levels? Or perhaps a combination of
both?
Knowing the variables that best differentiate your items has
several uses:
The question is, how do we derive the variables that best differ-
entiate items?
Principal Components
Principal Component Analysis (PCA) is a technique that finds underlying variables (known as
principal components) that best differentiate your data points. Principal components are dimensions
along which your data points are most spread out (see Figure 2).
Principal Component Analysis 13
A principal component can be expressed by one or more existing variables. For example, we may use
a single variable – vitamin C – to differentiate food items. Because vitamin C is present in vegetables
but absent in meat, the resulting plot (leftmost column in Figure 3) will differentiate vegetables from
meat, but meat items will be clumped together.
To spread the meat items out, we can use fat content in addition to vitamin C levels, since fat
is present in meat but absent in vegetables. However, fat and vitamin C levels are measured in
different units. To combine the two variables, we first have to normalize them, i.e. shift them onto
a uniform standard scale, which would allow us to calculate a new variable – vitamin C minus fat.
As vitamin C spreads the vegetables upwards, we minus fat to spread the meats in the opposite
direction downwards. Combining the two variables helps to spread out both vegetable and meat
items (center column in Figure 3).
The spread can be further improved by adding fiber, of which vegetable items have varying levels.
This new variable – (vitamin C + fiber) minus fat – achieves the best data spread yet (rightmost
column in Figure 3).
Principal Component Analysis 14
While in this demonstration we tried to derive principal components by trial-and-error, PCA does
this by systematic computation.
Specifically, fat and protein levels seem to move in the same direction with each other, and in the
opposite direction from fiber and vitamin C levels. To confirm our hypothesis, we can check for
correlations between the nutrition variables. As expected, there are significant positive correlations
between fat and protein levels (r = 0.56), as well as between fiber and vitamin C levels (r = 0.57).
Therefore, instead of analyzing all 4 nutrition variables, we can combine highly-correlated variables,
leaving just 2 dimensions to consider. This is the same strategy used in PCA – it examines
correlations between variables to reduce the number of dimensions in the dataset. This is why PCA
is called a dimension reduction technique.
Applying PCA to this food dataset results in the principal components displayed in Figure 5.
The numbers represent weights used in combining variables to derive principal components. So
Principal Component Analysis 16
instead of combining variables via trial-and-error like in Figure 1, PCA systematically computes the
optimal combinations of variables that best differentiate our items.
For example, to get the top principal component (PC1) value for a particular food item, we add up
the amount of fiber and vitamin C it contains, with slightly more emphasis on fiber, and then from
that we subtract the amount of fat and protein it contains, with protein negated to a larger extent.
We observe that the top principal component (PC1) summarizes our findings so far – it has paired
fat with protein, and fiber with vitamin C. It also takes into account the inverse relationship between
the pairs. Hence, PC1 likely serves to differentiate meat from vegetables. The second principal
component (PC2) is a combination of two unrelated nutrition variables – fat and vitamin C. It serves
to further differentiate sub-categories within meat (using fat) and vegetables (using vitamin C).
Using the top 2 principal components to plot food items results in the best data spread thus far
(shown in Figure 6).
Meat items (blue) have low PC1 values, and are thus concentrated on the left of the plot, on the
opposite side from vegetable items (orange). Among meats, seafood items (dark blue) have lower
fat content, so they have lower PC2 values and are at the bottom of the plot. Several non-leafy
vegetarian items (dark orange), having lower vitamin C content, also have lower PC2 values and
appear at the bottom.
Choosing the Number of Components. As principal components are derived from existing
variables, the information available to differentiate data points is constrained by the number of
variables you start with. Hence, the above PCA on food items generated 4 principal components,
corresponding to the original number of variables in the dataset.
Principal Component Analysis 17
To keep results simple and generalizable however, only the first few principal components are se-
lected for visualization and further analysis. Principal components are ordered by their effectiveness
in differentiating data points, with the first principal component doing so to the largest degree. The
number of principal components to shortlist is determined by a scree plot (see Figure 7).
Principal Component Analysis 18
Figure 7. Scree plot showing a ‘kink’ where the number of components equals to 2.
A scree plot shows the decreasing effectiveness of subsequent principal components in differentiat-
ing data points. A rule of thumb is to use the number of principal components corresponding to the
location of a kink. In the plot above, the kink is located at the second component. This means that
even though having three or more principal components would better differentiate data points, this
extra information may not justify the resulting complexity of the solution. As we can see from the
scree plot, the top 2 principal components already account for about 70% of data spread. Using fewer
principal components to explain the current data sample better ensures that the same components
can be generalized to another data sample.
Limitations
Maximizing Spread. The main assumption of PCA is that dimensions which reveal the largest
spread among data points are the most useful. However, this may not be true. A popular counter
example is the task of counting pancakes arranged in a stack, with pancake mass representing data
points.
To count the number of pancakes, one pancake is differentiated from the next along the vertical axis
(i.e. height of the stack). However, if the stack is short, PCA would erroneously identify a horizontal
axis (i.e. diameter of the pancakes) as a useful principal component for our task, as it would be the
dimension along which there is largest spread.
Interpreting Components. Interpretations of generated components have to be inferred, and
sometimes we may struggle to explain the combination of variables in a principal component.
Nonetheless, having prior domain knowledge could help. In our example with food items, prior
knowledge of major food categories help us to comprehend why nutrition variables are combined
the way they are to form principal components.
Orthogonal Components. One major drawback of PCA is that the principal components it
generates must not overlap in space, otherwise known as orthogonal components. This means that
the components are always positioned at 90 degrees to each other. However, this assumption is
restrictive as informative dimensions may not necessarily be orthogonal to each other (see Figure
9).
To resolve this, we can use an alternative technique called Independent Component Analysis (ICA).
ICA allows its components to overlap in space, thus they do not need to be orthogonal. Instead,
ICA forbids its components to overlap in the information they contain, aiming to reduce mutual
information shared between components. In other words, ICA’s components are independent, with
each component revealing unique information on the data set. Besides overcoming the orthogonality
assumption, ICA also considers other information apart from data spread in determining principal
components. Hence, it is less susceptible to the pancake error.
While ICA may seem superior to PCA, PCA remains one of the most popular technique for
dimension reduction. Hence, knowing how PCA works and its assumptions is useful. When in doubt,
Principal Component Analysis 20
one could consider running a ICA to verify and complement results from a PCA.
Association Rules
Discovering Purchasing Patterns
In grocery shopping, each shopper has a distinctive list of things
to buy depending on one’s needs and preferences. A housewife
might buy healthy ingredients for a family dinner, while a
bachelor might buy beer and chips. Understanding these buying
patterns can help to increase sales in several ways. If a pair of
items, X and Y, are frequently bought together:
To uncover how items are associated with each other, we can use
association rules analysis. Besides increasing sales profits, association rules can also be used in other
fields. In medical diagnosis for instance, understanding comorbid symptoms can help to improve
patient care and medicine prescription.
While we may know that certain items are frequently bought together, we need a systematic way
to uncover these associations.
Support Threshold. A support threshold could be chosen to identify popular itemsets, such that
itemsets with support values above this threshold would be deemed popular. If sales of items that
constitute a large proportion of your transactions tend to have a significant impact on your profits,
you might consider using that proportion as your support threshold.
Measure 2: Confidence. This indicates how likely item Y is purchased when item X is purchased,
expressed as {X -> Y}. This is measured by the proportion of transactions with item X where item Y
also appears. In Table 1, the confidence of {apple -> beer} is 3 out of 4, or 75%.
One drawback of the confidence measure is that it might misrepresent the importance of an
association. This is because it only accounts for how popular apples are, but not beers. If beers
are also very popular in general, there will be a higher chance that a transaction containing apples
will also contain beers, thus inflating the confidence measure. To account for the base popularity of
both constituent items, we use a third measure called lift.
Measure 3: Lift. This indicates how likely item Y is purchased when item X is purchased, while
controlling for how popular item Y is. For instance, the lift of {apple -> beer} is equal to the confidence
of {apple -> beer} divided by the popularity of {beer}. Hence, the lift of {apple -> beer} equals to 1,
which implies no association between items. A lift value greater than 1 means that item Y is likely
to be bought if item X is bought, while a value less than 1 means that item Y is unlikely to be bought
if item X is bought.
Figure 5. Associations between selected items. Visualized using the arulesViz R library.
• If someone bought tea, he is likely to have bought fruit as well, possibly inspiring the
production of fruit-flavored tea
Recall that one drawback of the confidence measure is that it could misrepresent the importance of
an association. To demonstrate this, we picked 3 association rules from the original dataset which
contained beer:
The {beer -> soda} rule had the highest confidence at 20%. However, both beer and soda appeared
frequently across all transactions (see Table 3), so their association could simply be a fluke. This is
confirmed by the lift value of {beer -> soda}, which was 1, implying no association between beer
and soda.
On the other hand, the {beer -> male cosmetics} rule had a low confidence, due to few purchases of
male cosmetics in general. However, whenever someone did buy male cosmetics, he was very likely
to have bought beer as well, as inferred from a high lift value of 2.6. The converse was true for {beer
-> berries}. With a lift value below 1, we might conclude that if someone bought berries, he would
not be likely to have bought beer.
While it is easy to determine the popularity of individual itemsets, a business owner would typically
be more interested in having a complete list of popular itemsets. To get this list, we need to calculate
the support values for every possible configuration of items, and then shortlist the itemsets that have
support values above a chosen threshold.
In a store with just 10 items, the total number of possible configurations to examine would be a
whopping 1023. This number increases exponentially in a store with hundreds of items. Hence, we
need a way to reduce the number of item configurations to consider.
Association Rules 25
Apriori Algorithm
The apriori principle is a solution to reduce the number of itemsets we need to examine. Put simply,
the apriori principle states that if an itemset is infrequent, then all its subsets must also be infrequent.
This means that if {beer} was found to be infrequent, we can expect {beer, pizza} to be equally or
even more infrequent. So in consolidating the list of popular itemsets, we need not consider {beer,
pizza}, nor any other itemset configuration that contains beer.
Finding itemsets with high support.
Using the apriori principle, the number of itemsets that has to be examined can be pruned, and the
list of popular itemsets can be obtained in these steps:
Step 0. Start with itemsets containing just a single item, such as {apple} and {pear}.
Step 1. Determine the support for itemsets. Keep the itemsets that meet a minimum support
threshold, and remove itemsets that do not.
Step 2. Using the itemsets you have kept from Step 1, generate all possible itemset configurations.
Step 3. Repeat Steps 1 & 2 until there are no more new itemsets.
Figure 6 shows how the number of possible combinations from 4 items could be significantly pruned
using the apriori principle. When {apple} had low support, it was removed along with all other
itemset configurations that contained apple. This reduced the number of itemsets to consider by
more than half.
Figure 6. Itemsets within the red border would have been pruned.
Take for example the task of finding high-confidence rules. If the rule {beer, chips -> apple} has low
confidence, all other rules with the same constituent items and with apple on the right-hand side
would have low confidence too. Specifically, the rules {beer -> apple, chips} and {chips -> apple, beer}
would have low confidence as well. As before, lower level candidate item rules can be pruned using
the apriori algorithm, so that fewer candidate rules need to be examined.
Limitations
Computationally Expensive. Even though the apriori algorithm reduces the number of candidate
itemsets to consider, this number could still be significant when store inventories are large or when
the support threshold is low. An alternative solution would be to reduce the number of comparisons
by using advanced data structures to sort candidate itemsets more efficiently.
Spurious Associations. When analyzing a large number of itemsets, associations could happen by
chance. To ensure that associations found are generalizable, they should be validated.
Despite these limitations, association rules remain an intuitive method to identify important
patterns, which works well for datasets of manageable size.
Correlation and Regression
Selecting People to Recruit
Military defense is critical for a country’s survival, and for a military to be strong, it needs to be
led by competent commanders. Commander training is intensive and time-consuming. Hence, it is
important to be accurate in identifying eligible candidates for training.
The question is, what are the traits of a potential commander? One might surmise that a comman-
der’s performance could be gauged by several factors, such as physical fitness and cognitive ability.
However, we do not know for sure:
To answer these, we can use regression analysis, a prediction technique. In selecting commanders,
we use plausible predictors, such as fitness test scores and IQ scores. The prediction model would
look something like this:
Regression analysis would then allow us to find out whether fitness and IQ are indeed strong
predictors of commander potential and, if so, which predictor is stronger.
There are ways to avoid the above pitfalls. First, one can standardize the units of predictor variables,
in a way that is analogous to expressing each variable in terms of percentiles. Converting predictors
to the same unit allows for more accurate comparisons. The weights of standardized predictors are
then called beta weights.
When there is only one predictor, the beta weight of that predictor is also known as a correlation
coefficient, denoted as r. Correlation coefficients range from -1 to 1, and this provides two bits of
information about the correlation:
• Direction. Positive (negative) coefficients imply that correlated variables move in the same
(opposite) direction; as one increases, the other increases (decreases).
• Magnitude. The magnitude of a correlation can be inferred from the absolute value of the
coefficient. The closer the coefficient is to -1 or 1, the stronger the predictor. Since correlation
coefficients indicate the absolute strength of individual predictors, they are a more reliable
way to rank predictors than regression weights.
To summarize, regression analysis is a technique for making predictions using one or more
predictors. It derives a trend line based on a general principle of reducing prediction errors. When
there are multiple predictors, predictors are assigned weights according to their relative predictive
strengths. When there is only one predictor, its correlation coefficient reveals the direction and
strength of relationship with the target variable. Thus, correlation coefficients can also be used to
rank predictors.
Limitations
After learning about what regression analysis can be used for and how it works, we now examine
its limitations:
Correlation and Regression 31
Sensitivity to outliers. As the regression analysis accounts for all data points equally, a single
data point with extreme values could skew the trend line significantly. Before running the analysis,
outliers should be identified using scatterplots.
Distorted weights of correlated predictors. As alluded to above, the inclusion of highly-correlated
predictors in a regression model would distort the interpretation of their weights. This problem is
called multicollinearity. To ensure that predictions remain generalizable to new cases, regression
models should be kept simple with as few uncorrelated predictors as needed. Alternatively, more
advanced techniques such as lasso or ridge regression could be used to overcome multicollinearity.
Different combinations of predictors. The type of regression explained in this post is called simple
linear regression. The term ‘linear’ means that effects of predictors are summed up, resulting in a
trend line that is straight. However, some trends may be curved. For instance, having an average
Body Mass Index (BMI) is a mark of a fit commander. BMI values that are too-low or high are
undesirable. To test non-linear trends however, more complex models have to be used.
Finally, you might have heard of the phrase “correlation does not imply causation”. To illustrate this,
suppose that household income was found to be positively correlated with commander potential.
This does not mean that having your spouse switch to a higher-paying job would make you a better
commander. Rather, higher household income could pay for better education opportunities, which
in turn improves cognitive skills required by commanders. Being mindful of how results may be
misinterpreted helps to ensure the accuracy of conclusions.
k-Nearest Neighbors and Anomaly
Detection
Identifying Categories by Chemical Make-Up
Have you ever wondered about the difference between red and white wine?
Some assume that red wine is made from red grapes, and white wine is made from white grapes.
But this is not entirely true, because white wine can also be made from red grapes, though red wine
cannot be made from white grapes.
The main difference between red and white wine is in how the grapes are fermented. To make red
wine, grape juice is fermented along with the grape skin, the source of red pigments. For white wine,
the juice is fermented without the skin. Presence of grape skin in the fermentation process is likely to
affect not only the color of the wine, but also its chemical makeup. This means that, without looking
at the color of the wine, we can possibly infer whether a wine is red or white, based on levels of its
chemical compounds.
To check our hypothesis, we can use a technique called k-Nearest Neighbors (k-NN). k-NN is one
of the simplest methods in machine learning. It classifies a data point based on how its neighbors
are classified. So if we are guessing the color of a particular wine, we can refer to the color of other
wines that have the most similar chemical makeup (i.e. neighbors). Besides classification into groups,
k-NN can also be used to estimate continuous values. To approximate a data point’s value, k -NN
takes the aggregated value of its most similar neighbors.
Figure 1. Levels of chlorides and sulfur dioxide in red wines ( red) and white wines ( black).
Minerals such as sodium chloride (i.e. table salt) are concentrated in grape skin, hence more of
these get infused into red wines. Grape skin also contains natural anti-oxidants that keep the fruit
fresh. Without it, white wines require more sulfur dioxide, which serves as a preservative. For these
reasons, red wines are clustered at the bottom-right of the plot, while white wines are at the top-left.
To deduce the color of a wine with specified levels of chloride and sulfur dioxide, we can refer to the
known colors of neighboring wines with similar quantities of both chemical compounds. By doing
this for each point in the plot, we can draw decision boundaries distinguishing red from white wines
(see Figure 2).
k-Nearest Neighbors and Anomaly Detection 34
Figure 2. Predicted classifications of wine color using k-NN. Unknown wines inside the red boundary would be
predicted as red wine, whereas those in the black boundary would be predicted as white wine.
Using these decision boundaries, we can predict a wine’s color with over 98% accuracy. Let us take
a look at how these boundaries are inferred from nearest neighbors.
Figure 3. A data point is classified by majority votes from its 5 nearest neighbors. Here, the unknown point would
be classified as red, since 4 out of 5 neighbours are red.
The k in k -NN is a parameter that refers to the number of nearest neighbors to include in the
majority voting process. In the above example, k equals to 5. Choosing the right value of k is a
process called parameter tuning, and is critical to prediction accuracy.
If k is too small, data points would match immediate neighbors only, amplifying errors due to
random noise. If k is too large, data points would try to match far flung neighbors, diluting
underlying patterns. But when k is just right, data points would reference a suitable number of
neighbors such that errors cancel out to reveal subtle trends in the data.
To achieve the best fit and lowest error, the parameter k can be tuned by cross-validation. In our
binary (i.e. two class) classification problem, we can also avoid tied votes by choosing k to be an
odd number.
Apart from classifying data points into groups, k-NN can also be used to predict continuous values,
by aggregating the values of the nearest neighbors. Instead of treating all neighbors equally and
taking a simple average, the estimate could be improved by taking a weighted average, where we
pay more attention to the values of closer neighbors than those further away, since closer neighbors
k-Nearest Neighbors and Anomaly Detection 36
Anomaly Detection
k-NN is not limited to merely predicting groups or values of data points. It can also be used in
detecting anomalies. Identifying anomalies can be the end goal in itself, such as in fraud detection.
Anomalies can also lead you to additional insights, such as discovering a predictor you previously
overlooked.
The simplest approach to detecting anomalies is by visualizing the data in a plot. In Figure 2 for
instance, we can see immediately which wines deviate from their clusters. However, simple 2-D
visualizations may not be practical, especially when you have more than 2 predictor variables to
examine. This is where predictive models such as k-NN come in.
As k-NN uses underlying patterns in the data to make predictions, any errors in these predictions
are thus telltale signs of data points which do not conform to overall trends. In fact, by this approach,
any algorithm that generates a predictive model can be used to detect anomalies. For instance, in
regression analysis, an outlier would deviate significantly from the best-fit line.
In our wine data, we can examine misclassifications generated from k-NN analysis to identify
anomalies. For instance, it seems that red wines that get misclassified as white wines tend to have
higher-than-usual sulfur dioxide content. One reason could be the acidity of the wine. Wines with
lower acidity require more preservatives. Learning from this, we might consider accounting for
acidity to improve predictions.
While anomalies could be caused by missing predictors, they could also arise due to insufficient data
for training the predictive model. The fewer data points we have, the more difficult it would be to
discern patterns in the data. Hence, as it is important to ensure an adequate sample size.
Once anomalies have been identified, they can be removed from the datasets used to train predictive
models. This will reduce noise in the data, thus strengthening the accuracy of predictive models.
Limitations
Although k-NN is simple and effective, there are times when it might not work as well:
Imbalanced classes. If there are multiple classes to be predicted, and the classes differ drastically in
size, data points belonging to the smallest class might be overshadowed by those from bigger classes,
increasing their risk of misclassification. To improve accuracy, we could use swap majority voting
in favor of weighted voting, whereby the class of closer neighbors are weighted more heavily than
ones further away.
Excess predictors. If there are too many predictors to consider, it would be computationally
intensive to identify and process nearest neighbors across multiple dimensions. Moreover, some
predictors could be redundant as they do not improve prediction accuracy. To resolve this, we can
use dimension reduction techniques to extract only the most powerful predictors for analysis.
k-Nearest Neighbors and Anomaly Detection 37
References
Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School
of Information and Computer Science.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining
from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-
9236.
Support Vector Machine
“No” or “Oh no”?
Medical diagnosis is complex. It requires multiple factors to be accounted for, and it can be vulnerable
to the subjective opinions of doctors. Sometimes, the correct diagnosis is not made until it is too
late. A more systematic approach to diagnosing ailments might be to use prediction techniques in
data science, in which whole databases of medical symptoms are systematically combed to derive
prediction models for underlying medical conditions.
In this chapter, we will examine one prediction technique called support vector machine (SVM).
It seeks to derive an optimal boundary to separate data points into two groups (e.g. healthy vs.
unhealthy).
Figure 1. Using SVM to predict the presence of heart disease by deriving patient profiles based on maximum heart
rate during exercise and age. The dark green region represents the predicted profile of healthy adults, while the
gray region represents the predicted profile of heart disease patients. The light green and black points represent
actual data from healthy adults and heart disease patients respectively.
To determine which symptoms predict the presence of heart disease, patients from an American
clinic were made to exercise while their physiological states were recorded. Subsequently, they
underwent imaging scans to determine the presence of heart disease. One of the physiological
indicators measured was the patient’s maximum heart rate attained during exercise. Using this
heart rate data together with a patient’s age, an SVM prediction model is able to tell with over
75% accuracy if someone is suffering from heart disease.
Heart disease patients generally had lower heart rates during exercise compared to others of the
same age, and the disease seemed more prevalent among patients above 55 years old.
While heart rate tended to decrease with age, heart disease patients aged about 60 years old appeared
to have similar heart rates with younger healthy adults, as indicated by the abrupt arc in the decision
boundary. If not for SVM’s ability to pick up curved patterns, this phenomenon might have been
overlooked.
Support Vector Machine 40
is inherently complex. SVM is favored for its superior computational efficiency, as it uses a method
called the kernel trick.
Instead of drawing the curved boundary directly onto the data plane, SVM first projects the data onto
a higher dimension, where the data points can be separated with a simple straight line (see Figure 4).
Straight lines are easier to compute, and when projected back down onto a lower dimension, they
could be translated into curved lines.
Figure 4. A circle of blue points on a 2-D sheet could be delineated by a straight line when projected onto a 3-D
sphere.
Ease of manipulating data in higher dimensions is one reason why SVM is a popular tool for
analyzing datasets with many variables. Common applications of SVM include decoding genetic
information and evaluating sentiments in text.
Limitations
Although SVM is a versatile and fast prediction tool, it might not work well in certain scenarios:
Small datasets. As SVM relies on support vectors to determine decision boundaries, a small sample
size would mean fewer of such support vectors to accurately position the boundaries.
Multiple groups. SVM is only able to classify 2 groups at a time. Where there are more than two
groups, SVM could be iterated to distinguish each group from the rest using a technique called
multi-class SVM.
Large overlap between groups. SVM classifies a data point based on which side of the decision
boundary it falls on. However, when there is a large overlap between data points from both groups,
a data point near the boundary might be more prone to misclassification. Despite this, SVM does not
give additional information on each data point’s probability of misclassification. One solution might
be to use a data point’s distance from the decision boundary to gauge its classification accuracy.
Support Vector Machine 42
References
Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School
of Information and Computer Science.
Credits to Robert Detrano, M.D., Ph.D. from V.A. Medical Center, Long Beach and Cleveland Clinic
Foundation for the data on heart disease.
Decision Tree
Predicting Survival in a Disaster
During a disaster, certain groups of people, such as women and children, might be entitled to
receiving help first, granting them a higher chance of survival. One way to identify these groups is
to use decision trees.
A decision tree predicts your chance of survival by asking a series of questions (see Figure 1). Each
question must only have 2 possible responses, such as “yes” versus “no”. You start at the top question,
called the root node, and move through the tree branches guided by your responses, until you reach
a leaf node. The proportion of survivors at that leaf node would be your predicted chance of survival.
Decision trees have broader applications such as predicting survival rates for medical diagnosis,
identifying who would resign or detecting fradulent transactions.
Decision trees are versatile. They can handle questions about categorical groupings (e.g. male vs.
female) or about continuous values (e.g. income). If the question asks about a continuous value, the
values can be split into groups – for instance, comparing values which are “above average” versus
“below average”.
Decision Tree 44
In standard decision trees, there should only be two possible responses, such as “yes” versus “no”.
If we want to test three or more responses (e.g. “yes”, “no”, “sometimes”), we can simply add more
branches down the tree (see Figure 2).
From the result, it seems that you would have a good chance of being rescued from the Titanic if
you were not from a 3rd class cabin, and you were either a male child or a female.
Decision trees are popular because they are easy to interpret. The question is, how is a decision tree
generated?
• Stop when data points at the leaf are all of the same predicted category/value
• Stop when the leaf contains less than five data points
• Stop when further branching does not improve homogeneity beyond a minimum threshold
Stopping criteria are selected using cross-validation to ensure that the decision tree can draw
accurate predictions for new data.
As recursive partitioning only uses the best binary questions to grow a decision tree, the presence
of non-significant variables would not affect results. Moreover, binary questions tend to divide data
points around central values, so decision trees are robust against extreme values (i.e. outliers).
Limitations
While decision trees are easy to interpret, they have some drawbacks.
Instability. Decision trees are grown by splitting data points into homogeneous groups. However,
a slight change in the data could result in a different tree dictating how data points should be split.
By aiming for the best way to split data points each time, decision trees are prone to overfitting.
Inaccuracy. Using the best binary question to split the data at the start may not lead to the most
accurate predictions. Sometimes, less effective splits used initially may lead to better predictions
subsequently.
Decision Tree 47
To overcome these limitations, we can avoid aiming for the best split each time. Instead, we can
diversify the trees grown. By combining predictions from different trees, we can get more stable
and accurate results.
There are two methods to diversify trees
• The first method chooses different combinations of binary questions to grow multiple trees,
and then aggregates the predictions from those trees. This technique is called a random forest.
• Instead of choosing binary questions at random, the second method strategically selects binary
questions such that the prediction accuracy for each subsequent tree improves incrementally.
Then, a weighted average of predictions from all trees is taken. This technique is called
gradient boosting.
While random forest and gradient boosting tend to produce more accurate predictions, their
complexity renders the solution harder to visualize. Hence, they are often called “black-boxes”.
On the other hand, decision trees can be visualized easily, allowing us to identify predictors and
their interactions. Interpreting results is important for planning targeted interventions, which is
why decision trees are still a popular tool for analysis.
Random Forest
Wisdom of the Crowd
Can several wrongs make a right?
While counter-intuitive, this is possible, even expected, in some of the best predictive models.
This is because each individual model has its own strengths and weakness. As there is only
one correct prediction but many possible wrong predictions, individual models that yield correct
predictions tend to reinforce each other, while wrong predictions cancel each other out. Hence,
combining individual models is one way to improve prediction accuracy. This process is called
ensembling.
One example is a random forest, which is an ensemble of decision trees. To show how random forest
is superior to its constituent trees, we generated 1000 possible decision trees to predict crime in a US
city, and then we compared these predictions to that from a random forest grown from the same
1000 trees.
Figure 1. Heat map of San Francisco, California, USA, showing frequency of crimes: very low (gray), low (yellow),
moderate (orange), or high (red).
From preliminary analysis, we can see how crime occured mainly in the boxed area north-east of
the city (Figure 1), so we further examined this area by dividing it into smaller regions measuring
900ft by 700ft (260m by 220m).
To predict where and when a crime might occur, 1000 possible decision trees were generated based
on historical crime and weather data, and these trees were then combined in a random forest. Data
from 2014 to 2015 was used to train the models, while data from 2016 (January to August) was used
to test the models’ accuracy.
It would not be feasible for a lean police force to implement extra security patrols for all areas
predicted to have crime. Hence, we programmed our prediction model to identify only the top 30%
of regions with the highest probability of violent crime occuring each day, so that the police could
focus on enhancing patrols in these areas.
So how well could we possibly predict crime?
The random forest model successfully predicted 72% (almost three quarters) of all violent crimes
that occured in 2016. This was superior to the average prediction accuracy of its constituent 1000
decision trees, which was 67% (see Figure 2).
Only 12 out of 1000 individual trees yielded an accuracy better than the random forest. In other
words, there is a 99% certainty that predictions from a random forest would be better than that from
an individual decision tree.
Random Forest 50
Figure 2. Histogram of prediction accuracies of 1000 decision trees compared against the overall result from
combining these trees in a random forest.
Figure 3 shows a sample of the random forest’s predictions over 4 days. Based on our predictions,
the police should allocate more resources to areas coded red, and fewer to areas coded gray. While it
may seem obvious that we need more patrols in areas with historically high crime, the model goes
further to pinpoint crime likelihood in non-red areas. For instance, on Day 4, a crime in a gray area
(lower-right) was correctly predicted despite no violent crimes occuring there in the prior 3 days.
Random Forest 51
Figure 3. Crime predictions for 4 consecutive days in 2016. Circles denote locations where a violent crime was
predicted to happen. Solid circles denote correct predictions. Crosses denote locations where a violent crime
happened, but was not predicted by the model.
Besides predicting crime, random forest also allows us to see which variables contribute most to its
prediction accuracy. Based on the chart in Figure 4, crime appears to be best forecasted using crime
history, location, day of the year and maximum temperature of the day.
Random Forest 52
Figure 4. Top variables contributing to the random forest’s accuracy in predicting crime.
We have seen how effective random forest could be in predicting a complex phenomenon like crime.
But how does random forest work?
Ensembles
A random forest is an ensemble of decision trees; it combines predictions from many different
decision trees. In an ensemble, predictions could be combined either by majority-voting or by taking
averages. Figure 5 shows how an ensemble formed by majority-voting could yield more accurate
predictions than the individual models it is based on.
FIgure 5. Example of 3 individual models attempting to predict 10 outputs of either Blue or Red. The correct
predictions are Blue for all 10 outputs. An ensemble formed by majority voting based on the 3 individual models
yields the highest prediction accuracy.
For this effect to work however, models included in the ensemble must not make the same kind
of mistakes. In other words, the models must be uncorrelated. A systematic way to generate
uncorrelated decision trees is to use a technique called bootstrap aggregating (bagging).
Random Forest 53
In Figure 6, there are 9 predictor variables represented by different colors. At each split, a subset of
the predictor variables is randomly sampled from the original 9. From each subset, the decision tree
algorithm selects the best variable for the split.
By restricting the possible predictors to use at each split in the tree, we are able to generate dissimilar
trees that prevent overfitting. To further reduce overfitting, a random forest could be made more
complex by increasing the number of trees. The resulting model would thus be more generalizable
and accurate.
Random Forest 54
Limitations
No model is perfect. Choosing whether to use a random forest model is a trade-off between predictive
power and interpretability of results.
Not interpretable. Random forests are considered black boxes, because they comprise randomly
generated decision trees, and are not led by clear prediction rules. For example, we would not know
exactly how a random forest model reached its prediction of a crime occuring at a specific place and
time. We only know that a majority of its constituent decision trees decided so. The lack of clarity
on how predictions are made may bring about ethical concerns when used in areas such as medical
diagnosis.
Nonetheless, random forests are widely used because they are easy to implement, especially when
accuracy of results is more crucial than interpretability.
A/B Testing and Multi-Armed
Bandits
Basics of A/B testing
Imagine that you run an online store, and you wish to put up online advertisements to notify people
of an ongoing sale. Which statement would you use?
Although the two statements mean the same thing, one could be more persuasive than the other.
For instance, you might wish to find out whether it is better to use an exclamation mark to convey
excitement, and whether the numerical figure of “50%” is more compelling than the term “half-price”.
To answer these questions, you could display each version of your ad to 100 people, and then check
how many of the 100 eventually clicked on your ad to visit your store. The ad that garnered more
clicks would likely be more effective in attracting buyers, and thus should be used for the rest of
your advertising campaign. This procedure is known as A/B testing, in which the effectiveness of
ad versions A and B are compared.
• Results could be a fluke. A lousier ad could outperform a better ad due to chance, such as
buyers being in a happier mood. To be more certain, we could increase the number of people
we show each ad version to. However, this leads to a second problem.
• Loss of potential revenue. By increasing the number of people to whom we show each ad
version from 100 to 200, we would be displaying the lousier ad to more people, potentially
losing buyers who might have been persuaded by the better ad.
These two problems represent the tradeoff in A/B testing between exploration and exploitation. If
you increase the number of people to test your ads on (exploration), you could be more certain about
which is the better ad, but you would lose more people who might have purchased from your site
had they seen the better ad (exploitation).
How do we find balance this trade-off?
A/B Testing and Multi-Armed Bandits 56
Epsilon-Decreasing Strategy
While A/B testing completes an exploration phase for the better ad before committing the rest of
the campaign to exploiting that ad, we need not wait for the end of the exploration phase to start
exploiting.
If Ad A had garnered more clicks than Ad B amongst the first 100 viewers, we could increase the
exposure of Ad A to 60% of the next 100 viewers, while decreasing that of Ad B to 40%. In other
words, we could start exploiting the initial result showing Ad A’s better performance, but we do not
stop exploring a small possibility of improvement in Ad B’s performance. As more evidence tilts in
Ad A’s favor, we could progressively show more of Ad A and less of Ad B.
The above is known as an epsilon-decreasing strategy, where epsilon refers to the proportion of time
you spend exploring an alternative ad to make sure it is indeed less effective. We decrease epsilon
as our confidence about which ad is better is reinforced. Hence, this technique falls under a class of
machine learning called reinforcement learning.
Figure 1. An A/B test comprises one exploration phase followed by one exploitation phase, whereas an epsilon-
decreasing strategy intersperses exploration with exploitation, with more exploration at the start and less
towards the end.
As slot machines are often seen as cheating players of money with each arm pull, they have been
nicknamed one-arm bandits. Having to choose which slot machine to play is thus called a multi-arm
bandit problem. This term now refers to any problem on how to allocate resources, such as which
A/B Testing and Multi-Armed Bandits 57
slot machine to play, which ad to show, which topics to revise before an exam, or which drug study
to fund.
Imagine you have 2 slot machines to choose from, A and B. You have money to play a total of 2000
rounds across both machines. In each round, you pull the arm of one slot machine, which gives you
either $1 or nothing.
Machine A has a 50% chance of payout, while Machine B has a 40% chance of payout. However,
these probabilities of payout are unknown to you.
The question is, what is the best way to play to maximize your winnings?
Total Exploration. If we play each machine randomly, we would get $900 on average.
A/B Testing. If we use A/B testing to explore which slot machine has a higher payout for the first
200 rounds, and then exploit Machine A for the next 1800 rounds, we would get $976 on average.
There is a catch. Since the payout rates for both machines are similar, there is an 8% chance we
might misidentify Machine B as having a higher payout.
To reduce the risk of an identification error, we could extend exploration over 500 rounds. This would
result in a 1% chance of misidentifying the higher-paying machine, but it would also decrease the
resulting winnings to about $963.
Epsilon-decreasing Strategy. If we use an epsilon-decreasing strategy to start exploiting a seem-
ingly better machine as we continue exploring to confirm our guess, we could get $984 on average,
while tolerating a 4% chance of misidentification. We could decrease the risk of identification error
by increasing the rate of exploration (i.e. value of epsilon), but as before, this would decrease our
average winnings.
Total Exploitation. If we had insider knowledge that Machine A had a higher payout, we could
exploit it from the very start and get $1000 on average.
A/B Testing and Multi-Armed Bandits 58
From Figure 3, it is clear that an epsilon-decreasing strategy would yield the highest winnings.
Moreover, for a large number of games, this strategy would surely reveal the better machine due to
a mathematical property called convergence.
While an epsilon-decreasing strategy seems superior, it also has limitations that render it more
difficult to implement compared to A/B testing.
1. Payout rate is constant over time. An ad might be popular in the morning but not at night,
while another ad might be moderately popular throughout the day. If we compare both ads
in the morning, we would falsely conclude that the first ad is better.
2. Payout rate is independent of previous plays. If an ad is shown to the same customer
multiple times, he might grow curious and be more likely to click on it. This means that
repeated exploration might be needed to reveal true payouts.
3. Minimal delay between playing a machine and observing the payout. If an ad is sent
via email, potential buyers might take a few days to respond. This would prevent us from
A/B Testing and Multi-Armed Bandits 59
knowing the true results of our exploration immediately, and any exploitation attempts in the
meantime would be based on incomplete information.
Nonetheless, if either the 2nd or 3rd assumption is violated by both products being compared, the
effect of any errors could be cancelled out. For instance, if two ads are to be sent via email, any
errors in measuring ad performance would be present for both ads, so comparing the ads against
each other would still be fair.
While the above assumptions also apply to A/B testing, they are harder to address in an epsilon-
decreasing strategy, where more detailed computations for epsilon would be required.