ML Fundamentals
ML Fundamentals
ML Fundamentals
Now we can finally start to talk about modeling. You’ll use your
new tools of data wrangling and programming, to fit many models
and understand how they work. The goal of a model is to provide
a simple low-dimensional summary of a dataset. Ideally, the
model will capture true “signals” (i.e. patterns generated by the
phenomenon of interest), and ignore “noise” (i.e. random variation
that you’re not interested in). Before we can start using models on
datasets, you need to understand the basics of how models work.
Machine learning foundations are from statistical theory and
learning.
Theory Input
The content is based on explanations from Hadley Wickham and
Andriy Burkov.
Modeling Basics in R
A simple model
There are 250 models on this plot, but a lot are really bad! We
need to find the good models by making precise our intuition that
a good model is “close” to the data. We need a way to quantify the
distance between the data and a model. Then we can fit the
model by finding the value of a0 and a1 that generate the model
with the smallest distance from this data.
Infobox
If you want to learn some more basic elements of programming in R, you can always have a look at
the course Business Data Science Basics (password: ds_fund_ws20).
grid %>%
ggplot(aes(a1, a2)) +
geom_point(data = filter(grid, rank(dist) <= 10), size = 4, colour = "red")
+
geom_point(aes(colour = -dist))
When you overlay the best 10 models back on the original data,
they all look pretty good:
ggplot(sim1, aes(x, y)) +
geom_point(size = 2, colour = "grey30") +
geom_abline(
aes(intercept = a1, slope = a2, colour = -dist),
data = filter(grid, rank(dist) <= 10)
)
You could imagine iteratively making the grid finer and finer until
you narrowed in on the best model. But there’s a better way to
tackle that problem: a numerical minimization tool called Newton-
Raphson search. The intuition of Newton-Raphson is pretty simple:
you pick a starting point and look around for the steepest slope.
You then ski down that slope a little way, and then repeat again
and again, until you can’t go any lower. In R, we can do that
with optim():
best <- optim(c(0, 0), measure_distance, data = sim1)
best$par
## [1] 4.22 2.05
Don’t worry too much about the details of how optim() works. It’s
the intuition that’s important here. If you have a function that
defines the distance between a model and a dataset, an algorithm
that can minimise that distance by modifying the parameters of
the model, you can find the best model. The neat thing about this
approach is that it will work for any family of models that you can
write an equation for.
There’s one more approach that we can use for this model,
because it’s a special case of a broader family: linear models. A
linear model has the general form:
y=a1+a2∗x1+a3∗x2+…+an∗x(n−1)
Machine Learning
Supervised Learning
Unsupervised Learning
Semi-Supervised Learning
Reinforcement Learning
Let’s say the problem that you want to solve using supervised
learning is spam detection. You gather the data, for example,
10,000 email messages, each with a label either “spam” or
“not_spam” (you could add those labels manually or pay someone
to do that for you). Now, you have to convert each email message
into a feature vector.
You repeat the above procedure for every email message in our
collection, which gives us 10,000 feature vectors (each vector
having the dimensionality of 20,000) and a label
(“spam”/“not_spam”).
Now you have machine-readable input data, but the output labels
are still in the form of human-readable text. Some learning
algorithms require transforming labels into numbers. For example,
some algorithms require numbers like 0 (to represent the label
“not_spam”) and 1 (to represent the label “spam”). The algorithm I
use to illustrate supervised learning is called Support Vector
Machine (SVM). This algorithm requires that the positive label (in
our case it’s “spam”) has the numeric value of +1 (one), and the
negative label (“not_spam”) has the value of −1 (minus one).
wx−b=0,
y=sign(wx−b),
f(x)=sign(w∗x−b∗)
wxi−b≥+1ifyi=+1,wxi−b≤−1ifyi=−1.
Infobox
A hyperparameter is a property of a learning algorithm, usually (but not always) having a numerical
value. That value influences the way the algorithm works. Those values aren’t learned by the
algorithm itself from data. They have to be set by the data analyst before running the algorithm.
We will be using the SVM algorithm at some later point during this
class. In this session we will start using the k-means algorithm.
Business case
The tools and techniques that we discuss in this section are
complex by nature from a theoretical perspective. Further adding
to complexity is the sheer number of algorithms available. The
goal is to streamline your education and focus on application
since this is where the rubber hits the road. However, for those
that want to learn the theory, I will recommend several resources
that have helped me understand various topics.
I. Clustering
K-Means
Hierarchical Clustering
You can find some information about the data set we are going to
use here. You can download the required dataset from here: Data
K-Means Clustering
# Proportion
group_by(STORE_NAME) %>%
mutate(PROP_OF_TOTAL = QUANTITY_PURCHASED / sum(QUANTITY_PURCHASED)) %>%
ungroup()
1.2: Convert to User-Item Format (or Customer-Product)
# 1.2 Convert to User-Item Format (e.g. Customer-Product) ----
Step 2: Modeling
2.1: Performing K-Means for customer segmentation
2.2: Let’s tidy the kmeans() output with the package broom. We
could do it by ourselves, but it makes it a lot easier. broom
summarizes key information about models in tidy tibbles. broom
provides three verbs to make it convenient to interact with model
objects:
tidy() summarizes information about model components
glance() reports information about the entire model
augment() adds informations about observations to a dataset
# 2.2 Tidying a K-Means Object ----
# return the centers information for the kmeans model
broom::tidy(kmeans_obj) %>% glimpse()
customer_product_tbl %>%
select(-STORE_NAME) %>%
kmeans(centers = centers, nstart = 100)
}
# Visualization
ggplot(aes(centers, tot.withinss)) +
geom_point(color = "#2DC6D6", size = 4) +
geom_line(color = "#2DC6D6", size = 1) +
# Add labels (which are repelled a little)
ggrepel::geom_label_repel(aes(label = centers), color = "#2DC6D6") +
# Formatting
labs(title = "Skree Plot",
subtitle = "Measures the distance each of the customer are from the closes
K-Means center",
caption = "Conclusion: Based on the Scree Plot, we select 3 clusters to
segment the customer base.")
UMAP
kNN and config are a little beyond the scope. I just want to point
you to the layout element, which is just a matrix. This is basically
the data we want to be plotting. Before, we want to do a little
manipulation:
umap_results_tbl <- umap_obj$layout %>%
as_tibble(.name_repair = "unique") %>% # argument is required to set names
in the next step
set_names(c("x", "y")) %>%
bind_cols(
customer_product_tbl %>% select(STORE_NAME)
)
umap_results_tbl %>%
ggplot(aes(x, y)) +
geom_point() +
geom_label_repel(aes(label = STORE_NAME), size = 3)
3.2 In order to color the plot according to the clusters, we need to
add the kmeans cluster assignments to the umap data.
# Get the data for the third element (which we have chosen in the skree plot)
kmeans_3_obj <- kmeans_mapped_tbl %>%
pull(k_means) %>%
pluck(3)
umap_kmeans_3_results_tbl %>%
mutate(label_text = str_glue("Customer: {STORE_NAME}
Cluster: {.cluster}")) %>%
# Geometries
geom_point() +
geom_label_repel(aes(label = label_text), size = 2, fill = "#282A36") +
# Formatting
scale_color_manual(values=c("#2d72d6", "#2dc6d6", "#2dd692")) +
labs(title = "Customer Segmentation: 2D Projection",
subtitle = "UMAP 2D Projection with K-Means Cluster Assignment",
caption = "Conclusion: 3 Customer Segments identified using 2 algorithms")
+
theme(legend.position = "none")
Now that we have a nice looking plot and know which customers
are related, we want to know why some customers are clustered
together and what their preferences are. So we still need to figure
out what each group is buying. The next step is to relate the
clusters to products. We need the information from the
customer_trends_tbl data (like price, category etc.). Instead of
having the data associated with the stores, we want to associate it
with the clusters.
# pipe part 1
cluster_trends_tbl <- customer_trends_tbl %>%
cluster_trends_tbl
4.6 Analyze Cluster
Filter the cluster and take a look at the most sold products for
each cluster to label them accordingly.
# Cluster 1
cluster_trends_tbl %>%
filter(.cluster == 1) %>%
arrange(desc(PROP_OF_TOTAL)) %>%
mutate(CUM_PROP = cumsum(PROP_OF_TOTAL)) %>%
View()
# Create function and do it for each cluster
get_cluster_trends <- function(cluster = 1) {
cluster_trends_tbl %>%
filter(.cluster == cluster) %>%
arrange(desc(PROP_OF_TOTAL)) %>%
mutate(cum_prop = cumsum(PROP_OF_TOTAL))
get_cluster_trends(cluster = 1)
get_cluster_trends(cluster = 2)
get_cluster_trends(cluster = 3)
Unfortunately the clusters / stores can’t be distinguished easily.
The top products are pretty much the same. At first glance I saw
mainly the following features for each cluster (your clusters might
be different, because there is a randomness in the k-means
algorithm):
Cluster 1
Branded: yes
Cat: Cold cereal
price: low / medium
Cluster 2 & 3
Branded: yes / no
Cat: Cold cereal & Pretzels
price: low / medium
If you desire, you can then update the labels with the description
for the clusters like this:
# Update Visualization
cluster_label_tbl
umap_kmeans_3_results_tbl %>%
left_join(cluster_label_tbl) %>%
mutate(label_text = str_glue("Customer: {STORE_NAME}
Cluster: {.cluster}
{.cluster_label}
")) %>%
...
To do a better customer segmentation, we would have needed
more and better data. But I hope you have now understood the
idea of segmentation and its applications in business.
Clustering
Popular methods
K-Means
hierarchical clustering.
Uses
Dimensionality Reduction
Popular methods
PCA
UMAP
tSNE.
Uses
Challenge
Company Segmentation with Stock Prices
In your assignment folder you will find a .Rmd file that contains all
instructions and also intermediate results in case you get stuck.
You can knit the .Rmd file to a html/pdf file by clicking the knit
button.