18CSE397T - Computational Data Analysis Unit - 4: Session - 7: SLO - 1
18CSE397T - Computational Data Analysis Unit - 4: Session - 7: SLO - 1
Introduction
It is crucial to understand customer behavior in any industry. I realized this last year
when my chief marketing officer asked me – “Can you tell me which existing
customers should we target for our new product?”
That was quite a learning curve for me. I quickly realized as a data scientist how
important it is to segment customers so my organization can tailor and build targeted
strategies. This is where the concept of clustering came in ever so handy!
Problems like segmenting customers are often deceptively tricky because we are not
working with any target variable in mind. We are officially in the land of
unsupervised learning where we need to figure out patterns and structures without a
set outcome in mind. It’s both challenging and thrilling as a data scientist.
Now, there are a few different ways to perform clustering (as you’ll see below). I will
introduce you to one such type in this article – hierarchical clustering.
We will learn what hierarchical clustering is, its advantage over the other clustering
algorithms, the different types of hierarchical clustering and the steps to perform it.
We will finally take up a customer segmentation dataset and then implement
hierarchical clustering in Python. I love this technique and I’m sure you will too after
this article!
Suppose we want to estimate the count of bikes that will be rented in a city
every day:
Or, let’s say we want to predict whether a person on board the Titanic
survived or not:
So, when we are given a target variable (count and Survival in the above two
cases) which we have to predict based on a given set of predictors or
independent variables (season, holiday, Sex, Age, etc.), such problems are
called supervised learning problems.
Let’s look at the figure below to understand this visually:
Our aim, when training the model, is to generate a function that maps the
independent variables to the desired target. Once the model is trained, we can
pass new sets of observations and the model will predict the target for them.
This, in a nutshell, is supervised learning.
There might be situations when we do not have any target variable to predict.
Such problems, without any explicit target variable, are known as unsupervised
learning problems. We only have the independent variables and no
target/dependent variable in these problems.