DataCamp Customer Segmentation in Python
CUSTOMER SEGMENTATION IN PYTHON
Data pre-processing for k-
means clustering
Karolis Urbonas
Head of Data Science, Amazon
DataCamp Customer Segmentation in Python
Advantages of k-means clustering
One of the most popular unsupervised learning method
Simple and fast
Works well*
* with certain assumptions about the data
DataCamp Customer Segmentation in Python
Key k-means assumptions
Symmetric distribution of variables (not skewed)
Variables with same average values
Variables with same variance
DataCamp Customer Segmentation in Python
Skewed variables
Left-skewed
Right-skewed
DataCamp Customer Segmentation in Python
Skewed variables
Skew removed with logarithmic
transformation
DataCamp Customer Segmentation in Python
Variables on the same scale
datamart_rfm.describe()
K-means assumes equal mean
And equal variance
It's not the case with RFM data
DataCamp Customer Segmentation in Python
CUSTOMER SEGMENTATION IN PYTHON
Let's review the concepts
DataCamp Customer Segmentation in Python
CUSTOMER SEGMENTATION IN PYTHON
Managing skewed variables
Karolis Urbonas
Head of Data Science, Amazon
DataCamp Customer Segmentation in Python
Identifying skewness
Visual analysis of the distribution
If it has a tail - it's skewed
DataCamp Customer Segmentation in Python
Exploring distribution of Recency
import seaborn as sns
from matplotlib import pyplot as plt
sns.distplot(datamart['Recency'])
plt.show()
DataCamp Customer Segmentation in Python
Exploring distribution of Frequency
sns.distplot(datamart['Frequency'])
plt.show()
DataCamp Customer Segmentation in Python
Data transformations to manage skewness
Logarithmic transformation (positive values only)
import numpy as np
frequency_log= np.log(datamart['Frequency'])
sns.distplot(frequency_log)
plt.show()
DataCamp Customer Segmentation in Python
Dealing with negative values
Adding a constant before log transformation
Cube root transformation
DataCamp Customer Segmentation in Python
CUSTOMER SEGMENTATION IN PYTHON
Let's practice how to
identify and manage
skewed variables!
DataCamp Customer Segmentation in Python
CUSTOMER SEGMENTATION IN PYTHON
Centering and scaling
variables
Karolis Urbonas
Head of Data Science, Amazon
DataCamp Customer Segmentation in Python
Identifying an issue
datamart_rfm.describe()
Analyze key statistics of the dataset
Compare mean and standard
deviation
DataCamp Customer Segmentation in Python
Centering variables with different means
K-means works well on variables with the same mean
Centering variables is done by subtracting average value from each observation
datamart_centered = datamart_rfm - datamart_rfm.mean()
datamart_centered.describe().round(2)
DataCamp Customer Segmentation in Python
Scaling variables with different variance
K-means works better on variables with the same variance / standard deviation
Scaling variables is done by dividing them by standard deviation of each
datamart_scaled = datamart_rfm / datamart_rfm.std()
datamart_scaled.describe().round(2)
DataCamp Customer Segmentation in Python
Combining centering and scaling
Subtract mean and divide by standard deviation manually
Or use a scaler from scikit-learn library (returns numpy.ndarray object)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(datamart_rfm)
datamart_normalized = scaler.transform(datamart_rfm)
print('mean: ', datamart_normalized.mean(axis=0).round(2))
print('std: ', datamart_normalized.std(axis=0).round(2))
mean: [-0. -0. 0.]
std: [1. 1. 1.]
DataCamp Customer Segmentation in Python
CUSTOMER SEGMENTATION IN PYTHON
Test different approaches
by yourself!
DataCamp Customer Segmentation in Python
CUSTOMER SEGMENTATION IN PYTHON
Sequence of structuring
pre-processing steps
Karolis Urbonas
Head of Data Science, Amazon
DataCamp Customer Segmentation in Python
Why the sequence matters?
Log transformation only works with positive data
Normalization forces data to have negative values and log will not work
DataCamp Customer Segmentation in Python
Sequence
1. Unskew the data - log transformation
2. Standardize to the same average values
3. Scale to the same standard deviation
4. Store as a separate array to be used for clustering
DataCamp Customer Segmentation in Python
Coding the sequence
Unskew the data with log transformation
import numpy as np
datamart_log = np.log(datamart_rfm)
Normalize the variables with StandardScaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(datamart_log)
Store it separately for clustering
datamart_normalized = scaler.transform(datamart_log)
DataCamp Customer Segmentation in Python
CUSTOMER SEGMENTATION IN PYTHON
Practice on RFM data!