0% found this document useful (0 votes)

200 views18 pages

Customer Segmentation PDF

The document loads data on online retail transactions from an Excel file into a Pandas dataframe. It then cleans the data by removing null/negative values and transactions from other countries. Recency, frequency and monetary value features are engineered for each customer by calculating time since last purchase, number of transactions and total sales amount. The features are log transformed and scaled for modeling. Visualizations show the distributions of recency days and sales amounts.

Uploaded by

New Mahoutsukai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

200 views18 pages

Customer Segmentation PDF

Uploaded by

New Mahoutsukai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Load Dependencies and Configuration Settings

In [1]:

import pandas as pd
import datetime
import math
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab

%matplotlib inline

Load and View the Dataset

In [2]:

cs_df = pd.read_excel(io=r'Online Retail.xlsx')

In [3]:

cs_df.head()

Out[3]:

InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country

0 WHITE HANGING HEART T-LIGHT 2010-12-01 United

536365 85123A 6 2.55 17850.0
HOLDER 08:26:00 Kingdom

1 2010-12-01 United
536365 71053 WHITE METAL LANTERN 6 3.39 17850.0
08:26:00 Kingdom

2 2010-12-01 United
536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2.75 17850.0
08:26:00 Kingdom

3 KNITTED UNION FLAG HOT WATER 2010-12-01 United

536365 84029G 6 3.39 17850.0
BOTTLE 08:26:00 Kingdom

4 2010-12-01 United
536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 3.39 17850.0
08:26:00 Kingdom

Transactions size

In [4]:
cs_df.shape

Out[4]:

(541909, 8)

Top Sales by Country

In [5]:

cs_df.Country.value_counts().reset_index().head(n=10)

Out[5]:

index Country

0 United
495478
Kingdom

1 Germany 9495
1 Germany 9495
index Country
2 France 8557

3 EIRE 8196

4 Spain 2533

5 Netherlands 2371

6 Belgium 2069

7 Switzerland 2002

8 Portugal 1519

9 Australia 1259

Top Customers contributing to 10% of total Sales

Number of customers

In [6]:
cs_df.CustomerID.unique().shape

Out[6]:
(4373,)

In [7]:
(cs_df.CustomerID.value_counts()/sum(cs_df.CustomerID.value_counts())*100).head(n=13).cumsum()

Out[7]:

17841.0 1.962249
14911.0 3.413228
14096.0 4.673708
12748.0 5.814728
14606.0 6.498553
15311.0 7.110850
14646.0 7.623350
13089.0 8.079807
13263.0 8.492020
14298.0 8.895138
15039.0 9.265809
14156.0 9.614850
18118.0 9.930462
Name: CustomerID, dtype: float64

Analyzing Data Quality Issues

Number of unique items

In [8]:
cs_df.StockCode.unique().shape

Out[8]:
(4070,)

Description of items: We see that the descriptions are more then the stock code so there must be some stock code which have more
than one decription

In [9]:
cs_df.Description.unique().shape
Out[9]:

(4224,)

In [10]:

cs_df.dtypes

Out[10]:
InvoiceNo object
StockCode object
Description object
Quantity int64
InvoiceDate datetime64[ns]
UnitPrice float64
CustomerID float64
Country object
dtype: object

In [11]:
cat_des_df = cs_df.groupby(["StockCode","Description"]).count().reset_index()

Stockcode which have more than one description

In [12]:

cat_des_df.StockCode.value_counts()[cat_des_df.StockCode.value_counts()>1].reset_index().head()

Out[12]:

index StockCode

0 20713 8

1 23084 7

2 21830 6

3 85175 6

4 85172 5

Example of one such stockcode

In [14]:
cs_df[cs_df['StockCode'] == cat_des_df.StockCode.value_counts()[cat_des_df.StockCode.value_counts(
)>1]
.reset_index()['index'][5]]['Description'].unique()

Out[14]:
array(['JUMBO BAG VINTAGE CHRISTMAS ', 'came coded as 20713',
'wrongly coded 20713', '20713 wrongly marked', 20713], dtype=object)

In [15]:
cs_df['invdatetime'] = pd.to_datetime(cs_df.InvoiceDate)

In [16]:
cs_df.Quantity.describe()

Out[16]:

count 541909.000000
mean 9.552250
std 218.081158
std 218.081158
min -80995.000000
25% 1.000000
50% 3.000000
75% 10.000000
max 80995.000000
Name: Quantity, dtype: float64

In [17]:

cs_df.UnitPrice.describe()

Out[17]:
count 541909.000000
mean 4.611114
std 96.759853
min -11062.060000
25% 1.250000
50% 2.080000
75% 4.130000
max 38970.000000
Name: UnitPrice, dtype: float64

Data Cleaning
In [18]:
# Seperate data for one geography
cs_df = cs_df[cs_df.Country == 'United Kingdom']

# Seperate attribute for total amount

cs_df['amount'] = cs_df.Quantity*cs_df.UnitPrice

# Remove negative or return transactions

cs_df = cs_df[~(cs_df.amount<0)]
cs_df.head()
cs_df = cs_df[~(cs_df.CustomerID.isnull())]

In [19]:
cs_df.shape

Out[19]:
(354345, 10)

Build Recency Feature

In [20]:
cs_df.InvoiceDate.max()

Out[20]:
Timestamp('2011-12-09 12:49:00')

In [21]:
cs_df.InvoiceDate.min()

Out[21]:
Timestamp('2010-12-01 08:26:00')

In [22]:
refrence_date = cs_df.InvoiceDate.max()
refrence_date = cs_df.InvoiceDate.max()
refrence_date = refrence_date + datetime.timedelta(days = 1)

In [23]:
cs_df['days_since_last_purchase'] = refrence_date - cs_df.InvoiceDate
cs_df['days_since_last_purchase_num'] = cs_df['days_since_last_purchase'].astype('timedelta64[D]')

Time period of transactions

In [24]:
customer_history_df = cs_df.groupby("CustomerID").min().reset_index()[['CustomerID',
'days_since_last_purchase_num']]
customer_history_df.rename(columns={'days_since_last_purchase_num':'recency'}, inplace=True)
customer_history_df.recency.describe()

Out[24]:

count 3921.000000
mean 92.188472
std 99.528995
min 1.000000
25% 18.000000
50% 51.000000
75% 143.000000
max 374.000000
Name: recency, dtype: float64

In [25]:

customer_history_df.head()

Out[25]:

CustomerID recency

0 12346.0 326.0

1 12747.0 2.0

2 12748.0 1.0

3 12749.0 4.0

4 12820.0 3.0

In [26]:
customer_history_df.recency.describe()

Out[26]:
count 3921.000000
mean 92.188472
std 99.528995
min 1.000000
25% 18.000000
50% 51.000000
75% 143.000000
max 374.000000
Name: recency, dtype: float64

In [27]:
x = customer_history_df.recency
mu = np.mean(customer_history_df.recency)
sigma = math.sqrt(np.var(customer_history_df.recency))
n, bins, patches = plt.hist(x, 1000, facecolor='green', alpha=0.75)
# add a 'best fit' line
y = mlab.normpdf(bins, mu, sigma)
l = plt.plot(bins, y, 'r--', linewidth=2)
l = plt.plot(bins, y, 'r--', linewidth=2)
plt.xlabel('Recency in days')
plt.ylabel('Number of transactions')
plt.title(r'$\mathrm{Histogram\ of\ sales\ recency}\ $')
plt.grid(True)

Build Frequency & Monetary value Features

In [28]:
customer_monetary_val = cs_df[['CustomerID', 'amount']].groupby("CustomerID").sum().reset_index()
customer_history_df = customer_history_df.merge(customer_monetary_val, how='outer')
customer_history_df.amount = customer_history_df.amount+0.001
customer_freq = cs_df[['CustomerID', 'amount']].groupby("CustomerID").count().reset_index()
customer_freq.rename(columns={'amount':'frequency'},inplace=True)
customer_history_df = customer_history_df.merge(customer_freq, how='outer')

Remove returns so that we only have purchases of a customer

In [29]:

customer_history_df.head()

Out[29]:

CustomerID recency amount frequency

0 12346.0 326.0 77183.601 1

1 12747.0 2.0 4196.011 103

2 12748.0 1.0 33719.731 4596

3 12749.0 4.0 4090.881 199

4 12820.0 3.0 942.341 59

In [30]:

from sklearn import preprocessing

import math

customer_history_df['recency_log'] = customer_history_df['recency'].apply(math.log)
customer_history_df['frequency_log'] = customer_history_df['frequency'].apply(math.log)
customer_history_df['amount_log'] = customer_history_df['amount'].apply(math.log)
feature_vector = ['amount_log', 'recency_log','frequency_log']
X_subset = customer_history_df[feature_vector].as_matrix()
scaler = preprocessing.StandardScaler().fit(X_subset)
X_scaled = scaler.transform(X_subset)

Visualizing Recency vs Monetary Value (scaled)

In [31]:
plt.scatter(customer_history_df.recency_log, customer_history_df.amount_log, alpha=0.5)

Out[31]:
<matplotlib.collections.PathCollection at 0x23fd2d19a90>

Visualizing Monetary Value distribution (scaled)

In [32]:
x = customer_history_df.amount_log
n, bins, patches = plt.hist(x, 1000, facecolor='green', alpha=0.75)

plt.xlabel('Log of Sales Amount')

plt.ylabel('Probability')
plt.title(r'$\mathrm{Histogram\ of\ Log\ transformed\ Customer\ Monetary\ value}\ $')
plt.grid(True)
#plt.show()

In [33]:

customer_history_df.head()

Out[33]:

CustomerID recency amount frequency recency_log frequency_log amount_log

0 12346.0 326.0 77183.601 1 5.786897 0.000000 11.253942

1 12747.0 2.0 4196.011 103 0.693147 4.634729 8.341890

2 12748.0 1.0 33719.731 4596 0.000000 8.432942 10.425838

3 12749.0 4.0 4090.881 199 1.386294 5.293305 8.316516

4 12820.0 3.0 942.341 59 1.098612 4.077537 6.848367

In [34]:

from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(8, 6))

ax = fig.add_subplot(111, projection='3d')

xs =customer_history_df.recency_log
ys = customer_history_df.frequency_log
zs = customer_history_df.amount_log
ax.scatter(xs, ys, zs, s=5)

ax.set_xlabel('Recency')
ax.set_ylabel('Frequency')
ax.set_zlabel('Monetary')

#plt.show()

Out[34]:

<matplotlib.text.Text at 0x23fc2e56828>

Analyze Customer Segments with Clustering

In [34]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm

X = X_scaled

cluster_centers = dict()

for n_clusters in range(3,6,2):

fig, (ax1, ax2) = plt.subplots(1, 2)
#ax2 = plt.subplot(111, projection='3d')
fig.set_size_inches(18, 7)
ax1.set_xlim([-0.1, 1])
ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

clusterer = KMeans(n_clusters=n_clusters, random_state=10)

cluster_labels = clusterer.fit_predict(X)

silhouette_avg = silhouette_score(X, cluster_labels)

cluster_centers.update({n_clusters :{
'cluster_center':clusterer.cluster_centers_,
'silhouette_score':silhouette_avg,
'labels':cluster_labels}
})

sample_silhouette_values = silhouette_samples(X, cluster_labels)

sample_silhouette_values = silhouette_samples(X, cluster_labels)
y_lower = 10
for i in range(n_clusters):
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]

ith_cluster_silhouette_values.sort()

size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i

color = cm.spectral(float(i) / n_clusters)

ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)

ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

y_lower = y_upper + 10 # 10 for the 0 samples

ax1.set_title("The silhouette plot for the various clusters.")

ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([])
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
colors = cm.spectral(cluster_labels.astype(float) / n_clusters)
feature1 = 0
feature2 = 2
ax2.scatter(X[:, feature1], X[:, feature2], marker='.', s=30, lw=0, alpha=0.7,
c=colors, edgecolor='k')

centers = clusterer.cluster_centers_
ax2.scatter(centers[:, feature1], centers[:, feature2], marker='o',
c="white", alpha=1, s=200, edgecolor='k')
for i, c in enumerate(centers):
ax2.scatter(c[feature1], c[feature2], marker='$%d$' % i, alpha=1,
s=50, edgecolor='k')
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature i.e. monetary value")
ax2.set_ylabel("Feature space for the 2nd feature i.e. frequency")
plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')
#plt.show()
In [35]:

for i in range(3,6,2):
print("for {} number of clusters".format(i))
cent_transformed = scaler.inverse_transform(cluster_centers[i]['cluster_center'])
print(pd.DataFrame(np.exp(cent_transformed),columns=feature_vector))
print("Silhouette score for cluster {} is {}". format(i, cluster_centers[i]['silhouette_score']
))
print()

for 3 number of clusters

amount_log recency_log frequency_log
0 843.937271 44.083222 53.920633
1 221.236034 121.766072 10.668661
2 3159.294272 7.196647 177.789098
Silhouette score for cluster 3 is 0.30437444714898737

for 5 number of clusters

amount_log recency_log frequency_log
0 3905.544371 5.627973 214.465989
1 1502.519606 46.880212 92.306262
2 142.867249 126.546751 5.147370
3 408.235418 139.056216 25.530424
4 464.371885 13.386419 29.581298
Silhouette score for cluster 5 is 0.27958641427323727

Assign Cluster Labels

In [36]:

labels = cluster_centers[5]['labels']
customer_history_df['num_cluster5_labels'] = labels
labels = cluster_centers[3]['labels']
customer_history_df['num_cluster3_labels'] = labels

In [37]:
customer_history_df.head()

Out[37]:

CustomerID recency amount frequency recency_log frequency_log amount_log num_cluster5_labels num_cluster3_labels

0 12346.0 326.0 77183.601 1 5.786897 0.000000 11.253942 1 0

1 12747.0 2.0 4196.011 103 0.693147 4.634729 8.341890 0 2

2 12748.0 1.0 33719.731 4596 0.000000 8.432942 10.425838 0 2

3 12749.0 4.0 4090.881 199 1.386294 5.293305 8.316516 0 2

4 12820.0 3.0 942.341 59 1.098612 4.077537 6.848367 4 2

Visualize Segments
Visualize Segments
In [38]:

import plotly as py
import plotly.graph_objs as go
py.offline.init_notebook_mode()

x_data = ['Cluster 1','Cluster 2','Cluster 3','Cluster 4', 'Cluster 5']

cutoff_quantile = 100
field_to_plot = 'recency'

y0 = customer_history_df[customer_history_df['num_cluster5_labels']==0][field_to_plot].values
y0 = y0[y0<np.percentile(y0, cutoff_quantile)]
y1 = customer_history_df[customer_history_df['num_cluster5_labels']==1][field_to_plot].values
y1 = y1[y1<np.percentile(y1, cutoff_quantile)]
y2 = customer_history_df[customer_history_df['num_cluster5_labels']==2][field_to_plot].values
y2 = y2[y2<np.percentile(y2, cutoff_quantile)]
y3 = customer_history_df[customer_history_df['num_cluster5_labels']==3][field_to_plot].values
y3 = y3[y3<np.percentile(y3, cutoff_quantile)]
y4 = customer_history_df[customer_history_df['num_cluster5_labels']==4][field_to_plot].values
y4 = y4[y4<np.percentile(y4, cutoff_quantile)]
y_data = [y0,y1,y2,y3,y4]

colors = ['rgba(93, 164, 214, 0.5)', 'rgba(255, 144, 14, 0.5)', 'rgba(44, 160, 101, 0.5)',
'rgba(255, 65, 54, 0.5)', 'rgba(207, 114, 255, 0.5)', 'rgba(127, 96, 0, 0.5)']
traces = []

for xd, yd, cls in zip(x_data, y_data, colors):

traces.append(go.Box(
y=yd,
name=xd,
boxpoints=False,
jitter=0.5,
whiskerwidth=0.2,
fillcolor=cls,
marker=dict(
size=2,
),
line=dict(width=1),
))

layout = go.Layout(
title='Difference in sales {} from cluster to cluster'.format(field_to_plot),
yaxis=dict(
autorange=True,
showgrid=True,
zeroline=True,
dtick=50,
gridcolor='black',
gridwidth=0.1,
zerolinecolor='rgb(255, 255, 255)',
zerolinewidth=2,
),
margin=dict(
l=40,
r=30,
b=80,
t=100,
),
paper_bgcolor='white',
plot_bgcolor='white',
showlegend=False
)

fig = go.Figure(data=traces, layout=layout)

py.offline.iplot(fig)
In [39]:
x_data = ['Cluster 1','Cluster 2','Cluster 3','Cluster 4', 'Cluster 5']
cutoff_quantile = 80
field_to_plot = 'amount'
y0 = customer_history_df[customer_history_df['num_cluster5_labels']==0][field_to_plot].values
y0 = y0[y0<np.percentile(y0, cutoff_quantile)]
y1 = customer_history_df[customer_history_df['num_cluster5_labels']==1][field_to_plot].values
y1 = y1[y1<np.percentile(y1, cutoff_quantile)]
y2 = customer_history_df[customer_history_df['num_cluster5_labels']==2][field_to_plot].values
y2 = y2[y2<np.percentile(y2, cutoff_quantile)]
y3 = customer_history_df[customer_history_df['num_cluster5_labels']==3][field_to_plot].values
y3 = y3[y3<np.percentile(y3, cutoff_quantile)]
y4 = customer_history_df[customer_history_df['num_cluster5_labels']==4][field_to_plot].values
y4 = y4[y4<np.percentile(y4, cutoff_quantile)]
y_data = [y0,y1,y2,y3,y4]

colors = ['rgba(93, 164, 214, 0.5)', 'rgba(255, 144, 14, 0.5)', 'rgba(44, 160, 101, 0.5)',
'rgba(255, 65, 54, 0.5)', 'rgba(207, 114, 255, 0.5)', 'rgba(127, 96, 0, 0.5)']
traces = []

for xd, yd, cls in zip(x_data, y_data, colors):

traces.append(go.Box(
y=yd,
name=xd,
boxpoints=False,
jitter=0.5,
whiskerwidth=0.2,
fillcolor=cls,
marker=dict(
size=2,
),
line=dict(width=1),
))

layout = go.Layout(
title='Difference in sales {} from cluster to cluster'.format(field_to_plot),
yaxis=dict(
autorange=True,
showgrid=True,
zeroline=True,
dtick=1000,
gridcolor='black',
gridwidth=0.1,
zerolinecolor='rgb(255, 255, 255)',
zerolinewidth=2,
),
margin=dict(
l=40,
r=30,
b=80,
t=100,
),
paper_bgcolor='white',
paper_bgcolor='white',
plot_bgcolor='white',
showlegend=False
)

fig = go.Figure(data=traces, layout=layout)

py.offline.iplot(fig)

In [40]:

x_data = ['Cluster 1','Cluster 2','Cluster 3','Cluster 4', 'Cluster 5']

cutoff_quantile = 80
field_to_plot = 'frequency'
y0 = customer_history_df[customer_history_df['num_cluster5_labels']==0][field_to_plot].values
y0 = y0[y0<np.percentile(y0, cutoff_quantile)]
y1 = customer_history_df[customer_history_df['num_cluster5_labels']==1][field_to_plot].values
y1 = y1[y1<np.percentile(y1, cutoff_quantile)]
y2 = customer_history_df[customer_history_df['num_cluster5_labels']==2][field_to_plot].values
y2 = y2[y2<np.percentile(y2, cutoff_quantile)]
y3 = customer_history_df[customer_history_df['num_cluster5_labels']==3][field_to_plot].values
y3 = y3[y3<np.percentile(y3, cutoff_quantile)]
y4 = customer_history_df[customer_history_df['num_cluster5_labels']==4][field_to_plot].values
y4 = y4[y4<np.percentile(y4, cutoff_quantile)]
y_data = [y0,y1,y2,y3,y4]

colors = ['rgba(93, 164, 214, 0.5)', 'rgba(255, 144, 14, 0.5)', 'rgba(44, 160, 101, 0.5)',
'rgba(255, 65, 54, 0.5)', 'rgba(207, 114, 255, 0.5)', 'rgba(127, 96, 0, 0.5)']
traces = []

for xd, yd, cls in zip(x_data, y_data, colors):

traces.append(go.Box(
y=yd,
name=xd,
boxpoints=False,
jitter=0.5,
whiskerwidth=0.2,
fillcolor=cls,
marker=dict(
size=2,
),
line=dict(width=1),
))

layout = go.Layout(
title='Difference in sales {} from cluster to cluster'.format(field_to_plot),
title='Difference in sales {} from cluster to cluster'.format(field_to_plot),
yaxis=dict(
autorange=True,
showgrid=True,
zeroline=True,
dtick=100,
gridcolor='black',
gridwidth=0.1,
zerolinecolor='rgb(255, 255, 255)',
zerolinewidth=2,
),
margin=dict(
l=40,
r=30,
b=80,
t=100,
),
paper_bgcolor='white',
plot_bgcolor='white',
showlegend=False
)

fig = go.Figure(data=traces, layout=layout)

py.offline.iplot(fig)

In [41]:
x_data = ['Cluster 1','Cluster 2','Cluster 3']
cutoff_quantile = 100
field_to_plot = 'recency'
y0 = customer_history_df[customer_history_df['num_cluster3_labels']==0][field_to_plot].values
y0 = y0[y0<np.percentile(y0, cutoff_quantile)]
y1 = customer_history_df[customer_history_df['num_cluster3_labels']==1][field_to_plot].values
y1 = y1[y1<np.percentile(y1, cutoff_quantile)]
y2 = customer_history_df[customer_history_df['num_cluster3_labels']==2][field_to_plot].values
y2 = y2[y2<np.percentile(y2, cutoff_quantile)]

y_data = [y0,y1,y2]

colors = ['rgba(93, 164, 214, 0.5)', 'rgba(255, 144, 14, 0.5)', 'rgba(44, 160, 101, 0.5)',
'rgba(255, 65, 54, 0.5)', 'rgba(207, 114, 255, 0.5)', 'rgba(127, 96, 0, 0.5)']
traces = []

for xd, yd, cls in zip(x_data, y_data, colors):

traces.append(go.Box(
y=yd,
y=yd,
name=xd,
boxpoints=False,
jitter=0.5,
whiskerwidth=0.2,
fillcolor=cls,
marker=dict(
size=2,
),
line=dict(width=1),
))

fig = go.Figure(data=traces, layout=layout)

py.offline.iplot(fig)

In [42]:
x_data = ['Cluster 1','Cluster 2','Cluster 3']
cutoff_quantile = 80
field_to_plot = 'amount'
y0 = customer_history_df[customer_history_df['num_cluster3_labels']==0][field_to_plot].values
y0 = y0[y0<np.percentile(y0, cutoff_quantile)]
y1 = customer_history_df[customer_history_df['num_cluster3_labels']==1][field_to_plot].values
y1 = customer_history_df[customer_history_df['num_cluster3_labels']==1][field_to_plot].values
y1 = y1[y1<np.percentile(y1, cutoff_quantile)]
y2 = customer_history_df[customer_history_df['num_cluster3_labels']==2][field_to_plot].values
y2 = y2[y2<np.percentile(y2, cutoff_quantile)]

y_data = [y0,y1,y2]

colors = ['rgba(93, 164, 214, 0.5)', 'rgba(255, 144, 14, 0.5)', 'rgba(44, 160, 101, 0.5)',
'rgba(255, 65, 54, 0.5)', 'rgba(207, 114, 255, 0.5)', 'rgba(127, 96, 0, 0.5)']
traces = []

for xd, yd, cls in zip(x_data, y_data, colors):

traces.append(go.Box(
y=yd,
name=xd,
boxpoints=False,
jitter=0.5,
whiskerwidth=0.2,
fillcolor=cls,
marker=dict(
size=2,
),
line=dict(width=1),
))

layout = go.Layout(
title='Difference in sales {} from cluster to cluster'.format(field_to_plot),
yaxis=dict(
dtick=1000,
)
)

fig = go.Figure(data=traces, layout=layout)

py.offline.iplot(fig)

In [43]:
x_data = ['Cluster 1','Cluster 2','Cluster 3']
cutoff_quantile = 90
field_to_plot = 'frequency'
y0 = customer_history_df[customer_history_df['num_cluster3_labels']==0][field_to_plot].values
y0 = y0[y0<np.percentile(y0, cutoff_quantile)]
y1 = customer_history_df[customer_history_df['num_cluster3_labels']==1][field_to_plot].values
y1 = y1[y1<np.percentile(y1, cutoff_quantile)]
y2 = customer_history_df[customer_history_df['num_cluster3_labels']==2][field_to_plot].values
y2 = y2[y2<np.percentile(y2, cutoff_quantile)]
y2 = y2[y2<np.percentile(y2, cutoff_quantile)]

y_data = [y0,y1,y2]

colors = ['rgba(93, 164, 214, 0.5)', 'rgba(255, 144, 14, 0.5)', 'rgba(44, 160, 101, 0.5)',
'rgba(255, 65, 54, 0.5)', 'rgba(207, 114, 255, 0.5)', 'rgba(127, 96, 0, 0.5)']
traces = []

for xd, yd, cls in zip(x_data, y_data, colors):

traces.append(go.Box(
y=yd,
name=xd,
boxpoints=False,
jitter=0.5,
whiskerwidth=0.2,
fillcolor=cls,
marker=dict(
size=2,
),
line=dict(width=1),
))

layout = go.Layout(
title='Difference in sales {} from cluster to cluster'.format(field_to_plot),
yaxis=dict(
autorange=True,
showgrid=True,
zeroline=True,
dtick=100,
gridcolor='black',
gridwidth=0.1,
zerolinecolor='rgb(255, 255, 255)',
zerolinewidth=2,
),
margin=dict(
l=40,
r=30,
b=80,
t=100,
),
paper_bgcolor='white',
plot_bgcolor='white',
showlegend=False
)

fig = go.Figure(data=traces, layout=layout)

py.offline.iplot(fig)

Pyspark PDF
0% (1)
Pyspark PDF
239 pages
Customer Segmentation Project
No ratings yet
Customer Segmentation Project
16 pages
Data Mining - Project
100% (2)
Data Mining - Project
11 pages
Customer Segmentation in Python
No ratings yet
Customer Segmentation in Python
71 pages
Customer Retail Shopping Analysis 1686591558
No ratings yet
Customer Retail Shopping Analysis 1686591558
45 pages
Sales Data Clustering
No ratings yet
Sales Data Clustering
15 pages
Cognizant's Artificial Intelligence Task 1
No ratings yet
Cognizant's Artificial Intelligence Task 1
14 pages
Bus Uncle Chatbot - Creating A Successful Digital Business (A)
No ratings yet
Bus Uncle Chatbot - Creating A Successful Digital Business (A)
10 pages
Exploratory Data Analysis66
No ratings yet
Exploratory Data Analysis66
17 pages
EDA Plots Code
No ratings yet
EDA Plots Code
13 pages
Sample Project 1
No ratings yet
Sample Project 1
14 pages
BIDA Practical Print
No ratings yet
BIDA Practical Print
56 pages
Reading Data: #Importing Required Libraries
No ratings yet
Reading Data: #Importing Required Libraries
16 pages
Customer Segmentation With K-Means and RMF
No ratings yet
Customer Segmentation With K-Means and RMF
13 pages
West Rox
No ratings yet
West Rox
29 pages
Wa0016.
No ratings yet
Wa0016.
13 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
10 pages
Exercise5 Solution
No ratings yet
Exercise5 Solution
22 pages
Guides
No ratings yet
Guides
23 pages
Extracted Notebook Content
No ratings yet
Extracted Notebook Content
17 pages
Practical 5
No ratings yet
Practical 5
6 pages
Technologyname Phase2
No ratings yet
Technologyname Phase2
20 pages
Eda Lab Assignment2
No ratings yet
Eda Lab Assignment2
10 pages
SPPUML6
No ratings yet
SPPUML6
9 pages
Diwali Sales Analysis EDA 1696347982
No ratings yet
Diwali Sales Analysis EDA 1696347982
8 pages
Axe Submission
No ratings yet
Axe Submission
4 pages
DVT Exp - 7
No ratings yet
DVT Exp - 7
11 pages
Deep Learning Assignments
No ratings yet
Deep Learning Assignments
13 pages
ML Practical 4D
No ratings yet
ML Practical 4D
11 pages
PRJ Sales Forecasting
No ratings yet
PRJ Sales Forecasting
22 pages
Data Wrangling Notebook Summary
No ratings yet
Data Wrangling Notebook Summary
9 pages
Customer Segmentation Using RFM Analysis: Overview
No ratings yet
Customer Segmentation Using RFM Analysis: Overview
11 pages
Supermarket Sales Analysis Project
No ratings yet
Supermarket Sales Analysis Project
8 pages
EcommerceAnalysis 1680541297
No ratings yet
EcommerceAnalysis 1680541297
11 pages
EDA Diwali Sale Analysis Project
No ratings yet
EDA Diwali Sale Analysis Project
11 pages
Final Ca
No ratings yet
Final Ca
10 pages
Document 11
No ratings yet
Document 11
6 pages
DMV - 5 - Jupyter Notebook
No ratings yet
DMV - 5 - Jupyter Notebook
5 pages
Amazon Apparel PDF
No ratings yet
Amazon Apparel PDF
138 pages
Customer Segmentation With K-Means Clustering and Visualization - Colab
No ratings yet
Customer Segmentation With K-Means Clustering and Visualization - Colab
3 pages
Untitled Document-2-1-13-7-11.4
No ratings yet
Untitled Document-2-1-13-7-11.4
5 pages
Data Preprocessing 2
No ratings yet
Data Preprocessing 2
5 pages
Ads Phase 5
No ratings yet
Ads Phase 5
23 pages
Implement K-Means Clustering.: Preprocessing
No ratings yet
Implement K-Means Clustering.: Preprocessing
8 pages
Clustering
No ratings yet
Clustering
53 pages
Customer Segmentation 1683225943
No ratings yet
Customer Segmentation 1683225943
34 pages
Another Project-Creating Customer Segments
No ratings yet
Another Project-Creating Customer Segments
31 pages
Task 6
No ratings yet
Task 6
14 pages
Final
No ratings yet
Final
2 pages
Customer Segmentation Report
No ratings yet
Customer Segmentation Report
8 pages
Python For Business Decision Making Asm2
No ratings yet
Python For Business Decision Making Asm2
21 pages
Project Sale Analysis
No ratings yet
Project Sale Analysis
8 pages
Main - Py Text File
No ratings yet
Main - Py Text File
5 pages
John Zink Burner Control Narratives
100% (3)
John Zink Burner Control Narratives
19 pages
Supermarket Sales Analysis 1
No ratings yet
Supermarket Sales Analysis 1
13 pages
Exp 12 and 15
No ratings yet
Exp 12 and 15
4 pages
Supermarket Sales Data Analysis
No ratings yet
Supermarket Sales Data Analysis
6 pages
BigMart Sales Data Analysis
No ratings yet
BigMart Sales Data Analysis
16 pages
DF PD - Read - Excel ('Sample - Superstore - XLS') : Anjaliassignmnet - Ipy NB
No ratings yet
DF PD - Read - Excel ('Sample - Superstore - XLS') : Anjaliassignmnet - Ipy NB
23 pages
New Thesis Topics in Oral Medicine and Radiology
100% (3)
New Thesis Topics in Oral Medicine and Radiology
6 pages
Practice Questions2
No ratings yet
Practice Questions2
2 pages
PrelimNum PDF
No ratings yet
PrelimNum PDF
236 pages
Seafarer Medical Certificate
No ratings yet
Seafarer Medical Certificate
2 pages
English Manual v3 001
No ratings yet
English Manual v3 001
63 pages
Mall Customer Data Analysis PDF
No ratings yet
Mall Customer Data Analysis PDF
10 pages
Lab 1 ML
No ratings yet
Lab 1 ML
2 pages
A Thief in The Night
No ratings yet
A Thief in The Night
103 pages
Herbs and Spices
No ratings yet
Herbs and Spices
13 pages
Keats
100% (1)
Keats
15 pages
Employee Welfare
No ratings yet
Employee Welfare
44 pages
3.1 Tuple Relational Calculus
No ratings yet
3.1 Tuple Relational Calculus
11 pages
Sundyne Compressor Brochure - US
No ratings yet
Sundyne Compressor Brochure - US
16 pages
Camry - EF932 - Instructions - For - Use - Manual 21
No ratings yet
Camry - EF932 - Instructions - For - Use - Manual 21
8 pages
Spark PDF
No ratings yet
Spark PDF
10 pages
Two-Layer Gaussian Process Regression With Example Selection For Image Dehazing
No ratings yet
Two-Layer Gaussian Process Regression With Example Selection For Image Dehazing
13 pages
4-Quantity Calculations
No ratings yet
4-Quantity Calculations
18 pages
Level 7 Diploma in Data Science (Fast Track) - Delivered Online by LSBR, UK
No ratings yet
Level 7 Diploma in Data Science (Fast Track) - Delivered Online by LSBR, UK
19 pages
02 - FootPrinting
No ratings yet
02 - FootPrinting
91 pages
Post WW Ii Latin American Boom: 21 Century Literature From The Philippines and The World Week 4 Topic
No ratings yet
Post WW Ii Latin American Boom: 21 Century Literature From The Philippines and The World Week 4 Topic
2 pages
Sweet Potatao As Superfood
No ratings yet
Sweet Potatao As Superfood
6 pages
Statement Up0510110008421
No ratings yet
Statement Up0510110008421
3 pages
t7 2009 Dec Q
No ratings yet
t7 2009 Dec Q
8 pages
Eugen Fink Oasis of Happiness
No ratings yet
Eugen Fink Oasis of Happiness
29 pages
Daniel Science
No ratings yet
Daniel Science
10 pages
5 People Who Disappeared But Would Reappear Years Later
No ratings yet
5 People Who Disappeared But Would Reappear Years Later
5 pages
Phil Summa
No ratings yet
Phil Summa
3 pages
Practical Set-1: The Result Is 600 The Result Is 70
No ratings yet
Practical Set-1: The Result Is 600 The Result Is 70
12 pages
Features
No ratings yet
Features
7 pages
Assignment 1 ECN3112
No ratings yet
Assignment 1 ECN3112
4 pages
AC6-How To Setup Client+AP Mode
No ratings yet
AC6-How To Setup Client+AP Mode
10 pages
Mapping Pulling Cable Grounding System
No ratings yet
Mapping Pulling Cable Grounding System
1 page
TCTX 5100 Classroom Rules Learning Activity
No ratings yet
TCTX 5100 Classroom Rules Learning Activity
2 pages
Hazard Identification: 2. Risk Analysis/Evaluation 3. Risk Control
No ratings yet
Hazard Identification: 2. Risk Analysis/Evaluation 3. Risk Control
2 pages
AI-Powered Bitcoin Trading: Developing an Investment Strategy with Artificial Intelligence
From Everand
AI-Powered Bitcoin Trading: Developing an Investment Strategy with Artificial Intelligence
Eoghan Leahy
No ratings yet