0% found this document useful (0 votes)
7 views21 pages

Using Big - Fact Customer Table With Proper Unseen Data - Colab

The document details a project focused on H&M's personalized fashion recommendations through customer segmentation based on transaction history. It involves data processing using Python libraries such as pandas and numpy, and includes merging customer, transaction, and article data. The project utilizes K-Means clustering on a large customer dataset to analyze and record customer transactions effectively.

Uploaded by

Samyukta G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views21 pages

Using Big - Fact Customer Table With Proper Unseen Data - Colab

The document details a project focused on H&M's personalized fashion recommendations through customer segmentation based on transaction history. It involves data processing using Python libraries such as pandas and numpy, and includes merging customer, transaction, and article data. The project utilizes K-Means clustering on a large customer dataset to analyze and record customer transactions effectively.

Uploaded by

Samyukta G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

24/10/2024, 23:20 Using Big_Fact Customer Table with proper unseen data - Colab

H&M Personalized Fashion Recommendations

Customer Segementation With Their Transaction History :

Samyukta Gurudanti

keyboard_arrow_down Import Statements


1 !pip install shutup
2 import shutup
3 shutup.please()

Requirement already satisfied: shutup in /usr/local/lib/python3.10/dist-packages (0.2.0)

1 import pandas as pd
2 import numpy as np
3 from google.colab import drive
4 drive.mount('/content/drive')
5 pd.set_option('display.max_columns', None)

1 articles = pd.read_csv('/content/drive/MyDrive/hm/articles_edited.csv')

1 trans = pd.read_csv('/content/drive/MyDrive/hm/transactions_train.csv')

1 trans.to_parquet('transactions_train.parquet')

1 cust = pd.read_csv('/content/drive/MyDrive/hm/customers.csv')

1 print('This is the length of customers',len(cust))

This is the length of customers 1371980

Applying K-Means on the Big Customer Fact Table

keyboard_arrow_down Merging
Merging only 788323 rows out of 31788324 rows

1 trans

https://colab.research.google.com/drive/1ms5pe88FOfmAksE3Oc5ZeC32_NXhHk55#printMode=true 1/21
24/10/2024, 23:20 Using Big_Fact Customer Table with proper unseen data - Colab

t_dat customer_id article_id price sales_channel_id

0 2018-09-20 000058a12d5b43e67d225668fa1f8d618c13dc232df0ca... 663713001 0.050831 2

1 2018-09-20 000058a12d5b43e67d225668fa1f8d618c13dc232df0ca... 541518023 0.030492 2

2 2018-09-20 00007d2de826758b65a93dd24ce629ed66842531df6699... 505221004 0.015237 2

3 2018-09-20 00007d2de826758b65a93dd24ce629ed66842531df6699... 685687003 0.016932 2

4 2018-09-20 00007d2de826758b65a93dd24ce629ed66842531df6699... 685687004 0.016932 2

... ... ... ... ... ...

31788319 2020-09-22 fff2282977442e327b45d8c89afde25617d00124d0f999... 929511001 0.059305 2

31788320 2020-09-22 fff2282977442e327b45d8c89afde25617d00124d0f999... 891322004 0.042356 2

31788321 2020-09-22 fff380805474b287b05cb2a7507b9a013482f7dd0bce0e... 918325001 0.043203 1

31788322 2020-09-22 fff4d3a8b1f3b60af93e78c30a7cb4cf75edaf2590d3e5... 833459002 0.006763 1

31788323 2020-09-22 fffef3b6b73545df065b521e19f64bf6fe93bfd450ab20... 898573003 0.033881 2

31788324 rows × 5 columns

keyboard_arrow_down Randomly selecting 788323 rows out of trans


1 trans = trans.sample(n=788323, random_state = 26)

1 trans

t_dat customer_id article_id price sales_channel_id

17258814 2019-09-28 ed09247cfe0c4f3ca9620169fb0a9562910a6dfe064521... 757303002 0.024390 2

18385881 2019-10-28 b8a2153e7c93981cd110b6c8ed93236e5b0d27ecc3cc98... 781833003 0.025407 2

10647175 2019-05-23 0e9d42f81879d01a3b2d8d4725585e3b77ba8208405d77... 708845001 0.013542 1

20519357 2019-12-23 2a396024b9f9c8d41d0d7b3b70c63ca80d1c674bcf0e29... 808276001 0.030492 1

29807461 2020-08-02 55f6ab9a67b10abc185893ef274ad49405beb1e1310783... 874669001 0.025407 2

... ... ... ... ... ...

12931122 2019-06-28 f56afb400af35c99fc11573a2708f72b0d32eaa3fe1886... 675069001 0.022864 2

6404429 2019-02-20 86d33d6b953bca611bda0a08715b78243459a0325e6ec7... 617645001 0.016932 2

8629358 2019-04-11 8c51635838424d2340458d6c8373c2850ec4dd1eeea96d... 733419003 0.033881 1

11295723 2019-06-03 780ef00724321a7d5ab06e1f0a3af13321bc9e9472299e... 697091007 0.023729 2

23052666 2020-03-05 1b61f9ee956bc1daef75d6c9750c4ed8d62db6cdd2298e... 571706001 0.019814 2

788323 rows × 5 columns

1 merged = pd.merge(trans, cust, on='customer_id', how='left')

1 merged = pd.merge(merged, articles, on='article_id', how='left')

1 merged

https://colab.research.google.com/drive/1ms5pe88FOfmAksE3Oc5ZeC32_NXhHk55#printMode=true 2/21
24/10/2024, 23:20 Using Big_Fact Customer Table with proper unseen data - Colab

t_dat customer_id article_id price sales_channel_id FN Active club_membe

2019-
0 ed09247cfe0c4f3ca9620169fb0a9562910a6dfe064521... 757303002 0.024390 2 NaN NaN
09-28

2019-
1 b8a2153e7c93981cd110b6c8ed93236e5b0d27ecc3cc98... 781833003 0.025407 2 1.0 1.0
10-28

2019-
2 0e9d42f81879d01a3b2d8d4725585e3b77ba8208405d77... 708845001 0.013542 1 NaN NaN
05-23

2019-
3 2a396024b9f9c8d41d0d7b3b70c63ca80d1c674bcf0e29... 808276001 0.030492 1 NaN NaN
12-23

2020-
4 55f6ab9a67b10abc185893ef274ad49405beb1e1310783... 874669001 0.025407 2 NaN NaN
08-02

... ... ... ... ... ... ... ...

2019-
788318 f56afb400af35c99fc11573a2708f72b0d32eaa3fe1886... 675069001 0.022864 2 1.0 1.0
06-28

2019-
788319 86d33d6b953bca611bda0a08715b78243459a0325e6ec7... 617645001 0.016932 2 NaN NaN
02-20

2019-
788320 8c51635838424d2340458d6c8373c2850ec4dd1eeea96d... 733419003 0.033881 1 1.0 1.0
04-11

2019-
788321 780ef00724321a7d5ab06e1f0a3af13321bc9e9472299e... 697091007 0.023729 2 NaN NaN
06-03

2020-
788322 1b61f9ee956bc1daef75d6c9750c4ed8d62db6cdd2298e... 571706001 0.019814 2 NaN NaN
03-05

788323 rows × 35 columns

1 merged.columns

Index(['t_dat', 'customer_id', 'article_id', 'price', 'sales_channel_id', 'FN',


'Active', 'club_member_status', 'fashion_news_frequency', 'age',
'postal_code', 'product_code', 'prod_name', 'product_type_no',
'product_type_name', 'product_group_name', 'graphical_appearance_no',
'graphical_appearance_name', 'colour_group_code', 'colour_group_name',
'perceived_colour_value_id', 'perceived_colour_value_name',
'perceived_colour_master_id', 'perceived_colour_master_name',
'department_no', 'department_name', 'index_code', 'index_name',
'index_group_no', 'index_group_name', 'section_no', 'section_name',
'garment_group_no', 'garment_group_name', 'detail_desc'],
dtype='object')

1 df = merged.copy()

keyboard_arrow_down This is the DataFrame which we will be using


1 df

https://colab.research.google.com/drive/1ms5pe88FOfmAksE3Oc5ZeC32_NXhHk55#printMode=true 3/21
24/10/2024, 23:20 Using Big_Fact Customer Table with proper unseen data - Colab

t_dat customer_id article_id price sales_channel_id FN Active club_membe

2019-
0 ed09247cfe0c4f3ca9620169fb0a9562910a6dfe064521... 757303002 0.024390 2 NaN NaN
09-28

2019-
1 b8a2153e7c93981cd110b6c8ed93236e5b0d27ecc3cc98... 781833003 0.025407 2 1.0 1.0
10-28

2019-
2 0e9d42f81879d01a3b2d8d4725585e3b77ba8208405d77... 708845001 0.013542 1 NaN NaN
05-23

2019-
3 2a396024b9f9c8d41d0d7b3b70c63ca80d1c674bcf0e29... 808276001 0.030492 1 NaN NaN
12-23

2020-
4 55f6ab9a67b10abc185893ef274ad49405beb1e1310783... 874669001 0.025407 2 NaN NaN
08-02

... ... ... ... ... ... ... ...

2019-
788318 f56afb400af35c99fc11573a2708f72b0d32eaa3fe1886... 675069001 0.022864 2 1.0 1.0
06-28

2019-
788319 86d33d6b953bca611bda0a08715b78243459a0325e6ec7... 617645001 0.016932 2 NaN NaN
02-20

2019-
788320 8c51635838424d2340458d6c8373c2850ec4dd1eeea96d... 733419003 0.033881 1 1.0 1.0
04-11

2019-
788321 780ef00724321a7d5ab06e1f0a3af13321bc9e9472299e... 697091007 0.023729 2 NaN NaN
06-03

2020-
788322 1b61f9ee956bc1daef75d6c9750c4ed8d62db6cdd2298e... 571706001 0.019814 2 NaN NaN
03-05

788323 rows × 35 columns

keyboard_arrow_down Recording the transaction history for every costumer, by storing them in a dictionary
1 import pandas as pd
2
3 d_f = df.copy()
4
5 customer_article_dict = {}
6
7 for index, row in d_f.iterrows():
8 customer_id = row['customer_id']
9 article_id = row['article_id']
10
11 if customer_id in customer_article_dict:
12 customer_article_dict[customer_id].append(article_id)
13 else:
14 customer_article_dict[customer_id] = [article_id]

keyboard_arrow_down Looking at 20 customer transactions


1 import random
2 keys = list(customer_article_dict.keys())
3 random_keys = random.sample(keys, k=20)
4 random_items = {key: customer_article_dict[key] for key in random_keys}

https://colab.research.google.com/drive/1ms5pe88FOfmAksE3Oc5ZeC32_NXhHk55#printMode=true 4/21
24/10/2024, 23:20 Using Big_Fact Customer Table with proper unseen data - Colab
5 for cust,trans in random_items.items():
f16e61774fc83dd8495a039fa2eb1d86b30afab274747c5c90b9982ab6f383bd [369796007]
471d47b2208774873206f702a3954d31b04eca6639d30bb773c5e8517427edd8 [871243002]
111dd91c7f827034ba986ed82f3f138874c338a69505ae46cb493a60522d0493 [698718003]
26793843a9b2ecd6768a8c19a56fff34ea6ed9be75c6ef1e2442f8746242bb23 [683379001, 111593001, 562657002]
e453c80e848439dacc86a09aa2cb61e722142b168078d7c1716790da0c98df21 [770345001]
0d407f91c43e757976e9058eefaa3a8e0e580c025ef5bc75c0558eab071ea130 [898889001]
1e4bc1cef79b905c6f1d74d8f9469e47639b7d27a44f058c89cfb22d779fa9fa [589748001, 685816011, 578478001]
4b4dff078de125eefa68caf2d70ab39ecb71d509406788a41f44b34b2c6ce851 [766346001]
2c675900d9aec7959e6bae583e329f57402cbeaabae99501809900b8a1943dec [684055001]
7b388395fff64005f6c1c0d7b2f7024d1201c924ec91314d3ddc8b82b45d0ee0 [723529003]
b33c55db22da946ec5f07d7588cbcbac028dd08c48979e42fe0f25e90a569615 [693380001]
b89a47f6b0df42c739de9b7770e44753df0286037f51a05e751537f6fb4dbf0d [876053002, 675127002, 870970001, 806388003, 779331001]
c2321aae0c867f40c30734b87e23ee3ac1223298de00497bb51fad15b18354cc [666448001, 685813006]
4650389aff8795635ad49026398d17522f10dff2f2d37ad03f51cce7b9796a6f [552346023]
f3edca38f8be7ae124555c321d41b91d5239a004885f016470cd3c028113b37b [774040003]
f66606805c1022b4375e586c198402b81dd574944fe1412cd56d6660d9f1e23e [129085001]
30a1e7dd92bb015297bf5dad51b0a209c455f7485519ce216f2df4d7a667695e [832253003]
24ba9b4ac22f2b3c9d05716c08fc1aaec2d255d3c535f710b13e62f694899c7a [827957002]
df25186a0a1679b95540dfaa3eea0aeee5d56ec4853468be83ae0e4c34b81f36 [714927006, 745475023]
d0167d37f2637b47c3023940a8152d7eeb6fb8c9b7bb25c8bc6dec40f422ac4f [743098002, 811927001]

keyboard_arrow_down Checking the distribution of transaction count for each customer


1 from collections import Counter
2 import matplotlib.pyplot as plt
3 length_counts = Counter(len(value) for value in customer_article_dict.values())
4 plt.bar(length_counts.keys(), length_counts.values())
5 plt.xlabel('Length of List')
6 plt.ylabel('Count')
7 plt.title('Histogram of List Lengths')
8 plt.show()
9

1 df = pd.DataFrame.from_dict(length_counts, orient='index', columns=['Count'])


2 df.index.name = 'Length of List'
3 df_sorted = df.sort_values(by='Count', ascending=False)
4 df_sorted['Percentage'] = (df_sorted['Count'] / df_sorted['Count'].sum()) * 100
5 df_sorted

https://colab.research.google.com/drive/1ms5pe88FOfmAksE3Oc5ZeC32_NXhHk55#printMode=true 5/21
24/10/2024, 23:20 Using Big_Fact Customer Table with proper unseen data - Colab

Count Percentage

Length of List

1 251985 59.394844

2 89964 21.205221

3 38872 9.162436

4 18888 4.452050

5 9999 2.356843

6 5591 1.317843

7 3302 0.778307

8 1944 0.458216

9 1208 0.284735

10 762 0.179609

11 541 0.127518

12 355 0.083676

13 253 0.059634

14 169 0.039835

15 110 0.025928

16 84 0.019799

17 58 0.013671

18 32 0.007543

19 30 0.007071

22 22 0.005186

21 21 0.004950

20 15 0.003536

23 10 0.002357

27 10 0.002357

25 7 0.001650

24 5 0.001179

26 3 0.000707

30 3 0.000707

29 3 0.000707

28 2 0.000471

32 1 0.000236

46 1 0.000236

38 1 0.000236

31 1 0.000236

37 1 0.000236

36 1 0.000236

keyboard_arrow_down Preprocessing the Data


1 df = d_f.copy()

1 df.dtypes

t_dat object
customer_id object
article_id int64
price float64
sales_channel_id int64
FN float64
Active float64
club_member_status object
fashion_news_frequency object

https://colab.research.google.com/drive/1ms5pe88FOfmAksE3Oc5ZeC32_NXhHk55#printMode=true 6/21
24/10/2024, 23:20 Using Big_Fact Customer Table with proper unseen data - Colab
age float64
postal_code object
product_code int64
prod_name object
product_type_no int64
product_type_name object
product_group_name object
graphical_appearance_no int64
graphical_appearance_name object
colour_group_code int64
colour_group_name object
perceived_colour_value_id int64
perceived_colour_value_name object
perceived_colour_master_id int64
perceived_colour_master_name object
department_no int64
department_name object
index_code object
index_name object
index_group_no int64
index_group_name object
section_no int64
section_name object
garment_group_no int64
garment_group_name object
detail_desc object
dtype: object

keyboard_arrow_down Since there are many NaN in FN and Active, I am choosing to drop them
And I am choosing to drop rows containing other NaN values, which still leaves us with 98% of the data

1 df.isna().sum()

t_dat 0
customer_id 0
article_id 0
price 0
sales_channel_id 0
FN 451475
Active 456440
club_member_status 1554
fashion_news_frequency 3572
age 3397
postal_code 0
product_code 0
prod_name 0
product_type_no 0
product_type_name 0
product_group_name 0
graphical_appearance_no 0
graphical_appearance_name 0
colour_group_code 0
colour_group_name 0
perceived_colour_value_id 0
perceived_colour_value_name 0
perceived_colour_master_id 0
perceived_colour_master_name 0
department_no 0
department_name 0
index_code 0
index_name 0
index_group_no 0
index_group_name 0
section_no 0
section_name 0
garment_group_no 0
garment_group_name 0
detail_desc 2877
dtype: int64

keyboard_arrow_down Changing the NaN Values in FN and Active to 0


1 df[['FN', 'Active']] = df[['FN', 'Active']].fillna(0)

1 df.isna().sum()

t_dat 0
customer_id 0
article_id 0

https://colab.research.google.com/drive/1ms5pe88FOfmAksE3Oc5ZeC32_NXhHk55#printMode=true 7/21
24/10/2024, 23:20 Using Big_Fact Customer Table with proper unseen data - Colab
price 0
sales_channel_id 0
FN 0
Active 0
club_member_status 1554
fashion_news_frequency 3572
age 3397
postal_code 0
product_code 0
prod_name 0
product_type_no 0
product_type_name 0
product_group_name 0
graphical_appearance_no 0
graphical_appearance_name 0
colour_group_code 0
colour_group_name 0
perceived_colour_value_id 0
perceived_colour_value_name 0
perceived_colour_master_id 0
perceived_colour_master_name 0
department_no 0
department_name 0
index_code 0
index_name 0
index_group_no 0
index_group_name 0
section_no 0
section_name 0
garment_group_no 0
garment_group_name 0
detail_desc 2877
dtype: int64

keyboard_arrow_down Dropping the other NaN rows


1 df = df.dropna()

1 df.isna().sum()

t_dat 0
customer_id 0
article_id 0
price 0
sales_channel_id 0
FN 0
Active 0
club_member_status 0
fashion_news_frequency 0
age 0
postal_code 0
product_code 0
prod_name 0
product_type_no 0
product_type_name 0
product_group_name 0
graphical_appearance_no 0
graphical_appearance_name 0
colour_group_code 0
colour_group_name 0
perceived_colour_value_id 0
perceived_colour_value_name 0
perceived_colour_master_id 0
perceived_colour_master_name 0
department_no 0
department_name 0
index_code 0
index_name 0
index_group_no 0
index_group_name 0
section_no 0
section_name 0
garment_group_no 0
garment_group_name 0
detail_desc 0
dtype: int64

keyboard_arrow_down Splitting the Date Column


1 data1 = df.copy()

https://colab.research.google.com/drive/1ms5pe88FOfmAksE3Oc5ZeC32_NXhHk55#printMode=true 8/21
24/10/2024, 23:20 Using Big_Fact Customer Table with proper unseen data - Colab
1 data1[['year', 'month', 'day']] = data1['t_dat'].str.split('-', expand=True).astype(int)
2 data1.drop('t_dat', axis=1, inplace=True)

1 data1['club_member_status'].unique()

array(['ACTIVE', 'PRE-CREATE', 'LEFT CLUB'], dtype=object)

keyboard_arrow_down Label Encoding Everything


1 for col in data1.columns:
2 if data1[col].nunique() < 999999999:
3 data1[col] = pd.factorize(data1[col])[0]
4 data1

customer_id article_id price sales_channel_id FN Active club_member_status fashion_news_frequency age post

0 0 0 0 0 0 0 0 0 0

1 1 1 1 0 1 1 0 1 1

2 2 2 2 1 0 0 0 0 2

3 3 3 3 1 0 0 0 0 3

4 4 4 1 0 0 0 0 0 4

... ... ... ... ... ... ... ... ... ...

788318 6177 4261 65 0 1 1 0 1 14

788319 417652 37442 5 0 0 0 0 0 22

788320 23321 14001 7 1 1 1 0 1 1

788321 417653 10377 1534 0 0 0 0 0 48

788322 417654 2944 172 0 0 0 0 0 26

777646 rows × 37 columns

1 data1.dtypes

customer_id int64
article_id int64
price int64
sales_channel_id int64
FN int64
Active int64
club_member_status int64
fashion_news_frequency int64
age int64
postal_code int64
product_code int64
prod_name int64
product_type_no int64
product_type_name int64
product_group_name int64
graphical_appearance_no int64
graphical_appearance_name int64
colour_group_code int64
colour_group_name int64
perceived_colour_value_id int64
perceived_colour_value_name int64
perceived_colour_master_id int64
perceived_colour_master_name int64
department_no int64
department_name int64
index_code int64
index_name int64
index_group_no int64
index_group_name int64
section_no int64
section_name int64
garment_group_no int64
garment_group_name int64
detail_desc int64
year int64
month int64
day int64
dtype: object

https://colab.research.google.com/drive/1ms5pe88FOfmAksE3Oc5ZeC32_NXhHk55#printMode=true 9/21
24/10/2024, 23:20 Using Big_Fact Customer Table with proper unseen data - Colab

keyboard_arrow_down Creating a transaction history dictionary


1 import pandas as pd
2 d_f = data1.copy()
3 customer_article_dict = {}
4 for index, row in d_f.iterrows():
5 customer_id = row['customer_id']
6 article_id = row['article_id']
7 if customer_id in customer_article_dict:
8 customer_article_dict[customer_id].append(article_id)
9 else:
10 customer_article_dict[customer_id] = [article_id]

1 import random
2 keys = list(customer_article_dict.keys())
3 random_keys = random.sample(keys, k=20)
4 random_items = {key: customer_article_dict[key] for key in random_keys}
5 for cust,trans in random_items.items():
6 print(cust,trans)

54649 [3129]
370033 [9803]
124800 [968, 755, 4333, 4312]
186190 [6560]
7264 [6011, 25051, 26915, 35368]
342395 [17468]
208954 [22247]
258780 [2440, 4327, 19413]
147522 [42768]
276297 [10235]
48806 [24768, 38346]
356411 [47114, 30776]
122661 [1, 5821]
401081 [23172]
281925 [27881, 21381, 12001]
348464 [38617, 1124]
47376 [138]
403457 [3253, 5887]
17151 [12144]
289893 [26221, 42496]

keyboard_arrow_down Making unique values of customers


1 data1

customer_id article_id price sales_channel_id FN Active club_member_status fashion_news_frequency age post

0 0 0 0 0 0 0 0 0 0

1 1 1 1 0 1 1 0 1 1

2 2 2 2 1 0 0 0 0 2

3 3 3 3 1 0 0 0 0 3

4 4 4 1 0 0 0 0 0 4

... ... ... ... ... ... ... ... ... ...

788318 6177 4261 65 0 1 1 0 1 14

788319 417652 37442 5 0 0 0 0 0 22

788320 23321 14001 7 1 1 1 0 1 1

788321 417653 10377 1534 0 0 0 0 0 48

788322 417654 2944 172 0 0 0 0 0 26

777646 rows × 37 columns

1 len(data1.customer_id.unique())

417655

1 customer_fact_table = data1.copy()

https://colab.research.google.com/drive/1ms5pe88FOfmAksE3Oc5ZeC32_NXhHk55#printMode=true 10/21
24/10/2024, 23:20 Using Big_Fact Customer Table with proper unseen data - Colab
1 customer_fact_table = customer_fact_table.drop_duplicates(subset='customer_id', keep='first')

1 len(customer_fact_table)

417655

1 customer_fact_table

customer_id article_id price sales_channel_id FN Active club_member_status fashion_news_frequency age post

0 0 0 0 0 0 0 0 0 0

1 1 1 1 0 1 1 0 1 1

2 2 2 2 1 0 0 0 0 2

3 3 3 3 1 0 0 0 0 3

4 4 4 1 0 0 0 0 0 4

... ... ... ... ... ... ... ... ... ...

788314 417650 26848 57 0 1 1 0 1 0

788317 417651 8799 230 1 1 1 0 1 30

788319 417652 37442 5 0 0 0 0 0 22

788321 417653 10377 1534 0 0 0 0 0 48

788322 417654 2944 172 0 0 0 0 0 26

417655 rows × 37 columns

keyboard_arrow_down Merging each individual customer's transaction history in a row


1 customer_fact_table['transactions'] = customer_fact_table['customer_id'].map(customer_article_dict)

1 len(customer_fact_table)

417655

1 customer_fact_table

customer_id article_id price sales_channel_id FN Active club_member_status fashion_news_frequency age post

0 0 0 0 0 0 0 0 0 0

1 1 1 1 0 1 1 0 1 1

2 2 2 2 1 0 0 0 0 2

3 3 3 3 1 0 0 0 0 3

4 4 4 1 0 0 0 0 0 4

... ... ... ... ... ... ... ... ... ...

788314 417650 26848 57 0 1 1 0 1 0

788317 417651 8799 230 1 1 1 0 1 30

788319 417652 37442 5 0 0 0 0 0 22

788321 417653 10377 1534 0 0 0 0 0 48

788322 417654 2944 172 0 0 0 0 0 26

417655 rows × 38 columns

1 customer_fact_table.columns

Index(['customer_id', 'article_id', 'price', 'sales_channel_id', 'FN',


'Active', 'club_member_status', 'fashion_news_frequency', 'age',
'postal_code', 'product_code', 'prod_name', 'product_type_no',
'product_type_name', 'product_group_name', 'graphical_appearance_no',
'graphical_appearance_name', 'colour_group_code', 'colour_group_name',
'perceived_colour_value_id', 'perceived_colour_value_name',
'perceived_colour_master_id', 'perceived_colour_master_name',
'department_no', 'department_name', 'index_code', 'index_name',
'index_group_no', 'index_group_name', 'section_no', 'section_name',

https://colab.research.google.com/drive/1ms5pe88FOfmAksE3Oc5ZeC32_NXhHk55#printMode=true 11/21
24/10/2024, 23:20 Using Big_Fact Customer Table with proper unseen data - Colab
'garment_group_no', 'garment_group_name', 'detail_desc', 'year',
'month', 'day', 'transactions'],
dtype='object')

1 cust = pd.read_csv('/content/drive/MyDrive/hm/customers.csv')

1 cust.columns

Index(['customer_id', 'FN', 'Active', 'club_member_status',


'fashion_news_frequency', 'age', 'postal_code'],
dtype='object')

1 customer_fact_table = customer_fact_table[list(set(customer_fact_table.columns).intersection(cust.columns)) + ['transacti

1 len(cust.columns)

1 len(customer_fact_table.columns)

1 customer_fact_table

customer_id Active club_member_status age postal_code FN fashion_news_frequency transactions

0 0 0 0 0 0 0 0 [0, 1480]

1 1 1 0 1 1 1 1 [1, 16907]

2 2 0 0 2 2 0 0 [2, 373, 11602]

3 3 0 0 3 3 0 0 [3, 28989, 23427]

4 4 0 0 4 4 0 0 [4, 11198]

... ... ... ... ... ... ... ... ...

788314 417650 1 0 0 106992 1 1 [26848]

788317 417651 1 0 30 102943 1 1 [8799]

788319 417652 0 0 22 219200 0 0 [37442]

788321 417653 0 0 48 219201 0 0 [10377]

788322 417654 0 0 26 219202 0 0 [2944]

417655 rows × 8 columns

1 def split_transactions(df):
2 for i in range(1, max(df['transactions'].apply(len)) + 1):
3 df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
4 return df
5 customer_fact_table = split_transactions(customer_fact_table)
6 customer_fact_table

https://colab.research.google.com/drive/1ms5pe88FOfmAksE3Oc5ZeC32_NXhHk55#printMode=true 12/21
24/10/2024, 23:20 Using Big_Fact Customer Table with proper unseen data - Colab
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
i h i 82 2f96d4266198 5 S i Wi hC W i
https://colab.research.google.com/drive/1ms5pe88FOfmAksE3Oc5ZeC32_NXhHk55#printMode=true 13/21
24/10/2024, 23:20 Using Big_Fact Customer Table with proper unseen data - Colab
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
<ipython-input-82-2f96d4266198>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


df['transaction_' + str(i)] = df['transactions'].apply(lambda x: x[i-1] if len(x) >= i else None)
customer_id Active club_member_status age postal_code FN fashion_news_frequency transactions transaction_1

0 0 0 0 0 0 0 0 [0, 1480] 0

1 1 1 0 1 1 1 1 [1, 16907] 1

2 2 0 0 2 2 0 0 [2, 373, 11602] 2


https://colab.research.google.com/drive/1ms5pe88FOfmAksE3Oc5ZeC32_NXhHk55#printMode=true 14/21
24/10/2024, 23:20 Using Big_Fact Customer Table with proper unseen data - Colab
[ , , ]

[3, 28989,
3 3 0 0 3 3 0 0 3
23427]

4 4 0 0 4 4 0 0 [4, 11198] 4

... ... ... ... ... ... ... ... ... ...

788314 417650 1 0 0 106992 1 1 [26848] 26848

788317 417651 1 0 30 102943 1 1 [8799] 8799

788319 417652 0 0 22 219200 0 0 [37442] 37442

788321 417653 0 0 48 219201 0 0 [10377] 10377

788322 417654 0 0 26 219202 0 0 [2944] 2944

417655 rows × 54 columns

https://colab.research.google.com/drive/1ms5pe88FOfmAksE3Oc5ZeC32_NXhHk55#printMode=true 15/21
24/10/2024, 23:20 Using Big_Fact Customer Table with proper unseen data - Colab

1 customer_fact_table.drop('transactions', axis = 1, inplace = True)

1 len(customer_fact_table.transaction_2.dropna())

170288

1 len(customer_fact_table)

417655

keyboard_arrow_down Dropping the Duplicates


1 duplicates = customer_fact_table[customer_fact_table.duplicated(subset=['customer_id'], keep=False)]
2
3 # Display the duplicates
4 print("Duplicates of the 'Name' column:")
5 print(duplicates)

Duplicates of the 'Name' column:


Empty DataFrame
Columns: [customer_id, Active, club_member_status, age, postal_code, FN, fashion_news_frequency, transaction_1, transact
Index: []

keyboard_arrow_down 40.77% of the customers have made double purchase


1 customer_fact_table

customer_id Active club_member_status age postal_code FN fashion_news_frequency transaction_1 transaction_2

0 0 0 0 0 0 0 0 0 1480.0

1 1 1 0 1 1 1 1 1 16907.0

2 2 0 0 2 2 0 0 2 373.0

3 3 0 0 3 3 0 0 3 28989.0

4 4 0 0 4 4 0 0 4 11198.0

... ... ... ... ... ... ... ... ... ..

788314 417650 1 0 0 106992 1 1 26848 NaN

788317 417651 1 0 30 102943 1 1 8799 NaN

788319 417652 0 0 22 219200 0 0 37442 NaN

788321 417653 0 0 48 219201 0 0 10377 NaN

788322 417654 0 0 26 219202 0 0 2944 NaN

417655 rows × 53 columns

keyboard_arrow_down Keeping everything till the 6th Transaction


1 selected_data = customer_fact_table.iloc[:, :13]
2 selected_data

https://colab.research.google.com/drive/1ms5pe88FOfmAksE3Oc5ZeC32_NXhHk55#printMode=true 16/21
24/10/2024, 23:20 Using Big_Fact Customer Table with proper unseen data - Colab

customer_id Active club_member_status age postal_code FN fashion_news_frequency transaction_1 transaction_2

0 0 0 0 0 0 0 0 0 1480.0

1 1 1 0 1 1 1 1 1 16907.0

2 2 0 0 2 2 0 0 2 373.0

3 3 0 0 3 3 0 0 3 28989.0

4 4 0 0 4 4 0 0 4 11198.0

... ... ... ... ... ... ... ... ... ..

788314 417650 1 0 0 106992 1 1 26848 NaN

788317 417651 1 0 30 102943 1 1 8799 NaN

788319 417652 0 0 22 219200 0 0 37442 NaN

788321 417653 0 0 48 219201 0 0 10377 NaN

788322 417654 0 0 26 219202 0 0 2944 NaN

417655 rows × 13 columns

1 selected_data = selected_data.fillna(0)

1 selected_data

customer_id Active club_member_status age postal_code FN fashion_news_frequency transaction_1 transaction_2

0 0 0 0 0 0 0 0 0 1480.0

1 1 1 0 1 1 1 1 1 16907.0

2 2 0 0 2 2 0 0 2 373.0

3 3 0 0 3 3 0 0 3 28989.0

4 4 0 0 4 4 0 0 4 11198.0

... ... ... ... ... ... ... ... ... ..

788314 417650 1 0 0 106992 1 1 26848 0.0

788317 417651 1 0 30 102943 1 1 8799 0.0

788319 417652 0 0 22 219200 0 0 37442 0.0

788321 417653 0 0 48 219201 0 0 10377 0.0

788322 417654 0 0 26 219202 0 0 2944 0.0

417655 rows × 13 columns

1 #selected_data.to_csv('selected_data.csv', index=False)

keyboard_arrow_down K-Means Algorithm for the selected_data


1 import pandas as pd
2 selected_data = pd.read_csv('/content/selected_data.csv')

1 selected_data

https://colab.research.google.com/drive/1ms5pe88FOfmAksE3Oc5ZeC32_NXhHk55#printMode=true 17/21
24/10/2024, 23:20 Using Big_Fact Customer Table with proper unseen data - Colab

customer_id Active club_member_status age postal_code FN fashion_news_frequency transaction_1 transaction_2

0 0 0 0 0 0 0 0 0 1480.0

1 1 1 0 1 1 1 1 1 16907.0

2 2 0 0 2 2 0 0 2 373.0

3 3 0 0 3 3 0 0 3 28989.0

4 4 0 0 4 4 0 0 4 11198.0

... ... ... ... ... ... ... ... ... ..

417650 417650 1 0 0 106992 1 1 26848 0.0

417651 417651 1 0 30 102943 1 1 8799 0.0

417652 417652 0 0 22 219200 0 0 37442 0.0

417653 417653 0 0 48 219201 0 0 10377 0.0

417654 417654 0 0 26 219202 0 0 2944 0.0

417655 rows × 13 columns

keyboard_arrow_down Standard_Scaler
1 from sklearn.preprocessing import StandardScaler
2 scaler = StandardScaler()
3 selected_data = scaler.fit_transform(selected_data)

1 selected_data = pd.DataFrame(selected_data)

1 selected_data

0 1 2 3 4 5 6 7 8 9 10 11

0 -1.732047 -0.813848 -0.171796 -1.365503 -1.415945 -0.825987 -0.827299 -1.099969 -0.424487 -0.347817 -0.24626 -0.183457 -0.14

1 -1.732038 1.228731 -0.171796 -1.294168 -1.415929 1.210673 1.204452 -1.099900 0.774207 -0.347817 -0.24626 -0.183457 -0.14

2 -1.732030 -0.813848 -0.171796 -1.222834 -1.415913 -0.825987 -0.827299 -1.099830 -0.510502 0.830148 -0.24626 -0.183457 -0.14

3 -1.732022 -0.813848 -0.171796 -1.151499 -1.415897 -0.825987 -0.827299 -1.099760 1.712991 2.030755 -0.24626 -0.183457 -0.14

4 -1.732013 -0.813848 -0.171796 -1.080165 -1.415881 -0.825987 -0.827299 -1.099690 0.330612 -0.347817 -0.24626 -0.183457 -0.14

... ... ... ... ... ... ... ... ... ... ... ... ...

417650 1.732013 1.228731 -0.171796 -1.365503 0.293660 1.210673 1.204452 0.774950 -0.539484 -0.347817 -0.24626 -0.183457 -0.14

417651 1.732022 1.228731 -0.171796 0.774536 0.228961 1.210673 1.204452 -0.485495 -0.539484 -0.347817 -0.24626 -0.183457 -0.14

417652 1.732030 -0.813848 -0.171796 0.203859 2.086610 -0.825987 -0.827299 1.514778 -0.539484 -0.347817 -0.24626 -0.183457 -0.14

417653 1.732038 -0.813848 -0.171796 2.058559 2.086626 -0.825987 -0.827299 -0.375296 -0.539484 -0.347817 -0.24626 -0.183457 -0.14

417654 1.732047 -0.813848 -0.171796 0.489197 2.086642 -0.825987 -0.827299 -0.894376 -0.539484 -0.347817 -0.24626 -0.183457 -0.14

417655 rows × 13 columns

keyboard_arrow_down Split the data into Train and Test


1 from sklearn.model_selection import train_test_split
2 X_train, X_test_unseen = train_test_split(selected_data, test_size=0.2, random_state=42)

1 len(X_train)

334124

1 len(X_test_unseen)

83531

keyboard_arrow_down K-Means Clustering


https://colab.research.google.com/drive/1ms5pe88FOfmAksE3Oc5ZeC32_NXhHk55#printMode=true 18/21
24/10/2024, 23:20 Using Big_Fact Customer Table with proper unseen data - Colab
1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4 %matplotlib inline
5 import seaborn as sns
6 pd.options.display.float_format = '{:.2f}'.format
7 import warnings
8 warnings.filterwarnings('ignore')
9 from sklearn.cluster import KMeans
10 from sklearn.metrics import silhouette_score
11 from mpl_toolkits.mplot3d import Axes3D

keyboard_arrow_down Performing PCA


keyboard_arrow_down Checking how many Principal Components are suitable
1 from sklearn.decomposition import PCA
2 import numpy as np
3 pca = PCA(n_components=None)
4 pca.fit(X_train)
5 cumulative_variance_ratio = np.cumsum(pca.explained_variance_ratio_)
6 num_components_95 = np.argmax(cumulative_variance_ratio > 0.95) + 1
7 print("Number of components capturing more than 95% of variance:", num_components_95)

Number of components capturing more than 95% of variance: 10

keyboard_arrow_down Doing it for 10 PC


1 import numpy as np
2 import matplotlib.pyplot as plt
3 from mpl_toolkits.mplot3d import Axes3D
4 from sklearn.cluster import KMeans
5 import pandas as pd
6 from sklearn.decomposition import PCA
7 pca = PCA(n_components=10)
8 pca_data = pca.fit_transform(X_train)
9 pca_df = pd.DataFrame(data=pca_data, columns=['PC{}'.format(i) for i in range(1, 11)])
10 pca_df.head()

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10

0 -1.67 -0.87 1.58 -0.96 -0.81 0.22 -1.36 -0.16 0.05 -0.00

1 2.07 -0.46 -0.82 0.08 -0.90 0.04 -1.22 0.18 -0.13 -0.01

2 1.86 -1.11 0.19 -0.57 -0.41 0.72 -1.56 -0.20 0.02 -0.01

3 -1.97 0.49 -1.22 3.23 -0.90 4.61 0.49 -0.08 -0.01 0.00

4 -1.28 0.58 -1.36 0.88 -0.42 -0.89 -0.79 -0.17 -0.01 -0.00

keyboard_arrow_down Plotting an Elbow Graph to check for K


1 sse = {};sil = [];kmax = 20
2 import matplotlib.pyplot as plt
3 fig = plt.subplots(nrows = 1, ncols = 2, figsize = (20,5))
4 plt.subplot(1,2,1)
5 for k in range(1, 20):
6 kmeans = KMeans(n_clusters=k, max_iter=1000).fit(pca_df)
7 sse[k] = kmeans.inertia_

https://colab.research.google.com/drive/1ms5pe88FOfmAksE3Oc5ZeC32_NXhHk55#printMode=true 19/21
24/10/2024, 23:20 Using Big_Fact Customer Table with proper unseen data - Colab

keyboard_arrow_down What is the optimal value for the elbow?


1 sse_values = list(sse.values())
2 sse_diff = np.diff(sse_values)
3 percentage_change = (sse_diff / sse_values[:-1]) * 100
4 elbow_index = np.argmax(percentage_change)
5 plt.plot(list(sse.keys()), list(sse.values()))
6 plt.scatter(elbow_index + 1, sse_values[elbow_index], c='red', label='Elbow Point')
7 plt.title('Elbow Method')
8 plt.xlabel('k: Number of clusters')
9 plt.ylabel('Sum of Squared Error')
10 plt.legend()
11 plt.grid()
12 plt.show()
13 optimal_clusters = elbow_index + 1
14 print(f'Optimal number of clusters: {optimal_clusters}')

Optimal number of clusters: 15

keyboard_arrow_down Taking K as 15
1 from sklearn.cluster import KMeans
2 kmeans = KMeans(n_clusters=15)
3 cluster_labels = kmeans.fit_predict(pca_df)
4 print(cluster_labels)

[ 0 12 13 ... 5 14 11]

keyboard_arrow_down Countplot for the cluster labels


https://colab.research.google.com/drive/1ms5pe88FOfmAksE3Oc5ZeC32_NXhHk55#printMode=true 20/21
24/10/2024, 23:20 Using Big_Fact Customer Table with proper unseen data - Colab
1 import matplotlib.pyplot as plt
2 import numpy as np
3 unique_labels, label_counts = np.unique(cluster_labels, return_counts=True)
4 plt.bar(unique_labels, label_counts)
5 plt.xlabel('Cluster Label')
6 plt.ylabel('Frequency')
7 plt.title('Frequency Plot of Cluster Labels')
8 plt.xticks(unique_labels)
9 plt.show()

1 pca_df['clusters'] = cluster_labels
2 #pca_df.to_csv('Cust_with_trans.csv')

keyboard_arrow_down Building simple model pipelines


1 X = pca_df.drop('clusters', axis = 1)
2 y = pca_df['clusters']

keyboard_arrow_down Without CV
1 from sklearn.model_selection import train_test_split
2 from sklearn.ensemble import RandomForestClassifier
3 from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
4
5 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
6
7 rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
8 rf_classifier.fit(X_train, y_train)
9
10 y_pred = rf_classifier.predict(X_test)
11
12 accuracy = accuracy_score(y_test, y_pred)
13 precision = precision_score(y_test, y_pred, average='weighted')
14 recall = recall_score(y_test, y_pred, average='weighted')
15 f1 = f1_score(y_test, y_pred, average='weighted')
16 conf_matrix = confusion_matrix(y_test, y_pred)
17
18 print("Accuracy:", accuracy)
19 print("Precision:", precision)
20 print("Recall:", recall)
21 print("F1 Score:", f1)
22 print("Confusion Matrix:")
23 print(conf_matrix)
24

Accuracy: 0.982581369248036
Precision: 0.9826128813113342

https://colab.research.google.com/drive/1ms5pe88FOfmAksE3Oc5ZeC32_NXhHk55#printMode=true 21/21

You might also like