Task 1 - Data Preparation and Customer Analytics - Jupyter Notebook
Task 1 - Data Preparation and Customer Analytics - Jupyter Notebook
Task 1 :-
♦ I am part of Quantium’s retail analytics team and have been approached by our client, the Categor
y Manager for Chips, who wants to better understand the types of customers who purchase Chips and their
purchasing behaviour within the region.
♦ The insights from my analysis will feed into the supermarket’s strategic plan for the chip catego
ry in the next half year.
Here is task :-
♦ I need to present a strategic recommendation to Julia that is supported by data which she can then
use for the upcoming category review however to do so I need to analyse the data to understand the curr
ent purchasing trends and behaviours. The client is particularly interested in customer segments and th
eir chip purchasing behaviour. Consider what metrics would help describe the customers’ purchasing beha
viour.
• Examine transaction data - check for missing data, anomalies, outliers and clean them
• Examine customer data - similar to above transaction data
• Data analysis and customer segments - create charts and graphs, note trends and insights
• Deep dive into customer segments - determine which segments should be targetted
Importing Dataset
Out[2]:
LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER
Out[3]:
DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES
2 43605 1 1343 383 61 Smiths Crinkle Cut Chips Chicken 170g 2 2.9
4 43330 2 2426 1038 108 Kettle Tortilla ChpsHny&Jlpno Chili 150g 3 13.8
Data Exploration
In [4]: # Basic Information of dataset(QVI_purchase_behaviour)
purchase_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72637 entries, 0 to 72636
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 LYLTY_CARD_NBR 72637 non-null int64
1 LIFESTAGE 72637 non-null object
2 PREMIUM_CUSTOMER 72637 non-null object
dtypes: int64(1), object(2)
memory usage: 1.7+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264836 entries, 0 to 264835
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DATE 264836 non-null int64
1 STORE_NBR 264836 non-null int64
2 LYLTY_CARD_NBR 264836 non-null int64
3 TXN_ID 264836 non-null int64
4 PROD_NBR 264836 non-null int64
5 PROD_NAME 264836 non-null object
6 PROD_QTY 264836 non-null int64
7 TOT_SALES 264836 non-null float64
dtypes: float64(1), int64(6), object(1)
memory usage: 16.2+ MB
In [6]: # Statistical Summary of QVI_purchase_behaviour data
purchase_data.describe().T
Out[6]:
count mean std min 25% 50% 75% max
Out[7]:
count mean std min 25% 50% 75% max
In [9]: purchase_data.isnull().sum()
Out[9]: LYLTY_CARD_NBR 0
LIFESTAGE 0
PREMIUM_CUSTOMER 0
dtype: int64
In [10]: ### Checking missing values of QVI_transaction_data
sns.heatmap(transaction_data.isnull())
plt.show()
In [11]: transaction_data.isnull().sum()
Out[11]: DATE 0
STORE_NBR 0
LYLTY_CARD_NBR 0
TXN_ID 0
PROD_NBR 0
PROD_NAME 0
PROD_QTY 0
TOT_SALES 0
dtype: int64
Out[12]:
LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER DATE STORE_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY
Natural Chip
YOUNG
0 1000 Premium 43390 1 1 5 Compny 2
SINGLES/COUPLES
SeaSalt175g
Smiths Crinkle
MIDAGE
2 1343 Budget 43605 1 383 61 Cut Chips 2
SINGLES/COUPLES
Chicken 170g
Smiths Chip
MIDAGE Thinly
3 2373 Budget 43329 2 974 69 5
SINGLES/COUPLES S/Cream&Onion
175g
Kettle Tortilla
MIDAGE
4 2426 Budget 43330 2 1038 108 ChpsHny&Jlpno 3
SINGLES/COUPLES
Chili 150g
•♦• We can see "DATE" column is not in proper format, so we will change it.
In [13]: print(len(merged_data))
print(len(transaction_data))
264836
264836
In [14]: ### Basic Information of merged_data
merged_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 264836 entries, 0 to 264835
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 LYLTY_CARD_NBR 264836 non-null int64
1 LIFESTAGE 264836 non-null object
2 PREMIUM_CUSTOMER 264836 non-null object
3 DATE 264836 non-null int64
4 STORE_NBR 264836 non-null int64
5 TXN_ID 264836 non-null int64
6 PROD_NBR 264836 non-null int64
7 PROD_NAME 264836 non-null object
8 PROD_QTY 264836 non-null int64
9 TOT_SALES 264836 non-null float64
dtypes: float64(1), int64(6), object(3)
memory usage: 22.2+ MB
Date column is not in proper format. so, date column should be datetime format
datetime64[ns]
Analyzing the product name column (PROD_NAME) to make sure all items are chips
In [17]: merged_data['PROD_NAME'].unique()
In [19]: word_counts = {}
def count_words(line):
for word in line:
if word not in word_counts:
word_counts[word] = 1
else:
word_counts[word] += 1
split_prods.apply(lambda line: count_words(line))
print(pd.Series(word_counts).sort_values(ascending = False))
Chips 49770
Kettle 41288
Smiths 28860
Salt 27976
Cheese 27890
...
Sunbites 1432
Pc 1431
Garden 1419
NCC 1419
Fries 1418
Length: 198, dtype: int64
In [20]: print("\n ----- Statistical Summary of Merged Data ----- \n")
print(merged_data.describe())
print("\n ----- Basic Information of Merged Data ----- \n")
print(merged_data.info())
PROD_QTY TOT_SALES
count 264836.000000 264836.000000
mean 1.907309 7.304200
std 0.643654 3.083226
min 1.000000 1.500000
25% 2.000000 5.400000
50% 2.000000 7.400000
75% 2.000000 9.200000
max 200.000000 650.000000
<class 'pandas.core.frame.DataFrame'>
Int64Index: 264836 entries, 0 to 264835
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 LYLTY_CARD_NBR 264836 non-null int64
1 LIFESTAGE 264836 non-null object
2 PREMIUM_CUSTOMER 264836 non-null object
3 DATE 264836 non-null datetime64[ns]
4 STORE_NBR 264836 non-null int64
5 TXN_ID 264836 non-null int64
6 PROD_NBR 264836 non-null int64
7 PROD_NAME 264836 non-null object
8 PROD_QTY 264836 non-null int64
9 TOT_SALES 264836 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(5), object(3)
memory usage: 22.2+ MB
None
In [21]: merged_data["PROD_QTY"].value_counts(bins=4).sort_index()
Out[22]:
LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER DATE STORE_NBR TXN_ID PROD_NBR PROD_NAME PROD_QT
Dorito Corn
2018-
69762 226000 OLDER FAMILIES Premium 226 226201 4 Chp Supreme 20
08-19
380g
Dorito Corn
2019-
69763 226000 OLDER FAMILIES Premium 226 226210 4 Chp Supreme 20
05-20
380g
Pringles
2019-
217237 201060 YOUNG FAMILIES Premium 201 200202 26 Sweet&Spcy
05-18
BBQ 134g
Pringles
YOUNG 2018-
238333 219004 Mainstream 219 218018 25 SourCream
SINGLES/COUPLES 08-14
Onion 134g
Infuzions BBQ
YOUNG 2019- Rib Prawn
238471 261331 Mainstream 261 261111 87
SINGLES/COUPLES 05-19 Crackers
110g
♦ Two outliers of value 200 in PROD_QTY will be removed. Both entries are by the same customer and
will be examined by this customer's transactions.
In [24]: len(merged_data[merged_data["LYLTY_CARD_NBR"]==226000])
Out[24]: 0
In [25]: merged_data["DATE"].describe()
♦ There are 365 days in a year but in the DATE column there are only 364 unique values so one is missing.
In [26]: pd.date_range(start=merged_data["DATE"].min(),
end=merged_data["DATE"].max()).difference(merged_data["DATE"])
♦ Using the difference method we see that 2018-12-25 was a missing date
Out[29]: 2018-12-25 1
2018-11-25 648
2018-10-18 658
2019-06-13 659
2019-06-24 662
Name: DATE, dtype: int64
The day with no transaction is a Christmas Day (25th December). That is when the store is closed. So there is no anomaly in
this.
count 258770.000000
mean 182.324276
std 64.955035
min 70.000000
25% 150.000000
50% 170.000000
75% 175.000000
max 380.000000
Name: 0, dtype: float64
175.0 64929
150.0 41633
134.0 25102
110.0 22387
170.0 19983
165.0 15297
300.0 15166
330.0 12540
380.0 6416
270.0 6285
200.0 4473
135.0 3257
250.0 3169
210.0 3167
90.0 3008
190.0 2995
160.0 2970
220.0 1564
70.0 1507
180.0 1468
125.0 1454
Name: 0, dtype: int64
♦ Some product names are written in more than one way. Example : Dorito and Doritos, Grains and GrnWv
es, Infusions and Ifzns, Natural and NCC, Red and RRD, Smith and Smiths and Snbts and Sunbites.
In [32]: merged_data["PROD_NAME"].str.split()[merged_data["PROD_NAME"].str.split().str[0] == "Red"].value_counts()
In [38]: merged_data.isnull().sum()
Out[38]: LYLTY_CARD_NBR 0
LIFESTAGE 0
PREMIUM_CUSTOMER 0
DATE 0
STORE_NBR 0
TXN_ID 0
PROD_NBR 0
PROD_NAME 0
PROD_QTY 0
TOT_SALES 0
Cleaned_Brand_Names 0
dtype: int64
Questions :-
♦ Who spends the most on chips (total sales), describing customers by lifestage and how premium the
ir general purchasing behaviour is ?
♦ How many customers are in each segment ?
♦ How many chips are bought per customer by segment ?
♦ What is the average chip price by customer segment ?
In [39]: grouped_sales = pd.DataFrame(merged_data.groupby(["LIFESTAGE", "PREMIUM_CUSTOMER"])["TOT_SALES"].agg(["sum", "me
grouped_sales.sort_values(ascending=False, by="sum")
Out[39]:
sum mean
LIFESTAGE PREMIUM_CUSTOMER
Out[40]: 1933115.0000000002
# Custom X axis
plt.yticks(r, names)
plt.ylabel("LIFESTAGE")
plt.xlabel("TOTAL SALES")
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.title("Total Sales per Lifestage")
plt.savefig("lifestage_sales.png", bbox_inches="tight")
# Show graphic
plt.show()
In [43]: stage_agg_prem = merged_data.groupby("LIFESTAGE")["PREMIUM_CUSTOMER"].agg(pd.Series.mode).sort_values()
print("\n ----- Top contributor per LIFESTAGE by PREMIUM category ----- \n")
print(stage_agg_prem)
LIFESTAGE
NEW FAMILIES Budget
OLDER FAMILIES Budget
OLDER SINGLES/COUPLES Budget
YOUNG FAMILIES Budget
MIDAGE SINGLES/COUPLES Mainstream
RETIREES Mainstream
YOUNG SINGLES/COUPLES Mainstream
Name: PREMIUM_CUSTOMER, dtype: object
Out[44]:
LYLTY_CARD_NBR
LIFESTAGE PREMIUM_CUSTOMER
Budget 4929
Premium 4750
Mainstream 849
Premium 588
In [45]: unique_cust.sort_values().plot.barh(figsize=(15,8), color='darkgoldenrod')
plt.grid(color='olive', linestyle='--')
plt.show()
In [46]: # Values of each group
ncust_bars1 = unique_cust[unique_cust.index.get_level_values("PREMIUM_CUSTOMER") == "Budget"]
ncust_bars2 = unique_cust[unique_cust.index.get_level_values("PREMIUM_CUSTOMER") == "Mainstream"]
ncust_bars3 = unique_cust[unique_cust.index.get_level_values("PREMIUM_CUSTOMER") == "Premium"]
# Custom X axis
plt.yticks(r, names)
plt.ylabel("Lifestage", fontsize=15, fontweight='bold', color='darkgoldenrod')
plt.xlabel("Unique Customers", fontsize=15, fontweight='bold', color='darkgoldenrod')
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.title("Unique Customers per Lifestage", fontsize=20, fontweight='bold', color='darkgoldenrod')
plt.savefig("lifestage_customers.png", bbox_inches="tight")
# View
plt.show()
The high sales amount by segment "Young Singles/Couples - Mainstream" and "Retirees - Mainstream" are due to their large
number of unique customers, but not for the "Older - Budget" segment. Next we'll analyze if the "Older - Budget" segment
has:
High Frequency of Purchase and Average Sales per Customer compared to the other segment.
In [47]: freq_per_cust = merged_data.groupby(["LYLTY_CARD_NBR", "LIFESTAGE", "PREMIUM_CUSTOMER"]).count()["DATE"]
freq_per_cust.groupby(["LIFESTAGE", "PREMIUM_CUSTOMER"]).agg(["mean", "count"]).sort_values(ascending=False, by=
Out[47]:
mean count
LIFESTAGE PREMIUM_CUSTOMER
•♦• The above table describes the "Average frequency of Purchase per segment" and "Unique custom
er per segment". The top three most frequent purchase is contributed by the "Older Families" lifestage
segment. We can see now that the "Older - Budget" segment contributes to high sales partly because of
the combination of:
High Frequency of Purchase and, Fairly high unique number of customer in the segment
In [48]: grouped_sales.sort_values(ascending=False, by="mean")
Out[48]:
sum mean
LIFESTAGE PREMIUM_CUSTOMER
•♦• Highest average spending per purchase are contributed by the Midage and Young "Singles/Couple
s". The difference between their Mainstream and Non-Mainstream group might seem insignificant (7.6 vs
6.6), but we'll find out by examining if the difference is statistically significant.
1.8542040107534844e-281
Out[49]: True
•♦• P-Value is close to 0. There is a statistically significant difference to the Total Sales betwee
n the "Mainstream Young Midage" segment to the "Budget and Premium Young Midage" segment.
Next, let's look examine what brand of chips the top 3 segments contributing to Total Sales are buying.
In [50]: merged_data.groupby(["LIFESTAGE", "PREMIUM_CUSTOMER"])["Cleaned_Brand_Names"].agg(pd.Series.mode).sort_values()
Kettle 838
Smiths 826
Doritos 570
Name: Cleaned_Brand_Names, dtype: int64
Smiths 1245
Kettle 1211
Doritos 899
Name: Cleaned_Brand_Names, dtype: int64
Kettle 1206
Smiths 986
Doritos 837
Name: Cleaned_Brand_Names, dtype: int64
Kettle 713
Smiths 633
Doritos 533
Name: Cleaned_Brand_Names, dtype: int64
---------- MIDAGE SINGLES/COUPLES - Mainstream ----------
Kettle 2136
Smiths 1337
Doritos 1291
Name: Cleaned_Brand_Names, dtype: int64
Kettle 247
Doritos 167
Pringles 165
Name: Cleaned_Brand_Names, dtype: int64
Kettle 414
Doritos 274
Smiths 254
Name: Cleaned_Brand_Names, dtype: int64
Smiths 1515
Kettle 1512
Doritos 1065
Name: Cleaned_Brand_Names, dtype: int64
---------- OLDER FAMILIES - Budget ----------
Kettle 3320
Smiths 3093
Doritos 2351
Name: Cleaned_Brand_Names, dtype: int64
Kettle 2019
Smiths 1835
Doritos 1449
Name: Cleaned_Brand_Names, dtype: int64
Kettle 2947
Smiths 2042
Doritos 1958
Name: Cleaned_Brand_Names, dtype: int64
---------- OLDER SINGLES/COUPLES - Budget ----------
Kettle 3065
Smiths 2098
Doritos 1954
Name: Cleaned_Brand_Names, dtype: int64
Kettle 2835
Smiths 2180
Doritos 2008
Name: Cleaned_Brand_Names, dtype: int64
Kettle 2216
Smiths 1458
Doritos 1409
Name: Cleaned_Brand_Names, dtype: int64
Kettle 2592
Doritos 1742
Smiths 1679
Name: Cleaned_Brand_Names, dtype: int64
Kettle 3386
Smiths 2476
Doritos 2320
Name: Cleaned_Brand_Names, dtype: int64
---------- YOUNG FAMILIES - Premium ----------
Kettle 1745
Smiths 1442
Doritos 1129
Name: Cleaned_Brand_Names, dtype: int64
Kettle 2743
Smiths 2459
Doritos 1996
Name: Cleaned_Brand_Names, dtype: int64
Kettle 1789
Smiths 1772
Doritos 1309
Name: Cleaned_Brand_Names, dtype: int64
•♦• Every segment had Kettle as the most purchased brand. Every segment except "YOUNG SINGLES/COUPLES M
ainstream" had Smiths as their second most purchased brand. "YOUNG SINGLES/COUPLES Mainstream" had Dori
tos as their second most purchased brand.
In [52]: from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
temp = merged_data.reset_index().rename(columns = {"index": "transaction"})
temp["Segment"] = temp["LIFESTAGE"] + ' - ' + temp['PREMIUM_CUSTOMER']
segment_brand_encode = pd.concat([pd.get_dummies(temp["Segment"]), pd.get_dummies(temp["Cleaned_Brand_Names"])],
frequent_sets = apriori(segment_brand_encode, min_support=0.01, use_colnames=True)
rules = association_rules(frequent_sets, metric="lift", min_threshold=1)
set_temp = temp["Segment"].unique()
rules[rules["antecedents"].apply(lambda x: list(x)).apply(lambda x: x in set_temp)]
C:\Users\Admin\AppData\Local\Programs\Python\Python310\lib\site-packages\mlxtend\frequent_patterns\fpcommon.p
y:111: DeprecationWarning: DataFrames with non-bool types result in worse computationalperformance and their s
upport might be discontinued in the future.Please use a DataFrame with bool type
warnings.warn(
Out[52]:
antecedent consequent
antecedents consequents support confidence lift leverage conviction
support support
1 (OLDER FAMILIES - Budget) (Smiths) 0.087451 0.120162 0.011679 0.133549 1.111409 0.001171 1.015451
(OLDER SINGLES/COUPLES -
3 (Kettle) 0.069504 0.155901 0.011573 0.166513 1.068064 0.000738 1.012731
Budget)
(OLDER SINGLES/COUPLES -
5 (Kettle) 0.067038 0.155901 0.011128 0.165991 1.064716 0.000676 1.012097
Premium)
7 (RETIREES - Mainstream) (Kettle) 0.081055 0.155901 0.012785 0.157738 1.011779 0.000149 1.002180
(YOUNG SINGLES/COUPLES -
8 (Kettle) 0.078744 0.155901 0.014515 0.184329 1.182344 0.002239 1.034852
Mainstream)
•♦• By looking at our a-priori analysis, we can conclude that Kettle is the brand of choice for mos
t segment.
Next, we'll find out the pack size preferences of different segments
In [53]: merged_pack = pd.concat([merged_data, pack_sizes.rename("Pack_Size")], axis=1)
for stage in merged_data["LIFESTAGE"].unique():
for prem in merged_data["PREMIUM_CUSTOMER"].unique():
print("----------",stage, '-', prem,"----------\n")
summary = merged_pack[(merged_pack["LIFESTAGE"] == stage)
& (merged_pack["PREMIUM_CUSTOMER"] == prem)]["Pack_Size"].value_counts().head(3).s
print(summary)
plt.figure()
summary.plot.barh(figsize=(6,2), color='olive')
plt.show()
134.0 537
150.0 961
175.0 1587
Name: Pack_Size, dtype: int64
134.0 832
150.0 1439
175.0 2262
Name: Pack_Size, dtype: int64
---------- YOUNG SINGLES/COUPLES - Mainstream ----------
134.0 2315
150.0 3159
175.0 4928
Name: Pack_Size, dtype: int64
134.0 781
150.0 1285
175.0 2034
Name: Pack_Size, dtype: int64
---------- MIDAGE SINGLES/COUPLES - Budget ----------
134.0 449
150.0 821
175.0 1256
Name: Pack_Size, dtype: int64
134.0 1159
150.0 1819
175.0 2912
Name: Pack_Size, dtype: int64
134.0 165
150.0 245
175.0 371
Name: Pack_Size, dtype: int64
134.0 309
150.0 448
175.0 763
Name: Pack_Size, dtype: int64
134.0 224
150.0 384
175.0 579
Name: Pack_Size, dtype: int64
---------- OLDER FAMILIES - Premium ----------
134.0 1014
150.0 1750
175.0 2747
Name: Pack_Size, dtype: int64
134.0 1996
150.0 3708
175.0 5662
Name: Pack_Size, dtype: int64
---------- OLDER FAMILIES - Mainstream ----------
134.0 1234
150.0 2261
175.0 3489
Name: Pack_Size, dtype: int64
134.0 1744
150.0 2854
175.0 4382
Name: Pack_Size, dtype: int64
---------- OLDER SINGLES/COUPLES - Budget ----------
134.0 1843
150.0 2899
175.0 4535
Name: Pack_Size, dtype: int64
134.0 1720
150.0 2875
175.0 4422
Name: Pack_Size, dtype: int64
134.0 1331
150.0 2015
175.0 3232
Name: Pack_Size, dtype: int64
134.0 1517
150.0 2381
175.0 3768
Name: Pack_Size, dtype: int64
134.0 2103
150.0 3415
175.0 5187
Name: Pack_Size, dtype: int64
---------- YOUNG FAMILIES - Premium ----------
134.0 1007
150.0 1832
175.0 2926
Name: Pack_Size, dtype: int64
134.0 1674
150.0 2981
175.0 4800
Name: Pack_Size, dtype: int64
---------- YOUNG FAMILIES - Mainstream ----------
134.0 1148
150.0 2101
175.0 3087
Name: Pack_Size, dtype: int64
In [54]: (temp.groupby(["LIFESTAGE", "PREMIUM_CUSTOMER"])["PROD_QTY"].sum()
/ temp.groupby(["LIFESTAGE", "PREMIUM_CUSTOMER"])["LYLTY_CARD_NBR"].nunique()).sort_values(ascending=False)
Out[56]: Segment
YOUNG SINGLES/COUPLES - Mainstream 4.071485
MIDAGE SINGLES/COUPLES - Mainstream 4.000101
RETIREES - Budget 3.924883
RETIREES - Premium 3.921323
NEW FAMILIES - Budget 3.919251
NEW FAMILIES - Mainstream 3.916581
OLDER SINGLES/COUPLES - Premium 3.887220
OLDER SINGLES/COUPLES - Budget 3.877022
NEW FAMILIES - Premium 3.871743
RETIREES - Mainstream 3.833343
OLDER SINGLES/COUPLES - Mainstream 3.803800
YOUNG FAMILIES - Budget 3.753659
MIDAGE SINGLES/COUPLES - Premium 3.752915
YOUNG FAMILIES - Premium 3.752402
OLDER FAMILIES - Budget 3.733344
MIDAGE SINGLES/COUPLES - Budget 3.728496
OLDER FAMILIES - Mainstream 3.727383
YOUNG FAMILIES - Mainstream 3.707097
OLDER FAMILIES - Premium 3.704625
YOUNG SINGLES/COUPLES - Premium 3.645518
YOUNG SINGLES/COUPLES - Budget 3.637681
Name: Unit_Price, dtype: float64
In [57]: temp.groupby(["LIFESTAGE", "PREMIUM_CUSTOMER"]).mean()["Unit_Price"].unstack().plot.bar(figsize=(15,4), rot=0)
plt.xlabel("Lifestage", fontsize=15, fontweight='bold', color='darkorange')
plt.legend(loc="center left", bbox_to_anchor=(1,0.5))
plt.show()
In [58]: z = temp.groupby(["Segment", "Cleaned_Brand_Names"]).sum()["TOT_SALES"].sort_values(ascending=False).reset_index
z[z["Segment"] == "YOUNG SINGLES/COUPLES - Mainstream"]
Out[58]:
Segment Cleaned_Brand_Names TOT_SALES
•♦• Young Singles/Couples (Mainstream) has the highest population, followed by Retirees (Mainstream). W
hich explains their high total sales.
•♦• Despite Older Families not having the highest population, they have the highest frequency of purcha
se, which contributes to their high total sales.
•♦• Older Families followed by Young Families has the highest average quantity of chips bought per purc
hase.
•♦• The Mainstream category of the "Young and Midage Singles/Couples" have the highest spending of chip
s per purchase. And the difference to the non-Mainstream "Young and Midage Singles/Couples" are statist
ically significant.
•♦• Chips brand Kettle is dominating every segment as the most purchased brand.
•♦• Observing the 2nd most purchased brand, "Young and Midage Singles/Couples" is the only segment with
a different preference (Doritos) as compared to others' (Smiths).
•♦• Most frequent chip size purchased is 175gr followed by the 150gr chip size for all segments.
Future Recommendations :-
•♦• Older Families: Focus on the Budget segment. Strength: Frequent purchase. We can give promotion
s that encourages more frequency of purchase. Strength: High quantity of chips purchased per visit. We
can give promotions that encourage them to buy more quantity of chips per purchase.
•♦• Young Singles/Couples: Focus on the Mainstream segment. This segment is the only segment that h
ad Doritos as their 2nd most purchased brand (after Kettle). To specifically target this segment it mig
ht be a good idea to collaborate with Doritos merchant to do some branding promotion catered to "Young
Singles/Couples - Mainstream" segment. Strength: Population quantity. We can spend more effort on maki
ng sure our promotions reach them, and it reaches them frequently.
•♦• Retirees: Focus on the Mainstream segment. Strength: Population quantity. Again, since their po
pulation quantity is the contributor to the high total sales, we should spend more effort on making sur
e our promotions reaches as many of them as possible and frequent.
•♦• General: All segments has Kettle as the most frequently purchased brand and 175gr (regardless o
f brand) followed by 150gr as the preferred chip size. When promoting chips in general to all segments
it is good to take advantage of these two points.
•♦•♦•♦•