Walmart Solution PDF
Walmart Solution PDF
[ ]: !gdown https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/001/293/
↪original/walmart_data.csv?1641285094
Downloading…
From: https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/001/293/ori
ginal/walmart_data.csv?1641285094
To: /content/walmart_data.csv?1641285094
100% 23.0M/23.0M [00:00<00:00, 87.3MB/s]
1. Exploratory Data Analysis
[ ]: # loading the dataset
df = pd.read_csv('walmart_data.csv')
[ ]: df.head()
1
3 2 0 12 1057
4 4+ 0 8 7969
[ ]: df.tail()
[ ]: df.shape
[ ]: (550068, 10)
[ ]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User_ID 550068 non-null int64
1 Product_ID 550068 non-null object
2 Gender 550068 non-null object
3 Age 550068 non-null object
4 Occupation 550068 non-null int64
5 City_Category 550068 non-null object
6 Stay_In_Current_City_Years 550068 non-null object
7 Marital_Status 550068 non-null int64
8 Product_Category 550068 non-null int64
9 Purchase 550068 non-null int64
dtypes: int64(5), object(5)
memory usage: 42.0+ MB
Insights:
From the above analysis, it is clear that, data has total of 10 features with lots of mixed alpha
numeric data.
Apart from Purchase Column, all the other data types are of categorical type. We will change the
2
datatypes of all such columns to category
Changing the Datatype of Columns:
[ ]: for i in df.columns[:-1]:
df[i] = df[i].astype('category')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User_ID 550068 non-null category
1 Product_ID 550068 non-null category
2 Gender 550068 non-null category
3 Age 550068 non-null category
4 Occupation 550068 non-null category
5 City_Category 550068 non-null category
6 Stay_In_Current_City_Years 550068 non-null category
7 Marital_Status 550068 non-null category
8 Product_Category 550068 non-null category
9 Purchase 550068 non-null int64
dtypes: category(9), int64(1)
memory usage: 10.3 MB
2. Satatistical Summary:
a. Satistical summary of object type columns:
[ ]: df.describe(include = 'category')
Insights: 1. User_ID - Among 5,50,068 transactions there are 5891 unique user_id, indicating
same customers buying multiple products. 2. Product_ID - Among 5,50,068 transactions there are
3631 unique products,with the product having the code P00265242 being the highest seller , with
a maximum of 1,880 units sold. 3. Gender - Out of 5,50,068 transactions, 4,14,259 (nearly 75%)
were done by male gender indicating a significant disparity in purchase behavior between males
and females during the Black Friday event. 4. Age - We have 7 unique age groups in the dataset.
3
26 - 35 Age group has maximum of 2,19,587 transactions. We will analyse this feature in detail in
future 5. Stay_In_Current_City_Years - Customers with 1 year of stay in current city accounted
to maximum of 1,93,821 transactions among all the other customers with (0,2,3,4+) years of stay in
current city 6. Marital_Status - 59% of the total transactions were done by Unmarried Customers
and 41% by Married Customers .
b.Satistical summary of numerical data type columns:
[ ]: df.describe()
[ ]: Purchase
count 550068.000000
mean 9263.968713
std 5023.065394
min 12.000000
25% 5823.000000
50% 8047.000000
75% 12054.000000
max 23961.000000
c.Duplicate Detection:
[ ]: df.duplicated().value_counts()
[ ]: False 550068
Name: count, dtype: int64
4
'P0099742', 'P0099842', 'P0099942']
----------------------------------------------------------------------
Unique Values in Gender column are :-
['F', 'M']
Categories (2, object): ['F', 'M']
----------------------------------------------------------------------
Unique Values in Age column are :-
['0-17', '55+', '26-35', '46-50', '51-55', '36-45', '18-25']
Categories (7, object): ['0-17', '18-25', '26-35', '36-45', '46-50', '51-55',
'55+']
----------------------------------------------------------------------
Unique Values in Occupation column are :-
[10, 16, 15, 7, 20, …, 18, 5, 14, 13, 6]
Length: 21
Categories (21, int64): [0, 1, 2, 3, …, 17, 18, 19, 20]
----------------------------------------------------------------------
Unique Values in City_Category column are :-
['A', 'C', 'B']
Categories (3, object): ['A', 'B', 'C']
----------------------------------------------------------------------
Unique Values in Stay_In_Current_City_Years column are :-
['2', '4+', '3', '1', '0']
Categories (5, object): ['0', '1', '2', '3', '4+']
----------------------------------------------------------------------
Unique Values in Marital_Status column are :-
[0, 1]
Categories (2, int64): [0, 1]
----------------------------------------------------------------------
Unique Values in Product_Category column are :-
[3, 1, 12, 8, 5, …, 10, 17, 9, 20, 19]
Length: 20
Categories (20, int64): [1, 2, 3, 4, …, 17, 18, 19, 20]
----------------------------------------------------------------------
Unique Values in Purchase column are :-
[ 8370 15200 1422 … 135 123 613]
----------------------------------------------------------------------
Insights:
The dataset does not contain any abnormal values.
We will convert the 0,1 in Marital Status column as married and unmarried
[ ]: #replacing the values in marital_status column
df['Marital_Status'] = df['Marital_Status'].replace({0:'Unmarried',1:'Married'})
df['Marital_Status'].unique()
[ ]: ['Unmarried', 'Married']
Categories (2, object): ['Unmarried', 'Married']
5
d. Missing value Analysis
[ ]: df.isnull().sum()
[ ]: User_ID 0
Product_ID 0
Gender 0
Age 0
Occupation 0
City_Category 0
Stay_In_Current_City_Years 0
Marital_Status 0
Product_Category 0
Purchase 0
dtype: int64
ax0 = fig.add_subplot(gs[0,0])
ax0.hist(df['Purchase'],color= '#5C8374',linewidth=0.5,edgecolor='black',bins =␣
↪20)
ax1 = fig.add_subplot(gs[1,0])
boxplot = ax1.boxplot(x = df['Purchase'],vert = False,patch_artist =␣
↪True,widths = 0.5)
6
# Customize median line
boxplot['medians'][0].set(color='red')
# Customize outlier markers
for flier in boxplot['fliers']:
flier.set(marker='o', markersize=8, markerfacecolor= "#4b4b4c")
7
Calculating the Number of Outliers:
As seen above, Purchase amount over 21399 is considered as outlier. We will count the number of
outliers as below
[ ]: len(df.loc[df['Purchase'] > 21399,'Purchase'])
[ ]: 2677
Insights:
Outliers:
There are total of 2677 outliers which is roughly 0.48% of the total data present in purchase
amount. We will not remove them as it indicates a broad range of spending behaviors during the
sale, highlighting the importance of tailoring marketing strategies to both regular and high-value
customers to maximize revenue.
Distribution:
Data suggests that the majority of customers spent between 5,823 USD and 12,054 USD , with the
median purchase amount being 8,047 USD . The lower limit of 12 USD while the upper limit of
21,399 USD reveal significant variability in customer spending
3.2 Categorical Variables:
3.2.1 Gnender, Marital Status and city category Distribution:
8
[ ]: #setting the plot style
fig = plt.figure(figsize = (15,12))
gs = fig.add_gridspec(1,3)
# creating pie chart for gender disribution
ax0 = fig.add_subplot(gs[0,0])
color_map = ["#3A7089", "#4b4b4c"]
ax0.pie(df['Gender'].value_counts().values,labels = df['Gender'].value_counts().
↪index,autopct = '%.1f%%',
plt.show()
Insights:
1. Gender Distribution - Data indicates a significant disparity in purchase behavior between
9
males and females during the Black Friday event.
2. Marital Status - Given that unmarried customers account for a higher percentage of trans-
actions, it may be worthwhile to consider specific marketing campaigns or promotions that
appeal to this group.
3. City Category - City B saw the most number of transactions followed by City C and City A
respectively
3.2.2 Customer Age Distribution
[ ]: #setting the plot style
fig = plt.figure(figsize = (15,7))
gs = fig.add_gridspec(1,2,width_ratios=[0.6, 0.4])
# creating bar chart for age disribution
ax0 = fig.add_subplot(gs[0,0])
temp = df['Age'].value_counts()
color_map = ["#3A7089",␣
↪"#4b4b4c",'#99AEBB','#5C8374','#6F7597','#7A9D54','#9EB384']
ax1 = fig.add_subplot(gs[0,1])
age_info = age_info =␣
↪[['26-35','40%'],['36-45','20%'],['18-25','18%'],['46-50','8%'],['51-55','7%'],['55+','4%'],
['0-17','3%']]
color_2d =␣
↪[["#3A7089",'#FFFFFF'],["#4b4b4c",'#FFFFFF'],['#99AEBB','#FFFFFF'],['#5C8374','#FFFFFF'],['#
['#7A9D54','#FFFFFF'],['#9EB384','#FFFFFF']]
table = ax1.table(cellText = age_info, cellColours=color_2d,␣
↪cellLoc='center',colLabels =['Age Group','Percent Dist.'],
10
table.set_fontsize(15)
#removing axis
ax1.axis('off')
#setting title for visual
fig.suptitle('Customer Age Distribution',font = 'serif', size = 18, weight =␣
↪'bold')
plt.show()
Insights:
The age group of 26-35 represents the largest share of Walmart’s Black Friday sales, accounting
for 40% of the sales. This suggests that the young and middle-aged adults are the most active and
interested in shopping for deals and discounts .
The 36-45 and 18-25 age groups are the second and third largest segments, respectively, with 20%
and 18% of the sales. This indicates that Walmart has a diverse customer base that covers different
life stages and preferences..
The 46-50, 51-55, 55+, and 0-17 age groups are the smallest customer segments , with less than 10%
of the total sales each. This implies that Walmart may need to improve its marketing strategies
and product offerings to attract more customers from these age groups, especially the seniors and
the children.
3.2.3 Customer Stay In current City Distribution
[ ]: #setting the plot style
fig = plt.figure(figsize = (15,7))
gs = fig.add_gridspec(1,2,width_ratios=[0.6, 0.4])
# creating bar chart for Customer Stay In current City
11
ax1 = fig.add_subplot(gs[0,0])
temp = df['Stay_In_Current_City_Years'].value_counts()
color_map = ["#3A7089", "#4b4b4c",'#99AEBB','#5C8374','#6F7597']
ax1.bar(x=temp.index,height = temp.values,color = color_map,zorder = 2,width =␣
↪0.6)
ax2 = fig.add_subplot(gs[0,1])
stay_info = [['1','35%'],['2','19%'],['3','17%'],['4+','15%'],['0','14%']]
color_2d =␣
↪[["#3A7089",'#FFFFFF'],["#4b4b4c",'#FFFFFF'],['#99AEBB','#FFFFFF'],['#5C8374','#FFFFFF'],['#
plt.show()
12
Insights:
The data suggests that the customers are either new to the city or move frequently, and may have
different preferences and needs than long-term residents.
The majority of the customers (49%) have stayed in the current city for one year or less . This
suggests that Walmart has a strong appeal to newcomers who may be looking for affordable and
convenient shopping options.
4+ years category (14%) customers indicates that Walmart has a loyal customer base who have
been living in the same city for a long time.
The percentage of customers decreases as the stay in the current city increases which suggests that
Walmart may benefit from targeting long-term residents for loyalty programs and promotions .
3.2.4 Top 10 Products and Categories:
Sales Snapshot Top 10 Products and Product Categories which has sold most during Black Friday
Sales
[ ]: #setting the plot style
fig = plt.figure(figsize = (15,6))
gs = fig.add_gridspec(1,2)
#Top 10 Product_ID Sales
ax = fig.add_subplot(gs[0,0])
temp = df['Product_ID'].value_counts()[0:10]
# reversing the list
temp = temp.iloc[-1:-11:-1]
color_map = ['#99AEBB' for i in range(7)] + ["#3A7089" for i in range(3)]
#creating the plot
ax.barh(y = temp.index,width = temp.values,height = 0.2,color = color_map)
13
ax.scatter(y = temp.index, x = temp.values, s = 150 , color = color_map )
#removing x-axis
ax.set_xticks([])
#adding label to each bar
for y,x in zip(temp.index,temp.values):
ax.text( x + 50 , y , x,{'font':'serif', 'size':10,'weight':
↪'bold'},va='center')
14
Insights:
1. Top 10 Products Sold - The top-selling products during Walmart’s Black Friday sales are
characterized by a relatively small variation in sales numbers, suggesting that Walmart offers
a variety of products that many different customers like to buy.
2. Top 10 Product Categories - Categories 5,1 and 8 have significantly outperformed other
categories with combined Sales of nearly 75% of the total sales suggesting a strong preference
for these products among customers.
3.2.5 Top 10 Customer Occupation
Top 10 Occupation of Customer in Black Friday Sales
[ ]: temp = df['Occupation'].value_counts()[0:10]
#setting the plot style
fig,ax = plt.subplots(figsize = (13,6))
color_map = ["#3A7089" for i in range(3)] + ['#99AEBB' for i in range(7)]
#creating the plot
ax.bar(temp.index,temp.values, color = color_map, zorder = 2)
#adding valuecounts
for x,y in zip(temp.index,temp.values):
ax.text(x, y + 2000, y,{'font':'serif', 'size':10,'weight':
↪'bold'},va='center',ha = 'center')
15
#adding title to the visual
ax.set_title('Top 10 Occupation of Customers',
{'font':'serif', 'size':15,'weight':'bold'})
plt.show()
Insights:
Customers with Occupation category 4,0 and 7 contributed significantly i.e. almost 37% of the total
purchases suggesting that these occupations have a high demand for Walmart products or services,
or that they have more disposable income to spend on Black Friday.
4.Bivariate Analysis:
4.1 Exploring Purchase Patterns
[ ]: #setting the plot style
fig = plt.figure(figsize = (15,20))
gs = fig.add_gridspec(3,2)
for i,j,k in␣
↪[(0,0,'Gender'),(0,1,'City_Category'),(1,0,'Marital_Status'),(1,1,'Stay_In_Current_City_Year
#plot position
if i <= 1:
ax0 = fig.add_subplot(gs[i,j])
else:
ax0 = fig.add_subplot(gs[i,:])
#plot
16
color_map = ["#3A7089",␣
↪"#4b4b4c",'#99AEBB','#5C8374','#6F7597','#7A9D54','#9EB384']
#plot title
ax0.set_title(f'Purchase Amount Vs {k}',{'font':'serif', 'size':12,'weight':
↪'bold'})
#customizing axis
ax0.set_xticklabels(df[k].unique(),fontweight = 'bold',fontsize = 12)
ax0.set_ylabel('Purchase Amount',fontweight = 'bold',fontsize = 12)
ax0.set_xlabel('')
plt.show()
17
Insights:
Out of all the variables analysed above, it’s noteworthy that the purchase amount remains relatively
stable regardless of the variable under consideration. As indicated in the data, the median purchase
amount consistently hovers around 8,000 USD , regardless of the specific variable being examined.
18
5. Gender vs Purchase Amount:
5.1 Data Visualization:
[ ]: #creating a df for purchase amount vs gender
temp = df.groupby('Gender')['Purchase'].agg(['sum','count']).reset_index()
#calculating the amount in billions
temp['sum_in_billions'] = round(temp['sum'] / 10**9,2)
#calculationg percentage distribution of purchase amount
temp['%sum'] = round(temp['sum']/temp['sum'].sum(),3)
#calculationg per purchase amount
temp['per_purchase'] = round(temp['sum']/temp['count'])
#renaming the gender
temp['Gender'] = temp['Gender'].replace({'F':'Female','M':'Male'})
temp
#for gender
ax.text(temp.loc[i,'%sum']/2 + txt[0],- 0.20 ,f"{temp.loc[i,'Gender']}",
va = 'center', ha='center',fontsize=14, color='white')
txt += temp.loc[i,'%sum']
19
#customizing ticks
ax.set_xticks([])
ax.set_yticks([])
ax.set_xlim(0,1)
#plot title
ax.set_title('Gender-Based Purchase Amount Distribution',{'font':'serif',␣
↪'size':15,'weight':'bold'})
ax1 = fig.add_subplot(gs[1,0])
color_map = ["#3A7089", "#4b4b4c"]
#plotting the visual
ax1.bar(temp['Gender'],temp['per_purchase'],color = color_map,zorder = 2,width␣
↪= 0.3)
20
shadow = True,colors = color_map,wedgeprops = {'linewidth':␣
↪5},textprops={'fontsize': 13, 'color': 'black'})
plt.show()
21
Insights:
1. Total Sales and Transactions Comparison The total purchase amount and number of trans-
actions by male customers was more than three times the amount and transactions by female
customers indicating that they had a more significant impact on the Black Friday sales.
2. Average Transaction Value The average purchase amount per transaction was slightly higher
for male customers than female customers ($9438 vs $8735) .
3. Distribution of Purchase Amount As seen above, the purchase amount for both the genders
is not normally distributed
5.2 Confidence Interval Construction: Estimating Average Purchase Amount per
Transaction
1. Step 1 - Building CLT Curve As seen above, the purchase amount distribution is not Normal.
So we need to use Central Limit Theorem . It states the distribution of sample means will
approximate a normal distribution, regardless of the underlying population distribution
22
2. Step 2 - Building Confidence Interval After building CLT curve, we will create a confidence
interval predicting population mean at 99%,95% and 90% Confidence level .
Note - We will use different sample sizes of [100,1000,5000,50000]
return interval
[77]: #defining a function for plotting the visual for given confidence interval
def plot(ci):
#setting the plot style
fig = plt.figure(figsize = (15,8))
gs = fig.add_gridspec(2,2)
#creating separate data frames for each gender
df_male = df.loc[df['Gender'] == 'M','Purchase']
df_female = df.loc[df['Gender'] == 'F','Purchase']
#sample sizes and corresponding plot positions
sample_sizes = [(100,0,0),(1000,0,1),(5000,1,0),(50000,1,1)]
#number of samples to be taken from purchase amount
bootstrap_samples = 20000
male_samples = {}
female_samples = {}
23
#creating a temporary dataframe for creating kdeplot
temp_df = pd.DataFrame(data = {'male_means':male_means,'female_means':
↪female_means})
#plotting kdeplots
#plot position
ax = fig.add_subplot(gs[x,y])
plt.legend()
plt.show()
return male_samples,female_samples
24
[79]: m_samp_95,f_samp_95 = plot(95)
25
[80]: m_samp_99,f_samp_99 = plot(99)
26
#plotting the summary
ax = fig.add_subplot(gs[l])
table.set_fontsize(13)
#removing axis
ax.axis('off')
#setting title
ax.set_title(f"{k}% Confidence Interval Summary",{'font':'serif', 'size':
↪14,'weight':'bold'})
Insights:
1. Sample Size The analysis highlights the importance of sample size in estimating population
parameters. It suggests that as the sample size increases, the confidence intervals become
narrower and more precise . In business, this implies that larger sample sizes can provide
more reliable insights and estimates.
2. Confidence Intervals From the above analysis, we can see that except for the Sample Size
of 100, the confidence interval do not overlap as the sample size increases. This means that
27
there is a statistically significant difference between the average spending per transaction for
men and women within the given samples.
3. Population Average We are 95% confident that the true population average for males falls
between $9,393 and $9,483 , and for females , it falls between $8,692 and $8,777 .
4. Women spend less Men tend to spend more money per transaction on average than women
, as the upper bounds of the confidence intervals for men are consistently higher than those
for women across different sample sizes.
5. How can Walmart leverage this conclusion to make changes or improvements?
5.1. Segmentation Opportunities Walmart can create targeted marketing campaigns, loyalty pro-
grams, or product bundles to cater to the distinct spending behaviors of male and female customers.
This approach may help maximize revenue from each customer segment.
5.2. Pricing Strategies Based on the above data of average spending per transaction by gender, they
might adjust pricing or discount strategies to incentivize higher spending among male customers
while ensuring competitive pricing for female-oriented products.
Note Moving forward in our analysis, we will use 95% Confidence Level only.
6. Marital Staus vs Purchase Amount:
6.1. Data Visulaisation
[84]: #creating a df for purchase amount vs marital status
temp = df.groupby('Marital_Status')['Purchase'].agg(['sum','count']).
↪reset_index()
28
#inserting the text
txt = [0.0] #for left parameter in ax.text()
for i in temp.index:
#for amount
ax.text(temp.loc[i,'%sum']/2 + txt[0],0.15,f"${temp.
↪loc[i,'sum_in_billions']} Billion",
txt += temp.loc[i,'%sum']
#customizing ticks
ax.set_xticks([])
ax.set_yticks([])
ax.set_xlim(0,1)
#plot title
ax.set_title('Marital_Status-Based Purchase Amount Distribution',{'font':
↪'serif', 'size':15,'weight':'bold'})
ax1 = fig.add_subplot(gs[1,0])
color_map = ["#3A7089", "#4b4b4c"]
#plotting the visual
ax1.bar(temp['Marital_Status'],temp['per_purchase'],color = color_map,zorder =␣
↪2,width = 0.3)
29
#adding grid lines
ax1.grid(color = 'black',linestyle = '--', axis = 'y', zorder = 0, dashes =␣
↪(5,10))
ax = ax3,hue_order = ['Married','Unmarried'])
#removing the axis lines
for s in ['top','left','right']:
ax3.spines[s].set_visible(False)
plt.show()
30
Insights:
1. Total Sales and Transactions Comparison The total purchase amount and number of transac-
tions by Unmarried customers was more than 20% the amount and transactions by married
customers indicating that they had a more significant impact on the Black Friday sales.
2. Average Transaction Value The average purchase amount per transaction was almost similar
for married and unmarried customers ($9261 vs $9266) .
3. Distribution of Purchase Amount As seen above, the purchase amount for both married and
unmarried customers is not normally distributed
7. Customer Age VS Purchase Amount:
7.1 Data Visualization
[86]: #creating a df for purchase amount vs age group
temp = df.groupby('Age')['Purchase'].agg(['sum','count']).reset_index()
31
#calculating the amount in billions
temp['sum_in_billions'] = round(temp['sum'] / 10**9,2)
#calculationg percentage distribution of purchase amount
temp['%sum'] = round(temp['sum']/temp['sum'].sum(),3)
#calculationg per purchase amount
temp['per_purchase'] = round(temp['sum']/temp['count'])
temp
left += temp.loc[i,'%sum']
#inserting the text
txt = 0.0 #for left parameter in ax.text()
for i in temp.index:
#for amount
ax.text(temp.loc[i,'%sum']/2 + txt,0.15,f"{temp.loc[i,'sum_in_billions']}B",
va = 'center', ha='center',fontsize=14, color='white')
txt += temp.loc[i,'%sum']
32
#customizing ticks
ax.set_xticks([])
ax.set_yticks([])
ax.set_xlim(0,1)
#plot title
ax.set_title('Age Group Purchase Amount Distribution',{'font':'serif', 'size':
↪15,'weight':'bold'})
ax1 = fig.add_subplot(gs[1])
#plotting the visual
ax1.bar(temp['Age'],temp['per_purchase'],color = color_map,zorder = 2,width = 0.
↪3)
33
ax = ax3)
#removing the axis lines
for s in ['top','left','right']:
ax3.spines[s].set_visible(False)
plt.show()
Insights:
1. Total Sales Comparison Age group between 26 - 45 accounts to almost 60% of the total sales
suggesting that Walmart’s Black Friday sales are most popular among these age groups. The
age group 0-17 has the lowest sales percentage (2.6%) , which is expected as they may not
have as much purchasing power. Understanding their preferences
2. Average Transaction Value While there is not a significant difference in per purchase spending
among the age groups, the 51-55 age group has a relatively low sales percentage (7.2%) but
34
they have the highest per purchase spending at 9535 . Walmart could consider strategies to
attract and retain this high-spending demographic.
3. Distribution of Purchase Amount As seen above, the purchase amount for all age groups is
not normally distributed
******
35