0% found this document useful (0 votes)
6 views39 pages

Final Project

The document outlines a final project conducted by Ituanose Mark Kemba, focusing on analyzing retail sales data to inform strategic business decisions. It details the methodology used for data collection, cleaning, outlier detection, and both descriptive and predictive analyses, including regression models and cluster analysis. The results include various data visualizations and insights into sales trends, customer behavior, and payment method preferences.

Uploaded by

ifechukwifeama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views39 pages

Final Project

The document outlines a final project conducted by Ituanose Mark Kemba, focusing on analyzing retail sales data to inform strategic business decisions. It details the methodology used for data collection, cleaning, outlier detection, and both descriptive and predictive analyses, including regression models and cluster analysis. The results include various data visualizations and insights into sales trends, customer behavior, and payment method preferences.

Uploaded by

ifechukwifeama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

1

Final Project

Ituanose Mark Kemba

2024.06.16
2

Table Of Contents

1.0. Introduction ……………………………... 5

1.1. Background…………………………………5

1.2. Scope of Study………………………………5

2.0. Methodology………………………………...6

2.1. Data Collection…………………………………6

2.2. Data Cleaning …………………………………...6

2.3. Outlier Detection………………………………...6

2.3.1. Z-Score Method ………………………………….7

2.3.2. IQR Method……………………………………….7

2.3.3. Comparison Table of Z-score and IQR Methods….7

2.4. Descriptive Analysis………………………………8

2.5. Predictive Analysis…………………………………12

2.5.1. Linear Regression……………………………………12

2.5.2. Multiple Regression………………………………….12

2.5.3. Cluster Analysis……………………………………….12


3

3.0 Results ………………………………………………………13

3.1 Data Visualizations………………………………………13

3.1.1 Monthly Total Sales……………………………………14

3.1.2. Monthly Revenue by Product Category……………….16

3.1.3. Revenue Percentage by Product Category…………….17

3.1.4. Total Amounts by Payment Methods………………….17

3.1.5. Total Amount vs. Discount Applied……………………18

3.1.6. Payment Methods Distribution………………………….18

3.1.7. Monthly Total Revenue………………………………….19

3.2. Predictive Analysis Result…………………………………20

3.2.1. Linear Regression Analysis …………………………………20

3.2.2. Multiple Regression Analysis…………………………………29

3.2.3. Cluster Analysis………………………………………………………30

4.0 Analysis Questions…………………………………….35

4.1. Revenue Contribution by Category………………………………...35

4.2. Seasonal Sales Trends for Books …………………………………...35


4

4.3. Seasonal Sales Trends for Home Décor and Clothing………………35

4.4. Electronic Sales Trends …………………………………………….36

4.5. Payment Method Preferences ……………………………………….36

4.6. Impact of Discounts on Sales…………………………………………...37

4.7. Total Quantity Bought by Customers……………………………………...37

5.0. Conclusion ……………………………………………………………38

6.0. References …………….…………………………………………….39

1.0 Introduction

1.1 Background
5

In the highly competitive retail industry, it is imperative to utilize data to inform strategic

choices. To identify important factors influencing overall sales, this project will investigate and

evaluate sales data from a retail location. The store can create focused advertising campaigns,

manage inventory efficiently, and optimize pricing strategies by being aware of these variables.

CustomerID, Quantity, Price, TransactionDate, Payment Method, Store Location, Product

Category, Discount Applied (%), and Total Amount are among the attributes included in the

collection. Using this data, we hope to derive practical insights that will improve overall

corporate performance.

1.2 Study Scope

A thorough analysis of retail sales data is used in this study. To guarantee data quality and

accuracy, it contains extensive data cleaning procedures and outlier detection utilizing both Z-

score and IQR approaches. While predictive studies, such as simple and multiple regression

models, are carried out to discover and quantify important predictors of total sales, descriptive

statistical analyses are used to summarize and characterize the significant elements of the data.

To more effectively convey findings, the study makes use of data visualization tools like scatter

plots, line charts, and pivot tables. To further delve into the purchase habits of customers,

Tableau clustering analysis is also employed to segment customers and transactions. These

analyses' findings are applied to several analytical problems, assisting in the formulation of

strategic decisions and business optimization.


6

2. Methodology

2.1. Data Collection

First things first, we gathered our dataset from the Professor's reliable source. This dataset is

packed with essential columns like Product Category, Quantity, Price, and more, giving us a

comprehensive view of retail transactions.

2.2. Data Cleaning

Next, we cleaned up our data to ensure accuracy and reliability. We identified and removed any

rows that had missing data. This step is crucial because clean data is the foundation of any

meaningful analysis.

2.3. Outlier Detection

After cleaning, we turned our attention to outlier detection. Outliers are data points that

significantly differ from other observations, potentially skewing our analysis. We used two

methods to detect outliers:

2.3.1. Z-Score Method


7

The Z-score method helped us pinpoint outliers based on how many standard deviations they

deviate from the mean. This method is effective for identifying extreme values in columns like

Quantity, Price, Discount Applied (%), and Total Amount.

2.3.2. IQR Method

We also employed the Interquartile Range (IQR) method, which focuses on the spread of data

within its quartiles. It's particularly useful for detecting outliers in skewed distributions. We

applied this method to the same columns as the Z-score for a thorough analysis.

2.3.3. Comparison Table of Z-score and IQR Methods

To give a clear picture, we created a comparison table showing the results from both the Z-score

and IQR methods. This helped us assess which method was more suitable for our dataset and

understand the nature of the outliers we encountered.

2.4. Descriptive Analysis with SQL

With our data cleaned and outliers identified, we moved on to descriptive analytics on all

numeric data, using SQL to calculate the mean, median, standard deviation, mode, and IQR. This

step involved summarizing key statistics of interest, such as:


8

Fig 2.4

1. Mean:

CustomerID: The mean CustomerID is about 500,473. This tells us the average ID number in

the dataset.

Quantity: The mean quantity purchased per transaction is approximately 5.009 items. This

gives us a typical number of items bought at once.

Price: The mean price per item is around 55.07 currency units. This shows us the average cost

per item purchased.

Discount Applied (%): On average, a discount of about 10.02% is applied to transactions.

This indicates the average reduction in price due to discounts.

Total Amount: The mean total transaction amount is about 248.34 currency units. This

represents the average purchase amount per transaction.

2. Median:
9

CustomerID: The median CustomerID is 499,679. This means half of the CustomerIDs are

below this value and half are above.

Quantity: The median quantity purchased is 5 items. This shows the middle value in the range

of quantities purchased.

Price: The median price per item is 55.12 currency units. This represents the central price

value in the dataset.

Discount Applied (%): The median discount applied is 10.03%. This indicates the middle value

of discount percentages applied.

Total Amount: The median total transaction amount is 200.35 currency units. This represents

the middle value of transaction amounts.

3. Mode:

CustomerID: The mode (most frequent CustomerID) is 340,516. This ID appears more often

than any other in the dataset.

Quantity: The mode quantity purchased is 7 items. This is the quantity that appears most

frequently in transactions.

Price: There is no mode provided for Price (#N/A). This suggests that prices are evenly

distributed without a clear most frequent price.


10

Discount Applied (%): There is no mode provided for Discount Applied (%) (N/A). This

indicates no single discount percentage is applied most frequently.

Total Amount: The mode total transaction amount is 101.72 currency units. This is the

transaction amount that occurs most frequently.

4. Standard Deviation:

CustomerID: The standard deviation of CustomerID is 288,461.21. This shows the extent of

variation or spread around the mean CustomerID.

Quantity: The standard deviation of quantity is 2.58. This indicates how much the quantity

values deviate from the mean, showing low variability.

Price: The standard deviation of price is 25.97. This shows the amount of variation in prices,

indicating some items are more expensive than others.

Discount Applied (%): The standard deviation of the discount applied is 5.78. This shows

variability in the discount rates applied to transactions.

Total Amount: The standard deviation of the total transaction amount is 184.55. This indicates

variability in transaction sizes, with some transactions being larger than others.

5. Interquartile Range (IQR):


11

CustomerID: The IQR for CustomerID is 500,442.5. This represents the range of the middle

50% of CustomerID values, showing the spread of data.

Quantity: The IQR for quantity is 4. This shows the range of the middle 50% of quantity

values, indicating consistency in the amount of items purchased.

Price: The IQR for the price is 44.91 currency units. This shows the range of the middle 50%

of price values, indicating the spread of item costs.

Discount Applied (%): The IQR for the discount applied is 10.02%. This shows the range of

the middle 50% discount percentage values, indicating how discounts are distributed.

Total Amount: The IQR for the total transaction amount is 266.85 currency units. This shows

the range of the middle 50% of transaction amounts, indicating the spread of purchase sizes.

These statistical measures help to understand the central tendency, variability, and distribution of

data points in each category, providing insights into customer behavior and transaction patterns.

These statistics provided us with a snapshot of the distribution and central tendency of our

numeric data columns.

2.5. Predictive Analysis


12

Now, onto the exciting part—predictive analytics! Here, we applied statistical models to forecast

and understand patterns in our data.

2.5.1. Linear Regression

We started with linear regression to explore relationships between variables, particularly how the

Total Amount is influenced by Quantity. This analysis helped us create a predictive model to

estimate sales based on quantities purchased.

2.5.2. Multiple Regression

Moving beyond simple relationships, multiple regression expanded our analysis to include

Quantity, Price, and Discount Applied (%). This model allowed us to delve deeper into factors

affecting total sales amounts and compare its predictive power with linear regression.

2.5.3. Cluster Analysis using Tableau

Lastly, we employed Tableau for cluster analysis—a powerful tool for customer segmentation.

By grouping customers based on purchasing behavior (Quantity, Price, Total Amount), we

identified distinct customer segments. This insight can guide targeted marketing strategies and

personalized customer experiences.


13

3.0 Results

3.1 Data Visualization

3.1.1 Time series of total sales over time on a monthly basis

Sum of
Months TotalAmount
2281.862742
April 2023 62363.07835
April 2024 1913643.087
August
2023 2115917.843
December
2023 2106554.83
February
2024 1997204.319
January
2024 2146268.002
July 2023 2106169.324
June 2023 2008137.229
March 2024 2094156.127
May 2023 2098160.608
November
2023 2062281.273
October
2023 2088945.35
September
2023 2028997.247
Grand
Total 24831080.18
14

3.1.2. Monthly Revenue by Product Category

StoreLocation (All)

Sum of Product
TotalAmount Category

Electroni Home (blan Grand


Months Books Clothing cs Decor k) Total

April 2023 17717.50448 14806.91 15771.33 12545.03 60840.78


15

541 343 454 785

476701.1 442662.0 479072.0 1878349.


April 2024 479914.3684 093 468 077 532
542699.6 521294.9 500389.1 2109279.
August 2023 544896.2248 667 141 62 968

December 538515.7 519491.8 527658.0 2125395.


2023 539729.6082 004 892 977 296

505551.5 487694.0 495028.1 1973154.


February 2024 484880.3218 903 937 115 117

526465.2 532423.0 545611.6 2128316.


January 2024 523816.9949 202 795 048 899

521163.7 519402.0 545553.8 2132550.


July 2023 546430.9142 01 994 031 518

516732.6 512140.0 510580.4 2066364.


June 2023 526911.6481 272 516 964 823

527844.2 521431.0 518365.7 2107835.


March 2024 540194.9013 334 446 925 972
16

521442.1 535326.8 511676.6 2097929.


May 2023 529484.0756 304 631 026 672

November 496786.5 527089.2 509976.6 2051277.


2023 517424.8868 22 032 85 297

497973.5 546847.7 502094.4 2049450.


October 2023 502534.9574 091 596 72 698

September 517822.8 515059.3 514216.0 2050334.


2023 503236.3982 393 245 39 601

6257172.80 6204505. 6196633. 6172767. 2483108


Grand Total 4 765 703 909 0.18

Total revenue by month and product


category
Ttal Amount

600000
400000 Books
Clothing
200000
Electronics
0 Home Decor
2 4 2 3 2 4 2 3 2 3 2 3
(blank)
20 20 20 20 20 20
il r y ay er
r
be ar ne
Ap u Ju M ob
em Ja
n c t
ec O
D
Month
17

3.1.3. Revenue Percentage by Product Category

Product Sum of
Category TotalAmount
Books 25.20%
Clothing 24.99%
Electronics 24.96%
Home
Decor 24.86%
Grand
Total 100.00%

3.1.4. Total Amounts by Payment Methods

Payment Var of
Method TotalAmount
Cash 33997.64049
Credit Card 33656.27435
Debit Card 34146.93217
PayPal 34442.60253
Grand Total 34060.14074

3.1.5. Total Amount vs. Discount Applied


18

3.1.6. Payment Methods Distribution

Row Count of
Labels PaymentMethod
Cash 25003
Credit
Card 25032
Debit Card 24888
PayPal 25067
Grand
Total 99990

3.1.7. Monthly Total Revenue

Sum of
months TotalAmount
January 2128316.899
February 1973154.117
19

March 2107835.972
April 1939190.32
May 2097929.672
June 2066364.823
July 2132550.518
August 2109279.968
September 2050334.601
October 2049450.698
November 2051277.297
December 2125395.296
Grand
Total 24831080.18

3.2. Predictive Analysis Result

3.2.1. Evaluation of linear regression on the total amount based on quantity

SUMMARY OUTPUT

Regression Statistics
0.69183
Multiple R 863
0.47864
R Square 069
Adjusted R 0.47863
Square 547
Standard 133.258
Error 205
Observation
s 99990

ANOVA
Significa
df SS MS F nce F
Regression 1 1630077 1630077 91795.2 0
20

593 593 823


1775561 17757.7
Residual 99988 819 491
3405639
Total 99989 412

Coefficie Standard Lower Upper Lower


nts Error t Stat P-value 95% 95% 95.0%
- -
0.41643 0.92041 0.45244 0.65095 1.38757 2.22044 1.38757
Intercept 67 913 247 124 35 688 35
49.4918 0.16335 302.977 49.1717 49.8120 49.1717
Quantity 746 172 363 0 073 42 073

Let's evaluate and explain the regression model based on the provided summary output:

Regression Statistics

1. Multiple R (Correlation Coefficient):

- Value: 0.6918

- Interpretation: This value represents the correlation between Quantity and Total Amount. The

value 0.6918 indicates a strong positive correlation.

2. R Square (Coefficient of Determination):

Value: 0.4786

Interpretation: This value indicates that approximately 47.86% of the variance in Total Amoun

can be explained by Quantity.

3. Adjusted R Square:
21

-Value: 0.4786

Interpretation: This adjusted value accounts for the number of predictors in the model. Since

we only have one predictor, it is close to the R Square value.

4. Standard Error:

Value: 133.2582

Interpretation: This value represents the standard deviation of the regression errors. It indicates

the average distance the observed values fall from the regression line.

5. Observations:

Value: 99,990

Interpretation: This is the number of observations (data points) used in the regression analysis.

ANOVA (Analysis of Variance)

1. Regression:

df: 1 (degrees of freedom for regression)

SS: 1,630,077,593 (sum of squares for regression)

MS: 1,630,077,593 (mean square for regression)

F: 91,795.2823 (F-statistic)

Significance F: 0 (p-value)

Interpretation: The regression model is statistically significant, as indicated by the very low p-

value (0). This means that the relationship between Quantity and Total Amount is highly

unlikely to be due to chance.


22

2. Residual:

df: 99,988 (degrees of freedom for residuals)

SS: 1,775,561,819 (sum of squares for residuals)

MS: 17,757.7491 (mean square for residuals)

3. Total:

df: 99,989 (total degrees of freedom)

SS:3,405,639,412 (total sum of squares)

Coefficients

1. Intercept:

Coefficient: 0.4164

Standard Error: 0.9204

t Stat: 0.4524

P-value: 0.6510

Lower 95%: -1.3876

Upper 95%: 2.2204

Interpretation: The intercept is not statistically significant (p-value > 0.05), indicating that the

intercept may not differ significantly from zero.

2. Quantity:
23

Coefficient: 49.4919

Standard Error: 0.1634

t Stat: 302.9774

P-value: 0

Lower 95%: 49.1717

Upper 95%: 49.8120

Interpretation: The coefficient for Quantity is statistically significant (p-value < 0.05),

indicating a significant positive relationship between Quantity and Total Amount. For each

additional unit of Quantity, the Total Amount increases by approximately 49.49 units.

Overall Model Evaluation

Model Significance: The model is statistically significant as indicated by the F-test with a

significance F value of 0. This suggests that the model provides a better fit than a model with no

predictors.

Fit: With an R Square of 0.4786, the model explains about 47.86% of the variability in `Total

Amount. While this is a substantial proportion, 52.14% of the variability is not explained by

Quantity alone.

Coefficients: The Quantity coefficient is highly significant, indicating a strong positive

relationship with Total Amount.

Conclusion
24

The regression analysis suggests a strong positive correlation between Quantity and Total

Amount, with Quantity being a significant predictor of Total Amount. The model explains about

47.86% of the variance in Total Amount, indicating that other factors not included in this model

may also be influencing the total amount. The intercept is not significant, which implies that it

might not have a meaningful contribution to the model.

Let's evaluate and compare the multiple regression results with the previous linear regression:

3.2.2 Evaluation of Multiple Regression Summary:

SUMMARY OUTPUT

Regression Statistics
0.94298
Multiple R 891
0.88922
R Square 808
Adjusted R 0.88922
Square 476
61.4249
Standard Error 158
Observations 99990

ANOVA
Significa
df SS MS F nce F
3028390 1009463 267547.
Regression 3 207 402 834 0
3772492 3773.02
Residual 99986 05 028
3405639
Total 99989 412

Coeffici Standard Lower Upper Lower


ents Error t Stat P-value 95% 95% 95.0%
25

- - - -
220.285 0.68077 323.582 221.620 218.951 221.62
Intercept 97 17 74 0 28 67 2
49.3895 0.07529 655.929 49.2420 49.5371 49.242
Quantity 844 704 972 0 031 657 03
4.51274 0.00747 603.355 4.49808 4.52740 4.4980
Price 861 942 509 0 904 818 90
- - - -
DiscountAppli 2.72376 0.03361 81.0390 2.78964 2.65789 2.7896
ed(%) 93 059 11 0 57 3 5

Regression Statistics:

1. Multiple R (Correlation Coefficient): 0.9430

Interpretation: Indicates a very strong positive correlation between the independent variables

(Quantity, Price, Discount Applied) and the dependent variable (Total Amount).

2. R Square (Coefficient of Determination): 0.8892

Interpretation: About 88.92% of the variance in Total Amount can be explained by the

independent variables (Quantity, Price, Discount Applied).

3. Adjusted R Square: 0.8892

Interpretation: Adjusted R Square is very close to R Square, indicating that the model does not

have too many irrelevant predictors.

4. Standard Error: 61.4249


26

Interpretation: This value represents the standard deviation of the regression errors, indicating

the average distance that the observed values fall from the regression line.

5. Observations: 99,990

Interpretation: The number of observations (data points) used in the regression analysis.

ANOVA:

1. Regression:

df: 3 (degrees of freedom for regression)

SS: 3,028,390,207 (sum of squares for regression)

MS: 1,009,463,402 (mean square for regression)

F: 267,547.834 (F-statistic)

Significance F: 0 (p-value)

Interpretation: The regression model is statistically significant, as indicated by the very low p-

value (0). This suggests that the relationship between the independent variables and the Total

Amount is highly unlikely to be due to chance.

2. Residual:

df: 99,986 (degrees of freedom for residuals)

SS: 377,249,205 (sum of squares for residuals)

MS: 3,773.0203 (mean square for residuals)


27

3. Total:

df: 99,989 (total degrees of freedom)

SS: 3,405,639,412 (total sum of squares)

Coefficients:

1. Intercept:

Coefficient: -220.2860

Standard Error: 0.6808

t Stat: -323.5827

P-value: 0

Lower 95%: -221.6203

Upper 95%: -218.9517

Interpretation: The intercept is statistically significant, indicating that it differs significantly

from zero.

2. Quantity:

Coefficient: 49.3896

Standard Error: 0.0753

t Stat: 655.9300

P-value: 0

Lower 95%: 49.2420

Upper 95%: 49.5372


28

Interpretation: Quantity is a significant predictor of Total Amount.

3. Price:

Coefficient: 4.5127

Standard Error: 0.0075

t Stat: 603.3555

P-value: 0

Lower 95%: 4.4981

Upper 95%: 4.5274

- **Interpretation: Price is a significant predictor of Total Amount.

4. Discount Applied (%):

Coefficient: -2.7238

Standard Error: 0.0336

t Stat: -81.0390

P-value: 0

Lower 95%: -2.7896

Upper 95%: -2.6579

Interpretation: Discount Applied (%) is a significant predictor of Total Amount.

Comparison with Linear Regression:

1. Model Significance:
29

- Both models are statistically significant, but the multiple regression model explains more

variance in Total Amount (R Square of 0.8892) compared to the simple linear regression model

with `Quantity` as the only predictor (R Square of 0.4786).

2. Predictor Significance:

- In the multiple regression model, all predictors (Quantity, Price, and Discount Applied (%))

are statistically significant.

Quantity was the sole predictor in the linear regression model, and it was significant.

3. Fit:

- The multiple regression model fits data better than the standard linear regression model, as

seen by higher R Square and Adjusted R Square values.

4. Coefficients:

The coefficient for `Quantity} in the multiple regression model is somewhat different (49.39)

compared to the simple linear regression model (49.49). This change indicates that Quantity's

effect on Total Amount is adjusted when considering Price and Discount Applied (%).

The multiple regression model incorporates additional coefficients for Price and Discount

Applied (%), which were not considered in the simple linear regression model.

Conclusion:

When compared to the simple linear regression model, the multiple regression model fits the data

better and accounts for a larger percentage of the variance in Total Amount. {Total Amount} can
30

be predicted with significant accuracy by Quantity, Price, and Discount Applied (%) Total

Amount is positively impacted by `Quantity and Price, and negatively impacted by Discount

Applied (%).

3.2.3 Overview of the Clustering Model

Inputs for Clustering


Variables: Customer ID
Sum of Price
Sum of Quantity
Sum of Total Amount
Level of Detail: Customer ID
Scaling: Normalized

Summary Diagnostics
Number of Clusters: 4
Number of Points: 99993
Between-group Sum of Squares: 17282.0
Within-group Sum of Squares: 5763.4
Total Sum of Squares: 23045.0

Most
Centers
Common
Number of Sum of Sum of Sum of Total
Clusters Customer ID
Items Price Quantity Amount
Cluster 1 24400 78.257 7.3885 517.28 830710
Cluster 2 25721 76.543 2.9444 199.12 762342
Cluster 3 27598 32.535 6.9257 200.8 230132
Cluster 4 22274 32.784 2.4131 69.458 778286
31

Not
7
Clustered

Analysis of Variance:
Model Error
Sum of Sum of
Variable F-statistic p-value DF DF
Squares Squares
Sum of Total
2.547e+04 0.0 3301.0 3 4320.0 99989
Amount
Sum of Quantity 2.501e+04 0.0 7801.0 3 1.04e+04 99989
Sum of Price 2.474e+04 0.0 6180.0 3 8327.0 99989

Categorical variables are not included in the Analysis of Variance table.

Objective: The clustering model aims to group customers based on their purchasing behavior

using variables such as total amount spent, quantity purchased, and price per item.

Input Variables

1. Customer ID: Unique identifier for each customer.

2. Sum of Price: Total amount spent by each customer.

3. Sum of Quantity: Total quantity purchased by each customer.

4. Sum of Total Amount: Aggregate total spent by each customer.

Model Details

Number of Clusters: 4

Number of Points: 99,993 (distinct customers)


32

Scaling: Normalized (ensures all variables are on the same scale for fair comparison)

Diagnostic Summary

Total Sum of Squares: 23,045.0

Between-group Sum of Squares: 17,282.0

Within-group Sum of Squares: 5,763.4

These metrics indicate how much variance is explained by the clusters (between-group) versus

variance within each cluster (within-group the).

Cluster Characteristics

Each cluster is characterized by its centroid (center) values for the variables:

Cluster 1:

Average Customer ID: 24400

Average Sum of Price: 78.257

Average Sum of Quantity: 7.3885

Average Sum of Total Amount: 517.28

Cluster 2:

Average Customer ID: 25721


33

Average Sum of Price: 76.543

Average Sum of Quantity: 2.9444

Average Sum of Total Amount: 199.12

Cluster 3:

Average Customer ID: 27598

Average Sum of Price: 32.535

Average Sum of Quantity: 6.9257

Average Sum of Total Amount: 200.8

Cluster 4:

Average Customer ID: 22274

Average Sum of Price: 32.784

Average Sum of Quantity: 2.4131

Average Sum of Total Amount: 69.458

Analysis of Variance (ANOVA)

ANOVA helps understand the significance of each variable in explaining the variance between

clusters:

Sum of Total Amount, Sum of Quantity, and Sum of Price have low p-values (0.0), indicating

they significantly contribute to clustering.


34

F-statistic values are high, suggesting strong differentiation between clusters based on these

variables.

Interpretation

Cluster 1 seems to include high-spending customers who buy relatively large quantities.

Cluster 2 consists of customers with moderate spending and moderate quantity.

Cluster 3 shows customers with lower spending but higher quantities.

Cluster 4 includes customers with the lowest spending and quantity.

Conclusion

This clustering model effectively segments customers into meaningful groups based on their

purchasing behavior. It provides insights that can help in tailoring marketing strategies, inventory

management, and customer engagement efforts to different customer segments. Significant

ANOVA results and clear differentiation between clusters on key variables support the model's

effectiveness.
35

4.0 Analytics Question

4.1 Revenue Contribution by Category

The booking category contributes the highest to total revenue, with a 25.20% percentage

4.2 Seasonal Sales Trends for Books

July 2023 with a sum of 546430.9143 Books showed higher sales and April 2023 with a sum of

17717.50448 Books showed lower sales

4.3 Seasonal Sales Trends for Home Decor and Clothing

Home Decor and Clothing categories experienced peak sales in January 2024 with 1072076.825.

4.4 Electronics Sales Trends

Electronics show a notable peak in sales in October 2023 with a sum of 546847.76 and April

2023 with the sum of 15771.33343 months have lower sales in

4.5 Payment Method Preferences


36

The preferred method is a Debit Card, and the average of all payment methods is shown in the

pivot table below

Average of

Payment Method TotalAmount

Cash 248.1991428

Credit Card 247.8163577

Debit Card 249.2070521

PayPal 248.1251381

Grand Total 248.3356354

4.6 Impact of Discounts on Sales


37

The correlation coefficient of approximately -0.08781421 suggests a weak negative correlation

between the Total Amount (or Total Revenue) and Discount Applied. Here’s how you can

interpret this result:

 Weak Negative Correlation: A correlation coefficient close to zero (in this case,

negative but close to zero) indicates a slight tendency for the Total Amount to decrease

slightly as Discount Applied increases. However, this relationship is not strong.

 Interpretation:

o Impact on Sales: It implies that as discounts increase, there is a very slight

tendency for Total Amount (or Total Revenue) to decrease. This could mean that

while discounts might attract more transactions, they might not significantly

increase the total revenue generated per transaction.

4.7 Total Quantity Bought by Customers

the peak period for total quantities bought by customers is July 2023 with a sum quantity of

43051, and the low period is April 2023 with a quantity sum of 1240.

Quantity Changes Towards Year-end:

 Increase Towards Year-end: There is a noticeable increase in quantities bought towards

the end of the year, particularly in December 2023 and January 2024. This trend may be

influenced by seasonal factors such as year-end sales, holidays, and gift-giving occasions.
38

5.0 Conclusion

This project aimed to derive insights from a comprehensive retail transaction dataset, featuring

key columns such as Product Category, Quantity, Price, and more. Data cleaning involved

removing rows with missing data and detecting outliers using both the Z-score method (threshold

±3) and the IQR method (range ±1.5 * IQR). Descriptive analysis was conducted using SQL to

generate statistics such as mean, median, mode, standard deviation, and IQR for numeric

columns. Various visualizations were created to analyze trends and distributions, including time

series of total sales over time on a monthly basis, total revenue by month and product category,

and payment method preferences. Predictive analytics included performing linear regression on

total amount based on quantity, as well as multiple regression using quantity, price, and discount

applied. Cluster analysis was conducted using Tableau to segment customers based on quantity,

price, and total amount. These comprehensive analyses provided valuable insights into seasonal

sales trends, the impact of discounts on sales, and customer buying patterns, ultimately informing

better retail management decisions.

6.0 References

Tan, P. N., Steinbach, M., & Kumar, V. (2006). Introduction to Data Mining. Pearson Addison

Wesley.
39

Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to Linear Regression

Analysis (5th ed.). Wiley.

Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of Data Mining. MIT Press.

You might also like