Final Project
Final Project
Final Project
2024.06.16
2
Table Of Contents
1.1. Background…………………………………5
2.0. Methodology………………………………...6
1.0 Introduction
1.1 Background
5
In the highly competitive retail industry, it is imperative to utilize data to inform strategic
choices. To identify important factors influencing overall sales, this project will investigate and
evaluate sales data from a retail location. The store can create focused advertising campaigns,
manage inventory efficiently, and optimize pricing strategies by being aware of these variables.
Category, Discount Applied (%), and Total Amount are among the attributes included in the
collection. Using this data, we hope to derive practical insights that will improve overall
corporate performance.
A thorough analysis of retail sales data is used in this study. To guarantee data quality and
accuracy, it contains extensive data cleaning procedures and outlier detection utilizing both Z-
score and IQR approaches. While predictive studies, such as simple and multiple regression
models, are carried out to discover and quantify important predictors of total sales, descriptive
statistical analyses are used to summarize and characterize the significant elements of the data.
To more effectively convey findings, the study makes use of data visualization tools like scatter
plots, line charts, and pivot tables. To further delve into the purchase habits of customers,
Tableau clustering analysis is also employed to segment customers and transactions. These
analyses' findings are applied to several analytical problems, assisting in the formulation of
2. Methodology
First things first, we gathered our dataset from the Professor's reliable source. This dataset is
packed with essential columns like Product Category, Quantity, Price, and more, giving us a
Next, we cleaned up our data to ensure accuracy and reliability. We identified and removed any
rows that had missing data. This step is crucial because clean data is the foundation of any
meaningful analysis.
After cleaning, we turned our attention to outlier detection. Outliers are data points that
significantly differ from other observations, potentially skewing our analysis. We used two
The Z-score method helped us pinpoint outliers based on how many standard deviations they
deviate from the mean. This method is effective for identifying extreme values in columns like
We also employed the Interquartile Range (IQR) method, which focuses on the spread of data
within its quartiles. It's particularly useful for detecting outliers in skewed distributions. We
applied this method to the same columns as the Z-score for a thorough analysis.
To give a clear picture, we created a comparison table showing the results from both the Z-score
and IQR methods. This helped us assess which method was more suitable for our dataset and
With our data cleaned and outliers identified, we moved on to descriptive analytics on all
numeric data, using SQL to calculate the mean, median, standard deviation, mode, and IQR. This
Fig 2.4
1. Mean:
CustomerID: The mean CustomerID is about 500,473. This tells us the average ID number in
the dataset.
Quantity: The mean quantity purchased per transaction is approximately 5.009 items. This
Price: The mean price per item is around 55.07 currency units. This shows us the average cost
Total Amount: The mean total transaction amount is about 248.34 currency units. This
2. Median:
9
CustomerID: The median CustomerID is 499,679. This means half of the CustomerIDs are
Quantity: The median quantity purchased is 5 items. This shows the middle value in the range
of quantities purchased.
Price: The median price per item is 55.12 currency units. This represents the central price
Discount Applied (%): The median discount applied is 10.03%. This indicates the middle value
Total Amount: The median total transaction amount is 200.35 currency units. This represents
3. Mode:
CustomerID: The mode (most frequent CustomerID) is 340,516. This ID appears more often
Quantity: The mode quantity purchased is 7 items. This is the quantity that appears most
frequently in transactions.
Price: There is no mode provided for Price (#N/A). This suggests that prices are evenly
Discount Applied (%): There is no mode provided for Discount Applied (%) (N/A). This
Total Amount: The mode total transaction amount is 101.72 currency units. This is the
4. Standard Deviation:
CustomerID: The standard deviation of CustomerID is 288,461.21. This shows the extent of
Quantity: The standard deviation of quantity is 2.58. This indicates how much the quantity
Price: The standard deviation of price is 25.97. This shows the amount of variation in prices,
Discount Applied (%): The standard deviation of the discount applied is 5.78. This shows
Total Amount: The standard deviation of the total transaction amount is 184.55. This indicates
variability in transaction sizes, with some transactions being larger than others.
CustomerID: The IQR for CustomerID is 500,442.5. This represents the range of the middle
Quantity: The IQR for quantity is 4. This shows the range of the middle 50% of quantity
Price: The IQR for the price is 44.91 currency units. This shows the range of the middle 50%
Discount Applied (%): The IQR for the discount applied is 10.02%. This shows the range of
the middle 50% discount percentage values, indicating how discounts are distributed.
Total Amount: The IQR for the total transaction amount is 266.85 currency units. This shows
the range of the middle 50% of transaction amounts, indicating the spread of purchase sizes.
These statistical measures help to understand the central tendency, variability, and distribution of
data points in each category, providing insights into customer behavior and transaction patterns.
These statistics provided us with a snapshot of the distribution and central tendency of our
Now, onto the exciting part—predictive analytics! Here, we applied statistical models to forecast
We started with linear regression to explore relationships between variables, particularly how the
Total Amount is influenced by Quantity. This analysis helped us create a predictive model to
Moving beyond simple relationships, multiple regression expanded our analysis to include
Quantity, Price, and Discount Applied (%). This model allowed us to delve deeper into factors
affecting total sales amounts and compare its predictive power with linear regression.
Lastly, we employed Tableau for cluster analysis—a powerful tool for customer segmentation.
identified distinct customer segments. This insight can guide targeted marketing strategies and
3.0 Results
Sum of
Months TotalAmount
2281.862742
April 2023 62363.07835
April 2024 1913643.087
August
2023 2115917.843
December
2023 2106554.83
February
2024 1997204.319
January
2024 2146268.002
July 2023 2106169.324
June 2023 2008137.229
March 2024 2094156.127
May 2023 2098160.608
November
2023 2062281.273
October
2023 2088945.35
September
2023 2028997.247
Grand
Total 24831080.18
14
StoreLocation (All)
Sum of Product
TotalAmount Category
600000
400000 Books
Clothing
200000
Electronics
0 Home Decor
2 4 2 3 2 4 2 3 2 3 2 3
(blank)
20 20 20 20 20 20
il r y ay er
r
be ar ne
Ap u Ju M ob
em Ja
n c t
ec O
D
Month
17
Product Sum of
Category TotalAmount
Books 25.20%
Clothing 24.99%
Electronics 24.96%
Home
Decor 24.86%
Grand
Total 100.00%
Payment Var of
Method TotalAmount
Cash 33997.64049
Credit Card 33656.27435
Debit Card 34146.93217
PayPal 34442.60253
Grand Total 34060.14074
Row Count of
Labels PaymentMethod
Cash 25003
Credit
Card 25032
Debit Card 24888
PayPal 25067
Grand
Total 99990
Sum of
months TotalAmount
January 2128316.899
February 1973154.117
19
March 2107835.972
April 1939190.32
May 2097929.672
June 2066364.823
July 2132550.518
August 2109279.968
September 2050334.601
October 2049450.698
November 2051277.297
December 2125395.296
Grand
Total 24831080.18
SUMMARY OUTPUT
Regression Statistics
0.69183
Multiple R 863
0.47864
R Square 069
Adjusted R 0.47863
Square 547
Standard 133.258
Error 205
Observation
s 99990
ANOVA
Significa
df SS MS F nce F
Regression 1 1630077 1630077 91795.2 0
20
Let's evaluate and explain the regression model based on the provided summary output:
Regression Statistics
- Value: 0.6918
- Interpretation: This value represents the correlation between Quantity and Total Amount. The
Value: 0.4786
Interpretation: This value indicates that approximately 47.86% of the variance in Total Amoun
3. Adjusted R Square:
21
-Value: 0.4786
Interpretation: This adjusted value accounts for the number of predictors in the model. Since
4. Standard Error:
Value: 133.2582
Interpretation: This value represents the standard deviation of the regression errors. It indicates
the average distance the observed values fall from the regression line.
5. Observations:
Value: 99,990
Interpretation: This is the number of observations (data points) used in the regression analysis.
1. Regression:
F: 91,795.2823 (F-statistic)
Significance F: 0 (p-value)
Interpretation: The regression model is statistically significant, as indicated by the very low p-
value (0). This means that the relationship between Quantity and Total Amount is highly
2. Residual:
3. Total:
Coefficients
1. Intercept:
Coefficient: 0.4164
t Stat: 0.4524
P-value: 0.6510
Interpretation: The intercept is not statistically significant (p-value > 0.05), indicating that the
2. Quantity:
23
Coefficient: 49.4919
t Stat: 302.9774
P-value: 0
Interpretation: The coefficient for Quantity is statistically significant (p-value < 0.05),
indicating a significant positive relationship between Quantity and Total Amount. For each
additional unit of Quantity, the Total Amount increases by approximately 49.49 units.
Model Significance: The model is statistically significant as indicated by the F-test with a
significance F value of 0. This suggests that the model provides a better fit than a model with no
predictors.
Fit: With an R Square of 0.4786, the model explains about 47.86% of the variability in `Total
Amount. While this is a substantial proportion, 52.14% of the variability is not explained by
Quantity alone.
Conclusion
24
The regression analysis suggests a strong positive correlation between Quantity and Total
Amount, with Quantity being a significant predictor of Total Amount. The model explains about
47.86% of the variance in Total Amount, indicating that other factors not included in this model
may also be influencing the total amount. The intercept is not significant, which implies that it
Let's evaluate and compare the multiple regression results with the previous linear regression:
SUMMARY OUTPUT
Regression Statistics
0.94298
Multiple R 891
0.88922
R Square 808
Adjusted R 0.88922
Square 476
61.4249
Standard Error 158
Observations 99990
ANOVA
Significa
df SS MS F nce F
3028390 1009463 267547.
Regression 3 207 402 834 0
3772492 3773.02
Residual 99986 05 028
3405639
Total 99989 412
- - - -
220.285 0.68077 323.582 221.620 218.951 221.62
Intercept 97 17 74 0 28 67 2
49.3895 0.07529 655.929 49.2420 49.5371 49.242
Quantity 844 704 972 0 031 657 03
4.51274 0.00747 603.355 4.49808 4.52740 4.4980
Price 861 942 509 0 904 818 90
- - - -
DiscountAppli 2.72376 0.03361 81.0390 2.78964 2.65789 2.7896
ed(%) 93 059 11 0 57 3 5
Regression Statistics:
Interpretation: Indicates a very strong positive correlation between the independent variables
(Quantity, Price, Discount Applied) and the dependent variable (Total Amount).
Interpretation: About 88.92% of the variance in Total Amount can be explained by the
Interpretation: Adjusted R Square is very close to R Square, indicating that the model does not
Interpretation: This value represents the standard deviation of the regression errors, indicating
the average distance that the observed values fall from the regression line.
5. Observations: 99,990
Interpretation: The number of observations (data points) used in the regression analysis.
ANOVA:
1. Regression:
F: 267,547.834 (F-statistic)
Significance F: 0 (p-value)
Interpretation: The regression model is statistically significant, as indicated by the very low p-
value (0). This suggests that the relationship between the independent variables and the Total
2. Residual:
3. Total:
Coefficients:
1. Intercept:
Coefficient: -220.2860
t Stat: -323.5827
P-value: 0
from zero.
2. Quantity:
Coefficient: 49.3896
t Stat: 655.9300
P-value: 0
3. Price:
Coefficient: 4.5127
t Stat: 603.3555
P-value: 0
Coefficient: -2.7238
t Stat: -81.0390
P-value: 0
1. Model Significance:
29
- Both models are statistically significant, but the multiple regression model explains more
variance in Total Amount (R Square of 0.8892) compared to the simple linear regression model
2. Predictor Significance:
- In the multiple regression model, all predictors (Quantity, Price, and Discount Applied (%))
Quantity was the sole predictor in the linear regression model, and it was significant.
3. Fit:
- The multiple regression model fits data better than the standard linear regression model, as
4. Coefficients:
The coefficient for `Quantity} in the multiple regression model is somewhat different (49.39)
compared to the simple linear regression model (49.49). This change indicates that Quantity's
effect on Total Amount is adjusted when considering Price and Discount Applied (%).
The multiple regression model incorporates additional coefficients for Price and Discount
Applied (%), which were not considered in the simple linear regression model.
Conclusion:
When compared to the simple linear regression model, the multiple regression model fits the data
better and accounts for a larger percentage of the variance in Total Amount. {Total Amount} can
30
be predicted with significant accuracy by Quantity, Price, and Discount Applied (%) Total
Amount is positively impacted by `Quantity and Price, and negatively impacted by Discount
Applied (%).
Summary Diagnostics
Number of Clusters: 4
Number of Points: 99993
Between-group Sum of Squares: 17282.0
Within-group Sum of Squares: 5763.4
Total Sum of Squares: 23045.0
Most
Centers
Common
Number of Sum of Sum of Sum of Total
Clusters Customer ID
Items Price Quantity Amount
Cluster 1 24400 78.257 7.3885 517.28 830710
Cluster 2 25721 76.543 2.9444 199.12 762342
Cluster 3 27598 32.535 6.9257 200.8 230132
Cluster 4 22274 32.784 2.4131 69.458 778286
31
Not
7
Clustered
Analysis of Variance:
Model Error
Sum of Sum of
Variable F-statistic p-value DF DF
Squares Squares
Sum of Total
2.547e+04 0.0 3301.0 3 4320.0 99989
Amount
Sum of Quantity 2.501e+04 0.0 7801.0 3 1.04e+04 99989
Sum of Price 2.474e+04 0.0 6180.0 3 8327.0 99989
Objective: The clustering model aims to group customers based on their purchasing behavior
using variables such as total amount spent, quantity purchased, and price per item.
Input Variables
Model Details
Number of Clusters: 4
Scaling: Normalized (ensures all variables are on the same scale for fair comparison)
Diagnostic Summary
These metrics indicate how much variance is explained by the clusters (between-group) versus
Cluster Characteristics
Each cluster is characterized by its centroid (center) values for the variables:
Cluster 1:
Cluster 2:
Cluster 3:
Cluster 4:
ANOVA helps understand the significance of each variable in explaining the variance between
clusters:
Sum of Total Amount, Sum of Quantity, and Sum of Price have low p-values (0.0), indicating
F-statistic values are high, suggesting strong differentiation between clusters based on these
variables.
Interpretation
Cluster 1 seems to include high-spending customers who buy relatively large quantities.
Conclusion
This clustering model effectively segments customers into meaningful groups based on their
purchasing behavior. It provides insights that can help in tailoring marketing strategies, inventory
ANOVA results and clear differentiation between clusters on key variables support the model's
effectiveness.
35
The booking category contributes the highest to total revenue, with a 25.20% percentage
July 2023 with a sum of 546430.9143 Books showed higher sales and April 2023 with a sum of
Home Decor and Clothing categories experienced peak sales in January 2024 with 1072076.825.
Electronics show a notable peak in sales in October 2023 with a sum of 546847.76 and April
The preferred method is a Debit Card, and the average of all payment methods is shown in the
Average of
Cash 248.1991428
PayPal 248.1251381
between the Total Amount (or Total Revenue) and Discount Applied. Here’s how you can
Weak Negative Correlation: A correlation coefficient close to zero (in this case,
negative but close to zero) indicates a slight tendency for the Total Amount to decrease
Interpretation:
tendency for Total Amount (or Total Revenue) to decrease. This could mean that
while discounts might attract more transactions, they might not significantly
the peak period for total quantities bought by customers is July 2023 with a sum quantity of
43051, and the low period is April 2023 with a quantity sum of 1240.
the end of the year, particularly in December 2023 and January 2024. This trend may be
influenced by seasonal factors such as year-end sales, holidays, and gift-giving occasions.
38
5.0 Conclusion
This project aimed to derive insights from a comprehensive retail transaction dataset, featuring
key columns such as Product Category, Quantity, Price, and more. Data cleaning involved
removing rows with missing data and detecting outliers using both the Z-score method (threshold
±3) and the IQR method (range ±1.5 * IQR). Descriptive analysis was conducted using SQL to
generate statistics such as mean, median, mode, standard deviation, and IQR for numeric
columns. Various visualizations were created to analyze trends and distributions, including time
series of total sales over time on a monthly basis, total revenue by month and product category,
and payment method preferences. Predictive analytics included performing linear regression on
total amount based on quantity, as well as multiple regression using quantity, price, and discount
applied. Cluster analysis was conducted using Tableau to segment customers based on quantity,
price, and total amount. These comprehensive analyses provided valuable insights into seasonal
sales trends, the impact of discounts on sales, and customer buying patterns, ultimately informing
6.0 References
Tan, P. N., Steinbach, M., & Kumar, V. (2006). Introduction to Data Mining. Pearson Addison
Wesley.
39
Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to Linear Regression
Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of Data Mining. MIT Press.