Retail Market Analysis: Ke Yuan, Yaoxin Liu, Shriyesh Chandra, Rishav Roy New York University
Retail Market Analysis: Ke Yuan, Yaoxin Liu, Shriyesh Chandra, Rishav Roy New York University
// Add the holiday feature At this stage, we have 9 features for future analysis,
val newDs3 = newDs2 which encapsulate key information from the original date
.withColumn("is_holiday", isHoliday) feature.
3.2. Prediction
day of week and is weekend: The weekend also has
a significant effect on consumer decisions because people In this problem, the objective is to predict the trends of
generally have more time to shop on weekends. Therefore, keywords using several time-related features. However, the
2
trends at a specific point in time are not always strongly
influenced by past trends, making it challenging for tradi- }
tional time series models to capture the relationship. As a
result, a decision tree model is selected, as it can effectively 3.3. Result
handle the complexity of non-linear relationships and is less
For the results, we calculate the RMSE on the test set
dependent on the sequential nature of past data compared to
and the predicted value for each keyword on 2024-12-31.
other time series-based machine learning models.
Below is a portion of the results:
v
u n
u1 X 2 Table 1. Part of results
RM SE = t (yi − ŷi ) Keywords RMSE Prediction on 2024-12-31
n i=1
Winter Coat 2.2654 10.0
The dataset is split into a training set and a testing set, Swimsuit 2.6990 16.0
with 80% used for training and 20% for testing. RMSE is Thanksgiving Turkey 21.7228 65.0
employed as the evaluation metric. Easter Eggs 20.7649 6.1071
val Array(trainingData, testData) = The full set of results can be found here.
data.randomSplit( A comprehensive analysis of the RMSE values for all
Array(0.8, 0.2), seed = 1234L predictions indicates that the predictive model demonstrates
) a high level of accuracy in this task. The average RMSE
An independent machine learning model was trained for across all test cases is 2.6561, demostrating the model’s re-
each keyword to make predictions, with each model operat- liability and precision in capturing the underlying patterns
ing independently. This approach was chosen because there in the data.
is no significant or direct correlation between different key- Table 2. RMSE
words. If a single model were used to perform multi-target Min Max Mean Variance
predictions for all keywords, it could lead to challenges 0.0 21.7228 2.6561 11.2749
such as increased model complexity, interference between
targets, and ultimately suboptimal prediction performance.
Keywords with RMSE less than 3.0 and predicted trends
By training separate models, we aim to better capture the
value greater than 80.0 are selected for future analysis: soap
unique patterns and characteristics of each keyword, im-
(trends = 84), shampoo (trends = 92.43), camera (trends =
proving the overall accuracy and reliability of the predic-
92), smartphone (trends = 92.9), vacuum (trends = 91),
tions.
microwave (trends = 90.9) and laptop (trends = 90.1). Fig-
// Train and evaluate the model ure 3 shows the distribution of predicted values on 2024-12-
// for each keyword label 31.
keywordLabels.foreach { label =>
val dt = new DecisionTreeRegressor()
.setLabelCol(label)
.setFeaturesCol("features")
// Make predictions
val predictions = model
.transform(testData)
Figure 3. Distribution of Predicted Values
// Evaluate the model using RMSE
val evaluator = ...
.setLabelCol(label) 4. Instacart Market Basket Analysis
.setPredictionCol("prediction")
.setMetricName("rmse") Instacart, a grocery order and delivery app with stores
in the United States and Canada, provides a user experi-
...... ence where customers get product recommendations based
3
on their previous orders. Instacart provided transactional 5. Remove any null records created from all the joins
data on customer orders over time.
6. Remove all the ID columns and the eval set column
The Instacart data is split and normalized into 6 datasets,
from the final dataset
each containing key information on orders and products
purchased. The resultant denormalized dataset is saved in parquet
format for future analysis.
1. aisles.csv - details of aisles
4.3. Market Basket Analysis
2. department.csv - Details of Department
The cleaned and denormalized data set has the following
3. products.csv - details of a product. Each product is schema:
associated with an aisle and a department
4. orders.csv consists of order details placed by any user
5. order products prior.csv - consists of all product
details for any prior order, that is, the order history of
every user
6. order products train.csv - consists of all product
details for a train order, that is, current order data for
every user. These data contain only 1 order per user.
4
product departments/aisles. The analysis has been visual-
ized as a pie chart indicating the percentage of total orders
per department/aisle. Since there are over 100 aisles, we
pick the top 25 aisles for the pie chart graphic.
Produce and dairy eggs are the most common depart-
ments. Fresh fruits and vegetables are the most common
product aisles.
5
5.2. Data Processing proximately 300 MB in size. It includes comprehensive
fields such as customer demographics, product details, and
During the processing of this dataset, we convert daily
transaction information, encompassing cities across multi-
data from multiple areas daily and calculating a single per-
ple U.S. states. The dataset served as a foundation for an-
centage seasonally and yearly to reflect overall trends at
alytical tasks and data exploration, with Zeppelin utilized
same time. This approach ensures that we maintain both
to perform functions like data transformation, visualization,
accuracy and proportionality.
and aggregation to derive actionable insights. In its initial
val yearlyMaxAvg = yearlyAverages state the Dataset had a lot many columns that were not nec-
.withColumn( essary at all in our evaluations. Below provides you a pic-
"MaxAvgPercent", ture of the schema of our Dataframe in its inital stage with-
greatest( out any preprocessing being performed:
col("YearAvgAllSpending"),
col("YearAvgFoodService"),
col("YearAvgEntertainment"),
col("YearAvgMerchandise"),
col("YearAvgGrocery"),
col("YearAvgHealth"),
col("YearAvgTransport"),
col("YearAvgRetailIncGrocery"),
col("YearAvgRetailExGrocery")
Figure 10. Schema for our original Walmart Dataset before Pre-
Processing Data
6
sis, such as customer name, order id, and 6.3.2 Region-Wise Profit Distribution
product container, which were marked for
removal. When grouping by region (East, West, South, Central), the
East emerges as the highest in total order quantity—over
8.2 million—and similarly leads in overall profit. The
6.2. Pre-Processing of Data South and Central regions follow behind, indicating po-
• Data Cleaning: Replaced or removed null and in- tential regional demand differences and logistical variations
valid values. For example, missing values in profit, that might affect profitability. Here we see that even though
sales, and shipping cost were set to zero. according to the State-wise distribution California has the
most number of orders, the most profit is observed from
the East Region while West Region is around half the profit
• Filtering Invalid States: Used regex to filter out in-
collected from East. We can conclude that even though the
valid state entries (e.g., those containing digits). Af-
number of orders and number of stores is a lot in Califor-
ter dropping these rows, the dataset’s record count re-
nia, its contribution to the overall West region does not help
duced from approximately 1,030,000 to 1,016,102.
with the non presence of Walmart in the West.
• Schema Refinement: Verified and updated column
data types. Automatic schema inference assigned
correct types to columns such as dates, integers
(e.g., order quantity), and strings (e.g., city,
state).
7
in purchasing behavior. We can realise from this analysis 7. Amazon Reviews Dataset
that the shoppers belonging to a younger demographic pre-
fer to not use Commerce brands in person, unlike people The Amazon Customer Reviews is a massive dataset
of higher age, who have a high presence at in person stores (over 75GB) of amazon product reviews.
compared to online shopping and E-Commerce. Typically it has been used for pre-training Large Lan-
guage Models and in sentiment analysis. We will look at
the Pure IDs (5-Core) segment of the dataset which just fo-
cuses on the product ratings (out of 5).
A parent ASIN is a unique Amazon Standard Identifi-
cation Number (ASIN) that groups together products with
variations, such as different colors or sizes. The mapping
from ASIN to product is obtained from a large json file (1.25
GB)
Data Links:
1. Pure IDs (5-Core) - Amazon Reviews’23
(a) Customer Age Group Analysis 2. Hugging Face Mapping Link
Monthly ordering patterns show a peak in January (around 3. Data Summary: Using the Dataframe API, calculate
2.34 million orders), followed by some variability through- simple statistics for the Ratings Dataframe. Here are
out the year. There’s a dip in February, and moderate fluc- some stats:
tuations continue in subsequent months. This points to po-
(a) Ratings range from 1.0 to 5.0
tential seasonality and holiday-driven purchasing trends in
Q1 and Q4. (b) 50% of the orders in the dataset have a 5.0 rating
(c) Timestamps are in the Unix timestamp format.
4. Unknown Category: When analysing our dataframe
after the join we found that unknown was a category
with an ASIN of its own.
8
into one single dataframe, it helps to create a column called
‘source file’ which refers to the source of the Dataframe
Row.
These are the preprocessing steps taken:
8. Collaborative Analysis
Once we individually analyzed every dataset, we at-
tempted to collate our findings on the few fields common
Figure 16. Sample Data from the cleaned Amazon Dataset between the datasets. Here are some interesting findings
from our analysis.
We perform data analysis on this cleaned data to iden-
8.1. Orders vs Days of the Week
tify interesting patterns that would prove to be useful to the
management team at Amazon to fine-tune their inventory We count the total number of orders for each day of the
management schedules. week. Weeks are numbered from 0 to 6 with 0 representing
9
Figure 20. Instacart: Orders vs Day of the week
were made during the weekend and reviews are put up dur-
ing midweek.
10
the orders start to peak from November and continue into 8.3. Correlation Between Reviews and Search
the new year through January. Trends for Gift Cards
The analysis explores the correlation between reviews
for gift cards on Amazon and Google search trends for the
keyword ”Amazon Giftcard” over the last 5 years.
The first plot (Figure 25a) illustrates the average number
of weekly reviews submitted for gift cards on Amazon. The
data shows a significant spike in reviews during the last two
weeks of December and the first two weeks of January. This
pattern likely indicates that many customers leave reviews
for gift cards after the winter holiday season.
The second plot (Figure 25b) highlights weekly Google
Figure 23. Walmart: Order counts vs Month of the Year search trends for the keyword ”Amazon Giftcard” from Jan-
uary 2020 onwards. It reveals a noticeable peak in searches
during the winter holiday season, typically in December,
suggesting a higher interest in purchasing gift cards during
this period.
This combined analysis demonstrates a correlation be-
tween increased searches for gift cards during December
and the subsequent reviews on Amazon, predominantly in
late December and early January. The trend aligns with con-
sumer behavior during the holiday season.
11
9. Conclusions The repository contains detailed notebooks, Scala
scripts, and Zeppelin notebooks that demonstrate the
The results of this project underscore the transformative methodologies and insights derived from the datasets. It
potential of big data analytics in optimizing retail opera- serves as a resource for replicating the results and extend-
tions. By integrating and analyzing datasets from diverse
ing the analysis further.
sources, such as Amazon reviews, Walmart sales, and In-
stacart orders, we uncovered actionable insights into con-
References
sumer behavior, seasonal purchasing trends, and product
preferences. These findings enable retailers to enhance in- 1. Instacart Market Basket Analysis. Kaggle.
ventory management, refine marketing strategies, and im- (n.d.). https://www.kaggle.com/c/
prove customer satisfaction. instacart-market-basket-analysis/
A key highlight of this analysis is the identification of data.
temporal patterns, such as the surge in gift card searches
and reviews during the winter holiday season, and the mid- 2. Amazon Reviews Dataset. (n.d.). Via Hug-
week peak in Amazon review activity. Similarly, the start- gingface Datasets - Amazon Reviews’23.
of-week concentration of Instacart orders aligns with con- https://amazon-reviews-2023.github.
sumer restocking habits, showcasing how purchasing be- io/dataloading/huggingface.html.
haviors vary across platforms and timelines. The monthly 3. State of Connecticut - Percent Change in Consumer
analysis revealed consistent spikes in retail activity during Spending. Data.gov. (2024, December 13). Available
the holiday months, particularly in December and January, here.
reflecting seasonal consumer demand.
Furthermore, the correlation analysis between consumer 4. Google. Google Trends Data. Accessed Decem-
actions and external factors, such as holidays and search ber 13, 2024. https://trends.google.com/
trends, highlights opportunities for more targeted and ef- trends/.
ficient decision-making. By leveraging tools like Apache
Spark for scalable data processing and visualization, this 5. Ahmed Mnif. Walmart Retail Dataset.
project showcases the importance of combining advanced Data.world. Accessed December 13, 2024.
analytics with domain knowledge to derive meaningful in- https://data.world/ahmedmnif150/
sights. walmart-retail-dataset.
Ultimately, this work demonstrates the value of big data
in fostering operational efficiency, improving customer ex-
periences, and driving long-term growth in the retail sec-
tor. Future work could explore the integration of additional
datasets, such as social media sentiment analysis, or the use
of real-time analytics to capture emerging trends, enabling
proactive and adaptive decision-making in a dynamic mar-
ketplace.
10. Acknowledgement
This work was made possible by downloading data
from the Google Trends API, Kaggle, Amazon Datasets,
Data.world and Data Gov.
Special thanks to the NYU Dataproc team to provide us
with a distributed computing platform.
Last but not least, Professor Yang Tang for providing us
with the skills to analyze Big datasets using Apache Spark.
12