0% found this document useful (0 votes)
7 views12 pages

Retail Market Analysis: Ke Yuan, Yaoxin Liu, Shriyesh Chandra, Rishav Roy New York University

This document presents a retail market analysis project that utilizes historical sales data, search trends, and customer reviews to identify consumer behavior patterns and optimize retail strategies. The analysis involves data cleaning, profiling, transformation, and advanced analytics using machine learning models to derive predictive insights. Key findings include trends in product demand, the impact of holidays and weekends on shopping behavior, and recommendations for inventory management to enhance customer satisfaction and revenue.

Uploaded by

WarKING GAMING
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views12 pages

Retail Market Analysis: Ke Yuan, Yaoxin Liu, Shriyesh Chandra, Rishav Roy New York University

This document presents a retail market analysis project that utilizes historical sales data, search trends, and customer reviews to identify consumer behavior patterns and optimize retail strategies. The analysis involves data cleaning, profiling, transformation, and advanced analytics using machine learning models to derive predictive insights. Key findings include trends in product demand, the impact of holidays and weekends on shopping behavior, and recommendations for inventory management to enhance customer satisfaction and revenue.

Uploaded by

WarKING GAMING
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Retail Market Analysis

Ke Yuan1† , Yaoxin Liu1† , Shriyesh Chandra1† , Rishav Roy1†


1
New York University

Contribute equally
{ky2591, yl5739, sc10670, rr4577}@nyu.edu
arXiv:2502.00024v1 [q-fin.GN] 20 Jan 2025

Abstract • Data Cleaning: Removing inconsistencies and ensur-


ing the data is ready for analysis.
This project focuses on analyzing retail market trends us-
ing historical sales data, search trends, and customer re- • Data Profiling: Understanding the structure and qual-
views. By identifying the patterns and trending products, ity of the datasets.
the analysis provides actionable insights for retailers to op-
timize inventory management and marketing strategies, ul- • Data Transformation: Structuring the data into a suit-
timately enhancing customer satisfaction and maximizing able format, saved as parquet files for optimized stor-
revenue. age and access.

Analysis and Results: This stage leverages Spark ML-


lib for advanced analytics. The Normalized Dataset is input
1. Introduction to Machine Learning Models to derive predictive insights.
Results are analyzed through Statistics and presented visu-
In today’s competitive retail industry, understanding
ally using Visualizations.
consumer behavior and market trends is important for suc-
Figure 1 shows the data flow diagram of the project.
cess. This project uses data analysis to bridge the gap be-
tween what consumers want and what sellers provide. By
examining datasets such as Instacart transactions, Amazon
reviews, Google Trends data, Percent Change in Consumer
Spending, and Walmart sales, the study identifies key pat-
terns and trends influencing purchasing decisions. These
insights help retailers make better decisions, ensuring prod-
ucts are available during busy periods, reducing waste, and
improving customer satisfaction. This project highlights
how analyzing data can lead to more efficient and effective
retail operations.
Figure 1. Design Diagram
2. Method
We processed and analyzed data collected from various 3. Google Trends Data
sources, including Amazon Reviews, Walmart, Instacart
Market Basket, Percent Change in Spending, and Google The cleaned Google Trends dataset contains weekly data
Trends, all stored in HDFS (Hadoop Distributed File Sys- for 162 keywords over the past 2 years. These keywords
tem). The workflow is divided into three main stages: Stor- represent broad product categories (e.g., desk, tea, sofa) of
age, Preprocessing, and Analysis/Results. commonly consumed everyday items. The trends data are
Storage: Data from diverse sources is stored in HDFS, normalized value from 0-100.
ensuring scalable and efficient data access for large-scale
3.1. Feature Engineering
processing.
Preprocessing: Utilizing Spark SQL, this stage in- Currently, the dataset includes only one feature: the date
volves: (yyyy-mm-dd). Therefore, additional features need to be
extracted before applying a machine learning model to pre- the is weekend feature is included in the model. Addition-
dict trends. ally, the day of week feature provides deeper insights and
Year, Month and Day: As a first step, the date is sep- is more precise for analysis, so it is also included in the set
arated into three features: year, month, and day, to better of features.
represent time. Each feature is represented as an integer.
val newDs4 = newDs3
val newDs = ds .withColumn("day_of_week",
.withColumn("year", col("date") dayofweek(col("date")))
.substr(1, 4).cast("int")) .withColumn("is_weekend",
.withColumn("month", col("date") when(col("day_of_week") === 1
.substr(6, 2).cast("int")) || col("day_of_week") === 7, 1)
.withColumn("day", col("date") .otherwise(0))
.substr(9, 2).cast("int")) sin season and cos season Both features reflect the
cyclical nature of seasons. For example, the transition from
Season: Next, a new feature, season, is added based
December (winter) to January (winter) should not be treated
on the month. As a decision tree model requires numerical
as a large gap, but as a smooth and continuous transition.
encoding for each feature, the season feature is manually
Using sine and cosine functions helps in encoding this con-
encoded as 1, 2, 3, and 4 to represent spring, summer, fall,
tinuity and improved Model Performance.
and winter. The reason for not using one-hot encoding is
that decision tree models do not require it.

def f_season(m: Int): Int = {


......
}

// Register the function as a UDF


val f_season_udf: UserDefinedFunction
= udf((m: Int) => f_season(m))

// Add the season feature using the UDF


val newDs2 = newDs
.withColumn("season", f_season_udf(col("month") Figure 2. Illustration of sin season and cos season
.cast("int")))

Is holiday: Since holidays have a significant effect on val newDs5 = newDs4


consumer decisions, the is holiday feature is important. We .withColumn("day_of_year",
select several major holidays, including Christmas, New dayofyear(col("date")))
Year, Thanksgiving, Black Friday, and Valentine’s Day. If .withColumn("sin_season",
the date is near any of these chosen holidays, the is holiday sin(lit(2 * math.Pi)
feature is set to 1, otherwise 0 * col("day_of_year")
/ lit(365)))
// Define a function to mark holidays .withColumn("cos_season",
val isHoliday = when( cos(lit(2 * math.Pi)
...... , * col("day_of_year")
1 / lit(365)))
).otherwise(0) .drop("day_of_year")

// Add the holiday feature At this stage, we have 9 features for future analysis,
val newDs3 = newDs2 which encapsulate key information from the original date
.withColumn("is_holiday", isHoliday) feature.
3.2. Prediction
day of week and is weekend: The weekend also has
a significant effect on consumer decisions because people In this problem, the objective is to predict the trends of
generally have more time to shop on weekends. Therefore, keywords using several time-related features. However, the

2
trends at a specific point in time are not always strongly
influenced by past trends, making it challenging for tradi- }
tional time series models to capture the relationship. As a
result, a decision tree model is selected, as it can effectively 3.3. Result
handle the complexity of non-linear relationships and is less
For the results, we calculate the RMSE on the test set
dependent on the sequential nature of past data compared to
and the predicted value for each keyword on 2024-12-31.
other time series-based machine learning models.
Below is a portion of the results:
v
u n
u1 X 2 Table 1. Part of results
RM SE = t (yi − ŷi ) Keywords RMSE Prediction on 2024-12-31
n i=1
Winter Coat 2.2654 10.0
The dataset is split into a training set and a testing set, Swimsuit 2.6990 16.0
with 80% used for training and 20% for testing. RMSE is Thanksgiving Turkey 21.7228 65.0
employed as the evaluation metric. Easter Eggs 20.7649 6.1071

val Array(trainingData, testData) = The full set of results can be found here.
data.randomSplit( A comprehensive analysis of the RMSE values for all
Array(0.8, 0.2), seed = 1234L predictions indicates that the predictive model demonstrates
) a high level of accuracy in this task. The average RMSE
An independent machine learning model was trained for across all test cases is 2.6561, demostrating the model’s re-
each keyword to make predictions, with each model operat- liability and precision in capturing the underlying patterns
ing independently. This approach was chosen because there in the data.
is no significant or direct correlation between different key- Table 2. RMSE
words. If a single model were used to perform multi-target Min Max Mean Variance
predictions for all keywords, it could lead to challenges 0.0 21.7228 2.6561 11.2749
such as increased model complexity, interference between
targets, and ultimately suboptimal prediction performance.
Keywords with RMSE less than 3.0 and predicted trends
By training separate models, we aim to better capture the
value greater than 80.0 are selected for future analysis: soap
unique patterns and characteristics of each keyword, im-
(trends = 84), shampoo (trends = 92.43), camera (trends =
proving the overall accuracy and reliability of the predic-
92), smartphone (trends = 92.9), vacuum (trends = 91),
tions.
microwave (trends = 90.9) and laptop (trends = 90.1). Fig-
// Train and evaluate the model ure 3 shows the distribution of predicted values on 2024-12-
// for each keyword label 31.
keywordLabels.foreach { label =>
val dt = new DecisionTreeRegressor()
.setLabelCol(label)
.setFeaturesCol("features")

// Train the model


val model = dt.fit(trainingData)

// Make predictions
val predictions = model
.transform(testData)
Figure 3. Distribution of Predicted Values
// Evaluate the model using RMSE
val evaluator = ...
.setLabelCol(label) 4. Instacart Market Basket Analysis
.setPredictionCol("prediction")
.setMetricName("rmse") Instacart, a grocery order and delivery app with stores
in the United States and Canada, provides a user experi-
...... ence where customers get product recommendations based

3
on their previous orders. Instacart provided transactional 5. Remove any null records created from all the joins
data on customer orders over time.
6. Remove all the ID columns and the eval set column
The Instacart data is split and normalized into 6 datasets,
from the final dataset
each containing key information on orders and products
purchased. The resultant denormalized dataset is saved in parquet
format for future analysis.
1. aisles.csv - details of aisles
4.3. Market Basket Analysis
2. department.csv - Details of Department
The cleaned and denormalized data set has the following
3. products.csv - details of a product. Each product is schema:
associated with an aisle and a department
4. orders.csv consists of order details placed by any user
5. order products prior.csv - consists of all product
details for any prior order, that is, the order history of
every user
6. order products train.csv - consists of all product
details for a train order, that is, current order data for
every user. These data contain only 1 order per user.

4.1. Data Profiling


Using Spark SQL and the Zeppelin notebook in NYU
Dataproc Cluster, we learn the following information from Figure 4. Denormalized Schema for Instacart Data
our dataset.
We perform data analysis on this cleaned data to iden-
1. Dataset Size: There are 134 aisles, 21 departments, tify interesting patterns that would prove to be useful to
50k products, 3.4 million unique orders, 32 million the management team at Instacart to fine-tune their product
records on order-products history and 1 million unique placements and employee work schedules.
training records
2. Null Checks/Missing Values: None of the 4.3.1 Orders vs Hours of the Day
dataset has null values, except orders.csv - Or- We count the total number of orders for each hour of the
ders Dataframe is expected to have null values in the day to estimate the number of employees required per shift
days since prior order column, in case there is no and ensure efficient staffing. Analysis of the bar graph lets
previous order. us know that the distribution follows a traditional bell curve
4.2. Data Pre-processing with minimal activity during from 12 am to 6 am. Orders
increase during the morning and afternoon hours and grad-
The data from the different data frames are merged and ually decrease as it gets late.
joined to convert into a format that would be convenient for
further analysis.
These are the pre-processing steps taken:
1. Join the smaller tables (aisles, departments and prod-
ucts) to create the productDetails Dataset.
2. Merge the data from order products train.csv and
order products prior.csv
Figure 5. Orders vs Hours of the Day
3. Orders table has pre-existing null values in the
days since prior order column. These can be filled
with -1. 4.3.2 Frequently ordered product departments and
product aisles
4. Join the Orders and ProductDetails dataset with
the merged training dataset to create a single de- We investigate the relationship between the number of items
normalized dataset. ordered per department/aisle to identify the most profitable

4
product departments/aisles. The analysis has been visual-
ized as a pie chart indicating the percentage of total orders
per department/aisle. Since there are over 100 aisles, we
pick the top 25 aisles for the pie chart graphic.
Produce and dairy eggs are the most common depart-
ments. Fresh fruits and vegetables are the most common
product aisles.

Figure 8. Position in cart vs Reorder rate

4.3.3 Relationship between position in shopping cart


and reorder rate

There is quite an interesting correlation between the item’s


position in the shopping cart and its reorder rate. Reorder
rate can be defined as:

Sum of reordered items


Reorder Rate =
Total number of items
From positions 1 through 50, there is a clear correlation
between the reorder rate and the position in the shopping
Figure 6. Popular Departments cart.

• Early positions: Higher reorder rates may indicate


staple or frequently purchased items.

• Later positions: Lower reorder rates may suggest less


critical or experimental purchases.

5. Percent Change in Consumer Spending


This dataset measures daily percentage changes in con-
sumer spending across multiple categories, including gro-
ceries, health, entertainment, and transportation, over the
years 2020 to 2024.

5.1. Overview of the dataset


This dataset includes fields that provide high-level in-
sights into consumer spending patterns, such as percentage
changes in categories like food service spending, entertain-
ment spending, grocery spending, and more. Table 3 shows
a snippet of the dataset.

Table 3. Part of Percent Change in Consumer Spending Dataset


StateCode Date AllSpending ...
900 2020-01-13 -2.3 ...
2500 2020-01-13 -0.218 ...
Figure 7. Popular Aisles ... ... ... ...

5
5.2. Data Processing proximately 300 MB in size. It includes comprehensive
fields such as customer demographics, product details, and
During the processing of this dataset, we convert daily
transaction information, encompassing cities across multi-
data from multiple areas daily and calculating a single per-
ple U.S. states. The dataset served as a foundation for an-
centage seasonally and yearly to reflect overall trends at
alytical tasks and data exploration, with Zeppelin utilized
same time. This approach ensures that we maintain both
to perform functions like data transformation, visualization,
accuracy and proportionality.
and aggregation to derive actionable insights. In its initial
val yearlyMaxAvg = yearlyAverages state the Dataset had a lot many columns that were not nec-
.withColumn( essary at all in our evaluations. Below provides you a pic-
"MaxAvgPercent", ture of the schema of our Dataframe in its inital stage with-
greatest( out any preprocessing being performed:
col("YearAvgAllSpending"),
col("YearAvgFoodService"),
col("YearAvgEntertainment"),
col("YearAvgMerchandise"),
col("YearAvgGrocery"),
col("YearAvgHealth"),
col("YearAvgTransport"),
col("YearAvgRetailIncGrocery"),
col("YearAvgRetailExGrocery")

Figure 10. Schema for our original Walmart Dataset before Pre-
Processing Data

6.1. Profiling of Data


Using Spark SQL and the Zeppelin notebook in NYU
Dataproc Cluster, we learn the following information from
Figure 9. Analysis of percentage change Trends our dataset:
• Schema and Column Stats: Identified column types
5.3. Analysis of yearly and Economic Trends (e.g., strings, dates, integers) and counted null vs. non-
Consumer spending patterns vary significantly across null values to assess data quality. Discovered columns
seasons and economic conditions. During crises like the like profit, sales, and shipping cost con-
COVID-19 pandemic, demand shifted heavily toward es- tained missing data that needed fixing.
sentials such as groceries, reflecting the focus on basic • Invalid Entries: Noticed around 18k records in the
needs. In contrast, stable economic periods saw increased state column with invalid or malformed data (e.g.,
spending on discretionary categories like health, wellness, numeric or special characters), which required clean-
and entertainment, driven by greater consumer confidence ing.
and disposable income.
Figure 9 shows the yearly winner categories by average • Unique Values: Found 45 unique states and 18 dis-
percentage contribution. tinct product subcategories. One subcategory was la-
beled as an unknown category, indicating possible data
6. Walmart Retail Dataset entry errors or incomplete records.
The Walmart Retail Dataset, obtained from data.world, • Unnecessary Columns: Identified several
spans retail order details from 2019 to 2023 and is ap- columns that were not required for the analy-

6
sis, such as customer name, order id, and 6.3.2 Region-Wise Profit Distribution
product container, which were marked for
removal. When grouping by region (East, West, South, Central), the
East emerges as the highest in total order quantity—over
8.2 million—and similarly leads in overall profit. The
6.2. Pre-Processing of Data South and Central regions follow behind, indicating po-
• Data Cleaning: Replaced or removed null and in- tential regional demand differences and logistical variations
valid values. For example, missing values in profit, that might affect profitability. Here we see that even though
sales, and shipping cost were set to zero. according to the State-wise distribution California has the
most number of orders, the most profit is observed from
the East Region while West Region is around half the profit
• Filtering Invalid States: Used regex to filter out in-
collected from East. We can conclude that even though the
valid state entries (e.g., those containing digits). Af-
number of orders and number of stores is a lot in Califor-
ter dropping these rows, the dataset’s record count re-
nia, its contribution to the overall West region does not help
duced from approximately 1,030,000 to 1,016,102.
with the non presence of Walmart in the West.
• Schema Refinement: Verified and updated column
data types. Automatic schema inference assigned
correct types to columns such as dates, integers
(e.g., order quantity), and strings (e.g., city,
state).

• Removal of Non-Essential Columns: Dropped


columns that were not essential for either the Figure 12. Region-Wise Profit Distribution
overall analysis or collaborative insights, such as
order priority, product base margin, and
unit price. 6.3.3 Number of Cities per State
States like California (108 unique cities), Texas (102), and
The resultant Processed dataset with only the essential MA (101) have the broadest coverage, reinforcing their
columns was saved in parquet format for future analysis. leading positions in terms of total orders. The large number
of cities suggests a widespread presence of Walmart stores
6.3. Analysis of Walmart Retail Dataset and, in turn, a more extensive customer base.
6.3.1 Statewise Analysis of Sales

From the aggregated results, California and Texas lead in


total order quantity, closely followed by MA, New Jersey,
and Florida. These top performers also generate corre-
spondingly high profits, suggesting that they represent sub-
stantial market share and sales volume within the dataset.

Figure 13. Number of Cities per State

6.3.4 Customer Age Group Analysis


Surprisingly, the 60+ age group accounts for the highest or-
der quantity (over 10.9 million) and the greatest total profit.
The 21–40 and 41–60 brackets are fairly close in their total
order volumes (˜7.4 million each), while the 0–20 segment
Figure 11. Statewise Analysis of Sales is notably smaller, pointing to specific demographic trends

7
in purchasing behavior. We can realise from this analysis 7. Amazon Reviews Dataset
that the shoppers belonging to a younger demographic pre-
fer to not use Commerce brands in person, unlike people The Amazon Customer Reviews is a massive dataset
of higher age, who have a high presence at in person stores (over 75GB) of amazon product reviews.
compared to online shopping and E-Commerce. Typically it has been used for pre-training Large Lan-
guage Models and in sentiment analysis. We will look at
the Pure IDs (5-Core) segment of the dataset which just fo-
cuses on the product ratings (out of 5).
A parent ASIN is a unique Amazon Standard Identifi-
cation Number (ASIN) that groups together products with
variations, such as different colors or sizes. The mapping
from ASIN to product is obtained from a large json file (1.25
GB)
Data Links:
1. Pure IDs (5-Core) - Amazon Reviews’23
(a) Customer Age Group Analysis 2. Hugging Face Mapping Link

In total there are 28 categories of data, each stored in its


own zip file. The dataset includes user ratings, and product
metadata for categories, that is the parent ASIN key and
user id along with the review timestamp.
7.1. Data Profiling
Using Spark SQL and Zeppelin Notebook on NYU Dat-
aproc Cluster, we learn the following information from our
dataset.
(b) Analysis of Profits based on Age Group
Figure 14. Analysis of Order Quantity and Profits Based on Age 1. Dataset Size: There are approximately 67 million
Group records in the ratings dataset and approximately 35
million records in the asin dataset.
2. Null Checks/Missing Values: None of the datasets
6.3.5 Monthly Analysis of Order Quantity have any null values

Monthly ordering patterns show a peak in January (around 3. Data Summary: Using the Dataframe API, calculate
2.34 million orders), followed by some variability through- simple statistics for the Ratings Dataframe. Here are
out the year. There’s a dip in February, and moderate fluc- some stats:
tuations continue in subsequent months. This points to po-
(a) Ratings range from 1.0 to 5.0
tential seasonality and holiday-driven purchasing trends in
Q1 and Q4. (b) 50% of the orders in the dataset have a 5.0 rating
(c) Timestamps are in the Unix timestamp format.
4. Unknown Category: When analysing our dataframe
after the join we found that unknown was a category
with an ASIN of its own.

7.2. Data Pre-processing


The data from the different data-frames are merged and
joined to get it into a format that would be convenient for
further analysis.
There is one pre-processing step that was done while
Figure 15. Monthly Analysis of Order Quantity loading the data. Since we load all the ratings csv files

8
into one single dataframe, it helps to create a column called
‘source file’ which refers to the source of the Dataframe
Row.
These are the preprocessing steps taken:

1. Join the Ratings and ASIN dataset to create a single


denormalized dataset.

2. Replace the nulls generated by the “left outer join” be-


tween the two datasets. We only have null values in
the ‘category’ column which we can replace using the
value from the ‘source file’ column Figure 17. Ratings vs Ratings count

3. Remove the user id, source file and the parent as in


7.3.1 Ratings Count
column
We try to look at the ratings count for all the products at
4. Convert the UNIX timestamp to a Date format Amazon. We notice that most of the products are reviewed
(YYYY-MM-DD) to get a consistent result that can be at 5 stars (100 M). All other ratings put together have a
used to compare with other datasets as well. count of around 50 M. This tells us that review ratings (out
of 5) are not a good indicator of analysis on the Amazon
5. Handling the unknown category: This category products due to their skewed nature.
seemed pretty inconclusive for our analysis and thus
we planned on removing it entirely from our final com- 7.3.2 Category Count
bined dataframe.
Next, we look at the distribution of ratings across the differ-
6. The resultant dataset was saved in parquet format for ent categories in the Amazon Dataset. Here we notice that
future analysis clothing, shoes, jewelry and electronics are among the top
most rated categories. A direct correlation can be drawn to
the fact that customers mainly purchase clothing items from
7.3. Data Analysis
Amazon in comparison to Health and Household items.
The cleaned and merged dataset has the following struc-
ture: 7.3.3 Yearly Reviews Count
Look at the total number of reviews uploaded every year,
we find a sudden spike in yearly review counts in 2013 - in-
dicating the year when Amazon became really popular as
a site for e-commerce activities. We also notice that re-
views have steadily increased upto 2019 after which it hit
a plateau. This can mostly be attributed to the fact that al-
most all households now hold Amazon accounts.
The data in 2023 is incomplete which indicates the sud-
den drop in review counts.

8. Collaborative Analysis
Once we individually analyzed every dataset, we at-
tempted to collate our findings on the few fields common
Figure 16. Sample Data from the cleaned Amazon Dataset between the datasets. Here are some interesting findings
from our analysis.
We perform data analysis on this cleaned data to iden-
8.1. Orders vs Days of the Week
tify interesting patterns that would prove to be useful to the
management team at Amazon to fine-tune their inventory We count the total number of orders for each day of the
management schedules. week. Weeks are numbered from 0 to 6 with 0 representing

9
Figure 20. Instacart: Orders vs Day of the week

Figure 18. Category vs Ratings Count

Figure 21. Percent Change: Average Change vs Day of the week

were made during the weekend and reviews are put up dur-
ing midweek.

Figure 19. Year vs Ratings Count

Sunday and 6 representing Saturday.


In the Instacart dataset we calcuate summaries for all or-
ders and orders that were reordered. Both show a similar
Figure 22. Amazon: Review Count vs Day of the Week
trend with the start of the week being especially busy as
households restock for the week. The counts decrease dur-
ing midweek and start increasing again during the week- 8.2. Orders vs Month of the Year
end.
The analysis holds true when we take a look at the Per- If we were to perform a similar analysis with a different
centage change in consumer spending for Sundays is temporal domain, we would get further interesting insights.
higher than any other day.
We can plot the distribution of the order count vs Month
Finally, we take a look at the Amazon Reviews Dataset. of the Week for the Amazon Dataset and the Walmart
Here, we notice that the trend has switched up with a major- Dataset. For both these datasets we notice an interesting
ity of the records getting created during midweek with the insight into the shopping habits of consumers.
weekend counts being lower. This resonates with our pre-
vious hypothesis if we understand that the Amazon dataset In both companies, we notice shopping and order
looks at reviews rather than the actual orders. If we were counts spike during the winter holiday months. While
to extrapolate these results, it would indicate that the orders in the case of Walmart, the peak is in January, for Amazon

10
the orders start to peak from November and continue into 8.3. Correlation Between Reviews and Search
the new year through January. Trends for Gift Cards
The analysis explores the correlation between reviews
for gift cards on Amazon and Google search trends for the
keyword ”Amazon Giftcard” over the last 5 years.
The first plot (Figure 25a) illustrates the average number
of weekly reviews submitted for gift cards on Amazon. The
data shows a significant spike in reviews during the last two
weeks of December and the first two weeks of January. This
pattern likely indicates that many customers leave reviews
for gift cards after the winter holiday season.
The second plot (Figure 25b) highlights weekly Google
Figure 23. Walmart: Order counts vs Month of the Year search trends for the keyword ”Amazon Giftcard” from Jan-
uary 2020 onwards. It reveals a noticeable peak in searches
during the winter holiday season, typically in December,
suggesting a higher interest in purchasing gift cards during
this period.
This combined analysis demonstrates a correlation be-
tween increased searches for gift cards during December
and the subsequent reviews on Amazon, predominantly in
late December and early January. The trend aligns with con-
sumer behavior during the holiday season.

Figure 24. Amazon: Review Counts vs Month of the Year

(a) Weekly reviews for Gift Cards on Amazon.

(b) Google search trends for ”Amazon Giftcard.”


Figure 25. Correlation analysis between gift card reviews and
Google search trends.

11
9. Conclusions The repository contains detailed notebooks, Scala
scripts, and Zeppelin notebooks that demonstrate the
The results of this project underscore the transformative methodologies and insights derived from the datasets. It
potential of big data analytics in optimizing retail opera- serves as a resource for replicating the results and extend-
tions. By integrating and analyzing datasets from diverse
ing the analysis further.
sources, such as Amazon reviews, Walmart sales, and In-
stacart orders, we uncovered actionable insights into con-
References
sumer behavior, seasonal purchasing trends, and product
preferences. These findings enable retailers to enhance in- 1. Instacart Market Basket Analysis. Kaggle.
ventory management, refine marketing strategies, and im- (n.d.). https://www.kaggle.com/c/
prove customer satisfaction. instacart-market-basket-analysis/
A key highlight of this analysis is the identification of data.
temporal patterns, such as the surge in gift card searches
and reviews during the winter holiday season, and the mid- 2. Amazon Reviews Dataset. (n.d.). Via Hug-
week peak in Amazon review activity. Similarly, the start- gingface Datasets - Amazon Reviews’23.
of-week concentration of Instacart orders aligns with con- https://amazon-reviews-2023.github.
sumer restocking habits, showcasing how purchasing be- io/dataloading/huggingface.html.
haviors vary across platforms and timelines. The monthly 3. State of Connecticut - Percent Change in Consumer
analysis revealed consistent spikes in retail activity during Spending. Data.gov. (2024, December 13). Available
the holiday months, particularly in December and January, here.
reflecting seasonal consumer demand.
Furthermore, the correlation analysis between consumer 4. Google. Google Trends Data. Accessed Decem-
actions and external factors, such as holidays and search ber 13, 2024. https://trends.google.com/
trends, highlights opportunities for more targeted and ef- trends/.
ficient decision-making. By leveraging tools like Apache
Spark for scalable data processing and visualization, this 5. Ahmed Mnif. Walmart Retail Dataset.
project showcases the importance of combining advanced Data.world. Accessed December 13, 2024.
analytics with domain knowledge to derive meaningful in- https://data.world/ahmedmnif150/
sights. walmart-retail-dataset.
Ultimately, this work demonstrates the value of big data
in fostering operational efficiency, improving customer ex-
periences, and driving long-term growth in the retail sec-
tor. Future work could explore the integration of additional
datasets, such as social media sentiment analysis, or the use
of real-time analytics to capture emerging trends, enabling
proactive and adaptive decision-making in a dynamic mar-
ketplace.

10. Acknowledgement
This work was made possible by downloading data
from the Google Trends API, Kaggle, Amazon Datasets,
Data.world and Data Gov.
Special thanks to the NYU Dataproc team to provide us
with a distributed computing platform.
Last but not least, Professor Yang Tang for providing us
with the skills to analyze Big datasets using Apache Spark.

11. Repository Access


The complete implementation for this project, includ-
ing preprocessing, aggregation, analysis, and visualization
code, is available on the GitHub repository:
https://github.com/shrishriyesh/
Retail-Market-Analysis

12

You might also like