0% found this document useful (0 votes)
13 views

Java

AstraZeneca sẽ thu hồi vaccine Covid-19 toàn cầu Tuổi trẻ Tuổi trẻ · 19 giờ Tổng thống Putin tuyên thệ nhậm chức nhiệm kỳ 5: Nga sẽ trỗi dậy mạnh mẽ hơn VnExpress VnExpress · 1 giờ 4 cầu thủ Hà Tĩnh bị điều tra liên quan ma túy Xem thêm Lời khai đầu tiên của tướng De Castries sau khi bại trận ở Điện Biên Phủ là gì? VTC News VTC News · 23 giờ Lời khai đầu tiên của tướng De Castries sau khi bại trận ở Điện Biên Phủ là gì? Digital first self-service Pegasystems Digital first self-service Ad
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Java

AstraZeneca sẽ thu hồi vaccine Covid-19 toàn cầu Tuổi trẻ Tuổi trẻ · 19 giờ Tổng thống Putin tuyên thệ nhậm chức nhiệm kỳ 5: Nga sẽ trỗi dậy mạnh mẽ hơn VnExpress VnExpress · 1 giờ 4 cầu thủ Hà Tĩnh bị điều tra liên quan ma túy Xem thêm Lời khai đầu tiên của tướng De Castries sau khi bại trận ở Điện Biên Phủ là gì? VTC News VTC News · 23 giờ Lời khai đầu tiên của tướng De Castries sau khi bại trận ở Điện Biên Phủ là gì? Digital first self-service Pegasystems Digital first self-service Ad
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

VIETNAM NATIONNAL UNIVERSITY

INTERNATIONAL SCHOOL
------------

Predict amazon product price


SUBJECT : Advanced Information Systems Development
Lecturer : Trương Công Đoàn
Subject code : INS401001
Group H : Hà Vũ Đăng Huy – 20070727
Mai Thị Diệu Linh – 20070739
Trần Hoài An – 20070661
Phạm Thị Ngọc Lương – 20070744
Trần Tuấn Minh – 20070754
Nguyễn Dức Hùng – 20070724

Hanoi, 2023
TABLE OF CONTENTS
Abstract......................................................................................................................3

I. Introduction............................................................................................................4

II. Methodology.........................................................................................................5

III. Data processing...................................................................................................6

1. Data Preprocessing.............................................................................................6

2. EDA..................................................................................................................12

IV. Result and discussion........................................................................................24

1.Features Selection.............................................................................................24

2. Model training & Evaluation............................................................................25

2.1. Linear Regression model............................................................................25

2.2. Apply scaler...............................................................................................26

2.3. Lasso..........................................................................................................27

2.4. Ridge..........................................................................................................28

3. Discussion........................................................................................................28

4. Predict...............................................................................................................32

V. Conclusion..........................................................................................................34

VI. REFERENCES..................................................................................................34

VII. Individual Contribution....................................................................................34

2
Abstract
In the ever-expanding e-commerce landscape, accurate pricing of products plays a pivotal role in
attracting customers and ensuring competitiveness. This research explores the application of
machine learning techniques to predict the prices of products on the Amazon platform. The
objective is to develop a predictive model that leverages various features such as product
attributes, historical pricing data, customer reviews, and seller information.

The study employs a dataset collected from Amazon, comprising a diverse range of products
across categories. Feature engineering techniques are applied to extract relevant information, and
machine learning algorithms, including regression models and ensemble methods, are employed
for price prediction. Additionally, natural language processing (NLP) is utilized to analyze
customer reviews and derive sentiment features that may impact pricing.

The research aims to address challenges associated with pricing variability, seasonality, and the
dynamic nature of e-commerce platforms. The effectiveness of the predictive model is evaluated
through comprehensive performance metrics, and the impact of different features on pricing
accuracy is analyzed. Insights gained from this study can inform pricing strategies for sellers and
contribute to a better understanding of the factors influencing product prices in the online
marketplace.

3
I. Introduction

In the fast-paced world of e-commerce, pricing strategies are key to the success of online
retailers. Accurately predicting product prices is essential for staying competitive in the market
and attracting and retaining customers. To address this challenge, leveraging advanced machine
learning techniques to forecast Amazon product prices has become an appealing approach for
improving pricing strategies and market positioning. Machine learning algorithms have proven to
be highly effective in analyzing large volumes of data and identifying patterns and trends. By
training these algorithms on historical pricing data and considering various relevant features,
such as product category, brand, customer reviews, and seller reputation, it is possible to build
predictive models that can estimate future prices with a reasonable degree of accuracy.

The advantages of using machine learning for price prediction in the Amazon marketplace are
numerous. First, it allows retailers to dynamically adjust their prices in response to changing
market conditions, such as fluctuations in demand or the introduction of new competitors. This
flexibility enables retailers to optimize their profits and maintain a competitive edge. Accurate
price predictions empower retailers to make informed decisions regarding pricing strategies. By
understanding how different factors influence product prices, retailers can identify which
variables have the greatest impact and adjust their pricing strategies accordingly. For example, if
customer reviews significantly affect the price of electronic gadgets, retailers can focus on
improving customer satisfaction to command higher prices.

Additionally, machine learning models can provide valuable insights into market trends and
customer preferences. By analyzing patterns in pricing data, retailers can gain a deeper
understanding of customer behavior and preferences, enabling them to tailor their product
offerings and marketing campaigns more effectively.

However, building an effective price prediction model for Amazon is not without its challenges.
The dynamic nature of e-commerce platforms, where prices can change frequently and abruptly,
requires models that can adapt quickly to new data. Additionally, incorporating a wide range of
relevant features and ensuring data quality and consistency pose additional hurdles.

Despite these challenges, the potential benefits of accurately predicting Amazon product prices
using machine learning make it a compelling area of research and application. By leveraging the
power of advanced algorithms and harnessing the wealth of data available, retailers can gain a
competitive advantage and optimize their pricing strategies in the ever-evolving e-commerce
landscape.

4
II. Methodology
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their
details listed on the official website of Amazon. In this notebook, my objective is to clean and
transform the data for EDA process to find some interesting insights through visualization. I will
then move on to finding insights about the data and try to elaborate in the form of visualization.
The record has 1465 lines and 16 attributes. We use check null for value to shows that this
dataset is of good quality, as most of the values are completely filled. However, there are still
some null values, which may affect the data analysis results. And Split training to prepare data
for statistical analysis or machine learning, where feature variables (X) and target variables (y)
are defined. We have implemented 4 algorithms - Linear Regression model, Apply scaler, Lasso,
Ridge.
Linear Regression model: Apply linear regression model using scikit-learn library. Here is a
code snippet to import the necessary libraries, split the dataset into training and test sets, create a
linear regression model, fit it to the training data, and make predictions on the data check.
Apply scaler: In this code snippet, it begins by importing the StandardScaler from the scikit-
learn library, which will be utilized to standardize the features.
Lasso: In this extended code snippet, it uses the Lasso regression model from scikit-learn
(Lasso) to our existing linear regression workflow.
Ridge: In this portion of the code, we use Ridge regression, another linear regression model, but
with L2 regularization.

5
III. Data processing

1. Data Preprocessing
Import library :

 pandas: The pandas library provides powerful and flexible data structures, designed to
work with structured data (such as data tables, time series data) and unstructured. It
provides functions to perform operations on data such as filtering, aggregation,
transformation, etc.
 numpy: numpy is a core library for scientific computation in Python. It provides an
efficient multidimensional array object and tools to work with these arrays.
 matplotlib: matplotlib is a library that visualizes 2D and 3D data in Python. It helps to
create high-quality charts in various formats.
 seaborn: seaborn is a Python data visualization library based on matplotlib. It provides a
high-level interface for drawing attractive and informative statistical charts.

First, it imports the pandas library, an open-source library in Python that provides high-
performance data structures and data analysis tools. Second, it reads a CSV file from the given
path and generates a DataFrame. A DataFrame is a two-dimensional data structure, which can
contain data of different types, and is the standard data structure in data analysis. The path to the
CSV file in your example shows that the file is being stored on your Google Drive. DataFrame df
now contains data from CSV files, and wecan use it to perform data analysis and processing.

6
We can notice that there are 2 null values, left blank in the
rating_count, the appearance of null values problems such
as falsifying the results of statistical analysis, causing
errors when training or predicting machine learning
models, or falsifying data visualization

First, it removes all rows in the DataFrame df that


contain any null values. The inplace=True parameter
ensures that the change will be applied directly to the
DataFrame df, rather than creating a new copy of the DataFrame. Second, it checks the
DataFrame df to see how many null values are in each column. After executing these two lines of
7
code, your DataFrame df will no longer contain any null values, and wewill know the exact
number of original null values in each column. This helps ensure that your data won't cause
errors when analyzing or modeling due to null values.

After executing these three lines of code, your DataFrame df will no longer contain any
duplicate rows, and wewill know the exact number of original duplicate rows. This helps ensure
that your data won't cause errors when analyzing or modeling due to duplicate rows.

8
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their
details listed on the official website of Amazon . The record has 1465 lines and 16 attributes. We
use check null for value to shows that this dataset is of good quality, as most of the values are
completely filled.

 product_id - Product ID
 product_name - Name of the Product
 category - Category of the Product
 discounted_price - Discounted Price of the Product
 actual_price - Actual Price of the Product
 discount_percentage - Percentage of Discount for the Product
 rating - Rating of the Product
 rating_count - Number of people who voted for the Amazon rating
 about_product - Description about the Product
 user_id - ID of the user who wrote review for the Product
 user_name - Name of the user who wrote review for the Product
 review_id - ID of the user review
 review_title - Short review
 review_content - Long review
 img_link - Image Link of the Product
 product_link - Official Website Link of the Product

Handling special characters and missing data

 Replace and convert the data type for column 'actual_price': First, it replaces the
character '₹' with an empty string (removing the character ' ₹'). Then it replaces the
comma with an empty string (removing the comma). Finally, it converts the column's
data type from string to real number (float64).
 Replace and convert the data type for column 'discounted_price': Similar to 'actual_price',
it removes the '₹' character and comma, then converts the column's data type from string
to real number (float64).

9
 Replace character '%' with empty string (remove character '%'). It then converts the
column's data type from string to real number (float64).
 Convert percentage to decimal: It divides all values in the column by 100 to convert the
percentage to decimal.

Remove ',' from column 'rating_count', then convert it to style float64

Replace the '&' character with the string '&' to create a space before and after the '&' sign
Replace some strings without spaces between words with strings with spaces. For example,
'OfficeProducts' is replaced by 'Office Products'.

 Replace the '&' character with the string '&' to create a space before and after the '&' sign.
 Replace commas ',' with strings ', ' to create a space after each comma.
 Replace some strings without spaces between words with strings with spaces. For
example, 'HomeAppliances' is replaced by 'Home Appliances'.

10
After checking the rating value, there is an error value

2. EDA
- Chart
Category:

11
Electronics, Computers & Accessories, and Home & Kitchen appliances account for most of the
products in this dataset. In general, most products are related to technology and electronics.

12
Describe the distribution of products by main categories. According to the chart, "Electronics"
has the most products with 526 products, followed by "Computers & Accessories" with 453 and
"Home & Kitchen" with 448 products. Meanwhile, "Office Products" and "Musical Instruments"
have a much smaller number, with only 31 and 2 products respectively. This shows that there is a
clear disparity in product offering between categories. This may reflect market demand, with
higher demand for electronic products, computers and accessories, and home appliances
compared to office products and musical instruments.

"Accessories & Peripherals" was the subsection with the most products with 381 products,
followed by "Kitchen & Home Appliances" with 308 products. Subsections such as "Home
Theater, TV & Video", "Mobiles & Accessories" have a similar number of products, about 161-
162 products. There was a marked decrease in the number of products when moving from
technology subsections to other subcategories such as "Heating, Cooling & Air Quality" and
"Wearable Technology". The fewest product subsection is "External Devices & Data Storage"
with only 18 products. This shows that there is a clear disparity in product supply between the
subsections. This may reflect market demand, with higher demand for accessories and

13
peripherals, home and kitchen appliances compared to default technology products and external
data storage

The majority of products receive the overall rating at the range around 3.8 and 4.0.
The highest rating is 5.0 coming from Computer & Accessories.
The lowest rating for a product is 2.0 from the Home & Kitchen category.

This chart is a horizontal column chart with the heading "Rank range by main category". It
shows rating ranges for different product categories, including Computers & Accessories,
Electronics, Musical Instruments, Office Products, Home & Kitchen, Home Improvement, Toys
& Games, Automobiles & Motorcycles, and Health & Personal Care.
Each category has a colored horizontal bar representing the range of ratings received with
small diamonds indicating individual ratings within that range. The x-axis is labeled "Rating"
and ranges from 2.0 to 5.0. Computers & Accessories had the highest rating, up to 5.0, while
Home & Kitchen had a low rating of around 2.0.
Based on your comments, here is some further analysis:
 Rating range: Most products receive an overall rating between 3.8 and 4.0. This suggests
that consumers are generally satisfied with the products they buy. However, there are also
some products that receive lower ratings, indicating that there may be quality or customer
satisfaction issues with some products.
 Highest and lowest ratings: The highest rating of 5.0 comes from the Computer &
Accessories category, indicating that consumers are very satisfied with the product in this
category. Conversely, the lowest rating for a product is 2.0 from the Home & Kitchen

14
category, which may indicate that there are quality issues or customer satisfaction with
the product in this category.
 Chart Effects: This chart provides an overview of product quality based on consumer
ratings in different categories. This can help us better understand customer satisfaction
with products in each category, thereby making better decisions about improving product
quality or focusing on specific product categories.

A heat chart can show a correlation between numerical characteristics related to the
product, such as: Price has decreased, Actual price, Percentage discount, Rating, Number of
ratings. The heat chart shows the extent and direction of the relationship between these
characteristics.

15
Overall, this heat graph provides a valuable visual representation of the relationships
between different product characteristics in the dataset. It can be used to understand trends,
identify potential dependencies, and information for subsequent analysis.

The characteristics listed include price dropped, actual price, percentage discount, rating,
and number of ratings. The correlation value ranges from -0.2 to 1, represented by the color
gradient from brown (negative correlation) to purple (positive correlation) on the right side of the
heat chart.
 The price has fallen and the actual price has the highest positive correlation (0.96),
indicating that when the actual price increased, the price that fell also increased, and vice
versa.
 The percentage discount has a negative correlation with both the price that has fallen and
the actual price, indicating that when the percentage discount increases, both the price
that has fallen and the actual price has decreased, and vice versa.

Based on the chart, we can see that when the actual price increases, the price that has
fallen also increases. This shows that there is a positive correlation between the actual price and
the price that has fallen. However, there are also some data points that do not follow the overall
trend, known as outliers. These outliers may be due to other factors such as product quality,
brand, or other factors that can affect product prices. This should be considered when building a
product price prediction model.

16
Rating vs rating count

The rating range of most products is around 3.75 and 4.38. The rating distribution is skewed left
and no product is rated lower than 2.0.
The rating_count range is really popular, falling from 0 to over 40000. Most products have a
number of ratings in the range of 0-5000. Distribution rating_count completely d úng_skewed.
 Rating distribution: The left column chart depicts the distribution of product ratings.
Most products have ratings ranging from 3.75 to 4.38, indicating that consumers are
generally satisfied with the products they buy. However, no product is rated lower than
2.0, indicating that the minimum quality of the products is quite high. This distribution is
skewed to the left, indicating that there are more products with higher ratings than low
ratings.
 Distribution of the number of ratings (rating_count): The column chart on the right
depicts the distribution of the number of ratings for products. The rating_count range is
very wide, from 0 to over 40000, but most products only have a number of ratings
between 0 and 5000. This shows that there are some products that are very popular,
receiving many ratings from consumers, while the majority of others only receive fewer
ratings. This distribution is skewed right, indicating that there are a small number of
products with a very high number of ratings.
 What Charts Does: This chart provides an overview of product quality (based on ratings)
and product popularity (based on ratings quantity). This can help us better understand
customer satisfaction with products as well as their level of engagement with products.
This information can be useful when we try to predict product prices, as quality and
popularity can affect product prices.

17
The chart provided consists of two column charts, describing the quantity and percentage of
products evaluated based on review scores. More than 75% of products receive a "Satisfied"
rating from customers, indicating that the overall level of customer satisfaction with the existing
product is very high. Only 6 products, corresponding to 0.4%, received a 'Dissatisfied' score,
indicating that the minimum quality of the products was quite high. In predicting Amazon
product prices, this information is crucial. Products with positive reviews tend to sell better and
can therefore be listed at a higher price. Conversely, products with negative or low reviews may
need to be discounted to attract buyers. Hopefully this analysis helps webetter understand your
data and how it can affect product price predictions.
Amount of rating score by main Category:

18
The chart depicts the quantity and percentage of products evaluated based on review scores. The
"Electronics" category has the highest number of "Very Satisfied" reviews. This may indicate
that products in this category meet or exceed customer expectations, resulting in high levels of
satisfaction. In contrast, "Toys & Games" has the lowest total number of reviews, possibly due to
unsatisfactory product quality or failure to meet customer expectations. More than 75% of
products receive a "Satisfied" rating from customers, indicating that the overall level of customer
satisfaction with the existing product is very high. Only 6 products, corresponding to 0.4%,
received a 'Dissatisfied' score, indicating that the minimum quality of the products was quite
high. In predicting Amazon product prices, this information is crucial. Products with positive
reviews tend to sell better and can therefore be listed at a higher price. Conversely, products with
negative or low reviews may need to be discounted to attract buyers.

19
Looking at the graph, we can depict the number and percentage of products evaluated based on
review scores in subcategories such as "Accessories", "Cameras & Photography", etc. Overall,
products like "Headphones, Earbuds & Accessories", "Mobiles & Accessories" had the highest
number of "Very Satisfied" reviews, while "Toys & Games" had the lowest total number of
reviews.

actual_price , discounted_price and discount_percentage

Describe the distribution of actual prices, reduced prices and discount rates of products. Both the
actual price and the fallen price have a wide distribution range, from 0 to 140000 and from 0 to
78000, respectively, and both have a high right deviation, indicating that there are some products
that cost much more than the majority of others. However, the discount rate is more balanced,
with most products having discounts ranging from 50% to 60%.Understanding the price
distribution and discount rate helps us determine the right price for a particular product based on
its actual value and level of market discount. At the same time, analyzing the distribution of
prices also helps us identify products that cost more or less than the majority of products other,
from which to come up with appropriate pricing strategies.

20
"Computers & Accessories" and "Electronics" have the widest range of discounts, with the
highest discounts up to around 80%. In contrast, "Musical Instruments" has the narrowest range
of discounts, while "Health & Personal Care" has no data on discounts. This information is
crucial in predicting product prices on Amazon. Understanding the price distribution and
discount rate helps us determine the right price for a particular product based on its actual value
and level of market discount. At the same time, analyzing the distribution of prices also helps us
identify which products are priced higher or lower than most other products, thereby devising
appropriate pricing strategies.

As can be noticed, subsections like "Accessories & Peripherals" and "Home Theatre, TV &
Video" have a wide range of discounts, while subsections like "Printers, Inks & Accessories" and

21
"Monitors" have a narrow range of discounts. This indicates the diversity in the discount strategy
of different types of products. Technology and electronic products often have high levels of
volatility in terms of price drops, while other products such as household and medical items tend
to remain stable. This information is crucial in predicting product prices on Amazon.
Understanding the price distribution and discount rate helps us determine the right price for a
particular product based on its actual value and level of market discount.

The "Actual Price Range by Rating Score" chart depicts the distribution of products based on
actual prices and corresponding review scores. From the chart, it is clear that the majority of
products are concentrated in the low price range, especially below 20,000 units of currency.
Products with a "Satisfied" rating stand out in this low price range. There is a sparse distribution
of products in the mid to high price range, with very few items priced above 80,000 units.
Interestingly, items rated as "Dissatisfied" and "Very Satisfied" are only found in the low price
range, indicating that both low- and high-quality products can be found at affordable prices.
This chart shows some important observations about the relationship between price and product
reviews. In particular, it shows that not all expensive products are of good quality, and not all
cheap products are of poor quality, on the contrary. This gives us an insight into the product
market and how price and quality interact with each other. This can also aid in making
purchasing decisions, when weighing price and product quality.

22
From the chart, it is clear that items with a range of discounted prices receive different review
points. Most items, regardless of their reduced prices, are rated as "Satisfied". There is a
concentration of items priced between 0 and 20,000 with ratings ranging from "Not satisfied" to
"Very satisfied". There are fewer items in the higher price ranges, and they also mostly receive
"Satisfied" reviews. This may indicate that while price is a factor in customer satisfaction, other
factors such as quality or service play an integral role.
When compared to the first chart, we can see that the price reduction has significantly changed
the price distribution of the products. This may indicate that a discount strategy can be an
effective tool to enhance customer satisfaction and increase sales. However, it should be noted
that excessive discounts can lead to a decrease in brand value and profitability

23
IV. Result and discussion

1.Features Selection
 Check null for value

This line shows that there are no null values in the columns "product_id", "discounted_price",
"actual_price", "discount_percentage", "rating", "main_category" and "sub_category". However,
there are two null values in the "rating_count" column.
This result shows that this dataset is of good quality, as most of the values are completely filled.
However, there are still some null values, which may affect the data analysis results.

This command is used to replace all null values in the "rating_count" column with the mode
value of this column. Mode is the value that appears most often in the column.
Replacing null values with mode can help improve the quality of the data set. However, it should
be noted that this approach may distort the results of data analysis, if the null values are not
actually replacement values.
For example, if a product actually has 1000 reviews, but this value is missing from the data set,
replacing this value with an incorrect value will reduce the average value of the "rating_count"
column. ".
 Split training

Two variables X1 and y1 are defined in the code. Variable X1 is assigned values from the
columns ‘rating’, ‘rating_count’ and ‘actual_price’ of DataFrame df1. Variable y1 is assigned the
value of column ‘discounted_price’ from df1. Code used to prepare data for statistical analysis or
machine learning, where feature variables (X) and target variables (y) are defined.

24
2. Model training & Evaluation.

2.1. Linear Regression model

Apply linear regression model using scikit-learn library. Here is a code snippet to import the
necessary libraries, split the dataset into training and test sets, create a linear regression model, fit
it to the training data, and make predictions on the data check.
The code uses libraries from scikit-learn to create and evaluate a linear regression model. The
train_test_split function is used to split the dataset into training and testing parts. A linear
regression model is then created and fitted to the training data using the LinearRegression()
function. Predictions are performed on test data and stored in y1_pred.
 train_test_split: This function is used to split wer dataset into training and testing sets.
The test_size parameter specifies the proportion of the dataset to include in the test split
(in this case, 20%).
 LinearRegression: This class is from scikit-learn and represents a linear regression model.
 fit: This method is used to train the linear regression model on the training data (X1_train
and y1_train).
 predict: This method is used to make predictions on the test data (X1_test), and the
results are stored in the variable y1_pred.

It is now calculating and printing the intercept, coefficients, and the R-squared value for your
linear regression model. Here's what each of these values means:

25
 Intercept: The intercept represents the value of the dependent variable when all
independent variables are zero. In your case, it is approximately -420.15.
 Coefficients: The coefficients represent the change in the dependent variable for a one-
unit change in the corresponding independent variable, assuming all other variables are
held constant. Your coefficients are [31.9828022, 0.00124927, 0.6262998] for the
features in X1.
 R-squared (R2) Score: R-squared is a measure of how well your linear regression model
explains the variance in the dependent variable. The value ranges from 0 to 1, with 1
indicating a perfect fit. In your case, an R2 score of approximately 0.92 suggests that
your model explains about 92.4% of the variance in the target variable.
Overall, it seems like the linear regression model is performing well on the test data based on the
high R-squared value.

2.2. Apply scaler

In this code snippet, it begins by importing the StandardScaler from the scikit-learn library,
which will be utilized to standardize the features. After selecting the relevant features ('rating',
'rating_count', and 'actual_price') and the target variable ('discounted_price') from the dataset,
weproceed to standardize the features using the fit_transform method of the StandardScaler
class. Subsequently, the dataset is split into training and testing sets, with 80% of the data
allocated for training the linear regression model and the remaining 20% for testing its
performance. The linear regression model is then fitted using the training data, and predictions
are generated for the test set. Finally, the model's performance is evaluated using the R2 score,
which measures the proportion of variance in the target variable that the model explains. The
resulting R2 score of 0.912 indicates that the linear regression model performs exceptionally
well, explaining approximately 91.2% of the variance in the discounted prices on the test data

26
2.3. Lasso

In this extended code snippet, it uses the Lasso regression model from scikit-learn (Lasso) to
your existing linear regression workflow. The Lasso model applies L1 regularization,
introducing a penalty term proportional to the absolute values of the coefficients, which
encourages sparsity in the model.
 It instantiate a Lasso model with a regularization parameter (alpha) set to 0.1 and then fit
the model to your training data (X2_train, y2_train).
 The trained Lasso model to make predictions on the test set (X2_test), and the predicted
values are stored in lasso_pred.
 It prints out the coefficients of the Lasso model (lasso.coef_) and the R2 score of the
model on the test set (lasso.score(X2_test, y2_test)).
The Lasso coefficients indicate the impact of each feature on the target variable, and the R2
score shows the model's performance. The score of approximately 0.912 suggests that the Lasso
model performs similarly to the linear regression model, with L1 regularization possibly leading
to some coefficients being exactly zero, contributing to feature selection. This regularization
technique can be particularly useful when dealing with high-dimensional datasets where feature
selection is important.

27
2.4. Ridge

In this portion of the code, we uses Ridge regression, another linear regression model, but with
L2 regularization.
We import the Ridge regression model from scikit-learn. Ridge regression applies L2
regularization, introducing a penalty term proportional to the squared magnitudes of the
coefficients.
We create a Ridge regression model with a regularization parameter (alpha) set to 0.1. The alpha
parameter controls the strength of the regularization, with higher values leading to stronger
regularization.
We fit the Ridge model to your training data (X2_train, y2_train). Ridge regression introduces a
penalty term based on the squared magnitudes of the coefficients, helping to mitigate
multicollinearity and stabilize the model.
We use the trained Ridge model to make predictions on the test set (X2_test), and the predicted
values are stored in ridge_pred.
We print out the R2 score of the Ridge model on the test set (ridge.score(X2_test, y2_test)). The
R2 score provides an indication of how well the Ridge model explains the variance in the target
variable on the test data.
The Ridge score of approximately 0.912 indicates that the Ridge model performs similarly to the
linear regression and Lasso models. Ridge regression is particularly useful when dealing with
multicollinearity, as it tends to distribute the impact of correlated features more evenly across
them. The choice between Lasso, Ridge, or linear regression often depends on the specific
characteristics of your dataset and the desired properties of the model.

3. Discussion
Through the analysis process, we collect parameters, thereby being able to evaluate the model
better and make more accurate forecasts, and then we analyze the index in depth to find out the
following best forecast:

28
The provided results show the performance metrics for different regression models, including
Linear Regression, Scaled Linear Regression, Lasso, and Ridge. Let's interpret the results for
each model:

 Linear Regression:
R2 Score: 0.92395
The R2 score is a measure of how well the linear regression model explains the variance
in the target variable. An R2 score of 0.92395 indicates that approximately 92.4% of the
variance in the discounted prices is explained by the model.
 Mean Squared Error (MSE): 4,490,693.47
MSE measures the average squared difference between the predicted and actual values. A
lower MSE indicates better model performance.
 Root Mean Squared Error (RMSE): 2,119.13
RMSE is the square root of MSE and provides an interpretable scale. It represents the
average magnitude of errors.
 Mean Absolute Error (MAE): 827.08
MAE is the average absolute difference between predicted and actual values. It represents
the average magnitude of errors.
 Mean Absolute Percentage Error (MAPE): 62.70%
MAPE expresses errors as a percentage of the actual values. A lower MAPE indicates
better accuracy.

29
 Scaled Linear Regression
R2 Score: 0.91248

The R2 score is a measure of how well the model explains the variance in the target
variable. An R2 score of 0.91248 indicates that approximately 91.2% of the variance in
the discounted prices is explained by the scaled linear regression model.

 Mean Squared Error (MSE): 4,562,275.29

MSE measures the average squared difference between the predicted and actual values.
In this case, the average squared difference is approximately 4,562,275.29.

 Root Mean Squared Error (RMSE): 2,135.95

RMSE is the square root of MSE and provides an interpretable scale. It represents the
average magnitude of errors. The RMSE is approximately 2,135.95.

 Mean Absolute Error (MAE): 859.40

MAE is the average absolute difference between predicted and actual values. In this case,
the average absolute difference is approximately 859.40.

 Mean Absolute Percentage Error (MAPE): 61.99%

MAPE expresses errors as a percentage of the actual values. In this case, the average
absolute percentage error is approximately 61.99%.

 Lasso
R2 Score: 0.91248

The R2 score measures the proportion of variance in the target variable that is predictable
from the features. An R2 score of 0.91248 indicates that approximately 91.2% of the
variance in the discounted prices is explained by the Lasso regression model.

 Mean Squared Error (MSE): 4,562,180.60

MSE calculates the average squared difference between predicted and actual values. In
this case, the average squared difference is approximately 4,562,180.60.

 Root Mean Squared Error (RMSE): 2,135.93

RMSE is the square root of MSE and provides an interpretable scale. It represents the
average magnitude of errors. The RMSE is approximately 2,135.93.

 Mean Absolute Error (MAE): 859.38

30
MAE is the average absolute difference between predicted and actual values. Here, the
average absolute difference is approximately 859.38.

 Mean Absolute Percentage Error (MAPE): 61.98%

MAPE expresses errors as a percentage of the actual values. In this case, the average
absolute percentage error is approximately 61.98%

 Ridge
R2 Score: 0.91249

The R2 score measures the proportion of variance in the target variable that is predictable
from the features. An R2 score of 0.91249 indicates that approximately 91.2% of the
variance in the discounted prices is explained by the Ridge regression model.

 Mean Squared Error (MSE): 4,561,688.66

MSE calculates the average squared difference between predicted and actual values. In
this case, the average squared difference is approximately 4,561,688.66.

 Root Mean Squared Error (RMSE): 2,135.81

RMSE is the square root of MSE and provides an interpretable scale. It represents the
average magnitude of errors. The RMSE is approximately 2,135.81.

 Mean Absolute Error (MAE): 859.33

MAE is the average absolute difference between predicted and actual values. Here, the
average absolute difference is approximately 859.33.

 Mean Absolute Percentage Error (MAPE): 61.97%

MAPE expresses errors as a percentage of the actual values. In this case, the average
absolute percentage error is approximately 61.97%.

In conclusion, The Linear Regression model has the highest R2 score, indicating the best overall
performance among the models.
Scaled Linear Regression, Lasso, and Ridge show similar performance, with slightly worse R2
scores and higher error metrics compared to the regular Linear Regression.

It's important to note that the choice between these models should consider both performance and
interpretability. Additionally, the specific characteristics of your dataset and the goals of your
analysis may influence the preferred regression technique.

31
4. Predict
Based on the results of testing the above 4 models, it shows that the linear regression model has
the best overall performance among the models, the accuracy measured by the R2 index is 0.92,
showing the level of fit between model and test data are very high. The R2 index, also known as
the coefficient of determination, is often used to evaluate the model's ability to explain variation
in the dependent variable. An R2 value close to 1.0 highlights the good agreement between the
model prediction and the actual value. With such high accuracy, I believe that the linear
regression model is the ideal choice to use in predicting product discount prices.
 Sample predict

It is used to make predictions using a linear regression model (lr_model). The code creates a new
DataFrame (new_data) with specific values for features ('rating', 'rating_count', and
'actual_price'). Then, it uses the predict method of the linear regression model to predict the
discounted price for the new data.
 A new DataFrame new_data is created with a single row of data, representing a product
with a rating of 4.5, 100 rating counts, and an actual price of 150.
 The predict method of the linear regression model (lr_model) is used to predict the
discounted price for the new data. The predicted price is stored in the variable
predicted_price
 The output "Predicted discounted price: [1024532.9294811]" suggests that, based on the
features provided in the new_data DataFrame (rating: 4.5, rating_count: 100,
actual_price: 150), the linear regression model (lr_model) predicts a discounted price of
approximately 1,024,532.93.

 Predict all data


This code is for creating a scatter plot to visualize the relationship between the actual prices
(y1_test) and the predicted prices (y1_pred) from the linear regression model (lr_model).

32
 The linear regression model (lr_model) is used to predict prices on the test set (X1_test),
and the predicted values are stored in y_pred.
 A scatter plot is created with y1_test (actual prices) on the x-axis and y1_pred (predicted
prices) on the y-axis. Each point in the scatter plot represents an actual vs. predicted pair.
 A red diagonal line is added to the plot. This line represents a perfect prediction where
actual and predicted prices are equal (y=x).
Result:

The output is a scatterplot comparing actual and predicted prices. The x-axis represents the
actual price, which ranges from 0 to 80000. The y-axis represents the predicted price, which also
ranges from 0 to 80000. There are green points scattered across the chart that represent
individual data points core. A red line is drawn on the graph, which can represent a model's
prediction or a straight line of best fit for given data points. Most of the green points are close to
the red line, indicating relatively accurate predictions, although there are some outliers.

33
V. Conclusion
This research endeavors to contribute to the evolving field of e-commerce pricing strategies by
developing a machine learning-based predictive model tailored to the unique dynamics of the
Amazon platform. The ultimate goal is to empower sellers with actionable insights, enabling
them to make informed pricing decisions and navigate the competitive landscape effectively. As
the project progresses, ongoing refinements to the model and methodologies will be made to
ensure relevance and applicability in the ever-changing e-commerce ecosystem.

VI. References
1. https://www.kaggle.com/code/minhtran003/amazon-sales-clean-eda-and-predict#INTRODUCTION

2. https://aws.amazon.com/vi/forecast/pricing/

3. https://www.tuck.dartmouth.edu/news/articles/predicting-price-changes-on-amazon-
marketplace

4. https://www.youtube.com/watch?v=lMCvsok1eoU&themeRefresh=1

VII. Individual Contribution

Member Student ID Contribute


20070727
Hà Vũ Đăng Huy 100%

20070739
Mai Thị Diệu Linh 100%

20070661
Trần Hoài An 100%

20070744
Phạm Thị Ngọc Lương 100%

20070754
Trần Tuấn Minh 100%

20070724
Nguyễn Đức Hùng 100%

34

You might also like