Amit Khilare Used Device Data PM Project
Amit Khilare Used Device Data PM Project
Context
Buying and selling used phones and tablets used to be something that happened on a
handful of online marketplace sites. But the used and refurbished device market has grown
considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts
that the used phone market would be worth $52.7bn by 2023 with a compound annual growth rate
(CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used
phones and tablets that offer considerable savings compared with new models.
Objective
The rising potential of this comparatively under-the-radar market fuels the need for an ML-
based solution to develop a dynamic pricing strategy for used and refurbished devices. ReCell, a
startup aiming to tap the potential in this market, has hired you as a data scientist. They want you to
analyze the data provided and build a linear regression model to predict the price of a used
phone/tablet and identify factors that significantly influence it.
Data Description
The data contains the different attributes of used/refurbished phones and tablets. The data
was collected in the year 2021. The detailed data dictionary is given below.
Dataset: used_device_data.csv
Data Dictionary
A. Mean (Average): Represents the sum of all values in a dataset divided by the total number of the
values.
B. Median: The middle value in a dataset that is arranged in ascending order (from the smallest value to
the largest value). If a dataset contains an even number of values, the median of the dataset is the
mean of the two middle values.
C. Mode: Defines the most frequently occurring value in a dataset. In some cases, a dataset may contain
multiple modes, while some datasets may not have any mode at all.
The selection of a central tendency measure depends on the properties of a dataset. For instance, the
mode is the only central tendency measure for categorical data, while a median works best with ordinary data.
Although the mean is regarded as the best measure of central tendency for quantitative data, that is not always
the case. For example, the mean may not work well with quantitative datasets that contain extremely large or
extremely small values. The extreme values may distort the mean. Thus, you may consider other measures.
Dispersion:
Most measures of dispersion have the same units as the quantity being measured. In other words, if the
measurements are in meters or seconds, so is the measure of dispersion. There are two main types of
dispersion methods in statistics which are:
A. Absolute Measure of Dispersion:
An absolute measure of dispersion contains the same unit as the original data set. Absolute dispersion method
expresses the variations in terms of the average of deviations of observations like standard or mean deviations.
It includes range, standard deviation, quartile deviation, etc. The types of absolute measures of dispersion are
range, variance, standard deviation, quartiles, and so on.
A. Positive Correlation:
Two features (variables) can be positively correlated with each other. It means that when the value of one
variable increases then the value of the other variable(s) also increases.
B. Negative Correlation:
Two features (variables) can be negatively correlated with each other. It means that when the value of one
variable increases then the value of the other variable(s) decreases.
C. No Correlation:
Two features (variables) are not correlated with each other. It means that when the value of one variable
increases or decreases then the value of the other variable(s) doesn’t increase or decrease.
4) Data Visualization
Data visualization is defined as a graphical representation that contains the information and the data.
By using visual elements like charts, graphs, and maps, data visualization techniques provide an accessible way
to see and understand trends, outliers, and patterns in data.
In modern days we have a lot of data in our hands i.e, in the world of Big Data, data visualization tools, and
technologies are crucial to analyze massive amounts of information and make data-driven decisions. It is used
in many areas such as:
● To model complex events.
● Visualize phenomenons that cannot be observed directly, such as weather patterns, medical conditions, or
mathematical relationships.
1. Distribution Plot
It is one of the best univariate plots to know about the distribution of data. When we want to analyze
the impact on the target variable(output) with respect to an independent variable(input), we use distribution
plots a lot. This plot gives us a combination of both probability density functions(pdf) and histogram in a single
plot.
1. Line Plot
This is the plot that you can see in the nook and corners of any sort of analysis between 2 variables.
The line plots are nothing but the values on a series of data points will be connected with straight lines. The
plot may seem very simple but it has more applications not only in machine learning but in many other areas.
2. Bar Plot
This is one of the widely used plots, that we would have seen multiple times not just in data analysis,
but we use this plot also wherever there is a trend analysis in many fields. Though it may seem simple it is
powerful in analyzing data like sales figures every week, revenue from a product, Number of visitors to a site on
each day of a week, etc.
Implementation:
Bivariate Analysis
Let's see how the amount of RAM varies across brands.
Dataframe of only those devices which offer a large battery and analyze.
Let's see how the price of used devices varies across the years.
checking how the prices vary for used phones and tablets offering 4G and 5G networks.
Outliner Check
Data Preparation for modelling
Model Building - Linear Regression
Checking Linear Regression Assumptions
VIF after dropping main_camera_mp
feature VIF
0 const 227.000703
1 screen_size 8.261309
2 selfie_camera_mp 2.851442
3 int_memory 1.339852
4 ram 2.272497
5 battery 4.018931
6 weight 6.207163
7 days_used 2.544869
8 normalized_new_price 2.713690
9 years_since_release 4.836333
10 brand_name_Alcatel 3.458240
11 brand_name_Apple 11.192808
12 brand_name_Asus 3.646213
13 brand_name_BlackBerry 1.622996
14 brand_name_Celkon 1.868741
15 brand_name_Coolpad 1.573721
16 brand_name_Gionee 2.076641
17 brand_name_Google 1.387899
18 brand_name_HTC 3.459903
19 brand_name_Honor 3.544615
20 brand_name_Huawei 6.395258
21 brand_name_Infinix 1.191416
22 brand_name_Karbonn 1.626941
23 brand_name_LG 5.348720
24 brand_name_Lava 1.824997
25 brand_name_Lenovo 4.702696
26 brand_name_Meizu 2.404332
27 brand_name_Micromax 3.776810
28 brand_name_Microsoft 2.091350
29 brand_name_Motorola 3.465108
30 brand_name_Nokia 3.747557
31 brand_name_OnePlus 1.585266
32 brand_name_Oppo 4.284583
33 brand_name_Others 10.827443
34 brand_name_Panasonic 1.886714
35 brand_name_Realme 1.965466
36 brand_name_Samsung 8.011342
37 brand_name_Sony 2.834760
38 brand_name_Spice 1.637533
39 brand_name_Vivo 3.727480
40 brand_name_XOLO 2.160588
41 brand_name_Xiaomi 4.063138
42 brand_name_ZTE 4.327130
43 os_Others 1.878917
44 os_Windows 1.739946
45 os_iOS 10.026109
46 4g_yes 2.420727
47 5g_yes 1.787727
We will drop the predictor variables having a p-value greater than 0.05 as they do
not significantly impact the target variable.
• But sometimes p-values change after dropping a variable. So, we'll not
drop all variables at once.
• Instead, we will do the following:
• Build a model, check the p-values of the variables, and drop the
column with the highest p-value.
• Create a new model without the dropped feature, check the p-
values of the variables, and drop the column with the highest p-value.
• Repeat the above two steps till there are no columns with p-
value > 0.05.
The above process can also be done manually by picking one variable at a time that
has a high p-value, dropping it, and building a model again. But that might be a little
tedious and using a loop will be more efficient.
x_train3 = x_train_with_const[selected_features]
x_test3 = x_test[selected_features]
print(olsmodel2.summary())
OLS Regression Results
=====================================================================
Dep. Variable: normalized_used_price R-squared: 0.847
Model: OLS Adj. R-squared: 0.846
Method: Least Squares F-statistic: 886.8
Date: Thu, 21 Nov 2024 Prob (F-statistic): 0.00
Time: 21:13:52 Log-Likelihood: 110.96
No. Observations: 2417 AIC: -189.9
Df Residuals: 2401 BIC: -97.27
Df Model: 15
Covariance Type: nonrobust
==================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------
const 1.3715 0.052 26.565 0.000 1.270 1.473
screen_size 0.0291 0.003 8.473 0.000 0.022 0.036
main_camera_mp 0.0234 0.001 16.151 0.000 0.021 0.026
selfie_camera_mp 0.0119 0.001 10.644 0.000 0.010 0.014
int_memory 0.0002 6.66e-05 2.836 0.005 5.83e-05 0.000
ram 0.0293 0.005 5.686 0.000 0.019 0.039
battery -1.46e-05 7.19e-06 -2.030 0.043 -2.87e-05 -4.94e-07
weight 0.0008 0.000 6.200 0.000 0.001 0.001
normalized_new_price 0.4092 0.011 36.760 0.000 0.387 0.431
years_since_release -0.0218 0.004 -5.847 0.000 -0.029 -0.014
brand_name_Celkon -0.1905 0.053 -3.571 0.000 -0.295 -0.086
brand_name_Nokia 0.0823 0.030 2.771 0.006 0.024 0.140
brand_name_Xiaomi 0.0829 0.025 3.296 0.001 0.034 0.132
os_Others -0.0598 0.030 -1.993 0.046 -0.119 -0.001
4g_yes 0.0461 0.015 3.023 0.003 0.016 0.076
5g_yes -0.0900 0.031 -2.862 0.004 -0.152 -0.028
====================================================================
Omnibus: 240.704 Durbin-Watson: 1.993
Prob(Omnibus): 0.000 Jarque-Bera (JB): 674.212
Skew: -0.537 Prob(JB): 3.95e-147
Kurtosis: 5.354 Cond. No. 4.08e+04
TEST FOR LINEARITY AND INDEPENDENCE
We will test for linearity and independence by making a plot of fitted values vs residuals and checking for
patterns.
If there is no pattern, then we say the model is linear and residuals are independent.
Otherwise, the model is showing signs of non-linearity and residuals are not independent.
import pylab
plt.show()
Final Model Summary
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.849
Model: OLS Adj. R-squared: 0.846
Method: Least Squares F-statistic: 277.1
Date: Thu, 21 Nov 2024 Prob (F-statistic): 0.00
Time: 21:24:26 Log-Likelihood: 125.15
No. Observations: 2417 AIC: -152.3
Df Residuals: 2368 BIC: 131.4
Df Model: 48
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const 1.4172 0.072 19.677 0.000 1.276 1.558
screen_size 0.0295 0.004 8.352 0.000 0.023 0.036
main_camera_mp 0.0232 0.002 14.892 0.000 0.020 0.026
selfie_camera_mp 0.0116 0.001 9.924 0.000 0.009 0.014
int_memory 0.0002 6.76e-05 2.779 0.005 5.53e-05 0.000
ram 0.0305 0.005 5.797 0.000 0.020 0.041
battery -1.665e-05 7.35e-06 -2.266 0.024 -3.11e-05 -2.24e-06
weight 0.0008 0.000 5.946 0.000 0.001 0.001
days_used 3.376e-05 3.07e-05 1.101 0.271 -2.64e-05 9.39e-05
normalized_new_price 0.4104 0.012 33.494 0.000 0.386 0.434
years_since_release -0.0255 0.005 -5.589 0.000 -0.034 -0.017
brand_name_Alcatel -0.0804 0.050 -1.618 0.106 -0.178 0.017
brand_name_Apple -0.0438 0.148 -0.297 0.767 -0.333 0.246
brand_name_Asus 0.0068 0.049 0.138 0.890 -0.090 0.103
brand_name_BlackBerry 0.0312 0.072 0.434 0.664 -0.110 0.172
brand_name_Celkon -0.2372 0.068 -3.485 0.001 -0.371 -0.104
brand_name_Coolpad -0.0308 0.071 -0.434 0.664 -0.170 0.108
brand_name_Gionee -0.0130 0.059 -0.221 0.825 -0.128 0.102
brand_name_Google -0.1192 0.083 -1.442 0.149 -0.281 0.043
brand_name_HTC -0.0403 0.050 -0.805 0.421 -0.138 0.058
brand_name_Honor -0.0478 0.051 -0.941 0.347 -0.147 0.052
brand_name_Huawei -0.0599 0.046 -1.299 0.194 -0.150 0.030
brand_name_Infinix 0.1338 0.113 1.179 0.238 -0.089 0.356
brand_name_Karbonn -0.0592 0.068 -0.867 0.386 -0.193 0.075
brand_name_LG -0.0608 0.047 -1.299 0.194 -0.152 0.031
brand_name_Lava -0.0230 0.063 -0.364 0.716 -0.147 0.101
brand_name_Lenovo -0.0364 0.047 -0.771 0.441 -0.129 0.056
brand_name_Meizu -0.0860 0.056 -1.530 0.126 -0.196 0.024
brand_name_Micromax -0.0645 0.049 -1.315 0.189 -0.161 0.032
brand_name_Microsoft 0.0749 0.082 0.916 0.360 -0.085 0.235
brand_name_Motorola -0.0686 0.051 -1.349 0.177 -0.168 0.031
brand_name_Nokia 0.0380 0.052 0.724 0.469 -0.065 0.141
brand_name_OnePlus -0.0384 0.073 -0.523 0.601 -0.182 0.105
brand_name_Oppo -0.0294 0.049 -0.599 0.549 -0.126 0.067
brand_name_Others -0.0679 0.044 -1.556 0.120 -0.154 0.018
brand_name_Panasonic -0.0427 0.062 -0.690 0.490 -0.164 0.078
brand_name_Realme -0.0359 0.063 -0.568 0.570 -0.160 0.088
brand_name_Samsung -0.0617 0.045 -1.376 0.169 -0.150 0.026
brand_name_Sony -0.0756 0.053 -1.428 0.153 -0.179 0.028
brand_name_Spice -0.0356 0.068 -0.520 0.603 -0.170 0.099
brand_name_Vivo -0.0644 0.050 -1.277 0.202 -0.163 0.034
brand_name_XOLO -0.0781 0.057 -1.362 0.173 -0.191 0.034
brand_name_Xiaomi 0.0325 0.050 0.655 0.513 -0.065 0.130
brand_name_ZTE -0.0460 0.048 -0.952 0.341 -0.141 0.049
os_Others -0.0604 0.033 -1.856 0.064 -0.124 0.003
os_Windows -0.0374 0.043 -0.861 0.389 -0.122 0.048
os_iOS -0.0141 0.148 -0.095 0.924 -0.304 0.276
4g_yes 0.0406 0.016 2.514 0.012 0.009 0.072
5g_yes -0.0916 0.032 -2.846 0.004 -0.155 -0.029
==============================================================================
Omnibus: 234.465 Durbin-Watson: 1.994
Prob(Omnibus): 0.000 Jarque-Bera (JB): 630.705
Skew: -0.536 Prob(JB): 1.11e-137
Kurtosis: 5.262 Cond. No. 1.85e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.85e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Test Performance
{'MSE': 0.05715049913984885,
'RMSE': 0.23906170571601143,
'R-squared': 0.8321757391826543}
Conclusion:
Hence, we’ve successfully performed exploratory data analysis on our dataset which included
Descriptive Analysis - Central Tendency and Dispersion & Correlation between attributes. We also performed
different data visualization techniques such as bar plot, line plot and scatter plot.