0% found this document useful (0 votes)
93 views25 pages

Amit Khilare Used Device Data PM Project

The Predictive Modelling Project report focuses on developing a dynamic pricing strategy for used and refurbished devices, driven by the growing market projected to reach $52.7 billion by 2023. The report outlines the data collected in 2021, including various attributes of devices, and emphasizes the importance of exploratory data analysis (EDA) in building a linear regression model to predict used device prices. Key factors influencing pricing, such as screen size, camera resolution, and internal memory, are identified through statistical analysis and visualization techniques.

Uploaded by

hvac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views25 pages

Amit Khilare Used Device Data PM Project

The Predictive Modelling Project report focuses on developing a dynamic pricing strategy for used and refurbished devices, driven by the growing market projected to reach $52.7 billion by 2023. The report outlines the data collected in 2021, including various attributes of devices, and emphasizes the importance of exploratory data analysis (EDA) in building a linear regression model to predict used device prices. Key factors influencing pricing, such as screen size, camera resolution, and internal memory, are identified through statistical analysis and visualization techniques.

Uploaded by

hvac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

__________________________________

Predictive Modelling Project


Report
__________________________________
PGP-DSBA (PGPDSBA.O.AUG24.A)

Prepared by- Amit Khilare


Project Predictive Modelling: Used Device Data

Context

Buying and selling used phones and tablets used to be something that happened on a
handful of online marketplace sites. But the used and refurbished device market has grown
considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts
that the used phone market would be worth $52.7bn by 2023 with a compound annual growth rate
(CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used
phones and tablets that offer considerable savings compared with new models.

Refurbished and used devices continue to provide cost-effective alternatives to both


consumers and businesses that are looking to save money when purchasing one. There are plenty of
other benefits associated with the used device market. Used and refurbished devices can be sold
with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such
as Verizon, Amazon, etc., provide attractive offers to customers for refurbished devices. Maximizing
the longevity of devices through second-hand trade also reduces their environmental impact and
helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost this
segment as consumers cut back on discretionary spending and buy phones and tablets only for
immediate needs.

Objective

The rising potential of this comparatively under-the-radar market fuels the need for an ML-
based solution to develop a dynamic pricing strategy for used and refurbished devices. ReCell, a
startup aiming to tap the potential in this market, has hired you as a data scientist. They want you to
analyze the data provided and build a linear regression model to predict the price of a used
phone/tablet and identify factors that significantly influence it.

Data Description

The data contains the different attributes of used/refurbished phones and tablets. The data
was collected in the year 2021. The detailed data dictionary is given below.
Dataset: used_device_data.csv

Data Dictionary

 brand name: Name of manufacturing brand


 os: OS on which the device runs
 screen size: Size of the screen in cm
 4g: Whether 4G is available or not
 5g: Whether 5G is available or not
 main_camera_mp: Resolution of the rear camera in megapixels
 selfie_camera_mp: Resolution of the front camera in megapixels
 int_memory: Amount of internal memory (ROM) in GB
 ram: Amount of RAM in GB
 battery: Energy capacity of the device battery in mAh
 weight: Weight of the device in grams
 release_year: Year when the device model was released
 days_used: Number of days the used/refurbished device has been used
 normalized_new_price: Normalized price of a new device of the same model in euros
 normalized_used_price: Normalized price of the used/refurbished device in euros
Theory:
Exploratory Data Visualization
Exploratory Data Analysis (EDA) is the first step in your data analysis process developed by “John
Tukey” in the 1970s. In statistics, exploratory data analysis is an approach to analysing data sets to summarize
their main characteristics, often with visual methods.
By the name itself, we can get to know that it is a step in which we need to explore the data set. When you are
trying to build a machine learning model you need to be pretty sure whether your data is making sense or not.
The main aim of exploratory data analysis is to obtain confidence in your data to an extent where you’re ready
to engage a machine learning algorithm.

1) Descriptive analysis - Central tendency


A measure of central tendency is a summary statistic that represents the centre point or typical value
of a dataset. These measures indicate where most values in a distribution fall and are also referred to as the
central location of a distribution. We can think of it as the tendency of data to cluster around a middle value.
In statistics, the three most common measures of central tendency are the mean,
Median, and mode. Each of these measures calculates the location of the central point using a different
method. Choosing the best measure of central tendency depends on the type of data we have

A. Mean (Average): Represents the sum of all values in a dataset divided by the total number of the
values.
B. Median: The middle value in a dataset that is arranged in ascending order (from the smallest value to
the largest value). If a dataset contains an even number of values, the median of the dataset is the
mean of the two middle values.
C. Mode: Defines the most frequently occurring value in a dataset. In some cases, a dataset may contain
multiple modes, while some datasets may not have any mode at all.

The selection of a central tendency measure depends on the properties of a dataset. For instance, the
mode is the only central tendency measure for categorical data, while a median works best with ordinary data.
Although the mean is regarded as the best measure of central tendency for quantitative data, that is not always
the case. For example, the mean may not work well with quantitative datasets that contain extremely large or
extremely small values. The extreme values may distort the mean. Thus, you may consider other measures.

2) Descriptive analysis - Dispersion and Distribution


A distribution tells you how likely certain events are, e.g for the normal distribution (continuous) you
can talk about the probability that you get a number between 3 and 7, or for a discrete distribution like the
Poisson, how likely are you to get 3?
A measure of dispersion tells you, if you see many events happen, how spread out are they going to be.

Dispersion:
Most measures of dispersion have the same units as the quantity being measured. In other words, if the
measurements are in meters or seconds, so is the measure of dispersion. There are two main types of
dispersion methods in statistics which are:
A. Absolute Measure of Dispersion:
An absolute measure of dispersion contains the same unit as the original data set. Absolute dispersion method
expresses the variations in terms of the average of deviations of observations like standard or mean deviations.
It includes range, standard deviation, quartile deviation, etc. The types of absolute measures of dispersion are
range, variance, standard deviation, quartiles, and so on.

B. Relative Measure of Dispersion:


The relative measures of dispersion are used to compare the distribution of two or more Data sets. This
measure compares values without units. Common relative dispersion Methods include:
3) Correlation
Correlation explains how one or more variables are related to each other. These variables can be input
data features which have been used to forecast our target variable.
Correlation, statistical technique which determines how one variable moves/changes in relation with the other
variable. It gives us the idea about the degree of the relationship of the two variables. It’s a bi-variate analysis
measure which describes the association between different variables. In most of the business it’s useful to
express one subject in terms of its relationship with others Types of Correlation:
Based on the degree of correlation:

A. Positive Correlation:
Two features (variables) can be positively correlated with each other. It means that when the value of one
variable increases then the value of the other variable(s) also increases.

B. Negative Correlation:
Two features (variables) can be negatively correlated with each other. It means that when the value of one
variable increases then the value of the other variable(s) decreases.

C. No Correlation:
Two features (variables) are not correlated with each other. It means that when the value of one variable
increases or decreases then the value of the other variable(s) doesn’t increase or decrease.

4) Data Visualization
Data visualization is defined as a graphical representation that contains the information and the data.
By using visual elements like charts, graphs, and maps, data visualization techniques provide an accessible way
to see and understand trends, outliers, and patterns in data.
In modern days we have a lot of data in our hands i.e, in the world of Big Data, data visualization tools, and
technologies are crucial to analyze massive amounts of information and make data-driven decisions. It is used
in many areas such as:
● To model complex events.
● Visualize phenomenons that cannot be observed directly, such as weather patterns, medical conditions, or
mathematical relationships.

Univariate Analysis Techniques for Data Visualization

1. Distribution Plot
It is one of the best univariate plots to know about the distribution of data. When we want to analyze
the impact on the target variable(output) with respect to an independent variable(input), we use distribution
plots a lot. This plot gives us a combination of both probability density functions(pdf) and histogram in a single
plot.

Bivariate Analysis Techniques for Data Visualization

1. Line Plot
This is the plot that you can see in the nook and corners of any sort of analysis between 2 variables.
The line plots are nothing but the values on a series of data points will be connected with straight lines. The
plot may seem very simple but it has more applications not only in machine learning but in many other areas.

2. Bar Plot
This is one of the widely used plots, that we would have seen multiple times not just in data analysis,
but we use this plot also wherever there is a trend analysis in many fields. Though it may seem simple it is
powerful in analyzing data like sales figures every week, revenue from a product, Number of visitors to a site on
each day of a week, etc.

Implementation:

1. Import Numpy and Pandas

2. Reading the csv file

3. Statistical summary of the dataset


4. Checking for duplicate values & Missing Values

5. Exploring Categorical features


Exploratory Data Analysis
Univariate Analysis
histogram_boxplot for normalized_new_price

histogram_boxplot for Screen Size

histogram_boxplot for main_camera


histogram_boxplot for selfie_camera

histogram_boxplot for internal memory

histogram_boxplot for 'ram'

histogram_boxplot for 'weight'


histogram_boxplot for 'battery'

histogram_boxplot for 'days_used'


Labelled Bar Plot for Brand Name against the count

Bivariate Analysis
Let's see how the amount of RAM varies across brands.

Dataframe of only those devices which offer a large battery and analyze.
Let's see how the price of used devices varies across the years.

Release year against average used price over release years.

checking how the prices vary for used phones and tablets offering 4G and 5G networks.
Outliner Check
Data Preparation for modelling
Model Building - Linear Regression
Checking Linear Regression Assumptions
VIF after dropping main_camera_mp
feature VIF
0 const 227.000703
1 screen_size 8.261309
2 selfie_camera_mp 2.851442
3 int_memory 1.339852
4 ram 2.272497
5 battery 4.018931
6 weight 6.207163
7 days_used 2.544869
8 normalized_new_price 2.713690
9 years_since_release 4.836333
10 brand_name_Alcatel 3.458240
11 brand_name_Apple 11.192808
12 brand_name_Asus 3.646213
13 brand_name_BlackBerry 1.622996
14 brand_name_Celkon 1.868741
15 brand_name_Coolpad 1.573721
16 brand_name_Gionee 2.076641
17 brand_name_Google 1.387899
18 brand_name_HTC 3.459903
19 brand_name_Honor 3.544615
20 brand_name_Huawei 6.395258
21 brand_name_Infinix 1.191416
22 brand_name_Karbonn 1.626941
23 brand_name_LG 5.348720
24 brand_name_Lava 1.824997
25 brand_name_Lenovo 4.702696
26 brand_name_Meizu 2.404332
27 brand_name_Micromax 3.776810
28 brand_name_Microsoft 2.091350
29 brand_name_Motorola 3.465108
30 brand_name_Nokia 3.747557
31 brand_name_OnePlus 1.585266
32 brand_name_Oppo 4.284583
33 brand_name_Others 10.827443
34 brand_name_Panasonic 1.886714
35 brand_name_Realme 1.965466
36 brand_name_Samsung 8.011342
37 brand_name_Sony 2.834760
38 brand_name_Spice 1.637533
39 brand_name_Vivo 3.727480
40 brand_name_XOLO 2.160588
41 brand_name_Xiaomi 4.063138
42 brand_name_ZTE 4.327130
43 os_Others 1.878917
44 os_Windows 1.739946
45 os_iOS 10.026109
46 4g_yes 2.420727
47 5g_yes 1.787727

Dropping high p-value variables

We will drop the predictor variables having a p-value greater than 0.05 as they do
not significantly impact the target variable.
• But sometimes p-values change after dropping a variable. So, we'll not
drop all variables at once.
• Instead, we will do the following:
• Build a model, check the p-values of the variables, and drop the
column with the highest p-value.
• Create a new model without the dropped feature, check the p-
values of the variables, and drop the column with the highest p-value.
• Repeat the above two steps till there are no columns with p-
value > 0.05.
The above process can also be done manually by picking one variable at a time that
has a high p-value, dropping it, and building a model again. But that might be a little
tedious and using a loop will be more efficient.
x_train3 = x_train_with_const[selected_features]

x_test3 = x_test[selected_features]

olsmodel2 = sm.OLS(y_train, sm.add_constant(x_train3)).fit() ## Complete the code fit OLS() on updated


dataset (no multicollinearity and no insignificant predictors)

print(olsmodel2.summary())
OLS Regression Results
=====================================================================
Dep. Variable: normalized_used_price R-squared: 0.847
Model: OLS Adj. R-squared: 0.846
Method: Least Squares F-statistic: 886.8
Date: Thu, 21 Nov 2024 Prob (F-statistic): 0.00
Time: 21:13:52 Log-Likelihood: 110.96
No. Observations: 2417 AIC: -189.9
Df Residuals: 2401 BIC: -97.27
Df Model: 15
Covariance Type: nonrobust
==================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------
const 1.3715 0.052 26.565 0.000 1.270 1.473
screen_size 0.0291 0.003 8.473 0.000 0.022 0.036
main_camera_mp 0.0234 0.001 16.151 0.000 0.021 0.026
selfie_camera_mp 0.0119 0.001 10.644 0.000 0.010 0.014
int_memory 0.0002 6.66e-05 2.836 0.005 5.83e-05 0.000
ram 0.0293 0.005 5.686 0.000 0.019 0.039
battery -1.46e-05 7.19e-06 -2.030 0.043 -2.87e-05 -4.94e-07
weight 0.0008 0.000 6.200 0.000 0.001 0.001
normalized_new_price 0.4092 0.011 36.760 0.000 0.387 0.431
years_since_release -0.0218 0.004 -5.847 0.000 -0.029 -0.014
brand_name_Celkon -0.1905 0.053 -3.571 0.000 -0.295 -0.086
brand_name_Nokia 0.0823 0.030 2.771 0.006 0.024 0.140
brand_name_Xiaomi 0.0829 0.025 3.296 0.001 0.034 0.132
os_Others -0.0598 0.030 -1.993 0.046 -0.119 -0.001
4g_yes 0.0461 0.015 3.023 0.003 0.016 0.076
5g_yes -0.0900 0.031 -2.862 0.004 -0.152 -0.028
====================================================================
Omnibus: 240.704 Durbin-Watson: 1.993
Prob(Omnibus): 0.000 Jarque-Bera (JB): 674.212
Skew: -0.537 Prob(JB): 3.95e-147
Kurtosis: 5.354 Cond. No. 4.08e+04
TEST FOR LINEARITY AND INDEPENDENCE
We will test for linearity and independence by making a plot of fitted values vs residuals and checking for
patterns.
If there is no pattern, then we say the model is linear and residuals are independent.
Otherwise, the model is showing signs of non-linearity and residuals are not independent.
import pylab

import scipy.stats as stats

stats.probplot(df_pred['residuals'], dist="norm", plot=pylab)

plt.show()
Final Model Summary
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.849
Model: OLS Adj. R-squared: 0.846
Method: Least Squares F-statistic: 277.1
Date: Thu, 21 Nov 2024 Prob (F-statistic): 0.00
Time: 21:24:26 Log-Likelihood: 125.15
No. Observations: 2417 AIC: -152.3
Df Residuals: 2368 BIC: 131.4
Df Model: 48
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const 1.4172 0.072 19.677 0.000 1.276 1.558
screen_size 0.0295 0.004 8.352 0.000 0.023 0.036
main_camera_mp 0.0232 0.002 14.892 0.000 0.020 0.026
selfie_camera_mp 0.0116 0.001 9.924 0.000 0.009 0.014
int_memory 0.0002 6.76e-05 2.779 0.005 5.53e-05 0.000
ram 0.0305 0.005 5.797 0.000 0.020 0.041
battery -1.665e-05 7.35e-06 -2.266 0.024 -3.11e-05 -2.24e-06
weight 0.0008 0.000 5.946 0.000 0.001 0.001
days_used 3.376e-05 3.07e-05 1.101 0.271 -2.64e-05 9.39e-05
normalized_new_price 0.4104 0.012 33.494 0.000 0.386 0.434
years_since_release -0.0255 0.005 -5.589 0.000 -0.034 -0.017
brand_name_Alcatel -0.0804 0.050 -1.618 0.106 -0.178 0.017
brand_name_Apple -0.0438 0.148 -0.297 0.767 -0.333 0.246
brand_name_Asus 0.0068 0.049 0.138 0.890 -0.090 0.103
brand_name_BlackBerry 0.0312 0.072 0.434 0.664 -0.110 0.172
brand_name_Celkon -0.2372 0.068 -3.485 0.001 -0.371 -0.104
brand_name_Coolpad -0.0308 0.071 -0.434 0.664 -0.170 0.108
brand_name_Gionee -0.0130 0.059 -0.221 0.825 -0.128 0.102
brand_name_Google -0.1192 0.083 -1.442 0.149 -0.281 0.043
brand_name_HTC -0.0403 0.050 -0.805 0.421 -0.138 0.058
brand_name_Honor -0.0478 0.051 -0.941 0.347 -0.147 0.052
brand_name_Huawei -0.0599 0.046 -1.299 0.194 -0.150 0.030
brand_name_Infinix 0.1338 0.113 1.179 0.238 -0.089 0.356
brand_name_Karbonn -0.0592 0.068 -0.867 0.386 -0.193 0.075
brand_name_LG -0.0608 0.047 -1.299 0.194 -0.152 0.031
brand_name_Lava -0.0230 0.063 -0.364 0.716 -0.147 0.101
brand_name_Lenovo -0.0364 0.047 -0.771 0.441 -0.129 0.056
brand_name_Meizu -0.0860 0.056 -1.530 0.126 -0.196 0.024
brand_name_Micromax -0.0645 0.049 -1.315 0.189 -0.161 0.032
brand_name_Microsoft 0.0749 0.082 0.916 0.360 -0.085 0.235
brand_name_Motorola -0.0686 0.051 -1.349 0.177 -0.168 0.031
brand_name_Nokia 0.0380 0.052 0.724 0.469 -0.065 0.141
brand_name_OnePlus -0.0384 0.073 -0.523 0.601 -0.182 0.105
brand_name_Oppo -0.0294 0.049 -0.599 0.549 -0.126 0.067
brand_name_Others -0.0679 0.044 -1.556 0.120 -0.154 0.018
brand_name_Panasonic -0.0427 0.062 -0.690 0.490 -0.164 0.078
brand_name_Realme -0.0359 0.063 -0.568 0.570 -0.160 0.088
brand_name_Samsung -0.0617 0.045 -1.376 0.169 -0.150 0.026
brand_name_Sony -0.0756 0.053 -1.428 0.153 -0.179 0.028
brand_name_Spice -0.0356 0.068 -0.520 0.603 -0.170 0.099
brand_name_Vivo -0.0644 0.050 -1.277 0.202 -0.163 0.034
brand_name_XOLO -0.0781 0.057 -1.362 0.173 -0.191 0.034
brand_name_Xiaomi 0.0325 0.050 0.655 0.513 -0.065 0.130
brand_name_ZTE -0.0460 0.048 -0.952 0.341 -0.141 0.049
os_Others -0.0604 0.033 -1.856 0.064 -0.124 0.003
os_Windows -0.0374 0.043 -0.861 0.389 -0.122 0.048
os_iOS -0.0141 0.148 -0.095 0.924 -0.304 0.276
4g_yes 0.0406 0.016 2.514 0.012 0.009 0.072
5g_yes -0.0916 0.032 -2.846 0.004 -0.155 -0.029
==============================================================================
Omnibus: 234.465 Durbin-Watson: 1.994
Prob(Omnibus): 0.000 Jarque-Bera (JB): 630.705
Skew: -0.536 Prob(JB): 1.11e-137
Kurtosis: 5.262 Cond. No. 1.85e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.85e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Test Performance

{'MSE': 0.05715049913984885,

'RMSE': 0.23906170571601143,

'R-squared': 0.8321757391826543}

Conclusion:
Hence, we’ve successfully performed exploratory data analysis on our dataset which included
Descriptive Analysis - Central Tendency and Dispersion & Correlation between attributes. We also performed
different data visualization techniques such as bar plot, line plot and scatter plot.

You might also like