DMPA RECORD-3-checkpoint - Removed
DMPA RECORD-3-checkpoint - Removed
import numpy as np
df1 = pd.read_csv('Training_Data_Set.csv')
df1.head(3)
Age Vroom
Owner
ID Maker model Location Distance manufacture_year of engine_displacement engine_power body_type Audit transmi
Type
car Rating
0 11100001 skoda octavia Ahmedabad NaN Second 1964.0 55 1964 147.0 compact 8
1 11100002 fiat panda Ahmedabad 27750.0 Third 2012.0 7 1242 51.0 NaN 6
df = df1.drop(['ID'] , axis = 1)
df.head(3)
Age Vroom
Owner
Maker model Location Distance manufacture_year of engine_displacement engine_power body_type Audit transmission
Type
car Rating
0 skoda octavia Ahmedabad NaN Second 1964.0 55 1964 147.0 compact 8 man
1 fiat panda Ahmedabad 27750.0 Third 2012.0 7 1242 51.0 NaN 6 man
(df.describe())
Distance manufacture_year Age of car engine_displacement engine_power Vroom Audit Rating Price
df.shape
(53515, 16)
MAKER : 8
maserati 38
fiat 1845
hyundai 2240
nissan 5485
bmw 7178
audi 7326
toyota 7840
skoda 21563
Name: Maker, dtype: int64
MODEL : 23
tt 903
juke 955
citigo 1120
q7 1245
roomster 1322
rapid 1409
aygo 1486
avensis 1512
auris 1666
micra 1676
coupe 1710
q3 1736
panda 1769
yeti 1898
x5 1979
q5 2039
i30 2047
x1 2420
x3 2779
qashqai 2854
yaris 3176
superb 3195
octavia 12619
Name: model, dtype: int64
LOCATION : 11
Ahmedabad 4770
Hyderabad 4804
Delhi 4822
Chennai 4834
Mumbai 4860
Pune 4862
Kolkata 4867
Jaipur 4870
Bangalore 4877
Kochi 4969
Coimbatore 4974
Name: Location, dtype: int64
OWNER TYPE : 4
Fourth & Above 13349
Second 13365
Third 13395
First 13406
Name: Owner Type, dtype: int64
BODY_TYPE : 2
van 9
compact 4127
Name: body_type, dtype: int64
TRANSMISSION : 2
auto 16781
man 36734
Name: transmission, dtype: int64
DOOR_COUNT : 7
1 2
6 8
3 185
2 4348
None 7534
5 7630
4 33808
Name: door_count, dtype: int64
SEAT_COUNT : 10
1 1
8 1
9 2
6 23
3 109
2 725
7 852
4 4467
None 8511
5 38824
Name: seat_count, dtype: int64
FUEL_TYPE : 2
petrol 25956
diesel 27559
Name: fuel_type, dtype: int64
... ... ... ... ... ... ... ... ... ... ... ...
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53515 entries, 0 to 53514
Data columns (total 67 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Distance 52304 non-null float64
1 manufacture_year 53513 non-null float64
2 Age of car 53515 non-null int64
3 engine_displacement 53515 non-null int64
4 engine_power 52076 non-null float64
5 Vroom Audit Rating 53515 non-null int64
6 Price 53515 non-null float64
7 Maker_bmw 53515 non-null uint8
8 Maker_fiat 53515 non-null uint8
9 Maker_hyundai 53515 non-null uint8
10 Maker_maserati 53515 non-null uint8
11 Maker_nissan 53515 non-null uint8
12 Maker_skoda 53515 non-null uint8
13 Maker_toyota 53515 non-null uint8
14 model_avensis 53515 non-null uint8
15 model_aygo 53515 non-null uint8
16 model_citigo 53515 non-null uint8
17 model_coupe 53515 non-null uint8
18 model_i30 53515 non-null uint8
19 model_juke 53515 non-null uint8
20 model_micra 53515 non-null uint8
21 model_octavia 53515 non-null uint8
22 model_panda 53515 non-null uint8
23 model_q3 53515 non-null uint8
24 model_q5 53515 non-null uint8
25 model_q7 53515 non-null uint8
26 model_qashqai 53515 non-null uint8
27 model_rapid 53515 non-null uint8
28 model_roomster 53515 non-null uint8
29 model_superb 53515 non-null uint8
30 model_tt 53515 non-null uint8
31 model_x1 53515 non-null uint8
32 model_x3 53515 non-null uint8
33 model_x5 53515 non-null uint8
34 model_yaris 53515 non-null uint8
35 model_yeti 53515 non-null uint8
36 Location_Bangalore 53515 non-null uint8
37 Location_Chennai 53515 non-null uint8
38 Location_Coimbatore 53515 non-null uint8
39 Location_Delhi 53515 non-null uint8
40 Location_Hyderabad 53515 non-null uint8
41 Location_Jaipur 53515 non-null uint8
42 Location_Kochi 53515 non-null uint8
43 Location_Kolkata 53515 non-null uint8
44 Location_Mumbai 53515 non-null uint8
45 Location_Pune 53515 non-null uint8
46 Owner Type_Fourth & Above 53515 non-null uint8
47 Owner Type_Second 53515 non-null uint8
48 Owner Type_Third 53515 non-null uint8
49 body_type_van 53515 non-null uint8
50 transmission_man 53515 non-null uint8
51 door_count_2 53515 non-null uint8
52 door_count_3 53515 non-null uint8
53 door_count_4 53515 non-null uint8
54 door_count_5 53515 non-null uint8
55 door_count_6 53515 non-null uint8
56 door_count_None 53515 non-null uint8
57 seat_count_2 53515 non-null uint8
58 seat_count_3 53515 non-null uint8
59 seat_count_4 53515 non-null uint8
60 seat_count_5 53515 non-null uint8
61 seat_count_6 53515 non-null uint8
62 seat_count_7 53515 non-null uint8
63 seat_count_8 53515 non-null uint8
64 seat_count_9 53515 non-null uint8
65 seat_count_None 53515 non-null uint8
66 fuel_type_petrol 53515 non-null uint8
dtypes: float64(4), int64(3), uint8(60)
memory usage: 5.9 MB
df.isnull().sum()
Distance 1211
manufacture_year 2
Age of car 0
engine_displacement 0
engine_power 1439
...
seat_count_7 0
seat_count_8 0
seat_count_9 0
seat_count_None 0
fuel_type_petrol 0
Length: 67, dtype: int64
df = df.fillna(df.mean())
df.isnull().sum()
Distance 0
manufacture_year 0
Age of car 0
engine_displacement 0
engine_power 0
..
seat_count_7 0
seat_count_8 0
seat_count_9 0
seat_count_None 0
fuel_type_petrol 0
Length: 67, dtype: int64
df.duplicated().sum()
114
df1=df.drop_duplicates()
df1.head(2)
Age Vroom
Distance manufacture_year of engine_displacement engine_power Audit Price Maker_bmw Maker_fiat Maker_hyundai ... seat_c
car Rating
2 rows × 67 columns
sns.heatmap(df1.iloc[:, 0:6].corr(),annot=True)
plt.show()
plt.tight_layout()
plt.show()
def remove_outlier(col):
Q1,Q3=col.quantile([0.25,0.75])
IQR=Q3-Q1
lower_range= Q1-(1.5*IQR)
upper_range= Q3+(1.5*IQR)
return lower_range,upper_range
for i in feature_list:
LL, UL = remove_outlier(df1[i])
df1[i] = np.where(df1[i] > UL,UL, df1[i])
df1[i] = np.where(df1[i] < LL,LL,df1[i])
/var/folders/y7/v00504rn5c5dgvzgtlt4bkcr0000gn/T/ipykernel_3490/610677943.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ret
urning-a-view-versus-a-copy
df1[i] = np.where(df1[i] < LL,LL,df1[i])
/var/folders/y7/v00504rn5c5dgvzgtlt4bkcr0000gn/T/ipykernel_3490/610677943.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
for i in range(num_plots):
plt.subplot(rows, cols, i + 1)
sns.boxplot(y=df1[feature_list[i]], data=df1)
plt.title('Boxplot of {}'.format(feature_list[i]))
plt.tight_layout()
plt.show()
Pair plot
#sns.pairplot(df1 , diag_kind = 'kde')
import statsmodels.api as sm
X_train=sm.add_constant(X_train)
X_test=sm.add_constant(X_test)
model = sm.OLS(y_train,X_train).fit()
model.summary()
Df Model: 12
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.76e-18. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
)
vif_file = "VIF values: \n\n{}\n".format(vif_series1)
print(vif_file)
/Users/sachinssharma/anaconda3/lib/python3.11/site-packages/statsmodels/regression/linear_model.py:1781: Runtim
eWarning: invalid value encountered in scalar divide
return 1 - self.ssr/self.centered_tss
VIF values:
const 7.682852e+09
Distance 2.604749e+00
manufacture_year 3.463081e+04
Age of car 3.463332e+04
engine_displacement 4.206017e+00
...
seat_count_7 NaN
seat_count_8 NaN
seat_count_9 NaN
seat_count_None NaN
fuel_type_petrol 1.563505e+00
Length: 67, dtype: float64
my_df = vif_series1.to_frame(name='col_names')
my_df.to_excel('vif.xlsx')
my_df.head(10)
col_names
const 7.682852e+09
Distance 2.604749e+00
manufacture_year 3.463081e+04
engine_displacement 4.206017e+00
engine_power 3.461206e+00
Maker_bmw NaN
Maker_fiat NaN
Maker_hyundai NaN
R-squared: 0.783
Adjusted R-squared: 0.782
0.790-0.783
0.007000000000000006
R-squared: 0.79
Adjusted R-squared: 0.79
0.790-0.79
0.0
R-squared: 0.748
Adjusted R-squared: 0.748
0.04200000000000004
R-squared: 0.753
Adjusted R-squared: 0.752
0.03700000000000003
#As we are observing the multicollinearity and no such diffrence in removing manufacture_year , Age of car , engine_d
# so remove and run regression model
olsmod_5 = sm.OLS(y_train,X_train)
olsres_5 = olsmod_5.fit()
olsres_5.summary()
Df Model: 11
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.76e-18. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
#by removing engine_displacement and age of car we are getting 0.783 and 0.745 respectively so remove only manufactur
Df Model: 9
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.76e-18. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
After dropping the features causes multicolinearity and the statistical insignificant ones, Our
model performance hasn't dropped sharply. this shows that these varibales did not have much
predictive power.
df_pred = pd.DataFrame()
df_pred.head()
# it is having no pattern that means it is linearity in nature . assumed that linearity and independence of predictor
score is less than 0.05 so reject null hypothesis, it is not normal as per the shapiro's test
0.7863209998227594
since p value is more than 0.05 we can say residuals are homoscedastic
#
# dropping columns from the test data that are not there in the above analysis, since by dropping both columns,
Age
const Distance of engine_displacement engine_power Maker_skoda Maker_toyota model_avensis model_aygo model_citigo ... seat_c
car
40755 1.0 155000.0 14.0 1995.0 110.0 0.0 0.0 0.0 0.0 0.0 ...
53337 1.0 199963.0 7.0 1968.0 81.0 1.0 0.0 0.0 0.0 0.0 ...
2 rows × 59 columns
336013.3899990611
329590.62845699233