0% found this document useful (0 votes)
32 views

Task 3 Car Price Prediction Using Machine Learning

Car Price Prediction Using ML with help of Juypter Notebook

Uploaded by

jadhavvikram863
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
32 views

Task 3 Car Price Prediction Using Machine Learning

Car Price Prediction Using ML with help of Juypter Notebook

Uploaded by

jadhavvikram863
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 30
71328, 1112AM Int}: localhost 8888inotebooks/DesktopFinal ProjectiCar Pr Car Price Prediction Project Task 3 ~Jupytr Notebook Project Report On Car Price Prediction Using Machine Learning Submitted by, Mr. Omkar Balwant Jadhav + Data + What Problem We Have and Which Metric to Use? + Exploratory Data Analysis = Target Variable = Numerical Features = Categorical Features + Model Selection = Baseline Mode! = Models with Ridge & Lasso & ElasticNet and KNN = Models with Random Forest & Extra Trees & Gradient Boosting & XGBoost = Best Model with Hyperparameter Tuning = Feature Importance * Conclusion ‘Type Markdown and LaTeX: a” 1. Collecting Data 1 import pandas as pd 2 data= pd.read_csv 3 data :\\Users\\Onkar'\\Downloads\\CarPrice.csv") 2. Defining the problem statement In this project, we study the data of Uber which is present in tabular format in which we use different libraries like numpy, pandas and matplotlib and different machine leaming algorithms. We study different columns of the table and try to co-relate them with others and find a relation between those two. Prediction Projact Task 3ipynb 130 71328, 1112AM In [2]: out [2]: In [3]: out(3]: In [5]: out [5]: 3. Exploratory Data Analysis Car Price Prediction Project Task 3 ~Jupytr Notebook Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations. Itis a good practice to understand the data first and try to gather as many insights from it. EDAis all about making sense of data in hand. 1 data.shape (205, 26) 1 data.head() ID symboling CarName fueltype aspiration doornumber _carbody drivewheel_ eng ° 1 3 alfawromero td tn tbl is gia 998 sl two convertible rw 1 2 3 alfe-romero a omer gas st two convertible ws 2 3 1 Giaromero td two hatchback i Quadrifoglio «9° s tchbact rw 34 2 audi100's gas std four sedan wd 4 5 2 audi 100s gas. std four sedan awa 5 rows x 26 columns ‘Type Markdown and LaTeX: a” 1 data 'CarWane" ].value_counts() toyota corona toyota corolla peugeot 504 subaru dl mitsubishi mirage g4 mazda gle mazda rx2 maxda glo maxda x3 volvo 246 CarName, Length: Nam 4 coupe deluxe waana 1 147, dtype: intea Iocahost 8888inotebooks/Desktop/Final ProjectICar Price Prediction Project Task Bipynb 2180 71328, 1112AM In [87]: from from from from 1@ from 11 from wavoukune 13. from 14 from 17 from 18 from In [26]: 1 df= pd.read_csv("C:\\Users\\Onkar\\Downloads\\CarPrice.csv") Car Price Prediction Project Task 3 ~Jupytr Notebook import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sklearn. preprocessing import StandardScaler, PolynomialFeatures, Onel sklearn.model_selection import KFold, cross_val_predict, train_test_: sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNe sklearn.metrics import r2_score,mean_squared_error sklearn.pipeline import make_pipeline sklearn.compose import make_column_transformer sklearn.neighbors import KNeighborsRegressor sklearn.svm import SVR sklearn.ensemble import RandonForestRegressor,,GradientBoostingRegres: sklearn.feature_selection import SelectkBest ,SelectPercentile, f_clas: 2 df.head() out [26]: ID. symboling _CarName fucltype aspiration doormumber _carbody drivewheel eng o 4 3 alfaromero td ‘wo convert cd ‘ula 88 3 convertible rw 1 3 alferomero rn omer gas st ‘wo convertible rad 2 3 4 Baromero td two hatchback Quadriagio 9° st chad mm 34 2 audi 1001s gas std four sedan ‘wa 4s 2 ausi 00 gas std four sedan uid 5 rows * 26 columns Iocahost 8888inotebooks/Desklop/Final ProjectICar Price Prediction Projact Task 3ipynb 180 71328, 1112AM In [27]: 1 df.info() RangeIndex: 205 entries, @ to 204 Car Price Prediction Project Task 3 ~Jupytr Notebook Data colunns (total 26 columns): 25 dtypes: floatea(s), int64(8), object(1@) memory usage: 41.84 KB Column car_ID symboling CarName fueltype aspiration doornunber carbody drivewheel enginelocation wheelbase carlength carwidth carheight curbweight enginetype cylindernumber enginesize fuelsystem boreratio stroke compressionratio horsepower peakrpm citympg highwaympg. price Non-Null Count 205 205 205 205 205 205 205 205 205 205 205 205 205 205 205 205 205 205 205 205 205 285 205 205 205 205 In [28]: 1 df.duplicated().sum() out[28]: @ non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null Dtype intea intea object object object object object object object Floated floates floated floates intea object object intea object Floates floates Floates intea intea intea intea floate4 Iocahost 8888inotebooks/Desklop/Final ProjectICar Price Prediction Projact Task 3ipynb 4130 71328, 1112AM Car Price Prediction Project Task 3 ~Jupytr Notebook In [29]: 1 def missing (df) 2 missing_nunber = df. isnull().sum().sort_values(ascending-False) 3 missing_percent = (df.isnull().sum()/df.isnull().count()).sort_values 4 missing_values = pd.concat([missing_number, missing_percent], axis: 5 return missing values 6 7 missing(aF) out [29]: Missing Number Missing Percent car|D ° 00 symboling ° 00 highwaympa ° 00 itympg ° 09 peakrpm ° 00 horsepower ° 00 compressionratio ° 00 stroke ° 00 boreratio ° 00 fuelsystem ° 00 enginesize ° 00 cylindernumber ° 00 enginetype ° 00 curbweight ° 09 carholght ° 00 carwidth ° 00 cartength ° 00 wheelbase ° 00 enginelocation ° 00 drivewhoe! ° 00 carbody ° 00 doomumber ° 00 aspiration ° 00 fucttype ° 09 carNiame ° 00 ° 00 Iocahost 8888inotebooks/Desklop/Final ProjectICar Price Prediction Projact Task 3ipynb 9130 71328, 1112AM Car Price Prediction Project Task 3 ~Jupytr Notebook In [38]: 1 df.nunique() out(30]: cari 205 symboling 6 CarName 147 fueltype 2 aspiration 2 doornumber 2 carbody 5 drivewheel 3 enginelocation 2 wheelbase 53 carlength 7 carwidth 4a carheight 49 curbweight 171 enginetype 7 cylindernumber 7 enginesize 44 fuelsystem 8 boreratio 38 stroke 37 compressionratio 32 horsepower 59 peakrpm 23 citympg 29 highwaympg 38 price 189 dtype: intea + There is no zero variance variable. + Car ID column is repetition of the index. So I'l drop it + Carname has 147 different entity. I'l check it. And try to find a way to reduce the variance. + Other than that there is no problem. In [31]: 1 df= df.copy() In [32]: 1 df4['CarNane"].sample(5) out [32]: bmw 24 isuzu MU-x honda accord 1x saab 99e volkswagen model 111 CarName, dtype: object Iocahost 8888inotebooks/Desklop/Final ProjectICar Price Prediction Projact Task 3ipynb 6130 71328, 1112AM In [33]: out (33): Car Price Prediction Project Task 3 ~Jupytr Notebook 1 df1['CarName"].unique() array([‘alfa-romero giulia', ‘alfa-romero stelvio', ‘alfa-romero Quadrifoglio', ‘audi 100 1s’, ‘audi 1001s", ‘audi fox’, ‘audi 5@00", ‘audi 4000", ‘audi See0s (diesel)', “bmw 3205", “bmw x1", "bmw x3", ‘bmw 74", "bmw x4", ‘bmw x5", ‘chevrolet impala", ‘chevrolet monte carlo", ‘chevrolet vega 2300", “dodge rampage’, ‘dodge challenger se’, ‘dodge 4200", ‘dodge monaco (sw)', ‘dodge colt hardtop’, ‘dodge colt (sw)', ‘dodge coronet custom’, "dodge dart custon", ‘dodge coronet custom (sw)', ‘honda civic’, "honda civic cvec’, "honda accord cvcc', ‘honda accord 1x", ‘honda civic 1500 gl’, “honda accord’, ‘honda civic 130", "honda prelude’, “honda civic (auto)', ‘isuzu MU-x', "isuzu D-Max * suzu D-Max V-Cross', “jaguar xj', ‘jaguar xf", ‘Jaguar xk’, tmaxda rx3", ‘maxda gle deluxe’, ‘mazda rx2 coupe’, ‘mazda rx-4", ‘mazda gle deluxe’, ‘mazda 626', ‘mazda glc', ‘mazda rx-7 gs", ‘mazda gle 4, ‘mazda gle custom 1', ‘mazda gle custom", “buick electra 225 custom’, ‘buick century luxus (sw)', “buick century’, “buick skyhawk', ‘buick opel isuzu deluxe’, “buick skylark", "buick century special’, “buick regal sport coupe (turbo)’, ‘mercury cougar’, ‘mitsubishi mirage’, ‘mitsubishi lancer’, ‘mitsubishi outlander’, ‘mitsubishi ga", ‘mitsubishi mirage g4', ‘mitsubishi montero’, ‘mitsubishi pajero’, ‘Nissan versa’, ‘nissan gt-r', ‘nissan rogue’, ‘nissan latio", ‘nissan titan’, ‘nissan leaf’, ‘nissan juke’, ‘nissan note’, ‘nissan clipper’, ‘nissan nv200", ‘nissan dayz', ‘nissan fuga’, ‘nissan otti', ‘nissan teana’, ‘nissan kicks', “peugeot 504°, ‘peugeot 304°, ‘peugeot 504 (sw)', ‘peugeot 60451", “peugeot 5@5s turbo diesel’, ‘plymouth fury iii", ‘plymouth cricket", ‘plymouth satellite custom (sw)', ‘plymouth fury gran sedan’, ‘plymouth valiant’, ‘plymouth duster’, ‘porsche macan', "porcshce panamera’, ‘porsche cayenne", “porsche boxter’, ‘renault 12t1", ‘renault 5 gtl', ‘saab 99e", ‘saab 991e", ‘saab 99gle’, ‘subaru’, ‘subaru dl’, ‘subaru brz', ‘subaru baja’, ‘subaru ri", ‘subaru r2‘, ‘subaru trezia’, ‘subaru tribeca', ‘toyota corona mark ii’, ‘toyota corona’, ‘toyota corolla 1200", ‘toyota corona hardtop", ‘toyota corolla 1600 (sw)', ‘toyota carina’, ‘toyota mark ii", “toyota corolla’, ‘toyota corolla liftback", ‘toyota celica gt liftback", ‘toyota corolla tercel’, “toyota corona liftback", ‘toyota starlet’, ‘toyota tercel’, “toyota cressida’, ‘toyota celica gt’, ‘toyouta tercel’, ‘vokswagen rabbit’, ‘volkswagen 1131 deluxe sedan’, ‘volkswagen model 111', ‘volkswagen type 3°, ‘volkswagen 411 (sw)', “volkswagen super beetle’, ‘volkswagen dasher', ‘vw dasher’, ‘vw rabbit", ‘volkswagen rabbit’, ‘volkswagen rabbit custom’, ‘volvo 145e (sw)', ‘volvo 144ea", ‘volvo 244d1', ‘volvo 245', ‘volvo 264g1', ‘volvo diesel’, ‘volvo 246"], dtype=object) + Huse only the brands/make not the models. + Ihave seen several typos, I'l handle those. Iocahost 8888inotebooks/Desklop/Final ProjectICar Price Prediction Projact Task 3ipynb 130 71328, 1112AM In [34]: In [35]: In [36]: In [37]: In [38]: out (38): localhost 8888/notebooks/Desktop/Final ProjectiCar Prk Car Price Prediction Project Task 3 ~Jupytr Notebook df1[ ‘model’ ] df1['model'} [x.split()[@] for x in df1[‘CarName'}] if1['model'].replace({'maxda': ‘Mazda','mazda’: 'Mazda', ‘nissan’: ‘Nissan’, ‘poreshce’: ‘Porsche’ , ‘porsche’ : 'Por: ‘toyouta’: ‘Toyota’, ‘toyota’: ‘Toyoti "vokswagen': ‘Volkswagen’, “Volkswagen” Vounune Let's drop the ‘model’ and ‘caria’ columns 1 die df1.drop({‘car_ID", ‘CarNane' J, axis=1) 1 print (f' We have {df1.shape[@]} instances with the {dF1.shape[1]-1} feat! We have 205 instances with the 24 features and 1 output variable numerical= df1.drop(['price'], axis=1).select_dtypes(‘number').columns categorical = dfl.select_dtypes(‘object').columns print(f'Numerical Columns: {df1[numerical].columns}') print(*\n") print(f'Categorical Columns: (df1[categorical].columns}') Numerical Columns: Index([‘symboling', ‘wheelbase’, ‘carlength', ‘carwidth', "carheight', “curbweight', ‘enginesize’, ‘boreratio', ‘stroke’, ‘compressionratio’, ‘horsepower’, ‘peakrpm', ‘citympg', ‘highwaympg'], dtype="object') Categorical Columns: Index(['fueltype', ‘aspiration’, ‘doornumber', ‘carbod y', ‘drivewheel", ‘enginelocation’, ‘enginetype’, ‘cylindernumber', ‘fuelsysten’, *model'], dtype= object") Target Varable 1 df1['price'].describe() count, 205.000000 mean 13276.710571. std 7988.852332 min 5118.000000 25% 7788.000000 50% 10295 .000000 79% 16503.000000 max 45400.200000 Name: price, dtype: floated Predict Projact Task Sipynb 8/30 7323, 1112.0M Car Price Prediction Project Task 3 -Jupyter Notebook In [39]: 1 print( f"Skewness: {df1['price’].skew()}") Skewness: 1.7776781560914454 In [41]: 1 df1['price’].plot(kind="hist") out[41]: 80 8 Frequency 8 8 10 5000 10000 15000 20000 25000 30000 35000 40000 45000 + Even though target variable has right skewness, | will not make any transformation on it, + Let's see the numerical features. Numerical Features lncalhost 8886/notebooks/Desktop/Final ProjectiCar Price Predicton Project Task 3.ipynb 71928, 112M Car Price Prediction Project Task 3 -Jupyter Notebook In [42]: 1 df1[numerical] .describe() out [42]: symboling wheelbase carlength —_carwidth__carheight _curbweight enginesize | feount 205.000000 206,000000 205.000000 206.0000 205:000000 206.000000 205.000000 20 mean 0.894146 98,756585 174,049268 65.907805 59.724878 2556.565854 126.907317 std 1.245907 9.021776 12397280 2.148204 2.443622 520.680208 41.5642603 min 2.000000 86.600000 141.100000 60,3000 47.800000 1488,000000 1.000000 25% 0.000000 94.500000 166:300000 64.1000 2.000000 2146.000000 97.0000, 50% 1.000000 97.0000 173.2000 5.500000 4.100000 2414,000000. 120.0000 75% 2.000000 102.400000 183.1000 66.900000 5.500000 2936,000000 141.0000 max 3.000000 120,900000 208.1000 72,300000 8,800000 4058,000000. 326.0000 In [44]: 1 df1[numerical].plot(kind='hist'); 200 mmm cymboling mm wheelbase us mmm carlength mm carwidth mm carheight 150 mmm curbweight as lm enginesize 3 lm boreratio g mm stroke é 400 lm compressionratio ME horsepower P mmm peakrpm mm ctympg 50 mms highwaymog 25 ° 1000 2000 ©3000 © 4000» 5000-6000 locahost 8868inotebooks/Desktop/Final ProjectCar Price Predicton Project Task 3ioynb 10190 71328, 1112AM In [55]: out (55): In [48]: out [48]: localhost 8888/notebooks/Desktop/Final ProjectiCar Prk Car Price Prediction Project Task 3 ~Jupytr Notebook 1 dfi [numerical] .plot(kind='hist' , subplots=True, bins=50) array([, Frequency'>, , , , , , , , , , , 200 eo mE carwidth 2 Eo -1 ° 1 2 + During the modelling process, we can use power transformer. + Let's observe the correlation among the numerical features + And also observe the correlation with the target variable localhost 8888inotebooks/Desklop/Final ProjectICar Price Prediction Project Task 3ipynb 19190 71328, 1112AM Car Price Prediction Project Task 3 -Jupyter Notebook In [61]: 1 numericali= df1.select_dtypes( ‘number’).colunns 2 3 4 matrix = np.triu(dfi[numerical1].corr()) 5 fig, ax = plt.subplots(figsize=(14,10)) 6 sns.heatmap (df1[numerical2].corr(), annot=True, fmt= '.2f", vmin=-1, vma nai sm om on aan 9 a atone pee + We have 9 numerical features which have more than .5 correlation with the price variable. + Which is a good sign for the prediction capability of the model, but still we need to see in the practice + From the threshold .9 perspective: Highwaympg and citympg has .97 correlation. We can drop one of them to avoid mutticollinearity problems for the linear models. + [have observed several highly correlated features below the .9 level. + Let's drop the ‘citympg’ In [62]: 1 df1 = df1.drop('citympg' ,axis=1) Categorical Features locahost 8868inotebooks/Desktop/Final ProjectCar Price Predicton Project Task 3ioynb 71328, 1112AM In [63]: out (63): In [65]: out [65]: localhost 8888/notebooks/Desktop/Final ProjectiCar Prk Car Price Prediction Project Task 3 ~Jupytr Notebook 1 df1[categorical].head() fueltype aspiration doornumber _carbody drivewheel_enginelocation enginetype cylinder 0 gas std two convertie. wd front dohe 1 gas std two converte sw front dohe 2 gas std two hatchback wd front ohev 3 gas std four sedan fa front one 4 gas std four sedan wd front one Fuel Type and Price 1 print (d#1.groupby( ‘Fueltype’ )[ ‘price’ ].mean().sort_values()) 2 print() 3. df1.groupby('fueltype' )[' price’ ].mean().plot(kind="hist' ,subplots=True, bil fueltype gas 12999.7982 diesel 15838.1500 Name: price, dtype: floated array([], dtype=object) LO 08 Frequency a 0.2 0.0 13000 13500 «=«14000-S«14500.-—S «15000 ~—-15500 Prediction Projact Task 3ipynb 1180 71328, 1112AM Car Price Prediction Project Task 3 ~Jupytr Notebook * Diesel cars are more expensive than cars with gas. 4. Model Selection Aspiration and Price In [67]: 1 print(df1.groupby( ‘aspiration’ )[*price’].mean().sort_values()) 2 print() 3. df1.groupby( ‘aspiration’ )[' price’ ].mean().plot(kind="hist', subplots=True,| aspiration std 12611.270833 turbo 16298. 166676 Name: price, dtype: floated out[67]: array([], dtype=object) LO 08 Frequency a 0.2 0.0 12500 13000 13500 14000 14500 15000 15500 16000 ‘Turbo aspiration is more expensive than standard aspiration CarBody and Price Predict Projact Task Sipynb 18190 localhost 8888/notebooks/Desktop/Final ProjectiCar Prk 71328, 1112AM Car Price Prediction Project Task 3 -Jupyter Notebook In [68]: 1 print(dF1.groupby( ' carbody')['price’].mean() .sort_values()) 2 print() 3. df1.groupby( 'carbody')['price’].mean().plot(kind="hist’ , subplots=True,bin carbody hatchback 10376.652386 wagon 12371.960000 sedan 14344,270833, convertible 21898.5@@e00 hardtop 22208.500000 Name: price, dtype: floate4 out[68]: array([], dtypesobject) LO 08 0.2 0.0 10000 12000 ©14000 += 16000-18000 -«-20000-S 22000 Frequency a + Based on the price, there are differences among the carbody. + While Wagon cars the leats expensive ones, hardtop and the convertibles are the most expensive ones. Drivewheel and Price locahost 8868inotebooks/Desktop/Final ProjectCar Price Predicton Project Task 3ioynb 19190 71328, 1112AM Car Price Prediction Project Task 3 -Jupyter Notebook In [78]: 1. print(dF1.groupby(drivewheel')['price’].mean().sort_values()) 2 print() 3. df1.groupby( 'drivewheel")[ price’ ].mean().plot(kind="hist', subplots=True,| drivewheel fwd 9239, 308333 awd 11087.463000 rwd — 19910.8@9211 Name: price, dtype: floatea Out[70]: array([], dtype-object) LO 08 0.2 0.0 10000 12000 14000 16000 18000 20000 Frequency a + Rear wheel drive cars are the most expensive ones. Front wheel cars the least expensive ones. Engine Location and price locahost 8868inotebooks/Desktop/Final ProjectCar Price Predicton Project Task 3ioynb 20130 71328, 1112AM Car Price Prediction Project Task 3 ~Jupytr Notebook In [72]: 1 print(df1.groupby(‘enginelocation’ ){ price’ ].mean().sort_values()) 2 print() 3. df1.groupby( 'enginelocation' )[ ‘price’ ].mean().plot(kind='hist' , subplots=T1 enginelocation front 12961.097361. rear -34528.000000 Name: price, dtype: floatea “Frequency'>], dtype=object) out[72]: array([], dtype=object) 8000 10000 12000 14000 16000 18000 + Our dataset has 8 different fuel system and price changes amongs them significantly, Model and Price locahost 8868inotebooks/Desktop/Final ProjectCar Price Predicton Project Task 3ioynb 2830 71328, 1112AM Car Price Prediction Project Task 3 -Jupyter Notebook In [76]: 1 print(df1.groupby(‘model') ‘price’ ].mean().sort_values()) 2 print() 3. df1.groupby( ‘model )[ ‘price’ ].mean().plot(kind='hist', subplots=True, bins=! model chevrolet 6007 00000 dodge 7875.aaaaaa plymouth 7963428571 honda 8184.692308 subaru 8541250000 isuzu 8916.500000 mitsubishi 9239769231 renault 9595.000000 Toyota 9885.812500 Volkswagen 10077.500000 Nissan 10415 .666667 Mazda 10652.882353 saab 15223.333333 peugeot 15489.090909 alfa-romero 15498.333333 mercury 16503.200000 audi 17859.166714 volvo 18063.181818 bmw 26118.750000 Porsche 31400.500000 buick 33647 .000000 jaguar 34600.000000 Name: price, dtype: floatea Out [76]: array([], dtype=object) 3.0 2.5 05 | | | 0.0 “5000 10000 15000 20000 25000 30000 35000 Frequency a 6 6 locahost 8868inotebooks/Desktop/Final ProjectCar Price Predicton Project Task 3ioynb 2430 71328, 1112AM + Based on the model, Porsche, Buick and Jaguar are the most expensive ones. Car Price Prediction Project Task 3 ~Jupytr Notebook + Chevroletis the least expensive model. Get The Dummies In [77]: 1 df2 2 df2.head() pd.get_dumnies(df1, column ategorical, drop_first rue) symboling wheelbase carlength carwidth carheight curbweight enginesize boreratio stro out(77]: ° 3 1 3 2 1 3 2 4 2 5 rows x 64 columns 286 886 eas oo8 994 1688 1688 1712 1766 1768 Model Selection 64a 64a 655 662 664 438 488 524 543 543 + Tlluse linear regression model as a base model + And then | will use Ridge, Lasso, Elasticnet, KNeighborsRegressor and Support Vector MAchine Regressor + And then i will use ensemble models, like Randomforest, Gradient Boosting and Extra Trees + Finally | will look at the XGBoost Regresson, 2588 2548 2823 2337 2824 + And after evaluating the algorithm, we will select our best model + Let's start Baseline Model Iocahost 8888inotebooks/Desklop/Final ProjectICar Price Prediction Projact Task 3ipynb 130 130 182 109 136 3.47 347 268 319 3.19 2 2 3 28130 71328, 1112AM Car Price Prediction Project Task 3 ~Jupytr Notebook In [78]: X= df2.drop( ‘price’, axis=1) 1 2 y= df2['price'] 3 Xtrain, Xtest, y_train, y test = train_test_split(X, y, test_size=0.3, | 4 5 model = LinearRegression() 6 7 8 model.fit(X_train, y train) 9 y_pred = model.predict(x_test) 11 print (f'model : {model} and rmse score is : {np.sqrt(mean_squared_error’ > model : LinearRegression() and rmse score is : [email protected], r2 score i s @.8985995076954914 + Baseline Model, in our case, Linear Regression model, without and scaling and transformation did a quite a good job. Ridge & Lasso & Elasticnet & KNN with Scaler and Transformer localhost 8888/notebooks/Desktop/Final ProjectiCar Prk Predict Projact Task Sipynb 2630 71328, 1112AM In [79]: 31 32 Car Price Prediction Project Task 3 ~Jupytr Notebook rmse_test r2_test =[] model_names =[] numerical2= df2.drop(['price’], axis=1).select_dtypes( ‘number’ ).columns (= d#2.drop('price’, axis=1) df2['price'] Xtrain, X test, y train, y test = train_test_split(x, y, test_size: s = StandardScaler() PowerTransformer(method="yeo-johnson’, standardiz rue) rr = Ridge() las = Lasso() el= ElasticNet() knn = KNeighborsRegressor() models = [rr,las,el,knn] for model in models: ct = make_column_transformer((s,nunerical2), (p, skew_cols. index), remaii pipe = make_pipeline(ct, model) pipe.fit(X_train, y_train) y_pred = pipe.predict(X_test) rmse_test .append(round(np.sqrt(mean_squared_error(y_test, y_pred)),2) 2_testappend(round(r2_score(y_test, y_pred),2)) print (f'model : {model} and rmse score is : {round(np.sqrt(mean_squi model_names = ['Ridge’, ‘Lasso’, 'ElasticNet', 'KNeighbors"] result_df = pd.DataFrame({'RMSE':rmse_test,'R2_Test':r2_test}, inde result_df ode! model : Ridge() and rmse score is : 2423.29, r2 score is 0.92 model : Lasso() and rmse score is : 2329.06, r2 score is 0.92 model model : ElasticNet() and rmse score is : 3350.1, r2 score is 0.84 KNeighborsRegressor() and rmse score is : 4048.13, r2 score is @.76 €:\Users\Onkar\anaconda3\1ib\site-packages\sklearn\linear_model\_coordinate_d escent .py:647: ConvergenceWiarning: Objective did not converge. You might want ‘to increase the number of iterations, check the scale of the features or cons ider increasing regularisation. Duality gap: 1.809e+07, tolerance: 8.716e+05 model = cd_fast.enet_coordinate_descent( out [79]: RMSE R2_Test 242329 0.92 2323.06 0.92 ElasticNet 3350.10 0.84 KNeighbors 4048.13 0.76 + By using standard scaler and power transformer for the skewness; + Forlinear models we got 92 for the R2 and localhost 8888inotebooks/Desklop/Final ProjectICar Price Prediction Project Task 3ipynb 21830 71328, 1112AM Car Price Prediction Project Task 3 ~Jupytr Notebook + 2307.47 RMSE which are better scores compare to the baseline model Best Model with the Hyperparameter Tuning In [81]: 1 X= df2.drop( ‘price’, axis=1) 2 ys df2['price’] 3 Xtrain, Xtest, y train, y test = train_test_split(x, y, test_size-0.3, | 4 5 rf = RandonForestRegressor(n_estimators= 220, random_state=42 ) 6 7 rf.fit(X train, y_train) 8 y_pred = rf.predict(x test) 9 print (fF rmse score is : {round(np.sqrt(mean_squared_error(y_test, y_prei 10 a 2 rmse score is : 1975,8483, r2 score is 0.9437 + With hyperparameter tuning we got a lift. + RMSE (from 1984.44 to 1975.8483) + R2 (from 9432 to 9437) Feature Importance localhost 8888/notebooks/Desktop/Final ProjectiCar Prk Prediction Projact Task 3ipynb 2030 71328, 1112AM In [84]: Car Price Prediction Project Task 3 ~Jupytr Notebook importances = rf.feature_importances_ feature names = [f'feature (i}' for i in range(X.shape[1])] 1 2 3 4 # what are scores for the features 5 for i in range(len(rf.feature_importances_)): 6 if rf.feature_importances [i] >0.001: 7 print(#'{X train.columns[i]} : {round(r#.feature_importances_[i], 8 3 print() 11 plt.bar([X_train.columns[i] for i in range(len(rf.feature_importances_))] 12 plt.xticks(rotation=90) 13. plt.rcParams["figure.figsize"] = (124,112) 14 plt.show() synboling : 0.002 wheelbase : 2.008 carlength : 0.013 carwidth : 0.026 carheight : 0.004 curbweight : @.167 enginesize : 0.6 boreratio : 0.005 stroke : 0.003 compressionratio : 0.005 horsepower : @.028 Peakrpm : 0.085 highwaympg : 0.118 enginetype_ohc : 0.001 model_bmw : 0.006 + Based on the Random Forest Regressor: + enginesize locahost 8868inotebooks/Desktop/Final ProjectCar Price Predicton Project Task 3ioynb 20830 71328, 1112AM Car Price Prediction Project Task 3 ~Jupytr Notebook = curbweight = highway mpg = horse power have biggest importance scores. Itis important to note that Random Forest Regressor gave importance score bigger than 0 to 16 features. Model used 16 out of 63 features to get best prediction, Conclusion We have developed model to predict car price problem. First, we made the detailed exploratory analysis. We have decided which metric to use We analyzed both target and features in detail We transform categorical variables into numeric so we can use them in the model. We transform numerical variables to reduce skewness and get close to normal distribution. We use pipeline to avoid data leakage ‘We looked at the results of the each model and selected the best one for the problem in hand. ‘We made hyperparameter tuning of the best model see the improvement ‘We looked at the feature importance. After this point itis up to you to develop and improve the models. Iocahost 8888inotebooks/Desklop/Final ProjectICar Price Prediction Projact Task 3ipynb 3030

You might also like