0% found this document useful (0 votes)
5 views9 pages

Ads 2

The document outlines various data cleaning techniques aimed at ensuring datasets are accurate, complete, and free from errors before analysis. Key methods include removing duplicates, handling missing data, standardizing formats, identifying outliers, correcting inconsistencies, validating data, and dealing with noise. By applying these techniques, data analysts can significantly enhance data quality, minimizing errors and biases in subsequent analyses.

Uploaded by

madhavikhaire77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views9 pages

Ads 2

The document outlines various data cleaning techniques aimed at ensuring datasets are accurate, complete, and free from errors before analysis. Key methods include removing duplicates, handling missing data, standardizing formats, identifying outliers, correcting inconsistencies, validating data, and dealing with noise. By applying these techniques, data analysts can significantly enhance data quality, minimizing errors and biases in subsequent analyses.

Uploaded by

madhavikhaire77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

‭Experiment No 2‬

‭Aim : Data Cleaning Techniques.‬

‭Theory :‬

‭ he‬ ‭aim‬ ‭of‬ ‭data‬ ‭cleaning‬ ‭(also‬ ‭known‬ ‭as‬ ‭data‬ ‭cleansing‬ ‭or‬‭data‬‭scrubbing)‬‭is‬‭to‬‭ensure‬
T
‭that‬‭the‬‭dataset‬‭is‬‭accurate,‬‭complete,‬‭consistent,‬‭and‬‭free‬‭from‬‭errors‬‭before‬‭performing‬
‭any‬ ‭analysis.‬ ‭Raw‬ ‭data‬ ‭collected‬ ‭from‬ ‭various‬ ‭sources‬ ‭often‬ ‭contains‬ ‭inconsistencies,‬
‭missing‬ ‭values,‬ ‭duplicates,‬ ‭or‬ ‭outliers‬ ‭that‬ ‭can‬ ‭distort‬ ‭or‬ ‭mislead‬ ‭results.‬ ‭Data‬‭cleaning‬
‭techniques‬ ‭address‬ ‭these‬ ‭issues‬ ‭to‬ ‭improve‬ ‭the‬ ‭quality‬ ‭of‬ ‭data,‬ ‭making‬ ‭it‬ ‭suitable‬ ‭for‬
‭analysis, reporting, and decision-making.‬

‭ ata‬ ‭cleaning‬ ‭involves‬‭several‬‭techniques,‬‭each‬‭aimed‬‭at‬‭addressing‬‭specific‬‭issues‬‭commonly‬


D
‭encountered in raw datasets. Here are some of the primary methods:‬

‭1.‬ ‭Removing‬‭Duplicates‬‭:‬‭Duplicate‬‭entries‬‭occur‬‭when‬‭identical‬‭or‬‭nearly‬‭identical‬‭records‬
a‭ re‬ ‭recorded‬ ‭multiple‬ ‭times.‬ ‭This‬ ‭can‬ ‭happen‬ ‭due‬ ‭to‬ ‭human‬ ‭error‬‭or‬‭issues‬‭during‬‭data‬
‭collection. Duplicate records can distort analysis by overrepresenting certain values.‬
‭○‬ ‭Technique‬‭:‬ ‭Identifying‬ ‭duplicate‬ ‭rows‬ ‭based‬ ‭on‬ ‭unique‬ ‭identifiers‬ ‭(like‬ ‭ID‬
‭numbers or email addresses) and removing them.‬
‭ .‬ ‭Handling‬ ‭Missing‬ ‭Data‬‭:‬ ‭Missing‬ ‭values‬ ‭occur‬ ‭when‬ ‭data‬ ‭is‬ ‭not‬ ‭recorded‬ ‭for‬ ‭certain‬
2
‭observations.‬‭Handling‬‭missing‬‭data‬‭is‬‭essential‬‭because‬‭missing‬‭values‬‭can‬‭skew‬‭results‬
‭and introduce bias.‬
‭○‬ ‭Techniques‬‭:‬
‭i.‬ ‭Deletion‬‭: Removing rows with missing values (listwise deletion).‬
‭ii.‬ ‭Imputation‬‭:‬ ‭Replacing‬ ‭missing‬ ‭values‬ ‭with‬ ‭estimated‬ ‭values,‬‭such‬‭as‬‭the‬
‭mean, median, mode, or predicted values from a model.‬
‭iii.‬ ‭Flagging‬‭:‬ ‭Adding‬ ‭a‬ ‭special‬ ‭indicator‬ ‭for‬ ‭missing‬ ‭values,‬ ‭so‬ ‭they‬ ‭can‬ ‭be‬
‭treated separately during analysis.‬
‭3.‬ ‭Standardizing‬ ‭Data‬ ‭Formats‬‭:‬ ‭Inconsistent‬ ‭data‬ ‭formats‬ ‭can‬ ‭lead‬ ‭to‬ ‭problems‬ ‭when‬
‭comparing‬ ‭or‬ ‭aggregating‬ ‭data.‬ ‭For‬ ‭example,‬ ‭dates‬ ‭might‬ ‭be‬ ‭recorded‬ ‭as‬
‭"MM/DD/YYYY" in some records and "DD/MM/YYYY" in others.‬
‭○‬ ‭Technique‬‭:‬ ‭Converting‬ ‭values‬ ‭to‬ ‭a‬ ‭consistent‬ ‭format,‬‭such‬‭as‬‭standardizing‬‭date‬
‭formats or ensuring that numerical data has consistent decimal places.‬
‭4.‬ ‭Identifying‬‭and‬‭Handling‬‭Outliers‬‭:‬‭Outliers‬‭are‬‭extreme‬‭values‬‭that‬‭differ‬‭significantly‬
‭from‬ ‭other‬ ‭observations‬ ‭in‬ ‭the‬‭dataset.‬‭They‬‭can‬‭indicate‬‭errors,‬‭data‬‭entry‬‭mistakes,‬‭or‬
‭actual rare events.‬
‭○‬ ‭Techniques‬‭:‬
‭i.‬ ‭Statistical‬‭methods‬‭:‬‭Using‬‭methods‬‭like‬‭the‬‭Z-score‬‭or‬‭IQR‬‭(Interquartile‬
‭ ange) to identify outliers.‬
R
‭ii.‬ ‭Capping‬‭: Limiting outlier values to a certain range (winsorization).‬
‭iii.‬ ‭Imputation‬‭:‬ ‭Replacing‬ ‭outlier‬ ‭values‬ ‭with‬ ‭more‬ ‭plausible‬ ‭values‬ ‭(mean,‬
‭median, or predicted values).‬
‭ .‬ ‭Correcting‬ ‭Data‬ ‭Inconsistencies‬‭:‬ ‭Data‬ ‭collected‬ ‭from‬ ‭multiple‬ ‭sources‬ ‭or‬ ‭individuals‬
5
‭may‬ ‭have‬ ‭inconsistencies,‬ ‭such‬ ‭as‬ ‭variations‬ ‭in‬ ‭spelling,‬ ‭abbreviations,‬ ‭or‬ ‭units‬ ‭of‬
‭measurement.‬
‭○‬ ‭Technique‬‭:‬‭Normalizing‬‭data‬‭by‬‭standardizing‬‭text‬‭entries‬‭(e.g.,‬‭converting‬‭"NY"‬
‭to "New York") and ensuring consistent measurement units.‬
‭6.‬ ‭Data‬ ‭Validation‬‭:‬ ‭Data‬ ‭validation‬ ‭ensures‬ ‭that‬ ‭values‬ ‭fall‬ ‭within‬ ‭acceptable‬ ‭ranges‬ ‭or‬
‭categories.‬ ‭For‬ ‭example,‬ ‭ages‬ ‭might‬ ‭be‬ ‭restricted‬ ‭to‬ ‭valid‬ ‭ranges‬ ‭(0–120‬ ‭years),‬ ‭or‬
‭product prices should not be negative.‬
‭○‬ ‭Technique‬‭:‬ ‭Applying‬ ‭rules‬ ‭or‬ ‭constraints‬ ‭to‬ ‭validate‬ ‭the‬ ‭data.‬ ‭For‬ ‭example,‬
‭setting up a validation rule that enforces age values between 0 and 120.‬
‭7.‬ ‭Dealing‬ ‭with‬ ‭Noise‬‭:‬ ‭Noise‬ ‭refers‬ ‭to‬ ‭irrelevant‬ ‭or‬ ‭random‬ ‭data‬ ‭that‬ ‭can‬ ‭distort‬ ‭the‬
‭analysis. Noise may occur due to external factors or errors during data collection.‬
‭○‬ ‭Technique‬‭:‬‭Filtering‬‭or‬‭smoothing‬‭data‬‭to‬‭reduce‬‭noise,‬‭such‬‭as‬‭applying‬‭moving‬
‭averages or using outlier detection methods.‬

‭ onclusion :‬
C
‭By‬‭applying‬‭various‬‭data‬‭cleaning‬‭techniques‬‭such‬‭as‬‭handling‬‭missing‬‭values,‬‭removing‬
‭duplicates,‬ ‭addressing‬ ‭outliers,‬ ‭and‬ ‭standardizing‬ ‭data,‬ ‭data‬ ‭analysts‬ ‭can‬ ‭significantly‬
‭improve‬ ‭the‬ ‭quality‬ ‭of‬ ‭the‬ ‭dataset,‬ ‭minimizing‬ ‭errors‬ ‭and‬ ‭biases‬ ‭that‬ ‭could‬ ‭affect‬
‭subsequent analyses.‬
1 import numpy as np
2 import pandas as pd
3 import seaborn as sns
4 import matplotlib.pyplot as plt
5 import scipy.stats as stats
6 import warnings
7 warnings.filterwarnings("ignore")
8
9
10 from google.colab import files
11 uploaded = files.upload()
12
13
14 df = pd.read_csv("kc_house_data 1.csv")
15 df.head()

Choose Files No file chosen Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to
enable.
Saving kc_house_data 1.csv to kc_house_data 1 (1).csv
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_

0 7129300520 20141013T000000 221900.0 3 1.00 1180 5650 1.0 0 0 ... 7 1

1 6414100192 20141209T000000 538000.0 3 2.25 2570 7242 2.0 0 0 ... 7 2

2 5631500400 20150225T000000 180000.0 2 1.00 770 10000 1.0 0 0 ... 6

3 2487200875 20141209T000000 604000.0 4 3.00 1960 5000 1.0 0 0 ... 7 1

4 1954400510 20150218T000000 510000.0 3 2.00 1680 8080 1.0 0 0 ... 8 1

5 rows × 21 columns

1 print("Size of the data : ", df.shape)

Size of the data : (21613, 21)

1 df = df.drop("id",axis=1)
2 df.head()

date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above

0 20141013T000000 221900.0 3 1.00 1180 5650 1.0 0 0 3 7 1180.0

1 20141209T000000 538000.0 3 2.25 2570 7242 2.0 0 0 3 7 2170.0

2 20150225T000000 180000.0 2 1.00 770 10000 1.0 0 0 3 6 770.0

3 20141209T000000 604000.0 4 3.00 1960 5000 1.0 0 0 5 7 1050.0

4 20150218T000000 510000.0 3 2.00 1680 8080 1.0 0 0 3 8 1680.0

1 #extracting the year from the date column


2 df['yr']=df['date'].astype(str).str[:4]
3 #deriving the house age
4 df['age']=df['yr'].astype(int)-df['yr_built']
5 #extracting house age after renovation
6 df['age_rnv']=0
7 df['age_rnv']=df['yr'][df['yr_renovated']!=0].astype(int)-df['yr_renovated'][df['yr_renovated']!=0]
8 df['age_rnv'][df['age_rnv'].isnull()]=0
9 df.head()

date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition ... yr_built yr_

0 20141013T000000 221900.0 3 1.00 1180 5650 1.0 0 0 3 ... 1955

1 20141209T000000 538000.0 3 2.25 2570 7242 2.0 0 0 3 ... 1951

2 20150225T000000 180000.0 2 1.00 770 10000 1.0 0 0 3 ... 1933

3 20141209T000000 604000.0 4 3.00 1960 5000 1.0 0 0 5 ... 1965

4 20150218T000000 510000.0 3 2.00 1680 8080 1.0 0 0 3 ... 1987

5 rows × 23 columns

1 df = df.drop(["date", "yr_built", "yr_renovated", "yr"],axis =1)


2 df.head()
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement zi

0 221900.0 3 1.00 1180 5650 1.0 0 0 3 7 1180.0 0

1 538000.0 3 2.25 2570 7242 2.0 0 0 3 7 2170.0 400

2 180000.0 2 1.00 770 10000 1.0 0 0 3 6 770.0 0

3 604000.0 4 3.00 1960 5000 1.0 0 0 5 7 1050.0 910

4 510000.0 3 2.00 1680 8080 1.0 0 0 3 8 1680.0 0

1 print("==============Correlation of price with other features=================")


2 df.corr().loc["price"][1:].sort_values(ascending = False).index

==============Correlation of price with other features=================


Index(['sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms',
'view', 'sqft_basement', 'bedrooms', 'lat', 'waterfront', 'floors',
'sqft_lot', 'sqft_lot15', 'age_rnv', 'condition', 'long', 'zipcode',
'age'],
dtype='object')

1 # independent and dependent features


2 X = df.drop("price",axis=1)
3 y = df["price"]

1 from sklearn.model_selection import train_test_split


2
3
4 X_train,X_test, y_train, y_test = train_test_split(X, y, train_size = 0.75,random_state=3)
5 print("Training data size : ", X_train.shape)
6 print("Test data size : ", X_test.shape)

Training data size : (16209, 18)


Test data size : (5404, 18)

1 from sklearn.linear_model import LinearRegression


2
3 simple_lr = LinearRegression()
4 simple_lr.fit(X_train[["sqft_living"]], y_train)

▾ LinearRegression i ?

LinearRegression()

1 y_pred = simple_lr.predict(X_test[["sqft_living"]])
2 y_pred

array([ 404785.59940785, 1220687.94257837, 850587.91062473, ...,


323475.74390288, 292634.07457341, 522544.70048401])

1 from sklearn.metrics import r2_score

1 from sklearn.metrics import r2_score, mean_squared_error # Import mean_squared_error


2
3 rmse = np.sqrt(mean_squared_error(y_test, y_pred))
4 print("RMSE of simple linear regression model : ", rmse)

RMSE of simple linear regression model : 258870.72627950888

1 r_squared = r2_score(y_test, y_pred)


2 print("R-squared of simple linear regression model : ", r_squared)
3 adjusted_r_squared = 1 - (1-r_squared)*(len(y_test)-1)/(len(y_test)-X_train.shape[1]-1)
4 print("\nAdjusted R-squared of simple linear regression model : ", adjusted_r_squared)

R-squared of simple linear regression model : 0.49753871219817647

Adjusted R-squared of simple linear regression model : 0.4958591758601203

1 reg_algos = pd.DataFrame({'Model': [],


2 'Features used':[],
3 'Root Mean Squared Error (RMSE)':[],
4 'R-squared':[],
5 'Adjusted R-squared':[],
6 })
1 model_count = reg_algos.shape[0]
2 reg_algos.loc[model_count] = ["Simple Linear Regression", "Single feature", rmse, r_squared, adjusted_r_squared]
3 reg_algos
4

Model Features used Root Mean Squared Error (RMSE) R-squared Adjusted R-squared

0 Simple Linear Regression Single feature 258870.72628 0.497539 0.495859

1 # Drop rows with NaN values in the selected features


2 X_train_clean = X_train[['sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms',
3 'view', 'sqft_basement', 'bedrooms']].dropna()
4
5 # Make sure the target variable (y_train) also matches the cleaned data
6 y_train_clean = y_train[X_train_clean.index]
7
8 # Fit the model with cleaned data
9 multiple_1 = LinearRegression()
10 multiple_1.fit(X_train_clean, y_train_clean)
11

▾ LinearRegression i ?

LinearRegression()

1 y_pred = multiple_1.predict(X_test[['sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms',


2 'view', 'sqft_basement', 'bedrooms']])
3 y_pred

array([ 487887.72731408, 1093680.35567361, 811391.75442649, ...,


308702.31061097, 311303.84061483, 403007.89990196])

1 rmse = np.sqrt(mean_squared_error(y_test, y_pred))


2 print("RMSE of Model 1 using a subset of features : ", rmse)

RMSE of Model 1 using a subset of features : 235594.92553209866

1 r_squared = r2_score(y_test, y_pred)


2 print("R-squared of Model 1 using a subset of features : ", r_squared)
3 adjusted_r_squared = 1 - (1-r_squared)*(len(y_test)-1)/(len(y_test)-X_train.shape[1]-1)
4 print("\nAdjusted R-squared of Model 1 using a subset of features : ", adjusted_r_squared)

R-squared of Model 1 using a subset of features : 0.5838320926112269

Adjusted R-squared of Model 1 using a subset of features : 0.5824410021129915

1 reg_algos.loc[model_count] = ["Multiple Linear Regression", "Subset of features", rmse, r_squared, adjusted_r_squared]


2 reg_algos

Model Features used Root Mean Squared Error (RMSE) R-squared Adjusted R-squared

0 Multiple Linear Regression Subset of features 235594.925532 0.583832 0.582441

1 !pip install scikit-learn


2 import pandas as pd
3 from sklearn.impute import SimpleImputer
4 from sklearn.linear_model import LinearRegression
5 from sklearn.model_selection import train_test_split
6
7 # ... (Your existing code) ...
8
9 # Create an imputer to replace NaN with the mean of the column
10 imputer = SimpleImputer(strategy='mean')
11
12 # Fit the imputer to your training data and transform it
13 X_train_imputed = imputer.fit_transform(X_train)
14
15 # Create a new DataFrame with the imputed values
16 X_train_imputed = pd.DataFrame(X_train_imputed, columns=X_train.columns, index=X_train.index)
17
18 # Now fit your model using the imputed data
19 multiple_with_all = LinearRegression()
20 multiple_with_all.fit(X_train_imputed, y_train)
21
22 # ... (Rest of your code) ...

Show hidden output


1 y_pred = multiple_with_all.predict(X_test)
2 y_pred

array([ 543287.1663189 , 1257510.98913652, 765837.4843899 , ...,


123640.085557 , 276934.94589009, 469213.84911597])

1 rmse = np.sqrt(mean_squared_error(y_test, y_pred))


2 print("RMSE of Model 2 using all the features : ", rmse)

RMSE of Model 2 using all the features : 197253.20603379642

1 r_squared = r2_score(y_test, y_pred)


2 print("R-squared of Model 2 using all the features : ", r_squared)
3 adjusted_r_squared = 1 - (1-r_squared)*(len(y_test)-1)/(len(y_test)-X_train.shape[1]-1)
4 print("\nAdjusted R-squared of Model 2 using all the features : ", adjusted_r_squared)

R-squared of Model 2 using all the features : 0.7082674660890779

Adjusted R-squared of Model 2 using all the features : 0.7072923155578993

1 model_count = reg_algos.shape[0]
2 reg_algos.loc[model_count] = ["Multiple Linear Regression", "All features", rmse, r_squared, adjusted_r_squared]
3 reg_algos

Model Features used Root Mean Squared Error (RMSE) R-squared Adjusted R-squared

0 Multiple Linear Regression Subset of features 235594.925532 0.583832 0.582441

1 Multiple Linear Regression All features 197253.206034 0.708267 0.707292

1 !pip install mlxtend


2 import joblib
3 import sys
4 sys.modules['sklearn.externals.joblib'] = joblib
5 from mlxtend.feature_selection import SequentialFeatureSelector as SFS

Show hidden output

1 sfs = SFS(LinearRegression(),
2 k_features=(3,18),
3 forward=True,
4 floating=False,
5 scoring = 'r2',
6 cv = 0)

1 !pip install mlxtend


2 import joblib
3 import sys
4 sys.modules['sklearn.externals.joblib'] = joblib
5 from mlxtend.feature_selection import SequentialFeatureSelector as SFS
6 import pandas as pd
7 from sklearn.impute import SimpleImputer
8
9 imputer = SimpleImputer(strategy='mean')
10
11 X_train_imputed = imputer.fit_transform(X_train)
12
13 X_train_imputed = pd.DataFrame(X_train_imputed, columns=X_train.columns)
14
15 sfs = SFS(LinearRegression(),
16 k_features=(3,18),
17 forward=True,
18 floating=False,
19 scoring = 'r2',
20 cv = 0)
21 sfs.fit(X_train_imputed, y_train)

Show hidden output

1 from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs


2 import matplotlib.pyplot as plt
3 fig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev')
4 plt.title('Sequential Forward Selection (w. StdErr)')
5 plt.grid()
6 plt.show()
1 sfs = SFS(LinearRegression(),
2 k_features=14,
3 forward=True,
4 floating=False,
5 scoring = 'r2',
6 cv = 0)

1 !pip install mlxtend


2 import joblib
3 import sys
4 sys.modules['sklearn.externals.joblib'] = joblib
5 from mlxtend.feature_selection import SequentialFeatureSelector as SFS
6 import pandas as pd
7 from sklearn.impute import SimpleImputer
8
9 imputer = SimpleImputer(strategy='mean')
10
11 X_train_imputed = imputer.fit_transform(X_train)
12
13 X_train_imputed = pd.DataFrame(X_train_imputed, columns=X_train.columns)
14
15 sfs = SFS(LinearRegression(),
16 k_features=(3,18),
17 forward=True,
18 floating=False,
19 scoring = 'r2',
20 cv = 0)
21 sfs.fit(X_train_imputed, y_train)

Show hidden output

1 sfs.k_feature_names_

('bedrooms',
'bathrooms',
'sqft_living',
'sqft_lot',
'floors',
'waterfront',
'view',
'condition',
'grade',
'sqft_above',
'sqft_basement',
'zipcode',
'lat',
'long',
'sqft_living15',
'sqft_lot15',
'age',
'age_rnv')

1 multiple_with_selected = LinearRegression()
2 multiple_with_selected.fit(X_train[['bedrooms',
3 'bathrooms',
4 'sqft_living',
5 'waterfront',
6 'view',
7 'condition',
8 'grade',
9 'sqft_basement',
10 'zipcode',
11 'lat',
12 'long',
13 'sqft_living15',
14 'sqft_lot15',
15 'age']], y_train)

▾ LinearRegression i ?

LinearRegression()

1 y_pred = multiple_with_selected.predict(X_test[['bedrooms',
2 'bathrooms',
3 'sqft_living',
4 'waterfront',
5 'view',
6 'condition',
7 'grade',
8 'sqft_basement',
9 'zipcode',
10 'lat',
11 'long',
12 'sqft_living15',
13 'sqft_lot15',
14 'age']])
15 y_pred

array([ 547845.87229597, 1263717.53465397, 766357.79047055, ...,


130581.4909028 , 279560.94999563, 466535.82218084])

1 rmse = np.sqrt(mean_squared_error(y_test, y_pred))


2 print("RMSE of Model 3 using all the selected features : ", rmse)

RMSE of Model 3 using all the selected features : 197145.0287554766

1
2 r_squared = r2_score(y_test, y_pred)
3 print("R-squared of Model 3 using all the selected features : ", r_squared)
4 adjusted_r_squared = 1 - (1-r_squared)*(len(y_test)-1)/(len(y_test)-X_train.shape[1]-1)
5 print("\nAdjusted R-squared of Model 2 using all the selectedfeatures : ", adjusted_r_squared)

R-squared of Model 3 using all the selected features : 0.708587361298175

Adjusted R-squared of Model 2 using all the selectedfeatures : 0.7076132800546034

1 model_count = reg_algos.shape[0]
2 reg_algos.loc[model_count] = ["Multiple Linear Regression", "Selected features", rmse, r_squared, adjusted_r_squared]
3 reg_algos

Model Features used Root Mean Squared Error (RMSE) R-squared Adjusted R-squared

0 Multiple Linear Regression Subset of features 235594.925532 0.583832 0.582441

1 Multiple Linear Regression All features 197253.206034 0.708267 0.707292

2 Multiple Linear Regression Selected features 197145.028755 0.708587 0.707613

1 from sklearn.preprocessing import PolynomialFeatures


2 from sklearn.impute import SimpleImputer # Import SimpleImputer
3
4 # Create an imputer to replace NaN with the mean of the column
5 imputer = SimpleImputer(strategy='mean')
6
7 # Fit the imputer to your training data and transform it
8 X_train_imputed = imputer.fit_transform(X_train)
9 X_test_imputed = imputer.transform(X_test) # Transform X_test as well
10
11 # Now apply PolynomialFeatures to the imputed data
12 polyfeat = PolynomialFeatures(degree=2)
13 X_trainpoly = polyfeat.fit_transform(X_train_imputed)
14 X_testpoly = polyfeat.transform(X_test_imputed) # Use transform for X_test
15
16 poly_reg = LinearRegression()
17 poly_reg.fit(X_trainpoly, y_train)
▾ LinearRegression i ?

LinearRegression()

1 X_trainpoly.shape[1]

190

1 X_testpoly.shape[1]

190

1 y_pred = poly_reg.predict(X_testpoly)
2 y_pred

array([ 501187.18259308, 1521362.3099478 , 647763.248394 , ...,


288724.48304937, 288659.51970974, 521409.17547527])

1 rmse = np.sqrt(mean_squared_error(y_test, y_pred))


2 print("RMSE of Model 1 with degree 2 : ", rmse)

RMSE of Model 1 with degree 2 : 197145.0287554766

1 r_squared = r2_score(y_test, y_pred)


2 print("R-squared of Model 1 with degree 2 : ", r_squared)
3 adjusted_r_squared = 1 - (1-r_squared)*(len(y_test)-1)/(len(y_test)-X_train.shape[1]-1)
4 print("\nAdjusted R-squared of Model 1 with degree 2 : ", adjusted_r_squared)

R-squared of Model 1 with degree 2 : 0.708587361298175

Adjusted R-squared of Model 1 with degree 2 : 0.7076132800546034

1 model_count = reg_algos.shape[0]
2 reg_algos.loc[model_count] = ["Polynomial Regression", "Selected features", rmse, r_squared, adjusted_r_squared]

You might also like