0% found this document useful (0 votes)

5 views9 pages

Ads 2

The document outlines various data cleaning techniques aimed at ensuring datasets are accurate, complete, and free from errors before analysis. Key methods include removing duplicates, handling missing data, standardizing formats, identifying outliers, correcting inconsistencies, validating data, and dealing with noise. By applying these techniques, data analysts can significantly enhance data quality, minimizing errors and biases in subsequent analyses.

Uploaded by

madhavikhaire77

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views9 pages

Ads 2

Uploaded by

madhavikhaire77

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

‭Experiment No 2‬

‭Aim : Data Cleaning Techniques.‬

‭Theory :‬

‭ he‬ ‭aim‬ ‭of‬ ‭data‬ ‭cleaning‬ ‭(also‬ ‭known‬ ‭as‬ ‭data‬ ‭cleansing‬ ‭or‬‭data‬‭scrubbing)‬‭is‬‭to‬‭ensure‬
T
‭that‬‭the‬‭dataset‬‭is‬‭accurate,‬‭complete,‬‭consistent,‬‭and‬‭free‬‭from‬‭errors‬‭before‬‭performing‬
‭any‬ ‭analysis.‬ ‭Raw‬ ‭data‬ ‭collected‬ ‭from‬ ‭various‬ ‭sources‬ ‭often‬ ‭contains‬ ‭inconsistencies,‬
‭missing‬ ‭values,‬ ‭duplicates,‬ ‭or‬ ‭outliers‬ ‭that‬ ‭can‬ ‭distort‬ ‭or‬ ‭mislead‬ ‭results.‬ ‭Data‬‭cleaning‬
‭techniques‬ ‭address‬ ‭these‬ ‭issues‬ ‭to‬ ‭improve‬ ‭the‬ ‭quality‬ ‭of‬ ‭data,‬ ‭making‬ ‭it‬ ‭suitable‬ ‭for‬
‭analysis, reporting, and decision-making.‬

‭ ata‬ ‭cleaning‬ ‭involves‬‭several‬‭techniques,‬‭each‬‭aimed‬‭at‬‭addressing‬‭specific‬‭issues‬‭commonly‬

D
‭encountered in raw datasets. Here are some of the primary methods:‬

‭1.‬ ‭Removing‬‭Duplicates‬‭:‬‭Duplicate‬‭entries‬‭occur‬‭when‬‭identical‬‭or‬‭nearly‬‭identical‬‭records‬
a‭ re‬ ‭recorded‬ ‭multiple‬ ‭times.‬ ‭This‬ ‭can‬ ‭happen‬ ‭due‬ ‭to‬ ‭human‬ ‭error‬‭or‬‭issues‬‭during‬‭data‬
‭collection. Duplicate records can distort analysis by overrepresenting certain values.‬
‭○‬ ‭Technique‬‭:‬ ‭Identifying‬ ‭duplicate‬ ‭rows‬ ‭based‬ ‭on‬ ‭unique‬ ‭identifiers‬ ‭(like‬ ‭ID‬
‭numbers or email addresses) and removing them.‬
‭ .‬ ‭Handling‬ ‭Missing‬ ‭Data‬‭:‬ ‭Missing‬ ‭values‬ ‭occur‬ ‭when‬ ‭data‬ ‭is‬ ‭not‬ ‭recorded‬ ‭for‬ ‭certain‬
2
‭observations.‬‭Handling‬‭missing‬‭data‬‭is‬‭essential‬‭because‬‭missing‬‭values‬‭can‬‭skew‬‭results‬
‭and introduce bias.‬
‭○‬ ‭Techniques‬‭:‬
‭i.‬ ‭Deletion‬‭: Removing rows with missing values (listwise deletion).‬
‭ii.‬ ‭Imputation‬‭:‬ ‭Replacing‬ ‭missing‬ ‭values‬ ‭with‬ ‭estimated‬ ‭values,‬‭such‬‭as‬‭the‬
‭mean, median, mode, or predicted values from a model.‬
‭iii.‬ ‭Flagging‬‭:‬ ‭Adding‬ ‭a‬ ‭special‬ ‭indicator‬ ‭for‬ ‭missing‬ ‭values,‬ ‭so‬ ‭they‬ ‭can‬ ‭be‬
‭treated separately during analysis.‬
‭3.‬ ‭Standardizing‬ ‭Data‬ ‭Formats‬‭:‬ ‭Inconsistent‬ ‭data‬ ‭formats‬ ‭can‬ ‭lead‬ ‭to‬ ‭problems‬ ‭when‬
‭comparing‬ ‭or‬ ‭aggregating‬ ‭data.‬ ‭For‬ ‭example,‬ ‭dates‬ ‭might‬ ‭be‬ ‭recorded‬ ‭as‬
‭"MM/DD/YYYY" in some records and "DD/MM/YYYY" in others.‬
‭○‬ ‭Technique‬‭:‬ ‭Converting‬ ‭values‬ ‭to‬ ‭a‬ ‭consistent‬ ‭format,‬‭such‬‭as‬‭standardizing‬‭date‬
‭formats or ensuring that numerical data has consistent decimal places.‬
‭4.‬ ‭Identifying‬‭and‬‭Handling‬‭Outliers‬‭:‬‭Outliers‬‭are‬‭extreme‬‭values‬‭that‬‭differ‬‭significantly‬
‭from‬ ‭other‬ ‭observations‬ ‭in‬ ‭the‬‭dataset.‬‭They‬‭can‬‭indicate‬‭errors,‬‭data‬‭entry‬‭mistakes,‬‭or‬
‭actual rare events.‬
‭○‬ ‭Techniques‬‭:‬
‭i.‬ ‭Statistical‬‭methods‬‭:‬‭Using‬‭methods‬‭like‬‭the‬‭Z-score‬‭or‬‭IQR‬‭(Interquartile‬
‭ ange) to identify outliers.‬
R
‭ii.‬ ‭Capping‬‭: Limiting outlier values to a certain range (winsorization).‬
‭iii.‬ ‭Imputation‬‭:‬ ‭Replacing‬ ‭outlier‬ ‭values‬ ‭with‬ ‭more‬ ‭plausible‬ ‭values‬ ‭(mean,‬
‭median, or predicted values).‬
‭ .‬ ‭Correcting‬ ‭Data‬ ‭Inconsistencies‬‭:‬ ‭Data‬ ‭collected‬ ‭from‬ ‭multiple‬ ‭sources‬ ‭or‬ ‭individuals‬
5
‭may‬ ‭have‬ ‭inconsistencies,‬ ‭such‬ ‭as‬ ‭variations‬ ‭in‬ ‭spelling,‬ ‭abbreviations,‬ ‭or‬ ‭units‬ ‭of‬
‭measurement.‬
‭○‬ ‭Technique‬‭:‬‭Normalizing‬‭data‬‭by‬‭standardizing‬‭text‬‭entries‬‭(e.g.,‬‭converting‬‭"NY"‬
‭to "New York") and ensuring consistent measurement units.‬
‭6.‬ ‭Data‬ ‭Validation‬‭:‬ ‭Data‬ ‭validation‬ ‭ensures‬ ‭that‬ ‭values‬ ‭fall‬ ‭within‬ ‭acceptable‬ ‭ranges‬ ‭or‬
‭categories.‬ ‭For‬ ‭example,‬ ‭ages‬ ‭might‬ ‭be‬ ‭restricted‬ ‭to‬ ‭valid‬ ‭ranges‬ ‭(0–120‬ ‭years),‬ ‭or‬
‭product prices should not be negative.‬
‭○‬ ‭Technique‬‭:‬ ‭Applying‬ ‭rules‬ ‭or‬ ‭constraints‬ ‭to‬ ‭validate‬ ‭the‬ ‭data.‬ ‭For‬ ‭example,‬
‭setting up a validation rule that enforces age values between 0 and 120.‬
‭7.‬ ‭Dealing‬ ‭with‬ ‭Noise‬‭:‬ ‭Noise‬ ‭refers‬ ‭to‬ ‭irrelevant‬ ‭or‬ ‭random‬ ‭data‬ ‭that‬ ‭can‬ ‭distort‬ ‭the‬
‭analysis. Noise may occur due to external factors or errors during data collection.‬
‭○‬ ‭Technique‬‭:‬‭Filtering‬‭or‬‭smoothing‬‭data‬‭to‬‭reduce‬‭noise,‬‭such‬‭as‬‭applying‬‭moving‬
‭averages or using outlier detection methods.‬

‭ onclusion :‬
C
‭By‬‭applying‬‭various‬‭data‬‭cleaning‬‭techniques‬‭such‬‭as‬‭handling‬‭missing‬‭values,‬‭removing‬
‭duplicates,‬ ‭addressing‬ ‭outliers,‬ ‭and‬ ‭standardizing‬ ‭data,‬ ‭data‬ ‭analysts‬ ‭can‬ ‭significantly‬
‭improve‬ ‭the‬ ‭quality‬ ‭of‬ ‭the‬ ‭dataset,‬ ‭minimizing‬ ‭errors‬ ‭and‬ ‭biases‬ ‭that‬ ‭could‬ ‭affect‬
‭subsequent analyses.‬
1 import numpy as np
2 import pandas as pd
3 import seaborn as sns
4 import matplotlib.pyplot as plt
5 import scipy.stats as stats
6 import warnings
7 warnings.filterwarnings("ignore")
8
9
10 from google.colab import files
11 uploaded = files.upload()
12
13
14 df = pd.read_csv("kc_house_data 1.csv")
15 df.head()

Choose Files No file chosen Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to
enable.
Saving kc_house_data 1.csv to kc_house_data 1 (1).csv
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_

0 7129300520 20141013T000000 221900.0 3 1.00 1180 5650 1.0 0 0 ... 7 1

1 6414100192 20141209T000000 538000.0 3 2.25 2570 7242 2.0 0 0 ... 7 2

2 5631500400 20150225T000000 180000.0 2 1.00 770 10000 1.0 0 0 ... 6

3 2487200875 20141209T000000 604000.0 4 3.00 1960 5000 1.0 0 0 ... 7 1

4 1954400510 20150218T000000 510000.0 3 2.00 1680 8080 1.0 0 0 ... 8 1

5 rows × 21 columns

1 print("Size of the data : ", df.shape)

Size of the data : (21613, 21)

1 df = df.drop("id",axis=1)
2 df.head()

date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above

0 20141013T000000 221900.0 3 1.00 1180 5650 1.0 0 0 3 7 1180.0

1 20141209T000000 538000.0 3 2.25 2570 7242 2.0 0 0 3 7 2170.0

2 20150225T000000 180000.0 2 1.00 770 10000 1.0 0 0 3 6 770.0

3 20141209T000000 604000.0 4 3.00 1960 5000 1.0 0 0 5 7 1050.0

4 20150218T000000 510000.0 3 2.00 1680 8080 1.0 0 0 3 8 1680.0

1 #extracting the year from the date column

2 df['yr']=df['date'].astype(str).str[:4]
3 #deriving the house age
4 df['age']=df['yr'].astype(int)-df['yr_built']
5 #extracting house age after renovation
6 df['age_rnv']=0
7 df['age_rnv']=df['yr'][df['yr_renovated']!=0].astype(int)-df['yr_renovated'][df['yr_renovated']!=0]
8 df['age_rnv'][df['age_rnv'].isnull()]=0
9 df.head()

date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition ... yr_built yr_

0 20141013T000000 221900.0 3 1.00 1180 5650 1.0 0 0 3 ... 1955

1 20141209T000000 538000.0 3 2.25 2570 7242 2.0 0 0 3 ... 1951

2 20150225T000000 180000.0 2 1.00 770 10000 1.0 0 0 3 ... 1933

3 20141209T000000 604000.0 4 3.00 1960 5000 1.0 0 0 5 ... 1965

4 20150218T000000 510000.0 3 2.00 1680 8080 1.0 0 0 3 ... 1987

5 rows × 23 columns

1 df = df.drop(["date", "yr_built", "yr_renovated", "yr"],axis =1)

2 df.head()
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement zi

0 221900.0 3 1.00 1180 5650 1.0 0 0 3 7 1180.0 0

1 538000.0 3 2.25 2570 7242 2.0 0 0 3 7 2170.0 400

2 180000.0 2 1.00 770 10000 1.0 0 0 3 6 770.0 0

3 604000.0 4 3.00 1960 5000 1.0 0 0 5 7 1050.0 910

4 510000.0 3 2.00 1680 8080 1.0 0 0 3 8 1680.0 0

1 print("==============Correlation of price with other features=================")

2 df.corr().loc["price"][1:].sort_values(ascending = False).index

==============Correlation of price with other features=================

Index(['sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms',
'view', 'sqft_basement', 'bedrooms', 'lat', 'waterfront', 'floors',
'sqft_lot', 'sqft_lot15', 'age_rnv', 'condition', 'long', 'zipcode',
'age'],
dtype='object')

1 # independent and dependent features

2 X = df.drop("price",axis=1)
3 y = df["price"]

1 from sklearn.model_selection import train_test_split

2
3
4 X_train,X_test, y_train, y_test = train_test_split(X, y, train_size = 0.75,random_state=3)
5 print("Training data size : ", X_train.shape)
6 print("Test data size : ", X_test.shape)

Training data size : (16209, 18)

Test data size : (5404, 18)

1 from sklearn.linear_model import LinearRegression

2
3 simple_lr = LinearRegression()
4 simple_lr.fit(X_train[["sqft_living"]], y_train)

▾ LinearRegression i ?

LinearRegression()

1 y_pred = simple_lr.predict(X_test[["sqft_living"]])
2 y_pred

array([ 404785.59940785, 1220687.94257837, 850587.91062473, ...,

323475.74390288, 292634.07457341, 522544.70048401])

1 from sklearn.metrics import r2_score

1 from sklearn.metrics import r2_score, mean_squared_error # Import mean_squared_error

2
3 rmse = np.sqrt(mean_squared_error(y_test, y_pred))
4 print("RMSE of simple linear regression model : ", rmse)

RMSE of simple linear regression model : 258870.72627950888

1 r_squared = r2_score(y_test, y_pred)

2 print("R-squared of simple linear regression model : ", r_squared)
3 adjusted_r_squared = 1 - (1-r_squared)*(len(y_test)-1)/(len(y_test)-X_train.shape[1]-1)
4 print("\nAdjusted R-squared of simple linear regression model : ", adjusted_r_squared)

R-squared of simple linear regression model : 0.49753871219817647

Adjusted R-squared of simple linear regression model : 0.4958591758601203

1 reg_algos = pd.DataFrame({'Model': [],

2 'Features used':[],
3 'Root Mean Squared Error (RMSE)':[],
4 'R-squared':[],
5 'Adjusted R-squared':[],
6 })
1 model_count = reg_algos.shape[0]
2 reg_algos.loc[model_count] = ["Simple Linear Regression", "Single feature", rmse, r_squared, adjusted_r_squared]
3 reg_algos
4

Model Features used Root Mean Squared Error (RMSE) R-squared Adjusted R-squared

0 Simple Linear Regression Single feature 258870.72628 0.497539 0.495859

1 # Drop rows with NaN values in the selected features

2 X_train_clean = X_train[['sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms',
3 'view', 'sqft_basement', 'bedrooms']].dropna()
4
5 # Make sure the target variable (y_train) also matches the cleaned data
6 y_train_clean = y_train[X_train_clean.index]
7
8 # Fit the model with cleaned data
9 multiple_1 = LinearRegression()
10 multiple_1.fit(X_train_clean, y_train_clean)
11

▾ LinearRegression i ?

LinearRegression()

1 y_pred = multiple_1.predict(X_test[['sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms',

2 'view', 'sqft_basement', 'bedrooms']])
3 y_pred

array([ 487887.72731408, 1093680.35567361, 811391.75442649, ...,

308702.31061097, 311303.84061483, 403007.89990196])

1 rmse = np.sqrt(mean_squared_error(y_test, y_pred))

2 print("RMSE of Model 1 using a subset of features : ", rmse)

RMSE of Model 1 using a subset of features : 235594.92553209866

1 r_squared = r2_score(y_test, y_pred)

2 print("R-squared of Model 1 using a subset of features : ", r_squared)
3 adjusted_r_squared = 1 - (1-r_squared)*(len(y_test)-1)/(len(y_test)-X_train.shape[1]-1)
4 print("\nAdjusted R-squared of Model 1 using a subset of features : ", adjusted_r_squared)

R-squared of Model 1 using a subset of features : 0.5838320926112269

Adjusted R-squared of Model 1 using a subset of features : 0.5824410021129915

1 reg_algos.loc[model_count] = ["Multiple Linear Regression", "Subset of features", rmse, r_squared, adjusted_r_squared]

2 reg_algos

Model Features used Root Mean Squared Error (RMSE) R-squared Adjusted R-squared

0 Multiple Linear Regression Subset of features 235594.925532 0.583832 0.582441

1 !pip install scikit-learn

2 import pandas as pd
3 from sklearn.impute import SimpleImputer
4 from sklearn.linear_model import LinearRegression
5 from sklearn.model_selection import train_test_split
6
7 # ... (Your existing code) ...
8
9 # Create an imputer to replace NaN with the mean of the column
10 imputer = SimpleImputer(strategy='mean')
11
12 # Fit the imputer to your training data and transform it
13 X_train_imputed = imputer.fit_transform(X_train)
14
15 # Create a new DataFrame with the imputed values
16 X_train_imputed = pd.DataFrame(X_train_imputed, columns=X_train.columns, index=X_train.index)
17
18 # Now fit your model using the imputed data
19 multiple_with_all = LinearRegression()
20 multiple_with_all.fit(X_train_imputed, y_train)
21
22 # ... (Rest of your code) ...

Show hidden output

1 y_pred = multiple_with_all.predict(X_test)
2 y_pred

array([ 543287.1663189 , 1257510.98913652, 765837.4843899 , ...,

123640.085557 , 276934.94589009, 469213.84911597])

1 rmse = np.sqrt(mean_squared_error(y_test, y_pred))

2 print("RMSE of Model 2 using all the features : ", rmse)

RMSE of Model 2 using all the features : 197253.20603379642

1 r_squared = r2_score(y_test, y_pred)

2 print("R-squared of Model 2 using all the features : ", r_squared)
3 adjusted_r_squared = 1 - (1-r_squared)*(len(y_test)-1)/(len(y_test)-X_train.shape[1]-1)
4 print("\nAdjusted R-squared of Model 2 using all the features : ", adjusted_r_squared)

R-squared of Model 2 using all the features : 0.7082674660890779

Adjusted R-squared of Model 2 using all the features : 0.7072923155578993

1 model_count = reg_algos.shape[0]
2 reg_algos.loc[model_count] = ["Multiple Linear Regression", "All features", rmse, r_squared, adjusted_r_squared]
3 reg_algos

Model Features used Root Mean Squared Error (RMSE) R-squared Adjusted R-squared

0 Multiple Linear Regression Subset of features 235594.925532 0.583832 0.582441

1 Multiple Linear Regression All features 197253.206034 0.708267 0.707292

1 !pip install mlxtend

2 import joblib
3 import sys
4 sys.modules['sklearn.externals.joblib'] = joblib
5 from mlxtend.feature_selection import SequentialFeatureSelector as SFS

Show hidden output

1 sfs = SFS(LinearRegression(),
2 k_features=(3,18),
3 forward=True,
4 floating=False,
5 scoring = 'r2',
6 cv = 0)

1 !pip install mlxtend

2 import joblib
3 import sys
4 sys.modules['sklearn.externals.joblib'] = joblib
5 from mlxtend.feature_selection import SequentialFeatureSelector as SFS
6 import pandas as pd
7 from sklearn.impute import SimpleImputer
8
9 imputer = SimpleImputer(strategy='mean')
10
11 X_train_imputed = imputer.fit_transform(X_train)
12
13 X_train_imputed = pd.DataFrame(X_train_imputed, columns=X_train.columns)
14
15 sfs = SFS(LinearRegression(),
16 k_features=(3,18),
17 forward=True,
18 floating=False,
19 scoring = 'r2',
20 cv = 0)
21 sfs.fit(X_train_imputed, y_train)

Show hidden output

1 from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs

2 import matplotlib.pyplot as plt
3 fig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev')
4 plt.title('Sequential Forward Selection (w. StdErr)')
5 plt.grid()
6 plt.show()
1 sfs = SFS(LinearRegression(),
2 k_features=14,
3 forward=True,
4 floating=False,
5 scoring = 'r2',
6 cv = 0)

1 !pip install mlxtend

Show hidden output

1 sfs.k_feature_names_

('bedrooms',
'bathrooms',
'sqft_living',
'sqft_lot',
'floors',
'waterfront',
'view',
'condition',
'grade',
'sqft_above',
'sqft_basement',
'zipcode',
'lat',
'long',
'sqft_living15',
'sqft_lot15',
'age',
'age_rnv')

1 multiple_with_selected = LinearRegression()
2 multiple_with_selected.fit(X_train[['bedrooms',
3 'bathrooms',
4 'sqft_living',
5 'waterfront',
6 'view',
7 'condition',
8 'grade',
9 'sqft_basement',
10 'zipcode',
11 'lat',
12 'long',
13 'sqft_living15',
14 'sqft_lot15',
15 'age']], y_train)

▾ LinearRegression i ?

LinearRegression()

1 y_pred = multiple_with_selected.predict(X_test[['bedrooms',
2 'bathrooms',
3 'sqft_living',
4 'waterfront',
5 'view',
6 'condition',
7 'grade',
8 'sqft_basement',
9 'zipcode',
10 'lat',
11 'long',
12 'sqft_living15',
13 'sqft_lot15',
14 'age']])
15 y_pred

array([ 547845.87229597, 1263717.53465397, 766357.79047055, ...,

130581.4909028 , 279560.94999563, 466535.82218084])

1 rmse = np.sqrt(mean_squared_error(y_test, y_pred))

2 print("RMSE of Model 3 using all the selected features : ", rmse)

RMSE of Model 3 using all the selected features : 197145.0287554766

1
2 r_squared = r2_score(y_test, y_pred)
3 print("R-squared of Model 3 using all the selected features : ", r_squared)
4 adjusted_r_squared = 1 - (1-r_squared)*(len(y_test)-1)/(len(y_test)-X_train.shape[1]-1)
5 print("\nAdjusted R-squared of Model 2 using all the selectedfeatures : ", adjusted_r_squared)

R-squared of Model 3 using all the selected features : 0.708587361298175

Adjusted R-squared of Model 2 using all the selectedfeatures : 0.7076132800546034

1 model_count = reg_algos.shape[0]
2 reg_algos.loc[model_count] = ["Multiple Linear Regression", "Selected features", rmse, r_squared, adjusted_r_squared]
3 reg_algos

Model Features used Root Mean Squared Error (RMSE) R-squared Adjusted R-squared

0 Multiple Linear Regression Subset of features 235594.925532 0.583832 0.582441

1 Multiple Linear Regression All features 197253.206034 0.708267 0.707292

2 Multiple Linear Regression Selected features 197145.028755 0.708587 0.707613

1 from sklearn.preprocessing import PolynomialFeatures

2 from sklearn.impute import SimpleImputer # Import SimpleImputer
3
4 # Create an imputer to replace NaN with the mean of the column
5 imputer = SimpleImputer(strategy='mean')
6
7 # Fit the imputer to your training data and transform it
8 X_train_imputed = imputer.fit_transform(X_train)
9 X_test_imputed = imputer.transform(X_test) # Transform X_test as well
10
11 # Now apply PolynomialFeatures to the imputed data
12 polyfeat = PolynomialFeatures(degree=2)
13 X_trainpoly = polyfeat.fit_transform(X_train_imputed)
14 X_testpoly = polyfeat.transform(X_test_imputed) # Use transform for X_test
15
16 poly_reg = LinearRegression()
17 poly_reg.fit(X_trainpoly, y_train)
▾ LinearRegression i ?

LinearRegression()

1 X_trainpoly.shape[1]

190

1 X_testpoly.shape[1]

190

1 y_pred = poly_reg.predict(X_testpoly)
2 y_pred

array([ 501187.18259308, 1521362.3099478 , 647763.248394 , ...,

288724.48304937, 288659.51970974, 521409.17547527])

1 rmse = np.sqrt(mean_squared_error(y_test, y_pred))

2 print("RMSE of Model 1 with degree 2 : ", rmse)

RMSE of Model 1 with degree 2 : 197145.0287554766

1 r_squared = r2_score(y_test, y_pred)

2 print("R-squared of Model 1 with degree 2 : ", r_squared)
3 adjusted_r_squared = 1 - (1-r_squared)*(len(y_test)-1)/(len(y_test)-X_train.shape[1]-1)
4 print("\nAdjusted R-squared of Model 1 with degree 2 : ", adjusted_r_squared)

R-squared of Model 1 with degree 2 : 0.708587361298175

Adjusted R-squared of Model 1 with degree 2 : 0.7076132800546034

1 model_count = reg_algos.shape[0]
2 reg_algos.loc[model_count] = ["Polynomial Regression", "Selected features", rmse, r_squared, adjusted_r_squared]

Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
PUB - 1032292 - XS-111 - XS-211 Install-Owners
100% (1)
PUB - 1032292 - XS-111 - XS-211 Install-Owners
42 pages
Comptia Data+ Da0-001
No ratings yet
Comptia Data+ Da0-001
10 pages
EX750-5 Circuit Diagram
100% (1)
EX750-5 Circuit Diagram
18 pages
Lcms 191 Timstof Pro Caps Pasef Ebook
No ratings yet
Lcms 191 Timstof Pro Caps Pasef Ebook
6 pages
Data Cleaning Using R
No ratings yet
Data Cleaning Using R
5 pages
Key Ingredients of PM
No ratings yet
Key Ingredients of PM
16 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Document
No ratings yet
Document
29 pages
Exp-2 ML
No ratings yet
Exp-2 ML
6 pages
Chap 3
No ratings yet
Chap 3
26 pages
III Unit
No ratings yet
III Unit
4 pages
Exp 2
No ratings yet
Exp 2
6 pages
Naan Mudhalvan Phase 2
No ratings yet
Naan Mudhalvan Phase 2
13 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
1-Introduction To Data Cleaning
No ratings yet
1-Introduction To Data Cleaning
22 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
DM Week 3 Des
No ratings yet
DM Week 3 Des
2 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
1data Cleansing Cheklist
No ratings yet
1data Cleansing Cheklist
2 pages
Unit-2 Preprocessing
No ratings yet
Unit-2 Preprocessing
18 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Practical 1 ML - Removed
No ratings yet
Practical 1 ML - Removed
5 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
DMTN
No ratings yet
DMTN
17 pages
REVIEWER
No ratings yet
REVIEWER
9 pages
Machine Learning Lab File
No ratings yet
Machine Learning Lab File
45 pages
Dsur Ea2352001010391 W7
No ratings yet
Dsur Ea2352001010391 W7
3 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
20 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
Lec4 SWN MC
No ratings yet
Lec4 SWN MC
45 pages
ML 4
No ratings yet
ML 4
17 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
Data Cleaning
No ratings yet
Data Cleaning
28 pages
DABD (KMBNIT01) Model Paper With Solution
No ratings yet
DABD (KMBNIT01) Model Paper With Solution
19 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
DM Unit2
No ratings yet
DM Unit2
9 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Data Cleaning in Machine Learning With Numerical Example
No ratings yet
Data Cleaning in Machine Learning With Numerical Example
3 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
No ratings yet
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
34 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
L2 Data Preparation
No ratings yet
L2 Data Preparation
18 pages
Data Minig Lab Manual
No ratings yet
Data Minig Lab Manual
58 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Outliners
No ratings yet
Outliners
15 pages
Syllabus: Data Warehousing and Data Mining
No ratings yet
Syllabus: Data Warehousing and Data Mining
18 pages
06 Data Mining-Data Preprocessing-Cleaning
No ratings yet
06 Data Mining-Data Preprocessing-Cleaning
6 pages
DAV Practical 2
No ratings yet
DAV Practical 2
6 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Data Visualization Hacks: Tricks for Clear Insights
From Everand
Data Visualization Hacks: Tricks for Clear Insights
Aarya Ganaka
No ratings yet
Palm 11
No ratings yet
Palm 11
8 pages
Kitab Al-Kafalah
No ratings yet
Kitab Al-Kafalah
4 pages
1 s2.0 S254252932200339X Main
No ratings yet
1 s2.0 S254252932200339X Main
33 pages
Petition - Notarial Commission - Template
No ratings yet
Petition - Notarial Commission - Template
5 pages
Ucl3612 Company Law I Tri 1, 2020/2021 Tutorial Topic 2: Promoters and Pre-Incorporation Contracts
No ratings yet
Ucl3612 Company Law I Tri 1, 2020/2021 Tutorial Topic 2: Promoters and Pre-Incorporation Contracts
7 pages
P40-01-F21-1 Fire Safety Maintenance Plan & Log - Weekly (Enf, 21.01.01)
No ratings yet
P40-01-F21-1 Fire Safety Maintenance Plan & Log - Weekly (Enf, 21.01.01)
3 pages
Swot - Case
No ratings yet
Swot - Case
2 pages
Alv Output Editable and Saving The Data To Database
100% (1)
Alv Output Editable and Saving The Data To Database
10 pages
ASHRAE Journal - Medical Office Building Thrives With Advanced Control Sequences
No ratings yet
ASHRAE Journal - Medical Office Building Thrives With Advanced Control Sequences
5 pages
Best Friend:: I Am Try To Give My Speech in This Topic During Lockdown
No ratings yet
Best Friend:: I Am Try To Give My Speech in This Topic During Lockdown
9 pages
Smoke Control System in High Rise Building
No ratings yet
Smoke Control System in High Rise Building
8 pages
eTranscriptFree
No ratings yet
eTranscriptFree
3 pages
Array Questions
No ratings yet
Array Questions
2 pages
Jurnal Referensi
No ratings yet
Jurnal Referensi
2 pages
Interface Parameters (MTPSCCP)
No ratings yet
Interface Parameters (MTPSCCP)
11 pages
Memo Payload Axis XL
100% (1)
Memo Payload Axis XL
3 pages
Limits, Continuity & Differentiability - DPP 04 - Lakshya JEE AIR O1 (2026)
No ratings yet
Limits, Continuity & Differentiability - DPP 04 - Lakshya JEE AIR O1 (2026)
3 pages
Letter Gothic STD
No ratings yet
Letter Gothic STD
3 pages
DUO CONE SEALS-install, Caterpillar
No ratings yet
DUO CONE SEALS-install, Caterpillar
16 pages
KRSP100&125HP Kaishan
100% (1)
KRSP100&125HP Kaishan
46 pages
Stock Statement
No ratings yet
Stock Statement
4 pages
Erosion and Erosion-Corrosion of Metals: A.V. Levy
No ratings yet
Erosion and Erosion-Corrosion of Metals: A.V. Levy
12 pages
Ballistic Limit Evaluation For Impact of Pistol Projectile 9 MM Luger On Aircraft Skin Metal Plate
No ratings yet
Ballistic Limit Evaluation For Impact of Pistol Projectile 9 MM Luger On Aircraft Skin Metal Plate
10 pages
2021 Iehc 78
No ratings yet
2021 Iehc 78
214 pages
ABB Azipod Brochure Lores
No ratings yet
ABB Azipod Brochure Lores
8 pages
Flow Diagram Description of Recruitment Operating Procedure
No ratings yet
Flow Diagram Description of Recruitment Operating Procedure
3 pages
English Project
No ratings yet
English Project
30 pages