Ads 2
Ads 2
Theory :
he aim of data cleaning (also known as data cleansing ordatascrubbing)istoensure
T
thatthedatasetisaccurate,complete,consistent,andfreefromerrorsbeforeperforming
any analysis. Raw data collected from various sources often contains inconsistencies,
missing values, duplicates, or outliers that can distort or mislead results. Datacleaning
techniques address these issues to improve the quality of data, making it suitable for
analysis, reporting, and decision-making.
1. RemovingDuplicates:Duplicateentriesoccurwhenidenticalornearlyidenticalrecords
a re recorded multiple times. This can happen due to human errororissuesduringdata
collection. Duplicate records can distort analysis by overrepresenting certain values.
○ Technique: Identifying duplicate rows based on unique identifiers (like ID
numbers or email addresses) and removing them.
. Handling Missing Data: Missing values occur when data is not recorded for certain
2
observations.Handlingmissingdataisessentialbecausemissingvaluescanskewresults
and introduce bias.
○ Techniques:
i. Deletion: Removing rows with missing values (listwise deletion).
ii. Imputation: Replacing missing values with estimated values,suchasthe
mean, median, mode, or predicted values from a model.
iii. Flagging: Adding a special indicator for missing values, so they can be
treated separately during analysis.
3. Standardizing Data Formats: Inconsistent data formats can lead to problems when
comparing or aggregating data. For example, dates might be recorded as
"MM/DD/YYYY" in some records and "DD/MM/YYYY" in others.
○ Technique: Converting values to a consistent format,suchasstandardizingdate
formats or ensuring that numerical data has consistent decimal places.
4. IdentifyingandHandlingOutliers:Outliersareextremevaluesthatdiffersignificantly
from other observations in thedataset.Theycanindicateerrors,dataentrymistakes,or
actual rare events.
○ Techniques:
i. Statisticalmethods:UsingmethodsliketheZ-scoreorIQR(Interquartile
ange) to identify outliers.
R
ii. Capping: Limiting outlier values to a certain range (winsorization).
iii. Imputation: Replacing outlier values with more plausible values (mean,
median, or predicted values).
. Correcting Data Inconsistencies: Data collected from multiple sources or individuals
5
may have inconsistencies, such as variations in spelling, abbreviations, or units of
measurement.
○ Technique:Normalizingdatabystandardizingtextentries(e.g.,converting"NY"
to "New York") and ensuring consistent measurement units.
6. Data Validation: Data validation ensures that values fall within acceptable ranges or
categories. For example, ages might be restricted to valid ranges (0–120 years), or
product prices should not be negative.
○ Technique: Applying rules or constraints to validate the data. For example,
setting up a validation rule that enforces age values between 0 and 120.
7. Dealing with Noise: Noise refers to irrelevant or random data that can distort the
analysis. Noise may occur due to external factors or errors during data collection.
○ Technique:Filteringorsmoothingdatatoreducenoise,suchasapplyingmoving
averages or using outlier detection methods.
onclusion :
C
Byapplyingvariousdatacleaningtechniquessuchashandlingmissingvalues,removing
duplicates, addressing outliers, and standardizing data, data analysts can significantly
improve the quality of the dataset, minimizing errors and biases that could affect
subsequent analyses.
1 import numpy as np
2 import pandas as pd
3 import seaborn as sns
4 import matplotlib.pyplot as plt
5 import scipy.stats as stats
6 import warnings
7 warnings.filterwarnings("ignore")
8
9
10 from google.colab import files
11 uploaded = files.upload()
12
13
14 df = pd.read_csv("kc_house_data 1.csv")
15 df.head()
Choose Files No file chosen Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to
enable.
Saving kc_house_data 1.csv to kc_house_data 1 (1).csv
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_
5 rows × 21 columns
1 df = df.drop("id",axis=1)
2 df.head()
date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above
date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition ... yr_built yr_
5 rows × 23 columns
▾ LinearRegression i ?
LinearRegression()
1 y_pred = simple_lr.predict(X_test[["sqft_living"]])
2 y_pred
Model Features used Root Mean Squared Error (RMSE) R-squared Adjusted R-squared
▾ LinearRegression i ?
LinearRegression()
Model Features used Root Mean Squared Error (RMSE) R-squared Adjusted R-squared
1 model_count = reg_algos.shape[0]
2 reg_algos.loc[model_count] = ["Multiple Linear Regression", "All features", rmse, r_squared, adjusted_r_squared]
3 reg_algos
Model Features used Root Mean Squared Error (RMSE) R-squared Adjusted R-squared
1 sfs = SFS(LinearRegression(),
2 k_features=(3,18),
3 forward=True,
4 floating=False,
5 scoring = 'r2',
6 cv = 0)
1 sfs.k_feature_names_
('bedrooms',
'bathrooms',
'sqft_living',
'sqft_lot',
'floors',
'waterfront',
'view',
'condition',
'grade',
'sqft_above',
'sqft_basement',
'zipcode',
'lat',
'long',
'sqft_living15',
'sqft_lot15',
'age',
'age_rnv')
1 multiple_with_selected = LinearRegression()
2 multiple_with_selected.fit(X_train[['bedrooms',
3 'bathrooms',
4 'sqft_living',
5 'waterfront',
6 'view',
7 'condition',
8 'grade',
9 'sqft_basement',
10 'zipcode',
11 'lat',
12 'long',
13 'sqft_living15',
14 'sqft_lot15',
15 'age']], y_train)
▾ LinearRegression i ?
LinearRegression()
1 y_pred = multiple_with_selected.predict(X_test[['bedrooms',
2 'bathrooms',
3 'sqft_living',
4 'waterfront',
5 'view',
6 'condition',
7 'grade',
8 'sqft_basement',
9 'zipcode',
10 'lat',
11 'long',
12 'sqft_living15',
13 'sqft_lot15',
14 'age']])
15 y_pred
1
2 r_squared = r2_score(y_test, y_pred)
3 print("R-squared of Model 3 using all the selected features : ", r_squared)
4 adjusted_r_squared = 1 - (1-r_squared)*(len(y_test)-1)/(len(y_test)-X_train.shape[1]-1)
5 print("\nAdjusted R-squared of Model 2 using all the selectedfeatures : ", adjusted_r_squared)
1 model_count = reg_algos.shape[0]
2 reg_algos.loc[model_count] = ["Multiple Linear Regression", "Selected features", rmse, r_squared, adjusted_r_squared]
3 reg_algos
Model Features used Root Mean Squared Error (RMSE) R-squared Adjusted R-squared
LinearRegression()
1 X_trainpoly.shape[1]
190
1 X_testpoly.shape[1]
190
1 y_pred = poly_reg.predict(X_testpoly)
2 y_pred
1 model_count = reg_algos.shape[0]
2 reg_algos.loc[model_count] = ["Polynomial Regression", "Selected features", rmse, r_squared, adjusted_r_squared]