Ex No3
Ex No3
Ex no : 3
Date : 28.03.2024
NORMALIZATION:
Normalization is a data technique that scales numeric values in a dataset to a standard range, ensuring all
features contribute equally to analysis. It prevents large-scale features from dominating the model.
Z-Score Normalization:
In our dataset, first Z-score normalization was used, making data have a mean of 0 and a standard deviation of
1. This was done for columns like 'RESIDENTIAL UNITS', 'COMMERCIAL UNITS', 'TOTAL UNITS',
'LAND SQUARE FEET', 'GROSS SQUARE FEET', and 'SALE PRICE'. This scaling makes these columns
comparable for better machine learning analysis.
Code:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
columns_to_normalize = ['RESIDENTIAL UNITS', 'COMMERCIAL UNITS', 'TOTAL UNITS', 'LAND
SQUARE FEET', 'GROSS SQUARE FEET', 'SALE PRICE']
scaler = StandardScaler()
df_z_score_normalized = df.copy()
df_z_score_normalized[columns_to_normalize] = scaler.fit_transform(df[columns_to_normalize])
print("Z-score normalized DataFrame:")
df_z_score_normalized[columns_to_normalize].head()
Output:
24
Register No: 2022510020
Min-Max Normalization:
To further preprocess the data, Min-Max normalization was applied using Scikit-learn’s MinMaxScaler.
This technique scales numeric values in a dataset to a specific range, typically between 0 and 1. The selected
numeric columns, such as ‘RESIDENTIAL UNITS’, ‘COMMERCIAL UNITS’, ‘TOTAL UNITS’, ‘LAND
SQUARE FEET’, ‘GROSS SQUARE FEET’, and ‘SALE PRICE’, were transformed using this method. Min-
Max normalization ensures that these columns are on a consistent scale, making them suitable for machine
learning analysis. The resulting Min-Max normalized DataFrame is displayed to observe the impact of this
scaling.
Code:
scaler = MinMaxScaler()
df_min_max_normalized = df.copy()
df_min_max_normalized[columns_to_normalize] = scaler.fit_transform(df[columns_to_normalize])
print("Min-Max normalized DataFrame:")
df_min_max_normalized[columns_to_normalize].head()
Output:
plt.xlabel(column)
plt.ylabel('frequency')
plt.title('histogram of '+column.lower())
plt.suptitle('DATA BEFORE NORMALIZATION')
plt.tight_layout()
plt.show()
scaler = MinMaxScaler()
df[columns_to_normalize] = scaler.fit_transform(df[columns_to_normalize])
plt.figure(figsize=(15,5))
for idx,column in enumerate(columns_to_normalize,start=1):
plt.subplot(1,len(columns_to_normalize),idx)
plt.hist(df[column],bins=20,color='skyblue',edgecolor='black')
plt.xlabel(column)
plt.ylabel('frequency')
plt.title('histogram of '+column.lower())
plt.suptitle('DATA AFTER MinMax NORMALIZATION')
plt.tight_layout()
plt.show()
Output:
26
Register No: 2022510020
Lineplot:
Code:
fig, axes = plt.subplots(nrows=len(columns_to_normalize), ncols=2, figsize=(12, 10))
for i, column in enumerate(columns_to_normalize):
# Line plot before normalization
df[[column]].plot(ax=axes[i, 0])
axes[i, 0].set_title(f'{column} (Before Normalization)')
axes[i, 0].set_ylabel('Value')
# Min-Max Normalization
min_val = df[column].min()
max_val = df[column].max()
normalized_data = (df[column] - min_val) / (max_val - min_val)
normalized_data.plot(ax=axes[i, 1], color='r')
axes[i, 1].set_title(f'{column} (After Min-Max Normalization)')
axes[i, 1].set_ylabel('Value')
plt.tight_layout()
fig.suptitle("Min-Max Normalization", fontsize=16, y=1.05)
plt.show()
Output:
27
Register No: 2022510020
28
Register No: 2022510020
plt.title('histogram of '+column.lower())
plt.suptitle('DATA BEFORE NORMALIZATION')
plt.tight_layout()
plt.show()
scaler = StandardScaler()
df[columns_to_normalize] = scaler.fit_transform(df[columns_to_normalize])
plt.figure(figsize=(15,5))
for idx,column in enumerate(columns_to_normalize,start=1):
plt.subplot(1,len(columns_to_normalize),idx)
plt.hist(df[column],bins=20,color='skyblue',edgecolor='black')
plt.xlabel(column)
plt.ylabel('frequency')
plt.title('histogram of '+column.lower())
plt.suptitle('DATA AFTER Z-SCORE NORMALIZATION')
plt.tight_layout()
plt.show()
Output:
29
Register No: 2022510020
Lineplot:
Code:
scaler = StandardScaler()
df_z_score_normalized = df.copy()
df_z_score_normalized[columns_to_normalize] = scaler.fit_transform(df[columns_to_normalize])
fig, axes = plt.subplots(nrows=len(columns_to_normalize), ncols=2, figsize=(12, 10))
for i, column in enumerate(columns_to_normalize):
df[column].plot(ax=axes[i, 0])
axes[i, 0].set_title(f'{column} (Before Normalization)')
axes[i, 0].set_ylabel('Value')
df_z_score_normalized[column].plot(ax=axes[i, 1], color='r')
axes[i, 1].set_title(f'{column} (After Z-score Normalization)')
axes[i, 1].set_ylabel('Value')
plt.tight_layout()
fig.suptitle("Z-score Normalization Line Plots", fontsize=16, y=1.05)
plt.show()
Output:
30
Register No: 2022510020
BINNING:
Binning is a data preprocessing technique used to transform continuous numerical data into discrete bins
or categories. It involves grouping numerical values into intervals or ranges, which can be useful for data
analysis and visualization tasks.
Code:
df['SALE_PRICE_IN_M'].plot.hist(bins=10)
Output:
31
Register No: 2022510020
Code:
maxrange=max(df['SALE_PRICE_IN_M'])
minrange=min(df['SALE_PRICE_IN_M'])
range=maxrange-minrange
bins=3
binwidth=range/bins
print(maxrange)
print(minrange)
print(range)
print(binwidth)
Output:
4.875
0.065789
4.809211
1.6030703333333334
Equal width binning:
The following code snippet demonstrates the process of equal-width binning using the Pandas library
(pd.cut function) and visualizes the resulting bins with a histogram using Matplotlib. This technique helps in
understanding how continuous data in a specific column is segmented into distinct categories of equal width.
Borough:
Code:
column_to_bin = 'BOROUGH'
num_bins = 5 # Increase the number of bins
bin_edges = pd.cut(df[column_to_bin], bins=num_bins, precision=2)
plt.figure(figsize=(10, 6))
32
Register No: 2022510020
Output:
Sale price:
Code:
custom_labels = ['Low', 'Medium', 'High']
columns_to_bin = ['SALE_PRICE_IN_M']
num_bins = 3
# Iterate through each column and perform equal-width binning
for idx, column in enumerate(columns_to_bin, start=1):
# Perform equal-width binning
bins = pd.cut(df[column], bins=num_bins, labels=custom_labels, include_lowest=True)
# Count the number of data points in each bin
bin_counts = bins.value_counts().sort_index()
# Plot the histogram of binned data
33
Register No: 2022510020
Output:
Custom binning:
The following code snippet demonstrates custom binning applied to the 'YEAR BUILT' column in a
dataset. Custom binning involves manually defining intervals or bins to group data based on specific criteria. In
this case, construction years are categorized into custom-defined intervals such as [1875-1900], [1900-1925],
and so forth.
Code:
custom_bins = [1875, 1900, 1925, 1950, 1975, 2000, 2024]
column_to_bin = 'YEAR BUILT'
34
Register No: 2022510020
Output:
Output:
Sampling:
Sampling is a data preprocessing technique used to select a subset of data points from a larger dataset,
often to reduce computational complexity or to ensure representative training and testing sets.
Random sampling:
36
Register No: 2022510020
Random sampling is a fundamental technique in statistics and data analysis, crucial for obtaining
representative subsets from larger datasets.The following code snippet randomly selects 10,000 data points from
the DataFrame df using the sample method. This random selection ensures a representative subset for analysis
and modeling purposes. The random_state=42 parameter ensures reproducibility of the random sample. The line
random_sample.head() displays the first few rows of the randomly sampled data.
Code:
sample_size=10000
random_sample=df.sample(n=sample_size,random_state=42)
Output:
Systematic sampling:
Systematic sampling involves selecting data points at regular intervals from an ordered dataset. In the
provided code snippet, a step size is calculated to create a systematic sample of approximately 10,000 data
points from the DataFrame df. This systematic sampling method ensures a structured and representative subset
of data for analysis.
Code:
step = int(len(df)/10000)
systematic_sample = df.iloc[::step]
systematic_sample
Output:
37
Register No: 2022510020
Stratified Sampling:
The following code snippet performs stratified sampling using Stratified K-Fold cross-validation,
encodes categorical data using LabelEncoder, separates features from the target variable 'BOROUGH,' and
trains a KNN classifier. It then predicts 'BOROUGH' for test data, computes classifier accuracy, and visualizes
the target variable's distribution before and after sampling.
Code:
from sklearn.model_selection import StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
mixed_cols = df.select_dtypes(include=['object']).columns.tolist()
df[mixed_cols] = df[mixed_cols].astype(str)
label_encoder = LabelEncoder()
for col in mixed_cols:
df[col] = label_encoder.fit_transform(df[col])
X = df.drop('BOROUGH', axis=1) # Features
y = df['BOROUGH'] # Target variable
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
train_index, test_index = next(skf.split(X, y))
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
knn = KNeighborsClassifier(n_neighbors=5)
38
Register No: 2022510020
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
sampled_data = pd.concat([X_test, y_test], axis
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
stage_counts = y.value_counts()
axes[0].pie(stage_counts, labels=stage_counts.index, autopct='%1.1f%%', startangle=90)
axes[0].set_title('Pie Chart of BOROUGH in Unsampled Data')
stage_counts_sampled = pd.Series(y_pred_concatenated).value_counts()
axes[1].pie(stage_counts_sampled, labels=stage_counts_sampled.index, autopct='%1.1f%%', startangle=90)
axes[1].set_title('Pie Chart of BOROUGH after Stratified Sampling')
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of KNN classifier: {accuracy:.2f}")
Output:
Proportionate Sampling:
The following code snippet presents a comparison of the distribution of 'BOROUGH' categories in two
scenarios: the original (unsampled) dataset and after applying proportionate sampling. The 'BOROUGH'
column represents different borough categories, and the pie charts visualize the percentage of buildings within
each borough category.The first pie chart displays the distribution of 'BOROUGH' in the original dataset,
providing an initial understanding of how buildings are distributed across different boroughs.The second pie
chart illustrates the distribution of 'BOROUGH' after applying proportionate sampling, where each borough
category is sampled in proportion to its representation in the original dataset.
Code:
stage_counts= df['BOROUGH'].value_counts()
plt.pie(stage_counts, labels=stage_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Pie Chart of BOROUGH in unsampled data')
39
Register No: 2022510020
plt.legend()
plt.show()
sampled_data2=df.groupby('BOROUGH', group_keys=False).apply(lambda x: x.sample(frac=0.6))
stage_counts = sampled_data2['BOROUGH'].value_counts()
plt.pie(stage_counts, labels=stage_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Pie Chart of BOROUGH after Proportionate Sampling')
plt.legend()
plt.show()
Output:
RESULT:
The data has been prepared using Normalization, Binning and sampling.
40