0% found this document useful (0 votes)
10 views17 pages

Ex No3

Uploaded by

janusrini14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views17 pages

Ex No3

Uploaded by

janusrini14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Register No: 2022510020

Ex no : 3
Date : 28.03.2024

Data preparation for Exploration using


normalization, binning and sampling methods
AIM:
To prepare our dataset “NYC Property Sales” for exploration using normalization, binning and sampling
methods.

NORMALIZATION:
Normalization is a data technique that scales numeric values in a dataset to a standard range, ensuring all
features contribute equally to analysis. It prevents large-scale features from dominating the model.

Z-Score Normalization:
In our dataset, first Z-score normalization was used, making data have a mean of 0 and a standard deviation of
1. This was done for columns like 'RESIDENTIAL UNITS', 'COMMERCIAL UNITS', 'TOTAL UNITS',
'LAND SQUARE FEET', 'GROSS SQUARE FEET', and 'SALE PRICE'. This scaling makes these columns
comparable for better machine learning analysis.
Code:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
columns_to_normalize = ['RESIDENTIAL UNITS', 'COMMERCIAL UNITS', 'TOTAL UNITS', 'LAND
SQUARE FEET', 'GROSS SQUARE FEET', 'SALE PRICE']
scaler = StandardScaler()
df_z_score_normalized = df.copy()
df_z_score_normalized[columns_to_normalize] = scaler.fit_transform(df[columns_to_normalize])
print("Z-score normalized DataFrame:")
df_z_score_normalized[columns_to_normalize].head()

Output:

24
Register No: 2022510020

Min-Max Normalization:
To further preprocess the data, Min-Max normalization was applied using Scikit-learn’s MinMaxScaler.
This technique scales numeric values in a dataset to a specific range, typically between 0 and 1. The selected
numeric columns, such as ‘RESIDENTIAL UNITS’, ‘COMMERCIAL UNITS’, ‘TOTAL UNITS’, ‘LAND
SQUARE FEET’, ‘GROSS SQUARE FEET’, and ‘SALE PRICE’, were transformed using this method. Min-
Max normalization ensures that these columns are on a consistent scale, making them suitable for machine
learning analysis. The resulting Min-Max normalized DataFrame is displayed to observe the impact of this
scaling.
Code:
scaler = MinMaxScaler()
df_min_max_normalized = df.copy()
df_min_max_normalized[columns_to_normalize] = scaler.fit_transform(df[columns_to_normalize])
print("Min-Max normalized DataFrame:")
df_min_max_normalized[columns_to_normalize].head()
Output:

Visualization of Min-Max Normalization Effects on Dataset Columns:


Using Lineplot:
To visually compare the effects of Min-Max normalization on different columns, line plots and
histograms are created before and after normalization using Matplotlib. For each numeric column in the dataset,
two plots are created: one for the original data and another for the data after Min-Max normalization.
Histogram:
Code:
columns_to_normalize = ['LAND SQUARE FEET', 'GROSS SQUARE FEET', 'SALE_PRICE_IN_M']
plt.figure(figsize=(15,5))
for idx,column in enumerate(columns_to_normalize,start=1):
plt.subplot(1,len(columns_to_normalize),idx)
plt.hist(df[column],bins=20,color='skyblue',edgecolor='black')
25
Register No: 2022510020

plt.xlabel(column)
plt.ylabel('frequency')
plt.title('histogram of '+column.lower())
plt.suptitle('DATA BEFORE NORMALIZATION')
plt.tight_layout()
plt.show()
scaler = MinMaxScaler()
df[columns_to_normalize] = scaler.fit_transform(df[columns_to_normalize])
plt.figure(figsize=(15,5))
for idx,column in enumerate(columns_to_normalize,start=1):
plt.subplot(1,len(columns_to_normalize),idx)
plt.hist(df[column],bins=20,color='skyblue',edgecolor='black')
plt.xlabel(column)
plt.ylabel('frequency')
plt.title('histogram of '+column.lower())
plt.suptitle('DATA AFTER MinMax NORMALIZATION')
plt.tight_layout()
plt.show()
Output:

26
Register No: 2022510020

Lineplot:
Code:
fig, axes = plt.subplots(nrows=len(columns_to_normalize), ncols=2, figsize=(12, 10))
for i, column in enumerate(columns_to_normalize):
# Line plot before normalization
df[[column]].plot(ax=axes[i, 0])
axes[i, 0].set_title(f'{column} (Before Normalization)')
axes[i, 0].set_ylabel('Value')
# Min-Max Normalization
min_val = df[column].min()
max_val = df[column].max()
normalized_data = (df[column] - min_val) / (max_val - min_val)
normalized_data.plot(ax=axes[i, 1], color='r')
axes[i, 1].set_title(f'{column} (After Min-Max Normalization)')
axes[i, 1].set_ylabel('Value')
plt.tight_layout()
fig.suptitle("Min-Max Normalization", fontsize=16, y=1.05)
plt.show()
Output:

27
Register No: 2022510020

Visualization of Z-Score Normalization Effects on Dataset Columns:


To visually compare the effects of Z-score normalization on different columns, line plots and histograms
are created before and after normalization using Matplotlib. For each numeric column in the dataset, two plots
are generated: one displaying the original data and another showing the data after Z-score normalization.
Histogram:
Code:
columns_to_normalize = ['LAND SQUARE FEET', 'GROSS SQUARE FEET', 'SALE_PRICE_IN_M']
plt.figure(figsize=(15,5))
for idx,column in enumerate(columns_to_normalize,start=1):
plt.subplot(1,len(columns_to_normalize),idx)
plt.hist(df[column],bins=20,color='skyblue',edgecolor='black')
plt.xlabel(column)
plt.ylabel('frequency')

28
Register No: 2022510020

plt.title('histogram of '+column.lower())
plt.suptitle('DATA BEFORE NORMALIZATION')
plt.tight_layout()
plt.show()
scaler = StandardScaler()
df[columns_to_normalize] = scaler.fit_transform(df[columns_to_normalize])
plt.figure(figsize=(15,5))
for idx,column in enumerate(columns_to_normalize,start=1):
plt.subplot(1,len(columns_to_normalize),idx)
plt.hist(df[column],bins=20,color='skyblue',edgecolor='black')
plt.xlabel(column)
plt.ylabel('frequency')
plt.title('histogram of '+column.lower())
plt.suptitle('DATA AFTER Z-SCORE NORMALIZATION')
plt.tight_layout()
plt.show()

Output:

29
Register No: 2022510020

Lineplot:
Code:
scaler = StandardScaler()
df_z_score_normalized = df.copy()
df_z_score_normalized[columns_to_normalize] = scaler.fit_transform(df[columns_to_normalize])
fig, axes = plt.subplots(nrows=len(columns_to_normalize), ncols=2, figsize=(12, 10))
for i, column in enumerate(columns_to_normalize):
df[column].plot(ax=axes[i, 0])
axes[i, 0].set_title(f'{column} (Before Normalization)')
axes[i, 0].set_ylabel('Value')
df_z_score_normalized[column].plot(ax=axes[i, 1], color='r')
axes[i, 1].set_title(f'{column} (After Z-score Normalization)')
axes[i, 1].set_ylabel('Value')
plt.tight_layout()
fig.suptitle("Z-score Normalization Line Plots", fontsize=16, y=1.05)
plt.show()

Output:

30
Register No: 2022510020

BINNING:
Binning is a data preprocessing technique used to transform continuous numerical data into discrete bins
or categories. It involves grouping numerical values into intervals or ranges, which can be useful for data
analysis and visualization tasks.
Code:
df['SALE_PRICE_IN_M'].plot.hist(bins=10)
Output:

31
Register No: 2022510020

Code:
maxrange=max(df['SALE_PRICE_IN_M'])
minrange=min(df['SALE_PRICE_IN_M'])
range=maxrange-minrange
bins=3
binwidth=range/bins
print(maxrange)
print(minrange)
print(range)
print(binwidth)
Output:
4.875
0.065789
4.809211
1.6030703333333334
Equal width binning:
The following code snippet demonstrates the process of equal-width binning using the Pandas library
(pd.cut function) and visualizes the resulting bins with a histogram using Matplotlib. This technique helps in
understanding how continuous data in a specific column is segmented into distinct categories of equal width.
Borough:
Code:
column_to_bin = 'BOROUGH'
num_bins = 5 # Increase the number of bins
bin_edges = pd.cut(df[column_to_bin], bins=num_bins, precision=2)
plt.figure(figsize=(10, 6))

32
Register No: 2022510020

plt.hist(df[column_to_bin], bins=num_bins, edgecolor='black')


plt.xlabel(column_to_bin)
plt.ylabel('Frequency')
plt.title('Histogram of BOROUGH with Equal-Width Binning')
plt.grid(True)
plt.show()

Output:

Sale price:
Code:
custom_labels = ['Low', 'Medium', 'High']
columns_to_bin = ['SALE_PRICE_IN_M']
num_bins = 3
# Iterate through each column and perform equal-width binning
for idx, column in enumerate(columns_to_bin, start=1):
# Perform equal-width binning
bins = pd.cut(df[column], bins=num_bins, labels=custom_labels, include_lowest=True)
# Count the number of data points in each bin
bin_counts = bins.value_counts().sort_index()
# Plot the histogram of binned data

33
Register No: 2022510020

plt.subplot(1, len(columns_to_bin), idx)


plt.bar(bin_counts.index.astype(str), bin_counts.values, color='skyblue')
# Add labels and title
plt.xlabel('sale price')
plt.ylabel('Frequency')
plt.title('Binned Histogram of ' + column)
plt.tight_layout()
plt.show()

Output:

Custom binning:
The following code snippet demonstrates custom binning applied to the 'YEAR BUILT' column in a
dataset. Custom binning involves manually defining intervals or bins to group data based on specific criteria. In
this case, construction years are categorized into custom-defined intervals such as [1875-1900], [1900-1925],
and so forth.
Code:
custom_bins = [1875, 1900, 1925, 1950, 1975, 2000, 2024]
column_to_bin = 'YEAR BUILT'

34
Register No: 2022510020

bin_edges = pd.cut(df[column_to_bin], bins=custom_bins, precision=0).unique().categories


bin_edges = [interval.left for interval in bin_edges] + [bin_edges[-1].right]
hist_data, bin_edges = np.histogram(df[column_to_bin], bins=bin_edges)
plt.figure(figsize=(10, 6))
plt.hist(df[column_to_bin], bins=bin_edges, edgecolor='black')
plt.xlabel(column_to_bin)
plt.ylabel('Frequency')
plt.title('Histogram of YEAR BUILT with Custom Binning')
plt.grid(True)
plt.show()

Output:

Custom binning on Sales price:


The code snippet performs custom binning on the 'SALE PRICE' data, dividing it into specific price
ranges. It then visualizes the distribution of buildings across these price bins using a bar plot. This visualization
helps to understand how the number of buildings varies across different sale price ranges, providing insights
into the distribution pattern of sale prices within the dataset.
Code:
35
Register No: 2022510020

sale_price_bins = [0, 100000, 500000, 1000000, 5000000, float('inf')]


df['SALE PRICE BINNED'] = pd.cut(df['SALE PRICE'], bins=sale_price_bi
sale_price_counts = df['SALE PRICE BINNED'].value_counts().sort_index()
plt.figure(figsize=(10, 6))
sale_price_counts.plot(kind='bar', color='lightblue')
plt.title('Distribution of Buildings by Sale Price')
plt.xlabel('Sale Price Bins')
plt.ylabel('Number of Buildings')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

Output:

Sampling:
Sampling is a data preprocessing technique used to select a subset of data points from a larger dataset,
often to reduce computational complexity or to ensure representative training and testing sets.
Random sampling:

36
Register No: 2022510020

Random sampling is a fundamental technique in statistics and data analysis, crucial for obtaining
representative subsets from larger datasets.The following code snippet randomly selects 10,000 data points from
the DataFrame df using the sample method. This random selection ensures a representative subset for analysis
and modeling purposes. The random_state=42 parameter ensures reproducibility of the random sample. The line
random_sample.head() displays the first few rows of the randomly sampled data.
Code:
sample_size=10000
random_sample=df.sample(n=sample_size,random_state=42)
Output:

Systematic sampling:
Systematic sampling involves selecting data points at regular intervals from an ordered dataset. In the
provided code snippet, a step size is calculated to create a systematic sample of approximately 10,000 data
points from the DataFrame df. This systematic sampling method ensures a structured and representative subset
of data for analysis.

Code:
step = int(len(df)/10000)
systematic_sample = df.iloc[::step]
systematic_sample
Output:

37
Register No: 2022510020

Stratified Sampling:
The following code snippet performs stratified sampling using Stratified K-Fold cross-validation,
encodes categorical data using LabelEncoder, separates features from the target variable 'BOROUGH,' and
trains a KNN classifier. It then predicts 'BOROUGH' for test data, computes classifier accuracy, and visualizes
the target variable's distribution before and after sampling.
Code:
from sklearn.model_selection import StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
mixed_cols = df.select_dtypes(include=['object']).columns.tolist()
df[mixed_cols] = df[mixed_cols].astype(str)
label_encoder = LabelEncoder()
for col in mixed_cols:
df[col] = label_encoder.fit_transform(df[col])
X = df.drop('BOROUGH', axis=1) # Features
y = df['BOROUGH'] # Target variable
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
train_index, test_index = next(skf.split(X, y))
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
knn = KNeighborsClassifier(n_neighbors=5)

38
Register No: 2022510020

knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
sampled_data = pd.concat([X_test, y_test], axis
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
stage_counts = y.value_counts()
axes[0].pie(stage_counts, labels=stage_counts.index, autopct='%1.1f%%', startangle=90)
axes[0].set_title('Pie Chart of BOROUGH in Unsampled Data')
stage_counts_sampled = pd.Series(y_pred_concatenated).value_counts()
axes[1].pie(stage_counts_sampled, labels=stage_counts_sampled.index, autopct='%1.1f%%', startangle=90)
axes[1].set_title('Pie Chart of BOROUGH after Stratified Sampling')
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of KNN classifier: {accuracy:.2f}")
Output:

Proportionate Sampling:
The following code snippet presents a comparison of the distribution of 'BOROUGH' categories in two
scenarios: the original (unsampled) dataset and after applying proportionate sampling. The 'BOROUGH'
column represents different borough categories, and the pie charts visualize the percentage of buildings within
each borough category.The first pie chart displays the distribution of 'BOROUGH' in the original dataset,
providing an initial understanding of how buildings are distributed across different boroughs.The second pie
chart illustrates the distribution of 'BOROUGH' after applying proportionate sampling, where each borough
category is sampled in proportion to its representation in the original dataset.
Code:
stage_counts= df['BOROUGH'].value_counts()
plt.pie(stage_counts, labels=stage_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Pie Chart of BOROUGH in unsampled data')

39
Register No: 2022510020

plt.legend()
plt.show()
sampled_data2=df.groupby('BOROUGH', group_keys=False).apply(lambda x: x.sample(frac=0.6))
stage_counts = sampled_data2['BOROUGH'].value_counts()
plt.pie(stage_counts, labels=stage_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Pie Chart of BOROUGH after Proportionate Sampling')
plt.legend()
plt.show()
Output:

RESULT:
The data has been prepared using Normalization, Binning and sampling.
40

You might also like