0% found this document useful (0 votes)
7 views8 pages

Mine 5

The document details an exercise performed by P Koushik Reddy using Jupyter Notebook to apply various normalization techniques (Min-Max, Z-Score, Decimal Scaling) on a dataset, followed by discretization using Binning. It includes code snippets for data loading, normalization, and visualization through histograms and scatter plots, as well as analysis of central tendency and dispersion. The exercise also explores clustering methods, including dendrograms and the elbow method for determining optimal cluster numbers.

Uploaded by

hudsonnnnn16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views8 pages

Mine 5

The document details an exercise performed by P Koushik Reddy using Jupyter Notebook to apply various normalization techniques (Min-Max, Z-Score, Decimal Scaling) on a dataset, followed by discretization using Binning. It includes code snippets for data loading, normalization, and visualization through histograms and scatter plots, as well as analysis of central tendency and dispersion. The exercise also explores clustering methods, including dendrograms and the elbow method for determining optimal cluster numbers.

Uploaded by

hudsonnnnn16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

NAME : P KOUSHIK REDDY

ROLL NO : 12212161

Note : WEKA does’t work in my laptop,hence I used Jupyter Notebook,it gives similar results

Ex. 5 Select a dataset which comprises numeric attributes of varying

range. Apply different normalization techniques viz. Min-max

normalization, z-score normalization, Decimal scaling on your datasets.

Further, discretize the numeric attributes using Binning and Histogram

analysis method. Analyze the effect of different techniques on dataset in

terms of type of attributes, statistical parameters such as central tendency

and dispersion and change in aptness of proximity metrics. (Later on this

exercise will be extended in combination of any clustering and/or

classification and/or association technique).

CODE : All the comments are for the necessary steps followed :

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.preprocessing import MinMaxScaler, StandardScaler

from scipy.cluster.hierarchy import dendrogram, linkage

from sklearn.cluster import KMeans

# Load the Iris dataset (or replace this with your dataset)

from sklearn.datasets import load_iris

iris = load_iris()

df = pd.DataFrame(iris.data, columns=iris.feature_names)
# Step 1: Display the first few rows of the dataset

print("Original Data:")

print(df.head())

# Step 2: Apply Min-Max Normalization

scaler_minmax = MinMaxScaler()

df_minmax = pd.DataFrame(scaler_minmax.fit_transform(df), columns=df.columns)

print("\nMin-Max Normalized Data:")

print(df_minmax.head())

# Step 3: Apply Z-Score Normalization

scaler_zscore = StandardScaler()

df_zscore = pd.DataFrame(scaler_zscore.fit_transform(df), columns=df.columns)

print("\nZ-Score Normalized Data:")

print(df_zscore.head())

# Step 4: Apply Decimal Scaling

df_decimal = df.copy()

for column in df_decimal.columns:

max_val = df_decimal[column].abs().max()

df_decimal[column] = df_decimal[column] / 10**np.ceil(np.log10(max_val))

print("\nDecimal Scaled Data:")

print(df_decimal.head())

# Step 5: Discretize the Numeric Attributes using Binning

df_binned = df.copy()

for column in df_binned.columns:

df_binned[column + "_binned"] = pd.cut(df_binned[column], bins=3, labels=["Low", "Medium",


"High"])
print("\nData After Binning:")

print(df_binned.head())

# Step 6: Histogram Analysis of Original and Normalized Data

plt.figure(figsize=(14, 10))

df.hist(bins=10, color='skyblue', edgecolor='black', alpha=0.7)

plt.suptitle("Histogram Analysis of Original Data")

plt.show()

plt.figure(figsize=(14, 10))

df_minmax.hist(bins=10, color='lightgreen', edgecolor='black', alpha=0.7)

plt.suptitle("Histogram Analysis of Min-Max Normalized Data")

plt.show()

plt.figure(figsize=(14, 10))

df_zscore.hist(bins=10, color='salmon', edgecolor='black', alpha=0.7)

plt.suptitle("Histogram Analysis of Z-Score Normalized Data")

plt.show()

plt.figure(figsize=(14, 10))

df_decimal.hist(bins=10, color='lightcoral', edgecolor='black', alpha=0.7)

plt.suptitle("Histogram Analysis of Decimal Scaled Data")

plt.show()

# Step 7: Analyze Effect on Central Tendency and Dispersion

print("\nOriginal Data Summary:")

print(df.describe())

print("\nMin-Max Normalized Data Summary:")

print(df_minmax.describe())
print("\nZ-Score Normalized Data Summary:")

print(df_zscore.describe())

print("\nDecimal Scaled Data Summary:")

print(df_decimal.describe())

# Step 8: Scatter Plot of Original Data (to visually inspect clustering potential)

plt.figure(figsize=(8, 6))

sns.scatterplot(data=df, x=df.columns[0], y=df.columns[1], s=100)

plt.title('Scatter Plot of Original Data')

plt.xlabel(df.columns[0])

plt.ylabel(df.columns[1])

plt.show()

# Step 9: Dendrogram for Hierarchical Clustering

linked = linkage(df, method='ward')

plt.figure(figsize=(10, 7))

dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=True)

plt.title('Dendrogram for Hierarchical Clustering')

plt.xlabel('Samples')

plt.ylabel('Euclidean distances')

plt.show()

# Step 10: Elbow Method to Determine Optimal Number of Clusters

wcss = []

for i in range(1, 11):

kmeans = KMeans(n_clusters=i, random_state=42)

kmeans.fit(df)

wcss.append(kmeans.inertia_)

plt.figure(figsize=(8, 6))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')

plt.title('Elbow Method for Optimal Number of Clusters')

plt.xlabel('Number of Clusters')

plt.ylabel('Within-Cluster Sum of Squares (WCSS)')

plt.show()

The output screenshots are pasted below

You might also like