0% found this document useful (0 votes)
1 views

Unsupervised ML

The document outlines a series of data processing and clustering techniques applied to the Iris dataset, including handling missing values, standardization, and the use of DBSCAN for clustering. It also covers the application of Principal Component Analysis (PCA) to reduce dimensionality and visualize the data, along with the explanation of variance captured by the principal components. Additionally, it briefly mentions Independent Component Analysis (ICA) for separating signals in noisy data.

Uploaded by

pezzyrex
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Unsupervised ML

The document outlines a series of data processing and clustering techniques applied to the Iris dataset, including handling missing values, standardization, and the use of DBSCAN for clustering. It also covers the application of Principal Component Analysis (PCA) to reduce dimensionality and visualize the data, along with the explanation of variance captured by the principal components. Additionally, it briefly mentions Independent Component Analysis (ICA) for separating signals in noisy data.

Uploaded by

pezzyrex
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

from sklearn.

datasets import load_iris


from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load the dataset


iris = load_iris()

# Create a DataFrame from the dataset


iris_df = pd.DataFrame(data = iris.data, columns = iris.feature_names)

# Let's assume some missing values for our imputer example


iris_df.iloc[5:10, 2:] = None
print(iris_df.head(11))
# Use SimpleImputer to fill in the missing values
imputer = SimpleImputer(strategy='most_frequent')
iris_df_imputed = imputer.fit_transform(iris_df) # A bug in the
code: .fit() method should be replaced with .fit_transform() method

# Now, let's standardize the imputed dataset using StandardScaler


scaler = StandardScaler()
iris_df_standardized = scaler.fit_transform(iris_df_imputed)

# Convert the ndarray back to pandas DataFrame


iris_df_standardized = pd.DataFrame(data=iris_df_standardized,
columns=iris.feature_names)

# Print the first 5 rows of the final, standardized dataset


print(iris_df_standardized.head(10))
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
5 5.4 3.9 NaN NaN
6 4.6 3.4 NaN NaN
7 5.0 3.4 NaN NaN
8 4.4 2.9 NaN NaN
9 4.9 3.1 NaN NaN
10 5.4 3.7 1.5 0.2
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 -0.900681 1.019004 -1.335106 -1.311121
1 -1.143017 -0.131979 -1.335106 -1.311121
2 -1.385353 0.328414 -1.391807 -1.311121
3 -1.506521 0.098217 -1.278406 -1.311121
4 -1.021849 1.249201 -1.335106 -1.311121
5 -0.537178 1.939791 -1.335106 -1.311121
6 -1.506521 0.788808 -1.335106 -1.311121
7 -1.021849 0.788808 -1.335106 -1.311121
8 -1.748856 -0.362176 -1.335106 -1.311121
9 -1.143017 0.098217 -1.335106 -1.311121

# Import the necessary libraries


from sklearn.cluster import DBSCAN
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Load Iris dataset


iris = load_iris()
X = iris.data

# Standardize the features


X = StandardScaler().fit_transform(X)

# Create a figure with two subplots for side-by-side comparison


fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Experiment 1: DBSCAN with eps=0.3, min_samples=5


db1 = DBSCAN(eps=0.3, min_samples=5).fit(X)
labels1 = db1.labels_
n_clusters_1 = len(set(labels1)) - (1 if -1 in labels1 else 0)
ax1.scatter(X[:, 2], X[:, 3], c=labels1, cmap='viridis')
ax1.set_title(f"Experiment 1: eps=0.3, min_samples=5\nClusters:
{n_clusters_1}")
ax1.set_xlabel("Scaled Petal length")
ax1.set_ylabel("Scaled Petal width")

# Experiment 2: DBSCAN with eps=0.6, min_samples=10


db2 = DBSCAN(eps=0.6, min_samples=10).fit(X)
labels2 = db2.labels_
n_clusters_2 = len(set(labels2)) - (1 if -1 in labels2 else 0)
ax2.scatter(X[:, 2], X[:, 3], c=labels2, cmap='viridis')
ax2.set_title(f"Experiment 2: eps=0.6, min_samples=10\nClusters:
{n_clusters_2}")
ax2.set_xlabel("Scaled Petal length")
ax2.set_ylabel("Scaled Petal width")

# Display the plots side by side


plt.suptitle("Comparison of DBSCAN Clustering of Iris Dataset")
plt.show()

# Import necessary libraries


from sklearn.cluster import DBSCAN
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Load Iris dataset


iris = load_iris()
X = iris.data

# Standardize the features


X = StandardScaler().fit_transform(X)

# Define DBSCAN parameters for experiments


params = [
{'eps': 0.3, 'min_samples': 5},
{'eps': 0.5, 'min_samples': 5},
{'eps': 0.3, 'min_samples': 10},
{'eps': 0.5, 'min_samples': 10}
]

# Set up subplots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

# Run experiments with different DBSCAN settings


for i, param in enumerate(params):
# Create and fit DBSCAN model
db = DBSCAN(eps=param['eps'], min_samples=param['min_samples'])
labels = db.fit_predict(X)

# Calculate the number of clusters (ignoring noise)


n_clusters = len(set(labels)) - (1 if -1 in labels else 0)

# Plot the results


axes[i].scatter(X[:, 2], X[:, 3], c=labels, cmap='viridis',
edgecolor='k')
axes[i].set_title(f"eps={param['eps']},
min_samples={param['min_samples']}\nClusters: {n_clusters}")
axes[i].set_xlabel("Scaled Petal length")
axes[i].set_ylabel("Scaled Petal width")

# Overall title and layout adjustments


fig.suptitle("DBSCAN Clustering of Iris Dataset with Varying Parameters
(Noise in Black)", fontsize=16)
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()

# Import the necessary libraries


from sklearn.cluster import DBSCAN
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset and extract the petal length and petal width
features
iris = load_iris()
X = iris.data[:, 2:]

# Standardize the features


X = StandardScaler().fit_transform(X)

# Create DBSCAN model


db = DBSCAN(eps=0.5, min_samples=5)
model = db.fit(X)
# Retrieve labels (-1 indicates noise points)
labels = model.labels_

# Visualize the DBSCAN clusters using only the petal length and width
features
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')

# Setting title, and x and y labels


plt.title("DBSCAN Clustering of Iris Dataset")
plt.xlabel("Scaled Petal Length")
plt.ylabel("Scaled Petal Width")
plt.show()

from sklearn.cluster import DBSCAN


from sklearn.datasets import load_iris
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import numpy as np

# Load Iris dataset


iris = load_iris()
X = iris.data

# Create DBSCAN model


dbscan = DBSCAN(eps=0.4, min_samples=7)
dbscan.fit(X)

# Retrieve labels (-1 indicates noise points)


labels = dbscan.labels_
print(labels)
# TODO: Add a piece of code here to count the number of noise points in
the model and print it
n_clusters=len(set(labels)) -(1 if -1 in labels else 0)
print(n_clusters)
# TODO: Compute and print the silhouette score
print(silhouette_score(X,labels))
# Visualize the DBSCAN clusters
fig, ax = plt.subplots()
scatter=ax.scatter(X[:, 0],X[:, 1],c=labels)
# Produce a legend with the unique colors from the scatter
legend1 = ax.legend(*scatter.legend_elements(), title='Clusters')
ax.add_artist(legend1)

plt.title('DBSCAN Clustering of Iris Dataset')


plt.xlabel('Sepal Length (scaled)')
plt.ylabel('Sepal Width (scaled)')

plt.show()

Principal Component Analysis (PCA) transforms the data so that the original
features are substituted with new ones — principal components.
● These components are linear combinations of the original ones.
● They are designed to be free of association with each other (orthogonal)
● And arranged in order of reduction in the amount of variance
(information) they obtain from the data

PCA is widely adopted for reducing the dimensionality of image datasets,


visualizing high-dimensional data, eliminating redundant features, and working
with datasets that have high-dimensional data, such as gene expression data
or social network data.

Step 1: Load the Iris Dataset

The Iris dataset has 150 samples with 4 features: sepal length, sepal width, petal
length, and petal width. The target labels (y) represent the species: Setosa (0),
Versicolor (1), and Virginica (2).

iris = load_iris()
X = iris.data
y = iris.target

Here's a quick look at the first 5 samples of X and y:

[[5.1 3.5 1.4 0.2]


[4.9 3.0 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5.0 3.6 1.4 0.2]]

Target labels (y): [0, 0, 1, 0, 2]

Step 2: Inflate the "Sepal Length" Feature

The code artificially scales up the sepal length feature by multiplying it by 100. This step
is useful to see how PCA handles features with vastly different scales.

X[:, 0] = X[:, 0] * 100

After scaling, the first 5 samples look like this:

[[510.0 3.5 1.4 0.2]


[490.0 3.0 1.4 0.2]
[470.0 3.2 1.3 0.2]
[460.0 3.1 1.5 0.2]
[500.0 3.6 1.4 0.2]]

Step 3: Standardize the Features

Since PCA is sensitive to the scale of data, scaling the dataset ensures that each feature
has a mean of 0 and a standard deviation of 1.

python
Copy code
X = StandardScaler().fit_transform(X)

After scaling, the first 5 samples look like this (approximate values):

[[ 1.94 -0.08 -1.34 -1.31]


[ 1.73 -1.01 -1.34 -1.31]
[ 1.53 -0.39 -1.39 -1.31]
[ 1.43 -0.70 -1.29 -1.31]
[ 1.83 0.23 -1.34 -1.31]]

Step 4: Apply PCA

PCA reduces the dataset to 2 principal components, capturing the most variance in the
data.

pca = PCA(n_components=2)
X_r1 = pca.fit_transform(X)

After applying PCA, the data is reduced to two dimensions (principal components). The first
5 samples of the transformed data (X_r1) look like this:

[[-2.66 0.48]
[-2.54 -0.66]
[-2.74 -0.31]
[-2.76 -0.56]
[-2.57 0.79]]

Step 5: Variance Explained by Each Component

PCA provides a measure of how much variance each component explains. This is useful to
understand how much information is retained in each principal component.

python
Copy code
print('explained variance ratio:', pca.explained_variance_ratio_)

For example, the output might look like:

less
Copy code
explained variance ratio: [0.72, 0.23]

This means that the first principal component explains 72% of the variance in the dataset,
and the second explains 23%, totaling 95%.
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import numpy as np
np.set_printoptions(suppress=True)

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target
print(np.column_stack((X,y)))
# Inflate 'sepal length (cm)'
X[:, 0] = X[:, 0] * 100

X=StandardScaler().fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_r1 = pca.fit_transform(X)
print(X_r1)
# Percentage of variance explained for each components
print('explained variance ratio : %s' %
str(pca.explained_variance_ratio_))

# Plot PCA
plt.figure()
colors = ['navy', 'turquoise', 'darkorange']
lw = 2

y == 0 produces a boolean array like [True, True, False, False, ...]


When we apply this boolean mask to X_r1, we get only the rows of X_r1
where y is 0

for color, i, target_name in zip(colors, [0, 1, 2], iris.target_names):


plt.scatter(X_r1[y == i, 0], X_r1[y == i, 1], color=color, alpha=.8,
lw=lw, label=target_name)
#In unsupervised learning, we typically don't use the target
labels for training the model, as the goal is to find patterns or
groupings in the data without prior knowledge of the labels.
However, in this context, the target labels are being used for
visualization purposes only, not for training or influencing the
PCA transformation.
#This will also do the same thing
#plt.scatter(X_r1[:, 0], X_r1[:, 1], c=iris.target, cmap=plt.cm.Set1,
edgecolor='k')
#THIS is even more simpler
#plt.scatter(X_r1[:, 0], X_r1[:, 1], c=iris.target, edgecolor='k')

plt.legend(loc='best', shadow=False, scatterpoints=1)


plt.title('PCA example')
plt.show()
We can see 3 clusters are reduced to 2. Two species are in same cluster
while one species in the other
Simpler code
# Importing necessary libraries
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Loading the Iris dataset


iris = load_iris()

# TODO: create a Pandas DataFrame from the Iris dataset, and specify
features and target
x=iris.data
y=iris.target
# TODO: Standardize the Iris dataset
x=StandardScaler().fit_transform(x)
# TODO: Perform PCA on standardized data
x=PCA(n_components=2).fit_transform(x)
# TODO: Make a DataFrame of the transformed data (principal components)
x = pd.DataFrame(data=x, columns=['pc1','pc2'])
# TODO: Plot the transformed data using Matplotlib's scatter plot
plt.scatter(x['pc1'],x['pc2'],c=iris.target)

ICA is a technique you can use to untangle conflicting variables in a dataset or


extract meaningful signals from noisy data, such as separating individual
voices from a racket at a party

ICA is a computational method for separating a multivariate signal into


additive subcomponents, supposing the mutual statistical independence of
non-Gaussian signals. As you're familiar with Principal Component Analysis
(PCA), it's noteworthy that ICA is quite similar. However, while PCA identifies
components that maximize variance and are statistically uncorrelated, ICA
further requires the components to be statistically independent. This additional
requirement makes ICA more potent than PCA in many applications because
it can recover non-Gaussian independent components.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import FastICA

# Load the iris dataset


iris = datasets.load_iris()
X = iris.data

# Standardize the dataset


X = (X - X.mean(axis=0)) / X.std(axis=0)

# Compute Fast Independent Components Analysis (FastICA)


ica = FastICA(n_components=2, whiten='unit-variance')
X_transformed = ica.fit_transform(X)

# Visualize the first independent component against the second


plt.figure(figsize=(10,6))
plt.scatter(X_transormed[:, 0], X_transformed[:, 1], c=iris.target,
edgecolor='k')

for color, i, target_name in zip(['blue', 'red', 'green'], [0, 1, 2],


iris.target_names):
plt.scatter(X_transformed[iris.target == i, 0],
X_transformed[iris.target == i, 1],
alpha=0.8,
color=color,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('Visualization of the first two independent components after
ICA')
plt.grid(True)
plt.show()

This is always better as x and y are 150 long mapped with corresponding
target
plt.scatter(X_transormed[:, 0], X_transformed[:, 1], c=iris.target,
edgecolor='k')

t-SNE
This advanced technique offers an impressive way to
visualize high-dimensional data by minimizing the
divergence or difference between two distributions -
namely, a pair modeled over high-dimensional and
corresponding low-dimensional space.

Simply put, t-SNE maps high-dimensional data points to a


lower-dimensional space (2D or 3D). Fascinatingly, it
keeps similar data points close together and dissimilar
data points far apart in this lower-dimensional space.
Neat, right?

from sklearn.manifold import TSNE


from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target

tSNE_model = TSNE(n_components=2, perplexity=30, learning_rate=200,


random_state=123)
X_2d = tSNE_model.fit_transform(X)

fig, ax = plt.subplots(figsize=(6, 5))


scatter = plt.scatter(X_2d[:,0],X_2d[:,1], c=iris.target)
legend1 = ax.legend(*scatter.legend_elements(), title="Types")
ax.add_artist(legend1)
plt.grid(True)
plt.title('t-SNE of Iris Dataset:')
plt.show()
SIMPLER
# Import required libraries
from sklearn import datasets
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

iris = datasets.load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names

tsne=TSNE(n_components=2,perplexity=30,learning_rate=200,random_state=42)
x=tsne.fit_transform(X)
plt.scatter(x[:,0],x[:,1],c=y)
plt.show()
K-Means vs. DBSCAN
K-means and DBSCAN, as clustering techniques, can be
compared across several parameters:

● Cluster Quality: K-means excels in creating spherical


and similarly sized clusters, while DBSCAN
outperforms it in forming clusters of varying shapes
and sizes.
● Scalability: While K-means easily scales with large
datasets, DBSCAN often requires additional
computational resources as the dimensions increase.
● Tolerance to Noise: DBSCAN identifies and handles
noise and outliers effectively, giving it an
advantage over K-means, which often absorb noisy
points into clusters.
● Working with Different Density: DBSCAN adapts well
according to the density-based definition of a
cluster, while K-means might struggle with clusters
of varying densities.
● Interpretability: K-means provides intuitive and
easy-to-interpret results, while DBSCAN’s results may
be slightly harder to interpret.

The above comparison between K-means and DBSCAN will make


it easier to decide which method meets your specific
requirements. For instance, if your data contains noise
or necessitates flexible cluster shapes, DBSCAN may offer
a more suitable choice.

PCA vs. ICA vs. t-SNE


These techniques can be distinguished based on the
following criteria:

● Explained Variance: PCA directly measures the


retained variance in the transformed data, while ICA
and t-SNE do not offer as explicit a measure.
● Computational Efficiency: PCA demands fewer computing
resources than ICA and t-SNE.
● Interpretability: PCA and ICA produce encoded
dimensions that are interpretable, unlike the reduced
dimensions in t-SNE, which aren't directly
interpretable.
● Modelling Technique: While all three function as
unsupervised techniques and are independent of any
labels, they aim to achieve different things. PCA
looks for the greatest variance, ICA looks for
statistical independence, while t-SNE makes
probability distributions in different dimensions as
similar as possible.

# Import required libraries


from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn import datasets
import matplotlib.pyplot as plt

# Load Iris dataset


iris = datasets.load_iris()
# Apply PCA to reduce dimensions to 2
pca = PCA(n_components=2).fit_transform(iris.data)

# Apply K-means with 3 clusters, matching the number of


iris species
km = KMeans(n_clusters=3)
km.fit(pca)

# Plotting PCA and K-means results


plt.scatter(pca[:, 0], pca[:, 1], c=km.labels_)
plt.title('Clustering Iris Dataset using K-means and
PCA')
plt.show()

You might also like