Unsupervised ML
Unsupervised ML
# Set up subplots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()
# Load the Iris dataset and extract the petal length and petal width
features
iris = load_iris()
X = iris.data[:, 2:]
# Visualize the DBSCAN clusters using only the petal length and width
features
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.show()
Principal Component Analysis (PCA) transforms the data so that the original
features are substituted with new ones — principal components.
● These components are linear combinations of the original ones.
● They are designed to be free of association with each other (orthogonal)
● And arranged in order of reduction in the amount of variance
(information) they obtain from the data
The Iris dataset has 150 samples with 4 features: sepal length, sepal width, petal
length, and petal width. The target labels (y) represent the species: Setosa (0),
Versicolor (1), and Virginica (2).
iris = load_iris()
X = iris.data
y = iris.target
The code artificially scales up the sepal length feature by multiplying it by 100. This step
is useful to see how PCA handles features with vastly different scales.
Since PCA is sensitive to the scale of data, scaling the dataset ensures that each feature
has a mean of 0 and a standard deviation of 1.
python
Copy code
X = StandardScaler().fit_transform(X)
After scaling, the first 5 samples look like this (approximate values):
PCA reduces the dataset to 2 principal components, capturing the most variance in the
data.
pca = PCA(n_components=2)
X_r1 = pca.fit_transform(X)
After applying PCA, the data is reduced to two dimensions (principal components). The first
5 samples of the transformed data (X_r1) look like this:
[[-2.66 0.48]
[-2.54 -0.66]
[-2.74 -0.31]
[-2.76 -0.56]
[-2.57 0.79]]
PCA provides a measure of how much variance each component explains. This is useful to
understand how much information is retained in each principal component.
python
Copy code
print('explained variance ratio:', pca.explained_variance_ratio_)
less
Copy code
explained variance ratio: [0.72, 0.23]
This means that the first principal component explains 72% of the variance in the dataset,
and the second explains 23%, totaling 95%.
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import numpy as np
np.set_printoptions(suppress=True)
X=StandardScaler().fit_transform(X)
# Apply PCA
pca = PCA(n_components=2)
X_r1 = pca.fit_transform(X)
print(X_r1)
# Percentage of variance explained for each components
print('explained variance ratio : %s' %
str(pca.explained_variance_ratio_))
# Plot PCA
plt.figure()
colors = ['navy', 'turquoise', 'darkorange']
lw = 2
# TODO: create a Pandas DataFrame from the Iris dataset, and specify
features and target
x=iris.data
y=iris.target
# TODO: Standardize the Iris dataset
x=StandardScaler().fit_transform(x)
# TODO: Perform PCA on standardized data
x=PCA(n_components=2).fit_transform(x)
# TODO: Make a DataFrame of the transformed data (principal components)
x = pd.DataFrame(data=x, columns=['pc1','pc2'])
# TODO: Plot the transformed data using Matplotlib's scatter plot
plt.scatter(x['pc1'],x['pc2'],c=iris.target)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import FastICA
This is always better as x and y are 150 long mapped with corresponding
target
plt.scatter(X_transormed[:, 0], X_transformed[:, 1], c=iris.target,
edgecolor='k')
t-SNE
This advanced technique offers an impressive way to
visualize high-dimensional data by minimizing the
divergence or difference between two distributions -
namely, a pair modeled over high-dimensional and
corresponding low-dimensional space.
iris = datasets.load_iris()
X = iris.data
y = iris.target
iris = datasets.load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
tsne=TSNE(n_components=2,perplexity=30,learning_rate=200,random_state=42)
x=tsne.fit_transform(X)
plt.scatter(x[:,0],x[:,1],c=y)
plt.show()
K-Means vs. DBSCAN
K-means and DBSCAN, as clustering techniques, can be
compared across several parameters: