0% found this document useful (0 votes)
0 views

Detecting Patterns with Unsupervised Learning

The document discusses unsupervised learning, particularly focusing on clustering techniques like K-Means and Mean Shift, which categorize unlabeled data into subgroups based on similarity metrics. It also covers the evaluation of clustering quality using silhouette scores and introduces Gaussian Mixture Models for more complex data distributions. The document includes practical examples and code snippets for implementing these algorithms using Python.

Uploaded by

Shehar Bano
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Detecting Patterns with Unsupervised Learning

The document discusses unsupervised learning, particularly focusing on clustering techniques like K-Means and Mean Shift, which categorize unlabeled data into subgroups based on similarity metrics. It also covers the evaluation of clustering quality using silhouette scores and introduces Gaussian Mixture Models for more complex data distributions. The document includes practical examples and code snippets for implementing these algorithms using Python.

Uploaded by

Shehar Bano
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Artificial Intelligence

Detecting Problems with Unsupervised


Learning
Unsupervised Learning

• Building machine learning models without using labeled training data


• Applications: market segmentation, stock markets, natural language processing,
and computer vision
• Large quantity of data exists without labeling and it needs to be categorized in
some way
• This is the perfect use case for unsupervised learning
• Unsupervised learning algorithms attempt to classify data into subgroups within
a given dataset using some similarity metric.
• When we have a dataset without any labels, we assume that the data is generated
because of latent variables that govern the distribution in some way
• Process of learning can then proceed in a hierarchical manner, starting from the
individual data points.
• We can build deeper levels of representation for the data by finding natural clusters
of similarities and trying to obtain signal and insights by classifying and
segmenting the data.
• Let's see some of the ways in which data can be classified using unsupervised
learning.
Clustering Data with the K-Means Algorithms

• Most popular unsupervised learning techniques to analyze data and find clusters using similarity
measurement such as the Euclidean distance to find subgroups
• Similarity measure can estimate the tightness of a cluster
• Clustering is the process of organizing data into subgroups whose elements are like each other
• The goal of the algorithm is to identify the intrinsic properties of data points that make them belong to the
same subgroup
• There is no universal similarity metric that works in all cases
• For example, we might be interested in finding the representative data point for each subgroup, or we
might be interested in finding the outliers in the data
• Depending on the situation, different metrics might be more appropriate than others
• The K-Means algorithm is a well-known algorithm for clustering data.
• The data is segmented into K subgroups using various data attributes.
• The number of clusters is fixed, and the data is classified based on that number.
• The main idea here is that we need to update the locations of the centroids with each iteration.
• A centroid is the location representing the center of the cluster.
• We continue iterating until we have placed the centroids at their optimal locations.
• We can see that the initial placement of centroids plays an important role in the algorithm.
• These centroids should be placed in a clever manner, because this directly impacts the results.
• A good strategy is to place them as far away from each other as possible.
Clustering Data with the K-Means Algorithms

• The basic K-Means algorithm places these centroids randomly where K-Means++
chooses these points algorithmically from the input list of data points.
• It tries to place the initial centroids far from each other so that they converge quickly.
• We then go through the training dataset and assign each data point to the closest centroid.
• Once we go through the entire dataset, the first iteration is over. The points have
been grouped based on the initialized centroids.
• The location of the centroids is recalculated based on the new clusters that were obtained at the end of the
first iteration.
• Once a new set of K centroids is obtained, the process is repeated.
• We iterate through the dataset and assign each point to the closest centroid.
• As the steps keep on getting repeated, the centroids keep moving to their
equilibrium position.
• After a certain number of iterations, the centroids do not change their locations anymore.
• The centroids converge to a final location.
• These K centroids are the values that will be used for inference.
Application of K-Means Clustering on Two-Dimensional Data

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import metrics
X = np.loadtxt('data_clustering.txt', delimiter=',') # Load input data
num_clusters = 5
plt.figure() # Plot input data
plt.scatter(X[:,0], X[:,1], marker='o', facecolors='none', edgecolors='black', s=80)
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
plt.title('Input data')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
kmeans = KMeans(init='k-means++', n_clusters=num_clusters, n_init=10)
kmeans.fit(X) # Train the KMeans clustering model
step_size = 0.01 # Step size of the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
x_vals, y_vals = np.meshgrid(np.arange(x_min, x_max, step_size),np.arange(y_min, y_max, step_size))
output = kmeans.predict(np.c_[x_vals.ravel(), y_vals.ravel()])
Plot All Output Values and Color Each Region

output = output.reshape(x_vals.shape)
plt.figure()
plt.clf()
plt.imshow(output, interpolation='nearest',extent=(x_vals.min(), x_vals.max(),y_vals.min(), y_vals.max()),
cmap=plt.cm.Paired, aspect='auto', origin='lower')
plt.scatter(X[:,0], X[:,1], marker='o', facecolors='none',edgecolors='black', s=80)
cluster_centers = kmeans.cluster_centers_
plt.scatter(cluster_centers[:,0], cluster_centers[:,1],
marker='o', s=210, linewidths=4, color='black', zorder=12, facecolors='black')
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
plt.title('Boundaries of clusters')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()

Visualization of input data Kmeans boundaries


Estimating the Number of Clusters with the Mean Shift Algorithm

• Mean Shift is a powerful nonparametric algorithm used in unsupervised learning for clustering because
it does not make any assumptions about the underlying distributions
• Mean Shift finds a lot of applications in fields such as object tracking and real-time data analysis
• In the Mean Shift algorithm, the whole feature space is considered as a probability
density function.
• We start with the training dataset and assume that it has been sampled from a probability density
function
• In this framework, the clusters correspond to the local maxima of the underlying distribution.
• If there are K clusters, then there are K peaks in the underlying data distribution and Mean Shift will
identify those peaks
• The goal of Mean Shift is to identify the location of centroids
• For each data point in the training dataset, it defines a window around it
• It then computes the centroid for this window and updates the location to this new centroid
• It then repeats the process for this new location by defining a window around it
• As we keep doing this, we move closer to the peak of the cluster
• Each data point will move towards the cluster it belongs to
• The movement is towards a region of higher density
• The centroids (also called means) keep on getting shifted towards the peaks of each cluster
• The algorithm gets its name from the fact that the means keep getting shifted
• The shift continues to happen until the algorithm converges, at which stage the centroids don't move
anymore.
Estimating the Number of Clusters with the Mean Shift Algorithm

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import MeanShift, estimate_bandwidth
from itertools import cycle
X = np.loadtxt('data_clustering.txt', delimiter=',') # Load data from input file
bandwidth_X = estimate_bandwidth(X, quantile=0.1, n_samples=len(X)) # Estimate the bandwidth of X
meanshift_model = MeanShift(bandwidth=bandwidth_X, bin_seeding=True) # Cluster data with MeanShift
meanshift_model.fit(X)
cluster_centers = meanshift_model.cluster_centers_
print('\nCenters of clusters:\n', cluster_centers)
labels = meanshift_model.labels_ # Estimate the number of clusters
num_clusters = len(np.unique(labels))
print("\nNumber of clusters in input data =", num_clusters)
plt.figure() # Plot the points and cluster centers
markers = 'o*xvs'
for i, marker in zip(range(num_clusters), markers):
plt.scatter(X[labels==i, 0], X[labels==i, 1], marker=marker, color='black') # Plot points belong to current
cluster
cluster_center = cluster_centers[i] # Plot the cluster center
plt.plot(cluster_center[0],cluster_center[1],marker='o',markerfacecolor='black',markeredgecolor='black',
markersize=15)
plt.title('Clusters')
plt.show()
Estimating the Number of Clusters with the Mean Shift Algorithm
Estimating the Quality of Clustering with Silhouette Scores
• If data is naturally organized into several distinct clusters, then it is easy to visually examine it and draw some
inferences
• This is rarely the case in the real world, unfortunately
• Data in the real world is huge and messy. So, we need a way to quantify the quality of the clustering
• Silhouette refers to a method used to check the consistency of clusters in data
• It gives an estimate of how well each data point fits with its cluster
• The silhouette score is a metric that measures the similarity of a data point to its own cluster, as compared to other
clusters
• The silhouette score works with any similarity metric
• For each data point, the silhouette score is computed using the following formula:
silhouette score = (p – q) / max(p, q)
• Here, p is the mean distance to the points in the nearest cluster that the data point is not a part of, and q is the mean
intra-cluster distance to all the points in its own cluster
• The value of the silhouette score range lies between -1 and 1
• A score closer to 1 indicates that the data point is very similar to other data points in the cluster, whereas a
score closer to -1 indicates that the data point is not like other data points in the cluster
• One way to think about it is if there are too many points with negative silhouette scores, then there may be
too few or too many clusters in the data
• We need to run the clustering algorithm again to find the optimal number of clusters
• Ideally, we want to have a high positive value
• Depending on the business problem, we do not need to optimize and have the highest possible value, but in
general, if we have a silhouette score that is close to 1, it indicates that the data clustered nicely
• If the scores are close to -1, it indicates that the variable that we are using to classify is noisy and does not
contain much of a signal
Estimating the Quality of Clustering with Silhouette Scores
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.cluster import KMeans
X = np.loadtxt('data_quality.txt', delimiter=',') # Load data from input file
scores = [] # Initialize variables
values = np.arange(2, 10)
for num_clusters in values: # Iterate through the defined range
kmeans = KMeans(init='k-means++', n_clusters=num_clusters, n_init=10)
kmeans.fit(X)
score = metrics.silhouette_score(X, kmeans.labels_, metric='euclidean', sample_size=len(X))
print("\nNumber of clusters =", num_clusters)
print("Silhouette score =", score)
scores.append(score)
plt.figure()
plt.bar(values, scores, width=0.7, color='black', align='center')
plt.title('Silhouette score vs number of clusters')
num_clusters = np.argmax(scores) + values[0] # Extract best score and optimal number of clusters
print('\nOptimal number of clusters =', num_clusters)
plt.figure()
plt.scatter(X[:,0], X[:,1], color='black', s=80, marker='o', facecolors='none')
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
plt.title('Input data')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()
Estimating the Quality of Clustering with Silhouette Scores
Guassian Mixture Model
• A Mixture Model is a type of probability density model where it is assumed that the data is governed by several
component distributions.
• If these distributions are Gaussian, then the model becomes a Gaussian Mixture Model
• These component distributions are combined in order to provide a multi-modal density function, which becomes a
mixture model
– We want to model the shopping habits of all the people in South America.
– One way to do it would be to model the whole continent and fit everything into a single model, but
people in different countries shop differently
– We therefore need to understand how people in individual countries shop and how they behave
– To get a good representative model, we need to account for all the variations within the continent
• In this case, we can use mixture models to model the shopping habits of individual countries and then combine all of
them into a Mixture Model
• This way, nuances in the data of the underlying behavior of individual countries are not missed. By not enforcing a
single model on all of the countries, a more accurate model is created
• An interesting point to note is that mixture models are semi-parametric, which means that they are partially
dependent on a set of predefined functions.
• They can provide greater precision and flexibility in modeling the underlying distributions of the data.
• They can smooth the gaps that result from having sparse data
• Once the function is defined, the mixture model goes from being semi-parametric to parametric.
• Hence a GMM is a parametric model represented as a weighted summation of component Gaussian functions.
• We assume that the data is being generated by a set of Gaussian models that are combined in some way
• GMMs are very powerful and are used in many fields.
• The parameters of the GMM are estimated from training data using algorithms like Expectation–Maximization
(EM) or Maximum A-Posteriori (MAP) estimation
• Applications include image database retrieval, modeling stock market fluctuations, biometric verification
Building a Classifier Based on Guassian Mixture Models
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import patches
from sklearn import datasets
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
iris = datasets.load_iris() # Load the iris dataset
X, y = datasets.load_iris(return_X_y=True)
skf = StratifiedKFold(n_splits=5) # Split dataset into training and testing (80/20 split)
skf.get_n_splits(X, y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
num_classes = len(np.unique(y_train))
classifier = GaussianMixture(n_components=num_classes, covariance_
type='full', init_params='kmeans', max_iter=20)
classifier.means_ = np.array([X_train[y_train == i].mean(axis=0) for i
in range(num_classes)])
classifier.fit(X_train)
plt.figure()
colors = 'bgr'
for i, color in enumerate(colors):
eigenvalues, eigenvectors = np.linalg.eigh(
classifier.covariances_[i][:2, :2])
norm_vec = eigenvectors[0] / np.linalg.norm(eigenvectors[0])
angle = np.arctan2(norm_vec[1], norm_vec[0])
angle = 180 * angle / np.pi
scaling_factor = 8
eigenvalues *= scaling_factor
Building a Classifier Based on Guassian Mixture Models ………continued

ellipse = patches.Ellipse(classifier.means_[i, :2], eigenvalues[0], eigenvalues[1], 180 + angle, color=color)


axis_handle = plt.subplot(1, 1, 1)
ellipse.set_clip_box(axis_handle.bbox)
ellipse.set_alpha(0.6)
axis_handle.add_artist(ellipse)
colors = 'bgr‘# Plot the data
for i, color in enumerate(colors):
cur_data = iris.data[iris.target == i]
plt.scatter(cur_data[:,0], cur_data[:,1], marker='o',facecolors='none', edgecolors='black', s=40,
label=iris.target_names[i])
test_data = X_test[y_test == i]
plt.scatter(test_data[:,0], test_data[:,1], marker='s',
facecolors='black', edgecolors='black', s=40 ,
label=iris.target_names[i])
y_train_pred = classifier.predict(X_train)
accuracy_training = np.mean(y_train_pred.ravel() == y_train.ravel()) *100
print('Accuracy on training data =', accuracy_training)
y_test_pred = classifier.predict(X_test)
accuracy_testing = np.mean(y_test_pred.ravel() == y_test.ravel()) *100
print('Accuracy on testing data =', accuracy_testing)
plt.title('GMM classifier')
plt.xticks(())
plt.yticks(())
plt.show()
Building a Classifier Based on Guassian Mixture Models Results

• Accuracy on training data = 87.5


• Accuracy on testing data = 86.6666666667
Finding Subgroups in Stock Market using the Affinity Propagation Models

• Affinity Propagation is a clustering algorithm that doesn't require a number of


clusters to be specified beforehand
• Because of its generic nature and simplicity of implementation, it has found a lot of applications in many
fields.
• It finds out representative clusters, called exemplars, using message passing
• It starts by specifying the measures of similarity that need to be considered
• It simultaneously considers all training data points as potential exemplars
• It then passes messages between the data points until it finds a set of exemplars
• The message passing happens in two alternate steps, called responsibility and
availability.
• Responsibility refers to the message sent from members of the cluster to candidate exemplars, indicating
how well suited the data point would be as a member of this exemplar's cluster
• Availability refers to the message sent from candidate exemplars to potential members of the cluster,
indicating how well suited it would be as an exemplar
• It keeps doing this until the algorithm converges on an optimal set of exemplars
• There is also a parameter called preference, which controls the number of exemplars
that will be found
• If a high value is chosen, it will cause the algorithm to find too many clusters
• If a low value is chosen, it will lead to a small number of clusters
• An optimal value would be the median similarity between the points.
Finding Subgroups in Stock Market using the Affinity Propagation Models
import datetime
import json
import numpy as np
import matplotlib.pyplot as plt
from sklearn import covariance, cluster
import yfinance as yf
input_file = 'company_symbol_mapping.json'
with open(input_file, 'r') as f: company_symbols_map = json.loads(f.read())
symbols, names = np.array(list(company_symbols_map.items())).T
start_date = datetime.datetime(2019, 1, 1) # Load the historical stock quotes
end_date = datetime.datetime(2019, 1, 31)
quotes = [yf.Ticker(symbol).history(start=start_date, end=end_date) for symbol in symbols]
opening_quotes = np.array([quote.Open for quote in quotes]).astype(np.float)
closing_quotes = np.array([quote.Close for quote in quotes]).astype(np.float)
quotes_diff = closing_quotes - opening_quotes
X = quotes_diff.copy().T
X /= X.std(axis=0)
edge_model = covariance.GraphLassoCV()
with np.errstate(invalid='ignore'):
edge_model.fit(X)
_, labels = cluster.affinity_propagation(edge_model.covariance_)
num_labels = labels.max()
print('\nClustering of stocks based on difference in opening and closing quotes:\n')
for i in range(num_labels + 1):
print("Cluster", i+1, "==>", ', '.join(names[labels == i]))
Finding Subgroups in Stock Market using the Affinity Propagation Models Result
Segmenting the Market Based on Shopping Affairs

import csv
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import MeanShift, estimate_bandwidth
input_file = 'sales.csv'
file_reader = csv.reader(open(input_file, 'r'), delimiter=',')
X = []
for count, row in enumerate(file_reader):
if not count:
names = row[1:]
continue
X.append([float(x) for x in row[1:]])
X = np.array(X) # Convert to numpy array
bandwidth = estimate_bandwidth(X, quantile=0.8, n_samples=len(X))
meanshift_model = MeanShift(bandwidth=bandwidth, bin_seeding=True)
meanshift_model.fit(X)
labels = meanshift_model.labels_
cluster_centers = meanshift_model.cluster_centers_
num_clusters = len(np.unique(labels))
print("\nNumber of clusters in input data =", num_clusters)
print("\nCenters of clusters:")
print('\t'.join([name[:3] for name in names]))
for cluster_center in cluster_centers:
print('\t'.join([str(int(x)) for x in cluster_center]))
Segmenting the Market Based on Shopping Affairs

cluster_centers_2d = cluster_centers[:, 1:3]


plt.figure() # Plot the cluster centers
plt.scatter(cluster_centers_2d[:,0], cluster_centers_2d[:,1],s=120, edgecolors='black', facecolors='none')
offset = 0.25
plt.xlim(cluster_centers_2d[:,0].min() - offset * cluster_centers_2d[:,0].ptp(),cluster_centers_2d[:,0].max() +
offset * cluster_centers_2d[:,0].ptp(),)
plt.ylim(cluster_centers_2d[:,1].min() - offset * cluster_centers_2d[:,1].ptp(),cluster_centers_2d[:,1].max() +
offset * cluster_centers_2d[:,1].ptp())
plt.title('Centers of 2D clusters')
plt.show()

You might also like