DSBA+Master+Codebook+-+Unsupervised+Learning
DSBA+Master+Codebook+-+Unsupervised+Learning
[email protected]
18XHT46RCY
Codebook
Data Science is the art and science of solving real world problems and making data driven decisions. It involves an
amalgamation of three aspects and a good data scientist has expertise in all three of them. These are:
Your lack of expertise should not become an impediment in your journey in Data Science. With consistent effort, you
can become fairly proficient in coding skills over a period of time. This Codebook is intended to help you become
comfortable with the finer nuances of Python and can be used as a handy reference for anything related to data science
codes throughout the program journey and beyond that.
Please keep in mind there is no one right way to write a code to achieve an intended outcome. There can be multiple
ways of doing things in Python. The examples presented in this document use just one of the approaches to perform
the analysis. Please explore by yourself different ways to perform the same thing.
[email protected]
18XHT46RCY
1
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Contents
PREFACE ......................................................................................................................................................... 1
Clustering ......................................................................................................................................................................... 3
Partition Clustering: K-Means ....................................................................................................................................... 3
Hierarchical Clustering: Agglomerative ......................................................................................................................... 3
[email protected]
18XHT46RCY
Table of Figures
Figure 15: A Dendrogram ................................................................................................................................................................... 4
2
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Unsupervised Learning
Clustering
The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as
the inertia or within-cluster sum-of-squares (see below). This algorithm requires the number of clusters to be specified. It scales
well to a large number of samples and has been used across a large range of application areas in many different fields.
import numpy as np
[ 1., 2.]])
Source: scikit-learn
Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging or splitting them
successively. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers
all the samples, the leaves being the clusters with only one sample.
The AgglomerativeClustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its
own cluster, and clusters are successively merged together. The linkage criteria determine the metric used for the merge strategy.
import numpy as np
clustering.labels_
array([1, 1, 1, 0, 0, 0])
Source: scikit-learn
Dendrogram
Plotting the hierarchical clustering as a dendrogram. The dendrogram illustrates how each cluster is composed by drawing a U-
shaped link between a non-singleton cluster and its children. The top of the U-link indicates a cluster merge. The two legs of the U-
link indicate which clusters were merged. The length of the two legs of the U-link represents the distance between the child clusters.
It is also the cophenetic distance between original observations in the two children clusters.
Z = hierarchy.linkage(ytdist, 'single')
plt.figure()
dn = hierarchy.dendrogram(Z)
[email protected]
18XHT46RCY
Figure 1: A Dendrogram
Source: scipy
4
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Dimensionality Reduction Techniques
Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. The
input data is centered but not scaled for each feature before applying the SVD. source: scikit-learn
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)
PCA(n_components=2)
print(pca.explained_variance_ratio_)
[0.99244289 0.00755711]
print(pca.singular_values_)
[email protected]
[6.30061232 0.54980396]
18XHT46RCY
Method 2: (using the statsmodels library)
LDA can be used to perform supervised dimensionality reduction, by projecting the input data to a linear subspace consisting of the
directions which maximize the separation between classes. The dimension of the output is necessarily less than the number of
classes, so this is, in general, a rather strong dimensionality reduction, and only makes sense in a multiclass setting.
import numpy as np
X = np.array([[-1, -1,2], [-2, -1,-1], [-3, -2,-3], [1, 1,-2], [2, 1,-3], [3, 2,-2]])
y = np.array([1, 1, 1, 2, 2, 2])
5
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
lda = LinearDiscriminantAnalysis(n_components=1)
reduced=lda.fit_transform(X,y)
reduced
array([[-3.98646358],
[-2.84747399],
[-3.70171618],
[ 2.27797919],
[ 3.41696878],
[ 4.84070578]])
Source: scikit-learn
[email protected]
18XHT46RCY
6
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.