3.unsupervised Learning
3.unsupervised Learning
3.unsupervised Learning
Unsupervised learning
Unsupervised learning
The analysis that allows us to discover and consolidate patterns is called unsupervised because
we do not know what groups there are in the data or the group membership of any individual
observation. In this case, we say that the data is unlabelled. The most common unsupervised
learning method is clustering, where patterns are discovered by grouping samples.
1. for each centre we identify the subset of training points (its cluster) that is closer to it than
any other centre;
2. the mean of each feature for the data points in each cluster are computed, and the
corresponding vector of means becomes the new centre for that cluster.
These two steps are iterated until the centres no longer move or the assignments no longer
change. Then, a new point x can be assigned to the cluster of the closest prototype.
http://beta.cambridgespark.com/courses/jpm/03module.html 1/9
2016. 11. 27. Unsupervised learning
PYTHON
# Apply k-means with 2 cluster using a subset of the features
# (mean_spent and max_spent)
Xsub = X[:,1:3]
n_clusters = 2
# use the fitted model to predict what the cluster of each customer should be
cluster_assignment = kmeans.predict(Xsub) (2)
cluster_assignment
1. The method fit runs the K-Means algorithm on the data that we pass to it.
2. The method predict returns a cluster label for each sample in the data.
PYTHON
# Visualise the clusters using a scatter plot or scatterplot matrix if you wish
data = [
Scatter(
x = Xsub[cluster_assignment == i, 0],
y = Xsub[cluster_assignment == i, 1],
mode = 'markers',
name = 'cluster '+ str(i)
) for i in range(n_clusters)
]
layout = Layout(
xaxis = dict(title = 'max_spent'),
yaxis = dict(title = 'mean_spent'),
height= 600,
)
iplot(fig)
http://beta.cambridgespark.com/courses/jpm/03module.html 2/9
2016. 11. 27. Unsupervised learning
The separation between the two clusters is neat (the two clusters can be separated with a line).
One cluster contains customers with low spendings and the second with high spendings.
PYTHON
# Apply k-means with 2 clusters using all the features
PYTHON
# Adapt the visualisation code accordingly
http://beta.cambridgespark.com/courses/jpm/03module.html 3/9
2016. 11. 27. Unsupervised learning
The result is now different. The first cluster contains customers with a maximum spending
close to the minimum mean spending and the second contains customers with a maximum
spending far from the minimum mean spending. This way can tell apart customers that could
be willing to buy objects that cost more than their average spending.
PYTHON
# Compare expenditure between clusters
feat = 1
http://beta.cambridgespark.com/courses/jpm/03module.html 4/9
2016. 11. 27. Unsupervised learning
PYTHON
# Create a boxplot of the two clusters for 'mean_spent'
data = [
Box(
y = X[cluster_assignment == i, feat],
name = 'cluster'+ str(i),
) for i in range(n_clusters)
]
layout = Layout(
xaxis = dict(title = "Clusters"),
yaxis = dict(title = "Value"),
showlegend=False
)
iplot(fig)
http://beta.cambridgespark.com/courses/jpm/03module.html 5/9
2016. 11. 27. Unsupervised learning
PYTHON
# Compare mean expediture with a histogram
iplot(fig)
http://beta.cambridgespark.com/courses/jpm/03module.html 6/9
2016. 11. 27. Unsupervised learning
Here we note:
PYTHON
# Compare the centroids
We can see that the centres coincide with the means of each cluster in the table above.
http://beta.cambridgespark.com/courses/jpm/03module.html 7/9
2016. 11. 27. Unsupervised learning
PYTHON
# Compute the silhouette score
easy to understand
any unseen point can be assigned to the cluster with the closest mean to the point
Cons:
all the points are assigned to a cluster, clusters are affected by noise
Comparison of algorithms
The chart below shows the characteristics of different clustering algorithms implemented in
sklearn on simple 2D datasets.
http://beta.cambridgespark.com/courses/jpm/03module.html 8/9
2016. 11. 27. Unsupervised learning
Here we note that K-Means works pretty well in case of globular clusters but it doesn’t
produce good results on the clusters that have circular and half moon shapes. Instead, Linkage
and DBSCAN are able to deal with these kind of cluster shapes.
Wrap up of Module 3
Clustering is an unsupervised way to generate groups out of your data
Some clustering algorithms, like DBSCAN, have an embedded outlier detection mechanism
Silhouette score can be used to measure how compact the clusters are
http://beta.cambridgespark.com/courses/jpm/03module.html 9/9