Exercise#9 Instructions 2021
Exercise#9 Instructions 2021
Clustering
In this exercise, we will do the following:
Explore a dataset
Visualize the clusters using matplotlib and seaborn
Build a clustering model using K-means clustering algorithm
Note: You need to do step 10 before you leave the lab and present the results to your professor
before leaving to earn any grades for this lab.
Pre-requisites:
1- Install Anoconda
2- We will be using a lot of Public datasets these datasets are available at https://goo.gl/zjS4C6 under a
folder named "Datasets for Predictive Modelling with Python", the datasets are organized in the order of
the text book chapters: Python: Advanced Predictive Analytics, chapter # 7 files are required
Following is the code, make sure you update the path to the correct path where you placed the
files and update the data frame name correctly:
import pandas as pd
import os
path = "C:/A_COMP309/data/"
filename = 'wine.csv'
fullpath = os.path.join(path,filename)
data_viji_wine = pd.read_csv(fullpath,sep=';')
print (data_viji_wine)
pd.set_option('display.max_columns',15)
print(data_viji_wine.head())
print(data_viji_wine.columns.values)
print(data_viji_wine.shape)
print(data_viji_wine.describe())
print(data_viji_wine.dtypes)
print(data_viji_wine.head(5))
print(data_viji_wine['quality'].value_counts())
# number_quality=data_viji_wine['quality'].value_counts()
# print("number of items ",number_quality)
print(data_viji_wine['quality'].unique())
pd.set_option('display.max_columns',15)
print(data_viji_wine.groupby('quality').mean())
Some observations
The lesser the volatile acidity and chlorides, the higher the wine quality
The more the sulphates and citric acid content, the higher the wine quality
The density and pH don't vary much across the wine quality
3- Plot a histogram to see the number of wine samples in each quality type
Following is the code, make sure you update the the data frame name correctly:
4- Use seaborn library to generate different plots: histograms, pairplots, heatmaps…etc. and
investigate the correlations.
Following are the code snippets, make sure you update the data frame name correctly:
#Use seaborn library to generate different plots:
import seaborn as sns
sns.distplot(data_viji_wine['quality'])
# plot only the density function
sns.distplot(data_viji_wine['quality'], rug=True, hist=False, color = 'g')
# Change the direction of the plot
sns.distplot(data_viji_wine['quality'], rug=True, hist=False, vertical = True)
# Check all correlations. Here it take longer time to execute
sns.pairplot(data_viji_wine)
# Subset three column
x=data_viji_wine[['fixed acidity','chlorides','pH']]
y=data_viji_wine[['chlorides','pH']]
# check the correlations
sns.pairplot(x)
# Generate heatmaps
sns.heatmap(data_viji_wine[['fixed acidity']])
sns.heatmap(x)
sns.heatmap(x.corr())
sns.heatmap(x.corr(),annot=True)
import matplotlib.pyplot as plt
plt.figure(figsize=(10,9))
sns.heatmap(x.corr(),annot=True, cmap='coolwarm',linewidth=0.5)
##line two variables
plt.figure(figsize=(20,9))
sns.lineplot(data=y)
sns.lineplot(data=y,x='chlorides',y='pH')
## line three variables
sns.lineplot(data=x)
Following is the code, make sure you update model name correctly:
#Normalize the data in order to apply clustering
data_viji_wine_norm = (data_viji_wine - data_viji_wine.min()) / (data_viji_wine.max() -
data_viji_wine.min())
data_viji_wine_norm.head()
The output should look like this
6- Generate some additional plots for the normalized data:
Following is the code, make sure you update model name correctly:
7- Cluster the data (observations) into 6 clusters using k-means clustering algorithm.
8- Following is the code, make sure you update model name correctly:
model.labels_
# Append the clusters to each record on the dataframe, i.e. add a new column for clusters
md=pd.Series(model.labels_)
data_viji_wine_norm['clust']=md
data_viji_wine_norm.head(10)
#find the final cluster's centroids for each cluster
model.cluster_centers_
#Calculate the J-scores The J-score can be thought of as the sum of the squared distance
between points and cluster centroid for each point and cluster.
#For an efficient cluster, the J-score should be as low as possible.
model.inertia_
10- Re-cluster the data into three clusters and check the results. Show the results to your professor.