0% found this document useful (0 votes)
6 views5 pages

Exercise#9 Instructions 2021

Uploaded by

laylaydeanne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views5 pages

Exercise#9 Instructions 2021

Uploaded by

laylaydeanne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Week 10 Interactive Exercise#9

Clustering
In this exercise, we will do the following:

 Explore a dataset
 Visualize the clusters using matplotlib and seaborn
 Build a clustering model using K-means clustering algorithm

Note: You need to do step 10 before you leave the lab and present the results to your professor
before leaving to earn any grades for this lab.

Pre-requisites:
1- Install Anoconda
2- We will be using a lot of Public datasets these datasets are available at https://goo.gl/zjS4C6 under a
folder named "Datasets for Predictive Modelling with Python", the datasets are organized in the order of
the text book chapters: Python: Advanced Predictive Analytics, chapter # 7 files are required

Steps for exploring and building a logistic regression model:


1- Open your spyder IDE
2- Load the 'wine.csv' file into a dataframe name the dataframe data_firstname_wine where first
name is your first name carry out the following activities:
a. Display the column names
b. Display the shape of the data frame i.e number of rows and number of columns
c. Display the main statistics of the data
d. Display the types of columns
e. Display the first five records
f. Find the unique values of the quality attribute
g. Find the mean of the various chemical compositions across samples for the different
groups of the wine quality

Following is the code, make sure you update the path to the correct path where you placed the
files and update the data frame name correctly:
import pandas as pd
import os
path = "C:/A_COMP309/data/"
filename = 'wine.csv'
fullpath = os.path.join(path,filename)
data_viji_wine = pd.read_csv(fullpath,sep=';')
print (data_viji_wine)
pd.set_option('display.max_columns',15)
print(data_viji_wine.head())
print(data_viji_wine.columns.values)
print(data_viji_wine.shape)
print(data_viji_wine.describe())
print(data_viji_wine.dtypes)
print(data_viji_wine.head(5))
print(data_viji_wine['quality'].value_counts())
# number_quality=data_viji_wine['quality'].value_counts()
# print("number of items ",number_quality)
print(data_viji_wine['quality'].unique())
pd.set_option('display.max_columns',15)
print(data_viji_wine.groupby('quality').mean())

Some observations
 The lesser the volatile acidity and chlorides, the higher the wine quality
 The more the sulphates and citric acid content, the higher the wine quality
 The density and pH don't vary much across the wine quality

3- Plot a histogram to see the number of wine samples in each quality type
Following is the code, make sure you update the the data frame name correctly:

import matplotlib.pyplot as plt


plt.hist(data_viji_wine['quality'])

4- Use seaborn library to generate different plots: histograms, pairplots, heatmaps…etc. and
investigate the correlations.
Following are the code snippets, make sure you update the data frame name correctly:
#Use seaborn library to generate different plots:
import seaborn as sns
sns.distplot(data_viji_wine['quality'])
# plot only the density function
sns.distplot(data_viji_wine['quality'], rug=True, hist=False, color = 'g')
# Change the direction of the plot
sns.distplot(data_viji_wine['quality'], rug=True, hist=False, vertical = True)
# Check all correlations. Here it take longer time to execute
sns.pairplot(data_viji_wine)
# Subset three column
x=data_viji_wine[['fixed acidity','chlorides','pH']]
y=data_viji_wine[['chlorides','pH']]
# check the correlations
sns.pairplot(x)

# Generate heatmaps
sns.heatmap(data_viji_wine[['fixed acidity']])
sns.heatmap(x)
sns.heatmap(x.corr())
sns.heatmap(x.corr(),annot=True)
import matplotlib.pyplot as plt
plt.figure(figsize=(10,9))
sns.heatmap(x.corr(),annot=True, cmap='coolwarm',linewidth=0.5)
##line two variables
plt.figure(figsize=(20,9))
sns.lineplot(data=y)
sns.lineplot(data=y,x='chlorides',y='pH')
## line three variables
sns.lineplot(data=x)

# check some plots after normalizing the data


x1=data_viji_wine_norm[['fixed acidity','chlorides','pH']]
y1=data_viji_wine_norm[['chlorides','pH']]
sns.lineplot(data=y1)
sns.lineplot(data=x1)
sns.lineplot(data=y,x='chlorides',y='pH')

5- Normalize the data in order to apply clustering, the formula is as follows:

Following is the code, make sure you update model name correctly:
#Normalize the data in order to apply clustering
data_viji_wine_norm = (data_viji_wine - data_viji_wine.min()) / (data_viji_wine.max() -
data_viji_wine.min())
data_viji_wine_norm.head()
The output should look like this
6- Generate some additional plots for the normalized data:
Following is the code, make sure you update model name correctly:

# check some plots after normalizing the data


x1=data_viji_wine_norm[['fixed acidity','chlorides','pH']]
y1=data_viji_wine_norm[['chlorides','pH']]
sns.lineplot(data=y1)
sns.lineplot(data=x1)
sns.lineplot(data=y,x='chlorides',y='pH')

7- Cluster the data (observations) into 6 clusters using k-means clustering algorithm.
8- Following is the code, make sure you update model name correctly:

from sklearn.cluster import KMeans


#from sklearn import datasets
model=KMeans(n_clusters=6)
model.fit(data_viji_wine_norm)

9- Check the results as follows:


a. Print the model labels
b. Append the clusters to each record on the dataframe, i.e. add a new column for clusters
c. find the final cluster's centroids for each cluster
d. Calculate the J-scores The J-score can be thought of as the sum of the squared distance
between points and cluster centroid for each point and cluster. For an efficient cluster,
the J-score should be as low as possible.
e. plot a histogram for the clusters variable to get an idea of the number of observations in
each cluster.
Following is the code, make sure you update model name correctly:

model.labels_
# Append the clusters to each record on the dataframe, i.e. add a new column for clusters
md=pd.Series(model.labels_)
data_viji_wine_norm['clust']=md
data_viji_wine_norm.head(10)
#find the final cluster's centroids for each cluster
model.cluster_centers_

#Calculate the J-scores The J-score can be thought of as the sum of the squared distance
between points and cluster centroid for each point and cluster.
#For an efficient cluster, the J-score should be as low as possible.
model.inertia_

#let us plot a histogram for the clusters

import matplotlib.pyplot as plt


plt.hist(data_viji_wine_norm['clust'])
plt.title('Histogram of Clusters')
plt.xlabel('Cluster')
plt.ylabel('Frequency')
# plot a scatter
plt.scatter(data_viji_wine_norm['clust'],data_viji_wine_norm['pH'])
plt.scatter(data_viji_wine_norm['clust'],data_viji_wine_norm['chlorides'])

10- Re-cluster the data into three clusters and check the results. Show the results to your professor.

You might also like