0% found this document useful (0 votes)
147 views26 pages

Pattern Recognition

The document contains a lab file for a Pattern Recognition course. It lists 23 experiments covering topics like reading CSV and image files, data pre-processing, scatter plots, implementing Bayes theorem, naive Bayes classification, correlation, covariance, PCA, SVD, feature selection, KNN classification, SVM classification, decision trees, neural networks, and different clustering algorithms. It provides code examples to implement reading files, Bayes theorem, and naive Bayes classification on a heart disease dataset.

Uploaded by

Aryan Attri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
147 views26 pages

Pattern Recognition

The document contains a lab file for a Pattern Recognition course. It lists 23 experiments covering topics like reading CSV and image files, data pre-processing, scatter plots, implementing Bayes theorem, naive Bayes classification, correlation, covariance, PCA, SVD, feature selection, KNN classification, SVM classification, decision trees, neural networks, and different clustering algorithms. It provides code examples to implement reading files, Bayes theorem, and naive Bayes classification on a heart disease dataset.

Uploaded by

Aryan Attri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

DEPARTMENT OF COMPUTER SCIENCE & TECHNOLOGY

LAB FILE
FOR

SUBJECT: PATTERN RECOGNITION LAB


SUBJECT CODE: CSP650

Department of COMPUTER SCIENCE & TECHNOLOGY


By Aryan Attri 2017008749
SHARDA UNIVERSITY, GREATER NOIDA

List of Experiments:
1
WAP reading csv file and image file Reading CSV file
2
WAP pre-processing (cleaning of data and data redundancy) of data
3 Draw Scatter Plot and Histograms from Multivariate data set

4
WAP to implement Bayes theorem in python
5 Implement naïve bayes on Hear Disease Data Set.

6
WAP to find correlation and covariance between features of data
7 WAP to implement PCA

8 WAP to implement SVD

9 WAP to implement univariate feature selection for Wine Quality data set

10 Write a program to implement k-Nearest Neighbour algorithm to classify the iris data set
11 Write a program to implement Support vector machine algorithm to classify the Heart Disease data set

12 Write a program to implement Decision tree algorithm to classify the Wine Quality data set

13 Write a program to implement Artificial neural network algorithm to classify the iris data set

14 Implement K-Mean Clustering on Netflix Cluster290

15 Implement Hierarchical Clustering

16 Implement Graph Clustering

17

18

19

20

21

22

23

Task 1 WAP reading csv file and image file


Reading CSV file
# importing csv module
import csv
# csv file name
filename = "aapl.csv"
# initializing the titles and rows list
fields = []
rows = []
# reading csv file
with open(filename, 'r') as csvfile:
# creating a csv reader object
csvreader = csv.reader(csvfile)
# extracting field names through first row
fields = next(csvreader)
# extracting each data row one by one
for row in csvreader:
rows.append(row)
# get total number of rows
print("Total no. of rows: %d"%(csvreader.line_num))
# printing the field names
print('Field names are:' + ', '.join(field for field in fields))

# printing first 5 rows


print('\nFirst 5 rows are:\n')
for row in rows[:5]:
# parsing each column of a row
for col in row:
print("%10s"%col),
print('\n')

Reading Image File


# Python program to read image using OpenCV
# importing OpenCV(cv2) module
import cv2
# Save image in set directory
# Read RGB image
img = cv2.imread('g4g.png')
# Output img with window name as 'image'
cv2.imshow('image', img)
# Maintain output window utill
# user presses a key
cv2.waitKey(0)
# Destroying present windows on screen
cv2.destroyAllWindows()

Task 2.WAP pre-processing(cleaning of data and data redundancy) of


data
Task 3: Draw Scatter Plot and Histograms from Multivariate data set
Task 4 : WAP to implement Bayes theorem in python
# Importing library
import math
import random
import csv
# the categorical class names are changed to numberic data
# eg: yes and no encoded to 1 and 0
def encode_class(mydata):
classes = []
for i in range(len(mydata)):
if mydata[i][-1] not in classes:
classes.append(mydata[i][-1])
for i in range(len(classes)):
for j in range(len(mydata)):
if mydata[j][-1] == classes[i]:
mydata[j][-1] = i
return mydata

# Splitting the data


def splitting(mydata, ratio):
train_num = int(len(mydata) * ratio)
train = []
# initially testset will have all the dataset
test = list(mydata)
while len(train) < train_num:
# index generated randomly from range 0
# to length of testset
index = random.randrange(len(test))
# from testset, pop data rows and put it in train
train.append(test.pop(index))
return train, test

# Group the data rows under each class yes or


# no in dictionary eg: dict[yes] and dict[no]
def groupUnderClass(mydata):
dict = {}
for i in range(len(mydata)):
if (mydata[i][-1] not in dict):
dict[mydata[i][-1]] = []
dict[mydata[i][-1]].append(mydata[i])
return dict

# Calculating Mean
def mean(numbers):
return sum(numbers) / float(len(numbers))

# Calculating Standard Deviation


def std_dev(numbers):
avg = mean(numbers)
variance = sum([pow(x - avg, 2) for x in numbers]) / float(len(numbers) - 1)
return math.sqrt(variance)

def MeanAndStdDev(mydata):
info = [(mean(attribute), std_dev(attribute)) for attribute in zip(*mydata)]
# eg: list = [ [a, b, c], [m, n, o], [x, y, z]]
# here mean of 1st attribute =(a + m+x), mean of 2nd attribute = (b + n+y)/3
# delete summaries of last class
del info[-1]
return info

# find Mean and Standard Deviation under each class


def MeanAndStdDevForClass(mydata):
info = {}
dict = groupUnderClass(mydata)
for classValue, instances in dict.items():
info[classValue] = MeanAndStdDev(instances)
return info

# Calculate Gaussian Probability Density Function


def calculateGaussianProbability(x, mean, stdev):
expo = math.exp(-(math.pow(x - mean, 2) / (2 * math.pow(stdev, 2))))
return (1 / (math.sqrt(2 * math.pi) * stdev)) * expo

# Calculate Class Probabilities


def calculateClassProbabilities(info, test):
probabilities = {}
for classValue, classSummaries in info.items():
probabilities[classValue] = 1
for i in range(len(classSummaries)):
mean, std_dev = classSummaries[i]
x = test[i]
probabilities[classValue] *= calculateGaussianProbability(x, mean, std_dev)
return probabilities
# Make prediction - highest probability is the prediction
def predict(info, test):
probabilities = calculateClassProbabilities(info, test)
bestLabel, bestProb = None, -1
for classValue, probability in probabilities.items():
if bestLabel is None or probability > bestProb:
bestProb = probability
bestLabel = classValue
return bestLabel

# returns predictions for a set of examples


def getPredictions(info, test):
predictions = []
for i in range(len(test)):
result = predict(info, test[i])
predictions.append(result)
return predictions

# Accuracy score
def accuracy_rate(test, predictions):
correct = 0
for i in range(len(test)):
if test[i][-1] == predictions[i]:
correct += 1
return (correct / float(len(test))) * 100.0

# driver code

# add the data path in your system


filename = r'E:\user\MACHINE LEARNING\machine learning algos\Naive bayes\filedata.csv'

# load the file and store it in mydata list


mydata = csv.reader(open(filename, "rt"))
mydata = list(mydata)
mydata = encode_class(mydata)
for i in range(len(mydata)):
mydata[i] = [float(x) for x in mydata[i]]
# split ratio = 0.7
# 70% of data is training data and 30% is test data used for testing
ratio = 0.7
train_data, test_data = splitting(mydata, ratio)
print('Total number of examples are: ', len(mydata))
print('Out of these, training examples are: ', len(train_data))
print("Test examples are: ", len(test_data))

# prepare model
info = MeanAndStdDevForClass(train_data)

# test model
predictions = getPredictions(info, test_data)
accuracy = accuracy_rate(test_data, predictions)
print("Accuracy of your model is: ", accuracy)

Task 5: Implement naïve bayes on Hear Disease Data Set.


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# visualization tools
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling as pp
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
df = pd.read_csv("../input/heart-disease-uci/heart.csv")
df.sample(5)
df.info()
report = pp.ProfileReport(df)

report.to_file("report.html")

report
task 6: WAP to find correlation and covariance between features of
data
correlation between features of data
The Pearson correlation coefficient (named for Karl Pearson) can be used to summarize the strength of
the linear relationship between two data samples.
The Pearson’s correlation coefficient is calculated as the covariance of the two variables divided by the
product of the standard deviation of each data sample. It is the normalization of the covariance between
the two variables to give an interpretable score.
Pearson's correlation coefficient = covariance(X, Y) / (stdv(X) * stdv(Y))
We can calculate the correlation between the two variables in our test problem.
The complete example is listed below
# calculate the Pearson's correlation between two variables
from numpy.random import randn
from numpy.random import seed
from scipy.stats import pearsonr
# seed random number generator
seed(1)
# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
# calculate Pearson's correlation
corr, _ = pearsonr(data1, data2)
print('Pearsons correlation: %.3f' % corr)

covariance between features of data


Syntax: numpy.cov(m, y=None, rowvar=True, bias=False, ddof=None, fweights=None, aweights=None)
# Python code to demonstrate the
# use of numpy.cov
import numpy as np

x = np.array([[0, 3, 4], [1, 2, 4], [3, 4, 5]])


print("Shape of array:\n", np.shape(x))
print("Covarinace matrix of x:\n", np.cov(x))

Task 7 : WAP to implement PCA


Principal Component Analyis is basically a statistical procedure to convert a set of observation of
possibly correlated variables into a set of values of linearly uncorrelated variables.
Each of the principal components is chosen in such a way so that it would describe most of the still
available variance and all these principal components are orthogonal to each other. In all principal
components first principal component has maximum variance.
Uses of PCA:
• It is used to find inter-relation between variables in the data.
• It is used to interpret and visualize data.
• As number of variables are decreasing it makes further analysis simpler.
• It’s often used to visualize genetic distance and relatedness between populations.
These are basically performed on square symmetric matrix. It can be a pure sums of squares and cross
products matrix or Covariance matrix or Correlation matrix. A correlation matrix is used if the individual
variance differs much.
# importing required libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# importing or loading the dataset
dataset = pd.read_csv('wines.csv')
# distributing the dataset into two components X and Y
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values
# Splitting the X and Y into the
# Training set and Testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# performing preprocessing part
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Applying PCA function on training
# and testing set of X component
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
# Fitting Logistic Regression To the training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
# Predicting the test set result using
# predict function under LogisticRegression
y_pred = classifier.predict(X_test)
# making confusion matrix between
# test set of Y and predicted value.
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Predicting the training set
# result through scatter plot
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1,
stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1,
stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),
X2.ravel()]).T).reshape(X1.shape), alpha = 0.75,
cmap = ListedColormap(('yellow', 'white', 'aquamarine')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
plt.title('Logistic Regression (Training set)')
plt.xlabel('PC1') # for Xlabel
plt.ylabel('PC2') # for Ylabel
plt.legend() # to show legend
# show scatter plot
plt.show()
# Visualising the Test set results through scatter plot
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1,
stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1,
stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),
X2.ravel()]).T).reshape(X1.shape), alpha = 0.75,
cmap = ListedColormap(('yellow', 'white', 'aquamarine')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
# title for scatter plot
plt.title('Logistic Regression (Test set)')
plt.xlabel('PC1') # for Xlabel
plt.ylabel('PC2') # for Ylabel
plt.legend()
# show scatter plot
plt.show()

Task 8 WAP to implement SVD


# get the image from https://cdn.pixabay.com/photo/2017/03/27/16/50/beach-2179624_960_720.jpg

Import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import cv2

# read image in grayscale


img = cv2.imread('beach-2179624_960_720.jpg', 0)

# obtain svd
U, S, V = np.linalg.svd(img)

# inspect shapes of the matrices


print(U.shape, S.shape, V.shape)

# plot images with different number of components


comps = [638, 500, 400, 300, 200, 100]

plt.figure(figsize = (16, 8))


for i in range(6):
low_rank = U[:, :comps[i]] @ np.diag(S[:comps[i]]) @ V[:comps[i], :]
if(i == 0):
plt.subplot(2, 3, i+1), plt.imshow(low_rank, cmap = 'gray'), plt.axis('off'), plt.title("Original
Image with n_components =" + str(comps[i]))
else:
plt.subplot(2, 3, i+1), plt.imshow(low_rank, cmap = 'gray'), plt.axis('off'),
plt.title("n_components =" + str(comps[i]))
Task 9: WAP to implement univariate feature selection for Wine
Quality data set

Here we will predict the quality of wine on the basis of giving features. We use the wine quality dataset
from Kaggle. This dataset has the fundamental features which are responsible for affecting the quality of
the wine. By the use of several Machine learning models, we will predict the quality of the wine. Here
we will only deal with the white type wine quality, we use classification techniques to check further the
quality of the wine i.e. is it good or bed
In this dataset, classes are ordered, but it was not balanced. Here, red wine instances are present at a
high rate and white wine instances are less than red.
These are the name of Features from the dataset
1. type
2. fixed acidity
3. volatile acidity
4. citric acid
5. residual sugar
6. chlorides
7. free sulfur dioxide
8. total sulfur dioxide
9. density
10. pH
11. sulphates
12. alcohol
13. quality

# import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
# loading the data
Dataframe = pd.read_csv(R'D:\\xdatasets\winequalityN.csv')
# show rows and columns
Dataframe.head()
# getting info.
Dataframe.info()
Dataframe.describe()
# null value check
Dataframe.isnull().sum()
# plot pairplot
sb.pairplot(Dataframe)
#show graph
plt.show()
#plot histogram
Dataframe.hist(bins=20,figsize=(10,10))
#plot showing
plt.show()
plt.figure(figsize=[15,6])
plt.bar(df['quality'],df['alcohol'])
plt.xlabel('quality')
plt.ylabel('alcohol')
plt.show()
# correlation by visualization
plt.figure(figsize=[18,7])
# plot correlation
sb.heatmap(Dataframe.corr(),annot=True)
plt.show()
colm = []
# loop for columns
for i in range(len(Dataframe.corr().keys())):
# loop for rows
for j in range(j):
if abs(Dataframe.corr().iloc[i,j]) > 0.7:
colm = Dataframe.corr().columns[i]
# drop colum
new_df = Dataframe.drop('total sulfur dioxide',axis = 1)
new_df.update(new_df.fillna(new_df.mean()))
# no of categorical columns
cat = new_df.select_dtypes(include='O')
# create dummies of categorical columns
df_dummies = pd.get_dummies(new_df,drop_first = True)
print(df_dummies)
df_dummies['best quality']=[1 if x>=7 else 0 for x in Dataframe.quality]
print(df_dummies)
# import libraries
from sklearn.preprocessing import train_test_split
# independent variables
x = df_dummies.drop(['quality','best quality'],axis=1)
# dependent variable
y = df_dummies['best quality']
# creating train test splits
xtrain,xtest,ytrain,ytest = train_test_split(x,y,test_size=0.2,random_state=40)
# code
# import libraries
from sklearn.preprocessing import MinMaxScaler
# creating scaler scale var.
norm = MinMaxScaler()
# fit the scal
norm_fit = norm.fit(xtrain)
# transfromation of trainig data
scal_xtrain = norm_fit.transform(xtrain)
# transformation of testing data
scal_xtest = norm_fit.transform(xtest)
print(scal_xtrain)
# code
#import libraries
from sklearn.ensemble import RandomForestClassifier
# for error checking
from sklearn.matrics import mean_squared_error
from sklearn.metrics import classification_report
# create model variable
rnd = RandomForestClassifier()
# fit the model
fit_rnd = rnd.fit(new_xtrain,ytrain)
# checking the accuracy score
rnd_score = rnd.score(new_xtest,ytest)
print('score of model is : ',rnd_score)
print('.................................')
print('calculating the error')
# checking mean_squared error
MSE = mean_squared_error(ytest,y_predict)
# checking root mean squared error
RMSE = np.sqrt(MSE)
print('mean squared error is : ',MSE)
print('root mean squared error is : ',RMSE)
print(classification_report(ytest,x_predict))
# code
x_predict = list(rnd.predict(xtest))
df = {'predicted':x_predict,'orignal':ytest}
pd.DataFrame(df).head(10)

Task 10: Write a program to implement k-Nearest Neighbour


algorithm to classify the iris data set
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

# Assign colum names to the dataset


names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

# Read dataset to pandas dataframe


dataset = pd.read_csv(url, names=names)
dataset.head()
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
error = []

# Calculating error for K values between 1 and 40


for i in range(1, 40):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train, y_train)
pred_i = knn.predict(X_test)
error.append(np.mean(pred_i != y_test))
plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), error, color='red', linestyle='dashed', marker='o',
markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')
Task 11 : Write a program to implement Support vector machine
algorithm to classify the Heart Disease data set
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import LinearSVC, SVC
import seaborn as sn
import pandas as pd
signals = pd.read_csv("C:\\Users\\monis\\Downloads\\DS1_signals.csv", header=None)
labels = pd.read_csv("C:\\Users\\monis\\Downloads\DS1_labels.csv", header=None)
print("*"*50)
print("Signals Info:")
print("*"*50)
print(signals.info())
print("*"*50)
print("Labels Info:")
print("*"*50)
print(labels.info())
print("*"*50)
signals.head()
print("Column Number of NaN's")
for col in signals.columns:
if signals[col].isnull().sum() > 0:
print(col, signals[col].isnull().sum())
joined_data = signals.join(labels, rsuffix="_signals", lsuffix="_labels")
joined_data.columns = [i for i in range(180)]+['class']
cor_mat=joined_data.corr()
print('*'*50)
print('Top 10 high positively correlated features')
print('*'*50)
print(cor_mat['class'].sort_values(ascending=False).head(10))
print('*'*50)
print('Top 10 high negatively correlated features')
print('*'*50)
print(cor_mat['class'].sort_values().head(10))
%matplotlib inline
from pandas.plotting import scatter_matrix
features = [79,80,78,77]
scatter_matrix(joined_data[features], figsize=(20,15), c =joined_data['class'], alpha=0.5);
print('-'*20)
print('Class\t %')
print('-'*20)
print(joined_data['class'].value_counts()/len(joined_data))
joined_data.hist('class');
print('-'*20)
from sklearn.model_selection import StratifiedShuffleSplit
split1 = StratifiedShuffleSplit(n_splits=1, test_size=0.2,random_state=42)
for train_index, test_index in split1.split(joined_data, joined_data['class']):
train_set = joined_data.loc[train_index]
test_set = joined_data.loc[test_index]
features_train = strat_train_set.drop('class', 1)
labels_train = strat_train_set['class']
scaler = StandardScaler()
std_features = scaler.fit_transform(strat_features_train)
svc_param_grid = {'C':[10], 'gamma':[0.1,1,10]}
svc = SVC(kernel='rbf',decision_function_shape='ovo',random_state=42, max_iter = 500)
svc_grid_search = GridSearchCV(svc, svc_param_grid, cv=3, scoring="f1_macro")
svc_grid_search.fit(std_features, labels_train)
train_accuracy=svc_grid_search.best_score_
print('Model\t\tBest params\t\tBest score')
print("-"*50)
print("SVC\t\t", svc_grid_search.best_params_, train_accuracy)
features_test = test_set.drop('class', 1)
labels_test = test_set['class']
std_features = scaler.fit_transform(features_test)
svc_grid_search.fit(std_features, labels_test)
test_accuracy=svc_grid_search.best_score
print('Model\t\tBest params\t\tBest score')
print("-"*50)
print("SVC\t\t", svc_grid_search.best_params_, test_accuracy)
print("Train Accuracy : "+str(train_accuracy))
print("Test Accuracy : "+str(test_accuracy))

Task 12:Write a program to implement Decision tree algorithm to


classify the Wine Quality data set
# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
pima = pd.read_csv("pima-indians-diabetes.csv", header=None, names=col_names)
pima.head()
#split dataset in features and target variable
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and
30% test
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer


clf = clf.fit(X_train,y_train)

#Predict the response for test dataset


y_pred = clf.predict(X_test)
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png())
# Create Decision Tree classifer object
clf = DecisionTreeClassifier(criterion="entropy", max_depth=3)

# Train Decision Tree Classifer


clf = clf.fit(X_train,y_train)

#Predict the response for test dataset


y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?


print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True, feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png())

Task 13:Write a program to implement Artificial neural network


algorithm to classify the iris data set
import keras #library for neural network
import pandas as pd #loading data in table form
import seaborn as sns #visualisation
import matplotlib.pyplot as plt #visualisation
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.preprocessing import normalize #machine learning algorithm library
#Reading data
data=pd.read_csv("../input/Iris.csv")
print("Describing the data: ",data.describe())
print("Info of the data:",data.info())
print("10 first samples of the dataset:",data.head(10))
print("10 last samples of the dataset:",data.tail(10))
sns.lmplot('SepalLengthCm', 'SepalWidthCm',
data=data,
fit_reg=False,
hue="Species",
scatter_kws={"marker": "D",
"s": 50})
plt.title('SepalLength vs SepalWidth')

sns.lmplot('PetalLengthCm', 'PetalWidthCm',
data=data,
fit_reg=False,
hue="Species",
scatter_kws={"marker": "D",
"s": 50})
plt.title('PetalLength vs PetalWidth')

sns.lmplot('SepalLengthCm', 'PetalLengthCm',
data=data,
fit_reg=False,
hue="Species",
scatter_kws={"marker": "D",
"s": 50})
plt.title('SepalLength vs PetalLength')

sns.lmplot('SepalWidthCm', 'PetalWidthCm',
data=data,
fit_reg=False,
hue="Species",
scatter_kws={"marker": "D",
"s": 50})
plt.title('SepalWidth vs PetalWidth')
plt.show()
print(data["Species"].unique())
data.loc[data["Species"]=="Iris-setosa","Species"]=0
data.loc[data["Species"]=="Iris-versicolor","Species"]=1
data.loc[data["Species"]=="Iris-virginica","Species"]=2
print(data.head())
data=data.iloc[np.random.permutation(len(data))]
print(data.head())
X=data.iloc[:,1:5].values
y=data.iloc[:,5].values

print("Shape of X",X.shape)
print("Shape of y",y.shape)
print("Examples of X\n",X[:3])
print("Examples of y\n",y[:3])
X_normalized=normalize(X,axis=0)
print("Examples of X_normalised\n",X_normalized[:3])
#Creating train,test and validation data
'''
80% -- train data
20% -- test data
'''
total_length=len(data)
train_length=int(0.8*total_length)
test_length=int(0.2*total_length)
X_train=X_normalized[:train_length]
X_test=X_normalized[train_length:]
y_train=y[:train_length]
y_test=y[train_length:]

print("Length of train set x:",X_train.shape[0],"y:",y_train.shape[0])


print("Length of test set x:",X_test.shape[0],"y:",y_test.shape[0])
#Neural network module
from keras.models import Sequential
from keras.layers import Dense,Activation,Dropout
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
#Change the label to one hot vector
'''
[0]--->[1 0 0]
[1]--->[0 1 0]
[2]--->[0 0 1]
'''
y_train=np_utils.to_categorical(y_train,num_classes=3)
y_test=np_utils.to_categorical(y_test,num_classes=3)
print("Shape of y_train",y_train.shape)
print("Shape of y_test",y_test.shape)
model=Sequential()
model.add(Dense(1000,input_dim=4,activation='relu'))
model.add(Dense(500,activation='relu'))
model.add(Dense(300,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(3,activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
model.fit(X_train,y_train,validation_data=(X_test,y_test),batch_size=20,epochs=10,verbose=1)
prediction=model.predict(X_test)
length=len(prediction)
y_label=np.argmax(y_test,axis=1)
predict_label=np.argmax(prediction,axis=1)

accuracy=np.sum(y_label==predict_label)/length * 100
print("Accuracy of the dataset",accuracy

Lab Task 14: Implement K-Mean Clustering on Netflix Cluster290


#import libraries
import pandas as pd
import numpy as np
import random as rd
import matplotlib.pyplot as plt
data = pd.read_csv('clustering.csv')
data.head()
X = data[["LoanAmount","ApplicantIncome"]]
#Visualise data points
plt.scatter(X["ApplicantIncome"],X["LoanAmount"],c='black')
plt.xlabel('AnnualIncome')
plt.ylabel('Loan Amount (In Thousands)')
plt.show()
# Step 1 and 2 - Choose the number of clusters (k) and select random centroid for each cluster

#number of clusters
K=3

# Select random observation as centroids


Centroids = (X.sample(n=K))
plt.scatter(X["ApplicantIncome"],X["LoanAmount"],c='black')
plt.scatter(Centroids["ApplicantIncome"],Centroids["LoanAmount"],c='red')
plt.xlabel('AnnualIncome')
plt.ylabel('Loan Amount (In Thousands)')
plt.show()
# Step 3 - Assign all the points to the closest cluster centroid
# Step 4 - Recompute centroids of newly formed clusters
# Step 5 - Repeat step 3 and 4

diff = 1
j=0

while(diff!=0):
XD=X
i=1
for index1,row_c in Centroids.iterrows():
ED=[]
for index2,row_d in XD.iterrows():
d1=(row_c["ApplicantIncome"]-row_d["ApplicantIncome"])**2
d2=(row_c["LoanAmount"]-row_d["LoanAmount"])**2
d=np.sqrt(d1+d2)
ED.append(d)
X[i]=ED
i=i+1

C=[]
for index,row in X.iterrows():
min_dist=row[1]
pos=1
for i in range(K):
if row[i+1] < min_dist:
min_dist = row[i+1]
pos=i+1
C.append(pos)
X["Cluster"]=C
Centroids_new = X.groupby(["Cluster"]).mean()[["LoanAmount","ApplicantIncome"]]
if j == 0:
diff=1
j=j+1
else:
diff = (Centroids_new['LoanAmount'] - Centroids['LoanAmount']).sum() +
(Centroids_new['ApplicantIncome'] - Centroids['ApplicantIncome']).sum()
print(diff.sum())
Centroids = X.groupby(["Cluster"]).mean()[["LoanAmount","ApplicantIncome"]]
color=['blue','green','cyan']
for k in range(K):
data=X[X["Cluster"]==k+1]
plt.scatter(data["ApplicantIncome"],data["LoanAmount"],c=color[k])
plt.scatter(Centroids["ApplicantIncome"],Centroids["LoanAmount"],c='red')
plt.xlabel('Income')
plt.ylabel('Loan Amount (In Thousands)')
plt.show()

Lab Task 15: Implement Hierarchical Clustering


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('Wholesale customers data.csv')
data.head()
from sklearn.preprocessing import normalize
data_scaled = normalize(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)
data_scaled.head()
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))
plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))
plt.axhline(y=6, color='r', linestyle='--')
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
cluster.fit_predict(data_scaled)
plt.figure(figsize=(10, 7))
plt.scatter(data_scaled['Milk'], data_scaled['Grocery'], c=cluster.labels_)

Lab Task 16:Implement Graph Clustering


import networkx as nx
import matplotlib.pyplot as plt

G = nx.Graph()
G.add_edges_from([('A', 'B'), ('A', 'K'), ('B', 'K'), ('A', 'C'),
('B', 'C'), ('C', 'F'), ('F', 'G'), ('C', 'E'),
('E', 'F'), ('E', 'D'), ('E', 'H'), ('H', 'I'), ('I', 'J')])

plt.figure(figsize =(9, 9))


nx.draw_networkx(G, with_labels = True, node_color ='green')

print(nx.shortest_path(G, 'A'))
# returns dictionary of shortest paths from A to all other nodes

print(int(nx.shortest_path_length(G, 'A')))
# returns dictionary of shortest path length from A to all other nodes

print(nx.shortest_path(G, 'A', 'G'))


# returns a shortest path from node A to G

print(nx.shortest_path_length(G, 'A', 'G'))


# returns length of shortest path from node A to G

print(list(nx.all_simple_paths(G, 'A', 'J')))


# returns list of all paths from node A to J
print(nx.average_shortest_path_length(G))
# returns average of shortest paths between all possible pairs of nodes

You might also like