0% found this document useful (0 votes)

147 views26 pages

Pattern Recognition

The document contains a lab file for a Pattern Recognition course. It lists 23 experiments covering topics like reading CSV and image files, data pre-processing, scatter plots, implementing Bayes theorem, naive Bayes classification, correlation, covariance, PCA, SVD, feature selection, KNN classification, SVM classification, decision trees, neural networks, and different clustering algorithms. It provides code examples to implement reading files, Bayes theorem, and naive Bayes classification on a heart disease dataset.

Uploaded by

Aryan Attri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

147 views26 pages

Pattern Recognition

Uploaded by

Aryan Attri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

DEPARTMENT OF COMPUTER SCIENCE & TECHNOLOGY

LAB FILE
FOR

SUBJECT: PATTERN RECOGNITION LAB

SUBJECT CODE: CSP650

Department of COMPUTER SCIENCE & TECHNOLOGY

By Aryan Attri 2017008749
SHARDA UNIVERSITY, GREATER NOIDA

List of Experiments:
1
WAP reading csv file and image file Reading CSV file
2
WAP pre-processing (cleaning of data and data redundancy) of data
3 Draw Scatter Plot and Histograms from Multivariate data set

4
WAP to implement Bayes theorem in python
5 Implement naïve bayes on Hear Disease Data Set.

6
WAP to find correlation and covariance between features of data
7 WAP to implement PCA

8 WAP to implement SVD

9 WAP to implement univariate feature selection for Wine Quality data set

10 Write a program to implement k-Nearest Neighbour algorithm to classify the iris data set
11 Write a program to implement Support vector machine algorithm to classify the Heart Disease data set

12 Write a program to implement Decision tree algorithm to classify the Wine Quality data set

13 Write a program to implement Artificial neural network algorithm to classify the iris data set

14 Implement K-Mean Clustering on Netflix Cluster290

15 Implement Hierarchical Clustering

16 Implement Graph Clustering

Task 1 WAP reading csv file and image file

Reading CSV file
# importing csv module
import csv
# csv file name
filename = "aapl.csv"
# initializing the titles and rows list
fields = []
rows = []
# reading csv file
with open(filename, 'r') as csvfile:
# creating a csv reader object
csvreader = csv.reader(csvfile)
# extracting field names through first row
fields = next(csvreader)
# extracting each data row one by one
for row in csvreader:
rows.append(row)
# get total number of rows
print("Total no. of rows: %d"%(csvreader.line_num))
# printing the field names
print('Field names are:' + ', '.join(field for field in fields))

# printing first 5 rows

print('\nFirst 5 rows are:\n')
for row in rows[:5]:
# parsing each column of a row
for col in row:
print("%10s"%col),
print('\n')

Reading Image File

# Python program to read image using OpenCV
# importing OpenCV(cv2) module
import cv2
# Save image in set directory
# Read RGB image
img = cv2.imread('g4g.png')
# Output img with window name as 'image'
cv2.imshow('image', img)
# Maintain output window utill
# user presses a key
cv2.waitKey(0)
# Destroying present windows on screen
cv2.destroyAllWindows()

Task 2.WAP pre-processing(cleaning of data and data redundancy) of

data
Task 3: Draw Scatter Plot and Histograms from Multivariate data set
Task 4 : WAP to implement Bayes theorem in python
# Importing library
import math
import random
import csv
# the categorical class names are changed to numberic data
# eg: yes and no encoded to 1 and 0
def encode_class(mydata):
classes = []
for i in range(len(mydata)):
if mydata[i][-1] not in classes:
classes.append(mydata[i][-1])
for i in range(len(classes)):
for j in range(len(mydata)):
if mydata[j][-1] == classes[i]:
mydata[j][-1] = i
return mydata

# Splitting the data

def splitting(mydata, ratio):
train_num = int(len(mydata) * ratio)
train = []
# initially testset will have all the dataset
test = list(mydata)
while len(train) < train_num:
# index generated randomly from range 0
# to length of testset
index = random.randrange(len(test))
# from testset, pop data rows and put it in train
train.append(test.pop(index))
return train, test

# Group the data rows under each class yes or

# no in dictionary eg: dict[yes] and dict[no]
def groupUnderClass(mydata):
dict = {}
for i in range(len(mydata)):
if (mydata[i][-1] not in dict):
dict[mydata[i][-1]] = []
dict[mydata[i][-1]].append(mydata[i])
return dict

# Calculating Mean
def mean(numbers):
return sum(numbers) / float(len(numbers))

# Calculating Standard Deviation

def std_dev(numbers):
avg = mean(numbers)
variance = sum([pow(x - avg, 2) for x in numbers]) / float(len(numbers) - 1)
return math.sqrt(variance)

def MeanAndStdDev(mydata):
info = [(mean(attribute), std_dev(attribute)) for attribute in zip(*mydata)]
# eg: list = [ [a, b, c], [m, n, o], [x, y, z]]
# here mean of 1st attribute =(a + m+x), mean of 2nd attribute = (b + n+y)/3
# delete summaries of last class
del info[-1]
return info

# find Mean and Standard Deviation under each class

def MeanAndStdDevForClass(mydata):
info = {}
dict = groupUnderClass(mydata)
for classValue, instances in dict.items():
info[classValue] = MeanAndStdDev(instances)
return info

# Calculate Gaussian Probability Density Function

def calculateGaussianProbability(x, mean, stdev):
expo = math.exp(-(math.pow(x - mean, 2) / (2 * math.pow(stdev, 2))))
return (1 / (math.sqrt(2 * math.pi) * stdev)) * expo

# Calculate Class Probabilities

def calculateClassProbabilities(info, test):
probabilities = {}
for classValue, classSummaries in info.items():
probabilities[classValue] = 1
for i in range(len(classSummaries)):
mean, std_dev = classSummaries[i]
x = test[i]
probabilities[classValue] *= calculateGaussianProbability(x, mean, std_dev)
return probabilities
# Make prediction - highest probability is the prediction
def predict(info, test):
probabilities = calculateClassProbabilities(info, test)
bestLabel, bestProb = None, -1
for classValue, probability in probabilities.items():
if bestLabel is None or probability > bestProb:
bestProb = probability
bestLabel = classValue
return bestLabel

# returns predictions for a set of examples

def getPredictions(info, test):
predictions = []
for i in range(len(test)):
result = predict(info, test[i])
predictions.append(result)
return predictions

# Accuracy score
def accuracy_rate(test, predictions):
correct = 0
for i in range(len(test)):
if test[i][-1] == predictions[i]:
correct += 1
return (correct / float(len(test))) * 100.0

# driver code

# add the data path in your system

filename = r'E:\user\MACHINE LEARNING\machine learning algos\Naive bayes\filedata.csv'

# load the file and store it in mydata list

mydata = csv.reader(open(filename, "rt"))
mydata = list(mydata)
mydata = encode_class(mydata)
for i in range(len(mydata)):
mydata[i] = [float(x) for x in mydata[i]]
# split ratio = 0.7
# 70% of data is training data and 30% is test data used for testing
ratio = 0.7
train_data, test_data = splitting(mydata, ratio)
print('Total number of examples are: ', len(mydata))
print('Out of these, training examples are: ', len(train_data))
print("Test examples are: ", len(test_data))

# prepare model
info = MeanAndStdDevForClass(train_data)

# test model
predictions = getPredictions(info, test_data)
accuracy = accuracy_rate(test_data, predictions)
print("Accuracy of your model is: ", accuracy)

Task 5: Implement naïve bayes on Hear Disease Data Set.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# visualization tools
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling as pp
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
df = pd.read_csv("../input/heart-disease-uci/heart.csv")
df.sample(5)
df.info()
report = pp.ProfileReport(df)

report.to_file("report.html")

report
task 6: WAP to find correlation and covariance between features of
data
correlation between features of data
The Pearson correlation coefficient (named for Karl Pearson) can be used to summarize the strength of
the linear relationship between two data samples.
The Pearson’s correlation coefficient is calculated as the covariance of the two variables divided by the
product of the standard deviation of each data sample. It is the normalization of the covariance between
the two variables to give an interpretable score.
Pearson's correlation coefficient = covariance(X, Y) / (stdv(X) * stdv(Y))
We can calculate the correlation between the two variables in our test problem.
The complete example is listed below
# calculate the Pearson's correlation between two variables
from numpy.random import randn
from numpy.random import seed
from scipy.stats import pearsonr
# seed random number generator
seed(1)
# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
# calculate Pearson's correlation
corr, _ = pearsonr(data1, data2)
print('Pearsons correlation: %.3f' % corr)

covariance between features of data

Syntax: numpy.cov(m, y=None, rowvar=True, bias=False, ddof=None, fweights=None, aweights=None)
# Python code to demonstrate the
# use of numpy.cov
import numpy as np

x = np.array([[0, 3, 4], [1, 2, 4], [3, 4, 5]])

print("Shape of array:\n", np.shape(x))
print("Covarinace matrix of x:\n", np.cov(x))

Task 7 : WAP to implement PCA

Principal Component Analyis is basically a statistical procedure to convert a set of observation of
possibly correlated variables into a set of values of linearly uncorrelated variables.
Each of the principal components is chosen in such a way so that it would describe most of the still
available variance and all these principal components are orthogonal to each other. In all principal
components first principal component has maximum variance.
Uses of PCA:
• It is used to find inter-relation between variables in the data.
• It is used to interpret and visualize data.
• As number of variables are decreasing it makes further analysis simpler.
• It’s often used to visualize genetic distance and relatedness between populations.
These are basically performed on square symmetric matrix. It can be a pure sums of squares and cross
products matrix or Covariance matrix or Correlation matrix. A correlation matrix is used if the individual
variance differs much.
# importing required libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# importing or loading the dataset
dataset = pd.read_csv('wines.csv')
# distributing the dataset into two components X and Y
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values
# Splitting the X and Y into the
# Training set and Testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# performing preprocessing part
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Applying PCA function on training
# and testing set of X component
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
# Fitting Logistic Regression To the training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
# Predicting the test set result using
# predict function under LogisticRegression
y_pred = classifier.predict(X_test)
# making confusion matrix between
# test set of Y and predicted value.
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Predicting the training set
# result through scatter plot
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1,
stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1,
stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),
X2.ravel()]).T).reshape(X1.shape), alpha = 0.75,
cmap = ListedColormap(('yellow', 'white', 'aquamarine')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
plt.title('Logistic Regression (Training set)')
plt.xlabel('PC1') # for Xlabel
plt.ylabel('PC2') # for Ylabel
plt.legend() # to show legend
# show scatter plot
plt.show()
# Visualising the Test set results through scatter plot
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1,
stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1,
stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),
X2.ravel()]).T).reshape(X1.shape), alpha = 0.75,
cmap = ListedColormap(('yellow', 'white', 'aquamarine')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
# title for scatter plot
plt.title('Logistic Regression (Test set)')
plt.xlabel('PC1') # for Xlabel
plt.ylabel('PC2') # for Ylabel
plt.legend()
# show scatter plot
plt.show()

Task 8 WAP to implement SVD

# get the image from https://cdn.pixabay.com/photo/2017/03/27/16/50/beach-2179624_960_720.jpg

Import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import cv2

# read image in grayscale

img = cv2.imread('beach-2179624_960_720.jpg', 0)

# obtain svd
U, S, V = np.linalg.svd(img)

# inspect shapes of the matrices

print(U.shape, S.shape, V.shape)

# plot images with different number of components

comps = [638, 500, 400, 300, 200, 100]

plt.figure(figsize = (16, 8))

for i in range(6):
low_rank = U[:, :comps[i]] @ np.diag(S[:comps[i]]) @ V[:comps[i], :]
if(i == 0):
plt.subplot(2, 3, i+1), plt.imshow(low_rank, cmap = 'gray'), plt.axis('off'), plt.title("Original
Image with n_components =" + str(comps[i]))
else:
plt.subplot(2, 3, i+1), plt.imshow(low_rank, cmap = 'gray'), plt.axis('off'),
plt.title("n_components =" + str(comps[i]))
Task 9: WAP to implement univariate feature selection for Wine
Quality data set

Here we will predict the quality of wine on the basis of giving features. We use the wine quality dataset
from Kaggle. This dataset has the fundamental features which are responsible for affecting the quality of
the wine. By the use of several Machine learning models, we will predict the quality of the wine. Here
we will only deal with the white type wine quality, we use classification techniques to check further the
quality of the wine i.e. is it good or bed
In this dataset, classes are ordered, but it was not balanced. Here, red wine instances are present at a
high rate and white wine instances are less than red.
These are the name of Features from the dataset
1. type
2. fixed acidity
3. volatile acidity
4. citric acid
5. residual sugar
6. chlorides
7. free sulfur dioxide
8. total sulfur dioxide
9. density
10. pH
11. sulphates
12. alcohol
13. quality

# import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
# loading the data
Dataframe = pd.read_csv(R'D:\\xdatasets\winequalityN.csv')
# show rows and columns
Dataframe.head()
# getting info.
Dataframe.info()
Dataframe.describe()
# null value check
Dataframe.isnull().sum()
# plot pairplot
sb.pairplot(Dataframe)
#show graph
plt.show()
#plot histogram
Dataframe.hist(bins=20,figsize=(10,10))
#plot showing
plt.show()
plt.figure(figsize=[15,6])
plt.bar(df['quality'],df['alcohol'])
plt.xlabel('quality')
plt.ylabel('alcohol')
plt.show()
# correlation by visualization
plt.figure(figsize=[18,7])
# plot correlation
sb.heatmap(Dataframe.corr(),annot=True)
plt.show()
colm = []
# loop for columns
for i in range(len(Dataframe.corr().keys())):
# loop for rows
for j in range(j):
if abs(Dataframe.corr().iloc[i,j]) > 0.7:
colm = Dataframe.corr().columns[i]
# drop colum
new_df = Dataframe.drop('total sulfur dioxide',axis = 1)
new_df.update(new_df.fillna(new_df.mean()))
# no of categorical columns
cat = new_df.select_dtypes(include='O')
# create dummies of categorical columns
df_dummies = pd.get_dummies(new_df,drop_first = True)
print(df_dummies)
df_dummies['best quality']=[1 if x>=7 else 0 for x in Dataframe.quality]
print(df_dummies)
# import libraries
from sklearn.preprocessing import train_test_split
# independent variables
x = df_dummies.drop(['quality','best quality'],axis=1)
# dependent variable
y = df_dummies['best quality']
# creating train test splits
xtrain,xtest,ytrain,ytest = train_test_split(x,y,test_size=0.2,random_state=40)
# code
# import libraries
from sklearn.preprocessing import MinMaxScaler
# creating scaler scale var.
norm = MinMaxScaler()
# fit the scal
norm_fit = norm.fit(xtrain)
# transfromation of trainig data
scal_xtrain = norm_fit.transform(xtrain)
# transformation of testing data
scal_xtest = norm_fit.transform(xtest)
print(scal_xtrain)
# code
#import libraries
from sklearn.ensemble import RandomForestClassifier
# for error checking
from sklearn.matrics import mean_squared_error
from sklearn.metrics import classification_report
# create model variable
rnd = RandomForestClassifier()
# fit the model
fit_rnd = rnd.fit(new_xtrain,ytrain)
# checking the accuracy score
rnd_score = rnd.score(new_xtest,ytest)
print('score of model is : ',rnd_score)
print('.................................')
print('calculating the error')
# checking mean_squared error
MSE = mean_squared_error(ytest,y_predict)
# checking root mean squared error
RMSE = np.sqrt(MSE)
print('mean squared error is : ',MSE)
print('root mean squared error is : ',RMSE)
print(classification_report(ytest,x_predict))
# code
x_predict = list(rnd.predict(xtest))
df = {'predicted':x_predict,'orignal':ytest}
pd.DataFrame(df).head(10)

Task 10: Write a program to implement k-Nearest Neighbour

algorithm to classify the iris data set
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

# Assign colum names to the dataset

names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

# Read dataset to pandas dataframe

dataset = pd.read_csv(url, names=names)
dataset.head()
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
error = []

# Calculating error for K values between 1 and 40

for i in range(1, 40):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train, y_train)
pred_i = knn.predict(X_test)
error.append(np.mean(pred_i != y_test))
plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), error, color='red', linestyle='dashed', marker='o',
markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')
Task 11 : Write a program to implement Support vector machine
algorithm to classify the Heart Disease data set
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import LinearSVC, SVC
import seaborn as sn
import pandas as pd
signals = pd.read_csv("C:\\Users\\monis\\Downloads\\DS1_signals.csv", header=None)
labels = pd.read_csv("C:\\Users\\monis\\Downloads\DS1_labels.csv", header=None)
print("*"*50)
print("Signals Info:")
print("*"*50)
print(signals.info())
print("*"*50)
print("Labels Info:")
print("*"*50)
print(labels.info())
print("*"*50)
signals.head()
print("Column Number of NaN's")
for col in signals.columns:
if signals[col].isnull().sum() > 0:
print(col, signals[col].isnull().sum())
joined_data = signals.join(labels, rsuffix="_signals", lsuffix="_labels")
joined_data.columns = [i for i in range(180)]+['class']
cor_mat=joined_data.corr()
print('*'*50)
print('Top 10 high positively correlated features')
print('*'*50)
print(cor_mat['class'].sort_values(ascending=False).head(10))
print('*'*50)
print('Top 10 high negatively correlated features')
print('*'*50)
print(cor_mat['class'].sort_values().head(10))
%matplotlib inline
from pandas.plotting import scatter_matrix
features = [79,80,78,77]
scatter_matrix(joined_data[features], figsize=(20,15), c =joined_data['class'], alpha=0.5);
print('-'*20)
print('Class\t %')
print('-'*20)
print(joined_data['class'].value_counts()/len(joined_data))
joined_data.hist('class');
print('-'*20)
from sklearn.model_selection import StratifiedShuffleSplit
split1 = StratifiedShuffleSplit(n_splits=1, test_size=0.2,random_state=42)
for train_index, test_index in split1.split(joined_data, joined_data['class']):
train_set = joined_data.loc[train_index]
test_set = joined_data.loc[test_index]
features_train = strat_train_set.drop('class', 1)
labels_train = strat_train_set['class']
scaler = StandardScaler()
std_features = scaler.fit_transform(strat_features_train)
svc_param_grid = {'C':[10], 'gamma':[0.1,1,10]}
svc = SVC(kernel='rbf',decision_function_shape='ovo',random_state=42, max_iter = 500)
svc_grid_search = GridSearchCV(svc, svc_param_grid, cv=3, scoring="f1_macro")
svc_grid_search.fit(std_features, labels_train)
train_accuracy=svc_grid_search.best_score_
print('Model\t\tBest params\t\tBest score')
print("-"*50)
print("SVC\t\t", svc_grid_search.best_params_, train_accuracy)
features_test = test_set.drop('class', 1)
labels_test = test_set['class']
std_features = scaler.fit_transform(features_test)
svc_grid_search.fit(std_features, labels_test)
test_accuracy=svc_grid_search.best_score
print('Model\t\tBest params\t\tBest score')
print("-"*50)
print("SVC\t\t", svc_grid_search.best_params_, test_accuracy)
print("Train Accuracy : "+str(train_accuracy))
print("Test Accuracy : "+str(test_accuracy))

Task 12:Write a program to implement Decision tree algorithm to

classify the Wine Quality data set
# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
pima = pd.read_csv("pima-indians-diabetes.csv", header=None, names=col_names)
pima.head()
#split dataset in features and target variable
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and
30% test
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer

clf = clf.fit(X_train,y_train)

#Predict the response for test dataset

y_pred = clf.predict(X_test)
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png())
# Create Decision Tree classifer object
clf = DecisionTreeClassifier(criterion="entropy", max_depth=3)

# Train Decision Tree Classifer

clf = clf.fit(X_train,y_train)

#Predict the response for test dataset

y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True, feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png())

Task 13:Write a program to implement Artificial neural network

algorithm to classify the iris data set
import keras #library for neural network
import pandas as pd #loading data in table form
import seaborn as sns #visualisation
import matplotlib.pyplot as plt #visualisation
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.preprocessing import normalize #machine learning algorithm library
#Reading data
data=pd.read_csv("../input/Iris.csv")
print("Describing the data: ",data.describe())
print("Info of the data:",data.info())
print("10 first samples of the dataset:",data.head(10))
print("10 last samples of the dataset:",data.tail(10))
sns.lmplot('SepalLengthCm', 'SepalWidthCm',
data=data,
fit_reg=False,
hue="Species",
scatter_kws={"marker": "D",
"s": 50})
plt.title('SepalLength vs SepalWidth')

sns.lmplot('PetalLengthCm', 'PetalWidthCm',
data=data,
fit_reg=False,
hue="Species",
scatter_kws={"marker": "D",
"s": 50})
plt.title('PetalLength vs PetalWidth')

sns.lmplot('SepalLengthCm', 'PetalLengthCm',
data=data,
fit_reg=False,
hue="Species",
scatter_kws={"marker": "D",
"s": 50})
plt.title('SepalLength vs PetalLength')

sns.lmplot('SepalWidthCm', 'PetalWidthCm',
data=data,
fit_reg=False,
hue="Species",
scatter_kws={"marker": "D",
"s": 50})
plt.title('SepalWidth vs PetalWidth')
plt.show()
print(data["Species"].unique())
data.loc[data["Species"]=="Iris-setosa","Species"]=0
data.loc[data["Species"]=="Iris-versicolor","Species"]=1
data.loc[data["Species"]=="Iris-virginica","Species"]=2
print(data.head())
data=data.iloc[np.random.permutation(len(data))]
print(data.head())
X=data.iloc[:,1:5].values
y=data.iloc[:,5].values

print("Shape of X",X.shape)
print("Shape of y",y.shape)
print("Examples of X\n",X[:3])
print("Examples of y\n",y[:3])
X_normalized=normalize(X,axis=0)
print("Examples of X_normalised\n",X_normalized[:3])
#Creating train,test and validation data
'''
80% -- train data
20% -- test data
'''
total_length=len(data)
train_length=int(0.8*total_length)
test_length=int(0.2*total_length)
X_train=X_normalized[:train_length]
X_test=X_normalized[train_length:]
y_train=y[:train_length]
y_test=y[train_length:]

print("Length of train set x:",X_train.shape[0],"y:",y_train.shape[0])

print("Length of test set x:",X_test.shape[0],"y:",y_test.shape[0])
#Neural network module
from keras.models import Sequential
from keras.layers import Dense,Activation,Dropout
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
#Change the label to one hot vector
'''
[0]--->[1 0 0]
[1]--->[0 1 0]
[2]--->[0 0 1]
'''
y_train=np_utils.to_categorical(y_train,num_classes=3)
y_test=np_utils.to_categorical(y_test,num_classes=3)
print("Shape of y_train",y_train.shape)
print("Shape of y_test",y_test.shape)
model=Sequential()
model.add(Dense(1000,input_dim=4,activation='relu'))
model.add(Dense(500,activation='relu'))
model.add(Dense(300,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(3,activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
model.fit(X_train,y_train,validation_data=(X_test,y_test),batch_size=20,epochs=10,verbose=1)
prediction=model.predict(X_test)
length=len(prediction)
y_label=np.argmax(y_test,axis=1)
predict_label=np.argmax(prediction,axis=1)

accuracy=np.sum(y_label==predict_label)/length * 100
print("Accuracy of the dataset",accuracy

Lab Task 14: Implement K-Mean Clustering on Netflix Cluster290

#import libraries
import pandas as pd
import numpy as np
import random as rd
import matplotlib.pyplot as plt
data = pd.read_csv('clustering.csv')
data.head()
X = data[["LoanAmount","ApplicantIncome"]]
#Visualise data points
plt.scatter(X["ApplicantIncome"],X["LoanAmount"],c='black')
plt.xlabel('AnnualIncome')
plt.ylabel('Loan Amount (In Thousands)')
plt.show()
# Step 1 and 2 - Choose the number of clusters (k) and select random centroid for each cluster

#number of clusters
K=3

# Select random observation as centroids

Centroids = (X.sample(n=K))
plt.scatter(X["ApplicantIncome"],X["LoanAmount"],c='black')
plt.scatter(Centroids["ApplicantIncome"],Centroids["LoanAmount"],c='red')
plt.xlabel('AnnualIncome')
plt.ylabel('Loan Amount (In Thousands)')
plt.show()
# Step 3 - Assign all the points to the closest cluster centroid
# Step 4 - Recompute centroids of newly formed clusters
# Step 5 - Repeat step 3 and 4

diff = 1
j=0

while(diff!=0):
XD=X
i=1
for index1,row_c in Centroids.iterrows():
ED=[]
for index2,row_d in XD.iterrows():
d1=(row_c["ApplicantIncome"]-row_d["ApplicantIncome"])**2
d2=(row_c["LoanAmount"]-row_d["LoanAmount"])**2
d=np.sqrt(d1+d2)
ED.append(d)
X[i]=ED
i=i+1

C=[]
for index,row in X.iterrows():
min_dist=row[1]
pos=1
for i in range(K):
if row[i+1] < min_dist:
min_dist = row[i+1]
pos=i+1
C.append(pos)
X["Cluster"]=C
Centroids_new = X.groupby(["Cluster"]).mean()[["LoanAmount","ApplicantIncome"]]
if j == 0:
diff=1
j=j+1
else:
diff = (Centroids_new['LoanAmount'] - Centroids['LoanAmount']).sum() +
(Centroids_new['ApplicantIncome'] - Centroids['ApplicantIncome']).sum()
print(diff.sum())
Centroids = X.groupby(["Cluster"]).mean()[["LoanAmount","ApplicantIncome"]]
color=['blue','green','cyan']
for k in range(K):
data=X[X["Cluster"]==k+1]
plt.scatter(data["ApplicantIncome"],data["LoanAmount"],c=color[k])
plt.scatter(Centroids["ApplicantIncome"],Centroids["LoanAmount"],c='red')
plt.xlabel('Income')
plt.ylabel('Loan Amount (In Thousands)')
plt.show()

Lab Task 15: Implement Hierarchical Clustering

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('Wholesale customers data.csv')
data.head()
from sklearn.preprocessing import normalize
data_scaled = normalize(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)
data_scaled.head()
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))
plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))
plt.axhline(y=6, color='r', linestyle='--')
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
cluster.fit_predict(data_scaled)
plt.figure(figsize=(10, 7))
plt.scatter(data_scaled['Milk'], data_scaled['Grocery'], c=cluster.labels_)

Lab Task 16:Implement Graph Clustering

import networkx as nx
import matplotlib.pyplot as plt

G = nx.Graph()
G.add_edges_from([('A', 'B'), ('A', 'K'), ('B', 'K'), ('A', 'C'),
('B', 'C'), ('C', 'F'), ('F', 'G'), ('C', 'E'),
('E', 'F'), ('E', 'D'), ('E', 'H'), ('H', 'I'), ('I', 'J')])

plt.figure(figsize =(9, 9))

nx.draw_networkx(G, with_labels = True, node_color ='green')

print(nx.shortest_path(G, 'A'))
# returns dictionary of shortest paths from A to all other nodes

print(int(nx.shortest_path_length(G, 'A')))
# returns dictionary of shortest path length from A to all other nodes

print(nx.shortest_path(G, 'A', 'G'))

# returns a shortest path from node A to G

print(nx.shortest_path_length(G, 'A', 'G'))

# returns length of shortest path from node A to G

print(list(nx.all_simple_paths(G, 'A', 'J')))

# returns list of all paths from node A to J
print(nx.average_shortest_path_length(G))
# returns average of shortest paths between all possible pairs of nodes

12 Bida - 630 - Final - Exam - Preparations PDF
No ratings yet
12 Bida - 630 - Final - Exam - Preparations PDF
7 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
9 pages
Fresco
100% (2)
Fresco
17 pages
RDL Course Syllabus New
100% (1)
RDL Course Syllabus New
2 pages
740 Fall 2017 Krakernauyd3896Lesson 2
0% (1)
740 Fall 2017 Krakernauyd3896Lesson 2
2 pages
Unit 2
No ratings yet
Unit 2
19 pages
ECOM003 Econometrics A
No ratings yet
ECOM003 Econometrics A
3 pages
Agent Cheat Sheet in AIML
No ratings yet
Agent Cheat Sheet in AIML
2 pages
Machine Learning Lab Manual - Record
No ratings yet
Machine Learning Lab Manual - Record
61 pages
ML Lab Manual (1-10) FINAL
No ratings yet
ML Lab Manual (1-10) FINAL
34 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
PR Practical File
No ratings yet
PR Practical File
38 pages
AI&ML PGM
No ratings yet
AI&ML PGM
53 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
Data Science Practicals
No ratings yet
Data Science Practicals
47 pages
ML All Projectpdf Removed
No ratings yet
ML All Projectpdf Removed
41 pages
Bacdeaf 23032025 115708 Split 1
No ratings yet
Bacdeaf 23032025 115708 Split 1
37 pages
ML 3
No ratings yet
ML 3
24 pages
ML Labmanual
No ratings yet
ML Labmanual
33 pages
ML Record
No ratings yet
ML Record
19 pages
ML Lab File Batch 1
No ratings yet
ML Lab File Batch 1
20 pages
Dav Lab Manual
No ratings yet
Dav Lab Manual
28 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
32 pages
Cse Machine Learning Lab Manual
No ratings yet
Cse Machine Learning Lab Manual
22 pages
MACHINE LEARNING Manual
No ratings yet
MACHINE LEARNING Manual
36 pages
ML Manual
No ratings yet
ML Manual
18 pages
Production Planning and Control
No ratings yet
Production Planning and Control
44 pages
Department of Computer Engineering Academic Term: June-Nov 2021
No ratings yet
Department of Computer Engineering Academic Term: June-Nov 2021
6 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
24 pages
DSBDA Practicals
No ratings yet
DSBDA Practicals
16 pages
Survival Analysis Coursework
No ratings yet
Survival Analysis Coursework
11 pages
AD3411
No ratings yet
AD3411
28 pages
Slides Module 4 Lesson 2
No ratings yet
Slides Module 4 Lesson 2
34 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
18 pages
Data Science Practical
No ratings yet
Data Science Practical
22 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
ML Lab PT
No ratings yet
ML Lab PT
25 pages
ML Lab Record
No ratings yet
ML Lab Record
33 pages
ML Practical 205160694034
No ratings yet
ML Practical 205160694034
33 pages
Data Science and Analtics Laboratory
No ratings yet
Data Science and Analtics Laboratory
21 pages
Advance Machine Learning
No ratings yet
Advance Machine Learning
28 pages
1
No ratings yet
1
13 pages
1st PGM
No ratings yet
1st PGM
10 pages
Fda Batch2program
No ratings yet
Fda Batch2program
18 pages
M PDF
No ratings yet
M PDF
13 pages
ML File
No ratings yet
ML File
13 pages
Atul MLT Exp 4-11
No ratings yet
Atul MLT Exp 4-11
17 pages
Machine Learning
No ratings yet
Machine Learning
16 pages
Ancova: Psy 420 Andrew Ainsworth
No ratings yet
Ancova: Psy 420 Andrew Ainsworth
53 pages
ML Lab
No ratings yet
ML Lab
14 pages
Machine File
No ratings yet
Machine File
27 pages
An Analysis of Growth and Performance of Msmes in Jammu and Kashmir
No ratings yet
An Analysis of Growth and Performance of Msmes in Jammu and Kashmir
7 pages
Journal of Pharmaceutical and Biomedical Analysis
No ratings yet
Journal of Pharmaceutical and Biomedical Analysis
13 pages
Practical Examples With STATA
No ratings yet
Practical Examples With STATA
36 pages
Diabetes Detection Using Machine Learning Classification Methods
No ratings yet
Diabetes Detection Using Machine Learning Classification Methods
5 pages
ML Lab Programs (1-12)
No ratings yet
ML Lab Programs (1-12)
35 pages
Naive
No ratings yet
Naive
5 pages
Mllabprog 5
No ratings yet
Mllabprog 5
6 pages
Ai ML Programs
No ratings yet
Ai ML Programs
34 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
13 pages
Neural - N - Problems - MLP
No ratings yet
Neural - N - Problems - MLP
15 pages
ML Complete Notes Hridoy
No ratings yet
ML Complete Notes Hridoy
5 pages
Amount of Urban Green Space Is Not An Indicator of Residents
No ratings yet
Amount of Urban Green Space Is Not An Indicator of Residents
8 pages
Ex 3
No ratings yet
Ex 3
5 pages
ML Lab Manual
No ratings yet
ML Lab Manual
12 pages
M4 - Yulia Nandi Pinta - 20101155310496
No ratings yet
M4 - Yulia Nandi Pinta - 20101155310496
12 pages
Precision and Recall
No ratings yet
Precision and Recall
13 pages
1 PB
No ratings yet
1 PB
7 pages
Review Jurnal Pertumbuhan Ekonomi - Ayu Mas Diah Vitarani - 204
No ratings yet
Review Jurnal Pertumbuhan Ekonomi - Ayu Mas Diah Vitarani - 204
3 pages
Exp 5
No ratings yet
Exp 5
4 pages
3 Naive Bayes Model
No ratings yet
3 Naive Bayes Model
3 pages
Naivebayes Labprg2
No ratings yet
Naivebayes Labprg2
3 pages
Reading 9 Parametric and Non-Parametric Tests of Independence
No ratings yet
Reading 9 Parametric and Non-Parametric Tests of Independence
3 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Aiml Ex 4-7
No ratings yet
Aiml Ex 4-7
8 pages
Machine Learning Laboratory (21AIL66)
No ratings yet
Machine Learning Laboratory (21AIL66)
7 pages
2019 Syllabus Empirical Methods in Finance
No ratings yet
2019 Syllabus Empirical Methods in Finance
8 pages
ML Lab
No ratings yet
ML Lab
7 pages
Hubungan Antar Volume Lalu Lintas Dengan Tingkat Kebisingan Di Jalan
No ratings yet
Hubungan Antar Volume Lalu Lintas Dengan Tingkat Kebisingan Di Jalan
7 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Assignment#3 (Naive Bayes)
No ratings yet
Assignment#3 (Naive Bayes)
5 pages
Artikel Ahmad Fadhil Imran PDF
No ratings yet
Artikel Ahmad Fadhil Imran PDF
5 pages
ML Lab Manual PDF
No ratings yet
ML Lab Manual PDF
9 pages
The Group Assessment
No ratings yet
The Group Assessment
2 pages
Math644 - Chapter 1 - Part2 PDF
No ratings yet
Math644 - Chapter 1 - Part2 PDF
14 pages
Logistic Regression Tutorial
No ratings yet
Logistic Regression Tutorial
25 pages
ADM2304 Multiple Regression Dr. Suren Phansalker
No ratings yet
ADM2304 Multiple Regression Dr. Suren Phansalker
12 pages
Scatter Diagrams and Karl Pearson Correlation Table by Arun
No ratings yet
Scatter Diagrams and Karl Pearson Correlation Table by Arun
3 pages
CS178 Homework #1: Problem 0: Getting Connected
No ratings yet
CS178 Homework #1: Problem 0: Getting Connected
4 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet