0% found this document useful (0 votes)
162 views101 pages

ML Lab Records

This document appears to be a record of practical work completed for a Machine Learning lab course. It includes an index listing programs completed on topics like exploratory data analysis, model building, and supervised learning algorithms like SVM, decision trees, KNN and random forest. The exploratory data analysis program imports diabetes and auto-mpg datasets and performs basic exploration of the data through statistics, visualizations and handling of missing values and outliers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
162 views101 pages

ML Lab Records

This document appears to be a record of practical work completed for a Machine Learning lab course. It includes an index listing programs completed on topics like exploratory data analysis, model building, and supervised learning algorithms like SVM, decision trees, KNN and random forest. The exploratory data analysis program imports diabetes and auto-mpg datasets and performs basic exploration of the data through statistics, visualizations and handling of missing values and outliers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 101

Mount Carmel College,

Autonomous Bengaluru

Department of Computer Science

(Affiliated to Bengaluru Central University)

RECORD OF PRACTICAL WORK

YEAR: 2020-2021

NAME: Sreesha Chakraborty

REGISTER NUMBER: M19CS05

CLASS: III Semester MSc. Computer Science

SUBJECT: Machine Learning Lab Record


Machine Learning Lab

I N D E X

Sl. No Title Date Page


No.

PART - A

1. Exploratory Data Analysis 1

2. Model building 29

3. Supervised Learning

a. Support Vector Machine (SVM) 44

b. Decision Tree 50

c. K – Nearest Neighbors (KNN) 53

d. Random Forest 57

III SEM MSc. Computer Science


Machine Learning Lab

PART - A

III SEM MSc. Computer Science


Machine Learning Lab

Program 1 Exploratory Data Analysis

Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Import Dataset
A = pd.read_csv('pima-indians-diabetes_csv.csv.csv')
A

A = pd.read_csv('pima-indians-diabetes_csv.csv.csv',header=None)
A

III SEM MSc. Computer Science Page 1


Machine Learning Lab

Add header to the dataset

A.columns = ['Pregnant','Glucose','BP','ST','Insulin','BMI','Diabetes','Age','Class']
A

A.head()

A.tail()

III SEM MSc. Computer Science Page 2


Machine Learning Lab

Basic Data Exploration

print('Dataset Type') print(type(A))

print('Dataset Dimension')
print(A.shape)

print(A.info())

III SEM MSc. Computer Science Page 3


Machine Learning Lab
Accessing the rows and columns
Accessing the columns
First Approach - accessing through column index

A.iloc[:,3:5]

Second Approach - accessing columns using attributes / header

A['ST']

A.loc[:,'ST']

III SEM MSc. Computer Science Page 4


Machine Learning Lab

Accessing the rows


First Approach - accessing through row index

A.iloc[1,:]

Second Approach - accessing rows using attributes / header

A.loc[1,:] #the row labels are same as the index value

Dataset Statistics
print('Mean Age')
print(A['Age'].mean())#A.iloc[s:e,ci]
print('Median Age')
print(A['Age'].median())
print('Mode of the column Age')
print(A['Age'].mode())

III SEM MSc. Computer Science Page 5


Machine Learning Lab

print('Standard Deviation of Age')


print(A['Age'].std())
print('Variance of Age')
print(A['Age'].var())

print('Count for the column Age')


print(A['Age'].count()) #does not include null values

print('Unique values for the column Age')


print(A['Age'].unique())

print('Maximum Age')
print(A['Age'].max())
print('Minimum Age')
print(A['Age'].min())

III SEM MSc. Computer Science Page 6


Machine Learning Lab

print('Count of every element in the column Age')


print(A['Age'].value_counts())

A['Class'].value_counts()

III SEM MSc. Computer Science Page 7


Machine Learning Lab

A['Pregnant'].value_counts()

Summary Statistics

print('Summary statistics of entire dataset')


print(A.describe())

Summary Statistics for one attribute of the dataset

A['BMI'].describe()

III SEM MSc. Computer Science Page 8


Machine Learning Lab

Dataset Visualization
Histogram

plt.hist(A['Age'],bins=10) #Matplotlib Library

A['Age'].hist(bins=10) # Pandas Library

A_int=A.select_dtypes(include=['int64']) # group dataset based on the datatype

A_int.head()

III SEM MSc. Computer Science Page 9


Machine Learning Lab

A_int.hist(figsize=(16,20),bins=50)

III SEM MSc. Computer Science Page 10


Machine Learning Lab

A.hist(figsize=(16,20),bins=5)

III SEM MSc. Computer Science Page 11


Machine Learning Lab

Bar Chart

# Matplotlib Library
x = A['Class'].unique()
y = A['Class'].value_counts()
plt.bar(x,y)

Bar chart using Pandas Library

y.plot(kind='bar')

A['Pregnant'].value_counts().plot(kind='bar',color='purple')

III SEM MSc. Computer Science Page 12


Machine Learning Lab

Pie Chart

y.plot(kind='pie')

A['Pregnant'].value_counts().plot(kind='pie', figsize=(10,10))

III SEM MSc. Computer Science Page 13


Machine Learning Lab

Scatter Plot

plt.scatter(A['Age'],A['Glucose']) # Matplotlib Library

A.plot.scatter(x='Age',y='Glucose') #Pandas Library

import pandas as pd
import numpy as np

df=pd.read_csv('auto-mpg.csv')
df.head()

III SEM MSc. Computer Science Page 14


Machine Learning Lab

df.shape

df

df['mpg'].max()

df.info()

III SEM MSc. Computer Science Page 15


Machine Learning Lab

df=pd.read_csv('auto-mpg_1.csv')
df.head()

df.info()

df.isnull().sum()

III SEM MSc. Computer Science Page 16


Machine Learning Lab

Replacing the null values in numerical features

median1 = df['mpg'].median()
median2 = df['cylinders'].median()
median3 =
df['displacement'].median() median4
= df['weight'].median() median5 =
df['acceleration'].median()

df['mpg'].replace(np.nan,median1,inplace=True)
df['cylinders'].replace(np.nan,median2,inplace=True)
df['displacement'].replace(np.nan,median3,inplace=True)
df['weight'].replace(np.nan,median4,inplace=True)
df['acceleration'].replace(np.nan,median5,inplace=True)

df.info()

III SEM MSc. Computer Science Page 17


Machine Learning Lab

Replacing the null values in categorical features


mode1 = df['horsepower'].mode().values[0] #an array of mode values are returned therefore access t
he first value(the highest)

df['horsepower'].replace(np.nan,mode1,inplace=True)

df.info()

type(mode1)

Handling missing values through interpolation


The missing values in the continous variable are replaced by estimating the most probable value

df1 = df.interpolate() #non categorical types missing values


df1.info()

III SEM MSc. Computer Science Page 18


Machine Learning Lab

Handling the duplicate values

duplicate = df.duplicated()
duplicate

duplicate.sum()

df_1 = pd.read_csv('auto-mpg_1.csv')

duplicate = df_1.duplicated()
duplicate.sum()

df_1.shape

duplicate.shape

type(duplicate.values[0])

III SEM MSc. Computer Science Page 19


Machine Learning Lab

Drop the duplicate rows

df_1.drop_duplicates(inplace=True)

duplicate=df_1.duplicated()

duplicate.sum()

df_1.shape

Handling the outliers


Box plot will be used to identify the outliers

df_1.boxplot(column=['mpg','acceleration'])

III SEM MSc. Computer Science Page 20


Machine Learning Lab

df_1.boxplot(column=['displacement'])

df_1.boxplot(column=['weight'])

Calculating Q1, Q2 and Q3 values


Q1 = np.percentile(df['acceleration'],25,interpolation='midpoint')
Q2 = np.percentile(df['acceleration'],50,interpolation='midpoint')
Q3 = np.percentile(df['acceleration'],75,interpolation='midpoint')

print('Q1 =',Q1)
print('Q2 =',Q2)
print('Q3 =',Q3)

III SEM MSc. Computer Science Page 21


Machine Learning Lab

Handling the outliers


arr = np.array([-18,0.3,0.7,2,-1,8,0.3,18,2,0.8,0.2,28,8,0.09])
arr
d = pd.DataFrame(arr)
d
d.boxplot()

Q1 = np.percentile(d,25)
Q2 = np.percentile(d,50)
Q3 = np.percentile(d,75)

print('Q1=',Q1)
print('Q2=',Q2)
print('Q3=',Q3)

IQR = Q3 - Q1
print('Interquartile range is', IQR)
low_lim = Q1 - 1.5 * IQR
up_lim = Q3 + 1.5 * IQR
print('low_limit is',low_lim)
print('up_limit is',up_lim)

III SEM MSc. Computer Science Page 22


Machine Learning Lab

outlier = []
for x in d[0]:
if((x > up_lim) or (x < low_lim)):
outlier.append(x)
print('outlier in the dataset is', outlier)

Identifying the outliers in autompg dataset

df.boxplot(column=['mpg'])

calculating Q1, Q2 and Q3 values

Q1 = np.percentile(df['mpg'],25)
Q2 = np.percentile(df['mpg'],50)
Q3 = np.percentile(df['mpg'],75)

print('Q1=',Q1)
print('Q2=',Q2)
print('Q3=',Q3)

III SEM MSc. Computer Science Page 23


Machine Learning Lab

IQR = Q3 - Q1
print('Interquartile range is', IQR)
low_lim = Q1 - 1.5 * IQR
up_lim = Q3 + 1.5 * IQR
print('low_limit is',low_lim)
print('up_limit is',up_lim)

outlier = []
for x in df['mpg']:
if((x > up_lim) or (x < low_lim)):
outlier.append(x)
print('outlier in the dataset is', outlier)

Handling the outliers – Trimming

df_trim = df[(df['mpg']>low_lim) & (df['mpg']<up_lim)]


(df['mpg']>low_lim) & (df['mpg']<up_lim)

III SEM MSc. Computer Science Page 24


Machine Learning Lab

df.info()

df_trim.info()

Handling the outliers – Capping

df_cap = df.copy()
df_cap['mpg'] = np.where(df_cap['mpg']>up_lim, up_lim, df_cap['mpg'])
df_cap['mpg'] = np.where(df_cap['mpg']<low_lim, low_lim, df_cap['mpg'])

outlier = []

for x in df['mpg']:
if((x > up_lim) or (x < low_lim)):
outlier.append(x)
print('outlier in the dataset is', outlier)

III SEM MSc. Computer Science Page 25


Machine Learning Lab

outlier = []

for x in df_cap['mpg']:
if((x > up_lim) or (x < low_lim)):
outlier.append(x)

print('outlier in the dataset is', outlier)

df_cap.info()

Feature Scaling

1. Min Max Normalization


2. Z Score Normalization

1. Min Max Normalization

import numpy as np
X = np.array([1,4,6.7,8,10.9])
norm_X = (X - np.min(X))/(np.max(X)-np.min(X))

III SEM MSc. Computer Science Page 26


Machine Learning Lab

print(X)
print(norm_X)

np.max(norm_X)

Min Max Normalization on Diabetes Dataset

import pandas as pd
import numpy as np
diabetes = pd.read_csv('pima-indians-diabetes.csv.csv')
diabetes.head()

from sklearn.preprocessing import MinMaxScaler

S = MinMaxScaler()
diabetes_scaled = S.fit_transform(diabetes)
diabetes_scaled

np.max(diabetes_scaled)

III SEM MSc. Computer Science Page 27


Machine Learning Lab

2. Z- Score Normalization

X = np.array([1.4,6.7,8,10.9])

X_Znorm = (X-np.mean(X))/np.std(X)
X_Znorm

np.max(X_Znorm)

Z - Score Normalization on Diabetes Dataset

from sklearn.preprocessing import StandardScaler


Stdscaler = StandardScaler()

diabetes_Zscore = Stdscaler.fit_transform(diabetes)
diabetes_Zscore

np.min(diabetes_Zscore)

np.max(diabetes_Zscore)

III SEM MSc. Computer Science Page 28


Machine Learning Lab

Program 2 Model building

Import Libraries

import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

Import Iris Dataset

iris = datasets.load_iris()
type(iris)

Analyze the dataset

iris

III SEM MSc. Computer Science Page 29


Machine Learning Lab

Convert to Pandas Data frame

df = pd.DataFrame(data = iris.data, columns=iris.feature_names) # user defined column names


columns=['sl','sw','pl','pw']

df.head()

df.shape

#add a new column

df['target'] = iris.target

III SEM MSc. Computer Science Page 30


Machine Learning Lab

df.tail()

Dataset Exploration

df.info()

df.describe()

III SEM MSc. Computer Science Page 31


Machine Learning Lab

df.boxplot()

df.boxplot(column='sepal width (cm)')

df.groupby('target').size()

III SEM MSc. Computer Science Page 32


Machine Learning Lab

Split the dataset into train and test data

X = df.drop(['target'], axis=1)
print(X)

Y = df['target']
Y

train = 70% and test = 30%


stratify=Y --> will split the dataset in a balanced manner with respect to all labels
random_state=4 --> 4 is the seed element for random splitting

xtrain,xtest,ytrain,ytest = train_test_split(X,Y,test_size=0.3,random_state=4,stratify=Y)
[xtrain.shape,xtest.shape,ytrain.shape,ytest.shape]

III SEM MSc. Computer Science Page 33


Machine Learning Lab

ytrain.value_counts()

ytest.value_counts()

105/3

Z - Score Standardization

xtrain

III SEM MSc. Computer Science Page 34


Machine Learning Lab

ytrain

S = StandardScaler()
xtrain = S.fit_transform(xtrain)
xtest = S.fit_transform(xtest)

xtest

np.min(xtest)

np.max(xtest)

III SEM MSc. Computer Science Page 35


Machine Learning Lab

Model Creation
model = LogisticRegression(solver='liblinear',random_state=0)

model

Train the data

model.fit(xtrain,ytrain)

Testing the test dataset

ytest[1]

y_pred = model.predict(xtest)

y_pred #prediction values

ytest[1]

III SEM MSc. Computer Science Page 36


Machine Learning Lab

y_pred[1]

Confusion Matrix

cm = metrics.confusion_matrix(ytest,y_pred)
print(cm)

Classification Accuracy

Acc = (cm[0,0]+cm[1,1]+cm[2,2])/np.sum(cm)
print(Acc)

Accuracy = metrics.accuracy_score(ytest,y_pred)
print(Accuracy)

Impact of different training and testing ratios and feature scaling on


classification accuracy
xtrain,xtest,ytrain,ytest = train_test_split(X,Y,test_size=0.1,random_state=4,stratify=Y)
[xtrain.shape,xtest.shape,ytrain.shape,ytest.shape]

III SEM MSc. Computer Science Page 37


Machine Learning Lab

ytrain.value_counts()

ytest.value_counts()

#S = StandardScaler()
S = MinMaxScaler()
xtrain = S.fit_transform(xtrain)
xtest = S.fit_transform(xtest)

model.fit(xtrain,ytrain)

y_pred = model.predict(xtest)

cm = metrics.confusion_matrix(ytest,y_pred)
print(cm)

Accuracy = metrics.accuracy_score(ytest,y_pred)
print(Accuracy)

III SEM MSc. Computer Science Page 38


Machine Learning Lab

60 40: Standard scaler: 93.33%, Min Max: 85.00%


70 30: Standard scaler: 91.11%, Min Max: 86.66%
80 20: Standard scaler: 96.66%, Min Max: 90.0%
90 10: Standard scaler: 93.33%, Min Max: 86.66%

Impact of outliers on classification accuracy


df.boxplot()

df.boxplot(column = 'sepal width (cm)')

Identify the outliers


Q1 = np.percentile(df['sepal width (cm)'],25)
Q2 = np.percentile(df['sepal width (cm)'],50)
Q3 = np.percentile(df['sepal width (cm)'],75)

III SEM MSc. Computer Science Page 39


Machine Learning Lab

IQR = Q3-Q1
low_limit = Q1-1.5*IQR
up_limit = Q3+1.5*IQR

Outliers Trimming

df_trim = df[(df['sepal width (cm)'] > low_limit) & (df['sepal width (cm)'] < up_limit)]
df_trim.shape

X = df_trim.drop(['target'],axis=1)
Y = df_trim['target']
xtrain,xtest,ytrain,ytest = train_test_split(X,Y,test_size=0.3,random_state=4,stratify=Y)
[xtrain.shape,xtest.shape,ytrain.shape,ytest.shape]

ytrain.value_counts()

ytest.value_counts()

III SEM MSc. Computer Science Page 40


Machine Learning Lab

S = StandardScaler()
xtrain = S.fit_transform(xtrain)
xtest = S.fit_transform(xtest)

model = LogisticRegression(solver='liblinear',random_state=0)

model.fit(xtrain,ytrain)

y_pred = model.predict(xtest)

cm = metrics.confusion_matrix(ytest,y_pred)
print(cm)

Accuracy = metrics.accuracy_score(ytest,y_pred)
print(Accuracy)

Outlier Capping

df_cap = df.copy()

df_cap['sepal width (cm)'] = np.where(df_cap['sepal width (cm)']>up_limit,up_limit,df_cap['sepal w


idth (cm)'])

df_cap['sepal width (cm)']=np.where(df_cap['sepal width (cm)']<low_limit,low_limit,df_cap['sepal


width (cm)'])

III SEM MSc. Computer Science Page 41


Machine Learning Lab

df_cap.shape

X = df_cap.drop(['target'],axis=1)
Y = df_cap['target']
xtrain,xtest,ytrain,ytest = train_test_split(X,Y,test_size=0.3,random_state=4,stratify=Y)
[xtrain.shape,xtest.shape,ytrain.shape,ytest.shape]

ytrain.value_counts()

ytest.value_counts()

S = StandardScaler()

xtrain = S.fit_transform(xtrain)

xtest = S.fit_transform(xtest)

model = LogisticRegression(solver='liblinear',random_state=0)

III SEM MSc. Computer Science Page 42


Machine Learning Lab

model.fit(xtrain,ytrain)

y_pred = model.predict(xtest)

cm = metrics.confusion_matrix(ytest,y_pred)

print(cm)

Accuracy = metrics.accuracy_score(ytest,y_pred)

print(Accuracy)

With Outliers Accuracy = 91.11%


Trimming the Outliers Accuracy = 95.45%
Capping the Outliers Accuracy = 91.11%

III SEM MSc. Computer Science Page 43


Machine Learning Lab

Program 3 a. Supervised Learning – Support Vector Machine (SVM)

Import Libraries
import numpy as np
import pandas as pd
from sklearn import svm, metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Load Dataset
data = pd.read_csv('diabetes (1).csv')
data.head()

Separating features and label for training


X = data.drop(['Outcome'],axis=1)
X.head()

III SEM MSc. Computer Science Page 44


Machine Learning Lab

Y = data['Outcome']
[X.shape, Y.shape]

Y.value_counts()

Split the dataset for training and testing


xtrain,xtest,ytrain,ytest = train_test_split(X,Y,test_size=0.3,random_state=3,stratify=Y)
[xtrain.shape, ytrain.shape, xtest.shape, ytest.shape]

ytest.value_counts()

Feature Scaling - Z Score Normalization


S = StandardScaler()
xtrain = S.fit_transform(xtrain)
xtest = S.fit_transform(xtest)

Model Creation and performance checking


model = svm.SVC(kernel='linear')

Train the data


model.fit(xtrain,ytrain)

III SEM MSc. Computer Science Page 45


Machine Learning Lab

Test the classifier on test dataset


y_pred = model.predict(xtest)

Print confusion matrix


cm = metrics.confusion_matrix(ytest,y_pred)
print(cm)

Calculation of accuracy
accuracy = metrics.accuracy_score(ytest,y_pred)
print(accuracy)

Calculation of sensitivity and specificity


Row 1: Negative
Row 2: Positive

For sensitivity calculation, consider the row corresponding to disease positive i.e., Row 2 in this
case

sensitivity = cm[1,1]/(cm[1,0]+cm[1,1])
print(sensitivity)

For specificity calculation, consider the row corresponding to disease negative i.e., Row 1 in this
case

specificity = cm[0,0]/(cm[0,0]+cm[0,1])
print(specificity)

III SEM MSc. Computer Science Page 46


Machine Learning Lab

y_score = model.decision_function(xtest)
ytest[0]

y_pred[0]

y_score[0]

np.max(y_score)

np.min(y_score)

Plot the Roc Curve


fpr,tpr,_=roc_curve(ytest,y_score)
roc_auc = auc(fpr,tpr)

roc_auc

III SEM MSc. Computer Science Page 47


Machine Learning Lab

plt.plot(fpr,tpr, color='darkorange', label='Area Under the Curve = %0.2f'%roc_auc)


plt.plot([0,1],[0,1],color='navy', linestyle='--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc='lower right')

Compare the performance of SVM Classifier for different Kernels


kernels = ('linear','poly','rbf')
accuracies = []
fpr = dict()
tpr = dict()
roc_auc = dict()
for i,k in enumerate(kernels):
model = svm.SVC(kernel = k)
model.fit(xtrain,ytrain)
ypredict = model.predict(xtest)
acc = metrics.accuracy_score(ytest,ypredict)
accuracies.append(acc)
print('{}% accuracy is obtained from model with kernel {}'.format(acc*100,k))

III SEM MSc. Computer Science Page 48


Machine Learning Lab

y_score = model.decision_function(xtest)
fpr[i],tpr[i],_ = metrics.roc_curve(ytest,y_score)
roc_auc[i] = metrics.auc(fpr[i], tpr[i])

ROC Plot for Comparison


plt.plot(fpr[0],tpr[0], color='darkorange', label='Linear Kernel, AUC = %0.2f'%roc_auc[0])
plt.plot(fpr[1],tpr[1], color='green', label='Polynomial Kernel, AUC = %0.2f'%roc_auc[1])
plt.plot(fpr[2],tpr[2], color='purple', label='RBF Kernel, AUC = %0.2f'%roc_auc[2])
plt.plot([0,1],[0,1],color='navy', linestyle='--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc='lower right')
plt.title('Receiver Operating Characterstics')

III SEM MSc. Computer Science Page 49


Machine Learning Lab

Program 3 b. Supervised Learning – Decision Tree

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics, datasets

iris = datasets.load_iris()

iris

iris.feature_names

III SEM MSc. Computer Science Page 50


Machine Learning Lab

Split X and Y for classification

X = iris.data
Y = iris.target

X.shape

Splitting dataset for Training and Testing


x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.3, random_state=1, stratify=Y)

Create decision tree model


clf = DecisionTreeClassifier(criterion='entropy')

Train the classifier


clf.fit(x_train,y_train)

y_pred = clf.predict(x_test)

Check classifier performance


print('Accuracy: ',metrics.accuracy_score(y_pred, y_test))

from sklearn.tree import export_graphviz #.out file is generated as data file

from six import StringIO

from IPython.display import Image

import pydotplus

III SEM MSc. Computer Science Page 51


Machine Learning Lab

dot_data = StringIO() #information of the tree

export_graphviz(clf,out_file=dot_data,

filled=True, rounded=True,

special_characters=True, feature_names=iris.feature_names,

class_names=iris.target_names)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

graph.write_png('iris_entropy.png')

Image(graph.create_png())

III SEM MSc. Computer Science Page 52


Machine Learning Lab

Program 3 c. Supervised Learning – K – Nearest Neighbors (KNN)

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

from sklearn import metrics

X,Y=make_classification(n_samples=200,n_features=8,n_informative=8,

n_redundant=0,n_repeated=0,n_classes=2,random_state=14)

X[0]

Standard Scaling
SC = StandardScaler()

SC.fit_transform(X)

III SEM MSc. Computer Science Page 53


Machine Learning Lab

X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,random_state=10)

print(X_train.shape,X_test.shape,Y_train.shape,Y_test.shape)

pd.DataFrame(data = Y_train).value_counts()

Create a KNN Classifier


knn = KNeighborsClassifier(n_neighbors=13)

Train and Test the Model


knn.fit(X_train,Y_train)

y_pred = knn.predict(X_test)

acc = metrics.accuracy_score(Y_test,y_pred)

print(acc)

Plot error curves to find optimal value of K


error_train=[]

error_test=[]

x=[1,3,5,7,9,11,13,15]

for k in x:

knn=KNeighborsClassifier(n_neighbors=k)

knn.fit(X_train,Y_train)

y_pred1=knn.predict(X_train)

III SEM MSc. Computer Science Page 54


Machine Learning Lab

y_pred2=knn.predict(X_test)

acc=metrics.accuracy_score(Y_test,y_pred2)

error_train.append(np.mean(Y_train!=y_pred1))

error_test.append(np.mean(Y_test!=y_pred2))

print("Neighbors:{} and Accuracy = {}".format(k,acc))

plt.plot(x,error_train,label='Train')

plt.plot(x,error_test,label='Test')

plt.xlabel('K Value')

plt.ylabel('Error')

plt.legend()

plt.title("Error Curve")

III SEM MSc. Computer Science Page 55


Machine Learning Lab

Select prominent features using PCA


from sklearn.decomposition import PCA

pca=PCA(n_components=4)#select 4 prominent features

principalcomponents=pca.fit_transform(X)

pc=pd.DataFrame(data=principalcomponents,columns=['PC1','PC2','PC3','PC4'])

X_train,X_test,Y_train,Y_test=train_test_split(principalcomponents,Y,test_size=0.2,

random_state=10)

knn=KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train,Y_train)

y_pred=knn.predict(X_test)

acc=metrics.accuracy_score(Y_test,y_pred)

print(acc)

III SEM MSc. Computer Science Page 56


Machine Learning Lab

Program 3 d. Supervised Learning – Random Forest

Random Forest Classifier


from sklearn import datasets

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn import metrics

Load the dataset - Breast Cancer Dataset


cancer = datasets.load_breast_cancer()

cancer

III SEM MSc. Computer Science Page 57


Machine Learning Lab

cancer.data

cancer.target

III SEM MSc. Computer Science Page 58


Machine Learning Lab

cancer.feature_names

Convert cancer dataset to Data Frame


data = pd.DataFrame(cancer.data,columns=cancer.feature_names)

data['target']=cancer.target

data.head()

Split the data for X and Y (Choose either method 1 or method 2)


X=data.drop(['target'],axis=1) # method 1

X=cancer.data # method 2

Y=data['target'] # method 1

Y=cancer.target # method 2

Split the dataset for training and testing


X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3,random_state=3,stratify=Y)

III SEM MSc. Computer Science Page 59


Machine Learning Lab

Build the model


clf=RandomForestClassifier(n_estimators=5) # n_estimators--> no. of decision trees

Train and test the classifier


clf.fit(X_train,Y_train)

Y_pred=clf.predict(X_test)

Classifier performance
print(metrics.confusion_matrix(Y_test,Y_pred))

print(metrics.accuracy_score(Y_test,Y_pred))

III SEM MSc. Computer Science Page 60


Machine Learning Lab

Feature Importance
fea_imp=pd.Series(clf.feature_importances_,index=cancer.feature_names).sort_values(ascending=F
alse)

fea_imp

Choose highly significant features


X1=data[['worst concave points','worst area','mean concave points','worst perimeter',

'worst radius','mean perimeter','mean radius','mean concavity','mean area']]

Y1=data['target']

X_train1,X_test1,Y_train1,Y_test1=train_test_split(X1,Y1,test_size=0.3,random_state=3,stratify=
Y)

III SEM MSc. Computer Science Page 61


Machine Learning Lab

Check the classifier performance for significant features alone


clf1=RandomForestClassifier(n_estimators=5) # n_estimators--> no. of decision trees
clf1.fit(X_train1,Y_train1)
Y_pred1=clf1.predict(X_test1)
print(metrics.accuracy_score(Y_test1,Y_pred1))

Random Forest Regression


import numpy as np

import pandas as pd

from sklearn.datasets import make_regression

from sklearn.model_selection import train_test_split

from sklearn import metrics

from sklearn.ensemble import RandomForestRegressor

Create a dataset
X,Y=make_regression(n_samples=100,n_features=20,n_informative=15,noise=0.1,random_state=2
)

Summary of the dataset


print(X.shape,Y.shape)

print(Y[:10])

III SEM MSc. Computer Science Page 62


Machine Learning Lab

Create test dataset


X_test,Y_test = make_regression(n_samples=10,n_features=20,n_informative=15,noise=0.1,rando
m_state=2)

Create a Model
model=RandomForestRegressor(n_estimators=10)

Train and test the regression model


model.fit(X,Y)

y_pred=model.predict(X_test)

Y_test

mse = metrics.mean_squared_error(y_pred,Y_test)

print(mse)

III SEM MSc. Computer Science Page 63


Machine Learning Lab

III SEM MSc. Computer Science Page 64


Machine Learning Lab

print(df1.head()

print(df2.head()

print(df3.head()

plt.scatter(df1.Age,df1['Income($)'], color='red')

plt.scatter(df2.Age,df2['Income($)'], color='green')

plt.scatter(df3.Age,df3['Income($)'], color='blue')

The data points are not clustered properly as the x and y axes range differ drastically. This improper
clustering can be overcome by feature scaling

III SEM MSc. Computer Science Page 65


Machine Learning Lab

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

III SEM MSc. Computer Science Page 66


Machine Learning Lab

df['Age']=scaler.fit_transform(df[['Age']])

df['Income($)']=scaler.fit_transform(df[['Income($)']])

df.head()

df.drop(['cluster'],axis=1)

III SEM MSc. Computer Science Page 67


Machine Learning Lab

km = KMeans(n_clusters=4)

y_pred = km.fit_predict(df[['Age','Income($)']])

df['cluster'] = y_pred

df1 = df[df.cluster==0]

df2 = df[df.cluster==1]

df3 = df[df.cluster==2]

df4 = df[df.cluster==3]

km.cluster_centers_

plt.scatter(df1.Age,df1['Income($)'], color='red')
plt.scatter(df2.Age,df2['Income($)'], color='green')
plt.scatter(df3.Age,df3['Income($)'], color='blue')
plt.scatter(df4.Age,df4['Income($)'], color='pink')
plt.scatter(km.cluster_centers_[:,0],
km.cluster_centers_[:,1], marker='*', s=200, color='purple', label='centeroid')
plt.legend()

III SEM MSc. Computer Science Page 68


Machine Learning Lab

Find optimal value of K using Elbow Curve


sse=[]
k_rng=range(1,12)
for k in k_rng:
km = KMeans(n_clusters=k)
km.fit_predict(df[['Age','Income($)']])
sse.append(km.inertia_)

sse

plt.xlabel('K')
plt.ylabel('Income')
plt.plot(k_rng,sse)
plt.grid()

III SEM MSc. Computer Science Page 69


Machine Learning Lab

Find optimal value of K using Silhoutte score


from sklearn.metrics import silhouette_score

s_score=[]
k_rng=range(2,12)
for k in k_rng:
km = KMeans(n_clusters=k)
km.fit_predict(df[['Age','Income($)']])
S = silhouette_score(df[['Age','Income($)']],km.labels_)
print("k:{} and Silhouette Score:{}".format(k,S))
s_score.append(S)

plt.xlabel('K')
plt.ylabel('Income')
plt.plot(k_rng,s_score)
plt.grid()

III SEM MSc. Computer Science Page 70


Machine Learning Lab

22/10/2021
Program 4 b. Unsupervised Learning – Hierarchical Clustering

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as sch
from sklearn.datasets import make_blobs

X,Y_true = make_blobs(n_samples=300, n_features = 2, centers=4, cluster_std=0.8, random_state=


0)

X[0:5,:]

plt.scatter(X[:,0],X[:,1])

III SEM MSc. Computer Science Page 71


Machine Learning Lab

Create a dendogram representation


plt.figure(figsize=(10,10))
dendogram = sch.dendrogram(sch.linkage(X,method='ward'))

Create an agglomerative clustering model


model = AgglomerativeClustering(n_clusters=4, affinity='euclidean',linkage='ward')
model.fit(X)
labels=model.labels_

plt.scatter(X[:,0],X[:,1],c=labels,cmap='rainbow')

III SEM MSc. Computer Science Page 72


Machine Learning Lab

Agglomerative Clustering with real world data


df = pd.read_csv('Mall_Customers.csv')

df.head()

df.info

Drop Customer ID and Gender


X = df.drop(['CustomerID','Genre'], axis=1)
X.head()

III SEM MSc. Computer Science Page 73


Machine Learning Lab

X_arr = np.array(X)

plt.figure(figsize=(10,10))
dend = sch.dendrogram(sch.linkage(X,method='ward'))

model = AgglomerativeClustering(n_clusters=3, affinity='euclidean',linkage='ward')


model.fit(X_arr)
labels=model.labels_

plt.scatter(X_arr[:,0],X_arr[:,1],c=labels,cmap='rainbow')

III SEM MSc. Computer Science Page 74


Machine Learning Lab

Perform clustering with feature normalization/ feature standardization

from sklearn.preprocessing import MinMaxScaler

S = MinMaxScaler()
X_sca = S.fit_transform(X_arr)

dend = sch.dendrogram(sch.linkage(X_sca,method='ward'))

model = AgglomerativeClustering(n_clusters=2, affinity='euclidean',linkage='ward')


model.fit(X_sca)
labels=model.labels_

plt.scatter(X_sca[:,1],X_sca[:,2],c=labels,cmap='rainbow')

III SEM MSc. Computer Science Page 75


Machine Learning Lab

Standard Scaling
from sklearn.preprocessing import StandardScaler

S = StandardScaler()
X_sca = S.fit_transform(X_arr)

dend = sch.dendrogram(sch.linkage(X_sca,method='ward'))

model = AgglomerativeClustering(n_clusters=2, affinity='euclidean',linkage='ward')


model.fit(X_sca)
labels=model.labels_

plt.scatter(X_sca[:,1],X_sca[:,2],c=labels,cmap='rainbow')

III SEM MSc. Computer Science Page 76


Machine Learning Lab

model = AgglomerativeClustering(n_clusters=4, affinity='euclidean',linkage='ward')


model.fit(X_sca)
labels=model.labels_

plt.scatter(X_sca[:,1],X_sca[:,2],c=labels,cmap='rainbow')

III SEM MSc. Computer Science Page 77


Machine Learning Lab

04/11/2021
Program 5 Neural Network Model

import sklearn.datasets
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neural_network import MLPClassifier
from sklearn import metrics

Data = pd.read_csv('diabetes (1).csv')

Data.head()

Data.tail(10)

III SEM MSc. Computer Science Page 78


Machine Learning Lab

X = Data.drop('Outcome',axis=1)
Y = Data['Outcome']

Y.head()

S = MinMaxScaler()
X = S.fit_transform(X)

X[0,:]

X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.3,stratify=Y,random_state=8)

X_train.shape

III SEM MSc. Computer Science Page 79


Machine Learning Lab

model = MLPClassifier(hidden_layer_sizes=(15,50,20,23,20,50),activation='tanh',solver='adam',bat
ch_size='auto',max_iter=1000)

model.fit(X_train,Y_train)

y_pred = model.predict(X_test)

cm = metrics.confusion_matrix(Y_test,y_pred)
print(cm)

print(metrics.accuracy_score(Y_test,y_pred))

III SEM MSc. Computer Science Page 80


Machine Learning Lab

PART - B

III SEM MSc. Computer Science


Machine Learning Lab

HOUSE PRICE PREDICTION

INTRODUCTION
House is one of human life's most essential needs, along with other fundamental needs such as food,
water, and much more. Demand for houses grew rapidly over the years as people's living
standards improved. While there are
people who make their house as an investment and property, yet most people around the world are
buying a house as their shelter or as their livelihood.

House price prediction can be done by using many prediction models (Machine Learning Model)
such as support vector regression, linear regression, and more. There are many benefits that
home buyers, property investors, and house builders can reap from the house-price model. This
model will provide a lot of information and knowledge to home buyers, property investors and
house builders, such as the valuation of house prices in the present market, which will help them
determine house prices.

The three-performance metrics including mean squared error (MSE), root mean squared error
(RMSE) and mean absolute percentage error (MAPE) associated with the following algorithms also
unambiguously outperform those of SVM.

Linear Regression with Python

Linear Regression is the simplest algorithm in machine learning, it can be trained in different
ways. In this notebook we will cover the following linear algorithms:

1. Linear Regression
2. Robust Regression
3. Ridge Regression
4. LASSO Regression
5. Elastic Net
6. Polynomial Regression
7. Stochastic Gradient Descent
8. Artificial Neural Network

III SEM MSc. Computer Science Page 81


Machine Learning Lab

Information about Data set


We are going to use the USA_Housing dataset. Since house price is a continues variable, this is a
regression problem. The data contains the following columns:

 'Avg. Area Income': Avg. Income of residents of the city house is located in.
 'Avg. Area House Age': Avg Age of Houses in same city
 'Avg. Area Number of Rooms': Avg Number of Rooms for Houses in same city
 'Avg. Area Number of Bedrooms': Avg Number of Bedrooms for Houses in same city
 'Area Population': Population of city house is located in
 'Price': Price that the house sold at
 'Address': Address for the house

III SEM MSc. Computer Science Page 82


Machine Learning Lab

Code Snippet:
Exploratory Data Analysis (EDA)

III SEM MSc. Computer Science Page 83


Machine Learning Lab

III SEM MSc. Computer Science Page 84


Machine Learning Lab

III SEM MSc. Computer Science Page 85


Machine Learning Lab

Training a Linear Regression Model

III SEM MSc. Computer Science Page 86


Machine Learning Lab

Linear Regression

III SEM MSc. Computer Science Page 87


Machine Learning Lab

Robust Regression

Robust regression is a form of regression analysis designed to overcome some limitations of


traditional parametric and non-parametric methods. Robust regression methods are designed to be
not overly affected by violations of assumptions by the underlying data-generating process.

III SEM MSc. Computer Science Page 88


Machine Learning Lab

Ridge Regression

Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a
penalty on the size of coefficients.

III SEM MSc. Computer Science Page 89


Machine Learning Lab

LASSO Regression
A linear model that estimates sparse coefficients.

Mathematically, it consists of a linear model trained with ℓ1ℓ1 prior as regularizer. The objective
function to minimize is:

III SEM MSc. Computer Science Page 90


Machine Learning Lab

Elastic Net

A linear regression model trained with L1 and L2 prior as regularize.


This combination allows for learning a sparse model where few of the weights are non-zero like
Lasso, while still maintaining the regularization properties of Ridge.
Elastic-net is useful when there are multiple features which are correlated with one another. Lasso is
likely to pick one of these at random, while elastic-net is likely to pick both.

III SEM MSc. Computer Science Page 91


Machine Learning Lab

Polynomial Regression

One common pattern within machine learning is to use linear models trained on nonlinear functions
of the data. This approach maintains the generally fast performance of linear methods, while
allowing them to fit a much wider range of data.
For example, a simple linear regression can be extended by constructing polynomial features from
the coefficients.

III SEM MSc. Computer Science Page 92


Machine Learning Lab

Random Forest Regressor

With different estimators

III SEM MSc. Computer Science Page 93


Machine Learning Lab

Support Vector Machine

III SEM MSc. Computer Science Page 94


Machine Learning Lab

Models Comparison

III SEM MSc. Computer Science Page 95


Machine Learning Lab

Modeling steps

1. Exploratory Data Analysis: analyze and investigate data sets and summarize their
main characteristics, often employing data visualization methods.

2. Train test split: we separate 70% of our dataset for training the model and 30% of the dataset
for testing the model

3. Scale the data: using standard scaler, we will scale the data so that all of our variables fall
within the range of -1 to 1

4. Create a predictions data frame: generate a data frame that includes the actual price of the
house captured in our test set and the predicted results from our model so that we can quantify
our success

5. Score the models: The root mean squared error (RMSE) and mean absolute error (MAE), mean
squared error (MSE), R2 Square and Cross Validation of our predictions to compare
performance of our five models

Comparing Models

To compare model performance, we will look at root mean squared error (RMSE) and mean absolute
error (MAE). These measurements are both commonly used for comparing model performance, but
they have slightly different intuition and mathematical meaning.

 MAE: the mean absolute error tells us on average how far our predictions are from the true
value. In this case, all errors receive the same weight.

 RMSE: we calculate RMSE by taking the square root of the sum of all of the squared errors.
When we square, the larger errors have a greater impact on the overall error while smaller
errors do not have as much weight on the overall error.

We use get_scores to calculate the RMSE and MAE scores for each model

III SEM MSc. Computer Science Page 96


Machine Learning Lab

CONCLUSION
To predict house prices, simple to advanced machine learning algorithms have been implemented,
such as Linear Regression, Ridge Regression, SVM Regressor Polynomial Regressor, Elastic Net
Regressor Lasso Regressor, Robust Regressor , Random Forest. It has been observed that all the
Regressor except the random forest Regressor yield 90% and above of efficiency.
The outcome of machine learning algorithms will help to select the most suitable demand prediction
algorithm.

III SEM MSc. Computer Science Page 97

You might also like