0% found this document useful (0 votes)
17 views42 pages

Machine Learning Lab Manual (1)

The document is a lab manual for a Machine Learning course at Lakireddy Bali Reddy College of Engineering, detailing various modules including data exploration, visualization, preprocessing, and advanced techniques such as PCA, SVD, and LDA. Each module includes aims, descriptions, and sample programs demonstrating the application of machine learning concepts using Python. The manual serves as a comprehensive guide for students to understand and implement machine learning algorithms and data analysis techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views42 pages

Machine Learning Lab Manual (1)

The document is a lab manual for a Machine Learning course at Lakireddy Bali Reddy College of Engineering, detailing various modules including data exploration, visualization, preprocessing, and advanced techniques such as PCA, SVD, and LDA. Each module includes aims, descriptions, and sample programs demonstrating the application of machine learning concepts using Python. The manual serves as a comprehensive guide for students to understand and implement machine learning algorithms and data analysis techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

LAKIREDDY BALI REDDYCOLLEGE OF ENGINEERING

(AUTONOMOUS)
Department of Computer Science & Engineering

23DS61 – MACHINE LEARNING LAB Manual


R23-M.Tech Data Science

1|P a ge
Index
Name of the Module

1. Basic statistical functions for Data exploration.

2. Data Visualization: Box plot, Scatterplot, histograms.

3.Data Pre-processing: Handling missing values, Outliers, Normalization,


Scaling.
4. Principal Component Analysis(PCA)

5. Singular Value Decomposition(SVD)

6. Linear Discriminant Analysis(LDA)

7. Regression Analysis: Linear Regression, Logistic Regression,


Polynomial Regression.
8. Regularized Regression

9. K Nearest Neighbor (KNN) Classifier.

10.Support Vector Machines (SVMs)

11.Random Forest model


12.AdaBoost Classifier and XGBoost

2|P a ge
Module 1
Aim:
Basic statistical functions for data exploration
No of Attributes:
1. Alcohol: the amount of alcohol in wine
2. Volatile acidity: are high acetic acid in wine which leads to an unpleasant vinegar taste
3. Sulphates: a wine additive that contributes to SO2 levels and acts as an antimicrobial and antioxidant
4. Citric Acid: acts as a preservative to increase acidity (small quantities add freshness and flavour to wines)
5. Total Sulphur Dioxide: is the amount of free + bound forms of SO2
6. Density: sweeter wines have a higher density
7. Chlorides: the amount of salt in the wine
8. Fixed acidity: are non-volatile acids that do not evaporate readily
9. pH: the level of acidity
10. Free Sulphur Dioxide: it prevents microbial growth and the oxidation of wine
11. Residual sugar: is the amount of sugar remaining after fermentation stops. The key is to have a perfect
balance between — sweetness and sourness (wines > 45g/ltrs are sweet)

Program:

#load the dataset:

import pandas as pd
df=pd.read_csv('wr.csv')
print(df)

output:

3|P a ge
Info and description of dataset:
df.describe()
Output:

Retrieving the data types of attributes:

df.dtypes

Output:

Maximum and Minimum values of attributes:


df.min()
Output:

4|P a ge
df.max()
Output:

Sum and mean,std,var of a particular attribute:

Standard deviation:
df['chlorides'].std()
Output:
0.0470653020100901
Mean:
df['chlorides'].mean()
Output:
0.08746654158849279

df['chlorides'].var()
Output:
0.0022151426533009912

TO check the values are null or not:


df.isnull()
Output:

Converting dataset values into an array:


5|P a ge
df.to_numpy()
Output:

Filtering the data:

df[df.chlorides>1]

OUTPUT:

6|P a ge
Module 2
Aim:
Data Visualization: Box plot, scatter plot, histogram

Description:
A Box Plot is also known as Whisker plot is created to display the summary of the set of data values having
properties like minimum, first quartile, median, third quartile and maximum. In the box plot, a box is created
from the first quartile to the third quartile, a vertical line is also there which goes through the box at the
median.

Program:
#Boxplot
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("/content/drive/MyDrive/diabetes.csv")
#creation of boxplot with dataset
df.boxplot(figsize = (5,5))

Output:

#boxplot without dataset:

np.random.seed(10)
data = np.random.normal(100, 20, 200)
fig = plt.figure(figsize =(10, 7))
plt.boxplot(data)
plt.show()

7|P a ge
output:

Boxplot with particular attribute:

#creating boxplot with Quantiles

Q1=df['Insulin'].quantile(0.25)
Q2=df['Insulin'].quantile(0.50)
Q3=df['Insulin'].quantile(0.75)
IQR=Q3-Q1
LowestQuartile=Q1-(1.5*IQR)
HighestQuartile=Q3+(1.5*IQR)
print("first Quantile is :",Q1)
print("second Quantile is :",Q2)
print("third Quantile is :",Q3)
print("IQR is:",IQR)
print("LowestQuartile is:",LowestQuartile)
print("HighestQuartile is:",HighestQuartile)
df.boxplot(column="Insulin")

8|P a ge
output:

Model performance:

TP=int(input("enter True Positive Value"))


TN=int(input("enter True Negative Value"))
FP=int(input("enter False Positive Value"))
FN=int(input("enter False Positive Value"))
acc=(TP+TN)/(TP+TN+FP+FN)
err=(FP+FN)/(TP+TN+FP+FN)
sen=(TP)/(TP+FN)
spes=(TN)/(TN+FP)
prec=(TP)/(TP+FP)
f1=(2*prec*sen)/(prec+sen)
print("Accuracy:",acc)
print("Errorrate:",err)
print("Sensitivity:",sen)
print("Specificity:",spes)
print("Precision",prec)
print("f1-measure:",f1)

Output:

9|P a ge
Program:

#scatterplot without dataset:

import matplotlib.pyplot as plt


x =[5, 7, 8, 7, 2, 17, 2, 9,
4, 11, 12, 9, 6]
y =[99, 86, 87, 88, 100, 86,
103, 87, 94, 78, 77, 85, 86]
plt.scatter(x, y, c ="blue")
plt.show()

Output:

Program:

#Scatter plot with datset:


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("/content/drive/MyDrive/diabetes.csv")
df.plot.scatter(x="Pregnancies",y="Insulin",s=20,color="violet")
df.plot.scatter(x="Glucose",y="BMI",s=20,color="indigo")
plt.title('Patients Pregnancies and insulin levels', fontsize = 20)
Output:

10 | P a g e
Program & output:

Histogram:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("/content/drive/MyDrive/diabetes.csv")
df.hist()

Histogram with particular attribute:


df['BloodPressure'].plot(kind='hist',color="purple”, bins=5)
df.hist("BMI",color="green")

output:

11 | P a g e
Module 3
Aim :
Data Preprocessing: Handling missing values, outliers, normalization, Scaling

Description:
Data preprocessing is essential before its actual use. Data preprocessing is the concept of changing the raw
data into a clean data set. The dataset is preprocessed in order to check missing values, noisy data, and other
inconsistencies before executing it to the algorithm. Data must be in a format appropriate for ML. For
example, if the algorithm processes only numeric data then if a class is labelled with “malignant” or
“benign” then it must be replaced by “0” or “1.” Data transformation and feature extraction are used to
expand the performance of classifiers and hence a classification algorithm will be able to create a
meaningful diagnosis. Only relevant features are selected and extracted for the particular disease. For
example, a cancer patient may have diabetes, so it is essential to separate related features of cancer from
diabetes. An unsupervised learning algorithm such as PCA is a familiar algorithm for feature extraction.
Supervised learning is appropriate for classification and predictive modeling.

Program :

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("/content/drive/MyDrive/diabetes.csv")

Finding null values and deleting the columns with missing data

12 | P a g e
Deleting the row with missing data & filling the missing values

#NORMALISATION
# importing packages
import pandas as pd
import matplotlib.pyplot as plt
# create data
df = pd.DataFrame([
[180000, 110, 18.9, 1400],
[360000, 905, 23.4, 1800],
[230000, 230, 14.0, 1300],
[60000, 450, 13.5, 1500]],

columns=['Col A', 'Col B',


'Col C', 'Col D'])
# view data
display(df)

Output :

13 | P a g e
df.plot(kind = 'bar')
# copy the data
df_max_scaled = df.copy()
# apply normalization techniques
for column in df_max_scaled.columns:
df_max_scaled[column] = df_max_scaled[column] / df_max_scaled[column].abs().max()
# view normalized data
display(df_max_scaled)
df_max_scaled.plot(kind = 'bar')

Output :

# copy the data


df_min_max_scaled = df.copy()
# apply normalization techniques
for column in df_min_max_scaled.columns:
df_min_max_scaled[column]=(df_min_max_scaled[column]-
df_min_max_scaled[column].min())/(df_min_max_scaled[column].max()-
df_min_max_scaled[column].min())
# view normalized data
print(df_min_max_scaled)
import matplotlib.pyplot as plt
df_min_max_scaled.plot(kind = 'bar')

Output:

14 | P a g e
Scaling :

15 | P a g e
Module 4
Aim:
Principal Component Analysis (PCA)

Description:
A social network dataset is a dataset containing the structural information of a social network. In the general
case, a social network dataset consists of persons connected by edges. Social network datasets can represent
friendship relationships or may be extracted from a social networking Web site

Attributes are:
User ID
Gender
Age
Estimated Salary
Purchased

Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form.

Generally this is called a data reduction technique. A property of PCA is that you can choose the number of
dimensions or principal component in the transformed result.

Program:
#Importing of the dataset and slicing it into independent and dependent variables
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
dataset = pd.read_csv('/content/drive/MyDrive/Social_Network_Ads.csv')
dataset

Output:

16 | P a g e
Program:

Read and explore data:


X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

#Encoding of the data using label encoder


from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X[:,0] = le.fit_transform(X[:,0])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

#feature Slicing to the training and test set of independent variables for reducing the
size to smaller values
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
X_train

Output:

17 | P a g e
Program:

#prediction of y
y_pred = classifier.predict(X_test)
y_pred

Output:

#evaluation of model using confusion matrix and accuracy

from sklearn.metrics import confusion_matrix,accuracy_score


cm = confusion_matrix(y_test, y_pred)
ac = accuracy_score(y_test,y_pred)
cm

18 | P a g e
Program:

Implementation of PCA

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score,precision_score,recall_


score,precision_recall_curve,plot_precision_recall_curve,f1_score
from sklearn import metrics
from sklearn.decomposition import PCA
pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
from sklearn.decomposition import PCA

pca = PCA(n_components = 2)

X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

explained_variance = pca.explained_variance_ratio_
from sklearn.neighbors import KNeighborsClassifier
knn_classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn_classifier.fit(X_train, y_train)
y_pred_knn = knn_classifier.predict(X_test)
accuracy=accuracy_score(y_test, y_pred_knn)
precision = precision_score(y_test, y_pred_knn)
recall = recall_score(y_test, y_pred_knn)
specificity = metrics.recall_score(y_test, y_pred_knn, pos_label=0)
f=f1_score(y_test,y_pred_knn)
e=(1-accuracy)
print('Accuracy: ',accuracy)
print('Precision: ',precision)
print('Error:',e)
print('Recall: ',recall)
print('F1score: ',f)
print('Specificity',specificity)

Output:

19 | P a g e
Module 5
Aim :

Singular Value Decomposition (SVD)

Description:

The Singular Value Decomposition (SVD) of a matrix is a factorization of that matrix into three matrices. It
has some interesting algebraic properties and conveys important geometrical and theoretical insights about
linear transformations. It also has some important applications in data science.

Mathematics behind SVD


The SVD of mxn matrix A is given by the formula :
A=UWVT

 U: mxn matrix of the orthonormal eigenvectors of .


 VT: transpose of a nxn matrix containing the orthonormal eigenvectors of A^{T}A.
 W: a nxn diagonal matrix of the singular values which are the square roots of the
eigenvaluesof AT A.

Program :

from numpy import array


from scipy.linalg import svd
A = array([[-4,-7], [1, 4]])
print(A)
U, s, V = svd(A)
print("value of U:")
print('----------------')
print(U)
print("value of sigma(s):")
print('-----------------')
print(s)
print("value of v:")
print('-----------------')
print(V)
Output :

20 | P a g e
Program:
Singular Value Decomposition on Image:

import numpy as np
import matplotlib.pyplot as plt
from skimage import data
from skimage.color import rgb2gray
from scipy.linalg import svd

cat = data.chelsea()
plt.imshow(cat)
# convert to grayscale
gray_cat = rgb2gray(cat)

# calculate the SVD and plot the image


U,S,V_T = svd(gray_cat, full_matrices=False)
S = np.diag(S)
fig, ax = plt.subplots(5, 2, figsize=(8, 20))

curr_fig=0
for r in [5, 10, 70, 100, 200]:
cat_approx =U[:, :r] @ S[0:r, :r] @ V_T[:r, :]
ax[curr_fig][0].imshow(256-cat_approx)
ax[curr_fig][0].set_title("k = "+str(r))
ax[curr_fig,0].axis('off')
ax[curr_fig][1].set_title("Original Image")
ax[curr_fig][1].imshow(gray_cat)
ax[curr_fig,1].axis('off')
curr_fig +=1
plt.show()

Output:

21 | P a g e
22 | P a g e
Module 6

Aim :
Linear Discriminant Analysis (LDA)

Description:

Linear Discriminant Analysis or Normal Discriminant Analysis or Discriminant Function Analysis is a


dimensionality reduction technique that is commonly used for supervised classification problems. It is
used for modelling differences in groups i.e. separating two or more classes. It is used to project the
features in higher dimension space into a lower dimension space.

Program:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('/content/drive/MyDrive/Wine.csv')
dataset

Output:

X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values

Splitting of data:
23 | P a g e
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Applying LDA
#Apply LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components = 2)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)
X_train

#fitting logistic regression


24 | P a g e
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
#predict the test results
y_pred = classifier.predict(X_test)
y_pred

#accuracy by confusion matrix

from sklearn.metrics import confusion_matrix, accuracy_score


cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test,y_pred)

#visualization of test results:

from matplotlib.colors import ListedColormap


X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() -
1, stop = X_set[:, 0].max() + 1, step = 0.01), np.arange(start = X_set[:, 1].min() -
1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1,X2,classifier.predict(np.array([X1.ravel(),X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, c
map = ListedColormap(('red', 'green', 'blue')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
25 | P a g e
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
plt.title('Logistic Regression (Test set)')
plt.xlabel('LD1')
plt.ylabel('LD2')
plt.legend()
plt.show()

26 | P a g e
Module 7
Aim :
Regression Analysis: Linear regression, Logistic regression, Polynomial
regression

Description:
Regression is a technique for investigating the relationship between independent variables or features and a
dependent variable or outcome. It's used as a method for predictive modelling in machine learning, in which
an algorithm is used to predict continuous outcomes.

Program:
#LINEAR REGESSION:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split as tts
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures

data = pd.read_csv('bottle.csv',nrows=1000)
data['Salnty']=data['Salnty'].fillna(value=data['Salnty'].mean())
data['T_degC']=data['T_degC'].fillna(value=data['T_degC'].mean())
x=data[['Salnty']]
y=data['T_degC']

pf1=PolynomialFeatures(degree=4)
x1=pf1.fit_transform(x)
regr=LinearRegression()
regr.fit(x1,y)
y_pred=regr.predict(x1)

R_square = r2_score(y,y_pred)
print('Coefficient of Determination:', R_square)
ch='y'
while(ch=='y' or ch=='Y'):
sal=float(input("Enter Salinity to Predict :"))
sal1=pf1.fit_transform([[sal]])
p=regr.predict(sal1)
print("\nTemperature is ",p)
ch=input("Enter y to calculate more : ")

Output:
Coefficient of Determination: 0.7838361038646351
Enter Salinity to Predict :32.45
Temperature is [6.07079706]
Enter y to calculate more : n

27 | P a g e
#LOGISTIC REGRESSION

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error,mean_squared_error

data=pd.read_csv("bottle.csv",nrows=100)
data['Salnty'] = data['Salnty'].fillna(value=data['Salnty'].mean())
data['T_degC'] = data['T_degC'].fillna(value=data['T_degC'].mean())
x=data['Salnty']
y=data['T_degC']

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=3/10,random_state=0)
#Converting int 2-D arrays
x_train=x_train.to_numpy().reshape(-1, 1)
x_test=x_test.to_numpy().reshape(-1, 1)
y_train=y_train.to_numpy().reshape(-1, 1)
y_test=y_test.to_numpy().reshape(-1, 1)
reg=LinearRegression()

reg.fit(x_train,y_train)
# M and C values
print("Intercept (C) : ",reg.intercept_)
print("Slope (M) : ",reg.coef_)
#Predection of testing sets
y_pred=reg.predict(x_test)
x_pred=reg.predict(x_train)

print('Mean Absolute Error : ',mean_absolute_error(y_test,y_pred))


print('Mean Squared Error : ',mean_squared_error(y_test,y_pred))
print('Root MeanSquared Error : ',np.sqrt(mean_squared_error(y_test,y_pred)))

Output:
Intercept (C) : [131.27879866]
Slope (M) : [[-3.67906099]]
Mean Absolute Error : 1.0035873375117492
Mean Squared Error : 1.4170202202564186
Root MeanSquared Error : 1.1903865843735044

28 | P a g e
POLYNOMIAL REGRESSION
import pandas as pd
import tkinter
from tkinter import *
from sklearn.model_selection import train_test_split as tts
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error,r2_score
def polyregr():
data = pd.read_csv('bottle.csv',nrows=1000)
data['Salnty']=data['Salnty'].fillna(value=data['Salnty'].mean())
data['T_degC']=data['T_degC'].fillna(value=data['T_degC'].mean())
x=data[['Salnty']]
y=data['T_degC']
pf1=PolynomialFeatures(degree=4)
x1=pf1.fit_transform(x)
regr=LinearRegression()
regr.fit(x1,y)
v=entry.get()
pred=np.array([[v]], dtype=float)
p=regr.predict(pf1.fit_transform(pred))
t1.delete(1.0,END)
t1.insert(END,p[0])
root =Tk()
root.geometry("1000x200")
root.configure(background='black')
NameLb = Label(root, text="ENTER SALINITY:", fg="White",bg="Black")
NameLb.config(font=("Times",20,"bold"))
NameLb.grid(row=6, column=1, pady=20, sticky=W)
entry= Entry(root,width=40)
entry.grid(row=6,column=2)
dst = Button(root, text="PREDICT", command=polyregr,fg="Red",bg="Black")
dst.config(font=("Times",15,"bold"))
dst.grid(row=12, column=2,padx=10)
NameLb = Label(root, text="THE PREDICTED TEMPERATURE IS:", fg="White",bg="Black")
NameLb.config(font=("Times",20,"bold"))
NameLb.grid(row=10, column=1, pady=20, sticky=W)
t1 = Text(root, height=1, width=40,bg="Black",fg="White")
t1.config(font=("arial",15,"bold"))
t1.grid(row=10, column=2, padx=10)
root.mainloop()
Output:

29 | P a g e
30 | P a g e
Module 8
AIM:
Regularized Regression
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split, cross_val_score
from statistics import mean
data = pd.read_csv('/content/drive/MyDrive/auto-mpg.csv')
data

Output:

31 | P a g e
Evaluating the model:

32 | P a g e
Module 9
AIM:
K-Nearest Neighbour (kNN) Classifier

DESCRIPTION:
The k-nearest neighbors (KNN) algorithm is a simple, supervised machine learning algorithm that can be
used to solve both classification and regression problems. It’s easy to implement and understand, but has a
major drawback of becoming significantly slows as the size of that data in use grows.
KNN works by finding the distances between a query and all the examples in the data, selecting the specified
number examples (K) closest to the query, then votes for the most frequent label (in the case of classification)
or averages the labels (in the case of regression).

PROGRAM:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_sc
ore ,recall_score,precision_score,f1_score
df = pd.read_csv("/data.csv")
X= df.iloc[:, [0,3]].values
y= df.iloc[:, 4].values
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.3, random_state = 0)
classifier = KNeighborsClassifier(n_neighbors=5,metric="e
uclidean")
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
#confusion matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
accuracy = accuracy_score(y_test, y_pred)
recall=recall_score(y_test,y_pred)
precision = precision_score(y_test, y_pred)
f1score=f1_score(y_test,y_pred)
print('Accuracy of the model:',accuracy)
print('precision of the model:',precision)
33 | P a g e
print('Recall of the model:',recall)
print('f1_score of the model:',f1score)
tp=confusion_matrix[0,0]
fp=confusion_matrix[0,1]
fn=confusion_matrix[1,0]
tn=confusion_matrix[1,1]
senstivity=tp/(tp+fn)

print('Sensitivity:',senstivity*100)
specificity=tn/(fp+tn)
print('Specificity:',specificity*100)

OUTPUT
[[67 12]
[20 21]]
Accuracy of the model: 0.7333333333333333
precision of the model: 0.6363636363636364
Recall of the model: 0.5121951219512195
f1_score of the model: 0.5675675675675675
Sensitivity: 77.01149425287356
Specificity: 63.63636363636363

KNN with Label Encoding and Scaling


Dataset: Social Networks whether purchased or not
Features: User Id, Gender, Age ,Estimated ,Purchased
Selected X features: userId,Gender,Age
Target Y : Published

PROGRAM:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
34 | P a g e
from sklearn.metrics import confusion_matrix, accuracy_sc
ore ,recall_score,precision_score,f1_score
df = pd.read_csv("/data.csv")
X= df.iloc[:, [0,2]].values
y= df.iloc[:, 4].values
#label encoding
le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size = 0.3, random_state = 0)
#Scaling
sc= StandardScaler()
X_train= sc.fit_transform(X_train)
X_test= sc.transform(X_test)
classifier = KNeighborsClassifier(n_neighbors=5,metric="e
uclidean")
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
#confusion matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
accuracy = accuracy_score(y_test, y_pred)
recall=recall_score(y_test,y_pred)
precision = precision_score(y_test, y_pred)
f1score=f1_score(y_test,y_pred)
print('Accuracy of the model:',accuracy)
print('precision of the model:',precision)
print('Recall of the model:',recall)
print('f1_score of the model:',f1score)
tp=confusion_matrix[0,0]
fp=confusion_matrix[0,1]
fn=confusion_matrix[1,0]
tn=confusion_matrix[1,1]
senstivity=tp/(tp+fn)
35 | P a g e
print('Sensitivity:',senstivity*100)
specificity=tn/(fp+tn)
print('Specificity:',specificity*100)

OUTPUT
[[67 12]
[13 28]]
Accuracy of the model: 0.7916666666666666
precision of the model: 0.7
Recall of the model: 0.6829268292682927
f1_score of the model: 0.6913580246913581
Sensitivity: 83.75
Specificity: 70.0

KNN with Principal Component Analysis


PROGRAM
import pandas as pd
from sklearn.preprocessing import StandardScaler,LabelEnc
oder
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix,accuracy_sco
re,recall_score,precision_score,f1_score
df=pd.read_csv("/data.csv")
X= df.iloc[:, [0,2]].values
y= df.iloc[:, 4].values
#Principal Component Analysis
pca=PCA(n_components=2)
principalComponents=pca.fit_transform(X)
principalDf=pd.DataFrame(data=principalComponents,columns
=["pc1","pc2"])
finalDf=pd.concat([principalDf,df[['Purchased']]],axis=1)
finalDf=pd.DataFrame(finalDf)
#print(finalDf)
X=finalDf[['pc1','pc2']].values
y=finalDf['Purchased'].values

36 | P a g e
#Label Encoding and Scaling
le = LabelEncoder()
y = le.fit_transform(y)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_s
ize=0.3,random_state=0)

sc= StandardScaler()
X_train= sc.fit_transform(X_train)
X_test= sc.transform(X_test)
#kNN classifier
classifier = KNeighborsClassifier(n_neighbors=10,metric='
euclidean')
classifier.fit(X_train,y_train)
y_pred=classifier.predict(X_test)
#Confusion Matrix
confusion_matrix=confusion_matrix(y_test,y_pred)
print(confusion_matrix)
accuracy=accuracy_score(y_test,y_pred)*100
precision=precision_score(y_test,y_pred)*100
recall=recall_score(y_test,y_pred)*100
f1_measure=f1_score(y_test,y_pred)*100
print('Accuracy of the model:',accuracy)
print('Precision of the model:',precision)
print('Recall of the model:',recall)
print('F1 Measure of the model:',f1_measure)
tp=confusion_matrix[0,0]
fp=confusion_matrix[0,1]
fn=confusion_matrix[1,0]
tn=confusion_matrix[1,1]
senstivity=tp/(tp+fn)
print('Sensitivity:',senstivity*100)
specificity=tn/(fp+tn)
print('Specificity:',specificity*100)

37 | P a g e
OUTPUT
[[75 4]
[13 28]]
Accuracy of the model: 85.83333333333333
Precision of the model: 87.5
Recall of the model: 68.29268292682927
F1 Measure of the model: 76.7123287671233
Sensitivity: 85.22727272727273

Specificity: 87.5

38 | P a g e
Module 10
AIM:
Support Vector Machines (SVMs)

DESCRIPTION:
Support vector machines (SVMs) are a set of supervised learning methods used
for classification, regression and outliers detection.
The advantages of support vector machines are:
 Effective in high dimensional spaces.
 Still effective in cases where number of dimensions is greater than the number of samples.

PROGRAM:
import pandas as pd
from sklearn.model_selection import train_test_split as tts
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix,accuracy_score

data=pd.read_csv('Social_Network_Ads.csv')
x=data.iloc[:,[2,3]].values
y=data.iloc[:,4].values

x_train,x_test,y_train,y_test=tts(x,y,test_size=0.25,random_state=0)

sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.fit_transform(x_test)

model=SVC(kernel='rbf',random_state=0)
# Why ‘rbf’, because it is nonlinear and gives better results as compared to linear
model.fit(x_train,y_train)
y_pred=model.predict(x_test)

cm=confusion_matrix(y_test,y_pred)
print("Confusion Matrix : \n",cm)
print("\nAccuracy Score :",accuracy_score(y_test,y_pred)*100)

OUTPUT:
Confusion Matrix :
[[64 4]
[ 3 29]]

Accuracy Score : 93.0

39 | P a g e
Module 11
AIM:
Random Forest model

DESCRIPTION:
The random forest is a classification algorithm consisting of many decisions trees. It uses bagging
and feature randomness when building each individual tree to try to create an uncorrelated forest of trees
whose prediction by committee is more accurate than that of any individual tree.

PROGRAM:
import pandas as pd
from sklearn.model_selection import train_test_split as tts
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix,accuracy_score
from sklearn.ensemble import RandomForestClassifier

data=pd.read_csv("Social_Network_Ads.csv")
x=data.iloc[:,[2,3]].values
y=data.iloc[:,4].values

x_train,x_test,y_train,y_test=tts(x,y,test_size=0.3,random_state=0)

sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.fit_transform(x_test)

forest=RandomForestClassifier(criterion='gini',n_estimators=10)

forest.fit(x_train, y_train)

y_pred = forest.predict(x_test)
cm=confusion_matrix(y_test,y_pred)
print("Confusion Matrix : \n",cm)
print("\nAccuracy Score :",accuracy_score(y_test,y_pred)*100)

OUTPUT:
Confusion Matrix :
[[72 7]
[ 5 36]]

Accuracy Score : 90.0

40 | P a g e
Module 12
AIM:
AdaBoost Classifier and XGBoost

DESCRIPTION:

AdaBoost:
AdaBoost, short for Adaptive Boosting, is a statistical classification meta-algorithm formulated
by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used
in conjunction with many other types of learning algorithms to improve performance. The output of the
other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output
of the boosted classifier. Usually, AdaBoost is presented for binary classification, although it can be
generalized to multiple classes or bounded intervals on the real line.

XGBoost:
XGBoost is an optimized Gradient Boosting Machine Learning library. It is originally written in C++,
but has API in several other languages. The core XGBoost algorithm is parallelizable i.e. it does
parallelization within a single tree.

PROGRAM FOR ADABOOST:


import numpy as nm
import pandas as pd
from sklearn.ensemble import AdaBoostClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix,accuracy_score
from sklearn.model_selection import train_test_split
df= pd.read_csv('diabetes.csv')
x=df[['Age','Glucose']]
y=df['Outcome']
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.fit_transform(x_test)
abc = AdaBoostClassifier(n_estimators=50, learning_rate=1)
model = abc.fit(x_train, y_train)
y_pred = model.predict(x_test)
Accuracy =accuracy_score(y_test,y_pred)
cm=confusion_matrix(y_test,y_pred)
print("Confusion Matrix : \n",cm)
print("Accuracy:",Accuracy*100)
OUTPUT:
Confusion Matrix :
[[112 18]
[ 32 30]]
Accuracy: 73.95833333333334

41 | P a g e
PROGRAM FOR XGBOOST:
import pandas as pd
from sklearn.model_selection import train_test_split as tts
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix,accuracy_score

df= pd.read_csv('diabetes.csv')
x=df[['Age','Glucose']]
y=df['Outcome']

x_train, x_test, y_train, y_test= tts(x, y, test_size= 0.25, random_state=0)

sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.fit_transform(x_test)

model = XGBClassifier()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

cm=confusion_matrix(y_test,y_pred)
print("Confusion Matrix : \n",cm)
Accuracy =accuracy_score(y_test,y_pred)
print("Accuracy:",Accuracy*100)
OUTPUT:
Confusion Matrix :
[[103 27]
[ 30 32]]
Accuracy: 70.3125

42 | P a g e

You might also like