Machine Learning Lab Manual (1)
Machine Learning Lab Manual (1)
(AUTONOMOUS)
Department of Computer Science & Engineering
1|P a ge
Index
Name of the Module
2|P a ge
Module 1
Aim:
Basic statistical functions for data exploration
No of Attributes:
1. Alcohol: the amount of alcohol in wine
2. Volatile acidity: are high acetic acid in wine which leads to an unpleasant vinegar taste
3. Sulphates: a wine additive that contributes to SO2 levels and acts as an antimicrobial and antioxidant
4. Citric Acid: acts as a preservative to increase acidity (small quantities add freshness and flavour to wines)
5. Total Sulphur Dioxide: is the amount of free + bound forms of SO2
6. Density: sweeter wines have a higher density
7. Chlorides: the amount of salt in the wine
8. Fixed acidity: are non-volatile acids that do not evaporate readily
9. pH: the level of acidity
10. Free Sulphur Dioxide: it prevents microbial growth and the oxidation of wine
11. Residual sugar: is the amount of sugar remaining after fermentation stops. The key is to have a perfect
balance between — sweetness and sourness (wines > 45g/ltrs are sweet)
Program:
import pandas as pd
df=pd.read_csv('wr.csv')
print(df)
output:
3|P a ge
Info and description of dataset:
df.describe()
Output:
df.dtypes
Output:
4|P a ge
df.max()
Output:
Standard deviation:
df['chlorides'].std()
Output:
0.0470653020100901
Mean:
df['chlorides'].mean()
Output:
0.08746654158849279
df['chlorides'].var()
Output:
0.0022151426533009912
df[df.chlorides>1]
OUTPUT:
6|P a ge
Module 2
Aim:
Data Visualization: Box plot, scatter plot, histogram
Description:
A Box Plot is also known as Whisker plot is created to display the summary of the set of data values having
properties like minimum, first quartile, median, third quartile and maximum. In the box plot, a box is created
from the first quartile to the third quartile, a vertical line is also there which goes through the box at the
median.
Program:
#Boxplot
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("/content/drive/MyDrive/diabetes.csv")
#creation of boxplot with dataset
df.boxplot(figsize = (5,5))
Output:
np.random.seed(10)
data = np.random.normal(100, 20, 200)
fig = plt.figure(figsize =(10, 7))
plt.boxplot(data)
plt.show()
7|P a ge
output:
Q1=df['Insulin'].quantile(0.25)
Q2=df['Insulin'].quantile(0.50)
Q3=df['Insulin'].quantile(0.75)
IQR=Q3-Q1
LowestQuartile=Q1-(1.5*IQR)
HighestQuartile=Q3+(1.5*IQR)
print("first Quantile is :",Q1)
print("second Quantile is :",Q2)
print("third Quantile is :",Q3)
print("IQR is:",IQR)
print("LowestQuartile is:",LowestQuartile)
print("HighestQuartile is:",HighestQuartile)
df.boxplot(column="Insulin")
8|P a ge
output:
Model performance:
Output:
9|P a ge
Program:
Output:
Program:
10 | P a g e
Program & output:
Histogram:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("/content/drive/MyDrive/diabetes.csv")
df.hist()
output:
11 | P a g e
Module 3
Aim :
Data Preprocessing: Handling missing values, outliers, normalization, Scaling
Description:
Data preprocessing is essential before its actual use. Data preprocessing is the concept of changing the raw
data into a clean data set. The dataset is preprocessed in order to check missing values, noisy data, and other
inconsistencies before executing it to the algorithm. Data must be in a format appropriate for ML. For
example, if the algorithm processes only numeric data then if a class is labelled with “malignant” or
“benign” then it must be replaced by “0” or “1.” Data transformation and feature extraction are used to
expand the performance of classifiers and hence a classification algorithm will be able to create a
meaningful diagnosis. Only relevant features are selected and extracted for the particular disease. For
example, a cancer patient may have diabetes, so it is essential to separate related features of cancer from
diabetes. An unsupervised learning algorithm such as PCA is a familiar algorithm for feature extraction.
Supervised learning is appropriate for classification and predictive modeling.
Program :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("/content/drive/MyDrive/diabetes.csv")
Finding null values and deleting the columns with missing data
12 | P a g e
Deleting the row with missing data & filling the missing values
#NORMALISATION
# importing packages
import pandas as pd
import matplotlib.pyplot as plt
# create data
df = pd.DataFrame([
[180000, 110, 18.9, 1400],
[360000, 905, 23.4, 1800],
[230000, 230, 14.0, 1300],
[60000, 450, 13.5, 1500]],
Output :
13 | P a g e
df.plot(kind = 'bar')
# copy the data
df_max_scaled = df.copy()
# apply normalization techniques
for column in df_max_scaled.columns:
df_max_scaled[column] = df_max_scaled[column] / df_max_scaled[column].abs().max()
# view normalized data
display(df_max_scaled)
df_max_scaled.plot(kind = 'bar')
Output :
Output:
14 | P a g e
Scaling :
15 | P a g e
Module 4
Aim:
Principal Component Analysis (PCA)
Description:
A social network dataset is a dataset containing the structural information of a social network. In the general
case, a social network dataset consists of persons connected by edges. Social network datasets can represent
friendship relationships or may be extracted from a social networking Web site
Attributes are:
User ID
Gender
Age
Estimated Salary
Purchased
Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form.
Generally this is called a data reduction technique. A property of PCA is that you can choose the number of
dimensions or principal component in the transformed result.
Program:
#Importing of the dataset and slicing it into independent and dependent variables
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
dataset = pd.read_csv('/content/drive/MyDrive/Social_Network_Ads.csv')
dataset
Output:
16 | P a g e
Program:
#feature Slicing to the training and test set of independent variables for reducing the
size to smaller values
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
X_train
Output:
17 | P a g e
Program:
#prediction of y
y_pred = classifier.predict(X_test)
y_pred
Output:
18 | P a g e
Program:
Implementation of PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
from sklearn.neighbors import KNeighborsClassifier
knn_classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn_classifier.fit(X_train, y_train)
y_pred_knn = knn_classifier.predict(X_test)
accuracy=accuracy_score(y_test, y_pred_knn)
precision = precision_score(y_test, y_pred_knn)
recall = recall_score(y_test, y_pred_knn)
specificity = metrics.recall_score(y_test, y_pred_knn, pos_label=0)
f=f1_score(y_test,y_pred_knn)
e=(1-accuracy)
print('Accuracy: ',accuracy)
print('Precision: ',precision)
print('Error:',e)
print('Recall: ',recall)
print('F1score: ',f)
print('Specificity',specificity)
Output:
19 | P a g e
Module 5
Aim :
Description:
The Singular Value Decomposition (SVD) of a matrix is a factorization of that matrix into three matrices. It
has some interesting algebraic properties and conveys important geometrical and theoretical insights about
linear transformations. It also has some important applications in data science.
Program :
20 | P a g e
Program:
Singular Value Decomposition on Image:
import numpy as np
import matplotlib.pyplot as plt
from skimage import data
from skimage.color import rgb2gray
from scipy.linalg import svd
cat = data.chelsea()
plt.imshow(cat)
# convert to grayscale
gray_cat = rgb2gray(cat)
curr_fig=0
for r in [5, 10, 70, 100, 200]:
cat_approx =U[:, :r] @ S[0:r, :r] @ V_T[:r, :]
ax[curr_fig][0].imshow(256-cat_approx)
ax[curr_fig][0].set_title("k = "+str(r))
ax[curr_fig,0].axis('off')
ax[curr_fig][1].set_title("Original Image")
ax[curr_fig][1].imshow(gray_cat)
ax[curr_fig,1].axis('off')
curr_fig +=1
plt.show()
Output:
21 | P a g e
22 | P a g e
Module 6
Aim :
Linear Discriminant Analysis (LDA)
Description:
Program:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('/content/drive/MyDrive/Wine.csv')
dataset
Output:
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values
Splitting of data:
23 | P a g e
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Applying LDA
#Apply LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components = 2)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)
X_train
26 | P a g e
Module 7
Aim :
Regression Analysis: Linear regression, Logistic regression, Polynomial
regression
Description:
Regression is a technique for investigating the relationship between independent variables or features and a
dependent variable or outcome. It's used as a method for predictive modelling in machine learning, in which
an algorithm is used to predict continuous outcomes.
Program:
#LINEAR REGESSION:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split as tts
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures
data = pd.read_csv('bottle.csv',nrows=1000)
data['Salnty']=data['Salnty'].fillna(value=data['Salnty'].mean())
data['T_degC']=data['T_degC'].fillna(value=data['T_degC'].mean())
x=data[['Salnty']]
y=data['T_degC']
pf1=PolynomialFeatures(degree=4)
x1=pf1.fit_transform(x)
regr=LinearRegression()
regr.fit(x1,y)
y_pred=regr.predict(x1)
R_square = r2_score(y,y_pred)
print('Coefficient of Determination:', R_square)
ch='y'
while(ch=='y' or ch=='Y'):
sal=float(input("Enter Salinity to Predict :"))
sal1=pf1.fit_transform([[sal]])
p=regr.predict(sal1)
print("\nTemperature is ",p)
ch=input("Enter y to calculate more : ")
Output:
Coefficient of Determination: 0.7838361038646351
Enter Salinity to Predict :32.45
Temperature is [6.07079706]
Enter y to calculate more : n
27 | P a g e
#LOGISTIC REGRESSION
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error,mean_squared_error
data=pd.read_csv("bottle.csv",nrows=100)
data['Salnty'] = data['Salnty'].fillna(value=data['Salnty'].mean())
data['T_degC'] = data['T_degC'].fillna(value=data['T_degC'].mean())
x=data['Salnty']
y=data['T_degC']
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=3/10,random_state=0)
#Converting int 2-D arrays
x_train=x_train.to_numpy().reshape(-1, 1)
x_test=x_test.to_numpy().reshape(-1, 1)
y_train=y_train.to_numpy().reshape(-1, 1)
y_test=y_test.to_numpy().reshape(-1, 1)
reg=LinearRegression()
reg.fit(x_train,y_train)
# M and C values
print("Intercept (C) : ",reg.intercept_)
print("Slope (M) : ",reg.coef_)
#Predection of testing sets
y_pred=reg.predict(x_test)
x_pred=reg.predict(x_train)
Output:
Intercept (C) : [131.27879866]
Slope (M) : [[-3.67906099]]
Mean Absolute Error : 1.0035873375117492
Mean Squared Error : 1.4170202202564186
Root MeanSquared Error : 1.1903865843735044
28 | P a g e
POLYNOMIAL REGRESSION
import pandas as pd
import tkinter
from tkinter import *
from sklearn.model_selection import train_test_split as tts
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error,r2_score
def polyregr():
data = pd.read_csv('bottle.csv',nrows=1000)
data['Salnty']=data['Salnty'].fillna(value=data['Salnty'].mean())
data['T_degC']=data['T_degC'].fillna(value=data['T_degC'].mean())
x=data[['Salnty']]
y=data['T_degC']
pf1=PolynomialFeatures(degree=4)
x1=pf1.fit_transform(x)
regr=LinearRegression()
regr.fit(x1,y)
v=entry.get()
pred=np.array([[v]], dtype=float)
p=regr.predict(pf1.fit_transform(pred))
t1.delete(1.0,END)
t1.insert(END,p[0])
root =Tk()
root.geometry("1000x200")
root.configure(background='black')
NameLb = Label(root, text="ENTER SALINITY:", fg="White",bg="Black")
NameLb.config(font=("Times",20,"bold"))
NameLb.grid(row=6, column=1, pady=20, sticky=W)
entry= Entry(root,width=40)
entry.grid(row=6,column=2)
dst = Button(root, text="PREDICT", command=polyregr,fg="Red",bg="Black")
dst.config(font=("Times",15,"bold"))
dst.grid(row=12, column=2,padx=10)
NameLb = Label(root, text="THE PREDICTED TEMPERATURE IS:", fg="White",bg="Black")
NameLb.config(font=("Times",20,"bold"))
NameLb.grid(row=10, column=1, pady=20, sticky=W)
t1 = Text(root, height=1, width=40,bg="Black",fg="White")
t1.config(font=("arial",15,"bold"))
t1.grid(row=10, column=2, padx=10)
root.mainloop()
Output:
29 | P a g e
30 | P a g e
Module 8
AIM:
Regularized Regression
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split, cross_val_score
from statistics import mean
data = pd.read_csv('/content/drive/MyDrive/auto-mpg.csv')
data
Output:
31 | P a g e
Evaluating the model:
32 | P a g e
Module 9
AIM:
K-Nearest Neighbour (kNN) Classifier
DESCRIPTION:
The k-nearest neighbors (KNN) algorithm is a simple, supervised machine learning algorithm that can be
used to solve both classification and regression problems. It’s easy to implement and understand, but has a
major drawback of becoming significantly slows as the size of that data in use grows.
KNN works by finding the distances between a query and all the examples in the data, selecting the specified
number examples (K) closest to the query, then votes for the most frequent label (in the case of classification)
or averages the labels (in the case of regression).
PROGRAM:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_sc
ore ,recall_score,precision_score,f1_score
df = pd.read_csv("/data.csv")
X= df.iloc[:, [0,3]].values
y= df.iloc[:, 4].values
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.3, random_state = 0)
classifier = KNeighborsClassifier(n_neighbors=5,metric="e
uclidean")
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
#confusion matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
accuracy = accuracy_score(y_test, y_pred)
recall=recall_score(y_test,y_pred)
precision = precision_score(y_test, y_pred)
f1score=f1_score(y_test,y_pred)
print('Accuracy of the model:',accuracy)
print('precision of the model:',precision)
33 | P a g e
print('Recall of the model:',recall)
print('f1_score of the model:',f1score)
tp=confusion_matrix[0,0]
fp=confusion_matrix[0,1]
fn=confusion_matrix[1,0]
tn=confusion_matrix[1,1]
senstivity=tp/(tp+fn)
print('Sensitivity:',senstivity*100)
specificity=tn/(fp+tn)
print('Specificity:',specificity*100)
OUTPUT
[[67 12]
[20 21]]
Accuracy of the model: 0.7333333333333333
precision of the model: 0.6363636363636364
Recall of the model: 0.5121951219512195
f1_score of the model: 0.5675675675675675
Sensitivity: 77.01149425287356
Specificity: 63.63636363636363
PROGRAM:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
34 | P a g e
from sklearn.metrics import confusion_matrix, accuracy_sc
ore ,recall_score,precision_score,f1_score
df = pd.read_csv("/data.csv")
X= df.iloc[:, [0,2]].values
y= df.iloc[:, 4].values
#label encoding
le = LabelEncoder()
y = le.fit_transform(y)
OUTPUT
[[67 12]
[13 28]]
Accuracy of the model: 0.7916666666666666
precision of the model: 0.7
Recall of the model: 0.6829268292682927
f1_score of the model: 0.6913580246913581
Sensitivity: 83.75
Specificity: 70.0
36 | P a g e
#Label Encoding and Scaling
le = LabelEncoder()
y = le.fit_transform(y)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_s
ize=0.3,random_state=0)
sc= StandardScaler()
X_train= sc.fit_transform(X_train)
X_test= sc.transform(X_test)
#kNN classifier
classifier = KNeighborsClassifier(n_neighbors=10,metric='
euclidean')
classifier.fit(X_train,y_train)
y_pred=classifier.predict(X_test)
#Confusion Matrix
confusion_matrix=confusion_matrix(y_test,y_pred)
print(confusion_matrix)
accuracy=accuracy_score(y_test,y_pred)*100
precision=precision_score(y_test,y_pred)*100
recall=recall_score(y_test,y_pred)*100
f1_measure=f1_score(y_test,y_pred)*100
print('Accuracy of the model:',accuracy)
print('Precision of the model:',precision)
print('Recall of the model:',recall)
print('F1 Measure of the model:',f1_measure)
tp=confusion_matrix[0,0]
fp=confusion_matrix[0,1]
fn=confusion_matrix[1,0]
tn=confusion_matrix[1,1]
senstivity=tp/(tp+fn)
print('Sensitivity:',senstivity*100)
specificity=tn/(fp+tn)
print('Specificity:',specificity*100)
37 | P a g e
OUTPUT
[[75 4]
[13 28]]
Accuracy of the model: 85.83333333333333
Precision of the model: 87.5
Recall of the model: 68.29268292682927
F1 Measure of the model: 76.7123287671233
Sensitivity: 85.22727272727273
Specificity: 87.5
38 | P a g e
Module 10
AIM:
Support Vector Machines (SVMs)
DESCRIPTION:
Support vector machines (SVMs) are a set of supervised learning methods used
for classification, regression and outliers detection.
The advantages of support vector machines are:
Effective in high dimensional spaces.
Still effective in cases where number of dimensions is greater than the number of samples.
PROGRAM:
import pandas as pd
from sklearn.model_selection import train_test_split as tts
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix,accuracy_score
data=pd.read_csv('Social_Network_Ads.csv')
x=data.iloc[:,[2,3]].values
y=data.iloc[:,4].values
x_train,x_test,y_train,y_test=tts(x,y,test_size=0.25,random_state=0)
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.fit_transform(x_test)
model=SVC(kernel='rbf',random_state=0)
# Why ‘rbf’, because it is nonlinear and gives better results as compared to linear
model.fit(x_train,y_train)
y_pred=model.predict(x_test)
cm=confusion_matrix(y_test,y_pred)
print("Confusion Matrix : \n",cm)
print("\nAccuracy Score :",accuracy_score(y_test,y_pred)*100)
OUTPUT:
Confusion Matrix :
[[64 4]
[ 3 29]]
39 | P a g e
Module 11
AIM:
Random Forest model
DESCRIPTION:
The random forest is a classification algorithm consisting of many decisions trees. It uses bagging
and feature randomness when building each individual tree to try to create an uncorrelated forest of trees
whose prediction by committee is more accurate than that of any individual tree.
PROGRAM:
import pandas as pd
from sklearn.model_selection import train_test_split as tts
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix,accuracy_score
from sklearn.ensemble import RandomForestClassifier
data=pd.read_csv("Social_Network_Ads.csv")
x=data.iloc[:,[2,3]].values
y=data.iloc[:,4].values
x_train,x_test,y_train,y_test=tts(x,y,test_size=0.3,random_state=0)
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.fit_transform(x_test)
forest=RandomForestClassifier(criterion='gini',n_estimators=10)
forest.fit(x_train, y_train)
y_pred = forest.predict(x_test)
cm=confusion_matrix(y_test,y_pred)
print("Confusion Matrix : \n",cm)
print("\nAccuracy Score :",accuracy_score(y_test,y_pred)*100)
OUTPUT:
Confusion Matrix :
[[72 7]
[ 5 36]]
40 | P a g e
Module 12
AIM:
AdaBoost Classifier and XGBoost
DESCRIPTION:
AdaBoost:
AdaBoost, short for Adaptive Boosting, is a statistical classification meta-algorithm formulated
by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used
in conjunction with many other types of learning algorithms to improve performance. The output of the
other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output
of the boosted classifier. Usually, AdaBoost is presented for binary classification, although it can be
generalized to multiple classes or bounded intervals on the real line.
XGBoost:
XGBoost is an optimized Gradient Boosting Machine Learning library. It is originally written in C++,
but has API in several other languages. The core XGBoost algorithm is parallelizable i.e. it does
parallelization within a single tree.
41 | P a g e
PROGRAM FOR XGBOOST:
import pandas as pd
from sklearn.model_selection import train_test_split as tts
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix,accuracy_score
df= pd.read_csv('diabetes.csv')
x=df[['Age','Glucose']]
y=df['Outcome']
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.fit_transform(x_test)
model = XGBClassifier()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
cm=confusion_matrix(y_test,y_pred)
print("Confusion Matrix : \n",cm)
Accuracy =accuracy_score(y_test,y_pred)
print("Accuracy:",Accuracy*100)
OUTPUT:
Confusion Matrix :
[[103 27]
[ 30 32]]
Accuracy: 70.3125
42 | P a g e