0% found this document useful (0 votes)
74 views30 pages

ML Lab Programs..Manual 02072022

Uploaded by

Mogili siva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views30 pages

ML Lab Programs..Manual 02072022

Uploaded by

Mogili siva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY HYDERABAD

B.Tech. in COMPUTER SCIENCE AND ENGINEERING COURSE STRUCTURE & SYLLABUS (R18)
Applicable From 2019-20 Admitted Batch
CS604PC: MACHINE LEARNING LAB
Syllabus
III Year B.Tech. CSE II-Sem L T P C
0 0 3 1.5
Course Objective: The objective of this lab is to get an overview of the various machine learning techniques and can able
to demonstrate them using python.
Course Outcomes: After the completion of the course the student can able to:
1. understand complexity of Machine Learning algorithms and their limitations;
2. understand modern notions in data analysis-oriented computing;
3. be capable of confidently applying common Machine Learning algorithms in practice and implementing their own;
4. Be capable of performing experiments in Machine Learning using real-world data.
List of Experiments
1. The probability that it is Friday and that a student is absent is 3 %. Since there are 5 school days in a week,
the probability that it is Friday is 20 %. What is the probability that a student is absent given that today is
Friday? Apply Bayes’ rule in python to get the result. (Ans: 15%)
2. Extract the data from database using python
3. Implement k-nearest neighbors classification using python
4. Given the following data, which specify classifications for nine combinations of VAR1 and VAR2 predict a
classification for a case where VAR1=0.906 and VAR2=0.606, using the result of k- means clustering with
3 means (i.e., 3 centroids)
VAR1 VAR2 CLASS
1.713 1.586 0
0.180 1.786 1
0.353 1.240 1
0.940 1.566 0
1.486 0.759 1
1.266 1.106 0
1.540 0.419 1
0.459 1.799 1
0.773 0.186 1
5. The following training examples map descriptions of individuals onto high, medium and low credit-
worthiness.
medium skiing design single twenties no -> highRisk high golf
trading married forties yes -> lowRisk
low speedway transport married thirties yes -> medRisk medium
football banking single thirties yes -> lowRisk high flying media
married fifties yes -> highRisk
low football security single twenties no -> medRisk medium
golf media single thirties yes -> medRisk medium golf
transport married forties yes -> lowRisk high skiing banking
single thirties yes -> highRisk low golf unemployed married
forties yes -> highRisk
Input attributes are (from left to right) income, recreation, job, status, age-group, home-owner. Find the unconditional
probability of `golf' and the conditional probability of `single' given `medRisk' in the dataset?
6. Implement linear regression using python.
7. Implement Naïve Bayes theorem to classify the English text
8. Implement an algorithm to demonstrate the significance of genetic algorithm
9. Implement the finite words classification system using Back-propagation algorithm
***
# ML LAB PROGRAM #1:
The probability that it is Friday and that a student is absent is 3 %. Since there are 5 school
days in a week, the probability that it is Friday is 20 %. What is the probability that a student
is absent given that today is Friday? Apply Bayes’ rule in python to get the result. (Ans: 15%)
Short Notes on Conditional Probability & Bayes Theorem:

Conditional Probability:
The conditional probability of the event is the probability that the event will occur, provided the information that an
event A has already occurred. This probability can be written as P(B|A), notation signifies the probability of B given A.

Conditional probability is the probability that an event has occurred, considering some additional information about
the outcomes of an experiment.

Mathematically, if the events A and B are not independent events, then the probability of the interaction of A and B (the
probability of occurrence of both events) is then given by:
P( A  B) = P( A)P(B | A) .
Bayes Theorem:
Bayes’ theorem is a way to figure out conditional probability. Conditional probability is the probability of an event
happening, given that it has some relationship to one or more other events

Bayes theorem is the cornerstone of Bayesian learning methods because it provides a way to calculate the posterior
probability P(hlD), from the prior probability P(h), together with P(D) and P(D(h).

P ( D | h) P ( h)
P(h | D) =
P( D)

"""
Solution:
A: Student is absent
F: It is Friday
P(A∩F) = 0.03 and P(F)=0.2. The Problem is to find P(A|F)
By the Bayes Conditional Probability P(A|F)=P(A∩F)/P(F)
P(A|F)=0.03/0.02=0.15
"""
#Source Code

def bayes_theorem(p_f, p_a_and_f):


p_a_given_f = (p_a_and_f)/ p_f
return p_a_given_f

# P(A∩F)
p_a_and_f = 0.03
# P(F)
p_f = 0.2

# calculate P(A|F)
result = bayes_theorem(p_f, p_a_and_f)
# summarize
print('Probability that a student is absent on Friday is: P(A|F)= %.2f%%' %
(result*100))

OUTPUT: # ML LAB PROGRAM #1:


Probability that a student is absent on Friday is: P(A|F)= 15.00%
Source Code File:

bayesTherm1.py
(select, copy and paste in destination folder and execute)
# ML LAB PROGRAM #2:
2. Extract the data from database using python
Procedure:
You’ll learn the following MySQL SELECT operations from Python using a ‘MySQL
Connector Python’ module.
Execute the SELECT query and process the result set returned by the query in Python.
Use Python variables in a where clause of a SELECT query to pass dynamic values.
Use fetchall(), fetchmany(), and fetchone() methods of a cursor class to fetch all or limited rows from a
table.
Python Select from MySQL Table
This article demonstrates how to select rows of a MySQL table in Python.
You’ll learn the following MySQL SELECT operations from Python using a ‘MySQL Connector
Python’ module
Execute the SELECT query and process the result set returned by the querying Python.
Use Python variables in a where clause of a SELECT query to pass dynamic values.
Use fetchall(),fetchmany(), and fetchone() methods of a cursor class to fetch all or limited rows from
atable.
Steps to fetch rows from a MySQL database table Follow these steps:
SOURCE CODE:
# -*- coding: utf-8 -*-
"""
Created on Wed Apr 20 09:11:24 2022
@author: RKCSE
mysql db connectivity
"""
"""
Install MySQL Driver
Python needs a MySQL driver to access the MySQL database.
In this tutorial we will use the driver "MySQL Connector".
We recommend that you use PIP to install "MySQL Connector".
PIP is most likely already installed in your Python environment.
Navigate your command line to the location of PIP, and type the following:
Download and install "MySQL Connector":
# pip install mysql-connector-python
"""
pip install mysql-connector-python

import mysql.connector as mysql

db = mysql.connect(
host = "localhost",
charset ='utf8',
user = "root",
passwd = "2009"
)
# create a database named "mydatabase":
cursor = db.cursor()
cursor.execute("CREATE DATABASE vgnt;")
cursor.execute("SHOW DATABASES")
for x in cursor:
print(x)

cursor.execute("USE vgnt")
cursor.execute("CREATE TABLE customers (name VARCHAR(255), address VARCHAR(255))")

sql = "INSERT INTO customers (name, address) VALUES (%s, %s)"


val = ("John", "Highway 21")
cursor.execute(sql, val)
db.commit()
print(cursor.rowcount, "record inserted.")
sql = "INSERT INTO customers (name, address) VALUES (%s, %s)"
val = [
('Peter', 'Lowstreet 4'),
('Amy', 'Apple st 652'),
('Hannah', 'Mountain 21'),
('Michael', 'Valley 345'),
('Sandy', 'Ocean blvd 2'),
('Betty', 'Green Grass 1'),
('Richard', 'Sky st 331'),
('Susan', 'One way 98'),
('Vicky', 'Yellow Garden 2'),
('Ben', 'Park Lane 38'),
('William', 'Central st 954'),
('Chuck', 'Main Road 989'),
('Viola', 'Sideway 1633')
]
cursor.executemany(sql, val)
db.commit()
print(cursor.rowcount, "was inserted.")
cursor.execute("select * from customers")
for x in cursor:
print(x)
sql = "INSERT INTO customers (name, address) VALUES (%s, %s)"
val = ("Michelle", "Blue Village")
cursor.execute(sql, val)
db.commit()
print("1 record inserted, ID:", cursor.lastrowid)
# Select records where the address contains the word "way":
sql = "SELECT * FROM customers WHERE address ='Park Lane 38'"
cursor.execute(sql)

myresult = cursor.fetchall()

for x in myresult:
print(x)

sql = "SELECT * FROM customers WHERE address LIKE '%way%'"


cursor.execute(sql)
result = cursor.fetchall()

for x in result:
print(x)
# Sort the result alphabetically by name: result:
sql = "SELECT * FROM customers ORDER BY name"
cursor.execute(sql)
for x in cursor:
print(x)

# Delete any record where the address is "Mountain 21":


sql = "DELETE FROM customers WHERE address = 'Mountain 21'"
cursor.execute(sql)
db.commit()
print(cursor.rowcount, "record(s) deleted")

cursor.execute("show tables")
for x in cursor:
print(x)

cursor.execute("select * from customers")


for x in cursor:
print(x)

# Delete the table "customers":


sql = "DROP TABLE customers"
cursor.execute(sql)

cursor.execute("show tables")
for x in cursor:
print(x)

Output: ML LAB PROGRAM #2:


OUTPUT OF THE ABOVE SAME PROGRAM IN TEXT:
Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 7.29.0 -- An enhanced Interactive Python.

import mysql.connector as mysql

db = mysql.connect(
host = "localhost",
charset ='utf8',
user = "root",
passwd = "1234"
)

cursor = db.cursor()
#"CREATE DATABASE vgnt;”
#"SHOW DATABASES"

('information_schema',)
('mysql',)
('performance_schema',)
('sakila',)
('sys',)
('vgnt',)
('world',)

"""
"CREATE TABLE customers (name VARCHAR(255), address VARCHAR(255)) "
sql = "INSERT INTO customers (name, address) VALUES (%s, %s)"
val = ("John", "Highway 21")
cursor.execute(sql, val)
db.commit()
print(cursor.rowcount, "record inserted.")
"""
1 record inserted.
"""
sql = "INSERT INTO customers (name, address) VALUES (%s, %s)"
val = [
('Peter', 'Lowstreet 4'),
('Amy', 'Apple st 652'),
('Hannah', 'Mountain 21'),
('Michael', 'Valley 345'),
('Sandy', 'Ocean blvd 2'),
('Betty', 'Green Grass 1'),
('Richard', 'Sky st 331'),
('Susan', 'One way 98'),
('Vicky', 'Yellow Garden 2'),
('Ben', 'Park Lane 38'),
('William', 'Central st 954'),
('Chuck', 'Main Road 989'),
('Viola', 'Sideway 1633')
]
"""

13 was inserted.
('John', 'Highway 21')
('Peter', 'Lowstreet 4')
('Amy', 'Apple st 652')
('Hannah', 'Mountain 21')
('Michael', 'Valley 345')
('Sandy', 'Ocean blvd 2')
('Betty', 'Green Grass 1')
('Richard', 'Sky st 331')
('Susan', 'One way 98')
('Vicky', 'Yellow Garden 2')
('Ben', 'Park Lane 38')
('William', 'Central st 954')
('Chuck', 'Main Road 989')
('Viola', 'Sideway 1633')

"""
sql = "SELECT * FROM customers WHERE address LIKE '%way%'"
('John', 'Highway 21')
('Susan', 'One way 98')
('Viola', 'Sideway 1633')

sql = "SELECT * FROM customers ORDER BY name"

# Delete any record where the address is "Mountain 21":


sql = "DELETE FROM customers WHERE address = 'Mountain 21'"
cursor.execute(sql)
db.commit()
print(cursor.rowcount, "record(s) deleted")

cursor.execute("show tables")
for x in cursor:
print(x)
"""
cursor.execute("select * from customers")
for x in cursor:
print(x)
"""

('Amy', 'Apple st 652')


('Ben', 'Park Lane 38')
('Betty', 'Green Grass 1')
('Chuck', 'Main Road 989')
('Hannah', 'Mountain 21')
('John', 'Highway 21')
('Michael', 'Valley 345')
('Michelle', 'Blue Village')
('Peter', 'Lowstreet 4')
('Richard', 'Sky st 331')
('Sandy', 'Ocean blvd 2')
('Susan', 'One way 98')
('Vicky', 'Yellow Garden 2')
('Viola', 'Sideway 1633')
('William', 'Central st 954')
1 record(s) deleted
('customers',)
('John', 'Highway 21')
('Peter', 'Lowstreet 4')
('Amy', 'Apple st 652')
('Michael', 'Valley 345')
('Sandy', 'Ocean blvd 2')
('Betty', 'Green Grass 1')
('Richard', 'Sky st 331')
('Susan', 'One way 98')
('Vicky', 'Yellow Garden 2')
('Ben', 'Park Lane 38')
('William', 'Central st 954')
('Chuck', 'Main Road 989')
('Viola', 'Sideway 1633')
('Michelle', 'Blue Village')
OUTPUT: (screen shots)
#Source Code File
(Select, Copy and Paste in Destination Folder)

pymysql.py
# ML LAB PROGRAM #5:
"""
The following training examples map descriptions of individuals onto high, medium and
low credit-worthiness.
#headers/attributes
income, recreation, job, status, age-group, home-owner
medium,skiing,design,single,twenties,no,highRisk
high,golf,trading,married,forties,yes, lowRisk
low,speedway,transport,married,thirties,yes,medRisk
medium,football,banking,single,thirties,yes,lowRisk
high,flying,media,married,fifties,yes,highRisk
low,football,security,single,twenties,no,medRisk
medium,golf,media,single,thirties,yes,medRisk
medium,golf,transport,married,forties,yes,lowRisk
high,skiing,banking,single,thirties,yes,highRisk
low,golf,unemployed,married,forties,yes,highRisk

#DATA #data has header


Find the unconditional probability of `golf' and
the conditional probability of `single' given `medRisk' in the dataset?
"""
Conditional Probability:
The conditional probability of the event is the probability that the event will occur, provided the information
that an event A has already occurred. This probability can be written as P(B|A), notation signifies the
probability of B given A.

Conditional probability is the probability that an event has occurred, considering some additional
information about the outcomes of an experiment.

Mathematically, if the events A and B are not independent events, then the probability of the interaction of
A and B (the probability of occurrence of both events) is then given by:
P( A  B) = P( A)P(B | A) .
Bayes Theorem:
Bayes’ theorem is a way to figure out conditional probability. Conditional probability is the probability of an
event happening, given that it has some relationship to one or more other events

Bayes theorem is the cornerstone of Bayesian learning methods because it provides a way to calculate the
posterior probability P(hlD), from the prior probability P(h), together with P(D) and P(D(h).

P ( D | h) P ( h)
P(h | D) =
P( D)

#Source Code
# Task-1: Unconditional Probability of Golf
totalRecords=10
numberGolfRecreation=4
probGolf=numberGolfRecreation/totalRecords
print("Unconditio nal probability of golf: =",probGolf)

#Task-2: Conditional Probability of 'Single' given 'medRisk'

# Conditional bayes Formula


# P(single|medRisk) = P(medRisk|single) P(single)/P(medRisk)

# since P(A|B) = P(A ∩ B)/P(B)


# P(medRisk|single) = P(medRisk ∩ single)/P(single)
# P(single)
numberSingle =5
probSingle = numberSingle/totalRecords #5/10
print("P(Single): ",probSingle)

# P(medRisk ∩ single)
numberMedRiskandSingle=2 #n(medRisk ∩ single)
probMedRiskandSingle=numberMedRiskandSingle/totalRecords #2/10
print("P(medRisk ∩ single): ",probMedRiskandSingle)

# P(medRisk|single)
probMedRiskgivenSingle=probMedRiskandSingle/probSingle
print("P(medRisk|single): ",probMedRiskgivenSingle)

numberMedRisk=3
probMedRisk=numberMedRisk/totalRecords
print("P(MedRisk): ",probMedRisk)

probSingleGivenMedRisk=probMedRiskgivenSingle*probSingle/probMedRisk
print("P(Single|medRisk): ",probSingleGivenMedRisk)

OUTPUT: # ML LAB PROGRAM #5:

Unconditio nal probability of golf: = 0.4


P(Single): 0.5
P(medRisk ∩ single): 0.2
P(medRisk|single): 0.4
P(MedRisk): 0.3
P(Single|medRisk): 0.6666666666666667

Source Code File:


(select, copy and paste in destination folder and execute)

golfCondiProb.py golfCondiProbData
.csv
# ML LAB PROGRAM #6:
6. Implement linear regression using python.
Liner Regression:
Linear regression is a statistical method for modeling relationships between a dependent
variable with a given set of independent variables.
simple linear regression we want to model our data as follows:
y = B0 + B1 * x
we know and B0 and B1 are coefficients that we need to estimate that move the line
around. It is used in estimating exactly how much of y will change, when x changes a
certain amount. The line of best fit is nothing but the line that best expresses the
relationship between the data points.
Simple regression is great, because rather than having to search for values by trial and
error or calculate them analytically using more advanced linear algebra, we can estimate
them directly from our data.

# SOURCE CODE
# import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('salary_data.csv')
X = dataset.iloc[:, :-1].values #get a copy of dataset exclude last column
y = dataset.iloc[:, 1].values #get array of dataset in column 1st (1: index)

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3,
random_state=0)

# Fitting Simple Linear Regression to the Training set


from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Visualizing the Training set results
viz_train = plt
viz_train.scatter(X_train, y_train, color='red')
viz_train.plot(X_train, regressor.predict(X_train), color='blue')
viz_train.title('Salary VS Experience (Training set)')
viz_train.xlabel('Year of Experience')
viz_train.ylabel('Salary')
viz_train.show()

# Visualizing the Test set results


viz_test = plt
viz_test.scatter(X_test, y_test, color='red')
viz_test.plot(X_train, regressor.predict(X_train), color='blue')
viz_test.title('Salary VS Experience (Test set)')
viz_test.xlabel('Year of Experience')
viz_test.ylabel('Salary')
viz_test.show()

# Predicting the result of 5 Years Experience


rows, cols = (1, 1)
arr = [[5 for i in range(cols)] for j in range(rows)]
print(arr)
y_pred = regressor.predict(arr)
print(y_pred)
s=y_pred
print("Salary for 5 years of Experience= ")
print(s)

# Predicting the result of X_test


print("Predicting the restults for X_test Records")
y_pred = regressor.predict(X_test)
print(X_test,y_pred)

Output: ML LAB PROGRAM #6:


Salary for 5 years of Experience=
[73545.90445964]
Predicting the restults for X_test Records
[[ 1.5]
[10.3]
[ 4.1]
[ 3.9]
[ 9.5]
[ 8.7]
[ 9.6]
[ 4. ]
[ 5.3]
[ 7.9]] [ 40835.10590871 123079.39940819 65134.55626083 63265.36777221
115602.64545369 108125.8914992 116537.23969801 64199.96201652
76349.68719258 100649.1375447 ]

Source Code file (select, copy and paste in a folder)


linearRegression1.p salary_data.csv
y

# ML LAB PROGRAM #7:


# Implement Naïve Bayes theorem to classify the English text
Bayes Theorem:
Bayes’ theorem is a way to figure out conditional probability. Conditional probability is the probability of an
event happening, given that it has some relationship to one or more other events

Bayes theorem is the cornerstone of Bayesian learning methods because it provides a way to calculate the
posterior probability P(hlD), from the prior probability P(h), together with P(D) and P(D(h).

P ( D | h) P ( h)
P(h | D) =
P( D)

Scikit-learn's CountVectorizer is used to convert a collection of text documents to a vector


of term/token counts.
It also enables the pre-processing of text data prior to generating the vector representation. This
functionality makes it a highly flexible feature representation module for text.

fit_transform():
This method performs fit and transform on the input data at a single time and converts the data points. If
we use fit and transform separate when we need both then it will decrease the efficiency of the model so we
use fit_transform() which will do both the work.
sklearn.naive_bayes.MultinomialNB():
Naïve Bayes classifier. It assumes that the features are drawn from a simple Multinomial
distribution. The Scikit-learn provides sklearn.naive_bayes.MultinomialNB to implement the
Multinomial Naïve Bayes algorithm for classification.

#SOURCE CODE
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

msg = pd.read_csv('document.csv', names=['message', 'label'])


print(msg)
print("Total Instances of Dataset: ", msg.shape[0])
msg['labelnum'] = msg.label.map({'pos': 1, 'neg': 0})
X = msg.message; print("X",X)
y = msg.labelnum
for i in y: print(i)

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)

print(Xtrain)
Xtrain.shape[0]
print(ytrain)
ytrain.shape[0]

cv = CountVectorizer()
Xtrain_dm = cv.fit_transform(Xtrain)
print("Xtrain_dm\n",Xtrain_dm)
cv.vocabulary_

Xtest_dm = cv.transform(Xtest)
df = pd.DataFrame(Xtrain_dm.toarray(),columns=cv.get_feature_names())
print(df)

clf = MultinomialNB()
clf.fit(Xtrain_dm, ytrain)
pred = clf.predict(Xtest_dm)
print('Accuracy Metrics:')

print('Accuracy: ', accuracy_score(ytest, pred))


print('Recall: ', recall_score(ytest, pred))
print('Precision: ', precision_score(ytest, pred))
print('Confusion Matrix: \n', confusion_matrix(ytest, pred))

OUTPUT: # ML LAB PROGRAM 7:


Total Instances of Dataset: 18
out[17]: 13

df = pd.DataFrame(Xtrain_dm.toarray(),columns=cv.get_feature_names())

print(df)
am amazingplace an and awesome bad ... we went what will with work
0 0 0 1 0 1 0 ... 0 0 0 0 0 0
1 0 0 0 0 0 1 ... 0 0 0 0 0 0
2 0 0 0 0 0 0 ... 0 0 0 0 1 0
3 0 1 1 0 0 0 ... 0 0 0 0 0 0
4 0 0 0 0 0 0 ... 0 0 0 0 0 1
5 0 0 0 0 0 0 ... 1 0 0 1 0 0
6 0 0 0 0 0 0 ... 0 0 0 0 0 0
7 0 0 0 0 0 0 ... 0 0 0 0 0 0
8 0 0 0 0 0 0 ... 0 1 0 0 0 0
9 0 0 1 0 1 0 ... 0 0 1 0 0 0
10 0 0 0 0 0 0 ... 0 0 0 0 0 0
11 1 0 0 1 0 0 ... 0 0 0 0 0 0
12 0 0 0 0 0 0 ... 0 0 0 0 0 0

[13 rows x 43 columns]

clf = MultinomialNB()
clf.fit(Xtrain_dm, ytrain)
pred = clf.predict(Xtest_dm)

print('Accuracy Metrics:')

print('Accuracy: ', accuracy_score(ytest, pred))


print('Recall: ', recall_score(ytest, pred))
print('Precision: ', precision_score(ytest, pred))
print('Confusion Matrix: \n', confusion_matrix(ytest, pred))
Accuracy Metrics:
Accuracy: 0.8
Recall: 1.0
Precision: 0.6666666666666666
Confusion Matrix:
[[2 1]
[0 2]]
"""
Source Code file (select, copy and paste in a folder)

textclassify1.py textclassify1Data.cs
v

# textClassify1Data.csv

I love this sandwich,pos


This is an amazingplace,pos
I feel very good about these foods,pos
This is my best work,pos
What an awesome view,pos
I do not like this restaurant,neg
I am tired of this stuff,neg
I can't deal with this,neg
He is my sworn enemy,neg
My boss is horrible,neg
This is an awesome place,pos
I do not like the taste of this juice,neg
I love to dance,pos
I am sick and tired of this place,neg
What a great holiday,pos
That is a bad locality to stay,neg
We will have good fun tomorrow,pos
I went to my enemy's house today,neg
# ML LAB PROGRAM #3:
3. Implement k-nearest neighbors classification using python
K Nearest Neighbor Algorithm:
DESCRIPTION:
This algorithm is used to solve the classification model problems. K-nearest neighbor or K-NN algorithm basically
creates an imaginary boundary to classify the data. When new data points come in, the algorithm will try to predict
that to the nearest of the boundary line.

KNN MODEL REPRESENTATION:


Efficient implementations can store the data using complex data structures like k-d trees to make look-upand matching
of new patterns during prediction efficient. Because the entire training dataset is stored, you may want to think carefully
about the consistency of yourtraining data.
The k-nearest neighbor algorithm is imported from the scikit-learn package.
• Create feature and target variables.
• Split data into training and test data.
• Generate a k-NN model using neighbors value.
• Train or fit the data into the model.
• Predict the future.

SOURCE CODE:
ML LAB PROGRAM #3
3. Implement k-nearest neighbors classification using python
"""
# Import necessarymodules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import numpy as np
"""

# Loading data
irisData = load_iris()
# Createfeature and target arrays
X = irisData.data
y = irisData.target
# print(X); print(y)
# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.35, random_state=40)
# X_train, X_test, y_train, y_test = train_test_split(X, y)
knn = KNeighborsClassifier(n_neighbors=8)
knn.fit(X_train, y_train)
# Predict on dataset X_test, y_test
y_pred=knn.predict(X_test)
# numpy. array_equiv(y_test,y_pred)
print("\n",y_pred)
print("\n",y_test)
y_test==y_pred
np.sum(y_test!=y_pred)

OUTPUT:
print("\n",y_pred)
[0 2 0 1 2 2 0 1 0 0 0 2 1 1 0 1 1 1 2 2 0 2 1 2 2 2 0 0 1 0 2 0 1 0 2 2 2
2 0 0 1 1 1 2 2 0 0 2 0 1 2 0 1]
print("\n",y_test)
[0 2 0 1 2 2 0 1 0 0 0 2 1 1 0 1 1 1 1 2 0 2 1 2 2 2 0 0 1 0 2 0 1 0 2 2 2
2 0 0 1 1 1 2 2 0 0 2 0 1 2 0 1]
y_test==y_pred
Out[22]:
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
False, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True])
np.sum(y_test!=y_pred)
Out[23]: 1
# New Test Cases for prediction
newX=([[5.6, 4.6, 1.6, 0.6]])
knn.predict(newX)
Out[26]: array([0])
newX=([[6.6, 2.6, 4.6, 1.6]])
knn.predict(newX)
Out[28]: array([1])
newX=([[5.6, 3.6, 5.6, 2.6]])
knn.predict(newX)
Out[30]: array([2])

Source Code file (select, copy and paste in a folder)

knn.py

VIVA QUESTIONS & ANSWERS ON KNN:


1. What is the KNN Algorithm?
KNN (K-nearest neighbors) is a supervised learning and non-parametric algorithm that can be used to solve both
classification and regression problem statements.
It uses data in which there is a target column present i.e, labeled data to model a function to produce an output for the
unseen data. It uses the Euclidean distance formula to compute the distance between the data points for classification
or prediction.
The main objective of this algorithm is that similar data points must be close to each other so it uses the distance to
calculate the similar points that are close to each other.
2. Why is KNN a non-parametric Algorithm?
The term “non-parametric” refers to not making any assumptions on the underlying data distribution. These methods
do not have any fixed numbers of parameters in the model.
Similarly, in KNN, the model parameters grow with the training data by considering each training case as a parameter
of the model. So, KNN is a non-parametric algorithm.
3. What is “K” in the KNN Algorithm?
K represents the number of nearest neighbors you want to select to predict the class of a given item, which is coming
as an unseen dataset for the model.
4. Why is the odd value of “K” preferred over even values in the KNN Algorithm?
The odd value of K should be preferred over even values in order to ensure that there are no ties in the voting. If the
square root of a number of data points is even, then add or subtract 1 to it to make it odd.
5. How does the KNN algorithm make the predictions on the unseen dataset?
The following operations have happened during each iteration of the algorithm. For each of the unseen or test data
point, the KNN classifier must:
Step-1: Calculate the distances of test point to all points in the training set and store them
Step-2: Sort the calculated distances in increasing order Step-3: Store the K nearest points from our training dataset
Step-4: Calculate the proportions of each class
Step-5: Assign the class with the highest proportion
# ML LAB PROGRAM #4:
4. Given the following data, which specify classifications for nine combinations
of VAR1 and VAR2 predict a classification for a case where VAR1=0.906 and
VAR2=0.606, using the result of k- means clustering with 3 means (i.e., 3
centroids)
VAR1 VAR2 CLASS
1.713 1.586 0
0.180 1.786 1
0.353 1.240 1
0.940 1.566 0
1.486 0.759 1
1.266 1.106 0
1.540 0.419 1
0.459 1.799 1
0.773 0.186 1
K Means Algorithm:
• K Means algorithm is a centroid-based clustering (unsupervised) technique. This technique
groups the dataset into k different clusters having an almost equal number of points. Each of
the clusters has a centroid point which represents the mean of the data points lying in that
cluster.
• The idea of the K-Means algorithm is to find k-centroid points and every point in the dataset will
belong to either of the k-sets having minimum Euclidean distance.
• K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. In this topic, we will learn what is K-means
clustering algorithm, how the algorithm works, along with the Python implementation of k-
means clustering.
• It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim
of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
• The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k should
be predetermined in this algorithm.

• The k-means clustering algorithm mainly performs two tasks:


o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
SOURCE CODE:
"""
Created on Wed Jun 22 11:43:43 2022

@author: krish
"""
"""
Given the following data, which specify classifications for nine combinations
of VAR1 and VAR2.
Predict a classification for a case where VAR1=0.906 and VAR2=0.606,
using the result of k- means clustering with 3 means (i.e., 3 centroids)

VAR1 VAR2 CLASS


1.713 1.586 0
0.180 1.786 1
0.353 1.240 1
0.940 1.566 0
1.486 0.759 1
1.266 1.106 0
1.540 0.419 1
0.459 1.799 1
0.773 0.186 1

"""
from sklearn.cluster import KMeans
import numpy as np

X = np.array([[1.713, 1.586], [0.180, 1.786], [0.353, 1.240],[0.940, 1.566],


[1.486, 0.759], [1.266, 1.106],[1.540, 0.419],[0.459, 1.799],[0.773, 0.186]])
y=np.array([0, 1, 1, 0, 1, 0, 1, 1, 1])
kmeans = KMeans(n_clusters=3, random_state=0).fit(X,y)

kmeans
# Predict a classification for a case where VAR1=0.906 and VAR2=0.606
kmeans.predict([[0.906, 0.606]]) #check the 2nd var set Cluster #0

# Few More
kmeans.predict([[0.180, 1.786]]) #check the 2nd var set Cluster #1
kmeans.predict([[0.940, 1.566]]) #check the 2nd var set Cluster #2
kmeans.predict([[1.486, 0.759]]) #check the 2nd var set Cluster #0

"""
OUTPUT:
kmeans = KMeans(n_clusters=3, random_state=0).fit(X,y)

kmeans
Out[69]: KMeans(n_clusters=3, random_state=0)
# Predict a classification for a case where VAR1=0.906 and VAR2=0.606
kmeans.predict([[0.906, 0.606]]) #check the 2nd var set Cluster #0
Out[71]: array([0])
# Few More
kmeans.predict([[0.180, 1.786]]) #check the 2nd var set Cluster #1
Out[73]: array([1])
kmeans.predict([[0.940, 1.566]]) #check the 2nd var set Cluster #2
Out[74]: array([2])
kmeans.predict([[1.486, 0.759]]) #check the 2nd var set Cluster #0
Out[75]: array([0])

"""

Source Code file (select, copy and paste in a folder)


kmeans.py

VIVA Qns:

1. Is Feature Scaling required for the K means Algorithm?


Yes, K-Means typically needs to have some form of normalization done on the datasets to work properly since it
is sensitive to both the mean and variance of the datasets.
For performing feature scaling, generally, StandardScaler is recommended, but depending on the specific use cases,
other techniques might be more suitable as well.

For Example, let’s have 2 variables, named age and salary where age is in the range of 20 to 60 and salary is in
the range of 100-150K, since scales of these variables are different so when these variables are substituted in the
euclidean distance formula, then the variable which is on the large scale suppresses the variable which is on the
smaller scale. So, the impact of age will not be captured very clearly. Hence, you have to scale the variables to the
same range using Standard Scaler, Min-Max Scaler, etc.

2. Which metrics can you use to find the accuracy of the K means Algorithm?
There does not exist a correct answer to this question as k means being an unsupervised learning technique does
not discuss anything about the output column. As a result, one can not get the accuracy number or values from the
algorithm directly.

3. What are the advantages and disadvantages of the K means Algorithm?


Advantages:

• Easy to understand and implement.


• Computationally efficient for both training and prediction.
• Guaranteed convergence.
Disadvantages:

• We need to provide the number of clusters as an input variable to the algorithm.


• It is very sensitive to the initialization process.
• Good at clustering when we are dealing with spherical cluster shapes, but it will perform poorly when dealing
with more complicated shapes.
• Due to the leveraging of the Euclidean distance function, it is sensitive to outliers.

4. What are the ways to avoid the problem of initialization sensitivity in the K means Algorithm?
There are two ways to avoid the problem of initialization sensitivity:
• Repeat K means: It basically repeats the algorithm again and again along with initializing thecentroids
followed by picking up the cluster which results in the small intra-cluster distance and large inter
cluster distance.
• K Means++: It is a smart centroid initialization technique.
• Amongst the above two techniques, K-Means++ is the best approach.
# ML LAB PROGRAM #9:
Implement the finite words classification system using Back-propagation
algorithm.

The Back propagation algorithm is a supervised learning method for multilayer feed- forward networks from
the field of Artificial Neural Networks.
The principle of the backpropagation approach is to model a given function by modifying internal weightings
of input signals to produce an expected output signal. The system is trained using a supervised learning
method, where the error between the system’s output and a known expected output is presented to the
system and used to modify its internal state.
The main three layers are:
1. Input layer
2. Hidden layer
3. Output layer
Each layer has its own way of working and its own way to take action such that we are able to get the desired
results and correlate these scenarios to our conditions.
Source Code:
"""
Back propagation algorithm
"""
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

msg = pd.read_csv('backpropDoc.csv', names=['message', 'label'])


print("Total Instances of Dataset: ", msg.shape[0])
msg['labelnum'] = msg.label.map({'pos': 1, 'neg': 0})
X = msg.message
y = msg.labelnum
X,y
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)

count_v = CountVectorizer()
Xtrain_dm = count_v.fit_transform(Xtrain)
Xtrain_dm
Xtest_dm = count_v.transform(Xtest)
df = pd.DataFrame(Xtrain_dm.toarray(),columns=count_v.get_feature_names())
df
clf = MLPClassifier(solver='lbfgs', alpha=1e-5,hidden_layer_sizes=(5, 2), random_state=1)
clf.fit(Xtrain_dm, ytrain)
pred = clf.predict(Xtest_dm)
print('Accuracy Metrics:')
print('Accuracy: ', accuracy_score(ytest, pred))
print('Recall: ', recall_score(ytest, pred))
print('Precision: ', precision_score(ytest, pred))
print('Confusion Matrix: \n', confusion_matrix(ytest, pred))

"""
# document.csv:
I love this sandwich,pos
This is an amazingplace,pos
I feel very good about these beers,pos
This is my best work,pos
What an awesome view,pos
I do not like this restaurant,neg
I am tired of this stuff,neg
I can't deal with this,neg He ismy sworn enemy,neg
My boss is horrible,neg
This is an awesome place,pos
I do not like the taste of this juice,neg
I love to dance,pos
I am sick and tired of this place,neg
What a great holiday,pos
That is a bad locality to stay,neg
We will have good fun tomorrow,pos
I went to my enemy's house today,neg
"""

"""
Just for Ref only:
sklearn.neural_network.MLPClassifier:
clf = MLPClassifier(solver='lbfgs', alpha=1e-5,hidden_layer_sizes=(5, 2), random_state=1)
This model optimizes the log-loss function using LBFGS or stochastic gradient descent. (New in
version 0.18.)
Activation function for the hidden layer: activation{‘identity’, ‘logistic’, ‘tanh’, ‘relu’},
default=’relu’

sklearn.neural_network.CountVectorizer():
Convert a collection of text documents to a matrix of token counts.
Tolerance for the optimization. When the loss or score is not improving by at least tol for
n_iter_no_change consecutive iterations, unless learning_rate is set to ‘adaptive’, convergence
is considered to be reached and training stops.

tolfloat, default=1e-4:
Tolerance for the optimization. When the loss or score is not improving by at least tol for
n_iter_no_change consecutive iterations, unless learning_rate is set to ‘adaptive’, convergence
is considered to be reached and training stops.
"""

Output:
Total Instances of Dataset: 18
Accuracy Metrics:
Accuracy: 0.8
Recall: 0.6666666666666666
Precision: 1.0
Confusion Matrix:
[[2 0]
[1 2]]

Source Code file (select, copy and paste in a folder)

backprop.py backpropDoc.csv
# ML LAB PROGRAM #8:
Genetic Algorithms (GAs) are adaptive heuristic search algorithms that belong to
the larger part of evolutionary algorithms. Genetic algorithms are based on the ideas
of natural selection and genetics. They are commonly used to generate high-quality
solutions for optimization problems and search problems. Genetic algorithms simulate
the process of natural selection which means those species who can adapt to changes
in their environment are able to survive and reproduce and go to next generation.
Operators of Genetic Algorithm:
Once the initial generation is created, the algorithm evolve the generation using
following operators –
1. Selection Operator: The idea is to give preference to the individuals with
good fitness scores and allow them to pass there genes to the successive
generations.
2. Crossover Operator: This represents mating between individuals. Two
individuals are selected using selection operator and crossover sites are
chosen randomly. Then the genes at these crossover sites are exchanged
thus creating a completely new individual (offspring).
3. Mutation Operator: The key idea is to insert random genes in offspring to
maintain the diversity in population to avoid the premature convergence

Source Code:
import matplotlib.pyplot
import numpy
def cal_pop_fitness(equation_inputs, pop):
# Calculating the fitness value of each solution in the current population.

# The fitness function calulates the sum of products between each input and its
corresponding weight.
fitness = numpy.sum(pop*equation_inputs, axis=1)
return fitness

def select_mating_pool(pop, fitness, num_parents):


# Selecting the best individuals in the current generation as parents for
producing the offspring of the next generation.
parents = numpy.empty((num_parents, pop.shape[1]))
for parent_num in range(num_parents):
max_fitness_idx = numpy.where(fitness == numpy.max(fitness))
max_fitness_idx = max_fitness_idx[0][0]
parents[parent_num, :] = pop[max_fitness_idx, :]
fitness[max_fitness_idx] = -99999999999
return parents
def crossover(parents, offspring_size):
offspring = numpy.empty(offspring_size)
# The point at which crossover takes place between two parents.
# Usually, it is at the center. crossover_point = numpy.uint8(offspring_size[1]/2)
for k in range(offspring_size[0]):
# Index of the first parent to mate. parent1_idx = k%parents.shape[0]
# Index of the second parent to mate. parent2_idx = (k+1)%parents.shape[0]
# The new offspring will have its first half of its genes taken from the first parent.
offspring[k, 0:crossover_point] = parents[parent1_idx, 0:crossover_point]
# The new offspring will have its second half of its genes taken from the second
parent. offspring[k, crossover_point:] = parents[parent2_idx, crossover_point:]
return offspring

def mutation(offspring_crossover, num_mutations=1):


mutations_counter = numpy.uint8(offspring_crossover.shape[1] / num_mutations)

# Mutation changes a number of genes as defined by the num_mutations argument. The


changes are random.
for idx in range(offspring_crossover.shape[0]):
gene_idx = mutations_counter - 1
for mutation_num in range(num_mutations): # The random value to be added to the
gene.
random_value = numpy.random.uniform(-1.0, 1.0, 1)
offspring_crossover[idx, gene_idx] = offspring_crossover[idx, gene_idx] +
random_value
gene_idx = gene_idx + mutations_counter
return offspring_crossover

"""
The y=target is to maximize this equation ASAP: y = w1x1+w2x2+w3x3+w4x4+w5x5+6wx6
where (x1,x2,x3,x4,x5,x6)=(4,-2,3.5,5,- 11,-4.7)
What are the best values for the 6 weights w1 to w6?
We are going to use the genetic algorithm for the best possible values after a number
of generations.
"""
# Inputs of the equation.
equation_inputs = [4,-2,3.5,5,-11,-4.7]

# Number of the weights we are looking to optimize.


num_weights = len(equation_inputs)

"""
Genetic algorithm parameters: Mating pool size Population size
"""
sol_per_pop = 8
num_parents_mating = 4
# Defining the population size.
# The population will have sol_per_pop chromosome where each chromosome has
num_weights genes.
pop_size = (sol_per_pop,num_weights)
#Creating the initial population.

new_population = numpy.random.uniform(low=-4.0, high=4.0, size=pop_size)


print(new_population)
"""
new_population[0, :] = [2.4, 0.7, 8, -2, 5, 1.1]
new_population[1, :] = [-0.4, 2.7, 5, -1, 7, 0.1]
new_population[2, :] = [-1, 2, 2, -3, 2,0.9]
new_population[3, :]=[4,7, 12, 6.1, 1.4, -4]
new_population[4, :] = [3.1, 4, 0, 2.4, 4.8,0]
new_population[5, :] = [-2, 3, -7,6, 3, 3]
"""
best_outputs = []
num_generations = 1000
for generation in range(num_generations):
print("Generation : ", generation)
# Measuring the fitness of each chromosome in the population.
fitness = cal_pop_fitness(equation_inputs, new_population)
print("Fitness")
print(fitness)
best_outputs.append(numpy.max(numpy.sum(new_population*equation_inputs,axis=1)))
# The best result in the current iteration.
print("Best result : ", numpy.max(numpy.sum(new_population*equation_inputs, axis=1)))

# Selecting the best parents in the population for mating.


parents = select_mating_pool(new_population, fitness, num_parents_mating)
print("Parents ")
print(parents)

# Generating next generation usingcrossover.


offspring_crossover = crossover(parents, offspring_size=(pop_size[0]-parents.shape[0],
num_weights))
print("Crossover")
print(offspring_crossover)
# Adding some variations to the offspring using mutation.
offspring_mutation = mutation(offspring_crossover, num_mutations=2)
print("Mutation")
print(offspring_mutation)
# Creating the new population based on the parents and offspring.
new_population[0:parents.shape[0], :] = parents new_population[parents.shape[0]:, :] =
offspring_mutation
# Getting the best solution after iterating finishing all generations. #At first, the
fitness is calculated for each solution in the final generation. fitness =
cal_pop_fitness(equation_inputs, new_population)
# Then return the index of that solution corresponding to the best fitness.
best_match_idx = numpy.where(fitness == numpy.max(fitness))
print("Best solution : ",new_population[best_match_idx, :])
print("Best solution fitness : ", fitness[best_match_idx])

Output:
Generation : 528
Generation : 529
.
.
.
Generation : 997
Generation : 998
Generation : 999
Fitness
[ 23.77600938 -38.72547264 -3.8947447 37.10753127 36.47088852 -45.89452384 -
23.09632211 -39.97632958]
Best result : 37.107531271257386
Parents
[[ 1.08870518 3.53573945 2.059628 -2.23014052 -2.93675387 -2.43870245]
[ 2.73488817 -0.27533366 0.43937282 1.77062771 -2.97782039 3.86517002]
[-1.44232977 -1.85501121 -1.7915992 0.5713773 -2.70042871 0.09695724]
[ 1.14702539 -2.02199422 -0.58553948 2.15263606 2.60080023 -1.56769697]]
Crossover
[[1.08870518 3.53573945 2.059628 2.23014052 2.93675387 2.43870245]
[2.73488817 0.27533366 0.43937282 1.77062771 2.97782039 3.86517002]
[1.44232977 1.85501121 1.7915992 0.5713773 2.70042871 0.09695724]
[1.14702539 2.02199422 0.58553948 2.15263606 2.60080023 1.56769697]]
Mutation
[[1.08870518 3.53573945 2.059628 2.23014052 2.93675387 2.43870245]
[2.73488817 0.27533366 0.43937282 1.77062771 2.97782039 3.86517002]
[1.44232977 1.85501121 1.7915992 0.5713773 2.70042871 0.09695724]
[1.14702539 2.02199422 1.43349451 2.15263606 2.60080023 1.56769697]]
Best solution : [[[ 3.50731352 1.70379643 -1.13092437 -3.82269723 1.51092617
-1.27104807]]]
Best solution fitness : [-23.09632211]

Source Code file (select, copy and paste in a folder)

ga.py
# Additional Programs

# Additional Program for Linear Regression

linearRegression2.p
y

# Additional Program for Bayes Theorem:

bayesTherm2.py

# Additional Programs for Naïve Bayes Classification:

textclassify2.py textclassify2Data.cs textclassify3.py textclassify3data.txt


v

You might also like