21ai66 ML Lab Manual
21ai66 ML Lab Manual
BELGAUM
LABORATORY MANUAL
2023-2024
COMPILED BY
Dr. S Anu Pallavi
Assistant Professor Grade-I
Vision & Mission of the Institute
Acharya Institute of Technology, committed to the cause of value-based education in all disciplines,
envisions itself as a fountainhead of innovative human enterprise, with inspirational initiatives for
Academic Excellence.
Acharya Institute of Technology strives to provide excellent academic ambiance to the students for
achieving global standards of technical education, foster intellectual and personal development,
meaningful research and ethical service to sustainable societal needs.
5
PROGRAM NO.1
For a given set of training data examples stored in a . CSV file, implement and demonstrate the
Find-S algorithm to output a description of the set of all hypotheses consistent with the training
examples.
Aim:
To implement the Find-S algorithm is to learn a maximally specific hypothesis from a given
set of training data examples. The Find-S algorithm produces a hypothesis that is consistent with all
positive training examples while being as specific as possible.
Objective:
The objective of the Find-S algorithm is to find the most specific hypothesis that fits all
positive examples in the training dataset. This specific hypothesis is then used to make predictions on
new unseen examples. By maximizing specificity, the algorithm aims to generalize well to unseen
data while avoiding overfitting.
Training Examples:
6
Python code:
import pandas as pd
df = pd.read_csv('/content/ML LAB/enjoysport.csv')
df.values.tolist()
print(df)
n = len(a[0]) - 1
S = ['?'] * n # Initialize with '?'
print("Initial hypothesis:", S)
print("FIND S ALGORITHM")
S = a[0][:-1]
for i in range(len(a)):
if a[i][n] == "yes":
for j in range(n):
if a[i][j] != S[j]:
S[j] = '?'
print("\nTraining example no {0}, Hypothesis is: {1}".format(i + 1, S))
Output:
Maximally specific hypothesis is: ['sunny', 'warm', '?', 'strong', '?', '?']
7
PROGRAM NO. 2
For a given set of training data examples stored in a . CSV file, implement and demonstrate the
Candidate-Elimination algorithm to output a description of the set of all hypotheses consistent
with the training examples.
Aim:
To implement the Candidate-Elimination algorithm for machine learning classification tasks
using a given set of training data examples stored in a CSV file.
Objective:
The objective of this exercise is to implement the Candidate-Elimination algorithm to learn a
hypothesis space consistent with the training data by iteratively refining the version space through
analysis of each training example.
Training Examples:
8
Python Code:
import pandas as pd
df = pd.read_csv('/content/ML LAB/enjoysport.csv')
a = df.values.tolist()
print(df)
n=len(a[0])-1
print("\n The initial value of hypothesis: ")
s = ['0'] * n
g = ['?'] * n
print ("\n The most specific hypothesis S0 :",s)
print (" \n The most general hypothesis G0 :",g)
s=a[0][:-1]
temp=[]
print("\n Candidate Elimination algorithm\n")
for i in range(len(a)):
if a[i][n]=="yes": #Use Positive for manufacture.csv
for j in range(n):
if a[i][j]!=s[j]:
s[j]='?'
for j in range(n):
for k in range(len(temp)): #Use len(temp)-1 for manufacture.csv
if temp[k][j]!='?' and temp[k][j]!=s[j]:
del temp[k]
9
Output:
The most specific hypothesis S0 : ['0', '0', '0', '0', '0', '0']
The most general hypothesis G0 : ['?', '?', '?', '?', '?', '?']
10
PROGRAM NO.3
Write a program to demonstrate the working of the decision tree-based ID3 algorithm. Use an
appropriate data set for building the decision tree and apply this knowledge to classify a new
sample.
Aim:
To demonstrate the working of the decision tree-based ID3 algorithm. This algorithm is used
for building a decision tree from a given dataset, where each node in the tree represents an attribute,
and each branch represents a possible value of that attribute. The decision tree is built by recursively
selecting the best attribute to split the data based on the information gain criterion.
Objective:
This experiment provides a comprehensive guide for understanding and implementing
decision tree learning with the ID3 algorithm, covering data loading, preprocessing, tree construction,
visualization, and classification tasks.
Entropy:
Entropy measures the impurity of a collection of examples.
11
Training Dataset Example:
Python code:
import pandas as pd
from math import log
from pprint import pprint
12
def build(data, attr_names):
pos, sz = len([x for x in data if x[-1] == 'Yes']), len(data[0]) - 1
neg = len(data) - pos
if neg == 0 or pos == 0:
return 'Yes' if neg == 0 else 'No'
else:
return tree
# Load data
df = pd.read_csv('/content/ML LAB/tennis.csv')
data = df.values.tolist()
attr_names = df.columns.values.tolist()
Output:
{'Outlook': {'Overcast': 'Yes',
'Rainy': {'Windy': {'Strong': 'No', 'Weak': 'Yes'}},
'Sunny': {'Humidity': {'High': 'No', 'Normal': 'Yes'}}}}
Classified as: Yes
13
Method 2:
Python code:
import pandas as pd
import math
def load_csv(file_path):
df = pd.read_csv(file_path)
return df
class Node:
def __init__(self, attribute):
self.attribute = attribute
self.children = []
self.answer = ""
for x in range(len(attr)):
for y in range(r):
if data[y][col] == attr[x]:
counts[x] += 1
for x in range(len(attr)):
dic[attr[x]] = [[0 for i in range(c)] for j in range(counts[x])]
pos = 0
for y in range(r):
if data[y][col] == attr[x]:
if delete:
del data[y][col]
dic[attr[x]][pos] = data[y]
pos += 1
return attr, dic
def entropy(S):
attr = list(set(S))
if len(attr) == 1:
return 0
counts = [0, 0]
for i in range(2):
counts[i] = sum([1 for x in S if attr[i] == x]) / (len(S) * 1.0)
sums = 0
14
for cnt in counts:
sums += -1 * cnt * math.log(cnt, 2)
return sums
n = len(data[0]) - 1
gains = [0] * n
for col in range(n):
gains[col] = compute_gain(data, col)
split = gains.index(max(gains))
node = Node(features[split])
fea = features[:split] + features[split + 1:]
attr, dic = subtables(data, split, delete=True)
for x in range(len(attr)):
child = build_tree(dic[attr[x]], fea)
node.children.append((attr[x], child))
return node
15
return
pos = features.index(node.attribute)
for value, n in node.children:
if x_test[pos] == value:
classify(n, x_test, features)
print("The decision tree for the dataset using ID3 algorithm is:")
print_tree(node1, 0)
Output:
The decision tree for the dataset using ID3 algorithm is:
Outlook
rain
0
sunny
0
overcast
yes
The test instance: ['rain', 'cool', 'normal', 'strong']
The label for test instance: 0
The test instance: ['sunny', 'mild', 'normal', 'strong']
The label for test instance: 0
16
PROGRAM NO. 4
Build an Artificial Neural Network by implementing the Backpropagation algorithm and test
the same using appropriate data sets.
Aim:
To build an Artificial Neural Network (ANN) using the Backpropagation algorithm and test
its performance using appropriate datasets.
Objective:
The objective of this experiment is to implement an Artificial Neural Network (ANN) using
the Backpropagation algorithm. Through this exercise, participants will gain practical experience in
preprocessing data, initializing network parameters, conducting forward and backward propagation,
updating weights and biases, and evaluating the network's performance.
2. Initialize the weights (whwh, woutwout) and biases (bhbh, boutbout) with small random
values.
3. Forward Propagation:
Calculate the weighted sum of inputs to the hidden and apply the activation function
to get the output of the hidden layer.
Calculate the weighted sum of inputs to the output layer and apply the activation
function to get the final output.
5. Backpropagation:
17
Compute the gradient of the output layer using the derivative of the sigmoid
function.
Propagate the error back to the hidden layer.
Compute the gradient of the hidden using the derivative of the sigmoid function.
6. Weight Update:
Update the using the gradient descent algorithm.
Update the weights using the gradient descent algorithm.
Training example:
Expected % in
Example Sleep Study
Exams
1 2 9 92
2 1 5 86
3 3 6 89
Expected %
Example Sleep Study in Exams
1 2/3 = 0.66666667 9/9 = 1 0.92
2 1/3 = 0.33333333 5/9 = 0.55555556 0.86
3 3/3 = 1 6/9 = 0.66666667 0.89
18
Python Code:
import numpy as np
# Sigmoid Function
def sigmoid(x):
return 1 / (1 + np.exp(-x))
# Variable initialization
epoch = 5000 # Setting training iterations
lr = 0.1 # Setting learning rate
inputlayer_neurons = 2 # number of features in data set
hiddenlayer_neurons = 3 # number of hidden layers neurons
output_neurons = 1 # number of neurons at output layer
for i in range(epoch):
# Forward Propagation
hinp1 = np.dot(X, wh)
hinp = hinp1 + bh
hlayer_act = sigmoid(hinp)
outinp1 = np.dot(hlayer_act, wout)
outinp = outinp1 + bout
output = sigmoid(outinp)
# Backpropagation
EO = y - output
outgrad = derivatives_sigmoid(output)
d_output = EO * outgrad
EH = d_output.dot(wout.T)
19
# Update weights
wout += hlayer_act.T.dot(d_output) * lr
wh += X.T.dot(d_hiddenlayer) * lr
Output:
Input:
[[0.66666667 1. ]
[0.33333333 0.55555556]
[1. 0.66666667]]
Actual Output:
[[0.92]
[0.86]
[0.89]]
Predicted Output:
[[0.89401359]
[0.88228311]
[0.89391465]]
20
PROGRAM NO.5
Write a program to implement the naive Bayesian classifier for a sample training data set stored
as a . CSV file. Compute the accuracy of the classifier, considering a few test data sets.
Aim:
To implement a naive Bayesian classifier using a sample training dataset stored as a CSV
file and compute the accuracy of the classifier using test datasets.
Objective:
The objective of this program is to implement a naive Bayesian classifier for a sample training
dataset stored as a CSV file. To achieve this, the first step involves loading the training dataset from
the CSV file. Following that, the data is preprocessed by encoding categorical variables into
numerical values using LabelEncoder. Subsequently, the dataset is split into train and test sets to
facilitate model evaluation. The Gaussian Naive Bayes classifier is then trained using the training
data. Next, the classifier is used to predict the classes for the test data. Finally, the accuracy of the
classifier is calculated by comparing the predicted classes with the actual classes. This evaluation
metric provides insight into the performance of the classifier.
1. Load the dataset from a CSV file and convert the loaded data to numerical format.
2. Split the dataset into training and testing sets based on a specified split ratio.
3. Separate by Class as group the instances in the dataset by their class labels.
4. Calculate the mean and standard deviation for each attribute in the dataset grouped by class.
5. Summarize the dataset statistics (mean and standard deviation) for each class.
6. Calculate the class conditional probability using the Gaussian probability density function:
7. For each class, calculate the product of the class conditional probabilities for all attributes.
8. Predict the class label for a given instance by selecting the class with the highest probability.
9. Generate predictions for all instances in the test set using the previously trained model.
10. Compare the predicted class labels with the actual class labels in the test set and calculate
the classification accuracy.
21
Python Code:
import csv
import random
import math
def loadcsv(filename):
with open(filename, "r") as file:
lines = csv.reader(file)
dataset = list(lines)
headers = dataset[0] # Extract column headers
dataset = dataset[1:] # Exclude the first row (column headers)
for i in range(len(dataset)):
dataset[i] = [float(x) for x in dataset[i]]
return headers, dataset
def separatebyclass(dataset):
separated = {}
for i in range(len(dataset)):
vector = dataset[i]
if (vector[-1] not in separated):
separated[vector[-1]] = []
separated[vector[-1]].append(vector)
return separated
def mean(numbers):
return sum(numbers)/float(len(numbers))
def stdev(numbers):
avg = mean(numbers)
variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
return math.sqrt(variance)
def summarize(dataset):
summaries = [(mean(attribute), stdev(attribute)) for attribute in
zip(*dataset)]
del summaries[-1]
return summaries
22
def summarizebyclass(dataset):
separated = separatebyclass(dataset)
summaries = {}
for classvalue, instances in separated.items():
summaries[classvalue] = summarize(instances)
return summaries
23
def main():
filename = '/content/ml lab sample/pima_indian.csv'
splitratio = 0.67
headers, dataset = loadcsv(filename)
trainingset, testset = splitdataset(dataset, splitratio)
print('Split {0} rows into train={1} and test={2}
rows'.format(len(dataset), len(trainingset), len(testset)))
summaries = summarizebyclass(trainingset)
predictions = getpredictions(summaries, testset)
accuracy = getaccuracy(testset, predictions)
print('Accuracy of the classifier is: {0}%'.format(accuracy))
main()
Output:
24
PROGRAM .6:
Write a program to construct a Bayesian network considering medical data. Use this model to
demonstrate the diagnosis of heart patients using a standard Heart Disease Data Set. You can
use Python ML library classes/API.
Aim:
To implement a Bayesian Network model using the pgmpy library to predict heart disease
based on patient data.
Objective :
This project aims to analyze a heart disease dataset by building a Bayesian Network. We
structure the network based on known connections between patient attributes and heart disease. Then,
we train the model using Maximum Likelihood Estimation. By employing Variable Elimination, we
make predictions about heart disease from the data provided. Finally, we evaluate how accurately the
model predicts heart disease.
1. The heart disease dataset is loaded from a CSV file named 'datasetheart.csv', with appropriate
column names assigned.
2. Based on the dataset's attributes and domain knowledge, a Bayesian Network structure is defined
using the `BayesianModel` class from `pgmpy.models`.
3. The model is trained using the `fit` method, with Maximum Likelihood Estimation as the
estimator, using the `fit` method.
4Variable Elimination is employed for inference. The `query` method is used to predict the
'RESULT' variable given the evidence 'C=2'.
5. The inference results are printed, providing the probabilities of 'RESULT' given the evidence
'C=2'.
25
Python Code:
!pip install pgmpy
data = pd.read_csv('/content/ML LAB/datasetheart.csv', names=['A', 'B', 'C',
'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'RESULT'])
from pgmpy.models import BayesianModel
from pgmpy.estimators import BDeuScore, BayesianEstimator
from pgmpy.inference import VariableElimination
print(data.head(5))
print(data.tail(5))
from pgmpy.models import BayesianModel
from pgmpy.estimators import MaximumLikelihoodEstimator
model = BayesianModel([("A","B"),("B","C"),("C","D"),("D","RESULT")])
model.fit(data,estimator=MaximumLikelihoodEstimator)
Output:
A B C D E F G H I J K L M RESULT
0 63 1 3 145 233 1 0 150 0 2.3 0 0 1 1
1 37 1 2 130 250 0 1 187 0 3.5 0 0 2 1
2 41 0 1 130 204 0 0 172 0 1.4 2 0 2 1
3 56 1 1 120 236 0 1 178 0 0.8 2 0 2 1
4 57 0 0 120 354 0 1 163 1 0.6 2 0 2 1
A B C D E F G H I J K L M RESULT
298 57 0 0 140 241 0 1 123 1 0.2 1 0 3 0
299 45 1 3 110 264 0 1 132 0 1.2 1 0 3 0
300 68 1 0 144 193 1 1 141 0 3.4 1 2 3 0
301 57 1 0 130 131 0 1 115 1 1.2 1 1 3 0
302 57 0 1 130 236 0 0 174 0 0.0 1 1 2 0
+-----------+---------------+
| RESULT | phi(RESULT) |
+===========+===============+
| RESULT(0) | 0.3893 |
+-----------+---------------+
| RESULT(1) | 0.6107 |
+-----------+---------------+
26
PROGRAM .7:
Apply the EM algorithm to cluster a set of data stored in a . CSV file. Use the same data set for
clustering using the k-means algorithm. Compare the results of these two algorithms and
comment on the quality of clustering. You can add Python ML library classes/API to the
program.
Aim:
To compare the clustering results obtained from the K-Means algorithm and the Gaussian
Mixture Model (GMM) using the Expectation-Maximization (EM) algorithm on a given dataset.
Objective:
The objective is to analyze and compare the performance of K-Means and Gaussian Mixture
Model (GMM) clustering algorithms on a dataset. Firstly, the dataset stored in a .CSV file is loaded
and preprocessed if required. Then, both algorithms are implemented to cluster the data into three
groups. Visualizations of the clustering results are created using scatter plots. Silhouette scores are
calculated for each clustering method to assess the quality of clustering. Finally, by comparing the
silhouette scores of K-Means and GMM, the algorithm that demonstrates better performance for the
given dataset is determined.
1. Load the dataset from the .CSV file and preprocess it if required.
2. Implement the K-Means algorithm:
a. Initialize the K-Means model with the desired number of clusters (k=3).
b. Fit the model to the dataset.
c. Obtain the cluster labels assigned by K-Means.
3. Visualize the K-Means clustering results using scatter plots.
4. Implement the Gaussian Mixture Model (GMM) using the EM algorithm:
a. Initialize the GMM model with the desired number of components (n=3).
b. Fit the model to the dataset.
c. Predict the cluster labels using the trained GMM model.
5. Visualize the GMM clustering results using scatter plots.
6. Calculate the silhouette scores for both clustering methods.
7. Print the silhouette scores of K-Means and GMM.
8. Compare the silhouette scores to determine the quality of clustering.
27
Python code:
plt.show()
28
Output:
29
PROGRAM 8:
Write a program to implement the k-Nearest Neighbour algorithm to classify the iris data set.
Print both correct and wrong predictions. Java/Python ML library classes can be used for this
problem.
Aim:
To implement the K-Nearest Neighbors (KNN) algorithm to classify instances in the Iris
dataset and evaluate its performance using accuracy metrics.
Objective:
The objective of this program is to use the K-Nearest Neighbors (KNN) algorithm to
classify Iris flowers into their respective species based on features like sepal length, sepal width,
petal length, and petal width.
Python code:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import datasets
#Iris Plants Dataset, dataset contains 150 (50 in each of three
classes)Number of Attributes: 4 numeric, predictive attributes and the Class
#The x variable contains the first four columns of the dataset (i.e.
attributes) while y contains the labels.
x = iris.data
y = iris.target
30
# Print the column names and data
print('sepal-length', 'sepal-width', 'petal-length', 'petal-width')
print(x)
print('class: 0-Iris-Setosa, 1-Iris-Versicolour, 2-Iris-Virginica')
print(y)
# Splits the dataset into 70% train data and 30% test data. This means that
out of total 150 records, the training set will contain
105 records and the test set contains 45 of those records
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
Output:
class: 0-Iris-Setosa, 1-Iris-Versicolour, 2-Iris-Virginica
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
Confusion Matrix
[[18 0 0]
[ 0 10 2]
[ 0 0 15]]
Accuracy Metrics
precision recall f1-score support
accuracy 0.96 45
macro avg 0.96 0.94 0.95 45
weighted avg 0.96 0.96 0.95 45
31
PROGRAM 9:
Implement the non-parametric Locally Weighted Regression algorithm in order to fit data
points. Select the appropriate data set for your experiment and draw graphs.
Aim:
This experiment aims to implement the non-parametric Locally Weighted Regression (LWR)
algorithm to fit data points and visualize the regression curve.
Objective:
Generate synthetic data points using a predefined function and add noise to mimic real-world
scenarios.
Implement the Locally Weighted Regression algorithm to fit the data points.
Visualize the regression curve along with the original data points for different values of
bandwidth (tau) to observe its effect on the regression curve.
Python code:
import numpy as np
import matplotlib.pyplot as plt
# Generating Data
x = np.linspace(-5, 5, 1000)
y = np.log(np.abs((x ** 2) - 1) + 0.5)
x = x + np.random.normal(scale=0.05, size=1000)
plt.scatter(x, y, alpha=0.3)
32
def radial_kernel(x0, x, tau):
return np.exp(np.sum((x - x0) ** 2, axis=1) / (-2 * tau ** 2))
# Plotting Function
def plot_lr(tau):
domain = np.linspace(-5, 5, num=300)
pred = [local_regression(x0, x, y, tau) for x0 in domain]
plt.scatter(x, y, alpha=0.3)
plt.plot(domain, pred, color="blue")
plt.show() # Corrected show() function
return plt
Output:
33
PROGRAM 10:
Aim:
To demonstrate the application of the Support Vector Machine (SVM) classifier on
suitable datasets to perform classification tasks
Objective :
The objective of the program is to illustrate the practical implementation of the Support
Vector Machine (SVM) classifier for performing classification tasks on a suitable dataset.
𝑤 represents the weights (coefficients) of the features, 𝑥 is the input feature vector, and 𝑏
is the bias term.
This kernel computes the dot product between the feature vectors xi and xj in the original
feature space.
6. Fit the SVM classifier to the training data using the `fit()` method.
If 𝑓(𝑥) is positive, the sample is assigned to one class, and if it is negative, it is assigned to
another class.
7. Use the trained classifier to predict labels for the test data using the `predict()` method.
8. Evaluate Performance:
a. Calculate the accuracy of the classifier using `accuracy_score()` from scikit-learn.
b. Generate a classification report using `classification_report()` to get precision,
recall, F1-score, etc.
9. Visualize Decision Boundary:
a. Define a function to plot the decision boundary of the classifier.
b. Use meshgrid to create a grid of points and predict the class for each point.
c. Plot the decision boundary along with the data points using Matplotlib.
10. Show the decision boundary plots for visualization.
34
Python code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
35
plt.ylabel('Feature 2')
plt.title(title)
plt.show()
36
Output:
Accuracy: 1.0
Classification Report:
precision recall f1-score support
accuracy 1.00 45
macro avg 1.00 1.00 1.00 45
weighted avg 1.00 1.00 1.00 45
37