0% found this document useful (0 votes)

8 views49 pages

PR Final File

The document outlines a lab file for a course on Pattern Recognition at Delhi Technological University, detailing various experiments involving Python libraries such as NumPy, Pandas, and Matplotlib. It includes objectives like data preprocessing, visualization, and implementing machine learning algorithms such as KNN and decision trees on the Iris dataset. The document also emphasizes the importance of data manipulation and analysis in machine learning.

Uploaded by

asthadahiya2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views49 pages

PR Final File

Uploaded by

asthadahiya2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

DELHI TECHNOLOGICAL UNIVERSITY

(Formerly Delhi College of Engineering)

Shahbad Daulatpur, Bawana Road, Delhi-110042

DEPARTMENT OF COMPUTER SCIENCE AND

ENGINEERING

Pattern Recognition Lab File

Subject Code: AI5304

Submitted To: Submitted By:

Dr. Aruna Bhat Saurabh Singh
Department of Computer Science M.Tech AFI
And Engineering 2K22/AFI/20
INDEX

Sno. Topic Date Signature

1 To study about numpy, pandas and matplotlib libraries

in python.

2 To perform data preprocessing and data

summarization on iris dataset.

3 To perform data preprocessing and data visualization

on iris dataset.
4 To implement k means clustering.

5 To implement data classification using KNN.

6 To implement decision tree using ID3 algorithm.

7 To implement decision tree using CART algorithm.

8 To implement decision tree using C4.5 algorithm.

To implement multi layer neural network.

10 Project- Develop a framework for skin disease

detection using knowledge distillation.
Experiment 1

Aim: To study about numpy, pandas and matplotlib libraries in python.

Theory:

 Numpy- NumPy stands for Numerical Python. NumPy is a Python tool for
working with arrays. It also includes tools for dealing with linear algebra, the
Fourier transform, and matrices.Travis Oliphant developed NumPy in 2005. It is
an open source initiative which can be used freely.In Python we have lists that
serve the purpose of arrays, but they are slow to process.NumPy aims to provide
an array object that is up to 50x faster than traditional Python lists.

 Pandas- Pandas is defined as an open-source library that provides high-

performance data manipulation in Python. The name of Pandas is derived from
the word Panel Data, which means an Econometrics from Multidimensional
data. It is used for data analysis in Python and developed by Wes
McKinney in 2008. Data analysis requires lots of processing, such
as restructuring, cleaning or merging, etc. Pandas is preferred for data analysis
because working with Pandas is fast, simple and more expressive than other tools.
 Matplotlib- Matplotlib is the basic visualizing or plotting library of the python
programming language. Matplotlib is a powerful tool for executing a variety of
tasks. It is able to create different types of visualization reports like line plots,
scatter plots, histograms, bar charts, pie charts, box plots, and many more
different plots. It also supports 3-dimensional plotting.

Code-

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [50]:
list1 = [[1,2,3],[4,5,6]]
In [51]:
#converting to numpy array
array1 = np.array(list1)
array1
Output:
array([[1, 2, 3],
[4, 5, 6]])
In [52]:
#Mathematical operation performed on all values of numpy arrays
toffee = np.array([5,8,3,6])
print(toffee - 2)
Output:
[3 6 1 4]
In [53]:
#Slicing
arr = np.array([1, 2, 3, 4, 5, 6, 7])
arr[1:5]
Output:
array([2, 3, 4, 5])
In [54]:
#Datatype of the array
arr.dtype
Output:
dtype('int32')
In [55]:
#Change data type from float to integer
arr = np.array([1.1, 2.1, 3.1])
newarr = arr.astype('i')
print(newarr)
print(newarr.dtype)
Output:
[1 2 3]
int32
In [56]:
#Shape of array
arr = np.array([1, 2, 3, 4], ndmin=5)

print(arr)
print('shape of array :', arr.shape)
Output:

[[[[[1 2 3 4]]]]]
shape of array : (1, 1, 1, 1, 4)
In [57]:
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passengers': [2, 6, 4]
}
In [58]:
#Series
info = np.array(['S','a','u','r','a','b','h'])
a = pd.Series(info)
a
Out[58]:
0 S
1 a
2 u
3 r
4 a
5 b
6 h
dtype: object
In [59]:
cardf = pd.DataFrame(mydataset)
cardf
Output:

cars passengers

0 BMW 2
cars passengers

1 Volvo 6

2 Ford 4

In [60]:
a=pd.Series(mydataset)
a
Output:
cars [BMW, Volvo, Ford]
passengers [2, 6, 4]
dtype: object
In [61]:
#Converting series to dataframe
a.to_frame()
Output:

cars [BMW, Volvo, Ford]

passengers [2, 6, 4]

In [62]:
#Unique values in dataframe
pd.unique(pd.Series([3, 1, 1, 2, 9, 7]))

Output:
array([3, 1, 2, 9, 7], dtype=int64)
In [63]:
#Append 2 dataframe
cardf2 = pd.DataFrame({"year":[2022, 2019, 2020],
"model":['zx', 'vls', 'b+'],
"enginecc":[3811, 1200, 4522]})
df=cardf.append(cardf2, ignore_index = True)
df
Output:
cars passengers year model enginecc

0 BMW 2.0 NaN NaN NaN

1 Volvo 6.0 NaN NaN NaN

2 Ford 4.0 NaN NaN NaN

3 NaN NaN 2022.0 zx 3811.0

4 NaN NaN 2019.0 vls 1200.0

5 NaN NaN 2020.0 b+ 4522.0

In [64]:
#Count
df.count()
Output:
cars 3
passengers 3
year 3
model 3
enginecc 3
dtype: int64
In [65]:
#Iterating Dataframe
for row_index,row in cardf.iterrows():
print (row_index,row)
Output:
0 cars BMW
passengers 2
Name: 0, dtype: object
1 cars Volvo
passengers 6
Name: 1, dtype: object
2 cars Ford
passengers 4
Name: 2, dtype: object
In [66]:
#Return top elements of df
df.head()
Output:

cars passengers year model enginecc

0 BMW 2.0 NaN NaN NaN

1 Volvo 6.0 NaN NaN NaN

2 Ford 4.0 NaN NaN NaN

3 NaN NaN 2022.0 zx 3811.0

4 NaN NaN 2019.0 vls 1200.0

In [67]:
df.tail()
Output:

cars passengers year model enginecc

1 Volvo 6.0 NaN NaN NaN

2 Ford 4.0 NaN NaN NaN

3 NaN NaN 2022.0 zx 3811.0

4 NaN NaN 2019.0 vls 1200.0

5 NaN NaN 2020.0 b+ 4522.0

In [68]:
#calculating statistical data like percentile, mean,etc
df.describe()
Output:

passengers year enginecc

count 3.0 3.000000 3.000000

mean 4.0 2020.333333 3177.666667

std 2.0 1.527525 1749.215348

min 2.0 2019.000000 1200.000000

25% 3.0 2019.500000 2505.500000

50% 4.0 2020.000000 3811.000000

75% 5.0 2021.000000 4166.500000

max 6.0 2022.000000 4522.000000

In [69]:
#Reading CSV files
df = pd.read_csv('iris.csv')
print(df)
Output:
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm

0 1 5.1 3.5 1.4 0.2

1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
.. ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8
Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
.. ...
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica

[150 rows x 6 columns]

In [70]:
#Plots using Matplotlib
car = ['BMW', 'Volvo', 'Ford']
Price= [37,70,48]

plt.figure(figsize=(9,3))

plt.subplot(131)
plt.bar(car, Price)
plt.subplot(132)
plt.scatter(car, Price)
plt.subplot(133)
plt.plot(car, Price)
plt.suptitle('Plot')
plt.show()

Output:
In [71]:
y = np.array([35, 25])
mylabels = ["Melanoma", "Benign"]

plt.pie(y, labels = mylabels)
plt.show()
Output:

Learning Outcome:
1. To be able to perform basic operations using numpy, pandas and matplotlib
libraries of python.
2. Successfully use inbuilt tools for the purpose of data manipulation, data analysis
and data visualization.
Experiment 2

Aim: To perform data visualization on iris dataset.

Theory:
 Data pre-processing- Data per-processing is a process of preparing the raw data
and making it suitable for a machine learning model. It is the first and crucial step
while creating a machine learning model. A real-world data generally contains
noises, missing values, and maybe in an unusable format which cannot be directly
used for machine learning models. Data per-processing is required tasks for
cleaning the data and making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.

 Data Summarization -Data summarization in machine learning refers to the

process of extracting important information from large datasets and presenting it
in a concise and meaningful way. This can be achieved using various techniques
such as statistical methods, etc.

Code:
import pandas as pd

# Reading the CSV file

df = pd.read_csv("Iris.csv")

# Printing top 5 rows

df.head()
Output:

I SepalLengthC SepalWidthC PetalLengthC PetalWidthC Specie

d m m m m s

Iris-
0 1 5.1 3.5 1.4 0.2
setosa

1 2 4.9 3.0 1.4 0.2 Iris-

I SepalLengthC SepalWidthC PetalLengthC PetalWidthC Specie
d m m m m s

setosa

Iris-
2 3 4.7 3.2 1.3 0.2
setosa

Iris-
3 4 4.6 3.1 1.5 0.2
setosa

Iris-
4 5 5.0 3.6 1.4 0.2
setosa

In [2]:
df.shape
Output:
(150, 6)

In [3]:
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
In [5]:
#Data Summarization
df.describe()
Output:

SepalLengthC SepalWidthC PetalLengthC PetalWidthC

Id
m m m m

coun 150.00000
150.000000 150.000000 150.000000 150.000000
t 0

mea
75.500000 5.843333 3.054000 3.758667 1.198667
n

std 43.445368 0.828066 0.433594 1.764420 0.763161

min 1.000000 4.300000 2.000000 1.000000 0.100000

25% 38.250000 5.100000 2.800000 1.600000 0.300000

50% 75.500000 5.800000 3.000000 4.350000 1.300000

112.75000
75% 6.400000 3.300000 5.100000 1.800000
0

150.00000
max 7.900000 4.400000 6.900000 2.500000
0

In [6]:
#Checking missing values
df.isnull().sum()

Output:
Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64
In [7]:
#Checking duplicates
data = df.drop_duplicates(subset ="Species",)
data
Output:

SepalLength SepalWidth PetalLength PetalWidth

Id Species
Cm Cm Cm Cm

Iris-
0 1 5.1 3.5 1.4 0.2
setosa

Iris-
50 51 7.0 3.2 4.7 1.4 versicol
or

Iris-
10 10
6.3 3.3 6.0 2.5 virginic
0 1
a

In [8]:
df.value_counts("Species")
Output:
Species
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
dtype: int64
In [9]:
data.corr(method='pearson')
data.corr(method='pearson')
Output:

SepalLength SepalWidth PetalLength PetalWidth

Id
Cm Cm Cm Cm

1.0000
Id 0.624413 -0.654654 0.969909 0.999685
00

SepalLength 0.6244
1.000000 -0.999226 0.795795 0.643817
Cm 13

-
SepalWidth
0.6546 -0.999226 1.000000 -0.818999 -0.673417
Cm
54

PetalLength 0.9699
0.795795 -0.818999 1.000000 0.975713
Cm 09

PetalWidthC 0.9996
0.643817 -0.673417 0.975713 1.000000
m 85

Learning Outcomes:
1.To be able to perform pre-processing by checking for null values, missing values,etc
in data.
2. To be able to extract useful information from the data with the help of various
statistical methods available.
Experiment 3

Aim: To perform data visualization on iris dataset.

Theory:
Data visualization-Data visualization is a graphical representation of quantitative
information and data by using visual elements like graphs, charts, and maps. Data
visualization convert large and small data sets into visuals, which is easy to
understand and process for humans.Data visualization tools provide accessible ways
to understand outliers, patterns, and trends in the data.

The iris dataset contains 150 observations, with 50 observations for each of the three
species of iris. Each observation includes measurements for the four attributes named
sepal length, sepal width, petal length, and petal width.

Code:

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
In [17]:
df = pd.read_csv('iris.csv')
df.head()
Output:

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

4 5 5.0 3.6 1.4 0.2 Iris-setosa

In [18]:

#to discribe the stats about data

df.describe()
Output:

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm

150.0000
count 150.000000 150.000000 150.000000 150.000000
00

75.50000
mean 5.843333 3.054000 3.758667 1.198667
0

43.44536
std 0.828066 0.433594 1.764420 0.763161
8

min 1.000000 4.300000 2.000000 1.000000 0.100000

38.25000
25% 5.100000 2.800000 1.600000 0.300000
0

75.50000
50% 5.800000 3.000000 4.350000 1.300000
0

112.7500
75% 6.400000 3.300000 5.100000 1.800000
00

150.0000
max 7.900000 4.400000 6.900000 2.500000
00

In [19]:
#basic info about datatype
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
In [20]:
#to display no of samples in each class
df['Species'].value_counts()
Output:
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
Name: Species, dtype: int64
In [21]:
#check null values
df.isnull().sum()
Output:
Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64

Data Visualization
In [22]:
# Visvualize the data attribiutes with histogram
df['SepalLengthCm'].hist()
Output:

In [23]:
df['SepalWidthCm'].hist()
Output:

In [24]:
df['PetalLengthCm'].hist()
Output:
In [25]:
df['PetalWidthCm'].hist()
Output:

In [28]:
sns.boxplot(data=df,x='Species',y='PetalLengthCm')
Output:
In [30]:
sns.boxplot(data=df,x='Species',y='PetalWidthCm')
Output:

In [31]:
sns.boxplot(data=df,x='Species',y='SepalLengthCm')
Output:
In [33]:
sns.boxplot(data=df,x='Species',y='SepalWidthCm')
Output:

In [36]:
snsdata = df.drop(['Id'], axis=1)
sns.pairplot(snsdata,hue='Species',height=3)
plt.show()
Learning Outcomes:
1.To be able to use inbuilt tools for the data visualization purpose.
2. To be able to extract useful information from the data with the help of data
visualization.
Experiment 4

Aim: To implement data classification using KNN.

Theory:
K-Nearest Neighbors is one of the simplest supervised machine learning algorithms
used for classification. It classifies a data point based on its neighbors’ classifications.
It stores all available cases and classifies new cases based on similar features.
The following example below shows a KNN algorithm being leveraged to predict if a
glass of wine is red or white. Different variables that are considered in this KNN
algorithm include sulphur dioxide and chloride levels.

K in KNN is a parameter that refers to the number of nearest neighbors in the majority
voting process.
Code:
import pandas as pd
import numpy as np
import math
import operator

# Importing data
data = pd.read_csv('Iris.csv')

print(data.head(5))
#euclidean distance between two data points
def euclideanDistance(data1, data2, length):
distance = 0
for x in range(length):
distance += np.square(data1[x] - data2[x])
return np.sqrt(distance)

#KNN model
def knn(trainingSet, testInstance, k):

distances = {}
sort = {}

length = testInstance.shape[1]

# Calculating euclidean distance between each row of training data and test data
for x in range(len(trainingSet)):

dist = euclideanDistance(testInstance, trainingSet.iloc[x], length)

distances[x] = dist[0]

#Sorting on the basis of distance

sorted_d = sorted(distances.items(), key=operator.itemgetter(1))

neighbors = []

for x in range(k):
neighbors.append(sorted_d[x][0])
classVotes = {}

#Calculating the most freq class in the neighbors

for x in range(len(neighbors)):
response = trainingSet.iloc[neighbors[x]][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1

sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)

return(sortedVotes[0][0], neighbors)

testSet = [[7.2, 3.6, 5.1, 2.5]]

test = pd.DataFrame(testSet)

#k=1

print('\n\nWith 1 Nearest Neighbour \n\n')

k=1
result,neigh = knn(data, test, k)

print('\nPredicted Class of the datapoint = ', result)

print('\nNearest Neighbour of the datapoints = ',neigh)

print('\n\nWith 3 Nearest Neighbours\n\n')

#k=3
k=3
result,neigh = knn(data, test, k)

print('\nPredicted class of the datapoint = ',result)

print('\nNearest Neighbours of the datapoints = ',neigh)

print('\n\nWith 5 Nearest Neighbours\n\n')

#k=3
k=5
result,neigh = knn(data, test, k)

print('\nPredicted class of the datapoint = ',result)

print('\nNearest Neighbours of the datapoints = ',neigh)

Output:

Learning outcome:
1. Understanding of lazy learner algorithm and how to implement them.
2. Understanding of different distances metric which can be used to establish relation
between different sets of points.
Experiment 5
Aim: To implement k means clustering.
Theory:
K-Means clustering is an unsupervised learning algorithm. There is no labeled data
for this clustering, unlike in supervised learning. K-Means performs the division of
objects into clusters that share similarities and are dissimilar to the objects belonging
to another cluster. The term ‘K’ is a number. For example, K = 2 refers to two clusters.

It is an iterative process of assigning each data point to the groups and slowly data
points get clustered based on similar features. The objective is to minimize the sum of
distances between the data points and the cluster centroid, to identify the correct
group each data point should belong to.

We divide a data space into K clusters and assign a mean value to each. The data
points are placed in the clusters closest to the mean value of that cluster. There are
several distance metrics available that can be used to calculate the distance.

Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [34]:
iris = pd.read_csv('Iris.csv')

mapping = {
'Iris-setosa' : 0,
'Iris-versicolor' : 1,
'Iris-virginica' : 2
}
y = iris.Species.replace(mapping)# Output values
# y=iris['Species']
X = iris.drop(['Id','Species'],axis=1).values
print(y.shape)
(150,)
In [35]:
clusters=len(np.unique(y))
In [36]:
def euclidean_dis(x1, x2):
return np.sqrt(np.sum((x1 - x2)**2))
In [37]:
from collections import defaultdict

class KMeans:

def __init__(self,data,k,max_ite):
self.data=data
self.k=k
self.max_ite=max_ite

def predict(self):

centroids = defaultdict(int)

K=self.k
max_iter=self.max_ite
j=0

for i in range(3):
print(self.data[j])
centroids[i] = self.data[j]
j=j+50

for i in range(max_iter):
classes=defaultdict(list)
prediction=[]

for key in range(K):

classes[key]=[]
for datapoint in self.data:
distance=[]
for j in range(K):

dis=euclidean_dis(datapoint,centroids[j])

distance.append(dis)
# print(distance)
mindis=min(distance)

index=distance.index(mindis)
# print(index)
prediction.append(index)
# print(prediction)
classes[index].append(datapoint)
old_centroid=dict(centroids)

for t in range(K):
class_=classes[t]
# print(class_)
new_centroid=np.mean(class_,axis=0)
centroids[t]=new_centroid

return classes,centroids,prediction
In [46]:

kmeans=KMeans(X,clusters,100)

classes,centroids,prediction=kmeans.predict()
# print(prediction)

for i in range(0,3):
classes[i]=np.array(classes[i]).tolist()

for i in range(0,3):
print(len(classes[i]))

print(centroids)
accuracy = np.sum(y== prediction) / len(y)
print(f'Accuracy is: {accuracy}')

Output:
[5.1 3.5 1.4 0.2]
[7. 3.2 4.7 1.4]
[6.3 3.3 6. 2.5]
50
62
38
defaultdict(<class 'int'>, {0: array([5.006, 3.418, 1.464, 0.244]), 1: array([5.
9016129 , 2.7483871 , 4.39354839, 1.43387097]), 2: array([6.85 , 3.07368421, 5.
74210526, 2.07105263])})
Accuracy is: 0.8933333333333333

Learning outcome:
1. Understanding of unsupervised learning algorithm and unlabeled data.
2. Understanding of how clusters are created and data points are assigned to different
clusters based on distance metrics.
Experiment 6

Aim: To implement decision tree using ID3 algorithm.

Theory:

ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm
iteratively (repeatedly) dichotomizes(divides) features into two or more groups at each
step.

Invented by Ross Quinlan, ID3 uses a top-down greedy approach to build a decision
tree. In simple words, the top-down approach means that we start building the tree
from the top and the greedy approach means that at each iteration we select the best
feature at the present moment to create a node.ID3 uses Information Gain find the
best feature at each iteration.

Information Gain calculates the reduction in the entropy and measures how well a
given feature separates or classifies the target classes. The feature with the highest
Information Gain is selected as the best one.

IG(S, A) = Entropy(S) - ∑((|Sᵥ| / |S|) * Entropy(Sᵥ))

Code:

import numpy as np

import pandas as pd

import math

df=pd.read_csv('Outlook.csv')

features = [feat for feat in df]

features.remove("answer")

class Node:

def __init__(self):
self.children = []

self.value = ""

self.isLeaf = False

self.pred = ""

def entropy(val):

pos = 0.0

neg = 0.0

for _, row in val.iterrows():

if row["answer"] == "yes":

pos += 1

else:

neg += 1

if pos == 0.0 or neg == 0.0:

return 0.0

else:

p = pos / (pos + neg)

n = neg / (pos + neg)

return -(p * math.log(p, 2) + n * math.log(n, 2))

def info_gain(data,attr):

uniq = np.unique(data[attr])

gain = entropy(data)

for u in uniq:

subdata = data[data[attr] == u]

sub_e = entropy(subdata)

gain -= (float(len(subdata)) / float(len(data))) * sub_e

return gain

def ID3(data, atrributes):

root = Node()

max_gain=0

max_feature=""

for feature in features:

gain = info_gain(data, feature)

# print(gain)

if gain>max_gain:

max_gain=gain

max_feature=feature;

root.value=max_feature

uniq = np.unique(data[max_feature])

for u in uniq:

subdata = data[data[max_feature] == u]

if entropy(subdata) == 0.0:

newNode = Node()

newNode.isLeaf = True

newNode.value = u

newNode.pred = np.unique(subdata["answer"])

root.children.append(newNode)

else:

dummyNode = Node()

dummyNode.value = u

new_attrs = atrributes.copy()
new_attrs.remove(max_feature)

child = ID3(subdata, new_attrs)

dummyNode.children.append(child)

root.children.append(dummyNode)

return root

def printTree(root: Node, depth=0):

for i in range(depth):

print("\t", end="")

print(root.value, end="")

if root.isLeaf:

print(" -> ", root.pred)

print()

for child in root.children:

printTree(child, depth + 1)

root=ID3(df, features)

print("---------------------------------------------------------------------------------")

print(f"decision tree with root {root.value} is:")

printTree(root)

Output:
Learning Outcomes:

1. To be able to understand different metrics used in ID3 decision tree algorithm like
entropy, information gain, etc.

2. To be able to perform classification of data with the help of ID3 algorithm.

Experiment 7

Aim: To implement decision tree using CART algorithm.

Theory:
The CART algorithm is a type of classification algorithm that is required to build a decision tree
on the basis of Gini’s impurity index. It is a basic machine learning algorithm and provides a
wide variety of use cases. A statistician named Leo Breiman coined the phrase to describe
Decision Tree algorithms that may be used for classification or regression predictive modeling
issues.

CART is an umbrella word that refers to the following types of decision trees:

 Classification Trees: When the target variable is continuous, the tree is used to find the
"class" into which the target variable is most likely to fall.
 Regression trees: These are used to forecast the value of a continuous variable.

Code:
import numpy as np
import pandas as pd
import math

df=pd.read_csv('Outlook.csv')
features = [feat for feat in df]
features.remove("answer")

class Node:
def __init__(self):
self.children = []
self.value = ""
self.isLeaf = False
self.pred = ""
def gini(val):
pos = 0.0
neg = 0.0
for _, row in val.iterrows():
if row["answer"] == "yes":
pos += 1
else:
neg += 1
if pos == 0.0 or neg == 0.0:
return 0.0
else:
p = (pos / (pos + neg))**2
n = (neg / (pos + neg))**2
return 1-p-n

def gini_idx(data,attr):

uniq = np.unique(data[attr])
gini_idx = 0
for u in uniq:
subdata = data[data[attr] == u]
sub_g = gini(subdata)
gini_idx += (float(len(subdata)) / float(len(data))) * sub_g
return gini_idx

def CART(data, atrributes):

root = Node()
min_gini=10000000
max_feature=""
for feature in features:
gini_val = gini_idx(data, feature)
if gini_val<min_gini:
min_gini=gini_val
max_feature=feature
root.value=max_feature

uniq = np.unique(data[max_feature])
for u in uniq:
subdata = data[data[max_feature] == u]
if gini(subdata) == 0.0:
newNode = Node()
newNode.isLeaf = True
newNode.value = u
newNode.pred = np.unique(subdata["answer"])
root.children.append(newNode)
else:
dummyNode = Node()
dummyNode.value = u
new_attrs = atrributes.copy()
new_attrs.remove(max_feature)
child = CART(subdata, new_attrs)
dummyNode.children.append(child)
root.children.append(dummyNode)

return root

def printTree(root: Node, depth=0):

for i in range(depth):
print("\t", end="")
print(root.value, end="")
if root.isLeaf:
print(" -> ", root.pred)
print()
for child in root.children:
printTree(child, depth + 1)

root=CART(df, features)
print("---------------------------------------------------------------------------------")
print(f"decision tree with root {root.value} is:")
printTree(root)

Output:

Learning Outcomes:

1. To be able to understand different metrics used in CART decision tree algorithm

like Gini index, etc and understand the difference between different types of decision
tree algorithms.

2. To be able to perform classification of data with the help of CART algorithm.

Experiment 8

Aim: To implement decision tree using C4.5 algorithm.

Theory:
The C4.5 algorithm is a decision tree algorithm developed by Ross Quinlan. It is an
extension of the ID3 algorithm, and it uses the same basic approach of recursively
splitting the data set based on the attribute that provides the highest information gain.
However, C4.5 introduces several improvements over ID3, making it a more powerful
and flexible algorithm.
One of the main improvements of C4.5 over ID3 is that it handles both continuous
and discrete attributes. In ID3, only discrete attributes can be used, which limits the
types of data that can be processed. C4.5 uses a technique called "binning" to
discretize continuous attributes. Binning involves dividing the range of the attribute
values into a set of intervals, or bins, and treating each bin as a discrete value.
Another improvement of C4.5 is that it uses a statistical method called gain ratio to
select the splitting attribute, instead of information gain used in ID3. Gain ratio takes
into account the number of values a given attribute can take, and adjusts the
information gain accordingly to avoid overestimating the importance of attributes
with many values.

Code:
import numpy as np
import pandas as pd
import math

df=pd.read_csv('Outlook.csv')
features = [feat for feat in df]
features.remove("answer")

class Node:
def __init__(self):
self.children = []
self.value = ""
self.isLeaf = False
self.pred = ""

def entropy(val):
pos = 0.0
neg = 0.0
for _, row in val.iterrows():
if row["answer"] == "yes":
pos += 1
else:
neg += 1
if pos == 0.0 or neg == 0.0:
return 0.0
else:
p = pos / (pos + neg)
n = neg / (pos + neg)
return -(p * math.log(p, 2) + n * math.log(n, 2))

def splitInfo(val, data):

count=0
shape=data['outlook'].shape
for _, row in val.iterrows():
count+=1
if count == 0.0:
return 0.0
else:
p=count/shape[0]
return -p*math.log(p,2)

def gain_ratio(data,attr):
splitinfo=0
uniq = np.unique(data[attr])
gain = entropy(data)
for u in uniq:
subdata = data[data[attr] == u]
splitinfo=splitinfo+splitInfo(subdata,data)
sub_e = entropy(subdata)
gain -= (float(len(subdata)) / float(len(data))) * sub_e
print(gain/splitinfo)
return gain/splitinfo

def C45(data, atrributes):

root = Node()
max_gain=0
max_feature=""
for feature in atrributes:
gain = gain_ratio(data, feature)
if gain>max_gain:
max_gain=gain
max_feature=feature;
root.value=max_feature

uniq = np.unique(data[max_feature])
for u in uniq:
subdata = data[data[max_feature] == u]
if entropy(subdata) == 0.0:
newNode = Node()
newNode.isLeaf = True
newNode.value = u
newNode.pred = np.unique(subdata["answer"])
root.children.append(newNode)
else:
dummyNode = Node()
dummyNode.value = u
new_attrs = atrributes.copy()
new_attrs.remove(max_feature)
child = C45(subdata, new_attrs)
dummyNode.children.append(child)
root.children.append(dummyNode)
return root
def printTree(root: Node, depth=0):
for i in range(depth):
print("\t", end="")
print(root.value, end="")
if root.isLeaf:
print(" -> ", root.pred)
print()
for child in root.children:
printTree(child, depth + 1)
root=C45(df, features)
print("---------------------------------------------------------------------------------")
print(f"decision tree with root {root.value} is:")
printTree(root)

Output:

Learning Outcomes:

1. To be able to understand different metrics used in C4.5 decision tree algorithm like
Gain ratio, etc and understand the improvement of C4.5 algorithm over ID3 algorithm.

2. To be able to perform classification of data with the help of C4.5 algorithm.

Experiment 9
Aim: To implement multi layer neural network.
Theory:
A neural network is a computational model inspired by the structure and function of the human brain. It
consists of interconnected nodes, called neurons, organized into layers. Each neuron receives one or
more inputs, processes them, and generates an output signal that can be passed to other neurons in the
network. The strength of the connections between neurons, called weights, determines how much each
neuron's input affects its output.
Neural networks are used for a variety of tasks, such as image and speech recognition, natural language
processing, and predictive modeling. They can learn to recognize patterns and relationships in large
datasets by adjusting their weights through a process called training, which involves presenting the
network with input data and adjusting the weights to minimize the difference between the predicted
outputs and the actual outputs

Code:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Define activation functions
def sigmoid(x):
return 1 / (1 + np.exp(-x))

def relu(x):
return np.maximum(0, x)

def softmax(x):
exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return exp_x / np.sum(exp_x, axis=1, keepdims=True)

# Define the neural network class
class NeuralNetwork:
def __init__(self, input_size, hidden_size, output_size):
self.input_size = input_size
self.hidden_size = hidden_size
self.output_size = output_size
self.W1 = np.random.randn(input_size, hidden_size) * 0.01
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * 0.01
self.b2 = np.zeros((1, output_size))

def forward(self, X):
self.z1 = np.dot(X, self.W1) + self.b1
self.a1 = relu(self.z1)
self.z2 = np.dot(self.a1, self.W2) + self.b2
self.probs = softmax(self.z2)
return self.probs

def backward(self, X, y, learning_rate):
m = len(X)
delta3 = self.probs
delta3[range(m), y] -= 1
dW2 = np.dot(self.a1.T, delta3)
db2 = np.sum(delta3, axis=0, keepdims=True)
delta2 = np.dot(delta3, self.W2.T) * (self.a1 > 0)
dW1 = np.dot(X.T, delta2)
db1 = np.sum(delta2, axis=0, keepdims=True)
self.W2 -= learning_rate * dW2
self.b2 -= learning_rate * db2
self.W1 -= learning_rate * dW1
self.b1 -= learning_rate * db1

def train(self, X, y, learning_rate, epochs):
for i in range(epochs):
# Forward pass
probs = self.forward(X)

# Backward pass
self.backward(X, y, learning_rate)

# Print loss every 100 epochs
if i % 100 == 0:
loss = self.loss(X, y)
print(f"Epoch {i}: Loss = {loss:.4f}")

def predict(self, X):
probs = self.forward(X)
return np.argmax(probs, axis=1)

def loss(self, X, y):
probs = self.forward(X)
correct_probs = probs[range(len(X)), y]
data_loss = np.sum(-np.log(correct_probs))
return data_loss / len(X)

# Load the Iris dataset and split it into training and test sets
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_
state=42)

# Create and train the neural network
nn = NeuralNetwork(input_size=4, hidden_size=10, output_size=3)
nn.train(X_train, y_train, learning_rate=0.001, epochs=1000)

# Evaluate the neural network on the test set
predictions = nn.predict(X_test)
print(f'Test values are {y_test}')
print(f'Predicted values are {predictions}')
accuracy = np.sum(y_test == predictions) / len(y_test)
print(f ‘Accuracy is: {accuracy}’)

Output:
Epoch 0: Loss = 1.0983
Epoch 100: Loss = 0.3449
Epoch 200: Loss = 0.2857
Epoch 300: Loss = 0.2326
Epoch 400: Loss = 0.0841
Epoch 500: Loss = 0.0738
Epoch 600: Loss = 0.0773
Epoch 700: Loss = 0.0698
Epoch 800: Loss = 0.0666
Epoch 900: Loss = 0.0643
Test values are [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]
Predicted values are [1 0 2 1 1 0 1 2 2 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]
Accuracy is: 0.9666666666666667

Learning Outcomes:
1. To be able to have basic understanding of neural networks and how they work.
2. To be able to build neural network from scratch considering all operations like
forward propagation, backward propagation and calculating loss function.

Aids Lab
No ratings yet
Aids Lab
45 pages
ML Lab Manual
No ratings yet
ML Lab Manual
12 pages
NumPy, Pandas, MatplotLib, Seaborn, ScikitLearn (SkLearn)
No ratings yet
NumPy, Pandas, MatplotLib, Seaborn, ScikitLearn (SkLearn)
14 pages
Unit-V Python - BCC402
No ratings yet
Unit-V Python - BCC402
20 pages
Unit 5
No ratings yet
Unit 5
39 pages
Fds Lab Record
No ratings yet
Fds Lab Record
84 pages
Dav 2 Unit
No ratings yet
Dav 2 Unit
55 pages
PR Final File
No ratings yet
PR Final File
70 pages
ML Lab Manual (Final) Dtu
No ratings yet
ML Lab Manual (Final) Dtu
52 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
33 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
43 pages
Unit 4
No ratings yet
Unit 4
105 pages
DAP 5 Module
No ratings yet
DAP 5 Module
68 pages
Python For AIML2
No ratings yet
Python For AIML2
21 pages
Unit 5 Python Notes HM
No ratings yet
Unit 5 Python Notes HM
59 pages
ML3 Data Analysis
No ratings yet
ML3 Data Analysis
80 pages
ML Lab Manual Completed
No ratings yet
ML Lab Manual Completed
56 pages
4 Introduction To Python Part 3
No ratings yet
4 Introduction To Python Part 3
62 pages
ML Lab File
No ratings yet
ML Lab File
43 pages
Roadmap
No ratings yet
Roadmap
27 pages
FINAL FDS MANUAL Print
No ratings yet
FINAL FDS MANUAL Print
55 pages
ML IU48prac1,2
No ratings yet
ML IU48prac1,2
16 pages
Class Xii PDF For Practical
No ratings yet
Class Xii PDF For Practical
24 pages
Machine Learning Experiment
No ratings yet
Machine Learning Experiment
69 pages
Fdsa Lab Manual Final
No ratings yet
Fdsa Lab Manual Final
70 pages
De&v Lab Manual
No ratings yet
De&v Lab Manual
91 pages
3-Numpy Pandas
No ratings yet
3-Numpy Pandas
37 pages
To Study About Numpy, Pandas and Matplotlib Libraries in Python
No ratings yet
To Study About Numpy, Pandas and Matplotlib Libraries in Python
21 pages
Eda Unit 1
No ratings yet
Eda Unit 1
7 pages
Pandas Numpy
No ratings yet
Pandas Numpy
4 pages
Python Unit IV
No ratings yet
Python Unit IV
12 pages
Unit 3
No ratings yet
Unit 3
19 pages
ML Lab Manual With Statistical Formulas
No ratings yet
ML Lab Manual With Statistical Formulas
9 pages
ML Manual
No ratings yet
ML Manual
21 pages
AIML Short Term Internship Session 9 Summary-1719044709410
No ratings yet
AIML Short Term Internship Session 9 Summary-1719044709410
14 pages
Week 3
No ratings yet
Week 3
10 pages
Lec 19
No ratings yet
Lec 19
14 pages
AI/ML Python Modules
No ratings yet
AI/ML Python Modules
17 pages
AD3301 DEV Lab Manual
No ratings yet
AD3301 DEV Lab Manual
26 pages
NumPy & Pandas
No ratings yet
NumPy & Pandas
27 pages
Python Unit 4
No ratings yet
Python Unit 4
43 pages
DXV Guidelines
No ratings yet
DXV Guidelines
3 pages
Introduction To Python (Part III)
No ratings yet
Introduction To Python (Part III)
29 pages
Unit 3 (FODS)
No ratings yet
Unit 3 (FODS)
34 pages
MCP Lab-2023 ContentForPythonLibrariesTopic
No ratings yet
MCP Lab-2023 ContentForPythonLibrariesTopic
9 pages
Unit 5 PythonPackages (Matplotlib)
No ratings yet
Unit 5 PythonPackages (Matplotlib)
24 pages
Report
No ratings yet
Report
18 pages
ML Exp
No ratings yet
ML Exp
9 pages
Unit 5
No ratings yet
Unit 5
28 pages
Answers 1
No ratings yet
Answers 1
17 pages
Fundamentals of Data Science Lab Manual New1
No ratings yet
Fundamentals of Data Science Lab Manual New1
32 pages
Numpy&pandas
No ratings yet
Numpy&pandas
17 pages
EXP1-siddhant Gupta (23 - SE - 148)
No ratings yet
EXP1-siddhant Gupta (23 - SE - 148)
17 pages
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
No ratings yet
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
28 pages
FML Lab 1
No ratings yet
FML Lab 1
4 pages
Yanmar SV20 - Partsbook PDF
100% (2)
Yanmar SV20 - Partsbook PDF
168 pages
Revolutionizing Agriculture
No ratings yet
Revolutionizing Agriculture
23 pages
Woodmizer LT15 Parts
No ratings yet
Woodmizer LT15 Parts
39 pages
CO-367 Machine Learning Lab File: Submitted To: Submitted by
No ratings yet
CO-367 Machine Learning Lab File: Submitted To: Submitted by
12 pages
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
No ratings yet
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
7 pages
How Social Media Can Make A History by Clay Shirky - Reaction Paper John Darryl P. Ligan
No ratings yet
How Social Media Can Make A History by Clay Shirky - Reaction Paper John Darryl P. Ligan
2 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
Machine Learning Lab File: Submitted To: Submitted by
No ratings yet
Machine Learning Lab File: Submitted To: Submitted by
9 pages
RSM Minitab Tutorial
100% (5)
RSM Minitab Tutorial
41 pages
Think and Decide Think and Observe: 3 Quarter Week 1 Lesson Plan Mathematics 4 I. Objectives
100% (1)
Think and Decide Think and Observe: 3 Quarter Week 1 Lesson Plan Mathematics 4 I. Objectives
3 pages
IPCC Inventory Software Manual
No ratings yet
IPCC Inventory Software Manual
66 pages
Cloud Computing Chapter3 2
0% (1)
Cloud Computing Chapter3 2
36 pages
CO304 Artificial Intelligence (Till 2024)
No ratings yet
CO304 Artificial Intelligence (Till 2024)
97 pages
Module 8 Artificial Intelligence in Monitoring and Evaluation
No ratings yet
Module 8 Artificial Intelligence in Monitoring and Evaluation
23 pages
CO326 Object Oriented Software Engineering (Till 2024)
No ratings yet
CO326 Object Oriented Software Engineering (Till 2024)
60 pages
Project Ghumantu Pathshala-2
No ratings yet
Project Ghumantu Pathshala-2
22 pages
Assgn 16
No ratings yet
Assgn 16
7 pages
Iterative Construct in Java
No ratings yet
Iterative Construct in Java
39 pages
Java Practical
No ratings yet
Java Practical
56 pages
Penjelasan Listing Program
No ratings yet
Penjelasan Listing Program
63 pages
Circulation
No ratings yet
Circulation
56 pages
2018 M.SC 2nd Sem
No ratings yet
2018 M.SC 2nd Sem
12 pages
The Influence of Social - 2
No ratings yet
The Influence of Social - 2
9 pages
Experiment No.1
No ratings yet
Experiment No.1
5 pages
Assignment Pumping Lemma For CFL
No ratings yet
Assignment Pumping Lemma For CFL
4 pages
Reset Blu Ray Samsung BD-F5100
0% (1)
Reset Blu Ray Samsung BD-F5100
5 pages
Question A - Merged
No ratings yet
Question A - Merged
14 pages
Fortimanager v6.4.11 Release Notes
No ratings yet
Fortimanager v6.4.11 Release Notes
45 pages
CG Report Final-Full
No ratings yet
CG Report Final-Full
24 pages
Eg - Points & Lines - MCQ
No ratings yet
Eg - Points & Lines - MCQ
6 pages
01ALCATEL - Temporis - 500 Pro - User Guide
No ratings yet
01ALCATEL - Temporis - 500 Pro - User Guide
40 pages
Test 1
No ratings yet
Test 1
3 pages
Example Project 4 - 2
No ratings yet
Example Project 4 - 2
3 pages
Objective:: Write An Experiment On Zener Diode Clipper
No ratings yet
Objective:: Write An Experiment On Zener Diode Clipper
13 pages
Case Study Instructions
No ratings yet
Case Study Instructions
8 pages
JS7 ClassNotes
No ratings yet
JS7 ClassNotes
5 pages
OPTALIGNsmart guideNV
No ratings yet
OPTALIGNsmart guideNV
2 pages
Elevayt
No ratings yet
Elevayt
8 pages
Keyboard Shortcut Keys
No ratings yet
Keyboard Shortcut Keys
3 pages
Week 1 Lec 2 CC
No ratings yet
Week 1 Lec 2 CC
13 pages
Pre - Reg 1k (4k) Tech BP Referral
No ratings yet
Pre - Reg 1k (4k) Tech BP Referral
1 page
RD545 Acoustic Leak Detector: Advanced Electronic Ground Microphone
No ratings yet
RD545 Acoustic Leak Detector: Advanced Electronic Ground Microphone
2 pages
Problem Set 1 Answers
No ratings yet
Problem Set 1 Answers
4 pages
Full Paper Title in Title Case: Name Surname, Name Surname
No ratings yet
Full Paper Title in Title Case: Name Surname, Name Surname
4 pages
Computer Forensic Analyst Intern-JD
No ratings yet
Computer Forensic Analyst Intern-JD
2 pages
Customizing AutoCAD 2020, 13th Edition
From Everand
Customizing AutoCAD 2020, 13th Edition
Prof. Sham Tickoo
No ratings yet