0% found this document useful (0 votes)
8 views49 pages

PR Final File

The document outlines a lab file for a course on Pattern Recognition at Delhi Technological University, detailing various experiments involving Python libraries such as NumPy, Pandas, and Matplotlib. It includes objectives like data preprocessing, visualization, and implementing machine learning algorithms such as KNN and decision trees on the Iris dataset. The document also emphasizes the importance of data manipulation and analysis in machine learning.

Uploaded by

asthadahiya2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views49 pages

PR Final File

The document outlines a lab file for a course on Pattern Recognition at Delhi Technological University, detailing various experiments involving Python libraries such as NumPy, Pandas, and Matplotlib. It includes objectives like data preprocessing, visualization, and implementing machine learning algorithms such as KNN and decision trees on the Iris dataset. The document also emphasizes the importance of data manipulation and analysis in machine learning.

Uploaded by

asthadahiya2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

DELHI TECHNOLOGICAL UNIVERSITY

(Formerly Delhi College of Engineering)


Shahbad Daulatpur, Bawana Road, Delhi-110042

DEPARTMENT OF COMPUTER SCIENCE AND


ENGINEERING

Pattern Recognition Lab File


Subject Code: AI5304

Submitted To: Submitted By:


Dr. Aruna Bhat Saurabh Singh
Department of Computer Science M.Tech AFI
And Engineering 2K22/AFI/20
INDEX

Sno. Topic Date Signature

1 To study about numpy, pandas and matplotlib libraries


in python.

2 To perform data preprocessing and data


summarization on iris dataset.

3 To perform data preprocessing and data visualization


on iris dataset.
4 To implement k means clustering.

5 To implement data classification using KNN.

6 To implement decision tree using ID3 algorithm.

7 To implement decision tree using CART algorithm.

8 To implement decision tree using C4.5 algorithm.

To implement multi layer neural network.


9

10 Project- Develop a framework for skin disease


detection using knowledge distillation.
Experiment 1

Aim: To study about numpy, pandas and matplotlib libraries in python.


Theory:

 Numpy- NumPy stands for Numerical Python. NumPy is a Python tool for
working with arrays. It also includes tools for dealing with linear algebra, the
Fourier transform, and matrices.Travis Oliphant developed NumPy in 2005. It is
an open source initiative which can be used freely.In Python we have lists that
serve the purpose of arrays, but they are slow to process.NumPy aims to provide
an array object that is up to 50x faster than traditional Python lists.

 Pandas- Pandas is defined as an open-source library that provides high-


performance data manipulation in Python. The name of Pandas is derived from
the word Panel Data, which means an Econometrics from Multidimensional
data. It is used for data analysis in Python and developed by Wes
McKinney in 2008. Data analysis requires lots of processing, such
as restructuring, cleaning or merging, etc. Pandas is preferred for data analysis
because working with Pandas is fast, simple and more expressive than other tools.
 Matplotlib- Matplotlib is the basic visualizing or plotting library of the python
programming language. Matplotlib is a powerful tool for executing a variety of
tasks. It is able to create different types of visualization reports like line plots,
scatter plots, histograms, bar charts, pie charts, box plots, and many more
different plots. It also supports 3-dimensional plotting.

Code-

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [50]:
list1 = [[1,2,3],[4,5,6]]
In [51]:
#converting to numpy array
array1 = np.array(list1)
array1
Output:
array([[1, 2, 3],
[4, 5, 6]])
In [52]:
#Mathematical operation performed on all values of numpy arrays
toffee = np.array([5,8,3,6])
print(toffee - 2)
Output:
[3 6 1 4]
In [53]:
#Slicing
arr = np.array([1, 2, 3, 4, 5, 6, 7])
arr[1:5]
Output:
array([2, 3, 4, 5])
In [54]:
#Datatype of the array
arr.dtype
Output:
dtype('int32')
In [55]:
#Change data type from float to integer
arr = np.array([1.1, 2.1, 3.1])
newarr = arr.astype('i')​
print(newarr)
print(newarr.dtype)
Output:
[1 2 3]
int32
In [56]:
#Shape of array
arr = np.array([1, 2, 3, 4], ndmin=5)

print(arr)
print('shape of array :', arr.shape)
Output:

[[[[[1 2 3 4]]]]]
shape of array : (1, 1, 1, 1, 4)
In [57]:
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passengers': [2, 6, 4]
}
In [58]:
#Series
info = np.array(['S','a','u','r','a','b','h'])
a = pd.Series(info)
a
Out[58]:
0 S
1 a
2 u
3 r
4 a
5 b
6 h
dtype: object
In [59]:
cardf = pd.DataFrame(mydataset)
cardf
Output:

cars passengers

0 BMW 2
cars passengers

1 Volvo 6

2 Ford 4

In [60]:
a=pd.Series(mydataset)
a
Output:
cars [BMW, Volvo, Ford]
passengers [2, 6, 4]
dtype: object
In [61]:
#Converting series to dataframe
a.to_frame()
Output:

cars [BMW, Volvo, Ford]

passengers [2, 6, 4]

In [62]:
#Unique values in dataframe
pd.unique(pd.Series([3, 1, 1, 2, 9, 7]))

Output:
array([3, 1, 2, 9, 7], dtype=int64)
In [63]:
#Append 2 dataframe
cardf2 = pd.DataFrame({"year":[2022, 2019, 2020],
"model":['zx', 'vls', 'b+'],
"enginecc":[3811, 1200, 4522]})
df=cardf.append(cardf2, ignore_index = True)
df
Output:
cars passengers year model enginecc

0 BMW 2.0 NaN NaN NaN

1 Volvo 6.0 NaN NaN NaN

2 Ford 4.0 NaN NaN NaN

3 NaN NaN 2022.0 zx 3811.0

4 NaN NaN 2019.0 vls 1200.0

5 NaN NaN 2020.0 b+ 4522.0

In [64]:
#Count
df.count()
Output:
cars 3
passengers 3
year 3
model 3
enginecc 3
dtype: int64
In [65]:
#Iterating Dataframe
for row_index,row in cardf.iterrows():
print (row_index,row)
Output:
0 cars BMW
passengers 2
Name: 0, dtype: object
1 cars Volvo
passengers 6
Name: 1, dtype: object
2 cars Ford
passengers 4
Name: 2, dtype: object
In [66]:
#Return top elements of df
df.head()
Output:

cars passengers year model enginecc

0 BMW 2.0 NaN NaN NaN

1 Volvo 6.0 NaN NaN NaN

2 Ford 4.0 NaN NaN NaN

3 NaN NaN 2022.0 zx 3811.0

4 NaN NaN 2019.0 vls 1200.0

In [67]:
df.tail()
Output:

cars passengers year model enginecc

1 Volvo 6.0 NaN NaN NaN

2 Ford 4.0 NaN NaN NaN

3 NaN NaN 2022.0 zx 3811.0

4 NaN NaN 2019.0 vls 1200.0

5 NaN NaN 2020.0 b+ 4522.0

In [68]:
#calculating statistical data like percentile, mean,etc
df.describe()
Output:

passengers year enginecc


passengers year enginecc

count 3.0 3.000000 3.000000

mean 4.0 2020.333333 3177.666667

std 2.0 1.527525 1749.215348

min 2.0 2019.000000 1200.000000

25% 3.0 2019.500000 2505.500000

50% 4.0 2020.000000 3811.000000

75% 5.0 2021.000000 4166.500000

max 6.0 2022.000000 4522.000000

In [69]:
#Reading CSV files
df = pd.read_csv('iris.csv')
print(df)
Output:
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm

0 1 5.1 3.5 1.4 0.2


1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
.. ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8
Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
.. ...
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica

[150 rows x 6 columns]


In [70]:
#Plots using Matplotlib
car = ['BMW', 'Volvo', 'Ford']
Price= [37,70,48]

plt.figure(figsize=(9,3))

plt.subplot(131)
plt.bar(car, Price)
plt.subplot(132)
plt.scatter(car, Price)
plt.subplot(133)
plt.plot(car, Price)
plt.suptitle('Plot')
plt.show()

Output:
In [71]:
y = np.array([35, 25])
mylabels = ["Melanoma", "Benign"]

plt.pie(y, labels = mylabels)
plt.show()
Output:

Learning Outcome:
1. To be able to perform basic operations using numpy, pandas and matplotlib
libraries of python.
2. Successfully use inbuilt tools for the purpose of data manipulation, data analysis
and data visualization.
Experiment 2

Aim: To perform data visualization on iris dataset.


Theory:
 Data pre-processing- Data per-processing is a process of preparing the raw data
and making it suitable for a machine learning model. It is the first and crucial step
while creating a machine learning model. A real-world data generally contains
noises, missing values, and maybe in an unusable format which cannot be directly
used for machine learning models. Data per-processing is required tasks for
cleaning the data and making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.

 Data Summarization -Data summarization in machine learning refers to the


process of extracting important information from large datasets and presenting it
in a concise and meaningful way. This can be achieved using various techniques
such as statistical methods, etc.

Code:
import pandas as pd

# Reading the CSV file


df = pd.read_csv("Iris.csv")

# Printing top 5 rows


df.head()
Output:

I SepalLengthC SepalWidthC PetalLengthC PetalWidthC Specie


d m m m m s

Iris-
0 1 5.1 3.5 1.4 0.2
setosa

1 2 4.9 3.0 1.4 0.2 Iris-


I SepalLengthC SepalWidthC PetalLengthC PetalWidthC Specie
d m m m m s

setosa

Iris-
2 3 4.7 3.2 1.3 0.2
setosa

Iris-
3 4 4.6 3.1 1.5 0.2
setosa

Iris-
4 5 5.0 3.6 1.4 0.2
setosa

In [2]:
df.shape
Output:
(150, 6)

In [3]:
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
In [5]:
#Data Summarization
df.describe()
Output:

SepalLengthC SepalWidthC PetalLengthC PetalWidthC


Id
m m m m

coun 150.00000
150.000000 150.000000 150.000000 150.000000
t 0

mea
75.500000 5.843333 3.054000 3.758667 1.198667
n

std 43.445368 0.828066 0.433594 1.764420 0.763161

min 1.000000 4.300000 2.000000 1.000000 0.100000

25% 38.250000 5.100000 2.800000 1.600000 0.300000

50% 75.500000 5.800000 3.000000 4.350000 1.300000

112.75000
75% 6.400000 3.300000 5.100000 1.800000
0

150.00000
max 7.900000 4.400000 6.900000 2.500000
0

In [6]:
#Checking missing values
df.isnull().sum()

Output:
Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64
In [7]:
#Checking duplicates
data = df.drop_duplicates(subset ="Species",)
data
Output:

SepalLength SepalWidth PetalLength PetalWidth


Id Species
Cm Cm Cm Cm

Iris-
0 1 5.1 3.5 1.4 0.2
setosa

Iris-
50 51 7.0 3.2 4.7 1.4 versicol
or

Iris-
10 10
6.3 3.3 6.0 2.5 virginic
0 1
a

In [8]:
df.value_counts("Species")
Output:
Species
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
dtype: int64
In [9]:
data.corr(method='pearson')
data.corr(method='pearson')
Output:

SepalLength SepalWidth PetalLength PetalWidth


Id
Cm Cm Cm Cm

1.0000
Id 0.624413 -0.654654 0.969909 0.999685
00

SepalLength 0.6244
1.000000 -0.999226 0.795795 0.643817
Cm 13

-
SepalWidth
0.6546 -0.999226 1.000000 -0.818999 -0.673417
Cm
54

PetalLength 0.9699
0.795795 -0.818999 1.000000 0.975713
Cm 09

PetalWidthC 0.9996
0.643817 -0.673417 0.975713 1.000000
m 85

Learning Outcomes:
1.To be able to perform pre-processing by checking for null values, missing values,etc
in data.
2. To be able to extract useful information from the data with the help of various
statistical methods available.
Experiment 3

Aim: To perform data visualization on iris dataset.


Theory:
Data visualization-Data visualization is a graphical representation of quantitative
information and data by using visual elements like graphs, charts, and maps. Data
visualization convert large and small data sets into visuals, which is easy to
understand and process for humans.Data visualization tools provide accessible ways
to understand outliers, patterns, and trends in the data.

The iris dataset contains 150 observations, with 50 observations for each of the three
species of iris. Each observation includes measurements for the four attributes named
sepal length, sepal width, petal length, and petal width.

Code:

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
In [17]:
df = pd.read_csv('iris.csv')
df.head()
Output:

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa


Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

4 5 5.0 3.6 1.4 0.2 Iris-setosa

In [18]:

#to discribe the stats about data


df.describe()
Output:

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm

150.0000
count 150.000000 150.000000 150.000000 150.000000
00

75.50000
mean 5.843333 3.054000 3.758667 1.198667
0

43.44536
std 0.828066 0.433594 1.764420 0.763161
8

min 1.000000 4.300000 2.000000 1.000000 0.100000

38.25000
25% 5.100000 2.800000 1.600000 0.300000
0

75.50000
50% 5.800000 3.000000 4.350000 1.300000
0

112.7500
75% 6.400000 3.300000 5.100000 1.800000
00

150.0000
max 7.900000 4.400000 6.900000 2.500000
00

In [19]:
#basic info about datatype
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
In [20]:
#to display no of samples in each class
df['Species'].value_counts()
Output:
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
Name: Species, dtype: int64
In [21]:
#check null values
df.isnull().sum()
Output:
Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64

Data Visualization
In [22]:
# Visvualize the data attribiutes with histogram
df['SepalLengthCm'].hist()
Output:

In [23]:
df['SepalWidthCm'].hist()
Output:

In [24]:
df['PetalLengthCm'].hist()
Output:
In [25]:
df['PetalWidthCm'].hist()
Output:

In [28]:
sns.boxplot(data=df,x='Species',y='PetalLengthCm')
Output:
In [30]:
sns.boxplot(data=df,x='Species',y='PetalWidthCm')
Output:

In [31]:
sns.boxplot(data=df,x='Species',y='SepalLengthCm')
Output:
In [33]:
sns.boxplot(data=df,x='Species',y='SepalWidthCm')
Output:

In [36]:
snsdata = df.drop(['Id'], axis=1)
sns.pairplot(snsdata,hue='Species',height=3)
plt.show()
Learning Outcomes:
1.To be able to use inbuilt tools for the data visualization purpose.
2. To be able to extract useful information from the data with the help of data
visualization.
Experiment 4

Aim: To implement data classification using KNN.


Theory:
K-Nearest Neighbors is one of the simplest supervised machine learning algorithms
used for classification. It classifies a data point based on its neighbors’ classifications.
It stores all available cases and classifies new cases based on similar features.
The following example below shows a KNN algorithm being leveraged to predict if a
glass of wine is red or white. Different variables that are considered in this KNN
algorithm include sulphur dioxide and chloride levels.

K in KNN is a parameter that refers to the number of nearest neighbors in the majority
voting process.
Code:
import pandas as pd
import numpy as np
import math
import operator

# Importing data
data = pd.read_csv('Iris.csv')

print(data.head(5))
#euclidean distance between two data points
def euclideanDistance(data1, data2, length):
distance = 0
for x in range(length):
distance += np.square(data1[x] - data2[x])
return np.sqrt(distance)

#KNN model
def knn(trainingSet, testInstance, k):

distances = {}
sort = {}

length = testInstance.shape[1]

# Calculating euclidean distance between each row of training data and test data
for x in range(len(trainingSet)):

dist = euclideanDistance(testInstance, trainingSet.iloc[x], length)

distances[x] = dist[0]

#Sorting on the basis of distance


sorted_d = sorted(distances.items(), key=operator.itemgetter(1))

neighbors = []

for x in range(k):
neighbors.append(sorted_d[x][0])
classVotes = {}

#Calculating the most freq class in the neighbors


for x in range(len(neighbors)):
response = trainingSet.iloc[neighbors[x]][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1

sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)


return(sortedVotes[0][0], neighbors)

testSet = [[7.2, 3.6, 5.1, 2.5]]


test = pd.DataFrame(testSet)

#k=1

print('\n\nWith 1 Nearest Neighbour \n\n')


k=1
result,neigh = knn(data, test, k)

print('\nPredicted Class of the datapoint = ', result)


print('\nNearest Neighbour of the datapoints = ',neigh)

print('\n\nWith 3 Nearest Neighbours\n\n')

#k=3
k=3
result,neigh = knn(data, test, k)

print('\nPredicted class of the datapoint = ',result)


print('\nNearest Neighbours of the datapoints = ',neigh)

print('\n\nWith 5 Nearest Neighbours\n\n')


#k=3
k=5
result,neigh = knn(data, test, k)

print('\nPredicted class of the datapoint = ',result)


print('\nNearest Neighbours of the datapoints = ',neigh)

Output:

Learning outcome:
1. Understanding of lazy learner algorithm and how to implement them.
2. Understanding of different distances metric which can be used to establish relation
between different sets of points.
Experiment 5
Aim: To implement k means clustering.
Theory:
K-Means clustering is an unsupervised learning algorithm. There is no labeled data
for this clustering, unlike in supervised learning. K-Means performs the division of
objects into clusters that share similarities and are dissimilar to the objects belonging
to another cluster. The term ‘K’ is a number. For example, K = 2 refers to two clusters.

It is an iterative process of assigning each data point to the groups and slowly data
points get clustered based on similar features. The objective is to minimize the sum of
distances between the data points and the cluster centroid, to identify the correct
group each data point should belong to.

We divide a data space into K clusters and assign a mean value to each. The data
points are placed in the clusters closest to the mean value of that cluster. There are
several distance metrics available that can be used to calculate the distance.

Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [34]:
iris = pd.read_csv('Iris.csv')

mapping = {
'Iris-setosa' : 0,
'Iris-versicolor' : 1,
'Iris-virginica' : 2
}
y = iris.Species.replace(mapping)# Output values
# y=iris['Species']
X = iris.drop(['Id','Species'],axis=1).values
print(y.shape)
(150,)
In [35]:
clusters=len(np.unique(y))
In [36]:
def euclidean_dis(x1, x2):
return np.sqrt(np.sum((x1 - x2)**2))
In [37]:
from collections import defaultdict

class KMeans:

def __init__(self,data,k,max_ite):
self.data=data
self.k=k
self.max_ite=max_ite

def predict(self):

centroids = defaultdict(int)

K=self.k
max_iter=self.max_ite
j=0

for i in range(3):
print(self.data[j])
centroids[i] = self.data[j]
j=j+50

for i in range(max_iter):
classes=defaultdict(list)
prediction=[]

for key in range(K):


classes[key]=[]
for datapoint in self.data:
distance=[]
for j in range(K):

dis=euclidean_dis(datapoint,centroids[j])

distance.append(dis)
# print(distance)
mindis=min(distance)

index=distance.index(mindis)
# print(index)
prediction.append(index)
# print(prediction)
classes[index].append(datapoint)
old_centroid=dict(centroids)

for t in range(K):
class_=classes[t]
# print(class_)
new_centroid=np.mean(class_,axis=0)
centroids[t]=new_centroid


return classes,centroids,prediction
In [46]:

kmeans=KMeans(X,clusters,100)

classes,centroids,prediction=kmeans.predict()
# print(prediction)

for i in range(0,3):
classes[i]=np.array(classes[i]).tolist()

for i in range(0,3):
print(len(classes[i]))

print(centroids)
accuracy = np.sum(y== prediction) / len(y)
print(f'Accuracy is: {accuracy}')

Output:
[5.1 3.5 1.4 0.2]
[7. 3.2 4.7 1.4]
[6.3 3.3 6. 2.5]
50
62
38
defaultdict(<class 'int'>, {0: array([5.006, 3.418, 1.464, 0.244]), 1: array([5.
9016129 , 2.7483871 , 4.39354839, 1.43387097]), 2: array([6.85 , 3.07368421, 5.
74210526, 2.07105263])})
Accuracy is: 0.8933333333333333

Learning outcome:
1. Understanding of unsupervised learning algorithm and unlabeled data.
2. Understanding of how clusters are created and data points are assigned to different
clusters based on distance metrics.
Experiment 6

Aim: To implement decision tree using ID3 algorithm.


Theory:

ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm
iteratively (repeatedly) dichotomizes(divides) features into two or more groups at each
step.

Invented by Ross Quinlan, ID3 uses a top-down greedy approach to build a decision
tree. In simple words, the top-down approach means that we start building the tree
from the top and the greedy approach means that at each iteration we select the best
feature at the present moment to create a node.ID3 uses Information Gain find the
best feature at each iteration.

Information Gain calculates the reduction in the entropy and measures how well a
given feature separates or classifies the target classes. The feature with the highest
Information Gain is selected as the best one.

IG(S, A) = Entropy(S) - ∑((|Sᵥ| / |S|) * Entropy(Sᵥ))

Code:

import numpy as np

import pandas as pd

import math

df=pd.read_csv('Outlook.csv')

features = [feat for feat in df]

features.remove("answer")

class Node:

def __init__(self):
self.children = []

self.value = ""

self.isLeaf = False

self.pred = ""

def entropy(val):

pos = 0.0

neg = 0.0

for _, row in val.iterrows():

if row["answer"] == "yes":

pos += 1

else:

neg += 1

if pos == 0.0 or neg == 0.0:

return 0.0

else:

p = pos / (pos + neg)

n = neg / (pos + neg)

return -(p * math.log(p, 2) + n * math.log(n, 2))

def info_gain(data,attr):

uniq = np.unique(data[attr])

gain = entropy(data)

for u in uniq:

subdata = data[data[attr] == u]

sub_e = entropy(subdata)

gain -= (float(len(subdata)) / float(len(data))) * sub_e


return gain

def ID3(data, atrributes):

root = Node()

max_gain=0

max_feature=""

for feature in features:

gain = info_gain(data, feature)

# print(gain)

if gain>max_gain:

max_gain=gain

max_feature=feature;

root.value=max_feature

uniq = np.unique(data[max_feature])

for u in uniq:

subdata = data[data[max_feature] == u]

if entropy(subdata) == 0.0:

newNode = Node()

newNode.isLeaf = True

newNode.value = u

newNode.pred = np.unique(subdata["answer"])

root.children.append(newNode)

else:

dummyNode = Node()

dummyNode.value = u

new_attrs = atrributes.copy()
new_attrs.remove(max_feature)

child = ID3(subdata, new_attrs)

dummyNode.children.append(child)

root.children.append(dummyNode)

return root

def printTree(root: Node, depth=0):

for i in range(depth):

print("\t", end="")

print(root.value, end="")

if root.isLeaf:

print(" -> ", root.pred)

print()

for child in root.children:

printTree(child, depth + 1)

root=ID3(df, features)

print("---------------------------------------------------------------------------------")

print(f"decision tree with root {root.value} is:")

printTree(root)

Output:
Learning Outcomes:

1. To be able to understand different metrics used in ID3 decision tree algorithm like
entropy, information gain, etc.

2. To be able to perform classification of data with the help of ID3 algorithm.


Experiment 7

Aim: To implement decision tree using CART algorithm.


Theory:
The CART algorithm is a type of classification algorithm that is required to build a decision tree
on the basis of Gini’s impurity index. It is a basic machine learning algorithm and provides a
wide variety of use cases. A statistician named Leo Breiman coined the phrase to describe
Decision Tree algorithms that may be used for classification or regression predictive modeling
issues.

CART is an umbrella word that refers to the following types of decision trees:

 Classification Trees: When the target variable is continuous, the tree is used to find the
"class" into which the target variable is most likely to fall.
 Regression trees: These are used to forecast the value of a continuous variable.

Code:
import numpy as np
import pandas as pd
import math

df=pd.read_csv('Outlook.csv')
features = [feat for feat in df]
features.remove("answer")

class Node:
def __init__(self):
self.children = []
self.value = ""
self.isLeaf = False
self.pred = ""
def gini(val):
pos = 0.0
neg = 0.0
for _, row in val.iterrows():
if row["answer"] == "yes":
pos += 1
else:
neg += 1
if pos == 0.0 or neg == 0.0:
return 0.0
else:
p = (pos / (pos + neg))**2
n = (neg / (pos + neg))**2
return 1-p-n

def gini_idx(data,attr):

uniq = np.unique(data[attr])
gini_idx = 0
for u in uniq:
subdata = data[data[attr] == u]
sub_g = gini(subdata)
gini_idx += (float(len(subdata)) / float(len(data))) * sub_g
return gini_idx

def CART(data, atrributes):


root = Node()
min_gini=10000000
max_feature=""
for feature in features:
gini_val = gini_idx(data, feature)
if gini_val<min_gini:
min_gini=gini_val
max_feature=feature
root.value=max_feature

uniq = np.unique(data[max_feature])
for u in uniq:
subdata = data[data[max_feature] == u]
if gini(subdata) == 0.0:
newNode = Node()
newNode.isLeaf = True
newNode.value = u
newNode.pred = np.unique(subdata["answer"])
root.children.append(newNode)
else:
dummyNode = Node()
dummyNode.value = u
new_attrs = atrributes.copy()
new_attrs.remove(max_feature)
child = CART(subdata, new_attrs)
dummyNode.children.append(child)
root.children.append(dummyNode)

return root

def printTree(root: Node, depth=0):


for i in range(depth):
print("\t", end="")
print(root.value, end="")
if root.isLeaf:
print(" -> ", root.pred)
print()
for child in root.children:
printTree(child, depth + 1)

root=CART(df, features)
print("---------------------------------------------------------------------------------")
print(f"decision tree with root {root.value} is:")
printTree(root)

Output:

Learning Outcomes:

1. To be able to understand different metrics used in CART decision tree algorithm


like Gini index, etc and understand the difference between different types of decision
tree algorithms.

2. To be able to perform classification of data with the help of CART algorithm.


Experiment 8

Aim: To implement decision tree using C4.5 algorithm.


Theory:
The C4.5 algorithm is a decision tree algorithm developed by Ross Quinlan. It is an
extension of the ID3 algorithm, and it uses the same basic approach of recursively
splitting the data set based on the attribute that provides the highest information gain.
However, C4.5 introduces several improvements over ID3, making it a more powerful
and flexible algorithm.
One of the main improvements of C4.5 over ID3 is that it handles both continuous
and discrete attributes. In ID3, only discrete attributes can be used, which limits the
types of data that can be processed. C4.5 uses a technique called "binning" to
discretize continuous attributes. Binning involves dividing the range of the attribute
values into a set of intervals, or bins, and treating each bin as a discrete value.
Another improvement of C4.5 is that it uses a statistical method called gain ratio to
select the splitting attribute, instead of information gain used in ID3. Gain ratio takes
into account the number of values a given attribute can take, and adjusts the
information gain accordingly to avoid overestimating the importance of attributes
with many values.

Code:
import numpy as np
import pandas as pd
import math

df=pd.read_csv('Outlook.csv')
features = [feat for feat in df]
features.remove("answer")

class Node:
def __init__(self):
self.children = []
self.value = ""
self.isLeaf = False
self.pred = ""

def entropy(val):
pos = 0.0
neg = 0.0
for _, row in val.iterrows():
if row["answer"] == "yes":
pos += 1
else:
neg += 1
if pos == 0.0 or neg == 0.0:
return 0.0
else:
p = pos / (pos + neg)
n = neg / (pos + neg)
return -(p * math.log(p, 2) + n * math.log(n, 2))

def splitInfo(val, data):


count=0
shape=data['outlook'].shape
for _, row in val.iterrows():
count+=1
if count == 0.0:
return 0.0
else:
p=count/shape[0]
return -p*math.log(p,2)

def gain_ratio(data,attr):
splitinfo=0
uniq = np.unique(data[attr])
gain = entropy(data)
for u in uniq:
subdata = data[data[attr] == u]
splitinfo=splitinfo+splitInfo(subdata,data)
sub_e = entropy(subdata)
gain -= (float(len(subdata)) / float(len(data))) * sub_e
print(gain/splitinfo)
return gain/splitinfo

def C45(data, atrributes):


root = Node()
max_gain=0
max_feature=""
for feature in atrributes:
gain = gain_ratio(data, feature)
if gain>max_gain:
max_gain=gain
max_feature=feature;
root.value=max_feature

uniq = np.unique(data[max_feature])
for u in uniq:
subdata = data[data[max_feature] == u]
if entropy(subdata) == 0.0:
newNode = Node()
newNode.isLeaf = True
newNode.value = u
newNode.pred = np.unique(subdata["answer"])
root.children.append(newNode)
else:
dummyNode = Node()
dummyNode.value = u
new_attrs = atrributes.copy()
new_attrs.remove(max_feature)
child = C45(subdata, new_attrs)
dummyNode.children.append(child)
root.children.append(dummyNode)
return root
def printTree(root: Node, depth=0):
for i in range(depth):
print("\t", end="")
print(root.value, end="")
if root.isLeaf:
print(" -> ", root.pred)
print()
for child in root.children:
printTree(child, depth + 1)
root=C45(df, features)
print("---------------------------------------------------------------------------------")
print(f"decision tree with root {root.value} is:")
printTree(root)

Output:

Learning Outcomes:

1. To be able to understand different metrics used in C4.5 decision tree algorithm like
Gain ratio, etc and understand the improvement of C4.5 algorithm over ID3 algorithm.

2. To be able to perform classification of data with the help of C4.5 algorithm.


Experiment 9
Aim: To implement multi layer neural network.
Theory:
A neural network is a computational model inspired by the structure and function of the human brain. It
consists of interconnected nodes, called neurons, organized into layers. Each neuron receives one or
more inputs, processes them, and generates an output signal that can be passed to other neurons in the
network. The strength of the connections between neurons, called weights, determines how much each
neuron's input affects its output.
Neural networks are used for a variety of tasks, such as image and speech recognition, natural language
processing, and predictive modeling. They can learn to recognize patterns and relationships in large
datasets by adjusting their weights through a process called training, which involves presenting the
network with input data and adjusting the weights to minimize the difference between the predicted
outputs and the actual outputs

Code:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Define activation functions
def sigmoid(x):
return 1 / (1 + np.exp(-x))

def relu(x):
return np.maximum(0, x)

def softmax(x):
exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return exp_x / np.sum(exp_x, axis=1, keepdims=True)

# Define the neural network class
class NeuralNetwork:
def __init__(self, input_size, hidden_size, output_size):
self.input_size = input_size
self.hidden_size = hidden_size
self.output_size = output_size
self.W1 = np.random.randn(input_size, hidden_size) * 0.01
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * 0.01
self.b2 = np.zeros((1, output_size))

def forward(self, X):
self.z1 = np.dot(X, self.W1) + self.b1
self.a1 = relu(self.z1)
self.z2 = np.dot(self.a1, self.W2) + self.b2
self.probs = softmax(self.z2)
return self.probs

def backward(self, X, y, learning_rate):
m = len(X)
delta3 = self.probs
delta3[range(m), y] -= 1
dW2 = np.dot(self.a1.T, delta3)
db2 = np.sum(delta3, axis=0, keepdims=True)
delta2 = np.dot(delta3, self.W2.T) * (self.a1 > 0)
dW1 = np.dot(X.T, delta2)
db1 = np.sum(delta2, axis=0, keepdims=True)
self.W2 -= learning_rate * dW2
self.b2 -= learning_rate * db2
self.W1 -= learning_rate * dW1
self.b1 -= learning_rate * db1

def train(self, X, y, learning_rate, epochs):
for i in range(epochs):
# Forward pass
probs = self.forward(X)

# Backward pass
self.backward(X, y, learning_rate)

# Print loss every 100 epochs
if i % 100 == 0:
loss = self.loss(X, y)
print(f"Epoch {i}: Loss = {loss:.4f}")

def predict(self, X):
probs = self.forward(X)
return np.argmax(probs, axis=1)

def loss(self, X, y):
probs = self.forward(X)
correct_probs = probs[range(len(X)), y]
data_loss = np.sum(-np.log(correct_probs))
return data_loss / len(X)

# Load the Iris dataset and split it into training and test sets
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_
state=42)

# Create and train the neural network
nn = NeuralNetwork(input_size=4, hidden_size=10, output_size=3)
nn.train(X_train, y_train, learning_rate=0.001, epochs=1000)

# Evaluate the neural network on the test set
predictions = nn.predict(X_test)
print(f'Test values are {y_test}')
print(f'Predicted values are {predictions}')
accuracy = np.sum(y_test == predictions) / len(y_test)
print(f ‘Accuracy is: {accuracy}’)

Output:
Epoch 0: Loss = 1.0983
Epoch 100: Loss = 0.3449
Epoch 200: Loss = 0.2857
Epoch 300: Loss = 0.2326
Epoch 400: Loss = 0.0841
Epoch 500: Loss = 0.0738
Epoch 600: Loss = 0.0773
Epoch 700: Loss = 0.0698
Epoch 800: Loss = 0.0666
Epoch 900: Loss = 0.0643
Test values are [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]
Predicted values are [1 0 2 1 1 0 1 2 2 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]
Accuracy is: 0.9666666666666667

Learning Outcomes:
1. To be able to have basic understanding of neural networks and how they work.
2. To be able to build neural network from scratch considering all operations like
forward propagation, backward propagation and calculating loss function.

You might also like