PR Final File
PR Final File
Numpy- NumPy stands for Numerical Python. NumPy is a Python tool for
working with arrays. It also includes tools for dealing with linear algebra, the
Fourier transform, and matrices.Travis Oliphant developed NumPy in 2005. It is
an open source initiative which can be used freely.In Python we have lists that
serve the purpose of arrays, but they are slow to process.NumPy aims to provide
an array object that is up to 50x faster than traditional Python lists.
Code-
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [50]:
list1 = [[1,2,3],[4,5,6]]
In [51]:
#converting to numpy array
array1 = np.array(list1)
array1
Output:
array([[1, 2, 3],
[4, 5, 6]])
In [52]:
#Mathematical operation performed on all values of numpy arrays
toffee = np.array([5,8,3,6])
print(toffee - 2)
Output:
[3 6 1 4]
In [53]:
#Slicing
arr = np.array([1, 2, 3, 4, 5, 6, 7])
arr[1:5]
Output:
array([2, 3, 4, 5])
In [54]:
#Datatype of the array
arr.dtype
Output:
dtype('int32')
In [55]:
#Change data type from float to integer
arr = np.array([1.1, 2.1, 3.1])
newarr = arr.astype('i')
print(newarr)
print(newarr.dtype)
Output:
[1 2 3]
int32
In [56]:
#Shape of array
arr = np.array([1, 2, 3, 4], ndmin=5)
print(arr)
print('shape of array :', arr.shape)
Output:
[[[[[1 2 3 4]]]]]
shape of array : (1, 1, 1, 1, 4)
In [57]:
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passengers': [2, 6, 4]
}
In [58]:
#Series
info = np.array(['S','a','u','r','a','b','h'])
a = pd.Series(info)
a
Out[58]:
0 S
1 a
2 u
3 r
4 a
5 b
6 h
dtype: object
In [59]:
cardf = pd.DataFrame(mydataset)
cardf
Output:
cars passengers
0 BMW 2
cars passengers
1 Volvo 6
2 Ford 4
In [60]:
a=pd.Series(mydataset)
a
Output:
cars [BMW, Volvo, Ford]
passengers [2, 6, 4]
dtype: object
In [61]:
#Converting series to dataframe
a.to_frame()
Output:
passengers [2, 6, 4]
In [62]:
#Unique values in dataframe
pd.unique(pd.Series([3, 1, 1, 2, 9, 7]))
Output:
array([3, 1, 2, 9, 7], dtype=int64)
In [63]:
#Append 2 dataframe
cardf2 = pd.DataFrame({"year":[2022, 2019, 2020],
"model":['zx', 'vls', 'b+'],
"enginecc":[3811, 1200, 4522]})
df=cardf.append(cardf2, ignore_index = True)
df
Output:
cars passengers year model enginecc
In [64]:
#Count
df.count()
Output:
cars 3
passengers 3
year 3
model 3
enginecc 3
dtype: int64
In [65]:
#Iterating Dataframe
for row_index,row in cardf.iterrows():
print (row_index,row)
Output:
0 cars BMW
passengers 2
Name: 0, dtype: object
1 cars Volvo
passengers 6
Name: 1, dtype: object
2 cars Ford
passengers 4
Name: 2, dtype: object
In [66]:
#Return top elements of df
df.head()
Output:
In [67]:
df.tail()
Output:
In [68]:
#calculating statistical data like percentile, mean,etc
df.describe()
Output:
In [69]:
#Reading CSV files
df = pd.read_csv('iris.csv')
print(df)
Output:
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
plt.figure(figsize=(9,3))
plt.subplot(131)
plt.bar(car, Price)
plt.subplot(132)
plt.scatter(car, Price)
plt.subplot(133)
plt.plot(car, Price)
plt.suptitle('Plot')
plt.show()
Output:
In [71]:
y = np.array([35, 25])
mylabels = ["Melanoma", "Benign"]
plt.pie(y, labels = mylabels)
plt.show()
Output:
Learning Outcome:
1. To be able to perform basic operations using numpy, pandas and matplotlib
libraries of python.
2. Successfully use inbuilt tools for the purpose of data manipulation, data analysis
and data visualization.
Experiment 2
Code:
import pandas as pd
Iris-
0 1 5.1 3.5 1.4 0.2
setosa
setosa
Iris-
2 3 4.7 3.2 1.3 0.2
setosa
Iris-
3 4 4.6 3.1 1.5 0.2
setosa
Iris-
4 5 5.0 3.6 1.4 0.2
setosa
In [2]:
df.shape
Output:
(150, 6)
In [3]:
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
In [5]:
#Data Summarization
df.describe()
Output:
coun 150.00000
150.000000 150.000000 150.000000 150.000000
t 0
mea
75.500000 5.843333 3.054000 3.758667 1.198667
n
112.75000
75% 6.400000 3.300000 5.100000 1.800000
0
150.00000
max 7.900000 4.400000 6.900000 2.500000
0
In [6]:
#Checking missing values
df.isnull().sum()
Output:
Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64
In [7]:
#Checking duplicates
data = df.drop_duplicates(subset ="Species",)
data
Output:
Iris-
0 1 5.1 3.5 1.4 0.2
setosa
Iris-
50 51 7.0 3.2 4.7 1.4 versicol
or
Iris-
10 10
6.3 3.3 6.0 2.5 virginic
0 1
a
In [8]:
df.value_counts("Species")
Output:
Species
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
dtype: int64
In [9]:
data.corr(method='pearson')
data.corr(method='pearson')
Output:
1.0000
Id 0.624413 -0.654654 0.969909 0.999685
00
SepalLength 0.6244
1.000000 -0.999226 0.795795 0.643817
Cm 13
-
SepalWidth
0.6546 -0.999226 1.000000 -0.818999 -0.673417
Cm
54
PetalLength 0.9699
0.795795 -0.818999 1.000000 0.975713
Cm 09
PetalWidthC 0.9996
0.643817 -0.673417 0.975713 1.000000
m 85
Learning Outcomes:
1.To be able to perform pre-processing by checking for null values, missing values,etc
in data.
2. To be able to extract useful information from the data with the help of various
statistical methods available.
Experiment 3
The iris dataset contains 150 observations, with 50 observations for each of the three
species of iris. Each observation includes measurements for the four attributes named
sepal length, sepal width, petal length, and petal width.
Code:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
In [17]:
df = pd.read_csv('iris.csv')
df.head()
Output:
In [18]:
150.0000
count 150.000000 150.000000 150.000000 150.000000
00
75.50000
mean 5.843333 3.054000 3.758667 1.198667
0
43.44536
std 0.828066 0.433594 1.764420 0.763161
8
38.25000
25% 5.100000 2.800000 1.600000 0.300000
0
75.50000
50% 5.800000 3.000000 4.350000 1.300000
0
112.7500
75% 6.400000 3.300000 5.100000 1.800000
00
150.0000
max 7.900000 4.400000 6.900000 2.500000
00
In [19]:
#basic info about datatype
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
In [20]:
#to display no of samples in each class
df['Species'].value_counts()
Output:
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
Name: Species, dtype: int64
In [21]:
#check null values
df.isnull().sum()
Output:
Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64
Data Visualization
In [22]:
# Visvualize the data attribiutes with histogram
df['SepalLengthCm'].hist()
Output:
In [23]:
df['SepalWidthCm'].hist()
Output:
In [24]:
df['PetalLengthCm'].hist()
Output:
In [25]:
df['PetalWidthCm'].hist()
Output:
In [28]:
sns.boxplot(data=df,x='Species',y='PetalLengthCm')
Output:
In [30]:
sns.boxplot(data=df,x='Species',y='PetalWidthCm')
Output:
In [31]:
sns.boxplot(data=df,x='Species',y='SepalLengthCm')
Output:
In [33]:
sns.boxplot(data=df,x='Species',y='SepalWidthCm')
Output:
In [36]:
snsdata = df.drop(['Id'], axis=1)
sns.pairplot(snsdata,hue='Species',height=3)
plt.show()
Learning Outcomes:
1.To be able to use inbuilt tools for the data visualization purpose.
2. To be able to extract useful information from the data with the help of data
visualization.
Experiment 4
K in KNN is a parameter that refers to the number of nearest neighbors in the majority
voting process.
Code:
import pandas as pd
import numpy as np
import math
import operator
# Importing data
data = pd.read_csv('Iris.csv')
print(data.head(5))
#euclidean distance between two data points
def euclideanDistance(data1, data2, length):
distance = 0
for x in range(length):
distance += np.square(data1[x] - data2[x])
return np.sqrt(distance)
#KNN model
def knn(trainingSet, testInstance, k):
distances = {}
sort = {}
length = testInstance.shape[1]
# Calculating euclidean distance between each row of training data and test data
for x in range(len(trainingSet)):
distances[x] = dist[0]
neighbors = []
for x in range(k):
neighbors.append(sorted_d[x][0])
classVotes = {}
#k=1
#k=3
k=3
result,neigh = knn(data, test, k)
Output:
Learning outcome:
1. Understanding of lazy learner algorithm and how to implement them.
2. Understanding of different distances metric which can be used to establish relation
between different sets of points.
Experiment 5
Aim: To implement k means clustering.
Theory:
K-Means clustering is an unsupervised learning algorithm. There is no labeled data
for this clustering, unlike in supervised learning. K-Means performs the division of
objects into clusters that share similarities and are dissimilar to the objects belonging
to another cluster. The term ‘K’ is a number. For example, K = 2 refers to two clusters.
It is an iterative process of assigning each data point to the groups and slowly data
points get clustered based on similar features. The objective is to minimize the sum of
distances between the data points and the cluster centroid, to identify the correct
group each data point should belong to.
We divide a data space into K clusters and assign a mean value to each. The data
points are placed in the clusters closest to the mean value of that cluster. There are
several distance metrics available that can be used to calculate the distance.
Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [34]:
iris = pd.read_csv('Iris.csv')
mapping = {
'Iris-setosa' : 0,
'Iris-versicolor' : 1,
'Iris-virginica' : 2
}
y = iris.Species.replace(mapping)# Output values
# y=iris['Species']
X = iris.drop(['Id','Species'],axis=1).values
print(y.shape)
(150,)
In [35]:
clusters=len(np.unique(y))
In [36]:
def euclidean_dis(x1, x2):
return np.sqrt(np.sum((x1 - x2)**2))
In [37]:
from collections import defaultdict
class KMeans:
def __init__(self,data,k,max_ite):
self.data=data
self.k=k
self.max_ite=max_ite
def predict(self):
centroids = defaultdict(int)
K=self.k
max_iter=self.max_ite
j=0
for i in range(3):
print(self.data[j])
centroids[i] = self.data[j]
j=j+50
for i in range(max_iter):
classes=defaultdict(list)
prediction=[]
kmeans=KMeans(X,clusters,100)
classes,centroids,prediction=kmeans.predict()
# print(prediction)
for i in range(0,3):
classes[i]=np.array(classes[i]).tolist()
for i in range(0,3):
print(len(classes[i]))
print(centroids)
accuracy = np.sum(y== prediction) / len(y)
print(f'Accuracy is: {accuracy}')
Output:
[5.1 3.5 1.4 0.2]
[7. 3.2 4.7 1.4]
[6.3 3.3 6. 2.5]
50
62
38
defaultdict(<class 'int'>, {0: array([5.006, 3.418, 1.464, 0.244]), 1: array([5.
9016129 , 2.7483871 , 4.39354839, 1.43387097]), 2: array([6.85 , 3.07368421, 5.
74210526, 2.07105263])})
Accuracy is: 0.8933333333333333
Learning outcome:
1. Understanding of unsupervised learning algorithm and unlabeled data.
2. Understanding of how clusters are created and data points are assigned to different
clusters based on distance metrics.
Experiment 6
ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm
iteratively (repeatedly) dichotomizes(divides) features into two or more groups at each
step.
Invented by Ross Quinlan, ID3 uses a top-down greedy approach to build a decision
tree. In simple words, the top-down approach means that we start building the tree
from the top and the greedy approach means that at each iteration we select the best
feature at the present moment to create a node.ID3 uses Information Gain find the
best feature at each iteration.
Information Gain calculates the reduction in the entropy and measures how well a
given feature separates or classifies the target classes. The feature with the highest
Information Gain is selected as the best one.
Code:
import numpy as np
import pandas as pd
import math
df=pd.read_csv('Outlook.csv')
features.remove("answer")
class Node:
def __init__(self):
self.children = []
self.value = ""
self.isLeaf = False
self.pred = ""
def entropy(val):
pos = 0.0
neg = 0.0
if row["answer"] == "yes":
pos += 1
else:
neg += 1
return 0.0
else:
def info_gain(data,attr):
uniq = np.unique(data[attr])
gain = entropy(data)
for u in uniq:
subdata = data[data[attr] == u]
sub_e = entropy(subdata)
root = Node()
max_gain=0
max_feature=""
# print(gain)
if gain>max_gain:
max_gain=gain
max_feature=feature;
root.value=max_feature
uniq = np.unique(data[max_feature])
for u in uniq:
subdata = data[data[max_feature] == u]
if entropy(subdata) == 0.0:
newNode = Node()
newNode.isLeaf = True
newNode.value = u
newNode.pred = np.unique(subdata["answer"])
root.children.append(newNode)
else:
dummyNode = Node()
dummyNode.value = u
new_attrs = atrributes.copy()
new_attrs.remove(max_feature)
dummyNode.children.append(child)
root.children.append(dummyNode)
return root
for i in range(depth):
print("\t", end="")
print(root.value, end="")
if root.isLeaf:
print()
printTree(child, depth + 1)
root=ID3(df, features)
print("---------------------------------------------------------------------------------")
printTree(root)
Output:
Learning Outcomes:
1. To be able to understand different metrics used in ID3 decision tree algorithm like
entropy, information gain, etc.
CART is an umbrella word that refers to the following types of decision trees:
Classification Trees: When the target variable is continuous, the tree is used to find the
"class" into which the target variable is most likely to fall.
Regression trees: These are used to forecast the value of a continuous variable.
Code:
import numpy as np
import pandas as pd
import math
df=pd.read_csv('Outlook.csv')
features = [feat for feat in df]
features.remove("answer")
class Node:
def __init__(self):
self.children = []
self.value = ""
self.isLeaf = False
self.pred = ""
def gini(val):
pos = 0.0
neg = 0.0
for _, row in val.iterrows():
if row["answer"] == "yes":
pos += 1
else:
neg += 1
if pos == 0.0 or neg == 0.0:
return 0.0
else:
p = (pos / (pos + neg))**2
n = (neg / (pos + neg))**2
return 1-p-n
def gini_idx(data,attr):
uniq = np.unique(data[attr])
gini_idx = 0
for u in uniq:
subdata = data[data[attr] == u]
sub_g = gini(subdata)
gini_idx += (float(len(subdata)) / float(len(data))) * sub_g
return gini_idx
uniq = np.unique(data[max_feature])
for u in uniq:
subdata = data[data[max_feature] == u]
if gini(subdata) == 0.0:
newNode = Node()
newNode.isLeaf = True
newNode.value = u
newNode.pred = np.unique(subdata["answer"])
root.children.append(newNode)
else:
dummyNode = Node()
dummyNode.value = u
new_attrs = atrributes.copy()
new_attrs.remove(max_feature)
child = CART(subdata, new_attrs)
dummyNode.children.append(child)
root.children.append(dummyNode)
return root
root=CART(df, features)
print("---------------------------------------------------------------------------------")
print(f"decision tree with root {root.value} is:")
printTree(root)
Output:
Learning Outcomes:
Code:
import numpy as np
import pandas as pd
import math
df=pd.read_csv('Outlook.csv')
features = [feat for feat in df]
features.remove("answer")
class Node:
def __init__(self):
self.children = []
self.value = ""
self.isLeaf = False
self.pred = ""
def entropy(val):
pos = 0.0
neg = 0.0
for _, row in val.iterrows():
if row["answer"] == "yes":
pos += 1
else:
neg += 1
if pos == 0.0 or neg == 0.0:
return 0.0
else:
p = pos / (pos + neg)
n = neg / (pos + neg)
return -(p * math.log(p, 2) + n * math.log(n, 2))
def gain_ratio(data,attr):
splitinfo=0
uniq = np.unique(data[attr])
gain = entropy(data)
for u in uniq:
subdata = data[data[attr] == u]
splitinfo=splitinfo+splitInfo(subdata,data)
sub_e = entropy(subdata)
gain -= (float(len(subdata)) / float(len(data))) * sub_e
print(gain/splitinfo)
return gain/splitinfo
uniq = np.unique(data[max_feature])
for u in uniq:
subdata = data[data[max_feature] == u]
if entropy(subdata) == 0.0:
newNode = Node()
newNode.isLeaf = True
newNode.value = u
newNode.pred = np.unique(subdata["answer"])
root.children.append(newNode)
else:
dummyNode = Node()
dummyNode.value = u
new_attrs = atrributes.copy()
new_attrs.remove(max_feature)
child = C45(subdata, new_attrs)
dummyNode.children.append(child)
root.children.append(dummyNode)
return root
def printTree(root: Node, depth=0):
for i in range(depth):
print("\t", end="")
print(root.value, end="")
if root.isLeaf:
print(" -> ", root.pred)
print()
for child in root.children:
printTree(child, depth + 1)
root=C45(df, features)
print("---------------------------------------------------------------------------------")
print(f"decision tree with root {root.value} is:")
printTree(root)
Output:
Learning Outcomes:
1. To be able to understand different metrics used in C4.5 decision tree algorithm like
Gain ratio, etc and understand the improvement of C4.5 algorithm over ID3 algorithm.
Code:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Define activation functions
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def relu(x):
return np.maximum(0, x)
def softmax(x):
exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return exp_x / np.sum(exp_x, axis=1, keepdims=True)
# Define the neural network class
class NeuralNetwork:
def __init__(self, input_size, hidden_size, output_size):
self.input_size = input_size
self.hidden_size = hidden_size
self.output_size = output_size
self.W1 = np.random.randn(input_size, hidden_size) * 0.01
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * 0.01
self.b2 = np.zeros((1, output_size))
def forward(self, X):
self.z1 = np.dot(X, self.W1) + self.b1
self.a1 = relu(self.z1)
self.z2 = np.dot(self.a1, self.W2) + self.b2
self.probs = softmax(self.z2)
return self.probs
def backward(self, X, y, learning_rate):
m = len(X)
delta3 = self.probs
delta3[range(m), y] -= 1
dW2 = np.dot(self.a1.T, delta3)
db2 = np.sum(delta3, axis=0, keepdims=True)
delta2 = np.dot(delta3, self.W2.T) * (self.a1 > 0)
dW1 = np.dot(X.T, delta2)
db1 = np.sum(delta2, axis=0, keepdims=True)
self.W2 -= learning_rate * dW2
self.b2 -= learning_rate * db2
self.W1 -= learning_rate * dW1
self.b1 -= learning_rate * db1
def train(self, X, y, learning_rate, epochs):
for i in range(epochs):
# Forward pass
probs = self.forward(X)
# Backward pass
self.backward(X, y, learning_rate)
# Print loss every 100 epochs
if i % 100 == 0:
loss = self.loss(X, y)
print(f"Epoch {i}: Loss = {loss:.4f}")
def predict(self, X):
probs = self.forward(X)
return np.argmax(probs, axis=1)
def loss(self, X, y):
probs = self.forward(X)
correct_probs = probs[range(len(X)), y]
data_loss = np.sum(-np.log(correct_probs))
return data_loss / len(X)
# Load the Iris dataset and split it into training and test sets
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_
state=42)
# Create and train the neural network
nn = NeuralNetwork(input_size=4, hidden_size=10, output_size=3)
nn.train(X_train, y_train, learning_rate=0.001, epochs=1000)
# Evaluate the neural network on the test set
predictions = nn.predict(X_test)
print(f'Test values are {y_test}')
print(f'Predicted values are {predictions}')
accuracy = np.sum(y_test == predictions) / len(y_test)
print(f ‘Accuracy is: {accuracy}’)
Output:
Epoch 0: Loss = 1.0983
Epoch 100: Loss = 0.3449
Epoch 200: Loss = 0.2857
Epoch 300: Loss = 0.2326
Epoch 400: Loss = 0.0841
Epoch 500: Loss = 0.0738
Epoch 600: Loss = 0.0773
Epoch 700: Loss = 0.0698
Epoch 800: Loss = 0.0666
Epoch 900: Loss = 0.0643
Test values are [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]
Predicted values are [1 0 2 1 1 0 1 2 2 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]
Accuracy is: 0.9666666666666667
Learning Outcomes:
1. To be able to have basic understanding of neural networks and how they work.
2. To be able to build neural network from scratch considering all operations like
forward propagation, backward propagation and calculating loss function.