100% found this document useful (1 vote)
3K views79 pages

R20 Iii-Ii ML Lab Manual

1. The ID3 algorithm takes a training dataset, target attribute, and list of attributes to build a decision tree. 2. It recursively creates nodes by selecting the attribute that best splits the data, reduces entropy, and increases information gain. 3. Leaf nodes are labeled with the most common target value for that subset. 4. To classify a new sample, it is pushed through the decision tree by evaluating attribute tests at each node until a leaf node predicts its target value.

Uploaded by

Eswar Vineet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
3K views79 pages

R20 Iii-Ii ML Lab Manual

1. The ID3 algorithm takes a training dataset, target attribute, and list of attributes to build a decision tree. 2. It recursively creates nodes by selecting the attribute that best splits the data, reduces entropy, and increases information gain. 3. Leaf nodes are labeled with the most common target value for that subset. 4. To classify a new sample, it is pushed through the decision tree by evaluating attribute tests at each node until a leaf node predicts its target value.

Uploaded by

Eswar Vineet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Experiment-1: Implement and demonstrate the FIND-S algorithm for finding the most

specific hypothesis based on a given set of training data samples. Read the training data
from a .CSV file..
FIND-S Algorithm
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance xFor
each attribute constraint ai in h
If the constraint ai is satisfied by x
Then do nothing
Else replace ai in h by the next more general constraint that is satisfied by x

3. Output hypothesis h

Training Examples:

Example Sky AirTemp Humidity Wind Water Forecast EnjoySport

1 Sunny Warm Normal Strong Warm Same Yes

2 Sunny Warm High Strong Warm Same Yes

3 Rainy Cold High Strong Warm Change No

4 Sunny Warm High Strong Cool Change Yes

Program:

import csv

a = []

with open('enjoysport.csv', 'r') as csvfile:


for row in csv.reader(csvfile):
a.append(row)
print(a)

print("\n The total number of training instances are : ",len(a))


num_attribute = len(a[0])-1

print("\n The initial hypothesis is : ")


hypothesis = ['0']*num_attribute
print(hypothesis)

for i in range(0, len(a)):


if a[i][num_attribute] == 'yes':
for j in range(0, num_attribute):
if hypothesis[j] == '0' or hypothesis[j] == a[i][j]:
hypothesis[j] = a[i][j]
else:
hypothesis[j] = '?'
print("\n The hypothesis for the training instance {} is :
\n" .format(i+1),hypothesis)

print("\n The Maximally specific hypothesis for the training


instance is ")
print(hypothesis)

Output:
Experiment:2. For a given set of training data examples stored in a .CSV file,
implement and demonstrate the Candidate-Elimination algorithm to output a
description of the set of all hypotheses consistent with the training examples.

CANDIDATE-ELIMINATION Learning Algorithm

The CANDIDATE-ELIMINTION algorithm computes the version space containing all


hypotheses from H that are consistent with an observed sequence of training examples.

Initialize G to the set of maximally general hypotheses in H


Initialize S to the set of maximally specific hypotheses in H
For each training example d, do
• If d is a positive example
• Remove from G any hypothesis inconsistent with d
• For each hypothesis s in S that is not consistent with d
• Remove s from S
• Add to S all minimal generalizations h of s such that
• h is consistent with d, and some member of G is more general than h
• Remove from S any hypothesis that is more general than another hypothesis in S

• If d is a negative example
• Remove from S any hypothesis inconsistent with d
• For each hypothesis g in G that is not consistent with d
• Remove g from G
• Add to G all minimal specializations h of g such that
• h is consistent with d, and some member of S is more specific than h
Remove from G any hypothesis that is less general than another hypothesis in CANDIDATE-
ELIMINTION algorithm using version spaces

Training Examples:
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport

1 Sunny Warm Normal Stron Warm Same Yes


g
2 Sunny Warm High Stron Warm Same Yes
g
3 Rainy Cold High Stron Warm Change No
g
4 Sunny Warm High Stron Cool Change Yes
g
Program:

import numpy as np
import pandas as pd
data = pd.DataFrame(data=pd.read_csv('enjoysport.csv'))
concepts = np.array(data.iloc[:,0:-1])
print(concepts)
target = np.array(data.iloc[:,-1])
print(target)

def learn(concepts, target):


specific_h = concepts[0].copy()
print("initialization of specific_h and general_h")
print(specific_h)
general_h = [["?" for i in range(len(specific_h))] for i in
range(len(specific_h))]
print(general_h)
for i, h in enumerate(concepts):
if target[i] == "yes":
for x in range(len(specific_h)):
if h[x]!= specific_h[x]:
specific_h[x] ='?'
general_h[x][x] ='?'
print(specific_h)
print(specific_h)
if target[i] == "no":
for x in range(len(specific_h)):
if h[x]!= specific_h[x]:
general_h[x][x] = specific_h[x]
else:
general_h[x][x] = '?'
print(" steps of Candidate Elimination Algorithm",i+1)
print(specific_h)
print(general_h)
indices = [i for i, val in enumerate(general_h) if val ==
['?', '?', '?', '?', '?', '?']]
for i in indices:
general_h.remove(['?', '?', '?', '?', '?', '?'])
return specific_h, general_h
s_final, g_final = learn(concepts, target)
print("Final Specific_h:", s_final, sep="\n")
print("Final General_h:", g_final, sep="\n")
Output:

Final Specific_h:
['sunny' 'warm' '?' 'strong' '?' '?']

Final General_h:
[['sunny', '?', '?', '?', '?', '?'],
['?', 'warm', '?', '?', '?', '?']]
Experiment:3.Write a program to demonstrate the working of the decision tree
based ID3 algorithm. Use an appropriate data set for building the decision tree and
apply this knowledge to classify a new sample.

ID3 Algorithm

ID3(Examples, Target_attribute, Attributes)

Examples are the training examples. Target_attribute is the attribute whose value is
to be predicted by the tree. Attributes is a list of other attributes that may be tested
by the learned decision tree. Returns a decision tree that correctly classifies the given
Examples.

 Create a Root node for the tree


 If all Examples are positive, Return the single-node tree Root, with label = +
 If all Examples are negative, Return the single-node tree Root, with label = -
 If Attributes is empty, Return the single-node tree Root, with label = most common
value of Target_attribute in Examples

 Otherwise Begin
 A ← the attribute from Attributes that best* classifies Examples
 The decision attribute for Root ← A
 For each possible value, vi, of A,
 Add a new tree branch below Root, corresponding to the test A = vi
 Let Examples vi, be the subset of Examples that have value vi for A
 If Examples vi , is empty
 Then below this new branch add a leaf node with label = most
common value of Target_attribute in Examples
 Else below this new branch add the subtree
ID3(Examples vi, Targe_tattribute, Attributes –
{A}))
 End
 Return Root

* The best attribute is the one with highest information gain


ENTROPY:
Entropy measures the impurity of a collection of
examples.

Where, p+ is the proportion of positive examples in


S
p- is the proportion of negative examples in S.

INFORMATION GAIN:

 Information gain, is the expected reduction in entropy caused by partitioning


the examples according to this attribute.
 The information gain, Gain(S, A) of an attribute A, relative to a collection of
examples S, is defined as

Training Dataset:

Day Outlook Temperatur Humidity Wind PlayTennis


e
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcas Hot High Weak Yes
t
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcas Cool Normal Strong Yes
t
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcas Mild High Strong Yes
t
D13 Overcas Hot Normal Weak Yes
t
D14 Rain Mild High Strong No

Test Dataset:

Day Outlook Temperatur Humidity Wind


e
T1 Rain Cool Normal Stron
g
T2 Sunny Mild Normal Stron
g
def entropy(S): attr=list(set(S))
if len(attr)==1:
return 0

counts=[0,0]
for i in range(2):
counts[i]=sum([1 for x in S if
attr[i]==x])/(len(S)*1.0)

sums=0
for cnt in counts:
sums+=-
1*cnt*math.log(cnt,2) return
sums

def compute_gain(data,col): attr,dic =


subtables(data,col,delete=False)

total_size=len(data)
entropies=[0]*len(attr
) ratio=[0]*len(attr)

total_entropy=entropy([row[-1] for row in


data]) for x in range(len(attr)):
ratio[x]=len(dic[attr[x]])/(total_size*1.
0) entropies[x]=entropy([row[-1] for row
in
dic[attr[x]]])
total_entropy-=ratio[x]*entropies[x]
return total_entropy

def build_tree(data,features): lastcol=[row[-1] for row in


data] if(len(set(lastcol)))==1:
node=Node("")
node.answer=lastcol[0
] return node
n=len(data[0])-1
gains=[0]*n
for col in range(n):
gains[col]=compute_gain(data,col)
split=gains.index(max(gains))
node=Node(features[split])
fea = features[:split]+features[split+1:]

attr,dic=subtables(data,split,delete=True)

for x in range(len(attr)):
child=build_tree(dic[attr[x]],fea)
node.children.append((attr[x],child)
)
return node
def print_tree(node,level): if node.answer!="":
print("
"*level,node.answ
er) return

print("
"*level,node.attribu
te) for value,n in
node.children:
print("
"*(level+1),val
ue) print_tree(n,level+2)
def classify(node,x_test,features): if node.answer!="":
print(node.answer)
return
pos=features.index(node.attribute)
for value, n in node.children:
if x_test[pos]==value:
classify(n,x_test,features)

'''Main program''' dataset,features=load_csv("data3.csv")


node1=build_tree(dataset,features)

print("The decision tree for the dataset using ID3 algorithm


is")
print_tree(node1,0)
testdata,features=load_csv("data3_test.csv")
for xtest in testdata:
print("The test instance:",xtest)
print("The label for test instance:",end=" ")
classify(node1,xtest,features)
Experiment-4: Exercises to solve the real-world problems using the following
machine learning methods: a) Linear Regression b) Logistic Regression c) Binary
Classifier
a) Linear Regression
 Linear regression is one of the easiest and most popular Machine Learning
algorithms. It is a statistical method that is used for predictive analysis. Linear
regression makes predictions for continuous/real or numeric variables such as
sales, salary, age, product price, etc.
 Linear regression algorithm shows a linear relationship between a dependent
(y) and one or more independent (y) variables, hence called as linear
regression. Since linear regression shows the linear relationship, which means
it finds how the value of the dependent variable is changing according to the
value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:

Program:
# importing the dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_csv('Salary_Data.csv')
dataset.head()
# data preprocessing
X = dataset.iloc[:, :-1].values #independent variable
array
y = dataset.iloc[:,1].values #dependent variable vector
# splitting the dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =
train_test_split(X,y,test_size=1/3,random_state=0)
# fitting the regression model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train) #actually produces the
linear eqn for the data
# predicting the test set results
y_pred = regressor.predict(X_test)
y_pred
y_test
# visualizing the results
#plot for the TRAIN
plt.scatter(X_train, y_train, color='red')
# plotting the observation line
plt.plot(X_train, regressor.predict(X_train), color='blue')
# plotting the regression line
plt.title("Salary vs Experience (Training set)") # stating
the title of the graph
plt.xlabel("Years of experience") # adding the name of x-
axis
plt.ylabel("Salaries") # adding the name of y-axis
plt.show() # specifies end of graph

#plot for the TEST


plt.scatter(X_test, y_test, color='red')
plt.plot(X_train, regressor.predict(X_train), color='blue')
# plotting the regression line
plt.title("Salary vs Experience (Testing set)")
plt.xlabel("Years of experience")
plt.ylabel("Salaries")
plt.show()

Output:

b) Logistic regression:
It is a Machine Learning classification algorithm that is used to predict the probability of a
categorical dependent variable. In logistic regression, the dependent variable is a binary variable that
contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic
regression model predicts P(Y=1) as a function of X.
Logistic Regression – The Python Way
To do this, we shall first explore our dataset using Exploratory Data Analysis (EDA) and then
implement logistic regression and finally interpret the odds:
Program:

#Data Pre-procesing Step


# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('user_data.csv')
#Extracting Independent and dependent Variable
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
# Splitting the dataset into training and test set.
# from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, tes
t_size= 0.25, random_state=0)
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
#Fitting Logistic Regression to the training set
from sklearn.linear_model import LogisticRegression
classifier= LogisticRegression(random_state=0)
classifier.fit(x_train, y_train)
#Predicting the test set result
y_pred= classifier.predict(x_test)
#Creating the Confusion matrix
from sklearn.metrics import confusion_matrix
cm= confusion_matrix()
#Visualizing the training set result
from matplotlib.colors import ListedColormap
x_set, y_set = x_train, y_train
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() 1,
stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() -
1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel()
, x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('purple','green' )))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('purple', 'green'))(i), label = j)
mtp.title('Logistic Regression (Training set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()

#Visulaizing the test set result


from matplotlib.colors import ListedColormap
x_set, y_set = x_test, y_test
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() -
1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel()
, x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('purple','green' )))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('purple', 'green'))(i), label = j)
mtp.title('Logistic Regression (Test set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()

c) BINARY CLASSIFIER:
In machine learning, binary classification is a supervised learning algorithm that categorizes new
observations into one of two classes.
The following are a few binary classification applications, where the 0 and 1 columns are two
possible classes for each observation:
Example
In a medical diagnosis, a binary classifier for a specific disease could take a patient's symptoms as
input features and predict whether the patient is healthy or has the disease. The possible outcomes of
the diagnosis are positive and negative.
Evaluation of binary classifiers
If the model successfully predicts the patients as positive, this case is called True Positive (TP). If
the model successfully predicts patients as negative, this is called True Negative (TN). The binary
classifier may misdiagnose some patients as well. If a diseased patient is classified as healthy by a
negative test result, this error is called False Negative (FN). Similarly, If a healthy patient is
classified as diseased by a positive test result, this error is called False Positive(FP).
We can evaluate a binary classifier based on the following parameters:
True Positive (TP): The patient is diseased and the model predicts "diseased"
False Positive (FP): The patient is healthy but the model predicts "diseased"
True Negative (TN): The patient is healthy and the model predicts "healthy"
False Negative (FN): The patient is diseased and the model predicts "healthy"
After obtaining these values, we can compute the accuracy score of the binary classifier as follows:
The following is a confusion matrix, which represents the above parameters:

In machine learning, many methods utilize binary classification. The most common are:
1. Support Vector Machines
2. Naive Bayes
3. Nearest Neighbor
4. Decision Trees
5. Logistic Regression
6. Neural Networks

The following Python example will demonstrate using binary classification in a logistic regression
problem.
A Python example for binary classification
For our data, we will use the breast cancer dataset from scikit-learn. This dataset contains tumor
observations and corresponding labels for whether the tumor was malignant or benign.
First, we'll import a few libraries and then load the data. When loading the data, we'll specify
as_frame=True so we can work with pandas objects (see our pandas tutorial for an introduction).
Program:

import matplotlib.pyplot as plt


from sklearn.datasets import load_breast_cancer

dataset = load_breast_cancer(as_frame=True)
dataset['data'].head()
dataset['target'].head()
dataset['target'].value_counts()
X = dataset['data']
y = dataset['target']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y ,
test_size=0.25, random_state=0)
from sklearn.preprocessing import StandardScaler

ss_train = StandardScaler()
X_train = ss_train.fit_transform(X_train)
models = {}

ss_test = StandardScaler()
X_test = ss_test.fit_transform(X_test)
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, predictions)
TN, FP, FN, TP = confusion_matrix(y_test,
predictions).ravel()

print('True Positive(TP) = ', TP)


print('False Positive(FP) = ', FP)
print('True Negative(TN) = ', TN)
print('False Negative(FN) = ', FN)
accuracy = (TP + TN) / (TP + FP + TN + FN)

print('Accuracy of the binary classifier =


{:0.3f}'.format(accuracy))
models = {}

# Logistic Regression
from sklearn.linear_model import LogisticRegression
models['Logistic Regression'] = LogisticRegression()

# Support Vector Machines


from sklearn.svm import LinearSVC
models['Support Vector Machines'] = LinearSVC()

# Decision Trees
from sklearn.tree import DecisionTreeClassifier
models['Decision Trees'] = DecisionTreeClassifier()

# Random Forest
from sklearn.ensemble import RandomForestClassifier
models['Random Forest'] = RandomForestClassifier()

# Naive Bayes
from sklearn.naive_bayes import GaussianNB
models['Naive Bayes'] = GaussianNB()

# K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
models['K-Nearest Neighbor'] = KNeighborsClassifier()
from sklearn.metrics import accuracy_score, precision_score,
recall_score

accuracy, precision, recall = {}, {}, {}

for key in models.keys():

# Fit the classifier


models[key].fit(X_train, y_train)

# Make predictions
predictions = models[key].predict(X_test)

# Calculate metrics
accuracy[key] = accuracy_score(predictions, y_test)
precision[key] = precision_score(predictions, y_test)
recall[key] = recall_score(predictions, y_test)
import pandas as pd

df_model = pd.DataFrame(index=models.keys(),
columns=['Accuracy', 'Precision', 'Recall'])
df_model['Accuracy'] = accuracy.values()
df_model['Precision'] = precision.values()
df_model['Recall'] = recall.values()
df_model
ax = df_model.plot.barh()
ax.legend(
ncol=len(models.keys()),
bbox_to_anchor=(0, 1),
loc='lower left',
prop={'size': 14}
)
plt.tight_layout()

Output:
Experiment-5: Develop a program for Bias, Variance, Remove duplicates , Cross
Validation
a)Bias-Variance Trade-off
The bias and the variance of a model’s performance are connected.
Ideally, we would prefer a model with low bias and low variance, although in practice, this is
very challenging. In fact, this could be described as the goal of applied machine learning for a
given predictive modeling problem,
Reducing the bias can easily be achieved by increasing the variance. Conversely, reducing the
variance can easily be achieved by increasing the bias.
This relationship is generally referred to as the bias-variance trade-off. It is a conceptual
framework for thinking about how to choose models and model configuration.
We can choose a model based on its bias or variance. Simple models, such as linear regression
and logistic regression, generally have a high bias and a low variance. Complex models, such as
random forest, generally have a low bias but a high variance.
We may also choose model configurations based on their effect on the bias and variance of the
model. The k hyperparameter in k-nearest neighbors controls the bias-variance trade-off. Small
values, such as k=1, result in a low bias and a high variance, whereas large k values, such as
k=21, result in a high bias and a low variance.
High bias is not always bad, nor is high variance, but they can lead to poor results.
We often must test a suite of different models and model configurations in order to discover what
works best for a given dataset. A model with a large bias may be too rigid and underfit the
problem. Conversely, a large variance may overfit the problem.
We may decide to increase the bias or the variance as long as it decreases the overall estimate of
model error.
Calculate the Bias and Variance
In a real-life situation in which f is unobserved, it is generally not possible to explicitly compute
the test MSE, bias, or variance for a statistical learning method. Nevertheless, one should always
keep the bias-variance trade-off in mind.
Even though the bias-variance trade-off is a conceptual tool, we can estimate it in some cases.
The mlxtend library by Sebastian Raschka provides the bias_variance_decomp() function that
can estimate the bias and variance for a model over multiple bootstrap samples.
First, you must install the mlxtend library; for example:

Running the example reports the estimated error as well as the estimated bias and variance for the
model error.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure,
or differences in numerical precision. Consider running the example a few times and compare the
average outcome.
In this case, we can see that the model has a high bias and a low variance. This is to be expected
given that we are using a linear regression model. We can also see that the sum of the estimated
mean and variance equals the estimated error of the model, e.g. 20.726 + 1.761 = 22.487.

# estimate the bias and variance for a regression model


Program:

from pandas import read_csv


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from mlxtend.evaluate import bias_variance_decomp
# load dataset
url =
'https://raw.githubusercontent.com/jbrownlee/Datasets/master/h
ousing.csv'
dataframe = read_csv(url, header=None)
# separate into inputs and outputs
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33, random_state=1)
# define the model
model = LinearRegression()
# estimate bias and variance
mse, bias, var = bias_variance_decomp(model, X_train, y_train,
X_test, y_test, loss='mse', num_rounds=200, random_seed=1)
# summarize results
print('MSE: %.3f' % mse)
print('Bias: %.3f' % bias)
print('Variance: %.3f' % var)
Output:
b)Remove Duplicates

When an element occurs more than once in a list, we refer to it as a duplicate.

we will learn different methods of removing these duplicates from a list in Python.

1. The Basic Approach


2. Using List Comprehension
3. Using Set()
4. Using enumerate()
5. Using OrderedDict

Let us discuss each one of them in detail.The

Basic Approach

In the first method, we will discuss the basic approach of removing duplicates fromthe
list using Python.

How to Remove Duplicates From a Python list?

In Python, there are numerous methods for removing duplicates from a list. To remove
duplicates from a given list, you can make use of these methods and get your work done.
Let's take a look at them:

Method 1 - Naive Method

To remove duplicates from a list in Python, iterate through the elements of the list and
store the first occurrence of an element in a temporary list while ignoring any other
occurrences of that element.

The basic approach is implemented in the naive method by:

 Using a For-loop to traverse the list


 If the elements do not already exist in a temporary list, they are added to it
 The temporary list has been assigned to the main list
Example:
Program:

# removing duplicates from the list

using naive methods # initializing list

sam_list = [11, 13, 15, 16, 13, 15, 16, 11]

print ("The list is: " +

str(sam_list)) # remove

duplicates from list

result = []

for i in sam_list:

if i not in result:

result.append(i)

# printing list after removal

print ("The list after removing duplicates : " + str(result))

Output:

The list is: [11, 13, 15, 16, 13, 15, 16, 11]

The list after removing duplicates: [11, 13, 15, 16]


Method 2 - Pandas drop_duplicates() method helps in removing duplicates fromthe
Pandas Dataframe In Python.

Syntax of df.drop_duplicates()

Syntax: DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)

Parameters:

 subset: Subset takes a column or list of column label. It’s default value is
none. After passing columns, it will consider them only for duplicates.
 keep: keep is to control how to consider duplicate value. It has only three
distinct value and default is ‘first’.
 If ‘first‘, it considers first value as unique and rest of the same
values as duplicate.
 If ‘last‘, it considers last value as unique and rest of the same values
as duplicate.
 If False, it consider all of the same values as duplicates
 inplace: Boolean values, removes rows with duplicates if True.

Return type: DataFrame with removed duplicate rows depending on Arguments


passed.

Example 1: Removing rows with the same First Name

In the following example, rows having the same First Name are removed and anew
data frame is returned.
Program:

# importing pandas package import pandas as pd

# making data frame from csv file

data = pd.read_csv("C:\\Users\\sys-
08\\Desktop\\Employee.csv")

# sorting by first name data.sort_values("Employee_ID",


inplace=True)

# dropping ALL duplicate values


data.drop_duplicates(subset="Employee_ID",
keep=False, inplace=True)

# displaying data

Output:
As shown in the image, the rows with the same names were removed from a data frame.

c)Cross Validation

Cross-validation is a technique for validating the model efficiency by training it on the subset of
input data and testing on previously unseen subset of the input data. We can also say that it is a
technique to check how a statistical model generalizes to an independent dataset.

In machine learning, there is always the need to test the stability of the model. It means based
only on the training dataset; we can't fit our model on the training dataset. For this purpose, we
reserve a particular sample of the dataset, which was not part of the training dataset. After that,
we test our model on that sample before deployment, and this complete process comes under
cross-validation. This is something different from the general train-test split.

Hence the basic steps of cross-validations are:


 Reserve a subset of the dataset as a validation set.
 Provide the training to the model using the training dataset.
 Now, evaluate model performance using the validation set. If the model performs well
with the validation set, perform the further step, else check for the issues.

Methods used for Cross-Validation

1. Hold Out Cross-validation

2. K-Fold cross-validation

3. Stratified K-Fold cross-validation

4. Leave Pout Cross-validation

5. Leave One Out Cross-validation

6. Monte Carlo (Shuffle-Split)

7. Time Series ( Rolling cross-validation)

K-Fold Cross-Validation

In this technique of K-Fold cross-validation, the whole dataset is partitioned into K parts of equal
size. Each partition is called a “Fold“.So as we have K parts we call it K-Folds. One Fold is used
as a validation set and the remaining K-1 folds are used as the training set.

The technique is repeated K times until each fold is used as a validation set and the remaining
folds as the training set.

The final accuracy of the model is computed by taking the mean accuracy of the k-models
validation data.

Pros:

1. The whole dataset is used as both a training set and validation set:
Cons:

1. Not to be used for imbalanced datasets: As discussed in the case of HoldOut cross-
validation, in the case of K-Fold validation too it may happen that all samples of training set will
have no sample form class “1” and only of class “0”.And the validation set will have a sample of
class “1”.

2. Not suitable for Time Series data: For Time Series data the order of the samples matter. But
in K-Fold Cross-Validation, samples are selected in random order.

Program:
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score,KFold
from sklearn.linear_model import LogisticRegression
iris=load_iris()
X=iris.data Y=iris.target
logreg=LogisticRegression()
kf=KFold(n_splits=5) score=cross_val_score(logreg,X,Y,cv=kf)
print("Cross Validation Scores are {}".format(score))
print("Average Cross Validation score
:{}".format(score.mean()))

Output:
Experiment-6: Write a program to implement Categorical Encoding, One-hot
Encoding

One-hot encoding in Python

Most of the existing machine learning algorithms cannot be executed on categorical data. Instead,
the categorical data needs to first be converted to numerical data. One-hot encoding is one of the
techniques used to perform this conversion. This method is mostly used when deep learning
techniques are to be applied to sequential classification problems.

One-hot encoding is essentially the representation of categorical variables as binary vectors.


These categorical values are first mapped to integer values. Each integer value is then represented
as a binary vector that is all 0s (except the index of the integer which is marked as 1).

Most of the existing machine learning algorithms cannot be executed on categorical data. Instead,
the categorical data needs to first be converted to numerical data. One-hot encoding is one of the
techniques used to perform this conversion. This method is mostly used when deep learning
techniques are to be applied to sequential classification problems.

One-hot encoding is essentially the representation of categorical variables as binary vectors.


These categorical values are first mapped to integer values. Each integer value is then represented
as a binary vector that is all 0s (except the index of the integer which is marked as 1).

Manual one-hot encoding

Have a look at the example below which manually converts the categorical list of colors to a
numerical list using one-hot encoding:
Program:
import numpy as np

### Categorical data to be converted to numeric data colors

= ["red", "green", "yellow", "red", "blue"] ### Universal

list of colors

total_colors = ["red", "green",

"blue", "black", "yellow"] ###

map each color to an integer

mapping = {}

for x in range(len(total_colors)):

mapping[total_colors[x]] = x

one_hot_encode = [] for c in

colors:

arr =

list(np.zeros(len(total_color

s), dtype = int))

arr[mapping[c]] = 1

one_hot_encode.append(arr)

print(one_hot_encode)

OUTPUT:

1.56s

[[1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [1, 0, 0, 0, 0], [0, 0, 1, 0, 0]]
One-hot encoding using scikit-learn

Take a look at the example below. It uses the scikit-learn library to perform one- hot
encoding:

Program:

from sklearn.preprocessing import LabelEncoder from

sklearn.preprocessing import OneHotEncoder ### Categorical

data to be converted to numeric data colors = (["red",

"green", "yellow", "red", "blue"]) ### integer mapping using

LabelEncoder label_encoder = LabelEncoder()

integer_encoded =

label_encoder.fit_transform(colors)

print(integer_encoded) integer_encoded =

integer_encoded.reshape(len(integer_encoded),

1)

### One hot encoding


onehot_encoder = OneHotEncoder(sparse=False) onehot_encoded =
onehot_encoder.fit_transform(integer_encoded)

print(onehot_encoded)

OUTPUT:1.53s
[2 1 3 2 0]

[[0. 0. 1. 0.]

[0. 1. 0. 0.]

[0. 0. 0. 1.]

[0. 0. 1. 0.]

[1. 0. 0. 0.]]
Experiment-7: Build an Artificial Neural Network by implementing the Back
propagation algorithm and test the same using appropriate data sets.
Program:

import numpy as np
X = np.array(([2, 9], [1, 5], [3, 6]), dtype=float)
y = np.array(([92], [86], [89]), dtype=float)
X = X/np.amax(X,axis=0) # maximum of X array longitudinally y
= y/100 #Sigmoid
Functiondef
sigmoid (x):
return (1/(1 + np.exp(- x))) #Derivative of Sigmoid
Functiondef derivatives_sigmoid(x):
return x * (1 - x)
#Variable initialization
epoch=7000 #Setting training iterations
lr=0.1 #Setting learning rate inputlayer_neurons = 2
#number of features in data set hiddenlayer_neurons = 3
#number of hidden layers neurons output_neurons = 1
#number of neurons at
output layer
#weight and bias initialization
wh=np.random.uniform(size=(inputlayer_neurons,hiddenlayer_neur
on s)) bh=np.random.uniform(size=(1,hiddenlayer_neurons))
wout=np.random.uniform(size=(hiddenlayer_neurons,output_neuron
s))

bout=np.random.uniform(size=(1,output_neurons)) # draws a
random range of numbers
uniformly of dim x*y#Forward Propagation
for i in range(epoch): hinp1=np.dot(X,wh
) hinp=hinp1 + bh hlayer_act = sigmoid(hinp)
outinp1=np.dot(hlayer_act, wout)outinp= outinp1+ bout
output = sigmoid(outinp)
#Backpropaga tion
EO = y-output
outgrad = derivatives_sigmoid(output) d_output = EO* outgrad
EH = d_output.dot(wout.T)
hiddengrad = derivatives_sigmoid(hlayer_act) #how much hidden
layer wts contributed to error
d_hiddenlayer = EH * hiddengrad wout +=
hlayer_act.T.dot(d_output)
*lr
# dotproduct of nextlayererror and currentlayerop bout +=
np.sum(d_output, axis=0,keepdims=True) *lrwh +=
X.T.dot(d_hiddenlayer) *lr #bh += np.sum(d_hiddenlayer,
axis=0,keepdims=True) *lrprint("Input: \n"
+ str(X))
print("Actual Output: \n" + str(y)) print("Predicted Output:
\n" ,output)

Output:
Input:
[[ 0.66666667 1. ]
[ 0.33333333 0.55555556]
[ 1. 0.66666667]]

Actual Output:[[0.92] [ 0.86]


[ 0.89]]
Predicted Output:[[ 0.89559591]
[ 0.88142069]
[ 0.8928407 ]]
Experiment-8: Write a program to implement k-Nearest Neighbor algorithm to
classify the iris data set. Print both correct and wrong predictions.

Principle: points (documents) that are close in the space

• belong to the same class

Distance Metrics
K-Nearest-Neighbour Algorithm:
1. Load the data
2. Initialize the value of k
3. For getting the predicted class, iterate from 1 to total
number of training datapoints
1. Calculate the distance between test data and
each row of training data. Here we will use
Euclidean distance as our distance metric
since it’s the most popular method.The
other metrics that can be used are
Chebyshev, cosine, etc.
2. Sort the calculated distances in ascending order
based on distancevalues
3. Get top k rows from the sorted array
4. Get the most frequent class of these rows i.e
Get the labels of theselected K entries
5. Return the predicted class
 If regression, return the mean of the K labels
 If classification, return the mode of the K labels
Confusion matrix:
Note,
• Class 1 : Positive
• Class 2 : Negative
• Positive (P) : Observation is positive (for example: is an apple).
• Negative (N) : Observation is not positive (for example: is not an
apple).
• True Positive (TP) : Observation is positive, and is predicted to be
positive.
• False Negative (FN) : Observation is
positive, but is predicted negative. (Also
known as a"Type II error.")
• True Negative (TN) : Observation is negative, and is predicted to
be negative.
• False Positive (FP) : Observation is
negative, but is predicted positive. (Also
known as a"Type I error.")

Example :

Accuracy:
 Overall, how often is the classifier correct?(TP+TN)/total = (100+50)/165 =
0.91
 Misclassification Rate: Overall, how often is it wrong?(FP+FN)/total =
(10+5)/165 = 0.09 equivalent to 1 minus Accuracyalso known as "Error Rate“
 True Positive Rate: When it's actually yes, how often does it predict yes?
TP/actual yes = 100/105 = 0.95 also known as "Sensitivity" or "Recall"
 False Positive Rate: When it's actually no, how often does it predict yes?
FP/actual no = 10/60 = 0.17
 True Negative Rate: When it's actually no, how often does it predict no?
TN/actual no = 50/60 = 0.83 equivalent to 1 minus False Positive Ratealso
known as "Specificity“
 Precision: When it predicts yes, how often is it correct?
 TP/predicted yes = 100/110 = 0.91
 Prevalence: How often does the yes condition actually occur in our
sample?actual yes/total = 105/165 = 0.64
Program:

from sklearn.neighbors import KNeighborsClassifier


from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split import
pandas as pd

dataset=pd.read_csv("iris.csv")
X_train,X_test,y_train,y_test=train_test_split(X,y,random_stat
e=0,test_size=0.2 5)
classifier=KNeighborsClassifier(n_neighbors=8,p=3,metric='eucl
idean'
classifier.fit(X_train,y_train) #predict the test resuts
y_pred=classifier.predict(X_test)

cm=confusion_matrix(y_test,y_ pred) print('Confusion matrix is


as follows\n',cm) print('Accuracy Metrics')
print(classification_report(y_test,y_pred))
print(" correct predicition",accuracy_score(y_test,y_pred))
print(" worng predicition",(1-
accuracy_score(y_test,y_pred)))

Output :
Confusion matrix is as follows[[13 0 0]
[ 0 15 1]
[ 0 0 9]]
Accuracy Metrics
precision recall f1-score support

Iris- 1.00 1.00 1.00 13


setosa 1.00 0.94 0.97 1
Iris- 6
versicolor 0.90 1.00 0.95
Iris- virginica 9

avg / total 0.98 0.97 0.97 38


correct predicition 0.9736842105263158
worng predicition 0.02631578947368418
Experiment-9: Implement the non-parametric Locally Weighted Regression
algorithm in order to fit data points. Select appropriate data set for your experiment
and draw graphs.

• Regression is a technique from statistics that is used to predict values of a


desired targetquantity when the target quantity is continuous.
• In regression, we seek to identify (or estimate) a continuous variable y
associated with a giveninput vector x.
• y is called the dependent variable.
• x is called the independent variable.

Loess/Lowess Regression: Loess regression is a nonparametric technique that uses


localweighted regression to fit a smooth curve through points in a scatter plot.

Lowess Algorithm: Locally weighted regression is a very powerful non- parametric


model used in statistical learning .Given a dataset X, y, we attempt to find a model
parameter β(x) that minimizes residual sum of weighted squared errors. The weights
are given by a kernel function(k or w) which can be chosen arbitrarily .

Locally Weighted Regression Algorithm:


1. Read the Given data Sample to X and the curve (linear or non linear) to Y
2. Set the value for Smoothening parameter or free parameter say τ
3. Set the bias /Point of interest set X0 which is a subset of X
4. Determine the weight matrix using:

5. Determine the value of model term parameter β using :

6. Prediction = x0*β

Program:
from numpy import *import operator
from os import listdirimport matplotlib import
matplotlib.pyplot as pltimport pandas as pd import
numpy.linalg
from scipy.stats.stats import pearsonr

def
kernel(point,xmat, k): m,n = shape(xmat) weights
=mat(eye((m)))for j in range(m):
diff = point - X[j] weights[j,j] = exp(diff*diff.T/(-
2.0*k**2))return weights

def
localWeight(point,xmat,y mat,k):wei = kernel(point,xmat,k)
W = (X.T*(wei*X)).I*(X.T*(wei*y mat.T))return W

def
localWeightRegression(xmat ,ymat,k):m,n = shape(xmat) ypred =
zeros(m)for i in range(m): ypred[i] =
xmat[i]*localWeight(xmat[i],xmat,ymat,k )return ypred

# load data points data = pd.read_csv('tips.csv')


bill = array(data.total_bill)
tip = array(data.tip)

#preparing and add 1 in billmbill = mat(bill)mtip = mat(tip)


m= shape(mbill)[1]one = mat(ones(m))
X= hstack((one.T,mbill.T))

#set k here ypred =


localWeightRegression(X,mtip,0.2)SortIndex = X[:,1].argsort(0)
xsort = X[SortIndex][:,0]

fig = plt.figure()
ax = fig.add_subplot(1,1,1) ax.scatter(bill,tip,
color='green')
ax.plot(xsort[:,1],ypred[SortIndex], color = 'red',
linewidth=5)plt.xlabel('Total bill') plt.ylabel('Tip')
plt.show();

Output:
Dataset
Add Tips.csv (256 rows)
Experiment:10. Assuming a set of documents that need to be classified, use the naïve
Bayesian Classifier model to perform this task. Built-in Java classes/API can be used to
write the program. Calculate theaccuracy, precision, and recall for your data set.

Naive Bayes algorithms for learning and classifying text

LEARN_NAIVE_BAYES_TEXT (Examples, V)
Examples is a set of text documents along with their target values. V is the set of all possible target
values. This function learns the probability terms P(wk |vj,), describing the probability that a
randomly drawn word from a document in class vj will be the English word wk. It also learns the
class prior probabilities P(vj).

1. collect all words, punctuation, and other tokens that occur in Examples
 Vocabulary ← c the set of all distinct words and other tokens occurring in any text
document from Examples

2. calculate the required P(vj) and P(wk|vj) probability terms

 For each target value vj in V do


 docsj ← the subset of documents from Examples for which the target value is vj
 P(vj) ← | docsj | / |Examples|
 Textj ← a single document created by concatenating all members of docsj
 n ← total number of distinct word positions in Textj
 for each word wk in Vocabulary
 nk ← number of times word wk occurs in Textj
 P(wk|vj) ← ( nk + 1) / (n + | Vocabulary| )

CLASSIFY_NAIVE_BAYES_TEXT (Doc)

Return the estimated target value for the document Doc. ai denotes the word found in the ith
position within Doc.

 positions ← all word positions in Doc that contain tokens found in Vocabulary
 Return VNB, where
Data set:

Text Documents Label


1 I love this sandwich pos
2 This is an amazing place pos
3 I feel very good about these beers pos
4 This is my best work pos
5 What an awesome view pos
6 I do not like this restaurant neg
7 I am tired of this stuff neg
8 I can't deal with this neg
9 He is my sworn enemy neg
10 My boss is horrible neg
11 This is an awesome place pos
12 I do not like the taste of this juice neg
13 I love to dance pos
14 I am sick and tired of this place neg
15 What a great holiday pos
16 That is a bad locality to stay neg
17 We will have good fun tomorrow pos
18 I went to my enemy's house today neg
Program:

import pandas as pd

msg=pd.read_csv('naivetext.csv',names=['message','label'])

print('The dimensions of the dataset',msg.shape)

msg['labelnum']=msg.label.map({'pos':1,'neg':0})
X=msg.message
y=msg.labelnum

print(X)
print(y)

#splitting the dataset into train and test data


from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest=train_test_split(X,y)

print ('\n The total number of Training Data :',ytrain.shape)


print ('\n The total number of Test Data :',ytest.shape)

#output of count vectoriser is a sparse matrix


from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
xtrain_dtm = count_vect.fit_transform(xtrain)
xtest_dtm=count_vect.transform(xtest)
print('\n The words or Tokens in the text documents \n')
print(count_vect.get_feature_names())

df=pd.DataFrame(xtrain_dtm.toarray(),columns=count_vect.get_fe
ature_names())

# Training Naive Bayes (NB) classifier on training data.


from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(xtrain_dtm,ytrain)
predicted = clf.predict(xtest_dtm)

#printing accuracy, Confusion matrix, Precision and Recall


from sklearn import metrics
print('\n Accuracy of the classifer is’,
metrics.accuracy_score(ytest,predicted))
print('\n Confusion matrix')
print(metrics.confusion_matrix(ytest,predicted))

print('\n The value of Precision' ,


metrics.precision_score(ytest,predicted))

print('\n The value of Recall' ,


metrics.recall_score(ytest,predicted))

Output:

The dimensions of the dataset (18, 2)


0 I love this sandwich
1 This is an amazing place
2 I feel very good about these beers
3 This is my best work
4 What an awesome view
5 I do not like this restaurant
6 I am tired of this stuff
7 I can't deal with this
8 He is my sworn enemy
9 My boss is horrible
10 This is an awesome place
11 I do not like the taste of this juice
12 I love to dance
13 I am sick and tired of this place
14 What a great holiday
15 That is a bad locality to stay
16 We will have good fun tomorrow
17 I went to my enemy's house today
Name: message, dtype: object
0 1
1 1
2 1
3 1
4 1
5 0
6 0
7 0
8 0
9 0
10 1
11 0
12 1
13 0
14 1
15 0
16 1
17 0
Name: labelnum, dtype: int64

The total number of Training Data: (13,)


The total number of Test Data: (5,)

The words or Tokens in the text documents


['about', 'am', 'amazing', 'an', 'and', 'awesome', 'beers', 'best', 'can', 'deal', 'do', 'enemy', 'feel',
'fun', 'good', 'great', 'have', 'he', 'holiday', 'house', 'is', 'like', 'love', 'my', 'not', 'of', 'place',
'restaurant', 'sandwich', 'sick', 'sworn', 'these', 'this', 'tired', 'to', 'today', 'tomorrow', 'very',
'view', 'we', 'went', 'what', 'will', 'with', 'work']

Accuracy of the classifier is 0.8


Confusion matrix
[[2 1]
[0 2]]
The value of Precision 0.6666666666666666
The value of Recall 1.0
Basic knowledge

Confusion Matrix

True positives: data points labelled as positive that are actually positive
False positives: data points labelled as positive that are actually negative
True negatives: data points labelled as negative that are actually negative
False negatives: data points labelled as negative that are actually positive
Example:

Accuracy: how often is the classifier correct?


Example: Movie Review

Doc Text Class

1 I loved the movie +


2 I hated the movie -
3 a great movie. good movie +
4 poor acting -
5 great acting. good movie +

Unique word
< I, loved, the, movie, hated, a, great, good, poor, acting>

Doc I loved the movie hated a great good poor acting Class

1 1 1 1 1 +

2 1 1 1 1 -

3 2 1 1 1 +

4 1 1 -

5 1 1 1 1 +

Doc I loved the movie hated a great good poor acting Class

1 1 1 1 1 +

3 2 1 1 1 +

5 1 1 1 1 +

3
𝑃(+) = = 0.6
5
1+1 1+1
𝑃(𝐼 |+) = = 0.0833 𝑃(𝑎 |+) = = 0.0833
14 + 10 14 + 10
1+1 2+1
𝑃(𝑙𝑜𝑣𝑒𝑑 |+) = = 0.0833 𝑃(𝑔𝑟𝑒𝑎𝑡 |+) = = 0.125
14 + 10 14 + 10
1+1 2+1
𝑃(𝑡ℎ𝑒 |+) = = 0.0833 𝑃(𝑔𝑜𝑜𝑑 |+) = = 0.125
14 + 10 14 + 10
4+1 0+1
𝑃(𝑚𝑜𝑣𝑖𝑒 |+) = = 0.2083 𝑃(𝑝𝑜𝑜𝑟 |+) = = 0.0416
14 + 10 14 + 10
0+1 1+1
𝑃(ℎ𝑎𝑡𝑒𝑑 |+) = = 0.0416 𝑃(𝑎𝑐𝑡𝑖𝑛𝑔 |+) = = 0.0833
14 + 10 14 + 10

Doc I loved the movie hated a great good poor acting Class

2 1 1 1 1 -

4 1 1 -

2
𝑃(−) = = 0.4
5

1+1 0+1
𝑃(𝐼 |−) = = 0.125 𝑃(𝑎 |−) = = 0.0625
6 + 10 6 + 10
0+1 0+1
𝑃(𝑙𝑜𝑣𝑒𝑑 |−) = = 0.0625 𝑃(𝑔𝑟𝑒𝑎𝑡 |−) = = 0.0625
6 + 10 6 + 10
1+1 0+1
𝑃(𝑡ℎ𝑒 |−) = = 0.125 𝑃(𝑔𝑜𝑜𝑑 |−) = = 0.0625
6 + 10 6 + 10
1+1 1+1
𝑃(𝑚𝑜𝑣𝑖𝑒|−) = = 0.125 𝑃(𝑝𝑜𝑜𝑟|−) = = 0.125
6 + 10 6 + 10
1+1 1+1
𝑃(ℎ𝑎𝑡𝑒𝑑 |−) = = 0.125 𝑃(𝑎𝑐𝑡𝑖𝑛𝑔|−) = = 0.125
6 + 10 6 + 10
Let’s classify the new document

I hated the poor acting


If Vj = +
then,
= P(+) P(I | +) P(hated | +) P(the | +) P(poor | +) P(acting | +)
= 0.6 * 0.0833 * 0.0416 * 0.0833 * 0.0416 * 0.0833
= 6.03 X 10−2
If Vj = −
then,
= P(−) P(I | −) P(hated | −) P(the | −) P(poor | −) P(acting | −)
= 0.4 * 0.125 * 0.125 * 0.125 * 0.125 * 0.125
= 1.22 X 10−5

= 1.22 X 10−5 > 6.03 X 10−2


So, the new document belongs to ( − ) class
Experiment:11. Apply EM algorithm to cluster a set of data stored in a .CSV file. Use the
same data set for clustering using k-Means algorithm. Compare the results of these
two algorithms and comment on the quality of clustering. You can add Java/Python
ML library classes/API in the program.

Expectation-Maximization (EM) Algorithm :

The EM algorithm is considered a latent variable model to find the local maximum likelihood
parameters of a statistical model, proposed by Arthur Dempster, Nan Laird, and Donald Rubin in
1977. The EM (Expectation-Maximization) algorithm is one of the most commonly used terms in
machine learning to obtain maximum likelihood estimates of variables that are sometimes
observable and sometimes not. However, it is also applicable to unobserved data or sometimes
called latent. It has various real-world applications in statistics, including obtaining the mode of the
posterior marginal distribution of parameters in machine learning and data mining applications.
Key Points:
It is known as the latent variable model to determine MLE and MAP parameters for latent variables.
It is used to predict values of parameters in instances where data is missing or unobservable for
learning, and this is done until convergence of the values occurs.
EM Algorithm
The EM algorithm is the combination of various unsupervised ML algorithms, such as the k-means
clustering algorithm. Being an iterative approach, it consists of two modes. In the first mode, we
estimate the missing or latent variables. Hence it is referred to as the Expectation/estimation step (E-
step). Further, the other mode is used to optimize the parameters of the models so that it can explain
the data more clearly. The second mode is known as the maximization-step or M-step.

 Expectation step (E - step): It involves the estimation (guess) of all missing values in the
dataset so that after completing this step, there should not be any missing value.
 Maximization step (M - step): This step involves the use of estimated data in the E-step and
updating the parameters.

 Repeat E-step and M-step until the convergence of the values occurs.
 The primary goal of the EM algorithm is to use the available observed data of the dataset to
estimate the missing data of the latent variables and then use that data to update the values of
the parameters in the M-step.
What is Convergence in the EM algorithm?
Convergence is defined as the specific situation in probability based on intuition, e.g., if there are
two random variables that have very less difference in their probability, then they are known as
converged. In other words, whenever the values of given variables are matched with each other, it is
called convergence.

Steps in EM Algorithm
The EM algorithm is completed mainly in 4 steps, which include Initialization Step, Expectation
Step, Maximization Step, and convergence Step. These steps are explained as follows:

1st Step: The very first step is to initialize the parameter values. Further, the system is
provided with incomplete observed data with the assumption that data is obtained from a
specific model.
2nd Step: This step is known as Expectation or E-Step, which is used to estimate or guess
the values of the missing or incomplete data using the observed data. Further, E-step
primarily updates the variables.
3rd Step: This step is known as Maximization or M-step, where we use complete data
obtained from the 2nd step to update the parameter values. Further, M-step primarily
updates the hypothesis.
4th step: The last step is to check if the values of latent variables are converging or not. If it
gets "yes", then stop the process; else, repeat the process from step 2 until the convergence
occurs.
PROGRAM:

import matplotlib.pyplot as plt


from sklearn import datasets
from sklearn.cluster import KMeans
import sklearn.metrics as sm
import pandas as pd
import numpy as np

iris = datasets.load_iris()
X = pd.DataFrame(iris.data)
X.columns
=['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width']
y = pd.DataFrame(iris.target)
y.columns = ['Targets']
model = KMeans(n_clusters=3)
model.fit(X)
plt.figure(figsize=(14,7))
colormap = np.array(['red', 'lime', 'black'])

# Plot the Original Classifications


plt.subplot(1, 2, 1)
plt.scatter(X.Petal_Length, X.Petal_Width,
c=colormap[y.Targets], s=40)
plt.title('Real Classification')
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')

# Plot the Models Classifications


plt.subplot(1, 2, 2)
plt.scatter(X.Petal_Length,X.Petal_Width,c=colormap[model.lab
els_], s=40)
plt.title('K Mean Classification')
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
print('The accuracy score of K-Mean: ',sm.accuracy_score(y,
model.labels_))
print('The Confusion matrixof K-Mean:
',sm.confusion_matrix(y, model.labels_))

from sklearn import preprocessing


scaler = preprocessing.StandardScaler()
scaler.fit(X)
xsa = scaler.transform(X)
xs = pd.DataFrame(xsa, columns = X.columns)
#xs.sample(5)

from sklearn.mixture import GaussianMixture


gmm = GaussianMixture(n_components=3)
gmm.fit(xs)
y_gmm = gmm.predict(xs)
#y_cluster_gmm
plt.subplot(2, 2, 3)
plt.scatter(X.Petal_Length, X.Petal_Width, c=colormap[y_gmm],
s=40)
plt.title('GMM Classification')
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
print('The accuracy score of EM: ',sm.accuracy_score(y,
y_gmm))
print('The Confusion matrix of EM: ',sm.confusion_matrix(y,
y_gmm))
OUTPUT:

The accuracy score of K-Mean: 0.24


The Confusion matrixof K-Mean: [[ 0 50 0]
[48 0 2]
[14 0 36]]

The accuracy score of EM: 0.03333333333333333


The Confusion matrix of EM: [[ 0 0 50]
[45 5 0]
[ 0 50 0]]
Experiment-12: Exploratory Data Analysis for Classification using Pandas or Matplotlib.

Exploratory Data Analysis – EDA


EDA is a phenomenon under data analysis used for gaining a better understanding of data aspects like:
Main features of data
 variables and relationships that hold between them
 identifying which variables are important for our problem

We shall look at various exploratory data analysis methods like:

 EDA is applied to investigate the data and summarize the key insights.
 It will give you the basic understanding of your data, it’s distribution, null values and much more.
 You can either explore data using graphs or through some python functions.
 There will be two type of analysis. Univariate and Bivariate. In the univariate, you will be analyzing a single attribute.
But in the bivariate, you will be analyzing an attribute with the target attribute.
 In the non-graphical approach, you will be using functions such as shape, summary, describe, isnull, info, datatypes
and more.
 In the graphical approach, you will be using plots such as scatter, box, bar, density and correlation plots.

Program:

The dataset we’ll be using is load_wine from sklearn.datasets


dataset, which you can import in python as:
# data manipulation
import pandas as pd
import numpy as np
# data viz
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sns
# apply some cool styling
plt.style.use("ggplot")
rcParams['figure.figsize'] = (12, 6)

# use sklearn to import a dataset


from sklearn.datasets import load_wine
# carichiamo il dataset
wine = load_wine()

# convertiamo il dataset in un dataframe Pandas


df = pd.DataFrame(data=wine.data, columns=wine.feature_names)
# creiamo la colonna per il target
df["target"] = wine.target
df.head()
df.tail()
df.shape
df.describe()
df.info()
df.duplicated().sum()
#Boxplot

df[['alcohol']].boxplot()
df.corr()
#Correlation plot

sns.heatmap(df.corr())

df.rename(columns={"od280/od315_of_diluted_wines":
"protein_concentration"}, inplace=True)
df.target.value_counts()
df.target.value_counts(normalize=True)
df.target.value_counts().plot(kind="bar")
plt.title("Value counts of the target variable")
plt.xlabel("Wine type")
plt.xticks(rotation=0)
plt.ylabel("Count")
plt.show()
df.magnesium.describe()
print(f"Skewness: {df['magnesium'].skew()}")
print(f"Kurtosis: {df['magnesium'].kurt()}")
sns.pairplot(df)
Output:

Boxplot for alcohol Attribute


Heatmap
Experiment-13: Write a program to construct a Bayesian network considering medical
data. Use this model to demonstrate the diagnosis of heart patients using standard Heart
Disease Data Set. You can use Java/Python ML library classes/API

A Bayesian network is a directed acyclic graph in which each edge corresponds to a conditional
dependency, and each node corresponds to a unique random variable.
Bayesian network consists of two major parts: a directed acyclic graph and a set of conditional
probability distributions
 The directed acyclic graph is a set of random variables represented by nodes.
 The conditional probability distribution of a node (random variable) is defined for every
possible outcome of the preceding causal node(s).

For illustration, consider the following example. Suppose we attempt to turn on our computer, but
the computer does not start (observation/evidence). We would like to know which of the possible
causes of computer failure is more likely. In this simplified illustration, we assume only two
possible causes of this misfortune: electricity failure and computer malfunction.
The corresponding directed acyclic graph is depicted in below figure.

Fig: Directed acyclic graph representing two independent possible causes of a computer failure.

The goal is to calculate the posterior conditional probability distribution of each of the possible
unobserved causes given the observed evidence, i.e. P [Cause | Evidence].
Data Set:
Title: Heart Disease Databases
The Cleveland database contains 76 attributes, but all published experiments refer to using a subset
of 14 of them. In particular, the Cleveland database is the only one that has been used by ML
researchers to this date. The "Heartdisease" field refers to the presence of heart disease in the
patient. It is integer valued from 0 (no presence) to 4.
Database: 0 1 2 3 4 Total
Cleveland: 164 55 36 35 13 303

Attribute Information:
1. age: age in years
2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type
 Value 1: typical angina
 Value 2: atypical angina
 Value 3: non-anginal pain
 Value 4: asymptomatic
4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
5. chol: serum cholestoral in mg/dl
6. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
7. restecg: resting electrocardiographic results
 Value 0: normal
 Value 1: having ST-T wave abnormality (T wave inversions and/or ST
elevationor depression of > 0.05 mV)
 Value 2: showing probable or definite left ventricular hypertrophy
by Estes'criteria
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak = ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
 Value 1: upsloping
 Value 2: flat
 Value 3: downsloping
12. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
13. Heartdisease: It is integer valued from 0 (no presence) to 4.

Some instance from the dataset:

age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal Heartdisease
63 1 1 145 233 1 2 150 0 2.3 3 0 6 0
67 1 4 160 286 0 2 108 1 1.5 2 3 3 2
67 1 4 120 229 0 2 129 1 2.6 2 2 7 1
41 0 2 130 204 0 2 172 0 1.4 1 0 3 0
62 0 4 140 268 0 2 160 0 3.6 3 2 3 3
60 1 4 130 206 0 2 132 1 2.4 2 2 7 4

Program:
import numpy as
np import
pandas as pd
import csv
from pgmpy.estimators import
MaximumLikelihoodEstimatorfrom pgmpy.models import
BayesianModel
from pgmpy.inference import VariableElimination

#read Cleveland Heart Disease data


heartDisease = pd.read_csv('heart.csv')
heartDisease = heartDisease.replace('?',np.nan)

#display the data


print('Sample instances from the dataset are given
below') print(heartDisease.head())

#display the Attributes names and datatyes

print('\n Attributes and


datatypes')
print(heartDisease.dtypes)

#Creat Model- Bayesian Network model =


BayesianModel([('age','heartdisease'),('sex','heartdisease
'),('exang','heartdisease'),('cp','heartdisease'),('heartdise
ase', 'restecg'),('heartdisease','chol')])

#Learning CPDs using Maximum Likelihood Estimators

print('\n Learning CPD using Maximum likelihood


estimators')
model.fit(heartDisease,estimator=MaximumLikelihoodEstimat
or)

# Inferencing with Bayesian Network

print('\n Inferencing with Bayesian Network:')


HeartDiseasetest_infer = VariableElimination(model)

#computing the Probability of HeartDisease given restecg


print('\n 1.Probability of HeartDisease given evidence=
restecg :1')
q1=HeartDiseasetest_infer.query(variables=['heartdisease'
],evidence={'restecg':1})
print(q1)

#computing the Probability of HeartDisease given cp

print('\n 2.Probability of HeartDisease given evidence= cp:2


')
q2=HeartDiseasetest_infer.query(variables=['heartdisease'],ev
idence={'cp':2})
print(q2)

Output:
Experiment-14: Write a program to Implement Support Vector Machines

Support Vector Machine


A Support Vector Machine (SVM) is a very powerful and versatile Machine Learning model, capable of
performing linear or nonlinear classification, regression, and even outlier detection. With this tutorial, we
learn about the support vector machine technique and how to use it in scikit-learn. We will also discover
the Principal Component Analysis and its implementation with scikit-learn.

Another simple approach that any machine learning expert should know about is the support vector

machine. Many people prefer the support vector machine because it produces great accuracy while using
less computing power. SVM (Support Vector Machine) can be used for both regression and classification.
However, it is widely applied in classifications objectives.

What is a Support Vector Machine?


The objective of the support vector machine algorithm is to find a hyperplane in N-dimensional space(N
— the number of features) that distinctly classifies the data points.

There are numerous hyper-planes from which to choose to split the two kinds of data points. Our goal is to
discover a plane with the greatest margin, or the greatest distance between data points from both classes.
Maximizing the margin distance adds some reinforcement, making it easier to classify future data points.

Hyper-planes and Support Vectors


Hyper-planes are decision-making boundaries that help in data classification. Different classes can be
assigned to data points on either side of the hyperplane. The hyperplane’s dimension is also determined by
the number of features. If there are only two input characteristics, the hyperplane is simply a line. The
hyperplane becomes a two-dimensional plane when the number of input features reaches three. When the
number of features exceeds three, it becomes impossible to imagine.

Hyper-planes are decision-making boundaries that help in data classification. Different classes can be
assigned to data points on either side of the hyperplane. The hyperplane’s dimension is also determined by
the number of features. If there are only two input characteristics, the hyperplane is simply a line. The
hyperplane becomes a two-dimensional plane when the number of input features reaches three. When the
number of features exceeds three, it becomes impossible to imagine.

Support vectors are data points that are closer to the hyperplane and have an influence on the hyperplane’s
position and orientation. We increase the classifier’s margin by using these support vectors. The
hyperplane’s position will be altered if the support vectors are deleted. These are the points that will assist
us in constructing our SVM.

SVM Implementation in Python


We will use a support vector machine in Predicting if the cancer diagnosis is benign or malignant based on
several observations/features.

Program:

# Support Vector Machine


# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the datasets

datasets =
pd.read_csv('C:\\Users\\Reddy\\Downloads\\Social_Network_Ads.csv')
X = datasets.iloc[:, [2,3]].values
Y = datasets.iloc[:, 4].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split X_Train,
X_Test, Y_Train, Y_Test train_test_split(X, Y, test_size = 0.25,
random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler sc_X =
StandardScaler() X_Train sc_X.fit_transform(X_Train)
X_Test = sc_X.transform(X_Test)
# Fitting the classifier into the Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_Train, Y_Train)

# Predicting the test set results


Y_Pred = classifier.predict(X_Test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix cm =
confusion_matrix(Y_Test, Y_Pred)

# Visualising the Training set results


from matplotlib.colors import ListedColormap
X_Set, Y_Set = X_Train, Y_Train
X1, X2 = np.meshgrid(np.arange(start = X_Set[:, 0].min() - 1, stop
= X_Set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_Set[:, 1].min() - 1, stop
= X_Set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),
X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red',
'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(Y_Set)):
plt.scatter(X_Set[Y_Set == j, 0], X_Set[Y_Set == j, 1],
c = ListedColormap(('red', 'green'))(i), label =
j)
plt.title('Support Vector Machine (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_Set, Y_Set = X_Test, Y_Test
X1, X2 = np.meshgrid(np.arange(start = X_Set[:, 0].min() - 1, stop
= X_Set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_Set[:, 1].min() - 1, stop
= X_Set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),
X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red',
'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())

for i, j in enumerate(np.unique(Y_Set)):
plt.scatter(X_Set[Y_Set == j, 0], X_Set[Y_Set == j, 1],
c = ListedColormap(('red', 'green'))(i), label =
j)

plt.title('Support Vector Machine (Test set)')


plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

OUTPUT:
Experiment-15: Write a program to Implement Principle Component
Analysis
Principal Component Analysis (PCA): is an algebraic technique for converting a set of
observations of possibly correlated variables into the set of values of liner uncorrelated
variables.

All principal components are chosen to describe most of the available variance in the
variable, and all principal components are orthogonal to each other. In all the sets of the
principal component first principal component will always have the maximum variance.

Different Uses of Principal Component Analysis:

o PCA can be used for finding interrelations between various variables in thedata.

o PCA can be used for interpreting and visualizing the data sets.

o PCA can also be used for visualizing genetic distance and connection between
populations.

o PCA also makes analysis simple with the decrease in the number of variables.

Principal component analysations are usually executed on a square symmetric matrix, and this
can be a pure sum of squares and cross products matrix or correlation matrix or covariance
matrix. The correlation matrix is used if there is a major difference in the individual variance.

What are the Objectives of Principal Component Analysis?The

basic objectives of PCA are as follows:

o PCA is a nondependent method can be used for reducing attribute space from a larger
number of variables of the set to a smaller number of factors.
o It is a dimension reducing technique but with no assurance whether the
dimension would be interpretable.
o In PCA, the main job is selecting the subset of variables from a larger set,
depending on which original variables will have the highest correlation withthe principal
amount.

Principal Axis Method: Principal Component Analysis searches for the linear combination of
the variable for extracting maximum variance from the variables. Once the PCA is done with
the process, it will move forward to another linear combination which will explain the
maximum ratio of the remaining variance, which would lead to orthogonal factors of the
sets. This method is usedfor analysing total variance in the variables of the set.

Eigen Vector: It is a nonzero vector that remains parallel after multiplying the matrix. Suppose
'V' is an eigen vector of dimension R of matrix K with dimension R * R. If KV and V are
parallel. Then the user has to solve KV = PV where both V and P are unknown for solving
eigen vector and eigen value.

Eigen Value: It is also known as "characteristic roots" in PCA. This is used for measuring the
variance in all the variables of the set, which is reported for by that factor. The proportion of
eigen value is the ratio of descriptive importance of the factors concerning the variables. If the
factor is low, then it subsidises less to the description of variables.

Now, we will Discuss Principal Component Analysis with Python.

Following are the Steps for Using PCA


with Python:

o Here, we will use wine.csv Dataset.

PROGRAM:

from sklearn.datasets import load_iris


from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.metrics import f1_score
from sklearn.svm import SVC
import matplotlib.pyplot as plt

# Load iris dataset


irisdata = load_iris()
X, y = irisdata['data'], irisdata['target']
plt.figure(figsize=(8,6))
plt.scatter(X[:,0], X[:,1], c=y)
plt.xlabel(irisdata["feature_names"][0])
plt.ylabel(irisdata["feature_names"][1])
plt.title("Two features from the iris dataset")
plt.show()

# Show the principal components


pca = PCA().fit(X)
print("Principal components:")
print(pca.components_)

# Show any three features


fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(projection='3d')
ax.scatter(X[:,1], X[:,2], X[:,3], c=y)
ax.set_xlabel(irisdata["feature_names"][1])
ax.set_ylabel(irisdata ["feature_names"][2])
ax.set_zlabel(irisdata ["feature_names"][3])
ax.set_title("Three particular features of the iris dataset")
plt.show()

# Remove PC1
Xmean = X - X.mean(axis=0)
value = Xmean @ pca.components_[0]
pc1 = value.reshape(-1,1) @ pca.components_[0].reshape(1,-1)
Xremove = X - pc1
plt.figure(figsize=(8,6))
plt.scatter(Xremove[:,0], Xremove[:,1], c=y)
plt.xlabel(irisdata["feature_names"][0])
plt.ylabel(irisdata["feature_names"][1])
plt.title("Two features from the iris dataset after removing PC1")

plt.show()

# Remove PC2
Xmean = X - X.mean(axis=0)
value = Xmean @ pca.components_[1]
pc2 = value.reshape(-1,1) @ pca.components_[1].reshape(1,-1)
Xremove = Xremove - pc2
plt.figure(figsize=(8,6))
plt.scatter(Xremove[:,0], Xremove[:,1], c=y)
plt.xlabel(irisdata["feature_names"][0])
plt.ylabel(irisdata["feature_names"][1])
plt.title("Two features from the iris dataset after removing PC1
and PC2")
plt.show()

# Remove PC3
Xmean = X - X.mean(axis=0)
value = Xmean @ pca.components_[2]
pc3 = value.reshape(-1,1) @ pca.components_[2].reshape(1,-1)
Xremove = Xremove - pc3
plt.figure(figsize=(8,6))
plt.scatter(Xremove[:,0], Xremove[:,1], c=y)
plt.xlabel(irisdata["feature_names"][0])
plt.ylabel(irisdata["feature_names"][1])
plt.title("Two features from the iris dataset after removing PC1
to PC3")
plt.show()

# Print the explained variance ratio


print("Explainedd variance ratios:")
print(pca.explained_variance_ratio_)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33)

# Run classifer on all features


clf = SVC(kernel="linear", gamma='auto').fit(X_train, y_train)
print("Using all features, accuracy: ", clf.score(X_test, y_test))
print("Using all features, F1: ", f1_score(y_test,
clf.predict(X_test), average="macro"))

# Run classifier on PC1


mean = X_train.mean(axis=0)
X_train2 = X_train - mean
X_train2 = (X_train2 @ pca.components_[0]).reshape(-1,1)
clf = SVC(kernel="linear", gamma='auto').fit(X_train2, y_train)
X_test2 = X_test - mean
X_test2 = (X_test2 @ pca.components_[0]).reshape(-1,1)
print("Using PC1, accuracy: ", clf.score(X_test2, y_test))
print("Using PC1, F1: ", f1_score(y_test, clf.predict(X_test2),
average="macro"))

OUTPUT:
Principal components:
[[ 0.36138659 -0.08452251 0.85667061 0.3582892 ]
[ 0.65658877 0.73016143 -0.17337266 -0.07548102]
[-0.58202985 0.59791083 0.07623608 0.54583143]
[-0.31548719 0.3197231 0.47983899 -0.75365743]]
Explainedd variance ratios:
[0.92461872 0.05306648 0.01710261 0.00521218]
Using all features, accuracy: 0.96
Using all features, F1: 0.9595238095238096
Using PC1, accuracy: 0.94
Using PC1, F1: 0.9398762157382846

You might also like