Jntuk R20 ML MANUAL
Jntuk R20 ML MANUAL
LABORATORY MANUAL
FOR
Institute Vision:
To emerge as an acclaimed center of learning that provides value-based technical education for theholistic
development of students
Institute Mission:
• Undertake activities that provide value-based knowledge in Science, Engineering, and technology
Department Vision:
To evolve into a Centre of learning that imparts quality education in Computer Science and
Engineering to produce highly competent professionals.
Department Mission:
Impart computing and technical skills with an emphasis on professional competency and human
values.
Enrich the learning aptitude to face the dynamic environment of the Computer Industry.
Enhance the analytical and problem-solving capability through contests and technical
seminars.
GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN
(Approved by AICTE New Delhi & Affiliated to JNTUK, Kakinada)
LAB RUBRICS
Internals Category Points
Attended and Attended and Attended but Not attended
Attendance completed on partially completed in but completed
(1) the same day completed on the extra lab in the extra lab
the same day
Complete Partial Most of the Complete
derstandingof understanding experiment misunderstan
Understanding
the of the misunderstood ding ng of the
ofthe Experiment
experiment experiment experiment
(2)
with learning with learning
objectives objectives
Day to Day
Complete Complete Complete Complete
Performance
implementation implementation implementation implementation
Implementation
with result with result with result with result
with result analysis
analysis and analysis only analysis and nalysis only in
(5) interpretation
interpretation extra lab
in extra lab
Submission of Submission of Submission of Submission of
Observation the observation the observation the observation the observation
submission on time on time almost on time immediately after the extra
(2) after the extra lab
lab
Write all the Write all the Some Some elements
Comprehensiveness elements of the elements of the elementsare are missing and
& Legible experiments experiments missing but poor
(3) which can be with poor presented handwriting
easily readable handwriting clearly
Record
Submission of Submission of Submission of Submission of
the record on the record the record the record after
Timely Submission
time almost on time immediately the extra lab
(2)
after the extra
lab
GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN
(Approved by AICTE New Delhi & Affiliated to JNTUK, Kakinada)
INDEX
SNO NAME OF THE EXPERIMENT PAGE NO.
MACHINE LEARNING LAB
1. Implement and demonstrate the FIND-S algorithm for finding the most specific 2
hypothesis based on a given set of training data samples. Read the training data from a
.CSV file.
2. For a given set of training data examples stored in a .CSV file, implement and 5
demonstrate the Candidate-Elimination algorithm to output a description of the
set of all hypotheses consistent with the training examples.
3. Write a program to demonstrate the working of the decision tree based ID3 algorithm. 9
Use an appropriate data set for building the decision tree and apply this knowledge to
classify a new sample.
Exercises to solve the real-world problems using the following machine learning
13
4. methods: a) Linear Regression b) Logistic Regression c) Binary Classifier
5. Develop a program for Bias, Variance, Remove duplicates, Cross Validation . 17
6. Write a program to implement Categorical Encoding, One-hot Encoding. 20
Build an Artificial Neural Network by implementing the Back propagation algorithm and 23
7. test the same using appropriate data sets.
Write a program to implement k-Nearest Neighbor algorithm to classify the iris data set. 25
8. Print both correct and wrong predictions.
9. Implement the non-parametric Locally Weighted Regression algorithm in order to fit da 27
points. Select appropriate data set for your experiment and draw graphs.
Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier 29
10. model to perform this task. Built-in Java classes/API can be used to write the program.
Calculate the accuracy, precision, and recall for your data set.
11. Apply EM algorithm to cluster a Heart Disease Data Set. Use the same data set for 32
clustering using k-Means algorithm. Compare the results of these two algorithms and
comment on the quality of clustering. You can add Java/Python ML library classes/API
in the program.
12 Exploratory Data Analysis for Classification using Pandas or Matplotlib. 36
13 Write a Python program to construct a Bayesian network considering medical data. Use
this model to demonstrate the diagnosis of heart patients using standard Heart Disease
Data Set
14 Write a program to Implement Support Vector Machines. 39
Experiment-1:
AIM: Implement and demonstrate the FIND-S algorithm for finding the most specific hypothesis
based on a given set of training data samples. Read the training data from a .CSV file.
Theory:
Find-S Algorithm:
Load Data set
Initialize h to the most specific hypothesis in H
For each positive training instance x
For each attribute constraint ai in h
If the constraint ai in h is satisfied by x then do nothing
Else replace ai in h by the next more general constraint that is satisfied by x
Output hypothesis h
Write a Python program to take input from a CSV file (e.g., EnjoySport.csv), prints all the instances,
then only the positive instances that influence Find-S, and the specific hypothesis after considering
each positive instance, and the final hypotheses.
Example:
Input:
'Sunny', 'Warm', 'Normal', 'Strong', 'Warm', 'Same',True
'Sunny', 'Warm', 'High', 'Strong', 'Warm', 'Same',True
'Rainy', 'Cold', 'High', 'Strong', 'Warm', 'Change',False
'Sunny', 'Warm', 'High', 'Strong', 'Cool', 'Change',True
Sample Code:
import csv
hypo = ['%','%','%','%','%','%'];
data = []
print("\nThe given training examples are:")
for row in readcsv:
print(row)
if row[len(row)-1].upper() == "YES":
data.append(row)
j=0;
k=0;
print("The steps of the Find-s algorithm are :\n",hypo);
list = [];
p=0;
d=len(data[p])-1;
for j in range(d):
list.append(data[i][j]);
hypo=list;
i=1;
for i in range(TotalExamples):
for k in range(d):
if hypo[k]!=data[i][k]:
hypo[k]='?';
k=k+1;
else:
hypo[k];
print(hypo);
i=i+1;
print("\nThe maximally specific Find-s hypothesis for the given training examples
is :");
list=[];
for i in range(d):
list.append(hypo[i]);
print(list);
Output:
The maximally specific Find-s hypothesis for the given training examples is :
[‘Sunny’, ‘Warm’, ‘?’, ‘Strong’, ‘?’, ‘?’]
Aim: For a given set of training data examples stored in a .CSV file, implement and demonstrate the
Candidate-Elimination algorithm to output a description of the set of all hypotheses consistent with
the training examples.
Theory:
Candidate-Elimination Algorithm:
Initialize G to the set of maximally general hypotheses in H
Initialize S to the set of maximally specific hypotheses in H
For each training example d, do
If d is a positive example
Remove from G any hypothesis inconsistent with d
For each hypothesis s in S that is not consistent with d
Remove s from S
Add to S all minimal generalizations h of s such that
h is consistent with d, and some member of G is more general than h
Remove from S any hypothesis that is more general than another hypothesis in S
If d is a negative example
Remove from S any hypothesis inconsistent with d
For each hypothesis g in G that is not consistent with d
Remove g from G
Add to G all minimal specializations h of g such that
h is consistent with d, and some member of S is more specific than h
Remove from G any hypothesis that is less general than another hypothesis in G
Output the hypotheses S and G.
Write a Python program to take input from a CSV file (e.g., EnjoySport.csv) and apply the candidate-
elimination algorithm to find S and G.
Example:
Input:
'Sunny', 'Warm', 'Normal', 'Strong', 'Warm', 'Same',True
'Sunny', 'Warm', 'High', 'Strong', 'Warm', 'Same',True
'Rainy', 'Cold', 'High', 'Strong', 'Warm', 'Change',False
'Sunny', 'Warm', 'High', 'Strong', 'Cool', 'Change',True
Output:
Final Specific h: [['Sunny', 'Warm', '?', 'Strong', '?', '?']]
Final General h: [['Sunny', '?', '?', '?', '?', '?'], ['?', 'Warm', '?', '?', '?', '?']]
Sample Code:
import numpy as np
import pandas as pd
'''
learn() function implements the learning method of the Candidate elimination
algorithm.
Arguments:
concepts - a data frame with all the features
target - a data frame with corresponding output values
'''
specific_h[x] = '?'
general_h[x][x] = '?'
# find indices where we have empty rows, meaning those that are unchanged
indices = [i for i, val in enumerate(general_h) if val == ['?', '?', '?', '?',
'?', '?']]
for i in indices:
# remove those rows from general_h
general_h.remove(['?', '?', '?', '?', '?', '?'])
# Return final values
return specific_h, general_h
Output:
Final Specific_h:
[‘Sunny’ ‘Warm’ ‘?’ ‘Strong’ ‘?’ ‘?’]
Final General_h:
[[‘Sunny’, ‘?’, ‘?’, ‘?’, ‘?’, ‘?’], [‘?’, ‘Warm’, ‘?’, ‘?’, ‘?’, ‘?’]]
AIM: Write a program to demonstrate the working of the decision tree based ID3 algorithm. Use an
appropriate data set for building the decision tree and apply this knowledge to classify a new sample.
Theory:
ID3 Algorithm:
Create a Root node for the tree
If all Examples are positive, Return the single-node tree Root, with label = +
If all Examples are negative, Return the single-node tree Root, with label = -
If Attributes is empty, Return the single-node tree Root, with label = most common value of Target_attribute in
Examples
Otherwise Begin
A ← the attribute from Attributes that best* classifies Examples
The decision attribute for Root ← A
For each possible value, vi, of A,
Add a new tree branch below Root, corresponding to the test A = vi
Let Examples vi, be the subset of Examples that have value vi for A
If Examples vi , is empty
Then below this new branch add a leaf node with label = most common value of
Target_attribute in Examples
Else below this new branch add the subtree ID3(Examples vi, Targe_tattribute,
Attributes – {A}))
End
Return Root
Write a Python program to import an appropriate dataset, split it into training and test sets, apply ID3
algorithm to build a decision tree fitting the training set. Find out its accuracy on the training set as
well as the test set.
import numpy as np
import math
import csv
def read_data(filename):
with open(filename, 'r') as csvfile:
datareader = csv.reader(csvfile, delimiter=',')
headers = next(datareader)
metadata = []
traindata = []
for name in headers:
metadata.append(name)
for row in datareader:
traindata.append(row)
class Node:
def init (self, attribute):
self.attribute = attribute
self.children = []
self.answer = ""
for x in range(items.shape[0]):
for y in range(data.shape[0]):
if data[y, col] == items[x]:
count[x] += 1
for x in range(items.shape[0]):
dict[items[x]] = np.empty((int(count[x]), data.shape[1]), dtype="|S32")
pos = 0
for y in range(data.shape[0]):
if data[y, col] == items[x]:
dict[items[x]][pos] = data[y]
pos += 1
def entropy(S):
items = np.unique(S)
if items.size == 1:
return 0
for x in range(items.shape[0]):
counts[x] = sum(S == items[x]) / (S.size * 1.0)
total_size = data.shape[0]
entropies = np.zeros((items.shape[0], 1))
intrinsic = np.zeros((items.shape[0], 1))
for x in range(items.shape[0]):
ratio = dict[items[x]].shape[0]/(total_size * 1.0)
entropies[x] = ratio * entropy(dict[items[x]][:, -1])
intrinsic[x] = ratio * math.log(ratio, 2)
for x in range(entropies.shape[0]):
total_entropy -= entropies[x]
return total_entropy / iv
split = np.argmax(gains)
node = Node(metadata[split])
metadata = np.delete(metadata, split, 0)
for x in range(items.shape[0]):
child = create_node(dict[items[x]], metadata)
node.children.append((items[x], child))
return node
def empty(size):
s = ""
for x in range(size):
s += " "
return s
Output:
for input [1, 0, 0, 0], we obtain Yes
Theory:
While there are many types of regression analysis, at their core they all examine the influence of one
or more independent variables on a dependent variable. Linear regression is usually among the first
few topics which people pick while learning predictive modelling. In this technique, the dependent
variable is continuous, independent variable(s) can be continuous or discrete, and nature of the
regression line is linear. Linear Regression establishes a relationship between dependent variable (Y)
and one or more independent variables (X) using a best fit straight line (also known as regression
line). It is represented by an equation Y = β0 + β1x1 + ……+ βrxr. In other words, linear regression
predicts the target variable as a linear (or weighted) combination of input variables. This equation can
be used to predict the value of the dependent variable based on given the independent variable(s).
Logistic regression, despite its name, is a classification algorithm rather than regression algorithm.
Based on a given set of independent variables, it is used to estimate discrete value (0 or 1, yes/no,
true/false). Basically, it measures the relationship between the categorical dependent variable and one
or more independent variables by estimating the probability of occurrence of an event using its
logistics function, i.e., Y = 1/(1+e β0 + β1x1 + ……+ βrxr).
Since logistic regression can be seen as a binary classifier, it enough for students to implement linear
and logistic regression with appropriate data sets for this experiment.
Sample Code:
# Importing libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings( "ignore" )
self.Y = Y
# calculate gradients
tmp = ( A - self.Y.T )
tmp = np.reshape( tmp, self.m )
dW = np.dot( self.X.T, tmp ) / self.m
db = np.sum( tmp ) / self.m
# update weights
self.W = self.W - self.learning_rate * dW
self.b = self.b - self.learning_rate * db
return self
# Hypothetical function h( x )
# Driver code
def main() :
# Importing dataset
df = pd.read_csv( "diabetes.csv" )
# Model training
model = LogitRegression( learning_rate = 0.01, iterations = 1000 )
# measure performance
correctly_classified = 0
correctly_classified1 = 0
# counter
count = 0
for count in range( np.size( Y_pred ) ) :
if Y_test[count] == Y_pred[count] :
correctly_classified = correctly_classified + 1
if Y_test[count] == Y_pred1[count] :
correctly_classified1 = correctly_classified1 + 1
count = count + 1
Output:
Accuracy on test set by our model : 58.333333333333336
Accuracy on test set by sklearn model : 61.111111111111114
AIM: Develop a program for Bias, Variance, Remove duplicates, Cross Validation
Theory:
The bias error is an error from erroneous assumptions in the learning algorithm. High bias can cause
an algorithm to miss the relevant relations between features and target outputs (underfitting). The
variance is an error from sensitivity to small fluctuations in the training set. High variance may result
from an algorithm modeling the random noise in the training data (overfitting). The bias–variance
tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model that both
accurately captures the regularities in its training data, but also generalizes well to unseen data.
Unfortunately, it is typically impossible to do both simultaneously. High-variance learning methods
may be able to represent their training set well but are at risk of overfitting to noisy or
unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models
that may fail to capture important regularities (i.e. underfit) in the data. The bias–variance
decomposition is a way of analyzing a learning algorithm's expected generalization error with respect
to a particular model. The following diagram illustrates the bias–variance tradeoff.
Preparing a dataset before designing a machine learning model is an important task for the data
scientist. When you gather a dataset for modelling a machine learning model, you may find some
instances repeated several times. It is very important for you to remove duplicates from the dataset to
maintain accuracy and to avoid misleading statistics.
Cross-validation is a technique for evaluating a machine learning model and testing its performance.
CV is commonly used in applied ML tasks. It can be used to estimate the test error associated with a
given statistical learning method in order to evaluate its performance, or to select the appropriate level
of flexibility.
In this experiment, students need to take a learning model and an appropriate data set, remove
duplicates in the data set, fit a model, measure bias and variance components of the error rate, and
fine-tune the parameters using cross validation. They may use built-in APIs if needed.
X, y = iris_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=123,
shuffle=True,
stratify=y)
tree = DecisionTreeClassifier(random_state=123)
tree = DecisionTreeClassifier(random_state=123)
bag = BaggingClassifier(base_estimator=tree,
n_estimators=100,
random_state=123)
X, y = boston_housing_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=123,
shuffle=True)
tree = DecisionTreeRegressor(random_state=123)
tree = DecisionTreeRegressor(random_state=123)
bag = BaggingRegressor(base_estimator=tree,
n_estimators=100,
random_state=123)
Output:
Average expected loss: 18.620
Average bias: 15.461
Average variance: 3.159
Theory:
In the field of data science, before going for the modelling, data preparation is a mandatory task.
There are various tasks we require to perform in the data preparation. Encoding categorical data is one
of such tasks which is considered crucial. As we know, most of the data in real life come with
categorical string values and most of the machine learning models work with numerical values only.
All models basically perform mathematical operations which can be performed using different tools
and techniques. But the harsh truth is that mathematics is totally dependent on numbers. So in short
we can say most of the models require numbers as the data, not strings or not anything else and these
numbers can be float or integer. Encoding categorical data is a process of converting categorical data
into integer format so that the data with converted categorical values can be provided to the models to
give and improve the predictions.
In this experiment, students need to implement popular categorical encoding techniques like:
One-Hot Encoding
Label Encoding (or Ordinal Encoding)
Binary Encoding
Base-N Encoding
Hash Encoding
Target Encoding
Sample Code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
# creating instance of one-hot-encoder
enc = OneHotEncoder(handle_unknown='ignore')
# passing bridge-types-cat column (label encoded values of
bridge_types)
enc_df =
pd.DataFrame(enc.fit_transform(bridge_df[['Bridge_Types_Cat']]).toar
ray())
# merge with main df bridge_df on key values
bridge_df = bridge_df.join(enc_df)
bridge_df
import numpy as np
### Categorical data to be converted to numeric data
colors = ["red", "green", "yellow", "red", "blue"]
one_hot_encode = []
for c in colors:
arr = list(np.zeros(len(total_colors), dtype = int))
arr[mapping[c]] = 1
one_hot_encode.append(arr)
print(one_hot_encode)
Theory:
The Back propagation algorithm is a supervised learning method for multilayer feedforward networks
from the field of Artificial Neural Networks.
Feed-forward neural networks are inspired by the information processing of one or more neural cells,
called a neuron. A neuron accepts input signals via its dendrites, which pass the electrical signal down
to the cell body. The axon carries the signal out to synapses, which are the connections of a cell’s
axon to other cell’s dendrites.
The principle of the back propagation approach is to model a given function by modifying internal
weightings of input signals to produce an expected output signal. The system is trained using a
supervised learning method, where the error between the system’s output and a known expected
output is presented to the system and used to modify its internal state.
Technically, the back propagation algorithm is a method for training the weights in a multilayer feed-
forward neural network. As such, it requires a network structure to be defined of one or more layers
where one layer is fully connected to the next layer. Back propagation can be used for both
classification and regression problems.
Students need to build an neural network by implementing the back propagation algorithm and test
the same using an appropriate data set.
Sample Code:
import numpy as np
# scale units
X = X/np.amax(X, axis=0) # maximum of X array
y = y/100 # max test score is 100
class Neural_Network(object):
def init (self):
# Parameters
self.inputSize = 2
self.outputSize = 1
self.hiddenSize = 3
# Weights
self.W1 = np.random.randn(self.inputSize, self.hiddenSize) # (3x2)
weight matrix from input to hidden layer
self.W2 = np.random.randn(self.hiddenSize, self.outputSize) # (3x1)
weight matrix from hidden to output layer
NN = Neural_Network()
Output:
Input:
[[0.66666667 1. ]
[0.33333333 0.55555556]
[1. 0.66666667]]
Actual Output:
[[0.92]
[0.86]
[0.89]]
Predicted Output:
[[0.90907296]
[0.85841616]
[0.90140598]]
Loss:
8.400178305772788e-05
AIM: Write a program to implement k-Nearest Neighbour algorithm to classify the iris data set. Print
both correct and wrong predictions.
Theory:
The k-nearest neighbours algorithm, also known as k-NN, is a non-parametric, supervised learning
classifier, which uses proximity to make classifications or predictions about an individual data point.
While it can be used for either regression or classification problems, it is typically used as a
classification algorithm, working off the assumption that similar points can be found near one another.
For classification problems, a class label is assigned on the basis of a majority vote—i.e. the label that
is most frequently represented around a given data point is used. Regression problems use a similar
concept as classification problem, but in this case, the average of the k nearest neighbors is taken to
make a prediction.
It's also worth noting that the k-NN algorithm is also part of a family of “lazy learning” models,
meaning that it only stores a training dataset versus undergoing a training stage. This also means that
all the computation occurs when a classification or prediction is being made. Since it heavily relies on
memory to store all its training data, it is also referred to as an instance-based or memory-based
learning method.
K-NN algorithm
Let m be the number of training data samples. Let p be an unknown point.
1. Store the training samples in an array of data points arr[].
2. for i=0 to m: Calculate Euclidean distance d(arr[i], p).
3. Sort the training data points in ascending order based on distance values.
4. Get top k rows from the sorted array.
5. Get the most frequent class of these rows.
6. Return the predicted class.
Students need to implement the k-NN algorithm and test the same to classify the iris data set. Print
both correct and wrong predictions.
Sample Code:
for i in range(len(X_test)):
x=X_test[i]
x_new=np.array([x])
prediction=kn.predict(x_new)
print("TARGET=",y_test[i],dataset["target_names"][y_test[i]],"PREDICTED=",predicti
on,dataset["target_names"][prediction])
print(kn.score(X_test,y_test))
Output:
TARGET= 2 virginica PREDICTED= [2] ['virginica']
TARGET= 1 versicolor PREDICTED= [1] ['versicolor']
TARGET= 0 setosa PREDICTED= [0] ['setosa']
TARGET= 1 versicolor PREDICTED= [2] ['virginica']
0.9736842105263158
Theory:
Model-based methods, such as neural networks and the mixture of Gaussians, use the data to build a
parameterized model. After training, the model is used for predictions and the data are generally
discarded. In contrast, "memory-based" methods are non-parametric approaches that explicitly retain
the training data, and use it each time a prediction needs to be made.
The disadvantage of global methods (like regular regression) is that sometimes no parameter values
can provide a sufficiently good approximation. An alternative to global function approximation is
locally weighted regression (LWR). The basic idea behind LWR is that instead of building a global
model for the whole function space, for each point of interest a local model is created based on
neighbouring data of the query point. For this purpose, each data point becomes a weighting factor
which expresses the influence of the data point for the prediction. In general, data points which are in
the close neighbourhood to the current query point are receiving a higher weight than data points
which are far away. LWR is also called lazy learning because the processing of the training data is
shifted until a query point needs to be answered. This approach makes LWR a very accurate function
approximation method where it is easy to add new training points.
Students need to implement the non-parametric LWR algorithm on an appropriate data set and draw
graphs.
Sample Code:
residuals = y - yest
s = np.median(np.abs(residuals))
return yest
import math
n = 100
x = np.linspace(0, 2 * math.pi, n)
y = np.sin(x) + 0.3 * np.random.randn(n)
f =0.25
iterations=3
yest = lowess(x, y, f, iterations)
Output:
AIM: Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier
model to perform this task. Built-in Java classes/API can be used to write the program. Calculate the
accuracy, precision, and recall for your data set.
Theory:
Naive Bayes is a simple and powerful algorithm for predictive modeling. The model comprises two
types of probabilities that can be calculated directly from the training data: (i) the probability of each
class and (ii) the conditional probability for each class given each x value. Once calculated, the
probability model can be used to make predictions for new data using Bayes theorem. Naive Bayes is
called naive because it assumes that each input feature/variable is independent. This is a strong
assumption and unrealistic for real data; however, the technique has been proved very effective on a
large range of complex problems. The thought behind naive Bayes classification is to try to classify
the data by maximizing P(O | Ci)P(Ci) using Bayes theorem of posterior probability (where O is the
Object or tuple in a dataset and “i” is an index of the class).
LEARN_NAIVE_BAYES_TEXT (Examples, V)
Examples is a set of text documents along with their target values. V is the set of all possible target
values. This function learns the probability terms P(wk |vj), describing the probability that a randomly
drawn word from a document in class vj will be the English word wk. It also learns the class prior
probabilities P(vj).
1. collect all words, punctuation, and other tokens that occur in Examples
• Vocabulary ← c the set of all distinct words and other tokens occurring in any text document
from Examples
2. calculate the required P(vj) and P(wk|vj) probability terms
• For each target value vj in V do
• docsj ← the subset of documents from Examples for which the target value is vj
• P(vj) ← | docsj | / |Examples|
• Textj ← a single document created by concatenating all members of docsj
• n ← total number of distinct word positions in Textj
• for each word wk in Vocabulary
• nk ← number of times word wk occurs in Textj
• P(wk|vj) ← ( nk + 1) / (n + | Vocabulary| )
Students need to implement naïve Bayesian Classification algorithm and apply it to a dataset of text
documents to classify them into various categories. Also need to calculate the accuracy, precision, and
recall for your data set.
X = msg.message
y = msg.labelnum
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)
from sklearn.feature_extraction.text import CountVectorizer
count_v = CountVectorizer()
Xtrain_dm = count_v.fit_transform(Xtrain)
Xtest_dm = count_v.transform(Xtest)
df = pd.DataFrame(Xtrain_dm.toarray(),columns=count_v.get_feature_names())
print(df[0:5])
Output:
Accuracy Metrics:
Accuracy: 0.6
Recall: 0.5
Precision: 1.0
Confusion Matrix:
[[1 0]
[2 2]]
Theory:
Clustering is an unsupervised learning technique that separates data of similar nature. It aims to find a
structure (intrinsic grouping) in a collection of unlabelled data. A cluster is therefore a collection of
objects which are ‘similar’ between each other and are ‘dissimilar’ to the objects belonging to other
clusters. Two representatives of the clustering algorithms are the K-means algorithm and the
expectation maximization (EM) algorithm. The K-means algorithm uses Euclidean distance while EM
uses statistical methods.
K-means clustering:
Input: The number of k and a database containing n objects.
Output: A set of k-clusters that minimize the squared-error criterion.
1. arbitrarily choose k objects as the initial cluster centres;
2. repeat;
a. (re)assign each object to the cluster to which the object is the most similar based on the
mean value of the objects in the cluster;
b. update the cluster mean, i.e. calculate the mean value of the object for each cluster;
until no change.
EM clustering:
Input: Cluster number k, a database, stopping tolerance.
Output: A set of k-clusters with weight that maximize log-likelihood function.
1. Expectation step: For each database record x, compute the membership probability of x in
each cluster h = 1,…, k.
2. Maximization step: Update mixture model parameter (probability weight).
3. Stopping criteria: If stopping criteria are satisfied stop, else set j = j +1 and go to (1).
Students need to implement the k-Means clustering algorithm and the EM algorithm to cluster a Heart
Disease Data Set, compare the results of these two algorithms and comment on the quality of
clustering.
Sample Code:
iris = datasets.load_iris()
X = pd.DataFrame(iris.data)
X. columns = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width']
y = pd.DataFrame(iris.target)
y.columns = ['Targets']
model = KMeans(n_clusters=3)
model.fit(X)
plt.figure(figsize=(14,7))
y_gmm = gmm.predict(xs)
#y_cluster_gmm
plt.subplot(2, 2, 3)
plt.scatter(X.Petal_Length, X.Petal_Width, c=colormap[y_gmm], s=40)
plt.title('GMM Classification')
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
Output:
The accuracy score of K-Mean: 0.24
The Confusion matrixof K-Mean: [[ 0 50 0]
[48 0 2]
[14 0 36]]
The accuracy score of EM: 0.0
The Confusion matrix of EM: [[ 0 50 0]
[ 5 0 45]
[50 0 0]]
Theory:
Exploratory Data Analysis (EDA) is one of the first steps in any machine learning project. It is a
technique to analyze data using some visual Techniques. With this technique, we can get detailed
information about the statistical summary of the data. We will also be able to deal with the duplicates
values, outliers, and also see some trends or patterns present in the dataset. Some of the well-known
aspects of EDA are:
1. Getting a quick statistical summary of the dataset
2. Checking Missing Values
3. Checking Duplicates
4. Handling Correlations
5. Data Visualization
6. Handling Outliers
7. Handling Nans
8. Feature selection
In this experiment, students need to explore and demonstrate various built-in methods for EDA
available in Pandas and Matplotlib.
Sample Code:
import pandas as pd
# Reading the CSV file
df = pd.read_csv("Iris.csv")
df.describe()
sns.scatterplot(x='SepalLengthCm', y='SepalWidthCm',
hue='Species', data=df, )
plt.show()
Output:
Aim: Write a Python program to construct a Bayesian network considering medical data. Use this
model to demonstrate the diagnosis of heart patients using standard Heart Disease Data Set
Theory:
A Bayesian network is a directed acyclic graph in which each edge corresponds to a conditional
dependency, and each node corresponds to a unique random variable.
Bayesian network consists of two major parts: a directed acyclic graph and a set of conditional
probability distributions
The directed acyclic graph is a set of random variables represented by nodes.
The conditional probability distribution of a node (random variable) is defined for every
possible outcome of the preceding causal node(s).
For illustration, consider the following example. Suppose we attempt to turn on our computer, but the
computer does not start (observation/evidence). We would like to know which of the possible causes
of computer failure is more likely. In this simplified illustration, we assume only two possible causes
of this misfortune: electricity failure and computer malfunction. The corresponding directed acyclic
graph is depicted in below figure.
The goal is to calculate the posterior conditional probability distribution of each of the possible
unobserved causes given the observed evidence, i.e. P [Cause | Evidence].
Students need to write a Python program to construct a Bayesian network for Cleveland's Heart
Disease Data Set using pgmpy API and demonstrate the diagnosis of heart patients using the
constructed Bayesian network.
Sample Code:
import numpy as np
from urllib.request import urlopen
import urllib
import matplotlib.pyplot as plt # Visuals
import seaborn as sns
import sklearn as skl
import pandas as pd
Cleveland_data_URL = 'http://archive.ics.uci.edu/ml/machine-
learning-databases/heart-disease/processed.hungarian.data'
Output:
Theory:
Support vector machine is a simple algorithm that every machine learning expert should have in
his/her arsenal. Support vector machine is highly preferred by many as it produces significant
accuracy with less computation power. Support Vector Machine, abbreviated as SVM can be used for
both regression and classification tasks. But, it is widely used in classification objectives.
The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional
space (N = the number of features) that distinctly classifies the data points. To separate the two
classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to
find a plane that has the maximum margin, i.e the maximum distance between data points of both
classes. Maximizing the margin distance provides some reinforcement so that future data points can
be classified with more confidence.
Support vectors are data points that are closer to the hyperplane and influence the position and
orientation of the hyperplane. Using these support vectors, we maximize the margin of the classifier.
Deleting the support vectors will change the position of the hyperplane. These are the points that help
us build our SVM.
Students need to implement support vector machine in Python for breast cancer dataset, and
demonstrate the use of SVM in predicting if the cancer diagnosis is benign or malignant based on
several observations/features.
Sample Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')
X = df.drop('target', axis=1)
y = df.target
pipeline = Pipeline([
('min_max_scaler', MinMaxScaler()),
('std_scaler', StandardScaler())
])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
'X' shape: (569, 30)
'y' shape: (569,)
CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 0.983471 0.891697 0.919598 0.937584 0.926054
recall 0.798658 0.991968 0.919598 0.895313 0.919598
f1-score 0.881481 0.939163 0.919598 0.910322 0.917569
support 149.000000 249.000000 0.919598 398.000000 398.000000
Confusion Matrix:
[[119 30]
[ 2 247]]
Test Result:
================================================
Accuracy Score: 95.32%
CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 1.000000 0.931034 0.953216 0.965517 0.956443
recall 0.873016 1.000000 0.953216 0.936508 0.953216
f1-score 0.932203 0.964286 0.953216 0.948245 0.952466
support 63.000000 108.000000 0.953216 171.000000 171.000000
Confusion Matrix:
[[ 55 8]
[ 0 108]]
Theory:
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation
to convert a set of observations of possibly correlated variables into a set of values of linearly
uncorrelated variables called principal components. The number of principal components is less than
or equal to the smaller the number of original variables or the number of observations. This
transformation is defined in such a way that the first principal component has the largest possible
variance (that is, accounts for as much of the variability in the data as possible), and each succeeding
component in turn has the highest variance possible under the constraint that it is orthogonal to
preceding components.
PCA algorithm:
Step 1: Calculate Mean
Step 2: Calculation of the covariance matrix.
Step 3: Eigenvalues of the covariance matrix
Step 4: Computation of the eigenvectors
Step 5: Computation of first principal components
Step 6: Geometrical meaning of first principal components
Students need to implement PAC algorithm in Python for breast_cancer dataset and fit SVM on the
transformed dataset, and print confusion matrix, precision and recall metrics.
Sample Code:
scaler = StandardScaler()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
pca = PCA(n_components=3)
scaler = StandardScaler()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
svm_clf = SVC(**best_params)
svm_clf.fit(X_train, y_train)
Sample Output:
Fitting 5 folds for each of 126 candidates, totalling 630 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent
workers.
Best params: {'C': 0.5, 'gamma': 0.1, 'kernel': 'rbf'}
Train Result:
================================================
Accuracy Score: 95.98%
CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 0.992593 0.942966 0.959799 0.967779 0.961545
recall 0.899329 0.995984 0.959799 0.947656 0.959799
f1-score 0.943662 0.968750 0.959799 0.956206 0.959358
support 149.000000 249.000000 0.959799 398.000000 398.000000
Confusion Matrix:
[[134 15]
[ 1 248]]
Test Result:
================================================
Accuracy Score: 93.57%
Confusion Matrix:
[[ 55 8]
[ 3 105]]