R20 Iii-Ii ML Lab Manual
R20 Iii-Ii ML Lab Manual
specific hypothesis based on a given set of training data samples. Read the training data
from a .CSV file..
FIND-S Algorithm
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance xFor
each attribute constraint ai in h
If the constraint ai is satisfied by x
Then do nothing
Else replace ai in h by the next more general constraint that is satisfied by x
3. Output hypothesis h
Training Examples:
Program:
import csv
a = []
Output:
Experiment:2. For a given set of training data examples stored in a .CSV file,
implement and demonstrate the Candidate-Elimination algorithm to output a
description of the set of all hypotheses consistent with the training examples.
• If d is a negative example
• Remove from S any hypothesis inconsistent with d
• For each hypothesis g in G that is not consistent with d
• Remove g from G
• Add to G all minimal specializations h of g such that
• h is consistent with d, and some member of S is more specific than h
Remove from G any hypothesis that is less general than another hypothesis in CANDIDATE-
ELIMINTION algorithm using version spaces
Training Examples:
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
import numpy as np
import pandas as pd
data = pd.DataFrame(data=pd.read_csv('enjoysport.csv'))
concepts = np.array(data.iloc[:,0:-1])
print(concepts)
target = np.array(data.iloc[:,-1])
print(target)
Final Specific_h:
['sunny' 'warm' '?' 'strong' '?' '?']
Final General_h:
[['sunny', '?', '?', '?', '?', '?'],
['?', 'warm', '?', '?', '?', '?']]
Experiment:3.Write a program to demonstrate the working of the decision tree
based ID3 algorithm. Use an appropriate data set for building the decision tree and
apply this knowledge to classify a new sample.
ID3 Algorithm
Examples are the training examples. Target_attribute is the attribute whose value is
to be predicted by the tree. Attributes is a list of other attributes that may be tested
by the learned decision tree. Returns a decision tree that correctly classifies the given
Examples.
Otherwise Begin
A ← the attribute from Attributes that best* classifies Examples
The decision attribute for Root ← A
For each possible value, vi, of A,
Add a new tree branch below Root, corresponding to the test A = vi
Let Examples vi, be the subset of Examples that have value vi for A
If Examples vi , is empty
Then below this new branch add a leaf node with label = most
common value of Target_attribute in Examples
Else below this new branch add the subtree
ID3(Examples vi, Targe_tattribute, Attributes –
{A}))
End
Return Root
INFORMATION GAIN:
Training Dataset:
Test Dataset:
counts=[0,0]
for i in range(2):
counts[i]=sum([1 for x in S if
attr[i]==x])/(len(S)*1.0)
sums=0
for cnt in counts:
sums+=-
1*cnt*math.log(cnt,2) return
sums
total_size=len(data)
entropies=[0]*len(attr
) ratio=[0]*len(attr)
attr,dic=subtables(data,split,delete=True)
for x in range(len(attr)):
child=build_tree(dic[attr[x]],fea)
node.children.append((attr[x],child)
)
return node
def print_tree(node,level): if node.answer!="":
print("
"*level,node.answ
er) return
print("
"*level,node.attribu
te) for value,n in
node.children:
print("
"*(level+1),val
ue) print_tree(n,level+2)
def classify(node,x_test,features): if node.answer!="":
print(node.answer)
return
pos=features.index(node.attribute)
for value, n in node.children:
if x_test[pos]==value:
classify(n,x_test,features)
Program:
# importing the dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_csv('Salary_Data.csv')
dataset.head()
# data preprocessing
X = dataset.iloc[:, :-1].values #independent variable
array
y = dataset.iloc[:,1].values #dependent variable vector
# splitting the dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =
train_test_split(X,y,test_size=1/3,random_state=0)
# fitting the regression model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train) #actually produces the
linear eqn for the data
# predicting the test set results
y_pred = regressor.predict(X_test)
y_pred
y_test
# visualizing the results
#plot for the TRAIN
plt.scatter(X_train, y_train, color='red')
# plotting the observation line
plt.plot(X_train, regressor.predict(X_train), color='blue')
# plotting the regression line
plt.title("Salary vs Experience (Training set)") # stating
the title of the graph
plt.xlabel("Years of experience") # adding the name of x-
axis
plt.ylabel("Salaries") # adding the name of y-axis
plt.show() # specifies end of graph
Output:
b) Logistic regression:
It is a Machine Learning classification algorithm that is used to predict the probability of a
categorical dependent variable. In logistic regression, the dependent variable is a binary variable that
contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic
regression model predicts P(Y=1) as a function of X.
Logistic Regression – The Python Way
To do this, we shall first explore our dataset using Exploratory Data Analysis (EDA) and then
implement logistic regression and finally interpret the odds:
Program:
c) BINARY CLASSIFIER:
In machine learning, binary classification is a supervised learning algorithm that categorizes new
observations into one of two classes.
The following are a few binary classification applications, where the 0 and 1 columns are two
possible classes for each observation:
Example
In a medical diagnosis, a binary classifier for a specific disease could take a patient's symptoms as
input features and predict whether the patient is healthy or has the disease. The possible outcomes of
the diagnosis are positive and negative.
Evaluation of binary classifiers
If the model successfully predicts the patients as positive, this case is called True Positive (TP). If
the model successfully predicts patients as negative, this is called True Negative (TN). The binary
classifier may misdiagnose some patients as well. If a diseased patient is classified as healthy by a
negative test result, this error is called False Negative (FN). Similarly, If a healthy patient is
classified as diseased by a positive test result, this error is called False Positive(FP).
We can evaluate a binary classifier based on the following parameters:
True Positive (TP): The patient is diseased and the model predicts "diseased"
False Positive (FP): The patient is healthy but the model predicts "diseased"
True Negative (TN): The patient is healthy and the model predicts "healthy"
False Negative (FN): The patient is diseased and the model predicts "healthy"
After obtaining these values, we can compute the accuracy score of the binary classifier as follows:
The following is a confusion matrix, which represents the above parameters:
In machine learning, many methods utilize binary classification. The most common are:
1. Support Vector Machines
2. Naive Bayes
3. Nearest Neighbor
4. Decision Trees
5. Logistic Regression
6. Neural Networks
The following Python example will demonstrate using binary classification in a logistic regression
problem.
A Python example for binary classification
For our data, we will use the breast cancer dataset from scikit-learn. This dataset contains tumor
observations and corresponding labels for whether the tumor was malignant or benign.
First, we'll import a few libraries and then load the data. When loading the data, we'll specify
as_frame=True so we can work with pandas objects (see our pandas tutorial for an introduction).
Program:
dataset = load_breast_cancer(as_frame=True)
dataset['data'].head()
dataset['target'].head()
dataset['target'].value_counts()
X = dataset['data']
y = dataset['target']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y ,
test_size=0.25, random_state=0)
from sklearn.preprocessing import StandardScaler
ss_train = StandardScaler()
X_train = ss_train.fit_transform(X_train)
models = {}
ss_test = StandardScaler()
X_test = ss_test.fit_transform(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, predictions)
TN, FP, FN, TP = confusion_matrix(y_test,
predictions).ravel()
# Logistic Regression
from sklearn.linear_model import LogisticRegression
models['Logistic Regression'] = LogisticRegression()
# Decision Trees
from sklearn.tree import DecisionTreeClassifier
models['Decision Trees'] = DecisionTreeClassifier()
# Random Forest
from sklearn.ensemble import RandomForestClassifier
models['Random Forest'] = RandomForestClassifier()
# Naive Bayes
from sklearn.naive_bayes import GaussianNB
models['Naive Bayes'] = GaussianNB()
# K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
models['K-Nearest Neighbor'] = KNeighborsClassifier()
from sklearn.metrics import accuracy_score, precision_score,
recall_score
# Make predictions
predictions = models[key].predict(X_test)
# Calculate metrics
accuracy[key] = accuracy_score(predictions, y_test)
precision[key] = precision_score(predictions, y_test)
recall[key] = recall_score(predictions, y_test)
import pandas as pd
df_model = pd.DataFrame(index=models.keys(),
columns=['Accuracy', 'Precision', 'Recall'])
df_model['Accuracy'] = accuracy.values()
df_model['Precision'] = precision.values()
df_model['Recall'] = recall.values()
df_model
ax = df_model.plot.barh()
ax.legend(
ncol=len(models.keys()),
bbox_to_anchor=(0, 1),
loc='lower left',
prop={'size': 14}
)
plt.tight_layout()
Output:
Experiment-5: Develop a program for Bias, Variance, Remove duplicates , Cross
Validation
a)Bias-Variance Trade-off
The bias and the variance of a model’s performance are connected.
Ideally, we would prefer a model with low bias and low variance, although in practice, this is
very challenging. In fact, this could be described as the goal of applied machine learning for a
given predictive modeling problem,
Reducing the bias can easily be achieved by increasing the variance. Conversely, reducing the
variance can easily be achieved by increasing the bias.
This relationship is generally referred to as the bias-variance trade-off. It is a conceptual
framework for thinking about how to choose models and model configuration.
We can choose a model based on its bias or variance. Simple models, such as linear regression
and logistic regression, generally have a high bias and a low variance. Complex models, such as
random forest, generally have a low bias but a high variance.
We may also choose model configurations based on their effect on the bias and variance of the
model. The k hyperparameter in k-nearest neighbors controls the bias-variance trade-off. Small
values, such as k=1, result in a low bias and a high variance, whereas large k values, such as
k=21, result in a high bias and a low variance.
High bias is not always bad, nor is high variance, but they can lead to poor results.
We often must test a suite of different models and model configurations in order to discover what
works best for a given dataset. A model with a large bias may be too rigid and underfit the
problem. Conversely, a large variance may overfit the problem.
We may decide to increase the bias or the variance as long as it decreases the overall estimate of
model error.
Calculate the Bias and Variance
In a real-life situation in which f is unobserved, it is generally not possible to explicitly compute
the test MSE, bias, or variance for a statistical learning method. Nevertheless, one should always
keep the bias-variance trade-off in mind.
Even though the bias-variance trade-off is a conceptual tool, we can estimate it in some cases.
The mlxtend library by Sebastian Raschka provides the bias_variance_decomp() function that
can estimate the bias and variance for a model over multiple bootstrap samples.
First, you must install the mlxtend library; for example:
Running the example reports the estimated error as well as the estimated bias and variance for the
model error.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure,
or differences in numerical precision. Consider running the example a few times and compare the
average outcome.
In this case, we can see that the model has a high bias and a low variance. This is to be expected
given that we are using a linear regression model. We can also see that the sum of the estimated
mean and variance equals the estimated error of the model, e.g. 20.726 + 1.761 = 22.487.
we will learn different methods of removing these duplicates from a list in Python.
Basic Approach
In the first method, we will discuss the basic approach of removing duplicates fromthe
list using Python.
In Python, there are numerous methods for removing duplicates from a list. To remove
duplicates from a given list, you can make use of these methods and get your work done.
Let's take a look at them:
To remove duplicates from a list in Python, iterate through the elements of the list and
store the first occurrence of an element in a temporary list while ignoring any other
occurrences of that element.
str(sam_list)) # remove
result = []
for i in sam_list:
if i not in result:
result.append(i)
Output:
The list is: [11, 13, 15, 16, 13, 15, 16, 11]
Syntax of df.drop_duplicates()
Parameters:
subset: Subset takes a column or list of column label. It’s default value is
none. After passing columns, it will consider them only for duplicates.
keep: keep is to control how to consider duplicate value. It has only three
distinct value and default is ‘first’.
If ‘first‘, it considers first value as unique and rest of the same
values as duplicate.
If ‘last‘, it considers last value as unique and rest of the same values
as duplicate.
If False, it consider all of the same values as duplicates
inplace: Boolean values, removes rows with duplicates if True.
In the following example, rows having the same First Name are removed and anew
data frame is returned.
Program:
data = pd.read_csv("C:\\Users\\sys-
08\\Desktop\\Employee.csv")
# displaying data
Output:
As shown in the image, the rows with the same names were removed from a data frame.
c)Cross Validation
Cross-validation is a technique for validating the model efficiency by training it on the subset of
input data and testing on previously unseen subset of the input data. We can also say that it is a
technique to check how a statistical model generalizes to an independent dataset.
In machine learning, there is always the need to test the stability of the model. It means based
only on the training dataset; we can't fit our model on the training dataset. For this purpose, we
reserve a particular sample of the dataset, which was not part of the training dataset. After that,
we test our model on that sample before deployment, and this complete process comes under
cross-validation. This is something different from the general train-test split.
2. K-Fold cross-validation
K-Fold Cross-Validation
In this technique of K-Fold cross-validation, the whole dataset is partitioned into K parts of equal
size. Each partition is called a “Fold“.So as we have K parts we call it K-Folds. One Fold is used
as a validation set and the remaining K-1 folds are used as the training set.
The technique is repeated K times until each fold is used as a validation set and the remaining
folds as the training set.
The final accuracy of the model is computed by taking the mean accuracy of the k-models
validation data.
Pros:
1. The whole dataset is used as both a training set and validation set:
Cons:
1. Not to be used for imbalanced datasets: As discussed in the case of HoldOut cross-
validation, in the case of K-Fold validation too it may happen that all samples of training set will
have no sample form class “1” and only of class “0”.And the validation set will have a sample of
class “1”.
2. Not suitable for Time Series data: For Time Series data the order of the samples matter. But
in K-Fold Cross-Validation, samples are selected in random order.
Program:
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score,KFold
from sklearn.linear_model import LogisticRegression
iris=load_iris()
X=iris.data Y=iris.target
logreg=LogisticRegression()
kf=KFold(n_splits=5) score=cross_val_score(logreg,X,Y,cv=kf)
print("Cross Validation Scores are {}".format(score))
print("Average Cross Validation score
:{}".format(score.mean()))
Output:
Experiment-6: Write a program to implement Categorical Encoding, One-hot
Encoding
Most of the existing machine learning algorithms cannot be executed on categorical data. Instead,
the categorical data needs to first be converted to numerical data. One-hot encoding is one of the
techniques used to perform this conversion. This method is mostly used when deep learning
techniques are to be applied to sequential classification problems.
Most of the existing machine learning algorithms cannot be executed on categorical data. Instead,
the categorical data needs to first be converted to numerical data. One-hot encoding is one of the
techniques used to perform this conversion. This method is mostly used when deep learning
techniques are to be applied to sequential classification problems.
Have a look at the example below which manually converts the categorical list of colors to a
numerical list using one-hot encoding:
Program:
import numpy as np
list of colors
mapping = {}
for x in range(len(total_colors)):
mapping[total_colors[x]] = x
one_hot_encode = [] for c in
colors:
arr =
list(np.zeros(len(total_color
arr[mapping[c]] = 1
one_hot_encode.append(arr)
print(one_hot_encode)
OUTPUT:
1.56s
[[1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [1, 0, 0, 0, 0], [0, 0, 1, 0, 0]]
One-hot encoding using scikit-learn
Take a look at the example below. It uses the scikit-learn library to perform one- hot
encoding:
Program:
integer_encoded =
label_encoder.fit_transform(colors)
print(integer_encoded) integer_encoded =
integer_encoded.reshape(len(integer_encoded),
1)
print(onehot_encoded)
OUTPUT:1.53s
[2 1 3 2 0]
[[0. 0. 1. 0.]
[0. 1. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]]
Experiment-7: Build an Artificial Neural Network by implementing the Back
propagation algorithm and test the same using appropriate data sets.
Program:
import numpy as np
X = np.array(([2, 9], [1, 5], [3, 6]), dtype=float)
y = np.array(([92], [86], [89]), dtype=float)
X = X/np.amax(X,axis=0) # maximum of X array longitudinally y
= y/100 #Sigmoid
Functiondef
sigmoid (x):
return (1/(1 + np.exp(- x))) #Derivative of Sigmoid
Functiondef derivatives_sigmoid(x):
return x * (1 - x)
#Variable initialization
epoch=7000 #Setting training iterations
lr=0.1 #Setting learning rate inputlayer_neurons = 2
#number of features in data set hiddenlayer_neurons = 3
#number of hidden layers neurons output_neurons = 1
#number of neurons at
output layer
#weight and bias initialization
wh=np.random.uniform(size=(inputlayer_neurons,hiddenlayer_neur
on s)) bh=np.random.uniform(size=(1,hiddenlayer_neurons))
wout=np.random.uniform(size=(hiddenlayer_neurons,output_neuron
s))
bout=np.random.uniform(size=(1,output_neurons)) # draws a
random range of numbers
uniformly of dim x*y#Forward Propagation
for i in range(epoch): hinp1=np.dot(X,wh
) hinp=hinp1 + bh hlayer_act = sigmoid(hinp)
outinp1=np.dot(hlayer_act, wout)outinp= outinp1+ bout
output = sigmoid(outinp)
#Backpropaga tion
EO = y-output
outgrad = derivatives_sigmoid(output) d_output = EO* outgrad
EH = d_output.dot(wout.T)
hiddengrad = derivatives_sigmoid(hlayer_act) #how much hidden
layer wts contributed to error
d_hiddenlayer = EH * hiddengrad wout +=
hlayer_act.T.dot(d_output)
*lr
# dotproduct of nextlayererror and currentlayerop bout +=
np.sum(d_output, axis=0,keepdims=True) *lrwh +=
X.T.dot(d_hiddenlayer) *lr #bh += np.sum(d_hiddenlayer,
axis=0,keepdims=True) *lrprint("Input: \n"
+ str(X))
print("Actual Output: \n" + str(y)) print("Predicted Output:
\n" ,output)
Output:
Input:
[[ 0.66666667 1. ]
[ 0.33333333 0.55555556]
[ 1. 0.66666667]]
Distance Metrics
K-Nearest-Neighbour Algorithm:
1. Load the data
2. Initialize the value of k
3. For getting the predicted class, iterate from 1 to total
number of training datapoints
1. Calculate the distance between test data and
each row of training data. Here we will use
Euclidean distance as our distance metric
since it’s the most popular method.The
other metrics that can be used are
Chebyshev, cosine, etc.
2. Sort the calculated distances in ascending order
based on distancevalues
3. Get top k rows from the sorted array
4. Get the most frequent class of these rows i.e
Get the labels of theselected K entries
5. Return the predicted class
If regression, return the mean of the K labels
If classification, return the mode of the K labels
Confusion matrix:
Note,
• Class 1 : Positive
• Class 2 : Negative
• Positive (P) : Observation is positive (for example: is an apple).
• Negative (N) : Observation is not positive (for example: is not an
apple).
• True Positive (TP) : Observation is positive, and is predicted to be
positive.
• False Negative (FN) : Observation is
positive, but is predicted negative. (Also
known as a"Type II error.")
• True Negative (TN) : Observation is negative, and is predicted to
be negative.
• False Positive (FP) : Observation is
negative, but is predicted positive. (Also
known as a"Type I error.")
Example :
Accuracy:
Overall, how often is the classifier correct?(TP+TN)/total = (100+50)/165 =
0.91
Misclassification Rate: Overall, how often is it wrong?(FP+FN)/total =
(10+5)/165 = 0.09 equivalent to 1 minus Accuracyalso known as "Error Rate“
True Positive Rate: When it's actually yes, how often does it predict yes?
TP/actual yes = 100/105 = 0.95 also known as "Sensitivity" or "Recall"
False Positive Rate: When it's actually no, how often does it predict yes?
FP/actual no = 10/60 = 0.17
True Negative Rate: When it's actually no, how often does it predict no?
TN/actual no = 50/60 = 0.83 equivalent to 1 minus False Positive Ratealso
known as "Specificity“
Precision: When it predicts yes, how often is it correct?
TP/predicted yes = 100/110 = 0.91
Prevalence: How often does the yes condition actually occur in our
sample?actual yes/total = 105/165 = 0.64
Program:
dataset=pd.read_csv("iris.csv")
X_train,X_test,y_train,y_test=train_test_split(X,y,random_stat
e=0,test_size=0.2 5)
classifier=KNeighborsClassifier(n_neighbors=8,p=3,metric='eucl
idean'
classifier.fit(X_train,y_train) #predict the test resuts
y_pred=classifier.predict(X_test)
Output :
Confusion matrix is as follows[[13 0 0]
[ 0 15 1]
[ 0 0 9]]
Accuracy Metrics
precision recall f1-score support
6. Prediction = x0*β
Program:
from numpy import *import operator
from os import listdirimport matplotlib import
matplotlib.pyplot as pltimport pandas as pd import
numpy.linalg
from scipy.stats.stats import pearsonr
def
kernel(point,xmat, k): m,n = shape(xmat) weights
=mat(eye((m)))for j in range(m):
diff = point - X[j] weights[j,j] = exp(diff*diff.T/(-
2.0*k**2))return weights
def
localWeight(point,xmat,y mat,k):wei = kernel(point,xmat,k)
W = (X.T*(wei*X)).I*(X.T*(wei*y mat.T))return W
def
localWeightRegression(xmat ,ymat,k):m,n = shape(xmat) ypred =
zeros(m)for i in range(m): ypred[i] =
xmat[i]*localWeight(xmat[i],xmat,ymat,k )return ypred
fig = plt.figure()
ax = fig.add_subplot(1,1,1) ax.scatter(bill,tip,
color='green')
ax.plot(xsort[:,1],ypred[SortIndex], color = 'red',
linewidth=5)plt.xlabel('Total bill') plt.ylabel('Tip')
plt.show();
Output:
Dataset
Add Tips.csv (256 rows)
Experiment:10. Assuming a set of documents that need to be classified, use the naïve
Bayesian Classifier model to perform this task. Built-in Java classes/API can be used to
write the program. Calculate theaccuracy, precision, and recall for your data set.
LEARN_NAIVE_BAYES_TEXT (Examples, V)
Examples is a set of text documents along with their target values. V is the set of all possible target
values. This function learns the probability terms P(wk |vj,), describing the probability that a
randomly drawn word from a document in class vj will be the English word wk. It also learns the
class prior probabilities P(vj).
1. collect all words, punctuation, and other tokens that occur in Examples
Vocabulary ← c the set of all distinct words and other tokens occurring in any text
document from Examples
CLASSIFY_NAIVE_BAYES_TEXT (Doc)
Return the estimated target value for the document Doc. ai denotes the word found in the ith
position within Doc.
positions ← all word positions in Doc that contain tokens found in Vocabulary
Return VNB, where
Data set:
import pandas as pd
msg=pd.read_csv('naivetext.csv',names=['message','label'])
msg['labelnum']=msg.label.map({'pos':1,'neg':0})
X=msg.message
y=msg.labelnum
print(X)
print(y)
df=pd.DataFrame(xtrain_dtm.toarray(),columns=count_vect.get_fe
ature_names())
Output:
Confusion Matrix
True positives: data points labelled as positive that are actually positive
False positives: data points labelled as positive that are actually negative
True negatives: data points labelled as negative that are actually negative
False negatives: data points labelled as negative that are actually positive
Example:
Unique word
< I, loved, the, movie, hated, a, great, good, poor, acting>
Doc I loved the movie hated a great good poor acting Class
1 1 1 1 1 +
2 1 1 1 1 -
3 2 1 1 1 +
4 1 1 -
5 1 1 1 1 +
Doc I loved the movie hated a great good poor acting Class
1 1 1 1 1 +
3 2 1 1 1 +
5 1 1 1 1 +
3
𝑃(+) = = 0.6
5
1+1 1+1
𝑃(𝐼 |+) = = 0.0833 𝑃(𝑎 |+) = = 0.0833
14 + 10 14 + 10
1+1 2+1
𝑃(𝑙𝑜𝑣𝑒𝑑 |+) = = 0.0833 𝑃(𝑔𝑟𝑒𝑎𝑡 |+) = = 0.125
14 + 10 14 + 10
1+1 2+1
𝑃(𝑡ℎ𝑒 |+) = = 0.0833 𝑃(𝑔𝑜𝑜𝑑 |+) = = 0.125
14 + 10 14 + 10
4+1 0+1
𝑃(𝑚𝑜𝑣𝑖𝑒 |+) = = 0.2083 𝑃(𝑝𝑜𝑜𝑟 |+) = = 0.0416
14 + 10 14 + 10
0+1 1+1
𝑃(ℎ𝑎𝑡𝑒𝑑 |+) = = 0.0416 𝑃(𝑎𝑐𝑡𝑖𝑛𝑔 |+) = = 0.0833
14 + 10 14 + 10
Doc I loved the movie hated a great good poor acting Class
2 1 1 1 1 -
4 1 1 -
2
𝑃(−) = = 0.4
5
1+1 0+1
𝑃(𝐼 |−) = = 0.125 𝑃(𝑎 |−) = = 0.0625
6 + 10 6 + 10
0+1 0+1
𝑃(𝑙𝑜𝑣𝑒𝑑 |−) = = 0.0625 𝑃(𝑔𝑟𝑒𝑎𝑡 |−) = = 0.0625
6 + 10 6 + 10
1+1 0+1
𝑃(𝑡ℎ𝑒 |−) = = 0.125 𝑃(𝑔𝑜𝑜𝑑 |−) = = 0.0625
6 + 10 6 + 10
1+1 1+1
𝑃(𝑚𝑜𝑣𝑖𝑒|−) = = 0.125 𝑃(𝑝𝑜𝑜𝑟|−) = = 0.125
6 + 10 6 + 10
1+1 1+1
𝑃(ℎ𝑎𝑡𝑒𝑑 |−) = = 0.125 𝑃(𝑎𝑐𝑡𝑖𝑛𝑔|−) = = 0.125
6 + 10 6 + 10
Let’s classify the new document
The EM algorithm is considered a latent variable model to find the local maximum likelihood
parameters of a statistical model, proposed by Arthur Dempster, Nan Laird, and Donald Rubin in
1977. The EM (Expectation-Maximization) algorithm is one of the most commonly used terms in
machine learning to obtain maximum likelihood estimates of variables that are sometimes
observable and sometimes not. However, it is also applicable to unobserved data or sometimes
called latent. It has various real-world applications in statistics, including obtaining the mode of the
posterior marginal distribution of parameters in machine learning and data mining applications.
Key Points:
It is known as the latent variable model to determine MLE and MAP parameters for latent variables.
It is used to predict values of parameters in instances where data is missing or unobservable for
learning, and this is done until convergence of the values occurs.
EM Algorithm
The EM algorithm is the combination of various unsupervised ML algorithms, such as the k-means
clustering algorithm. Being an iterative approach, it consists of two modes. In the first mode, we
estimate the missing or latent variables. Hence it is referred to as the Expectation/estimation step (E-
step). Further, the other mode is used to optimize the parameters of the models so that it can explain
the data more clearly. The second mode is known as the maximization-step or M-step.
Expectation step (E - step): It involves the estimation (guess) of all missing values in the
dataset so that after completing this step, there should not be any missing value.
Maximization step (M - step): This step involves the use of estimated data in the E-step and
updating the parameters.
Repeat E-step and M-step until the convergence of the values occurs.
The primary goal of the EM algorithm is to use the available observed data of the dataset to
estimate the missing data of the latent variables and then use that data to update the values of
the parameters in the M-step.
What is Convergence in the EM algorithm?
Convergence is defined as the specific situation in probability based on intuition, e.g., if there are
two random variables that have very less difference in their probability, then they are known as
converged. In other words, whenever the values of given variables are matched with each other, it is
called convergence.
Steps in EM Algorithm
The EM algorithm is completed mainly in 4 steps, which include Initialization Step, Expectation
Step, Maximization Step, and convergence Step. These steps are explained as follows:
1st Step: The very first step is to initialize the parameter values. Further, the system is
provided with incomplete observed data with the assumption that data is obtained from a
specific model.
2nd Step: This step is known as Expectation or E-Step, which is used to estimate or guess
the values of the missing or incomplete data using the observed data. Further, E-step
primarily updates the variables.
3rd Step: This step is known as Maximization or M-step, where we use complete data
obtained from the 2nd step to update the parameter values. Further, M-step primarily
updates the hypothesis.
4th step: The last step is to check if the values of latent variables are converging or not. If it
gets "yes", then stop the process; else, repeat the process from step 2 until the convergence
occurs.
PROGRAM:
iris = datasets.load_iris()
X = pd.DataFrame(iris.data)
X.columns
=['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width']
y = pd.DataFrame(iris.target)
y.columns = ['Targets']
model = KMeans(n_clusters=3)
model.fit(X)
plt.figure(figsize=(14,7))
colormap = np.array(['red', 'lime', 'black'])
EDA is applied to investigate the data and summarize the key insights.
It will give you the basic understanding of your data, it’s distribution, null values and much more.
You can either explore data using graphs or through some python functions.
There will be two type of analysis. Univariate and Bivariate. In the univariate, you will be analyzing a single attribute.
But in the bivariate, you will be analyzing an attribute with the target attribute.
In the non-graphical approach, you will be using functions such as shape, summary, describe, isnull, info, datatypes
and more.
In the graphical approach, you will be using plots such as scatter, box, bar, density and correlation plots.
Program:
df[['alcohol']].boxplot()
df.corr()
#Correlation plot
sns.heatmap(df.corr())
df.rename(columns={"od280/od315_of_diluted_wines":
"protein_concentration"}, inplace=True)
df.target.value_counts()
df.target.value_counts(normalize=True)
df.target.value_counts().plot(kind="bar")
plt.title("Value counts of the target variable")
plt.xlabel("Wine type")
plt.xticks(rotation=0)
plt.ylabel("Count")
plt.show()
df.magnesium.describe()
print(f"Skewness: {df['magnesium'].skew()}")
print(f"Kurtosis: {df['magnesium'].kurt()}")
sns.pairplot(df)
Output:
A Bayesian network is a directed acyclic graph in which each edge corresponds to a conditional
dependency, and each node corresponds to a unique random variable.
Bayesian network consists of two major parts: a directed acyclic graph and a set of conditional
probability distributions
The directed acyclic graph is a set of random variables represented by nodes.
The conditional probability distribution of a node (random variable) is defined for every
possible outcome of the preceding causal node(s).
For illustration, consider the following example. Suppose we attempt to turn on our computer, but
the computer does not start (observation/evidence). We would like to know which of the possible
causes of computer failure is more likely. In this simplified illustration, we assume only two
possible causes of this misfortune: electricity failure and computer malfunction.
The corresponding directed acyclic graph is depicted in below figure.
Fig: Directed acyclic graph representing two independent possible causes of a computer failure.
The goal is to calculate the posterior conditional probability distribution of each of the possible
unobserved causes given the observed evidence, i.e. P [Cause | Evidence].
Data Set:
Title: Heart Disease Databases
The Cleveland database contains 76 attributes, but all published experiments refer to using a subset
of 14 of them. In particular, the Cleveland database is the only one that has been used by ML
researchers to this date. The "Heartdisease" field refers to the presence of heart disease in the
patient. It is integer valued from 0 (no presence) to 4.
Database: 0 1 2 3 4 Total
Cleveland: 164 55 36 35 13 303
Attribute Information:
1. age: age in years
2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type
Value 1: typical angina
Value 2: atypical angina
Value 3: non-anginal pain
Value 4: asymptomatic
4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
5. chol: serum cholestoral in mg/dl
6. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
7. restecg: resting electrocardiographic results
Value 0: normal
Value 1: having ST-T wave abnormality (T wave inversions and/or ST
elevationor depression of > 0.05 mV)
Value 2: showing probable or definite left ventricular hypertrophy
by Estes'criteria
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak = ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
Value 1: upsloping
Value 2: flat
Value 3: downsloping
12. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
13. Heartdisease: It is integer valued from 0 (no presence) to 4.
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal Heartdisease
63 1 1 145 233 1 2 150 0 2.3 3 0 6 0
67 1 4 160 286 0 2 108 1 1.5 2 3 3 2
67 1 4 120 229 0 2 129 1 2.6 2 2 7 1
41 0 2 130 204 0 2 172 0 1.4 1 0 3 0
62 0 4 140 268 0 2 160 0 3.6 3 2 3 3
60 1 4 130 206 0 2 132 1 2.4 2 2 7 4
Program:
import numpy as
np import
pandas as pd
import csv
from pgmpy.estimators import
MaximumLikelihoodEstimatorfrom pgmpy.models import
BayesianModel
from pgmpy.inference import VariableElimination
Output:
Experiment-14: Write a program to Implement Support Vector Machines
Another simple approach that any machine learning expert should know about is the support vector
machine. Many people prefer the support vector machine because it produces great accuracy while using
less computing power. SVM (Support Vector Machine) can be used for both regression and classification.
However, it is widely applied in classifications objectives.
There are numerous hyper-planes from which to choose to split the two kinds of data points. Our goal is to
discover a plane with the greatest margin, or the greatest distance between data points from both classes.
Maximizing the margin distance adds some reinforcement, making it easier to classify future data points.
Hyper-planes are decision-making boundaries that help in data classification. Different classes can be
assigned to data points on either side of the hyperplane. The hyperplane’s dimension is also determined by
the number of features. If there are only two input characteristics, the hyperplane is simply a line. The
hyperplane becomes a two-dimensional plane when the number of input features reaches three. When the
number of features exceeds three, it becomes impossible to imagine.
Support vectors are data points that are closer to the hyperplane and have an influence on the hyperplane’s
position and orientation. We increase the classifier’s margin by using these support vectors. The
hyperplane’s position will be altered if the support vectors are deleted. These are the points that will assist
us in constructing our SVM.
Program:
datasets =
pd.read_csv('C:\\Users\\Reddy\\Downloads\\Social_Network_Ads.csv')
X = datasets.iloc[:, [2,3]].values
Y = datasets.iloc[:, 4].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split X_Train,
X_Test, Y_Train, Y_Test train_test_split(X, Y, test_size = 0.25,
random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler sc_X =
StandardScaler() X_Train sc_X.fit_transform(X_Train)
X_Test = sc_X.transform(X_Test)
# Fitting the classifier into the Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_Train, Y_Train)
for i, j in enumerate(np.unique(Y_Set)):
plt.scatter(X_Set[Y_Set == j, 0], X_Set[Y_Set == j, 1],
c = ListedColormap(('red', 'green'))(i), label =
j)
OUTPUT:
Experiment-15: Write a program to Implement Principle Component
Analysis
Principal Component Analysis (PCA): is an algebraic technique for converting a set of
observations of possibly correlated variables into the set of values of liner uncorrelated
variables.
All principal components are chosen to describe most of the available variance in the
variable, and all principal components are orthogonal to each other. In all the sets of the
principal component first principal component will always have the maximum variance.
o PCA can be used for finding interrelations between various variables in thedata.
o PCA can be used for interpreting and visualizing the data sets.
o PCA can also be used for visualizing genetic distance and connection between
populations.
o PCA also makes analysis simple with the decrease in the number of variables.
Principal component analysations are usually executed on a square symmetric matrix, and this
can be a pure sum of squares and cross products matrix or correlation matrix or covariance
matrix. The correlation matrix is used if there is a major difference in the individual variance.
o PCA is a nondependent method can be used for reducing attribute space from a larger
number of variables of the set to a smaller number of factors.
o It is a dimension reducing technique but with no assurance whether the
dimension would be interpretable.
o In PCA, the main job is selecting the subset of variables from a larger set,
depending on which original variables will have the highest correlation withthe principal
amount.
Principal Axis Method: Principal Component Analysis searches for the linear combination of
the variable for extracting maximum variance from the variables. Once the PCA is done with
the process, it will move forward to another linear combination which will explain the
maximum ratio of the remaining variance, which would lead to orthogonal factors of the
sets. This method is usedfor analysing total variance in the variables of the set.
Eigen Vector: It is a nonzero vector that remains parallel after multiplying the matrix. Suppose
'V' is an eigen vector of dimension R of matrix K with dimension R * R. If KV and V are
parallel. Then the user has to solve KV = PV where both V and P are unknown for solving
eigen vector and eigen value.
Eigen Value: It is also known as "characteristic roots" in PCA. This is used for measuring the
variance in all the variables of the set, which is reported for by that factor. The proportion of
eigen value is the ratio of descriptive importance of the factors concerning the variables. If the
factor is low, then it subsidises less to the description of variables.
PROGRAM:
# Remove PC1
Xmean = X - X.mean(axis=0)
value = Xmean @ pca.components_[0]
pc1 = value.reshape(-1,1) @ pca.components_[0].reshape(1,-1)
Xremove = X - pc1
plt.figure(figsize=(8,6))
plt.scatter(Xremove[:,0], Xremove[:,1], c=y)
plt.xlabel(irisdata["feature_names"][0])
plt.ylabel(irisdata["feature_names"][1])
plt.title("Two features from the iris dataset after removing PC1")
plt.show()
# Remove PC2
Xmean = X - X.mean(axis=0)
value = Xmean @ pca.components_[1]
pc2 = value.reshape(-1,1) @ pca.components_[1].reshape(1,-1)
Xremove = Xremove - pc2
plt.figure(figsize=(8,6))
plt.scatter(Xremove[:,0], Xremove[:,1], c=y)
plt.xlabel(irisdata["feature_names"][0])
plt.ylabel(irisdata["feature_names"][1])
plt.title("Two features from the iris dataset after removing PC1
and PC2")
plt.show()
# Remove PC3
Xmean = X - X.mean(axis=0)
value = Xmean @ pca.components_[2]
pc3 = value.reshape(-1,1) @ pca.components_[2].reshape(1,-1)
Xremove = Xremove - pc3
plt.figure(figsize=(8,6))
plt.scatter(Xremove[:,0], Xremove[:,1], c=y)
plt.xlabel(irisdata["feature_names"][0])
plt.ylabel(irisdata["feature_names"][1])
plt.title("Two features from the iris dataset after removing PC1
to PC3")
plt.show()
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33)
OUTPUT:
Principal components:
[[ 0.36138659 -0.08452251 0.85667061 0.3582892 ]
[ 0.65658877 0.73016143 -0.17337266 -0.07548102]
[-0.58202985 0.59791083 0.07623608 0.54583143]
[-0.31548719 0.3197231 0.47983899 -0.75365743]]
Explainedd variance ratios:
[0.92461872 0.05306648 0.01710261 0.00521218]
Using all features, accuracy: 0.96
Using all features, F1: 0.9595238095238096
Using PC1, accuracy: 0.94
Using PC1, F1: 0.9398762157382846