0% found this document useful (0 votes)
157 views

Machine Learning Lab Manual

The document describes experiments related to machine learning algorithms: 1. The first experiment involves implementing the FIND-S algorithm to output the most specific hypothesis based on training data from a CSV file. Code is provided to load the data and implement the algorithm. 2. The second experiment involves implementing the Candidate-Elimination algorithm to output all hypotheses consistent with training examples. Pseudocode explains the algorithm's steps and code demonstrates loading data and applying the algorithm. 3. The third experiment involves demonstrating the ID3 decision tree algorithm. Key steps of the ID3 algorithm like calculating information gain to determine the root node are outlined.

Uploaded by

SAMINA ATTARI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
157 views

Machine Learning Lab Manual

The document describes experiments related to machine learning algorithms: 1. The first experiment involves implementing the FIND-S algorithm to output the most specific hypothesis based on training data from a CSV file. Code is provided to load the data and implement the algorithm. 2. The second experiment involves implementing the Candidate-Elimination algorithm to output all hypotheses consistent with training examples. Pseudocode explains the algorithm's steps and code demonstrates loading data and applying the algorithm. 3. The third experiment involves demonstrating the ID3 decision tree algorithm. Key steps of the ID3 algorithm like calculating information gain to determine the root node are outlined.

Uploaded by

SAMINA ATTARI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Maharishi Arvind Institute Of Engineering And Technology, Jaipur

Department of Computer Science & Engineering

Laboratory Manual

Lab Name: - 6CS4-22: Machine Learning Lab

For VI Semester
Experiment No.1
Implement and demonstrate the FIND-S algorithm for finding the
most specific hypothesis based on a given set of training data
samples. Read the training data from a .CSV file.

Ans: - Find-S algorithm


 Most specific hypothesis
 consider only positive examples
Terms Used
 Concept learning: Concept learning is basically learning task
of machine (Learn by Train data)
 General Hypothesis: Not Specifying features to learn
machine.
 G= {'?', '?','?','?'...} -Number of attributes
 Specific Hypothesis: Specifying features to learn machine
(Specific feature)
 S= {'pi','pi','pi'...} -Number of pi depends on number of
attributes
Algorithm Concept
Step1:Load Data set
Step2: Initializing most specific hypothesis.
Step2: For each positive example :
if attribute_value == hypothesis_value:
Do nothing
else:
replace attribute value with '?' (Basically generalizing it)
Step4: Output hypothesis h
Implementation of Find-S Algorithm
Example
#Importing Important libraries
import pandas as pd
import numpy as np
#Reading and Printing Data-set
data = pd.read_csv("find_s.csv")
print(data)
Output

#Defining Concepts (features)


concept=np.array(data)[:,:-1]
print(concept)
#defining target basically our positive and negative examples
target=np.array(data)[:,-1]
print(target)
Output

Note:- Defining Target and Concepts (Features) to train the model


#Defining Model (S-find algorithm concepts)
def func(concept,target):
for i , val in enumerate(target):
if val=="Yes":
specific_hypothesis=concept[i].copy()
print(specific_hypothesis)
break
count=0
for i, val in enumerate(concept):
count=count+1
print("concept=",concept[i])
print("count=",count)
if target[i]=="Yes":
print("target[i]=",target[i])
#print(len(specific_hypothesis))
for x in range (len(specific_hypothesis)):
print("x=",x)
if val[x]!= specific_hypothesis[x]:
#print("val[x]=",val[x])
specific_hypothesis[x]='?'
print("specific_hypothesis[x]=",specific_hypothesis)
else:
pass
print(specific_hypothesis)
return specific_hypothesis
Note:
1. enumerate(iterable, start=0), enumerate function return
values with index.
2. Count is used here just to check which count of for loop is
running , So that there are no Confusions.
3. The first for loop is telling us that if our example is positive i.e.
"Yes" then only the specific Hypothesis will work otherwise it
will not.
4. In the second loop, we are checking that if our concepts
values match the values of the previous hypothesis values, then
nothing will be done to them.
5. If they do not match, they will be taken in general, that is, we
will put the question mark on that perceptual concept value.
#Calling Function func function
print("Final Specific Hypothesis ", func(concept,target))
Output
['Morning' 'Sunny' 'Warm' 'Yes' 'Mild' Strong']
concept= ['Morning' 'Sunny' 'Warm' 'Yes' 'Mild' 'Strong']
count= 1
target[i]= Yes
x= 0
x= 1
x= 2
x= 3
x= 4
x= 5
concept= ['Evening' 'Rainy' 'cold' 'No' 'Mild' 'Normal']
count= 2
concept= ['Evening' 'Sunny' 'Mod' 'Yes' 'Normal' 'Normal']
count= 3
target[i]= Yes
x= 0
specific_hypothesis[x]= ['?' 'Sunny' 'Warm' 'Yes' 'Mild' 'Strong']
x= 1
x= 2
specific_hypothesis[x]= ['?' 'Sunny' '?' 'Yes' 'Mild' 'Strong']
x= 3
x= 4
specific_hypothesis[x]= ['?' 'Sunny' '?' 'Yes' '?' 'Strong']
x= 5
specific_hypothesis[x]= ['?' 'Sunny' '?' 'Yes' '?' '?']
concept= ['Evening' 'Sunny' 'cold' 'No' 'High' 'Strong']
count= 4
target[i]= Yes
x= 0
specific_hypothesis[x]= ['?' 'Sunny' '?' 'Yes' '?' '?']
x= 1
x= 2
specific_hypothesis[x]= ['?' 'Sunny' '?' 'Yes' '?' '?']
x= 3
specific_hypothesis[x]= ['?' 'Sunny' '?' '?' '?' '?']
x= 4
specific_hypothesis[x]= ['?' 'Sunny' '?' '?' '?' '?']
x= 5
specific_hypothesis[x]= ['?' 'Sunny' '?' '?' '?' '?']
['?' 'Sunny' '?' '?' '?' '?']
Final Specific Hypothesis ['?' 'Sunny' '?' '?' '?' '?']
Experiment No. 2

For a given set of training data examples stored in a .CSV file,


implement and demonstrate the Candidate-Elimination algorithm
to output a description of the set of all hypotheses consistent with
the training examples.

Ans:- Candidate Elimination


 You can consider this as an extended form of Fund-S
algorithm.
 Consider both positive and negative examples.
 Actually, positive examples are used here as Find-S algorithm
(Basically they are generalize from the specification).
 While negative example is specified from generalize form.
Terms Used
 Concept learning: Concept learning is basically learning task
of machine (Learn by Train data)
 General Hypothesis:Not Specifying features to learn machine.
 G= {'?', '?','?','?'...} -Number of attributes
 Specific Hypothesis: Specifying features to learn machine
(Specific feature)
 S= {'pi','pi','pi'...} -Number of pi depends on number of
attributes.
 Version Space: It is intermediate of general hypothesis and
Specific hypothesis. It not only just written one hypothesis but a
set of all possible hypothesis based on training data-set.
Algorithm Concept:
Step1:Load Data set
Step2: Initialize General Hypothesis and Specific Hypothesis.
Step3: For each training example
Step4: If example is positive example
if attribute_value == hypothesis_value:
Do nothing
else:
replace attribute value with '?' (Basically generalizing it)
Step5: If example is Negative example
Make generalize hypothesis more specific.
Implementation of Candidate-Elimination Algorithm
#Importing Important Libraries
import numpy as np
import pandas as pd
data = pd.DataFrame(data=pd.read_csv('CE.csv'))
print(data)
Output
concepts = np.array(data.iloc[:,0:-1])
target = np.array(data.iloc[:,-1])
print(target)
print(concepts)
Output

Note:- Defining Target and Concepts (Features) to train the model


#Defining Model (Candidate Elimination algorithm concepts)
def learn(concepts, target):
specific_h = concepts[0].copy()
print("Initialization of specific_h and general_h")
print("specific_h: ",specific_h)
general_h = [["?" for i in range(len(specific_h))] for i in
range(len(specific_h))]
print("general_h: ",general_h)
print("concepts: ",concepts)
for i, h in enumerate(concepts):
if target[i] == "yes":
for x in range(len(specific_h)):
#print("h[x]",h[x])
if h[x] != specific_h[x]:
specific_h[x] = '?'
general_h[x][x] = '?'
if target[i] == "no":
for x in range(len(specific_h)):
if h[x] != specific_h[x]:
general_h[x][x] = specific_h[x]
else:
general_h[x][x] = '?'
print("\nSteps of Candidate Elimination Algorithm: ",i+1)
print("Specific_h: ",i+1)
print(specific_h,"\n")
print("general_h :", i+1)
print(general_h)
indices = [i for i, val in enumerate(general_h) if val == ['?', '?', '?',
'?', '?', '?']]
print("\nIndices",indices)
for i in indices:
general_h.remove(['?', '?', '?', '?', '?', '?'])
return specific_h, general_h
s_final,g_final = learn(concepts, target)
print("\nFinal Specific_h:", s_final, sep="\n")
print("Final General_h:", g_final, sep="\n")
Output
Experiment No. 3
Write a program to demonstrate the working of the decision tree
based ID3 algorithm. Use an appropriate data set for building the
decision tree and apply this knowledge to classify a new sample

Ans:-
Decision Tree
 The Decision Tree Basically is an inverted tree, with each
node representing features and attributes.
 While the leaf node represents the output
 Except for the leaf node, the remaining nodes act as decision
making nodes.
Algorithms
 CART (Gini Index)
 ID3 (Entropy, Information Gain)
Note:- Here we will understand the ID3 algorithm
Algorithm Concepts
1. To understand this concept, we take an example, assuming we
have a data set (link is given here Click Here).
2. Based on this data, we have to find out if we can play
someday or not.
3. We have four attributes in the data-set. Now how do we
decide which attribute we should put on the root node?
4. For this, we will Calculate the information gain of all the
attributes (Features), which will have maximum information
will be our root node.
Step1 : Creating a root node
Entorpy(Entropy of whole data-set)
Entropy(S)=(p/p+n)*log2(p/p+n)-(n/n+p)*log2(n/n+p)
p- p stand for number of positive examples
n- n stand for number of negative examples.
Step2: For Every Attribute/Features
Average Information (AIG of a particular attribute)
I(Attribute)=Sum of {(pi+ni/p+n)*Entropy(Entropy of Attribute)}
pi- Here pi stand for number of positive examples in particular
attribute.
ni- Here ni stand for number of negative examples in particular
attribute.
Entropy (Attribute) - Entropy of Attribute calculated in same as we
calculated for System (Whole Data-Set)
Information Gain
Gain=Entropy(s)-I(Attribute)
1. If all examples are positive, Return the single-node tree ,with
label=+
2. If all examples are Negative, Return the single-node tree,with
label= -
3. If Attribute empty, Return the single-node tree
Step4: Pick The Highest Gain Attribute
1. The attribute that has the most information gain has to create
a group of all the its attributes and process them in same
as which we have done for the parent (Root) node.
2. Again, the feature which has maximum information gain will
become a node and this process will continue until we get the
leaf node.
Step5: Repeat Until we get final node (Leaf node )
Let's take a look how our tree will look like

Implementation of Decision-Tree (ID3) Algorithm


#Importing important libraries
import pandas as pd
from pandas import DataFrame
#Reading Dataset
df_tennis = pd.read_csv('DS.csv')
print( df_tennis)
Output

Calculating Entropy of Whole Data-set


#Function to calculate final Entropy
def entropy(probs):
import math
return sum( [-prob*math.log(prob, 2) for prob in probs] )
#Function to calculate Probabilities of positive and negative
examples
def entropy_of_list(a_list):
from collections import Counter
cnt = Counter(x for x in a_list) #Count the positive and negative
ex
num_instances = len(a_list)
#Calculate the probabilities that we required for our entropy
formula
probs = [x / num_instances for x in cnt.values()]
#Calling entropy function for final entropy
return entropy(probs)
total_entropy = entropy_of_list(df_tennis['PT'])
print("\n Total Entropy of PlayTennis Data Set:",total_entropy)
Output

Calculate Information Gain for each Attribute


#Defining Information Gain Function
def information_gain(df, split_attribute_name,
target_attribute_name, trace=0):
print("Information Gain Calculation of ",split_attribute_name)
print("target_attribute_name",target_attribute_name)
#Grouping features of Current Attribute
df_split = df.groupby(split_attribute_name)
for name,group in df_split:
print("Name: ",name)
print("Group: ",group)
nobs = len(df.index) * 1.0
print("NOBS",nobs)
#Calculating Entropy of the Attribute and probability part of
formula
df_agg_ent = df_split.agg({target_attribute_name :
[entropy_of_list, lambda x: len(x)/nobs] })
[target_attribute_name]
print("df_agg_ent",df_agg_ent)
# Calculate Information Gain
avg_info = sum( df_agg_ent['Entropy'] * df_agg_ent['Prob1'] )
old_entropy = entropy_of_list(df[target_attribute_name])
return old_entropy - avg_info
print('Info-gain for Outlook is :'+str(information_gain(df_tennis,
'Outlook', 'PT')),"\n")
Output
Note
In the same way, we will Calculate the information gain of the
remaining attributes and then the attribute who has the most
information will be named the best attribute
Defining ID3 Algorithm
#Defining ID3 Algorithm Function
def id3(df, target_attribute_name, attribute_names,
default_class=None):
#Counting Total number of yes and no classes (Positive and
negative Ex)
from collections import Counter
cnt = Counter(x for x in df[target_attribute_name])
if len(cnt) == 1:
return next(iter(cnt))
# Return None for Empty Data Set
elif df.empty or (not attribute_names):
return default_class
else:
default_class = max(cnt.keys())
print("attribute_names:",attribute_names)
gainz = [information_gain(df, attr, target_attribute_name)
for attr in attribute_names]
#Separating the maximum information gain attribute after
calculating the information gain
index_of_max = gainz.index(max(gainz)) #Index of Best
Attribute
best_attr = attribute_names[index_of_max] #choosing best
attribute
#The tree is initially an empty dictionary
tree = {best_attr:{}} # Initiate the tree with best attribute as a
node
remaining_attribute_names = [i for i in attribute_names if i !
= best_attr]
for attr_val, data_subset in df.groupby(best_attr):
subtree = id3(data_subset,
target_attribute_name,
remaining_attribute_names,
default_class)
tree[best_attr][attr_val] = subtree
return tree
Note
# Get Predictor Names (all but 'class')
attribute_names = list(df_tennis.columns)
print("List of Attributes:", attribute_names)
attribute_names.remove('PT')
#Remove the class attribute
print("Predicting Attributes:", attribute_names)
Output

# Run Algorithm (Calling ID3 function)


from pprint import pprint
tree = id3(df_tennis,'PT',attribute_names)
print("\n\nThe Resultant Decision Tree is :\n")
pprint(tree)
attribute = next(iter(tree))
print("Best Attribute :\n",attribute)
print("Tree Keys:\n",tree[attribute].keys())
Note:-The pprint module provides a capability to pretty-
print arbitrary Python data structures in a well-formatted and
more readable way.
Note:- After running the algorithm the output will be very large
because we have also called the information gain function in it,
which is required for ID3 Algorithm.
Note:- Here I am showing only the tree,to see the rest of the
process result, run the code and see.

ACCURACY
#Defining a function to calculate accuracy
def classify(instance, tree, default=None):
attribute = next(iter(tree))
print("Key:",tree.keys())
print("Attribute:",attribute)
print("Insance of Attribute :",instance[attribute],attribute)
if instance[attribute] in tree[attribute].keys():
result = tree[attribute][instance[attribute]]
print("Instance
Attribute:",instance[attribute],"TreeKeys :",tree[attribute].keys()
)
if isinstance(result, dict):
return classify(instance, result)
else:
return result
else:
return default
Note
df_tennis['predicted'] = df_tennis.apply(classify, axis=1,
args=(tree,'No') )
print(df_tennis['predicted'])
print('\n Accuracy is:\n' +
str( sum(df_tennis['PT']==df_tennis['predicted'] ) /
(1.0*len(df_tennis.index)) ))
df_tennis[['PT', 'predicted']]
Note:-
training_data = df_tennis.iloc[1:-4]
test_data = df_tennis.iloc[-4:]
train_tree = id3(training_data, 'PT', attribute_names)
test_data['predicted2'] = test_data.apply(
classify, axis=1, args=(train_tree,'Yes') )
print ('\n\n Accuracy is : ' +
str( sum(test_data['PT']==test_data['predicted2'] ) /
(1.0*len(test_data.index)) ))
Output
Experiment No. 4

Build an Artificial Neural Network by implementing the


Backpropagation algorithm and test the same using appropriate
data sets

Ans:- What is Computer Vision (CV)


1. Computer vision is an interdisciplinary
scientific field that deals with how computers can be made to
gain high-level understanding from digital images or videos.
2. From the perspective of engineering, it
seeks to automate tasks that the human visual system can do.
3. For example we can see when we
upload a photo on Facebook, then Facebook provides us a
feature called Auto-tag, what Facebook does in it, gives us the
option to tag the people in the photo, this all happened because
of Computer vision.
Now the question comes how does the computer
recognize images?
1. First of all, what is image, image is a collection or
combination of pixels.
2. pixels is smallest unit in an image.
3. A computer see an image as 0 and 1, Each pixel has
different number of channels , if it is a grayscale image it has
only one pixel or one channel.
4. If it is a colored image it contains three channel, called
red, green and blue commonly known as RGB.
5. Each pixel has value between 0 to 255, for better
understanding let's take a look to the picture given below.
6. image size-A x B x 3 (image has A rows B columns and 3
denote RGB) , image size- A x B (image is grayscale image with A
rows and B columns)

6. But how a computer recognize that what is in the picture like


cat , dog , human etc, So here comes the concept of machine
learning.
Face Detection Using OpenCv
Problem: We have an image and we have to detect faces in that
image.
OpenCv: OpenCv is a python library to solve the problems of
computer vision.
Let's understand it with an example
import cv2 # Import OpenCv
from matplotlib import pyplot as plt #To plot the image
image = cv2.imread('bts.jpg') #Reading image
#converting into grayscale image
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
print(image.shape)
print(gray.shape)
print(gray)#printing gray image
Output

Note:
1. First we imported OpenCV module after that we
imported matplotlib library to plot and show the picture.
2. After that we read the image,our image was a colored image
which we then converted into grayscale image.
3. You can see the difference between grayscale image and
colored image form the output.
4. In OpenCv all the images first converted into numpy array,this
is why we were able to see their shape.
5. And when we printed the gray image ,you can see that we got
the 2D array, Like this if you will print a color image you will get a
3D array.
6. #create Cascade Classifier it contains all the face feature
7. haar_face_cascade=cv2.CascadeClassifier('face.xml')
8. #Search for the co-ordinates of face
9. faces=haar_face_cascade.detectMultiScale(image,scaleFacto
r=1.06,
10. minNeighbors=5);
11. print("face found",len(faces))
12. #create a rectangle outline on the face
13. for (x, y, w, h) in faces:
14. cv2.rectangle(image, (x, y), (x + w, y + h), (0, 255, 0),2)
15. plt.imshow(image) #show the image
16. plt.axis('off')
17. plt.show()
Note: (let's understand what we are doing in this code)
1. We first created a Cascade classifier, which contain all the
faces features in "face.xml".
2. Smaller the value of scaleFactor greater the accuracy.
3. As we know our image is a numpy array, now we are
searching for face coordinates in the image which we have to
detect.
4. After that we are making a rectangular frame on the faces,
(0,255,0) shows frame color and 2 is the width of the frame.
Output
Experiment No.5
Write a program to implement the naïve Bayesian classifier for a
sample training data set stored as a .CSV file. Compute the
accuracy of the classifier, considering few test data sets.

Naive Bayes Classifier


 It is supervised learning algorithm used for classification
based on Bayes' Theorem
 NBC is not just an algorithm, but a collection of many
algorithms that work on the same concept, the Bayes' Theorem
Industrial Use of Naive Bayes Classifier
1. News Categorization
2. Spam filtering
3. Object and face recognition.
4. Medical Diagnosis
5. Weather Prediction etc..
Type of Naive Bayes Classifier
We have three type of naive bayes classifier
1. Gaussian
2. Multinomial
3. Bernoulli
Bayes' Theorem
NBS works only on the bass theorem. Let's see what the bass
theorem is.
P(H/E) = P(E/H) P(H)/P(E)
 H- Hypothesis , E-Event / Evidence
 Bayes' Theorem works on conditional probability
 We have been given that if the event has happened or the
event is true, then we have to calculate the probability of
Hypothesis on this event.
 Means the chances of happening H when the event E is
happened.
P(H) - It is said priori (A prior probability), Probability of H before
E is happen.
P(H/E) - Posterior probability, Probability of E after event E is
true.
Note: As our question is, we have implement a naive bayes
classifier on .csv file,Here we will use the naive bayes classifier
on wine data-set.
Wine Dataset Description
 The wine dataset contains the results of a chemical analysis of
wines grown in a specific area of Italy.
 It contains total 178 samples (data), with 13 chemical analysis
(features) recorded for each sample.
 And contains three classes (our target), with no missing
values.
Implementation of Algorithm
#Import important libraries
import numpy as np
import pandas as pd
#Import dataset
from sklearn import datasets
#Load dataset
wine = datasets.load_wine()
#print(wine)#if you want to see the data you can print data
Note:Here we have just loaded the data, you can download and
load the data, you can also load it direct from sklearn .Our data
dictionary is in the form of dictionary you can print and see it.
#print the names of the 13 features
print ("Features: ", wine.feature_names)
#print the label type of wine
print ("Labels: ", wine.target_names)
Output
Note: Here we have seen our target and our features name by
printing it, with this data we will train our data.
X=pd.DataFrame(wine['data'])
print(X.head())
print(wine.data.shape)
#print the wine labels (0:Class_0, 1:class_2, 2:class_2)
y=print (wine.target)
Output

Note:
1. Here we have seen the values of 14 samples for our 13
features by printing only five.
2. On the basis of these features, our wine classes are made up
of three, 0, 1, 2.
3. # Import train_test_split function
4. from sklearn.model_selection import train_test_split
5. # Split dataset into training set and test set
6. X_train, X_test, y_train, y_test = train_test_split(wine.data,
wine.target, test_size=0.30,random_state=109)
Note: Split our data into training data and testing data , 70 %
training data and 30 % testing data. From training data our model
learn and from testing data, we can see how much our model
learned.

#Import Gaussian Naive Bayes model


from sklearn.naive_bayes import GaussianNB
#Create a Gaussian Classifier
gnb = GaussianNB()
#Train the model using the training sets
gnb.fit(X_train, y_train)
#Predict the response for test dataset
y_pred = gnb.predict(X_test)
print(y_pred)
Output
Note: We have used the Gussian model here,and then tested with
test data
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
#confusion matrix
from sklearn.metrics import confusion_matrix
cm=np.array(confusion_matrix(y_test,y_pred))
cm
Output

Note:
1. To check how good our model is, we have obtained the
accuracy of our model.
2. Here we calculated both confusion matrix and accuracy.
3. We can see from the confusion matrix that our model has
predict a total of 5 values wrong and are correct prediction.
Experiment No.6

Assuming a set of documents that need to be classified, use the


naïve Bayesian Classifier model to perform this task. Built-in Java
classes/API can be used to write the program. Calculate the
accuracy, precision, and recall for your data set.

Naive Bayes Classifier


 It is supervised learning algorithm used for classification
based on Bayes' Theorem
 NBC is not just an algorithm, but a collection of many
algorithms that work on the same concept, the Bayes' Theorem
Industrial Use of Naive Bayes Classifier
1. News Categorization
2. Spam filtering
3. Object and face recognition.
4. Medical Diagnosis
5. Weather Prediction etc..
Type of Naive Bayes Classifier
We have three type of naive bayes classifier
1. Gaussian
2. Multinomial
3. Bernoulli
Bayes' Theorem
NBS works only on the bass theorem. Let's see what the bass
theorem is.
P(H/E) = P(E/H) P(H)/P(E)
 H- Hypothesis , E-Event / Evidence
 Bayes' Theorem works on conditional probability
 We have been given that if the event has happened or the
event is true, then we have to calculate the probability of
Hypothesis on this event.
 Means the chances of happening H when the event E is
happened.
P(H) - It is said priori (A prior probability), Probability of H before
E is happen.
P(H/E) - Posterior probability, Probability of E after event E is
true.
Note: Now we will see how to use NBS in text classification for this
it is necessary to understand our process first
1. Actually what we do is we first train the model with the data
that we have already, In the case of text, we have a data-set in
which the texts are already in some defined categories.
2. After that we will take any new text and find out which
category it belongs to.
Let's understand it with an example
Import Data-set
1. The dataset that we are going to use here is
Fetch_20 Newsgroups dataset.
2. You can load the dataset in 2 ways, either directly upload
from sklearn or you can download it.
3. There are 2 folders in it, the first folder is of our train dataset,
the second folder is of our test dataset.
4. There are many categories in this data, but here we will take
only a few categories which are given in the code below in
categories variable.
import sklearn.datasets as skd
categories = ['alt.atheism',
'soc.religion.christian','comp.graphics', 'sci.med']
news_train =
skd.load_files('C:\\Users\\apex\\Downloads\\20news-bydate-
train', categories= categories, encoding= 'ISO-8859-1')
news_test =
skd.load_files('C:\\Users\\apex\\Downloads\\20news-bydate-
test',categories= categories, encoding= 'ISO-8859-1')
Note: After loading the data , variables in data stored in form of
dictionary
print(news_train.keys())
print(news_train['target_names'])
Output
Note:As we can see we have four classes (target_names) and our
data (texts) is in the value form in the dict_key data.
Counting Words
1. Why did we need a word count?
2. To train our model, we must know the particular word has
come in which category
3. To calculate the Probability, we need to know the count of
Particular word, the total word count etc
 For example : we have a category called education and a text
in the category is "education is basic right of every Indian"
 Now we have to find out what is the probability of the
word "Indian" in category education
 Then what will we do , we
calculate P(Indian/education) ,means we are calculating the
probability of Indian word in category education, for that we
must know how many time the particular word appear in that
particular category.
1. Human can do this work with his mind and sense, but how the
machine will do this work, what the machine will do, machine
will give a unique numeric value to each unique word and then
count it how often the particular numeric number is coming in
which category.
2. Here we have a function CountVectorizer function to do this
process.
from sklearn.feature_extraction.text import CountVectorizer
count_vect=CountVectorizer()
X_tarin_tf=count_vect.fit_transform(news_train.data)
print(X_tarin_tf)
X_tarin_tf.shape
Output

Note: 2257 - representing here number of rows(number of


samples) , 35788- number of columns (unique words/ features)
and 0,1,2,3 are classes.
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer=TfidfTransformer()
X_train_tfidf=tfidf_transformer.fit_transform(X_tarin_tf)
X_train_tfidf.shape
Output
(2257, 35788) # shows that there is no change in number of words
Note:What TfidfTransformer Basically do is weight the words in
the classification.
For example we get "the" word multiple times in any text,
So TfidfTransformer finds out how much its contribution to the
model is in the classification.
Note: We have three types of naive bayes' classifier, here we are
using Multinomial classifier
Train The Model
from sklearn.naive_bayes import MultinomialNB
clf= MultinomialNB().fit(X_train_tfidf, news_train.target)
Note:
1. With MultinomialNB Classifier we train our model
2. While training the model ,we are taking 2 parameters, first
our transformed data and second the categories (classes) in
which we classify our text.
Test The Model
1. Before testing, we also have to transform our test data like
train data.
2. If we do not do this then our model will not recognize this
data and the chances of error will increase.

X_test_tf=count_vect.transform(news_test.data)
X_test_tfidf=tfidf_transformer.transform(X_test_tf)
predicted=clf.predict(X_test_tfidf)
predicted
Output
Note: It classify our test data into classes we have, you can also
use any text other then test data.
ACCURACY
let's check how correct our model is
from sklearn import metrics
from sklearn.metrics import accuracy_score
print("Accuracy",accuracy_score(news_test.target,predicted))
print(metrics.confusion_matrix(news_test.target,predicted))
Output

Note: You can see that here we have accuracy, confusion matrix.
Experiment No.7

You might also like