0% found this document useful (0 votes)
13 views15 pages

Allnlp

Uploaded by

ramnathjhrav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views15 pages

Allnlp

Uploaded by

ramnathjhrav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Expt.

No:
Page No.:1
Date:
EXPERIMENT-6

Aim: Demonstrate Term Frequency – Inverse Document Frequency (TF – IDF) using python
Description: It is a widely used statistical method in natural language processing and information
retrieval. It measures how important a term is within a document relative to a collection of documents.
Words within a text document are transformed into importance numbers by a text vectorization process.
TF-IDF vectorizes/scores a word by multiplying the word’s Term Frequency (TF) with the Inverse
Document Frequency (IDF).
Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic used in natural
language processing and information retrieval to evaluate the importance of a word in a document
relative to a collection of documents.
Term Frequency (TF):
Term Frequency measures the frequency of a term (word) within a document.
It is calculated as the ratio of the number of times a term appears in a document to the total number of
terms in the document.
Formula: TF=Number of times term appears in document /Total number of terms in document
Inverse Document Frequency (IDF):
Inverse Document Frequency measures the importance of a term across a collection of documents.
It is calculated as the logarithm of the ratio of the total number of documents to the number of
documents containing the term, with smoothing to avoid division by zero.
Formula: IDF=log(Total number of documents in the corpus /Number of documents containing term )
TF-IDF Score:
The TF-IDF score for a term in a document combines both TF and IDF.
It is calculated by multiplying the TF of the term in the document by the IDF of the term across the
entire corpus.
Formula: TF-IDF = TF×IDF

TF-IDF is useful in many natural language processing applications.


1. Search Engines used to rank the relevance of a document for a query.
2. Text classification, text summarization, and topic modelling.

ADITYA ENGINEERING COLLEGE, SURAMPALEM


2 2 A 9 1 A 6 1 4
Expt. No:
Page No.:2
Date:
Program:
import pandas as pd
import numpy as np
corpus = ['data science is one of the most important fields of science' , 'this is one of the best data science
courses' , 'data scientists analyse data' ]
words_set = set()
for doc in corpus:
words = doc.split(' ')
words_set = words_set.union(set(words))
print('Number of words in the corpus:',len(words_set))
print('The words in the corpus: \n', words_set)
n_docs = len(corpus)
n_words_set = len(words_set)
df_tf = pd.DataFrame(np.zeros((n_docs, n_words_set)), columns=list(words_set))
for i in range(n_docs):
words = corpus[i].split(' ')
for w in words:
df_tf[w][i] = df_tf[w][i] + (1 / len(words))
df_tf
print("IDF of: ")
idf = {}
for w in words_set:
k=0
for i in range(n_docs):
if w in corpus[i].split():
k += 1
idf[w] = np.log10(n_docs / k)
print(f'{w:>15}: {idf[w]:>10}' )
df_tf_idf = df_tf.copy()

ADITYA ENGINEERING COLLEGE, SURAMPALEM


2 2 A 9 1 A 6 1 4
Expt. No:
Page No.:3
Date:
for w in words_set:
for i in range(n_docs):
df_tf_idf[w][i] = df_tf[w][i] * idf[w]
print(df_tf_idf)

Output:
Number of words in the corpus: 14

The words in the corpus:

{'most', 'the', 'science', 'this', 'analyse', 'fields', 'important', 'courses', 'scientists', 'one', 'of', 'data', 'is', 'best'}

IDF of:

most: 0.47712125471966244

the: 0.17609125905568124

science: 0.17609125905568124

this: 0.47712125471966244

analyse: 0.47712125471966244

fields: 0.47712125471966244

important: 0.47712125471966244

courses: 0.47712125471966244

scientists: 0.47712125471966244

one: 0.17609125905568124

of: 0.17609125905568124

data: 0.0

is: 0.17609125905568124

best: 0.47712125471966244

most the science this analyse fields important

0 0.043375 0.016008 0.032017 0.000000 0.00000 0.043375 0.043375

1 0.000000 0.019566 0.019566 0.053013 0.00000 0.000000 0.000000

2 0.000000 0.000000 0.000000 0.000000 0.11928 0.000000 0.000000

courses scientists one of data is best

0 0.000000 0.00000 0.016008 0.032017 0.0 0.016008 0.000000

1 0.053013 0.00000 0.019566 0.019566 0.0 0.019566 0.053013

ADITYA ENGINEERING COLLEGE, SURAMPALEM


2 2 A 9 1 A 6 1 4
Expt. No:
Page No.:4
Date:
2 0.000000 0.11928 0.000000 0.000000 0.0 0.000000 0.000000

ADITYA ENGINEERING COLLEGE, SURAMPALEM


2 2 A 9 1 A 6 1 4
Expt. No:
Page No.:5
Date:
EXPERIMENT-7

Aim: Demonstrate word embeddings using word2vec.

Description: Word embedding is one of the most important techniques in natural language
processing(NLP), where words are mapped to vectors of real numbers. Word embedding is capable of
capturing the meaning of a word in a document, semantic and syntactic similarity, relation with other
words. It also has been widely used for recommender systems and text classification. Word2vec is one
of the most popular technique to learn word embeddings using a two-layer neural network. Its input is
a text corpus and its output is a set of vectors. Word embedding via word2vec can make natural language
computer-readable, then further implementation of mathematical operations on words can be used to
detect their similarities. A well-trained set of word vectors will place similar words close to each other
in that space. For instance, the words women, men, and human might cluster in one corner, while yellow,
red and blue cluster together in another. gensim is an open source python library for natural language
processing and it was developed and is maintained by the Czech natural language processing researcher
Gensim library will enable us to develop word embeddings by training our own word2vec models on a
custom corpus either with CBOW of skip-grams algorithms.

Train the genism word2vec model:

model = Word2Vec(sent, min_count=1,size= 50,workers=3, window =3, sg = 1)

size: The number of dimensions of the embeddings and the default is 100.

window: The maximum distance between a target word and words around the target word. The default
window is 5.

min_count: The minimum count of words to consider when training the model; words with occurrence
less than this count will be ignored. The default for min_count is 5.

workers: The number of partitions during training and the default workers is 3.

sg: The training algorithm, either CBOW(0) or skip gram(1). The default training algorithm is CBOW.
After training the word2vec model, obtain the word embedding from the training model. Finally print
the model.

ADITYA ENGINEERING COLLEGE, SURAMPALEM


2 2 A 9 1 A 6 1 4
Expt. No:
Page No.:6
Date:

Program:
from gensim.models import Word2Vec

# define training data

sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],


['this', 'is', 'the', 'second', 'sentence'],

['yet', 'another', 'sentence'],

['one', 'more', 'sentence'],

['and', 'the', 'final', 'sentence']]

# training the model

model = Word2Vec(sentences, min_count=1)

print(model)

# summarizing vocabulary

ADITYA ENGINEERING COLLEGE, SURAMPALEM


2 2 A 9 1 A 6 1 4
Expt. No:
Page No.:7
Date:

words = list(model.wv.key_to_index)

print(words)

# access vector for one word

print(model.wv['is'])

model.save('model.bin')

new_model = Word2Vec.load('model.bin')

print(new_model)

Output:
Word2Vec(vocab=14,size=100,alpha=0.025)
['this','is','the','first','sentence','for','word2vec','second','yet','another','one','more','and','final'] [
5.9599371e-043.6903401e-032.2744297e-035.7322328e-04
-4.7999555e-034.1460539e-033.6190548e-034.4815554e-03
-9.4492309e-04-2.3332548e-03-7.7754230e-04-2.0325035e-03
-4.9208495e-05-3.8984963e-032.2744499e-031.9393873e-03
1.0208354e-032.7080898e-031.9608904e-031.0961948e-03

ADITYA ENGINEERING COLLEGE, SURAMPALEM


2 2 A 9 1 A 6 1 4
Expt. No:
Page No.:8
Date:
EXPERIMENT-8

Aim: Implement Text classification using naïve bayes classifier and text blob library.
Description: Text classifier are systems that classify your texts and divide them in different classes.

TextBlob is a Python library for processing textual data. It provides a consistent API for diving into
common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase
extraction, sentiment analysis, and more.
Step -1 Installing textblob.
Step-2 Download the data files that textblob uses for its functionality and for nltk.
Step-3 Train the classifier based on Naive Bayes Classifier.
Step-4 Test the data using classifier to get your text classified.
Step-5 Calculate the accuracy of the classifier.
Program:
!pip install textblob
import nltk
nltk.download('punkt')
train = [('What an amazing weather.', 'pos'),
('this is an amazing idea!', 'pos'),
('I feel very good about these ideas.', 'pos'),
('this is my best performance.', 'pos'),
("what an awesome view", 'pos'),
('I do not like this place', 'neg'),

ADITYA ENGINEERING COLLEGE, SURAMPALEM


2 2 A 9 1 A 6 1 4
Expt. No:
Page No.:9
Date:
('I am tired of this stuff.', 'neg'),
("I can't deal with all this tension", 'neg'),
('he is my sworn enemy!', 'neg'),
('my friends is horrible.', 'neg')]
test = [('the food was great.', 'pos'),
('I do not want to live anymore', 'neg'),
("I ain't feeling dandy today.", 'neg'),
("I feel amazing!", 'pos'),
('Ramesh is a friend of mine.', 'pos'),
("I can't believe I'm doing this.", 'neg')]
from textblob.classifiers import NaiveBayesClassifier
cl = NaiveBayesClassifier(train)
print(cl.classify("This is an amazing library!"))
print(cl.accuracy(test))
print(cl.classify("my friends is tension"))
print(cl.accuracy(test))
cl.show_informative_features(4)

Output:
pos
0.8333333333333334
neg
0.8333333333333334
Most Informative Features
contains(I) = True neg : pos = 2.3 : 1.0
contains(an) = False neg : pos = 2.2 : 1.0
contains(I) = False pos : neg = 1.8 : 1.0
contains(my) = True neg : pos = 1.7 : 1.0

ADITYA ENGINEERING COLLEGE, SURAMPALEM


2 2 A 9 1 A 6 1 4
Expt. No:
Page No.:10
Date:
EXPERIMENT-9

Aim: Apply support vector machine for text classification.

Description:
Support Vector Machine” (SVM) is a supervised machine learning algorithm that can be used for both
classification or regression challenges. However, it is mostly used in classification problems. In the SVM
algorithm, we plot each data item as a point in n-dimensional space (where n is a number of features you
have) with the value of each feature being the value of a particular coordinate. Then, we perform
classification by finding the hyper-plane that differentiates the two classes very well. Support Vectors are
the coordinates of individual observation. The SVM classifier is a frontier that best segregates the two classes
(hyper-plane/ line).

Program:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target

C = 1.0

ADITYA ENGINEERING COLLEGE, SURAMPALEM


2 2 A 9 1 A 6 1 4
Expt. No:
Page No.:11
Date:
svc = svm.SVC(kernel='linear', C=1,gamma='auto').fit(X, y)
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
h = (x_max / x_min)/100
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
plt.subplot(1, 1, 1)
Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('SVC with linear kernel')
plt.show()
svc = svm.SVC(kernel='rbf', C=1,gamma='auto').fit(X, y)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())

plt.title('SVC with linear kernel')

plt.show()

Output:

ADITYA ENGINEERING COLLEGE, SURAMPALEM


2 2 A 9 1 A 6 1 4
Expt. No:
Page No.:12
Date:

ADITYA ENGINEERING COLLEGE, SURAMPALEM


2 2 A 9 1 A 6 1 4
Expt. No:
Page No.:13
Date:
EXPERIMENT -10

Aim: Convert text to vectors (using term frequency) and apply cosine similarity to provide closeness among two
texts.

Description: Cosine similarity is a measure of similarity between two non-zero vectors of an inner product
space that “measures the cosine of the angle between them” Cosine Similarity tends to determine how similar
two words or sentence are, It can be used for Sentiment Analysis, Text Comparison and being used by lot of
popular packages out there like word2vec.
This is also called as Scalar product because the dot product of two vectors gives a scalar result.
For Example, Vector(A) = [5,0,2] Vector(B) = [2,5,0]
The dot product vector(A) DOT vector(B) = 5*2+0*5+2*0=10+0+0 =10
The documents are similar lesser the angle between them and Cosine of Angle increase as the value of angle
decreases since Cos 0 =1 and Cos 90 = 0
First step calculate the cosine similarity between the documents. Convert the documents/Sentences/words in a
form of feature vector first. Useful Methods for feature extraction i) Bag of Words ii) TF-IDF.
Bag of Words counts the unique words in documents and frequency of each of the words. Scikit learn
Countvectorizer extract the Bag of Words Features.
TF-IDF score of a word to rank it’s importance in a document tfidf score of a word w = tf(w)*idf(w) tf(w) =
Number of times the word appears in a document/Total number of words in the document.
idf(w) = Number of documents/Number of documents that contains word w se Scikit learn Cosine Similarity
function to compare the first document i.e. Document 0 with the other Documents in Corpus.
Cosine Similarities of the document 0 compared with other documents in the corpus. The first element in array is
1 which means Document 0 is compared with Document 0 and second element 0.08619387, 0,0 where
Document 0 is compared with Document 1,2,3.

Program:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity


import pandas as pd
count_vect = CountVectorizer()
Document1= "Aditya Engineering College situated at Surampalem"
Document2= "Engineering Colleges offer computer science courses in MCA AIML CSE IT departments"
Document3= "Computer science students have opprtunities in IT sector"

ADITYA ENGINEERING COLLEGE, SURAMPALEM


2 2 A 9 1 A 6 1 4
Expt. No:
Page No.:14
Date:
Document4= "IT sector hire students with skills in computer science"
corpus = [Document1,Document2,Document3,Document4]
X_train_counts = count_vect.fit_transform(corpus)
df1=pd.DataFrame(X_train_counts.toarray(),columns=count_vect.get_feature_names_out(),index=['Document
0','Document 1','Document2','Document3'])
vectorizer = TfidfVectorizer()
trsfm=vectorizer.fit_transform(corpus)
df2=pd.DataFrame(trsfm.toarray(),columns=vectorizer.get_feature_names_out(),index=['Document
0','Document 1','Document 2','Document 3'])
print(df1)
print(df2)
cosine_similarity(trsfm[0:3], trsfm)

Output:
aditya aiml at college colleges computer courses cse \
Document 0 1 0 1 1 0 0 0 0
Document 1 0 1 0 0 1 1 1 1
Document2 0 0 0 0 0 1 0 0
Document3 0 0 0 0 0 1 0 0

departments engineering ... mca offer opprtunities science \


Document 0 0 1 ... 0 0 0 0
Document 1 1 1 ... 1 1 0 1
Document2 0 0 ... 0 0 1 1
Document3 0 0 ... 0 0 0 1

sector situated skills students surampalem with


Document 0 0 1 0 0 1 0
Document 1 0 0 0 0 0 0
Document2 1 0 0 1 0 0
Document3 1 0 1 1 0 1

[4 rows x 24 columns]
aditya aiml at college colleges computer \
Document 0 0.421765 0.000000 0.421765 0.421765 0.000000 0.000000
Document 1 0.000000 0.328776 0.000000 0.000000 0.328776 0.209853
Document 2 0.000000 0.000000 0.000000 0.000000 0.000000 0.289152

ADITYA ENGINEERING COLLEGE, SURAMPALEM


2 2 A 9 1 A 6 1 4
Expt. No:
Page No.:15
Date:
Document 3 0.000000 0.000000 0.000000 0.000000 0.000000 0.263386

courses cse departments engineering ... mca \


Document 0 0.000000 0.000000 0.000000 0.332524 ... 0.000000
Document 1 0.328776 0.328776 0.328776 0.259211 ... 0.328776
Document 2 0.000000 0.000000 0.000000 0.000000 ... 0.000000
Document 3 0.000000 0.000000 0.000000 0.000000 ... 0.000000

offer opprtunities science sector situated skills \


Document 0 0.000000 0.000000 0.000000 0.000000 0.421765 0.000000
Document 1 0.328776 0.000000 0.209853 0.000000 0.000000 0.000000
Document 2 0.000000 0.453012 0.289152 0.357160 0.000000 0.000000
Document 3 0.000000 0.000000 0.263386 0.325334 0.000000 0.412645

students surampalem with


Document 0 0.000000 0.421765 0.000000
Document 1 0.000000 0.000000 0.000000
Document 2 0.357160 0.000000 0.000000
Document 3 0.325334 0.000000 0.412645
[4 rows x 24 columns]
array([[1. , 0.08619387, 0. , 0. ],
[0.08619387, 1. , 0.24271786, 0.22108976],
[0. , 0.24271786, 1. , 0.53702605]

ADITYA ENGINEERING COLLEGE, SURAMPALEM


2 2 A 9 1 A 6 1 4

You might also like