Allnlp
Allnlp
No:
Page No.:1
Date:
EXPERIMENT-6
Aim: Demonstrate Term Frequency – Inverse Document Frequency (TF – IDF) using python
Description: It is a widely used statistical method in natural language processing and information
retrieval. It measures how important a term is within a document relative to a collection of documents.
Words within a text document are transformed into importance numbers by a text vectorization process.
TF-IDF vectorizes/scores a word by multiplying the word’s Term Frequency (TF) with the Inverse
Document Frequency (IDF).
Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic used in natural
language processing and information retrieval to evaluate the importance of a word in a document
relative to a collection of documents.
Term Frequency (TF):
Term Frequency measures the frequency of a term (word) within a document.
It is calculated as the ratio of the number of times a term appears in a document to the total number of
terms in the document.
Formula: TF=Number of times term appears in document /Total number of terms in document
Inverse Document Frequency (IDF):
Inverse Document Frequency measures the importance of a term across a collection of documents.
It is calculated as the logarithm of the ratio of the total number of documents to the number of
documents containing the term, with smoothing to avoid division by zero.
Formula: IDF=log(Total number of documents in the corpus /Number of documents containing term )
TF-IDF Score:
The TF-IDF score for a term in a document combines both TF and IDF.
It is calculated by multiplying the TF of the term in the document by the IDF of the term across the
entire corpus.
Formula: TF-IDF = TF×IDF
Output:
Number of words in the corpus: 14
{'most', 'the', 'science', 'this', 'analyse', 'fields', 'important', 'courses', 'scientists', 'one', 'of', 'data', 'is', 'best'}
IDF of:
most: 0.47712125471966244
the: 0.17609125905568124
science: 0.17609125905568124
this: 0.47712125471966244
analyse: 0.47712125471966244
fields: 0.47712125471966244
important: 0.47712125471966244
courses: 0.47712125471966244
scientists: 0.47712125471966244
one: 0.17609125905568124
of: 0.17609125905568124
data: 0.0
is: 0.17609125905568124
best: 0.47712125471966244
Description: Word embedding is one of the most important techniques in natural language
processing(NLP), where words are mapped to vectors of real numbers. Word embedding is capable of
capturing the meaning of a word in a document, semantic and syntactic similarity, relation with other
words. It also has been widely used for recommender systems and text classification. Word2vec is one
of the most popular technique to learn word embeddings using a two-layer neural network. Its input is
a text corpus and its output is a set of vectors. Word embedding via word2vec can make natural language
computer-readable, then further implementation of mathematical operations on words can be used to
detect their similarities. A well-trained set of word vectors will place similar words close to each other
in that space. For instance, the words women, men, and human might cluster in one corner, while yellow,
red and blue cluster together in another. gensim is an open source python library for natural language
processing and it was developed and is maintained by the Czech natural language processing researcher
Gensim library will enable us to develop word embeddings by training our own word2vec models on a
custom corpus either with CBOW of skip-grams algorithms.
size: The number of dimensions of the embeddings and the default is 100.
window: The maximum distance between a target word and words around the target word. The default
window is 5.
min_count: The minimum count of words to consider when training the model; words with occurrence
less than this count will be ignored. The default for min_count is 5.
workers: The number of partitions during training and the default workers is 3.
sg: The training algorithm, either CBOW(0) or skip gram(1). The default training algorithm is CBOW.
After training the word2vec model, obtain the word embedding from the training model. Finally print
the model.
Program:
from gensim.models import Word2Vec
print(model)
# summarizing vocabulary
words = list(model.wv.key_to_index)
print(words)
print(model.wv['is'])
model.save('model.bin')
new_model = Word2Vec.load('model.bin')
print(new_model)
Output:
Word2Vec(vocab=14,size=100,alpha=0.025)
['this','is','the','first','sentence','for','word2vec','second','yet','another','one','more','and','final'] [
5.9599371e-043.6903401e-032.2744297e-035.7322328e-04
-4.7999555e-034.1460539e-033.6190548e-034.4815554e-03
-9.4492309e-04-2.3332548e-03-7.7754230e-04-2.0325035e-03
-4.9208495e-05-3.8984963e-032.2744499e-031.9393873e-03
1.0208354e-032.7080898e-031.9608904e-031.0961948e-03
Aim: Implement Text classification using naïve bayes classifier and text blob library.
Description: Text classifier are systems that classify your texts and divide them in different classes.
TextBlob is a Python library for processing textual data. It provides a consistent API for diving into
common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase
extraction, sentiment analysis, and more.
Step -1 Installing textblob.
Step-2 Download the data files that textblob uses for its functionality and for nltk.
Step-3 Train the classifier based on Naive Bayes Classifier.
Step-4 Test the data using classifier to get your text classified.
Step-5 Calculate the accuracy of the classifier.
Program:
!pip install textblob
import nltk
nltk.download('punkt')
train = [('What an amazing weather.', 'pos'),
('this is an amazing idea!', 'pos'),
('I feel very good about these ideas.', 'pos'),
('this is my best performance.', 'pos'),
("what an awesome view", 'pos'),
('I do not like this place', 'neg'),
Output:
pos
0.8333333333333334
neg
0.8333333333333334
Most Informative Features
contains(I) = True neg : pos = 2.3 : 1.0
contains(an) = False neg : pos = 2.2 : 1.0
contains(I) = False pos : neg = 1.8 : 1.0
contains(my) = True neg : pos = 1.7 : 1.0
Description:
Support Vector Machine” (SVM) is a supervised machine learning algorithm that can be used for both
classification or regression challenges. However, it is mostly used in classification problems. In the SVM
algorithm, we plot each data item as a point in n-dimensional space (where n is a number of features you
have) with the value of each feature being the value of a particular coordinate. Then, we perform
classification by finding the hyper-plane that differentiates the two classes very well. Support Vectors are
the coordinates of individual observation. The SVM classifier is a frontier that best segregates the two classes
(hyper-plane/ line).
Program:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
C = 1.0
plt.show()
Output:
Aim: Convert text to vectors (using term frequency) and apply cosine similarity to provide closeness among two
texts.
Description: Cosine similarity is a measure of similarity between two non-zero vectors of an inner product
space that “measures the cosine of the angle between them” Cosine Similarity tends to determine how similar
two words or sentence are, It can be used for Sentiment Analysis, Text Comparison and being used by lot of
popular packages out there like word2vec.
This is also called as Scalar product because the dot product of two vectors gives a scalar result.
For Example, Vector(A) = [5,0,2] Vector(B) = [2,5,0]
The dot product vector(A) DOT vector(B) = 5*2+0*5+2*0=10+0+0 =10
The documents are similar lesser the angle between them and Cosine of Angle increase as the value of angle
decreases since Cos 0 =1 and Cos 90 = 0
First step calculate the cosine similarity between the documents. Convert the documents/Sentences/words in a
form of feature vector first. Useful Methods for feature extraction i) Bag of Words ii) TF-IDF.
Bag of Words counts the unique words in documents and frequency of each of the words. Scikit learn
Countvectorizer extract the Bag of Words Features.
TF-IDF score of a word to rank it’s importance in a document tfidf score of a word w = tf(w)*idf(w) tf(w) =
Number of times the word appears in a document/Total number of words in the document.
idf(w) = Number of documents/Number of documents that contains word w se Scikit learn Cosine Similarity
function to compare the first document i.e. Document 0 with the other Documents in Corpus.
Cosine Similarities of the document 0 compared with other documents in the corpus. The first element in array is
1 which means Document 0 is compared with Document 0 and second element 0.08619387, 0,0 where
Document 0 is compared with Document 1,2,3.
Program:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
Output:
aditya aiml at college colleges computer courses cse \
Document 0 1 0 1 1 0 0 0 0
Document 1 0 1 0 0 1 1 1 1
Document2 0 0 0 0 0 1 0 0
Document3 0 0 0 0 0 1 0 0
[4 rows x 24 columns]
aditya aiml at college colleges computer \
Document 0 0.421765 0.000000 0.421765 0.421765 0.000000 0.000000
Document 1 0.000000 0.328776 0.000000 0.000000 0.328776 0.209853
Document 2 0.000000 0.000000 0.000000 0.000000 0.000000 0.289152