Group A Assignment No: 7
Group A Assignment No: 7
4 4 4 4 4 20
Group A
Assignment No: 7
----------------------------------------------------------------------------------------------------------------
Title of the Assignment:
1. Extract Sample document and apply following document preprocessing methods:
Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.
2. Create representation of document by calculating Term Frequency and Inverse Document
Frequency.
----------------------------------------------------------------------------------------------------------------
Objective of the Assignment: Students should be able to perform Text Analysis using TF
IDF Algorithm
---------------------------------------------------------------------------------------------------------------
Prerequisite:
1. Basic of Python Programming
2. Basic of English language.
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Basic concepts of Text Analytics
2. Text Analysis Operations using natural language toolkit
3. Text Analysis Model using TF-IDF.
4. Bag of Words (BoW)
---------------------------------------------------------------------------------------------------------------
Text mining is also referred to as text analytics. Text mining is a process of exploring
sizable textual data and finding patterns. Text Mining processes the text itself, while NLP
processes with the underlying metadata. Finding frequency counts of words, length of the
sentence, presence/absence of specific words is known as text mining. Natural language
processing is one of the components of text mining. NLP helps identify sentiment,
finding entities in the sentence, and category of blog/article. Text mining is preprocessed
data for text analytics. In Text Analytics, statistical and machine learning algorithms are
used to classify information.
sent_tokenize() method
method
Lemmatization Vs Stemming
Stemming algorithm works by cutting the suffix from the word. In a broader sense
cuts either the beginning or end of the word.
required to create dictionaries and look for the proper form of the word.
Stemming is a general operation while lemmatization is an intelligent operation
where the proper form will be looked in the dictionary. Hence, lemmatization
helps in forming better machine learning features.
Example:
The initial step is to make a vocabulary of unique words and calculate TF for each
document. TF will be more for words that frequently appear in a document and
less for rare words in a document.
● Inverse Document Frequency (IDF)
It is the measure of the importance of a word. Term frequency (TF) does not
consider the importance of words. Some words such as’ of’, ‘and’, etc. can be
most frequently present but are of little significance. IDF provides weightage to
each word based on its frequency in the corpus D.
After applying TFIDF, text in A and B documents can be represented as a TFIDF vector of
dimension equal to the vocabulary words. The value corresponding to each word represents
the importance of that word in a particular document.
TFIDF is the product of TF with IDF. Since TF values lie between 0 and 1, not using ln can
result in high IDF for some words, thereby dominating the TFIDF. We don’t want that, and
therefore, we use ln so that the IDF should not completely dominate the TFIDF.
● Disadvantage of TFIDF
It is unable to capture the semantics. For example, funny and humorous are synonyms, but
TFIDF does not capture that. Moreover, TFIDF can be computationally expensive if the
vocabulary is vast.
4. Bag of Words (BoW)
Machine learning algorithms cannot work with raw text directly. Rather, the text must be
converted into vectors of numbers. In natural language processing, a common technique
for extracting features from text is to place all of the words that occur in the text in a
bucket. This approach is called a bag of words model or BoW for short. It’s referred to
as a “bag” of words because any information about the structure of the sentence is lost.
Algorithm for Tokenization, POS Tagging, stop words removal, Stemming and
Lemmatization:
Step 1: Download the required packages
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
Step 2: Initialize the text
text= "Tokenization is the first step in text analytics. The
process of breaking down a text paragraph into smaller chunks
such as words or sentences is called Tokenization."
Step 3: Perform Tokenization
#Sentence Tokenization
from nltk.tokenize import sent_tokenize
tokenized_text= sent_tokenize(text)
print(tokenized_text)
#Word Tokenization
from nltk.tokenize import word_tokenize
tokenized_word=word_tokenize(text)
print(tokenized_word)
Step 5: Create a dictionary of words and their occurrence for each document in the
corpus
numOfWordsA = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsA:
numOfWordsA[word] += 1
numOfWordsB = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsB:
numOfWordsB[word] += 1
Step 6: Compute the term frequency for each of our documents.
def computeTF(wordDict, bagOfWords):
tfDict = {}
bagOfWordsCount = len(bagOfWords)
for word, count in wordDict.items():
tfDict[word] = count / float(bagOfWordsCount)
return tfDict
tfA = computeTF(numOfWordsA, bagOfWordsA)
tfB = computeTF(numOfWordsB, bagOfWordsB)
Step 7: Compute the term Inverse Document Frequency.
def computeIDF(documents):
import math
N = len(documents)
idfDict = dict.fromkeys(documents[0].keys(), 0)
for document in documents:
for word, val in document.items():
if val > 0:
idfDict[word] += 1
Conclusion:
In this way we have done text data analysis using TF IDF algorithm
Assignment Question:
1) Perform Stemming for text = "studies studying cries cry". Compare
the results generated with Lemmatization. Comment on your answer how
Stemming and Lemmatization differ from each other.
2) Write Python code for removing stop words from the below documents, conver the
documents into lowercase and calculate the TF, IDF and TFIDF score for each
document.
documentA = 'Jupiter is the largest Planet'
documentB = 'Mars is the fourth planet from the Sun'