Sivasri NLP Lab
Sivasri NLP Lab
PROCESSING &
APPLICATIONS 21EC3082
S.No Date Experiment Name Pre- In-Lab (25M) Post- Viva Total Faculty
Lab Program/ Data and Analysis & Lab Voce (50M) Signature
(10M) Procedure Results Inference (10M) (5M)
(5M) (10M) (10M)
1. Introductory Session -NA-
Tokenization_of_text #1
2.
Text_2_Sequences #2
3.
One_Hot_Encoding #3
4.
Vectorization_of_texts #4
5.
Databases_how_to_Use #5
6.
Parsing_nltk_toolbox #6
7.
TF_Testing_fail #7
8.
IDF_Why #8
9.
TFIDF_Vertorization #9
10.
TF_IDF_Failure_meaning #10
11.
Distance_Metrics #11
12
The aim is to compare and evaluate different tokenization techniques or libraries, such as NLTK,
SpaCy, and TensorFlow, to determine their effectiveness in handling various types of text data.
Description:
Tokenization is the 1st step in any NLP model. The experiment may aim to explore how tokenization
using NLTK, spaCy, and TensorFlow can be integrated into a broader NLP pipeline or used as a
preprocessing step for tasks such as sentiment analysis, machine translation, named entity
recognition, or text summarization. The focus is on understanding the impact of tokenization
choices on downstream model performance. The experiment may aim to analyze the performance
characteristics of tokenization using NLTK and TensorFlow.
Pre-Requisites:
1. https://pip.pypa.io/en/stable/installation/
2. https://packaging.python.org/en/latest/tutorials/installing-packages/
3. https://pypi.org/project/nltk/
4. https://www.tensorflow.org/install/pip
5. https://spacy.io/usage
6. https://pypi.org/project/gensim/
Pre-Lab:
This Section must contain at least 5 Descriptive type questions or Self-Assessment Questions
which help the student to understand the Program/Experiment that must be performed in the
Laboratory Session.
In-Lab:
1. Apply tokenization methods in the NLTK library on a 5-line text data available in NLTK.
2. Apply tokenization methods in the TF library on a 5-line text data available in NLTK.
3. Draw comparisons based on text handling capabilities.
Procedure/Program:
1. import nltk
from nltk.tokenize import word_tokenize,sent_tokenize,TreebankWordTokenizer
from nltk.tokenize import wordpunct_tokenize,TweetTokenizer,MWETokenizer
text = 'I love NLP class, fear! #Hope.Grade %10.0% & @job'
tokenizer = TreebankWordTokenizer()
text.split(',')
tokens = wordpunct_tokenize(text)
tokens = TweetTokenizer(text)
tokenizer = TweetTokenizer()
tokenizer.tokenize(text)
2. import tensorflow
from tensorflow.keras.preprocessing.text import Tokenizer
text = [
'I love NLP class, fear! #Hope.Grade %10.0% & @job'
]
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(text)
tokenizer.get_config()
3.
Comparison
1. NLTK:
NLTK provides specialized tokenization functions for both sentence and word tokenization
Comparison
NLTK provides specialized tokenization functions for both sentence and word
tokenization. It is a comprehensive NLP library with various text processing tools and
resources.
NLTK is easy to use and widely adopted in the NLP community for research and education.
2. TensorFlow:
TensorFlow offers the TextVectorization layer, which can tokenize text as part of a deep l
earning pipeline.
It is integrated into TensorFlow's ecosystem, making it suitable for building deep learning
models for NLP.
TensorFlow is more suitable for scenarios where tokenization is part of a broader deep learning
workflow.
It is a comprehensive NLP library with various text processing tools and resources.
NLTK is easy to use and widely adopted in the NLP community for research
Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 2 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI
and education.
TensorFlow:
TensorFlow offers the TextVectorization layer, which can tokenize text as part of a
deep learning pipeline.
It is integrated into TensorFlow's ecosystem, making it suitable for building
deep learning models for NLP.
TensorFlow is more suitable for scenarios where tokenization is part of a broader
deep learning workflow.
1. ['I',1
'love',2
'NLP',3
'class',4
',',5
'fear',6
'!',7
'#Hope',8
'.',9
'Grade',10
'%',11
'10.0',
'%',
'&',
'@job']
Ultimately, the choice between NLTK and TensorFlow for tokenization depends on your
specific project requirements, your familiarity with the libraries, and the scale of your NLP
task. Both libraries have their strengths and are valuable tools in the field of Natural
Language Processing.
1. What is tokenization?
A) Process of breaking down a text or a sequence of characters into smaller units, known
as tokens.
B) According to your exp which tokenizer API is the best?
A) If you're working on a small-scale NLP project, educational purposes, nltk is a good
choice. If you're working on a larger-scale NLP project, tensorflow is a good choice.
C) How NLTK and TensorFlow handle tokenization for different languages.
A) For NLTK:
Multilingual Tokenization, Language Identification, Customization, Word Segmentation (for
languages like Chinese and Japanese)
For TensorFlow:
TextVectorization Layer, Character-Level Tokenization, Customization, Pre-trained Models
D) List the Metrics used to Evaluate Tokenization Techniques.
A) Token Accuracy, Sentence Accuracy,F1-Score,BLEU (Bilingual Evaluation Understudy),
Perplexity, Edit Distance, Speed and Efficiency, Domain-specific Metrics, Error Analysis,
Human Evaluation
E) Can you tokenize multiple text documents simultaneously using TensorFlow.
A) Yes, you can tokenize multiple text documents simultaneously using TensorFlow.
TensorFlow provides the TextVectorization layer, which is a powerful tool for tokenization,
and it can be adapted to handle multiple text documents at once.
Post-Lab:
1. Try tokenization in the spaCy library and compare with the NLTK and Tensorflow.
2. Try tokenization on big corpus dataset given below.
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
Procedure/Program:
This Section is meant for the student to Write the program/Procedure for Experiment
1. import spacy
# Sample text
text = "Tokenization is an important NLP task. It involves splitting text into words,
subwords, or characters."
# Tokenize using
spaCy doc = nlp(text)
# Extract tokens
spacy_tokens = [token.text for token in doc]
Comparision:
spaCy provides very detailed tokenization, including identifying parts of speech, named
entities, and more. It's excellent for fine-grained linguistic analysis.
NLTK offers a simple and effective word tokenization method. It's easy to use and suitable
for basic tokenization needs
The choice of tokenization library depends on your specific NLP task, requirements, and
existing infrastructure. All three libraries are valuable in their own right and excel in
different use cases.
Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.
Aim/Objective:
The aim is to evaluate different techniques or libraries, such as NLTK, SpaCy, and TensorFlow, to
determine their effectiveness in converting text to a sequence of numbers.
Description:
Pre-Requisites:
1. https://pip.pypa.io/en/stable/installation/
2. https://packaging.python.org/en/latest/tutorials/installing-packages/
3. https://pypi.org/project/nltk/
4. https://www.tensorflow.org/install/pip
5. https://spacy.io/usage
6. https://pypi.org/project/gensim/
Pre-Lab:
This Section must contain at least 5 Descriptive type questions or Self-Assessment Questions
which help the student to understand the Program/Experiment that must be performed in the
Laboratory Session.
3. Are all sentences in the text considered to have the same length? If No, What did you do.
A) No, I have used padding technique to equalize the length.
In-Lab:
1. Apply tokenization and convert a sequence of sentences in the NLTK library to a sequence
of numbers.
2. Convert a 10-sentence dataset with multiple-length sentences into a number array of
equal size for ML model training.
Procedure/Program:
This Section is meant for the student to Write the program/Procedure for the Experiment
1) import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
# Sample sequence of
sentences text = """
This is the first
sentence. Here is the
second one.
And this is the third
sentence. """
# Convert words to numbers (you can assign numeric IDs or use word
embeddings) # For simplicity, we'll just use the position of the word in the
sentence as a number word_ids = list(range(len(words)))
print(tokenized_sentences)
2) import tensorflow as
tf import numpy as np
2) [[ 6 8 3 19 0 0 0 0 0 0]
[ 7 4 3 9 0 0 0 0 0 0]
[ 3 2 0 0 0 0 0 0 0 0]
[ 1 9 8 0 0 0 0 0 0 0]
[ 8 3 2 12 4 5 9 0 0 0]
[ 3 2 0 0 0 0 0 0 0 0]
[ 1 9 0 0 0 0 0 0 0 0]
[ 3 2 7 12 0 0 0 0 0 0]
[ 6 8 3 1 5 0 0 0 0 0]]
Post-Lab:
import re
In the experiment where we tried the normalization of converted numbers from text data,
we applied two types of normalization: scaling to a specific range and converting numbers
to a common format. Let's analyze the key points and provide some insights:
Purpose: Scaling numbers to a specific range, such as [0, 1], is useful when you want to
ensure that all numbers have similar magnitudes, making them comparable.
Method: We used a regular expression to extract numbers from the text, converted them to
floats, and then applied a scaling function to normalize them.
Output: The numbers were scaled to the range [0, 1] based on their relative magnitudes
within the text.
Converting Numbers to a Common Format:
Purpose: Converting numbers to a common format, such as integers, can be helpful when
you want to work with numbers in a specific way, like for counting or indexing.
Method: We used a regular expression to extract numbers from the text and then converted
them to integers.
Output: The numbers were converted to integers, making them suitable for integer-based
operations.
Inference:
Normalization of numbers extracted from text data can be valuable in data preprocessing
and analysis to ensure that the numbers are in a consistent and comparable format.
Scaling to a specific range, such as [0, 1], is beneficial when you want to maintain the relative
relationships between numbers but standardize their magnitudes.
Converting numbers to a common format, such as integers, can simplify further data
processing and calculations.
Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.
Text Aim/Objective:
The aim is to convert the text into numbers and eventually code those converted numbers into
encodings for downstream NLP tasks using NLTK, SpaCy, and TensorFlow.
Description:
One hot encoding of text data is a process of transforming categorical data, such as words or symbols,
into numerical data that can be used by machine learning models. It involves creating a binary
vector for each categorical value, where only one element is 1 and the rest are 0. The length of the
vector is equal to the number of unique categories in the data. One hot encoding allows the
representation of categorical data as multidimensional binary vectors that can be fed to models
that require numerical input.
Pre-Requisites:
1. https://pip.pypa.io/en/stable/installation/
2. https://packaging.python.org/en/latest/tutorials/installing-packages/
3. https://pypi.org/project/nltk/
4. https://www.tensorflow.org/install/pip
5. https://spacy.io/usage
6. https://pypi.org/project/gensim/
Pre-Lab:
This Section must contain at least 5 Descriptive type questions or Self-Assessment Questions
which help the student to understand the Program/Experiment that must be performed in the
Laboratory Session.
In-Lab:
1. Apply One Hot Encodings and convert a sequence of sentences in the NLTK library to
a sequence of numbers and then OHE.
2. Convert a 10-sentence dataset with multiple-length sentences into a OHE array of equal
size for ML model training.
Procedure/Program:
1) import nltk
from nltk import word_tokenize
from nltk.util import ngrams
import pandas as pd
# Sample sentences
sentences = [
"This is the first sentence.",
"Here's the second sentence.",
"And this is the third
sentence."
]
2) import nltk
# Sample sentences
sentences = [
"This is the first sentence.",
"Here's the second sentence.",
"And this is the third
sentence.", "A short one.",
"Another short one.",
"This is a longer sentence with more words.",
"Yet another example sentence.",
"A very short sentence.",
"A medium-length sentence goes here.",
"This is the last sentence in the
dataset."
]
Post-Lab:
1. Try using OHE data for training a simple neural network model.
2. Try text to OHE on big corpus dataset given below and train a ANN model.
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
Procedure/Program:
import numpy as np
import tensorflow as
tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# Make predictions (you can replace this with your own test data)
test_data = np.array([[1, 0, 0, 0]]) # OHE for a Cat
predictions = model.predict(test_data)
In the experiment where we tried using one-hot encoding (OHE) data for training a simple
neural network model, we utilized a basic neural network to demonstrate the concept.
Here's an analysis and inference:
Analysis:
Model Architecture: We created a simple neural network model with one hidden layer and
one output layer. The choice of model architecture can vary based on the specific problem
and dataset. In practice, you may need to adjust the model's complexity and depth
according to the complexity of the data.
Loss Function and Optimization: We used mean squared error (MSE) as the loss function
and the Adam optimizer for model training. The choice of loss function and optimizer
depends on the nature of your problem (e.g., regression or classification) and should be
chosen accordingly.
Training: The model was trained using the provided OHE data and corresponding labels.
During training, the loss decreased with each epoch, which is a common pattern in training
neural networks.
Inference:
Predictions: After training, the model can be used to make predictions for new data. In our
example, we made predictions for a new OHE data point (test_data).
Output: The output of the model predictions will be a numerical value (e.g., regression
output) or class probabilities (e.g., classification output). The specific output depends on
the nature of your problem.
Aim/Objective:
The aim is to convert text into vectors by computing term frequencies and create a corpus.
Description:
The objective of converting text to a sequence of numbers using TF vectorizer function. The
primary goal of this conversion is to represent textual data in a numerical format that machine learning
models can process effectively. To convert text to a numerical format that enables the application of
machine learning and NLP techniques.
Pre-Requisites:
1. https://pip.pypa.io/en/stable/installation/
2. https://packaging.python.org/en/latest/tutorials/installing-packages/
3. https://pypi.org/project/nltk/
4. https://www.tensorflow.org/install/pip
5. https://spacy.io/usage
6. https://pypi.org/project/gensim/
Pre-Lab:
This Section must contain at least 5 Descriptive type questions or Self-Assessment Questions
which help the student to understand the Program/Experiment that must be performed in the
Laboratory Session.
In-Lab:
1. Apply tokenization and convert a sequence of sentences in the NLTK library to a sequence
of numbers. Use those sequences and calculate term frequencies for representing text
data on a small corpus.
2. Convert a 10-sentence dataset with multiple-length sentences into TF representations and
compare them with OHE.
Procedure/Program:
1) import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
# Sample sentences
sentences = [
"This is the first sentence. It contains some words.",
]
2) import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
import pandas as pd
# Sample sentences
sentences = [
"This is the first
sentence.", "Here's the
second one.",
"And this is the third
sentence.", "A short one.",
2)
Tokenization is a crucial preprocessing step in NLP for breaking down text into analyzable
units.
The choice between TF and OHE depends on the specific task and the type of information
needed for analysis.
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is an example
sentence.") for token in doc:
print(f"{token.text}: {token.vector}")
import numpy as np
import tensorflow as
tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Build an ANN
model model =
Sequential()
model.add(Embedding(input_dim=len(word_index) + 1, output_dim=16, input_length=10))
model.add(GlobalAveragePooling1D())
model.add(Dense(16, activation="relu"))
model.add(Dense(1, activation="sigmoid"))
# Make predictions
test_texts = ["A new positive sentence.", "A new negative sentence."]
test_sequences = tokenizer.texts_to_sequences(test_texts)
test_padded_sequences = pad_sequences(test_sequences, maxlen=10, padding="post",
truncating="post")
Analysis:
Data Preparation: We started with a small sample of text data along with
corresponding binary labels (0 for negative, 1 for positive). In practice, you would use a
larger and more diverse dataset specific to your NLP task.
Text Preprocessing: Text preprocessing is a crucial step. We tokenized the text using the
Tokenizer class and converted it into numerical sequences. Additionally, we used padding
to ensure uniform sequence lengths. Proper preprocessing is essential to feed text data
into neural networks effectively.
Model Architecture: We built a simple ANN model for text classification. The model
included an embedding layer to convert words into dense vectors, a global average
pooling layer to aggregate word embeddings, and two dense layers. This architecture is a
starting point and can be adjusted based on the complexity of the task.
Loss Function and Optimization: We compiled the model with binary cross-entropy loss,
which is common for binary classification tasks. The Adam optimizer was used for model
training. The choice of loss function and optimizer may vary depending on the specific
problem.
Training: We trained the model on the transformed text data. In this small example, we
used a limited number of epochs (10). In real-world scenarios, you would typically train the
model for many more epochs to achieve better performance.
Inferences:
Start with Simple Models: In this experiment, we used a simple ANN model as a
starting point. It's often a good practice to begin with a basic architecture and
gradually increase complexity based on the performance on validation data.
Model Tuning: Depending on your specific NLP task, you may need to adjust the
model architecture, hyperparameters (e.g., learning rate, batch size), and use
techniques like dropout and regularization to improve performance.
Scaling: For more complex NLP tasks, you may consider using pre-trained models like
BERT, GPT, or their variants, which have achieved state-of-the-art performance on
various NLP benchmarks.
Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.
Aim/Objective:
The aim is to use the online resources of text data to test NLP applications.
Description:
A text corpus is a large and structured collection of texts, typically stored in a digital format, that
serves as a linguistic resource for language analysis and research. It consists of a diverse range of
written or spoken texts from various sources and domains, such as books, articles, newspapers,
websites, social media, conversations, and more.
Pre-Requisites:
1. https://pip.pypa.io/en/stable/installation/
2. https://packaging.python.org/en/latest/tutorials/installing-packages/
3. https://pypi.org/project/nltk/
4. https://www.tensorflow.org/install/pip
5. https://spacy.io/usage
6. https://pypi.org/project/gensim/
Pre-Lab:
1. How can I create a text corpus from a collection of documents using Python?
A) Collect Your Documents
Read and Extract Text
Preprocess Text Data
Organize into a Dataset
Save the Corpus
2. What Python libraries can I use to tokenize and preprocess text data for corpus creation?
A) NLTK (Natural Language
Toolkit) spaCy
TextBlob
scikit-learn
Gensim
Pattern
spaCy and Hugging Face Transformers (for Deep Learning)
3. How can I handle different file formats (e.g., PDF, Word documents) when building a
A) Adapt the code examples to your specific use case and file paths. Additionally, consider
handling errors and exceptions that may occur when working with different file formats to
ensure the robustness of your corpus creation process
4. What are the steps involved in cleaning and preprocessing text data for corpus creation?
A) Text Lowercasing,Tokenization,Stop Word Removal,Special Character and Number Removal,
Stemming or Lemmatization, Handling Contractions and Abbreviations, Removing HTML
Tags and URLs, Handling Missing Data, Custom Cleaning Steps
5. How can I remove stopwords and punctuation from text documents when creating a
corpus in Python?
A) filtered_words = [token.text for token in doc if not token.is_stop and token.text not in
string.punctuation]
In-Lab:
1. From NLTK library, download and apply wordnet package of built-in corpus. Extract
the requirements of a text dataset and tokenize the text.
2. From spaCy, use en_core_web_sm (English Small) corpus and tokenize this text.
Procedure/Program:
1) import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize, sent_tokenize
words = word_tokenize(sentence.lower())
for word in words:
# Check if the word is a synonym of "requirement"
if wordnet.synsets(word, pos=wordnet.NOUN) and word in wordnet.synsets
('requirement', pos=wordnet.NOUN) [0].lemma_names():
requirements.append(word)
return requirements
2) import spacy
# Print the
tokens
print(tokens)
1)
2) ['This', 'is', 'an', 'example', 'sentence', '.', 'Tokenize', 'me', ',', 'please', '.']
Corpora
Lexical Resources
Sample Texts
Non-English Corpora
Custom Corpora
2. Can you give an example of a text dataset available in NLTK?
Tree band corpus, web text corpus, Word net
3. How can you access and explore the content of a text dataset in NLTK?
nltk.download('gutenberg')
from nltk.corpus import
Gutenberg # Get the list of text
IDs
text_ids = gutenberg.fileids()
Post-Lab:
1. Try to encode the wordnet text into TF vectors and OHE. Measure the corpus size
occupied by them in memory.
2. Try to find some text datasets available online and load into your current program.
Procedure/Program:
import sys
import nltk
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
# Create TF vectors
tf_vectorizer = CountVectorizer()
tf_matrix = tf_vectorizer.fit_transform(wordnet_text)
In the experiment where we encoded WordNet text into TF vectors (Term Frequency)
and One-Hot Encoding (OHE) and measured the corpus size occupied by these encodings in
memory, here are the analysis and inferences:
Analysis:
Encoding Techniques: We applied two common text encoding techniques, TF vectors and
OHE, to represent WordNet text data. These techniques are fundamental for preparing text data for
various Natural Language Processing (NLP) tasks.
Inferences:
Use Case Considerations: The choice between TF vectors and OHE depends on the
specific use case and the nature of the data. TF vectors are commonly used for text classification
and information retrieval tasks where term frequency information is essential, and memory
efficiency is a concern. OHE may be necessary for tasks that require binary input, such as some
deep learning models.
Vocabulary Size: The memory requirements of OHE can increase significantly with the size
of the vocabulary. If you have a large corpus with many unique terms, OHE may become
impractical due to its high memory consumption.
Library Efficiency: Utilizing libraries like scikit-learn for TF vectorization and NumPy for
OHE matrix creation is a good practice. These libraries are optimized for memory efficiency and
provide efficient implementations for common encoding tasks.
Trade-Offs: The choice of encoding method often involves trade-offs between memory
efficiency, computational complexity, and the specific requirements of your NLP task. It's
important to consider these factors when selecting an encoding technique.
Scaling: For large-scale NLP tasks and extensive corpora, advanced techniques like
word embeddings (e.g., Word2Vec, GloVe) or transformer-based models (e.g., BERT) are
preferred, as they provide memory-efficient and semantically rich representations.
Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.
aq
Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 31 of 167