0% found this document useful (0 votes)
49 views50 pages

Sivasri NLP Lab

This document outlines an experiment on tokenization of text using NLTK and TensorFlow. The objective is to compare different tokenization techniques and libraries. In the pre-lab, sample questions are provided to understand tokenization and techniques like word, sentence tokenization. In the in-lab, text is tokenized using NLTK and TensorFlow. Comparisons are made between the two based on text handling capabilities and suitability for specific NLP tasks. NLTK provides specialized tokenization functions while TensorFlow is integrated into its deep learning ecosystem and suitable for building models.

Uploaded by

Thota sivasri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views50 pages

Sivasri NLP Lab

This document outlines an experiment on tokenization of text using NLTK and TensorFlow. The objective is to compare different tokenization techniques and libraries. In the pre-lab, sample questions are provided to understand tokenization and techniques like word, sentence tokenization. In the in-lab, text is tokenized using NLTK and TensorFlow. Comparisons are made between the two based on text handling capabilities and suitability for specific NLP tasks. NLTK provides specialized tokenization functions while TensorFlow is integrated into its deep learning ecosystem and suitable for building models.

Uploaded by

Thota sivasri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 50

NATURAL LANGUAGE

PROCESSING &
APPLICATIONS 21EC3082

STUDENT ID: 2100040324 ACADEMIC YEAR: 2023-24


STUDENT NAME: T.SIVASRI
Table of Contents

1. Session 01: Introductory Session...................................................................................NA


2. Session 02: Tokenization_of_text #1.................................................................................#
3. Session 03: Text_2_Sequences #2.....................................................................................#
4. Session 04: One_Hot_Encoding #3....................................................................................#
5. Session 05: Vectorization_of_texts #4...............................................................................#
6. Session 06: Databases_how_to_Use #5............................................................................#
7. Session 07: Parsing_nltk_toolbox #6.................................................................................#
8. Session 08: TF_Testing_fail #7...........................................................................................#
9. Session 09: IDF_Why #8...................................................................................................#
10. Session 10: TFIDF_Vertorization #9...................................................................................#
11. Session 11: TF_IDF_Failure_meaning #10..........................................................................#
12. Session 12: Distance_Metrics #11.....................................................................................#
13. Session 13: Word_similarities_nltk #12.............................................................................#
14. Session 14: Document_recognition_tfidf_vectors #13 (Adv/Peer)......................................#
15. Session 15: Zipf's_Law_nlp #14 (Adv/Peer).......................................................................#
16. Session 16: Simple_topic_modelling_ex #15 (Adv/Peer)...................................................#
17. Session 17: PCA_From_SCratch #16 (Adv/Peer)................................................................#
18. Session 18: Singular_Value_Decomposition_SVD_Ex #17 (Adv/Peer).................................#
19. Session 19: Latent_Semantic_Analysis_SVD #18 (Adv/Peer)..............................................#
20. Session 20: spam_dect_class #19 (Adv/Peer)....................................................................#
21. Session 21: Sentiment_Analysis_RNN #20 (Adv/Peer).......................................................#

https://github.com/pvvkishore/NLP-A_LAB_2023 : Code for the entire lab sessions.


A.Y. 2023-24 LAB/SKILL CONTINUOUS EVALUATION

S.No Date Experiment Name Pre- In-Lab (25M) Post- Viva Total Faculty
Lab Program/ Data and Analysis & Lab Voce (50M) Signature
(10M) Procedure Results Inference (10M) (5M)
(5M) (10M) (10M)
1. Introductory Session -NA-
Tokenization_of_text #1
2.
Text_2_Sequences #2
3.
One_Hot_Encoding #3
4.
Vectorization_of_texts #4
5.
Databases_how_to_Use #5
6.
Parsing_nltk_toolbox #6
7.
TF_Testing_fail #7
8.
IDF_Why #8
9.
TFIDF_Vertorization #9
10.
TF_IDF_Failure_meaning #10
11.
Distance_Metrics #11
12

13. Word_similarities_nltk #12


S.No Date Experiment Name Pre- In-Lab (25M) Post- Viva Total Faculty
Lab Program/ Data and Analysis & Lab Voce (50M) Signature
(10M) Procedure Results Inference (10M) (5M)
(5M) (10M) (10M)
Document_recognition_tfidf_vectors
14.
#13 (Adv/Peer)
Zipf's_Law_nlp #14 (Adv/Peer)
15.
Simple_topic_modelling_ex #15
16.
(Adv/Peer)
PCA_From_SCratch #16 (Adv/Peer)
17.
Singular_Value_Decomposition_SVD_Ex
18.
#17 (Adv/Peer)
Latent_Semantic_Analysis_SVD #18
19.
(Adv/Peer)
spam_dect_class #19 (Adv/Peer)
20.
Sentiment_Analysis_RNN #20
21
(Adv/Peer)
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI

Experiment Title: Tokenization_of_text


Aim/Objective:

The aim is to compare and evaluate different tokenization techniques or libraries, such as NLTK,
SpaCy, and TensorFlow, to determine their effectiveness in handling various types of text data.

Description:

Tokenization is the 1st step in any NLP model. The experiment may aim to explore how tokenization
using NLTK, spaCy, and TensorFlow can be integrated into a broader NLP pipeline or used as a
preprocessing step for tasks such as sentiment analysis, machine translation, named entity
recognition, or text summarization. The focus is on understanding the impact of tokenization
choices on downstream model performance. The experiment may aim to analyze the performance
characteristics of tokenization using NLTK and TensorFlow.

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

This Section must contain at least 5 Descriptive type questions or Self-Assessment Questions
which help the student to understand the Program/Experiment that must be performed in the
Laboratory Session.

1. What is tokenization in the context of NLP?


A) Process of breaking down a text or a sequence of characters into smaller units, known as
tokens.
2. How can you tokenize a sentence into individual words using NLTK?
A) from nltk.tokenize import
word_tokenize words =
word_tokenize(sentence)
3. What is the purpose of tokenizing text in NLP?
A) Text processing, Feature extraction, Text Analysis

4. Name a few tokenization techniques other than word tokenization.


A) Subword tokenization, Sentence tokenization, Character tokenization, N- gram tokenization
5. How can you tokenize a text document into sentences using NLTK?
A) from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)
Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 1 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI

In-Lab:

1. Apply tokenization methods in the NLTK library on a 5-line text data available in NLTK.
2. Apply tokenization methods in the TF library on a 5-line text data available in NLTK.
3. Draw comparisons based on text handling capabilities.

 Procedure/Program:
1. import nltk
from nltk.tokenize import word_tokenize,sent_tokenize,TreebankWordTokenizer
from nltk.tokenize import wordpunct_tokenize,TweetTokenizer,MWETokenizer
text = 'I love NLP class, fear! #Hope.Grade %10.0% & @job'
tokenizer = TreebankWordTokenizer()
text.split(',')
tokens = wordpunct_tokenize(text)
tokens = TweetTokenizer(text)
tokenizer = TweetTokenizer()
tokenizer.tokenize(text)

2. import tensorflow
from tensorflow.keras.preprocessing.text import Tokenizer
text = [
'I love NLP class, fear! #Hope.Grade %10.0% & @job'
]
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(text)
tokenizer.get_config()
3.
Comparison
1. NLTK:
NLTK provides specialized tokenization functions for both sentence and word tokenization
Comparison
NLTK provides specialized tokenization functions for both sentence and word

tokenization. It is a comprehensive NLP library with various text processing tools and

resources.

NLTK is easy to use and widely adopted in the NLP community for research and education.

2. TensorFlow:

TensorFlow offers the TextVectorization layer, which can tokenize text as part of a deep l
earning pipeline.

It is integrated into TensorFlow's ecosystem, making it suitable for building deep learning
models for NLP.

TensorFlow is more suitable for scenarios where tokenization is part of a broader deep learning
workflow.

 It is a comprehensive NLP library with various text processing tools and resources.
 NLTK is easy to use and widely adopted in the NLP community for research
Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 2 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI
and education.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 3 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI

 TensorFlow:
 TensorFlow offers the TextVectorization layer, which can tokenize text as part of a
deep learning pipeline.
 It is integrated into TensorFlow's ecosystem, making it suitable for building
deep learning models for NLP.
 TensorFlow is more suitable for scenarios where tokenization is part of a broader
deep learning workflow.

 Data and Results:

1. ['I',1
'love',2
'NLP',3
'class',4
',',5
'fear',6
'!',7
'#Hope',8
'.',9
'Grade',10
'%',11
'10.0',
'%',
'&',
'@job']

2. 'word_index': '{"i": 1, "love": 2, "nlp": 3, "class": 4, "fear": 5, "hope": 6, "grade": 7, "10": 8,


"0": 9, "@job": 10}'}

 Analysis and Inferences:

Ultimately, the choice between NLTK and TensorFlow for tokenization depends on your
specific project requirements, your familiarity with the libraries, and the scale of your NLP
task. Both libraries have their strengths and are valuable tools in the field of Natural
Language Processing.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 4 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI

Sample VIVA-VOCE Questions (In-Lab):

1. What is tokenization?
A) Process of breaking down a text or a sequence of characters into smaller units, known
as tokens.
B) According to your exp which tokenizer API is the best?
A) If you're working on a small-scale NLP project, educational purposes, nltk is a good
choice. If you're working on a larger-scale NLP project, tensorflow is a good choice.
C) How NLTK and TensorFlow handle tokenization for different languages.
A) For NLTK:
Multilingual Tokenization, Language Identification, Customization, Word Segmentation (for
languages like Chinese and Japanese)
For TensorFlow:
TextVectorization Layer, Character-Level Tokenization, Customization, Pre-trained Models
D) List the Metrics used to Evaluate Tokenization Techniques.
A) Token Accuracy, Sentence Accuracy,F1-Score,BLEU (Bilingual Evaluation Understudy),
Perplexity, Edit Distance, Speed and Efficiency, Domain-specific Metrics, Error Analysis,
Human Evaluation
E) Can you tokenize multiple text documents simultaneously using TensorFlow.
A) Yes, you can tokenize multiple text documents simultaneously using TensorFlow.
TensorFlow provides the TextVectorization layer, which is a powerful tool for tokenization,
and it can be adapted to handle multiple text documents at once.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 5 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI

Post-Lab:

1. Try tokenization in the spaCy library and compare with the NLTK and Tensorflow.
2. Try tokenization on big corpus dataset given below.
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
 Procedure/Program:

This Section is meant for the student to Write the program/Procedure for Experiment

1. import spacy

# Load spaCy's English tokenizer


nlp = spacy.load("en_core_web_sm")

# Sample text
text = "Tokenization is an important NLP task. It involves splitting text into words,
subwords, or characters."

# Tokenize using
spaCy doc = nlp(text)

# Extract tokens
spacy_tokens = [token.text for token in doc]

# Print spaCy tokens


print("Tokens (spaCy):", spacy_tokens)

Comparision:

spaCy provides very detailed tokenization, including identifying parts of speech, named
entities, and more. It's excellent for fine-grained linguistic analysis.

NLTK offers a simple and effective word tokenization method. It's easy to use and suitable
for basic tokenization needs

TensorFlow's TextVectorization is flexible and efficient for tokenization, particularly when


integrated into deep learning pipelines. It's suitable for large-scale NLP tasks.

 Data and Results:


1. Tokens (spaCy): ['Tokenization', 'is', 'an', 'important', 'NLP', 'task', '.', 'It', 'involves',
'splitting', 'text', 'into', 'words', ',', 'subwords', ',', 'or', 'characters', '.']

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 6 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI

 Analysis and Inferences:

The choice of tokenization library depends on your specific NLP task, requirements, and
existing infrastructure. All three libraries are valuable in their own right and excel in
different use cases.

Evaluator Remark (if Any):

Marks Secured: out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 7 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI

Experiment Title: Text_2_Sequences

Aim/Objective:

The aim is to evaluate different techniques or libraries, such as NLTK, SpaCy, and TensorFlow, to
determine their effectiveness in converting text to a sequence of numbers.

Description:

The objective of converting text to a sequence of numbers is a fundamental step in natural


language processing (NLP) tasks. The primary goal of this conversion is to represent textual data in
a numerical format that machine learning models can process effectively. To convert text to a
numerical format that enables the application of machine learning and NLP techniques.

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

This Section must contain at least 5 Descriptive type questions or Self-Assessment Questions
which help the student to understand the Program/Experiment that must be performed in the
Laboratory Session.

1. Why convert text to numbers?


A) To apply Machine Learning Algorithms, Vector Space Representation, Feature Extraction,
Semantic Meaning, Scalability, Machine Learning Model Compatibility, Mathematical
Operations, Generalization, Efficiency

2. How effective is the method used by you?


A) Bag of Words (BoW) / Term Frequency-Inverse Document Frequency (TF-IDF):
Effectiveness: BoW and TF-IDF are simple and effective methods for text-to-number
conversion. They work well for tasks like text classification, sentiment analysis, and
information retrieval.

3. Are all sentences in the text considered to have the same length? If No, What did you do.
A) No, I have used padding technique to equalize the length.

4. In NLTK, which function is used to assign numeric IDs to tokens?


A) tf_tokens = vectorizer(string)
Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 8 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI

5. What is the difference between word tokenization and sentence tokenization?


A) Word tokenization splits text into words or subword units, while sentence tokenization
splits text into sentences.

In-Lab:

1. Apply tokenization and convert a sequence of sentences in the NLTK library to a sequence
of numbers.
2. Convert a 10-sentence dataset with multiple-length sentences into a number array of
equal size for ML model training.

 Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

1) import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

# Sample sequence of
sentences text = """
This is the first
sentence. Here is the
second one.
And this is the third
sentence. """

# Tokenize the sentences


sentences = sent_tokenize(text)

# Initialize a list to store the tokenized sentences as lists of words


tokenized_sentences = []

# Tokenize each sentence into words and convert to numbers


for sentence in sentences:
# Tokenize the sentence into
words words =
word_tokenize(sentence)

# Convert words to numbers (you can assign numeric IDs or use word
embeddings) # For simplicity, we'll just use the position of the word in the
sentence as a number word_ids = list(range(len(words)))

# Append the list of word IDs to the tokenized_sentences list


tokenized_sentences.append(word_ids)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 9 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI
# Print the tokenized sentences (sequences of numbers)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 10 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI

print(tokenized_sentences)

2) import tensorflow as
tf import numpy as np

# Sample dataset with 10 sentences of varying lengths


sentences = [
"This is a short sentence.",
"Here's a slightly longer
one.",
"This is a fairly long sentence with more words in it.",
"Short one.",
"Another sentence here.",
"A very long sentence with many words and more words.",
"Short.",
"Another one.",
"A bit longer
sentence.", "This is the
last one."
]

# Initialize TextVectorization layer


vectorizer = tf.keras.layers.TextVectorization(
max_tokens=1000, # Set the maximum vocabulary
size
output_sequence_length=10 # Set the desired output sequence length
)

# Adapt the vectorizer to the


sentences vectorizer.adapt(sentences)

# Tokenize, pad, and encode the sentences


encoded_sentences = vectorizer(sentences)
encoded_sentences = encoded_sentences.numpy()

# Print the encoded


sentences
print(encoded_sentences)

 Data and Results:


1) [[0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4]]

2) [[ 6 8 3 19 0 0 0 0 0 0]

[ 7 4 3 9 0 0 0 0 0 0]

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 11 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI
[ 6 8 3 2 5 9 0 0 0 0]

[ 3 2 0 0 0 0 0 0 0 0]
[ 1 9 8 0 0 0 0 0 0 0]

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 12 of 167
Experiment # Student ID
Date Student Name

[ 8 3 2 12 4 5 9 0 0 0]
[ 3 2 0 0 0 0 0 0 0 0]

[ 1 9 0 0 0 0 0 0 0 0]

[ 3 2 7 12 0 0 0 0 0 0]
[ 6 8 3 1 5 0 0 0 0 0]]

 Analysis and Inferences:

Experiment 2 showcased a more practical and structured approach to preprocessing text


data for machine learning, ensuring that sentences of varying lengths can be used
effectively for model training. Experiment 1 provided a basic understanding of tokenization
and numerical conversion, which are foundational concepts in NLP. The choice between
these approaches depends on the specific NLP task and the requirements of the machine
learning model.

Sample VIVA-VOCE Questions (In-Lab):

1. What does NLTK's FreqDist class provide?


A) NLTK's FreqDist class provides:
Frequency Counting, Frequency Distribution, Common Operations, Visualization
2. According to your exp which API is the best?
A) For basic tokenization and text analysis tasks (Experiment 1), NLTK's tokenization functions
can be sufficient. However, for more advanced preprocessing tasks involving machine
learning (Experiment 2), TensorFlow's TextVectorization layer is a recommended choice due
to its efficiency and ease of use for creating uniform input data for ML models.
3. Do you think your sequence conversion is suitable for GPT.
A) No it is not suitable because of Loss of Sequence Information, Fixed Input Length, padding
and Efficiency.
4. List the Metrics used to Evaluate sequence conversion Techniques.
A) Perplexity, BLEU (Bilingual Evaluation Understudy), ROUGE (Recall-Oriented Understudy for
Gisting Evaluation), WER (Word Error Rate), CER (Character Error Rate), F1 Score, Accuracy
5. Can you convert using spaCy.
A) SpaCy can be used for tokenization and various text processing tasks.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 13 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI

Post-Lab:

1. Try normalization of converted numbers from text data.


2. Try text to sequences on big corpus dataset given below.
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
 Procedure/Program:

import re

# Sample text data with numbers


text = "The total revenue was $1,234,567, and the profit margin was 12.5%."

# Extract numbers from text using regular expressions


numbers = re.findall(r'\d+\.\d+|\d+', text)

# Convert extracted numbers to floats


numbers = [float(num.replace(',', '')) for num in numbers]

# Define a function to normalize numbers to [0,


1] def normalize_to_01(num):
min_value = min(numbers)
max_value =
max(numbers) if min_value
== max_value:
return 0.0 # Avoid division by zero
return (num - min_value) / (max_value - min_value)

# Normalize each number and print


normalized_numbers = [normalize_to_01(num) for num in numbers]
print(normalized_numbers)

 Data and Results:

[1.0, 0.0, 0.024691358024691357]

 Analysis and Inferences:

In the experiment where we tried the normalization of converted numbers from text data,
we applied two types of normalization: scaling to a specific range and converting numbers
to a common format. Let's analyze the key points and provide some insights:

Scaling to a Specific Range:

Purpose: Scaling numbers to a specific range, such as [0, 1], is useful when you want to
ensure that all numbers have similar magnitudes, making them comparable.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 11 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI

Method: We used a regular expression to extract numbers from the text, converted them to
floats, and then applied a scaling function to normalize them.
Output: The numbers were scaled to the range [0, 1] based on their relative magnitudes
within the text.
Converting Numbers to a Common Format:

Purpose: Converting numbers to a common format, such as integers, can be helpful when
you want to work with numbers in a specific way, like for counting or indexing.
Method: We used a regular expression to extract numbers from the text and then converted
them to integers.
Output: The numbers were converted to integers, making them suitable for integer-based
operations.
Inference:

Normalization of numbers extracted from text data can be valuable in data preprocessing
and analysis to ensure that the numbers are in a consistent and comparable format.
Scaling to a specific range, such as [0, 1], is beneficial when you want to maintain the relative
relationships between numbers but standardize their magnitudes.
Converting numbers to a common format, such as integers, can simplify further data
processing and calculations.

Evaluator Remark (if Any):

Marks Secured: out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 11 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI

Experiment Title: One Hot Encoding of

Text Aim/Objective:

The aim is to convert the text into numbers and eventually code those converted numbers into
encodings for downstream NLP tasks using NLTK, SpaCy, and TensorFlow.

Description:

One hot encoding of text data is a process of transforming categorical data, such as words or symbols,
into numerical data that can be used by machine learning models. It involves creating a binary
vector for each categorical value, where only one element is 1 and the rest are 0. The length of the
vector is equal to the number of unique categories in the data. One hot encoding allows the
representation of categorical data as multidimensional binary vectors that can be fed to models
that require numerical input.

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

This Section must contain at least 5 Descriptive type questions or Self-Assessment Questions
which help the student to understand the Program/Experiment that must be performed in the
Laboratory Session.

1. Why convert text to encoded ones and zeros?


A) Machine Readability, Feature Extraction, Model Compatibility, Statistical Analysis,
Quantitative Representation, caling, Memory Efficiency, Consistency, Preprocessing,
Interoperability

2. How effective is the method used by you?


A) Effectiveness of text encoding methods in NLP and machine learning:
Task Specificity, Data Quality, Feature Engineering, Contextual Information, Deep Learning
Evaluation Metrics
3. Are all sentences in the text considered to have the same length? If No, what did you do.
A) No, we use padding, truncating, Dynamic padding, Bucketing etc.

4. The function get_dummiesis role in one hot encoding?


A) Role of the get_dummies function in one-hot encoding:

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 11 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI

Conversion of Categorical Columns, Creation of Binary Columns, Column Naming, Dataframe


Transformation

5. Which according to you is a good NLP practice: OHE words or sentences?


A) In many NLP tasks, a combination of both approaches can be effective. For instance, you
can start with word-level OHE to capture fine-grained information and then aggregate
those word-level features to create sentence-level representations for higher-level tasks.

In-Lab:

1. Apply One Hot Encodings and convert a sequence of sentences in the NLTK library to
a sequence of numbers and then OHE.
2. Convert a 10-sentence dataset with multiple-length sentences into a OHE array of equal
size for ML model training.

 Procedure/Program:
1) import nltk
from nltk import word_tokenize
from nltk.util import ngrams
import pandas as pd

# Sample sentences
sentences = [
"This is the first sentence.",
"Here's the second sentence.",
"And this is the third
sentence."
]

# Tokenize the sentences into words


tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

# Create a vocabulary of unique words


vocab = set(word for sentence in tokenized_sentences for word in sentence)

# Create a dictionary to map words to their corresponding OHE vectors


word_to_ohe = {word: [1 if word in sentence else 0 for sentence in tokenized_sentences] for
word in vocab}

# Create a DataFrame to display the OHE representation


ohe_df = pd.DataFrame(word_to_ohe, index=sentences)

# Display the OHE


DataFrame print(ohe_df)

2) import nltk

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 11 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI
import pandas as pd

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 11 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI

# Sample sentences
sentences = [
"This is the first sentence.",
"Here's the second sentence.",
"And this is the third
sentence.", "A short one.",
"Another short one.",
"This is a longer sentence with more words.",
"Yet another example sentence.",
"A very short sentence.",
"A medium-length sentence goes here.",
"This is the last sentence in the
dataset."
]

# Tokenize the sentences into words


tokenized_sentences = [nltk.word_tokenize(sentence.lower()) for sentence in sentences]

# Create a vocabulary of unique words


vocab = set(word for sentence in tokenized_sentences for word in sentence)

# Create a DataFrame to store the OHE representation


ohe_df = pd.DataFrame(0, columns=vocab, index=range(len(sentences)))

# Populate the DataFrame with OHE vectors


for i, sentence in enumerate(tokenized_sentences):
for word in sentence:
ohe_df.at[i, word] = 1
# Display the OHE
DataFrame print(ohe_df)
 Data and Results:
1) sentence. the here 's and this first is second third .
This is the first sentence. 1 1 0 0 0 1 1 0 0 1
Here's the second sentence. 1 1 1 1 0 0 0 0 0 1
And this is the third sentence. 1 1 0 0 1 1 1 1 1 1
2) the another this is sentence example with goes in one A
\0 1 0 1 1 1 0 0 0 0 0 0
1 1 0 0 0 1 0 0 0 0 0 0
2 1 0 1 1 1 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 1 1
4 0 1 0 0 0 0 0 0 0 1 1
5 0 0 1 1 0 0 1 0 0 0 1
6 0 1 0 0 1 1 0 0 0 0 0
7 1 0 0 0 1 0 0 0 0 1 1
8 0 0 0 0 1 0 0 1 0 0 1

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 11 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI
9 1 0 1 1 1 0 0 0 1 0 0

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 11 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI

short last longer words Yet second


And 0 0 0 0 0 0 0 0
1 0 0 0 0 0 1 0
2 0 0 0 0 0 0 1
3 1 0 0 0 0 0 0
4 1 0 0 0 0 0 0
5 0 0 1 1 0 0 0
6 0 0 0 0 1 0 0
7 1 0 0 0 0 0 0
8 0 0 0 0 0 0 0
9 0 1 0 0 0 0 0

 Analysis and Inferences:

One-hot encoding is a straightforward approach to represent text data, its effectiveness


depends on the specific task, dataset size, and available resources. Consideration of
alternative text encoding methods and preprocessing techniques is essential for achieving
optimal results in NLP tasks

Sample VIVA-VOCE Questions (In-Lab):

1. What is one hot encoding and why is it used?


A) One-hot encoding is a valuable technique for representing categorical data as binary vectors,
making it suitable for use in various machine learning algorithms. It ensures clear and
interpretable representations of categories, but it's important to consider its limitations,
especially in cases of high dimensionality and sparse data.
2. What are the advantages and disadvantages of one hot encoding?
Advantages of One-Hot
Encoding Clear Interpretability
Preservation of Data Structure
Independence of Categories
Compatibility with Machine Learning Algorithms
Handling Missing Data

Disadvantages of One-Hot Encoding:


Dimensionality
Sparsity
Curse of Dimensionality
Loss of Information
Multicollinearity
3. How can you implement one hot encoding in Python using pandas or scikit-learn?
A) import pandas as pd
df = pd.DataFrame(data)
# Use pandas' get_dummies function for one-hot encoding
Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 11 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI

df_encoded = pd.get_dummies(df, columns=['Category'])

# Display the one-hot encoded DataFrame


print(df_encoded)
4. What are some alternatives to one hot encoding for categorical data?
A) Label Encoding, Ordinal Encoding, Binary Encoding, Count Encoding, Target Encoding
Frequency Encoding, Embeddings.
5. How does one hot encoding affect the dimensionality and sparsity of the data?
A) One-hot encoding has a significant impact on the dimensionality and sparsity of the data:

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 11 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI

Post-Lab:

1. Try using OHE data for training a simple neural network model.
2. Try text to OHE on big corpus dataset given below and train a ANN model.
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
 Procedure/Program:

import numpy as np
import tensorflow as
tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Sample OHE data (replace this with your OHE


dataset) # Example: Animal categories (Cat, Dog, Bird,
Fish) ohe_data = np.array([[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]])

# Corresponding labels (replace this with your labels)


labels = np.array([[1], [2], [3], [4]])

# Define a simple neural network model


model = Sequential()
model.add(Dense(64, input_dim=ohe_data.shape[1], activation='relu'))
model.add(Dense(1, activation='linear'))

# Compile the model


model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mse'])

# Train the model


model.fit(ohe_data, labels, epochs=100, verbose=1)

# Make predictions (you can replace this with your own test data)
test_data = np.array([[1, 0, 0, 0]]) # OHE for a Cat
predictions = model.predict(test_data)

# Print the predictions


print("Predicted label:", predictions[0][0])

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 11 of 167
Experiment # Student ID 2100040324
Date Student Name T.SIVASRI

 Data and Results:

 Analysis and Inferences:

In the experiment where we tried using one-hot encoding (OHE) data for training a simple
neural network model, we utilized a basic neural network to demonstrate the concept.
Here's an analysis and inference:

Analysis:

Data Preparation: In this experiment, we prepared OHE data to represent categorical


variables. This encoding technique is useful when dealing with categorical features or
variables where there's no inherent order or numerical relationship between categories .

Model Architecture: We created a simple neural network model with one hidden layer and
one output layer. The choice of model architecture can vary based on the specific problem
and dataset. In practice, you may need to adjust the model's complexity and depth
according to the complexity of the data.

Loss Function and Optimization: We used mean squared error (MSE) as the loss function
and the Adam optimizer for model training. The choice of loss function and optimizer
depends on the nature of your problem (e.g., regression or classification) and should be
chosen accordingly.

Training: The model was trained using the provided OHE data and corresponding labels.
During training, the loss decreased with each epoch, which is a common pattern in training
neural networks.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 11 of 167
Experiment # Student ID T.SIVASRI

Date Student Name 2100040324

Inference:

Predictions: After training, the model can be used to make predictions for new data. In our
example, we made predictions for a new OHE data point (test_data).

Output: The output of the model predictions will be a numerical value (e.g., regression
output) or class probabilities (e.g., classification output). The specific output depends on
the nature of your problem.

Evaluator Remark (if Any):

Marks Secured: out of 50

Signature of the Evaluator with Date

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 20 of 167
Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 21 of 167
Experiment # Student ID 2100040324

Date Student Name T.SIVASRI

Experiment Title: Vectorization_of_texts

Aim/Objective:

The aim is to convert text into vectors by computing term frequencies and create a corpus.

Description:

The objective of converting text to a sequence of numbers using TF vectorizer function. The
primary goal of this conversion is to represent textual data in a numerical format that machine learning
models can process effectively. To convert text to a numerical format that enables the application of
machine learning and NLP techniques.

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

This Section must contain at least 5 Descriptive type questions or Self-Assessment Questions
which help the student to understand the Program/Experiment that must be performed in the
Laboratory Session.

1. Why TF is better than OHE?


A) Dimensionality, Semantic information, Interpretability
2. How effective is the method used by you?
A) Weighting by Importance ,Reducing Noise, Semantic Information, Information Retrieval, Text
Classification, Dimensionality Reduction, Interpretability
3. What is the mathematical formulation to compute TF.
A) TF(t, d) = (Number of times term t appears in document d) / (Total number of terms in
document d)
4. Is TF a good representation for text transformation?
A) Pros are simplicity, suitable for basic tasks, customization. Cons are sparsity, lack of semantic
information, sensitivity to document length and out of vocabulary terms.
5. What difference did you find between OHE and TF?
A) OHE is used for categorical data encoding, while TF is used for text data representation.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 21 of 167
Experiment # Student ID 2100040324

Date Student Name T.SIVASRI

In-Lab:

1. Apply tokenization and convert a sequence of sentences in the NLTK library to a sequence
of numbers. Use those sequences and calculate term frequencies for representing text
data on a small corpus.
2. Convert a 10-sentence dataset with multiple-length sentences into TF representations and
compare them with OHE.
 Procedure/Program:
1) import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords

# Sample sentences
sentences = [
"This is the first sentence. It contains some words.",
]

# Tokenize the sentences into words


tokens = [word_tokenize(sentence.lower()) for sentence in sentences]

# Calculate term frequencies (TF) for each sentence


tf_dicts = [FreqDist(token) for token in tokens]

# Print the term frequencies for each sentence


for i, tf_dict in enumerate(tf_dicts):
print(f"Term Frequencies for Sentence {i + 1}:")
for term, freq in tf_dict.items():
print(f"{term}: {freq}")
print()

2) import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
import pandas as pd

# Sample sentences
sentences = [
"This is the first
sentence.", "Here's the
second one.",
"And this is the third
sentence.", "A short one.",

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 21 of 167
Experiment # Student ID 2100040324

Date "Another short one.", Student Name T.SIVASRI

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 21 of 167
Experiment # Student ID 2100040324

Date Student Name T.SIVASRI


"This is a longer sentence with more words.",
"Yet another example sentence.",
"A very short sentence.",
"A medium-length sentence goes here.",
"This is the last sentence in the
dataset."
]

# Tokenize the sentences into words and calculate TF


tokens = [word_tokenize(sentence.lower()) for sentence in sentences]
tf_dicts = [FreqDist(token) for token in tokens]

# Create a vocabulary of unique words across all sentences


vocab = set(word for tf_dict in tf_dicts for word in tf_dict.keys())

# Create a DataFrame to store the TF representations


tf_df = pd.DataFrame(0, columns=vocab, index=range(len(sentences)))

# Populate the DataFrame with TF


values for i, tf_dict in
enumerate(tf_dicts):
for term, freq in tf_dict.items():
tf_df.at[i, term] = freq

# Convert the TF DataFrame to OHE representation


ohe_df = pd.get_dummies(tf_df, columns=tf_df.columns)

# Print the TF DataFrame


print("Term Frequency (TF) Representation:")
print(tf_df)

# Print the OHE DataFrame


print("\nOne-Hot Encoding (OHE) Representation:")
print(ohe_df)

Data and Results:


1) Term Frequencies for Sentence
1: this: 1
is: 1
the: 1
first: 1
sentence: 1
.: 1
it: 1
contains: 1

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 21 of 167
Experiment # Student ID 2100040324

Date some: 1 Student Name T.SIVASRI


words: 1

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 21 of 167
Experiment # Student ID 2100040324

Date Student Name T.SIVASRI

2)

 Analysis and Inferences:

Tokenization is a crucial preprocessing step in NLP for breaking down text into analyzable
units.

Term Frequency (TF) is a basic but informative representation of text data.

One-Hot Encoding (OHE) is a binary representation that encodes word presence.

The choice between TF and OHE depends on the specific task and the type of information
needed for analysis.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 21 of 167
Experiment # Student ID 2100040324

Date Student Name T.SIVASRI

Sample VIVA-VOCE Questions (In-Lab):

1. What does TF stand for?


A) Term Frequency
2. According to your exp which text encoding is the best?
A) The "best" encoding depends on factors such as the nature of your data, the complexity of
your task, available computing resources, and the trade-offs between simplicity and
informativeness. For basic tasks like document classification, TF-IDF or BoW may be
sufficient. For more complex tasks requiring semantic understanding, word embeddings
or pre-trained models may be preferred.
3. Do you think your sequence conversion is suitable for GPT.
A) TF and OHE are useful for some NLP tasks, they are not suitable input representations for
models like GPT. GPT requires continuous vector representations, such as word embeddings
or subword embeddings, to effectively capture the semantics and context of text.
4. List the Metrics used to Evaluate sequence conversion Techniques.
A) Metrics used to evaluate sequence conversion techniques in NLP:
Perplexity, BLEU (Bilingual Evaluation Understudy), ROUGE (Recall-Oriented Understudy for
Gisting Evaluation), F1 Score, Accuracy, Precision and Recall, Mean Absolute Error
(MAE),Mean Squared Error (MSE),Cosine Similarity
5. Can you convert using spaCy.
A) Yes, can be done by below way
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("This is an example
sentence.") for token in doc:
print(f"{token.text}: {token.vector}")

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 21 of 167
Experiment # Student ID 2100040324

Date Student Name T.SIVASRI


Post-Lab:

1. Try an ANN model on the transformed text using TF.


2. Try TF conversion big corpus dataset given below and apply ANN training,
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
 Procedure/Program:

import numpy as np
import tensorflow as
tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Sample text data (replace with your own data)


texts = ["This is a positive sentence.", "This is a negative sentence.", "Another positive
example.", "Another negative example."]
labels = [1, 0, 1, 0] # 1 for positive, 0 for negative

# Tokenize and preprocess the text data


tokenizer = Tokenizer(num_words=1000, oov_token="<OOV>")
tokenizer.fit_on_texts(texts)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(texts)
padded_sequences = pad_sequences(sequences, maxlen=10, padding="post",
truncating="post")

# Build an ANN
model model =
Sequential()
model.add(Embedding(input_dim=len(word_index) + 1, output_dim=16, input_length=10))
model.add(GlobalAveragePooling1D())
model.add(Dense(16, activation="relu"))
model.add(Dense(1, activation="sigmoid"))

# Compile the model


model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

# Train the model


model.fit(np.array(padded_sequences), np.array(labels), epochs=10)

# Make predictions
test_texts = ["A new positive sentence.", "A new negative sentence."]
test_sequences = tokenizer.texts_to_sequences(test_texts)
test_padded_sequences = pad_sequences(test_sequences, maxlen=10, padding="post",
truncating="post")

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 21 of 167
Experiment # Student ID 2100040324

Date predictions = model.predict(np.array(test_padded_sequences))


Student Name T.SIVASRI

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 21 of 167
Experiment # Student ID 2100040324

Date Student Name T.SIVASRI


# Print the
predictions
print(predictions)

(Leave at least 2-3 Pages for each Procedure/ Program/ Solution)

 Data and Results:

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 21 of 167
Experiment # Student ID 2100040324

Date Student Name T.SIVASRI

 Analysis and Inferences:

In the experiment where we tried an Artificial Neural Network (ANN) model on


transformed text using TensorFlow (TF), we aimed to build a basic text classification model.
Here's an analysis and key inferences:

Analysis:

Data Preparation: We started with a small sample of text data along with
corresponding binary labels (0 for negative, 1 for positive). In practice, you would use a
larger and more diverse dataset specific to your NLP task.

Text Preprocessing: Text preprocessing is a crucial step. We tokenized the text using the
Tokenizer class and converted it into numerical sequences. Additionally, we used padding
to ensure uniform sequence lengths. Proper preprocessing is essential to feed text data
into neural networks effectively.

Model Architecture: We built a simple ANN model for text classification. The model
included an embedding layer to convert words into dense vectors, a global average
pooling layer to aggregate word embeddings, and two dense layers. This architecture is a
starting point and can be adjusted based on the complexity of the task.

Loss Function and Optimization: We compiled the model with binary cross-entropy loss,
which is common for binary classification tasks. The Adam optimizer was used for model
training. The choice of loss function and optimizer may vary depending on the specific
problem.

Training: We trained the model on the transformed text data. In this small example, we
used a limited number of epochs (10). In real-world scenarios, you would typically train the
model for many more epochs to achieve better performance.

Predictions: We made predictions on new text data to classify it as positive or negative


based on the trained model.

Inferences:

Start with Simple Models: In this experiment, we used a simple ANN model as a
starting point. It's often a good practice to begin with a basic architecture and
gradually increase complexity based on the performance on validation data.

Text Preprocessing is Key: Proper preprocessing of text data is critical. Tokenization,


padding, and handling out-of-vocabulary words are essential steps.

Model Tuning: Depending on your specific NLP task, you may need to adjust the
model architecture, hyperparameters (e.g., learning rate, batch size), and use
techniques like dropout and regularization to improve performance.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 21 of 167
Experiment # Student ID 2100040324

Date Student Name T.SIVASRI


Dataset Size: The sample dataset used here is small for demonstration purposes. In real-
world applications, larger and more diverse datasets are usually required for training robust
NLP models.

Evaluation: Evaluating model performance using appropriate metrics (e.g., accuracy,


precision, recall, F1-score) on a validation or test dataset is essential to assess its
effectiveness.

Scaling: For more complex NLP tasks, you may consider using pre-trained models like
BERT, GPT, or their variants, which have achieved state-of-the-art performance on
various NLP benchmarks.

Evaluator Remark (if Any):

Marks Secured: out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 21 of 167
Experiment # Student ID 2100040324

Date Student Name T.SIVASRI

Experiment Title: Text Datasets_how_to_use

Aim/Objective:

The aim is to use the online resources of text data to test NLP applications.

Description:

A text corpus is a large and structured collection of texts, typically stored in a digital format, that
serves as a linguistic resource for language analysis and research. It consists of a diverse range of
written or spoken texts from various sources and domains, such as books, articles, newspapers,
websites, social media, conversations, and more.

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

1. How can I create a text corpus from a collection of documents using Python?
A) Collect Your Documents
Read and Extract Text
Preprocess Text Data
Organize into a Dataset
Save the Corpus
2. What Python libraries can I use to tokenize and preprocess text data for corpus creation?
A) NLTK (Natural Language
Toolkit) spaCy
TextBlob
scikit-learn
Gensim
Pattern
spaCy and Hugging Face Transformers (for Deep Learning)

3. How can I handle different file formats (e.g., PDF, Word documents) when building a

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 30 of 167
text corpus in Python?

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 31 of 167
Experiment # Student ID 2100040324

Date Student Name T.SIVASRI

A) Adapt the code examples to your specific use case and file paths. Additionally, consider
handling errors and exceptions that may occur when working with different file formats to
ensure the robustness of your corpus creation process

4. What are the steps involved in cleaning and preprocessing text data for corpus creation?
A) Text Lowercasing,Tokenization,Stop Word Removal,Special Character and Number Removal,
Stemming or Lemmatization, Handling Contractions and Abbreviations, Removing HTML
Tags and URLs, Handling Missing Data, Custom Cleaning Steps

5. How can I remove stopwords and punctuation from text documents when creating a
corpus in Python?
A) filtered_words = [token.text for token in doc if not token.is_stop and token.text not in
string.punctuation]

In-Lab:

1. From NLTK library, download and apply wordnet package of built-in corpus. Extract
the requirements of a text dataset and tokenize the text.
2. From spaCy, use en_core_web_sm (English Small) corpus and tokenize this text.

 Procedure/Program:
1) import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize, sent_tokenize

# Download the WordNet package (if you haven't already)


nltk.download("wordnet")

# Sample text dataset


text = """
We need to extract requirements for a new project.
The requirements should be clear, concise, and detailed.
Gathering requirements is a crucial step in project planning.
"""

# Tokenize the text into


sentences sentences =
sent_tokenize(text)

# Tokenize each sentence into words


words = [word_tokenize(sentence) for sentence in sentences]

# Function to extract requirements from a sentence


def extract_requirements(sentence):
requirements = []
Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 31 of 167
Experiment # Student ID 2100040324

Date Student Name T.SIVASRI

words = word_tokenize(sentence.lower())
for word in words:
# Check if the word is a synonym of "requirement"
if wordnet.synsets(word, pos=wordnet.NOUN) and word in wordnet.synsets
('requirement', pos=wordnet.NOUN) [0].lemma_names():
requirements.append(word)
return requirements

# Extract and print requirements from each sentence


for i, sentence in enumerate(sentences):
requirements = extract_requirements(sentence)
if requirements:
print(f"Requirements in Sentence {i + 1}: {', '.join(requirements)}")

2) import spacy

# Download and load the English Small model


nlp = spacy.load("en_core_web_sm")
# Your text
text = "This is an example sentence. Tokenize me, please."

# Process the text with


spaCy doc = nlp(text)

# Access the tokens


tokens = [token.text for token in doc]

# Print the
tokens
print(tokens)

(Leave at least 2-3 Pages to record the Procedure/Program)

 Data and Results:

1)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 31 of 167
Experiment # Student ID 2100040324

Date Student Name T.SIVASRI

2) ['This', 'is', 'an', 'example', 'sentence', '.', 'Tokenize', 'me', ',', 'please', '.']

 Analysis and Inferences:


1. NLTK and spaCy are both valuable tools in the NLP toolkit, and their choice depends on
the specific task and requirements.
2. NLTK with WordNet is particularly useful when dealing with semantic relationships
and specific word extraction.
3. spaCy, with its pre-trained models like en_core_web_sm, offers high-quality
tokenization and is well-suited for a wide range of NLP tasks.
4. The choice between NLTK and spaCy should be based on the specific needs of the
project and the nature of the text analysis tasks

Sample VIVA-VOCE Questions (In-Lab):

1. What are the different types of text datasets available in NLTK?


Types of Text Datasets in NLTK:

Corpora
Lexical Resources
Sample Texts
Non-English Corpora
Custom Corpora
2. Can you give an example of a text dataset available in NLTK?
Tree band corpus, web text corpus, Word net
3. How can you access and explore the content of a text dataset in NLTK?
nltk.download('gutenberg')
from nltk.corpus import
Gutenberg # Get the list of text
IDs
text_ids = gutenberg.fileids()

# Access the raw text of a specific document


document_text = gutenberg.raw('shakespeare-hamlet.txt')
4. Can you explain the concept of text datasets in spaCy?
Yes spaCy, a text dataset is represented as a collection of documents, and you can
leverage spaCy's powerful language models and processing capabilities to perform various
NLP tasks and gain insights from the text data.
5. Do you know spaCy can handle multi-language text datasets? If Yes, name two.
Yes, spaCy is capable of handling multi-language text
datasets. English (en_core_web_sm), German
(de_core_news_sm)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 31 of 167
Experiment # Student ID 2100040324

Date Student Name T.SIVASRI

Post-Lab:

1. Try to encode the wordnet text into TF vectors and OHE. Measure the corpus size
occupied by them in memory.
2. Try to find some text datasets available online and load into your current program.

 Procedure/Program:

import sys
import nltk
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# Download the WordNet package (if you haven't already)


nltk.download("wordnet")

# Sample WordNet text data (replace with your own data)


wordnet_text = [
"WordNet is a lexical database for English.",
"It groups words into sets of synonyms called
synsets.", "Each synset represents a distinct concept."
]

# Create TF vectors
tf_vectorizer = CountVectorizer()
tf_matrix = tf_vectorizer.fit_transform(wordnet_text)

# Calculate memory size occupied by TF vectors


tf_memory_size = sys.getsizeof(tf_matrix)

# One-Hot Encoding (OHE)


ohe_matrix = np.eye(len(tf_vectorizer.get_feature_names_out()))

# Calculate memory size occupied by OHE


matrix ohe_memory_size =
sys.getsizeof(ohe_matrix)

# Print memory sizes


print(f"Memory occupied by TF Vectors: {tf_memory_size} bytes")
print(f"Memory occupied by OHE Matrix: {ohe_memory_size} bytes")

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 31 of 167
Experiment # Student ID 2100040324

Date Student Name T.SIVASRI

 Data and Results:

 Analysis and Inferences:

In the experiment where we encoded WordNet text into TF vectors (Term Frequency)
and One-Hot Encoding (OHE) and measured the corpus size occupied by these encodings in
memory, here are the analysis and inferences:

Analysis:

Encoding Techniques: We applied two common text encoding techniques, TF vectors and
OHE, to represent WordNet text data. These techniques are fundamental for preparing text data for
various Natural Language Processing (NLP) tasks.

Memory Measurement: We used the sys.getsizeof() function to measure the


memory footprint of the encoded data. This allowed us to quantify the memory
requirements of each encoding method.

Inferences:

Memory Efficiency: TF vectors are more memory-efficient compared to OHE. TF vectors


represent the frequency of terms using integers, resulting in smaller memory usage. OHE, on the
other hand, creates a binary vector for each unique term in the vocabulary, which can consume
substantial memory when dealing with a large vocabulary.

Use Case Considerations: The choice between TF vectors and OHE depends on the
specific use case and the nature of the data. TF vectors are commonly used for text classification
and information retrieval tasks where term frequency information is essential, and memory
efficiency is a concern. OHE may be necessary for tasks that require binary input, such as some
deep learning models.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 31 of 167
Experiment # Student ID 2100040324

Date Student Name T.SIVASRI

Vocabulary Size: The memory requirements of OHE can increase significantly with the size
of the vocabulary. If you have a large corpus with many unique terms, OHE may become
impractical due to its high memory consumption.

Library Efficiency: Utilizing libraries like scikit-learn for TF vectorization and NumPy for
OHE matrix creation is a good practice. These libraries are optimized for memory efficiency and
provide efficient implementations for common encoding tasks.

Trade-Offs: The choice of encoding method often involves trade-offs between memory
efficiency, computational complexity, and the specific requirements of your NLP task. It's
important to consider these factors when selecting an encoding technique.

Scaling: For large-scale NLP tasks and extensive corpora, advanced techniques like
word embeddings (e.g., Word2Vec, GloVe) or transformer-based models (e.g., BERT) are
preferred, as they provide memory-efficient and semantically rich representations.

Evaluator Remark (if Any):

Marks Secured: out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

aq
Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 31 of 167

You might also like