NLP - Short Assignments
NLP - Short Assignments
Title:
Objectives:
tokenization.
comparison.
Pre-requisites:
Sample Sentence:
"I am trying to learn Natural Language Processing using the NLTK library. NLTK is a
Theory:
Tokenization is the process of breaking a text into individual words or phrases, also
known as tokens. There are several tokenization techniques available in the NLTK
Stemming is the process of reducing a word to its root form. Porter Stemmer and
Snowball Stemmer are two widely used stemming algorithms in the NLTK library. While
improvement over the Porter Stemmer algorithm and provides better results.
Lemmatization is the process of reducing a word to its base or dictionary form, known
as lemma. It uses a dictionary to map words to their base form, which makes it more
Conclusion:
We have explored different tokenization techniques available in the NLTK library and
compared their effectiveness in terms of accuracy and speed. We have also applied
their root form. Finally, we have compared the results of stemming and lemmatization
Assignment 2:
Title:
Objectives:
- To create word embeddings using Word2Vec model and analyze the results.
Pre-requisites:
- Familiarity with the Python programming language and its libraries such as NLTK,
Dataset:
The dataset to be used for this assignment is the Car Dataset from Kaggle, which
contains information about cars, including their make, model, year, mileage, fuel type,
and more.
Theory:
a bag of words, ignoring the order and context of the words. We will count the
will calculate TF-IDF scores for the words in the Car Dataset.
Word2Vec embeddings for the Car Dataset and analyze the results.
Conclusion:
We have explored different techniques for analyzing text data in the Car Dataset. We
have performed a bag-of-words approach to count the occurrence and normalized count
occurrence of words in the dataset, as well as calculated TF-IDF scores for the words.
Finally, we have created Word2Vec embeddings for the dataset and analyzed the
results.
Assignment 3:
Title:
Text Cleaning, Lemmatization, Stop Word Removal, Label Encoding, and TF-IDF
Objectives:
Pre-requisites:
- Familiarity with the Python programming language and its libraries such as NLTK,
Dataset:
The dataset to be used for this assignment is the News Dataset available on the
https://github.com/PICT-NLP/BE-NLP-Elective/blob/main/3-Preprocessing/News_dat
aset.pickle. This dataset contains news articles labeled with their respective
categories.
Theory:
Text Cleaning involves removing noise, unwanted characters, and unnecessary words
from the text data. We will perform text cleaning on the News Dataset.
Stop Word Removal involves removing common words that do not carry much meaning
from the text data. We will remove stop words from the text using any method.
Conclusion:
We have performed various preprocessing steps on the News Dataset, including text
cleaning, lemmatization, stop word removal, and label encoding. We have also created a
preparing text data for various NLP applications. Finally, we have saved the outputs of
Assignment 4:
Title:
Pre-requisites:
algorithms.
- Familiarity with Pytorch library and its modules, such as nn, optim, and DataLoader.
Dataset:
We can use any text classification dataset, such as the IMDB movie review dataset or
Theory:
The Transformer is a type of neural network architecture that was introduced in the
compute the attention between the input sequence and itself, while Position-wise
of these components and combine them to form a complete model. We will then train
Conclusion:
We have explored the architecture of a Transformer and its key components, including
We have implemented these components from scratch using Pytorch and trained the
model on a text classification task. We have also analyzed the performance of the
model and interpreted the results. Building a Transformer from scratch is a challenging
but rewarding task that can enhance our understanding of deep learning and NLP.
Assignment 5:
Title:
Objectives:
- To understand the concept of morphology and how words are built up from smaller
meaning-bearing units.
- To learn about the different types of morphemes, including free and bound
morphemes.
- To use add-delete tables as a tool for analyzing the morphological structure of words.
Pre-requisites:
Theory:
Morphology is the study of the structure and form of words, including how they are
built up from smaller meaning-bearing units called morphemes. There are two types of
morphemes: free morphemes, which can stand alone as words, and bound morphemes,
Add-delete tables are a tool used in morphology to analyze the morphological structure
of words. These tables show how words can be built up from smaller morphemes by
adding or deleting affixes. The table is divided into three columns: the stem, the affix,
To use add-delete tables, we start with a stem, which is the base form of a word. We
then add prefixes or suffixes to the stem to create new words. We can also delete
affixes to derive new words or analyze the morphological structure of existing words.
Conclusion:
We have explored the concept of morphology and how words are built up from smaller
meaning-bearing units called morphemes. We have learned about the different types of
morphemes, including free and bound morphemes, and how they are used to create
words. We have also used add-delete tables as a tool for analyzing the morphological