0% found this document useful (0 votes)
57 views

NLP - Short Assignments

The document discusses 5 assignments related to natural language processing techniques. The assignments cover topics like tokenization, stemming, lemmatization, bag-of-words modeling, TF-IDF, word embeddings, text classification using transformers, and morphological analysis using add-delete tables.

Uploaded by

wemela1891
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

NLP - Short Assignments

The document discusses 5 assignments related to natural language processing techniques. The assignments cover topics like tokenization, stemming, lemmatization, bag-of-words modeling, TF-IDF, word embeddings, text classification using transformers, and morphological analysis using add-delete tables.

Uploaded by

wemela1891
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Assignment 1:

Title:

Tokenization and Stemming Techniques using NLTK

Objectives:

- To perform tokenization on sample sentences using various techniques available in

NLTK library including whitespace, punctuation-based, Treebank, Tweet, and MWE

tokenization.

- To compare the effectiveness of different tokenization techniques in terms of

accuracy and speed.

- To apply Porter Stemmer and Snowball Stemmer on tokenized sentences to reduce

them to their root form.

- To apply lemmatization techniques on the same set of tokenized sentences for

comparison.

Pre-requisites:

- Basic knowledge of Natural Language Processing (NLP) concepts

- Familiarity with Python programming language and NLTK library

Sample Sentence:

"I am trying to learn Natural Language Processing using the NLTK library. NLTK is a

powerful tool for working with human language data."

Theory:

Tokenization is the process of breaking a text into individual words or phrases, also

known as tokens. There are several tokenization techniques available in the NLTK

library, including whitespace, punctuation-based, Treebank, Tweet, and MWE


tokenization. Each technique has its own advantages and disadvantages, and the choice

of technique depends on the specific requirements of the NLP task.

Stemming is the process of reducing a word to its root form. Porter Stemmer and

Snowball Stemmer are two widely used stemming algorithms in the NLTK library. While

Porter Stemmer is based on a set of rules and heuristics, Snowball Stemmer is an

improvement over the Porter Stemmer algorithm and provides better results.

Lemmatization is the process of reducing a word to its base or dictionary form, known

as lemma. It uses a dictionary to map words to their base form, which makes it more

accurate than stemming.

Conclusion:

We have explored different tokenization techniques available in the NLTK library and

compared their effectiveness in terms of accuracy and speed. We have also applied

Porter Stemmer and Snowball Stemmer on tokenized sentences to reduce them to

their root form. Finally, we have compared the results of stemming and lemmatization

techniques on the same set of tokenized sentences.

Assignment 2:

Title:

Bag-of-Words, TF-IDF and Word2Vec Embeddings on Car Dataset

Objectives:

- To perform a bag-of-words approach on the Car Dataset by counting the occurrence

and normalized count occurrence of words in the dataset.

- To calculate TF-IDF score for the words in the dataset.

- To create word embeddings using Word2Vec model and analyze the results.
Pre-requisites:

- Basic knowledge of Natural Language Processing (NLP) concepts.

- Familiarity with the Python programming language and its libraries such as NLTK,

Pandas, and Gensim.

Dataset:

The dataset to be used for this assignment is the Car Dataset from Kaggle, which

contains information about cars, including their make, model, year, mileage, fuel type,

and more.

Theory:

The Bag-of-Words approach is a common NLP technique that represents a document as

a bag of words, ignoring the order and context of the words. We will count the

occurrence and normalized count occurrence of words in the dataset.

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to

evaluate how important a word is to a document in a collection. It measures the

frequency of a word in a document relative to its frequency in the entire collection. We

will calculate TF-IDF scores for the words in the Car Dataset.

Word2Vec is a neural network-based approach used to create word embeddings, which

are vector representations of words in a high-dimensional space. We will create

Word2Vec embeddings for the Car Dataset and analyze the results.

Conclusion:

We have explored different techniques for analyzing text data in the Car Dataset. We

have performed a bag-of-words approach to count the occurrence and normalized count

occurrence of words in the dataset, as well as calculated TF-IDF scores for the words.
Finally, we have created Word2Vec embeddings for the dataset and analyzed the

results.

Assignment 3:

Title:

Text Cleaning, Lemmatization, Stop Word Removal, Label Encoding, and TF-IDF

Representation on News Dataset

Objectives:

- To perform text cleaning on the News Dataset.

- To perform lemmatization on the cleaned text using any method.

- To remove stop words from the text using any method.

- To perform label encoding on the target variable of the dataset.

- To create a TF-IDF representation of the preprocessed text.

- To save the outputs of the preprocessing steps.

Pre-requisites:

- Basic knowledge of Natural Language Processing (NLP) concepts.

- Familiarity with the Python programming language and its libraries such as NLTK,

Pandas, and Scikit-learn.

Dataset:

The dataset to be used for this assignment is the News Dataset available on the

following GitHub repository:

https://github.com/PICT-NLP/BE-NLP-Elective/blob/main/3-Preprocessing/News_dat

aset.pickle. This dataset contains news articles labeled with their respective

categories.
Theory:

Text Cleaning involves removing noise, unwanted characters, and unnecessary words

from the text data. We will perform text cleaning on the News Dataset.

Lemmatization is the process of reducing words to their base or dictionary form. We

will perform lemmatization on the cleaned text using any method.

Stop Word Removal involves removing common words that do not carry much meaning

from the text data. We will remove stop words from the text using any method.

Label Encoding is a process of converting categorical variables into numerical format.

We will perform label encoding on the target variable of the dataset.

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to

evaluate how important a word is to a document in a collection. We will create a TF-IDF

representation of the preprocessed text.

Conclusion:

We have performed various preprocessing steps on the News Dataset, including text

cleaning, lemmatization, stop word removal, and label encoding. We have also created a

TF-IDF representation of the preprocessed text. These steps are essential in

preparing text data for various NLP applications. Finally, we have saved the outputs of

the preprocessing steps for future use.

Assignment 4:

Title:

Building a Transformer from Scratch using Pytorch Library


Objectives:

- To understand the architecture of a Transformer.

- To implement the key components of a Transformer, including the Multi-Head

Attention, Position-wise Feedforward Network, and Layer Normalization.

- To train and evaluate the Transformer model on a text classification task.

- To analyze the performance of the model and interpret the results.

Pre-requisites:

- Knowledge of deep learning concepts, including neural networks and optimization

algorithms.

- Familiarity with Pytorch library and its modules, such as nn, optim, and DataLoader.

- Understanding of NLP concepts, such as tokenization, padding, and embedding.

Dataset:

We can use any text classification dataset, such as the IMDB movie review dataset or

the AG News dataset.

Theory:

The Transformer is a type of neural network architecture that was introduced in the

paper "Attention Is All You Need" by Vaswani et al. (2017). It is a self-attention

mechanism that can process sequential data, such as text or speech.

The key components of a Transformer are Multi-Head Attention, Position-wise

Feedforward Network, and Layer Normalization. Multi-Head Attention is used to

compute the attention between the input sequence and itself, while Position-wise

Feedforward Network is used to transform the attention outputs. Layer Normalization

is used to normalize the outputs of each layer.


To implement the Transformer from scratch using Pytorch, we will need to define each

of these components and combine them to form a complete model. We will then train

and evaluate the model on a text classification task.

Conclusion:

We have explored the architecture of a Transformer and its key components, including

Multi-Head Attention, Position-wise Feedforward Network, and Layer Normalization.

We have implemented these components from scratch using Pytorch and trained the

model on a text classification task. We have also analyzed the performance of the

model and interpreted the results. Building a Transformer from scratch is a challenging

but rewarding task that can enhance our understanding of deep learning and NLP.

Assignment 5:

Title:

Understanding Morphology Using Add-Delete Tables

Objectives:

- To understand the concept of morphology and how words are built up from smaller

meaning-bearing units.

- To learn about the different types of morphemes, including free and bound

morphemes.

- To use add-delete tables as a tool for analyzing the morphological structure of words.

Pre-requisites:

- Basic knowledge of linguistics and grammar.

- Familiarity with the concept of words and their structures.

- Understanding of the difference between morphemes and phonemes.

Theory:
Morphology is the study of the structure and form of words, including how they are

built up from smaller meaning-bearing units called morphemes. There are two types of

morphemes: free morphemes, which can stand alone as words, and bound morphemes,

which must be attached to other morphemes to create words.

Add-delete tables are a tool used in morphology to analyze the morphological structure

of words. These tables show how words can be built up from smaller morphemes by

adding or deleting affixes. The table is divided into three columns: the stem, the affix,

and the resulting word.

To use add-delete tables, we start with a stem, which is the base form of a word. We

then add prefixes or suffixes to the stem to create new words. We can also delete

affixes to derive new words or analyze the morphological structure of existing words.

Conclusion:

We have explored the concept of morphology and how words are built up from smaller

meaning-bearing units called morphemes. We have learned about the different types of

morphemes, including free and bound morphemes, and how they are used to create

words. We have also used add-delete tables as a tool for analyzing the morphological

structure of words. By studying morphology, we can gain a deeper understanding of the

structure and meaning of language.

You might also like