Open navigation menu

Scribd

0% found this document useful (0 votes)

79 views8 pages

NLP - Short Assignments

The document discusses 5 assignments related to natural language processing techniques. The assignments cover topics like tokenization, stemming, lemmatization, bag-of-words modeling, TF-IDF, word embeddings, text classification using transformers, and morphological analysis using add-delete tables.

Uploaded by

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views8 pages

NLP - Short Assignments

The document discusses 5 assignments related to natural language processing techniques. The assignments cover topics like tokenization, stemming, lemmatization, bag-of-words modeling, TF-IDF, word embeddings, text classification using transformers, and morphological analysis using add-delete tables.

Uploaded by

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Assignment 1:

Title:

Tokenization and Stemming Techniques using NLTK

Objectives:

- To perform tokenization on sample sentences using various techniques available in

NLTK library including whitespace, punctuation-based, Treebank, Tweet, and MWE

tokenization.

- To compare the effectiveness of different tokenization techniques in terms of

accuracy and speed.

- To apply Porter Stemmer and Snowball Stemmer on tokenized sentences to reduce

them to their root form.

- To apply lemmatization techniques on the same set of tokenized sentences for

comparison.

Pre-requisites:

- Basic knowledge of Natural Language Processing (NLP) concepts

- Familiarity with Python programming language and NLTK library

Sample Sentence:

"I am trying to learn Natural Language Processing using the NLTK library. NLTK is a

powerful tool for working with human language data."

Theory:

Tokenization is the process of breaking a text into individual words or phrases, also

known as tokens. There are several tokenization techniques available in the NLTK

library, including whitespace, punctuation-based, Treebank, Tweet, and MWE

tokenization. Each technique has its own advantages and disadvantages, and the choice

of technique depends on the specific requirements of the NLP task.

Stemming is the process of reducing a word to its root form. Porter Stemmer and

Snowball Stemmer are two widely used stemming algorithms in the NLTK library. While

Porter Stemmer is based on a set of rules and heuristics, Snowball Stemmer is an

improvement over the Porter Stemmer algorithm and provides better results.

Lemmatization is the process of reducing a word to its base or dictionary form, known

as lemma. It uses a dictionary to map words to their base form, which makes it more

accurate than stemming.

Conclusion:

We have explored different tokenization techniques available in the NLTK library and

compared their effectiveness in terms of accuracy and speed. We have also applied

Porter Stemmer and Snowball Stemmer on tokenized sentences to reduce them to

their root form. Finally, we have compared the results of stemming and lemmatization

techniques on the same set of tokenized sentences.

Assignment 2:

Title:

Bag-of-Words, TF-IDF and Word2Vec Embeddings on Car Dataset

Objectives:

- To perform a bag-of-words approach on the Car Dataset by counting the occurrence

and normalized count occurrence of words in the dataset.

- To calculate TF-IDF score for the words in the dataset.

- To create word embeddings using Word2Vec model and analyze the results.
Pre-requisites:

- Basic knowledge of Natural Language Processing (NLP) concepts.

- Familiarity with the Python programming language and its libraries such as NLTK,

Pandas, and Gensim.

Dataset:

The dataset to be used for this assignment is the Car Dataset from Kaggle, which

contains information about cars, including their make, model, year, mileage, fuel type,

and more.

Theory:

The Bag-of-Words approach is a common NLP technique that represents a document as

a bag of words, ignoring the order and context of the words. We will count the

occurrence and normalized count occurrence of words in the dataset.

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to

evaluate how important a word is to a document in a collection. It measures the

frequency of a word in a document relative to its frequency in the entire collection. We

will calculate TF-IDF scores for the words in the Car Dataset.

Word2Vec is a neural network-based approach used to create word embeddings, which

are vector representations of words in a high-dimensional space. We will create

Word2Vec embeddings for the Car Dataset and analyze the results.

Conclusion:

We have explored different techniques for analyzing text data in the Car Dataset. We

have performed a bag-of-words approach to count the occurrence and normalized count

occurrence of words in the dataset, as well as calculated TF-IDF scores for the words.
Finally, we have created Word2Vec embeddings for the dataset and analyzed the

results.

Assignment 3:

Title:

Text Cleaning, Lemmatization, Stop Word Removal, Label Encoding, and TF-IDF

Representation on News Dataset

Objectives:

- To perform text cleaning on the News Dataset.

- To perform lemmatization on the cleaned text using any method.

- To remove stop words from the text using any method.

- To perform label encoding on the target variable of the dataset.

- To create a TF-IDF representation of the preprocessed text.

- To save the outputs of the preprocessing steps.

Pre-requisites:

- Basic knowledge of Natural Language Processing (NLP) concepts.

- Familiarity with the Python programming language and its libraries such as NLTK,

Pandas, and Scikit-learn.

Dataset:

The dataset to be used for this assignment is the News Dataset available on the

following GitHub repository:

https://github.com/PICT-NLP/BE-NLP-Elective/blob/main/3-Preprocessing/News_dat

aset.pickle. This dataset contains news articles labeled with their respective

categories.
Theory:

Text Cleaning involves removing noise, unwanted characters, and unnecessary words

from the text data. We will perform text cleaning on the News Dataset.

Lemmatization is the process of reducing words to their base or dictionary form. We

will perform lemmatization on the cleaned text using any method.

Stop Word Removal involves removing common words that do not carry much meaning

from the text data. We will remove stop words from the text using any method.

Label Encoding is a process of converting categorical variables into numerical format.

We will perform label encoding on the target variable of the dataset.

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to

evaluate how important a word is to a document in a collection. We will create a TF-IDF

representation of the preprocessed text.

Conclusion:

We have performed various preprocessing steps on the News Dataset, including text

cleaning, lemmatization, stop word removal, and label encoding. We have also created a

TF-IDF representation of the preprocessed text. These steps are essential in

preparing text data for various NLP applications. Finally, we have saved the outputs of

the preprocessing steps for future use.

Assignment 4:

Title:

Building a Transformer from Scratch using Pytorch Library

Objectives:

- To understand the architecture of a Transformer.

- To implement the key components of a Transformer, including the Multi-Head

Attention, Position-wise Feedforward Network, and Layer Normalization.

- To train and evaluate the Transformer model on a text classification task.

- To analyze the performance of the model and interpret the results.

Pre-requisites:

- Knowledge of deep learning concepts, including neural networks and optimization

algorithms.

- Familiarity with Pytorch library and its modules, such as nn, optim, and DataLoader.

- Understanding of NLP concepts, such as tokenization, padding, and embedding.

Dataset:

We can use any text classification dataset, such as the IMDB movie review dataset or

the AG News dataset.

Theory:

The Transformer is a type of neural network architecture that was introduced in the

paper "Attention Is All You Need" by Vaswani et al. (2017). It is a self-attention

mechanism that can process sequential data, such as text or speech.

The key components of a Transformer are Multi-Head Attention, Position-wise

Feedforward Network, and Layer Normalization. Multi-Head Attention is used to

compute the attention between the input sequence and itself, while Position-wise

Feedforward Network is used to transform the attention outputs. Layer Normalization

is used to normalize the outputs of each layer.

To implement the Transformer from scratch using Pytorch, we will need to define each

of these components and combine them to form a complete model. We will then train

and evaluate the model on a text classification task.

Conclusion:

We have explored the architecture of a Transformer and its key components, including

Multi-Head Attention, Position-wise Feedforward Network, and Layer Normalization.

We have implemented these components from scratch using Pytorch and trained the

model on a text classification task. We have also analyzed the performance of the

model and interpreted the results. Building a Transformer from scratch is a challenging

but rewarding task that can enhance our understanding of deep learning and NLP.

Assignment 5:

Title:

Understanding Morphology Using Add-Delete Tables

Objectives:

- To understand the concept of morphology and how words are built up from smaller

meaning-bearing units.

- To learn about the different types of morphemes, including free and bound

morphemes.

- To use add-delete tables as a tool for analyzing the morphological structure of words.

Pre-requisites:

- Basic knowledge of linguistics and grammar.

- Familiarity with the concept of words and their structures.

- Understanding of the difference between morphemes and phonemes.

Theory:
Morphology is the study of the structure and form of words, including how they are

built up from smaller meaning-bearing units called morphemes. There are two types of

morphemes: free morphemes, which can stand alone as words, and bound morphemes,

which must be attached to other morphemes to create words.

Add-delete tables are a tool used in morphology to analyze the morphological structure

of words. These tables show how words can be built up from smaller morphemes by

adding or deleting affixes. The table is divided into three columns: the stem, the affix,

and the resulting word.

To use add-delete tables, we start with a stem, which is the base form of a word. We

then add prefixes or suffixes to the stem to create new words. We can also delete

affixes to derive new words or analyze the morphological structure of existing words.

Conclusion:

We have explored the concept of morphology and how words are built up from smaller

meaning-bearing units called morphemes. We have learned about the different types of

morphemes, including free and bound morphemes, and how they are used to create

words. We have also used add-delete tables as a tool for analyzing the morphological

structure of words. By studying morphology, we can gain a deeper understanding of the

structure and meaning of language.

You might also like

Deep Learning - Unit-III Two Marks
100% (1)
Deep Learning - Unit-III Two Marks
3 pages
Multiple Choice Questions in Computer Science
0% (3)
Multiple Choice Questions in Computer Science
31 pages
Unit 4 Deeplearning
No ratings yet
Unit 4 Deeplearning
41 pages
NLP Assignment 2
No ratings yet
NLP Assignment 2
2 pages
Bit Byte Volume2
No ratings yet
Bit Byte Volume2
163 pages
Assignment On RNN
No ratings yet
Assignment On RNN
1 page
Artificial Intelligence - Knowledge Representation and Reasoning - Unit 7 - Week 4 - Models
No ratings yet
Artificial Intelligence - Knowledge Representation and Reasoning - Unit 7 - Week 4 - Models
4 pages
DIP - Assignment 9 Solution
No ratings yet
DIP - Assignment 9 Solution
6 pages
Nptel Itcp 2022 Assignments Combined
No ratings yet
Nptel Itcp 2022 Assignments Combined
47 pages
CNN RNN Assignment Set 4
0% (1)
CNN RNN Assignment Set 4
2 pages
Question Bank
No ratings yet
Question Bank
2 pages
Deep Learning Exp
No ratings yet
Deep Learning Exp
25 pages
ML Unit-4
No ratings yet
ML Unit-4
9 pages
Query Operation 2021
No ratings yet
Query Operation 2021
35 pages
Artificial Intelligence - Knowledge Representation and Reasoning - Unit 5 - Week 2
No ratings yet
Artificial Intelligence - Knowledge Representation and Reasoning - Unit 5 - Week 2
6 pages
CCS369 TEXT AND SPEECH ANALYSIS - Syllabus
No ratings yet
CCS369 TEXT AND SPEECH ANALYSIS - Syllabus
4 pages
Early Detection of Lung Cancer Using AI and ML
No ratings yet
Early Detection of Lung Cancer Using AI and ML
6 pages
Practice Final sp22
No ratings yet
Practice Final sp22
10 pages
19CSE453 - Natural Language Processing: Part of Speech Tagging
No ratings yet
19CSE453 - Natural Language Processing: Part of Speech Tagging
59 pages
CS 224n Assignment #2: Word2vec (43 Points)
No ratings yet
CS 224n Assignment #2: Word2vec (43 Points)
4 pages
Problem Statement - Stock Market News Sentiment Analysis and Summarization - Introduction To Natural Language Processing - Great Learning
0% (1)
Problem Statement - Stock Market News Sentiment Analysis and Summarization - Introduction To Natural Language Processing - Great Learning
2 pages
Assignment 5 (COPY)
No ratings yet
Assignment 5 (COPY)
5 pages
Computer Organization and Design: Lecture: 3 Tutorial: 1 Practical: 0 Credit: 4
No ratings yet
Computer Organization and Design: Lecture: 3 Tutorial: 1 Practical: 0 Credit: 4
18 pages
Artificial Intelligence Unit IV
No ratings yet
Artificial Intelligence Unit IV
105 pages
PPT08-Natural Language Processing
100% (1)
PPT08-Natural Language Processing
44 pages
1157 CS F425 20231222015056 Mid Semester Question Paper DL
No ratings yet
1157 CS F425 20231222015056 Mid Semester Question Paper DL
2 pages
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
No ratings yet
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
7 pages
Assignment - Week 6 (Neural Networks) Type of Question: MCQ/MSQ
No ratings yet
Assignment - Week 6 (Neural Networks) Type of Question: MCQ/MSQ
4 pages
EC8093 - Digital Image Processing (Ripped From Amazon Kindle Ebooks by Sai Seena) PDF
0% (1)
EC8093 - Digital Image Processing (Ripped From Amazon Kindle Ebooks by Sai Seena) PDF
102 pages
RBM, DBN, and DBM
No ratings yet
RBM, DBN, and DBM
79 pages
Cs230exam Win21
No ratings yet
Cs230exam Win21
21 pages
Module 2 - Natural Language Processing: Paulo Gomes DEI - FCTUC, 2006/2007
No ratings yet
Module 2 - Natural Language Processing: Paulo Gomes DEI - FCTUC, 2006/2007
42 pages
Sp09midterm Revised
No ratings yet
Sp09midterm Revised
6 pages
Machine Learning Full Question Bank
No ratings yet
Machine Learning Full Question Bank
14 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
2 pages
Lecture 2.1.2activation Function
No ratings yet
Lecture 2.1.2activation Function
15 pages
Artificial Neural Networks Quiz Questions 1
No ratings yet
Artificial Neural Networks Quiz Questions 1
17 pages
Back Propagation Network: Soft Computing
No ratings yet
Back Propagation Network: Soft Computing
33 pages
1-Introduction To Networking
No ratings yet
1-Introduction To Networking
18 pages
ARTIFICIAl iNTELLIGENCE Unit III &iv
No ratings yet
ARTIFICIAl iNTELLIGENCE Unit III &iv
39 pages
Digital Image Processing - S. Jayaraman, S. Esakkirajan and T. Veerakumar
No ratings yet
Digital Image Processing - S. Jayaraman, S. Esakkirajan and T. Veerakumar
5 pages
ML Assignment 3 Nptel 2019
No ratings yet
ML Assignment 3 Nptel 2019
26 pages
Chapter2 - Machine Instructions and Programs
No ratings yet
Chapter2 - Machine Instructions and Programs
54 pages
Two Stage Job Title Identification-1
No ratings yet
Two Stage Job Title Identification-1
77 pages
Independent Component Analysis: Bhagesh Bhutani (20) Chayan Sharma (21) Deepak
No ratings yet
Independent Component Analysis: Bhagesh Bhutani (20) Chayan Sharma (21) Deepak
15 pages
NLP End Sem Paper - Evaluation Scheme
No ratings yet
NLP End Sem Paper - Evaluation Scheme
14 pages
Cs8080 Irt Local Author
No ratings yet
Cs8080 Irt Local Author
168 pages
Digital Image Processing Assignment Week 6: NPTEL Online Certificate Courses Indian Institute of Technology, Kharagpur
No ratings yet
Digital Image Processing Assignment Week 6: NPTEL Online Certificate Courses Indian Institute of Technology, Kharagpur
4 pages
NLP - Cheatsheet
No ratings yet
NLP - Cheatsheet
10 pages
NLP Lab Tasks
No ratings yet
NLP Lab Tasks
16 pages
Network Technologies and TCP/IP: Biyani's Think Tank
100% (1)
Network Technologies and TCP/IP: Biyani's Think Tank
78 pages
SRM Institute of Science and Technology
No ratings yet
SRM Institute of Science and Technology
6 pages
ML Question Bank
No ratings yet
ML Question Bank
29 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
Unit - 3
No ratings yet
Unit - 3
42 pages
Mcq-For-Dip With Solution
No ratings yet
Mcq-For-Dip With Solution
55 pages
IAT-1 Workbook P3-Python
No ratings yet
IAT-1 Workbook P3-Python
16 pages
ML Unit-Iv
No ratings yet
ML Unit-Iv
19 pages
Cognitive Computing (Course Code: 18CS3272) : CO1 - Session4 Session Topic: The Elements of A Cognitive System
No ratings yet
Cognitive Computing (Course Code: 18CS3272) : CO1 - Session4 Session Topic: The Elements of A Cognitive System
9 pages
Unit 2a
No ratings yet
Unit 2a
31 pages
Understanding Machine Learning Solution Manual: 2 Gentle Start
No ratings yet
Understanding Machine Learning Solution Manual: 2 Gentle Start
67 pages
Computer Skills
No ratings yet
Computer Skills
65 pages
Genai Manual
No ratings yet
Genai Manual
17 pages
How Does NLP Benefit Legal System - A Summary of Legal Artificial Intelligence
No ratings yet
How Does NLP Benefit Legal System - A Summary of Legal Artificial Intelligence
13 pages
Word Embedding
No ratings yet
Word Embedding
35 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
Social Media Text Analytics of Malayalam - English Code Mixed Using Deep Learning
No ratings yet
Social Media Text Analytics of Malayalam - English Code Mixed Using Deep Learning
25 pages
NLP 1 Week Tutorial NLTK
No ratings yet
NLP 1 Week Tutorial NLTK
15 pages
NLP Notes
No ratings yet
NLP Notes
11 pages
Research On The Application of Deep Learning-Based BERT Model in Sentiment Analysis
No ratings yet
Research On The Application of Deep Learning-Based BERT Model in Sentiment Analysis
10 pages
PGP in DS & AI
No ratings yet
PGP in DS & AI
24 pages
Pro Deep Learning With Tensorflow 2.0: A Mathematical Approach To Advanced Artificial Intelligence in Python 2Nd Edition Santanu Pattanayak
100% (1)
Pro Deep Learning With Tensorflow 2.0: A Mathematical Approach To Advanced Artificial Intelligence in Python 2Nd Edition Santanu Pattanayak
49 pages
LLM4psych Multimodalities
No ratings yet
LLM4psych Multimodalities
31 pages
(IJCST-V9I5P10) :alok Kumar, Aditi Kharadi, Deepika Singh, Mala Kumari
No ratings yet
(IJCST-V9I5P10) :alok Kumar, Aditi Kharadi, Deepika Singh, Mala Kumari
9 pages
Job Recommendation System Using NLP
No ratings yet
Job Recommendation System Using NLP
10 pages
Project Synapsis
No ratings yet
Project Synapsis
14 pages
10 1002@cpe 5971
No ratings yet
10 1002@cpe 5971
17 pages
A Review of Deep Learning Based Malware Detection Techniques
No ratings yet
A Review of Deep Learning Based Malware Detection Techniques
19 pages
An Empirical Analysis of SMS Scam Detection Systems: Muhammad Salman Muhammad Ikram Mohamed Ali Kaafar
No ratings yet
An Empirical Analysis of SMS Scam Detection Systems: Muhammad Salman Muhammad Ikram Mohamed Ali Kaafar
14 pages
DAQAS - Deep Arabic Question Answering System Based On Duplicate Question Detection and Machine Reading Comprehension
No ratings yet
DAQAS - Deep Arabic Question Answering System Based On Duplicate Question Detection and Machine Reading Comprehension
14 pages
#Metoomaastricht: Building A Chatbot To Assist Survivors of Sexual Harassment
No ratings yet
#Metoomaastricht: Building A Chatbot To Assist Survivors of Sexual Harassment
19 pages
Matching and Scoring Final
No ratings yet
Matching and Scoring Final
11 pages
Phone Call Analysis - Saipr
No ratings yet
Phone Call Analysis - Saipr
12 pages
Threat Behavior Textual Search by Attention Graph Isomorphism
No ratings yet
Threat Behavior Textual Search by Attention Graph Isomorphism
15 pages
Impact of Stemming and Word Embedding On Deep Learning-Based Arabic Text Categorization
No ratings yet
Impact of Stemming and Word Embedding On Deep Learning-Based Arabic Text Categorization
16 pages
Language Modeling of Code-Mixed Text-1
No ratings yet
Language Modeling of Code-Mixed Text-1
61 pages
2022 IEEE Compyter Society - Unsupervised User-Based Insider Threat
No ratings yet
2022 IEEE Compyter Society - Unsupervised User-Based Insider Threat
16 pages
Ieee1 PDF
No ratings yet
Ieee1 PDF
13 pages
Attack Techniques and Threat Identification For Vulnerabilities
No ratings yet
Attack Techniques and Threat Identification For Vulnerabilities
9 pages
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet