0% found this document useful (0 votes)

28 views5 pages

NLP Experiment 2

Uploaded by

TAHA MURADE [UCOE-3968]

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views5 pages

NLP Experiment 2

Uploaded by

TAHA MURADE [UCOE-3968]

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Experiment 2

Name of the Student: - Nehal Bhaskar Kolipaka

Roll No. 70
Date of Practical Performed: - 18/07/2024 Staff Signature with Date & Marks

Aim: Write a Program to perform Tokenization and Filtration & Script Validation.

Theory:
Tokenization: Tokenization is one of the most common tasks when it comes to working with
text data. Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text
document into smallerunits, such as individual words or terms. Each of these smaller units are
called tokens.

The tokens could be words, numbers or punctuation marks. In tokenization, smaller units
are created bylocating word boundaries. Wait – what are word boundaries?
These are the ending point of a word and the beginning of the next word. These tokens are
considered as afirst step for stemming and lemmatization
Before processing a natural language, we need to identify the words that constitute a string
of characters.That’s why tokenization is the most basic step to proceed with NLP (text data).
This is important becausethe meaning of the text could easily be interpreted by analyzing the
words present in the text.
Let’s take an example. Consider the below string:
“This is a cat.”
What do you think will happen after we perform tokenization on this string? We get [‘This’, ‘is’,
‘a’, cat’].There are numerous uses of doing this. We can use this tokenized form to:
• Count the number of words in the text
• Count the frequency of the word, that is, the number of times a particular word is present
And so on. We can extract a lot more information which we’ll discuss in detail in future
articles. For now,it’s time to dive into the meat of this article – the different methods of
performing tokenization in NLP.

1
Filtering:
Filtering is the process of removing stop words or any unnecessary data from the sentence. With
our text split into a stream of tokens, we have the ability to start filtering out any tokens that we
might not find helpful for our application. In the Text Pre- processing tool, we currently have the
option to filter out digit, punctuation, and stop word tokens (we address stop words in the next
section).

Stop-words are words that don't necessarily add meaning to a body of text but are necessary for
the text to be grammatically correct. For example, words like "the" or "a" these words aren't
providing much information, but are important for understanding.

Python’s split function: This is the most basic one, and it returns a list of strings after splitting
the string based on a specific separator. The separators can be changed as needed. Sentence
Tokenization: Here, the structure of the sentence is analyzed. As we know, a sentence ends with a
period(.); therefore, it can be used as a separator.

● Word Tokenizer: It works similarly to a sentence tokenizer. Here the text is split up into token
based on (‘ ‘) as the separator. We give nothing as the parameter it splits by space by default.

Code:
Tokenization:
import nltk
from nltk.tokenize import word_tokenize

def tokenize_script(text):
tokens = word_tokenize(text)
return tokens

text = "Tokenization is an important step in natural language processing."

tokens=tokenize_script(text)
print(tokens)

Output:

2
Filtration:
Code:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def filter_text(text):
tokens = word_tokenize(text)
stop_word = set(stopwords.words('english'))
filtered_text = [token for token in tokens if token.lower() not in
stop_word and len(token) > 1]
return filtered_text #Added return statement to return the filtered
tokens

text = "Tokenization is an important step in natural language processing."

filtered_tokens = filter_text(text)
print(filtered_tokens)

output:

3
Script Validation:
Code:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def validate_script(script):
tokens = word_tokenize(script)
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token.lower() not in
stop_words and len(token) > 1]

min_num_words = 5

if len(filtered_tokens) >= min_num_words:

print(f"Script '{script}'")
return True
else:
print(f"Script '{script}'")
return False

script1= "this is a short script."

script2= "aim : write a program to perform tokenization and script
validation in python."
script3= "python program to test script."

print(validate_script(script1))
print(validate_script(script2))
print(validate_script(script3))

4
output:

Conclusion: -
We have thus examined the idea of tokenization and filtering. Each word in the statement was
tokenized, and the specific alphabet was then taken out of the statement during filtering.

2023 Harrow Academic Scholarship Test Papers Combined
100% (1)
2023 Harrow Academic Scholarship Test Papers Combined
87 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
Unit Test 4: Anguage Review AST Simple AND Past Continuous
No ratings yet
Unit Test 4: Anguage Review AST Simple AND Past Continuous
2 pages
Reading Comprehension Texts: South Africa
No ratings yet
Reading Comprehension Texts: South Africa
3 pages
Top 30 NLP Interview Questions and Answers: 1. What Do You Understand by Natural Language Processing?
No ratings yet
Top 30 NLP Interview Questions and Answers: 1. What Do You Understand by Natural Language Processing?
18 pages
Identification and Distinction of Root, Stem and Base in English Linguistics Teaching
No ratings yet
Identification and Distinction of Root, Stem and Base in English Linguistics Teaching
9 pages
Gold Exp B1 U2 Lang Test B
100% (2)
Gold Exp B1 U2 Lang Test B
2 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
Describing Trends Exercise: I. Match One of The Sentences Below To The Corresponding Graph
No ratings yet
Describing Trends Exercise: I. Match One of The Sentences Below To The Corresponding Graph
6 pages
Verb To Be I, AM Are
No ratings yet
Verb To Be I, AM Are
12 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
NLP Lab Manual Lab Work
No ratings yet
NLP Lab Manual Lab Work
24 pages
Theory of Computation
No ratings yet
Theory of Computation
33 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
NLP-1 (Tokenization)
100% (1)
NLP-1 (Tokenization)
10 pages
IELTS 4 Skills Format
No ratings yet
IELTS 4 Skills Format
36 pages
Kiểm tra ngữ âm
No ratings yet
Kiểm tra ngữ âm
19 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
Sow Penjajaran F2
No ratings yet
Sow Penjajaran F2
9 pages
Tai Lieu On Thi THPT Quoc Gia Mon Tieng Anh
No ratings yet
Tai Lieu On Thi THPT Quoc Gia Mon Tieng Anh
130 pages
Tokenization in NLP
No ratings yet
Tokenization in NLP
10 pages
Verbal Ability-Tutorial-6
No ratings yet
Verbal Ability-Tutorial-6
2 pages
NLP Unit-1
No ratings yet
NLP Unit-1
12 pages
The Natural Approach Krashen
No ratings yet
The Natural Approach Krashen
196 pages
65 SC Tae1 A3
No ratings yet
65 SC Tae1 A3
3 pages
Tokenizer
No ratings yet
Tokenizer
4 pages
WGK01 01 Pef 20180815
No ratings yet
WGK01 01 Pef 20180815
29 pages
Park Ridge School of Montessori Incorporated
No ratings yet
Park Ridge School of Montessori Incorporated
4 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
33 pages
English Lessons Summary 2024
No ratings yet
English Lessons Summary 2024
29 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
NLP Programs
No ratings yet
NLP Programs
5 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
25 pages
NLP 02
No ratings yet
NLP 02
6 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
LP Vi Manual
No ratings yet
LP Vi Manual
77 pages
Experiment - 2
No ratings yet
Experiment - 2
3 pages
(VOCABULARY) 7 Strategies For Using Context Clues
No ratings yet
(VOCABULARY) 7 Strategies For Using Context Clues
4 pages
New Magic Grammar 1A
No ratings yet
New Magic Grammar 1A
35 pages
Dsbdal A7
No ratings yet
Dsbdal A7
65 pages
AM604PC Natural Language Processing LAB R22 AI&ML 3rd Yr 2nd Sem AM604PC Natural Language Processing LAB R22 AI&ML 3rd Yr 2nd Sem
No ratings yet
AM604PC Natural Language Processing LAB R22 AI&ML 3rd Yr 2nd Sem AM604PC Natural Language Processing LAB R22 AI&ML 3rd Yr 2nd Sem
20 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
SK NLP Practical (FS)
No ratings yet
SK NLP Practical (FS)
22 pages
Orientation Kit-English
No ratings yet
Orientation Kit-English
19 pages
Jal Patel NLP
No ratings yet
Jal Patel NLP
32 pages
Chapter 6
No ratings yet
Chapter 6
19 pages
Handout Nominal Patterns
No ratings yet
Handout Nominal Patterns
4 pages
H7 W5 NLP - Merged
No ratings yet
H7 W5 NLP - Merged
17 pages
TED Worksheet
No ratings yet
TED Worksheet
3 pages
NLP Lab Manual 3-2 Aiml R22 Update
100% (1)
NLP Lab Manual 3-2 Aiml R22 Update
20 pages
Pertemuan 5-Tata Bahasa 3
No ratings yet
Pertemuan 5-Tata Bahasa 3
6 pages
Sahil NLP
No ratings yet
Sahil NLP
16 pages
Natural Language Processing (NLP) & Computational Linguistics
No ratings yet
Natural Language Processing (NLP) & Computational Linguistics
60 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
Section 13 & Appendix A, B Index
No ratings yet
Section 13 & Appendix A, B Index
59 pages
Final NLP Lab File
No ratings yet
Final NLP Lab File
28 pages
NLP Practicals All
No ratings yet
NLP Practicals All
57 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
1.precis Writing (Final)
No ratings yet
1.precis Writing (Final)
17 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
Lecture 2 Tokenization
No ratings yet
Lecture 2 Tokenization
16 pages
NLP Programming
No ratings yet
NLP Programming
39 pages
Explaining
No ratings yet
Explaining
5 pages
Lab Prgms Weel1-Output
No ratings yet
Lab Prgms Weel1-Output
4 pages
Natural Language Processing Unit 1
No ratings yet
Natural Language Processing Unit 1
16 pages
Raz lf57 Lookatfossils LBLP
No ratings yet
Raz lf57 Lookatfossils LBLP
3 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
UNIT-5 Quetions - Answers
No ratings yet
UNIT-5 Quetions - Answers
10 pages
NLP Lab Programs
No ratings yet
NLP Lab Programs
3 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
No ratings yet
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
81 pages
Write A Python Program For The Following Preprocessing of Text in NLP: Tokenization Filtration Script Validation Stop Word Removal Stemming
No ratings yet
Write A Python Program For The Following Preprocessing of Text in NLP: Tokenization Filtration Script Validation Stop Word Removal Stemming
2 pages
Câu Trực Tiếp Gián Tiếp B1
No ratings yet
Câu Trực Tiếp Gián Tiếp B1
5 pages
NLP Exp 3
No ratings yet
NLP Exp 3
24 pages
Gitika Mandal BE4 A 17 NLP EXP1
No ratings yet
Gitika Mandal BE4 A 17 NLP EXP1
3 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Tinywow Pythass3 77951173
No ratings yet
Tinywow Pythass3 77951173
17 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
NLP Lab1
No ratings yet
NLP Lab1
2 pages
NLP - Record (Weeks 1-12)
No ratings yet
NLP - Record (Weeks 1-12)
41 pages
NLP Lab Work
No ratings yet
NLP Lab Work
34 pages
NLP Basics
No ratings yet
NLP Basics
12 pages
ELLN Module 1 LEsson 2
No ratings yet
ELLN Module 1 LEsson 2
61 pages
NLP Module 1
No ratings yet
NLP Module 1
71 pages
NLP Lab Manual Final
No ratings yet
NLP Lab Manual Final
25 pages
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet

NLP Experiment 2

Uploaded by

NLP Experiment 2

Uploaded by

Experiment 2

Name of the Student: - Nehal Bhaskar Kolipaka

text = "Tokenization is an important step in natural language processing."

text = "Tokenization is an important step in natural language processing."

if len(filtered_tokens) >= min_num_words:

script1= "this is a short script."

You might also like