0% found this document useful (0 votes)

31 views8 pages

Ccs369-Lab Ex 3,4,5

The document outlines procedures for accessing text corpora using NLTK in Python, including installation, corpus downloading, and basic text analysis techniques. It also describes a function to identify the 50 most frequently occurring words in a text, excluding stop words, and provides an implementation of the Word2Vec model. The programs and algorithms are presented step-by-step, along with example outputs for verification.

Uploaded by

953622243086

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views8 pages

Ccs369-Lab Ex 3,4,5

Uploaded by

953622243086

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

EX.

NO:03 ACCESSING TEXT CORPORA USING NLTK IN PYTHON

AIM:

To Accessing text Corpora using NLTK in python.

ALGORITHM:

STEP:1 Install NLTK.

STEP:2 Import the NLTK library.
STEP:3 Download the required corpus.
STEP:4 Load the corpus.
STEP:5 Access the corpus data.
STEP:6 Tokenize the text data.
STEP:7 Perform basic analysis, such as frequency distribution.
STEP:8 Visualize the analysis results.
STEP:9 Apply advanced text processing, such as POS tagging.
STEP:10 Apply a model or algorithm, such as named entity recognition.

PROGRAM:

1. Install NLTK:
pip install nltk

2. Import NLTK:
import nltk

3. Download Corpora:
Nltk.download(‘gutenberg’)

OUTPUT:
True

4. Access a Corpus:
from nltk.corpus import gutenberg
print(gutenberg.fileids())

Page | 11
text=gutenberg.raw('austen-emma.txt')
print(text[:1000])

OUTPUT:
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt',
'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt',
'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-
parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-
caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-
leaves.txt']
[Emma by Jane Austen 1816]
VOLUME I
CHAPTER I
Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.
She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period. Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.
Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma. Between _them_ it was more the intimacy
of sisters. Even before Miss Taylor had ceased to hold the nominal
office of governess, the mildness o

5. Downloading:
import nltk
Nltk.download(‘brown’)

OUTPUT:
True

Page | 12
6. Working with Other Corpora:
from nltk.corpus import brown
print(brown.categories())
news_text=brown.raw(categories=’news’)
print(news_text[:1000])

OUTPUT:
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor',
'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']

The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr

an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn
produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns
took/vbd place/nn ./.

The/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns

that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd
over-all/jj charge/nn of/in the/at election/nn ,/, ``/`` deserves/vbz the/at praise/nn
and/cc thanks/nns of/in the/at City/nn-tl of/in-tl Atlanta/np-tl ''/'' for/in the/at
manner/nn in/in which/wdt the/at election/nn was/bedz conducted/vbn ./.

The/at September-October/np term/nn jury/nn had/hvd been/ben

charged/vbn by/in Fulton/np-tl Superior/jj-tl Court/nn-tl Judge/nn-tl
Durwood/np Pye/np to/to investigate/vb reports/nns of/in possible/jj ``/``
irregularities/nns ''/'' in/in the/at hard-fought/jj primary/nn which/wdt was/bedz
won/vbn by/in Mayor-nominate/nn-tl Ivan/np Allen/np Jr./

7. Explore Other Resources:

from nltk.corpus import stopwords
stop_words=stopwords.words('english')
print(stop_words[:20])

OUTPUT:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",
"you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']

Page | 13
RESULT:

Thus the python program for Accessing text corpora using NLTK in
python was executed successfully and the output is verified.

Page | 14
WRITE A FUNCTION THAT FINDS THE 50 MOST
EX.NO:04 FREQUENTLY OCCURING WORDS OF A TEXT
THAT ARE NOT STOP WORDS

AIM:

To write a function that finds the 50 most frequency occurring words of a

text that are not stop words.

ALGORITHM:

StTEP:1 Accept or read the text input.

STEP:2 Load a predefined list of stop words.
STEP:3 Convert the entire text to lowercase to ensure uniformity.
STEP:4 Split the text into individual words.
STEP:5 Remove words that are found in the stop words list from the list of
tokenized words.
STEP:6 Count the occurrences of each word in the filtered list.
STEP:7 Sort the words based on their frequency in descending order.
STEP:8 Select the top 50 words from the sorted list.
STEP:9 Format the result as a list of tuples or a similar structure where each
entry consists of a word and its frequency.
STEP:10 Return or display the list of the top 50 most frequent words and their
counts.

PROGRAM:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter
import string
nltk.download('stopwords')
nltk.download('punkt')
def get_most_frequent_words(text,num_words=50):
stop_words=set(stopwords.words('english'))

Page | 15
words=word_tokenize(text)
words=[word.lower() for word in words if word.isalpha()]
filtered_words=[word for word in words if word not in stop_words]
word_counts=Counter(filtered_words)
most_common_words=word_counts.most_common(num_words)
return most_common_words
if __name == " main ":
example_text="""Everything we see around us constitutes nature, including
the sun, the moon, trees, flowers, fruits, human beings, birds, animals, etc. In
nature, everyone depends on one another to keep the ecosystem healthy. For
survival, every creature is interrelated and reliant on one another. Humans, for
example, rely on nature for their survival, and nature provides us with oxygen,
food, water, shelter, medicines, and clothing, among other things."""
top_words=get_most_frequent_words(example_text)
print(top_words)

OUTPUT:

[('nature', 4), ('us', 2), ('one', 2), ('another', 2), ('survival', 2), ('everything', 1),
('see', 1), ('around', 1), ('constitutes', 1), ('including', 1), ('sun', 1), ('moon', 1),
('trees', 1), ('flowers', 1), ('fruits', 1), ('human', 1), ('beings', 1), ('birds', 1),
('animals', 1), ('etc', 1), ('everyone', 1), ('depends', 1), ('keep', 1), ('ecosystem', 1),
('healthy', 1), ('every', 1), ('creature', 1), ('interrelated', 1), ('reliant', 1), ('humans',
1), ('example', 1), ('rely', 1), ('provides', 1), ('oxygen', 1), ('food', 1), ('water', 1),
('shelter', 1), ('medicines', 1), ('clothing', 1), ('among', 1), ('things', 1)]

RESULT:

Thus the python program to write a function that finds the 50 most
frequently occurring words of a text that are not stop words was executed
successfully and the output is verified.

Page | 16
EX.NO:05
IMPLEMENT THE WORD2VEC MODEL

AIM:

To wite a python program to implement the Word2Vec model.

ALGORITHM:

STEP:1 Collect and preprocess text data.

STEP:2 Build vocabulary.
STEP:3 Generate training data.
STEP:4 Initialize model parameters.
STEP:5 Define model architecture.
STEP:6 Set up loss function.
STEP:7 Choose optimization algorithm.
STEP:8 Train the model.
STEP:9 Extract word embedding.
STEP:10 Evaluate and use embeddings.

PROGRAM:

pip install gensim nltk

from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
sentences = [
simple_preprocess("This is the first document."),
simple_preprocess("This document is the second document."),
simple_preprocess("And this is the third one."),
simple_preprocess("Is this the first document?")
]
model = Word2Vec(sentences, vector_size=50, window=5, min_count=1,
workers=4)
word = 'document'
embedding = model.wv[word]
print(f"Embedding for '{word}':\n{embedding}")

Page | 17
OUTPUT:

Embedding for 'document':

[-1.0724545e-03 4.7286271e-04 1.0206699e-02 1.8018546e-02
-1.8605899e-02 -1.4233618e-02 1.2917745e-02 1.7945977e-02
-1.0030856e-02 -7.5267432e-03 1.4761009e-02 -3.0669428e-03
-9.0732267e-03 1.3108104e-02 -9.7203208e-03 -3.6320353e-03
5.7531595e-03 1.9837476e-03 -1.6570430e-02 -1.8897636e-02
1.4623532e-02 1.0140524e-02 1.3515387e-02 1.5257311e-03
1.2701781e-02 -6.8107317e-03 -1.8928028e-03 1.1537147e-02
-1.5043275e-02 -7.8722071e-03 -1.5023164e-02 -1.8600845e-03
1.9076237e-02 -1.4638334e-02 -4.6675373e-03 -3.8754821e-03
1.6154874e-02 -1.1861792e-02 9.0324880e-05 -9.5074680e-03
-1.9207101e-02 1.0014586e-02 -1.7519170e-02 -8.7836506e-03
-7.0199967e-05 -5.9236289e-04 -1.5322480e-02 1.9229487e-02
9.9641159e-03 1.8466286e-02]

RESULT:

Thus the python program to implement the Word2Vec model was

executed successfully and the output is verified.

Page | 18

Ccs369 - Text and Speech Analysis - Lab Manual
100% (1)
Ccs369 - Text and Speech Analysis - Lab Manual
23 pages
Tsarecord
No ratings yet
Tsarecord
22 pages
CCS369-Text and Speech Analysis Lab (1-9)
No ratings yet
CCS369-Text and Speech Analysis Lab (1-9)
37 pages
TSA Student
No ratings yet
TSA Student
20 pages
Tsa Labmanual
No ratings yet
Tsa Labmanual
26 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Text Analysis With NLTK Cheatsheet
No ratings yet
Text Analysis With NLTK Cheatsheet
3 pages
Batch 2
No ratings yet
Batch 2
13 pages
Ccs339 Text and Speech Analysis Lab Manual
No ratings yet
Ccs339 Text and Speech Analysis Lab Manual
51 pages
CCS369 - Text and Speech Analysis
No ratings yet
CCS369 - Text and Speech Analysis
31 pages
NLP - Record (Weeks 1-12)
No ratings yet
NLP - Record (Weeks 1-12)
41 pages
Lab-1 - Tokenization, Stemming, Stopwords - Jupyter Notebook
No ratings yet
Lab-1 - Tokenization, Stemming, Stopwords - Jupyter Notebook
15 pages
Ex4 Lab
No ratings yet
Ex4 Lab
4 pages
NLP Exercises
No ratings yet
NLP Exercises
2 pages
SPR 05 NLTK
No ratings yet
SPR 05 NLTK
18 pages
NLP Using Python
No ratings yet
NLP Using Python
50 pages
Aim - Procedure - Result - Single Side
No ratings yet
Aim - Procedure - Result - Single Side
18 pages
Natural Language Processing
No ratings yet
Natural Language Processing
17 pages
Text Processing
No ratings yet
Text Processing
16 pages
NLP Record
No ratings yet
NLP Record
6 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
SK NLP Practical (FS)
No ratings yet
SK NLP Practical (FS)
22 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
NLTK Cheatsheet
No ratings yet
NLTK Cheatsheet
27 pages
NLTK
No ratings yet
NLTK
16 pages
Lab3 IR BIM
No ratings yet
Lab3 IR BIM
14 pages
Lab1 IR
No ratings yet
Lab1 IR
14 pages
NLP
No ratings yet
NLP
12 pages
Detail NLP
No ratings yet
Detail NLP
5 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Natural Language Processing
No ratings yet
Natural Language Processing
12 pages
Language Engineering - Section
No ratings yet
Language Engineering - Section
20 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
Exercise 2
No ratings yet
Exercise 2
2 pages
Natural Language Processing Lab Manual
No ratings yet
Natural Language Processing Lab Manual
24 pages
Final NLP Lab File
No ratings yet
Final NLP Lab File
28 pages
Tsa Ex-2
No ratings yet
Tsa Ex-2
4 pages
Frequency Distribution: Text1 Text Corpora Text Corpus
No ratings yet
Frequency Distribution: Text1 Text Corpora Text Corpus
2 pages
All Practicals
No ratings yet
All Practicals
33 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
Ai & ML Week-11
No ratings yet
Ai & ML Week-11
32 pages
NLP Record
No ratings yet
NLP Record
15 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
33 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
NLP FinAL
No ratings yet
NLP FinAL
27 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
NLTK Tutorial
No ratings yet
NLTK Tutorial
33 pages
A Tutorial On: Linguistic Data Analysis
No ratings yet
A Tutorial On: Linguistic Data Analysis
99 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
AI Zone: Log in Sign Up
No ratings yet
AI Zone: Log in Sign Up
24 pages
Python NLP
No ratings yet
Python NLP
15 pages
NLP TP1 Report Lahouel Ibrahim
No ratings yet
NLP TP1 Report Lahouel Ibrahim
6 pages
NLP - Practical List
No ratings yet
NLP - Practical List
14 pages
NLP Previous Sem
No ratings yet
NLP Previous Sem
5 pages
NLP Lab1
No ratings yet
NLP Lab1
6 pages
Exercise 3
No ratings yet
Exercise 3
3 pages
Jal Patel NLP
No ratings yet
Jal Patel NLP
32 pages
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Balance and Movement
100% (2)
Balance and Movement
16 pages
Light Reflection and Refraction Questions
No ratings yet
Light Reflection and Refraction Questions
16 pages
Installation Guide For Ibm'S Db2 Database Server Software
No ratings yet
Installation Guide For Ibm'S Db2 Database Server Software
10 pages
TIU: Agar Mahasiswa Memahami Dan Mampu Menerapkan Prinsip-Prinsip Bahasa Inggris Berkenaan Dengan Kepentingan
No ratings yet
TIU: Agar Mahasiswa Memahami Dan Mampu Menerapkan Prinsip-Prinsip Bahasa Inggris Berkenaan Dengan Kepentingan
3 pages
The Relationship Between Facebook and Uncertainty: Chris Watson Mena Shenouda
No ratings yet
The Relationship Between Facebook and Uncertainty: Chris Watson Mena Shenouda
28 pages
Lesson Plan Form: Ashland University
No ratings yet
Lesson Plan Form: Ashland University
2 pages
Inbound Processing WM
No ratings yet
Inbound Processing WM
10 pages
Inkandvolt Yearly Planning Week1
No ratings yet
Inkandvolt Yearly Planning Week1
3 pages
Akash Kumar Singh: B.Tech
No ratings yet
Akash Kumar Singh: B.Tech
2 pages
موضوعات تعبير الانجليزي غير مترجمة
No ratings yet
موضوعات تعبير الانجليزي غير مترجمة
19 pages
Online Platforms For Ict Content Development
100% (1)
Online Platforms For Ict Content Development
21 pages
Applies To:: OM DROP: Drop Ship Setup (Doc ID 113636.1)
No ratings yet
Applies To:: OM DROP: Drop Ship Setup (Doc ID 113636.1)
2 pages
Lorenz K Companion in The Bird's World
No ratings yet
Lorenz K Companion in The Bird's World
29 pages
Fowler 1994 PDF
No ratings yet
Fowler 1994 PDF
14 pages
Interview Questions (TD)
No ratings yet
Interview Questions (TD)
9 pages
Chapter 5 - ENG 310
No ratings yet
Chapter 5 - ENG 310
37 pages
Internship/Training: On Cyber Security
No ratings yet
Internship/Training: On Cyber Security
25 pages
BTR Is Really Easy For Some Charts
No ratings yet
BTR Is Really Easy For Some Charts
3 pages
Lesson 15 - Information Theory
No ratings yet
Lesson 15 - Information Theory
18 pages
Chapter 06. Engineering Economics
No ratings yet
Chapter 06. Engineering Economics
35 pages
Experiment No 14 2018 Batch
No ratings yet
Experiment No 14 2018 Batch
14 pages
CH 7 Mse
No ratings yet
CH 7 Mse
20 pages
(Hotel Name) Feedback Form: Customer Name: Address: Email/Phone Account
No ratings yet
(Hotel Name) Feedback Form: Customer Name: Address: Email/Phone Account
2 pages
Final Year Project Proposals
100% (4)
Final Year Project Proposals
77 pages
DOEFINAL
No ratings yet
DOEFINAL
16 pages
Rich Internet Applications With PHP
100% (1)
Rich Internet Applications With PHP
30 pages
MAT538 Tutorial
No ratings yet
MAT538 Tutorial
21 pages
Physics 1 Lab Report Experiment Churi 1
No ratings yet
Physics 1 Lab Report Experiment Churi 1
11 pages
Pdfs Pre Level Sci PDF
No ratings yet
Pdfs Pre Level Sci PDF
2 pages
Arno and Thomas 2016
No ratings yet
Arno and Thomas 2016
11 pages

Ccs369-Lab Ex 3,4,5

Uploaded by

Ccs369-Lab Ex 3,4,5

Uploaded by

EX.

NO:03 ACCESSING TEXT CORPORA USING NLTK IN PYTHON

To Accessing text Corpora using NLTK in python.

STEP:1 Install NLTK.

The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr

The/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns

The/at September-October/np term/nn jury/nn had/hvd been/ben

7. Explore Other Resources:

To write a function that finds the 50 most frequency occurring words of a

StTEP:1 Accept or read the text input.

To wite a python program to implement the Word2Vec model.

STEP:1 Collect and preprocess text data.

pip install gensim nltk

Embedding for 'document':

Thus the python program to implement the Word2Vec model was

You might also like