0% found this document useful (0 votes)
31 views8 pages

Ccs369-Lab Ex 3,4,5

The document outlines procedures for accessing text corpora using NLTK in Python, including installation, corpus downloading, and basic text analysis techniques. It also describes a function to identify the 50 most frequently occurring words in a text, excluding stop words, and provides an implementation of the Word2Vec model. The programs and algorithms are presented step-by-step, along with example outputs for verification.

Uploaded by

953622243086
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views8 pages

Ccs369-Lab Ex 3,4,5

The document outlines procedures for accessing text corpora using NLTK in Python, including installation, corpus downloading, and basic text analysis techniques. It also describes a function to identify the 50 most frequently occurring words in a text, excluding stop words, and provides an implementation of the Word2Vec model. The programs and algorithms are presented step-by-step, along with example outputs for verification.

Uploaded by

953622243086
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

EX.

NO:03 ACCESSING TEXT CORPORA USING NLTK IN PYTHON

AIM:

To Accessing text Corpora using NLTK in python.

ALGORITHM:

STEP:1 Install NLTK.


STEP:2 Import the NLTK library.
STEP:3 Download the required corpus.
STEP:4 Load the corpus.
STEP:5 Access the corpus data.
STEP:6 Tokenize the text data.
STEP:7 Perform basic analysis, such as frequency distribution.
STEP:8 Visualize the analysis results.
STEP:9 Apply advanced text processing, such as POS tagging.
STEP:10 Apply a model or algorithm, such as named entity recognition.

PROGRAM:

1. Install NLTK:
pip install nltk

2. Import NLTK:
import nltk

3. Download Corpora:
Nltk.download(‘gutenberg’)

OUTPUT:
True

4. Access a Corpus:
from nltk.corpus import gutenberg
print(gutenberg.fileids())

Page | 11
text=gutenberg.raw('austen-emma.txt')
print(text[:1000])

OUTPUT:
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt',
'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt',
'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-
parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-
caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-
leaves.txt']
[Emma by Jane Austen 1816]
VOLUME I
CHAPTER I
Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.
She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period. Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.
Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma. Between _them_ it was more the intimacy
of sisters. Even before Miss Taylor had ceased to hold the nominal
office of governess, the mildness o

5. Downloading:
import nltk
Nltk.download(‘brown’)

OUTPUT:
True

Page | 12
6. Working with Other Corpora:
from nltk.corpus import brown
print(brown.categories())
news_text=brown.raw(categories=’news’)
print(news_text[:1000])

OUTPUT:
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor',
'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']

The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr


an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn
produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns
took/vbd place/nn ./.

The/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns


that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd
over-all/jj charge/nn of/in the/at election/nn ,/, ``/`` deserves/vbz the/at praise/nn
and/cc thanks/nns of/in the/at City/nn-tl of/in-tl Atlanta/np-tl ''/'' for/in the/at
manner/nn in/in which/wdt the/at election/nn was/bedz conducted/vbn ./.

The/at September-October/np term/nn jury/nn had/hvd been/ben


charged/vbn by/in Fulton/np-tl Superior/jj-tl Court/nn-tl Judge/nn-tl
Durwood/np Pye/np to/to investigate/vb reports/nns of/in possible/jj ``/``
irregularities/nns ''/'' in/in the/at hard-fought/jj primary/nn which/wdt was/bedz
won/vbn by/in Mayor-nominate/nn-tl Ivan/np Allen/np Jr./

7. Explore Other Resources:


from nltk.corpus import stopwords
stop_words=stopwords.words('english')
print(stop_words[:20])

OUTPUT:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",
"you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']

Page | 13
RESULT:

Thus the python program for Accessing text corpora using NLTK in
python was executed successfully and the output is verified.

Page | 14
WRITE A FUNCTION THAT FINDS THE 50 MOST
EX.NO:04 FREQUENTLY OCCURING WORDS OF A TEXT
THAT ARE NOT STOP WORDS

AIM:

To write a function that finds the 50 most frequency occurring words of a


text that are not stop words.

ALGORITHM:

StTEP:1 Accept or read the text input.


STEP:2 Load a predefined list of stop words.
STEP:3 Convert the entire text to lowercase to ensure uniformity.
STEP:4 Split the text into individual words.
STEP:5 Remove words that are found in the stop words list from the list of
tokenized words.
STEP:6 Count the occurrences of each word in the filtered list.
STEP:7 Sort the words based on their frequency in descending order.
STEP:8 Select the top 50 words from the sorted list.
STEP:9 Format the result as a list of tuples or a similar structure where each
entry consists of a word and its frequency.
STEP:10 Return or display the list of the top 50 most frequent words and their
counts.

PROGRAM:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter
import string
nltk.download('stopwords')
nltk.download('punkt')
def get_most_frequent_words(text,num_words=50):
stop_words=set(stopwords.words('english'))

Page | 15
words=word_tokenize(text)
words=[word.lower() for word in words if word.isalpha()]
filtered_words=[word for word in words if word not in stop_words]
word_counts=Counter(filtered_words)
most_common_words=word_counts.most_common(num_words)
return most_common_words
if __name == " main ":
example_text="""Everything we see around us constitutes nature, including
the sun, the moon, trees, flowers, fruits, human beings, birds, animals, etc. In
nature, everyone depends on one another to keep the ecosystem healthy. For
survival, every creature is interrelated and reliant on one another. Humans, for
example, rely on nature for their survival, and nature provides us with oxygen,
food, water, shelter, medicines, and clothing, among other things."""
top_words=get_most_frequent_words(example_text)
print(top_words)

OUTPUT:

[('nature', 4), ('us', 2), ('one', 2), ('another', 2), ('survival', 2), ('everything', 1),
('see', 1), ('around', 1), ('constitutes', 1), ('including', 1), ('sun', 1), ('moon', 1),
('trees', 1), ('flowers', 1), ('fruits', 1), ('human', 1), ('beings', 1), ('birds', 1),
('animals', 1), ('etc', 1), ('everyone', 1), ('depends', 1), ('keep', 1), ('ecosystem', 1),
('healthy', 1), ('every', 1), ('creature', 1), ('interrelated', 1), ('reliant', 1), ('humans',
1), ('example', 1), ('rely', 1), ('provides', 1), ('oxygen', 1), ('food', 1), ('water', 1),
('shelter', 1), ('medicines', 1), ('clothing', 1), ('among', 1), ('things', 1)]

RESULT:

Thus the python program to write a function that finds the 50 most
frequently occurring words of a text that are not stop words was executed
successfully and the output is verified.

Page | 16
EX.NO:05
IMPLEMENT THE WORD2VEC MODEL

AIM:

To wite a python program to implement the Word2Vec model.

ALGORITHM:

STEP:1 Collect and preprocess text data.


STEP:2 Build vocabulary.
STEP:3 Generate training data.
STEP:4 Initialize model parameters.
STEP:5 Define model architecture.
STEP:6 Set up loss function.
STEP:7 Choose optimization algorithm.
STEP:8 Train the model.
STEP:9 Extract word embedding.
STEP:10 Evaluate and use embeddings.

PROGRAM:

pip install gensim nltk


from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
sentences = [
simple_preprocess("This is the first document."),
simple_preprocess("This document is the second document."),
simple_preprocess("And this is the third one."),
simple_preprocess("Is this the first document?")
]
model = Word2Vec(sentences, vector_size=50, window=5, min_count=1,
workers=4)
word = 'document'
embedding = model.wv[word]
print(f"Embedding for '{word}':\n{embedding}")

Page | 17
OUTPUT:

Embedding for 'document':


[-1.0724545e-03 4.7286271e-04 1.0206699e-02 1.8018546e-02
-1.8605899e-02 -1.4233618e-02 1.2917745e-02 1.7945977e-02
-1.0030856e-02 -7.5267432e-03 1.4761009e-02 -3.0669428e-03
-9.0732267e-03 1.3108104e-02 -9.7203208e-03 -3.6320353e-03
5.7531595e-03 1.9837476e-03 -1.6570430e-02 -1.8897636e-02
1.4623532e-02 1.0140524e-02 1.3515387e-02 1.5257311e-03
1.2701781e-02 -6.8107317e-03 -1.8928028e-03 1.1537147e-02
-1.5043275e-02 -7.8722071e-03 -1.5023164e-02 -1.8600845e-03
1.9076237e-02 -1.4638334e-02 -4.6675373e-03 -3.8754821e-03
1.6154874e-02 -1.1861792e-02 9.0324880e-05 -9.5074680e-03
-1.9207101e-02 1.0014586e-02 -1.7519170e-02 -8.7836506e-03
-7.0199967e-05 -5.9236289e-04 -1.5322480e-02 1.9229487e-02
9.9641159e-03 1.8466286e-02]

RESULT:

Thus the python program to implement the Word2Vec model was


executed successfully and the output is verified.

Page | 18

You might also like