Tsa Lab Record - Cse
Tsa Lab Record - Cse
Tsa Lab Record - Cse
PADMAVATHY ENGINEERING
COLLEGE
(An Autonomous Institution)
Mambakkam-Medavakkam MainRoad,
Ponmar,Chennai- 600127
Academic Year:2023–2024
Register Number :
Year/ Semester :
PRINCE SHRI VENKATESHWARA
PADMAVATHY ENGINEERING COLLEGE
(An Autonomous Institution)
BONAFIDE CERTIFICATE
Name : ………………………………………………
Register No : ………………………………………………
Semester : ………………………………………………
Branch : ………………………………………………
Certified that this is a Bonafide Record of the work done by the above student in the
CCS369-Text and Speech Analysis Laboratory during the year2023- 2024.
To be a prominent institution for technical education and research to meet the global
challenges and demand for the societal needs.
To nurture in the students, professional and ethical values, and to install in them a spirit
of innovation and entrepreneurship.
To encourage in the students a desire for higher learning and research, to equip them to
face the global challenges.
To provide opportunities for students to get the needed additional skills to make them
industry ready.
2. Problem analysis: Identify, formulate, review research literature, and analyse complex engineering
problems reaching substantiated conclusions using first principles of mathematics, natural sciences, and
engineering sciences.
3. Design/development of solutions: Design solutions for complex engineering problems and design
system
4. components or processes that meet the specified needs with appropriate consideration for the public
health and safety, and the cultural, societal, and environmental considerations.
5. Conduct investigations of complex problems: Use research-based knowledge and research methods
including design of experiments, analysis and interpretation of data, and synthesis of the information to
provide valid conclusions.
6. Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modelling to complex engineering activities with an
understanding of the limitations.
7. The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal,
health, safety, legal and cultural issues, and the consequent responsibilities relevant to the professional
engineering practice.
8. Environment and sustainability: Understand the impact of the professional engineering solutions in
societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable
development.
9. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of the
engineering practice.
10. Individual and teamwork: Function effectively as an individual, and as a member or leader in diverse
teams, and in multidisciplinary settings.
11. Communication: Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write effective reports and
design documentation, make effective presentations, and give and receive clear instructions.
12. Project management and finance: Demonstrate knowledge and understanding of the engineering and
management principles and apply these to one’s own work, as a member and leader in a team, to manage
projects and in multidisciplinary environments.
13. Life-long learning: Recognize the need for and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.
PEO 1: Acquire logical and analytical skills with a solid foundation in core areas of computer
Science & Engineering.
PEO 2: Enduringly engage in learning new technologies and actively contribute to the academic,
Research & Development and Society.
PEO 3: Enrich the passion for higher studies, research and a successful career in Software industries
PSO 2: Ability to Employ modern tools for analyzing data and networks in building their career as
software professional, researcher and an entrepreneur with a zeal for higher studies.
INSTRUCTIONS TO STUDENTS
Before entering the lab, the student should carry the following things (MANDATORY)
Identity card issued by the college.
Class notes
Lab observation book
Lab Manual
Lab Record
Student must sign in and sign out in the register provided when attending the lab session
without fail.
Come to the laboratory in time. Students, who are late more than 15 min., will not be allowed
to attend the lab.
Students need to maintain 100% attendance in lab if not a strict action will be taken.
All students must follow a Dress Code while in the laboratory
Foods, drinks are NOT allowed.
All bags must be left at the indicated place.
Refer to the lab staff if you need any help in using the lab.
Respect the laboratory and its other users.
Workspace must be kept clean and tidy after experiment is completed.
Read the Manual carefully before coming to the laboratory and be sure about what you
are supposed to do.
Do the experiments as per the instructions given in the manual.
Copy all the programs to observation which are taught in class before attending the lab
session.
Students are not supposed to use floppy disks, pen drives without permission of lab- in
charge.
Lab records need to be submitted on or before the date of submission.
Syllabus
1. Create Regular expressions in Python for detecting word patterns and tokenizing text
2. Getting started with Python and NLTK - Searching Text, Counting Vocabulary, Frequency
Distribution, Collocations, Bigrams
3. Accessing Text Corpora using NLTK in Python
4. Write a function that finds the 50 most frequently occurring words of a text that are not stop
words.
5. Implement the Word2Vec model
6. Use a transformer for implementing classification
7. Design a chatbot with a simple dialog system
8. Convert text to speech and find accuracy
9. Design a speech recognition system and find the error rate
COURSE OUTCOMES:
At the end of this course, the students will be able to:
CO1:Explain existing and emerging deep learning architectures for text and speech processing
CO2:Apply deep learning techniques for NLP tasks, language modelling and machine translation
CO3:Explain coreference and coherence for text processing
CO4:Build question-answering systems, chatbots and dialogue systems
CO5:Apply deep learning models for building speech recognition and text-to-speech systems
Mapping of Course Outcomes with the POs and PSOs
CO PO PO PO PO PO PO PO PO PO PO PO PO PSO PSO
/PO 1 2 3 4 5 6 7 8 9 10 11 12 1 2
CO1 2 2 1 2 2 1 2 1 1
CO2 2 2 1 2 2 1 2 2 1
CO3 3 3 1 3 3 2 2 3 2
CO4 3 3 1 3 3 2 2 3 2
CO5 3 3 2 3 1 2 2 3 2
Additional Experiments
TABLE OF CONTENTS
411721104020
AIM:
Write a program to tokenize text into individual words and sentence endings using regular
expressions in Python.
ALGORITHM:
PROGRAM:
import re
print('Range',re.search(r'[a-zA-Z]', 'x'))
Range <_sre.SRE_Match object; span=(0, 1), match='x'>
import re
print(re.search(r'[^a-z]', 'c'))
None
import re
print(re.search(r'G[^e]', 'Geeks'))
None
import re
print(re.search(r'G[e]', 'Geeks'))
<_sre.SRE_Match object; span=(0, 2), match='Ge'>
import re
print('Geeks:', re.search(r'\bGeeks\b', 'Geeks'))
411721104020
import re
# Beginning of String
match = re.search(r'^Geek', 'Campus Geek of the month')
print('Beg. of String:', match)
match = re.search(r'^Geek', 'Geek of the month')
print('Beg. of String:', match)
# End of String
match = re.search(r'Geeks$', 'Compute science portal-GeeksforGeeks')
print('End of String:', match)
Beg. of String: None
Beg. of String: <_sre.SRE_Match object; span=(0, 4), match='Geek'>
End of String: <_sre.SRE_Match object; span=(31, 36), match='Geeks'>
import re
print('Any Character', re.search(r'p.th.n', 'python 3'))
Any Character <_sre.SRE_Match object; span=(0, 6), match='python'>
import re
print('Color',re.search(r'colou?r', 'color'))
print('Colour',re.search(r'colou?r', 'colour'))
Color <_sre.SRE_Match object; span=(0, 5), match='color'>
Colour <_sre.SRE_Match object; span=(0, 6), match='colour'>
import re
print('Date{mm-dd-yyyy}:', re.search(r'[\d]{2}-[\d]{2}-[\d]{4}', '18-08-2020'))
Date{mm-dd-yyyy}: <_sre.SRE_Match object; span=(0, 10), match='18-08-2020'>
import re
print('Three Digit:', re.search(r'[\d]{3,4}', '189'))
print('Four Digit:', re.search(r'[\d]{3,4}', '2145'))
Three Digit: <_sre.SRE_Match object; span=(0, 3), match='189'>
Four Digit: <_sre.SRE_Match object; span=(0, 4), match='2145'>
import re
string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'
411721104020
import re
# multiline string
string = 'abc 12\
de 23 \n f45 6'
import re
# multiline string
string = 'abc 12\
de 23 \n f45 6'
# matches all whitespace characters
pattern = '\s+'
replace = ''
new_string = re.sub(r'\s+', replace, string, 1)
print(new_string)
# Output:
# abc12de 23
# f45 6
abc12de 23
f45 6
import re
string = "Python is fun"
# check if 'Python' is at the beginning
match = re.search('\APython', string)
if match:
print("pattern found inside the string")
else:
print("pattern not found")
# Output: pattern found inside the string
pattern found inside the string
import re
string = '39801 356, 2102 1111'
# Three digit number followed by space followed by two digit number
411721104020
Viva Questions:
1. How can I detect all words containing only letters in a text using regular expressions in Python?
To detect words containing only letters, you can use the regular expression \b[A-Za-z]+\b.
3. How can I identify words with a specific pattern like having exactly 3 vowels in a text using regular
expressions in Python?
To identify words with exactly 3 vowels, you can use the regular expression \b(?
=[^aeiouAEIOU]*[aeiouAEIOU][^aeiouAEIOU]*[aeiouAEIOU][^aeiouAEIOU]*[aeiouAEIOU])
\w+\b.
4. How can I tokenize a text while ignoring punctuation using regular expressions in Python?
To tokenize a text while ignoring punctuation, you can use the regular expression \b\w+\b,
which matches word characters (\w) and boundaries (\b), effectively ignoring punctuation.
5. How do I detect and extract email addresses from a given text using regular expressions in Python?
To detect and extract email addresses from text, you can use the regular expression
[\w\.-]+@[a-zA-Z\d\. -]+\.[a-zA-Z]{2,}, which matches the common pattern for email addresses.
RESULT:
Thus the program executed successfully for Regular Expressions.
411721104020
AIM:
Write a code to work with Python and NLTK (Natural Language Toolkit) for tasks like searching text, counting
vocabulary, frequency distribution, collocations, and bigrams.
ALGORITHM:
PROGRAM:
# using gensim
documents = [u"Chennai Super Kings club defeat local rivals Rajasthan Royals this weekend.",
u"Weekend cricket frenzy takes over India.",
u"Machine Learning is one of the hottest areas in CSE.",
u"London football club bid to move to Wembley stadium.",
u"Chennai Super Kings bid 5 crores for striker Dhoni.",
u"Financial troubles result in loss of millions for bank.",
u"New York Syndicate Union bank files for bankruptcy after financial losses.",
u"Chennai football club is taken over by oil millionaire from Maharashtra.",
u"Banking on finances not working for Russia."]
import spacy
nlp = spacy.load('en_core_web_sm')
texts = []
for document in documents:
text = []
doc = nlp(document)
for w in doc:
if not w.is_stop and not w.is_punct and not w.like_num:
text.append(w.lemma_)
texts.append(text)
print(texts)
dictionary = corpora.Dictionary(texts)
print(dictionary.token2id)
{'Chennai': 0, 'Rajasthan': 1, 'Royals': 2, 'Super': 3, 'club': 4, 'defeat': 5, 'king': 6, 'local': 7, 'rival': 8, 'weekend': 9,
'India': 10, 'Weekend': 11, 'cricket': 12, 'frenzy': 13, 'take': 14, 'CSE': 15, 'Learning': 16, 'Machine': 17, 'area': 18,
411721104020
'hot': 19, 'London': 20, 'Wembley': 21, 'bid': 22, 'football': 23, 'stadium': 24, 'Dhoni': 25, 'crore': 26, 'striker': 27,
'bank': 28, 'financial': 29, 'loss': 30, 'million': 31, 'result': 32, 'trouble': 33, 'New': 34, 'Syndicate': 35, 'Union': 36,
'York': 37, 'bankruptcy': 38, 'file': 39, 'Maharashtra': 40, 'millionaire': 41, 'oil': 42, 'Russia': 43, 'banking': 44,
'finance': 45, 'work': 46}
corpus = [dictionary.doc2bow(text) for text in texts]
print(corpus)
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)], [(10, 1), (11, 1), (12, 1), (13, 1), (14, 1)],
[(15, 1), (16, 1), (17, 1), (18, 1), (19, 1)], [(4, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1)], [(0, 1), (3, 1), (6, 1),
(22, 1), (25, 1), (26, 1), (27, 1)], [(28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1)], [(28, 1), (29, 1), (30, 1), (34, 1),
(35, 1), (36, 1), (37, 1), (38, 1), (39, 1)], [(0, 1), (4, 1), (14, 1), (23, 1), (40, 1), (41, 1), (42, 1)], [(43, 1), (44, 1),
(45, 1), (46, 1)]]
print(dictionary)
# TF-IDF representation
from gensim import models
tfidf = models.TfidfModel(corpus)
(39, 0.3674642015582995)]
[(0, 0.23736497937484743), (4, 0.23736497937484743), (14, 0.3249693308062283), (23, 0.3249693308062283),
(40, 0.47472995874969487), (41, 0.47472995874969487), (42, 0.47472995874969487)]
[(43, 0.5), (44, 0.5), (45, 0.5), (46, 0.5)]
import gensim
bigram = gensim.models.Phrases(texts)
print(bigram)
[['Chennai', 'Super', 'king', 'club', 'defeat', 'local', 'rival', 'Rajasthan', 'Royals', 'weekend'], ['Weekend', 'cricket',
'frenzy', 'take', 'India'], ['Machine', 'Learning', 'hot', 'area', 'CSE'], ['London', 'football', 'club', 'bid', 'Wembley',
'stadium'],
['Chennai', 'Super', 'king', 'bid', 'crore', 'striker', 'Dhoni'], ['financial', 'trouble', 'result', 'loss', 'million', 'bank'],
['New', 'York', 'Syndicate', 'Union', 'bank', 'file', 'bankruptcy', 'financial', 'loss'], ['Chennai', 'football', 'club', 'take',
'oil', 'millionaire', 'Maharashtra'], ['banking', 'finance', 'work', 'Russia']]
Viva Questions:
1. How can I search for specific words or phrases in a text using NLTK in Python?
411721104020
To search for specific words or phrases in a text using NLTK, you can use the concordance() method.
For example, if text is your NLTK Text object, you can use text.concordance("word") to find all
occurrences of "word" in the text.
3. How can I generate a frequency distribution of words in a text using NLTK in Python?
To generate a frequency distribution of words in a text using NLTK, you can use the FreqDist class.
For example: from nltk import FreqDist text = ["apple", "banana", "apple", "orange",
"banana"] freq_dist = FreqDist(text) print(freq_dist.most_common())
5. How can I identify bigrams (pairs of consecutive words) in a text using NLTK in Python?
To identify bigrams (pairs of consecutive words) in a text using NLTK, you can use the bigrams() function.
For example: from nltk import bigrams text = ["apple", "banana", "orange", "kiwi"] t
ext_bigrams = list(bigrams(text)) print(text_bigrams)
RESULT:
Thus the program executed successfully.
411721104020
EX NO:3
DATE: ACCESSING TEXT CORPORA USING NLTK IN PYTHON
AIM:
Write a program to generate a word cloud from a given text dataset, visualizing the most frequent words in the text
with larger font sizes.
ALGORITHM:
1. Import Libraries: Import necessary libraries including wordcloud, matplotlib, and optionally NLTK for text
preprocessing.
2. Load Text Data:
Load the text dataset you want to analyze. This could be from a file, a website, or any other source.
3. Preprocess Text (Optional):
Preprocess the text data if necessary, such as:
Converting text to lowercase.
Removing stopwords (common words like "the", "and", etc.).
Removing punctuation.
Tokenizing the text into words.
4. Generate Word Frequencies:
Count the frequency of each word in the preprocessed text.
5. Create Word Cloud:
Create a word cloud object.
Generate the word cloud using the word frequencies.
Customize the appearance of the word cloud if desired (e.g., colors, fonts, etc.).
6. Display Word Cloud:
Display the generated word cloud using matplotlib or any other suitable library for visualization.
PROGRAM:
and that he had two more like this one, back in Arabia that he would go back to get. The King loved the horse so much
that he had to have the other two as well. He paid the trader 5000 gold coins in advance. The trader promised that he
would
return within two days with the other horses.Two days turned into two weeks, and still, there was no sign of the trader
and the two horses. One evening, to ease his mind, the King went on a stroll in his garden. There he spotted Tenali
Raman writing down something on a piece of paper. Curious, the King asked Tenali what he was jotting down.Tenali
Raman was hesitant, but after further questioning, he showed the King the paper. On the paper was a list of names, the
King’s being at the top of the list. Tenali said these were the names of the biggest fools in the Vijayanagara Kingdom!
As expected, the King was furious that his name was at the top and asked Tenali Raman for an explanation. Tenali
referred to the horse story, saying the King was a fool to believe that the trader, a stranger, would return after receiving
5000 gold coins.Countering his argument, the King then asked, what happens if/when the trader does come back? In true
Tenali humour, he replied saying, in that case, the trader would be a bigger fool, and his name would replace the King’s
on the list!"
print(text)
King Krishnadevaraya loved horses and had the best collection of horse breeds in the Kingdom. Well, one day, a trader
came to the King and told him that he had brought with him a horse of the best breed in Arabia.He invited the King to
inspect the horse. King Krishnadevaraya loved the horse; so the trader said that the King could buy this one and that he
had two more like this one, back in Arabia that he would go back to get. The King loved the horse so much that he had to
have the other two as well. He paid the trader 5000 gold coins in advance. The trader promised that he would return
within two days with the other horses.Two days turned into two weeks, and still, there was no sign of the trader and the
two horses. One evening, to ease his mind, the King went on a stroll in his garden. There he spotted Tenali Raman
writing down something on a piece of paper. Curious, the King asked Tenali what he was jotting down.Tenali Raman
was hesitant, but after further questioning, he showed the King the paper. On the paper was a list of names, the King’s
being at the top
of the list. Tenali said these were the names of the biggest fools in the Vijayanagara Kingdom!As expected, the King was
furious that his name was at the top and asked Tenali Raman for an explanation. Tenali referred to the horse story, saying
the King was a fool to believe that the trader, a stranger, would return after receiving 5000 gold coins.Countering his
argument, the King then asked, what happens if/when the trader does come back? In true Tenali humour, he replied
saying,
in that case, the trader would be a bigger fool, and his name would replace the King’s on the list!
wordcloud = WordCloud().generate(text)
plt.figure(figsize = (16,6))
plt.imshow(wordcloud)
<matplotlib.image.AxesImage at 0x7d4e144f84c0>
411721104020
411721104020
Viva Questions:
1. What is NLTK, and how does it facilitate text corpus access in Python?
NLTK, or the Natural Language Toolkit, is a powerful library for natural language processing in Python. It
provides a wide range of tools and resources, including access to various text corpora. NLTK simplifies text
corpus access by offering a unified interface to access and manipulate different corpora seamlessly.
4. Can you explain how to access the Brown corpus in NLTK and retrieve sample text from it? A4: Certainly. To
access the Brown corpus in NLTK and retrieve sample text, you can use the following code:
import nltk nltk.download('brown') from nltk.corpus import brown # Access sample text from the Brown corpus
sample_text = brown.raw(categories='news')[:200] print(sample_text)
5. How do you access the WordNet corpus in NLTK, and what is its significance in natural language processing?
You can access the WordNet corpus in NLTK by using the following Python code:
import nltk nltk.download('wordnet') from nltk.corpus import wordnet .WordNet is a lexical database for the
English language, which is extensively used in natural language processing tasks such as word sense
disambiguation, synonym detection, and semantic analysis. Accessing WordNet through NLTK provides access to
a vast collection of words organized into synonym sets, along with their semantic relationships, making it a
valuable resource for NLP applications.
RESULT:
Thus the program for word cloud executed successfully.
411721104020
AIM:
Write a code to find the 50 most frequently occurring words in a text while excluding common stop words.
ALGORITHM:
1. Import Libraries: Import NLTK to access its stopwords list and for text preprocessing.
2. Define the Function:
Define a function named top_frequent_words that takes a text string as input.
3. Preprocess the Text:
Tokenize the text into words.
Convert words to lowercase.
Remove punctuation.
Remove stop words.
4. Count Word Frequencies:
Count the frequency of each word using a dictionary.
5. Sort and Retrieve Top Words:
Sort the word frequencies dictionary by values in descending order.
Retrieve the top 50 words from the sorted list.
6. Return Results:
Return the list of top 50 words.
PROGRAM:
import nltk
from nltk.corpus import webtext
from nltk.probability import FreqDist
nltk.download('webtext')
wt_words = webtext.words('d:\\Ranga\\testing.txt')
data_analysis = nltk.FreqDist(wt_words)
[nltk_data] Downloading package webtext to
[nltk_data] C:\Users\ifsrk\AppData\Roaming\nltk_data...
[nltk_data] Package webtext is already up-to-date!
411721104020
# Let's take the specific words only if their frequency is greater than 1.
filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])
does: 1
down: 2
ease: 1
evening: 1
expected: 1
explanation: 1
fool: 2
fools: 1
furious: 1
further: 1
garden: 1
gold: 2
happens: 1
have: 1
hesitant: 1
horse: 6
horses: 3
humour: 1
inspect: 1
into: 1
invited: 1
jotting: 1
like: 1
list: 3
loved: 3
mind: 1
more: 1
much: 1
name: 2
names: 2
other: 2
paid: 1
paper: 3
piece: 1
promised: 1
questioning: 1
receiving: 1
referred: 1
411721104020
replace: 1
replied: 1
return: 2
said: 2
saying: 2
showed: 1
sign: 1
something: 1
spotted: 1
still: 1
story: 1
stranger: 1
stroll: 1
that: 9
then: 1
there: 1
these: 1
this: 2
told: 1
trader: 8
true: 1
turned: 1
weeks: 1
well: 1
went: 1
were: 1
what: 2
when: 1
with: 2
within: 1
would: 5
writing: 1
data_analysis = nltk.FreqDist(filter_words)
data_analysis.plot(25, cumulative=False)
411721104020
<AxesSubplot:xlabel='Samples', ylabel='Counts'>
411721104020
Viva Questions:
1. What is the purpose of the function you've written?
The purpose of the function is to find the 50 most frequently occurring words in a text while excluding
common stop words, which are often function words and have little semantic meaning.
2. How do you handle the task of finding the most frequent words in the function?
The function tokenizes the text into words, removes any punctuation, converts the words to lowercase,
and filters out stop words. Then, it uses NLTK's FreqDist to calculate the frequency distribution of the remaining
words and returns the 50 most common ones.
3. How do you ensure that stop words are excluded from the analysis?
NLTK provides a predefined set of stop words for various languages. The function utilizes this set to filter
out stop words from the text before calculating word frequencies.
4. Can you explain the significance of excluding stop words in natural language processing tasks?
Stop words are commonly occurring words in a language (e.g., "the", "is", "in") that often do not carry
significant meaning in text analysis tasks. By excluding stop words, we focus on content-bearing words,
improving the accuracy and relevance of our analysis, such as text summarization, sentiment analysis, or topic
modeling.
RESULT:
Thus the program executed successfully.
411721104020
EX NO:5
DATE: IMPLEMENT THE WORD2VEC MODEL
AIM:
Write a code to demonstrate different methods for encoding categorical data into numerical representations, specifically
focusing on one-hot encoding..
ALGORITHM:
PROGRAM:
import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)
abc
0 10 0
1 01 0
2 00 1
3 10 0
import numpy as np
411721104020
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
import numpy as np
# Print the one-hot encoded vectors for each word in each sentence
411721104020
for i in range(len(sentences)):
print(f"Sentences {i + 1}:")
for j in range(len(vectors[i])):
print(f"{sentences[i].split()[j]}: {vectors[i][j]}")
Sentences 1:
The: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
cat: [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
sat: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
on: [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
the: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
mat.: [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
Sentences 2:
The: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
dog: [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
chased: [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
the: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
cat.: [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
Sentences 3:
The: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
mat: [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
was: [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
soft: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
and: [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
fluffy.: [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
!pip install sklearn
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# define example
data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']
print("cold-0, hot-1, warm-2")
values = array(data)
print(values)
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)
411721104020
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)
# invert first example
inverted = label_encoder.inverse_transform([argmax(onehot_encoded[0, :])])
print(inverted)
cold-0, hot-1, warm-2
['cold' 'cold' 'warm' 'cold' 'hot' 'hot' 'warm' 'cold' 'warm' 'hot']
[0 0 2 0 1 1 2 0 2 1]
[[1. 0. 0.]
[1. 0. 0.]
[0. 0. 1.]
[1. 0. 0.]
[0. 1. 0.]
[0. 1. 0.]
[0. 0. 1.]
[1. 0. 0.]
[0. 0. 1.]
[0. 1. 0.]]
['cold']
411721104020
Viva Questions:
1. What is the Word2Vec model?
Word2Vec is a popular word embedding technique used to represent words as dense vectors in a
continuous vector space. It is based on the distributional hypothesis, which states that words appearing in
similar contexts tend to have similar meanings. Word2Vec learns vector representations by training on large
corpora, capturing semantic relationships between words.
RESULT:
Thus the program executed successfully.
411721104020
AIM:
Write a program for building a language translator GUI using the translator library and TKinter (TK) is to create a user-
friendly interface where users can input text in one language and get the translation in another language interactively..
ALGORITHM:
1. Import Libraries: Import necessary libraries including translator for translation and TKinter for GUI development.
2. Create GUI Layout:
Create a TKinter window (Tk()).
Design the layout with labels, text entry widgets, buttons, etc., to accept user input and display translated
text.
3. Define Translation Function:
Define a function to handle translation:
Get the input text from the text entry widget.
Use the translator library to translate the text from the source language to the target language.
Display the translated text in the GUI.
4. Event Handling:
Bind the translation function to an event, such as clicking a button.
5. Main Event Loop:
Start the main event loop to run the GUI application.
PROGRAM:
!pip install tk
# installing tk inter
! pip install translate
# installing translator libraries
import tkinter
from tkinter import *
from translate import Translator
tkinter.TkVersion
8.6
411721104020
Screen = Tk()
Screen.title("Language Translator with GUI")
InputLanguageChoice = StringVar()
TranslateLanguageChoice = StringVar()
LanguageChoices = {'Hindi','English','Tamil','German','Spanish'}
InputLanguageChoice.set('English')
TranslateLanguageChoice.set('Hindi')
mainloop()
411721104020
Viva Questions:
1. What is the Translator library in Python?
The Translator library is a simple yet powerful tool for language translation in Python. It utilizes Google
Translate's API to provide quick and accurate translations between various languages.
2. How can you integrate the Translator library into a Python application?
To integrate the Translator library into a Python application, you first need to install it using pip: pip install
translate Once installed, you can import and use the translate module in your Python code to perform
translations.
4. How do you create a simple GUI translator using Tkinter and the Translator library?
You can create a simple GUI translator by designing a graphical interface with Tkinter, including input and
output text fields, language selection options, and a translation button. Then, you can use the Translator library
to handle the translation process when the button is clicked.
5. What are some potential improvements or additional features you could add to this GUI translator?
Some potential improvements or additional features for this GUI translator include:
Adding support for detecting the input language automatically.
Implementing error handling for cases such as network issues or invalid input.
Enhancing the user interface with styling, icons, or additional widgets.
Supporting translations for more languages by expanding the language selection options.
Allowing users to save or copy translated text for future use.
411721104020
RESULT:
EX NO:7
DATE: DESIGN A CHATBOT WITH A SIMPLE DIALOG SYSTEM
AIM:
Write a program for designing a chatbot with a simple dialog system is to create a conversational agent that can interact
with users, understand their queries, and provide appropriate responses based on predefined rules or patterns..
ALGORITHM:
1. Initialize Chatbot:
Start by initializing the chatbot, greet the user, and provide an initial prompt or question.
2. Accept User Input:
Accept user input as text.
3. Process User Input:
Preprocess the user input if necessary (e.g., convert to lowercase, remove punctuation).
4. Determine Intent:
Determine the intent of the user input based on keywords or patterns.
This can be done using rule-based methods like keyword matching or regular expressions.
5. Generate Response:
Based on the detected intent, select an appropriate response from a predefined set of responses.
Responses can be stored in a dictionary or any other suitable data structure.
6. Display Response:
Display the generated response to the user.
7. Continue Conversation:
Optionally, continue the conversation by prompting the user for additional input or providing further
information.
8. End Chat:
End the conversation if the user says goodbye or after a certain number of interactions.
PROGRAM:
Viva Questions:
1. What is a chatbot, and how does it function in a dialog system?
A chatbot is an AI-powered software program designed to interact with users through text or voice-based
conversations. In a dialog system, a chatbot engages users in a structured or unstructured conversation,
responding to user queries or providing assistance based on predefined rules or machine learning algorithms.
411721104020
2. How does a simple dialog system differ from more complex conversational AI systems?
A simple dialog system typically operates based on predefined rules or patterns, providing responses
according to programmed logic. In contrast, more complex conversational AI systems utilize natural language
processing (NLP) and machine learning techniques to understand and generate human-like responses, often
incorporating context and learning from interactions.
3. What are the key components of designing a chatbot with a simple dialog system?
The key components of designing a chatbot with a simple dialog system include:
Intent recognition: Identifying the user's intention or query based on input text.
Response generation: Generating appropriate responses based on recognized intents and predefined rules or
patterns.
Dialog flow management: Managing the flow of conversation, including handling user inputs, prompting for
clarification, and providing responses.
4. How do you handle user inputs and generate responses in a simple dialog system?
In a simple dialog system, user inputs are typically matched to predefined intents or patterns using
techniques like keyword matching or regular expressions. Once the intent is recognized, corresponding
responses are generated from a predefined set of responses or through templated responses based on input.
5. Can you provide an example of a simple chatbot dialog system and how it operates?
Sure. Let's consider a weather chatbot as an example. The chatbot recognizes intents such as "get
weather forecast" or "check weather," and responds with the current weather conditions or a weather forecast
for the specified location. It operates by matching user inputs to predefined intents and retrieving relevant
weather information based on the recognized intent.
411721104020
RESULT:
Thus the program executed successfully.
411721104020
EX NO:8
DATE: TEXT TO SPEECH AND FIND ACCURACY
AIM:
Write a program to create a function that takes audio input from the microphone, converts it to text using speech
recognition, and prints the recognized text.
ALGORITHM:
1. Initialize Recognizer: Import the speech_recognition library and initialize the recognizer.
2. Define Speech-to-Text Function: Define a function that takes an audio input and converts it to text using speech
recognition. This function will also print the recognized text.
3. Adjust for Ambient Noise: Adjust the recognizer for ambient noise to improve recognition accuracy.
4. Listen for User Input: Use the microphone as the audio source and listen for user input.
5. Recognize Audio: Use the recognizer to recognize the audio input and convert it to text.
6. Print Recognized Text: Print the recognized text.
7. Speak Recognized Text (Optional): Optionally, use text-to-speech to speak the recognized text.
PROGRAM:
import speech_recognition as sr
sr.__version__
'3.9.0'
import pyttsx3
411721104020
import PyAudio
import pyaudio
# Let us create a function that takes in the audio as input and converts it to text.
# Now, use the microphone to get audio input from the user in real-time, recognize it, and
# print it in text.
# Wait for a second to let the recognizer adjust the energy threshold based on
# the surrounding noise level
r.adjust_for_ambient_noise(source2, duration=0.2)
MyText = r.recognize_google(audio2)
MyText = MyText.lower()
3rd year cse classroom location i mean the room number please
Viva Questions:
1. What is the process of converting text to speech, and what role does accuracy play in this process?
Converting text to speech involves synthesizing natural-sounding speech from written text. Accuracy in text-to-
speech conversion refers to the fidelity of the synthesized speech to the original text. Higher accuracy means that the
synthesized speech closely matches the intended text, resulting in a more natural and understandable output.
411721104020
3. What are some factors that can affect the accuracy of text-to-speech conversion?
Several factors can affect the accuracy of text-to-speech conversion, including:
Quality of the text preprocessing: Proper handling of punctuation, capitalization, and special characters can
improve accuracy.
Speech synthesis engine: The underlying technology and algorithms used for speech synthesis can impact
accuracy.
Linguistic characteristics: Complex sentence structures, ambiguous pronunciation, or uncommon words may
affect accuracy.
Voice quality: The naturalness and clarity of the synthesized voice can influence perceived accuracy.
RESULT:
Thus the program executed successfully.
411721104020
EX NO:9
DATE: DESIGN A SPEECH RECOGNITION SYSTEM AND FIND THE ERROR RATE
AIM:
Write a program to design a speech recognition system and calculate the error rate to evaluate its performance. The error
rate indicates the accuracy of the system in converting spoken words into text.
ALGORITHM:
1. Collect audio data and corresponding transcriptions.
2. Preprocess the audio data.
3. Extract features from the audio.
4. Train a speech recognition model.
5. Evaluate the model on a separate dataset.
6. Calculate the error rate (e.g., WER, CER).
7. Display the error rate.
8. Analyze and improve the system as needed.
9. Extracted bigrams.
PROGRAM:
import speech_recognition as sr
sr.__version__
'3.9.0'
'C:\\Users\\thaanya\\NLP-PRACTICALS'
hindi = sr.AudioFile('sai_record.wav')
with hindi as source:
audio = r.record(source)
type(audio)
speech_recognition.AudioData
411721104020
# You can now invoke recognize_google() to attempt to recognize any speech in the audio.
# Depending on your internet connection speed, you may have to wait several seconds before
# seeing the result.
r.recognize_google(audio)
result2:
{ 'alternative': [ { 'transcript': 'Santhosh is a Bhagwan santhosha I am '
'Anaconda'},
{ 'transcript': 'Santosh is a Bhagwan santhosha I am '
'Anaconda'}],
'final': True}
type(audio1)
speech_recognition.AudioData
Viva Questions:
1.What is a speech recognition system, and how does it function?
A speech recognition system is a technology that converts spoken language into text or commands. It operates by
analyzing audio input, extracting features such as spectrograms or MFCCs (Mel-Frequency Cepstral Coefficients)
, and applying machine learning algorithms or neural networks to recognize and transcribe spoken words into text.
411721104020
3. How can you evaluate the error rate of a speech recognition system?
The error rate of a speech recognition system can be evaluated using objective metrics such as Word Error Rate
(WER) or Character Error Rate (CER). WER measures the percentage of words in the recognized output that differ
from the ground truth, while CER calculates the percentage of characters that are incorrectly transcribed. Lower error
rates indicate higher accuracy in speech recognition.
4. What factors can contribute to errors in speech recognition systems? A4: Several factors can contribute to errors in
speech recognition systems, including:
Background noise: Environmental noise can interfere with audio signals, leading to inaccuracies in recognition.
Speaker variability: Differences in accent, pronunciation, or speech patterns among speakers can pose challenges for
accurate recognition.
Vocabulary complexity: Uncommon words, technical terms, or ambiguous pronunciations may be more prone to
recognition errors.
Speech rate: Rapid speech or unclear articulation can increase the likelihood of recognition errors.
System limitations: The quality of the acoustic model, language model, or training data can affect the system's
ability to accurately recognize speech.
5. Can you explain how to calculate the Word Error Rate (WER) in practice?
To calculate the Word Error Rate (WER) in practice, you compare the recognized text output generated by the
speech recognition system to the corresponding ground truth or reference transcript. Count the total number of
insertions (extra words), deletions (missed words), and substitutions (mismatched words) required to transform the
recognized text into the reference transcript. Then, divide the total number of errors by the total number of words
in the reference transcript to obtain the WER as a percentage.
411721104020
RESULT:
Thus the program executed successfully.
ADDITIONAL EXPERIMENTS
411721104020
EX NO:10
DATE: STEMMING AND LEMMATIZATION USING NLTK
AIM:
Write a program for stemming and lemmatization using NLTK (Natural Language Toolkit) is to normalize words in
text by reducing them to their base or root forms.
ALGORITHM:
1. Tokenization: Break down the input text into individual words or tokens.
2. Stemming:
Apply a stemming algorithm, such as the Porter Stemmer, to each token.
Remove common suffixes to obtain the root form of each word.
3. Lemmatization:
Tag each token with its part of speech (optional but recommended).
Apply lemmatization using NLTK's WordNet Lemmatizer, considering the part of speech tag.
Retrieve the base or dictionary form of each word (lemma).
4. Post-Processing (Optional):
Remove any non-word tokens resulting from stemming or lemmatization.
Convert all tokens to lowercase for consistency.
5. Output:
Return the stemmed or lemmatized tokens as the normalized text.
PROGRAM:
raw
'My name is Maximus Decimus Meridius, commander of the armies of the north, General of the Felix legions and l
oyal servant to the true emperor, Marcus Aurelius. Father to a murdered son, husband to a murdered wife. And I will
have my vengeance, in this life or the next.'
411721104020
porter = PorterStemmer()
stems = [porter.stem(t) for t in tokens]
print(stems)
['My', 'name', 'is', 'maximu', 'decimu', 'meridiu', ',', 'command', 'of', 'the', 'armi', 'of', 'the', 'north', ',', 'gener', 'of', 'the',
'felix', 'legion', 'and', 'loyal', 'servant', 'to', 'the', 'true', 'emperor', ',', 'marcu', 'aureliu', '.', 'father', 'to', 'a', 'murder', 'son',
',', 'husband', 'to', 'a', 'murder', 'wife', '.', 'and', 'I', 'will', 'have', 'my', 'vengeanc', ',', 'in', 'thi', 'life', 'or', 'the', 'next', '.']
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(t) for t in tokens]
print(lemmas)
['My', 'name', 'is', 'Maximus', 'Decimus', 'Meridius', ',', 'commander', 'of', 'the', 'army', 'of', 'the', 'north', ',', 'General', 'of',
'the', 'Felix', 'legion', 'and', 'loyal', 'servant', 'to', 'the', 'true', 'emperor', ',', 'Marcus', 'Aurelius', '.', 'Father', 'to', 'a',
'murdered', 'son', ',', 'husband', 'to', 'a', 'murdered', 'wife', '.', 'And', 'I', 'will', 'have', 'my', 'vengeance', ',', 'in', 'this', 'life',
'or', 'the', 'next', '.']
Viva Questions:
1.What is stemming, and how does it differ from lemmatization?
Stemming and lemmatization are techniques used in natural language processing to reduce words to their root or
base form. Stemming involves removing suffixes or prefixes from words to extract the stem, whereas lemmatization
involves reducing words to their dictionary form, known as lemma, by considering the context and meaning of the word.
411721104020
4.What are some considerations when choosing between stemming and lemmatization?
When choosing between stemming and lemmatization, consider factors such as the application domain, desired
level of linguistic accuracy, and computational resources. Stemming is faster and simpler but may produce non-words or
incorrect stems. Lemmatization, on the other hand, provides more accurate results by considering word meanings but
requires more computational resources.
5.Can you explain the importance of stemming and lemmatization in natural language processing tasks?
Stemming and lemmatization are crucial preprocessing steps in natural language processing tasks such as text
classification, information retrieval, and sentiment analysis. By reducing words to their base forms, stemming and
lemmatization help normalize text data, reduce vocabulary size, and improve the accuracy of downstream NLP tasks by
ensuring that variations of words are treated as the same token.
411721104020
RESULT:
Thus the program executed successfully.
EX NO:11
DATE: NLP AUTO COMPLETE PROGRAM
411721104020
AIM:
Write a program for NLP auto complete program to predict the next word or phrase a user is likely to input based
on the context of their current input.
ALGORITHM:
1. Tokenization: Break down the input text into individual words or phrases, removing punctuation and converting
everything to lowercase for consistency.
2. Frequency Count: Count the frequency of occurrence for each token in the dataset to understand their likelihood.
3. Prediction: Given a partial input from the user, predict the next word or phrase based on the most frequent
continuation from the dataset.
4. Ranking: If there are multiple predictions, rank them based on their frequency of occurrence and present the most
frequent ones to the user.
5. User Interface Integration: Present the predicted words or phrases to the user in the application's interface,
6. such as a dropdown menu or a list of suggestions.
7. Feedback Loop (Optional): Incorporate a feedback mechanism where the user's selections are used to refine
future predictions.
PROGRAM:
#import pyreadline3
# we can also implement the autocompelte using
# python readline packages
# install the packages first
greetings = ['Hello How Are You', 'I like students','Best Wishes','Thanks and Regards', 'United States of America',
"Prince Shri Venkateswara Padmavathy Engineering College"]
greetings
'Best Wishes',
'Thanks and Regards',
'United States of America',
'Prince Shri Venkateswara Padmavathy Engineering College']
ws = Tk()
ws.title('Python-AUTO-COMPLETE-example')
ws.geometry('400x300')
ws.config(bg='#f25252')
frame = Frame(ws, bg='#f25252')
frame.pack(expand=True)
Label(
frame,
bg='#f25252',
font = ('Times',21),
text='MY MESSAGES'
).pack()
entry = AutocompleteEntry(
frame,
width=30,
font=('Times', 27),
completevalues=greetings
)
entry.pack()
ws.mainloop()
Viva Questions:
1.What is an NLP autocomplete program, and how does it function?
An NLP autocomplete program is a tool that suggests word or phrase completions based on partial input provided
by the user. It leverages natural language processing techniques to analyze the context of the input text and predict the
most likely completions. The program utilizes language models trained on large corpora to generate accurate
suggestions.
411721104020
2.How does an NLP autocomplete program differ from traditional autocomplete systems?
Traditional autocomplete systems typically rely on simple prefix matching or frequency-based approaches to
suggest completions. In contrast, an NLP autocomplete program utilizes advanced NLP techniques such as language
modeling, contextual analysis, and machine learning algorithms to generate more accurate and contextually relevant
suggestions.
3.What are the main components of designing an NLP autocomplete program? A3: The main components of designing
an NLP autocomplete program include:
Data preprocessing: Tokenization, normalization, and filtering to prepare the input text data.
Language modeling: Building a language model that captures the statistical properties of natural language to
predict next words or phrases.
Prediction mechanism: Implementing algorithms or techniques to generate autocomplete suggestions based on the
input text and the language model.
User interface: Designing an intuitive interface to interact with users and display autocomplete suggestions in
real-time.
4.Can you explain the importance of data quality and size in an NLP autocomplete program?
Data quality and size play a crucial role in the accuracy and effectiveness of an NLP autocomplete program.
High-quality and diverse training data enable the program to learn robust language patterns and generate more accurate
predictions. Additionally, a larger corpus provides a broader context for understanding and predicting user input,
resulting in more relevant autocomplete suggestions.
RESULT:
Thus the program executed successfully.