Tsa Lab Record - Cse

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 61

PRINCE SHRI VENKATESHWARA

PADMAVATHY ENGINEERING
COLLEGE
(An Autonomous Institution)

Mambakkam-Medavakkam MainRoad,
Ponmar,Chennai- 600127

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CCS369-TEXT AND SPEECH ANALYSIS LABORATORY


(BE-CSE–VI SEMESTER)

Academic Year:2023–2024

Name of the Student :

Register Number :

Year/ Semester :
PRINCE SHRI VENKATESHWARA
PADMAVATHY ENGINEERING COLLEGE
(An Autonomous Institution)

BONAFIDE CERTIFICATE

Name : ………………………………………………
Register No : ………………………………………………
Semester : ………………………………………………
Branch : ………………………………………………

Certified that this is a Bonafide Record of the work done by the above student in the
CCS369-Text and Speech Analysis Laboratory during the year2023- 2024.

Signature of Faculty In-Charge Signature of Principal

Submitted for Practical Examination held on……………………….

Internal Examiner External Examiner


VISION OF THE INSTITUTE

To be a prominent institution for technical education and research to meet the global
challenges and demand for the societal needs.

MISSION OF THE INSTITUTE


 To develop the needed resources and infrastructure, and to establish a conducive
ambience for the teaching- learning process.

 To nurture in the students, professional and ethical values, and to install in them a spirit
of innovation and entrepreneurship.

 To encourage in the students a desire for higher learning and research, to equip them to
face the global challenges.

 To provide opportunities for students to get the needed additional skills to make them
industry ready.

 To interact with industries and other organizations to facilitate transfer of knowledge


and know-how.

DEPARTMENT OF COMPUTER SCIENCE AND TECHNOLOGY

VISION OF THE DEPARTMENT


To be a nationally preferred department in quality education to produce globally competent
professionals through research and technical skills, inculcating moral values and societal demands.

MISSION OF THE DEPARTMENT


 To Provide facilities and expertise by means of interactive teaching and participatory learning in
correlation with industrial needs.
 To enrich the students with needed skills and practical exposure, enabling them to become socially
responsible, ethical, and competitive professionals.
 To Inculcate the spirit of moral values to make the students successful in profession, higher studies, and
entrepreneurship.
PROGRAM OUTCOMES (POs)

1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering


fundamentals, and an engineering specialization to the solution of complex engineering problems.

2. Problem analysis: Identify, formulate, review research literature, and analyse complex engineering
problems reaching substantiated conclusions using first principles of mathematics, natural sciences, and
engineering sciences.

3. Design/development of solutions: Design solutions for complex engineering problems and design
system
4. components or processes that meet the specified needs with appropriate consideration for the public
health and safety, and the cultural, societal, and environmental considerations.

5. Conduct investigations of complex problems: Use research-based knowledge and research methods
including design of experiments, analysis and interpretation of data, and synthesis of the information to
provide valid conclusions.

6. Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modelling to complex engineering activities with an
understanding of the limitations.

7. The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal,
health, safety, legal and cultural issues, and the consequent responsibilities relevant to the professional
engineering practice.
8. Environment and sustainability: Understand the impact of the professional engineering solutions in
societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable
development.

9. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of the
engineering practice.

10. Individual and teamwork: Function effectively as an individual, and as a member or leader in diverse
teams, and in multidisciplinary settings.

11. Communication: Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write effective reports and
design documentation, make effective presentations, and give and receive clear instructions.

12. Project management and finance: Demonstrate knowledge and understanding of the engineering and
management principles and apply these to one’s own work, as a member and leader in a team, to manage
projects and in multidisciplinary environments.

13. Life-long learning: Recognize the need for and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.

PROGRAMME EDUCATIONAL OBJECTIVE(PEOs):

PEO 1: Acquire logical and analytical skills with a solid foundation in core areas of computer
Science & Engineering.

PEO 2: Enduringly engage in learning new technologies and actively contribute to the academic,
Research & Development and Society.

PEO 3: Enrich the passion for higher studies, research and a successful career in Software industries

or become an entrepreneurial endeavour.


PROGRAMME SPECIFIC OUTCOMES(PSOs):
PSO 1: Ability to apply advanced programming techniques to solve contemporary issues using
Internet of Things, data science and analytics, Artificial Intelligence and Machine Learning.

PSO 2: Ability to Employ modern tools for analyzing data and networks in building their career as
software professional, researcher and an entrepreneur with a zeal for higher studies.
INSTRUCTIONS TO STUDENTS

Before entering the lab, the student should carry the following things (MANDATORY)
 Identity card issued by the college.
 Class notes
 Lab observation book
 Lab Manual
 Lab Record
 Student must sign in and sign out in the register provided when attending the lab session
without fail.
 Come to the laboratory in time. Students, who are late more than 15 min., will not be allowed
to attend the lab.
 Students need to maintain 100% attendance in lab if not a strict action will be taken.
 All students must follow a Dress Code while in the laboratory
 Foods, drinks are NOT allowed.
 All bags must be left at the indicated place.
 Refer to the lab staff if you need any help in using the lab.
 Respect the laboratory and its other users.
 Workspace must be kept clean and tidy after experiment is completed.
 Read the Manual carefully before coming to the laboratory and be sure about what you
are supposed to do.
 Do the experiments as per the instructions given in the manual.
 Copy all the programs to observation which are taught in class before attending the lab
session.
 Students are not supposed to use floppy disks, pen drives without permission of lab- in
charge.
 Lab records need to be submitted on or before the date of submission.
Syllabus

PRACTICAL EXPERIMENTS 30PERIODS

1. Create Regular expressions in Python for detecting word patterns and tokenizing text
2. Getting started with Python and NLTK - Searching Text, Counting Vocabulary, Frequency
Distribution, Collocations, Bigrams
3. Accessing Text Corpora using NLTK in Python
4. Write a function that finds the 50 most frequently occurring words of a text that are not stop
words.
5. Implement the Word2Vec model
6. Use a transformer for implementing classification
7. Design a chatbot with a simple dialog system
8. Convert text to speech and find accuracy
9. Design a speech recognition system and find the error rate

COURSE OUTCOMES:
At the end of this course, the students will be able to:
CO1:Explain existing and emerging deep learning architectures for text and speech processing
CO2:Apply deep learning techniques for NLP tasks, language modelling and machine translation
CO3:Explain coreference and coherence for text processing
CO4:Build question-answering systems, chatbots and dialogue systems
CO5:Apply deep learning models for building speech recognition and text-to-speech systems
Mapping of Course Outcomes with the POs and PSOs

CO PO PO PO PO PO PO PO PO PO PO PO PO PSO PSO
/PO 1 2 3 4 5 6 7 8 9 10 11 12 1 2
CO1 2 2 1 2 2 1 2 1 1
CO2 2 2 1 2 2 1 2 2 1
CO3 3 3 1 3 3 2 2 3 2
CO4 3 3 1 3 3 2 2 3 2
CO5 3 3 2 3 1 2 2 3 2

1 - low, 2 - medium, 3 - high, ‘-' - no correlation


Relevance of PO’s /PSO’s

Exp Title of Experiments CO’s PO’s


. S.NO DATE EXPERIMENT TITLE PAGE MARKS SIGN
No. NO
1 1 Create Regular expressions in Python
Create Regular for detecting
expressions word for
in Python
patterns and tokenizing text word patterns and tokenizing
detecting
2 Getting started withtext
Python and NLTK - Searching Text,
2 Counting Vocabulary,
Getting startedDistribution,
Frequency with Python and NLTK -
Searching Text, Counting Vocabulary,
Collocations, Bigrams
3 Frequency
Accessing Text Corpora Distribution,
using NLTK Collocations,
in Python
Bigrams

4 3 Write a function thatAccessing Text


finds the 50 Corpora
most using NLTK in
frequently
occurring words of aPython
text that are not stop words.
5 4 Implement the Word2Vec
Write amodel
function that finds the 50 most
frequently occurring words of a text that
are not stop words.
6 Building language translator (GUI) using translator
5 library and TK Implement the Word2Vec model

7 Design a chatbot with a simple dialog system

8 6 Convert text to speech


Building language
and find translator (GUI) using
accuracy
translator library and TK
9 Design a speech recognition system and find the error rate
7 Design a chatbot with a simple dialog
system Additional Experiments

8 Convert text to speech and find accuracy


10 Stemming and Lemmatization Using NLTK

11 NLP Auto complete Program


9 Design a speech recognition system and
find the error rate

Additional Experiments

10 Stemming and Lemmatization Using


NLTK

11 NLP Auto complete Program

TABLE OF CONTENTS
411721104020

EX NO:1 CREATE REGULAR EXPRESSIONS IN PYTHON FOR DETECTING WORD


DATE: PATTERNS AND TOKENIZING TEXT

AIM:
Write a program to tokenize text into individual words and sentence endings using regular
expressions in Python.

ALGORITHM:

1.Define regular expressions patterns for word and sentence tokenization.


2.Use re.findall() to find all matches of these patterns in the text.
3.Combine the matches into a list of tokens.
4.Return the list of token

PROGRAM:

# practicals using regular expressions


# practicals using string processing and functions
import re
print(re.findall(r'[Gg]eeks', 'GeeksforGeeks: \ A computer science portal for geeks'))
['Geeks', 'Geeks', 'geeks']

import re
print('Range',re.search(r'[a-zA-Z]', 'x'))
Range <_sre.SRE_Match object; span=(0, 1), match='x'>

import re
print(re.search(r'[^a-z]', 'c'))
None

import re
print(re.search(r'G[^e]', 'Geeks'))
None

import re
print(re.search(r'G[e]', 'Geeks'))
<_sre.SRE_Match object; span=(0, 2), match='Ge'>

import re
print('Geeks:', re.search(r'\bGeeks\b', 'Geeks'))
411721104020

print('GeeksforGeeks:', re.search(r'\bGeeks\b', 'GeeksforGeeks'))


Geeks: <_sre.SRE_Match object; span=(0, 5), match='Geeks'>
GeeksforGeeks: None

import re
# Beginning of String
match = re.search(r'^Geek', 'Campus Geek of the month')
print('Beg. of String:', match)
match = re.search(r'^Geek', 'Geek of the month')
print('Beg. of String:', match)
# End of String
match = re.search(r'Geeks$', 'Compute science portal-GeeksforGeeks')
print('End of String:', match)
Beg. of String: None
Beg. of String: <_sre.SRE_Match object; span=(0, 4), match='Geek'>
End of String: <_sre.SRE_Match object; span=(31, 36), match='Geeks'>

import re
print('Any Character', re.search(r'p.th.n', 'python 3'))
Any Character <_sre.SRE_Match object; span=(0, 6), match='python'>

import re
print('Color',re.search(r'colou?r', 'color'))
print('Colour',re.search(r'colou?r', 'colour'))
Color <_sre.SRE_Match object; span=(0, 5), match='color'>
Colour <_sre.SRE_Match object; span=(0, 6), match='colour'>

import re
print('Date{mm-dd-yyyy}:', re.search(r'[\d]{2}-[\d]{2}-[\d]{4}', '18-08-2020'))
Date{mm-dd-yyyy}: <_sre.SRE_Match object; span=(0, 10), match='18-08-2020'>

import re
print('Three Digit:', re.search(r'[\d]{3,4}', '189'))
print('Four Digit:', re.search(r'[\d]{3,4}', '2145'))
Three Digit: <_sre.SRE_Match object; span=(0, 3), match='189'>
Four Digit: <_sre.SRE_Match object; span=(0, 4), match='2145'>

# Program to extract numbers from a string


import re
string = 'hello 12 hi 89. Howdy 34'
pattern = '\d+'
result = re.findall(pattern, string)
print(result)
# Output: ['12', '89', '34']
['12', '89', '34']

import re
string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'
411721104020

result = re.split(pattern, string)


print(result)
# Output: ['Twelve:', ' Eighty nine:', '.']
['Twelve:', ' Eighty nine:', '.']
# Program to remove all whitespaces

import re
# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters


pattern = '\s+'
# empty string
replace = ''
new_string = re.sub(pattern, replace, string)
print(new_string)
# Output: abc12de23f456
abc12de23f456

import re
# multiline string
string = 'abc 12\
de 23 \n f45 6'
# matches all whitespace characters
pattern = '\s+'
replace = ''
new_string = re.sub(r'\s+', replace, string, 1)
print(new_string)
# Output:
# abc12de 23
# f45 6
abc12de 23
f45 6

import re
string = "Python is fun"
# check if 'Python' is at the beginning
match = re.search('\APython', string)
if match:
print("pattern found inside the string")
else:
print("pattern not found")
# Output: pattern found inside the string
pattern found inside the string

import re
string = '39801 356, 2102 1111'
# Three digit number followed by space followed by two digit number
411721104020

pattern = '(\d{3}) (\d{2})'


# match variable contains a Match object.
match = re.search(pattern, string)
if match:
print(match.group())
else:
print("pattern not found")
# Output: 801 35
801 35
match.group(1)
'801'
match.group(2)
'35'
411721104020

Viva Questions:
1. How can I detect all words containing only letters in a text using regular expressions in Python?
To detect words containing only letters, you can use the regular expression \b[A-Za-z]+\b.

2. How do I tokenize a text into words using regular expressions in Python?


To tokenize a text into words, you can use the regular expression \b\w+\b.

3. How can I identify words with a specific pattern like having exactly 3 vowels in a text using regular
expressions in Python?
To identify words with exactly 3 vowels, you can use the regular expression \b(?
=[^aeiouAEIOU]*[aeiouAEIOU][^aeiouAEIOU]*[aeiouAEIOU][^aeiouAEIOU]*[aeiouAEIOU])
\w+\b.
4. How can I tokenize a text while ignoring punctuation using regular expressions in Python?
To tokenize a text while ignoring punctuation, you can use the regular expression \b\w+\b,
which matches word characters (\w) and boundaries (\b), effectively ignoring punctuation.

5. How do I detect and extract email addresses from a given text using regular expressions in Python?
To detect and extract email addresses from text, you can use the regular expression
[\w\.-]+@[a-zA-Z\d\. -]+\.[a-zA-Z]{2,}, which matches the common pattern for email addresses.

RESULT:
Thus the program executed successfully for Regular Expressions.
411721104020

EX NO:2 GETTING STARTED WITH PYTHON AND NLTK - SEARCHING TEXT,


DATE: COUNTING VOCABULARY, FREQUENCY DISTRIBUTION, COLLOCATIONS,
BIGRAMS

AIM:

Write a code to work with Python and NLTK (Natural Language Toolkit) for tasks like searching text, counting
vocabulary, frequency distribution, collocations, and bigrams.

ALGORITHM:

1. Import Necessary Libraries: Import the required libraries, including NLTK.


2. Download NLTK Data: Download essential NLTK data such as tokenizers, stopwords, etc.
3. Load and Preprocess Text:
 Load the text data you want to analyze.
 Preprocess the text as necessary, including:
 Converting text to lowercase.
 Tokenizing the text into words.
 Removing punctuation and non-alphanumeric characters.
 Optionally removing stopwords (common words like "the", "and", etc.).
4. Perform Text Analysis:
 Searching Text:
 Define the word or pattern you want to search for.
 Use string search functions or regular expressions to find occurrences of the word/pattern in the
text.
 Counting Vocabulary:
 Compute the number of unique words in the text.
 Frequency Distribution:
 Calculate the frequency of each word in the text.
 Collocations:
 Identify frequent word combinations (collocations) that often appear together in the text.
 Bigrams:
 Extract pairs of consecutive words in the text.
5. Display Results:
 Display the results of each analysis task:
 Search results (if any).
 Vocabulary count.
 Top N most frequent words.
 Identified collocations.
 Extracted bigrams.
411721104020

PROGRAM:

# installing gensim, spaCy packages first


!pip install gensim
!pip install spacy

# using gensim

from gensim import corpora

documents = [u"Chennai Super Kings club defeat local rivals Rajasthan Royals this weekend.",
u"Weekend cricket frenzy takes over India.",
u"Machine Learning is one of the hottest areas in CSE.",
u"London football club bid to move to Wembley stadium.",
u"Chennai Super Kings bid 5 crores for striker Dhoni.",
u"Financial troubles result in loss of millions for bank.",
u"New York Syndicate Union bank files for bankruptcy after financial losses.",
u"Chennai football club is taken over by oil millionaire from Maharashtra.",
u"Banking on finances not working for Russia."]

import spacy
nlp = spacy.load('en_core_web_sm')
texts = []
for document in documents:
text = []
doc = nlp(document)
for w in doc:
if not w.is_stop and not w.is_punct and not w.like_num:
text.append(w.lemma_)
texts.append(text)
print(texts)

dictionary = corpora.Dictionary(texts)
print(dictionary.token2id)

{'Chennai': 0, 'Rajasthan': 1, 'Royals': 2, 'Super': 3, 'club': 4, 'defeat': 5, 'king': 6, 'local': 7, 'rival': 8, 'weekend': 9,
'India': 10, 'Weekend': 11, 'cricket': 12, 'frenzy': 13, 'take': 14, 'CSE': 15, 'Learning': 16, 'Machine': 17, 'area': 18,
411721104020

'hot': 19, 'London': 20, 'Wembley': 21, 'bid': 22, 'football': 23, 'stadium': 24, 'Dhoni': 25, 'crore': 26, 'striker': 27,
'bank': 28, 'financial': 29, 'loss': 30, 'million': 31, 'result': 32, 'trouble': 33, 'New': 34, 'Syndicate': 35, 'Union': 36,
'York': 37, 'bankruptcy': 38, 'file': 39, 'Maharashtra': 40, 'millionaire': 41, 'oil': 42, 'Russia': 43, 'banking': 44,
'finance': 45, 'work': 46}
corpus = [dictionary.doc2bow(text) for text in texts]
print(corpus)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)], [(10, 1), (11, 1), (12, 1), (13, 1), (14, 1)],
[(15, 1), (16, 1), (17, 1), (18, 1), (19, 1)], [(4, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1)], [(0, 1), (3, 1), (6, 1),
(22, 1), (25, 1), (26, 1), (27, 1)], [(28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1)], [(28, 1), (29, 1), (30, 1), (34, 1),
(35, 1), (36, 1), (37, 1), (38, 1), (39, 1)], [(0, 1), (4, 1), (14, 1), (23, 1), (40, 1), (41, 1), (42, 1)], [(43, 1), (44, 1),
(45, 1), (46, 1)]]

print(dictionary)

Dictionary<47 unique tokens: ['Chennai', 'Rajasthan', 'Royals', 'Super', 'club']...>

# TF-IDF representation
from gensim import models
tfidf = models.TfidfModel(corpus)

for document in tfidf[corpus]:


print (document)

[(0, 0.18334368469568632), (1, 0.36668736939137264), (2, 0.36668736939137264), (3, 0.25101038358744027),


(4, 0.18334368469568632), (5, 0.36668736939137264), (6, 0.25101038358744027), (7, 0.36668736939137264),
(8, 0.36668736939137264), (9, 0.36668736939137264)]
[(10, 0.4730584737401374), (11, 0.4730584737401374), (12, 0.4730584737401374), (13, 0.4730584737401374),
(14, 0.32382514060926004)]
[(15, 0.4472135954999579), (16, 0.4472135954999579), (17, 0.4472135954999579), (18, 0.4472135954999579),
(19, 0.4472135954999579)]
[(4, 0.24434832234965204), (20, 0.4886966446993041), (21, 0.4886966446993041), (22, 0.33453001789363906),
(23, 0.33453001789363906), (24, 0.4886966446993041)]
[(0, 0.2317258471482127), (3, 0.3172489626590664), (6, 0.3172489626590664), (22, 0.3172489626590664), (25,
0.4634516942964254), (26, 0.4634516942964254), (27, 0.4634516942964254)]
[(28, 0.3261257358488856), (29, 0.3261257358488856), (30, 0.3261257358488856), (31, 0.47641928776064074),
(32, 0.47641928776064074), (33, 0.47641928776064074)]
[(28, 0.25154215249054435), (29, 0.25154215249054435), (30, 0.25154215249054435), (34, 0.3674642015582995),
(35, 0.3674642015582995), (36, 0.3674642015582995), (37, 0.3674642015582995), (38, 0.3674642015582995),
411721104020

(39, 0.3674642015582995)]
[(0, 0.23736497937484743), (4, 0.23736497937484743), (14, 0.3249693308062283), (23, 0.3249693308062283),
(40, 0.47472995874969487), (41, 0.47472995874969487), (42, 0.47472995874969487)]
[(43, 0.5), (44, 0.5), (45, 0.5), (46, 0.5)]

import gensim
bigram = gensim.models.Phrases(texts)
print(bigram)

Phrases<94 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>

texts = [bigram[line] for line in texts]


print(texts)

[['Chennai', 'Super', 'king', 'club', 'defeat', 'local', 'rival', 'Rajasthan', 'Royals', 'weekend'], ['Weekend', 'cricket',
'frenzy', 'take', 'India'], ['Machine', 'Learning', 'hot', 'area', 'CSE'], ['London', 'football', 'club', 'bid', 'Wembley',
'stadium'],
['Chennai', 'Super', 'king', 'bid', 'crore', 'striker', 'Dhoni'], ['financial', 'trouble', 'result', 'loss', 'million', 'bank'],
['New', 'York', 'Syndicate', 'Union', 'bank', 'file', 'bankruptcy', 'financial', 'loss'], ['Chennai', 'football', 'club', 'take',
'oil', 'millionaire', 'Maharashtra'], ['banking', 'finance', 'work', 'Russia']]

Viva Questions:
1. How can I search for specific words or phrases in a text using NLTK in Python?
411721104020

To search for specific words or phrases in a text using NLTK, you can use the concordance() method.
For example, if text is your NLTK Text object, you can use text.concordance("word") to find all
occurrences of "word" in the text.

2. How do I count the unique words in a text using NLTK in Python?


To count the unique words in a text using NLTK, you can create a frequency distribution using FreqDist
and then count the number of unique words using the len() function.
For example: from nltk import FreqDist text = ["apple", "banana", "apple", "orange", "banana"]
freq_dist = FreqDist(text) unique_words_count = len(freq_dist) print(unique_words_count)

3. How can I generate a frequency distribution of words in a text using NLTK in Python?
To generate a frequency distribution of words in a text using NLTK, you can use the FreqDist class.
For example: from nltk import FreqDist text = ["apple", "banana", "apple", "orange",
"banana"] freq_dist = FreqDist(text) print(freq_dist.most_common())

4. How do I find collocations in a text using NLTK in Python?


To find collocations (sequences of words that often appear together) in a text using NLTK, you can
use the collocations() method. For example: from nltk import Text text = Text(["apple", "banana",
"apple", "orange", "banana"]) text.collocations()

5. How can I identify bigrams (pairs of consecutive words) in a text using NLTK in Python?
To identify bigrams (pairs of consecutive words) in a text using NLTK, you can use the bigrams() function.
For example: from nltk import bigrams text = ["apple", "banana", "orange", "kiwi"] t
ext_bigrams = list(bigrams(text)) print(text_bigrams)

RESULT:
Thus the program executed successfully.
411721104020

EX NO:3
DATE: ACCESSING TEXT CORPORA USING NLTK IN PYTHON

AIM:

Write a program to generate a word cloud from a given text dataset, visualizing the most frequent words in the text
with larger font sizes.

ALGORITHM:

1. Import Libraries: Import necessary libraries including wordcloud, matplotlib, and optionally NLTK for text
preprocessing.
2. Load Text Data:
 Load the text dataset you want to analyze. This could be from a file, a website, or any other source.
3. Preprocess Text (Optional):
 Preprocess the text data if necessary, such as:
 Converting text to lowercase.
 Removing stopwords (common words like "the", "and", etc.).
 Removing punctuation.
 Tokenizing the text into words.
4. Generate Word Frequencies:
 Count the frequency of each word in the preprocessed text.
5. Create Word Cloud:
 Create a word cloud object.
 Generate the word cloud using the word frequencies.
 Customize the appearance of the word cloud if desired (e.g., colors, fonts, etc.).
6. Display Word Cloud:
 Display the generated word cloud using matplotlib or any other suitable library for visualization.

PROGRAM:

! pip install wordcloud


from wordcloud import WordCloud
import matplotlib.pyplot as plt
text = "King Krishnadevaraya loved horses and had the best collection of horse breeds in the Kingdom. Well, one day,
a trader came to the King and told him that he had brought with him a horse of the best breed in Arabia.He invited
the King to inspect the horse. King Krishnadevaraya loved the horse; so the trader said that the King could buy this one
411721104020

and that he had two more like this one, back in Arabia that he would go back to get. The King loved the horse so much
that he had to have the other two as well. He paid the trader 5000 gold coins in advance. The trader promised that he
would
return within two days with the other horses.Two days turned into two weeks, and still, there was no sign of the trader
and the two horses. One evening, to ease his mind, the King went on a stroll in his garden. There he spotted Tenali
Raman writing down something on a piece of paper. Curious, the King asked Tenali what he was jotting down.Tenali
Raman was hesitant, but after further questioning, he showed the King the paper. On the paper was a list of names, the
King’s being at the top of the list. Tenali said these were the names of the biggest fools in the Vijayanagara Kingdom!
As expected, the King was furious that his name was at the top and asked Tenali Raman for an explanation. Tenali
referred to the horse story, saying the King was a fool to believe that the trader, a stranger, would return after receiving
5000 gold coins.Countering his argument, the King then asked, what happens if/when the trader does come back? In true
Tenali humour, he replied saying, in that case, the trader would be a bigger fool, and his name would replace the King’s
on the list!"
print(text)

King Krishnadevaraya loved horses and had the best collection of horse breeds in the Kingdom. Well, one day, a trader
came to the King and told him that he had brought with him a horse of the best breed in Arabia.He invited the King to
inspect the horse. King Krishnadevaraya loved the horse; so the trader said that the King could buy this one and that he
had two more like this one, back in Arabia that he would go back to get. The King loved the horse so much that he had to
have the other two as well. He paid the trader 5000 gold coins in advance. The trader promised that he would return
within two days with the other horses.Two days turned into two weeks, and still, there was no sign of the trader and the
two horses. One evening, to ease his mind, the King went on a stroll in his garden. There he spotted Tenali Raman
writing down something on a piece of paper. Curious, the King asked Tenali what he was jotting down.Tenali Raman
was hesitant, but after further questioning, he showed the King the paper. On the paper was a list of names, the King’s
being at the top
of the list. Tenali said these were the names of the biggest fools in the Vijayanagara Kingdom!As expected, the King was
furious that his name was at the top and asked Tenali Raman for an explanation. Tenali referred to the horse story, saying
the King was a fool to believe that the trader, a stranger, would return after receiving 5000 gold coins.Countering his
argument, the King then asked, what happens if/when the trader does come back? In true Tenali humour, he replied
saying,
in that case, the trader would be a bigger fool, and his name would replace the King’s on the list!

wordcloud = WordCloud().generate(text)
plt.figure(figsize = (16,6))
plt.imshow(wordcloud)
<matplotlib.image.AxesImage at 0x7d4e144f84c0>
411721104020
411721104020

Viva Questions:
1. What is NLTK, and how does it facilitate text corpus access in Python?
NLTK, or the Natural Language Toolkit, is a powerful library for natural language processing in Python. It
provides a wide range of tools and resources, including access to various text corpora. NLTK simplifies text
corpus access by offering a unified interface to access and manipulate different corpora seamlessly.

2. How can you check the available text corpora in NLTK?


You can check the available text corpora in NLTK by using the fileids() function from the nltk.corpus
module. This function returns a list of identifiers for all the available corpora.

3. How do you download and access the Gutenberg corpus in NLTK?


You can download and access the Gutenberg corpus in NLTK by using the following Python code:
import nltk nltk.download('gutenberg') from nltk.corpus import gutenberg

4. Can you explain how to access the Brown corpus in NLTK and retrieve sample text from it? A4: Certainly. To
access the Brown corpus in NLTK and retrieve sample text, you can use the following code:
import nltk nltk.download('brown') from nltk.corpus import brown # Access sample text from the Brown corpus
sample_text = brown.raw(categories='news')[:200] print(sample_text)

5. How do you access the WordNet corpus in NLTK, and what is its significance in natural language processing?
You can access the WordNet corpus in NLTK by using the following Python code:
import nltk nltk.download('wordnet') from nltk.corpus import wordnet .WordNet is a lexical database for the
English language, which is extensively used in natural language processing tasks such as word sense
disambiguation, synonym detection, and semantic analysis. Accessing WordNet through NLTK provides access to
a vast collection of words organized into synonym sets, along with their semantic relationships, making it a
valuable resource for NLP applications.

RESULT:
Thus the program for word cloud executed successfully.
411721104020

EX NO:4 WRITE A FUNCTION THAT FINDS THE 50 MOST FREQUENTLY OCCURRING


DATE: WORDS OF A TEXT THAT ARE NOT STOP WORDS.

AIM:

Write a code to find the 50 most frequently occurring words in a text while excluding common stop words.

ALGORITHM:

1. Import Libraries: Import NLTK to access its stopwords list and for text preprocessing.
2. Define the Function:
 Define a function named top_frequent_words that takes a text string as input.
3. Preprocess the Text:
 Tokenize the text into words.
 Convert words to lowercase.
 Remove punctuation.
 Remove stop words.
4. Count Word Frequencies:
 Count the frequency of each word using a dictionary.
5. Sort and Retrieve Top Words:
 Sort the word frequencies dictionary by values in descending order.
 Retrieve the top 50 words from the sorted list.
6. Return Results:
 Return the list of top 50 words.

PROGRAM:

import nltk
from nltk.corpus import webtext
from nltk.probability import FreqDist
nltk.download('webtext')
wt_words = webtext.words('d:\\Ranga\\testing.txt')
data_analysis = nltk.FreqDist(wt_words)
[nltk_data] Downloading package webtext to
[nltk_data] C:\Users\ifsrk\AppData\Roaming\nltk_data...
[nltk_data] Package webtext is already up-to-date!
411721104020

# Let's take the specific words only if their frequency is greater than 1.
filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])

for key in sorted(filter_words):


print("%s: %s" % (key, filter_words[key]))
5000: 2
Arabia: 2
Countering: 1
Curious: 1
King: 12
Kingdom: 2
Kingâ: 2
Krishnadevaraya: 2
Raman: 3
Tenali: 7
There: 1
Vijayanagara: 1
Well: 1
advance: 1
after: 2
argument: 1
asked: 3
back: 3
being: 1
believe: 1
best: 2
bigger: 1
biggest: 1
breed: 1
breeds: 1
brought: 1
came: 1
case: 1
coins: 2
collection: 1
come: 1
could: 1
days: 2
411721104020

does: 1
down: 2
ease: 1
evening: 1
expected: 1
explanation: 1
fool: 2
fools: 1
furious: 1
further: 1
garden: 1
gold: 2
happens: 1
have: 1
hesitant: 1
horse: 6
horses: 3
humour: 1
inspect: 1
into: 1
invited: 1
jotting: 1
like: 1
list: 3
loved: 3
mind: 1
more: 1
much: 1
name: 2
names: 2
other: 2
paid: 1
paper: 3
piece: 1
promised: 1
questioning: 1
receiving: 1
referred: 1
411721104020

replace: 1
replied: 1
return: 2
said: 2
saying: 2
showed: 1
sign: 1
something: 1
spotted: 1
still: 1
story: 1
stranger: 1
stroll: 1
that: 9
then: 1
there: 1
these: 1
this: 2
told: 1
trader: 8
true: 1
turned: 1
weeks: 1
well: 1
went: 1
were: 1
what: 2
when: 1
with: 2
within: 1
would: 5
writing: 1
data_analysis = nltk.FreqDist(filter_words)

data_analysis.plot(25, cumulative=False)
411721104020

<AxesSubplot:xlabel='Samples', ylabel='Counts'>
411721104020

Viva Questions:
1. What is the purpose of the function you've written?
The purpose of the function is to find the 50 most frequently occurring words in a text while excluding
common stop words, which are often function words and have little semantic meaning.

2. How do you handle the task of finding the most frequent words in the function?
The function tokenizes the text into words, removes any punctuation, converts the words to lowercase,
and filters out stop words. Then, it uses NLTK's FreqDist to calculate the frequency distribution of the remaining
words and returns the 50 most common ones.

3. How do you ensure that stop words are excluded from the analysis?
NLTK provides a predefined set of stop words for various languages. The function utilizes this set to filter
out stop words from the text before calculating word frequencies.

4. Can you explain the significance of excluding stop words in natural language processing tasks?
Stop words are commonly occurring words in a language (e.g., "the", "is", "in") that often do not carry
significant meaning in text analysis tasks. By excluding stop words, we focus on content-bearing words,
improving the accuracy and relevance of our analysis, such as text summarization, sentiment analysis, or topic
modeling.

5. How would you use this function in a practical scenario?


In a practical scenario, you could apply this function to analyze large volumes of text, such as articles,
books, or social media posts. By identifying the most frequent non-stop words, you gain insights into the central
themes, topics, or key terms present in the text, which can inform further analysis or summarization.

RESULT:
Thus the program executed successfully.
411721104020

EX NO:5
DATE: IMPLEMENT THE WORD2VEC MODEL

AIM:

Write a code to demonstrate different methods for encoding categorical data into numerical representations, specifically
focusing on one-hot encoding..

ALGORITHM:

1. Import Libraries: Import necessary libraries such as NumPy and scikit-learn.


2. Define Example Data:
 Define a dataset containing categorical values.
3. Integer Encoding:
 Use LabelEncoder from scikit-learn to transform categorical labels into numerical representations.
 Print the encoded values.
4. Binary Encoding (One-Hot Encoding):
 Use OneHotEncoder from scikit-learn to perform one-hot encoding on the integer-encoded labels.
 Print the one-hot encoded values.
5. Inversion:
 Invert the one-hot encoded values back to categorical labels using inverse_transform from LabelEncoder.
 Print the inverted value to verify the inversion process.

PROGRAM:

import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)
abc
0 10 0
1 01 0
2 00 1
3 10 0
import numpy as np
411721104020

# Define the corpus of text


corpus = [
"The quick brown fox jumped over the lazy dog.",
"She sells seashells by the seashore.",
"Peter Piper picked a peck of pickled peppers."
]

# Create a set of unique words in the corpus


unique_words = set()
for sentence in corpus:
for word in sentence.split():
unique_words.add(word.lower())

# Create a dictionary to map each


# unique word to an index
word_to_index = {}
for i, word in enumerate(unique_words):
word_to_index[word] = i

# Create one-hot encoded vectors for


# each word in the corpus
one_hot_vectors = []
for sentence in corpus:
sentence_vectors = []
for word in sentence.split():
vector = np.zeros(len(unique_words))
vector[word_to_index[word.lower()]] = 1
sentence_vectors.append(vector)
one_hot_vectors.append(sentence_vectors)

# Print the one-hot encoded vectors


# for the first sentence
print("One-hot encoded vectors for the first sentence:")
for vector in one_hot_vectors[0]:
print(vector)
One-hot encoded vectors for the first sentence:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
411721104020

[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
import numpy as np

# Define the sentences


sentences = [
'The cat sat on the mat.',
'The dog chased the cat.',
'The mat was soft and fluffy.'
]

# Create a vocabulary set


vocab = set()
for sentence in sentences:
words = sentence.lower().split()
for word in words:
vocab.add(word)

# Create a dictionary to map words to integers


word_to_int = {word: i for i, word in enumerate(vocab)}

# Create a binary vector for each word in each sentence


vectors = []
for sentence in sentences:
words = sentence.lower().split()
sentence_vectors = []
for word in words:
binary_vector = np.zeros(len(vocab))
binary_vector[word_to_int[word]] = 1
sentence_vectors.append(binary_vector)
vectors.append(sentence_vectors)

# Print the one-hot encoded vectors for each word in each sentence
411721104020

for i in range(len(sentences)):
print(f"Sentences {i + 1}:")
for j in range(len(vectors[i])):
print(f"{sentences[i].split()[j]}: {vectors[i][j]}")
Sentences 1:
The: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
cat: [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
sat: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
on: [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
the: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
mat.: [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
Sentences 2:
The: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
dog: [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
chased: [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
the: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
cat.: [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
Sentences 3:
The: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
mat: [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
was: [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
soft: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
and: [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
fluffy.: [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
!pip install sklearn
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# define example
data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']
print("cold-0, hot-1, warm-2")
values = array(data)
print(values)
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)
411721104020

# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)
# invert first example
inverted = label_encoder.inverse_transform([argmax(onehot_encoded[0, :])])
print(inverted)
cold-0, hot-1, warm-2
['cold' 'cold' 'warm' 'cold' 'hot' 'hot' 'warm' 'cold' 'warm' 'hot']
[0 0 2 0 1 1 2 0 2 1]
[[1. 0. 0.]
[1. 0. 0.]
[0. 0. 1.]
[1. 0. 0.]
[0. 1. 0.]
[0. 1. 0.]
[0. 0. 1.]
[1. 0. 0.]
[0. 0. 1.]
[0. 1. 0.]]
['cold']
411721104020

Viva Questions:
1. What is the Word2Vec model?
Word2Vec is a popular word embedding technique used to represent words as dense vectors in a
continuous vector space. It is based on the distributional hypothesis, which states that words appearing in
similar contexts tend to have similar meanings. Word2Vec learns vector representations by training on large
corpora, capturing semantic relationships between words.

2. How does the Word2Vec model work?


Word2Vec operates on the principle that words with similar meanings occur in similar contexts. It learns
vector representations by training on a large corpus of text. There are two main architectures for Word2Vec:
Continuous Bag of Words (CBOW) and Skip-gram. CBOW predicts the current word given a context, while Skip-
gram predicts surrounding words given a central word. These architectures are trained using neural networks to
optimize the similarity between word vectors based on their co-occurrence patterns.

3. What are the main components of implementing the Word2Vec model?


The main components of implementing the Word2Vec model include:
1. Text preprocessing: Tokenizing text into words, removing punctuation, and lowercasing.
2. Building a vocabulary: Creating a vocabulary from the corpus and assigning a unique index to each word.
3. Training the Word2Vec model: Training the Word2Vec model using the corpus, choosing an appropriate
architecture (CBOW or Skip-gram) and hyperparameters.
4. Obtaining word vectors: Extracting word vectors from the trained model for downstream tasks.

4. How can you train a Word2Vec model in Python?


In Python, you can train a Word2Vec model using the Word2Vec class provided by the gensim library.
First, preprocess your text data and tokenize it into words. Then, pass the tokenized text to the Word2Vec class,
specifying parameters such as the architecture (CBOW or Skip-gram), vector size, window size, etc. Finally, train
the model using the train() method.

5. What are some applications of Word2Vec in natural language processing?


Word2Vec has various applications in natural language processing, including:
 Document classification and clustering
 Information retrieval and search engines
 Named entity recognition and part-of-speech tagging
 Sentiment analysis and text classification
 Machine translation and language generation
411721104020

RESULT:
Thus the program executed successfully.
411721104020

EX NO:6 BUILDING LANGUAGE TRANSLATOR(GUI) USING TRANSLATOR


DATE: LIBRARY AND TK

AIM:

Write a program for building a language translator GUI using the translator library and TKinter (TK) is to create a user-
friendly interface where users can input text in one language and get the translation in another language interactively..

ALGORITHM:

1. Import Libraries: Import necessary libraries including translator for translation and TKinter for GUI development.
2. Create GUI Layout:
 Create a TKinter window (Tk()).
 Design the layout with labels, text entry widgets, buttons, etc., to accept user input and display translated
text.
3. Define Translation Function:
 Define a function to handle translation:
 Get the input text from the text entry widget.
 Use the translator library to translate the text from the source language to the target language.
 Display the translated text in the GUI.
4. Event Handling:
 Bind the translation function to an event, such as clicking a button.
5. Main Event Loop:
 Start the main event loop to run the GUI application.

PROGRAM:

!pip install tk
# installing tk inter
! pip install translate
# installing translator libraries
import tkinter
from tkinter import *
from translate import Translator

tkinter.TkVersion

8.6
411721104020

Screen = Tk()
Screen.title("Language Translator with GUI")

InputLanguageChoice = StringVar()
TranslateLanguageChoice = StringVar()

LanguageChoices = {'Hindi','English','Tamil','German','Spanish'}
InputLanguageChoice.set('English')
TranslateLanguageChoice.set('Hindi')

#creating a function for translating using translate package


def Translate():
translator = Translator(from_lang= InputLanguageChoice.get(),to_lang=TranslateLanguageChoice.get())
Translation = translator.translate(TextVar.get())
OutputVar.set(Translation)

#choice for input language


InputLanguageChoiceMenu = OptionMenu(Screen,InputLanguageChoice,*LanguageChoices)
Label(Screen,text="Choose a Language").grid(row=0,column=1)
InputLanguageChoiceMenu.grid(row=1,column=1)

#choice in which the language is to be translated


NewLanguageChoiceMenu = OptionMenu(Screen,TranslateLanguageChoice,*LanguageChoices)
Label(Screen,text="Translated Language").grid(row=0,column=2)
NewLanguageChoiceMenu.grid(row=1,column=2)

Label(Screen,text="Enter Text").grid(row=2,column =0)


TextVar = StringVar()
TextBox = Entry(Screen,textvariable=TextVar).grid(row=2,column = 1)

Label(Screen,text="Output Text").grid(row=2,column =2)


OutputVar = StringVar()
TextBox = Entry(Screen,textvariable=OutputVar).grid(row=2,column = 3)

#Button for calling function


B = Button(Screen,text="Translate",command=Translate, relief = GROOVE).grid(row=3,column=1,columnspan = 3)

mainloop()
411721104020

Viva Questions:
1. What is the Translator library in Python?
The Translator library is a simple yet powerful tool for language translation in Python. It utilizes Google
Translate's API to provide quick and accurate translations between various languages.

2. How can you integrate the Translator library into a Python application?
To integrate the Translator library into a Python application, you first need to install it using pip: pip install
translate Once installed, you can import and use the translate module in your Python code to perform
translations.

3. What is Tkinter, and how does it contribute to building a GUI in Python?


Tkinter is the standard GUI (Graphical User Interface) toolkit for Python. It provides a set of modules for
creating GUI applications with ease. Tkinter is simple to use, comes pre-installed with Python, and is cross-
platform, making it an excellent choice for building interactive applications.

4. How do you create a simple GUI translator using Tkinter and the Translator library?
You can create a simple GUI translator by designing a graphical interface with Tkinter, including input and
output text fields, language selection options, and a translation button. Then, you can use the Translator library
to handle the translation process when the button is clicked.

5. What are some potential improvements or additional features you could add to this GUI translator?
Some potential improvements or additional features for this GUI translator include:
 Adding support for detecting the input language automatically.
 Implementing error handling for cases such as network issues or invalid input.
 Enhancing the user interface with styling, icons, or additional widgets.
 Supporting translations for more languages by expanding the language selection options.
 Allowing users to save or copy translated text for future use.
411721104020

RESULT:

Thus the program executed successfully.


411721104020

EX NO:7
DATE: DESIGN A CHATBOT WITH A SIMPLE DIALOG SYSTEM

AIM:

Write a program for designing a chatbot with a simple dialog system is to create a conversational agent that can interact
with users, understand their queries, and provide appropriate responses based on predefined rules or patterns..

ALGORITHM:
1. Initialize Chatbot:
 Start by initializing the chatbot, greet the user, and provide an initial prompt or question.
2. Accept User Input:
 Accept user input as text.
3. Process User Input:
 Preprocess the user input if necessary (e.g., convert to lowercase, remove punctuation).
4. Determine Intent:
 Determine the intent of the user input based on keywords or patterns.
 This can be done using rule-based methods like keyword matching or regular expressions.
5. Generate Response:
 Based on the detected intent, select an appropriate response from a predefined set of responses.
 Responses can be stored in a dictionary or any other suitable data structure.
6. Display Response:
 Display the generated response to the user.
7. Continue Conversation:
 Optionally, continue the conversation by prompting the user for additional input or providing further
information.
8. End Chat:
 End the conversation if the user says goodbye or after a certain number of interactions.

PROGRAM:

# Simple version 1 Question and Answering System


# Define a dictionary of question-answer pairs
qa_pairs = {
"What is your name?": "My name is venkat.",
"How old are you?": "I am 52 years.",
"Where are you from?": "Thirukkoilur.",
"What is the capital of Tamilnadu state?": "Chennai!",
"How can I help you?": "You can ask me questions or seek information.",
"What is the weather today?": "I'm sorry, I don't have real-time access to weather data."
411721104020

# Function to find the answer to a question


def answer_question(question, qa_dict):
return qa_dict.get(question, "Sorry, I don't have an answer to that question.")

# Get user input


user_question = input("Ask a question: ")

# Get the answer


answer = answer_question(user_question, qa_pairs)
print(answer)

Ask a question: How old are you?


I am 52 years.

Viva Questions:
1. What is a chatbot, and how does it function in a dialog system?
A chatbot is an AI-powered software program designed to interact with users through text or voice-based
conversations. In a dialog system, a chatbot engages users in a structured or unstructured conversation,
responding to user queries or providing assistance based on predefined rules or machine learning algorithms.
411721104020

2. How does a simple dialog system differ from more complex conversational AI systems?
A simple dialog system typically operates based on predefined rules or patterns, providing responses
according to programmed logic. In contrast, more complex conversational AI systems utilize natural language
processing (NLP) and machine learning techniques to understand and generate human-like responses, often
incorporating context and learning from interactions.

3. What are the key components of designing a chatbot with a simple dialog system?
The key components of designing a chatbot with a simple dialog system include:
 Intent recognition: Identifying the user's intention or query based on input text.
 Response generation: Generating appropriate responses based on recognized intents and predefined rules or
patterns.
 Dialog flow management: Managing the flow of conversation, including handling user inputs, prompting for
clarification, and providing responses.

4. How do you handle user inputs and generate responses in a simple dialog system?
In a simple dialog system, user inputs are typically matched to predefined intents or patterns using
techniques like keyword matching or regular expressions. Once the intent is recognized, corresponding
responses are generated from a predefined set of responses or through templated responses based on input.

5. Can you provide an example of a simple chatbot dialog system and how it operates?
Sure. Let's consider a weather chatbot as an example. The chatbot recognizes intents such as "get
weather forecast" or "check weather," and responds with the current weather conditions or a weather forecast
for the specified location. It operates by matching user inputs to predefined intents and retrieving relevant
weather information based on the recognized intent.
411721104020

RESULT:
Thus the program executed successfully.
411721104020

EX NO:8
DATE: TEXT TO SPEECH AND FIND ACCURACY

AIM:

Write a program to create a function that takes audio input from the microphone, converts it to text using speech
recognition, and prints the recognized text.

ALGORITHM:

1. Initialize Recognizer: Import the speech_recognition library and initialize the recognizer.
2. Define Speech-to-Text Function: Define a function that takes an audio input and converts it to text using speech
recognition. This function will also print the recognized text.
3. Adjust for Ambient Noise: Adjust the recognizer for ambient noise to improve recognition accuracy.
4. Listen for User Input: Use the microphone as the audio source and listen for user input.
5. Recognize Audio: Use the recognizer to recognize the audio input and convert it to text.
6. Print Recognized Text: Print the recognized text.
7. Speak Recognized Text (Optional): Optionally, use text-to-speech to speak the recognized text.

PROGRAM:

# !pip install SpeechRecognition


# Once installed, you should verify the installation by opening an interpreter session and typing:

import speech_recognition as sr
sr.__version__

'3.9.0'

! pipwin install pyaudio

'pipwin' is not recognized as an internal or external command,


operable program or batch file.

!pip install pyttsx3


!pip install PyAudio
# from anaconda prompt
# !pipwin install pyaudio

import pyttsx3
411721104020

import PyAudio
import pyaudio

# Let us create a function that takes in the audio as input and converts it to text.

# Initialize the recognizer


r = sr.Recognizer()

# Function to convert text to speech


def SpeakText(command):
#Initialize the engine
engine = pyttsx3.init()
engine.say(command)
engine.runAndWait()

# Now, use the microphone to get audio input from the user in real-time, recognize it, and
# print it in text.

# use the microphone as source for input


with sr.Microphone() as source2:

# Wait for a second to let the recognizer adjust the energy threshold based on
# the surrounding noise level
r.adjust_for_ambient_noise(source2, duration=0.2)

# Listen for user input


print("Please say something")
audio2 = r.listen(source2)

# Using Google to recognize audio

MyText = r.recognize_google(audio2)
MyText = MyText.lower()

print("Did you say " + MyText)


SpeakText(MyText)

Please say something


result2:
411721104020

{ 'alternative': [ { 'transcript': '3rd year CSE classroom location I '


'mean the room number please'},
{ 'transcript': '3rd year CSE classroom location I '
'mean room number please'},
{ 'transcript': '3rd year CSE classroom location I am '
'in the room number please'},
{ 'transcript': '3rd year CSE classroom location in '
'the room number please'},
{ 'transcript': '3rd year CSE classroom location am '
'in the room number please'}],
'final': True}
Did you say 3rd year cse classroom location i mean the room number please
print(MyText)

3rd year cse classroom location i mean the room number please

Viva Questions:
1. What is the process of converting text to speech, and what role does accuracy play in this process?
Converting text to speech involves synthesizing natural-sounding speech from written text. Accuracy in text-to-
speech conversion refers to the fidelity of the synthesized speech to the original text. Higher accuracy means that the
synthesized speech closely matches the intended text, resulting in a more natural and understandable output.
411721104020

2. How can you evaluate the accuracy of text-to-speech conversion?


The accuracy of text-to-speech conversion can be evaluated subjectively through human perception by listening
to the synthesized speech and comparing it to the original text. Additionally, objective metrics such as word error rate
(WER), phoneme error rate (PER), or prosody evaluation can provide quantitative measures of accuracy.

3. What are some factors that can affect the accuracy of text-to-speech conversion?
Several factors can affect the accuracy of text-to-speech conversion, including:
 Quality of the text preprocessing: Proper handling of punctuation, capitalization, and special characters can
improve accuracy.
 Speech synthesis engine: The underlying technology and algorithms used for speech synthesis can impact
accuracy.
 Linguistic characteristics: Complex sentence structures, ambiguous pronunciation, or uncommon words may
affect accuracy.
 Voice quality: The naturalness and clarity of the synthesized voice can influence perceived accuracy.

4.Can you explain the importance of accuracy in text-to-speech applications?


Accuracy is crucial in text-to-speech applications because it directly impacts the intelligibility and naturalness of
the synthesized speech. High accuracy ensures that the synthesized output effectively conveys the intended message of
the original text, enhancing user comprehension and user experience in various applications such as assistive technology,
navigation systems, virtual assistants, and educational tools.

5.How can you measure accuracy in a text-to-speech system in practice?


In practice, accuracy in text-to-speech systems can be measured through a combination of subjective human
evaluation and objective metrics. Subjective evaluation involves gathering feedback from human listeners who assess the
naturalness and intelligibility of the synthesized speech. Objective metrics such as WER, PER, or prosody evaluation
algorithms provide quantitative measures of accuracy based on the comparison between the synthesized speech and the
original text.
411721104020

RESULT:
Thus the program executed successfully.
411721104020

EX NO:9
DATE: DESIGN A SPEECH RECOGNITION SYSTEM AND FIND THE ERROR RATE

AIM:

Write a program to design a speech recognition system and calculate the error rate to evaluate its performance. The error
rate indicates the accuracy of the system in converting spoken words into text.

ALGORITHM:
1. Collect audio data and corresponding transcriptions.
2. Preprocess the audio data.
3. Extract features from the audio.
4. Train a speech recognition model.
5. Evaluate the model on a separate dataset.
6. Calculate the error rate (e.g., WER, CER).
7. Display the error rate.
8. Analyze and improve the system as needed.
9. Extracted bigrams.

PROGRAM:

!pip install SpeechRecognition


# Once installed, you should verify the installation by opening an interpreter session and typing:

import speech_recognition as sr
sr.__version__

'3.9.0'

# Creating a Recognizer instance is easy.


r = sr.Recognizer()
import os
os.getcwd()

'C:\\Users\\thaanya\\NLP-PRACTICALS'

# Using record() to Capture Data From a File


# Type the following to process the contents of the “OSR_in_000_0062_16k.wav” file:

hindi = sr.AudioFile('sai_record.wav')
with hindi as source:
audio = r.record(source)
type(audio)

speech_recognition.AudioData
411721104020

# You can now invoke recognize_google() to attempt to recognize any speech in the audio.
# Depending on your internet connection speed, you may have to wait several seconds before
# seeing the result.

r.recognize_google(audio)

result2:
{ 'alternative': [ { 'transcript': 'Santhosh is a Bhagwan santhosha I am '
'Anaconda'},
{ 'transcript': 'Santosh is a Bhagwan santhosha I am '
'Anaconda'}],
'final': True}

'Santhosh is a Bhagwan santhosha I am Anaconda'

type(audio1)

speech_recognition.AudioData

Viva Questions:
1.What is a speech recognition system, and how does it function?
A speech recognition system is a technology that converts spoken language into text or commands. It operates by
analyzing audio input, extracting features such as spectrograms or MFCCs (Mel-Frequency Cepstral Coefficients)
, and applying machine learning algorithms or neural networks to recognize and transcribe spoken words into text.
411721104020

2.What role does error rate play in speech recognition systems?


Error rate in speech recognition systems quantifies the accuracy of transcribing spoken words into text. It measures the
discrepancy between the recognized text output and the ground truth, providing insight into the system's performance and
reliability.

3. How can you evaluate the error rate of a speech recognition system?
The error rate of a speech recognition system can be evaluated using objective metrics such as Word Error Rate
(WER) or Character Error Rate (CER). WER measures the percentage of words in the recognized output that differ
from the ground truth, while CER calculates the percentage of characters that are incorrectly transcribed. Lower error
rates indicate higher accuracy in speech recognition.

4. What factors can contribute to errors in speech recognition systems? A4: Several factors can contribute to errors in
speech recognition systems, including:
 Background noise: Environmental noise can interfere with audio signals, leading to inaccuracies in recognition.
 Speaker variability: Differences in accent, pronunciation, or speech patterns among speakers can pose challenges for
 accurate recognition.
 Vocabulary complexity: Uncommon words, technical terms, or ambiguous pronunciations may be more prone to
 recognition errors.
 Speech rate: Rapid speech or unclear articulation can increase the likelihood of recognition errors.
 System limitations: The quality of the acoustic model, language model, or training data can affect the system's
 ability to accurately recognize speech.

5. Can you explain how to calculate the Word Error Rate (WER) in practice?
To calculate the Word Error Rate (WER) in practice, you compare the recognized text output generated by the
speech recognition system to the corresponding ground truth or reference transcript. Count the total number of
insertions (extra words), deletions (missed words), and substitutions (mismatched words) required to transform the
recognized text into the reference transcript. Then, divide the total number of errors by the total number of words
in the reference transcript to obtain the WER as a percentage.
411721104020

RESULT:
Thus the program executed successfully.

ADDITIONAL EXPERIMENTS
411721104020

EX NO:10
DATE: STEMMING AND LEMMATIZATION USING NLTK

AIM:

Write a program for stemming and lemmatization using NLTK (Natural Language Toolkit) is to normalize words in
text by reducing them to their base or root forms.

ALGORITHM:

1. Tokenization: Break down the input text into individual words or tokens.
2. Stemming:
 Apply a stemming algorithm, such as the Porter Stemmer, to each token.
 Remove common suffixes to obtain the root form of each word.
3. Lemmatization:
 Tag each token with its part of speech (optional but recommended).
 Apply lemmatization using NLTK's WordNet Lemmatizer, considering the part of speech tag.
 Retrieve the base or dictionary form of each word (lemma).
4. Post-Processing (Optional):
 Remove any non-word tokens resulting from stemming or lemmatization.
 Convert all tokens to lowercase for consistency.
5. Output:
 Return the stemmed or lemmatized tokens as the normalized text.

PROGRAM:

! pip install wordnet


import nltk
nltk.download('wordnet')
from nltk import word_tokenize, PorterStemmer, WordNetLemmatizer
raw = "My name is Maximus Decimus Meridius, commander of the armies of the north, General of the Felix legions and
loyal servant to the true emperor, Marcus Aurelius. Father to a murdered son, husband to a murdered wife. And I will
have my vengeance, in this life or the next."
tokens = word_tokenize(raw)

raw

'My name is Maximus Decimus Meridius, commander of the armies of the north, General of the Felix legions and l
oyal servant to the true emperor, Marcus Aurelius. Father to a murdered son, husband to a murdered wife. And I will
have my vengeance, in this life or the next.'
411721104020

porter = PorterStemmer()
stems = [porter.stem(t) for t in tokens]
print(stems)

['My', 'name', 'is', 'maximu', 'decimu', 'meridiu', ',', 'command', 'of', 'the', 'armi', 'of', 'the', 'north', ',', 'gener', 'of', 'the',
'felix', 'legion', 'and', 'loyal', 'servant', 'to', 'the', 'true', 'emperor', ',', 'marcu', 'aureliu', '.', 'father', 'to', 'a', 'murder', 'son',
',', 'husband', 'to', 'a', 'murder', 'wife', '.', 'and', 'I', 'will', 'have', 'my', 'vengeanc', ',', 'in', 'thi', 'life', 'or', 'the', 'next', '.']

lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(t) for t in tokens]
print(lemmas)

['My', 'name', 'is', 'Maximus', 'Decimus', 'Meridius', ',', 'commander', 'of', 'the', 'army', 'of', 'the', 'north', ',', 'General', 'of',
'the', 'Felix', 'legion', 'and', 'loyal', 'servant', 'to', 'the', 'true', 'emperor', ',', 'Marcus', 'Aurelius', '.', 'Father', 'to', 'a',
'murdered', 'son', ',', 'husband', 'to', 'a', 'murdered', 'wife', '.', 'And', 'I', 'will', 'have', 'my', 'vengeance', ',', 'in', 'this', 'life',
'or', 'the', 'next', '.']

Viva Questions:
1.What is stemming, and how does it differ from lemmatization?
Stemming and lemmatization are techniques used in natural language processing to reduce words to their root or
base form. Stemming involves removing suffixes or prefixes from words to extract the stem, whereas lemmatization
involves reducing words to their dictionary form, known as lemma, by considering the context and meaning of the word.
411721104020

2.How do you perform stemming using NLTK in Python?


In NLTK, stemming can be performed using various algorithms such as the Porter Stemmer or the Lancaster
Stemmer. Here's how you can perform stemming using NLTK's Porter Stemmer:
from nltk.stem import PorterStemmer stemmer = PorterStemmer() word = "running" stemmed_word =
stemmer.stem(word) print(stemmed_word) # Output: run

3.What is lemmatization, and how can you achieve it with NLTK?


Lemmatization is the process of reducing words to their base or dictionary form (lemma). In NLTK,
lemmatization is achieved using WordNet's morphological database. Here's an example of lemmatization using NLTK:
from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() word = "running" lemma =
lemmatizer.lemmatize(word, pos='v') # Specify the part of speech (verb in this case) print(lemma) # Output: run

4.What are some considerations when choosing between stemming and lemmatization?
When choosing between stemming and lemmatization, consider factors such as the application domain, desired
level of linguistic accuracy, and computational resources. Stemming is faster and simpler but may produce non-words or
incorrect stems. Lemmatization, on the other hand, provides more accurate results by considering word meanings but
requires more computational resources.

5.Can you explain the importance of stemming and lemmatization in natural language processing tasks?
Stemming and lemmatization are crucial preprocessing steps in natural language processing tasks such as text
classification, information retrieval, and sentiment analysis. By reducing words to their base forms, stemming and
lemmatization help normalize text data, reduce vocabulary size, and improve the accuracy of downstream NLP tasks by
ensuring that variations of words are treated as the same token.
411721104020

RESULT:
Thus the program executed successfully.

EX NO:11
DATE: NLP AUTO COMPLETE PROGRAM
411721104020

AIM:

Write a program for NLP auto complete program to predict the next word or phrase a user is likely to input based
on the context of their current input.

ALGORITHM:

1. Tokenization: Break down the input text into individual words or phrases, removing punctuation and converting
everything to lowercase for consistency.
2. Frequency Count: Count the frequency of occurrence for each token in the dataset to understand their likelihood.
3. Prediction: Given a partial input from the user, predict the next word or phrase based on the most frequent
continuation from the dataset.
4. Ranking: If there are multiple predictions, rank them based on their frequency of occurrence and present the most
frequent ones to the user.
5. User Interface Integration: Present the predicted words or phrases to the user in the application's interface,
6. such as a dropdown menu or a list of suggestions.
7. Feedback Loop (Optional): Incorporate a feedback mechanism where the user's selections are used to refine
future predictions.

PROGRAM:

#import pyreadline3
# we can also implement the autocompelte using
# python readline packages
# install the packages first

! pip install ttkwidgets


! pip install pillow
! pip install tk

from ttkwidgets.autocomplete import AutocompleteEntry


from tkinter import *

greetings = ['Hello How Are You', 'I like students','Best Wishes','Thanks and Regards', 'United States of America',
"Prince Shri Venkateswara Padmavathy Engineering College"]

greetings

['Hello How Are You',


'I like students',
411721104020

'Best Wishes',
'Thanks and Regards',
'United States of America',
'Prince Shri Venkateswara Padmavathy Engineering College']

ws = Tk()
ws.title('Python-AUTO-COMPLETE-example')
ws.geometry('400x300')
ws.config(bg='#f25252')
frame = Frame(ws, bg='#f25252')
frame.pack(expand=True)

Label(
frame,
bg='#f25252',
font = ('Times',21),
text='MY MESSAGES'
).pack()

entry = AutocompleteEntry(
frame,
width=30,
font=('Times', 27),
completevalues=greetings
)
entry.pack()

ws.mainloop()

Viva Questions:
1.What is an NLP autocomplete program, and how does it function?
An NLP autocomplete program is a tool that suggests word or phrase completions based on partial input provided
by the user. It leverages natural language processing techniques to analyze the context of the input text and predict the
most likely completions. The program utilizes language models trained on large corpora to generate accurate
suggestions.
411721104020

2.How does an NLP autocomplete program differ from traditional autocomplete systems?
Traditional autocomplete systems typically rely on simple prefix matching or frequency-based approaches to
suggest completions. In contrast, an NLP autocomplete program utilizes advanced NLP techniques such as language
modeling, contextual analysis, and machine learning algorithms to generate more accurate and contextually relevant
suggestions.

3.What are the main components of designing an NLP autocomplete program? A3: The main components of designing
an NLP autocomplete program include:
 Data preprocessing: Tokenization, normalization, and filtering to prepare the input text data.
 Language modeling: Building a language model that captures the statistical properties of natural language to
predict next words or phrases.
 Prediction mechanism: Implementing algorithms or techniques to generate autocomplete suggestions based on the
input text and the language model.
 User interface: Designing an intuitive interface to interact with users and display autocomplete suggestions in
real-time.

4.Can you explain the importance of data quality and size in an NLP autocomplete program?
Data quality and size play a crucial role in the accuracy and effectiveness of an NLP autocomplete program.
High-quality and diverse training data enable the program to learn robust language patterns and generate more accurate
predictions. Additionally, a larger corpus provides a broader context for understanding and predicting user input,
resulting in more relevant autocomplete suggestions.

5.How do you evaluate the performance of an NLP autocomplete program?


The performance of an NLP autocomplete program can be evaluated using metrics such as accuracy, precision,
recall, and user satisfaction. Accuracy measures the correctness of autocomplete suggestions, while precision and recall
assess the relevance of suggestions compared to the user's actual input. User satisfaction can be gauged through user
feedback, surveys, or usability testing to determine the program's effectiveness in assisting users with their tasks.
411721104020

RESULT:
Thus the program executed successfully.

You might also like