0% found this document useful (0 votes)
30 views124 pages

Tsa-Unit-1 To 5 Notes

The document provides an overview of Natural Language Processing (NLP), detailing its foundational concepts, phases, and applications in text and speech analysis. It discusses the significance of data preprocessing and algorithm development, as well as the five phases of NLP: lexical analysis, syntax analysis, semantic analysis, discourse integration, and pragmatic analysis. Additionally, it highlights the role of NLP in enhancing business intelligence through insights from customer-agent interactions.

Uploaded by

rajaramsiva19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views124 pages

Tsa-Unit-1 To 5 Notes

The document provides an overview of Natural Language Processing (NLP), detailing its foundational concepts, phases, and applications in text and speech analysis. It discusses the significance of data preprocessing and algorithm development, as well as the five phases of NLP: lexical analysis, syntax analysis, semantic analysis, discourse integration, and pragmatic analysis. Additionally, it highlights the role of NLP in enhancing business intelligence through insights from customer-agent interactions.

Uploaded by

rajaramsiva19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 124

lOMoARcPSD|38187289

CCS369-UNIT 1 - SUMMARIZED LECTURE NOTE

cse (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

CCS369- TEXT AND SPEECH ANALYSIS


LECTURE NOTES
UNIT I NATURAL LANGUAGE BASICS 6
Foundations of natural language processing – Language Syntax and Structure- Text Preprocessing and Wrangling – Text
tokenization – Stemming – Lemmatization – Removing stop-words – Feature Engineering for Text representation – Bag
of Words model- Bag of N-Grams model – TF-IDF model

INTRODUCTION
Artificial intelligence (AI) integration has revolutionized various industries, and now it is transforming the realm of
human behavior research. This integration marks a significant milestone in the data collection and analysis endeavors,
enabling users to unlock deeper insights from spoken language and empower researchers and analysts with enhanced
capabilities for understanding and interpreting human communication. Human interactions are a critical part of many
organizations. Many organizations analyze speech or text via natural language processing (NLP) and link them to insights
and automation such as text categorization, text classification, information extraction, etc.
In business intelligence, speech and text analytics enable us to gain insights into customer-agent conversations through
sentiment analysis, and topic trends. These insights highlight areas of improvement, recognition, and concern, to better
understand and serve customers and employees. Speech and text analytics features provide automated speech and text
analytics capabilities on 100% of interactions to provide deep insight into customer-agent conversations. Speech and text
analytics is a set of features that uses natural language processing (NLP) to provide an automated analysis of an
interaction’s content and insight into customer-agent conversations. Speech and text analytics includes transcribing voice
interactions, analysis for customer sentiment and topic spotting, and creating meaning from otherwise unstructured data.
FOUNDATIONS OF NATURAL LANGUAGE PROCESSING
Natural Language Processing (NLP) is the process of producing meaningful phrases and sentences in the form of natural
language. Natural Language Processing precludes Natural Language Understanding (NLU) and Natural Language
Generation (NLG). NLU takes the data input and maps it into natural language. NLG conducts information extraction and
retrieval, sentiment analysis, and more. NLP can be thought of as an intersection of Linguistics, Computer Science and
Artificial Intelligence that helps computers understand, interpret and manipulate human language.

Fig. NLP Overview


Ever since then, there has been an immense amount of study and development in the field of Natural Language
Processing. Today NLP is one of the most in-demand and promising fields of Artificial Intelligence!
There are two main parts to Natural Language Processing:

1. Data Preprocessing
2. Algorithm Development

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Applications Core technologies


• Machine Translation • Language modeling
• Information Retrieval • Part-of-speech tagging
• Question Answering • Syntactic parsing
• Dialogue Systems • Named-entity recognition
• Information Extraction • Coreference resolution
• Summarization • Word sense disambiguation
• Sentiment Analysis • Semantic Role Labelling
• ... • ...

In Natural Language Processing, machine learning training algorithms study millions of examples of text — words,
sentences, and paragraphs — written by humans. By studying the samples, the training algorithms gain an understanding
of the “context” of human speech, writing, and other modes of communication. This training helps NLP software to
differentiate between the meanings of various texts. The five phases of NLP involve lexical (structure) analysis, parsing,
semantic analysis, discourse integration, and pragmatic analysis. Some well-known application areas of NLP are Optical
Character Recognition (OCR), Speech Recognition, Machine Translation, and Chatbots .

Fig: Five Phases of NLP


Phase I: Lexical or morphological analysis

The first phase of NLP is word structure analysis, which is referred to as lexical or morphological analysis. A lexicon is
defined as a collection of words and phrases in a given language, with the analysis of this collection being the process of
splitting the lexicon into components, based on what the user sets as parameters – paragraphs, phrases, words, or
characters.

Similarly, morphological analysis is the process of identifying the morphemes of a word. A morpheme is a basic unit of
English language construction, which is a small element of a word, that carries meaning. These can be either a free
morpheme (e.g. walk) or a bound morpheme (e.g. -ing, -ed), with the difference between the two being that the latter
cannot stand on it’s own to produce a word with meaning, and should be assigned to a free morpheme to attach meaning.

In search engine optimization (SEO), lexical or morphological analysis helps guide web searching. For instance, when
doing on-page analysis, you can perform lexical and morphological analysis to understand how often the target keywords
are used in their core form (as free morphemes, or when in composition with bound morphemes). This type of analysis
can ensure that you have an accurate understanding of the different variations of the morphemes that are used.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Morphological analysis can also be applied in transcription and translation projects, so can be very useful in content
repurposing projects, and international SEO and linguistic analysis.

Phase II: Syntax analysis (parsing)

Syntax Analysis is the second phase of natural language processing. Syntax analysis or parsing is the process of checking
grammar, word arrangement, and overall – the identification of relationships between words and whether those make
sense. The process involved examination of all words and phrases in a sentence, and the structures between them.

As part of the process, there’s a visualisation built of semantic relationships referred to as a syntax tree (similar to a
knowledge graph). This process ensures that the structure and order and grammar of sentences makes sense, when
considering the words and phrases that make up those sentences. Syntax analysis also involves tagging words and phrases
with POS tags. There are two common methods, and multiple approaches to construct the syntax tree – top-down and
bottom-up, however, both are logical and check for sentence formation, or else they reject the input.

Syntax analysis can be beneficial for SEO in several ways:


 Programmatic SEO: Checking whether the produced content makes sense, especially when producing content at
scale using an automated or semi-automated approach.
 Semantic analysis: Once you have a syntax analysis conducted, semantic analysis is easy, as well as uncovering
the relationship between the different entities recognized in the content.

Phase III: Semantic analysis

Semantic analysis is the third stage in NLP, when an analysis is performed to understand the meaning in a statement. This
type of analysis is focused on uncovering the definitions of words, phrases, and sentences and identifying whether the way
words are organized in a sentence makes sense semantically.
This task is performed by mapping the syntactic structure, and checking for logic in the presented relationships between
entities, words, phrases, and sentences in the text. There are a couple of important functions of semantic analysis, which
allow for natural language understanding:
 To ensure that the data types are used in a way that’s consistent with their definition.
 To ensure that the flow of the text is consistent.
 Identification of synonyms, antonyms, homonyms, and other lexical items.
 Overall word sense disambiguation.
 Relationship extraction from the different entities identified from the text.
There are several things you can utilise semantic analysis for in SEO. Here are some examples:
 Topic modeling and classification – sort your page content into topics (predefined or modelled by an algorithm).
You can then use this for ML-enabled internal linking, where you link pages together on your website using the
identified topics. Topic modeling can also be used for classifying first-party collected data such as customer
service tickets, or feedback users left on your articles or videos in free form (i.e. comments).
 Entity analysis, sentiment analysis, and intent classification – You can use this type of analysis to perform
sentiment analysis and identify intent expressed in the content analysed. Entity identification and sentiment
analysis are separate tasks, and both can be done on things like keywords, titles, meta descriptions, page content,
but works best when analysing data like comments, feedback forms, or customer service or social media
interactions. Intent classification can be done on user queries (in keyword research or traffic analysis), but can
also be done in analysis of customer service interactions.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Phase IV: Discourse integration


Discourse integration is the fourth phase in NLP, and simply means contextualisation. Discourse integration is the analysis
and identification of the larger context for any smaller part of natural language structure (e.g. a phrase, word or sentence).
During this phase, it’s important to ensure that each phrase, word, and entity mentioned are mentioned within the
appropriate context. This analysis involves considering not only sentence structure and semantics, but also sentence
combination and meaning of the text as a whole. Otherwise, when analyzing the structure of text, sentences are broken up
and analyzed and also considered in the context of the sentences that precede and follow them, and the impact that they
have on the structure of text. Some common tasks in this phase include: information extraction, conversation analysis, text
summarisation, discourse analysis.
Here are some complexities of natural language understanding introduced during this phase:

 Understanding of the expressed motivations within the text, and its underlying meaning.

 Understanding of the relationships between entities and topics mentioned, thematic understanding, and
interactions analysis.

 Understanding the social and historical context of entities mentioned.

Discourse integration and analysis can be used in SEO to ensure that appropriate tense is used, that the relationships
expressed in the text make logical sense, and that there is overall coherency in the text analysed. This can be especially
useful for programmatic SEO initiatives or text generation at scale. The analysis can also be used as part of international
SEO localization, translation, or transcription tasks on big corpuses of data.
There are some research efforts to incorporate discourse analysis into systems that detect hate speech (or in the SEO space
for things like content and comment moderation), with this technology being aimed at uncovering intention behind text by
aligning the expression with meaning, derived from other texts. This means that, theoretically, discourse analysis can also
be used for modeling of user intent (e.g search intent or purchase intent) and detection of such notions in texts.
Phase V: Pragmatic analysis
Pragmatic analysis is the fifth and final phase of natural language processing. As the final stage, pragmatic analysis
extrapolates and incorporates the learnings from all other, preceding phases of NLP. Pragmatic analysis involves the
process of abstracting or extracting meaning from the use of language, and translating a text, using the gathered
knowledge from all other NLP steps performed beforehand.
Here are some complexities that are introduced during this phase
 Information extraction, enabling an advanced text understanding functions such as question-answering.
 Meaning extraction, which allows for programs to break down definitions or documentation into a more
accessible language.
 Understanding of the meaning of the words, and context, in which they are used, which enables conversational
functions between machine and human (e.g. chatbots).
Pragmatic analysis has multiple applications in SEO. One of the most straightforward ones is programmatic SEO and
automated content generation. This type of analysis can also be used for generating FAQ sections on your product, using
textual analysis of product documentation, or even capitalizing on the ‘People Also Ask’ featured snippets by adding an
automatically-generated FAQ section for each page you produce on your site.
LANGUAGE SYNTAX AND STRUCTURE

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

For any language, syntax and structure usually go hand in hand, where a set of specific rules, conventions, and principles
govern the way words are combined into phrases; phrases get combines into clauses; and clauses get combined into
sentences. We will be talking specifically about the English language syntax and structure in this section. In English,
words usually combine together to form other constituent units. These constituents include words, phrases, clauses, and
sentences. Considering a sentence, “The brown fox is quick and he is jumping over the lazy dog”, it is made of a bunch
of words and just looking at the words by themselves don’t tell us much.

Fig. A bunch of unordered words don’t convey much information


Knowledge about the structure and syntax of the language is helpful in many areas like text processing, annotation, and
parsing for further operations such as text classification or summarization. Typical parsing techniques for understanding
text syntax are mentioned below.
 Parts of Speech (POS) Tagging
 Shallow Parsing or Chunking
 Constituency Parsing
 Dependency Parsing
We will be looking at all of these techniques in subsequent sections. Considering the previous example sentence “The
brown fox is quick and he is jumping over the lazy dog”, if we were to annotate it using basic POS tags, it would look
like the following figure.

Fig. POS tagging for a sentence


Thus, a sentence typically follows a hierarchical structure consisting the following components,
sentence → clauses → phrases → words
Tagging Parts of Speech
Parts of speech (POS) are specific lexical categories to which words are assigned, based on their syntactic context and
role. Usually, words can fall into one of the following major categories.
 N(oun): This usually denotes words that depict some object or entity, which may be living or nonliving. Some
examples would be fox , dog , book , and so on. The POS tag symbol for nouns is N.
 V(erb): Verbs are words that are used to describe certain actions, states, or occurrences. There are a wide variety
of further subcategories, such as auxiliary, reflexive, and transitive verbs (and many more). Some typical
examples of verbs would be running , jumping , read , and write . The POS tag symbol for verbs is V.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

 Adj(ective): Adjectives are words used to describe or qualify other words, typically nouns and noun phrases. The
phrase beautiful flower has the noun (N) flower which is described or qualified using the adjective (ADJ)
beautiful . The POS tag symbol for adjectives is ADJ .
 Adv(erb): Adverbs usually act as modifiers for other words including nouns, adjectives, verbs, or other adverbs.
The phrase very beautiful flower has the adverb (ADV) very , which modifies the adjective (ADJ) beautiful ,
indicating the degree to which the flower is beautiful. The POS tag symbol for adverbs is ADV.
Besides these four major categories of parts of speech , there are other categories that occur frequently in the English
language. These include pronouns, prepositions, interjections, conjunctions, determiners, and many others. Furthermore,
each POS tag like the noun (N) can be further subdivided into categories like singular nouns (NN), singular proper
nouns(NNP), and plural nouns (NNS).
The process of classifying and labeling POS tags for words called parts of speech tagging or POS tagging . POS tags are
used to annotate words and depict their POS, which is really helpful to perform specific analysis, such as narrowing down
upon nouns and seeing which ones are the most prominent, word sense disambiguation, and grammar analysis.
Let us consider both nltk and spacy which usually use the Penn Treebank notation for POS tagging. NLTK and spaCy
are two of the most popular Natural Language Processing (NLP) tools available in Python. You can build chatbots,
automatic summarizers, and entity extraction engines with either of these libraries. While both can theoretically
accomplish any NLP task, each one excels in certain scenarios. The Penn Treebank, or PTB for short, is a dataset
maintained by the University of Pennsylvania.

# create a basic pre-processed corpus, don't lowercase to get POS context


corpus = normalize_corpus(news_df['full_text'], text_lower_case=False,
text_lemmatization=False, special_char_removal=False)
# demo for POS tagging for sample news headline
sentence = str(news_df.iloc[1].news_headline)
sentence_nlp = nlp(sentence)
# POS tagging with Spacy
spacy_pos_tagged = [(word, word.tag_, word.pos_) for word in sentence_nlp]
pd.DataFrame(spacy_pos_tagged, columns=['Word', 'POS tag', 'Tag type'])
# POS tagging with nltk
nltk_pos_tagged = nltk.pos_tag(sentence.split())
pd.DataFrame(nltk_pos_tagged, columns=['Word', 'POS tag'])

Fig. Python code & Output of POS tagging a news headline

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

We can see that each of these libraries treat tokens in their own way and assign specific tags for them. Based on what we
see, spacy seems to be doing slightly better than nltk.
Shallow Parsing or Chunking
Based on the hierarchy we depicted earlier, groups of words make up phrases. There are five major categories of phrases:
 Noun phrase (NP): These are phrases where a noun acts as the head word. Noun phrases act as a subject or
object to a verb.
 Verb phrase (VP): These phrases are lexical units that have a verb acting as the head word. Usually, there are
two forms of verb phrases. One form has the verb components as well as other entities such as nouns, adjectives,
or adverbs as parts of the object.
 Adjective phrase (ADJP): These are phrases with an adjective as the head word. Their main role is to describe or
qualify nouns and pronouns in a sentence, and they will be either placed before or after the noun or pronoun.
 Adverb phrase (ADVP): These phrases act like adverbs since the adverb acts as the head word in the phrase.
Adverb phrases are used as modifiers for nouns, verbs, or adverbs themselves by providing further details that
describe or qualify them.
 Prepositional phrase (PP): These phrases usually contain a preposition as the head word and other lexical
components like nouns, pronouns, and so on. These act like an adjective or adverb describing other words or
phrases.
Shallow parsing, also known as light parsing or chunking, is a popular natural language processing technique of analyzing
the structure of a sentence to break it down into its smallest constituents (which are tokens such as words) and group them
together into higher-level phrases. This includes POS tags and phrases from a sentence.

Fig. An example of shallow parsing depicting higher level phrase annotations

Constituency Parsing
Constituent-based grammars are used to analyze and determine the constituents of a sentence. These grammars can be
used to model or represent the internal structure of sentences in terms of a hierarchically ordered structure of their
constituents. Each and every word usually belongs to a specific lexical category in the case and forms the head word of
different phrases. These phrases are formed based on rules called phrase structure rules.
Phrase structure rules form the core of constituency grammars, because they talk about syntax and rules that govern the
hierarchy and ordering of the various constituents in the sentences. These rules cater to two things primarily.
 They determine what words are used to construct the phrases or constituents.
 They determine how we need to order these constituents together.
The generic representation of a phrase structure rule is S → AB , which depicts that the structure S consists of
constituents A and B , and the ordering is A followed by B . While there are several rules (refer to Chapter 1, Page 19:

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Text Analytics with Python, if you want to dive deeper), the most important rule describes how to divide a sentence or a
clause. The phrase structure rule denotes a binary division for a sentence or a clause as S → NP VP where S is the
sentence or clause, and it is divided into the subject, denoted by the noun phrase ( NP) and the predicate, denoted by the
verb phrase (VP).
A constituency parser can be built based on such grammars/rules, which are usually collectively available as context-free
grammar (CFG) or phrase-structured grammar. The parser will process input sentences according to these rules, and help
in building a parse tree.

Fig. An example of constituency parsing showing a nested hierarchical structure

Dependency Parsing
In dependency parsing, we try to use dependency-based grammars to analyze and infer both structure and semantic
dependencies and relationships between tokens in a sentence. The basic principle behind a dependency grammar is that in
any sentence in the language, all words except one, have some relationship or dependency on other words in the sentence.
The word that has no dependency is called the root of the sentence. The verb is taken as the root of the sentence in most
cases. All the other words are directly or indirectly linked to the root verb using links, which are the dependencies.
Considering the sentence “The brown fox is quick and he is jumping over the lazy dog”, if we wanted to draw the
dependency syntax tree for this, we would have the structure

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Fig. A dependency parse tree for a sentence


These dependency relationships each have their own meaning and are a part of a list of universal dependency types.
Some of the dependencies are as follows:
 The dependency tag det is pretty intuitive — it denotes the determiner relationship between a nominal head and the
determiner. Usually, the word with POS tag DET will also have the det dependency tag relation. Examples
include fox → the and dog → the.
 The dependency tag amod stands for adjectival modifier and stands for any adjective that modifies the meaning of
a noun. Examples include fox → brown and dog → lazy.
 The dependency tag nsubj stands for an entity that acts as a subject or agent in a clause. Examples
include is → fox and jumping → he.
 The dependencies cc and conj have more to do with linkages related to words connected by coordinating
conjunctions . Examples include is → and and is → jumping.
 The dependency tag aux indicates the auxiliary or secondary verb in the clause. Example: jumping → is.
 The dependency tag acomp stands for adjective complement and acts as the complement or object to a verb in the
sentence. Example: is → quick
 The dependency tag prep denotes a prepositional modifier, which usually modifies the meaning of a noun, verb,
adjective, or preposition. Usually, this representation is used for prepositions having a noun or noun phrase
complement. Example: jumping → over.
 The dependency tag pobj is used to denote the object of a preposition. This is usually the head of a noun phrase
following a preposition in the sentence. Example: over → dog.
TEXT PREPROCESSING OR WRANGLING
Text preprocessing or wrangling is a method to clean the text data and make it ready to feed data to the model. Text data
contains noise in various forms like emotions, punctuation, text in a different case. When we talk about Human Language
then, there are different ways to say the same thing, And this is only the main problem we have to deal with because
machines will not understand words, they need numbers so we need to convert text to numbers in an efficient manner.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Techniques to perform text preprocessing or wrangling are as follows:


Contraction Mapping/ Expanding Contractions: Contractions are a shortened version of words or a group of words,
quite common in both spoken and written language. In English, they are quite common, such as I will to I’ll, I have to I’ve
, do not to don’t, etc. Mapping these contractions to their expanded form helps in text standardization.
Tokenization: Tokenization is the process of separating a piece of text into smaller units called tokens. Given a document,
tokens can be sentences, words, subwords, or even characters depending on the application.
Noise cleaning: Special characters and symbols contribute to extra noise in unstructured text. Using regular expressions
to remove them or using tokenizers, which do the pre-processing step of removing punctuation marks and other special
characters, is recommended.
Spell-checking: Documents in a corpus are prone to spelling errors; In order to make the text clean for the subsequent
processing, it is a good practice to run a spell checker and fix the spelling errors before moving on to the next steps.
Stopwords Removal: Stop words are those words which are very common and often less significant. Hence, removing
these is a pre-processing step as well. This can be done explicitly by retaining only those words in the document which are
not in the list of stop words or by specifying the stop word list as an argument in CountVectorizer or TfidfVectorizer
methods when getting Bag-of-Words(BoW)/TF-IDF scores for the corpus of text documents.
Stemming/Lemmatization: Both stemming and lemmatization are methods to reduce words to their base form. While
stemming follows certain rules to truncate the words to their base form, often resulting in words that are not
lexicographically correct, lemmatization always results in base forms that are lexicographically correct. However,
stemming is a lot faster than lemmatization. Hence, to stem/lemmatize is dependent on whether the application needs
quick pre-processing or requires more accurate base forms.
TOKENIZATION

Tokenization is a common task in Natural Language Processing (NLP). It’s a fundamental step in both traditional NLP
methods like Count Vectorizer and Advanced Deep Learning-based architectures like Transformers. As tokens are the
building blocks of Natural Language, the most common way of processing the raw text happens at the token level.
Tokens are the building blocks of Natural Language.
Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words,
characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-
gram characters) tokenization.

For example, consider the sentence: “Never give up”.

The most common way of forming tokens is based on space. Assuming space as a delimiter, the tokenization of the
sentence results in 3 tokens – Never-give-up. As each token is a word, it becomes an example of Word tokenization.
Similarly, tokens can be either characters or sub-words. For example, let us consider “smarter”:

1. Character tokens: s-m-a-r-t-e-r


2. Sub-word tokens: smart-er

Here, Tokenization is performed on the corpus to obtain tokens. The following tokens are then used to prepare a
vocabulary. Vocabulary refers to the set of unique tokens in the corpus. Remember that vocabulary can be constructed by
considering each unique token in the corpus or by considering the top K Frequently Occurring Words.
Creating Vocabulary is the ultimate goal of Tokenization.
One of the simplest hacks to boost the performance of the NLP model is to create a vocabulary out of top K frequently
occurring words.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Now, let’s understand the usage of the vocabulary in Traditional and Advanced Deep Learning-based NLP methods.
Traditional NLP approaches such as Count Vectorizer and TF-IDF use vocabulary as features. Each word in the
vocabulary is treated as a unique feature:

Fig. Traditional NLP: Count Vectorizer


• In Advanced Deep Learning-based NLP architectures, vocabulary is used to create the tokenized input sentences.
Finally, the tokens of these sentences are passed as inputs to the model
As discussed earlier, tokenization can be performed on word, character, or subword level. It’s a common question – which
Tokenization should we use while solving an NLP task? Let’s address this question here.
Word Tokenization
Word Tokenization is the most commonly used tokenization algorithm. It splits a piece of text into individual words based
on a certain delimiter. Depending upon delimiters, different word-level tokens are formed. Pretrained Word Embeddings
such as Word2Vec and GloVe comes under word tokenization.
Drawbacks of Word Tokenization
One of the major issues with word tokens is dealing with Out Of Vocabulary (OOV) words. OOV words refer to the new
words which are encountered at testing. These new words do not exist in the vocabulary. Hence, these methods fail in
handling Out-of-Vocabolary (OOV) words.
• A small trick can rescue word tokenizers from OOV words. The trick is to form the vocabulary with the Top K
Frequent Words and replace the rare words in training data with unknown tokens (UNK). This helps the model to learn the
representation of OOV words in terms of UNK tokens
• So, during test time, any word that is not present in the vocabulary will be mapped to a UNK token. This is how
we can tackle the problem of OOV in word tokenizers.
• The problem with this approach is that the entire information of the word is lost as we are mapping OOV to UNK
tokens. The structure of the word might be helpful in representing the word accurately. And another issue is that every
OOV word gets the same representation
Another issue with word tokens is connected to the size of the vocabulary. Generally, pre-trained models are trained on a
large volume of the text corpus. So, just imagine building the vocabulary with all the unique words in such a large corpus.
This explodes the vocabulary!
Character Tokenization
Character Tokenization splits apiece of text into a set of characters. It overcomes the drawbacks we saw above about Word
Tokenization.
• Character Tokenizers handles OOV words coherently by preserving the information of the word. It breaks down
the OOV word into characters and represents the word in terms of these characters

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

• It also limits the size of the vocabulary. Want to talk a guess on the size of the vocabulary? 26 since the
vocabulary contains a unique set of characters
Drawbacks of Character Tokenization
Character tokens solve the OOV problem but the length of the input and output sentences increases rapidly as we are
representing a sentence as a sequence of characters. As a result, it becomes challenging to learn the relationship between
the characters to form meaningful words. This brings us to another tokenization known as Subword Tokenization which is
in between a Word and Character tokenization.
Subword Tokenization
Subword Tokenization splits the piece of text into subwords (or n-gram characters). For example, words like lower can be
segmented as low-er, smartest as smart-est, and so on.
Transformed-based models – the SOTA in NLP – rely on Subword Tokenization algorithms for preparing vocabulary.
Now, we will discuss one of the most popular Subword Tokenization algorithm known as Byte Pair Encoding (BPE).
Byte Pair Encoding (BPE)
Byte Pair Encoding (BPE) is a widely used tokenization method among transformer-based models. BPE addresses the
issues of Word and Character Tokenizers:
• BPE tackles OOV effectively. It segments OOV as subwords and represents the word in terms of these subwords
• The length of input and output sentences after BPE are shorter compared to character tokenization
BPE is a word segmentation algorithm that merges the most frequently occurring character or character sequences
iteratively. Here is a step by step guide to learn BPE.
Steps to learn BPE
1. Split the words in the corpus into characters after appending </w>
2. Initialize the vocabulary with unique characters in the corpus
3. Compute the frequency of a pair of characters or character sequences in corpus
4. Merge the most frequent pair in corpus
5. Save the best pair to the vocabulary
6. Repeat steps 3 to 5 for a certain number of iterations

We will understand the steps with an example.


Consider a corpus:

1a) Append the end of the word (say </w>) symbol to every word in the corpus:

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

1b) Tokenize words in a corpus into characters:

2. Initialize the vocabulary:

Iteration 1:
3. Compute frequency:

4. Merge the most frequent pair:

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

5. Save the best pair:

Repeat steps 3-5 for every iteration from now. Let me illustrate for one more iteration.
Iteration 2:
3. Compute frequency:

4. Merge the most frequent pair:

5. Save the best pair:

After 10 iterations, BPE merge operations looks like:

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Applying BPE to OOV words


Here is a step-by-step procedure for representing OOV words:
1. Split the OOV word into characters after appending </w>
2. Compute pair of characters or character sequences in a word
3. Select the pairs present in the learned operations
4. Merge the most frequent pair
5. Repeat steps 2 and 3 until merging is possible

STEMMING
Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly
referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”,
“choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”. Stemming is
an important part of the pipelining process in Natural language processing. The input to the stemmer is tokenized words.
How do we get these tokenized words? Well, tokenization involves breaking down the document into different words.
Stemming is a natural language processing technique that is used to reduce words to their base form, also known as the
root form. The process of stemming is used to normalize text and make it easier to process. It is an important step in text
pre-processing, and it is commonly used in information retrieval and text mining applications. There are several different
algorithms for stemming as follows:
 Porter stemmer
 Snowball stemmer
 Lancaster stemmer.
The Porter stemmer is the most widely used algorithm, and it is based on a set of heuristics that are used to remove
common suffixes from words. The Snowball stemmer is a more advanced algorithm that is based on the Porter stemmer,
but it also supports several other languages in addition to English. The Lancaster stemmer is a more aggressive stemmer
and it is less accurate than the Porter stemmer and Snowball stemmer.

Stemming can be useful for several natural language processing tasks such as text classification, information retrieval, and
text summarization. However, stemming can also have some negative effects such as reducing the readability of the text,
and it may not always produce the correct root form of a word. It is important to note that stemming is different from
Lemmatization. Lemmatization is the process of reducing a word to its base form, but unlike stemming, it takes into
account the context of the word, and it produces a valid word, unlike stemming which can produce a non-word as the root
form.
Some more examples stemming from the root word "like" include:

->"likes"
->"liked"
->"likely"
->"liking"

Errors in Stemming:

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

There are mainly two errors in stemming –


 over-stemming
 under-stemming
Over-stemming occurs when two words are stemmed from the same root that are of different stems. Over-stemming can
also be regarded as false-positives. Over-stemming is a problem that can occur when using stemming algorithms in natural
language processing. It refers to the situation where a stemmer produces a root form that is not a valid word or is not the
correct root form of a word. This can happen when the stemmer is too aggressive in removing suffixes or when it does not
consider the context of the word.
Over-stemming can lead to a loss of meaning and make the text less readable. For example, the word “arguing” may be
stemmed to “argu,” which is not a valid word and does not convey the same meaning as the original word. Similarly, the
word “running” may be stemmed to “run,” which is the base form of the word but it does not convey the meaning of the
original word.
To avoid over-stemming, it is important to use a stemmer that is appropriate for the task and language. It is also important
to test the stemmer on a sample of text to ensure that it is producing valid root forms. In some cases, using a lemmatizer
instead of a stemmer may be a better solution as it takes into account the context of the word, making it less prone to
errors. Another approach to this problem is to use techniques like semantic role labeling, sentiment analysis, context-based
information, etc. that help to understand the context of the text and make the stemming process more precise.
Under-stemming occurs when two words are stemmed from the same root that are not of different stems. Under-stemming
can be interpreted as false-negatives. Under-stemming is a problem that can occur when using stemming algorithms in
natural language processing. It refers to the situation where a stemmer does not produce the correct root form of a word or
does not reduce a word to its base form. This can happen when the stemmer is not aggressive enough in removing suffixes
or when it is not designed for the specific task or language.
Under-stemming can lead to a loss of information and make it more difficult to analyze text. For example, the word
“arguing” and “argument” may be stemmed to “argu,” which does not convey the meaning of the original words.
Similarly, the word “running” and “runner” may be stemmed to “run,” which is the base form of the word but it does not
convey the meaning of the original words.
To avoid under-stemming, it is important to use a stemmer that is appropriate for the task and language. It is also
important to test the stemmer on a sample of text to ensure that it is producing the correct root forms. In some cases, using
a lemmatizer instead of a stemmer may be a better solution as it takes into account the context of the word, making it less
prone to errors. Another approach to this problem is to use techniques like semantic role labeling, sentiment analysis,
context-based information, etc. that help to understand the context of the text and make the stemming process more
precise.

Applications of stemming:
Stemming is used in information retrieval systems like search engines. It is used to determine domain vocabularies in
domain analysis. To display search results by indexing while documents are evolving into numbers and to map documents
to common subjects by stemming. Sentiment Analysis, which examines reviews and comments made by different users
about anything, is frequently used for product analysis, such as for online retail stores. Before it is interpreted, stemming
is accepted in the form of the text-preparation mean.
A method of group analysis used on textual materials is called document clustering (also known as text clustering).
Important uses of it include subject extraction, automatic document structuring, and quick information retrieval.
Fun Fact: Google search adopted a word stemming in 2003. Previously a search for “fish” would not have returned
“fishing” or “fishes”.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Some Stemming algorithms are:


Porter’s Stemmer algorithm
It is one of the most popular stemming methods proposed in 1980. It is based on the idea that the suffixes in the English
language are made up of a combination of smaller and simpler suffixes. This stemmer is known for its speed and
simplicity. The main applications of Porter Stemmer include data mining and Information retrieval. However, its
applications are only limited to English words. Also, the group of stems is mapped on to the same stem and the output
stem is not necessarily a meaningful word. The algorithms are fairly lengthy in nature and are known to be the oldest
stemmer.
Example: EED -> EE means “if the word has at least one vowel and consonant plus EED ending, change the
ending to EE” as ‘agreed’ becomes ‘agree’.
Advantage: It produces the best output as compared to other stemmers and it has less error rate.
Limitation: Morphological variants produced are not always real words.
Lovins Stemmer
It is proposed by Lovins in 1968, that removes the longest suffix from a word then the word is recorded to convert this
stem into valid words.
Example: sitting -> sitt -> sit
Advantage: It is fast and handles irregular plurals like 'teeth' and 'tooth' etc.
Limitation: It is time consuming and frequently fails to form words from stem.
Dawson Stemmer
It is an extension of Lovins stemmer in which suffixes are stored in the reversed order indexed by their length and last
letter.
Advantage: It is fast in execution and covers more suffices.
Limitation: It is very complex to implement.
Krovetz Stemmer
It was proposed in 1993 by Robert Krovetz. Following are the steps:
1) Convert the plural form of a word to its singular form.
2) Convert the past tense of a word to its present tense and remove the suffix ‘ing’.
Example: ‘children’ -> ‘child’
Advantage: It is light in nature and can be used as pre-stemmer for other stemmers.
Limitation: It is inefficient in case of large documents.
Xerox Stemmer
Example:
‘children’ -> ‘child’
‘understood’ -> ‘understand’
‘whom’ -> ‘who’
‘best’ -> ‘good’

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

N-Gram Stemmer
An n-gram is a set of n consecutive characters extracted from a word in which similar words will have a high proportion
of n-grams in common.
Example: ‘INTRODUCTIONS’ for n=2 becomes : *I, IN, NT, TR, RO, OD, DU, UC, CT, TI, IO, ON, NS, S*
Advantage: It is based on string comparisons and it is language dependent.
Limitation: It requires space to create and index the n-grams and it is not time efficient.
Snowball Stemmer:
When compared to the Porter Stemmer, the Snowball Stemmer can map non-English words too. Since it supports other
languages the Snowball Stemmers can be called a multi-lingual stemmer. The Snowball stemmers are also imported from
the nltk package. This stemmer is based on a programming language called ‘Snowball’ that processes small strings and is
the most widely used stemmer. The Snowball stemmer is way more aggressive than Porter Stemmer and is also referred to
as Porter2 Stemmer. Because of the improvements added when compared to the Porter Stemmer, the Snowball stemmer is
having greater computational speed.
Lancaster Stemmer:
The Lancaster stemmers are more aggressive and dynamic compared to the other two stemmers. The stemmer is really
faster, but the algorithm is really confusing when dealing with small words. But they are not as efficient as Snowball
Stemmers. The Lancaster stemmers save the rules externally and basically uses an iterative algorithm. Lancaster Stemmer
is straightforward, although it often produces results with excessive stemming. Over-stemming renders stems non-
linguistic or meaningless.
LEMMATIZATION
Lemmatization is a text pre-processing technique used in natural language processing (NLP) models to break a word
down to its root meaning to identify similarities. For example, a lemmatization algorithm would reduce the word better to
its root word, or lemme, good. In stemming, a part of the word is just chopped off at the tail end to arrive at the stem of
the word. There are different algorithms used to find out how many characters have to be chopped off, but the algorithms
don’t actually know the meaning of the word in the language it belongs to. In lemmatization, the algorithms do have this
knowledge. In fact, you can even say that these algorithms refer to a dictionary to understand the meaning of the word
before reducing it to its root word, or lemma. So, a lemmatization algorithm would know that the word better is
derived from the word good, and hence, the lemme is good. But a stemming algorithm wouldn’t be able to do
the same. There could be over-stemming or under-stemming, and the word better could be reduced to either bet,
or bett, or just retained as better. But there is no way in stemming that can reduce better to its root word good.
This is the difference between stemming and lemmatization.

Fig. Stemming vs Lemmatization

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Lemmatization gives more context to chatbot conversations as it recognizes words based on their exact and contextual
meaning. On the other hand, lemmatization is a time-consuming and slow process. The obvious advantage of
lemmatization is that it is more accurate than stemming. So, if you’re dealing with an NLP application such as a chat bot
or a virtual assistant, where understanding the meaning of the dialogue is crucial, lemmatization would be useful. But this
accuracy comes at a cost. Because lemmatization involves deriving the meaning of a word from something like a
dictionary, it’s very time-consuming. So, most lemmatization algorithms are slower compared to their stemming
counterparts. There is also a computation overhead for lemmatization, however, in most machine-learning problems,
computational resources are rarely a cause of concern.
REMOVING STOP-WORDS
The words which are generally filtered out before processing a natural language are called stop words. These are actually
the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much
information to the text. Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”. Stop words are
available in abundance in any human language. By removing these words, we remove the low-level information from our
text in order to give more focus to the important information. In order words, we can say that the removal of such words
does not show any negative consequences on the model we train for our task.
Removal of stop words definitely reduces the dataset size and thus reduces the training time due to the fewer number of
tokens involved in the training.
We do not always remove the stop words. The removal of stop words is highly dependent on the task we are performing
and the goal we want to achieve. For example, if we are training a model that can perform the sentiment analysis task, we
might not remove the stop words.
Movie review: “The movie was not good at all.”
Text after removal of stop words: “movie good”
We can clearly see that the review for the movie was negative. However, after the removal of stop words, the review
became positive, which is not the reality. Thus, the removal of stop words can be problematic here.
Tasks like text classification do not generally need stop words as the other words present in the dataset are more important
and give the general idea of the text. So, we generally remove stop words in such tasks.
In a nutshell, NLP has a lot of tasks that cannot be accomplished properly after the removal of stop words. So, think
before performing this step. The catch here is that no rule is universal and no stop words list is universal. A list not
conveying any important information to one task can convey a lot of information to the other task.
Word of caution: Before removing stop words, research a bit about your task and the problem you are trying to solve,
and then make your decision.

Fig. Stop Word Removal


Next comes a very important question: why we should remove stop words from the text?. So, there are two main
reasons for that:
1. They provide no meaningful information, especially if we are building a text classification model. Therefore,
we have to remove stop words from our dataset.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

2. As the frequency of stop words are too high, removing them from the corpus results in much smaller data in
terms of size. Reduced size results in faster computations on text data and the text classification model need to
deal with a lesser number of features resulting in a robust model.

FEATURE ENGINEERING FOR TEXT REPRESENTATION


Feature engineering is one of the most important steps in machine learning. It is the process of using domain knowledge
of the data to create features that make machine learning algorithms work. Think machine learning algorithm as a learning
child the more accurate information you provide the more they will be able to interpret the information well. Focusing
first on our data will give us better results than focusing only on models. Feature engineering helps us to create better data
which helps the model understand it well and provide reasonable results.
NLP is a subfield of artificial intelligence where we understand human interaction with machines using natural languages.
To understand a natural language, you need to understand how we write a sentence, how we express our thoughts using
different words, signs, special characters, etc basically we should understand the context of the sentence to interpret its
meaning.
Extracting Features from Text
In this section, we will learn about common feature extraction techniques and methods. We’ll also talk about when to use
them and some challenges we might face implementing those techniques. Feature extraction methods can be divided into
3 major categories, basic, statistical, and advanced/vectorized.
Basic Methods
These feature extraction methods are based on various concepts from NLP and linguistics. These are some of the oldest
methods but still can be very reliable are used frequently in many areas.
 Parsing
 PoS Tagging
 Name Entity Recognition (NER)
 Bag of Words (BoW)
Statistical Methods
This is a bit more advanced feature extraction method and uses the concepts from statistics and probability to extract
features from text data.
 Term Frequency-Inverse Document Frequency (TF-IDF)

Advanced Methods
These methods can also be called vectorized methods as they aim to map a word, sentence, document to a fixed-length
vector of real numbers. The goal of this method is to extract semantics from a piece of text, both lexical and distributional.
Lexical semantics is just the meaning reflected by the words whereas distributional semantics refers to finding meaning
based on various distributions in a corpus.
 Word2Vec
 GloVe: Global Vector for word representation

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Fi
g. Word2Vec vs GloVe
BAG OF WORDS MODEL
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval
(IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding
grammar and even word order but keeping multiplicity. A bag-of-words model, or BoW for short, is a way of extracting
features from text for use in modelling, such as with machine learning algorithms.
The approach is very simple and flexible, and can be used in a myriad of ways for extracting features from documents.
A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two
things:
1. A vocabulary of known words.
2. A measure of the presence of known words.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded.
The model is only concerned with whether known words occur in the document, not where in the document. The intuition
is that documents are similar if they have similar content. Further, that from the content alone we can learn something
about the meaning of the document. The bag-of-words can be as simple or complex as you like. The complexity comes
both in deciding how to design the vocabulary of known words (or tokens) and how to score the presence of known
words.
One of the biggest problems with text is that it is messy and unstructured, and machine learning algorithms prefer
structured, well defined fixed-length inputs and by using the Bag-of-Words technique we can convert variable-length texts
into a fixed-length vector.

Also, at a much granular level, the machine learning models work with numerical data rather than textual data. So to be
more specific, by using the bag-of-words (BoW) technique, we convert a text into its equivalent vector of numbers.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Let us see an example of how the bag of words technique converts text into vectors
Example (1) without preprocessing:
Sentence 1: “Welcome to Great Learning, Now start learning”
Sentence 2: “Learning is a good practice”

Sentence 1 Sentence 2

Welcome Learning

to is

Great a

Learning good

, practice

Now

start

learning

Step 1: Go through all the words in the above text and make a list of all of the words in the model vocabulary.
 Welcome
 To
 Great
 Learning
 ,
 Now
 start
 learning
 is
 a
 good
 practice
Note that the words ‘Learning’ and ‘ learning’ are not the same here because of the difference in their cases and hence are
repeated. Also, note that a comma ‘ , ’ is also taken in the list. Because we know the vocabulary has 12 words, we can use
a fixed-length document-representation of 12, with one position in the vector to score each word.

The scoring method we use here is to count the presence of each word and mark 0 for absence. This scoring method is
used more generally.
The scoring of sentence 1 would look as follows:
Word Frequency

Welcome 1

to 1

Great 1

Learning 1

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

, 1

Now 1

start 1

learning 1

is 0

a 0

good 0

practice 0
Writing the above frequencies in the vector
Sentence 1 ➝ [ 1,1,1,1,1,1,1,1,0,0,0 ]
Now for sentence 2, the scoring would like:

Word Frequency

Welcome 0

to 0

Great 0

Learning 1

, 0

Now 0

start 0

learning 0

is 1

a 1

good 1

practice 1
Similarly, writing the above frequencies in the vector form
Sentence 2 ➝ [ 0,0,0,0,0,0,0,1,1,1,1,1 ]
Sentence Welcome to Great Learning , Now start learning is a good practice

Sentence1 1 1 1 1 1 1 1 1 0 0 0 0

Sentence2 0 0 0 0 0 0 0 1 1 1 1 1

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

But is this the best way to perform a bag of words. The above example was not the best example of how to use a
bag of words. The words Learning and learning, although having the same meaning are taken twice. Also, a
comma “,” which does not convey any information is also included in the vocabulary.

Let us make some changes and see how we can use ‘bag of words in a more effective way.

Example(2) with preprocessing:

Sentence 1: ”Welcome to Great Learning, Now start learning”

Sentence 2: “Learning is a good practice”

Step 1: Convert the above sentences in lower case as the case of the word does not hold any information.

Step 2: Remove special characters and stopwords from the text. Stopwords are the words that do not contain much
information about text like ‘is’, ‘a’,’the and many more’.

After applying the above steps, the sentences are changed to

Sentence 1: ”welcome great learning now start learning”

Sentence 2: “learning good practice”

Although the above sentences do not make much sense the maximum information is contained in these words only.

Step 3: Go through all the words in the above text and make a list of all of the words in our model vocabulary.
 welcome
 great
 learning
 now
 start
 good
 practice
Now as the vocabulary has only 7 words, we can use a fixed-length document-representation of 7, with one position in the
vector to score each word.

The scoring method we use here is the same as used in the previous example. For sentence 1, the count of words is as
follow:

Word Frequency

welcome 1

great 1

learning 2

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

now 1

start 1

good 0

practice 0
Writing the above frequencies in the vector

Sentence 1 ➝ [ 1,1,2,1,1,0,0 ]

Now for sentence 2, the scoring would be like

Word Frequency

welcome 0

great 0

learning 1

now 0

start 0

good 1

practice 1
Similarly, writing the above frequencies in the vector form

Sentence 2 ➝ [ 0,0,1,0,0,1,1 ]

Sentence welcome great learning now start good practice

Sentence1 1 1 2 1 1 0 0

Sentence2 0 0 1 0 0 1 1

The approach used in example two is the one that is generally used in the Bag-of-Words technique, the reason being that
the datasets used in Machine learning are tremendously large and can contain vocabulary of a few thousand or even
millions of words. Hence, preprocessing the text before using bag-of-words is a better way to go. There are various
preprocessing steps that can increase the performance of Bag-of-Words. Some of them are explained in great detail in
this blog.

In the examples above we use all the words from vocabulary to form a vector, which is neither a practical way nor the best
way to implement the BoW model. In practice, only a few words from the vocabulary, more preferably the most common
words are used to form the vector.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Limitations of Bag-of-Words
The bag-of-words model is very simple to understand and implement and offers a lot of flexibility for customization on
your specific text data. It has been used with great success on prediction problems like language modeling and
documentation classification.

Nevertheless, it suffers from some shortcomings, such as:


 Vocabulary: The vocabulary requires careful design, most specifically in order to manage the size, which impacts
the sparsity of the document representations.
 Sparsity: Sparse representations are harder to model both for computational reasons (space and time complexity)
and also for information reasons, where the challenge is for the models to harness so little information in such a
large representational space.
 Meaning: Discarding word order ignores the context, and in turn meaning of words in the document (semantics).
Context and meaning can offer a lot to the model, that if modeled could tell the difference between the same
words differently arranged (“this is interesting” vs “is this interesting”), synonyms (“old bike” vs “used bike”),
and much more.

BAG OF N-GRAMS MODEL

A bag-of-n-grams model is a way to represent a document, similar to a [bag-of-words][/terms/bag-of-words/] model. A


bag-of-n-grams model represents a text document as an unordered collection of its n-grams.

For example, let’s use the following phrase and divide it into bi-grams (n=2).
“James is the best person ever.”

becomes

 <start>James
 James is
 is the
 the best
 best person
 person ever.
 ever.<end>

In a typical bag-of-n-grams model, these 6 bigrams would be a sample from a large number of bigrams observed in a
corpus. And then James is the best person ever. would be encoded in a representation showing which of the corpus’s
bigrams were observed in the sentence. A bag-of-n-grams model has the simplicity of the bag-of-words model but allows
the preservation of more word locality information.

TF-IDF MODEL

TF-IDF stands for Term Frequency Inverse Document Frequency of records. It can be defined as the calculation of how
relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a
word appears but is compensated by the word frequency in the corpus (data-set).

Terminologies:

 Term Frequency: In document , the frequency represents the number of instances of a given word . Therefore,
we can see that it becomes more relevant when a word appears in the text, which is rational. Since the ordering of
terms is not significant, we can use a vector to describe the text in the bag of term models. For each specific term
in the paper, there is an entry with the value being the term frequency.

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

The weight of a term that occurs in a document is simply proportional to the term frequency.

 Document Frequency: This tests the meaning of the text, which is very similar to TF, in the whole corpus
collection. The only difference is that in document d, TF is the frequency counter for a term , while df is the
number of occurrences in the document set N of the term t. In other words, the number of papers in which the
word is present is DF.

 Inverse Document Frequency: Mainly, it tests how relevant the word is. The key aim of the search is to locate
the appropriate records that fit the demand. Since considers all terms equally significant, it is therefore not only
possible to use the term frequencies to measure the weight of the term in the paper. First, find the document
frequency of a term by counting the number of documents containing the term:

Term frequency is the number of instances of a term in a single document only; although the frequency of the document is
the number of separate documents in which the term appears, it depends on the entire corpus. Now let’s look at the
definition of the frequency of the inverse paper. The IDF of the word is the number of documents in the corpus separated
by the frequency of the text.

The more common word is supposed to be considered less significant, but the element (most definite integers) seems too
harsh. We then take the logarithm (with base 2) of the inverse frequency of the paper. So the if of the term t becomes:

 Computation: TF-IDF is one of the best metrics to determine how significant a term is to a text in a series or a
corpus. TF-IDF is a weighting system that assigns a weight to each word in a document based on its term
frequency (TF) and the reciprocal document frequency (TF) (IDF). The words with higher scores of weight are
deemed to be more significant.

Usually, the TF-IDF weight consists of two terms-

1. Normalized Term Frequency


2. Inverse Document Frequency

Numerical Example

Imagine the term appears 20 times in a document that contains a total of 100 words. Term Frequency (TF) of can be
calculated as follow:

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.E. CSE Prepared By: R. Reshma/AP/CSE

Assume a collection of related documents contains 10,000 documents. If 100 documents out of 10,000 documents contain
the term , Inverse Document Frequency (IDF) of can be calculated as follows

Using these two quantities, we can calculate TF-IDF score of the term for the document.

**********

Department of Computer Science and Engineering Dhanalakshmi Srinivasan College of Engineering


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

CCS369-UNIT 2 - CCS369 lecture notes

Artificial intelligence and data science (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.TECH AI&DS Prepared By: D.SHANKAR/AP/AI&DS

CCS369- TEXT AND SPEECH ANALYSIS


LECTURE NOTES
UNIT II TEXT CLASSIFICATION 6
Vector Semantics and Embeddings -Word Embeddings - Word2Vec model – Glove model – FastText model – Overview
of Deep Learning models – RNN – Transformers – Overview of Text summarization and Topic Models

INTRODUCTION
Text classification is a machine learning technique that assigns a set of predefined categories to open-ended text. Text
classifiers can be used to organize, structure, and categorize pretty much any kind of text – from documents, medical
studies, and files, and all over the web. For example, new articles can be organized by topics; support tickets can be
organized by urgency; chat conversations can be organized by language; brand mentions can be organized by sentiment;
and so on.

Text classification is one of the fundamental tasks in natural language processing with broad applications such as
sentiment analysis, topic labeling, spam detection, and intent detection. Here’s an example of how it works:

“The user interface is quite straightforward and easy to use.”

A text classifier can take this phrase as input, analyze its content, and then automatically assign relevant tags, such as UI
and Easy To Use.

Fig. Basic Flow of Text Classification

Some Examples of Text Classification:


 Sentiment Analysis.
 Language Detection.
 Fraud Profanity & Online Abuse Detection.
 Detecting Trends in Customer Feedback.
 Urgency Detection in Customer Support.

Why is Text Classification Important?


It’s estimated that around 80% of all information is unstructured, with text being one of the most common types of
unstructured data. Because of the messy nature of text, analyzing, understanding, organizing, and sorting through text data
is hard and time-consuming, so most companies fail to use it to its full potential. This is where text classification with
machine learning comes in. Using text classifiers, companies can automatically structure all manner of relevant text, from
emails, legal documents, social media, chatbots, surveys, and more in a fast and cost-effective way. This allows companies
to save time analyzing text data, automate business processes, and make data-driven business decisions.

Why use machine learning text classification?


Some of the top reasons:
 Scalability: Manually analyzing and organizing is slow and much less accurate.. Machine learning can
automatically analyze millions of surveys, comments, emails, etc., at a fraction of the cost, often in just a few
minutes. Text classification tools are scalable to any business needs, large or small

Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

 .
 Real-time analysis: There are critical situations that companies need to identify as soon as possible and take
immediate action (e.g., PR crises on social media). Machine learning text classification can follow your brand
mentions constantly and in real-time, so you'll identify critical information and be able to take action right away.
 Consistent criteria: Human annotators make mistakes when classifying text data due to distractions, fatigue, and
boredom, and human subjectivity creates inconsistent criteria. Machine learning, on the other hand, applies the
same lens and criteria to all data and results. Once a text classification model is properly trained it performs with
unsurpassed accuracy.

 We can perform text classification in two ways: manual or automatic.

 Manual text classification involves a human annotator, who interprets the content of text and categorizes it
accordingly. This method can deliver good results but it’s time-consuming and expensive.

 Automatic text classification applies machine learning, natural language processing (NLP), and other AI-guided
techniques to automatically classify text in a faster, more cost-effective, and more accurate manner.

 There are many approaches to automatic text classification, but they all fall under three types of systems:

 Rule-based systems
 Machine learning-based systems
 Hybrid systems

 Rule-based systems
 Rule-based approaches classify text into organized groups by using a set of handcrafted linguistic rules. These
rules instruct the system to use semantically relevant elements of a text to identify relevant categories based on its content.
Each rule consists of an antecedent or pattern and a predicted category.

 Example: Say that you want to classify news articles into two groups: Sports and Politics. First, you’ll need to
define two lists of words that characterize each group (e.g., words related to sports such as football, basketball, LeBron
James, etc., and words related to politics, such as Donald Trump, Hillary Clinton, Putin, etc.). Next, when you want to
classify a new incoming text, you’ll need to count the number of sport-related words that appear in the text and do the
same for politics- related words. If the number of sports-related word appearances is greater than the politics-related word
count, then the text is classified as Sports and vice versa. For example, this rule-based system will classify the headline
“When is LeBron James' first game with the Lakers?” as Sports because it counted one sports-related term (LeBron
James) and it didn’t count any politics-related terms.

 Rule-based systems are human comprehensible and can be improved over time. But this approach has some
disadvantages. For starters, these systems require deep knowledge of the domain. They are also time-consuming, since
generating rules for a complex system can be quite challenging and usually requires a lot of analysis and testing. Rule-
based systems are also difficult to maintain and don’t scale well given that adding new rules can affect the results of the
pre-existing rules.

 Machine learning-based systems
 Instead of relying on manually crafted rules, machine learning text classification learns to make classifications
based on past observations. By using pre-labeled examples as training data, machine learning algorithms can learn the
different associations between pieces of text, and that a particular output (i.e., tags) is expected for a particular input (i.e.,
text). A “tag” is the pre-determined classification or category that any given text could fall into.

 The first step towards training a machine learning NLP classifier is feature extraction: a method used to transform
each text into a numerical representation in the form of a vector. One of the most frequently used approaches is the bag of
words, where a vector represents the frequency of a word in a predefined dictionary of words.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

 For example, if we have defined our dictionary to have the following words {This, is, the, not, awesome, bad,
basketball}, and we wanted to vectorize the text “This is awesome,” we would have the following vector representation of
that text: (1, 1, 0, 0, 1, 0, 0). Then, the machine learning algorithm is fed with training data that consists of pairs of feature
sets (vectors for each text example) and tags (e.g. sports, politics) to produce a classification model:


 Fig. Training process in Text Classification

 Once it’s trained with enough training samples, the machine learning model can begin to make accurate
predictions. The same feature extractor is used to transform unseen text to feature sets, which can be fed into the
classification model to get predictions on tags (e.g., sports, politics):


 Fig. Prediction process in Text Classification
 Text classification with machine learning is usually much more accurate than human-crafted rule systems,
especially on complex NLP classification tasks. Also, classifiers with machine learning are easier to maintain and you
can always tag new examples to learn new tasks.

 Machine Learning Text Classification Algorithms
 Some of the most popular text classification algorithms include the Naive Bayes family of algorithms, support
vector machines (SVM), and deep learning.
 Naive Bayes
 The Naive Bayes family of statistical algorithms are some of the most used algorithms in text classification and text
analysis, overall. One of the members of that family is Multinomial Naive Bayes (MNB) with a huge advantage, that you
can get really good results even when your dataset isn’t very large (~ a couple of thousand tagged samples) and

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

computational resources are scarce. Naive Bayes is based on Bayes’s Theorem, which helps us compute the conditional
probabilities of

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

 the occurrence of two events, based on the probabilities of the occurrence of each individual event. So we’re
calculating the probability of each tag for a given text, and then outputting the tag with the highest probability.

 Naive Bayes formula.


 The probability of A, if B is true, is equal to the probability of B, if A is true, times the probability of A being true,
divided by the probability of B being true. This means that any vector that represents a text will have to contain
information about the probabilities of the appearance of certain words within the texts of a given category so that the
algorithm can compute the likelihood of that text belonging to the category.

 Support Vector Machines
 Support Vector Machines (SVM) is another powerful text classification machine learning algorithm because like
Naive Bayes, SVM doesn’t need much training data to start providing accurate results. SVM does, however, require more
computational resources than Naive Bayes, but the results are even faster and more accurate. In short, SVM draws a line
or “hyperplane” that divides a space into two subspaces. One subspace contains vectors (tags) that belong to a group, and
another subspace contains vectors that do not belong to that group.


 Fig. Optimal SVM Hyperplane

 The optimal hyperplane is the one with the largest distance between each tag. In two dimensions it looks like
above: Those vectors are representations of your training texts, and a group is a tag you have tagged your texts with. As
data gets more complex, it may not be possible to classify vectors/tags into only two categories. So, it looks like this:

 Deep Learning
 Deep learning is a set of algorithms and techniques inspired by how the human brain works, called neural
networks. Deep learning architectures offer huge benefits for text classification because they perform at super high
accuracy with lower- level engineering and computation. The two main deep learning architectures for text classification
are Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). Deep learning is hierarchical machine
learning, using multiple algorithms in a progressive chain of events. It’s similar to how the human brain works when
making decisions, using different techniques simultaneously to process huge amounts of data.

 Deep learning algorithms do require much more training data than traditional machine learning algorithms (at least
millions of tagged examples). However, they don’t have a threshold for learning from training data, like traditional
machine learning algorithms, such as SVM and Deep learning classifiers continue to get better the more data you feed
them with: Deep learning algorithms, like Word2Vec or GloVe are also used in order to obtain better vector
representations for words and improve the accuracy of classifiers trained with traditional machine learning algorithms.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

 Hybrid Systems
 Hybrid systems combine a machine learning-trained base classifier with a rule-based system, used to further
improve the results. These hybrid systems can be easily fine-tuned by adding specific rules for those conflicting tags that
haven’t been correctly modeled by the base classifier.

 VECTOR SEMANTICS AND EMBEDDINGS

 Vector semantics is the standard way to represent word meaning in NLP, helping vector semantics us model many
of the aspects of word meaning. The idea of vector semantics is to represent a word as a point in a multidimensional
semantic space that is derived from the distributions of embeddings word neighbors. Vectors for representing words are
called embeddings (although the term is sometimes more strictly applied only to dense vectors like word2vec). Vector
Semantics defines semantics & interprets word meaning to explain features such as word similarity. Its central idea is:
Two words are similar if they have similar word contexts.

 In its current form, the vector model inspires its working from the linguistic and philosophical work of the 1950s.
Vector semantics represents a word in multi-dimensional vector space. Vector model is also called Embeddings, due to the
fact that a word is embedded in a particular vector space. The vector model offers many advantages in NLP. For example,
in sentimental analysis, sets up a boundary class and predicts if the sentiment is positive or negative (a binomial
classification). Another key practical advantage of vector semantics is that it can learn automatically from text without
complex labeling or supervision. As a result of these advantages, vector semantics has become a de-facto standard for
NLP applications such as Sentiment Analysis, Named Entity Recognition (NER), topic modeling, and so on.
 WORD EMBEDDINGS
 It is an approach for representing words and documents. Word Embedding or Word Vector is a numeric vector
input that represents a word in a lower-dimensional space. It allows words with similar meaning to have a similar
representation. They can also approximate meaning. A word vector with 50 values can represent 50 unique features.
 Features: Anything that relates words to one another. Eg: Age, Sports, Fitness, Employed etc. Each word vector
has values corresponding to these features.

 Goal of Word Embeddings
 To reduce dimensionality
 To use a word to predict the words around it
 Inter word semantics must be captured
How are Word Embeddings used?
 They are used as input to machine learning models.
 Take the words —-> Give their numeric representation —-> Use in training or inference
 To represent or visualize any underlying patterns of usage in the corpus that was used to train them.

 Implementations of Word Embeddings:

 Word Embeddings are a method of extracting features out of text so that we can input those features into a machine
learning model to work with text data. They try to preserve syntactical and semantic information. The methods such as
Bag of Words(BOW), CountVectorizer and TFIDF rely on the word count in a sentence but do not save any syntactical or
semantic information. In these algorithms, the size of the vector is the number of elements in the vocabulary. We can get a
sparse matrix if most of the elements are zero. Large input vectors will mean a huge number of weights which will result
in high computation required for training. Word Embeddings give a solution to these problems.
 Let’s take an example to understand how word vector is generated by taking emoticons which are most frequently
used in certain conditions and transform each emoji into a vector and the conditions will be our features.

 H  ?
appy ???? ??? ????

 S   ?

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

ad ???? ??? ????

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

 E   ?
xcited ???? ??? ????

 S   ?
ick ???? ??? ????
 The emoji vectors for the emojis will
be: [happy, sad, excited, sick]
 ???? =[1,0,1,0]
 ???? =[0,1,0,1]
 ???? =[0,0,1,1]
 .....

 In a similar way, we can create word vectors for different words as well on the basis of given features. The words
with similar vectors are most likely to have the same meaning or are used to convey the same sentiment. There are two
different approaches for getting Word Embeddings:

1) Word2Vec:
 In Word2Vec every word is assigned a vector. We start with either a random vector or one-hot vector.
 One-Hot vector: A representation where only one bit in a vector is 1. If there are 500 words in the corpus then the
vector length will be 500. After assigning vectors to each word we take a window size and iterate through the entire
corpus. While we do this there are two neural embedding methods which are used:

Continuous Bowl of Words (CBOW)
 In this model what we do is we try to fit the neighboring words in the window to the central word.

 CBOW Architecture.

 This architecture is very similar to a feed-forward neural network. This model architecture essentially
tries to predict a target word from a list of context words.

 The intuition behind this model is quite simple: given a phrase "Have a great day", we will choose our
target word to be “a” and our context words to be [“have”, “great”, “day”]. What this model will do is take
the distributed representations of the context words to try and predict the target word.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

 The English language contains almost 1.2 million words, making it impossible to include so many words
in our example. So I ‘ll consider a small example in which we have only four words i.e. live, home, they and at.
For simplicity, we will consider that the corpus contains only one sentence, that being, ‘They live at home’.

 First, we convert each word into a one-hot encoding form. Also, we’ll not consider all the words in the
sentence but ll only take certain words that are in a window. For example for a window size equal to three, we
only consider three words in a sentence. The middle word is to be predicted and the surrounding two words are
fed into the neural network as context. The window is then slid and the process is repeated again.

 Finally, after training the network repeatedly by sliding the window a shown above, we get weights
which we use to get the embeddings as shown below.

 Usually, we take a window size of around 8-10 words and have a vector size of 300.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Skip Gram
 In this model, we try to make the central word closer to the neighboring words. It is the complete opposite of the
CBOW model. It is shown that this method produces more meaningful embeddings.

 After applying the above neural embedding methods we get trained vectors of each word after many iterations
through the corpus. These trained vectors preserve syntactical or semantic information and are converted to lower
dimensions. The vectors with similar meaning or semantic information are placed close to each other in space.

 The Skip-gram model architecture usually tries to achieve the reverse of what the CBOW model does. It
tries to predict the source context words (surrounding words) given a target word (the centre word)
 The working of the skip-gram model is quite similar to the CBOW but there is just a difference in the
architecture of its neural network and the way the weight matrix is generated as shown in the figure below:


 After obtaining the weight matrix, the steps to get word embedding is same as CBOW.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

 So now which one of the two algorithms should we use for implementing word2vec? Turns out for large
corpus with higher dimensions, it is better to use skip-gram but is slow to train. Whereas CBOW is better for
small corpus and is faster to train too.

2) GloVe:
 This is another method for creating word embeddings. In this method, we take the corpus and iterate through it and
get the co-occurrence of each word with other words in the corpus. We get a co-occurrence matrix through this. The words
which occur next to each other get a value of 1, if they are one word apart then 1/2, if two words apart then 1/3 and so on.
 Let us take an example to understand how the matrix is created. We have a small corpus:
Corpus:
 It is a nice
evening.
 Good
Evening!
 Is it a nice evening?

 it  is  a   ev  g
nice ening ood
    
 i  0
t
   
 i  1+  0
s 1
  
 a  1/  1+  0
2+1 1/2
 
 n  1/  1/  1+ 
ice 3+1/2 2+1/3 1 0

 e  1/  1/  1/   0
vening 4+1/3 3+1/4 2+1/2 1+1

 g  0  0  0   1  0
ood 0
 The upper half of the matrix will be a reflection of the lower half. We can consider a window frame as well to
calculate the co-occurrences by shifting the frame till the end of the corpus. This helps gather information about the
context in which the word is used. Initially, the vectors for each word is assigned randomly. Then we take two pairs of
vectors and see how close they are to each other in space. If they occur together more often or have a higher value in the
co-occurrence matrix and are far apart in space then they are brought close to each other. If they are close to each other but
are rarely or not frequently used together then they are moved further apart in space.
 After many iterations of the above process, we’ll get a vector space representation that approximates the
information from the co-occurrence matrix. The performance of GloVe is better than Word2Vec in terms of both semantic
and syntactic capturing.

 Pre-trained Word Embedding Models:
 People generally use pre-trained models for word embeddings. Few of them are:
 SpaCy
 fastText
 Flair etc.
 Common Errors made:
 You need to use the exact same pipeline during deploying your model as were used to create the training data for
the word embedding. If you use a different tokenizer or different method of handling white space, punctuation etc.
you might end up with incompatible inputs.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

 Words in your input that doesn’t have a pre-trained vector. Such words are known as Out of Vocabulary
Word(oov). What you can do is replace those words with “UNK” which means unknown and then handle them
separately.
 Dimension mis-match: Vectors can be of many lengths. If you train a model with vectors of length say 400 and
then try to apply vectors of length 1000 at inference time, you will run into errors. So make sure to use the same
dimensions throughout.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

 Benefits of using Word Embeddings:


 It is much faster to train than hand build models like WordNet(which uses graph embeddings)
 Almost all modern NLP applications start with an embedding layer
 It Stores an approximation of meaning
 Drawbacks of Word Embeddings:
 It can be memory intensive
 It is corpus dependent. Any underlying bias will have an effect on your model
 It cannot distinguish between homophones. Eg: brake/break, cell/sell, weather/whether etc.

 WORD2VEC MODEL IN DETAIL

 Word embeddings is a technique where individual words are transformed into a numerical representation
of the word (a vector). Where each word is mapped to one vector, this vector is then learned in a way which
resembles a neural network. The vectors try to capture various characteristics of that word with regard to the
overall text. These characteristics can include the semantic relationship of the word, definitions, context, etc.
With these numerical representations, you can do many things like identify similarity or dissimilarities between
words. Clearly, these are integral as inputs to various aspects of machine learning. A machine cannot process
text in its raw form, thus converting the text into an embedding will allow users to feed the embedding to classic
machine learning models. The simplest embedding would be a one-hot encoding of text data where each vector
would be mapped to a category.

 For  have = [1, 0, 0, 0, 0, 0, ... 0]
example:  a = [0, 1, 0, 0, 0, 0, ... 0]
  good = [0, 0, 1, 0, 0, 0, ... 0]
 day = [0, 0, 0, 1, 0, 0, ... 0] ...

 However, there are multiple limitations of simple embeddings such as this, as they do not capture
characteristics of the word, and they can be quite large depending on the size of the corpus.

 Word2Vec Architecture
 The effectiveness of Word2Vec comes from its ability to group together vectors of similar words. Given a
large enough dataset, Word2Vec can make strong estimates about a word’s meaning based on their occurrences
in the text. These estimates yield word associations with other words in the corpus. For example, words like
“King” and “Queen” would be very similar to one another. When conducting algebraic operations on word
embeddings you can find a close approximation of word similarities. For example, the 2-dimensional
embedding vector of "king"
- the 2-dimensional embedding vector of "man" + the 2-dimensional embedding vector of "woman" yielded a
vector which is very close to the embedding vector of "queen". Note, that the values below were chosen
arbitrarily.

 𝐾𝐾𝐾𝐾 − 𝐾 𝐾𝐾 +
𝐾 𝐾𝐾 𝐾𝐾 = 𝐾𝐾𝐾𝐾𝐾 [5,3] −
[2,1] + [3, 2] = [6,4]

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

 Graphical Representation of an Word2Vec Embedding ( King and Queen are close to each other in
position)

 There are two main architectures that yield the success of word2vec.
 Skip-gram
 CBOW architectures. (Refer Previous Section)

 Continuous Skip-Gram Model



 The skip-gram model is a simple neural network with one hidden layer trained in order to predict the
probability of a given word being present when an input word is present. Intuitively, you can imagine the skip-
gram model being the opposite of the CBOW model. In this architecture, it takes the current word as an input
and tries to accurately predict the words before and after this current word. This model essentially tries to learn
and predict the context words around the specified input word. Based on experiments assessing the accuracy of
this model it was found that the prediction quality improves given a large range of word vectors, however it also
increases the computational complexity. The process can be described visually as seen below.

Training Data Generation Model


 Example of generating training data for skip-gram model. Window size is 3. Image provided by author
 As seen above, given some corpus of text, a target word is selected over some rolling window. The
training data consists of pairwise combinations of that target word and all other words in the window. This is the
resulting training data for the neural network. Once the model is trained, we can essentially yield a probability
of a word being a context word for a given target. The following image below represents the architecture of the
neural network for the skip-gram model.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289


 Skip-Gram Model architecture



 A corpus can be represented as a vector of size N, where each element in N corresponds to a word in the
corpus. During the training process, we have a pair of target and context words, the input array will have 0 in all
elements except for the target word. The target word will be equal to 1. The hidden layer will learn the
embedding representation of each word, yielding a d-dimensional embedding space. The output layer is a dense
layer with a softmax activation function. The output layer will essentially yield a vector of the same size as the
input, each element in the vector will consist of a probability. This probability indicates the similarity between
the target word and the associated word in the corpus.

 GLOVE MODEL

 GloVe , coined from Global Vectors, is an unsupervised learning algorithm for obtaining vector representations for
words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting
representations showcase interesting linear substructures of the word vector space. It is a model for distributed word
representation. The model is an unsupervised learning algorithm for obtaining vector representations for words. This is
achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity.

 The statistics of word occurrences in a corpus is the primary source of information available to all unsupervised
methods for learning word representations, and although many such methods now exist, the question still remains as to
how meaning is generated from these statistics, and how the resulting word vectors might represent that meaning. In this
section, we shed some light on this question. We use our insights to construct a new model for word representation which
we call GloVe, for Global Vectors, because the global corpus statistics are captured directly by the model.

 GloVe model combines the advantages of the two major model families in the literature:
• global matrix factorization
• local context window methods

 This model efficiently leverages statistical information by training only on the nonzero elements in a word-word
cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus.

 Matrix Factorization Methods: methods that reduce a matrix into constituent parts that make it easier to
calculate more complex matrix operations .
 Shallow Window-Based Methods: Another approach is to learn word representations that aid in making
predictions within local context windows.
 Refer the previous section for example.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

 FASTTEXT MODEL

 This model allows training word embeddings from a training corpus with the additional ability to obtain word
vectors for out-of-vocabulary words. FastText is an open-source, free, lightweight library that allows users to learn text
representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit
on mobile devices. fastText embeddings exploit subword information to construct word embeddings. FastText is more
stable than Word2Vec architecture. In FastText, each word is represented as the average of the vector representation of its
character n- grams along with the word itself. So, the word embedding for the word 'equal' can be given as the sum of all
vector representations of all of its character n-gram and the word itself.

 Word embedding techniques like word2vec and GloVe provide distinct vector representations for the words in the
vocabulary. This leads to ignorance of the internal structure of the language. This is a limitation for morphologically rich
language as it ignores the syntactic relation of the words. As many word formations follow the rules in morphologically
rich languages, it is possible to improve vector representations for these languages by using character-level information.
 To improve vector representation for morphologically rich language, FastText provides embeddings for character
n-grams, representing words as the average of these embeddings. It is an extension of the word2vec model. Word2Vec
model provides embedding to the words, whereas fastText provides embeddings to the character n-grams. Like the
word2vec model, fastText uses CBOW and Skip-gram to compute the vectors.

 FastText can also handle out-of-vocabulary words, i.e., the fast text can find the word embeddings that are not
present at the time of training.

 Out-of-vocabulary (OOV) words are words that do not occur while training the data and are not present in the
model’s vocabulary. Word embedding models like word2vec or GloVe cannot provide embeddings for the OOV words
because they provide embeddings for words; hence, if a new word occurs, it cannot provide embedding. Since FastText
provides embeddings for character n-grams, it can provide embeddings for OOV words. If an OOV word occurs, then
fastText provides embedding for that word by embedding its character n-gram.

 Understanding the Working of FastText


 In FastText, each word is represented as the average of the vector representation of its character n-grams along
with the word itself.
 Consider the word “equal” and n = 3, then the word will be represented by character n-grams:

 < eq, equ, qua, ual, al > and < equal >

 So, the word embedding for the word ‘equal’ can be given as the sum of all vector representations of all of its
character n- gram and the word itself.

 Continuous Bag Of Words (CBOW) for FastTest:
 In the Continuous Bag Of Words (CBOW), we take the context of the target word as input and predict the word
that occurs in the context.

 For example, in the sentence “ I want to learn FastText.” In this sentence, the words “I,” “want,” “to,” and
“FastText” are given as input, and the model predicts “learn” as output.

 All the input and output data are in the same dimension and have one-hot encoding. It uses a neural network for
training. The neural network has an input layer, a hidden layer, and an output layer. Figure 1.2 shows the working of
CBOW for FastText.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289




 Skip-gram for FastText
 Skip-gram works like CBOW, but the input is the target word, and the model predicts the context of the given the
word. It also uses neural networks for training. Figure 1.3 shows the working of Skip-gram.

 Highlighting the Difference: Word2Vec vs. FastText



 FastText can be viewed as an extension to word2vec. Some of the significant differences between word2vec and
fastText are as follows:
 Word2Vec works on the word level, while fastText works on the character n-grams.
 Word2Vec cannot provide embeddings for out-of-vocabulary words, while fastText can provide embeddings for
OOV words.
 FastText can provide better embeddings for morphologically rich languages compared to word2vec.
 FastText uses the hierarchical classifier to train the model; hence it is faster than word2vec.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

 OVERVIEW OF DEEP LEARNING MODELS – RNN


 Deep learning is a machine learning technique that teaches computers to do what comes naturally to humans: learn
by example. Deep learning is a key technology behind driverless cars, enabling them to recognize a stop sign, or to
distinguish a pedestrian from a lamppost. It is the key to voice control in consumer devices like phones, tablets, TVs, and
hands-free speakers. Deep learning is getting lots of attention lately and for good reason. It’s achieving results that were
not possible before.

 In deep learning, a computer model learns to perform classification tasks directly from images, text, or sound. Deep
learning models can achieve state-of-the-art accuracy, sometimes exceeding human-level performance. Models are trained
by using a large set of labeled data and neural network architectures that contain many layers.

 While deep learning was first theorized in the 1980s, there are two main reasons it has only recently become useful:
1. Deep learning requires large amounts of labeled data. For example, driverless car development requires millions
of images and thousands of hours of video.
2. Deep learning requires substantial computing power. High-performance GPUs have a parallel architecture that is
efficient for deep learning. When combined with clusters or cloud computing, this enables development teams to
reduce training time for a deep learning network from weeks to hours or less.
 Deep learning applications are used in industries from automated driving to medical devices.

 Automated Driving: Automotive researchers are using deep learning to automatically detect objects such as stop
signs and traffic lights. In addition, deep learning is used to detect pedestrians, which helps decrease accidents.
 Aerospace and Defense: Deep learning is used to identify objects from satellites that locate areas of interest, and
identify safe or unsafe zones for troops.
 Medical Research: Cancer researchers are using deep learning to automatically detect cancer cells. Teams at
UCLA built an advanced microscope that yields a high-dimensional data set used to train a deep learning
application to accurately identify cancer cells.
 Industrial Automation: Deep learning is helping to improve worker safety around heavy machinery by
automatically detecting when people or objects are within an unsafe distance of machines.
 Electronics: Deep learning is being used in automated hearing and speech translation. For example, home
assistance devices that respond to your voice and know your preferences are powered by deep learning
applications.

 RNN
 Recurrent Neural Network (RNN) is a type of Neural Network where the output from the previous step is fed as
input to the current step. In traditional neural networks, all the inputs and outputs are independent of each other, but in
cases when it is required to predict the next word of a sentence, the previous words are required and hence there is a need
to remember the previous words. Thus, RNN came into existence, which solved this issue with the help of a Hidden
Layer. The main and most important feature of RNN is its Hidden state, which remembers some information about a
sequence. The state is also referred to as Memory State since it remembers the previous input to the network. It uses the
same parameters for each input as it performs the same task on all the inputs or hidden layers to produce the output. This
reduces the complexity of parameters, unlike other neural networks.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

 Recurrent neural network

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

 A RNN treats each word of a sentence as a separate input occurring at time ‘t’ and uses the activation value at ‘t-
1’ also, as an input in addition to the input at time ‘t’. The diagram below shows a detailed structure of an RNN
architecture. The architecture described above is also called as a many to many architecture with (Tx = Ty) i.e. number
of inputs = number of outputs. Such structure is quite useful in Sequence modelling.

 Apart from the architecture mentioned above there are three other types of architectures of RNN which are
commonly used.
1. Many to One RNN : Many to one architecture refers to an RNN architecture where many inputs (Tx) are used to
give one output (Ty). A suitable example for using such an architecture will be a classification task.

 RNN are a very important variant of neural networks heavily used in Natural Language Processing.

 Conceptually they differ from a standard neural network as the standard input in a RNN is a word instead of the
entire sample as in the case of a standard neural network. This gives the flexibility for the network to work with varying
lengths of sentences, something which cannot be achieved in a standard neural network due to it’s fixed structure. It
also provides an additional advantage of sharing features learned across different positions of text which cannot be
obtained in a standard neural network.



 In the image above H represents the output of the activation function.

2. One to Many RNN: One to Many architecture refers to a situation where a RNN generates a series of output values
based on a single input value. A prime example for using such an architecture will be a music generation task, where an
input is a jounre or the first note.


3. Many to Many Architecture (Tx not equals Ty): This architecture refers to where many inputs are read to produce
many outputs, where the length of inputs is not equal to the length of outputs. A prime example for using such an
architecture is machine translation tasks.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289


 Encoder refers to the part of the network which reads the sentence to be translated, and, Decoder is the
part of the network which translates the sentence into desired language.

 Limitations of RNN
 Apart from all of its usefulness RNN does have certain limitations major of which are:
1. Examples of RNN architecture stated above can capture the dependencies in only one direction of the language.
Basically, in the case of Natural Language Processing it assumes that the word coming after has no effect on the
meaning of the word coming before. With our experience of languages, we know that it is certainly not true.
2. RNN are also not very good in capturing long term dependencies and the problem of vanishing gradients
resurface in RNN.

 NNs are ideal for solving problems where the sequence is more important than the individual items themselves.
 An RNNs is essentially a fully connected neural network that contains a refactoring of some of its layers into a loop.
That loop is typically an iteration over the addition or concatenation of two inputs, a matrix multiplication and a non-linear
function.
 Among the text usages, the following tasks are among those RNNs perform well at:
 Sequence labelling
 Natural Language Processing (NLP) text classification
 Natural Language Processing (NLP) text generation
 Other tasks that RNNs are effective at solving are time series predictions or other sequence predictions that aren’t
image or tabular-based.
 There have been several highlighted and controversial reports in the media over the advances in text generation,
OpenAI’s GPT-2 algorithm. In many cases the generated text is often indistinguishable from text written by humans.

 RNNs effectively have an internal memory that allows the previous inputs to affect the subsequent predictions. It’s
much easier to predict the next word in a sentence with more accuracy, if you know what the previous words were. Often
with tasks well suited to RNNs, the sequence of the items is as or more important than the previous item in the sequence.

 Sequence-to-Sequence Models: TRANSFORMERS (Translate one language into another language)
 Sequence-to-sequence (seq2seq) models in NLP are used to convert sequences of Type A to sequences of Type
B. For example, translation of English sentences to German sentences is a sequence-to-sequence task.

 Recurrent Neural Network (RNN) based sequence-to-sequence models have garnered a lot of traction ever
since they were introduced in 2014. Most of the data in the current world are in the form of sequences – it can be a
number sequence, text sequence, a video frame sequence or an audio sequence.
 The performance of these seq2seq models was further enhanced with the addition of the Attention Mechanism in
2015. How quickly advancements in NLP have been happening in the last 5 years – incredible!
 These sequence-to-sequence models are pretty versatile and they are used in a variety of NLP tasks, such as:
 Machine Translation
 Text Summarization
 Speech Recognition
 Question-Answering System, and so on

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

 RNN based Sequence-to-Sequence Model


 Let’s take a simple example of a sequence-to-sequence model. Check out the below illustration:


 German to English Translation using seq2seq
 The above seq2seq model is converting a German phrase to its English counterpart. Let’s break it down:
 Both Encoder and Decoder are RNNs
 At every time step in the Encoder, the RNN takes a word vector (xi) from the input sequence and a hidden state
(Hi) from the previous time step
 The hidden state is updated at each time step
 The hidden state from the last unit is known as the context vector. This contains information about the input
sequence
 This context vector is then passed to the decoder and it is then used to generate the target sequence (English phrase)
 If we use the Attention mechanism, then the weighted sum of the hidden states are passed as the context vector
to the decoder
 Challenges
 Despite being so good at what it does, there are certain limitations of seq-2-seq models with attention:
 Dealing with long-range dependencies is still challenging
 The sequential nature of the model architecture prevents parallelization. These challenges are addressed by
Google Brain’s Transformer concept
 RNN can remember important things about the input it has received, which allows them to be very precise in
predicting what can be the next outcome. So this is the reason why they are performed or preferred on a sequential data
algorithm. And some of the examples of sequence data can be something like time, series, speech, text, financial data,
audio, video, weather, and many more. Although RNN was the state-of-the-art algorithm for dealing with sequential data,
they come up with their own drawbacks and some popular drawbacks over here can be like due to the complication or the
complexity of the algorithm. The neural network is pretty slow to train. And as a huge amount of dimensions here, the
training is very long and difficult to do.

 TRANSFORMERS

 Attention models/Transformers are the most exciting models being studied in NLP research today, but they can be
a bit challenging to grasp – the pedagogy is all over the place. This is both a bad thing (it can be confusing to hear
different versions) and in some ways a good thing (the field is rapidly evolving, there is a lot of space to improve).

 Transformer

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

 Internally, the Transformer has a similar kind of architecture as the previous models above. But the Transformer
consists of six encoders and six decoders.


 Each encoder is very similar to each other. All encoders have the same architecture. Decoders share the same
property, i.e. they are also very similar to each other. Each encoder consists of two layers: Self-attention and a feed
Forward Neural Network.

 The encoder’s inputs first flow through a self-attention layer. It helps the encoder look at other words in the input
sentence as it encodes a specific word. The decoder has both those layers, but between them is an attention layer that
helps the decoder focus on relevant parts of the input sentence.

 Self-Attention
 Let’s start to look at the various vectors/tensors and how they flow between these components to turn the input
of a trained model into an output. As is the case in NLP applications in general, we begin by turning each input word
into a vector using an embedding algorithm.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289



 Each word is embedded into a vector of size 512. We will represent those vectors with these simple boxes. The
embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they
receive a list of vectors each of the size 512.

 In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the
encoder that’s directly below. After embedding the words in our input sequence, each of them flows through each of the
two layers of the encoder.

 Here we begin to see one key property of the Transformer, which is that the word in each position flows through
its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward
layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing
through the feed-forward layer.

 Next, we’ll switch up the example to a shorter sentence and we’ll look at what happens in each sub-layer of the
encoder.

 Self-Attention
 Let’s first look at how to calculate self-attention using vectors, then proceed to look at how it’s actually
implemented — using matrices.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

 Figuring out the relation of words within a sentence and giving the right
attention to it.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

 The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in
this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector.
These vectors are created by multiplying the embedding by three matrices that we trained during the training process.

 Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64,
while the embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is
an architecture choice to make the computation of multiheaded attention (mostly) constant.


 Multiplying x1 by the WQ weight matrix produces q1, the “query” vector associated with that word. We end up
creating a “query”, a “key”, and a “value” projection of each word in the input sentence.

 What are the “query”, “key”, and “value” vectors?
 They’re abstractions that are useful for calculating and thinking about attention. Once you proceed with reading
how attention is calculated below, you’ll know pretty much all you need to know about the role each of these vectors
plays.

 The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the
first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score
determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.

 The score is calculated by taking the dot product of the query vector with the key vector of the respective word
we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot
product of q1 and k1. The second score would be the dot product of q1 and k2.


 The third and forth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in
this example is 64. This leads to having more stable gradients. There could be other possible values here, but this is the
default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add
up to 1.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289


 This softmax score determines how much how much each word will be expressed at this position. Clearly the word
at this position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to
the current word.

 The fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition
here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them
by tiny numbers like 0.001, for example). The sixth step is to sum up the weighted value vectors. This produces the
output of the self-attention layer at this position (for the first word).

 That concludes the self-attention calculation. The resulting vector is one we can send along to the feed-forward
neural network. In the actual implementation, however, this calculation is done in matrix form for faster processing.

 Multihead attention
 There are a few other details that make them work better. For example, instead of only paying attention to each
other in one dimension, Transformers use the concept of Multihead attention. The idea behind it is that whenever you are
translating a word, you may pay different attention to each word based on the type of question that you are asking. The
images below show what that means. For example, whenever you are translating “kicked” in the sentence “I kicked the
ball”, you may ask “Who kicked”. Depending on the answer, the translation of the word to another language can change.
Or ask other questions, like “Did what?”, etc…

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289





 Positional Encoding
 Another important step on the Transformer is to add positional encoding when encoding each word.
Encoding the position of each word is relevant since the position of each word is relevant to the translation.

 OVERVIEW OF TEXT SUMMARIZATION AND TOPIC MODELS

 Text summarization is the process of creating a concise and accurate representation of the main points and
information in a document. Topic modelling can help you generate summaries by extracting the most relevant and salient
topics and words from the document. Text summarization refers to the technique of shortening long pieces of text. The
intention is to create a coherent and fluent summary having only the main points outlined in the document.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

 Automatic text summarization is a common problem in machine learning and natural language processing (NLP).
In general, text summarization technique has proved to be critical in quickly and accurately summarizing voluminous
texts, something which could be expensive and time consuming if done without machines.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

 There are two main types of how to summarize text in NLP:


 Extraction-based summarization
 Abstraction-based summarization

 Extraction-based summarization
 The extractive text summarization technique involves pulling keyphrases from the source document and
combining them to make a summary. The extraction is made according to the defined metric without making any changes
to the texts.

 Here is an example:
 Source text: Joseph and Mary rode on a donkey to attend the annual event in Jerusalem. In the city, Mary gave
birth to a child named Jesus.

 Extractive summary: Joseph and Mary attend event Jerusalem. Mary birth Jesus.

 As you can see above, the words in bold have been extracted and joined to create a summary — although
sometimes the summary can be grammatically strange.

 Abstraction-based summarization
 The abstraction technique entails paraphrasing and shortening parts of the source document. When abstraction is
applied for text summarization in deep learning problems, it can overcome the grammar inconsistencies of the extractive
method. The abstractive text summarization algorithms create new phrases and sentences that relay the most useful
information from the original text — just like humans do.

 Therefore, abstraction performs better than extraction. However, the text summarization algorithms required to do
abstraction are more difficult to develop; that’s why the use of extraction is still popular.

 Here is an example:
 Abstractive summary: Joseph and Mary came to Jerusalem where Jesus was born.

 How does a text summarization algorithm work?
 Usually, text summarization in NLP is treated as a supervised machine learning problem (where future
outcomes are predicted based on provided data).

 Typically, here is how using the extraction-based approach to summarize texts can work:

1. Introduce a method to extract the merited keyphrases from the source document. For example, you can use
part- of-speech tagging, word sequences, or other linguistic patterns to identify the keyphrases.

2. Gather text documents with positively-labeled keyphrases. The keyphrases should be compatible to the
stipulated extraction technique. To increase accuracy, you can also create negatively-labeled keyphrases.

3. Train a binary machine learning classifier to make the text summarization. Some of the features you can use
include:
 Length of the keyphrase
 Frequency of the keyphrase
 The most recurring word in the keyphrase
 Number of characters in the keyphrase
4. Finally, in the test phrase, create all the keyphrase words and sentences and carry out classification for them.

 Topic Modeling
 Topic modeling is a technique that can help you discover the main themes and concepts in a large collection of text
documents. It can also help you summarize, classify, or cluster the documents based on their topics. In this article, you
will learn how to use topic modeling for these tasks and what are some of the common algorithms and tools that you can
apply.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289


 Topic modeling is a form of unsupervised learning that aims to find hidden patterns and structures in the text data.
It assumes that each document is composed of a mixture of topics, and each topic is a distribution of words that
represent a specific

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

 subject or idea. For example, a document about sports might have topics such as soccer, basketball, and fitness.
Topic modeling can help you identify these topics and their proportions in each document. Topic modeling can help you
generate summaries by extracting the most relevant and salient topics and words from the document for text
summarization. You can then use these topics and words to construct a summary that captures the essence and meaning of
the document.

 Topic modeling is a collection of text-mining techniques that uses statistical and machine learning models to
automatically discover hidden abstract topics in a collection of documents.
 Topic modeling is also an amalgamation of a set of unsupervised techniques that’s capable of detecting word and
phrase patterns within documents and automatically cluster word groups and similar expressions helping in best
representing a set of documents.

 There are many cases where humans or machines generate a huge amount of text over time and it is not prudent
nor possible to go through the entire text for gaining an understanding of what is important or to come to an opinion of the
entire process of generating the data.
 In such cases, NLP algorithms and in particular topic modeling are useful to extract a summary of the underlying
text and discover important contexts from the text.

 Topic modeling is the method of extracting needed attributes from a bag of words. This is critical because each
word in the corpus is treated as a feature in NLP. As a result, feature reduction allows us to focus on the relevant material
rather than wasting time sifting through all of the data's text.

 There are many different topic modeling algorithms and tools available for text analysis projects. Popular methods
include Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NMF), and Latent Semantic Analysis
(LSA). Common tools used to apply these algorithms include Gensim, a Python library providing implementations of
LDA, NMF, and other topic modeling methods; Scikit-learn, a Python library providing implementations of NMF, LSA,
and other machine learning methods; and MALLET, a Java-based toolkit providing implementations of LDA, NMF, and
other topic modeling methods. These tools offer various utilities and functionalities for preprocessing, evaluation,
visualization, data manipulation, feature extraction, model selection, and performance metrics.

 Working and Methods of Topic Modeling:



 To infer subjects from unstructured data, topic modeling includes counting words and grouping similar word
patterns. Suppose, if we are a software firm interested in learning what consumers have to say about specific elements of
our product, we would need to use a topic modeling algorithm to examine our comments instead of spending hours trying
to figure out which messages are talking about our topics of interest.

 A topic model groups feedback that is comparable, as well as phrases and expressions that appear most frequently,
by recognizing patterns such as word frequency and distance between words. We may rapidly infer what each group of
texts is about using this information.

 Five algorithms are particularly used for topic modeling. We are going to learn about the methods, taking
help from OpenGenus.

1. Latent Dirirchlet Allocation (LDA):

 The statistical and graphical concept of Latent Dirichlet Allocation is used to find correlations between many
documents in a corpus. The greatest likelihood estimate from the entire corpus of text is obtained using the Variational
Exception Maximization (VEM) technique. This is traditionally solved by selecting the top few words from a bag of
words. The statement, however, is utterly devoid of meaning. Each document may be represented by a probabilistic
distribution of subjects, and each topic can be defined by a probabilistic distribution of words, according to this approach.
As a result, we have a much better picture of how the issues are related.

 Example of LDA (Source)

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

 Consider the following scenario: you have a corpus of 1000 documents. The bag of words is made up of 1000
common words after preprocessing the corpus. We can determine the subjects that are relevant to each document using
LDA.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

 The extraction of data from a corpus of data is therefore made straightforward. The upper level represents the
documents, the middle level represents the produced themes, and the bottom level represents the words in the diagram
above.

 As a result, the rule indicates that a text is represented as a distribution of themes, and topics are described as a
distribution of words.

2. Non Negative Matrix Factorization (NMF):

 NMF is a matrix factorization method that ensures the non-negative elements of the factorized matrices. Consider
the document-term matrix produced after deleting stopwords from a corpus. The term-topic matrix and the topic-
document matrix are two matrices that may be factored out of the matrix.

 Matrix factorization may be accomplished using a variety of optimization methods. NMF may be performed more
quickly and effectively using Hierarchical Alternating Least Square. The factorization takes place in this case by updating
one column at a time while leaving the other columns unchanged.

3. Latent Semantic Analysis (LSA):

 Latent Semantic Analysis is another unsupervised learning approach for extracting relationships between words in a
large number of documents. This assists us in selecting the appropriate documents.

 It merely serves as a dimensionality reduction tool for the massive corpus of text data. These extraneous data adds
noise to the process of extracting the proper insights from the data.

4. Parallel Latent Dirichlet Allocation:

 Partially Labeled Dirichlet Allocation is another name for it. The model implies that there are a total of n labels,
each of which is associated with a different subject in the corpus.

 Then, similar to the LDA, the individual themes are represented as the probability distribution of the entire corpus.
Optionally, each document might be allocated a global subject, resulting in l global topics, where l is the number of
individual documents in the corpus.

 The technique also assumes that every subject in the corpus has just one label. In comparison to the other
approaches, this procedure is highly rapid and exact because the labels are supplied before creating the model.

5. Pachinko Allocation Model (PAM):

 The Pachinko Allocation Model (PAM) is a more advanced version of the Latent Dirichlet Allocation Model. The
LDA model identifies themes based on thematic correlations between words in the corpus, bringing out the correlation
between words. PAM, on the other hand, makes do by modeling the correlation between the produced themes. Because it
additionally considers the link between subjects, this model has more ability in determining the semantic relationship
precisely. Pachinko is a popular Japanese game, and the model is named for it. To explore the association between
themes, the model uses
 Directed Acrylic Graphs (DAG).

 ***************

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

1.TEXT AND Speech Analysis

Text and speech analysis (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

CCS369 TEXT AND SPEECH ANALYSIS

UNIT III QUESTION ANSWERING AND DIALOGUE SYSTEMS


Information retrieval – IR-based question answering – knowledge-based question answering –
language models for QA – classic QA models – chatbots – Design of dialogue systems -–
evaluating dialogue systems.

Information retrieval:
What is text information retrieval?
• Text retrieval is to return relevant textual documents from a given collection, according to
users' information needs as declared in a query.
• Main differences from database retrieval are concerned with: – Information.
• Unstructured text vs.

Information Retrieval (IR) can be defined as a software program that deals with the
organization, storage, retrieval, and evaluation of information from document repositories,
particularly textual information. Information Retrieval is the activity of obtaining material that
can usually be documented on an unstructured nature i.e. usually text which satisfies an
information need from within large collections which is stored on computers..

Examples:
Vector-space, Boolean and Probabilistic IR models. In this system, the retrieval of information
depends on documents containing the defined set of queries.

What is an IR Model?
An Information Retrieval (IR) model selects and ranks the document that is required by the
user or the user has asked for in the form of a query. The documents and the queries are
represented in a similar manner, so that document selection and ranking can be formalized by
a matching function that returns a retrieval status value (RSV) for each document in the
collection. Many of the Information Retrieval systems represent document contents by a set of

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

descriptors, called terms, belonging to a vocabulary V. An IR model determines the query-


document matching function according to four main approaches:

Components of Information Retrieval/ IR Model


 Acquisition: In this step, the selection of documents and other objects from various web
resources that consist of text-based documents takes place. The required data is collected
by web crawlers and stored in the database.
 Representation: It consists of indexing that contains free-text terms, controlled
vocabulary, manual & automatic techniques as well. example: Abstracting contains

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

summarizing and Bibliographic description that contains author, title, sources, data, and
metadata.

 Representation: It consists of indexing that contains free-text terms, controlled


vocabulary, manual & automatic techniques as well. example: Abstracting contains
summarizing and Bibliographic description that contains author, title, sources, data, and
metadata.
 File Organization: There are two types of file organization methods. i.e. Sequential: It
contains documents by document data. Inverted: It contains term by term, list of records
under each term. Combination of both.
 Query: An IR process starts when a user enters a query into the system. Queries are formal
statements of information needs, for example, search strings in web search engines. In
information retrieval, a query does not uniquely identify a single object in the collection.
Instead, several objects may match the query, perhaps with different degrees of relevancy.

Difference Between Information Retrieval and Data Retrieval


Information Retrieval Data Retrieval

The software program that deals with Data retrieval deals with obtaining data from a
the organization, storage, retrieval, and database management system such as ODBMS. It
evaluation of information from is A process of identifying and retrieving the data
document repositories particularly from the database, based on the query provided by
textual information. user or application.

Retrieves information about a subject. Determines the keywords in the user query and

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Information Retrieval Data Retrieval

retrieves the data.

Small errors are likely to go unnoticed. A single error object means total failure.

Not always well structured and is semantically Has a well-defined structure and
ambiguous. semantics.

Does not provide a solution to the user of the Provides solutions to the user of the
database system. database system.

The results obtained are approximate matches. The results obtained are exact matches.

Results are ordered by relevance. Results are unordered by relevance.

It is a probabilistic model. It is a deterministic model.


The User Task: The information first is supposed to be translated into a query by the user. In
the information retrieval system, there is a set of words that convey the semantics of the
information that is required whereas, in a data retrieval system, a query expression is used to
convey the constraints which are satisfied by the objects.

 Logical View of the Documents: A long time ago, documents were represented through a
set of index terms or keywords. Nowadays, modern computers represent documents by a
full set of words which reduces the set of representative keywords. This can be done by
eliminating stopwords i.e. articles and connectives. These operations are text operations.
These text operations reduce the complexity of the document representation from full text
to set of index terms.

Past, Present, and Future of Information Retrieval


1. Early Developments: As there was an increase in the need for a lot of information, it
became necessary to build data structures to get faster access. The index is the data structure
for faster retrieval of information. Over centuries manual categorization of hierarchies was
done for indexes.
2. Information Retrieval In Libraries: Libraries were the first to adopt IR systems for
information retrieval. In first-generation, it consisted, automation of previous technologies,
and the search was based on author name and title. In the second generation, it included
searching by subject heading, keywords, etc. In the third generation, it consisted of graphical
interfaces, electronic forms, hypertext features, etc.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

3. The Web and Digital Libraries: It is cheaper than various sources of information, it
provides greater access to networks due to digital communication and it gives free access to
publish on a larger medium.
Advantages of Information Retrieval
1. Efficient Access: Information retrieval techniques make it possible for users to easily
locate and retrieve vast amounts of data or information.
2. Personalization of Results: User profiling and personalization techniques are used in
information retrieval models to tailor search results to individual preferences and behaviors.
3. Scalability: Information retrieval models are capable of handling increasing data volumes.
4. Precision: These systems can provide highly accurate and relevant search results, reducing
the likelihood of irrelevant information appearing in search results.

Disadvantages of Information Retrieval

1. Information Overload: When a lot of information is available, users often face


information overload, making it difficult to find the most useful and relevant material.
2. Lack of Context: Information retrieval systems may fail to understand the context of a
user’s query, potentially leading to inaccurate results.
3. Privacy and Security Concerns: As information retrieval systems often access sensitive
user data, they can raise privacy and security concerns.
4. Maintenance Challenges: Keeping these systems up-to-date and effective requires
ongoing efforts, including regular updates, data cleaning, and algorithm adjustments.
5. Bias and fairness: Ensuring that information retrieval systems do not exhibit biases and
provide fair and unbiased results is a crucial challenge, especially in contexts like web search
engines and recommendation systems.

IR-based question answering:

What is IR based question answering?


IR-based Factoid Question Answering. The goal of information retrieval based question
answering is to answer a user's question by finding short text segments on the web or some
other collection of doc- uments.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

What is a question-answering System?


Question answering (QA) is a field of natural language processing (NLP) and artificial
intelligence (AI) that aims to develop systems that can understand and answer questions posed
in natural language.
How does a natural language question-answering system work?
A natural language question-answering (QA) system is a computer program that automatically
answers questions using NLP. The basic process of a natural language QA system includes the
following steps:
1. Text pre-processing: The question is pre-processed to remove irrelevant information and
standardise the text’s format. This step includes tokenisation, lemmatisation, and stop-word
removal, among others.
2. Question understanding: The pre-processed question is analysed to extract the relevant
entities and concepts and to identify the type of question being asked. This step can be done
using natural language processing (NLP) techniques such as named entity
recognition, dependency parsing, and part-of-speech tagging.
3. Information retrieval: The question is used to search a database or corpus of text to retrieve
the most relevant information. This can be done using information retrieval techniques such as
keyword search or semantic search.
4. Answer generation: The retrieved information is analysed to extract the specific answer to the
question. This can be done using various techniques, such as machine learning algorithms, rule-
based systems, or a combination.
5. Ranking: The extracted answers are ranked based on relevance and confidence score.

Types of question answering system

1. Information retrieval-based QA
Information retrieval-based question answering (QA) is a method of automatically answering
questions by searching for relevant documents or passages that contain the answer. This
approach uses information retrieval techniques, such as keyword or semantic search, to identify
the documents or passages most likely to hold the answer to a given question.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

2. Knowledge-based QA
Knowledge-based question answering (QA) automatically answers questions using a knowledge
base, such as a database or ontology, to retrieve the relevant information. This strategy’s
foundation is that searching for a structured knowledge base for a question can yield the answer.
Knowledge-based QA systems are generally more accurate and reliable than other QA
approaches based on structured and well-curated knowledge.

3. Generative QA
Generative question answering (QA) automatically answers questions using a generative model,
such as a neural network, to generate a natural language answer to a given question.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

This method is based on the idea that a machine can be taught to understand and create text in
natural language to provide a correct answer in terms of grammar and meaning.
4. Hybrid QA
Hybrid question answering (QA) automatically answers questions by combining multiple QA
approaches, such as information retrieval-based, knowledge-based, and generative QA. This
approach is based on the idea that different QA approaches have their strengths and weaknesses,
and by combining them, the overall performance of the QA system can be improved.
5. Rule-based QA
Rule-based question answering (QA) automatically answers questions using a predefined set of
rules based on keywords or patterns in the question. This approach is based on the idea that
many questions can be answered by matching the question to a set of predefined rules or
templates.

Applications: Tools:

 Customer Service TensorFlow


 Search engines BERT
 Healthcare GPT-3
 Education Hugging Face
 Finance SpaCy
 In e-commerce NLTK
 Void assistants OpenNLP
 Chatbots
 Virtual assistants
 Business intelligence

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

knowledge-based question answering:

What is knowledge based question answering?


Knowledge-based question answering (KBQA) is the task of finding answers to questions by
pro- cessing a structured knowledge base KB. A KB consists of a set of entities E, a set of
relations R, and a set of literals S.

Knowledge-based question answering (KBQA) in text and speech analysis involves using
structured knowledge bases or ontologies to answer questions posed in natural language.
This approach contrasts with traditional information retrieval systems, which primarily
match keywords or phrases to documents. Here's an overview of how KBQA works:

Knowledge Representation: KBQA systems rely on structured knowledge


representations such as ontologies, knowledge graphs, or semantic networks. These
representations capture entities, their attributes, relationships, and hierarchies in a
formalized manner.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Natural Language Understanding: The system analyzes the natural language question
to understand its meaning, including entity mentions, relationships, and constraints
implied by the question. Techniques such as part-of-speech tagging, named entity
recognition, dependency parsing, and semantic role labeling are often used.

Query Formulation: Based on the understanding of the question, the system formulates
a structured query that can be executed against the knowledge base. This query typically
involves selecting relevant entities, properties, and relationships to retrieve the desired
information.

Knowledge Base Querying: The formulated query is executed against the knowledge
base to retrieve relevant information. This process may involve querying a structured
database, a knowledge graph, or accessing external sources such as linked data on the
web.

Answer Generation: Once the relevant information is retrieved from the knowledge
base, it is processed to generate a natural language answer that directly addresses the
user's question. This may involve aggregating and summarizing information, as well as
ensuring that the answer is fluent and grammatically correct.

Response Presentation: Finally, the generated answer is presented to the user through
the appropriate interface, whether it's a text-based response in a chatbot interface or
synthesized speech in a voice-based interaction.

KBQA systems can vary in complexity and sophistication, ranging from simple rule-
based approaches to more advanced systems leveraging machine learning and natural
language processing techniques.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

language models for QA:


These models can predict any word in a sentence or body of text by using every other
word in the text. Examining text bidirectionally increases result accuracy. This type is
often used in machine learning models and speech generation applications.

What is language model in speech?


Language models rely on acoustic models to convert analog speech waves into digital
and discrete phonemes that form the building blocks of words.

Large language Models

Challenges with Language Modeling?


Formal languages (like a programming language) are precisely defined. All the words
and their usage is predefined in the system. Anyone who knows a specific programming
language can understand what’s written without any formal specification.

Machines only understand the language of numbers. For creating language models, it is
necessary to convert all the words into a sequence of numbers. For the modellers, this is
known as encodings.

How does Language Model Works?

Language Models determine the probability of the next word by analyzing the text in
data. These models interpret the data by feeding it through algorithms.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

The algorithms are responsible for creating rules for the context in natural language. The
models are prepared for the prediction of words by learning the features and
characteristics of a language. With this learning, the model prepares itself for
understanding phrases and predicting the next words in sentences.

For training a language model, a number of probabilistic approaches are used. These
approaches vary on the basis of the purpose for which a language model is created. The
amount of text data to be analyzed and the math applied for analysis makes a difference
in the approach followed for creating and training a language model.

For example, a language model used for predicting the next word in a search query will
be absolutely different from those used in predicting the next word in a long document
(such as Google Docs). The approach followed to train the model would be unique in
both cases.

Types of Language Models:

There are primarily two types of language models:

1. Statistical Language Models


Statistical models include the development of probabilistic models that are able to predict
the next word in the sequence, given the words that precede it. A number of statistical
language models are in use already.
Let’s take a look at some of those popular models:
 N-Gram
 Unigram
 Bidirectional
 Exponential
 Continuous Space

2. Neural Language Models

These language models are based on neural networks and are often considered as an advanced
approach to execute NLP tasks. Neural language models overcome the shortcomings of
classical models such as n-gram and are used for complex tasks such as speech recognition or
machine translation.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Some Common Examples of Language Models:

1. Speech Recognition
2. Machine Translation
3. Sentiment Analysis
4. Text Suggestions
5. Parsing Tools
6. Text Classification
7. Dialog Systems and Creative Writing
8. Text Summarization

Modern Questions Answering System

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

How to Train-A Question and Answering Machine Learning Models

Common Challenges in NLP Language Models:

1) Long-Term Dependency
2) Low-Resource Languages
3) Sarcasm and Irony
4) Handling Noisy Text
5) Contextual Ambiguity

classic QA models:
Classic question-answering (QA) models in text and speech analysis have evolved
over the years. Here are some of the classic models:

1.Information Retrieval (IR) Models: These models are based on retrieving


relevant documents or passages from a collection in response to a query. Classic IR
models include:

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Vector Space Model (VSM): Represents documents and queries as vectors in a high-
dimensional space and computes similarity scores between them.

Term Frequency-Inverse Document Frequency (TF-IDF): Measures the importance of a


term in a document relative to a corpus.

Web based Questions and Answering Models

2.Rule-based QA Systems: These systems rely on handcrafted rules to parse questions and
retrieve relevant information from structured or semi-structured data sources. Classic examples
include:

ELIZA: A rule-based natural language processing program that simulates a conversation by


following patterns and rules.

ALICE: Another early chatbot that uses pattern matching and predefined responses.

3.Statistical QA Models: These models utilize statistical techniques to analyze text and
generate answers. Classic examples include:

IBM Watson: Utilizes a combination of statistical techniques, natural language processing, and
machine learning to understand and answer questions.

DeepQA: The architecture behind IBM Watson, which combines various algorithms and
techniques for question answering.

4.Neural QA Models: These models leverage neural networks to understand and answer
questions. Classic examples include:

Memory Networks: Models designed to store and retrieve information from memory, useful
for tasks like question answering.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Attention Mechanisms: Mechanisms that allow neural networks to focus on relevant parts of
the input, improving performance in QA tasks.

Transformer-based Models: Models like BERT (Bidirectional Encoder Representations from


Transformers) and GPT (Generative Pre-trained Transformer) have shown significant
advancements in QA tasks.

Questions and Answering – NLP Projects

5.Graph-based QA Models: These models represent text or knowledge as graphs and perform
reasoning over them to answer questions. Classic examples include:

Knowledge Graphs: Represent structured knowledge as graphs and perform graph-based


reasoning to answer questions.

Graph Neural Networks (GNNs): Neural networks designed to operate on graph-structured


data, which can be used for QA tasks involving graph representations.

Each of these classic models has its strengths and weaknesses, and modern QA systems often
combine multiple approaches for improved performance.

What are the uses of question answering system?

Question answering is commonly used to build conversational client applications, which


include social media applications, chat bots, and speech-enabled desktop applications.

What are the 5 applications of NLP?


NLP business applications come in different forms and are so common these days. For
example, spell checkers, online search, translators, voice assistants, spam filters, and
autocorrect are all NLP applications.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Chatbots:
The role of chatbots in NLP lies in their ability to understand and respond to natural
language input from users. This means that rather than relying on specific commands or
keywords like traditional computer programs, chatbots can process human-like questions and
responses.

What are the main types of chatbots?

Depending on their capabilities, chatbots can be simple, intelligent, and hybrid.

1.Simple bots are quite basic tools that rely on natural language processing. They can
understand and respond to human queries with certain actions that are based on keywords and
phrases. This type of bots has a defined rule-based decision tree (or RBDT), which helps users
find needed information. FAQ chatbot is a perfect example of a simple bot.

2.Intelligent chatbots, which are also known as virtual assistants or virtual agents, are powered
by artificial intelligence and are much more complicated than simple chatbots. They can
understand human written and oral language and, which is more important, the context behind
it.

Hybrid chatbots are bots that are partially automated, meaning that they lead conversations
until a human interaction is required. They might have the same functionality as simple bots,
but a user can opt for a person when needed.

Chatbots can be powerful tools in text and speech analysis due to their ability to process large
amounts of data quickly and efficiently. Here's how they're used:

1.Text Analysis: Chatbots can analyze text data to extract valuable insights such as sentiment
analysis, topic modeling, keyword extraction, and named entity recognition. They can

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

understand the context of the conversation and provide relevant responses or take appropriate
actions based on the analysis.

2.Speech Recognition: With advancements in natural language processing (NLP) and speech
recognition technology, chatbots can transcribe spoken language into text. This text data can
then be further analyzed using text analysis techniques mentioned above.

3.Sentiment Analysis: Chatbots can analyze the sentiment expressed in text or speech, helping
businesses gauge customer satisfaction, detect issues, or monitor public opinion about their
products or services.

4.Customer Support: Chatbots are commonly used in customer support to analyze customer
queries and provide appropriate responses. They can understand the intent behind the
customer's message and either provide a solution or escalate the query to a human agent if
necessary.

Educational AI Chatbots for Content and Languages Integrated Learning

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

5.Market Research: Chatbots can be deployed to gather and analyze textual data from social
media, forums, or surveys to understand consumer preferences, trends, and feedback on
products or services.

6.Language Translation: Chatbots equipped with language translation capabilities can analyze
and translate text or speech from one language to another, facilitating communication across
linguistic barriers.

7.Personalization: By analyzing user interactions and preferences, chatbots can personalize


responses and recommendations, improving user experience and engagement.

Chatbots Terminologies:

 Quick reply
 Hybrid Chat
 Intent
 Sentiment analysis
 Compulsory input
 Optional input
 Decision trees

Benefits of commercial chatbots:


 Help customers find what they need much faster
 Can easily substitute a seller
 Always available at your customers’ fingertips

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Design of dialogue systems:


A Dialogue System is a system which interacts with human in natural language. At present
many universities are developing the dialogue system in their regional language.Dialogue
systems, also known as conversational agents or chatbots, are designed to interact with users in
a natural and human-like manner. They can be implemented in various forms, including text-
based interfaces like messaging apps or speech-based interfaces like virtual assistants.

Architecture of spoken dialogue systems

What is speech dialog system?

A spoken dialog system (SDS) is a computer system able to converse with a human with voice.
It has two essential components that do not exist in a written text dialog system: a speech
recognizer and a text-to-speech module (written text dialog systems usually use other input
systems provided by an OS).

What is an example of a dialogue system?

Examples of dialogue systems in action include chatbots, food ordering apps, website AI
assistants, automated customer support service, self-checkout systems, etc.

What are types of dialogue systems?


 Rule-based systems,
 Statistical systems,
 Neural networks

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Components of Dialogue System:

A Dialogue system has mainly seven components. These components are following:

esponse Generator

several key components:


1.Natural Language Understanding (NLU):

NLU is crucial for dialogue systems to comprehend user inputs accurately. It involves tasks
such as intent classification, entity recognition, and sentiment analysis.

In text analysis, techniques like natural language processing (NLP) and machine learning
models are used to parse and understand the meaning of user messages.

In speech analysis, automatic speech recognition (ASR) systems convert spoken language into
text, which is then processed using NLU techniques.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

2.Dialogue Management:

Dialogue management is responsible for determining the system's response based on the user's
input and the current context of the conversation.

In text-based systems, dialogue management often involves maintaining a conversation state,


tracking the dialogue history, and selecting appropriate responses using rule-based systems or
machine learning algorithms.

In speech-based systems, dialogue management may also incorporate speech recognition results
and handle interruptions or errors in speech input.

3.Response Generation:

Response generation involves creating human-like responses to user inputs. This can be
achieved using templates, rule-based systems, or machine learning models like neural networks.

In text-based systems, response generation may involve generating text using language
generation techniques such as neural language models (e.g., GPT).

In speech-based systems, text-to-speech (TTS) synthesis is used to convert textual responses


into spoken language.

4.User Experience (UX) Design:

UX design focuses on creating a smooth and intuitive interaction between users and dialogue
systems.

In text-based systems, UX design includes considerations such as message formatting, response


timing, and error handling.

In speech-based systems, UX design involves designing voice prompts, handling interruptions


gracefully, and providing feedback through speech.

5.Feedback and Adaptation:

Dialogue systems should be able to learn and adapt based on user feedback to improve their
performance over time.

Techniques such as reinforcement learning can be used to optimize dialogue policies based on
user interactions.

In text-based systems, sentiment analysis can be used to gauge user satisfaction and adjust
system behavior accordingly.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

In speech-based systems, user feedback can be collected through voice commands or post-
interaction surveys.

6.Multi-Modality:

Some dialogue systems incorporate both text and speech modalities to provide a more versatile
user experience.

Multi-modal systems must seamlessly integrate text and speech processing components while
maintaining consistency across modalities.

7.Privacy and Security:

Dialogue systems often handle sensitive information, so ensuring privacy and security is
paramount.

Techniques such as end-to-end encryption and secure data handling practices should be
implemented to protect user data.

Dialog Design

Classification of Dialogue System:

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Evaluating dialogue systems:

What is the dialogue system architecture?


While the architecture of Dialogue Systems can vary, they typically follow the same sequence
of phases: Input Recognition, Natural Language Understanding, Dialogue Management,
Response Generation, and Output Rendering.

Evaluating dialogue systems, whether they operate through text or speech analysis, involves
assessing various aspects of their performance, including accuracy, effectiveness, user
satisfaction, and scalability. Here are some common evaluation metrics and methodologies for
both text and speech-based dialogue systems:

Survey on evaluation methods for Dialog Systems

Text-Based Dialogue Systems:

Accuracy Metrics:

Intent Classification Accuracy: Measures how accurately the system classifies user intents
based on their input.

Entity Recognition F1 Score: Evaluates the system's ability to correctly identify and extract
entities from user messages.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Response Generation Quality: Assess the coherence, relevance, and grammatical correctness of
the generated responses using metrics like BLEU, ROUGE, or human judgment.

Effectiveness Metrics:

Task Completion Rate: Determines the percentage of user queries or tasks successfully
completed by the system without errors.

Response Latency: Measures the time taken by the system to respond to user inputs, aiming for
low latency to improve user experience.

User Satisfaction:

User Surveys: Collect feedback from users through surveys to assess their satisfaction with the
dialogue system's performance, usability, and helpfulness.

User Ratings: Users can rate their interactions with the system on a scale, providing quantitative
feedback on their satisfaction levels.

Error Analysis:Analyze common errors made by the system, such as misclassification of


intents, incorrect entity recognition, or nonsensical responses, to identify areas for
improvement.

Robustness and Adaptability:

Evaluate how well the system handles variations in user input, including typos, slang, or
ambiguous language.

Speech Synthesis

Speech-Based Dialogue Systems:

Speech Recognition Accuracy:

Word Error Rate (WER): Measures the accuracy of the system's speech recognition component
by comparing the transcribed text with the ground truth.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Phoneme Error Rate (PER): Evaluates the accuracy of phoneme-level transcription in speech
recognition.

Naturalness of Speech Synthesis:

Mean Opinion Score (MOS): Collect subjective ratings from human listeners on the
naturalness, intelligibility, and overall quality of synthesized speech.

Task Completion Rate:

Similar to text-based systems, assess the percentage of user queries or tasks successfully
completed by the system through spoken interactions.

Speech Interaction Latency:

Measure the time taken by the system to process spoken input, recognize intents, generate
responses, and synthesize speech, aiming for minimal latency.

Noise Robustness:

Evaluate the system's performance in noisy environments by introducing background noise and
assessing its impact on speech recognition accuracy and speech synthesis intelligibility.

User Experience in Speech-Based Interactions: Conduct user studies to gather feedback on


the ease of use, naturalness, and effectiveness of speech-based interactions with the system.

Multimodal Integration:

Assess the effectiveness of integrating speech recognition and synthesis with other modalities,
such as text-based input and output, to provide a seamless user experience across multiple
channels.

Dialog Management and Language Generation

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

TSA UNIT-4

Text and speech analysis (St. Xavier's Catholic College of Engineering)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

UNIT-4
TEXT-TO-SPEECH ANALYSIS

Overview. Text normalization. Letter-to-sound. Prosody, Evaluation. Signal


processing - Concatenative and parametric approaches, WaveNet and other
deep learning-based TTS systems

Introduction:

Natural Language Processing (NLP) has become a forefront in Artificial Intelligence,


evident from breakthroughs like GPT-3 and Google's trillion-parameter AI language
model. To delve into NLP, it's essential to start with the basics. Text normalization is a
crucial aspect that involves reducing the randomness of text to a predefined standard,
enhancing efficiency by dealing with a more consistent set of information.

Why Text Normalization?

The primary goal of text normalization, achieved through techniques like stemming and
lemmatization, is to bring diverse linguistic forms closer to a common base form. This
process minimizes variations and aids machines in better understanding and processing
human language.

Example:

Consider the sentence by Jaron Lanier, and how text normalization can be applied to it:

Original Sentence:

“It would be unfair to demand that people cease pirating files when those same people
aren't paid for their participation in very lucrative network schemes..."

Expanding Contractions:

Contractions like "it'll" are expanded to their full forms using a dictionary of
contractions and regular expressions. For example, "it'll" becomes "it will."

CODE:

import re
contractions_re =
re.compile('(%s)'%'|'.join(contractions_dict.keys()))defexpand_contractions(s,
contractions_dict=contractions_dict):
def replace(match):
return contractions_dict[match.group(0)]
return contractions_re.sub(replace, s)sentence = expand_contractions(sentence)
print(sentence)

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Tokenization:

The sentence is segmented into words and sentences using tokenization. This step
involves breaking down the text into smaller units called tokens. For instance, "cease"
and "pirating" become separate tokens

Removing Punctuations:

Punctuation removal ensures that only alphabetic words are retained, contributing to
text standardization. This step helps in streamlining the dataset for further analysis.

Stemming:
The application of stemming involves reducing words to their word stem or root form.
Porter's algorithm is a common approach, but it may lead to over-stemming or under-
stemming.

Over-stemming: where a much larger part of a word is chopped off than what is
required, which in turn leads to words being reduced to the same root word or stem
incorrectly .For example, the words “university” and “universe” that get reduced to
“univers”.

Under-stemming: occurs when two or more words could be wrongly reduced to more
than one root word when they actually should be reduced to the same root word. For
example, the words “data” and “datum” that get reduced to “dat” and “datu” respectively
(instead of the same stem “dat”).

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

An improvement is demonstrated with the Snowball Stemmer for more accurate


stemming.

The Snowball Stemmer is an algorithm for stemming words, aiming to reduce them to
their base or root form. It is an extension of the Porter Stemmer algorithm and was
developed by Martin Porter. The Snowball Stemmer is designed to be more aggressive
and efficient in stemming words in various languages.

Lemmatization:
Unlike stemming, lemmatization reduces words to their base form, ensuring the root
word belongs to the language. The WordNetlemmatizer is applied, and performance can
be enhanced further by incorporating parts-of-speech (POS) tagging.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

LETTER TO SOUND:
"Letter-to-sound" conversion refers to the process of converting written or typed text,
specifically the letters or characters, into their corresponding sounds or phonetic
representations. This conversion is essential in various applications, such as speech
synthesis, where a computer-generated voice needs to pronounce words accurately.

To make it more authentic they will analysis the sounds of each letters and make the
output tones similar to human. To make this they do the following

Context Sensitivity: The pronunciation of a word can depend on its context within a
sentence. Therefore, advanced letter-to-sound systems take contextual information into
account to enhance accuracy.

Language Variations: Different languages and dialects have unique pronunciation


rules, making letter-to-sound conversion language-specific. Systems may need
adaptations for specific linguistic characteristics.

Here we use two different analysis. One is rule-based and another one is statistical
approaches.

Rule-Based Approaches:

Rule-based approaches in text-to-speech (TTS) systems rely on predefined linguistic


rules to convert text into phonetic representations. These rules are designed based on
the principles of phonology, orthography, and morphology of the target language. The
rules dictate how letters or graphemes in the input text are mapped to their
corresponding phonemes, considering factors such as context and syllable structure.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Rule Sets:

Rule sets are developed by linguists and phonologists based on the analysis of the target
language's phonological and orthographic characteristics. These rule sets encompass
various pronunciation patterns and phonological rules governing the language.
Linguistic principles, such as phonotactics (permissible phoneme sequences), allophony
(variation of phonemes in different contexts), and syllable structure, inform the
development of rule sets.

Examples:

Examples of common rules in rule-based approaches include:

Vowel Pronunciation: Rules dictating the pronunciation of vowels based on their


position in a word or adjacent consonants (e.g., the pronunciation of "a" in "cat" vs.
"car").

Consonant Clusters: Rules governing the pronunciation of consonant clusters,


including assimilation or deletion of certain consonants (e.g., the pronunciation of "kn"
in "know").

Silent Letters: Rules specifying the pronunciation or suppression of silent letters in


words (e.g., the silent "e" in "cake").

Advantages:

Transparency: Rule-based approaches offer transparency in the pronunciation process,


as the rules are explicitly defined and can be understood and modified by linguists and
developers.

Interpretability: The rules used in rule-based systems are interpretable, allowing for
easy debugging and customization based on linguistic knowledge.

Control: Developers have control over the pronunciation process, enabling fine-tuning
of the system to produce desired phonetic outputs.

Limitations:

Handling Irregularities: Rule-based approaches may struggle to handle irregularities


and exceptions in language, such as irregular verb conjugations or irregular phonetic
patterns.

Complexity: Developing comprehensive rule sets for complex languages can be


challenging and labor-intensive, requiring extensive linguistic expertise and resources.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Adaptability: Rule-based systems may lack adaptability to variations in pronunciation


across different dialects or speech styles, as they rely on fixed rule sets that may not
capture all linguistic nuances.

PROSODY:
Prosody refers to the rhythm, pitch, loudness, and intonation patterns of speech. It plays
a crucial role in conveying meaning, emotions, and the speaker's attitude. Prosody
encompasses various elements that contribute to the melodic and rhythmic aspects of
spoken language.

Here are some key aspects and the importance of prosody:

Pitch:

• Pitch refers to the perceived frequency of a speaker's voice. It can be high or low.
• Pitch variations can indicate emphasis, emotion, or changes in meaning. For
example, rising pitch at the end of a sentence can turn a statement into a question.

Rhythm:

• Rhythm is the pattern of stressed and unstressed syllables in speech.


• Rhythmic patterns contribute to the flow and naturalness of speech. Changes in
rhythm can convey emphasis or highlight specific words.

Loudness:

• Loudness refers to the volume or intensity of the speaker's voice.


• Changes in loudness can convey emotions such as excitement, anger, or emphasis
on certain words or phrases.

Tempo:

• Tempo is the speed at which speech is delivered.Variations in tempo can convey


energy, urgency, or a more relaxed mood.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

• For instance, a faster tempo might indicate excitement, while a slower tempo could
express sadness.

Intonation:

• Intonation refers to the rising and falling patterns of pitch in connected speech
• .Intonation patterns can convey information about sentence type (statement,
question, command) and the speaker's emotional state.
• They also help listeners interpret the speaker's intended meaning.

Pauses:

• Pauses involve brief breaks in speech.


• Well-placed pauses contribute to the naturalness of speech and help listeners
process information. Pauses can also emphasize certain words or ideas.

Voice Quality:

• Voice quality relates to the characteristics of the speaker's voice, such as breathiness
or roughness.
• Voice quality can convey the speaker's emotional state and add nuance to the
message.

Emotional Expression:

Prosody is a powerful tool for expressing emotions. It can add warmth, enthusiasm, or
seriousness to the spoken words, making communication more engaging and effective.

Semantic Emphasis:

• Prosody helps convey the intended meaning of a sentence.


• By emphasizing certain words through pitch, loudness, or duration, the speaker
can guide the listener's understanding.

Prosody Evaluation in Speech Processing:


Evalution metrics:

MOS:

A Mean Opinion Score (MOS) serves as a quantitative measure of the overall quality of a
particular event or experience, commonly used in telecommunications to assess the
quality of voice and video sessions. Traditionally, MOS ratings range from 1 (poor) to 5
(excellent), derived as averages from individual parameters scored by human observers
or approximated by objective measurement methods.

To get a MOS, people used to listen to or watch stuff and give their opinions. Nowadays,
we have machines that try to give scores like humans do. Different standards and

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

methods, like ITU-T's guidelines, help decide how to score things like phone calls and
video quality.

The commonly employed Absolute Category Ranking (ACR) scale, ranging from 1 to 5,
classifies quality levels as Excellent (5), Good (4), Fair (3), Poor (2), and Bad (1). An
MOS of approximately 4.3 to 4.5 is deemed excellent, while quality becomes
unacceptable below a MOS of around 3.5.

Low MOS ratings in video and voice calls may result from various factors along the
transmission chain, including hardware and software issues, network-related
impairments such as jitter, latency, and packet loss, which significantly impact
perceived call quality.

PSEQ:

‘Good’ and ‘Bad’ Audio :

Typically, PESQ scores are categorized into six bands. The audio samples provided
below for each band represent actual recordings of audio quality tests conducted by
Cyara on our customers’ international contact numbers:

1.00 – 1.99 No meaning understood with any feasible effort

2.00 – 2.39 Considerable effort required

2.40 – 2.79 Moderate effort required

2.80 – 3.29 Attention necessary; a small amount of effort required

3.30 – 3.79 Attention necessary; no appreciable effort required

3.80 – 4.50 Complete relaxation possible; no effort required.

it’s easy to see how frustration can lead a customer to abandon a call when they
encounter a low audio quality score. In such cases, a productive conversation between
both parties becomes virtually impossible. For conversations falling within the range of
2.00 to 2.79, there might be a slight improvement, but it’s also likely to include phrases
such as, “Could you repeat that?” or “Sorry, I can’t hear you“. Typically, this leads to
significant delays in resolving issues and, eventually, customer frustration that might
lead to call abandonment.

It’s worth remembering, of course, that ‘good audio quality’ can vary from one country
to the next and even from one carrier line or contact number to the next. What may be
considered as acceptable in the United States could be deemed unachievable in Brazil.
This is where Cyara’s in-country benchmarks add value. You have the flexibility to create

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

the dialing patterns that align with your business needs, dialing from within the country
where you want to measure quality.

By proactively assessing your audio quality and benchmarking over time, you can make
more informed decisions. You can also determine which telecommunications providers
to choose and how best to route your calls. This approach enables you to adapt and
optimize your telecommunications approach to meet the distinct needs and expectations
of each region in which you operate.

ToBI:

the ToBI (Tones and Break Indices) framework may indeed be relevant, particularly in
the field of speech technology and natural language processing.

Text Speech Analysis involves the examination and interpretation of various linguistic
features present in spoken language, including prosody (intonation, rhythm, and stress
patterns), phonetics, syntax, semantics, and pragmatics. The ToBI framework, as
mentioned earlier, provides a standardized system for annotating and modeling
prosodic features in spoken language.

In TSA applications, such as text-to-speech synthesis (TTS), the ToBI framework or


similar models may be used to analyze and generate natural-sounding prosody in
synthesized speech. By incorporating prosodic annotations derived from ToBI or related
frameworks, TTS systems can better capture the intonation, pitch contours, and
rhythmic patterns of human speech, resulting in more expressive and intelligible
synthesized output.

Therefore, while ToBI may not be explicitly mentioned in the context of TSA, its
principles and methodologies for prosodic analysis could be applied in TSA-related
tasks, such as speech recognition, sentiment analysis, dialogue systems, and other text
and speech processing applications.

Signal Processing in Prosody Evaluation:


1. Pitch Detection:

Signal Processing Techniques: Pitch detection involves the application of algorithms


such as Autocorrelation, Harmonic Product Spectrum, and YIN algorithm to extract the
fundamental frequency (F0) of speech.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Application: These techniques are crucial for accurately identifying and analyzing pitch
variations in speech.

2. Spectral Analysis:
Signal Processing Techniques: Spectral analysis, including Fourier Transform and
Short-Time Fourier Transform (STFT), is employed to examine the spectral components
of speech.

Application: This helps extract features related to voice quality, pitch, and other acoustic
characteristics.

3. Waveform Analysis:
Signal Processing Techniques: Time-domain waveform analysis involves examining the
characteristics of speech waveforms, including loudness and duration.

Application: This analysis provides insights into the temporal aspects of speech.

4. Prosody Modeling:
Signal Processing Techniques: Prosody modeling often employs techniques such as
Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) to capture
dynamic patterns of pitch, duration, and intensity.

Application: These models are essential for understanding and synthesizing prosodic
features.

5. Machine Learning Approaches:


Signal Processing Techniques: Machine learning approaches, including supervised
learning algorithms like Support Vector Machines and Neural Networks, are used for
modeling prosody.

Application: These models can recognize emotional prosody or predict prosodic


features, enhancing the overall understanding of speech.

6. Signal Synthesis and Modification:


Signal Processing Techniques: Signal synthesis and modification techniques involve
processes such as pitch shifting and time-stretching to modify prosodic features for
expressive speech synthesis.

Application: These techniques are valuable for creating natural and expressive synthetic
speech.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Concatenative and parametric approaches:

Concatenative TTS
Concatenative TTS relies on high-quality audio clips recordings, which are combined
together to form the speech. At the first step voice actors are recorded saying a range of
speech units, from whole sentences to syllables that are further labeled and segmented by
linguistic units from phones to phrases and sentences forming a huge database. During
speech synthesis, a Text-to-Speech engine searches such database for speech units that
match the input text, concatenates them together and produces an audio file.

Speech Database Collection:

The navigation app has a large database of recorded speech units, including phonemes,
diphones, and words or phrases related to navigation instructions (e.g., "Turn left",

"Continue straight ahead", "Exit on the right").

These recordings are performed by professional voice actors and cover a wide range of

phonetic combinations, prosodic variations, and speaker characteristics.

Text Analysis:

As you navigate, the app analyzes the route and upcoming maneuvers to determine the

appropriate navigation instructions.

The text is broken down into smaller linguistic units, such as phonemes, syllables, or
words, based on the granularity of the speech database.

Unit Selection:

The navigation app selects appropriate speech units from the database based on the

analyzed route and upcoming maneuvers.

It considers factors such as the complexity of the maneuver, road conditions, and traffic
information when selecting units.

Unit Concatenation:

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

The selected speech units are concatenated together to form the synthesized speech

output for navigation instructions.

The app ensures smooth transitions between adjacent units to maintain naturalness and

fluency in the synthesized speech.

Prosody Generation:

Prosodic features, such as pitch contour, duration, and intensity variations, are

incorporated into the synthesized speech to match the intended linguistic and emotional

expression of the navigation instructions.

The prosody of the concatenated speech units is adjusted dynamically based on the

urgency and importance of the navigation instructions.

Post-processing:

Any additional processing or modifications, such as filtering or normalization, are

applied to the synthesized speech output to ensure quality and intelligibility.

The synthesized speech is evaluated to ensure it meets the desired standards for clarity

and naturalness.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Output Generation:

The final synthesized speech output, providing turn-by-turn navigation instructions, is

produced in the desired audio format.

The synthesized speech is integrated into the navigation app for playback, allowing

drivers to receive voice guidance as they navigate to their destination.Pros

- High quality of audio in terms of intelligibility;

- Possibility to preserve the original actor’s voice;

Cons

- Such systems are very time consuming because they require huge databases, and hard-

coding the combination to form these words;

- The resulting speech may sound less natural and emotionless, because it is nearly

impossible to get the audio recordings of all possible words spoken in all possible

combinations of emotions, prosody, stress, etc.

Parametric approaches:

Parametric approaches in text-to-speech (TTS) synthesis involve generating speech


using mathematical models and parameters that describe the characteristics of speech,
such as pitch, duration, and spectral envelope.

Text Analysis:

Analyze the input text to extract linguistic features, such as phonemes, prosodic cues,
and contextual information.

Tokenize the text into smaller linguistic units, such as words, syllables, or phonemes,
depending on the granularity of the synthesis model.

Example: Consider the input text "The quick brown fox jumps over the lazy dog."

Tokenize the text into words: ["The", "quick", "brown", "fox", "jumps", "over", "the",
"lazy", "dog"].

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Feature Extraction:

Extract relevant linguistic features from the analyzed text, including phonetic content,
stress patterns, syntactic structure, and semantic information.

Use text analysis techniques, such as part-of-speech tagging, language modeling, and
phonetic transcription, to infer linguistic features from the input text.

Example: Analyze the word "jumps" from the input text.

Extract phonetic content (/dʒʌmps/), stress pattern (stressed-unstressed), and part-of-


speech tag (verb) for the word "jumps".

Parameter Generation:

Map the extracted linguistic features to speech parameters that describe the
characteristics of the synthesized speech, such as pitch, duration, and spectral envelope.

Train statistical models, generative models, or signal processing algorithms to generate


speech parameters based on the extracted linguistic features.

Adapt the synthesis model to account for variations in speaker characteristics, speaking
styles, and linguistic contexts.

Example: We figure out things like how high or low its pitch should be, how long it
should last, and what it should sound like.

Synthesis:

Use the generated speech parameters to synthesize speech waveforms that match the
desired characteristics of the input text.

Apply signal processing techniques, such as filtering, modulation, and envelope shaping,
to manipulate the speech parameters and produce natural-sounding speech.

Control the synthesis process to ensure appropriate prosody, timing, and intonation in
the synthesized speech output.

Exmaple:For the word jump , We adjust things like pitch (making it go up and down),
duration (how long it lasts), and its overall sound to match what we decided earlier..

Post-processing:

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Apply any additional processing or modifications to the synthesized speech output, such
as dynamic range compression, equalization, or noise reduction.

Adjust the synthesized speech output to meet quality and intelligibility standards, and
make any necessary enhancements or corrections.

Example: After creating the synthesized speech, we tweak it to sound better: we make
sure it's not too loud or too quiet (dynamic range compression), adjust the sound to
make it clearer and easier to understand (equalization), and remove any background
noise (noise reduction) to make it sound cleaner.

Evaluation:

Evaluate the quality and naturalness of the synthesized speech output using subjective
and/or objective measures.

Collect feedback from listeners or use automated evaluation metrics to assess the
performance of the parametric synthesis model.

Fine-tune the synthesis model based on evaluation results and user feedback to improve
the quality of the synthesized speech output.

Example: People listen to it to see if it sounds natural and clear (subjective evaluation),
and we also use machines to measure things like pitch and sound quality (objective
evaluation).

Comparision:

Aspect Parametric Approach Concatenative


Approach
Speech Generation Generates speech based on Generates speech by
Method mathematical models and concatenating pre-recorded
parameters. speech units.
Data Requirements Requires less storage space Requires a large database of
as it relies on synthesizing recorded speech units.
speech from parameters.
Flexibility Offers flexibility in Limited flexibility as it
controlling various speech relies on available recorded
characteristics. speech.
Naturalness May struggle to capture Often produces highly
naturalness compared to natural-sounding speech
concatenative approaches. due to using recorded
speech.
Development Can be complex and Generally simpler to
Complexity computationally intensive implement and requires
to develop and optimize less computational

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

parametric models. resources.


Adaptability Adaptable to multiple Limited adaptability to
speakers and languages different speakers and
with appropriate languages.
parameterization.
Storage Requirements Requires less storage space Requires significant storage
compared to concatenative space for the database of
synthesis. recorded speech units.

Introduction to advanced TTS systems:


WaveNet is a deep generative model for speech synthesis developed by DeepMind, a
research company under Alphabet Inc. WaveNet is known for its ability to generate
high-quality, natural-sounding speech waveforms directly from input text.

WaveNet is based on a deep neural network architecture known as a dilated


convolutional neural network (CNN).

The network consists of many layers of dilated convolutions, which allow the network to
capture long-range dependencies in the input data while maintaining computational
efficiency.

WaveNet uses a causal convolutional structure, where each output depends only on
previous input values, making it suitable for sequential data generation tasks like speech
synthesis.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Researchers typically avoid modeling raw audio because it changes very quickly, with a
lot of information packed into each second.

For example, there can be up to 16,000 pieces of sound data per second. Creating a
model that predicts each of these pieces based on all the previous ones is really hard.

However, our earlier PixelRNN and PixelCNN models showed that it was possible to
generate detailed images one tiny piece at a time. Even more impressively, they could do
this for each color in the image separately.

This success encouraged us to adapt our two-dimensional PixelNets to work with one-
dimensional audio data, like WaveNet.

Enhancing speech synthesis.

The picture above illustrates the structure of a WaveNet, which is a fully convolutional
neural network.

In WaveNet, the convolutional layers have different dilation factors, allowing the
network to cover thousands of time steps and capture complex patterns in the input
audio data.

During training, real audio waveforms from human speakers are used as input
sequences. After training, the network can generate synthetic speech by sampling from
the probability distribution computed by the network at each step.

This sampled value is then fed back into the input, and the process repeats to generate
the next sample. While this step-by-step sampling approach is computationally
expensive, it is crucial for generating realistic-sounding audio with complex patterns.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

We trained WaveNet using Google's TTS datasets to assess its performance. The figure
below compares WaveNets' quality, rated from 1 to 5, with Google's top TTS systems
(parametric and concatenative) and human speech using Mean Opinion Scores (MOS).
MOS are standard subjective tests for sound quality, obtained from blind tests with
human subjects, consisting of over 500 ratings on 100 test sentences.

WaveNets significantly reduce the gap between state-of-the-art systems and human-
level performance by over 50% for both US English and Mandarin Chinese.

Given that Google's current TTS systems are considered some of the best worldwide for
both Chinese and English, improving upon them with a single model represents a
significant accomplishment.

Text input conditioning:

To use WaveNet for text-to-speech, we need to provide it with the text we want it to
speak. We transform the text into a sequence of linguistic and phonetic features,
containing information about phonemes, syllables, words, etc.

These features are then fed into WaveNet. As a result, WaveNet's predictions are
conditioned not only on previous audio samples but also on the text input.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

If we train WaveNet without the text sequence, it can still generate speech. However, in
this case, it has to make up what to say, lacking the contextual information provided by
the text input.

Other Deep learning-based TTS systems


Introduction to Tacotron 2:

Tacotron 2 is a neural network model architecture developed by NVIDIA for generating


natural and human-like speech synthesis from text inputs.

Working:

Input Representation:

The input to Tacotron is a sequence of text characters or phonemes representing the


desired speech output. This input sequence is typically encoded using techniques like
one-hot encoding or embedding to convert characters or phonemes into numerical
representations.

Encoder:

The encoded text sequence is fed into an encoder network. The encoder's role is to
process the input text and capture its contextual information, such as linguistic features
and syntactic structure.

The encoder network often consists of convolutional or recurrent layers, such as Long
Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) cells, to capture temporal
dependencies in the input sequence.

Decoder:

The output of the encoder serves as the initial hidden state of the decoder network. The
decoder's task is to generate a spectrogram representation of the speech signal based on
the encoded text information.

The decoder is typically implemented using recurrent layers, such as LSTM or GRU
cells, and it operates autoregressively, generating one spectrogram frame at a time.

At each time step, the decoder attends to relevant parts of the encoded text sequence
using an attention mechanism. This allows the decoder to focus on different portions of
the input text dynamically as it generates the speech output.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Post-processing:

The spectrogram output from the decoder represents the spectral characteristics of the
speech signal over time. This spectrogram is then passed through a post-processing
stage to convert it into a time-domain waveform.

Techniques such as Griffin-Lim algorithm or neural vocoders like WaveNet or


WaveGlow may be used for waveform synthesis, transforming the spectrogram into a
natural-sounding speech waveform.

Training:

Tacotron is trained in a supervised manner using pairs of input text and corresponding
speech audio data.

During training, the model learns to minimize the difference between the predicted
spectrogram and the ground truth spectrogram derived from the audio data.

Training typically involves optimizing a loss function, such as mean squared error
(MSE) or a combination of spectrogram-based losses and adversarial losses, through
techniques like gradient descent.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

Evaluation and Inference:

Once trained, the Tacotron model can be used for inference, where it takes a text input
and generates the corresponding speech waveform.

During inference, the model utilizes the learned parameters to predict the spectrogram
representation of the speech signal based on the input text.

The predicted spectrogram is then converted into a waveform using the same post-
processing techniques employed during training.

Downloaded by Rajalakshmi Arulmozh IT ([email protected])


lOMoARcPSD|38187289

TSA-UT-V

Artificial Intelligence (Dhanalakshmi College of Engineering)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.TECH AI&DS Prepared By: D.SHANKAR/AP/AI&DS

CCS369- TEXT AND SPEECH ANALYSIS


LECTURE NOTES
UNIT V AUTOMATIC SPEECH RECOGNITION 6
Speech recognition: Acoustic modelling – Feature Extraction - HMM, HMM-DNN systems

SPEECH RECOGNITION
Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition, or
speech-to-text, is a capability which enables a program to process human speech into a written format. While
it’s commonly confused with voice recognition, speech recognition focuses on the translation of speech from a
verbal format to a text one whereas voice recognition just seeks to identify an individual user’s voice.
peech recognition, or speech-to-text, is the ability of a machine or program to identify words spoken aloud and
convert them into readable text. Rudimentary speech recognition software has a limited vocabulary and may
only identify words and phrases when spoken clearly. More sophisticated software can handle natural speech,
different accents and various languages.
Speech recognition uses a broad array of research in computer science, linguistics and computer engineering.
Many modern devices and text-focused programs have speech recognition functions in them to allow for easier
or hands-free use of a device.
Speech recognition and voice recognition are two different technologies and should not be confused:
 Speech recognition is used to identify words in spoken language.
 Voice recognition is a biometric technology for identifying an individual's voice.

How does speech recognition work?


Speech recognition systems use computer algorithms to process and interpret spoken words and convert them
into text. A software program turns the sound a microphone records into written language that computers and
humans can understand, following these four steps:
1. analyze the audio;
2. break it into parts;
3. digitize it into a computer-readable format; and
4. use an algorithm to match it to the most suitable text representation.
Speech recognition software must adapt to the highly variable and context-specific nature of human speech. The
software algorithms that process and organize audio into text are trained on different speech patterns, speaking
styles, languages, dialects, accents and phrasings. The software also separates spoken audio from background
noise that often accompanies the signal.
To meet these requirements, speech recognition systems use two types of models:
 Acoustic models. These represent the relationship between linguistic units of speech and audio signals.
 Language models. Here, sounds are matched with word sequences to distinguish between words that
sound similar.

What applications is speech recognition used for?


Speech recognition systems have quite a few applications. Here is a sampling of them.
Mobile devices. Smartphones use voice commands for call routing, speech-to-text processing, voice dialing and
voice search. Users can respond to a text without looking at their devices. On Apple iPhones, speech
recognition powers the keyboard and Siri, the virtual assistant. Functionality is available in secondary

Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.TECH AI&DS Prepared By: D.SHANKAR/AP/AI&DS

languages, too. Speech recognition can also be found in word processing applications like Microsoft Word,
where users can dictate words to be turned into text.
Education. Speech recognition software is used in language instruction. The software hears the user's speech
and offers help with pronunciation.
Customer service. Automated voice assistants listen to customer queries and provides helpful resources.
Healthcare applications. Doctors can use speech recognition software to transcribe notes in real time into
healthcare records.
Disability assistance. Speech recognition software can translate spoken words into text using closed captions to
enable a person with hearing loss to understand what others are saying. Speech recognition can also enable
those with limited use of their hands to work with computers, using voice commands instead of typing.
Court reporting. Software can be used to transcribe courtroom proceedings, precluding the need for human
transcribers.
Emotion recognition. This technology can analyze certain vocal characteristics to determine what emotion the
speaker is feeling. Paired with sentiment analysis, this can reveal how someone feels about a product or service.
Hands-free communication. Drivers use voice control for hands-free communication, controlling phones,
radios and global positioning systems, for instance.

What are the features of speech recognition systems?


Good speech recognition programs let users customize them to their needs. The features that enable this
include:

 Language weighting. This feature tells the algorithm to give special attention to certain words, such
as those spoken frequently or that are unique to the conversation or subject. For example, the
software can be trained to listen for specific product references.
 Acoustic training. The software tunes out ambient noise that pollutes spoken audio. Software
programs with acoustic training can distinguish speaking style, pace and volume amid the din of
many people speaking in an office.
 Speaker labeling. This capability enables a program to label individual participants and identify
their specific contributions to a conversation.
 Profanity filtering. Here, the software filters out undesirable words and language.
What are the different speech recognition algorithms?
The power behind speech recognition features comes from a set of algorithms and technologies. They include
the following:

 Hidden Markov model. HMMs are used in autonomous systems where a state is partially
observable or when all of the information necessary to make a decision is not immediately available
to the sensor (in speech recognition's case, a microphone). An example of this is in acoustic
modeling, where a program must match linguistic units to audio signals using statistical probability.
 Natural language processing. NLP eases and accelerates the speech recognition process.
 N-grams. This simple approach to language models creates a probability distribution for a sequence.
An example would be an algorithm that looks at the last few words spoken, approximates the history
of the sample of speech and uses that to determine the probability of the next word or phrase that
will be spoken.

Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.TECH AI&DS Prepared By: D.SHANKAR/AP/AI&DS

 Artificial intelligence. AI and machine learning methods like deep learning and neural networks are
common in advanced speech recognition software. These systems use grammar, structure, syntax
and composition of audio and voice signals to process speech. Machine learning systems gain
knowledge with each use, making them well suited for nuances like accents.
What are the advantages of speech recognition?
There are several advantages to using speech recognition software, including the following:
 Machine-to-human communication. The technology enables electronic devices to communicate with
humans in natural language or conversational speech.
 Readily accessible. This software is frequently installed in computers and mobile devices, making it
accessible.
 Easy to use. Well-designed software is straightforward to operate and often runs in the background.
 Continuous, automatic improvement. Speech recognition systems that incorporate AI become more
effective and easier to use over time. As systems complete speech recognition tasks, they generate more
data about human speech and get better at what they do.
What are the disadvantages of speech recognition?
While convenient, speech recognition technology still has a few issues to work through. Limitations include:
 Inconsistent performance. The systems may be unable to capture words accurately because of
variations in pronunciation, lack of support for some languages and inability to sort through background
noise. Ambient noise can be especially challenging. Acoustic training can help filter it out, but these
programs aren't perfect. Sometimes it's impossible to isolate the human voice.
 Speed. Some speech recognition programs take time to deploy and master. The speech processing may
feel relatively slow.
 Source file issues. Speech recognition success depends on the recording equipment used, not just the
software.

Acoustic Modelling
Acoustic modelling of speech typically refers to the process of establishing statistical representations for the
feature vector sequences computed from the speech waveform. Hidden Markov Model (HMM) is one most
common types of acoustic models.Modern speech recognition systems use both an acoustic model and
a language model to represent the statistical properties of speech. The acoustic model models the
relationship between the audio signal and the phonetic units in the language. The language model is responsible
for modeling the word sequences in the language. These two models are combined to get the top-ranked word
sequences corresponding to a given audio segment.
Speech audio characteristics

Audio can be encoded at different sampling rates (i.e. samples per second – the most common being: 8, 16, 32,
44.1, 48, and 96 kHz), and different bits per sample (the most common being: 8-bits, 16-bits, 24-bits or 32-bits).
Speech recognition engines work best if the acoustic model they use was trained with speech audio which was
recorded at the same sampling rate/bits per sample as the speech being recognized.

Telephony-based speech recognition

The limiting factor for telephony based speech recognition is the bandwidth at which speech can be transmitted.
For example, a standard land-line telephone only has a bandwidth of 64 kbit/s at a sampling rate of 8 kHz and 8-
bits per sample (8000 samples per second * 8-bits per sample = 64000 bit/s). Therefore, for telephony based
speech recognition, acoustic models should be trained with 8 kHz/8-bit speech audio files.

Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.TECH AI&DS Prepared By: D.SHANKAR/AP/AI&DS

In the case of Voice over IP, the codec determines the sampling rate/bits per sample of speech transmission.
Codecs with a higher sampling rate/bits per sample for speech transmission (which improve the sound quality)
necessitate acoustic models trained with audio data that matches that sampling rate/bits per sample.

Desktop-based speech recognition

For speech recognition on a standard desktop PC, the limiting factor is the sound card. Most sound cards today
can record at sampling rates of between 16 kHz-48 kHz of audio, with bit rates of 8 to 16-bits per sample, and
playback at up to 96 kHz.

As a general rule, a speech recognition engine works better with acoustic models trained with speech audio data
recorded at higher sampling rates/bits per sample. But using audio with too high a sampling rate/bits per sample
can slow the recognition engine down. A compromise is needed. Thus for desktop speech recognition, the
current standard is acoustic models trained with speech audio data recorded at sampling rates of 16 kHz/16bits
per sample.

Acoustic modeling is a crucial component in the field of automatic speech recognition (ASR) and various other
applications involving spoken language processing. It is the process of creating a statistical representation of the
relationship between acoustic features and phonemes, words, or other linguistic units in a spoken language.
Acoustic models play a central role in converting spoken language into text and are a key part of the larger ASR
system. Here's how acoustic modeling works:
1. **Feature Extraction:** The process starts with capturing audio input, which is typically sampled at a high
rate. Feature extraction is performed to convert this raw audio into a more compact and informative
representation. Common acoustic features include Mel-frequency cepstral coefficients (MFCCs) or filterbank
energies. These features capture the spectral characteristics of the audio signal over time.
2. **Training Data:** Acoustic modeling requires a significant amount of training data, typically consisting of
transcribed audio recordings. This data is used to establish statistical patterns between acoustic features and the
corresponding linguistic units (e.g., phonemes, words).
3. **Phoneme or State Modeling:** In traditional Hidden Markov Models (HMMs), which have been widely
used in ASR, the acoustic modeling process involves modeling phonemes or states. An HMM represents a
sequence of states, each associated with a specific acoustic observation probability distribution. These states
correspond to phonemes or sub-phonetic units.
4. **Building Gaussian Mixture Models (GMMs):** For each state or phoneme, a Gaussian Mixture Model
(GMM) is constructed. GMMs are a set of Gaussian distributions that model the likelihood of observing
specific acoustic features given a phoneme or state. These GMMs capture the variation in acoustic features
associated with each phoneme.
5. **Training the Models:** During training, the GMM parameters are estimated to maximize the likelihood
of the observed acoustic features given the transcribed training data. This training process adjusts the means and
covariances of the Gaussian components to fit the observed acoustic data.
6. **Decoding:** When transcribing new, unseen audio, the acoustic model is used in combination with
language and pronunciation models. The ASR system uses these models to search for the most likely sequence
of phonemes or words that best matches the observed acoustic features. Decoding algorithms like the Viterbi
algorithm are commonly used for this task.
7. **Integration:** The output of the acoustic model is combined with language and pronunciation models to
generate a final transcription or understanding of the spoken input.
Modern ASR systems have evolved beyond HMM-based approaches, with deep learning techniques, such as
deep neural networks (DNNs) and recurrent neural networks (RNNs), becoming more prevalent in acoustic
modeling. Deep learning models can directly map acoustic features to phonemes or words, bypassing the need

Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.TECH AI&DS Prepared By: D.SHANKAR/AP/AI&DS

for GMMs and HMMs. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are often
used for this purpose. These deep learning models have significantly improved the accuracy of ASR systems,
making them more robust to various accents, noise, and speaking styles.
In summary, acoustic modeling is a crucial step in automatic speech recognition, responsible for establishing
the statistical relationship between acoustic features and linguistic units. This process enables the conversion of
spoken language into text, and advances in deep learning techniques have greatly improved the accuracy and
efficiency of acoustic models in ASR systems.
Feature Extraction
Feature extraction is a fundamental step in acoustic modeling for tasks like automatic speech recognition (ASR)
and speaker identification. Its primary goal is to convert the raw audio signal into a more compact and
informative representation that captures relevant acoustic characteristics. The choice of acoustic features greatly
impacts the performance of the acoustic model. Here are some common techniques for feature extraction in
acoustic modeling:
1. **Mel-Frequency Cepstral Coefficients (MFCCs):** MFCCs are one of the most widely used acoustic
features in ASR. They mimic the human auditory system's sensitivity to different frequencies. The MFCC
extraction process typically involves the following steps:
- Pre-emphasis: Boosts high-frequency components to compensate for the muffled low frequencies in speech.
- Framing: The audio signal is divided into short overlapping frames, often around 20-30 milliseconds in
duration.
- Windowing: Each frame is multiplied by a windowing function (e.g., Hamming window) to reduce spectral
leakage.
- Fast Fourier Transform (FFT): The power spectrum of each frame is computed using the FFT.
- Mel-filterbank: A set of triangular filters on the Mel-scale is applied to the power spectrum. The resulting
filterbank energies capture the distribution of energy in different frequency bands.
- Logarithm: The logarithm of filterbank energies is taken to simulate the human perception of loudness.
- Discrete Cosine Transform (DCT): DCT is applied to decorrelate the log filterbank energies and produce a
set of MFCC coefficients.
2. **FilterbankEnergies:** These are similar to the intermediate step of MFCC computation but without the
logarithm and DCT steps. Filterbank energiesare a set of values that represent the energy in different
frequency bands over time. They are often used in conjunction with MFCCs or as a simpler alternative when the
benefits of MFCCs are not required.
3. **Spectrogram:** The spectrogram is a visual representation of the spectrum of frequencies in the audio
signal over time. It is often used as a feature for tasks that benefit from a time-frequency representation, such as
music genre classification and environmental sound recognition.
4. **Pitch and Fundamental Frequency (F0):** Extracting pitch information can be important for certain
applications. Pitch is the perceived frequency of a sound and is often associated with prosody and intonation in
speech
5. **Linear Predictive Coding (LPC):** LPC analysis models the speech signal as the output of an all-pole
filter and extracts coefficients that represent the vocal tract's resonances. LPC features are used in speech coding
and sometimes ASR.
6. **Perceptual Linear Prediction (PLP) Cepstral Coefficients:** PLP is an alternative to MFCCs that
incorporates psychoacoustic principles, modeling the human auditory system's response more closely.

Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.TECH AI&DS Prepared By: D.SHANKAR/AP/AI&DS

7. **Deep Learning-Based Features:** In recent years, deep neural networks have been used to learn features
directly from the raw waveform. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs)
can be used to capture high-level representations from audio data.
8. **Gammatone Filters:** These are designed to more closely mimic the response of the human auditory
system to different frequencies.
The choice of feature extraction method depends on the specific task and the characteristics of the data. For
ASR, MFCCs and filterbank energies are the most commonly used features. However, as deep learning
techniques become more prevalent in acoustic modeling, end-to-end systems that operate directly on raw audio
data are gaining popularity, and feature extraction is becoming integrated into the model architecture.
In signal processing, a filter bank (or filterbank) is an array of bandpass filters that separates the input signal
into multiple components, each one carrying a single frequency sub-band of the original signal.

HMM

A Hidden Markov Model (HMM) is a statistical model used for modeling sequential data, where the underlying
system is assumed to be a Markov process with hidden states. HMMs are widely used in various fields,
including speech recognition, natural language processing, bioinformatics, and more.

Let's delve into the details of Hidden Markov Models:

1. Markov Process:
 A Markov process, also known as a Markov chain, is a stochastic model that describes a system's
transitions from one state to another over discrete time steps.
 In a simple Markov chain, the future state of the system depends only on the current state and is
independent of all previous states. This property is called the Markov property.
2. Hidden States:
 In an HMM, there are two sets of states: hidden states and observable states.
 Hidden states represent the unobservable underlying structure of the system. They are
responsible for generating the observable data.
3. Observable Data:

Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.TECH AI&DS Prepared By: D.SHANKAR/AP/AI&DS

 Observable states are the data that can be directly measured or observed.
 For example, in speech recognition, the hidden states might represent phonemes, while the
observable data are the audio signals.
4. State Transitions:
 An HMM defines the probabilities of transitioning from one hidden state to another. These
transition probabilities are often represented by a transition matrix.
 Transition probabilities can be time-dependent (time-inhomogeneous) or time-independent
(time-homogeneous).
5. Emission Probabilities:
 Emission probabilities specify the likelihood of emitting observable data from a particular hidden
state.
 In the context of speech recognition, these probabilities represent the likelihood of generating a
certain audio signal given the hidden state (e.g., a phoneme).
6. Initialization Probabilities:
 An HMM typically includes initial probabilities, which represent the probability distribution over
the initial hidden states at the start of the sequence.
7. Observations and Inference:
 Given a sequence of observations (observable data), the goal is to infer the most likely sequence
of hidden states.
 This is typically done using algorithms like the Viterbi algorithm, which finds the most probable
sequence of hidden states that generated the observations.
8. Learning HMM Parameters:
 Training an HMM involves estimating its parameters, including transition probabilities, emission
probabilities, and initial state probabilities.
 This can be done using methods like the Baum-Welch algorithm, which is a variant of the
Expectation-Maximization (EM) algorithm.
9. Applications:
 HMMs have a wide range of applications, such as speech recognition, where they can model
phonemes, natural language processing for part-of-speech tagging, bioinformatics for gene
prediction, and more.
10. Limitations:
 HMMs assume that the system is a first-order Markov process, which means it depends only on
the current state. More complex dependencies might require more advanced models.
 HMMs are also sensitive to their initial parameter estimates and might get stuck in local optima
during training.

In summary, Hidden Markov Models are a powerful tool for modeling and analyzing sequential data with
hidden structure. They are used in a variety of fields to uncover underlying patterns and make predictions based
on observed data.

HMM-DNN
The hybrid HMM-DNN approach in speech recognition make use of the properties like the strong learning
power of DNN and the sequential modelling activity of the HMM. As DNN accepts only fixed sized inputs it
will be difficult to deal with speech signals as they are variable length time varying signal. So in this approach
HMM deals with the dynamic characteristic of the speech signal and DNN is responsible for the observation
probability. Given the acoustic observations, each output neuron of DNN is trained to estimate the posterior
probability of continuous density HMM’s state. DNN when trained in the usual traditional way through
supervised manner does not produce good results and very difficult to get to an optimal point. When a set of

Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.TECH AI&DS Prepared By: D.SHANKAR/AP/AI&DS

data is given as input, importance should be given to extract the variety of data rather than the quantity of data
extracted because later on a good classification can be made from this data.
DNN-HMM systems, also known as Deep Neural Network-Hidden Markov Model systems, are a type of
technology used in automatic speech recognition (ASR) and other sequential data modeling tasks. These
systems combine deep neural networks (DNNs) with Hidden Markov Models (HMMs) to improve the accuracy
and robustness of speech recognition and other related applications. Here's a detailed explanation of DNN-
HMM systems:

1. **Hidden Markov Models (HMMs)**:


- As discussed earlier, HMMs are probabilistic models that describe the temporal evolution of a system. In
ASR, they are used to model the sequence of phonemes or subword units that make up spoken language.

2. **Deep Neural Networks (DNNs)**:


- DNNs are a type of artificial neural network with multiple layers (hence "deep"). They have shown great
success in various machine learning tasks, including image recognition, natural language processing, and speech
processing.
- In the context of DNN-HMM systems, DNNs are used for acoustic modeling. They replace the traditional
Gaussian Mixture Models (GMMs) that were used for modeling acoustic features in older HMM systems.

3. **Acoustic Modeling**:
- Acoustic modeling in ASR is the process of estimating the likelihood of observing a given acoustic feature
(e.g., a frame of audio) given a particular state in the HMM.
- In DNN-HMM systems, DNNs are used to model this likelihood. They take acoustic features as input and
produce the probability distribution over the set of states.

4. **Phoneme or SubwordModeling**:
- DNN-HMM systems typically model phonemes, context-dependent phonemes, or subword units (e.g.,
triphones) as the hidden states in HMMs.
- The DNNs are trained to predict which phoneme or subword unit corresponds to a given acoustic frame,
given the surrounding context.

5. **Training**:
- DNN-HMM systems are trained using large datasets of transcribed speech. The DNNs are trained to
minimize the error between their predicted state probabilities and the true state labels in the training data.
- DNNs can be trained using supervised learning techniques, and backpropagation is used to update the
model's weights.

Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289

V- SEM/III B.TECH AI&DS Prepared By: D.SHANKAR/AP/AI&DS

6. **Integration with HMMs**:


- The DNN-generated state probabilities are integrated into the HMM framework. This is often done by
incorporating the DNN as an emission probability model in the HMM.

7. **Decoding**:
- During the decoding phase, DNN-HMM systems use algorithms like the Viterbi algorithm to find the most
likely sequence of hidden states (phonemes or subword units) that best explain the observed acoustic features.

8. **Benefits**:
- DNN-HMM systems have significantly improved ASR accuracy, especially in challenging environments
with background noise and variations in speech.
- They capture complex acoustic patterns and can model a wide range of speakers and accents effectively.

9. **Challenges**:
- Training deep neural networks requires large amounts of labeled data and significant computational
resources.
- DNN-HMM systems can be complex to design and optimize, and there's a risk of overfitting the model to
the training data.

DNN-HMM systems have been a major breakthrough in ASR technology and have significantly improved the
accuracy of speech recognition systems, making them more practical for real-world applications, including
voice assistants, transcription services, and more.

Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by Rajalakshmi Arulmozh IT ([email protected])

You might also like