Tsa-Unit-1 To 5 Notes
Tsa-Unit-1 To 5 Notes
INTRODUCTION
Artificial intelligence (AI) integration has revolutionized various industries, and now it is transforming the realm of
human behavior research. This integration marks a significant milestone in the data collection and analysis endeavors,
enabling users to unlock deeper insights from spoken language and empower researchers and analysts with enhanced
capabilities for understanding and interpreting human communication. Human interactions are a critical part of many
organizations. Many organizations analyze speech or text via natural language processing (NLP) and link them to insights
and automation such as text categorization, text classification, information extraction, etc.
In business intelligence, speech and text analytics enable us to gain insights into customer-agent conversations through
sentiment analysis, and topic trends. These insights highlight areas of improvement, recognition, and concern, to better
understand and serve customers and employees. Speech and text analytics features provide automated speech and text
analytics capabilities on 100% of interactions to provide deep insight into customer-agent conversations. Speech and text
analytics is a set of features that uses natural language processing (NLP) to provide an automated analysis of an
interaction’s content and insight into customer-agent conversations. Speech and text analytics includes transcribing voice
interactions, analysis for customer sentiment and topic spotting, and creating meaning from otherwise unstructured data.
FOUNDATIONS OF NATURAL LANGUAGE PROCESSING
Natural Language Processing (NLP) is the process of producing meaningful phrases and sentences in the form of natural
language. Natural Language Processing precludes Natural Language Understanding (NLU) and Natural Language
Generation (NLG). NLU takes the data input and maps it into natural language. NLG conducts information extraction and
retrieval, sentiment analysis, and more. NLP can be thought of as an intersection of Linguistics, Computer Science and
Artificial Intelligence that helps computers understand, interpret and manipulate human language.
1. Data Preprocessing
2. Algorithm Development
In Natural Language Processing, machine learning training algorithms study millions of examples of text — words,
sentences, and paragraphs — written by humans. By studying the samples, the training algorithms gain an understanding
of the “context” of human speech, writing, and other modes of communication. This training helps NLP software to
differentiate between the meanings of various texts. The five phases of NLP involve lexical (structure) analysis, parsing,
semantic analysis, discourse integration, and pragmatic analysis. Some well-known application areas of NLP are Optical
Character Recognition (OCR), Speech Recognition, Machine Translation, and Chatbots .
The first phase of NLP is word structure analysis, which is referred to as lexical or morphological analysis. A lexicon is
defined as a collection of words and phrases in a given language, with the analysis of this collection being the process of
splitting the lexicon into components, based on what the user sets as parameters – paragraphs, phrases, words, or
characters.
Similarly, morphological analysis is the process of identifying the morphemes of a word. A morpheme is a basic unit of
English language construction, which is a small element of a word, that carries meaning. These can be either a free
morpheme (e.g. walk) or a bound morpheme (e.g. -ing, -ed), with the difference between the two being that the latter
cannot stand on it’s own to produce a word with meaning, and should be assigned to a free morpheme to attach meaning.
In search engine optimization (SEO), lexical or morphological analysis helps guide web searching. For instance, when
doing on-page analysis, you can perform lexical and morphological analysis to understand how often the target keywords
are used in their core form (as free morphemes, or when in composition with bound morphemes). This type of analysis
can ensure that you have an accurate understanding of the different variations of the morphemes that are used.
Morphological analysis can also be applied in transcription and translation projects, so can be very useful in content
repurposing projects, and international SEO and linguistic analysis.
Syntax Analysis is the second phase of natural language processing. Syntax analysis or parsing is the process of checking
grammar, word arrangement, and overall – the identification of relationships between words and whether those make
sense. The process involved examination of all words and phrases in a sentence, and the structures between them.
As part of the process, there’s a visualisation built of semantic relationships referred to as a syntax tree (similar to a
knowledge graph). This process ensures that the structure and order and grammar of sentences makes sense, when
considering the words and phrases that make up those sentences. Syntax analysis also involves tagging words and phrases
with POS tags. There are two common methods, and multiple approaches to construct the syntax tree – top-down and
bottom-up, however, both are logical and check for sentence formation, or else they reject the input.
Semantic analysis is the third stage in NLP, when an analysis is performed to understand the meaning in a statement. This
type of analysis is focused on uncovering the definitions of words, phrases, and sentences and identifying whether the way
words are organized in a sentence makes sense semantically.
This task is performed by mapping the syntactic structure, and checking for logic in the presented relationships between
entities, words, phrases, and sentences in the text. There are a couple of important functions of semantic analysis, which
allow for natural language understanding:
To ensure that the data types are used in a way that’s consistent with their definition.
To ensure that the flow of the text is consistent.
Identification of synonyms, antonyms, homonyms, and other lexical items.
Overall word sense disambiguation.
Relationship extraction from the different entities identified from the text.
There are several things you can utilise semantic analysis for in SEO. Here are some examples:
Topic modeling and classification – sort your page content into topics (predefined or modelled by an algorithm).
You can then use this for ML-enabled internal linking, where you link pages together on your website using the
identified topics. Topic modeling can also be used for classifying first-party collected data such as customer
service tickets, or feedback users left on your articles or videos in free form (i.e. comments).
Entity analysis, sentiment analysis, and intent classification – You can use this type of analysis to perform
sentiment analysis and identify intent expressed in the content analysed. Entity identification and sentiment
analysis are separate tasks, and both can be done on things like keywords, titles, meta descriptions, page content,
but works best when analysing data like comments, feedback forms, or customer service or social media
interactions. Intent classification can be done on user queries (in keyword research or traffic analysis), but can
also be done in analysis of customer service interactions.
Understanding of the expressed motivations within the text, and its underlying meaning.
Understanding of the relationships between entities and topics mentioned, thematic understanding, and
interactions analysis.
Discourse integration and analysis can be used in SEO to ensure that appropriate tense is used, that the relationships
expressed in the text make logical sense, and that there is overall coherency in the text analysed. This can be especially
useful for programmatic SEO initiatives or text generation at scale. The analysis can also be used as part of international
SEO localization, translation, or transcription tasks on big corpuses of data.
There are some research efforts to incorporate discourse analysis into systems that detect hate speech (or in the SEO space
for things like content and comment moderation), with this technology being aimed at uncovering intention behind text by
aligning the expression with meaning, derived from other texts. This means that, theoretically, discourse analysis can also
be used for modeling of user intent (e.g search intent or purchase intent) and detection of such notions in texts.
Phase V: Pragmatic analysis
Pragmatic analysis is the fifth and final phase of natural language processing. As the final stage, pragmatic analysis
extrapolates and incorporates the learnings from all other, preceding phases of NLP. Pragmatic analysis involves the
process of abstracting or extracting meaning from the use of language, and translating a text, using the gathered
knowledge from all other NLP steps performed beforehand.
Here are some complexities that are introduced during this phase
Information extraction, enabling an advanced text understanding functions such as question-answering.
Meaning extraction, which allows for programs to break down definitions or documentation into a more
accessible language.
Understanding of the meaning of the words, and context, in which they are used, which enables conversational
functions between machine and human (e.g. chatbots).
Pragmatic analysis has multiple applications in SEO. One of the most straightforward ones is programmatic SEO and
automated content generation. This type of analysis can also be used for generating FAQ sections on your product, using
textual analysis of product documentation, or even capitalizing on the ‘People Also Ask’ featured snippets by adding an
automatically-generated FAQ section for each page you produce on your site.
LANGUAGE SYNTAX AND STRUCTURE
For any language, syntax and structure usually go hand in hand, where a set of specific rules, conventions, and principles
govern the way words are combined into phrases; phrases get combines into clauses; and clauses get combined into
sentences. We will be talking specifically about the English language syntax and structure in this section. In English,
words usually combine together to form other constituent units. These constituents include words, phrases, clauses, and
sentences. Considering a sentence, “The brown fox is quick and he is jumping over the lazy dog”, it is made of a bunch
of words and just looking at the words by themselves don’t tell us much.
Adj(ective): Adjectives are words used to describe or qualify other words, typically nouns and noun phrases. The
phrase beautiful flower has the noun (N) flower which is described or qualified using the adjective (ADJ)
beautiful . The POS tag symbol for adjectives is ADJ .
Adv(erb): Adverbs usually act as modifiers for other words including nouns, adjectives, verbs, or other adverbs.
The phrase very beautiful flower has the adverb (ADV) very , which modifies the adjective (ADJ) beautiful ,
indicating the degree to which the flower is beautiful. The POS tag symbol for adverbs is ADV.
Besides these four major categories of parts of speech , there are other categories that occur frequently in the English
language. These include pronouns, prepositions, interjections, conjunctions, determiners, and many others. Furthermore,
each POS tag like the noun (N) can be further subdivided into categories like singular nouns (NN), singular proper
nouns(NNP), and plural nouns (NNS).
The process of classifying and labeling POS tags for words called parts of speech tagging or POS tagging . POS tags are
used to annotate words and depict their POS, which is really helpful to perform specific analysis, such as narrowing down
upon nouns and seeing which ones are the most prominent, word sense disambiguation, and grammar analysis.
Let us consider both nltk and spacy which usually use the Penn Treebank notation for POS tagging. NLTK and spaCy
are two of the most popular Natural Language Processing (NLP) tools available in Python. You can build chatbots,
automatic summarizers, and entity extraction engines with either of these libraries. While both can theoretically
accomplish any NLP task, each one excels in certain scenarios. The Penn Treebank, or PTB for short, is a dataset
maintained by the University of Pennsylvania.
We can see that each of these libraries treat tokens in their own way and assign specific tags for them. Based on what we
see, spacy seems to be doing slightly better than nltk.
Shallow Parsing or Chunking
Based on the hierarchy we depicted earlier, groups of words make up phrases. There are five major categories of phrases:
Noun phrase (NP): These are phrases where a noun acts as the head word. Noun phrases act as a subject or
object to a verb.
Verb phrase (VP): These phrases are lexical units that have a verb acting as the head word. Usually, there are
two forms of verb phrases. One form has the verb components as well as other entities such as nouns, adjectives,
or adverbs as parts of the object.
Adjective phrase (ADJP): These are phrases with an adjective as the head word. Their main role is to describe or
qualify nouns and pronouns in a sentence, and they will be either placed before or after the noun or pronoun.
Adverb phrase (ADVP): These phrases act like adverbs since the adverb acts as the head word in the phrase.
Adverb phrases are used as modifiers for nouns, verbs, or adverbs themselves by providing further details that
describe or qualify them.
Prepositional phrase (PP): These phrases usually contain a preposition as the head word and other lexical
components like nouns, pronouns, and so on. These act like an adjective or adverb describing other words or
phrases.
Shallow parsing, also known as light parsing or chunking, is a popular natural language processing technique of analyzing
the structure of a sentence to break it down into its smallest constituents (which are tokens such as words) and group them
together into higher-level phrases. This includes POS tags and phrases from a sentence.
Constituency Parsing
Constituent-based grammars are used to analyze and determine the constituents of a sentence. These grammars can be
used to model or represent the internal structure of sentences in terms of a hierarchically ordered structure of their
constituents. Each and every word usually belongs to a specific lexical category in the case and forms the head word of
different phrases. These phrases are formed based on rules called phrase structure rules.
Phrase structure rules form the core of constituency grammars, because they talk about syntax and rules that govern the
hierarchy and ordering of the various constituents in the sentences. These rules cater to two things primarily.
They determine what words are used to construct the phrases or constituents.
They determine how we need to order these constituents together.
The generic representation of a phrase structure rule is S → AB , which depicts that the structure S consists of
constituents A and B , and the ordering is A followed by B . While there are several rules (refer to Chapter 1, Page 19:
Text Analytics with Python, if you want to dive deeper), the most important rule describes how to divide a sentence or a
clause. The phrase structure rule denotes a binary division for a sentence or a clause as S → NP VP where S is the
sentence or clause, and it is divided into the subject, denoted by the noun phrase ( NP) and the predicate, denoted by the
verb phrase (VP).
A constituency parser can be built based on such grammars/rules, which are usually collectively available as context-free
grammar (CFG) or phrase-structured grammar. The parser will process input sentences according to these rules, and help
in building a parse tree.
Dependency Parsing
In dependency parsing, we try to use dependency-based grammars to analyze and infer both structure and semantic
dependencies and relationships between tokens in a sentence. The basic principle behind a dependency grammar is that in
any sentence in the language, all words except one, have some relationship or dependency on other words in the sentence.
The word that has no dependency is called the root of the sentence. The verb is taken as the root of the sentence in most
cases. All the other words are directly or indirectly linked to the root verb using links, which are the dependencies.
Considering the sentence “The brown fox is quick and he is jumping over the lazy dog”, if we wanted to draw the
dependency syntax tree for this, we would have the structure
Tokenization is a common task in Natural Language Processing (NLP). It’s a fundamental step in both traditional NLP
methods like Count Vectorizer and Advanced Deep Learning-based architectures like Transformers. As tokens are the
building blocks of Natural Language, the most common way of processing the raw text happens at the token level.
Tokens are the building blocks of Natural Language.
Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words,
characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-
gram characters) tokenization.
The most common way of forming tokens is based on space. Assuming space as a delimiter, the tokenization of the
sentence results in 3 tokens – Never-give-up. As each token is a word, it becomes an example of Word tokenization.
Similarly, tokens can be either characters or sub-words. For example, let us consider “smarter”:
Here, Tokenization is performed on the corpus to obtain tokens. The following tokens are then used to prepare a
vocabulary. Vocabulary refers to the set of unique tokens in the corpus. Remember that vocabulary can be constructed by
considering each unique token in the corpus or by considering the top K Frequently Occurring Words.
Creating Vocabulary is the ultimate goal of Tokenization.
One of the simplest hacks to boost the performance of the NLP model is to create a vocabulary out of top K frequently
occurring words.
Now, let’s understand the usage of the vocabulary in Traditional and Advanced Deep Learning-based NLP methods.
Traditional NLP approaches such as Count Vectorizer and TF-IDF use vocabulary as features. Each word in the
vocabulary is treated as a unique feature:
• It also limits the size of the vocabulary. Want to talk a guess on the size of the vocabulary? 26 since the
vocabulary contains a unique set of characters
Drawbacks of Character Tokenization
Character tokens solve the OOV problem but the length of the input and output sentences increases rapidly as we are
representing a sentence as a sequence of characters. As a result, it becomes challenging to learn the relationship between
the characters to form meaningful words. This brings us to another tokenization known as Subword Tokenization which is
in between a Word and Character tokenization.
Subword Tokenization
Subword Tokenization splits the piece of text into subwords (or n-gram characters). For example, words like lower can be
segmented as low-er, smartest as smart-est, and so on.
Transformed-based models – the SOTA in NLP – rely on Subword Tokenization algorithms for preparing vocabulary.
Now, we will discuss one of the most popular Subword Tokenization algorithm known as Byte Pair Encoding (BPE).
Byte Pair Encoding (BPE)
Byte Pair Encoding (BPE) is a widely used tokenization method among transformer-based models. BPE addresses the
issues of Word and Character Tokenizers:
• BPE tackles OOV effectively. It segments OOV as subwords and represents the word in terms of these subwords
• The length of input and output sentences after BPE are shorter compared to character tokenization
BPE is a word segmentation algorithm that merges the most frequently occurring character or character sequences
iteratively. Here is a step by step guide to learn BPE.
Steps to learn BPE
1. Split the words in the corpus into characters after appending </w>
2. Initialize the vocabulary with unique characters in the corpus
3. Compute the frequency of a pair of characters or character sequences in corpus
4. Merge the most frequent pair in corpus
5. Save the best pair to the vocabulary
6. Repeat steps 3 to 5 for a certain number of iterations
1a) Append the end of the word (say </w>) symbol to every word in the corpus:
Iteration 1:
3. Compute frequency:
Repeat steps 3-5 for every iteration from now. Let me illustrate for one more iteration.
Iteration 2:
3. Compute frequency:
STEMMING
Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly
referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”,
“choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”. Stemming is
an important part of the pipelining process in Natural language processing. The input to the stemmer is tokenized words.
How do we get these tokenized words? Well, tokenization involves breaking down the document into different words.
Stemming is a natural language processing technique that is used to reduce words to their base form, also known as the
root form. The process of stemming is used to normalize text and make it easier to process. It is an important step in text
pre-processing, and it is commonly used in information retrieval and text mining applications. There are several different
algorithms for stemming as follows:
Porter stemmer
Snowball stemmer
Lancaster stemmer.
The Porter stemmer is the most widely used algorithm, and it is based on a set of heuristics that are used to remove
common suffixes from words. The Snowball stemmer is a more advanced algorithm that is based on the Porter stemmer,
but it also supports several other languages in addition to English. The Lancaster stemmer is a more aggressive stemmer
and it is less accurate than the Porter stemmer and Snowball stemmer.
Stemming can be useful for several natural language processing tasks such as text classification, information retrieval, and
text summarization. However, stemming can also have some negative effects such as reducing the readability of the text,
and it may not always produce the correct root form of a word. It is important to note that stemming is different from
Lemmatization. Lemmatization is the process of reducing a word to its base form, but unlike stemming, it takes into
account the context of the word, and it produces a valid word, unlike stemming which can produce a non-word as the root
form.
Some more examples stemming from the root word "like" include:
->"likes"
->"liked"
->"likely"
->"liking"
Errors in Stemming:
Applications of stemming:
Stemming is used in information retrieval systems like search engines. It is used to determine domain vocabularies in
domain analysis. To display search results by indexing while documents are evolving into numbers and to map documents
to common subjects by stemming. Sentiment Analysis, which examines reviews and comments made by different users
about anything, is frequently used for product analysis, such as for online retail stores. Before it is interpreted, stemming
is accepted in the form of the text-preparation mean.
A method of group analysis used on textual materials is called document clustering (also known as text clustering).
Important uses of it include subject extraction, automatic document structuring, and quick information retrieval.
Fun Fact: Google search adopted a word stemming in 2003. Previously a search for “fish” would not have returned
“fishing” or “fishes”.
N-Gram Stemmer
An n-gram is a set of n consecutive characters extracted from a word in which similar words will have a high proportion
of n-grams in common.
Example: ‘INTRODUCTIONS’ for n=2 becomes : *I, IN, NT, TR, RO, OD, DU, UC, CT, TI, IO, ON, NS, S*
Advantage: It is based on string comparisons and it is language dependent.
Limitation: It requires space to create and index the n-grams and it is not time efficient.
Snowball Stemmer:
When compared to the Porter Stemmer, the Snowball Stemmer can map non-English words too. Since it supports other
languages the Snowball Stemmers can be called a multi-lingual stemmer. The Snowball stemmers are also imported from
the nltk package. This stemmer is based on a programming language called ‘Snowball’ that processes small strings and is
the most widely used stemmer. The Snowball stemmer is way more aggressive than Porter Stemmer and is also referred to
as Porter2 Stemmer. Because of the improvements added when compared to the Porter Stemmer, the Snowball stemmer is
having greater computational speed.
Lancaster Stemmer:
The Lancaster stemmers are more aggressive and dynamic compared to the other two stemmers. The stemmer is really
faster, but the algorithm is really confusing when dealing with small words. But they are not as efficient as Snowball
Stemmers. The Lancaster stemmers save the rules externally and basically uses an iterative algorithm. Lancaster Stemmer
is straightforward, although it often produces results with excessive stemming. Over-stemming renders stems non-
linguistic or meaningless.
LEMMATIZATION
Lemmatization is a text pre-processing technique used in natural language processing (NLP) models to break a word
down to its root meaning to identify similarities. For example, a lemmatization algorithm would reduce the word better to
its root word, or lemme, good. In stemming, a part of the word is just chopped off at the tail end to arrive at the stem of
the word. There are different algorithms used to find out how many characters have to be chopped off, but the algorithms
don’t actually know the meaning of the word in the language it belongs to. In lemmatization, the algorithms do have this
knowledge. In fact, you can even say that these algorithms refer to a dictionary to understand the meaning of the word
before reducing it to its root word, or lemma. So, a lemmatization algorithm would know that the word better is
derived from the word good, and hence, the lemme is good. But a stemming algorithm wouldn’t be able to do
the same. There could be over-stemming or under-stemming, and the word better could be reduced to either bet,
or bett, or just retained as better. But there is no way in stemming that can reduce better to its root word good.
This is the difference between stemming and lemmatization.
Lemmatization gives more context to chatbot conversations as it recognizes words based on their exact and contextual
meaning. On the other hand, lemmatization is a time-consuming and slow process. The obvious advantage of
lemmatization is that it is more accurate than stemming. So, if you’re dealing with an NLP application such as a chat bot
or a virtual assistant, where understanding the meaning of the dialogue is crucial, lemmatization would be useful. But this
accuracy comes at a cost. Because lemmatization involves deriving the meaning of a word from something like a
dictionary, it’s very time-consuming. So, most lemmatization algorithms are slower compared to their stemming
counterparts. There is also a computation overhead for lemmatization, however, in most machine-learning problems,
computational resources are rarely a cause of concern.
REMOVING STOP-WORDS
The words which are generally filtered out before processing a natural language are called stop words. These are actually
the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much
information to the text. Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”. Stop words are
available in abundance in any human language. By removing these words, we remove the low-level information from our
text in order to give more focus to the important information. In order words, we can say that the removal of such words
does not show any negative consequences on the model we train for our task.
Removal of stop words definitely reduces the dataset size and thus reduces the training time due to the fewer number of
tokens involved in the training.
We do not always remove the stop words. The removal of stop words is highly dependent on the task we are performing
and the goal we want to achieve. For example, if we are training a model that can perform the sentiment analysis task, we
might not remove the stop words.
Movie review: “The movie was not good at all.”
Text after removal of stop words: “movie good”
We can clearly see that the review for the movie was negative. However, after the removal of stop words, the review
became positive, which is not the reality. Thus, the removal of stop words can be problematic here.
Tasks like text classification do not generally need stop words as the other words present in the dataset are more important
and give the general idea of the text. So, we generally remove stop words in such tasks.
In a nutshell, NLP has a lot of tasks that cannot be accomplished properly after the removal of stop words. So, think
before performing this step. The catch here is that no rule is universal and no stop words list is universal. A list not
conveying any important information to one task can convey a lot of information to the other task.
Word of caution: Before removing stop words, research a bit about your task and the problem you are trying to solve,
and then make your decision.
2. As the frequency of stop words are too high, removing them from the corpus results in much smaller data in
terms of size. Reduced size results in faster computations on text data and the text classification model need to
deal with a lesser number of features resulting in a robust model.
Advanced Methods
These methods can also be called vectorized methods as they aim to map a word, sentence, document to a fixed-length
vector of real numbers. The goal of this method is to extract semantics from a piece of text, both lexical and distributional.
Lexical semantics is just the meaning reflected by the words whereas distributional semantics refers to finding meaning
based on various distributions in a corpus.
Word2Vec
GloVe: Global Vector for word representation
Fi
g. Word2Vec vs GloVe
BAG OF WORDS MODEL
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval
(IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding
grammar and even word order but keeping multiplicity. A bag-of-words model, or BoW for short, is a way of extracting
features from text for use in modelling, such as with machine learning algorithms.
The approach is very simple and flexible, and can be used in a myriad of ways for extracting features from documents.
A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two
things:
1. A vocabulary of known words.
2. A measure of the presence of known words.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded.
The model is only concerned with whether known words occur in the document, not where in the document. The intuition
is that documents are similar if they have similar content. Further, that from the content alone we can learn something
about the meaning of the document. The bag-of-words can be as simple or complex as you like. The complexity comes
both in deciding how to design the vocabulary of known words (or tokens) and how to score the presence of known
words.
One of the biggest problems with text is that it is messy and unstructured, and machine learning algorithms prefer
structured, well defined fixed-length inputs and by using the Bag-of-Words technique we can convert variable-length texts
into a fixed-length vector.
Also, at a much granular level, the machine learning models work with numerical data rather than textual data. So to be
more specific, by using the bag-of-words (BoW) technique, we convert a text into its equivalent vector of numbers.
Let us see an example of how the bag of words technique converts text into vectors
Example (1) without preprocessing:
Sentence 1: “Welcome to Great Learning, Now start learning”
Sentence 2: “Learning is a good practice”
Sentence 1 Sentence 2
Welcome Learning
to is
Great a
Learning good
, practice
Now
start
learning
Step 1: Go through all the words in the above text and make a list of all of the words in the model vocabulary.
Welcome
To
Great
Learning
,
Now
start
learning
is
a
good
practice
Note that the words ‘Learning’ and ‘ learning’ are not the same here because of the difference in their cases and hence are
repeated. Also, note that a comma ‘ , ’ is also taken in the list. Because we know the vocabulary has 12 words, we can use
a fixed-length document-representation of 12, with one position in the vector to score each word.
The scoring method we use here is to count the presence of each word and mark 0 for absence. This scoring method is
used more generally.
The scoring of sentence 1 would look as follows:
Word Frequency
Welcome 1
to 1
Great 1
Learning 1
, 1
Now 1
start 1
learning 1
is 0
a 0
good 0
practice 0
Writing the above frequencies in the vector
Sentence 1 ➝ [ 1,1,1,1,1,1,1,1,0,0,0 ]
Now for sentence 2, the scoring would like:
Word Frequency
Welcome 0
to 0
Great 0
Learning 1
, 0
Now 0
start 0
learning 0
is 1
a 1
good 1
practice 1
Similarly, writing the above frequencies in the vector form
Sentence 2 ➝ [ 0,0,0,0,0,0,0,1,1,1,1,1 ]
Sentence Welcome to Great Learning , Now start learning is a good practice
Sentence1 1 1 1 1 1 1 1 1 0 0 0 0
Sentence2 0 0 0 0 0 0 0 1 1 1 1 1
But is this the best way to perform a bag of words. The above example was not the best example of how to use a
bag of words. The words Learning and learning, although having the same meaning are taken twice. Also, a
comma “,” which does not convey any information is also included in the vocabulary.
Let us make some changes and see how we can use ‘bag of words in a more effective way.
Step 1: Convert the above sentences in lower case as the case of the word does not hold any information.
Step 2: Remove special characters and stopwords from the text. Stopwords are the words that do not contain much
information about text like ‘is’, ‘a’,’the and many more’.
Although the above sentences do not make much sense the maximum information is contained in these words only.
Step 3: Go through all the words in the above text and make a list of all of the words in our model vocabulary.
welcome
great
learning
now
start
good
practice
Now as the vocabulary has only 7 words, we can use a fixed-length document-representation of 7, with one position in the
vector to score each word.
The scoring method we use here is the same as used in the previous example. For sentence 1, the count of words is as
follow:
Word Frequency
welcome 1
great 1
learning 2
now 1
start 1
good 0
practice 0
Writing the above frequencies in the vector
Sentence 1 ➝ [ 1,1,2,1,1,0,0 ]
Word Frequency
welcome 0
great 0
learning 1
now 0
start 0
good 1
practice 1
Similarly, writing the above frequencies in the vector form
Sentence 2 ➝ [ 0,0,1,0,0,1,1 ]
Sentence1 1 1 2 1 1 0 0
Sentence2 0 0 1 0 0 1 1
The approach used in example two is the one that is generally used in the Bag-of-Words technique, the reason being that
the datasets used in Machine learning are tremendously large and can contain vocabulary of a few thousand or even
millions of words. Hence, preprocessing the text before using bag-of-words is a better way to go. There are various
preprocessing steps that can increase the performance of Bag-of-Words. Some of them are explained in great detail in
this blog.
In the examples above we use all the words from vocabulary to form a vector, which is neither a practical way nor the best
way to implement the BoW model. In practice, only a few words from the vocabulary, more preferably the most common
words are used to form the vector.
Limitations of Bag-of-Words
The bag-of-words model is very simple to understand and implement and offers a lot of flexibility for customization on
your specific text data. It has been used with great success on prediction problems like language modeling and
documentation classification.
For example, let’s use the following phrase and divide it into bi-grams (n=2).
“James is the best person ever.”
becomes
<start>James
James is
is the
the best
best person
person ever.
ever.<end>
In a typical bag-of-n-grams model, these 6 bigrams would be a sample from a large number of bigrams observed in a
corpus. And then James is the best person ever. would be encoded in a representation showing which of the corpus’s
bigrams were observed in the sentence. A bag-of-n-grams model has the simplicity of the bag-of-words model but allows
the preservation of more word locality information.
TF-IDF MODEL
TF-IDF stands for Term Frequency Inverse Document Frequency of records. It can be defined as the calculation of how
relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a
word appears but is compensated by the word frequency in the corpus (data-set).
Terminologies:
Term Frequency: In document , the frequency represents the number of instances of a given word . Therefore,
we can see that it becomes more relevant when a word appears in the text, which is rational. Since the ordering of
terms is not significant, we can use a vector to describe the text in the bag of term models. For each specific term
in the paper, there is an entry with the value being the term frequency.
The weight of a term that occurs in a document is simply proportional to the term frequency.
Document Frequency: This tests the meaning of the text, which is very similar to TF, in the whole corpus
collection. The only difference is that in document d, TF is the frequency counter for a term , while df is the
number of occurrences in the document set N of the term t. In other words, the number of papers in which the
word is present is DF.
Inverse Document Frequency: Mainly, it tests how relevant the word is. The key aim of the search is to locate
the appropriate records that fit the demand. Since considers all terms equally significant, it is therefore not only
possible to use the term frequencies to measure the weight of the term in the paper. First, find the document
frequency of a term by counting the number of documents containing the term:
Term frequency is the number of instances of a term in a single document only; although the frequency of the document is
the number of separate documents in which the term appears, it depends on the entire corpus. Now let’s look at the
definition of the frequency of the inverse paper. The IDF of the word is the number of documents in the corpus separated
by the frequency of the text.
The more common word is supposed to be considered less significant, but the element (most definite integers) seems too
harsh. We then take the logarithm (with base 2) of the inverse frequency of the paper. So the if of the term t becomes:
Computation: TF-IDF is one of the best metrics to determine how significant a term is to a text in a series or a
corpus. TF-IDF is a weighting system that assigns a weight to each word in a document based on its term
frequency (TF) and the reciprocal document frequency (TF) (IDF). The words with higher scores of weight are
deemed to be more significant.
Numerical Example
Imagine the term appears 20 times in a document that contains a total of 100 words. Term Frequency (TF) of can be
calculated as follow:
Assume a collection of related documents contains 10,000 documents. If 100 documents out of 10,000 documents contain
the term , Inverse Document Frequency (IDF) of can be calculated as follows
Using these two quantities, we can calculate TF-IDF score of the term for the document.
**********
INTRODUCTION
Text classification is a machine learning technique that assigns a set of predefined categories to open-ended text. Text
classifiers can be used to organize, structure, and categorize pretty much any kind of text – from documents, medical
studies, and files, and all over the web. For example, new articles can be organized by topics; support tickets can be
organized by urgency; chat conversations can be organized by language; brand mentions can be organized by sentiment;
and so on.
Text classification is one of the fundamental tasks in natural language processing with broad applications such as
sentiment analysis, topic labeling, spam detection, and intent detection. Here’s an example of how it works:
A text classifier can take this phrase as input, analyze its content, and then automatically assign relevant tags, such as UI
and Easy To Use.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289
.
Real-time analysis: There are critical situations that companies need to identify as soon as possible and take
immediate action (e.g., PR crises on social media). Machine learning text classification can follow your brand
mentions constantly and in real-time, so you'll identify critical information and be able to take action right away.
Consistent criteria: Human annotators make mistakes when classifying text data due to distractions, fatigue, and
boredom, and human subjectivity creates inconsistent criteria. Machine learning, on the other hand, applies the
same lens and criteria to all data and results. Once a text classification model is properly trained it performs with
unsurpassed accuracy.
We can perform text classification in two ways: manual or automatic.
Manual text classification involves a human annotator, who interprets the content of text and categorizes it
accordingly. This method can deliver good results but it’s time-consuming and expensive.
Automatic text classification applies machine learning, natural language processing (NLP), and other AI-guided
techniques to automatically classify text in a faster, more cost-effective, and more accurate manner.
There are many approaches to automatic text classification, but they all fall under three types of systems:
Rule-based systems
Machine learning-based systems
Hybrid systems
Rule-based systems
Rule-based approaches classify text into organized groups by using a set of handcrafted linguistic rules. These
rules instruct the system to use semantically relevant elements of a text to identify relevant categories based on its content.
Each rule consists of an antecedent or pattern and a predicted category.
Example: Say that you want to classify news articles into two groups: Sports and Politics. First, you’ll need to
define two lists of words that characterize each group (e.g., words related to sports such as football, basketball, LeBron
James, etc., and words related to politics, such as Donald Trump, Hillary Clinton, Putin, etc.). Next, when you want to
classify a new incoming text, you’ll need to count the number of sport-related words that appear in the text and do the
same for politics- related words. If the number of sports-related word appearances is greater than the politics-related word
count, then the text is classified as Sports and vice versa. For example, this rule-based system will classify the headline
“When is LeBron James' first game with the Lakers?” as Sports because it counted one sports-related term (LeBron
James) and it didn’t count any politics-related terms.
Rule-based systems are human comprehensible and can be improved over time. But this approach has some
disadvantages. For starters, these systems require deep knowledge of the domain. They are also time-consuming, since
generating rules for a complex system can be quite challenging and usually requires a lot of analysis and testing. Rule-
based systems are also difficult to maintain and don’t scale well given that adding new rules can affect the results of the
pre-existing rules.
Machine learning-based systems
Instead of relying on manually crafted rules, machine learning text classification learns to make classifications
based on past observations. By using pre-labeled examples as training data, machine learning algorithms can learn the
different associations between pieces of text, and that a particular output (i.e., tags) is expected for a particular input (i.e.,
text). A “tag” is the pre-determined classification or category that any given text could fall into.
The first step towards training a machine learning NLP classifier is feature extraction: a method used to transform
each text into a numerical representation in the form of a vector. One of the most frequently used approaches is the bag of
words, where a vector represents the frequency of a word in a predefined dictionary of words.
For example, if we have defined our dictionary to have the following words {This, is, the, not, awesome, bad,
basketball}, and we wanted to vectorize the text “This is awesome,” we would have the following vector representation of
that text: (1, 1, 0, 0, 1, 0, 0). Then, the machine learning algorithm is fed with training data that consists of pairs of feature
sets (vectors for each text example) and tags (e.g. sports, politics) to produce a classification model:
Fig. Training process in Text Classification
Once it’s trained with enough training samples, the machine learning model can begin to make accurate
predictions. The same feature extractor is used to transform unseen text to feature sets, which can be fed into the
classification model to get predictions on tags (e.g., sports, politics):
Fig. Prediction process in Text Classification
Text classification with machine learning is usually much more accurate than human-crafted rule systems,
especially on complex NLP classification tasks. Also, classifiers with machine learning are easier to maintain and you
can always tag new examples to learn new tasks.
Machine Learning Text Classification Algorithms
Some of the most popular text classification algorithms include the Naive Bayes family of algorithms, support
vector machines (SVM), and deep learning.
Naive Bayes
The Naive Bayes family of statistical algorithms are some of the most used algorithms in text classification and text
analysis, overall. One of the members of that family is Multinomial Naive Bayes (MNB) with a huge advantage, that you
can get really good results even when your dataset isn’t very large (~ a couple of thousand tagged samples) and
computational resources are scarce. Naive Bayes is based on Bayes’s Theorem, which helps us compute the conditional
probabilities of
the occurrence of two events, based on the probabilities of the occurrence of each individual event. So we’re
calculating the probability of each tag for a given text, and then outputting the tag with the highest probability.
Fig. Optimal SVM Hyperplane
The optimal hyperplane is the one with the largest distance between each tag. In two dimensions it looks like
above: Those vectors are representations of your training texts, and a group is a tag you have tagged your texts with. As
data gets more complex, it may not be possible to classify vectors/tags into only two categories. So, it looks like this:
Deep Learning
Deep learning is a set of algorithms and techniques inspired by how the human brain works, called neural
networks. Deep learning architectures offer huge benefits for text classification because they perform at super high
accuracy with lower- level engineering and computation. The two main deep learning architectures for text classification
are Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). Deep learning is hierarchical machine
learning, using multiple algorithms in a progressive chain of events. It’s similar to how the human brain works when
making decisions, using different techniques simultaneously to process huge amounts of data.
Deep learning algorithms do require much more training data than traditional machine learning algorithms (at least
millions of tagged examples). However, they don’t have a threshold for learning from training data, like traditional
machine learning algorithms, such as SVM and Deep learning classifiers continue to get better the more data you feed
them with: Deep learning algorithms, like Word2Vec or GloVe are also used in order to obtain better vector
representations for words and improve the accuracy of classifiers trained with traditional machine learning algorithms.
Hybrid Systems
Hybrid systems combine a machine learning-trained base classifier with a rule-based system, used to further
improve the results. These hybrid systems can be easily fine-tuned by adding specific rules for those conflicting tags that
haven’t been correctly modeled by the base classifier.
VECTOR SEMANTICS AND EMBEDDINGS
Vector semantics is the standard way to represent word meaning in NLP, helping vector semantics us model many
of the aspects of word meaning. The idea of vector semantics is to represent a word as a point in a multidimensional
semantic space that is derived from the distributions of embeddings word neighbors. Vectors for representing words are
called embeddings (although the term is sometimes more strictly applied only to dense vectors like word2vec). Vector
Semantics defines semantics & interprets word meaning to explain features such as word similarity. Its central idea is:
Two words are similar if they have similar word contexts.
In its current form, the vector model inspires its working from the linguistic and philosophical work of the 1950s.
Vector semantics represents a word in multi-dimensional vector space. Vector model is also called Embeddings, due to the
fact that a word is embedded in a particular vector space. The vector model offers many advantages in NLP. For example,
in sentimental analysis, sets up a boundary class and predicts if the sentiment is positive or negative (a binomial
classification). Another key practical advantage of vector semantics is that it can learn automatically from text without
complex labeling or supervision. As a result of these advantages, vector semantics has become a de-facto standard for
NLP applications such as Sentiment Analysis, Named Entity Recognition (NER), topic modeling, and so on.
WORD EMBEDDINGS
It is an approach for representing words and documents. Word Embedding or Word Vector is a numeric vector
input that represents a word in a lower-dimensional space. It allows words with similar meaning to have a similar
representation. They can also approximate meaning. A word vector with 50 values can represent 50 unique features.
Features: Anything that relates words to one another. Eg: Age, Sports, Fitness, Employed etc. Each word vector
has values corresponding to these features.
Goal of Word Embeddings
To reduce dimensionality
To use a word to predict the words around it
Inter word semantics must be captured
How are Word Embeddings used?
They are used as input to machine learning models.
Take the words —-> Give their numeric representation —-> Use in training or inference
To represent or visualize any underlying patterns of usage in the corpus that was used to train them.
Implementations of Word Embeddings:
Word Embeddings are a method of extracting features out of text so that we can input those features into a machine
learning model to work with text data. They try to preserve syntactical and semantic information. The methods such as
Bag of Words(BOW), CountVectorizer and TFIDF rely on the word count in a sentence but do not save any syntactical or
semantic information. In these algorithms, the size of the vector is the number of elements in the vocabulary. We can get a
sparse matrix if most of the elements are zero. Large input vectors will mean a huge number of weights which will result
in high computation required for training. Word Embeddings give a solution to these problems.
Let’s take an example to understand how word vector is generated by taking emoticons which are most frequently
used in certain conditions and transform each emoji into a vector and the conditions will be our features.
H ?
appy ???? ??? ????
S ?
E ?
xcited ???? ??? ????
S ?
ick ???? ??? ????
The emoji vectors for the emojis will
be: [happy, sad, excited, sick]
???? =[1,0,1,0]
???? =[0,1,0,1]
???? =[0,0,1,1]
.....
In a similar way, we can create word vectors for different words as well on the basis of given features. The words
with similar vectors are most likely to have the same meaning or are used to convey the same sentiment. There are two
different approaches for getting Word Embeddings:
1) Word2Vec:
In Word2Vec every word is assigned a vector. We start with either a random vector or one-hot vector.
One-Hot vector: A representation where only one bit in a vector is 1. If there are 500 words in the corpus then the
vector length will be 500. After assigning vectors to each word we take a window size and iterate through the entire
corpus. While we do this there are two neural embedding methods which are used:
Continuous Bowl of Words (CBOW)
In this model what we do is we try to fit the neighboring words in the window to the central word.
CBOW Architecture.
This architecture is very similar to a feed-forward neural network. This model architecture essentially
tries to predict a target word from a list of context words.
The intuition behind this model is quite simple: given a phrase "Have a great day", we will choose our
target word to be “a” and our context words to be [“have”, “great”, “day”]. What this model will do is take
the distributed representations of the context words to try and predict the target word.
The English language contains almost 1.2 million words, making it impossible to include so many words
in our example. So I ‘ll consider a small example in which we have only four words i.e. live, home, they and at.
For simplicity, we will consider that the corpus contains only one sentence, that being, ‘They live at home’.
First, we convert each word into a one-hot encoding form. Also, we’ll not consider all the words in the
sentence but ll only take certain words that are in a window. For example for a window size equal to three, we
only consider three words in a sentence. The middle word is to be predicted and the surrounding two words are
fed into the neural network as context. The window is then slid and the process is repeated again.
Finally, after training the network repeatedly by sliding the window a shown above, we get weights
which we use to get the embeddings as shown below.
Usually, we take a window size of around 8-10 words and have a vector size of 300.
Skip Gram
In this model, we try to make the central word closer to the neighboring words. It is the complete opposite of the
CBOW model. It is shown that this method produces more meaningful embeddings.
After applying the above neural embedding methods we get trained vectors of each word after many iterations
through the corpus. These trained vectors preserve syntactical or semantic information and are converted to lower
dimensions. The vectors with similar meaning or semantic information are placed close to each other in space.
The Skip-gram model architecture usually tries to achieve the reverse of what the CBOW model does. It
tries to predict the source context words (surrounding words) given a target word (the centre word)
The working of the skip-gram model is quite similar to the CBOW but there is just a difference in the
architecture of its neural network and the way the weight matrix is generated as shown in the figure below:
After obtaining the weight matrix, the steps to get word embedding is same as CBOW.
So now which one of the two algorithms should we use for implementing word2vec? Turns out for large
corpus with higher dimensions, it is better to use skip-gram but is slow to train. Whereas CBOW is better for
small corpus and is faster to train too.
2) GloVe:
This is another method for creating word embeddings. In this method, we take the corpus and iterate through it and
get the co-occurrence of each word with other words in the corpus. We get a co-occurrence matrix through this. The words
which occur next to each other get a value of 1, if they are one word apart then 1/2, if two words apart then 1/3 and so on.
Let us take an example to understand how the matrix is created. We have a small corpus:
Corpus:
It is a nice
evening.
Good
Evening!
Is it a nice evening?
it is a ev g
nice ening ood
i 0
t
i 1+ 0
s 1
a 1/ 1+ 0
2+1 1/2
n 1/ 1/ 1+
ice 3+1/2 2+1/3 1 0
e 1/ 1/ 1/ 0
vening 4+1/3 3+1/4 2+1/2 1+1
g 0 0 0 1 0
ood 0
The upper half of the matrix will be a reflection of the lower half. We can consider a window frame as well to
calculate the co-occurrences by shifting the frame till the end of the corpus. This helps gather information about the
context in which the word is used. Initially, the vectors for each word is assigned randomly. Then we take two pairs of
vectors and see how close they are to each other in space. If they occur together more often or have a higher value in the
co-occurrence matrix and are far apart in space then they are brought close to each other. If they are close to each other but
are rarely or not frequently used together then they are moved further apart in space.
After many iterations of the above process, we’ll get a vector space representation that approximates the
information from the co-occurrence matrix. The performance of GloVe is better than Word2Vec in terms of both semantic
and syntactic capturing.
Pre-trained Word Embedding Models:
People generally use pre-trained models for word embeddings. Few of them are:
SpaCy
fastText
Flair etc.
Common Errors made:
You need to use the exact same pipeline during deploying your model as were used to create the training data for
the word embedding. If you use a different tokenizer or different method of handling white space, punctuation etc.
you might end up with incompatible inputs.
Words in your input that doesn’t have a pre-trained vector. Such words are known as Out of Vocabulary
Word(oov). What you can do is replace those words with “UNK” which means unknown and then handle them
separately.
Dimension mis-match: Vectors can be of many lengths. If you train a model with vectors of length say 400 and
then try to apply vectors of length 1000 at inference time, you will run into errors. So make sure to use the same
dimensions throughout.
Graphical Representation of an Word2Vec Embedding ( King and Queen are close to each other in
position)
There are two main architectures that yield the success of word2vec.
Skip-gram
CBOW architectures. (Refer Previous Section)
FASTTEXT MODEL
This model allows training word embeddings from a training corpus with the additional ability to obtain word
vectors for out-of-vocabulary words. FastText is an open-source, free, lightweight library that allows users to learn text
representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit
on mobile devices. fastText embeddings exploit subword information to construct word embeddings. FastText is more
stable than Word2Vec architecture. In FastText, each word is represented as the average of the vector representation of its
character n- grams along with the word itself. So, the word embedding for the word 'equal' can be given as the sum of all
vector representations of all of its character n-gram and the word itself.
Word embedding techniques like word2vec and GloVe provide distinct vector representations for the words in the
vocabulary. This leads to ignorance of the internal structure of the language. This is a limitation for morphologically rich
language as it ignores the syntactic relation of the words. As many word formations follow the rules in morphologically
rich languages, it is possible to improve vector representations for these languages by using character-level information.
To improve vector representation for morphologically rich language, FastText provides embeddings for character
n-grams, representing words as the average of these embeddings. It is an extension of the word2vec model. Word2Vec
model provides embedding to the words, whereas fastText provides embeddings to the character n-grams. Like the
word2vec model, fastText uses CBOW and Skip-gram to compute the vectors.
FastText can also handle out-of-vocabulary words, i.e., the fast text can find the word embeddings that are not
present at the time of training.
Out-of-vocabulary (OOV) words are words that do not occur while training the data and are not present in the
model’s vocabulary. Word embedding models like word2vec or GloVe cannot provide embeddings for the OOV words
because they provide embeddings for words; hence, if a new word occurs, it cannot provide embedding. Since FastText
provides embeddings for character n-grams, it can provide embeddings for OOV words. If an OOV word occurs, then
fastText provides embedding for that word by embedding its character n-gram.
Skip-gram for FastText
Skip-gram works like CBOW, but the input is the target word, and the model predicts the context of the given the
word. It also uses neural networks for training. Figure 1.3 shows the working of Skip-gram.
RNN
Recurrent Neural Network (RNN) is a type of Neural Network where the output from the previous step is fed as
input to the current step. In traditional neural networks, all the inputs and outputs are independent of each other, but in
cases when it is required to predict the next word of a sentence, the previous words are required and hence there is a need
to remember the previous words. Thus, RNN came into existence, which solved this issue with the help of a Hidden
Layer. The main and most important feature of RNN is its Hidden state, which remembers some information about a
sequence. The state is also referred to as Memory State since it remembers the previous input to the network. It uses the
same parameters for each input as it performs the same task on all the inputs or hidden layers to produce the output. This
reduces the complexity of parameters, unlike other neural networks.
A RNN treats each word of a sentence as a separate input occurring at time ‘t’ and uses the activation value at ‘t-
1’ also, as an input in addition to the input at time ‘t’. The diagram below shows a detailed structure of an RNN
architecture. The architecture described above is also called as a many to many architecture with (Tx = Ty) i.e. number
of inputs = number of outputs. Such structure is quite useful in Sequence modelling.
Apart from the architecture mentioned above there are three other types of architectures of RNN which are
commonly used.
1. Many to One RNN : Many to one architecture refers to an RNN architecture where many inputs (Tx) are used to
give one output (Ty). A suitable example for using such an architecture will be a classification task.
RNN are a very important variant of neural networks heavily used in Natural Language Processing.
Conceptually they differ from a standard neural network as the standard input in a RNN is a word instead of the
entire sample as in the case of a standard neural network. This gives the flexibility for the network to work with varying
lengths of sentences, something which cannot be achieved in a standard neural network due to it’s fixed structure. It
also provides an additional advantage of sharing features learned across different positions of text which cannot be
obtained in a standard neural network.
In the image above H represents the output of the activation function.
2. One to Many RNN: One to Many architecture refers to a situation where a RNN generates a series of output values
based on a single input value. A prime example for using such an architecture will be a music generation task, where an
input is a jounre or the first note.
3. Many to Many Architecture (Tx not equals Ty): This architecture refers to where many inputs are read to produce
many outputs, where the length of inputs is not equal to the length of outputs. A prime example for using such an
architecture is machine translation tasks.
Encoder refers to the part of the network which reads the sentence to be translated, and, Decoder is the
part of the network which translates the sentence into desired language.
Limitations of RNN
Apart from all of its usefulness RNN does have certain limitations major of which are:
1. Examples of RNN architecture stated above can capture the dependencies in only one direction of the language.
Basically, in the case of Natural Language Processing it assumes that the word coming after has no effect on the
meaning of the word coming before. With our experience of languages, we know that it is certainly not true.
2. RNN are also not very good in capturing long term dependencies and the problem of vanishing gradients
resurface in RNN.
NNs are ideal for solving problems where the sequence is more important than the individual items themselves.
An RNNs is essentially a fully connected neural network that contains a refactoring of some of its layers into a loop.
That loop is typically an iteration over the addition or concatenation of two inputs, a matrix multiplication and a non-linear
function.
Among the text usages, the following tasks are among those RNNs perform well at:
Sequence labelling
Natural Language Processing (NLP) text classification
Natural Language Processing (NLP) text generation
Other tasks that RNNs are effective at solving are time series predictions or other sequence predictions that aren’t
image or tabular-based.
There have been several highlighted and controversial reports in the media over the advances in text generation,
OpenAI’s GPT-2 algorithm. In many cases the generated text is often indistinguishable from text written by humans.
RNNs effectively have an internal memory that allows the previous inputs to affect the subsequent predictions. It’s
much easier to predict the next word in a sentence with more accuracy, if you know what the previous words were. Often
with tasks well suited to RNNs, the sequence of the items is as or more important than the previous item in the sequence.
Sequence-to-Sequence Models: TRANSFORMERS (Translate one language into another language)
Sequence-to-sequence (seq2seq) models in NLP are used to convert sequences of Type A to sequences of Type
B. For example, translation of English sentences to German sentences is a sequence-to-sequence task.
Recurrent Neural Network (RNN) based sequence-to-sequence models have garnered a lot of traction ever
since they were introduced in 2014. Most of the data in the current world are in the form of sequences – it can be a
number sequence, text sequence, a video frame sequence or an audio sequence.
The performance of these seq2seq models was further enhanced with the addition of the Attention Mechanism in
2015. How quickly advancements in NLP have been happening in the last 5 years – incredible!
These sequence-to-sequence models are pretty versatile and they are used in a variety of NLP tasks, such as:
Machine Translation
Text Summarization
Speech Recognition
Question-Answering System, and so on
German to English Translation using seq2seq
The above seq2seq model is converting a German phrase to its English counterpart. Let’s break it down:
Both Encoder and Decoder are RNNs
At every time step in the Encoder, the RNN takes a word vector (xi) from the input sequence and a hidden state
(Hi) from the previous time step
The hidden state is updated at each time step
The hidden state from the last unit is known as the context vector. This contains information about the input
sequence
This context vector is then passed to the decoder and it is then used to generate the target sequence (English phrase)
If we use the Attention mechanism, then the weighted sum of the hidden states are passed as the context vector
to the decoder
Challenges
Despite being so good at what it does, there are certain limitations of seq-2-seq models with attention:
Dealing with long-range dependencies is still challenging
The sequential nature of the model architecture prevents parallelization. These challenges are addressed by
Google Brain’s Transformer concept
RNN can remember important things about the input it has received, which allows them to be very precise in
predicting what can be the next outcome. So this is the reason why they are performed or preferred on a sequential data
algorithm. And some of the examples of sequence data can be something like time, series, speech, text, financial data,
audio, video, weather, and many more. Although RNN was the state-of-the-art algorithm for dealing with sequential data,
they come up with their own drawbacks and some popular drawbacks over here can be like due to the complication or the
complexity of the algorithm. The neural network is pretty slow to train. And as a huge amount of dimensions here, the
training is very long and difficult to do.
TRANSFORMERS
Attention models/Transformers are the most exciting models being studied in NLP research today, but they can be
a bit challenging to grasp – the pedagogy is all over the place. This is both a bad thing (it can be confusing to hear
different versions) and in some ways a good thing (the field is rapidly evolving, there is a lot of space to improve).
Transformer
Internally, the Transformer has a similar kind of architecture as the previous models above. But the Transformer
consists of six encoders and six decoders.
Each encoder is very similar to each other. All encoders have the same architecture. Decoders share the same
property, i.e. they are also very similar to each other. Each encoder consists of two layers: Self-attention and a feed
Forward Neural Network.
The encoder’s inputs first flow through a self-attention layer. It helps the encoder look at other words in the input
sentence as it encodes a specific word. The decoder has both those layers, but between them is an attention layer that
helps the decoder focus on relevant parts of the input sentence.
Self-Attention
Let’s start to look at the various vectors/tensors and how they flow between these components to turn the input
of a trained model into an output. As is the case in NLP applications in general, we begin by turning each input word
into a vector using an embedding algorithm.
Each word is embedded into a vector of size 512. We will represent those vectors with these simple boxes. The
embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they
receive a list of vectors each of the size 512.
In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the
encoder that’s directly below. After embedding the words in our input sequence, each of them flows through each of the
two layers of the encoder.
Here we begin to see one key property of the Transformer, which is that the word in each position flows through
its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward
layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing
through the feed-forward layer.
Next, we’ll switch up the example to a shorter sentence and we’ll look at what happens in each sub-layer of the
encoder.
Self-Attention
Let’s first look at how to calculate self-attention using vectors, then proceed to look at how it’s actually
implemented — using matrices.
Figuring out the relation of words within a sentence and giving the right
attention to it.
The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in
this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector.
These vectors are created by multiplying the embedding by three matrices that we trained during the training process.
Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64,
while the embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is
an architecture choice to make the computation of multiheaded attention (mostly) constant.
Multiplying x1 by the WQ weight matrix produces q1, the “query” vector associated with that word. We end up
creating a “query”, a “key”, and a “value” projection of each word in the input sentence.
What are the “query”, “key”, and “value” vectors?
They’re abstractions that are useful for calculating and thinking about attention. Once you proceed with reading
how attention is calculated below, you’ll know pretty much all you need to know about the role each of these vectors
plays.
The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the
first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score
determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.
The score is calculated by taking the dot product of the query vector with the key vector of the respective word
we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot
product of q1 and k1. The second score would be the dot product of q1 and k2.
The third and forth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in
this example is 64. This leads to having more stable gradients. There could be other possible values here, but this is the
default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add
up to 1.
This softmax score determines how much how much each word will be expressed at this position. Clearly the word
at this position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to
the current word.
The fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition
here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them
by tiny numbers like 0.001, for example). The sixth step is to sum up the weighted value vectors. This produces the
output of the self-attention layer at this position (for the first word).
That concludes the self-attention calculation. The resulting vector is one we can send along to the feed-forward
neural network. In the actual implementation, however, this calculation is done in matrix form for faster processing.
Multihead attention
There are a few other details that make them work better. For example, instead of only paying attention to each
other in one dimension, Transformers use the concept of Multihead attention. The idea behind it is that whenever you are
translating a word, you may pay different attention to each word based on the type of question that you are asking. The
images below show what that means. For example, whenever you are translating “kicked” in the sentence “I kicked the
ball”, you may ask “Who kicked”. Depending on the answer, the translation of the word to another language can change.
Or ask other questions, like “Did what?”, etc…
Positional Encoding
Another important step on the Transformer is to add positional encoding when encoding each word.
Encoding the position of each word is relevant since the position of each word is relevant to the translation.
OVERVIEW OF TEXT SUMMARIZATION AND TOPIC MODELS
Text summarization is the process of creating a concise and accurate representation of the main points and
information in a document. Topic modelling can help you generate summaries by extracting the most relevant and salient
topics and words from the document. Text summarization refers to the technique of shortening long pieces of text. The
intention is to create a coherent and fluent summary having only the main points outlined in the document.
Automatic text summarization is a common problem in machine learning and natural language processing (NLP).
In general, text summarization technique has proved to be critical in quickly and accurately summarizing voluminous
texts, something which could be expensive and time consuming if done without machines.
Topic modeling is a form of unsupervised learning that aims to find hidden patterns and structures in the text data.
It assumes that each document is composed of a mixture of topics, and each topic is a distribution of words that
represent a specific
subject or idea. For example, a document about sports might have topics such as soccer, basketball, and fitness.
Topic modeling can help you identify these topics and their proportions in each document. Topic modeling can help you
generate summaries by extracting the most relevant and salient topics and words from the document for text
summarization. You can then use these topics and words to construct a summary that captures the essence and meaning of
the document.
Topic modeling is a collection of text-mining techniques that uses statistical and machine learning models to
automatically discover hidden abstract topics in a collection of documents.
Topic modeling is also an amalgamation of a set of unsupervised techniques that’s capable of detecting word and
phrase patterns within documents and automatically cluster word groups and similar expressions helping in best
representing a set of documents.
There are many cases where humans or machines generate a huge amount of text over time and it is not prudent
nor possible to go through the entire text for gaining an understanding of what is important or to come to an opinion of the
entire process of generating the data.
In such cases, NLP algorithms and in particular topic modeling are useful to extract a summary of the underlying
text and discover important contexts from the text.
Topic modeling is the method of extracting needed attributes from a bag of words. This is critical because each
word in the corpus is treated as a feature in NLP. As a result, feature reduction allows us to focus on the relevant material
rather than wasting time sifting through all of the data's text.
There are many different topic modeling algorithms and tools available for text analysis projects. Popular methods
include Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NMF), and Latent Semantic Analysis
(LSA). Common tools used to apply these algorithms include Gensim, a Python library providing implementations of
LDA, NMF, and other topic modeling methods; Scikit-learn, a Python library providing implementations of NMF, LSA,
and other machine learning methods; and MALLET, a Java-based toolkit providing implementations of LDA, NMF, and
other topic modeling methods. These tools offer various utilities and functionalities for preprocessing, evaluation,
visualization, data manipulation, feature extraction, model selection, and performance metrics.
Consider the following scenario: you have a corpus of 1000 documents. The bag of words is made up of 1000
common words after preprocessing the corpus. We can determine the subjects that are relevant to each document using
LDA.
The extraction of data from a corpus of data is therefore made straightforward. The upper level represents the
documents, the middle level represents the produced themes, and the bottom level represents the words in the diagram
above.
As a result, the rule indicates that a text is represented as a distribution of themes, and topics are described as a
distribution of words.
2. Non Negative Matrix Factorization (NMF):
NMF is a matrix factorization method that ensures the non-negative elements of the factorized matrices. Consider
the document-term matrix produced after deleting stopwords from a corpus. The term-topic matrix and the topic-
document matrix are two matrices that may be factored out of the matrix.
Matrix factorization may be accomplished using a variety of optimization methods. NMF may be performed more
quickly and effectively using Hierarchical Alternating Least Square. The factorization takes place in this case by updating
one column at a time while leaving the other columns unchanged.
3. Latent Semantic Analysis (LSA):
Latent Semantic Analysis is another unsupervised learning approach for extracting relationships between words in a
large number of documents. This assists us in selecting the appropriate documents.
It merely serves as a dimensionality reduction tool for the massive corpus of text data. These extraneous data adds
noise to the process of extracting the proper insights from the data.
4. Parallel Latent Dirichlet Allocation:
Partially Labeled Dirichlet Allocation is another name for it. The model implies that there are a total of n labels,
each of which is associated with a different subject in the corpus.
Then, similar to the LDA, the individual themes are represented as the probability distribution of the entire corpus.
Optionally, each document might be allocated a global subject, resulting in l global topics, where l is the number of
individual documents in the corpus.
The technique also assumes that every subject in the corpus has just one label. In comparison to the other
approaches, this procedure is highly rapid and exact because the labels are supplied before creating the model.
5. Pachinko Allocation Model (PAM):
The Pachinko Allocation Model (PAM) is a more advanced version of the Latent Dirichlet Allocation Model. The
LDA model identifies themes based on thematic correlations between words in the corpus, bringing out the correlation
between words. PAM, on the other hand, makes do by modeling the correlation between the produced themes. Because it
additionally considers the link between subjects, this model has more ability in determining the semantic relationship
precisely. Pachinko is a popular Japanese game, and the model is named for it. To explore the association between
themes, the model uses
Directed Acrylic Graphs (DAG).
***************
Information retrieval:
What is text information retrieval?
• Text retrieval is to return relevant textual documents from a given collection, according to
users' information needs as declared in a query.
• Main differences from database retrieval are concerned with: – Information.
• Unstructured text vs.
Information Retrieval (IR) can be defined as a software program that deals with the
organization, storage, retrieval, and evaluation of information from document repositories,
particularly textual information. Information Retrieval is the activity of obtaining material that
can usually be documented on an unstructured nature i.e. usually text which satisfies an
information need from within large collections which is stored on computers..
Examples:
Vector-space, Boolean and Probabilistic IR models. In this system, the retrieval of information
depends on documents containing the defined set of queries.
What is an IR Model?
An Information Retrieval (IR) model selects and ranks the document that is required by the
user or the user has asked for in the form of a query. The documents and the queries are
represented in a similar manner, so that document selection and ranking can be formalized by
a matching function that returns a retrieval status value (RSV) for each document in the
collection. Many of the Information Retrieval systems represent document contents by a set of
summarizing and Bibliographic description that contains author, title, sources, data, and
metadata.
The software program that deals with Data retrieval deals with obtaining data from a
the organization, storage, retrieval, and database management system such as ODBMS. It
evaluation of information from is A process of identifying and retrieving the data
document repositories particularly from the database, based on the query provided by
textual information. user or application.
Retrieves information about a subject. Determines the keywords in the user query and
Small errors are likely to go unnoticed. A single error object means total failure.
Not always well structured and is semantically Has a well-defined structure and
ambiguous. semantics.
Does not provide a solution to the user of the Provides solutions to the user of the
database system. database system.
The results obtained are approximate matches. The results obtained are exact matches.
Logical View of the Documents: A long time ago, documents were represented through a
set of index terms or keywords. Nowadays, modern computers represent documents by a
full set of words which reduces the set of representative keywords. This can be done by
eliminating stopwords i.e. articles and connectives. These operations are text operations.
These text operations reduce the complexity of the document representation from full text
to set of index terms.
3. The Web and Digital Libraries: It is cheaper than various sources of information, it
provides greater access to networks due to digital communication and it gives free access to
publish on a larger medium.
Advantages of Information Retrieval
1. Efficient Access: Information retrieval techniques make it possible for users to easily
locate and retrieve vast amounts of data or information.
2. Personalization of Results: User profiling and personalization techniques are used in
information retrieval models to tailor search results to individual preferences and behaviors.
3. Scalability: Information retrieval models are capable of handling increasing data volumes.
4. Precision: These systems can provide highly accurate and relevant search results, reducing
the likelihood of irrelevant information appearing in search results.
1. Information retrieval-based QA
Information retrieval-based question answering (QA) is a method of automatically answering
questions by searching for relevant documents or passages that contain the answer. This
approach uses information retrieval techniques, such as keyword or semantic search, to identify
the documents or passages most likely to hold the answer to a given question.
2. Knowledge-based QA
Knowledge-based question answering (QA) automatically answers questions using a knowledge
base, such as a database or ontology, to retrieve the relevant information. This strategy’s
foundation is that searching for a structured knowledge base for a question can yield the answer.
Knowledge-based QA systems are generally more accurate and reliable than other QA
approaches based on structured and well-curated knowledge.
3. Generative QA
Generative question answering (QA) automatically answers questions using a generative model,
such as a neural network, to generate a natural language answer to a given question.
This method is based on the idea that a machine can be taught to understand and create text in
natural language to provide a correct answer in terms of grammar and meaning.
4. Hybrid QA
Hybrid question answering (QA) automatically answers questions by combining multiple QA
approaches, such as information retrieval-based, knowledge-based, and generative QA. This
approach is based on the idea that different QA approaches have their strengths and weaknesses,
and by combining them, the overall performance of the QA system can be improved.
5. Rule-based QA
Rule-based question answering (QA) automatically answers questions using a predefined set of
rules based on keywords or patterns in the question. This approach is based on the idea that
many questions can be answered by matching the question to a set of predefined rules or
templates.
Applications: Tools:
Knowledge-based question answering (KBQA) in text and speech analysis involves using
structured knowledge bases or ontologies to answer questions posed in natural language.
This approach contrasts with traditional information retrieval systems, which primarily
match keywords or phrases to documents. Here's an overview of how KBQA works:
Natural Language Understanding: The system analyzes the natural language question
to understand its meaning, including entity mentions, relationships, and constraints
implied by the question. Techniques such as part-of-speech tagging, named entity
recognition, dependency parsing, and semantic role labeling are often used.
Query Formulation: Based on the understanding of the question, the system formulates
a structured query that can be executed against the knowledge base. This query typically
involves selecting relevant entities, properties, and relationships to retrieve the desired
information.
Knowledge Base Querying: The formulated query is executed against the knowledge
base to retrieve relevant information. This process may involve querying a structured
database, a knowledge graph, or accessing external sources such as linked data on the
web.
Answer Generation: Once the relevant information is retrieved from the knowledge
base, it is processed to generate a natural language answer that directly addresses the
user's question. This may involve aggregating and summarizing information, as well as
ensuring that the answer is fluent and grammatically correct.
Response Presentation: Finally, the generated answer is presented to the user through
the appropriate interface, whether it's a text-based response in a chatbot interface or
synthesized speech in a voice-based interaction.
KBQA systems can vary in complexity and sophistication, ranging from simple rule-
based approaches to more advanced systems leveraging machine learning and natural
language processing techniques.
Machines only understand the language of numbers. For creating language models, it is
necessary to convert all the words into a sequence of numbers. For the modellers, this is
known as encodings.
Language Models determine the probability of the next word by analyzing the text in
data. These models interpret the data by feeding it through algorithms.
The algorithms are responsible for creating rules for the context in natural language. The
models are prepared for the prediction of words by learning the features and
characteristics of a language. With this learning, the model prepares itself for
understanding phrases and predicting the next words in sentences.
For training a language model, a number of probabilistic approaches are used. These
approaches vary on the basis of the purpose for which a language model is created. The
amount of text data to be analyzed and the math applied for analysis makes a difference
in the approach followed for creating and training a language model.
For example, a language model used for predicting the next word in a search query will
be absolutely different from those used in predicting the next word in a long document
(such as Google Docs). The approach followed to train the model would be unique in
both cases.
These language models are based on neural networks and are often considered as an advanced
approach to execute NLP tasks. Neural language models overcome the shortcomings of
classical models such as n-gram and are used for complex tasks such as speech recognition or
machine translation.
1. Speech Recognition
2. Machine Translation
3. Sentiment Analysis
4. Text Suggestions
5. Parsing Tools
6. Text Classification
7. Dialog Systems and Creative Writing
8. Text Summarization
1) Long-Term Dependency
2) Low-Resource Languages
3) Sarcasm and Irony
4) Handling Noisy Text
5) Contextual Ambiguity
classic QA models:
Classic question-answering (QA) models in text and speech analysis have evolved
over the years. Here are some of the classic models:
Vector Space Model (VSM): Represents documents and queries as vectors in a high-
dimensional space and computes similarity scores between them.
2.Rule-based QA Systems: These systems rely on handcrafted rules to parse questions and
retrieve relevant information from structured or semi-structured data sources. Classic examples
include:
ALICE: Another early chatbot that uses pattern matching and predefined responses.
3.Statistical QA Models: These models utilize statistical techniques to analyze text and
generate answers. Classic examples include:
IBM Watson: Utilizes a combination of statistical techniques, natural language processing, and
machine learning to understand and answer questions.
DeepQA: The architecture behind IBM Watson, which combines various algorithms and
techniques for question answering.
4.Neural QA Models: These models leverage neural networks to understand and answer
questions. Classic examples include:
Memory Networks: Models designed to store and retrieve information from memory, useful
for tasks like question answering.
Attention Mechanisms: Mechanisms that allow neural networks to focus on relevant parts of
the input, improving performance in QA tasks.
5.Graph-based QA Models: These models represent text or knowledge as graphs and perform
reasoning over them to answer questions. Classic examples include:
Each of these classic models has its strengths and weaknesses, and modern QA systems often
combine multiple approaches for improved performance.
Chatbots:
The role of chatbots in NLP lies in their ability to understand and respond to natural
language input from users. This means that rather than relying on specific commands or
keywords like traditional computer programs, chatbots can process human-like questions and
responses.
1.Simple bots are quite basic tools that rely on natural language processing. They can
understand and respond to human queries with certain actions that are based on keywords and
phrases. This type of bots has a defined rule-based decision tree (or RBDT), which helps users
find needed information. FAQ chatbot is a perfect example of a simple bot.
2.Intelligent chatbots, which are also known as virtual assistants or virtual agents, are powered
by artificial intelligence and are much more complicated than simple chatbots. They can
understand human written and oral language and, which is more important, the context behind
it.
Hybrid chatbots are bots that are partially automated, meaning that they lead conversations
until a human interaction is required. They might have the same functionality as simple bots,
but a user can opt for a person when needed.
Chatbots can be powerful tools in text and speech analysis due to their ability to process large
amounts of data quickly and efficiently. Here's how they're used:
1.Text Analysis: Chatbots can analyze text data to extract valuable insights such as sentiment
analysis, topic modeling, keyword extraction, and named entity recognition. They can
understand the context of the conversation and provide relevant responses or take appropriate
actions based on the analysis.
2.Speech Recognition: With advancements in natural language processing (NLP) and speech
recognition technology, chatbots can transcribe spoken language into text. This text data can
then be further analyzed using text analysis techniques mentioned above.
3.Sentiment Analysis: Chatbots can analyze the sentiment expressed in text or speech, helping
businesses gauge customer satisfaction, detect issues, or monitor public opinion about their
products or services.
4.Customer Support: Chatbots are commonly used in customer support to analyze customer
queries and provide appropriate responses. They can understand the intent behind the
customer's message and either provide a solution or escalate the query to a human agent if
necessary.
5.Market Research: Chatbots can be deployed to gather and analyze textual data from social
media, forums, or surveys to understand consumer preferences, trends, and feedback on
products or services.
6.Language Translation: Chatbots equipped with language translation capabilities can analyze
and translate text or speech from one language to another, facilitating communication across
linguistic barriers.
Chatbots Terminologies:
Quick reply
Hybrid Chat
Intent
Sentiment analysis
Compulsory input
Optional input
Decision trees
A spoken dialog system (SDS) is a computer system able to converse with a human with voice.
It has two essential components that do not exist in a written text dialog system: a speech
recognizer and a text-to-speech module (written text dialog systems usually use other input
systems provided by an OS).
Examples of dialogue systems in action include chatbots, food ordering apps, website AI
assistants, automated customer support service, self-checkout systems, etc.
A Dialogue system has mainly seven components. These components are following:
esponse Generator
NLU is crucial for dialogue systems to comprehend user inputs accurately. It involves tasks
such as intent classification, entity recognition, and sentiment analysis.
In text analysis, techniques like natural language processing (NLP) and machine learning
models are used to parse and understand the meaning of user messages.
In speech analysis, automatic speech recognition (ASR) systems convert spoken language into
text, which is then processed using NLU techniques.
2.Dialogue Management:
Dialogue management is responsible for determining the system's response based on the user's
input and the current context of the conversation.
In speech-based systems, dialogue management may also incorporate speech recognition results
and handle interruptions or errors in speech input.
3.Response Generation:
Response generation involves creating human-like responses to user inputs. This can be
achieved using templates, rule-based systems, or machine learning models like neural networks.
In text-based systems, response generation may involve generating text using language
generation techniques such as neural language models (e.g., GPT).
UX design focuses on creating a smooth and intuitive interaction between users and dialogue
systems.
Dialogue systems should be able to learn and adapt based on user feedback to improve their
performance over time.
Techniques such as reinforcement learning can be used to optimize dialogue policies based on
user interactions.
In text-based systems, sentiment analysis can be used to gauge user satisfaction and adjust
system behavior accordingly.
In speech-based systems, user feedback can be collected through voice commands or post-
interaction surveys.
6.Multi-Modality:
Some dialogue systems incorporate both text and speech modalities to provide a more versatile
user experience.
Multi-modal systems must seamlessly integrate text and speech processing components while
maintaining consistency across modalities.
Dialogue systems often handle sensitive information, so ensuring privacy and security is
paramount.
Techniques such as end-to-end encryption and secure data handling practices should be
implemented to protect user data.
Dialog Design
Evaluating dialogue systems, whether they operate through text or speech analysis, involves
assessing various aspects of their performance, including accuracy, effectiveness, user
satisfaction, and scalability. Here are some common evaluation metrics and methodologies for
both text and speech-based dialogue systems:
Accuracy Metrics:
Intent Classification Accuracy: Measures how accurately the system classifies user intents
based on their input.
Entity Recognition F1 Score: Evaluates the system's ability to correctly identify and extract
entities from user messages.
Response Generation Quality: Assess the coherence, relevance, and grammatical correctness of
the generated responses using metrics like BLEU, ROUGE, or human judgment.
Effectiveness Metrics:
Task Completion Rate: Determines the percentage of user queries or tasks successfully
completed by the system without errors.
Response Latency: Measures the time taken by the system to respond to user inputs, aiming for
low latency to improve user experience.
User Satisfaction:
User Surveys: Collect feedback from users through surveys to assess their satisfaction with the
dialogue system's performance, usability, and helpfulness.
User Ratings: Users can rate their interactions with the system on a scale, providing quantitative
feedback on their satisfaction levels.
Evaluate how well the system handles variations in user input, including typos, slang, or
ambiguous language.
Speech Synthesis
Word Error Rate (WER): Measures the accuracy of the system's speech recognition component
by comparing the transcribed text with the ground truth.
Phoneme Error Rate (PER): Evaluates the accuracy of phoneme-level transcription in speech
recognition.
Mean Opinion Score (MOS): Collect subjective ratings from human listeners on the
naturalness, intelligibility, and overall quality of synthesized speech.
Similar to text-based systems, assess the percentage of user queries or tasks successfully
completed by the system through spoken interactions.
Measure the time taken by the system to process spoken input, recognize intents, generate
responses, and synthesize speech, aiming for minimal latency.
Noise Robustness:
Evaluate the system's performance in noisy environments by introducing background noise and
assessing its impact on speech recognition accuracy and speech synthesis intelligibility.
Multimodal Integration:
Assess the effectiveness of integrating speech recognition and synthesis with other modalities,
such as text-based input and output, to provide a seamless user experience across multiple
channels.
TSA UNIT-4
UNIT-4
TEXT-TO-SPEECH ANALYSIS
Introduction:
The primary goal of text normalization, achieved through techniques like stemming and
lemmatization, is to bring diverse linguistic forms closer to a common base form. This
process minimizes variations and aids machines in better understanding and processing
human language.
Example:
Consider the sentence by Jaron Lanier, and how text normalization can be applied to it:
Original Sentence:
“It would be unfair to demand that people cease pirating files when those same people
aren't paid for their participation in very lucrative network schemes..."
Expanding Contractions:
Contractions like "it'll" are expanded to their full forms using a dictionary of
contractions and regular expressions. For example, "it'll" becomes "it will."
CODE:
import re
contractions_re =
re.compile('(%s)'%'|'.join(contractions_dict.keys()))defexpand_contractions(s,
contractions_dict=contractions_dict):
def replace(match):
return contractions_dict[match.group(0)]
return contractions_re.sub(replace, s)sentence = expand_contractions(sentence)
print(sentence)
Tokenization:
The sentence is segmented into words and sentences using tokenization. This step
involves breaking down the text into smaller units called tokens. For instance, "cease"
and "pirating" become separate tokens
Removing Punctuations:
Punctuation removal ensures that only alphabetic words are retained, contributing to
text standardization. This step helps in streamlining the dataset for further analysis.
Stemming:
The application of stemming involves reducing words to their word stem or root form.
Porter's algorithm is a common approach, but it may lead to over-stemming or under-
stemming.
Over-stemming: where a much larger part of a word is chopped off than what is
required, which in turn leads to words being reduced to the same root word or stem
incorrectly .For example, the words “university” and “universe” that get reduced to
“univers”.
Under-stemming: occurs when two or more words could be wrongly reduced to more
than one root word when they actually should be reduced to the same root word. For
example, the words “data” and “datum” that get reduced to “dat” and “datu” respectively
(instead of the same stem “dat”).
The Snowball Stemmer is an algorithm for stemming words, aiming to reduce them to
their base or root form. It is an extension of the Porter Stemmer algorithm and was
developed by Martin Porter. The Snowball Stemmer is designed to be more aggressive
and efficient in stemming words in various languages.
Lemmatization:
Unlike stemming, lemmatization reduces words to their base form, ensuring the root
word belongs to the language. The WordNetlemmatizer is applied, and performance can
be enhanced further by incorporating parts-of-speech (POS) tagging.
LETTER TO SOUND:
"Letter-to-sound" conversion refers to the process of converting written or typed text,
specifically the letters or characters, into their corresponding sounds or phonetic
representations. This conversion is essential in various applications, such as speech
synthesis, where a computer-generated voice needs to pronounce words accurately.
To make it more authentic they will analysis the sounds of each letters and make the
output tones similar to human. To make this they do the following
Context Sensitivity: The pronunciation of a word can depend on its context within a
sentence. Therefore, advanced letter-to-sound systems take contextual information into
account to enhance accuracy.
Here we use two different analysis. One is rule-based and another one is statistical
approaches.
Rule-Based Approaches:
Rule Sets:
Rule sets are developed by linguists and phonologists based on the analysis of the target
language's phonological and orthographic characteristics. These rule sets encompass
various pronunciation patterns and phonological rules governing the language.
Linguistic principles, such as phonotactics (permissible phoneme sequences), allophony
(variation of phonemes in different contexts), and syllable structure, inform the
development of rule sets.
Examples:
Advantages:
Interpretability: The rules used in rule-based systems are interpretable, allowing for
easy debugging and customization based on linguistic knowledge.
Control: Developers have control over the pronunciation process, enabling fine-tuning
of the system to produce desired phonetic outputs.
Limitations:
PROSODY:
Prosody refers to the rhythm, pitch, loudness, and intonation patterns of speech. It plays
a crucial role in conveying meaning, emotions, and the speaker's attitude. Prosody
encompasses various elements that contribute to the melodic and rhythmic aspects of
spoken language.
Pitch:
• Pitch refers to the perceived frequency of a speaker's voice. It can be high or low.
• Pitch variations can indicate emphasis, emotion, or changes in meaning. For
example, rising pitch at the end of a sentence can turn a statement into a question.
Rhythm:
Loudness:
Tempo:
• For instance, a faster tempo might indicate excitement, while a slower tempo could
express sadness.
Intonation:
• Intonation refers to the rising and falling patterns of pitch in connected speech
• .Intonation patterns can convey information about sentence type (statement,
question, command) and the speaker's emotional state.
• They also help listeners interpret the speaker's intended meaning.
Pauses:
Voice Quality:
• Voice quality relates to the characteristics of the speaker's voice, such as breathiness
or roughness.
• Voice quality can convey the speaker's emotional state and add nuance to the
message.
Emotional Expression:
Prosody is a powerful tool for expressing emotions. It can add warmth, enthusiasm, or
seriousness to the spoken words, making communication more engaging and effective.
Semantic Emphasis:
MOS:
A Mean Opinion Score (MOS) serves as a quantitative measure of the overall quality of a
particular event or experience, commonly used in telecommunications to assess the
quality of voice and video sessions. Traditionally, MOS ratings range from 1 (poor) to 5
(excellent), derived as averages from individual parameters scored by human observers
or approximated by objective measurement methods.
To get a MOS, people used to listen to or watch stuff and give their opinions. Nowadays,
we have machines that try to give scores like humans do. Different standards and
methods, like ITU-T's guidelines, help decide how to score things like phone calls and
video quality.
The commonly employed Absolute Category Ranking (ACR) scale, ranging from 1 to 5,
classifies quality levels as Excellent (5), Good (4), Fair (3), Poor (2), and Bad (1). An
MOS of approximately 4.3 to 4.5 is deemed excellent, while quality becomes
unacceptable below a MOS of around 3.5.
Low MOS ratings in video and voice calls may result from various factors along the
transmission chain, including hardware and software issues, network-related
impairments such as jitter, latency, and packet loss, which significantly impact
perceived call quality.
PSEQ:
Typically, PESQ scores are categorized into six bands. The audio samples provided
below for each band represent actual recordings of audio quality tests conducted by
Cyara on our customers’ international contact numbers:
it’s easy to see how frustration can lead a customer to abandon a call when they
encounter a low audio quality score. In such cases, a productive conversation between
both parties becomes virtually impossible. For conversations falling within the range of
2.00 to 2.79, there might be a slight improvement, but it’s also likely to include phrases
such as, “Could you repeat that?” or “Sorry, I can’t hear you“. Typically, this leads to
significant delays in resolving issues and, eventually, customer frustration that might
lead to call abandonment.
It’s worth remembering, of course, that ‘good audio quality’ can vary from one country
to the next and even from one carrier line or contact number to the next. What may be
considered as acceptable in the United States could be deemed unachievable in Brazil.
This is where Cyara’s in-country benchmarks add value. You have the flexibility to create
the dialing patterns that align with your business needs, dialing from within the country
where you want to measure quality.
By proactively assessing your audio quality and benchmarking over time, you can make
more informed decisions. You can also determine which telecommunications providers
to choose and how best to route your calls. This approach enables you to adapt and
optimize your telecommunications approach to meet the distinct needs and expectations
of each region in which you operate.
ToBI:
the ToBI (Tones and Break Indices) framework may indeed be relevant, particularly in
the field of speech technology and natural language processing.
Text Speech Analysis involves the examination and interpretation of various linguistic
features present in spoken language, including prosody (intonation, rhythm, and stress
patterns), phonetics, syntax, semantics, and pragmatics. The ToBI framework, as
mentioned earlier, provides a standardized system for annotating and modeling
prosodic features in spoken language.
Therefore, while ToBI may not be explicitly mentioned in the context of TSA, its
principles and methodologies for prosodic analysis could be applied in TSA-related
tasks, such as speech recognition, sentiment analysis, dialogue systems, and other text
and speech processing applications.
Application: These techniques are crucial for accurately identifying and analyzing pitch
variations in speech.
2. Spectral Analysis:
Signal Processing Techniques: Spectral analysis, including Fourier Transform and
Short-Time Fourier Transform (STFT), is employed to examine the spectral components
of speech.
Application: This helps extract features related to voice quality, pitch, and other acoustic
characteristics.
3. Waveform Analysis:
Signal Processing Techniques: Time-domain waveform analysis involves examining the
characteristics of speech waveforms, including loudness and duration.
Application: This analysis provides insights into the temporal aspects of speech.
4. Prosody Modeling:
Signal Processing Techniques: Prosody modeling often employs techniques such as
Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) to capture
dynamic patterns of pitch, duration, and intensity.
Application: These models are essential for understanding and synthesizing prosodic
features.
Application: These techniques are valuable for creating natural and expressive synthetic
speech.
Concatenative TTS
Concatenative TTS relies on high-quality audio clips recordings, which are combined
together to form the speech. At the first step voice actors are recorded saying a range of
speech units, from whole sentences to syllables that are further labeled and segmented by
linguistic units from phones to phrases and sentences forming a huge database. During
speech synthesis, a Text-to-Speech engine searches such database for speech units that
match the input text, concatenates them together and produces an audio file.
The navigation app has a large database of recorded speech units, including phonemes,
diphones, and words or phrases related to navigation instructions (e.g., "Turn left",
These recordings are performed by professional voice actors and cover a wide range of
Text Analysis:
As you navigate, the app analyzes the route and upcoming maneuvers to determine the
The text is broken down into smaller linguistic units, such as phonemes, syllables, or
words, based on the granularity of the speech database.
Unit Selection:
The navigation app selects appropriate speech units from the database based on the
It considers factors such as the complexity of the maneuver, road conditions, and traffic
information when selecting units.
Unit Concatenation:
The selected speech units are concatenated together to form the synthesized speech
The app ensures smooth transitions between adjacent units to maintain naturalness and
Prosody Generation:
Prosodic features, such as pitch contour, duration, and intensity variations, are
incorporated into the synthesized speech to match the intended linguistic and emotional
The prosody of the concatenated speech units is adjusted dynamically based on the
Post-processing:
The synthesized speech is evaluated to ensure it meets the desired standards for clarity
and naturalness.
Output Generation:
The synthesized speech is integrated into the navigation app for playback, allowing
Cons
- Such systems are very time consuming because they require huge databases, and hard-
- The resulting speech may sound less natural and emotionless, because it is nearly
impossible to get the audio recordings of all possible words spoken in all possible
Parametric approaches:
Text Analysis:
Analyze the input text to extract linguistic features, such as phonemes, prosodic cues,
and contextual information.
Tokenize the text into smaller linguistic units, such as words, syllables, or phonemes,
depending on the granularity of the synthesis model.
Example: Consider the input text "The quick brown fox jumps over the lazy dog."
Tokenize the text into words: ["The", "quick", "brown", "fox", "jumps", "over", "the",
"lazy", "dog"].
Feature Extraction:
Extract relevant linguistic features from the analyzed text, including phonetic content,
stress patterns, syntactic structure, and semantic information.
Use text analysis techniques, such as part-of-speech tagging, language modeling, and
phonetic transcription, to infer linguistic features from the input text.
Parameter Generation:
Map the extracted linguistic features to speech parameters that describe the
characteristics of the synthesized speech, such as pitch, duration, and spectral envelope.
Adapt the synthesis model to account for variations in speaker characteristics, speaking
styles, and linguistic contexts.
Example: We figure out things like how high or low its pitch should be, how long it
should last, and what it should sound like.
Synthesis:
Use the generated speech parameters to synthesize speech waveforms that match the
desired characteristics of the input text.
Apply signal processing techniques, such as filtering, modulation, and envelope shaping,
to manipulate the speech parameters and produce natural-sounding speech.
Control the synthesis process to ensure appropriate prosody, timing, and intonation in
the synthesized speech output.
Exmaple:For the word jump , We adjust things like pitch (making it go up and down),
duration (how long it lasts), and its overall sound to match what we decided earlier..
Post-processing:
Apply any additional processing or modifications to the synthesized speech output, such
as dynamic range compression, equalization, or noise reduction.
Adjust the synthesized speech output to meet quality and intelligibility standards, and
make any necessary enhancements or corrections.
Example: After creating the synthesized speech, we tweak it to sound better: we make
sure it's not too loud or too quiet (dynamic range compression), adjust the sound to
make it clearer and easier to understand (equalization), and remove any background
noise (noise reduction) to make it sound cleaner.
Evaluation:
Evaluate the quality and naturalness of the synthesized speech output using subjective
and/or objective measures.
Collect feedback from listeners or use automated evaluation metrics to assess the
performance of the parametric synthesis model.
Fine-tune the synthesis model based on evaluation results and user feedback to improve
the quality of the synthesized speech output.
Example: People listen to it to see if it sounds natural and clear (subjective evaluation),
and we also use machines to measure things like pitch and sound quality (objective
evaluation).
Comparision:
The network consists of many layers of dilated convolutions, which allow the network to
capture long-range dependencies in the input data while maintaining computational
efficiency.
WaveNet uses a causal convolutional structure, where each output depends only on
previous input values, making it suitable for sequential data generation tasks like speech
synthesis.
Researchers typically avoid modeling raw audio because it changes very quickly, with a
lot of information packed into each second.
For example, there can be up to 16,000 pieces of sound data per second. Creating a
model that predicts each of these pieces based on all the previous ones is really hard.
However, our earlier PixelRNN and PixelCNN models showed that it was possible to
generate detailed images one tiny piece at a time. Even more impressively, they could do
this for each color in the image separately.
This success encouraged us to adapt our two-dimensional PixelNets to work with one-
dimensional audio data, like WaveNet.
The picture above illustrates the structure of a WaveNet, which is a fully convolutional
neural network.
In WaveNet, the convolutional layers have different dilation factors, allowing the
network to cover thousands of time steps and capture complex patterns in the input
audio data.
During training, real audio waveforms from human speakers are used as input
sequences. After training, the network can generate synthetic speech by sampling from
the probability distribution computed by the network at each step.
This sampled value is then fed back into the input, and the process repeats to generate
the next sample. While this step-by-step sampling approach is computationally
expensive, it is crucial for generating realistic-sounding audio with complex patterns.
We trained WaveNet using Google's TTS datasets to assess its performance. The figure
below compares WaveNets' quality, rated from 1 to 5, with Google's top TTS systems
(parametric and concatenative) and human speech using Mean Opinion Scores (MOS).
MOS are standard subjective tests for sound quality, obtained from blind tests with
human subjects, consisting of over 500 ratings on 100 test sentences.
WaveNets significantly reduce the gap between state-of-the-art systems and human-
level performance by over 50% for both US English and Mandarin Chinese.
Given that Google's current TTS systems are considered some of the best worldwide for
both Chinese and English, improving upon them with a single model represents a
significant accomplishment.
To use WaveNet for text-to-speech, we need to provide it with the text we want it to
speak. We transform the text into a sequence of linguistic and phonetic features,
containing information about phonemes, syllables, words, etc.
These features are then fed into WaveNet. As a result, WaveNet's predictions are
conditioned not only on previous audio samples but also on the text input.
If we train WaveNet without the text sequence, it can still generate speech. However, in
this case, it has to make up what to say, lacking the contextual information provided by
the text input.
Working:
Input Representation:
Encoder:
The encoded text sequence is fed into an encoder network. The encoder's role is to
process the input text and capture its contextual information, such as linguistic features
and syntactic structure.
The encoder network often consists of convolutional or recurrent layers, such as Long
Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) cells, to capture temporal
dependencies in the input sequence.
Decoder:
The output of the encoder serves as the initial hidden state of the decoder network. The
decoder's task is to generate a spectrogram representation of the speech signal based on
the encoded text information.
The decoder is typically implemented using recurrent layers, such as LSTM or GRU
cells, and it operates autoregressively, generating one spectrogram frame at a time.
At each time step, the decoder attends to relevant parts of the encoded text sequence
using an attention mechanism. This allows the decoder to focus on different portions of
the input text dynamically as it generates the speech output.
Post-processing:
The spectrogram output from the decoder represents the spectral characteristics of the
speech signal over time. This spectrogram is then passed through a post-processing
stage to convert it into a time-domain waveform.
Training:
Tacotron is trained in a supervised manner using pairs of input text and corresponding
speech audio data.
During training, the model learns to minimize the difference between the predicted
spectrogram and the ground truth spectrogram derived from the audio data.
Training typically involves optimizing a loss function, such as mean squared error
(MSE) or a combination of spectrogram-based losses and adversarial losses, through
techniques like gradient descent.
Once trained, the Tacotron model can be used for inference, where it takes a text input
and generates the corresponding speech waveform.
During inference, the model utilizes the learned parameters to predict the spectrogram
representation of the speech signal based on the input text.
The predicted spectrogram is then converted into a waveform using the same post-
processing techniques employed during training.
TSA-UT-V
SPEECH RECOGNITION
Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition, or
speech-to-text, is a capability which enables a program to process human speech into a written format. While
it’s commonly confused with voice recognition, speech recognition focuses on the translation of speech from a
verbal format to a text one whereas voice recognition just seeks to identify an individual user’s voice.
peech recognition, or speech-to-text, is the ability of a machine or program to identify words spoken aloud and
convert them into readable text. Rudimentary speech recognition software has a limited vocabulary and may
only identify words and phrases when spoken clearly. More sophisticated software can handle natural speech,
different accents and various languages.
Speech recognition uses a broad array of research in computer science, linguistics and computer engineering.
Many modern devices and text-focused programs have speech recognition functions in them to allow for easier
or hands-free use of a device.
Speech recognition and voice recognition are two different technologies and should not be confused:
Speech recognition is used to identify words in spoken language.
Voice recognition is a biometric technology for identifying an individual's voice.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289
languages, too. Speech recognition can also be found in word processing applications like Microsoft Word,
where users can dictate words to be turned into text.
Education. Speech recognition software is used in language instruction. The software hears the user's speech
and offers help with pronunciation.
Customer service. Automated voice assistants listen to customer queries and provides helpful resources.
Healthcare applications. Doctors can use speech recognition software to transcribe notes in real time into
healthcare records.
Disability assistance. Speech recognition software can translate spoken words into text using closed captions to
enable a person with hearing loss to understand what others are saying. Speech recognition can also enable
those with limited use of their hands to work with computers, using voice commands instead of typing.
Court reporting. Software can be used to transcribe courtroom proceedings, precluding the need for human
transcribers.
Emotion recognition. This technology can analyze certain vocal characteristics to determine what emotion the
speaker is feeling. Paired with sentiment analysis, this can reveal how someone feels about a product or service.
Hands-free communication. Drivers use voice control for hands-free communication, controlling phones,
radios and global positioning systems, for instance.
Language weighting. This feature tells the algorithm to give special attention to certain words, such
as those spoken frequently or that are unique to the conversation or subject. For example, the
software can be trained to listen for specific product references.
Acoustic training. The software tunes out ambient noise that pollutes spoken audio. Software
programs with acoustic training can distinguish speaking style, pace and volume amid the din of
many people speaking in an office.
Speaker labeling. This capability enables a program to label individual participants and identify
their specific contributions to a conversation.
Profanity filtering. Here, the software filters out undesirable words and language.
What are the different speech recognition algorithms?
The power behind speech recognition features comes from a set of algorithms and technologies. They include
the following:
Hidden Markov model. HMMs are used in autonomous systems where a state is partially
observable or when all of the information necessary to make a decision is not immediately available
to the sensor (in speech recognition's case, a microphone). An example of this is in acoustic
modeling, where a program must match linguistic units to audio signals using statistical probability.
Natural language processing. NLP eases and accelerates the speech recognition process.
N-grams. This simple approach to language models creates a probability distribution for a sequence.
An example would be an algorithm that looks at the last few words spoken, approximates the history
of the sample of speech and uses that to determine the probability of the next word or phrase that
will be spoken.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289
Artificial intelligence. AI and machine learning methods like deep learning and neural networks are
common in advanced speech recognition software. These systems use grammar, structure, syntax
and composition of audio and voice signals to process speech. Machine learning systems gain
knowledge with each use, making them well suited for nuances like accents.
What are the advantages of speech recognition?
There are several advantages to using speech recognition software, including the following:
Machine-to-human communication. The technology enables electronic devices to communicate with
humans in natural language or conversational speech.
Readily accessible. This software is frequently installed in computers and mobile devices, making it
accessible.
Easy to use. Well-designed software is straightforward to operate and often runs in the background.
Continuous, automatic improvement. Speech recognition systems that incorporate AI become more
effective and easier to use over time. As systems complete speech recognition tasks, they generate more
data about human speech and get better at what they do.
What are the disadvantages of speech recognition?
While convenient, speech recognition technology still has a few issues to work through. Limitations include:
Inconsistent performance. The systems may be unable to capture words accurately because of
variations in pronunciation, lack of support for some languages and inability to sort through background
noise. Ambient noise can be especially challenging. Acoustic training can help filter it out, but these
programs aren't perfect. Sometimes it's impossible to isolate the human voice.
Speed. Some speech recognition programs take time to deploy and master. The speech processing may
feel relatively slow.
Source file issues. Speech recognition success depends on the recording equipment used, not just the
software.
Acoustic Modelling
Acoustic modelling of speech typically refers to the process of establishing statistical representations for the
feature vector sequences computed from the speech waveform. Hidden Markov Model (HMM) is one most
common types of acoustic models.Modern speech recognition systems use both an acoustic model and
a language model to represent the statistical properties of speech. The acoustic model models the
relationship between the audio signal and the phonetic units in the language. The language model is responsible
for modeling the word sequences in the language. These two models are combined to get the top-ranked word
sequences corresponding to a given audio segment.
Speech audio characteristics
Audio can be encoded at different sampling rates (i.e. samples per second – the most common being: 8, 16, 32,
44.1, 48, and 96 kHz), and different bits per sample (the most common being: 8-bits, 16-bits, 24-bits or 32-bits).
Speech recognition engines work best if the acoustic model they use was trained with speech audio which was
recorded at the same sampling rate/bits per sample as the speech being recognized.
The limiting factor for telephony based speech recognition is the bandwidth at which speech can be transmitted.
For example, a standard land-line telephone only has a bandwidth of 64 kbit/s at a sampling rate of 8 kHz and 8-
bits per sample (8000 samples per second * 8-bits per sample = 64000 bit/s). Therefore, for telephony based
speech recognition, acoustic models should be trained with 8 kHz/8-bit speech audio files.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289
In the case of Voice over IP, the codec determines the sampling rate/bits per sample of speech transmission.
Codecs with a higher sampling rate/bits per sample for speech transmission (which improve the sound quality)
necessitate acoustic models trained with audio data that matches that sampling rate/bits per sample.
For speech recognition on a standard desktop PC, the limiting factor is the sound card. Most sound cards today
can record at sampling rates of between 16 kHz-48 kHz of audio, with bit rates of 8 to 16-bits per sample, and
playback at up to 96 kHz.
As a general rule, a speech recognition engine works better with acoustic models trained with speech audio data
recorded at higher sampling rates/bits per sample. But using audio with too high a sampling rate/bits per sample
can slow the recognition engine down. A compromise is needed. Thus for desktop speech recognition, the
current standard is acoustic models trained with speech audio data recorded at sampling rates of 16 kHz/16bits
per sample.
Acoustic modeling is a crucial component in the field of automatic speech recognition (ASR) and various other
applications involving spoken language processing. It is the process of creating a statistical representation of the
relationship between acoustic features and phonemes, words, or other linguistic units in a spoken language.
Acoustic models play a central role in converting spoken language into text and are a key part of the larger ASR
system. Here's how acoustic modeling works:
1. **Feature Extraction:** The process starts with capturing audio input, which is typically sampled at a high
rate. Feature extraction is performed to convert this raw audio into a more compact and informative
representation. Common acoustic features include Mel-frequency cepstral coefficients (MFCCs) or filterbank
energies. These features capture the spectral characteristics of the audio signal over time.
2. **Training Data:** Acoustic modeling requires a significant amount of training data, typically consisting of
transcribed audio recordings. This data is used to establish statistical patterns between acoustic features and the
corresponding linguistic units (e.g., phonemes, words).
3. **Phoneme or State Modeling:** In traditional Hidden Markov Models (HMMs), which have been widely
used in ASR, the acoustic modeling process involves modeling phonemes or states. An HMM represents a
sequence of states, each associated with a specific acoustic observation probability distribution. These states
correspond to phonemes or sub-phonetic units.
4. **Building Gaussian Mixture Models (GMMs):** For each state or phoneme, a Gaussian Mixture Model
(GMM) is constructed. GMMs are a set of Gaussian distributions that model the likelihood of observing
specific acoustic features given a phoneme or state. These GMMs capture the variation in acoustic features
associated with each phoneme.
5. **Training the Models:** During training, the GMM parameters are estimated to maximize the likelihood
of the observed acoustic features given the transcribed training data. This training process adjusts the means and
covariances of the Gaussian components to fit the observed acoustic data.
6. **Decoding:** When transcribing new, unseen audio, the acoustic model is used in combination with
language and pronunciation models. The ASR system uses these models to search for the most likely sequence
of phonemes or words that best matches the observed acoustic features. Decoding algorithms like the Viterbi
algorithm are commonly used for this task.
7. **Integration:** The output of the acoustic model is combined with language and pronunciation models to
generate a final transcription or understanding of the spoken input.
Modern ASR systems have evolved beyond HMM-based approaches, with deep learning techniques, such as
deep neural networks (DNNs) and recurrent neural networks (RNNs), becoming more prevalent in acoustic
modeling. Deep learning models can directly map acoustic features to phonemes or words, bypassing the need
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289
for GMMs and HMMs. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are often
used for this purpose. These deep learning models have significantly improved the accuracy of ASR systems,
making them more robust to various accents, noise, and speaking styles.
In summary, acoustic modeling is a crucial step in automatic speech recognition, responsible for establishing
the statistical relationship between acoustic features and linguistic units. This process enables the conversion of
spoken language into text, and advances in deep learning techniques have greatly improved the accuracy and
efficiency of acoustic models in ASR systems.
Feature Extraction
Feature extraction is a fundamental step in acoustic modeling for tasks like automatic speech recognition (ASR)
and speaker identification. Its primary goal is to convert the raw audio signal into a more compact and
informative representation that captures relevant acoustic characteristics. The choice of acoustic features greatly
impacts the performance of the acoustic model. Here are some common techniques for feature extraction in
acoustic modeling:
1. **Mel-Frequency Cepstral Coefficients (MFCCs):** MFCCs are one of the most widely used acoustic
features in ASR. They mimic the human auditory system's sensitivity to different frequencies. The MFCC
extraction process typically involves the following steps:
- Pre-emphasis: Boosts high-frequency components to compensate for the muffled low frequencies in speech.
- Framing: The audio signal is divided into short overlapping frames, often around 20-30 milliseconds in
duration.
- Windowing: Each frame is multiplied by a windowing function (e.g., Hamming window) to reduce spectral
leakage.
- Fast Fourier Transform (FFT): The power spectrum of each frame is computed using the FFT.
- Mel-filterbank: A set of triangular filters on the Mel-scale is applied to the power spectrum. The resulting
filterbank energies capture the distribution of energy in different frequency bands.
- Logarithm: The logarithm of filterbank energies is taken to simulate the human perception of loudness.
- Discrete Cosine Transform (DCT): DCT is applied to decorrelate the log filterbank energies and produce a
set of MFCC coefficients.
2. **FilterbankEnergies:** These are similar to the intermediate step of MFCC computation but without the
logarithm and DCT steps. Filterbank energiesare a set of values that represent the energy in different
frequency bands over time. They are often used in conjunction with MFCCs or as a simpler alternative when the
benefits of MFCCs are not required.
3. **Spectrogram:** The spectrogram is a visual representation of the spectrum of frequencies in the audio
signal over time. It is often used as a feature for tasks that benefit from a time-frequency representation, such as
music genre classification and environmental sound recognition.
4. **Pitch and Fundamental Frequency (F0):** Extracting pitch information can be important for certain
applications. Pitch is the perceived frequency of a sound and is often associated with prosody and intonation in
speech
5. **Linear Predictive Coding (LPC):** LPC analysis models the speech signal as the output of an all-pole
filter and extracts coefficients that represent the vocal tract's resonances. LPC features are used in speech coding
and sometimes ASR.
6. **Perceptual Linear Prediction (PLP) Cepstral Coefficients:** PLP is an alternative to MFCCs that
incorporates psychoacoustic principles, modeling the human auditory system's response more closely.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289
7. **Deep Learning-Based Features:** In recent years, deep neural networks have been used to learn features
directly from the raw waveform. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs)
can be used to capture high-level representations from audio data.
8. **Gammatone Filters:** These are designed to more closely mimic the response of the human auditory
system to different frequencies.
The choice of feature extraction method depends on the specific task and the characteristics of the data. For
ASR, MFCCs and filterbank energies are the most commonly used features. However, as deep learning
techniques become more prevalent in acoustic modeling, end-to-end systems that operate directly on raw audio
data are gaining popularity, and feature extraction is becoming integrated into the model architecture.
In signal processing, a filter bank (or filterbank) is an array of bandpass filters that separates the input signal
into multiple components, each one carrying a single frequency sub-band of the original signal.
HMM
A Hidden Markov Model (HMM) is a statistical model used for modeling sequential data, where the underlying
system is assumed to be a Markov process with hidden states. HMMs are widely used in various fields,
including speech recognition, natural language processing, bioinformatics, and more.
1. Markov Process:
A Markov process, also known as a Markov chain, is a stochastic model that describes a system's
transitions from one state to another over discrete time steps.
In a simple Markov chain, the future state of the system depends only on the current state and is
independent of all previous states. This property is called the Markov property.
2. Hidden States:
In an HMM, there are two sets of states: hidden states and observable states.
Hidden states represent the unobservable underlying structure of the system. They are
responsible for generating the observable data.
3. Observable Data:
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289
Observable states are the data that can be directly measured or observed.
For example, in speech recognition, the hidden states might represent phonemes, while the
observable data are the audio signals.
4. State Transitions:
An HMM defines the probabilities of transitioning from one hidden state to another. These
transition probabilities are often represented by a transition matrix.
Transition probabilities can be time-dependent (time-inhomogeneous) or time-independent
(time-homogeneous).
5. Emission Probabilities:
Emission probabilities specify the likelihood of emitting observable data from a particular hidden
state.
In the context of speech recognition, these probabilities represent the likelihood of generating a
certain audio signal given the hidden state (e.g., a phoneme).
6. Initialization Probabilities:
An HMM typically includes initial probabilities, which represent the probability distribution over
the initial hidden states at the start of the sequence.
7. Observations and Inference:
Given a sequence of observations (observable data), the goal is to infer the most likely sequence
of hidden states.
This is typically done using algorithms like the Viterbi algorithm, which finds the most probable
sequence of hidden states that generated the observations.
8. Learning HMM Parameters:
Training an HMM involves estimating its parameters, including transition probabilities, emission
probabilities, and initial state probabilities.
This can be done using methods like the Baum-Welch algorithm, which is a variant of the
Expectation-Maximization (EM) algorithm.
9. Applications:
HMMs have a wide range of applications, such as speech recognition, where they can model
phonemes, natural language processing for part-of-speech tagging, bioinformatics for gene
prediction, and more.
10. Limitations:
HMMs assume that the system is a first-order Markov process, which means it depends only on
the current state. More complex dependencies might require more advanced models.
HMMs are also sensitive to their initial parameter estimates and might get stuck in local optima
during training.
In summary, Hidden Markov Models are a powerful tool for modeling and analyzing sequential data with
hidden structure. They are used in a variety of fields to uncover underlying patterns and make predictions based
on observed data.
HMM-DNN
The hybrid HMM-DNN approach in speech recognition make use of the properties like the strong learning
power of DNN and the sequential modelling activity of the HMM. As DNN accepts only fixed sized inputs it
will be difficult to deal with speech signals as they are variable length time varying signal. So in this approach
HMM deals with the dynamic characteristic of the speech signal and DNN is responsible for the observation
probability. Given the acoustic observations, each output neuron of DNN is trained to estimate the posterior
probability of continuous density HMM’s state. DNN when trained in the usual traditional way through
supervised manner does not produce good results and very difficult to get to an optimal point. When a set of
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289
data is given as input, importance should be given to extract the variety of data rather than the quantity of data
extracted because later on a good classification can be made from this data.
DNN-HMM systems, also known as Deep Neural Network-Hidden Markov Model systems, are a type of
technology used in automatic speech recognition (ASR) and other sequential data modeling tasks. These
systems combine deep neural networks (DNNs) with Hidden Markov Models (HMMs) to improve the accuracy
and robustness of speech recognition and other related applications. Here's a detailed explanation of DNN-
HMM systems:
3. **Acoustic Modeling**:
- Acoustic modeling in ASR is the process of estimating the likelihood of observing a given acoustic feature
(e.g., a frame of audio) given a particular state in the HMM.
- In DNN-HMM systems, DNNs are used to model this likelihood. They take acoustic features as input and
produce the probability distribution over the set of states.
4. **Phoneme or SubwordModeling**:
- DNN-HMM systems typically model phonemes, context-dependent phonemes, or subword units (e.g.,
triphones) as the hidden states in HMMs.
- The DNNs are trained to predict which phoneme or subword unit corresponds to a given acoustic frame,
given the surrounding context.
5. **Training**:
- DNN-HMM systems are trained using large datasets of transcribed speech. The DNNs are trained to
minimize the error between their predicted state probabilities and the true state labels in the training data.
- DNNs can be trained using supervised learning techniques, and backpropagation is used to update the
model's weights.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by Rajalakshmi Arulmozh IT ([email protected])
lOMoARcPSD|38187289
7. **Decoding**:
- During the decoding phase, DNN-HMM systems use algorithms like the Viterbi algorithm to find the most
likely sequence of hidden states (phonemes or subword units) that best explain the observed acoustic features.
8. **Benefits**:
- DNN-HMM systems have significantly improved ASR accuracy, especially in challenging environments
with background noise and variations in speech.
- They capture complex acoustic patterns and can model a wide range of speakers and accents effectively.
9. **Challenges**:
- Training deep neural networks requires large amounts of labeled data and significant computational
resources.
- DNN-HMM systems can be complex to design and optimize, and there's a risk of overfitting the model to
the training data.
DNN-HMM systems have been a major breakthrough in ASR technology and have significantly improved the
accuracy of speech recognition systems, making them more practical for real-world applications, including
voice assistants, transcription services, and more.
Department of Artificial Intelligence and Data Science Dhanalakshmi Srinivasan College of Engineering
Downloaded by Rajalakshmi Arulmozh IT ([email protected])