Unit 1 2 3 4 5 NLP Notes Merged
Unit 1 2 3 4 5 NLP Notes Merged
NLP is used in a wide variety of everyday products and services. Some of the most
common ways NLP is used are through voice-activated digital assistants on
smartphones, email-scanning programs used to identify spam, and translation apps
that decipher foreign languages.
The ability to analyse both structured and unstructured data, such as speech,
text messages, and social media posts.
Improving customer satisfaction and experience by identifying insights using
sentiment analysis.
Reducing costs by employing NLP-enabled AI to perform specific tasks, such
as chatting with customers via chatbots or analysing large amounts of text data.
Better understanding a target market or brand by conducting NLP analysis on
relevant data like social media posts, focus group surveys, and reviews.
NLP Limitations:
NLP can be used for a wide variety of applications but it's far from perfect. In fact,
many NLP tools struggle to interpret sarcasm, emotion, slang, context, errors, and
other types of ambiguous statements.
This means that NLP is mostly limited to unambiguous situations that don't require
a significant amount of interpretation.
NLP Examples:
Although natural language processing might sound like something out of a science
fiction novel, the truth is that people already interact with countless NLP-powered
devices and services every day.
Online chatbots, for example, use NLP to engage with consumers and direct them
toward appropriate resources or products. While chat bots can’t answer every
question that customers may have, businesses like them because they offer cost-
effective ways to troubleshoot common problems or questions that consumers have
about their products.
Another common use of NLP is for text prediction and autocorrect, which you’ve
likely encountered many times before while messaging a friend or drafting a
document. This technology allows texters and writers alike to speed-up their
writing process and correct common typos.
NLP Applications:
Natural Language Processing (NLP) has a wide range of applications across
various industries and domains. Here are some real-world examples of how NLP
is being used:
2. Chatbots and Virtual Assistants : Virtual assistants like Siri, Alexa, and
Google Assistant rely heavily on NLP to understand and respond to user voice
commands and natural language queries. Chatbots are also used in customer
support to answer common questions and handle basic tasks.
Linguists tend to use the term grammar in an extended sense to cover all the
structure of human languages: phonology, morphology, syntax, and their
contribution to meaning.
However, even if you know the grammar of a language, in this sense, you still
need more knowledge to interpret many utterances.
All of the following, sentences are underspecified in this sense. Pronouns,
ellipsis (incomplete sentences) and other ambiguities of various kinds all
requires additional non-grammatical information to select an appropriate
interpretation given the (extra)linguistic context.
Regular Expression (RE):
A regular expression (RE) is a language for specifying text search strings. RE helps
us to match or find other strings or sets of strings, using a specialized syntax held
in a pattern.
It can be derived in different ways like text that was originally electronic,
transcripts of spoken language and optical character recognition, etc.
Corpus Size: Another important element of corpus design is its size. How large
the corpus should be? There is no specific answer to this question. The size of
the corpus depends upon the purpose for which it is intended as well as on some
practical considerations as follows:
Tokenization is the process of dividing a text into individual units called tokens.
Tokens are typically words, sub-words, or symbols, and they serve as the basic
building blocks for NLP tasks. Tokenization helps in extracting meaningful
information from text and is a necessary step for various applications in NLP. Here
are some key points about tokenization:
Segmentation is the process of dividing text into meaningful units, often used in
languages with little to no explicit word boundaries. Asian languages, such as
Chinese, Japanese, and Korean, rely heavily on segmentation because their writing
systems don't use spaces between words. Here are some key points about
segmentation:
A. Contractions:
B. Removing Punctuation:
Noise Reduction: Punctuation marks can introduce noise in the text data,
making it more challenging for NLP algorithms to extract meaningful
information.
Consistency: Removing punctuation ensures that text is consistent in its
representation, which can be important for tasks like text classification and
sentiment analysis.
Tokenization: Tokenization, which is the process of splitting text into words
or tokens, is simplified when punctuation is removed because words are
separated more cleanly.
However, it's important to note that some punctuation marks, such as hyphens,
apostrophes in contractions, and certain symbols in technical documents, should
be handled with care and not removed indiscriminately.
C. Handling Capitalization:
Capitalization refers to the use of uppercase and lowercase letters in text. Managing
capitalization is essential in NLP for the following reasons:
Stop words are common words such as "the," "a," "an," "in," "to," and "and" that
occur frequently in text but often carry little meaningful information. Managing
stop words is essential for several reasons:
Noise Reduction: Removing stop words helps reduce noise in the text data and
focuses the analysis on more meaningful content.
Computational Efficiency: Stop words consume memory and computational
resources but don't contribute significantly to NLP tasks like text classification,
sentiment analysis, or information retrieval. Removing them can improve
efficiency.
Customization: In some cases, depending on the specific NLP task, you may
choose to retain certain stop words that are contextually relevant.
E. Stemming:
What is Stemming?
For example, “laughing”, “laughed”, “laughs”, “laugh” will all become “laugh”,
which is their stem, because their inflection form will be removed.
After stemming the tokens that we will get are- “hi”, “team”, “are”, “not”, “winn”
Notice that the keyword “winn” is not a regular word and “hi” changed the
context of the entire sentence.
Applications of Stemming:
What is Lemmatization?
The output of lemmatization is the root word called a lemma. For example,
Applications of Lemmatization:
Stemming Lemmatization
Stemming is a process that stems Lemmatization considers the context
or removes last few characters and converts the word to its meaningful
from a word, often leading to base form, which is called Lemma.
incorrect meanings and spelling.
For instance, stemming the word For instance, lemmatizing the word
‘Caring‘ would return ‘Car‘. ‘Caring‘ would return ‘Care‘.
Stemming is used in case of large Lemmatization is computationally
dataset where performance is an expensive since it involves look-up tables
issue. and what not.
G. Part of Speech(POS) Tags in NLP:
Part of speech tags or POS tags is the properties of words that define their main
context, their function, and the usage in a sentence. Some of the commonly used
parts of speech tags are- Nouns, which define any object or entity; Verbs, which
define some action; and Adjectives or Adverbs, which act as the modifiers,
quantifiers, or intensifiers in any sentence. In a sentence, every word will be
associated with a proper part of the speech tag, for example,
In the below sentence, every word is associated with a part of the speech tag
which defines their functions.
In this case “David’ has NNP tag which means it is a proper noun, “has” and
“purchased” belongs to verb indicating that they are the actions and “laptop” and
“Apple store” are the nouns, “new” is the adjective whose role is to modify the
context of laptop.
Part of speech tags is defined by the relations of words with the other words in
the sentence. Machine learning models or rule-based models are applied to obtain
the part of speech tags of a word. The most commonly used part of speech tagging
notations is provided by the Penn Part of Speech Tagging.
Part of speech tags have a large number of applications and they are used in a
variety of tasks such as text cleaning, feature engineering tasks, and word
sense disambiguation. For example, consider these two sentences-
In both sentences, the keyword “book” is used but in sentence one, it is used as a
verb while in sentence two it is used as a noun.
Minimum Edit Distance:
Many NLP tasks are concerned with measuring how similar two strings are.
Spell correction:
The user typed “graffe”
Which is closest? : graf grail giraffe
the word giraffe, which differs by only one letter from graffe, seems
intuitively to be more similar than, say grail or graf,
The minimum edit distance between two strings is defined as the
minimum numberof editing operations (insertion, deletion, substitution)
needed to transform one string into another.
The minimum edit distance between intentionand execution
can be visualized using their alignment.
Given two sequences, an alignment is a correspondence between substrings
of the twosequences.
S I D I
How do we find the minimum edit distance?
– We can think of this as a search task, in which we are searching for
the shortest path—a
sequence of edits—from one string to another.
The value of D(i, j) is computed by taking the minimum of the three possible
paths
through the matrix which arrive there:
deletion
insertion
substitution
Edit distance isn’t sufficient
– We often need to align each character of the two strings to each other
We do this by keeping a “backtrace”
deletion
insertion
substitution
Adding Backtrace to Minimum Edit Distance:
The N-gram model is a statistical language model that estimates the probability
of the next word in a sequence based on the previous N-1 words. It works on the
assumption that the probability of a word depends only on the preceding N-1
words, which is known as the Markov assumption. In simpler terms, it predicts
the likelihood of a word occurring based on its context.
The N-gram language model works by calculating the frequency of each N-gram
in a large corpus of text. The frequency of these N-grams is then used to estimate
the probability of a particular word given its context.
For example, let's consider the sentence, "I am going to the grocery store". We
can generate bigrams from this sentence by taking pairs of adjacent words, such
as "I am", "am going", "going to", "to the", and "the grocery". The frequency of
each bigram in a large corpus of text can be calculated, and these frequencies can
be used to estimate the probability of the next word given the previous word.
A bigram model is a type of n-gram model where n=2. This means that the model
calculates the probability of each word in a sentence based on the previous word.
To calculate the probability of the next word given the previous word(s) in a
bigram model, we use the following formula:
where P(w_n | w_{n-1}) is the probability of the nth word given the previous
word, and count(w_{n-1}, w_n) is the count of the bigram (w_{n-1}, w_n) in the
text, and count(w_{n-1}) is the count of the previous word in the text.
For example, let's say we have the sentence: "The cat sat on the mat." In a bigram
model, we would calculate the probability of each word based on the previous
word:
The probability of each word given the previous word can be used to generate
new sentences or to evaluate the likelihood of a given sentence.
or
where P(w_n | w_{n-2}, w_{n-1}) is the probability of the nth word given the
previous two words, and count(w_{n-2}, w_{n-1}, w_n) is the count of the
trigram (w_{n-2}, w_{n-1}, w_n) in the text, and count(w_{n-2}, w_{n-1}) is
the count of the previous two words in the text.
For example, let's say we have the sentence: "I love to eat pizza." In a trigram
model, we would calculate the probability of each word based on the previous
two words:
Again, the probability of each word given the previous two words can be used to
generate new sentences or to evaluate the likelihood of a given sentence.
However, trigram models may be less accurate than bigram models because they
require more context to calculate the probability of each word.
There is a problem with the out of vocabulary words. These words are during
the testing but not in the training. One solution is to use the fixed vocabulary
and then convert out vocabulary words in the training to pseudowords.
When implemented in the sentiment analysis, the bi-gram model
outperformed the uni-gram model but the number of the features is then
doubled. So, the scaling of the N-gram model to the larger data sets or moving
to the higher-order needs better feature selection approaches.
The N-gram model captures the long-distance context poorly. It has been
shown after every 6-grams, the gain of performance is limited.
Language Model Evaluation:
To answer the above questions for language models, we first need to answer the
following intermediary question: Does our language model assign a higher
probability to grammatically correct and frequent sentences than those sentences
which are rarely encountered or have some grammatical error? To train parameters
of any model we need a training dataset. After training the model, we need to
evaluate how well the model’s parameters have been trained; for which we use a
test dataset which is utterly distinct from the training dataset and hence unseen by
the model. After that, we define an evaluation metric to quantify how well our
model performed on the test dataset.
Language Model
Evaluation
Extrinsic/In-vivo Intrinsic
evaluation of evaluation of
language models language models
For comparing two language models A and B, pass both the language models
through a specific natural language processing task and run the job. After that
compare the accuracies of models A and B to evaluate the models in comparison
to one another. The natural language processing task may be text summarization,
sentiment analysis and so on.
As a result, better language models will have lower perplexity values or higher
probability values for a test set.
Smoothing Techniques:
What is Smoothing in NLP?
Smoothing refers to the technique we use to adjust the probabilities used in the
model so that our model can perform more accurately and even handle the words
absent in the training set.
As you can see, P (“I like mathematics”) comes out to be 0, but it can be a proper
sentence, but due to limited training data, our model didn’t do well.
Now, we’ll see how smoothing can solve this issue.
Types of Smoothing in NLP:
2. Add K Smoothing:
Padd-k(wi | w(i-1)) = count(wi w(i-1)) + k / count(w(i-1)) + k.V)
Choosing right value of k is a tedious job. So not prefer a lot.
3. Backoff:
It says using less content is good.
Start with n-gram,
If insufficient observations, check (n-1)gram
If insufficient observations, check (n-2)gram
4. Interpolation:
Try a mixture of (multiple) n-gram models
1. Word Similarity:
Word similarity refers to the degree of resemblance or likeness between
two or more words in terms of their meaning. It quantifies how closely
related or interchangeable words are in a given context. Word similarity
can be measured using various techniques, including computational
methods like cosine similarity or word embeddings (e.g., Word2Vec or
GloVe) that capture semantic relationships between words based on their
co-occurrence patterns in large text corpora. High word similarity suggests
that words share similar meanings or are semantically related, while low
similarity implies less semantic overlap.
2. Word Senses:
In natural language, many words have multiple meanings or senses. Word
senses refer to the different interpretations or definitions that a word can
have, depending on the context in which it is used. These different senses
can be quite distinct or subtly related. For example, the word "bank" can
refer to a financial institution (e.g., a bank where you deposit money) or
the side of a river (e.g., the bank of a river). Identifying and disambiguating
word senses is crucial for tasks like natural language understanding,
machine translation, and information retrieval.
3. Lexical Semantics:
Lexical semantics is a subfield of linguistics and computational linguistics
that focuses on the study of word meaning, particularly how words convey
meaning in context. It explores how words relate to each other, their sense
distinctions, and the semantic relationships that exist between words in a
language. Lexical semantics delves into questions such as synonymy (the
relationship between synonyms), antonymy (the relationship between
antonyms), hypernymy (the relationship between a more general word and
its more specific instances), and hyponymy (the relationship between a
specific instance and the general category it belongs to). Lexical semantics
also examines phenomena like polysemy (a word having multiple related
senses) and homonymy (a word having multiple unrelated senses).
Computational techniques and resources, such as lexical databases and
semantic similarity measures, are used to model and study lexical
semantics in computational linguistics.
How can we build a computation model that explains meaning of a word? The
answer is Vector semantics. Vector Semantics defines semantics & interprets
word meaning to explain features such as word similarity. Its central idea is: Two
words are similar if they have similar word context.
In its current form, the vector model inspires its working from linguistic and
philosophical work of 1950s. Vector semantics represents a word in multi-
dimensional vector space. Vector model is also called Embeddings, due to the
fact that word is embedded in a particular vector space. Vector model offers many
advantages in NLP. For example, in sentimental analysis, it sets up a boundary
class and predicts if the sentiment is positive or negative (a binomial
classification). Another key practical advantage with vector semantics is that it
can learn automatically from text without complex labelling or supervision. As a
result of these advantages, the vector semantics has become a de-facto standard
for NLP applications such as Sentiment Analysis, Named Entity Recognition
(NER), topic modelling, and so on.
Type of Word Embedding:
Table 1
TD vector was originally devised for Information Retrieval (IR), wherein you are
required to find out similar documents for a given query. For example, in the above
table, you can see that plays “As you Like It” and “Twelfth Night” might be
similar based on column vector entries — comparing As you Like It [1,114,36,20]
with Twelfth Night [0, 80, 58, 15].
Of course, here, we have sampled only 4 words. That is, the vocabulary size is 4
and the vector will have a dimension of |V| = 4. In a typical NLP application,
wherein you consider thousands of documents, you will see this matrix will be a
long and sparse — mostly with entries 0s.
In this case, the vocabulary size is |V| = 6 and the dimension of the matrix is |V| x
|V| — in this case, 6 x 6. For this toy example, the TC matrix with a window size
of 1 is:
Table 2
Of course, a typical NLP application will have a vocabulary size ranging in 1000s.
So, in TC matrix too, the matrix is too large and sparse.
Both these co-occurrence matrices show the frequency of an item associated with
documents. It turns out that word frequency alone wouldn’t be enough to
understand its importance– whether a word is discriminative or not. Traditionally,
we believe that more frequent words are more important than infrequent words.
However, there are frequent words such as “good” in table 1 above are
unimportant. How do you balance these two contrast expectations — capture all
frequent words yet they should be discriminative? TF-IDF answer this paradox
question.
2. Term Frequency, Inverse Document Frequency (TF-IDF):
The tf-idf is a product of two terms: term frequency (tf) and inverse document
frequency (idf). The TF defines the frequency of a given term in a given document.
In an NLP application, since the frequency of a term might be too high, we down
weight the frequency term by applying log10 scale. So, TF is defined as:
Inverse Document Frequency(idf) assigns higher weights to words that occur only
in few documents. Such words are quite useful for discriminating those documents
from the rest of the collection. The idf is defined as N/dft, where N is number of
documents and dft is document frequency — the number of documents in which
the term (t) occurs. Here too, we apply the log10 scale to down weight the IDF
value:
As you can see here, the tf-idf appropriately measures the importance of a word
and helps us identify whether a given word is discriminative or not.
Though TF-IDF is an improvement over the simple bag of words approach and
yields better results for common NLP tasks, the overall pros and cons remain the
same. We still need to create a huge sparse matrix, which also takes a lot more
computation than the simple bag of words approach.
3. Word2Vec:
Word2Vec model comes in two flavors: Skip Gram Model and Continuous Bag
of Words Model (CBOW).
Word2Vec has several advantages over bag of words and IF-IDF scheme.
Word2Vec retains the semantic meaning of different words in a document. The
context information is not lost.
Another great advantage of Word2Vec approach is that the size of the embedding
vector is very small. Each dimension in the embedding vector contains
information about one aspect of the word. We do not need huge sparse vectors,
unlike the bag of words and TF-IDF approaches.
Pointwise Mutual Information (PMI):
Use cases of NLP can be seen across industries like understanding customers'
issues, predicting the next word user is planning to type in the keyboard, automatic
text summarization etc. Many researchers across the world trained NLP models in
several human languages like English, Spanish, French, Mandarin etc. so that
benefit of NLP can be seen in every society. It is the most useful NLP metric called
Pointwise mutual information (PMI) to identify words that can go together along.
PMI helps us to find related words. In other words, it explains how likely the co-
occurrence of two words than we would expect by chance. For example, the word
"Data Science" has a specific meaning when these two words "Data" and
"Science" go together. Otherwise meaning of these two words are independent.
Similarly, "Great Britain" is meaningful since we know the word "Great" can be
used with several other words but not so relevant in meaning like "Great UK, Great
London, Great Dubai etc."
When words 'w1' and 'w2' are independent, their joint probability is equal to the
product of their individual probabilities. Imagine when the formula of PMI as
shown below returns 0, it means the numerator and denominator is same and then
taking log of 1 produces 0. In simple words it means the words together has NO
specific meaning or relevance. Question arises what are we trying to achieve here.
We are focusing on the words which have high joint probability with the other
word but having not so high probability of occurrence if words are considered
separately. It implies that this word pair has a specific meaning.
Steps to compute PMI:
Yes PMI can be negative. Remember log2(0) is -Inf. PMI score lies between −∞
to + ∞. For demonstration let's assume that both joint p(w1,w2) and individual
p(w1) and p(w2) are 0.001. PMI in that case would be -1. Negative PMI means
words are co-occurring less than we expect by chance.
PPMI builds on PMI but addresses some of its limitations, particularly when
dealing with sparse datasets (common in NLP). It focuses on emphasizing positive
associations and reduces the impact of low-frequency and uninformative word
pairs. The formula for PPMI is:
Information retrieval (IR):
5. Ranking: Once the index is built, the system ranks the documents in the
collection based on their relevance to the query. Various ranking algorithms are
used, with common approaches including:
6. Retrieval: The ranked list of documents is presented to the user, with the most
relevant documents appearing at the top. Users can then select and review the
documents that are likely to contain the information they need.
2. Lower Case: If the text is in the same case, it is easy for a machine to
interpret the words because the lower case and upper case are treated
differently by the machine. for example, words like Ball and ball are treated
differently by machine. So, we need to make the text in the same case and
the most preferred case is a lower case to avoid such problems.
5. Remove Stop Words: Stop Words are the most commonly occurring
words in a text which do not provide any valuable information. Stop Words
like they, there, this, where, etc. are some of the Stop Words. NLTK library
is a common library that is used to remove Stop Words and include
approximately 180 Stop Words which it removes. If we want to add any
new word to a set of words then it is easy using the add method.
6. Rephrase Text: We may need to modify some text or change the pattern
to a particular string which makes it easy to identify like we can match the
pattern of email ids and change it to string like email address.
8. Remove Extra(White) Spaces: Most of the time text data contain extra
spaces or while performing the above preprocessing techniques more than
one space is left between the text so we need to control this problem.
regular expression library performs well to solve this problem.
Context-Free Grammars:
What is Grammar?
Syntax also refers to the way words are arranged together. Let us see some basic
ideas related to syntax:
Regular languages and part of speech: Refers to the way words are
arranged together but cannot support easily. Examples are Constituency,
Grammatical relations, and Subcategorization and dependency relations.
Let us move on to discuss the types of grammar in NLP. We will cover three types
of grammar: context-free, constituency, and dependency.
The symbols that express abstractions over these terminals are called non-
terminals.
In each context-free rule, the item to the right of the arrow (→) is an
ordered list of one or more terminals and non-terminals, and to the left of
the arrow is a single non-terminal symbol expressing some cluster or
generalization. - The non-terminal associated with each word in the lexicon
is its lexical category or part of speech.
Context Free Grammar consists of a finite set of grammar rules that have
four components: a Set of Non-Terminals, a Set of Terminals, a Set of
Productions, and a Start Symbol.
CFG can also be seen as a notation used for describing the languages, a superset
of Regular grammar.
CFG consists of a finite set of grammar rules having the following four
components:
Set of Non-terminals: It is represented by V. The non-terminals are
syntactic variables that denote the sets of strings, which help define the
language generated with the help of grammar.
Set of Terminals: It is also known as tokens and is represented by Σ.
Strings are formed with the help of the basic symbols of terminals.
Set of Productions: It is represented by P. The set explains how the
terminals and non-terminals can be combined.
Every production consists of the following components:
Non-terminals are also called variables or placeholders as they stand for
other symbols, either terminals or non-terminals. They are symbols
representing the structure of the language being described. Non-terminals
are a set of production rules specifying how to replace a non-terminal
symbol with a string of symbols, which can include terminals, words or
characters, and other non-terminals.
Start Symbol: The formal language defined by a CFG is the set of strings
derivable from the designated start symbol. Each grammar must have one
designated start symbol, which is often called S.
The properties are derived generally with the help of other NLP concepts
like part of speech tagging, a noun or Verb phrase identification, etc. For
example, Constituency grammar can organize any sentence into its three
constituents - a subject, a context, and an object.
Look at a sample parse tree: Example sentence - "The dog chased the cat."
In this parse tree, the sentence is represented by the root node S (for
sentence). The sentence is divided into two main constituents: NP (noun
phrase) and VP (verb phrase).
The NP is further broken down into Det (determiner) and Noun, and the
VP is further broken down into V (verb) and NP.
Each of these constituents can be further broken down into smaller
constituents.
Constituency grammar is better equipped to handle context-free and
dependency grammar limitations. Let us look at them:
Constituency grammar is not language-specific, making it easy to use
the same model for multiple languages or switch between languages,
hence handling the multilingual issue plaguing the other two types of
grammar.
Deep Learning Based NER: deep learning NER is much more accurate than
previous method, as it is capable to assemble words. This is due to the fact that
it used a method called word embedding, that is capable of understanding the
semantic and syntactic relationship between various words. It is also able to
learn analyses topic-specific as well as high level words automatically. This
makes deep learning NER applicable for performing multiple tasks. Deep
learning can do most of the repetitive work itself, hence researchers for
example can use their time more efficiently.
Syntactic Analysis:
Syntactic analysis or parsing or syntax analysis is the third phase of NLP. The
purpose of this phase is to draw exact meaning, or you can say dictionary meaning
from the text. Syntax analysis checks the text for meaningfulness comparing to
the rules of formal grammar. For example, the sentence like “hot ice-cream”
would be rejected by semantic analyzer.
In this sense, syntactic analysis or parsing may be defined as the process of
analyzing the strings of symbols in natural language conforming to the rules of
formal grammar. The origin of the word ‘parsing’ is from Latin word
‘pars’ which means ‘part’.
Concept of Parser: It is used to implement the task of parsing. It may be defined
as the software component designed for taking input data (text) and giving
structural representation of the input after checking for correct syntax as per
formal grammar. It also builds a data structure generally in the form of parse tree
or abstract syntax tree or other hierarchical structure.
For example, new articles can be organized by topics; support tickets can be
organized by urgency; chat conversations can be organized by language; brand
mentions can be organized by sentiment; and so on.
It’s estimated that around 80% of all information is unstructured, with text being
one of the most common types of unstructured data. Because of the messy nature
of text, analysing, understanding, organizing, and sorting through text data is hard
and time-consuming, so most companies fail to use it to its full potential.
This is where text classification with machine learning comes in. Using text
classifiers, companies can automatically structure all manner of relevant text,
from emails, legal documents, social media, chatbots, surveys, and more in a fast
and cost-effective way. This allows companies to save time analysing text data,
automate business processes, and make data-driven business decisions.
Why use machine learning text classification? Some of the top reasons:
Manual text classification involves a human annotator, who interprets the content
of text and categorizes it accordingly. This method can deliver good results but
it’s time-consuming and expensive.
There are many approaches to automatic text classification, but they all fall under
three types of systems:
Rule-based systems
Machine learning-based systems
Hybrid systems
1. Rule-based systems:
Next, when you want to classify a new incoming text, you’ll need to count
the number of sport-related words that appear in the text and do the same
for politics-related words. If the number of sports-related word
appearances is greater than the politics-related word count, then the text is
classified as Sports and vice versa.
For example, this rule-based system will classify the headline “When is
LeBron James' first game with the Lakers?” as Sports because it counted
one sports-related term (LeBron James) and it didn’t count any politics-
related terms.
The first step towards training a machine learning NLP classifier is feature
extraction: a method is used to transform each text into a numerical
representation in the form of a vector. One of the most frequently used
approaches is bag of words, where a vector represents the frequency of a
word in a predefined dictionary of words.
For example, if we have defined our dictionary to have the following words
{This, is, the, not, awesome, bad, basketball}, and we wanted to vectorize
the text “This is awesome,” we would have the following vector
representation of that text: (1, 1, 0, 0, 1, 0, 0).
Then, the machine learning algorithm is fed with training data that consists
of pairs of feature sets (vectors for each text example) and tags
(e.g., sports, politics) to produce a classification model:
Once it’s trained with enough training samples, the machine learning
model can begin to make accurate predictions. The same feature extractor
is used to transform unseen text to feature sets, which can be fed into the
classification model to get predictions on tags (e.g., sports, politics):
Text classification with machine learning is usually much more accurate
than human-crafted rule systems, especially on complex NLP classification
tasks. Also, classifiers with machine learning are easier to maintain and
you can always tag new examples to learn new tasks.
Some of the most popular text classification algorithms include the Naive Bayes
family of algorithms, support vector machines (SVM), and deep learning.
1. Naive Bayes:
The Naive Bayes family of statistical algorithms are some of the most used
algorithms in text classification and text analysis, overall.
One of the members of that family is Multinomial Naive Bayes (MNB) with
a huge advantage, that you can get really good results even when your dataset
isn’t very large (~ a couple of thousand tagged samples) and computational
resources are scarce.
This means that any vector that represents a text will have to contain
information about the probabilities of the appearance of certain words within
the texts of a given category, so that the algorithm can compute the likelihood
of that text belonging to the category.
2. Support Vector Machines:
In short, SVM draws a line or “hyperplane” that divides a space into two
subspaces. One subspace contains vectors (tags) that belong to a group, and
another subspace contains vectors that do not belong to that group.
The optimal hyperplane is the one with the largest distance between each tag.
In two dimensions it looks like this:
Those vectors are representations of your training texts, and a group is a tag
you have tagged your texts with.
As data gets more complex, it may not be possible to classify vectors/tags into
only two categories. So, it looks like this:
But that’s the great thing about SVM algorithms – they’re “multi-
dimensional.” So, the more complex the data, the more accurate the results
will be. Imagine the above in three dimensions, with an added Z-axis, to create
a circle.
Mapped back to two dimensions the ideal hyperplane looks like this:
3. Deep Learning:
Deep learning is a set of algorithms and techniques inspired by how the human
brain works, called neural networks. Deep learning architectures offer huge
benefits for text classification because they perform at super high accuracy
with lower-level engineering and computation.
Deep learning algorithms do require much more training data than traditional
machine learning algorithms (at least millions of tagged examples). However,
they don’t have a threshold for learning from training data, like traditional
machine learning algorithms, such as SVM and NBeep learning classifiers
continue to get better the more data you feed them with:
Deep learning algorithms, like Word2Vec or GloVe are also used in order to
obtain better vector representations for words and improve the accuracy of
classifiers trained with traditional machine learning algorithms.
4. Hybrid Systems:
With these results, you can build performance metrics that are useful for a
quick assessment on how well a classifier works:
Text classification has thousands of use cases and is applied to a wide range of
tasks. In some cases, data classification tools work behind the scenes to enhance
app features we interact with on a daily basis (like email spam filtering). In some
other cases, classifiers are used by marketers, product managers, engineers, and
salespeople to automate business processes and save hundreds of hours of manual
data processing.
Some of the top applications and use cases of text classification include:
With the help of text classification, businesses can make sense of large
amounts of data using techniques like aspect-based sentiment
analysis to understand what people are talking about and how they’re
talking about each aspect. For example, a potential PR crisis, a customer
that’s about to churn, complaints about a bug issue or downtime
affecting more than a handful of customers.
Logistic regression assumes a linear relationship between the input features and
the class labels.
Why use logistic regression?
Logistic regression is a popular algorithm for text classification and is also our
go-to favourite for several reasons:
1. Simplicity: Logistic regression is a relatively simple algorithm that is easy
to implement and interpret. It can be trained efficiently even on large
datasets, making it a practical choice for many real-world applications.
2. Easily understood: Logistic regression models can be understood by
looking at the coefficients of the input features, which can show which
words or phrases are most important for classification.
3. Works well with sparse data: Text data is often very high-
dimensional and sparse, meaning many features are zero for most data
points. Logistic regression can handle sparse data well and can be
regularised to prevent overfitting.
4. Versatile: Logistic regression works well for both binary and multi-class
classification. It is a versatile algorithm for text classification that can be
used for binary and multi-class classification tasks.
5. Baseline model: Logistic regression can be used as a baseline model for
classifying text. This lets you compare how well more complicated
algorithms work with a simple model that is easy to understand.
Logistic regression is a practical algorithm for classifying text that can give good
results in many situations, especially for more straightforward classification tasks
or as a starting point for more complicated algorithms.
How to use logistic regression for text classification:
Logistic regression is a commonly used statistical method for binary
classification tasks, including text classification.
In text classification, the goal is to assign a given piece of text to one or more
predefined categories or classes.
To use logistic regression for text classification, we first need to represent the text
as numerical features that can be used as input to the model. One popular
approach for this is to use the bag-of-words representation, where we represent
each document as a vector of word frequencies.
Once we have our numerical feature representation of the text, we can use logistic
regression to learn a model to predict the probability of each document belonging
to a given class. The logistic regression model learns a set of weights for each
feature and uses these weights to make predictions based on the input features.
During training, we adjust the weights to minimise a loss function, such as cross-
entropy, that measures the difference between the predicted probabilities and the
actual labels. Once the model is trained, we can use it to predict the class labels
for new text inputs.
Overall, logistic regression is a simple but effective method for text classification
tasks and can be used as a baseline model or combined with more complex models
in ensemble approaches. However, it may need help with more complex
relationships between features and labels and may not capture the full range of
patterns in natural language data.
Multinomial Logistic Regression:
Multinomial Logistic Regression is a statistical technique used for modelling
relationships between multiple categories of a dependent variable and one or
more independent variables. In the context of Natural Language Processing
(NLP), it's often used for tasks like text classification or sentiment analysis where
there are multiple classes to predict.
3. Probability Distribution:
4. Mathematical Representation:
During training, the model learns the weights (coefficients) associated with
each feature for each class.
Optimization techniques such as gradient descent or variants are used to
find the optimal weights that minimize the error between predicted
probabilities and actual classes.
6. Prediction:
To predict the class of a new instance, the model calculates the probabilities
for each class using the learned weights and the input features.
The class with the highest probability is assigned as the predicted class for
the instance.
Sentiment Classification
From the name itself, we can understand that to identify the sentiment
based on the review.
The task is to simply classify the tweets into positive and negative
sentiment. Here the input which tweets can have various lengths. But in
Recurrent neural network, we always have an output with the same
length of the input.
Image Captioning: Image captioning is a very interesting project where you will
have an image and for that particular image, you need to generate a textual
description.
So here,
1. The input will be single input – the image,
2. And the output will be a series or sequence of words
Here the image might be of a fixed size, but the description will vary for the
length.
Suppose you have some text in a particular language. Let’s assume English, and
you don’t know English so you want to translate them into French. So that time
we used a language translator.
Deep Networks:
Stacked and Bidirectional RNNs:
They are often used in natural language processing tasks, such as language
translation, text classification, and named entity recognition. They can capture
contextual dependencies in the input data by considering past and future
contexts.
where φ is the activation function, W is the weight matrix, and b is the bias.
The final hidden state is the concatenation of A t(forward) and At(backward)
Here, \oplus denotes the mean vector concatenation. There are some other ways
also to combine both forward and backward hidden states like element-wise
addition or multiplication.
The hidden state at time t is given by the combination of A t(forward) and
At(backward). The output of any given hidden state is given by:
In a Bidirectional RNN however, since there are forward and backward passes
happening simultaneously, updating the weights for the two processes could
happen at the same point in time. This leads to erroneous results. Thus, to
accommodate forward and backward passes separately, the following algorithm
is used for training a Bidirectional RNN:
Forward Pass
Forward states (from t = 1 to N) and backward states (from t = N to 1) are
passed.
Output neuron values are passed (from t = 1 to N)
Backward Pass
Output neuron values are passed (from t = N to 1)
Forward states (from t = N to 1) and backward states (from t = 1 to N) are
passed.
Both the forward and backward passes together train a Bidirectional RNN.
Traditional RNNs like GRUs and LSTMs grasp context only from preceding
words, unable to anticipate future ones. Bidirectional RNNs tackle this by
processing sequences in both directions, using two RNNs. Their hidden states
merge into a single one for decoding—this could be either the whole sequence
or the last time step's state, impacting the neural network's design.
Managing Context in RNNs:
Gradients are those values which to update neural networks weights. In other
words, we can say that Gradient carries information.
If we apply RNN for a paragraph RNN may leave out necessary information due
to gradient problems and not be able to carry information from the initial time
step to later time steps.
The reason for exploding gradient was the capturing of relevant and irrelevant
information. a model which can decide what information from a paragraph and
relevant and remember only relevant information and throw all the irrelevant
information.
This is achieved by using gates. the LSTM ( Long -short-term memory ) and GRU
( Gated Recurrent Unit ) have gates as an internal mechanism, which control what
information to keep and what information to throw out. By doing this LSTM,
GRU networks solve the exploding and vanishing gradient problem.
Almost each and every SOTA ( state of the art) model based on RNN follows
LSTM or GRU networks for prediction.
LSTMs /GRUs are implemented in speech recognition, text generation, caption
generation, etc.
LSTM networks:
Every LSTM network basically contains three gates to control the flow of
information and cells to hold information. The Cell States carries the information
from initial to later time steps without getting vanished.
Gates
Gates make use of sigmoid activation or you can say tanh activation.
values ranges in tanh activation are 0 -1.
1. Forget Gate
2. Input Gate
3. Output Gate
GRU ( Gated Recurrent Units ) are similar to the LSTM networks. GRU is a kind
of newer version of RNN. However, there are some differences between GRU
and LSTM.
1. Update Gate: Update Gate is a combination of Forget Gate and Input Gate.
Forget gate decides what information to ignore and what information to add
in memory.
2. Reset Gate: This Gate Resets the past information in order to get rid of
gradient explosion. Reset Gate determines how much past information should
be forgotten.
The Encoder-Decoder Model with RNNs:
In tasks like machine translation, we must map from a sequence of input words to
a sequence of output words. The reader must note that this is not similar to
“sequence labelling”, where that task it to map each word in the sequence to a
predefined classes, like part-of-speech or named entity task.
In above two examples, the models are tasked to map each word in the sequence
to a tag/class.
Google translation
But in tasks like machine translation: the length of inputs sequence need to not
necessarily length of output sequence. As you can see in the google translation
example, the input length is “5” and output length is “4”. Since we are mapping
an input sequence to an output sequence, thus comes the name sequence to
sequence models. Not only the length of input and output sequence differs but the
order of words also differ. This is very complex task in NLP and Encoder- decoder
networks are very successful at handling these sorts of complicated tasks of
sequence-to-sequence mapping.
One more important task that can be solved with encoder-decoder networks is text
summarisation where we map the long text to a short summary/abstract. In this
blog we will try to understand the architecture of encoder-decoder networks and
how it works.
This network have been applied to very wide range of applications including
machine translation, text summarisation, questioning answering and dialogue.
Let’s try to understand the idea underlying the encoder-decoder networks. The
encoder takes the input sequence and creates a contextual representation (which is
also called context) of it and the decoder takes this contextual representation as
input and generates output sequence.
Encoder and Decoder with RNN’s:
Encoder: Encoder takes the input sequence and generated a context which is the
essence of the input to the decoder.
Decoder: Decoder takes the context as input and generates a sequence of output.
When we employ RNN as decoder, the context is the final hidden state of the RNN
encoder.
The first decoder RNN cell takes “CONTEXT” as its prior hidden state. The
decoder then generated the output until the end-of-sequence marker is generated.
Each cell in RNN decoder takes input auto regressively, i.e., The decoder uses its
own estimated output at time t as the input for the next time step xt+1. One
important drawback if the context is made available only for first decoder RNN
cell is the context wanes as more and more output sequence is generated. To
overcome this drawback the “CONTEXT” can be made available at each decoding
RNN time step. There is a little deviation from the vanilla-RNN. Let’s look at the
updated the equations for decoder RNN.
Let’s take a look at the architecture of the Transformer below. It might look
intimidating but don’t worry, we will break it down and understand it block by
block.
Now focus on the below image. The Encoder block has 1 layer of a Multi-Head
Attention followed by another layer of Feed Forward Neural Network. The
decoder, on the other hand, has an extra Masked Multi-Head Attention.
The encoder and decoder blocks are actually multiple identical encoders and
decoders stacked on top of each other. Both the encoder stack and the decoder
stack have the same number of units.
Let’s see how this setup of the encoder and the decoder stack works:
The word embeddings of the input sequence are passed to the first encoder
These are then transformed and propagated to the next encoder
The output from the last encoder in the encoder-stack is passed to all the
decoders in the decoder-stack as shown in the figure below:
An important thing to note here – in addition to the self-attention and feed-
forward layers, the decoders also have one more layer of Encoder-Decoder
Attention layer. This helps the decoder focus on the appropriate parts of the input
sequence.
You might be thinking – what exactly does this “Self-Attention” layer do in the
Transformer? Excellent question! This is arguably the most crucial component in
the entire setup so let’s understand this concept.
Take a look at the above image. Can you figure out what the term “it” in this
sentence refers to?
Is it referring to the street or to the animal? It’s a simple question for us but not
for an algorithm. When the model is processing the word “it”, self-attention tries
to associate “it” with “animal” in the same sentence.
Self-attention allows the model to look at the other words in the input sequence
to get a better understanding of a certain word in the sequence. Now, let’s see
how we can calculate self-attention.
Calculating Self-Attention:
I have divided this section into various steps for ease of understanding.
First, we need to create three vectors from each of the encoder’s input vectors:
1. Query Vector
2. Key Vector
3. Value Vector.
These vectors are trained and updated during the training process. We’ll know
more about their roles once we are done with this section
Next, we will calculate self-attention for every word in the input sequence
Consider this phrase – “Action gets results”. To calculate the self-attention for
the first word “Action”, we will calculate scores for all the words in the phrase
with respect to “Action”. This score determines the importance of other words
when we are encoding a certain word in an input sequence
1. The score for the first word is calculated by taking the dot product of the Query
vector (q1) with the keys vectors (k1, k2, k3) of all the words:
2. Then, these scores are divided by 8 which is the square root of the dimension
of the key vector:
3. Next, these scores are normalized using the SoftMax activation function:
4. These normalized scores are then multiplied by the value vectors (v1, v2, v3)
and sum up the resultant vectors to arrive at the final vector (z1). This is the output
of the self-attention layer. It is then passed on to the feed-forward network as
input.
So, z1 is the self-attention vector for the first word of the input sequence “Action
gets results”. We can get the vectors for the rest of the words in the input sequence
in the same fashion:
Attention can only deal with fixed-length text strings. The text has to be
split into a certain number of segments or chunks before being fed into the
system as input
This chunking of text causes context fragmentation. For example, if a
sentence is split from the middle, then a significant amount of context is
lost. In other words, the text is split without respecting the sentence or any
other semantic boundary
So how do we deal with these pretty major issues? That’s the question folks who
Pretrained Language Models:
A pretrained model is a model that has been trained on a large dataset and can
be used as a starting point for other tasks. Pretrained models have already
learned the general patterns and features of the data they were trained on, so
they can be fine-tuned for other tasks with relatively little additional training
data.
In natural language processing (NLP), pre-trained models are often used as the
starting point for a wide range of NLP tasks, such as language translation,
sentiment analysis, and text summarization. By using a pre-trained model, NLP
practitioners can save time and resources, as they don’t have to train a model
from scratch on a large dataset. Some popular pre-trained models for NLP
include BERT, GPT-2, ELMo, and RoBERTa. These models are trained on
large datasets of text and can be fine-tuned for specific tasks.
One of the key features of GPT-2 is its ability to generate human-like text.
This is useful for applications such as text summarization, language
translation, and content generation. GPT-2 can generate text that is coherent
and fluent, making it a powerful tool for natural language generation tasks.
In addition to text generation, GPT-2 can also be fine-tuned for a wide range
of NLP tasks, such as sentiment analysis and text classification. It has
achieved state-of-the-art performance on a variety of NLP benchmarks,
making it a powerful tool for NLP practitioners.
3. Procedure:
Initialize with Pre-trained Weights: Start with the parameters learned
during pre-training.
Task-Specific Data: Use a smaller dataset related to the task of interest.
Learning Rate and Layers: Adjust learning rates, unfreeze and train
specific layers or the entire model.
Iterative Training: Fine-tune the model on the task-specific data,
allowing it to learn task-specific patterns.
4. Benefits:
Utilizes Pre-trained Knowledge: Saves time and resources by
leveraging knowledge from pre-training.
Better Performance: Adapts the model to specific tasks, improving its
performance on those tasks.
1. Objective: MLMs are a type of pre-trained language model where the model
learns to predict missing or masked words within a sentence.
2. Training Procedure:
Masking Tokens: Randomly mask some of the tokens in the input text.
Prediction Task: Task the model with predicting the masked tokens
based on the context provided by the surrounding words.
Objective Function: The model is trained to minimize the difference
between the predicted and actual masked tokens.
3. BERT as an Example:
BERT employs a bidirectional Transformer architecture.
It masks 15% of the tokens in a sequence.
The model aims to predict these masked tokens based on the rest of the
input.
4. Benefits:
1. Captures Contextual Information: MLMs learn rich contextual
representations by understanding relationships between words in a
sentence.
2. Language Understanding: Learns semantics, syntax, and linguistic
relationships within a sentence.
Relationship:
2. Implementation of ChatGPT:
Integration of ChatGPT: Implemented ChatGPT within the platform's
customer support system.
Real-time Assistance: Automated responses for common queries,
reducing response time and improving customer satisfaction.
Personalized Interaction: Tailored responses based on user queries,
enhancing the user experience and engagement.
4. Sentiment Classification:
Sentiment Analysis Tool: Developed an AI-driven sentiment
classification system.
Customer Feedback Analysis: Automated analysis of customer reviews
and feedback to gauge sentiment trends.
Customer Insights: Enabled proactive responses to negative sentiments,
enhancing brand reputation.
5. Language Translation:
Multilingual Support: Integrated AI-powered language translation
across the platform.
Global Expansion: Enabled users to access content and communicate in
their preferred language.
Seamless Communication: Facilitated cross-border transactions and
interactions with localized content.
6. Results and Impact:
Enhanced User Engagement: Increased user interactions and retention
rates by 30%.
Improved Customer Support: Reduced response time by 50%, leading
to higher customer satisfaction.
Better Decision Making: Insights from sentiment analysis aided in
strategic decision-making and product improvements.
Global Reach: Expanded user base by 40% in non-native English-
speaking regions due to language translation support.
7. Conclusion:
Future Prospects: Continuous refinement and updates to AI-powered
tools for better accuracy and performance.
Potential Expansion: Plan to integrate AI for more personalized
experiences and predictive analytics.
8. Key Takeaways:
AI-Powered Tools: Significantly improve user experiences, customer
support, and engagement.
Sentiment Analysis and Translation: Crucial for understanding user
sentiments and expanding global reach.
Continuous Innovation: Essential to stay ahead in delivering enhanced
AI-driven services.
This case study showcases the transformative impact of AI-powered tools like
ChatGPT, GPT models, sentiment classification, and language translation in
enhancing user experiences, expanding global outreach, and improving
customer interactions for a multinational e-commerce platform.