0% found this document useful (0 votes)
26 views15 pages

Chapter Four - NLP

Chapter Five of the document discusses the evolution of Natural Language Processing (NLP), highlighting the transition from traditional rule-based methods to the development of large language models (LLMs) using the Transformer architecture. It covers key advancements such as word embeddings, the introduction of self-attention mechanisms, and the ethical considerations surrounding LLMs. The chapter emphasizes the importance of addressing challenges in NLP to enhance communication and mitigate risks associated with these technologies.

Uploaded by

gpt4prompt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views15 pages

Chapter Four - NLP

Chapter Five of the document discusses the evolution of Natural Language Processing (NLP), highlighting the transition from traditional rule-based methods to the development of large language models (LLMs) using the Transformer architecture. It covers key advancements such as word embeddings, the introduction of self-attention mechanisms, and the ethical considerations surrounding LLMs. The chapter emphasizes the importance of addressing challenges in NLP to enhance communication and mitigate risks associated with these technologies.

Uploaded by

gpt4prompt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Applied ML 1

Chapter Five: Natural Language


Processing

Introduction
The development of large language models has been one of the most significant
breakthroughs in the field of natural language processing (NLP) in recent years. These
models have demonstrated impressive performance on a variety of language tasks,
including language translation, text generation, and question-answering. The
Transformer architecture has played a key role in enabling the development of these
models. In this chapter, we will explore the evolution of the Transformer architecture and
its impact on the field of NLP.

https://arxiv.org/pdf/2010.15036.pdf

© 2020-2023, Ali Arsanjani


1
Applied ML 2

Section 1 : A Brief History of NLP, NLU and NLG

In-Depth Evolution of Natural Language Processing (NLP)

Section 1: Traditional NLP Techniques


The inception of NLP was characterized by rule-based and statistical methods designed to
interpret, understand, and generate human language. These techniques, though foundational,
were constrained by their reliance on linguistic rules crafted by experts, limiting their adaptability
and scalability.

● Syntax Analysis and Parsing: This involved breaking down sentences into their
grammatical components, such as nouns, verbs, and adjectives, to extract their syntactic
structures. Tools like the Stanford Parser utilized algorithms to understand sentence
structure, which was crucial for tasks such as translating languages and answering
questions. However, these methods struggled with ambiguities inherent in human
language.
● Named Entity Recognition (NER): This process identified key information in text, such as
names of people, organizations, and locations. It was crucial for information extraction
and data analysis applications. Traditional NER systems used hand-crafted rules or lists
of known entities, which were not easily scalable or adaptable to new domains or
languages.
● Part-of-Speech Tagging: This involved labeling words with their corresponding part of
speech, based on both their definition and context. This tagging was vital for parsing
sentences and understanding language structure. Early systems used rule-based
methods or probabilistic models, which could be inaccurate when faced with complex or
ambiguous sentence structures.

These traditional methods were a crucial step in building the foundation of NLP. However, their
reliance on manual rule creation made them labor-intensive and unable to cope with the
nuances and variability of natural language effectively.

© 2020-2023, Ali Arsanjani


2
Applied ML 3

Section 2: Word Embeddings


Word embeddings represented a paradigm shift in NLP, moving away from sparse,
high-dimensional representations like one-hot encoding to dense, low-dimensional vectors.
These vectors captured semantic similarities between words, enabling models to understand
linguistic nuances based on context.

● Semantic Representation: Word2Vec, introduced by Mikolov et al., and GloVe, by


Pennington et al., were breakthroughs that allowed words with similar meanings to have
similar representations in vector space. This facilitated models' understanding of
semantic relationships and analogies, overcoming a significant limitation of traditional
methods.
● Dimensionality Reduction: Word embeddings reduced the dimensionality of word
representations, significantly improving computational efficiency and enabling the
processing of large text corpora with neural networks.
● Contextual Limitations: Initial embeddings were static, offering a single representation for
each word regardless of its use in different contexts. This limitation was addressed by
context-sensitive embeddings like ELMo, which generated dynamic representations for
words based on their surrounding text, using bidirectional LSTM networks.

The development of word embeddings was a critical advancement in NLP, enabling models to
capture the semantic richness of language more effectively than ever before.

© 2020-2023, Ali Arsanjani


3
Applied ML 4

Section 3: The Transformer Architecture


The introduction of the Transformer model by Vaswani et al. represented a significant leap in
NLP technology. Its novel architecture, centered around self-attention mechanisms, offered a
new way to model relationships within text without the sequential dependencies of previous
RNN and LSTM models.

● Parallelization: Unlike RNNs and LSTMs, the Transformer architecture allowed for
significant parallelization, reducing training times and enabling the processing of longer
sequences of text in a single step.
● Attention Mechanism: The self-attention mechanism enabled the model to weigh the
importance of different parts of the input text differently, allowing it to capture nuanced
relationships and dependencies across the entire sequence. This capability was
particularly beneficial for tasks requiring an understanding of context, such as translation
and summarization.
● Flexibility and Scalability: The Transformer's architecture proved to be incredibly
versatile, serving as the foundation for a range of subsequent models tailored to diverse
NLP tasks. Its scalability was demonstrated by its capacity to handle models of varying
sizes, from hundreds of millions to billions of parameters.

The Transformer model set new standards for efficiency, effectiveness, and applicability in NLP,
paving the way for the development of highly sophisticated language models.

© 2020-2023, Ali Arsanjani


4
Applied ML 5

Section 4: Large Language Models (LLMs)


The era of LLMs, epitomized by models like GPT and BERT, marked a pinnacle in the evolution
of NLP. These models leveraged the Transformer architecture to unprecedented scales,
demonstrating remarkable capabilities in understanding and generating human language.

● Training Techniques: LLMs are pre-trained on vast corpora of text data using
unsupervised learning techniques, allowing them to capture a broad understanding of
language patterns, syntax, and semantics. This pre-training is followed by fine-tuning,
where the model is adapted to specific tasks with smaller, task-specific datasets.
● Reinforcement Learning from Human Feedback (RLHF): To further refine the outputs of
LLMs and align them with human values and ethical standards, techniques such as
RLHF have been employed. This involves using human feedback to guide the learning
process, adjusting the model's outputs to be more aligned with human judgment and
preferences.

The development and refinement of LLMs have not only significantly advanced the state of NLP
but also raised important considerations regarding their ethical use, interpretability, and impact
on society.

● Ethical Considerations and Societal Impact: As LLMs became more powerful, concerns
about their potential to generate misleading information, reinforce biases, and impact
human labor markets intensified. Addressing these concerns requires careful design,
training methodologies that mitigate bias, and transparent usage guidelines to ensure
that these models are used responsibly and for the benefit of society.
● Interpretability and Transparency: Despite their impressive capabilities, LLMs' complexity
makes understanding how they arrive at specific outputs challenging. Efforts to improve
the interpretability of these models include techniques for model visualization,
explanation frameworks, and research into simpler models that can offer comparable
performance with greater transparency.
● Continual Learning and Adaptation: Another area of focus is making LLMs adaptable to
new information without requiring extensive retraining. Techniques like few-shot learning,
where models learn from a minimal number of examples, and continual learning, where
models update their knowledge base without forgetting previously learned information,
are critical for developing more efficient and adaptable NLP systems.

Looking Forward
The evolution of NLP from traditional techniques to LLMs represents a remarkable journey of
innovation and discovery. Each stage of this evolution has addressed the limitations of previous
methods while opening new avenues for exploration and application. The future of NLP lies in
addressing the current challenges faced by LLMs, including ethical considerations,
interpretability, and adaptability, while exploring new architectures and training methods that can
further advance our ability to understand and generate human language.

© 2020-2023, Ali Arsanjani


5
Applied ML 6

As the field continues to evolve, collaboration between researchers, ethicists, and practitioners
will be essential to ensure that the benefits of NLP technology are realized across society,
enhancing communication, accessibility, and information sharing while mitigating potential risks
and biases. The journey of NLP is far from complete, and the next chapters promise to be as
exciting and impactful as those that have preceded them.

© 2020-2023, Ali Arsanjani


6
Applied ML 7

Section 1: Steps in Traditional NLP

Preliminaries related to Text Pre-processing


• Tokenization A process of transforming a text (sentence) into tokens or words is known
as tokenization. Documents can be tokenized into sentences, whereas sentences can be
converted into tokens. In tokenization, a sequence to text is divided into the words, symbols,
phrases or tokens [6]. The prime objective of tokenization is to find out the words in a
sentence. Usually, tokenization is applied as a first and standard pre-processing step in any
NLP task.[40]

• Removal of Noise, URLs, Hashtag and User-mentions


Unwanted strings and Unicode
are considered as leftover during the crawling process, which is not useful for the machines
and creates noise in the data. Also, almost all of tweets messages posted by users, contains
URLs to provide extra information, User-mention/tags (𝛼) and use hashtag symbol ”#” to
associate their tweet message with some particular topic and can also express their sentiments
in tweets by using hashtags. These give extra information which is useful for human beings,
but it does not provide any information to machines and considered as noise which needs to
be handled. Researchers have presented different techniques to handle this extra information
provided by users such as in the case of URLs; it is replaced with tags [1] whereas User
Mentions (𝛼) are removed [13, 65]

• Word Segmentation
Word segmentation is the process of separating the phrases, content
and keywords used in the hashtag. Moreover, this step can help in understanding and
classifying the content of tweets easily for machines without any human intervention. As
mentioned earlier, Twitter users use # (hashtags) in almost all tweets to associate their tweets
with some particular topic. The phrase or keyword starting with # is known as hashtags.
Various techniques are presented in the literature for word segmentation in [22, 136].
Woodstock ’18, June 03–05, 2018, Woodstock, NY Naseem U, et al.

• Replacing Emoticons and Emojis


Twitter users use many different emoticons and emojis
such as:), :(, etc. to express their sentiments and opinions. So it is important to capture this
useful information to classify the tweets correctly. There are few tokenizers available which

© 2020-2023, Ali Arsanjani


7
Applied ML 8

can capture few expressions and emotions and replace them with their associated meanings
[41].

• Replacement of abbreviation and slang


Character limitations of Twitter enforce online
users to use abbreviations, short words and slangs in their posts online. An abbreviation is
a short or acronym of a word such as MIA which stands for missing in action. In contrast,
slang is an informal way of expressing thoughts or meanings which is sometimes restricted
to some particular group of people, context and considered as informal. So it is crucial to
handle such kind of informal nature of text by replacing them to their actual meaning to
get better performance without losing information. Researchers have proposed different
methods to handle this kind of issue in a text, but the most useful technique is to convert
them to an actual word which is easy for a machine to understand [68, 100].

• Replacing elongated characters


Social media users, sometimes intentionally use elongated
words in which they purposely write or add more characters repeatedly more times, such as
loooovvveee, greeeeat. Thus, it is important to deal with these words and change them to
their base word so that classifier does not treat them different words. In our experiments,
we replaced elongated words to their original base words. Detection and Replacement of
elongated words have been studied by [97] and [5].

• Correction of Spelling mistakes


Incorrect spellings and grammatical mistakes are very
commonly present in the text, especially in the case of social media platforms, especially on
Twitter and Facebook. Correction of spelling and grammatical mistakes helps in reducing
the same words written indifferently. Textblob is one the library which can be used for this
purpose. Norvig’s spell correction1 method is also widely used to correct spelling mistakes.

• Expanding Contractions
A contraction is a shortened form of the words which is widely
being used by online users. An apostrophe is used in the place of the missing letter(s).
Because we want to standardize the text for machines to process easily so, in the removal of
contractions, shortened words are expanded to their original root /base words. For example,
words like how is, I’m, can’t and don’t are the contractions for words how is, I am, cannot
and do not respectively. In the study conducted by [14], contractions were replaced with their

© 2020-2023, Ali Arsanjani


8
Applied ML 9

original words or by the relevant word. If contractions are not replaced, then the tokenization
step will create tokens of the word "can’t" into "can" "t".
http://norvig.com/spell-correct.html
l
A Comprehensive Survey on Word Representation Models: From Classical to State-Of-The-Art
Word Representation Language Models Woodstock ’18, June 03–05, 2018, Woodstock, NY

• Removing Punctuations
Social media users use different punctuations to express their
sentiments and emotions, which may are useful for humans but not all much useful for
machines for the classification of short texts. So removal of punctuation is common practice
in classification tasks such as sentiment analysis. However, sometimes some punctuation
symbols like "!" and "?" shows/denotes the sentiments. Its common practice to remove
punctuation. [82]. whereas, replacing question mark or sign of exclamation with tags has
also been studied by [5].

• Removing Numbers
Text corpus usually contains unwanted numbers which are useful for
human beings to understand but not much use for machines which makes lowers the results of
the classification task. The simple and standard method is to remove them [47, 58]. However,
we could lose some useful information if we remove them before transforming slang and
abbreviation into their actual words. For example, words like "2maro", "4 u", "gr8", etc. should
be first converted to actual words, and then we can proceed with this pre-processing step.

• Lower-casing all words


A sentence in a corpus has many different words with capitalization.
This step of pre-processing helps to avoid different copies of the same words. This diversity
of capitalization within the corpus can cause a problem during the classification task and
lower the performance. Changing each capital letters into a lower case is the most common
method to handle this issue in text data. Although, this pre-processing technique projects
all tokens in a corpus under the one feature space also causes a bunch of problems in the
interpretation of some words like "US" in the raw corpus. The word "US "could be pronoun
and a country name as well, so converting it to a lower case in all cases can be problematic.
The study conducted by [33] has lower-cased words in corpus to get clean words.

© 2020-2023, Ali Arsanjani


9
Applied ML 10

• Removing Stop-words
In-text classification task, there are many words which do not have
critical significance and are present in high frequency in a text. It means the words which
does not help to improve the performance because they do not have much information for
the sentiment classification task, so it is recommended to remove stop words before feature
selection step. Words like (a, the, is, and, am, are, etc.). A popular and straightforward method
to handle with such words is to remove them. There are different stop-word libraries available
such as NLTK, scikit-learn and spaCy.

• Stemming
One word can turn up in many different forms, whereas the semantic meaning of
those words is still the same. Stemming is the techniques to replace and remove the suffixes
and affixes to get the root, base or stem word. The importance of stemming was studied by
[92]. There are several types of stemming algorithms which helps to consolidate different
forms of words into the same feature space such as Porter Stemmer, Lancaster stemmer
and Snowball stemmers etc. Feature reduction can be achieved by utilizing the stemming
technique.
Woodstock ’18, June 03–05, 2018, Woodstock, NY Naseem U, et al.

• Lemmatization
The purpose of the lemmatization is the same as stemming, which is to cut
down the words to it’s base or root words. However, in lemmatization inflection of words
are not just chopped off, but it uses lexical knowledge to transform words into its base forms.
There are many libraries available which help to do this lemmatization technique. Few of
the famous ones are NLTK (Wordnet lemmatizer), genism, Stanford CoreNLP, spaCy and
TextBlob etc.

• Part of Speech (POS) Tagging


The purpose of Pat of speech (POS) tagging is to assign part
of speech to text. It clubs together with the words which have the same grammatical with
words together.

• Handling Negations
For humans, it is simple to get the context if there is any negation
present in the sentence, but for machines sometimes it does not help to capture and classify
accurately so handling a negation can be a challenging task in the case of word-level text

© 2020-2023, Ali Arsanjani


10
Applied ML 11

analysis. Replacing negation words with the prefix ’NEG_’ has been studied by [103]. Similarly,
handling negations with antonym has been studied by [124].

© 2020-2023, Ali Arsanjani


11
Applied ML 12

Here is a summary of the text preprocessing techniques that you


have mentioned:

1. * Tokenization: This is the process of splitting a text into individual words or tokens.
2. * Removal of noise, URLs, hashtags, and user-mentions: This is the process of removing
any unwanted or irrelevant text from the data.
3. * Word segmentation: This is the process of splitting hashtags into individual words.
4. * Replacement of emoticons and emojis: This is the process of replacing emoticons and
emojis with their corresponding text representations.
5. * Replacement of abbreviations and slang: This is the process of replacing abbreviations
and slang with their corresponding full forms.
6. * Replacement of elongated characters: This is the process of replacing elongated
characters with their corresponding short forms.
7. * Correction of spelling mistakes: This is the process of correcting any spelling mistakes
in the text.
8. * Expanding contractions: This is the process of expanding contractions into their
corresponding full forms.
9. * Removal of punctuation: This is the process of removing any punctuation from the text.
10. * Removal of numbers: This is the process of removing any numbers from the text.
11. * Lower-casing all words: This is the process of converting all words to lowercase.
12. * Removal of stop-words: This is the process of removing any stop words from the text.
13. * Stemming: This is the process of reducing a word to its root form.
14. * Lemmatization: This is the process of reducing a word to its lemmatized form.
15. * Part of speech (POS) tagging: This is the process of assigning a part of speech to each
word in the text.
16. * Handling negations: This is the process of identifying and handling negations in the
text.

These text preprocessing techniques are used to clean and prepare text data for further
processing. They can help to improve the accuracy and performance of natural language
processing (NLP) tasks such as text classification, sentiment analysis, and machine translation.

© 2020-2023, Ali Arsanjani


12
Applied ML 13

Preliminaries related to Text Pre-processing


• Tokenization A process of transforming a text (sentence) into tokens or words is known as
tokenization. Documents can be tokenized into sentences, whereas sentences can be
converted into tokens. In tokenization, a sequence to text is divided into the words, symbols,
phrases or tokens [6]. The prime objective of tokenization is to find out the words in a sentence.
Usually, tokenization is applied as a first and standard pre-processing step in any NLP task.[40]

• Removal of Noise, URLs, Hashtag and User-mentions Unwanted strings and Unicode are
considered as leftover during the crawling process, which is not useful for the machines and
creates noise in the data. Also, almost all of tweets messages posted by users, contains URLs
to provide extra information, User-mention/tags (𝛼) and use hashtag symbol ”#” to associate
their tweet message with some particular topic and can also express their sentiments in tweets
by using hashtags. These give extra information which is useful for human beings, but it does
not provide any information to machines and considered as noise which needs to be handled.
Researchers have presented different techniques to handle this extra information provided by
users such as in the case of URLs; it is replaced with tags [1] whereas Usermentions (𝛼) are
removed [13, 65]

• Word Segmentation Word segmentation is the process of separating the phrases, content and
keywords used in the hashtag. Moreover, this step can help in understanding and classifying the
content of tweets easily for machines without any human intervention. As mentioned earlier,
Twitter users use # (hashtags) in almost all tweets to associate their tweets with some particular
topic. The phrase or keyword starting with # is known as hashtags. Various techniques are
presented in the literature for word segmentation in [22, 136]. 5 Woodstock ’18, June 03–05,
2018, Woodstock, NY Naseem U, et al.

• Replacing Emoticons and Emojis Twitter users use many different emoticons and emojis such
as:), :(, etc. to express their sentiments and opinions. So it is important to capture this useful
information to classify the tweets correctly. There are few tokenizers available which can capture
few expressions and emotions and replace them with their associated meanings [41].

• Replacement of abbreviation and slang Character limitations of Twitter enforce online users to
use abbreviations, short words and slangs in their posts online. An abbreviation is a short or
acronym of a word such as MIA which stands for missing in action. In contrast, slang is an
informal way of expressing thoughts or meanings which is sometimes restricted to some
particular group of people, context and considered as informal. So it is crucial to handle such
kind of informal nature of text by replacing them to their actual meaning to get better
performance without losing information. Researchers have proposed different methods to
handle this kind of issue in a text, but the most useful technique is to convert them to an actual
word which is easy for a machine to understand [68, 100].

• Replacing elongated characters Social media users, sometimes intentionally use elongated
words in which they purposely write or add more characters repeatedly more times, such as
loooovvveee, greeeeat. Thus, it is important to deal with these words and change them to their

© 2020-2023, Ali Arsanjani


13
Applied ML 14

base word so that classifier does not treat them different words. In our experiments, we replaced
elongated words to their original base words. Detection and Replacement of elongated words
have been studied by [97] and [5]. • Correction of Spelling mistakes Incorrect spellings and
grammatical mistakes are very commonly present in the text, especially in the case of social
media platforms, especially on Twitter and Facebook. Correction of spelling and grammatical
mistakes helps in reducing the same words written indifferently. Textblob is one the library which
can be used for this purpose. Norvig’s spell correction1 method is also widely used to correct
spelling mistakes.

• Expanding Contractions A contraction is a shortened form of the words which is widely being
used by online users. An apostrophe is used in the place of the missing letter(s). Because we
want to standardize the text for machines to process easily so, in the removal of contractions,
shortened words are expanded to their original root /base words. For example, words like how
is, I’m, can’t and don’t are the contractions for words how is, I am, cannot and do not
respectively. In the study conducted by [14], contractions were replaced with their original words
or by the relevant word. If contractions are not replaced, then the tokenization step will create
tokens of the word "can’t" into "can" "t". 1http://norvig.com/spell-correct.html 6 A Comprehensive
Survey on Word Representation Models: From Classical to State-Of-The-Art Word
Representation Language Models Woodstock ’18, June 03–05, 2018, Woodstock, NY

• Removing Punctuations Social media users use different punctuations to express their
sentiments and emotions, which may are useful for humans but not all much useful for
machines for the classification of short texts. So removal of punctuation is common practice in
classification tasks such as sentiment analysis. However, sometimes some punctuation symbols
like "!" and "?" shows/denotes the sentiments. Its common practice to remove punctuation. [82].
whereas, replacing question mark or sign of exclamation with tags has also been studied by [5].

• Removing Numbers Text corpus usually contains unwanted numbers which are useful for
human beings to understand but not much use for machines which makes lowers the results of
the classification task. The simple and standard method is to remove them [47, 58]. However,
we could lose some useful information if we remove them before transforming slang and
abbreviation into their actual words. For example, words like "2maro", "4 u", "gr8", etc. should be
first converted to actual words, and then we can proceed with this pre-processing step.

• Lower-casing all words A sentence in a corpus has many different words with capitalization.
This step of pre-processing helps to avoid different copies of the same words. This diversity of
capitalization within the corpus can cause a problem during the classification task and lower the
performance. Changing each capital letters into a lower case is the most common method to
handle this issue in text data. Although, this pre-processing technique projects all tokens in a
corpus under the one feature space also causes a bunch of problems in the interpretation of
some words like "US" in the raw corpus. The word "US "could be pronoun and a country name
as well, so converting it to a lower case in all cases can be problematic. The study conducted by
[33] has lower-cased words in corpus to get clean words.

© 2020-2023, Ali Arsanjani


14
Applied ML 15

• Removing Stop-words In-text classification task, there are many words which do not have
critical significance and are present in high frequency in a text. It means the words which does
not help to improve the performance because they do not have much information for the
sentiment classification task, so it is recommended to remove stop words before feature
selection step. Words like (a, the, is, and, am, are, etc.). A popular and straightforward method
to handle with such words is to remove them. There are different stop-word libraries available
such as NLTK, scikit-learn and spaCy.

• Stemming One word can turn up in many different forms, whereas the semantic meaning of
those words is still the same. Stemming is the techniques to replace and remove the suffixes
and affixes to get the root, base or stem word. The importance of stemming was studied by [92].
There are several types of stemming algorithms which helps to consolidate different forms of
words into the same feature space such as Porter Stemmer, Lancaster stemmer and Snowball
stemmers etc. Feature reduction can be achieved by utilizing the stemming technique. 7
Woodstock ’18, June 03–05, 2018, Woodstock, NY Naseem U, et al.

• Lemmatization The purpose of the lemmatization is the same as stemming, which is to cut
down the words to it’s base or root words. However, in lemmatization inflection of words are not
just chopped off, but it uses lexical knowledge to transform words into its base forms. There are
many libraries available which help to do this lemmatization technique. Few of the famous ones
are NLTK (Wordnet lemmatizer), genism, Stanford CoreNLP, spaCy and TextBlob etc.

• Part of Speech (POS) Tagging The purpose of Pat of speech (POS) tagging is to assign part of
speech to text. It clubs together with the words which have the same grammatical with words
together.

• Handling Negations For humans, it is simple to get the context if there is any negation present
in the sentence, but for machines sometimes it does not help to capture and classify accurately
so handling a negation can be a challenging task in the case of word-level text analysis.
Replacing negation words with the prefix ’NEG_’ has been studied by [103]. Similarly, handling
negations with antonym has been studied by [124].

© 2020-2023, Ali Arsanjani


15

You might also like