Chapter Four - NLP
Chapter Four - NLP
Introduction
The development of large language models has been one of the most significant
breakthroughs in the field of natural language processing (NLP) in recent years. These
models have demonstrated impressive performance on a variety of language tasks,
including language translation, text generation, and question-answering. The
Transformer architecture has played a key role in enabling the development of these
models. In this chapter, we will explore the evolution of the Transformer architecture and
its impact on the field of NLP.
https://arxiv.org/pdf/2010.15036.pdf
● Syntax Analysis and Parsing: This involved breaking down sentences into their
grammatical components, such as nouns, verbs, and adjectives, to extract their syntactic
structures. Tools like the Stanford Parser utilized algorithms to understand sentence
structure, which was crucial for tasks such as translating languages and answering
questions. However, these methods struggled with ambiguities inherent in human
language.
● Named Entity Recognition (NER): This process identified key information in text, such as
names of people, organizations, and locations. It was crucial for information extraction
and data analysis applications. Traditional NER systems used hand-crafted rules or lists
of known entities, which were not easily scalable or adaptable to new domains or
languages.
● Part-of-Speech Tagging: This involved labeling words with their corresponding part of
speech, based on both their definition and context. This tagging was vital for parsing
sentences and understanding language structure. Early systems used rule-based
methods or probabilistic models, which could be inaccurate when faced with complex or
ambiguous sentence structures.
These traditional methods were a crucial step in building the foundation of NLP. However, their
reliance on manual rule creation made them labor-intensive and unable to cope with the
nuances and variability of natural language effectively.
The development of word embeddings was a critical advancement in NLP, enabling models to
capture the semantic richness of language more effectively than ever before.
● Parallelization: Unlike RNNs and LSTMs, the Transformer architecture allowed for
significant parallelization, reducing training times and enabling the processing of longer
sequences of text in a single step.
● Attention Mechanism: The self-attention mechanism enabled the model to weigh the
importance of different parts of the input text differently, allowing it to capture nuanced
relationships and dependencies across the entire sequence. This capability was
particularly beneficial for tasks requiring an understanding of context, such as translation
and summarization.
● Flexibility and Scalability: The Transformer's architecture proved to be incredibly
versatile, serving as the foundation for a range of subsequent models tailored to diverse
NLP tasks. Its scalability was demonstrated by its capacity to handle models of varying
sizes, from hundreds of millions to billions of parameters.
The Transformer model set new standards for efficiency, effectiveness, and applicability in NLP,
paving the way for the development of highly sophisticated language models.
● Training Techniques: LLMs are pre-trained on vast corpora of text data using
unsupervised learning techniques, allowing them to capture a broad understanding of
language patterns, syntax, and semantics. This pre-training is followed by fine-tuning,
where the model is adapted to specific tasks with smaller, task-specific datasets.
● Reinforcement Learning from Human Feedback (RLHF): To further refine the outputs of
LLMs and align them with human values and ethical standards, techniques such as
RLHF have been employed. This involves using human feedback to guide the learning
process, adjusting the model's outputs to be more aligned with human judgment and
preferences.
The development and refinement of LLMs have not only significantly advanced the state of NLP
but also raised important considerations regarding their ethical use, interpretability, and impact
on society.
● Ethical Considerations and Societal Impact: As LLMs became more powerful, concerns
about their potential to generate misleading information, reinforce biases, and impact
human labor markets intensified. Addressing these concerns requires careful design,
training methodologies that mitigate bias, and transparent usage guidelines to ensure
that these models are used responsibly and for the benefit of society.
● Interpretability and Transparency: Despite their impressive capabilities, LLMs' complexity
makes understanding how they arrive at specific outputs challenging. Efforts to improve
the interpretability of these models include techniques for model visualization,
explanation frameworks, and research into simpler models that can offer comparable
performance with greater transparency.
● Continual Learning and Adaptation: Another area of focus is making LLMs adaptable to
new information without requiring extensive retraining. Techniques like few-shot learning,
where models learn from a minimal number of examples, and continual learning, where
models update their knowledge base without forgetting previously learned information,
are critical for developing more efficient and adaptable NLP systems.
Looking Forward
The evolution of NLP from traditional techniques to LLMs represents a remarkable journey of
innovation and discovery. Each stage of this evolution has addressed the limitations of previous
methods while opening new avenues for exploration and application. The future of NLP lies in
addressing the current challenges faced by LLMs, including ethical considerations,
interpretability, and adaptability, while exploring new architectures and training methods that can
further advance our ability to understand and generate human language.
As the field continues to evolve, collaboration between researchers, ethicists, and practitioners
will be essential to ensure that the benefits of NLP technology are realized across society,
enhancing communication, accessibility, and information sharing while mitigating potential risks
and biases. The journey of NLP is far from complete, and the next chapters promise to be as
exciting and impactful as those that have preceded them.
• Word Segmentation
Word segmentation is the process of separating the phrases, content
and keywords used in the hashtag. Moreover, this step can help in understanding and
classifying the content of tweets easily for machines without any human intervention. As
mentioned earlier, Twitter users use # (hashtags) in almost all tweets to associate their tweets
with some particular topic. The phrase or keyword starting with # is known as hashtags.
Various techniques are presented in the literature for word segmentation in [22, 136].
Woodstock ’18, June 03–05, 2018, Woodstock, NY Naseem U, et al.
can capture few expressions and emotions and replace them with their associated meanings
[41].
• Expanding Contractions
A contraction is a shortened form of the words which is widely
being used by online users. An apostrophe is used in the place of the missing letter(s).
Because we want to standardize the text for machines to process easily so, in the removal of
contractions, shortened words are expanded to their original root /base words. For example,
words like how is, I’m, can’t and don’t are the contractions for words how is, I am, cannot
and do not respectively. In the study conducted by [14], contractions were replaced with their
original words or by the relevant word. If contractions are not replaced, then the tokenization
step will create tokens of the word "can’t" into "can" "t".
http://norvig.com/spell-correct.html
l
A Comprehensive Survey on Word Representation Models: From Classical to State-Of-The-Art
Word Representation Language Models Woodstock ’18, June 03–05, 2018, Woodstock, NY
• Removing Punctuations
Social media users use different punctuations to express their
sentiments and emotions, which may are useful for humans but not all much useful for
machines for the classification of short texts. So removal of punctuation is common practice
in classification tasks such as sentiment analysis. However, sometimes some punctuation
symbols like "!" and "?" shows/denotes the sentiments. Its common practice to remove
punctuation. [82]. whereas, replacing question mark or sign of exclamation with tags has
also been studied by [5].
• Removing Numbers
Text corpus usually contains unwanted numbers which are useful for
human beings to understand but not much use for machines which makes lowers the results of
the classification task. The simple and standard method is to remove them [47, 58]. However,
we could lose some useful information if we remove them before transforming slang and
abbreviation into their actual words. For example, words like "2maro", "4 u", "gr8", etc. should
be first converted to actual words, and then we can proceed with this pre-processing step.
• Removing Stop-words
In-text classification task, there are many words which do not have
critical significance and are present in high frequency in a text. It means the words which
does not help to improve the performance because they do not have much information for
the sentiment classification task, so it is recommended to remove stop words before feature
selection step. Words like (a, the, is, and, am, are, etc.). A popular and straightforward method
to handle with such words is to remove them. There are different stop-word libraries available
such as NLTK, scikit-learn and spaCy.
• Stemming
One word can turn up in many different forms, whereas the semantic meaning of
those words is still the same. Stemming is the techniques to replace and remove the suffixes
and affixes to get the root, base or stem word. The importance of stemming was studied by
[92]. There are several types of stemming algorithms which helps to consolidate different
forms of words into the same feature space such as Porter Stemmer, Lancaster stemmer
and Snowball stemmers etc. Feature reduction can be achieved by utilizing the stemming
technique.
Woodstock ’18, June 03–05, 2018, Woodstock, NY Naseem U, et al.
• Lemmatization
The purpose of the lemmatization is the same as stemming, which is to cut
down the words to it’s base or root words. However, in lemmatization inflection of words
are not just chopped off, but it uses lexical knowledge to transform words into its base forms.
There are many libraries available which help to do this lemmatization technique. Few of
the famous ones are NLTK (Wordnet lemmatizer), genism, Stanford CoreNLP, spaCy and
TextBlob etc.
• Handling Negations
For humans, it is simple to get the context if there is any negation
present in the sentence, but for machines sometimes it does not help to capture and classify
accurately so handling a negation can be a challenging task in the case of word-level text
analysis. Replacing negation words with the prefix ’NEG_’ has been studied by [103]. Similarly,
handling negations with antonym has been studied by [124].
1. * Tokenization: This is the process of splitting a text into individual words or tokens.
2. * Removal of noise, URLs, hashtags, and user-mentions: This is the process of removing
any unwanted or irrelevant text from the data.
3. * Word segmentation: This is the process of splitting hashtags into individual words.
4. * Replacement of emoticons and emojis: This is the process of replacing emoticons and
emojis with their corresponding text representations.
5. * Replacement of abbreviations and slang: This is the process of replacing abbreviations
and slang with their corresponding full forms.
6. * Replacement of elongated characters: This is the process of replacing elongated
characters with their corresponding short forms.
7. * Correction of spelling mistakes: This is the process of correcting any spelling mistakes
in the text.
8. * Expanding contractions: This is the process of expanding contractions into their
corresponding full forms.
9. * Removal of punctuation: This is the process of removing any punctuation from the text.
10. * Removal of numbers: This is the process of removing any numbers from the text.
11. * Lower-casing all words: This is the process of converting all words to lowercase.
12. * Removal of stop-words: This is the process of removing any stop words from the text.
13. * Stemming: This is the process of reducing a word to its root form.
14. * Lemmatization: This is the process of reducing a word to its lemmatized form.
15. * Part of speech (POS) tagging: This is the process of assigning a part of speech to each
word in the text.
16. * Handling negations: This is the process of identifying and handling negations in the
text.
These text preprocessing techniques are used to clean and prepare text data for further
processing. They can help to improve the accuracy and performance of natural language
processing (NLP) tasks such as text classification, sentiment analysis, and machine translation.
• Removal of Noise, URLs, Hashtag and User-mentions Unwanted strings and Unicode are
considered as leftover during the crawling process, which is not useful for the machines and
creates noise in the data. Also, almost all of tweets messages posted by users, contains URLs
to provide extra information, User-mention/tags (𝛼) and use hashtag symbol ”#” to associate
their tweet message with some particular topic and can also express their sentiments in tweets
by using hashtags. These give extra information which is useful for human beings, but it does
not provide any information to machines and considered as noise which needs to be handled.
Researchers have presented different techniques to handle this extra information provided by
users such as in the case of URLs; it is replaced with tags [1] whereas Usermentions (𝛼) are
removed [13, 65]
• Word Segmentation Word segmentation is the process of separating the phrases, content and
keywords used in the hashtag. Moreover, this step can help in understanding and classifying the
content of tweets easily for machines without any human intervention. As mentioned earlier,
Twitter users use # (hashtags) in almost all tweets to associate their tweets with some particular
topic. The phrase or keyword starting with # is known as hashtags. Various techniques are
presented in the literature for word segmentation in [22, 136]. 5 Woodstock ’18, June 03–05,
2018, Woodstock, NY Naseem U, et al.
• Replacing Emoticons and Emojis Twitter users use many different emoticons and emojis such
as:), :(, etc. to express their sentiments and opinions. So it is important to capture this useful
information to classify the tweets correctly. There are few tokenizers available which can capture
few expressions and emotions and replace them with their associated meanings [41].
• Replacement of abbreviation and slang Character limitations of Twitter enforce online users to
use abbreviations, short words and slangs in their posts online. An abbreviation is a short or
acronym of a word such as MIA which stands for missing in action. In contrast, slang is an
informal way of expressing thoughts or meanings which is sometimes restricted to some
particular group of people, context and considered as informal. So it is crucial to handle such
kind of informal nature of text by replacing them to their actual meaning to get better
performance without losing information. Researchers have proposed different methods to
handle this kind of issue in a text, but the most useful technique is to convert them to an actual
word which is easy for a machine to understand [68, 100].
• Replacing elongated characters Social media users, sometimes intentionally use elongated
words in which they purposely write or add more characters repeatedly more times, such as
loooovvveee, greeeeat. Thus, it is important to deal with these words and change them to their
base word so that classifier does not treat them different words. In our experiments, we replaced
elongated words to their original base words. Detection and Replacement of elongated words
have been studied by [97] and [5]. • Correction of Spelling mistakes Incorrect spellings and
grammatical mistakes are very commonly present in the text, especially in the case of social
media platforms, especially on Twitter and Facebook. Correction of spelling and grammatical
mistakes helps in reducing the same words written indifferently. Textblob is one the library which
can be used for this purpose. Norvig’s spell correction1 method is also widely used to correct
spelling mistakes.
• Expanding Contractions A contraction is a shortened form of the words which is widely being
used by online users. An apostrophe is used in the place of the missing letter(s). Because we
want to standardize the text for machines to process easily so, in the removal of contractions,
shortened words are expanded to their original root /base words. For example, words like how
is, I’m, can’t and don’t are the contractions for words how is, I am, cannot and do not
respectively. In the study conducted by [14], contractions were replaced with their original words
or by the relevant word. If contractions are not replaced, then the tokenization step will create
tokens of the word "can’t" into "can" "t". 1http://norvig.com/spell-correct.html 6 A Comprehensive
Survey on Word Representation Models: From Classical to State-Of-The-Art Word
Representation Language Models Woodstock ’18, June 03–05, 2018, Woodstock, NY
• Removing Punctuations Social media users use different punctuations to express their
sentiments and emotions, which may are useful for humans but not all much useful for
machines for the classification of short texts. So removal of punctuation is common practice in
classification tasks such as sentiment analysis. However, sometimes some punctuation symbols
like "!" and "?" shows/denotes the sentiments. Its common practice to remove punctuation. [82].
whereas, replacing question mark or sign of exclamation with tags has also been studied by [5].
• Removing Numbers Text corpus usually contains unwanted numbers which are useful for
human beings to understand but not much use for machines which makes lowers the results of
the classification task. The simple and standard method is to remove them [47, 58]. However,
we could lose some useful information if we remove them before transforming slang and
abbreviation into their actual words. For example, words like "2maro", "4 u", "gr8", etc. should be
first converted to actual words, and then we can proceed with this pre-processing step.
• Lower-casing all words A sentence in a corpus has many different words with capitalization.
This step of pre-processing helps to avoid different copies of the same words. This diversity of
capitalization within the corpus can cause a problem during the classification task and lower the
performance. Changing each capital letters into a lower case is the most common method to
handle this issue in text data. Although, this pre-processing technique projects all tokens in a
corpus under the one feature space also causes a bunch of problems in the interpretation of
some words like "US" in the raw corpus. The word "US "could be pronoun and a country name
as well, so converting it to a lower case in all cases can be problematic. The study conducted by
[33] has lower-cased words in corpus to get clean words.
• Removing Stop-words In-text classification task, there are many words which do not have
critical significance and are present in high frequency in a text. It means the words which does
not help to improve the performance because they do not have much information for the
sentiment classification task, so it is recommended to remove stop words before feature
selection step. Words like (a, the, is, and, am, are, etc.). A popular and straightforward method
to handle with such words is to remove them. There are different stop-word libraries available
such as NLTK, scikit-learn and spaCy.
• Stemming One word can turn up in many different forms, whereas the semantic meaning of
those words is still the same. Stemming is the techniques to replace and remove the suffixes
and affixes to get the root, base or stem word. The importance of stemming was studied by [92].
There are several types of stemming algorithms which helps to consolidate different forms of
words into the same feature space such as Porter Stemmer, Lancaster stemmer and Snowball
stemmers etc. Feature reduction can be achieved by utilizing the stemming technique. 7
Woodstock ’18, June 03–05, 2018, Woodstock, NY Naseem U, et al.
• Lemmatization The purpose of the lemmatization is the same as stemming, which is to cut
down the words to it’s base or root words. However, in lemmatization inflection of words are not
just chopped off, but it uses lexical knowledge to transform words into its base forms. There are
many libraries available which help to do this lemmatization technique. Few of the famous ones
are NLTK (Wordnet lemmatizer), genism, Stanford CoreNLP, spaCy and TextBlob etc.
• Part of Speech (POS) Tagging The purpose of Pat of speech (POS) tagging is to assign part of
speech to text. It clubs together with the words which have the same grammatical with words
together.
• Handling Negations For humans, it is simple to get the context if there is any negation present
in the sentence, but for machines sometimes it does not help to capture and classify accurately
so handling a negation can be a challenging task in the case of word-level text analysis.
Replacing negation words with the prefix ’NEG_’ has been studied by [103]. Similarly, handling
negations with antonym has been studied by [124].