0% found this document useful (0 votes)

26 views15 pages

Chapter Four - NLP

Chapter Five of the document discusses the evolution of Natural Language Processing (NLP), highlighting the transition from traditional rule-based methods to the development of large language models (LLMs) using the Transformer architecture. It covers key advancements such as word embeddings, the introduction of self-attention mechanisms, and the ethical considerations surrounding LLMs. The chapter emphasizes the importance of addressing challenges in NLP to enhance communication and mitigate risks associated with these technologies.

Uploaded by

gpt4prompt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views15 pages

Chapter Four - NLP

Uploaded by

gpt4prompt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Applied ML 1

Chapter Five: Natural Language

Processing

Introduction
The development of large language models has been one of the most significant
breakthroughs in the field of natural language processing (NLP) in recent years. These
models have demonstrated impressive performance on a variety of language tasks,
including language translation, text generation, and question-answering. The
Transformer architecture has played a key role in enabling the development of these
models. In this chapter, we will explore the evolution of the Transformer architecture and
its impact on the field of NLP.

https://arxiv.org/pdf/2010.15036.pdf

© 2020-2023, Ali Arsanjani

1
Applied ML 2

Section 1 : A Brief History of NLP, NLU and NLG

In-Depth Evolution of Natural Language Processing (NLP)

Section 1: Traditional NLP Techniques

The inception of NLP was characterized by rule-based and statistical methods designed to
interpret, understand, and generate human language. These techniques, though foundational,
were constrained by their reliance on linguistic rules crafted by experts, limiting their adaptability
and scalability.

● Syntax Analysis and Parsing: This involved breaking down sentences into their
grammatical components, such as nouns, verbs, and adjectives, to extract their syntactic
structures. Tools like the Stanford Parser utilized algorithms to understand sentence
structure, which was crucial for tasks such as translating languages and answering
questions. However, these methods struggled with ambiguities inherent in human
language.
● Named Entity Recognition (NER): This process identified key information in text, such as
names of people, organizations, and locations. It was crucial for information extraction
and data analysis applications. Traditional NER systems used hand-crafted rules or lists
of known entities, which were not easily scalable or adaptable to new domains or
languages.
● Part-of-Speech Tagging: This involved labeling words with their corresponding part of
speech, based on both their definition and context. This tagging was vital for parsing
sentences and understanding language structure. Early systems used rule-based
methods or probabilistic models, which could be inaccurate when faced with complex or
ambiguous sentence structures.

These traditional methods were a crucial step in building the foundation of NLP. However, their
reliance on manual rule creation made them labor-intensive and unable to cope with the
nuances and variability of natural language effectively.

© 2020-2023, Ali Arsanjani

2
Applied ML 3

Section 2: Word Embeddings

Word embeddings represented a paradigm shift in NLP, moving away from sparse,
high-dimensional representations like one-hot encoding to dense, low-dimensional vectors.
These vectors captured semantic similarities between words, enabling models to understand
linguistic nuances based on context.

● Semantic Representation: Word2Vec, introduced by Mikolov et al., and GloVe, by

Pennington et al., were breakthroughs that allowed words with similar meanings to have
similar representations in vector space. This facilitated models' understanding of
semantic relationships and analogies, overcoming a significant limitation of traditional
methods.
● Dimensionality Reduction: Word embeddings reduced the dimensionality of word
representations, significantly improving computational efficiency and enabling the
processing of large text corpora with neural networks.
● Contextual Limitations: Initial embeddings were static, offering a single representation for
each word regardless of its use in different contexts. This limitation was addressed by
context-sensitive embeddings like ELMo, which generated dynamic representations for
words based on their surrounding text, using bidirectional LSTM networks.

The development of word embeddings was a critical advancement in NLP, enabling models to
capture the semantic richness of language more effectively than ever before.

© 2020-2023, Ali Arsanjani

3
Applied ML 4

Section 3: The Transformer Architecture

The introduction of the Transformer model by Vaswani et al. represented a significant leap in
NLP technology. Its novel architecture, centered around self-attention mechanisms, offered a
new way to model relationships within text without the sequential dependencies of previous
RNN and LSTM models.

● Parallelization: Unlike RNNs and LSTMs, the Transformer architecture allowed for
significant parallelization, reducing training times and enabling the processing of longer
sequences of text in a single step.
● Attention Mechanism: The self-attention mechanism enabled the model to weigh the
importance of different parts of the input text differently, allowing it to capture nuanced
relationships and dependencies across the entire sequence. This capability was
particularly beneficial for tasks requiring an understanding of context, such as translation
and summarization.
● Flexibility and Scalability: The Transformer's architecture proved to be incredibly
versatile, serving as the foundation for a range of subsequent models tailored to diverse
NLP tasks. Its scalability was demonstrated by its capacity to handle models of varying
sizes, from hundreds of millions to billions of parameters.

The Transformer model set new standards for efficiency, effectiveness, and applicability in NLP,
paving the way for the development of highly sophisticated language models.

© 2020-2023, Ali Arsanjani

4
Applied ML 5

Section 4: Large Language Models (LLMs)

The era of LLMs, epitomized by models like GPT and BERT, marked a pinnacle in the evolution
of NLP. These models leveraged the Transformer architecture to unprecedented scales,
demonstrating remarkable capabilities in understanding and generating human language.

● Training Techniques: LLMs are pre-trained on vast corpora of text data using
unsupervised learning techniques, allowing them to capture a broad understanding of
language patterns, syntax, and semantics. This pre-training is followed by fine-tuning,
where the model is adapted to specific tasks with smaller, task-specific datasets.
● Reinforcement Learning from Human Feedback (RLHF): To further refine the outputs of
LLMs and align them with human values and ethical standards, techniques such as
RLHF have been employed. This involves using human feedback to guide the learning
process, adjusting the model's outputs to be more aligned with human judgment and
preferences.

The development and refinement of LLMs have not only significantly advanced the state of NLP
but also raised important considerations regarding their ethical use, interpretability, and impact
on society.

● Ethical Considerations and Societal Impact: As LLMs became more powerful, concerns
about their potential to generate misleading information, reinforce biases, and impact
human labor markets intensified. Addressing these concerns requires careful design,
training methodologies that mitigate bias, and transparent usage guidelines to ensure
that these models are used responsibly and for the benefit of society.
● Interpretability and Transparency: Despite their impressive capabilities, LLMs' complexity
makes understanding how they arrive at specific outputs challenging. Efforts to improve
the interpretability of these models include techniques for model visualization,
explanation frameworks, and research into simpler models that can offer comparable
performance with greater transparency.
● Continual Learning and Adaptation: Another area of focus is making LLMs adaptable to
new information without requiring extensive retraining. Techniques like few-shot learning,
where models learn from a minimal number of examples, and continual learning, where
models update their knowledge base without forgetting previously learned information,
are critical for developing more efficient and adaptable NLP systems.

Looking Forward
The evolution of NLP from traditional techniques to LLMs represents a remarkable journey of
innovation and discovery. Each stage of this evolution has addressed the limitations of previous
methods while opening new avenues for exploration and application. The future of NLP lies in
addressing the current challenges faced by LLMs, including ethical considerations,
interpretability, and adaptability, while exploring new architectures and training methods that can
further advance our ability to understand and generate human language.

© 2020-2023, Ali Arsanjani

5
Applied ML 6

As the field continues to evolve, collaboration between researchers, ethicists, and practitioners
will be essential to ensure that the benefits of NLP technology are realized across society,
enhancing communication, accessibility, and information sharing while mitigating potential risks
and biases. The journey of NLP is far from complete, and the next chapters promise to be as
exciting and impactful as those that have preceded them.

© 2020-2023, Ali Arsanjani

6
Applied ML 7

Section 1: Steps in Traditional NLP

Preliminaries related to Text Pre-processing

• Tokenization A process of transforming a text (sentence) into tokens or words is known
as tokenization. Documents can be tokenized into sentences, whereas sentences can be
converted into tokens. In tokenization, a sequence to text is divided into the words, symbols,
phrases or tokens [6]. The prime objective of tokenization is to find out the words in a
sentence. Usually, tokenization is applied as a first and standard pre-processing step in any
NLP task.[40]

• Removal of Noise, URLs, Hashtag and User-mentions

Unwanted strings and Unicode
are considered as leftover during the crawling process, which is not useful for the machines
and creates noise in the data. Also, almost all of tweets messages posted by users, contains
URLs to provide extra information, User-mention/tags (𝛼) and use hashtag symbol ”#” to
associate their tweet message with some particular topic and can also express their sentiments
in tweets by using hashtags. These give extra information which is useful for human beings,
but it does not provide any information to machines and considered as noise which needs to
be handled. Researchers have presented different techniques to handle this extra information
provided by users such as in the case of URLs; it is replaced with tags [1] whereas User
Mentions (𝛼) are removed [13, 65]

• Word Segmentation
Word segmentation is the process of separating the phrases, content
and keywords used in the hashtag. Moreover, this step can help in understanding and
classifying the content of tweets easily for machines without any human intervention. As
mentioned earlier, Twitter users use # (hashtags) in almost all tweets to associate their tweets
with some particular topic. The phrase or keyword starting with # is known as hashtags.
Various techniques are presented in the literature for word segmentation in [22, 136].
Woodstock ’18, June 03–05, 2018, Woodstock, NY Naseem U, et al.

• Replacing Emoticons and Emojis

Twitter users use many different emoticons and emojis
such as:), :(, etc. to express their sentiments and opinions. So it is important to capture this
useful information to classify the tweets correctly. There are few tokenizers available which

© 2020-2023, Ali Arsanjani

7
Applied ML 8

can capture few expressions and emotions and replace them with their associated meanings
[41].

• Replacement of abbreviation and slang

Character limitations of Twitter enforce online
users to use abbreviations, short words and slangs in their posts online. An abbreviation is
a short or acronym of a word such as MIA which stands for missing in action. In contrast,
slang is an informal way of expressing thoughts or meanings which is sometimes restricted
to some particular group of people, context and considered as informal. So it is crucial to
handle such kind of informal nature of text by replacing them to their actual meaning to
get better performance without losing information. Researchers have proposed different
methods to handle this kind of issue in a text, but the most useful technique is to convert
them to an actual word which is easy for a machine to understand [68, 100].

• Replacing elongated characters

Social media users, sometimes intentionally use elongated
words in which they purposely write or add more characters repeatedly more times, such as
loooovvveee, greeeeat. Thus, it is important to deal with these words and change them to
their base word so that classifier does not treat them different words. In our experiments,
we replaced elongated words to their original base words. Detection and Replacement of
elongated words have been studied by [97] and [5].

• Correction of Spelling mistakes

Incorrect spellings and grammatical mistakes are very
commonly present in the text, especially in the case of social media platforms, especially on
Twitter and Facebook. Correction of spelling and grammatical mistakes helps in reducing
the same words written indifferently. Textblob is one the library which can be used for this
purpose. Norvig’s spell correction1 method is also widely used to correct spelling mistakes.

© 2020-2023, Ali Arsanjani

8
Applied ML 9

original words or by the relevant word. If contractions are not replaced, then the tokenization
step will create tokens of the word "can’t" into "can" "t".
http://norvig.com/spell-correct.html
l
A Comprehensive Survey on Word Representation Models: From Classical to State-Of-The-Art
Word Representation Language Models Woodstock ’18, June 03–05, 2018, Woodstock, NY

• Removing Punctuations
Social media users use different punctuations to express their
sentiments and emotions, which may are useful for humans but not all much useful for
machines for the classification of short texts. So removal of punctuation is common practice
in classification tasks such as sentiment analysis. However, sometimes some punctuation
symbols like "!" and "?" shows/denotes the sentiments. Its common practice to remove
punctuation. [82]. whereas, replacing question mark or sign of exclamation with tags has
also been studied by [5].

• Removing Numbers
Text corpus usually contains unwanted numbers which are useful for
human beings to understand but not much use for machines which makes lowers the results of
the classification task. The simple and standard method is to remove them [47, 58]. However,
we could lose some useful information if we remove them before transforming slang and
abbreviation into their actual words. For example, words like "2maro", "4 u", "gr8", etc. should
be first converted to actual words, and then we can proceed with this pre-processing step.

• Lower-casing all words

A sentence in a corpus has many different words with capitalization.
This step of pre-processing helps to avoid different copies of the same words. This diversity
of capitalization within the corpus can cause a problem during the classification task and
lower the performance. Changing each capital letters into a lower case is the most common
method to handle this issue in text data. Although, this pre-processing technique projects
all tokens in a corpus under the one feature space also causes a bunch of problems in the
interpretation of some words like "US" in the raw corpus. The word "US "could be pronoun
and a country name as well, so converting it to a lower case in all cases can be problematic.
The study conducted by [33] has lower-cased words in corpus to get clean words.

© 2020-2023, Ali Arsanjani

9
Applied ML 10

• Removing Stop-words
In-text classification task, there are many words which do not have
critical significance and are present in high frequency in a text. It means the words which
does not help to improve the performance because they do not have much information for
the sentiment classification task, so it is recommended to remove stop words before feature
selection step. Words like (a, the, is, and, am, are, etc.). A popular and straightforward method
to handle with such words is to remove them. There are different stop-word libraries available
such as NLTK, scikit-learn and spaCy.

• Stemming
One word can turn up in many different forms, whereas the semantic meaning of
those words is still the same. Stemming is the techniques to replace and remove the suffixes
and affixes to get the root, base or stem word. The importance of stemming was studied by
[92]. There are several types of stemming algorithms which helps to consolidate different
forms of words into the same feature space such as Porter Stemmer, Lancaster stemmer
and Snowball stemmers etc. Feature reduction can be achieved by utilizing the stemming
technique.
Woodstock ’18, June 03–05, 2018, Woodstock, NY Naseem U, et al.

• Lemmatization
The purpose of the lemmatization is the same as stemming, which is to cut
down the words to it’s base or root words. However, in lemmatization inflection of words
are not just chopped off, but it uses lexical knowledge to transform words into its base forms.
There are many libraries available which help to do this lemmatization technique. Few of
the famous ones are NLTK (Wordnet lemmatizer), genism, Stanford CoreNLP, spaCy and
TextBlob etc.

• Part of Speech (POS) Tagging

The purpose of Pat of speech (POS) tagging is to assign part
of speech to text. It clubs together with the words which have the same grammatical with
words together.

10
Applied ML 11

analysis. Replacing negation words with the prefix ’NEG_’ has been studied by [103]. Similarly,
handling negations with antonym has been studied by [124].

11
Applied ML 12

Here is a summary of the text preprocessing techniques that you

have mentioned:

1. * Tokenization: This is the process of splitting a text into individual words or tokens.
2. * Removal of noise, URLs, hashtags, and user-mentions: This is the process of removing
any unwanted or irrelevant text from the data.
3. * Word segmentation: This is the process of splitting hashtags into individual words.
4. * Replacement of emoticons and emojis: This is the process of replacing emoticons and
emojis with their corresponding text representations.
5. * Replacement of abbreviations and slang: This is the process of replacing abbreviations
and slang with their corresponding full forms.
6. * Replacement of elongated characters: This is the process of replacing elongated
characters with their corresponding short forms.
7. * Correction of spelling mistakes: This is the process of correcting any spelling mistakes
in the text.
8. * Expanding contractions: This is the process of expanding contractions into their
corresponding full forms.
9. * Removal of punctuation: This is the process of removing any punctuation from the text.
10. * Removal of numbers: This is the process of removing any numbers from the text.
11. * Lower-casing all words: This is the process of converting all words to lowercase.
12. * Removal of stop-words: This is the process of removing any stop words from the text.
13. * Stemming: This is the process of reducing a word to its root form.
14. * Lemmatization: This is the process of reducing a word to its lemmatized form.
15. * Part of speech (POS) tagging: This is the process of assigning a part of speech to each
word in the text.
16. * Handling negations: This is the process of identifying and handling negations in the
text.

These text preprocessing techniques are used to clean and prepare text data for further
processing. They can help to improve the accuracy and performance of natural language
processing (NLP) tasks such as text classification, sentiment analysis, and machine translation.

12
Applied ML 13

Preliminaries related to Text Pre-processing

• Tokenization A process of transforming a text (sentence) into tokens or words is known as
tokenization. Documents can be tokenized into sentences, whereas sentences can be
converted into tokens. In tokenization, a sequence to text is divided into the words, symbols,
phrases or tokens [6]. The prime objective of tokenization is to find out the words in a sentence.
Usually, tokenization is applied as a first and standard pre-processing step in any NLP task.[40]

• Removal of Noise, URLs, Hashtag and User-mentions Unwanted strings and Unicode are
considered as leftover during the crawling process, which is not useful for the machines and
creates noise in the data. Also, almost all of tweets messages posted by users, contains URLs
to provide extra information, User-mention/tags (𝛼) and use hashtag symbol ”#” to associate
their tweet message with some particular topic and can also express their sentiments in tweets
by using hashtags. These give extra information which is useful for human beings, but it does
not provide any information to machines and considered as noise which needs to be handled.
Researchers have presented different techniques to handle this extra information provided by
users such as in the case of URLs; it is replaced with tags [1] whereas Usermentions (𝛼) are
removed [13, 65]

• Word Segmentation Word segmentation is the process of separating the phrases, content and
keywords used in the hashtag. Moreover, this step can help in understanding and classifying the
content of tweets easily for machines without any human intervention. As mentioned earlier,
Twitter users use # (hashtags) in almost all tweets to associate their tweets with some particular
topic. The phrase or keyword starting with # is known as hashtags. Various techniques are
presented in the literature for word segmentation in [22, 136]. 5 Woodstock ’18, June 03–05,
2018, Woodstock, NY Naseem U, et al.

• Replacing Emoticons and Emojis Twitter users use many different emoticons and emojis such
as:), :(, etc. to express their sentiments and opinions. So it is important to capture this useful
information to classify the tweets correctly. There are few tokenizers available which can capture
few expressions and emotions and replace them with their associated meanings [41].

• Replacement of abbreviation and slang Character limitations of Twitter enforce online users to
use abbreviations, short words and slangs in their posts online. An abbreviation is a short or
acronym of a word such as MIA which stands for missing in action. In contrast, slang is an
informal way of expressing thoughts or meanings which is sometimes restricted to some
particular group of people, context and considered as informal. So it is crucial to handle such
kind of informal nature of text by replacing them to their actual meaning to get better
performance without losing information. Researchers have proposed different methods to
handle this kind of issue in a text, but the most useful technique is to convert them to an actual
word which is easy for a machine to understand [68, 100].

• Replacing elongated characters Social media users, sometimes intentionally use elongated
words in which they purposely write or add more characters repeatedly more times, such as
loooovvveee, greeeeat. Thus, it is important to deal with these words and change them to their

13
Applied ML 14

base word so that classifier does not treat them different words. In our experiments, we replaced
elongated words to their original base words. Detection and Replacement of elongated words
have been studied by [97] and [5]. • Correction of Spelling mistakes Incorrect spellings and
grammatical mistakes are very commonly present in the text, especially in the case of social
media platforms, especially on Twitter and Facebook. Correction of spelling and grammatical
mistakes helps in reducing the same words written indifferently. Textblob is one the library which
can be used for this purpose. Norvig’s spell correction1 method is also widely used to correct
spelling mistakes.

• Expanding Contractions A contraction is a shortened form of the words which is widely being
used by online users. An apostrophe is used in the place of the missing letter(s). Because we
want to standardize the text for machines to process easily so, in the removal of contractions,
shortened words are expanded to their original root /base words. For example, words like how
is, I’m, can’t and don’t are the contractions for words how is, I am, cannot and do not
respectively. In the study conducted by [14], contractions were replaced with their original words
or by the relevant word. If contractions are not replaced, then the tokenization step will create
tokens of the word "can’t" into "can" "t". 1http://norvig.com/spell-correct.html 6 A Comprehensive
Survey on Word Representation Models: From Classical to State-Of-The-Art Word
Representation Language Models Woodstock ’18, June 03–05, 2018, Woodstock, NY

• Removing Punctuations Social media users use different punctuations to express their
sentiments and emotions, which may are useful for humans but not all much useful for
machines for the classification of short texts. So removal of punctuation is common practice in
classification tasks such as sentiment analysis. However, sometimes some punctuation symbols
like "!" and "?" shows/denotes the sentiments. Its common practice to remove punctuation. [82].
whereas, replacing question mark or sign of exclamation with tags has also been studied by [5].

• Removing Numbers Text corpus usually contains unwanted numbers which are useful for
human beings to understand but not much use for machines which makes lowers the results of
the classification task. The simple and standard method is to remove them [47, 58]. However,
we could lose some useful information if we remove them before transforming slang and
abbreviation into their actual words. For example, words like "2maro", "4 u", "gr8", etc. should be
first converted to actual words, and then we can proceed with this pre-processing step.

• Lower-casing all words A sentence in a corpus has many different words with capitalization.
This step of pre-processing helps to avoid different copies of the same words. This diversity of
capitalization within the corpus can cause a problem during the classification task and lower the
performance. Changing each capital letters into a lower case is the most common method to
handle this issue in text data. Although, this pre-processing technique projects all tokens in a
corpus under the one feature space also causes a bunch of problems in the interpretation of
some words like "US" in the raw corpus. The word "US "could be pronoun and a country name
as well, so converting it to a lower case in all cases can be problematic. The study conducted by
[33] has lower-cased words in corpus to get clean words.

14
Applied ML 15

• Removing Stop-words In-text classification task, there are many words which do not have
critical significance and are present in high frequency in a text. It means the words which does
not help to improve the performance because they do not have much information for the
sentiment classification task, so it is recommended to remove stop words before feature
selection step. Words like (a, the, is, and, am, are, etc.). A popular and straightforward method
to handle with such words is to remove them. There are different stop-word libraries available
such as NLTK, scikit-learn and spaCy.

• Stemming One word can turn up in many different forms, whereas the semantic meaning of
those words is still the same. Stemming is the techniques to replace and remove the suffixes
and affixes to get the root, base or stem word. The importance of stemming was studied by [92].
There are several types of stemming algorithms which helps to consolidate different forms of
words into the same feature space such as Porter Stemmer, Lancaster stemmer and Snowball
stemmers etc. Feature reduction can be achieved by utilizing the stemming technique. 7
Woodstock ’18, June 03–05, 2018, Woodstock, NY Naseem U, et al.

• Lemmatization The purpose of the lemmatization is the same as stemming, which is to cut
down the words to it’s base or root words. However, in lemmatization inflection of words are not
just chopped off, but it uses lexical knowledge to transform words into its base forms. There are
many libraries available which help to do this lemmatization technique. Few of the famous ones
are NLTK (Wordnet lemmatizer), genism, Stanford CoreNLP, spaCy and TextBlob etc.

• Part of Speech (POS) Tagging The purpose of Pat of speech (POS) tagging is to assign part of
speech to text. It clubs together with the words which have the same grammatical with words
together.

• Handling Negations For humans, it is simple to get the context if there is any negation present
in the sentence, but for machines sometimes it does not help to capture and classify accurately
so handling a negation can be a challenging task in the case of word-level text analysis.
Replacing negation words with the prefix ’NEG_’ has been studied by [103]. Similarly, handling
negations with antonym has been studied by [124].

NLP Unit 1 and 2
No ratings yet
NLP Unit 1 and 2
106 pages
OceanofPDF - Com Large Language Models Concepts - John AtkinsonAbutridy
No ratings yet
OceanofPDF - Com Large Language Models Concepts - John AtkinsonAbutridy
185 pages
Slides
No ratings yet
Slides
137 pages
Sinan Ozdemir - Quick Start Guide To Large Language Models, Second Edition-Addison-Wesley (2024)
No ratings yet
Sinan Ozdemir - Quick Start Guide To Large Language Models, Second Edition-Addison-Wesley (2024)
279 pages
How To Steal Binance Funds
100% (3)
How To Steal Binance Funds
4 pages
How To Subnet in Your Head
95% (20)
How To Subnet in Your Head
67 pages
Unit 1
No ratings yet
Unit 1
99 pages
Motivation Letter For Undergraduate Scholarship
80% (5)
Motivation Letter For Undergraduate Scholarship
4 pages
Openroads Manual For Designers
100% (1)
Openroads Manual For Designers
108 pages
Buck Converter Notes-1
No ratings yet
Buck Converter Notes-1
10 pages
Brochure VD4
No ratings yet
Brochure VD4
8 pages
HIRARC Form
50% (2)
HIRARC Form
43 pages
Seminar Title: Natural Language Processing: Understanding and Generating Human Language
No ratings yet
Seminar Title: Natural Language Processing: Understanding and Generating Human Language
20 pages
Writing PHD Thesis Latex
100% (3)
Writing PHD Thesis Latex
4 pages
Natural Language Processing (NLP) : April 2024
No ratings yet
Natural Language Processing (NLP) : April 2024
88 pages
Fda Udi Unique Device Identifier Guidance
100% (1)
Fda Udi Unique Device Identifier Guidance
11 pages
Microsoft Word: Microsoft Official Academic Course
No ratings yet
Microsoft Word: Microsoft Official Academic Course
210 pages
Ans Practicebook U01 PDF
36% (14)
Ans Practicebook U01 PDF
2 pages
Day 1
No ratings yet
Day 1
32 pages
Advanced Techniques in Training and Applying Large Language Models
No ratings yet
Advanced Techniques in Training and Applying Large Language Models
6 pages
AI4youngster - 6 - Topic NLP
No ratings yet
AI4youngster - 6 - Topic NLP
66 pages
Natural Language Processing
No ratings yet
Natural Language Processing
5 pages
ML Module A7707 - Part1
No ratings yet
ML Module A7707 - Part1
48 pages
NLP LectureNotes UNIT 1
No ratings yet
NLP LectureNotes UNIT 1
55 pages
8 Things You Can Do To Be More Innovative
No ratings yet
8 Things You Can Do To Be More Innovative
71 pages
Dbms Notes
No ratings yet
Dbms Notes
48 pages
Unit-I Additive Manufacturing Old
No ratings yet
Unit-I Additive Manufacturing Old
62 pages
The Diverse Landscape of Large Language Models Deepsense Ai
No ratings yet
The Diverse Landscape of Large Language Models Deepsense Ai
16 pages
DBMT103-EUR 500BN Swift
No ratings yet
DBMT103-EUR 500BN Swift
2 pages
SCO409 Lecture Notes
No ratings yet
SCO409 Lecture Notes
64 pages
256 Fa2024 Intro-1
No ratings yet
256 Fa2024 Intro-1
66 pages
Large Language Models A Comprehensive Survey of It
No ratings yet
Large Language Models A Comprehensive Survey of It
30 pages
Natural - Language - Processing (NLP)
No ratings yet
Natural - Language - Processing (NLP)
32 pages
Thoughts On NLP Research in The (Post-) LLM Era: Yijia Shao Yuanpei College 2023/04/28
No ratings yet
Thoughts On NLP Research in The (Post-) LLM Era: Yijia Shao Yuanpei College 2023/04/28
51 pages
14 LookingForward
No ratings yet
14 LookingForward
48 pages
Pranay Report
No ratings yet
Pranay Report
26 pages
Analysis of The Evolution of Advanced Transformer-Based Language Models: Experiments On Opinion Mining
No ratings yet
Analysis of The Evolution of Advanced Transformer-Based Language Models: Experiments On Opinion Mining
16 pages
Hocken Maier 25
No ratings yet
Hocken Maier 25
46 pages
ADC0831/ADC0832/ADC0834 and ADC0838 8-Bit Serial I/O A/D Converters With Multiplexer Options
No ratings yet
ADC0831/ADC0832/ADC0834 and ADC0838 8-Bit Serial I/O A/D Converters With Multiplexer Options
33 pages
Module I NLP
No ratings yet
Module I NLP
65 pages
GenAI Syllabus
No ratings yet
GenAI Syllabus
17 pages
Exploring The Evolution of Large Language Models: Architectures, Applications, and Future Directions
No ratings yet
Exploring The Evolution of Large Language Models: Architectures, Applications, and Future Directions
11 pages
Topic 2: Introduction To Natural Language Processing (NLP)
No ratings yet
Topic 2: Introduction To Natural Language Processing (NLP)
16 pages
Unit - 3
No ratings yet
Unit - 3
55 pages
NLP Module 1
No ratings yet
NLP Module 1
31 pages
NLP 1
No ratings yet
NLP 1
37 pages
LLM 1
No ratings yet
LLM 1
6 pages
LLM Review
No ratings yet
LLM Review
16 pages
Cerec Radio Device
No ratings yet
Cerec Radio Device
32 pages
A Survey of Large Language Models LLMs
No ratings yet
A Survey of Large Language Models LLMs
17 pages
ChatBot Unit1
No ratings yet
ChatBot Unit1
35 pages
Artificial Intelligence-UNIT-4
No ratings yet
Artificial Intelligence-UNIT-4
37 pages
Embedded System Assignment
No ratings yet
Embedded System Assignment
14 pages
Assignment II - XR Application Requirements and Outline
No ratings yet
Assignment II - XR Application Requirements and Outline
17 pages
1 NLP
No ratings yet
1 NLP
26 pages
Natural Language Processing New
No ratings yet
Natural Language Processing New
25 pages
Course Outline
No ratings yet
Course Outline
14 pages
Deep Learning Paper1
No ratings yet
Deep Learning Paper1
16 pages
Chapter 7 Developing Er Diagram
No ratings yet
Chapter 7 Developing Er Diagram
17 pages
A Review of The Marathi Natural Language Processing
No ratings yet
A Review of The Marathi Natural Language Processing
13 pages
Akchukwu Wisdom Chidi Seminar Corrected Version
No ratings yet
Akchukwu Wisdom Chidi Seminar Corrected Version
17 pages
Seminar Outline NLP
No ratings yet
Seminar Outline NLP
5 pages
Deep Learning Lecture 28 April
No ratings yet
Deep Learning Lecture 28 April
4 pages
Sha 10
No ratings yet
Sha 10
6 pages
Introduction To Natural Language Processing (NLP) : by Ayush Shinde
No ratings yet
Introduction To Natural Language Processing (NLP) : by Ayush Shinde
10 pages
MTH MLP
No ratings yet
MTH MLP
6 pages
Connection Pooling
No ratings yet
Connection Pooling
5 pages
Unit1 A
No ratings yet
Unit1 A
8 pages
Eco 36
No ratings yet
Eco 36
6 pages
Quadcopter With Arduino Uno Running MultiWii
No ratings yet
Quadcopter With Arduino Uno Running MultiWii
5 pages
Introduction To NLP - First - Week - Lecture - 2st
No ratings yet
Introduction To NLP - First - Week - Lecture - 2st
4 pages
Jack Lucas: Senior Bio-Pharmaceutical Project Manager Professional Summary
No ratings yet
Jack Lucas: Senior Bio-Pharmaceutical Project Manager Professional Summary
4 pages
Understanding LLMs Solberg-2025
No ratings yet
Understanding LLMs Solberg-2025
12 pages
Summary of Agile and Scrum
No ratings yet
Summary of Agile and Scrum
3 pages
Natural Language Processing - Bridging The Gap Between Humans and Machines
No ratings yet
Natural Language Processing - Bridging The Gap Between Humans and Machines
6 pages
Transformersin NLP
No ratings yet
Transformersin NLP
7 pages
BHEL Unit Implements ERP Package
No ratings yet
BHEL Unit Implements ERP Package
9 pages
Research Article On NLP
No ratings yet
Research Article On NLP
3 pages
Aptcom 3
No ratings yet
Aptcom 3
6 pages
The Development of Language AI Models in 2018
No ratings yet
The Development of Language AI Models in 2018
5 pages
The Evolution of LLMs in The Context of NLP
No ratings yet
The Evolution of LLMs in The Context of NLP
5 pages
Evolving Landscap of NLP
No ratings yet
Evolving Landscap of NLP
5 pages
Wisdom Natural Language Processing
No ratings yet
Wisdom Natural Language Processing
4 pages
The Impact of Deep Learning On Natural Language Processing
No ratings yet
The Impact of Deep Learning On Natural Language Processing
3 pages
05 eLMS Activity 1 Victoriano Joshua P
No ratings yet
05 eLMS Activity 1 Victoriano Joshua P
2 pages
222
No ratings yet
222
2 pages
Article Format - Short
No ratings yet
Article Format - Short
2 pages
Hyper Upgraded Titan Speakerman Toilet Tower Defense Wiki Fandom
No ratings yet
Hyper Upgraded Titan Speakerman Toilet Tower Defense Wiki Fandom
1 page
AWS DDA Agenda PDF
No ratings yet
AWS DDA Agenda PDF
1 page

Chapter Four - NLP

Uploaded by

Chapter Four - NLP

Uploaded by

Applied ML 1

Chapter Five: Natural Language

© 2020-2023, Ali Arsanjani

Section 1 : A Brief History of NLP, NLU and NLG

In-Depth Evolution of Natural Language Processing (NLP)

Section 1: Traditional NLP Techniques

© 2020-2023, Ali Arsanjani

Section 2: Word Embeddings

● Semantic Representation: Word2Vec, introduced by Mikolov et al., and GloVe, by

© 2020-2023, Ali Arsanjani

Section 3: The Transformer Architecture

© 2020-2023, Ali Arsanjani

Section 4: Large Language Models (LLMs)

© 2020-2023, Ali Arsanjani

© 2020-2023, Ali Arsanjani

Section 1: Steps in Traditional NLP

Preliminaries related to Text Pre-processing

• Removal of Noise, URLs, Hashtag and User-mentions

• Replacing Emoticons and Emojis

© 2020-2023, Ali Arsanjani

• Replacement of abbreviation and slang

• Replacing elongated characters

• Correction of Spelling mistakes

© 2020-2023, Ali Arsanjani

• Lower-casing all words

© 2020-2023, Ali Arsanjani

• Part of Speech (POS) Tagging

© 2020-2023, Ali Arsanjani

© 2020-2023, Ali Arsanjani

Here is a summary of the text preprocessing techniques that you

© 2020-2023, Ali Arsanjani

Preliminaries related to Text Pre-processing

© 2020-2023, Ali Arsanjani

© 2020-2023, Ali Arsanjani

© 2020-2023, Ali Arsanjani

You might also like