Natural Language Processing - Theory and Application
Natural Language Processing - Theory and Application
SEMINAR ON
PRESENTED
BY
2203046
SUBMITTED TO THE:
SUPERVISED BY:
MR. OBASADE
IN PARTIAL FULFILMENT FOR THE AWARD OF
NATIONAL DIPLOMA (ND) IN COMPUTER SCIENCE
SEPTEMBER, 2024
TABLE OF CONTENTS
CHAPTER ONE......................................................................................................................................2
INTRODUCTION...................................................................................................................................2
CHAPTER TWO.....................................................................................................................................7
Sentiment Analysis........................................................................................................................12
Toxicity classification....................................................................................................................13
Text generation..............................................................................................................................14
CHAPTER THREE...............................................................................................................................16
CONCLUSION......................................................................................................................................16
REFERENCES......................................................................................................................................18
CHAPTER ONE
INTRODUCTION
1.1 Overview of Natural Language Processing
Whether it's Alexa, Siri, Google Assistant, Bixby, or Cortana, everyone with a smartphone or
smart speaker has a voice-activated assistant nowadays. Every year, these voice assistants
seem to get better at recognizing and executing the things we tell them to do. But have you
ever wondered how these assistants process the things we're saying? They manage to do this
thanks to Natural Language Processing, or NLP.
Natural Language Processing (NLP) is one of the hottest areas of artificial intelligence (AI)
thanks to applications like text generators that compose coherent essays, chatbots that fool
people into thinking they’re sentient, and text-to-image programs that produce photorealistic
images of anything you can describe. Recent years have brought a revolution in the ability of
computers to understand human languages, programming languages, and even biological and
chemical sequences, such as DNA and protein structures, that resemble language. The latest
AI models are unlocking these areas to analyze the meanings of input text and generate
meaningful, expressive output. (deeplearning, 2023).
Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that focuses on
the interaction between computers and human languages. The goal of NLP is to enable
machines to understand, interpret, and respond to human languages in a way that is both
meaningful and useful. As humans increasingly rely on digital technologies, the demand for
machines to effectively process and interpret large volumes of textual and spoken language
data has grown. NLP bridges this gap by applying various computational techniques to
analyze and understand human language (Rohit Kumar Yadav et al., 2024).
NLP encompasses various tasks, including but not limited to text classification, machine
translation, question-answering systems, speech recognition, and sentiment analysis. The
field covers a broad spectrum, from simple keyword-matching systems to advanced deep-
learning algorithms capable of understanding the nuances and complexities of human
language.
The journey of NLP can be traced back to the 1950s when researchers began exploring how
computers could understand language. Early systems were largely rule-based and involved
manually encoding grammatical rules and vocabulary. For instance, one of the earliest efforts
was the development of the Georgetown-IBM experiment in 1954, which involved the
automatic translation of Russian sentences into English. Over the next few decades,
advancements were made with the introduction of statistical models in the 1980s and
machine learning algorithms in the 1990s. However, the major leap in NLP came with the
advent of deep learning techniques and large datasets, which allowed for more accurate and
scalable language models. The introduction of neural networks, particularly the Transformer
architecture, revolutionized the field, enabling state-of-the-art performance in various NLP
tasks (Vaswani et al., 2017).
In essence, NLP seeks to address one of the most profound challenges in computer science:
enabling machines to process and generate human language in a way that mimics human
understanding. This challenge stems from the intricacies of human language, such as syntax
(structure), semantics (meaning), and pragmatics (context). Each of these linguistic
components plays a role in how humans communicate, and NLP aims to model these
complexities computationally.
1.2.1 Syntax
Syntax refers to the structure of language, which dictates how words are arranged to form
grammatically correct sentences. In human languages, each word has a specific part of speech
—nouns, verbs, adjectives, etc.—and the rules of syntax govern the permissible combinations
of these parts. For instance, in English, a basic sentence follows a Subject-Verb-Object (SVO)
structure, as in “The cat (subject) eats (verb) fish (object).”
1.2.2 Semantics
While syntax focuses on structure, semantics is concerned with meaning. It seeks to capture
the meaning of words, phrases, and sentences, allowing a machine to understand the context
in which language is used. One major challenge in NLP is that words often have multiple
meanings, known as polysemy. For example, the word “bank” can refer to the edge of a river
or a financial institution, depending on the context.
In traditional NLP systems, meaning is often captured through techniques like Word Sense
Disambiguation (WSD), which involves identifying the correct meaning of a word based on
its context. However, modern NLP systems use deep learning models to represent word
meanings more effectively. These models generate word embeddings, such as Word2Vec or
GloVe, which represent words as continuous vectors in a high-dimensional space. This allows
words with similar meanings to be placed closer together in this vector space, thereby
capturing semantic similarity. More advanced models, like BERT (Bidirectional Encoder
Representations from Transformers), capture contextualized word meanings by analyzing
entire sentences, rather than words in isolation (Devlin et al., 2019).
1.2.3 Pragmatics
Pragmatics goes beyond syntax and semantics by focusing on the use of language in different
contexts. Human language is highly dependent on context, which includes the speaker's
intentions, the surrounding conversation, and even the social and cultural background.
Understanding pragmatics is critical for machines to engage in more natural conversations.
For example, the sentence “Can you pass the salt?” is a request for action in a dining setting,
despite being framed as a yes/no question. Pragmatic understanding would allow an NLP
system, such as a conversational agent, to respond appropriately to such requests. Tasks like
coreference resolution, which determines when different words refer to the same entity (e.g.,
“John” and “he”), are essential for pragmatics. Additionally, tasks such as sentiment
analysis, which gauges the emotional tone of a text, and speech act recognition, which
identifies the speaker’s intent (e.g., question, command, statement), fall under the realm of
pragmatics.
Another issue is polysemy, where words carry multiple meanings depending on the context.
Additionally, context dependency poses a problem since understanding a sentence often
requires knowledge of the broader discourse. For instance, in a conversation about weather,
the pronoun "it" in "It is raining" refers to the weather, but in a different context, "it" could
mean something entirely different. NLP systems need to be designed to capture such nuances
to avoid misinterpretation (Daniel Jurafsky & James H. Martin, 2018).
Together, NLU and NLG form the foundation of many modern applications such as chatbots,
automated translation systems, and virtual assistants. They are fundamental in making
machines not just process language but also interact with humans in a meaningful and
coherent way. (Devlin et al., 2019)
CHAPTER TWO
Data preprocessing: Before a model processes text for a specific task, the text often needs to
be pre-processed to improve model performance or to turn words and characters into a format
the model can understand. Data-centric AI is a growing movement that prioritizes data
preprocessing. Various techniques may be used in this data preprocessing.
Naive Bayes is a supervised classification algorithm that finds the conditional probability
distribution P(label | text) using the following Bayes formula:
And predicts based on which joint distribution has the highest probability. The naive
assumption in the Naive Bayes model is that the individual words are independent.
Decision trees are supervised classification models that split the dataset based on different
features to maximize information gain in those splits.
2.2.2 Deep Learning NLP Techniques:
Convolutional Neural Network (CNN): The idea of using a CNN to classify text was first
presented in the paper “Convolutional Neural Networks for Sentence Classification” by Yoon
Kim. The central intuition is to see a document as an image. However, instead of pixels, the
input is sentences or documents represented as a matrix of words.
Recurrent Neural Network (RNN): Many techniques for text classification that use deep
learning process words nearby using n-grams or a window (CNNs). They can see “New
York” as a single instance. However, they can’t capture the context provided by a particular
text sequence. They don’t learn the sequential structure of the data, where every word is
dependent on the previous word or a word in the previous sentence. RNNs remember
previous information using hidden states and connect it to the current task. The architectures
known as Gated Recurrent Unit (GRU) and long short-term memory (LSTM) are types of
RNNs designed to remember information for an extended period. Moreover, the bidirectional
LSTM/GRU keeps contextual information in both directions, which is helpful in text
classification. RNNs have also been used to generate mathematical proofs and translate
human thoughts into words.
Autoencoders are deep learning encoder-decoders that approximate a mapping from X to X,
i.e., input=output. They first compress the input features into a lower-dimensional
representation (sometimes called a latent code, latent vector, or latent representation) and
learn to reconstruct the input. The representation vector can be used as input to a separate
model, so this technique can be used for dimensionality reduction. Among specialists in many
other fields, geneticists have applied autoencoders to spot mutations associated with diseases
in amino acid sequences.
2.3 Important Natural Language Processing (NLP) Models
Over the years, many NLP models have made waves within the AI community, and some
have even made headlines in the mainstream news. The most famous of these have been
chatbots and language models.
The ability to analyze both structured and unstructured data, such as speech, text
messages, and social media posts.
Eliza was developed in the mid-1960s to try to solve the Turing Test; that is, to fool
people into thinking they were conversing with another human being rather than a
machine. Eliza used pattern matching and a series of rules without encoding the
context of the language.
Tay was a chatbot that Microsoft launched in 2016. It was supposed to tweet like
a teen and learn from conversations with real users on Twitter. The bot adopted
phrases from users who tweeted sexist and racist comments, and Microsoft
deactivated it not long afterward. Tay illustrates some points made by the “Stochastic
Parrots” paper, particularly the danger of not debiasing data.
BERT and his Muppet friends: Many deep learning models for NLP are named after
Muppetcharacters,
including ELMo, BERT, BigBIRD, ERNIE, Kermit, Grover, RoBERTa, and Rosita.
Most of these models are good at providing contextual embeddings and enhanced
knowledge representation.
Generative Pre-Trained Transformer 3 (GPT-3) is a 175 billion-parameter model that
can write original prose with human-equivalent fluency in response to an input
prompt. The model is based on the transformer architecture. The previous version,
GPT-2, is open source. Microsoft acquired an exclusive license to access GPT-3’s
underlying model from its developer OpenAI, but other users can interact with it via
an application programming interface (API). Several groups
including EleutherAI and Meta have released open source interpretations of GPT-3.
Mixture of Experts (MoE): While most deep learning models use the same set of
parameters to process every input, MoE models aim to provide different parameters
for different inputs based on efficient routing algorithms to achieve higher
performance. Switch Transformer is an example of the MoE approach that aims to
reduce communication and computational costs.
Toxicity classification
is a branch of sentiment analysis where the aim is not just to classify hostile intent but also to
classify particular categories such as threats, insults, obscenities, and hatred towards certain
identities. The input to such a model is text, and the output is generally the probability of
each class of toxicity. Toxicity classification models can be used to moderate and improve
online conversations by silencing offensive comments, detecting hate speech, or scanning
documents for defamation.
Grammatical error correction models encode grammatical rules to correct the grammar
within text. This is viewed mainly as a sequence-to-sequence task, where a model is trained
on an ungrammatical sentence as input and a correct sentence as output. Online grammar
checkers like Grammarly and word-processing systems like Microsoft Word use such systems
to provide a better writing experience to their customers. Schools also use them to grade
student essays.
Topic modeling is an unsupervised text-mining task that takes a corpus of documents and
discovers abstract topics within that corpus. The input to a topic model is a collection of
documents, and the output is a list of topics that defines words for each topic as well as
assignment proportions of each topic in a document. Latent Dirichlet Allocation (LDA), one
of the most popular topic modeling techniques, tries to view a document as a collection of
topics and a topic as a collection of words. Topic modeling is being used commercially to
help lawyers find evidence in legal documents.
Text generation
Text Generation, more formally known as natural language generation (NLG), produces text
that’s similar to human-written text. Such models can be fine-tuned to produce text in
different genres and formats — including tweets, blogs, and even computer code. Text
generation has been performed using Markov processes, LSTMs, BERT, GPT-2, LaMDA,
and other approaches. It’s particularly useful for autocomplete and chatbots
Autocomplete predicts what word comes next, and autocomplete systems of varying
complexity are used in chat applications like WhatsApp. Google uses autocomplete to
predict search queries. One of the most famous models for autocomplete is GPT-2,
which has been used to write articles, song lyrics, and much more.
Chatbots automate one side of a conversation while a human conversant generally
supplies the other side. They can be divided into the following two categories:
Database query: We have a database of questions and answers, and we would
like a user to query it using natural language.
Conversation generation: These chatbots can simulate dialogue with a human
partner. Some are capable of engaging in wide-ranging conversations. A high-
profile example is Google’s LaMDA, which provided such human-like
answers to questions that one of its developers was convinced that it had
feelings.
Information retrieval finds the documents that are most relevant to a query. This is a
problem every search and recommendation system faces. The goal is not to answer a
particular query but to retrieve, from a collection of documents that may be numbered in the
millions, a set that is most relevant to the query. Document retrieval systems mainly execute
two processes: indexing and matching. In most modern systems, indexing is done by a vector
space model through Two-Tower Networks, while matching is done using similarity or
distance scores. Google recently integrated its search function with a multimodal information
retrieval model that works with text, image, and video data.
CHAPTER THREE
CONCLUSION
Natural Language Processing (NLP) supposedly makes the job easier but still demands a
human interference. People and the industry fear NLP would start a trend of job snatching
which is true to a certain sense but it certainly cannot function the way it does without human
inputs. The will to work and cater to the loopholes or bugs in a machine is the task of a
human who is handling it. Notwithstanding, the advantages of NLP may anger in the arena of
jobs but right now it is the knight in the shining armor of the industry.
After exploring the foundational aspects of Natural Language Processing (NLP), it is clear
that NLP is a critical component in the development of intelligent systems capable of
understanding and generating human language. The first chapter provided an introduction to
NLP, outlining its core concepts such as syntax, semantics, and pragmatics, and highlighting
the field's historical evolution from rule-based systems to modern machine learning
techniques.
We also delved deeper into the theories and techniques that drive NLP, focusing on the
algorithms, models, and methodologies that allow machines to interpret language
meaningfully. This chapter examined key approaches like statistical models, deep learning,
and neural network architectures, including state-of-the-art models such as BERT and GPT.
We explored their applications in tasks like machine translation, text classification, and
sentiment analysis, and how these models have revolutionized the way machines process
language.
NLP as a Bridge Between Humans and Machines: NLP enables more intuitive and efficient
human-computer interaction, making technology more accessible by allowing users to
communicate in natural language.
Rapid Evolution and Advancements: NLP has evolved from simple rule-based systems to
advanced deep learning models that understand the complexities of language. With
developments like the Transformer architecture, the field continues to push the boundaries of
language understanding.
Broad Applications Across Industries: NLP has transformed industries such as healthcare,
finance, customer service, and education. Its ability to process and analyse large volumes of
unstructured text has made it indispensable in automating tasks and generating insights.