LO1. Introduction To NLP
LO1. Introduction To NLP
Language Processing
NATURAL LANGUAGE PROCESSING
What is NLP?
By “natural language” we mean a language that is used for everyday
communication by humans; languages such as English, Hindi, or Portuguese.
It is the area of research and applications that explores how computers can be
used to understand and manipulate natural text or speech to do useful tasks.
https://www.sas.com/en_id/insights/analytics/what-is-natural-language-processing-nlp.html
1960s 1980s
https://spotintelligence.com/2023/06/23/history-natural-language-processing/
https://www.dataversity.net/a-brief-history-of-natural-language-processing-nlp/
2000s 2020s
https://spotintelligence.com/2023/06/23/history-natural-language-processing/
https://www.dataversity.net/a-brief-history-of-natural-language-processing-nlp/
Voice Assistants
Microsoft Azure Autocorrect on Email Spam
(e.g., Siri, Alexa,
Text Analytics Smartphones Filters
Google Assistant)
Smart Reply in
Messaging Apps
Phonetics &
Morphology Syntax Semantic Discourse Pragmatics
Phonology
Phonetics &
Morphology Syntax Semantic Discourse Pragmatics
Phonology
This level deals with the interpretation of This level is the basic one in speech
speech sounds within and across words. recognition
Six ways to pronounce t in English: top, stop, pot, little, kitten, hunter.
In phonetics we can see infinite realisations, for example every time you say a ‘p’ it will slightly different than the
other times you’ve said it.
However, in phonology all productions are the same sound within the language’s phoneme inventory, therefore
even though every ‘p’ is produced slightly different every time, the actual sound is the same.
This highlights a key difference between phonetic and phonology as even though no two ‘p’s are the same, they
represent the same sound in the language.
https://www.sheffield.ac.uk/linguistics/home/all-about-linguistics/about-website/branches-linguistics/phonology
Phonetics &
Morphology Syntax Semantic Discourse Pragmatics
Phonology
Morphology
Morphology is the first crucial step in NLP.
This level deals with the structure of the word and its
componential nature, which are composed of morphemes –
the smallest units of meaning.
For example, the word preregistration can be
morphologically analyzed into three separate components
(morphemes):
◦ the prefix pre,
◦ the root registra,
◦ the suffix tion.
For example, adding the suffix –ed to a verb, conveys that the action of the verb took
place in the past.
Morphology is mainly useful for identifying the parts of speech in a sentence and words
that interact together.
It describes a set of relations between words’ surface forms and lexical forms.
A word’s surface form is its graphical (in written text) or spoken form.
◦ Example: Disconnecting-> connect
◦ Example: سيعلمون-> علم
Lexical Analysis
Lexemes and lemma:
◦ Lexeme refers to the set of all the forms that have the same meaning, and
◦ lemma refers to the particular form that is chosen by convention to represent
the lexeme. (canonical form, dictionary form, or citation form of a set of
words headword).
Example 1: run, runs, ran and running are forms of the same lexeme, with run as
the lemma
The lexical Analysis is the analysis of the word into its lemma (also known as its
dictionary form) and its grammatical description.
Phonetics &
Morphology Syntax Semantic Discourse Pragmatics
Phonology
Part of Speech tagging: Label each word with a unique tag that indicates its syntactic
role. Examples of Tags: Noun, Verb, article, preposition, … e.g., it is part of English
syntax that a determiner such as “the” will come before a noun.
This level focuses on analyzing the words in a sentence -> uncover the grammatical
structure of the sentence.
The output of this level of processing is a representation of the sentence that reveals
the structural dependency relationships between the words.
There are various grammars that can be utilized, and which will, in turn, impact the
choice of a parser.
Phonetics &
Morphology Syntax Semantic Discourse Pragmatics
Phonology
This process gathers information vital to the pragmatic analysis to determine which meaning
was intended by the user.
Example: Semantic processing determines the differences between such sentences as
The animal is in the pen
and
The ink is in the pen
Phonetics &
Morphology Syntax Semantic Discourse Pragmatics
Phonology
The entities involved in the sentence must either have been introduced
explicitly or they must be related to entities that were introduced previously.
Phonetics &
Morphology Syntax Semantic Discourse Pragmatics
Phonology
Pragmatics are the sequence of steps taken that expose the overall purpose of the
statement being analyzed.
Lexical ambiguity
Semantic ambiguity
Types of ambiguity include: Syntactic ambiguity
Referential ambiguity
Local ambiguity.
World Models are needed for a good disambiguation system, to allow for
the selection of the most practical meaning of a given sentence.
This world model needs to be as broad as the scenarios the system would
encounter in its normal operation.
Compiling lexicons: Morphological analysis is crucial for compiling dictionaries and lexicons
Stemming for IR: helps in retrieving documents containing different inflected forms of the
query terms.
Word Formation
There are many ways to combine morphemes to create
words. Four of these methods are common and play
important roles in speech and language processing:
◦ Inflection (Inflectional morphology)
◦ Derivation (Derivational Morphology)
◦ Compounding,
◦ Cliticization.
For example, the noun doghouse is the concatenation of the morpheme dog with the
morpheme house.
A cliticis a morpheme that acts syntactically like a word but is reduced in form and
attached to another word.
For example, the English morpheme ’ve in the word I’ve is a clitic, as is the French
definite article l’ in the word l’opera.
Example from Arabic: ... ، أذهبت للمدرسة، تاهلل... ، السيما، اينما، ريثما،ربما
Tokenization
Tokenization is a fundamental step in processing textual data preceding the tasks of
information retrieval, text mining, and NLP.
Tokenization is typically the first task in a pipeline of natural language processing tools. It
usually involves two sub-tasks, which are often performed at the same time:
◦ separating punctuation symbols from words;
◦ detecting sentence boundaries.
Tokenization is closely related to the morphological analysis. It is the task of separating out
words from running text.
The function of a tokenizer is to split a running text into tokens, so that they can be fed into a
morphological transducer or POS tagger for further processing.
The tokenizer is responsible for defining word boundaries, demarcating clitics, multiword
expressions, Named entities, abbreviations and numbers.
Tokenization
In the output of this process white space is typically used as separation
marks between tokens, and sentences are usually separated by new lines.
Problem : many punctuation symbols are ambiguous in their use.
Example:
◦ a hyphen in a football score, in a range of numbers, in a compound
word, or to divide a word at the end of line.
◦ Full stop: in abbreviations and the end of a sentence.
Issues in tokenization:
◦ Finland’s capital -> Finland? Finlands? Finland’s?
◦ New York / San Francisco: one token or two? How do you decide it is
one token?
◦ USA and U.S.A
◦ Score (sport) 3-4 / Range of values 1-10
◦ In Arabic, other problems occur, example: وسيأكولون
Processing:
◦ If you want to apply for a scholarship abroad and in a specific university you require IELTS and TOEFL
After:
◦ If want apply scholarship aboard university require IELTS TOFEL
For example, consider the words ‘run’, ‘running’, and ‘runs’, all convert into the
word ‘run’ after stemming is implemented on.
One crucial point about stem words is that they need not be meaningful. For
example, the word ‘traditional’ stem is ‘tradi’ and has no meaning.
Jurafsky, D. and Martin, J.H., Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition.
https://www.projectpro.io/article/stemming-in-nlp/780
Habash, N.Y., 2022. Introduction to Arabic natural language processing. Springer Nature.
Natural Language Processing, Prof. Arafat Awajan
Exercises: أنلزمكموها،كالمهندسون
Non-derivative words and stop words can receive affixes and clitics.
Other challenges:
◦ some letters may be dropped when occurring with others in some circumstances: للرجال،الرجال
◦ Different word decomposition are possible: كامل
Tokenization
Tokenization is a non-trivial problem as it is closely related to the
morphological analysis. It is the task of separating out words from
running text.
The function of a tokenizer is to split a running text into tokens, so that
they can be fed into a morphological transducer or POS tagger for
further processing.
The tokenizer is responsible for defining word boundaries, demarcating
clitics, multiword expressions, abbreviations and numbers.
There is not a single possible or obvious tokenization scheme: a
tokenization scheme is an analytical tool devised by the researcher.
Different tokenization imply different amount of information, and
further influence the options for linguistic generalization.
For a word to be analyzed its parts must have an entry in each lexicons. Assuming
the both null prefix and null suffix are both possible.
Privacy
◦ NLP systems often rely on large amounts of personal data, such as text messages, emails, and social
media posts, to provide insights and make predictions.
◦ This data can be sensitive and personal, and individuals may not be aware that it is being collected or
used by NLP systems.
◦ To protect privacy, it is crucial to ensure that NLP systems are designed with privacy in mind. This
includes using data minimization techniques to reduce the amount of personal data collected, providing
clear and transparent information about how data is being used, and implementing appropriate security
measures to protect data from unauthorized access or theft.
https://glair.ai/post/bias-in-natural-language-processing-nlp
Bias from annotations: The labels chosen for training and the
procedure used for annotating the labels introduces the
annotation bias. Selection bias is introduced by the samples
chosen for training or testing an NLP model.
Hovy, D. and Prabhumoye, S., 2021. Five sources of bias in natural language
processing. Language and Linguistics Compass, 15(8), p.e12432.
https://compass.onlinelibrary.wiley.com/doi/full/10.1111/lnc3.12432
NLP systems reflect biases in the language data used for training them.
Models trained on these data sets treat language as if it resembles this restricted training data, creating demographic bias.
Results are ageist, racist or sexist models that are biased against the respective user groups. This is the issue of selection bias, which is
rooted in data.
When choosing a text data set to work with, we are also making decisions about the demographic groups represented in the data.
If our data set is dominated by the ‘dialect’ of a specific demographic group, we should not be surprised that our models have problems
understanding others.
Most data sets have some built-in bias, and in many cases, it is benign.
It becomes problematic when this bias negatively affects certain groups or disproportionately advantages others.
On biased data sets, statistical models overfit to the presence of specific linguistic signals that are particular to the dominant group. As a
result, the model will work less well for other groups, that is, it excludes demographic groups.
Annotation can introduce bias in various forms through a mismatch of the annotator population with the data. This is the issue of label bias.
Label and selection bias can—and most often do—interact, so it can be challenging to distinguish them. It does, however, underscore how important it is to address them jointly. There are
several ways in which annotations introduce bias.
In its simplest form, bias arises because annotators are distracted, uninterested, or lazy about the annotation task. As a result, they choose the ‘wrong’ labels.
More problematic is label bias from informed and well-meaning annotators that systematically disagree.
For example, the term ‘social media’ can be validly analysed as either a noun phrase composed of an adjective and a noun, or a noun compound, composed of two
nouns.
◦ Which label an annotator chooses depends on their interpretation of how lexicalized the term ‘social media’ is.
◦ If they perceive it as fully lexicalized, they will choose a noun compound.
◦ If they believe the process is still ongoing, that is, the phrase is analytical, they will choose an ‘adjective plus noun’ construct.
◦ Two annotators with these opposing views will systematically label ‘social’ as an adjective or a noun, respectively. While we can spot the disagreement, we cannot
discount either of them as wrong or malicious.
Finally, label bias can result from a mismatch between authors' and annotators' linguistic and social norms.
For example, that annotators rate the utterances of different ethnic groups differently and that they mistake innocuous banter as hate speech because they are
unfamiliar with communication norms of the original speakers.
Even balanced, well-labelled data sets contain bias: the most common text inputs representing in NLP systems,
word embeddings, have been shown to pick up on racial and gender biases in the training data.
For example, ‘woman’ is associated with ‘homemaker’ in the same way ‘man’ is associated with ‘programmer’.
There has been some justified scepticism over whether these analogy tasks are the best way to evaluate
embedding models, but there is plenty of evidence that (1) embeddings do capture societal attitudes, and that
(2) these societal biases are resistant to many correction methods. This is the issue of semantic bias.
These biases hold not just for word embeddings but also for the contextual representations of big pre-trained
language models that are now widely used in different NLP systems.
As they are pre-trained on almost the entire available internet, they are even more prone to societal biases.
Simply using ‘better’ training data is not a feasible long-term solution: languages evolve continuously, so even a representative sample
can only capture a snapshot—at best a short-lived solution.
Systems trained on biased data exacerbate that bias even further when applied to new data.
Sentiment analysis tools pick up on societal prejudices, leading to different outcomes for different demographic groups. For example, by
merely changing the gender of a pronoun, the systems classified the sentence differently.
Machine translation systems changed the perceived user demographics to make samples sound older and more male in translation. This
issue is bias overamplification, which is rooted in the models themselves.
Models can overamplify existing biases, contributing to incorrect outcomes even when the answers are technically correct.
The choice of loss objective in model training can unintentionally reinforce biases, causing models to provide correct answers for the
wrong reasons.
Machine learning models often provide predictions even when uncertain or unable to offer accurate responses, potentially resulting in
biased or misleading outcomes.
Models should ideally report uncertainty rather than delivering potentially biased or incorrect results.
Unsupervised models produce word embeddings, which are numerical depictions of text data.
◦ An unsupervised model searches through a lot of text and generates vectors to represent the words in
the text.
◦ Unfortunately, our models are exposed to more than just semantic information because we look for
hidden patterns and use them to build embeddings (which automatically organize data).
◦ Models are subjected to biases similar to those seen in human culture while digesting the text. The biases
then spread to our supervised learning models, which are designed to use unbiased data in order to avoid
producing biased outputs.
• One of the main reasons that NLP algorithms are biased is that the original dataset to train
the model is unbalanced.
• For example, there could be more data associating “doctors” with “male”, and so the
resultant model would have more probability to predict “doctors” as “male”.
• Therefore, one of the best ways to eliminate bias in NLP is to solve the problem of
unbalanced data. There are many ways to achieve so.
• For instance, one can utilize data augmentation algorithms such as SMOTE to self-create
more data for the minority group in the dataset.
• Plus, if the total amount of the dataset is very enormous, one can also choose to remove
some data from the majority group to make the dataset more balanced.
• The method employs the transfer learning concept to fine-tune an unbiased model on a
more biased dataset.
• Such an approach enables the model to get rid of learning biases from training data while
still being sufficiently trained to tackle target tasks.
• A diverse AI and ethics audit team could be a crucial part in the development of machine
learning technologies that are beneficial to societies.
• By having a diverse audit group to review the trained NLP models, anticipants from
different backgrounds could help consider the models in multiple perspectives and help
the development team spot potential biases against minority groups.
• Additionally, the diverse development team could offer insights through their lived
experiences to suggest how to modify the model.