0% found this document useful (0 votes)
13 views88 pages

LO1. Introduction To NLP

Uploaded by

kazukiai29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views88 pages

LO1. Introduction To NLP

Uploaded by

kazukiai29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

Introduction to Natural

Language Processing
NATURAL LANGUAGE PROCESSING

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
1
Bird, S., Klein, E. and Loper, E., 2009. Natural language processing with Python: analyzing text with the natural language
toolkit. " O'Reilly Media, Inc.".

What is NLP?
By “natural language” we mean a language that is used for everyday
communication by humans; languages such as English, Hindi, or Portuguese.

In contrast to artificial languages such as programming languages and


mathematical notations, natural languages have evolved as they pass from
generation to generation and are hard to pin down with explicit rules.

Natural Language Processing (NLP) is the scientific discipline concerned with


making natural language accessible to machines.

It is the area of research and applications that explores how computers can be
used to understand and manipulate natural text or speech to do useful tasks.

Natural language Processing is a theoretically motivated range of computational


techniques for analyzing and representing naturally occurring texts at one or
more levels of linguistic analysis for the purpose of achieving human - like
language processing for a range of tasks or applications.

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
2
Why NLP?
Large volumes of textual data
◦ Natural language processing helps computers communicate with humans in their own language and scales other language-related
tasks.
◦ For example, NLP makes it possible for computers to read text, hear speech, interpret it, measure sentiment and determine which
parts are important.
◦ Today’s machines can analyze more language-based data than humans, without fatigue and in a consistent, unbiased way.
◦ Considering the staggering amount of unstructured data that’s generated every day, from medical records to social media,
automation will be critical to fully analyze text and speech data efficiently.

Structuring a highly unstructured data source


◦ Human language is astoundingly complex and diverse. We express ourselves in infinite ways, both verbally and in writing.
◦ Not only are there hundreds of languages and dialects, but within each language is a unique set of grammar and syntax rules, terms
and slang. When we write, we often misspell or abbreviate words, or omit punctuation. When we speak, we have regional accents,
and we mumble, stutter and borrow terms from other languages.
◦ While supervised and unsupervised learning, and specifically deep learning, are now widely used for modeling human language,
there’s also a need for syntactic and semantic understanding and domain expertise that are not necessarily present in these machine
learning approaches.
◦ NLP is important because it helps resolve ambiguity in language and adds useful numeric structure to the data for many downstream
applications, such as speech recognition or text analytics.

https://www.sas.com/en_id/insights/analytics/what-is-natural-language-processing-nlp.html

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
3
The History of NLP
1950s 1970s
Foundations of Inquiry Rule-Based Ascent
The decade when pioneers like Alan Turing (1950) An era characterized by the exploration of rule-
and IBM (1954) laid the conceptual groundwork for based approaches to NLP, with a focus on creating
artificial intelligence and machine translation. linguistic rules for language understanding.

Early Explorations Statistical Awakening


A period marked by early experiments in machine The rise of statistical methods in NLP, with a shift
translation and the development of the LISP (1960) toward using data-driven approaches and the
programming language, setting the stage for AI emergence of statistical language models , including
research. n-gram models.

1960s 1980s

https://spotintelligence.com/2023/06/23/history-natural-language-processing/
https://www.dataversity.net/a-brief-history-of-natural-language-processing-nlp/

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
4
The History of NLP
1990s 2010s
Corpus and Collaboration Deep Learning Revolution
A decade marked by the release of large annotated corpora, such as A transformative decade with the ascent of deep learning techniques, including
the Penn Treebank, and increased collaboration in the NLP research the introduction of Word2Vec (2013), GloVe (2014), FastText (2016), attention
community. mechanisms (2014), and the dominance of neural networks.
Statistical Language Models, including n-gram models, become Bidirectional Encoder Representations from Transformers (BERT) is released,
widely use. achieving state-of-the-art results in various NLP tasks (2018).
Transformer Triumph
Machine Learning Surge A period of continued advancements in
The proliferation of machine learning transformer-based models, exemplified by
techniques, including the introduction of the the release of GPT-3 (2020) and BERT,
MaxEnt model (2003) and the increasing showcasing unprecedented language
influence of statistical language models. understanding and generation capabilities.

2000s 2020s

https://spotintelligence.com/2023/06/23/history-natural-language-processing/
https://www.dataversity.net/a-brief-history-of-natural-language-processing-nlp/

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
5
Techniques
Part-of-Speech Tagging
Named Entity Recognition (NER)
Text Classification
Sentiment Analysis
Sequence-to-sequence models
Large language Models

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
6
Applications
Google Search
ChatGPT Google translate Grammarly
Engine

Voice Assistants
Microsoft Azure Autocorrect on Email Spam
(e.g., Siri, Alexa,
Text Analytics Smartphones Filters
Google Assistant)

Smart Reply in
Messaging Apps

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
7
NLP Categories of
Knowledge

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
8
Levels (phases) of natural language
processing

Phonetics &
Morphology Syntax Semantic Discourse Pragmatics
Phonology

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
9
Levels (phases) of natural language
processing

Phonetics &
Morphology Syntax Semantic Discourse Pragmatics
Phonology

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
10
Phonetics & Phonology

This level deals with the interpretation of This level is the basic one in speech
speech sounds within and across words. recognition

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
11
Phonetics
Phonetics: language sounds, how they are physically formed.
(θ:Think, ʒ: vision, ɑ: farm, æ:hat).
It is the study of the Actual sounds of the language

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
12
Phonology
Phonology: systems of discrete sounds, e.g. languages’ syllable structure. It studies
which sounds make a difference in a language

Phonology is the study of the categorical organisation of speech sounds in languages;


how speech sounds are organised in the mind and used to convey meaning.

Six ways to pronounce t in English: top, stop, pot, little, kitten, hunter.

Natural Language Processing, Prof. Arafat Awajan


https://www.sheffield.ac.uk/linguistics/home/all-about-linguistics/about-website/branches-linguistics/phonology

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
13
Phonetics vs. Phonology
Phonetics is concerned with the physical properties of sounds (actual production and perception of sounds),
whereas Phonology is concerned with the abstract (the way sounds function in the system of particular language).

In phonetics we can see infinite realisations, for example every time you say a ‘p’ it will slightly different than the
other times you’ve said it.

However, in phonology all productions are the same sound within the language’s phoneme inventory, therefore
even though every ‘p’ is produced slightly different every time, the actual sound is the same.

This highlights a key difference between phonetic and phonology as even though no two ‘p’s are the same, they
represent the same sound in the language.

https://www.sheffield.ac.uk/linguistics/home/all-about-linguistics/about-website/branches-linguistics/phonology

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
14
Levels (phases) of natural language
processing

Phonetics &
Morphology Syntax Semantic Discourse Pragmatics
Phonology

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
15
Natural Language Processing, Prof. Arafat Awajan

Morphology
Morphology is the first crucial step in NLP.
This level deals with the structure of the word and its
componential nature, which are composed of morphemes –
the smallest units of meaning.
For example, the word preregistration can be
morphologically analyzed into three separate components
(morphemes):
◦ the prefix pre,
◦ the root registra,
◦ the suffix tion.

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
16
Morphology
An NLP system should recognize the meaning carried by each morpheme to gain and
represent meaning -> need of linguistic resources (Lexicon).

For example, adding the suffix –ed to a verb, conveys that the action of the verb took
place in the past.

Morphology is mainly useful for identifying the parts of speech in a sentence and words
that interact together.

A morphology is a systematic description of words in a natural language.

It describes a set of relations between words’ surface forms and lexical forms.

A word’s surface form is its graphical (in written text) or spoken form.
◦ Example: Disconnecting-> connect
◦ Example: ‫ سيعلمون‬-> ‫علم‬

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
17
Lexical Analysis
At the lexical level, humans, as well as NLP systems, interpret the meaning of
individual words and their components.
The lexical level requires a lexicon (Dictionary/ glossary).
The lexicon (wordstock) of a language is its vocabulary, it is the total
inventory of morphemes in a given language. It is a book containing an
alphabetical arrangement of the words in a language and their definitions.
Lexeme: a meaningful linguistic unit that is an item in the vocabulary of a
language
Dictionary: a reference source in print or electronic form containing words
usually alphabetically arranged along with information about their forms ,
functions, pronunciations, functions, etymologies, meanings, and syntactical
and idiomatic uses, or translations in a different language.

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
18
Natural Language Processing, Prof. Arafat Awajan

Lexical Analysis
Lexemes and lemma:
◦ Lexeme refers to the set of all the forms that have the same meaning, and
◦ lemma refers to the particular form that is chosen by convention to represent
the lexeme. (canonical form, dictionary form, or citation form of a set of
words headword).

Example 1: run, runs, ran and running are forms of the same lexeme, with run as
the lemma

Example 2 : ‫ درس‬/ ‫دراسة‬

The lexical Analysis is the analysis of the word into its lemma (also known as its
dictionary form) and its grammatical description.

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
19
Levels (phases) of natural language
processing

Phonetics &
Morphology Syntax Semantic Discourse Pragmatics
Phonology

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
20
Syntax
Syntactic Analysis: The study of the structural relationships between words or the way
words are used to form phrases (Parsing)

Part of Speech tagging: Label each word with a unique tag that indicates its syntactic
role. Examples of Tags: Noun, Verb, article, preposition, … e.g., it is part of English
syntax that a determiner such as “the” will come before a noun.

Chunking: Label segments of a sentence with syntactic constituents such as noun or


verb phrases (NP or VP).

This level focuses on analyzing the words in a sentence -> uncover the grammatical
structure of the sentence.

The output of this level of processing is a representation of the sentence that reveals
the structural dependency relationships between the words.

There are various grammars that can be utilized, and which will, in turn, impact the
choice of a parser.

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
21
Syntactic Analysis: Parsing
Linear sequences of words are transformed into structures
that show how the words relate to one another.
This parsing step converts the flat list of words of the sentence
into a structure that defines the units represented by that list.
Example: Syntactic processing interprets the difference between
"John hit Mary" and "Mary hit John.“
Not all NLP applications require a full parse of sentences, therefore
the remaining challenges in parsing of prepositional phrase
attachment and conjunction scoping no longer confuse those
applications for which phrasal and clausal dependencies are
sufficient.

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
22
Syntax
Example: I know that you and Frank were planning to
disconnect me.
Examples of symbols used:
◦ S: sentence (Ali went to the school) →NP + VP
◦ NP: Noun Phrase (Ali went to the school) →Noun + VP
◦ VP: verbal Phrase (Ali prefer a morning flight) →Verb + NP
◦ DET: Determiner (Ali went to the school)
◦ PP: Preposition (to, from, …)
◦ CONJ: Conjunction (and, or, but, ..)
◦ … More with context free Grammers

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
23
Syntax
Syntax and Meaning
◦ Syntax expresses meaning in most languages because order and
dependency contribute to meaning.

For example, the two sentences:


‘The dog chased the cat.’
and
‘The cat chased the dog.’
differ only in terms of syntax, yet convey quite different meanings.

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
24
Levels (phases) of natural language
processing

Phonetics &
Morphology Syntax Semantic Discourse Pragmatics
Phonology

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
25
Semantics
Semantics are the examination of the meaning of words and sentences.
The study of the literal meaning.
Semantics convey useful information relevant to the scenario as a whole.
This process builds up a representation of the objects and actions that a
sentence is describing and includes the details provided by adjectives,
adverbs and propositions.
In the Semantic analysis, the structures created by the syntactic analyzer
are assigned meanings.
This step must map individual words into appropriate objects in the
knowledge base, and must create the correct structures to correspond to
the way the meanings of the individual words combine with each other.

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
26
Semantics
Example:
I know that you and Frank were planning to disconnect me.
ACTION = disconnect
ACTOR = you and Frank
OBJECT = me

This process gathers information vital to the pragmatic analysis to determine which meaning
was intended by the user.
Example: Semantic processing determines the differences between such sentences as
The animal is in the pen

and
The ink is in the pen

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
27
Levels (phases) of natural language
processing

Phonetics &
Morphology Syntax Semantic Discourse Pragmatics
Phonology

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
28
Discourse
The meaning of an individual sentence may depend on the sentences that
precede it and may influence the sentences yet to come.

The entities involved in the sentence must either have been introduced
explicitly or they must be related to entities that were introduced previously.

The overall discourse must be coherent.

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
29
Levels (phases) of natural language
processing

Phonetics &
Morphology Syntax Semantic Discourse Pragmatics
Phonology

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
30
Pragmatics
The structure representing what was said is reinterpreted to determine what was
actually meant.

Pragmatics are the sequence of steps taken that expose the overall purpose of the
statement being analyzed.

Pragmatics is “the analysis of the real meaning of an utterance (expression


/speech) in a human language, by disambiguating and contextualizing the
utterance”.

The study of how language is used to accomplish goals.


◦ What should you conclude from the fact I said something?
◦ How should you react?

Example: including notions of polite and indirect styles.


I’m sorry, I’m afraid I can’t do that.

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
31
Discourse vs.
Pragmatics
Discourse analysis studies the organisation and
structure of texts, whereas pragmatics studies
how context, considering speaker intentions and
other pragmatic elements, affects how linguistic
utterances are understood.

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
32
Computational linguists are obsessed with ambiguity, it
is a fundamental problem of computational linguistics ->
Resolving ambiguity is a crucial goal

This is accomplished by identifying ambiguities


encountered by the system and resolving them using
one or more types of disambiguation techniques.
Ambiguity
Ambiguity is explained as “the problem that an
utterance in a human language can have more than one
possible meaning”.

Lexical ambiguity
Semantic ambiguity
Types of ambiguity include: Syntactic ambiguity
Referential ambiguity
Local ambiguity.

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
33
Ambiguity
Lexical Ambiguity results when a word has more than one possible meaning such as in the case
of “board”, it could mean the verb “to get on” or it could refer to a “flat slab of wood”.
Syntactic Ambiguity is present when more than one parse of a sentence exists. Example: “He
lifted the branch with the red leaf.” The verb phrase may contain “with the red leaf” as part of
the imbedded noun phrase describing the branch or “with the red leaf” may be interpreted as a
prepositional phrase describing the action instead of the branch, implying that he used the red
leaf to lift the Branch.
Ambiguity at the syntactic level: He understands you like your mother
Ambiguity at the syntactic level:
)‫– أكلت التفاحة (أنا‬
)‫– أكلت التفاحة (أنت‬
)‫– أكلت التفاحة (أنت‬
)‫– أكلت التفاحة (هي‬

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
34
Ambiguity
Semantic Ambiguity is existent when more than one possible meaning exists for a sentence as in
He lifted the branch with the red leaf.
It may mean that the person in question used a red leaf to lift the branch or that he lifted a branch
that had a red leaf on it.
Referential Ambiguity is the result of referring to something without explicitly naming it by using words like “it”,
‘he” and “they.”
These words require the target to be looked up and may be impossible to resolve such as in the sentence:
The interface sent the peripheral device data which caused it to break
it could mean the peripheral device, the data, or the interface.
Local Ambiguity occurs when a part of a sentence is unclear but is resolved when the sentence as a
whole is examined.
The sentence: “this hall is colder than the room,” exemplifies local ambiguity as the phrase: “is colder
than” is indefinite until “the room” is defined.

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
35
Disambiguation
There are many techniques and tools to decide which interpretation of a
word to use, some of these techniques are listed below:
◦ Prior probabilities are rules that tell the system that a certain word
phrase nearly always means a certain thing without looking at
anything else, this is a purely statistical approach to disambiguation.
◦ Conditional probability examines the scenario in reference to the
origin of the phrase to make the decision on the meaning of a word
phrase.
◦ Context looks at the environment and incidents surrounding the
phrase to make a decision on which interpretation to use.

World Models are needed for a good disambiguation system, to allow for
the selection of the most practical meaning of a given sentence.

This world model needs to be as broad as the scenarios the system would
encounter in its normal operation.

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
36
Morphological Analysis

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
37
Bickmore, T. and Giorgino, T., 2021. Methodological review: health dialog systems for patients and consumers. J. Biomed. Inform.-JBI.

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
38
Morphology
A morphology is a systematic description of A morpheme is a meaningful linguistic unit that
words in a natural language. It describes a set of contains no smaller meaningful parts. It is the
relations between words’ surface forms and minimal meaning-bearing unit in a language.
lexical forms.
Example
A word’s surface form is its graphical (in written ◦ the word cat consists of a single morpheme (the
text) or spoken form. morpheme cat)
◦ the word cats consists of two: the morpheme cat
Morphology is the study of word form and and the morpheme -s.
structure: the way words are built up from
smaller units called morphemes. Morphology tries to formulate rules
Their internal structure
◦ Washing -> wash + -ing
How they are formed?
◦ bat -> bats :: rat -> rats
◦ Write -> writer :: browse -> browser

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
39
Morphological Analysis
MORPHOLOGICAL ANALYSIS MORPHOLOGICAL GENERATION

Morphological analysis is the computational Morphological processing may be


process which provides information about the bidirectional: i.e., parsing and generation.
structure of a given surface word.
Morphological generator is the inverse process
Individual words are analyzed into their of producing a surface word given a
components, and non-word tokens (such as morphological analysis.
punctuation) are separated from the words.
Morphology is mainly useful for identifying the
parts of speech in a sentence and words that
interact together.

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
40
Natural Language Processing, Prof. Arafat Awajan

Why we need Morphological


Analyzing and Generating?
Productivity: going, drinking, running, playing
◦ Create a vast array of expressions from a relatively small set of morphemes.
◦ Storing every form leads to inefficiency.

Addition of new words


◦ Verb: To fax. Forms: fax, faxes, faxed, faxing
◦ New Word: "Podcast“. Forms: "Podcasts," "Podcaster," "Podcasting," "Podcasted," etc.

Compiling lexicons: Morphological analysis is crucial for compiling dictionaries and lexicons

Stemming for IR: helps in retrieving documents containing different inflected forms of the
query terms.

Lemmatization: words need to be standardized for consistency and accuracy

Spell Checking: identify misspelled words and suggest corrections

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
41
Morphemes Classes
Two broad classes of morphemes: stems and Examples:
affixes. ◦ The word eats is composed of a stem eat and
the suffix -s.
The stem: the “main” morpheme of the word,
◦ The word unbuckle is composed of a stem
supplying the main meaning. buckle and the prefix un-.
The affixes: add “additional” meanings of ◦ A word can have more than one affix. For
various kinds. example, the word rewrites has the prefix re-,
the stem write, and the suffix -s.
Examples: ◦ The word unbelievably has a stem (believe) plus
◦ (foxes) breaks down into (fox and –es) three affixes (un-, -able, and -ly).
◦ (cats) breaks down into (cat and s)
◦ ‫ الرجل‬breaks down into ‫ال – رجل‬

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
42
Natural Language Processing, Prof. Arafat Awajan

Word Formation
There are many ways to combine morphemes to create
words. Four of these methods are common and play
important roles in speech and language processing:
◦ Inflection (Inflectional morphology)
◦ Derivation (Derivational Morphology)
◦ Compounding,
◦ Cliticization.

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
43
Word Formation: Inflection
Inflection is the combination of a word stem with Inflectional morphology concerns properties such as
a grammatical morpheme -> A word of the same tense, aspect, number, person, gender, and case,
class as the original stem, giving information although not all languages code all of these.
about tense, gender, number, … English has a relatively simple inflectional system; only
•For example, English has the inflectional nouns, verbs, and sometimes adjectives can be
inflected, and the number of possible inflectional
morpheme (s) for marking the plural on nouns, affixes is quite small.
and the inflectional morpheme (ed) for marking
the past tense on verbs. English has very little morphological marking of case
and gender
Examples:
◦ Cats -> stem (lemma): cat + affix “s” (plural) Exceptions and irregularities ?
◦ Asked -> ask (stem): ask + affix “ed” (past sentence) ◦ Women -> woman, pl

Examples From Arabic Language generated by


Inflection: Typically, do not change the core meaning or part of
speech of a word.
(suffix) ‫ ون‬+ (stem) ‫ درس‬+ (Prefix) ‫ ي‬:‫يدرسون‬

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
44
Word Formation: Derivation
Derivation is the combination of a word stem
with a grammatical morpheme -> a word of a Verbs and Adjectives to Nouns
different class, often with a meaning hard to
predict exactly. -ation Computerize Computerization

Example: -ee Appoint Appointee


◦ Computerization -> lemma (computerize) is a verb + -er Kill Killer
suffix (ation) to produce a noun.
◦ Unhappy -> prefix (un) + stem (happy) to form
-ness Fuzzy Fuzziness
negative.
Nouns and Verbs to Adjectives
Examples From Arabic Language generated bu -al Computation Computional
derivation:
-able Embrace Embraceable
(suffix) ‫ ة‬+ (infex) ‫ ا‬+ (stem) ‫ درس‬+ (Prefix) ‫ ال‬: ‫الدراسة‬
-less Clue Clueless
Changes their meaning or part of speech.

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
45
Word Formation: Compounding
Compounding is the combination of multiple word stems together.

For example, the noun doghouse is the concatenation of the morpheme dog with the
morpheme house.

Example from Arabic?? Arabic is generally ‫برمائي‬


assumed to not have compound nouns ‫رأسمال‬
(even if ancient Arabid did have)

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
46
Word Formation: Cliticization
Cliticizationis the combination of a word stem with a clitic.

A cliticis a morpheme that acts syntactically like a word but is reduced in form and
attached to another word.

For example, the English morpheme ’ve in the word I’ve is a clitic, as is the French
definite article l’ in the word l’opera.

Example from Arabic: ... ،‫ أذهبت للمدرسة‬،‫ تاهلل‬... ،‫ السيما‬،‫ اينما‬،‫ ريثما‬،‫ربما‬

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
47
Morphology and Spelling
Rules
English morphology is essentially concatenative where a words
is a sequence of prefixes, stems and suffixes.
Some words have irregular morphology and their inflectional
forms simply have to be listed.
Examples:
◦ Car -> Cars / Box -> Boxes
◦ ‫ مدرس‬-> ‫ مدرسون‬/ ‫ سيارة‬-> ‫ سيارات‬/ ‫ أمين‬-> ‫أمناء‬

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
48
Natural Language Processing, Prof. Arafat Awajan

Tokenization
Tokenization is a fundamental step in processing textual data preceding the tasks of
information retrieval, text mining, and NLP.

Tokenization is typically the first task in a pipeline of natural language processing tools. It
usually involves two sub-tasks, which are often performed at the same time:
◦ separating punctuation symbols from words;
◦ detecting sentence boundaries.

Tokenization is closely related to the morphological analysis. It is the task of separating out
words from running text.

The function of a tokenizer is to split a running text into tokens, so that they can be fed into a
morphological transducer or POS tagger for further processing.

The tokenizer is responsible for defining word boundaries, demarcating clitics, multiword
expressions, Named entities, abbreviations and numbers.

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
49
Natural Language Processing, Prof. Arafat Awajan

Tokenization
In the output of this process white space is typically used as separation
marks between tokens, and sentences are usually separated by new lines.
Problem : many punctuation symbols are ambiguous in their use.
Example:
◦ a hyphen in a football score, in a range of numbers, in a compound
word, or to divide a word at the end of line.
◦ Full stop: in abbreviations and the end of a sentence.
Issues in tokenization:
◦ Finland’s capital -> Finland? Finlands? Finland’s?
◦ New York / San Francisco: one token or two? How do you decide it is
one token?
◦ USA and U.S.A
◦ Score (sport) 3-4 / Range of values 1-10
◦ In Arabic, other problems occur, example: ‫وسيأكولون‬

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
50
Stop List
A stop list is a list of words that do not represent a
document’s contents and may include prepositions,
pronouns, and conjunctions. (the, on, under, of, ….)
Such a list is also referred to as a functional or structural
word list
Examples:
◦ a, the , they, about, above, across, after again, all, almost, alone,
and, at, or, if, but, can, did, ever, I, on, for, ….
◦ ‫في كل لم لن له من هو هي كما لها منذ قد ال هناك قال كان كانت فيه لكن في لم من‬
‫هو يوم فيها منها يكون يمكن حيث اما االتى التي اكثر‬

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
51
Example
Statement:
◦ If you want to apply for a scholarship abroad and in a specific university you require IELTS and TOEFL

Processing:
◦ If you want to apply for a scholarship abroad and in a specific university you require IELTS and TOEFL

After:
◦ If want apply scholarship aboard university require IELTS TOFEL

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
52
Stemming
The process of removing affixes from a word so that we are left with the stem of
that word is called stemming.

For example, consider the words ‘run’, ‘running’, and ‘runs’, all convert into the
word ‘run’ after stemming is implemented on.

One crucial point about stem words is that they need not be meaningful. For
example, the word ‘traditional’ stem is ‘tradi’ and has no meaning.

This naive version of morphological analysis is called stemming

Jurafsky, D. and Martin, J.H., Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition.
https://www.projectpro.io/article/stemming-in-nlp/780

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
53
Lemmatization
For some language processing tasks, we need Lemmatization is reducing inflectional/variant
to know that two words have a similar root, forms to base form
despite their surface differences.
Examples:
Example: the words sang, sung, and sings are ◦ am, are, is -> be
all forms of the verb sing. The word sing is ◦ car, cars, car's, cars’ -> car
sometimes called the common lemma of these
words, and mapping from all of these to sing is The boy's cars are different colors
called lemmatization.
-> the boy car be different color
Example: teach, teacher, teachers, teaching
Lemmatization implies doing “proper”
...،‫ تدريس‬،‫ مدرسين‬،‫ مدرسة‬،‫ دراسة‬،‫درس‬ reduction to dictionary headword form

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
54
Stemming vs.
Lemmatization

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
55
Lexical Requirements for
Morphological Analysis
There are some lexical information needed for full, high
precision morphological processing:
◦ affixes, plus the associated information carried by the affix
◦ irregular forms, with associated information similar to that for
affixes
◦ stems / lemma with syntactic categories (plus more detailed
information if derivational morphology is to be treated as
productive)
◦ Spelling rules
◦ Stop words

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
56
Part-of-Speech Tagging
Words can be used in a variety of grammatical Example:
roles, for example nouns, adjectives, ◦ Text: The function of sleep, according to one
prepositions, verbs, and so on. school of thought, is to consolidate memory.
These categories are the basic grammatical POS Tagging:
units of language and are called parts of
speech. The|DET function|NOUN of|PREP
sleep|NOUN ,|PUN according|VERB to|PREP
Part-of-speech tagging, or POS tagging, is the one|DET school|NOUN of|PRE
task of automatically labeling each token in thought|NOUN ,|PUN is|VERB to|PREP
the sentence with its part of speech. This is a consolidate|VERB memory|NOUN .|PUN
crucial early step in understanding a sentence.

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
57
Part-of-Speech Tagging
Ambiguity is what makes POS tagging, and Tagset: Which and how many tags should we
many NLP tasks, difficult. Many words have use?
more than one possible syntactic category.
Example:
◦ The back door = adjective
◦ On my back = noun
◦ Win the voters back = adverb
◦ Promised to back the bill = verb

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
58
POS Tagset
Main Tags of Part-of-speech: Verb-specific
◦ Traditional: Noun, Verb, Particle ◦ Aspect: perfective, imperfective, imperative
◦ Computational: Noun (N), Proper Noun (PN), Verb ◦ Voice: active, passive
(V), Adjective (Adj), Adverb (Adv), Preposition (P), ◦ Tense: past, present, future
Pronoun (Pron), Numeral (Num), Conjunction
(Conj), Determiner (Det), ◦ Mood: indicative, subjunctive, jussive
◦ Auxiliary (Aux), Punctuation (Pun), Interjection (IJ), ◦ Subject (Person, Number, Gender)
and others
Noun-specific
◦ Number: singular, dual, plural, collective
◦ Gender: masculine, feminine, Neutral
◦ Definiteness: definite, indefinite
◦ Case: nominative, accusative, genitive
◦ Possessive clitic: Indicates possession or ownership

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
59
Morphological Analysis
–Arabic Language –

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
60
Arabic Morphology
The morphological features of a word provide The morphology of the Arabic language has two
crucial information to enable understanding of text types of morphemes:
and information extraction. ◦ templatic morphemes and
◦ concatenative morphemes.
The possible meanings of individual words depends
mainly on their morphology (morphemes) and their Templatic morphemes: Most Arabic words (stems)
position in a sentence. are derivative words that are generated according
to the root-and-pattern scheme (templatic
morphemes)
The possible meanings of a word must be
determined first to accomplish the understanding of Concatenative morphemes include:
text written in a natural language.
◦ Stems,
◦ Affixes, and
The Arabic language is a Semitic language with a ◦ Clitics.
rich and complex morphology, both derivational and
inflectional.

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
61
Arabic Templatic Morphology
The templatic morphology of the Arabic language These additional letters and diacritical marks may
is based on the Semitic root-and-pattern be added at the beginning, at the middle or at
scheme. the end of the root -> Stem
The majority of words are generated from basic
entities called roots or radicals according to a
predefined list of patterns called morphological
balances or patterns )‫(الوزن الصرفي‬.
The root morpheme is a sequence of (mostly)
three, (less so) four, or very rarely five
consonants (termed radicals).
This mechanism of Arabic word generation is
called ‘AL-ISHTIQAQ.’
This mechanism is performed by adding letters
and/or diacritical marks (‫ )التشكيل‬to the roots.

Habash, N.Y., 2022. Introduction to Arabic natural language processing. Springer Nature.
Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
62
Arabic Templatic Morphology
All classes of words (verbs, nouns, adjectives and Morphological balance or pattern is a model used to
adverbs) can be generated from roots according to the study the internal structure of Arabic words. It consists
appropriate patterns. of the three basic Arabic pattern letters that
corresponds to the first, second, and third letters of
The pattern used for generating a word determines its the Arabic triliteral root, respectively.
various attributes such as:
◦ gender (masculine / feminine) -> ‫ َكتَبَت‬، َ َ ‫َكت‬
‫ب‬ Root: ‫أ‬-‫ر‬-‫ق‬, Pattern: ‫فَعَ َل‬, Applied Pattern: َ ‫قَ َرأ‬
• The first letter of the pattern (‫)ف‬
َ corresponds to the first
◦ number (singular/plural) -> ‫ َكتَبُوا‬، ‫ب‬ َ َ ‫َكت‬ root letter (‫)ق‬.
◦ ُ ُ ‫ يَكت‬، ‫ب‬
tense (past, present) -> ‫ب‬ َ َ ‫َكت‬ • The second letter of the pattern (‫ع‬ َ ) corresponds to the
◦ Imperatives -> ‫اكتُب‬ second root letter (‫)ر‬.
• The third letter of the pattern (‫ ) َل‬corresponds to the third
◦ etc. root letter (‫(أ‬
An Arabic word can be represented lexically by its root, For quadriliteral roots, the third letter is duplicated to
along with its morphological pattern. represent the fourth root letter.
This representation captures the fundamental meaning Root: ‫ل‬-‫ق‬-‫ر‬-‫ع‬, Pattern: ‫فَعلَ َل‬, Applied Pattern: ‫َعرقَ َل‬
of the word (encoded in the root) as well as its
structural and grammatical characteristics (determined Root: ‫ل‬-‫ق‬-‫ر‬-‫ع‬, Pattern: ‫يُفَع ِل ُل‬, Applied Pattern: ‫يُ َعرقِ ُل‬
by the pattern).

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
63
Arabic Templatic Morphology
Example: Pattern Pattern Prefix Infix Suffix
◦ The word ‫ العب‬is generated by the root play ‫لعب‬ Structure
according to the pattern ‫فاعل‬
‫سيفعلون‬ [‫]سي***ون‬ [‫]سي‬ [] [‫]ون‬
◦ This pattern indicates that the word is a noun, its
gender is masculine, and it is singular. The final ‫فاعلون‬ [‫]*ا**ون‬ [] [‫]ا‬ [‫]ون‬
meaning will be players: (play:noun; singular;
masculine) ‫مفاعيل‬ [*‫]م*ا*ي‬ [‫]م‬ [‫]اي‬ []
◦ Other words generated from the same root:
... ‫ يالعب‬،‫ لعيّب‬،‫ ملعوب‬،‫ لعبة‬،‫ملعب‬ Pattern Pattern Structure List of Diacritical
marks
The pattern is one element of a countable set of
limited size. ‫َيف َعلُون‬ [‫]ي***ون‬ [ ُ َ َ]
َ
A pattern is defined by a set of additive letters ‫يُف َعلُون‬ [‫]ي***ون‬ [ ُ َ َُ]
and/or a set of diacritical marks and their positions ‫يَفعّلُون‬ [‫]ي***ون‬ [ ُ ّ َ]
َ
in the generated word.

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
64
Arabic Concatenative Morphology
The concatenative morphology of the Arabic Affixes
language uses the concatenative morphemes: ◦ A prefix can consist of as many as four
◦ Stems, concatenated prefixes or could be null.
◦ Affixes, and ◦ The suffix consists of as many as three
◦ Clitics. concatenated suffixes or could be null.
◦ The circumfixes are generally combinations of
There are three types of affixes: prefix with suffixes.
◦ Prefixes,
Clitic:
◦ Suffixes, and ◦ A clitic is a symbol of one to three letters that
◦ Circumfixes. represents another token, such as a preposition
‫بال ُمعَ ِلّ ِم َكال ُمعَ ِلّ ِم‬, conjunction ‫كتاب وقلم‬, definite article
Terminology Alert The terms prefix and suffix are ‫البيت‬, or object pronoun ُ‫كتبه‬. There are
sometimes used to refer to proclitics and enclitics, ◦ Two types of clitics: Proclitics and Enclitics.
respectively. Prefix and suffix have also been used ◦ Proclitics precede a word, for example, the definite article.
to refer to the whole sequence of affixes and ◦ Enclitics follow a word, for example, object pronouns.

clitics attaching to a stem.


Habash, N.Y., 2022. Introduction to Arabic natural language processing. Springer Nature.
Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
65
Clitics
Habash, N.Y., 2022. Introduction to Arabic natural language
processing. Springer Nature.

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
66
Clitic
https://www.semanticscholar.org/paper/Clitics-in-Arabic-Language%3A-A-
Statistical-Study-Alotaiby-Foda/f8a0eb585082d9f14e2315f15818e1183a06ddfb

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
67
Example: ‫وسيكتبونها‬
Proclitics: ‫ س‬،‫و‬
Prefixes: ‫ي‬
Root: ‫ك ت ب‬
Infixes: -
Suffixes: ‫ون‬
Enclitic: ‫ها‬

Exercises: ‫ أنلزمكموها‬،‫كالمهندسون‬

General Structure of Arabic Words


Natural Language Processing, Prof. Arafat Awajan
Tan, T.P., Xiao, X., Tang, E.K., Chng, E.S. and Li, H., 2009, August. MASS: A Malay language LVCSR corpus resource. In 2009 Oriental COCOSDA International Conference
on Speech Database and Assessments (pp. 25-30). IEEE.
https://www.researchgate.net/publication/224600045_MASS_A_Malay_language_LVCSR_corpus_resource/figures?lo=1&utm_source=google&utm_medium=organic

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
68
Arabic Words Classes
Arabic words (stems) may be:
◦ Derivative words (templatic)
◦ Non-derivative words (non-templatic)
◦ Stop words

Non-derivative words do not obey the standard templatic derivation rules.


Examples of non-derivative words are words borrowed from foreign languages
and proper names.

Stop words, sometimes called function words, include pronouns, prepositions,


conjunctions, question words, and so on.

Non-derivative words and stop words can receive affixes and clitics.

Example, the word ... ‫ االميركي‬،‫ فعليهم‬،‫ فيها‬،‫الديمقراطيون‬

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
69
Arabic Features and Challenges
The formation of Arabic words presents specific features and challenges that must be taken into
consideration when fixing the rules used by the morphological analyzer.
The first challenge: some letters of the root may be dropped or modified during the generation of
words from roots. examples ...،‫ دعى‬،‫ عظه‬،‫ وعظ‬،‫ قم‬،‫ يقوم‬،‫ قام‬،‫ يعود‬،‫عاد‬. Defective or weak roots are the
roots with one or more long vowels.
The second challenge is the presence of eight different types of diacritical marks, used to represent
short vowels.
In fully diacriticized text a diacritical mark is added after each consonant of the word. These diacritical
marks play a very important role in fixing the meaning of words.
The number of these marks is eight diacritics:
◦ three diacritical marks to indicate the short vowels (َِ ََُ)
َ
◦ three double diacritic marks which combine the single ones ( ً )
◦ one diacritical mark to indicate the absence of vowelization ( )
◦ ّ
and a single diacritical mark to indicate the duplicate occurrence of a consonant (َ).

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
70
Arabic Features and Challenges
According to the extent that diacritics have been used, Arabic
text may be classified into three different categories
◦ Undiacriticized
◦ Partially diacriticized, and
◦ Fully diacriticized text.
Vowelization or diacritization is the process of putting
diacritical marks or short vowels above or under letters of
Arabic words.
Nunation ‫ التنوين‬is the process of putting one of the set of
vowels at the end of the word to produce a phonetic effect
that adds the sound of the letter.
Gemination or tashdeed is the process of putting the vowel
ّ above a letter to duplicate it phonetically.
(َ)

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
71
Arabic Features and Challenges
The third challenge is that not all the words in Arabic text are generated from a root. For example, some
words such as the tools and foreign words cannot be broken down into a root and pattern. Example:
‫ تلفزيون‬، ‫كمبيوتر‬
The fourth Challenge: Arabic orthography concatenate certain word forms with the preceding or the
following ones, possibly changing their spelling ‫وسوف نعمل على ذلك‬
An example of an ambiguous word is “F+H+M” ‫ فهم‬which has at least 5 interpretations. It can be
interpreted as
◦ a perfect verb that means understand,
◦ a perfect verb that means make (him) understand,
◦ a noun that means understanding,
◦ a concatenation of a conjunction and a pronoun that means and + they,
◦ and finally a conjunction and a verb that means and + (he) intend.

Other challenges:
◦ some letters may be dropped when occurring with others in some circumstances: ‫ للرجال‬،‫الرجال‬
◦ Different word decomposition are possible: ‫كامل‬

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
72
Roots and Stem
The root is a single morpheme that provides the basic A stem is a morpheme that can accept an affixes.
meaning of a word
Stems are formed by a derivational combination of a
In English, the root is sometimes called the word base root morpheme and set of vowels; the two are
or stem; it is the part of the word that remains after arranged according to canonical patterns.
the removal of affixes.
Roots are said to interdigitate with patterns to form
In Arabic, the root is the original form of the word stems.
before any transformation process, and it plays an
important role in language studies. The stem expresses some central idea or meaning.
For example, the Arabic stem katab (he wrote) is
composed of the morpheme ktb (notion of writing)
and the vowel melody morpheme 'a-a'.
The two are coordinated according to the pattern
CVCVC (C=consonant, V=vowel).
An affix can be added before or after, or inserted
inside, a root or a stem as a prefix, suffix or infix,
respectively, to form new words or meanings.

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
73
Inflectional Morphology
INFLECTIONAL MORPHOLOGY ARABIC MORPHOLOGY

Perfect verb subject conjugation (suffixes only)

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
74
Natural Language Processing, Prof. Arafat Awajan

Tokenization
Tokenization is a non-trivial problem as it is closely related to the
morphological analysis. It is the task of separating out words from
running text.
The function of a tokenizer is to split a running text into tokens, so that
they can be fed into a morphological transducer or POS tagger for
further processing.
The tokenizer is responsible for defining word boundaries, demarcating
clitics, multiword expressions, abbreviations and numbers.
There is not a single possible or obvious tokenization scheme: a
tokenization scheme is an analytical tool devised by the researcher.
Different tokenization imply different amount of information, and
further influence the options for linguistic generalization.

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
75
Arabic Word Analysis
Requirements
Because the Arabic prefixes and suffixes are finite number, their respective lexicons
could be considered complete

The tree-part approach entails the use of six lexicons:


◦ Prefixes lexicon
◦ Stem lexicon (roots and morphological patterns)
◦ Suffixes lexicon
◦ Lexicon of proclitics
◦ Lexicon of enclitics
◦ Lexicon of Stop words
◦ In addition to the spelling rules of the language

For a word to be analyzed its parts must have an entry in each lexicons. Assuming
the both null prefix and null suffix are both possible.

Natural Language Processing, Prof. Arafat Awajan

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
76
Bias in NLP models

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
77
https://analyticssteps.com/blogs/ethical-considerations-natural-language-processing-nlp

Ethical concerns of NLP


Bias
◦ NLP models are trained on large datasets, and the quality of their output depends on the quality and
diversity of the data they are trained on. If the training data is biased, the NLP model may learn and
perpetuate that bias, leading to unfair or discriminatory outcomes.
◦ For example, an NLP-based recruitment system may discriminate against candidates based on their race
or gender, even if unintentionally.
◦ To address this issue, it is essential to ensure that NLP models are designed and trained on diverse and
representative datasets that are free from bias. Additionally, it is crucial to conduct regular audits of NLP
systems to identify and address any bias that may exist in the models or the data they are trained on.

Privacy
◦ NLP systems often rely on large amounts of personal data, such as text messages, emails, and social
media posts, to provide insights and make predictions.
◦ This data can be sensitive and personal, and individuals may not be aware that it is being collected or
used by NLP systems.
◦ To protect privacy, it is crucial to ensure that NLP systems are designed with privacy in mind. This
includes using data minimization techniques to reduce the amount of personal data collected, providing
clear and transparent information about how data is being used, and implementing appropriate security
measures to protect data from unauthorized access or theft.

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
78
https://analyticssteps.com/blogs/ethical-considerations-natural-language-processing-nlp

Ethical concerns of NLP


Transparency
◦ It is often difficult to understand how NLP models arrive at their predictions
or recommendations, which can lead to distrust and confusion among users.
◦ To address this issue, it is essential to ensure that NLP systems are
transparent and explainable, with clear documentation and visualizations
that enable users to understand how the models are making decisions.

NLP can also be used to promote ethical communication and empathy.


◦ For example, NLP-based chatbots can be used to provide mental health
support and counseling, enabling individuals to access help and support
when they need it most.
◦ NLP can also be used to analyze social media posts and identify instances of
hate speech or bullying, enabling organizations to take action to promote
social justice and equality.

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
79
What does machine learning bias look
like?
In essence, it occurs when machine learning algorithms In contrast to an example like "man is to woman as king is
display latent biases that frequently go undetected during to queen," where king and queen have a literal gender
testing because the majority of publications evaluate their definition, the algorithm in these examples is effectively
models for pure accuracy. conveying preconceptions. Queens are defined as being
female and kings as being male. The statement "Man is to
For Example: woman as computer programmer is to homemaker" is
prejudiced because neither computer programmers nor
◦ It is more likely that "He is a doctor" than "She is a doctor." homemakers are designated as male or female.
◦ As a computer programmer is to a homemaker, a guy is to a NLP models also show a strong presence of biases other
woman. than gender prejudice. Here are some instances of some
◦ Female nouns tend to convey rage more strongly in sentences. additional biases:
◦ "He is a nurse," in other words. She is a nurse. He is a doctor Black people are to crime what Caucasians are to law
when "She is a doctor" is translated from English to Hungarian enforcement, according to machine learning models.
and back to English.
Machine learning models suggest that legal is to
Christianity what terrorism is to Islam.
AI is more likely to mark tweets made by African
Americans as offensive.

https://glair.ai/post/bias-in-natural-language-processing-nlp

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
80
Sources of Bias
Bias from data: The first entry point for bias in the NLP pipeline
is the choice of data for the experimentation.

Bias from annotations: The labels chosen for training and the
procedure used for annotating the labels introduces the
annotation bias. Selection bias is introduced by the samples
chosen for training or testing an NLP model.

Bias from input representations: The third type of bias is


introduced by the choice of representation used for the data.

Bias from models: The choice of models or machine learning


algorithms used also introduces the issue of bias amplification.

Bias from research design: Finally, the entire research design


process can introduce bias if researchers are not careful with
their choices in the NLP pipeline.

Hovy, D. and Prabhumoye, S., 2021. Five sources of bias in natural language
processing. Language and Linguistics Compass, 15(8), p.e12432.
https://compass.onlinelibrary.wiley.com/doi/full/10.1111/lnc3.12432

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
81
Hovy, D. and Prabhumoye, S., 2021. Five sources of bias in natural language
processing. Language and Linguistics Compass, 15(8), p.e12432.
https://compass.onlinelibrary.wiley.com/doi/full/10.1111/lnc3.12432

NLP systems reflect biases in the language data used for training them.
Models trained on these data sets treat language as if it resembles this restricted training data, creating demographic bias.
Results are ageist, racist or sexist models that are biased against the respective user groups. This is the issue of selection bias, which is
rooted in data.
When choosing a text data set to work with, we are also making decisions about the demographic groups represented in the data.
If our data set is dominated by the ‘dialect’ of a specific demographic group, we should not be surprised that our models have problems
understanding others.
Most data sets have some built-in bias, and in many cases, it is benign.
It becomes problematic when this bias negatively affects certain groups or disproportionately advantages others.
On biased data sets, statistical models overfit to the presence of specific linguistic signals that are particular to the dominant group. As a
result, the model will work less well for other groups, that is, it excludes demographic groups.

Bias from data


AL HUSSEIN TECHNICAL UNIVERSITY (HTU)
DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
82
Hovy, D. and Prabhumoye, S., 2021. Five sources of bias in natural language
processing. Language and Linguistics Compass, 15(8), p.e12432.
https://compass.onlinelibrary.wiley.com/doi/full/10.1111/lnc3.12432

Annotation can introduce bias in various forms through a mismatch of the annotator population with the data. This is the issue of label bias.

Label and selection bias can—and most often do—interact, so it can be challenging to distinguish them. It does, however, underscore how important it is to address them jointly. There are
several ways in which annotations introduce bias.

In its simplest form, bias arises because annotators are distracted, uninterested, or lazy about the annotation task. As a result, they choose the ‘wrong’ labels.

More problematic is label bias from informed and well-meaning annotators that systematically disagree.

For example, the term ‘social media’ can be validly analysed as either a noun phrase composed of an adjective and a noun, or a noun compound, composed of two
nouns.
◦ Which label an annotator chooses depends on their interpretation of how lexicalized the term ‘social media’ is.
◦ If they perceive it as fully lexicalized, they will choose a noun compound.
◦ If they believe the process is still ongoing, that is, the phrase is analytical, they will choose an ‘adjective plus noun’ construct.
◦ Two annotators with these opposing views will systematically label ‘social’ as an adjective or a noun, respectively. While we can spot the disagreement, we cannot
discount either of them as wrong or malicious.

Finally, label bias can result from a mismatch between authors' and annotators' linguistic and social norms.

For example, that annotators rate the utterances of different ethnic groups differently and that they mistake innocuous banter as hate speech because they are
unfamiliar with communication norms of the original speakers.

Bias from annotations


AL HUSSEIN TECHNICAL UNIVERSITY (HTU)
DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
83
Hovy, D. and Prabhumoye, S., 2021. Five sources of bias in natural language
processing. Language and Linguistics Compass, 15(8), p.e12432.
https://compass.onlinelibrary.wiley.com/doi/full/10.1111/lnc3.12432

Even balanced, well-labelled data sets contain bias: the most common text inputs representing in NLP systems,
word embeddings, have been shown to pick up on racial and gender biases in the training data.
For example, ‘woman’ is associated with ‘homemaker’ in the same way ‘man’ is associated with ‘programmer’.
There has been some justified scepticism over whether these analogy tasks are the best way to evaluate
embedding models, but there is plenty of evidence that (1) embeddings do capture societal attitudes, and that
(2) these societal biases are resistant to many correction methods. This is the issue of semantic bias.
These biases hold not just for word embeddings but also for the contextual representations of big pre-trained
language models that are now widely used in different NLP systems.
As they are pre-trained on almost the entire available internet, they are even more prone to societal biases.

Bias from input representations


AL HUSSEIN TECHNICAL UNIVERSITY (HTU)
DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
84
Hovy, D. and Prabhumoye, S., 2021. Five sources of bias in natural language
processing. Language and Linguistics Compass, 15(8), p.e12432.
https://compass.onlinelibrary.wiley.com/doi/full/10.1111/lnc3.12432

Simply using ‘better’ training data is not a feasible long-term solution: languages evolve continuously, so even a representative sample
can only capture a snapshot—at best a short-lived solution.
Systems trained on biased data exacerbate that bias even further when applied to new data.
Sentiment analysis tools pick up on societal prejudices, leading to different outcomes for different demographic groups. For example, by
merely changing the gender of a pronoun, the systems classified the sentence differently.
Machine translation systems changed the perceived user demographics to make samples sound older and more male in translation. This
issue is bias overamplification, which is rooted in the models themselves.
Models can overamplify existing biases, contributing to incorrect outcomes even when the answers are technically correct.
The choice of loss objective in model training can unintentionally reinforce biases, causing models to provide correct answers for the
wrong reasons.
Machine learning models often provide predictions even when uncertain or unable to offer accurate responses, potentially resulting in
biased or misleading outcomes.
Models should ideally report uncertainty rather than delivering potentially biased or incorrect results.

Bias from models


AL HUSSEIN TECHNICAL UNIVERSITY (HTU)
DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
85
Hovy, D. and Prabhumoye, S., 2021. Five sources of bias in natural language
processing. Language and Linguistics Compass, 15(8), p.e12432.
https://compass.onlinelibrary.wiley.com/doi/full/10.1111/lnc3.12432

NLP research predominantly focuses on English, leading to linguistic bias.


Factors such as resource availability and overexposure to English contribute to this bias.
Implications include limited representation of marginalized communities and potential harm from undisclosed biases.
Diversification of research focus beyond English and towards underrepresented languages is necessary.
Efforts to develop and promote tools, datasets, and resources for underrepresented languages can help mitigate
linguistic bias.
Encouraging interdisciplinary collaboration and incorporating insights from sociolinguistics and cultural studies can
enhance understanding and representation of diverse linguistic contexts.
Ethical considerations, including transparency and accountability, are crucial for mitigating bias and promoting
inclusivity.
Collaboration with diverse communities and stakeholders is essential for equitable NLP research.

Bias from research design


AL HUSSEIN TECHNICAL UNIVERSITY (HTU)
DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
86
https://glair.ai/post/bias-in-natural-language-processing-nlp

How Bias Affects NLP


Model?
The majority of the deep learning models we employ are "black-box" models.
◦ We create a model, build it, train it using certain data, and then utilize it to address a specific issue.
◦ As designers, we frequently stop there and don't go into great detail about the reasoning behind a
model's choices.
◦ This doesn't necessarily imply that the fundamental ideas behind a model are unknown, though.

Unfortunately, the foundation of NLP is unsupervised learning.


◦ Although the models used in natural language processing are primarily supervised learning models, the
data they use was produced by models that were trained in an unsupervised manner.
◦ Because we can't directly feed text into our models, we do this. Instead, we must transform the text into
language representations that our models can understand.
◦ Word embeddings are the name given to these representations.

Unsupervised models produce word embeddings, which are numerical depictions of text data.
◦ An unsupervised model searches through a lot of text and generates vectors to represent the words in
the text.
◦ Unfortunately, our models are exposed to more than just semantic information because we look for
hidden patterns and use them to build embeddings (which automatically organize data).
◦ Models are subjected to biases similar to those seen in human culture while digesting the text. The biases
then spread to our supervised learning models, which are designed to use unbiased data in order to avoid
producing biased outputs.

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
87
https://funginstitute.berkeley.edu/news/op-ed-tackling-biases-in-natural-language-processing/

Solution 1: Data Manipulation

• One of the main reasons that NLP algorithms are biased is that the original dataset to train
the model is unbalanced.
• For example, there could be more data associating “doctors” with “male”, and so the
resultant model would have more probability to predict “doctors” as “male”.
• Therefore, one of the best ways to eliminate bias in NLP is to solve the problem of
unbalanced data. There are many ways to achieve so.
• For instance, one can utilize data augmentation algorithms such as SMOTE to self-create
more data for the minority group in the dataset.
• Plus, if the total amount of the dataset is very enormous, one can also choose to remove
some data from the majority group to make the dataset more balanced.

Solutions Solution 2: Bias Fine-Tuning

• The method employs the transfer learning concept to fine-tune an unbiased model on a
more biased dataset.
• Such an approach enables the model to get rid of learning biases from training data while
still being sufficiently trained to tackle target tasks.

Solution 3: Form Diverse AI Development & Audit Teams

• A diverse AI and ethics audit team could be a crucial part in the development of machine
learning technologies that are beneficial to societies.
• By having a diverse audit group to review the trained NLP models, anticipants from
different backgrounds could help consider the models in multiple perspectives and help
the development team spot potential biases against minority groups.
• Additionally, the diverse development team could offer insights through their lived
experiences to suggest how to modify the model.

AL HUSSEIN TECHNICAL UNIVERSITY (HTU)


DATA SCIENCE AND ARTIFICIAL INTELLIGENCE DEPARTMENT
88

You might also like