0% found this document useful (0 votes)
3 views

Lecture 2 NLP

Uploaded by

Youssef Mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture 2 NLP

Uploaded by

Youssef Mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

‫سورة طه‪26-25 :‬‬

‫‪1‬‬
BY
DR. BELAL BADAWY AMIN

2
Introduction
Structured Data Unstructured Data
Can be displayed in rows, columns and Can't be displayed in rows, columns and
relational data base relational data base

Numbers, dates and strings Images, audio, video, word files, e-


mail…......etc
Estimated 20% of enterprise data Estimated 80% of enterprise data
Require less storage Require more storage
Easier to manage and protect with More difficult to manage and protect
legacy solutions with legacy solutions
Main Approaches in NLP (TimeLine)
1.Rule Based Approaches (slow manner)
Regular expression
Context –free grammars
2. Machine learning or Traditional (increase performance and accuracy)
Linear classifier
Likelihood maximization
3. Deep Learning
Convolution neural network
Recurrent neural network (more efficient and performance)
Why NLP is important?
 NLP is everywhere even if we don’t realize it
 The majority of activities performed by humans are done through
language
 There are millions of gigabytes of data generating by social media,
Apps messages and so on
 All these channels are generating large amount of text data every
second
 And because of the large volumes of text data as well as highly
unstructured data. We can no longer use the common approach to
understand the text and this is where NLP comes.
Why NLP is difficult?
 It’s the nature of human language that makes NLP difficult
 Humans gets the edge due to the communication skills he has
 There are hundreds of natural languages each of which has different
syntax rules, words can be ambiguous where there meaning it
dependent on their context.
 The rules that dictate the passing of information using natural
languages are not easy for computers to understand.
Techniques of NLP
1. Syntax analysis: refer to the arrangement of words in a sentence such that they
make grammatical sense. It is used to assess how the natural language align with
the grammatical rules. Here are some syntax rules techniques that can be used:
Lemmatization: it entails reducing the various inflected forms of a word into a single form for
easy analysis.
Stemming: it involves cutting the inflected words to their root form.
Morphological segmentation: it involves dividing words into individual units called
morphemes.
Word segmentation: it involves dividing a large piece of continuous text into distinct units.
Part-of-speech tagging: it involves identifying the part of speech for every word.
Parsing: it involves undertaking grammatical analysis for the provided sentence.
Sentence breaking: it involves placing sentence boundaries on a large space of text.
Techniques of NLP
2. Semantic analysis: refer to the meaning that is conveyed by a text. It is one of
the difficult aspects of NLP that has not been fully resolved yet . It involves
applying computer algorithms to understand the meaning and interrelations of
words and sentences are structured. Here are some techniques can be used:
Named entity recognition (NER): it involves determining the parts of a text that can be
identified and categorized into preset groups.
Word sense disambiguation: it involves giving meaning to a word based on the context .
Natural language generation: it involves using databases to derive semantic intentions and
convert them into human language.
Words and Corpora
Corpora: is a computer-readable collection of text or speech.
Types are the number of distinct words in a corpus; if the set of words
in the vocabulary is V, |V| is size of vocabulary.
Tokens are the total number N of running words.
Herdan's Law = where often .67 < β < .75
i.e., vocabulary size grows with > square root of the number of word
tokens
How many words in a corpus?
 Example:
they lay back on the San Francisco grass and looked at the stars
and their
How many?

tokens ----------.
types ----------.
Corpora
Words don't appear out of nowhere!
A text is produced by
• a specific writer(s),
• at a specific time,
• in a specific variety,
• of a specific language,
• for a specific function.
Corpora vary along dimension like
◦ Language: 7097 languages in the world
◦ Variety, like African American Language varieties.
◦ AAE Twitter posts might include forms like "iont" (I don't)
◦ Code switching, e.g., Spanish/English, Hindi/English:
S/E: Por primera vez veo a @username actually being hateful! It was beautiful:)
[For the first time I get to see @username actually being hateful! it was beautiful:) ]
H/E: dost tha or ra- hega ... dont wory ... but dherya rakhe
[“he was and will remain a friend ... don’t worry ... but have faith”]
◦ Genre: newswire, fiction, scientific articles, Wikipedia
◦ Author Demographics: writer's age, gender, ethnicity, SES
Pre-processing text data
Cleaning up the text data is necessary to highlight attributes that we are
going to want our model to pick up on. Cleaning the data typically consist of
number of steps:
1. Remove Punctuations
2. Converting text to lowercase
3. Tokenization
4. Remove stop – word
5. Lemmatization / stemming
6. Vectorization
7. Feature Engineering
Tokenization
Space-based tokenization
A very simple way to tokenize
◦ For languages that use space characters between words
◦ Arabic, Cyrillic, Greek, Latin, etc., based writing systems
◦ Segment off a token between instances of spaces
Unix tools for space-based tokenization
◦ The "tr" command
◦ Inspired by Ken Church's UNIX for Poets
◦ Given a text file, output the word tokens and their frequencies
Issues in Tokenization
Can't just blindly remove punctuation:
◦ m.p.h., Ph.D., AT&T, cap’n
◦ prices ($45.55)
◦ dates (01/02/06)
◦ URLs (http://www.stanford.edu)
◦ hashtags (#nlproc)
◦ email addresses ([email protected])
Clitic: a word that doesn't stand on its own
◦ "are" in we're, French "je" in j'ai, "le" in l'honneur
When should multiword expressions (MWE) be words?
◦ New York, rock ’n’ roll
Tokenization in languages without spaces
Many languages (like Chinese, Japanese) don't use
spaces to separate words!
姚明进入总决赛
“Yao Ming reaches the finals”
3 words?
姚明 进入 总决赛
YaoMing reaches finals 5 words?
姚 明 进入 总 决赛
Yao Ming reaches overall finals
Word tokenization / segmentation
So in Chinese it's common to just treat each character
(zi) as a token.
• So the segmentation step is very simple
In other languages (like Thai and Japanese), more
complex word segmentation is required.
• The standard algorithms are neural sequence models
trained by supervised machine learning.
Another option for text tokenization
Instead of
• white-space segmentation
• single-character segmentation
Use the data to tell us how to tokenize.
Subword tokenization (is a technique used in natural language
processing (NLP) that involves breaking down words into smaller
subwords or pieces.)
Ex: “football” might be split into “foot”, and “ball”
Sub word tokenization
Three common algorithms:
◦ Byte-Pair Encoding (BPE) (Sennrich et al., 2016)
◦ Unigram language modeling tokenization (Kudo,
2018)
◦ Word Piece (Schuster and Nakajima, 2012)
All have 2 parts:
◦ A token learner that takes a raw training corpus and
induces a vocabulary (a set of tokens).
◦ A token segmenter that takes a raw test sentence and
tokenizes it according to that vocabulary
Byte Pair Encoding (BPE) token
learner
Let vocabulary be the set of all individual characters
= {A, B, C, D,…, a, b, c, d….}
Repeat:
◦ Choose the two symbols that are most frequently
adjacent in the training corpus (say 'A', 'B')
◦ Add a new merged symbol 'AB' to the vocabulary
◦ Replace every adjacent 'A' 'B' in the corpus with 'AB'.
Until k merges have been done.
BPE token learner algorithm
Byte Pair Encoding (BPE) Addendum
Most subword algorithms are run inside space-
separated tokens.
So we commonly first add a special end-of-word
symbol '__' before space in training corpus
Next, separate into letters.
BPE token segmenter algorithm
On the test data, run each merge learned from the
training data:
◦ Greedily
◦ In the order we learned them
◦ (test frequencies don't play a role)
So: merge every e r to er, then merge er _ to er_, etc.
Result:
◦ Test set "n e w e r _" would be tokenized as a full word
◦ Test set "l o w e r _" would be two tokens: "low er_"
Properties of BPE tokens
Usually include frequent words
And frequent subwords
• Which are often morphemes like -est or –er
A morpheme is the smallest meaning-bearing unit of a
language
• unlikeliest has 3 morphemes un-, likely, and -est
Word Normalization
Putting words/tokens in a standard format
◦ U.S.A. or USA
◦ uhhuh or uh-huh
◦ Fed or fed
◦ am, is, be, are
Case folding
Applications like IR: reduce all letters to lower case
◦ Since users tend to use lower case
◦ Possible exception: upper case in mid-sentence?
◦ e.g., General Motors
◦ Fed vs. fed
◦ SAIL vs. sail
For sentiment analysis, MT, Information extraction
◦ Case is helpful (US versus us is important)

You might also like