Lecture 2 NLP

Uploaded by

Youssef Mohamed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Lecture 2 NLP

Uploaded by

Youssef Mohamed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

‫سورة طه‪26-25 :‬‬

‫‪1‬‬
BY
DR. BELAL BADAWY AMIN

2
Introduction
Structured Data Unstructured Data
Can be displayed in rows, columns and Can't be displayed in rows, columns and
relational data base relational data base

Numbers, dates and strings Images, audio, video, word files, e-

mail…......etc
Estimated 20% of enterprise data Estimated 80% of enterprise data
Require less storage Require more storage
Easier to manage and protect with More difficult to manage and protect
legacy solutions with legacy solutions
Main Approaches in NLP (TimeLine)
1.Rule Based Approaches (slow manner)
Regular expression
Context –free grammars
2. Machine learning or Traditional (increase performance and accuracy)
Linear classifier
Likelihood maximization
3. Deep Learning
Convolution neural network
Recurrent neural network (more efficient and performance)
Why NLP is important?
 NLP is everywhere even if we don’t realize it
 The majority of activities performed by humans are done through
language
 There are millions of gigabytes of data generating by social media,
Apps messages and so on
 All these channels are generating large amount of text data every
second
 And because of the large volumes of text data as well as highly
unstructured data. We can no longer use the common approach to
understand the text and this is where NLP comes.
Why NLP is difficult?
 It’s the nature of human language that makes NLP difficult
 Humans gets the edge due to the communication skills he has
 There are hundreds of natural languages each of which has different
syntax rules, words can be ambiguous where there meaning it
dependent on their context.
 The rules that dictate the passing of information using natural
languages are not easy for computers to understand.
Techniques of NLP
1. Syntax analysis: refer to the arrangement of words in a sentence such that they
make grammatical sense. It is used to assess how the natural language align with
the grammatical rules. Here are some syntax rules techniques that can be used:
Lemmatization: it entails reducing the various inflected forms of a word into a single form for
easy analysis.
Stemming: it involves cutting the inflected words to their root form.
Morphological segmentation: it involves dividing words into individual units called
morphemes.
Word segmentation: it involves dividing a large piece of continuous text into distinct units.
Part-of-speech tagging: it involves identifying the part of speech for every word.
Parsing: it involves undertaking grammatical analysis for the provided sentence.
Sentence breaking: it involves placing sentence boundaries on a large space of text.
Techniques of NLP
2. Semantic analysis: refer to the meaning that is conveyed by a text. It is one of
the difficult aspects of NLP that has not been fully resolved yet . It involves
applying computer algorithms to understand the meaning and interrelations of
words and sentences are structured. Here are some techniques can be used:
Named entity recognition (NER): it involves determining the parts of a text that can be
identified and categorized into preset groups.
Word sense disambiguation: it involves giving meaning to a word based on the context .
Natural language generation: it involves using databases to derive semantic intentions and
convert them into human language.
Words and Corpora
Corpora: is a computer-readable collection of text or speech.
Types are the number of distinct words in a corpus; if the set of words
in the vocabulary is V, |V| is size of vocabulary.
Tokens are the total number N of running words.
Herdan's Law = where often .67 < β < .75
i.e., vocabulary size grows with > square root of the number of word
tokens
How many words in a corpus?
 Example:
they lay back on the San Francisco grass and looked at the stars
and their
How many?

tokens ----------.
types ----------.
Corpora
Words don't appear out of nowhere!
A text is produced by
• a specific writer(s),
• at a specific time,
• in a specific variety,
• of a specific language,
• for a specific function.
Corpora vary along dimension like
◦ Language: 7097 languages in the world
◦ Variety, like African American Language varieties.
◦ AAE Twitter posts might include forms like "iont" (I don't)
◦ Code switching, e.g., Spanish/English, Hindi/English:
S/E: Por primera vez veo a @username actually being hateful! It was beautiful:)
[For the first time I get to see @username actually being hateful! it was beautiful:) ]
H/E: dost tha or ra- hega ... dont wory ... but dherya rakhe
[“he was and will remain a friend ... don’t worry ... but have faith”]
◦ Genre: newswire, fiction, scientific articles, Wikipedia
◦ Author Demographics: writer's age, gender, ethnicity, SES
Pre-processing text data
Cleaning up the text data is necessary to highlight attributes that we are
going to want our model to pick up on. Cleaning the data typically consist of
number of steps:
1. Remove Punctuations
2. Converting text to lowercase
3. Tokenization
4. Remove stop – word
5. Lemmatization / stemming
6. Vectorization
7. Feature Engineering
Tokenization
Space-based tokenization
A very simple way to tokenize
◦ For languages that use space characters between words
◦ Arabic, Cyrillic, Greek, Latin, etc., based writing systems
◦ Segment off a token between instances of spaces
Unix tools for space-based tokenization
◦ The "tr" command
◦ Inspired by Ken Church's UNIX for Poets
◦ Given a text file, output the word tokens and their frequencies
Issues in Tokenization
Can't just blindly remove punctuation:
◦ m.p.h., Ph.D., AT&T, cap’n
◦ prices ($45.55)
◦ dates (01/02/06)
◦ URLs (http://www.stanford.edu)
◦ hashtags (#nlproc)
◦ email addresses ([email protected])
Clitic: a word that doesn't stand on its own
◦ "are" in we're, French "je" in j'ai, "le" in l'honneur
When should multiword expressions (MWE) be words?
◦ New York, rock ’n’ roll
Tokenization in languages without spaces
Many languages (like Chinese, Japanese) don't use
spaces to separate words!
姚明进入总决赛
“Yao Ming reaches the finals”
3 words?
姚明进入总决赛
YaoMing reaches finals 5 words?
姚明进入总决赛
Yao Ming reaches overall finals
Word tokenization / segmentation
So in Chinese it's common to just treat each character
(zi) as a token.
• So the segmentation step is very simple
In other languages (like Thai and Japanese), more
complex word segmentation is required.
• The standard algorithms are neural sequence models
trained by supervised machine learning.
Another option for text tokenization
Instead of
• white-space segmentation
• single-character segmentation
Use the data to tell us how to tokenize.
Subword tokenization (is a technique used in natural language
processing (NLP) that involves breaking down words into smaller
subwords or pieces.)
Ex: “football” might be split into “foot”, and “ball”
Sub word tokenization
Three common algorithms:
◦ Byte-Pair Encoding (BPE) (Sennrich et al., 2016)
◦ Unigram language modeling tokenization (Kudo,
2018)
◦ Word Piece (Schuster and Nakajima, 2012)
All have 2 parts:
◦ A token learner that takes a raw training corpus and
induces a vocabulary (a set of tokens).
◦ A token segmenter that takes a raw test sentence and
tokenizes it according to that vocabulary
Byte Pair Encoding (BPE) token
learner
Let vocabulary be the set of all individual characters
= {A, B, C, D,…, a, b, c, d….}
Repeat:
◦ Choose the two symbols that are most frequently
adjacent in the training corpus (say 'A', 'B')
◦ Add a new merged symbol 'AB' to the vocabulary
◦ Replace every adjacent 'A' 'B' in the corpus with 'AB'.
Until k merges have been done.
BPE token learner algorithm
Byte Pair Encoding (BPE) Addendum
Most subword algorithms are run inside space-
separated tokens.
So we commonly first add a special end-of-word
symbol '__' before space in training corpus
Next, separate into letters.
BPE token segmenter algorithm
On the test data, run each merge learned from the
training data:
◦ Greedily
◦ In the order we learned them
◦ (test frequencies don't play a role)
So: merge every e r to er, then merge er _ to er_, etc.
Result:
◦ Test set "n e w e r _" would be tokenized as a full word
◦ Test set "l o w e r _" would be two tokens: "low er_"
Properties of BPE tokens
Usually include frequent words
And frequent subwords
• Which are often morphemes like -est or –er
A morpheme is the smallest meaning-bearing unit of a
language
• unlikeliest has 3 morphemes un-, likely, and -est
Word Normalization
Putting words/tokens in a standard format
◦ U.S.A. or USA
◦ uhhuh or uh-huh
◦ Fed or fed
◦ am, is, be, are
Case folding
Applications like IR: reduce all letters to lower case
◦ Since users tend to use lower case
◦ Possible exception: upper case in mid-sentence?
◦ e.g., General Motors
◦ Fed vs. fed
◦ SAIL vs. sail
For sentiment analysis, MT, Information extraction
◦ Case is helpful (US versus us is important)

Project +Sweta+Kumari+ +FRA+Milestone+1+ July+ 2021
100% (2)
Project +Sweta+Kumari+ +FRA+Milestone+1+ July+ 2021
31 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
NLP m2
No ratings yet
NLP m2
71 pages
Week 2
No ratings yet
Week 2
90 pages
C10_AI_UNIT 3_NLP_ HALF YEARLY
No ratings yet
C10_AI_UNIT 3_NLP_ HALF YEARLY
37 pages
Introduction To NLP
No ratings yet
Introduction To NLP
68 pages
Unit-6 Natural Language Processing
No ratings yet
Unit-6 Natural Language Processing
7 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
AI_NLP
No ratings yet
AI_NLP
9 pages
Natural Language Processing (NLP) & Computational Linguistics
No ratings yet
Natural Language Processing (NLP) & Computational Linguistics
60 pages
NLPNotes
No ratings yet
NLPNotes
12 pages
Kuhlmann - Introduction To Computational Linguistics (Slides) (2015)
100% (1)
Kuhlmann - Introduction To Computational Linguistics (Slides) (2015)
66 pages
text-processing
No ratings yet
text-processing
114 pages
5 BASIC TEXT PROCESSING
No ratings yet
5 BASIC TEXT PROCESSING
6 pages
Natural Language Process (NLP)
No ratings yet
Natural Language Process (NLP)
29 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
NLP Final
No ratings yet
NLP Final
4 pages
Unit I Inroduction
No ratings yet
Unit I Inroduction
52 pages
AI_M3_Merged.pdf
No ratings yet
AI_M3_Merged.pdf
98 pages
Lectures 2 - MS CLASS Words and Text Classification
No ratings yet
Lectures 2 - MS CLASS Words and Text Classification
102 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
35 pages
NLP Unit 1
No ratings yet
NLP Unit 1
15 pages
Seminar Report1
No ratings yet
Seminar Report1
17 pages
Bhawini NLP Practical
No ratings yet
Bhawini NLP Practical
98 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
NLP Notes Unit-1
No ratings yet
NLP Notes Unit-1
20 pages
NLP
No ratings yet
NLP
16 pages
NLP_AI_X
No ratings yet
NLP_AI_X
6 pages
Natural Language Processing
No ratings yet
Natural Language Processing
30 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
nlp
No ratings yet
nlp
19 pages
Chapter 6.
No ratings yet
Chapter 6.
31 pages
NLP m1
No ratings yet
NLP m1
148 pages
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
No ratings yet
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
41 pages
NLP Insem Notes
No ratings yet
NLP Insem Notes
13 pages
Natural Language Processing
No ratings yet
Natural Language Processing
21 pages
lec2
No ratings yet
lec2
21 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
UNIT I
No ratings yet
UNIT I
12 pages
Text Preprocessing
No ratings yet
Text Preprocessing
59 pages
AIML-HC Mod 04
No ratings yet
AIML-HC Mod 04
71 pages
NLP-UNIT-1-1
No ratings yet
NLP-UNIT-1-1
67 pages
Seminar On Natural Language Processing
No ratings yet
Seminar On Natural Language Processing
21 pages
NLP Digital Notes
No ratings yet
NLP Digital Notes
128 pages
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
No ratings yet
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
54 pages
NLP Steps Basic
No ratings yet
NLP Steps Basic
26 pages
404-BA-Chapter V
No ratings yet
404-BA-Chapter V
22 pages
Module 1.1
No ratings yet
Module 1.1
9 pages
1 Introduction
No ratings yet
1 Introduction
99 pages
Ch1 B NLP Introduction
No ratings yet
Ch1 B NLP Introduction
43 pages
تعلم ML4 (1)
No ratings yet
تعلم ML4 (1)
42 pages
38. Natural Language Processing (1) Copy
No ratings yet
38. Natural Language Processing (1) Copy
30 pages
Assignment of AI Finished
No ratings yet
Assignment of AI Finished
16 pages
NLP Unit-1 - 1
No ratings yet
NLP Unit-1 - 1
24 pages
Unit 1 Extra
No ratings yet
Unit 1 Extra
6 pages
5.2 Natural Language Processing
No ratings yet
5.2 Natural Language Processing
43 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Corpora
No ratings yet
Corpora
48 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Email Classification Using Naive Bayes Classifier: Domain Algorithms Framework Platform
No ratings yet
Email Classification Using Naive Bayes Classifier: Domain Algorithms Framework Platform
7 pages
NP wd365 2021 Eom6-1 Jalessaleacock Report 1 1
No ratings yet
NP wd365 2021 Eom6-1 Jalessaleacock Report 1 1
5 pages
Burner F Systems DOC V2 2 en
No ratings yet
Burner F Systems DOC V2 2 en
94 pages
Computer Networking
No ratings yet
Computer Networking
5 pages
MetServiceAWS Network
No ratings yet
MetServiceAWS Network
1 page
ACN Research Assignments
No ratings yet
ACN Research Assignments
5 pages
Black Teal Dark Simple Digital Lifestyle Pitch Deck Presentation - 20231103 - 161929 - 0000
No ratings yet
Black Teal Dark Simple Digital Lifestyle Pitch Deck Presentation - 20231103 - 161929 - 0000
20 pages
Calibration Slider Diagnosis Programs - 95587956
No ratings yet
Calibration Slider Diagnosis Programs - 95587956
42 pages
Utas Thesis Guidelines
100% (3)
Utas Thesis Guidelines
7 pages
MM in SAP PDF
No ratings yet
MM in SAP PDF
64 pages
Database Management Systems
No ratings yet
Database Management Systems
75 pages
Quantopian Platform
No ratings yet
Quantopian Platform
63 pages
TWIX-1 Datasheet
No ratings yet
TWIX-1 Datasheet
2 pages
Blockchain technology_Unit-2
No ratings yet
Blockchain technology_Unit-2
19 pages
MDM System Management Training
No ratings yet
MDM System Management Training
199 pages
Creates A Phone Application
No ratings yet
Creates A Phone Application
11 pages
CCD CMOS WhitePaper
No ratings yet
CCD CMOS WhitePaper
28 pages
Aeb4101 Engineering and Design: Module - 3
No ratings yet
Aeb4101 Engineering and Design: Module - 3
14 pages
Class Documentation Chemserv DLL enUS
No ratings yet
Class Documentation Chemserv DLL enUS
16 pages
Research Social Media Addiction
89% (9)
Research Social Media Addiction
37 pages
Inspection Report-Siemens Sensation 64
No ratings yet
Inspection Report-Siemens Sensation 64
3 pages
How To Change Directory Within Ubuntu WSL in Windows Format - Stack Overflow
No ratings yet
How To Change Directory Within Ubuntu WSL in Windows Format - Stack Overflow
3 pages
Lab 4 All
No ratings yet
Lab 4 All
68 pages
Empowerment Technologies: Quarter 1 Weekly Home Learning Plan - Modular Distance Learning
No ratings yet
Empowerment Technologies: Quarter 1 Weekly Home Learning Plan - Modular Distance Learning
3 pages
Workforce Upskilling For The AI Era
No ratings yet
Workforce Upskilling For The AI Era
16 pages
PP QM Tcodes
No ratings yet
PP QM Tcodes
10 pages
T-Shaped Skills Builder Guide in 2020 For End-To-End Data Scientist
No ratings yet
T-Shaped Skills Builder Guide in 2020 For End-To-End Data Scientist
11 pages
Mining Frequent Patterns Without Candidate Generation
No ratings yet
Mining Frequent Patterns Without Candidate Generation
44 pages
17IT603 - Python & R Programming
No ratings yet
17IT603 - Python & R Programming
3 pages

Lecture 2 NLP

Uploaded by

Lecture 2 NLP

Uploaded by

‫سورة طه‪26-25 :‬‬

Numbers, dates and strings Images, audio, video, word files, e-

You might also like