0% found this document useful (0 votes)

25 views

Introduction To NLP

Uploaded by

anand.ashish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

Introduction To NLP

Uploaded by

anand.ashish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 68

Deep Learning for NLP:

Introduction to NLP

Ashish Anand
Professor, Dept. of CSE, IIT Guwahati
Associated Faculty, Mehta Family School of Data Sc and AI, IIT Guwahati
Day Agenda

• Talk 1: Introduction to Natural Language Processing (NLP)

• Speaker: Dr Ashish Anand

• Talk 2: Neural Models for NLP

• Speaker: Dr Ashish Anand

• Hands-on-Session
• Speaker: Dr Aparajita Dutta
Objective

• Short and quick Introduction to NLP

• Self-Motivated can get started working with NLP
Outline: Introduction to NLP

• What is NLP?
• Daily life NLP Applications
• Generic Formulation
• Getting Started with NLP
• Statistical Language Models
Defining NLP
What do we mean by NLP?

• Natural Language – Written or Spoken language used by humans.

Example: Assamese, Bengali, Hindi, Sanskrit, English, German, …

• NLP – Computational methods to learn, understand & generate

natural language content

• Multiple distinct fields study human language: Linguists, Speech

Recognition, Computational Linguists etc.
Different Levels of NLP
• Word
• Phonetics and Phonology: study of linguistic sounds
• Morphology: study of meaningful components of words [example]

• Syntax: structural relationship between words

• Semantic: study of meaning

• Lexical semantics: study of meanings of words
• Compositional semantics: How to combine words
• Pragmatics and Discourse: dealing with more than a sentence:
paragraph, documents
Daily Life Applications
Application I: Automatic text completion
Application II: Spelling Correction

• Spelling correction: Study was conducted by students vs study was

conducted be students
Application III: Sentiment Classification

I like this laptop

My new laptop is not good for

computational intensive task

Watching a lecture on my new

laptop
Application IV: Named Entity Recognition
(NER)
Application V: Machine Translation

• Source Sentence: I have asked him to do homework

• Target Sentence:
VI: Words, Meaning and Representation

• Similar Words

• Synonyms

• Word Sense Disambiguation

• I went to a bank to deposit money
• I went to a bank to see calm water currents
Many more applications …..
Formulation
Generic Formulation
Search

• Computes the argmax of the function Ψ

• Often machinery of Combinatorial optimization as often

outputs are discrete variables

• Simple search algorithms to dynamic programming and beam

search
Learning

• Finding the model parameters θ

• Mostly, again an optimization problem.

• Relying on numerical optimization, as parameters are often

continuous
Basic Formulation

• Word => Vector Representation

• Text Classification

• Sequence Labelling Problems

• Sequence to Sequence Learning Problems

Getting Started with
NLP
Source: Corpus
• Corpus (plural : corpora)
• Special collection of texts collected according to a predefined set of criteria
• May be available as pre-pr0cessed and linguistically-marked-up or in raw
format

• Different types of corpora

• Monolingual
• Parallel: bilingual or multilingual [Vary at the alignment level]
• Comparable: bilingual or multilingual
• Learner Corpus
• Diachronic Corpus
Examples of Corpus

Corpus Tokens Types

Switchboard phone 2.4 million 20000

conversations

Shakespeare 884,000 31000

Brown 1 million 38000

Google N-grams 1 trillion 13 million

Two ways to talk about words:

1. Tokens: each occurrence of all words is counted
2. Types: number of distinct words
More Examples of Corpora
• Access to multiple corpus from tools like NLTK
• Building from databases such as PubMed, free text from web,
Wikipedia, Social media platforms etc.
• Task specific
• Shared task challenges: ACE, CoNLL, SemEval, BioAsq, SQuAD, CORD-
19

• Caution: One shoe does not fit all.

• Caution: Ethical and Bias Issues
Text Preprocessing
• Removing non-text (e.g. tags, ads)
• Text Normalization
• Segmentation: Word and Sentence Segmentation
• Normalizing Word Formats
• Spelling Variations: Labeled/labelled
• Capitalization: Led/LED
• Lemmatization
• Stemming
• Morphological analysis: dealing with smallest meaning-bearing units
Text normalization
Tokenization: Word Segmentation
Definition
• Process to divide the input text into units, also called, tokens, where
each is either a word or a number or a punctuation mark.
What counts as a word?

I am interested in Natural Language Processing, but I’m not sure of the

required prerequisites.
What counts as a word?

• Should I count punctuation as a word?

• Should I treat I’m as one word or break them into three words: I, ’, m?
[Clitic]
• Should I consider “Natural Language Processing” as one word or 3
words?
Challenges in defining word as a
contiguous alphanumeric characters
• Too restrictive
• Should we consider “$12.20” or “Micro$oft” or “:)” as a word?

• We can expect several variants especially in forums like Twitter etc.

which may not obey exact definition but should be considered as a
word.

• Simple Heuristic: Whitespace

• “a space or tab or the new line” between words.
• Still to deal with several issues.
Some challenges with simple
heuristics
• Periods
• Wash. vs wash
• Abbreviations at the end vs. in the middle – e.g. etc.
• More on this while discussing sentence segmentation

• Single apostrophes
• Contractions such as I’ll, I’m etc.: should be taken as two words or one word?
• Penn Treebank split such contractions.
• Phrases such as dog’s vs. yesterday’s in “The house I rented yesterday’s
garden is really big”.
• Orthographic-word-final single quotation (often comes at the end of
sentence/quoted fragment) and cases like (plural possessive) “boys’ toys”.
Defining words: Problems: Spoken
Corpora
• This lecture umm is main- mainly divided into two components

• Two types of disfluencies

• Fragments: main-
• Fillers/Filled pauses: uh.. Umm..
Some other issues

• Quite a large vocabulary

• Restricting a vocabulary size enhances OOV problem
• No implicit notion of similar words
• Each word is given distinct id
Tokenization in Practice

• Deterministic algorithms based on regular expressions

• Compiled into efficient finite state automata
Word segmentation in other
languages
• 请将这句话翻译成中文 [Please translate this sentence into Chinese]
• Languages like Chinese, Japanese have no spaces between words
• Japanese is further complicated with multiple alphabets intermingled

• Compound nouns written as a single word

• Lebensversicherungsgesellschaftsangestellter [Life insurance company
employee]
Word Tokenization in Chinese

• Chinese words are composed of characters

• Characters are generally 1 syllable and 1 morpheme.
• Average word is 2.4 characters long.
• Standard baseline segmentation algorithm:
• Maximum Matching (also called Greedy)

Source: SLP-Slides-Chap2
Maximum Matching
Word Segmentation Algorithm
• Given a wordlist of Chinese, and a string.
1) Start a pointer at the beginning of the string
2) Find the longest word in dictionary that matches the string starting
at pointer
3) Move the pointer over the word in string
4) Go to 2

Source: SLP-Slides-Chap2
Max-match segmentation
illustration
• Thecatinthehat the cat in the hat
the table down there
• Thetabledownthere
theta bled own there
• Doesn’t generally work in English!
• But works astonishingly well in Chinese
• 莎拉波娃现在居住在美国东南部的佛罗里达。
• 莎拉波娃现在居住在美国东南部的佛罗里达
• Modern probabilistic segmentation algorithms even better

Source: SLP-Slides-Chap2
Subword Tokenization: Motivation
• Frequent words should be identified as a token

• Rare words should be broken into meaningful subword tokens:

• Unknowingly : “un”, “know”, “ing”, “ly”

• Helps in taking care of OOV, rare and related words

• Reasonable vocabulary size

• To make it language independent

Subword Tokenization: Popular
Methods
• Byte Pair Encoding (BPE)1
• Wordpiece2
• Similar to BPE, except the merging criteria is different
• Unigram3 and Sentencepiece4
• Rely on unigram language model
• Language independent

1. Sennrich et al. 2015. Neural machine translation of rare words with subword units. ACL 2016
2. Schuster and Nakajima. 2012. Japanese and Korean voice search. ICASSP 2012
3. Kudo. 2018. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates.
ACL2018
4. Kudo et al. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text
Processing. EMNLP 2018 (demo paper)
Byte Pair Encoding

• Used for data compression in Information theory

• Idea: Iteratively merge most frequently byte pairs into a byte not
present in the data.
BPE for Word Tokenization

• Assumption: corpus has been already tokenized

• Step 1: Count the frequency of each word appearing in the given
corpus.
• Step 2: For each word, append them with a special token ``<E>",
signifying end of a word.
• Step 3: Break each word into their constituent characters. So a word
"exam" will be converted into a sequence of characters
["e","x","a","m","<E>"].
BPE for Word Tokenization

• Step 4: In each iteration, count the frequency of each consecutive

byte pair and merge the most frequent byte pairs into one.

• Step 5: Stop after a fixed number of iterations (i.e. merge operations)

or after obtaining a maximum number of tokens.
BPE Tokenization: Illustration

• Dictionary
• {'low<E>': 5, 'lower<E>': 2, 'newest<E>': 6, 'widest<E>': 3}
• Vocabulary on characters
• {'d','e','i','l','n','o','r','s','t','w','<E>’}
• 1st Iter: {'d','e','i','l','n','o','r','s','t','w','<E>','es’} [e and s occurred
together 9 times]
• 2nd Iter: {'d','e','i','l','n','o','r','s','t','w','<E>','es’, ‘est’}
• And So on.
BPE Tokenization: Encoding: Text
Data Tokenization

• Question: How to tokenize a given sequence of words into

learned tokens?

• Answer
• Idea: Run the merged byte pairs in the order they were learned.
• Segment each test word into characters
• Apply first merge rule [Our example, merge ‘e’ and ‘s’]
• Then second and so on…
• Example: newer -> “new” “er_”
Tools to getting started with NLP

Source: https://medium.com/microsoftazure/7-amazing-open-source-nlp-tools-to-try-with-notebooks-in-2019-c9eec058d9f1
Defining Language Model
Lets look at some examples
• Predicting next word
• I am planning ……..

• Speech Recognition
• I saw a van vs eyes awe an
Example continued
• Spelling correction
• Study was conducted by students vs study was conducted be
students
• Their are two exams for this course vs There are two exams for this
course

• Machine Translation
• I have asked him to do homework
• मैंने उससे पूछा कि होमवर्क करने के लिए
• मैंने उसे होमवर्क करने के लिए कहा
In each of the example, objective is
either
• To find next probable word

• To find which sentence is more likely to be true

Language Models (LM)
• Models assigning probabilities to a sequence of words

• P(I saw a van) > P(eyes awe an)

• P(मैंने उससे पूछा कि होमवर्क करने के लिए) <

P(मैंने उसे होमवर्क करने के लिए कहा)
Defining LM formally
Estimating Probability of a
sequence
• Our task is to compute
P(I, am, fascinated, with, recent, advances, in, AI)

• Chain Rule
Estimating Probability of a
sequence
Estimating P(w1, w2, .., wn)
•
Estimating P(w1, w2, .., wn)
• Too many possible sentences
• Data sparseness
• Poor generalizability
Markov Assumption
•
Markov Assumption
•
MLE of N-gram models
• Unigram (Simplest Model)

• Bigram (1st order Markov Model)

• Trigram (2nd order Markov Model)

Trigram Model in Summary
Problem with MLE

• Works well if test corpus is very similar to training, which is

not generally the case

• Sparsity Issue
• OOV : Can be solved by having <UNK> category
• Words are present in corpus but relevant counts are zero
• Underestimation of such probabilities
N-gram Model: Issue
• Long-distance dependencies

“The computer which I had just put into the lab on the fifth floor
crashed”
Smoothing Techniques
Simplest Approach: Additive
Smoothing
• Add-1 Smoothing

𝑐 ( 𝑤 𝑖 −2 ,𝑤 𝑖 −1 ) + ¿ 𝒱 ∨¿ ¿
𝑐 ( 𝑤 𝑖 −2 ,𝑤 𝑖 −1 , 𝑤𝑖 ) + 1
𝑝 𝑚𝑙𝑒 ( 𝑤𝑖|𝑤 𝑖 −2 , 𝑤𝑖 −1 ¿=

• Generalized version

𝑐 ( 𝑤 𝑖 −2 ,𝑤 𝑖 −1 ) + 𝛿∨𝒱 ∨¿ ¿
𝑐 ( 𝑤 𝑖 −2 , 𝑤𝑖−1 , 𝑤𝑖 ) + 𝛿
𝑝 𝑚𝑙𝑒 ( 𝑤𝑖|𝑤 𝑖 −2 , 𝑤𝑖 −1 ¿=
Take the help of lower order models
• Bigram Example
• c(w1, w2) = 0 = c(w1, w2’)

• padd (w2 | w1 ) = padd (w2’| w1 )

• Lets assume p(w2‘) < p(w2)

• We should expect padd (w2 | w1 ) > padd (w2’| w1 )

Take the help of lower order models

• Linear Interpolation Models

• Discounting Models
References

• Jurafsky and Martin, Speech and Language Processing, 3rd Ed. Draft
[Available at https://web.stanford.edu/~jurafsky/slp3/ ]
Thanks!
Question and Comments!

[email protected] https://www.iitg.ac.in/
anand.ashish

Organizational Analysis Report Format - Spring 2021
100% (1)
Organizational Analysis Report Format - Spring 2021
4 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
Lecture 2 NLP
No ratings yet
Lecture 2 NLP
27 pages
A Beginner's Guide To Natural Language Processing - IBM Developer
No ratings yet
A Beginner's Guide To Natural Language Processing - IBM Developer
9 pages
Session 1
No ratings yet
Session 1
60 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
C10_AI_UNIT 3_NLP_ HALF YEARLY
No ratings yet
C10_AI_UNIT 3_NLP_ HALF YEARLY
37 pages
1 Introduction
No ratings yet
1 Introduction
99 pages
NLP m2
No ratings yet
NLP m2
71 pages
NLP_Week_02
No ratings yet
NLP_Week_02
55 pages
Natural Language Processing (NLP) & Computational Linguistics
No ratings yet
Natural Language Processing (NLP) & Computational Linguistics
60 pages
5 BASIC TEXT PROCESSING
No ratings yet
5 BASIC TEXT PROCESSING
6 pages
Week 2
No ratings yet
Week 2
90 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
5 pages
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
Text Preprocessing
No ratings yet
Text Preprocessing
59 pages
DL for NLP Week1
No ratings yet
DL for NLP Week1
153 pages
Intro DL 10 NLP
No ratings yet
Intro DL 10 NLP
99 pages
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
No ratings yet
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
45 pages
5.2 Natural Language Processing
No ratings yet
5.2 Natural Language Processing
43 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
NLPNotes
No ratings yet
NLPNotes
12 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Natural Language Process (NLP)
No ratings yet
Natural Language Process (NLP)
29 pages
Corpora
No ratings yet
Corpora
48 pages
CSC 528 Lecture 3
No ratings yet
CSC 528 Lecture 3
42 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
NLP 9
No ratings yet
NLP 9
44 pages
Lecture1 5 IntroToNLP
No ratings yet
Lecture1 5 IntroToNLP
73 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
25 pages
NLP_DeepNLP
No ratings yet
NLP_DeepNLP
61 pages
cs224n Winter2023 Lecture1 Notes Draft
No ratings yet
cs224n Winter2023 Lecture1 Notes Draft
13 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
Natural Language Processing: Some Screenshots Are Taken From NLP Course by Jufrasky - Used Only For Educational Purpose
No ratings yet
Natural Language Processing: Some Screenshots Are Taken From NLP Course by Jufrasky - Used Only For Educational Purpose
44 pages
NLP Intro
No ratings yet
NLP Intro
74 pages
NLP_Week_02
No ratings yet
NLP_Week_02
54 pages
Text Analytics and Natural Language Processing - KAI073.docx
No ratings yet
Text Analytics and Natural Language Processing - KAI073.docx
24 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
02 Linguistics Essentials
No ratings yet
02 Linguistics Essentials
36 pages
UNIT 5a
No ratings yet
UNIT 5a
48 pages
Unit 7-NLP
No ratings yet
Unit 7-NLP
33 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
NLP Steps Basic
No ratings yet
NLP Steps Basic
26 pages
AI_NLP
No ratings yet
AI_NLP
9 pages
Bhawini NLP Practical
No ratings yet
Bhawini NLP Practical
98 pages
NLP - 1_250119_222702 (1)
No ratings yet
NLP - 1_250119_222702 (1)
71 pages
lect1-intro-3jan08 (1)
No ratings yet
lect1-intro-3jan08 (1)
94 pages
05 Introduction To NLP
No ratings yet
05 Introduction To NLP
63 pages
NLP Unit 1
No ratings yet
NLP Unit 1
15 pages
NLP_AI_X
No ratings yet
NLP_AI_X
6 pages
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
No ratings yet
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
54 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
No ratings yet
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
41 pages
NLP Insem Notes
No ratings yet
NLP Insem Notes
13 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
Bhawini NLP File
No ratings yet
Bhawini NLP File
100 pages
How To Write Vocabulary Presentations And Practice
From Everand
How To Write Vocabulary Presentations And Practice
Philip Kerr
2/5 (4)
5th Ed Chapter 03
No ratings yet
5th Ed Chapter 03
16 pages
Math in The Modern World Prefinal Reviewer
No ratings yet
Math in The Modern World Prefinal Reviewer
3 pages
UNIT 3: Program Control Structures
No ratings yet
UNIT 3: Program Control Structures
65 pages
Maity - Patra - 2016 - Tradeoffs Aware Design Procedure For An Adaptively Biased Capacitorless Low
No ratings yet
Maity - Patra - 2016 - Tradeoffs Aware Design Procedure For An Adaptively Biased Capacitorless Low
12 pages
Standard Thermal Store Installation Guide Sept 21
No ratings yet
Standard Thermal Store Installation Guide Sept 21
16 pages
Revision Date Purpose
No ratings yet
Revision Date Purpose
35 pages
DivMemo-Draft-CAR-RQA-for-SY-2024-2025-1-1
No ratings yet
DivMemo-Draft-CAR-RQA-for-SY-2024-2025-1-1
3 pages
Personal Development: Quarter 2 - Module 7: Career Pathways
No ratings yet
Personal Development: Quarter 2 - Module 7: Career Pathways
6 pages
2022 Xe Paper - General Aptitude
No ratings yet
2022 Xe Paper - General Aptitude
16 pages
Gec 4 - Art Appreciation Midterm Exam
100% (2)
Gec 4 - Art Appreciation Midterm Exam
5 pages
MET 2017 Educational Technology & ICT
100% (1)
MET 2017 Educational Technology & ICT
64 pages
SoftwareSerial Library - Arduino Documentation
No ratings yet
SoftwareSerial Library - Arduino Documentation
1 page
Office Correspondence Manual
88% (16)
Office Correspondence Manual
21 pages
Thesis Binding Ohio State University
100% (3)
Thesis Binding Ohio State University
7 pages
Brochure - Fortis Walls
No ratings yet
Brochure - Fortis Walls
20 pages
Gridlock, A Novel of Suspense
No ratings yet
Gridlock, A Novel of Suspense
32 pages
To Include Items That ... Use This Criteria Query Result
No ratings yet
To Include Items That ... Use This Criteria Query Result
31 pages
genZ-48V-30Ah-LFP-Battery-Product-Manual-Rev-1_0-KJF
No ratings yet
genZ-48V-30Ah-LFP-Battery-Product-Manual-Rev-1_0-KJF
16 pages
Logcat
No ratings yet
Logcat
192 pages
Iso 50001:2018 Client Gap Analysis Tool: Instructions For Use
No ratings yet
Iso 50001:2018 Client Gap Analysis Tool: Instructions For Use
12 pages
COMP 425 SOFTWARE PROJECT MANAGEMENT - Kabarak University
No ratings yet
COMP 425 SOFTWARE PROJECT MANAGEMENT - Kabarak University
3 pages
Java Events Handling: Delegation Event Model
No ratings yet
Java Events Handling: Delegation Event Model
16 pages
Diff. White Hat SEO - Black Hat SEO.docx
No ratings yet
Diff. White Hat SEO - Black Hat SEO.docx
3 pages
CBI Marketing Manager
No ratings yet
CBI Marketing Manager
6 pages
SFTFS 35
No ratings yet
SFTFS 35
1 page
Brain Tumour Segmentation Using Convolutional Neural Network With Tensor Flow
No ratings yet
Brain Tumour Segmentation Using Convolutional Neural Network With Tensor Flow
7 pages
MVC File and Folder Structure
No ratings yet
MVC File and Folder Structure
4 pages
MAT3700 - TL - Oct 2024 - Guidelines For Exam
No ratings yet
MAT3700 - TL - Oct 2024 - Guidelines For Exam
5 pages
Bppe Mid I Question Paper
No ratings yet
Bppe Mid I Question Paper
6 pages

Introduction To NLP

Uploaded by

Introduction To NLP

Uploaded by

Deep Learning for NLP:

• Talk 1: Introduction to Natural Language Processing (NLP)

• Talk 2: Neural Models for NLP

• Short and quick Introduction to NLP

• Natural Language – Written or Spoken language used by humans.

• NLP – Computational methods to learn, understand & generate

• Multiple distinct fields study human language: Linguists, Speech

• Syntax: structural relationship between words

• Semantic: study of meaning

• Spelling correction: Study was conducted by students vs study was

I like this laptop

My new laptop is not good for

Watching a lecture on my new

• Source Sentence: I have asked him to do homework

• Word Sense Disambiguation

• Computes the argmax of the function Ψ

• Often machinery of Combinatorial optimization as often

• Simple search algorithms to dynamic programming and beam

• Finding the model parameters θ

• Mostly, again an optimization problem.

• Relying on numerical optimization, as parameters are often

• Word => Vector Representation

• Sequence Labelling Problems

• Sequence to Sequence Learning Problems

• Different types of corpora

Corpus Tokens Types

Switchboard phone 2.4 million 20000

Shakespeare 884,000 31000

Brown 1 million 38000

Google N-grams 1 trillion 13 million

Two ways to talk about words:

• Caution: One shoe does not fit all.

I am interested in Natural Language Processing, but I’m not sure of the

• Should I count punctuation as a word?

• We can expect several variants especially in forums like Twitter etc.

• Simple Heuristic: Whitespace

• Two types of disfluencies

• Quite a large vocabulary

• Deterministic algorithms based on regular expressions

• Compound nouns written as a single word

• Chinese words are composed of characters

• Rare words should be broken into meaningful subword tokens:

• Unknowingly : “un”, “know”, “ing”, “ly”

• Reasonable vocabulary size

• To make it language independent

• Used for data compression in Information theory

• Assumption: corpus has been already tokenized

• Step 4: In each iteration, count the frequency of each consecutive

• Step 5: Stop after a fixed number of iterations (i.e. merge operations)

• Question: How to tokenize a given sequence of words into

• To find which sentence is more likely to be true

• P(I saw a van) > P(eyes awe an)

• P(मैंने उससे पूछा कि होमवर्क करने के लिए) <

• Bigram (1st order Markov Model)

• Trigram (2nd order Markov Model)

• Works well if test corpus is very similar to training, which is

• padd (w2 | w1 ) = padd (w2’| w1 )

• Lets assume p(w2‘) < p(w2)

• We should expect padd (w2 | w1 ) > padd (w2’| w1 )

• Linear Interpolation Models

You might also like