0% found this document useful (0 votes)

26 views22 pages

02_Text Preprocessing_part3

Uploaded by

dinhnguyenngoc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views22 pages

02_Text Preprocessing_part3

Uploaded by

dinhnguyenngoc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Lecture 2: Text Preprocessing

Pilsung Kang
School of Industrial Management Engineering
Korea University
AGENDA

01 Introduction to NLP
02 Lexical Analysis
03 Syntax Analysis
04 Other Topics in NLP
Syntax Analysis
• Syntax Analysis
✓ Process of analyzing a string of symbols conforming to the rules of a formal grammar

• Parser
✓ An algorithm that computes a structure for an input string given a grammar
✓ All parsers have two fundamental properties
▪ Directionality: the sequence in which the structures are constructed (e.g., top-down or
bottom-up)
▪ Search strategy: the order in which the search space of possible analysis explored (e.g.,
depth-first, breadth-first)
Syntax Analysis
• Parsing Representation
✓ Tree vs List

✓ Meaning
▪ S (Sentence) consists of NP (Noun Phrase) and VP (Verb Phrase)
▪ NP consists of Name (John)
▪ VP consists of VERB (ate) and the other NP
▪ NP consists of ART (the) and Noun (apple)
Syntax Analysis
• Not a single parsing tree due to language ambiguity

• Lexical ambiguity
✓ One word can be used for multiple parts of speech
✓ Lexical ambiguity causes structural ambiguity

flies flies
Syntax Analysis
• Structural Ambiguity
✓ One sentence can be understood in different ways

Park Park
John Mary John Mary
AGENDA

01 Introduction to NLP
02 Lexical Analysis
03 Syntax Analysis
04 Other Topics in NLP
Language Modeling
Jurafsky, Language Modeling

• Probabilistic Language Model

✓ Assign a probability to a sentence (not POS tags, but the sentence itself)

• Applications
✓ Machine Translation
▪ P(high wind tonight) > P(large wind tonight)

✓ Spell correction
▪ The office is about fifteen minuets from my house
▪ P(about fifteen minutes from) > P(about fifteen minuets from)

✓ Speech recognition
▪ P(I saw a van) >> P(eyes awe of an)

✓ Summarization, question-answering, etc.

Language Modeling
Jurafsky, Language Modeling

• Probabilistic Language Modeling

✓ Compute the probability of a sentence or sequence of words

✓ Related task: probability of an upcoming word

▪ ex) I love you more than I can ______. (swim? say?)

• How to compute P(W)

✓ What is P(its, water, is, so, transparent, that)?
✓ Chain Rules of Probability:
Language Modeling Jurafsky, Language Modeling

• Markov Assumption
✓ Consider only k previous words when estimating the conditional probability

✓ Simplest case: Unigram model

✓ An example of automatically generated sentences from a unigram model

Language Modeling Jurafsky, Language Modeling

• Bigram model
✓ Condition on the previous word

• N-gram models
✓ Can extend to trigrams, 4-grams, 5-grams
▪ In sufficient model of language because language has long-distance dependencies
▪ “The computer when I had just put into the machine room on the fifth floor crashed.”

✓ We can often get away with N-gram models

Language Modeling Jurafsky, Language Modeling
Language Modeling
• Google Books N-Gram
✓ 1,024 billion words & 1.1 billion 5-grams that appeared at least 40 times (2006)
Language Modeling Bengio et al. (2003)

• Neural Network-based Language Model

Language Modeling Mikolov et al. (2010)

• Recurrent Neural Network (RNN)-based Language Model

✓ A simplified RNN structure for character-level language model

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Language Modeling
• Recurrent Neural Network (RNN)-based Language Model
✓ Character-level RNN vs Word-level RNN

https://github.com/hunkim/word-rnn-tensorflow
Language Modeling Lee (2017)

• Recurrent Neural Network (RNN)-based Language Model

✓ Character-level RNN (Korean)
▪ 이광수 장편소설 「무정」 (총 323,660 음절, 1,680개 단어)
▪ 특징: 1917년 작품이라 한자어가 많이 쓰였음, 큰따옴표와 줄바꿈을 포함한 대화체 문장이
많으며, 중고교생 대상으로 읽히는 작품이라 중간중간 괄호 속에 편집자 주석이 끼어 있음

형식은, 아뿔싸! 내가 어찌하여 이러한 생각을 하는가, 내 마음이 이렇게 약하던가 하면서 두 주먹을
불끈 쥐고 전신에 힘을 주어 이러한 약한 생각을 떼어 버리려 하나, 가슴속에는 이상하게 불길이
확확 일어난다. 이때에,
“미스터 리, 어디로 가는가” 하는 소리에 깜짝 놀라 고개를 들었다. (중략) 형식은 얼마큼 마음에
수치한 생각이 나서 고개를 돌리며, “아직 그런 말에 익숙지를 못해서……” 하고 말끝을 못 맺는다.
“대관절 어디로 가는 길인가? 급지 않거든 점심이나 하세그려.”
“점심은 먹었는걸.”
“그러면 맥주나 한잔 먹지.”
“내가 술을 먹는가.”
(중략)
“요― 오메데토오(아― 축하하네). 이이나즈케(약혼한 사람)가 있나 보네그려. 음 나루호도(그러려니).
그러구도 내게는 아무 말도 없단 말이야. 에, 여보게” 하고 손을 후려친다.
Language Modeling Lee (2017)

Iter 0 :
랫萬게좁뉘쁠름끈玄른작밭裸觀갈나맡文플조바늠헝伍下잊볕홀툽뤘혈調記운피悲렙司狼독벗칼둡걷착날完잣老
엇낫業4改‘촉수릎낯깽잊쯤죽道넌友련친씌았융타雲채發造거크휘탁亨律與命텐암먼헝평琵헤落유 리벤産이馨텐

Iter 1300 : 를 옷 사가 려만다밤 말어변 대니 심로 려이, 순 과 이을 죄사글를 . 사람을 영채와 이니아베을 니러,
다가 달고 면 를 아잘 하 기 성구을 을 실튿으루 아잠 고 이 그와 매못 더 (띄어쓰기)

Iter 4900 : 를 왔다내 루방덩이종 은 얼에는 집어흔영채는 아무 우선을 에서가며 건들하아버전는 애양을 자에
운 모양이 랐다. 은 한다선과 ‘마는 .식세식가들어 ,
형식다
“내었다.있이 문 (줄바꿈)

Iter 100000 : 면서 치현분들더 중 한통 선교잤다.

“처럼 우셨다시가…… 것이 말사도? 여자려겠습니다” 하는 마음(裸生)은 이런 적드렸다. 그 말이 얼굴이 딸로
나고 얼굴이 마음불 하고 따라 선

Iter 300000 : 씻었다. 선형은 형식의 형식은 빛이 가슴을 오고 걸현감에는 일이 는 눈과 의고 아이얗어 알으로
자기의 구원을 내어려가 여러 짓을 쾌처게 안아 말고였는 악한 순간에 속으로 두 학교에

Iter 500000 : 본다. 성학과 평양으로 새로도 처음의 타던 공격하였다. ‘영채의 꽁인의 생각을 하면 때에 기생의
이는 것 보더니 나는 듯이 제인 소세건과 영채의 모양이를 대하였다. 형식을 생각하여

Iter 750000 : 으로 유안하였다. 더할까 하는 세상이 솔이요, 알고 게식도 들어울는 듯하였다. 태에그려 깔깔고
웃는 듯이 흔반다. 우선형은 사람을 어려보낸다.
“그려가?” (간접 인용)
한다. 영채는 손을 기쁘

Iter 1000000 : 에 돌내면서,

“여러 넣어오습데다. 그 말대 아무도 좀 집림과 시오 백매, 저는 열녀더러, 기런 소년이가 아니라.”
“어리지요.”
노파도 놀라며,
“저희마다가 말없습니까.”
“아니 (대화체)
Language Modeling
• Sequence to Sequence (Seq2Seq) Learning

https://medium.com/@Synced/history-and-frontier-of-the-neural-machine-translation-dc981d25422d
Language Modeling
• Performance Improvements
✓ GPT-2 (Open AI): Too Good to open the source code??
Our model, called GPT-2 (a successor to GPT), was trained simply to predict the next
word in 40GB of Internet text. Due to our concerns about malicious applications of the
technology, we are not releasing the trained model.As an experiment in responsible
disclosure, we are instead releasing a much smaller model for researchers to experiment
with, as well as a technical paper.

https://github.com/graykode/gpt-2-Pytorch?fbclid=IwAR0mHAR1cEPpuJ7QT9TBir-37_32tkQnpMjvsu57qPacEZz2-zbTm3Iibj8
Language Modeling
• Performance Improvements
✓ GPT-2 (Open AI): Too Good to open the source code??

System prompt (human-written)

In a shocking finding, scientist discovered a herd of unicorns living in a remote,
previously unexplored valley, in the Andes Mountains. Even more surprising to the
researchers was the fact that the unicorns spoke perfect English.

Model completion (machine-written, 10 tries)

The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were
previously unknown to science.
Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.
Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes
Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to
be a natural fountain, surrounded by two peaks of rock and silver snow.
Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue,
with some crystals on top,” said Pérez.
Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen from the air without having to move
too much to see them – they were so close they could touch their horns.
While examining these bizarre creatures the scientists discovered that the creatures also spoke some fairly regular English. Pérez
stated, “We can see, for example, that they have a common ‘language,’ something like a dialect or dialectic.”

1 Intro To NLP
100% (1)
1 Intro To NLP
46 pages
NLP Unit 2
No ratings yet
NLP Unit 2
20 pages
English Syntax And Argumentation Fifth Edition 5th Edition Bas Aarts instant download
No ratings yet
English Syntax And Argumentation Fifth Edition 5th Edition Bas Aarts instant download
77 pages
Unit 3
No ratings yet
Unit 3
19 pages
Teaching How To Generate Complex Nominal Phrases Used To Define New Concepts or Describe Objects
100% (1)
Teaching How To Generate Complex Nominal Phrases Used To Define New Concepts or Describe Objects
6 pages
Artificial Intelligence: Natural Language Processing II
No ratings yet
Artificial Intelligence: Natural Language Processing II
51 pages
Natural Language Processing
No ratings yet
Natural Language Processing
11 pages
Computational Linguistics Notes
No ratings yet
Computational Linguistics Notes
17 pages
NLP UNIT 2 Notes
No ratings yet
NLP UNIT 2 Notes
14 pages
Formal Aspects of Language Modeling
No ratings yet
Formal Aspects of Language Modeling
252 pages
Xu-Ly-Ngon-Ngu-Tu-Nhien - Regina-Barzilay - Lec17-Machine-Translation-Part-Iii - (Cuuduongthancong - Com)
No ratings yet
Xu-Ly-Ngon-Ngu-Tu-Nhien - Regina-Barzilay - Lec17-Machine-Translation-Part-Iii - (Cuuduongthancong - Com)
57 pages
Ngrams
100% (1)
Ngrams
22 pages
Unit 1 Notes.pptx
No ratings yet
Unit 1 Notes.pptx
74 pages
Modern Language Models Refute Chomsky's Approach To Language
No ratings yet
Modern Language Models Refute Chomsky's Approach To Language
48 pages
Piantadosi 23 Modern-Lang.2
No ratings yet
Piantadosi 23 Modern-Lang.2
48 pages
NLP-Ch-2 Introduction to Language Models
No ratings yet
NLP-Ch-2 Introduction to Language Models
82 pages
Cortado-Cap 6
No ratings yet
Cortado-Cap 6
160 pages
5624 - Softskill - NLP
No ratings yet
5624 - Softskill - NLP
28 pages
Nat Lang Processing
No ratings yet
Nat Lang Processing
19 pages
Noun Phrase and Its Constituents
100% (3)
Noun Phrase and Its Constituents
5 pages
NLP UNIT-4
No ratings yet
NLP UNIT-4
62 pages
Natural Language Processing:: N-Gram Language Models
No ratings yet
Natural Language Processing:: N-Gram Language Models
48 pages
5.0
No ratings yet
5.0
34 pages
Natural Language Processing: Dr. Ahmed El-Bialy
100% (1)
Natural Language Processing: Dr. Ahmed El-Bialy
49 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
Lecture-8. Only For This Batch
No ratings yet
Lecture-8. Only For This Batch
46 pages
N-Grams and Corpus Linguistics: Julia Hirschberg
No ratings yet
N-Grams and Corpus Linguistics: Julia Hirschberg
47 pages
UNIT 1
No ratings yet
UNIT 1
17 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
Lecture_4_N_grams
No ratings yet
Lecture_4_N_grams
29 pages
BCSE306L_AI_MODULE-7_SMSATAPATHY
No ratings yet
BCSE306L_AI_MODULE-7_SMSATAPATHY
51 pages
1 - Intro - To - NLP 2
No ratings yet
1 - Intro - To - NLP 2
55 pages
Rizvi College of Engineering: DLO8012: Natural Language Processing
No ratings yet
Rizvi College of Engineering: DLO8012: Natural Language Processing
16 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
Unit 5-Aiml
No ratings yet
Unit 5-Aiml
25 pages
Unit 5
No ratings yet
Unit 5
26 pages
r19 Ai Unit IV Chapter 1
No ratings yet
r19 Ai Unit IV Chapter 1
19 pages
Ai Unit 5
No ratings yet
Ai Unit 5
16 pages
Natural Language: Anguage Odels
No ratings yet
Natural Language: Anguage Odels
28 pages
Structure in Linguistics
No ratings yet
Structure in Linguistics
6 pages
lecture5-ngrams
No ratings yet
lecture5-ngrams
40 pages
Unit 5 notes final
No ratings yet
Unit 5 notes final
14 pages
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
No ratings yet
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
41 pages
4.Chapter5_ Syntactic and Semantic Representations
No ratings yet
4.Chapter5_ Syntactic and Semantic Representations
47 pages
SebentaLN-parte1
No ratings yet
SebentaLN-parte1
42 pages
Natural Language Processing: Perception, Communication, and Expert Systems
No ratings yet
Natural Language Processing: Perception, Communication, and Expert Systems
130 pages
NLP Viva
No ratings yet
NLP Viva
14 pages
NLP Week 2 Rationalist and Empiricist Paradigms in Natural Language Processing
No ratings yet
NLP Week 2 Rationalist and Empiricist Paradigms in Natural Language Processing
28 pages
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
No ratings yet
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
30 pages
Natural Language Processing
No ratings yet
Natural Language Processing
44 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
Adaptive Predicates in Empty-Start Natural Languag
No ratings yet
Adaptive Predicates in Empty-Start Natural Languag
11 pages
6.Chapter6_LanguageModel
No ratings yet
6.Chapter6_LanguageModel
33 pages
Unit - 5
No ratings yet
Unit - 5
13 pages
NLP Introduction
No ratings yet
NLP Introduction
35 pages
ai
No ratings yet
ai
13 pages
Natural Language Processing
100% (1)
Natural Language Processing
21 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Natural Language Processing
No ratings yet
Natural Language Processing
21 pages
NLP Mid-1
No ratings yet
NLP Mid-1
15 pages
Chapter 6-NLP
No ratings yet
Chapter 6-NLP
8 pages
Syntax 4th Lec 3
No ratings yet
Syntax 4th Lec 3
85 pages
Motivation in Language. An Attempt at Systematisation
No ratings yet
Motivation in Language. An Attempt at Systematisation
28 pages
Syntactic Analysis Ii (Parsing Using CFGS) : Dr. Sukhnandan Kaur Tiet
No ratings yet
Syntactic Analysis Ii (Parsing Using CFGS) : Dr. Sukhnandan Kaur Tiet
36 pages
Topic 5-Adjectives Adverbs
No ratings yet
Topic 5-Adjectives Adverbs
39 pages
cambrige greek gramar
No ratings yet
cambrige greek gramar
42 pages
01_Introduction to Text Analytics_part1
No ratings yet
01_Introduction to Text Analytics_part1
64 pages
01_Introduction to Text Analytics_part2
No ratings yet
01_Introduction to Text Analytics_part2
48 pages
Desola Ling
No ratings yet
Desola Ling
18 pages
Atural Anguage Rocessing: Chandra Prakash LPU
No ratings yet
Atural Anguage Rocessing: Chandra Prakash LPU
59 pages
Meaning Shift Analysis of Indonesian Translated Noun Phrases in The Gabriel Garcia Marquez's Novel "Memories of My Melancholy Whores" Translated by Dian Vita Ellyati
No ratings yet
Meaning Shift Analysis of Indonesian Translated Noun Phrases in The Gabriel Garcia Marquez's Novel "Memories of My Melancholy Whores" Translated by Dian Vita Ellyati
16 pages
Syntactic Construction
No ratings yet
Syntactic Construction
12 pages
Pensa P. Furede-Syntax Individual Assignment
No ratings yet
Pensa P. Furede-Syntax Individual Assignment
10 pages
Siloni Noun Phrases and Nominalizations
No ratings yet
Siloni Noun Phrases and Nominalizations
227 pages
02_Text Preprocessing_part2
No ratings yet
02_Text Preprocessing_part2
36 pages
XCGDF
No ratings yet
XCGDF
8 pages
Booij 2006 Inflection and Derivation Elsevier-With-Cover-Page-V2
No ratings yet
Booij 2006 Inflection and Derivation Elsevier-With-Cover-Page-V2
10 pages
English Verb 1
No ratings yet
English Verb 1
10 pages
05_Text Representation II - Distributed Representation_GloVe
No ratings yet
05_Text Representation II - Distributed Representation_GloVe
23 pages
79312-Article Text-185366-1-10-20230511
No ratings yet
79312-Article Text-185366-1-10-20230511
10 pages
Sir Gerric Lesson Plan Demo
No ratings yet
Sir Gerric Lesson Plan Demo
9 pages
Guia NP AP PP VP Advp...
No ratings yet
Guia NP AP PP VP Advp...
5 pages
Linguistics - Unit 2
100% (1)
Linguistics - Unit 2
11 pages
Contrast Clauses + Would Ac218c32022
No ratings yet
Contrast Clauses + Would Ac218c32022
4 pages
Luger Book
100% (1)
Luger Book
12 pages
Intro To Linguistics
No ratings yet
Intro To Linguistics
5 pages
Task - THE TYPES OF PHRASE
No ratings yet
Task - THE TYPES OF PHRASE
12 pages
Lexical and Grammatical Categories
No ratings yet
Lexical and Grammatical Categories
37 pages
Serial Verbs in Transition
No ratings yet
Serial Verbs in Transition
28 pages
B2.1: Trả Lời Cho Câu Hỏi Cái Gì, Ở Đâu, Khi Nào 1.Noun Phrases (Nps) - P.76
No ratings yet
B2.1: Trả Lời Cho Câu Hỏi Cái Gì, Ở Đâu, Khi Nào 1.Noun Phrases (Nps) - P.76
9 pages
The Structure of The Vietnamese Noun Phrase (Tuong Hung Nguyen 2009 Abstract)
No ratings yet
The Structure of The Vietnamese Noun Phrase (Tuong Hung Nguyen 2009 Abstract)
1 page
26 Ok
No ratings yet
26 Ok
32 pages
How Language Works: How Babies Babble, Words Change Meaning, and Languages Live or Die
From Everand
How Language Works: How Babies Babble, Words Change Meaning, and Languages Live or Die
David Crystal
3.5/5 (78)
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet

02_Text Preprocessing_part3

Uploaded by

02_Text Preprocessing_part3

Uploaded by

Lecture 2: Text Preprocessing

• Probabilistic Language Model

✓ Summarization, question-answering, etc.

• Probabilistic Language Modeling

✓ Related task: probability of an upcoming word

▪ ex) I love you more than I can ______. (swim? say?)

• How to compute P(W)

✓ Simplest case: Unigram model

✓ An example of automatically generated sentences from a unigram model

✓ We can often get away with N-gram models

• Neural Network-based Language Model

• Recurrent Neural Network (RNN)-based Language Model

• Recurrent Neural Network (RNN)-based Language Model

Iter 100000 : 면서 치현분들더 중 한통 선교잤다.

Iter 1000000 : 에 돌내면서,

System prompt (human-written)

Model completion (machine-written, 10 tries)

You might also like