0% found this document useful (0 votes)

8 views35 pages

02 Text Preprocessing Part1

Uploaded by

dinhnguyenngoc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views35 pages

02 Text Preprocessing Part1

Uploaded by

dinhnguyenngoc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Lecture 2: Text Preprocessing

Pilsung Kang
School of Industrial Management Engineering
Korea University
AGENDA

01 Introduction to NLP
02 Lexical Analysis
03 Syntax Analysis
04 Other Topics in NLP
Natural Language Processing
• Natural language processing sequence
Natural Language Processing Witte (2006)

• Classical categorization of NLP

Natural Language Processing
• Phonology is the first gate of AI solutions

http://biz.chosun.com/site/data/html_dir/2016/06/21/2016062100053.html
Natural Language Processing
• Speech to Text (STT)

https://github.com/kaldi-asr/kaldi
Natural Language Processing
• Top 16 Speech Recognition Startups (2020. 02. 06)

https://www.ai-startups.org/top/speech_recognition/
Natural Language Processing
• Text to Speech (TTS) Example
Natural Language Processing Witte (2006)

• An example of NLP

Lexical Analysis A teacher come+s

Syntax Analysis (A teacher)NP (comes)VP

Semantic Analysis exist(x, teacher(x), comes(x))

“A teacher comes”
Pragmatic Analysis
➔ Be quite!
Natural Language Processing
• Is Pragmatic Analysis Possible?

“여섯 단어로 우리를 울릴만한

소설을 써 보시지?”
Is NLP Easy? No! Witte (2006)

• Why is NLP hard?

Is NLP Easy? No!
• Programming Language

https://github.com/google-research/bert/blob/master/optimization.py
Is NLP Easy? No!
• How to annoy graduate students with four lines of Python code
Is NLP Easy? No! Witte (2006)

• Ambiguity of a natural language

Is NLP Easy? No! Witte (2006)

• Complex and subtle relationship between concepts in texts

✓ “AOL merges with Time-Warner”
✓ “Time-Warner is bought by AOL”

• Ambiguity and context sensitivity

✓ automobile = car = vehicle = Hyundai

vs.
Research Trends in NLP Witte (2006)

• From rule-based approaches to statistical approaches

Research Trends in NLP Collobert et al. (2011)

• From statistical approaches to machine-learning (deep-learning) approaches

Research Trends in NLP Socher et al. (2013)

• From statistical approaches to machine-learning (deep-learning) approaches

http://nlp.stanford.edu:8080/sentiment/rntnDemo.html
Research Trends in NLP Wu et al. (2016)

• From statistical approaches to machine-learning (deep-learning) approaches

Research Trends in NLP
• End-to-End Multi-Task Learning

http://blog.aylien.com/leveraging-deep-learning-for-multilingual/
Research Trends in NLP
• Performance Improvements

https://paperswithcode.com/area/natural-language-processing
Research Trends in NLP
• Performance Improvements with a huge model

https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
Research Trends in NLP
• An Era of optimization???
MO는 고성능 컴퓨터에 기반해 복잡한 연산을 한다는
점에서 AI와 비슷하다. 하지만 과거의 빅데이터를 분석해
최선의 답을 도출하는 AI와 달리 MO는 수학 공식으로
현재 주어진 한정된 조건에서 가장 이상적인 해결책을
찾아준다. AI가 경험을 바탕으로 가장 나은 방안을
제시한다면, MO는 수학 이론을 동원해 최적 답을
알려주는 것이다. MO는 AI처럼 빅데이터를 분석하는
과정이 필요 없기 때문에 시간·비용이 상대적으로 적게
든다.
Research Trends in NLP
• 10 Exiting ideas of 2018 in NLP (http://ruder.io/10-exciting-ideas-of-2018-in-nlp/)
✓ Unsupervised Machine Translation
✓ Pretrained language models
✓ Common sense inference datasets
✓ Meta-learning
✓ Robust unsupervised methods
✓ Understanding representations
✓ Clever auxiliary tasks
✓ Combining semi-supervised learning with transfer learning
✓ QA and reasoning with large documents
✓ Inductive bias
Research Trends in NLP
• Major NLP Achievements & Papers from 2019
✓ Language Models Are Unsupervised Multitask Learners
✓ XLNet: Generalized Autoregressive Pretraining for Language Understanding
✓ RoBERTa: A Robustly Optimized BERT Pretraining Approach
✓ Emotion-Cause Pair Extraction: A New Task to Emotion Analysis in Texts
✓ Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems
✓ Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks
✓ Probing the Need for Visual Context in Multimodal Machine Translation
✓ Bridging the Gap between Training and Inference for Neural Machine Translation
✓ On Extractive and Abstractive Neural Document Summarization with Transformer Language
Models
✓ CTRL: A Conditional Transformer Language Model For Controllable Generation
✓ ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

https://www.topbots.com/top-ai-nlp-research-papers-2019/
Research Trends in NLP
• 14 NLP research breakthrough you can apply to your business
✓ BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
✓ Sequence Classification with Human Attention
✓ Phrase-Based & Neural Unsupervised Machine Translation
✓ What you can cram into a single vector: Probing sentence embeddings for linguistic properties
✓ SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
✓ Deep contextualized word representations
✓ Meta-Learning for Low-Resource Neural Machine Translation
✓ Linguistically-Informed Self-Attention for Semantic Role Labeling
✓ A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks
✓ Know What You Don’t Know: Unanswerable Questions for SQuAD
✓ An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence
Modeling
✓ Universal Language Model Fine-tuning for Text Classification
✓ Improving Language Understanding by Generative Pre-Training
✓ Dissecting Contextual Word Embeddings: Architecture and Representation
(https://www.topbots.com/most-important-ai-nlp-research/?utm_campaign=meetedgar&utm_medium=social&utm_source=meetedgar.com)
Research Trends in NLP
• Statistical translation vs. deep learning-based translation

https://language-translator-demo.mybluemix.net/
Research Trends in NLP
• Statistical translation vs. deep learning-based translation
Research Trends in NLP
• Provide your inputs to improve the machine translator!

• 2018. 03. 06 & 2019. 03. 02 & 2020. 03. 02

Research Trends in NLP
• Provide your inputs to improve the machine translator!

• 2018. 03. 06

• 2019. 03. 02

• 2020. 03. 02
Data Quality in NLP
• ExoBrain Project
Data Quality in NLP
• Data Annotation as a Business Model
✓ Scale AI: https://scale.com/
✓ Basic AI: https://www.basic.ai/
Data Quality in NLP
• Data Annotation as a Business Model
✓ Amazon SageMaker Ground Truth: https://aws.amazon.com/ko/sagemaker/groundtruth/
▪ Data labeling Platform
처음에는 사람에 의해 labeling 작업 수행
이 과정에서 Amazon Mechanical Turk를 사용해서 작업자를 매칭시켜주거나 공급 업체를 추천해줌

1차 레이블링된 데이터를 이용해서 AI 모델을 학습시킨 후,

모델의 신뢰도가 낮을 경우에 사람에게 확인 요청을 하는 feedback loop를 거침
Data Quality in NLP
• Data Annotation as a Business Model (Social Enterprise)
✓ DataMaker: https://www.rdproject.kr/#section-service
✓ 테스트웍스: http://www.testworks.co.kr/

NLP StudyMaterial
No ratings yet
NLP StudyMaterial
540 pages
Unit 1
No ratings yet
Unit 1
99 pages
Natural Language Processing A Machine Learning Perspective by Yue Zhang, Westlake University Zhiyang Teng, Westlake University
No ratings yet
Natural Language Processing A Machine Learning Perspective by Yue Zhang, Westlake University Zhiyang Teng, Westlake University
768 pages
Natural Language Processing
No ratings yet
Natural Language Processing
87 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
21 pages
01 Introduction To Natural Language Processing
No ratings yet
01 Introduction To Natural Language Processing
42 pages
NLP AI Detailed Presentation
No ratings yet
NLP AI Detailed Presentation
18 pages
A Review of The Marathi Natural Language Processing
No ratings yet
A Review of The Marathi Natural Language Processing
13 pages
Lect36 Tasks
No ratings yet
Lect36 Tasks
95 pages
Introduction To NLP - Part 1
No ratings yet
Introduction To NLP - Part 1
23 pages
Lecture01 Introduction
No ratings yet
Lecture01 Introduction
35 pages
Chapter-1 Deep Learning in NLP
No ratings yet
Chapter-1 Deep Learning in NLP
28 pages
Lect36 Tasks
No ratings yet
Lect36 Tasks
95 pages
Unit 1a
No ratings yet
Unit 1a
53 pages
DL Lab Manual 2022-23
No ratings yet
DL Lab Manual 2022-23
34 pages
Dataiku - Get Up To Speed With NLP
No ratings yet
Dataiku - Get Up To Speed With NLP
16 pages
SCO409 Lecture Notes
No ratings yet
SCO409 Lecture Notes
64 pages
Course Code HUM1012 Logic and Language Structure BL202425040 0921 D21+D22
No ratings yet
Course Code HUM1012 Logic and Language Structure BL202425040 0921 D21+D22
55 pages
Natural - Language - Processing (NLP)
No ratings yet
Natural - Language - Processing (NLP)
32 pages
NLP Front Matter
No ratings yet
NLP Front Matter
28 pages
Module-1 Introduction To NLP
No ratings yet
Module-1 Introduction To NLP
28 pages
NLP Research Opportunities Presentation
No ratings yet
NLP Research Opportunities Presentation
10 pages
Rishabh Sharma (Anantika Johari)
No ratings yet
Rishabh Sharma (Anantika Johari)
8 pages
Introduction To NLP 2021
No ratings yet
Introduction To NLP 2021
13 pages
1 s2.0 S0925231221010997 Main
No ratings yet
1 s2.0 S0925231221010997 Main
14 pages
NLP PPT1
No ratings yet
NLP PPT1
29 pages
Deep Learning For Natural Language Processing: July 2021
No ratings yet
Deep Learning For Natural Language Processing: July 2021
10 pages
2.A Comprehensive Survey of Deep Learning Techniques For Natural Language Processing
No ratings yet
2.A Comprehensive Survey of Deep Learning Techniques For Natural Language Processing
11 pages
A Practitioner's Guide To Natural Language Processing (Part I) - Processing Understanding Text
No ratings yet
A Practitioner's Guide To Natural Language Processing (Part I) - Processing Understanding Text
46 pages
NLP Survey - Presentation
No ratings yet
NLP Survey - Presentation
31 pages
Follow Me On For More:: Steve Nouri
No ratings yet
Follow Me On For More:: Steve Nouri
39 pages
AI4youngster - 6 - Topic NLP
No ratings yet
AI4youngster - 6 - Topic NLP
66 pages
Module 1
No ratings yet
Module 1
39 pages
Module 1
No ratings yet
Module 1
49 pages
NLP Module 1
No ratings yet
NLP Module 1
31 pages
NLP 160709201345
No ratings yet
NLP 160709201345
61 pages
Module I NLP
No ratings yet
Module I NLP
65 pages
Hocken Maier 25
No ratings yet
Hocken Maier 25
46 pages
Progress in Neural NLP Modeling, Learning, and Reasoning
No ratings yet
Progress in Neural NLP Modeling, Learning, and Reasoning
16 pages
Presentation 1
No ratings yet
Presentation 1
10 pages
Unit - 1
No ratings yet
Unit - 1
55 pages
NLP Research Paper 4
No ratings yet
NLP Research Paper 4
7 pages
Natural Language Processing 5
No ratings yet
Natural Language Processing 5
24 pages
Natural Language Processing: John Doe CEO
No ratings yet
Natural Language Processing: John Doe CEO
16 pages
NLP LectureNotes UNIT 1
No ratings yet
NLP LectureNotes UNIT 1
55 pages
Research Article On NLP
No ratings yet
Research Article On NLP
3 pages
Natural Language Processing 101
No ratings yet
Natural Language Processing 101
26 pages
NLP AI Professional Presentation 2
No ratings yet
NLP AI Professional Presentation 2
18 pages
1 NLP
No ratings yet
1 NLP
26 pages
NLP2
No ratings yet
NLP2
3 pages
Topic 2: Introduction To Natural Language Processing (NLP)
No ratings yet
Topic 2: Introduction To Natural Language Processing (NLP)
16 pages
Advances in Natural Language Processing
No ratings yet
Advances in Natural Language Processing
7 pages
Introduction To NLP - First - Week - Lecture - 2st
No ratings yet
Introduction To NLP - First - Week - Lecture - 2st
4 pages
Eco 36
No ratings yet
Eco 36
6 pages
Deep Learning Paper1
No ratings yet
Deep Learning Paper1
16 pages
Akchukwu Wisdom Chidi Seminar Corrected Version
No ratings yet
Akchukwu Wisdom Chidi Seminar Corrected Version
17 pages
Pertemuan 1 - Introduction To NLP
No ratings yet
Pertemuan 1 - Introduction To NLP
29 pages
Wisdom Natural Language Processing
No ratings yet
Wisdom Natural Language Processing
4 pages
Introduction To NLP - First - Week - Lecture - 1st
No ratings yet
Introduction To NLP - First - Week - Lecture - 1st
6 pages
Dav Public School, Vasant Kunj, New Delhi: Artificial Intelligence (Subject Code: 417)
No ratings yet
Dav Public School, Vasant Kunj, New Delhi: Artificial Intelligence (Subject Code: 417)
8 pages
Question Bank of Applied Machine Learning
No ratings yet
Question Bank of Applied Machine Learning
2 pages
PHD GUIDE LIST 28thdec2019
No ratings yet
PHD GUIDE LIST 28thdec2019
14 pages
Feature Selection Techniques For ML - A Survey of More Than Two Decades of Research - Dipti Theng
No ratings yet
Feature Selection Techniques For ML - A Survey of More Than Two Decades of Research - Dipti Theng
63 pages
B.C.A. Data Science
No ratings yet
B.C.A. Data Science
83 pages
AI Subsets
No ratings yet
AI Subsets
5 pages
ML Glossary
No ratings yet
ML Glossary
44 pages
2411.19537v1 Survey
No ratings yet
2411.19537v1 Survey
24 pages
Overfitting Vs Underfitting
No ratings yet
Overfitting Vs Underfitting
8 pages
A3 (16063620)
No ratings yet
A3 (16063620)
32 pages
01 - Introduction To Text Analytics - Part1
No ratings yet
01 - Introduction To Text Analytics - Part1
64 pages
05 - Text Representation II - Distributed Representation - GloVe
No ratings yet
05 - Text Representation II - Distributed Representation - GloVe
23 pages
01 - Introduction To Text Analytics - Part2
No ratings yet
01 - Introduction To Text Analytics - Part2
48 pages
02 - Text Preprocessing - Part2
No ratings yet
02 - Text Preprocessing - Part2
36 pages
02 - Text Preprocessing - Part3
No ratings yet
02 - Text Preprocessing - Part3
22 pages
DA528 Machine Learning Midterm Exam Questions
No ratings yet
DA528 Machine Learning Midterm Exam Questions
4 pages
Amit Kumar: Education
No ratings yet
Amit Kumar: Education
1 page
IT8601 unitIV
No ratings yet
IT8601 unitIV
47 pages
Distributed Computing With DTN
No ratings yet
Distributed Computing With DTN
55 pages
Topic 4 - Data Mining Tools and Technique
No ratings yet
Topic 4 - Data Mining Tools and Technique
22 pages
AI and ML Research Paper
No ratings yet
AI and ML Research Paper
44 pages
Title 216 The Fintech Revolution AI's Role in Disrupting Traditional Banking and Financial Services
No ratings yet
Title 216 The Fintech Revolution AI's Role in Disrupting Traditional Banking and Financial Services
14 pages
CV Template
No ratings yet
CV Template
2 pages
Resume Template by Anubhav
No ratings yet
Resume Template by Anubhav
1 page
Responsible AI Notes End Sem
No ratings yet
Responsible AI Notes End Sem
34 pages
Artificial Intelligence Based Language Translation
No ratings yet
Artificial Intelligence Based Language Translation
9 pages
Suresh Ladki Baaz Ai
No ratings yet
Suresh Ladki Baaz Ai
10 pages
Remotesensing 16 01871
No ratings yet
Remotesensing 16 01871
18 pages
Detection of Money Laundering Accounts Using Data Mining Techniques
No ratings yet
Detection of Money Laundering Accounts Using Data Mining Techniques
10 pages
Optimizing Solar Panel Tilt Using Machine Learning Techniques
No ratings yet
Optimizing Solar Panel Tilt Using Machine Learning Techniques
6 pages
Etc Sa2
No ratings yet
Etc Sa2
8 pages
Solution To Credit Assignment Problem in MLP. Rumelhart, Hinton and Relating To Economics)
No ratings yet
Solution To Credit Assignment Problem in MLP. Rumelhart, Hinton and Relating To Economics)
14 pages
A Hybrid Modeling Approach For Predicting The Educational Use of Mobile Cloud Computing Services in Higher Education
No ratings yet
A Hybrid Modeling Approach For Predicting The Educational Use of Mobile Cloud Computing Services in Higher Education
7 pages
Competitive Networks - The Kohonen Self-Organising Map
No ratings yet
Competitive Networks - The Kohonen Self-Organising Map
6 pages
Mastering Transformers: The Journey from BERT to Large Language Models and Stable Diffusion
From Everand
Mastering Transformers: The Journey from BERT to Large Language Models and Stable Diffusion
Savaş Yıldırım
No ratings yet
Natural Language Processing with NLTK: Definitive Reference for Developers and Engineers
From Everand
Natural Language Processing with NLTK: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

02 Text Preprocessing Part1

Uploaded by

02 Text Preprocessing Part1

Uploaded by

Lecture 2: Text Preprocessing

• Classical categorization of NLP

Lexical Analysis A teacher come+s

Syntax Analysis (A teacher)NP (comes)VP

Semantic Analysis exist(x, teacher(x), comes(x))

“여섯 단어로 우리를 울릴만한

• Why is NLP hard?

• Ambiguity of a natural language

• Complex and subtle relationship between concepts in texts

• Ambiguity and context sensitivity

• From rule-based approaches to statistical approaches

• From statistical approaches to machine-learning (deep-learning) approaches

• From statistical approaches to machine-learning (deep-learning) approaches

• From statistical approaches to machine-learning (deep-learning) approaches

• 2018. 03. 06 & 2019. 03. 02 & 2020. 03. 02

1차 레이블링된 데이터를 이용해서 AI 모델을 학습시킨 후,

You might also like