02 Text Preprocessing Part1
02 Text Preprocessing Part1
Pilsung Kang
School of Industrial Management Engineering
Korea University
AGENDA
01 Introduction to NLP
02 Lexical Analysis
03 Syntax Analysis
04 Other Topics in NLP
Natural Language Processing
• Natural language processing sequence
Natural Language Processing Witte (2006)
http://biz.chosun.com/site/data/html_dir/2016/06/21/2016062100053.html
Natural Language Processing
• Speech to Text (STT)
https://github.com/kaldi-asr/kaldi
Natural Language Processing
• Top 16 Speech Recognition Startups (2020. 02. 06)
https://www.ai-startups.org/top/speech_recognition/
Natural Language Processing
• Text to Speech (TTS) Example
Natural Language Processing Witte (2006)
• An example of NLP
“A teacher comes”
Pragmatic Analysis
➔ Be quite!
Natural Language Processing
• Is Pragmatic Analysis Possible?
https://github.com/google-research/bert/blob/master/optimization.py
Is NLP Easy? No!
• How to annoy graduate students with four lines of Python code
Is NLP Easy? No! Witte (2006)
vs.
Research Trends in NLP Witte (2006)
http://nlp.stanford.edu:8080/sentiment/rntnDemo.html
Research Trends in NLP Wu et al. (2016)
http://blog.aylien.com/leveraging-deep-learning-for-multilingual/
Research Trends in NLP
• Performance Improvements
https://paperswithcode.com/area/natural-language-processing
Research Trends in NLP
• Performance Improvements with a huge model
https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
Research Trends in NLP
• An Era of optimization???
MO는 고성능 컴퓨터에 기반해 복잡한 연산을 한다는
점에서 AI와 비슷하다. 하지만 과거의 빅데이터를 분석해
최선의 답을 도출하는 AI와 달리 MO는 수학 공식으로
현재 주어진 한정된 조건에서 가장 이상적인 해결책을
찾아준다. AI가 경험을 바탕으로 가장 나은 방안을
제시한다면, MO는 수학 이론을 동원해 최적 답을
알려주는 것이다. MO는 AI처럼 빅데이터를 분석하는
과정이 필요 없기 때문에 시간·비용이 상대적으로 적게
든다.
Research Trends in NLP
• 10 Exiting ideas of 2018 in NLP (http://ruder.io/10-exciting-ideas-of-2018-in-nlp/)
✓ Unsupervised Machine Translation
✓ Pretrained language models
✓ Common sense inference datasets
✓ Meta-learning
✓ Robust unsupervised methods
✓ Understanding representations
✓ Clever auxiliary tasks
✓ Combining semi-supervised learning with transfer learning
✓ QA and reasoning with large documents
✓ Inductive bias
Research Trends in NLP
• Major NLP Achievements & Papers from 2019
✓ Language Models Are Unsupervised Multitask Learners
✓ XLNet: Generalized Autoregressive Pretraining for Language Understanding
✓ RoBERTa: A Robustly Optimized BERT Pretraining Approach
✓ Emotion-Cause Pair Extraction: A New Task to Emotion Analysis in Texts
✓ Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems
✓ Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks
✓ Probing the Need for Visual Context in Multimodal Machine Translation
✓ Bridging the Gap between Training and Inference for Neural Machine Translation
✓ On Extractive and Abstractive Neural Document Summarization with Transformer Language
Models
✓ CTRL: A Conditional Transformer Language Model For Controllable Generation
✓ ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
https://www.topbots.com/top-ai-nlp-research-papers-2019/
Research Trends in NLP
• 14 NLP research breakthrough you can apply to your business
✓ BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
✓ Sequence Classification with Human Attention
✓ Phrase-Based & Neural Unsupervised Machine Translation
✓ What you can cram into a single vector: Probing sentence embeddings for linguistic properties
✓ SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
✓ Deep contextualized word representations
✓ Meta-Learning for Low-Resource Neural Machine Translation
✓ Linguistically-Informed Self-Attention for Semantic Role Labeling
✓ A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks
✓ Know What You Don’t Know: Unanswerable Questions for SQuAD
✓ An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence
Modeling
✓ Universal Language Model Fine-tuning for Text Classification
✓ Improving Language Understanding by Generative Pre-Training
✓ Dissecting Contextual Word Embeddings: Architecture and Representation
(https://www.topbots.com/most-important-ai-nlp-research/?utm_campaign=meetedgar&utm_medium=social&utm_source=meetedgar.com)
Research Trends in NLP
• Statistical translation vs. deep learning-based translation
https://language-translator-demo.mybluemix.net/
Research Trends in NLP
• Statistical translation vs. deep learning-based translation
Research Trends in NLP
• Provide your inputs to improve the machine translator!
• 2018. 03. 06
• 2019. 03. 02
• 2020. 03. 02
Data Quality in NLP
• ExoBrain Project
Data Quality in NLP
• Data Annotation as a Business Model
✓ Scale AI: https://scale.com/
✓ Basic AI: https://www.basic.ai/
Data Quality in NLP
• Data Annotation as a Business Model
✓ Amazon SageMaker Ground Truth: https://aws.amazon.com/ko/sagemaker/groundtruth/
▪ Data labeling Platform
처음에는 사람에 의해 labeling 작업 수행
이 과정에서 Amazon Mechanical Turk를 사용해서 작업자를 매칭시켜주거나 공급 업체를 추천해줌