01_Introduction to Text Analytics_part1
01_Introduction to Text Analytics_part1
Pilsung Kang
School of Industrial Management Engineering
Korea University
AGENDA
http://www.zdnet.com/within-two-years-80-percent-of-medical- http://www.computerweekly.com/feature/How-to-manage-
data-will-be-unstructured-7000013707/ unstructured-data-for-business-benefit
Text Analytics: Background
• AI vs. Lawyers: The ultimate showdown
✓ Task: to spot issues in five Non-Disclosure Agreements (NDAs)
https://www.lawgeex.com/resources/AIvsLawyer/
Text Analytics: Background
• AI vs. Lawyers: The ultimate showdown
https://www.lawgeex.com/resources/AIvsLawyer/
Text Analytics: Background
• AI vs. Lawyers: The ultimate showdown
https://www.lawgeex.com/resources/AIvsLawyer/
Example: AI papers in arXiv
• The number of papers in the “artificial intelligence” section
✓ Can you read them all?
https://www.technologyreview.com/s/612768/we-analyzed-16625-papers-to-figure-out-where-ai-is-headed-
next/?utm_source=facebook&utm_campaign=site_visitor.unpaid.engagement&utm_medium=tr_social
Example: AI papers in arXiv
• Let’s do some text mining!
✓ Actually, it was just a simple word frequency analysis
Extract Meaningful
Using Various Information and
For Unstructured
Analytical Methods Knowledge
Text Data
Text Analytics: Applications
• Information Abstraction/Summarization/Visualization
Text Analytics: Applications
• Information Abstraction/Summarization/Visualization
Text Analytics: Applications
• Information Abstraction/Summarization/Visualization
Text Analytics: Applications
• Information Abstraction/Summarization/Visualization
✓ Central Bank speech analysis: Similarities between the central banks in the world
Text Analytics: Applications
• Information Abstraction/Summarization/Visualization
✓ Central Bank speech analysis: Similarities between the central banks in the world
Text Analytics: Applications
Seo et al. (2020)
• Information Abstraction/Summarization/Visualization
✓ Unusual customer response identification and visualization
Text Analytics: Applications
Seo et al. (2020)
• Information Abstraction/Summarization/Visualization
✓ Unusual customer response identification and visualization
Text Analytics: Applications
Seo et al. (2020)
• Information Abstraction/Summarization/Visualization
✓ Unusual customer response identification and visualization
Text Analytics: Applications
Seo et al. (2020)
• Information Abstraction/Summarization/Visualization
✓ Unusual customer response identification and visualization
Text Analytics: Applications
Seo et al. (2020)
• Information Abstraction/Summarization/Visualization
✓ Unusual customer response identification and visualization
Text Analytics: Applications
Seo et al. (2020)
• Information Abstraction/Summarization/Visualization
✓ Unusual customer response identification and visualization
Text Analytics: Applications
Seo et al. (2020)
• Information Abstraction/Summarization/Visualization
✓ Unusual customer response identification and visualization
Text Analytics: Applications
• Document Clustering
✓ Cluster documents and extract representative keywords for each cluster
Text Analytics: Applications
• Document Clustering
Text Analytics: Applications
• Topic Extraction
✓ Analyze documents and extract latent topics in the corpus
Text Analytics: Applications
Kim et al. (2016)
• Topic Extraction
✓ 30 Topics discovered by LDA
Fault detection Convolutional Network Representation Face Speech Acoustic Extreme Deep learning Image
with DBN neural network Learning learning Recognition Recognition Modeling Learning architecture Segmentation
layer
deep neural feature face speaker speech deep deep image
input
belief convolutional level recognition speech recognition learn architecture scene
output
network pool extract estimation noise acoustic algorithm neural scale
unit
dbn convolution learn facial adaptation hmm structure standard segmentation
hide
fault convnet extraction shape source neural extreme explore pixel
function
Long-short Predictive Signal Classification Large-scale Image quality Visual Detection Action
NLP
term memory analytics processing models computing assessment recognition using CNN recognition
term data analysis classification application domain pattern word cnn video
recurrent prediction filter classifier implementation state process text detection human
long technique signal class efficient quality compute language convolutional temporal
lstm information component vector process resolution visual representation neural action
network research audio support power relationship field semantic detect track
• Topic Extraction
✓ Relations between topics
Scalability
Applications
Object/Signal Recognition
Image Processing
Optimization &
Advanced Learning
earning Strategies
NLP/ Autoencoder
1 대출 0 2 0 … 0
2 대박 0 0 0 … 0
3 미팅 0 0 2 … 0
4 이상형 0 0 2 … 0
5 머니 0 2 0 … 0
6 외로 0 0 3 … 1
스팸 여부 N Y Y … N
Text Analytics: Applications
Kim et al. (2016)
• Document Categorization/Classification
✓ Sport player evaluation
Text Analytics: Applications
Lee et al. (2017)
• Document Categorization/Classification
✓ Sentiment Analysis
Text Analytics: Applications
• Document Categorization/Classification
✓ Sentiment Analysis
https://techxplore.com/news/2016-08-deep-neural-network-approach-sarcasm.html
Text Analytics: Applications
Lee et al. (2017)
• Document Categorization/Classification
✓ Sentiment Analysis
Text Analytics: Applications
Lee et al. (2017)
• Document Categorization/Classification
✓ Sentiment Analysis
Text Analytics: Applications
Lee et al. (2017)
• Document Categorization/Classification
✓ Sentiment Analysis
Text Analytics: Applications
Mo et al. (2017)
• Document Categorization/Classification
✓ Sentiment Analysis
Text Analytics: Applications
• Recommendation
✓ Analyze texts in daum café, blogs, and SNS contents
✓ Named entity recognition/extraction (NEE/NER) technique in natural language
processing is used
✓ For 60,000 keywords
Text Analytics: Applications
• Recommendation
✓ Dining code: restaurant
recommendation service
▪ Analyze restaurant review from top
3 blog services (naver, daum, tistory)
▪ Assign higher weights to opinion
leaders’ posts
▪ Filter advertising blog posts by
analyzing the comments on a post
https://github.com/facebookresearch/DrQA/blob/master/img/drqa.png
Text Analytics: Applications
• Natural Language Understanding: Question Answering
https://ai.googleblog.com/2019/01/natural-questions-new-corpus-and.html
Text Analytics: Applications
• Natural Language Understanding: Question Answering
https://paperswithcode.com/task/question-answering
Text Analytics: Applications
• Doing Conversation like Human Beings: ChatBot (Dialogue system)
https://chatbotslife.com/chatbots-are-the-future-of-marketing-31fd285f37d9
Text Analytics: Challenges
• Challenges
✓ High number of possible “dimensions” (word, phrases, etc.)
Text Analytics: Challenges
• Challenges
✓ Complex and subtle relationship between concepts in texts
vs.
vs.
Text Analytics: Text Structures
• Structure of text data
http://www.slideshare.net/pierluca.lanzi/machine-learning-and-data-mining-19-mining-text-and-web-data
Text Analytics: Text Structures Abbott (2013)
✓ Semi-structured
▪ Extensive format elements, metadata, field labels: E-mail, HTML/XML web pages, pdf files,
etc.
✓ Document Clustering
▪ Grouping and categorizing terms, snippets, paragraphs or documents using data mining
clustering methods
✓ Concept Extraction
▪ Grouping or words and phrases into semantically similar groups
Mining Text Data Abbott (2013)
✓ Web Mining
▪ Data and text mining on the internet with a specific focus on the scale and
interconnectedness of the web
Step 1:
Decide what to mine
& Collect text data
A Simplified Process of Text Analytics
Word S1 S2
John 1 1
Likes 2 1
Step 2: To 1 1
Step 1: Games 0 1
Step 2:
Preprocess &
Transform the data
Step 1:
Define what to mine &
Collect text data