0% found this document useful (0 votes)
17 views64 pages

01_Introduction to Text Analytics_part1

Uploaded by

dinhnguyenngoc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views64 pages

01_Introduction to Text Analytics_part1

Uploaded by

dinhnguyenngoc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Lecture 1: Introduction to Text Analytics

Pilsung Kang
School of Industrial Management Engineering
Korea University
AGENDA

01 Text Analytics: Overview


02 TA Process 1: Collection & Preprocessing
03 TA Process 2: Transformation
04 TA Process 3: Dimensionality Reduction
05 TA Process 4: Learning & Evaluation
Text Analytics: Background
• Motivation
✓ Approximately 80% of the world’s data is help in unstructured formats

✓ Simple document retrieval is not enough, but knowledge discovery is required!

http://www.zdnet.com/within-two-years-80-percent-of-medical- http://www.computerweekly.com/feature/How-to-manage-
data-will-be-unstructured-7000013707/ unstructured-data-for-business-benefit
Text Analytics: Background
• AI vs. Lawyers: The ultimate showdown
✓ Task: to spot issues in five Non-Disclosure Agreements (NDAs)

https://www.lawgeex.com/resources/AIvsLawyer/
Text Analytics: Background
• AI vs. Lawyers: The ultimate showdown

https://www.lawgeex.com/resources/AIvsLawyer/
Text Analytics: Background
• AI vs. Lawyers: The ultimate showdown

https://www.lawgeex.com/resources/AIvsLawyer/
Example: AI papers in arXiv
• The number of papers in the “artificial intelligence” section
✓ Can you read them all?

https://www.technologyreview.com/s/612768/we-analyzed-16625-papers-to-figure-out-where-ai-is-headed-
next/?utm_source=facebook&utm_campaign=site_visitor.unpaid.engagement&utm_medium=tr_social
Example: AI papers in arXiv
• Let’s do some text mining!
✓ Actually, it was just a simple word frequency analysis

• Discovery 1: Machine learning eclipses knowledge-based reasoning


Example: AI papers in arXiv
• Discovery 2: The Neural-Network Boom
Example: AI papers in arXiv
• Discovery 3: The rise of reinforcement learning
Text Analytics: Definition

Extract Meaningful
Using Various Information and
For Unstructured
Analytical Methods Knowledge
Text Data
Text Analytics: Applications
• Information Abstraction/Summarization/Visualization
Text Analytics: Applications
• Information Abstraction/Summarization/Visualization
Text Analytics: Applications
• Information Abstraction/Summarization/Visualization
Text Analytics: Applications
• Information Abstraction/Summarization/Visualization
✓ Central Bank speech analysis: Similarities between the central banks in the world
Text Analytics: Applications
• Information Abstraction/Summarization/Visualization
✓ Central Bank speech analysis: Similarities between the central banks in the world
Text Analytics: Applications
Seo et al. (2020)

• Information Abstraction/Summarization/Visualization
✓ Unusual customer response identification and visualization
Text Analytics: Applications
Seo et al. (2020)

• Information Abstraction/Summarization/Visualization
✓ Unusual customer response identification and visualization
Text Analytics: Applications
Seo et al. (2020)

• Information Abstraction/Summarization/Visualization
✓ Unusual customer response identification and visualization
Text Analytics: Applications
Seo et al. (2020)

• Information Abstraction/Summarization/Visualization
✓ Unusual customer response identification and visualization
Text Analytics: Applications
Seo et al. (2020)

• Information Abstraction/Summarization/Visualization
✓ Unusual customer response identification and visualization
Text Analytics: Applications
Seo et al. (2020)

• Information Abstraction/Summarization/Visualization
✓ Unusual customer response identification and visualization
Text Analytics: Applications
Seo et al. (2020)

• Information Abstraction/Summarization/Visualization
✓ Unusual customer response identification and visualization
Text Analytics: Applications
• Document Clustering
✓ Cluster documents and extract representative keywords for each cluster
Text Analytics: Applications
• Document Clustering
Text Analytics: Applications
• Topic Extraction
✓ Analyze documents and extract latent topics in the corpus
Text Analytics: Applications
Kim et al. (2016)

• Topic Extraction
✓ 30 Topics discovered by LDA

Fault detection Convolutional Network Representation Face Speech Acoustic Extreme Deep learning Image
with DBN neural network Learning learning Recognition Recognition Modeling Learning architecture Segmentation
layer
deep neural feature face speaker speech deep deep image
input
belief convolutional level recognition speech recognition learn architecture scene
output
network pool extract estimation noise acoustic algorithm neural scale
unit
dbn convolution learn facial adaptation hmm structure standard segmentation
hide
fault convnet extraction shape source neural extreme explore pixel
function

Long-short Predictive Signal Classification Large-scale Image quality Visual Detection Action
NLP
term memory analytics processing models computing assessment recognition using CNN recognition

term data analysis classification application domain pattern word cnn video
recurrent prediction filter classifier implementation state process text detection human
long technique signal class efficient quality compute language convolutional temporal
lstm information component vector process resolution visual representation neural action
network research audio support power relationship field semantic detect track

Learning with Fast learning Applications


Image Medical image Reinforcement Parameter Auto RBM and Character
few labeled complexity for vehicles
retrieval diagnosis learning optimization encoder variations recognition
data reduction & robots
image learn
image train representation machine train fast time recognition
segmentation question
visual algorithm learn boltzmann data reduce real system
disease state
retrieval gradient sparse rbm label parameter application character
cell answer
descriptor sample encode restrict few weight drive network
medical reinforcement
attribute optimization stack distribution transfer complexity Vehicle neural
Text Analytics: Applications
Kim et al. (2016)

• Topic Extraction
✓ Relations between topics

Scalability
Applications
Object/Signal Recognition
Image Processing

Optimization &
Advanced Learning

earning Strategies
NLP/ Autoencoder

Deep Learning Structures


Independent
& Learning
Topics
Text Analytics: Applications
• Document Categorization/Classification
✓ Spam mail filtering
Text Analytics: Applications
• Document Categorization/Classification
✓ Spam mail filtering

No. 키워드 Mail 1 Mail 2 Mail 3 … Mail N

1 대출 0 2 0 … 0
2 대박 0 0 0 … 0
3 미팅 0 0 2 … 0
4 이상형 0 0 2 … 0
5 머니 0 2 0 … 0
6 외로 0 0 3 … 1
스팸 여부 N Y Y … N
Text Analytics: Applications
Kim et al. (2016)

• Document Categorization/Classification
✓ Sport player evaluation
Text Analytics: Applications
Lee et al. (2017)

• Document Categorization/Classification
✓ Sentiment Analysis
Text Analytics: Applications
• Document Categorization/Classification
✓ Sentiment Analysis

https://techxplore.com/news/2016-08-deep-neural-network-approach-sarcasm.html
Text Analytics: Applications
Lee et al. (2017)

• Document Categorization/Classification
✓ Sentiment Analysis
Text Analytics: Applications
Lee et al. (2017)

• Document Categorization/Classification
✓ Sentiment Analysis
Text Analytics: Applications
Lee et al. (2017)

• Document Categorization/Classification
✓ Sentiment Analysis
Text Analytics: Applications
Mo et al. (2017)

• Document Categorization/Classification
✓ Sentiment Analysis
Text Analytics: Applications
• Recommendation
✓ Analyze texts in daum café, blogs, and SNS contents
✓ Named entity recognition/extraction (NEE/NER) technique in natural language
processing is used
✓ For 60,000 keywords
Text Analytics: Applications
• Recommendation
✓ Dining code: restaurant
recommendation service
▪ Analyze restaurant review from top
3 blog services (naver, daum, tistory)
▪ Assign higher weights to opinion
leaders’ posts
▪ Filter advertising blog posts by
analyzing the comments on a post

Developed by HS Shin, KKU http://www.diningcode.com/


Text Analytics: Applications
Kim et al. (2015)

• Improve forecasting accuracy combined with structured data


✓ Forecasting the box office scores based on the polarity of SNS posts
Text Analytics: Applications
Kim et al. (2015)

• Improve forecasting accuracy combined with structured data


✓ Forecasting the box office scores based on the polarity of SNS posts
Text Analytics: Applications
Kim et al. (2015)

• Improve forecasting accuracy combined with structured data


✓ Forecasting the box office scores based on the polarity of SNS posts
Text Analytics: Applications
송서하 외 (2019)

• Improve forecasting accuracy combined with structured data


✓ Early warning model for financial firms
Text Analytics: Applications
송서하 외 (2019)

• Improve forecasting accuracy combined with structured data


✓ Early warning model for financial firms
Text Analytics: Applications
송서하 외 (2019)

• Improve forecasting accuracy combined with structured data


✓ Early warning model for financial firms
Text Analytics: Applications
송서하 외 (2019)

• Improve forecasting accuracy combined with structured data


✓ Early warning model for financial firms
Text Analytics: Applications
송서하 외 (2019)

• Improve forecasting accuracy combined with structured data


✓ Early warning model for financial firms
Text Analytics: Applications
• Natural Language Understanding: Question Answering

https://github.com/facebookresearch/DrQA/blob/master/img/drqa.png
Text Analytics: Applications
• Natural Language Understanding: Question Answering

https://ai.googleblog.com/2019/01/natural-questions-new-corpus-and.html
Text Analytics: Applications
• Natural Language Understanding: Question Answering

https://paperswithcode.com/task/question-answering
Text Analytics: Applications
• Doing Conversation like Human Beings: ChatBot (Dialogue system)

https://chatbotslife.com/chatbots-are-the-future-of-marketing-31fd285f37d9
Text Analytics: Challenges
• Challenges
✓ High number of possible “dimensions” (word, phrases, etc.)
Text Analytics: Challenges
• Challenges
✓ Complex and subtle relationship between concepts in texts

“장명준은 즐겁게 오버워치를


하다가 지도교수에게 들켰다"

“강필성 교수는 우연히 들른


신공학관 220호에서 게임을
하는 한 학생을 목격했다"
Text Analytics: Challenges
• Challenges
✓ Ambiguity and context sensitivity
▪ automobile = car = vehicle = Hyundai

vs.

vs.
Text Analytics: Text Structures
• Structure of text data

http://www.slideshare.net/pierluca.lanzi/machine-learning-and-data-mining-19-mining-text-and-web-data
Text Analytics: Text Structures Abbott (2013)

• How Unstructured is “Unstructured”? (by Feldman and Sanger)


✓ Weakly structured
▪ Few structural cues to text based layout or markups: research papers, legal memoranda,
news stories, etc.

✓ Semi-structured
▪ Extensive format elements, metadata, field labels: E-mail, HTML/XML web pages, pdf files,
etc.

• Why is Text Mining Hard?


✓ Language itself is ambiguous
▪ Contexts is needed to clarify
▪ Same word with different meanings, different words with same meaning
▪ Misspellings, abbreviations, etc.
Text Analytics: Areas Abbott (2013)

• Active areas in text processing


Types of Text Analytics Abbott (2013)

• Seven Types of Text Mining (by Elder et al.)


✓ Document Classification
▪ Grouping and categorizing snippets, paragraphs, or document using data mining
classification methods, based on models trained on labeled examples

✓ Document Clustering
▪ Grouping and categorizing terms, snippets, paragraphs or documents using data mining
clustering methods

✓ Concept Extraction
▪ Grouping or words and phrases into semantically similar groups
Mining Text Data Abbott (2013)

• Seven Types of Text Mining (by Elder et al.)


✓ Search and Information Retrieval (IR)
▪ Storage and retrieval of text documents, including search engines and keyword search

✓ Information Extraction (IE)


▪ Identification and extraction of relevant facts and relationships from unstructured texts,
the process of making structured data from unstructured and semi-structured texts

✓ Web Mining
▪ Data and text mining on the internet with a specific focus on the scale and
interconnectedness of the web

✓ Natural Language Processing (NLP)


▪ Low-level language processing and understanding tasks (e.g., tagging part of speech)
▪ Often used synonymously with computational linguistics
A Simplified Process of Text Analytics

Source of text data


Digital library Corporate document archive

Word Wide Web (WWW) SNS

Step 1:
Decide what to mine
& Collect text data
A Simplified Process of Text Analytics

From unstructured to structured!

S1: Jon likes to watch movies. Mary likes too.


S2: John also likes to watch football game.

Word S1 S2
John 1 1
Likes 2 1

Step 2: To 1 1

Preprocess & Watch 1 1

Transform the data Movies 1 0


Also 0 1
Football 0 1

Step 1: Games 0 1

Define what to mine & Mary 1 0

Collect text data too 1 0


A Simplified Process of Text Analytics

Reduce the number of features


Word S1 S2
John 1 1
Likes 2 1
To 1 1
Watch 1 1
Step 3: Movies 1 0
Select/Extract features Also 0 1
Football 0 1
Games 0 1
Step 2: Mary 1 0
Preprocess & too 1 0
Transform the data
Word S1 S2
Likes 2 1
Step 1: Watch 1 1
Define what to mine & Movies 1 0
Collect text data Football 0 1
Games 0 1
A Simplified Process of Text Analytics

Step 4: Select appropriate algorithm


Algorithm Learning &
Evaluation
• Vector space model vs. Probabilistic model

• Classification vs. Clustering vs. Association


Step 3:
Select/Extract features

Step 2:
Preprocess &
Transform the data

Step 1:
Define what to mine &
Collect text data

You might also like