0% found this document useful (0 votes)
48 views12 pages

Slides - Text Mining

The document discusses various techniques in text analytics including text cleaning, bag-of-words modeling, n-gram modeling, and applications such as sentiment analysis, spam detection, and text summarization. Text analytics is the process of extracting meaning from unstructured written text through techniques like removing stop words, stemming words, and representing documents as frequencies of words or consecutive pairs of words.

Uploaded by

Sunil Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views12 pages

Slides - Text Mining

The document discusses various techniques in text analytics including text cleaning, bag-of-words modeling, n-gram modeling, and applications such as sentiment analysis, spam detection, and text summarization. Text analytics is the process of extracting meaning from unstructured written text through techniques like removing stop words, stemming words, and representing documents as frequencies of words or consecutive pairs of words.

Uploaded by

Sunil Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Text Analytics

[email protected]
BX9T5ZHNQF

Machine Learning

Proprietary content. ©Great Learning. All Rights Reserved.


This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by [email protected]
Sharing or publishing the contents in part or full is liable for legal action.
Agenda

• Introduction to Text Analytics

• Text Analytics and Applications

• Unstructured Vs Structured Data, Cleaning

• Bag of words, Word Frequencies


[email protected]
BX9T5ZHNQF

• Hierarchical Clustering

• Sentiment Analysis

• Word Embeddings

• Ensemble Methods

Proprietary content. ©Great Learning. All Rights Reserved.


This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by [email protected]
2
Sharing or publishing the contents in part or full is liable for legal action.
Text Analytics

• The process of drawing meaning out of written


communication.

• to understand online reviews, tweets, call center agent


notes, survey results, and other types of written feedback
that capture insight into your customers.
[email protected]
BX9T5ZHNQF • Spam detection

• translation

• search and crawl

• sentimental analysis

• entity modeling to support fact based decision making

• text summarization
Proprietary content. ©Great Learning. All Rights Reserved.
This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by [email protected]
3
Sharing or publishing the contents in part or full is liable for legal action.
Text: Unstructured Data

• Structured: Data is organized into pre-defined structure like


a table of database - with rows and columns.
[email protected]
BX9T5ZHNQF
• UnStructured Data: Data does not have a pre-defined
structure. Think of a collection of emails, a bunch of satellite
images or the entire text of speeches from the british
parliament since 1803.

Proprietary content. ©Great Learning. All Rights Reserved.


This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by [email protected]
4
Sharing or publishing the contents in part or full is liable for legal action.
Modeling/representing text

• Bag of words - Documents simply represented by the words


in the document and their frequencies. Disregards grammar
and word order


[email protected]
BX9T5ZHNQF Bayesian SPAM filter

• Semantic - mapping natural language rules to get a formal


representation of the meaning of the text

• Name entity identification

Proprietary content. ©Great Learning. All Rights Reserved.


This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by [email protected]
5
Sharing or publishing the contents in part or full is liable for legal action.
Bag of words
• Corpus:

• A: John likes to play soccer

• B: John is reading a book

John likes soccer play book reading a is to


[email protected]
BX9T5ZHNQF

A 1 1 1 1 1
B 1 1 1 1 1

Proprietary content. ©Great Learning. All Rights Reserved.


This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by [email protected]
6
Sharing or publishing the contents in part or full is liable for legal action.
n-gram model

• The Bag-of-words model is an orderless document representation. Only


the counts of words matter.

• We could do this also by choosing consecutive pairs (2-gram) and


representing each pair

• A: John likes to play soccer

[email protected] • B: John is reading a book


BX9T5ZHNQF

• 2-gram (bigram):

John likes likes to play soccer to play John is is reading reading a a book

A 1 1 1 1
B 1 1 1 1

Proprietary content. ©Great Learning. All Rights Reserved.


This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by [email protected]
7
Sharing or publishing the contents in part or full is liable for legal action.
Cleaning text

• Stop words: Common words that are not useful in providing


value or context. Eg: ‘the’, ‘an’, ‘in’ etc.

• Stemming: Returning words to their original stem. Eg:


‘Chopping’, ‘Chopped’ are all replaced with ‘Chop’
[email protected]
BX9T5ZHNQF

• Lower case conversion

• Remove punctuations

• Strip extra white spaces

• Remove numbers

Proprietary content. ©Great Learning. All Rights Reserved.


This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by [email protected]
8
Sharing or publishing the contents in part or full is liable for legal action.
Example

[email protected]
BX9T5ZHNQF

Proprietary content. ©Great Learning. All Rights Reserved.


This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by [email protected]
9
Sharing or publishing the contents in part or full is liable for legal action.
Example

[email protected]
BX9T5ZHNQF

Proprietary content. ©Great Learning. All Rights Reserved.


This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by [email protected]
10
Sharing or publishing the contents in part or full is liable for legal action.
Term-Document Matrix (TDM)
Doc 1 Doc 2 … Doc N
Term 1

Term 2


[email protected]
BX9T5ZHNQF
Term M

Document-Term Matrix (DTM)


Term 1 Term 2 … Term M
Doc 1
Doc 2

Doc N

Proprietary content. ©Great Learning. All Rights Reserved.


This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by [email protected]
11
Sharing or publishing the contents in part or full is liable for legal action.
• Each document is represented by a vector in the term document
matrix

• This lends itself to a number of ML techniques

• For example, these vectors (documents) can be clustered to


identify similar documents

[email protected]
BX9T5ZHNQF

Proprietary content. ©Great Learning. All Rights Reserved.


This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by [email protected]
12
Sharing or publishing the contents in part or full is liable for legal action.

You might also like