0% found this document useful (0 votes)

9 views34 pages

Text Mining

The document discusses the exponential growth of data, particularly unstructured text data, and highlights various natural language processing (NLP) techniques such as text classification, sentiment analysis, and text summarization. It explains concepts like stemming, lemmatization, and the importance of stop words in processing textual data. Additionally, it covers methods for measuring word similarity and the use of semantic dictionaries like WordNet to enhance natural language understanding.

Uploaded by

ARSH SINHA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views34 pages

Text Mining

Uploaded by

ARSH SINHA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Text is Everywhere !

 Data continues to grow exponentially

 Estimated to be 2.5 Exabytes (2.5 million TB) a day
 Grow to 40 Zettabytes (40 billion TB) by 2020 (50-times that of 2010)

 Approximately 80% of all data is estimated to be unstructured, text-rich data

 40 million articles (5 million in English) in Wikipedia
 >4.5 billion Web pages
 >500 million tweets a day, 200 billion a year
 >1.5 trillion queries / searches on Google a year
 Parse text
 Find / Identify / Extract relevant information from text
 Classify text documents
 Search for relevant text documents
 Sentiment analysis
 Topic modeling
 Text summarization
 ……………
 Sentences / input strings
 Words or Tokens
 Characters
 Document, larger files
 Language used for everyday communication by humans
 English
 中⽂ (Chinese)
 рус́ ский язы́к (Russian)
 español (Spanish)
 Any computation, manipulation of natural language

 Natural languages evolve

 new words get added

 old words lose popularity

 meanings of words change

 language rules themselves may change

 Counting words, counting frequency of words
 Finding sentence boundaries
 Part of speech tagging
 Parsing the sentence structure
 Identifying semantic roles
 Identifying entities in a sentence (Name Entity Recognition)
 Finding which pronoun refers to which entity (Co-Reference Resolution)
 Which medical specialty does this relate to?
 Given a set of classes:

 Classification: Assign the correct class label to the given input

 Topic identification: Is this news article about
Politics, Sports, or Technology?

 Spam Detection: Is this email a spam or not?

 Sentiment analysis: Is this movie review positive

or negative?

 Spelling correction: weather or whether?

color or colour?
 Humans learn from past experiences, machines learn from past
instances!
 Textual data presents a unique set of challenges.

 All the information you need is in the text

 But features can be pulled out from text at different

granularities!
Words
• By far the most common class of features
• Handling commonly-occurring words: Stop words
• Normalization: Make lower case vs. leave as-is
• Stemming / Lemmatization
 Characteristics of words : Capitalization
 Parts of speech of words in a sentence
 Grammatical structure, sentence parsing
 Grouping words of similar meaning, semantics
• {buy, purchase}
• {Mr., Ms., Dr., Prof.}; Numbers / Digits; Dates
Depending on classification tasks, features may
come from inside words and word sequences .
bigrams, trigrams, n-grams: “White House”
character sub-sequences in words: “ing”, “ion”, …
 Stemming and Lemmatization are Text Normalization
(or sometimes called Word Normalization) techniques in
the field of Natural Language Processing.
 Stemming is the process of reducing inflection in words to
their root forms such as mapping a group of words to the
same stem even if the stem itself is not a valid word in the
Language.
 A computer program or subroutine that stems word may be
called a stemming program, stemming algorithm, or
stemmer.
 Stemming algorithms: Porte stemmer, Snowball stemmer,
Lancaster stemmer

 Lemmatization, unlike Stemming, reduces the inflected

words properly ensuring that the root word belongs to the
language. In Lemmatization root word is called Lemma. A
lemma (plural lemmas or lemmata) is the canonical form,
dictionary form, or citation form of a set of words.
 Stop Words are words which do not contain important significance to be used in
Search Queries. Each programming language will give its own list of stop words to
use. Mostly they are words that are commonly used in the English language such as
'as, the, be, are' etc.
Consider a document containing 100 words wherein the word cat appears
3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now,
assume we have 10 million documents and the word cat appears in one
thousand of these. Then, the inverse document frequency (i.e., idf) is
calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the
product of these quantities: 0.03 * 4 = 0.12.
 Deer, Elk
 Deer, Giraffe
 Deer, Horse
 Deer, Mouse

 How can we quantify such similarity ?

 Grouping similar words into semantic concepts

 As a building block in natural language understanding task such as

“paraphrasing”.
 Semantic dictionary of (mostly) English words, interlinked
by semantic relations.

 Includes rich linguistic information

 part of speech, word senses, synonyms, hypernyms/hyponyms, meronyms, distributional
related forms, …

 Machine-readable, freely available.

 WordNet organizes information in a hierarchy

 Many similarity measures use the hierarchy in some way

 Verbs, nouns, adjectives all have separate hierarchies

 Find the shortest path between the two concepts

 Similarity measure inversely related to path distance

 PathSim(deer, elk) = 0.5
 PathSim(deer, giraffe) = 0.33
 PathSim(deer, horse) = 0.14

Intro To TM
No ratings yet
Intro To TM
32 pages
1009 NLP PPT
No ratings yet
1009 NLP PPT
31 pages
History
No ratings yet
History
197 pages
Module 1 NLP
No ratings yet
Module 1 NLP
26 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
TextMining
No ratings yet
TextMining
43 pages
Lect 5
No ratings yet
Lect 5
40 pages
Text
No ratings yet
Text
102 pages
Module 1.2
No ratings yet
Module 1.2
28 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
CSE442 Text
No ratings yet
CSE442 Text
89 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
2 - Text Operation - 1
No ratings yet
2 - Text Operation - 1
28 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
Statistical NLP
No ratings yet
Statistical NLP
45 pages
Chapter 2 Part 1 & 2
No ratings yet
Chapter 2 Part 1 & 2
58 pages
NLP Question Bank Answers (Jagmeet)
No ratings yet
NLP Question Bank Answers (Jagmeet)
31 pages
Predictive Methods For Text Mining
No ratings yet
Predictive Methods For Text Mining
75 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
PDF NLP
No ratings yet
PDF NLP
7 pages
NLP Soln
No ratings yet
NLP Soln
6 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Project Final Presentation
No ratings yet
Project Final Presentation
30 pages
DVT U4 My Notes
No ratings yet
DVT U4 My Notes
15 pages
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
No ratings yet
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
31 pages
AIML-HC Mod 04
No ratings yet
AIML-HC Mod 04
71 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Feature Eng
No ratings yet
Feature Eng
34 pages
Text Analytics Basics
No ratings yet
Text Analytics Basics
28 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
100 Words Almost Everyone Confuses and Misuses
From Everand
100 Words Almost Everyone Confuses and Misuses
Editors of the American Heritage Dictionaries
3/5 (4)
Text Mining
No ratings yet
Text Mining
62 pages
Ai TXT Unit2
No ratings yet
Ai TXT Unit2
14 pages
Unit 6 Endsem PYQs
No ratings yet
Unit 6 Endsem PYQs
15 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
01 - Introduction To Text Analytics - Part2
No ratings yet
01 - Introduction To Text Analytics - Part2
48 pages
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
UNIT-1 T. S. Eliot: Religious Poems: 1.0 Objectives
No ratings yet
UNIT-1 T. S. Eliot: Religious Poems: 1.0 Objectives
215 pages
Natural Language Processing (CSE4022) : by N. Ilakiyaselvan
No ratings yet
Natural Language Processing (CSE4022) : by N. Ilakiyaselvan
80 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Pipeline
No ratings yet
Pipeline
9 pages
NLP Unit Test 2
No ratings yet
NLP Unit Test 2
10 pages
English - Literature 3 6 2017
No ratings yet
English - Literature 3 6 2017
12 pages
Ai NLP
No ratings yet
Ai NLP
9 pages
Sentence Stems For Comprehension Strategies
No ratings yet
Sentence Stems For Comprehension Strategies
10 pages
NLP TT-1 Question Bank
No ratings yet
NLP TT-1 Question Bank
21 pages
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
No ratings yet
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
45 pages
Cot1 Ap 2019
67% (3)
Cot1 Ap 2019
2 pages
Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
F 584 C 6481 FDB
No ratings yet
F 584 C 6481 FDB
157 pages
1Z0-061 Sample Questions Answers
0% (1)
1Z0-061 Sample Questions Answers
6 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
18
No ratings yet
18
12 pages
1 An Introduction To AutoCAD-Basic Hands-On Encoding
No ratings yet
1 An Introduction To AutoCAD-Basic Hands-On Encoding
34 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
7 pages
Session 2.1 Ancient World
No ratings yet
Session 2.1 Ancient World
31 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
The Great Convergence Information Technology and The New Globalization 1st Edition by Richard Baldwin 067466048X, Â 9780674660489 Download
100% (1)
The Great Convergence Information Technology and The New Globalization 1st Edition by Richard Baldwin 067466048X, Â 9780674660489 Download
41 pages
String Handling
No ratings yet
String Handling
33 pages
Lecture 1 Text Preprocessing PDF
No ratings yet
Lecture 1 Text Preprocessing PDF
29 pages
Session 15 and 16
No ratings yet
Session 15 and 16
17 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
FS 1 Activity 1.1
No ratings yet
FS 1 Activity 1.1
3 pages
Program Outcomes: Doctor of Philosophy in Development Education (Ph.D. Deved)
No ratings yet
Program Outcomes: Doctor of Philosophy in Development Education (Ph.D. Deved)
5 pages
Swing Bench 21 F
No ratings yet
Swing Bench 21 F
29 pages
Macbeth
No ratings yet
Macbeth
3 pages
Some Important Transaction Codes For SAP BW
No ratings yet
Some Important Transaction Codes For SAP BW
3 pages
BEI IDEACOD 07035 Encoder CANopen Manual Serie EN 102
No ratings yet
BEI IDEACOD 07035 Encoder CANopen Manual Serie EN 102
34 pages
FF1 - Lesson Plan
100% (1)
FF1 - Lesson Plan
58 pages
Greece and The Greeks in Ottoman History and Turkish Historiography
No ratings yet
Greece and The Greeks in Ottoman History and Turkish Historiography
15 pages
Types of Clauses
No ratings yet
Types of Clauses
3 pages
On Page Seo
No ratings yet
On Page Seo
11 pages
1 Transformation and Collineations
No ratings yet
1 Transformation and Collineations
2 pages
Adv Unit3 Revision
No ratings yet
Adv Unit3 Revision
2 pages
Executive Summary of Mujib Climate Prosperity Plan
No ratings yet
Executive Summary of Mujib Climate Prosperity Plan
3 pages
Problem Solving Python Programming
No ratings yet
Problem Solving Python Programming
5 pages
What Is Prophetic Ministry
No ratings yet
What Is Prophetic Ministry
1 page
Ee Mungu Unilinde 122017
No ratings yet
Ee Mungu Unilinde 122017
1 page
Trung Tâm Anh NG Nhung PH M 27N7A KĐT Trung Hòa Nhân Chính - 0944 225 191
No ratings yet
Trung Tâm Anh NG Nhung PH M 27N7A KĐT Trung Hòa Nhân Chính - 0944 225 191
2 pages
Coreference: Fundamentals and Applications
From Everand
Coreference: Fundamentals and Applications
Fouad Sabry
No ratings yet

Text Mining

Uploaded by

Text Mining

Uploaded by

Text is Everywhere !

 Data continues to grow exponentially

 Approximately 80% of all data is estimated to be unstructured, text-rich data

 Natural languages evolve

 old words lose popularity

 meanings of words change

 language rules themselves may change

 Classification: Assign the correct class label to the given input

 Spam Detection: Is this email a spam or not?

 Sentiment analysis: Is this movie review positive

 Spelling correction: weather or whether?

 All the information you need is in the text

 But features can be pulled out from text at different

 Lemmatization, unlike Stemming, reduces the inflected

 How can we quantify such similarity ?

 As a building block in natural language understanding task such as

 Includes rich linguistic information

 Machine-readable, freely available.

 Many similarity measures use the hierarchy in some way

 Verbs, nouns, adjectives all have separate hierarchies

 Similarity measure inversely related to path distance

You might also like