0% found this document useful (0 votes)
9 views34 pages

Text Mining

The document discusses the exponential growth of data, particularly unstructured text data, and highlights various natural language processing (NLP) techniques such as text classification, sentiment analysis, and text summarization. It explains concepts like stemming, lemmatization, and the importance of stop words in processing textual data. Additionally, it covers methods for measuring word similarity and the use of semantic dictionaries like WordNet to enhance natural language understanding.

Uploaded by

ARSH SINHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views34 pages

Text Mining

The document discusses the exponential growth of data, particularly unstructured text data, and highlights various natural language processing (NLP) techniques such as text classification, sentiment analysis, and text summarization. It explains concepts like stemming, lemmatization, and the importance of stop words in processing textual data. Additionally, it covers methods for measuring word similarity and the use of semantic dictionaries like WordNet to enhance natural language understanding.

Uploaded by

ARSH SINHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Text is Everywhere !

 Data continues to grow exponentially


 Estimated to be 2.5 Exabytes (2.5 million TB) a day
 Grow to 40 Zettabytes (40 billion TB) by 2020 (50-times that of 2010)

 Approximately 80% of all data is estimated to be unstructured, text-rich data


 40 million articles (5 million in English) in Wikipedia
 >4.5 billion Web pages
 >500 million tweets a day, 200 billion a year
 >1.5 trillion queries / searches on Google a year
 Parse text
 Find / Identify / Extract relevant information from text
 Classify text documents
 Search for relevant text documents
 Sentiment analysis
 Topic modeling
 Text summarization
 ……………
 Sentences / input strings
 Words or Tokens
 Characters
 Document, larger files
 Language used for everyday communication by humans
 English
 中⽂ (Chinese)
 рус́ ский язы́к (Russian)
 español (Spanish)
 Any computation, manipulation of natural language

 Natural languages evolve


 new words get added

 old words lose popularity

 meanings of words change

 language rules themselves may change


 Counting words, counting frequency of words
 Finding sentence boundaries
 Part of speech tagging
 Parsing the sentence structure
 Identifying semantic roles
 Identifying entities in a sentence (Name Entity Recognition)
 Finding which pronoun refers to which entity (Co-Reference Resolution)
 Which medical specialty does this relate to?
 Given a set of classes:

 Classification: Assign the correct class label to the given input


 Topic identification: Is this news article about
Politics, Sports, or Technology?

 Spam Detection: Is this email a spam or not?

 Sentiment analysis: Is this movie review positive


or negative?

 Spelling correction: weather or whether?


color or colour?
 Humans learn from past experiences, machines learn from past
instances!
 Textual data presents a unique set of challenges.

 All the information you need is in the text

 But features can be pulled out from text at different


granularities!
Words
• By far the most common class of features
• Handling commonly-occurring words: Stop words
• Normalization: Make lower case vs. leave as-is
• Stemming / Lemmatization
 Characteristics of words : Capitalization
 Parts of speech of words in a sentence
 Grammatical structure, sentence parsing
 Grouping words of similar meaning, semantics
• {buy, purchase}
• {Mr., Ms., Dr., Prof.}; Numbers / Digits; Dates
Depending on classification tasks, features may
come from inside words and word sequences .
bigrams, trigrams, n-grams: “White House”
character sub-sequences in words: “ing”, “ion”, …
 Stemming and Lemmatization are Text Normalization
(or sometimes called Word Normalization) techniques in
the field of Natural Language Processing.
 Stemming is the process of reducing inflection in words to
their root forms such as mapping a group of words to the
same stem even if the stem itself is not a valid word in the
Language.
 A computer program or subroutine that stems word may be
called a stemming program, stemming algorithm, or
stemmer.
 Stemming algorithms: Porte stemmer, Snowball stemmer,
Lancaster stemmer

 Lemmatization, unlike Stemming, reduces the inflected


words properly ensuring that the root word belongs to the
language. In Lemmatization root word is called Lemma. A
lemma (plural lemmas or lemmata) is the canonical form,
dictionary form, or citation form of a set of words.
 Stop Words are words which do not contain important significance to be used in
Search Queries. Each programming language will give its own list of stop words to
use. Mostly they are words that are commonly used in the English language such as
'as, the, be, are' etc.
Consider a document containing 100 words wherein the word cat appears
3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now,
assume we have 10 million documents and the word cat appears in one
thousand of these. Then, the inverse document frequency (i.e., idf) is
calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the
product of these quantities: 0.03 * 4 = 0.12.
 Deer, Elk
 Deer, Giraffe
 Deer, Horse
 Deer, Mouse

 How can we quantify such similarity ?


 Grouping similar words into semantic concepts

 As a building block in natural language understanding task such as


“paraphrasing”.
 Semantic dictionary of (mostly) English words, interlinked
by semantic relations.

 Includes rich linguistic information


 part of speech, word senses, synonyms, hypernyms/hyponyms, meronyms, distributional
related forms, …

 Machine-readable, freely available.


 WordNet organizes information in a hierarchy

 Many similarity measures use the hierarchy in some way

 Verbs, nouns, adjectives all have separate hierarchies


 Find the shortest path between the two concepts

 Similarity measure inversely related to path distance


 PathSim(deer, elk) = 0.5
 PathSim(deer, giraffe) = 0.33
 PathSim(deer, horse) = 0.14

You might also like