The document discusses the exponential growth of data, particularly unstructured text data, and highlights various natural language processing (NLP) techniques such as text classification, sentiment analysis, and text summarization. It explains concepts like stemming, lemmatization, and the importance of stop words in processing textual data. Additionally, it covers methods for measuring word similarity and the use of semantic dictionaries like WordNet to enhance natural language understanding.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
9 views34 pages
Text Mining
The document discusses the exponential growth of data, particularly unstructured text data, and highlights various natural language processing (NLP) techniques such as text classification, sentiment analysis, and text summarization. It explains concepts like stemming, lemmatization, and the importance of stop words in processing textual data. Additionally, it covers methods for measuring word similarity and the use of semantic dictionaries like WordNet to enhance natural language understanding.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34
Text is Everywhere !
Data continues to grow exponentially
Estimated to be 2.5 Exabytes (2.5 million TB) a day Grow to 40 Zettabytes (40 billion TB) by 2020 (50-times that of 2010)
Approximately 80% of all data is estimated to be unstructured, text-rich data
40 million articles (5 million in English) in Wikipedia >4.5 billion Web pages >500 million tweets a day, 200 billion a year >1.5 trillion queries / searches on Google a year Parse text Find / Identify / Extract relevant information from text Classify text documents Search for relevant text documents Sentiment analysis Topic modeling Text summarization …………… Sentences / input strings Words or Tokens Characters Document, larger files Language used for everyday communication by humans English 中⽂ (Chinese) рус́ ский язы́к (Russian) español (Spanish) Any computation, manipulation of natural language
Natural languages evolve
new words get added
old words lose popularity
meanings of words change
language rules themselves may change
Counting words, counting frequency of words Finding sentence boundaries Part of speech tagging Parsing the sentence structure Identifying semantic roles Identifying entities in a sentence (Name Entity Recognition) Finding which pronoun refers to which entity (Co-Reference Resolution) Which medical specialty does this relate to? Given a set of classes:
Classification: Assign the correct class label to the given input
Topic identification: Is this news article about Politics, Sports, or Technology?
Spam Detection: Is this email a spam or not?
Sentiment analysis: Is this movie review positive
or negative?
Spelling correction: weather or whether?
color or colour? Humans learn from past experiences, machines learn from past instances! Textual data presents a unique set of challenges.
All the information you need is in the text
But features can be pulled out from text at different
granularities! Words • By far the most common class of features • Handling commonly-occurring words: Stop words • Normalization: Make lower case vs. leave as-is • Stemming / Lemmatization Characteristics of words : Capitalization Parts of speech of words in a sentence Grammatical structure, sentence parsing Grouping words of similar meaning, semantics • {buy, purchase} • {Mr., Ms., Dr., Prof.}; Numbers / Digits; Dates Depending on classification tasks, features may come from inside words and word sequences . bigrams, trigrams, n-grams: “White House” character sub-sequences in words: “ing”, “ion”, … Stemming and Lemmatization are Text Normalization (or sometimes called Word Normalization) techniques in the field of Natural Language Processing. Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language. A computer program or subroutine that stems word may be called a stemming program, stemming algorithm, or stemmer. Stemming algorithms: Porte stemmer, Snowball stemmer, Lancaster stemmer
Lemmatization, unlike Stemming, reduces the inflected
words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words. Stop Words are words which do not contain important significance to be used in Search Queries. Each programming language will give its own list of stop words to use. Mostly they are words that are commonly used in the English language such as 'as, the, be, are' etc. Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12. Deer, Elk Deer, Giraffe Deer, Horse Deer, Mouse
How can we quantify such similarity ?
Grouping similar words into semantic concepts
As a building block in natural language understanding task such as
“paraphrasing”. Semantic dictionary of (mostly) English words, interlinked by semantic relations.
Includes rich linguistic information
part of speech, word senses, synonyms, hypernyms/hyponyms, meronyms, distributional related forms, …
Machine-readable, freely available.
WordNet organizes information in a hierarchy
Many similarity measures use the hierarchy in some way
Verbs, nouns, adjectives all have separate hierarchies
Find the shortest path between the two concepts
Similarity measure inversely related to path distance