AFM_Module 4
AFM_Module 4
Software/tools
Week 4
Learning Objectives
• Describe text mining and understand the need for text
mining
• Differentiate between text mining, Web mining and data
mining
• Understand the different application areas for text mining
• Know the process of carrying out a text mining project
• Describe Web mining, its objectives, and its benefits
• Understand the three different branches of Web mining
• Web content mining
• Web structure mining
• Web usage mining
Text Analytics
Applying data analytics to derive knowledge from text
• Stop words
Morphology is the study of the internal structure of words, including prefixes, suffixes, and root words.
In topic modeling, considering morphology can help identify different forms of the same word (e.g., "play," "playing," "played") as representing the same underlying topic. This
can lead to a more accurate understanding of the thematic content.
Term-by-document Matrix:
• Synonyms
This is a fundamental data structure used in topic modeling.
Imagine a table where rows represent documents and columns represent unique words encountered across all documents.
Each cell contains a value that represents the weight or importance of a specific word in a particular document. This weight can be simply the word count (term frequency) or a
more sophisticated measure like TF-IDF.
Singular Value Decomposition (SVD):
SVD is a mathematical technique used for dimensionality reduction.
In topic modeling, the term-by-document matrix can be very large, with many words and documents. SVD helps decompose this matrix into a more manageable form by
identifying the most significant underlying themes (topics) within the data.
Text analytics: Tasks
Text analytics: Tasks
Text analytics: Tasks
Text analytics: Tasks
Text analytics: Text normalization
Applying stemming or lemmatization. Stemming reduces words to their root form (e.g., "running" → "run") while lemmatization considers the
context and reduces words to their dictionary form (e.g., "running" → "run").
Lemmatization:
Dictionary Based: Lemmatization uses a dictionary to map words to their dictionary form, also called a lemma. It considers the grammatical context of
the word to ensure the resulting lemma is an actual word.
Text analytics: more basics
Text Mining Process
Software/hardware limitations
Privacy issues
Linguistic limitations
Context diagram for the text mining process
Domain expertise
Tools and techniques
Text Mining Process
Task 1 Task 2 Task 3
Feedback Feedback
The inputs to the process The output of the Task 1 is a The output of the Task 2 is a The output of Task 3 is a
includes a variety of relevant collection of documents in flat file called term-document number of problem specific
unstructured (and semi- some digitized format for matrix where the cells are classification, association,
structured) data sources such computer processing populated with the term clustering models and
as text, XML, HTML, etc. frequencies visualizations
Document 2 1
Document 3 3 1
Document 4 1
Document 5 2 1
Document 6 1 1
...
Text Mining Process
• Step 2: Create the Term–by–Document Matrix (TDM)
• Should all terms be included?
• Stop words, include words
• Synonyms, homonyms
• Stemming
• What is the best representation of the indices (values in cells)?
• Row counts; binary frequencies; log frequencies;
• Inverse document frequency
Inverse document frequency (IDF) is a metric used in text analysis, particularly in conjunction with term frequency (TF) to calculate a word's importance within a document relative to a collection of documents (corpus).
Focuses on Rare Words: IDF emphasizes words that are uncommon across the entire document collection. These uncommon words are likely more informative and specific to the document's content compared to
frequent words.
Text Mining Process
• Step 2: Create the Term–by–Document Matrix (TDM)
• TDM is a sparse matrix. How can we reduce the dimensionality of the
TDM?
• Manual - a domain expert goes through it
• Eliminate terms with very few occurrences in very few documents (?)
• Transform the matrix using SVD
• SVD is similar to principle component analysis
Domain expert review: While possible, manually going through a TDM to identify relevant terms is impractical for large datasets. It's time-consuming, subjective, and prone to human bias.
Automatic Approaches:
Thresholding: This method involves eliminating terms that occur in fewer than a certain number of documents or have a very low frequency within the corpus. This can be effective for removing noisy
terms with little information, but choosing the threshold value requires careful consideration to avoid eliminating potentially valuable terms.
Singular Value Decomposition (SVD): This is a powerful and widely used dimensionality reduction technique for TDM. It decomposes the matrix into three components:
• Dream of AI community
• to have algorithms that are capable of automatically reading and
obtaining knowledge from text
NLP Task Categories
• Information retrieval/recovery
• Information extraction
• Named-entity recognition
• Question answering
• Automatic summarization
• Natural language generation and understanding
• Machine translation
• Foreign language reading and writing
• Text proofing
Web Mining Overview
• Web is the largest repository of data
• Data is in HTML, XML, text format
• Challenges (of processing Web data)
• The Web is too big for effective data mining
• The Web is too complex
• The Web is too dynamic
• The Web is not specific to a domain
• The Web has everything
Web Mining
Web
Analytics
Voice of
Customer
Customer Experience
Management