web and text mining
web and text mining
Text Mining
Text mining, also known as text data mining or text analytics, is the process of extracting
meaningful information and insights from unstructured text data. This involves various
techniques to analyze and interpret text, transforming it into a structured format that can be easily
understood and used for decision-making.
Natural language processing which evolved from computational linguistics, uses methods from
various disciplines, such as computer science, linguistics, and data science, to enable computers
to understand human language in both written and verbal forms. By analyzing sentence structure
and grammar, NLP sub-tasks allow computers to “read”. Common sub-tasks include:
Part-of-Speech (PoS) tagging: assigns a tag to every token in a document based on its
part of speech—that is, denoting nouns, verbs, adjectives.
Text categorization: also known as text classification, is responsible for analyzing text
documents and classifying them based on predefined topics or categories.
Sentiment analysis: detects positive or negative sentiment from internal or external data
sources, allowing you to track changes in customer attitudes over time.
Process of Text Mining:
1. Data Collection: Gathering text from various sources like documents, emails, social
media, and web pages.
2. Preprocessing: Cleaning and preparing the data
o Tokenization: Splitting text into words or phrases.
o Stopword Removal: Eliminating common words (e.g., "and," "the") that add
little value.
o Stemming/Lemmatization: Reducing words to their base forms.
3. Feature Extraction: Transforming text into a structured format
o Bag of Words: Representing text as a frequency count of words.
o TF-IDF (Term Frequency-Inverse Document Frequency): Weighing the
importance of words based on their frequency in a document relative to a corpus.
4. Modeling: Applying statistical and machine learning methods to identify patterns or
make predictions. Common approaches include:
o Classification: Categorizing text into predefined labels
o Clustering: Grouping similar texts together
o Topic Modeling: Discovering abstract topics within a collection of documents.
5. Evaluation: Assessing the performance of the models using metrics such as accuracy,
precision, recall, and F1 score.