ASTMA Explanations m1 Stuff
ASTMA Explanations m1 Stuff
- Key phrases
- Patterns
- Sentiments
Output - Trends
- Topics
- Relationships between data points
- Named entities
-R - NLTK
- Python - SpaCy
Tools/Software - SAS - GATE
- SQL - TextBlob
- RapidMiner - RapidMiner (for text mining)
- Databases - Documents
- Data Warehouses - Emails
Data Sources
- Logs - Web Pages
- Transaction Records - Social Media Posts
What is Social Media Analytics?
Definition
Social media analytics refers to the process of collecting, analyzing, and interpreting data from
social media platforms to gain insights into various aspects of business performance. This
involves evaluating data from platforms like Facebook, Twitter, and Instagram to understand
user behavior, preferences, and trends.
Explanation
Social media analytics helps businesses make informed decisions by providing a clear picture of
how their brand, products, and competitors are perceived online. By leveraging data from social
media interactions, organizations can optimize their strategies in product development, customer
experience, branding, competitive analysis, and operational efficiency.
Sub-Points
1. Product Development
Explanation: Analyzing aggregated data from social media posts, tweets, and Amazon
product reviews helps companies understand customer pain points, evolving needs, and
desired features.
Example: If numerous users on Twitter express frustration with a feature in a product,
this feedback can guide product updates or the development of new features.
Benefits: Identifies and tracks trends to manage existing product lines and guide new
product development.
2. Customer Experience
3. Branding
Explanation: Social media serves as a vast focus group where natural language
processing (NLP) and sentiment analysis are used to monitor brand health and refine
positioning.
Example: If sentiment analysis reveals a surge in negative comments about a brand, the
company can address the issue promptly and adjust its branding strategy.
Benefits: Maintains brand health, refines positioning, and develops new brand attributes
based on ongoing feedback.
4. Competitive Analysis
5. Operational Efficiency
Explanation: Deep analysis of social media data can enhance how organizations gauge
demand, manage inventory, and optimize resources.
Example: Retailers can use social media trends to predict demand for products, adjust
inventory levels, and manage supplier relationships more effectively.
Benefits: Reduces costs and improves resource allocation by aligning operations with
real-time market demands.
Summary
By integrating these insights, organizations can make data-driven decisions that enhance their
market position and operational efficiency.
Definition
Static Collection: A fixed set of documents that does not change over time. For example,
the PubMed database, which contains a constant repository of medical research articles.
Dynamic Collection: An evolving set of documents where new items can be added and
existing items updated or removed.
Document
Definition
A document is informally defined as a unit of discrete textual data within a collection. It often
correlates with real-world documents such as business reports, legal memoranda, emails,
research papers, manuscripts, articles, press releases, or news stories, though this correlation is
not always necessary.
Prototypical Document
Document Types
Semi-Structured Document
Definition: Documents with extensive and consistent formatting elements that make
field-type metadata more easily inferable. Examples include emails, HTML web pages,
PDF files, and word-processing files with heavy templating or style sheets.
Characteristics: Easier to extract structured information due to consistent formatting.
Document Features
1. Character-Level Features
Definition: The individual letters, numerals, special characters, and spaces that form the
basic building blocks of higher-level semantic features.
Example: A character-level representation might include all characters in a document or
a filtered subset.
2. Word-Level Features
3. Term-Level Features
Definition: Single words or multiword phrases selected from the document corpus using
term-extraction methodologies.
Example: For a document mentioning "President Abraham Lincoln," terms might
include "Lincoln," "President Abraham Lincoln," and "White House."
4. Concept-Level Features
Summary
2. Feature Dimensionality:
Large Feature Sets: Text documents can have a very large number of features. The
dimensionality of feature representation in natural language processing (NLP) is
significantly higher than in traditional databases. This is due to the extensive vocabulary
and complex combinations of terms possible in textual data.
3. Feature Sparsity:
In text mining:
1. Domain Definition:
o A domain is a specific area of interest where specialized ontologies, lexicons, and
taxonomies are developed. These can be broad, such as in biology, or more
focused, like in genomics or proteomics [2].
2. Role of Domain Knowledge:
o Domain knowledge, or background knowledge, enhances text mining by
improving concept extraction and validation. This knowledge helps in creating
meaningful, consistent, and normalized concept hierarchies, making text analysis
more effective [1].
3. Applications:
o It aids in preprocessing text data, refining feature selection, and improving the
accuracy of text classification and pattern discovery
Overall, these methods help uncover insights from the overall data rather than from isolated
documents.
1. Browsing:
o Traditional browsing involves navigating through text-based search results. Users
can manually sift through documents or use search filters to locate relevant
information.
2. Visualization Tool:
o Advanced text mining systems now incorporate highly interactive graphical
representations. These tools allow users to:
Drag, pull, click: Users can manipulate visual elements directly, enabling
them to explore relationships between concepts dynamically.
Interactive Exploration: Visualization tools often include features like
zooming in on data points, highlighting specific patterns, and filtering
results by various criteria.
These interactive visualizations significantly enhance the user's ability to discover and analyze
concept patterns within large text datasets, making the data more accessible and insightful.
1. Preprocessing Tasks:
o These tasks prepare data for core knowledge discovery operations. They include
data source preprocessing and categorization activities, transforming information
into a canonical format, and applying feature extraction methods. Tasks may also
involve document date stamping and fetching raw data from various sources.
2. Core Mining Operations:
o These operations are central to the text mining system, focusing on pattern
discovery, trend analysis, and incremental knowledge discovery. Patterns like
distributions, frequent concept sets, and associations are identified. Advanced
systems may leverage background knowledge to enhance these operations,
collectively known as knowledge distillation processes.
3. Presentation Layer:
o This layer includes components like graphical user interfaces (GUIs), pattern
browsing tools, and query editors. It enables users to interact with data visually
and textually, creating or modifying concept clusters and annotated profiles for
specific patterns.
4. Refinement Techniques:
o These techniques filter redundant information, cluster related data, and may
involve comprehensive processes like suppression, ordering, and generalization.
They are aimed at optimizing discovery and are sometimes referred to as
postprocessing.
1. Data Sources
WWW & FTP Resources: Web pages and files accessible via FTP serve as primary data
sources.
News and Email: Structured and unstructured data from news feeds and emails.
Other Online Resources: Includes databases, social media, and other internet-based
repositories.
3. Preprocessing Tasks
Overview: Preprocessing is essential for transforming raw data into a format suitable for
analysis.
Categorization: Data is categorized based on predefined classes or clusters.
Feature/Term Extraction: Key features or terms are extracted from the text using
methods like tokenization, stemming, lemmatization, and stop-word removal. This step
often includes document time-stamping, where documents are tagged with the date and
time of creation or last modification.
Canonical Format Conversion: The data is normalized into a standard format to ensure
consistency across different data sources.
Initial Data Fetching: Techniques specifically designed for collecting and integrating
data from multiple, disparate sources are sometimes used.
Data Representation: This involves compressing data for efficient storage or creating
hierarchical structures (e.g., taxonomies, ontologies) to represent the relationships
between concepts.
6. Knowledge Sources
7. Parsing Routines
Parsing: The text is parsed to identify structures, such as sentences, clauses, and other
grammatical elements, which are crucial for understanding the meaning and context of
the text.
9. Refinement Techniques
Overview: These techniques refine the output from core mining operations to improve
the quality and relevance of the results.
Suppression and Filtering: Removal of redundant or irrelevant information.
Ordering and Pruning: Organizing data into a more useful structure and removing
unnecessary elements.
Generalization and Clustering: Aggregating similar data points to simplify the analysis.
GUI and Browsing Functionality: User interfaces that allow interaction with the text
mining system, often including dashboards and visualization tools.
Query Language Access: Tools that allow users to formulate and submit queries to the
system.
Visualization Tools: These tools represent the mining results graphically, such as in
charts or graphs, making it easier to interpret patterns and trends.
Search and Query Interpreters: These components help users search the processed data
and interpret the results effectively.
User Role: The user interacts with the system through the presentation layer, utilizing the
GUI, visualization tools, and query functions to explore the data and discover insights.
Enhancement: The system can use external knowledge sources to refine its operations
and improve the accuracy of the extracted information.
Conclusion
This architecture outlines the comprehensive process of text mining, from data collection to user
interaction, emphasizing the importance of preprocessing, mining operations, refinement, and
user-facing tools to discover and present meaningful patterns and insights in textual data.
#Distributions
I. Concept Selection
Explanation:
Concept selection involves identifying a subset of documents from a larger collection that are
labeled with one or more specified concepts.
Concept proportion measures the fraction of documents in a collection labeled with a specific set
of concepts.
Conditional concept proportion measures the fraction of documents labeled with one set of
concepts that are also labeled with another set.
Concept distribution assigns a value between 0 and 1 to each concept in a set, representing its
distribution in a subset of documents.
Concept proportion distribution measures the proportion of documents labeled with a specific
concept within a subset of documents.
VI. Conditional Proportion Distribution
VII. Average Proportion Distribution
Explanation:
Average proportion distribution averages the concept proportions across sibling nodes in a
hierarchy.
1. Vertices (Nodes): Each vertex in the graph represents a concept. For instance, in our
example, a vertex might represent a specific country.
2. Edges: The edges between vertices represent the relationships between these concepts.
The edges can be weighted, meaning that they carry a numerical value representing the
strength or affinity of the relationship between two concepts. For example, if two
countries have a strong economic relationship concerning crude oil, the edge connecting
these two country nodes will have a higher weight.
3. Similarity Function: To determine whether an edge should exist between two vertices
(concepts), a similarity function is used. This function measures how similar two
concepts are within the given context. If the similarity exceeds a certain threshold, an
edge is created between those concepts.
D: A collection of documents.
C: A set of concepts.
P: A context phrase, which specifies the context under which the relationships are of
interest (e.g., crude oil).
Temporal context relationships involve analyzing how these relationships between concepts
change over time. For example, how the relationship between two countries concerning crude oil
changes over the years. By breaking down the corpus (the collection of documents) into temporal
segments, one can track and analyze these changes, providing a dynamic view of the context
graph over time.
Context graphs are widely used in fields like text analytics, social network analysis, and AI.
They help in understanding and visualizing complex relationships in large datasets, enabling
more informed decision-making.
For example, if you're processing a PDF document, task-oriented preprocessing might involve
extracting titles, author names, and other metadata. This approach can be especially useful when
dealing with natural language texts, where the data is inherently complex and unstructured.
Example
Consider a scenario where you have a scanned document containing various sections like the
title, author, abstract, and body text. The task-oriented preprocessing approach would first
convert the scanned document into a stream of text using Optical Character Recognition (OCR).
Next, it would identify and extract specific sections like the title and author by recognizing their
visual position in the document. Finally, these sections would be labeled and structured for
further analysis.
General-purpose Natural Language Processing (NLP) tasks involve analyzing text documents
using common linguistic knowledge. These tasks are not specific to any particular problem but
are fundamental steps in understanding and processing language.
1. Tokenization
Tokenization is the process of breaking down text into smaller units, like sentences or words,
called tokens. This step is critical because it simplifies the text and prepares it for more complex
analysis.
Example: Consider the sentence, "Dr. Smith is a renowned scientist." The tokenizer must
distinguish between "Dr." as a title and the period that typically ends a sentence. The sentence
would be split into tokens like ["Dr.", "Smith", "is", "a", "renowned", "scientist"].
POS tagging involves labeling each word in a sentence with its corresponding part of speech
(noun, verb, adjective, etc.). This helps in understanding the grammatical structure of the
sentence and can provide insights into the sentence's meaning.
Example: In the sentence, "The quick brown fox jumps over the lazy dog," POS tagging would
label "The" as a determiner, "quick" as an adjective, "fox" as a noun, "jumps" as a verb, and so
on.
3. Syntactic Parsing
Syntactic parsing assigns a structure to a sequence of text based on grammatical rules. There are
two primary types of syntactic parsing:
Constituency Parsing: This approach breaks sentences down into phrases, such as noun
phrases or verb phrases. For example, in the sentence "The cat sat on the mat," "The cat"
would be identified as a noun phrase, and "sat on the mat" as a verb phrase.
Dependency Parsing: This method focuses on the relationships between words in a
sentence. For instance, in the same sentence "The cat sat on the mat," dependency parsing
would highlight that "sat" is the main verb, with "cat" as the subject and "on the mat" as a
prepositional phrase dependent on "sat."
4. Shallow Parsing
Shallow parsing, also known as chunking, breaks down text into small, easily interpretable
chunks like simple noun and verb phrases. It doesn't provide a full parse of the sentence, making
it faster and more robust for certain applications like information extraction.
Example: In the sentence "He bought a car," shallow parsing might break it down into two
chunks: [He] [bought a car], where "He" is a noun phrase and "bought a car" is a verb phrase.
These tasks create representations of documents that are meaningful either for more sophisticated
processing phases or for direct interaction by the user.
Example: Imagine you have a large corpus of emails. Using text categorization, you could
automatically sort these emails into categories like "Work," "Personal," and "Spam." Information
extraction could then identify important details within each email, like the sender's name or the
date of the meeting mentioned.
Conclusion
Preprocessing is the foundation of text mining, transforming raw data into structured, meaningful
representations. By understanding and applying these techniques, you can significantly improve
the accuracy and efficiency of your text mining projects.
1. Decision Tree
Definition:
A Decision Tree is a non-parametric supervised learning algorithm used for classification and
regression tasks. It models decisions and their possible consequences as a tree structure, where
each internal node represents a "test" on an attribute, each branch represents the outcome of the
test, and each leaf node represents a class label or a decision.
Formula:
The core of a Decision Tree involves measures like Information Gain (IG) or Gini Index (GI) to
determine the best attribute to split the data. For binary classification:
Explanation:
The Decision Tree algorithm starts with the entire dataset and selects the attribute that maximizes
Information Gain or minimizes Gini Index to split the dataset. This process continues recursively
for each subset, creating a tree where each leaf node represents a class label. Pruning may be
applied to avoid overfitting.
Example:
Suppose you have a dataset to predict whether a person buys a computer based on attributes like
age, income, student status, and credit rating. The Decision Tree might split the data first on
"age" and then on "income," forming a tree that predicts the class labels "buys computer" or
"doesn't buy computer."
OR
Definition:
A Decision Tree is a hierarchical model used for making decisions or predictions by breaking
down a dataset into smaller subsets. It does this by recursively splitting the data based on feature
values. The tree consists of nodes where each node represents a decision based on a feature,
branches representing the outcomes of these decisions, and leaves representing the final
prediction or classification.
Explanation:
1. Root Node: The top node of the tree, representing the entire dataset.
2. Splitting: The process of dividing the dataset into subsets based on the value of a feature.
The aim is to increase the homogeneity of the target variable within each subset.
3. Decision Nodes: Nodes where the dataset is split. These nodes test different features to
decide the best way to split the data.
4. Leaf Nodes: The terminal nodes that provide the final output or classification. They
represent the decision or prediction outcome for the dataset subset that reaches them.
Choosing Splits:
Information Gain (IG): Measures how much information is gained about the target
variable by splitting the data based on a particular feature. It helps in choosing the feature
that will yield the most significant reduction in uncertainty.
Gini Index: Measures the impurity of a node. A node with a Gini Index of 0 is perfectly
pure, meaning all data points in that node belong to a single class.
Example: Imagine a Decision Tree is used to decide whether to play tennis based on weather
conditions. The root node might test if it's sunny, followed by branches for temperature and
humidity. The final leaf nodes could predict outcomes such as "Play" or "Don't Play" based on
these conditions.
2. Naive Bayes
Definition:
Naive Bayes is a probabilistic classifier based on applying Bayes' theorem with strong (naive)
independence assumptions between the features. It is particularly used for text classification.
Explanation:
Naive Bayes assumes that the presence of a particular feature in a class is independent of the
presence of any other feature. Despite this assumption, it performs well in many real-world
situations, especially in text classification problems like spam detection, where the order of
words does not influence the outcome.
Example:
Consider a spam filter that classifies emails based on the occurrence of certain words. If the
words "win" and "prize" are frequent in spam emails, the classifier might predict an incoming
email containing these words as spam based on the learned probabilities.
OR
Definition:
Naive Bayes is a probabilistic classifier that applies Bayes' theorem with the assumption that
features are conditionally independent given the class label. Despite this simplification, Naive
Bayes often performs well in practice, especially in text classification and spam detection.
Explanation:
1. Bayes' Theorem: Provides a way to calculate the probability of a class given the
features. It combines prior knowledge (prior probability) with the likelihood of features
occurring given the class:
2. Naive Assumption: Assumes that all features are independent given the class label. This
simplifies the computation but might not always hold true in real data.
Classification Process:
Compute Prior Probability: The likelihood of each class based on the training data.
Compute Likelihood: The probability of observing the given feature values for each
class.
Apply Bayes' Theorem: Calculate the posterior probability for each class and choose the
class with the highest probability.
Example: In spam email classification, Naive Bayes would use the frequency of words in spam
and non-spam emails to predict the probability that a new email is spam based on its word
content.
3. Linear Regression
Definition:
Linear Regression is a linear approach to modeling the relationship between a dependent variable
and one or more independent variables. The goal is to fit a linear equation to observed data.
Explanation:
Linear Regression aims to find the line (or hyperplane in higher dimensions) that best fits the
data by minimizing the sum of squared errors between the observed and predicted values. This
line represents the relationship between the dependent and independent variables.
Example:
Consider predicting house prices based on features like square footage, number of bedrooms, and
location. Linear Regression would provide a formula that combines these features to estimate the
price, allowing for predictions on new data.
OR
Definition:
Linear Regression models the relationship between a dependent variable and one or more
independent variables by fitting a linear equation to the observed data. It aims to predict the
dependent variable based on the values of the independent variables.
Explanation:
1. Model: The equation of a line (or hyperplane in multiple dimensions) is used to make
predictions. For a single independent variable, the model is:
2.
Objective: Minimize the sum of squared errors (differences between observed and
predicted values). This is achieved using techniques like Ordinary Least Squares (OLS).
3. Assumptions:
o Linearity: The relationship between dependent and independent variables is
linear.
o Independence: The residuals (errors) are independent.
o Homoscedasticity: Constant variance of residuals.
o Normality: Residuals should be normally distributed.
Example: Predicting house prices based on features such as square footage and number of
bedrooms. Linear Regression would provide a formula that estimates the price based on these
features, allowing predictions for new properties.
KNN works by storing all available cases and classifying new cases based on a similarity
measure (e.g., distance functions). It assigns the most common label among its nearest neighbors
to the new instance. It is highly effective for small datasets but can be computationally expensive
for large ones.
Example:
Imagine you have a dataset of flowers labeled by species based on features like petal length and
width. Given a new flower, KNN would classify it by identifying the closest kkk flowers in the
dataset and assigning the most frequent species among those neighbors.
OR
Definition:
Explanation:
1. Distance Metric: KNN uses distance metrics like Euclidean distance to find the kkk
nearest neighbors of a new instance. For kkk-NN, the distance between two points ppp
and qqq in nnn-dimensional space is calculated as:
2. Classification: The new instance is assigned to the class that is most frequent among its
kkk nearest neighbors.
3. Regression: The value of the new instance is predicted as the average (or weighted
average) of the values of its kkk nearest neighbors.
4. Choosing kkk: The value of kkk is crucial. A small kkk can lead to overfitting, while a
large kkk might smooth out the classification or prediction.
Example: For classifying a new flower as one of several species, KNN would look at the kkk
closest flowers in the training set and assign the species based on the majority class among these
neighbors.
5. Support Vector Machine (SVM)
Definition:
Support Vector Machine (SVM) is a supervised learning model used for classification and
regression tasks. It works by finding the hyperplane that best separates the classes in the feature
space.
Explanation:
Example:
In a binary classification problem, like distinguishing between cats and dogs based on features
like ear length and tail length, SVM would find the hyperplane that maximizes the separation
between the two classes, thereby classifying any new animal based on which side of the
hyperplane it falls.
OR
Definition:
Support Vector Machine (SVM) is a supervised learning algorithm that finds the optimal
hyperplane to separate different classes in the feature space. It aims to maximize the margin
between the classes to achieve better generalization.
Explanation:
1. Hyperplane: A hyperplane in an nnn-dimensional space is a flat affine subspace of
dimension n−1n-1n−1 that separates the data into two classes. SVM seeks the hyperplane
that maximizes the margin between the closest points (support vectors) of each class.
2. Margin: The distance between the hyperplane and the closest data points from either
class. Maximizing this margin helps improve the model's generalization to unseen data.
3. Kernel Trick: SVM can handle non-linearly separable data by mapping it into a higher-
dimensional space using a kernel function (e.g., polynomial, radial basis function). This
allows for finding a linear separating hyperplane in the new space.
Example: In a dataset with two classes that are not linearly separable, SVM can use a kernel
function to transform the data into a higher dimension where a linear hyperplane can be used to
separate the classes effectively.