0% found this document useful (0 votes)

14 views27 pages

ASTMA Explanations m1 Stuff

Uploaded by

sahavaibhav111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views27 pages

ASTMA Explanations m1 Stuff

Uploaded by

sahavaibhav111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 27

Aspect Data Mining Text Mining

The process of discovering patterns and The process of extracting useful

Definition knowledge from large datasets, typically information, patterns, and knowledge from
structured data. unstructured text data.

Works with unstructured or semi-

Primarily works with structured data (e.g.,
Data Type structured data (e.g., documents, social
databases, spreadsheets).
media posts).

- Classification - Natural Language Processing (NLP)

- Clustering - Information Retrieval
Techniques Used
- Association rule learning - Sentiment Analysis
- Regression - Topic Modeling

- Decision Trees - TF-IDF

Common - Neural Networks - Latent Dirichlet Allocation (LDA)
Algorithms - K-Means Clustering - Named Entity Recognition (NER)
- Apriori Algorithm - Sentiment Analysis Models

- Key phrases
- Patterns
- Sentiments
Output - Trends
- Topics
- Relationships between data points
- Named entities

- Market Basket Analysis - Sentiment Analysis

Applications - Fraud Detection - Document Categorization
- Customer Segmentation - Information Extraction

-R - NLTK
- Python - SpaCy
Tools/Software - SAS - GATE
- SQL - TextBlob
- RapidMiner - RapidMiner (for text mining)

- Handling large volumes of structured - Handling unstructured data

Challenges data - Ambiguity in natural language
- Ensuring data quality and relevance - Understanding context and semantics

- Databases - Documents
- Data Warehouses - Emails
Data Sources
- Logs - Web Pages
- Transaction Records - Social Media Posts
What is Social Media Analytics?
Definition

Social media analytics refers to the process of collecting, analyzing, and interpreting data from
social media platforms to gain insights into various aspects of business performance. This
involves evaluating data from platforms like Facebook, Twitter, and Instagram to understand
user behavior, preferences, and trends.

Explanation

Social media analytics helps businesses make informed decisions by providing a clear picture of
how their brand, products, and competitors are perceived online. By leveraging data from social
media interactions, organizations can optimize their strategies in product development, customer
experience, branding, competitive analysis, and operational efficiency.

Sub-Points

1. Product Development

 Explanation: Analyzing aggregated data from social media posts, tweets, and Amazon
product reviews helps companies understand customer pain points, evolving needs, and
desired features.
 Example: If numerous users on Twitter express frustration with a feature in a product,
this feedback can guide product updates or the development of new features.
 Benefits: Identifies and tracks trends to manage existing product lines and guide new
product development.

2. Customer Experience

 Explanation: Behavioral analysis across social channels allows businesses to capitalize

on micro-moments to delight customers and enhance loyalty and lifetime value.
 Example: An IBM study shows organizations are shifting from being product-led to
experience-led. By analyzing social media interactions, businesses can personalize
experiences and engage customers more effectively.
 Benefits: Increases customer satisfaction and loyalty by addressing their immediate
needs and preferences.

3. Branding
 Explanation: Social media serves as a vast focus group where natural language
processing (NLP) and sentiment analysis are used to monitor brand health and refine
positioning.
 Example: If sentiment analysis reveals a surge in negative comments about a brand, the
company can address the issue promptly and adjust its branding strategy.
 Benefits: Maintains brand health, refines positioning, and develops new brand attributes
based on ongoing feedback.

4. Competitive Analysis

 Explanation: Understanding competitors' actions and customer responses provides

insights into market opportunities and potential disruptors.
 Example: If a competitor announces they are leaving a niche market, it might create an
opportunity for your business. Conversely, a spike in positive mentions for a competitor’s
new product could indicate emerging trends.
 Benefits: Helps identify market opportunities and potential threats by analyzing
competitors' activities and customer feedback.

5. Operational Efficiency

 Explanation: Deep analysis of social media data can enhance how organizations gauge
demand, manage inventory, and optimize resources.
 Example: Retailers can use social media trends to predict demand for products, adjust
inventory levels, and manage supplier relationships more effectively.
 Benefits: Reduces costs and improves resource allocation by aligning operations with
real-time market demands.

Summary

Social media analytics allows companies to:

 Spot trends related to offerings and brands

 Understand conversations—what is being said and how it is received
 Derive customer sentiment towards products and services
 Gauge response to social media and other communications
 Identify high-value features for products or services
 Uncover competitor strategies and effectiveness
 Map how third-party partners and channels affect performance

By integrating these insights, organizations can make data-driven decisions that enhance their
market position and operational efficiency.

Document and Document Collection

Document Collection

Definition

A document collection refers to a grouping of text-based documents, which can be categorized as

either static or dynamic.

 Static Collection: A fixed set of documents that does not change over time. For example,
the PubMed database, which contains a constant repository of medical research articles.
 Dynamic Collection: An evolving set of documents where new items can be added and
existing items updated or removed.

Document

Definition

A document is informally defined as a unit of discrete textual data within a collection. It often
correlates with real-world documents such as business reports, legal memoranda, emails,
research papers, manuscripts, articles, press releases, or news stories, though this correlation is
not always necessary.

Prototypical Document

Within a particular document collection, a prototypical document represents a typical example of

a class of entities within that collection. This helps in defining and analyzing the document in the
context of its collection.

Document Types

Weakly Structured Document

 Definition: Documents with minimal typographical, layout, or markup indicators that

denote structure. Examples include most scientific research papers, business reports, legal
memoranda, and news stories.
 Characteristics: Often referred to as free format or weakly structured.

Semi-Structured Document

 Definition: Documents with extensive and consistent formatting elements that make
field-type metadata more easily inferable. Examples include emails, HTML web pages,
PDF files, and word-processing files with heavy templating or style sheets.
 Characteristics: Easier to extract structured information due to consistent formatting.

Document Features

1. Character-Level Features
 Definition: The individual letters, numerals, special characters, and spaces that form the
basic building blocks of higher-level semantic features.
 Example: A character-level representation might include all characters in a document or
a filtered subset.

2. Word-Level Features

 Definition: Features representing individual words within a document. The

representation can include every word in the document.
 Example: If a document contains the phrase "President Abraham Lincoln," the word-
level features would include "President," "Abraham," "Lincoln," and each word
separately.

3. Term-Level Features

 Definition: Single words or multiword phrases selected from the document corpus using
term-extraction methodologies.
 Example: For a document mentioning "President Abraham Lincoln," terms might
include "Lincoln," "President Abraham Lincoln," and "White House."

4. Concept-Level Features

 Definition: Features generated using manual, statistical, rule-based, or hybrid

categorization methods. These might include single words, multiword expressions, or
entire clauses related to specific concepts.
 Example: In a collection of sports car reviews, the concept "automotive" might be
identified even if the exact term is not used in the document.

Summary

 Document Collection involves grouping documents that may be static or dynamic.

 Document refers to a discrete textual unit often related to real-world documents.
 Weakly Structured Documents have minimal formatting, while Semi-Structured
Documents have consistent formatting.
 Document Features include character-level, word-level, term-level, and concept-level
features, each providing different layers of information for analysis.

# Document Feature Representational Model

1. Preprocessing to Representational Model:
 Feature Set: In text mining, each document is represented by a set of features extracted
during preprocessing. These features could include words, phrases, or other textual
elements.

2. Feature Dimensionality:

 Large Feature Sets: Text documents can have a very large number of features. The
dimensionality of feature representation in natural language processing (NLP) is
significantly higher than in traditional databases. This is due to the extensive vocabulary
and complex combinations of terms possible in textual data.

3. Feature Sparsity:

 Sparse Representation: In document collections, only a small percentage of all possible

features appear in any single document. Thus, when documents are represented as binary
vectors of features (where each feature is either present or absent), most values in these
vectors are zero, making them sparse

# Domain and Background Knowledge

In text mining:

1. Domain Definition:
o A domain is a specific area of interest where specialized ontologies, lexicons, and
taxonomies are developed. These can be broad, such as in biology, or more
focused, like in genomics or proteomics [2].
2. Role of Domain Knowledge:
o Domain knowledge, or background knowledge, enhances text mining by
improving concept extraction and validation. This knowledge helps in creating
meaningful, consistent, and normalized concept hierarchies, making text analysis
more effective [1].
3. Applications:
o It aids in preprocessing text data, refining feature selection, and improving the
accuracy of text classification and pattern discovery

# Search for Pattern and Trends

In text mining, discovering patterns and trends involves analyzing large collections of documents
to identify relationships and shifts in the data. Here's how it works:

1. Algorithmic and Heuristic Approaches:

o Text mining systems use various algorithms and heuristics to examine
distributions, frequent item sets, and associations between concepts across
multiple documents. This helps in understanding how concepts are interrelated
within the entire dataset [1].
2. Pattern Detection:
o For example, a high frequency of articles about a politician and "scandal" may
suggest a negative public image, prompting a need for a PR strategy. Similarly, an
increase in mentions of a company and its product could signal a shift in the
company's focus [4].
3. Trend Analysis:
o Text mining also includes trend analysis, which relies on date-and-time stamping
of documents. This allows comparisons between different time periods to identify
shifts in topics or emerging trends [4].

Overall, these methods help uncover insights from the overall data rather than from isolated
documents.

# Presentation Layer in Text Mining Systems

In modern text mining systems, the Presentation Layer plays a crucial role in how users interact
with and interpret the data. This layer includes tools for browsing and visualization:

1. Browsing:
o Traditional browsing involves navigating through text-based search results. Users
can manually sift through documents or use search filters to locate relevant
information.
2. Visualization Tool:
o Advanced text mining systems now incorporate highly interactive graphical
representations. These tools allow users to:
 Drag, pull, click: Users can manipulate visual elements directly, enabling
them to explore relationships between concepts dynamically.
 Interactive Exploration: Visualization tools often include features like
zooming in on data points, highlighting specific patterns, and filtering
results by various criteria.

These interactive visualizations significantly enhance the user's ability to discover and analyze
concept patterns within large text datasets, making the data more accessible and insightful.

# General Architecture for Text Mining

The architecture of a text mining system typically comprises four key components:

1. Preprocessing Tasks:
o These tasks prepare data for core knowledge discovery operations. They include
data source preprocessing and categorization activities, transforming information
into a canonical format, and applying feature extraction methods. Tasks may also
involve document date stamping and fetching raw data from various sources.
2. Core Mining Operations:
o These operations are central to the text mining system, focusing on pattern
discovery, trend analysis, and incremental knowledge discovery. Patterns like
distributions, frequent concept sets, and associations are identified. Advanced
systems may leverage background knowledge to enhance these operations,
collectively known as knowledge distillation processes.
3. Presentation Layer:
o This layer includes components like graphical user interfaces (GUIs), pattern
browsing tools, and query editors. It enables users to interact with data visually
and textually, creating or modifying concept clusters and annotated profiles for
specific patterns.
4. Refinement Techniques:
o These techniques filter redundant information, cluster related data, and may
involve comprehensive processes like suppression, ordering, and generalization.
They are aimed at optimizing discovery and are sometimes referred to as
postprocessing.

Detailed Explanation of the General Architecture for Text

Mining
The architecture of a text mining system is typically organized into several key components,
each serving a specific function to enable the extraction of meaningful information from large
volumes of textual data. The diagram you provided outlines this architecture and includes the
following components:

1. Data Sources

 WWW & FTP Resources: Web pages and files accessible via FTP serve as primary data
sources.
 News and Email: Structured and unstructured data from news feeds and emails.
 Other Online Resources: Includes databases, social media, and other internet-based
repositories.

2. Document Fetching and Crawling Techniques

 Crawling Techniques: These are used to fetch data from the web and other online
sources automatically. The system might utilize web crawlers, scrapers, and bots to
collect the raw text from these sources.

3. Preprocessing Tasks

 Overview: Preprocessing is essential for transforming raw data into a format suitable for
analysis.
 Categorization: Data is categorized based on predefined classes or clusters.
 Feature/Term Extraction: Key features or terms are extracted from the text using
methods like tokenization, stemming, lemmatization, and stop-word removal. This step
often includes document time-stamping, where documents are tagged with the date and
time of creation or last modification.
 Canonical Format Conversion: The data is normalized into a standard format to ensure
consistency across different data sources.
 Initial Data Fetching: Techniques specifically designed for collecting and integrating
data from multiple, disparate sources are sometimes used.

4. Processed Document Collection

 Categorized, Keyword-Labeled, and Time-Stamped: After preprocessing, documents

are stored in a collection where they are organized, labeled, and time-stamped for further
analysis.

5. Compressed and/or Hierarchical Representation

 Data Representation: This involves compressing data for efficient storage or creating
hierarchical structures (e.g., taxonomies, ontologies) to represent the relationships
between concepts.

6. Knowledge Sources

 Background Knowledge: This could include domain-specific databases, ontologies, or

dictionaries that provide contextual information to enhance the accuracy and relevance of
text mining operations.

7. Parsing Routines

 Parsing: The text is parsed to identify structures, such as sentences, clauses, and other
grammatical elements, which are crucial for understanding the meaning and context of
the text.

8. Core Mining Operations

 Pattern Identification: This includes discovering patterns like frequent terms,

associations, and trends in the text.
 Trend Analysis: The system tracks changes and trends over time, identifying shifts in
language use, topic importance, etc.
 Knowledge Discovery Algorithms: Algorithms that find new insights from the text data,
such as clustering or classification algorithms.
 Knowledge Distillation: Extracting the most relevant and significant information from
the processed data, often enhanced by integrating background knowledge.

9. Refinement Techniques

 Overview: These techniques refine the output from core mining operations to improve
the quality and relevance of the results.
 Suppression and Filtering: Removal of redundant or irrelevant information.
 Ordering and Pruning: Organizing data into a more useful structure and removing
unnecessary elements.
 Generalization and Clustering: Aggregating similar data points to simplify the analysis.

10. Presentation Layer Components

 GUI and Browsing Functionality: User interfaces that allow interaction with the text
mining system, often including dashboards and visualization tools.
 Query Language Access: Tools that allow users to formulate and submit queries to the
system.
 Visualization Tools: These tools represent the mining results graphically, such as in
charts or graphs, making it easier to interpret patterns and trends.
 Search and Query Interpreters: These components help users search the processed data
and interpret the results effectively.

11. User Interaction

 User Role: The user interacts with the system through the presentation layer, utilizing the
GUI, visualization tools, and query functions to explore the data and discover insights.

12. Background Knowledge

 Enhancement: The system can use external knowledge sources to refine its operations
and improve the accuracy of the extracted information.

Conclusion

This architecture outlines the comprehensive process of text mining, from data collection to user
interaction, emphasizing the importance of preprocessing, mining operations, refinement, and
user-facing tools to discover and present meaningful patterns and insights in textual data.
#Distributions

I. Concept Selection
Explanation:

Concept selection involves identifying a subset of documents from a larger collection that are
labeled with one or more specified concepts.

II. Concept Proportion

Explanation:

Concept proportion measures the fraction of documents in a collection labeled with a specific set
of concepts.

III. Conditional Concept Proportion

Explanation:

Conditional concept proportion measures the fraction of documents labeled with one set of
concepts that are also labeled with another set.

IV. Concept Distribution

Explanation:

Concept distribution assigns a value between 0 and 1 to each concept in a set, representing its
distribution in a subset of documents.

V. Concept Proportion Distribution

Explanation:

Concept proportion distribution measures the proportion of documents labeled with a specific
concept within a subset of documents.
VI. Conditional Proportion Distribution
VII. Average Proportion Distribution
Explanation:

Average proportion distribution averages the concept proportions across sibling nodes in a
hierarchy.

#What is a Context Graph?

A Context Graph is a specialized type of graph used to represent relationships between
concepts within a particular context. Imagine you have a collection of documents that discuss
various topics, such as countries and their economic relationships. A context graph allows you to
visualize and analyze how these concepts (like countries) are related to each other, particularly in
the context of a specific subject matter, like crude oil.

Components of a Context Graph

1. Vertices (Nodes): Each vertex in the graph represents a concept. For instance, in our
example, a vertex might represent a specific country.
2. Edges: The edges between vertices represent the relationships between these concepts.
The edges can be weighted, meaning that they carry a numerical value representing the
strength or affinity of the relationship between two concepts. For example, if two
countries have a strong economic relationship concerning crude oil, the edge connecting
these two country nodes will have a higher weight.
3. Similarity Function: To determine whether an edge should exist between two vertices
(concepts), a similarity function is used. This function measures how similar two
concepts are within the given context. If the similarity exceeds a certain threshold, an
edge is created between those concepts.

How is a Context Graph Defined?

To formally define a context graph, let’s break down the components:

 D: A collection of documents.
 C: A set of concepts.
 P: A context phrase, which specifies the context under which the relationships are of
interest (e.g., crude oil).

The context graph G = (C, E) is a weighted graph where:

 C: The vertices (nodes) represent the concepts.

 E: The edges represent the relationships between these concepts, and an edge between
two concepts exists if their relationship, given the context phrase P, is strong enough.

Temporal Context Relationships

Temporal context relationships involve analyzing how these relationships between concepts
change over time. For example, how the relationship between two countries concerning crude oil
changes over the years. By breaking down the corpus (the collection of documents) into temporal
segments, one can track and analyze these changes, providing a dynamic view of the context
graph over time.

Practical Uses of Context Graphs

Context graphs are widely used in fields like text analytics, social network analysis, and AI.
They help in understanding and visualizing complex relationships in large datasets, enabling
more informed decision-making.

#Introduction to Preprocessing in Text Mining

Preprocessing in text mining is crucial because it transforms raw, unstructured data into
structured representations. These structured representations are essential for extracting
meaningful insights and performing advanced analytics. Without preprocessing, the raw data
would be too chaotic and varied for any meaningful pattern recognition or analysis.

Types of Preprocessing Techniques

Task-Oriented Preprocessing

Task-oriented preprocessing focuses on creating structured document representations by

breaking down the process into tasks and subtasks. These tasks are designed to solve specific
problems or achieve particular goals.

For example, if you're processing a PDF document, task-oriented preprocessing might involve
extracting titles, author names, and other metadata. This approach can be especially useful when
dealing with natural language texts, where the data is inherently complex and unstructured.
Example

Consider a scenario where you have a scanned document containing various sections like the
title, author, abstract, and body text. The task-oriented preprocessing approach would first
convert the scanned document into a stream of text using Optical Character Recognition (OCR).
Next, it would identify and extract specific sections like the title and author by recognizing their
visual position in the document. Finally, these sections would be labeled and structured for
further analysis.

General-Purpose NLP Tasks

General-purpose Natural Language Processing (NLP) tasks involve analyzing text documents
using common linguistic knowledge. These tasks are not specific to any particular problem but
are fundamental steps in understanding and processing language.

1. Tokenization

Tokenization is the process of breaking down text into smaller units, like sentences or words,
called tokens. This step is critical because it simplifies the text and prepares it for more complex
analysis.

Example: Consider the sentence, "Dr. Smith is a renowned scientist." The tokenizer must
distinguish between "Dr." as a title and the period that typically ends a sentence. The sentence
would be split into tokens like ["Dr.", "Smith", "is", "a", "renowned", "scientist"].

2. Part of Speech (POS) Tagging

POS tagging involves labeling each word in a sentence with its corresponding part of speech
(noun, verb, adjective, etc.). This helps in understanding the grammatical structure of the
sentence and can provide insights into the sentence's meaning.

Example: In the sentence, "The quick brown fox jumps over the lazy dog," POS tagging would
label "The" as a determiner, "quick" as an adjective, "fox" as a noun, "jumps" as a verb, and so
on.

3. Syntactic Parsing

Syntactic parsing assigns a structure to a sequence of text based on grammatical rules. There are
two primary types of syntactic parsing:

 Constituency Parsing: This approach breaks sentences down into phrases, such as noun
phrases or verb phrases. For example, in the sentence "The cat sat on the mat," "The cat"
would be identified as a noun phrase, and "sat on the mat" as a verb phrase.
 Dependency Parsing: This method focuses on the relationships between words in a
sentence. For instance, in the same sentence "The cat sat on the mat," dependency parsing
would highlight that "sat" is the main verb, with "cat" as the subject and "on the mat" as a
prepositional phrase dependent on "sat."

4. Shallow Parsing

Shallow parsing, also known as chunking, breaks down text into small, easily interpretable
chunks like simple noun and verb phrases. It doesn't provide a full parse of the sentence, making
it faster and more robust for certain applications like information extraction.

Example: In the sentence "He bought a car," shallow parsing might break it down into two
chunks: [He] [bought a car], where "He" is a noun phrase and "bought a car" is a verb phrase.

Problem-Dependent Tasks: Text Categorization and Information Extraction

These tasks create representations of documents that are meaningful either for more sophisticated
processing phases or for direct interaction by the user.

 Text Categorization: This involves assigning predefined categories to documents based

on their content. For example, categorizing news articles into "Sports," "Politics,"
"Technology," etc.
 Information Extraction (IE): IE involves identifying and extracting specific pieces of
information from text, such as names, dates, or prices.

Example: Imagine you have a large corpus of emails. Using text categorization, you could
automatically sort these emails into categories like "Work," "Personal," and "Spam." Information
extraction could then identify important details within each email, like the sender's name or the
date of the meeting mentioned.

Conclusion
Preprocessing is the foundation of text mining, transforming raw data into structured, meaningful
representations. By understanding and applying these techniques, you can significantly improve
the accuracy and efficiency of your text mining projects.

1. Decision Tree
Definition:

A Decision Tree is a non-parametric supervised learning algorithm used for classification and
regression tasks. It models decisions and their possible consequences as a tree structure, where
each internal node represents a "test" on an attribute, each branch represents the outcome of the
test, and each leaf node represents a class label or a decision.

Formula:

The core of a Decision Tree involves measures like Information Gain (IG) or Gini Index (GI) to
determine the best attribute to split the data. For binary classification:

Explanation:

The Decision Tree algorithm starts with the entire dataset and selects the attribute that maximizes
Information Gain or minimizes Gini Index to split the dataset. This process continues recursively
for each subset, creating a tree where each leaf node represents a class label. Pruning may be
applied to avoid overfitting.

Example:

Suppose you have a dataset to predict whether a person buys a computer based on attributes like
age, income, student status, and credit rating. The Decision Tree might split the data first on
"age" and then on "income," forming a tree that predicts the class labels "buys computer" or
"doesn't buy computer."

Definition:

A Decision Tree is a hierarchical model used for making decisions or predictions by breaking
down a dataset into smaller subsets. It does this by recursively splitting the data based on feature
values. The tree consists of nodes where each node represents a decision based on a feature,
branches representing the outcomes of these decisions, and leaves representing the final
prediction or classification.

Explanation:

1. Root Node: The top node of the tree, representing the entire dataset.
2. Splitting: The process of dividing the dataset into subsets based on the value of a feature.
The aim is to increase the homogeneity of the target variable within each subset.
3. Decision Nodes: Nodes where the dataset is split. These nodes test different features to
decide the best way to split the data.
4. Leaf Nodes: The terminal nodes that provide the final output or classification. They
represent the decision or prediction outcome for the dataset subset that reaches them.

Choosing Splits:

 Information Gain (IG): Measures how much information is gained about the target
variable by splitting the data based on a particular feature. It helps in choosing the feature
that will yield the most significant reduction in uncertainty.
 Gini Index: Measures the impurity of a node. A node with a Gini Index of 0 is perfectly
pure, meaning all data points in that node belong to a single class.

Example: Imagine a Decision Tree is used to decide whether to play tennis based on weather
conditions. The root node might test if it's sunny, followed by branches for temperature and
humidity. The final leaf nodes could predict outcomes such as "Play" or "Don't Play" based on
these conditions.

2. Naive Bayes
Definition:

Naive Bayes is a probabilistic classifier based on applying Bayes' theorem with strong (naive)
independence assumptions between the features. It is particularly used for text classification.
Explanation:

Naive Bayes assumes that the presence of a particular feature in a class is independent of the
presence of any other feature. Despite this assumption, it performs well in many real-world
situations, especially in text classification problems like spam detection, where the order of
words does not influence the outcome.

Example:

Consider a spam filter that classifies emails based on the occurrence of certain words. If the
words "win" and "prize" are frequent in spam emails, the classifier might predict an incoming
email containing these words as spam based on the learned probabilities.

Definition:

Naive Bayes is a probabilistic classifier that applies Bayes' theorem with the assumption that
features are conditionally independent given the class label. Despite this simplification, Naive
Bayes often performs well in practice, especially in text classification and spam detection.

Explanation:

1. Bayes' Theorem: Provides a way to calculate the probability of a class given the
features. It combines prior knowledge (prior probability) with the likelihood of features
occurring given the class:

2. Naive Assumption: Assumes that all features are independent given the class label. This
simplifies the computation but might not always hold true in real data.

Classification Process:

 Compute Prior Probability: The likelihood of each class based on the training data.
 Compute Likelihood: The probability of observing the given feature values for each
class.
 Apply Bayes' Theorem: Calculate the posterior probability for each class and choose the
class with the highest probability.

Example: In spam email classification, Naive Bayes would use the frequency of words in spam
and non-spam emails to predict the probability that a new email is spam based on its word
content.

3. Linear Regression
Definition:

Linear Regression is a linear approach to modeling the relationship between a dependent variable
and one or more independent variables. The goal is to fit a linear equation to observed data.
Explanation:

Linear Regression aims to find the line (or hyperplane in higher dimensions) that best fits the
data by minimizing the sum of squared errors between the observed and predicted values. This
line represents the relationship between the dependent and independent variables.

Example:

Consider predicting house prices based on features like square footage, number of bedrooms, and
location. Linear Regression would provide a formula that combines these features to estimate the
price, allowing for predictions on new data.

Definition:

Linear Regression models the relationship between a dependent variable and one or more
independent variables by fitting a linear equation to the observed data. It aims to predict the
dependent variable based on the values of the independent variables.

Explanation:

1. Model: The equation of a line (or hyperplane in multiple dimensions) is used to make
predictions. For a single independent variable, the model is:
2.

Objective: Minimize the sum of squared errors (differences between observed and
predicted values). This is achieved using techniques like Ordinary Least Squares (OLS).
3. Assumptions:
o Linearity: The relationship between dependent and independent variables is
linear.
o Independence: The residuals (errors) are independent.
o Homoscedasticity: Constant variance of residuals.
o Normality: Residuals should be normally distributed.

Example: Predicting house prices based on features such as square footage and number of
bedrooms. Linear Regression would provide a formula that estimates the price based on these
features, allowing predictions for new properties.

4. K-Nearest Neighbors (KNN)

Definition:

K-Nearest Neighbors (KNN) is a simple, non-parametric, instance-based learning algorithm used

for classification and regression. The idea is to classify a new instance based on the majority
class of its kkk nearest neighbors in the feature space.
Explanation:

KNN works by storing all available cases and classifying new cases based on a similarity
measure (e.g., distance functions). It assigns the most common label among its nearest neighbors
to the new instance. It is highly effective for small datasets but can be computationally expensive
for large ones.

Example:

Imagine you have a dataset of flowers labeled by species based on features like petal length and
width. Given a new flower, KNN would classify it by identifying the closest kkk flowers in the
dataset and assigning the most frequent species among those neighbors.

Definition:

K-Nearest Neighbors (KNN) is a non-parametric, instance-based learning algorithm used for

classification and regression. It classifies or predicts the value of a new instance by looking at the
kkk nearest instances in the feature space.

Explanation:

1. Distance Metric: KNN uses distance metrics like Euclidean distance to find the kkk
nearest neighbors of a new instance. For kkk-NN, the distance between two points ppp
and qqq in nnn-dimensional space is calculated as:

2. Classification: The new instance is assigned to the class that is most frequent among its
kkk nearest neighbors.
3. Regression: The value of the new instance is predicted as the average (or weighted
average) of the values of its kkk nearest neighbors.
4. Choosing kkk: The value of kkk is crucial. A small kkk can lead to overfitting, while a
large kkk might smooth out the classification or prediction.

Example: For classifying a new flower as one of several species, KNN would look at the kkk
closest flowers in the training set and assign the species based on the majority class among these
neighbors.
5. Support Vector Machine (SVM)
Definition:

Support Vector Machine (SVM) is a supervised learning model used for classification and
regression tasks. It works by finding the hyperplane that best separates the classes in the feature
space.

Explanation:

SVM constructs a hyperplane or set of hyperplanes in a high-dimensional space, which can be

used for classification, regression, or other tasks. It aims to maximize the margin between the
classes, leading to better generalization.

Example:

In a binary classification problem, like distinguishing between cats and dogs based on features
like ear length and tail length, SVM would find the hyperplane that maximizes the separation
between the two classes, thereby classifying any new animal based on which side of the
hyperplane it falls.

Definition:

Support Vector Machine (SVM) is a supervised learning algorithm that finds the optimal
hyperplane to separate different classes in the feature space. It aims to maximize the margin
between the classes to achieve better generalization.

Explanation:
1. Hyperplane: A hyperplane in an nnn-dimensional space is a flat affine subspace of
dimension n−1n-1n−1 that separates the data into two classes. SVM seeks the hyperplane
that maximizes the margin between the closest points (support vectors) of each class.
2. Margin: The distance between the hyperplane and the closest data points from either
class. Maximizing this margin helps improve the model's generalization to unseen data.
3. Kernel Trick: SVM can handle non-linearly separable data by mapping it into a higher-
dimensional space using a kernel function (e.g., polynomial, radial basis function). This
allows for finding a linear separating hyperplane in the new space.

Example: In a dataset with two classes that are not linearly separable, SVM can use a kernel
function to transform the data into a higher dimension where a linear hyperplane can be used to
separate the classes effectively.

2021 FINA 5260 - Lesson 3
No ratings yet
2021 FINA 5260 - Lesson 3
82 pages
M3-Social Media Text Analytics
No ratings yet
M3-Social Media Text Analytics
19 pages
BA - Unit 2 - 2025 - 0702
No ratings yet
BA - Unit 2 - 2025 - 0702
204 pages
SCA - Module 7
No ratings yet
SCA - Module 7
47 pages
Co 3
No ratings yet
Co 3
62 pages
Social Media Intelligence - Taking Monitoring To Task
No ratings yet
Social Media Intelligence - Taking Monitoring To Task
14 pages
Unit 5-1
No ratings yet
Unit 5-1
30 pages
SMA Viva Questions PDF
No ratings yet
SMA Viva Questions PDF
20 pages
UNIT1
100% (1)
UNIT1
37 pages
IM and Big Data Roadshow
No ratings yet
IM and Big Data Roadshow
30 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
34 pages
Soma 2
No ratings yet
Soma 2
14 pages
CHP 5
No ratings yet
CHP 5
57 pages
2025 Sma M3
No ratings yet
2025 Sma M3
77 pages
Ds Unit 3 Notes
No ratings yet
Ds Unit 3 Notes
29 pages
Module 1
No ratings yet
Module 1
24 pages
SM&WA
No ratings yet
SM&WA
10 pages
Module 6 - Social Media Analytics and Text Mining.
No ratings yet
Module 6 - Social Media Analytics and Text Mining.
27 pages
DTA First Lecture
No ratings yet
DTA First Lecture
36 pages
10.1201 9781003200154-4 Chapterpdf
No ratings yet
10.1201 9781003200154-4 Chapterpdf
23 pages
SMA Final Sem
No ratings yet
SMA Final Sem
119 pages
DataAnalytics Chap-4
No ratings yet
DataAnalytics Chap-4
63 pages
Da Sem 6
No ratings yet
Da Sem 6
9 pages
SMA Session 1
No ratings yet
SMA Session 1
24 pages
Chapter4 DA New
No ratings yet
Chapter4 DA New
36 pages
AYUSH Project Proposal
No ratings yet
AYUSH Project Proposal
13 pages
C28 SMA Exp-2
No ratings yet
C28 SMA Exp-2
18 pages
Sma QB Solved QB
No ratings yet
Sma QB Solved QB
43 pages
SMA QB Answers
No ratings yet
SMA QB Answers
26 pages
Unit-4 Emerging Analytics
No ratings yet
Unit-4 Emerging Analytics
19 pages
SMA Soln
No ratings yet
SMA Soln
20 pages
Sma Unit1
No ratings yet
Sma Unit1
26 pages
BI Reading Presentation v0.3
No ratings yet
BI Reading Presentation v0.3
25 pages
An Overview of Social Media Analytics
No ratings yet
An Overview of Social Media Analytics
2 pages
Module - 4
No ratings yet
Module - 4
37 pages
Social Media and Text Analytics
No ratings yet
Social Media and Text Analytics
9 pages
Sma Co1
No ratings yet
Sma Co1
59 pages
Sma Ia-1
No ratings yet
Sma Ia-1
10 pages
563 33 Powerpoint-Slides Chapter 5
No ratings yet
563 33 Powerpoint-Slides Chapter 5
18 pages
Astma Lab Manual
No ratings yet
Astma Lab Manual
17 pages
Business Intelligence and Anlytics UNIT 2
No ratings yet
Business Intelligence and Anlytics UNIT 2
35 pages
Unit 1: To Data Science
No ratings yet
Unit 1: To Data Science
56 pages
Compusoft, 3 (4), 738-742 PDF
No ratings yet
Compusoft, 3 (4), 738-742 PDF
5 pages
Text Analytics
No ratings yet
Text Analytics
21 pages
Social Media Analytics & Its Need
No ratings yet
Social Media Analytics & Its Need
3 pages
Unit 1 Smada
No ratings yet
Unit 1 Smada
10 pages
SMA Assignment 2
No ratings yet
SMA Assignment 2
9 pages
Sma QB
No ratings yet
Sma QB
10 pages
Generation of Brand Product Reputation Using Twitter Data
No ratings yet
Generation of Brand Product Reputation Using Twitter Data
4 pages
SMA QB Solution
No ratings yet
SMA QB Solution
34 pages
Chapter 1: Social Media Analytics An Overview
No ratings yet
Chapter 1: Social Media Analytics An Overview
5 pages
SMA1
No ratings yet
SMA1
1 page
Social Media Web and Text Analytics
No ratings yet
Social Media Web and Text Analytics
10 pages
Certificate of No Record 2
No ratings yet
Certificate of No Record 2
5 pages
Reason 4 Manual (English)
100% (11)
Reason 4 Manual (English)
402 pages
Management Information Systems - Introduction To Social Media
No ratings yet
Management Information Systems - Introduction To Social Media
26 pages
Text Analytics
No ratings yet
Text Analytics
1 page
KREBS millMAX Slurry Pump Brochure
No ratings yet
KREBS millMAX Slurry Pump Brochure
12 pages
Review of Related Literature
64% (11)
Review of Related Literature
12 pages
Chp2 (2) - Creativity and Innovation
100% (2)
Chp2 (2) - Creativity and Innovation
37 pages
The Elements of Abbreviation in Medieval Latin Paleography
100% (2)
The Elements of Abbreviation in Medieval Latin Paleography
60 pages
Richard Pizzey Archive Cra 19 PDF
No ratings yet
Richard Pizzey Archive Cra 19 PDF
17 pages
JSS2 Mathematics Scheme
No ratings yet
JSS2 Mathematics Scheme
3 pages
Japan - All About Japan Project in English
100% (3)
Japan - All About Japan Project in English
4 pages
CN Unit-4 Notes
No ratings yet
CN Unit-4 Notes
24 pages
Impact of Technology On Leadership Practices in Organizations
No ratings yet
Impact of Technology On Leadership Practices in Organizations
8 pages
The Johannine Comma
No ratings yet
The Johannine Comma
8 pages
Architecture Site Analysis Guide
No ratings yet
Architecture Site Analysis Guide
11 pages
Case Study 1
No ratings yet
Case Study 1
3 pages
General Physics1 Q2 W8 Module8 Thermodynamics
No ratings yet
General Physics1 Q2 W8 Module8 Thermodynamics
23 pages
Draft Banking Penalties Regulations 2017
No ratings yet
Draft Banking Penalties Regulations 2017
16 pages
Script For A Better Malaysia - DSAI
No ratings yet
Script For A Better Malaysia - DSAI
186 pages
Reading Effectively PowerPoint Slides
No ratings yet
Reading Effectively PowerPoint Slides
10 pages
526 M.AM - SC (MATHEMATICS) Semister 3
No ratings yet
526 M.AM - SC (MATHEMATICS) Semister 3
15 pages
The Future of Audit, Risk & Compliance in Goverance
No ratings yet
The Future of Audit, Risk & Compliance in Goverance
10 pages
Homework The Evidence Prof Hallam
100% (1)
Homework The Evidence Prof Hallam
4 pages
RRLS Summary (TQM) (SUSTAINABILITY)
No ratings yet
RRLS Summary (TQM) (SUSTAINABILITY)
16 pages
CIR v. Jerry Ocier
No ratings yet
CIR v. Jerry Ocier
11 pages
Ideologyofpakistan
No ratings yet
Ideologyofpakistan
10 pages
Songs
No ratings yet
Songs
6 pages
Weather Stopper Ii Sti-3150-Break
No ratings yet
Weather Stopper Ii Sti-3150-Break
2 pages
Literal and Non-Literal Meanings of Words and Idioms
No ratings yet
Literal and Non-Literal Meanings of Words and Idioms
8 pages
Jean-Philippe Rameau
No ratings yet
Jean-Philippe Rameau
18 pages
File
No ratings yet
File
3 pages
Bacolod Assignment in P.E
No ratings yet
Bacolod Assignment in P.E
6 pages
Data Science and Analytics: Transforming Raw Data into Actionable Insights: A Comprehensive Guide
From Everand
Data Science and Analytics: Transforming Raw Data into Actionable Insights: A Comprehensive Guide
Marlowe Reyes
No ratings yet
Text Analytics with Python: A Brief Introduction to Text Analytics with Python
From Everand
Text Analytics with Python: A Brief Introduction to Text Analytics with Python
Anthony S. Williams
No ratings yet
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Text Mining: Fundamentals and Applications
From Everand
Text Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet

ASTMA Explanations m1 Stuff

Uploaded by

ASTMA Explanations m1 Stuff

Uploaded by

Aspect Data Mining Text Mining

The process of discovering patterns and The process of extracting useful

Works with unstructured or semi-

- Classification - Natural Language Processing (NLP)

- Decision Trees - TF-IDF

- Market Basket Analysis - Sentiment Analysis

- Handling large volumes of structured - Handling unstructured data

 Explanation: Behavioral analysis across social channels allows businesses to capitalize

 Explanation: Understanding competitors' actions and customer responses provides

Social media analytics allows companies to:

 Spot trends related to offerings and brands

Document and Document Collection

A document collection refers to a grouping of text-based documents, which can be categorized as

Within a particular document collection, a prototypical document represents a typical example of

Weakly Structured Document

 Definition: Documents with minimal typographical, layout, or markup indicators that

 Definition: Features representing individual words within a document. The

 Definition: Features generated using manual, statistical, rule-based, or hybrid

 Document Collection involves grouping documents that may be static or dynamic.

# Document Feature Representational Model

 Sparse Representation: In document collections, only a small percentage of all possible

# Domain and Background Knowledge

# Search for Pattern and Trends

1. Algorithmic and Heuristic Approaches:

# Presentation Layer in Text Mining Systems

# General Architecture for Text Mining

Detailed Explanation of the General Architecture for Text

2. Document Fetching and Crawling Techniques

4. Processed Document Collection

 Categorized, Keyword-Labeled, and Time-Stamped: After preprocessing, documents

5. Compressed and/or Hierarchical Representation

 Background Knowledge: This could include domain-specific databases, ontologies, or

8. Core Mining Operations

 Pattern Identification: This includes discovering patterns like frequent terms,

10. Presentation Layer Components

11. User Interaction

12. Background Knowledge

II. Concept Proportion

III. Conditional Concept Proportion

IV. Concept Distribution

V. Concept Proportion Distribution

#What is a Context Graph?

Components of a Context Graph

How is a Context Graph Defined?

To formally define a context graph, let’s break down the components:

The context graph G = (C, E) is a weighted graph where:

 C: The vertices (nodes) represent the concepts.

Temporal Context Relationships

Practical Uses of Context Graphs

#Introduction to Preprocessing in Text Mining

Types of Preprocessing Techniques

Task-oriented preprocessing focuses on creating structured document representations by

General-Purpose NLP Tasks

2. Part of Speech (POS) Tagging

Problem-Dependent Tasks: Text Categorization and Information Extraction

 Text Categorization: This involves assigning predefined categories to documents based

4. K-Nearest Neighbors (KNN)

K-Nearest Neighbors (KNN) is a simple, non-parametric, instance-based learning algorithm used

K-Nearest Neighbors (KNN) is a non-parametric, instance-based learning algorithm used for

SVM constructs a hyperplane or set of hyperplanes in a high-dimensional space, which can be

You might also like