0% found this document useful (0 votes)
7 views

Advanced Topic Data Mining

Uploaded by

Maryam Syed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Advanced Topic Data Mining

Uploaded by

Maryam Syed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Advanced Topics in Data Mining

Farhad Muhammad Riaz

[email protected]
Ensemble Philosophy

• Build many models and combine them


• Only through averaging do we get at the truth!
• It’s too hard (impossible?) to build a single
model that works best
• Two types of approaches:
– Models that don’t use randomness
– Models that incorporate randomness

Intro AI Ensembles 2
Ensemble Approaches

• Bagging
– Bootstrap aggregating

• Boosting

• Random Forests
– Bagging reborn

Intro AI Ensembles 3
Bagging
• Main Assumption:
– Combining many unstable predictors to produce a
ensemble (stable) predictor.
– Unstable Predictor: small changes in training data produce
large changes in the model.
• e.g. Neural Nets, trees
• Stable: SVM (sometimes), Nearest Neighbor.
• Hypothesis Space
– Variable size (nonparametric):
• Can model any function if you use an appropriate predictor (e.g.
trees)

Intro AI Ensembles 4
The Bagging Algorithm

Given data: D = {(x1 , y1 ) ,..., (x N , yN )}

For m = 1: M
• Obtain bootstrap sampleDm from the
training data D
• Build a model Gm (x) from bootstrap data Dm

Intro AI Ensembles 5
The Bagging Model

• Regression

M
1
yˆ =
M
åG
m=1
m (x )

• Classification:
– Vote over classifier outputs
G1 (x),..., GM (x)

Intro AI Ensembles 6
Bagging Details

• Bootstrap sample of N instances is obtained by


drawing N examples at random, with
replacement.
• On average each bootstrap sample has
63% of instances
– Encourages predictors to have uncorrelated
errors
• This is why it works

Intro AI Ensembles 7
Bagging Details 2

• Usually set
– Or use validation =~ to
M data 30pick
• The models M
need to be unstable
Gm (x(or
– Usually full length ) slightly pruned) decision
trees.

Intro AI Ensembles 8
Boosting

– Main Assumption:
• Combining many weak predictors (e.g. tree stumps or
1-R predictors) to produce an ensemble predictor
• The weak predictors or classifiers need to be stable
– Hypothesis Space
• Variable size (nonparametric):
– Can model any function if you use an appropriate predictor
(e.g. trees)

Intro AI Ensembles 9
Commonly Used Weak Predictor
(or classifier)

A Decision Tree Stump (1-R)

Intro AI Ensembles 10
Boosting

Each classifier Gm (x) is


trained from a weighted
Sample of the training
Data

Intro AI Ensembles 11
Boosting (Continued)

• Each predictor is created by using a biased


sample of the training data
– Instances (training examples) with high error are
weighted higher than those with lower error
• Difficult instances get more attention
– This is the motivation behind boosting

Intro AI Ensembles 12
Background Notation

• The function is defined as:


I (s)

ìïï 1 if s is true
I (s ) = í
ïïî 0 otherwise
• The function is the natural logarithm
log ( x )

Intro AI Ensembles 13
The AdaBoost Algorithm
(Freund and Schapire, 1996)
Given data: D = {(x1 , y1 ) ,..., (x N , yN )}

1. Initialize weights wi = 1/ N , i = 1,..., N

2. For m = 1: M
a) Fit classifier Gm (x) Î {- 1,1to
} data using weights wi
b) Compute å
N
wi I ( yi ¹ Gm (xi ))
errm = i =1
N

å
i =1
wi

c) Compute a = log ((1- err ) / err )


m m m

d) Set w ¬ w exp éêa I ( y ¹ G (x ))ùú, i = 1,..., N


i i ëm i m i û

Intro AI Ensembles 14
The AdaBoost Model

éM ù
yˆ = sgn êå amGm (x)ú
ê
ëm=1 ú
û

AdaBoost is NOT used for Regression!

Intro AI Ensembles 15
The Updates in Boosting

Alpha for Boosting Re-weighting Factor for Boosting


5 100

4 90

3 80

2 70

1 60

w * exp( m)
m

0 50

-1 40

-2 30

-3 20

-4 10

-5 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
errm errm

Intro AI Ensembles 16
Boosting Characteristics
Simulated data: test error
rate for boosting with
stumps, as a function of
the number of iterations.
Also shown are the test
error rate for a single
stump, and a 400 node
tree.

Intro AI Ensembles 17
Loss Functions for y Î {- 1, +1}, f Î Â
•Misclassification
I (sgn ( f ) ¹ y)
•Exponential
(Boosting)
exp (- yf )

•Binomial Deviance
(
(Cross log 1 + exp
Entropy) (- 2 yf ))

•Squared
( yError
- f )
2

Incorrect Classification Correct Classification


•Support
(1- yf Vectors
)gI ( yf >1)

Intro AI Ensembles 18
Intro AI Ensembles 19
Intro AI Ensembles 20
Intro AI Ensembles 21
Intro AI Ensembles 22
Intro AI Ensembles 23
Intro AI Ensembles 24
Other Variations of Boosting

• Gradient Boosting
– Can use any cost function
• Stochastic (Gradient) Boosting
– Bootstrap Sample: Uniform random sampling
(with replacement)
– Often outperforms the non-random version

Intro AI Ensembles 25
Gradient Boosting

Intro AI Ensembles 26
Boosting Summary

• Good points
– Fast learning
– Capable of learning any function (given appropriate weak learner)
– Feature weighting
– Very little parameter tuning
• Bad points
– Can overfit data
– Only for binary classification
• Learning parameters (picked via cross validation)
– Size of tree
– When to stop
• Software
– http://www-stat.stanford.edu/~jhf/R-MART.html

Intro AI Ensembles 27
PCA

Intro AI Ensembles 28
Intro AI Ensembles 29
Sentiment Analysis

• Sentiment Analysis is a use case of Natural


Language Processing (NLP) and comes
under the category of text classification.
• To put it simply, Sentiment Analysis involves
classifying a text into various sentiments, such
as positive or negative, Happy, Sad or Neutral,
etc.
• Thus, the ultimate goal of sentiment analysis is
to decipher the underlying mood, emotion, or
sentiment of a text.
• This is also referred to as Opinion Mining.
How Does Sentiment Analysis Work?

• Text Preprocessing:
• The text cleaning process involves removing
irrelevant information from the text data, such as
special characters, punctuation, and stop words.
• Tokenization:
• The text is divided into individual words or tokens
to facilitate analysis.
• Feature Extraction:
– The text extraction process involves extracting
relevant features from the text, such as words, n-
grams, or even parts of speech.
• Sentiment Classification:
• Machine learning algorithms or pre-trained models
are used to classify the sentiment of each text
instance. Researchers achieve this through
supervised learning, where they train models on
labeled data, or through pre-trained models that
have learned sentiment patterns from large
datasets.
• Post-processing:
• The sentiment analysis results may undergo
additional processing, such as aggregating
sentiment scores or applying threshold rules to
classify sentiments as positive, negative, or neutral.
• Evaluation:
– Researchers assess the performance of
the sentiment analysis model using
evaluation metrics, such as accuracy,
precision, recall, or F1 score.
Type of Sentiment Analysis

• Document-Level Sentiment Analysis:


• This type of analysis determines the overall sentiment
expressed in a document, such as a review or an article. It
aims to classify the entire text as positive, negative, or
neutral.
• Sentence-Level Sentiment Analysis:
• Here, the sentiment of each sentence within a document is
analyzed. This type provides a more granular understanding
of the sentiment expressed in different text parts.
• Aspect-Based Sentiment Analysis:
• This approach focuses on identifying and extracting the
sentiment associated with specific aspects or entities
mentioned in the text. For example, in a product review, the
sentiment towards different features of the product (e.g.,
performance, design, usability) can be analyzed separately.
• Entity-Level Sentiment Analysis:
• This type of analysis identifies the sentiment
expressed towards specific entities or targets
mentioned in the text, such as people, companies,
or products. It helps understand the sentiment
associated with different entities within the same
document.
• Comparative Sentiment Analysis:
• This approach involves comparing the sentiment
between different entities or aspects mentioned in
the text. It aims to identify the relative sentiment or
preferences expressed towards various entities or
features.
Ways to Perform Sentiment Analysis

• Using Text Blob


• Using Vader
• Using Bag of Words Vectorization-based
Models
• Using LSTM-based Models
• Using Transformer-based Models

You might also like