Advanced Topic Data Mining
Advanced Topic Data Mining
[email protected]
Ensemble Philosophy
Intro AI Ensembles 2
Ensemble Approaches
• Bagging
– Bootstrap aggregating
• Boosting
• Random Forests
– Bagging reborn
Intro AI Ensembles 3
Bagging
• Main Assumption:
– Combining many unstable predictors to produce a
ensemble (stable) predictor.
– Unstable Predictor: small changes in training data produce
large changes in the model.
• e.g. Neural Nets, trees
• Stable: SVM (sometimes), Nearest Neighbor.
• Hypothesis Space
– Variable size (nonparametric):
• Can model any function if you use an appropriate predictor (e.g.
trees)
Intro AI Ensembles 4
The Bagging Algorithm
For m = 1: M
• Obtain bootstrap sampleDm from the
training data D
• Build a model Gm (x) from bootstrap data Dm
Intro AI Ensembles 5
The Bagging Model
• Regression
M
1
yˆ =
M
åG
m=1
m (x )
• Classification:
– Vote over classifier outputs
G1 (x),..., GM (x)
Intro AI Ensembles 6
Bagging Details
Intro AI Ensembles 7
Bagging Details 2
• Usually set
– Or use validation =~ to
M data 30pick
• The models M
need to be unstable
Gm (x(or
– Usually full length ) slightly pruned) decision
trees.
Intro AI Ensembles 8
Boosting
– Main Assumption:
• Combining many weak predictors (e.g. tree stumps or
1-R predictors) to produce an ensemble predictor
• The weak predictors or classifiers need to be stable
– Hypothesis Space
• Variable size (nonparametric):
– Can model any function if you use an appropriate predictor
(e.g. trees)
Intro AI Ensembles 9
Commonly Used Weak Predictor
(or classifier)
Intro AI Ensembles 10
Boosting
Intro AI Ensembles 11
Boosting (Continued)
Intro AI Ensembles 12
Background Notation
ìïï 1 if s is true
I (s ) = í
ïïî 0 otherwise
• The function is the natural logarithm
log ( x )
Intro AI Ensembles 13
The AdaBoost Algorithm
(Freund and Schapire, 1996)
Given data: D = {(x1 , y1 ) ,..., (x N , yN )}
2. For m = 1: M
a) Fit classifier Gm (x) Î {- 1,1to
} data using weights wi
b) Compute å
N
wi I ( yi ¹ Gm (xi ))
errm = i =1
N
å
i =1
wi
Intro AI Ensembles 14
The AdaBoost Model
éM ù
yˆ = sgn êå amGm (x)ú
ê
ëm=1 ú
û
Intro AI Ensembles 15
The Updates in Boosting
4 90
3 80
2 70
1 60
w * exp( m)
m
0 50
-1 40
-2 30
-3 20
-4 10
-5 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
errm errm
Intro AI Ensembles 16
Boosting Characteristics
Simulated data: test error
rate for boosting with
stumps, as a function of
the number of iterations.
Also shown are the test
error rate for a single
stump, and a 400 node
tree.
Intro AI Ensembles 17
Loss Functions for y Î {- 1, +1}, f Î Â
•Misclassification
I (sgn ( f ) ¹ y)
•Exponential
(Boosting)
exp (- yf )
•Binomial Deviance
(
(Cross log 1 + exp
Entropy) (- 2 yf ))
•Squared
( yError
- f )
2
Intro AI Ensembles 18
Intro AI Ensembles 19
Intro AI Ensembles 20
Intro AI Ensembles 21
Intro AI Ensembles 22
Intro AI Ensembles 23
Intro AI Ensembles 24
Other Variations of Boosting
• Gradient Boosting
– Can use any cost function
• Stochastic (Gradient) Boosting
– Bootstrap Sample: Uniform random sampling
(with replacement)
– Often outperforms the non-random version
Intro AI Ensembles 25
Gradient Boosting
Intro AI Ensembles 26
Boosting Summary
• Good points
– Fast learning
– Capable of learning any function (given appropriate weak learner)
– Feature weighting
– Very little parameter tuning
• Bad points
– Can overfit data
– Only for binary classification
• Learning parameters (picked via cross validation)
– Size of tree
– When to stop
• Software
– http://www-stat.stanford.edu/~jhf/R-MART.html
Intro AI Ensembles 27
PCA
Intro AI Ensembles 28
Intro AI Ensembles 29
Sentiment Analysis
• Text Preprocessing:
• The text cleaning process involves removing
irrelevant information from the text data, such as
special characters, punctuation, and stop words.
• Tokenization:
• The text is divided into individual words or tokens
to facilitate analysis.
• Feature Extraction:
– The text extraction process involves extracting
relevant features from the text, such as words, n-
grams, or even parts of speech.
• Sentiment Classification:
• Machine learning algorithms or pre-trained models
are used to classify the sentiment of each text
instance. Researchers achieve this through
supervised learning, where they train models on
labeled data, or through pre-trained models that
have learned sentiment patterns from large
datasets.
• Post-processing:
• The sentiment analysis results may undergo
additional processing, such as aggregating
sentiment scores or applying threshold rules to
classify sentiments as positive, negative, or neutral.
• Evaluation:
– Researchers assess the performance of
the sentiment analysis model using
evaluation metrics, such as accuracy,
precision, recall, or F1 score.
Type of Sentiment Analysis