0% found this document useful (0 votes)

9 views40 pages

Advanced Topic Data Mining

Uploaded by

Maryam Syed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views40 pages

Advanced Topic Data Mining

Uploaded by

Maryam Syed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

Advanced Topics in Data Mining

Farhad Muhammad Riaz

[email protected]
Ensemble Philosophy

• Build many models and combine them

• Only through averaging do we get at the truth!
• It’s too hard (impossible?) to build a single
model that works best
• Two types of approaches:
– Models that don’t use randomness
– Models that incorporate randomness

Intro AI Ensembles 2
Ensemble Approaches

• Bagging
– Bootstrap aggregating

• Boosting

• Random Forests
– Bagging reborn

Intro AI Ensembles 3
Bagging
• Main Assumption:
– Combining many unstable predictors to produce a
ensemble (stable) predictor.
– Unstable Predictor: small changes in training data produce
large changes in the model.
• e.g. Neural Nets, trees
• Stable: SVM (sometimes), Nearest Neighbor.
• Hypothesis Space
– Variable size (nonparametric):
• Can model any function if you use an appropriate predictor (e.g.
trees)

Intro AI Ensembles 4
The Bagging Algorithm

Given data: D = {(x1 , y1 ) ,..., (x N , yN )}

For m = 1: M
• Obtain bootstrap sampleDm from the
training data D
• Build a model Gm (x) from bootstrap data Dm

Intro AI Ensembles 5
The Bagging Model

• Regression

M
1
yˆ =
M
åG
m=1
m (x )

• Classification:
– Vote over classifier outputs
G1 (x),..., GM (x)

Intro AI Ensembles 6
Bagging Details

• Bootstrap sample of N instances is obtained by

drawing N examples at random, with
replacement.
• On average each bootstrap sample has
63% of instances
– Encourages predictors to have uncorrelated
errors
• This is why it works

Intro AI Ensembles 7
Bagging Details 2

• Usually set
– Or use validation =~ to
M data 30pick
• The models M
need to be unstable
Gm (x(or
– Usually full length ) slightly pruned) decision
trees.

Intro AI Ensembles 8
Boosting

– Main Assumption:
• Combining many weak predictors (e.g. tree stumps or
1-R predictors) to produce an ensemble predictor
• The weak predictors or classifiers need to be stable
– Hypothesis Space
• Variable size (nonparametric):
– Can model any function if you use an appropriate predictor
(e.g. trees)

Intro AI Ensembles 9
Commonly Used Weak Predictor
(or classifier)

A Decision Tree Stump (1-R)

Intro AI Ensembles 10
Boosting

Each classifier Gm (x) is

trained from a weighted
Sample of the training
Data

Intro AI Ensembles 11
Boosting (Continued)

• Each predictor is created by using a biased

sample of the training data
– Instances (training examples) with high error are
weighted higher than those with lower error
• Difficult instances get more attention
– This is the motivation behind boosting

Intro AI Ensembles 12
Background Notation

• The function is defined as:

I (s)

ìïï 1 if s is true
I (s ) = í
ïïî 0 otherwise
• The function is the natural logarithm
log ( x )

Intro AI Ensembles 13
The AdaBoost Algorithm
(Freund and Schapire, 1996)
Given data: D = {(x1 , y1 ) ,..., (x N , yN )}

1. Initialize weights wi = 1/ N , i = 1,..., N

2. For m = 1: M
a) Fit classifier Gm (x) Î {- 1,1to
} data using weights wi
b) Compute å
N
wi I ( yi ¹ Gm (xi ))
errm = i =1
N

å
i =1
wi

c) Compute a = log ((1- err ) / err )

m m m

d) Set w ¬ w exp éêa I ( y ¹ G (x ))ùú, i = 1,..., N

i i ëm i m i û

Intro AI Ensembles 14
The AdaBoost Model

éM ù
yˆ = sgn êå amGm (x)ú
ê
ëm=1 ú
û

AdaBoost is NOT used for Regression!

Intro AI Ensembles 15
The Updates in Boosting

Alpha for Boosting Re-weighting Factor for Boosting

5 100

4 90

3 80

2 70

1 60

w * exp( m)
m

0 50


-1 40

-2 30

-3 20

-4 10

-5 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
errm errm

Intro AI Ensembles 16
Boosting Characteristics
Simulated data: test error
rate for boosting with
stumps, as a function of
the number of iterations.
Also shown are the test
error rate for a single
stump, and a 400 node
tree.

Intro AI Ensembles 17
Loss Functions for y Î {- 1, +1}, f Î Â
•Misclassification
I (sgn ( f ) ¹ y)
•Exponential
(Boosting)
exp (- yf )

•Binomial Deviance
(
(Cross log 1 + exp
Entropy) (- 2 yf ))

•Squared
( yError
- f )
2

Incorrect Classification Correct Classification

•Support
(1- yf Vectors
)gI ( yf >1)

Intro AI Ensembles 18
Intro AI Ensembles 19
Intro AI Ensembles 20
Intro AI Ensembles 21
Intro AI Ensembles 22
Intro AI Ensembles 23
Intro AI Ensembles 24
Other Variations of Boosting

• Gradient Boosting
– Can use any cost function
• Stochastic (Gradient) Boosting
– Bootstrap Sample: Uniform random sampling
(with replacement)
– Often outperforms the non-random version

Intro AI Ensembles 25
Gradient Boosting

Intro AI Ensembles 26
Boosting Summary

• Good points
– Fast learning
– Capable of learning any function (given appropriate weak learner)
– Feature weighting
– Very little parameter tuning
• Bad points
– Can overfit data
– Only for binary classification
• Learning parameters (picked via cross validation)
– Size of tree
– When to stop
• Software
– http://www-stat.stanford.edu/~jhf/R-MART.html

Intro AI Ensembles 27
PCA

Intro AI Ensembles 28
Intro AI Ensembles 29
Sentiment Analysis

• Sentiment Analysis is a use case of Natural

Language Processing (NLP) and comes
under the category of text classification.
• To put it simply, Sentiment Analysis involves
classifying a text into various sentiments, such
as positive or negative, Happy, Sad or Neutral,
etc.
• Thus, the ultimate goal of sentiment analysis is
to decipher the underlying mood, emotion, or
sentiment of a text.
• This is also referred to as Opinion Mining.
How Does Sentiment Analysis Work?

• Text Preprocessing:
• The text cleaning process involves removing
irrelevant information from the text data, such as
special characters, punctuation, and stop words.
• Tokenization:
• The text is divided into individual words or tokens
to facilitate analysis.
• Feature Extraction:
– The text extraction process involves extracting
relevant features from the text, such as words, n-
grams, or even parts of speech.
• Sentiment Classification:
• Machine learning algorithms or pre-trained models
are used to classify the sentiment of each text
instance. Researchers achieve this through
supervised learning, where they train models on
labeled data, or through pre-trained models that
have learned sentiment patterns from large
datasets.
• Post-processing:
• The sentiment analysis results may undergo
additional processing, such as aggregating
sentiment scores or applying threshold rules to
classify sentiments as positive, negative, or neutral.
• Evaluation:
– Researchers assess the performance of
the sentiment analysis model using
evaluation metrics, such as accuracy,
precision, recall, or F1 score.
Type of Sentiment Analysis

• Document-Level Sentiment Analysis:

• This type of analysis determines the overall sentiment
expressed in a document, such as a review or an article. It
aims to classify the entire text as positive, negative, or
neutral.
• Sentence-Level Sentiment Analysis:
• Here, the sentiment of each sentence within a document is
analyzed. This type provides a more granular understanding
of the sentiment expressed in different text parts.
• Aspect-Based Sentiment Analysis:
• This approach focuses on identifying and extracting the
sentiment associated with specific aspects or entities
mentioned in the text. For example, in a product review, the
sentiment towards different features of the product (e.g.,
performance, design, usability) can be analyzed separately.
• Entity-Level Sentiment Analysis:
• This type of analysis identifies the sentiment
expressed towards specific entities or targets
mentioned in the text, such as people, companies,
or products. It helps understand the sentiment
associated with different entities within the same
document.
• Comparative Sentiment Analysis:
• This approach involves comparing the sentiment
between different entities or aspects mentioned in
the text. It aims to identify the relative sentiment or
preferences expressed towards various entities or
features.
Ways to Perform Sentiment Analysis

• Using Text Blob

• Using Vader
• Using Bag of Words Vectorization-based
Models
• Using LSTM-based Models
• Using Transformer-based Models

The Anticipated Bass in Cuban Popular Music
100% (2)
The Anticipated Bass in Cuban Popular Music
14 pages
ML Lecture 15 Ensemble
No ratings yet
ML Lecture 15 Ensemble
27 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
10 pages
Learning Experience 5 English: Level A1: Let's Celebrate Diversity!
80% (10)
Learning Experience 5 English: Level A1: Let's Celebrate Diversity!
3 pages
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
No ratings yet
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
74 pages
Bagging+Boosting+Gradient Boosting
100% (1)
Bagging+Boosting+Gradient Boosting
48 pages
SOAL PTS KELAS 8 Bahasa Inggris
0% (1)
SOAL PTS KELAS 8 Bahasa Inggris
5 pages
CLFR
No ratings yet
CLFR
134 pages
Fake Product Review Monitoring & Removal and Sentiment Analysis of Genuine Reviews
No ratings yet
Fake Product Review Monitoring & Removal and Sentiment Analysis of Genuine Reviews
4 pages
Ensemble Learning
No ratings yet
Ensemble Learning
34 pages
Bagging and Boosting
No ratings yet
Bagging and Boosting
8 pages
Unit V Aiml
No ratings yet
Unit V Aiml
18 pages
SE-6104 Data Mining and Analytics: Lecture # 13 Advance Classification
No ratings yet
SE-6104 Data Mining and Analytics: Lecture # 13 Advance Classification
31 pages
Web Mining Unit 2
No ratings yet
Web Mining Unit 2
12 pages
An Introduction of Ensemble Learning
100% (1)
An Introduction of Ensemble Learning
40 pages
Gradient Boosted Trees: Dr. Geetha Kuntoji
No ratings yet
Gradient Boosted Trees: Dr. Geetha Kuntoji
24 pages
Lec06 - Ensembling Methods Bagging Boosting
No ratings yet
Lec06 - Ensembling Methods Bagging Boosting
48 pages
Find The Distance Between The Points
No ratings yet
Find The Distance Between The Points
7 pages
Lecture 6 Text Classification
No ratings yet
Lecture 6 Text Classification
19 pages
PBL Project
No ratings yet
PBL Project
18 pages
Sentiment Analysis To Measure The Users Opinion by Using Machine Learning Techniques
No ratings yet
Sentiment Analysis To Measure The Users Opinion by Using Machine Learning Techniques
15 pages
Sentiment Analysis of Talaash Movie Reviews Using Text Mining Approach
No ratings yet
Sentiment Analysis of Talaash Movie Reviews Using Text Mining Approach
9 pages
Performance Comparison of Graph Database and Relational Database
No ratings yet
Performance Comparison of Graph Database and Relational Database
14 pages
MADHU-IEEE Update
No ratings yet
MADHU-IEEE Update
5 pages
MADHU IEEE Updated 28 07 24
No ratings yet
MADHU IEEE Updated 28 07 24
5 pages
JOU Classification of Sentiment Reviews Using N-Gram Machine Learning
No ratings yet
JOU Classification of Sentiment Reviews Using N-Gram Machine Learning
10 pages
Module 2
No ratings yet
Module 2
34 pages
Unit6 002
No ratings yet
Unit6 002
10 pages
MADHU IEEE Updated 27 05 24
No ratings yet
MADHU IEEE Updated 27 05 24
5 pages
M4 - FDS
No ratings yet
M4 - FDS
15 pages
Prabowo 2009
No ratings yet
Prabowo 2009
15 pages
UNIT-4 Information Retrieval Notes
No ratings yet
UNIT-4 Information Retrieval Notes
16 pages
5 - EnsembleModeling
No ratings yet
5 - EnsembleModeling
80 pages
ML 11
No ratings yet
ML 11
13 pages
Technical Report
No ratings yet
Technical Report
10 pages
Ensemble Learning SA
No ratings yet
Ensemble Learning SA
27 pages
Lecture 2
No ratings yet
Lecture 2
35 pages
ML Unit-3
No ratings yet
ML Unit-3
28 pages
Machine Learning Lecture 2,3,4
No ratings yet
Machine Learning Lecture 2,3,4
26 pages
DR S.K-IEEE-updated-29-07-24
No ratings yet
DR S.K-IEEE-updated-29-07-24
5 pages
YOUTUBE SENTEMENT ANALYSIS (Major Project mp11)
No ratings yet
YOUTUBE SENTEMENT ANALYSIS (Major Project mp11)
40 pages
Third Periodical Test
No ratings yet
Third Periodical Test
6 pages
Soal B Inggris Uas KLS X Olla
No ratings yet
Soal B Inggris Uas KLS X Olla
25 pages
Conti Rossini, Turajev. Vitae Sanctorum Indigenarum. 1904. Volume 1 - Textus.
100% (1)
Conti Rossini, Turajev. Vitae Sanctorum Indigenarum. 1904. Volume 1 - Textus.
278 pages
Bagging
No ratings yet
Bagging
7 pages
Unit 3
No ratings yet
Unit 3
99 pages
Fire Extinguisher Prediction Using Machine Learning Report
No ratings yet
Fire Extinguisher Prediction Using Machine Learning Report
48 pages
Cambridge O Level: Computer Science 2210/22
No ratings yet
Cambridge O Level: Computer Science 2210/22
12 pages
Ensemble Learning
No ratings yet
Ensemble Learning
26 pages
Ch-4 Ensemble Learning
No ratings yet
Ch-4 Ensemble Learning
18 pages
Time To Explore (5) ML
No ratings yet
Time To Explore (5) ML
9 pages
Opinion Mining Using Machine Learning
No ratings yet
Opinion Mining Using Machine Learning
3 pages
Ensemble Learning
No ratings yet
Ensemble Learning
52 pages
Module 7 - Ensemble Learning
No ratings yet
Module 7 - Ensemble Learning
41 pages
Ensemble Methods
No ratings yet
Ensemble Methods
31 pages
SentA Russir Day2
No ratings yet
SentA Russir Day2
33 pages
2.4-Ensemble Methods Lecture Notes
No ratings yet
2.4-Ensemble Methods Lecture Notes
14 pages
Unit-3 ML
No ratings yet
Unit-3 ML
18 pages
Unit 3
No ratings yet
Unit 3
63 pages
What Is Ensemble Learning
No ratings yet
What Is Ensemble Learning
4 pages
Homework Problems: RR RR RR R
No ratings yet
Homework Problems: RR RR RR R
2 pages
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
36 pages
Sentiment Analysis of A Product Based On User Reviews Using Random Forests Algorithm
No ratings yet
Sentiment Analysis of A Product Based On User Reviews Using Random Forests Algorithm
5 pages
1920 3. The Audio Lingual Method
No ratings yet
1920 3. The Audio Lingual Method
63 pages
UNIT III Word File
No ratings yet
UNIT III Word File
13 pages
Fake Product Review Monitoring & Removal and Sentiment Analysis of Genuine Reviews
No ratings yet
Fake Product Review Monitoring & Removal and Sentiment Analysis of Genuine Reviews
4 pages
Unit Iv
No ratings yet
Unit Iv
14 pages
Air Quality Prediction Using Machine Learning
No ratings yet
Air Quality Prediction Using Machine Learning
29 pages
Sentimental Analysis Final Year Project
No ratings yet
Sentimental Analysis Final Year Project
21 pages
RES Presentation
No ratings yet
RES Presentation
21 pages
Research Defense Template by Rome
No ratings yet
Research Defense Template by Rome
28 pages
3BPROFED10 - Pansoy & Raymundo - Module2 - Lesson3
No ratings yet
3BPROFED10 - Pansoy & Raymundo - Module2 - Lesson3
33 pages
Othello
No ratings yet
Othello
52 pages
English Progress Checklist
No ratings yet
English Progress Checklist
50 pages
Net Order Value in Me2k Is Not Correct
No ratings yet
Net Order Value in Me2k Is Not Correct
2 pages
15-Gerunds and Infinitives - Complex Forms
No ratings yet
15-Gerunds and Infinitives - Complex Forms
3 pages
Cambridge IGCSE ™: Islamiyat 0493/11
No ratings yet
Cambridge IGCSE ™: Islamiyat 0493/11
17 pages
A by Gail E. Tompkins Notes The Importance
No ratings yet
A by Gail E. Tompkins Notes The Importance
4 pages
Jaimin Labmanual Java
No ratings yet
Jaimin Labmanual Java
70 pages
Sunday School Lesson Called by God Jeremiah 1 - 4 10
No ratings yet
Sunday School Lesson Called by God Jeremiah 1 - 4 10
5 pages
RAMA 54294 05101381924087 0025066601 01 Front Ref
No ratings yet
RAMA 54294 05101381924087 0025066601 01 Front Ref
30 pages
Unbundling Pokémon Go - Applidium
No ratings yet
Unbundling Pokémon Go - Applidium
4 pages
RFQ Process
No ratings yet
RFQ Process
19 pages
Tutorial 3 Question
No ratings yet
Tutorial 3 Question
8 pages
Style Sheet For Dissertation
No ratings yet
Style Sheet For Dissertation
2 pages
Chapter - 06 - Positive - and - Neutral - Messages Without Answer
No ratings yet
Chapter - 06 - Positive - and - Neutral - Messages Without Answer
15 pages
09goods L Question - Wave On String (Eng)
No ratings yet
09goods L Question - Wave On String (Eng)
13 pages
The Themes of Quine S Philosophy Meaning Reference and Knowledge 1st Edition Edward Becker
No ratings yet
The Themes of Quine S Philosophy Meaning Reference and Knowledge 1st Edition Edward Becker
44 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)

Advanced Topic Data Mining

Uploaded by

Advanced Topic Data Mining

Uploaded by

Advanced Topics in Data Mining

Farhad Muhammad Riaz

• Build many models and combine them

Given data: D = {(x1 , y1 ) ,..., (x N , yN )}

• Bootstrap sample of N instances is obtained by

A Decision Tree Stump (1-R)

Each classifier Gm (x) is

• Each predictor is created by using a biased

• The function is defined as:

1. Initialize weights wi = 1/ N , i = 1,..., N

c) Compute a = log ((1- err ) / err )

d) Set w ¬ w exp éêa I ( y ¹ G (x ))ùú, i = 1,..., N

AdaBoost is NOT used for Regression!

Alpha for Boosting Re-weighting Factor for Boosting

Incorrect Classification Correct Classification

• Sentiment Analysis is a use case of Natural

• Document-Level Sentiment Analysis:

• Using Text Blob

You might also like