Sentiment Analysis Poster
Sentiment Analysis Poster
Illustration: 0.85
v Text Classification:
0.84
0.83
0.82
25-75% 20-80% 10-90%
Dependency model (% accuracy) Bag of Words model (% accuracy)
Methodology • The p-value for the difference in the accuracy rates under all the three splits is less than 0.00001.
• Since 𝛼= 0.05 was chosen, any p-value below 0.05 indicates that the results are significant.
• Steps taken to add feature and calculate accuracy score: Cross Validation: Scores: Cross Validation: Scores:
1 0.877 1 0.843
v Traditional Method of Text Representation:
v Step 1: 2 0.842 2 0.849
Bag of Words (BoW) approach Dependency 3 0.836 3 0.852
Movie review with sentiment Movie review representation:
Parser dependency parsed
4 0.850 4 0.853
• What it is : A set of words that is chosen before the text classification. The selection can be 5 0.857 5 0.849
made in multiple ways, e.g. it could be the n most frequent words in the entire training corpus. Example: Dependent: ‘never’; Governor: ‘failed’ 6 0.854 6 0.844
v Step 2: 7 0.839 7 0.858
Feature 8 0.838 8 0.844
• How it is used : The words from the text are matched to the existing words (and the sentiments Movie review representation: Formation of dependency
9 0.860 9 0.848
that they denote) in BoW and then the classifier gives a prediction of the sentiment. dependency parsed Extraction pairs
10 0.828 10 0.845
Example: dependent + governor : ‘never + failed’
• Why it is used : Since this approach does not involve the employment of any linguistic structure, • The observations under the cross validation are puzzling because the deviations from the mean accuracy
v Step 3: score are high.
it is simple and this simplicity makes BoW popular
• The same cross validation for the BoW model has very small deviations in accuracy scores, if any.
Movie review representation: Scikit
Dependency pairs, individual Naïve Bayes Classifier v Conclusion:
• Example : Classification of ‘The movie was great’ (extraction of nouns and adjectives): Words Learn
• Adding the feature of dependencies significantly improved the accuracy rates.
BoW = {movie, film, great, horrible, tedious} Text representation = {movie:1, film:0, great:1, • Further research should look into:
v Step 4: • Why the cross validation is showing an anomalous behavior in case of the dependency
horrible:0, tedious:0}. This information is passed on to the training algorithm which will be trained to Test-Train model.
associate individual features (e.g great:1) with sentiment labels – in this case “positive” Classifier Accuracy rates • Tokenizing the corpus for html tags and re-running the experiment.
Split • The most informative features to see how the dependency pairs are classified.