NLP Presentation
NLP Presentation
Analysis PES1201701420
UE17CS333 Project
Submission
ABOUT THE PROJECT
- The main aim of the project is to develop a sentiment analyzer
that can be used on twitter data to classify it as positive or
negative.
UE17CS333-PROJECT_2020 2
UNIQUENESS AND ANALYSIS
- We created an aggregated model consisting of all the
classifiers used
during the process. The ensemble model created worked to
our advantage
as we saw in the previous slides that it provided one of the
highest
accuracy compared to other classifiers.
UE17CS333-PROJECT_2020 4
DATASET SOURCE
- The format of the Tweet column was not useful and had to
be cleaned and tokenized. We also limited the number of
tweets to 40 thousand.
UE17CS333-PROJECT_2020 5
DATASET PREPROCESSING
- Chose the relevant columns that were required for our
study, which were the tweet and the sentiment associated.
UE17CS333-PROJECT_2020 6
DATASET PREPROCESSING
- Removal of numbers, URLs, html tags and symbols, the
“@” symbol followed by the account handle.
UE17CS333-PROJECT_2020 8
LITERATURE REVIEW - TABLE
1
Paper 3 A Dataset of Hindi- Aditya Bohra, Deepanshu Makes use of a system created
English Code-Mixed Vijay, Vinay Singh, Syed that classifies a tweet having a
Social Media Text for S. Akhtar and Manish combination of Hindi and English to
Hate Speech Detection Shrivatsava negative or not.
Paper 4 Resource Creation for Sakshi Gupta, Piyush Proposes a method to successfully
Hindi-English Code Bansal and Radhika aggregate data to form a dataset of
Mixed Social Media Text Mamidi words that have a multilingual
characteristic.
Paper 5 Sentiment classification Kumar Ravi and Made use of different combinations
of Hinglish text Vadlamani Ravi of feature selection methods and a
host of classifiers using term
frequency-inverse document
frequency feature representation.
UE17CS333-PROJECT_2020 9
LITERATURE REVIEW - TABLE
2
Papers Accuracy Benefits Drawbacks
Paper 1 90% The strategy described here is equally Elaborate testing is not possible as
applicable to all Indian languages as these languages are used in verbal
these are verb ending languages and communication.
have similar mixture of lexicons as in
case of Hindi.
Paper 2 69.7% Sub-Word LSTM interprets sentiment The lexicon lookup approach didn’t
based on morpheme-like structures and perform well owing to the heavily
the results thus produced are misspelt words in the text, which led to
significantly better than baselines. incorrect transliterations.
UE17CS333-PROJECT_2020 10
LITERATURE REVIEW - TABLE
2
Paper 3 71.7% The features used in the classification The corpus was not annotated with
system are character n-grams, word n- part-of-speech tags at word level
grams, punctuations, negation words and which would have yield better results.
hate lexicon which are integrated in the
SVM as the classification system.
Paper 4 89.94% They have used an existing language Have not taken into consideration the
identification system, and improved a sentence-level context for word
normalisation system, achieving a higher disambiguation.
accuracy than the base system.
Paper 5 AUC = Proposed a triumvirate of TF-IDF, GR, and Did not employ sentence parser for
0.8601 RBFNN, which is found as the best considering relation between different
combination for classifying sentiment parts-of-speech of a sentence.
expressed in the Hinglish text.
UE17CS333-PROJECT_2020 11
BLOCK DIAGRAM FOR
IMPLEMENTATION
UE17CS333-PROJECT_2020 12
QUANTITY OF WORK – THE
MAIN CODE MODULES
Sl. No. Code Module Description Status (% completed) Comments
func(test_text) 100% The master module
2. hinglish(test_text) 100% Takes care of text translation
3. text_classify(text) 100% Classifies text using all 8 models
4. hybrid(test_set_formatted) 100% Builds the hybrid model classifier
5. features(test_text) 100% Filters features from the text
6. start(text) 100% Preprocessing module
UE17CS333-PROJECT_2020 13
QUALITY OF WORK –
MILESTONES THAT ARE DONE
AND WORKING
Serial Milestone description Status Comments
no (%
complet
e)
1. Dataset Selection 100% A better dataset can be used.
2. Preprocessing 100% Cleaning done efficiently.
3. Feature Selection 100% Adjectives, Abstract Nouns, Adverbs
4. Choice of Classifiers 100% 7 Classifiers chosen.
5. Building Classifiers 100% Successfully built
6. Training Classifiers 100% Trained on 85% data.
7. Creation of Hybrid Model 100% Voting Based Ensemble Model.
8. Translation Challenge 100% Google Translate Machine, TextBlob
9. Creating a controller module 100% func module combines all functionality. 14
UE17CS333-PROJECT_2020
RESULTS OBTAINED - Accuracy
Comparison of Accuracies Classifier Used Accuracy
SGD 61.2397
UE17CS333-PROJECT_2020 15
RESULTS OBTAINED - Confusion
Matrix
For Hybrid Model:
UE17CS333-PROJECT_2020 16
RESULTS OBTAINED - F1 Score
Naive Bayes’
Classifier:
UE17CS333-PROJECT_2020 17
RESULTS OBTAINED - F1 Score
Multinomial Naive Bayes’
Classifier:
Logistic Regression
Classifier:
UE17CS333-PROJECT_2020 18
RESULTS OBTAINED - F1 Score
Stochastic Gradient
Descent Classifier:
UE17CS333-PROJECT_2020 19
RESULTS OBTAINED - F1 Score
Maximum Entropy
Classifer:
Hybrid Model:
UE17CS333-PROJECT_2020 20
OUR TOP THREE LEARNING IN
THIS PROJECT
1. We were able to get familiar with the usage and
implementation of different classifiers.
UE17CS333-PROJECT_2020 21
TOP CHALLENGES
UNRESOLVED SO FAR
1. Accuracy for the testing of the models was around 60%,
even after several efforts to increase it.
UE17CS333-PROJECT_2020 22
OUR GOING FORWARD PLAN
(IF ANY)
1. Find a better dataset to work with.
UE17CS333-PROJECT_2020 23