0% found this document useful (0 votes)
9 views23 pages

NLP Presentation

Uploaded by

rameshtharu076
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views23 pages

NLP Presentation

Uploaded by

rameshtharu076
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Sakshi Goel

Bilingual Sentiment PES1201700148


Suhail Rahman

Analysis PES1201701420
UE17CS333 Project
Submission
ABOUT THE PROJECT
- The main aim of the project is to develop a sentiment analyzer
that can be used on twitter data to classify it as positive or
negative.

- Our project takes care of the challenge of bilingual comments,


where people tweet in two languages, in this case Hindi and
English, in the English Alphabet.

UE17CS333-PROJECT_2020 2
UNIQUENESS AND ANALYSIS
- We created an aggregated model consisting of all the
classifiers used
during the process. The ensemble model created worked to
our advantage
as we saw in the previous slides that it provided one of the
highest
accuracy compared to other classifiers.

- When a sentence is in Hindi, we use Google Translate to


directly
convert it to English. If the sentence consists of a
combination of Hindi and
English, we make use of TextBlob to identify that.
UE17CS333-PROJECT_2020 3
DATASET SOURCE
- The dataset that was used was obtained from “Kaggle”
called the Sentiment140 dataset.

- It contains 1,600,000 tweets extracted using the twitter


API. The tweets have been annotated (0 = negative, 4 =
positive) and they can be used to detect sentiment.

- The two columns that we mainly need are as follows:


- The Label
- The Tweet

UE17CS333-PROJECT_2020 4
DATASET SOURCE
- The format of the Tweet column was not useful and had to
be cleaned and tokenized. We also limited the number of
tweets to 40 thousand.

UE17CS333-PROJECT_2020 5
DATASET PREPROCESSING
- Chose the relevant columns that were required for our
study, which were the tweet and the sentiment associated.

- If there were any emoticons used, we converted them into


their equivalent emotion that they are trying to signify,
while emojis were removed.

- We also expanded some words which were joined together


such as “Can’t” was changed to “Can not”.

UE17CS333-PROJECT_2020 6
DATASET PREPROCESSING
- Removal of numbers, URLs, html tags and symbols, the
“@” symbol followed by the account handle.

- These were all some data cleaning steps that were


important to the study to function effectively. Finally, the
dataset contained the cleaned tweets which we converted
to lowercase for simplicity.

- Certain features, like adjectives, abstract nouns and


adverbs were focused on and the rest of the words were
removed as they did not add any value to the sentiment.
UE17CS333-PROJECT_2020 7
LITERATURE REVIEW - TABLE
1
Papers Title Authors Methodology Used

Paper 1 Machine translation of R. Mahesh, K.Sinha, Makes use a system designed


bi-lingual Hindi-English Anil Thakur specifically to separate out the Hindi
(Hinglish) text and English parts of a word that has
a combination of the two.

Paper 2 Towards Sub-Word Aditya Joshi,Ameya Introduces a constantly learning sub-


Level Compositions for Prabhu Pandurang, word level representation in LSTM
Sentiment Analysis of Manish Shrivatsava and (Subword-LSTM) architecture
Hindi-English Code Vasudeva Varma instead of character-level or word-
Mixed Text level representations.

UE17CS333-PROJECT_2020 8
LITERATURE REVIEW - TABLE
1
Paper 3 A Dataset of Hindi- Aditya Bohra, Deepanshu Makes use of a system created
English Code-Mixed Vijay, Vinay Singh, Syed that classifies a tweet having a
Social Media Text for S. Akhtar and Manish combination of Hindi and English to
Hate Speech Detection Shrivatsava negative or not.
Paper 4 Resource Creation for Sakshi Gupta, Piyush Proposes a method to successfully
Hindi-English Code Bansal and Radhika aggregate data to form a dataset of
Mixed Social Media Text Mamidi words that have a multilingual
characteristic.
Paper 5 Sentiment classification Kumar Ravi and Made use of different combinations
of Hinglish text Vadlamani Ravi of feature selection methods and a
host of classifiers using term
frequency-inverse document
frequency feature representation.

UE17CS333-PROJECT_2020 9
LITERATURE REVIEW - TABLE
2
Papers Accuracy Benefits Drawbacks

Paper 1 90% The strategy described here is equally Elaborate testing is not possible as
applicable to all Indian languages as these languages are used in verbal
these are verb ending languages and communication.
have similar mixture of lexicons as in
case of Hindi.
Paper 2 69.7% Sub-Word LSTM interprets sentiment The lexicon lookup approach didn’t
based on morpheme-like structures and perform well owing to the heavily
the results thus produced are misspelt words in the text, which led to
significantly better than baselines. incorrect transliterations.

UE17CS333-PROJECT_2020 10
LITERATURE REVIEW - TABLE
2
Paper 3 71.7% The features used in the classification The corpus was not annotated with
system are character n-grams, word n- part-of-speech tags at word level
grams, punctuations, negation words and which would have yield better results.
hate lexicon which are integrated in the
SVM as the classification system.
Paper 4 89.94% They have used an existing language Have not taken into consideration the
identification system, and improved a sentence-level context for word
normalisation system, achieving a higher disambiguation.
accuracy than the base system.
Paper 5 AUC = Proposed a triumvirate of TF-IDF, GR, and Did not employ sentence parser for
0.8601 RBFNN, which is found as the best considering relation between different
combination for classifying sentiment parts-of-speech of a sentence.
expressed in the Hinglish text.

UE17CS333-PROJECT_2020 11
BLOCK DIAGRAM FOR
IMPLEMENTATION

UE17CS333-PROJECT_2020 12
QUANTITY OF WORK – THE
MAIN CODE MODULES
Sl. No. Code Module Description Status (% completed) Comments
func(test_text) 100% The master module
2. hinglish(test_text) 100% Takes care of text translation
3. text_classify(text) 100% Classifies text using all 8 models
4. hybrid(test_set_formatted) 100% Builds the hybrid model classifier
5. features(test_text) 100% Filters features from the text
6. start(text) 100% Preprocessing module

UE17CS333-PROJECT_2020 13
QUALITY OF WORK –
MILESTONES THAT ARE DONE
AND WORKING
Serial Milestone description Status Comments
no (%
complet
e)
1. Dataset Selection 100% A better dataset can be used.
2. Preprocessing 100% Cleaning done efficiently.
3. Feature Selection 100% Adjectives, Abstract Nouns, Adverbs
4. Choice of Classifiers 100% 7 Classifiers chosen.
5. Building Classifiers 100% Successfully built
6. Training Classifiers 100% Trained on 85% data.
7. Creation of Hybrid Model 100% Voting Based Ensemble Model.
8. Translation Challenge 100% Google Translate Machine, TextBlob
9. Creating a controller module 100% func module combines all functionality. 14
UE17CS333-PROJECT_2020
RESULTS OBTAINED - Accuracy
Comparison of Accuracies Classifier Used Accuracy

Naive Bayes 62.0729

Multinomial Naive Bayes 62.2062

Bernoulli Naive Bayes 62.2062


Accuracy

Logistic Regression 62.2562

SGD 61.2397

SVC Classifier 61.3897

Max Entropy 613897

Hybrid Model 62.2563


Classifier

UE17CS333-PROJECT_2020 15
RESULTS OBTAINED - Confusion
Matrix
For Hybrid Model:

UE17CS333-PROJECT_2020 16
RESULTS OBTAINED - F1 Score
Naive Bayes’
Classifier:

Bernouille’s Naive Bayes’


Classifier:

UE17CS333-PROJECT_2020 17
RESULTS OBTAINED - F1 Score
Multinomial Naive Bayes’
Classifier:

Logistic Regression
Classifier:

UE17CS333-PROJECT_2020 18
RESULTS OBTAINED - F1 Score
Stochastic Gradient
Descent Classifier:

Support Vector Machines


Classifier:

UE17CS333-PROJECT_2020 19
RESULTS OBTAINED - F1 Score
Maximum Entropy
Classifer:

Hybrid Model:

UE17CS333-PROJECT_2020 20
OUR TOP THREE LEARNING IN
THIS PROJECT
1. We were able to get familiar with the usage and
implementation of different classifiers.

2. Understanding which classifiers work when used on a


certain type of data. Learning the advantages and
drawbacks of the used classification models.

3. Getting the opportunity to create an ensemble model to


give us optimal results.

UE17CS333-PROJECT_2020 21
TOP CHALLENGES
UNRESOLVED SO FAR
1. Accuracy for the testing of the models was around 60%,
even after several efforts to increase it.

2. Two separate modules, instead of one, used for translation.

3. Dataset used for training could be a better one.

UE17CS333-PROJECT_2020 22
OUR GOING FORWARD PLAN
(IF ANY)
1. Find a better dataset to work with.

2. Try more complex machine learning models for the


classification of text.

3. Use better translation techniques.

UE17CS333-PROJECT_2020 23

You might also like