Youtube Comments Sentiment Analysis 2
Youtube Comments Sentiment Analysis 2
which is in a position to gather useful information from ROC curve. They also show that the modified version
the twitter website and efficiently perform sentiment of MNB is extremely closely associated with the
analysis of tweets regarding the Smart phone war. The straightforward centroid-based classifier and compare
system uses efficient scoring system for predicting the the 2 methods empirically.
user’s age. The user’s gender is predicted employing a Another work on the sentiment analysis of social
well-trained Naïve Bayes Classifier. Sentiment media is completed using multimodal approach,
Classifier Model labels the tweet with a sentiment. discussed within the paper by Diana Maynard et al.[47].
KrisztianBalog et al. proposed in his paper a way to They examine a specific use case, which is to assist
gather useful information from the twitter website and archivists select material for inclusion in an archive of
efficiently perform sentiment analysis of tweets social media for preserving community memories,
regarding the Smart phone war. The system uses moving towards structured preservation around
efficient rating system for predicting the user’s age. semantic categories. The textual approach they take is
Twitter Sentiment 8 Analysis: the great the Bad and rule-based and builds on variety of subcomponents,
therefore the OMG!, paper by EfthymiosKouloumpis taking under consideration issues inherent in social
et al. deals with the utility of linguistic features for media like noisy ungrammatical text, use of swear
detecting the sentiment of Twitter messages. They words, sarcasm etc[1]Athar, A. (2014). Sentiment
evaluate the usefulness of existing lexical resources analysis of scientific citations (No.UCAMCL-TR-856).
also as features that capture information about the University of Cambridge, Computer Laboratory. The
informal and artistic language utilized in micro- author used NB and SVM classifier and compute the
blogging. Another sentiment analysis of web text was accuracies of the system using an F-score. Macro F-
done using the blog posts by Gilad Mishne et al. scores using uni-gram mentioned within the research
One of the foremost prominent works in website work is 48 percent. [2] Pang, B., Lee, L.,
classification was done by Daniele Riboni in the paper Vaithyanathan, S. (2002, July). Thumbs up? sentiment
“Feature Selection for website Classification”[44]. classification using machine learning techniques.
They conducted various experiments on a corpus of Author used label data for the purpose of classification,
8000 documents belonging to 10 Yahoo! categories they preferred the supervised learning approach. For
using Kernel Perception and Naive Bayes classifiers. the purpose of classification, the Na¨ıve Bayes
Their experiments show the usefulness of classifier is used.
dimensionality reduction and of a replacement In this work, they need used a dataset of movie
structured oriented weighing technique. They also reviews. [3]Sentiment analysis and opinion mining
introduce a replacement method for representing linked (Liu, 2012):- Sentiment analysis and opinion mining is
pages using local information that creates hypertext the field of study that analyses people’s opinions,
categorization feasible for real-time applications. sentiments, evaluations, attitudes, and emotions from
written language. It is one among the foremost active
Other classification works are just like the one research areas on natural language processing and is
done by Eibe Frank et al.[46] In their paper they also widely studied in data pre-processing, Web
propose an appropriate correction by adjusting attribute mining, and text mining. [4]Deep Learning for Hate
priors. This correction are often implemented as Speech Detection in Tweets by Pinkesh Badjatiya
another data normalization step, and that they show (IIIT-H) , Shashank Gupta (IIIT-H), Manish Gupta
that it can significantly improve the world under the (Microsoft), Vasudeva Varma (IIIT-H) (June 1st,
2017):- One of the most useful applications of and compute accuracies using F-score and accuracy
sentiment classification models is that the detection of score. Secondly, in order to improve the accuracy
hate speech. Recently, there are numerous reports of scores, we apply other features like (stop words &
the tough lives of content moderation staff. Our punctuation removal, lemmatization, etc.) along with
experiments on a benchmark dataset of 16K annotated n-grams and then again compute the accuracies. The
tweets show that such deep learning methods latter approach helps to reduce the noise and
outperform state-of-the-art char/word n-gram methods. complexity of the data. Thirty iterations of each
[5]Mehmood, K., Essam, D., Shafi, K. (2018, experiment were conducted to compute average results
July).Sentiment Analysis System for Roman Urdu. In and a total of six experiments were performed. After
Science and Information Conference (pp. 29- computing the accuracies of each phase.
42) .Springer, Cham. They used their data set which is We then select the best feature which is giving the
based on Urdu reviews related to movies, politics, best result and which classifier is better in a specific
mobile, dramas and miscellaneous domains extracted scenario.
using scrapers as well as manual. The data-set was then
classified using different supervised learning classifiers
and compare their results with each other.
3. METHODOLOGY
The purpose of the methodology is defined in this
section. Our methodology is depicted in Fig. 1. First of
all, we used the annotated dataset. We used python
based machine learning library named Scikit-Learn for
implementing the system. Scikit-Learn is a well-known
machine learning library tightly integrated with Python
language and provides easy-to-interact interface.
First of all our system reads the data stored in the file
having (Tab Separated Values) format. After reading,
pre-processing phase is applied to clean and prepare
the data for the use of machine learning algorithms.
Directly text data cannot be given to machine learning
algorithms, it should be converted into a suitable type.
Using Scikit-Learn module named “count-
vectorizer ”, the text data firstly convert into numeric
format and prepare the matrix of tokens count.
Now the data is ready for machine learning
algorithms. Then 60% of data is splitted randomly to
train the classifier and 40% for testing the classifier’
accuracy.
We perform our experiments in two phases, firstly ,
we just apply N-grams (Length 1-3) features on data Fig. 1. Step by Step Flow of System Working.
A. Features Selection
For the sake of developing a system for 3.3 ALGORITHMS USED
sentiment analysis, different features are
provided by ML .We have used various
This work attempted to utilize six machine learning
features e.g. lemmatization, n-grams, stop
techniques for the task of sentiment analysis. The
words and term-document frequency to
modeling of all techniques is briefly discussed below.
evaluate the classifier’ accuracy. Later the
evaluation results will be displayed.
classification classifiers
B. Lemmatization
without other features while RF performs best as gives the best accuracy using uni-gram, LR
same as KNN. performance is significant in case of bigrams and
tri-grams, KNN outperforms in case of n-grams
The overall discussion describes that uni-gram, bi- without other features and gives worst performs
gram, and tri-gram without other features perform with other features. The overall discussion
best where unigram is at first position. Table II describes that uni-gram, bigrams, and tri-grams
shows that Overall SVM, LR, and RF performed without other features performs best and give
very best with the highest accuracy scores. N- significant accuracy scores.
grams play significant performance in NB, SVM
table I- f- scores
Figure-II Results of word cloud visualizations implement alsoused the n-grams approach together with the
on our dataset lemmatization process to reduce data dimensions .
In this paper, we have implemented six classifiers
5.CONCLUSION LR, DT, KNN, and RF including NB, SVM . We
In this research work, we presented a sentiment computed the accuracies by increasing the number
analysis system for you-tube comments. We have of evaluation metrics F-score and accuracy
used different machine learning classifiers which including F-score to evaluate the accuracies with
are NB, SVM, DT, LR, KNN and RF along with the base system. Our results showed significant
different features to process the data and optimize improvement like in the case of Naïve Bayes using
the classification results. Experiments are uni-gram feature we achieved micro-F 87% while
performed on the data set . Data set is partitioned the base system described the result of micro-F = 78%
into training and testing sets according to the ratio and our results are approximately 9 % better. we
of 80:20 and. Accuracies of the classifiers are achieved the macro-F = 49% by reducing the data
computed by using various evaluation metrics like dimensions by using the lemmatization process and
F-score, and Accuracy score. The result shows that stop words removal mechanism. Based on bi-gram
SVM performs better than other classifiers. After and tri-gram features our system achieved the same
SVM Naïve Bayes performs well. In the case of the result of micro-F = 87%. In our case, the micro-F
macro average, the performance of SVM classifier based on bi-gram and tri-gram features increased by
is best while computing Fscore, and accuracy 11 %. while we improved our results to extant and
measures while the random forest is best in case of achieved maximum of (micro-F = 87%, macro-F =
micro average. Uni-grams, bi-grams, and tri-gram 49%) that shows the significant improvement of our
features performed very well and support the work.
classifiers to achieve highest accuracy scores..We