GitHub Training
GitHub Training
Juan C. Pichel
Pablo Gamallo
Sentiment Analysis on
Multilingual Tweets using
Big Data Technologies
Centro Singular de Investigación en Tecnoloxías da Información
Universidade de Santiago de Compostela
Index
1. Introduction
2. Background & Related Work
3. Architecture of the System
4. Tweet Mining Module
5. Sentiment Analysis Module
6. Performance Results
7. Conclusions
8. Evolution of the System
Machine learning
Learning algorithms over a known dataset
Training features such as bag of words or PoS tags
Lexicon-based
Polarity lexicons (dictionaries)
Main strategy: machine learning + polarity lexicons + shallow
syntactic information to detect polarity shifters
Web scraper
Acquires tweets from the Twitter web interface
Multi-thread
Loop queries based on a term list
MapReduce application
Mappers
̶ Every tweet to be processed must match the query
terms
̶ If so, the text is processed through the classifier modules
̶ Two key-value pairs are emitted
To increment the counter of successfully processed tweets
To point the polarity of the processed tweet (-1, 0 or 1)
Positivity ratio
A normalized value between 0 and 1
̶ Negative < 0.45
̶ Positive > 0.55
Contrasts the positive tweets with the negative ones
σ 𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑖𝑒𝑠
+1
𝑁𝑜. 𝑜𝑓 𝑡𝑤𝑒𝑒𝑡𝑠
2
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 12
Web interface
Lists of terms
5,000 most frequent words in Spanish
5,000 most frequent words in English
Terms distributed over 72 threads
Much higher performance with the full TMM system
F-score and ranking of our system (CitiusSentiment) in the SemEval competition, namely task 9 focused on
sentiment analysis in Twitter (only English microtexts).
Successfully processed tweets, matches and positivity ratio for the selected Spanish terms.
10
0
1 2 4 8 16 32
corrupción gobierno elecciones