0% found this document useful (0 votes)
313 views

GitHub Training

This document describes a system for performing sentiment analysis on tweets in multiple languages using big data technologies. The system includes a tweet mining module that collects tweets from Twitter's API and via web scraping. It stores the tweets in Apache Hadoop's HBase database. A sentiment analysis module uses a Naive Bayes classifier with lexicon and part-of-speech features to classify tweets in several languages as positive, negative or neutral. MapReduce programs integrate the classifier into the big data infrastructure to analyze millions of tweets quickly. Evaluation results show the classifier performs well compared to other methods and the system scales to handle large volumes of tweets efficiently using Hadoop clusters.

Uploaded by

cyberfox786
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
313 views

GitHub Training

This document describes a system for performing sentiment analysis on tweets in multiple languages using big data technologies. The system includes a tweet mining module that collects tweets from Twitter's API and via web scraping. It stores the tweets in Apache Hadoop's HBase database. A sentiment analysis module uses a Naive Bayes classifier with lexicon and part-of-speech features to classify tweets in several languages as positive, negative or neutral. MapReduce programs integrate the classifier into the big data infrastructure to analyze millions of tweets quickly. Evaluation results show the classifier performs well compared to other methods and the system scales to handle large volumes of tweets efficiently using Hadoop clusters.

Uploaded by

cyberfox786
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Rodrigo Martínez-Castaño

Juan C. Pichel
Pablo Gamallo

Sentiment Analysis on
Multilingual Tweets using
Big Data Technologies
Centro Singular de Investigación en Tecnoloxías da Información
Universidade de Santiago de Compostela
Index
1. Introduction
2. Background & Related Work
3. Architecture of the System
4. Tweet Mining Module
5. Sentiment Analysis Module
6. Performance Results
7. Conclusions
8. Evolution of the System

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 2


Introduction 1/2
Sentiment Analysis
 Consists in finding the opinion
 Twitter is a large source of short texts
 Useful conclusions with huge amounts of text
Analysing tweets is a big challenge
 Human subjectivity
 Too short to be linguistically analysed

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 3


Introduction 2/2
Parallel architecture using Big Data
 Standard solutions cannot handle GBs or TBs of text in
reasonable time
 Apache Hadoop cluster with HBase
Goals
 Sentiment classifier should perform as well as other
state-of-the-art classifiers in different languages
 Millions of tweets should be processed in short times

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 4


Background & Related Work 1/2
Big Data processing

MapReduce programming model


 Two phases: map and reduce
 Inputs and outputs are key-value pairs
Apache Hadoop
 HDFS (filesystem)
 YARN (resource-management platform)
 MapReduce Framework

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 5


Background & Related Work 2/2
Sentiment analysis

 Machine learning
 Learning algorithms over a known dataset
 Training features such as bag of words or PoS tags
 Lexicon-based
 Polarity lexicons (dictionaries)
 Main strategy: machine learning + polarity lexicons + shallow
syntactic information to detect polarity shifters

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 6


Architecture of the System

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 7


Tweet Mining Module
Streaming API consumer
 Consumes a sample stream (around 1%)
 Not enough

Web scraper
 Acquires tweets from the Twitter web interface
 Multi-thread
 Loop queries based on a term list

Storage under Apache HBase


Centro de Investigación en Tecnoloxías da Información (CiTIUS) 8
Sentiment Analysis Module 1/4
CitiusSentiment

Naive Bayes classifier


 Optimal time performance
 Reasonable accuracy
 Independence among linguistic features
Multilingual
 Spanish, English, Portuguese and Galician
Lexicon-based features

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 9


Sentiment Analysis Module 2/4
Strategy & features
 The annotated corpus only contains positive and negative
examples of tweets
 The tweet is considered neutral if it does not contain any
word within the polarity lexicon
 Precision higher than 80%
 Pre-processing (URLs, hashtags, emoticons, etc.)
 Considered features
 Lemmas
 Multiwords
 Polarity lexicons (10,850 –English–, 4,564 –Spanish–)
 Valence shifters

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 10


Sentiment Analysis Module 3/4
Integration into a Big Data infrastructure

Perldoop for the translation of the classifier to Java

MapReduce application
 Mappers
̶ Every tweet to be processed must match the query
terms
̶ If so, the text is processed through the classifier modules
̶ Two key-value pairs are emitted
 To increment the counter of successfully processed tweets
 To point the polarity of the processed tweet (-1, 0 or 1)

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 11


Sentiment Analysis Module 4/4
Integration into a Big Data infrastructure
 Reducer
̶ Computes the total number of processed tweets
̶ Summarizes the total score
̶ Calculates the positivity ratio

 Positivity ratio
 A normalized value between 0 and 1
̶ Negative < 0.45
̶ Positive > 0.55
 Contrasts the positive tweets with the negative ones
σ 𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑖𝑒𝑠
+1
𝑁𝑜. 𝑜𝑓 𝑡𝑤𝑒𝑒𝑡𝑠
2
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 12
Web interface

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 13


Performance results 1/5
Tweet Mining Module evaluation

Average number of unique collected tweets per second, filtered by language.

Lists of terms
 5,000 most frequent words in Spanish
 5,000 most frequent words in English
Terms distributed over 72 threads
Much higher performance with the full TMM system

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 14


Performance results 2/5
Sentiment analysis evaluation

F-score and ranking of our system (CitiusSentiment) in the TASS


competition: Sentiment analysis on Spanish tweets.

F-score and ranking of our system (CitiusSentiment) in the SemEval competition, namely task 9 focused on
sentiment analysis in Twitter (only English microtexts).

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 15


Performance results 3/5
Evaluation of the Big Data infrastructure 1/3

Successfully processed tweets, matches and positivity ratio for the selected Spanish terms.

Processing time (in minutes) for the selected Spanish terms.

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 16


Performance results 4/5
Evaluation of the Big Data infrastructure 2/3

System speedup with respect to the no. of


HBase regions
12

10

0
1 2 4 8 16 32
corrupción gobierno elecciones

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 17


Performance results 5/5
Evaluation of the Big Data infrastructure 3/3

 68 Spanish popular terms selected as targets for the TMM

 Manual splits of the original HBase table with a single region

 50 GiB of RAM per node for YARN containers (5 nodes)

 The system scales up quite good with enough physical


resources to handle the launched tasks

 Higher number of splits with low number of coincidences


causes overhead

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 18


Conclusions
 Twitter is a large source of short texts with opinions
 Making sentiment analysis on tweets is challenging
 Our classifier performs above the average in two competitions
 Big Data technologies help speeding up the sentiment analysis
process

 The Tweet Mining Module improves the number of captured


tweets

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 19


Evolution of the System
A real-time system
 Apache Storm for real-time processing
 Inter-module RAM based buffers for faster I/O
 Apache Spark for real time queries on the polarized
tweets
 RESTful API & web interface
̶ Exploration of popular terms
̶ Custom real-time queries
̶ Chart represented results divided by custom time intervals

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 20


Thank you!

Rodrigo Martínez-Castaño: [email protected]


Juan C. Pichel: [email protected]
Pablo Gamallo: [email protected]

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 21

You might also like