0% found this document useful (0 votes)

313 views

GitHub Training

This document describes a system for performing sentiment analysis on tweets in multiple languages using big data technologies. The system includes a tweet mining module that collects tweets from Twitter's API and via web scraping. It stores the tweets in Apache Hadoop's HBase database. A sentiment analysis module uses a Naive Bayes classifier with lexicon and part-of-speech features to classify tweets in several languages as positive, negative or neutral. MapReduce programs integrate the classifier into the big data infrastructure to analyze millions of tweets quickly. Evaluation results show the classifier performs well compared to other methods and the system scales to handle large volumes of tweets efficiently using Hadoop clusters.

Uploaded by

cyberfox786

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

313 views

GitHub Training

Uploaded by

cyberfox786

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Rodrigo Martínez-Castaño

Juan C. Pichel
Pablo Gamallo

Sentiment Analysis on
Multilingual Tweets using
Big Data Technologies
Centro Singular de Investigación en Tecnoloxías da Información
Universidade de Santiago de Compostela
Index
1. Introduction
2. Background & Related Work
3. Architecture of the System
4. Tweet Mining Module
5. Sentiment Analysis Module
6. Performance Results
7. Conclusions
8. Evolution of the System

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 2

Introduction 1/2
Sentiment Analysis
 Consists in finding the opinion
 Twitter is a large source of short texts
 Useful conclusions with huge amounts of text
Analysing tweets is a big challenge
 Human subjectivity
 Too short to be linguistically analysed

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 3

Introduction 2/2
Parallel architecture using Big Data
 Standard solutions cannot handle GBs or TBs of text in
reasonable time
 Apache Hadoop cluster with HBase
Goals
 Sentiment classifier should perform as well as other
state-of-the-art classifiers in different languages
 Millions of tweets should be processed in short times

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 4

Background & Related Work 1/2
Big Data processing

MapReduce programming model

 Two phases: map and reduce
 Inputs and outputs are key-value pairs
Apache Hadoop
 HDFS (filesystem)
 YARN (resource-management platform)
 MapReduce Framework

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 5

Background & Related Work 2/2
Sentiment analysis

 Machine learning
 Learning algorithms over a known dataset
 Training features such as bag of words or PoS tags
 Lexicon-based
 Polarity lexicons (dictionaries)
 Main strategy: machine learning + polarity lexicons + shallow
syntactic information to detect polarity shifters

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 6

Architecture of the System

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 7

Tweet Mining Module
Streaming API consumer
 Consumes a sample stream (around 1%)
 Not enough

Web scraper
 Acquires tweets from the Twitter web interface
 Multi-thread
 Loop queries based on a term list

Storage under Apache HBase

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 8
Sentiment Analysis Module 1/4
CitiusSentiment

Naive Bayes classifier

 Optimal time performance
 Reasonable accuracy
 Independence among linguistic features
Multilingual
 Spanish, English, Portuguese and Galician
Lexicon-based features

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 9

Sentiment Analysis Module 2/4
Strategy & features
 The annotated corpus only contains positive and negative
examples of tweets
 The tweet is considered neutral if it does not contain any
word within the polarity lexicon
 Precision higher than 80%
 Pre-processing (URLs, hashtags, emoticons, etc.)
 Considered features
 Lemmas
 Multiwords
 Polarity lexicons (10,850 –English–, 4,564 –Spanish–)
 Valence shifters

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 10

Sentiment Analysis Module 3/4
Integration into a Big Data infrastructure

Perldoop for the translation of the classifier to Java

MapReduce application
 Mappers
̶ Every tweet to be processed must match the query
terms
̶ If so, the text is processed through the classifier modules
̶ Two key-value pairs are emitted
 To increment the counter of successfully processed tweets
 To point the polarity of the processed tweet (-1, 0 or 1)

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 11

Sentiment Analysis Module 4/4
Integration into a Big Data infrastructure
 Reducer
̶ Computes the total number of processed tweets
̶ Summarizes the total score
̶ Calculates the positivity ratio

 Positivity ratio
 A normalized value between 0 and 1
̶ Negative < 0.45
̶ Positive > 0.55
 Contrasts the positive tweets with the negative ones
σ 𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑖𝑒𝑠
+1
𝑁𝑜. 𝑜𝑓 𝑡𝑤𝑒𝑒𝑡𝑠
2
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 12
Web interface

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 13

Performance results 1/5
Tweet Mining Module evaluation

Average number of unique collected tweets per second, filtered by language.

Lists of terms
 5,000 most frequent words in Spanish
 5,000 most frequent words in English
Terms distributed over 72 threads
Much higher performance with the full TMM system

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 14

Performance results 2/5
Sentiment analysis evaluation

F-score and ranking of our system (CitiusSentiment) in the TASS

competition: Sentiment analysis on Spanish tweets.

F-score and ranking of our system (CitiusSentiment) in the SemEval competition, namely task 9 focused on
sentiment analysis in Twitter (only English microtexts).

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 15

Performance results 3/5
Evaluation of the Big Data infrastructure 1/3

Successfully processed tweets, matches and positivity ratio for the selected Spanish terms.

Processing time (in minutes) for the selected Spanish terms.

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 16

Performance results 4/5
Evaluation of the Big Data infrastructure 2/3

System speedup with respect to the no. of

HBase regions
12

0
1 2 4 8 16 32
corrupción gobierno elecciones

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 17

Performance results 5/5
Evaluation of the Big Data infrastructure 3/3

 68 Spanish popular terms selected as targets for the TMM

 Manual splits of the original HBase table with a single region

 50 GiB of RAM per node for YARN containers (5 nodes)

 The system scales up quite good with enough physical

resources to handle the launched tasks

 Higher number of splits with low number of coincidences

causes overhead

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 18

Conclusions
 Twitter is a large source of short texts with opinions
 Making sentiment analysis on tweets is challenging
 Our classifier performs above the average in two competitions
 Big Data technologies help speeding up the sentiment analysis
process

 The Tweet Mining Module improves the number of captured

tweets

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 19

Evolution of the System
A real-time system
 Apache Storm for real-time processing
 Inter-module RAM based buffers for faster I/O
 Apache Spark for real time queries on the polarized
tweets
 RESTful API & web interface
̶ Exploration of popular terms
̶ Custom real-time queries
̶ Chart represented results divided by custom time intervals

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 20

Thank you!

Rodrigo Martínez-Castaño: [email protected]

Juan C. Pichel: [email protected]
Pablo Gamallo: [email protected]

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 21

Smartgit Quickstart Guide
No ratings yet
Smartgit Quickstart Guide
28 pages
Emily Basañes Pre Post Observation COT 1
100% (8)
Emily Basañes Pre Post Observation COT 1
4 pages
Github Tutorial PDF
100% (3)
Github Tutorial PDF
15 pages
GitHub All Basics To Advanced
100% (1)
GitHub All Basics To Advanced
7 pages
GitHub Slide Deck
No ratings yet
GitHub Slide Deck
20 pages
Git Basic Training: Analyst Presentation November 2008
100% (1)
Git Basic Training: Analyst Presentation November 2008
34 pages
01 Version Control
No ratings yet
01 Version Control
37 pages
Script+answers Vol2
No ratings yet
Script+answers Vol2
175 pages
GitHub Presentation 1
No ratings yet
GitHub Presentation 1
16 pages
Creating A Repository in Github - Duration: 2 Days: Jala Technologies
100% (1)
Creating A Repository in Github - Duration: 2 Days: Jala Technologies
9 pages
GIT Guide: Initial Configuration
No ratings yet
GIT Guide: Initial Configuration
5 pages
Git Work Book
100% (4)
Git Work Book
74 pages
Git & Github Cheatsheet
100% (1)
Git & Github Cheatsheet
18 pages
Git Concepts Simplified
No ratings yet
Git Concepts Simplified
27 pages
Git Cheatsheet
100% (1)
Git Cheatsheet
3 pages
Lab-4 (Jenkins and Maven Configuration)
No ratings yet
Lab-4 (Jenkins and Maven Configuration)
13 pages
Learn Devops
100% (1)
Learn Devops
197 pages
Git Flow Tutorial
No ratings yet
Git Flow Tutorial
29 pages
Start A Repository On Git
No ratings yet
Start A Repository On Git
13 pages
The Basics of Git and GitHub
100% (1)
The Basics of Git and GitHub
10 pages
Git Cheatsheet
No ratings yet
Git Cheatsheet
1 page
Github and Git
No ratings yet
Github and Git
28 pages
Git - Notes
No ratings yet
Git - Notes
37 pages
Git - Github
No ratings yet
Git - Github
29 pages
Basic Git Commands - Atlassian Documentation
No ratings yet
Basic Git Commands - Atlassian Documentation
2 pages
An Intro To Git and GitHub For Beginners (Tutorial)
100% (1)
An Intro To Git and GitHub For Beginners (Tutorial)
28 pages
Git Cheatsheet EN Grey PDF
No ratings yet
Git Cheatsheet EN Grey PDF
2 pages
Gitlab Workflow v10
No ratings yet
Gitlab Workflow v10
26 pages
Enable Github Pages
No ratings yet
Enable Github Pages
2 pages
Getting Started With Bitbucket
100% (1)
Getting Started With Bitbucket
9 pages
Github Cheatsheet
No ratings yet
Github Cheatsheet
6 pages
Combined: @arjun-Panwar
100% (1)
Combined: @arjun-Panwar
86 pages
Deploying A Django Application To Elastic Beanstalk - AWS Elastic Beanstalk
No ratings yet
Deploying A Django Application To Elastic Beanstalk - AWS Elastic Beanstalk
12 pages
DevOps Chapter 2
No ratings yet
DevOps Chapter 2
31 pages
Git For Everyone
No ratings yet
Git For Everyone
58 pages
How To Use GitHub
50% (2)
How To Use GitHub
4 pages
Git Flow
No ratings yet
Git Flow
16 pages
Git - The Simple Guide - No Deep Shit
No ratings yet
Git - The Simple Guide - No Deep Shit
9 pages
Continuous Integration (Jenkins/Hudson)
No ratings yet
Continuous Integration (Jenkins/Hudson)
41 pages
Benefits of Single Ci CD
No ratings yet
Benefits of Single Ci CD
9 pages
Git Cheat Sheet Education
No ratings yet
Git Cheat Sheet Education
3 pages
Google NLP: NLP (Natural Language Processing)
No ratings yet
Google NLP: NLP (Natural Language Processing)
8 pages
Github Tutorial
No ratings yet
Github Tutorial
10 pages
GitHub Essentials - Sample Chapter
No ratings yet
GitHub Essentials - Sample Chapter
34 pages
Navigate To - Create New Repository: Step 1
No ratings yet
Navigate To - Create New Repository: Step 1
12 pages
[FREE PDF sample] Python Unit Test Automation: Practical Techniques for Python Developers and Testers 1 / converted Edition Ashwin Pajankar ebooks
100% (2)
[FREE PDF sample] Python Unit Test Automation: Practical Techniques for Python Developers and Testers 1 / converted Edition Ashwin Pajankar ebooks
35 pages
Git Cheat Sheet
100% (1)
Git Cheat Sheet
9 pages
Introduction To Go Language Final PDF
No ratings yet
Introduction To Go Language Final PDF
220 pages
GitHub Tutorial
No ratings yet
GitHub Tutorial
15 pages
Git and Github Datasheet
100% (1)
Git and Github Datasheet
14 pages
11 Beginner Tips For Learning Python Programming - Real Python
No ratings yet
11 Beginner Tips For Learning Python Programming - Real Python
8 pages
Git Github Workshop
No ratings yet
Git Github Workshop
69 pages
Github Git Cheat Sheet
No ratings yet
Github Git Cheat Sheet
2 pages
HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
Hacker’s Guide to Machine Learning Concepts
From Everand
Hacker’s Guide to Machine Learning Concepts
Trilokesh Khatri
No ratings yet
Mastering Shell Commands On Linux
From Everand
Mastering Shell Commands On Linux
Urko Galen
No ratings yet
LPI Linux Certification Questions: LPI Linux Interview Questions, Answers, and Explanations
From Everand
LPI Linux Certification Questions: LPI Linux Interview Questions, Answers, and Explanations
equitypress
3.5/5 (6)
Mastering Java: A Comprehensive Guide to Programming Excellence Category
From Everand
Mastering Java: A Comprehensive Guide to Programming Excellence Category
Kameron Hussain
No ratings yet
Ember.js Cookbook
From Everand
Ember.js Cookbook
Erik Hanchett
No ratings yet
Java Quick Syntax Reference
From Everand
Java Quick Syntax Reference
Mikael Olsson
No ratings yet
Building Web Services with Microsoft Azure
From Everand
Building Web Services with Microsoft Azure
Alex Belotserkovskiy
No ratings yet
Chapter 2
No ratings yet
Chapter 2
8 pages
UTS Self Care Plan
No ratings yet
UTS Self Care Plan
2 pages
Art, Dance, and Music Teraphy
No ratings yet
Art, Dance, and Music Teraphy
33 pages
Stimulus-Response Theory - Cognitive Psychology
No ratings yet
Stimulus-Response Theory - Cognitive Psychology
62 pages
Role of AI Chatbots in Education Systematic Litera
No ratings yet
Role of AI Chatbots in Education Systematic Litera
17 pages
Tos Research 4th Quarter
No ratings yet
Tos Research 4th Quarter
1 page
Concept of Essay
No ratings yet
Concept of Essay
14 pages
2024 - BR21EH - Biostatistics and Introduction To Research
No ratings yet
2024 - BR21EH - Biostatistics and Introduction To Research
73 pages
Communication
No ratings yet
Communication
132 pages
Effective Communication in Nursing
No ratings yet
Effective Communication in Nursing
19 pages
BWave & Equivoque
100% (1)
BWave & Equivoque
3 pages
Neurosculpting Limbic Stress Retraining
No ratings yet
Neurosculpting Limbic Stress Retraining
15 pages
LCT Unit 3
No ratings yet
LCT Unit 3
7 pages
Aig Math Week at Glance
No ratings yet
Aig Math Week at Glance
2 pages
Entrance Exam Guide To The Crmef
No ratings yet
Entrance Exam Guide To The Crmef
10 pages
Self-Management, Self-Regulation, and EQ
No ratings yet
Self-Management, Self-Regulation, and EQ
8 pages
Chapter 1
No ratings yet
Chapter 1
29 pages
Perrachione Perrachione 2008 J Consumer Behaviour
No ratings yet
Perrachione Perrachione 2008 J Consumer Behaviour
16 pages
Integrating Artificial Intelligence Into Education: Samarth Sharma and Deepika Sharma
No ratings yet
Integrating Artificial Intelligence Into Education: Samarth Sharma and Deepika Sharma
5 pages
Tutorial - AlphaGo PDF
No ratings yet
Tutorial - AlphaGo PDF
27 pages
Probation Appraisal Form 196
100% (2)
Probation Appraisal Form 196
2 pages
A Simulation-Based Game For Project Management Experiential Learning
No ratings yet
A Simulation-Based Game For Project Management Experiential Learning
7 pages
Clil 4
No ratings yet
Clil 4
19 pages
Managing A Phone Call British English Student
No ratings yet
Managing A Phone Call British English Student
2 pages
7 Organizational Behavior
0% (1)
7 Organizational Behavior
8 pages
Theoretical Development in The Field of Human Resources Management: Issues and Challenges For The Future
No ratings yet
Theoretical Development in The Field of Human Resources Management: Issues and Challenges For The Future
25 pages
Hugot Lines and Status in Social Media Using The Millennial Language of Senior High School Students
No ratings yet
Hugot Lines and Status in Social Media Using The Millennial Language of Senior High School Students
14 pages
Kasus 3
No ratings yet
Kasus 3
24 pages
Business Logic Finals
No ratings yet
Business Logic Finals
13 pages

GitHub Training

Uploaded by

GitHub Training

Uploaded by

Rodrigo Martínez-Castaño

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 2

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 3

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 4

MapReduce programming model

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 5

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 6

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 7

Storage under Apache HBase

Naive Bayes classifier

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 9

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 10

Perldoop for the translation of the classifier to Java

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 11

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 13

Average number of unique collected tweets per second, filtered by language.

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 14

F-score and ranking of our system (CitiusSentiment) in the TASS

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 15

Processing time (in minutes) for the selected Spanish terms.

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 16

System speedup with respect to the no. of

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 17

 68 Spanish popular terms selected as targets for the TMM

 Manual splits of the original HBase table with a single region

 50 GiB of RAM per node for YARN containers (5 nodes)

 The system scales up quite good with enough physical

 Higher number of splits with low number of coincidences

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 18

 The Tweet Mining Module improves the number of captured

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 19

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 20

Rodrigo Martínez-Castaño: [email protected]

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 21

You might also like