0% found this document useful (0 votes)

27 views5 pages

Assignment 3-2

Uploaded by

Austin Azenga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views5 pages

Assignment 3-2

Uploaded by

Austin Azenga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

CS1026: Assignment 3 - Sentiment Analysis

Due: November 17th, 2021. 9:00pm.

Weight: 12%

Learning Outcomes:
By completing this assignment, you will gain skills relating to
 Using functions
 Complex data structures
 Text processing
 File input and output
 Exceptions in Python
 Using Python modules
 Testing programs and developing test cases; adhering to specifications
 Writing code that is used by other programs.

Background:
With the emergence of Internet companies such as Google, Facebook, and Twitter, more and more data
accessible online is comprised of text. Textual data and the computational means of processing it and
extracting information is also increasingly more important in areas such as business, humanities, social
sciences, etc. In this assignment, you will deal with textual analysis.

Twitter has become very popular, with many people “tweeting” aspects of their daily lives. This “flow
of tweets” has recently become a way to study or guess how people feel about various aspects of the
world or their own life. For example, analysis of tweets has been used to try to determine how certain
geographical regions may be voting – this is done by analyzing the content, the words, and phrases, in
tweets. Similarly, analysis of keywords or phrases in tweets can be used to determine how popular or
unpopular a movie might be. This is often referred to as sentiment analysis.

Task:
In this assignment, you will write a Python module, called sentiment_analysis.py (this is the name of
the file that you should use) and a main program, main.py, that uses the module to analyze Twitter
information. In the module sentiment_analysis.py, you will create a function that will perform simple
sentiment analysis on Twitter data. The Twitter data contains comments from individuals about how
they feel about their lives and comes from individuals across the continental United States. The
objective is to determine which timezone (Eastern, Central, Mountain, Pacific; see below for more
information on how to do this) is the “happiest”. To do this, your program will need to:

 Analyze each individual tweet to determine a score – a “happiness score” – for the individual
tweet.
 The “happiness score” for a single tweet is found by looking for certain keywords (which are
given) in a tweet and for each keyword found in that tweet totaling their “sentiment values”. In
this assignment, each value is an integer from 1 to 10.
The happiness score for the tweet is simply the sum of the “sentiment values” for keywords
found in the tweet divided by the number of keywords found in the tweet.
If there are none of the given keywords in a tweet, it is just ignored, i.e., you do NOT count it.

To determine the words in a tweet, you should do the following:

o Separate a tweet into words based on white space. A “word” is any sequence of characters
surrounded by white space (blank, tab, end of line, etc.).
o You should remove any punctuation from the beginning or end of the word (do NOT worry
about punctuation within a word). So, “#lonely” would become “lonely” and “happy!!”
would become “happy”; but “not-so-happy” is just “not-so-happy”.
o You should convert the “word” into just lower case letters. This gives you a “word” from
the tweet.
o If you match the “word” to any of the sentiment keywords (see below), you add the score of
that sentiment keyword to a total for the tweet; you should just do exact matches. For
example, if the word “hats” is in the tweet and the word “hat” is a sentiment keyword, then
they DO NOT MATCH. Of course, if “hats” is in the list of sentiment keywords, then there
is a match.
o A tweet that has at least 1 matched keyword (exact match) is called a keyword tweet.
 For each region, you should count:
o the number of tweets in that region, and
o the number of keyword tweets”. [Note: the number of “keyword tweets” is always less than
or equal to the total number of tweets in a region].
 The “happiness score” for a timezone is just the sum of the happiness scores for the all the
keyword tweets in the region divided by the number of keyword tweets in that region; again, if
a tweet has NO keywords, then it is NOT to be counted as a “keyword tweet” in that timezone,
i.e., it is just skipped as a “keyword tweet” but counted in the total number of tweets in that
region.

A file called tweets.txt contains the tweets and a file called keywords.txt contains keywords and
scores for determining the “sentiment” of an individual tweet. These files are described in more detail
below.

File tweets.txt
The file tweets.txt contains the tweets; one per line (some lines are quite long). The format of a tweet
is:
[lat, long] value date time text
where:
 [lat, long] - the latitude and longitude of where the tweet originated. You will need these values
to determine the timezone in which the tweet originated.
 value – not used; this can be skipped.
 date – the date of the tweet; not used, this can be skipped.
 time – the time of day that the tweet was sent; not used this can be skipped.
 text – the text in the tweet.

File keywords.txt
The file keywords.txt contains sentiment keywords and their “happiness scores”; one per line. The
format of a line is:
keyword, value
where:
 keyword - the keyword to look for.
 value – the value of the keyword; values are from 1 to 10, where 1 represents very “unhappy”
and 10 represents “very happy”.

Determining timezones across the continental United States
Given a latitude and longitude, the task of determining exactly the location that it corresponds to can be
very challenging given the geographical boundaries of the United States. For this assignment, we
simply approximate the regions corresponding to the timezones by rectangular areas defined by latitude
and longitude points. Our approximation looks like:

p9 p7 p5 p3 p1

Pacific Mountain Central Eastern

p10 p8 p6 p4 p2

So the Eastern timezone, for example, is defined by latitude-longitude points p1, p2, p3, and p4. To
determine the origin of a tweet, then, one simply has to determine in which region the latitude and
longitude of the tweet belongs. The values of the points are:

p1 = (49.189787, -67.444574)
p2 = (24.660845, -67.444574)
p3 = (49.189787, -87.518395)
p4 = (24.660845, -87.518395)
p5 = (49.189787, -101.998892)
p6 = (24.660845, -101.998892)
p7 = (49.189787, -115.236428)
p8 = (24.660845, -115.236428)
p9 = (49.189787, -125.242264)
p10 = (24.660845, -125.242264)

Note: if the latitude-longitude of a tweet is outside of all these regions, it is to be skipped; if a tweet is
on the border between regions, then choose one of the regions.
Functional Specifications:
1. Your module sentiment_analysis.py must include a function compute_tweets that has two
parameters. The first parameter will be the name of the file with the tweets and the second
parameter will be the name of the file with the keywords. This function will use these two files to
process the tweets and output the results. This function should also check to make sure that both
files exist and if either does not exist, then your program should generate an exception and the
function compute_tweets should return an empty list (see part 1.c below).

a. The function should input the keywords and their “happiness values” and store them in a data
structure in your program (the data structure is of your choice).

b. Your function should then process the file of tweets, computing the “happiness score” for each
tweet and computing the “happiness score” for each timezone. You will need to read the file of
tweets line by line as text and break it apart. The string processing functions in Python are very
useful for doing this. Your program should not duplicate code. It is important to determine
places that code can be reused and create functions. Your program should ignore tweets from
outside the time zones.

c. Your function, compute_tweets, should return a list of tuples:

I. The list should contain the results in a tuple for each of the regions, in order: Eastern,
Central, Mountain, Pacific.
II. Each tuple should contain three values: (average, count_of_keyword_tweets,
count_of_tweets), where average is the average “happiness value” of that region,
count_of_keyword_tweets is the number of tweets found in that region with keywords
and count_of_tweets is the number of tweets found in that region. These values should
be in the order specified.
III. Note: if there is an exception from a file name that does not exist, then an empty list
should be returned.

2. Your main program, main.py, will prompt the user for the name of the two files – the file
containing the keywords and the file containing the tweets. It will then call the function
compute_tweets with the two files to process the tweets using the given keywords. Your main
program will get the results from compute_tweets and print the results; it should print the
results in a readable fashion (i.e., not just numbers).

3. You are also given a program, driver.py, and some test files. The test files are small files of tweets
and keywords that driver.py uses to test your program – that is, it will import your program,
sentiment_analysis.py, and will make use of the function compute_tweets. The files tweets1.txt
and tweets2.txt are small files with tweets and the files key1.txt and key2.txt contain keywords
and “happiness values”. The program driver.py will use these to test your function; these files are
small enough that you can compute the results by hand to test your program. You should use the
program and these files to test your code. Note: while driver.py does some testing, it is your
responsibility to design your own test cases to test it thoroughly.

4. An automated testing program will run a number of test cases against your program.
5. Note: For both files, it is advised that when you read in the files you use one of the following open
statements to avoid encoding errors: open("fileName.txt","r",encoding="utf-8") or
open('fileName.txt', encoding='utf-8', errors='ignore').

Non-functional Specifications:
1. The program should strictly adhere to the input and output requirements and parameters for the
function compute_tweets, particularly the order of the parameters.

2. Include brief comments in your code identifying yourself, describing the program, and describing
key portions of the code.

3. Assignments are to be done individually and must be your own work. Software may be used to
detect academic dishonesty (cheating).

4. Use Python coding conventions and good programming techniques, for example:
 Meaningful variable names
 Conventions for naming variables and constants
 Use of constants where appropriate
 Readability: indentation, white space, consistency

5. You should submit the files main.py and sentiment_analysis.py (others are not required). Make
sure you upload your Python file to your assignment; DO NOT put the code inline in the textbox.

Marking of the Assignment:

1. Your program will be executed by an automated testing program. This testing program
assumes that:
a. The modules are named main.py, and sentiment_analysis.py.
b. That you are using Python 3.9 and that everything executes in PyCharm Edu.
c. That you have submitted it via OWL by uploading it.

Failure to adhere to these constraints will likely cause the testing program
to fail. This may require a remarking of your program which will include
a 20% penalty.

2. Is the program named correctly for testing, i.e., is the module correctly named
sentiment_analysis.py? Is there a function compute_tweets and are the parameters in the
correct order?
 Is there a program main.py which imports and makes use of the module
sentiment_analysis.py?
 Does the program behave according to specifications? Does it work on with the test
program, driver.py ?
 Is there an effective use of functions beyond compute_tweets ?
 Note: A program like driver.py and other test files will be used to test your program as
well.

Kemu Degree
No ratings yet
Kemu Degree
1 page
CS5228 Project 2 Twitter Sentiment Analysis Group No.: 29: 1 Problem Statement
No ratings yet
CS5228 Project 2 Twitter Sentiment Analysis Group No.: 29: 1 Problem Statement
15 pages
Data Mining Project Report
100% (2)
Data Mining Project Report
5 pages
Cyberbullying Experience of Students
No ratings yet
Cyberbullying Experience of Students
5 pages
CS1026 - Assignment 3
No ratings yet
CS1026 - Assignment 3
3 pages
CS1026: Assignment 3 Sentiment Analysis: Due: November 13, 2019 at 9:00pm. Weight: 12%
No ratings yet
CS1026: Assignment 3 Sentiment Analysis: Due: November 13, 2019 at 9:00pm. Weight: 12%
5 pages
TWITTER SENTIMENT NLP Projectt
No ratings yet
TWITTER SENTIMENT NLP Projectt
19 pages
M1 Sample
No ratings yet
M1 Sample
8 pages
Feature Extraction of Geo-Tagged Twitter Data For Sentiment Analysis
No ratings yet
Feature Extraction of Geo-Tagged Twitter Data For Sentiment Analysis
6 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
25 pages
JournalNX - Traffic Time Monitoring
No ratings yet
JournalNX - Traffic Time Monitoring
3 pages
Earthquake Shakes Twitter User:: Analyzing Tweets For Real-Time Event Detection
No ratings yet
Earthquake Shakes Twitter User:: Analyzing Tweets For Real-Time Event Detection
50 pages
Sentiment Analysis Presentationnotes
No ratings yet
Sentiment Analysis Presentationnotes
4 pages
Real Time Alert System For Natural Disaster: IPASJ International Journal of Computer Science (IIJCS)
No ratings yet
Real Time Alert System For Natural Disaster: IPASJ International Journal of Computer Science (IIJCS)
6 pages
Social Media Safe Twitter Usage
No ratings yet
Social Media Safe Twitter Usage
6 pages
An Algorithm For Identification of Natural Disaster Affected Area
No ratings yet
An Algorithm For Identification of Natural Disaster Affected Area
11 pages
Fin Irjmets1715854730
No ratings yet
Fin Irjmets1715854730
8 pages
Twitter Sentiment Analysis Using Deep Learning
No ratings yet
Twitter Sentiment Analysis Using Deep Learning
17 pages
Demo Tweet Sieve
No ratings yet
Demo Tweet Sieve
2 pages
6 Project Report Sem6
No ratings yet
6 Project Report Sem6
13 pages
TM 2
No ratings yet
TM 2
41 pages
Major Project Report: AT "Baldev Ram Mirdha Institute of Technology"
No ratings yet
Major Project Report: AT "Baldev Ram Mirdha Institute of Technology"
51 pages
Major Project Report: AT "Baldev Ram Mirdha Institute of Technology"
No ratings yet
Major Project Report: AT "Baldev Ram Mirdha Institute of Technology"
51 pages
Segregating Tweets Using Machine Learning
No ratings yet
Segregating Tweets Using Machine Learning
4 pages
Clustering and Sentiment Analysis On Twitter Data
No ratings yet
Clustering and Sentiment Analysis On Twitter Data
5 pages
Characteristics and Predictability of Twitter Sentiment Series
No ratings yet
Characteristics and Predictability of Twitter Sentiment Series
7 pages
13 Chapter 6 PSO GA DT
No ratings yet
13 Chapter 6 PSO GA DT
11 pages
Twitter Sentiment Analysis System
No ratings yet
Twitter Sentiment Analysis System
5 pages
Michael Final Project
100% (1)
Michael Final Project
59 pages
Major Project Report: AT "Baldev Ram Mirdha Institute of Technology"
No ratings yet
Major Project Report: AT "Baldev Ram Mirdha Institute of Technology"
51 pages
Social Data Analytics
No ratings yet
Social Data Analytics
2 pages
NLP - J - Final ReviewReport - Cyberbullying
No ratings yet
NLP - J - Final ReviewReport - Cyberbullying
25 pages
Paper 23
No ratings yet
Paper 23
4 pages
Implementation of Sentiment Analysis On Twitter Data
No ratings yet
Implementation of Sentiment Analysis On Twitter Data
6 pages
Mining Tweets
No ratings yet
Mining Tweets
19 pages
Sypnosis: Twitter Sentimental Analysis
No ratings yet
Sypnosis: Twitter Sentimental Analysis
3 pages
Machine Learning Algorithm For Sentimental Analysis of Twitter Feeds
No ratings yet
Machine Learning Algorithm For Sentimental Analysis of Twitter Feeds
4 pages
TwitterContentClassification FIRSTMONDAYPUBLISHEDEDITION
No ratings yet
TwitterContentClassification FIRSTMONDAYPUBLISHEDEDITION
20 pages
Maddock Starbird Tweet Deletions
No ratings yet
Maddock Starbird Tweet Deletions
6 pages
Twitter Sentiment Analysis2
No ratings yet
Twitter Sentiment Analysis2
6 pages
Cse499a Report
No ratings yet
Cse499a Report
18 pages
Chapter 26 Text Mining - Introduction To Data Science
No ratings yet
Chapter 26 Text Mining - Introduction To Data Science
20 pages
(CS283MiniProject) Report - Sambayan and Satuito
No ratings yet
(CS283MiniProject) Report - Sambayan and Satuito
5 pages
DA Project Report
No ratings yet
DA Project Report
17 pages
TEDAS: A Twitter-Based Event Detection and Analysis System
No ratings yet
TEDAS: A Twitter-Based Event Detection and Analysis System
4 pages
Slangs and Short Forms of Malay Twitter Sentiment Analysis Using Supervised Machine Learning
No ratings yet
Slangs and Short Forms of Malay Twitter Sentiment Analysis Using Supervised Machine Learning
7 pages
Sentiment Analysis For Social Media
No ratings yet
Sentiment Analysis For Social Media
6 pages
Project Report
No ratings yet
Project Report
20 pages
(IJCST-V8I5P3) : Gajendra R. Wani
No ratings yet
(IJCST-V8I5P3) : Gajendra R. Wani
4 pages
Step 1: Create A CSV File: # For Text Mining
No ratings yet
Step 1: Create A CSV File: # For Text Mining
9 pages
A Clustering Analysis of Tweet Length and Its Relation To Sentiment
No ratings yet
A Clustering Analysis of Tweet Length and Its Relation To Sentiment
6 pages
Practice Architecture 1
No ratings yet
Practice Architecture 1
10 pages
Cyberbullying Detection System On Twitter
No ratings yet
Cyberbullying Detection System On Twitter
11 pages
21 Recipes For Mining Twitter With Rtweet
No ratings yet
21 Recipes For Mining Twitter With Rtweet
63 pages
Detecting Citizen Problems and Their Locations Using Twitter Data
No ratings yet
Detecting Citizen Problems and Their Locations Using Twitter Data
4 pages
Proposalwriting
No ratings yet
Proposalwriting
16 pages
Location Detection Over Social Media
No ratings yet
Location Detection Over Social Media
22 pages
Stem5 Pluto RRL
No ratings yet
Stem5 Pluto RRL
10 pages
Twitter Sentiment Analysis Using Machine Learning Algorithms IJERTV12IS070128
No ratings yet
Twitter Sentiment Analysis Using Machine Learning Algorithms IJERTV12IS070128
3 pages
MAT00003C Introduction To Applied Mathemtaics Exam Questions 2020
No ratings yet
MAT00003C Introduction To Applied Mathemtaics Exam Questions 2020
9 pages
Maxima2d Fall21
No ratings yet
Maxima2d Fall21
2 pages
CE3GOE Assessment 1 2021-22 PDF
No ratings yet
CE3GOE Assessment 1 2021-22 PDF
11 pages
Utility
No ratings yet
Utility
1 page
Passport PDF
No ratings yet
Passport PDF
1 page
Assignment 2 QBUS2820 2021S2
No ratings yet
Assignment 2 QBUS2820 2021S2
3 pages
Megadodo Publications XF
No ratings yet
Megadodo Publications XF
3 pages
PSP Angels Offer 26
No ratings yet
PSP Angels Offer 26
3 pages
The Second Midterm Project
No ratings yet
The Second Midterm Project
4 pages
ACTL5109 Assignment 2021
No ratings yet
ACTL5109 Assignment 2021
6 pages
Assignment Two - Curatorial Exercise
No ratings yet
Assignment Two - Curatorial Exercise
3 pages
Netflix Exercise Assignment
No ratings yet
Netflix Exercise Assignment
11 pages
Problem Set 1
No ratings yet
Problem Set 1
2 pages
EBME 410 Computer Project MRI Synthesis
No ratings yet
EBME 410 Computer Project MRI Synthesis
2 pages
Transcript2022mar18 DBF 54352 2016
No ratings yet
Transcript2022mar18 DBF 54352 2016
1 page
9273203-Reading 2
No ratings yet
9273203-Reading 2
2 pages
Statement
No ratings yet
Statement
14 pages
Video Transcript
No ratings yet
Video Transcript
3 pages
SAP Practical Re-Sit Assignment22
No ratings yet
SAP Practical Re-Sit Assignment22
3 pages
HW 1
No ratings yet
HW 1
5 pages
Statement
0% (1)
Statement
2 pages
Topics 1
No ratings yet
Topics 1
2 pages
University of Leeds
No ratings yet
University of Leeds
1 page
CWRK
No ratings yet
CWRK
2 pages
Photo - 2023 02 01 - 22 10 2545RY
No ratings yet
Photo - 2023 02 01 - 22 10 2545RY
1 page

Assignment 3-2

Uploaded by

Assignment 3-2

Uploaded by

CS1026: Assignment 3 - Sentiment Analysis

Due: November 17th, 2021. 9:00pm.

To determine the words in a tweet, you should do the following:

Pacific Mountain Central Eastern

c. Your function, compute_tweets, should return a list of tuples:

Marking of the Assignment:

You might also like