ML Sentimentanalysis
ML Sentimentanalysis
Introduction
In this guide, you will learn how to perform the dictionary-based sentiment analysis on a corpus
of documents using the programming software Python with a practical example to illustrate the
process. You are provided with links to the example dataset, and you are encouraged to replicate
this example. An additional practice example is suggested at the end of this guide. This example
assumes that you have the data file stored in the working directory being used by Python.
Contents
1.
2.
Your Turn
This example demonstrates how to assess sentiment computationally from a large corpus of
economic news articles. The analysis can help researchers, investors, and government understand
how the news articles think about the U.S. economy without reading every one of them; the
sentiment measures can also be used as summary statistics in further quantitative analysis.
This example uses a subset of data from the 2016 Economic News Article Tone dataset
(https://data.world/crowdflower/economic-news-article-tone) released by user CrowdFlower
under the CC0: Public Domain license through the platform data.world. The news articles are
collected from major news outlets, published between 1951 and 2014, and about U.S. economy.
For each article, the researchers of this dataset have a human judging the sentiment of the article
on a 9-point scale (1 = most negative and 9 = most positive); the researchers also asked the
judges how confident they are about their ratings on a scale between 0 and 1. Hence, this dataset
provides the “ground truth” sentiment for each article, which can be compared to the
computational measures.
There are 1,420 rows in the dataset with each row corresponding to a news article. The dataset
contains five columns:
articleid: article ID
text: text content for each article
date: publication date
positivity: human-rated sentiment
positivity.confidence: confidence of human rating
Python is an open-source programming language. Python does not operate with pull-down
menus. Rather, you must submit lines of code that execute functions and operations built into
Python. It is best to save your code in a simple text file that Python users generally refer to as a
script file. We provide a script file with this example that executes all of the operations described
here. If you are not familiar with Python, we suggest you start with the introduction manual
located at https://wiki.python.org/moin/BeginnersGuide. While most computer systems come
with a vanilla Python, we recommend installing the distribution made by Anaconda
(https://www.anaconda.com/download/) as it contains many packages that are commonly used.
This software guide uses this distribution and will prompt to install any package used here but
not included in the Anaconda distribution.
For this example, we need the “nltk” package for tokenizing the documents (see SAGE Research
Methods Dataset on Basics in Text Analysis for tokenization). For the installation of this
package, please visit its official website (https://www.nltk.org/install.html). The Anaconda
distribution of Python should have this package installed already. The particular function needed
from this package is “treebank,” and it can be loaded as:
import nltk
nltk.download(‘opinion_lexicon’)
This dictionary only needs to be downloaded once. After it is downloaded, it can be loaded as the
following every time you need it:
Note that the package is installed from a specific Github link because that is where the most
updated “ggplot” is hosted. Other versions of “ggplot” might not be compatible with the other
packages used in this guide.
We also need the package “pandas” for data handling. The installation instructions for this
package can be found at its websites (https://pandas.pydata.org/pandas-docs/stable/install.html).
If you are using the Anaconda distribution of Python, then the package should be installed
already. With the package installed, we can load it as:
import pandas as pd
To begin with the analysis, we must first load the data into Python. This can be done with the
following code (assuming the data file is already saved in your working directory):
dataset=pd.read_csv(dataset-econnews-2016-subset1.csv’)
The dataframe loaded above is a table where each row corresponds to a news article, and the
column “text” contains the content of the articles.
First, we generate a list of positive words and a list of negative words from the dictionary
downloaded above:
pos_list=set(opinion_lexicon.positive())
neg_list=set(opinion_lexicon.negative())
tokenizer = treebank.TreebankWordTokenizer()
Now, we define a function that takes a string as input, tokenizes it, counts the number of positive
and negative words in the string, and calculates their difference as sentiment:
def sentiment(sentence):
senti=0
words = [word.lower() for word in tokenizer.tokenize(sentence)]
for word in words:
if word in pos_list:
senti += 1
elif word in neg_list:
senti -= 1
return senti
After defining the sentiment function, we apply it to every document (i.e., every entry in the
“text” column):
dataset[‘sentiment’]=dataset[‘text’].apply(sentiment)
To evaluate the performance of our approach, we calculate the correlation between our
computational sentiment measure and human ratings as follows:
dataset.loc[dataset[‘positivity.confidence’]>=0.8, [‘positivity’,‘sentiment’]].corr()
The entries in the dataframe can be accessed by the “.loc” attribute with square brackets. The
first input in the square brackets “dataset[‘positivity.confidence’]>=0.8” selects rows whose
“positivity.confidence” is larger than 0.8. We only consider human ratings with a confidence
larger than 0.8 for comparison because ratings with low confidence are noisy and do not provide
a fair evaluation. The second input “[‘positivity’,‘sentiment’]” selects the two columns. In the
end, the “corr()” function calculates the correlation between the two columns just selected.
Finally, with the computational sentiment measure, we then calculate the average sentiment of
the articles for each day and visualize it:
dataset[‘date’]=pd.to_datetime(dataset[‘date’])
gg.ggplot(gg.aes(x=‘date’, y=‘sentiment’), data=dataset) +
gg.stat_smooth(method=‘loess’, span=1/3) + gg.scale_x_date(labels=‘%Y’)
The first line converts the date column, which was read in as characters to the “datetime” object
in Python, so that other functions will treat it as dates instead of characters. The second line plots
the daily average sentiment over time. The stat_smooth() function automatically calculates the
daily average sentiment across all articles and fits a smooth curve to the daily averages.
For each command, Python will return its output immediately. Here, we focus on the main
results.
The correlation between our computational sentiment measure and human judgments of
sentiment is computed to be 0.63, which is not perfect but reasonably good.
The change of sentiment over time is shown in Figure 1. The black curve is the smoothed curve
of the daily averages, and the dark gray band around the curve denotes the confidence interval
around the average. Note that there are two abrupt drops of sentiment, one around 1990 and the
other around 2008. The first one is probably due to the 1990s Recession and second due to the
disastrous 2008 Financial Crisis. It makes sense that the sentiment of the news articles is
extremely negative during the financial crises. It is also interesting to see that the sentiment
quickly goes up several years after the crises as the economy improves.
Figure 1: Daily Average Sentiment of the News Articles Published Between 1951 and 2014.
3 Your Turn
You can download this sample dataset and see whether you can reproduce the results presented
here. Then, try retrieving the 10 most positive or negative news and see what they are about.