0% found this document useful (0 votes)
14 views5 pages

ML Sentimentanalysis

Uploaded by

tauqeer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views5 pages

ML Sentimentanalysis

Uploaded by

tauqeer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

How-to Guide for Python

Introduction

In this guide, you will learn how to perform the dictionary-based sentiment analysis on a corpus
of documents using the programming software Python with a practical example to illustrate the
process. You are provided with links to the example dataset, and you are encouraged to replicate
this example. An additional practice example is suggested at the end of this guide. This example
assumes that you have the data file stored in the working directory being used by Python.

Contents

 1.

Dictionary-Based Sentiment Analysis

 2.

An Example in Python: Sentiment of Economic News Articles

o 2.1 The Python Procedure


o 2.2 Exploring the Python Output
 3.

Your Turn

1 Dictionary-Based Sentiment Analysis

Dictionary-based sentiment analysis is a computational approach to measuring the feeling that a


text conveys to the reader. In the simplest case, sentiment has a binary classification: positive or
negative, but it can be extended to multiple dimensions such as fear, sadness, anger, joy, etc.
This method relies heavily on a pre-defined list (or dictionary) of sentiment-laden words.

2 An Example in Python: Sentiment of Economic News Articles

This example demonstrates how to assess sentiment computationally from a large corpus of
economic news articles. The analysis can help researchers, investors, and government understand
how the news articles think about the U.S. economy without reading every one of them; the
sentiment measures can also be used as summary statistics in further quantitative analysis.

This example uses a subset of data from the 2016 Economic News Article Tone dataset
(https://data.world/crowdflower/economic-news-article-tone) released by user CrowdFlower
under the CC0: Public Domain license through the platform data.world. The news articles are
collected from major news outlets, published between 1951 and 2014, and about U.S. economy.
For each article, the researchers of this dataset have a human judging the sentiment of the article
on a 9-point scale (1 = most negative and 9 = most positive); the researchers also asked the
judges how confident they are about their ratings on a scale between 0 and 1. Hence, this dataset
provides the “ground truth” sentiment for each article, which can be compared to the
computational measures.

There are 1,420 rows in the dataset with each row corresponding to a news article. The dataset
contains five columns:

 articleid: article ID
 text: text content for each article
 date: publication date
 positivity: human-rated sentiment
 positivity.confidence: confidence of human rating

2.1 The Python Procedure

Python is an open-source programming language. Python does not operate with pull-down
menus. Rather, you must submit lines of code that execute functions and operations built into
Python. It is best to save your code in a simple text file that Python users generally refer to as a
script file. We provide a script file with this example that executes all of the operations described
here. If you are not familiar with Python, we suggest you start with the introduction manual
located at https://wiki.python.org/moin/BeginnersGuide. While most computer systems come
with a vanilla Python, we recommend installing the distribution made by Anaconda
(https://www.anaconda.com/download/) as it contains many packages that are commonly used.
This software guide uses this distribution and will prompt to install any package used here but
not included in the Anaconda distribution.

For this example, we need the “nltk” package for tokenizing the documents (see SAGE Research
Methods Dataset on Basics in Text Analysis for tokenization). For the installation of this
package, please visit its official website (https://www.nltk.org/install.html). The Anaconda
distribution of Python should have this package installed already. The particular function needed
from this package is “treebank,” and it can be loaded as:

from nltk.tokenize import treebank

We use a dictionary of sentiment words from Bing Liu and collaborators


(https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html), which categorizes words in a
binary fashion into positive and negative categories. This dictionary can be downloaded using
“nltk”:

 import nltk
 nltk.download(‘opinion_lexicon’)

This dictionary only needs to be downloaded once. After it is downloaded, it can be loaded as the
following every time you need it:

from nltk.corpus import opinion_lexicon


We also need the package “ggplot” for visualization, which can be installed as the following (the
command is executed in a system terminal but not in Python):

pip install git+https://github.com/yhat/ggpy.git

Note that the package is installed from a specific Github link because that is where the most
updated “ggplot” is hosted. Other versions of “ggplot” might not be compatible with the other
packages used in this guide.

We also need the package “pandas” for data handling. The installation instructions for this
package can be found at its websites (https://pandas.pydata.org/pandas-docs/stable/install.html).
If you are using the Anaconda distribution of Python, then the package should be installed
already. With the package installed, we can load it as:

import pandas as pd

To begin with the analysis, we must first load the data into Python. This can be done with the
following code (assuming the data file is already saved in your working directory):

dataset=pd.read_csv(dataset-econnews-2016-subset1.csv’)

The dataframe loaded above is a table where each row corresponds to a news article, and the
column “text” contains the content of the articles.

First, we generate a list of positive words and a list of negative words from the dictionary
downloaded above:

 pos_list=set(opinion_lexicon.positive())
 neg_list=set(opinion_lexicon.negative())

We then construct a tokenizer for use later:

tokenizer = treebank.TreebankWordTokenizer()

Now, we define a function that takes a string as input, tokenizes it, counts the number of positive
and negative words in the string, and calculates their difference as sentiment:

 def sentiment(sentence):
 senti=0
 words = [word.lower() for word in tokenizer.tokenize(sentence)]
 for word in words:
 if word in pos_list:
 senti += 1
 elif word in neg_list:
 senti -= 1
 return senti
After defining the sentiment function, we apply it to every document (i.e., every entry in the
“text” column):

dataset[‘sentiment’]=dataset[‘text’].apply(sentiment)

To evaluate the performance of our approach, we calculate the correlation between our
computational sentiment measure and human ratings as follows:

dataset.loc[dataset[‘positivity.confidence’]>=0.8, [‘positivity’,‘sentiment’]].corr()

The entries in the dataframe can be accessed by the “.loc” attribute with square brackets. The
first input in the square brackets “dataset[‘positivity.confidence’]>=0.8” selects rows whose
“positivity.confidence” is larger than 0.8. We only consider human ratings with a confidence
larger than 0.8 for comparison because ratings with low confidence are noisy and do not provide
a fair evaluation. The second input “[‘positivity’,‘sentiment’]” selects the two columns. In the
end, the “corr()” function calculates the correlation between the two columns just selected.

Finally, with the computational sentiment measure, we then calculate the average sentiment of
the articles for each day and visualize it:

 dataset[‘date’]=pd.to_datetime(dataset[‘date’])
 gg.ggplot(gg.aes(x=‘date’, y=‘sentiment’), data=dataset) +
gg.stat_smooth(method=‘loess’, span=1/3) + gg.scale_x_date(labels=‘%Y’)

The first line converts the date column, which was read in as characters to the “datetime” object
in Python, so that other functions will treat it as dates instead of characters. The second line plots
the daily average sentiment over time. The stat_smooth() function automatically calculates the
daily average sentiment across all articles and fits a smooth curve to the daily averages.

2.2 Exploring the Python Output

For each command, Python will return its output immediately. Here, we focus on the main
results.

The correlation between our computational sentiment measure and human judgments of
sentiment is computed to be 0.63, which is not perfect but reasonably good.

The change of sentiment over time is shown in Figure 1. The black curve is the smoothed curve
of the daily averages, and the dark gray band around the curve denotes the confidence interval
around the average. Note that there are two abrupt drops of sentiment, one around 1990 and the
other around 2008. The first one is probably due to the 1990s Recession and second due to the
disastrous 2008 Financial Crisis. It makes sense that the sentiment of the news articles is
extremely negative during the financial crises. It is also interesting to see that the sentiment
quickly goes up several years after the crises as the economy improves.

Figure 1: Daily Average Sentiment of the News Articles Published Between 1951 and 2014.
3 Your Turn

You can download this sample dataset and see whether you can reproduce the results presented
here. Then, try retrieving the 10 most positive or negative news and see what they are about.

You might also like