Tutorial Text Analysis
Tutorial Text Analysis
to Test a Hypothesis
People often complain about important subjects being covered too little in the
news. One such subject is climate change. The scientific consensus is that this is
an important problem, and it stands to reason that the more people are aware
of it, the better our chances may be of solving it. But how can we assess how
widely covered climate change is by various media outlets? We can use Python
to do some text analysis!
Specifically, in this post, we'll try to answer some questions about which news
outlets are giving climate change the most coverage. At the same time, we'll
learn some of the programming skills required to analyze text data in Python
and test a hypothesis related to that data.
This tutorial assumes that you’re fairly familiar with Python and the popular data
science package pandas. If you'd like to brush up on pandas, check
out this post, and if you need to build a more thorough foundation, Dataquest's
interactive courses cover all of the Python and pandas fundamentals in more
depth.
Owner: Atlantic Media; majority stake recently sold to Emerson collective, a non-profit
founded by Powell Jobs, widow of Steve Jobs
Lean Left
Breitbart:
Right
Business Insider:
Center / left-center
Buzzfeed News:
Private, Jonah Peretti CEO & Kenneth Lerer, executive chair (latter also co-founder of
Huffington Post)
Lean left
CNN:
Lean left
Fox News:
Fox entertainment group, mass media
Guardian:
Lean left
National Review:
Right
NY Times Company
Lean Left
NPR:
Non-profit
Center / left-center
Reuters:
Center
Talking points memo:
Left
Washington Post:
Lean left
Vox:
We can turn this into a formal hypothesis statement and will do that later in the
post. But first, let’s dive deeper into the data. A terminology note: in the
computational linguistics and NLP communities, a text collection such as this is
called a corpus, so we'll use that terminology here when talking about our text
data set.
Exploratory Data Analysis, or EDA, is an important part of any Data Science
project. It usually involves analyzing and visualizing the data in various ways to
look for patterns before proceeding with more in-depth analysis. In this case,
though, we're working with text data rather than numerical data, which makes
things a bit different.
For example, in numerical exploratory data analysis, we'd often want to look at
the mean values for our data features. But there’s no such thing as an
“average” word in a textual database, which makes our task a bit more
complex. However, there are still both quantitative and qualitative explorations
we can perform to sanity check our corpus’s integrity.
First, let’s reproduce the chart above to ensure that we're not missing any data,
and then sort by article count. We'll start by covering all of our imports, reading
the data set, and checking the length of each of its three parts.
publicatio mont
id title n author date year h url content
House
Republican WASHINGTO
s Fret N—
0
About 2016 Congressional
1728 Winning New York Carl -12- 2016. Na Republicans
3 Their Hea... Times Hulse 31 0 12.0 N have...
Rift
Between
Officers Benjamin After the bullet
1
and Mueller 2017 shells get
1728 Residents New York and Al -06- 2017. Na counted, the
4 as Killing... Times Baker 19 0 6.0 N blood...
publicatio mont
id title n author date year h url content
Tyrus
Wong, When Walt
‘Bambi’ Disney’s
2
Artist 2017 “Bambi”
1728 Thwarted New York Margalit -01- 2017. Na opened in
5 by Racial ... Times Fox 06 0 1.0 N 1942, cri...
Among
Deaths in
2016, a Death may be
3
Heavy Toll William 2017 the great
1728 in Pop New York McDonal -04- 2017. Na equalizer, but
6 Musi... Times d 10 0 4.0 N it isn’t...
Kim Jong-
un Says
North SEOUL, South
4
Korea Is Choe 2017 Korea —
1728 Preparing New York Sang- -01- 2017. Na North Korea’s
7 to T... Times Hun 02 0 1.0 N leader, ...
len(pt1)
50000
pt2 =
pd.read_csv('data/articles2.csv.zip',compression='zip',index_col=0)
len(pt2)
49999
pt3 =
pd.read_csv('data/articles3.csv.zip',compression='zip',index_col=0)
len(pt3)
42571
Working with three separate data sets isn't going to be convenient, though.
We'll combine all three DataFrames into a single one so we can work with our
entire corpus more easily:
articles = pd.concat([pt1,pt2,pt3])
len(articles)
142570
Next, we'll make sure we have the same publication names as in the original
data set, and check the earliest and latest years of the articles.
articles.publication.unique()
array(['New York Times', 'Breitbart', 'CNN', 'Business Insider',
'Atlantic', 'Fox News', 'Talking Points Memo', 'Buzzfeed
News',
'National Review', 'New York Post', 'Guardian', 'NPR',
'Reuters',
'Vox', 'Washington Post'], dtype=object)
print(articles['year'].min())
articles['year'].max()
2000.0
2017.0
It’s unusual to store dates as floats like we see above, but that is how they are
stored in our CSV file. We're not planning to use dates for anything too
important anyway, so for the purposes of this tutorial we'll just leave them as
floats. If we were doing a different analysis, though, we might want to convert
them to a different format.
Let's take a quick look at when our articles are from using
pandas' value_counts() function.
articles['year'].value_counts()
2016.0 85405
2017.0 50404
2015.0 3705
2013.0 228
2014.0 125
2012.0 34
2011.0 8
2010.0 6
2009.0 3
2008.0 3
2005.0 2
2004.0 2
2003.0 2
2007.0 1
2000.0 1
Name: year, dtype: int64
We can see that there are mostly recent years, but a few older articles are
included, too. That serves our purposes fine, as we're mostly concerned with
coverage over the past few years.
Now, let's sort the publications by name to reproduce the original plot from
Kaggle.
ax =
articles['publication'].value_counts().sort_index().plot(kind='bar'
, fontsize=14, figsize=(12,10))
ax.set_title('Article Count\n', fontsize=20)
ax.set_xlabel('Publication', fontsize=18)
ax.set_ylabel('Count', fontsize=18);
This plot order is helpful if you want to find a specific outlet quickly, but it may
be more helpful for us to sort it by article count so that we get a better idea of
where our data is coming from.
ax = articles['publication'].value_counts().plot(kind='bar',
fontsize=14, figsize=(12,10))
ax.set_title('Article Count - most to least\n', fontsize=20)
ax.set_xlabel('Publication', fontsize=18)
ax.set_ylabel('Count', fontsize=18);
We want to check the average article length in words, but equally important is
the diversity of those words. Let’s look at both.
We'll start by defining a function that removes punctuation and converts all the
text to lower case. (We’re not doing any complicated syntactic analysis, so we
don’t need to preserve the sentence structure or capitalization).
def clean_text(article):
clean1 = re.sub(r'['+string.punctuation + '’—”'+']', "",
article.lower())
return re.sub(r'\W+', ' ', clean1)
Now we'll create a new column in our dataframe with the cleaned up text.
articles['tokenized'] = articles['content'].map(lambda x:
clean_text(x))
articles['tokenized'].head()
0 washington congressional republicans have a ne...
1 after the bullet shells get counted the blood ...
2 when walt disneys bambi opened in 1942 critics...
3 death may be the great equalizer but it isnt n...
4 seoul south korea north koreas leader kim said...
Name: tokenized, dtype: object
Let's take a look at the average (mean) number of words in each article, and the
longest and shortest articles in our data set.
articles['num_wds'] = articles['tokenized'].apply(lambda x:
len(x.split()))
articles['num_wds'].mean()
732.36012485095046
articles['num_wds'].max()
articles['num_wds'].min()
49902
0
An article with zero words isn't any use to us, so let's see how many of those
there are. We'll want to remove articles with no words from our data set.
len(articles[articles['num_wds']==0])
97
Let's get rid of those empty articles and then see what that does to the mean
number of words per article in our data set, and what our new minimum word
count is.
articles = articles[articles['num_wds']>0]
articles['num_wds'].mean()
articles['num_wds'].min()
732.85873814687693
1
At this point, it might be helpful for us visualize a distribution of the article word
counts to see how skewed our average might be by outliers. Let's generate
another plot to take a look:
Operating on the series from the tokenized column that we created earlier, we
will invoke the split function from the string library. Then we'll get the set
from our series to eliminate duplicate words, then measure the size of the set
with len().
Finally, we’ll add the result as a new column that contains the number of unique
words in each article.
articles['uniq_wds'] =
articles['tokenized'].str.split().apply(lambda x: len(set(x)))
articles['uniq_wds'].head()
0 389
1 1403
2 920
3 1037
4 307
Name: uniq_wds, dtype: int64
We also want to take a look at the average (mean) number of unique words per
article, and the minimum and maximum unique word counts.
articles['uniq_wds'].mean()
articles['uniq_wds'].min()
articles['uniq_wds'].max()
336.49826282874648
1
4692
When we plot this into a chart, we can see that while the distribution of unique
words is still skewed, it looks a bit more like a normal (Gaussian) distribution
than the distribution based on total word counts we generated earlier.
ax=articles['uniq_wds'].plot(kind='hist', bins=50, fontsize=14,
figsize=(12,10))
ax.set_title('Unique Words Per Article\n', fontsize=20)
ax.set_ylabel('Frequency', fontsize=18)
ax.set_xlabel('Number of Unique Words', fontsize=18);
Let’s also look at how these two measures of article length differ by publication.
To do that, we’ll use pandas’s groupby function. The full documentation on this
powerful function can be found here, but for our purposes, we just need to know
that it allows us to aggregate, or total in different ways, different metrics by the
values of another column.
In this case that column is publication. This first plot uses just the number of
objects in each group by aggregating over len. We could have used any other
column besides title in the code below.
art_grps = articles.groupby('publication')
ax=art_grps['title'].aggregate(len).plot(kind='bar', fontsize=14,
figsize=(12,10))
ax.set_title('Articles per Publication (repeated)\n', fontsize=20)
ax.set_ylabel('Number of Articles', fontsize=18)
ax.set_xlabel('Publication', fontsize=18);
Now we'll aggregate over the mean number of words and number of unique
words, respectively.
ax=art_grps['num_wds'].aggregate(np.mean).plot(kind='bar',
fontsize=14, figsize=(12,10))
ax.set_title('Mean Number of Words per Article\n', fontsize=20)
ax.set_ylabel('Mean Number of Words', fontsize=18)
ax.set_xlabel('Publication', fontsize=18);
ax=art_grps['uniq_wds'].aggregate(np.mean).plot(kind='bar',
fontsize=14, figsize=(12,10))
ax.set_title('Mean Number of Unique Words per Article\n',
fontsize=20)
ax.set_ylabel('Mean Number of Unique Words', fontsize=18)
ax.set_xlabel('Publication', fontsize=18);
Finally, let’s look at the most common words over the entire corpus.
We'll use a Python Counter, which is a special kind of dictionary that assumes
integer types for each key’s value. Here, we iterate through all the articles using
the tokenized version of our articles.
wd_counts = Counter()
for i, row in articles.iterrows():
wd_counts.update(row['tokenized'].split())
When we're counting the most common words, though, we don’t want to
include all words in our count. There are a number of words so common in
written English that they're likely to appear as the most common words in any
analysis. Counting them won't tell us anything about the article's content. In NLP
and text processing, these words are called "stopwords." The list of common
English stopwords includes words such as “and,” “or,” and “such.”
Remember we imported the module stopwords from nltk.corpus at the
beginning of this project, so now let’s take a look at what words are contained in
this pre-made stopwordslist:
stopwords.words('english')
['i',
'me',
'my',
'myself',
'we',
'our',
'ours',
'ourselves',
'you',
"you're",
"you've",
"you'll",
"you'd",
'your',
'yours',
'yourself',
'yourselves',
'he',
'him',
'his',
'himself',
'she',
"she's",
'her',
'hers',
'herself',
'it',
"it's",
'its',
'itself',
'they',
'them',
'their',
'theirs',
'themselves',
'what',
'which',
'who',
'whom',
'this',
'that',
"that'll",
'these',
'those',
'am',
'is',
'are',
'was',
'were',
'be',
'been',
'being',
'have',
'has',
'had',
'having',
'do',
'does',
'did',
'doing',
'a',
'an',
'the',
'and',
'but',
'if',
'or',
'because',
'as',
'until',
'while',
'of',
'at',
'by',
'for',
'with',
'about',
'against',
'between',
'into',
'through',
'during',
'before',
'after',
'above',
'below',
'to',
'from',
'up',
'down',
'in',
'out',
'on',
'off',
'over',
'under',
'again',
'further',
'then',
'once',
'here',
'there',
'when',
'where',
'why',
'how',
'all',
'any',
'both',
'each',
'few',
'more',
'most',
'other',
'some',
'such',
'no',
'nor',
'not',
'only',
'own',
'same',
'so',
'than',
'too',
'very',
's',
't',
'can',
'will',
'just',
'don',
"don't",
'should',
"should've",
'now',
'd',
'll',
'm',
'o',
're',
've',
'y',
'ain',
'aren',
"aren't",
'couldn',
"couldn't",
'didn',
"didn't",
'doesn',
"doesn't",
'hadn',
"hadn't",
'hasn',
"hasn't",
'haven',
"haven't",
'isn',
"isn't",
'ma',
'mightn',
"mightn't",
'mustn',
"mustn't",
'needn',
"needn't",
'shan',
"shan't",
'shouldn',
"shouldn't",
'wasn',
"wasn't",
'weren',
"weren't",
'won',
"won't",
'wouldn',
"wouldn't"]
As we can see, this is quite a long list, but none of these words can really tell us
anything about the meaning of an article. Let's use this list to delete the
stopwords from our Counter.
for sw in stopwords.words('english'):
del wd_counts[sw]
To further filter our word counts down into useful information, Counter has a
handy most_common method which we can use here to take a look at just the
most commonly-used words it found. Using this fucntion, we can specify the
number of results we'd like to see. Here, we'll ask it for a list of just the top 20
most common words.
wd_counts.most_common(20)
[('said', 571476),
('trump', 359436),
('would', 263184),
('one', 260552),
('people', 246748),
('new', 205187),
('also', 181491),
('like', 178532),
('president', 161541),
('time', 144047),
('could', 143626),
('first', 132971),
('years', 131219),
('two', 126745),
('even', 124510),
('says', 123381),
('state', 118926),
('many', 116965),
('u', 116602),
('last', 115748)]
Above, we can see some pretty predictable words, but also a bit of a surprise:
the word u is apparently among the most common. This may seem strange, but
it comes from the fact that acronyms like "U.S." and "U.N." are used frequently
in these articles.
That's a bit odd, but remember that at the moment we're just exploring the
data. The actual hypothesis we want to test is that climate change coverage
might be correlated with certain aspects of a media outlet, like its ownership or
political leanings. The existence of u as a word in our corpus isn't likely to affect
this analysis at all, so we can leave it as-is.
We could do a lot more cleaning and refining for this data set in other areas as
well, but it's probably not necessary. Instead, let's move on to the next step:
testing whether our initial hypothesis is correct.
How can we tell whether an article is talking about climate change? There are
several ways we could do this. We could identify concepts using advanced text
analytics techniques such as clustering or topic modeling. But for the purposes
of this article, let's keep it simple: let's just identify keywords that might
correlate with the topic, and search for them in the articles. Just brainstorming
some words and phrases of interest should do the trick.
So here's how we're going to filter it: we want the string chang followed by the
string climate within 1 to 5 words (in regular expressions, \w+ matches one or
more word characters, and \W+ matches one or more nonword characters).
We can use | is to represent a logical or, so we can also match the
string climate followed by the string chang within 1 to 5 words. The 1 to 5 word
part is the part of the regex that will look like this: (?:\w+\W+){1,5}?.
All together, searching for these two strings should help us identify any articles
that mention climate change, the changing climate, etc.
if not found:
disj = re.compile(r'(chang\w+\W+(?:\w+\W+){1,5}?climate) |
(climate\W+(?:\w+\W+){1,5}?chang)')
if disj.match(content):
found = True
return found
disj = re.compile(r'(chang\w+\W+(?:\w+\W+){1,5}?climate)|(climate\
W+(?:\w+\W+){1,5}?chang)')
disj.match('climate is changing')
<_sre.SRE_Match object; span=(0, 16), match='climate is chang'>
disj.match('change in extreme climate')
<_sre.SRE_Match object; span=(0, 26), match='change in extreme climate'>
disj.match('nothing changing here except the weather')
As we can see, this is working as intended — it's matching the real references to
climate change, and not being thrown off by the use of the term "change" in
other contexts.
Now let's use our function to create a new Boolean field indicating whether
we've found relevant words, and then see if there are any mentions of climate
change in the first five articles of our data set:
articles['cc_wds'] = articles['tokenized'].apply(find_cc_wds)
articles['cc_wds'].head()
0 False
1 False
2 False
3 False
4 False
Name: cc_wds, dtype: bool
The first five articles in our data set don't contain any mentions of climate
change, but we know our function is working as intended from our earlier test,
so now we can start to do some analysis of the news coverage.
articles['cc_wds'].sum() / len(articles)
0.030826893516666315
We see that the proportion of climate coverage over all articles is 3.1%, which is
fairly low, but not problematic from a statistical point of view.
Next we want to count the relative proportions for each group. Let’s illustrate
how this works by looking at the proportion per publication source. We will again
use our groupby object and sum, but this time we want the count of articles per
group, which we get from the count function:
art_grps['cc_wds'].sum()
publication
Atlantic 366.0
Breitbart 471.0
Business Insider 107.0
Buzzfeed News 128.0
CNN 274.0
Fox News 58.0
Guardian 417.0
NPR 383.0
National Review 245.0
New York Post 124.0
New York Times 339.0
Reuters 573.0
Talking Points Memo 76.0
Vox 394.0
Washington Post 437.0
Name: cc_wds, dtype: float64
art_grps.count()
publicatio
n
atlantic 7178 7178 6198 7178 7178 7178 0 7178 7178 7178
publicatio
n
business
insider 6695 6695 4926 6695 6695 6695 0 6695 6695 6695
buzzfeed
news 4835 4835 4834 4835 4835 4835 4835 4835 4835 4835
fox news 4351 4351 1117 4349 4349 4349 4348 4351 4351 4351
guardian 8680 8680 7249 8640 8640 8640 8680 8680 8680 8680
national
review 6195 6195 6195 6195 6195 6195 6195 6195 6195 6195
new york
times 7803 7803 7767 7803 7803 7803 0 7803 7803 7803
talking
points
memo 5214 5213 1676 2615 2615 2615 5214 5214 5214 5214
autho mont conten tokenize num
id title r date year h url t d
publicatio
n
vox 4947 4947 4947 4947 4947 4947 4947 4947 4947 4947
Now, let's break that down into proportions and sort the list so that we can
quickly see at a glance which outlets are doing the most coverage of climate
change:
The proportion ranges from 0.7% for the New York Post to 8% for Vox. Let's plot
this, sorted by publication name, and then again sorted by value.
If we look at the left vs. right political leanings of the publications, and call the
group of publications that lean left “Left” and the right-leaning group “Right,”
our null hypothesis is that the population climate change article proportion for
Left equals the population climate change article proportion for Right. Our
alternative hypothesis is that the two population proportions are unequal. We
can substitute other population groupings and state similar hypotheses for other
political leaning comparisons or for other publication characteristics.
Let’s start with political leanings. You can revisit the top of this post to remind
yourself of how we collected the information about outlets' political leanings.
The below code uses a dictionary to assign left, right, and center values to
each publication name based on the information we collected.
#liberal, conservative, and center
bias_assigns = {'Atlantic': 'left', 'Breitbart': 'right', 'Business
Insider': 'left', 'Buzzfeed News': 'left', 'CNN': 'left', 'Fox
News': 'right',
'Guardian': 'left', 'National Review': 'right',
'New York Post': 'right', 'New York Times': 'left',
'NPR': 'left', 'Reuters': 'center', 'Talking Points
Memo': 'left', 'Washington Post': 'left', 'Vox': 'left'}
articles['bias'] = articles['publication'].apply(lambda x:
bias_assigns[x])
articles.head()
publicatio mont
id title n author date year h url content
House
Republican WASHINGTO
s Fret N—
0
About 2016 Congressional
1728 Winning New York Carl -12- 2016. Na Republicans
3 Their Hea... Times Hulse 31 0 12.0 N have...
Rift
Between
Officers Benjamin After the bullet
1
and Mueller 2017 shells get
1728 Residents New York and Al -06- 2017. Na counted, the
4 as Killing... Times Baker 19 0 6.0 N blood...
Tyrus
Wong, When Walt
‘Bambi’ Disney’s
2
Artist 2017 “Bambi”
1728 Thwarted New York Margalit -01- 2017. Na opened in
5 by Racial ... Times Fox 06 0 1.0 N 1942, cri...
Among
Deaths in
2016, a Death may be
3
Heavy Toll William 2017 the great
1728 in Pop New York McDonal -04- 2017. Na equalizer, but
6 Musi... Times d 10 0 4.0 N it isn’t...
publicatio mont
id title n author date year h url content
Kim Jong-
un Says
North SEOUL, South
4
Korea Is Choe 2017 Korea —
1728 Preparing New York Sang- -01- 2017. Na North Korea’s
7 to T... Times Hun 02 0 1.0 N leader, ...
We again use groupby() to find the proportion of climate change articles within
each political group.
bias_groups = articles.groupby('bias')
bias_proportions = bias_groups['cc_wds'].sum() /
bias_groups['cc_wds'].count()
Let's look at how many articles there are in each group, and chart it:
bias_groups['cc_wds'].count()
bias
center 10710
left 79943
right 51820
Name: cc_wds, dtype: int64
ax=bias_proportions.plot(kind='bar', fontsize=14, figsize=(12,10))
ax.set_title('Proportion of climate change articles by Political
Bias\n', fontsize=20)
ax.set_xlabel('Bias', fontsize=18)
ax.set_ylabel('Proportion', fontsize=18);
From the above chart, it seems obvious that the proportion of climate change
related articles differs for the different political leaning groups, but let's formally
test our hypothesis. To do this, for a given pair of article groupings, we state the
null hypothesis, which is to assume that there is no difference in the population
proportion of climate-related articles. Let’s also establish a 95% confidence level
for our test.
Once we gather our statistics, we can use either P-values or confidence intervals
to determine whether our results are statistically significant. We’ll use
confidence intervals here because we're interested in what range of values of
the difference are likely to reflect the population proportion differences. The
statistic of interest in our hypothesis test is the difference in the proportion of
climate change articles in two samples. Recall that there is a close relationship
between confidence intervals and significance tests. Specifically, if a statistic is
significantly different than zero at the 0.05 level, then the 95% confidence
interval will not contain 0.
Let's calculate the confidence intervals for our leaning groups, looking first at
left vs. right.
center = bias_groups.get_group('center')
left = bias_groups.get_group('left')
right = bias_groups.get_group('right')
calc_ci_range(bias_proportions['left'], len(left),
bias_proportions['right'], len(right))
(0.017490570656831184, 0.02092806371626154)
calc_ci_range(bias_proportions['center'], len(center),
bias_proportions['left'], len(left))
(0.012506913377622272, 0.021418820332295894)
Next, we can look at publication ownership, using the same approach. We divide
our population into four groups, LLC, corporation, non-profit, and private.
Now let's plot that data to see whether different types of companies cover
climate change in different proportions.
calc_ci_range(owner_proportions['LLC'], len(llc),
owner_proportions['corp'], len(corp))
(0.0031574852345019415, 0.0072617257208337279)
Here, the confidence interval is 0.3% to 0.7%, much closer to zero than our
earlier differences, but still not including zero. We would expect the non-profit to
LLC interval to also not include zero:
calc_ci_range(owner_proportions['non-profit'], len(non_profit),
owner_proportions['LLC'], len(llc))
(0.0058992390642172241, 0.011661788182388525)
calc_ci_range(owner_proportions['private'], len(private),
owner_proportions['non-profit'], len(non_profit))
(-0.003248922257497777, 0.004627808917174475)
Thus in this case, we can conclude that there is not a significant difference in
proportion of climate change related articles between these two populations,
unlike the others populations we’ve compared.
In terms of working with this corpus, though, we've barely touched the tip of the
iceberg. There are many interesting analyses you could attempt with this data,
so download the data from Kaggle for yourself and start writing your own text
analysis project!
Further Reading:
Olteanu, A, et al. “Comparing events coverage in online news and social media:
The case of climate change.” Proceedings of the Ninth International AAAI
Conference on Web and Social Media. 2015.