Python For Text Analysis
Python For Text Analysis
Vic Anand
University of Illinois at Urbana-Champaign
[email protected]
Khrystyna Bochkay
University of Miami
[email protected]
Roman Chychyla
University of Miami
[email protected]
Andrew Leone
Northwestern University
[email protected]
This article may be used only for the purpose of research, teaching,
and/or private study. Commercial use or systematic downloading
(by robots or other automatic processes) is prohibited without ex-
plicit Publisher approval.
Boston — Delft
1 Introduction 3
3 Jupyter Notebooks 12
3.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . 12
3.2 JupyterLab: A Development Environment for Jupyter Note-
books . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 How to Launch JupyterLab . . . . . . . . . . . . . . . . . 17
3.4 Working in JupyterLab . . . . . . . . . . . . . . . . . . . 18
3.5 The Markdown Language and Formatted Text Cells . . . . 24
Acknowledgements 214
References 215
ABSTRACT
The prominence of textual data in accounting research has in-
creased dramatically. To assist researchers in understanding
and using textual data, this monograph defines and describes
common measures of textual data and then demonstrates the
collection and processing of textual data using the Python
programming language. The monograph is replete with sam-
ple code that replicates textual analysis tasks from recent
research papers.
In the first part of the monograph, we provide guidance on
getting started in Python. We first describe Anaconda, a
distribution of Python that provides the requisite libraries
for textual analysis, and its installation. We then introduce
the Jupyter notebook, a programming environment that
improves research workflows and promotes replicable re-
search. Next, we teach the basics of Python programming
and demonstrate the basics of working with tabular data in
the Pandas package.
Vic Anand, Khrystyna Bochkay, Roman Chychyla and Andrew Leone (2020 isbn),
“Using Python for Text Analysis in Accounting Research
(forthcoming)”, Foundations and Trends® in Accounting: Vol. xx, No. xx, pp 1–18.
DOI: 10.1561/XXXXXXXXX.
materials available here. We kindly ask researchers who use our materials
to cite this paper.
While the base language is very capable, much of Python’s value derives
from the rich ecosystem of available packages (aka libraries). Packages
1
Confusingly, the term Anaconda has multiple meanings. Anaconda is the name
of a company. It is also the name of a software distribution that includes Python,
a package manager (called conda), and Jupyter notebooks. Finally, just to make
things extra confusing, Anaconda also refers to a “meta package,” i.e., a set of
specific versions of specific packages curated by the Anaconda company that work
well together. At the time of this writing, the Anaconda meta-package is at version
2020.07; this meta-package includes specific versions of the aforementioned 300 or so
packages for data science. When the Anaconda company releases the next version of
the Anaconda meta-package, many of the bundled packages will be upgraded.
download and install many packages and this takes some time.
There are many ways to write and execute Python code. In this chapter,
we introduce the Jupyter Notebook, a popular tool in the data science
community, and recommend that the readers use it in their text analysis
research projects. Jupyter Notebooks are live documents that contain
code, outputs from the code, visualizations, equations, and formatted
text. As we will show, because Jupyter Notebooks support all of these
features and store them in one place, they can greatly simplify and
improve research workflows.
Consider the following research task: load a data set, create some new
columns, make plots, compute summary statistics, and share the results
with coauthors. With many popular statistics packages, one would need
to copy the outputs (tables and figures) into a Word document or email,
add some annotations or comments, and send these to coauthors. When
it is time to revise the analysis, the Word document or email must be
updated manually and shared again.
For illustrative purposes, let us assume we use Stata, a popular
12
and by year; the plots are displayed on the screen and also saved as
PNG files. Finally, the code computes the annual return by year of an
equally-weighted portfolio of these three stocks.
Consider the problem of saving these results and sharing them with
coauthors. A researcher needs to save the code in a Stata Do file, save
the log file, and save the plot files. To share these results with coauthors,
the researcher would need to copy the summary table from the Stata
results window (or from the log file) and the plots into another program
(e.g. email, Word, OneNote), annotate the findings, and send the results
to the coauthors.
A Jupyter Notebook is a single file that provides functionality for all
these tasks. It is a container for the code and output, and allows the user
to easily annotate the output. Consider the sample Jupyter Notebook
that accompanies this chapter; this sample notebook implements in
Python the same functionality as in the Stata code above. We have
reproduced the sample notebook in Figures 3.1, 3.2, and 3.3. Notice
the notebook contains results in tabular form, figures, headings, and
commentary on the results. Since it is a single file, this notebook can
easily be shared with coauthors; if the coauthors do not have Python,
the notebook can be exported to HTML or PDF format, or hosted on a
website. Additionally, if the source data or the code changes, the cells in
the notebook can simply be run again and the notebook re-shared. In
sum, a Jupyter Notebook is a container for the entire research workflow
and can easily be shared.
Lab, “the next generation web-based user interface for Project Jupyter”
(Project Jupyter, 2018).
2
Both the traditional Jupyter Notebook app and JupyterLab are installed by
default with Anaconda (see Figure 2.3).
Upon launching JupyterLab, you will see a browser tab like that shown
in Figure 3.4. JupyterLab runs entirely within the browser. The left
sidebar contains a file explorer. You can navigate through the folder tree
on your computer and find existing notebooks. Jupyter Notebook files,
which have the file extension .ipynb, have an orange icon (notice the
orange icon in Figure 3.4 next to the notebook Ch3_Sample_notebook).
Launch a notebook by double-clicking on the filename next to its orange
icon. Alternatively, you can create a new notebook by clicking on the
Python 3 tile.
To create a new, empty notebook, click the Python 3 tile that is shown
after launching JupyterLab (see Figure 3.4). Sometimes, upon launch,
JupyterLab will open the last edited notebook. To create a new notebook,
click on the File menu, then New, then Notebook. Alternatively, click
File, then New Launcher and a window such as that shown in (see
Figure 3.4) will appear.
We shown an example of a new, empty notebook in Figure 3.7. The
new notebook contains a single, empty cell.
To enter code in a cell, click inside the cell with the mouse. Alternatively,
if the cell is selected (as shown in Figure 3.8), press Enter and a cursor
will be shown inside the cell. At this point, we can type in our code. To
run the code in the active cell, press CTRL + ENTER. Python will execute
the code and show the output, if any, beneath the cell. Additionally, a
number will appear in the brackets to the left of the cell. This number
indicates the order in which cells were executed. Cells need not be
Figure 3.5: JupyterLab with two views of the same notebook. The red circle
highlights the file explorer icon; clicking on this icon hides or displays the file explorer
in JupyterLab.
CTRL+ENTER executes the currently selected cell and does not advance
the cell selection. SHIFT+ENTER executes code and selects the next cell.
If there is no next cell, one is created. ALT+ENTER executes the current
cell and creates a new empty cell beneath the current cell.
Figure 3.6: JupyterLab permits users to open CSV and image files. It also allows
for sophisticated window layouts; these are useful when users need to view their data
and code simultaneously.
Notice that the cell in Figure 3.8 contains two lines of code. A cell can
contain one or more lines of code. It is entirely up to the user. We could
write all of our code in a single cell. Or, we could only have one line of
code per cell. Common practice is to put all code to perform an action
within in each cell. For example, a cell might open a file and display the
results. Another cell might contain multiple lines of code that create
a graph. Notice that some cells in Figure 3.1 contain multiple lines of
code and others contain a single line of code. The first cell contains all
import statements. The second opens a data file and sorts it. The third
cell displays the first 5 rows of the data. The fourth computes return
measures, and the fifth drops rows with missing values. We could have
Figure 3.8: Example of code execution. We typed code into the cell and pressed
CTRL + ENTER. The output is shown beneath the cell, and the cell counter shows
1 since this was the first cell executed since the notebook was opened.
Keystroke Action
c Copy currently selected cell
x Cut currently selected cell
v Paste below currently selected cell
hold down the mouse button, then drag the cell up or down to the
desired location.
All of the commands above can work on multiple cells. To select more
than one cell, hold down the SHIFT key and then click on the up or
down arrow to select multiple cells.
Markdown. To revert to a code cell, press the y key or use the dropdown
menu.
Keystroke Action
m Change cell type to markdown
y Change cell type to code
Within a Markdown cell, you can type text. However, you still need
to use CTRL+ENTER or an alternative keystroke to render the text in the
cell.
Emphasis
will appear as sample italicized text. To see this, create a new cell in a
notebook, type an asterisk, some text, and another asterisk. Then press
CTRL+ENTER.
To render text in boldface, place the text between pairs of asterisks.
For example:
Headings
Markdown supports six levels of headings. Heading 1 is the largest. To
render text as heading 1, place a single # before the text. To render
text as heading 2, place two # characters before the text. And so on.
For example, entering the following code into a markdown cell:
# Sample heading 1
## Sample heading 2
### Sample heading 3
#### Sample heading 4
##### Sample heading 5
###### Sample heading 6
might appear as:
Lists
Markdown supports unordered (i.e. bulleted) and ordered (i.e. num-
bered) lists. To create an unordered list, place one list item on each
line. Before each list item, type a single asterisk at the beginning of a
line and a space.3 For example, the following markdown would create a
3
Markdown also permits pluses or minuses instead of asterisks to denote list
items.
bulleted list:
* Python
* R
* SAS
* Stata
To create an ordered list, place one list item on each line. Before
each list item, type a number and a period at the beginning of a line.
The numbers must be sequential. For example, the following markdown
would create a numbered list.
1. Python
2. R
3. SAS
4. Stata
4.1 Fundamentals
28
Python is case-sensitive. The variables ROA and roa are distinct. If you
try to call the built-in absolute value function abs using capital letters
(e.g., ABS), an error will result.
Out:
Invest in firm 1.
4.1.4 Comments
Variables are containers for data. Variables are commonly used to store
the result of a computation or an entire dataset.
Python makes it very easy to create a variable. Simply type the desired
variable name, an equals sign, and a valid value. The following code
illustrates this by creating a new variable named z and storing the value
14 in it.
z = 14
Out:
-7352
Out:
The CEO said , " Earnings growth will exceed 5% next
quarter . "
Out:
Harley 's ticker symbol is HOG .
When Python sees a single quote, it looks for the next single quote.
Everything in between the two single quotes is treated as a string. The
same rule applies for double quotes.
It is possible to include a single quote inside a single-quoted string,
or to include a double quote inside a double-quoted string. To do so,
prefix the quote inside the string with a backslash character (\). When
Python sees a backslash character, it treats the next character as a
literal, meaning it does not assign a special meaning to it. Here are
some examples:
In:
myString = ' Harley \ ' s ticker symbol is HOG . '
print ( myString )
myString = " The CEO said , \" Earnings growth will exceed
5% next quarter .\" "
print ( myString )
Out:
Harley 's ticker symbol is HOG .
The CEO said , " Earnings growth will exceed 5% next
quarter ."
Empty Strings
Python allows “empty strings,” i.e., zero-length strings. The empty
string is entered as '' (two single quotes in succession). The empty
string is often used to represent missing values in columns of text data.
if ( x > 3) :
print ( ' Something ')
else :
print ( ' Something else ')
Out:
Something else
Out:
int
The type() function can also be used to check the type of data
stored in a variable. For example:
In:
y = ' Hello there . '
type ( y )
Out:
str
Table 4.2: Functions to convert values from one data type to another
Out:
ValueError Traceback (
most recent call last )
< ipython - input -4 -7 a251f5038b9 > in < module >
----> 1 int ( ' EBITDA is 1.03 million . ')
4.3 Operators
Operators are symbols that act on values. Python provides the following
types of operators. Knowledge of these operators will be helpful when
Floor division (//) performs division and then truncates any frac-
tional portion of the result. The mod operator (%) performs division
and returns the remainder.
Generally, if both operands are integers, the result will be an integer and
if either value is a float, the result will be a float. The only exception is
the division operator which always returns a float. The division operator
behaves this way to guarantee that it does not throw away information.
Out:
x equals 5.
For example:
In:
myList = [1 ,2 ,3] # list of integers 1 , 2 , and 3
if 5 in myList :
print ( '5 is in the list . ')
else :
print ( '5 is not in the list . ')
Out:
5 is not in the list .
myList = [1 ,2 ,3]
for x in myList :
print ( x )
Out:
1
2
3
Out:
not 4 + 3 * 4 * 2 == 28 # Starting expression
not 4 + 12 * 2 == 28 # 3*4 --> 12
not 4 + 24 == 28 # 12*2 --> 24
not 28 == 28 # 4+24 --> 28
not True # 28==28 --> True
False # not True --> False
Out:
True
We caution the reader that the syntax for print has changed sub-
stantially in recent years. In this paper, we demonstrate the newest
syntax, f-strings. However, many online code samples use older styles.
For a thorough overview of all printing styles, we refer the reader to
this article.
Out:
3.14
Out:
5 -3 EDGAR
Out:
In fiscal year 2019 , GENERAL MOTORS had net income of
6.581 billion .
When Python sees the letter f before a string, it treats the string
differently. Inside the string, Python looks for expressions inside curly
braces. It evaluates those expressions and replaces them with the string
equivalents of their values. Notice that the expression {FiscalYear}
was replaced with the value 2019, the value of the variable FiscalYear.
Also notice that the expressions inside f-strings need not contain vari-
ables; they can contain any valid Python expression. Notice the second
expression, {CompanyName.upper()}. The upper() method converts a
string to all upper-case. Also notice that the third expression divides
the variable NetIncome by one billion.3
4.5.1 if Statements
The syntax for Python if statement is:
if condition_A :
A statements
elif condition_B :
B statements
elif condition_C :
C statements
else :
else_statements
Out:
young adult
4
elif means “else if”.
If the variable age equals 25, Python will check the first condition
(age < 13). This will evaluate to False. Python will then check the
first elif, which contains the condition (age < 20). This will evaluate
to False, so Python will check the second elif condition, (age < 30).
This will evaluate to True, so Python will execute the corresponding
statement and print 'young adult'. Python will then stop execution.
4.5.2 Loops
A loop executes an action while a condition is met, then terminates.
while Loops
As its name suggests, a while loop executes while a condition is true.
In Python, the while loop first checks whether its condition is true. If
it is, the loop takes some actions and then rechecks the condition. If the
condition is still true, the loop executes the actions again. This process
repeats until the condition is false.5
The syntax of a while loop is shown below. The colon and the
indentation are mandatory. Note that the statements can include if
statements and other while loops.
while condition :
statement 1
statement 2
...
for Loops
A for loop is very similar to a while loop. It will test a condition and
execute the body of the loop if the condition is satisfied. The difference
is that a for loop will iterate, or loop over, something automatically. In
other words, a for loop is a while loop with some built-in conveniences.
The syntax of a for loop is:
5
A common programming mistake is to accidentally run an “infinite loop”. This
results when the actions do not update the condition and the condition always stays
true.
Out:
1
2
3
7
The code above creates a list with the values, 1, 2, 3, and 7. The
for loop iterates over the list. It does so by automatically creating
a new variable that we named x. The loop sets x to the first item in
the list, which in this case is 1. It then executes the body of the loop,
which prints x. The loop then sets x to the next thing on the list, 2,
and executes the body of the loop. This continues until the loop has
iterated over every value in the list.6,7
Another common scenario is to execute an action a given number of
times. This is typically accomplished using the built-in range() function.
In the example below, the range function returns the sequence 0, 1, 2,
3, 4, 5, 6. The for loop will iterate over this sequence and print every
value in the sequence.
In:
It is possible to implement the same functionality through a while loop. However,
6
doing so requires more machinery. The code would need to keep track of the index of
each item in the list and ensure that the number of iterations exactly equals the list
length. Additionally, the code would need to use indexing to retrieve the list element
that corresponds to the iteration number. The for loop is much simpler and less
error-prone.
7
When iterating over a list, you may need the list indexes as well as the elements.
Python provides the enumerate function for this purpose. We provide an example of
this function in section 7.3. We do not discuss this function here since it requires
knowledge of lists (section 4.7.1) and tuples (section 4.7.2).
Out:
0
1
2
3
4
5
6
4.6 Functions
8
Under the hood, programming languages keep track of the memory address of
the first element of a collection. The index is then used as an offset to that memory
address. That is why the index 0, which implies an offset of 0, returns the first
element.
9
The terms package and library are synonyms for module and we will use these
terms interchangeably.
def MyAverage (x , y ) :
return ( x + y ) / 2
A function definition begins with the keyword def, which stands for
define. Following the def keyword is a name for the function. Next is an
argument list; this list must be enclosed in parentheses and separated
by commas. Our MyAverage function takes two arguments, x and y.
These arguments will create variables, but these variables will only “live”
inside the function. The first line of the function definition must end
with a colon. The next lines are the body of the function and they must
be indented. Python uses the indentation to determine which statements
constitute the “body” of the function. Python will execute the body of
the function, line by line, until it reaches a return statement, or until it
executes the last indented statement. A return statement, if provided,
“returns” a value to whatever line of code called the function.
x = 15
SampleFunction ()
print ( f ' Outside SampleFunction , x is { x } ')
Out:
Inside SampleFunction , x is 0
Outside SampleFunction , x is 15
print ( filteredList )
Out:
[8 , 9]
The above example creates a list L. The second line of code extracts
all elements of L greater than 7 and saves them to a new list. We used a
lambda expression to perform the filtering. The expression, lambda x:
x > 7, begins with the lambda keyword and is followed by an argument
x and a colon.10 The body of the function follows the colon. In this case,
the function body is simply x > 7, which evaluates to a Boolean. The
filter() function applies the anonymous function to every element
of the list L and keeps list elements for which the anonymous function
evaluates to True. Notice that the anonymous function is created inline,
used once, and discarded.
10
Lambda expressions allow multiple arguments. Simply separate the arguments
with commas.
import math
That statement tells Python to load all of the functions from its math
module into the environment. After we execute this import statement,
we can use any of the functions in the module. For example, the following
code computes the factorial of a number using a function from the math
library.
In:
import math
print ( math . factorial (5) )
Out:
120
This code imports specific functions from the math library. Additionally,
such functions do not require the prefix “math..” If someone wishes to
import all functions from a library, they can use something like the
following:
mydata = [1 ,3 ,5 ,7 ,9]
my_median = st . median ( mydata )
Out:
The median of my data is 5
The above code tells Python to use st as an alias for the statistics
module. An alias can be any valid variable name.
Out:
Operating Income for FY 2019 was 12.4 billion , up more
than eight percent from Operating Income in FY
2018.
Positional Arguments
If we do not tell it otherwise, re.sub assumes that the first argument is
the search text, the second argument is the replacement text, and the
third argument is the text to search. Functions called in this manner rely
on positional arguments: the position of the argument in the function
call has meaning. When using positional arguments, we must read the
11
In the remainder of this section, we will work with the function re.sub from
Python’s built-in regular expression library. Regular expressions are a powerful tool
for finding patterns in text. We introduce regular expressions in Chapter 6.
Keyword Arguments
In the modified code, we explicitly told Python that the value of the
pattern argument is "OI", the value of repl is "Operating Income",
and the value of string is text. A function called this way relies on
keyword arguments. Keyword arguments provide many advantages over
positional arguments. Code readability is increased since the arguments
are clearly specified. Additionally, arguments can be passed in any order.
Thus, this function call would yield an identical result:
pretty_text = re . sub ( string = text ,
repl = " Operating Income " ,
pattern = " OI " )
Out:
' Operating Income for FY 2019 was 12.4 billion , up more
than eight percent from Operating Income in FY
2018. '
Out:
File " < ipython - input -37 -1 a96811d74d8 > " , line 1
re . sub ( pattern = " OI " , " Operating Income " , text )
^
SyntaxError : positional argument follows keyword
argument
4.7.1 Lists
Creating a List
Create a list by enclosing data inside square brackets and separating
each data item with a comma. Spaces between commas are optional.
The following example creates a list and saves the list into a variable, L.
The list contains three elements, a string, a float, and a list. Note that
the type of the variable is list. Also note that it is possible to nest
lists within lists.
In:
L = [ ' GM ' , -3.14 , [1 , 2 , 3]]
type ( L )
Out:
list
Out:
[2 , 4 , 6 , 8]
Notice that the new list, doubles, is created in one line of code.
The code inside the square brackets, [x*2 for x in L], is a list com-
prehension. The general form of a list comprehension is:
In:
L [1]
Out:
-3.14
Out:
[1 , 2 , 3]
Slicing
Python makes it very easy to retrieve more than one element of a list.
This process is called slicing. In our opinion, slicing is one of the most
useful features of Python.
The syntax for slicing a list is:
list_name [ start : end : step ]
where start is the index of the first element we wish to retrieve (inclusive
lower bound), end is one more than the index of the last element we
wish to retrieve (exclusive upper bound), and step is the step size, or
gap between indexes. Note that start, end, and step are optional. If
omitted, Python assumes 0 for start, the list length for end, and 1 for
step. Slicing is best illustrated with examples:
In:
L = [ 'a ' , 'b ' , 'c ' , 'd ' , 'e ' , 'f ']
Out:
Out:
[ 'a ' , 6 , 'c ' , 'd ' , 'e ' , 'f ']
Out:
Out:
[ 'a ' , 'b ' , 'c ' , 1 , 2 , 3]
Duplicating Lists
The * operator makes copies of a list and concatenates them into a new
list.
In:
L = [1 , 2 , 3]
L * 3
Out:
[1 , 2 , 3 , 1 , 2 , 3 , 1 , 2 , 3]
Copying Lists
To demonstrate list copying, we must first introduce the concepts of
shallow and deep copies. Say that a list is stored in the variable L. The
statement L2 = L creates a shallow copy. It creates a new symbol in the
environment, L2, but that symbol points to the same data as L. Thus,
a change made to L2 will affect L. To see this, consider the following
code:
In:
L = [1 , 2 , 3]
L2 = L
Out:
L = [ ' text ' , 2 , 3]
L2 = [ ' text ' , 2 , 3]
To make a deep copy of a list, you must use a list’s copy method.
This method duplicates the list in memory and prevents behavior like
that shown in the previous example. The following code demonstrates
the copy method of list.
In:
L = [1 , 2 , 3]
L2 = L . copy ()
L2 [0] = ' text '
Out:
L = [1 , 2 , 3]
L2 = [ ' text ' , 2 , 3]
Out:
[1 , 2 , 3 , ' cat ']
Python also provides the insert and remove methods to insert and
remove list elements. L.insert(i, x) inserts x at the index given by
i. L.remove(x) removes the first item from L where L[i] is equal to x.
4.7.2 Tuples
A tuple is an immutable list - it cannot be changed after it is created. The
syntax for tuples is nearly identical to that for lists. The main difference
is that tuples use parentheses ( ) whereas lists use square brackets
[ ]. The process for retrieving an element from a tuple is identical to
that for lists. We mention tuples because many Python functions either
require tuples as arguments or return tuples. For example, the Pandas
DataFrame, which we introduce in the next chapter, stores a table of
data. Pandas provides a function that retrieves the dimensions of a
DataFrame and this function returns a tuple containing the number of
rows and columns.
4.7.3 Dictionaries
A dictionary is a list of key-value pairs. Dictionaries are very useful
data structures since they allow users to assign data items (values) to
keys of their choice. This makes it easier to store and retrieve data. For
example, consider the following code that stores an income statement
as a dictionary.
income_stmt = { ' Revenue ': 100 ,
' COGS ': 52 ,
' Gross margin ': 48 ,
' SG & A ': 40 ,
' Net Income ': 8}
Out:
100
In many ways, Python strings behave like lists. Strings can be sliced,
joined using the + operator, duplicated using the * operator, just like
lists. That is why this section appears after the section on lists.
Out:
[ 'H ' , 'e ' , 'l ' , 'l ' , 'o ' , ' ' , 'w ' , 'o ' , 'r ' , 'l ' , 'd ' ,
'. ']
Out:
.
Hello
Out:
' Hello World . '
Repeating Strings
Use the * operator to repeat a string. In:
s1 = ' Hello '
s1 * 3
Out:
' HelloHelloHello '
Out:
True
False
Out:
HELLO WORLD .
hello world .
A common need when cleaning text data is to remove white space from
strings. Python provides the strip method for this purpose. strip
removes all white space characters from the beginning and end of a
string. White space includes space, tab, and newline characters.
In:
s = ' Dirty string with unnecessary spaces at
beginning and end . '
s . strip ()
Out:
' Dirty string with unnecessary spaces at beginning and
end . '
Notice that strip removed the spaces from the beginning and the
end of the string. If we wish to remove white space from only the
beginning (end) of a string, use lstrip (rstrip).
By default, these methods remove white space, but we can pass an
optional argument to tell these methods which characters to strip.
In:
s = ' Hello . '
s . rstrip ( '. ')
Out:
' Hello '
These functions return a new string and do not modify the original
string.
Out:
True
False
find
find searches for a substring within a string. If the substring is found,
find returns the index of the first occurrence. If the substring is not
found, find returns -1.
In:
s = ' text analysis '
s . find ( ' xt a ')
Out:
2
replace
Use replace to replace one substring with another. By default, it
replaces all occurrences of a substring. An optional count argument
allows to specify the number of replacements.
In:
s = ' text analysis '
s . replace ( ' text ' , ' TEXT ')
Out:
' TEXT analysis '
This function returns a new string and does not modify the original
string.
Out:
4
split
The split method splits a string and returns a list of substrings. By
default, it splits using spaces but we can tell Python which delimiter
we wish to use.
In:
s = ' text analysis is fun . '
s . split ()
Out:
[ ' text ' , ' analysis ' , ' is ' , ' fun . ']
69
The Pandas library contains two main objects, the DataFrame and the
Series.
Figure 5.1: Sample DataFrame containing selected Compustat data for General
Motors Corporation. GVKEY is the unique Compustat identifier. FYEAR is the fiscal
year. TIC is the ticker symbol. IB is income before extraordinary items. PRCC_F is
closing stock price at the end of the fiscal year.
Figure 5.2: Sample Series containing selected Compustat data for General Motors
Corporation. This Series is the fourth (IB) column of the DataFrame depicted in
figure 5.1 above.
The most common method of importing the Pandas library is with this
call to the import statement. It is common practice to use the alias pd
for Pandas.
import pandas as pd
import numpy as np
import pandas as pd
This code imports data from the worksheet MSFT No Header in the
Excel file Ch5_Data.xlsx. The output is shown in Figure 5.4.
Figure 5.4: Sample DataFrame containing selected financial data for Microsoft
Corporation. The raw data lacked a header row. The keyword argument header=None
informed pd.read_excel that the data lacked a header row. Column names were
supplied using the names keyword argument.
Many Excel users place blank lines and text above the data in a work-
sheet. To handle this use case, use the optional keyword argument
skiprows when calling pd.read_excel. To skip the first n lines of the
file, use skiprows=n. Alternatively, to skip specific rows, pass a list of
row numbers; unlike Excel, skiprows assumes row numbers begin at
zero.
The following sample code skips the first five lines from the appro-
priate worksheet of the data file that accompanies this chapter. An
alternative to skiprows=5 is skiprows=[0,1,2,3,4].
# Assume pandas has already been imported , and the Excel
# file is located in the same folder as the code file .
df = pd . read_excel ( ' Ch5_Data . xlsx ' ,
sheet_name = ' MSFT Extraneous Lines ' ,
skiprows =5)
2
Regular expressions are discussed in Chapter 6. We explain the other lines of
code in more detail later in this chapter.
df [ ' Income '] = df [ ' Income ' ]. str . replace ( '( ' , ' - ')
# Convert column to float
df [ ' Income '] = df [ ' Income ' ]. astype ( float )
Parsing Dates
The pd.read_csv function can parse dates. Simply tell the function
which columns contain dates through the parse_dates keyword ar-
gument, and Pandas usually imports the dates correctly. To see this,
consider the following sample data file, Ch5_Dates.xlsx, that contains
stock information for General Motors Corporation:
Date1 , Date2 , Date3 , Company Name , Ticker , Closing Price
4 -8 -2019 ,8/4/2019 ,08 Apr 19 , General Motors , GM ,39.06
4 -9 -2019 ,9/4/2019 ,09 Apr 19 , General Motors , GM ,38.86
4 -10 -2019 ,10/4/2019 ,10 Apr 19 , General Motors , GM ,39.25
To import the Date1 and Date3 columns as date values, and not
text strings, use the following code.
dfDates = pd . read_csv ( ' Ch5_Dates . csv ' ,
parse_dates =[ ' Date1 ' , ' Date3 ' ])
Use the function pd.read_stata to read Stata .dta files. At the time
of this writing, the latest version of Pandas (version 1.1.2), supports
Stata files up to and including version 16. If, for some reason, you have
difficulty reading a Stata file into Pandas, simply export it from Stata
to Excel or CSV format and then read it into Pandas.
Use the function pd.read_sas to read SAS files. This function can
read SAS xport (.XPT) files and SAS7BDAT files. In our experience,
this function can be finicky. If you have a valid SAS license and SAS
installation on your computer and plan to regularly pass data between
SAS and Python, we highly recommend the package SASPy. This
package was written by The SAS Institute and is officially supported.
Out:
Series s :
0 1
1 2
2 3
3 4
dtype : int64
DataFrame df :
X X_sq
0 1 1
1 2 4
2 3 9
3 4 16
The following code operates on one of the example files that ac-
companies this chapter. It loads a data file containing selected finan-
cial data for Microsoft Corporation and then selects a subset of the
columns. Specifically, the code df[['Fiscal Year', 'Net Income']]
selects the Fiscal Year and Net Income columns and returns a new
DataFrame containing those columns. We then show the last three rows
(using the tail method).
In:
import numpy as np
import pandas as pd
Out:
Fiscal Year Net Income
31 2017 21204.0
32 2018 16571.0
33 2019 39240.0
Out:
< class ' pandas . core . frame . DataFrame ' >
< class ' pandas . core . series . Series ' >
Fiscal Year column is greater than 2014. To filter the data this way,
we need to take the column Fiscal Year column and compare every
value to 2014. Let us see what happens when we do that:
In:
import numpy as np
import pandas as pd
Out:
0 False
1 False
...
32 True
33 True
Name : Fiscal Year , dtype : bool
In the above code, Pandas took the column Fiscal Year (which is
a Series) and compared every value in the column to 2014. It returned
a new Series of Boolean values. Each value corresponds to one fiscal
year in the data. If the fiscal year is greater than (less than) 2014, the
corresponding value in the returned Series is True (False). The new
Series has exactly the same length as the original DataFrame.
Out:
True
Out:
df , unfiltered :
col1 col2
0 a 1
1 b 2
2 c 3
Out:
Company Name Ticker Closing Price
0 Microsoft MSFT 138.43
Table 5.1: Pandas logical operators for joining multiple conditions and their Python
equivalents.
Joining Conditions
When filtering a Pandas DataFrame with multiple conditions, each
condition must be enclosed in parentheses. Briefly, the reason for this
is the order of operations. If the parentheses are omitted, Pandas will
attempt to perform operations in the wrong order. This will cause an
error or yield incorrect results.
The following examples work with the Microsoft financial data that
accompanies this chapter. The first example filters the data, so it only
includes firm-year observations for which fiscal year is greater than 2013
and common equity exceeds $80 billion. The second example filters the
data so it includes fiscal years before 1990 and after 2015.
In:
import pandas as pd
# Assumes the Excel file is in the same folder
df = pd . read_excel ( ' Ch5_Data . xlsx ' ,
sheet_name = ' MSFT Clean ')
Pandas provides the .loc function that allows users to filter rows and
columns simultaneously. Consider the following code that filters the
Microsoft financial data. This code restricts years 2011 and up and
retrieves only the assets and equity columns.
import pandas as pd
# Assumes the Excel file is in the same folder
df = pd . read_excel ( ' Ch5_Data . xlsx ' ,
sheet_name = ' MSFT Clean ')
Notice that .loc has the syntax of an indexer (i.e., square brackets
[]), not that of a function. The first argument is the indexes of the
desired rows and the second argument is the names of the desired
columns. If the second argument is a list, .loc returns a DataFrame.
However, if the second argument is the name of a single column, not in
a list, then .loc returns a Series.
Pandas provides many other related functions. .iloc behaves simi-
larly but accepts zero-based index numbers for rows and columns. .at
and .iat allow users to retrieve a single value from a DataFrame. We
do not describe these accessors in detail and leave it to the reader to
consult the Pandas documentation.
Out:
col1 newcol
0 1 5
1 2 5
2 3 5
In:
df = pd . DataFrame ({ ' col1 ': [1 ,2 ,3]})
df [ ' newcol '] = [ 'a ' , 'b ' , 'c ']
print ( df )
Out:
col1 newcol
0 1 a
1 2 b
2 3 c
Let us analyze the code on the right-hand side of the equals sign,
df['City'].str.strip(). That tells Pandas to take the City column
as a Series, run the string method strip on every value in the column,
and return a new Series.6 We then save that new Series in the existing
column df['City'].
Note that we used a for loop to iterate over the columns, and an
f-string to create a new column name from an existing column name.
a SAS data step. Note that the second example combines values from
four columns, as well as three strings. It also uses the Pandas astype
method to convert a Series of integers to a Series of strings.
# Compute ROA
df [ ' ROA '] = df [ ' Net Income '] / df [ ' Assets - Total ']
Out:
X Xsq
0 1 1
1 2 4
2 3 9
Out:
Y X
3 b 1
0 b 2
2 a 3
1 a 4
Out:
myIndex LeftVals RightVals
0 1 10 100
1 3 30 300
Note that the merge function creates and returns a new DataFrame.
95
for the first location where the regex pattern matches; output is
a Match object if there is a match, or None otherwise;
• re.findall(pattern, string) finds all substrings where the
regex pattern matches and returns them as a list;
• re.split(pattern, string) splits a string at every match of
the regex pattern and returns a list of strings. For example, one
can retrieve the individual words in a sentence by splitting at the
spaces;
• re.sub(pattern, repl, string) finds all matches of pattern
in string and replaces them with repl.
Regular expression patterns can be specified in either single and
double quotation marks, '' and "". It is a good practice to insert
the letter “r” before Regex patterns in Python’s re operations (e.g.,
re.search(r'pattern', string)). Prefixing with “r” indicates that
backslashes \ should be treated literally and not as escape characters
in Python. In other words, the “r” prefix indicates that the string is a
“raw string.” For example, the following code that demonstrates how to
match a basic regex r"OI" in a sentence.
In:
# loads Python 's regular expressions module
import re
text = " OI for FY 2019 was 12.4 billion , up more than
eight percent from OI in FY 2018. "
# returns a Match object of the first match ,
# if it exists
x1 = re . search ( r " OI " , text )
Out:
< re . Match object ; span =(0 , 2) , match = ' OI ' >
[ ' OI ' , ' OI ']
[ ' OI for FY 2019 was 12.4 billion ' , ' up more than
eight percent from OI in FY 2018. ']
Operating Income for FY 2019 was 12.4 billion , up more
than eight percent from Operating Income in FY
2018.
Out:
[ ' MD & a ' , ' md & A ']
[ ' md & a ' , ' md & a ']
http://www.regular-expressions.info/,
https://docs.python.org/3/howto/regex.html,
https://docs.python.org/3.4/library/re.html,
https://www.w3schools.com/python/python_regex.asp.
Also, there are many interactive websites online that allow users to test
their regular expressions for correctness. For instance, https://regex101.
com/ allows to perform regex testing on sample texts; it also provides
useful explanations for what each regex element captures as illustrated
in the figure below.
Out:
[ '7 ' , '0 ' , '2 ' , '0 ' , '1 ' , '9 ']
[ '7 ' , '0 ' , '% ' , '2 ' , '0 ' , '1 ' , '9 ']
[ ' 70% ']
Out:
we expect this trend to continue
In:
import re
date = " 09/14/2020 "
# specifies three named groups , namely ' Month ', ' Day ',
and ' Year '
regex = r " (? P < Month >\ d {1 ,2}) /(? P < Day >\ d {1 ,2}) /(? P < Year
>\ d {2 ,4}) "
# identifies regex matches in date
date_matches = re . search ( regex , date )
Out:
Month : 09
Day : 14
Year : 2020
Out:
[ ' 70% ' , ' 9% ' , ' 12% ' , ' 12.5 percent ']
print ( cik )
print ( company_name )
print ( filing_date )
print ( sic )
Out:
[ ' 0000080424 ']
[ ' PROCTER & GAMBLE Co ']
[ ' 20181019 ']
[ ' 2840 ']
print ( risk_words )
print ( text_riskiness )
Out:
[ ' risk ' , ' risks ' , ' risks ' , ' Risk ' , ' risks ' , ' risky ' , '
riskiness ' , ' risks ']
1 1.4 28 57 1428571429
108
count.2,3
If an input dictionary includes base form words only, then counting
the frequencies of such words in a text will result in significant under-
statement of the ‘true’ word count in the document. This will happen
because a simple regular expression in the form r'\b' + word + r'\b'
will find matches of base words only, and all words that have different
endings will be ignored. For example, if an input dictionary contains
‘damage’ (and no other words with the same beginning) as one of its
negative words, then regex r'\bdamage\b' will return zero matches in
a sentence “Our business could be damaged” because the ending ‘ed’ is
not specified in the regex pattern.
There are several ways to deal with this problem. First, we can
develop a more complex regex that will allow different endings in regex
word matching. This approach works relatively well; however, we have
to be careful with matching words that may have different meanings
depending on the word ending (e.g., careful vs. careless). Second, we
can perform word stemming or lemmatization of input text documents,
so that all input document words are in their base form.4 This approach
also works well; however, stemming or lemmatization are not always
100% accurate, introducing some noise in the subsequent word count.
Finally, we can modify the original input dictionary by manually adding
derivative words to the dictionary, i.e., including ‘damage’, ‘damages’,
‘damaging’, ‘damaged’ as negative words in the dictionary. This approach
is feasible if working with relatively short lists of words, but may become
increasingly costly if working with thousands of words in the dictionary.
We have to be even more careful when input dictionaries contain
both single words and multi-word phrases. Then, in addition to dif-
ferent word endings, we should consider whether other words may be
present in the middle of a given phrase. For example, if an input dic-
tionary contains ‘economic environment’ as one of its entries, then
2
Python’s string.lower() method converts all characters in an input string into
lowercase characters (e.g., "Higher returns".lower() returns “higher returns”).
3
Pandas’ .str.lower() method converts all characters in a series/column of data
to lowercase (e.g., df['colname'] = df['colname'].str.lower() replaces text in
column colname of DataFrame df with its lowercase equivalent).
4
We cover stemming and lemmatization methods in Section 7.4.
Out:
5
Again, we highly recommend to test all regex expressions prior to performing
large scale counts. https://regex101.com/ is a useful website for that.
[ ' We ' , ' invested ' , ' in ' , ' six ' , ' areas ' , ' of ' , ' the ' , '
business ' , ' that ' , ' account ' , ' for ' , ' nearly ' , ' of '
, ' total ' , " Macy 's " , ' sales ' , ' Dresses ' , ' fine ' , '
jewelry ' , ' big ' , ' ticket ' , " men 's " , ' tailored ' , "
women 's " , ' shoes ' , ' and ' , ' beauty ' , ' these ' , '
investments ' , ' were ' , ' aimed ' , ' at ' , ' driving ' , '
growth ' , ' through ' , ' great ' , ' products ' , 'top -
performing ' , ' colleagues ' , ' improved ' , ' environment
' , ' and ' , ' enhanced ' , ' marketing ' , ' All ' , ' six ' , '
areas ' , ' continued ' , ' to ' , ' outperform ' , ' the ' , '
balance ' , ' of ' , ' the ' , ' business ' , ' on ' , ' market ' ,
' share ' , ' return ' , ' on ' , ' investment ' , ' and ' , '
profitability ' , ' And ' , ' we ' , ' capture ' , '
approximately ' , ' of ' , ' the ' , ' market ' , ' in ' , ' these
' , ' categories ']
73
Out:
1 " We invested in six areas of the business that
account for nearly 40% of total Macy 's sales . "
2 " Dresses , fine jewelry , big ticket , men 's tailored ,
women 's shoes and beauty , these investments were
aimed at driving growth through great products , top
- performing colleagues , improved environment and
enhanced marketing . "
3 " All six areas continued to outperform the balance of
the business on market share , return on investment
and profitability . "
4 " And we capture approximately 9% of the market in
these categories . "
import spacy
# load the English language model in spacy
nlp = spacy . load ( ' en_core_web_sm ')
# create an " nlp " object that parses a textual document
a_text = nlp ( text )
Out:
[ ' We ' , ' invested ' , ' in ' , ' six ' , ' areas ' , ' of ' , ' the ' , '
business ' , ' that ' , ' account ' , ' for ' , ' nearly ' , ' 40 '
, '% ' , ' of ' , ' total ' , ' Macy ' , " 's " , ' sales ' , '. ' , '
Dresses ' , ' , ' , ' fine ' , ' jewelry ' , ' , ' , ' big ' , '
ticket ' , ' , ' , ' men ' , " 's " , ' tailored ' , ' , ' , ' women '
, " 's " , ' shoes ' , ' and ' , ' beauty ' , ' , ' , ' these ' , '
investments ' , ' were ' , ' aimed ' , ' at ' , ' driving ' , '
growth ' , ' through ' , ' great ' , ' products ' , ' , ' , ' top '
, ' - ' , ' performing ' , ' colleagues ' , ' , ' , ' improved ' ,
' environment ' , ' and ' , ' enhanced ' , ' marketing ' , '. '
, ' All ' , ' six ' , ' areas ' , ' continued ' , ' to ' , '
outperform ' , ' the ' , ' balance ' , ' of ' , ' the ' , '
business ' , ' on ' , ' market ' , ' share ' , ' , ' , ' return ' ,
' on ' , ' investment ' , ' and ' , ' profitability ' , '. ' , '
And ' , ' we ' , ' capture ' , ' approximately ' , '9 ' , '% ' , '
of ' , ' the ' , ' market ' , ' in ' , ' these ' , ' categories ' ,
'. ']
lemmatizer = WordNetLemmatizer ()
Out:
Stemming for ' increasing ' is increas
Stemming for ' increases ' is increas
Stemming for ' increased ' is increas
Lemmatization for ' increasing ' is increase
Lemmatization for ' increases ' is increase
Lemmatization for ' increased ' is increase
Out:
" We deliv adjust earn per share of $ 2.12 . for the
year , compar sale were down 0.7 % on an own plu
licens basi , and we deliv adjust earn per share of
$ 2.91 . "
p o si t ive_words_dict )
n e ga t i ve _ dict_regex = cr ea te_ di ct _re ge x_ li st (
n e ga t ive_words_dict )
Out:
[ re . compile ( ' \\ bable \\ b ') , re . compile ( ' \\ babundance \\ b '
) , re . compile ( ' \\ babundant \\ b ') ]
[ re . compile ( ' \\ babandon \\ b ') , re . compile ( ' \\ babandoned
\\ b ') , re . compile ( ' \\ babandoning \\ b ') ]
Out:
(114 , 7 , 0 , 6.140350877192983)
\ ' t | shouldn \ ' t | wouldn \ ' t | couldn \ ' t | can \ ' t | cannot |
neither | nor ) ?\ s ( " + term + r " ) \ b " ) for term in
dict_terms ]
Out:
re . compile ( " ( not | never | no | none | nobody | nothing | don \\ ' t |
doesn \\ ' t | won \\ ' t | shan \\ ' t | didn \\ ' t | shouldn \\ ' t |
wouldn \\ ' t | couldn \\ ' t | can \\ ' t | cannot | neither | nor )
?\\ s ( able ) \\ b " )
re . compile ( " ( not | never | no | none | nobody | nothing | don \\ ' t |
doesn \\ ' t | won \\ ' t | shan \\ ' t | didn \\ ' t | shouldn \\ ' t |
wouldn \\ ' t | couldn \\ ' t | can \\ ' t | cannot | neither | nor )
?\\ s ( abandon ) \\ b " )
# Positive Words #
# To account for negators , we can separately count
# positive and negated positive words
p o si t ive_word_count = 0
negated_positive_word_count = 0
# Then , Tone is :
tone = 100 * ( positive_words_sum -
ne ga tive_words_sum ) / total_words_count
return ( total_words_count , positive_words_sum ,
negative_words_sum , tone )
Out:
( ' ' , ' advantage ')
( ' ' , ' advantage ')
( ' ' , ' advantage ')
( ' ' , ' leading ')
( ' not ' , ' pleased ')
129
Out:
Number of words in text : 143
Out:
Number of sentences in text : 5
The fog index (or Gunning fog index) is a numerical score assigned to an
input text where larger values indicate greater difficulty of reading the
text. Reading levels by grade approximately correspond to the following
fog index scores:
For a given piece of text, the score is calculated as the weighted sum
of the average number of words per sentence and the average number
of complex words (i.e., words with three or more syllables) per word:
middle part ensures that our regular expression will not capture
the vowel e if it is the last character in the given word.
We can use the regular expression above to write a function
count_syllables that counts the number of syllables in a given word,
and function is_complex_word that for a given word returns True if
the word has more than three syllables in it, and False otherwise.
# regex pattern that matches vowels in a word
# ( case - insensitive ) ; used for syllable count
re_syllables = re . compile ( r ' (^|[^ aeuoiy ]) (?! e$ ) [ aeouiy ] ' ,
re . IGNORECASE )
Out:
Number of syllables in word " Text " : 1
Is word " Text " complex : False
Number of syllables in word " analysis " : 4
Is word " analysis " complex ?: True
Out:
Note that in the last line of calculate_fog definition we use float function
1
to convert the integer (i.e., whole digit number) output of the list length function,
len, to a continuous value (i.e., number with fraction). While in Python 3, dividing
integer numbers yields continuous results (see section 4.3.1), in Python 2 the division
operator, /, yields continuous number only if either the numerator or denominator is
a continuous number. For example, if 5.0 and 2.0 are continuous numbers, 5.0 / 2.0
yields 2.5 in Python 2, as expected. If 5 and 2 are integer numbers, 5 / 2 yields 2
in Python 2; that is, Python 2 only yields the integer (whole number) part of the
division if both the numerator and denominator are integer numbers.
Out:
score : 21.78965034965035 ,
grade_level : ' college_graduate '
2
The Bog index was was developed for the StyleWriter software package, designed
to help people improve their writing. See: http://www.stylewriter-usa.com for more
details.
138
FLS are not guarantees of future performance and different risks and
uncertainties may cause a company’s actual results to differ significantly
from management’s expectations.
To identify forward-looking statements, we will use the methodology
developed in Muslu et al. (2015).1 Specifically, Muslu et al. (2015)
classify a sentence as a forward-looking statement if it includes one of
the following:
• references to the future: will, future, next fiscal, next month,
next period, next quarter, next year, incoming fiscal, incoming
month, incoming period, incoming quarter, incoming year, com-
ing fiscal, coming month, coming period, coming quarter, coming
year, upcoming fiscal, upcoming month, upcoming period, upcom-
ing quarter, upcoming year, subsequent fiscal, subsequent month,
subsequent period, subsequent quarter, subsequent year, following
fiscal, following month, following period, following quarter, and
following year;
• future-oriented verbs and their conjugations: aim, anticipate, as-
sume, commit, estimate, expect, forecast, foresee, hope, intend,
plan, project, seek, target, etc.;
• reference to a year that comes after the year of the filing. For
example, 2022 if the filing’s fiscal year is 2020.
Therefore, to automatically capture a forward-looking statement,
we need to write a code that tests whether at least one of the three FLS
conditions is true. We will start with generating regular expressions
that correspond to future-oriented terms as per Appendix “Identifying
Forward-Looking Disclosures” in Muslu et al. (2015). To facilitate
exposition of the code, we include explanatory comments in every line.
In:
import re
1
Bozanic et al. (2018) is another excellent example of FLS classification. The
methodology we review here can be used to replicate FLS measures as per Bozanic
et al. (2018).
Out:
for fls_term in f l s _ t e r m s _ w i t h _ f u t u r e _ y e a r s :
# fls_term . search ( sentence ) returns a match
# object if there is a match , and " None "
# if there is no FLS term match in the
# sentence
if fls_term . search ( sentence ) :
return True
return False
Out:
False : Finally , we launched a completely new website
experience for Atlanta .
False : The new online experience provides a modern and
fresh brand look and includes enhanced simplicity
and flexibility for shopping and buying that easily
transitions to a home delivery or in - store
experience .
False : We are excited to put the customer in the
driver seat .
False : This experience is a unique and powerful
integration of our own in - store and online
capabilities .
True : Keep in mind , we will continue to improve both
the customer and associate experience in Atlanta
and use these learnings to inform how we roll out
into other markets .
return sentences
earn_terms = [ " earnings " , " EPS " , " income " , " loss " ,
" losses " , " profit " , " profits " ]
quant_terms = [ " thousand " , " thousands " , " million " ,
" millions " , " billion " , " billions " ,
" percent " , " % " , " dollar " , " dollars " ,
"$"]
# input text
text = """ Operating income margins , excluding the
restructuring charges , are projected to be in the
range of 4.5% to 4.8% , and interest expense and
other income are forecasted to be approximately
$18 million and $6 million , respectively . While
operating performance is expected to remain
strong , Agribusiness profits are expected to be
lower in the third and fourth quarters as pricing
for subsequent sales will not match the high level
Out:
# a sample text
text = """ Q1 revenue reached $12 .7 billion . We are
thrilled with the continued growth of Apple Card .
We experienced some product shortages due to very
strong customer demand for both Apple Watch and
AirPod during the quarter . Apple is looking at
buying U . K . startup for $1 billion . """
2
See Chapter 7 for instructions on how to install spacy library.
Out:
Sentence 1 : Q1 revenue reached $12 .7 billion .
Sentence 2 : We are thrilled with the continued growth
of Apple Card .
Sentence 3 : We experienced some product shortages due
to very strong customer demand for both Apple Watch
and AirPod during the quarter .
Sentence 4 : Apple is looking at buying U . K . startup
for $1 billion .
sentence_subj_obj ( sentence ) )
Out:
Sentence 1 : [{ ' Token ': ' revenue ' , ' Dependency ': ' nsubj
'} , { ' Token ': ' billion ' , ' Dependency ': ' dobj ' }]
Sentence 2 : [{ ' Token ': ' We ' , ' Dependency ': ' nsubjpass '
} , { ' Token ': ' growth ' , ' Dependency ': ' pobj '} , { '
Token ': ' Card ' , ' Dependency ': ' pobj ' }]
Sentence 3 : [{ ' Token ': ' We ' , ' Dependency ': ' nsubj '} , {
' Token ': ' shortages ' , ' Dependency ': ' dobj '} , { '
Token ': ' demand ' , ' Dependency ': ' pobj '} , { ' Token ':
' Watch ' , ' Dependency ': ' pobj '} , { ' Token ': ' quarter '
, ' Dependency ': ' pobj ' }]
Sentence 4 : [{ ' Token ': ' Apple ' , ' Dependency ': ' nsubj '
} , { ' Token ': ' startup ' , ' Dependency ': ' dobj '} , { '
Token ': ' billion ' , ' Dependency ': ' pobj ' }]
Out:
[{ ' Token ': ' Q1 ' , ' Lemma_Token ': ' Q1 ' , ' POS ': ' PROPN ' , '
Dependency ': ' compound ' , ' Stop_word ': False } , { '
Token ': ' revenue ' , ' Lemma_Token ': ' revenue ' , ' POS ':
' NOUN ' , ' Dependency ': ' nsubj ' , ' Stop_word ': False
} , { ' Token ': ' reached ' , ' Lemma_Token ': ' reach ' , '
POS ': ' VERB ' , ' Dependency ': ' ROOT ' , ' Stop_word ':
False } , { ' Token ': '$ ' , ' Lemma_Token ': '$ ' , ' POS ': '
SYM ' , ' Dependency ': ' quantmod ' , ' Stop_word ': False
} , { ' Token ': ' 12.7 ' , ' Lemma_Token ': ' 12.7 ' , ' POS ':
' NUM ' , ' Dependency ': ' compound ' , ' Stop_word ': False
} , { ' Token ': ' billion ' , ' Lemma_Token ': ' billion ' , '
POS ': ' NUM ' , ' Dependency ': ' dobj ' , ' Stop_word ':
False } , { ' Token ': '. ' , ' Lemma_Token ': '. ' , ' POS ': '
PUNCT ' , ' Dependency ': ' punct ' , ' Stop_word ': False }]
type . '}
Out:
Q1 CARDINAL Numerals that do not fall
under another type .
$12 .7 billion MONEY Monetary values , including
unit .
Apple Card ORG Companies , agencies ,
institutions , etc .
Apple Watch ORG Companies , agencies ,
institutions , etc .
AirPod ORG Companies , agencies ,
institutions , etc .
the quarter DATE Absolute or relative dates or
periods .
Apple ORG Companies , agencies ,
institutions , etc .
U.K. GPE Countries , cities , states .
$1 billion MONEY Monetary values , including
unit .
Out:
Number of named entities : 9
Number of words : 52
Specificity score : 5.777777777777778
9.5 Using Stanford NLP for part-of-speech and named entity recog-
nition tasks
In the code above, we show how to use spacy library to tokenize text
and identify named entities. Another popular set of tools for natural
language analysis is Stanford NLP. For example, Hope et al. (2016) use
Stanford NLP to calculate their specificity measure. Stanford NLP’s
official Python library is called Stanza. It includes tools for sentence and
word recognition, multi-word token expansion, lemmatization, part-of-
speech dependency parsing, and name entity recognition parsing. Below,
we demonstrate how to use Stanford NLP in Python for part-of-speech
and NER applications.
Stanza can be installed using either conda or pip as follows:
# extracts sentences
sentences = doc . sentences
Out:
Sentences :
Q1 revenue reached $ ...
We are thrilled with ...
We experienced some ...
Apple is looking at ...
Words :
Q1
revenue
reached
$
12.7
billion
.
Out:
We PRON
are AUX
thrilled ADJ
with ADP
the DET
continued VERB
growth NOUN
of ADP
Apple PROPN
Card PROPN
. PUNCT
Out:
$12 .7 billion MONEY
Apple Card ORG
Apple Watch ORG
AirPod ORG
the quarter DATE
Apple ORG
U.K. GPE
$1 billion MONEY
Note that the output of the code above is very similar to the output
of spacy’s NER tool, except for one entity (spacy also recognized “Q1”
as a cardinal numeral).
156
similarity measures.
Finally, the length of text is an important factor when considering
which similarity metric to use. If an input text is relatively long (five
words or more), it is more appropriate to choose a similarity measure
that operates on a word level. Conversely, if the text is rather short, it
is more appropriate to work with measures that operate on a character
level. In this chapter, we demonstrate how to use both long- and short-
text similarity measures.
There are various similarity measures for relatively long pieces of text
such as the Euclidean distance, cosine similarity, and the Jaccard similar-
ity index. Most accounting and finance studies use the cosine similarity
measure to compare texts. Therefore, in this chapter, we show how to
compute textual cosine similarity in Python.
that the order of words, their part of speech, sentence structure, and
other linguistic information is not recorded in bag-of-words vectors.
Cosine similarity between two vectors u and v is defined as the cosine
of angle between these two vectors. It can be calculated as follows:
PN
u·v i=1 ui vi
cos(u, v) = = r ,
|u||v| PN 2
P
N 2
i=1 ui i=1 vi
where ui and vi are vectors components, and |u| and |v| are lengths of
vectors u and v, respectively. Simply put, cosine similarity is a measure
of distance between two vectors. Values close to 1 indicate high degree
of similarity between two vectors, and values close to 0 indicate little
similarity. Since we can represent text as vectors using the bag-of-words
approach, we can calculate distance (similarity) between pieces of text.
NLTK’s word tokenizer extracts words from a given text and outputs
them as a list of tokens. However, if the input text includes punctuation
or apostrophe characters (e.g. ,, !, or ’), NLTK’s word tokenizer yields
these characters as separate tokens (in addition to words). When calcu-
lating text similarity, we should exclude these punctuation character
tokens as they introduce noise to bag-of-words vectors. Conveniently,
Python includes a list of punctuation characters; we only need to add
apostrophe to that list.
In:
# Python includes a collection of all punctuation
# characters
from string import punctuation
Out:
! " # $ %& '() *+ , -./:; <= >? @ [\]^ _ `{|}~ ’
Now, we can write a custom word tokenizer using NTLK’s list of stop
words and the Porter stemmer:
# imports word tokenizer from NLTK
from nltk import word_tokenize
# imports list of stop words from NLTK
from nltk . corpus import stopwords
Let us demonstrate how this tokenizer works using text excerpts from
business description sections of 10-K filings of three telecommunication
companies:
In:
# excerpt from Verizon Communications Inc . 2018 10 - K
doc_verizon = """ Verizon Communications Inc . ( Verizon
or the Company ) is a holding company that , acting
through its subsidiaries , is one of the world’s
leading providers of communications , information
and entertainment products and services to
consumers , businesses and governmental agencies . """
# excerpt from AT & T Inc . 2018 10 - K
doc_att = """ We are a leading provider of
communications and digital entertainment services
in the United States and the world . We offer our
services and products to consumers in the U . S . ,
Mexico and Latin America and to businesses and
other providers of telecommunications services
worldwide . """
# excerpt from Sprint Corporation 2018 10 - K
doc_sprint = """ Sprint Corporation , including its
consolidated subsidiaries , is a communications
company offering a comprehensive range of wireless
and wireline communications products and services
that are designed to meet the needs of individual
consumers , businesses , government subscribers and
resellers . """
Out:
[ ' verizon ' , ' commun ' , ' inc . ' , ' verizon ' , ' compani ' , '
hold ' , ' compani ' , ' act ' , ' subsidiari ' , ' one ' , '
world ' , ' lead ' , ' provid ' , ' commun ' , ' inform ' , '
entertain ' , ' product ' , ' servic ' , ' consum ' , ' busi ' ,
' government ' , ' agenc ']
[ ' lead ' , ' provid ' , ' commun ' , ' digit ' , ' entertain ' , '
servic ' , ' unit ' , ' state ' , ' world ' , ' offer ' , ' servic
' , ' product ' , ' consum ' , 'u . s . ' , ' mexico ' , ' latin ' ,
' america ' , ' busi ' , ' provid ' , ' telecommun ' , ' servic '
, ' worldwid ']
[ ' sprint ' , ' corpor ' , ' includ ' , ' consolid ' , ' subsidiari '
, ' commun ' , ' compani ' , ' offer ' , ' comprehens ' , ' rang
' , ' wireless ' , ' wirelin ' , ' commun ' , ' product ' , '
servic ' , ' design ' , ' meet ' , ' need ' , ' individu ' , '
consum ' , ' busi ' , ' govern ' , ' subscrib ' , ' resel ']
Note that words in the output lists are stemmed and do not include
stop words. Finally, we can use Scikit-learn’s CountVectorizer class
to convert text documents to bag-of-words vectors:
In:
# CountVectorizer converts text to bag - of - words vectors
from sklearn . feature_extraction . text import
CountVectorizer
Out:
[ ' act ' , ' agenc ' , ' america ' , ' busi ' , ' commun ' , ' compani '
, ' comprehens ' , ' consolid ' , ' consum ' , ' corpor ']
[[1 1 0 1 2 2 0 0 1 0]
[0 0 1 1 1 0 0 0 1 0]
[0 0 0 1 2 1 1 1 1 1]]
Out:
[[1. 0.44854261 0.40768712]
[0.44854261 1. 0.32225169]
[0.40768712 0.32225169 1. ]]
Out:
[ ' act ' , ' agenc ' , ' america ' , ' busi ']
[[0.22943859 0.22943859 0. 0.13551013]
[0. 0. 0.23464902 0.13858749]
[0. 0. 0. 0.13365976]]
Out:
[[1. 0.30593809 0.23499515]
[0.30593809 1. 0.17890296]
[0.23499515 0.17890296 1. ]]
Out:
1
2
Out:
Levenshtein distance : 11
Levenshtein similarity score : 0.7105263157894737
The similarity score between the two versions of the company name
is 0.71 indicating high level of similarity.
Google Word2Vec
Deep Learning with Word2Vec
1
Chapter 12 discusses how to extract specific sections in 10-K filings.
and by extracting individual words from the input text. The code below
summarizes these data preprocessing steps.
In:
import re
# imports word tokenizer from NLTK
import nltk
# download NLTK 's stopwords module
nltk . download ( ' stopwords ')
from nltk import word_tokenize
# imports list of stop words from NLTK
from nltk . corpus import stopwords
# path to the input txt file with Apple 's 2018 MD & A
input_file = r " .../ Apple_MDNA . txt "
2
Python’s Gensim library offers an excellent implementation of Latent Dirichlet
or pip as follows:
Out:
[( ' rate ' , 0.9937257766723633) ,
( ' interest ' , 0.9924227595329285) ,
( ' assets ' , 0.992143988609314) ,
( ' risk ' , 0.99197918176651) ,
( ' million ' , 0.9914211630821228) ,
( ' capital ' , 0.9911070466041565) ,
( ' mortgage ' , 0.9908864498138428) ,
( ' securities ' , 0.99075442552566 53) ,
( ' financial ' , 0.990697801113128 7) ,
( ' december ' , 0.9902232885360718) ]
often coexist with the word “sales” as indicated by their high similarity
scores.
In the example above, we used only one textual document to train the
Word2Vec model. However, the performance of the model in identifying
word clusters and similarities will greatly improve when we increase
the training corpus. A popular option for training Word2Vec is the
Google News dataset model. It consists of 300-dimensional embeddings
for around three million words and phrases (see https://code.google.
com/archive/p/word2vec/ for details and to download ‘GoogleNews-
vectors-negative300.bin.gz’ file (∼1.5GB)). With the pre-trained model
we can access the word vectors and get the similarity scores as follows:
In:
from gensim . models import KeyedVectors
# load embeddings directly from the downloaded file
called " GoogleNews - vectors - negative300 . bin "
model = KeyedVectors . load_word2v ec_format ( ' GoogleNews -
vectors - negative300 . bin ' , binary = True )
# similarity between pairs of words
a = model . similarity ( ' confident ' , ' uncertain ')
b = model . similarity ( ' recession ' , ' crisis ')
# most similar words
c = model . most_similar ( ' accounting ')
# identifies a word that does not belong in the list
d = model . doesnt_match ( " good great amazing bad " . split ()
)
print ( a )
print ( b )
print ( c )
print ( d )
Out:
0.38531393
0.59829676
bad
173
section. All the text in-between is the content of the MD&A section:
Now we can write a function that extracts the MD&A section for a
given 10-K filing text. Please note that it is not unusual for the MD&A
section to be located in an exhibit (e.g., Exhibit 13) as opposed to
the main 10-K file. Therefore, we recommend to search for the MD&A
section in the “complete” submission filing, as opposed to the main
10-K file only. Moreover, it is important to check whether the length
of MD&A section is sufficiently long as the MD&A section can be
incorporated by reference in the main document.
def extract_mdna ( plain_text : str , min_mdna_length = 500) :
""" Attempts to extract MD & A section from a plain - text
document . The extracted MD & A section should be of
the minimum specified length (500 characters by
default ) . """
# tries to find position of Item 7 heading
Out:
foods have been introduced to international markets .
Principal international markets include Brazil ,
France , Mexico , Poland , the
Netherlands , South Africa , Spain and the United Kingdom
.
COMPETITION
In:
# extracts the MD & A section from the PepsiCo 's 10 - K
# filing text
text_mdna_only = extract_mdna ( text_complete_10k )
Out:
Item 7. Management 's Discussion and Analysis of Results
of Operations , Cash
Flows and Liquidity and Capital Resources
[...]
Figure 11.2: Example: Identifying the MD&A section in an HTML document using
hyperlinks in table of contents.
In this example, inside the font tag there is a keyword style; style
is an attribute of tag font. Attributes are used to define various prop-
erties of element tags. In this case, tag font is used to specify how to
display the section’s heading font, and its attribute style describes
the properties of the font. Specifically, it states that the font used to
display the heading should be of Arial family (font-family:Arial),
its default size should be 10 points (font-size:10pt), and it should
be rendered using bold font (font-weight:bold). That is, the prop-
erty font-weight:bold here achieves the same effect as <b> tag in the
previous example.
We can also observe that between words “Item” and “7” there is
HTML code: “ ”. This code instructs a web browser to render in
its place a non-breaking space character, a type of space that does not
allow line breaks before or after its position.
Now that we know how HTML documents render bold font text, we
can write a Python function that will search for section headings that
are displayed in bold font. In fact, HTML documents render text that
is centered, underlined, or otherwise emphasized in a similar manner
they render bold text. For example, similarly to how the tag <b> and
property font-weight:bold render bold text, the tag <u> and property
text-decoration:underline render underlined text. Therefore, we
can write a function that will search for section captions that are
displayed using font styles commonly used for section headings. Consider
the code below:
import re
r " < strong [^ >]* >(? P < value >.+?) </ strong > " ,
# center tag ; centered text
r " < center [^ >]* >(? P < value >.+?) </ center > " ,
# any tag that has an attribute (" style ") with
# ' font - weight : bold ' value
r " <(?P < tag >[\ w -]+) \ b [^ >]* font - weight :\ s * bold
[^ >]* >(? P < value >.+?) </(? P = tag ) >" ,
# any tag that has an attribute (" style ") with
# ' text - decoration : underline ' value
r " <(?P < tag >[\ w -]+) \ b [^ >]* text - decoration :\ s *
underline [^ >]* >(? P < value >.+?) </(? P = tag ) >" ,
# em tag ; emphasized text
r " <em >(? P < value >.+?) </ em > " ]
Out:
dding - bottom :2 px ; padding - right :2 px ; " >< div style = " text -
align : center ; font - size :7 pt ; " >< font style = " font -
family : Arial ; font - size :7 pt ; font - weight : bold ; " >( Zip
Code ) </ font > </ div > </ td > </ tr > </ table > </ div > </ div > <
div style = " line - height :120%; padding - top :2 px ; text -
align : center ; font - size :9 pt ; " >< font style = " font - fam
Now, we search for all centered text in the HTML 10-K filing that
is rendered using the font-weight:bold attribute, and print the first
three instances of such text:
In:
# gets all text from the HTML 10 - K filing defined using
# font - weight : bold attribute
style_values = g et_ht ml_s tyle _valu es ( html_styles [4] ,
htm l_complete_10k )
Out:
{ ' text ': ' UNITED STATES ' , ' position ': 753}
{ ' text ': ' SECURITIES AND EXCHANGE COMMISSION ' , '
position ': 906}
{ ' text ': ' Washington , D . C . 20549 ' , ' position ': 1080}
Out:
< font style = " font - family : Arial ; font - size :10 pt ; font -
weight : bold ; " > Item & # 160;7. Management ’ s
Discussion and Analysis of Financial Condition and
Results of Operations </ font > </ div > < a name ="
sAE854BB7
Notice that although the extraction was successful, the output is the
HTML source code of the MD&A section. We can convert this HTML
code to plain-text using a popular Python library lxml, which is used
to process XML and HTML files. This library is included by default in
Anaconda, but can be installed via conda or pip as follows:
Out:
Item 7. Management 's Discussion and Analysis of
Financial Condition and Results of Operations
https://xbrl.us/home/learn/us-taxonomies/,
http://xbrlview.fasb.org.
Out:
<? xml version = " 1.0 " encoding = " US - ASCII " ? >
<! - - XBRL Document Created with WebFilings - - >
<! - - p : ee88abc9bc7d42cca8267c157db83ce8 , x :1
ee976a4cce447438165efc656d0aac8 - - >
< xbrli : xbrl xmlns : country = " http :// xbrl . sec . gov / country
/2013 -01 -31 " xmlns : dei = " http :// xbrl . sec . gov / dei
/2013 -01 -31 " xmlns : hd = " http :// www . homedepot . com
/20140202 " xmlns : invest = " http :// xbrl . sec . gov / invest
/2013 -01 -31 " xmlns : iso4217 = " http :// ww
Out:
INCOME TAXES
The components of Earnings before Provision for Income
Taxes for fiscal 2013 , 2012 and 2011 were as
follows ( amounts in millions ) :
The SEC’s EDGAR site has served as a treasure trove of data for
accounting researchers for over 15 years. One of the first studies to
extract and analyze this data with a scripting language (as opposed
to hand-collecting) was Butler et al. (2004). In this study, the authors
extracted audit opinions from 10-Ks and classified modified audit opin-
ions by type (e.g., going concern opinion). Later studies, such as Li
(2008) began to apply methods from computer science (e.g., computa-
tional linguistics and machine learning) to analyze the readability and
information content of various disclosures. In this section we cover the
steps to obtain data directly from the EDGAR system.
193
https://www.sec.gov/Archives/edgar/full-index/2019/QTR1/master.idx
To download index files for a time period, we can create a loop that
simply changes the year and the qtr number in the http address above
and downloads the index file for the given qtr-year.1 Note that the name
1
The SEC encourages researchers to download files after business hours.
Following is the output generated from the above script. Each down-
loaded file is listed.
Out:
Downloading Index Files
Downloaded / edgar / indexfiles / master20181 . idx
Downloaded / edgar / indexfiles / master20182 . idx
Downloaded / edgar / indexfiles / master20183 . idx
Downloaded / edgar / indexfiles / master20184 . idx
Downloaded / edgar / indexfiles / master20191 . idx
Downloaded / edgar / indexfiles / master20192 . idx
Downloaded / edgar / indexfiles / master20193 . idx
Downloaded / edgar / indexfiles / master20194 . idx
Downloading of Index Files Complete
Now that the index files have been downloaded, we are now ready to
download the filings. In the following example, we will download 10-Ks
filed from 2018 through 2019. For illustration purposes, we will only
download the first five 10-Ks for each quarter but this limit can easily
be changed or removed by changing or removing the “if statement” that
checks the count.
The get_files function takes five parameters, start_year, end_year,
reform, inddirect, and odirect. As with the get_index function,
start_year and end_year specify the year range to be downloaded.
reform, will contain the regular expression specifying the filings to be
downloaded (e.g., 10-K). inddirect is the folder containing the index
files, and odirect is the directory the filings will be downloaded to.
To limit the total number of files in a given folder, the filings will be
downloaded into separate year folders.
Most of the logic below is straightforward but there are two regular
expressions that need further explanation. The first regular expression is
used to identify the forms to download, which are 10-Ks in this example.
Following is a line containing a 10-K from the index file for the first
quarter of 2018 (master20181.idx).
1000228|HENRY SCHEIN INC|10-K|2018-02-21|edgar/data/1000228/0001000228-18-000012.txt
In the example above, we are looking for a '10' that follows the
delimiter, '|'. It is a good idea to start with the delimiter because
there are some filings that we want to ignore, such as 'NT 10-K', which
a company files if does not expect to file its 10-K on a timely basis. By
requiring the '10' to come right after the delimiter we are able to skip
over “‘NT” filings. In most cases, the '10' is followed by a '-' but it
is possible that the '-' is omitted, so we follow the '-' with a '?',
which makes the '-' optional. Next we require 'K', which we follow
with '(sb|sb40|405)?'. This part of the expression means that the
'K' is optionally followed by 'sb', 'sb40', or '405'. Finally,'\s*\|'
means we need zero or more spaces followed by '|'. Note that this
expression intentionally excludes amended 10-Ks, which has a form type
of ’10-K/A’. If you want to download amended 10-Ks as well, simply
remove '\s*\|' from the end of the regular expression.
The second regular expression we will use is:
re_fullfilename = re . compile ( r " \|( edgar / data .*\/([\ d -]+\. txt )
) " , re . IGNORECASE )
This expression gets the location and name of the filing to download.
Using the HENRY SCHEIN example, above, the expression will obtain
Now that the filings have been downloaded, we can draw from some of
the previous chapters to extract data we are interested in. Let’s first
take a look at what is contained in the files we downloaded, using the
2018, 10-K submission (0001000228-18-000012.txt) by Henry Schein,
Inc. If you view this submission on the SEC website from your browser,
it looks as in Figure 12.2.
Using a dictionary like this allows us to process all the regular expressions
in a loop, which reduces the amount of code we have to write.
Read File Names into a List: To read in and process all the files in a
folder, we first need to create a list containing all the file names. There
is a very useful package, called glob, that makes this step easy. The
following statement reads all .txt files contained in the folder located at
‘/foldername’ path into a list named files:
files = glob.glob('/foldername/*.txt)
Once the list is created, it is easy to loop through each file using
a for loop. The following script reads 10-K filings, creates a pandas
dataframe containing the header information and writes the dataframe
out to a csv file.
In:
import os
import re
import pandas as pd
import glob
}
# create a regular expression representing the
# last row of the file you want to read . The tag
# ' </ SEC - HEADER > ' represents the end of the
# Header information in the . txt file . All the
# header information should be found before this
# line
regex_endheader = re . compile ( r ' </ SEC - HEADER > ' , re .
IGNORECASE )
After the script executes, it should produce output like that shown
below.
There are times when we want to obtain data from a website, and the
process of doing so is often referred to as web scraping. The process
requires some familiarity with HTML, which was briefly described in
the previous chapter, and a lot of patience! The ease with which we can
extract data depends on the consistency with which data is reported
on a website. The challenge is that we need to extract the data from
HTML files, and some websites change the HTML fairly often. That
means, we could write code to scrape data from a website that works
one day but not the next. In this section, we will introduce the basics
of web scraping with the task of obtaining all Accounting and Auditing
2
It is important to respect website policies on scraping. Many sites prohibit
certain types of scraping, especially when the process consumes significant resources
and slows down the site. There is a file at the root of most websites, “robots.txt ”,
that specifies what is and is not allowed.
three columns. Note that column 45, removes the “AAER-” from the
AAER number so that we are left with just the integer. For example,
we remove “AAER-” from “AAER-1213” and are left with “1213”. In
rows 47 and 48, we convert the Date, into YYYYMMDD format and
so that it can be saved as an integer. This is done using the dateutil
package.
After the column information is obtained, it is added to the pandas
dataframe (line 51). The last step in the routine is to download the
specific AAER. The AAERs will be saved by AAER number followed
by an extension. The extension will be saved as a PDF or HTML file
depending on the file type. Lines 53-56 get the appropriate file extension.
Line 60 checks to see if the file has already been downloaded and, if
it has not, line 61 downloads it. Finally, line 63 writes the pandas
dataframe to a csv file.
This is a relatively simple example of web scraping that demonstrates
the potential for collecting data from web pages on the internet. For
more advanced web scraping, have a look at Scrapy, a comprehensive
Python framework for web scraping.
The most convenient and reliable way to obtain data on the internet is
through an Application Programming Interface (API). Many popular
websites (e.g., Twitter, Facebook, Lexis/Nexis, etc.) offer API’s to
developers and researchers, which provide more direct access to data
on the website’s server. Essentially, we can pass the server a query and,
rather than rendering the results in HTML to our browser, the results
are returned directly to our (Python) program. One will typically be
required to request an API access for a specific website/service. For
example, we need to apply for developer access to use the Twitter API,
which usually takes a few days for approval. We will also likely need to
pay an access fee if we want to obtain a large amount of data.
Most APIs have a Python “wrapper”, or package, that simplifies
programming. Because API implementations vary greatly from website
to website, we will not cover a specific example. However, if you need
to collect a significant amount of data, we strongly suggest that you
214
215