Info 19
Info 19
http://www.johnny-lin.com/infosys
2019
© 2012–2019 Johnny Wei-Bing Lin. All rights reserved.
Contents
Preface v
Notices viii
2 X-Y Plots 20
2.1 Matplotlib: Python’s basic plotting package . . . . . . . . . . . . . . . . . . . . . 21
2.2 My first X-Y plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.1 A line plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 A scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Ï Python Sidebar: Lists and tuples . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Customizing an X-Y plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 Controlling the axes ranges . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.2 Controlling line and marker formatting . . . . . . . . . . . . . . . . . . . 27
2.4.3 Annotation and adjusting the font size of labels . . . . . . . . . . . . . . . 28
2.4.4 Plotting multiple figures . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.5 Plotting multiple curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
i
2.4.6Adding a legend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.7Adjusting the plot size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.8Saving figures to a file . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Ï Python Sidebar: Commenting . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
ii
5.3.3 Example of how objects work: Arrays . . . . . . . . . . . . . . . . . . . . 75
5.4 Writing a file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.5 Processing file contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.6 Catching file opening errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.7 Ï Python Sidebar: More on exception handling . . . . . . . . . . . . . . . . . . 81
5.8 Better ways of reading a file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Glossary 126
iii
Acronyms 129
Bibliography 130
iv
Preface
• Anaconda: https://www.anaconda.com/download/
• Canopy: https://www.enthought.com/products/canopy/
Both distributions have a version that is free and can be installed without administrator privileges.
v
preface I discuss the different versions of the book). Some of the special typesetting conventions I
use include:
• Source code: Typeset in a serif, non-proportional font, as in a = 4.
• Commands to type on your keyboard or printed to the screen: Typeset in a serif, non-proportional
font, as in print('hello').
• Generic arguments: Typeset in a serif, proportional, italicized font, in between a less than
sign and a greater than sign, as in <condition>.
• File, directory, and executable names: Typeset in a serif, proportional, italicized font, as in
/usr/bin.
Please note that general references to application, library, module, and package names are not
typeset any differently from regular text. Thus, references to the matplotlib package are typeset
just as in this sentence. As most packages have unique names, this should not be confusing. In the
few cases where the package names are regular English words (e.g., the time module), references
to the module will hopefully be clear from the context.
Usually, the first time a key word is used and/or explained, it will be bold in the text like this.
Key words are found in the glossary, and when useful, occurrences of those words are hyperlinked
to the glossary. Many acronyms are hyperlinked to the acronym list. The glossary and acronym
lists start on p. 126.
All generic text is in black. All hyperlinks (whether to locations internal or external to the
document), if provided, are in blue.
Finally, because the focus on this book is on how to do MIS using Python, not on the Python
language itself, when I do focus on the nitty-gritty of the Python language as a language, it will be
in sections with headers beginning with “Ï Python Sidebar.” The symbol in front is a keyboard
,. This will make it easier to find these sections in the Table of Contents, if you’re looking only
for the parts where I focus in on Python syntax and structure.
Personal Acknowledgments
While I often use first person throughout this book, I am acutely aware of the debt I owe to family,
friends, and colleagues who, over many years, generously nurtured many of the ideas in this book:
Indeed, we all do stand on the shoulders of giants, as Newton said. All praise I happily yield to
them; any mistakes and errors are my own. Much of this book is based on my book, A Hands-On
Introduction to Using Python in the Atmospheric and Oceanic Sciences,1 and the acknowledgments
I made there equally apply to this text.
1
Lin (2012)
vi
In addition to those people, I also want to acknowledge Hannah Aizenman for her contributions
to the present work. Thanks to Evangeline Abrigo for correction help and to students who provided
anonymous feedback.
I am personally grateful for those who gave me permission to use material they created: These
are acknowledged in the Notices section starting on p. viii and in the captions of the included or
adapted figures.
Finally, I thank my wife Karen, my children Timothy, James, and Christianne, for their encour-
agement and love, and my Lord and Savior Jesus Christ for giving me life itself, both physically
and spiritually: “… I have come that they may have life, and have it to the full” (John 10:10b,
NIV).
vii
Notices
Trademark Acknowledgments
ArcGIS is a registered trademark of Environmental Systems Research Institute, Inc. Debian is
a registered trademark of Software in the Public Interest, Inc. IDL is a registered trademark of
Exelis Corporation. Linux is a trademark owned by Linus Torvalds. Mac, Mac OS, and OS X are
registered trademarks of Apple Inc. Mathematica is a trademark of Wolfram Research, Inc. Matlab
and MathWorks are registered trademarks of The MathWorks, Inc. Perl is a registered trademark of
Yet Another Society. Python is a registered trademark of the Python Software Foundation. Solaris is
a trademark of Oracle. Swiss Army is a registered trademark of Victorinox AG, Ibach, Switzerland
and its related companies. Ubuntu is a registered trademark of Canonical Ltd. PowerShell and
Windows are registered trademarks of Microsoft Corporation in the United States and/or other
countries. All other marks mentioned in this book are the property of their respective owners. Any
errors or omissions in trademark and/or other mark attribution are not meant to be assertions of
trademark and/or other mark rights.
Copyright Acknowledgments
Scripture taken from the HOLY BIBLE, NEW INTERNATIONAL VERSION®. Copyright ©
1973, 1978, 1984 Biblica. Used by permission of Zondervan. All rights reserved. The “NIV” and
“New International Version” trademarks are registered in the United States Patent and Trademark
Office by Biblica. Use of either trademark requires the permission of Biblica.
All figures not created by myself are used by permission and are noted either in this acknowl-
edgments section or in the respective figure captions. Use in this book of information from all other
resources is believed to be covered under Fair Use doctrine.
viii
Part I
1
Chapter 1
Chapter Contents
1.1 Starting a Python interpreter session . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 The Canopy IDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Using a terminal window . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Ï Python Sidebar: Arithmetic operators in Python . . . . . . . . . . . 7
1.2 Using functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Ï Python Sidebar: Optional input into functions . . . . . . . . . . . . 10
1.2.2 Ï Python Sidebar: Importing modules and using module items . . . . 10
1.2.3 Ï Python Sidebar: Writing and using your own functions . . . . . . . 11
1.2.4 Ï Python Sidebar: Keyword input parameters . . . . . . . . . . . . . 13
1.3 Saving values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.1 Ï Python Sidebar: Naming, creating, and deleting variables . . . . . . 14
1.3.2 Ï Python Sidebar: A quick introduction to strings . . . . . . . . . . . 15
1.3.3 Ï Python Sidebar: A programmable calculator . . . . . . . . . . . . . 17
1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Python is a multi-paradigm language, which is a computer science-ese way of saying that you
can use it in a bunch of ways. The simplest (and sometimes most useful) way is as a calculator. In
fact, when I’m on my laptop, I don’t usually bother with firing up a calculator app if all I need is
to do some simple (or not-so-simple) arithmetic. Instead, I start a Python session and just type in
what I want to calculate. In the process, if I need to do more heavy-duty calculations, I have all the
power of Python’s mathematical and statistical libraries at my disposal.
2
1.1. STARTING A PYTHON INTERPRETER SESSION
compiled language like Java where you have to write the code, process the code using the com-
piler, then run the code. In Python, it’s instant gratification ,.
But, you have to first start up the Python interpreter. I’ll describe two ways of getting Python’s
interpreter up and running: Using the Canopy environment’s Interactive Development Environ-
ment (IDE) and running a session in a terminal window.
1. Start Canopy the usual way, by finding the application on your computer and double clicking
it to start it. You should get a welcome window similar to Figure 1.1. Click the Editor button
in the welcome window.
2. Next, you’ll get a window that says “Create a new file” or “Select files from your computer”.
Instead, select the menu choice View → Python. In the bottom window, you’ll get a Python
interpreter, as shown in Figure 1.2.
3. In the interpreter, you can type in Python commands where it says In (e.g., In [1]). When
you press Enter, the Python calculation run and the result is printed out after the Out prompt
(e.g., Out [1]). Figure 1.3 gives an example.
One final note: We’ll find later on when we discuss saving and reading files and modules
that the default location where we save and read files is somewhere called the “current working
directory.” In Canopy, that location is set by a drop-down menu in the upper-right corner of the
Python shell, as shown in Figure 1.4.
3
1.1. STARTING A PYTHON INTERPRETER SESSION
4
1.1. STARTING A PYTHON INTERPRETER SESSION
5
1.1. STARTING A PYTHON INTERPRETER SESSION
Figure 1.4: Canopy’s current working directory and drop-down menu to control changing the cur-
rent working directory is circled in red.
Terminal windows are created on a Mac OS X computer by running the Terminal application and
on a Linux machine by running xterm, GNOME Terminal, or any of a number of other terminal
creating applications. On a Windows computer, you’d start an MS-DOS shell.
Why would you want to do this? We’ll talk more about this later when we discuss automating
Management Information Systems (MIS) tasks. For now, using a terminal is a quick way of getting
access to the Python interpreter. In contrast with starting up and IDE, which sometimes takes a
while, a terminal window usually pops up quickly.
To start the Python interpreter using a terminal window, after you’ve opened that window, do
the following (everything you type will be in that window):
1. Type python. You should get something that looks like Figure 1.5. If this doesn’t happen,
here are some possible fixes:
• You may have to type in the full path name to your Python application. On my machine,
6
1.1. STARTING A PYTHON INTERPRETER SESSION
Operation Symbol
Add +
Subtract −
Multiply ∗
Divide /
Exponentiation ∗∗
The three greater-than signs (>>>) on the left of the line tells you you are now in the Python
interpreter.
2. An alternate way of opening a terminal that you can use to run Canopy is to open the Canopy
application and then select the Tools → Canopy Terminal menu option. This will open up
a terminal window that has Canopy set as the default Python. If you type python in that
terminal window, you’ll launch Canopy’s Python.1
3. You can now type in whatever you want to calculate, press Enter, and the answer will output.
For instance:
>>> (4+5)*3
27
>>>
4. The interpreter immediately executes the command, printing the string hello world! to
screen.
5. To exit the Python interpreter, type Ctrl-d. To exit from the terminal window, type exit once
you’re outside the Python interpreter. Typing exit while you’re in the Python interpreter
will do nothing but tell you what you need to do to leave the Python interpreter.
7
1.2. USING FUNCTIONS
metic operation order in Python follows the standard order of parenthetical blocks first, then ex-
ponentiation, then multiplication/division, and finally addition/subtraction. Operations at the same
level of priority are executed left-to-right. See https://en.wikibooks.org/wiki/Python_Programming/
Operators for more details.
Note that in Python 2.7.x, division of two integers will default to integer division. That is, it
will return the quotient and discard the remainder and the output will be of integer type:
>>> 1/2
0
>>> 3/2
1
>>> 7/3
2
>>> type(7/3)
<type 'int'>
To force Python to use the kind of division we’re all used to (floating point division), make at least
one of the numbers floating point by putting a period (or a period and a decimal) after it:
>>> 1./2
0.5
>>> 1.0/2
0.5
In Python 3.x, the default is to include the remainder in the output, with the output being floating
point type:
>>> 1/2
0.5
>>> 3/2
1.5
>>> 7/3
2.3333333333333335
>>> type(7/3)
<class 'float'>
To do integer division in Python 3.x, use “//” instead of “/”.2 Note that the // operator also works
as integer division in Python 2.7.x.
8
1.2. USING FUNCTIONS
Python also has such function, but instead of typing in some numbers then pressing a button,
you type in the name of these functions to execute them. Thus, to take the sine of π/2 radians (i.e.,
90◦ ):
>>> sin(3.1415/2)
0.99999999892691405
The sine function is called sin and the input into the function is put in between two parentheses
immediately after the name of the function. After you press Enter, the answer or result of calling
the sine function is returned.
But wait, don’t type that in! What I wrote above is not what will happen; what will happen if
you type in what I wrote above is:
>>> sin(3.1415/2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'sin' is not defined
Hmmm, so Python doesn’t know anything about the sine function. The sine function, it turns out,
is not built-in into the interpreter but rather is part of a module called SciPy (the name used in the
interpreter is “scipy”, without capitalization). In order to use sin I first need to import it. If I add
an import line before the call to sin, everything works fine:
If there is more than one input in a function call, each input is listed one at a time, separated by
commas. Each input is called a “parameter” (or “positional input parameter”) and the sequence of
multiple inputs is called a “parameter list.” The fv function in SciPy which calculates future value
takes four required inputs, in this order: the interest rate, the number of compounding periods, the
payment paid (by default) at the end of each period, and the present value.3 Thus, if you have an
investment account with an annual interest rate of 3% (compounded annually), worth $1000 now,
and you add $500 every year for ten years, how much will have you at the end of the ten years?
Note the negative sign means “cash flow out (i.e. money not available today).”4
By the way, if you want to learn more about a function, you can use the help command. In the
case of fv, you can type in help(fv) in the Python interpreter and Python will tell you about the
function and how to use it. If you are running the interpreter from the command-line, type Space to
page down the help display, j to scroll down the display, k to scroll up the display, and q to leave
the display.
3
http://docs.scipy.org/doc/numpy/reference/generated/numpy.fv.html (accessed September 1, 2016).
4
http://docs.scipy.org/doc/numpy/reference/generated/numpy.fv.html (accessed September 1, 2016).
9
1.2. USING FUNCTIONS
In this case, when is the keyword input parameter and we set it to the string 'begin'. (We’ll talk
more about strings in Section 1.3.2.) In Section 1.3, we talk more about setting variables.
Keyword input parameter’s are nice for inputs to functions that are optional rather than required.
Since they have a default setting, you only need to specify them in the function’s calling line if
you want to pass in a value different than the default. (Positional input parameters are used for
required inputs.) The function’s documentation will tell you what is required (positional) and what
is optional (keyword) input.
10
1.2. USING FUNCTIONS
Finally, modules do not only contain functions but may also contain variables set to certain
values. For instance, the SciPy module has a variable called pi that is set to the value of π. When
you import a module using import, you can reference the variables defined in the module in a way
similar to how you reference functions, by using the period notation:
Notice that when we make use of pi, we do not put parenthesis after the variable name whereas
when we use a function (like fn), we do put parenthesis after the function name. This is because
with the function, we are calling the function. Python conveys this through appending the parameter
list (in a pair of parentheses) to the function name. In the case of pi, because it’s a variable, it
isn’t being called. (Remember that calling a function means giving input to a function, running the
function, and getting whatever’s returned.) In Section 1.3, we talk more about setting variables,and
when we talk about objects in more detail on Section 5.3, we’ll see some more of how this period
notation is used.
11
1.2. USING FUNCTIONS
When defining a function, the first line begins with def, followed by the function name, then
the parameter list. The number of variables in the parameter list is the number of parameters we
need to provide when we call the function. In the above example, there is one parameter named
input. All references in the body of the function definition with the name input refers to whatever
value is passed into the function when it is called. The def line ends with a colon.
The second thing to note in the above function definition is that the body of the definition is
delineated by an indentation of four spaces (which pressing Tab will give me when I’m in the Python
interpreter). The “...” characters are automatically included by the interpreter as I’m typing; they
just mean that Python recognizes you’re typing in a function definition and is waiting for you to
provide the body of the function.
Finally, in the example we see that the return value of a function is given by the return followed
by whatever is being returned. In the above example, the variable output is returned, but the return
value can itself be an expression. For instance, the code below works exactly the same:
and saves us an extra line of typing. (Note that we make the 100. a decimal to ensure we never
have integer division operating on input.
Typing in a function in an interpreter is easy and straightforward but as soon as you exit the
interpreter, the function definition is lost. Thus, it often makes sense to instead define a function in
a file and then use import to give you access to the function.
In our percent_to_decimal example, we put the code we would have typed into the inter-
preter in a file called myfuncs.py, as seen in Figure 1.6. Then, as long as myfuncs.py is in the current
working directory for the Python interpreter,5 I can import myfuncs.py and use its contents:
But wait, isn’t import used for importing modules? Well, yes, it is. But what this shows us is
that a module is just a file containing Python commands. If you have a bunch of function, variable,
5
The more complete answer is that myfuncs.py has to be on the path specified by the PYTHONPATH environment
variable, but for our purposes right now, as long as myfuncs.py is in the same directory we started python in, it’ll work.
12
1.3. SAVING VALUES
def percent_to_decimal(input):
return input / 100.
or class definitions, just put them into a file and voilà, you have a module you can import and use!
It’s that easy to create a library of your own functions.6
>>> annual_percent_to_period_decimal(6)
0.005
On the other hand, if we want to specify a different value for num_periods_per_year, we can
do so in the call to annual_percent_to_period_decimal. In the case below, we assume the
sub-annual period is a half-year:
13
1.3. SAVING VALUES
>>> a = 2
>>> 3*a
6
>>> b = 5
>>> a*b
10
To see what the contents of a variable are, just type in the name of the variable:
>>> a = 2
>>> a
2
You can have any number of variables. You do not need the blank spaces on either side of the
equal sign. Thus, a = 2 and a=2 work equally well (pun intended ,).
int a = 4;
a = 4
In contrast with Java, Python is a dynamically typed language, meaning that the type of a
variable can change with time. The type of a variable is automatically set by Python based upon
whatever is on the left-hand side. Thus, in the code below:
14
1.3. SAVING VALUES
>>> a = 3
>>> type(a)
<type 'int'>
>>> a = 6.4
>>> type(a)
<type 'float'>
>>> a = "hello"
>>> type(a)
<type 'str'>
we see from our calls to the type function that the type of the variable a changes each time we
assign a to a new value.
If you want to delete a variable so that Python doesn’t recognize it, use the built-in del com-
mand:
>>> a = 3
>>> a
3
>>> del(a)
>>> a
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'a' is not defined
15
1.3. SAVING VALUES
The triple quotes allow you to include newline characters and spaces through typing Enter and
pressing the space bar and having those remembered as part of the string. This makes it easier to
enter in more complexly formatted strings:
The character “\n” is the newline character. Tab is “\t”. The print function prints out the contents
of the string variable to the screen but represents formatting characters (such as newline) as they
way they should look.
Two final points about strings to mention, and then we’ll leave more about strings to the discus-
sion in Section 5.3.2. First, if you want to concatenate two strings together, use the “+” operator:
>>> a = "Boeing"
>>> b = "Airbus"
>>> a + " or " + b
'Boeing or Airbus'
Second, if you want to convert a number into a string, use the built-in str function:
>>> a = "Boeing"
>>> b = "Airbus"
>>> i = 1
>>> a + " or " + b + " is number " + str(i)
'Boeing or Airbus is number 1'
>>> a + " or " + b + " is number " + i
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: cannot concatenate 'str' and 'int' objects
If you leave out the str call, Python doesn’t know how to “add” together a string and an integer,
and so complains to you.
16
1.3. SAVING VALUES
import myfuncs
myfuncs.percent_to_decimal(42)
myfuncs.percent_to_decimal(2)
myfuncs.percent_to_decimal(83)
myfuncs.percent_to_decimal(12)
But if we exited the interpreter and wanted to later redo these four recalculations, we’d have to
type it in again. So, let’s copy-and-paste these lines into a file, as seen in Figure 1.7. (We don’t
copy-and-paste the output or the interpreter prompt >>>.) We’ll call this file myscript1.py.
To run the script from the terminal, type in:
python myscript1.py
(Using Canopy, to do the same thing, you would first open the file myscript1.py and then choose
the menu item Run → Run File. Or, you can just click the green arrowhead icon in the task bar.)
But nothing happens! The results of the percent_to_decimal calls are not printed to screen!
This is one of the adjustments you have to make when writing a script instead of typing in commands
at the interpreter prompt. In the interpreter, when you type the name of a variable or call a function,
the value of the variable or the function’s return value are automatically displayed to the screen. In
17
1.3. SAVING VALUES
import myfuncs
print(myfuncs.percent_to_decimal(42))
print(myfuncs.percent_to_decimal(2))
print(myfuncs.percent_to_decimal(83))
print(myfuncs.percent_to_decimal(12))
def percent_to_decimal(input):
return input / 100.
print(percent_to_decimal(42))
print(percent_to_decimal(2))
print(percent_to_decimal(83))
print(percent_to_decimal(12))
a script, this does not occur. In order to see the contents of the return value, you have to pass the
function call as an argument to the print function, as seen in the myscript2.py file in Figure 1.8.
Now, when you run the revised script, the expected output is produced:
$ python myscript2.py
0.42
0.02
0.83
0.12
Notice that when we run the script in the terminal, we automatically leave the Python interpreter
at the end of running the script. If you want to stay in the interpreter at the end of running the script,
add a “-i” between python and the script filename, as in:
$ python -i myscript2.py
0.42
0.02
0.83
0.12
>>>
Finally, if we don’t want to keep the percent_to_decimal function in a separate file from our
script, that’s fine too. We can define functions and use them in the same script, as well as setting
variables, etc. By putting the function in our script, we also do not need to import the myfuncs
module. Figure 1.9 shows this revised script, myscript3.py.
18
1.4. SUMMARY
1.4 Summary
Well, that’s it! We now know how to use Python as a really fancy programmable business and
financial calculator! As we’ll see later on in this book, this is only a fraction of the powers of
Python, but it’s still very useful nonetheless!
19
Chapter 2
X-Y Plots
Chapter Contents
2.1 Matplotlib: Python’s basic plotting package . . . . . . . . . . . . . . . . . . 21
2.2 My first X-Y plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.1 A line plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 A scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Ï Python Sidebar: Lists and tuples . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Customizing an X-Y plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 Controlling the axes ranges . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.2 Controlling line and marker formatting . . . . . . . . . . . . . . . . . . 27
2.4.3 Annotation and adjusting the font size of labels . . . . . . . . . . . . . . 28
2.4.4 Plotting multiple figures . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.5 Plotting multiple curves . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.6 Adding a legend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.7 Adjusting the plot size . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.8 Saving figures to a file . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Ï Python Sidebar: Commenting . . . . . . . . . . . . . . . . . . . . . . . 31
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
It is easy to underestimate the power of graphs. When we first learn about graphing, it’s easy
to think graphs are “merely” picture representations of functions or equations. In reality, graphs
or visualizations are a key way of enabling us to understand what the data is telling us. If you
have millions of point-of-sale customer transactions you are analyzing, it is unlikely you can find
a single function to describe that data (or that that function will tell you much about the patterns in
the data). If you need to display the raw data, you either have to do so in a table or in a graph that
provides you insight as to the meaning of the data. Thus, graphs are essential analytical tools for
business data.
Before we begin, credit where credit is due: I owe the matplotlib documentation (http://matplotlib.
org) for a lot of the material in this chapter.
20
2.1. MATPLOTLIB: PYTHON’S BASIC PLOTTING PACKAGE
The graph created is shown in Figure 2.1. You can either type in each line of code in a Python
interpreter or in a file and run the script file (see Section 1.3.3 on how to write and run a script). If
you do run the lines of code as a script from a terminal window, you have to keep the interpreter
session open in order to see the plot.
Based on what the lines of code above say, what do you think each line does? Here’s my
description:
• Line 1: Import the pyplot module.
21
2.2. MY FIRST X-Y PLOT
• Line 2: Create the plot. The first argument is the list of x-values (the decimal times) and the
second argument is the list of the corresponding y-values. See Section 2.3 for a discussion
of what is a list.
Once a plot is created (line 2), matplotlib keeps track of what plot is the “current” plot. Subsequent
commands (e.g., to make a label) are applied to the current plot.
Line 5 indicates that as you put each element of a graph onto matplotlib’s virtual canvas, mat-
plotlib does not render your addition but instead waits until you call the show command. Usually,
you care only about how a plot looks when it’s all done. By waiting until show is called, matplotlib
avoids the extra computation involved in rendering intermediate steps. (If you have more than one
figure, call show after all plots are defined to visualize all the plots at once.)
Matplotlib does pretty well using intelligent defaults for the graphs you ask it to make. But much
of the time, we’ll want to customize what our graphs look like. We’ll talk about such customization
in a second, but first we take a side-trip to introduce Python lists.
22
2.3. PYTHON SIDEBAR: LISTS AND TUPLES
23
2.3. PYTHON SIDEBAR: LISTS AND TUPLES
Lists are ordered sequences. What that means is they are a collection of items where the first
item has the first position, the second the second position, and so on. They are like arrays, except
each of the items in the list do not have to be of the same type. A given list element can also be
set to anything, even another list. Square brackets (“[]”) delimit (i.e., start and stop) a list, and
commas between list elements separate elements from one another.
List element addresses start with zero, so the first element of list a is a[0], the second is a[1],
etc. Because the ordinal value (i.e., first, second, third, etc.) of an element differs from the address
of an element (i.e., zero, one, two, etc.), when we refer to an element by its address we will append
a “th” to the end of the address. That is, the “zeroth” element by address is the first element by
position in the list, the “oneth” element by address is the second element by position, the “twoth”
element by address is the third element by position, and so on.
Finally, the length of a list can be obtained using the built-in len function, e.g., len(a) to find
the length of the list a.
Example 1 (A list):
Type in the following in the Python interpreter:
What is len(a)? What does a[1] equal to? How about a[3]? a[3][1]?
Solution and discussion: The len(a) is 4, a[1] equals 3.2, a[3] equals the list [-1.2, 'there', 5.5],
and a[3][1] equals the string 'there'. I find the easiest way to read a complex reference like
a[3][1] is from left to right, that is, “in the threeth element of the list a, take the oneth element.”
In Python, list elements can also be addressed starting from the end; thus, a[-1] is the last
element in list a, a[-2] is the next to last element, etc.
You can create new lists that are slices of an existing list. Slicing follows these rules:
• The lower limit of the range is inclusive, and the upper limit of the range is exclusive.
Solution and discussion: You should get the following if you print out the list slice a[1:3]:
24
2.3. PYTHON SIDEBAR: LISTS AND TUPLES
Because the upper-limit is exclusive in the slice, the threeth element (i.e., the fourth element) is not
part of the slice; only the oneth and twoth (i.e., second and third) elements are part of the slice.
Lists are mutable (i.e., you can add and remove items, change the size of the list). One way of
changing elements in a list is by assignment (just like you would change an element in an array):
How would we go about replacing the value of the second element with the string 'goodbye'?
Solution and discussion: We refer to the second element as a[1], so using variable assignment,
we change that element by:
a[1] = 'goodbye'
Python lists, however, also have special “built-in” functions that allow you to insert items into
the list, pop off items from the list, etc. We’ll discuss the nature of those functions (which are called
methods; this relates to object-oriented programming) in more detail in Ch. 5. Even without that
discussion, however, it is still fruitful to consider a few examples of using list methods to alter lists:
What do the following commands give you when typed into the Python interpreter?:
• a.insert(2,'everyone')
• a.remove(2)
• a.append(4.5)
• a.index('hello')
• a.count('hello')
25
2.4. CUSTOMIZING AN X-Y PLOT
Solution and discussion: The first command insert inserts the string 'everyone' into the
list after the twoth (i.e., third) element of the list. The second command remove removes the first
occurrence of the value given in the argument. The third command append adds the argument to
the end of the list.
For the list a, if we printed out the contents of a after each of the first three lines above were
executed one after the other, we would get:
The fourth command, after running the first three commands, would then return the location in
the list a whose value was 'hello'; that index is 2. The index method returns the first occurrence
of a match in the list with the value that is passed in via the parameter list of the method’s call. The
final command counts the number of occurrences of 'hello' and returns the value 1. If any list
elements are themselves lists (such as the last element of a), count does not look into the sub-lists
to match the search target.
A little on tuples and strings: Tuples are nearly identical to lists with the exception that tuples
cannot be changed (i.e., they are immutable). That is to say, if you try to insert an element in a
tuple, Python will return an error. Tuples are defined exactly as lists except you use parenthesis as
delimiters instead of square brackets, e.g., b = (3.2, 'hello').
You can, to an extent, treat strings as lists. Thus, if:
a = "hello"
then:
"h" in a
will return True. The command:
"el" in a
will also return True, because the membership operator in will work on contiguous substrings.
You can also slice strings as if each character were a list element. a[1:3] will return the substring
"el". See Section 4.8 for more on indexing strings.
26
2.4. CUSTOMIZING AN X-Y PLOT
The single argument to the axis function is a list where the first two elements are the lower and
upper x-axis bounds and the third and fourth elements are the lower and upper y-axis bounds. The
resulting graph is shown in Figure 2.3.
27
2.4. CUSTOMIZING AN X-Y PLOT
28
2.4. CUSTOMIZING AN X-Y PLOT
Table 2.1: Some linestyle codes in pyplot and a high-resolution line plot showing the lines generated
by the linestyle codes. See http://matplotlib.sourceforge.net/api/pyplot_api.html.
Table 2.2: Some marker codes in pyplot and a high-resolution line plot showing the markers gen-
erated by the marker codes. See http://matplotlib.sourceforge.net/api/pyplot_api.html.
29
2.4. CUSTOMIZING AN X-Y PLOT
Line 1 creates a figure and gives it the name “3”. Lines 2–3 (which is a single logical line to the
interpreter) makes a line plot with a circle as the marker to the figure named “3”. Line 4 creates
a figure named “4”, and lines 5–6 make a line plot with a dash-dot linestyle to that figure. Line 7
makes figure “3” the current plot again, and the final line adds a title to figure “3”.
The first three arguments specify the x- and y-locations of the first curve, which will be plot using
a dashed line and a circle as the marker. The second three arguments specify the x- and y-locations
of the second curve, which will be plot with a solid line and a diamond as the marker. Both curves
will be on the same figure.
produces the plot in Figure 2.4.6. Note the "r" and "b" strings in the plot calls produce a red
and blue line/marker, respectively. For more information on legends, see http://matplotlib.org/api/
axes_api.html#matplotlib.axes.Axes.legend (the documentation is pretty verbose, though, so you
might find it a more fruitful experience after some more experience with Python).
30
2.5. PYTHON SIDEBAR: COMMENTING
plt.savefig('testplot.png', dpi=300)
Here we specify an output resolution using the optional dpi keyword parameter; if left out, the
matplotlib default resolution will be used. Note that it is not enough for you to set dpi in your
figure command to get an output file at a specific resolution. The dpi setting in figure will
control what resolution show displays at while the dpi setting in savefig will control the output
file’s resolution; however, the figsize parameter in figure controls the figure size for both show
and savefig.
You can also save figures to a file using the graphical user interface (GUI) save button that is
part of the plot window displayed on the screen when you execute the show function. If you save
the plot using the save button, it will save at the default resolution, even if you specify a different
resolution in your figure command; use savefig if you want to write out your file at a specific
resolution.
31
2.6. SUMMARY
are begin with the hash symbol (“#”). Whether the hash is found at the beginning of a line or mid-
way through a line, the Python interpreter considers everything after and including the hash as a
comment.
Let’s take the code from Section 2.4.4 and add some commenting to it:
These comment lines don’t really say much than is already clear from the code, but they illustrate
how the comment symbol works.
2.6 Summary
Matplotlib makes it easy to make x-y plots of various types. We’ll be using this package a lot as
we learn more data analysis tools.
32
Chapter 3
Chapter Contents
3.1 My first dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Choosing which groups of the data to examine and manipulate . . . . . . . . 35
3.3 Ï Python Sidebar: More on for looping . . . . . . . . . . . . . . . . . . . 36
3.4 Asking questions of the data . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Ï Python Sidebar: Booleans . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6 Ï Python Sidebar: Looping an indefinite number of times . . . . . . . . . 40
3.7 Changing the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.8 Analyzing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.9 Ï Python Sidebar: Docstrings . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.10 Ï Python Sidebar: Writing code a little bit at a time . . . . . . . . . . . . 44
3.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Calculations that are suitable for a handheld calculator, even a programmable one, represent
only a smattering of the kinds of calculations we can do today and are interested in with regards
to business data. Full-fledged programming language like Python have the capability to bring
business insights from large collections of data, whether they be from finance, human resources,
manufacturing, sales, operations, or any aspect of a business. In this chapter, we make our first
foray in using Python as a data analysis tool.
33
3.1. MY FIRST DATASET
35384
36453
36644
36211
37208
37727
37436
36568
34945
34517
33834
33703
Let’s first calculate the mean of these values. A function for calculating the mean is called, not
surprisingly, mean, and is part of the SciPy package. It takes a single list of input values (see
Section 2.3 for more on lists) as an input parameter. Thus, to calculate the mean and save the value
as the variable mean_2015 we would type:
import scipy
mean_2015 = scipy.mean([35384, 36453, 36644, 36211, 37208, 37727, \
37436, 36568, 34945, 34517, 33834, 33703])
We could also first save the year’s worth of input values as a list variable and pass the variable into
the function:
import scipy
data_2015 = [35384, 36453, 36644, 36211, 37208, 37727, \
37436, 36568, 34945, 34517, 33834, 33703]
mean_2015 = scipy.mean(data_2015)
This is all fine and good, but what if we want to do more than just pass the data into a function?
What if we want to manipulate or change the data values in some way to help us get the calculations
we are interested in? That is to say, how do we go about:
• Choosing which groups of the data to examine and manipulate (e.g., “each item, one at a
time,” “just the first five items,” etc.).
• Asking questions of the data (e.g., “which values are greater than $36,000 million,” “what
months are sales less than $37,000 million,” etc.).
34
3.2. CHOOSING WHICH GROUPS OF THE DATA TO EXAMINE AND MANIPULATE
Let’s unpack this. The for statement in line 2 tells Python to go through each element of data_2015
and set the variable idata to that element, one at a time, in the order they are given in the list
data_2015. Once that is done, Python executes the contents of the body of the loop, that is, all the
lines of code that are indented in four spaces under the for line. In this case, there is only one line
in the body of the loop (that is, that are indented in four spaces), and so nothing else is done except
printing idata.
What if you want to go through the list in a different order? You can loop through the list by
referencing the indices of the list elements rather than the elements themselves. For instance, if
you wanted to print out each element of data_2015, backwards (i.e., starting with the December
value and ending with the January value), you could do the following:
We could also store the list of indices as a list variable and loop through that list in the for loop
immediately above. Remember the first element of a list has index 0.
We can use the same strategy of looping through indices to enable us to examine only a subset
of the data. For instance, to print out only the data for February and March:
Because going through a list of indices is such a common operation, Python includes a built-in
function called range which produces such a list: range(n) returns the list [0, 1, 2, …, n − 1].
Thus:
35
3.3. PYTHON SIDEBAR: MORE ON FOR LOOPING
36
3.5. PYTHON SIDEBAR: BOOLEANS
In line 4, we have a branching statement or if statement. It takes the value of idata, checked to
see if idata is greater than 35000, and if so, prints the value of idata to the screen.
Note that as with a for loop, if the true/false test the if statement examines is true, the statement
executes the block of lines after the if statement, and this block of lines is denoted by four spaces
of indentation. In the above code, there is only one line executed when idata is greater than 35000;
if we had more lines, we’d list them one after the other, all indented four spaces in.
What if we decided we wanted to check to see which values of the data are over 35000 and for
those values to print both the value of the data and the month number? To do this, we’ll make use
of the indices of the list, since the index values are the same as the month number, minus one. We
thus loop over the indices instead of the data values themselves:
Note that in line 3, the input parameter for range is the return value of the call to len, using
data_2015 as the input parameter to that call. This way, if the length of data_2015 changes, we
don’t have to change the for loop. Also remember that the print statement is looking for a string
as input and you have to use the str function to convert the month number (i.e., i+1 in line 6) and
the value from data_2015 (i.e., data_2015[i] in line 6).
Finally, what if we wanted to ask a question, do one thing if the answer is true adn something
else if the answer to the question is false? To do so, we use if paired with else. In the code below,
we print the month number and amount if the value of the data is greater than 35000 and print a
message "Too low" otherwise:
Note that there is a colon after the else, and the lines of code you want to execute if the if condition
is false is also indented in four spaces. As you’ve noticed, all blocks of code that are in a distinct
grouping (e.g., a function definition, a for loop body, an if statement body) have their first lines
of that body indented in one indentation level of four spaces.
37
3.5. PYTHON SIDEBAR: BOOLEANS
Expressions like idata > 35000 are called boolean expressions because the Python inter-
preter returns a “true” or “false” value after evaluating the expression. For instance, in the inter-
preter typing the following will obtain:
Notice that the values that are returned from evaluating the true/false expression are the values True
and False. (The capitalized first letter is meaningful here, as in all Python code because Python is
case-sensitive.) These are actual values, just like a number or a string. And like actual values, you
can set variables to them:
Variables whose values are either True or False are called boolean variables. You can set a
variable directly to one of those values or, as we did above, set them to the result of a boolean
expression.
The operators we use to operate on boolean variables are called logical or boolean operators.
The main ones are: and, or, and not.
a = True
b = False
print(a and b)
print(a or b)
print(not a)
Solution and discussion: The first two lines assign a and b as boolean variables. The first two
print statements return False and True, respectively. Remember that and requires both operands
to be True in order to return True, while or only requires one of the operands be True to return
38
3.5. PYTHON SIDEBAR: BOOLEANS
True. The not operator is a unary operator, meaning it operates on one token rather than two (like
and and or).
Membership testing: The in operator is another useful logical operator, though not as well-
known as the traditional logical operators like equality, greater than, etc. The in operator can be
used in Python lists and strings to check to see if some value is one of the elements (for a list) or is
at least a substring (for a string).
Solution and discussion: An expression using the in operator returns a boolean value. The
first command returns True because the string “hi” is in the list words. The next two commands
return False because neither the strings “okay” nor “y” are elements in the list. Note that “y” exists
as a substring of “yes”, but that doesn’t count as far as the in operator is concerned in the above
commands, since we’re using the operator on a list; the value to the left of in has to completely
match at least one of the elements of the list to the right of in. The final command returns False.
Since “bye” is in the list, it is false that “bye” is not in words. Note that the syntax to say “not in”
is not in and not a not applied to the return of the in expression (boy, that’s a tongue twister!).
String membership: When in is used on strings, Python looks for any occurrence of the entire
string to the left of in anywhere in the string to the right of in. Thus:
>>> 'us' in 'business'
will return True but:
>>> 'using' in 'business'
will return False.
Note that the syntax of in for membership testing is not the same as in in a for…in statement.
Even though the same term (“in”) is being used, in the latter case the in functions to tell Python to
iterate through what is to the right of the in.
39
3.6. PYTHON SIDEBAR: LOOPING AN INDEFINITE NUMBER OF TIMES
Solution and discussion: This will print out the integers one through nine, with each integer
on its own line. Prior to executing the code block underneath the while statement, the interpreter
checks whether the condition (a < 10) is true or false. If the condition evaluates as True, the code
block executes; if the condition evaluates as False, the code block is not executed. Thus:
a = 10
while a < 10:
print(a)
a = a + 1
will do nothing. Likewise:
a = 10
while False:
print(a)
a = a + 1
will also do nothing. In this last code snippet, the value of the variable a is immaterial; as the
condition is always set to False, the while loop will never execute. (Conversely, a while True:
statement will never terminate. It is a bad idea to write such a statement ,.)
Please see your favorite Python reference if you’d like more information about while loops.
40
3.7. CHANGING THE DATA
The second way is to create a separate, new list, loop through the elements of the old list, make
changes to the values of the old list, then save the changed values into the new list. The following
does the same as the previous code snippet except the “changed” list of data is in a list called
new_data_2015. Note that the old list of data data_2015 is unchanged:
1 new_data_2015 = []
2 data_2015 = [35384, 36453, 36644, 36211, 37208, 37727, \
3 37436, 36568, 34945, 34517, 33834, 33703]
4 for idata in data_2015:
5 if idata > 35000:
6 new_data_2015.append(0)
7 else:
8 new_data_2015.append(idata)
In line 1, we create a new list as an empty list. In lines 6 and 8, we use the append method to add
either 0 or the value of idata into the new list new_data_2015.
41
3.8. ANALYZING DATA
and v
u N
u 1 X
σ=t (xi − x)2
N − 1 i=1
respectively.
We can, of course, use the SciPy mean and std functions to calculate the mean and standard
deviation, respectively:
import scipy
data_2015 = [35384, 36453, 36644, 36211, 37208, 37727, \
37436, 36568, 34945, 34517, 33834, 33703]
data_mean = scipy.mean(data_2015)
data_std = scipy.std(data_2015)
print(data_mean)
print(data_std)
35885.8333333
1322.68998845
which are the annual mean and standard deviation of U.S. gasoline retail sales (adjusted) for 2015.
But, using a pre-written function isn’t very instructive. Let’s start with the definition of mean
given above and calculate the 2015 annual mean directly from the mathematical definition, using
what we’ve learned about looping.
In our previous loops, we hadn’t done anything arithmetic with the numbers. To calculating the
mean, we see from the presence of a summation operator that we need to create a running sum of
the data values as well as figure out how many points there are we’re averaging over (N ). To do
the former, we create a variable called running_sum, initialize it to 0, and add to it in the loop. To
do the latter, we can use the len function on the list of data:
which gives 35885.8333333, just as the SciPy function mean did. Note that Python supports the
+=, etc. operator, so we could rewrite line 5 as:
running_sum += idata
Of course, SciPy has many more statistical functions. Some are at the SciPy module level like
mean while others are in the stats submodule of SciPy (and are accessed by an import scipy.stats
42
3.9. PYTHON SIDEBAR: DOCSTRINGS
command). To learn more about all that SciPy has to offer, see the SciPy Reference Guide (http:
//docs.scipy.org/doc/scipy/reference/) and the NumPy Reference Guide (http://docs.scipy.org/doc/
numpy/reference/). Many of the NumPy functions (such as mean) are available at the SciPy module
level, but the documentation is in the NumPy Reference Guide. (We’ll talk more about the NumPy
package in Ch. 4.)
def percent_to_decimal(input):
"""Convert a percent to its decimal equivalent.
Examples:
>>> percent_to_decimal(42)
0.42
"""
return input / 100.
The docstring is a triple-quote string (see Section 1.3.2 regarding triple-quote strings) that imme-
diately follows the def line of the function. Note that the docstring is indented in four spaces, so
Python knows it’s part of the function’s definition block.
The first line of the docstring is a single sentence summarizing what the function does. This
line is followed by a blank line and then one or more paragraphs describing the algorithm used in
the function, general details about dependencies, general details about the return value, etc. (In
this case, we wouldn’t normally say how the decimal equivalent is calculated since it’s a trivial
calculation, but we put such a description here for illustrative purposes.) Finally, the docstring
has sections that describe the input parameters, provide examples of how to use the function, etc.
Information is provided on the type of variables, what those variables correspond to, etc.
43
3.10. PYTHON SIDEBAR: WRITING CODE A LITTLE BIT AT A TIME
Notice how we put returns in the docstring to make the text nicely formatted if the code were
printed on a piece of letter-sized paper. (The triple-quotes make this automatic.) I highly recom-
mend this practice! It makes your documentation much easier to read for someone who is looking
through your code file.
3.11 Summary
With looping and branching, we can efficiently examine datasets and manipulate the data. Vari-
ables, looping, and branching are the there core constructs of a programming language. With these,
you can get the computer to do incredibly complex calculations. How to do so, and to see the ad-
ditional tools Python gives to make our lives easier in more advanced data analysis, is the topic of
the next chapter.
44
Chapter 4
Chapter Contents
4.1 Revisiting the gasoline retail dataset . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Basic statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3 Ï Python Sidebar: More on creating and using NumPy arrays . . . . . . 49
4.3.1 Creating arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.2 Array indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.3 Array inquiry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.4 Array manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.5 Calculations using array contents . . . . . . . . . . . . . . . . . . . . . 55
4.4 Correlation and lag autocorrelation . . . . . . . . . . . . . . . . . . . . . . . 58
4.5 More complicated manipulations . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.6 Ï Python Sidebar: Testing inside an array . . . . . . . . . . . . . . . . . . 62
4.6.1 Testing inside an array: Method 1 (loops) . . . . . . . . . . . . . . . . . 62
4.6.2 Testing inside an array: Method 2 (array syntax) . . . . . . . . . . . . . 63
4.6.3 Additional array functions . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.7 Ï Python Sidebar: Floating point comparison . . . . . . . . . . . . . . . . 68
4.8 Ï Python Sidebar: String indexing . . . . . . . . . . . . . . . . . . . . . . 69
4.9 Ï Python Sidebar: A simple way of seeing how fast your code runs . . . . 70
4.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Taking a list of data and processing it is useful. But most of the time, we’re interested in not
only a single list of data but a whole table (or bunch of tables) of data. Many times in business
scenarios, we are looking to see how many factors (e.g., costs from many suppliers, pricing in
different markets, etc.) relate to each other and tell us about how a business is doing and how to
grow the business. Clearly, single lists of data won’t get us very far.
45
4.1. REVISITING THE GASOLINE RETAIL DATASET
The key to using Python to wrangle these kinds of business datasets is the Numpy array. NumPy
arrays (which we’ll just call arrays from now on) are tables of numbers. These tables can be one-
dimensional (as in a list of data), two-dimensional (like a spreadsheet), three-dimensional (like a
workbook of spreadsheets), or four, five, or more-dimensional. Arrays differ from lists in mainly
two ways: every element in a given array has to be of the same type and the syntax for dealing with
multi-dimensional arrays is a lot more straightforward and powerful.
In this chapter, we’ll look at data analysis that deals with this more complex data. Along the
way, we’ll describe how we create and manipulate arrays.
46
YEAR JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
1992 12803 12601 12639 12710 12870 12886 12966 13092 13246 13304 13358 13475
1993 13398 13596 13487 13539 13521 13461 13502 13372 13416 13755 13754 13627
1994 13656 13875 13911 13867 13773 14059 14322 14685 14649 14709 14878 14885
1995 14915 14963 14921 15021 15215 15333 15249 15227 15194 15007 14981 15246
1996 15479 15457 15951 16252 16488 16352 16075 16066 16088 16360 16602 16759
1997 16938 16984 16969 16613 16367 16378 16396 16682 16757 16711 16638 16536
1998 16225 16054 15703 15798 16057 15933 16087 15851 15812 15948 16014 16399
1999 16394 16309 16529 17115 17186 17070 17704 18135 18339 18632 18992 20095
2000 19358 20214 20819 19907 20242 20885 20905 20574 21327 21412 21906 22124
2001 21931 21615 20652 21618 22628 21928 20686 20878 21336 20033 19245 18950
2002 19130 19068 19813 20856 20930 20733 21438 21182 21272 21659 22009 22464
2005 28438 29015 29528 29950 29113 29917 30976 33368 36008 36088 33185 33119
2006 34298 34174 34126 35752 36126 36127 37064 37765 34865 33010 33393 34858
2007 34168 34973 35963 36386 38031 37207 36996 37155 38327 39425 41897 41637
2008 42425 42411 43140 43025 44313 46429 46869 45701 45747 40640 31890 27952
2009 28768 29910 28811 28848 30507 33241 33057 34781 34868 34955 36655 37232
2010 37372 37009 37184 37208 36700 35761 35778 36261 37423 38534 39269 41172
2011 41731 42226 43654 44506 45191 44501 44638 45047 44935 44755 45596 45142
2012 45673 47216 47358 47023 45918 43711 43467 46411 47115 48015 46690 45827
2013 45932 48776 47161 45760 45212 45471 45384 44847 45035 44381 43963 45909
2014 46399 46816 46041 46317 45976 44973 44563 44416 44220 43305 42282 39199
2015 35384 36453 36644 36211 37208 37727 37436 36568 34945 34517 33834 33703
2016 32653 30967 32113 32984 33538 34265 33329
Table 4.1: Adjusted (for season, holiday, and trading-days) total monthly sales at gasoline stations, for the U.S. as a whole, from calendar
years 1992–2016 (in millions of dollars). Source: https://www.census.gov/retail/marts/www/adv44700.txt (accessed September 8, 2016).
4.2. BASIC STATISTICS
Immediately, we see that this table of data is best understood as a two-dimensional grid, rather
than a list. How do we hold all this data into a single variable in Python? We could make a list of
lists, but the syntax of specifying elements of such a variable sounds like it’d be a little hairy.
Instead, Python has a nice data structure called a NumPy array that makes handling two- (and
higher) dimensional data very straightforward. They are similar to Java arrays but are much more
useful and powerful. Because “NumPy arrays” is a mouthful, we’ll often just call them “arrays.”
In the next section, we’ll take a look at some basic statistical calculations using the gasoline re-
tail data from calendar years 1992–2015 (since the 2016 data ends with July in Table 4.1). Through
that example, and later ones, we’ll see how to create and manipulate arrays.
To utilize the NumPy array functions, we need to first import the NumPy package, as shown in line
1. We give an alias to the package, so instead of having to type numpy before each NumPy function
we use, we only need to type np. We enter in the values as list of lists, in lines 2–6. Finally, we
create an array version of data_as_list in the last line, by using the array function.
With our data in array form, we can more easily slice and dice the data to do the calcuations
we want. To calculate the annual means for each year in this dataset, we need to go through each
row of the array (since each row holds a year of data), and calculate the mean of all the values in
that row. The following code will make these calculations, using the data_as_array array, and
fill the lists annual_means and monthly_means with the respective means:
1 import numpy as np
2
3 annual_means = []
4 num_rows = np.shape(data_as_array)[0]
5 for irow in range(num_rows):
6 annual_means.append( np.mean(data_as_array[irow,:])
7
8 monthly_means = []
9 num_cols = np.shape(data_as_array)[1]
10 for icol in range(num_cols):
11 monthly_means.append( np.mean(data_as_array[:,icol])
48
4.3. PYTHON SIDEBAR: MORE ON CREATING AND USING NUMPY ARRAYS
You might be looking at the code above and be saying, “Whoa!” It is a bit dense. But based on the
names of the variables, we can make some reasoned guesses as to what is going on in the code.
In line 3, we initialize an empty list to hold all the values of the annual means that we will
calculate below. Line 4 appears to calculate the number of rows in data_as_array, but how does
it do that? It makes use of a function shape which, when we compare to line 9, apparently returns
a list that gives the number of rows then the number of columns. We’ll talk more about shape and
an array’s shape in Section 4.3.
Once we know how many rows there are, we iterate through the indices of the rows of data_as_array,
starting with the loop definition in line 5. The loop body, in line 6, needs a little unpacking. In
Python, you’ll often see these kinds of seemingly complicated, composite statements, because you
can use the return value of an operation or function call as input into another operation or function
call. The way to read these composite statements is to start in the inner-most parenthesis or oper-
ation nesting level and work outwards from there. In line 6, this is the data_as_array[irow,:]
statement. We’ll talk more about this in Section 4.3, but what this statement does is select the row
with index irow and returns all elements of that row as a subarray. That return value is then fed
into the NumPy mean function, which takes the mean of that subarray. The return value from the
mean function call is a scalar that is appended to the annual_means list.
Lines 8–11 behave similarly as lines 2–7. The difference here is in line 10 we iterate over all the
indices of the columns of data_as_array and the subarray we create using data_as_array[:,icol]
in line 11 is an array consisting of all the elements in the column icol (i.e., all the Januarys, all the
Februarys, etc.).
As we can see, arrays are like lists but also different, and provide additional functionality that
makes our code more compact and powerful. We now turn to Section 4.3 to spend more time talking
about making and using NumPy arrays and to do so with a little more structure.
49
4.3. PYTHON SIDEBAR: MORE ON CREATING AND USING NUMPY ARRAYS
import numpy as np
a = np.array(mylist)
The array function will match the array type to the contents of the list. Note that the elements of
mylist have to be convertible to the same type. Thus, if the list elements are all numbers (floating
point or integer), the array function will work fine. Otherwise, things could get dicey.
Sometimes you will want to make sure your NumPy array elements are of a specific type. To
force a certain numerical type for the array, set the dtype keyword to a type code:
import numpy as np
a = np.array(mylist, dtype='d')
where the string 'd' is the typecode for double-precision floating point. Some common typecodes
(which are all strings) include:
• 'd': Double precision floating
Often you will want to create an array of a given size and shape, but you will not know in
advance what the element values will be. To create an array of a given shape filled with zeros,
use the zeros function, which takes the shape of the array (a tuple) as the single positional input
argument (with dtype being optional, if you want to specify it):
Print out the array you made by typing in print(a). Did you get what you expected?
50
4.3. PYTHON SIDEBAR: MORE ON CREATING AND USING NUMPY ARRAYS
>>> print(a)
[[ 0. 0.]
[ 0. 0.]
[ 0. 0.]]
Note that you don’t have to type import numpy as np prior to every use of a function from
NumPy, as long as earlier in your source code file you have done that import. In the examples in this
section, I will periodically include this line to remind you that np is now an alias for the imported
NumPy module. However, in your own code file, if you already have the import numpy as np
statement near the beginning of your file, you do not have to type it in again as per the example.
Likewise, if I do not tell you to type in the import numpy as np statement, and I ask you to use
a NumPy function, I’m assuming you already have that statement earlier in your code file.
Also note that while the input shape into zeros is a tuple, which all array shapes are, if you
type in a list, the function call will still work.
Another array you will commonly create is the array that corresponds to the output of range,
that is, an array that starts at 0 and increments upwards by 1. NumPy provides the arange function
for this purpose. The syntax is the same as range, but it optionally accepts the dtype keyword
parameter if you want to select a specific type for your array elements:
a = np.arange(10)
Print out the array you made by typing in print(a). Did you get what you expected?
>>> print(a)
[0 1 2 3 4 5 6 7 8 9]
Note that because the argument of arange is an integer, the resulting array has integer elements.
If, instead, you had typed in arange(10.0), the elements in the resulting array would have been
floating point. You can accomplish the same effect by using the dtype keyword input parameter,
of course, but I mention this because sometimes it can be a gotcha: you intend an integer array but
accidentally pass in a floating point value for the number of elements in the array, or vice versa.
51
4.3. PYTHON SIDEBAR: MORE ON CREATING AND USING NUMPY ARRAYS
>>> print(a[1])
3.2
>>> print a[1:4]
[ 3.2 5.5 -6.4]
>>> print(a[2:])
[ 5.5 -6.4 -2.2 2.4]
52
4.3. PYTHON SIDEBAR: MORE ON CREATING AND USING NUMPY ARRAYS
import numpy as np
a = np.array([[2, 3.2, 5.5, -6.4, -2.2, 2.4],
[1, 22, 4, 0.1, 5.3, -9],
[3, 1, 2.1, 21, 1.1, -2]])
>>> print(a[1,2])
4.0
>>> print(a[:,3])
[ -6.4 0.1 21. ]
>>> print(a[1,:])
[ 1. 22. 4. 0.1 5.3 -9. ]
>>> print(a[1,1:4])
[ 22. 4. 0.1]
Note that when I typed in the array I did not use the line continuation character at the end of
each line because I was entering in a list, and by starting another line after I typed in a comma,
Python automatically understood that I had not finished entering the list and continued reading the
line for me.
Solution and discussion: Here are some results using the example array from Example 13:
53
4.3. PYTHON SIDEBAR: MORE ON CREATING AND USING NUMPY ARRAYS
>>> print(np.shape(a))
(3, 6)
>>> print(np.ndim(a))
2
>>> print(np.size(a))
18
>>> print(a.dtype.char)
d
Note that you should not use len for returning the number of elements in an array. Also,
the size function returns the total number of elements in an array. Finally, a.dtype.char is an
example of an array attribute; notice there are no parentheses at the end of the specification because
an attribute variable is a piece of data, not a function that you call.
The neat thing about array inquiry functions (and attributes) is that you can write code to operate
on an array in general instead of a specific array of given size, shape, etc. This allows you to write
code that can be used on arrays of all types, with the exact array determined at run time.
import numpy as np
a = np.array([[2, 3.2, 5.5, -6.4, -2.2, 2.4],
[1, 22, 4, 0.1, 5.3, -9],
[3, 1, 2.1, 21, 1.1, -2]])
If we type np.shape(a), we find this is a (3, 6) shape array (i.e., three rows and six columns).
That equals 18 elements. We can reshape this into a 2 slice, 3 row, and 3 column array using
reshape. This reshaped array has the shape (2, 3, 3). Note that the total number of elements
of this reshaped array is the same as the original array, i.e., 18 elements. The syntax for using
reshape is reshape(input_array, new_shape). Thus, the example described in this paragraph,
when typed into the interpreter, would be:
54
4.3. PYTHON SIDEBAR: MORE ON CREATING AND USING NUMPY ARRAYS
Note that when you re-shape an array, you do not move the elements around in memory. All
reshaping does is change when you decide to stop a row, a slice, etc. Put another way, the 4 and
0.1 in array a above are stored next to each other in the computer’s memory, but so too are the 4
and 0.1 in array b, even though in array b the 4 ends a row and slice. In that sense, row, slice, etc.
breaks are not “hard-wired” into the way the computer stores the array but are rather “bookmarks”
the computer keeps track of to allow us to use array slicing (as we saw in Example 13).
55
4.3. PYTHON SIDEBAR: MORE ON CREATING AND USING NUMPY ARRAYS
1 import numpy as np
2 a = np.array([[2, 3.2, 5.5, -6.4],
3 [3, 1, 2.1, 21]])
4 b = np.array([[4, 1.2, -4, 9.1],
5 [6, 21, 1.5, -27]])
6 shape_a = np.shape(a)
7 product_ab = np.zeros(shape_a, dtype='f')
8 for i in range(shape_a[0]):
9 for j in range(shape_a[1]):
10 product_ab[i,j] = a[i,j] * b[i,j]
Can you describe what is happening in each line?
Solution and discussion: In the first four lines after the import line (lines 2–5), I create arrays
a and b. They are both two row, four column arrays. In the sixth line, I read the shape of array a and
save it as the variable shape_a. Note that shape_a is the tuple (2,4). In the seventh line, I create
a results array of the same shape of a and b, of single-precision floating point type, and with each
element filled with zeros. In the last three lines (lines 8–10), I loop through all rows (the number of
which is given by shape_a[0]) and all columns (the number of which is given by shape_a[1]),
by index. Thus, i and j are set to the element addresses for rows and columns, respectively, and
line 10 does the multiplication operation and sets the product in the results array product_ab using
the element addresses.
One other note: In this example, I make the assumption that the shape of a and the shape of b
are the same, but I should instead add a check that this is actually the case. While a check using an
if statement condition such as:
np.shape(a) != np.shape(b)
will work, because equality between sequences is true if all corresponding elements are equal,2
things get tricky, fast, if you are interested in more complex logical comparisons and boolean oper-
ations for arrays. For instance, the logic that works for != doesn’t apply to built-in Python boolean
operators such as and. We’ll see later on in Section 4.6.2 how to do element-wise boolean opera-
tions on arrays.
So, why wouldn’t you want to use the looping method for general array operations? In three
and a half words: Loops are (relatively) s-l-o-w. Thus, if you can at all help it, it’s better to use
array syntax for general array operations: your code will be faster, more flexible, and easier to read
and test.
56
4.3. PYTHON SIDEBAR: MORE ON CREATING AND USING NUMPY ARRAYS
explicitly, the loops and element-wise operations are done implicitly in the operator. That is to say,
instead of writing this code (assume arrays a and b are 1-D arrays of the same size):
c = np.zeros(np.shape(a), dtype='f')
for i in range(np.size(a)):
c[i] = a[i] * b[i]
array syntax means you can write this code:
c = a * b
Let’s try this with a specific example using actual numbers:
There are three more key benefits of array syntax. First, operand shapes are automatically
checked for compatibility, so there is no need to check for that explicitly. Second, you do not need
to know the number of dimensions of the arrays ahead of time, so the same line of code works
on 1-D, 2-D, 3-D, etc. arrays. Finally, the array syntax formulation runs faster than the equivalent
code using loops! Simpler, better, faster: pretty cool, eh? ,
Let’s try another array syntax example:
57
4.4. CORRELATION AND LAG AUTOCORRELATION
import numpy as np
a = np.arange(10)
b = a * 2
c = a + b
d = c * 2.0
What results? Predict what you think a, b, and c will be, then print out those arrays to confirm
whether you were right.
>>> print(a)
[0 1 2 3 4 5 6 7 8 9]
>>> print(b)
[ 0 2 4 6 8 10 12 14 16 18]
>>> print(c)
[ 0 3 6 9 12 15 18 21 24 27]
>>> print(d)
[ 0. 6. 12. 18. 24. 30. 36. 42. 48. 54.]
Arrays a, b, and c are all integer arrays because the operands that created those arrays are all
integers. Array d, however, is floating point because it was created by multiplying an integer array
by a floating point scalar. Python automatically chooses the type of the new array to retain, as much
as possible, the information found in the operands.
The correlation coefficient tells us the degree to which two variables are linearly related to
each other. If the value of the correlation coefficient is 1, the two variables are tightly related and
increases in one variable are connected to increases in the other (“correlated”). If the value of the
correlation coefficient is −1, the two variables are also tightly related but where increase in one
3
https://en.wikipedia.org/w/index.php?title=Correlation_and_dependence&oldid=739479314 (accessed Septem-
ber 24, 2016)
58
4.4. CORRELATION AND LAG AUTOCORRELATION
variable are connected to decreases in the other (“anti-correlated”). If the value of the correlation
coeffience is 0, the two variables are not tightly related to each other (“uncorrelated”).
SciPy has a function called corrcoef that returns the correlation matrix. In the case of two one-
dimensional timeseries (that is, two single sequences of data values), the correlation coefficient is
the value of the “off-diagonal” elements, of which there are two, and both of which are the same
value. Thus, the following code returns the correlation coefficient between two one-dimensional
arrays x and y (both arrays of the same length):
import scipy
scipy.corrcoef(x, y)[0,1]
You can verify that this result is the same as generated by the equation above.
In this chapter, however, we are dealing with the question of how to do NumPy arrays to help
us do more complex data analysis; calling a function isn’t really more “complex.” Well, the reason
we introduced the idea of correlation is to provide background for something that is a little more
complex, and illustrates another use of slicing. And that something is lag autocorrelation.
The definition of the correlation coefficient above gives us the correlation coefficient between
two variables, x and y. But what if x and y weren’t two different variables but the same variable,
except shifted in time? That is to say, what if x and y were time series but where y were values of
x shifted earlier or later? Then, the correlation coefficient we calculated would tell us the degree
to which an earlier value is related to a later value. If we calculated the correlation coefficient for
various degrees of shifting, we would have calculated the lag autocorrelation for the time series.
(When we do this correlation calculation between two different shifted time series x and y it is called
lag correlation. The “auto” in “lag autocorrelation” means “same time series. For our purposes,
we’ll focus on lag autocorrelation, but the more interesting and useful measure is lag correlation. I
encourage you to learn more about that technique someday on your own.)
Let’s take the subset of gasoline retail data in Section 4.2 and calculate and plot the lag au-
tocorrelation for various number of months of lag. Here is code that would do the trick, using
loops:
1 import numpy as np
2 import scipy
3 data_flat = np.ravel(data_as_array)
4 lags = np.arange(6)
5 correl = []
6 for ilag in lags:
7 x = np.zeros( (np.size(data_flat)-ilag) )
8 y = np.zeros( (np.size(data_flat)-ilag) )
9 for imonth in range(np.size(x)):
10 x[imonth] = data_flat[imonth]
11 y[imonth] = data_flat[imonth+ilag]
12 correl.append( np.corrcoef(x, y)[0,1] )
In line 3, we use the ravel function to turn the two-dimensional array into a one-dimensional array.
In line 4 we make the lags we’re considering to be 0–5 months. The correlation coefficient will
be between f (t) vs. f (t + lag), and we hold the correlation coefficient values in the list correl.
59
4.4. CORRELATION AND LAG AUTOCORRELATION
Figure 4.1: Lag autocorrelation plot for the data set in Section 4.4.
In lines 6–11, we loop through each lag value and make the temporary variables x and y values of
data_flat, shifted by the lag value. Once we have the x and y arrays, we calculate the correlation
coefficient for that lag.
A graph of the lag autocorrelation coefficient vs. the lag value is given in Figure 4.1. As ex-
pected, at no lag, the correlation coefficient is 1, since a time series is perfectly related to itself at
the same time. As the lag increases, the correlation drops. This is consistent with what we’d expect,
that is, that the relationship between the current month’s gasoline retail sales with sales in the future
will become weaker the farther out in the future you go. (Note, however, that we can’t really say
what these coefficient values mean without conducting a test of statistical significance. When doing
statistical calculations, only statistically significant values compared to some reasonable baseline
have meaning, in the statistical sense.)
Here’s the solution using array slicing:
60
4.5. MORE COMPLICATED MANIPULATIONS
1 import numpy as np
2 import scipy
3 data_flat = np.ravel(data_as_array)
4 lags = np.arange(6)
5 correl = []
6 for ilag in lags:
7 if ilag == 0:
8 correl.append(1.0)
9 else:
10 x = data_flat[0:-ilag]
11 y = data_flat[ilag:]
12 correl.append( np.corrcoef(x, y)[0,1] )
It’s similar to the solution using loops except the portion where we select the values from data_flat
are done using array slicing, with adjustments for the lag value. Note that we pre-calculate the lag
0 case because the lag autocorrelation for no lag is always 1. We have to do this because the slicing
syntax in lines 10–11 do not work if ilag equals 0. The slicing syntax in lines 10–11, however, are
cleaner than the code in the loops solution (lines 7–11 in that solution), which does the same thing.
1 import numpy as np
2 data_gt_40k = []
3 data_shape = np.shape(data_as_array)
4 for iyear in range(data_shape[0]):
5 for imonth in range(data_shape[1]):
6 if data_as_array[iyear, imonth] > 40000:
7 data_gt_40k.append(data_as_array[iyear, imonth])
In this code, we make use of nested loops. That is, the loop defined in line 5 is run for every iteration
of the loop defined in line 4. In this way we go through all the data in the two-dimensional array
data_as_array. The if test in line 6 is conducted on every element in data_as_array. Note
that in Python, the nesting of loops is defined by the indentation. If you do not indent the inner
loop, you won’t have a nested loop but rather two separate loops.
61
4.6. PYTHON SIDEBAR: TESTING INSIDE AN ARRAY
Note that the above code would also work on the full data in Table 4.1, if that data was set to
data_as_array, since we make use of array inquiry functions to determine how many times to
loop through the array.
We saw in Section 4.3.5 that with array syntax we can implicitly loop through all elements of an
array and make calculations with each element of the array, one element at a time, without writing
an actual for statement. But what about if tests? Is there a way we can do testing inside an array
while using array syntax? That way, we can get the benefits of simpler code, the flexibility of code
that works on arrays of any number of dimensions, and speed. The answer is, yes! NumPy has
comparison and boolean operators that act element-wise and array inquiry and selection functions.
Here’s the code to do the above task without explicit for loops:
1 import numpy as np
2 is_gt_40k_indices = np.where(data_as_array > 40000)
3 data_gt_40k = data_as_array[is_gt_40k_indices]
The expression data_as_array > 40000 produces an array of the same size and shape as data_as_array
but whose elements are either True or False, depending on whether or not the corresponding el-
ement in data_as_array is greater than 40000. The NumPy where function, when called with
a single boolean array as a parameter, returns a data structure that specifies the indices where the
input parameter is true. That data structure can then be used when specifying indices in an array
(line 3) to select those elements elements of the array.
This use of boolean array expressions and where only touches on ways we can do tests while
using array syntax. Section 4.6.2 goes into more detail on this topic.
62
4.6. PYTHON SIDEBAR: TESTING INSIDE AN ARRAY
The pass command is used when you have a block statement (e.g., a block if statement, etc.)
where you want the interpreter to do nothing. In this case, because answer is filled with all zeros
on initialization, if the if test condition returns False, we want that element of answer to be zero.
But, all elements of answer start out as zero, so the else block has nothing to do; thus, we pass.
Again, while this code works, loops are slow, and the if statement makes it even slower. The
nested for loops also mean that this code will only work for a 2-D version of the array a.
import numpy as np
a = np.arange(6)
print(a > 3)
print(np.greater(a, 3))
gives you the same thing. Other comparison functions are similarly defined for the other standard
comparison operators; those functions also act element-wise on NumPy arrays.
63
4.6. PYTHON SIDEBAR: TESTING INSIDE AN ARRAY
Once you have arrays of booleans, you can operate on them using boolean operator NumPy
functions. You cannot use Python’s built-in and, or, etc. operators; those will not act element-
wise. Instead, use the NumPy functions logical_and, logical_or, etc. Thus, if we have this
code:
a = np.arange(6)
print(np.logical_and(a>1, a<=3))
Example 19 (Using where to directly select corresponding values from another array or scalar):
Consider the following case:
import numpy as np
a = np.arange(8)
condition = np.logical_and(a>3, a<6)
answer = np.where(condition, a*2, 0)
64
4.6. PYTHON SIDEBAR: TESTING INSIDE AN ARRAY
>>> print(a)
[0 1 2 3 4 5 6 7]
>>> print(condition)
[False False False False True True False False]
>>> print(answer)
[ 0 0 0 0 8 10 0 0]
The array condition shows which elements of the array a are greater than 3 and less than 6. The
where call takes every element of array a where that is true and doubles the corresponding value
of a; elsewhere, the output element from where is set to 0.
The second way of using where is to return a tuple of array element indices for which a condition
is true, which then can be used to select the corresponding values by selection with indices.4 For
1-D arrays, the tuple is a one-element tuple whose value is an array listing the indices where the
condition is true. For 2-D arrays, the tuple is a two-element tuple whose first value is an array
listing the row index where the condition is true and the second value is an array listing the column
index where the condition is true. In terms of syntax, you tell where to return indices instead of an
array of selected values by calling where with only a single argument, the <condition> array. To
select those elements in an array, pass in the tuple as the argument inside the square brackets (i.e.,
[]) when you are selecting elements. Here is an example:
import numpy as np
a = np.arange(8)
condition = np.logical_and(a>3, a<6)
answer_indices = np.where(condition)
answer = (a*2)[answer_indices]
Solution and discussion: You should have obtained similar results as Example 19, except the
zero elements are absent in answer and now you also have a tuple of the indices where condition
is true:
4
This is like the behavior of IDL’s WHERE function.
65
4.6. PYTHON SIDEBAR: TESTING INSIDE AN ARRAY
>>> print(a)
[0 1 2 3 4 5 6 7]
>>> print(condition)
[False False False False True True False False]
>>> print(answer_indices)
(array([4, 5]),)
>>> print(answer)
[ 8 10]
The array condition shows which elements of the array a are greater than 3 and less than 6. The
where call returns the indices where condition is true, and since condition is 1-D, there is only
one element in the tuple answer_indices. The last line multiplies array a by two (which is also
an array) and selects the elements from that array with addresses given by answer_indices.
Note that selection with answer_indices will give you a 1-D array, even if condition is not
1-D. Let’s turn array a into a 3-D array, do everything else the same, and see what happens:
import numpy as np
a = np.reshape( np.arange(8), (2,2,2) )
condition = np.logical_and(a>3, a<6)
answer_indices = np.where(condition)
answer = (a*2)[answer_indices]
>>> print(a)
[[[0 1]
[2 3]]
[[4 5]
[6 7]]]
>>> print(condition)
[[[False False]
[False False]]
[[ True True]
[False False]]]
>>> print(answer_indices)
(array([1, 1]), array([0, 0]), array([0, 1]))
>>> print(answer)
[ 8 10]
Note how condition is 3-D and the answer_indices tuple now has three elements (for the
three dimensions of condition), but answer is again 1-D.
66
4.6. PYTHON SIDEBAR: TESTING INSIDE AN ARRAY
import numpy as np
a = np.arange(8)
condition = np.logical_and(a>3, a<6)
answer = ((a*2)*condition) + \
(0*np.logical_not(condition))
>>> print(a)
[0 1 2 3 4 5 6 7]
>>> print(condition)
[False False False False True True False False]
>>> print(answer)
[ 0 0 0 0 8 10 0 0]
But how does this code produce this solution? Let’s go through it step-by-step. The condition
line is the same as in Example 19, so we won’t say more about that. But what about the answer
line? First, we multiply array a by two and then multiply that by condition. Every element that is
True in condition will then equal double of a, but every element that is False in condition will
equal zero. We then add that to zero times the logical_not of condition, which is condition
but with all Trues as Falses, and vice versa. Again, any value that multiplies by True will be that
value and any value that multiplies by False will be zero. Because condition and its “logical not”
are mutually exclusive—if one is true the other is false—the sum of the two terms to create answer
will select either a*2 or 0. (Of course, the array generated by 0*np.logical_not(condition) is
an array of zeros, but you can see how multiplying by something besides 0 will give you a different
replacement value.)
Also, note the continuation line character is a backslash at the end of the line (as seen in the line
that assigns answer).
This method of testing inside arrays using arithmetic operations on boolean arrays is also faster
than loops.
67
4.7. PYTHON SIDEBAR: FLOATING POINT COMPARISON
The answer is not −7.35, as you’d expect. The inexactness of floating point representation in a
computer can cause real problems when doing tests for equality. Consider these three tests:
The computer may think that two floating point numbers, if they’re close enough to each other, are
the same. And, this behavior is not consistent between different computers and operating systems.
What then should we do?
The answer is to never do logical equality comparisons between floating point numbers, but
instead, we should compare whether two floating point numbers are “close to” each other. The
NumPy array package has a function allclose that does this. The allclose function allows you
to set what constitutes two numbers being “close,” so you can do the closeness testing the same
way everytime, regardless of what computer you’re using. Here’s an example:
5
See Bruce Bush’s article “The Perils of Floating Point,” http://www.lahey.com/float.htm (accessed March 17,
2012).
68
4.8. PYTHON SIDEBAR: STRING INDEXING
When the two input values to allclose are arrays, the comparison is done element-wise between
corresponding elements of the two arrays. If all such comparisons yield True, the function returns
True. If even one of the element pairs are not close to each other, the function returns False:
To find out which element(s) is not “close” between the two arrays, use the NumPy function
isclose:
The resulting boolean array tells us which corresponding elements in the two arrays are close to
each other. It takes the same atol and rtol keyword input parameters to change what counts as
“close.” See the documentation for the two routines for details as to how these parameters work.6
6
See https://docs.scipy.org/doc/numpy/reference/generated/numpy.allclose.html and https://docs.scipy.org/doc/
numpy/reference/generated/numpy.isclose.html.
69
4.9. PYTHON SIDEBAR: A SIMPLE WAY OF SEEING HOW FAST YOUR CODE RUNS
What is mysaying[2]? mysaying[0:5]? mysaying[-5:]? How would you extract the word
“you’ll”?
Solution and discussion: mysaying[2] returns a “y”. mysaying[0:5] returns “Buy l”. And
mysaying[-5:] returns “fine.”. To extract the word “you’ll’, mysaying[24:30] will work. The
indexing syntax is the same for substrings as for 1-D arrays, including the meaning of ranges (the
colon) and negative indices. Remember blank spaces and the apostrophe are all each characters in
the string mysaying.
import time
begin_time = time.time()
for i in range(1000000L):
a = 2*3
print(time.time() - begin_time)
Solution and discussion: The code prints out the amount of time (in seconds) it takes to mul-
tiply two times three and assign the product to the variable a one million times. (Of course, it also
includes the time to do the looping, which in this simple case probably is a substantial fraction of
the total time of execution.)
70
4.10. SUMMARY
4.10 Summary
The NumPy array is a really powerful tool. It enables us to slice-and-dice data in any number of
ways, enabling us to analyze data quickly. Especially for datasets that are more complicated than
a single sequence of values, NumPy arrays enable us to get a handle on that data.
71
Chapter 5
Chapter Contents
5.1 File objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Reading a file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3 Ï Python Sidebar: An introduction to objects . . . . . . . . . . . . . . . . 74
5.3.1 The nuts and bolts of objects . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3.2 Example of how objects work: Strings . . . . . . . . . . . . . . . . . . . 75
5.3.3 Example of how objects work: Arrays . . . . . . . . . . . . . . . . . . . 75
5.4 Writing a file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.5 Processing file contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.6 Catching file opening errors . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.7 Ï Python Sidebar: More on exception handling . . . . . . . . . . . . . . . 81
5.8 Better ways of reading a file . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
We’re almost done with the foundations of doing business data analysis using the tools in
Python. In the previous chapters, the datasets we’ve looked at have been pretty small. Partly,
this is because it’s easier to get a handle on small datasets, and when you’re learning a new tool,
this helps out a lot. However, what makes using a programming language (as opposed to Excel) to
analyze data is the ease with which a program written in a programming language can be scaled up
to handle a large dataset. In Excel, it’s not so easy to go from, say 500 rows of data to 500 million
rows of data.
The key to getting a computer to operate on large amounts of data is in storing the data in a file
and reading the data into an array (or similar variable). After you’ve put the data into an array, you
can manipulate and do calculations on that data. For really large datasets, the data will be stored in
a database in a format that helps you manage the data. SQL is perhaps the best-known general use
database language, but there are plenty of others, customized for use in other use-cases.
In this chapter, we’ll look at reading data that is stored in a text file. Text files are handy for
storing files on the order of tens to hundreds of thousands of lines. Once you start hitting millions
72
5.1. FILE OBJECTS
of lines, text files will still work, but they start becoming cumbersome. The ideas behind handling
data in text files, however, are similar to other file formats. In addition, this gives us an opportunity
to revisit string handling in Python, which in and of itself is one of Python’s “superpowers.”
The first argument in the open statement gives the filename. The second argument sets the mode
for the file: 'r' for reading-only from the file, 'w' for writing a file, and 'a' for appending to the
file.
Once you’ve created the text file object, you can use various methods attached to the file object
to interact with the file.
When you are done reading from or writing to a file, you can close it using the close method.
Thus, to close a file object fileobj, execute:
fileobj.close()
aline = fileobj.readline()
Because the file object is connected to a text file, aline will be a string. Note that aline contains
the newline character, because each line in a file is terminated by the newline character.
To read the rest of a file that you already started reading, or to read an entire file you haven’t
started reading, and then put the read contents into a list, use the readlines method:
contents = fileobj.readlines()
Here, contents is a list of strings, and each element in contents is a line in the fileobj file.
Each element also contains the newline character, from the end of each line in the file.
Note that the variable names aline and contents are not special; use whatever variable name
you would like to hold the strings you are reading in from the text file.
73
5.3. PYTHON SIDEBAR: AN INTRODUCTION TO OBJECTS
74
5.3. PYTHON SIDEBAR: AN INTRODUCTION TO OBJECTS
Example 24 (Viewing attributes and methods attached to strings and trying out a few meth-
ods):
In the Python interpreter, type in:
a = "hello"
Now type: dir(a). What do you see? Type a.title() and a.upper() and see what you get.
Solution and discussion: The dir(a) command gives a list of (nearly) all the attributes and
methods attached to the object a, which is the string "hello". (An Internet search for “python
string attributes methods” would also probably give you a similar list.) Note that there is more data
attached to the object than just the word “hello”, e.g., the attributes a.__doc__ and a.__class__
also show up in the dir listing.
Methods can act on the data in the object. Thus, a.title() applies the title method to the
data of a and returns the string "hello" in title case (i.e., the first letter of the word capitalized);
a.upper() applies the upper method to the data of a and returns the string all in uppercase. Notice
these methods do not require additional input arguments between the parenthesis, because all the
data needed is already in the object (i.e., "hello").
The syntax for objects looks very similar to the syntax for data and functions in modules. First,
to refer to attributes or methods of an instance, you add a period after the object name and then
put the attribute or method name. To set an attribute, the reference should be on the lefthand side
of the equal sign; the opposite is the case to read an attribute. Method calls require you to have
parentheses after the name, with or without arguments, just like a function call. Finally, methods
can produce a return value (like a function), act on attributes of the object in-place, or both.
75
5.3. PYTHON SIDEBAR: AN INTRODUCTION TO OBJECTS
a = np.reshape(np.arange(12), (4,3))
Now type: dir(a). What do you see? Based on their names, and your understanding of what
arrays are, what do you think some of these attributes and methods do?
Solution and discussion: The dir command should give you a list of a lot of stuff. I’m not
going to list all the output here but instead will discuss the output in general terms.
We first notice that there are two types of attribute and method names: those with double-
underscores in front and in back of the name and those without any pre- or post-pended double-
underscores. We consider each type of name in turn.
A very few double-underscore names sound like data. The a.__doc__ variable is one such
attribute and refers to documentation of the object. Most of the double-underscore names suggest
operations on or with arrays (e.g., add, div, etc.), which is what they are: Those names are of the
methods of the array object that define what Python will do to your data when the interpreter sees
a “+”, “/”, etc. Thus, if you want to redefine how operators operate on arrays, you can do so. It is
just a matter of redefining that method of the object.
That being said, I do not, in general, recommend you do so. In Python, the double-underscore in
front means that attribute or method is “very private.” (A variable with a single underscore in front
is private, but not as private as a double-underscore variable.) That is to say, it is an attribute or
method that normal users should not access, let alone redefine. Python does not, however, do much
to prevent you from doing so, so advanced users who need to access or redefine those attributes
and methods can do so.
The non-double-underscore names are names of “public” attributes and methods, i.e., attributes
and methods normal users are expected to access and (possibly) redefine. A number of the methods
and attributes of a are duplicates of functions (or the output of functions) that act on arrays (e.g.,
transpose, T), so you can use either the method version or the function version.
And now let’s look at some examples of accessing and using array object attributes and methods:
Solution and discussion: The giveaway as to whether we are accessing attributes or calling
methods is whether there are parenthesis after the name; if not, it’s an attribute, otherwise, it’s a
method. Of course, you could type the name of the method without parentheses following, but then
the interpreter would just say you specified the method itself, as you did not call the method:
76
5.4. WRITING A FILE
>>> print(a.astype)
<built-in method astype of numpy.ndarray object at 0x20d5100>
That is to say, the above syntax prints the method itself; since you can’t meaningfully print the
method itself, Python’s print command just says “this is a method.”
The astype call produces a version of array a that converts the values of a into single-character
strings. The shape attribute gives the shape of the array. The cumsum method returns a flattened
version of the array where each element is the cumulative sum of all the elements before. Finally,
the attribute T is the transpose of the array a.
While it’s nice to have a bunch of array attributes and methods attached to the array object, in
practice, I find I seldom access array attributes and find it easier to use NumPy functions instead
of the corresponding array methods. One exception with regards to attributes is the dtype.char
attribute, which we’ve already seen; that’s very useful since it tells you the type of the elements of
the array.
fileobj.write(astr)
Here, astr is the string you want to write to the file. Note that a newline character is not automat-
ically written to the file after the string is written. If you want a newline character to be added, you
have to append it to the string prior to writing (e.g., astr+'\n').
To write a list of strings to the file, use the writelines method:
fileobj.writelines(contents)
Here, contents is a list of strings, and, again, a newline character is not automatically written after
the string (so you have to explicitly add it if you want it written to the file).
77
5.5. PROCESSING FILE CONTENTS
will take the string a, look for a blank space (which is passed in as the argument to split, and use
that blank space as the delimiter or separator with which one can split up the string. If split has no
parameters, the method removes all whitespace of any kind between the non-whitespace characters
of the string.
The join method takes a separator string and puts it between items of a list (or an array) of
strings. For instance:
will take the list of strings a and concatenate these elements together, using the tab string ('\t')
to separate two elements from each other.
Finally, once we have the strings we desire, we can convert them to numerical types in order to
make calculations. Here are two ways of doing so:
• If you loop through a list of strings, you can use the float and int functions on the string
to get a number. For instance:
import numpy as np
a = ['3.4', '2.1', '-2.6']
anum = np.zeros(len(a), 'd')
for i in range(len(a)):
anum[i] = float(a[i])
takes a list of strings a and turns it into a NumPy array of double-precision floating point
numbers anum.1
• If you make the list of strings a NumPy array of strings, you can use the astype method for
type conversion to floating point or integer. For instance:
anum = np.array(a).astype('d')
takes a list of strings a, converts it from a list to an array of strings using the array function,
and turns that array of strings into an array of double-precision floating point numbers anum
using the astype method of the array of strings.
1
Note that you can specify the array dtype without actually writing the dtype keyword; NumPy array constructors
like zeros will understand a typecode given as the second positional input parameter.
78
5.5. PROCESSING FILE CONTENTS
A gotcha: Different operating systems may set the end-of-line character to something besides
'\n'. Make sure you know what your text file uses. (For instance, MS-DOS uses '\r\n', which
is a carriage return followed by a line feed.) By the way, Python has a platform independent way
of referring to the end-of-line character: the attribute linesep in the module os. If you write
your program using that variable, instead of hard-coding in '\n', your program will write out the
specific end-of-line character for the system you’re running on.
Solution and discussion: This code will do the trick (note I use comment lines to help guide
the reader):
1 import numpy as np
2
Note you don’t have to strip off the newline character before converting the number to floating
point using float.
While a multi-column text file of data seems more formidable than a single column text file,
with the split method, it’s not all that more complex, as we see in the following example.
79
5.6. CATCHING FILE OPENING ERRORS
file in and put the data into an two-dimensional array? To keep things simple, we’ll discard the
2016 data, since the file does not cover that year completely.
Solution and discussion: In preparing to write our code, we first note that the first two lines
contain header information rather than data. Second, we note that columns are separated by variable
amounts of whitespace. That won’t pose a problem because the split method will intelligently
eliminate the whitespace between the columns.
With this as background, here’s a solution:
1 import numpy as np
2
The array data holds all the numerical data in the file from 1992–2015 and has the same structure as
Table 4.1, without the headers. The first column of data are the years, and the subsequent columns
are the data for each month over all the years.
A few items: In lines 7 and 8, the number of rows in the data array and, subsequently, the
number of rows we loop through are three less than the number of lines in the file. This is because
we discard the first two lines as headers and the last line as an incomplete year. In line 9, the row
index we are examining is not i but i+2 because we are skipping the first two elements (i.e., rows)
in data_str.
and the file not_a_file.txt does not exist, the following will be returned by the interpreter:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory: 'not_a_file.txt'
80
5.7. PYTHON SIDEBAR: MORE ON EXCEPTION HANDLING
try:
fileobj = open('data.txt', 'r')
except IOError:
fileobj = open('DATA.TXT', 'r')
will try to open data.txt and if it fails with an IOError, will execute the code in the except block,
which tries to open the file DATA.TXT.
The try…except structure is a useful way of gracefully handling errors of many kinds, not
just file errors. In Section 5.7 below, we look a little more closely at Python’s exception handling
mechanism.
The syntax for raise is the command raise followed by an exception class (in this case I used
the built-in exception class ValueError, which is commonly used to denote errors that have to do
81
5.7. PYTHON SIDEBAR: MORE ON EXCEPTION HANDLING
with bad variable values), then a comma and a string that will be output by the interpreter when the
raise is thrown.
So far, I’ve been loosely saying that an exception “stops” the execution of a program. Actually,
raising an exception is not exactly the same as stopping the execution of a program. In a traditional
“stop,” execution of the program terminates and you are returned to the operating system level.
The program can’t do anything more. When an exception is raised, execution stops and sends
the interpreter up one level to see if there is some code that will properly handle the error. This
means that in using raise, you can gracefully handle expected errors without terminating the entire
program.
In Section 5.6, we saw how to handle a file opening exception using the try…except handler.
In that handler, you execute the block under the try, then execute the excepts if an exception is
raised. Here’s another non-file opening example:
rad = -2.5
try:
a = area(rad)
except ValueError:
a = area(abs(rad))
When the interpreter enters the try block, it executes all the statements in the block one by one. If
one of the statements returns an exception (as the first area call will because rad is negative), the
interpreter looks for an except statement at the calling level (one level up from the first area call,
which is the level of calling) that recognizes the exception class (in this case ValueError). If the
interpreter finds such an except statement, the interpreter executes the block under that except.
In this example, that block repeats the area call but with the absolute value of rad instead of rad
itself. If the interpreter does not find such an except statement, it looks another level up for a
statement that will handle the exception; this occurs all the way up to the main level, and if no
handler is found there, execution of the entire program stops.
In the examples in this chapter, I used the exception classes IOError and ValueError. These
are examples of built-in exception classes; you can find these and other built-in exception classes
listed in a good Python reference (e.g., TypeError, ZeroDivisionError, etc.) and which you
82
5.8. BETTER WAYS OF READING A FILE
can use to handle the specific type of error you are protecting against.2 I should note, however, the
better and more advanced approach is to define your own exception classes to customize handling,
but this topic is beyond the scope of this book.
The skip_header keyword input parameter sets how many lines at the beginning of the file to
ignore and the skip_footer keyword input parameter does the same for the bottom of the file. The
dtype keyword input parameter tells me to convert the file’s values to long integer; if I had left out
that keyword input parameter, the function would have converted the values to double-precision
floating point type. Details about using genfromtxt are found here: http://docs.scipy.org/doc/
numpy/user/basics.io.genfromtxt.html. You might also find the function’s reference manual entry
to be useful: http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html.
Files with comma-separated values: We might think, at first glance, that to read in a comma-
separated values (CSV) file, all we need to do is apply the string method split using the comma
(',') as the delimiter. And this would work fine, if CSV files contained only numbers. In fact, often
times you’ll find CSV files that have fields that themselves have newline characters or commas
inside them. For instance, consider the following lines that might be the first four lines from a CSV
file:
1 Date Sold,Item and Details,Purchase Price,Retail Price
2 01/01/2015,"Party Mix: Nuts, chips, and salsa",2.14,8.19
3 02/04/2016,"Game: Rock, scissors, and paper",3.43,9.99
4 01/22/2013,"Stack-Of-Cars: Chevy\nHonda\nKia\nJeep",2.33,6.87
The first line is a header that gives the names for each column. Lines 2–4 are data lines. The quota-
tion marks tell us that the second element in that list is a string "Party Mix: Nuts, chips, and salsa"
2
See http://docs.python.org/library/exceptions.html for a listing of built-in exception classes (accessed August 17,
2012).
83
5.9. SUMMARY
and not three separate items, "Party Mix: Nuts", "chips", "and salsa". Unfortunately, a
plain-old split(',') call won’t figure that out. Note that in a CSV file, there is no whitespace
between commas. If there were, the whitespace would be considered part of the values for the
column after the comma.
We could, using string variables and Python’s programming constructs, write a parser that would
properly deal with these and similar special cases. It would, however, be a lot of work. Fortunately,
there is a module called csv that handles these special cases for us.
Below is code that will read the contents of a CSV text file myfile.csv and put the contents into
the list data:
1 import csv
2 fileobj = open('myfile.csv', 'r')
3 readerobj = csv.reader(fileobj)
4 data = []
5 for row in readerobj:
6 data.append(row)
7 fileobj.close()
Note that the list data is not a list of strings but a list of lists. That is, each element in data is a
single row in myfile.csv, parsed so that each column is a separate element.
Underneath the hood, what’s happening in the above code is that the reader function creates
the iterable readerobj. Iterables are objects that you can loop through using a for loop (as in
line 5 in the code above). The syntax of looping through an iterable is the same as looping through
a list (which itself is an iterable).
If the CSV text file myfile.csv was the data lines earlier on p. 83, printing each row would result
in:
Note that every element in these lists are strings. If we want to do arithmetic on the prices, we
have to convert the prices into floating point values. But, we were able to handle nicely the string
elements in the file that had commas and newline characters in them. Details on the csv module
are available here: https://docs.python.org/3.5/library/csv.html.3
5.9 Summary
In this chapter we saw that Python conceptualizes files as objects, with attributes and methods
attached to them. To manipulate and access those files, you use the file object’s methods. For the
contents of text files, we found string methods to be useful. For most daily work, however, we’ll
3
The discussion in this section also owes credit to http://stackoverflow.com/a/13472940 (accessed October 13,
2016).
84
5.9. SUMMARY
want to use special packages designed for reading in data, such as found in NumPy (which we
discussed above).
As a postscript: Though we won’t talk about it in this course, if you decide you really want
to use Python to do data analysis, the package you actually want to use is pandas (http://pandas.
pydata.org/). It handles Excel workbook files really well. (The openpyxl module also enables you
access Excel workbooks.4 ) Pandas is amazingly powerful, but its learning curve is a little steep.
For the purposes of this course, we won’t need to use the power it offers, but I wanted to give you
a heads-up as to something to learn in the future.
4
http://openpyxl.readthedocs.io/en/default/ (accessed October 7, 2016).
85
Part II
86
Chapter 6
Managing Files
Chapter Contents
6.1 Segue: Creating a lot of files . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2 Moving and filing lots of files . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.3 Renaming lots of files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4 Copying lots of files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.5 Deleting lots of files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.6 Another segue: Testing to see what kind of file something is . . . . . . . . . . 92
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
You’ve done a bunch of data analysis and have generated a host of files. Some are data files,
some are plots, and maybe some others are text. Or, maybe the various parts of your company are
generating data—point of sale logs, invoices and expenses, customer surveys databases, etc.—and
all those other units are sending files your way (perhaps automatically) to manage. How do we
handle all these files?
Today’s graphically-based operating systems (Windows, Mac OS X) are easy to use because
they make sense, visually and kinesthetically. If you want to move a file from one folder to another
folder, you click the file’s icon with your mouse and drag the icon to the destination folder. Drag
means “move.” But, this operation no longer is easy when you have thousands of files to move
into hundreds of different folders.
In this chapter, we exam some common file management tasks and describe how Python can
help us automate these tasks. We won’t cover tasks dealing with directories (e.g., getting a list of
files in a directory, directory paths, creating and deleting directories, etc.), instead waiting until
Chapter 7 to cover those operations; in the present chapter, we’ll assume the directories we’re
interested in already exist. We’ll find both in the present chapter and in future chapters that with
very few lines of code, Python can tackle the management of five or five million files, and can do
so in an operating system independent way. That is, the Python code we write to manage files on
a Windows system can also work on a Mac OS X system, with no change. Very cool!
A note of warning: Before you begin using Python’s file and directory manipulation tools, I
strongly encourage you to create a scratch directory and do all your work in there, until you’ve
gotten the hang of using these functions and commands. Change your current working directory
87
6.1. SEGUE: CREATING A LOT OF FILES
to this scratch directory. If you’re running Python from a terminal, just change to this newly created
scratch directory and start Python from in there. If you’re using Canopy, Figure 7.2 shows where
the switch is to change the working directory. This will help decrease the potential of inadvertently
messing up your file system.
1 import numpy as np
2 hours = np.array([11, 12, 13, 14, 15, 16, 17, 18, 19], dtype='d')
3 norm_customers = np.sin( (hours-hours[0]+1)/(hours[-1]-hours[0]+2)*np.pi )
where hours is the first time in an hour range, based on a 24-hour clock (thus, hour “13” represents
the hour from 1–2 pm) and norm_customers is the normalized number of customers, which varies
from 0 to 1, with 1 being the maximum number of customers in any given day.
Through trial-and-error, you also discover that the number of customers per hour scales with
the day’s high temperature in the following way:
For the temperatures listed above, you can obtain the number of actual customers for each hour in
hours by multiplying norm_customers by the scaling factor above. Thus, when it is 50◦ F, the
actual number of customers per hour is:
>>> norm_customers * 6
array([ 1.85410197, 3.52671151, 4.85410197, 5.7063391 , 6. ,
5.7063391 , 4.85410197, 3.52671151, 1.85410197])
1
This example is entirely ficticious. Don’t try to create a business based upon these numbers.
88
6.2. MOVING AND FILING LOTS OF FILES
Give this model of customers at the ice cream cart, what would be the code to write out files
listing the hour in the day and the actual customers for each hour (i.e., each file has two columns),
for each temperature value we’re given above? We’ll want to give the files names that are related
to the temperature they describe, such as hrly_customers-50F.txt. The code to do so is below, using
tabs as delimiters:
import numpy as np
temps = [50, 60, 70, 80, 90, 100]
scaling_factors = [6, 10, 25, 41, 55, 64]
hours = np.array([11, 12, 13, 14, 15, 16, 17, 18, 19], dtype='d')
norm_customers = np.sin( (hours-hours[0]+1)/(hours[-1]-hours[0]+2)*np.pi )
for i in range(len(temps)):
fileobj = open("hrly_customers-" + str(temps[i]) + "F.txt", 'w')
actual_customers = norm_customers * scaling_factors[i]
for j in range(len(hours)):
fileobj.write(str(hours[j]) + '\t' + str(actual_customers[j]) + "\n")
fileobj.close()
And instantly we have six files of data! The same idea of using loops to generate multiple files can
be used to create any number of files. And, now we have a bunch of files that we can manage in
the rest of this chapter ,.
89
6.2. MOVING AND FILING LOTS OF FILES
1 import shutil
2
3 filenames = [ 'hrly_customers-50F.txt',
4 'hrly_customers-60F.txt',
5 'hrly_customers-70F.txt',
6 'hrly_customers-80F.txt',
7 'hrly_customers-90F.txt',
8 'hrly_customers-100F.txt']
9
The shutil module (imported in line 1) has a number of functions that simulate shell commands. A
shell is a text-based interface to an operating system, the software that manages your files, directo-
ries, etc.2 Shells contain commands to enable you to copy, move, delete, etc. your files. In shutil,
the move function (lines 14 and 16) moves files. The function takes two arguments, the source
and its destination (both strings). If the destination is a directory, the source file is moved into the
destination directory. If the destination is not a directory, the file is renamed.3
Lines 3–8 define the names of the files we will be managing. In this program, we write the
names of the files by hand. But what if there were a way to get a list of the files in the current
directory without typing them in? We’ll talk about such ways in Chapter 7.
In lines 11–12, we use Python’s string manipulation methods to extract the temperature value
corresponding to the file. The if statement in lines 13–16 enables us to move the file to the correct
location.
What general lessons can we take from this example? First, file management commands use
the string representations of filenames to operate on the files. This means we can use Python’s
powerful string manipulation tools to work with files. Second, because Python is a full-fledged
programming language, we can make use of constructs like loops and if statements to help us
intelligently manage our files.4 Finally, we see in this short program that Python has packages that
give us the ability to interface with the operating system. This means we aren’t stuck with managing
files by hand nor with writing a separate program (called a shell script) to do this managing. Instead,
we can do this in one environment!
2
A common shell for Windows is PowerShell. For Linux, bash (or the Bourne Again SHell) is a widely used. Mac
OS X, as a Unix-based operating system, also has access to the common shells available to Linux.
3
Reference for parts of this paragraph: https://docs.python.org/2/library/shutil.html (accessed November 7, 2016).
4
Shells also have this ability, but the syntax of many shells are much less readable than Python’s.
90
6.3. RENAMING LOTS OF FILES
1 import shutil
2
3 filenames = [ 'hrly_customers-50F.txt',
4 'hrly_customers-60F.txt',
5 'hrly_customers-70F.txt',
6 'hrly_customers-80F.txt',
7 'hrly_customers-90F.txt',
8 'hrly_customers-100F.txt']
9
In line 11, the split method separates the filename based upon the presence of the hyphen. Because
the filenames have only one hyphen, the filenames are separated into two parts. These two parts
are returned as elements in a list (which has two elements). Everything to the left of the hyphen
is the zeroth element of the list and everything to the right of the hyphen is the oneth element of
the list. The hyphen itself does not appear in the list. Thus, when we use the list indexing syntax
to select the oneth element of the list generated from the split, we get only that portion with the
temperature and the “.txt” extension.
1 import shutil
2
3 filenames = [ 'hrly_customers-50F.txt',
4 'hrly_customers-60F.txt',
5 'hrly_customers-70F.txt',
6 'hrly_customers-80F.txt',
7 'hrly_customers-90F.txt',
8 'hrly_customers-100F.txt']
9
91
6.5. DELETING LOTS OF FILES
In the above example, the copies will be in the same directory as the original files.
However, as powerful as copy2 is, some metadata attached to the file will not transfer. This
can include, for instance, ACLs (Access Control Lists) on Mac OS X.5
1 import os
2
3 filenames = [ 'hrly_customers-50F.txt',
4 'hrly_customers-60F.txt',
5 'hrly_customers-70F.txt',
6 'hrly_customers-80F.txt',
7 'hrly_customers-90F.txt',
8 'hrly_customers-100F.txt']
9
Note how we import a different module in line 1, and use a different function in the for loop.
For both isfile and isdir, if the file is a link, the link is followed until its termination. (You can,
for instance, make an alias of an alias of an alias, etc. Following a link means following that chain
until you reach the original file or directory.) If there is a file or directory at the end of the chain,
the function will return True or False as appropriate.
5
Reference for parts of this paragraph: https://docs.python.org/2/library/shutil.html (accessed November 7, 2016).
6
Reference for parts of this paragraph: https://docs.python.org/2/library/os.path.html (accessed November 7, 2016).
92
6.7. SUMMARY
Here are the results from using these functions in the Python interpreter on the hrly_customers-
50F.txt file and the mild directory, as from our Section 6.2 example. Remember, the script assumes
the files are in the same directory as the script:
Both the file and directory are regular items, not links (i.e., not aliases), and so the islink call
returns False.
Why do we care about testing what a file is? Now, as we go into Chapter 7, we can test to see
whether a file is a directory or not and choose the right file/directory handling function to use.
6.7 Summary
We’ve seen that functions in the shutil, os, and os.path modules can be used to manage files, and
that the use of looping gives us a powerful way of managing many files in very few lines of code. In
addition, the commands we’ve seen are the same regardless of the operating system we’re on. File
management using Python, as a result, has the potential of enabling us to write a single program
that can manage our files whether the computer we’re on is running Windows, Mac OS X, Linux,
or any one of a number of other operating systems. While a pure and complete “write once, run
anywhere” ability is hard to come by, the file management features in Python help move us in that
direction.
(For more details on these modules, please see the Python documentation: https://docs.python.
org/2/library/shutil.html, https://docs.python.org/2/library/os.html, and https://docs.python.org/2/
library/os.path.html.)
One final note: As powerful as automation is, it’s not always worth doing. Here are two xkcd
comics to help us think more clearly about when to automate and when not to ,: https://xkcd.com/
1205/ and https://xkcd.com/1319/.
93
Chapter 7
Chapter Contents
7.1 Prelude: Specifying directory paths . . . . . . . . . . . . . . . . . . . . . . . 95
7.2 Creating and removing empty directories . . . . . . . . . . . . . . . . . . . . 97
7.3 Renaming and moving directories . . . . . . . . . . . . . . . . . . . . . . . . 98
7.4 Copying and deleting directories . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.5 Returning and changing the working directory . . . . . . . . . . . . . . . . . 99
7.6 Listing the contents of a directory . . . . . . . . . . . . . . . . . . . . . . . . 100
7.7 Ï Python Sidebar: Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . 101
7.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
In Chapter 6, we took a look at some of the functions Python has that we can use to move,
rename, copy, and delete files. These functions are particularly useful because we can embedded
calls to them in a loop and automatically operate on many files.
No one today (or at least no one should ,), however, stores all of their files on their desktop.
Instead, we put collections of files into directories or folders and place groups of folders into other
folders, and so on. Using directories, we’re able to group related files with each other and move
entire collections of files around at one time.
But, what if you have a large collection of directories? We then have the same problem as in
Chapter 6, except in this case instead of needing to manage many files we need to manage many
directories (which may in turn contain many files). Can Python help us with this task?
In this chapter, we look at some of the tools Python provides to help us manage directories and
their contents. There are two classes of tools we’ll look at: those used in creating, naming, moving,
and copying directories and those used in accessing the contents of the directory. We start, however,
with something of a prelude: how can we use Python to help us specify the path to a directory?
Before we do, I want to reiterate what I said at the beginning of Chapter 6: Before you begin
using Python’s file and directory manipulation tools, I strongly encourage you to create a scratch
directory and do all your work in there, until you’ve gotten the hang of using these functions and
commands. Change your current working directory to this scratch directory. This will help decrease
the potential of inadvertently messing up your file system.
94
7.1. PRELUDE: SPECIFYING DIRECTORY PATHS
MyFiles/BusinessData/June/Day15.txt
MyFiles\BusinessData\June\Day15.txt
Immediately, we see that depending on the operating system we use, the character that separates
different directories in the path may differ. On Windows, the character is a backslash while in Linux
the character is a forward slash. (Note that a backslash, when represented in a string, is a “backslash
backslash”, i.e., “\\”. When the string is printed out, only a single backslash occurs:
>>> a = "\\"
>>> print(a)
\
1
Here “list” does not mean a Python list but a sequence or listing of directories.
95
7.1. PRELUDE: SPECIFYING DIRECTORY PATHS
In the next paragraph we’ll see how the os.path module helps us to avoid having to think about
this.)
Python’s os.path module (the path submodule of the os module) provides tools to enable us
to construct and manipulate paths that account for the difference between the directory separation
characters used in different operating systems. The join function of os.path takes a set of string
arguments and joins them together into a single string with the correct character separating direc-
tories for the operating system being used. For instance, for our Day15.txt path example above,
join would give:
The above code was executed on a Linux system, so the path separation character used is a forward
slash.
In addition to having functions to create a path, the os.path module also has functions to ma-
nipulate paths (quite a few, in fact). Some functions include: abspath, which gives the absolute
path (a.k.a. full path) for a given path:
In this example, the path given by mypath is a relative path, that is, a path specification relative
to the present location (called the current working directory). By default, when you start Python
from the command line, the current working directory is the location you were in when you started
Python. In this example, I started Python while I was in the /home/jlin directory, so the full path
to mypath is /home/jlin/MyFiles/BusinessData/June/Day15.txt, which is what abspath gives. In
Canopy, you can set the working directory via the Python interpreter window. Figure 7.2 shows
the Canopy Python interpreter of Figure 1.2, except with the working directory status bar and edit
menu switch circled by a red circle. The status bar shows what the current working directory in
Canopy is set to and the edit menu (accessed by a right-click) gives you options for changing that
working directory. You can, however, get and change the current working directory from within
Python, while a Python program is running; we discuss how in Section 7.5.
The function basename returns the file or directory that is at the end of the path. For the above
example:
If you are looking for a straightforward way to figure out what kind of file is specified by path,
via the file’s extension, you can use the splitext function to split off the file extension from the
rest of the path. For example:
96
7.2. CREATING AND REMOVING EMPTY DIRECTORIES
Figure 7.2: The Canopy Python interpreter, with the working directory status bar and edit menu
switch shown. Right-click the switch to get menu options that include changing the working direc-
tory.
splitext returns a two-element tuple with the extension in the second element and the rest of the
path in the first. If the path does not have a file extension, the second element of the tuple is an
empty string. You can run tests on what file extension is returned and have your program respond
accordingly.
The Python documentation on os.path is a must read if you want to see all the different tools
you have at your disposal for handling paths: https://docs.python.org/2/library/os.path.html.
>>> import os
>>> import os.path
>>> mypath = os.path.join("MyFiles", "BusinessData", "August")
>>> os.mkdir(mypath)
2
Reference for parts of this paragraph: https://docs.python.org/2/library/os.html (accessed November 9, 2016).
97
7.3. RENAMING AND MOVING DIRECTORIES
The mkdir command, however, assumes that all intermediate directories to the directory you are
creating already exist. In the above example, if the MyFiles and BusinessData directories do not
already exist, the call to mkdir will fail and return an error.
If you want to create all the intermediate directories in addition to the final (or leaf directory),
you should use makedirs:3
>>> import os
>>> import os.path
>>> mypath = os.path.join("MyFiles", "Car", "Manuals")
>>> os.makedirs(mypath)
In this example above, the makedirs command creates not only the leaf directory, Manuals, but
also the intermediate directory Car, which does not exist in the Figure 7.1 directory tree. makedirs
does all this in one fell swoop.
Notice for both the mkdir and makedirs calls above, no message is printed in the interpreter.
The commands create the directories on you file system and that’s that. No news means the com-
mand executed “successfully.” Of course, you have to make sure that you typed everything in
correctly, in order for “successful” to be the same as “desired outcome” ,.
Of course, for both mkdir and makedirs, we’re creating empty directories that have nothing
inside them (except directories that are part of the path). We’ll address the case of copying and
deleting directory trees with files in them in Section 7.4.
Given makedirs, why would you use ever want to use mkdir? Admittedly, makedirs is pow-
erful, but with such power comes the potential to mess things up. Unless I know that I want to
create all the intermediate directories in a path, I’d use mkdir to limit any inadvertent mistakes.
If I’m in the BusinessData directory, that is, if my current working directory is the BusinessData
directory in Figure 7.1, the rename is even easier (with the imports removed):
3
Reference for parts of this paragraph: https://docs.python.org/2/library/os.html (accessed November 9, 2016).
4
Reference for parts of this paragraph: https://docs.python.org/2/library/shutil.html (accessed November 9, 2016).
98
7.4. COPYING AND DELETING DIRECTORIES
There’s no need to bother with os.path.join since I’m only referring to paths made up of a single
directory name and level.
In addition to renaming directories, we can also use shutil’s move to move directories (and
everything in that directory) into another directory. Thus, if we wanted to create a directory called
Media inside MyFiles (in the Figure 7.1 tree) and move AudioClips and VideoClips into Media,
we’d execute the following code (I’m assuming our current working directory is MyFiles):
>>> import os
>>> import shutil
>>> os.mkdir("Media")
>>> shutil.move("AudioClips", "Media")
>>> shutil.move("VideoClips", "Media")
Note that copytree does not copy over all the metadata. (If you care about copying all the meta-
data, you’ll probably have to write a shell script.)
To remove a directory and all sub-directories and files, use the shutil function rmtree. The
following will remove BusinessDataCopy and everything in it from the current working directory:6
>>> import shutil
>>> shutil.rmtree("BusinessDataCopy")
99
7.6. LISTING THE CONTENTS OF A DIRECTORY
>>> import os
>>> os.getcwd()
'/home/jlin'
Remember that getcwd is a function so you have to put the empty set of parenthesis in order to call
the function. In the example above, I had started Python in a terminal window while I was in my
home directory, /home/jlin, so this is the value of my current working directory.
To change my current working directory, I use the chdir function in the os module. The fol-
lowing will change my current working directory to the temp directory in my home directory:
>>> import os
>>> os.chdir("/home/jlin/temp")
In the above example, I specified the full or absolute path for my own computer, which runs Linux.
Since I had started my Python session from my home directory, and I’m trying to change my current
working directory to a sub-directory of my home directory, the following would have also worked:
>>> import os
>>> os.chdir("temp")
So, I can specify a directory in chdir using a relative path instead of an absolute path.
>>> import os
>>> os.listdir('.')
['AudioClips', 'Report.pdf', 'Flier.pdf', 'VideoClips', 'BusinessData']
>>> os.listdir(os.getcwd())
['AudioClips', 'Report.pdf', 'Flier.pdf', 'VideoClips', 'BusinessData']
In the first example of listdir, I pass in the current directory symbol in Linux (a period) and the
function returns me the list of the contents of the directory I am currently in (which is MyFiles). In
the second example, I pass in the return value of a call to the getcwd function, which returns the
full path of the directory I am currently in.
Sometimes, I’m interested in listing only a subset of the contents of a directory. The glob module
has a function called glob that will do basic pattern matching when doing a directory listing. If
I’m in the MyFiles directory of the directory tree in Figure 7.1, I’ll get the following:8
7
Reference for parts of this paragraph: https://docs.python.org/2/library/os.html (accessed November 10, 2016).
8
Reference for parts of this paragraph: https://docs.python.org/2/library/glob.html (accessed November 10, 2016).
100
7.7. PYTHON SIDEBAR: DICTIONARIES
In both calls to glob, the asterisk is a wildcard character meaning “any character or set of char-
acters.” Thus, in the line 2 glob call, the function looks for all contents of MyFiles, i.e., items
of any name. In the line 4 call, the function looks for all contents whose name ends in .pdf.
More complex pattern matching is possible; see the glob module documentation for details (https:
//docs.python.org/2/library/glob.html).
But isn’t this a little cumbersome? If we make any changes to one variable we have to be very
careful to synchronize the change with the other variable. If files changes but num_lines does
not, there’s no way for us to connect the two together again. Wouldn’t it be nice if there was a
way to directly associate the filename with the number of lines value? The dictionary can solve
this problem. We’ll first look at what are dictionaries, how to create them, and what kinds of tools
they offer us. Then we’ll come back to our files and num_lines example above and store the
information in a dictionary.
Like lists and tuples, dictionaries are also collections of elements, but dictionaries, instead of
being ordered, are unordered lists whose elements are referenced by keys, not by position. Keys
101
7.7. PYTHON SIDEBAR: DICTIONARIES
can be anything that can be uniquely named and sorted. In practice, keys are usually integers or
strings. Values can be anything. (And when I say “anything,” I mean anything, just like lists and
tuples.)
Curly braces (“{}”) delimit a dictionary. The elements of a dictionary are “key:value” pairs,
separated by a colon. Dictionary elements are referenced like lists, except the key is given in place
of the element address. The example below will make this all clearer:
Example 31 (A dictionary):
Type the following in the Python interpreter:
Solution and discussion: a['b'] returns the floating point number 3.2. a['c'] returns the list
[-1.2, 'there', 5.5], so a['c'][1] returns the oneth element of that list, the string 'there'.
Like lists, dictionaries come with methods that enable you to find out all the keys in the dictio-
nary, find out all the values in the dictionary, etc. Here are some useful dictionary methods:
If you typed the following into the Python interpreter, what would you get for each line?:
• d = a.keys()
• d = a.values()
• a.has\_key('c')
Solution and discussion: The first line executes the command keys, which returns a list of
all the keys of a, and sets that list to the variable d. The second command does this same thing
as the first command, except d is a list of the values in the dictionary a. The third command tests
if dictionary a has an element with the key 'c', returning True if true and False if not. For the
dictionary a, the first line returns the list ['a', 'c', 'b'] and sets that to the variable d while
the second line returns True.
102
7.7. PYTHON SIDEBAR: DICTIONARIES
Note that the keys and values methods do not return a sorted list of items. Because dictionaries
are unordered collections, you must not assume the key:value pairs are stored in the dictionary in
any particular order. If you want to access the dictionary values in a specific order, you should first
order the dictionary’s keys (or, in some cases, values) in the desired order using a sorting function
like sorted.
Once a dictionary is created, you can add key:value pairs to the dictionary by assignment. Thus,
if we wanted to add the value 'goodbye' with the key 'd' to the dictionary in Example 31 (repro-
duced below as a reminder), we would type:
To replace the value of an existing dictionary entry, we use the same assignment syntax. To delete
an entry, we use the del command. On the above dictionary:
>>> del(a['c'])
>>> a
{'a': 2, 'b': 3.2, 'd': 'goodbye'}
Let’s return to the file system example we were looking at at the beginning of this section. How
might we use dictionaries to connect the number of lines in each file with the file’s name? By using
the filename as the key and the number of lines as the value. That is, we can define this dictionary
num_lines_dict:
num_lines_dict = {'Day1.txt':12, 'Day15.txt':35, 'Day30.txt':92}
We obtain the number of lines values by referencing a filename, which is a key, such as:
and we can add new elements by assignment. For instance, let’s say a new file Day10.txt was
created in this directory and that the file was 103 lines long. We add the entry to the dictionary by
typing in:
num_lines_dict['Day10.txt'] = 103
If we want to loop through all the files and print out the number of lines for each file, the
following code will do the trick:
103
7.8. SUMMARY
Note, however, the files will not be printed in any (humanly discernable) order. As dictionaries are
unordered collections, we must think of the keys method as returning the dictionary’s keys in any
way it wants to. If we want to go through the keys in a particular order, we have to sort the keys
first:
The above code uses the built-in sorted function to sort the keys using sorted’s default ordering
algorithm.
What does this use of dictionaries buy us? Some thoughts regarding the above example:
• We can now refer to elements associated with a filename by the filename itself. There’s no
ambiguity, for instance, as to what the “threeth” element refers to.
• Instead of dealing with two lists and making sure we keep them synchronized with each other,
all the information is now stored in a single variable.
• We can add and remove items from this collection of data without having to worry about
messing up the order of the collection because the data is stored unordered and referenced
without respect to order.
One final note: If we want to store more data about a file than a single number, we can make the
value part of the dictionary’s key:value pairs to be a collection of some sort, say a list or another
dictionary. Dictionaries are very flexible! If the structure of the data we want to store is even more
complex, it’s probably a good idea to starting thinking about defining our own classes of objects to
store the data. We’ll talk some about this topic in Chapter 9.
7.8 Summary
Through the os, shutil, and glob modules, Python gives us many of the file and directory manage-
ment tools that a shell gives us. However, because Python is a full-fledged programming language
with many scientific, mathematical, and business packages, we can write much more complex
programs for managing and interacting with our files and directories than are possible in a shell
language. The dictionary data structure is an example of a tool that makes storing information
related to files (whose names are straightforwardly represented by strings) a piece of cake (well,
maybe two pieces ,). One common application of these tools is searching for files on a computer,
and it’s to that topic we turn to in Chapter 8.
104
Chapter 8
Chapter Contents
8.1 Review of Python routines for searching . . . . . . . . . . . . . . . . . . . . . 105
8.2 Searching in a directory tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.3 Searching using recursion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.3.1 What is recursion? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.3.2 Implementing a search using recursion . . . . . . . . . . . . . . . . . . . 112
8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
We’ve talked about creating, copying, and moving files. We have one more foundational MIS
file and directory management activity to discuss: searching. File systems can contain hundreds of
thousands to millions of files and directories. How do we find the file or files we want? It could
be that set of sales data files or the brochures from that ad campaign ten years ago.
In this chapter, we describe some of the tools Python provides to help us find files on our
computer, automatically. We’ll start out by reviewing searching tools we’ve already seen in Python
and how they can be used to search for files (among other things). Next, we’ll look at Python tools
specifically designed for searching directory trees. Finally, we’ll examine another way of searching
directory trees using the concept of recursion.
105
8.1. REVIEW OF PYTHON ROUTINES FOR SEARCHING
Note that in the above code, loc will be set to the index of the last occurrence of the number six.
The index method of a list: As we saw in Example 4, the index method of a list will return
the index where the first occurrence of the search target is found. Below, we search the data list
for the index where the first occurrence of the number six is located and store that index in loc:
The count method of a list: As we also saw in Example 4, the count method of a list will
count the number of occurrences in the list of a search target. Below, we search the data list to
find how many times the number six occurs and store that number in count_six:
Array syntax and testing for equality: Array syntax automatically applies logical operators
like equality element-wise across an array. Below we look in data for where it equals the number
six:
import numpy as np
data = np.array([3, 6, 1, -8, 9])
mask = data == 6
The variable mask is a boolean array showing which elements of data equal the number six and
which do not:
>>> mask
array([False, True, False, False, False], dtype=bool)
import numpy as np
data = np.array([3, 6, 1, -8, 9])
mask = np.where(data == 6, np.arange(np.size(data)), -1)
>>> mask
array([-1, 1, -1, -1, -1])
The isclose function on arrays: We saw how to use allclose and isclose in Section 4.7
to do “equality” tests on floating point numbers. Below we look in data for where it is close to the
number six:
106
8.1. REVIEW OF PYTHON ROUTINES FOR SEARCHING
import numpy as np
data = np.array([3., 6., 1., -8., 9.])
mask = np.isclose(data, 6.0)
and, as earlier, the variable mask is a boolean array showing which elements of data are close to
the number six and which are not:
>>> mask
array([False, True, False, False, False], dtype=bool)
Membership testing: Starting on p. 39, we discussed how the in operator can be used for
membership testing in lists and strings. This doesn’t tell us where a search target is in the list or
string but it does tell us if the target is present in the list or string. Below we test whether the number
six is in the list data:
If a dictionary has a specific key: The has_key method, as we saw in Example 32, let’s us
see if the dictionary has a given key. Below we construct a dictionary data and ask whether the
dictionary has a key named 'six':
You can also use the dictionary method values to obtain all the values in data as a list and use
the list index method to get the location of the search target. The values are not in any human-
understood order, but the index will work to get the value of interest. Here, we look for the number
six:
The glob function for wildcard searching in a directory: In Section 7.6, we saw that the
glob module function glob can search a directory for filenames matching a given pattern. The
following searches the current working directory for all files that have the number six somewhere
in their name and returns the list of those files as files6:
import glob
files6 = glob.glob("*6*")
107
8.2. SEARCHING IN A DIRECTORY TREE
import os
for dirpath, dirnames, filenames in os.walk("MyFiles"):
print( str(dirpath) + ":\n " + str(dirnames) + \
"\n " + str(filenames) )
108
8.2. SEARCHING IN A DIRECTORY TREE
1 MyFiles:
2 ['AudioClips', 'VideoClips', 'BusinessData']
3 ['Report.pdf', 'Flier.pdf']
4 MyFiles/AudioClips:
5 []
6 ['Jingle.mp3', 'Testimony.mp3']
7 MyFiles/VideoClips:
8 []
9 ['Ad.mov', 'Poster.jpg']
10 MyFiles/BusinessData:
11 ['July', 'June']
12 ['Old.txt']
13 MyFiles/BusinessData/July:
14 []
15 ['Day1.txt']
16 MyFiles/BusinessData/June:
17 []
18 ['Day15.txt', 'Day1.txt', 'Day30.txt']
Let’s interpret this. Each iteration of the for loop, the variables dirpath, dirnames, and
filenames are filled. On the first iteration of the loop, dirpath is set to 'MyFiles', dirnames
is set to the list ['AudioClips', 'VideoClips', 'BusinessData'], and filenames is set to
the list ['Report.pdf', 'Flier.pdf']. (These contents are shown in lines 1–3 above.) That
is to say, on the first iteration, dirpath is set to the topmost directory in the tree (MyFiles) and
dirnames and filenames are set to the names of the sub-directories and files that are in MyFiles.
On the second iteration, dirpath is set to MyFiles/AudioClips, that is, the AudioClips sub-
directory in MyFiles. The dirnames variable is set to an empty list, and the filenames variable
is set to the list ['Jingle.mp3', 'Testimony.mp3']. (These contents are shown in lines 4–6
above.) That is, because there are no sub-directories in AudioClips, the dirnames list has nothing
in it and only the files in AudioClips show up (in the list filenames).
The third iteration sets dirpath, dirnames, filenames to what’s shown in lines 7–9 above,
and so on until all sub-directories in the MyFiles directory tree are visited. As a side note, from the
output, we see that when we reach a “leaf” directory, i.e., a directory that only contains files, the
dirnames list will be empty (and thus have a length of zero). If we’re interested in looking only at
leaf directories, we can use len(dirnames) as a test to see whether we’re in a leaf directory.
Now, let’s use walk to search for a file. Let’s say we want to find the location of all occurrences
of a file named Day1.txt. From Figure 7.1, we know there are two files in this tree by that name.
The following code will store the directory paths to all occurrences of Day1.txt:
import os
file_occurrence_locations = []
for dirpath, dirnames, filenames in os.walk("MyFiles"):
if 'Day1.txt' in filenames:
file_occurrence_locations.append(dirpath)
and the file_occurrence_locations list will have the following values:
109
8.3. SEARCHING USING RECURSION
>>> file_occurrence_locations
['MyFiles/BusinessData/July', 'MyFiles/BusinessData/June']
which are the paths of the sub-directories where there are files named Day1.txt.
One final note about finding files: It is an unfortunate truth that different operating systems not
only have different directory separation characters but also different rules about case-sensitivity. On
some operating systems (like Linux), filenames can be made from upper- or lower-case characters.
On other operating systems (like Mac OS X), there is no real distinguishing between a file named
Foo.txt and one named foo.txt.3 This issue with case makes it possible that a plain string equality
or membership test will fail, if you’re working across operating systems. Python provides the
fnmatch module to enable pattern matching for filenames that takes into account the case handling
features of the operating system one is working on. See the fnmatch documentation for details
(https://docs.python.org/2/library/fnmatch.html).
110
8.3. SEARCHING USING RECURSION
A mathematical example might be helpful to describe both what recursion is and why it doesn’t
have to lead to infinite regress. Let’s say you want to write a function that takes a list of numbers
and adds the numbers up. Using a for loop, the function add_up would be:
def add_up(input_list):
total = 0.0
for i in input_list:
total = total + i
return total
When we go through the logic of this code in our head, we probably hear something like, “take
one number, add it to total, take another number, add it to total, and so on until we’re out of
numbers.”
But, we could think of the process of adding up these numbers in another way: “the sum of a
list of numbers is the sum of the list of numbers not including the last number plus the last number.”
But how do we get the “sum of the list of numbers not including the last number?” Using the same
logic, that sum is also, “the sum of the list of numbers not including the last number plus the last
number,” but the “list of numbers” isn’t the original list of numbers but the original list without the
last item on the list. The logic would continue until “the sum of the list of numbers not including
the last number” is zero, because there are no numbers left in such a list.
In Python, this would look like:
1 def add_up(input_list):
2 if len(input_list) == 0:
3 return 0.0
4 else:
5 return add_up(input_list[:-1]) + input_list[-1]
Some things to point out in this code. First, we see that our recursive call, i.e., our call to the add_up
function inside the add_up definition, occurs in line 5. It’s important to note, however, that while
we are calling add_up in line 5, the input parameter is not the same as the original, full list of
numbers. (If it were, we would have to get infinite regress.) Instead, the input argument we’re
passing into the line 5 call to add_up is the current input_list without the last element. This is
how we implement the “not including the last number” phrase in our above word description of our
logic.
Second, our worries about infinite regress are solved by lines 2–3. The if statement checks to
see whether the list of numbers input_list actually has any numbers in it and if not, the function
returns zero. This makes sense because the sum of nothing is zero. But from the standpoint of infi-
nite regress, this means that there comes a point when we no longer will make calls to add_up. The
situation tested for in lines 2–3 and called the stopping case or base case. As the name suggests,
this condition terminates recursive calls to the function preventing infinite regress. Every recursive
function has to have at least one stopping case. The lack of one (or enough) stopping cases will
result in a recursive function eventually failing and causing a “stack overflow.” We’re not going
to go into what a stack overflow is beyond saying if it happens, your program will terminate. So
you’ll know you made a mistake ,.
111
8.3. SEARCHING USING RECURSION
In this example, the recursive function doesn’t seem more concise than the one using a loop. It’s
not. For simple problems, this is typical: the looping solution is more straightforward and shorter
than the recursive version. For more complex problems, however, this is not the case. In the next
section, Section 8.3.2, we look at a more complex problem and compare a recursive solution with
pieces of an iterative solution that doesn’t have the benefit of the os module’s walk generator.
Recursion is, at least to me, a tough idea to grasp. I’d encourage you to read more resources on
recursion and, most importantly, to practice writing recursive functions. Here are a few resources
I’d recommend:
• Python Course: A reasonably accessible introduction, but its even more mathematical exam-
ples may not be motivating to MIS professionals: http://www.python-course.eu/recursive_
functions.php.
• CodeStepByStep: Scroll down for two recursion practice problems: http://www.codestepbystep.
com/problem/list.
• Python Data Structures and Algorithms: Recursion — Exercises, Practice, Solution: A bunch
of exercises with solutions: http://www.w3resource.com/python-exercises/data-structures-
and-algorithms/python-recursion.php.
• Recursion Examples (Python): A bunch of small problems with solutions: https://frankanya.
wordpress.com/2013/04/18/recursion-examples-python/.
A recursive solution
How might we do this recursively? In terms of logic, we might say, “look into the directory, return
a list of all .txt files and if you find a directory, look into the directory, etc.” In terms of code, we
could write the following find_txt function which would implement that logic:
1 import os
2 import os.path
3
4 def find_txt(top_dir):
5 list_items = os.listdir(top_dir)
6 list_txt = []
7 for item in list_items:
8 item_path = os.path.join(top_dir, item)
9 if os.path.isdir(item_path):
10 list_txt = list_txt + find_txt(item_path)
11 else:
12 if os.path.splitext(item)[-1].lower() == '.txt':
13 list_txt = list_txt + [item_path]
14 return list_txt
112
8.3. SEARCHING USING RECURSION
['MyFiles/BusinessData/Old.txt',
'MyFiles/BusinessData/July/Day1.txt',
'MyFiles/BusinessData/June/Day15.txt',
'MyFiles/BusinessData/June/Day1.txt',
'MyFiles/BusinessData/June/Day30.txt']
(I hand-formatted the output above so it’d look readable but the values are what a print statement
produces on the return value from the call. Also, my solution assumes there are no symbolic links
in the directory tree.)
Let’s go through each line of find_txt and describe what’s happening. In lines 1–2 we import
the modules we’ll need. Line 4 defines the function. Note that the function takes a single input
string, top_dir, which is the name of the directory in which we will look for all .txt files. Line 5
saves a listing of all the items (files and directories) in top_dir. Note that the listdir method
only gives the names (i.e., the basename) of the items; it does not provide the absolute path to the
items. In line 6 I initialize an empty list of all the .txt files in top_dir. For a given top_dir, this
list will be initially empty since we haven’t processed the listing of the items in top_dir yet. (You
may ask how we’ll account for .txt files in sub-directories of top_dir; I’ll discuss that in just a
sec.)
In line 7, we loop through each item and for each item that ends in .txt (line 12), we add that
to list_txt (line 13). Note that in the test in line 12, we use the lower string method to ensure
our comparison with the .txt suffix will work, even if the operating system is case-sensitive (e.g.,
we’re considering .txt and .TXT files as both .txt text files). Remember that the splitext function
breaks any path into the file extension portion and the non-extension portion, with the extension
portion being the last item of the list.
If an item in top_dir is a directory, we make a call to find_txt (this is the recursive call)
to obtain the list of .txt files from that directory. Note that the path you pass into the recursive
find_txt call (as well as in the line 9 check with isdir has to start with the initial top_dir. That
is, it has to start with the top_dir you passed in on the original call to find_txt; in the example
above, that was find_txt("MyFiles"). The reason the path needs to begin with MyFiles is be-
cause the function doesn’t change the current working directory. As a result, the current working
directory is still the parent directory of MyFiles, and any item inside the directory tree will remain
hidden if the path doesn’t go through MyFiles.
Note also that line 8 will automatically keep on adding whatever the current item is to the
directory path that began with MyFiles and continues to whatever sub-directory contains the item
refered to in item_path. This is because the line 10 find_txt call’s argument becomes the new
top_dir in that recursive find_txt call, and join adds on new sub-directories to that top_dir
in line 8.
In lines 10 and 13, we don’t use append to extend the list_txt list because the list returned
from the line 10 find_txt call in line 10 may be more than one element. The addition sign con-
catenates two lists, which is why we use that in lines 10 and 13. (We could use append in line
13, but we use the addition operator for the sake of symmetry.) Here’s a simpler example of list
concatenation versus appending:
113
8.3. SEARCHING USING RECURSION
>>> a = [1, 2, 3]
>>> b = [4, 5, 6]
>>> a + b
[1, 2, 3, 4, 5, 6]
>>> a.append(b)
>>> a
[1, 2, 3, [4, 5, 6]]
from which we see that appending a list to a list does not concatenate the two lists together.
Finally, in line 14, we return the output from find_txt, which is the list of all .txt files in the
directory tree given by top_dir.
import os
import os.path
def find_txt(top_dir):
list_items = os.listdir(top_dir)
list_txt = []
for item in list_items:
item_path = os.path.join(top_dir, item)
if os.path.isdir(item_path):
list_items_2 = os.listdir(item_path)
for item_2 in list_items_2:
item_2_path = os.path.join(top_dir, item_2)
[... more code ...]
In lines 9–12, when we hit a sub-directory, we create another for loop to go through those items.
If that list of items includes another directory, we would add another nested for loop to go through
that sub-directory, and so on.
The problem with this approach is that we don’t know, ahead of time, how many levels of
directories there are. With two directory levels we need two nested loops, with three directory
levels we need three nested loops, and so on. But there could be one level of directories, or two
levels, or twenty levels. Do we write a twenty-level nested for loop structure in case there are that
many directory levels? And how does one write a twenty-level nested for loop structure? I can’t
imagine any way of doing so that is remotely readable.
So, what are the pluses and minuses for recursion versus loops? Loops are generally more
readable and understandable and the number of iterations you can make with loops is more or less
unlimited. Recursive solutions, in contrast, can be confusing and there is a limit to the number
114
8.4. SUMMARY
of recursion levels available (on my computer it’s 1000).4 However, we’ve seen there are some
problems where the loops-only solution is so complex that recursion is, by far, the best way to
go. For the walk-through-a-directory tree problem, Python provides a generator function walk that
makes recursion unnecessary for most such uses of directory trees, but there are other problems
besides directory traversal ones where recursion is the only way to solve the problem in a reasonable
number of lines of code. If it only makes sense to state the problem logic in such a way that there
is an element where an action is defined in terms of itself, it’s worth trying a recursive solution.
8.4 Summary
Python provides a number of tools to help us find things, whether in a list of items or a tree of items.
With directory trees, the os module’s walk function makes it possible to loop through the entire tree
using a single for loop. Directory trees, however, can also be accessed using a recursive solution.
The recursive technique, while somewhat tricky to understand, can enable us to solve certain kinds
of problems with concise and elegant code instead of a spaghetti-like mess a loops-only solution
would force on us.
4
The sys module has a function getrecursionlimit that tells you what that limit is. See https://docs.python.org/
2/library/sys.html (accessed November 14, 2016).
115
Chapter 9
Chapter Contents
9.1 What is object-oriented programming . . . . . . . . . . . . . . . . . . . . . . 116
9.2 Case Study: Managing a bookstore’s inventory . . . . . . . . . . . . . . . . . 117
9.3 Ï Python Sidebar: The NoneType data type . . . . . . . . . . . . . . . . . 124
9.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
With what we’ve learned so far—using Python to make calculations, plot graphs, access data
in files, manage files and file processes—we can write powerful Python functions and programs to
provide the kinds of data analysis, business intelligence, and information system management that
can contribute towards all aspects of a business. However, for highly complex tasks, our current
toolkit, while very powerful, could use a little boost. It’s like building a house with a hammer and
hand saw. You can do it, and human beings did exactly that for thousands of years, but it really
makes a difference if you have a nail gun and power saw with you.
In this chapter, we will be given these power tools through learning about and how to use OOP
to help us create more complex programs. In Chapter 5, we learned about objects and how to use
them. Here, we’ll learn how to write our own objects in Python and, more importantly, see how to
use these objects to manage more involved information system-related tasks.
116
9.2. CASE STUDY: MANAGING A BOOKSTORE’S INVENTORY
at OOP in action; from that description, hopefully some sense of how OOP makes programs more
organized, readable, maintainable, and usable will come through. Through these examples, I hope
to describe both how to write object-oriented programs as well as why object-oriented programs
work the way they do.
• Title
• Author
• Publisher
• Year
• Price
• Number of pages
• Linear dimensions
• Weight
117
9.2. CASE STUDY: MANAGING A BOOKSTORE’S INVENTORY
and some methods that might act on those and related attributes might include:
How about for a magazine? Attributes might include information similar to a book but will
probably also include:
• Volume number
• Issue
• Publication date
In terms of methods, besides those associated with a book, we might also have methods that:
With that as background, let’s write our class definitions. Here is a definition of the Book class
that about a book. The class definition provides a single method (besides the initialization method)
that returns a formatted summary description for Book objects. In addition, the code below creates
two instances of the class (note line continuations are added to fit the code on the page):
118
9.2. CASE STUDY: MANAGING A BOOKSTORE’S INVENTORY
1 class Book(object):
2 def __init__(self, authorlast, authorfirst, \
3 title, place, publisher, year):
4 self.authorlast = authorlast
5 self.authorfirst = authorfirst
6 self.title = title
7 self.place = place
8 self.publisher = publisher
9 self.year = year
10
11 def summary_descrip(self):
12 return self.authorlast \
13 + ', ' + self.authorfirst \
14 + ', ' + self.title \
15 + ', ' + self.place \
16 + ': ' + self.publisher \
17 + ', ' + self.year + '.'
18
What does each line do? Line 1 begins the class definition. Class definitions start with class
statement. The block following the class line is the class definition.
The argument in the class statement is a special class called object. This has to do with the
OOP idea of inheritance, which is a topic beyond the scope of this book. Suffice it to say that classes
you create can inherit or incorporate attributes and methods from other classes. Base classes (class
that do not depend on other classes) inherit from object, which is a built-in class in Python that
provides the foundational tools for all other classes.
Notice how attributes and methods are defined, set, and used in the class definition. Within the
class definition, you refer to the instance of the class as self. So, for example, the instance attribute
year is called self.year in the class definition. If I decided to make use of the summary_descrip
method elsewhere in the class, I’d refer to it as self.summary_descrip in the class definition (and
it would be called by self.summary_descrip()).
When you actually create an instance, the instance name is the name of the object (e.g., beauty,
pynut), so the instance attribute title of the instance beauty is referred to as beauty.title,
and every instance attribute is separate from every other instance attribute (e.g., beauty.title
and pynut.title are separate variables, not aliases for one another).
Attributes are created and set by assignment. If you want to create and set the title attribute,
type in:
119
9.2. CASE STUDY: MANAGING A BOOKSTORE’S INVENTORY
self.title =
and then put what you want it to equal to on the right-hand side of the equal sign. Methods are
defined using the def statement. The first parameter in any method definition is self; this syntax
is how Python tells a method “make use of all the previously defined attributes and methods in this
instance.” However, you never type self in the parameter list when you call the method. So, the
summary_descrip method definition in lines 11–17 has self in the parameter list but if we call
that method, the parameter list will be empty.
Usually, the first method you define will be the __init__ method. This method is called when-
ever you create an instance of the class, and so you usually put code that handles the arguments
present when you create (or instantiate) an instance of a class and conduct any kind of initialization
for the object instance. The arguments list of __init__ is the list of arguments passed in to the
constructor of the class, which is called when you use the class name with calling syntax.
Thus, in lines 4–9, I assign each of the positional input parameters in the def __init__ line
to an instance attribute of the same name. (The attributes and input parameters don’t have to be
of the same name, but it’s convenient here to do so.) Once assigned, these attributes can be used
anywhere in the class definition by reference to self, as in the definition of the summary_descrip
method.
In lines 19–22, I create an instance beauty of the Book class. Note how the arguments that are
passed in are the same arguments as in the def __init__ argument list. In the last four lines, I
create another instance of the Book class, the object pynut. Both beauty and pynut are instances
or specific realizations of the class (or template) Book.
Now that we’ve seen an example of defining a class, let’s look at an example of using instances
of the Book class:
1. How would you print out the author attribute of the pynut instance (at the interpreter, after
running the file)?
3. How would you change the publication year for the beauty book to "2010"?
2. You will print out the the bibliography formatted version of the information in beauty.
120
9.2. CASE STUDY: MANAGING A BOOKSTORE’S INVENTORY
3. Type: beauty.year = "2010". Remember that you can change instance attributes of classes
you have designed just like you can change instance attributes of any class; just use assign-
ment.
Note that in Python, you don’t generally have to write getters and setters (special methods to
return and set the values of attributes) like you do in languages like Java, where privacy is typically
strongly controlled.
1 class Magazine(object):
2 def __init__(self, journaltitle, \
3 volume, monthday, year):
4 self.journaltitle = journaltitle
5 self.volume = volume
6 self.monthday = monthday
7 self.year = year
8
9 def summary_descrip(self):
10 return self.journaltitle \
11 + ', ' + self.volume \
12 + ', ' + self.monthday \
13 + ', ' + self.year + '.'
This code is similar to that for the Book class, with these exceptions: some attributes differ between
the two classes (books, for instance, do not have months and days) and the method summary_descrip
is different between the two classes (to accommodate the different formatting between book and
magazine entries).
So, using OOP, we now have objects that store elements of our bookstore’s inventory. That’s
handy, in and of itself. While lists of numbers lend themselves nicely to arrays, most objects in the
real world are described by a motley assortment of characteristics. Python objects allow us to store
all these characteristics and operate on them in a clear and tidy way.
But let’s not just let the objects sit there; let’s do something with these objects. Since we have
a method to return a summary description of each objects, let’s write some code to print out that
summary description. In the code below, I first define some Book objects and Magazine objects,
the print out summary descriptions of the objects (I didn’t duplicate the Book and Magazine defi-
nitions):
121
9.2. CASE STUDY: MANAGING A BOOKSTORE’S INVENTORY
What are we to make of this bit of code? What does using objects buy us? To answer this, let’s
ask how would we have written a program without objects that did the exact task of writing out the
summary descriptions of these items in our bookstore inventory that we do in lines 14–16. Let’s
say the information for each book or magazine, instead of being stored in an object, was stored in
a list. For instance, pynut and good might be given as:
But in order to properly write out the summary description for each item, we need to know what
kind of item it is. Then, when we loop through all the items to write out the summary descriptions,
we can use an if test to see what kind of item it is and choose the correct format to write out the
descriptions. Thus, our code might look like this:
122
9.2. CASE STUDY: MANAGING A BOOKSTORE’S INVENTORY
123
9.3. PYTHON SIDEBAR: THE NONETYPE DATA TYPE
a = None
print(a is None)
print(a == 4)
Solution and discussion: The first print statement will return True while the second print
statement will return False.
The is operator compares “equality” not in the sense of value (like == does) but in the sense
of memory location. You can type in “a == None”, the better syntax for comparing to None is “a
is None”.1 The a == 4 test is false because the number 4 is not equal to None.
So what is the use of a variable of NoneType? I use it to “safely” initialize a keyword input
parameter or attribute. That is to say, I initialize a variable to None, and if later on my program tries
to do an operation with the variable before the variable has been reassigned to a non-NoneType
variable, Python will give an error. This is a simple way to make sure I did not forget to set the
variable to a real value. Remember variables are dynamically typed, so replacing a NoneType
variable with some other value later on is no problem!
9.4 Summary
You could, I think, fairly summarize this chapter as addressing one big question: Why should an
MIS student bother with object-oriented programming? In short, code written using OOP is less
prone to error. OOP enables you to mostly eliminate lengthy argument lists, and it is much more
difficult for a function to accidentally process data it should not process. Additionally, OOP deals
with long series of conditional tests much more compactly; there is no need to duplicate if tests in
multiple places. Finally, objects enable you to test smaller pieces of your program (e.g., individual
attributes and methods), which makes your tests more productive and effective.
Second, programs written using OOP are more easily extended. New cases are easily added by
creating new classes that have the interface methods defined for them. Additional functionality is
1
The reason is a little esoteric; see the web page http://jaredgrubb.blogspot.com/2009/04/python-is-none-vs-none.
html if you’re interested in the details (accessed August 16, 2012).
124
9.4. SUMMARY
also easily added by just adding new methods/attributes. Finally, any changes to class definitions
automatically propagate to all instances of the class.
For short, quick-and-dirty programs, procedural programming is still the better option; there
is no reason to spend the time coding the additional OOP infrastructure. But for more involved
business applications, things can very quickly become complex. As soon as that happens, the
object decomposition can really help. Here’s the rule-of-thumb I use: For a one-off, short program,
I write it procedurally, but for any program I may extend someday (even if it is a tentative “may”),
I write it using objects.
125
Glossary
absolute path the path to a directory or file starting from the root directory (on Linux) or the drive
letter.
argument an item passed into a function as input; there is a subtle distinction from a parameter,
but the two have similar meanings.
attribute data bound to an object that are designed to be acted on by methods also bound to that
object; sometimes attributes are called “instance variables”.
current working directory the directory you are currently in and that Python will base all relative
file and directory references from.
data coordinates a coordinate system for a plot where locations are specified by the values of the
x- and y-axes data ranges.
docstring a triple-quote delimited string that goes right after the def statement (or similar con-
struct) and which provides a “help”-like description of the function.
dynamically typed variables take on the type of whatever value they are set to when they are
assigned.
exception an error state in the program that cannot be processed by the current scope.
inherit incorporate the attribute and method definitions of another class into a definition of a new
class of objects.
inheritance dealing with inheriting attribute and method definitions of another class into a defini-
tion of a new class of objects.
126
Glossary
iterable a data structure that one can go through, one element at a time; in such a structure, after
you’ve looked at one element of it, it will move you on to the next element.
keyword input parameter a parameter set by reference to a name or keyword rather than by po-
sition in a list.
method functions bound to an object that are designed to act on the data also bound to that object.
module an importable Python source code file that typically contains function, class, and variable
object definitions.
newline character a special text code that specifies a new line; the specific code is operating
system dependent.
object a “variable” that has attached to it both data (attributes) and functions designed to act on
that data (methods).
package a directory of importable Python source code files (and, potentially, subpackages) that
typically contains function, class, and variable object definitions.
path the listing of what directories you need to go through to reach the file or directory at the end
of the path.
recursive related to the idea that we can define some tasks in terms of themselves; this is expressed
in terms of a function partially being defined by calls to itself.
relative path the path to a directory or file starting from the current directory.
127
Glossary
shape a tuple whose elements are the number of elements in each dimension of an array; in Python,
the elements are arranged so the fastest varying dimension is the last element in the tuple and
the slowest varying dimension is the first element in the tuple.
shell a text-based interface to an operating system, the software that manages your files, directo-
ries, etc..
terminal window a text window in which you can directly type in operating system and other
commands.
typecode a single character string that specifies the type of the elements of a NumPy array.
128
Acronyms
129
Bibliography
Lin, J. W.-B. (2012). A Hands-On Introduction to Using Python in the Atmospheric and Oceanic
Sciences. Chicago, IL.
130