Python for Data Science
Python for Data Science
Introduction to Notebooks
All animals are equal, but some animals are more equal than others.
George Orwell
In This Chapter
Running Python statements
Introduction to Jupyter notebooks
Introduction to Google Colab hosted notebooks
Text and code cells
Uploading files to the Colab environment
Using a system alias to run shell commands
Magic functions
Note
For code in this book, we use bold text for user input (the code you
would type), and non-bold text for any output that results.
You can then type Python statements and run them by pressing Enter:
print("Hello")
Hello
As shown here, you see the result of each statement displayed directly after
the statement’s line.
When Python commands are stored in a text file with the extension .py, you
can run them on the command line by typing python followed by the
filename. If you have a file named hello.py, for example, and it contains the
statement print("Hello"), you can invoke this file on the command line as
follows and see its output displayed on the next line:
python hello.py
Hello
For traditional Python software projects, the interactive shell was adequate
as a place to figure out syntax or do simple experiments. The file-based
code was where the real development took place and where software was
written. These files could be distributed to whatever environment needed to
run the code. For scientific computing, neither of these solutions was ideal.
Scientists wanted to have interactive engagement with data while still being
able to persist and share in a document-based format. Notebook-based
development emerged to fill the gap.
Jupyter Notebooks
The IPython project is a more feature-rich version of the Python interactive
shell. The Jupyter project sprang from the IPython project. Jupyter
notebooks combine the interactive nature of the Python shell with the
persistence of a document-based format. A notebook is an executable
document that combines executable code with formatted text. A notebook is
composed of cells, which contain code or text. When a code cell is
executed, any output is displayed directly below the cell. Any state changes
performed by a code cell are shared by any cells executed subsequently.
This means you can build up your code cell by cell, without having to rerun
the whole document when you make a change. This is especially useful
when you are exploring and experimenting with data.
Jupyter notebooks have been widely adopted for data science work. You
can run these notebooks locally from your machine or from hosted services
such as those provided by AWS, Kaggle, Databricks, or Google.
Google Colab
Colab (short for Colaboratory) is Google’s hosted notebook service. Using
Colab is a great way to get started with Python, as you don’t need to install
anything or deal with library dependencies or environment management.
This book uses Colab notebooks for all of its examples. To use Colab, you
must be signed in to a Google account and go to
https://colab.research.google.com (see Figure 1.1). From here you can
create new notebooks or open existing notebooks. The existing notebooks
can include examples supplied by Google, notebooks you have previously
created, or notebooks you have copied to your Google Drive.
Figure 1.1 The Initial Google Colab Dialog
When you choose to create a new notebook, it opens in a new browser tab.
The first notebook you create has the default title Untitled0.ipynb. To
change its name, double-click on the title and type a new name (see Figure
1.2).
Colab automatically saves your notebooks to your Google Drive, which you
can access by going to Drive.Google.com. The default location is a
directory named Colab Notebooks (see Figure 1.3).
Figure 1.3 The Colab Notebooks Folder at Google Drive
As shown in Figure 1.6, you can create a numbered list by prefacing items
with numbers, and you can create a bulleted list by prefacing items with
stars.
Figure 1.6 Creating Lists in a Google Colab Notebook
As shown in Figure 1.7, you can create headings by preceding text with
hash signs. A single hash sign creates a top-level heading, two hashes
creates a first level heading, and so forth.
A heading that is at the top of a cell determines the cell’s hierarchy in the
document. You can view this hierarchy by opening the table of contents,
which you do by clicking the Menu button at the top left of the notebook
interface, as shown in Figure 1.8.
You can use the table of contents to navigate the document by clicking on
the displayed headings. A heading cell that has child cells has a triangle
next to the heading text. You can click this triangle to hide or view the child
cells (see Figure 1.9).
Figure 1.9 Hiding Cells in a Google Colab Notebook
LaTeX
The LaTeX language (see https://www.latex-project.org/about/), which is
designed for preparing technical documents, excels at presenting
mathematical text. LaTeX uses a code-based approach that is designed to
allow you to concentrate on content rather than layout. You can insert
LaTeX code into Colab notebook text cells by surrounding it with dollar
signs. Figure 1.10 shows an example from the LaTeX documentation
embedded in a Colab notebook text cell.
Figure 1.10 LaTeX Embedded in a Google Colab Notebook
Subsequent chapters of this book use only code cells for Colab notebooks.
Colab Files
To see the files and folders available in Colab, click the Files button on the
left of the interface (see Figure 1.11). By default, you have access to the
sample_data folder supplied by Google.
You can also click the Upload button to upload files to the session (see
Figure 1.12).
Figure 1.12 Uploading Files in Google Colab
Files that you upload are available only in the current session of your
document. If you come back to the same document later, you need to upload
them again. All files available in Colab have the path root /content/, so if
you upload a file named heights.over.time.csv, its path is
/content/heights.over.time.csv.
You can mount your Google Drive by clicking the Mount Drive button (see
Figure 1.13). The contents of you drive have the root path /content/drive.
Figure 1.13 Mounting Your Google Drive
System Aliases
You can run a shell command from within a Colab notebook code cell by
prepending the command with an exclamation point. For example, the
following example prints the working directory:
!pwd
/content
You can capture any output from a shell command in a Python variable, as
shown here, and use it in subsequent code:
var = !ls sample_data
print(var)
Note
Don't worry about variables yet. You will learn about them in
Chapter 2, “Fundamentals of Python.”
Magic Functions
Magic functions are functions that change the way a code cell is run. For
example, you can time a Python statement by using the magic function
%timeit() as shown here:
As another example, you can have a cell run HTML code by using the
magic function %%html:
Click here to view code image
%%html
<marquee style='width: 30%; color: blue;'><b>Whee!</b></marquee>
Note
You can find more information about magic functions in Cell Magics
example notebooks that is part of the Jupyter documentation at
https://nbviewer.jupyter.org/github/ipython/ipython/blob/1.x/exampl
es/notebooks/Cell%20Magics.ipynb.
Summary
Jupyter notebooks are documents that combine formatted text with
executable code. They have become a very popular format for scientific
work, and many examples are available around the web. Google Colab
offers hosted notebooks and includes many popular libraries used in data
science. A notebook is made up of text cells, which are formatted in
Markdown, and code cells, which can execute Python code. The following
chapters present many examples of Colab notebooks.
Questions
1. What kind of notebooks are hosted in Google Colab?
2. What cell types are available in Google Colab?
3. How do you mount your Google Drive in Colab?
4. What language runs in Google Colab code cells?
2
Fundamentals of Python
In This Chapter
Python built-in types
Introduction to statements
Expression statements
Assert statements
Assignment statements and variables
Import statements
Printing
Basic math operations
Dot notation
This chapter looks at some of the building blocks you can use to create a
Python program. It introduces the basic built-in data types, such as integers
and strings. It also introduces various simple statements you can use to
direct your computer’s actions. This chapter covers statements that assign
values to variables and statements to ensure that code evaluates as expected.
It also discusses how to import modules to extend the functionality
available to you in your code. By the end of this chapter, you will have
enough knowledge to write a program that performs simple math operations
on stored values.
At the simplest, integers (or ints) are represented in code as ordinary digits.
Floating point numbers, referred to as floats, are represented as a group of
digits including a dot separator. You can use the type function to see the
type of an integer and a float:
type(13)
int
type(4.1)
float
If you want a number to be a float, you must ensure that it has a dot and a
number to the right, even if that number is zero:
type(1.0)
float
Booleans are represented by the two constants, True and False, both of
which evaluate to the type bool, which, behind the scenes, is a specialized
form of int:
type(True)
bool
type(False)
bool
Note
You will learn much more about strings and binary strings in Chapter
4.
A special type, NoneType, has only one value, None. It is used to represent
something that has no value:
type(None)
NoneType
Statements
A Python program is constructed of statements. Each statement can be
thought of as an action that the computer should perform. If you think of a
software program as being akin to a recipe from a cookbook, a statement is
a single instruction, such as “beat the eggs yolks until they turn white” or
“bake for 15 minutes.”
At the simplest, a Python statement is a single line of code with the end of
the line signifying the end of the statement. A simple statement could, for
example, call a single function, as in this expression statement:
print("hello")
Python allows for both simple and complex statements. Simple Python
statements include expression, assert, assignment, pass, delete, return, yield,
raise, break, continue, import, future, global, and nonlocal statements. This
chapter covers some of these simple statements, and later chapters cover
most of the rest of them. Chapter 5, “Execution Control,” and Chapter 6,
“Functions,” cover complex statements.
Multiple Statements
While using a single statement is enough to define a program, most useful
programs consist of multiple statements. The results of one statement can
be used by the statements that follow, building functionality by combining
actions. For example, you can use the following statement to assign a
variable the result of an integer division, use that result to calculate a value
for another variable, and then use both variables in a third statement as
inputs to a print statement:
Click here to view code image
x = 23//3
y = x**2
print(f"x is {x}, y is {y}")
x is 7, y is 49
Expression Statements
A Python expression is a piece of code that evaluates to a value (or to None).
This value could be, among other things, a mathematical expression or a
call to a function or method. An expression statement is simply a statement
that just has an expression but does not capture its output for further use.
Expression statements are generally useful only in interactive environments,
such as an IPython shell. In such an environment, the result of an
expression is displayed to the user after it is run. This means that if you are
in a shell and you want to know what a function returns or what 12344
divided by 12 is, you can see the output without coding a means to display
it. You can also use an expression statement to see the value of a variable
(as shown in the following example) or just to echo the display value of any
type. Here are some simple expression statements and the output of each
one:
Click here to view code image
23 * 42
966
"Hello"
'Hello'
import os
os.getcwd()
'/content'
Assert Statements
An assert statement takes an expression as an argument and ensures that the
result evaluates to True. Expressions that return False, None, zero, empty
containers, and empty strings evaluate to False; all other values evaluate to
True. (Containers are discussed in Chapter 3, “Sequences,” and Chapter 4,
“Other Data Structures.”) An assert statement throws an error if the
expression evaluates to False, as shown in this example:
Click here to view code image
assert(False)
-----------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-5-8808c4021c9c> in <module>()
----> 1 assert(False)
Otherwise, the assert statement calls the expression and continues on to the
next statement, as shown in this example:
assert(True)
You can use assert statements when debugging to ensure that some
condition you assume to be true is indeed the case. These statements do
have an impact on performance, though, so if you are using them
generously when you develop, you might want to disable them when
running your code in a production environment. If you are running your
code from the command line, you can add the -o, optimize, flag to disable
them:
python -o my_script.py
Assignment Statements
A variable is a name that points to some piece of data. It is important to
understand that, in an assignment statement, the variable points to the data
and is not the data itself. The same variable can be pointed to different
items—even items that are of different types. In addition, you can change
the data at which a variable points without changing the variable. As in the
earlier examples in this chapter, a variable is assigned a value using the
assignment operator (a single equals sign). The variable name appears to
the left of the operator, and the value appears to the right. The following
examples shows how to assign the value 12 to the variable x and the text
'Hello' to the variable y:
x = 12
y = 'Hello'
Once the variables are assigned values, you can use the variable names in
place of the values. So, you can perform math by using the x variable or use
the y variable to construct a larger piece of text, as shown in this example:
Click here to view code image
answer = x - 3
print(f"{y} Jeff, the answer is {answer}")
Hello Jeff, the answer is 9
You can see that the values for x and y are used where the variables have
been inserted. You can assign multiple values to multiple variables in a
single statement by separating the variable names and values with commas:
x, y, z = 1,'a',3.0
Here x is assigned the value 1, y the value 'a', and z the value 3.0.
It is a best practice to give your variables meaningful names that help
explain their use. Using x for a value on the x-axis of a graph is fine, for
example, but using x to hold the value for a client’s first name is confusing;
first_name would be a much clearer variable name for a client’s first name.
Pass Statements
Pass statements are placeholders. They perform no action themselves, but
when there is code that requires a statement to be syntactically correct, a
pass statement can be used. A pass statement consists of the keyword pass
and nothing else. Pass statements are generally used for stubbing out
functions and classes when laying out code design (that is, putting in the
names without functionality). You’ll learn more about functions in Chapter
6, “Functions,” and classes in Chapter 14.
Delete Statements
A delete statement deletes something from the running program. It consists
of the del keyword followed by the item to be deleted, in parentheses. Once
the item is deleted, it cannot be referenced again unless it is redefined. The
following example shows a value being assigned to a variable and then
deleted:
Click here to view code image
polly = 'parrot'
del(polly)
print(polly)
-----------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-c0525896ade9> in <module>()
1 polly = 'parrot'
2 del(polly)
----> 3 print(polly)
Note
Python has its own garbage collection system, and, generally you
don’t need to delete objects to free up memory, but there may be
times when you want to remove them anyway.
Return Statements
A return statement defines the return value of a function. You will see how
to write functions, including using return statements, in Chapter 6.
Yield Statements
Yield statements are used in writing generator functions, which provide a
powerful way to optimize for performance and memory usage. We cover
generators in Chapter 13, “Functional Programming.”
Raise Statements
Some of the examples so far in this chapter have demonstrated code that
causes errors. Such errors that occur during the running of a program (as
opposed to errors in syntax that prevent a program from running at all) are
called exceptions. Exceptions interrupt the normal execution of a program,
and unless they are handled, cause the program to exit. Raise statements are
used both to re-invoke an exception that has been caught and to raise either
a built-in exception or an exception that you have designed specifically for
your program. Python has many built-in exceptions, covering many
different use cases (see
https://docs.python.org/3/library/exceptions.xhtml#bltin-exceptions). If you
want to invoke one of these built-in exceptions, you can use a raise
statement, which consists of the raise keyword followed by the exception.
For example, NotImplementedError is an error used in class hierarchies to
indicate that a child class should implement a method (see Chapter 14). The
following example raises this error with a raise statement:
Click here to view code image
raise NotImplementedError
-----------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-1-91639a24e592> in <module>()
----> 1 raise NotImplementedError
Break Statements
You use a break statement to end a loop before its normal looping condition
is met. Looping and break statements are covered in Chapter 5.
Continue Statements
You use a continue statement to skip a single iteration of a loop. These
statements are also covered in Chapter 5.
Import Statements
One of the most powerful features of writing software is the ability to reuse
pieces of code in different contexts. Python code can be saved in files (with
the .py extension); if these files are designed for reuse, they are referred to
as modules. When you run Python, whether in an interactive session or as a
standalone program, some features are available as core language features,
which means you can use them directly, without additional setup. When you
install Python, these core features are installed, and so is the Python
Standard Library. This library is a series of modules that you can bring into
your Python session to extend functionality. To have access to one of these
modules in your code, you use an import statement, which consists of the
keyword import and the name of the module to import. The following
example shows how to import the os module, which is used to interact with
the operating system:
import os
You can also give a module an alias during import. For example, it is a
common convention to import Pandas as pd:
import pandas as pd
You can then reference the module by using the alias rather than the module
name, as shown in this example:
Click here to view code image
pd.read_excel('/some_excel_file.xls')
You can also import specific parts of a module by using the from keyword
with import:
Click here to view code image
import os from path
path
<module 'posixpath' from '/usr/lib/python3.6/posixpath.py'>
This example imports the submodule path from the module os. You can
now use path in your program as if it were defined by your own code.
Future Statements
Future statements allow you to use certain modules that are part of a future
release. This book does not cover them as they are rarely used in Data
Science.
Global Statements
Scope in a program refers to the environment that shares definitions of
names and values. Earlier you saw that when you define a variable in an
assignment statement, that variable retains its name and value for future
statements. These statements are said to share scope. When you start
writing functions (in Chapter 6) and classes (in Chapter 14), you will
encounter scopes that are not shared. Using a global statement is a way to
share variables across scopes. (You will learn more about global statements
in Chapter 13.)
Nonlocal Statements
Using nonlocal statements is another way of sharing variables across scope.
Whereas a global variable is shared across a whole module, a nonlocal
statement encloses the current scope. Nonlocal statements are valuable only
with multiple nested scopes, and you should not need them outside of very
specialized situations, so this book does not cover them.
Print Statements
When you are working in an interactive environment such as the Python
shell, IPython, or, by extension, a Colab notebook, you can use expression
statements to see the value of any Python expression. (An expression is
piece of code that evaluates to a value.) In some cases, you may need to
output text in other ways, such as when you run a program at the command
line or in a cloud function. The most basic way to display output in such
situations is to use a print statement. By default, the print function outputs
text to the standard-out stream. You can pass any of the built-in types or
most other objects as arguments to be printed. Consider these examples:
print(1)
1
print('a')
a
You can also pass multiple arguments, and they are printed on the same
line:
print(1,'b')
1 b
You can use an optional argument to define the separator used between
items when multiple arguments are provided:
print(1,'b',sep='->')
1->b
5 – 6
-1
3*4
12
9/3
3.0
2**3
8
We will look at more math operations in Part II, “Data Science Libraries.”
You access object methods in a similar way, but with parentheses following.
The following example uses the to_bytes() method of the same integer:
Click here to view code image
a_number.to_bytes(8, 'little')
b'\x02\x00\x00\x00\x00\x00\x00\x00'
Summary
Programming languages provide a means of translating human instructions
to computer instructions. Python uses different types of statements to give a
computer instructions, with each statement describing an action. You can
combine statements together to create software. The data on which actions
are taken is represented in Python by a variety of types, including both
built-in types and types defined by developers and third parties. These types
have their own characteristics, attributes, and, in many cases, methods that
can be accessed using the dot syntax.
Questions
1. With Python, what is the output of type(12)?
2. When using Python, what is the effect of using assert(True) on the
statements that follow it?
3. How would you use Python to invoke the exception LastParamError?
4. How would you use Python to print the string "Hello"?
5. How do you use Python to raise 2 to the power of 3?
3
Sequences
Errors using inadequate data are much less than those using no data at
all.
Charles Babbage
In This Chapter
Shared sequence operations
Lists and tuples
Strings and string methods
Ranges
Shared Operations
The sequences family shares quite a bit of functionality. Specifically, there
are ways of using sequences that are applicable to most of the group
members. There are operations that relate to sequences having a finite
length, for accessing the items in a sequence, and for creating a new
sequence based a sequence’s content.
Testing Membership
You can test whether an item is a member of a sequence by using the in
operation. This operation returns True if the sequence contains an item that
evaluates as equal to the item in question, and it returns False otherwise.
The following are examples of using in with different sequence types:
Click here to view code image
'first' in ['first', 'second', 'third']
True
23 in (23,)
True
'b' in 'cat'
False
b'a' in b'ieojjza'
True
You can use the keyword not in conjunction with in to check whether
something is absent from a sequence:
'b' not in 'cat'
True
The two places you are most likely to use in and not in are in an interactive
session to explore data and as part of an if statement (see Chapter 5,
“Execution Control”).
Indexing
Because a sequence is an ordered series of items, you can access an item in
a sequence by using its position, or index. Indexes start at zero and go up to
one less than the number of items. In an eight-item sequence, for example,
the first item has an index of zero, and the last item an index of seven.
To access an item by using its index, you use square brackets around the
index number. The following example defines a string and accesses its first
and last substrings using their index numbers:
Click here to view code image
name = "Ignatius"
name[0]
'I'
name[4]
't'
You can also index counting back from the end of a sequence by using
negative index numbers:
name[-1]
's'
name[-2]
'u'
Slicing
You can use indexes to create new sequences that represent subsequences of
the original. In square brackets, supply the beginning and ending index
numbers of the subsequence separated by a colon, and a new sequence is
returned:
Click here to view code image
name = "Ignatius"
name[2:5]
'nat'
The subsequence that is returned contains items starting from the first index
and up to, but not including, the ending index. If you leave out the
beginning index, the subsequence starts at the beginning of the parent
sequence; if you leave out the end index, the subsequence goes to the end of
the sequence:
name[:5]
'Ignat'
name[4:]
'tius'
You can use negative index numbers to create slices counting from the end
of a sequence. This example shows how to grab the last three letters of a
string:
name[-3:]
'ius'
If you want a slice to skip items, you can provide a third argument that
indicates what to count by. So, if you have a list sequence of integers, as
shown earlier, you can create a slice just by using the starting and ending
index numbers:
Click here to view code image
scores = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
scores[3:15]
[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
But you can also indicate the step to take, such as counting by threes:
scores[3:15:3]
[3, 6, 9, 12]
You can use the min and max functions to find the minimum and maximum
items, respectively:
Click here to view code image
scores = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
min(scores)
0
max(name)
'u'
You can find out how many times an item appears in a sequence by using
the count method:
name.count('a')
1
You can get the index of an item in a sequence by using the index method:
name.index('s')
7
You can use the result of the index method to create a slice up to an item,
such as a letter in a string:
Click here to view code image
name[:name.index('u')]
'Ignati'
Math Operations
You can perform addition and multiplication with sequences of the same
type. When you do, you conduct these operations on the sequence, not on
its contents. So, for example, adding the list [1] to the list [2] will produce
the list [1,2], not [3]. Here is an example of using the plus (+) operator to
create a new string from three separate strings:
Click here to view code image
"prefix" + "-" + "postfix"
'prefix-postfix'
You can create tuples by using the tuple constructor, tuple(), or using
parentheses. If you want to create a tuple with a single item, you must
follow that item with a comma, or Python will interpret the parentheses not
as indicating a tuple but as indicating a logical grouping. You can also
create a tuple without parentheses by just putting a comma after an item.
Listing 3.1 provides examples of tuple creation.
Listing 3.1 Creating Tuples
Click here to view code image
tup = (1,2)
tup
(1,2)
tup = (1,)
tup
(1,)
tup = 1,2,
tup
(1,2)
Warning
A common but subtle bug occurs when you leave a trailing comma
behind an argument to a function. It turns the argument into a tuple
containing the original argument. So the second argument to the
function my_function(1, 2,) will be (2,) and not 2.
You can also use the list or tuple constructors with a sequence as an
argument. The following example uses a string and creates a list of the
items the string contains:
Click here to view code image
name = "Ignatius"
letters = list(name)
letters
['I', 'g', 'n', 'a', 't', 'i', 'u', 's']
flavours.append('SuperFudgeNutPretzelTwist')
flavours
['Chocolate', 'Vanilla', 'SuperFudgeNutPretzelTwist']
flavours.insert(0,"sourMash")
flavours
['sourMash', 'Chocolate', 'Vanilla', 'SuperFudgeNutPretzelTwist']
To remove an item from a list, you use the pop method. With no argument,
this method removes the last item. By using an optional index argument,
you can specify a specific item. In either case, the item is removed from the
list and returned.
The following example pops the last item off the list and then pops off the
item at index 0. You can see that both items are returned when they are
popped and that they are then gone from the list:
Click here to view code image
flavours.pop()
'SuperFudgeNutPretzelTwist'
flavours.pop(0)
'sourMash'
flavours
['Chocolate', 'Vanilla']
To add the contents of one list to another, you use the extend method:
Click here to view code image
deserts = ['Cookies', 'Water Melon']
desserts
['Cookies', 'Water Melon']
desserts.extend(flavours)
desserts
['Cookies', 'Water Melon', 'Chocolate', 'Vanilla']
This method modifies the first list so that it now has the contents of the
second list appended to its contents.
This appears to have worked, until you modify one of the sublists:
Click here to view code image
lists[-1].append(4)
lists
[[4], [4], [4], [4]]
Unpacking
You can assign values to multiple variables from a list or tuple in one line:
a, b, c = (1,3,4)
a
1
b
3
c
4
Or, if you want to assign multiple values to one variable while assigning
single ones to the others, you can use a * next to the variable that will take
multiple values. Then that variable will absorb all the items not assigned to
other variables:
Click here to view code image
*first, middle, last = ['horse', 'carrot', 'swan', 'burrito', 'f
first
['horse', 'carrot', 'swan']
last
'fly'
middle
'burrito'
Sorting Lists
For lists you can use built-in sort and reverse methods that can change the
order of the contents. Much like the sequence min and max functions, these
methods work only if the contents are comparable, as shown in these
examples:
Click here to view code image
name = "Ignatius"
letters = list(name)
letters
['I', 'g', 'n', 'a', 't', 'i', 'u', 's']
letters.sort()
letters
['I', 'a', 'g', 'i', 'n', 's', 't', 'u']
letters.reverse()
letters
['u', 't', 's', 'n', 'i', 'g', 'a', 'I']
Strings
A string is a sequence of characters. In Python, strings are Unicode by
default, and any Unicode character can be part of a string. Strings are
represented as characters surrounded by quotation marks. Single or double
quotations both work, and strings made with them are equal:
Click here to view code image
'Here is a string'
'Here is a string'
a_very_large_phrase = """
Wikipedia is hosted by the Wikimedia Foundation,
a non-profit organization that also hosts a range of other proje
"""
With Python strings you can use special characters, each preceded by a
backslash. The special characters include \t for tab, \r for carriage return,
and \n for newline. These characters are interpreted with special meaning
during printing. While these characters are generally useful, they can be
inconvenient if you are representing a Windows path:
Click here to view code image
windows_path = "c:\row\the\boat\now"
print(windows_path)
ow heoat
ow
For such situations, you can use Python’s raw string type, which interprets
all characters literally. You signify the raw string type by prefixing the
string with an r:
Click here to view code image
windows_path = r"c:\row\the\boat\now"
print(windows_path)
c:\row\the\boat\now
captain.capitalize()
'Patrick tayluer'
captain.lower()
'patrick tayluer'
captain.upper()
'PATRICK TAYLUER'
captain.swapcase()
'pATRICK tAYLUER'
Python 3.6 introduced format strings, or f-strings. You can insert values into
f-strings at runtime by using replacement fields, which are delimited by
curly braces. You can insert any expression, including variables, into the
replacement field. An f-string is prefixed with either an F or an f, as shown
in this example:
Click here to view code image
strings_count = 5
frets_count = 24
f"Noam Pikelny's banjo has {strings_count} strings and {frets_co
'Noam Pikelny's banjo has 5 strings and 24 frets'
This example shows how to insert items from a list into the replacement
field:
Click here to view code image
players = ["Tony Trischka", "Bill Evans", "Alan Munde"]
f"Performances will be held by {players[1]}, {players[0]}, and {
'Performances will be held by Bill Evans, Tony Trischka, and Ala
Ranges
Using range objects is an efficient way to represent a series of numbers,
ordered by value. They are largely used for specifying the number of times
a loop should run. Chapter 5 introduces loops. Range objects can take start
(optional), end, and step (optional) arguments. Much as with slicing, the
start is included in the range, and the end is not. Also as with slicing, you
can use negative steps to count down. Ranges calculate numbers as you
request them, and so they don’t need to store more memory for large
ranges. Listing 3.4 demonstrates how to create ranges with and without the
optional arguments. This listing makes lists from the ranges so that you can
see the full contents that the range would supply.
Listing 3.4 Creating Ranges
Click here to view code image
range(10)
range(0, 10)
list(range(1, 10))
[1, 2, 3, 4, 5, 6, 7, 8, 9]
list(range(0,10,2))
[0, 2, 4, 6, 8]
list(range(10, 0, -2))
[10, 8, 6, 4, 2]
Summary
This chapter covers the import group of types known as sequences. A
sequence is an ordered, finite collection of items. Lists and tuples can
contain mixed types. Lists can be modified after creation, but tuples cannot.
Strings are sequences of text. Range objects are used to describe ranges of
numbers. Lists, strings, and ranges are among the most commonly used
types in Python.
Questions
1. How would you test whether a is in the list my_list?
2. How would you find out how many times b appears in a string named
my_string?
In This Chapter
Creating dictionaries
Accessing and updating dictionary contents
Creating sets
Set operations
Dictionaries
Imagine that you are doing a study to determine if there is a correlation
between student height and grade point average (GPA). You need a data
structure to represent the data for an individual student, including the
person’s name, height, and GPA. You could store the information in a list or
tuple. You would have to keep track of which index represented which
piece of data, though. A better representation would be to label the data so
that you wouldn’t need to track the translation from index to attribute. You
can use dictionaries to store data as key/value pairs. Every item, or value, in
a dictionary is accessed using a key. This lookup is very efficient and is
much faster than searching a long sequence.
With a key/value pair, the key and the value are separated with a colons.
You can present multiple key/value pairs, separated by commas and
enclosed in curly brackets. So, a dictionary for the student record might
look like this:
Click here to view code image
{ 'name': 'Betty', 'height': 62,'gpa': 3.6 }
The keys for this dictionary are the strings'name', 'height', and 'gpa'.
Each key points to a piece of data: 'name' points to the string'Betty',
'height' points to the integer 62, and 'gpa' points to the floating point
number 3.6. The values can be of any type, though there are some
restrictions on the key type, as discussed later in the chapter.
Creating Dictionaries
You can create dictionaries with or without initial data. You can create an
empty dictionary by using the dict() constructor method or by simply
using curly braces:
Click here to view code image
dictionary = dict()
dictionary
{}
dictionary = {}
dictionary
{}
A third option is to create a dictionary by using curly braces, with the keys
and values paired using colons and separated with commas:
Click here to view code image
subject_3 = {'name':'Paula', 'height':64, 'gpa':3.8, 'ranking':1
These three methods all create dictionaries that evaluate the same way, as
long as the same keys and values are used:
Click here to view code image
subject_1 == subject_2 == subject_3
True
student_record['height']
64
student_record['gpa']
3.8
If you want to add a new key/value pair to an existing dictionary, you can
assign the value to the slot by using the same syntax:
Click here to view code image
student_record['applied'] = '2019-10-31'
student_record
{'name':'Paula',
'height':64,
'gpa':3.8,
'applied': '2019-10-31'}
Note
Of course, to really protect the subject’s identity, you would want to
remove the person’s name as well as any other PII.
Dictionary Views
Dictionary views are objects that offer insights into a dictionary. There are
three views: dict_keys, dict_values, and dict_items. Each view type lets
you look at the dictionary from a different perspective.
Dictionaries have a keys() method, which returns a dict_keys object. This
object gives you access to the current keys of the dictionary:
Click here to view code image
keys = subject_1.keys()
keys
dict_keys(['name', 'height', 'gpa', 'ranking'])
The values() method returns a dict_values object, which gives you access
to the values stored in the dictionary:
Click here to view code image
values = subject_1.values()
values
dict_values(['Paula', 64, 4.0, 1])
You can test membership in any of these views by using the in operator.
This example shows how to check whether the key 'ranking' is used in this
dictionary:
'ranking' in keys
True
This example shows how to check whether the integer 1 is one of the values
in the dictionary:
1 in values
True
This example shows how to check whether the key/value pair mapping
'ranking' is 1:
('ranking',1) in items
True
Starting in Python 3.8, dictionary views are dynamic. This means that if you
change a dictionary after acquiring a view, the view reflects the new
changes. For example, say that you want to delete a key/value pair from the
dictionary whose views are accessed above, as shown here:
Click here to view code image
del(subject_1['ranking'])
subject_1
{'name': 'Paula', 'height': 64, 'gpa': 4.0}
1 in values
False
('ranking',1) in items
False
Every dictionary view type has a length, which you can access by using the
same len function used with sequences:
Click here to view code image
len(keys)
3
len(values)
3
len(items)
3
As of Python 3.8, you can use the reversed function on a dict_key view to
get a view in reverse order:
Click here to view code image
keys
dict_keys(['name', 'height', 'gpa'])
list(reversed(keys))
['gpa', 'height', 'name']
The dict_key views are set-like objects, which means that many set
operations will work on them. This example shows how to create two
dictionaries:
Click here to view code image
admission_record = {'first':'Julia',
'last':'Brown',
'id': 'ax012E4',
'admitted': '2020-03-14'}
student_record = {'first':'Julia',
'last':'Brown',
'id': 'ax012E4',
'gpa':3.8,
'major':'Data Science',
'minor': 'Math',
'advisor':'Pickerson'}
admission_record.keys() | student_record.keys()
{'admitted', 'advisor', 'first', 'gpa', 'id', 'last', 'major', '
Note
You will learn more about sets and set operations in the next section.
The most common use for key_item views is to iterate through a dictionary
and perform an operation with each key/value pair. The following example
uses a for loop (see Chapter 5, “Execution Control”) to print each pair:
Click here to view code image
for k,v in student_record.items():
print(f"{k} => {v}")
first => Julia
last => Brown
gpa => 4.0
major => Data Science
minor => Math
advisor => Pickerson
As a shortcut, you can also test for a key without explicitly calling the
dict_key view. Instead, you just use in directly with the dictionary:
'last' in student_record
True
This also works if you want to iterate through the keys of a dictionary. You
don’t need to access the dict_key view directly:
Click here to view code image
for key in student_record:
print(f"key: {key}")
key: first
key: last
key: gpa
key: major
key: minor
key: advisor
This type of error stops the execution of a program that is run outside a
notebook. One way to avoid these errors is to test whether the key is in the
dictionary before accessing it:
Click here to view code image
if 'name' in student_record:
student_record['name']
This example uses an if statement that accesses the key ‘name’ only if it is
in the dictionary. (For more on if statements, see Chapter 5.)
As a convenience, dictionaries have a method, get(), that is designed to for
safely accessing missing keys. By default, this method returns a None
constant if the key is missing:
Click here to view code image
print( student_record.get('name') )
None
You can also provide a second argument, which is the value to return in the
event of missing keys:
Click here to view code image
student_record.get('name', 'no-name')
'no-name'
This example tries to get the value for the key 'name' from the dictionary
student_record, and if it is missing, it tries to get the value for the key
'first' from the dictionary admission_record, and if that key is missing, it
returns the default value 'no-name'.
Mutable objects, such as lists, are not valid keys for dictionaries. If you try
to use a list as a key, you experience an error:
Click here to view code image
{('item',): 'a tuple',
1: 'an integer',
b'binary': 'a binary string',
range(0, 12): 'a range',
'string': 'a string',
['a', 'list'] : 'a list key' }
-------------------------------------------------------------
TypeError Traceback (most rece
<ipython-input-31-1b0e555de2b5> in <module>()
----> 1 { ['a', 'list'] : 'a list key' }
TypeError: unhashable type: 'list'
A tuple whose contents are immutable can be used as a dictionary key. So,
tuples of numbers, strings, and other tuples are all valid as keys:
Click here to view code image
tuple_key = (1, 'one', 1.0, ('uno',))
{ tuple_key: 'some value' }
{(1, 'one', 1.0, ('uno',)): 'some value'}
If a tuple contains a mutable object, such as a list, then the tuple is not a
valid key:
Click here to view code image
bad_tuple = ([1, 2], 3)
{ bad_tuple: 'some value' }
----------------------------------------------------------------
TypeError Traceback (most recent
<ipython-input-28-b2cddfdda91e> in <module>()
1 bad_tuple = ([1, 2], 3)
----> 2 { bad_tuple: 'some value' }
TypeError: unhashable type: 'list'
a_tuple = 'a','b',
a_tuple.__hash__()
7273358294597481374
a_number = 13
a_number.__hash__()
13
Dictionaries and lists are among the most commonly used data structures in
Python. They give you great ways to structure data for meaningful, fast
lookups.
Note
Although the key/value lookup mechanism does not rely on an order
of the data, as of Python 3.7, the order of the keys reflects the order
in which they were inserted.
Sets
The Python set data structure is an implementation of the sets you may be
familiar with from mathematics. A set is an unordered collection of unique
items. You can think of a set as a magic bag that does not allow duplicate
objects. The items in sets can be any hashable type.
A set is represented in Python as a list of comma-separated items enclosed
in curly braces:
{ 1, 'a', 4.0 }
You can create a set either by using the set() constructor or by using curly
braces directly. However, when you use empty curly braces, you create an
empty dictionary, not an empty set. If you want to create an empty set, you
must use the set() constructor:
Click here to view code image
empty_set = set()
empty_set
set()
empty_set = {}
empty_set
{}
You can create a set with initial values by using either the constructor or the
curly braces.
You can provide any type of sequence as the argument, and a set will be
returned based on the unique items from the sequence:
Click here to view code image
letters = 'a', 'a', 'a', 'b', 'c'
unique_letters = set(letters)
unique_letters
{'a', 'b', 'c'}
unique_chars = set('mississippi')
unique_chars
{'i', 'm', 'p', 's'}
unique_num = {1, 1, 2, 3, 4, 5, 5}
unique_num
{1, 2, 3, 4, 5}
Much like dictionary keys, sets hash their contents to determine uniqueness.
Therefore, the contents of a set must be hashable and, hence, immutable. A
list cannot be a member of a set:
Click here to view code image
bad_set = { ['a','b'], 'c' }
----------------------------------------------------------------
TypeError Traceback (most recent
<ipython-input-12-1179bc4af8b8> in <module>()
----> 1 bad_set = { ['a','b'], 'c' }
TypeError: unhashable type: 'list'
3 not in unique_num
False
You can use the len() function to see how many items a set contains:
len(unique_num)
6
As with lists, you can remove and return an item from a set by using the
pop() method:
This method does not return the item removed. If you try to remove an item
that is not found in the set, you get an error:
Click here to view code image
students.remove('Barb')
----------------------------------------------------------------
KeyError Traceback (most recent
<ipython-input-3-a36a5744ac05> in <module>()
----> 1 students.remove('Barb')
KeyError: 'Barb'
You could write code to test whether an item is in a set before removing it,
but there is a convenience function, discard(), that does not throw an error
when you attempt to remove a missing item:
Click here to view code image
students.discard('Barb')
students.discard('Tik')
students
{'Max'}
You can remove all of the contents of a set by using the clear() method:
Click here to view code image
students.clear()
students
set()
Remember that because sets are unordered, they do not support indexing:
Click here to view code image
unique_num[3]
----------------------------------------------------------------
TypeError Traceback (most recent
<ipython-input-16-fecab0cd5f95> in <module>()
----> 1 unique_num[3]
TypeError: 'set' object does not support indexing
You can test equality by using the equals, ==, and not equals, !=, operators
(which are discussed in Chapter 5). Because sets are unordered, sets created
from sequences with the same items in different orders are equal:
Click here to view code image
first = {'a','b','c','d'}
second = {'d','c','b','a'}
first == second
True
first != second
False
Set Operations
You can perform a number of operations with sets. Many set operations are
offered both as methods on the set objects and as separate operators (<, <=,
>, >=, &, |, and ^). The set methods can be used to perform operations
between sets and other sets, and they can also be used between sets and
other iterables (that is, data types that can be iterated over). The set
operators work only between sets and other sets (or frozensets).
Disjoint
Two sets are disjoint if they have no items in common. With Python sets,
you can use the disjoint() method to test this. If you test a set of even
numbers against a set of odd numbers, they share no numbers, and hence
the result of disjoint() is True:
Click here to view code image
even = set(range(0,10,2))
even
{0, 2, 4, 6, 8}
odd = set(range(1,11,2))
odd
{1, 3, 5, 7, 9}
even.isdisjoint(odd)
True
Subset
If all the items in a set, Set B, can be found in another set, Set A, then Set B
is a subset of Set A. The subset() method tests whether the current set is a
subset of another. The following example tests whether a set of positive
multiples of 3 below 21 are a subset of positive integers below 21:
Click here to view code image
nums = set(range(21))
nums
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1
threes = set(range(3,21,3))
threes
{3, 6, 9, 12, 15, 18}
threes.issubset(nums)
True
You can use the <= operator to test whether a set to the left is a subset of a
set to the right:
threes <= nums
True
Proper Subsets
If all the items of a set are contained in a second set, but not all the items in
the second set are in the first set, then the first set is a proper subset of the
second set. This is equivalent to saying that the first set is a subset of the
second and that they are not equal. You use the < operator to test for proper
subsets:
Click here to view code image
threes < nums
True
nums.issuperset([1,2,3,4])
True
You use the greater-than-or-equal-to operator, >=, to test for supersets and
the greater-than operator, >, to test for proper supersets:
Click here to view code image
nums >= threes
True
Union
The union of two sets results in a set containing all the items in both sets.
For Python sets you can use the union() method, which works with sets and
other iterables, and the standalone bar operator, |, which returns the union
of two sets:
Click here to view code image
odds = set(range(0,12,2))
odds
{0, 2, 4, 6, 8, 10}
evens = set(range(1,13,2))
evens
{1, 3, 5, 7, 9, 11}
odds.union(evens)
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}
odds.union(range(0,12))
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}
odds | evens
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}
Intersection
The intersection of two sets is a set containing all items shared by both sets.
You can use the intersection() method or the and operator, &, to perform
intersections:
Click here to view code image
under_ten = set(range(10))
odds = set(range(1,21,2))
under_ten.intersection(odds)
{1, 3, 5, 7, 9}
Difference
The difference between two sets is all of the items in the first set that are not
in the second set. You can use the difference() method or the minus
operator, –, to perform set difference:
Click here to view code image
odds.difference(under_ten)
{11, 13, 15, 17, 19}
odds - under_ten
{11, 13, 15, 17, 19}
Symmetric Difference
The symmetric difference of two sets is a set containing any items
contained in only one of the original sets. Python sets have a
symmetric_difference() method, and the caret operator, ^, for calculating
the symmetric difference:
Click here to view code image
under_ten = set(range(10))
over_five = set(range(5, 15))
under_ten.symmetric_difference(over_five)
{0, 1, 2, 3, 4, 10, 11, 12, 13, 14}
under_ten ^ over_five
{0, 1, 2, 3, 4, 10, 11, 12, 13, 14}
Updating Sets
Python sets offer a number of ways to update the contents of a set in place.
In addition to using update(), which adds the contents to a set, you can use
variations that update based on the various set operations.
The following example shows how to update from another set:
Click here to view code image
unique_num = {0, 1, 2}
unique_num.update( {3, 4, 5, 7} )
unique_num
{0, 1, 2, 3, 4, 5, 7}
The following example shows how to update the difference from a range:
Click here to view code image
unique_num.difference_update( range(0,12,2) )
unique_num
{1, 3, 5, 7, 9}
unique_letters |= set("Arkansas")
unique_letters
{'A', 'a', 'i', 'k', 'm', 'n', 'p', 'r', 's'}
Frozensets
Because sets are mutable, they cannot be used as dictionary keys or even as
items in sets. In Python, frozensets are set-like objects that are immutable.
You can use frozensets in place of sets for any operation that does not
change its contents, as in these examples:
Click here to view code image
froze = frozenset(range(10))
froze
frozenset({0, 1, 2, 3, 4, 5, 6, 7, 8, 9})
froze | set(range(5,15))
frozenset({0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14})
Summary
Python’s built-in data structures offer a variety of ways to represent and
organize your data. Dictionaries and sets are both complements to the
sequence types. Dictionaries map keys to values in an efficient way. Sets
implement mathematical set operations as data structures. Both dictionaries
and sets are great choices where order is not the best operating principle.
Questions
1. What are three ways to create a dictionary with the following
key/value pairs:
{'name': 'Smuah', 'height':62}
2. How would you update the value associated with the key gpa in the
dictionary student to be '4.0'?
3. Given the dictionary data, how would you safely access the value for
the key settings if that key might be missing?
4. What is the difference between a mutable object and immutable
object?
5. How would you create a set from the string "lost and lost again"?
5
Execution Control
In This Chapter
Introduction to compound statements
Equality operations
Comparison operations
Boolean operations
if statements
while loops
for loops
Up until this point in the book, you’ve seen statements as individual units,
executing sequentially one line at a time. Programming becomes much
more powerful and interesting when you can group statements together so
that they execute as a unit. Simple statements that are joined together can
perform more complex behaviors.
Compound Statements
Chapter 2, “Fundamentals of Python,” introduces simple statements, each
of which performs an action. This chapter looks at compound statements,
which allow you to control the execution of a group of statements. This
execution can occur only when a condition is true. The compound
statements covered in this chapter include for loops, while loops, if
statements, try statements, and with statements.
The controlled statements can be grouped in one of two ways. The first,
more common, way is to group them as a code block, which is a group of
statements that are run together. In Python, code blocks are defined using
indentation. A group of statements that share the same indentation are
grouped into the same code block. The group ends when there is a
statement that is not indented as far as the others. That final statement is not
part of the code block and will execute regardless of the control statement.
This is what a code block looks like:
Click here to view code image
<control statement>:
<controlled statement 1>
<controlled statement 2>
<controlled statement 3>
< statement ending block>
You should use this second style only when you have very few controlled
statements and you feel that limiting the compound statement to one line
will enhance, not detract from, the readability of the program.
Equality Operations
Python offers the equality operator, ==, the inequality operator, !=, and the
identity operator, is. The equality and inequality operators both compare
the value of two objects and return one of the constants True or False.
Listing 5.1 assigns two variables with integer values of 1, and another with
the value 2. It then uses the equality operator to show that the first two
variables are equal, and the third is not. It does the same with the inequality
operator, whose results are opposite those of the equality operator with the
same inputs.
Listing 5.1 Equality Operations
Click here to view code image
# Assign values to variables
a, b, c = 1, 1, 2
# Check if value is equal
a == b
True
a == c
False
a != b
False
a != c
True
Web forms often report all user input as strings. A common problem occurs
when trying to compare user input from a web form that represents a
number but is of type string with an actual number. String input always
evaluates to False when compared to a number, even if the input is a string
version of the same value.
Comparison Operations
You use comparison operators to compare the order of objects. What “the
order” means depends on the type of objects compared. For numbers, the
comparison is the order on a number line, and for strings, the Unicode value
of the characters is used. The comparison operators are less than (<), less
than or equal to (<=), greater than (>), and greater than or equal to (>=).
Listing 5.2 demonstrates the behavior of various comparison operators.
Listing 5.2 Comparison Operations
Click here to view code image
a, b, c = 1, 1, 2
a < b
False
a < c
True
a <= b
True
a > b
False
a >= b
True
There are certain cases where you can use comparison operators between
objects of different types, such as with the numeric types, but most cross-
type comparisons are not allowed. If you use a comparison operator with
noncomparable types, such as a string and a list, an error occurs.
Boolean Operations
The Boolean operators are based on Boolean math, which you may have
studied in a math or philosophy course. These operations were first
formalized by the mathematician George Boole in the 19th century. In
Python, the Boolean operators are and, or, and not. The and and or
operators each take two arguments; the not operator takes only one.
The and operator evaluates to True if both of its arguments evaluate to True;
otherwise, it evaluates to False. The or operator evaluates to True if either
of its arguments evaluates to True; otherwise, it evaluates to False. The not
operator returns True if its argument evaluates to False; otherwise, it
evaluates to False. Listing 5.3 demonstrates these behaviors.
Listing 5.3 Boolean Operations
Click here to view code image
True and True
True
True or False
True
False or False
False
not False
True
not True
False
Both the and and or operators are short-circuit operators. This means they
will only evaluate their input expression as much as is needed to determine
the output. For example, say that you have two methods, returns_false()
and returns_true(), and you use them as inputs to the and operator as
follows:
Click here to view code image
returns_false() and returns_true()
In this case, the second method will not be called if the first returns True.
The not operator always returns one of the Boolean constants True or False.
The other two Boolean operators return the result of the last expression
evaluated. This is very useful with object evaluation.
Object Evaluation
All objects in Python evaluate to True or False. This means you can use
objects as arguments to Boolean operations. The objects that evaluate to
False are the constants None and False, any numeric with a value of zero, or
anything with a length of zero. This includes empty sequences, such as an
empty string ("") or an empty list ([]). Almost anything else evaluates to
True.
Because the or operator returns the last expression it evaluates, you can use
it to create a default value when a variable evaluates to False:
Click here to view code image
a = ''
b = a or 'default value'
b
'default value'
Because this example assigns the first variable to an empty string, which
has a length of zero, this variable evaluates to False. The or operator
evaluates this and then evaluates and returns the second expression.
if Statements
The if statement is a compound statement. if statements let you branch the
behavior of your code depending on the current state. You can use an if
statement to take an action only when a chosen condition is met or use a
more complex one to choose among multiple actions, depending on
multiple conditions. The control statement starts with the keyword if
followed by an expression (which evaluates to True or False) and then a
colon. The controlled statements follow either on the same line separated by
semicolons:
Click here to view code image
if True:message="It's True!";print(message)
It's True!
This example checks whether the value of the variable snack is in the set
fruit. If it is, an encouraging message is printed.
If you want to have multiple branches in your code, you can nest if and
else statements as shown in Listing 5.5. In this case, three choices are
made: one if the balance is positive, one if it is negative, and one if it is
negative.
Listing 5.5 Nested else Statements
Click here to view code image
balance = 2000.32
account_status = None
if balance > 0:
account_status = 'Positive'
else:
if balance == 0:
account_status = 'Empty'
else:
account_status = 'Overdrawn'
print(account_status)
Positive
While this code is legitimate and will work the way it is supposed to, it is a
little hard to read. To perform the same branching logic in a more concise
way, you can use an elif statement. This type of statement is added after an
initial if statement. It has a controlling expression of its own, which will be
evaluated only if the previous statement’s expression evaluates to False.
Listing 5.6 performs the same logic as Listing 5.5, but has the nested else
and if statements replaced by elif.
Listing 5.6 elif Statements
Click here to view code image
balance = 2000.32
account_status = None
if balance > 0:
account_status = 'Positive'
elif balance == 0:
account_status = 'Empty'
else:
account_status = 'Overdrawn'
print(account_status)
Positive
if fav_num in (3,7):
print(f"{fav_num} is lucky")
elif fav_num == 0:
print(f"{fav_num} is evocative")
elif fav_num > 20:
print(f"{fav_num} is large")
elif fav_num == 13:
print(f"{fav_num} is my favorite number too")
else:
print(f"I have no opinion about {fav_num}")
is my favorite number too
while Loops
A while loop consists of the keyword while followed by a controlling
expression, a colon, and then a controlled code block. The controlled
statement in a while loop executes only if the controlling statement
evaluates to True; in this way, it is like an if statement. Unlike an if
statement, however, the while loop repeatedly continues to execute the
controlled block as long as its control statement remains True. Here is a
while loop that executes as long as the variable counter is below five:
Notice that the variable is incremented with each iteration. This guarantees
that the loop will exit. Here is the output from running this loop:
Click here to view code image
I've counted 0 so far, I hope there aren't more
I've counted 1 so far, I hope there aren't more
I've counted 2 so far, I hope there aren't more
I've counted 3 so far, I hope there aren't more
I've counted 4 so far, I hope there aren't more
You can see that the loop runs five times, incrementing the variable each
time.
Note
It is important to provide an exit condition, or your loop will repeat
infinitely.
for Loops
for loops are used to iterate through some group of objects. This group can
be a sequence, a generator, a function, or any other object that is iterable.
An iterable object is any object that returns a series of items one at a time.
for loops are commonly used to perform a block of code a set number of
times or perform an action on each member of a sequence. The controlling
statement of a for loop consists of the keyword for, a variable, the keyword
in, and the iterable followed by a colon:
The variable is assigned the first value from the iterable, the controlled
block is executed with that value, and then the variable is assigned the next
value. This continues as long as the iterable has values to return.
A common way to run a block of code a set number of times is to use a for
loop with a range object as the iterable:
Click here to view code image
for i in range(6):
j = i + 1
print(j)
1
2
3
4
5
6
Each item in the list is used in the code block, and when there are no items
left, the loop exits.
while True:
beast = beasts[i]
if beast not in fish:
print(f"Oh no! It's not a fish, it's a {beast}")
break
print(f"I caught a {beast} with my fishing net")
i += 1
I caught a salmon with my fishing net
I caught a pike with my fishing net
Oh no! It's not a fish, it's a bear
Summary
Compound statements such as if statements, while loops, and for loops are
a fundamental part of code beyond simple scripts. With the ability to branch
and repeat your code, you can form blocks of action that describe complex
behavior. You now have tools to structure more complex software.
Questions
1. What is printed by the following code if the variable a is set to an
empty list?
Click here to view code image
if a:
print(f"Hiya {a}")
else:
print(f"Biya {a}")
In This Chapter
Defining functions
Docstrings
Positional and keyword parameters
Wildcard parameters
Return statements
Scope
Decorators
Anonymous functions
The last and perhaps most powerful compound statement that we discuss is
the function. Functions give you a way to name a code block wrapped as an
object. That code can then be invoked by use of that name, allowing the
same code to be called multiple times and in multiple places.
Defining Functions
A function definition defines a function object, which wraps the executable
block. The definition does not run the code block but just defines the
function. The definition describes how the function can be called, what it is
named, what parameters can be passed to it, and what will be executed
when it is invoked. The building blocks of a function are the controlling
statement, an optional docstring, the controlled code block, and a return
statement.
Control Statement
The first line of a function definition is the control statement, which takes
the following form:
Click here to view code image
def <Function Name> (<Parameters>):
The code block in this case consists of a single pass statement, which does
nothing. The Python style guide, PEP8, has conventions for naming
functions (see https://www.python.org/dev/peps/pep-0008/#function-and-
variable-names).
Docstrings
The next part of a function definition is the documentation string, or
docstring, which contains documentation for the function. It can be omitted,
and the Python compiler will not object. However, it is highly
recommended to supply a docstring for all but the most obvious methods.
The docstring communicates your intentions in writing a function, what the
function does, and how it should be called. PEP8 provides guidance
regarding the content of docstrings (see
https://www.python.org/dev/peps/pep-0008/#documentation-strings). The
docstring consists of a single-line string or a multiline string surrounded in
three pairs of double quotes that immediately follows the control statement:
Click here to view code image
def do_nothing(not_used):
"""This function does nothing."""
pass
For a single-line docstring, the quotes are on the same line as the text. For a
multiline docstring, the quotes are generally above and below the text, as in
Listing 6.1.
Listing 6.1 Multiline Docstring
Click here to view code image
def do_nothing(not_used):
"""
This function does nothing.
This function uses a pass statement to
avoid doing anything.
Parameters:
not_used - a parameter of any type,
which is not used.
"""
pass
The first line of the docstring should be a statement summarizing what the
function does. With a more detailed explanation, a blank line is left after the
first statement. There are many different possible conventions for what is
contained after the first line of a docstring, but generally you want to offer
an explanation of what the function does, what parameters it takes, and
what it is expected to return. The docstring is useful both for someone
reading your code and for various utilities that read and display either the
first line or the whole docstring. For example, if you call the help()
function on the function do_nothing(), the docstring is displayed as shown
in Listing 6.2.
Listing 6.2 Docstring from help
Click here to view code image
help(do_nothing)
Help on function do_nothing in module __main__:
do_nothing(not_used)
Parameters:
not_used - a parameter of any type,
which is not used.
Parameters
Parameters allow you to pass values into a function, which can be used in
the function’s code block. A parameter is like a variable given to a function
when it is called, where the parameter can be different every time you call
the function. A function does not have to accept any parameters. For a
function that should not accept parameters, you leave the parentheses after
the function name empty:
Click here to view code image
def no_params():
print("I don't listen to nobody")
When you call a function, you pass the values for the parameters within the
parentheses following the function name. Parameter values can be set based
on the position at which they are passed or based on keywords. Functions
can be defined to require their parameters be passed in either or a
combination of these ways. The values passed to a function are attached to
variables with the names defined in the function definition. Listing 6.3
defines three parameters: first, second, and third. These variables are then
available to the code block that follows, which prints out the values for each
parameter.
Listing 6.3 Parameters by Position or Keyword
Click here to view code image
def does_order(first, second, third):
'''Prints parameters.'''
print(f'First: {first}')
print(f'Second: {second}')
print(f'Third: {third}')
does_order(1, 2, 3)
First: 1
Second: 2
Third: 3
Listing 6.3 defines the function does_order() and then calls it three times.
The first time, it uses the position of the arguments, (1, 2, 3), to assign the
variable values. It assigns the first value to the first parameter, first, the
second value to the second parameter, second, and the third value to the
third parameter, third.
The second time the listing calls the function does_order(), it uses keyword
assignment, explicitly assigning the values using the parameter names,
(first=1, second=2, third=3). In the third call, the first parameter is
assigned by position, and the other two are assigned using keyword
assignment. Notice that in all three cases, the parameters are assigned the
same values.
Keyword assignments do not rely on the position of the keywords. For
example, you can assign third=3 in the position before second=2 without
issue. You cannot use a keyword assignment to the left of a positional
assignment, however:
Click here to view code image
does_order(second=2, 1, 3)
File "<ipython-input-9-eed80203e699>", line 1
does_order(second=2, 1, 3)
^
SyntaxError: positional argument follows keyword argument
You can require that a parameter be called only using the keyword method
by putting a * to its left in the function definition. All parameters to the
right of the star can only be called using keywords. Listing 6.4 shows how
to make the parameter third a required keyword parameter and then call it
using the keyword syntax.
Listing 6.4 Parameters Requiring Keywords
Click here to view code image
def does_keyword(first, second, *, third):
'''Prints parameters.'''
print(f'First: {first}')
print(f'Second: {second}')
print(f'Third: {third}')
does_keyword(1, 2, third=3)
First: 1
Second: 2
Third: 3
If you try to call a required keyword parameter using positional syntax, you
get an error:
Click here to view code image
does_keyword(1, 2, 3)
----------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-15-88b97f8a6c32> in <module>
----> 1 does_keyword(1, 2, 3)
does_defaults(1, 2, 3)
First: 1
Second: 2
Third: 3
does_defaults(1, 2)
First: 1
Second: 2
Third: 3
Much as with the restriction regarding the order of keyword and position
arguments during a function call, you cannot define a function with a
default value parameter to the left of a non-default value parameter:
Click here to view code image
def does_defaults(first=1, second, third=3):
'''Prints parameters.'''
print(f'First: {first}')
print(f'Second: {second}')
print(f'Third: {third}')
File "<ipython-input-19-a015eaeb01be>", line 1
def does_defaults(first=1, second, third=3):
^
SyntaxError: non-default argument follows default argument
Default values are defined in the function definition, not in the function
call. This means that if you use a mutable object, such as a list or dictionary,
as a default value, it will be created once for the function. Every time you
call that function using that default, the same list or dictionary object will
be used. This can lead to subtle problems if it is not expected. Listing 6.6
defines a function with a list as the default argument. The code block
appends 1 to the list. Notice that every time the function is called, the list
retains the values from previous calls.
Listing 6.6 Mutable Defaults
Click here to view code image
def does_list_default(my_list=[]):
'''Uses list as default.'''
my_list.append(1)
print(my_list)
does_list_default()
[1]
does_list_default()
[1, 1]
does_list_default()
[1, 1, 1]
does_list_param()
[1]
does_list_param()
[1]
does_list_param()
[1]
does_positional(1, 2, 3)
First: 1
Second: 2
Third: 3
does_positional(1, 2, third=3)
First: 1
Second: 2
Third: 3
does_wildcard_keywords(one=1, name='Martha')
one : 1
name : Martha
You can use both positional and keyword wildcard parameters in the same
function: Just define the positional parameters first and the keyword
parameters second. Listing 6.12 demonstrates a function using both
positional and keyword parameters.
Listing 6.12 Positional and Keyword Wildcard Parameters
Click here to view code image
def does_wildcards(*args, **kwargs):
'''Demonstrates wildcard parameters.'''
print(f'Positional: {args}')
print(f'Keyword: {kwargs}')
adds_one(1)
2
Every Python function has a return value. If you do not define a return
statement explicitly, the function returns the special value None:
Click here to view code image
def returns_none():
'''Demonstrates default return value.'''
pass
returns_none() == None
True
This example omits a return statement and then tests that the value returned
is equal to None.
Scope in Functions
Scope refers to the availability of objects defined in code. A variable
defined in the global scope is available throughout your code, whereas a
variable defined in a local scope is available only in that scope. Listing 6.14
defines a variable outer and a variable inner. Both variables are available
in the code block of the function shows_scope, where you print them both.
Listing 6.14 Local and Global Scope
Click here to view code image
outer = 'Global scope'
def shows_scope():
'''Demonstrates local variable.'''
inner = 'Local scope'
print(outer)
print(inner)
shows_scope()
Global scope
Local scope
Decorators
A decorator enables you to design functions that modify other functions.
Decorators are commonly used to set up logging using a set convention or
by third-party libraries. While you may not need to write your own
decorators, it is useful to understand how they work. This section walks
through the concepts involved.
In Python, everything is an object, including functions. This means you can
point a variable to a function. Listing 6.15 defines the function add_one(n),
which takes a number and adds 1 to it. Next, it creates the variable my_func,
which has the function add_one() as its value.
Note
When you are not calling a function, you do not use parentheses in
the variable assignment. By omitting the parentheses, you are
referring to the function object and not to a return value. You can see
this where Listing 6.15 prints my_func, which is indeed a function
object. You can then call the function by adding the parentheses and
argument to my_func, which returns the argument plus 1.
my_func = add_one
print(my_func)
<function add_one at 0x1075953a0>
my_func(2)
3
Because functions are objects, you can use them with data structures such
as dictionaries or lists. Listing 6.16 defines two functions and puts them in a
list pointed to by the variable my_functions. It then iterates through the list,
assigning each function to the variable my_func during its iteration and
calling the function during the for loop’s code block.
Listing 6.16 Calling a List of Functions
Click here to view code image
def add_one(n):
'''Adds one to a number.'''
return n + 1
def add_two(n):
'''Adds two to a number.'''
return n + 2
def nested():
'''Prints a message.'''
print('nested')
return nested
my_func = call_nested()
outer
my_func()
nested
You can also wrap one function with another, adding functionality before or
after. Listing 6.18 wraps the function add_one(number) with the function
wrapper(number). The wrapping function takes a parameter, number, which
it then passes to the wrapped function. It also has statements before and
after calling add_one(number). You can see the order of the print statements
when you call wrapper(1) and see that it returns the expected values from
add_one: 1 and 2.
def wrapper(number):
'''Wraps another function.'''
print('Before calling function')
retval = add_one(number)
print('After calling function')
return retval
wrapper(1)
Before calling function
Adding 1
After calling function
2
def wrapper(number):
'''Wraps another function.'''
print('Before calling function')
retval = some_func(number)
print('After calling function')
return retval
return wrapper
my_func = do_wrapping(add_one)
wrapping function
my_func(1)
Before calling function
Adding 1
After calling function
2
You can use do_wrapping(some_func) to wrap any function that you like.
For example, if you have the function add_two(number), you can pass it as
an argument just as you did add_one(number):
Click here to view code image
my_func = do_wrapping(add_two)
my_func(1)
wrapping function
Before calling function
Adding 2
After calling function
3
Decorators provide syntax that can simplify this type of function wrapping.
Instead of calling do_wrapping(some_func), assigning it to a variable, and
then invoking the function from the variable, you can simply put
@do_wrapping at the top of the function definition. Then the function
add_one(number) can be called directly, and the wrapping happens behind
the scenes.
You can see in Listing 6.20 that add_one(number) is wrapped in a similar
fashion as in Listing 6.18, but with the simpler decorator syntax.
Listing 6.20 Decorator Syntax
Click here to view code image
def do_wrapping(some_func):
'''Returns a wrapped function.'''
print('wrapping function')
def wrapper(number):
'''Wraps another function.'''
print('Before calling function')
retval = some_func(number)
print('After calling function')
return retval
return wrapper
@do_wrapping
def add_one(number):
'''Adds to a number.'''
print('Adding 1')
return number + 1
wrapping function
add_one(1)
Before calling function
Adding 1
After calling function
2
Anonymous Functions
The vast majority of the time you define functions, you will want to use the
syntax for named functions. This is what you have seen up to this point.
There is an alternative, however: the unnamed, anonymous function. In
Python, anonymous functions are known as lambda functions, and they
have the following syntax:
lambda <Parameter>: <Statement>
where lambda is the keyword designating a lambda function, <Parameter> is
an input parameter, and <Statement> is the statement to execute using the
parameter. The result of <Statement> is the return value. This is how you
define a lambda function that adds one to an input value:
lambda x: x +1
In general, your code will be easier to read, use, and debug if you avoid
lambda functions, but one useful place for them is when a simple function
is applied as an argument to another. Listing 6.21 defines the function
apply_to_list(data, my_func), which takes a list and a function as
arguments. When you call this function with the intention of adding 1 to
each member of the list, the lambda function is an elegant solution.
Listing 6.21 Lambda Function
Click here to view code image
def apply_to_list(data, my_func):
'''Applies a function to items in a list.'''
for item in data:
print(f'{my_func(item)}')
Summary
Functions, which are important building blocks in constructing complex
programs, are reusable named blocks of code. Functions are documented
with docstrings. Functions can accept parameters in a number of ways. A
function uses a return statement to pass a value at the end of its execution.
Decorators are special functions that wrap other functions. Anonymous, or
lambda, functions are unnamed.
Questions
For Questions 1–3, refer to Listing 6.22.
Listing 6.22 Functions for Questions 1–3
Click here to view code image
def add_prefix(word, prefix='before-'):
'''Prepend a word.'''
return f'{prefix}{word}'3
def return_one():
return 1
def wrapper():
print('a')
retval = return_one()
print('b')
print(retval)
In This Chapter
Introducing third-party libraries
Creating NumPy arrays
Indexing and slicing arrays
Filtering array data
Array methods
Broadcasting
This is the first of this book’s chapters on Data Science Libraries. The
Python functionality explored so far in this book makes Python a powerful
generic language. The libraries covered in this part of the book make
Python dominant in data science. The first library we will look at, NumPy,
is the backbone of many of the other data science libraries. In this chapter,
you will learn about the NumPy array, which is an efficient
multidimensional data structure.
Third-Party Libraries
Python code is organized into libraries. All of the functionality you
have seen so far in this book is available in the Python Standard
Library, which is part of any Python installation. Third-party
libraries give you capabilities far beyond this. They are developed
and maintained by groups outside the organization that maintains
Python itself. The existence of these groups and libraries creates a
vibrant ecosystem that has kept Python a dominant player in the
programming world. Many of these libraries are available in the
Colab environment, and you can easily import them into a file. If
you are working outside Colab, you may need to install them, which
generally is done using the Python package manager, pip.
Once you have NumPy installed, you can import it. When you import any
library, you can change what it is called in your environment by using the
keyword as. NumPy is typically renamed np during import:
import numpy as np
When you have the library installed and imported, you can then access any
of NumPy’s functionality through the np object.
Creating Arrays
A NumPy array is a data structure that is designed to efficiently handle
operations on large data sets. These data sets can be of varying dimensions
and can contain numerous data types—though not in the same object.
NumPy arrays are used as input and output to many other libraries and are
used as the underpinning of other data structures that are important to data
science, such as those in Pandas and SciPy.
You can create arrays from other data structures or initialized with set
values. Listing 7.1 demonstrates different ways to create a one-dimensional
array. You can see that the array object is displayed as having an internal list
as its data. Data is not actually stored in lists, but this representation makes
arrays easy to read.
Listing 7.1 Creating an Array
Click here to view code image
np.array([1,2,3]) # Array from list
array([1, 2, 3])
If you check the data type of the array, you see that it is np.ndarray:
type(oned)
numpy.ndarray
Note
ndarray is short for n-dimensional array.
As mentioned earlier, you can make arrays of many dimensions. Two-
dimensional arrays are used as matrixes. Listing 7.3 creates a two-
dimensional array from a list of three three-element lists. You can see that
the resulting array has 3×3 shape and two dimensions.
Listing 7.3 Matrix from Lists
Click here to view code image
list_o_lists = [[1,2,3],
[4,5,6],
[7,8,9]]
twod = np.array(list_o_lists)
twod
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
twod.shape
(3, 3)
twod.ndim
2
You can produce an array with the same elements but different dimensions
by using the reshape method. This method takes the new shape as
arguments. Listing 7.4 demonstrates using a one-dimensional array to
produce a two-dimensional one and then producing one-dimensional and
three-dimensional arrays from the two-dimensional one.
Listing 7.4 Using reshape
Click here to view code image
oned = np.arange(12)
oned
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
twod = oned.reshape(3,4)
twod
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
twod.reshape(12)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
twod.reshape(2,2,3)
array([[[ 0, 1, 2],
[ 3, 4, 5]],
[[ 6, 7, 8],
[ 9, 10, 11]]])
The shape you provide for an array must be consistent with the number of
elements in it. For example, if you take the 12-element array twod and try to
set its dimensions with a shape that does not include 12 elements, you get
an error:
Click here to view code image
twod.reshape(2,3)
----------------------------------------------------------------
ValueError Traceback (most recent
<ipython-input-295-0b0517f762ed> in <module>
----> 1 twod.reshape(2,3)
oned[3]
3
oned[-1]
20
oned[3:9]
array([3, 4, 5, 6, 7, 8])
For multidimensional arrays, you can supply one argument for each
dimension. If you omit the argument for a dimension, it defaults to all
elements of that dimension. So, if you supply a single number as an
argument to a two-dimensional array, that number will indicate which row
to return. If you supply single-number arguments for all dimensions, a
single element is returned. You can also supply a slice for any dimension. In
return you get a subarray of elements, whose dimensions are determined by
the length of your slices. Listing 7.6 demonstrates various options for
indexing and slicing a two-dimensional array.
Listing 7.6 Indexing and Slicing a Two-Dimensional Array
Click here to view code image
twod = np.arange(21).reshape(3,7)
twod
array([[ 0, 1, 2, 3, 4, 5, 6],
[ 7, 8, 9, 10, 11, 12, 13],
[14, 15, 16, 17, 18, 19, 20]])
twod[2] # Accessing row 2
array([14, 15, 16, 17, 18, 19, 20])
You can assign new values to an existing array, much as you would with a
list, by using indexing and slicing. If you assign a values to a slice, the
whole slice is updated with the new value. Listing 7.7 demonstrates how to
update a single element and a slice of a two-dimensional array.
Listing 7.7 Changing Values in an Array
Click here to view code image
twod = np.arange(21).reshape(3,7)
twod
array([[ 0, 1, 2, 3, 4, 5, 6],
[ 7, 8, 9, 10, 11, 12, 13],
[14, 15, 16, 17, 18, 19, 20]])
twod[0,0] = 33
twod
array([[33, 1, 2, 3, 4, 5, 6],
[ 7, 8, 9, 10, 11, 12, 13],
[14, 15, 16, 17, 18, 19, 20]])
twod[1:,:3] = 0
array([[33, 1, 2, 3, 4, 5, 6],
[ 0, 0, 0, 10, 11, 12, 13],
[ 0, 0, 0, 17, 18, 19, 20]])
Element-by-Element Operations
An array is not a sequence. Arrays do share some characteristics with lists,
and on some level it is easy to think of the data in an array as a list of lists.
There are many differences between arrays and sequences, however. One
area of difference is when performing operations between the items in two
arrays or two sequences.
Remember that when you do an operation such as multiplication with a
sequence, the operation is done to the sequence, not to its contents. So, if
you multiply a list by zero, the result is a list with a length of zero:
[1, 2, 3]*0
[]
You cannot multiply two lists, even if they are the same length:
Click here to view code image
[1, 2, 3]*[4, 5, 6]
----------------------------------------------------------------
TypeError Traceback (most recent
<ipython-input-325-f525a1e96937> in <module>
----> 1 [1, 2, 3]*[4, 5, 6]
You can write code to perform operations between the elements of lists. For
example, Listing 7.8 demonstrates looping through two lists in order to
create a third list that contains the results of multiple pairs of elements. The
zip() function is used to combine the two lists into a list of tuples, with
each tuple containing elements from each of the original lists.
Listing 7.8 Element-by-Element Operations with Lists
Click here to view code image
L1 = list(range(10))
L2 = list(range(10, 0, -1))
L1
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
L2
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
L3 = []
for i, j in zip(L1, L2):
L3.append(i*j)
L3
[0, 9, 16, 21, 24, 25, 24, 21, 16, 9]
array1 + array2
array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10])
array1 / array2
array([0. , 0.11111111, 0.25 , 0.42857143, 0.66666667,
1. , 1.5 , 2.33333333, 4. , 9. ])
Filtering Values
One of the most used aspects of NumPy arrays and the data structures built
on top of them is the ability to filter values based on conditions of your
choosing. In this way, you can use an array to answer questions about your
data.
Listing 7.10 shows a two-dimensional array of integers, called twod. A
second array, mask, has the same dimensions as twod, but it contains
Boolean values. mask specifies which elements from twod to return. The
resulting array contains the elements from twod whose corresponding
positions in mask have the value True.
Listing 7.10 Filtering Using Booleans
Click here to view code image
twod = np.arange(21).reshape(3,7)
twod
array([[ 0, 1, 2, 3, 4, 5, 6],
[ 7, 8, 9, 10, 11, 12, 13],
[14, 15, 16, 17, 18, 19, 20]])
Comparison operators that you have seen returning single Booleans before
return arrays when used with arrays. So, if you use the less-than operator (<)
against the array twod as follows, the result will be an array with True for
every item that is below five and False for the rest:
twod < 5
You can use this result as a mask to get only the values that are True with
the comparison. For example, Listing 7.11 creates a mask and then returns
only the values of twod that are less than 5.
Listing 7.11 Filtering Using Comparison
Click here to view code image
mask = twod < 5
mask
array([[ True, True, True, True],
[ True, False, False, False],
[False, False, False, False]])
twod[mask]
array([0, 1, 2, 3, 4])
As you can see, you can use comparison and order operators to easily
extract knowledge from data. You can also combine these comparisons to
create more complex masks. Listing 7.12 uses & to join two conditions to
create a mask that evaluates to True only for items meeting both conditions.
Listing 7.12 Filtering Using Multiple Comparisons
Click here to view code image
mask = (twod < 5) & (twod%2 == 0)
mask
array([[ True, False, True, False],
[ True, False, False, False],
[False, False, False, False]])
twod[mask]
array([0, 2, 4])
Note
Filtering using masks is a process that you will use time and time
again, especially with Pandas DataFrames, which are built on top of
NumPy arrays. You will learn about DataFrames in Chapter 9,
“Pandas.”
data2 = data1[:2,3:]
data2
array([[ 3, 4, 5],
[ 9, 10, 11]])
data2[1,2] = -1
data2
array([[ 3, 4, 5],
[ 9, 10, -1]])
data1
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, -1],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23]])
This behavior can lead to bugs and miscalculations, but if you understand it,
you can gain some important benefits when working with large data sets. If
you want to change data from a slice or filtering operation without changing
it in the original array, you can make a copy. For example, in Listing 7.14,
notice that when an item is changed in the copy, the original array remains
unchanged.
Listing 7.14 Changing Values in a Copy
Click here to view code image
data1 = np.arange(24).reshape(4,6)
data1
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23]])
data2 = data1[:2,3:].copy()
data2
array([[ 3, 4, 5],
[ 9, 10, 11]])
data2[1,2] = -1
data2
array([[ 3, 4, 5],
[ 9, 10, -1]])
data1
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23]])
Listing 7.16 demonstrates some of the matrix operations that are available
with arrays. These include returning the transpose, returning matrix
products, and returning the diagonal. Remember that you can use the
multiplication operator (*) between arrays to perform element-by-element
multiplication. If you want to calculate the dot product of two matrices, you
need to use the @ operator or the .dot() method.
Listing 7.16 Matrix Operations
Click here to view code image
A1 = np.arange(9).reshape(3,3)
A1
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
A1.T # Transpose
array([[0, 3, 6],
[1, 4, 7],
[2, 5, 8]])
A2 = np.ones(9).reshape(3,3)
array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])
A1 @ A2 # Matrix product
array([[ 3., 3., 3.],
[12., 12., 12.],
[21., 21., 21.]])
A1.diagonal() # Diagonal
array([0, 4, 8])
An array, unlike many sequence types, can contain only one data type. You
cannot have an array that contains both strings and integers. If you do not
specify the data type, NumPy guesses the type, based on the data. Listing
7.17 shows that when you start with integers, NumPy sets the data type to
int64. You can also see, by checking the nbytes attribute, that the data for
this array takes 800 bytes of memory.
Listing 7.17 Setting Type Automatically
Click here to view code image
darray = np.arange(100)
darray
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48,
51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65,
68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82,
85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])
darray.dtype
dtype('int64')
darray.nbytes
800
For lager data sets, you can control the amount of memory used by setting
the data type explicitly. The int8 data type can represent numbers from –
128 to 127, so it would be adequate for a data set of 1–99. You can set an
array’s data type at creation by using the parameter dtype. Listing 7.18 does
this to bring the size of the data down to 100 bytes.
Listing 7.18 Setting Type Explicitly
Click here to view code image
darray = np.arange(100, dtype=np.int8)
darray
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48,
51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65,
68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82,
85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99],
dtype=int8)
darray.nbytes
100
Note
You can see the many available NumPy data types at
https://numpy.org/devdocs/user/basics.types.xhtml.
Because an array can store only one data type, you cannot insert data that
cannot be cast to that data type. For example, if you try to add a string to the
int8 array, you get an error:
A subtle error with array type occurs if you add to an array data of a finer
granularity than the array’s data type; this can lead to data loss. For
example, say that you add the floating-point number 0.5 to the int8 array:
darray[14] = 0.5
Broadcasting
You can perform operations between arrays of different dimensions.
Operations can be done when the dimension is the same or when the
dimension is one for at least one of the arrays. Listing 7.19 adds 1 to each
element of the array A1 three different ways: first with an array of ones with
the same dimensions (3, 3), then with an array with one dimension of one
(1, 3), and finally by using the integer 1.
Listing 7.19 Broadcasting
Click here to view code image
A1 = np.array([[1,2,3],
[4,5,6],
[7,8,9]])
A2 = np.array([[1,1,1],
[1,1,1],
[1,1,1]])
A1 + A2
array([[ 2, 3, 4],
[ 5, 6, 7],
[ 8, 9, 10]])
A2 = np.array([1,1,1])
A1 + A2
array([[ 2, 3, 4],
[ 5, 6, 7],
[ 8, 9, 10]])
A1 + 1
array([[ 2, 3, 4],
[ 5, 6, 7],
[ 8, 9, 10]])
In all three cases, the result is the same: an array of dimension (3, 3). This is
called broadcasting because a dimension of one is expanded to fit the
higher dimension. So if you do an operation with arrays of dimensions (1,
3, 4, 4) and (5, 3, 4, 1), the resulting array will have the dimensions (5, 3, 4,
4). Broadcasting does not work with dimensions that are different but not
one.
Listing 7.20 does an operation on arrays with the dimensions (2, 1, 5) and
(2, 7, 1). The resulting array has the dimensions (2, 7, 5).
Listing 7.20 Expanding Dimensions
Click here to view code image
A4 = np.arange(10).reshape(2,1,5)
A4
array([[[0, 1, 2, 3, 4]],
[[5, 6, 7, 8, 9]]])
A5 = np.arange(14).reshape(2,7,1)
A5
array([[[ 0],
[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6]],
[[ 7],
[ 8],
[ 9],
[10],
[11],
[12],
[13]]])
A6 = A4 - A5
A6
array([[[ 0, 1, 2, 3, 4],
[-1, 0, 1, 2, 3],
[-2, -1, 0, 1, 2],
[-3, -2, -1, 0, 1],
[-4, -3, -2, -1, 0],
[-5, -4, -3, -2, -1],
[-6, -5, -4, -3, -2]],
[[-2, -1, 0, 1, 2],
[-3, -2, -1, 0, 1],
[-4, -3, -2, -1, 0],
[-5, -4, -3, -2, -1],
[-6, -5, -4, -3, -2],
[-7, -6, -5, -4, -3],
[-8, -7, -6, -5, -4]]])
A6.shape
(2, 7, 5)
NumPy Math
In addition to the NumPy array, the NumPy library offers many
mathematical functions, including trigonometric functions, logarithmic
functions, and arithmetic functions. These functions are designed to be
performed with NumPy arrays and are often used in conjunction with data
types in other libraries. This section takes a quick look at NumPy
polynomials.
NumPy offers the class poly1d for modeling one-dimensional polynomials.
To use this class, you need to import it from NumPy:
[1] 1 from numpy import poly1d
Then create a polynomial object, giving the coefficients as an argument:
poly1d((4,5))
poly1d([4, 5])
If for a second argument you supply the value True, the first argument is
interpreted as roots rather than coefficients. The following example models
the polynomial resulting from the calculation (x – 4)(x – 3)(x – 2)(x – 1):
Click here to view code image
r = poly1d([4,3,2,1], True)
print(r)
4 3 2
1 x - 10 x + 35 x - 50 x + 24
The poly1d class is just one of the many specialized mathematical tools
offered in the NumPy toolkit. These tools are used in conjunction with
many of the other specialized tools that you will learn about in the coming
chapters.
Summary
The third-party library NumPy is a workhorse for doing data science in
Python. Even if you don’t use NumPy arrays directly, you will encounter
them because they are building blocks for many other libraries. Libraries
such as SciPy and Pandas build directly on NumPy arrays. NumPy arrays
can be made in many dimensions and data types. You can tune them to
control memory consumption by controlling their data type. They are
designed to be efficient with large data sets.
Questions
1. Name three differences between NumPy arrays and Python lists.
2. Given the following code, what would you expect for the final value
of d2?
Click here to view code image
d1 = np.array([[0, 1, 3],
[4, 2, 9]])
d2 = d1[:, 1:]
3. Given the following code, what would you expect for the final value
of d1[0,2]?
Click here to view code image
d1 = np.array([[0, 1, 3],
[4, 2, 9]])
d2 = d1[:, 1:]
d2[0,1] = 0
4. If you add two arrays of dimensions (1, 2, 3) and (5, 2, 1), what will
be the resulting array’s dimensions?
5. Use the poly1d class to model the following polynomial:
4 3 2
6 x + 2 x + 5 x + x -10
8
SciPy
Most people use statistics like a drunk man uses a lamppost; more for
support than illumination.
Andrew Lang
In This Chapter
Math with NumPy
Introduction to SciPy
scipy.misc submodule
scipy.special submodule
scipy.stats submodule
SciPy Overview
The SciPy library is a collection of packages that build on NumPy to
provide tools for scientific computing. It includes submodules that deal with
optimization, Fourier transformations, signal processing, linear algebra,
image processing, and statistics, among others. This chapter touches on
three submodules: the scipy.misc submodule, the scipy.special
submodule, and scipy.stats, which is the submodule most useful for data
science.
This chapter also uses the library matplotlib for some examples. It has
visualization capabilities for numerous plot types as well as images. The
convention for importing its plotting library is to import it with the name
plt:
special.perm(10,2)
90.0
Note
scipy.special has a scipy.stats submodule, but it is not meant for
direct use. Rather, you use the scipy.stats submodule for your
statistics needs. This submodule is discussed next.
Discrete Distributions
SciPy offers some discrete distributions that share some common methods.
These common methods are demonstrated in Listing 8.2 using a binomial
distribution. A binomial distribution involves some number of trials, with
each trial having either a success or failure outcome.
Listing 8.2 Binomial Distribution
Click here to view code image
from scipy import stats
B = stats.binom(20, 0.3) # Define a binomial distribution consisting
# 20 trials and 30% chance of success
You can use matplotlib to plot it and get a sense of its shape (see Figure
8.2):
Click here to view code image
import matplotlib.pyplot as plt
plt.hist(rvs)
plt.show()
Figure 8.2 Binomial Distribution
The numbers along the bottom of the distribution in Figure 8.2 represent the
number of successes in each 20-trial experiment. You can see that 6 out of
20 is the most common result, and it matches the 30% success rate.
Another distribution modeled in the scipy.stats submodule is the Poisson
distribution. This distribution models the probability of a certain number of
individual events happening across some scope of time. The shape of the
distribution is controlled by its mean, which you can set by using the mu
keyword. For example, a lower mean, such as 3, will skew the distribution
to the left, as shown in Figure 8.3:
Click here to view code image
P = stats.poisson(mu=3)
rvs = P.rvs(size=10000)
rvs
array([4, 4, 2, ..., 1, 0, 2])
plt.hist(rvs)
plt.show()
Figure 8.3 Poisson Distribution Skewed Left
A higher mean, such as 15, pushes the distribution to the right, as you can
see in Figure 8.4:
Click here to view code image
P = stats.poisson(mu=15)
rvs = P.rvs(size=100000)
plt.hist(rvs)
plt.show()
Figure 8.4 Poisson Distribution Skewed Right
Continuous Distributions
The scipy.stats submodule includes many more continuous than discrete
distributions; it has 87 continuous distributions as of this writing. These
distributions all take arguments for location (loc) and scale (scale). They
all default to a location of 0 and scale of 1.0.
One continuous distribution modeled is the Normal distribution, which may
be familiar to you as the bell curve. In this symmetric distribution, half of
the data is to the left of the mean and half to the right. Here’s how you can
make a normal distribution using the default location and scale:
Click here to view code image
N = stats. norm()
rvs = N.rvs(size=100000)
plt.hist(rvs, bins=1000)
plt.show()
N1.var() # Variance
2500.0
N1.median()# Median
30.0
Note
If you try the examples shown here, some of your values may differ
due to random number generation.
You can see that Figure 8.7 displays a curve as you would expect from an
exponential function. The following is a uniform distribution, which is has a
constant probability and is also known as a rectangular distribution:
Click here to view code image
U = stats.uniform()
rvs = U.rvs(size=10000)
rvs
array([8.24645026e-01, 5.02358065e-01, 4.95390940e-01, ...,
8.63031657e-01, 1.05270200e-04, 1.03627699e-01])
plt.hist(rvs, bins=1000)
plt.show()
This distribution gives an even probability over a set range. Its plot is
shown in Figure 8.8.
Summary
The NumPy and SciPy libraries both offer utilities for solving complex
mathematical problems. These two libraries cover a great breadth and depth
of resources, and entire books have been devoted to their application. You
have seen only a few of the many capabilities. These libraries are the first
places you should look when you embark on solving or modeling complex
mathematical problems.
Questions
1. Use the scipy.stats submodule to model a Normal distribution with a
mean of 15.
2. Generate 25 random samples from the distribution modeled in
Question 1.
3. Which scipy submodule has utilities designed for mathematical
physics?
4. What method is provided with a discrete distribution to calculate its
standard deviation?
9
Pandas
In This Chapter
Introduction to Pandas DataFrames
Creating DataFrames
DataFrame introspection
Accessing data
Manipulating DataFrames
Manipulating DataFrame data
The Pandas DataFrame, which is built on top of the NumPy array, is probably
the most commonly used data structure. DataFrames are like supercharged
spreadsheets in code. They are one of the primary tools used in data science.
This chapter looks at creating DataFrames, manipulating DataFrames,
accessing data in DataFrames, and manipulating that data.
About DataFrames
A Pandas DataFrame, like a spreadsheet, is made up of columns and rows.
Each column is a pandas.Series object. A DataFrame is, in some ways,
similar to a two-dimensional NumPy array, with labels for the columns and
index. Unlike a NumPy array, however, a DataFrame can contain different
data types. You can think of a pandas.Series object as a one-dimensional
NumPy array with labels. The pandas.Series object, like a NumPy array, can
contain only one data type. The pandas.Series object can use many of the
same methods you have seen with arrays, such as min(), max(), mean(), and
medium().
Creating DataFrames
You can create DataFrames with data from many sources, including
dictionaries and lists and, more commonly, by reading files. You can create an
empty DataFrame by using the DataFrame constructor:
Click here to view code image
df = pd.DataFrame()
print(df)
Empty DataFrame
Columns: []
Index: []