Data Science Essentials
Data Science Essentials
Foundations of Applied
Mathematics
Data Science Essentials
Jeffrey Humpherys & Tyler J. Jarvis, managing editors
List of Contributors
B. Barker M. Fabiano
Brigham Young University Brigham Young University
E. Evans K. Finlinson
Brigham Young University Brigham Young University
R. Evans J. Fisher
Brigham Young University Brigham Young University
J. Grout R. Flores
Drake University Brigham Young University
J. Humpherys R. Fowers
Brigham Young University Brigham Young University
T. Jarvis A. Frandsen
Brigham Young University Brigham Young University
J. Whitehead R. Fuhriman
Brigham Young University Brigham Young University
J. Adams T. Gledhill
Brigham Young University Brigham Young University
J. Bejarano S. Giddens
Brigham Young University Brigham Young University
Z. Boyd C. Gigena
Brigham Young University Brigham Young University
M. Brown M. Graham
Brigham Young University Brigham Young University
A. Carr F. Glines
Brigham Young University Brigham Young University
C. Carter C. Glover
Brigham Young University Brigham Young University
T. Christensen M. Goodwin
Brigham Young University Brigham Young University
M. Cook R. Grout
Brigham Young University Brigham Young University
R. Dorff D. Grundvig
Brigham Young University Brigham Young University
B. Ehlert E. Hannesson
Brigham Young University Brigham Young University
i
ii List of Contributors
K. Harmer M. Probst
Brigham Young University Brigham Young University
J. Hendricks M. Proudfoot
Brigham Young University Brigham Young University
A. Henriksen D. Reber
Brigham Young University Brigham Young University
I. Henriksen H. Ringer
Brigham Young University Brigham Young University
C. Hettinger C. Robertson
Brigham Young University Brigham Young University
S. Horst M. Russell
Brigham Young University Brigham Young University
K. Jacobson R. Sandberg
Brigham Young University Brigham Young University
R. Jenkins C. Sawyer
Brigham Young University Brigham Young University
J. Leete M. Stauffer
Brigham Young University Brigham Young University
J. Lytle E. Steadman
Brigham Young University Brigham Young University
E. Manner J. Stewart
Brigham Young University Brigham Young University
R. McMurray S. Suggs
Brigham Young University Brigham Young University
S. McQuarrie A. Tate
Brigham Young University Brigham Young University
D. Miller T. Thompson
Brigham Young University Brigham Young University
J. Morrise M. Victors
Brigham Young University Brigham Young University
M. Morrise E. Walker
Brigham Young University Brigham Young University
A. Morrow J. Webb
Brigham Young University Brigham Young University
R. Murray R. Webb
Brigham Young University Brigham Young University
J. Nelson J. West
Brigham Young University Brigham Young University
E. Parkinson A. Zaitzeff
Brigham Young University Brigham Young University
This project is funded in part by the National Science Foundation, grant no. TUES Phase II
DUE-1323785.
Preface
This lab manual is designed to accompany the textbook Foundations of Applied Mathematics
by Humpherys and Jarvis. While the Volume 3 text focuses on statistics and rigorous data analysis,
these labs aim to introduce experienced Python programmers to common tools for obtaining, cleaning,
organizing, and presenting data. The reader should be familiar with Python [VD10] and its NumPy
[Oli06, ADH+ 01, Oli07] and Matplotlib [Hun07] packages before attempting these labs. See the
Python Essentials manual for introductions to these topics.
©This work is licensed under the Creative Commons Attribution 3.0 United States License.
You may copy, distribute, and display this copyrighted work only if you give credit to Dr. J. Humpherys.
All derivative works must include an attribution to Dr. J. Humpherys as the owner of this work as
well as the web address to
https://github.com/Foundations-of-Applied-Mathematics/Labs
as the original source of this work.
To view a copy of the Creative Commons Attribution 3.0 License, visit
http://creativecommons.org/licenses/by/3.0/us/
or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105,
USA.
iii
iv Preface
Contents
Preface iii
I Labs 1
2 Unix Shell 2 17
3 SQL 1: Introduction 31
5 Regular Expressions 53
6 Web Technologies 65
7 Web Scraping 77
8 Web Crawling 89
9 Pandas 1: Introduction 97
12 Geopandas 147
14 MongoDB 171
v
vi Contents
II Appendices 237
Bibliography 271
Part I
Labs
1
1
Unix Shell 1:
Introduction
Lab Objective: Unix is a popular operating system that is commonly used for servers and the basis
for most open source software. Using Unix for writing and submitting labs will develop a foundation
for future software development. In this lab we explore the basics of the Unix shell, including how
to navigate and manipulate files, access remote machines with Secure Shell, and use Git for basic
version control.
Unix was first developed by AT&T Bell Labs in the 1970s. In the 1990s, Unix became the
foundation of the Linux and MacOSX operating systems. Most servers are Linux-based, so knowing
how to use Unix shells allows us to interact with servers and other Unix-based machines.
A Unix shell is a program that takes commands from a user and executes those commands on
the operating system. We interact with the shell through a terminal (also called a command line), a
program that lets you type in commands and gives those commands to the shell for execution.
Note
Windows is not built off of Unix, but it does come with a terminal called PowerShell. This
terminal uses a different command syntax. We will not cover the equivalent commands in the
Windows terminal, but you could download a Unix-based terminal such as Git Bash or Cygwin
to complete this lab on a Windows machine (you will still lose out on certain commands).
Alternatively, Windows 10 now offers a Windows Subsystem for Linux, WSL, which is a Linux
operating system downloaded onto Windows.
Note
For this lab we will be working in the UnixShell1 directory provided with the lab materials.
If you have not yet downloaded the code repository, follow steps 1 through 6 in the Getting
Started guide found at https://foundations-of-applied-mathematics.github.io/ before
proceeding with this lab. Make sure to run the download_data.sh script as described in step
5 of Getting Started; otherwise you will not have the necessary files to complete this lab.
3
4 Lab 1. Introduction to the Unix Shell
#!/bin/bash
echo "Hello World!"
The first line, #!bin/bash, tells the computer to use the bash interpreter to run the script, and
where this interpreter is located. The #! is called the shebang or hashbang character sequence. It is
followed by the absolute path to the bash interpreter.
To run a bash script, type bash <script name> into the terminal. Alternatively, you can
execute any script by typing ./<script name>, but note that the script must contain executable
permissions for this to work. (We will learn more about permissions later in the lab.)
$ bash hello_world.sh
Hello World!
Navigation
Typically, people navigate computers by clicking on icons to open folders and programs. In the
terminal, instead of point and click we use typed commands to move from folder to folder. In the
Unix shell, we call folders directories. The file system is a set of nested directories containing files
and other directories.
Begin by opening a terminal. The text you see in the upper left of the terminal is called the
prompt. Before you start creating or deleting files, you’ll want to know where you are. To see what
directory you are currently working in, type pwd into the prompt. This command stands for print
working directory, and it prints out a string telling you your current location.
To see the all the contents of your current directory, type the command ls, list segments.
~$ pwd
/home/username
~$ ls
Desktop Downloads Public Videos
Documents Pictures
5
The command cd, change directory, allows you to navigate directories. To change to a new
directory, type the cd command followed by the name of the directory to which you want to move
(if you cd into a file, you will get an error). You can move up one directory by typing cd ...
Two important directories are the root directory and the home directory. You can navigate to
the home directory by typing cd ∼ or just cd. You can navigate to root by typing cd /.
Problem 1. To begin, open a terminal and navigate to the UnixShell1/ directory provided
with this lab. Use ls to list the contents. There should be a file called Shell1.zip and a script
called unixshell1.sh. a
Run unixshell1.sh. This script will do the following:
3. Execute various shell commands, to be added in the next few problems in this lab
Now, open the unixshell1.sh script in a text editor. Add commands to the script to do
the following:
• Print a string telling you directory you are currently working in.
Test your commands by running the script again and checking that it prints a string
ending in the location Shell1/.
a If the necessary data files are not in your directory, cd one directory up by typing cd .. and type bash
download_data.sh to download the data files for each lab.
$ man ls
LS(1) User Commands LS(1)
NAME
ls - list directory contents
SYNOPSIS
ls [OPTION]... [FILE]...
6 Lab 1. Introduction to the Unix Shell
DESCRIPTION
List information about the FILEs (the current directory by default).
-a, --all
do not ignore entries starting with .
The apropos <keyword> command will list all Unix commands that have <keyword> contained
somewhere in their manual page names and descriptions. For example, if you forget how to copy
files, you can type in apropos copy and you’ll get a list of all commands that have copy in their
description.
Flags
When you use man, you will see a list of options such as -a, -A, --author, etc. that modify how a
command functions. These are called flags. You can use one flag on a command by typing <command
> -<flag>, like ls -a, or combine multiple flags by typing <command> -<flag1><flag2>, etc. as in
ls -alt.
For example, sometimes directories contain hidden files, which are files whose names begin with
a dot character like .bash. The ls command, by default, does not list hidden files. Using the -a
flag specifies that ls should not ignore hidden files. Find more common flags for ls in Table 1.1.
Flags Description
-a Do not ignore hidden files and folders
-l List files and folders in long format
-r Reverse order while sorting
-R Print files and subdirectories recursively
-s Print item name and size
-S Sort by size
-t Sort output by date modified
$ ls
file1.py file2.py
$ ls -a
. .. file1.py file2.py .hiddenfile.py
Problem 2. Within the script, add a command using ls to print one list of the contents of
Shell1/ with the following criteria:
• List the files and folders in long format (include the permissions, date last modified, etc.)
Test your command by entering it into the terminal within Shell1/ or by running the script
and checking for the desired output.
~/Test$ ls
file1.py NewDirectory newfile.py
To copy a file into a directory, use cp <filename> <dir_name>. When making a copy of a
directory, use the -r flag to recursively copy files contained in the directory. If you try to copy a
directory without the -r, the command will return an error.
Moving files and directories follows a similar format, except no -r flag is used when moving one
directory into another. The command mv <filename> <dir_name> will move a file to a folder and
mv <dir1> <dir2> will move the first directory into the second.
If you want to rename a file, use mv <file_old> <file_new>; the same goes for directories.
~/Test$ ls
file1.py NewDirectory newfile.py
When deleting files, use rm <filename>, and when deleting a directory, use rm -r <dir_name
>. The -r flag tells the terminal to recursively remove all the files and subfolders within the targeted
directory.
If you want to make sure your command is doing what you intend, the -v flag tells rm, cp, or
mkdir to print strings in the terminal describing what it is doing.
When your terminal gets too cluttered, use clear to clean it up.
Commands Description
clear Clear the terminal screen
cp file1 dir1 Create a copy of file1 and move it to dir1/
cp file1 file2 Create a copy of file1 and name it file2
cp -r dir1 dir2 Create a copy of dir1/ and all its contents into dir2/
mkdir dir1 Create a new directory named dir1/
mkdir -p path/to/new/dir1 Create dir1/ and all intermediate directories
mv file1 dir1 Move file1 to dir1/
mv file1 file2 Rename file1 as file2
rm file1 Delete file1 [-i, -v]
rm -r dir1 Delete dir1/ and all items within dir1/ [-i, -v]
touch file1 Create an empty file named file1
Table 1.2 contains all the commands we have discussed so far. Commonly used flags for some
commands are contained in square brackets; use man or ––help to see what these mean.
Problem 3. Add commands to the unixshell1.sh script to make the following changes in
Shell1/:
Test your commands by running the script and then using ls within Shell1/ to check that
each directory was deleted, created, or changed correctly.
9
Wildcards
As we are working in the file system, there will be times that we want to perform the same command
to a group of similar files. For example, you may need to move all text files within a directory to a
new directory. Rather than copy each file one at a time, we can apply one command to several files
using wildcards. We will use the * and ? wildcards. The * wildcard represents any string and the ?
wildcard represents any single character. Though these wildcards can be used in almost every Unix
command, they are particularly useful when dealing with files.
$ ls
File1.txt File2.txt File3.jpg text_files
$ mv -v *.txt text_files/
File1.txt -> text_files/File1.txt
File2.txt -> text_files/File2.txt
$ ls
File3.jpg text_files
Command Description
*.txt All files that end with .txt.
image* All files that have image as the first 5 characters.
*py* All files that contain py in the name.
doc*.txt All files of the form doc1.txt, doc2.txt, docA.txt, etc.
Problem 4. Within the Shell1/ directory, there are many files. Add commands to the script
to organize these files into directories using wildcards. Organize by completing the following:
Command Description
find dir1 -type f -name "word" Find all files in dir1/ (and its subdirectories) called word
(-type f is for files; -type d is for directories)
grep "word" filename Find all occurences of word within filename
grep -nr "word" dir1 Find all occurences of word within the files inside dir1/
(-n lists the line number; -r performs a recursive search)
Table 1.4 contains basic sytax for using these two commands. There are many more variations
of syntax for grep and find, however. You can use man grep and man find to explore other options
for using these commands.
$ ls -l
-rw-rw-r-- 1 username groupname 194 Aug 5 20:20 calc.py
drw-rw-r-- 1 username groupname 373 Aug 5 21:16 Documents
-rwxr-x--x 1 username groupname 27 Aug 5 20:22 mult.py
-rw-rw-r-- 1 username groupname 721 Aug 5 20:23 project.py
11
The first character of each line denotes the type of the item whether it be a normal file, a
directory, a symbolic link, etc. The next nine characters denote the permissions associated with that
file.
For example, look at the output for mult.py. The first character - denotes that mult.py is a
normal file. The next three characters, rwx, tell us the owner can read, write, and execute the file.
The next three characters, r-x, tell us members of the same group can read and execute the file, but
not edit it. The final three characters, --x, tell us other users can execute the file and nothing more.
Permissions can be modified using the chmod command. There are multiple notations used
to modify permissions, but the easiest to use when we want to make small modifications to a file’s
permissions is symbolic permissions notation. See Table 1.5 for more examples of using symbolic
permissions notation, as well as other useful commands for working with permissions.
$ ls -l script1.sh
total 0
-rw-r--r-- 1 c c 0 Aug 21 13:06 script1.sh
Command Description
chmod u+x file1 Add executing (x) permissions to user (u)
chmod g-w file1 Remove writing (w) permissions from group (g)
chmod o-r file1 Remove reading (r) permissions from other other users (o)
chmod a+w file1 Add writing permissions to everyone (a)
chown change owner
chgrp change group
getfacl view all permissions of a file in a readable format.
Running Files
To run a file for which you have execution permissions, type the file name preceded by ./.
$ ./hello.sh
bash: ./hello.sh: Permission denied
$ ls -l hello.sh
-rw-r--r-- 1 username groupname 31 Jul 30 14:34 hello.sh
$ ./hello.sh
Hello World!
12 Lab 1. Introduction to the Unix Shell
Problem 5. Within Shell1/, there is a script called organize_photos.sh. First, use find
to locate the script. Once you know the file location, add commands to your script so that it
completes the following tasks:
Test that the script has been executed by checking that additional files have been moved into
the Photos/ directory. Check that permissions have been updated on the script by using ls -l.
Secure Shell
Secure Shell (SSH) allows you to remotely access other computers or servers securely. SSH is a net-
work protocol encrypted using public-key cryptography. It ensures that all communication between
your computer and the remote server is secure and encrypted.
The system you are connecting to is called the host, and the system you are connecting from
is called the client. The first time you connect to a host, you will receive a warning saying the
authenticity of the host can’t be established. This warning is a default, and appears when you are
connecting to a host you have not connected to before. When asked if you would like to continue
connecting, select yes.
When prompted for your password, type your password as normal and press enter. No charac-
ters will appear on the screen, but they are still being logged. Once the connection is established,
there is a secure tunnel through which commands and files can be exchanged between the client and
host. To end a secure connection, type exit.
Secure Copy
To copy files from one computer to another, you can use the Unix command scp, which stands for
secure copy protocol. The syntax for scp is essentially the same as the syntax for cp.
To copy a file from your computer to a specific location on on a remote machine, use the
syntax scp <file1> <user@remote_host:file_path>. As with cp, to copy a directory and all of
its contents, use the -r flag.
Commands Description
ssh username@remote_host Establish a secure connection with remote_host
scp file1 user@remote_host:file_path/ Create a copy of file1 on host
scp -r dir1 user@remote_host:file_path/ Create a copy of dir1 and its contents on host
scp user@remote_host:file_path/file1 file_path2 Create a local copy of file on client
Git
Git is a version control system, meaning that it keeps a record of changes in a file. Git also facilitates
collaboration between people working on the same code. It does both these things by managing
updates between an online code repository and copies of the repository, called clones, stored locally
on computers.
We will be using git to submit labs and return feedback on those labs. If git is not already
installed on your computer, download it at http://git-scm.com/downloads.
Using Git
Git manages the history of a file system through commits, or checkpoints. Each time a new commit
is added to the online repository, a checkpoint is created so that if need be, you can use or look back
at an older version of the repository. You can use git log to see a list of previous commits. You
can also use git status to see the files that have been changed in your local repository since the
last commit.
Before making your own changes, you’ll want to add any commits from other clones into your
local repository. To do this, use the command git pull origin master.
Once you have made changes and want to make a new commit, there are normally three steps.
To save these changes to the online repository, first add the changed files to the staging area, a list of
files to save during the next commit, with git add <filename(s)>. If you want to add all changes
that you have made to tracked files (files that are already included in the online repository), use
git add -u.
Next, save the changes in the staging area with git commit -m "<A brief message describing
the changes>".
Finally, add the changes in this commit to the online repository with git push origin master.
Online Repository
Computer
Figure 1.1: Exchanging git commits between the repository and a local clone.
15
Merge Conflicts
Git maintains order by raising an alert when changes are made to the same file in different clones and
neither clone contains the changes made in the other. This is called a merge conflict, which happens
when someone else has pushed a commit that you do not yet have, while you have also made one or
more commits locally that they do not have.
Achtung!
When pulling updates with git pull origin master, your terminal may sometimes display
the following merge conflict message.
Note
Vim is a terminal text editor available on essentially any computer you will use. When working
with remote machines through ssh, vim is often the only text editor available to use. To
exit vim, press esc:wq To learn more about vim, visit the official documentation at https:
//vimhelp.org.
16 Lab 1. Introduction to the Unix Shell
Command Explanation
git status Display the staging area and untracked changes.
git pull origin master Pull changes from the online repository.
git push origin master Push changes to the online repository.
git add <filename(s)> Add a file or files to the staging area.
git add -u Add all modified, tracked files to the staging area.
git commit -m "<message>" Save the changes in the staging area with a given message.
git checkout <filename> Revert changes to an unstaged file since the last commit.
git reset HEAD <filename> Remove a file from the staging area, but keep changes.
git diff <filename> See the changes to an unstaged file since the last commit.
git diff --cached <filename> See the changes to a staged file since the last commit.
git config --local <option> Record your credentials (user.name, user.email, etc.).
Problem 7. Using git commands, push unixshell1.sh and UnixShell1.tar.gz to your on-
line git repository. Do not add anything else in the UnixShell1/ directory to the online
repository.
2
Unix Shell 2
Lab Objective: Introduce system management, calling Unix Shell commands within Python, and
other advanced topics. As in the last Unix lab, the majority of learning will not be had in finishing
the problems, but in following the examples.
Note that you will need to have administrative rights to download this package. To unzip a file, use
unzip.
Note
To begin this lab, unzip the Shell2.zip file into your UnixShell2/ directory using a terminal
command.
17
18 Lab 2. Unix Shell 2
extracting: Shell2/Scripts/fiteen_secs
extracting: Shell2/Scripts/script3
extracting: Shell2/Scripts/hello.sh...
While the zip file format is more popular on the Windows platform, the tar utility is more
common in the Unix environment.
Note
When submitting this lab, you will need to archive and compress your entire Shell2/ directory
into a file called Shell2.tar.gz and push Shell2.tar.gz as well as shell2.py to your online
repository.
If you are doing multiple submissions, make sure to delete your previous Shell2.tar.gz
file before creating a new one from your modified Shell2/ directory. Refer to Unix1 for more
information on deleting files.
As a final note, please do not push the entire directory to your online repository. Only
push ShellFinal.tar.gz and shell2.py.
The example below demonstrates how to archive and compress our Shell2/ directory. The -z
flag calls for the gzip compression tool, the -v flag calls for a verbose output, the -p flag tells the
tool to preserve file permission, and the -f flag indicates the next parameter will be the name of the
archive file. Note that the -f flag must always come last.
The unix file system presents many opportunities for the manipulation, viewing, and editing of files.
Before moving on to more complex commands, we will look at some of the commands available to
view the content of a file.
The cat command, followed by the filename, will display all the contents of a file on the
terminal screen. This can be problematic if you are dealing with a large file. There are a few
available commands to control the output of cat in the terminal. See Table 2.1.
As an example, use less <filename> to restrict the number of lines that are shown. With this
command, use the arrow keys to navigate up and down and press q to exit.
19
Command Description
cat Print all of the file contents
more Print the file contents one page at a time, navigating forwards
less Like more, but you navigate forward and backwards
head Print the first 10 lines of a file
head -nk Print the first k lines of a file
tail Print the last 10 lines of a file
tail -nk Print the last k lines of a file
$ cd Shell2/Files/Feb
# Output the number of lines in assignments.txt.
$ cat assignments.txt | wc -l
9
# Sort the files by file size and output file names and their size.
$ls -s | sort -nr
4 project3.py
4 project2.py
4 assignments.txt
4 pics
total 16
In addition to piping commands together, when working with files specifically, we can use
redirects. A redirect, represented as < in the terminal, passes the file to a terminal command.
To save a command’s output to a file, we can use > or >>. The > operator will overwrite anything
that may exist in the output file whereas >> will append the output to the end of the output file.
Examples of redirects and writing to a file are given below.
# Gets the same result as the first command in the above example.
$ wc -l < assignments.txt
9
# Writes the number of lines in the assignments.txt file to word_count.txt.
$ wc -l < assignments.txt >> word_count.txt
20 Lab 2. Unix Shell 2
Problem 1. The words.txt file in the Documents/ directory contains a list of words that are
not in alphabetical order. Write an alphabetically sorted list of words in words.txt to a new
file in your Documents/ called sortedwords.txt using pipes and redirects. After you write the
alphabetized words to the designated file, also write the number of words in words.txt to the
end of sortedwords.txt. Save this file in the Documents/ directory. Try to accomplish this
with a total of two commands or fewer.
Resource Management
To be able to optimize performance, it is valuable to be aware of the resources, specifically hard drive
space and computer memory, being used.
Job Control
One way to monitor and optimize performance is in job control. Any time you start a program in
the terminal (you could be running a script, opening ipython, etc.,) that program is called a job.
You can run a job in the foreground and also in the background. When we run a program in the
foreground, we see and interact with it. Running a script in the foreground means that we will not
be able to enter any other commands in the terminal while the script is running. However, if we
choose to run it in the background, we can enter other commands and continue interacting with other
programs while the script runs.
Consider the scenario where we have multiple scripts that we want to run. If we know that these
scripts will take awhile, we can run them all in the background while we are working on something
else. Table 2.2 lists some common commands that are used in job control. We strongly encourage
you to experiment with some of these commands.
Command Description
COMMAND & Adding an ampersand to the end of a command
runs the command in the background
bg %N Restarts the Nth interrupted job in the background
fg %N Brings the Nth job into the foreground
jobs Lists all the jobs currently running
kill %N Terminates the Nth job
ps Lists all the current processes
Ctrl-C Terminates current job
Ctrl-Z Interrupts current job
nohup Run a command that will not be killed if the user logs out
The fifteen_secs and five_secs scripts in the Scripts/ directory take fifteen seconds and
five seconds to execute respectively. The python file fifteen_secs.py in the Python/ directoy takes
fifteen seconds to execute, this file counts to fifteen and then outputs "Success!". These will be
particularly useful as you are experimenting with these commands.
Remember, that when you use the ./ command in place of other commands you will probably
need to change permissions. For more information on changing permissions, review Unix 1. Run the
following command sequence from the Shell2 directory.
21
$ ./Scripts/fifteen_secs &
$ ps
PID TTY TIME CMD
6 tty1 00:00:00 bash
59 tty1 00:00:00 fifteen_secs
60 tty1 00:00:00 sleep
61 tty1 00:00:00 ps
# Stop fifteen_secs
$ kill 59
$ ps
PID TTY TIME CMD
6 tty1 00:00:00 bash
60 tty1 00:00:00 sleep
61 tty1 00:00:00 ps
[1]+ Terminated ./fifteen_secs
Problem 2. In addition to the five_secs and fifteen_secs scripts, the Scripts/ folder
contains three scripts (named script1, script2, and script3) that each take about forty-
five seconds to execute. From the Scripts directory, execute each of these commands in the
background in the following order; script1, script2, and script3. Do this so all three are
running at the same time. While they are all running, write the output of jobs to a new file
log.txt saved in the Scripts/ directory.
(Hint: In order to get the same output as the solutions file, you need to run the ./ command
and not the bash command.)
22 Lab 2. Unix Shell 2
In addition to these, Python has a few extra functions that are useful for file management and
shell commands. See Table 2.4. The two functions os.walk() and glob.glob() are especially useful
for doing searches like find and grep. Look at the example below and then try out a few things on
your own to try to get a feel for them.
Function Description
os.walk() Iterate through the subfolders and subfolder files of a given directory.
os.path.isdir() Return True if the input is a directory.
os.path.isfile() Return True if the input is a file.
os.path.join() Join several folder names or file names into one path.
glob.glob() Return a list of file names that match a pattern.
subprocess.call() Execute a shell command.
subprocess.check_output() Execute a shell command and return its output as a string.
'Python/project.py']
Problem 3. Write a Python function grep() that accepts the name of a target string and
a file pattern. Find all files in the current directory or its subdirectories that match the file
pattern. Next, check within the contents of the matched file for the target string. For example,
grep("*.py", "range()") should search Python files for the command range(). Return a
list of the filenames that matched the file pattern and the target string.
$ cd Shell2/Scripts
$ python
>>> import subprocess
>>> subprocess.call(["ls", "-l"])
24 Lab 2. Unix Shell 2
Function Description
subprocess.call() run a Unix command
subprocess.check_output() run a Unix command and record its output
subprocess.check_output.decode() this tranlates Unix command output to a string
subprocess.Popen() use this to pipe togethether Unix commands
total 40
-rw-r--r-- 1 username groupname 20 Aug 26 2016 five_secs
-rw-r--r-- 1 username groupname 21 Aug 26 2016 script1
-rw-r--r-- 1 username groupname 21 Aug 26 2016 script2
-rw-r--r-- 1 username groupname 21 Aug 26 2016 script3
-rw-r--r-- 1 username groupname 21 Aug 26 2016 fiften_secs
0
# Decode() translates the result to a string.
>>> file_info = subprocess.check_output(["ls", "-l"]).decode()
>>> file_info.split('\n')
['total 40',
'-rw-r--r-- 1 username groupname 20 Aug 26 2016 five_secs',
'-rw-r--r-- 1 username groupname 21 Aug 26 2016 script1',
'-rw-r--r-- 1 username groupname 21 Aug 26 2016 script2',
'-rw-r--r-- 1 username groupname 21 Aug 26 2016 script3',
'-rw-r--r-- 1 username groupname 21 Aug 26 2016 fiften_secs',
'']
Popen is a class of the subprocess module, with its own atrributes and commands. It pipes
together a few commands, similar to we did at the beginning of the lab. This allows for more
versatility in the shell input commands. If you wish to know more about the Popen class, go to the
subprocess documentation on the internet.
$ cd Shell2
$ python
>>> import subprocess
>>> args = ["cat Files/Feb/assignments.txt | wc -l"]
# shell = True indicates to open a new shell process
# note that task is now an object of the Popen class
>>> task = subprocess.Popen(args, shell=True)
>>> 9
Achtung!
25
If shell commands depend on user input, the program is vulnerable to a shell injection attack.
This applies to Unix Shell commands as well as other situations like web browser interaction
with web servers. Be extremely careful when creating a shell process from Python. There are
specific functions, like shlex.quote(), that quote specific strings that are used to construct shell
commands. But, when possible, it is often better to avoid user input altogether. For example,
consider the following function.
If inspect_file() is given the input ".; rm -rf /", then ls -l . is executed innocently,
and then rm -rf / destroys the computer by force deleting everything in the root directory.a
Be careful not to execute a shell command from within Python in a way that a malicious user
could potentially take advantage of.
a See https://en.wikipedia.org/wiki/Code_injection#Shell_injection for more example attacks.
Problem 4. Write a Python function that accepts an integer n. Search the current directory
and all subdirectories for the n largest files. Then sort the list of filenames from the largest to
the smallest files. Next, write the line count of the smallest file to a file called smallest.txt
into the current directory. Finally, return the list of filenames, including the file path, in order
from largest to smallest.
(Hint: the shell commands ls -s shows the file size.)
As a note, to get this problem correct, you do not need to only return filenames, but
the entire file path. For exampe, instead of returning ’data.txt’ as part of your list, return
’Files/Mar/docs/data.txt’.
Downloading Files
The Unix shell has tools for downloading files from the internet. The most popular are wget and
curl. At its most basic, curl is the more robust of the two while wget can download recursively.
This means that wget is capable of following links and directory structure when downloading content.
When we want to download a single file, we just need the URL for the file we want to download.
This works for PDF files, HTML files, and other content simply by providing the right URL.
$ wget https://github.com/Foundations-of-Applied-Mathematics/Data/blob/master/←-
Volume1/dream.png
$ wget -b URL
Problem 5. The file urls.txt in the Documents/ directory contains a list of URLs. Download
the files in this list using wget and move them to the Photos/ directory.
sed s/str1/str2/g
This command will replace every instance of str1 with str2. More specific examples follow.
Problem 6. Problem6() is a function that accepts an integer n as input and returns three
different lists in the following order: a list of integers from 0 to n in increments of 1; a list of
integers from n to 0 in increments of −2; a list of integers from 0 to n in increments of 3.
It contains two syntax errors that are repeated in multiple locations. Look in your
shell2.py file and identify the syntax errors, but do not fix them yet. After you find them, use
sed commands to replace those errors with the correct commands. To test if your commands
worked, you can review your lab file that you edited, or just simply run prob6().
$ cd Shell2/Documents
$ ls -l | awk ' {print $1, $9} '
total
28 Lab 2. Unix Shell 2
-rw-r--r--. assignments.txt
-rw-r--r--. doc1.txt
-rw-r--r--. doc2.txt
-rw-r--r--. doc3.txt
-rw-r--r--. doc4.txt
-rw-r--r--. files.txt
-rw-r--r--. lines.txt
-rw-r--r--. newfiles.txt
-rw-r--r--. people.txt
-rw-r--r--. review.txt
-rw-r--r--. urls.txt
-rw-r--r--. words.txt
Notice we pipe the output of ls -l to awk. When calling a command using awk, we have to
use quotation marks. It is a common mistake to forget to add these quotation marks. Inside these
quotation marks, commands always take the same format.
In the remaining examples we will not be using any of the options, but we will address various
actions.
In the Documents/ directory, you will find a people.txt file that we will use for the following
examples. In our first example, we use the print action. The $1 and $9 mean that we are going to
print the first and ninth fields.
Beyond specifying which fields we wish to print, we can also choose how many characters to
allocate for each field. This is done using the % command within the printf command, which allows
us to edit how the relevant data is printed. Look at the last part of the example below to see how it
is done.
# contents of people.txt
$ cat people.txt
male,John,23
female,Mary,31
female,Sally,37
male,Ted,19
male,Jeff,41
female,Cindy,25
# Change the field separator (FS) to space at the beginning of run using BEGIN
# Printing each field individually proves we have successfully separated the ←-
fields
$ awk ' BEGIN{ FS = " " }; {print $1,$2,$3} ' < people.txt
male John 23
female Mary 31
female Sally 37
male Ted 19
male Jeff 41
female Cindy 25
29
The statement "%-6s %2s %s\n" formats the columns of the output. This says to set aside six
characters left justified, then two characters right justified, then print the last field to its full length.
Problem 7. Inside the Documents/ directory, you should find a file named files.txt. This
file contains details on approximately one hundred files. The different fields in the file are
separated by tabs. Using awk, sort, pipes, and redirects, write it to a new file in the current
directory named date_modified.txt with the following specifications:
• in the first column, print the date the file was modified
• sort the file from newest to oldest based on the date last modified
We have barely scratched the surface of what awk can do. Performing an internet search for
awk one-liners will give you many additional examples of useful commands you can run using awk.
Note
Remember to archive and compress your Shell2 directory before pushing it to your online
repository for grading.
30 Lab 2. Unix Shell 2
Additional Material
Customizing the Shell
Though there are multiple Unix shells, one of the most popular is the bash shell. The bash shell
is highly customizable. In your home directory, you will find a hidden file named .bashrc. All
customization changes are saved in this file. If you are interested in customizing your shell, you can
customize the prompt using the PS1 environment variable. As you become more and more familiar
with the Unix shell, you will come to find there are commands you run over and over again. You
can save commands you use frequently with alias. If you would like more information on these and
other ways to customize the shell, you can find many quality reference guides and tutorials on the
internet.
System Management
In this section, we will address some of the basics of system management. As an introduction, the
commands in Table 2.6 are used to learn more about the computer system.
Command Description
passwd Change user password
uname View operating system name
uname -a Print all system information
uname -m Print machine hardware
w Show who is logged in and what they are doing
whoami Print userID of current user
Lab Objective: Being able to store and manipulate large data sets quickly is a fundamental part of
data science. The SQL language is the classic database management system for working with tabular
data. In this lab we introduce the basics of SQL, including creating, reading, updating, and deleting
SQL tables, all via Python’s standard SQL interaction modules.
Relational Databases
A relational database is a collection of tables called relations. A single row in a table, called a tuple,
corresponds to an individual instance of data. The columns, called attributes or features, are data
values of a particular category. The collection of column headings is called the schema of the table,
which describes the kind of information stored in each entry of the tuples.
For example, suppose a database contains demographic information for M individuals. If a table
had the schema (Name, Gender, Age), then each row of the table would be a 3-tuple corresponding
to a single individual, such as (Jane Doe, F, 20) or (Samuel Clemens, M, 74.4). The table
would therefore be M × 3 in shape. Note that including a person’s age in a database means that
the data would quickly be outdated since people get older every year. A better choice would be to
use birth year. Another table with the schema (Name, Income) would be M × 2 if it included all M
individuals.
Attribute
Tuple
Relation
Figure 3.1: See https://en.wikipedia.org/wiki/Relational_database.
31
32 Lab 3. SQL 1: Introduction
SQLite
The most common database management systems (DBMS) for relational databases are based on
Structured Query Language, commonly called SQL (pronounced1 “sequel”). Though SQL is a language
in and of itself, most programming languages have tools for executing SQL routines. In Python, the
most common variant of SQL is SQLite, implemented as the sqlite3 module in the standard library.
A SQL database is stored in an external file, usually marked with the file extension db or mdf.
These files should not be opened in Python with open() like text files; instead, any interactions with
the database—creating, reading, updating, or deleting data—should occur as follows.
1. Create a connection to the database with sqlite3.connect(). This creates a database file if
one does not already exist.
2. Get a cursor, an object that manages the actual traversal of the database, with the connection’s
cursor() method.
3. Alter or read data with the cursor’s execute() method, which accepts an actual SQL command
as a string.
4. Save any changes with the cursor’s commit() method, or revert changes with rollback().
Achtung!
Some changes, such as creating and deleting tables, are automatically committed to the database
as part of the cursor’s execute() method. Be extremely cautious when deleting tables, as
the action is immediate and permanent. Most changes, however, do not take effect in the
database file until the connection’s commit() method is called. Be careful not to close the
connection before committing desired changes, or those changes will not be recorded.
The with statement can be used with open() so that file streams are automatically closed, even
in the event of an error. Likewise, combining the with statement with sql.connect() automatically
rolls back changes if there is an error and commits them otherwise. However, the actual database
connection is not closed automatically. With this strategy, the previous code block can be reduced
to the following.
>>> try:
... with sql.connect("my_database.db") as conn:
... cur = conn.cursor() # Get the cursor.
... cur.execute("SELECT * FROM MyTable") # Execute a SQL command.
... finally: # Commit or revert, then
... conn.close() # close the connection.
The CREATE TABLE command, together with a table name and a schema, adds a new table to
a database. The schema is a comma-separated list where each entry specifies the column name, the
column data type,2 and other optional parameters. For example, the following code adds a table
called MyTable with the schema (Name, ID, Age) with appropriate data types.
The DROP TABLE command deletes a table. However, using CREATE TABLE to try to create a
table that already exists or using DROP TABLE to remove a nonexistent table raises an error. Use
DROP TABLE IF EXISTS to remove a table without raising an error if the table doesn’t exist. See
Table 3.1 for more table management commands.
2 Though SQLite does not force the data in a single column to be of the same type, most other SQL systems enforce
uniform column types, so it is good practice to specify data types in the schema.
34 Lab 3. SQL 1: Introduction
Note
SQL commands like CREATE TABLE are often written in all caps to distinguish them from other
parts of the query, like the table name. This is only a matter of style: SQLite, along with most
other versions of SQL, is case insensitive. In Python’s SQLite interface, the trailing semicolon
is also unnecessary. However, most other database systems require it, so it’s good practice to
include the semicolon in Python.
Problem 1. Write a function that accepts the name of a database file. Connect to the database
(and create it if it doesn’t exist). Drop the tables MajorInfo, CourseInfo, StudentInfo, and
StudentGrades from the database if they exist. Next, add the following tables to the database
with the specified column names and types.
Remember to commit and close the database. You should be able to execute your function
more than once with the same input without raising an error.
To check the database, use the following commands to get the column names of a specified
table. Assume here that the database file is called students.db.
With this syntax, SQLite assumes that values match sequentially with the schema of the table.
The schema of the table can also be written explicitly for clarity.
Achtung!
Never use Python’s string operations to construct a SQL query from variables. Doing so
makes the program susceptible to a SQL injection attack.a Instead, use parameter substitution
to construct dynamic commands: use a ? character within the command, then provide the
sequence of values as a second argument to execute().
To insert several rows at a time to the same table, use the cursor object’s executemany()
method and parameter substitution with a list of tuples. This is typically much faster than using
execute() repeatedly.
# Insert (Samuel Clemens, 1910421, 74.4) and (Jane Doe, 123, 20) to MyTable.
>>> with sql.connect("my_database.db") as conn:
... cur = conn.cursor()
... rows = [('John Smith', 456, 40.5), ('Jane Doe', 123, 20)]
... cur.executemany("INSERT INTO MyTable VALUES(?,?,?);", rows)
36 Lab 3. SQL 1: Introduction
Problem 2. Expand your function from Problem 1 so that it populates the tables with the
data given in Tables 3.2a–3.2d.
(d) StudentGrades
To validate your database, use the following command to retrieve the rows from a table.
Problem 3. The data file us_earthquakes.csva contains data from about 3,500 earthquakes
in the United States since the 1769. Each row records the year, month, day, hour, minute,
second, latitude, longitude, and magnitude of a single earthquake (in that order). Note that
latitude, longitude, and magnitude are floats, while the remaining columns are integers.
Write a function that accepts the name of a database file. Drop the table USEarthquakes
if it already exists, then create a new USEarthquakes table with schema (Year, Month, Day,
Hour, Minute, Second, Latitude, Longitude, Magnitude). Populate the table with the
data from us_earthquakes.csv. Remember to commit the changes and close the connection.
(Hint: using executemany() is much faster than using execute() in a loop.)
a Retrieved from https://datarepository.wolframcloud.com/resources/Sample-Data-US-Earthquakes.
Deleting or altering existing data in a database requires some searching for the desired row or rows.
The WHERE clause is a predicate that filters the rows based on a boolean condition. The operators
==, !=, <, >, <=, >=, AND, OR, and NOT all work as expected to create search conditions.
If the WHERE clause were omitted from either of the previous commands, every record in MyTable
would be affected. Always use a very specific WHERE clause when removing or updating data.
Table 3.3: SQLite commands for inserting, removing, and updating rows.
Problem 4. Modify your function from Problems 1 and 2 so that in the StudentInfo table,
values of −1 in the MajorID column are replaced with NULL values.
Also modify your function from Problem 3 in the following ways.
1. Remove rows from USEarthquakes that have a value of 0 for the Magnitude.
2. Replace 0 values in the Day, Hour, Minute, and Second columns with NULL values.
38 Lab 3. SQL 1: Introduction
Method Description
execute() Execute a single SQL command
executemany() Execute a single SQL command over different values
executescript() Execute a SQL script (multiple SQL commands)
fetchone() Return a single tuple from the result set
fetchmany(n) Return the next n rows from the result set as a list of tuples
fetchall() Return the entire result set as a list of tuples
# Get tuples of the form (StudentID, StudentName) from the StudentInfo table.
>>> cur.execute("SELECT StudentID, StudentName FROM StudentInfo;")
>>> cur.fetchone() # List the first match (a tuple).
(401767594, 'Michelle Fernandez')
>>> conn.close()
The WHERE predicate can also refine a SELECT command. If the condition depends on a column
in a different table from the data that is being a selected, create a table alias with the AS command
to specify columns in the form table.column.
39
>>> conn.close()
Problem 5. Write a function that accepts the name of a database file. Assuming the database
to be in the format of the one created in Problems 1 and 2, query the database for all tuples
of the form (StudentName, CourseName) where that student has an “A” or “A+” grade in that
course. Return the list of tuples.
Aggregate Functions
A result set can be analyzed in Python using tools like NumPy, but SQL itself provides a few tools
for computing a few very basic statistics: AVG(), MIN(), MAX(), SUM(), and COUNT() are aggregate
functions that compress the columns of a result set into the desired quantity.
Problem 6. Write a function that accepts the name of a database file. Assuming the database
to be in the format of the one created in Problem 3, query the USEarthquakes table for the
following information.
Create a single figure with two subplots: a histogram of the magnitudes of the earthquakes in
the 19th century, and a histogram of the magnitudes of the earthquakes in the 20th century.
Show the figure, then return the average magnitude of all of the earthquakes in the database.
Be sure to return an actual number, not a list or a tuple.
(Hint: use np.ravel() to convert a result set of 1-tuples to a 1-D array.)
Note
Problem 6 raises an interesting question: are the number of earthquakes in the United States
increasing with time, and if so, how drastically? A closer look shows that only 3 earthquakes
were recorded (in this data set) from 1700–1799, 208 from 1800–1899, and a whopping 3049
from 1900–1999. Is the increase in earthquakes due to there actually being more earthquakes, or
to the improvement of earthquake detection technology? The best answer without conducting
additional research is “probably both.” Be careful to question the nature of your data—how it
was gathered, what it may be lacking, what biases or lurking variables might be present—before
jumping to strong conclusions.
See the following for more info on the sqlite3 and SQL in general.
• https://docs.python.org/3/library/sqlite3.html
• https://www.w3schools.com/sql/
• https://en.wikipedia.org/wiki/SQL_injection
41
Additional Material
Shortcuts for WHERE Conditions
Complicated WHERE conditions can be simplified with the following commands.
• IN: check for equality to one of several values quickly, similar to Python’s in operator. In other
words, the following SQL commands are equivalent.
• BETWEEN: check two (inclusive) inequalities quickly. The following are equivalent.
SELECT * FROM MyTable WHERE AGE >= 20 AND AGE <= 60;
SELECT * FROM MyTable WHERE AGE BETWEEN 20 AND 60;
42 Lab 3. SQL 1: Introduction
4
SQL 2 (The Sequel)
Lab Objective: Since SQL databases contain multiple tables, retrieving information about the
data can be complicated. In this lab we discuss joins, grouping, and other advanced SQL query
concepts to facilitate rapid data retrieval.
We will use the following database as an example throughout this lab, found in students.db.
(d) StudentGrades
Joining Tables
A join combines rows from different tables in a database based on common attributes. In other
words, a join operation creates a new, temporary table containing data from 2 or more existing
tables. Join commands in SQLite have the following general syntax.
43
44 Lab 4. SQL 2 (The Sequel)
The ON clause tells the query how to join tables together. Typically if there are N tables being
joined together, there should be N − 1 conditions in the ON clause.
Inner Joins
An inner join creates a temporary table with the rows that have exact matches on the attribute(s)
specified in the ON clause. Inner joins intersect two or more tables, as in Figure 4.1a.
A B A B
C C
(a) An inner join of A, B, and C. (b) A left outer join of A with B and C.
Figure 4.1
For example, Table 4.1c (StudentInfo) and Table 4.1a (MajorInfo) both have a MajorID
column, so the tables can be joined by pairing rows that have the same MajorID. Such a join
temporarily creates the following table.
Notice that this table is missing the rows where MajorID was NULL in the StudentInfo table.
This is because there was no match for NULL in the MajorID column of the MajorInfo table, so the
inner join throws those rows away.
45
Because joins deal with multiple tables at once, it is important to assign table aliases with the
AS command. Join statements can also be supplemented with WHERE clauses like regular queries.
Problem 1. Write a function that accepts the name of a database file. Assuming the database
to be in the format of Tables 4.1a–4.1d, query the database for the list of the names of students
who have a B grade in any course (not a B– or a B+).
Outer Joins
A left outer join, sometimes called a left join, creates a temporary table with all of the rows from the
first (left-most) table, and all the “matched” rows on the given attribute(s) from the other relations.
Rows from the left table that don’t match up with the columns from the other tables are supplemented
with NULL values to fill extra columns. Compare the following table and code to Table 4.2.
Some flavors of SQL also support the RIGHT OUTER JOIN command, but sqlite3 does not
recognize the command since T1 RIGHT OUTER JOIN T2 is equivalent to T2 LEFT OUTER JOIN T1.
To use different kinds of joins in a single query, append one join statement after another. The
join closest to the beginning of the statement is executed first, creating a temporary table, and the
next join attempts to operate on that table. The following example performs an additional join on
Table 4.3 to find the name and major of every student who got a C in a class.
In this last example, note carefully that Alfonso Phelps would have been excluded from the
result set if an inner join was performed first instead of an outer join (since he lacks a major).
Problem 2. Write a function that accepts the name of a database file. Query the database for
all tuples of the form (Name, MajorName, Grade) where Name is a student’s name and Grade
is their grade in Calculus. Only include results for students that are actually taking Calculus,
but be careful not to exclude students who haven’t declared a major.
Grouping Data
Many data sets can be naturally sorted into groups. The GROUP BY command gathers rows from a
table and groups them by a certain attribute. The groups are then combined by one of the aggregate
functions AVG(), MIN(), MAX(), SUM(), or COUNT(). The following code groups the rows in Table
4.1d by studentID and counts the number of entries in each group.
GROUP BY can also be used in conjunction with joins. The join creates a temporary table like
Tables 4.2 or 4.3, the results of which can then be grouped.
Just like the WHERE clause chooses rows in a relation, the HAVING clause chooses groups from the
result of a GROUP BY based on some criteria related to the groupings. For this particular command,
it is often useful (but not always necessary) to create an alias for the columns of the result set with
the AS operator. For instance, the result set of the previous example can be filtered down to only
contain students who are taking 3 courses.
Problem 3. Write a function that accepts a database file. Query the database for the list of
the names of courses that have at least 5 student enrolled in them.
Problem 4. Write a function that accepts a database file. Query the given database for tuples
of the form (MajorName, N) where N is the number of students in the specified major. Sort the
results in ascending order by the count N, and then in alphabetic order by MajorName. Include
Null majors.
Problem 5. Write a function that accepts a database file. Query the database for tuples of
the form (StudentName, MajorName) where the last name of the specified student begins with
the letter C. Include Null majors.
Case Expressions
A case expression maps the values in a column using boolean logic. There are two forms of a case
expression: simple and searched. A simple case expression matches and replaces specified attributes.
A searched case expression involves using a boolean expression at each step, instead of listing
all of the possible values for an attribute.
Chaining Queries
The result set of any SQL query is really just another table with data from the original database.
Separate queries can be made from result sets by enclosing the entire query in parentheses. For these
sorts of operations, it is very important to carefully label the columns resulting from a subquery.
Problem 6. Write a function that accepts the name of a database file. Query the database for
tuples of the form (StudentName, N, GPA) where N is the number of courses that the specified
student is enrolled in and GPA is their grade point average based on the following point system.
51
Lab Objective: Cleaning and formatting data are fundamental problems in data science. Regular
expressions are an important tool for working with text carefully and efficiently, and are useful for both
gathering and cleaning data. This lab introduces regular expression syntax and common practices,
including an application to a data cleaning problem.
A regular expression or regex is a string of characters that follows a certain syntax to specify a
pattern. Strings that follow the pattern are said to match the expression (and vice versa). A single
regular expression can match a large set of strings, such as the set of all valid email addresses.
Achtung!
There are some universal standards for regular expression syntax, but the exact syntax varies
slightly depending on the program or language. However, the syntax presented in this lab
(for Python) is sufficiently similar to any other regex system. Consider learning to use regular
expressions in Vim or your favorite text editor, keeping in mind that there will be slight syntactic
differences from what is presented here.
>>> import re
>>> pattern = re.compile("cat") # Make a pattern object for finding 'cat'.
>>> bool(pattern.search("cat")) # 'cat' matches 'cat', of course.
True
>>> bool(pattern.match("catfish")) # 'catfish' starts with 'cat'.
53
54 Lab 5. Regular Expressions
True
>>> bool(pattern.match("fishcat")) # 'fishcat' doesn't start with 'cat'.
False
>>> bool(pattern.search("fishcat")) # but it does contain 'cat'.
True
>>> bool(pattern.search("hat")) # 'hat' does not contain 'cat'.
False
Most of the functions in the re module are shortcuts for compiling a pattern object and calling
one of its methods. Using re.compile() is good practice because the resulting object is reusable,
while each call to re.search() compiles a new (but redundant) pattern object. For example, the
following lines of code are equivalent.
>>> bool(re.compile("cat").search("catfish"))
True
>>> bool(re.search("cat", "catfish"))
True
Problem 1. Write a function that compiles and returns a regular expression pattern object
with the pattern string "python".
The following string characters (separated by spaces) are metacharacters in Python’s regular expres-
sions, meaning they have special significance in a pattern string: . ^ $ * + ? { } [ ] \ | ( ).
A regular expression that matches strings with one or more metacharacters requires two things.
1. Use raw strings instead of regular Python strings by prefacing the string with an r, such as
r"cat". The resulting string interprets backslashes as actual backslash characters, rather than
the start of an escape sequence like \n or \t.
2. Preface any metacharacters with a backslash to indicate a literal character. For example, to
match the string "$3.99? Thanks.", use r"\$3\.99\? Thanks\.".
Without raw strings, every backslash in has to be written as a double backslash, which makes many
regular expression patterns hard to read ("\\$3\\.99\\? Thanks\\.").
Problem 2. Write a function that compiles and returns a regular expression pattern object
that matches the string "^{@}(?)[%]{.}(*)[_]{&}$".
Hint: There are online sites like https://regex101.com/ that can help you check your
answers.
55
The regular expressions of Problems 1 and 2 only match strings that are or include the exact
pattern. The metacharacters allow regular expressions to have much more flexibility and control
so that a single pattern can match a wide variety of strings, or a very specific set of strings. The
line anchor metacharacters ^ and $ are used to match the start and the end of a line of text,
respectively. This shrinks the matching set, even when using the search() method instead of the
match() method. For example, the only single-line string that the expression '^x$' matches is 'x',
whereas the expression 'x' can match any string with an 'x' in it.
The pipe character | is a logical OR in a regular expression: A|B matches A or B. The parentheses
() create a group in a regular expression. A group establishes an order of operations in an expression.
For example, in the regex "^one|two fish$", precedence is given to the invisible string concatenation
between "two" and "fish", while "^(one|two) fish$" gives precedence to the '|' metacharacter.
Notice that the pipe is inside the group.
Problem 3. Write a function that compiles and returns a regular expression pattern object
that matches the following strings, and no other strings, even with re.search().
"Book store" "Mattress store" "Grocery store"
"Book supplier" "Mattress supplier" "Grocery supplier"
Character Classes
The hard bracket metacharacters [ and ] are used to create character classes, a part of a regular
expression that can match a variety of characters. For example, the pattern [abc] matches any of
the characters a, b, or c. This is different than a group delimited by parentheses: a group can match
multiple characters, while a character class matches only one character. For instance, [abc] does
not match ab or abc, and (abc) matches abc but not ab or even a.
Within character classes, there are two additional metacharacters. When ^ appears as the
first character in a character class, right after the opening bracket [, the character class matches
anything not specified instead. In other words, ^ is the set complement operation on the character
class. Additionally, the dash - specifies a range of values. For instance, [0-9] matches any digit,
and [a-z] matches any lowercase letter. Thus [^0-9] matches any character except for a digit, and
[^a-z] matches any character except for lowercase letters. Keep in mind that the dash -, when at
the beginning or end of the character class, will match the literal '-'. Note that [0-27-9] acts like
[(0-2)|(7-9)].
...
d8: True True
aa: True False # a is not in [^abcA-C] or [0-27-9].
E9: False True # E is not in [a-z].
EE: False False # E is not in [a-z] or [0-27-9].
d88: False False # Too many characters.
There are also a variety of shortcuts that represent common character classes, listed in Table
5.1. Familiarity with these shortcuts makes some regular expressions significantly more readable.
Character Description
\b Matches the empty string, but only at the start or end of a word.
\s Matches any whitespace character; equivalent to [ \t\n\r\f\v].
\S Matches any non-whitespace character; equivalent to [^\s].
\d Matches any decimal digit; equivalent to [0-9].
\D Matches any non-digit character; equivalent to [^\d].
\w Matches any alphanumeric character; equivalent to [a-zA-Z0-9_].
\W Matches any non-alphanumeric character; equivalent to [^\w].
Any of the character class shortcuts can be used within other custom character classes. For
example, [_A-Z\s] matches an underscore, capital letter, or whitespace character.
Finally, a period . matches any character except for a line break. This is a very powerful
metacharacter; be careful to only use it when part of the regular expression really should match any
character.
The following table is a useful recap of some common regular expression metacharacters.
Character Description
. Matches any character except a newline.
^ Matches the start of the string.
$ Matches the end of the string or just before the newline at the end of the string.
| A|B creates an regular expression that will match either A or B.
[...] Indicates a set of characters. A ^ as the first character indicates a complementing set.
(...) Matches the regular expression inside the parentheses.
The contents can be retrieved or matched later in the string.
Repetition
The remaining metacharacters are for matching a specified number of characters. This allows a single
regular expression to match strings of varying lengths.
Character Description
* Matches 0 or more repetitions of the preceding regular expression.
+ Matches 1 or more repetitions of the preceding regular expression.
? Matches 0 or 1 of the preceding regular expression.
{m,n} Matches from m to n repetitions of the preceding regular expression.
*?, +?, ??, {m,n}? Non-greedy versions of the previous four special characters.
Each of the repetition operators acts on the expression immediately preceding it. This could
be a single character, a group, or a character class. For instance, (abc)+ matches abc, abcabc,
abcabcabc, and so on, but not aba or cba. On the other hand, [abc]* matches any sequence of a,
b, and c, including abcabc and aabbcc.
The curly braces {} specify a custom number of repetitions allowed. {,n} matches up to n
instances, {m,} matches at least m instances, {k} matches exactly k instances, and {m,n} matches
from m to n instances. Thus the ? operator is equivalent to {,1} and + is equivalent to {1,}.
Be aware that line anchors are especially important when using repetition operators. Consider
the following (bad) example and compare it to the previous example.
The unexpected matches occur because "aaa" is at the beginning of each of the test strings. With
the line anchors ^ and $, the search truly only matches the exact string "aaa".
Problem 4. A valid Python identifier (a valid variable name) is any string starting with an al-
phabetic character or an underscore, followed by any (possibly empty) sequence of alphanumeric
characters and underscores.
A valid python parameter definition is defined as the concatenation of the following strings:
• (optional) an equals sign followed by any number of spaces and ending with one of the
following: any real number, a single quote followed by any number of non-single-quote
characters followed by a single quote, or any valid python identifier
Define a function that compiles and returns a regular expression pattern object that
matches any valid Python parameter definition.
(Hint: Use the \w character class shortcut to keep your regular expression clean.)
To help in debugging, the following examples may be useful. These test cases are a good
start, but are not exhaustive. The first table should match valid Python identifiers. The second
should match a valid python parameter definition, as defined in this problem. Note that some
strings which would be valid in python will not be for this problem.
Matches: "Mouse" "compile" "_123456789" "__x__" "while"
Non-matches: "3rats" "err*r" "sq(x)" "sleep()" " x"
Matches: "max=4.2" "string= ''" "num_guesses"
Non-matches: "300" "is_4=(value==4)" "pattern = r'^one|two fish$'"
Method Description
match() Match a regular expression pattern to the beginning of a string.
fullmatch() Match a regular expression pattern to all of a string.
search() Search a string for the presence of a pattern.
sub() Substitute occurrences of a pattern found in a string.
subn() Same as sub, but also return the number of substitutions made.
split() Split a string by the occurrences of a pattern.
findall() Find all occurrences of a pattern in a string.
finditer() Return an iterator yielding a match object for each match.
Some substitutions require remembering part of the text that the regular expression matches.
Groups are useful here: each group in the regular expression can be represented in the substitution
string by \n, where n is an integer (starting at 1) specifying which group to use.
# Find words that start with 'cat', remembering what comes after the 'cat'.
>>> pig_latin = re.compile(r"\bcat(\w*)")
>>> target = "Let's catch some catfish for the cat"
The repetition operators ?, +, *, and {m,n} are greedy, meaning that they match the largest
string possible. On the other hand, the operators ??, +?, *?, and {m,n}? are non-greedy, meaning they
match the smallest strings possible. This is very often the desired behavior for a regular expression.
Finally, there are a few customizations that make searching larger texts manageable. Each of
these flags can be used as keyword arguments to re.compile().
Flag Description
re.DOTALL . matches any character at all, including the newline.
re.IGNORECASE Perform case-insensitive matching.
re.MULTILINE ^ matches the beginning of lines (after a newline) as well as the string;
$ matches the end of lines (before a newline) as well as the end of the string.
A benefit of using '^' and '$' is that they allow you to search across multiple lines. For
example, how would we match "World" in the string "Hello\nWorld"? Using re.MULTILINE in the
re.search function will allow us to match at the beginning of each new line, instead of just the
beginning of the string. The following shows how to implement multiline searching:
>>>pattern1 = re.compile("^W")
>>>pattern2 = re.compile("^W", re.MULTILINE)
>>>bool(pattern1.search("Hello\nWorld"))
False
>>>bool(pattern2.search("Hello\nWorld"))
True
Problem 5. A Python block is composed of several lines of code with the same indentation
level. Blocks are delimited by key words and expressions, followed by a colon. Possible key
words are if, elif, else, for, while, try, except, finally, with, def, and class. Some of
these keywords require an expression to precede the colon (if, elif, for, etc.). Some require
no expressions to precede the colon (else, finally), and except may or may not have an
expression before the colon.
Write a function that accepts a string of Python code and uses regular expressions to
place colons in the appropriate spots. Assume that every colon is missing in the input string.
Return the string of code with colons in the correct places.
"""
k, i, p = 999, 1, 0
while k > i
i *= 2
p += 1
if k != 999
print("k should not have changed")
else
pass
print(p)
"""
If you wish to extract the characters that match some groups, but not others, you can choose
to exclude a group from being returned using the syntax (?:)
Problem 6. The file fake_contacts.txt contains poorly formatted contact data for 2000
fictitious individuals. Each line of the file contains data for one person, including their name
and possibly their birthday, email address, and/or phone number. The formatting of the data
is not consistent, and much of it is missing. Each contact name includes a first and last name.
Some names have middle initials, in the form Jane C. Doe. Each birthday lists the month, then
the day, and then the year, though the format varies from 1/1/11, 1/01/2011, etc. If century
is not specified for birth year, as in 1/01/XX, birth year is assumed to be 20XX. Remember, not
all information is listed for each contact.
Use regular expressions to extract the necessary data and format it uniformly, writing
birthdays as mm/dd/yyyy and phone numbers as (xxx)xxx-xxxx. Return a dictionary where the
key is the name of an individual and the value is another dictionary containing their information.
Each of these inner dictionaries should have the keys "birthday", "email", and "phone". In
the case of missing data, map the key to None.
The first two entries of the completed dictionary are given below.
62 Lab 5. Regular Expressions
{
"John Doe": {
"birthday": "01/01/2099",
"email": "[email protected]",
"phone": "(123)456-7890"
},
"Jane Smith": {
"birthday": None,
"email": None,
"phone": "(222)111-3333"
},
# ...
}
63
Additional Material
Regular Expressions in the Unix Shell
As we have seen„ regular expressions are very useful when we want to match patterns. Regular
expressions can be used when matching patterns in the Unix Shell. Though there are many Unix
commands that take advantage of regular expressions, we will focus on grep and awk.
Because there is a lot going on in this command, we will break it down piece-by-piece. The
output of ls -l is getting piped to awk. Then we have an if statement. The syntax here means if
the condition inside the parenthesis holds, print field 9 (the field with the filename). The condition
is where we use regular expressions. The ~ checks to see if the contents of field 3 (the field with the
username) matches the regular expression found inside the forward slashes. To clarify, freddy is the
regular expression in this example and the expression must be surrounded by forward slashes.
Consider a similar example. In this example, we will list the names of the directories inside the
current directory. (This replicates the behavior of the Unix command ls -d */)
Notice in this example, we printed the names of the directories, whereas in one of the example
using grep, we printed all the details of the directories as well.
64 Lab 5. Regular Expressions
Achtung!
Some of the definitions for character classes we used earlier in this lab will not work in the Unix
Shell. For example, \w and \d are not defined. Instead of \w, use [[:alnum:]]. Instead of
\d, use [[:digit:]]. For a complete list of similar character classes, search the internet for
POSIX Character Classes or Bracket Character Classes.
6
Web Technologies
Lab Objective: The Internet is a term for the collective grouping of all publicly accessible computer
networks in the world. This network can be traversed to access services such as social communication,
maps, video streaming, and large datasets, all of which are hosted on computers across the world.
Using these technologies requires an understanding of data serialization, data transportation protocols,
and how programs such as servers, clients, and APIs are created to facilitate this communication.
Data Serialization
Serialization is the process of packaging data in a form that makes it easy to transmit the data
and quickly reconstruct it on another computer or in a different programming language. Many
serialization metalanguages exist, such as Python’s pickle, YAML, XML, and JSON. JSON, which
stands for JavaScript Object Notation, is the dominant format for serialization in web applications.
Despite having “JavaScript” in its name, JSON is a language-independent format and is frequently
used for transmitting data between different programming languages. It stores information about
objects as a specially formatted string that is easy for both humans and machines to read and write.
Deserialization is the process of reconstructing an object from the string.
JSON is built on two types of data structures: a collection of key/value pairs similar to Python’s
built-in dict, and an ordered list of values similar to Python’s built-in list.
65
66 Lab 6. Web Technologies
Note
To see a longer example of what JSON looks like, try opening a Jupyter Notebook (a .ipynb
file) in a plain text editor. The file lists the Notebook cells, each of which has attributes like
"cell_type" (usually code or markdown) and "source" (the actual code in the cell).
The JSON libraries of various languages have a fairly standard interface. The Python standard
library module for JSON is called json. If performance speed is critical, consider using the ujson
or simplejson modules that are written in C. A string written in JSON format that represents a
piece of data is called a JSON message. The json.dumps() function generates the JSON message
for a single Python object, which can be stored and used within the Python program. Alternatively,
the json encoder json.dump() generates the same object, but writes it directly to a file. To load a
JSON string or file, use the json decoder json.loads() or json.load(), respectively.
Problem 1. The file nyc_traffic.json contains information about 1000 traffic accidents in
New York City during the summer of 2017.a Each entry lists one or more reasons for the
accident, such as “Unsafe Speed” or “Fell Asleep.”
Write a function that loads the data from the JSON file. Make a readable, sorted bar
chart showing the total number of times that each of the 7 most common reasons for accidents
are listed in the data set.
(Hint: the collections.Counter data structure and use plt.tight_layout() may be useful
here.)
To check your work, the 6th most common reason is “Backing Unsafely,” listed 59 times.
a See https://opendata.cityofnewyork.us/.
It is good practice to check for errors to ensure that custom encoders and decoders are only
used when intended.
Problem 2. The following class facilitates a regular 3 × 3 game of tic-tac-toe, where the boxes
in the board have the following coordinates.
Write a custom encoder and decoder for the TicTacToe class. If the custom encoder
receives anything other than a TicTacToe object, raise a TypeError.
class TicTacToe:
def __init__(self):
"""Initialize an empty board. The O's go first."""
self.board = [[' ']*3 for _ in range(3)]
self.turn, self.winner = "O", None
def empty_spaces(self):
"""Return the list of coordinates for the empty boxes."""
return [(i,j) for i in range(3) for j in range(3)
if self.board[i][j] == ' ' ]
def __str__(self):
return "\n---------\n".join(" | ".join(r) for r in self.board)
69
Creating a Server
One simple way to create a server in Python is via the socket module. The server socket must
first be initialized by specifying the type of connection and the address at which clients can find the
server. The server socket then listens and waits for a connection from a client, receives and processes
data, and eventually sends a response back to the client. After exchanges between the server and the
client are finished, the server closes the connection to the client.
Name Description
socket Create a new socket using the given address family, socket type and protocol number.
bind Bind the socket to an address. The socket must not already be bound.
listen Enable a server to accept connections.
accept Accept a connection. Must be bound to an address and listening for connections.
connect Connect to a remote socket at address.
sendall Send data to the socket. The socket must be connected to a remote socket.
Continues to send data until either all data has been sent or an error occurs.
recv Receive data from the socket. Must be given a buffer size; use 1024.
close Mark the socket closed.
The socket.socket() method receives two parameters, which specify the socket type. The
server address is a (host, port) tuple. The host is the IP address, which in this case is "localhost"
or "0.0.0.0"—the default address that specifies the local machine and allows connections on all
interfaces. The port number is an integer from 0 to 65535. About 250 port numbers are commonly
used, and certain ports have pre-defined uses. Only use port numbers greater than 1023 to avoid
interrupting standard system services, such as email and system updates.
70 Lab 6. Web Technologies
After setting up the server socket, the server program waits for a client to connect. The
accept() method returns a new socket object and the client’s address. Data is received through the
connection socket’s recv() method, which takes an integer specifying the number of bits of data to
receive. The data is transferred as a raw byte stream (of type bytes), so the decode() method is
necessary to translate the data into a string. Likewise, data that is sent back to the client through
the connection socket’s sendall() method must be encoded into a byte stream via the encode()
method.
Finally, try-finally blocks in the server ensure that the connection is always closed securely.
Put these blocks within an infinite while(True) block to ensure that your server will be ready for
any client request. Note that the accept() method does not return until a connection is made with
a client. Therefore, this server program cannot be executed in its entirety without a client. To stop
a server, raise a KeyBoardInterrupt (press ctrl+c) in the terminal where it is running.
Note that server-client communication is the reason that JSON serialization and deserialization
is so important. For example, information such as an image or a family tree could be sent more
simply using serialized objects.
# Specify the socket type, which determines how clients will connect.
server_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_sock.bind(server_address) # Assign this socket to an address.
server_sock.listen(1) # Start listening for clients.
while True:
# Wait for a client to connect to the server.
print("\nWaiting for a connection...")
connection, client_address = server_sock.accept()
try:
# Receive data from the client.
print("Connection accepted from {}.".format(client_address))
in_data = connection.recv(1024).decode() # Receive data.
print("Received '{}' from client".format(in_data))
# Process the received data and send something back to the client.
out_data = in_data[::-1]
print("Sending '{}' back to the client".format(out_data))
connection.sendall(out_data.encode()) # Send data.
Achtung!
It often takes some time for a computer to reopen a port after closing a server connection. This
is due to the timeout functionality of specific protocols that check connections for errors and
disruptions. While testing code, wait a few seconds before running the program again, or use
different ports for each test.
Problem 3. Write a function that accepts a (host, port) tuple and starts up a tic-tac-toe
server at the specified location. Wait to accept a connection, then while the connection is open,
repeat the following operations.
1. Receive a JSON serialized TicTacToe object (serialized with your custom encoder from
Problem 2) from the client.
2. Deserialize the TicTacToe object using your custom decoder from Problem 2.
3. If the client has just won the game, send "WIN" back to the client and close the connection.
4. If there is no winner but board is full, send "DRAW" to the client and close the connection.
5. If the game still isn’t over, make a random move on the tic-tac-toe board and serialize
the updated TicTacToe object. If this move wins the game, send "LOSE" to the client,
then send the serialized object separately (as proof), and close the connection. Otherwise,
send only the updated TicTacToe object back to the client but keep the connection open.
(Hint: print information at each step so you can see what the server is doing.)
Ensure that the connection closes securely even if an exception is raised. Note that you
will not be able to fully test your server until you have written a client (see Problem 4).
Creating a Client
The socket module also has tools for writing client programs. First, create a socket object with
the same settings as the server socket, then call the connect() method with the server address as
a parameter. Once the client socket is connected to the server socket, the two sockets can transfer
information between themselves.
Unlike the server socket, the client socket sends and reads the data itself instead of creating a
new connection socket. When the client program is complete, close the client socket. The server will
keep running, waiting for another client to serve.
To see a client and server communicate, open a terminal and run the server. Then run the
client in a separate terminal. Try this with the provided examples.
Problem 4. Write a client function that accepts a (host, port) tuple and connects to the tic-
tac-toe server at the specified location. Start by initializing a new TicTacToe object, then
repeat the following steps until the game is over.
1. Print the board and prompt the player for a move. Continue prompting the player until
they provide valid input.
2. Update the board with the player’s move, then serialize it using your custom encoder
from Problem 2, and send the serialized version to the server.
3. Receive a response from the server. If the game is over, congratulate or mock the player
appropriately. If the player lost, receive a second response from the server (the final game
board), deserialize it, and print it out.
APIs
An Application Program Interface (API) is a particular kind of server that listens for requests from
authorized users and responds with data. For example, a list of locations can be sent with the proper
request syntax to a Google Maps API, and it will respond with the calculated driving time from
start to end, including each location. Every API has endpoints where clients send their requests.
Though standards exist for creating and communicating with APIs, most APIs have a unique syntax
for authentication and requests that is documented by the organization providing the service.
The requests module is the standard way to send a download request to an API in Python.
Achtung!
Each website and API has a policy that specifies appropriate behavior for automated data
retrieval and usage. If data is requested without complying with these requirements, there can
be severe legal consequences. Most websites detail their policies in a file called robots.txt on
their main page. See, for example, https://www.google.com/robots.txt.
74 Lab 6. Web Technologies
Additional Material
Other Internet Protocols
There are many protocols in the Internet Protocol Suite other than TCP that are used for different
purposes. The Protocol Suite can be divided into four categorical layers:
1. Application: Software that utilizes transport protocols to move information between comput-
ers. This layer includes protocols important for email, file transfers, and browsing the web.
2. Transport: Protocols that assist in basic high level communication between two computers in
areas such as data-streaming, reliability control, and flow control.
3. Internet: Protocols that handle routing, assignment of addresses, and movement of data on a
network.
4. Link: Protocols that communicate with local networking hardware such as routers and switches.
Although these examples are simple, every data transfer with TCP follows a similar pattern.
For basic connections, these interactions are simple processes. However, requesting a webpage would
require management of possibly hundreds of connections. In order to make this more feasible, there
are higher level protocols that handle smaller TCP/IP details. The most predominant of these
protocols is HTTP.
HTTP
HTTP stands for Hypertext Transfer Protocol, which is an application layer networking protocol.
It is a higher level protocol than TCP but uses TCP protocols to manage connections and provide
network capabilities. The protocol is centered around a request and response paradigm in which a
client makes a request to a server and the server replies with response. There are several methods,
or requests, defined for HTTP servers, the three most common of which are GET, POST, and PUT.
GET requests request information from a server, POST requests modify the state of the server, and
PUT requests add new pieces of data to the server.
Every HTTP request or response consists of two parts: a header and a body. The headers
contain important information about the request including: the type of request, encoding, and a
timestamp. Custom headers may be added to any request to provide additional information. The
body of the request or response contains the appropriate data or may be empty.
An HTTP connection can be setup in Python by using the standard Python library http.
Though it is the standard, the process can be greatly simplified by using an additional library called
requests. The following demonstrates a simple GET request with the http library.
This process is how a web browser (a client program) retrieves a webpage. It first sends an
HTTP request to the web server (a server program) and receives the HTML, CSS, and other code
files for a webpage, which are compiled and run in the web browser.
Requests also often include parameters which are keys to tell the server what is being requested
or placed. These parameters can be included in the URL that requests from the server, or in
parameters that the requests library can implement. For example:
>>> r = requests.get("http://httpbin.org/get?key2=value2&key1=value1")
>>> print(r.text)
{
"args": {
"key1": "value1",
"key2": "value2"
},
...
},
"origin": "128.187.116.7",
"url": "http://httpbin.org/get?key2=value2&key1=value1"
}
>>> r = requests.get("http://httpbin.org/get", params={'key1':'value1','key2':'←-
value2'})
>>> print(r.url)
http://httpbin.org/get?key2=value2&key1=value1
>>> print(r.text)
{
"args": {
"key1": "value1",
"key2": "value2"
},
...
},
"origin": "128.187.116.7",
"url": "http://httpbin.org/get?key2=value2&key1=value1"
}
76 Lab 6. Web Technologies
A similar format to GET requests can also be used for PUT or POST requests. These special
requests alter the state of the server or send a piece of data to the server, respectively. In addition,
for PUT and POST requests, a data string or dictionary may be sent as a binary stream attachment.
The requests library attaches these data objects with the data parameter. For example:
Note that the data parameter accepts input in the form of a JSON string.
Frequently, when these requests arrive at the server, they are in the form of a binary stream,
which can be read with similar notation to the Python open function. Below is an example of reading
the previous PUT request with a data attachment as a binary stream using read.
For more information on the requests library, see the documentation at http://docs.python-requests.
org/.
7
Web Scraping
Lab Objective: Web Scraping is the process of gathering data from websites on the internet.
Since almost everything rendered by an internet browser as a web page uses HTML, the first step in
web scraping is being able to extract information from HTML. In this lab, we introduce the requests
library for scraping web pages, and BeautifulSoup, Python’s canonical tool for efficiently and cleanly
navigating and parsing HTML.
# Make a request and check the result. A status code of 200 is good.
>>> response = requests.get("http://www.byu.edu")
>>> print(response.status_code, response.ok, response.reason)
200 True OK
1 Though requests is not part of the standard library, it is recognized as a standard tool in the data science
77
78 Lab 7. Web Scraping
<head>
<meta charset="utf-8" />
# ...
Note that some websites aren’t built to handle large amounts of traffic or many repeated
requests. Most are built to identify web scrapers or crawlers that initiate many consecutive GET
requests without pauses, and retaliate or block them. When web scraping, always make sure to
store the data that you receive in a file and include error checks to prevent retrieving the same data
unnecessarily. We won’t spend much time on that in this lab, but it’s especially important in larger
applications.
Problem 1. Use the requests library to get the HTML source for the website http://www.
example.com. Save the source as a file called example.html. If the file already exists, make
sure not to scrape the website or overwrite the file. You will use this file later in the lab.
Achtung!
Scraping copyrighted information without the consent of the copyright owner can have severe
legal consequences. Many websites, in their terms and conditions, prohibit scraping parts or all
of the site. Websites that do allow scraping usually have a file called robots.txt (for example,
www.google.com/robots.txt) that specifies which parts of the website are off-limits and how
often requests can be made according to the robots exclusion standard.a
Be careful and considerate when doing any sort of scraping, and take care when writing
and testing code to avoid unintended behavior. It is up to the programmer to create a scraper
that respects the rules found in the terms and conditions and in robots.txt.b
We will cover this more in the next lab.
a See www.robotstxt.org/orig.html and en.wikipedia.org/wiki/Robots_exclusion_standard.
b Python provides a parsing library called urllib.robotparser for reading robot.txt files. For more infor-
mation, see https://docs.python.org/3/library/urllib.robotparser.html.
79
HTML
Hyper Text Markup Language, or HTML, is the standard markup language—a language designed for
the processing, definition, and presentation of text—for creating webpages. It structures a document
using pairs of tags that surround and define content. Opening tags have a tag name surrounded
by angle brackets (<tag-name>). The companion closing tag looks the same, but with a forward
slash before the tag name (</tag-name>). A list of all current HTML tags can be found at http:
//htmldog.com/reference/htmltags.
Most tags can be combined with attributes to include more data about the content, help identify
individual tags, and make navigating the document much simpler. In the following example, the <a>
tag has id and href attributes.
In HTML, href stands for hypertext reference, a link to another website. Thus the above
example would be rendered by a browser as a single line of text, with here being a clickable link to
http://www.example.com:
Unlike Python, HTML does not enforce indentation (or any whitespace rules), though inden-
tation generally makes HTML more readable. The previous example can be written in a single
line.
Special tags, which don’t contain any text or other tags, are written without a closing tag
and in a single pair of brackets. A forward slash is included between the name and the closing
bracket. Examples of these include <hr/>, which describes a horizontal line, and <img/>, the tag for
representing an image.
Note
You can open .html files using a text editor or any web browser. In a browser, you can inspect
the source code associated with specific elements. Right click the element and select Inspect.
If you are using Safari, you may first need to enable “Show Develop menu” in “Preferences”
under the “Advanced” tab.
80 Lab 7. Web Scraping
BeautifulSoup
BeautifulSoup (bs4) is a package3 that makes it simple to navigate and extract data from HTML
documents. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html for the
full documentation.
The bs4.BeautifulSoup class accepts two parameters to its constructor: a string of HTML
code and an HTML parser to use under the hood. The HTML parser is technically a keyword
argument, but the constructor prints a warning if one is not specified. The standard choice for the
parser is "html.parser", which means the object uses the standard library’s html.parser module
as the engine behind the scenes.
Note
Depending on project demands, a parser other than "html.parser" may be useful. A couple of
other options are "lxml", an extremely fast parser written in C, and "html5lib", a slower parser
that treats HTML in much the same way a web browser does, allowing for irregularities. Both
must be installed independently; see https://www.crummy.com/software/BeautifulSoup/
bs4/doc/#installing-a-parser for more information.
A BeautifulSoup object represents an HTML document as a tree. In the tree, each tag is a
node with nested tags and strings as its children. The prettify() method returns a string that can
be printed to represent the BeautifulSoup object in a readable format that reflects the tree structure.
Each tag in a BeautifulSoup object’s HTML code is stored as a bs4.element.Tag object, with
actual text stored as a bs4.element.NavigableString object. Tags are accessible directly through
the BeautifulSoup object.
# Get just the name, attributes, and text of the <a> tag.
>>> print(a_tag.name, a_tag.attrs, a_tag.string, sep="\n")
a
{'id': 'info', 'href': 'http://www.example.com'}
here
Attribute Description
name The name of the tag
attrs A dictionary of the attributes
string The single string contained in the tag
strings Generator for strings of children tags
stripped_strings Generator for strings of children tags, stripping whitespace
text Concatenation of strings from all children tags
Problem 2. The BeautifulSoup class has a find_all() method that, when called with True
as the only argument, returns a list of all tags in the HTML source code.
Write a function that accepts a string of HTML code as an argument. Use BeautifulSoup
to return a list of the names of the tags in the code.
<body>
<p class="title"><b>The Three Little Pigs</b></p>
<p class="story">Once upon a time, there were three little pigs named
<a href="http://example.com/larry" class="pig" id="link1">Larry,</a>
<a href="http://example.com/mo" class="pig" id="link2">Mo</a>, and
<a href="http://example.com/curly" class="pig" id="link3">Curly.</a>
<p>The three pigs had an odd fascination with experimental construction.</p>
<p>...</p>
</body></html>
"""
>>> pig_soup.a
<a class="pig" href="http://example.com/larry" id="link1">Larry,</a>
Since the HTML in this example has several <p> and <a> tags, only the first tag of each name
is accessible directly from pig_soup. The other tags can be accessed by manually navigating through
the HTML tree.
Every HTML tag (except for the topmost tag, which is usually <html>) has a parent tag. Each
tag also has zero or more sibling and children tags or text. Following a true tree structure, every
bs4.element.Tag in a soup has multiple attributes for accessing or iterating through parent, sibling,
or child tags.
Attribute Description
parent The parent tag
parents Generator for the parent tags up to the top level
next_sibling The tag immediately after to the current tag
next_siblings Generator for sibling tags after the current tag
previous_sibling The tag immediately before the current tag
previous_siblings Generator for sibling tags before the current tag
contents A list of the immediate children tags
children Generator for immediate children tags
descendants Generator for all children tags (recursively)
# Get the names of all of <a>'s parent tags, traveling up to the top.
# The name '[document]' means it is the top of the HTML code.
>>> [par.name for par in a_tag.parents] # <a>'s parent is <p>, whose
['p', 'body', 'html', '[document]'] # parent is <body>, and so on.
83
Note carefully that newline characters are considered to be children of a parent tag. Therefore
iterating through children or siblings often requires checking which entries are tags and which are
just text. In the next example, we use a tag’s attrs attribute to access specific attributes within the
tag (see Table 7.1).
# Get to the <p> tag that has class="story" using these commands.
>>> p_tag = pig_soup.body.p.next_sibling.next_sibling
>>> p_tag.attrs["class"] # Make sure it's the right tag.
['story']
# Iterate through the child tags of <p> and print hrefs whenever they exist.
>>> for child in p_tag.children:
... # Skip the children that are not bs4.element.Tag objects
... # These don't have the attribute "attrs"
... if hasattr(child, "attrs") and "href" in child.attrs:
... print(child.attrs["href"])
http://example.com/larry
http://example.com/mo
http://example.com/curly
Note that the "class" attribute of the <p> tag is a list. This is because the "class" attribute
can take on several values at once; for example, the tag <p class="story book"> is of class 'story'
and of class 'book'.
The behavior of the string attribute of a bs4.element.Tag object depends on the structure
of the corresponding HTML tag.
1. If the tag has a string of text and no other child elements, then string is just that text.
2. If the tag has exactly one child tag and the child tag has only a string of text, then the tag has
the same string as its child tag.
3. If the tag has more than one child, then string is None. In this case, use strings to iterate
through the child strings. Alternatively, the get_text() method returns all text belonging to
a tag and to all of its descendants. In other words, it returns anything inside a tag that isn’t
another tag.
>>> pig_soup.head
<head><title>Three Little Pigs</title></head>
Problem 3. Write a function that reads a file of the same format as the output from Problem
1 and loads it into BeautifulSoup. Find the first <a> tag, and return its text along with a
boolean value indicating whether or not it has a hyperlink (href attribute).
>>> pig_soup.find(string='Mo')
'Mo' # The result is the actual string,
>>> pig_soup.find(string='Mo').parent # so go up one level to get the tag.
<a class="pig" href="http://example.com/mo" id="link2">Mo</a>
Problem 4. The file san_diego_weather.html contains the HTML source for an old page
from Weather Underground.a Write a function that reads the file and loads it into Beautiful-
Soup.
Return a list of the following tags:
2. The tags which contain the links “Previous Day” and “Next Day.”
3. The tag which contains the number associated with the Actual Max Temperature.
a Seehttp://www.wunderground.com/history/airport/KSAN/2015/1/1/DailyHistory.html?req_city=San+
Diego&req_state=CA&req_statename=California&reqdb.zip=92101&reqdb.magic=1&reqdb.wmo=99999&MR=1
>>> pig_soup.find(href="http://example.com/curly")
<a class="pig" href="http://example.com/curly" id="link3">Curly.</a>
This approach works, but it requires entering in the entire URL. To perform generalized
searches, the find() and find_all() method also accept compiled regular expressions from the
re module. This way, the methods locate tags whose name, attributes, and/or string matches a
pattern.
>>> import re
# Find the first tag with a string that starts with 'Cu'.
>>> pig_soup.find(string=re.compile(r"^Cu")).parent
<a class="pig" href="http://example.com/curly" id="link3">Curly.</a>
Finally, to find a tag that has a particular attribute, regardless of the actual value of the
attribute, use True in place of search values.
86 Lab 7. Web Scraping
Symbol Meaning
= Matches an attribute value exactly
*= Partially matches an attribute value
^= Matches the beginning of an attribute value
$= Matches the end of an attribute value
+ Next sibling of matching element
> Search an element’s children
You can do many other useful things with CSS selectors. A helpful guide can be found at
https://www.w3schools.com/cssref/css_selectors.asp. The code below gives an example using
arguments described above.
Problem 6. The file large_banks_data.html is one of the pages from the index in Problem
5.a Write a function that reads the file and loads the source into BeautifulSoup. Create a single
figure with two subplots:
1. A sorted bar chart of the seven banks with the most domestic branches.
2. A sorted bar chart of the seven banks with the most foreign branches.
Lab Objective: Gathering data from the internet often requires information from several web
pages. In this lab, we present two methods for crawling through multiple web pages without violating
copyright laws or straining the load on a server. We also demonstrate how to scrape data from
asynchronously loaded web pages and how to interact programmatically with web pages when needed.
Scraping Etiquette
There are two main ways that web scraping can be problematic for a website owner.
1. The scraper doesn’t respect the website’s terms and conditions or gathers private or proprietary
data.
2. The scraper imposes too much extra server load by making requests too often or in quick
succession.
These are extremely important considerations in any web scraping program. Scraping copyrighted
information without the consent of the copyright owner can have severe legal consequences. Many
websites, in their terms and conditions, prohibit scraping parts or all of the site. Websites that do
allow scraping usually have a file called robots.txt (for example, www.google.com/robots.txt)
that specifies which parts of the website are off-limits, and how often requests can be made according
to the robots exclusion standard.1
Achtung!
Be careful and considerate when doing any sort of scraping, and take care when writing and
testing code to avoid unintended behavior. It is up to the programmer to create a scraper that
respects the rules found in the terms and conditions and in robots.txt. Make sure to scrape
websites legally.
Recall that consecutive requests without pauses can strain a website’s server and provoke retal-
iation. Most servers are designed to identify such scrapers, block their access, and sometimes even
blacklist the user. This is especially common in smaller websites that aren’t built to handle enormous
amounts of traffic. To briefly pause the program between requests, use time.sleep().
1 See www.robotstxt.org/orig.html and en.wikipedia.org/wiki/Robots_exclusion_standard.
89
90 Lab 8. Web Crawling
The amount of necessary wait time depends on the website. Sometimes, robots.txt contains
a Crawl-delay directive which gives a number of seconds to wait between successive requests. If
this doesn’t exist, pausing for a half-second to a second between requests is typically sufficient. An
email to the site’s webmaster is always the safest approach and may be necessary for large scraping
operations.
Python provides a parsing library called urllib.robotparser for reading robot.txt files.
Below is an example of using this library to check where robots are allowed on arxiv.org. A website’s
robots.txt file will often include different instructions for specific crawlers. These crawlers are
identified by a User-agent string. For example, Google’s webcrawler, User-agent Googlebot, may
be directed to index only the pages the website wants to have listed on a Google search. We will use
the default User-agent, "*".
Problem 1. Write a program that accepts a web address defaulting to the site http://
example.webscraping.com and a list of pages defaulting to ["/", "/trap", "/places/default
/search"]. For each page, check if the robots.txt file permits access. Return a list of boolean
values corresponding to each page. Also return the crawl delay time.
current = None
for _ in range(4):
while current == None: # Try downloading until it works.
# Download the page source and PAUSE before continuing.
page_source = requests.get(page).text
time.sleep(1) # PAUSE before continuing.
soup = BeautifulSoup(page_source, "html.parser")
current = soup.find_all(class_="product_pod")
# Find the URL for the page with the next data.
if "page-4" not in page:
new_page = soup.find(string=next_page_finder).parent["href"]
page = base_url + new_page # New complete page URL.
current = None
return titles
In this example, the for loop cycles through the pages of books, and the while loop ensures
that each website page loads properly: if the downloaded page_source doesn’t have a tag whose
class is product_pod, the request is sent again. After recording all of the titles, the function locates
the link to the next page. This link is stored in the HTML as a relative website path (page-2.html);
the complete URL to the next day’s page is the concatenation of the base URL http://books.
toscrape.com/catalogue/category/books/mystery_3/ with this relative link.
Problem 2. Modify scrape_books() so that it gathers the price for each fiction book and
returns the mean price, in £, of a fiction book.
92 Lab 8. Web Crawling
Note
Selenium requires an executable driver file for each kind of browser. The following examples
use Google Chrome, but Selenium supports Firefox, Internet Explorer, Safari, Opera, and
PhantomJS (a special browser without a user interface). See https://seleniumhq.github.io/
selenium/docs/api/py or http://selenium-python.readthedocs.io/installation.html
for installation instructions and driver download instructions.
If your program still can’t find the driver after you’ve downloaded it, add the argument
executable_path = "path/to/driver/file" when you call webdriver. If this doesn’t work,
you may need to add the location to your system PATH. On a Mac, open the file /etc/path and
add the new location. On Linux, add export PATH="path/to/driver/file:$PATH" to the file
/.bashrc . For Windows, follow a tutorial such as this one: https://www.architectryan.
com/2018/03/17/add-to-the-path-on-windows-10/.
To use Selenium, start up a browser using one of the drivers in selenium.webdriver. The
browser has a get() method for going to different web pages, a page_source attribute containing
the HTML source of the current page, and a close() method to exit the browser.
# Feed the HTML source code for the page into BeautifulSoup for processing.
>>> soup = BeautifulSoup(browser.page_source, "html.parser")
>>> print(soup.prettify())
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>
Example Domain
</title>
<meta charset="utf-8"/>
93
Selenium can deliver the HTML page source to BeautifulSoup, but it also has its own tools for
finding tags in the HTML.
Method Returns
find_element_by_tag_name() The first tag with the given name
find_element_by_name() The first tag with the specified name attribute
find_element_by_class_name() The first tag with the given class attribute
find_element_by_id() The first tag with the given id attribute
find_element_by_link_text() The first tag with a matching href attribute
find_element_by_partial_link_text() The first tag with a partially matching href attribute
Each of the find_element_by_*() methods returns a single object representing a web element
(of type selenium.webdriver.remote.webelement.WebElement), much like a BeautifulSoup tag (of
type bs4.element.Tag). If no such element can be found, a Selenium NoSuchElementException is
raised. If you want to find more than just the first matching object, each webdriver also has several
find_elements_by_*() methods (elements, plural) that return a list of all matching elements, or
an empty list if there are no matches.
Web element objects have methods that allow the program to interact with them: click()
sends a click, send_keys() enters in text, and clear() deletes existing text. This functionality
makes it possible for Selenium to interact with a website in the same way that a human would. For
example, the following code opens up https://www.google.com, types “Python Selenium Docs” into
the search bar, and hits enter.
In the above example, we could have used find_element_by_class_name(), but when you need
more precision than that, CSS selectors can be very useful. Remember that to view specific HTML
associated with an object in Chrome or Firefox, you can right click on the object and click “Inspect.”
For Safari, you need to first enable “Show Develop menu” in “Preferences” under “Advanced.” Keep
in mind that you can also search through the source code (ctrl+f or cmd+f) to make sure you’re
using a unique identifier.
Note
Using Selenium to access a page’s source code is typically much safer, though slower, than
using requests.get(), since Selenium waits for each web page to load before proceeding. For
instance, some websites are somewhat defensive about scrapers, but Selenium can sometimes
make it possible to gather info without offending the administrators.
Lab Objective: Though NumPy and SciPy are powerful tools for numerical computing, they lack
some of the high-level functionality necessary for many data science applications. Python’s pandas
library, built on NumPy, is designed specifically for data management and analysis. In this lab
we introduce pandas data structures, syntax, and explore its capabilities for quickly analyzing and
presenting data.
Pandas Basics
Pandas is a python library used primarily to analyze data. It combines functionality of NumPy,
MatPlotLib, and SQL to create an easy to understand library that allows for the manipulation of
data in various ways. In this lab we focus on the use of Pandas to analyze and manipulate data in
ways similar to NumPy and SQL.
The first pandas data structure is a Series. A Series is a one-dimensional array that can hold any
datatype, similar to a ndarray. However, a Series has an index that gives a label to each entry.
An index generally is used to label the data.
Typically a Series contains information about one feature of the data. For example, the data
in a Series might show a class’s grades on a test and the Index would indicate each student in the
class. To initialize a Series, the first parameter is the data and the second is the index.
97
98 Lab 9. Pandas 1: Introduction
DataFrame
The second key pandas data structure is a DataFrame. A DataFrame is a collection of multiple
Series. It can be thought of as a 2-dimensional array, where each row is a separate datapoint and
each column is a feature of the data. The rows are labeled with an index (as in a Series) and the
columns are labeled in the attribute columns.
There are many different ways to initialize a DataFrame. One way to initialize a DataFrame is
by passing in a dictionary as the data of the DataFrame. The keys of the dictionary will become the
labels in columns and the values are the Series associated with the label.
Notice that pd.DataFrame automatically lines up data from both Series that have the same
index. If the data only appears in one of the Series, the entry for the second Series is NaN.
We can also initialize a DataFrame with a NumPy array. In this way, the data is passed in as
a 2-dimensional NumPy array, while the column labels and index are passed in as parameters. The
first column label goes with the first column of the array, the second with the second, etc. The same
holds for the index.
A DataFrame can also be viewed as a NumPy array using the attribute values.
Data I/O
The pandas library has functions that make importing and exporting data simple. The functions
allow for a variety of file formats to be imported and exported, including CSV, Excel, HDF5, SQL,
JSON, HTML, and pickle files.
Method Description
to_csv() Write the index and entries to a CSV file
read_csv() Read a csv and convert into a DataFrame
to_json() Convert the object to a JSON string
to_pickle() Serialize the object and store it in an external file
to_sql() Write the object data to an open SQL database
read_html() Read a table in an html page and convert to a DataFrame
The CSV (comma separated values) format is a simple way of storing tabular data in plain
text. Because CSV files are one of the most popular file formats for exchanging data, we will explore
the read_csv() function in more detail. To learn to read other types of file formats, see the online
pandas documentation. To read a CSV data file into a DataFrame, call the read_csv() function
with the path to the CSV file, along with the appropriate keyword arguments. Below we list some
of the most important keyword arguments:
• delimiter: The character that separates data fields. It is often a comma or a whitespace
character.
• header: The row number (0 indexed) in the CSV file that contains the column names.
• index_col: The column (0 indexed) in the CSV file that is the index for the DataFrame.
• skiprows: If an integer n, skip the first n rows of the file, and then start reading in the data.
If a list of integers, skip the specified rows.
• names: If the CSV file does not contain the column names, or you wish to use other column
names, specify them in a list.
The read_html is useful when scraping data. It takes in a url or html file and an optional
match, a string or regex. It returns a list of the tables that match the match in a DataFrame. While
the data will probably need to be cleaned up a little, it is much faster than scraping a website.
100 Lab 9. Pandas 1: Introduction
Data Manipulation
Accessing Data
While array slicing can be used to access data in a DataFrame, it is always preferable to use the
loc and iloc indexers. Accessing Series and DataFrame objects using these indexing operations
is more efficient than slicing because the bracket indexing has to check many cases before it can
determine how to slice the data structure. Using loc/iloc explicitly, bypasses the extra checks. The
loc index selects rows and columns based on their labels, while iloc selects them based on their
integer position. When using these indexers, the first and second arguments refer to the rows and
columns, respectively, just as array slicing.
An entire column of a DataFrame can be accessed using simple square brackets and the name
of the column. In addition, to create a new column or reset the values of an entire column, simply
call this column in the same fashion and set the value.
Often datasets can be very large and difficult to visualize. Pandas offers various methods to
make the data easier to visualize. The methods head and tail will show the first or last n data
points, respectively, where n defaults to 5. The method sample will draw n random entries of the
dataset, where n defaults to 1.
It may also be useful to re-order the columns or rows or sort according to a given column.
# Re-order columns
>>> grades.reindex(columns=['English','Math','History'])
English Math History
Barbara 73.0 52.0 100.0
David 39.0 10.0 100.0
Eleanor NaN 35.0 100.0
Greg 26.0 NaN 100.0
Lauren 99.0 NaN 100.0
Mark 68.0 81.0 100.0
Other methods used for manipulating DataFrame and Series panda structures can be found
in Table 9.2.
102 Lab 9. Pandas 1: Introduction
Method Description
append() Concatenate two or more Series.
drop() Remove the entries with the specified label or labels
drop_duplicates() Remove duplicate values
dropna() Drop null entries
fillna() Replace null entries with a specified value or strategy
reindex() Replace the index
sample() Draw a random entry
shift() Shift the index
unique() Return unique values
Table 9.2: Methods for managing or modifying data in a pandas Series or DataFrame.
Problem 1. The file budget.csv contains the budget of a college student over the course of 4
years. Write a function that performs the following operations in this order:
1. Reindex the columns such that amount spent on groceries is the first column and all other
columns maintain the same ordering.
2. Sort the DataFrame in descending order based on how much money was spent on Groceries
Read in budget.csv as a DataFrame with the index as column 0 and perform each of these
operations on the DataFrame. Return the values of the updated DataFrame as a NumPy array.
Hint: Use index_col=0 to set the first column as the index when reading in the csv .
David 20.0
Eleanor 70.0
Greg NaN
Lauren NaN
Mark 162.0
Name: Math, dtype: float64
In addition to arithmetic, Series have a variety of other methods similar to NumPy arrays. A
collection of these methods is found in Table 9.3.
Method Returns
abs() Object with absolute values taken (of numerical data)
idxmax() The index label of the maximum value
idxmin() The index label of the minimum value
count() The number of non-null entries
cumprod() The cumulative product over an axis
cumsum() The cumulative sum over an axis
max() The maximum of the entries
mean() The average of the entries
median() The median of the entries
min() The minimum of the entries
mode() The most common element(s)
prod() The product of the elements
sum() The sum of the elements
var() The variance of the elements
Table 9.3: Numerical methods of the Series and DataFrame pandas classes.
Functions for calculating means and variances, the covariance and correlation matrices, and
other basic statistics are also available.
104 Lab 9. Pandas 1: Introduction
The method rank gives a ranking based on methods such as average, minimum, and maximum.
This method defaults ranking in ascending order (the least will be ranked 1 and the greatest will be
ranked the highest number).
# Rank each student's performance based on their highest grade in any class
# in descending order
>>> grades.rank(axis=1,method='max',ascending=False)
Math English History
Barbara 3.0 2.0 1.0
David 3.0 2.0 1.0
Eleanor 2.0 NaN 1.0
Greg NaN 2.0 1.0
Lauren NaN 2.0 1.0
Mark 2.0 3.0 1.0
These methods can be very effective in interpreting data. For example, the rank example above
shows use that Barbara does best in History, then English and then Math.
When dealing with missing data, make sure you are aware of the behavior of the pandas
functions you are using. For example, sum() and mean() ignore NaN values in the computation.
Achtung!
Always consider missing data carefully when analyzing a dataset. It may not always be helpful
to drop the data or fill it in with a random number. Consider filling the data with the mean
of surrounding data or the mean of the feature in question. Overall, the choice for how to fill
missing data should make sense with the dataset.
Problem 2. Write a function which uses budget.csv to answer the questions "Which category
affects living expenses the most? Which affects other expenses the most?" Perform the following
manipulations:
2. Create two new columns, 'Living Expenses' and 'Other'. Sum the columns 'Rent'
, 'Groceries', 'Gas' and 'Utilities' and set as the value of 'Living Expenses'.
Sum the columns 'Dining Out', 'Out With Friends' and 'Netflix' and set as the
value of 'Other'.
3. Identify which column, other than 'Living Expenses', correlates most with 'Living
Expenses' and which column, other than 'Other' correlates most with 'Other'. This
can indicate which columns in the budget affects the overarching categories the most.
Return the names of each of those columns as a tuple. The first should be of the column
corresponding to 'Living Expenses' and the second to 'Other'.
Before querying our data, it is important to know some of its basic properties, such as number of
columns, number of rows, and the datatypes of the columns. This can be done by simply calling the
info() method on the desired DataFrame:
>>> mathInfo.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 3 columns):
Grade 5 non-null float64
ID 5 non-null int64
Math_Major 5 non-null object
dtypes: float64(1), int64(1), object(1)
Masks
Sometimes, we only want to access data from a single column. For example if we wanted to only
access the ID of the students in the studentInfo DataFrame, then we would use the following syntax.
If we wanted to access multiple columns at once we can use a list of column names.
Now we can access the specific columns that we want. However some of these columns may still
contain data points that we don’t want to consider. In this case we can build a mask. Each mask
that we build will return a pandas Series object with a bool value at each index indicating if the
condition is satisfied.
We can also create compound masks with multiple statements. We do this using the same
syntax you would use for a compound mask in a normal NumPy array. Useful operators are: &, the
AND operator; |, the OR operator; and ∼, the NOT operator.
Problem 3. Read in the file crime_data.csv as a pandas object. The file contains data on
types of crimes in the U.S. from 1960 to 2016. Set the index as the column ’Year’. Answer
the following questions using the pandas methods learned in this lab. The answer of each
question should be saved as indicated. Return the answers to each question as a tuple (i.e.
(answer_1,answer_2,answer_3)).
1. Identify the three crimes that have a mean over 1,500,000. Of these three crimes, which
two are very correlated? Which of these two crimes has a greater maximum value? Save
the title of this column as a variable to return as the answer.
2. Examine the data since 2000. Sort this data (in ascending order) according to number
of murders. Find the years where Aggravated Assault is greater than 850,000. Save the
indices (the years) of the masked and reordered DataFrame as a NumPy array to return
as the answer.
3. What year had the highest crime rate? In this year, which crime was committed the
most? What percentage of the total crime that year was it? Save this value as a float.
The datetime.datetime object has a parser method, strptime(), that converts a string into
a new datetime.datetime object. The parser is flexible so the user must specify the format that
the dates are in. For example, if the dates are in the format "Month/Day//Year::Hour", specify
format"=%m/%d//%Y::%H" to parse the string appropriately. See Table 9.4 for formatting options.
109
Pattern Description
%Y 4-digit year
%y 2-digit year
%m 1- or 2-digit month
%d 1- or 2-digit day
%H Hour (24-hour)
%I Hour (12-hour)
%M 2-digit minute
%S 2-digit second
Problem 4. The file DJIA.csv contains daily closing values of the Dow Jones Industrial Av-
erage from 2006–2016. Read the data into a Series or DataFrame with a DatetimeIndex as
the index. Drop any rows without numerical values, cast the "VALUE" column to floats, then
return the updated DataFrame.
Hint: You can change the column type the same way you’d change a numpy array type.
Parameter Description
start Starting date
end End date
periods Number of dates to include
freq Amount of time between consecutive dates
normalize Normalizes the start and end times to midnight
Exactly three of the parameters start, end, periods, and freq must be specified to gen-
erate a range of dates. The freq parameter accepts a variety of string representations, referred
to as offset aliases. See Table 9.6 for a sampling of some of the options. For a complete list
of the options, see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.
html#timeseries-offset-aliasesl.
Parameter Description
"D" calendar daily (default)
"B" business daily (every business day)
"H" hourly
"T" minutely
"S" secondly
"MS" first day of the month (Month Start)
"BMS" first business day of the month (Business Month Start)
"W-MON" every Monday (Week-Monday)
"WOM-3FRI" every 3rd Friday of the month (Week of the Month - 3rd Friday)
# Create a DatetimeIndex with the first weekday of every other month in 2016.
>>> pd.date_range(start='1/1/2016', end='1/1/2017', freq="2BMS" )
DatetimeIndex(['2016-01-01', '2016-03-01', '2016-05-02', '2016-07-01',
'2016-09-01', '2016-11-01'],
dtype='datetime64[ns]', freq='2BMS')
# Create a DatetimeIndex for 10 minute intervals between 4:00 PM and 4:30 PM on←-
September 9, 2016.
>>> pd.date_range(start='9/28/2016 16:00',
end='9/28/2016 16:30', freq="10T")
DatetimeIndex(['2016-09-28 16:00:00', '2016-09-28 16:10:00',
'2016-09-28 16:20:00', '2016-09-28 16:30:00'],
dtype='datetime64[ns]', freq='10T')
Problem 5. The file paychecks.csv contains values of an hourly employee’s last 93 paychecks.
Paychecks are given every other Friday, starting on March 14, 2008, and the employee started
working on March 13, 2008.
Read in the data, using pd.date_range() to generate the DatetimeIndex. Set this as
the new index of the DataFrame and return the DataFrame.
>>> df = pd.DataFrame(dict(VALUE=np.random.rand(5)),
index=pd.date_range("2016-10-7", periods=5, freq='D'))
>>> df
VALUE
2016-10-07 0.127895
112 Lab 9. Pandas 1: Introduction
2016-10-08 0.811226
2016-10-09 0.656711
2016-10-10 0.351431
2016-10-11 0.608767
>>> df.shift(1)
VALUE
2016-10-07 NaN
2016-10-08 0.127895
2016-10-09 0.811226
2016-10-10 0.656711
2016-10-11 0.351431
>>> df.shift(-2)
VALUE
2016-10-07 0.656711
2016-10-08 0.351431
2016-10-09 0.608767
2016-10-10 NaN
2016-10-11 NaN
Shifting data makes it easy to gather statistics about changes from one timestamp or period to
the next.
Problem 6. Compute the following information about the DJIA dataset from Problem 4 that
has a DateTimeIndex.
Return the DateTimeIndex of the day with the largest gain and the day with the largest loss.
(Hint: Call your function from Problem 4 to get the DataFrame already cleaned and with
DatetimeIndex).
More information on how to use datetime with Pandas is in the additional material section.
This includes working with Periods and more analysis with time series.
114 Lab 9. Pandas 1: Introduction
Additional Material
SQL Operations in pandas
DataFrames are tabular data structures bearing an obvious resemblance to a typical relational
database table. SQL is the standard for working with relational databases; however, pandas can
accomplish many of the same tasks as SQL. The SQL-like functionality of pandas is one of its
biggest advantages, eliminating the need to switch between programming languages for different
tasks. Within pandas, we can handle both the querying and data analysis.
For the examples below, we will use the following data:
SQL SELECT statements can be done by column indexing. WHERE statements can be in-
cluded by adding masks (just like in a NumPy array). The method isin() can also provide a useful
WHERE statement. This method accepts a list, dictionary, or Series containing possible values of
the DataFrame or Series. When called upon, it returns a Series of booleans, indicating whether
an entry contained a value in the parameter pass into isin().
ID GPA
0 0 3.8
3 3 3.9
7 7 3.4
Next, let’s look at JOIN statements. In pandas, this is done with the merge function. merge
takes the two DataFrame objects to join as parameters, as well as keyword arguments specifying the
column on which to join, along with the type (left, right, inner, outer).
# SELECT GPA, Grade FROM otherInfo FULL OUTER JOIN mathInfo ON otherInfo.
# ID = mathInfo.ID
>>> pd.merge(otherInfo, mathInfo, on='ID', how='outer')[['GPA', 'Grade']]
GPA Grade
0 3.8 4.0
1 3.5 3.0
2 3.0 NaN
3 3.9 4.0
4 2.8 NaN
5 2.9 3.5
6 3.8 3.0
7 3.4 NaN
8 3.7 NaN
[9 rows x 2 columns]
116 Lab 9. Pandas 1: Introduction
Like the pd.date_range() method, the pd.period_range() method is useful for generating a
PeriodIndex for unindexed data. The syntax is essentially identical to that of pd.date_range().
When using pd.period_range(), remember that the freq parameter marks the end of the period.
After creating a PeriodIndex, the freq parameter can be changed via the asfreq() method.
# Get every three months form March 2010 to the start of 2011.
>>> p = pd.period_range("2010-03", "2011", freq="3M")
>>> p
PeriodIndex(['2010-03', '2010-06', '2010-09', '2010-12'],
dtype='period[3M]', freq='3M')
117
# Shift index by 1
>>> p _= 1
>>> p
PeriodIndex(['2010Q1', '2010Q2', '2010Q3', '2010Q4'],
dtype='int64', freq='Q-DEC')
If for any reason you need to switch from periods to timestamps, pandas provides a very simple
method to do so. The how parameter can be start or end and determines if the timestamp is the
beginning or the end of the period. Similarly, you can switch from timestamps to periods.
>>> p.to_period("Q-DEC")
PeriodIndex(['2010Q1', '2010Q2', '2010Q3', '2010Q4'],
dtype='int64', freq='Q-DEC')
Slicing
Slicing is much more flexible in pandas for time series. We can slice by year, by month, or even use
traditional slicing syntax to select a range of dates.
Resampling
Some datasets do not have datapoints at a fixed frequency. For example, a dataset of website traffic
has datapoints that occur at irregular intervals. In situations like these, resampling can help provide
insight on the data.
The two main forms of resampling are downsampling, aggregating data into fewer intervals, and
upsampling, adding more intervals.
To downsample, use the resample() method of the Series or DataFrame. This method is
similar to groupby() in that it groups different entries together. Then aggregation produces a new
data set. The first parameter to resample() is an offset string from Table 9.6: "D" for daily, "H" for
hourly, and so on.
Many time series are inherently noisy. To analyze general trends in data, we use rolling functions
and exponentally-weighted moving (EWM) functions. Rolling functions, or moving window functions,
perform a calculation on a window of data. There are a few rolling functions that come standard
with pandas.
One of the most commonly used rolling functions is the rolling average, which takes the average value
over a window of data.
Whereas a moving window function gives equal weight to the whole window, an exponentially-weighted
moving function gives more weight to the most recent data points.
In the case of a exponentially-weighted moving average (EWMA), each data point is calculated
as follows.
zi = αx̄i + (1 − α)zi−1 ,
where zi is the value of the EWMA at time i, x̄i is the average for the i-th window, and α is the
decay factor that controls the importance of previous data points. Notice that α = 1 reduces to the
rolling average.
More commonly, the decay is expressed as a function of the window size. In fact, the span for
an EWMA is nearly analogous to window size for a rolling average.
Notice the syntax for EWM functions is very similar to that of rolling functions.
ax2 = plt.subplot(122)
s.plot(color="gray", lw=.3, ax=ax2)
s.ewm(span=200).mean().plot(color='g', lw=1, ax=ax2)
ax2.legend(["Actual", "EWMA"], loc="lower right")
ax2.set_title("EWMA")
10
Pandas 2: Plotting
Lab Objective: Clear, insightful visualizations are a crucial part of data analysis. To facilitate
quick data visualization, pandas includes several tools that wrap around matplotlib. These tools make
it easy to compare different parts of a data set, explore the data as a whole, and spot patterns and
correlations the data.
Table 10.1: Types of plots in pandas. The plot ID is the value of the keyword argument kind. That
is, df.plot(kind="scatter") creates a scatter plot. The default kind is "line".
The plot() method calls plt.plot(), plt.hist(), plt.scatter(), and other matplotlib
plotting functions, but it also assigns axis labels, tick marks, legends, and a few other things based
on the index and the data. Most calls to plot() specify the kind of plot and which Series to use
as the x and y axes. By default, the index of the Series or DataFrame is used for the x axis.
121
122 Lab 10. Pandas 2: Plotting
In this case, the call to the plot() method is essentially equivalent to the following code.
The plot() method also takes in many keyword arguments for matplotlib plotting and an-
notation functions. For example, setting legend=False disables the legend, providing a value for
title sets the figure title, grid=True turns a grid on, and so on. For more customizations, see
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html.
A good way to start analyzing an unfamiliar data set is to visualize as much of the data as possible
to determine which parts are most important or interesting. For example, since the columns in a
DataFrame share the same index, the columns can all be graphed together using the index as the
x-axis. By default, the plot() method attempts to plot every Series (column) in a DataFrame.
This is especially useful with sequential data, like the budget data set.
(a) All columns of the budget data set on the same (b) All columns of the budget data set except
figure, using the index as the x-axis. "Living Expenses" and "Rent".
Figure 10.1
While plotting every Series at once can give an overview of all the data, the resulting plot is
often difficult for the reader to understand. For example, the budget data set has 9 columns, so the
resulting figure, Figure 10.1a, is fairly cluttered.
One way to declutter a visualization is to examine less data. Notice that 'Living Expenses'
has values much bigger than the other columns. Dropping this column, as well as 'Rent', gives a
better overview of the data, shown in Figure 10.1b.
Achtung!
Often plotting all data at once is unwise because columns have different units of measure.
Be careful not to plot parts of a data set together if those parts do not have the same units or
are otherwise incomparable.
Another way to declutter a plot is to use subplots. To quickly plot several columns in sepa-
rate subplots, use subplots=True and specify a shape tuple as the layout for the plots. Subplots
automatically share the same x-axis. Set sharey=True to force them to share the same y-axis as
well.
As mentioned previously, the plot() method can be used to plot different kinds of plots. One
possible kind of plot is a histogram. Since plots made by the plot() method share an x-axis by
default, histograms turn out poorly whenever there are columns with very different data ranges or
when more than one column is plotted at once.
Figure 10.2: Two examples of histograms that are difficult to understand because multiple columns
are plotted.
125
Thus, histograms are good for examining the distribution of a single column in a data set. For
histograms, use the hist() method of the DataFrame instead of the plot() method. Specify the
number of bins with the bins parameter. Choose a number of bins that accurately represents the
data; the wrong number of bins can create a misleading or uninformative visualization.
Problem 1. Create 3 visualizations for the data in crime_data.csv. Make one of the visual-
izations a histogram. The visualizations should be well labeled and easy to understand. Include
a short description of your plots as a caption.
# Plot 'Dining Out' and 'Out With Friends' as lines against the index.
>>> budget.plot(y=["Dining Out", "Out With Friends"],title="Amount Spent on ←-
Dining Out and Out with Friends per Day")
126 Lab 10. Pandas 2: Plotting
Figure 10.4: Correlations between "Dining Out" and "Out With Friends".
The first plot shows us that more money is spent on dining out than being out with friends
overall. However, both categories stay in the same range for most of the data. This is confirmed in
the scatter plot by the block in the upper right corner, indicating the common range spent on dining
out and being out with friends.
Achtung!
When analyzing data, especially while searching for patterns and correlations, always ask
yourself if the data makes sense and is trustworthy. What lurking variables could have influenced
the data measurements as they were being gathered?
The crime data set from Problem 1 is somewhat suspect in this regard. The murder rate is
likely accurate, since murder is conspicuous and highly reported, but what about the rape rate?
Are the number of rapes increasing, or is the percentage of rapes being reported increasing?
It’s probably both! Be careful about drawing conclusions for sensitive or questionable data.
Another useful visualization used to understand correlations in a data set is a scatter matrix.
The function pd.plotting.scatter_matrix() produces a table of plots where each column is plotted
against each other column in separate scatter plots. The plots on the diagonal, instead of plotting
a column against itself, displays a histogram of that column. This provides a very quick method for
an initial analysis of the correlation between different columns.
1050
Living Expenses
1000
950
70
60
50
Other
40
30
950
1000
1050
30
40
50
60
70
Other
Living Expenses
Bar Graphs
Different types of graphs help to identify different patterns. Note that the data set budget gives
monthly expenses. It may be beneficial to look at one specific month. Bar graphs are a good way to
compare small portions of the data set.
As a general rule, horizontal bar charts (kind="barh") are better than the default vertical bar
charts (kind="bar") because most humans can detect horizontal differences more easily than vertical
differences. If the labels are too long to fit on a normal figure, use plt.tight_layout() to adjust
the plot boundaries to fit the labels in.
# Plot all data for the last month without 'Rent' and 'Living Expenses'
>>> budget.drop(['Rent','Living Expenses'],axis=1).iloc[-1,:].plot(kind='barh')
>>> plt.tight_layout()
128 Lab 10. Pandas 2: Plotting
Other Other
Living Expenses
Netflix
Netflix
Out With Friends Out With Friends
Gas Gas
Dining Out Dining Out
Groceries
Groceries
Utilities
Rent Utilities
0 200 400 600 800 1000 0 20 40 60 80 100 120 140
Figure 10.6: Bar graphs showing expenses paid in the last month of budget.
Problem 2. Using the crime data from the previous problem, identify if a trend exists between
Forcible Rape and the following variables:
1. Violent
2. Burglary
3. Aggravated Assault
Make sure each graph is clearly labelled and readable. Include a caption explaining
whether there is a visual trend between the variables.
Distributional Visualizations
While histograms are good at displaying the distributions for one column, a different visualization is
needed to show the distribution of an entire set. A box plot, sometimes called a “cat-and-whisker” plot,
shows the five number summary: the minimum, first quartile, median, third quartile, and maximum
of the data. Box plots are useful for comparing the distributions of relatable data. However, box
plots are a basic summary, meaning that they are susceptible to miss important information such as
how many points were in each distribution.
# Compare the distributions of all columns but 'Rent' and 'Living Expenses'.
>>> budget.drop(["Rent", "Living Expenses"], axis=1).plot(kind="box",
... vert=False)
129
70 Other
60 Netflix
50
Out With Friends
40
Gas
30
Dining Out
20
Groceries
10
0 Utilities
Gas Dining Out Out With Friends Other 0 25 50 75 100 125 150 175
Hexbin Plots
A scatter plot is essentially a plot of samples from the joint distribution of two columns. However,
scatter plots can be uninformative for large data sets when the points in a scatter plot are closely
clustered. Hexbin plots solve this problem by plotting point density in hexagonal bins—essentially
creating a 2-dimensional histogram.
The file sat_act.csv contains 700 self reported scores on the SAT Verbal, SAT Quantitative
and ACT, collected as part of the Synthetic Aperture Personality Assessment (SAPA) web based
personality assessment project. The obvious question with this data set is “how correlated are ACT
and SAT scores?” The scatter plot of ACT scores versus SAT Quantitative scores, Figure 10.8a, is
highly cluttered, even though the points have some transparency. A hexbin plot of the same data,
Figure 10.8b, reveals the frequency of points in binned regions.
130 Lab 10. Pandas 2: Plotting
# Plot the ACT scores against the SAT Quant scores in a regular scatter plot.
>>> satact.plot(kind="scatter", x="ACT", y="SATQ", alpha=.8)
# Plot the densities of the ACT vs. SATQ scores with a hexbin plot.
>>> satact.plot(kind="hexbin", x="ACT", y="SATQ", gridsize=20)
800 800 30
700 700 25
600 600 20
SATQ
SATQ
500 500 15
400 400 10
300 300 5
200 200
5 10 15 20 25 30 35 0
10 20 30
ACT ACT
(a) ACT vs. SAT Quant scores. (b) Frequency of ACT vs. SAT Quant scores.
Figure 10.8: Scatter plots and hexbin plot of SAT and ACT scores.
Just as choosing a good number of bins is important for a good histogram, choosing a good
gridsize is crucial for an informative hexbin plot. A large gridsize creates many small bins and a
small gridsize creates fewer, larger bins.
Note
Since hexbins are based on frequencies, they are prone to being misleading if the dataset is not
understood well. For example, when plotting information that deals with geographic position,
increases in frequency may be results in higher populations rather than the actual information
being plotted.
Attention to Detail
Consider the plot in Figure 10.9. It is a scatter plot of positively correlated data of some kind, with
temp–likely temperature–on the x axis and cons on the y axis. However, the picture is not really
communicating anything about the dataset. It has not specified the units for the x or the y axis, nor
does it tell what cons is. There is no title, and the source of the data is unknown.
0.55
0.50
0.45
cons
0.40
0.35
0.30
0.25
30 40 50 60 70
temp
Figure 10.9: Non-specific data.
This code produces the rather substandard plot in Figure 10.9. Examining the source of the
dataset can give important details to create better plots. When plotting data, make sure to under-
stand what the variable names represent and where the data was taken from. Use this information
to create a more effective plot.
The ice cream data used in Figure 10.9 is better understood with the following information:
1. The dataset details ice cream consumption via 30 four-week periods from March 1951 to July
1953 in the United States.
2. cons corresponds to “consumption of ice cream per capita” and is measured in pints.
6. The listed source is: “Hildreth, C. and J. Lu (1960) Demand relations with autocorrelated
disturbances, Technical Bulletin No 2765, Michigan State University.”
This information gives important details that can be used in the following code. As seen in
previous examples, pandas automatically generates legends when appropriate. Pandas also automat-
ically labels the x and y axes, however our data frame column titles may be insufficient. Appropriate
titles for the x and y axes must also list appropriate units. For example, the y axis should specify
that the consumption is in units of pints per head, in place of the ambiguous label cons.
To add the necessary text to the figure, use either plt.annotate() or plt.text(). Alter-
natively, add text immediately below wherever the figure is displayed. The first two parameters of
plt.text are the x and y coordinates to place the text. The third parameter is the text to write.
For instance, using plt.text(0.5, 0.5, "Hello World") will center the Hello World string in the
axes.
Both of these methods are imperfect but can normally be easily replaced by a caption attached
to the figure. Again, we reiterate how important it is that you source any data you use; failing to do
so is plagiarism.
Finally, we have a clear and demonstrative graphic in Figure 10.10.
133
Achtung!
Visualizing data can inherit many biases of the visualizer and as a result can be intentionally
misleading. Examples of this include, but are not limited to, visualizing subsets of data that
do not represent the whole of the data and having purposely misconstrued axes. Every data
visualizer has the responsibility to avoid including biases in their visualizations to ensure data
is being represented informatively and accurately.
Problem 4. The dataset college.csv contains information from 1995 on universities in the
United States. To access information on variable names, go to https://cran.r-project.org/
web/packages/ISLR/ISLR.pdf. Create 3 plots that compare variables or universities. These
plots should answer questions about the data, e.g. what is the distribution of graduation rates
or do schools with lower student to faculty ratios have higher tuition costs. These three plots
should be easy to understand and have clear variable names and citations.
134 Lab 10. Pandas 2: Plotting
11
Pandas 3: Grouping
Lab Objective: Many data sets contain categorical values that naturally sort the data into groups.
Analyzing and comparing such groups is an important part of data analysis. In this lab we explore
pandas tools for grouping data and presenting tabular data more compactly, primarily through groupby
and pivot tables.
Groupby
The file mammal_sleep.csv1 contains data on the sleep cycles of different mammals, classified by
order, genus, species, and diet (carnivore, herbivore, omnivore, or insectivore). The "sleep_total"
column gives the total number of hours that each animal sleeps (on average) every 24 hours. To get
an idea of how many animals sleep for how long, we start off with a histogram of the "sleep_total"
column.
1 Proceedings of the National Academy of Sciences, 104 (3):1051–1056, 2007. Updates from V. M. Savage and G. B.
West, with additional variables supplemented by Wikipedia. Available in pydataset (with a few more columns) under
the key "msleep".
135
136 Lab 11. Pandas 3: Grouping
While this visualization is a good start, it doesn’t provide any information about how different
kinds of animals have different sleeping habits. How long do carnivores sleep compared to herbivores?
Do mammals of the same genus have similar sleep patterns?
A powerful tool for answering these kinds of questions is the groupby() method of the pandas
DataFrame class, which partitions the original DataFrame into groups based on the values in one
or more columns. The groupby() method does not return a new DataFrame; it returns a pandas
GroupBy object, an interface for analyzing the original DataFrame by groups.
For example, the columns "genus", "vore", and "order" in the mammal sleep data all have a
discrete number of categorical values that could be used to group the data. Since the "vore" column
has only a few unique values, we start by grouping the animals by diet.
# Get a single group and sample a few rows. Note vore='carni' in each entry.
>>> vores.get_group("carni").sample(5)
name genus vore order sleep_total sleep_rem sleep_cycle
80 Genet Genetta carni Carnivora 6.3 1.3 NaN
50 Tiger Panthera carni Carnivora 15.8 NaN NaN
8 Dog Canis carni Carnivora 10.1 2.9 0.333
0 Cheetah Acinonyx carni Carnivora 12.1 NaN NaN
82 Red fox Vulpes carni Carnivora 9.8 2.4 0.350
137
As shown above, groupby() is useful for filtering a DataFrame by column values; the command
df.groupby(col).get_group(value) returns the rows of df where the entry of the col column is
value. The real advantage of groupby(), however, is how easily it compares groups of data. Standard
DataFrame methods like describe(), mean(), std(), min(), and max() all work on GroupBy objects
to produce a new data frame that describes the statistics of each group.
Multiple columns can be used simultaneously for grouping. In this case, the get_group()
method of the GroupBy object requires a tuple specifying the values for each of the grouping columns.
Problem 1. Read in the data college.csv containing information on various United States
universities in 1995. To access information on variable names, go to https://cran.r-project.
org/web/packages/ISLR/ISLR.pdf. Use a groupby object to group the colleges by private and
public universities. Read in the data as a DataFrame object and use groupby and describe to
examine the following columns by group:
2. How many students from the top 10% of their high school class,
3. How many students from the top 25% of their high school class.
138 Lab 11. Pandas 3: Grouping
Determine whether private or public universities have a higher mean for each of these columns.
For the type of university with the higher mean, save the values of the describe function on
said column as an array using .values. Return a tuple with these arrays in the order described
above.
For example, if I were comparing whether the number of professors with PhDs was higher
at private or public universities, I would return the following array:
Visualizing Groups
There are a few ways that groupby() can simplify the process of visualizing groups of data. First of
all, groupby() makes it easy to visualize one group at a time using the plot method. The following
visualization improves on Figure 11.1 by grouping mammals by their diets.
4
2
3
1 2
1
0 0
2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 2 4 6 8 10 12 14 16
Hours Hours
Figure 11.2: "sleep_total" histograms for two groups in the mammalian sleep data set.
The statistical summaries from the GroupBy object’s mean(), std(), or describe() methods
also lend themselves well to certain visualizations for comparing groups.
Mammalian Sleep, ±
herbi
carni
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Hours
Box plots are well suited for comparing similar distributions. The boxplot() method of the
GroupBy class creates one subplot per group, plotting each of the columns as a box plot.
carni herbi
20.0
17.5
15.0
12.5
10.0
7.5
5.0
2.5
0.0
sleep_total sleep_rem sleep_cycle sleep_total sleep_rem sleep_cycle
insecti omni
20.0
17.5
15.0
12.5
10.0
7.5
5.0
2.5
0.0
sleep_total sleep_rem sleep_cycle sleep_total sleep_rem sleep_cycle
Alternatively, the boxplot() method of the DataFrame class creates one subplot per column,
plotting each of the columns as a box plot. Specify the by keyword to group the data appropriately.
140 Lab 11. Pandas 3: Grouping
Like groupby(), the by argument can be a single column label or a list of column labels. Similar
methods exist for creating histograms (GroupBy.hist() and DataFrame.hist() with by keyword),
but generally box plots are better for comparing multiple distributions.
Problem 2. Create visualizations that give relevant information answering the following ques-
tions (using college.csv):
1. How do the number of applicants, number of accepted students, and number of enrolled
students compare between private and public universities?
2. How wide is the range of money spent on room and board at both private and public
universities?
Pivot Tables
One of the downfalls of groupby() is that a typical GroupBy object has too much information to
display coherently. A pivot table intelligently summarizes the results of a groupby() operation
by aggregating the data in a specified way. The standard tool for making a pivot table is the
pivot_table() method of the DataFrame class. As an example, consider the "HairEyeColor" data
set from pydataset.
>>> for col in ["Hair", "Eye", "Sex"]: # Get unique values per column.
... print("{}: {}".format(col, ", ".join(set(str(x) for x in hec[col]))))
...
Hair: Brown, Black, Blond, Red
Eye: Brown, Blue, Hazel, Green
Sex: Male, Female
There are several ways to group this data with groupby(). However, since there is only one
entry per unique hair-eye-sex combination, the data can be completely presented in a pivot table.
Listing the data in this way makes it easy to locate data and compare the female and male
groups. For example, it is easy to see that brown hair is more common than red hair and that about
twice as many females have blond hair and blue eyes than males.
Unlike "HairEyeColor", many data sets have more than one entry in the data for each grouping.
An example in the previous dataset would be if there were two or more rows in the original data for
females with blond hair and blue eyes. To construct a pivot table, data of similar groups must be
aggregated together in some way.
By default entries are aggregated by averaging the non-null values. You can use the keyword
argument aggfunc to choose among different ways to aggregate the data. For example, if you use
aggfunc='min', the value displayed will be the minimum of all the values. Other arguments include
'max', 'std' for standard deviation, 'sum', or 'count' to count the number of occurrences. You
also may pass in any function that reduces to a single float, like np.argmax or even np.linalg.norm
if you wish. A list of functions can also be passed into the aggfunc keyword argument.
142 Lab 11. Pandas 3: Grouping
Consider the Titanic data set found in titanic.csv2 . For this analysis, take only the "
Survived", "Pclass", "Sex", "Age", "Fare", and "Embarked" columns, replace null age values
with the average age, then drop any rows that are missing data. To begin, we examine the average
survival rate grouped by sex and passenger class.
Note
The pivot_table() method is a convenient way of performing a potentially complicated
groupby() operation with aggregation and some reshaping. The following code is equivalent
to the previous example.
The stack(), unstack(), and pivot() methods provide more advanced shaping options.
Among other things, this pivot table clearly shows how much more likely females were to survive
than males. To see how many entries fall into each category, or how many survived in each category,
aggregate by counting or summing instead of taking the mean.
Sex
female 137.0 94.0 106.0
male 61.0 25.0 75.0
Problem 3. The file Ohio_1999.csv contains data on workers in Ohio in the year 1999. Use
pivot tables to answer the following questions:
1. Which race/sex combination makes the most Usual Weekly Earnings in aggregate?
2. Which race/sex combination worked the least amount of cumulative Usual Hours Worked?
3. What race/sex combination worked the most Usual Hours Worked per week per person?
Return a tuple for each question (in order of the questions) where the first entry is the
numerical code corresponding to the race and the second entry is corresponding to the sex.
Some useful keys in understand the data are as follows:
From this table, it appears that male children (ages 0 to 12) in the 1st and 2nd class were very
likely to survive, whereas those in 3rd class were much less likely to. This clarifies the claim that
males were less likely to survive than females. However, there are a few oddities in this table: zero
percent of the female children in 1st class survived, and zero percent of teenage males in second class
survived. To further investigate, count the number of entries in each group.
This table shows that there was only 1 female child in first class and only 10 male teenagers in
second class, which sheds light on the previous table.
Achtung!
The previous pivot table brings up an important point about partitioning datasets. The Titanic
dataset includes data for about 1300 passengers, which is a somewhat reasonable sample size,
but half of the groupings include less than 30 entries, which is not a healthy sample size for
statistical analysis. Always carefully question the numbers from pivot tables before making any
conclusions.
Pandas also supports multi-indexing on the columns. As an example, consider the price of a
passenger tickets. This is another continuous feature that can be discretized with pd.cut(). Instead,
we use pd.qcut() to split the prices into 2 equal quantiles. Some of the resulting groups are empty;
to improve readability, specify fill_value as the empty string or a dash.
aggfunc="count", fill_value='-')
Fare (-0.001, 14.454] (14.454, 512.329]
Pclass 1.0 2.0 3.0 1.0 2.0 3.0
Sex Age
female (0, 12] - - 7 1 13 23
(12, 18] - 4 23 12 4 5
(18, 80] - 31 101 129 54 57
male (0, 12] - - 8 4 11 27
(12, 18] - 5 26 4 5 11
(18, 80] 8 94 350 163 56 70
Not surprisingly, most of the cheap tickets went to passengers in 3rd class.
Problem 4. Use the employment data from Ohio in 1999 to answer the following questions:
1. The column Educational Attainment contains numbers 0-46. Any number less than 39
means the person did not get any form of degree. 39-42 refers to either a high-school or
associate’s degree. A number greater than or equal to 43 means the person got at least a
bachelor’s degree. What is the most common degree among workers?
2. Partition the Age column into 4 evenly spaced intervals. Which interval has the most
workers?
Return the answer to each question (in order) as an Interval. For part three, the answer
should be a tuple where the first entry in the Interval of the age and the second is the Interval
of the degree.
An Interval is the object returned by pd.cut and pd.qcut. An example of getting an
Interval from a pivot table is shown below.
>>> # Create pivot table used in last example with titanic dataset
>>> table = titanic.pivot_table(values="Survived",
index=[age], columns=[fare, "Pclass"],
aggfunc="count")
>>> # Get index of maximum interval
>>> table.sum(axis=1).idxmax()
Interval(0, 12, closed='right')
Problem 5. Examine the college dataset using pivot tables and groupby objects. Deter-
mine the answer to the following questions. If the answer is yes, save the answer as True. If
the answer the no, save the answer as False. For the last question, save the answer as a string
giving your explanation. Return a tuple containing your answers to the questions in order.
146 Lab 11. Pandas 3: Grouping
1. Is there a correlation between the percent of alumni that donate and the amount the
school spends per student in BOTH private and public universities?
2. Partition Grad.Rate into evenly spaced intervals of 20%. Is the partition with the greatest
number of schools the same for private and public universities?
3. Does having a lower acceptance rate correlate with having more students from the top 10
percent of their high school class being admitted on average for BOTH private and public
universities?
4. Why is the average percentage of students admitted from the top 10 percent of their high
school class so high in private universities with very low acceptance rates? Use only the
data to explain why; do not extrapolate.
12
Geopandas
Lab Objective: Geopandas is a package designed to organize and manipulate geographic data,
It combines the data manipulation tools from Pandas and the geometric capabilities of the Shapely
package. In this lab, we explore the basic data structures of GeoSeries and GeoDataFrames and their
functionalities.
Installation
Geopandas is a new package designed to combine the functionalities of Pandas and Shapely, a package
used for geometric manipulation. Using Geopandas with geographic data is very useful as it allows
the user to not only compare numerical data, but geometric attributes. Since Geopandas is currently
under development, the installation procedure requires that all dependencies are up to date. While
possible to install Geopandas through pip using
A particular package needed for Geopandas is Fiona. Geopandas will not run without the
correct version of this package. To check the current version of Fiona that is installed, run the
following code. If the version is not at least 1.7.13, update Fiona.
147
148 Lab 12. Geopandas
GeoSeries
A GeoSeries is a Pandas Series where each entry is a set of geometric objects. There are three classes
of geometric objects inherited from the Shapely package:
1. Points / Multi-Points
2. Lines / Multi-Lines
3. Polygons / Multi-Polygons
A point is used to identify objects like coordinates, where there is one small instance of the object. A
line could be used to describe a road. A polygon could be used to identify regions, such as a country.
Multipoints, multilines, and multipolygons contain lists of points, lines, and polygons, respectively.
Since each object in the GeoSeries is also a Shapely object, the GeoSeries inherits many methods
and attributes of Shapely objects. Some of the key attributes and methods are listed in Table 12.1.
These attributes and methods can be used to calculate distances, find the sizes of countries, and de-
termine whether coordinates are within country’s boundaries. The example below uses the attribute
bounds to find the maximum and minimum coordinates of Egypt in a built-in GeoDataFrame.
Method/Attribute Description
distance(other) returns minimum distance from GeoSeries to other
contains(other) returns True if shape contains other
intersects(other) returns True if shape intersects other
area returns shape area
convex_hull returns convex shape around all points in the object
Creating GeoDataFrames
The main structure used in GeoPandas is a GeoDataFrame, which is similar to a Pandas DataFrame.
A GeoDataFrame has one special column called geometry. This GeoSeries column is used when a
spatial method, like distance(), is used on the GeoDataFrame.
To make a GeoDataFrame, first create a Pandas DataFrame. At least one of the columns
in the DataFrame should contain geometric information. Convert a column containing geometric
information to a GeoSeries using the apply method. At this point, the Pandas DataFrame can be
cast as a GeoDataFrame. When creating a GeoDataFrame, if more than one column has geometric
data, assign which column will be the geometry using the set_geometry() method.
149
# Cast as GeoDataFrame
>>> gdf = gpd.GeoDataFrame(df, geometry='Coordinates')
Note
Longitude is the angular measurement starting at the Prime Meridian, 0°, and going to 180°
to the east and −180° to the west. Latitude is the angle between the equatorial plane and
the normal line at a given point; a point along the Equator has latitude 0, the North Pole has
latitude +90° or 90°N , and the South Pole has latitude −90° or 90°S.
Plotting GeoDataFrames
Information from a GeoDataFrame is plotted based on the geometry column. Data points are dis-
played as geometry objects. The following example plots the shapes in the world GeoDataFrame.
75
50
25
25
50
75
Multiple GeoDataFrames can be plotted at once. This can be done by by setting one Geo-
DataFrame as the base of the plot and ensuring that each layer uses the same axes. In the follow-
ing example, a GeoDataFrame containing the coordinates of world airports is plotted on top of a
GeoDataFrame containing the polygons of country boundaries, resulting in a world map of airport
locations.
World Airports
75
50
25
Latitude
25
50
75
Problem 1. Read in the file airports.csv as a Pandas DataFrame. Create three convex
hulls around the three sets of airports listed below. This can be done by passing in lists of the
airports’ coordinates to a shapely.geometry.Polygon object.
Create a new GeoDataFrame with these three Polygons as entries. Plot this GeoDataFrame
on top of an outlined world map.
• Oiapoque Airport, Maio Airport, Zhezkazgan Airport, Walton Airport, RAF Ascension
Island, Usiminas Airport, Piloto Osvaldo Marques Dias Airport
• Zhezkazgan Airport, Khanty Mansiysk Airport, Novy Urengoy Airport, Kalay Airport,
Biju Patnaik Airport, Walton Airport
GeoDataFrames can utilize many Pandas functionalities, and they can also be parsed by geo-
metric manipulations. For example, a useful way to index GeoDataFrames is with the cx indexer.
This splits the GeoDataFrame by the coordinates of each geometric object. It is used by calling the
method cx on a GeoDataFrame, followed by a slicing argument, where the first element refers to the
longitude and the second refers to latitude.
152 Lab 12. Geopandas
GeoSeries in a GeoDataFrame can also be dissolved, or merged, together into one GeoSeries
based on their geometry data. For example, all countries on one continent could be merged to create
a GeoSeries containing the information of that continent. The method designed for this is called
dissolve. It receives two parameters, by and aggfunc. by indicates which column to dissolve along,
and aggfunc tells how to combine the information in all other columns. The default aggfunc is first
, which returns the first application entry. In the following example, we use sum as the aggfunc so
that each continent is the combination of its countries.
Projections
When plotting, GeoPandas uses the CRS (coordinate reference system) of a GeoDataFrame. This
reference system informs how coordinates should be spaced on a plot. GeoPandas accepts many
different CRSs, and references to them can be found at www.spatialreference.org. Two of the
most commonly used CRSs are EPSG:4326 and EPSG:3395. EPSG:4326 uses the standard latitude-
longitude projection used by GPS. EPSG:3395, also known as Mercator, is the standard navigational
projection.
When creating a new GeoDataFrame, it is important to set the crs attribute of the Geo-
DataFrame. This allows the plot to be shown correctly. GeoDataFrames being layered need to have
the same CRS. To change the CRS, use the method to_crs().
GeoDataFrames can also be plotted using the values in the the other attributes of the GeoSeries.
The map plots the color of each geometry object according to the value of the column selected. This
is done by passing in the parameter column into the plot() method.
0
1.00
25 0.75
50 0.50
75 0.25
Merging GeoDataFrames
Just as multiple Pandas DataFrames can be merged, multiple GeoDataFrames can be merged with
attribute joins or spatial joins. An attribute join is similar to a merge in Pandas. It combines two
GeoDataFrames on a column (not the geometry column) and then combines the rest of the data into
one GeoDataFrame.
A spatial join merges two GeoDataFrames based on their geometry data. The function used
for this is sjoin. sjoin accepts two GeoDataFrames and then direction on how to merge. It is
imperative that two GeoDataFrames have the same CRS. In the example below, we merge using an
inner join with the option intersects. The inner join means that we will only use keys in the
intersection of both geometry columns, and we will retain only the left geometry column. intersects
tells the GeoDataFrames to merge on GeoSeries that intersect each other. Other options include
contains and within.
Problem 3. Load in the file nytimes.csva as a DataFrame. This file includes county-level
data for the cumulative cases and deaths of Covid-19 in the US, starting with the first case in
Snohomish County, Washington, on January 21, 2020. Begin by converting the date column
into a DatetimeIndex.
Next, use county FIPS codes to merge your GeoDataFrame from Problem 2 with the
DataFrame you just created. A FIPS code is a 5-digit unique identifier for geographic locations.
Ignore rows in the Covid-19 DataFrame with unknown FIPS codes as well as all data from
Hawaii and Alaska.
Note that the fips column of the Covid-19 DataFrame stores entries as floats, but the
county GeoDataFrame stores FIPS codes as strings, with the first two digits in the STATEFP
column and the last three in the COUNTYFP column.
Once you have completed the merge, plot the cases from March 21, 2020 on top of your
state outline map from Problem 2, using the CRS of EPSG:5071. Finally, print out the name
of the county with the most cases on March 21, 2020 along with its case count.
a Source: https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv
Sometimes data can be much more informative when plotted on a logarithmic scale. See how
the world map changes when we add a norm argument in the code below. Depending on the purpose
of the graph, Figure 12.5 may be more informative than Figure 12.4.
>>> plt.show()
Figure 12.5: World map showing country GDP using a log scale
Problem 4. As in Problem 3, plot your state outline map from Problem 2 on top of a map of
the Covid-19 cases from March 21, 2020. This time, however, use a log scale. Use EPSG:5071 for
the CRS. Pick a good colormap (the counties with the most cases should generally be darkest)
and be sure to display a colorbar.
Problem 5. In this problem, you will create an animation of the spread of Covid-19 through
US counties from January 21, 2020 to June 21, 2020. Use a log scale and a good colormap, and
be sure that you’re using the same norm and colorbar for the whole animation. Use EPSG:5071
for the projection.
As a reminder, below is a summary of what you will need in order to animate this map.
You may also find it helpful to refer to the animation section included with the Volume 4 lab
manual.
1. Set up your figure and norm. Be sure to use the highest case count for your vmax so that
the scale remains uniform.
2. Write your update function. This should plot the cases from a given day.
3. Set up your colorbar. Do this outside the update function to avoid adding a new colorbar
each day.
157
4. Create the animation. Check to make sure everything displays properly before you save
it.
Lab Objective: The quality of a data analysis or model is limited by the quality of the data used. In
this lab we learn techniques for cleaning data, creating features, and determining feature importance.
Almost every dataset has problems that make it unsuitable for regression or other modeling.
At a basic level, these problems might cause simple functions to error out. More substantially, data
problems could significantly change the result of your model or analysis.
Data cleaning is the process of identifying and correcting bad data. This could be data that
is missing, duplicated, irrelevant, inconsistent, incorrect, in the wrong format, or does not make
sense. Though it can be tedious, data cleaning is the most important step of data analysis. Without
accurate and legitimate data, any results or conclusions are suspect and may be incorrect.
We will demonstrate common issues with data and how to correct them using the following
dataset. It consists of family members and some basic details.
# Example dataset
>>> df = pd.read_csv('toy_dataset.csv')
>>> df
Name Age name DOB Marital_Status
0 John Doe 30 john 01/01/2010 Divorcee
1 Jane Doe 29 jane 12/02/1990 Divorced
2 Jill smith 40 NaN 03/04/1980 married
3 Jill smith 40 jill 03/04/1980 married
4 jack smith 100 jack 4/4/1980 marrieed
5 Jenny Smith 5 NaN 05/05/2015 NaN
6 JAmes Smith 2 NaN 20/06/2018 single
7 Rover 2 NaN 05/05/2018 NaN
159
160 Lab 13. Data Cleaning and Feature Importance
Inspection
The first step of data cleaning is to analyze the quality of the data. If the quality is poor, the data
might not be worth using. Knowing the quality of the data will also give you an idea of how long
it will take to clean it. A quality dataset is one in which the data is valid, accurate, complete,
consistent, and uniform. Some of these issues, like uniformity, are fairly easy to fix during cleaning,
while other aspects like accuracy are more difficult, if not impossible.
Validity is the degree that the data conforms to given rules. If a column corresponds to the
temperature in Salt Lake City, measured in degrees Farenheit, then a value over 110 or below 0
should make you suspicious, since those would be extreme values for Salt Lake City. In fact, checking
the all-time temperature records for Salt Lake shows that the values in this column should never be
more than 107 and never less than −30. Any values outside that range are almost certainly errors
and should probably be reset to N aN , unless you have special information that allows you to impute
more accurate values.
Some standard rules are
• data type: The data types of each column should all be the same.
• data range: The data of a column, typically numbers or dates, should all be in the same
range.
• regular expression patterns: A text column must be in the same format, like phone numbers
must in the form 999-999-9999.
• cross-field validation: Conditions must hold across multiple columns, a hospital discharge
date can’t be earlier than the admittance date.
• duplicated data: Rows or columns that are repeated. In some cases, they may not be exact.
We can check the data type in Pandas using dtype. A dytpe of object means that the data
in that column contains either strings or mixed dtypes. These fields should be investigated to
determine if they contain mixed datatypes. In our toy example, we would expect that Marriage_Len
is numerical, so an object dtype is suspicious. Looking at the data, we see that James has Not
Applicable, which is a string.
name object
DOB object
Marital_Status object
Height float64
Weight int64
Marriage_Len object
Spouse object
dtype: object
Duplicates can be easily identified in Pandas using the duplicated() function. When no
parameters are passed, it returns a DataFrame of the first duplicates. We can identify rows that are
duplicated in only some columns by passing in the column names. The keep parameter has three
possible values, first, last, and False. False keeps all duplicated values, while first and last only keep
one of the duplicated values, the first and last ones respectively.
We can check the range of values in a numeric column using the min and max attributes. Other
options for looking at the values include line plots, histograms, and boxplots. Some other useful
Pandas commands for evaluating data include pd.unique() and df.nunique(), which identify and
count unique values. value_counts() counts the number of values in each item of a column, like a
histogram.
The accuracy of the data, how close the data is to reality, is harder to confirm. Just because
a data point is valid, doesn’t mean that it is true. For example, a valid street address doesn’t have
to exist, or a person might lie about their weight. The first case could be checked using mapping
software, but the second could be unverifiable.
The percentage of missing data is the completeness of the data. All uncleaned data will have
missing values, but datasets with large amounts of missing data, or lots of missing data in key columns,
are not going to be as useful. Pandas has several functions to help identify and count missing values.
In Pandas, all missing data is considered a NaN and does not affect the dtype of a column. df.isna
() returns a boolean DataFrame indicating whether each value is missing. df.notnull() returns a
boolean DataFrame with True where a value is not missing.
Consistency measures how consistent the data is in the dataset and across multiple datasets.
For example, in our toy dataset, Jack Smith is 100 years old, but his birth year is 1980. Data is
inconsistent across datasets when the data points should be the same and are different. This could be
due to incorrect entries or syntax. An example is using multiple finance dataset to build a predictive
model. The dates in each dataset should have the same format so that they can all be used equally
in the model. Any columns that have monetary data should all be in the same unit, like dollars or
pesos.
163
Lastly, uniformity is the measure of how similarly the data is formatted. Data that has the
same units of measure and syntax are considered uniform. Looking at the Height column in our
dataset, we see values ranging from 1.8 to 105. This is likely the result of different units of measure.
When looking at the quality of the data, there are no set rules on how to measure these concepts
or at what point the data is considered bad data. Sometimes, even if the data is bad, it is the only
data available and has to be used. Having an idea of the quality of the data will help you know what
cleaning steps are needed and help with analyzing the results. Creating a summary statistics,
also known as data profiling is a good way to get a general idea of the quality of the data. The
summary statistics should be specific to the dataset and describe aspects of the data discussed
in this section. It could also include visualizations and basic statistics, like the mean and standard
deviation.
Visualization is an important aspect of the inspection phase. Using histograms, box plots, and
hexbins can identify outliers in the data. Outliers should be investigated to determine if they are
accurate. Removing outliers will improve your model, but you should only remove an outlier if you
have a legitimate reason. Columns that have a small distribution or variance, or consist of one value,
could be worth removing since they might contribute little to the model.
Problem 1. The g_t_results.csv file is a set of parent-reported scores on their child’s Gifted
and Talented tests. The two tests, OLSAT and NNAT, are used by NYC to determine if children
are qualified for gifted programs. The OLSAT Verbal has 16 questions for Kindergartners and
30 questions for first and second graders. The NNAT has 48 questions. Using this dataset,
answer the following questions.
1) What column has the highest number of null values and what percent of its values are
null? Print the answer as a tuple with (column name, percentage)
2) List the columns with have mixed types that should be numeric. Print the answer as
a tuple.
3) How many third graders have scores outside the valid range for the OLSAT Verbal
Score? Print the answer
4) How many data values are missing (NaN)? Print the number.
Cleaning
After the data has been inspected, it’s time to start cleaning. There are many aspects and methods
of cleaning; not all of them will be used in every dataset. Which ones you choose should be based
on your dataset and the goal of the project.
Unwanted Data
Removing unwanted data typically falls into two categories, duplicated data and irrelevant data.
Duplicated observations usually occur when data is scraped, combined from multiple datasets, or a
user submits the data twice. Irrelevant data consists of observations that don’t fit the specific problem
you are trying to solve or don’t have enough variation to affect the model. We can drop duplicated
data using the duplicated() function described above with drop() or using drop_duplicates,
which has the same parameters as duplicated.
164 Lab 13. Data Cleaning and Feature Importance
Validity Errors
After moving unwanted data, we correct any validity errors found during inspection. All features
should have a consistent type, standard formatting (like capitalization), and the same units. Syntax
errors should be fixed, and white space at the beginning and ends of strings should be removed.
Some data might need to be padded so that it’s all the same length.
Method Description
series.str.lower() Convert to all lower case
series.str.upper() Convert to all upper case
series.str.strip() Remove all leading and trailing white space
series.str.lstrip() Remove leading white space
series.str.replace(" ","") Remove all spaces
series.str.pad() Pad strings
Validity also includes correcting or removing contradicting values. This might be two values
in a row or values across datasets. For example, a child shouldn’t have a marital status of married.
Another example is if two columns should sum to a third but don’t for a specific row.
Missing Data
There will always be missing data in any uncleaned dataset. Some commonly suggested methods
for handling data are removing the missing data and setting the missing values to some value based
on other observations. However, missing data can be informative and removing or replacing missing
data erases that information. Also, removing missing values from a dataset might result in significant
amounts of data being lost. Removing missing data could also make your model less accurate if you
need to predict on data with missing values, so retaining the missing values can help increase accuracy.
So how can we handle missing data? Dropping missing data is the easiest method. Dropping
rows should only be done if the are a small number of missing data points in a column or if the row is
missing a significant amount of data. If a column is very sparse, consider dropping the entire column.
Another option is to estimate the missing data’s value and replace it. There are many ways to do
this, including mean, mode, median, randomly choosing from the distribution, linear regression, and
hot-decking.
Hot-deck is when you fill in the data based on similar observations. It can be applied to
numerical and categorical data, unlike most of the other options listed above. The easiest hot-deck
method is to fill in the data with random numbers after dividing the data into groups based on
some characteristic, like gender. Sequential hot-deck sorts the column with missing data based on
an auxiliary column and then fills in the data with the value from the next available data point.
K-Nearest Neighbors can also be used to identify similar data points.
The last option is to flag the data as missing. This retains the information from missing data
and removes the missing data (by replacing it). For categorical data, simply replace the data with
a new category. For numerical data, we can fill the missing data with 0, or some value that makes
sense, and add an indicator variable for missing data. This allows the algorithm to estimate the
constant for missing data instead of just using the mean.
>>> df
Name Age DOB Marital_Status
0 JOHN DOE 30 01/01/2010 divorcee
1 JANE DOE 29 12/02/1990 divorced
2 JILL SMITH 40 03/04/1980 married
3 JACK SMITH 40 4/4/1980 married
4 JENNY SMITH 5 05/05/2015 missing
More dangerous, in many ways, than numerical errors, are entries that are recorded as a numerical
value (float or int) when they should be recorded as nonnumerical data, that is, in a format that
cannot be summed, multiplied, or averaged. One example is missing data recorded as 0. Missing
data should always be stored in a form that cannot accidentally be incorporated into the model.
Typically this is done by storing N aN as the value. However, the above method of using missing
as the value is more valuable since some algorithms will not run on data with N aN . Unfortunately,
many datasets have recorded missing values with a 0 or some other number. You should verify that
this does not occur in your dataset. Similarly, a survey with a scale from 1 to 5 will sometimes have
the additional choice of “N/A” (meaning “not applicable”), which could be coded as 6, not because
the value 6 is meaningful, but just because that is the next thing after 5. Again, this should be fixed
so that the “N/A” choice cannot accidentally be used for any computations.
166 Lab 13. Data Cleaning and Feature Importance
Categorical data are also often encoded as numerical values. These values should not be left
as numbers that can be computed with. For example, postal codes are shorthand for locations,
and there is no numerical meaning to the code. It makes no sense to add, subtract, or multiply
postal codes, so it is important not to let those accidentally be added, subtracted, or multiplied, for
example by inadvertently including them in the design matrix (unless they are one-hot encoded or
given some other meaningful numerical value). It is good practice to convert postal codes, area codes,
ID numbers, and other non-numeric data into strings or other data types that cannot be computed
with. ).
Ordinal Data
Ordinal data is data that has a meaningful order, but the differences between the values aren’t
consistent, or maybe aren’t even meaningful at all. For example, a survey question might ask about
your level of education, with 1 being high-school graduate, 2 bachelor’s degree, 3 master’s degree,
and 4 doctoral degree. These values are called ordinal data because it is meaningful to talk about an
answer of 1 being less than an answer of 2. However, the difference between 1 and 2 is not necessarily
the same as the difference between 3 and 4, and it would not make sense to compute an average
answer—the average of a high school diploma and a masters degree is not a bachelor’s degree, despite
the fact that the average of 1 and 3 is 2. Treating these like categorical data loses the information of
the ordering, but treating it like regular numerical data implies that a difference of 2 has the same
meaning whether it comes as 3 − 1 or 4 − 2. If that last assumption is approximately true, then it
may be ok to treat these data as numerical in your model, but if that assumption is not correct, it
may be better to treat the variable as categorical.
Problem 2. imdb.csv contains a small set of information about 99 movies. Clean the data
set by doing the following in order:
1. Remove duplicate rows. Print the shape of the dataframe after removing the rows.
2. Drop all rows that contain missing data. Print the shape of the dataframe after removing
the rows.
3. Remove rows that have data outside valid data ranges and explain briefly how you deter-
mined your ranges for each column.
4. Identify and drop columns with three or fewer different values. Print a tuple with the
names of the columns dropped.
Feature Engineering
One often needs to construct new columns, commonly referred to as features in the context of
machines learning, for a dataset, because the dependent variable is not necessarily a linear function
of the features in the original dataset. Constructing new features is called feature engineering. Once
new features are created, we can analyze how much a model depends on each feature. Features with
low importance probably do not contributed much and could potentially be removed.
167
Fognets are fine mesh nets that collect water that condenses on the netting. These are used in
some desert cities in Morocco to produce drinking water. Consider a dataset measuring the amount of
water Y collected from fognets, where one of the features WindDir is the wind direction, measured in
degrees. This feature is not likely to contribute meaningfully in a linear model because the direction
359 is almost the same as the direction 0, but no nonzero linear multiple of WindDir will reflect this
relation. One way to improve the situation is to
replace the WindDir with two new (engineered)
features: sin 180
π
WindDir and cos 180 π
WindDir .
Discrete Fourier transforms and wavelet decomposition often reveal important properties of
data collected over time (called time-series), like sound, video, economic indicators, etc. In many
such settings it is useful to engineer new features from a wavelet decomposition, the DFT, or some
other function of the data.
Problem 3. basketball.csv contains data for all NBA players between 2001 and 2018. Each
row represents a player’s stats for a year. The features in this data set are
• per (float): player efficiency rating, how much a player produced in one minute of play
• bpm (float): box plus/minus is the estimated number of points a player contributed to
over 100 possessions
(float):
Create two new features:
• target (str): The target team if the player is leaving. If the player is retiring, the target
should be ’retires’.
Remove all rows except those where a player changes team, that is, target is not null nor
’retires’. Drop the player, year, and team_id columns.
Use the provided function, identify_importance(), to determine how important each fea-
ture is in a Random Forest algorithm by passing in the dataframe. It will return a dictionary
of features with the feature importance (in percentages) as values. Sort the resulting dictionary
from most important feature to least and print the results.
168 Lab 13. Data Cleaning and Feature Importance
Categorical features are those that take only a finite number of values, and usually no categorical
value has a numerical meaning, even if it happens to be number. For example in an election dataset,
the names of the candidates in the race are categorical, and there is no numerical meaning (neither
ordering nor size) to numbers assigned to candidates based soley on their names.
Consider the following election data.
Ballot number For Governor For President
001 Herbert Romney
002 Cooke Romney
003 Cooke Obama
004 Herbert Romney
005 Herbert Romney
006 Cooke Stein
A common mistake occurs when someone assigns a number to each categorical entry (say 1
for Cooke, 2 for Herbert, 3 for Romney, etc.). While this assignment is not, in itself, inherently
incorrect, it is incorrect to use the value of this number in a statistical model. Any such model would
be fundamentally wrong because a vote for Cooke cannot, in any reasonable way, be considered
half of a vote for Herbert or a third of a vote for Romney. Many researchers have accidentally used
categorical data in this way (and some have been very publicly embarrassed) because their categorical
data was encoded numerically, which made it hard to recognize as categorical data.
Whenever you encounter categorical data that is encoded numerically like this, immediately
change it either to non-numerical form (“Cooke,” “Herbert,” “Romney,”. . . ) or apply a one-hot
encoding as described below.
In order to construct a meaningful model with categorical data, one normally applies a one-hot
encoding or dummy variable encoding. 1 To do this construct a new feature for every possible value
of the categorical variable, and assign the value 1 to that feature if the variable takes that value and
zero otherwise. Pandas makes one-hot encoding simple:
# one-hot encoding
df = pd.get_dummies(df, columns=['For President']])
The previous dataset, when the presidential race is one-hot encoded, becomes
1 Yes, these are silly names, but they are the most common names for it. Unfortunately, it is probably too late to
Achtung!
When performing linear regression, it is good practice to add a constant column to your dataset
and to remove one column of the one-hot encoding of each categorical variable.
To see why, notice that summing terms in one row corresponding to the one-hot encoding
of a specific categorical variable (for example the presidential candidate) always gives 1. If the
dataset already has a constant column (which you really always should add if it isn’t there
already), then the constant column is a linear combination of the one-hot encoded columns.
This cause the matrix to fail to be invertible and can cause identifiability problems.
The standard way to deal with this is to remove one column of the one-hot embedding
for each categorical variable. For example, with the elections dataset above, we could remove
the Cooke and Romney columns. Doing that means that in the new dataset a row sum of 0
corresponds to a ballot with a vote for Cooke and a vote for Romney, while a 1 in any column
indicates how the ballot differed from the base choice of Cooke and Romney.
When using pandas, you can drop the first column of a one-hot encoding by passing in
drop_first=True.
Problem 4. Load housing.csv into a dataframe with index=0. Descriptions of the features
are in housing_data_description.txt. The goal is to construct a regression model that
predicts SalePrice using the other features of the dataset. Do this as follows:
1. Identify and handle the missing data. Hint: Dropping every row with some missing data
is not a good choice because it gives you an empty dataframe. What can you do instead?
2. Identify the variable with nonnumerical values that are misencoded as numbers. One-hot
encode it. Hint: don’t forget to remove one of the encoded columns to prevent collinearity
with the constant column).
170 Lab 13. Data Cleaning and Feature Importance
5. Choose four categorical features that seem very important in predicting SalePrice. One-
hot encode these features, and remove all other categorical features.
Print the ten features that have the highest coef in your model and the summary.
To run an OLS model in python, use the following code.
import statsmodels.api as sm
Problem 5. Using the copy of the dataframe you created in Problem 4, one-hot encode all the
categorical variables. Print the shape of you database, and Run OLS.
Print the ten features that have the highest coef in your model and the summary. Write
a couple of sentences discussing which model is better and why.
14
MongoDB
Lab Objective: Relational databases, including those managed with SQL or pandas, require data to
be organized into tables. However, many data sets have an inherently dynamic structure that cannot
be efficiently represented as tables. MongoDB is a non-relational database management system that is
well-suited to large, fast-changing datasets. In this lab we introduce the Python interface to MongoDB,
including common commands and practices.
Database Initialization
Suppose the manager of a general store has all sorts of inventory: food, clothing, tools, toys, etc.
There are some common attributes shared by all items: name, price, and producer. However, other
attributes are unique to certain items: sale price, number of wheels, or whether or not the product
is gluten-free. A relational database housing this data would be full of mostly-blank rows, which is
extremely inefficient. In addition, adding new items to the inventory requires adding new columns,
causing the size of the database to rapidly increase. To efficiently store the data, the whole database
would have to be restructured and rebuilt often.
To avoid this problem, NoSQL databases like MongoDB avoid using relational tables. Instead,
each item is a JSON-like object, and thus can contain whatever attributes are relevant to the specific
item, without including any meaningless attribute columns.
Note
MongoDB is a database management system (DBMS) that runs on a server, which should be
running in its own dedicated terminal. Refer to the Additional Material section for installation
instructions.
The Python interface to MongoDB is called pymongo. After installing pymongo and with the
MongoDB server running, use the following code to connect to the server.
171
172 Lab 14. MongoDB
Documents in MongoDB are represented as JSON-like objects, and therefore do not adhere to
a set schema. Each document can have its own fields, which are completely independent of the fields
in other documents.
# Insert one document with fields 'name' and 'age' into the collection.
>>> col.insert_one({'name': 'Jack', 'age': 23})
# Insert another document. Notice that the value of a field can be a string,
# integer, truth value, or even an array.
>>> col.insert_one({'name': 'Jack', 'age': 22, 'student': True,
... 'classes': ['Math', 'Geography', 'English']})
Note
Once information has been added to the database it will remain there, even if the python
environment you are working with is shut down. It can be accessed anytime using the same
commands as before.
Problem 1. The file trump.json, located in trump.json.zip, contains posts from http:
//www.twitter.com (tweets) over the course of an hour that have the key word “trump”.a
Each line in the file is a single JSON message that can be loaded with json.loads().
Create a MongoDB database and initialize a collection in the database. Use the collection’s
delete_many() method with an empty set as input to clear existing contents of the collection,
then fill the collection one line at a time with the data from trump.json. Check that your
collection has 67,859 entries with its count() method.
a See the Additional Materials section for an example of using the Twitter API.
Querying a Collection
MongoDB uses a query by example pattern for querying. This means that to query a database, an
example must be provided for the database to use in matching other documents.
# Find all the documents that have a 'name' field containing the value 'Jack'.
>>> data = col.find({'name': 'Jack'})
# Find the FIRST document with a 'name' field containing the value 'Jack'.
>>> data = col.find_one({'name': 'Jack'})
The find_one() method returns the first matching document as a dictionary. The find()
query may find any number of objects, so it will return a Cursor, a Python object that is used to
iterate over the query results. There are many useful functions that can be called on a Cursor, for
more information see http://api.mongodb.com/python/current/api/pymongo/cursor.html.
'age': 22,
'classes': ['Math', 'Geography', 'English'],
'name': 'Jack',
'student': True},
{'_id': ObjectId('59260028617410748cc7b8ca'),
'name': 'Jeremy',
'occupation': 'waiter',
'student': True}]
The Logical operators listed in the following table can be used to do more complex queries.
Operator Description
$lt, $gt <, >
$lte,$gte <=, >=
$eq, $ne ==, !=
$in, $nin in, not in
$or, $and, $not or, and, not
$exists Match documents with a specific field
$type Match documents with values of a specific type
$all Match arrays that contain all queried elements
$size Match arrays with a specified number of elements
$regex Search documents with a regular expression
# Query for everyone that is a student (those that have a 'student' attribute
# and haven't been expelled).
>>> results = col.find({'student': {'$not': {'$in': [False, 'Expelled']}}})
It is likely that a database will hold more complex JSON entries then these, with many nested
attributes and arrays. For example, an entry in a database for a school might look like this.
To query the nested attributes and arrays use a dot, as in the following examples.
175
The Twitter JSON files are large and complex. To see what they look like, either look at the
JSON file used to populate the collection or print any tweet from the database. The following
website also contains useful information about the fields in the JSON file https://dev.twitter.
com/overview/api/tweets.
The distinct function is also useful in seeing what the possible values are for a given field.
Problem 2. Query the Twitter collection from Problem 1 for the following information.
• How many tweets came from one of the main continental US time zones? These are listed
as "Central Time (US & Canada)", "Pacific Time (US & Canada)", "Eastern Time
(US & Canada)", and "Mountain Time (US & Canada)".
• How often did each language occur? Construct a dictionary with each language and it’s
frequency count.
(Hint: use distinct() to get the language options.)
# Delete the first person from the database whose name is Jack.
>>> col.delete_one({'name':'Jack'})
Another useful function is the sort function, which can sort the data by some attribute. It
takes in the attribute by which the data will be sorted, and then the direction (1 for ascending and
-1 for descending). Ascending is the default. The following code is an example of sorting.
176 Lab 14. MongoDB
# Sort the students oldest to youngest, ignoring those whose age is not listed.
>>> results = col.find({'age': {'$exists': True}}).sort('age', -1)
>>> for person in results:
... print(person['name'])
...
Jill
Jack
Jack
Problem 3. Query the Twitter collection from Problem 1 for the following information.
• What are the usernames of the 5 most popular (defined as having the most followers)
tweeters? Don’t include repeats.
• Of the tweets containing at least 5 hashtags, sort the tweets by how early the 5th hashtag
appears in the text. What is the earliest spot (character count) it appears?
• What are the coordinates of the tweet that came from the northernmost location? Use
the latitude and longitude point in "coordinates".
Updating Documents
Another useful attribute of MongoDB is that data in the database can be updated. It is possible
to change values in existing fields, rename fields, delete fields, or create new fields with new values.
This gives much more flexibility than a relational database, in which the structure of the databse
must stay the same. To update a database, use either update_one or update_many, depending on
whether one or more documents should be changed (the same as with delete). Both of these take
two parameters; a find query, which finds documents to change, and the update parameters, telling
these things what to update. The syntax is update_many({find query},{update parameters}).
The update parameters must contain update operators. Each update operator is followed by
the field it is changing and the value to change it. The syntax is the same as with query operators.
The operators are shown in the table below.
177
Operator Description
$inc , $mul +=, *=
$min, $max min(), max()
$rename Rename a specified field to the given new name
$set Assign a value to a specified field (creating the field if necessary)
$unset Remove a specified field
$currentDate Set the value of the field to the current date.
With "$type": "date", use a datetime format;
with "$type": "timestamp:, use a timestamp.
# Update the first person from the database whose name is Jack to include a
# new field 'lastModified' containing the current date.
>>> col.update_one({'name':'Jack'},
... {'$currentDate': {'lastModified': {'$type': 'date'}}})
# Give the first John a new field 'best_friend' that is set to True.
>>> col.update_one({'name':'John'}, {'$set': {'best_friend': True}})
• Update every tweet from someone with at least 1000 followers to include a popular field
whose value is True. Report the number of popular tweets.
Additional Material
Installation of MongoDB
MongoDB runs as an isolated program with a path directed to its database storage. To run a practice
MongoDB server on your machine, complete the following steps:
To begin, navigate to an appropriate directory on your machine and create a folder called data.
Within that folder, create another folder called db. Make sure that you have read, write, and execute
permissions for both folders.
To run a server on your machine, you will need the proper executable files from MongoDB. The
following instructions are individualized by operating system. For all of them, download your binary
files from https://www.mongodb.com/download-center?jmp=nav#community.
1. For Linux/Mac:
Extract the necessary files from the downloaded package. In the terminal, navigate into the
bin directory of the extracted folder. You may then start a Mongo server by running in a
terminal: ./mongod --dbpath /pathtoyourdatafolder.
2. For Windows:
Go into your Downloads folder and run the Mongo .msi file. Follow the installation instruc-
tions. You may install the program at any location on your machine, but do not forget where
you have installed it. You may then start a Mongo server by running in command prompt:
C:\locationofmongoprogram\mongod.exe –dbpath C:\pathtodatafolder\data\db.
MongoDB servers are set by default to run at address:port 127.0.0.1:27107 on your machine.
You can also run Mongo commands through a mongo terminal shell. More information on this
can be found at https://docs.mongodb.com/getting-started/shell/introduction/.
Twitter API
Pulling information from the Twitter API is simple. First you must get a Twitter account and register
your app with them on apps.twitter.com. This will enable you to have a consumer key, consumer
secret, access token, and access secret, all required by the Twitter API.
You will also need to install tweepy, an open source library that allows python to easily work
with the Twitter API. This can be installed with pip by running from the command line
The data for this lab was then pulled using the following code on May 26, 2017.
import tweepy
from tweepy import OAuthHandler
from tweepy import Stream
179
class StreamListener(tweepy.StreamListener):
def on_status(self, status):
print(status.text)
stream_listener = StreamListener()
stream = tweepy.Stream(auth=my_auth, listener=stream_listener)
stream.filter(track=["trump"]) #This pulls all tweets that include the keyword ←-
"trump". Any number of keywords can be searched for.
180 Lab 14. MongoDB
15
Introduction to Parallel
Computing
Lab Objective: Many modern problems involve so many computations that running them on
a single processor is impractical or even impossible. There has been a consistent push in the past
few decades to solve such problems with parallel computing, meaning computations are distributed to
multiple processors. In this lab, we explore the basic principles of parallel computing by introducing
the cluster setup, standard parallel commands, and code designs that fully utilize available resources.
Parallel Architectures
A serial program is executed one line at a time in a single process. Since modern computers have
multiple processor cores, serial programs only use a fraction of the computer’s available resources.
This can be beneficial for smooth multitasking on a personal computer because programs can run
uninterrupted on their own core. However, to reduce the runtime of large computations, it is beneficial
to devote all of a computer’s resources (or the resources of many computers) to a single program. In
theory, this parallelization strategy can allow programs to run N times faster where N is the number
of processors or processor cores that are accessible. Communication and coordination overhead
prevents the improvement from being quite that good, but the difference is still substantial.
A supercomputer or computer cluster is essentially a group of regular computers that share their
processors and memory. There are several common architectures that combine computing resources
for parallel processing, and each architecture has a different protocol for sharing memory and proces-
sors between computing nodes, the different simultaneous processing areas. Each architecture offers
unique advantages and disadvantages, but the general commands used with each are very similar.
181
182 Lab 15. Intro to Parallel Computing
• Controller : Receives directions from the client and distributes instructions and data to the
computing nodes. Consists of a hub to manage communications and schedulers to assign
processes to the engines.
• Engines: The individual computing nodes. Each engine is like a separate Python process, each
with its own namespace, and computing resources.
Command Description
ipcontroller start Initialize a controller process.
ipengine start Initialize an engine process.
ipcluster start Initialize a controller process and several engines simultaneously.
Each of these processes can be stopped with a keyboard interrupt (Ctrl+C). By default, the
controller uses JSON files in UserDirectory/.ipython/profile-default/security/ to determine
its settings. Once a controller is running, it acts like a server listening for client connections from
engine processes. Engines connect by default to a controller with the settings defined in the afore-
mentioned JSON files. There is no limit to the number of engines that can be started in their own
terminal windows and connected to the controller, but it is recommended to only use as many engines
areas as there are cores to maximize efficiency. Once started, each engine has its own ID number on
the controller that is used for communication.
183
Achtung!
The directory that the controller and engines are started from matters. To facilitate connections,
navigate to the same folder as your source code before using ipcontroller, ipengine, or
ipcluster. Otherwise, the engines may not connect to the controller or may not be able to
find auxiliary code as directed by the client.
Starting a controller and engines in individual terminal windows with ipcontroller and
ipengine is a little inconvenient, but having separate terminal windows for the engines allows the
user to see individual errors in detail. It is also actually more convenient when starting a cluster of
multiple computers. For now, we use ipcluster to get the entire cluster started quickly.
Note
Jupyter notebooks also have a Clusters tab in which clusters can be initialized using an
interactive GUI. To enable the tab, run the following command. This operation may require
root permissions.
Once the client object has been created, it can be used to create one of two classes: a DirectView
or a LoadBalancedView. These views allow for messages to be sent to collections of engines simul-
taneously. A DirectView allows for total control of task distribution while a LoadBalancedView
automatically tries to spread out the tasks equally on all engines. The remainder of the lab will be
focused on the DirectView class.
Since each engine has its own namespace, modules must be imported in every engine. There
is more than one way to do this, but the easiest way is to use the DirectView object’s execute()
method, which accepts a string of code and executes it in each engine.
Problem 1. Write a function that initializes a Client object, creates a DirectView with all
available engines, and imports scipy.sparse as sparse on all engines.
The push() and pull() methods of a DirectView object manage variable values in the engines. Use
push() to set variable values and pull() to get variables. Each method has an easy shortcut via
indexing.
Parallelization almost always involves splitting up collections and sending different pieces to each
engine for processing. The process is called scattering and is usually used for dividing up arrays or
lists. The inverse process of pasting a collection back together is called gathering. This method of
distributing and collecting a dataset is the foundation of the prominent MapReduce algorithm, a
common model for processing large data sets using parallelization.
The execute() method is the simplest way to run commands on parallel engines. It accepts a string
of code (with exact syntax) to be executed. Though simple, this method works well for small tasks.
Apply
The apply() method accepts a function and arguments to plug into it, and distributes them to the
engines. Unlike execute(), apply() returns the output from the engines directly.
Note that the engines can access their local variables in either of the execution methods.
Map
The built-in map() function applies a function to each element of an iterable. The iPyParallel
equivalent, the map() method of the DirectView class, combines apply() with scatter() and
gather(). Simply put, it accepts a dataset, splits it between the engines, executes a function on the
given elements, returns the results, and combines them into one object. This function also represents
a key component in the MapReduce algorithm.
• Blocking: The controller places commands on the specified engines’ execution queues, then
“blocks” execution until every engine finishes its task. The main program halts until the answer
is received from the controller. This mode is usually best for problems in which each node is
performing the same task.
• Non-Blocking: The controller places commands on the specified engines’ execution queues,
then immediately returns an AsyncResult object that can be used to check the execution
status and eventually retrieve the actual result. The main program continues without waiting
for responses.
The execution methods execute(), apply(), and map(), as well as push(), pull(), scatter()
, and gather(), each have a keyword argument block that specifies whether or not to using blocking.
If not specified, the argument defaults to the block attribute of the DirectView. Alternatively, the
methods apply_sync() and map_sync() always use blocking, and apply_async() and map_async()
always use non-blocking.
187
# The non-blocking method is faster, but we still need to get its results.
>>> block_results[1] # This list holds actual result values.
[5.9734047365913572,
5.1895936886345959,
4.9088268102823909,
4.8920224621657855]
>>> responses[10] # This list holds AsyncResult objects.
<AsyncResult: <lambda>:finished>
>>> %time nonblock_results = [r.get() for r in responses]
CPU times: user 3.52 ms, sys: 11 mms, total: 3.53 ms
Wall time: 3.54 ms # Getting the responses takes little time.
As was demonstrated above, when non-blocking is used, commands can be continuously sent
to engines before they have finished their previous task. This allows them to begin their next task
without waiting to send their calculated answer and receive a new command. However, this requires
a design that incorporates check points to retrieve answers and enough memory to store response
objects.
Table 15.1 details the methods of the AsyncResult object.
There are additional magic methods supplied by iPyParallel that make some of these opera-
tions easier. These methods are contained in the Additional Material section. More information on
iPyParallel architecture, interface, and methods at https://ipyparallel.readthedocs.io/en/
latest/index.html.
188 Lab 15. Intro to Parallel Computing
Problem 3. Write a function that accepts an integer n. Instruct each engine to make n draws
from the standard normal distribution, then hand back the minimum, maximum, and mean
draw to the client. Print the results. If you have four engines running, your output should
resemble the following:
Problem 4. Use your function from Problem 3 to compare serial and parallel execution times.
For n = 1000000, 5000000, 10000000, 15000000,
2. Time how long it takes to do the same process serially. Make n draws and then calculate
and record the statistics, but use a for loop with N iterations, where N is the number
of engines running.
Plot the execution times against n. You should notice an increase in efficiency in the parallel
version as the problem size increases.
Applications
Parallel computing, when used correctly, is one of the best ways to speed up the run time of an
algorithm. As a result, it is very commonly used today and has many applications, such as the
following:
• Graphic rendering
• Numerical integration
In fact, there are many problems that are only possible to solve through parallel computing because
solving them serially would take too long. In these types of problems, even the parallel solution could
take years. Some brute-force algorithms, like those used to crack simple encryptions, are examples
of this type of problem.
189
The problems mentioned above are well suited to parallel computing because they can be
manipulated in such a way that running them on multiple processors results in a significant run time
improvement. Manipulating an algorithm to be run with parallel computing is called parallelizing
the algorithm. When a problem only requires very minor manipulations to parallelize, it is often
called embarrassingly parallel. Typically, an algorithm is embarrassingly parallel when there is little
to no dependency between results. Algorithms that do not meet this criteria can still be parallelized,
but there is not always a significant enough improvement in run time to make it worthwhile. For
example, calculating the Fibonacci sequence using the usual formula, F(n) = F(n − 1) + F(n − 2),
is poorly suited to parallel computing because each element of the sequence is dependent on the
previous two elements.
where a = x1 < x2 < . . . < xN = b and h = xn+1 − xn for each n. See Figure 15.2.
Note that estimation of the area of each interval is independent of all other intervals. As
a result, this problem is considered embarrassingly parallel.
Write a function that accepts a function handle to integrate, bounds of integration, and
the number of points to use for the approximation. Parallelize the trapezoid rule in order to
estimate the integral of f . That is, evenly divide the points among all available processors
and run the trapezoid rule on each portion simultaneously. The sum of the results of all the
processors will be the estimation of the integral over the entire interval of integration. Return
this sum.
190 Lab 15. Intro to Parallel Computing
y = f (x)
x1 x2 x3 x4 x5
x
Intercommunication
The phrase parallel computing refers to designing an architecture and code that makes the best use
of computing resources for a problem. Occasionally, this will require nodes to be interdependent on
each other for previous results. This contributes to a slower result because it requires a great deal
of communication latency, but is sometimes the only method to parallelize a function. Although im-
portant, the ability to effectively communicate between engines has not been added to iPyParallel.
It is, however, possible in an MPI framework and will be covered in a later lab.
191
Additional Material
Installation and Initialization
If you have not already installed ipyparallel, you may do so using the conda package manager.
SSH Connection
When using engines and controllers that are on separate machines, their communication will most
likely be using an SSH tunnel. This Secure Shell allows messages to be passed over the network.
In order to enable this, an SSH user and IP address must be established when starting the
controller. An example of this follows.
Magic Methods
The iPyParallel module has a few magic methods that are very useful for quick commands in
iPython or in a Jupyter Notebook. The most important are as follows. Additional methods are
found at https://ipyparallel.readthedocs.io/en/latest/magics.html.
%px - This magic method runs the corresponding Python command on the engines specified
in dview.targets.
%autopx - This magic method enables a boolean that runs any code run on every engine until
%autopx is run again.
Examples of these magic methods with a client and four engines are as follows.
# %px
In [4]: with dview.sync_imports():
...: import numpy
...:
importing numpy on engine(s)
In [5]: \%px a = numpy.random.random(2)
In [6]: dview['a']
Out[6]:
[array([ 0.30390162, 0.14667075]),
array([ 0.95797678, 0.59487915]),
array([ 0.20123566, 0.57919846]),
array([ 0.87991814, 0.31579495])]
# %autopx
In [7]: %autopx
%autopx enabled
In [8]: max_draw = numpy.max(a)
In [10]: %autopx
%autopx disabled
Decorators
The iPyParallel module also has a few decorators that are very useful for quick commands. The
two most important are as follows:
@parallel - This decorator creates methods on remote engines that break up element wise
operations and recombine results.
# Remote decorator
>>> @dview.remote(block=True)
>>> def plusone():
... return a+1
>>> dview['a'] = 5
>>> plusone()
[6, 6, 6, 6,]
# Parallel decorator
>>> import numpy as np
>>> @dview.parallel(block=True)
>>> def combine(A,B):
... return A+B
>>> ex1 = np.random.random((3,3))
>>> ex2 = np.random.random((3,3))
>>> print(ex1+ex2)
[[ 0.87361929 1.41110357 0.77616724]
[ 1.32206426 1.48864976 1.07324298]
[ 0.6510846 0.45323311 0.71139272]]
>>> print(combine(ex1,ex2))
[[ 0.87361929 1.41110357 0.77616724]
[ 1.32206426 1.48864976 1.07324298]
[ 0.6510846 0.45323311 0.71139272]]
Lab Objective: In the world of parallel computing, MPI is the most widespread and standardized
message passing library. As such, it is used in the majority of parallel computing programs. In this
lab, we explore and practice the basic principles and commands of MPI to further recognize when and
how parallelization can occur.
MPI can be thought of as “the assembly language of parallel computing,” because of this gen-
erality.1 MPI is important because it was the first portable and universally available standard for
programming parallel systems and continues to be the de facto standard today.
For more information on how MPI works and how to get it installed on your machine, see the
additional material for this lab.
Note
Most modern personal computers now have multicore processors. Programs that are designed
for these multicore processors are “parallel” programs and are typically written using OpenMP
or POSIX threads. MPI, on the other hand, is designed for any general architecture.
195
196 Lab 16. Parallel Programming with MPI
Using MPI
We will start with a Hello World program.
1 #hello.py
2 from mpi4py import MPI
4 COMM = MPI.COMM_WORLD
RANK = COMM.Get_rank()
6
hello.py
Save this program as hello.py and execute it from the command line as follows:
Notice that when you try this on your own, the lines will not necessarily print in order. This is
because there will be five separate processes running autonomously, and we cannot know beforehand
which one will execute its print() statement first.
197
Achtung!
It is usually bad practice to perform I/O (e.g., call print()) from any process besides the root
process (rank 0), though it can be a useful tool for debugging.
How does this program work? First, the mpiexec program is launched. This is the program
which starts MPI, a wrapper around whatever program you to pass into it. The -n 5 option specifies
the desired number of processes. In our case, 5 processes are run, with each one being an instance of
the program “python”. To each of the 5 instances of python, we pass the argument hello.py which
is the name of our program’s text file, located in the current directory. Each of the five instances of
python then opens the hello.py file and runs the same program. The difference in each process’s
execution environment is that the processes are given different ranks in the communicator. Because
of this, each process prints a different number when it executes.
MPI and Python combine to make succinct source code. In the above program, the line
from mpi4py import MPI loads the MPI module from the mpi4py package. The line COMM = MPI
.COMM_WORLD accesses a static communicator object, which represents a group of processes which
can communicate with each other via MPI commands. The next line, RANK = COMM.Get_rank(),
accesses the processes rank number. A rank is the process’s unique ID within a communicator, and
they are essential to learning about other processes. When the program mpiexec is first executed,
it creates a global communicator and stores it in the variable MPI.COMM_WORLD. One of the main
purposes of this communicator is to give each of the five processes a unique identifier, or rank. When
each process calls COMM.Get_rank(), the communicator returns the rank of that process. RANK points
to a local variable, which is unique for every calling process because each process has its own separate
copy of local variables. This gives us a way to distinguish different processes while writing all of the
source code for the five processes in a single file.
Here is the syntax for Get_size() and Get_rank(), where Comm is a communicator object:
Comm.Get_size() Returns the number of processes in the communicator. It will return the same
number to every process. Parameters:
Example:
1 #Get_size_example.py
2 from mpi4py import MPI
SIZE = MPI.COMM_WORLD.Get_size()
4 print("The number of processes is {}.".format(SIZE))
Get_size_example.py
Comm.Get_rank() Determines the rank of the calling process in the communicator. Parameters:
Example:
198 Lab 16. Parallel Programming with MPI
1 #Get_rank_example.py
2 from mpi4py import MPI
RANK = MPI.COMM_WORLD.Get_rank()
4 print("My rank is {}.".format(RANK))
Get_rank_example.py
The Communicator
A communicator is a logical unit that defines which processes are allowed to send and receive mes-
sages. In most of our programs we will only deal with the MPI.COMM_WORLD communicator, which
contains all of the running processes. In more advanced MPI programs, you can create custom com-
municators to group only a small subset of the processes together. This allows processes to be part
of multiple communicators at any given time. By organizing processes this way, MPI can physically
rearrange which processes are assigned to which CPUs and optimize your program for speed. Note
that within two different communicators, the same process will most likely have a different rank.
Note that one of the main differences between mpi4py and MPI in C or Fortran, besides being
array-based, is that mpi4py is largely object oriented. Because of this, there are some minor changes
between the mpi4py implementation of MPI and the official MPI specification.
For instance, the MPI Communicator in mpi4py is a Python class and MPI functions like
Get_size() or Get_rank() are instance methods of the communicator class. Throughout these MPI
labs, you will see functions like Get_rank() presented as Comm.Get_rank() where it is implied that
Comm is a communicator object.
1 #separateCode.py
2 from mpi4py import MPI
RANK = MPI.COMM_WORLD.Get_rank()
4
a = 2
6 b = 3
if RANK == 0:
8 print a + b
elif RANK == 1:
10 print a*b
elif RANK == 2:
12 print max(a, b)
separateCode.py
199
Problem 1. Write a program in which processes with an even rank print “Hello” and pro-
cess with an odd rank print “Goodbye.” Print the process number along with the “Hello” or
“Goodbye” (for example, “Goodbye from process 3”).
1 #passValue.py
2 import numpy as np
from mpi4py import MPI
4
COMM = MPI.COMM_WORLD
6 RANK = COMM.Get_rank()
passValue.py
To illustrate simple message passing, we have one process choose a random number and then
pass it to the other. Inside the receiving process, we have it print out the value of the variable
num_buffer before it calls Recv() to prove that it really is receiving the variable through the message
passing interface.
Here is the syntax for Send() and Recv(), where Comm is a communicator object:
Comm.Send(buf, dest=0, tag=0) Performs a basic send from one process to another. Parame-
ters:
buf (array-like) : data to send
dest (integer) : rank of destination
tag (integer) : message tag
The buf object is not as simple as it appears. It must contain a pointer to a Numpy array.
It cannot, for example, simply pass a string. The string would have to be packaged inside an array
first.
Example:
1 #Send_example.py
2 from mpi4py import MPI
import numpy as np
4
RANK = MPI.COMM_WORLD.Get_rank()
6
Send_example.py
Problem 2. Write a script that runs on two processes and passes an n by 1 vector of ran-
dom values from one process to the other. Write it so that the user passes the value of n in
as a command-line argument. The following code demonstrates how to access command-line
arguments.
Note
201
Send() and Recv() are referred to as blocking functions. That is, if a process calls Recv(), it
will sit idle until it has received a message from a corresponding Send() before it will proceed.
(However, in Python the process that calls Comm.Send will not necessarily block until the
message is received, though in C, MPI_Send does block) There are corresponding non-blocking
functions Isend() and Irecv() (The I stands for immediate). In essence, Irecv() will return
immediately. If a process calls Irecv() and doesn’t find a message ready to be picked up, it
will indicate to the system that it is expecting a message, proceed beyond the Irecv() to do
other useful work, and then check back later to see if the message has arrived. This can be used
to dramatically improve performance.
Problem 3. Write a script in which the process with rank i sends a random value to the
process with rank i + 1 in the global communicator. The process with the highest rank will
send its random value to the root process. Notice that we are communicating in a ring. For
communication, only use Send() and Recv(). The program should work for any number of
processes. Does the order in which Send() and Recv() are called matter?
Note
When calling Comm.Recv, you can allow the calling process to accept a message from any
process that happened to be sending to the receiving process. This is done by setting source
to a predefined MPI constant, source=ANY_SOURCE (note that you would first need to import
this with from mpi4py.MPI import ANY_SOURCE or use the syntax source=MPI.ANY_SOURCE).
1 # pi.py
2 import numpy as np
from scipy import linalg as la
4
pi.py
$ python pi.py
3.166
Problem 4. The n-dimensional open unit ball is the set Un = {x ∈ Rn | kxk2 < 1}. Write
a script that accepts integers n and N on the command line. Estimate the volume of Un
by drawing N points over the n-dimensional domain [−1, 1] × [−1, 1] × · · · × [−1, 1] on each
available process except the root process (for a total of (r − 1)N draws, where r is the number
of processes). Have the root process print the volume estimate.
(Hint: the volume of [−1, 1] × [−1, 1] × · · · × [−1, 1] is 2n .)
When n = 2, this is the same experiment outlined above so your function should return an
2
approximation of π. The volume of the U3 is 34 π ≈ 4.18879, and the volume of U4 is π2 ≈ 4.9348.
Try increasing the number of sample points N or processes r to see if your estimates improve.
Note
Good parallel code should pass as little data as possible between processes. Sending large or
frequent messages requires a level of synchronization and causes some processes to pause as they
wait to receive or send messages, negating the advantages of parallelism. It is also important
to divide work evenly between simultaneous processes, as a program can only be as fast as its
slowest process. This is called load balancing, and can be difficult in more complex algorithms.
Additional Material
Installation of MPI
MPI is a library of functions that interface with your computer’s hardware to provide optimal parallel
computing performance. In order to use mpi4py, we need to have an MPI Library on installed on
the computer as well as the mpi4py package. When you invoke mpi4py in your python code, mpi4py
takes what you have written in python and applies it using an MPI Library, so only installing mpi4py
is not enough to use MPI.
203
1. For Linux/Mac: We recommend using OpenMPI for your MPI Library installation, though it
is not the only library available.
The following is a bash script written for Linux that will install OpenMPI version 4.0.2. It will
take about 15 minutes to complete.
#!/bin/bash
# download openMPI
wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi←-
-4.0.2.tar.gz
# extract the files
tar -zxf openmpi-4.0.2.tar.gz
cd openmpi-4.0.2
# configure the files
./configure --prefix=/usr/local/openmpi
# compile openMPI
make all
# install openMPI
sudo make install
Finally, you must add OpenMPI to your PATH variable. This is so your computer knows where
to look when it wants to execute a certain MPI command. Here is a link that describes how to
edit the PATH variable https://gist.github.com/nex3/c395b2f8fd4b02068be37c961301caa7.
On linux you will open a file called .bashrc, on Mac the file is called .bash_profile, both are in
the home directory. Add the following line, save the file, and restart your terminal.
export PATH=/usr/local/openmpi/bin:$PATH
2. For Windows: There is only one free MPI library available for Windows at https://msdn.
microsoft.com/en-us/library/bb524831(v=vs.85).aspx. Download the appropriate .exe
or .msi file to install on your machine.
Installing mpi4py
1. For All Systems: The easiest installation is using conda install mpi4py. You may also run
pip install mpi4py
204 Lab 16. Parallel Programming with MPI
17
Ethics in Machine
Learning
Lab Objective: Machine learning algorithms can be extremely useful and convenient, but they can
also have negative consequences. In this lab, we’ll explore the impact people have on machine learning
algorithms and some of the unintended effects of machine learning that can result.
Introduction
Machine learning can be an extremely powerful tool in helping us interpret large datasets and then
making decisions based on our data. A well-designed algorithm can use a dataset to train a model
and identify patterns as well as make decisions with minimal human contact. As machine learning
continues to advance and gain power and validity in the world, more and more datasets are being
compiled to be trained on and then implemented into some kind of model. This implementation
can be anything from predicting the next word when you are texting on your phone to typing at
your computer to extremely powerful and accurate facial recognition software. Even though machine
learning can be and is exceptionally useful, it has already demonstrated some drawbacks. These
drawbacks have negative impacts that often outweigh the power and helpfulness of a well-designed
machine learning model. In this lab we will be looking at a few of these drawbacks, as well as
some ethical repercussions to help us understand how to be aware of and avoid them in our future
endeavors.
Understanding Bias
To begin, we need to make a quick distinction about the term bias. Bias has many different meanings
depending on the circumstance and field of study. From a mathematical perspective, bias is defined
simply as a measure of the average error of an estimator, a statistical estimate of some quantity. In
mathematical terms,
bias(θ̂) = E[θ̂] − θ̂
where E is the expected value, θ is the term we are estimating, and θ̂ is the estimator. In
machine learning, one of the goals is often to minimize this type of bias, which we will refer to as
statistical bias.
205
206 Lab 17. Ethics in Machine Learning
This idea of minimizing bias also applies to other types of biases, including cognitive and data
bias. These kinds of biases often result in predictions from our machine learning model being partial
to a certain subset of the data. In this lab, we will investigate problems that involve several types of
biases.
Achtung!
Not every type of bias has a clear or standard definition. Some types of biases even have
different names. For instance, Wikipedia defines statistical bias more broadly,as “a systematic
tendency in the process of data collection, which results in lopsided, misleading results". In this
lab we have used common terms and definitions specific in the machine learning field but note
that they are not universal. With sensitive topics like bias, it is important to be clear about
the definition and meaning so that misunderstanding do not occur.
See https://en.wikipedia.org/wiki/Bias#Statistical_biases for Wikipedia’s defi-
nitions of different biases.
where if g(x) is the function the model predicted and f (x) is the actual target function,
By definition, a model with high variance means that small changes in the training set will
result in large changes of the target function, so the model is overfitted. Low variance is just the
opposite; changes in the training set will hardly affect the target function. Statistical bias, on the
other hand, deals more specifically with the general form of the target function. High statistical bias
implies that we are making large assumptions about the form of the target function, and small bias
implies that we are making small assumptions about the form of the target function. Models with
high bias are often classified as being underfitted, meaning it will not generalize well to any other
data. Simply put, high bias assumes a model and tries to fit the data to that model, and low bias
tries to fit a model to the data. Making the statistical bias smaller often makes the variance of the
model go up and a smaller variance will result in a larger bias. The relationship between statistical
bias and variance, or overfitting and underfitting, is inescapable in machine learning. In the end, the
best algorithm is achieved by finding the middle ground.
207
Table 17.1: List of common machine learning algorithms whether they have low or high Bias and
Variance.
A common example for evaluating this relationship is that of fitting a dataset to a polynomial.
Based on our definitions of statistical bias and variance, we can conclude that as the degree of
polynomial gets larger, the statistical bias decreases because the end result begins to depend more
and more on the specific data points given to the training set.
We will be using the mean square error to calculate the error. The mean square error is defined
as
n
1X
MSE = (Yi − Ŷi )2 (17.1)
n i=1
>>> x = np.linspace(-1,1,25)
>>> degree = 3 # polynomial degree
Figure 17.1: Plot of cos(x) (red) and a third-degree polynomial approximation (blue).
To simulate the idea of generalization, we will generate 100 datasets. The better the algorithm
performs across all the sets, the more we can assume that it will fit well on different datasets that
were not in the training data.
Problem 1. Approximate sin(πx) with polynomials. To do this, write a function that ac-
cepts as parameters min_degree and max_degree. Inside it, generate data using the provided
generate_sets() function. x_test and x_train are 1-d arrays with the x-values needed to
test and train. y_test and y_train are 100-d arrays, each subarray contains the y-values for
one dataset.
For each dataset in y_train,
3. Calculate the mean squared error of the training and testing data.
Return the mean squared error of both the training and testing data for each dataset,
as well as an array containing the predicted values for x_test for each degree and dataset, and
the generated x_test values (needed to calculate the bias).
Plot the mean test error and the mean train error for each degree with the corre-
sponding polynomial degree as the horizontal axis and compare with Figure 17.2.
209
Hint: The ndarrays outputted by your function should have the following shapes:
Now that we have the predictions and the error values, we can also evaluate the variance and
bias. Recall that variance is calculated solely using the prediction values. In this case, the estimator
g is the function generated when we used numpy.polyfit().
Bias, on the other hand, is calculated using the estimator and the desired target function, which
in this case is sin(πx). Using the equation for bias given above
. For the returned x_test and results_list from test_polyfits(), we can calculate the mean
variance across all the datasets on degree i using the method given below.
Problem 2. Use the results of 1 to calculate the mean of the bias, variance, test error, and
train error of each polynomial degree in range {0, . . . , 8}. Plot the mean test error, variance,
and bias with the corresponding polynomial degree as the horizontal axis. Make sure to include
axis labels, a title, and plot legends for readability. Your plot should look like the right plot in
Figure 17.2.
Create a single sample dataset with 500 samples using generate_sets with num_datasets=1
and num_samples=500. Choose the polynomial of degree n that showed the smallest mean test
error in the previous problems and fit your training set to that model.
Evaluate your results by plotting sin(πx), the sample points, and the best fit polynomial.
Include plot legends, display the mean squared error of the test set in the title of the plot,
and display the degree of the polynomial as the label on the horizontal axis.
210 Lab 17. Ethics in Machine Learning
Figure 17.2
The way that you evaluated and chose the best algorithm for the given datasets is common in
evaluating and choosing models to use in machine learning. If you continue to decrease the bias the
training error will often continue to decrease, but there will be a point where the testing error starts
to increase.
In Problem 1 we generated our own dataset, but if we are given a dataset from which we wanted
to create a model, we can use cross validation, which is similar to what we did in the previous problem,
to evaluate and choose the best model. Cross validation takes the given data and splits it into multiple
training and testing sets. Using these new splits, we can then train and test on every one of these
train-test-splits and evaluate the results. The model that showed the best performance across all of
these splits is the one you will probably want to use, because it will perform better on datasets that
were not in the training test. Computational complexity and size are other factors that could also
influence your choice.
Though this kind of model evaluation can be both tedious and time consuming, but it is
often the key to choosing the best model for a specific machine learning task as well as future
model development. Even the smallest change in your model can result in a significant change in
accuracy. Faulty data might look good initially but will not transfer over to different datasets very
well. Evaluating the statistical bias and variance in conjuction with the error of your model will help
you avoid these issues.
Measurement Bias
Measurement bias occurs when some of the data used as a variable is inaccurate. This could be
because the equipment measuring the data can’t detect small changes or the device measuring changes
the data. One example is taking pictures on a camera that increases the brightness of the photo or
has a spot on the lens.
Cancer detection is well-known machine learning problem. A model will train on images of cells,
some of which are cancerous, and then predict whether new samples are cancerous. Some melanoma
classification algorithms have been shown to predict melanoma better than trained physicians. This
success has led to developers releasing software that allows a user to take picture and then the
software will use a previously trained algorithm to predict the presence of melanoma. While this can
be extremely useful and helpful, if there is a measurement error, such as a damaged camera lens,
there can be issues in correct classification.
211
We will examine a model created to predict whether a specific skin lesion is melanomic. The
files melanoma_test.pkl and melanoma_modified.pkl contain pandas DataFrames that will be
used for testing the trained model which will be trained on melanoma_train.pkl. The file melan
oma.pkl is a flattened array of a black and white image of a skin lesion. The other testing file
melanoma_modified.pkl contains the same data, with an added black square to the image to simulate
a faulty camera or damaged machine.
Problem 3. In this problem we will compare the differences between good data and faulty
data in predicting if a person has melanoma. First, use get_melanoma_data() to get the
relevant training and testing data. Notice that there are two different testing datasets, the test
and modified datasets as explained above. Next, train a Random Forest model using SkLearn’s
RandomForestClassifier with random_state=15. After the model is trained, use the model to
predict the test and modified datasets.
Compute the accuracy and the percentage of false positive results for both datasets.
Display all four of these numbers in a plt.bar graph so that the reader can understand which
ones are from the modified test data and which one is from the original image.
Write a few sentences explaining the results of the graph and how this can affect the
usefulness of the model.
Sampling Bias
Sampling bias is a type of bias that results from the way the data, or a sample, is collected. More
specifically, sampling bias is when the sample is collected in a way that results in some members of
the intended population having a smaller sampling probability than others. This can happen when
a sample is taken in a specific area that is not representative of the entire population. Sampling bias
will result in many issues for prediction algorithms. In these next couple of sections and problems
we will be focusing on only a few of them.
Minority groups have a history of facing discrimination in many forms, including legal discrimination.
Because of the power and usefulness of machine learning, predictive algorithms have already made
their way into the judicial system as recidivism tools. One example is a machine learning algorithm
that uses facial recognition with other factors, such as whether a defendant has a job and their
education level, to produce a score called a risk assessment. The risk assessment score is then used
to help determine things like sentence length, bond amount, and parole.
An analysis1 showed that the algorithm had a 20% false positive rate for violent crime and of
those predicted to commit another crime, about 61% did. It also showed that black defendants were
77% more likely to receive a higher risk for future violent crime and 45% more likely to receive a
higher risk score for future crime of any kind. Many of these algorithms have not been independently
evaluated for accuracy and racial bias, and the defendants are not allowed to see how their scores
are calculated.
1 https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
212 Lab 17. Ethics in Machine Learning
Problem 4. Identify the importance of the RACE, FACILITY, AGE, and OFFENSE features in
sentence_labeled.csv in predicting SENTENCE YEARS. To do this, implement the following
steps:
• Load the sentence_labeled.csv. Create the labels and split into test and training data,
(use SkLearn’s test_train_split()) with a 70/30 train/test split and random_state=21.
• Using the model, predict the labels for the test data. Calculate and print the R-squared
score of the predicted test labels. It should be around 0.8 − 1.0.
• Remove RACE from the features. Retrain the model, predict the labels, compute the
R-squared score and feature importance, and create a new bar plot.
Figure 17.3
In this case, removing the RACE feature will not eliminate prejudice because of the FACILITY
feature. The facilities that generally have longer sentences are those that are used to incarcerate the
individuals of racial minority communities in the United States. More information on the facilities can
be found in the facility_map.p file. It is common, that other less obvious features can also include
sample bias in them. In other words, a column such as FACILITY, which does not necessarily indicate
race, will also help to propagate negative prejudice into our machine learning models because it may
disguise underlying characteristics. For example, since neighborhoods can be segregated, using zip
codes in these areas can support a racial bias. It is extremely important to look at feature importance
and compare results across demographics to prevent prejudice and discrimination.
Problem 5. Using the model with all four features, make predictions on the unlabeled data,
sentence_unlabeled.p. Create histograms for the SENTENCE YEARS of black convicts for the
following OFFENSE numbers: 27, 42, 95, 64. Overlay the SENTENCE YEARS of the same OFFENSE
for white convicts. Include plot legends and use 51 bins, each representing a single year from
0 − 50. Use the OFFENSE number as a key to get its description using offense_map.p and use
it as the title of the plot. See Figure 17.4 for an example.
Hint: Set the hist parameter alpha to a value which allows for you to see both plots,
even if they overlap.
Note: These offenses have been chosen because the sample size was significant and almost
equivalent for both black and white convicts.
214 Lab 17. Ethics in Machine Learning
Figure 17.4: The percent of whites and blacks sentenced to lengths of time (in year).
The company grew and began receiving more skilled applicants outside of their specific veteran
demographic. To reduce the time it took to sort through the growing stack of resumes, they decided
to create a model that could do the preliminary sorting for them. Using previous hiring data, they
created an algorithm that selected keywords that were common on applicants and simplified all the
previous and current applicants based on those key words. After this, they created the model to
make the prediction that they wanted.
Problem 6. Analyze the algorithm that the company created. The first steps of identifying
key words, simplifying the resumes and model creation have all been done for you. To get the
accuracy and results of the model call,
This function will return a cross validation mean accuracy score called accuracy, and a
pandas DataFrame, called results. results contains the training data. Each row represents
an applicant and the columns represent a keyword that could appear on the applicant’s resume.
The last column, ’Interview’, is the model’s prediction of whether that applicant received an
interview.
Create a function called get_percentages() that accepts the results and calculates and
returns the following.
Finally, use a plt.bar graph to display the cross validation accuracy score, the percentage
of people who received an interview, the percentage of women that received an interview, the
percentage of women who were veterans that received an interview, and the percentage of
women who are not veterans that received an interview. Compare your graph with Figure 17.5.
216 Lab 17. Ethics in Machine Learning
If we only looked at three of the results: the percentage of applicants that recieved an interview,
the percentage of women that received an interview, and the accuracy, we would think that the model
performed well. The percentage of women who received an interview was similar to the percentage of
total people who got an interview. However, if you look at more specific results, like the percentage
of women veterans that got the interview versus the percentage of non-veteran women that received
an interview, the algorithm shows some intense favoritism to veterans.
As discussed previously, the model to select keywords was trained on the company’s previous
data. This data was gathered from a time when they previously received applications from and hired
exclusively veterans. Some of these key words in skills.txt are not relevant for a software engineer, but
the algorithm deemed them important because of the frequency of their appearance in the training
data. The training data did not always represent what the employers were actually looking for.
Discussion on Ethics
In this lab, we discussed many important techniques for examining and determining multiple types of
bias in machine learning algorithms. Statistical bias and variance can be optimized and measurement
error can be avoided, but other kinds of cognitive and social bias are more nuanced. Being aware of
these issues is important as machine learning becomes ingrained in how the world and its machines
function. We will mention three ethical concerns regarding machine learning that were more or less
a part of this lab. Be aware that this list of three does not contain all possible ethical questions when
it comes to machine learning.
The first concern is of the fairness of the model. We examined this idea in both the resume and
the incarceration problems. Data based on previous prejudice or sampling bias may be replicated
in your algorithm, propagating error and resulting in unintentional discrimination. Avoiding this
problem requires conscious effort and a proper understanding of the sources of your data and the
impact of the model.
217
The second concern is dehumanization. Machine learning can take humans out of many im-
portant decision-making processes, which can be a good thing if the algorithms are extremely well
designed, but it also poses a significant risk if the model contains any amount of accidental bias or
discrimination, as in the incarceration problem.
The final idea is consent. It is important to consider what data the model uses, whose data
it is, and if it was gathered appropriately. Privacy and permission must be respected. This is an
important concern to consider before beginning a project and while considering potential effects of
the analysis.
There is not a single answer to any of these questions that will please everyone, such is the
nature of ethics, but before implementing a machine learning algorithm, consider what you are doing
and how it might affect those around you. As this issue gains awareness, people are starting to create
tools to identify and mitigate bias and discrimination.
• AI Fairness 3603 : Created by IBM, it is a well-known open-source package that not only
identifies and mitigates discrimination and bias.
• Skater4 : Developed by Oracle, Skater can be used on complex algorithms and black-box models
to detect bias.
• Audit-AI5 : A python library that can integrate with Pandas an SkLearn to measure and miti-
gate discriminatory patterns.
Other python packages for identifying bias include FairML, aequitas, fairNN, and parity-fairness.
For more information on fairness, accountability, transparency, and ethics in machine learning,
visit https://www.microsoft.com/en-us/research/theme/fate/.
3 https://aif360.mybluemix.net/
4 https://pypi.org/project/skater/
5 https://github.com/pymetrics/audit-ai
218 Lab 17. Ethics in Machine Learning
18
Apache Spark
Lab Objective: Dealing with massive amounts of data often requires parallelization and cluster
computing; Apache Spark is an industry standard for doing just that. In this lab we introduce the
basics of PySpark, Spark’s Python API, including data structures, syntax, and use cases. Finally, we
conclude with a brief introduction to the Spark Machine Learning Package.
Apache Spark
Apache Spark is an open-source, general-purpose distributed computing system used for big data
analytics. Spark is able to complete jobs substantially faster than previous big data tools (i.e.
Apache Hadoop) because of its in-memory caching, and optimized query execution. Spark provides
development APIs in Python, Java, Scala, and R. On top of the main computing framework, Spark
provides machine learning, SQL, graph analysis, and streaming libraries.
Spark’s Python API can be accessed through the PySpark package. Installation for local exe-
cution or remote connection to an existing cluster can be done with conda or pip commands.1
If you use python3 in your terminal, you will need to set the PYSPARK_PYTHON environment
variable to python3. When using an IDE, you must call it from the terminal or set the variables
inside the editor so that PySpark can be found.
PySpark
One major benefit of using PySpark is the ability to run it in an interactive environment. One such
option is the interactive Spark shell that comes prepackaged with PySpark. To use the shell, simply
run pyspark in the terminal. In the Spark shell you can run code one line at a time without the
need to have a fully written program. This is a great way to get a feel for Spark. To get help with a
function use help(function); to exit the shell simply run quit().
1 See https://runawayhorse001.github.io/LearningApacheSpark/setup.html for detailed installation instructions.
219
220 Lab 18. Apache Spark
In the interactive shell, the SparkSession object - the main entrypoint to all Spark functionality
- is available by default as spark. When running Spark in a standard Python script (or in IPython)
you need to define this object explicitly. The code box below outlines how to do this. It is standard
practice to name your SparkSession object spark.
It is important to note that when you are finished with a SparkSession you should end it by
calling spark.stop().
Note
While the interactive shell is very robust, it may be easier to learn Spark in an environment that
you are more familiar with (like IPython). To do so, just use the code given below. Help can be
accessed in the usual way for your environment. Just remember to stop() the SparkSession!
Note
The syntax
is somewhat unusual. While this code can be written on a single line, it is often more readable
to break it up when dealing with many chained operations; this is standard styling for Spark.
Note that you cannot write a comment after a line continuation character '\'.
221
>>> titanic.take(2)
['0,3,Mr. Owen Harris Braund,male,22,1,0,7.25',
'1,1,Mrs. John Bradley (Florence Briggs Thayer) Cumings,female,38,1,0,71.283']
>>> titanic_parallelize.take(2)
[array(['0', '3', ..., 'male', '22', '1', '0', '7.25'], dtype=object),
array(['1', '1', ..., 'female', '38', '1', '0', '71.2833'], dtype=object)]
Achtung!
Because Apache Spark partitions and distributes data, calling for the first n objects using the
same code (such as take(n)) may yield different results on different computers (or even each
time you run it on one computer). This is not something you should worry about; it is the
result of variation in partitioning and will not affect data analysis.
2 https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/problem12.html
222 Lab 18. Apache Spark
RDD Operations
Transformations
There are two types of operations you can perform on RDDs: transformations and actions. Trans-
formations are functions that produce new RDDs from existing ones. Transformations are also lazy;
they are not executed until an action is performed. This allows Spark to boost performance by
optimizing how a sequence of transformations is executed at runtime.
One of the most commonly used transformations is the map(func), which creates a new RDD
by applying func to each element of the current RDD. This function, func, can be any callable
python function, though it is often implemented as a lambda function. Similarly, flatMap(func)
creates an RDD with the flattened results of map(func).
The filter(func) transformation returns a new RDD containing only the elements that satisfy
func. In this case, func should be a callable python function that returns a Boolean. The elements
of the RDD that evaluate to True are included in the new RDD while those that evaluate to False
are excluded.
Note
223
A great transformation to help validate or explore your dataset is distinct(). This will return
a new RDD containing only the distinct elements of the original. In the case of the Titanic
dataset, if you did not know how many classes there were, you could do the following:
Problem 1. Write a function that accepts the name of a text file with default filename=
huck_finn.txt.a Load the file as a PySpark RDD, and count the number of occurrences of
each word. Sort the words by count, in descending order, and return a list of the (word, count)
pairs for the 20 most used words.
a https://www.gutenberg.org/files/76/76-0.txt
Actions
Actions are operations that return non-RDD objects. Two of the most common actions, take(n)
and collect(), have already been seen above. The key difference between the two is that take(n)
returns the first n elements from one (or more) partition(s) while collect() returns the contents of
the entire RDD. When working with small datasets this may not be an issue, but for larger datasets
running collect() can be very expensive.
Another important action is reduce(func). Generally, reduce() combines (reduces) the data
in each row of the RDD using func to produce some useful output. Note that func must be an asso-
ciative and commutative binary operation; otherwise the results will vary depending on partitioning.
Problem 2. Since the area of a circle of radius r is A = πr2 , one way to estimate π is to
estimate the area of the unit circle. A Monte Carlo approach to this problem is to uniformly
sample points in the square [−1, 1] × [−1, 1] and then count the percentage of points that land
within the unit circle. The percentage of points within the circle approximates the percentage
of the area occupied by the circle. Multiplying this percentage by 4 (the area of the square
[−1, 1] × [−1, 1]) gives an estimate for the area of the circle. a
Write a function that uses Monte Carlo methods to estimate the value of π. Your function
should accept two keyword arguments: n=10**5 and parts=6. Use n*parts sample points and
partition your RDD with parts partitions. Return your estimate.
a See Example 7.1.1 in the Volume 2 textbook
DataFrames
While RDDs offer granular control, they can be slower than their Scala and Java counterparts when
implemented in Python. The solution to this was the creation of a new data structure: Spark
DataFrames. Just like RDDs, DataFrames are immutable distributed collections of objects; however,
unlike RDDs, DataFrames are organized into named (and typed) columns. In this way they are
conceptually similar to a relational database (or a pandas DataFrame).
The most important difference between a relational database and Spark DataFrames is in
the execution of transformations and actions. When working with DataFrames, Spark’s Catalyst
Optimizer creates and optimizes a logical execution plan before sending any instructions to the
drivers. After the logical plan has been formed, an optimal physical plan is created and executed.
This provides significant performance boosts, especially when working with massive amounts of
data. Since the Catalyst Optimizer functions the same across all language APIs, DataFrames bring
performance parity to all of Spark’s APIs.
226 Lab 18. Apache Spark
Creating a DataFrame from an existing text, csv, or JSON file is generally easier than creating an
RDD. The DataFrame API also has arguments to deal with file headers or to automatically infer the
schema.
Note
To convert a DataFrame to an RDD use my_df.rdd; to convert an RDD to a DataFrame use
spark.createDataFrame(my_rdd). You can also use spark.createDataFrame() on numpy
arrays and pandas DataFrames.
227
DataFrames can be easily updated, queried, and analyzed using SQL operations. Spark allows
you to run queries directly on DataFrames similar to how you perform transformations on RDDs.
Additionally, the pyspark.sql.functions module contains many additional functions to further
analysis. Below are many examples of basic DataFrame operations; further examples involving the
pyspark.sql.functions module can be found in the additional materials section. Full documenta-
tion can be found at https://spark.apache.org/docs/latest/api/python/pyspark.sql.html.
# filter the DataFrame for passengers between 20-30 years old (inclusive)
>>> titanic.filter(titanic.age.between(20, 30)).show(3)
+--------+------+--------------------+------+----+-----+-----+------+
|survived|pclass| name| sex| age|sibsp|parch| fare|
+--------+------+--------------------+------+----+-----+-----+------+
| 0| 3|Mr. Owen Harris B...| male|22.0| 1| 0| 7.25|
| 1| 3|Miss. Laina Heikk...|female|26.0| 0| 0| 7.925|
| 0| 3| Mr. James Moran| male|27.0| 0| 0|8.4583|
+--------+------+--------------------+------+----+-----+-----+------+
only showing top 3 rows
| 1| 18177.41|
| 3| 6675.65|
| 2| 3801.84|
+------+---------+
Note
If you prefer to use traditional SQL syntax you can use spark.sql("SQL QUERY"). Note that
this requires you to first create a temporary view of the DataFrame.
# create the temporary view so we can access the table through SQL
>>> titanic.createOrReplaceTempView("titanic")
229
Problem 4. In this problem, you will be using the london_income_by_borough.csv and the
london_crime_by_lsoa.csv files to visualize the relationship between income and the fre-
quency of crime.a The former contains estimated mean and median income data for each
London borough, averaged over 2008-2016; the first line of the file is a header with columns
borough, mean-08-16, and median-08-16. The latter contains over 13 million lines of crime
data, organized by borough and LSOA (Lower Super Output Area) code, for London between
2008 and 2016; the first line of the file is a header, containing the following seven columns:
230 Lab 18. Apache Spark
lsoa_code: LSOA code (think area code) where the crime was committed
borough: London borough were the crime was committed
major_category: major (read: general) category of the crime
minor_category: minor (read: specific) category of the crime
value: number of occurrences of this crime in the given lsoa_code, month, and year
year: year the crime was committed
month: month the crime was committed
# prepare data
# convert the 'sex' column to binary categorical variable
>>> from pyspark.ml.feature import StringIndexer, OneHotEncoder
>>> sex_binary = StringIndexer(inputCol='sex', outputCol='sex_binary')
# drop unnecessary columns for cleaner display (note the new columns)
>>> titanic = titanic.drop('pclass', 'name', 'sex')
>>> titanic.show(2)
+--------+----+-----+-----+----+----------+-------------+----------+
|survived| age|sibsp|parch|fare|sex_binary|pclass_onehot| features|
+--------+----+-----+-----+----+----------+-------------+----------+
| 0|22.0| 1| 0|7.25| 0.0| (3,[],[])|(8,[4,5...|
| 1|38.0| 1| 0|71.3| 1.0| (3,[1],...|[0.0,1....|
+--------+----+-----+-----+----+----------+-------------+----------+
# we train the classifier by fitting our tvs object to the training data
>>> clf = tvs.fit(train)
0.7527272727272727
>>> results.weightedRecall
0.7527272727272727
>>> results.weightedPrecision
0.751035147726004
Below is a broad overview of the pyspark.ml ecosystem. It should help give you a starting
point when looking for a specific functionality.
Some of Spark’s available classifiers are listed below. For complete documentation, visit
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html.
Use randomSplit([0.75, 0.25], seed=11) to split your data into train and test sets
before fitting the model. Return the accuracy, weightedRecall, and weightedPrecision for
your model, in the given order.
Hint: to calculate the accuracy of a classifer in PySpark, use accuracy = MCE(labelCol
='survived',metricName='accuracy').evaluate(predictions).
234 Lab 18. Apache Spark
Additional Material
Further DataFrame Operations
There are a few other functions built directly on top of DataFrames to further analysis. Additionally,
the pyspark.sql.functions module expands the available functions significantly.3
| 1| 6143.483042924841|14.183632587264817|
| 3|139.64879027298073|12.095083834183779|
| 2|180.02658999396826|13.756191206499766|
+------+------------------+------------------+
pyspark.sql.functions Operation
ceil(col) computes the ceiling of each element in col
floor(col) computes the floor of each element in col
min(col), max(col) returns the minimum/maximum value of col
mean(col) returns the average of the values of col
stddev(col) returns the unbiased sample standard deviation of col
var_samp(col) returns the unbiased variance of the values in col
rand(seed=None) generates a random column with i.i.d. samples from [0, 1]
generates a random column with i.i.d. samples from the
randn(seed=None)
standard normal distribution
exp(col) computes the exponential of col
returns arg1-based logarithm of arg2; if there is only one
log(arg1, arg2=None)
argument, then it returns the natural logarithm
computes the given trigonometric or inverse trigonometric
cos(col), sin(col), etc.
(asin(col), etc.) function of col
236 Lab 18. Apache Spark
Part II
Appendices
237
A
Getting Started
The labs in this curriculum aim to introduce computational and mathematical concepts, walk through
implementations of those concepts in Python, and use industrial-grade code to solve interesting,
relevant problems. Lab assignments are usually about 5–10 pages long and include code examples
(yellow boxes), important notes (green boxes), warnings about common errors (red boxes), and
about 3–7 exercises (blue boxes). Get started by downloading the lab manual(s) for your course from
http://foundations-of-applied-mathematics.github.io/.
Submitting Assignments
Labs
Every lab has a corresponding specifications file with some code to get you started and to make your
submission compatible with automated test drivers. Like the lab manuals, these materials are hosted
at http://foundations-of-applied-mathematics.github.io/.
Download the .zip file for your course, unzip the folder, and move it somewhere where it
won’t get lost. This folder has some setup scripts and a collection of folders, one per lab, each of
which contains the specifications file(s) for that lab. See Student-Materials/wiki/Lab-Index for
the complete list of labs, their specifications and data files, and the manual that each lab belongs to.
Achtung!
Do not move or rename the lab folders or the enclosed specifications files; if you do, the test
drivers will not be able to find your assignment. Make sure your folder and file names match
Student-Materials/wiki/Lab-Index.
To submit a lab, modify the provided specifications file and use the file-sharing program
specified by your instructor (discussed in the next section). The instructor will drop feedback
files in the lab folder after grading the assignment. For example, the Introduction to Python lab
has the specifications file PythonIntro/python_intro.py. To complete that assignment, modify
PythonIntro/python_intro.py and submit it via your instructor’s file-sharing system. After grad-
ing, the instructor will create a file called PythonIntro/PythonIntro_feedback.txt with your score
and some feedback.
239
240 Appendix A. Getting Started
Homework
Non-lab coding homework should be placed in the _Homework/ folder and submitted like a lab
assignment. Be careful to name your assignment correctly so the instructor (and test driver) can find
it. The instructor may drop specifications files and/or feedback files in this folder as well.
Setup
Achtung!
We strongly recommend using a Unix-based operating system (Mac or Linux) for the labs.
Unix has a true bash terminal, works well with git and python, and is the preferred platform
for computational and data scientists. It is possible to do this curriculum with Windows, but
expect some road bumps along the way.
There are two ways to submit code to the instructor: with git (http://git-scm.com/), or with
a file-syncing service like Google Drive. Your instructor will indicate which system to use.
There are many websites for hosting online git repositories. Your instructor will indicate which
web service to use, but we only include instructions here for setup with Bitbucket.
1. Sign up. Create a Bitbucket account at https://bitbucket.org. If you use an academic email
address (ending in .edu, etc.), you will get free unlimited public and private repositories.
2. Make a new repository. On the Bitbucket page, click the + button from the menu on the
left and, under CREATE, select Repository. Provide a name for the repository, mark the
repository as private, and make sure the repository type is Git. For Include a README?,
select No (if you accidentally include a README, delete the repository and start over). Un-
der Advanced settings, enter a short description for your repository, select No forks un-
der forking, and select Python as the language. Finally, click the blue Create repository
button. Take note of the URL of the webpage that is created; it should be something like
https://bitbucket.org/<name>/<repo>.
241
3. Give the instructor access to your repository. On your newly created Bitbucket repository
page (https://bitbucket.org/<name>/<repo> or similar), go to Settings in the menu to
the left and select User and group access, the second option from the top. Enter your
instructor’s Bitbucket username under Users and click Add. Select the blue Write button so
your instructor can read from and write feedback to your repository.
4. Connect your folder to the new repository. In a shell application (Terminal on Linux or Mac,
or Git Bash (https://gitforwindows.org/) on Windows), enter the following commands.
# Add the contents of this folder to git and update the repository.
$ git add --all
$ git commit -m "initial commit"
$ git push origin master
For example, if your Bitbucket username is greek314, the repository is called acmev1, and the
folder is called Student-Materials/ and is on the desktop, enter the following commands.
# Record credentials.
$ git config --local user.name "archimedes"
242 Appendix A. Getting Started
# Add the contents of this folder to git and update the repository.
$ git add --all
$ git commit -m "initial commit"
$ git push origin master
At this point you should be able to see the files on your repository page from a web browser. If
you enter the repository URL incorrectly in the git remote add origin step, you can reset
it with the following line.
5. Download data files. Many labs have accompanying data files. To download these files, navi-
gate to your clone and run the download_data.sh bash script, which downloads the files and
places them in the correct lab folder for you. You can also find individual data files through
Student-Materials/wiki/Lab-Index.
6. Install Python package dependencies. The labs require several third-party Python packages
that don’t come bundled with Anaconda. Run the following command to install the necessary
packages.
7. (Optional) Clone your repository. If you want your repository on another computer after
completing steps 1–4, use the following commands.
Using Git
Git manages the history of a file system through commits, or checkpoints. Use git status to see
the files that have been changed since the last commit. These changes are then moved to the staging
area, a list of files to save during the next commit, with git add <filename(s)>. Save the changes
in the staging area with git commit -m "<A brief message describing the changes>".
Figure A.1: Git commands to stage, unstage, save, or discard changes. Commit messages are recorded
in the log.
All of these commands are done within a clone of the repository, stored somewhere on a com-
puter. This repository must be manually synchronized with the online repository via two other git
commands: git pull origin master, to pull updates from the web to the computer; and git
push origin master, to push updates from the computer to the web.
Online Repository
Computer
Figure A.2: Exchanging git commits between the repository and a local clone.
244 Appendix A. Getting Started
Command Explanation
git status Display the staging area and untracked changes.
git pull origin master Pull changes from the online repository.
git push origin master Push changes to the online repository.
git add <filename(s)> Add a file or files to the staging area.
git add -u Add all modified, tracked files to the staging area.
git commit -m "<message>" Save the changes in the staging area with a given message.
git checkout -- <filename> Revert changes to an unstaged file since the last commit.
git reset HEAD -- <filename> Remove a file from the staging area.
git diff <filename> See the changes to an unstaged file since the last commit.
git diff --cached <filename> See the changes to a staged file since the last commit.
git config --local <option> Record your credentials (user.name, user.email, etc.).
Note
When pulling updates with git pull origin master, your terminal may sometimes display
the following message.
This means that someone else (the instructor) has pushed a commit that you do not yet have,
while you have also made one or more commits locally that they do not have. This screen,
displayed in vim (https://en.wikipedia.org/wiki/Vim_(text_editor)), is asking you to
enter a message (or use the default message) to create a merge commit that will reconcile both
changes. To close this screen and create the merge commit, type :wq and press enter.
$ cd ~/Desktop/Student-Materials/
$ git pull origin master # Pull updates.
### Make changes to a file.
$ git add -u # Track changes.
$ git commit -m "Made some changes." # Commit changes.
$ git push origin master # Push updates.
245
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
PythonIntro/python_intro.py
modified: PythonIntro/python_intro.py
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working directory clean
246 Appendix A. Getting Started
B
Installing and
Managing Python
Lab Objective: One of the great advantages of Python is its lack of overhead: it is relatively easy
to download, install, start up, and execute. This appendix introduces tools for installing and updating
specific packages and gives an overview of possible environments for working efficiently in Python.
Achtung!
This curriculum uses Python 3.6, not Python 2.7. With the wrong version of Python, some
example code within the labs may not execute as intended or result in an error.
Managing Packages
A Python package manager is a tool for installing or updating Python packages, which involves
downloading the right source code files, placing those files in the correct location on the machine,
and linking the files to the Python interpreter. Never try to install a Python package without using
a package manager (see https://xkcd.com/349/).
247
248 Appendix B. Installing and Managing Python
Conda
Many packages are not included in the default Anaconda download but can be installed via Ana-
conda’s package manager, conda. See https://docs.anaconda.com/anaconda/packages/pkg-docs
for the complete list of available packages. When you need to update or install a package, always
try using conda first.
Command Description
conda install <package-name> Install the specified package.
conda update <package-name> Update the specified package.
conda update conda Update conda itself.
conda update anaconda Update all packages included in Anaconda.
conda --help Display the documentation for conda.
For example, the following terminal commands attempt to install and update matplotlib.
Note
The best way to ensure a package has been installed correctly is to try importing it in IPython.
Achtung!
Be careful not to attempt to update a Python package while it is in use. It is safest to update
packages while the Python interpreter is not running.
Pip
The most generic Python package manager is called pip. While it has a larger package list, conda is
the cleaner and safer option. Only use pip to manage packages that are not available through conda.
249
Command Description
pip install package-name Install the specified package.
pip install --upgrade package-name Update the specified package.
pip freeze Display the version number on all installed packages.
pip --help Display the documentation for pip.
Workflows
There are several different ways to write and execute programs in Python. Try a variety of workflows
to find what works best for you.
• Atom: https://atom.io/
• Geany: https://www.geany.org/
• Vim: https://www.vim.org/
• Emacs: https://www.gnu.org/software/emacs/
Once Python code has been written in a text editor and saved to a file, that file can be executed
in the terminal or command line.
IPython is an enhanced version of Python that is more user-friendly and interactive. It has
many features that cater to productivity such as tab completion and object introspection.
Note
While Mac and Linux computers come with a built-in bash terminal, Windows computers do
not. Windows does come with Powershell, a terminal-like application, but some commands in
Powershell are different than their bash analogs, and some bash commands are missing from
Powershell altogether. There are two good alternatives to the bash terminal for Windows:
Jupyter Notebook
The Jupyter Notebook (previously known as IPython Notebook) is a browser-based interface for
Python that comes included as part of the Anaconda Python Distribution. It has an interface similar
to the IPython interpreter, except that input is stored in cells and can be modified and re-evaluated
as desired. See https://github.com/jupyter/jupyter/wiki/ for some examples.
To begin using Jupyter Notebook, run the command jupyter notebook in the terminal. This
will open your file system in a web browser in the Jupyter framework. To create a Jupyter Notebook,
click the New drop down menu and choose Python 3 under the Notebooks heading. A new tab
will open with a new Jupyter Notebook.
Jupyter Notebooks differ from other forms of Python development in that notebook files contain
not only the raw Python code, but also formatting information. As such, Juptyer Notebook files
cannot be run in any other development environment. They also have the file extension .ipynb
rather than the standard Python extension .py.
Jupyter Notebooks also support Markdown—a simple text formatting language—and LATEX,
and can embedded images, sound clips, videos, and more. This makes Jupyter Notebook the ideal
platform for presenting code.
• JupyterLab: http://jupyterlab.readthedocs.io/en/stable/
• PyCharm: https://www.jetbrains.com/pycharm/
• Spyder: http://code.google.com/p/spyderlib/
Lab Objective: NumPy operations can be difficult to visualize, but the concepts are straightforward.
This appendix provides visual demonstrations of how NumPy arrays are used with slicing syntax,
stacking, broadcasting, and axis-specific operations. Though these visualizations are for 1- or 2-
dimensional arrays, the concepts can be extended to n-dimensional arrays.
Data Access
The entries of a 2-D array are the rows of the matrix (as 1-D arrays). To access a single entry, enter
the row index, a comma, and the column index. Remember that indexing begins with 0.
× × × × × × × × × ×
× × × × × × × × × ×
A[0] =
×
A[2,1] =
× × × × × × × × ×
× × × × × × × × × ×
Slicing
A lone colon extracts an entire row or column from a 2-D array. The syntax [a:b] can be read as
“the ath entry up to (but not including) the bth entry.” Similarly, [a:] means “the ath entry to the
end” and [:b] means “everything up to (but not including) the bth entry.”
× × × × × × × × × ×
× × × × × × × × × ×
A[1] = A[1,:] =
× ×
A[:,2] =
× × × × × × × ×
× × × × × × × × × ×
× × × × × × × × × ×
× × × × × × × × × ×
A[1:,:2] =
×
A[1:-1,1:-1] =
× × × × × × × × ×
× × × × × × × × × ×
251
252 Appendix C. NumPy Visual Guide
Stacking
np.hstack() stacks sequence of arrays horizontally and np.vstack() stacks a sequence of arrays
vertically.
× × × ∗ ∗ ∗
A= × × × B= ∗ ∗ ∗
× × × ∗ ∗ ∗
× × × ∗ ∗ ∗ × × ×
np.hstack((A,B,A)) = ×
× × ∗ ∗ ∗ × × ×
× × × ∗ ∗ ∗ × × ×
× × ×
× × ×
× × ×
∗ ∗ ∗
np.vstack((A,B,A)) = ∗ ∗ ∗
∗ ∗ ∗
× × ×
× × ×
× × ×
Because 1-D arrays are flat, np.hstack() concatenates 1-D arrays and np.vstack() stacks them
vertically. To make several 1-D arrays into the columns of a 2-D array, use np.column_stack().
x= y=
× × × × ∗ ∗ ∗ ∗
np.hstack((x,y,x)) =
× × × × ∗ ∗ ∗ ∗ × × × ×
× ∗ ×
× × × × × ∗ ×
np.vstack((x,y,x)) = ∗ ∗ ∗ ∗ np.column_stack((x,y,x)) =
×
∗ ×
× × × ×
× ∗ ×
The functions np.concatenate() and np.stack() are more general versions of np.hstack() and
np.vstack(), and np.row_stack() is an alias for np.vstack().
Broadcasting
NumPy automatically aligns arrays for component-wise operations whenever possible. See http:
//docs.scipy.org/doc/numpy/user/basics.broadcasting.html for more in-depth examples and
broadcasting rules.
253
1 2 3
A= 1 x=
2 3 10 20 30
1 2 3
1 2 3
1 2 3
11 22 33
A + x= 1 2 3 = 11 22 33
+ 11 22 33
10 20 30
1 2 3 10 11 12 13
A + x.reshape((1,-1)) = 1 2 3 + 20 = 21 22 23
1 2 3 30 31 32 33
1 2 3 4
1 2 3 4
A=
1
2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
A.sum(axis=0) =
= 4 8 12 16
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
A.sum(axis=1) =
= 10 10 10 10
1 2 3 4
1 2 3 4
254 Appendix C. NumPy Visual Guide
D
Introduction to
Scikit-Learn
Lab Objective: Scikit-learn is the one of the fundamental tools in Python for machine learning.
In this appendix we highlight and give examples of some popular scikit-learn tools for classification
and regression, training and testing, data normalization, and constructing complex models.
Note
This guide corresponds to scikit-learn version 0.20, which has a few significant differences from
previous releases. See http://scikit-learn.org/stable/whats_new.html for current release
notes. Install scikit-learn (the sklearn module) with conda install scikit-learn.
Scikit-learn [PVG+ 11, BLB+ 13] takes a highly object-oriented approach to machine learning
models. Every major scikit-learn class inherits from sklearn.base.BaseEstimator and conforms to
the following conventions:
1. The constructor __init__() receives hyperparameters for the classifier, which are parameters
for the model f that are not dependent on data. Each hyperparameter must have a default
value (i.e., every argument of __init__() is a keyword argument), and each argument must
be saved as an instance variable of the same name as the parameter.
2. The fit() method constructs the model f . It receives an N × D matrix X and, optionally,
a vector y with N entries. Each row xi of X is one sample with corresponding label yi . By
convention, fit() always returns self.
Along with the BaseEstimator class, there are several other “mix in” base classes in sklearn.base
that define specific kinds of models. The three listed below are the most common.1
1 See http://scikit-learn.org/stable/modules/classes.html#base-classes for the complete list.
255
256 Appendix D. Introduction to Scikit-Learn
# Load the breast cancer dataset and split it into training and testing groups.
>>> cancer = load_breast_cancer()
>>> X_train, X_test, y_train, y_test = train_test_split(cancer.data,
... cancer.target)
>>> print(X_train.shape, y_train.shape)
(426, 30) (426,) # There are 426 training points, each with 30 features.
The KNeighborsClassifier object could easily be replaced with a different classifier, such as a
GaussianNB object from sklearn.naive_bayes. Since GaussianNB also inherits from BaseEstimator
and ClassifierMixin, it has fit(), predict(), and score() methods that take in the same kinds
of inputs as the corresponding methods for the KNeighborsClassifier. The only difference, from
an external perspective, is the hyperparameters that the constructor accepts.
Roughly speaking, the GaussianNB classifier assumes all features in the data are independent
and normally distributed, then uses Bayes’ rule to compute the likelihood of a new point belonging to
a label for each of the possible labels. To do this, the fit() method computes the mean and variance
of each feature, grouped by label. These quantities are saved as the attributes theta_ (the means)
and sigma_ (the variances), then used in predict(). Parameters like these that are dependent
on data are only defined in fit(), not the constructor, and they are always named with a trailing
underscore. These “non-hyper” parameters are often simply called model parameters.
The fit() method should do all of the “heavy lifting” by calculating the model parameters.
The predict() method should then use these parameters to choose a label for test data.
Table D.1: Naming and initialization conventions for scikit-learn model parameters.
Achtung!
Both PopularClassifier and ConstRegressor wait until predict() to validate the strategy
hyperparameter. The check could easily be done in the constructor, but that goes against scikit-
learn conventions: in order to cooperate with automated validation tools, the constructor of any
class inheriting from BaseEstimator must store the arguments of __init__() as attributes—
with the same names as the arguments—and do nothing else.
260 Appendix D. Introduction to Scikit-Learn
Note
The first input to fit() and predict() are always two-dimensional N × D NumPy arrays,
where N is the number of observations and D is the number of features. To fit or predict on
one-dimensional data (D = 1), reshape the input array into a “column vector” before feeding it
into the estimator. One-dimensional problems are somewhat rare in machine learning, but the
following example shows how to do a simple one-dimensional linear regression.
data
10 predicted
8
6
4
2
0
0 1 2 3 4 5
Transformers
A scikit-learn transformer processes data to make it better suited for estimation. This may involve
shifting and scaling data, dropping columns, replacing missing values, and so on.
261
Classes that inherit from the TransformerMixin base class have a fit() method that accepts
an N × D matrix X (like an estimator) and an optional set of labels. The labels are not needed—in
fact the fit() method should do nothing with them—but the parameter for the labels remains as a
keyword argument to be consistent with the fit(X,y) syntax of estimators. Instead of a predict()
method, the transform() method accepts data, modifies it (usually via a copy), and returns the
result. The new data may or may not have the same number of columns as the original data.
One common transformation is shifting and scaling the features (columns) so that they each
have a mean of 0 and a standard deviation of 1. The following example implements a basic version
of this transformer.
Achtung!
The transform() method should only rely on model parameters derived from the training
data in fit(), not on the data that is worked on in transform(). For example, if the
NormalizingTransformer is fit with the input X, b then transform() should shift and scale
any input X by the mean and standard deviation of X, b not by the mean and standard devia-
tion of X. Otherwise, the transformation is different for each input X.
Table D.2: Common scikit-learn classifiers, regressors, and transformers. For full documentation on
these classes, see http://scikit-learn.org/stable/modules/classes.html.
263
Validation Tools
Knowing how to determine whether or not an estimator performs well is an essential part of machine
learning. This often turns out to be a surprisingly sophisticated issue that largely depends on the type
of problem being solved and the kind of data that is available for training. Scikit-learn has validation
tools for many situations; for brevity, we restrict our attention to the simple (but important) case of
binary classification, where the range of the desired model is Y = {0, 1}.
Evaluation Metrics
The score() method of a scikit-learn estimator representing the model f : X → {0, 1} returns the
accuracy of the model, which is the percent of labels that are predicted correctly. However, accuracy
isn’t always the best measure of success. Consider the confusion matrix for a classifier, the matrix
where the (i, j)th entry is the number of observations with actual label i but that are classified as
label j. In binary classification, calling the class with label 0 the negatives and the class with label
1 the positives, this becomes the following.
Predicted: 0 Predicted: 1
Actual: 0 True Negatives (T N ) False Positives (F P )
Choosing β < 1 weighs precision more than recall, while β > 1 prioritizes recall over precision.
The choice of β = 1 yields the common F1 score, which weighs precision and recall equally. This is
an important alternative to accuracy when, for example, the training set is heavily unbalanced with
respect to the class labels.
Scikit-learn implements these metrics in sklearn.metrics, as well as functions for evaluating
regression, non-binary classification, and clustering models. The general syntax for such functions
is some_score(actual_labels, predicted_labels). For the complete list and further discussion,
see http://scikit-learn.org/stable/modules/model_evaluation.html.
264 Appendix D. Introduction to Scikit-Learn
# Fit the esimator to training data and predict the test labels.
>>> knn.fit(X_train, y_train)
>>> knn_predicted = knn.predict(X_test)
Cross Validation
The sklearn.model_selection module has utilities to streamline and improve model evaluation.
• train_test_split() randomly splits data into training and testing sets (we already used this).
• cross_val_score() randomly splits the data and trains and scores the model a set number
of times. Each trial uses different training data and results in a different model. The function
returns the score of each trial.
• cross_validate() does the same thing as cross_val_score(), but it also reports the time it
took to fit, the time it took to score, and the scores for the test set as well as the training set.
265
Doing multiple evaluations with different testing and training sets is extremely important. If the
scores on a cross validation test vary wildly, the model is likely overfitting to the training data.
Note
Any estimator, even a user-defined class, can be evaluated with the scikit-learn tools presented
in this section as long as that class conforms to the scikit-learn API discussed previously (i.e.,
inheriting from the correct base classes, having fit() and predict() methods, managing
hyperparameters and parameters correctly, and so on). Any time you define a custom estimator,
following the scikit-learn API gives you instant access to tools such as cross_val_score().
Grid Search
Recall that the hyperparameters of a machine learning model are user-provided parameters that
do not depend on the training data. Finding the optimal hyperparameters for a given model is a
challenging and active area of research.2 However, brute-force searching over a small hyperparameter
space is simple in scikit-learn: a sklearn.model_selection.GridSearchCV object is initialized with
an estimator, a dictionary of hyperparameters, and cross validation parameters (such as cv and
scoring). When its fit() method is called, it does a cross validation test on the given estimator
with every possible hyperparameter combination.
For example, a k-neighbors classifier has a few important hyperparameters that can have a
significant impact on the speed and accuracy of the model: n_neighbors, the number of nearest
neighbors allowed to vote; and weights, which specifies a strategy for weighting the distances between
points. The following code tests various combinations of these hyperparameters.
2 Intelligent hyperparameter selection is sometimes called metalearning. See, for example, [SGCP+ 18].
266 Appendix D. Introduction to Scikit-Learn
# After fitting, the gridsearch object has data about the results.
>>> print(knn_gs.best_params_, knn_gs.best_score_)
{'n_neighbors': 5, 'weights': 'uniform'} 0.9532526583188765
The cost of a grid search rapidly increases as the hyperparameter space grows. However,
the outcomes of each trial are completely independent of each other, so the problem of training
each classifier is embarassingly parallel. To parallelize the grid search over n cores, set the n_jobs
parameter to n, or set it to −1 to divide the labor between as many cores as are available.
In some circumstances, the parameter grid can be also organized in a way that eliminates
redundancy. Consider an SVC classifier from sklearn.svm, an estimator that works by lifting the
data into a high-dimensional space, then constructing a hyperplane to separate the classes. The SVC
has a hyperparameter, kernel, that determines how the lifting into higher dimensions is done, and
for each choice of kernel there are additional corresponding hyperparameters. To search the total
hyperparameter space without redundancies, enter the parameter grid as a list of dictionaries, each
of which defines a different section of the hyperparameter space. In the following code, doing so
reduces the number of trials from 3 × 2 × 3 × 4 = 72 to only 1 + (1 × 1 × 3) + (1 × 4) = 11.
Pipelines
Most machine learning problems require at least a little data preprocessing before estimation in
order to get good results. A scikit-learn pipeline (sklearn.pipeline.Pipeline) chains together one
or more transformers and one estimator into a single object, complete with fit() and predict()
methods. For example, it is often a good idea to shift and scale data before feeding it into a classifier.
The StandardScaler transformer can be combined with a classifier with a pipeline. Calling fit()
on the resulting object calls fit_transform() on each successive transformer, then fit() on the
estimator at the end. Likewise, calling predict() on the Pipeline object calls transform() on
each transformer, then predict() on the estimator.
Since Pipeline objects behaves like estimators (following the fit() and predict() conven-
tions), they can be used with tools like cross_val_score() and GridSearchCV. To specify which
hyperparameters belong to which steps of the pipeline, precede each hyperparameter name with <
stepname>__. For example, knn__n_neighbors corresponds to the n_neighbors hyperparameter of
the part of the pipeline that is labeled knn.
# Pass the Pipeline object to the GridSearchCV and fit it to the data.
>>> pipe = Pipeline([("scaler", StandardScaler()),
("knn", KNeighborsClassifier())])
>>> pipe_gs = GridSearchCV(pipe, pipe_param_grid,
... cv=4, n_jobs=-1, verbose=1).fit(X_train, y_train)
Fitting 4 folds for each of 40 candidates, totalling 160 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 160 out of 160 | elapsed: 0.3s finished
Pipelines can also be used to compare different transformations or estimators. For example, a
pipeline could end in either a KNeighborsClassier() or an SVC(), even though they have different
hyperparameters. Like before, use a list of dictionaries to specify the hyperparameter space.
268 Appendix D. Introduction to Scikit-Learn
Additional Material
Exercises
Problem 1. Writing custom scikit-learn transformers is a convenient way to organize the data
cleaning process. Consider the data in titanic.csv, which contains information about passen-
gers on the maiden voyage of the RMS Titanic in 1912. Write a custom transformer class to
clean this data, implementing the transform() method as follows:
1. Extract a copy of data frame with just the "Pclass", "Sex", and "Age" columns.
2. Replace NaN values in the "Age" column (of the copied data frame) with the mean age.
The mean age of the training data should be calculated in fit() and used in transform()
(compare this step to using sklearn.preprocessing.Imputer).
Ensure that your transformer matches scikit-learn conventions (it inherits from the correct base
classes, fit() returns self, etc.).
Problem 2. Read the data from titanic.csv with pd.read_csv(). The "Survived" column
indicates which passengers survived, so the entries of the column are the labels that we would
like to predict. Drop any rows in the raw data that have NaN values in the "Survived" column,
then separate the column from the rest of the data. Split the data and labels into training and
testing sets. Use the training data to fit a transformer from Problem 1, then use that transformer
to clean the training set, then the testing set. Finally, train a LogisticRegressionClassifier
and a RandomForestClassifier on the cleaned training data, and score them using the cleaned
test set.
Problem 4. Make a pipeline with at least two transformers to further process the Titanic
dataset. Do a gridsearch on the pipeline and report the hyperparameters of the best estimator.
270 Appendix D. Introduction to Scikit-Learn
Bibliography
[ADH+ 01] David Ascher, Paul F Dubois, Konrad Hinsen, Jim Hugunin, Travis Oliphant, et al.
Numerical python, 2001.
[BLB+ 13] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller,
Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler,
Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. API
design for machine learning software: experiences from the scikit-learn project. In ECML
PKDD Workshop: Languages for Data Mining and Machine Learning, pages 108–122,
2013.
[Oli06] Travis E Oliphant. A guide to NumPy, volume 1. Trelgol Publishing USA, 2006.
[Oli07] Travis E Oliphant. Python for scientific computing. Computing in Science & Engineer-
ing, 9(3), 2007.
[SGCP+ 18] Brandon Schoenfeld, Christophe Giraud-Carrier, Mason Poggemann, Jarom Chris-
tensen, and Kevin Seppi. Preprocessor selection for machine learning pipelines. arXiv
preprint arXiv:1810.09942, 2018.
[VD10] Guido VanRossum and Fred L Drake. The python language reference. Python software
foundation Amsterdam, Netherlands, 2010.
271