FIN3028 Phyton Companion Notes
FIN3028 Phyton Companion Notes
Dr Alan Hanna
Richard Feynman
i
Preface
This document is intended as a companion quick reference guide to the ‘Python for Finance’
module (FIN3028) at Queen’s Business School.
Feedback on the content is welcome and should be sent to [email protected]. Knuth-style re-
wards are offered to students currently undertaking the module who identify mistakes (grammat-
ical, typographical, mathematical, factual, numerical, logical or otherwise) in the manuscript.
Propagated or compounded errors will be treated as a single observation. Terms and conditions
apply :-)
http://xkcd.com/1195/
ii
Contents
I Python Fundamentals 1
1 Introduction 2
1.1 What is Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Anaconda Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Installing Packages using pip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Installing Packages using conda (preferred method) . . . . . . . . . . . . . . . . . . . 3
1.2.4 Upgrading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.5 Python Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Running Python via a browser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Becoming a better developer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Getting Started 6
2.1 Code Editors & IDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Command Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 IDLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Spyder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.1 Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.2 Console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.3 Variable Explorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.4 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.5 Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Jupyter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.1 Launching Jupyter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.2 Jupyter Notebooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.3 Code Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.4 Markdown Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.5 nbextensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Other IDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.1 VS Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.2 RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Online Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Ethical Considerations 18
3.1 Security and Ownership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Societal Contribution and Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Professional Conduct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Python Basics 21
4.1 Code Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
CONTENTS iii
4.3 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.1 assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.2 Variables as memory locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.3 underscore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3.4 print . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.5.1 String methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5.2 String formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5.3 f-string . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5.4 Regular expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Dates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.7 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.8 Containers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.8.1 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.8.2 Tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.8.3 Ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.8.4 Working with arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.8.5 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.8.6 Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.8.7 List Comprehensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.9 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.9.1 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.9.2 Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.10 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.10.1 Standard Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.10.2 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.10.3 Import . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.11 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.11.1 Positional arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.11.2 Optional arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.11.3 Arbitrary argument lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.11.4 Multiple return values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.11.5 Lambda Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.11.6 Modifying input parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.11.7 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.11.8 Inner Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.11.9 Generator Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.11.10 Type annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.12 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.13 Enumerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.14 Other python features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.14.1 Assignment expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.14.2 Decorators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Development 53
5.1 Life cycle models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 Coding Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4.1 Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4.2 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4.3 Line by line execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.5 Good developer practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
CONTENTS iv
II Data Processing 63
7 Libraries 64
7.1 NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.1.1 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.1.2 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.1.3 Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.1.4 Reshaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.1.5 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.2 SciPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.3 Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.3.1 Getting started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.3.2 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.3.3 Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.3.4 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.3.5 Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.3.6 Dates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.4 scikit-learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.5 tkinter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.5.1 Widgets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8 I/O Operations 78
8.1 User input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.2 File based input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.2.1 File paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.3 Text files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.4 CSV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.5 Pickle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.6 Excel workbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.7 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.7.1 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.8 JSON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.9 XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.9.1 XPath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.10 HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.10.1 Web scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.10.2 Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.10.3 Downloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.10.4 Browser based web scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.10.5 Responsible web scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.11 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.11.1 Using an API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.11.2 Creating an API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.12 Financial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
CONTENTS v
8.12.1 pandas-datareader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.12.2 yfinance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.12.3 WRDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9 Data Handling 97
9.1 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.1.1 Descriptive Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.1.2 Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.1.3 Numerical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
9.2 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
9.3 Data Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
9.3.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
9.3.2 Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
9.4 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.5 Duplicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.6 Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
9.6.1 Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
9.6.2 Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
9.6.3 Dummy variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
9.6.4 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
9.6.5 Differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.6.6 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.7 Reshaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
9.7.1 Stack/Unstack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
9.7.2 Melt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
9.8 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
9.8.1 Group by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
9.8.2 Pivot table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
9.9 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
IV Appendix 160
Index 161
Bibliography 164
1
Part I
Python Fundamentals
2
1 I NTRODUCTION
1.2 Installation
Python can be downloaded as a stand-alone installation, or as part of a larger package. For
this module we will use the Anaconda installation. Default installation configurations should be
fine.
• https://www.python.org/downloads/
• Stand-alone installation
• https://www.anaconda.com/download/
• Comes with additional packages and tools
• Can make it easier to manage package installations
• Installed on all PCs throughout the QUB network
• A requirement for this module!
Note that you may require administrator rights to install new packages. In Windows, you should
right click and choose ‘run as administrator’ when launching the command prompt (cmd). Within
Anaconda, you should use the ‘Anaconda Prompt’ application from the Start Menu (again run-
ning with admin rights) as shown in Figure 1.1. Installing one package may prompt you to upgrade
others.
1.2.4 Upgrading
The easiest way to upgrade your Anaconda installation is from the Anaconda prompt with the
following command:
Individual packages (with necessary dependencies) may be updated with commands such as:
Normally new versions will be backward compatible (i.e. old code will still run under new versions)
but this is not always the case (for example when moving from version 2.x to 3.x).
Note that it is possible to have multiple versions of Python on one computer. In Windows, the
version that is used can be configured in the Window’s path. Python helpfully provides a guide to
installations on Windows.
You can run the Python interpreter in interactive mode from a command prompt. Typing ‘cmd’
into the Windows start box will launch a new command prompt as shown in Figure 2.1. If neces-
sary, navigate to the appropriate folder using the ‘cd’ (change directory) command. Relevant
commands are shown in Table 2.1.
Command Action
cd.. Change directory - move to parent directory
cd temp Change directory - move to subfolder temp
python Start an interactive python session
python filename.py Execute the python script filename.py
ctrl + c Quit the python session
quit() <enter> Quit the python session
2.3 IDLE
IDLE is an basic IDE that ships with many versions of Python. For Window’s installations, it will appear
in the start menu under Python.
Statements can be executed at the prompt, or it is possible to create (and save) code modules
that can be executed in full by pressing F5.
2.4 Spyder
Spyder is a richer IDE that is part of the default Anaconda installation. Typical of IDEs, it features
dockable panels as shown in Figure 2.3. This allows developers to simultaneously develop and
execute commands, while inspecting the results.
In the following sections we will briefly outline key features of the main components. Full documen-
tation is available from https://www.spyder-ide.org/.
Before beginning make sure to set the working directory. This is where Python will look for files that
you might attempt to import and where it will export files (when not supplied an explicit path). In
Spyder, you can quickly set the working directory to match the location of an open code module
by right clicking its tab and choosing ‘set console working directory’.
2.4.1 Editor
The editor allows developers to work across multiple code modules. Each tab represents a text
file saved with a ‘.py’ extension. The Spyder code editor incorporates a number of standard
2.4. SPYDER 8
developer features:
• Colour coding of text (defaults are grey for comments, blue for keywords, magenta for built-in
words, red for numbers, green for strings)
• IntelliSense can be used to auto-completing text (CTRL+space)
• Tooltips that appear to explain function prototypes
• Code tidying features (see source menu)
• ‘Compiler’ warnings - visual clues that code may contain errors
Shortcuts for quickly commenting/uncommenting or indenting/unindenting lines of code (see edit
menu). Many of these settings can be customised from Tools > Preferences. By default one tab
or indent is set of 4 spaces. By opening a file under different configuration settings it is easy to
create files with inconsistent indentation. To check whether you are using spaces or tabs, turn on
‘Show blank spaces’ under the source menu. The latest version of Spyder is compatible with kite,
a plug-in that can be installed to enhance auto-completion features.
A further feature of the editor is that it will highlight potential errors by adding a warning icon in
the margin beside the offending line of code. Hovering over the icon with the mouse will reveal a
short description of the detected issue.
Code within modules can be broken into cells to allow them to be executed as single stand-alone
blocks. This is achieved by lines starting with either:
Cells can also be used for code outlining which is useful for quickly navigating around your code
(enable via View > Panes > Outline). Code be run in different ways:
2.4. SPYDER 9
exec(open('./filename.py').read())
Shortcut Action
F5 Run current script
F9 Run current line or selection
CTRL+RETURN Run active cell
SHIFT+RETURN Run active cell and advance
2.4.2 Console
The console is used for both input and output. Commands can be executed directly in the con-
sole. Output will appear here by default. It is possible to create multiple console windows (from
the Consoles menu). Output from running code in the editor will appear in the active console
window. See also Preferences > Run options.
A ‘History log’ will show a list of previously executed statements. If you wish to re-execute com-
mands, you could copy it from here. Alternatively, when the console window is active, the up and
down cursor keys, will cycle through previously executed statements.
current value. This will include objects and functions. For variables representing more complicated
data structures (e.g. matrices, or classes) it may be possible to double click the variable to see an
expanded view of the underlying data.
Some configuration changes may be required to explicitly include (or exclude) certain variable
types. For example, to inspect objects (instances of classes) follow the configuration change
shown in figure 2.7.
2.4.4 Debugging
Coding mistakes that cause incorrect results are known as ‘bugs’. Locating and correcting such
mistakes is referred to as ‘debugging’.
Normally, code is executed as quickly as your computer architecture allows. One way to debug
your code is to instead execute your code one statement at a time, tracing the sequence of
execution and inspecting intermediate results to detect the point of failure. Since the point of
2.4. SPYDER 11
failure may occur only after thousands of line of code, breakpoints allow you to pause code
execution at critical points in your code.
Breakpoints appear as red dots in the editor margin and are activated (and deactivated) by
clicking in the margin or via keyboard shortcuts. A full list of breakpoints can be found in the
breakpoint window (view > panes > breakpoints).
Commands for debugging can be accessed via the Debug menu, toolbar or keyboard short-
cuts (Table 2.3). When running code in debug mode, the next statement to be executed will be
highlighted in editor.
Shortcut Action
F12 Breakpoint toggle
(or just click in margin beside the line number)
SHIFT+F12 Set/edit conditional breakpoint
CTRL+F5 Debug script
CTRL+F10 Debug one line at a time
CTRL+F11 Step into (function)
CTRL+SHIFT+F11 Step out
CTRL+F12 Continue execution
CTRL+SHIFT+F12 End execution
2.4.5 Help
Spyder’s help system can be activated by CTRL+I, after placing the cursor adjacent to text in
editor. If the text is recognised as part of the Python language or package, help will appear as
shown in figure 2.8.
An iterative help tool is also built into Python. To invoke, simply issue the command help().
#To exit
quit
While documentation and built-in help functions are valuable resources, a host of online developer
communities has already encountered and solved most problems you can ever think of. Google
and forums such as https://stackoverflow.com/ are normally instant sources of solutions.
Further ‘help’ is provided by the following spyder features:
• Linting - a background process that analyses your code checking for syntax errors and suspi-
cious code. Issues are indicated in the margin with colour coded symbols to denote invalid
syntax and warnings. Hover over the symbol for an explanation.
• Kite (separate installation required) provides AI powered code completions. Press Control +
Space for a popup list of potential options to complete the text you are typing.
• The Code Analysis panel provides a list of potential errors, bad practices, quality issues and
style violations. See https://docs.spyder-ide.org/current/panes/pylint.html for more de-
tails.
2.5 Jupyter
Jupyter is a browser-based environment that allows you create interactive workbooks, containing
text, code, results and visualisations using a range of languages including Python. Jupyter is in-
cluded with Anaconda. Full documentation is available from https://jupyter.org/documentation.
jupyter notebook
Once jupyter has launched, navigate to the directory where your notebooks are saved or where
you wish to create a new one. Existing workbooks can be opened by clicking on them. Further
2.5. JUPYTER 13
options become available after selection (checking the box beside their name). To create a new
one, use the ‘New’ dropdown to select Python 3 (Notebook). A second tab ‘Running’ on the
home page provides a list of currently active notebooks and terminal instances.
# Heading 1
## Heading 2
### Heading 3
#### Heading 4
---
Bullet points use ‘*’ and numbered lists just use numbers (the numbers you type themselves are not
important):
* bullet 1
* bullet 2
* bullet 3
1. item 1
1. item 2
1. item 3
Bold and italic typeface can be altered by enclosing text within paired *’s. Text representing code
uses a special quote character (normally to the left of the ‘1’ key on a standard keyboard)
**bold**
*italic*
It is possible to include the typesetting system Latex which is useful for embedding formula and
other mathematical notation. Various web sites allow formula to be created using tools similar
to Microsoft’s equation editor from which Latex can then be copied and pasted into a Jupyter
notebook. See for example https://www.hostmath.com/.
While rudimentary tables can be created with simple markup using pipe (|) characters and hy-
phens for column lines and horizontal rules respectively, superior formatting is achieved where ta-
bles are created using HTML syntax (see https://www.w3schools.com/html/html_tables.asp).
2.5. JUPYTER 16
<table>
<tr>
<th>First name</th>
<th>Last name</th>
<th>Student Number</th>
</tr>
<tr>
<td>Conor</td>
<td>McNally</td>
<td>12345678</td>
</tr>
<tr>
<td>Jingyi</td>
<td>Zhang</td>
<td>23456789</td>
</tr>
</table>
#Hyperlinks
<a href="http://qub.ac.uk/qms">Queen's Management School</a>
Jupyter also provides an insert image option under the Edit menu.
Markdown is often the preferred way of providing documentation (see for example the home
page of any GitHub project). There are many stand alone (i.e. not part of any coding environe-
ment) markdown editors including:
• Typora - a desktop editor (small payment required)
• StackEdit - an in-browser editor
• Notepad++ - via free extension MarkdownViewer++
2.5.5 nbextensions
The Jupyter Notebook Extensions package is a collection of community-contributed tools that add
functionality to Jupyter notebook. As an unofficial package, it must be manually installed. To do
this, it is recommended that you use a conda command from an anaconda prompt:
Once installed, various extensions can be enabled (see Figure 2.13). One extension is the Accessi-
bility Toolbar. Another is the variable inspector that allows the value of variables to be seen without
2.6. OTHER IDES 17
2.6.2 RStudio
R’s reticulate package allows Python scripts to be run within RStudio. While this can be conve-
nient for heavy users of R who wish to run some Python code, there is currently limited support for
Python development within RStudio. For example, Python variables created will not appear in the
Environment (i.e. variable explorer)1 .
Note that before attempting to run Python code, you must ensure that the location of your pre-
ferred Python installation has been set.
Sys.setenv(RETICULATE_PYTHON = "C:/Users/yourname/Anaconda3/")
Hopefully you will enjoy learning how to use Python, manipulate data, and develop models to
inform decision making. With your new-found skills, also come some new responsibilities relating to
the ethics of working with data.
The Association for Computing Machinery (ACM) sets out general ethical principles for
computing professionals.
• Contribute to society and to human well-being, acknowledging that all people are
stakeholders in computing
• Avoid harm
• Be honest and trustworthy
• Respect the work required to produce new ideas, inventions, creative works, and com-
puting artifacts
• Respect privacy
• Honor confidentiality
https://www.acm.org/code-of-ethics
3.2 Privacy
Advances in computing makes it easy to collect, store, and distribute data. For individuals, this can
mean that data relating to them is stored, processed, and monitored, without their knowledge
or consent. Anyone collecting or processing such data should be sensitive to the concerns of
individuals. Indeed, legislation recognises the ‘data subject’ by granting them rights and imposing
responsibilities on those that collect, store, or process such data.
In the EU, General Data Protection Regulations (GDPR) apply. Failure to comply with the
GDPR can result in a fine of up to C20 million or 4% of annual worldwide turnover, whichever
is greater. In the UK this is implemented as the Data Protection Act 2018.
It sets out strict rules called ‘data protection principles’ to ensure that data is:
trial a large number of such models, and then only present their successful results, or fail to ac-
knowledge that the model’s success may only be due to chance.
As an example, think about the huge number of time-series datasets collected in different geo-
graphical locations from different fields (i.e. finance, economics, health, agriculture, etc). While
the chances that any two randomly-selected series will exhibit correlation is low, in a large enough
collection of datasets, finding ones that are highly correlated (where there is no underlying causal
relationship) becomes a statistical certainty. To see the sometimes amusing consequences visit
https://tylervigen.com/discover.
"""
This is a multiline string created using triple quotes
Known as a doc string it is use to help document code
"""
Comments are defined to be read by other developers to help them understand (or remember)
why code has been written (its purpose) or why it has been written in a particular way. Code
comments should be generally be brief and non-obvious. In this manual they are used to explain
new concepts; such comments would not normally be necessary when the expected reader is
another developer.
4.2 Keywords
Keywords are those reserved for a special purpose and essentially define the vocabulary of python.
For this reason, you should not attempt to use them as variable names. These include keywords
relating to:
• Boolean: True, False, and, or, not
• Conditional logic: if, elif, else, assert
• Iteration: for, while, in, break, continue
• Built in: def, class, return, print, range
• Exception handling: raise, try, except, finally
You can ask to see the full list by running:
Together with variables and operators, they can be used to write expressions and statements. An
expression is a piece of code that can be evaluated, resulting in some value. A statement is a
single syntactic unit of code defining an instruction to perform a specific action.
4.3. VARIABLES 22
4.3 Variables
Variables are a means to store and refer to data values. The concept of a variable is similar to
using (Greek) letters to represent values in algebraic equations. The variable name is just a way to
refer to the value it represents, which of course, may change depending on the problem being
solved. In programming variables are associated with a particular location in the memory of the
computer on which the code is being run. Their names can be anything you choose subject to
some basic rules:
• can contain alphabetic or numeric characters or the underscore character
• they cannot contain spaces
• they cannot start with a number
• they should not start with an underscore (this is used for specific purposes in Python)
• they should not clash with keywords reserved as part of the Python language
These names are referred to as identifiers but the term also includes names given to functions
and classes. Unlike some other languages, variables do not need to be declared before they are
used.
4.3.1 assignment
Variables are created at the instant when data is first assigned to them. A statement which con-
tains a single equals sign (=) is an assignment operation. This should be interpreted as ‘evaluate
the expression on the right hand side and assign its value to the variable on the left hand side’.
After the operation is performed the two sides will be equal (but not necessarily before). Variables
can then be used in expressions, with their current value used to determine the result. Any attempt
to use an uninitialised variable in an expression on the right hand side of an assignment operator
will fail.
#Output will be 4
print(a + 2)
Note that python is case sensitive. Variable names should be thoughtfully chosen to make them
meaningful and suggestive of their intended purpose. When implementing formula, variable
names should attempt to mirror those used in standard notation. For example, an interest rate
would be better represented using the letter r than the letter x.
Just as two people may share the same name, it is entirely possible for developers to use the same
name to refer to different values. Such names have the potential to be ambiguous, but this can
be resolved through the concept of scope.
Variables will continue to persist during the python session unless they are explicitly destroyed.
4.3. VARIABLES 23
#List variables
dir()
#Delete variables
del x,y,z
In programming, constants are variables with a value that cannot be changed. Unlike some other
languages python does not explicitly support constants. Literals are fixed values that appear in
statements (e.g. numbers and text). Literals should be used sparingly outside of scripts. Instead try
to get code generically by avoiding the practice of hard coding (using problem-specific values in
what is otherwise a general solution to a class of problems).
a = 3
b = a
Clearly b will have the value 3, but what if we then assign b the value 4. Does this also change the
value of a? In this case, the answer is no, but the general behaviour depends on the data type
4.4. DATA TYPES 24
that the variable contains (and whether they are immutable). Officially ‘assignment statements in
Python do not copy objects, they create bindings between a target and an object.’
If you need to make a copy (that can be independently modified), you can use the copy library
(see https://docs.python.org/3/library/copy.html). Note that there are two types: shallow copy
and deep copy. The different is important when working with compound objects (see Section
??).
• A shallow copy constructs a new compound object and then (to the extent possible) inserts
references into it to the objects found in the original.
• A deep copy constructs a new compound object and then, recursively, inserts copies into it
of the objects found in the original.
4.3.3 underscore
We have seen that the underscore character can be included in the name of an identifier. When
used on its own, the underscore is considered as a throwaway (or "I don’t care") variable. By
convention its use indicates to other developers that the variable itself is not important.
4.3.4 print
You will probably have already have noticed how the print function is used to display the val-
ues of variables. This function accepts a comma separated list of values and literals that it will
concatenate together for output. Note that all values will be converted to strings automatically:
no explicit casting is required. The optional sep parameter (defaulted to a space character) is
placed between each of the string values being joined. The optional end parameter (defaulted to
a newline character) is placed at the end. Thus by default, for two consecutive print statements,
the second will begin on a new line.
The display function is similar but will produce more highly formatted output in richer frontend like
Jupyter. For example, DataFrame objects are rendered with alternatively coloured rows rather than
in plain text.
From python version 3.10, it is possible to use the pipe (‘|’) symbol to union two or more types. This
allows you to test if a variable is one of the types. For example, to test if something was an integer
or a float:
4.5 Strings
In programming, text is referred to as a string. Python strings can be enclosed in single, double, or
even triple quotes. Single quotes can appear as part of a text string created with double quotes
and vice versa. Any appearance of quoted text within code is referred to as a histring literal.
In Python, strings can be treated like an array of characters, and treated like other array data
types.
#sub-strings
print(text1[0:4],"\n")
#string concatenation
text4 = text1 + ' ' + text2 + ' ' + text3
print(">",text4,"\n")
The backslash character is used to modify the meaning of certain characters. For example ‘
n’ is used to denote a linefeed. The backslash character therefore needs to be ‘escaped’ and
4.5. STRINGS 26
written as ‘\\’ if you actually want the backslash character to appear in text. Python provides a
way (the prefix ‘r’) to indicate you want the string to be treated exactly the way you type it. This is
particularly helpful when you need a string to represent a windows path.
The same approach can also be used to escape quotation marks within a string that would oth-
erwise prematurely mark its end.
Method Description
text.upper() converts to upper case
text.lower() converts to lower case
text.isdigit() returns True if all characters are digits (i.e. an integer)
text.isalpha() returns True if all characters are alphabetic
text.rstrip() removes all trailing whitespace
text.find(sub) Find the start index of a substring (-1 returned if not there)
text.format() substitutes values into a string using curly bracket place-
holders
text.join() concatenate any number of strings using given seperator
text.zfill() Pad a numeric string with zeros to a given width
#But you can specify you own separator with the 'sep' optional parameter
print('answer',value,sep="==>")
Strings can also be concatenated manually using the ‘+’ operator, but non-string variables may
need to be converted.
#String concatenation
text = "hello" + " " + "world"
Results can also be placed in strings using placeholders using the format string method. The
placeholders can either be empty, enumerated or named, and can also include formatting com-
mands.
Note that this replaces older style string formatting that you may still come across. For full details
see:
• old style formatting: https://docs.python.org/2/library/stdtypes.html#string-formatting
• new style formatting: https://docs.python.org/3/library/string.html#string-formatting
• examples of both: https://www.w3schools.com/python/ref_string_format.asp
• example of new: https://pyformat.info/
4.5.3 f-string
Formatted string literals or f-strings were introduced in Python 3.6. Variables placed inside curly
braces within a string literal (prefixed with ‘f’) are automatically evaluated at run time.
name = 'Jenny'
print(f'hello {name}') hello Jenny
Variable names and their values can now be output without providing two inputs into a print
function. So the variable name itself need only be typed once but appears as both its name and
value in the output.
print(f"{a=}")
print(f"{b=}") a=3
b='how cool is this?'
Method Description
[] a set of (specified) characters
() a sub-expression
\w any alphanumeric character
\d any digit
. (dot) matches any character except a newline
ˆ (caret) anchors to the start of the string
$ matches to the end of a line
* matches zero or more of the preceding expression
+ matches one or more of the preceding expression
? matches zero or one of the preceding expression
{m} matches exactly m copies the preceding expression
{m,n} matches between m and n copies the preceding expression
| A|B is a match to either A or B
(?:...) non-capturing brackets (use for sub-patterns that you don’t want to
capture)
#Extact a dollar amount with or without comma separator and decimal point
text = 'the share closed at $1,234.56 yesterday'
re.findall(r'[\$]\d{1,}(?:,\d{3})*(?:\.\d+)?', text)
It is also possible to create a ‘compiled’ regular expression which can then be used to perform
pattern matching operations, so the regular expression no longer needs to be passed as an input
into a function.
yearmatch = re.compile(r'(?:19)\d\d')
print(yearmatch.findall('1875, 1956, 1999, 2012, 2130'))
4.6 Dates
The datetime module provides support for working with dates (see https://docs.python.org/3/
library/re.html for full documentation).
4.6. DATES 29
To make dates human readable for input and output they are treated as strings. However be-
cause string dates come in a myriad of formats they are difficult to manipulate. It is therefore
convenient to convert them into a standardised (numerical) datetime representation. In particular
beware of American dates which follow a month-day-year system and often appear in financial
datasets.
#Create a date
p2 = datetime(2022, 12, 25, 0, 0)
print(p2)
Date formatting syntax is outlined in Table 4.3 using date p2 above as an example. For a full listing
see strftime under https://docs.python.org/3/library/datetime.html.
Note that when printed (without formatting), dates reveal that they are really datetime objects,
and as python objects, they have useful properties and methods.
#Decompose date
print("weekday =",p1.weekday()) weekday = 4
print("day =",p1.day) day = 25
print("month =",p1.month) month = 12
A useful format for encoding dates is the one defined by ‘%Y%m%d’, which produces dates such
4.7. OPERATORS 30
4.7 Operators
Operators can be applied to variables or literals. They include standard arithmetic operations
such as addition and multiplication not listed here.
Expressions are executed using standard operator precedence (e.g. multiplication before addi-
tion). If in doubt about the sequence in which the operators are applied use parenthesis (round
brackets). Note that in Python ** is used for raising to the power of.
• Boolean operators
– not, and, or
– To test for equality use ==
– To test for inequality use !=
• Bitwise operators (applied to the ‘bits’ of the binary representation of a number):
– | (or)
– & (and)
– ˆ (xor – exclusive or)
While some operations can be performed by variables of different types, explicit conversion may
be required.
print(i+pi)
#This will fail even although + can be used for string concatenation
print(i+pie)
#This is fine
print(str(i)+pie)
4.8 Containers
A container is particular compound data type that stores a collection of items. They mimic struc-
tures that should already be familiar to you such as lists, sets and queues. Most programming
languages provide support for containers which manage storage requirements and provide func-
tions to access and manipulate elements.
4.8.1 Lists
Lists are ordered arrays that store mixed data types and allow duplicates. In Python they are
created using square brackets.
#Create list
list1 = []
list2 = list()
list3 = [2, 3, 5, 7, 11]
list4 = ["finance", "economics", "accounting"]
list5 = [list3, list4]
list6 = [5, 'e']
list7 = ['na'] * 9 + ['hey Jude']
#Find length
len(list7)
Accessing elements from the list is also performed using square brackets []. Note that Python is
zero based so the first element is at index position 0 (Figure 4.2). A list with n elements is therefore
indexed from 0 to n − 1. The number of elements in a list (or other array types) can be found using
the len function.
Python allows arrays to be accessed from the start or, by using negatives indexing, from the end.
For an array of size n, index values of −n to n − 1 are therefore valid.
#Create list
A = [2, 3, 5, 7, 11]
A[2]
A list is one example of an iterable, a structure that is capable of returning its members one at a
time. Iterables can be used in a for loop (see Section 4.9.2), when each element needs to be
considered in turn. See https://docs.python.org/3/glossary.html#term-iterable.
4.8.2 Tuples
Tuples are ordered arrays but cannot be changed (immutable) once initialised. They are created
using round brackets, but their elements are accessed via square brackets.
#Create tuple
tuple1 = ()
tuple2 = tuple()
tuple3 = (2, 3, 5, 7, 11)
tuple4 = ("finance", "economics", "accounting")
tuple5 = (tuple3, tuple4)
tuple6 = (5, 'e')
4.8.3 Ranges
Ranges are immutable integer sequences, created using the range function. They are particularly
useful when using for loops. Each integer sequence is defined by its start value, its stop value, and
the increment between consecutive values in the sequence. The start value defaults to zero, and
the increment defaults to 1. Note that in keeping with Python’s zero-based approach, the range
sequence continues until, but stops short of including the stop value.
#Create
range(start)
range(start, stop [,step])
#Sequence of integers 0 to 9
r = range(10)
#Sequence of integers 1 to 10
r = range(1,11)
#Convert to list
a = list(r)
4.8. CONTAINERS 33
#Sequence of integers 10 to 1
r = range(10,0,-1)
a:b:c
represents the elements indexed from a up to (but not including) b advancing in steps of c. This
works inside square bracket notation and is a short cut for using the slice function. The following
logic also applies:
• If c is omitted its value is defaulted to 1
• If a is omitted its value is defaulted to 0
• If b is omitted its values is defaulted to the length of the array being accessed.
• Negative b values indicate counting backwards from the end of the array.
#Values at 3, 5, 7 & 9
A[2:10:2]
When the step value is negative the default logic reverses, so that you stop from the end and work
backwards.
#Numbers 0 to 9
A = list(range(10))
#Numbers 9 to 0
A[::-1])
#Numbers 5 to 0
A[5::-1]
4.8. CONTAINERS 34
#Numbers 9 to 6
A[:5:-1]
myslice = slice(1,3)
print(A[myslice])
Modification of non-immutable arrays involves updating, inserting, appending and removing ele-
ments.
#Join
A.extend([12,13])
#Remove elements
del A[2:4]
4.8.5 Sets
Sets are unordered and unindexed collections that prohibit duplicates (just like their mathemat-
ical equivalents). They are created using curly brackets. An immutable version of a set is a
frozenset.
#create set
myset = {"alpha", "beta", "gamma"}
#check membership
"alpha" in myset
4.8.6 Dictionaries
Dictionaries are used to store key/value pairs, in a way that allows the value to be accessed via
its key. Dictionaries are mutable and indexed. From python v3.6, they are ordered based on the
order in which keys are inserted into the dictionary. They are created using curly brackets and
their elements are not single values, but rather key/value pairs. The keys must be unique but this
allows associated values to be ‘looked up’. The values can be any other variable type such as
int, string, list or even dict.
#Create dict
mydict = {"red": 1, "amber": 2, "green": 3}
#Update
mydict["stop"] = 0
The get method of a dictionary can also be used to retrieve values in a way the specifies a default
value should the key not be found in the dictionary.
#Create list
A = list(range(1,11))
#Find squares of A
B = [x*x for x in A]
#Index a list
D = [c/(1+y)**(i+1) for i, c in enumerate(CF)]
Similar results can also be achieved by using the map function which can be used to apply a
function to every item in an iterable (container). Note that this returns an iterator object which
you may wish to convert to a list. See https://docs.python.org/3/library/functions.html#map for
more details. The example below uses a special type of function called a lambda function (see
section 4.11).
A = [1,2,3]
f = lambda x: x**2
B = map(f,A)
print(list(B))
>> [1, 4, 9]
Python’s zip function can be used to combine iterables (e.g. lists, dictionaries, etc) by generat-
ing a series of tuples, where each tuple contains one element from each iterable being zipped
together.
A = [1, 2, 3]
B = ['x', 'y', 'z']
C = zip(A,B)
print(list(C))
>> [(1, 'x'), (2, 'y'), (3, 'z')]
4.9. FLOW CONTROL 37
4.9.1 Selection
Selection in programming languages is coded using if statements. Here code is executed (or not)
based on the state of particular variables based on a Boolean test that evaluates as either true or
false. In its simplest form, the syntax for an if statement is as follows:
All lines indented after the if statement will only be executed if the test evaluates as True. If the
test evaluates as false, the indented statements will not be executed and code execution will
resume at the next unindented line. An alternative formulation adds an else clause:
It is possible to extend the logic to have multiple branching possibilities by using additional elif
clauses:
elif testB:
statementsB
elif testC:
statementsC
else:
statementsZ
4.9. FLOW CONTROL 38
Such logic should be used for mutually exclusive tests since at most one clause will be executed.
Efficient code should avoid unnecessary testing, so prefer elif clauses to sequential if statements
(for example, where passing the first test guarantees that future tests would fail).
Note that every else clause must have some executable code. If no statements are necessary
but you wish to retain the clause for clarity, use the pass statement (nothing happens with this is
executed).
Nesting
Nesting of if statements (i.e. sub-branching) can be achieved with careful indentation.
if testA:
if testB:
statementsB
else:
statementsC
else:
statementsZ
Selection tests
Selection test can take several forms.
#Test of equality
if (a==3):
#Test of inequality
if (x>18):
if (x!=0):
#Test of memberships
if x in range(100):
if subject in ('Finance', 'Economics'):
4.9. FLOW CONTROL 39
Inline if statements
For simple selection cases, consider ‘in-lining’ code.
Match
In many languages, ‘switch/case’ statements provide an alternative to if statements. Somewhat
controversially, this functionality was made available in python from version 3.10.
test = 1
match test:
case 1,2,3:
print("plan A")
case 4,5:
print("plan B")
This approach can be much faster that the equivalent if statement approach (e.g. for sequence
matching). Most other languages also support an ‘else’ case clause to be executed when none
of the previous cases are matched. In python this is performed with the underscore ‘_’ which is
considered to be a ‘soft keyword’ denoting a wildcard.
test = 6
match test:
case 1,2,3:
print("plan A")
case 4,5:
print("plan B")
case _:
print("the no plan plan")
4.9.2 Iteration
Iteration is used whenever you want to repeat a set of operations multiple times in a loop.
For loops
When the number of times the operation should be repeated is known in advance, a for loop is
normally preferred. Just as with an if construct, all indented lines are considered to be part of the
loop. For loops rely on the use of a variable to keep track of which iteration is currently underway.
By convention, the letters i, j, k are typically used. In python, loops that can be indexed by
predictable integer sequences can use the range function.
#Print numbers 0 to 9
for i in range(10):
print(i)
The loop variable can either be used as an index (keeping track of which loop code execution is
on) or can assume the values from a container.
4.9. FLOW CONTROL 40
#First 5 primes
A = [2, 3, 5, 7, 11]
A combined approach uses the enumerate function to populate two variables: one counter to
index the loops and another variable that assumes the values of interest for each iteration. This
removes the need to access array elements by index.
for i, cf in enumerate(bond):
print(i, cf)
#instead of
print(i, bond[i])
Note that enumerate has a second optional input parameter start which defaults to 0. A reversed
function returns the elements in the reverse order (i.e. last to first).
The itertools module provides further functionality for creating iterators for efficient looping. See
https://docs.python.org/3/library/itertools.html for more details. From python 3.10 it will pro-
vide a pairwise function. This allows you to loop over the n − 1 adjacent pairs of values from an
array of length n.
Two keywords provide extra control options.
• break terminates iteration entirely; execution resumes after the loop
• continue terminates that iteration; execution resumes on the next loop
• If using break you can add an else clause
• This will only be executed if the break never is
#Print numbers 0 to 4
for i in range(10):
if i==5:
break
print(i)
4.10. LIBRARIES 41
While loops
With a while loop, code will continue to execute as long as a condition remains true. It is essential
that statements within the loop will eventually mean the test will evaluate as false. Otherwise, the
loop will continue to run forever: an infinite loop. Equally, if the condition evaluates to false initially,
the code within the loop will not be executed even once.
Many while loops use a counter to keep track of the number of iterations. This must be initialised
to a starting value before the loop, and updated at some point during the loop, normally by
incrementing (adding one) at the end of the loop.
#Print numbers 0 to 9
i=0
while i<10:
print(i)
i=i+1
4.10 Libraries
4.10.1 Standard Library
Python’s standard library contains built-in modules that provide solutions for many problems that
occur in everyday programming.
4.10.2 Packages
Outside of the standard library a vast and growing collection of packages are available from the
Python Package Index (PyPI). To search this repository visit https://pypi.org/.
Packages consist of related code modules organised into a particular structure. This can be used
to create a hierarchy of namespaces which help to address problems with scope. A package
is identified by a __init__.py file, which indicate that the containing folder is a Python package
directory. The folder hierarchy might look like this:
package/
__init__.py
subpackage1/
__init__.py
module1.py
module2.py
subpackage2/
__init__.py
module1.py
module2.py
4.10.3 Import
Before calling functions from a package, or otherwise utilising its functionality, it must first be im-
ported.
import math
The structure of a package determines how to import the components that you wish to use.
#import a package
import package
#import a module
import package.subpackage1.module2
How you import determines how to reference the functions that you imported. When importing is it
possible to assign a alias. After defining the alias, this new name can be used to refer to whatever
was imported in place of its originally defined name.
4.11. FUNCTIONS 43
#using import
package.subpackage1.module2.functionname
#with an alias
import package as myalias
myalias.functionname()
Python will only import a given module once in a given python session (no matter how many times
you run the import command). Importing a module does not mean that all variables created in
the module will be imported into the current one. Importing creates a module object which can
then be used to access the imported module’s variables. Variables defined in a module therefore
have module (not global) scope.
import math
print(math.pi)
import numpy as np
#Confirm name
print(np.__name__)
#Confirm version
print(np.__version__)
4.11 Functions
Python comes with built-in functions and a range of packages. These can be explored using the
dir function.
Most functions accept inputs and produce outputs. When defining a function, the inputs are
called parameters. When calling a function, the inputs supplied called arguments. Functions
have identifiers (i.e. names) which follow the same rules are variables. To determine if something
is a function, you can use callable.
4.11. FUNCTIONS 44
#create a function
def myfunctionname(arg1, arg2, arg3, arg4):
statements
#call a function
myfunctionname(arg1, arg2, arg3, arg4)
myfunctionname(x, y, 3, 'test')
When a function is defined in this way, each argument must be supplied (and in the expected
order).
Functions will normally return a result but don’t have to. This is achieved by the keyword return.
When this statement of the function is reached, the function ceases, and execution continues
from the point where the function was called. If you choose not to return a value, the function will
terminate when it reaches its last statement and will returns the result as None.
#create a function
def square(x):
"""this text will appear in calls to help(square)"""
return x**2
#call a function
square(2)
Notice the first line of the function appears as a string inside three double quotes (this allows it to
be split over multiple lines). This line is a bit like a comment (it doesn’t do anything) and is known
as a docstring because it provides documentation for the function that is accessible through the
help function. It must appear as the first line of the function. The use of three double quotes is just
a convention. Modules and classes can also have docstrings.
Note that the names given to functions are also identifiers so they can be treated like variables
and even passed into other functions.
4.11. FUNCTIONS 45
#Call function
bondPrice(0.05, 0.04, 3, 1000)
bondPrice(0.05, 0.04, 3)
Non-defaulted arguments are considered to be positional arguments, while those that are de-
faulted as called keyword arguments. Positional arguments must be passed in using the ex-
pected order. Keyword arguments can be passed in using different orders, but must be preceded
by the keyword argument. See https://docs.python.org/3/tutorial/controlflow.html#keyword
-arguments.
Similarly, when an argument parameter is preceded by **, it receives a dictionary (i.e. a col-
lection of key/value pairs). This allows an arbitrary number of named arguments to be passed
in without the function having to formally define them or require them to be passed in using a
specific order.
def describe(**kwargs):
return "; ".join([f'{k}={str(v)}' for k,v in kwargs.items()])
def test():
return 'the answer to life the universe and everything is', 42
4.11. FUNCTIONS 46
#x will be a tuple
x = test()
They can be used as inputs into higher order function and are used along side built-in functions
such as filter and reduce.
A = [1, 2, 3, 4, 5, 6]
B = list(filter(lambda x: (x%2 == 0) , A))
def test1(A):
for i in range(len(A)):
A[i] = A[i]*2
#B will be modified
B = [1, 2, 3]
test1(B)
print(B)
In the second example, an assignment operation means that A now points at a different list.
def test2(A):
A = [0]
test2(B)
print(B)
For more details see the FAQ on ‘How do I write a function with output parameters (call by refer-
ence)?’ from https://docs.python.org/3/faq/programming.html.
4.11.7 Scope
Variables declared within a function have local scope. Scope refers to the domain within which
a particular name has meaning. They are created as temporary variables and their value is un-
known outside of the function. This means is it possible to use the same variable names in different
contexts. Variables declared outside functions are global to the module. Prefixing a variable
name with __ (double underscore) makes it private to the module.
def upper(text):
return text.upper()
def lower(text):
return text.lower()
if uppercase:
return upper(text)
else:
return lower(text)
print(setcase('Test'))
print(setcase('Test', False))
They can also be called indirectly after being returned as outputs from the parent function. In
such cases, they even have access to variables that were passed into the parent function when
it was called.
def changecase(uppercase=True):
def upper(text):
4.11. FUNCTIONS 48
return text.upper()
def lower(text):
return text.lower()
if uppercase:
return upper
else:
return lower
f = changecase()
print(f('Test'))
try:
except:
print('something went wrong')
Different types of exception can be handled in different ways. A full list of exception types can be
found at https://docs.python.org/3/library/exceptions.html.
It is also possible to create your own custom errors using raise.
try:
except ZeroDivisionError:
print('I know what went wrong')
except:
print('something else went wrong')
raise RuntimeError('my custom exception')
Extra else and finally clauses can be added to deal with the case when no exceptions occur
and to enact final cleanup operations, respectively.
try:
except ZeroDivisionError:
print('I know what went wrong')
except:
print('something else went wrong')
raise RuntimeError('my custom exception')
else:
print('No exception, return, continue or break encountered')
4.13. ENUMERATIONS 50
finally:
print('this code will get executed on the way out error or not')
The with statement is used to wrap the execution of a block with methods defined by a con-
text manager. It is designed to simplify code that would otherwise be included in a try block to
ensure certain operations are performed successfully. The classic example is when performing in-
put/output operations with files which must be opened and subsequently closed. At the end of
the with block, the file will be automatically closed even if an exception is raised within the with
block.
From python 3.10, it is possible to combine multiple open operations in a single with statement.
4.13 Enumerations
An enumeration is a set of symbolic names that are linked to unique constant values. They are
iterable and immutable and provide a way to eliminate hard coding. By convention, upper case
characters are used to denote that they are constant.
#Sample usage
x = trafficlight.RED
print(x==trafficlight.AMBER)
print(x.value)
if (n := len(A))>1:
print(n)
The syntax (:=) is affectionately known as the ‘walrus operator’ (think eyes and tusks).
4.14.2 Decorators
Decorators provide a way to modify the behaviour of a function. The decorator function both
accepts a function as input and returns a function.
def tracker(func):
def wrapper():
print("Calling function", func.__name__)
func()
print("Exited function ", func.__name__)
return wrapper
@tracker
def hello():
print("Hello")
hello()
The effect of @tracker here is to replace hello with tracker(hello). The decorator function essen-
tially wraps the decorated function, allowing additional operations to be performed before and
after it is called. For functions taking inputs, the decorator can be extended as follows.
def friendly(func):
def wrapper(*args, **kwargs):
func(*args, **kwargs)
print("Nice to meeting you")
return wrapper
@friendly
def greeting(name):
print("Hello", name)
greeting('Kelly')
An for functions returning values, the wrapper should return the value or a modified version of
it.
def logoutput(func):
def wrapper(*args, **kwargs):
4.14. OTHER PYTHON FEATURES 52
x = func(*args, **kwargs)
print("function", func.__name__, "returned", x)
return x
return wrapper
@logoutput
def simpleinterest(P,r,T):
return P*(1+r*T)
FV = simpleinterest(100,0.06,1)
53
5 D EVELOPMENT
5.2 Testing
Testing is just as important as implementation. Poorly implemented code not only leads to in-
correct results, but requires developers to expend time revisiting tasks that they believed to be
complete.
Test driven development is an approach that tightly links the process of design, implementation
and testing. Tests are developed simultaneously to ensure that the implementation satisfies the
requirements of the design. In this way tests accumulate over time, and can be run at any point of
the development cycle to determine the degree to which the implementation satisfies expected
behaviour.
Regardless of the approach used, developers should expend some time thinking about how they
will test their code. The development of rigorous test plans is necessary to validate their implemen-
tation. Tests can take different forms, as outlined in Table 5.1.
Type Description
Unit testing Testing is performed at the lowest level on single logical ‘units’
Integration testing Testing is performed as modules are combined together.
System testing Testing is performed at the level of a complete application or system.
Black box testing Testing software without knowledge of how it works internally.
White box testing Testing using knowledge of the internal workings. Tests designed to ex-
ercise all possible code paths.
Regression testing Results match those of earlier versions with only expected deviations
User acceptance testing Testing the software meets user expectations
Stress testing Testing how the software performs under extreme conditions, heavy
loads, high volumes, etc
Unit tests can be incorporated into Spyder using a plugin. More details are available from https://
www.spyder-ide.org/blog/introducing-unittest-plugin/.
Testing tips:
5.3. CODING ERRORS 54
Logical Code does not do what it was meant to Well thought out algorithms
Use your cortex compiler!
Good test plans
Runtime Errors when code is unable to perform the Code for the unexpected
required operation (e.g. division by zero) Avoid assumptions
Error handling
Edge case testing
Common python error messages, and their likely causes are outlined in Table 5.3.
5.4 Debugging
Errors are somewhat inevitable. After all, to err is human. A good developer will:
• adopt practices that reduce the incidence of errors
• familiarise themselves with the types of errors that coders typically make
• design and implement test plans that expose errors
• develop their ability to systematically detect the source of errors
Remember that the observed error may only be a symptom or side effect of the issue presenting
itself. Developers must pinpoint the underlying root cause of the actual problem. Such detective
work is part art, part science: sometimes investigating an intuitive hunch will quickly pay off but
trial and error is rarely an efficient approach. A key element is determining the point in a program
where it first deviates from its expected behaviour or state.
5.4. DEBUGGING 55
AttributeError: ‘xyz’ object The method (as typed) does not exist.
has no attribute ‘abc’
Check exact match.
After class changes be sure to clear instances of old ver-
sions of the class.
5.4.1 Logging
The state of a program at any point of execution can be determined by inspecting its variables.
Stepping through code one line at a time can be tedious, so it is often more convenient to output
variable state during execution. This can be achieved with print statements that are added
temporary and later deleted or commented out.
A more formal approach is to build logging into your programme that can be used more generally
for maintenance and support. This logs messages indicating progress and warnings either to a file
or to the console. By incorporating a ‘debug mode’, such logging can be configured to include
extra output that would only be of interest to a developer. Python’s logging module can be used
to achieve this.
import logging
logging.info('log a message')
logging.debug('log a message if in debug mode')
logging.warning('log a message as a warning')
5.4.2 Assertions
Assert statements are simply Boolean tests that terminate code execution if they evaluate to False.
As such they can be used to include tests that confirm a program is in the expected state before
continuing. This allows a developer to embed regular checkpoints in their code to alert them is a
checkpoint is not reached successfully.
A = []
assert len(A) != 0, "List is empty."
print(max(A)) #This won't get the chance to run
5.5. GOOD DEVELOPER PRACTICES 56
Developing coding principles and following good developer practices will not only make you a
better programmer, but means that you (and others) will spend less time debugging, correcting,
maintaining, and rewriting your code, not to mention figuring out what it was supposed to do.
Some basic guidelines to follow are:
1. Choose meaningful variable, function and class names, that are suggestive of their intended
usage. Mirror standard mathematical notation where possible.
2. Use white space to separate distinct steps in your algorithms.
3. Use comments to document your code.
4. Write generic code that can be reused rather than rewritten. If you find yourself copying and
pasting code (even with minor edits) you should probably rethink.
5. Avoid hard coding. Think carefully whenever you key in a literal value. If you find yourself
typing it a second time, consider converting it to a variable.
6. Functions should not reveal their inner workings. Users should see them as ‘black boxes’.
7. Functions should not rely on global variables. Any data they process should be passed into
as input parameters.
5.5. GOOD DEVELOPER PRACTICES 57
8. Design tests to validate your code. Where possible tests should be simple and have intuitive
answers. Rerun these tests after code changes.
9. Imports and anything configurable (e.g. file paths) should appear (once) at the top of the
code module.
10. Avoid variable proliferation. If you find yourself repeatedly creating variables that store similar
things, consider using a list, dictionary or other data structure. This will make your code easier
to maintain and simplify the processing of the data.
11. Avoid single use variables. Avoid using entirely or reuse the variable name when it is safe to
do so.
For historic reasons the length of a line of code is considered by be 80 characters. Going beyond
this limit can cause problems when printing code, or will force you to scroll from left to right when
viewing code. When statements exceed this limit, consider using line continuations. A backslash
‘\’ character at the end of a line allows the statement to continue onto the next line.
For more general advice, refer to PEP 8 - a style guide for python code: https://www.python.org/
dev/peps/pep-0008/. Alternatively, consider a code formatting tool such as ‘Black’.
58
6 O BJECT O RIENTING P ROGRAMMING
6.1 Principles
Object Orienting Programming (OOP) is a programming framework based around ‘objects’ which
couple data with related operations. Classes are used to define generic object types. They can
be thought of as the design or blueprint from which actual instances (called objects) will be cre-
ated.
Objects are defined by their properties and methods. Properties relate to attributes or data associ-
ated with each object instance. Methods are essentially functions, but define operations that the
class is capable of performing. OOP is supported (to some degree) by many languages including
C++, Java, C#, VBA and MATLAB.
Key to OOP are the concepts of:
• Encapsulation: hiding the internal workings of an object (use of public, private, protected
designations)
• Inheritance: allowing one class to inherit the properties and methods of another
• Polymorphism: allowing different objects to be treated similarly even though they will behave
differently
Polymorphism is seen through inherited classes sharing a common interface. Class benefits in-
clude:
• Logical organisation of code
• Code re-use
• Run time object typing
• Supportable
• Extensible
• Data persistence between calls
6.2 Classes
For an introduction to classes in Python see Hilpisch (2019). A full description can be found at
https://docs.python.org/3/tutorial/classes.html. The basic structure of a class takes the follow-
ing form:
class classname:
#constructor
def __init__(self, name):
self.name = name
#destructor
def __del__(self):
print("I'm being destroyed")
6.2. CLASSES 59
#methods
def mymethod(self, inputvalue):
print("someone called my method")
First note that variables defined within the class are shared by all instances of the class. Class prop-
erties that are specific to each instance of the class are defined by the __init__ constructor.
Constructors
A constructor is a special method that is invoked when an instance of a class is created. In Python
constructors as defined by __init__ methods (note the double underscore before and after the
init). Like other methods, constructors can take zero or more inputs, but the first input is always a
reference to the specific instance of the object. Convention dictates that this should be named
self.
A constructor is a convenient place to include code that initialises the object before it is first
used.
Destructor
A destructor is a special method that is invoked when an instance of a class is being destroyed. In
Python destructors as defined by __del__ methods. A destructor provides an opportunity to take
any final actions that might be required before an object is destroyed.
str
The __str__ method will determine the output when an instance of the class is passed into a print
statement. Thus the class can provide the most helpful description of itself and the specific data is
might hold. It is intended to readable and helpful.
def __str__(self):
return "My name is "+self.name
repr
The __repr__ method is used to produce an unambiguous string representation of a class and
tends to be used for debugging.
def __repr__(self):
return str(self.uniqueid)
6.2.2 Methods
Methods are just functions defined with a class. A distinguishing feature is that the first input pa-
rameter is always a reference to the specific instance of the object. Convention dictates that this
should be named self. By default methods relate to a specific instance of a class.
class myclass:
def method(self):
print("regular method")
@classmethod
def classmethod(cls):
print("class method")
@staticmethod
def staticmethod():
return 'static method'
Class methods have the @classmethod decorator, and accept a cls parameter that points to the
generic class, rather than a specific instance. Such methods cannot modify the state of specific
instances, but they can modify the state of attributes shared across all instances.
Statics methods have the @staticmethod decorator and accept neither self nor cls parameters.
They cannot therefore not be used to modify any instance of the class or any class attributes.
6.2.3 Instantiation
Instantiation is the name given to creating an instance of a class. These instances are referred to
as objects. Objects are created by treating the class name like a function. If a constructor has
been defined, and includes non-defaulted parameters, them these must be supplied.
6.2.4 Encapsulation
Methods and properties can be made public (i.e. available to all), or private (i.e. for internal
use only). Adding a double underscore prefix to the name makes a property or method private.
Protected attributes are those which can only be accessed within the class or by classes which
inherit from it. Adding a single underscore prefix to the name makes a property protected.
class somewhatcoy:
#public
self.commonknowledge = commonknowledge
#protected
6.2. CLASSES 61
self._familysecret = familysecret
#private
self.__secret = secret
6.2.5 Inheritance
Inheritance allows one class (the child) to inherit the properties and methods of another (the
parent), without having to write additional code. The child implementation can be modified
to include new properties and methods, and can even override parent implementations. The
concept is illustrated in Figure 6.1.
With the exception of anything designated private, a child has access to everything that belongs
to the parent.
class parent:
def __init__(self):
self.name = "parent"
def saymyname(self):
print(self.name)
def oldtrick(self):
print("anything you can do")
class child(parent):
def __init__(self):
self.name = "child"
def oldtrick(self):
print("i can also do")
6.2. CLASSES 62
def newtrick(self):
print("but look what else I can do")
Inheritance occurs when the child class is defined by including the name of the class from which
it should inherit in brackets.
A = parent()
B = child()
Note that a child’s constructor can invoke the parent’s constructor as follows:
class child(parent):
def __init__(self):
super().__init__()
63
Part II
Data Processing
64
7 L IBRARIES
7.1 NumPy
NumPy is a popular package for scientific computing with Python. In particular it provides support
for working with multi-dimensional arrays, matrix algebra and random numbers. By convention it
imported and aliased as np.
import numpy as np
See https://numpy.org/ for documentation or Harris et al. (2020) for a discussion of the NumPy
ecosystem.
7.1.1 Arrays
Numpy arrays can be of different data types, and can be created in different ways.
These commands create an object of type numpy.ndarray, which has a number of useful meth-
ods.
A.sum()
A.max()
A.mean()
A further major benefit of np arrays over Python lists, is the easy with which operations can be
performed pointwise on each element.
7.1. NUMPY 65
#Double elements of A
A * 2
#Square elements of A
A ** 2
A = np.arange(10)
7.1.2 Matrices
While NumPy has a specific numpy.matrix class that was originally intended for linear algebra, this
has been deprecated. It is now recommended to simply use (two-dimensional) arrays. These can
be created as follows:
#Check dimensions
print(A.shape)
7.1.3 Broadcasting
Broadcasting describes how NumPy treats arrays (with different shapes) during arithmetic opera-
tions. In mathematics, when working with one dimensional arrays (vectors) and two dimensional
arrays (matrices), there are strict rules about when operations (such as addition and multiplication)
are valid. When NumPy encounters an operation between two arrays of different sizes, the smaller
one is ‘broadcast across the larger one’ so that they have compatible shapes. This means that in
most cases where a reasonable interpretation exists for the intended operation, it will produce a
result. For more details see https://numpy.org/doc/stable/user/basics.broadcasting.html.
7.1.4 Reshaping
Reshaping operations take the values from one array and repack them into the values in another
equally sized but differently shaped array. Note that reshaping returns a new array. It does not
reshape the original array.
One standard reshaping operation is the transpose operation. Note that the transpose operator
has no effect on one dimensional arrays.
#Transpose B
D = B.T
Resizing operations, by contrast, are applied to the original object. If the overall size of a numeric
array is increased, it will be padded with zeros. If the overall size is decreased, the elements will be
truncated.
x = np.array([[1],[2]])
A = np.arange(1,5).reshape(2,2)
#Matrix multiplication
np.matmul(A,x)
7.2. SCIPY 67
#Equivalently
A@x
Matrix operations are generally only defined if the vectors or matrices to which they are applied
satisfy particular dimensional requirements. However, NumPy will apply broadcasting rules when
operators are applied to variables of different shape, making some operations permissable.
x = np.array([[2],[5]])
A = np.arange(1,5).reshape(2,2)
See for further details. NumPy is also quite forgiving with one dimensional vectors, treating them
as column vectors when it appears reasonable to do so.
Further examples of standard linear algebra operations are shown below. A full listing is available
from https://numpy.org/doc/stable/reference/routines.linalg.html.
7.2 SciPy
SciPy provides additional mathematical algorithms for scientific computing. Sub-packages (which
must be imported individually) include:
• interpolate - interpolation functions
• linalg - linear algebra
• optimize - root-find and optimization
• sparse - sparse matrix manipulation
See https://scipy.org/scipylib/ for more details. A particular sub-package of interest is stats
which provides random numbers and statistical distribution functions. In particular it includes
classes for the continuous distributions:
• uniform
• norm
7.3. PANDAS 68
• lognorm
• uniform
which each have the following methods:
• cdf - cumulative distribution function
• pdf - probability density function
• ppf - inverse cdf
7.3 Pandas
7.3.1 Getting started
Pandas is a library designed specifically for data analysis. A key feature is its DataFrame object,
which can be thought of as a database table consisting of record (rows) and fields (columns).
Pandas provides tools which make it easy to import, export, merge, join, filter, aggregate and
manipulate these data sets. Full documentation is available from https://pandas.pydata.org/. By
convention Pandas imported and aliased as pd.
import pandas as pd
A DataFrame can be created in different ways. One way uses Python’s dict to pass in a series of
key/value pairs, where each key defines a field, and each value contains an equally-sized array
with which to populate the field for each record.
Crucially, each field can be a different data type (within Pandas these are dtype). The data type
of each column can be inspected as follows:
There are various commands that can be run to inspect the structure and content of a DataFrame.
#Statistical summary
df.describe()
Each DataFrame has an index which represent the row labels. By default, the labels are a simple
numerical enumeration from 0, but can be redefined.
#Equivalently
df.index = df.currency
The index can be created from one or more columns. While index values do not have to be
unique, creating such a primary key or unique identifier may have it easier to merge one DataFrame
with another. Pandas also offers a Series data structure, which is a one dimensional array, with a
corresponding set of data labels (i.e. its index). A column from a dataframe is therefore a pandas
Series.
7.3.2 Selection
To select particular rows, columns or subsets of the DataFrame it is possible to use Python-style in-
dexing, or optimised methods such as loc (primarily using labels) and iloc (primarily using integer
position). Note the slices using loc are inclusive of the end record.
#Select specific rows and columns (rows currently have integer index)
df.iloc[0:2,[1,2]]
df.loc[0:2,'rate']
df.take([1,2])
The at and iat methods can be used to retrieve single values in a similar manner. Note that when
retrieving a single column, what is returned is actually a pandas series (with any corresponding
index). To extract just a list of values, use the tolist method.
#Panadas series
print(type(df['rate']))
#Python list
print(type(df['rate'].tolist()))
In pandas, the symbols & (and), | (or) and ∼ (not) are used as Boolean operators.
#Find rows for which the exchange rate is between 10 and 100
df[10<df.rate & df.rate<100]
#Find rows for which the exchange rate is below 10 or above 100
df[df.rate<10 | df.rate>100]
#Find rows for which the exchange rate is less than or equal to 10
#i.e. not greater than 10
df[~df.rate>10]
A filter method can also be used for row and column sub-setting:
7.3.3 Iteration
To loop over the rows in a DataFrame use it iterrows method (similar to enumerate).
7.3. PANDAS 71
An alternative (and faster) approach is to use itertuples. The fields in each row are placed into
a tuple (and must therefore be accessed by their position) with or without the index as the first
element.
The pandas query method can be used to simplify some selection operations by writing simple
string expressions. For example, the following two statements are equivalent.
7.3.4 Sorting
Dataframes can be easily sorted by their index using the sort_index method.
#Sort by index
print(df.sort_index())
df.rate.sort_values()
Of course this will only return a single column (i.e. a series). Fortunately, this method also exists for
dataframes.
#Sort the dataframe using multiple columns (i.e. primary, secondary keys)
df.sort_values(by=['unit','currency'])
The na_position parameter can be set to determine the treatment of missing values.
A similar requirement is to determine the rank, assigning the value 1 to the element in first place, 2
to the next element and so on, based on the sort order. The default behaviour will average ranks
for elements in tied positions. Thus two elements in joint second position would each receive ranks
of 2.5 (the average of 2 and 3). This behaviour can be adjusted by setting the method parame-
ter.
7.3.5 Manipulation
DataFrame updates can be achieved with assignment operations provided, of course, that the
objects on the left and right hand size are similar in size.
#Update a column
df.loc[:,'rate'] = [0.8244, 0.9172, 106.98, 1.5434, 7.1322]
#Rename a column
df.columns[-1] = 'name'
Appending rows iteratively adding one row at a time is considered inefficient. In such circum-
stances it is better to adds rows to a list, which can then be appended as a single operation.
The concat method can also be used to combine DataFrames, along either axis (0 for rows, 1 for
columns).
It is possible to combine DataFrames using methods similar to database-style joins (see section 8.7).
Indeed, pandas had a join method, which by default matches rows from one DataFrame to rows
7.3. PANDAS 73
#delete a column
del df["rate"]
#delete a column
df = df.drop(["rate"], axis=1)
7.3.6 Dates
When working with time series, rows in a DataFrame can be referenced by date. Panads provides
support for this through its DatetimeIndex object. These can be created with the date_range:
Note that monthly and annual dates default to the end of month and year respectively. Once
created they can be assigned as the index of a DataFrame:
7.4. SCIKIT-LEARN 74
df.index = dates
7.4 scikit-learn
Scikit-learn (Pedregosa et al., 2011) is a machine learning package built on NumPy, SciPy and
matplotlib. It incorporates multiple tools for problems relating to classification, regression and clus-
tering. It will be used in Chapters 11 & 12. See https://scikit-learn.org/ for more details.
7.5 tkinter
Tkinter is cross-platform toolkit used by many different programming languages to build a GUI
(Graphical User Interface). It provides access to standard windows components such as text
boxes, buttons, listboxes, etc. See https://docs.python.org/3/library/tkinter.html for docu-
mentation or tutorials such as https://realpython.com/python-gui-tkinter/. It can be used in
combination with packages such as pyinstaller to create standalone executables. Simple tkinter
projects can be found in Eramo (2020).
To see how easy it is to begin, the following code create and launches a window with a spe-
cific title. The window automatically has standard features such as the ability to reposition, resize,
maximise and minimise. The output is shown in figure 7.1.
The main window (here) is call root (remember that one window can spawn another). The mainloop
method creates the window and run it (like an infinite loop) until it is terminated.
import tkinter
#Define window
root = tkinter.Tk()
root.title('First window')
7.5.1 Widgets
Tkinter widgets are used to build the user interface. Each component must be created, associ-
ated with a window (considered to be its parent), configured (in terms of its properties) and then
positioned.
7.5. TKINTER 75
Labels
The label widget allows non-editable text to be displayed. We use this simple component to
illustrate each of the approaches to positioning the widgets on the window.
The pack approach simply needs to pack each element in the sequence of appearance.
#Define window
root = tkinter.Tk()
root.title('Labels (pack)')
#Create label
label1 = tkinter.Label(root, text = "Hello")
label2 = tkinter.Label(root, text = "World")
#Pack approach
label1.pack()
label2.pack()
root.mainloop()
The place approach requires each widget to the positioned relative to the top left hand cor-
ner.
#Define window
root = tkinter.Tk()
root.title('Labels (place)')
#Create label
label1 = tkinter.Label(root, text = "Hello")
label2 = tkinter.Label(root, text = "World")
#Place approach
label1.place(x = 10, y = 10)
label2.place(x = 50, y = 100)
root.mainloop()
The grid approach requires each widget to the positioned in a specified row and column. Rows
7.5. TKINTER 76
or columns that contain no widgets will be ignored. The size of the largest widget will determine
overall sizing of the row/column in which it is positioned.
#Define window
root = tkinter.Tk()
root.title('Labels (grid)')
#Create label
label1 = tkinter.Label(root, text = "Hello")
label2 = tkinter.Label(root, text = "World")
#Grid approach
label1.grid(row = 0, column = 0)
label2.grid(row = 1, column = 1)
root.mainloop()
Buttons
Buttons are widgets that users can click to initiate come action. In addition to creating the control,
the user must ‘plumb in’ the widget by associating it to the code the should run when the button
is clicked.
def showmessage():
messagebox.showinfo("you clicked the button")
#Define window
root = tkinter.Tk()
root.title('Buttons')
root.mainloop()
Processing of data starts and ends with input and output respectively. The source or destination
of data could include transient hardware devices such as keyboards and computer monitors, live
data feeds such a stock price tickers, or more permanent records such as files stored on a hard
drive or fields in a database.
Whatever is typed will be assigned to the variable. By default this will be of type string (even if the
user keys a numeric value).
#Absolute location
filename = r'c:\temp\test.csv'
Note the ‘r’ character before the string literal. This denotes a raw string that treats backslashes as
literal characters. This overcomes the fact that the backslash character used in windows file paths
would otherwise be treated as an escape character (see Section 4.5).
Note also that ‘.’ denotes the current folder and ‘..’ denotes the parent directory.
The os module provides a way of operating system dependent functionality such as determining
the current working directory.
#delete file
os.remove(filename)
f.write("line 1\r")
f.write("line 2\r")
f.close()
f.write("line 3\r")
f.close()
Unfortunately working with text gets complicated by the many languages, symbols and even emo-
jis in common usage. For this reason Python provides support for unicode (https://www.unicode
.org/), a system that aims to list every character used by human languages, providing each with
a unique code point (an integer value). There are already 150,000 such codes, which clearly does
well beyond the 8-bit characters supported by ASCII codes (https://www.ascii-code.com/).
Text files can therefore be encoded in different ways. One common method is ‘utf-8’. You may
notice that his appears in the status bar of Sypder indicating the default encoding of the python
files it creates. Note that not all applications will cope with non-standard encodings: a character
than can be typed in one application may not be compatible with another. For conversion,
python provides support via its codecs (encoders/decoders) library.
To illustrate the potential difficulties, consider the following code:
8.4. CSV 80
Reading from a file follows a similar approach. The file must be opened in ‘read’ mode.
f.close()
Alternatively a for loop can be used to read the file one line at a time.
for line in f:
print(line)
f.close()
See also section 4.12 for how to use the with keyword to ensure files are safely closed.
8.4 CSV
One popular format for outputting tabular data is the csv file (comma separated values). These
can be created manually by writing every record to a new line, and adding a comma (or other
separator) between each field of each record.
8.4. CSV 81
f = open('outputfile.csv', 'w')
f.close()
This uses the join method of strings to concatenate together a container of values with a given
separator. Note that here, care is required is convert numeric values to strings. Had all the values
been strings, it would have been possible to write this as:
f.write(','.join(row))
Also note that while csv is a popular format, it can cause problems when the data fields themselves
naturally contain commas. If used as input, such commas will be interpreted as indicating the end
of one field and the beginning of the next. Therefore it may be better to use separators that do
not commonly appear in text such as the pipe symbol (|).
Reading csv files manually
FX = []
for line in f:
#Remove carriage return
line = line.strip()
f.close()
import csv
#Import
with open('outputfile.csv', newline='') as f:
reader = csv.reader(f)
for row in reader:
print(row)
#Export (again)
with open('outputfile2.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(FX)
8.5. PICKLE 82
The pandas DataFrame object has its own method for csv export.
Importing works in exactly the same manner, with optional parameters set according to the pre-
cise format of the file.
df_in = pd.read_csv('outputfile2.csv')
8.5 Pickle
The pickle module allows data objects to be serialised into (and un-serialised from) a byte stream,
and can therefore be used as a way to persist data between python sessions. As a binary format,
the files are not human readable (try opening one with a text editor).
In the code that follows, notice that the read and write modes are supplemented with ‘b’ to
denote a binary format.
import pickle
del(mydict)
df.to_excel("output.xlsx")
More involved operations over multiple sheets can be performed with the help of the ExcelWriter
object. This is akin to opening a text stream and then performing multiple write operations.
While the read function has many optional parameters, reading in tables is relatively simple.
8.7 Databases
While there are many different database vendors and implementations (e.g. Sybase, SQL Server,
Oracle, PostgreSQL), most relational databases are based on ANSI SQL (pronounced as ‘sequel’)
which stands for Structured Query Language. As the name suggests, SQL commands can be used
to query the tables with a database to extract information. As shown in Figure 8.1, tables consist
of rows (records) and columns (fields). Additionally, SQL commands can be used to insert, update
or delete data records, or indeed to change the database schema itself (i.e. create, modify or
delete tables). An introduction to the SQL language can be found at https://www.w3schools.com/
sql.
The sqlite3 module provides an interface to SQLite, a library the implements a self-contained
SQL database engine. Documentation can be found at https://docs.python.org/3.8/library/
sqlite3.html.
Just as a text stream must be opened to read from or write to a file, a connection must first be
established with a database before queries can be executed. The database can be a file, a
server, or even one held entirely in memory.
#Close connection
con.close()
Long queries with multiple clauses are best typeset across several lines inside triple quotes. Pandas
provides support for importing data directly from a database.
import pandas as pd
df = pd.read_sql(query, con)
8.7.1 Queries
Select statements form the basis for all queries to retrieve data.
Where clauses are used to be selective about which rows to return, and act like filters on the query.
Standard Boolean operators (AND, OR, NOT) and parenthesis can be used to modify the intended
logic. In SQL, the percent sign (%) is used as the wildcard character.
The results of queries can be manipulated by sorting (ORDER BY) or aggregating (GROUP BY).
The modifiers ‘ASC’ (the default) and ‘DESC’ can be used for alphabetic/numeric or reverse al-
phabetic/numeric sorts. Common aggregation functions include COUNT, SUM, MAX, MIN and
AVG.
#Aggregate results
#Selected fields must be used for grouping or aggregated in some way.
SELECT country, SUM(value) from assets GROUP BY country
Updates to tables are covered by INSERT, UPDATE and DELETE queries. Adding new rows to existing
tables is performed by INSERT queries. Depending on how tables have been created, it may or
may not be necessary to supply values for all fields. For examples, certain fields may be automat-
ically populated as an auto-incremented index. Other fields may be permitted to hold no values
(i.e. all allowed to be null).
Queries can be dynamically constructed in Python using string concatenation and formatting.
The execute method of the cursor object also permits ‘?’ to be used as a placeholder into which
values supplied in a tuple will be substituted. Note that queries that alter the data in a database
must be committed before they take permanent effect.
con = sqlite3.connect(dbname)
cur = con.cursor()
query = """UPDATE transactions set amount = ?
WHERE account_id = ?"""
cur.execute(query, (123.45, 1234567))
con.commit()
For relational databases (where data in one table is related to data in another), the concept
of normalisation is applied to reduce inefficiency (storing the same information more than once)
and inconsistency. To extract useful information from a database it is therefore often necessary to
combine data from different tables. In SQL this is performed using JOIN operations of which there
are several types. Suppose we wish to combined two tables: table1 (the left table) and table2
(the right table).
• (inner) joins combine records with matching values in both tables (akin to set intersection)
• left (outer) joins return all records from the left table and matching records from the right
table (a bit like a lookup)
8.8. JSON 86
• right (outer) joins return all records from the right table and matching records from the left
table (a bit like a lookup)
• full outer joins return all records from both tables whether there is a match or not (akin to set
union)
Inner joins are by far the most common. When combining tables, it is possible that field names
become ambiguous. To resolve this, field names can be prefixed by their table name using a
period ‘.’. As in the example below, table names can be aliases to avoid repeatedly typing their
full name.
#Return all fields from table1 and one specific column from table 2
#matching records based to two fields
SELECT t1.*, t2.columnZ
FROM table1 t1
INNER JOIN table2 t2
ON table1.columnA = table2.columnX AND table1.columnB = table2.columnY
These SQL statements form the core building blocks of most database queries, and can be used
in combination to perform more complicated operations (e.g. joining three tables and simultane-
ously filtering the results, or using the results from one query as the input into another).
8.8 JSON
JSON (JavaScript Object Notation) is human-readable data format that, despite the name, is lan-
guage independent. There are two basic building blocks. One of these is based on name/value
pairs and is similar in concept to a python dict:
Note that names must always be strings, while values can be strings or values. The other is an
ordered collection of values similar to a python list:
[ "one", 2, "iii" ]
Any combination of these two object types is valid. The dumps and loads functions respectively
convert (valid) python objects to json strings and json strings to python objects. To help you re-
member that they work with strings, remember that they end is ‘s’.
import json
To save or load json text to/from file use dump and load.//
8.9. XML 87
#save to file
with open('json.txt', 'w') as f:
json.dump(d,f)
In addition to list and dict objects, tuples can also be converted to JSON (but will become lists
when reloaded). The keywords True, False and None are also automatically converted to true, false
and null respectively. See for example https://www.w3schools.com/python/python_json.asp.
Given the similarity of JSON formats to list and dict objects, in practice file IO in JSON format is
more common.
8.9 XML
XML (eXtensible Markup Language) is a human-readable format for the storage and transfer of
data. XML documents appear similar in nature to the HTML that is used to encode web pages,
and consist of content and markup. Just as round and curly brackets are used to demark the
structure of JSON, a main feature that defines the structure and syntax of XML is the tag:
<tagname>content</tagname>
<empty-tag />
Each tag begins with < and end with >. They normally appear in pairs, with an extra ‘/’ character
used to distinguish the start from end tag. Unlike HTML, XML is not limited to a predefined set of
recognised tags: the tag names can be freely defined by the XML document author. A pair of
tags with its intermediate content is referred to as a element. The content can be individual values
or other valid xml. Attributes are name/value pairs contained within start or empty tags:
XML documents often begin with a declaration describing the precise format of the XML the fol-
lows. Just as for Python, indentation aids readability, with tabs indicated the nested structure of
complex hierarchical objects. Note that XML tags are also case sensitive. Suppose the following
XML file (‘portfolio.xml’) has been created.
</portfolio>
It is possible to open an XML document and traverse its structure using the Document Object
Model (DOM).
import xml.etree.ElementTree as ET
tree = ET.parse('portfolio.xml')
portfolio = tree.getroot()
Alternatively, it is possible to create a Python structure (based on the OrderedDict) that replicates
the XML document, using the xmltodict project (must first be installed).
import xmltodict
print(doc)
8.9.1 XPath
XPath is part of the XSLT standard. Without getting bogged down in new abbreviations1 it is a tool
that can be used to navigate XML documents. The hierarchy of elements within a XML document
can be thought to describe a path (similar to a path to a file within a folder structure). XPath allows
elements belonging to a particular part of the path to be directly retrieved.
tree = etree.parse('portfolio.xml')
8.10 HTML
HTML (Hypertext Markup Language) is the encoding used for content designed to be displayed
in a web browser. On most browsers it is possible to right click and choose ‘view source’ to reveal
the actual web page that the browser is rendering. As the HTML abbreviation suggests, HTML
documents contain both content (text) and markup (formatting instructions) necessary to do this.
1 XSL (Extensible Stylesheet Language) is a language for expressing style sheets. Style sheets are designed to describe
the formatting rules that should be applied to data stored in XML format. XSLT (XSL Transformations) allow XML documents
to be converted into other markup formats.
8.10. HTML 89
While a thorough understanding of HTML is not strictly necessary, it is helpful to have an idea of the
basic building blocks that are used to provide the structure and markup in an HTML document.
Using any text editor, try a document called ‘page.html’ with the following content. Double click
the file; it should open and display only the content (with formatting as defined by the markup) in
your default browser.
<!DOCTYPE html>
<html>
<head>
<title>Title</title>
</head>
<body>
<h1>Heading One</h1>
<p>My first paragraph.</p>
<table>
<tr>
<th>col 1 header</th>
<th>col 2 header</th>
<th>col 3 header</th>
</tr>
<tr>
<td>row 1 col 1</td>
<td>row 1 col 2</td>
<td>row 1 col 3</td>
</tr>
<tr>
<td>row 2 col 1</td>
<td>row 2 col 2</td>
<td>row 2 col 3</td>
</tr>
</table>
</body>
</html>
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/ISO_4217'
req = requests.get(url)
#Find tables within the HTML - the one of interest is the second one
tabs = soup.find_all('table')
tab = tabs[1]
8.10.2 Pandas
The Pandas read_html function extracts all tables from an HTML page and returns them as a list of
DataFrames.
8.10.3 Downloads
File downloads (from an known url) can be performed with the urllib library
import urllib.request
8.11 API
8.11.1 Using an API
An application programming interface (API) defines the functionality provided by a particular
piece of software, and the protocols with which to interact with it. There are a multitude of soft-
ware services that be accessed via their APIs. These can be web services accessed via HTTP or
other language-specific wrappers. While some are designed to be open and free-to-use, others
require access tokens and may incur usage costs.
As an example, we consider https://date.nager.at/Api a free-to-use Public Holiday API provided.
To use this service manually, you can take a base url and append a four-digit year, and a two-
character ISO country code. Thus, the 2021 US holidays could be retrieved via:
https://date.nager.at/api/v2/publicholidays/2021/US
8.11. API 92
If you navigate to this link in a browser you will see a page of plain text in JSON format. Of course,
such a manual process can also be coded in Python as follows:
import requests
import json
url = 'https://date.nager.at/api/v2/publicholidays/2021/US'
#Access url
req = requests.get(url)
import holidayapi
parameters = {
'country': 'US',
'year': 2021,
}
holidays = hapi.holidays(parameters)
Quandl provide access to financial data via an api with free and premium services (https://www
.quandl.com/tools/api). For a list of other public API see https://github.com/public-apis/public
-apis.
@app.route('/')
def hello():
return "Hello World!"
8.11. API 93
if __name__ == '__main__':
app.run()
Running the code in Spyder or from an Anaconda prompt, will create a local server, accessi-
ble from your browser from the url http://127.0.0.1:5000/ as shown in Figure 8.2. The decorator
@app.route (see section 4.14.2) the function hello is registered for route / so that the function hello
is called whenever this precise URL is accessed.
This example can be extended to make the return vary depending on the precise URL visited.
Notice in the code below that two additional @app.route’s have been added.
@app.route('/')
def hello():
return "Hello World!"
@app.route('/hi')
def hi():
return "Hi Everyone!"
@app.route('/<name>/<int:age>')
def parsetextandnumber(name,age):
return name + " is " + str(age) + " years old"
if __name__ == '__main__':
app.run()
Notice how the final @app.route defines the format (and type) of the function inputs. In addition
to the text and integer inputs required here it is also possible to configure:
• string - text (default)
8.12. FINANCIAL DATA 94
Query Arguments
URLs can embed query arguments to allow for a more flexible range of possible input values. For
example:
domain.com?name1=value1&name2=value2
The query string begins after the ‘?’ character and ampersand (&) is used to separate each
key-value pair. The following code snippet shows how to extract their values:
@app.route('/query')
def queryexample():
value1 = request.args.get('name1')
value2 = request.args.get('name2')
start = datetime.datetime(2019, 1, 1)
end = datetime.datetime(2023, 1, 1)
8.12.2 yfinance
Ran Aroussi’s yfinance offers another library that facilitates dynamic sourcing of financial data.
This provides access to the public API provided by Yahoo Finance. The example below shows
how to obtain historic stock prices but it is also possible to source company financials and much
more.
import yfinance as yf
ticker = yf.Ticker("MSFT")
8.12.3 WRDS
At Queen’s you have access to various data via Wharton Research Data Services (WRDS).
Account setup
1. Browse to wrds-www.wharton.upenn.edu
2. Select the Register tab.
3. Complete the Account Request form selecting ‘Queen’s University Belfast’.
4. Once you submit an Account Request, an email will be sent to your WRDS Representatives.
After receiving approval, an account will be created and you will receive an e-mail message
with a special URL and instructions for setting the account password and logging into WRDS.
8.12. FINANCIAL DATA 96
Accessing WRDS via python requires the wrds package to be installed. From an anaconda prompt
run:
import wrds
This approach will work fine in Jupyter but not with Sypder (v5.15) because of the prompt to get
login details. This can be solved by explicitly supplying this information as input to the connection
function.
Alternatively, if you create a file called ‘.pgpass’ (no file name, just a file extension) in your working
directory and edit it to have text as follows (with valid login details):
wrds-pgdata.wharton.upenn.edu:9737:wrds:myusername:mypassword
To get started retrieving data, see the following documentation and sample code:
• pypi.org/project/wrds
• wrds-www.wharton.upenn.edu/documents/1443/wrds_connection.html
97
9 DATA H ANDLING
Data in its raw form is often in a state that makes it unfit for immediate analysis. Real world data is
messy. It may:
• contain missing or unexpected values
• be inconsistently encode
• be in an unsuitable format
• only tell part of the story
For all these reasons and more, a step by step process, illustrated in Figure 9.1 is necessary to
prepare the data for analysis.
#Sample data
df = pd.DataFrame({'company': ['A', 'B', 'C', 'D'],
'mktcap': [975452, 123455678, 36087532, 230975486]})
currency,rate,unit
GBP,0.8246,pound
EUR,0.9171,euro
JPY,107.64,yen
AUD,1.5301,
CNY,7.1264,yuan
DEM,,mark
When imported as a Pandas dataframe, missing values are treated as NaN (Not a Number -
np.nan). In other languages such as R, the term NA (not available) is preferred, and this also
creeps into Python. Values created with Python’s None are also considered null or missing, and can
be detected using the isnull (equivalently notnull) method.
import pandas as pd
df = df.append({'currency': 'ZAR',
'rate': 12.2134,
'unit': None}, ignore_index = True)
One may reasonably decide that missing fields require a entire record to be discarded. This can
be achieving by selection or using the dropna method.
Note the use of the optional inplace method here. This updates the existing dataframe rather
than returning a modified dataframe as a result. Before considering alternative treatments for
missing data, we consider how missing records can be inserted into the dataframe. In the following
example, the dataframes index is used to makes sure every required index value is present even if
it wasn’t previously represented.
#reindex - retain rows with index in the list and add missing ones
df = df.reindex(required_currencies)
The reindex method provides many more options. For example a fill_value can be supplied as
the default for missing value. Alternatively, an rudimentary ‘interpolation’ method allows values to
be backward or forward filled.
print(df)
To assign specific values to replace missing ones, use the fillna method.
It’s possible to fill missing values with an average value, but this operation can only be performed
on numeric columns (in earlier versions pandas ignored non-numeric columns).
It may make more sense to use group-based means (see section 9.8.1). Should more sophisticated
logic be required, one can always use the apply method in combination with a custom function
or lambda function.
def fixname(name):
if pd.isnull(name):
return 'missing'
else:
return name
df['unit'] = df['unit'].apply(fixname)
#For non-numeric data describe includes an overall and unique value count
df['label'].describe()
While not foolproof, inspecting the results of such commands should reveal any outliers or unex-
pected data. Inspecting a sample of rows (say using df.head()) may also quickly reveal if the data
is not as expected.
9.3.1 Correlation
Correlation and covariance matrices can be calculated using the corresponding methods avail-
able for dataframes. Non-numerical attributes will be ignored. The return type will be a pandas
dataframe, so it must be converted to an numpy array before being used for matrix calcula-
tions.
cor = df.corr(method='pearson')
cov = df.cov()
The issue of multicollinearity occurs when two or more explanatory variables are highly linearly re-
lated. While this may not be a problem for certain machine learning models, coefficient estimates
in regression models can be unstable in such circumstances.
To visualise correlation, it is possible to use seaborn to produce a heatmap (Figure 9.2) so high
levels of positive and negative correlation are apparent.
sns.heatmap(C,annot=True,cmap="RdYlGn")
9.3.2 Normality
A QQ plot (quantile-quantile) plot is a graphical method for visually comparing two distributions.
Each (x,y) pair corresponds to the corresponding quantile for each of the distributions. In this way,
random variables can be compared with the actual distribution to see if the data is a reasonable
fit to the assumed distribution.
import statsmodels.api as sm
from matplotlib import pyplot as plt
1.0
1 0.93 -0.39 0.97 0.75 0.75 0.97 0.74 0.46 0.93 0.2 -0.096
0
0.93 1 -0.67 0.9 0.67 0.68 0.92 0.52 0.33 0.94 -0.14 0.038 0.8
1
-0.39 -0.67 1 -0.32 -0.63 -0.63 -0.26 -0.7 -0.43 -0.52 0.061 0.19
2
0.6
0.97 0.9 -0.32 1 0.67 0.69 0.99 0.51 0.33 0.96 0.15 -0.22
3
0.4
0.75 0.67 -0.63 0.67 1 0.99 0.6 0.85 0.79 0.6 0.3 0.065
4
0.75 0.68 -0.63 0.69 0.99 1 0.62 0.8 0.77 0.59 0.26 0.11 0.2
5
0.97 0.92 -0.26 0.99 0.6 0.62 1 0.48 0.25 0.98 0.18 -0.063
6
0.0
0.74 0.52 -0.7 0.51 0.85 0.8 0.48 1 0.66 0.52 0.51 -0.029
7
0.46 0.33 -0.43 0.33 0.79 0.77 0.25 0.66 1 0.26 0.38 0.11 0.2
8
0.93 0.94 -0.52 0.96 0.6 0.59 0.98 0.52 0.26 1 -0.23 0.023
9
0.4
0.2 -0.14 0.061 0.15 0.3 0.26 0.18 0.51 0.38 -0.23 1 0.16
10
-0.096 0.038 0.19 -0.22 0.065 0.11 -0.063 -0.029 0.11 0.023 0.16 1 0.6
11
0 1 2 3 4 5 6 7 8 9 10 11
Figure 9.3 shows two such plots. The first (9.3a) is generated by the code above, with randomly
generated values from the normal distribution being plotted against the normal distribution (the
default of the qqplot function). Here the data is a good fit to the 45°degree indicating a rea-
sonable expectation of normality. The second plot (9.3b) shows randomly generated values from
the exponential distribution being plotted against the normal distribution and clearly shows a poor
fit.
6
2
5
1 4
Sample Quantiles
Sample Quantiles
3
0 2
1 1
0
2 1
2
3
3 2 1 0 1 2 2 1 0 1 2 3 4 5 6
Theoretical Quantiles Theoretical Quantiles
9.4 Outliers
Treatment of outliers requires special consideration and should not be performed without thought.
Inappropriate treatment of outliers may bias or invalidate the conclusions of subsequent analysis.
As with all troublesome data, it is tempting to simply remove it. Such operations can be performed
using filtering techniques described earlier. An alternative to removing outliers is to constrain their
values to lie within a bounded interval, with values lying outside the interval being shifted to the
boundary values. This can be through of applying a function f : R → [a, b] defined as:
a if x < a
f (x) = x if a ≤ x ≤ b (9.1)
b if b < x
Winsorizing data (or winsorization) determines the lower and upper bounds based on some speci-
fied percentile (for example between the 5th and 95th percentile.
#Winsorise data
df[cols] = df[cols].clip(lower=df.quantile(0.05),
upper=df.quantile(0.95), axis=1)
9.5 Duplicates
Pandas provides several tools for identifying duplicates and removing them if required. In the
example below, the second and last records are considered to be duplicates even though they
have difference index values.
print(df.duplicated())
9.6. DATA TRANSFORMATION 104
The Boolean array returned by the duplicated method, only returns True for actual duplicates; a
value of False is recorded for the first record it is considered a duplicate of. Duplicates can be
removed from a dataframe as follows:
#Remove duplicates
df.drop_duplicates(inplace=True)
By default, all fields are required to be identical to be considered duplicates. To apply similar logic
to a subset of columns use:
#Define mapping
sizemap = {'small':0, 'medium':1, 'large':2, 'huge':2}
An alternative approach is to write a custom function to perform the transformation which is then
applied to a column of the dataframe.
def assigngroup(label):
label = label.strip().lower()
if label in ['small']:
return 0
9.6. DATA TRANSFORMATION 105
return None
df['group'] = df['size'].apply(assigngroup)
Substitutions for specific values can be performed using the replace method.
Replace can also be used to perform multiple search and replace operations simultaneously by
supplying a list of replacement values or by supplying a dict that defines the substitutions.
Pandas allows columns to be treated as a categorical variable type. For large datasets, this may
reduce memory requirements, as text strings can be encoded as integer values. Performance of
some operatings (such as grouping) may also be improved.
df['size'] = df['size'].astype('category')
9.6.2 Binning
Numerical data can also be used for grouping by defining discrete bins (or buckets) into which
to assign each point. One way to do this is using the cut function. The bins input parameter
can either be defined as an integer (the number of bins required) or as an array specifying the
bin boundaries. In the former case, the bins will be equally sized (but not necessarily equally
populated).
Note that by default the bins are defined by intervals of the form (a, b], open on the left and closed
on the right. This means that the point left hand point a is not considered to belong to the bucket,
but the right hand point b is (i.e. a < x ≤ b).
An alternative is to use the qcut function which automatically calculates the boundary points to
ensure each bucket is uniformly populated.
For each of the category values, a new column is added populated with 1 when the row matches
that category and zero otherwise.
The drop_first parameter should be set to True to remove the first level to get k − 1 dummies
for the k category values. Scikit-learn provides similar functionality for ‘one hot encoding’ (see
OneHotEncoder) but returns an array (i.e. matrix) rather than a dataframe.
9.6.4 Scaling
The fields within a data set can vary by an order of magnitude. For certain types of analysis where
any linear transformation of the data has no material impact on the results, it can be convenient
to scale numerical data to a common interval such as [0, 1] or [−1, 1]. In the former case, the
maximum value in each column will be scaled to 1, and the minimum value to 0.
When using the sklearn package such scaling can be performed as follows:
9.6.5 Differencing
For time series data (or any data where each row is related to the previous one), new columns can
be created by performing operations across rows. One common operation is lagging a variable.
In pandas this is performed using the shift. For a time series, this transforms A(t) to A(t − s), where s
is the integer shift.
#Lagged variable
df['lagA1'] = df['A'].shift(1)
#First difference
df['dA'] = df['A'] - df['A'].shift(1)
#Percentage change
df['B_pct_change'] = (df['B'] - df['B'].shift(1))/df['B'].shift(1)
Rather than calculating the first difference and percentage changes manually, it is possible to
pandas functions. Note that such operations can only be applied to an entire DataFrame if all
columns are numeric.
One technique for doing so is Principal Component Analysis (PCA) which, as the name suggests,
attempts to simplify the description of a data set by identifying its ‘principal’ components. One
way to think of the data points is as coordinates in n dimensional space. One set of coordinates
can be expressed in different ways depending on the axis system employed. Matrices can be
viewed as a means to transform data points between axis systems.
The equation below represents such as transformation. Data points x can be considered as co-
ordinates in a regular Cartesian coordinate system. The axes of this system are represented as
columns of the identity matrix In . The columns of matrix A represent an alternative axis system.
Data points y expressed in this system of coordinates can be transformed into the regular coor-
dinate system by multiplication by A. For PCA, the columns of A would be referred to as factor
loadings, and the y values as factor scores for x.
a11 a12 ··· a1n y1 1 0 ··· 0 x1 x1
a21 a22 ··· a2n y2 0 1 ··· 0 x2 x2
.. .. .. .. = .. .. .. .. = ..
.. ..
. . . . . . . . . . .
am1 am2 ··· amn yn 0 0 ··· 1 xn xn
PCA attempts to find a new, more efficient set of axis. Any such system must have orthogonal
axes. The first axis (principal component) is chosen to maximise the variance of the data points
projected onto it, the second axis to have the next greatest variance and so on. Data points
consisting of many attributes, can therefore be approximated by their first few coordinates under
the new axis system.
PCA can be performed on dataframes and matrices using scikit-learn.
#Fit to data
pca.fit(df)
Once fitted it is possible to inspect how much of the variance is captured by each of the factors,
and therefore the total variance captured by the reduced factor set.
print(pca.explained_variance_ratio_)
print("total variance explained = ",sum(pca.explained_variance_ratio_))
#Inpsect factors
pca.components
Once fitted, the observations can be transformed into the reduced dimensional space.
X2 = pca.transform(X)
9.7 Reshaping
Data can be displayed in long or wide formats. In the wide format, different characteristics for the
same subject are placed in separate columns. In the long format, each characteristic of each
9.7. RESHAPING 109
9.7.1 Stack/Unstack
To convert from long to wide, one can use an hierarchical index (MultiIndex). One way to think
about a MultiIndex is as a composite key: a combination of columns that together can be used
to suitably identify each row. They can be created from a list of arrays. The code below cre-
ate a DataFrame matching the long table shown in Table 9.1. The first column ‘company’ has
repeated values so makes a poor index until it is combined with the second column ‘label’ to
make a MultiIndex. The order in which they are combined determines the index levels.
#Create multi-index
df.set_index(['company', 'label'], inplace=True)
#Inspect levels
print(df.index.names)
The unstack methods acts like a pivot to return a new dataframe (or modify the existing one) with
addition inner column labels, based on the index level pivoted. By default this will be -1 (i.e. the
last level). As shown by the output below, the first time, the ‘label’ values become columns, but
the second time, the ‘company’ values become columns.
9.7. RESHAPING 110
df3 value
company A B
label
est 1984 2018
listed yes no
size L S
To undo this operation, one can use either stack function which returns a reshaped DataFrame
with a multi-index with additional inner-levels. This retains the existing outer index (currently the
company).
df = pd.DataFrame({'company':['A', 'B'],
'est':['1984', '2018'],
'listed':['yes', 'no'],
'size':['L', 'S']})
#Set index
df.set_index('company', inplace = True)
9.7.2 Melt
Melting a DataFrame converts it from wide to long. In the example below we begin by creating the
wide table from earlier. Note that no index was created.
df = pd.DataFrame({'company':['A', 'B'],
'est':['1984', '2018'],
'listed':['yes', 'no'],
'size':['L', 'S']})
print(df2)
9.8. AGGREGATION 111
Note that melt is also a DataFrame method so the following lives are equivalent:
To reverse this operation, one can use unstack, or use the pivot method as shown here.
9.8 Aggregation
Aggregation is a process where records are first grouped based on a chosen set of characteristics,
and then summary information is calculated on other selected characteristics. For example, one
could first group stocks by their market sector and then find their average size within each of these
groups.
The concept is similar to ‘group by’ SQL operations or pivot tables in Excel, and pandas supports
similar concepts. We’ll use this simple DataFrame in the examples that follow.
sector = ['F', 'I', 'F', 'I', 'I', 'T', 'T', 'F', 'I', 'T']
size = [1.23, 0.56, 3.25, 0.17, 1.98, 2.62, 0.95, 5.11, 2.09, 2.75]
listed = ['Y', 'Y', 'Y', 'N', 'N', 'Y', 'N', 'Y', 'N', 'Y']
price = [12.23, 56.71, 0.53, 1.23, 5.47, 27.91, 5.22, 1.45, 3.90, 5.55]
9.8.1 Group by
The groupby method groups DataFrame rows based on a list of column names. Aggregation func-
tions can then be applied to each sub-group. For example, here all remaining numerical columns
will be summed.
gb = df.groupby(['sector']).sum()
size price
sector
F 9.59 14.21
I 4.80 67.31
T 6.32 38.68
The output is a DataFrame with an index for both its rows and columns. Alternatively a dict can be
used to supply columns to aggregate and a list of functions to use for aggregation. Applying the
method to our example:
print(gb)
size price
count min max mean
sector
F 3 1.23 5.11 4.736667
I 4 0.17 2.09 16.827500
T 3 0.95 2.75 12.893333
Here the columns of the output form a MultiIndex comprised of the aggregated columns names
and the functions used to aggregate them.
A sort key can also be passed to the groupby method. For a list of supported functions see
https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html. Note that aggregat-
ing function ignore NA values.
Other functions (custom or otherwise) can also be used for aggregation. The example below uses
a lambda function to count ’Y’ values.
print(gb)
Although we have used groupby for aggregation, the groupby method actually returns a special
DataFrameGroupBy object.
9.8. AGGREGATION 113
gb = df.groupby(['listed','sector'])
listed sector
#Show the size of each group N I 3
print(gb.size()) T 1
Y F 3
I 1
T 2
dtype: int64
In this case (since we have grouped by two columns) the index is now a MultiIndex. Unfortunately
the DataFrameGroupBy cannot be printed, but it can be iterated over.
size price
min max mean min max mean
listed sector
N I 0.17 2.09 1.413333 1.23 5.47 3.533333
T 0.95 0.95 0.950000 5.22 5.22 5.220000
Y F 1.23 5.11 3.196667 0.53 12.23 4.736667
I 0.56 0.56 0.560000 56.71 56.71 56.710000
T 2.62 2.75 2.685000 5.55 27.91 16.730000
You may recall the apply method. This can also be used to perform the summarise step before
combining. In the example below a custom function accepts a DataFrame and returns a DataFrame
9.8. AGGREGATION 114
contains the rows for which a particular column contains the largest value in the common. This is
applied to the DataFrameGroupBy object (on the size column) to return the biggest company (or
companies, should there be a tie) in each sub-group.
Such a approach can therefore be used to modify a DataFrame on a group by group basis. One
application is addressing missing values using group means.
print(pt)
price size
listed N Y N Y
sector
F NaN 12.23 NaN 5.11
I 5.47 56.71 2.09 0.56
T 5.22 27.91 0.95 2.75
Of course you are not limited to built-in function and can supply your own. The aggregating
function will be supplied with a Series object which can be treated like an array. Here a lambda
function is used to aggregate strings by concatenating them together.
9.9. TIME SERIES 115
pt = df.pivot_table(index=['sector'],
values=['listed'],
aggfunc = lambda x: ' '.join(str(v) for v in x))
print(pt)
listed
sector
F Y Y Y
I Y N N N
T Y N Y
For simple counts (i.e. frequencies) one might also consider using pd.crosstab.
For mixed format dates is more convenient to (at least attempt to) rely on a library function to work
out how to perform the conversion. The dateutil library provides such a function.
Rather than working with lists of dates, pandas defines a DatetimeIndex type. Lists of dates, in
string or python format, can be converted into a DatetimeIndex with the pandas to_datetime func-
tion.
dti = pd.to_datetime(dates)
The return type of this function depends on the input. Data supplied as a Python list will output as
a DatetimeIndex. Input supplied as a pandas series will also return as series. Once converted, it
becomes easy to extract things like the year, quarter or month, by depends on its exact tpye.
9.9. TIME SERIES 116
Alternatively the pandas date_range function can create dates spanning a particular period, with
a specified frequency. A full list of configurable frequencies can be found at https://pandas
.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.
startdate = datetime(2023,10,1)
enddate = datetime(2024,10,1)
We can use such DatetimeIndex objects as the index of a Series or DataFrame. In the example that
follow, a random walk is created to represent a stock price.
vol = 0.2
p = np.ones(n)*10
z = np.random.normal(0,1,n)
for i in range(1,n):
p[i] = p[i-1] * (1+z[i]*vol*(1/365)**0.5)
We can visually inspect the series using the plot method. Note that since this data is randomly
generated, any rerun will look different.
9.9. TIME SERIES 117
ts.plot(y='price')
Such series and dataframes benefit from flexibility in selecting particular sub-series.
#Select between two dates (here the first date is the entire month)
ts2 = ts['2023-11':'2020-12-15']
Note that exact matches (i.e. attempts to select a single row) for dataframe will be treated column-
wise and will raise a KeyError.
#Select a specific date (will work for a series but not a dataframe)
#print(ts['2023-11-05'])
Part III
Data Analysis
119
10 DATA V ISUALISATION
10.2 Matplotlib
Matplotlib is a library for generating charts that makes extensive use of NumPy. Its name reveals
its original origins, which stem from MATLAB; anyone familiar with creating charts in MATLAB will
recognise the Matplotlib’s commands. By convention it is imported and aliased as follows:
Note that not all ax methods work with plt, however you can get the current axes instances as
follows:
10.3. SCATTERPLOT 120
10.3 Scatterplot
Here we present the commands necessary to generate a simple 2D-line chart. At a minimum
each series can be specified by three things: the x-points, the y-points, and its format. It is not
necessary to specify any other chart components (i.e. title, legends, axis, etc) but the production
of high-quality data visualisations will require such customisation. The output of the code below is
shown in Figure 10.1a.
import numpy as np
from math import *
import matplotlib.pyplot as plt
#add series
plt.plot(x, y, '-', label='sin', color='#888888')
plt.plot(x, z, '*r', label='cos')
#add titles
plt.title('My first chart')
plt.xlabel('x-axis label')
plt.ylabel('y-axis label')
#add legend
plt.legend(loc='lower left')
#display chart
plt.show()
Formatting of the series line style and colour can be encoded using:
10.3. SCATTERPLOT 121
Full colour control is available via the color parameter using RGB colours.
Rather than selecting your own formats, predefined styles can adopt using the command be-
low. Styles should be applied before creating chart components. A list of styles is available from
https://matplotlib.org/3.1.1/gallery/style_sheets/style_sheets_reference.html.
#Format
plt.style.use('ggplot')
In Spyder (version 3.7), plots on the ’Plot’ tab alongside the variable explorer. To export a plot,
right click and choose ‘save plot as...’. Exporting can also be achieved in code using the savefig
command. This allows other options to be configured including output format and quality. When
embedding images in other documents remember to ensure they are of a high quality (increase
the dots per inch (dpi)) and that all text can be easily read.
#Save to file
plt.savefig('chart.png', dpi=600)
Scatter plots can be created as line series (with just markers and no lines) or using the scatter
method (see Figure 10.1b) which also permits a third variable to be represented using a colour
scale.
plt.clf()
plt.scatter(y,z*x, marker='o', c = x)
plt.colorbar()
plt.show()
10.3. SCATTERPLOT 122
Multiple plots can be combined into a single figure by using subplots (in fact all plots reside inside
a figure object). These can be configured in any regular grid shape and be configured to have
shared axis (sharex, sharey). The overall grid shape, and indexing of each subplot uses matrix no-
tation (row then column). For a full range of options see https://matplotlib.org/3.1.0/gallery/
subplots_axes_and_figures/subplots_demo.html.
plt.show()
When working with large numbers of automatically generated subplots, it becomes cumbersome
to give them all different variables names. In such cases it is better to treat them as an array of
subplots which can be referenced by index (working from left to right, and top to bottom).
n=3
fig, axs = plt.subplots(n, n)
for i in range(n):
for j in range(n):
axs[i,j].set_title(str(n*i+j))
fig.tight_layout(pad=3.0)
An alternative approach is to add them one at a time as follows. Here only the 1st and 4th plots
are added in a figure configured to contain 2x2 subplots.
fig = plt.figure()
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,4)
plt.show()
10.4. BAR CHART 123
Secondly, to ensure the that tick labels are suitably spaced and formatted. Fortunately within
matplotlib.dates a number of convenient functions to perform such operations are provided. In
the following code MaxNLocator is used to indicate the maximum number of ticks you would like to
see, and a format is created (using DateFormatter) and then applied using the set_major_formatter
method of the axis.
ax.xaxis.set_major_locator(plt.MaxNLocator(5))
myFmt = mdates.DateFormatter('%d-%b')
ax.xaxis.set_major_formatter(myFmt)
For more control it is possible to create arrays of dates to define the tick labels you want to
see.
#Create data
CPI = [1.2, 1.4, 0.9, 0.4, -0.2, 0.3 ]
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
#Create components
plt.bar(months, CPI, color='red')
plt.xlabel('Month')
10.5. PIE CHART 124
plt.ylabel('Inflation (%)')
plt.title("UK Inflation")
plt.show()
#Create chart
fig, ax = plt.subplots()
#Add components
ax.bar(months, CPI, color='red')
ax.set_xlabel('Month')
ax.set_ylabel('Inflation (%)')
ax.set_title('UK Inflation')
With two or more series, care must be taken to control the position of the series. These could be set
side by side, deliberately made to overlap, or stacked (stacked=True). The code below generates
the chart shown in Figure 10.2b. To illustrate how, it has been created with horizontal bars.
plt.ylabel('Month')
plt.xlabel('Inflation (%)')
plt.title("UK Inflation")
plt.legend(loc='upper right')
plt.show()
• https://scc.ms.unimelb.edu.au/resources-list/data-visualisation-and-exploration/no_pie
-charts
If you really must, the following code will create the pie chart in Figure 10.3a. Notice that the
values are converted into percentages by default. Full documentation is available from https://
matplotlib.org/3.1.1/gallery/pie_and_polar_charts/pie_features.html.
A way to avoid the angle problem is to switch to a donut plot (Figure 10.3b). This isn’t really a
separate chart type. As the code reveals, it is created by effectively blanking out an inner circle
from the pie chart.
# create data
values = [5,7,2,8]
10.6. HISTOGRAM 126
labels = ['A','B','C','D']
plt.show()
10.6 Histogram
Here we generate a histogram, shown in Figure 10.4a.
#Create a histogram
z = np.random.randn(1000)
plt.clf()
plt.hist(z, bins='auto')
plt.title("Normal?")
plt.show()
10.7 3D-plots
Three dimensional plots require Axes3D to be imported. An example is provided below with output
as shown in Figure 10.4b. See https://matplotlib.org/mpl_toolkits/mplot3d/tutorial.html for
further examples.
#Mesh Grid
n = 101
x = np.linspace(1,3,n)
y = np.linspace(1,3,n)
z = np.zeros((n,n))
for i in range(0,n):
for j in range(0,n):
z[i,j] = x[i]*y[j]**2 / (x[i]**2 + y[j]**2)
plt.clf()
fig = plt.figure()
ax = fig.gca(projection='3d')
x, y = np.meshgrid(x, y)
surf = ax.plot_wireframe(x, y, z)
10.8. BOXPLOTS 127
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')
fig.show()
10.8 Boxplots
A boxplot or Box & Whisker plot displays the distribution of a variable. The box (i.e. the rectangle)
shows the interquartile range with the median marked as a bar in the middle (but not necessarily
the midpoint). The whiskers (i.e. the lines extending out of the rectangle) extend to show some
proportion of the remaining distribution. How far they extend is normally configurable and can be
set to a certain percentile, or a multiple of the interquartile range. Any values beyond the whiskers
are plotted as individual points representing outliers.
The code below shows how to create a boxplot with the result shown in Figure 10.5.
#Create a figure
fig, ax = plt.subplots(dpi=300)
plt.show()
Pandas dataframes have a boxplot method to generate such plots using different columns, or the
same columns grouped in different ways. See https://pandas.pydata.org/pandas-docs/stable/
reference/api/pandas.DataFrame.boxplot.html for details.
10.9. SCATTER MATRIX 128
pd.plotting.scatter_matrix(df, c=df.outcome)
fedfunds
10
0.020
0.00
cpi
0.02
0.0
ind
0.1
0.0
pay
0.1
0
10
0.02
0.00
0.02
0.1
0.0
0.1
0.0
fedfunds
cpi ind pay
Notice how a categorical variable has been used to colour each observation. This can be useful
10.10. CHOROPLETH MAP 129
when investigating which variables must be used to classify future unseen observations.
import folium
country_shapes = r'.\maps\world-countries.json'
the_map = folium.Map(tiles="cartodbpositron", location=[20, 10],
,→ zoom_start=3)
the_map.choropleth(
geo_data=country_shapes,
name='choropleth',
data=df,
columns=['country', 'downloads'],
key_on='properties.name',
bins = [0,10,50,100,500],
fill_color='Reds',
nan_fill_color='white',
fill_opacity=0.8,
line_opacity=0.1,
)
the_map
the_map.save('SSRN_choropleth.html')
10.11.1 seaborn
Seaborn is based on matplotlib and is designed to work with pandas. For convenience it pro-
vides features to set styles and colour themes for consistent formatting across charts. See https://
seaborn.pydata.org/introduction.html for full details.
import pandas as pd
import seaborn as sns
sns.set_style("whitegrid")
ax = sns.barplot(data=df, x="label", y="value")
fig = ax.get_figure()
fig.savefig("seaborn.png", dpi=400)
10.11.2 plotly
Plotly (now part of the Anaconda installation) produces interactive charts displayed within browsers.
Its functionality is documented at https://plotly.com/python/.
10.11. OTHER LIBRARIES 131
import plotly.graph_objects as go
import numpy as np
np.random.seed(1)
N = 100
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
sz = np.random.rand(N) * 30
fig = go.Figure()
fig.add_trace(go.Scatter(
x=x,
y=y,
mode="markers",
marker=go.scatter.Marker(
size=sz,
color=colors,
opacity=0.6,
colorscale="Viridis"
)
))
Note that figures generated using plotly will not display in Spyder. Two lines have code have
been added above so that the output should instead appear in your browser. Plotly supports the
creation of choropleth maps.
Machine learning texts often refer to attributes and observations. Those more familiar with database
terminology may prefer to think in terms of fields and records. In supervised learning, the observa-
tion attributes (i.e. the regressor, co-variate or independent variables) are used to ascertain the
value of the target variable (i.e. the regressand or dependent variable). In general, any model
must first be trained on a training set and then evaluated using a distinct testing set. The training
set can be subdivided into data that will be used to construct the model, and data that will be
used to validate and tune the model.
11.2. SAMPLING 134
In finance, machine learning techniques have been used for fraud detection, credit decisions and
time series estimation. For a Python linked introduction to the topic of machine learning see Müller,
Guido, et al. (2016). In the rest of this chapter we will heavily utilise scikit-learn.
11.2 Sampling
Train and test subsets must be extracted from a larger dataset. Test data should be out-of-sample,
that is, previously unseen by the model. While a portion of the training data may be used for
validation, this is testing how well the model fits the training data. Testing with an independent
dataset, validates how well the fitted model performs on unseen data.
The idea is similar to student examination: a meaningful assessment should be unseen. If students
were assessed using only past papers to which they had access, one would not be surprised if
they performed well. In such circumstances, what is being tested it not that which is intended: the
ability to apply knowledge rather than recall the solution to previous problems.
There are different ways that this can be achieved and the appropriate method will depend on
the goal of the modeling.
#First 5 rows
X = df.iloc[:5,]
#Random 5 rows
X = df.sample(5, replace = False)
Scikit-learn provides a sample_without_replacement function that selects a random subset from the
integers {0, 1, . . . , n−1}. These values can be used to select a subset of rows from a dataframe.
When working with time series data, a random sampling approach may not be appropriate. For
example, if a modeler was building a real-time forecasting model, at the point of prediction, they
would have no access to information from the future. Building a model that relies on data that
would not have been available at the time is fraudulent. One would dismiss a forecaster who told
us that they could accurately predict the price of oil in 6 months provided we first supply them with
the price of oil in 5 months and 7 months time.
In such instances the goal should be to use out-of-time sampling (Stein, 2002). The idea is illustrated
in Figure 11.2 which represents a time line, with older observations to the left and more recent
observations to the right. An intermediate point is used to divide the dataset into pieces. The
shaded portions represent time buffer that may need to be excluded:
1. Used to calculate backward differences which is not possible for observations at the start of
the dataset.
2. Used for training data outcomes; training data should not include outcomes that would be
unknown at the point of the first test observation.
3. Outcomes are in the future and therefore unknown so cannot be used for testing (yet).
Assuming the dataset is sorted chronologically, it may be possible to partition the dataset by taking
the first k rows and the last n − k rows. Alternatively pandas indexing can be used to select based
on dates.
Using one of its optional parameters, it can also be used for stratified sampling which samples
from within subgroups, rather than the entire population. This can be used to make sure that
observations from different groups are appropriately represented in the random sample. This can
be preferable when working with imbalanced sets, for example, when the ‘positive’ cases in a
clinical trial are a small proportion of the population.
11.3. DECISION TREES 136
Partitioning the data, and randomly selecting testing and training data, can be performed in a
single step using the scikit-learn train_test_split function.
but it could easily be a rating from 1 to 5. The process begins by choosing an attribute Xi that can
be used to divide the data into two disjoint (but not necessarily equally sized) groups such that
B ∪ C = A. The idea is illustrated in Figure 11.3.
Each question in an interview reveals more information about a candidate, but not all information
is equally valuable in making a hiring decision. In our example the attribute X3 has been chosen
precisely because it is considered to maximise the information gain. This may relate to an interview
question that records the number of years of relevant experience (i.e. a numerical value), and the
test must be X3 ≥ 4. Candidates with less that 4 years experience are assigned to group B (where
only 15% of future employees are considered good hires), while candidates with 4 or more years
experience are assigned to group C (where 65% of candidates are considered good hires). Note
that the threshold 4 has been just as carefully selected as the attribute X3 .
To further partition group C attribute X4 is chosen. This may relate to a question about qualifi-
cations that records the candidates degree subject (i.e. categories data). Candidates where
X4 ∈ {Finance, Economics} are assigned to group G (where 90% of candidates are considered
good hires), with other candidates being assigned to group F (where only 20% of candidates are
considered good hires).
For more background, see for example Witten, Frank, and Hall (2005). To fit a decision tree model
in scikit-learn.
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x_train, y_train)
A more colourful visualisation (making it easier to distinguish between classifications at each node)
can be achieved with the following code. Note that this requires installation of pydotplus.
11.3. DECISION TREES 138
dot_data = StringIO()
text = export_graphviz(clf, filled=True, rounded=True,
feature_names = feature_cols, class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(text)
graph.write_pdf('tree2.pdf')
n
X
E(S) = −pi log2 (pi ) (11.1)
i=1
2
X |Sj |
G(S) = E(S) − E(Sj ) (11.2)
j=1
|S|
n
X n
X
Gini(S) = 1 − p2i = pi (1 − pi ) (11.3)
i=1 i=1
11.3.3 Algorithm
The decision tree algorithm thus proceeds by repeatedly partitioning the data in a manner that
maximizes information gain as defined by a chosen function such as entropy or the Gini index. In
theory this partitioning can continue until:
• all observations at a node are of the same class
• a node contains only a single observation (a subcase of above)
• the attributes do not allow observations to be partitioned in a way that increases information
gain.
Such terminal nodes are referred to as the leaf nodes. However, there is a danger of over-fitting:
this is where the model may provide an excellent fit to the training data but fails to generalise well
(i.e. performs poorly when applied to unseen data).
11.4 Evaluation
We consider now how to evaluate a model.
11.4. EVALUATION 140
11.4.1 Prediction
Once a model has been trained, it can be applied to the (unseen) test data. Scikit-learn provides
a sklearn.metrics module to help analyse the results.
Predicted
A B C
A 15 6 1
Actual B 4 13 2
C 0 4 19
How such results are interpreted will be context dependent. Consider, a classification problem with
only two possible outcomes. Classic examples include medical diagnostic tests (where patients
test positive or negative) and criminal trials (where the defendant is guilty or not guilty). An abstract
confusion matrix for such problems is shown in Table 11.2.
Predicted
P N
P TP FN
Actual
N FP TN
False negatives are referred to as a type II errors. For medical patients being tested for a harm-
ful disease this falsely gives them the all clear. For defendants on trial this is acquitting a guilty
person.
As these examples illustrate, not all errors are equal. Instinctively a type I seems of much greater
consequence in a court case. Calibrating diagnostic tests and classification models requires a
trade off between different outcomes based on the significance of their outcomes.
Suppose the total number of actual positive cases is N and actual negative cases is P . The True
Positive Rate (TPR) is given by:
TP TP
TPR = =
P TP + FN
This is also known as the sensitivity or hit rate. The True Negative Rate (TNR) is given by:
TN TN
TNR = =
N TN + FP
This is also known as the specificity. The False Positive Rate (FPR) is:
FP FP
FPR = = = 1 − TNR
N FP + TN
TP + TN TP + TN
ACC = =
P +N TP + TN + FP + FN
In scikit-learn, the confusion matrix can be derived from the predicted results as follows.
cm = confusion_matrix(y_test, y_pred)
print(cm)
Clearly if a high bar is set (threshold moves to the right), less observations will be classified as
positive and more as negative. Taken to an extreme, if the threshold is set above the maximum
score, all observations will be classified as negative. This gives a TPR of 0 and a FPR of 0. Taken
to the other extreme, if the threshold is set below the minimum score, all observations will be
classified as positive. This gives a TPR of 1 and a FPR of 1. For intermediate values of the threshold
0 ≤ T P R, F P R ≤ 1. Plotting this for different thresholds will therefore generate a plot similar to
Figure 11.6.
5 2 &