0% found this document useful (0 votes)
250 views175 pages

FIN3028 Phyton Companion Notes

This document provides an overview and quick reference guide for the Python for Finance module at Queen's Business School. It covers Python fundamentals like installation, code editors, variables, and data types. It also discusses ethical considerations for coding like security, privacy, and professional conduct. The guide is intended as a companion for students taking the Python for Finance course.

Uploaded by

tonggennn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
250 views175 pages

FIN3028 Phyton Companion Notes

This document provides an overview and quick reference guide for the Python for Finance module at Queen's Business School. It covers Python fundamentals like installation, code editors, variables, and data types. It also discusses ethical considerations for coding like security, privacy, and professional conduct. The guide is intended as a companion for students taking the Python for Finance course.

Uploaded by

tonggennn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 175

Python for Finance

Dr Alan Hanna

FIN3028 Companion Notes 2023/24


"If the purpose of education is to score well on a test, we’ve lost sight of the real reason
for learning."

Richard Feynman
i

Preface
This document is intended as a companion quick reference guide to the ‘Python for Finance’
module (FIN3028) at Queen’s Business School.
Feedback on the content is welcome and should be sent to [email protected]. Knuth-style re-
wards are offered to students currently undertaking the module who identify mistakes (grammat-
ical, typographical, mathematical, factual, numerical, logical or otherwise) in the manuscript.
Propagated or compounded errors will be treated as a single observation. Terms and conditions
apply :-)

http://xkcd.com/1195/
ii

Contents

I Python Fundamentals 1
1 Introduction 2
1.1 What is Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Anaconda Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Installing Packages using pip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Installing Packages using conda (preferred method) . . . . . . . . . . . . . . . . . . . 3
1.2.4 Upgrading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.5 Python Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Running Python via a browser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Becoming a better developer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Getting Started 6
2.1 Code Editors & IDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Command Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 IDLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Spyder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.1 Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.2 Console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.3 Variable Explorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.4 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.5 Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Jupyter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.1 Launching Jupyter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.2 Jupyter Notebooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.3 Code Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.4 Markdown Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.5 nbextensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Other IDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.1 VS Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.2 RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Online Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Ethical Considerations 18
3.1 Security and Ownership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Societal Contribution and Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Professional Conduct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Python Basics 21
4.1 Code Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
CONTENTS iii

4.3 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.1 assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.2 Variables as memory locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.3 underscore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3.4 print . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.5.1 String methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5.2 String formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5.3 f-string . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5.4 Regular expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Dates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.7 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.8 Containers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.8.1 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.8.2 Tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.8.3 Ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.8.4 Working with arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.8.5 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.8.6 Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.8.7 List Comprehensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.9 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.9.1 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.9.2 Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.10 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.10.1 Standard Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.10.2 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.10.3 Import . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.11 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.11.1 Positional arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.11.2 Optional arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.11.3 Arbitrary argument lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.11.4 Multiple return values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.11.5 Lambda Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.11.6 Modifying input parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.11.7 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.11.8 Inner Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.11.9 Generator Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.11.10 Type annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.12 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.13 Enumerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.14 Other python features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.14.1 Assignment expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.14.2 Decorators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5 Development 53
5.1 Life cycle models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 Coding Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4.1 Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4.2 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4.3 Line by line execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.5 Good developer practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
CONTENTS iv

6 Object Orienting Programming 58


6.1 Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2 Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2.1 Dunder Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2.3 Instantiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2.4 Encapsulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2.5 Inheritance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

II Data Processing 63
7 Libraries 64
7.1 NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.1.1 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.1.2 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.1.3 Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.1.4 Reshaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.1.5 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.2 SciPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.3 Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.3.1 Getting started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.3.2 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.3.3 Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.3.4 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.3.5 Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.3.6 Dates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.4 scikit-learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.5 tkinter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.5.1 Widgets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

8 I/O Operations 78
8.1 User input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.2 File based input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.2.1 File paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.3 Text files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.4 CSV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.5 Pickle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.6 Excel workbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.7 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.7.1 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.8 JSON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.9 XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.9.1 XPath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.10 HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.10.1 Web scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.10.2 Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.10.3 Downloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.10.4 Browser based web scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.10.5 Responsible web scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.11 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.11.1 Using an API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.11.2 Creating an API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.12 Financial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
CONTENTS v

8.12.1 pandas-datareader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.12.2 yfinance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.12.3 WRDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

9 Data Handling 97
9.1 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.1.1 Descriptive Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.1.2 Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.1.3 Numerical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
9.2 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
9.3 Data Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
9.3.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
9.3.2 Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
9.4 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.5 Duplicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.6 Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
9.6.1 Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
9.6.2 Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
9.6.3 Dummy variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
9.6.4 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
9.6.5 Differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.6.6 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.7 Reshaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
9.7.1 Stack/Unstack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
9.7.2 Melt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
9.8 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
9.8.1 Group by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
9.8.2 Pivot table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
9.9 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

III Data Analysis 118


10 Data Visualisation 119
10.1 Creating beautiful evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
10.2 Matplotlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
10.3 Scatterplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
10.3.1 Working with time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
10.4 Bar chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
10.5 Pie Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
10.6 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
10.7 3D-plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
10.8 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
10.9 Scatter Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
10.10Choropleth Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
10.11Other libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
10.11.1 seaborn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
10.11.2 plotly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
10.11.3 Interactive Dashboards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

11 Machine Learning 133


11.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
11.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
11.2.1 Without replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
CONTENTS vi

11.2.2 With replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135


11.2.3 Features and Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
11.3 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
11.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
11.3.2 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
11.3.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
11.3.4 Model Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
11.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
11.4.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
11.4.2 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
11.4.3 ROC Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
11.4.4 AUC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
11.4.5 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
11.5 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
11.5.1 Model Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
11.5.2 Variable Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
11.6 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
11.7 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
11.8 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
11.8.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
11.8.2 Low variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
11.8.3 Random substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
11.8.4 Variable importance approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
11.8.5 Univariate feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
11.8.6 Recursive feature elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

12 Natural Language Processing 151


12.1 NLTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
12.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
12.2.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
12.2.2 Stop Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
12.2.3 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
12.3 Document Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
12.4 Document Term Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
12.5 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
12.6 Topic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
12.6.1 LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
12.6.2 NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
12.7 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
12.7.1 SpaCy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

IV Appendix 160
Index 161

Bibliography 164
1

Part I

Python Fundamentals
2
1 I NTRODUCTION

1.1 What is Python?


Python is a powerful, high-level, multipurpose, programming language. It is an interpreted lan-
guage, meaning that code can be executed immediately without first being converted by a com-
piler into executable machine-level code. This can mean that Python is slower to execute (than
say, Java or C++) and more prone to run time errors. This is one of the reasons why Python is often
considered a ‘glue’ language that is useful for connecting different software components.
It can be used for a wide range of purposes including scripting, file I/O, text manipulation, number
crunching and web services. The core language is complemented by a plethora of available
packages.
A major benefit of Python is that it is much easier to learn than other languages and places an
emphasis on readability, reducing the cost of developing and maintaining code. It is particularly
suitable for Rapid Application Development and prototyping. In Python, novice developers are
not forced to think about things such as static typing of variables or garbage collection (freeing
memory space after usage).
Further benefits include:
• Python is open-source, meaning that it is free to use and distribute
• The language supports Object-Oriented Programming
• The code is portable, meaning that it can run across a range of platforms such as Windows,
Mac OS, Unix/Linux, etc

1.2 Installation
Python can be downloaded as a stand-alone installation, or as part of a larger package. For
this module we will use the Anaconda installation. Default installation configurations should be
fine.

• https://www.python.org/downloads/
• Stand-alone installation

• https://www.anaconda.com/download/
• Comes with additional packages and tools
• Can make it easier to manage package installations
• Installed on all PCs throughout the QUB network
• A requirement for this module!

See https://docs.anaconda.com/anaconda/install/ for installation instructions for different operat-


ing systems.
Note that all required software is pre-installed on PCs throughout QUB. You will not have sufficient
privileges to install software and may not be able to launch command prompts.
1.2. INSTALLATION 3

Figure 1.1: Anaconda Prompt

1.2.1 Anaconda Packages


Anaconda supports a large number of packages already so installing additional ones may not
be something you need to do. To see those that are supported and those which are included in
each Anaconda version visit the online documentation at https://docs.anaconda.com/anaconda/
packages/pkg-docs/ and select the relevant installer.

1.2.2 Installing Packages using pip


pip is the package installer for Python and is included with modern versions of Python. New pack-
ages can be installed from the command line with the command:

pip install "packagename"

Note that you may require administrator rights to install new packages. In Windows, you should
right click and choose ‘run as administrator’ when launching the command prompt (cmd). Within
Anaconda, you should use the ‘Anaconda Prompt’ application from the Start Menu (again run-
ning with admin rights) as shown in Figure 1.1. Installing one package may prompt you to upgrade
others.

1.2.3 Installing Packages using conda (preferred method)


Conda is a package management system that installs and updates packages based on their de-
pendencies (see https://www.anaconda.com/blog/understanding-conda-and-pip). It can also be
used to create and manage different ‘environments’ (separate installations of Python and pack-
age configurations). For personal installations it is not recommended that you perform any up-
grades until the module is complete. The default installation should be sufficiently up-to-date and
contain all required packages.
Upgrading of applications (i.e. Spyder and Jupyter) can also be performed from the Anaconda
Navigator application by clicking the gear icon beside each application. Anaconda Navigator
can also be used to review, create and amend environments (configurations of specific versions
of packages that you want to use for particular projects). For example, in general it is desirable
to keep all packages updated to the latest version. However, not all releases are backward com-
patible which can lead to breaks in code that was previously working.
Individual packages can be installed using the conda install command. For help with commands
to install specific packages, first search for the package name on https://anaconda.org/. As
shown in figure 1.2, this will provide additional package details and the command to install.
1.2. INSTALLATION 4

Figure 1.2: conda install

1.2.4 Upgrading
The easiest way to upgrade your Anaconda installation is from the Anaconda prompt with the
following command:

conda update --all

Individual packages (with necessary dependencies) may be updated with commands such as:

conda update numpy

1.2.5 Python Versions


A number of revisions were made to Python between version 2 and 3. While all the code in this
manual will run under Python 3, this may not be the case with Python code you may source from
online or in older text books.

print "No brackets required in Python 2"


print("Bracket required in Python 3")

To check which version of python you are running run:


1.3. RUNNING PYTHON VIA A BROWSER 5

#Confirm Python version


import sys
print(sys.version)

Normally new versions will be backward compatible (i.e. old code will still run under new versions)
but this is not always the case (for example when moving from version 2.x to 3.x).
Note that it is possible to have multiple versions of Python on one computer. In Windows, the
version that is used can be configured in the Window’s path. Python helpfully provides a guide to
installations on Windows.

1.3 Running Python via a browser


It is now possible to execute Python code within a browser without a local Python installation. The
following sites permit Python (among other languages) to be executed using their servers:
• https://colab.research.google.com
• https://repl.it/
• https://ideone.com/
• https://trinket.io/

1.4 Becoming a better developer


Developers constantly have to refresh their skills as they solve new problems and technologies
evolve. It is naïve to think that one can truly master a programming language or technology
by reading a book or completing a training course: mastery requires practice and a willingness
to keep learning. There are a multitude of online resources that can be accessed to improve
your python proficiency. There is a also an active community of developers who have faced
and overcome many of the problems you are likely to encounter (see for example stackoverflow).
Efficient developers will draw on these communities to quickly resolve technical problems. In this
respect, Google (or ChatGPT) may be a developer’s best friend.
6
2 G ETTING S TAR TED

2.1 Code Editors & IDEs


Code can be written using any basic text editor (e.g. notepad) but it is better to use one with
some basic support for coding and knowledge of language syntax. Decent text editors include
Notepad++ (free to download) and Sublime Text (a basic version is free).
Better still is an IDE (Integrated Development Environment). The basic components of any IDE
are:
• Source code editor
• Compiler/Interpreter
• Debugging tools
but desirable features also include:
• Syntax highlighting
• Code completion
• Code Analysis
• Flexible configuration, dockable panels
• Support for projects, version control, etc
Python files (i.e. text files containing code) should be given a ‘.py’ file extension.

2.2 Command Prompt


Make sure Python is set on the path variable of the Windows Environment, otherwise windows may
not know where to find python. To find the python install location to add to the path, you can
run:

#Confirm location of installation


import sys
print(sys.executable)

You can run the Python interpreter in interactive mode from a command prompt. Typing ‘cmd’
into the Windows start box will launch a new command prompt as shown in Figure 2.1. If neces-
sary, navigate to the appropriate folder using the ‘cd’ (change directory) command. Relevant
commands are shown in Table 2.1.

Command Action
cd.. Change directory - move to parent directory
cd temp Change directory - move to subfolder temp
python Start an interactive python session
python filename.py Execute the python script filename.py
ctrl + c Quit the python session
quit() <enter> Quit the python session

Table 2.1: Commands for use with a Window’s command prompt


2.3. IDLE 7

Figure 2.1: Window’s command prompt

2.3 IDLE
IDLE is an basic IDE that ships with many versions of Python. For Window’s installations, it will appear
in the start menu under Python.

Figure 2.2: IDLE

Statements can be executed at the prompt, or it is possible to create (and save) code modules
that can be executed in full by pressing F5.

2.4 Spyder
Spyder is a richer IDE that is part of the default Anaconda installation. Typical of IDEs, it features
dockable panels as shown in Figure 2.3. This allows developers to simultaneously develop and
execute commands, while inspecting the results.
In the following sections we will briefly outline key features of the main components. Full documen-
tation is available from https://www.spyder-ide.org/.
Before beginning make sure to set the working directory. This is where Python will look for files that
you might attempt to import and where it will export files (when not supplied an explicit path). In
Spyder, you can quickly set the working directory to match the location of an open code module
by right clicking its tab and choosing ‘set console working directory’.

2.4.1 Editor
The editor allows developers to work across multiple code modules. Each tab represents a text
file saved with a ‘.py’ extension. The Spyder code editor incorporates a number of standard
2.4. SPYDER 8

Figure 2.3: Spyder

developer features:
• Colour coding of text (defaults are grey for comments, blue for keywords, magenta for built-in
words, red for numbers, green for strings)
• IntelliSense can be used to auto-completing text (CTRL+space)
• Tooltips that appear to explain function prototypes
• Code tidying features (see source menu)
• ‘Compiler’ warnings - visual clues that code may contain errors
Shortcuts for quickly commenting/uncommenting or indenting/unindenting lines of code (see edit
menu). Many of these settings can be customised from Tools > Preferences. By default one tab
or indent is set of 4 spaces. By opening a file under different configuration settings it is easy to
create files with inconsistent indentation. To check whether you are using spaces or tabs, turn on
‘Show blank spaces’ under the source menu. The latest version of Spyder is compatible with kite,
a plug-in that can be installed to enhance auto-completion features.
A further feature of the editor is that it will highlight potential errors by adding a warning icon in
the margin beside the offending line of code. Hovering over the icon with the mouse will reveal a
short description of the detected issue.
Code within modules can be broken into cells to allow them to be executed as single stand-alone
blocks. This is achieved by lines starting with either:

#%% your text


# %% your text

Cells can also be used for code outlining which is useful for quickly navigating around your code
(enable via View > Panes > Outline). Code be run in different ways:
2.4. SPYDER 9

• From the run or debug menus


• From toolbar buttons (figure 2.4)
• Using shortcut (Function) keys (table 2.2)
• From the console window using the command:

exec(open('./filename.py').read())

Figure 2.4: Spyder: Code execution toolbar

Shortcut Action
F5 Run current script
F9 Run current line or selection
CTRL+RETURN Run active cell
SHIFT+RETURN Run active cell and advance

Table 2.2: Spyder: Code execution shortcuts

2.4.2 Console
The console is used for both input and output. Commands can be executed directly in the con-
sole. Output will appear here by default. It is possible to create multiple console windows (from
the Consoles menu). Output from running code in the editor will appear in the active console
window. See also Preferences > Run options.

Figure 2.5: Spyder Console

A ‘History log’ will show a list of previously executed statements. If you wish to re-execute com-
mands, you could copy it from here. Alternatively, when the console window is active, the up and
down cursor keys, will cycle through previously executed statements.

2.4.3 Variable Explorer


The variable explorer window (figure 2.6) will list variables that have been created (and not yet
destroyed) within the current python session. Here you will see their name, type and inspect their
2.4. SPYDER 10

current value. This will include objects and functions. For variables representing more complicated
data structures (e.g. matrices, or classes) it may be possible to double click the variable to see an
expanded view of the underlying data.

Figure 2.6: Spyder Variable Explorer

Some configuration changes may be required to explicitly include (or exclude) certain variable
types. For example, to inspect objects (instances of classes) follow the configuration change
shown in figure 2.7.

Figure 2.7: Spyder Preferences

2.4.4 Debugging
Coding mistakes that cause incorrect results are known as ‘bugs’. Locating and correcting such
mistakes is referred to as ‘debugging’.
Normally, code is executed as quickly as your computer architecture allows. One way to debug
your code is to instead execute your code one statement at a time, tracing the sequence of
execution and inspecting intermediate results to detect the point of failure. Since the point of
2.4. SPYDER 11

failure may occur only after thousands of line of code, breakpoints allow you to pause code
execution at critical points in your code.
Breakpoints appear as red dots in the editor margin and are activated (and deactivated) by
clicking in the margin or via keyboard shortcuts. A full list of breakpoints can be found in the
breakpoint window (view > panes > breakpoints).
Commands for debugging can be accessed via the Debug menu, toolbar or keyboard short-
cuts (Table 2.3). When running code in debug mode, the next statement to be executed will be
highlighted in editor.

Shortcut Action
F12 Breakpoint toggle
(or just click in margin beside the line number)
SHIFT+F12 Set/edit conditional breakpoint
CTRL+F5 Debug script
CTRL+F10 Debug one line at a time
CTRL+F11 Step into (function)
CTRL+SHIFT+F11 Step out
CTRL+F12 Continue execution
CTRL+SHIFT+F12 End execution

Table 2.3: Spyder: Code debug shortcuts

2.4.5 Help
Spyder’s help system can be activated by CTRL+I, after placing the cursor adjacent to text in
editor. If the text is recognised as part of the Python language or package, help will appear as
shown in figure 2.8.

Figure 2.8: Spyder Help

An iterative help tool is also built into Python. To invoke, simply issue the command help().

#Now type commands at the help prompt


>>>help>
2.5. JUPYTER 12

#Find a list of keywords


keywords

#Find available modules


modules

#To exit
quit

While documentation and built-in help functions are valuable resources, a host of online developer
communities has already encountered and solved most problems you can ever think of. Google
and forums such as https://stackoverflow.com/ are normally instant sources of solutions.
Further ‘help’ is provided by the following spyder features:
• Linting - a background process that analyses your code checking for syntax errors and suspi-
cious code. Issues are indicated in the margin with colour coded symbols to denote invalid
syntax and warnings. Hover over the symbol for an explanation.
• Kite (separate installation required) provides AI powered code completions. Press Control +
Space for a popup list of potential options to complete the text you are typing.
• The Code Analysis panel provides a list of potential errors, bad practices, quality issues and
style violations. See https://docs.spyder-ide.org/current/panes/pylint.html for more de-
tails.

2.5 Jupyter
Jupyter is a browser-based environment that allows you create interactive workbooks, containing
text, code, results and visualisations using a range of languages including Python. Jupyter is in-
cluded with Anaconda. Full documentation is available from https://jupyter.org/documentation.

2.5.1 Launching Jupyter


Once launched, Jupyter should activate a default browser. You may also notice a console window
appear as shown in figure 2.9: you should ignore this. However, if you would prefer to run Jupyter
in a different browser, you may wish to copy the ‘localhost’ url from the console and paste this
into the browser address bar. Closing with console window will terminate the kernels that facilitate
Python sessions.
The Jupyter Home page (figure 2.10) allows you to navigate through the folders on your com-
puter that Jupyter has access to. For a standard windows setup this will default to the directory
‘c:\users\youruserprofile’. To work under a different folder, open an Anaconda prompt, and either
‘cd’ to the folder and run:

jupyter notebook

or explicitly set the directory you want to use:

jupyter notebook --notebook-dir c:\preferredfolder

Once jupyter has launched, navigate to the directory where your notebooks are saved or where
you wish to create a new one. Existing workbooks can be opened by clicking on them. Further
2.5. JUPYTER 13

Figure 2.9: Jupyter Console

Figure 2.10: Jupyter Home Page

options become available after selection (checking the box beside their name). To create a new
one, use the ‘New’ dropdown to select Python 3 (Notebook). A second tab ‘Running’ on the
home page provides a list of currently active notebooks and terminal instances.

2.5.2 Jupyter Notebooks


Jupyter Notebooks are comprised of cells, of which there are two primary types: code and mark-
down. Cells can be inserted from the toolbar or ‘insert’ menu (figure 2.11). Their type can be
set or modified using the toolbar dropdown. Once the cell contents are complete, they can be
‘run’ from the toolbar or using the keyboard shortcut CTRL+Enter or SHFT+Enter. Running a code
cell executes the code. Running a markdown cell, displays the contents as rich text (i.e. with the
markup applied). To revert to edit mode, simply double click the cell.
Jupyter notebooks can be saved and shared as ‘.ipynb’ files. Like python files, these are just text
files, but formatted under the JSON schema. After saving, a notebook can be closed using the
‘Close and Halt’ option from the File menu.
2.5. JUPYTER 14

Figure 2.11: Jupyter Notebook

2.5.3 Code Cells


A code cell (just like in Spyder) is a container for code, from single line statements, to functions or
entire algorithms. When run, the code will execute within the kernal, which persists state over time
and between cells. The kernel can be reset (removing all variables from memory) from the Kernel
menu.
When run, the output (if any) will be displayed immediately below the executed cell as shown in
Figure 2.12. A re-opened notebook will display the output of cells run until prior sessions. The output
can be cleared from the Cell menu using the Current Outputs (or All Output) > Clear option.

Figure 2.12: Jupyter Code Cells

2.5.4 Markdown Cells


A markdown cell is one that is intended to display text. Text is composed in a markup language
which combines both the content and intended formatting (mark up) as text. A guide to mark-
down syntax is available from sites including:
• https://www.markdownguide.org/basic-syntax/
• https://daringfireball.net/projects/markdown/syntax
Markup for heading levels uses the ‘#’ symbol, with 3 hyphen’s used for a horizontal rule:
2.5. JUPYTER 15

# Heading 1

## Heading 2

### Heading 3

#### Heading 4

---

Bullet points use ‘*’ and numbered lists just use numbers (the numbers you type themselves are not
important):

* bullet 1
* bullet 2
* bullet 3

1. item 1
1. item 2
1. item 3

Bold and italic typeface can be altered by enclosing text within paired *’s. Text representing code
uses a special quote character (normally to the left of the ‘1’ key on a standard keyboard)

**bold**

*italic*

`text representing code`

It is possible to include the typesetting system Latex which is useful for embedding formula and
other mathematical notation. Various web sites allow formula to be created using tools similar
to Microsoft’s equation editor from which Latex can then be copied and pasted into a Jupyter
notebook. See for example https://www.hostmath.com/.

$$ f(x) = \frac {1}{\sigma


\sqrt{2\pi}}e^{-{\frac{1}{2}}
\left({\frac {x-\mu }
{\sigma }}\right)^{2}} $$

While rudimentary tables can be created with simple markup using pipe (|) characters and hy-
phens for column lines and horizontal rules respectively, superior formatting is achieved where ta-
bles are created using HTML syntax (see https://www.w3schools.com/html/html_tables.asp).
2.5. JUPYTER 16

<table>
<tr>
<th>First name</th>
<th>Last name</th>
<th>Student Number</th>
</tr>
<tr>
<td>Conor</td>
<td>McNally</td>
<td>12345678</td>
</tr>
<tr>
<td>Jingyi</td>
<td>Zhang</td>
<td>23456789</td>
</tr>
</table>

Images and hyperlinks can be included using HTML code:

#Hyperlinks
<a href="http://qub.ac.uk/qms">Queen's Management School</a>

#Embed images from online


![Wrong jupyter](https://photojournal.jpl.nasa.gov/jpeg/PIA00343.jpg)

#Embed images from file (note relative path)


![Local file](./Images/Python.png)

#Alternatively using HTML


<img width="150" src="./Images/Python.png">

Jupyter also provides an insert image option under the Edit menu.
Markdown is often the preferred way of providing documentation (see for example the home
page of any GitHub project). There are many stand alone (i.e. not part of any coding environe-
ment) markdown editors including:
• Typora - a desktop editor (small payment required)
• StackEdit - an in-browser editor
• Notepad++ - via free extension MarkdownViewer++

2.5.5 nbextensions
The Jupyter Notebook Extensions package is a collection of community-contributed tools that add
functionality to Jupyter notebook. As an unofficial package, it must be manually installed. To do
this, it is recommended that you use a conda command from an anaconda prompt:

conda install -c conda-forge jupyter_contrib_nbextensions

Once installed, various extensions can be enabled (see Figure 2.13). One extension is the Accessi-
bility Toolbar. Another is the variable inspector that allows the value of variables to be seen without
2.6. OTHER IDES 17

the need to print them.

Figure 2.13: nbextension options

2.6 Other IDEs


2.6.1 VS Code
Visual Studio Code is Microsoft’s free source-code editor and supports development across many
different languages, Python included. It supports standard code development features including
syntax highlighting, code completion, debugging and source control.

2.6.2 RStudio
R’s reticulate package allows Python scripts to be run within RStudio. While this can be conve-
nient for heavy users of R who wish to run some Python code, there is currently limited support for
Python development within RStudio. For example, Python variables created will not appear in the
Environment (i.e. variable explorer)1 .
Note that before attempting to run Python code, you must ensure that the location of your pre-
ferred Python installation has been set.

Sys.setenv(RETICULATE_PYTHON = "C:/Users/yourname/Anaconda3/")

2.7 Online Resources


• https://docs.python.org/3/glossary.html - official python glossary
• https://www.pythonmorsels.com/terms/ - unofficial (less technical) glossary
• https://stackoverflow.com/ - developer community question-and-answer website
• https://www.w3schools.com/python/ - python tutorial

1 This may be remedied in version 1.4 of RStudio


18
3 E THICAL C ONSIDERATIONS

Hopefully you will enjoy learning how to use Python, manipulate data, and develop models to
inform decision making. With your new-found skills, also come some new responsibilities relating to
the ethics of working with data.

ACM ethical principles

The Association for Computing Machinery (ACM) sets out general ethical principles for
computing professionals.

• Contribute to society and to human well-being, acknowledging that all people are
stakeholders in computing
• Avoid harm
• Be honest and trustworthy
• Respect the work required to produce new ideas, inventions, creative works, and com-
puting artifacts
• Respect privacy
• Honor confidentiality

https://www.acm.org/code-of-ethics

3.1 Security and Ownership


When working with data (and code) one should consider its source, who should have access to
it, and what it can be used for.
While much data and code is freely available and may be freely used, there can be restrictions on
its legal or fair usage. Before making derived data or code publicly available, one should check
for:
• limits on commercial (for-profit) exploitation
• restrictions on re-sharing
• requirements to acknowledge the original source or contributor
Data sourced from paid-for services (e.g. Bloomberg) under licensing agreements may not nor-
mally be sold or made publicly available in its raw form.
A related issue is that of data security. Even if you have no intention of sharing data, you may
need to actively take preventative measures to ensure this does not happen. This is particularly
important when the data is private or confidential. For example, proprietary data that may be
commercially sensitive, or data relating to individuals who have a reasonable expectation of pri-
vacy. Consideration here must be given to:
• Storage - how will the data be held (i.e. database, file)
• Location - where will the data be held (i.e. pendrive, harddrive, cloud)
• Access - who can access the data
• Protection - what measures are in place to prevent unauthorized access (e.g. physical re-
strictions, passwords)
• Transmission - will the data be physically or electronically shared (encryption may be re-
quired)
3.2. PRIVACY 19

3.2 Privacy
Advances in computing makes it easy to collect, store, and distribute data. For individuals, this can
mean that data relating to them is stored, processed, and monitored, without their knowledge
or consent. Anyone collecting or processing such data should be sensitive to the concerns of
individuals. Indeed, legislation recognises the ‘data subject’ by granting them rights and imposing
responsibilities on those that collect, store, or process such data.

General Data Protection Regulations

In the EU, General Data Protection Regulations (GDPR) apply. Failure to comply with the
GDPR can result in a fine of up to C20 million or 4% of annual worldwide turnover, whichever
is greater. In the UK this is implemented as the Data Protection Act 2018.

It sets out strict rules called ‘data protection principles’ to ensure that data is:

• used fairly, lawfully and transparently


• used for specified, explicit purposes
• used in a way that is adequate, relevant and limited to only what is necessary
• accurate and, where necessary, kept up to date
• kept for no longer than is necessary
• handled in a way that ensures appropriate security, including protection against
unlawful or unauthorised processing, access, loss, destruction or damage

3.3 Societal Contribution and Impact


Data analysis is ultimately used to inform decision making, and decisions can lead to real-world
consequences for individuals and society. Artificial intelligence poses serious ethical dilemmas:
• facial recognition (‘smart policing’, racial profiling)
• autonomous vehicles (the trolley problem)
• medical diagnosis (false negatives)
• social media targeting (elections)
• social credit scores
This places a responsibility on programmers and developers to maximise social welfare and min-
imise the negative consequences of their work. Particular care should be taken to avoid discrimi-
nation and promote inclusivity.

3.4 Professional Conduct


Special considerations apply to programmers and data analysts in respect of professional con-
duct.
When compiling results, care should be taken to be as transparent as possible. This includes ac-
knowledging authorship of utilised resources, and the source of all data. A key goal is reproducibil-
ity. By making your methodology as clear as possible, others can validate your results or identify
weaknesses in your approach.
White (2000) defines data snooping as when ‘a given set of data is used more than once for pur-
poses of ... model selection’. This occurs when, in a effort to find a satisfactory model, researchers
3.5. FURTHER READING 20

trial a large number of such models, and then only present their successful results, or fail to ac-
knowledge that the model’s success may only be due to chance.
As an example, think about the huge number of time-series datasets collected in different geo-
graphical locations from different fields (i.e. finance, economics, health, agriculture, etc). While
the chances that any two randomly-selected series will exhibit correlation is low, in a large enough
collection of datasets, finding ones that are highly correlated (where there is no underlying causal
relationship) becomes a statistical certainty. To see the sometimes amusing consequences visit
https://tylervigen.com/discover.

3.5 Further Reading


• ACM Code of Ethics - https://www.acm.org/code-of-ethics
• The Institute for Ethical AI & Machine Learning - https://ethical.institute/
• Creative Commons - https://creativecommons.org/
21
4 P YTHON B ASICS

4.1 Code Comments


When writing code it is good practice make use of code comments. These lines are not executed,
but are used to make the code self-documenting. They can also be used to set reminders or to
temporarily disable code. By default most code editors will format such text in green or grey. In
Python a hash ‘#’ marks the start of a comment. All subsequent text on the line is ignored during
execution. Three quotation marks can also be used to create something call a docstring.

#Short comments begin with a hash (#)

#TODO – simple (unique) term I can search for

"""
This is a multiline string created using triple quotes
Known as a doc string it is use to help document code
"""

Comments are defined to be read by other developers to help them understand (or remember)
why code has been written (its purpose) or why it has been written in a particular way. Code
comments should be generally be brief and non-obvious. In this manual they are used to explain
new concepts; such comments would not normally be necessary when the expected reader is
another developer.

4.2 Keywords
Keywords are those reserved for a special purpose and essentially define the vocabulary of python.
For this reason, you should not attempt to use them as variable names. These include keywords
relating to:
• Boolean: True, False, and, or, not
• Conditional logic: if, elif, else, assert
• Iteration: for, while, in, break, continue
• Built in: def, class, return, print, range
• Exception handling: raise, try, except, finally
You can ask to see the full list by running:

#Find list of Python keywords


import keyword
keyword.kwlist

Together with variables and operators, they can be used to write expressions and statements. An
expression is a piece of code that can be evaluated, resulting in some value. A statement is a
single syntactic unit of code defining an instruction to perform a specific action.
4.3. VARIABLES 22

4.3 Variables
Variables are a means to store and refer to data values. The concept of a variable is similar to
using (Greek) letters to represent values in algebraic equations. The variable name is just a way to
refer to the value it represents, which of course, may change depending on the problem being
solved. In programming variables are associated with a particular location in the memory of the
computer on which the code is being run. Their names can be anything you choose subject to
some basic rules:
• can contain alphabetic or numeric characters or the underscore character
• they cannot contain spaces
• they cannot start with a number
• they should not start with an underscore (this is used for specific purposes in Python)
• they should not clash with keywords reserved as part of the Python language
These names are referred to as identifiers but the term also includes names given to functions
and classes. Unlike some other languages, variables do not need to be declared before they are
used.

4.3.1 assignment
Variables are created at the instant when data is first assigned to them. A statement which con-
tains a single equals sign (=) is an assignment operation. This should be interpreted as ‘evaluate
the expression on the right hand side and assign its value to the variable on the left hand side’.
After the operation is performed the two sides will be equal (but not necessarily before). Variables
can then be used in expressions, with their current value used to determine the result. Any attempt
to use an uninitialised variable in an expression on the right hand side of an assignment operator
will fail.

#This creates a variable called 'a' the first time it is run


a = 2

#This creates a second (and different) variable


A = 'hello'

#Variable names can be long or short


my_long_named_variable = 3.14

#Output will be 4
print(a + 2)

Note that python is case sensitive. Variable names should be thoughtfully chosen to make them
meaningful and suggestive of their intended purpose. When implementing formula, variable
names should attempt to mirror those used in standard notation. For example, an interest rate
would be better represented using the letter r than the letter x.
Just as two people may share the same name, it is entirely possible for developers to use the same
name to refer to different values. Such names have the potential to be ambiguous, but this can
be resolved through the concept of scope.
Variables will continue to persist during the python session unless they are explicitly destroyed.
4.3. VARIABLES 23

#List variables
dir()

#Delete variables
del x,y,z

In programming, constants are variables with a value that cannot be changed. Unlike some other
languages python does not explicitly support constants. Literals are fixed values that appear in
statements (e.g. numbers and text). Literals should be used sparingly outside of scripts. Instead try
to get code generically by avoiding the practice of hard coding (using problem-specific values in
what is otherwise a general solution to a class of problems).

4.3.2 Variables as memory locations


All values that are stored as variables are held in memory. The variable name or identifier reveals
where to find the value in memory and is really an address. This is illustrated in Figure 4.1. Variables
can therefore be considered as ‘pointers’ that point to objects.

Figure 4.1: Variables. Source: Trey Parker’s www.pythonmorsels.com

For a more detailed discussion see https://www.pythonmorsels.com/pointers/.


It is possible to assign one variable to another, but consider the following code.

a = 3
b = a

Clearly b will have the value 3, but what if we then assign b the value 4. Does this also change the
value of a? In this case, the answer is no, but the general behaviour depends on the data type
4.4. DATA TYPES 24

that the variable contains (and whether they are immutable). Officially ‘assignment statements in
Python do not copy objects, they create bindings between a target and an object.’
If you need to make a copy (that can be independently modified), you can use the copy library
(see https://docs.python.org/3/library/copy.html). Note that there are two types: shallow copy
and deep copy. The different is important when working with compound objects (see Section
??).
• A shallow copy constructs a new compound object and then (to the extent possible) inserts
references into it to the objects found in the original.
• A deep copy constructs a new compound object and then, recursively, inserts copies into it
of the objects found in the original.

4.3.3 underscore
We have seen that the underscore character can be included in the name of an identifier. When
used on its own, the underscore is considered as a throwaway (or "I don’t care") variable. By
convention its use indicates to other developers that the variable itself is not important.

4.3.4 print
You will probably have already have noticed how the print function is used to display the val-
ues of variables. This function accepts a comma separated list of values and literals that it will
concatenate together for output. Note that all values will be converted to strings automatically:
no explicit casting is required. The optional sep parameter (defaulted to a space character) is
placed between each of the string values being joined. The optional end parameter (defaulted to
a newline character) is placed at the end. Thus by default, for two consecutive print statements,
the second will begin on a new line.

#Print three values separated by a space character.


print(value1,value2,value3, sep=' ', end='\n')

#Print three values separated by a tab character.


print(1,'two',[3], sep='\t')

The display function is similar but will produce more highly formatted output in richer frontend like
Jupyter. For example, DataFrame objects are rendered with alternatively coloured rows rather than
in plain text.

4.4 Data types


Python variables do not need to be declared nor do they have to be typed. For example, in
C++, a variable could be defined as being of type integer. Once so defined, attempts to treat it
as a different type (say a real number) would generate warnings during compilation or run-time
errors. Python variables are effectively variant data types: they can hold any data type assigned
to them. Never-the-less, it is worth having some appreciation of the different data types that exist,
and how they will be treated. Basic data types include:
• Numeric types
– int - integer
– float - real
– complex - complex
4.5. STRINGS 25

• bool (i.e. Boolean: True or False)


– Zero or empty containers are also considered False
– Non-zero numeric values are considered True
• str (i.e. text or string)
• Containers
– list, tuple, set, dict, etc
The type of a variable can be checked in code using:

#Check data type of variable A


type(A)

#Test if variable is of a particular data type (returns Boolean)


isinstance(A,int)
isinstance(A,bool)
isinstance(A,str)

From python version 3.10, it is possible to use the pipe (‘|’) symbol to union two or more types. This
allows you to test if a variable is one of the types. For example, to test if something was an integer
or a float:

#Test if variable is an integer or a float


isinstance(A, int | float)

4.5 Strings
In programming, text is referred to as a string. Python strings can be enclosed in single, double, or
even triple quotes. Single quotes can appear as part of a text string created with double quotes
and vice versa. Any appearance of quoted text within code is referred to as a histring literal.
In Python, strings can be treated like an array of characters, and treated like other array data
types.

text1 = 'this is a string;'


text2 = "so is this;"
text3 = "this is an \"interesting\" string;"

#sub-strings
print(text1[0:4],"\n")

#string concatenation
text4 = text1 + ' ' + text2 + ' ' + text3
print(">",text4,"\n")

#convert numbers to string


text5 = 'convert numbers to strings ' + str(4)
print(text5,"\n")

The backslash character is used to modify the meaning of certain characters. For example ‘
n’ is used to denote a linefeed. The backslash character therefore needs to be ‘escaped’ and
4.5. STRINGS 26

written as ‘\\’ if you actually want the backslash character to appear in text. Python provides a
way (the prefix ‘r’) to indicate you want the string to be treated exactly the way you type it. This is
particularly helpful when you need a string to represent a windows path.

#special characters many need to be escaped


text1 = 'this is text \n with special characters \\'
print(text1)

#but not if you prefix with r


text2 = r'so \n is this \\'
print(text2)

The same approach can also be used to escape quotation marks within a string that would oth-
erwise prematurely mark its end.

4.5.1 String methods


Useful string methods include those listed in Table 4.1.

Method Description
text.upper() converts to upper case
text.lower() converts to lower case
text.isdigit() returns True if all characters are digits (i.e. an integer)
text.isalpha() returns True if all characters are alphabetic
text.rstrip() removes all trailing whitespace
text.find(sub) Find the start index of a substring (-1 returned if not there)
text.format() substitutes values into a string using curly bracket place-
holders
text.join() concatenate any number of strings using given seperator
text.zfill() Pad a numeric string with zeros to a given width

Table 4.1: String methods

4.5.2 String formatting


The print statement can be used to output text and other results to the console.

#Multiple values can be listed in a single print


#As output they will be separated by default by a space
print('answer',value)

#But you can specify you own separator with the 'sep' optional parameter
print('answer',value,sep="==>")

Strings can also be concatenated manually using the ‘+’ operator, but non-string variables may
need to be converted.

#String concatenation
text = "hello" + " " + "world"

#Concatentate items from list A using a comma separator


text = ",".join(A)
4.5. STRINGS 27

#Convert numeric types to strings before concatenation


text = "answer=" + str(value)

Results can also be placed in strings using placeholders using the format string method. The
placeholders can either be empty, enumerated or named, and can also include formatting com-
mands.

name = "VOD"; price = 192.20; previous = 182


text = "The price of {} is {}".format(name, price)
text = "Return is {:%}".format(price/previous)

Note that this replaces older style string formatting that you may still come across. For full details
see:
• old style formatting: https://docs.python.org/2/library/stdtypes.html#string-formatting
• new style formatting: https://docs.python.org/3/library/string.html#string-formatting
• examples of both: https://www.w3schools.com/python/ref_string_format.asp
• example of new: https://pyformat.info/

4.5.3 f-string
Formatted string literals or f-strings were introduced in Python 3.6. Variables placed inside curly
braces within a string literal (prefixed with ‘f’) are automatically evaluated at run time.

name = 'Jenny'
print(f'hello {name}') hello Jenny

Variable names and their values can now be output without providing two inputs into a print
function. So the variable name itself need only be typed once but appears as both its name and
value in the output.

print(f"{a=}")
print(f"{b=}") a=3
b='how cool is this?'

4.5.4 Regular expressions


A regular expression (regex) specifies a set of strings that matches it. You can use this as a kind
of search using wildcards, except that it gives you huge amounts of flexibility to configure what is
considered a match.
Regular expressions are supported through the built-in package re. A non-exhaustive list of the
syntax is listed in Table 4.2. Note that is may be necessary to escape certain characters.
One way to use regular expressions is with the re.findall function which looks for pattern matches
within a given string.

text = 'd0es this strange str1ng contain 33 numbers?'


4.6. DATES 28

Method Description
[] a set of (specified) characters
() a sub-expression
\w any alphanumeric character
\d any digit
. (dot) matches any character except a newline
ˆ (caret) anchors to the start of the string
$ matches to the end of a line
* matches zero or more of the preceding expression
+ matches one or more of the preceding expression
? matches zero or one of the preceding expression
{m} matches exactly m copies the preceding expression
{m,n} matches between m and n copies the preceding expression
| A|B is a match to either A or B
(?:...) non-capturing brackets (use for sub-patterns that you don’t want to
capture)

Table 4.2: Regular Expressions

#Find digits in a range


re.findall(r'[1-9]', text)

#Find all digits


re.findall(r'\d+', text)

#Find words containing str


re.findall(r'str\w*', text)

#Escaping periods and anchoring text to end


url = 'facebook.co.uk amazon.co.uk'
re.findall(r'\w+\.co\.uk$', url)

#Extact a dollar amount with or without comma separator and decimal point
text = 'the share closed at $1,234.56 yesterday'
re.findall(r'[\$]\d{1,}(?:,\d{3})*(?:\.\d+)?', text)

It is also possible to create a ‘compiled’ regular expression which can then be used to perform
pattern matching operations, so the regular expression no longer needs to be passed as an input
into a function.

yearmatch = re.compile(r'(?:19)\d\d')
print(yearmatch.findall('1875, 1956, 1999, 2012, 2130'))

Full documentation is available from https://docs.python.org/3/howto/regex.html and https://


docs.python.org/3/library/re.html.

4.6 Dates
The datetime module provides support for working with dates (see https://docs.python.org/3/
library/re.html for full documentation).
4.6. DATES 29

To make dates human readable for input and output they are treated as strings. However be-
cause string dates come in a myriad of formats they are difficult to manipulate. It is therefore
convenient to convert them into a standardised (numerical) datetime representation. In particular
beware of American dates which follow a month-day-year system and often appear in financial
datasets.

from datetime import *


2022-12-25 00:00:00
#Convert string date in given 25-12-2022
#format into a datetime 2022-12-25 00:00:00
p1 = datetime.strptime('25-12-2022', '%d-%m-%Y')
print(p1)

#Convert a datetime object into


#a string in a given format.
s = p1.strftime('%d-%m-%Y')
print(s)

#Create a date
p2 = datetime(2022, 12, 25, 0, 0)
print(p2)

Date formatting syntax is outlined in Table 4.3 using date p2 above as an example. For a full listing
see strftime under https://docs.python.org/3/library/datetime.html.

Code Representation Example Output


%d 2-digit day of month 25
%D Forward slash separated date 12/25/22 (American style!)
%F Hyphen separated date 2021-25-22
%m 2-digit month of year 12
%B Month of year as text in full December
%b 3-character month of year Dec
%y 2-digit year 22
%Y 4-digit year 2022
%w 1-digit day of week 5
%A Day of week as text in full Saturday
%a 3-character day of week Sat
%H 2-digit hour of day (24 hour) 00
%I 2-digit hour of day (12 hour) 12
%p Locale’s equivalent of either AM or PM AM
%M 2-digit minute of hour 00
%S 2-digit second of minute 00

Table 4.3: Date formatting

Note that when printed (without formatting), dates reveal that they are really datetime objects,
and as python objects, they have useful properties and methods.

#Decompose date
print("weekday =",p1.weekday()) weekday = 4
print("day =",p1.day) day = 25
print("month =",p1.month) month = 12

A useful format for encoding dates is the one defined by ‘%Y%m%d’, which produces dates such
4.7. OPERATORS 30

20201225, which has the following advantages:


• it can be stored as an integer
• it is human readable
• it is easily parsed into its day, month and year components
• it can be sorted chronologically (as text or a number)
When included as part of a file naming convention, the files can easy be sorted in chronological
order, in a way that other date naming conventions would not permit.
Date arithmetic is also supported: ‘differencing’ two dates and adding (positive or negative) units
of time onto a date.

#Date difference (days elapsed)


diff = p2-p1 days elapsed= 365
print("days elapsed=",diff.days)
+1Y 2023-12-25 00:00:00
#Add units of time +3M 2023-03-25 00:00:00
from dateutil.relativedelta import relativedelta +7D 2023-01-01 00:00:00
print("+1Y", p1 + relativedelta(years=1))
print("+3M", p1 + relativedelta(month=3))
print("+7D", p1 + relativedelta(days=7))

4.7 Operators
Operators can be applied to variables or literals. They include standard arithmetic operations
such as addition and multiplication not listed here.
Expressions are executed using standard operator precedence (e.g. multiplication before addi-
tion). If in doubt about the sequence in which the operators are applied use parenthesis (round
brackets). Note that in Python ** is used for raising to the power of.
• Boolean operators
– not, and, or
– To test for equality use ==
– To test for inequality use !=
• Bitwise operators (applied to the ‘bits’ of the binary representation of a number):
– | (or)
– & (and)
– ˆ (xor – exclusive or)
While some operations can be performed by variables of different types, explicit conversion may
be required.

#Explicitly create value of specific types


i = int(3)
pi = float(3.14)
pie = str('apple')

#Adding integers and floats is fine


4.8. CONTAINERS 31

print(i+pi)

#This will fail even although + can be used for string concatenation
print(i+pie)

#This is fine
print(str(i)+pie)

4.8 Containers
A container is particular compound data type that stores a collection of items. They mimic struc-
tures that should already be familiar to you such as lists, sets and queues. Most programming
languages provide support for containers which manage storage requirements and provide func-
tions to access and manipulate elements.

4.8.1 Lists
Lists are ordered arrays that store mixed data types and allow duplicates. In Python they are
created using square brackets.

#Create list
list1 = []
list2 = list()
list3 = [2, 3, 5, 7, 11]
list4 = ["finance", "economics", "accounting"]
list5 = [list3, list4]
list6 = [5, 'e']
list7 = ['na'] * 9 + ['hey Jude']

#Find length
len(list7)

Accessing elements from the list is also performed using square brackets []. Note that Python is
zero based so the first element is at index position 0 (Figure 4.2). A list with n elements is therefore
indexed from 0 to n − 1. The number of elements in a list (or other array types) can be found using
the len function.

Figure 4.2: Python indexing

Python allows arrays to be accessed from the start or, by using negatives indexing, from the end.
For an array of size n, index values of −n to n − 1 are therefore valid.

#Create list
A = [2, 3, 5, 7, 11]

#Access 3rd element - count from zero!


4.8. CONTAINERS 32

A[2]

#Access 3rd and 4th elements


A[2:4]

#update first value


A[0] = 1

A list is one example of an iterable, a structure that is capable of returning its members one at a
time. Iterables can be used in a for loop (see Section 4.9.2), when each element needs to be
considered in turn. See https://docs.python.org/3/glossary.html#term-iterable.

4.8.2 Tuples
Tuples are ordered arrays but cannot be changed (immutable) once initialised. They are created
using round brackets, but their elements are accessed via square brackets.

#Create tuple
tuple1 = ()
tuple2 = tuple()
tuple3 = (2, 3, 5, 7, 11)
tuple4 = ("finance", "economics", "accounting")
tuple5 = (tuple3, tuple4)
tuple6 = (5, 'e')

#Access 3rd element - count from zero!


tuple3[2]

#This won't work – it's a tuple!


tuple3[0] = 1

4.8.3 Ranges
Ranges are immutable integer sequences, created using the range function. They are particularly
useful when using for loops. Each integer sequence is defined by its start value, its stop value, and
the increment between consecutive values in the sequence. The start value defaults to zero, and
the increment defaults to 1. Note that in keeping with Python’s zero-based approach, the range
sequence continues until, but stops short of including the stop value.

#Create
range(start)
range(start, stop [,step])

#Sequence of integers 0 to 9
r = range(10)

#Sequence of integers 1 to 10
r = range(1,11)

#Convert to list
a = list(r)
4.8. CONTAINERS 33

#Sequence of integers 10 to 1
r = range(10,0,-1)

4.8.4 Working with arrays


Sequence Operations
For all array type containers, it is possible to access individual elements or sub-sequences (known
as a slice). Similar to the range function, a sequence defined by:

a:b:c

represents the elements indexed from a up to (but not including) b advancing in steps of c. This
works inside square bracket notation and is a short cut for using the slice function. The following
logic also applies:
• If c is omitted its value is defaulted to 1
• If a is omitted its value is defaulted to 0
• If b is omitted its values is defaulted to the length of the array being accessed.
• Negative b values indicate counting backwards from the end of the array.

#Values at 3 & 4 (remember its zero based)


A[2:4]

#Values at 3, 5, 7 & 9
A[2:10:2]

#First three values


A[:3]

#Forth value onwards


A[3:]

#All but last three values


A[:-3]

When the step value is negative the default logic reverses, so that you stop from the end and work
backwards.

#Numbers 0 to 9
A = list(range(10))

#Numbers 9 to 0
A[::-1])

#Numbers 5 to 0
A[5::-1]
4.8. CONTAINERS 34

#Numbers 9 to 6
A[:5:-1]

The built-in slice function can be used as follows:

myslice = slice(1,3)
print(A[myslice])

Modification of non-immutable arrays involves updating, inserting, appending and removing ele-
ments.

#Update second value


A[1] = 4

#Add value onto end


A.append(11)

#Join
A.extend([12,13])

#Insert into specified position


A.insert(1,14)

#Remove elements
del A[2:4]

#Remove element at end of array


A.pop()

4.8.5 Sets
Sets are unordered and unindexed collections that prohibit duplicates (just like their mathemat-
ical equivalents). They are created using curly brackets. An immutable version of a set is a
frozenset.

#Create empty set


myset = set()

#create set
myset = {"alpha", "beta", "gamma"}

#add new member


myset.add("eta")

#remove member – warning if does not exist


myset.remove("beta")

#remove member – no warning


myset.discard("beta")
4.8. CONTAINERS 35

#check membership
"alpha" in myset

4.8.6 Dictionaries
Dictionaries are used to store key/value pairs, in a way that allows the value to be accessed via
its key. Dictionaries are mutable and indexed. From python v3.6, they are ordered based on the
order in which keys are inserted into the dictionary. They are created using curly brackets and
their elements are not single values, but rather key/value pairs. The keys must be unique but this
allows associated values to be ‘looked up’. The values can be any other variable type such as
int, string, list or even dict.

#Create dict
mydict = {"red": 1, "amber": 2, "green": 3}

#Find value associated with key


mydict["amber"]

#Check key exists


"amber" in mydict

#Update
mydict["stop"] = 0

The get method of a dictionary can also be used to retrieve values in a way the specifies a default
value should the key not be found in the dictionary.

#Key is in the dictionary: returns value 1


mydict.get("red")

#Key is not in the dictionary: returns default -1


mydict.get("blue", -1)

It is possible to iterate over the keys or values of a dictionary.

#Iterate over dictionary keys


for item in mydict:
print(item)

#Iterate over dictionary keys


for k in mydict.keys():
print(k)

#Iterate over dictionary values


for v in mydict.values():
print(v)

#Iterate over dictionary unpacked key/value pairs


for k, v in mydict.items():
print(k, v)
4.8. CONTAINERS 36

Pythons collections module also provides specialized container datatypes including:


• Counter which counts (hashable) objects (a bit like keeping tallies).
• OrderedDict which remembers the order in which the objects were added (this was not the
case for a dict prior to python v3.6).
• defaultdict which allows the value of new entries to be created via a supplied function.
See https://docs.python.org/3/library/collections.html for more details.

4.8.7 List Comprehensions


Comprehension allows lists to quickly be created from other lists with minimal lines of code. This
process can simultaneously apply value manipulation and filtering logic.

#Create list
A = list(range(1,11))

#Find squares of A
B = [x*x for x in A]

#Find even numbers in A


C = [x for x in A if (x % 2 == 0)]

#Index a list
D = [c/(1+y)**(i+1) for i, c in enumerate(CF)]

Similar results can also be achieved by using the map function which can be used to apply a
function to every item in an iterable (container). Note that this returns an iterator object which
you may wish to convert to a list. See https://docs.python.org/3/library/functions.html#map for
more details. The example below uses a special type of function called a lambda function (see
section 4.11).

A = [1,2,3]
f = lambda x: x**2

B = map(f,A)
print(list(B))
>> [1, 4, 9]

Python’s zip function can be used to combine iterables (e.g. lists, dictionaries, etc) by generat-
ing a series of tuples, where each tuple contains one element from each iterable being zipped
together.

A = [1, 2, 3]
B = ['x', 'y', 'z']
C = zip(A,B)

print(list(C))
>> [(1, 'x'), (2, 'y'), (3, 'z')]
4.9. FLOW CONTROL 37

4.9 Flow Control


Python uses indentation (tabs) to create code structure and define the sequence of operation
of statements. Other languages use braces (curly brackets) to demarcate the start and end of
particular blocks of code. Indentation quickly reveals to seasoned programmers, the general
intended logic of an algorithm. It therefore aids readability and makes code easier to maintain.
Thus while indented typesetting of code is considered good practice in other language, Python
enforces its use. Indenting is preceded with a statement ending in a colon. This statement defines
what type of construct the code that follows is part of (for example, a function, a class, etc).
In coding there are two primary coding constructs for controlling the sequence of operation of
statements. Both are considered to form compound statements. The first deals with algorithms
that perform certain operations only when specific conditions are met. Drawn as a flowchart this
would result in two branches: one when the condition is met, and another when the condition
is not met. In coding this is known as selection. The second main construct is iteration: when a
particular set of steps is repeated multiple times in a loop.

4.9.1 Selection
Selection in programming languages is coded using if statements. Here code is executed (or not)
based on the state of particular variables based on a Boolean test that evaluates as either true or
false. In its simplest form, the syntax for an if statement is as follows:

#Execute statements only if test is true


if test:
statements

All lines indented after the if statement will only be executed if the test evaluates as True. If the
test evaluates as false, the indented statements will not be executed and code execution will
resume at the next unindented line. An alternative formulation adds an else clause:

#Execute statementsA if test is true otherwise execute statementsB


if test:
statementsA
else:
statementsB

It is possible to extend the logic to have multiple branching possibilities by using additional elif
clauses:

#At most one set of statements will be executed


if testA:
statementsA

elif testB:
statementsB

elif testC:
statementsC

else:
statementsZ
4.9. FLOW CONTROL 38

Such logic should be used for mutually exclusive tests since at most one clause will be executed.
Efficient code should avoid unnecessary testing, so prefer elif clauses to sequential if statements
(for example, where passing the first test guarantees that future tests would fail).

(a) if (b) if else (c) if elif else

Figure 4.3: if statement configurations

Note that every else clause must have some executable code. If no statements are necessary
but you wish to retain the clause for clarity, use the pass statement (nothing happens with this is
executed).

Nesting
Nesting of if statements (i.e. sub-branching) can be achieved with careful indentation.

if testA:

if testB:
statementsB

else:
statementsC

else:
statementsZ

Selection tests
Selection test can take several forms.

#Test of equality
if (a==3):

#Test of inequality
if (x>18):
if (x!=0):

#Test of memberships
if x in range(100):
if subject in ('Finance', 'Economics'):
4.9. FLOW CONTROL 39

Inline if statements
For simple selection cases, consider ‘in-lining’ code.

discount = (price * 0.2) if onSale else 0

Match
In many languages, ‘switch/case’ statements provide an alternative to if statements. Somewhat
controversially, this functionality was made available in python from version 3.10.

test = 1

match test:
case 1,2,3:
print("plan A")
case 4,5:
print("plan B")

This approach can be much faster that the equivalent if statement approach (e.g. for sequence
matching). Most other languages also support an ‘else’ case clause to be executed when none
of the previous cases are matched. In python this is performed with the underscore ‘_’ which is
considered to be a ‘soft keyword’ denoting a wildcard.

test = 6

match test:
case 1,2,3:
print("plan A")
case 4,5:
print("plan B")
case _:
print("the no plan plan")

4.9.2 Iteration
Iteration is used whenever you want to repeat a set of operations multiple times in a loop.

For loops
When the number of times the operation should be repeated is known in advance, a for loop is
normally preferred. Just as with an if construct, all indented lines are considered to be part of the
loop. For loops rely on the use of a variable to keep track of which iteration is currently underway.
By convention, the letters i, j, k are typically used. In python, loops that can be indexed by
predictable integer sequences can use the range function.

#Print numbers 0 to 9
for i in range(10):
print(i)

The loop variable can either be used as an index (keeping track of which loop code execution is
on) or can assume the values from a container.
4.9. FLOW CONTROL 40

#First 5 primes
A = [2, 3, 5, 7, 11]

# use i to index the elements of A


for i in range(len(A)):
print(A[i])

#use v to represent the values each element of A


for v in A:
print(v)

A combined approach uses the enumerate function to populate two variables: one counter to
index the loops and another variable that assumes the values of interest for each iteration. This
removes the need to access array elements by index.

#using enumerate to iterate over an array


bond = [5, 5, 5, 105]

for i, cf in enumerate(bond):
print(i, cf)

#instead of
print(i, bond[i])

Note that enumerate has a second optional input parameter start which defaults to 0. A reversed
function returns the elements in the reverse order (i.e. last to first).
The itertools module provides further functionality for creating iterators for efficient looping. See
https://docs.python.org/3/library/itertools.html for more details. From python 3.10 it will pro-
vide a pairwise function. This allows you to loop over the n − 1 adjacent pairs of values from an
array of length n.
Two keywords provide extra control options.
• break terminates iteration entirely; execution resumes after the loop
• continue terminates that iteration; execution resumes on the next loop
• If using break you can add an else clause
• This will only be executed if the break never is

#Print numbers 0 to 9 except 5


for i in range(10):
if i==5:
continue
print(i)

#Print numbers 0 to 4
for i in range(10):
if i==5:
break
print(i)
4.10. LIBRARIES 41

#Else for a loop


for i in range(10):
if f(i) < 0:
break
else:
print("all positive values")

While loops
With a while loop, code will continue to execute as long as a condition remains true. It is essential
that statements within the loop will eventually mean the test will evaluate as false. Otherwise, the
loop will continue to run forever: an infinite loop. Equally, if the condition evaluates to false initially,
the code within the loop will not be executed even once.

Figure 4.4: while loop

Many while loops use a counter to keep track of the number of iterations. This must be initialised
to a starting value before the loop, and updated at some point during the loop, normally by
incrementing (adding one) at the end of the loop.

#Print numbers 0 to 9
i=0
while i<10:
print(i)
i=i+1

4.10 Libraries
4.10.1 Standard Library
Python’s standard library contains built-in modules that provide solutions for many problems that
occur in everyday programming.

Module Description Documentation


math Basic math/trig https://docs.python.org/3/library/math.html
string Text manipulation https://docs.python.org/3/library/string.html
datetime Date/time manipulation https://docs.python.org/3/library/datetime.html
os Files, folders and paths https://docs.python.org/3/library/os.html
4.10. LIBRARIES 42

4.10.2 Packages
Outside of the standard library a vast and growing collection of packages are available from the
Python Package Index (PyPI). To search this repository visit https://pypi.org/.
Packages consist of related code modules organised into a particular structure. This can be used
to create a hierarchy of namespaces which help to address problems with scope. A package
is identified by a __init__.py file, which indicate that the containing folder is a Python package
directory. The folder hierarchy might look like this:

package/
__init__.py

subpackage1/
__init__.py
module1.py
module2.py

subpackage2/
__init__.py
module1.py
module2.py

4.10.3 Import
Before calling functions from a package, or otherwise utilising its functionality, it must first be im-
ported.

import math

#Print a list of imported functions


print(dir(math))

The structure of a package determines how to import the components that you wish to use.

#import a package
import package

#import a module
import package.subpackage1.module2

#import all modules


from package.subpackage import *

#Import specific modules into the namespace


from package.subpackage1 import module2

#Import specific names into the namespace


from package.subpackage1.module2 import specificname

How you import determines how to reference the functions that you imported. When importing is it
possible to assign a alias. After defining the alias, this new name can be used to refer to whatever
was imported in place of its originally defined name.
4.11. FUNCTIONS 43

#using import
package.subpackage1.module2.functionname

#using the function


module2.functionname

#with an alias
import package as myalias
myalias.functionname()

Python will only import a given module once in a given python session (no matter how many times
you run the import command). Importing a module does not mean that all variables created in
the module will be imported into the current one. Importing creates a module object which can
then be used to access the imported module’s variables. Variables defined in a module therefore
have module (not global) scope.

import math
print(math.pi)

Built-in properties can be accessed via the package name or alias.

import numpy as np

#Confirm name
print(np.__name__)

#Confirm version
print(np.__version__)

#Confirm file location


print(np.__file__)

4.11 Functions
Python comes with built-in functions and a range of packages. These can be explored using the
dir function.

#Find built in functions


dir(__builtins__)

#Find functions within a package


import math
dir(math)

Most functions accept inputs and produce outputs. When defining a function, the inputs are
called parameters. When calling a function, the inputs supplied called arguments. Functions
have identifiers (i.e. names) which follow the same rules are variables. To determine if something
is a function, you can use callable.
4.11. FUNCTIONS 44

#Integers are not callable


a = 2
callable(a)
>> False

#Built in functions like abs (the absolute value of a number) are


callable(abs)
>> True

4.11.1 Positional arguments


While developers should prefer to reuse rather than reinvent, there will be situations when it is more
convenient to create your own functions. In python these are defined using the keyword def.
Functions may take input parameters (arguments), which are provided as a comma separated list
inside round parenthesis. Note that the names of the input parameters represent placeholders for
the values that the caller of the function wishes to provide. The names of the values used when
calling the function can be different.

#create a function
def myfunctionname(arg1, arg2, arg3, arg4):
statements

#call a function
myfunctionname(arg1, arg2, arg3, arg4)
myfunctionname(x, y, 3, 'test')

When a function is defined in this way, each argument must be supplied (and in the expected
order).
Functions will normally return a result but don’t have to. This is achieved by the keyword return.
When this statement of the function is reached, the function ceases, and execution continues
from the point where the function was called. If you choose not to return a value, the function will
terminate when it reaches its last statement and will returns the result as None.

#create a function
def square(x):
"""this text will appear in calls to help(square)"""
return x**2

#call a function
square(2)

Notice the first line of the function appears as a string inside three double quotes (this allows it to
be split over multiple lines). This line is a bit like a comment (it doesn’t do anything) and is known
as a docstring because it provides documentation for the function that is accessible through the
help function. It must appear as the first line of the function. The use of three double quotes is just
a convention. Modules and classes can also have docstrings.
Note that the names given to functions are also identifiers so they can be treated like variables
and even passed into other functions.
4.11. FUNCTIONS 45

4.11.2 Optional arguments


Functions can be adapted to let them be called in different ways. It is possible to default some or
all of the arguments: this allows the function to be called without providing every argument.

#Default input parameters


def bondPrice(coupon, yield, maturity, facevalue = 100):

#Call function
bondPrice(0.05, 0.04, 3, 1000)
bondPrice(0.05, 0.04, 3)

Non-defaulted arguments are considered to be positional arguments, while those that are de-
faulted as called keyword arguments. Positional arguments must be passed in using the ex-
pected order. Keyword arguments can be passed in using different orders, but must be preceded
by the keyword argument. See https://docs.python.org/3/tutorial/controlflow.html#keyword
-arguments.

4.11.3 Arbitrary argument lists


Functions that can take an arbitrary number of arguments can be accommodated by wrapping
up multiple parameters in a tuple. These variadic arguments will normally appear after all other
inputs, with only optional parameters (with supplied defaults) appearing after them.

#function with any number of inputs


def flexiblefunction(x, *args, value=3):
for item in args:
print(item)
print(value)

flexiblefunction('colours', 'red', 'green', 'blue', value=4)

Similarly, when an argument parameter is preceded by **, it receives a dictionary (i.e. a col-
lection of key/value pairs). This allows an arbitrary number of named arguments to be passed
in without the function having to formally define them or require them to be passed in using a
specific order.

def describe(**kwargs):
return "; ".join([f'{k}={str(v)}' for k,v in kwargs.items()])

print(describe(a='1', b=2, c='three'))

> a=1; b=2; c=three

4.11.4 Multiple return values


It is possible for a function to return more than one result, but the individual results must be part of
a single structure such as a container or other object. Simply providing a comma separated list of
return values results in a tuple being returned.

def test():
return 'the answer to life the universe and everything is', 42
4.11. FUNCTIONS 46

#x will be a tuple
x = test()

#but the tuple can be unpacked into individual variables


a, b = test()

4.11.5 Lambda Expressions


Short single-line functions can be created with the lambda keyword. Functions created with lambda
rather than def are anonymous: the function itself has no name, but it returns a function object
that can be assigned to a variable which thereafter acts like a function.

sqme = lambda x : x**2


print(sqme(5))

They can be used as inputs into higher order function and are used along side built-in functions
such as filter and reduce.

A = [1, 2, 3, 4, 5, 6]
B = list(filter(lambda x: (x%2 == 0) , A))

4.11.6 Modifying input parameters


You may wonder what happens if the function attempts to change the value of the function
inputs. In other languages, this depends on whether you pass a copy of the parameter (pass by
value) or pass a pointer to the memory address (pass by reference). In Python, arguments are
passed by assignment. In practice this means that type of data (and whether it is mutable) and
how attempts are made to modify it will determine the behaviour.
Let’s take two examples which both pass a list to a function. In both cases the value assigned to A
is a reference, not a copy of the list it refers to. Using list methods to change it inside the function
therefore updates the original values.

def test1(A):
for i in range(len(A)):
A[i] = A[i]*2

#B will be modified
B = [1, 2, 3]
test1(B)
print(B)

In the second example, an assignment operation means that A now points at a different list.

def test2(A):
A = [0]

#B will not be modified


B = [1, 2, 3]
4.11. FUNCTIONS 47

test2(B)
print(B)

For more details see the FAQ on ‘How do I write a function with output parameters (call by refer-
ence)?’ from https://docs.python.org/3/faq/programming.html.

4.11.7 Scope
Variables declared within a function have local scope. Scope refers to the domain within which
a particular name has meaning. They are created as temporary variables and their value is un-
known outside of the function. This means is it possible to use the same variable names in different
contexts. Variables declared outside functions are global to the module. Prefixing a variable
name with __ (double underscore) makes it private to the module.

#Define global (private) variable


__myNumber = 5

#Need to declare use of global inside function


def myFunction(x):
global __myNumber
return x + __myNumber

4.11.8 Inner Functions


It is possible to define one function inside another. Such functions cannot be called directly, but
they can be called within their parent function.

def setcase(text, uppercase=True):

def upper(text):
return text.upper()

def lower(text):
return text.lower()

if uppercase:
return upper(text)
else:
return lower(text)

print(setcase('Test'))
print(setcase('Test', False))

They can also be called indirectly after being returned as outputs from the parent function. In
such cases, they even have access to variables that were passed into the parent function when
it was called.

def changecase(uppercase=True):

def upper(text):
4.11. FUNCTIONS 48

return text.upper()

def lower(text):
return text.lower()

if uppercase:
return upper
else:
return lower

f = changecase()
print(f('Test'))

4.11.9 Generator Functions


An alternative to return is yield, which suspends execution of a function and supplies a result to
the caller, but will resume execution where it left off when next called. It is reserved for use in
generator functions which are called differently to regular functions.

#Define a generator function (must use field)


def countTo3():
i=0
while i<20:
yield (i % 3) +1
i=i+1

#Create a generator object


f = countTo3()

#Iterator over the generated items


for i in f:
print(i)

#Or advance one at a time using next


for i in range(10):
print(next(f))

4.11.10 Type annotations


Unlike other languages, python associates types with values rather than identifiers. Normally when
writing functions, there is an expectation that values passed into the function will be of specific
types. This can be enforced by performing checks inside the function.
Type annotations allow developers to specify the types of input parameters and the return values
of functions. While this is not enforced, it offers a useful hint to developers attempted to use the
function. For example, the function below indicates that it expects two inputs of type float and
will return a variable of type float.

def add(x: float, y: float) -> float:


return x+y
4.12. ERROR HANDLING 49

See https://docs.python.org/3/library/typing.html for more details.

4.12 Error Handling


If you anticipate potential runtime errors, you can add code in an attempt to handle them more
gracefully. A full guide to errors and error handling can be found at https://docs.python.org/3/
tutorial/errors.html. The basic idea of error handling is ‘trial and error’: you try to run your code,
and it if it doesn’t work you deal with the error. This is achieved using a try except block.
If no error is encountered inside the try block, code execution will skip the except block entirely. If
an error is encountered, execution will jump to the statements inside the except block.

try:

#code where an error might occur

except:
print('something went wrong')

Different types of exception can be handled in different ways. A full list of exception types can be
found at https://docs.python.org/3/library/exceptions.html.
It is also possible to create your own custom errors using raise.

try:

#code where an error might occur

except ZeroDivisionError:
print('I know what went wrong')

except:
print('something else went wrong')
raise RuntimeError('my custom exception')

Extra else and finally clauses can be added to deal with the case when no exceptions occur
and to enact final cleanup operations, respectively.

try:

#code where an error might occur

except ZeroDivisionError:
print('I know what went wrong')

except:
print('something else went wrong')
raise RuntimeError('my custom exception')

else:
print('No exception, return, continue or break encountered')
4.13. ENUMERATIONS 50

finally:
print('this code will get executed on the way out error or not')

The with statement is used to wrap the execution of a block with methods defined by a con-
text manager. It is designed to simplify code that would otherwise be included in a try block to
ensure certain operations are performed successfully. The classic example is when performing in-
put/output operations with files which must be opened and subsequently closed. At the end of
the with block, the file will be automatically closed even if an exception is raised within the with
block.

#Open textstream assigning the name f


with open('output.txt', 'w') as f:
f.write('I better not forget to close the file')

f.write("this won't work: the file is already closed!")

From python 3.10, it is possible to combine multiple open operations in a single with statement.

#Open two (or more) files with a single with statement


with (open('file1.txt', 'r') as f1,
open('file2.txt', 'r') as f2):
pass

4.13 Enumerations
An enumeration is a set of symbolic names that are linked to unique constant values. They are
iterable and immutable and provide a way to eliminate hard coding. By convention, upper case
characters are used to denote that they are constant.

from enum import Enum

#Define new enumeration


class trafficlight(Enum):
RED = 1
AMBER = 2
GREEN = 3

#Sample usage
x = trafficlight.RED
print(x==trafficlight.AMBER)
print(x.value)

Full documentation describing their usage can be found at https://docs.python.org/3/library/


enum.html.
4.14. OTHER PYTHON FEATURES 51

4.14 Other python features


4.14.1 Assignment expressions
An assignment expression allows part something that is evaluated as part of an expression to be
assigned to a variable. This can be used to reduce the number of times expressions are evaluated
and reduce the need for separate assignment statements.

if (n := len(A))>1:
print(n)

The syntax (:=) is affectionately known as the ‘walrus operator’ (think eyes and tusks).

4.14.2 Decorators
Decorators provide a way to modify the behaviour of a function. The decorator function both
accepts a function as input and returns a function.

def tracker(func):
def wrapper():
print("Calling function", func.__name__)
func()
print("Exited function ", func.__name__)
return wrapper

@tracker
def hello():
print("Hello")

hello()

The effect of @tracker here is to replace hello with tracker(hello). The decorator function essen-
tially wraps the decorated function, allowing additional operations to be performed before and
after it is called. For functions taking inputs, the decorator can be extended as follows.

def friendly(func):
def wrapper(*args, **kwargs):
func(*args, **kwargs)
print("Nice to meeting you")
return wrapper

@friendly
def greeting(name):
print("Hello", name)

greeting('Kelly')

An for functions returning values, the wrapper should return the value or a modified version of
it.

def logoutput(func):
def wrapper(*args, **kwargs):
4.14. OTHER PYTHON FEATURES 52

x = func(*args, **kwargs)
print("function", func.__name__, "returned", x)
return x
return wrapper

@logoutput
def simpleinterest(P,r,T):
return P*(1+r*T)

FV = simpleinterest(100,0.06,1)
53
5 D EVELOPMENT

5.1 Life cycle models


While there are many models for development, most include the following steps:
• Analysis
• Design
• Implementation (coding)
• Documentation
• Testing
• Release
• Maintenance
Novice developers often rush to the coding stage with no thought given to the other stages. This
often proves costly in the end. Good developers know that there is more to development than
coding.

5.2 Testing
Testing is just as important as implementation. Poorly implemented code not only leads to in-
correct results, but requires developers to expend time revisiting tasks that they believed to be
complete.
Test driven development is an approach that tightly links the process of design, implementation
and testing. Tests are developed simultaneously to ensure that the implementation satisfies the
requirements of the design. In this way tests accumulate over time, and can be run at any point of
the development cycle to determine the degree to which the implementation satisfies expected
behaviour.
Regardless of the approach used, developers should expend some time thinking about how they
will test their code. The development of rigorous test plans is necessary to validate their implemen-
tation. Tests can take different forms, as outlined in Table 5.1.

Type Description
Unit testing Testing is performed at the lowest level on single logical ‘units’
Integration testing Testing is performed as modules are combined together.
System testing Testing is performed at the level of a complete application or system.
Black box testing Testing software without knowledge of how it works internally.
White box testing Testing using knowledge of the internal workings. Tests designed to ex-
ercise all possible code paths.
Regression testing Results match those of earlier versions with only expected deviations
User acceptance testing Testing the software meets user expectations
Stress testing Testing how the software performs under extreme conditions, heavy
loads, high volumes, etc

Table 5.1: Testing types

Unit tests can be incorporated into Spyder using a plugin. More details are available from https://
www.spyder-ide.org/blog/introducing-unittest-plugin/.
Testing tips:
5.3. CODING ERRORS 54

• When designing test plans, choose tests that:


– cover routine usage
– cover ‘edge’ cases (unusual/rare but still valid cases)
– exercise all possible code paths
– test for unexpected input
– can be quickly rerun
• Where possible:
– use realistic input data
– have intuitive results (obviously right or wrong)
– compare with known results (for equality/inequality)
– visualise the results
• Repeat testing after any code change!

5.3 Coding Errors


There are different types of error that developers can made. Table 5.2 outlines how to avoid the
major types.

Error Type Nature of Error How to avoid


Syntax Invalid code that won’t run Use the IDE

Logical Code does not do what it was meant to Well thought out algorithms
Use your cortex compiler!
Good test plans

Runtime Errors when code is unable to perform the Code for the unexpected
required operation (e.g. division by zero) Avoid assumptions
Error handling
Edge case testing

Table 5.2: Error types

Common python error messages, and their likely causes are outlined in Table 5.3.

5.4 Debugging
Errors are somewhat inevitable. After all, to err is human. A good developer will:
• adopt practices that reduce the incidence of errors
• familiarise themselves with the types of errors that coders typically make
• design and implement test plans that expose errors
• develop their ability to systematically detect the source of errors
Remember that the observed error may only be a symptom or side effect of the issue presenting
itself. Developers must pinpoint the underlying root cause of the actual problem. Such detective
work is part art, part science: sometimes investigating an intuitive hunch will quickly pay off but
trial and error is rarely an efficient approach. A key element is determining the point in a program
where it first deviates from its expected behaviour or state.
5.4. DEBUGGING 55

Error Cause and remedy


NameError: name ‘b’ is not The function cannot be found in the current working direc-
defined tory or Python path.
The variable is not an input or has not been initialised (as-
signed a value before it is used).
Check precise spelling (Python is care sensitive!)

IndexError: list index out of Python is attempting to access an element of a list/array


range that does not exist

AttributeError: ‘xyz’ object The method (as typed) does not exist.
has no attribute ‘abc’
Check exact match.
After class changes be sure to clear instances of old ver-
sions of the class.

Table 5.3: Python Error Messages

5.4.1 Logging
The state of a program at any point of execution can be determined by inspecting its variables.
Stepping through code one line at a time can be tedious, so it is often more convenient to output
variable state during execution. This can be achieved with print statements that are added
temporary and later deleted or commented out.
A more formal approach is to build logging into your programme that can be used more generally
for maintenance and support. This logs messages indicating progress and warnings either to a file
or to the console. By incorporating a ‘debug mode’, such logging can be configured to include
extra output that would only be of interest to a developer. Python’s logging module can be used
to achieve this.

import logging

logging.basicConfig(filename='debug.log',level=logging.DEBUG, force = True)

logging.info('log a message')
logging.debug('log a message if in debug mode')
logging.warning('log a message as a warning')

5.4.2 Assertions
Assert statements are simply Boolean tests that terminate code execution if they evaluate to False.
As such they can be used to include tests that confirm a program is in the expected state before
continuing. This allows a developer to embed regular checkpoints in their code to alert them is a
checkpoint is not reached successfully.

A = []
assert len(A) != 0, "List is empty."
print(max(A)) #This won't get the chance to run
5.5. GOOD DEVELOPER PRACTICES 56

5.4.3 Line by line execution


A last resort for debugging is line by line execution. Here the developer executes one line of code
at a time until the error is encountered. As execution proceeds the developer can check that
execution follows the expected path through the code and that all variables are in their expected
state.
Breakpoints can be used to pause execution at certain key points in a program. This allows a
developer to fast-forward to the area of code that is deemed most likely to contain the source of
the error.

5.5 Good developer practices


The guiding principles for Python’s design (‘The Zen of Python’) can be found by executing the
command import this as shown in figure 5.1.

Figure 5.1: Zen of Python

Developing coding principles and following good developer practices will not only make you a
better programmer, but means that you (and others) will spend less time debugging, correcting,
maintaining, and rewriting your code, not to mention figuring out what it was supposed to do.
Some basic guidelines to follow are:
1. Choose meaningful variable, function and class names, that are suggestive of their intended
usage. Mirror standard mathematical notation where possible.
2. Use white space to separate distinct steps in your algorithms.
3. Use comments to document your code.
4. Write generic code that can be reused rather than rewritten. If you find yourself copying and
pasting code (even with minor edits) you should probably rethink.
5. Avoid hard coding. Think carefully whenever you key in a literal value. If you find yourself
typing it a second time, consider converting it to a variable.
6. Functions should not reveal their inner workings. Users should see them as ‘black boxes’.
7. Functions should not rely on global variables. Any data they process should be passed into
as input parameters.
5.5. GOOD DEVELOPER PRACTICES 57

8. Design tests to validate your code. Where possible tests should be simple and have intuitive
answers. Rerun these tests after code changes.
9. Imports and anything configurable (e.g. file paths) should appear (once) at the top of the
code module.
10. Avoid variable proliferation. If you find yourself repeatedly creating variables that store similar
things, consider using a list, dictionary or other data structure. This will make your code easier
to maintain and simplify the processing of the data.
11. Avoid single use variables. Avoid using entirely or reuse the variable name when it is safe to
do so.
For historic reasons the length of a line of code is considered by be 80 characters. Going beyond
this limit can cause problems when printing code, or will force you to scroll from left to right when
viewing code. When statements exceed this limit, consider using line continuations. A backslash
‘\’ character at the end of a line allows the statement to continue onto the next line.
For more general advice, refer to PEP 8 - a style guide for python code: https://www.python.org/
dev/peps/pep-0008/. Alternatively, consider a code formatting tool such as ‘Black’.
58
6 O BJECT O RIENTING P ROGRAMMING

6.1 Principles
Object Orienting Programming (OOP) is a programming framework based around ‘objects’ which
couple data with related operations. Classes are used to define generic object types. They can
be thought of as the design or blueprint from which actual instances (called objects) will be cre-
ated.
Objects are defined by their properties and methods. Properties relate to attributes or data associ-
ated with each object instance. Methods are essentially functions, but define operations that the
class is capable of performing. OOP is supported (to some degree) by many languages including
C++, Java, C#, VBA and MATLAB.
Key to OOP are the concepts of:
• Encapsulation: hiding the internal workings of an object (use of public, private, protected
designations)
• Inheritance: allowing one class to inherit the properties and methods of another
• Polymorphism: allowing different objects to be treated similarly even though they will behave
differently
Polymorphism is seen through inherited classes sharing a common interface. Class benefits in-
clude:
• Logical organisation of code
• Code re-use
• Run time object typing
• Supportable
• Extensible
• Data persistence between calls

6.2 Classes
For an introduction to classes in Python see Hilpisch (2019). A full description can be found at
https://docs.python.org/3/tutorial/classes.html. The basic structure of a class takes the follow-
ing form:

class classname:

#class variables (shared by all instances)


mytype = 'example'

#constructor
def __init__(self, name):
self.name = name

#destructor
def __del__(self):
print("I'm being destroyed")
6.2. CLASSES 59

#methods
def mymethod(self, inputvalue):
print("someone called my method")

First note that variables defined within the class are shared by all instances of the class. Class prop-
erties that are specific to each instance of the class are defined by the __init__ constructor.

6.2.1 Dunder Methods


Dunder methods are names that are preceded and succeeded by double underscores (hence
the name). They can be used to create custom behaviour for standard operations.

Constructors
A constructor is a special method that is invoked when an instance of a class is created. In Python
constructors as defined by __init__ methods (note the double underscore before and after the
init). Like other methods, constructors can take zero or more inputs, but the first input is always a
reference to the specific instance of the object. Convention dictates that this should be named
self.
A constructor is a convenient place to include code that initialises the object before it is first
used.

Destructor
A destructor is a special method that is invoked when an instance of a class is being destroyed. In
Python destructors as defined by __del__ methods. A destructor provides an opportunity to take
any final actions that might be required before an object is destroyed.

str
The __str__ method will determine the output when an instance of the class is passed into a print
statement. Thus the class can provide the most helpful description of itself and the specific data is
might hold. It is intended to readable and helpful.

def __str__(self):
return "My name is "+self.name

repr
The __repr__ method is used to produce an unambiguous string representation of a class and
tends to be used for debugging.

def __repr__(self):
return str(self.uniqueid)

Python’s documentation states:


For many types, this function makes an attempt to return a string that would yield an ob-
ject with the same value when passed to eval() otherwise, the representation is a string
enclosed in angle brackets that contains the name of the type of the object together
with additional information often including the name and address of the object.
6.2. CLASSES 60

6.2.2 Methods
Methods are just functions defined with a class. A distinguishing feature is that the first input pa-
rameter is always a reference to the specific instance of the object. Convention dictates that this
should be named self. By default methods relate to a specific instance of a class.

class myclass:

def method(self):
print("regular method")

@classmethod
def classmethod(cls):
print("class method")

@staticmethod
def staticmethod():
return 'static method'

Class methods have the @classmethod decorator, and accept a cls parameter that points to the
generic class, rather than a specific instance. Such methods cannot modify the state of specific
instances, but they can modify the state of attributes shared across all instances.
Statics methods have the @staticmethod decorator and accept neither self nor cls parameters.
They cannot therefore not be used to modify any instance of the class or any class attributes.

6.2.3 Instantiation
Instantiation is the name given to creating an instance of a class. These instances are referred to
as objects. Objects are created by treating the class name like a function. If a constructor has
been defined, and includes non-defaulted parameters, them these must be supplied.

#Oops! A is just another way to refer to the class


A = classname

#Instantiate an object of type classname


A = classname('dummy')

6.2.4 Encapsulation
Methods and properties can be made public (i.e. available to all), or private (i.e. for internal
use only). Adding a double underscore prefix to the name makes a property or method private.
Protected attributes are those which can only be accessed within the class or by classes which
inherit from it. Adding a single underscore prefix to the name makes a property protected.

class somewhatcoy:

def __init__(self, commonknowledge, familysecret, secret):

#public
self.commonknowledge = commonknowledge

#protected
6.2. CLASSES 61

self._familysecret = familysecret

#private
self.__secret = secret

6.2.5 Inheritance
Inheritance allows one class (the child) to inherit the properties and methods of another (the
parent), without having to write additional code. The child implementation can be modified
to include new properties and methods, and can even override parent implementations. The
concept is illustrated in Figure 6.1.

Figure 6.1: Class inheritance

With the exception of anything designated private, a child has access to everything that belongs
to the parent.

class parent:

def __init__(self):
self.name = "parent"

def saymyname(self):
print(self.name)

def oldtrick(self):
print("anything you can do")

class child(parent):

def __init__(self):
self.name = "child"

def oldtrick(self):
print("i can also do")
6.2. CLASSES 62

def newtrick(self):
print("but look what else I can do")

Inheritance occurs when the child class is defined by including the name of the class from which
it should inherit in brackets.

A = parent()
B = child()

#Implemented at parent level only - called by parent


A.saymyname()

#Implemented at parent level only - called by child so inherit from parent


B.saymyname()

#Implemented at parent level - called by parent


A.oldtrick()

#Overrided at child level - called by child so use override method


B.oldtrick()

#Implemented at child level - called by child


B.newtrick()

Note that a child’s constructor can invoke the parent’s constructor as follows:

class child(parent):

def __init__(self):
super().__init__()
63

Part II

Data Processing
64
7 L IBRARIES

SciPy describes itself as ‘a Python-based ecosystem of open-source software for mathematics,


science, and engineering’. In this chapter we consider three of its libraries:
• NumPy - arrays
• SciPy - scientific computing
• pandas - data structures and analysis
Additionally, scikit-learn, a machine learning toolkit, is introduced.

7.1 NumPy
NumPy is a popular package for scientific computing with Python. In particular it provides support
for working with multi-dimensional arrays, matrix algebra and random numbers. By convention it
imported and aliased as np.

import numpy as np

See https://numpy.org/ for documentation or Harris et al. (2020) for a discussion of the NumPy
ecosystem.

7.1.1 Arrays
Numpy arrays can be of different data types, and can be created in different ways.

#Create from a list


A = np.array(['a', 'b', 'c'])

#Create array with given dimensions and initialise to zero


A = np.zeros(5)

#Create array with given dimensions and initialise to one


A = np.ones(5)

#Odd numbers from 1 to 7


A = np.arange(1,9,2)

#Evenly spaced numbers (from, to, number of points)


A = np.linspace(0,1,11)

These commands create an object of type numpy.ndarray, which has a number of useful meth-
ods.

A.sum()
A.max()
A.mean()

A further major benefit of np arrays over Python lists, is the easy with which operations can be
performed pointwise on each element.
7.1. NUMPY 65

#Double elements of A
A * 2

#Square elements of A
A ** 2

#Square root of elements of A


np.sqrt(A)

#Find exponential of each element of A


np.exp(A)

Tests of equality and inequality can also be performed.

A = np.arange(10)

#Generate a Boolean array indicating which values are greater than 5


A > 5

#Count how many values are greater than 5


sum(A > 5)

7.1.2 Matrices
While NumPy has a specific numpy.matrix class that was originally intended for linear algebra, this
has been deprecated. It is now recommended to simply use (two-dimensional) arrays. These can
be created as follows:

#Create a 2-d array


A = np.array([[1,2],
[3,4]])

#Create a 3x3 matrix of ones


A = np.ones((3,3))

#Create a 3x3 identity matrix (also np.eye)


A = np.identity(3)

#Check number of elements (in total)


print(A.size)

#Check dimensions
print(A.shape)

#sum along first dimension (rows)


A.sum(axis=0)

#sum along second dimension (columns)


A.sum(axis=1)
7.1. NUMPY 66

7.1.3 Broadcasting
Broadcasting describes how NumPy treats arrays (with different shapes) during arithmetic opera-
tions. In mathematics, when working with one dimensional arrays (vectors) and two dimensional
arrays (matrices), there are strict rules about when operations (such as addition and multiplication)
are valid. When NumPy encounters an operation between two arrays of different sizes, the smaller
one is ‘broadcast across the larger one’ so that they have compatible shapes. This means that in
most cases where a reasonable interpretation exists for the intended operation, it will produce a
result. For more details see https://numpy.org/doc/stable/user/basics.broadcasting.html.

7.1.4 Reshaping
Reshaping operations take the values from one array and repack them into the values in another
equally sized but differently shaped array. Note that reshaping returns a new array. It does not
reshape the original array.

#Create a 1D 10 element array


A = np.arange(10)

#Reshape into a 2D array with 5 rows and 2 columns


B = A.reshape(5,2)

#Reshape into a 1D array


C = B.flatten()

#Create an array by creating 3 vertical and 2 horizontal copies of A


D = np.tile(A,(3,2))

One standard reshaping operation is the transpose operation. Note that the transpose operator
has no effect on one dimensional arrays.

#Transpose B
D = B.T

Resizing operations, by contrast, are applied to the original object. If the overall size of a numeric
array is increased, it will be padded with zeros. If the overall size is decreased, the elements will be
truncated.

#Create a 1D 10 element array


A = np.arange(10)
A.resize((10,2))

7.1.5 Linear Algebra


Matrix multiplication can be performed using NumPy’s matmul function.

x = np.array([[1],[2]])
A = np.arange(1,5).reshape(2,2)

#Matrix multiplication
np.matmul(A,x)
7.2. SCIPY 67

#Equivalently
A@x

#This is NOT matrix multiplication


A*x

Matrix operations are generally only defined if the vectors or matrices to which they are applied
satisfy particular dimensional requirements. However, NumPy will apply broadcasting rules when
operators are applied to variables of different shape, making some operations permissable.

x = np.array([[2],[5]])
A = np.arange(1,5).reshape(2,2)

#This is NOT matrix multiplication


#It will multiply the first row of A by 2 and the second by 5
A*x

See for further details. NumPy is also quite forgiving with one dimensional vectors, treating them
as column vectors when it appears reasonable to do so.
Further examples of standard linear algebra operations are shown below. A full listing is available
from https://numpy.org/doc/stable/reference/routines.linalg.html.

#find the matrix inverse


B = np.linalg.inv(A)

#extract (as a array) the diagonal


d = np.diagonal(A)

#create a diagonal matrix from an array


C = np.diag(d)

#Solve a system of Equations


np.linalg.solve(A, x)

7.2 SciPy
SciPy provides additional mathematical algorithms for scientific computing. Sub-packages (which
must be imported individually) include:
• interpolate - interpolation functions
• linalg - linear algebra
• optimize - root-find and optimization
• sparse - sparse matrix manipulation
See https://scipy.org/scipylib/ for more details. A particular sub-package of interest is stats
which provides random numbers and statistical distribution functions. In particular it includes
classes for the continuous distributions:
• uniform
• norm
7.3. PANDAS 68

• lognorm
• uniform
which each have the following methods:
• cdf - cumulative distribution function
• pdf - probability density function
• ppf - inverse cdf

7.3 Pandas
7.3.1 Getting started
Pandas is a library designed specifically for data analysis. A key feature is its DataFrame object,
which can be thought of as a database table consisting of record (rows) and fields (columns).
Pandas provides tools which make it easy to import, export, merge, join, filter, aggregate and
manipulate these data sets. Full documentation is available from https://pandas.pydata.org/. By
convention Pandas imported and aliased as pd.

import pandas as pd

A DataFrame can be created in different ways. One way uses Python’s dict to pass in a series of
key/value pairs, where each key defines a field, and each value contains an equally-sized array
with which to populate the field for each record.

df = pd.DataFrame({'currency': ['GBP', 'EUR', 'JPY', 'AUD', 'CNY'],


'rate': [0.8246, 0.9171, 107.64, 1.5301, 7.1264]})

Crucially, each field can be a different data type (within Pandas these are dtype). The data type
of each column can be inspected as follows:

#Find data type of each field


print(df.dtypes)

#Summary of data frame


df.info()

#Find number of rows and columns


r, c = df.shape

There are various commands that can be run to inspect the structure and content of a DataFrame.

#Inspect first n rows (defaults to 5)


df.head()

#Inspect last n rows (defaults to 5)


df.tail(3)

#Find column labels (can be indexed like a list)


df.columns
7.3. PANDAS 69

#Find row labels (can be indexed like a list)


df.index

#Statistical summary
df.describe()

Each DataFrame has an index which represent the row labels. By default, the labels are a simple
numerical enumeration from 0, but can be redefined.

#Set index but retain as a column


df = df.set_index('currency', drop=False)

#Equivalently
df.index = df.currency

#Reset index to enumeration


df = df.reset_index()

The index can be created from one or more columns. While index values do not have to be
unique, creating such a primary key or unique identifier may have it easier to merge one DataFrame
with another. Pandas also offers a Series data structure, which is a one dimensional array, with a
corresponding set of data labels (i.e. its index). A column from a dataframe is therefore a pandas
Series.

FX = pd.Series([0.8246, 0.9171, 107.64])


FX.index = ['GBP', 'EUR', 'JPY']
print(FX)
print(FX['JPY'])

7.3.2 Selection
To select particular rows, columns or subsets of the DataFrame it is possible to use Python-style in-
dexing, or optimised methods such as loc (primarily using labels) and iloc (primarily using integer
position). Note the slices using loc are inclusive of the end record.

#Selection of rows by position


df[1:3]

#Select column(s) by label


df['rate']
df[['rate','currency']]

#Alternatively - but column name must be a valid Python variable name


df.rate

#Select specific rows and columns (rows currently have integer index)
df.iloc[0:2,[1,2]]
df.loc[0:2,'rate']

#Extract elements in given positions (along an axis, default=0)


7.3. PANDAS 70

df.take([1,2])

The at and iat methods can be used to retrieve single values in a similar manner. Note that when
retrieving a single column, what is returned is actually a pandas series (with any corresponding
index). To extract just a list of values, use the tolist method.

#Panadas series
print(type(df['rate']))

#Python list
print(type(df['rate'].tolist()))

Selection using Boolean arrays is supported.

#Find rows for which the exchange rate is above 10


df[df.rate>10]

#Find rows where the unit contains 'y'


df[df.unit.str.contains('y')]

In pandas, the symbols & (and), | (or) and ∼ (not) are used as Boolean operators.

#Find rows for which the exchange rate is between 10 and 100
df[10<df.rate & df.rate<100]

#Find rows for which the exchange rate is below 10 or above 100
df[df.rate<10 | df.rate>100]

#Find rows for which the exchange rate is less than or equal to 10
#i.e. not greater than 10
df[~df.rate>10]

A filter method can also be used for row and column sub-setting:

#Filter on specific columns


df.filter(items=['currency', 'rate'])

#Set row index


df = df.set_index('currency')

#Filter for rows based on the index


#Here currency names containing the letter J
df.filter(like='J', axis=0)

7.3.3 Iteration
To loop over the rows in a DataFrame use it iterrows method (similar to enumerate).
7.3. PANDAS 71

for index, row in df.iterrows():


print(index, row)

An alternative (and faster) approach is to use itertuples. The fields in each row are placed into
a tuple (and must therefore be accessed by their position) with or without the index as the first
element.

for tup in df.itertuples(index=True):


print(tup)

The pandas query method can be used to simplify some selection operations by writing simple
string expressions. For example, the following two statements are equivalent.

#Select rows where the value of A is greater than the value of B


df[df.A > df.B]

#Equivalent selection written as a query


df.query('A > B')

7.3.4 Sorting
Dataframes can be easily sorted by their index using the sort_index method.

#Sort by index
print(df.sort_index())

#Sort by index in reserve order


print(df.sort_index(ascending=False))

To sort on the values of a particular column use the sort_values method.

df.rate.sort_values()

Of course this will only return a single column (i.e. a series). Fortunately, this method also exists for
dataframes.

#Sort the dataframe using multiple columns (i.e. primary, secondary keys)
df.sort_values(by=['unit','currency'])

The na_position parameter can be set to determine the treatment of missing values.
A similar requirement is to determine the rank, assigning the value 1 to the element in first place, 2
to the next element and so on, based on the sort order. The default behaviour will average ranks
for elements in tied positions. Thus two elements in joint second position would each receive ranks
of 2.5 (the average of 2 and 3). This behaviour can be adjusted by setting the method parame-
ter.

#Create a column to define some rank order


df['rank'] = df.rate.rank()
7.3. PANDAS 72

7.3.5 Manipulation
DataFrame updates can be achieved with assignment operations provided, of course, that the
objects on the left and right hand size are similar in size.

#Add a new column (or override an existing one)


df['unit'] = ('pound', 'euro', 'yen', 'dollar', 'renminbi')

#Create a new row using None


newrow = pd.DataFrame({'currency': ['ZAR'],
'rate': [12.2134],
'unit': [None]})

#Append another row


df = pd.concat([df,newrow] , ignore_index=True)

#Update a single value


df.loc[4,'unit'] = 'yuan'

#Update a column
df.loc[:,'rate'] = [0.8244, 0.9172, 106.98, 1.5434, 7.1322]

#Rename a column
df.columns[-1] = 'name'

Appending rows iteratively adding one row at a time is considered inefficient. In such circum-
stances it is better to adds rows to a list, which can then be appended as a single operation.

#Creating extra rows in a new DataFrame


df2 = pd.DataFrame([['DKK', 6.60, 'krone'],
['THB',31.17, 'baht']], columns=df.columns)

#Append one DataFrame to another


df = pd.concat([df,df2], ignore_index = True)

The concat method can also be used to combine DataFrames, along either axis (0 for rows, 1 for
columns).

#Combine dataframe rows as above


df = pd.concat([df, df2], ignore_index = True)

#Add dataframe rows as columns (index matching)


df3 = pd.DataFrame({"country": \
["UK", "Eurozone", "Japan", "Australia", "China", \
"South Africa", "Denmark", "Thailand"]})
print( pd.concat([df, df3], axis=1))

It is possible to combine DataFrames using methods similar to database-style joins (see section 8.7).
Indeed, pandas had a join method, which by default matches rows from one DataFrame to rows
7.3. PANDAS 73

in another uses each DataFrame’s index.

#Create a DataFrame of currency amounts


df4 = pd.DataFrame({'currency': ['GBP', 'EUR'],
'amount': [100, 50]})

#Join with the currency table to get exchange rates


df5 = df4.set_index('currency').join(df.set_index('currency'))

#Convert amounts to dollar values


df5['dollarvalue'] = df5['amount']/df5['rate']

An alternative, and perhaps more versatile, method is merge.

#Merge with the currency table to get exchange rates


df5 = df4.merge(df, left_on='currency', right_on='currency')

#Convert amounts to dollar values


df5['dollarvalue'] = df5['amount']/df5['rate']

To remove rows and columns use either del or drop.

#delete a column
del df["rate"]

#delete a column
df = df.drop(["rate"], axis=1)

#delete a row by index


df = df.drop([2])

#delete two rows, updating the dataframe without assignment


df.drop([2,3], inplace=True)

7.3.6 Dates
When working with time series, rows in a DataFrame can be referenced by date. Panads provides
support for this through its DatetimeIndex object. These can be created with the date_range:

#Daily dates between start and end date


dates = pd.date_range('2020-1-1','2020-12-31', freq='D')

#Monthly dates from start date and given count


dates = pd.date_range('2020-1-1', periods = 12, freq='M')

#Annual dates from start date and given count


dates = pd.date_range('2000-1-1', periods = 20, freq='A')

Note that monthly and annual dates default to the end of month and year respectively. Once
created they can be assigned as the index of a DataFrame:
7.4. SCIKIT-LEARN 74

df.index = dates

Note that DatetimeIndex objects are immutable.

7.4 scikit-learn
Scikit-learn (Pedregosa et al., 2011) is a machine learning package built on NumPy, SciPy and
matplotlib. It incorporates multiple tools for problems relating to classification, regression and clus-
tering. It will be used in Chapters 11 & 12. See https://scikit-learn.org/ for more details.

7.5 tkinter
Tkinter is cross-platform toolkit used by many different programming languages to build a GUI
(Graphical User Interface). It provides access to standard windows components such as text
boxes, buttons, listboxes, etc. See https://docs.python.org/3/library/tkinter.html for docu-
mentation or tutorials such as https://realpython.com/python-gui-tkinter/. It can be used in
combination with packages such as pyinstaller to create standalone executables. Simple tkinter
projects can be found in Eramo (2020).
To see how easy it is to begin, the following code create and launches a window with a spe-
cific title. The window automatically has standard features such as the ability to reposition, resize,
maximise and minimise. The output is shown in figure 7.1.
The main window (here) is call root (remember that one window can spawn another). The mainloop
method creates the window and run it (like an infinite loop) until it is terminated.

import tkinter

#Define window
root = tkinter.Tk()
root.title('First window')

#Run root window


root.mainloop()

Figure 7.1: Simple tkinter window

7.5.1 Widgets
Tkinter widgets are used to build the user interface. Each component must be created, associ-
ated with a window (considered to be its parent), configured (in terms of its properties) and then
positioned.
7.5. TKINTER 75

Widgets can be positioned on their window using one of three approaches:


• Pack positions the widgets in relation to one another. It is the easiest approach but provides
limited functionality to carefully position each component according to individual stylistic
preferences.
• Place uses an absolute x/y coordinates.
• Grid uses a two dimension grid with widgets arranged in horizontal and vertical rows.

Labels
The label widget allows non-editable text to be displayed. We use this simple component to
illustrate each of the approaches to positioning the widgets on the window.
The pack approach simply needs to pack each element in the sequence of appearance.

#Define window
root = tkinter.Tk()
root.title('Labels (pack)')

#Set size of initial window


root.geometry('250x150')

#Create label
label1 = tkinter.Label(root, text = "Hello")
label2 = tkinter.Label(root, text = "World")

#Pack approach
label1.pack()
label2.pack()

root.mainloop()

The place approach requires each widget to the positioned relative to the top left hand cor-
ner.

#Define window
root = tkinter.Tk()
root.title('Labels (place)')

#Set size of initial window


root.geometry('250x150')

#Create label
label1 = tkinter.Label(root, text = "Hello")
label2 = tkinter.Label(root, text = "World")

#Place approach
label1.place(x = 10, y = 10)
label2.place(x = 50, y = 100)

root.mainloop()

The grid approach requires each widget to the positioned in a specified row and column. Rows
7.5. TKINTER 76

(a) pack (b) place (c) grid

Figure 7.2: Tkinter Label Positioning

or columns that contain no widgets will be ignored. The size of the largest widget will determine
overall sizing of the row/column in which it is positioned.

#Define window
root = tkinter.Tk()
root.title('Labels (grid)')

#Set size of initial window


root.geometry('250x150')

#Create label
label1 = tkinter.Label(root, text = "Hello")
label2 = tkinter.Label(root, text = "World")

#Grid approach
label1.grid(row = 0, column = 0)
label2.grid(row = 1, column = 1)

root.mainloop()

Buttons
Buttons are widgets that users can click to initiate come action. In addition to creating the control,
the user must ‘plumb in’ the widget by associating it to the code the should run when the button
is clicked.

def showmessage():
messagebox.showinfo("you clicked the button")

#Define window
root = tkinter.Tk()
root.title('Buttons')

#Set size of initial window


root.geometry('250x150')

#Create buttons, adding it to the root window


button = tkinter.Button(root, text = 'Click me', command = showmessage)
button.grid(row = 0, column = 0)
7.5. TKINTER 77

root.mainloop()

Figure 7.3: Buttons


78
8 I/O O PERATIONS

Processing of data starts and ends with input and output respectively. The source or destination
of data could include transient hardware devices such as keyboards and computer monitors, live
data feeds such a stock price tickers, or more permanent records such as files stored on a hard
drive or fields in a database.

8.1 User input


Input from the user can be requested using the input command. This will cause code execution
to pause until the user types something at the console prompt and presses ‘Enter’.

value = input("prompt for user to enter something:")

Whatever is typed will be assigned to the variable. By default this will be of type string (even if the
user keys a numeric value).

8.2 File based input


When accessing files (or whatever type) it will be necessary to specify their name and file extension
(e.g. ‘test.csv’). Python will attempt to locate the file in the current working directory or other
designated folders defined as part of the python path.

8.2.1 File paths


Files located elsewhere must be supplied with a full or relative path.

#This will work if the file is in the working directory


filename = 'test.csv'

#Absolute location
filename = r'c:\temp\test.csv'

#Relative path: file located in a subfolder of working directory


filename = r'.\data\test.csv'

#Relative path: file located in parent folder of working directory


filename = r'..\test.csv'

Note the ‘r’ character before the string literal. This denotes a raw string that treats backslashes as
literal characters. This overcomes the fact that the backslash character used in windows file paths
would otherwise be treated as an escape character (see Section 4.5).
Note also that ‘.’ denotes the current folder and ‘..’ denotes the parent directory.
The os module provides a way of operating system dependent functionality such as determining
the current working directory.

#get current working directory


cwd = os.getcwd()
8.3. TEXT FILES 79

#change current working directory


os.chdir(folder)

#create a full file path from a folder and file name


fullfilepath = os.path.join(directory, filename)

#iterative over files in current working directory


for filename in os.listdir():
print(filename)

#delete file
os.remove(filename)

The pathlib module provides some alternative approches.

8.3 Text files


File output first requires a text stream to be opened. The mode in which the file is opened will
determine whether an existing file is overwritten or whether additional text is appended to the
end. Notice the new lines characters that have been added in the example above.

#Open file in write mode


f = open('outputfile.txt', "w")

f.write("line 1\r")
f.write("line 2\r")

f.close()

#Open file in append mode


f = open('outputfile.txt', "a")

f.write("line 3\r")

f.close()

Unfortunately working with text gets complicated by the many languages, symbols and even emo-
jis in common usage. For this reason Python provides support for unicode (https://www.unicode
.org/), a system that aims to list every character used by human languages, providing each with
a unique code point (an integer value). There are already 150,000 such codes, which clearly does
well beyond the 8-bit characters supported by ASCII codes (https://www.ascii-code.com/).
Text files can therefore be encoded in different ways. One common method is ‘utf-8’. You may
notice that his appears in the status bar of Sypder indicating the default encoding of the python
files it creates. Note that not all applications will cope with non-standard encodings: a character
than can be typed in one application may not be compatible with another. For conversion,
python provides support via its codecs (encoders/decoders) library.
To illustrate the potential difficulties, consider the following code:
8.4. CSV 80

#Text containing a character outside the standard ASCII character set


name = "Áine"

#This won't work - the text contains a character


f = open('outputfile2.txt', "w", encoding = 'ascii')
f.write(name)
f.close()

#But this will work fine


f = open('outputfile3.txt', "w", encoding = 'utf-8')
f.write(name)
f.close()

Reading from a file follows a similar approach. The file must be opened in ‘read’ mode.

#Open file for read


f = open('inputfilefile.txt', 'r', encoding = 'utf-8')

#Read the first list


text = f.readline()
print(text)

#Read the next 5 characters


text = f.read(5)
print(text)

#Read all the way to the end


text = f.read()
print(text)

f.close()

Alternatively a for loop can be used to read the file one line at a time.

f = open('inputfilefile.txt', 'r', encoding = 'utf-8')

for line in f:
print(line)

f.close()

See also section 4.12 for how to use the with keyword to ensure files are safely closed.

8.4 CSV
One popular format for outputting tabular data is the csv file (comma separated values). These
can be created manually by writing every record to a new line, and adding a comma (or other
separator) between each field of each record.
8.4. CSV 81

FX = [['GBP', 1.2345], ['EUR', 1.0987]]

f = open('outputfile.csv', 'w')

for row in FX:


f.write(','.join(str(i) for i in row)+'\r')

f.close()

This uses the join method of strings to concatenate together a container of values with a given
separator. Note that here, care is required is convert numeric values to strings. Had all the values
been strings, it would have been possible to write this as:

f.write(','.join(row))

Also note that while csv is a popular format, it can cause problems when the data fields themselves
naturally contain commas. If used as input, such commas will be interpreted as indicating the end
of one field and the beginning of the next. Therefore it may be better to use separators that do
not commonly appear in text such as the pipe symbol (|).
Reading csv files manually

f = open('outputfile.csv', 'r', encoding = 'utf-8')

FX = []

for line in f:
#Remove carriage return
line = line.strip()

#Split fields into an array and add to list


FX.append(line.split(','))

f.close()

Unsurprisingly, Python offers a library to make this a little easier.

import csv

#Import
with open('outputfile.csv', newline='') as f:
reader = csv.reader(f)
for row in reader:
print(row)

#Export (again)
with open('outputfile2.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(FX)
8.5. PICKLE 82

The pandas DataFrame object has its own method for csv export.

#Export with default options


df.to_csv('outputfile2.csv')

#Export with tab separator but without field names


df.to_csv('outputfile3.csv', header=False, sep="\t")

Importing works in exactly the same manner, with optional parameters set according to the pre-
cise format of the file.

df_in = pd.read_csv('outputfile2.csv')

df_in2 = pd.read_csv('outputfile3.csv', header=None, sep="\t")

8.5 Pickle
The pickle module allows data objects to be serialised into (and un-serialised from) a byte stream,
and can therefore be used as a way to persist data between python sessions. As a binary format,
the files are not human readable (try opening one with a text editor).
In the code that follows, notice that the read and write modes are supplemented with ‘b’ to
denote a binary format.

import pickle

mydict = { "python": "good", "r": "ok" }

#Export pickled object


pickle.dump( mydict, open( "pickle.p", "wb" ) )

del(mydict)

#Re-import pickled object


mydict = pickle.load( open( "pickle.p", "rb" ) )

8.6 Excel workbooks


Pandas provides support for reading and writing data from Excel. To output can be as simple as
writing one line of code.

df.to_excel("output.xlsx")

More involved operations over multiple sheets can be performed with the help of the ExcelWriter
object. This is akin to opening a text stream and then performing multiple write operations.

with pd.ExcelWriter('output.xlsx') as writer:


df.to_excel(writer, sheet_name='Sheet1')
df2.to_excel(writer, sheet_name='Sheet2')
8.7. DATABASES 83

While the read function has many optional parameters, reading in tables is relatively simple.

#Read a simple Excel file (single sheet and table)


df_in = pd.read_excel('input.xlsx')

#Read table from specific location on specific sheet


df_in2 = pd.read_excel('input.xlsx', sheet_name = 'Sheet2',
skiprows = 2, usecols = 'B:C')

8.7 Databases
While there are many different database vendors and implementations (e.g. Sybase, SQL Server,
Oracle, PostgreSQL), most relational databases are based on ANSI SQL (pronounced as ‘sequel’)
which stands for Structured Query Language. As the name suggests, SQL commands can be used
to query the tables with a database to extract information. As shown in Figure 8.1, tables consist
of rows (records) and columns (fields). Additionally, SQL commands can be used to insert, update
or delete data records, or indeed to change the database schema itself (i.e. create, modify or
delete tables). An introduction to the SQL language can be found at https://www.w3schools.com/
sql.

Figure 8.1: Database table

The sqlite3 module provides an interface to SQLite, a library the implements a self-contained
SQL database engine. Documentation can be found at https://docs.python.org/3.8/library/
sqlite3.html.
Just as a text stream must be opened to read from or write to a file, a connection must first be
established with a database before queries can be executed. The database can be a file, a
server, or even one held entirely in memory.

import sqlite3 as sq3

#Create connection to a named database


con = sq3.connect('mydb.db')

#Create a cursor object to run queries


cur = con.cursor()
8.7. DATABASES 84

#Define and execute a query


query = 'SELECT * FROM users'
cur.execute(query)

#Retrieve the results (a list of tuples)


rows = cur.fetchall()

for row in rows:


print(row)

#Close connection
con.close()

Long queries with multiple clauses are best typeset across several lines inside triple quotes. Pandas
provides support for importing data directly from a database.

import pandas as pd
df = pd.read_sql(query, con)

DB Browser for SQLite (https://sqlitebrowser.org/) is a visual, open-source, application that can


be used to inspect, edit, and create, database files compatible with SQLite.

8.7.1 Queries
Select statements form the basis for all queries to retrieve data.

#Select all rows and columns from a table


SELECT * FROM tablename

#Limit output to a defined number of rows


SELECT * FROM tablename limit 10

#Select all rows but only particular columns from a table


SELECT column1, column2 FROM tablename

Where clauses are used to be selective about which rows to return, and act like filters on the query.
Standard Boolean operators (AND, OR, NOT) and parenthesis can be used to modify the intended
logic. In SQL, the percent sign (%) is used as the wildcard character.

#Select a subset of rows from a table


SELECT * FROM assets WHERE currency = 'EUR'
SELECT * FROM assets WHERE currency IN ('NOK', 'DKK')
SELECT * FROM assets WHERE currency = 'EUR' AND value > 1000000
SELECT * FROM assets WHERE name like 'A%'

#Find unique values for a given field


SELECT DISTINCT(country) FROM assets WHERE currency = 'EUR'

#Find number of records matching filter criteria


SELECT COUNT(*) FROM assets WHERE currency = 'EUR'
8.7. DATABASES 85

The results of queries can be manipulated by sorting (ORDER BY) or aggregating (GROUP BY).
The modifiers ‘ASC’ (the default) and ‘DESC’ can be used for alphabetic/numeric or reverse al-
phabetic/numeric sorts. Common aggregation functions include COUNT, SUM, MAX, MIN and
AVG.

#Sort results based on two sort keys


SELECT * FROM assets ORDER BY name, value DESC

#Aggregate results
#Selected fields must be used for grouping or aggregated in some way.
SELECT country, SUM(value) from assets GROUP BY country

Updates to tables are covered by INSERT, UPDATE and DELETE queries. Adding new rows to existing
tables is performed by INSERT queries. Depending on how tables have been created, it may or
may not be necessary to supply values for all fields. For examples, certain fields may be automat-
ically populated as an auto-incremented index. Other fields may be permitted to hold no values
(i.e. all allowed to be null).

#Insert new row into a tablle


INSERT INTO tablename VALUES (value1, value2, value3, ...)
INSERT INTO tablename (field1, field2) VALUES (value1, value2)

#Update selected fields for selected records


UPDATE tablename
SET column1 = value1, column2 = value2
WHERE column3 = value

#Delete selected rows


DELETE FROM tablename WHERE column1 > value;

Queries can be dynamically constructed in Python using string concatenation and formatting.
The execute method of the cursor object also permits ‘?’ to be used as a placeholder into which
values supplied in a tuple will be substituted. Note that queries that alter the data in a database
must be committed before they take permanent effect.

con = sqlite3.connect(dbname)
cur = con.cursor()
query = """UPDATE transactions set amount = ?
WHERE account_id = ?"""
cur.execute(query, (123.45, 1234567))
con.commit()

For relational databases (where data in one table is related to data in another), the concept
of normalisation is applied to reduce inefficiency (storing the same information more than once)
and inconsistency. To extract useful information from a database it is therefore often necessary to
combine data from different tables. In SQL this is performed using JOIN operations of which there
are several types. Suppose we wish to combined two tables: table1 (the left table) and table2
(the right table).
• (inner) joins combine records with matching values in both tables (akin to set intersection)
• left (outer) joins return all records from the left table and matching records from the right
table (a bit like a lookup)
8.8. JSON 86

• right (outer) joins return all records from the right table and matching records from the left
table (a bit like a lookup)
• full outer joins return all records from both tables whether there is a match or not (akin to set
union)
Inner joins are by far the most common. When combining tables, it is possible that field names
become ambiguous. To resolve this, field names can be prefixed by their table name using a
period ‘.’. As in the example below, table names can be aliases to avoid repeatedly typing their
full name.

#Return all fields from table1 and one specific column from table 2
#matching records based to two fields
SELECT t1.*, t2.columnZ
FROM table1 t1
INNER JOIN table2 t2
ON table1.columnA = table2.columnX AND table1.columnB = table2.columnY

These SQL statements form the core building blocks of most database queries, and can be used
in combination to perform more complicated operations (e.g. joining three tables and simultane-
ously filtering the results, or using the results from one query as the input into another).

8.8 JSON
JSON (JavaScript Object Notation) is human-readable data format that, despite the name, is lan-
guage independent. There are two basic building blocks. One of these is based on name/value
pairs and is similar in concept to a python dict:

{"id": 1234, "name": "Smith, Roberta", "dob": "19/11/1995"}

Note that names must always be strings, while values can be strings or values. The other is an
ordered collection of values similar to a python list:

[ "one", 2, "iii" ]

Any combination of these two object types is valid. The dumps and loads functions respectively
convert (valid) python objects to json strings and json strings to python objects. To help you re-
member that they work with strings, remember that they end is ‘s’.

import json

#Create JSON text


text = '{"id": 1234, "name": "Smith, Roberta", "accounts": [123, 456]}'

#Load JSON creating object structure


d = json.loads(text)

#Convert structure to JSON text


print(json.dumps(d))

To save or load json text to/from file use dump and load.//
8.9. XML 87

#save to file
with open('json.txt', 'w') as f:
json.dump(d,f)

#read from file


with open('json.txt', 'r') as f:
d2 = json.load(f)

In addition to list and dict objects, tuples can also be converted to JSON (but will become lists
when reloaded). The keywords True, False and None are also automatically converted to true, false
and null respectively. See for example https://www.w3schools.com/python/python_json.asp.
Given the similarity of JSON formats to list and dict objects, in practice file IO in JSON format is
more common.

8.9 XML
XML (eXtensible Markup Language) is a human-readable format for the storage and transfer of
data. XML documents appear similar in nature to the HTML that is used to encode web pages,
and consist of content and markup. Just as round and curly brackets are used to demark the
structure of JSON, a main feature that defines the structure and syntax of XML is the tag:

<tagname>content</tagname>
<empty-tag />

Each tag begins with < and end with >. They normally appear in pairs, with an extra ‘/’ character
used to distinguish the start from end tag. Unlike HTML, XML is not limited to a predefined set of
recognised tags: the tag names can be freely defined by the XML document author. A pair of
tags with its intermediate content is referred to as a element. The content can be individual values
or other valid xml. Attributes are name/value pairs contained within start or empty tags:

<tagname name = "roberta" id = "1234"> </tagname>


<empty-tag value="42"/>

XML documents often begin with a declaration describing the precise format of the XML the fol-
lows. Just as for Python, indentation aids readability, with tabs indicated the nested structure of
complex hierarchical objects. Note that XML tags are also case sensitive. Suppose the following
XML file (‘portfolio.xml’) has been created.

<?xml version="1.0" encoding="UTF-8"?>


<portfolio>
<asset type="equity">
<ticker>FB</ticker>
<close>123.45</close>
</asset>
<!-- This is a comment -->
<asset type="equity">
<ticker>C</ticker>
<close>56.37</close>
</asset>
8.10. HTML 88

</portfolio>

It is possible to open an XML document and traverse its structure using the Document Object
Model (DOM).

import xml.etree.ElementTree as ET

tree = ET.parse('portfolio.xml')

portfolio = tree.getroot()

for asset in portfolio:


print(asset.attrib['type'])
for detail in asset:
print(detail.tag, detail.text)

Alternatively, it is possible to create a Python structure (based on the OrderedDict) that replicates
the XML document, using the xmltodict project (must first be installed).

import xmltodict

with open('portfolio.xml') as fd:


doc = xmltodict.parse(fd.read())

print(doc)

8.9.1 XPath
XPath is part of the XSLT standard. Without getting bogged down in new abbreviations1 it is a tool
that can be used to navigate XML documents. The hierarchy of elements within a XML document
can be thought to describe a path (similar to a path to a file within a folder structure). XPath allows
elements belonging to a particular part of the path to be directly retrieved.

from lxml import etree

tree = etree.parse('portfolio.xml')

#Retrieve 'asset' elements under the root 'portfolio' element


matches = tree.xpath('/portfolio/asset')

8.10 HTML
HTML (Hypertext Markup Language) is the encoding used for content designed to be displayed
in a web browser. On most browsers it is possible to right click and choose ‘view source’ to reveal
the actual web page that the browser is rendering. As the HTML abbreviation suggests, HTML
documents contain both content (text) and markup (formatting instructions) necessary to do this.
1 XSL (Extensible Stylesheet Language) is a language for expressing style sheets. Style sheets are designed to describe

the formatting rules that should be applied to data stored in XML format. XSLT (XSL Transformations) allow XML documents
to be converted into other markup formats.
8.10. HTML 89

While a thorough understanding of HTML is not strictly necessary, it is helpful to have an idea of the
basic building blocks that are used to provide the structure and markup in an HTML document.
Using any text editor, try a document called ‘page.html’ with the following content. Double click
the file; it should open and display only the content (with formatting as defined by the markup) in
your default browser.

<!DOCTYPE html>
<html>
<head>
<title>Title</title>
</head>
<body>

<h1>Heading One</h1>
<p>My first paragraph.</p>

<table>
<tr>
<th>col 1 header</th>
<th>col 2 header</th>
<th>col 3 header</th>
</tr>
<tr>
<td>row 1 col 1</td>
<td>row 1 col 2</td>
<td>row 1 col 3</td>
</tr>
<tr>
<td>row 2 col 1</td>
<td>row 2 col 2</td>
<td>row 2 col 3</td>
</tr>
</table>

</body>
</html>

8.10.1 Web scraping


Our interest in HTML stems from the fact that the internet is a veritable treasure trove of data. Ex-
tracting data (particularly in an automated fashion) from the internet is referred to as web scrap-
ing.
As a example of what is possible, we consider the wikipedia page that lists ISO currency codes.
Here we use the requests library to retrieve the HTML, and the BeautifulSoup library that makes it
easy to parse the HTML.

import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/ISO_4217'

#Retrive the HTML from the website


8.10. HTML 90

req = requests.get(url)

#Create an object representation for processing.


soup = BeautifulSoup(req.text, 'lxml')
print(soup)

#Find tables within the HTML - the one of interest is the second one
tabs = soup.find_all('table')
tab = tabs[1]

#Find the rows from the table structure


rows = tab.findAll('tr')

#Skip the header row


for row in rows[1:]:
#Find the columns in each row - print the 1st and 4th columns
cols = row.findAll("td")
print(cols[0].get_text(), cols[3].get_text())

8.10.2 Pandas
The Pandas read_html function extracts all tables from an HTML page and returns them as a list of
DataFrames.

#Extract all tables from a webpage as a list of dataframes


dfs = pd.read_html("https://en.wikipedia.org/wiki/ISO_4217")

8.10.3 Downloads
File downloads (from an known url) can be performed with the urllib library

import urllib.request

#Download an xml file with currency codes


url = 'https://www.currency-iso.org/dam/downloads/lists/list_one.xml'
urllib.request.urlretrieve(url, 'currencies.xml')

#Download the python logo


url = 'https://www.python.org/static/community_logos/' \
+ 'python-logo-master-v3-TM.png'
urllib.request.urlretrieve(url, 'python.png')

8.10.4 Browser based web scraping


Packages such as selenium work in combination with drivers that allow programmatic control of
browsers. While designed for automated web-based testing, they can be used to mimic the steps
(such as typing) that one might take to manually access and retrieve information from a website.
For example usage please see https://pypi.org/project/selenium/.
8.11. API 91

8.10.5 Responsible web scraping


While it is possible to scrape data, such an approach though is not always welcomed by the or-
ganisation behind the website. Note that many websites, recognising the value of their content, try
to make it difficult for wholesale harvesting of their data, particularly if it is for commercial exploita-
tion. Some web sites are designed deliberately to prevent such activities. This can mean limiting
the number or frequency (throttling) of connections from a single IP address, or using sophisticated
techniques such that content does not appear in raw HTML. Thus, while it may now be technically
possible to extract data from a website, one should always pause to ask if it is legitimate to do
so.
To begin with, one should consult the website for terms of service. This can clarify, for example,
if personal, educational, or commercial usage is permitted. If in doubt, consider reaching out
to the company to check. Most domains also include a robots.txt file (see for example https://
www.yahoo.com/robots.txt). While primarily aimed at search engine crawlers, this can indicate
parts of a website where automated requests are unwelcome.
Some websites are happy to share their data to the point of facilitating information requests via an
application programming interfaces (API), often with accompanying documentation and sample
code. These define protocols for requesting data (or performing other operations) and allow com-
panies to better marshal such requests. Where available, these should be the preferred mode of
access (see Section 8.11).
A responsible web scraper can choose to share additional information via the user-agent request
header. This allows servers to check the application (normally a browser), version, and operating
system from where requests have been made. Some automated requests can be distinguished
(and blocked) in this way. To promote transparency, the header can be customised to provide
additional information (such as a contact email address) that would allow the domain owner to
understand or query unexpected usage.
A further consideration is fair usage of a shared resource. While a single user running a single pro-
cess is unlikely to overload a server, sending too many requests could impact the service available
to others. Taken to an extreme, this could result in a situation similar to a denial-of-service (DoS) at-
tack, rendering the website unavailable to users. A simple solution is to slow the speed of requests
by adding sleep commands to periodically pause the execution of code.
Post-scraping, the developer is also faced with responsible storage, processing, and interpretation
of the data. One should also consider how the data itself will be used. For example, if content is
subject to copyright, can it be reshared in raw or derived formats, what attribution is required, and
is commercial exploitation permitted.
See Heydt (2018) for more advice on how to web scrape responsibly.

8.11 API
8.11.1 Using an API
An application programming interface (API) defines the functionality provided by a particular
piece of software, and the protocols with which to interact with it. There are a multitude of soft-
ware services that be accessed via their APIs. These can be web services accessed via HTTP or
other language-specific wrappers. While some are designed to be open and free-to-use, others
require access tokens and may incur usage costs.
As an example, we consider https://date.nager.at/Api a free-to-use Public Holiday API provided.
To use this service manually, you can take a base url and append a four-digit year, and a two-
character ISO country code. Thus, the 2021 US holidays could be retrieved via:
https://date.nager.at/api/v2/publicholidays/2021/US
8.11. API 92

If you navigate to this link in a browser you will see a page of plain text in JSON format. Of course,
such a manual process can also be coded in Python as follows:

import requests
import json

url = 'https://date.nager.at/api/v2/publicholidays/2021/US'

#Access url
req = requests.get(url)

#Inspect what the website has returned


print(req.text)

#Convert the JSON text into a Python structure (a list)


holidays = json.loads(req.text)

#Extract (some of) the details


for h in holidays:
print(h['date'],'-',h['name'])

A similar service is provided by https://holidayapi.com/ but as paid-for service. They provide


a Python API client library holidayapi available from https://github.com/holidayapi/holidayapi
-python. The code below shows how this would work, but crucially, requires an API key to authen-
ticate valid access to the service.

import holidayapi

hapi = holidayapi.v1('_YOUR_API_KEY_') #Alas this is a paid for service!

parameters = {
'country': 'US',
'year': 2021,
}

holidays = hapi.holidays(parameters)

Quandl provide access to financial data via an api with free and premium services (https://www
.quandl.com/tools/api). For a list of other public API see https://github.com/public-apis/public
-apis.

8.11.2 Creating an API


Creating your own API running as a local service is easily achieved using a library such as flask.
Here’s a classic example from https://realpython.com/flask-by-example-part-1-project-setup/.

from flask import Flask


app = Flask(__name__)

@app.route('/')
def hello():
return "Hello World!"
8.11. API 93

if __name__ == '__main__':
app.run()

Running the code in Spyder or from an Anaconda prompt, will create a local server, accessi-
ble from your browser from the url http://127.0.0.1:5000/ as shown in Figure 8.2. The decorator
@app.route (see section 4.14.2) the function hello is registered for route / so that the function hello
is called whenever this precise URL is accessed.

Figure 8.2: Flask - example 1

This example can be extended to make the return vary depending on the precise URL visited.
Notice in the code below that two additional @app.route’s have been added.

from flask import Flask


app = Flask(__name__)

#Add code to initialise variables here

@app.route('/')
def hello():
return "Hello World!"

@app.route('/hi')
def hi():
return "Hi Everyone!"

@app.route('/<name>/<int:age>')
def parsetextandnumber(name,age):
return name + " is " + str(age) + " years old"

if __name__ == '__main__':
app.run()

Notice how the final @app.route defines the format (and type) of the function inputs. In addition
to the text and integer inputs required here it is also possible to configure:
• string - text (default)
8.12. FINANCIAL DATA 94

• int - unsigned integers


• float - unsigned real numbers
• path - text but accepts slashes
• uuid - for UUID strings
In this way the server can be made to respond dynamically to requests for information. The output
from accessing different URLs is now shown in Figure 8.3.

Figure 8.3: Flask - example 2

Query Arguments
URLs can embed query arguments to allow for a more flexible range of possible input values. For
example:

domain.com?name1=value1&name2=value2

The query string begins after the ‘?’ character and ampersand (&) is used to separate each
key-value pair. The following code snippet shows how to extract their values:

from flask import Flask, request

@app.route('/query')
def queryexample():
value1 = request.args.get('name1')
value2 = request.args.get('name2')

return f'value1={value1}; value2={value2}'

8.12 Financial Data


8.12.1 pandas-datareader
Support for sourcing data is provided by pandas-datareader. This allows data from various internet
sources (including Fred, World Bank, Eurostat) to be accesses although in some cases an API key
is required. For a full list of sources and further documentation see https://pandas-datareader
.readthedocs.io/en/latest/remote_data.html.
In this example, economic data is sourced from Fred.
8.12. FINANCIAL DATA 95

import pandas_datareader.data as web


import datetime

start = datetime.datetime(2019, 1, 1)
end = datetime.datetime(2023, 1, 1)

#Retrieve a dataframe of monthly GDP values


gdp = web.DataReader('GDP', 'fred', start, end)

#Retrieve a dataframe of monthly CPI values


inflation = web.DataReader(['CPIAUCSL', 'CPILFESL'], 'fred', start, end)

In this example, market data is sourced from Yahoo Finance.

#Price/volume data for Facebook


df = web.DataReader('FB', 'yahoo', start, end)

#Corporate actions for Citi


actions = web.DataReader('C', 'yahoo-actions', start, end)

#Dividends for IBM


dividends = web.DataReader('IBM', 'yahoo-dividends', start, end)

8.12.2 yfinance
Ran Aroussi’s yfinance offers another library that facilitates dynamic sourcing of financial data.
This provides access to the public API provided by Yahoo Finance. The example below shows
how to obtain historic stock prices but it is also possible to source company financials and much
more.

import yfinance as yf

ticker = yf.Ticker("MSFT")

# get historical market data


hist = ticker.history(period="1y")

8.12.3 WRDS
At Queen’s you have access to various data via Wharton Research Data Services (WRDS).

Account setup
1. Browse to wrds-www.wharton.upenn.edu
2. Select the Register tab.
3. Complete the Account Request form selecting ‘Queen’s University Belfast’.
4. Once you submit an Account Request, an email will be sent to your WRDS Representatives.
After receiving approval, an account will be created and you will receive an e-mail message
with a special URL and instructions for setting the account password and logging into WRDS.
8.12. FINANCIAL DATA 96

5. You may log into WRDS via /wrds-www.wharton.upenn.edu/login


6. Review the WRDS Terms of Use.
7. You may begin using your new account.
To get started review wrds-www.wharton.upenn.edu/pages/classroom/.

Accessing via python


Note: WRDS is currently only accessible via python from outside the QUB network

Accessing WRDS via python requires the wrds package to be installed. From an anaconda prompt
run:

pip install wrds

You can then connect to WRDS via the commands:

import wrds

#Connection with prompt for username and password


conn = wrds.Connection()

This approach will work fine in Jupyter but not with Sypder (v5.15) because of the prompt to get
login details. This can be solved by explicitly supplying this information as input to the connection
function.

#Connection with explicit username and password (no prompt)


conn = wrds.Connection(wrds_username='myusername',
,→ wrds_password='mypassword')

Alternatively, if you create a file called ‘.pgpass’ (no file name, just a file extension) in your working
directory and edit it to have text as follows (with valid login details):

wrds-pgdata.wharton.upenn.edu:9737:wrds:myusername:mypassword

You can then connect without exposing your password in code.

#Connection with preconfigured .pgpass file (no prompt)


conn = wrds.Connection(wrds_username='myusername')

To get started retrieving data, see the following documentation and sample code:
• pypi.org/project/wrds
• wrds-www.wharton.upenn.edu/documents/1443/wrds_connection.html
97
9 DATA H ANDLING

Data in its raw form is often in a state that makes it unfit for immediate analysis. Real world data is
messy. It may:
• contain missing or unexpected values
• be inconsistently encode
• be in an unsuitable format
• only tell part of the story
For all these reasons and more, a step by step process, illustrated in Figure 9.1 is necessary to
prepare the data for analysis.

Figure 9.1: Data processing

9.1 Data Types


9.1.1 Descriptive Data
Descriptive data can be thought as free form text, where a potentially infinite number of responses
are possible. In many cases, data of this type is of little real value in analysis, in the sense that it
is purely random and has no meaningful relationship with other data fields. If distinctive, this may
be used as a label that uniquely identifies each data point. For example, a company name or
security identifier, is extremely useful for investigators, but of little quantitative value.

9.1.2 Categorical Data


Categorical data is used to assign each data point a particular category or label. It describes
some characteristic of the data point that can be used to identify similarities or differences be-
tween data points. Nominal data refers to instances where no meaningful order can be imposed
on the data based the category. A dichotomous variable is a nominal variable where only two
values are possible (akin to a Boolean property. By contrast, ordinal data allows some meaning-
ful order to be imposed between data points from different categories, but does not permit any
meaningful measure of distance between the points to be derived. For example:
• Eye colour (brown, blue, green, other) is nominal
• Fee status (child, adult, senior citizen) is ordinal
• Company status (public, private) is nominal
• Credit rating (A, B, C) is ordinal
Of course, nominal data can still be encoded in a manner that implies order (public=0, private=1)
even where none exists.
9.2. MISSING DATA 98

9.1.3 Numerical Data


Just as in Python data types, a distinction is made between discrete data and continuous data.
Discrete data refers to data which can only take a finite number of outcomes (and can therefore
be enumerated using integers). Continuous data refers to values that can be measured but can
take an unlimited number of values; such values would naturally be stored as floats (i.e. real
numbers).
Of course it is possible to create categorical data from numerical data by placing numerical data
into buckets/bins as the following example illustrates.

#Sample data
df = pd.DataFrame({'company': ['A', 'B', 'C', 'D'],
'mktcap': [975452, 123455678, 36087532, 230975486]})

#Lambda function to categorise based on size


f = lambda x: ('S' if x<10**6 else ('M' if x<10**8 else 'L'))

#Create an ordinal field: S < M < L


df['size'] = [f(x) for x in df.mktcap]

9.2 Missing Data


One should always inspect a dataset before beginning to analyse it to make sure the data is as
expected, and the analysis is valid. A common issue is with missing data. In the case of a database
table, a field can be defined as non-null. This prohibit the creation of a record if that specific field
is not supplied with a value. For example, a employee table may designate the surname field
as non-null (i.e. every employee must have a name) but permit null values for a mobile phone
number. One must therefore decide whether empty fields pose a problem, and if so, what should
be done.
Consider the following file ‘FX.csv’:

currency,rate,unit
GBP,0.8246,pound
EUR,0.9171,euro
JPY,107.64,yen
AUD,1.5301,
CNY,7.1264,yuan
DEM,,mark

When imported as a Pandas dataframe, missing values are treated as NaN (Not a Number -
np.nan). In other languages such as R, the term NA (not available) is preferred, and this also
creeps into Python. Values created with Python’s None are also considered null or missing, and can
be detected using the isnull (equivalently notnull) method.

import pandas as pd

#Read file into a dataframe


df = pd.read_csv('FX.csv')

#Append another row using None


9.2. MISSING DATA 99

df = df.append({'currency': 'ZAR',
'rate': 12.2134,
'unit': None}, ignore_index = True)

#Test for null values using series method


print(df.rate.isnull())

#Test for non-null values using pandas function


print(pd.notnull(df.rate))

#Count null values


print(sum(df.unit.isnull()))

One may reasonably decide that missing fields require a entire record to be discarded. This can
be achieving by selection or using the dropna method.

#Extract rows with non-null rates


df_clean = df[df.rate.notnull()]

#Drop all rows with null values


df.dropna(inplace=True)

#Drop rows where particular columns have null values


df.dropna(subset=['rate'], inplace=True)

#Drop all columns with null values


df.dropna(axis=1, inplace=True)

Note the use of the optional inplace method here. This updates the existing dataframe rather
than returning a modified dataframe as a result. Before considering alternative treatments for
missing data, we consider how missing records can be inserted into the dataframe. In the following
example, the dataframes index is used to makes sure every required index value is present even if
it wasn’t previously represented.

required_currencies = ['GBP', 'EUR', 'JPY', 'AUD', 'USD']

#make currency the index


df.set_index('currency')

#reindex - retain rows with index in the list and add missing ones
df = df.reindex(required_currencies)

The reindex method provides many more options. For example a fill_value can be supplied as
the default for missing value. Alternatively, an rudimentary ‘interpolation’ method allows values to
be backward or forward filled.

df = pd.DataFrame({'month': [1, 4, 7, 10],


'value': [3, 5, 2, 1]})

df.set_index('month', inplace = True)


9.3. DATA STATISTICS 100

#Use a forward fill - missing values take last non-null value


df = df.reindex(range(1,13), method = 'ffill')

print(df)

To assign specific values to replace missing ones, use the fillna method.

#Fill all missing values with a constant


df.fillna(0)

#Fill missing values by column using a dict


df.fillna({'rate':0, 'unit':'unset'})

#Fill missing values with values from another column


df['columnA'].fillna(df['columnB'])

It’s possible to fill missing values with an average value, but this operation can only be performed
on numeric columns (in earlier versions pandas ignored non-numeric columns).

#Form a list of columns to apply to (the numeric ones)


numcols = list(df.select_dtypes(include='number'))

#Now use mean to fill missing values


# (not actually appropriate in this example!)
df[numcols] = df[numcols].fillna(df[numcols].mean())

It may make more sense to use group-based means (see section 9.8.1). Should more sophisticated
logic be required, one can always use the apply method in combination with a custom function
or lambda function.

def fixname(name):
if pd.isnull(name):
return 'missing'
else:
return name

df['unit'] = df['unit'].apply(fixname)

9.3 Data Statistics


A second step to familiarising oneself with a dataset is to generate some basic descriptive statistics.
Particularly when working with large datasets, this can quickly reveal potential problems with the
data.

#Basic individual statistics for numeric columns


df.max()
df.sum()
df.mean()
9.3. DATA STATISTICS 101

#Count, mean, std dev, and quartiles


df.describe()

#For non-numeric data describe includes an overall and unique value count
df['label'].describe()

#Find unique values


df['label'].unique()

While not foolproof, inspecting the results of such commands should reveal any outliers or unex-
pected data. Inspecting a sample of rows (say using df.head()) may also quickly reveal if the data
is not as expected.

9.3.1 Correlation
Correlation and covariance matrices can be calculated using the corresponding methods avail-
able for dataframes. Non-numerical attributes will be ignored. The return type will be a pandas
dataframe, so it must be converted to an numpy array before being used for matrix calcula-
tions.

cor = df.corr(method='pearson')
cov = df.cov()

#Convert to numpy array (matrix)


C = cor.to_numpy()

The issue of multicollinearity occurs when two or more explanatory variables are highly linearly re-
lated. While this may not be a problem for certain machine learning models, coefficient estimates
in regression models can be unstable in such circumstances.
To visualise correlation, it is possible to use seaborn to produce a heatmap (Figure 9.2) so high
levels of positive and negative correlation are apparent.

import seaborn as sns


import matplotlib.pyplot as plt

sns.heatmap(C,annot=True,cmap="RdYlGn")

9.3.2 Normality
A QQ plot (quantile-quantile) plot is a graphical method for visually comparing two distributions.
Each (x,y) pair corresponds to the corresponding quantile for each of the distributions. In this way,
random variables can be compared with the actual distribution to see if the data is a reasonable
fit to the assumed distribution.

import statsmodels.api as sm
from matplotlib import pyplot as plt

#generate random values to test


values = np.random.normal(0, 1, 100)
9.3. DATA STATISTICS 102

1.0
1 0.93 -0.39 0.97 0.75 0.75 0.97 0.74 0.46 0.93 0.2 -0.096

0
0.93 1 -0.67 0.9 0.67 0.68 0.92 0.52 0.33 0.94 -0.14 0.038 0.8

1
-0.39 -0.67 1 -0.32 -0.63 -0.63 -0.26 -0.7 -0.43 -0.52 0.061 0.19

2
0.6
0.97 0.9 -0.32 1 0.67 0.69 0.99 0.51 0.33 0.96 0.15 -0.22

3
0.4
0.75 0.67 -0.63 0.67 1 0.99 0.6 0.85 0.79 0.6 0.3 0.065
4

0.75 0.68 -0.63 0.69 0.99 1 0.62 0.8 0.77 0.59 0.26 0.11 0.2
5

0.97 0.92 -0.26 0.99 0.6 0.62 1 0.48 0.25 0.98 0.18 -0.063
6

0.0
0.74 0.52 -0.7 0.51 0.85 0.8 0.48 1 0.66 0.52 0.51 -0.029
7

0.46 0.33 -0.43 0.33 0.79 0.77 0.25 0.66 1 0.26 0.38 0.11 0.2
8

0.93 0.94 -0.52 0.96 0.6 0.59 0.98 0.52 0.26 1 -0.23 0.023
9

0.4
0.2 -0.14 0.061 0.15 0.3 0.26 0.18 0.51 0.38 -0.23 1 0.16
10

-0.096 0.038 0.19 -0.22 0.065 0.11 -0.063 -0.029 0.11 0.023 0.16 1 0.6
11

0 1 2 3 4 5 6 7 8 9 10 11

Figure 9.2: heatmap

#By default will compare against the normal distribution


fig = sm.qqplot(values, line ='45')
plt.show()

Figure 9.3 shows two such plots. The first (9.3a) is generated by the code above, with randomly
generated values from the normal distribution being plotted against the normal distribution (the
default of the qqplot function). Here the data is a good fit to the 45°degree indicating a rea-
sonable expectation of normality. The second plot (9.3b) shows randomly generated values from
the exponential distribution being plotted against the normal distribution and clearly shows a poor
fit.

6
2
5
1 4
Sample Quantiles

Sample Quantiles

3
0 2

1 1
0
2 1
2
3
3 2 1 0 1 2 2 1 0 1 2 3 4 5 6
Theoretical Quantiles Theoretical Quantiles

(a) Normal vs Normal (b) Exponential vs Normal

Figure 9.3: QQ plot


9.4. OUTLIERS 103

See the statsmodels documentation for more details.

9.4 Outliers
Treatment of outliers requires special consideration and should not be performed without thought.
Inappropriate treatment of outliers may bias or invalidate the conclusions of subsequent analysis.
As with all troublesome data, it is tempting to simply remove it. Such operations can be performed
using filtering techniques described earlier. An alternative to removing outliers is to constrain their
values to lie within a bounded interval, with values lying outside the interval being shifted to the
boundary values. This can be through of applying a function f : R → [a, b] defined as:

 a if x < a
f (x) = x if a ≤ x ≤ b (9.1)
b if b < x

In pandas this can be performed with the clip method.

df = pd.DataFrame({'label': ['A', 'B', 'C', 'D'],


'value': [0.98, 1.02, -0.87, -1.1],
'value2': [0.56, 1.05, -1.27, -0.99],})

#clip one column


df['value'] = df['value'].clip(-1,1)

#clip named columns


cols = ['value', 'value2']
df[cols] = df[cols].clip(-1,1)

#clip all numeric columns (lower bound only)


cols = df.select_dtypes(np.number).columns
df[cols] = df[cols].clip(lower = 0)

Winsorizing data (or winsorization) determines the lower and upper bounds based on some speci-
fied percentile (for example between the 5th and 95th percentile.

#Winsorise data
df[cols] = df[cols].clip(lower=df.quantile(0.05),
upper=df.quantile(0.95), axis=1)

9.5 Duplicates
Pandas provides several tools for identifying duplicates and removing them if required. In the
example below, the second and last records are considered to be duplicates even though they
have difference index values.

df = pd.DataFrame({'label': ['alpha', 'beta', 'beta', 'beta'],


'value': [2,4,6,4]}, index = ['a','b','b','c'])

print(df.duplicated())
9.6. DATA TRANSFORMATION 104

The Boolean array returned by the duplicated method, only returns True for actual duplicates; a
value of False is recorded for the first record it is considered a duplicate of. Duplicates can be
removed from a dataframe as follows:

#Remove duplicates
df.drop_duplicates(inplace=True)

By default, all fields are required to be identical to be considered duplicates. To apply similar logic
to a subset of columns use:

#Find duplicates based on one column only


print(df.duplicated(['label']))

#Drop duplicates based on one column only


df.drop_duplicates(['label'], inplace=True)

9.6 Data Transformation


9.6.1 Categories
There are many ways in which raw data can be transformed prior to analysis. When dealing with
categorical data, one may need to standardise the categories, or otherwise regroup them. As a
first step for keyed values this can mean harmonising the capitalisation and removing any leading
or trailing spaces. In the following example, a string description of size is encoded using a map
onto three possible values (encoded as 0,1,2).

df = pd.DataFrame({'company': ['A', 'B', 'C', 'D'],


'size': ['small', 'large', 'medium', 'Huge ']})

#Standardize strings (strip spaces, convert to lower case)


df['size'] = df['size'].apply(lambda x: x.strip().lower())

#Define mapping
sizemap = {'small':0, 'medium':1, 'large':2, 'huge':2}

#Extract text to match keys


text = df['size'].astype("string")

#Apply map and create new field


df['group'] = text.map(sizemap)

An alternative approach is to write a custom function to perform the transformation which is then
applied to a column of the dataframe.

def assigngroup(label):

label = label.strip().lower()

if label in ['small']:
return 0
9.6. DATA TRANSFORMATION 105

elif label in ['medium']:


return 1
elif label in ['large', 'huge']:
return 2

return None

df['group'] = df['size'].apply(assigngroup)

Substitutions for specific values can be performed using the replace method.

#Replace a single value


df['size'] = df['size'].replace(['Huge '], 'large')

#Replace a multiple values


df['size'] = df['size'].replace(['Huge', 'Huge '], 'large')

Replace can also be used to perform multiple search and replace operations simultaneously by
supplying a list of replacement values or by supplying a dict that defines the substitutions.
Pandas allows columns to be treated as a categorical variable type. For large datasets, this may
reduce memory requirements, as text strings can be encoded as integer values. Performance of
some operatings (such as grouping) may also be improved.

df['size'] = df['size'].astype('category')

#Inspect the encoding


print(df['size'].cat.codes)
print(df['size'].cat.categories)
print(df['size'].value_counts())

#Create a field using a pre-defined category encoding


categories = ['public', 'private']
df['type'] = pd.Categorical.from_codes([0,1,1,0], categories)

9.6.2 Binning
Numerical data can also be used for grouping by defining discrete bins (or buckets) into which
to assign each point. One way to do this is using the cut function. The bins input parameter
can either be defined as an integer (the number of bins required) or as an array specifying the
bin boundaries. In the former case, the bins will be equally sized (but not necessarily equally
populated).

df = pd.DataFrame({'company': ['A', 'B', 'C', 'D'],


'mktcap': [975452, 123455678, 36087532, 230975486]})

#Create required boundaries bounding all datapoints


boundaries = [0, 10**6, 10**8, 10**12]

#Here the bucket labels have been enumerated 0,1,2


buckets = pd.cut(df['mktcap'],boundaries, labels =
,→ range(len(boundaries)-1))
9.6. DATA TRANSFORMATION 106

#Encode groups using bucket label


df['group'] = buckets

Note that by default the bins are defined by intervals of the form (a, b], open on the left and closed
on the right. This means that the point left hand point a is not considered to belong to the bucket,
but the right hand point b is (i.e. a < x ≤ b).
An alternative is to use the qcut function which automatically calculates the boundary points to
ensure each bucket is uniformly populated.

#Divide the data into four groups of equal size


buckets = pd.qcut(df['mktcap'], q=4)

#Inspect allocations to each bucket


print(pd.value_counts(buckets))

As an alternative, the sklearn.preprocessing module contains a KBinsDiscretizer function.

9.6.3 Dummy variables


Dummy variables which act as indicator functions for each value within a category can be cre-
ated with the get_dummies function.

df = pd.DataFrame({'company': ['A', 'B', 'C', 'D'],


'group': ['S', 'L', 'M', 'L']})

df2 = pd.get_dummies(df, columns = ['group'])

For each of the category values, a new column is added populated with 1 when the row matches
that category and zero otherwise.

company group_L group_M group_S


0 A 0 0 1
1 B 1 0 0
2 C 0 1 0
3 D 1 0 0

The drop_first parameter should be set to True to remove the first level to get k − 1 dummies
for the k category values. Scikit-learn provides similar functionality for ‘one hot encoding’ (see
OneHotEncoder) but returns an array (i.e. matrix) rather than a dataframe.

9.6.4 Scaling
The fields within a data set can vary by an order of magnitude. For certain types of analysis where
any linear transformation of the data has no material impact on the results, it can be convenient
to scale numerical data to a common interval such as [0, 1] or [−1, 1]. In the former case, the
maximum value in each column will be scaled to 1, and the minimum value to 0.

#Scale numeric columns to (0,1)


cols = df.select_dtypes(np.number).columns
9.6. DATA TRANSFORMATION 107

df[cols] = df[cols] - df[cols].min()


df[cols] = df[cols] / df[cols].max()

When using the sklearn package such scaling can be performed as follows:

from sklearn import preprocessing

cols = ['value', 'value2']

scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))


scaler.fit(df[cols])
df[cols] = scaler.transform(df[cols])

9.6.5 Differencing
For time series data (or any data where each row is related to the previous one), new columns can
be created by performing operations across rows. One common operation is lagging a variable.
In pandas this is performed using the shift. For a time series, this transforms A(t) to A(t − s), where s
is the integer shift.

#Lagged variable
df['lagA1'] = df['A'].shift(1)

#First difference
df['dA'] = df['A'] - df['A'].shift(1)

#Percentage change
df['B_pct_change'] = (df['B'] - df['B'].shift(1))/df['B'].shift(1)

Rather than calculating the first difference and percentage changes manually, it is possible to
pandas functions. Note that such operations can only be applied to an entire DataFrame if all
columns are numeric.

#First backward difference for a single column


df.A.diff()

#Backward difference for entire dataframe


df.diff()

#Percentage change a single column


df.B.pct_change()

#Percentage change for entire dataframe


print(df.pct_change())

9.6.6 Dimensionality Reduction


For large data sets (i.e. those with many attributes), it is possible to reduce the dimensionality of
data, when the data exhibits correlation.
9.7. RESHAPING 108

One technique for doing so is Principal Component Analysis (PCA) which, as the name suggests,
attempts to simplify the description of a data set by identifying its ‘principal’ components. One
way to think of the data points is as coordinates in n dimensional space. One set of coordinates
can be expressed in different ways depending on the axis system employed. Matrices can be
viewed as a means to transform data points between axis systems.
The equation below represents such as transformation. Data points x can be considered as co-
ordinates in a regular Cartesian coordinate system. The axes of this system are represented as
columns of the identity matrix In . The columns of matrix A represent an alternative axis system.
Data points y expressed in this system of coordinates can be transformed into the regular coor-
dinate system by multiplication by A. For PCA, the columns of A would be referred to as factor
loadings, and the y values as factor scores for x.
       
a11 a12 ··· a1n y1 1 0 ··· 0 x1 x1
 a21 a22 ··· a2n  y2   0 1 ··· 0  x2   x2 
.. .. .. .. = .. .. .. .. = ..
       
 ..  ..  
 . . . .  .   . . . .  .   . 
am1 am2 ··· amn yn 0 0 ··· 1 xn xn

PCA attempts to find a new, more efficient set of axis. Any such system must have orthogonal
axes. The first axis (principal component) is chosen to maximise the variance of the data points
projected onto it, the second axis to have the next greatest variance and so on. Data points
consisting of many attributes, can therefore be approximated by their first few coordinates under
the new axis system.
PCA can be performed on dataframes and matrices using scikit-learn.

from sklearn.decomposition import PCA

#Decide how many factors are required


pca = PCA(n_components=3)

#Fit to data
pca.fit(df)

Once fitted it is possible to inspect how much of the variance is captured by each of the factors,
and therefore the total variance captured by the reduced factor set.

print(pca.explained_variance_ratio_)
print("total variance explained = ",sum(pca.explained_variance_ratio_))

#Inpsect factors
pca.components

Once fitted, the observations can be transformed into the reduced dimensional space.

X2 = pca.transform(X)

9.7 Reshaping
Data can be displayed in long or wide formats. In the wide format, different characteristics for the
same subject are placed in separate columns. In the long format, each characteristic of each
9.7. RESHAPING 109

subject is stored in a separate row.

company label value


A listed yes
A size L company listed size est
A est 1984 A yes L 1984
B listed no B no S 2018
B size S
B est 2018

Table 9.1: long vs wide

9.7.1 Stack/Unstack
To convert from long to wide, one can use an hierarchical index (MultiIndex). One way to think
about a MultiIndex is as a composite key: a combination of columns that together can be used
to suitably identify each row. They can be created from a list of arrays. The code below cre-
ate a DataFrame matching the long table shown in Table 9.1. The first column ‘company’ has
repeated values so makes a poor index until it is combined with the second column ‘label’ to
make a MultiIndex. The order in which they are combined determines the index levels.

df = pd.DataFrame({'company': ['A', 'A', 'A', 'B', 'B', 'B'],


'label': ['listed','size', 'est']*2,
'value': ['yes', 1984, 'L', 'S', 'no', 2018]})

#Create multi-index
df.set_index(['company', 'label'], inplace=True)

#Inspect levels
print(df.index.names)

#Inspect unique values of second level (remember its zero based)


#These will define the inner column labels (if we pivot on this index part)
print(df.index.get_level_values(1).unique())

#Access using a tuple


df.loc[('A','size'),]

The unstack methods acts like a pivot to return a new dataframe (or modify the existing one) with
addition inner column labels, based on the index level pivoted. By default this will be -1 (i.e. the
last level). As shown by the output below, the first time, the ‘label’ values become columns, but
the second time, the ‘company’ values become columns.
9.7. RESHAPING 110

#Unstack based on inner most level


df2 = df.unstack() df2 value
print("df2",df2) label est listed size
company
#The columns are now a multi-index! A 1984 yes L
print(df2.columns) B 2018 no S

#Unstack based on top level MultiIndex([('value', 'est'),


df3 = df.unstack(0) ('value', 'listed'),
print("df3",df3) ('value', 'size')],
names=[None, 'label'])

df3 value
company A B
label
est 1984 2018
listed yes no
size L S
To undo this operation, one can use either stack function which returns a reshaped DataFrame
with a multi-index with additional inner-levels. This retains the existing outer index (currently the
company).

df = pd.DataFrame({'company':['A', 'B'],
'est':['1984', '2018'],
'listed':['yes', 'no'],
'size':['L', 'S']})

#Set index
df.set_index('company', inplace = True)

#Convert from wide to long


df2 = df.stack()

9.7.2 Melt
Melting a DataFrame converts it from wide to long. In the example below we begin by creating the
wide table from earlier. Note that no index was created.

df = pd.DataFrame({'company':['A', 'B'],
'est':['1984', '2018'],
'listed':['yes', 'no'],
'size':['L', 'S']})

#Identify columns which act as keys (id_vars)


#and those which contain values (value_vars)
#since we're creating a column we can assign a name (var_name)
df2 = pd.melt(df, id_vars=['company'],
value_vars = ['est', 'listed', 'size'],
var_name = 'label')

print(df2)
9.8. AGGREGATION 111

Note that melt is also a DataFrame method so the following lives are equivalent:

df2 = pd.melt(df, <parameters>)


df2 = df.melt(<parameters>)

To reverse this operation, one can use unstack, or use the pivot method as shown here.

#Shape dataframe with given index and columns


df3 = df2.pivot(index='company', columns='label')

#Remove the column multi-index (if required)


df3.columns = df3.columns.droplevel()

9.8 Aggregation
Aggregation is a process where records are first grouped based on a chosen set of characteristics,
and then summary information is calculated on other selected characteristics. For example, one
could first group stocks by their market sector and then find their average size within each of these
groups.

Figure 9.4: Aggregation

The concept is similar to ‘group by’ SQL operations or pivot tables in Excel, and pandas supports
similar concepts. We’ll use this simple DataFrame in the examples that follow.

sector = ['F', 'I', 'F', 'I', 'I', 'T', 'T', 'F', 'I', 'T']
size = [1.23, 0.56, 3.25, 0.17, 1.98, 2.62, 0.95, 5.11, 2.09, 2.75]
listed = ['Y', 'Y', 'Y', 'N', 'N', 'Y', 'N', 'Y', 'N', 'Y']
price = [12.23, 56.71, 0.53, 1.23, 5.47, 27.91, 5.22, 1.45, 3.90, 5.55]

df = pd.DataFrame({'sector': sector, 'size': size,


'listed': listed, 'price': price})
9.8. AGGREGATION 112

9.8.1 Group by
The groupby method groups DataFrame rows based on a list of column names. Aggregation func-
tions can then be applied to each sub-group. For example, here all remaining numerical columns
will be summed.

gb = df.groupby(['sector']).sum()
size price
sector
F 9.59 14.21
I 4.80 67.31
T 6.32 38.68
The output is a DataFrame with an index for both its rows and columns. Alternatively a dict can be
used to supply columns to aggregate and a list of functions to use for aggregation. Applying the
method to our example:

gb = df.groupby(['sector']).agg({'size': ['count', 'min', 'max'],


'price':['mean']})

print(gb)

Produces the following output:

size price
count min max mean
sector
F 3 1.23 5.11 4.736667
I 4 0.17 2.09 16.827500
T 3 0.95 2.75 12.893333

Here the columns of the output form a MultiIndex comprised of the aggregated columns names
and the functions used to aggregate them.
A sort key can also be passed to the groupby method. For a list of supported functions see
https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html. Note that aggregat-
ing function ignore NA values.
Other functions (custom or otherwise) can also be used for aggregation. The example below uses
a lambda function to count ’Y’ values.

#Create custom (lambda) function


#This counts 'Y' values in an array
f = lambda x: len([v for v in x if v=='Y'])

#Apply custom function in aggregation


gb = df.groupby(['sector']).agg({'size': ['count', 'min', 'max'],
'listed':[f]})

print(gb)

Although we have used groupby for aggregation, the groupby method actually returns a special
DataFrameGroupBy object.
9.8. AGGREGATION 113

gb = df.groupby(['listed','sector'])
listed sector
#Show the size of each group N I 3
print(gb.size()) T 1
Y F 3
I 1
T 2
dtype: int64

In this case (since we have grouped by two columns) the index is now a MultiIndex. Unfortunately
the DataFrameGroupBy cannot be printed, but it can be iterated over.

#Interate over keys (tuples) and


,→ data ('N', 'I')
for key, group in gb: sector size listed price
print(key) 3 I 0.17 N 1.23
print(group) 4 I 1.98 N 5.47
8 I 2.09 N 3.90
('N', 'T')
sector size listed price
6 T 0.95 N 5.22
('Y', 'F')
sector size listed price
0 F 1.23 Y 12.23
2 F 3.25 Y 0.53
7 F 5.11 Y 1.45
('Y', 'I')
sector size listed price
1 I 0.56 Y 56.71
('Y', 'T')
sector size listed price
5 T 2.62 Y 27.91
9 T 2.75 Y 5.55
DataFrameGroupBy objects are ready for aggregation: just apply the required function (or list of
functions) as a string name or as a function pointer (which of course allows you to supply custom
functions).

print(gb.agg(['min', 'max', np.mean]))

Producing output as follows:

size price
min max mean min max mean
listed sector
N I 0.17 2.09 1.413333 1.23 5.47 3.533333
T 0.95 0.95 0.950000 5.22 5.22 5.220000
Y F 1.23 5.11 3.196667 0.53 12.23 4.736667
I 0.56 0.56 0.560000 56.71 56.71 56.710000
T 2.62 2.75 2.685000 5.55 27.91 16.730000

You may recall the apply method. This can also be used to perform the summarise step before
combining. In the example below a custom function accepts a DataFrame and returns a DataFrame
9.8. AGGREGATION 114

contains the rows for which a particular column contains the largest value in the common. This is
applied to the DataFrameGroupBy object (on the size column) to return the biggest company (or
companies, should there be a tie) in each sub-group.

def findlargest(df, col):


return df[df[col]==max(df[col])] sector size listed price
8 I 2.09 N 3.90
gb = df.groupby(['listed','sector']) 6 T 0.95 N 5.22
7 F 5.11 Y 1.45
bigfish = gb.apply(findlargest, 1 I 0.56 Y 56.71
'size') 9 T 2.75 Y 5.55

Such a approach can therefore be used to modify a DataFrame on a group by group basis. One
application is addressing missing values using group means.

9.8.2 Pivot table


Pivot tables allow aggregation results to be shown by row and column. In the example below, a
three column DataFrame is pivoted to aggregated to provide counts by its two categorical vari-
ables.

#Choose rows (index), columns and what to aggregate (values)


pt = df.pivot_table(index=['sector'], columns = ['listed'],
values=['size', 'price'], aggfunc=max)

print(pt)

This produces output:

price size
listed N Y N Y
sector
F NaN 12.23 NaN 5.11
I 5.47 56.71 2.09 0.56
T 5.22 27.91 0.95 2.75

Common aggregation functions include those shown in Table 9.2.

aggregation method python function


count len
sum sum
max max
min min
mean np.mean
standard deviation np.std

Table 9.2: Aggregation methods

Of course you are not limited to built-in function and can supply your own. The aggregating
function will be supplied with a Series object which can be treated like an array. Here a lambda
function is used to aggregate strings by concatenating them together.
9.9. TIME SERIES 115

pt = df.pivot_table(index=['sector'],
values=['listed'],
aggfunc = lambda x: ' '.join(str(v) for v in x))

print(pt)

This produced output as follows:

listed
sector
F Y Y Y
I Y N N N
T Y N Y

For simple counts (i.e. frequencies) one might also consider using pd.crosstab.

9.9 Time Series


In finance, it is almost impossible to avoid working with time series data. We briefly introduced
python dates in section 4.6 but revisit them now that we are familiar with pandas. Recall that it
convenient to convert dates represented as text into datetime variables.

from datetime import *


dates = ['21-12-23', '18-03-24', '19-06-24'] 2023-12-21 00:00:00
format = '%d-%m-%y' 2024-03-18 00:00:00
2024-06-19 00:00:00
for x in dates:
print(datetime.strptime(x, format))

For mixed format dates is more convenient to (at least attempt to) rely on a library function to work
out how to perform the conversion. The dateutil library provides such a function.

#Mixed format date/times


dates = ['Fri 21-12-23', '18-03-2024', 2023-12-21 00:00:00
'19-06-24 8:10AM'] 2024-03-18 00:00:00
2024-06-19 08:10:00
from dateutil.parser import parse
for x in dates:
print(parse(x))

Rather than working with lists of dates, pandas defines a DatetimeIndex type. Lists of dates, in
string or python format, can be converted into a DatetimeIndex with the pandas to_datetime func-
tion.

dti = pd.to_datetime(dates)

The return type of this function depends on the input. Data supplied as a Python list will output as
a DatetimeIndex. Input supplied as a pandas series will also return as series. Once converted, it
becomes easy to extract things like the year, quarter or month, by depends on its exact tpye.
9.9. TIME SERIES 116

%Extract the year of a date DatetimeIndex


df.date.year

%Extract the year of a series of date times


df.date.dt.year

Alternatively the pandas date_range function can create dates spanning a particular period, with
a specified frequency. A full list of configurable frequencies can be found at https://pandas
.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.

startdate = datetime(2023,10,1)
enddate = datetime(2024,10,1)

#Given start date, frequency and number of periods


pdates = pd.date_range(start=startdate, freq = 'D', periods=7)

#Given start date, frequency and enddate


pdates = pd.date_range(start=startdate, end=enddate, freq = 'M')

We can use such DatetimeIndex objects as the index of a Series or DataFrame. In the example that
follow, a random walk is created to represent a stock price.

#Create dates as a DatetimeIndex


n = 100
pdates = pd.date_range(startdate ,freq = 'D', periods=n)

#Create stock prices (and random walk)


import numpy as np

vol = 0.2
p = np.ones(n)*10
z = np.random.normal(0,1,n)

for i in range(1,n):
p[i] = p[i-1] * (1+z[i]*vol*(1/365)**0.5)

#Create dataframe with DatetimeIndex index


ts = pd.DataFrame({'price':p}, index = pdates)

We can visually inspect the series using the plot method. Note that since this data is randomly
generated, any rerun will look different.
9.9. TIME SERIES 117

ts.plot(y='price')

Such series and dataframes benefit from flexibility in selecting particular sub-series.

#Select rows before a given date


ts2 = ts[ts.index < datetime(2023,11,5)]

#Select dates in a given month


ts2 = ts['2023-11']

#Select between two dates (here the first date is the entire month)
ts2 = ts['2023-11':'2020-12-15']

#Skip the first 10 dates


ts2 = ts[10:]

#Skip the last 10 dates


ts2 = ts[:-10]

#Every 7th date


ts2 = ts[::7]

Note that exact matches (i.e. attempts to select a single row) for dataframe will be treated column-
wise and will raise a KeyError.

#Select a specific date (will work for a series but not a dataframe)
#print(ts['2023-11-05'])

#Select a specific date (will work for a series or a dataframe)


print(ts.loc['2023-11-05'])
118

Part III

Data Analysis
119
10 DATA V ISUALISATION

10.1 Creating beautiful evidence


Charts are a great way to communicate information. A well crafted visualisation can communi-
cate more information, that is quicker to process, and easier to interpret than alternatives such
as tables and text. Learning how to technically produce different types of visualisation is there-
fore a useful skill, but less so than understanding how to elevate your chart to qualify as ‘beautiful
evidence’ (see the work of Edward Tufte). In practice this means expending just as much (if not
more) effort in designing and formatting your chart than in its construction.
Every element of a chart (titles, axes, labels, legends, gridlines, lines, markers, colours, fonts, etc)
is a design choice. All to often these are unthinkingly accepted in their default state, to the point
where it is easy to identify a chart produced using Excel or R. Similarly, many people are bound
by rules that they were taught as children (e.g. you must label each axis). These rules are not
followed by professional designers. Rather, consider what purpose each chart component serves:
is it necessary; what does it add; is it sufficiently prominent; does it distract from the primary focus
of the reader? The Financial Times is a great place to start if you want to see professional chart
producers (‘chartographers’ if you will) at work:
• FT’s Chart Doctor
• Visual guide to chart types
• Axes of Evil Series (crimes against charting)
For a gallery of python charts see https://python-graph-gallery.com/all-charts/.

10.2 Matplotlib
Matplotlib is a library for generating charts that makes extensive use of NumPy. Its name reveals
its original origins, which stem from MATLAB; anyone familiar with creating charts in MATLAB will
recognise the Matplotlib’s commands. By convention it is imported and aliased as follows:

import matplotlib as mpl


import matplotlib.pyplot as plt

A full gallery of plot types can be found at https://matplotlib.org/gallery/index.html.


Charts are created using a Figure containing one or more Axes (individual plots). The plt alias will
access the current axis or create one if it doesn’t exist.

#Create a figure with a single axes


fig, ax = plt.subplots()

#Create a figure with 3 (stacked) subplots arranged in a square figure


fig, ax = plt.subplots(3,1, figsize=(8,8))

#tight_layout provides control over spacing between subplots


plt.tight_layout(pad=2.0)

Note that not all ax methods work with plt, however you can get the current axes instances as
follows:
10.3. SCATTERPLOT 120

#Get the current axes instance on the current figure


plt.gca

10.3 Scatterplot
Here we present the commands necessary to generate a simple 2D-line chart. At a minimum
each series can be specified by three things: the x-points, the y-points, and its format. It is not
necessary to specify any other chart components (i.e. title, legends, axis, etc) but the production
of high-quality data visualisations will require such customisation. The output of the code below is
shown in Figure 10.1a.

import numpy as np
from math import *
import matplotlib.pyplot as plt

#Plot function at discrete points of your choice


x = np.linspace(0,2*pi,50)
y = np.sin(x)
z = np.cos(x)

#clear previous plot


plt.clf()

#add series
plt.plot(x, y, '-', label='sin', color='#888888')
plt.plot(x, z, '*r', label='cos')

#add titles
plt.title('My first chart')
plt.xlabel('x-axis label')
plt.ylabel('y-axis label')

#add legend
plt.legend(loc='lower left')

#set axis limits


plt.axis([0, 2*pi, -1.1, 1.1])

#add grid lines


plt.grid(True)

#display chart
plt.show()

Formatting of the series line style and colour can be encoded using:
10.3. SCATTERPLOT 121

(a) Line chart (b) Scatterplot

Figure 10.1: Series plots

• ‘-’ for solid line • ‘b’ for blue


• ‘–’ for dashed line • ‘g’ for green
• ‘-.’ for dash-dotted line • ‘r’ for red
• ‘:’ for dotted line • ‘c’ for cyan
• ‘m’ for magenta
• ‘y’ for yellow
• ‘k’ for black
• ‘w’ for white

Full colour control is available via the color parameter using RGB colours.
Rather than selecting your own formats, predefined styles can adopt using the command be-
low. Styles should be applied before creating chart components. A list of styles is available from
https://matplotlib.org/3.1.1/gallery/style_sheets/style_sheets_reference.html.

#Format
plt.style.use('ggplot')

In Spyder (version 3.7), plots on the ’Plot’ tab alongside the variable explorer. To export a plot,
right click and choose ‘save plot as...’. Exporting can also be achieved in code using the savefig
command. This allows other options to be configured including output format and quality. When
embedding images in other documents remember to ensure they are of a high quality (increase
the dots per inch (dpi)) and that all text can be easily read.

#Save to file
plt.savefig('chart.png', dpi=600)

Scatter plots can be created as line series (with just markers and no lines) or using the scatter
method (see Figure 10.1b) which also permits a third variable to be represented using a colour
scale.

plt.clf()
plt.scatter(y,z*x, marker='o', c = x)
plt.colorbar()
plt.show()
10.3. SCATTERPLOT 122

Multiple plots can be combined into a single figure by using subplots (in fact all plots reside inside
a figure object). These can be configured in any regular grid shape and be configured to have
shared axis (sharex, sharey). The overall grid shape, and indexing of each subplot uses matrix no-
tation (row then column). For a full range of options see https://matplotlib.org/3.1.0/gallery/
subplots_axes_and_figures/subplots_demo.html.

#Ask for a 1x2 layout


fig, (ax1, ax2) = plt.subplots(2, 1)

#Define the first subplot


ax1.plot(x, y, '-', label='sin')
ax1.set_title('sub plot 1')

#Define the second subplot


ax2.plot(x, z, '-', label='cos')
ax2.set_title('sub plot 2')

#Set figure properties


fig.suptitle('Main Title')
fig.tight_layout(pad=3.0)

plt.show()

When working with large numbers of automatically generated subplots, it becomes cumbersome
to give them all different variables names. In such cases it is better to treat them as an array of
subplots which can be referenced by index (working from left to right, and top to bottom).

n=3
fig, axs = plt.subplots(n, n)

for i in range(n):
for j in range(n):
axs[i,j].set_title(str(n*i+j))

fig.tight_layout(pad=3.0)

An alternative approach is to add them one at a time as follows. Here only the 1st and 4th plots
are added in a figure configured to contain 2x2 subplots.

fig = plt.figure()
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,4)
plt.show()
10.4. BAR CHART 123

10.3.1 Working with time series


Working with time series plots requires a little more effort. Firstly, to ensure that date values, are
treated as dates and can therefore be plotted chronologically. This can be performed using
datetime.strptime as in Section 4.6, or in pandas using the code as shown below.

#Ensure date is treated late a date


df.date = pd.to_datetime(df.date, format='%d/%m/%Y')

Secondly, to ensure the that tick labels are suitably spaced and formatted. Fortunately within
matplotlib.dates a number of convenient functions to perform such operations are provided. In
the following code MaxNLocator is used to indicate the maximum number of ticks you would like to
see, and a format is created (using DateFormatter) and then applied using the set_major_formatter
method of the axis.

import matplotlib.dates as mdates


fig, ax = plt.subplots()

#plot will write to the current axis


ax.plot(df.date, df.price)

ax.xaxis.set_major_locator(plt.MaxNLocator(5))

myFmt = mdates.DateFormatter('%d-%b')
ax.xaxis.set_major_formatter(myFmt)

For more control it is possible to create arrays of dates to define the tick labels you want to
see.

#Select every 4th Monday


from matplotlib.dates import MO, TU, WE, TH, FR, SA, SU
ax.xaxis.set_major_locator(mdates.WeekdayLocator(byweekday=MO, interval=4))

#Select the first day of every other month


ax.xaxis.set_major_locator(mdates.MonthLocator(bymonthday=1,
interval=2, tz=None))

For more possibilities see https://matplotlib.org/3.1.1/api/dates_api.html.

10.4 Bar chart


A basic one series bar chart can be created from a list of values and a list of labels. The code
below generates the chart shown in Figure 10.2a.

#Create data
CPI = [1.2, 1.4, 0.9, 0.4, -0.2, 0.3 ]
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']

#Create components
plt.bar(months, CPI, color='red')
plt.xlabel('Month')
10.5. PIE CHART 124

plt.ylabel('Inflation (%)')
plt.title("UK Inflation")
plt.show()

Note that if using the ax approach, labels must be set as follows:

#Create chart
fig, ax = plt.subplots()

#Add components
ax.bar(months, CPI, color='red')
ax.set_xlabel('Month')
ax.set_ylabel('Inflation (%)')
ax.set_title('UK Inflation')

With two or more series, care must be taken to control the position of the series. These could be set
side by side, deliberately made to overlap, or stacked (stacked=True). The code below generates
the chart shown in Figure 10.2b. To illustrate how, it has been created with horizontal bars.

#Create second series


RPI = [1.5, 1.7, 1.1, 0.8, -0.5, 0.4 ]

#Create variables for sizing and positioning the bars


width = 0.4
pos = np.array(range(len(months)))

#Add bars - horizontal this time


#Offset position of second series
plt.barh(pos, CPI, width, color='red', label='CPI')
plt.barh(pos+width, RPI, width, color='grey', label='RPI')

#Centre y-axis labels


plt.yticks(pos+width/2,months)

plt.ylabel('Month')
plt.xlabel('Inflation (%)')
plt.title("UK Inflation")
plt.legend(loc='upper right')
plt.show()

10.5 Pie Chart


The use of pie charts is discouraged for several reasons. Firstly, they are considered difficult to
accurately interpret since angles are very hard to estimate and compare. This problem is com-
pounded for 3D pie charts. Secondly, there are nearly always better alternatives. Thirdly, they are
frequently abused (especially if the combined segments are supposed to represent 100%). For a
discussion of why, read the following articles.
• https://www.businessinsider.com/pie-charts-are-the-worst-2013-6
• https://ig.ft.com/science-of-charts/
10.5. PIE CHART 125

(a) Vertical (b) Horizontal

Figure 10.2: Bar charts

(a) Pie chart (b) Donut plot

Figure 10.3: Pie chart variants

• https://scc.ms.unimelb.edu.au/resources-list/data-visualisation-and-exploration/no_pie
-charts
If you really must, the following code will create the pie chart in Figure 10.3a. Notice that the
values are converted into percentages by default. Full documentation is available from https://
matplotlib.org/3.1.1/gallery/pie_and_polar_charts/pie_features.html.

labels = ['Yes', 'N']


sizes = [1,99]

plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)


plt.title("Should I use a pie chart?")
plt.show()

A way to avoid the angle problem is to switch to a donut plot (Figure 10.3b). This isn’t really a
separate chart type. As the code reveals, it is created by effectively blanking out an inner circle
from the pie chart.

# create data
values = [5,7,2,8]
10.6. HISTOGRAM 126

labels = ['A','B','C','D']

#create the pie chart as before


plt.pie(values, labels=labels)
plt.title("Donut plot")

#create a (white) circle and place it on the 'current figure'


circle=plt.Circle( (0,0), 0.75, color='white')
p=plt.gcf()
p.gca().add_artist(circle)

plt.show()

10.6 Histogram
Here we generate a histogram, shown in Figure 10.4a.

#Create a histogram
z = np.random.randn(1000)

plt.clf()
plt.hist(z, bins='auto')
plt.title("Normal?")
plt.show()

10.7 3D-plots
Three dimensional plots require Axes3D to be imported. An example is provided below with output
as shown in Figure 10.4b. See https://matplotlib.org/mpl_toolkits/mplot3d/tutorial.html for
further examples.

import matplotlib.pyplot as plt


from mpl_toolkits.mplot3d import Axes3D

#Mesh Grid
n = 101
x = np.linspace(1,3,n)
y = np.linspace(1,3,n)
z = np.zeros((n,n))

for i in range(0,n):
for j in range(0,n):
z[i,j] = x[i]*y[j]**2 / (x[i]**2 + y[j]**2)

plt.clf()
fig = plt.figure()
ax = fig.gca(projection='3d')
x, y = np.meshgrid(x, y)
surf = ax.plot_wireframe(x, y, z)
10.8. BOXPLOTS 127

(a) Histogram (b) 3D plot

Figure 10.4: Matplotlib

ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')
fig.show()

10.8 Boxplots
A boxplot or Box & Whisker plot displays the distribution of a variable. The box (i.e. the rectangle)
shows the interquartile range with the median marked as a bar in the middle (but not necessarily
the midpoint). The whiskers (i.e. the lines extending out of the rectangle) extend to show some
proportion of the remaining distribution. How far they extend is normally configurable and can be
set to a certain percentile, or a multiple of the interquartile range. Any values beyond the whiskers
are plotted as individual points representing outliers.
The code below shows how to create a boxplot with the result shown in Figure 10.5.

#Generate random data


n=500
z1 = np.random.normal(0,2,n)
z2 = np.random.normal(2,1,n)

#Create a figure
fig, ax = plt.subplots(dpi=300)

#Create the boxplot, here with two lists of data


ax.boxplot([z1, z2], labels=['a','b'])
ax.set_title('Normal data')

plt.show()

Pandas dataframes have a boxplot method to generate such plots using different columns, or the
same columns grouped in different ways. See https://pandas.pydata.org/pandas-docs/stable/
reference/api/pandas.DataFrame.boxplot.html for details.
10.9. SCATTER MATRIX 128

Figure 10.5: Boxplot

10.9 Scatter Matrix


When working with dataframes, it can be helpful to understand the relationships between vari-
ables. One way to do this is to plot every pair of variables in a single figure (see Figure 10.6). Since
plotting one variable against itself is trivial, a histogram is provided instead so the distribution can
be inspected.

pd.plotting.scatter_matrix(df, c=df.outcome)
fedfunds

10

0.020
0.00
cpi

0.02
0.0
ind

0.1
0.0
pay

0.1
0

10

0.02

0.00

0.02
0.1

0.0

0.1

0.0

fedfunds
cpi ind pay

Figure 10.6: Scatter Matrix

Notice how a categorical variable has been used to colour each observation. This can be useful
10.10. CHOROPLETH MAP 129

when investigating which variables must be used to classify future unseen observations.

10.10 Choropleth Map


A choropleth map uses colour intensity to represent geographical data. The data being analysed
must first be mapped in specific areas (represented as polygons) that could represent a country,
state, zip code, electoral district, etc. Data representing the polygons (shapefiles) is often in the
form of a json file where the shape of each region is described by a list of coordinates (i.e. latitude
and longitude) which form a (rough) outline of the geographical area.
For example, the following snippet of code uses the folium package to create the image shown
in Figure 10.7.

import folium

country_shapes = r'.\maps\world-countries.json'
the_map = folium.Map(tiles="cartodbpositron", location=[20, 10],
,→ zoom_start=3)

the_map.choropleth(
geo_data=country_shapes,
name='choropleth',
data=df,
columns=['country', 'downloads'],
key_on='properties.name',
bins = [0,10,50,100,500],
fill_color='Reds',
nan_fill_color='white',
fill_opacity=0.8,
line_opacity=0.1,
)
the_map

the_map.save('SSRN_choropleth.html')

10.11 Other libraries


Matplotlib is just one of several charting library. In this section we briefly introduce some other
popular libraries.

10.11.1 seaborn
Seaborn is based on matplotlib and is designed to work with pandas. For convenience it pro-
vides features to set styles and colour themes for consistent formatting across charts. See https://
seaborn.pydata.org/introduction.html for full details.

import pandas as pd
import seaborn as sns

#Apply the default seaborn theme, scaling, and color palette.


df = pd.DataFrame({'label': ['A', 'B', 'C', 'D', 'E'],
10.11. OTHER LIBRARIES 130

Figure 10.7: Choropleth Map

'value': [12, 7, 5, 7, 10]})

sns.set_style("whitegrid")
ax = sns.barplot(data=df, x="label", y="value")
fig = ax.get_figure()

fig.savefig("seaborn.png", dpi=400)

Figure 10.8: seaborn

10.11.2 plotly
Plotly (now part of the Anaconda installation) produces interactive charts displayed within browsers.
Its functionality is documented at https://plotly.com/python/.
10.11. OTHER LIBRARIES 131

#Example from https://plotly.com/python/static-image-export/

import plotly.graph_objects as go

import plotly.io as pio


pio.renderers.default = "browser"

import numpy as np
np.random.seed(1)

N = 100
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
sz = np.random.rand(N) * 30

fig = go.Figure()
fig.add_trace(go.Scatter(
x=x,
y=y,
mode="markers",
marker=go.scatter.Marker(
size=sz,
color=colors,
opacity=0.6,
colorscale="Viridis"
)
))

#requires orca to be instaslled


fig.write_image("plotly.png")

#Should now appear in a browser


fig.show()

Note that figures generated using plotly will not display in Spyder. Two lines have code have
been added above so that the output should instead appear in your browser. Plotly supports the
creation of choropleth maps.

10.11.3 Interactive Dashboards


An alternative way to visualise data is to make it interactive so that the reader has the option to
explore it in more detail. Various packages provide features such as allowing the user to vary what
is being plotted and displaying additional information as users hover over particular data points.
Examples of such libraries are:
• bokeh
• ipywidgets
• voila
• dash
• streamlit
10.11. OTHER LIBRARIES 132

Figure 10.9: Plotly


133
11 M ACHINE L EARNING

11.1 Basic concepts


While machine learning may conjure science-fiction images of computers taking over the world,
very often the term is used to describe much more mundane situations where algorithms are
used to make predictions from data. Thus, a simple linear regression model that has been fitted
by a computer programme to a particular dataset, and then used to make a prediction for a
new data point, is an example of machine learning. The dataset has allowed the algorithm to
‘learn’ something (i.e. infer a relationship under the imposed model assumptions) and then use
this ‘knowledge’ to do something independently (make a prediction). One supposes, that the
more sample data the algorithm has access to, the better its predictions would become.
Supervised learning relates to problems where knowledge has already been accumulated about
the outcome or answer to a problem. This prior knowledge is used to learn about the relationship
between observable variables and the outcome of interest. This could be a numerical value
(a forecast problem) or a categorical value (a classification problem). Like a teacher helping a
student to learn, a supervisor can help the algorithm to improve by providing it with more problems
with known answers. Unsupervised learning seeks to find relationships, structure or meaning within
data. One of the most common approaches is to use clustering, where observations are grouped
based on their common characteristics.

Figure 11.1: Machine Learning

Machine learning texts often refer to attributes and observations. Those more familiar with database
terminology may prefer to think in terms of fields and records. In supervised learning, the observa-
tion attributes (i.e. the regressor, co-variate or independent variables) are used to ascertain the
value of the target variable (i.e. the regressand or dependent variable). In general, any model
must first be trained on a training set and then evaluated using a distinct testing set. The training
set can be subdivided into data that will be used to construct the model, and data that will be
used to validate and tune the model.
11.2. SAMPLING 134

In finance, machine learning techniques have been used for fraud detection, credit decisions and
time series estimation. For a Python linked introduction to the topic of machine learning see Müller,
Guido, et al. (2016). In the rest of this chapter we will heavily utilise scikit-learn.

11.2 Sampling
Train and test subsets must be extracted from a larger dataset. Test data should be out-of-sample,
that is, previously unseen by the model. While a portion of the training data may be used for
validation, this is testing how well the model fits the training data. Testing with an independent
dataset, validates how well the fitted model performs on unseen data.
The idea is similar to student examination: a meaningful assessment should be unseen. If students
were assessed using only past papers to which they had access, one would not be surprised if
they performed well. In such circumstances, what is being tested it not that which is intended: the
ability to apply knowledge rather than recall the solution to previous problems.
There are different ways that this can be achieved and the appropriate method will depend on
the goal of the modeling.

11.2.1 Without replacement


The most straight forward approach is to divide the dataset into disjoint subsets. This could be per-
formed using a simple partition (e.g. the first k observations) or using random selection. Dataframe
methods for such selecting subsets in this manner include:

#First 5 rows
X = df.iloc[:5,]

#First 5 percent of rows


X = df.iloc[:(int(0.05*len(df))),]

#Random 5 rows
X = df.sample(5, replace = False)

Scikit-learn provides a sample_without_replacement function that selects a random subset from the
integers {0, 1, . . . , n−1}. These values can be used to select a subset of rows from a dataframe.

from sklearn.utils.random import sample_without_replacement

#Establish number of rows in dataframe


n = len(df)

#Randomly sample (70 percent of) row numbers


#Using a seed to ensure the same results are returned every time
A = sample_without_replacement(n, int(0.7*n), random_state=0)

#Form a second list of row numbers not in A


B = [i for i in range(n) if i not in A]

#Use the integer subsets to select particular rows


df_train = df.iloc[A,]
df_test = df.iloc[B,]
11.2. SAMPLING 135

When working with time series data, a random sampling approach may not be appropriate. For
example, if a modeler was building a real-time forecasting model, at the point of prediction, they
would have no access to information from the future. Building a model that relies on data that
would not have been available at the time is fraudulent. One would dismiss a forecaster who told
us that they could accurately predict the price of oil in 6 months provided we first supply them with
the price of oil in 5 months and 7 months time.
In such instances the goal should be to use out-of-time sampling (Stein, 2002). The idea is illustrated
in Figure 11.2 which represents a time line, with older observations to the left and more recent
observations to the right. An intermediate point is used to divide the dataset into pieces. The
shaded portions represent time buffer that may need to be excluded:
1. Used to calculate backward differences which is not possible for observations at the start of
the dataset.
2. Used for training data outcomes; training data should not include outcomes that would be
unknown at the point of the first test observation.
3. Outcomes are in the future and therefore unknown so cannot be used for testing (yet).

Figure 11.2: Out-of-time sampling

Assuming the dataset is sorted chronologically, it may be possible to partition the dataset by taking
the first k rows and the last n − k rows. Alternatively pandas indexing can be used to select based
on dates.

#Create an index based on a date/time column


df.index = pd.to_datetime(df['date'].tolist())

#Partition dataset using time slices (many formats supported)


df_train = df['1900-01':'1989-12']
df_test = df['1990-01':'2019-09']

11.2.2 With replacement


Scikit-learn also provides a resample function which implements resampling with replacement. It
can be applied to arrays, lists and dataframes.

from sklearn.utils import resample


A = resample(range(n),replace=True, n_samples=k)

Using one of its optional parameters, it can also be used for stratified sampling which samples
from within subgroups, rather than the entire population. This can be used to make sure that
observations from different groups are appropriately represented in the random sample. This can
be preferable when working with imbalanced sets, for example, when the ‘positive’ cases in a
clinical trial are a small proportion of the population.
11.3. DECISION TREES 136

11.2.3 Features and Target


As well as sub-setting the rows of a dataframe, it is also necessary to subset the columns. For
classification and regression problems one column needs to be identified as the target. One
must also decide which columns from the dataframe can/should be used as predictors. This may
mean excluding some because they are only weak predictors and ultimately detract from the
performance of the model. Others must be excluded because they are unsuitable because they
are of a type incompatible with the model (e.g. text or categorical) or have no predictive power
(e.g. data labels, dates).

#Define feature columns - here by excluding those that are unsuitable


exclude_cols = ['date', 'name']
feature_cols = [x for x in df.columns if x not in exclude_cols]

#Extract features and target for training data


x_train = df_train[feature_cols]
y_train = df_train.target

#Extract features and target for test data


x_test = df_test[feature_cols]
y_test = df_test.target

Partitioning the data, and randomly selecting testing and training data, can be performed in a
single step using the scikit-learn train_test_split function.

from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test =
train_test_split(df[feature_cols], df['target'], random_state=0)

11.3 Decision Trees


11.3.1 Introduction
Decision trees are one of the simplest machine learning models to understand. While they can be
used for regression, they are introduced here as a model for classification.
A decision tree is normally visualised as an inverted tree, starting with a root node, which represents
the entire data to be ‘fitted’. The process for building the tree proceeds recursively. At each node,
the dataset associated with it, is split into two subsets, creating two new branches of the tree. Thus
at each step, the original dataset is partitioned into smaller and smaller pieces. At each node
the partitioning is based on a binary test performed on a single attribute. The idea is to carefully
choose the attribute and the precise nature of the test in such a way that classification in each of
the two subgroups is easier. Breiman et al. (1984) note that, “[t]he fundamental idea is to select
each split of a subset so that the data in each of the descendant subsets are ‘purer’ than the
data in the parent subset”.
As an analogy, imagine a company that regularly interviews for job openings. Their goal is to find
‘good hires’, of which only 50% of applicants are believed to be. Of course, one can only know
a ‘good hire’ has been made with hindsight. However, they have access to interview records (for
which a standard set of questions are used) and employee appraisal data, and decide to build a
decision tree to guide future hiring decisions.
Their dataset A is a list of records, with fields X1 , X2 , . . . , Xn which represent the answer to interview
questions, and Y , their employee rating. In this example we’ll treat this as being ‘good’ or ‘bad’
11.3. DECISION TREES 137

but it could easily be a rating from 1 to 5. The process begins by choosing an attribute Xi that can
be used to divide the data into two disjoint (but not necessarily equally sized) groups such that
B ∪ C = A. The idea is illustrated in Figure 11.3.

Figure 11.3: Decision Tree

Each question in an interview reveals more information about a candidate, but not all information
is equally valuable in making a hiring decision. In our example the attribute X3 has been chosen
precisely because it is considered to maximise the information gain. This may relate to an interview
question that records the number of years of relevant experience (i.e. a numerical value), and the
test must be X3 ≥ 4. Candidates with less that 4 years experience are assigned to group B (where
only 15% of future employees are considered good hires), while candidates with 4 or more years
experience are assigned to group C (where 65% of candidates are considered good hires). Note
that the threshold 4 has been just as carefully selected as the attribute X3 .
To further partition group C attribute X4 is chosen. This may relate to a question about qualifi-
cations that records the candidates degree subject (i.e. categories data). Candidates where
X4 ∈ {Finance, Economics} are assigned to group G (where 90% of candidates are considered
good hires), with other candidates being assigned to group F (where only 20% of candidates are
considered good hires).
For more background, see for example Witten, Frank, and Hall (2005). To fit a decision tree model
in scikit-learn.

from sklearn import tree

clf = tree.DecisionTreeClassifier()
clf = clf.fit(x_train, y_train)

The fitted tree can be visualised as shown in Figure 11.4.

import matplotlib.pyplot as plt


tree.plot_tree(clf)

#Save to file in different formats


fig = plt.gcf()
fig.savefig('tree.png')
fig.savefig('tree.pdf')

A more colourful visualisation (making it easier to distinguish between classifications at each node)
can be achieved with the following code. Note that this requires installation of pydotplus.
11.3. DECISION TREES 138

X[12] <= -0.015


gini = 0.133
samples = 237
value = [220, 17]
X[19] <= -0.15 X[9] <= 0.014
gini = 0.444 gini = 0.077
samples = 12 samples = 225
value = [4, 8] value = [216, 9]
gini = 0.0 X[11] <= -0.007 X[0] <= 212.721 gini = 0.0
samples = 2 gini = 0.32 gini = 0.053 samples = 3
value = [2, 0] samples = 10 samples = 222 value = [0, 3]
value = [2, 8] value = [216, 6]
gini = 0.0 gini = 0.0 X[5] <= 361.636 gini = 0.0
samples = 2 samples = 8 gini = 0.036 samples = 2
value = [2, 0] value = [0, 8] samples = 220 value = [0, 2]
value = [216, 4]
X[11] <= 0.036 gini = 0.0
gini = 0.027 samples = 1
samples = 219 value = [0, 1]
value = [216, 3]
X[13] <= -0.018 X[18] <= 0.1
gini = 0.01 gini = 0.32
samples = 209 samples = 10
value = [208, 1] value = [8, 2]
X[0] <= 113.234 gini = 0.0 gini = 0.0 gini = 0.0
gini = 0.278 samples = 203 samples = 8 samples = 2
samples = 6 value = [203, 0] value = [8, 0] value = [0, 2]
value = [5, 1]
gini = 0.0 gini = 0.0
samples = 1 samples = 5
value = [0, 1] value = [5, 0]

Figure 11.4: Decision Tree Example

from sklearn.tree import export_graphviz


from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus

dot_data = StringIO()
text = export_graphviz(clf, filled=True, rounded=True,
feature_names = feature_cols, class_names=['0','1'])

graph = pydotplus.graph_from_dot_data(text)
graph.write_pdf('tree2.pdf')

11.3.2 Information Gain


To understand the decision tree algorithm is implemented, it is necessary to introduce the con-
cepts of information gain and entropy. One way to think about entropy is the impurity of the data
(i.e. where purity is the degree to which they share a common classification). For decision trees, a
node with an entropy of zero requires no further partitioning.
More formally, a set S with observations from n classes, where pi represents the portion of observa-
tions from class i, has entropy:

n
X
E(S) = −pi log2 (pi ) (11.1)
i=1

Suppose S is partitioned into S1 and S2 , the information gain is calculated as:


11.4. EVALUATION 139

2
X |Sj |
G(S) = E(S) − E(Sj ) (11.2)
j=1
|S|

Where |S| denotes the size of the set.


The Gini Index is an alternative measure that is sometimes used. This is defined as:

n
X n
X
Gini(S) = 1 − p2i = pi (1 − pi ) (11.3)
i=1 i=1

11.3.3 Algorithm
The decision tree algorithm thus proceeds by repeatedly partitioning the data in a manner that
maximizes information gain as defined by a chosen function such as entropy or the Gini index. In
theory this partitioning can continue until:
• all observations at a node are of the same class
• a node contains only a single observation (a subcase of above)
• the attributes do not allow observations to be partitioned in a way that increases information
gain.
Such terminal nodes are referred to as the leaf nodes. However, there is a danger of over-fitting:
this is where the model may provide an excellent fit to the training data but fails to generalise well
(i.e. performs poorly when applied to unseen data).

11.3.4 Model Tuning


To overcome such issues, decision trees may be configured in the following ways to:
• Set the maximum depth to cap the maximum number of nodes along any branch from the
root to a leaf node.
• Set the minimum samples split to prohibit further partitioning once the number of observations
at a node falls below the threshold
• Set the minimum samples leaf to prevent a split that would result in a leaf node with fewer
observations than the threshold.
• Set the maximum leaf nodes to limit the total number of nodes in a tree. The next split chosen
based on the maximum reduction in impurity achieved.
• Set the minimum impurity decrease to prevent splits that result in low information gain.
• Set the minimum impurity split to prevent splits once a given level of purity is achieved.
Reducing the size of the tree to prevent overfitting is referred to as pruning. Attempts to constrain
the size of the tree during construction (as above) could be considered ‘pre-pruning’. Post-pruning
involves fitting the tree to the training data, and then reducing its size afterwards by pruning (i.e.
removing leaf nodes) to a point where validation error (against a separate validation data set) is
minimised.

11.4 Evaluation
We consider now how to evaluate a model.
11.4. EVALUATION 140

11.4.1 Prediction
Once a model has been trained, it can be applied to the (unseen) test data. Scikit-learn provides
a sklearn.metrics module to help analyse the results.

from sklearn import metrics


y_pred = clf.predict(x_test)

#Compare predicted outcomes with actual known values


print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

#Output a table of performance metrics


from sklearn.metrics import classification_report
np.set_printoptions(precision=2)
print (classification_report(y_test, y_pred))

11.4.2 Confusion Matrix


A confusion matrix is a table that contrasts the predicted classification with the actual one. Table
11.1 provides an example of a classification problem with 3 classes A, B, & C. The classification
model is applied to test data, with known classifications. A perfect classifier (i.e. one that made
only accurate predictions), would have all entries along the diagonal. Any values off the diagonal,
represent mis-classifications.

Predicted
A B C
A 15 6 1
Actual B 4 13 2
C 0 4 19

Table 11.1: Confusion Matrix (3 × 3)

How such results are interpreted will be context dependent. Consider, a classification problem with
only two possible outcomes. Classic examples include medical diagnostic tests (where patients
test positive or negative) and criminal trials (where the defendant is guilty or not guilty). An abstract
confusion matrix for such problems is shown in Table 11.2.

Predicted
P N
P TP FN
Actual
N FP TN

Table 11.2: Confusion Matrix (2 × 2)

Four outcomes are possible:


• True Positive (TP): prediction = positive = actual
• False Positive (FP): prediction = positive ̸= actual
• False Negative (FN): prediction = negative ̸= actual
• True Negative (TN): prediction = negative = actual
False positives are referred to as a type I errors. For medical patients being tested for a harmful
disease this is a false alarm. For defendants on trial this is convicting an innocent person.
11.4. EVALUATION 141

False negatives are referred to as a type II errors. For medical patients being tested for a harm-
ful disease this falsely gives them the all clear. For defendants on trial this is acquitting a guilty
person.
As these examples illustrate, not all errors are equal. Instinctively a type I seems of much greater
consequence in a court case. Calibrating diagnostic tests and classification models requires a
trade off between different outcomes based on the significance of their outcomes.
Suppose the total number of actual positive cases is N and actual negative cases is P . The True
Positive Rate (TPR) is given by:

TP TP
TPR = =
P TP + FN

This is also known as the sensitivity or hit rate. The True Negative Rate (TNR) is given by:

TN TN
TNR = =
N TN + FP

This is also known as the specificity. The False Positive Rate (FPR) is:

FP FP
FPR = = = 1 − TNR
N FP + TN

The accuracy is given by:

TP + TN TP + TN
ACC = =
P +N TP + TN + FP + FN

In scikit-learn, the confusion matrix can be derived from the predicted results as follows.

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
print(cm)

#Normalise the results (rows sum to 1)


cm = confusion_matrix(y_test, y_pred, normalize = 'true')
print(cm)

11.4.3 ROC Curve


For binary classifiers, a Receiver Operating Characteristic curve is a plot of the True Positive Rate
against the False Positive Rate for different model thresholds. Fawcett (2006) describes it as a
depiction of the tradeoff between benefits (getting it right: true positives) and costs (getting it
wrong: false positives).
Consider a test that assigns a score to every observation, allowing the observations to be ranked
from least likely to be positive, to most likely to be positive. This is illustrated in Figure 11.5 where
observations least likely to be positive appear on the left, and those most likely to positive appear
on the right. Those that are actually positive are shared red, while those that are actually negative
are shaded grey. Once the threshold is set, all points to the right will be classified as positive, and
those to the left negative.
11.4. EVALUATION 142

Figure 11.5: Classifier Threshold

Clearly if a high bar is set (threshold moves to the right), less observations will be classified as
positive and more as negative. Taken to an extreme, if the threshold is set above the maximum
score, all observations will be classified as negative. This gives a TPR of 0 and a FPR of 0. Taken
to the other extreme, if the threshold is set below the minimum score, all observations will be
classified as positive. This gives a TPR of 1 and a FPR of 1. For intermediate values of the threshold
0 ≤ T P R, F P R ≤ 1. Plotting this for different thresholds will therefore generate a plot similar to
Figure 11.6.


52&&XUYH


7UXH3RVLWLYH5DWH






UDQGRP
FODVVLILHU

     
)DOVH3RVLWLYH5DWH

Figure 11.6: ROC Curve

The diagonal, represents a random classifier: one that classifies according to a random guess. If
such a classifier guesses positive 50% of the time, it will get 50% of the positives right and 50% of
the negatives correct. Similarly, if it guesses positive 90% of the time, it will get 90% of the positives
right and 90% of the negatives will be false positives. Clearly any useful classifier must outperform
the random classifier, so its ROC curve should remain above the diagonal. A classifier with a ROC
curve below the diagonal is performing very poorly. In fact, greater success would be achieved
by making the oppositive prediction of such a model!
Informally, the closer the curve approaches the top left hand corner, the better the classifier. The
classifier illustrated in Figure 11.7 has managed to score/order the observations to perfectly sepa-
rate the positive and negative ones. Regardless of where the threshold is set, it will guarantee not
to include any false positives before all true positives have been detected. Thus 100% of all true
positives are detected before the false positive rate rises above zero.
To visualise the ROC curve using scikit-learn, use the metrics module to first extract the true positive
rate and false positive rate.
11.4. EVALUATION 143

Figure 11.7: ‘Perfect’ Classifier

#Extract true/false positive rates


from sklearn import metrics
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)

#Create random classifier line


x1 = [0, 1]
y1 = [0, 1]

#add series
plt.plot(x1, y1, '--b', label='random', lw=3)
plt.plot(fpr, tpr, '-r', label='classifier', lw=3)

#add titles
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

#add legend
plt.legend(loc='lower right')

#set axis limits


plt.axis([0, 1, 0, 1.005])

#display chart
plt.show()

11.4.4 AUC
Comparing two ROC curves is not always straight forward. If one curve is consistently higher that
the other it appears to represent a better classifier, but what if they overlap? To make comparison
easier, metrics can be derived. One such metric is the Area Under the Curve (AUC), which effec-
tively integrates the ROC as a function over the unit interval [0, 1]. Since the plot is contained within
the unit square, this produces a value between 0 and 1, with the random classifier (the diagonal)
achieving a score of 0.5.
The AUC has an important statistical property: the AUC of a classifier is equivalent to the prob-
ability that the classifier will rank a randomly chosen positive instance higher than a randomly
chosen negative instance. Fawcett (2006) notes that the AUC is actually a statistic indicating ‘the
probability that the classifier will rank a randomly chosen positive instance higher than a randomly
chosen negative instance’. It is also related to the Mann-Whitney test. An alternative approach,
is the Cumulative Accuracy Profiles (CAP) curve with its associated accuracy ratio, equivalent to
twice the area under the ROC curve lying above the diagonal.
Note that it is dangerous to make claims about the relative performance of two models based
on a single test. As a statistic, one must also consider its variance. The AUC scores from compet-
ing models can be compared statistically using the approach of DeLong, DeLong, and Clarke-
11.5. RANDOM FORESTS 144

Pearson (1988).
Hand (2009) raises concerns relating to the misclassification costs inherent in AUC (since all errors
are considered equal). However Lessmann, Baesens, Seow, and Thomas (2015) find that among
alternatives, such as Hand’s H−measure, classifier ranking is similar for credit scoring.
In scikit-learn, the auc value can be extracted from the true positive rate and false positive rates
using the metrics module.

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)


auc = metrics.auc(fpr, tpr)

11.4.5 Cross Validation


So far we have considered partitioning the data into separate training and testing sets. However,
problems can arise when data is limited or if training data is unrepresentative. A general way
to address this problem is to repeat the process multiple times, using randomly selected training
sets, with the remainder of data used for testing. An overall error rate is then calculated from
as the average error across all these tests. The ‘hold out’ approach, divides the data in k equal
parts. On each iteration one of the partitions is used for testing, and the rest to build the model.
The bootstrap method (discussed below) randomly selects data with replacement. For large data
sets, the chance that a particular observation will be included is approximately two-thirds (actually
1 − e−1 ).

11.5 Random Forests


Random forests generate a large number of predictor models (decision trees) which, while individ-
ually may represent poor predictors, combine to create an aggregate model with high predictive
accuracy, low bias and reduced variance (Breiman, 2001). The use of multiple classifiers in such
an ‘ensemble’ compensates for the high variance that arises when only one classifier is used. Any
ensemble approach can be compared to a panel of expects: each contributes their wisdom
so that the panel can reach a collective decision. It can also be compared to the ‘wisdom of
crowds’: each individual member may actually lack real expertise
In a forest with 1000 trees, each individual tree makes a prediction. In the case of a binary classi-
fier, the ensemble score can be defined as the number of individual trees that predict a positive
outcome for a particular observation. The higher the score, the more strongly the model is pre-
dicting a positive outcome. The overall ensemble predication can be based on a simple majority
(positive if the score is 500 or more), or based on a more or less exacting threshold (depending on
the cost/benefit of type I and type II errors).
One of the innovations of the random forest is the use of bootstrap aggregation (or ‘bagging’) of
the data set. Each tree within the forest is constructed using a random sample from the training
set with replacement (those observation selected is ‘in the bag’). Some of the observations will
be present more than once, whilst other observations will not feature (see also section 11.2.2).
Oversampling can be used in cases where classes are unbalanced, with one class having far
fewer observations that another. A measure of performance is to the ‘out of bag error’, that is,
how well the decision tree fits the observations omitted from the training set.
A second source of randomness for each tree is through the use of attribute subsetting. A third
is through random selection of the attribute selected for each branching decision. While indi-
vidual trees may generate different classification predictions for new instances, the overall forest
prediction is determined by a simple majority vote or defined voting threshold.
Much in the same way a committee of experts are able to arrive at a correct decision versus
11.5. RANDOM FORESTS 145

an individual, the ensemble of trees predict the correct classification better than an individual
decision tree: “[w]hat one loses, with the trees, is a simple and interpretable structure. What one
gains is increased accuracy” (Breiman, 1996, p137).
An ensemble of decision trees can comprise hundreds or thousands of individual decision trees.
The random elements of the procedure make the algorithm robust to noise and outliers. Bias
is reduced by allowing the trees to grow to maximal depth and correlation is reduced by the
randomness present in the procedure, so the ensemble tends not to overfit the data.
Finally, while random forests have been discussed here in the context of classification models, they
can also be used for regression.
A random forest classifier can be fitted in scikit-learn as follows.

from sklearn.ensemble import RandomForestClassifier

#Create RF. Set number of trees, and ask for oob errors to be calculated
clf = RandomForestClassifier(oob_score=True, n_estimators=250)
clf.fit(x_train, y_train)

print(clf.oob_score_)

In determining how large the forest should be, it is useful to inspect at what point the error stop
decreasing as more trees are added. This can be achieved with a loop where forests of various
sizes are fitted and their out of bag error determined. An example plot is shown in Figure 11.8.

min_estimators = 15
max_estimators = 175
error_rate= []

clf = RandomForestClassifier(random_state=0, oob_score=True)

for i in range(min_estimators, max_estimators + 1):


clf.set_params(n_estimators=i)
clf.fit(x_train, y_train)

# Record the OOB error for each `n_estimators=i` setting.


oob_error = 1 - clf.oob_score_
error_rate.append(oob_error)

A note on random numbers

Working with random numbers can be frustrating: different results are generated every time
code is executed. Attempts to reproduce earlier results will therefore fail, and the source
of differences cannot be readily identified (what has changed? the code, the data, the
software?). To avoid such frustration, it is normally possible to seed the random number
generators used by models with elements of randomness.

In scikit-learn whenever randomization is part of the algorithm, a random_state parameter


controls the random number generator used. Setting this to a integer will produce the same
results with each call. See https://scikit-learn.org/stable/glossary.html#term-random
-state for more details.
11.6. BOOSTING 146

OOB Error
0.080
0.075
0.070
0.065
Error

0.060
0.055
0.050

20 40 60 80 100 120 140 160 180


Forest Size
Figure 11.8: Out of bag error

11.5.1 Model Tuning


Configuration options can be used to fine tune Random Forests to increase performance. These
include:
• number of trees in the ensemble
• the number of variables to consider for splitting each node
• sample size - how many observations to use for each bag (and whether these should be
weighted by class size)
• replacement - enable or disable

11.5.2 Variable Importance


With a single decision tree, it is easy to observe which variables are used to partition the data
at each node, and therefore which ones have greater discriminatory power (predictive ability).
When a forest consists of many hundreds of trees this is more challenging. Variable importance
can be inferred from statistics that can be generated when fitting the random forest.
Simple counts of the number of times a variable has been used for node splitting overlook the
fact that the importance of the variable is also dependent on where in the tree the split occurs.
Preferred variable importance scores therefore incorporate some measure of how much improve-
ment has been achieved by the split (such as the Gini index). An alternative approach is to see
how the accuracy score changes based on permuting variables across observations. The larger
the impact, the more important the variable is deemed to be. See section 11.8.4 for further de-
tails.

11.6 Boosting
Boosting is a generic technique that takes an ensemble of weak classifiers and combines them to
create a strong classifier. The key is to adjust the weights during the process, to increase attention
(i.e. boost) of observations that are misclassified, relative to those that are correctly classified. A
11.7. NEURAL NETWORKS 147

boosting model, generates a series of models, but adjusts its approach to address deficiencies in
early iterations of the process.
With AdaBoost (Adaptive Boosting), introduced by Freund and Schapire (1997), the adjustment of
weights at each iteration forces new classifiers being added, to focus on what the ensemble to
date has difficulty classifying correctly.

11.7 Neural Networks


The concept of a neural network is inspired by the human brain. In a human brain, signals are
propagated by synapses between neurons. MRCI scans reveal that different parts of the brain
‘light up’ based on the stimulus applied. An image of an artificial neural network (ANN) is provided
in Figure 11.9.

Input Hidden Output


Layer Layer Layer

𝑥𝑖,1

𝑥𝑖,2

𝑥𝑖,3

𝑥𝑖,4

Figure 11.9: Neural Network

The stimulus, of the ANN is provided by input data: here a single observation with 4 attributes.
These values are passed to the first layer of neurons or nodes (called the input layer). At this point
in the ANN, the variables can be transformed by the application of a function. Referred to as
activation functions, they are normally sigmoid functions (S) shaped that map the real line to a
bounded interval as a increasing function. Examples of such functions include:

1
f (x) = (11.4)
1 + e−x
f (x) = tanh(x) (11.5)

These signals are then propagated through the network to the next layer of nodes (the hidden
layer) along the synapses or edges, but with different weights. The signal received by each inner
node is therefore a different linear combination of the (transformed) input variables. The inner
nodes perform further transformations before propagating the signal forward, to the outer layer of
nodes, again using different weights. Note that while the diagram shows a single hidden layer of
nodes, and a single node in the output layer, this need not be the case.
A linear regression model is a special case: with inputs connected directly to outputs (no hidden
layer) and with linear activation functions. Feedforward neural networks (as described here) are
11.8. FEATURE SELECTION 148

those where signals move through the network in the same direction (with no loops). The term
Multi-layer Perceptron is also used for feedforward neural networks assuming at least one hidden
layer, and non-linear activation functions.
ANN models are trained (either for regression or classification) by passing inputs with known results
through the network. Each ANN prediction generates an error. The training process attempts to
minimise this error across the combined training set by adjusting the weights across the network.
Backpropagation refers to the use of gradient-based techniques to allows solvers to minimise the
error function.
As the neural network produces a numerical output it can readily be applied to regression type
problems, but they also be used for classification. In the case of a dichotomous model, the output
can be treated as a score (similar to the votes in a random forest) with higher values indicating
membership of one class is more likely.
A neural network model can be implemented and evaluated in scikit-learn in a similar manner to
earlier models. Note that such models are sensitive to scale in a way that decision trees are not.
Pre-scaling is therefore advised.

from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler(feature_range=(-1, 1))


scaler.fit(x_train)
xs_train = scaler.transform(x_train)

#Scaling same scaling on test data


xs_test = scaler.transform(x_test)

A Multi-layer Perceptron classifier can be fitted as follows.

from sklearn.neural_network import MLPClassifier


clf = MLPClassifier(solver='lbfgs', activation='logistic',
hidden_layer_sizes=(8), random_state=1)

clf.fit(xs_train, y_train)

11.8 Feature Selection


Deciding which predictors to include in a model is a difficult problem. Including less important
predictors can increase computational effort and may reduce overall accuracy. In this section we
briefly consider some approaches that can be used to determine which predictors to include and
which to drop. This process is referred to as feature selection. Of course, this should be informed
by academic literature and will be dependent on the model itself and how it copes with large
number of (correlated) predictors.
A module called feature_selection is provided by scikit-learn. For full documentation see https://
scikit-learn.org/stable/modules/feature_selection.html

11.8.1 Correlation
Understanding correlation is helpful for two main reasons. Firstly, for detecting relationships be-
tween the predictor variables and the target variable (when numeric). Secondly, for detecting
multicollinearity which may mean that you there is redundancy in your predictors, or worse, that
11.8. FEATURE SELECTION 149

your mode will be unstable. See section 9.3.1 for details of how to calculate and inspect correla-
tion data.

11.8.2 Low variance


Features with low variance (and therefore with many observations that have similar values) may
also low discriminatory power. The sklearn.feature_selection module contains a VarianceThreshold
class that can be used to remove attributes with variance below a certain threshold.

11.8.3 Random substitution


One way to determine the importance of a variable is to investigate the impact of removing
it. Simply replace the feature with random numbers, leaving all other features untouched. If the
model performance degrades, this suggests the feature is of some importance. Conversely, if there
is no noticeable performance degradation, the feature is contributing little to the performance of
the model and is a candidate for omission.

11.8.4 Variable importance approaches


Where models have built in feature importance metrics these should be used to inspect the rela-
tive importance of predictors. These can be extracted and plotted as shown in Figure 11.10.

feat_importances = pd.Series(model.feature_importances_,
,→ index=feature_cols)
feat_importances.nlargest(10).plot(kind='barh')

indpro
ind
payems
fedfunds
cpiaucsl
unrate
cpi
pay
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

Figure 11.10: Variable Importance

11.8.5 Univariate feature selection


This approach uses univariate statistical tests to eliminate certain attributes. Two simple such meth-
ods involve keeping only the best performing k features (SelectKBest in scikit-learn) and using only
the highest proportion expressed as a percentage (SelectPercentile in scikit-learn).

11.8.6 Recursive feature elimination


Rather than discarding all unwanted features in a since step, under this approach, a repeated
process of pruning is applied. The model is first built using all available features. The worst perform-
11.8. FEATURE SELECTION 150

ing feature (or features failing to meet come cutoff) are discarded and the model is built again.
Pruning stops when the required number of features is reached, or model performance starts to
deteriorate. This allows some features that were initially low ranked a second chance. For exam-
ple, the importance of a variable may be understated if it is correlated to many others, but its
importance will rise as others are removed.
The idea can be likened to eliminating the lowest ranked contestant in a election system using a
single transferable vote
151
12 N ATURAL L ANGUAGE P ROCESSING

For many years text has largely been considered a purely qualitative source of information which
requires humans to review and interpret. The mainstay of scientic investigation has therefore fo-
cused on the analysis of quantitative data. There are a number of contributing factors that have
seen a paradigm shift in attitudes to text over the past two decades. Firstly, advances in computing
have made the capture and processing of large text-based datasets feasible. Secondly,through
the rise of the internet and social media, companies has recognised that commercial advantage
can be gained from insights gleaned from data. Thirdly, the proliferation of tools, that make such
investigations possible to those with limited technical capability.
In Finance, applications have included news analytics (see Mitra and Mitra (2011)), trading strate-
gies (e.g. based on Twitter) and fraud detection.

12.1 NLTK
The Natural Language Toolkit (NLTK) is a library of Python programmes for processing human lan-
guage. Its developers have released an accompanying book (Bird, Klein, & Loper, 2009), which is
also available from http://www.nltk.org/book/ in HTML format.
In the examples that follow the word ‘document’ to refer to any single piece of text to be analysed.
It will also be assumed that a variable text has been created and populated with some text from
a file if not explicitly defined.

with open('commentary.txt', 'r', encoding = 'utf-8') as f:


text = f.read()

print(text)

Recall from section 8.3 that text files can come in different formats. A quick way to determine
or change the format of a text file is using a free text editor such as Notepad++. As shown in
Figure 12.1, the format of active file is shown in the status bar. In Notepad++ the format can be
changed from the ‘Encoding’ menu. Another useful feature of Notepad++ is the ability to ‘Show
all characters’. This option is available on the toolbar and reveals non-printing characters. When
this toggle is enabled as in Figure 12.1, you can clearly see the locations of carriage returns, line
feeds, spaces, tabs, etc.

12.2 Preprocessing
12.2.1 Tokenization
A starting point for text analysis is to identify the components of a document. Python’s text objects
provide some methods that can be used to do this.

#Split into words


print(text.split())

#Split into lines


for line in text.splitlines():
print(line)
12.2. PREPROCESSING 152

Figure 12.1: Notepad++

Notice that punctuation has been retained, and in some cases, treated as part of a word. In
text analysis the term token is used to represent a series of related characters that form a distinct
grouping. Often this means a word, but it could also mean characters representing a numeric
value or punctuation (commas, question marks, etc). Spaces, tabs and carriage returns are nor-
mally taken as demarking the end of one token and the beginning of another. Consider the
following sentence.

text = "The U.S.A. has $26,000bn of national debt; who's paying the bill?"

Regular expressions (see section 4.5.4) can be used to define precisely those tokens which to
extract from a piece of text. Here we use \w+ to capture groups of one or more alphanumeric
characters.

print(re.findall("\w+", text))

This makes a reasonable attempt at parsing the text into tokens but also:
• splits the abbreviation into separate letters (which are individually meaningless)
• splits on the apostrophe
• splits the comma-separator thousand separator and removes the currency symbol.
How can this be remedied? We address this one step at a time, starting with the abbreviation.
Here we create a regex that searches for either alphanumeric characters or (|) groups of capital
letters followed by full stops.

pattern = "(?:[A-Z]\.)+|\w+"
print(re.findall(pattern, text))

Next we address the apostrophe, but the regex is getting complicated so we write it in a slightly
different way.

pattern = r"""(?x) #allow verbose regex (with spaces)


(?:[A-Z]\.)+ #abbreviations, e.g. U.S.A.
| \w+(?:[']\w+)* #words with an apostrophe
12.2. PREPROCESSING 153

| \w+"""

print(re.findall(pattern, text))

The possible combinations of currency symbols, digits, comma and decimal point separators, and
abbreviations for millions or billions, makes the regex rather complicated looking.

pattern = r"""(?x) # set flag to allow verbose regexps


(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+(?:[']\w+)* # words with an apostrophe
| [\$\£]?\d+(?:\,\d+){0,}(?:\.\d+)?(?:m|bn)? #currency
| \w+"""

print(re.findall(pattern, text))

Deciding how to parse a document into tokens of interest can quickly become quite involved.
Fortunately libraries such as NLTK provide tools to achieve this. NLTK’s tokenizer provides a default
pattern which can be custom defined if required.

#Split into sentences


from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)

for s in sentences:
print(s)

#Split into words


from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
print(tokens)

If the results of tokenization do not capture specific cases of interest, you can define your own
regex pattern instead.

from nltk.tokenize import regexp_tokenize


print(regexp_tokenize(text, pattern))

Once the individual tokens of interest have been identified, further processes can be applied to
standardise the text and reduce the total number of unique words (and therefore the complexity
and computational effort of analysing them). In some situations, it may be appropriate to convert
all words to lower case. This means that words that are capitalised as the first word of a sentence
do not require separate treatment.

sentence = "The share price of Apple is bananas"


tokens = regexp_tokenize(sentence, pattern)
tokens = [t.lower() for t in tokens]
print(tokens)

Of course, words that are capitalised as abbreviations or proper nouns potentially lose some
meaning (e.g. ‘Apple’ the company, versus apple the fruit).
12.3. DOCUMENT ANALYSIS 154

12.2.2 Stop Words


Of course not all words will convey meaning. The term ‘stop words’ is used to describe the words
that are most commonly used in a particular language (for example the conjunctions like ‘and’,
‘if’, or ‘when’, and prepositions like ‘to’, ‘with’ or ‘in’). Such words are generally unhelpful in un-
derstanding the content of the corpus and are therefore ignored.

#First make sure that the stopwords data has been downloaded
nltk.download('stopwords')

from nltk.corpus import stopwords

swlist = set(stopwords.words('english'))

text = 'all the news is bad but share prices are still rising'
tokens = regexp_tokenize(text, pattern)

whatsleft = [t for t in tokens if t not in swlist]


print(whatsleft)

Alternative stop words lists (in different languages) can be found at sites such as https://www.ranks
.nl/stopwords. Of course you can create your own (or adapt an existing one). One reason for
doing so would be to eliminate specific words that provide no useful information within a particular
context.

12.2.3 Stemming
Stemming is a process that attempts to substitute a word with the base word from which it is
derived. The goal is to reduce the total number of unique words but doing so in such a way
that the meaning or substance of the word is unaltered. Thus words that differ only by tense or
pluralisation, would all be mapped to the same stem word.

from nltk.stem import PorterStemmer

pst = PorterStemmer()

wordlist = ['investing', 'invested', 'investors', 'invest']

for word in wordlist:


print(pst.stem(word))

Details (and even the Python code) for the PorterStemmer (named after its author, Martin Porter)
can be found at https://tartarus.org/martin/PorterStemmer/. Other stemmers such as ‘Lancast-
erStemmer’ and ‘Snowball’ also form part of the NLTK stem package. See https://www.nltk.org/
api/nltk.stem.html for a full list.

12.3 Document Analysis


To get a sense of the content of the documents, a useful starting is point is to look at the frequency
with which each word appears.
12.4. DOCUMENT TERM FREQUENCY 155

from nltk.probability import FreqDist

freq_dist = FreqDist(tokens)

#Output 10 most frequently occurring words


print(freq_dist.most_common()[:10])

Of course, individual words are devoid of context: only in combination do words become mean-
ingful. Another analysis technique is therefore to look for ‘n-grams’, that is, sequences of n words.

from nltk import ngrams

#Find all bigrams (2 word sequences)


bigrams = ngrams(tokens, 2)

freq_dist = FreqDist(bigrams)

#Output 10 most frequently occurring bigrams


print(freq_dist.most_common()[:10])

12.4 Document Term Frequency


Taken together, a collection of related articles represents a corpus. One basic approach to under-
standing the corpus is simply to tabulate the frequency of words appearing therein. Frequencies
provide a basic description of the language used throughout the corpus, and reveal words that
commonly reappear between articles. This provides some insights into the contents of the arti-
cles.
Consider a collection of m documents D = {di : i = 1, . . . , m} which form a corpus of n words
C = {wj : j = 1, . . . , n}. This can be presented as a matrix A where n rows represent the documents,
and m columns represent the words. let f (di , wj ) be the number of times word wj appears in
document di : The information within the documents can be encoded in element Ai,j in different
ways.

Scheme Weight
Binary 1f (wj ,di )>0
Count f (wj , di )
Log 1 + log(f (wj , di ))
Weighted Pf (wj ,di )
j f (wj ,di )
 2
 12
Normalised Pf (wj ,di ) 2
j f (wj i)
,d

Table 12.1: Encoding of the document term frequency

The inverse document frequency attempts to measure how commonly a word is used across doc-
uments. One such measure is:

m
idf (wj ) = log P (12.1)
1
i f (di ,wj )>0
12.4. DOCUMENT TERM FREQUENCY 156

The term frequency–inverse document frequency (tf-idf) is calculated as:

tf idf (di , wj ) = tf (di , wj )idf (wj ) (12.2)

In this way words which appears frequently in one document, but also appear frequently across
documents lose their potency of meaning. Dimensionality can be reduced by eliminating words
appearing below a given threshold. For example, O’Callaghan, Greene, Carthy, and Cunning-
ham (2015) set the minimum document frequency (i.e. in how many different documents the
word must appear to be included) to be max(10, n/1000) where n is the total number of docu-
ments.
Here we will use the Python library scikit-learn (Pedregosa et al., 2011) to create such a matrix
representation of a set of documents (rows) and their constituent words (columns).

#Import the feature extraction library


import sklearn.feature_extraction.text as txt

#Define a set of documents; these could be file names


text1 = 'Unemployment falls. Stock prices soar!'
text2 = 'Political turmoil. Market falls and falls.'
text3 = 'Volatility falls as confidence begins to soar.'
documents = [text1, text2, text3]

#Create the vectoriser that will transform the documents


vectorizer = txt.CountVectorizer(stop_words = 'english')

#Perform the transformation and extract the document term matrix


dtm = vectorizer.fit_transform(documents).toarray()

#Extract the words (columns)


words = vectorizer.get_feature_names()

print(dtm)
print(words)

The output of this code is shown below. The third column indicates that the word ‘falls’ appears in
each document, and twice in the second one.

[[0 0 1 0 0 1 1 1 0 1 0]
[0 0 2 1 1 0 0 0 1 0 0]
[1 1 1 0 0 0 1 0 0 0 1]]

['begins', 'confidence', 'falls', 'market', 'political', 'prices', 'soar',


'stock', 'turmoil', 'unemployment', 'volatility']

Defaulted parameters such as min_df and max_df can be set to exclude words that appear below
or above chosen thresholds. Alternative vectorisation approaches can be used. Normalisation
seeks to give equal weight to each document by normalising the rows so that the sum of their
squares add to one.

#Apply normalisation
vectorizer = txt.TfidfVectorizer(stop_words = 'english', use_idf = False)
12.5. SENTIMENT ANALYSIS 157

#Apply inverse document frequency


vectorizer = txt.TfidfVectorizer(stop_words = 'english')

See https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
for exact details of scikit-learns implementation.

12.5 Sentiment Analysis


Sentiment analysis (sometimes opinion mining) attempts to interpret articles based purely on the
frequencies of particular words. By defining ‘dictionaries’ of words assumed to express a partic-
ular sentiment, the frequencies are taken to reveal the extent to which the sentiment is being
expressed in a particular article. With basic word count approaches, no attempt is made to relate
these words to the holder of the sentiment, or the subject of the sentiment, nor is particular effort
expended dealing with the nuances of natural language (e.g. sarcasm, synonyms, homographs,
etc). The overall sentiment of an article is quantified by converting the frequencies into a score, for
example by dividing the number of words expressing a ‘positive’ sentiment by the total of number
of words in the article.
What words should be taken to convey such sentiments? To avoid bias, researchers prefer to
use independently established dictionaries such as those from Harvard’s Psychosocial Dictionary
(see http://www.wjh.harvard.edu/inquirer/homecat.htm) provides such word listings across a large
number of categories including those taken to be ‘positive’ and ‘negative’, and words relating to
such areas as conflict, politics and economics.
However, when applied to specific subject areas, such dictionaries may not reflect the intended
meaning of the author. Loughran and McDonald (2015) demonstrate that financial terms would
frequently be misinterpreted by such generic dictionaries.
By analysing reporting in the Wall Street Journal, Tetlock (2007) found evidence that the level of
negative tone had predictive power for daily stock returns. Garcia (2013) reported similar findings
(concentrated in recessions) for the New York Times, with both the negative and positive tone of
reporting predicting daily stock returns.
NTLK’s vader tool is specifically attuned to sentiments expressed in social media (Hutto & Gilbert,
2014).

sentence = "It's a wonderful day and I'm very happy and positive"
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
scores = sia.polarity_scores(sentence)
print(scores)

Output comes in the form of a dictionary of four polarity scores: negative, neutral, positive and
compound.

'neg': 0.0, 'neu': 0.342, 'pos': 0.658, 'compound': 0.911

Similar functionality is provided by the library TextBlob.


12.6. TOPIC ANALYSIS 158

12.6 Topic Analysis


When writing about a particular issue, one might expect certain words to appear. For example,
when writing about the economy one might expect to see words such as demand, inflation, em-
ployment, growth or credit. Topic models seek to identify patterns of words that have a tendency
to appear together. Indeed the premise is that any article will discuss a number of topics to vary-
ing degrees, and that discussion of each topic will use a particular distribution of words. Ziph’s law
(Zipf, 1949) would imply that each topic will be dominated by a relatively small number of high
frequency words. One difficulty that arises is the potential for homographs: words with the same
spelling but different meaning. Such words have the potential to link unrelated topics. For exam-
ple, the word ‘bear’ in finance is used both as an adjective and a verb, and may relate to ‘bear
markets’ or to those who ‘bear risk’. Similarly ‘power’ could relate to energy or politics.

12.6.1 LDA
Latent Dirichlet Allocation (LDA), introduced by Blei, Ng, and Jordan (2003), is a statistical tech-
nique that seeks to identify patterns of words occurring within a collection of articles (the corpus).
Sentiment approaches are typically based on word frequencies and require pre-defined dictio-
naries. Compilation of these dictionaries is a non-trivial task requiring significant judgement. By
contrast, LDA identifies the related content inherent in the articles without explicit direction.
Suppose that articles are created from a distribution of topics, each with its own distribution of
associated words. LDA seeks to find these distributions using Bayesian estimation and Gibbs sim-
ulations. The number of topics is an input into this process requiring subjective judgement. When
applied successfully, researchers should be able to apply a meaningful label (or labels) to capture
the essence of each topic. Implementation can be performed using Python packages including
lda, gensim and scikit-learn.

import numpy as np
from sklearn.decomposition import LatentDirichletAllocation

num_topics = 3

model = LatentDirichletAllocation(n_components=num_topics)
A = model.fit_transform(dtm)
T = model.components_

Here, the output A has one row per document, and one column per topic. The scores at ai,j can
be thought of as the dominance of topic j in document i. The output T has one row per topic,
and one column per word. The score at tj,k can be thought of as the dominance of work k in
topic j.
One will be interested in the topics that dominate a particular document, so topic scores should
be sorted in descending order to the highest scoring topics are listed first. Similarly, to understand
each topic, one will be interested in the dominant words. Again, sorting in descending order and
taking the top 5-10 words should be sufficient for an actual topic to be identified. For example, the
top 10 words of a particular set of topics may appear as:
1. dollar, currency exchange, foreign, sterling, gold, deficit, euro, weak, central
2. industry, produce, steel, capacity, demand, order, competitive, chemicals, import, engine
3. dividend, cover, distribution, interim, paid, final, scrip, income, payment, expectation
In such a case one might label the topics as ‘foreign exchange’, ‘industry’ and ‘dividends’ re-
spectively.
12.7. PACKAGES 159

12.6.2 NMF
An alternative approach is Non-Negative Matrix Factorization (NMF) (Lee & Seung, 1999). As the
name suggests, a matrix representing the frequency of words appearing in the corpus as articles
(rows) and words (columns) is decomposed (approximately) into the product of two non-negative
matrices of lower dimension. The articles can then be represented in terms of scores relating to
each topic, and each topic by scores relating to their use of words.
Consider a corpus containing m articles comprised of a n unique words. Such a corpus can be
encoded as an m by n matrix C where ci,j represents the number of occurrences of word j within
document i. NMF attempts to factorize this matrix by approximating it was the product of two
smaller non-negative matrices.

AT ≈ C (12.3)

Here T ≥ 0 is a p by n matrix where tj,k represents the prevalence of word k in topic j, and A ≥ 0
is a m by p matrix where ai,j represents the extent to which article i relates to topic j. Thus each
article is expressed as a linear combination of topics, which in turn are each expressed as linear
combinations of words.

from sklearn.decomposition import NMF

num_topics = 3
model = NMF(n_components=num_topics)

A = model.fit_transform(dtm)
T = model.components_

The choice of A and T for a given rank k is based on optimisation. See Xu, Liu, and Gong (2003)
for NMF applied to document clustering. O’Callaghan et al. (2015) suggest that NMF may provide
more coherent topics particularly when analyzing niche or non-mainstream content, while LDA
topics may have higher levels of generality and redundancy.

12.7 Packages
12.7.1 SpaCy
SpaCy (see Vasiliev (2020)) provides a range of text processing capabilities including:
• Tokenization
• Part-of-speech Tagging - identification of nouns, verbs, dates, etc.
• Word2Vec and Doc2Vec - creating vector based representations of strings.
• Similarity - providing a measure of how similar two strings are.
For more information see https://spacy.io/.
160

Part IV

Appendix
161
I NDEX
accuracy, 140 DOM, 87
alias, 42 dummy variables, 105
Anaconda, 2 dunder, 58
API, 90
assertions, 54 elif, 37
assignment, 22 else, 37
assignment expression, 50 encapsulation, 57, 59
AUC, 142 entropy, 137
enum, 50
backpropagation, 147 enumerate, 40
bagging, 143 environments, 3
BeautifulSoup, 88 error handling, 48
binning, 104 except, 48
bitwise operator, 30 expression, 21
Boolean operator, 30, 69
bootstrap aggregation, 143 f-string, 27
boxplot, 126 feature elimination, 148
break, 40 feature selection, 147
broadcasting, 65, 66 feedforward, 146
flask, 91
cells, 8 folium, 128
choropleth, 128, 130 for, 39
class, 57 functions, 43
command prompt, 6
comment, 21 generator, 47
commit, 84 Gini index, 138
compound statements, 37 global, 46
comprehension, 36 groupby, 111
confusion matrix, 139 GUI, 73
constant, 23
constructor, 58 hard coding, 23
containers, 31 help, 11
continue, 40 HTML, 87
corpus, 154
IDE, 6
Counter, 35
identifiers, 22
crosstab, 114
IDLE, 7
dashboard, 130 if, 37
data snooping, 19 immutable, 32
data types, 24 import, 42
database, 82 information gain, 137
DataFrame, 67 inheritance, 57, 60
dates, 28 inline if, 38
DatetimeIndex, 72, 114 inplace, 98
debugging, 10 installation, 2
decision tree, 135 iterable, 32, 36
decorator, 50, 92 iteration, 39
deep copy, 24 iterrows, 69
defaultdict, 36 itertools, 40
destructor, 58 itertuples, 70
dictionary, 35
join, 84
display, 24
JSON, 85
docstring, 21, 44
Jupyter, 12
INDEX 162

keywords, 21 protected, 59
kite, 12 pruning, 138

lambda expressions, 45 QQ plot, 100


LDA, 157 query, 83
line continuation, 56
linting, 12 r-string, 77
lists, 31 raise, 49
literal, 23 random forest, 143
logging, 54 range, 32
raw string, 77
map, 36 regex, 27, 151
markdown, 14 reindex, 98
match, 39 repr, 58
Matplotlib, 118 requests, 88
melt, 109 reversed, 40
methods, 57 ROC curve, 140
multi-layer perceptron, 147 RStudio, 17
MultiIndex, 108
SciPy, 66
n-gram, 154 scope, 46
namespace, 42 seaborn, 128
NaN, 97 select, 83
Navigator, 3 selection, 37
nbextensions, 16 selenium, 89
nesting, 38 sensitivity, 140
neural network, 146 Series, 68
NLTK, 150 sets, 34
NMF, 158 shallow copy, 24
nominal, 96 shapefile, 128
normalisation, 84 slice, 33
notebooks, 13 SpaCy, 158
null, 84 specificity, 140
NumPy, 63 Spyder, 7
SQL, 82
OOP, 57 stack, 108
operators, 30 stackoverflow, 12
OrderedDict, 35 statement, 21
ordinal, 96 stemming, 153
os, 77 stop word, 153
out-of-sample, 133 stratified sampling, 134
out-of-time, 134 string, 25
style, 56
pairwise, 40 subplots, 121
Pandas, 67 supervised learning, 132
pandas-datareader, 93
pass, 37 TextBlob, 156
pass by reference, 46 tkinter, 73
pathlib, 78 token, 151
PCA, 107 transpose, 65
pivot_table, 113 try, 48
plotly, 129 tuples, 32
polymorphism, 57 type annotations, 48
print, 24
properties, 57 underscore, 24
INDEX 163

unstack, 108
unsupervised learning, 132
urllib, 89

vader, 156
variable explorer, 9
variable importance, 148
variadic, 45
VS Code, 16

walrus, 50
web scraping, 88
while, 41
widget, 73
wildcard, 83
working directory, 7, 77
WRDS, 94

XML, 86
XPath, 87
XSL, 87
XSLT, 87

yfinance, 94
yield, 47

Zen, 55
zip, 36
References 164

References
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with python: analyzing text with
the natural language toolkit. " O’Reilly Media, Inc.".
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning
Rresearch, 3(Jan), 993–1022.
Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123–140.
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5–32.
Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees.
wadsworth int. Group, 37(15), 237–251.
DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two
or more correlated receiver operating characteristic curves: a nonparametric approach.
Biometrics, 837–845.
Eramo, M. (2020). The art of doing: Create 10 python GUIs with tkinter today! Packt Publishing.
Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition letters, 27(8), 861–874.
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an
application to boosting. Journal of computer and system sciences, 55(1), 119–139.
Garcia, D. (2013). Sentiment during recessions. The Journal of Finance, 68(3), 1267–1300.
Hand, D. J. (2009). Measuring classifier performance: a coherent alternative to the area under
the roc curve. Machine learning, 77(1), 103–123.
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., . . . Smith,
N. J. (2020). Array programming with numpy. Nature, 585, 357—362.
Heydt, M. (2018). Python web scraping cookbooks (1st ed.). Packt Publishing.
Hilpisch, Y. (2019). Python for finance (2nd ed.). O’Reilly.
Hutto, C., & Gilbert, E. (2014). Vader: A parsimonious rule-based model for sentiment analysis of
social media text. In Proceedings of the international aaai conference on web and social
media (Vol. 8).
Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization.
Nature, 401(6755), 788.
Lessmann, S., Baesens, B., Seow, H.-V., & Thomas, L. C. (2015). Benchmarking state-of-the-art classifi-
cation algorithms for credit scoring: An update of research. European Journal of Operational
Research, 247(1), 124–136.
Loughran, T., & McDonald, B. (2015). The use of word lists in textual analysis. Journal of Behavioral
Finance, 16(1), 1–11.
Mitra, G., & Mitra, L. (2011). The handbook of news analytics in finance. John Wiley & Sons.
Müller, A. C., Guido, S., et al. (2016). Introduction to machine learning with python: a guide for
References 165

data scientists. O’Reilly Media, Inc.


O’Callaghan, D., Greene, D., Carthy, J., & Cunningham, P. (2015). An analysis of the coherence of
descriptors in topic modeling. Expert Systems with Applications, 42(13), 5645–5657.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., . . . Duchesnay, E.
(2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12,
2825–2830.
Stein, R. M. (2002). Benchmarking default prediction models: Pitfalls and remedies in model vali-
dation. Moody’s KMV, New York, 20305.
Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market.
The Journal of finance, 62(3), 1139–1168.
Vasiliev, Y. (2020). Natural language processing with python and spacy: A practical introduction.
No Starch Press.
White, H. (2000). A reality check for data snooping. Econometrica, 68(5), 1097–1126.
Witten, I. H., Frank, E., & Hall, M. A. (2005). Data mining: Practical machine learning tools and
techniques. Elsevier.
Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factoriza-
tion. In Proceedings of the 26th annual international acm sigir conference on research and
development in informaion retrieval (pp. 267–273).
Zipf, G. K. (1949). Human behaviour and the principle of least-effort. Addison-Wesley.
Dr Alan Hanna
[email protected]
Queen’s Management School
Riddel Hall, 185 Stranmillis Road, Belfast, Northern Ireland, BT9 5EE
http://www.qub.ac.uk/mgt
(September 2023)

You might also like